Statistics & Query Planning: Mistakes

Module: Query Optimization & Performance

Never running ANALYZE after bulk INSERT of 5M rows

Run ANALYZE immediately after bulk operations: BEGIN; INSERT INTO orders SELECT * FROM staging; ANALYZE orders; COMMIT;

Optimizer still thinks table has old row count. Chooses nested loop for "small" table, but table is now huge. Cardinality error causes catastrophic plan choice.

Always ANALYZE after bulk operations >10% of table size. Make it part of your ETL process.

Critical

Query plan degrades from hash join (8s) to nested loop (600s) - 75x slower

Assuming uniform distribution on skewed column (status: 99% active, 1% pending)

Run ANALYZE to create histogram: ANALYZE orders; Check histogram: SELECT most_common_vals, most_common_freqs FROM pg_stats WHERE tablename = 'orders' AND attname = 'status';

Without histogram, optimizer assumes each distinct value appears equally. For skewed data, this causes massive estimation errors. Wrong memory allocation leads to disk spills.

Create histograms on all columns used in WHERE clauses, especially status/type/category columns with skewed distributions.

High

Memory spill to disk - estimates 5M rows, actual 9.9M rows. Query takes 120s instead of 3s (40x slower)

Treating correlated columns as independent: WHERE city = 'Seattle' AND state = 'WA' (100% correlated)

Create extended statistics: CREATE STATISTICS city_state_stats (dependencies) ON city, state FROM users; ANALYZE users;

Optimizer assumes city and state are independent: P(Seattle) × P(WA) = 1% × 2% = 0.02%. Reality: if city=Seattle then state=WA with 100% probability, so 1%. Extended statistics teach optimizer about correlation.

Identify correlated columns (city/state, product/category, year/quarter) and create extended statistics. Check correlation: SELECT corr(col1_numeric, col2_numeric) FROM table;

High

Cardinality estimate off by 50x - estimates 20K rows, actual 1M rows. Wrong join order, 10x slower

Not monitoring statistics age - statistics last updated 90 days ago on rapidly growing table

Monitor statistics age: SELECT tablename, last_analyze, last_autoanalyze FROM pg_stat_user_tables WHERE last_analyze < NOW() - INTERVAL '7 days'; Alert if >7 days old.

Statistics become stale as data changes. Row counts, distributions, and correlations drift from reality. Optimizer makes increasingly bad decisions. Problem goes unnoticed until users complain.

Set up monitoring alerts for stale statistics. Check statistics age weekly. Automate ANALYZE on schedule for critical tables.

Medium

Gradual performance degradation as data grows. Query that took 5s now takes 60s (12x slower)

Running ANALYZE on every single INSERT - table has 1000 inserts/second

Let autovacuum handle it (triggers after ~20% change) or schedule ANALYZE daily: 0 2 * * * psql -c "ANALYZE orders"

ANALYZE is not free - it scans table and computes statistics. Running too frequently wastes CPU and I/O. Statistics don't change significantly after small inserts. Diminishing returns.

Use automatic statistics updates for most tables. Manual ANALYZE only after bulk operations (>10% of table). Don't ANALYZE on every row.

Medium

ANALYZE takes 500ms, blocks 1000 inserts/second = 500 blocked inserts. Throughput drops 50%