Statistics & Cardinality: Mistakes

Module: Query Optimization & Performance

Querying correlated columns (city, state) without extended statistics

CREATE STATISTICS city_state_stats (dependencies) ON city, state FROM users; ANALYZE users;

Optimizer assumes city and state are independent: P(Seattle) × P(WA) = 0.01 × 0.02 = 0.0002. Reality: 100% correlated, P(Seattle AND WA) = P(Seattle) = 0.01. Extended statistics teach optimizer about correlation, fixing 50x estimation error.

Create extended statistics for all correlated column pairs: geographic (city/state), hierarchical (product/category), temporal (year/quarter). Check correlation: SELECT corr(col1::int, col2::int) FROM table. If >0.7, create extended stats.

Critical

Cardinality estimate off by 50x - estimates 2K rows, actual 100K rows. Chooses nested loop, takes 5 minutes instead of 5 seconds (60x slower)

Querying skewed column (status: 99% active, 1% pending) without histogram

ANALYZE orders; Verify histogram: SELECT most_common_vals, most_common_freqs FROM pg_stats WHERE tablename = 'orders' AND attname = 'status';

Without histogram, optimizer assumes uniform distribution: 2 distinct values = 50% each. Reality: highly skewed (99%/1%). 2x underestimate causes memory allocation too small (512MB vs 1GB needed), leading to disk spill. Histogram provides actual frequencies, enabling correct memory allocation.

Always create histograms on skewed columns used in WHERE/GROUP BY. Check distribution: SELECT col, COUNT(*), ROUND(100.0*COUNT(*)/SUM(COUNT(*)) OVER(),2) as pct FROM table GROUP BY col. If any value >20%, histogram is critical.

High

Query on active status estimates 5M rows (assumes uniform 50%), actual 9.9M rows (2x error). Memory spill to disk, 200 seconds instead of 3 seconds (67x slower)

Not monitoring estimation accuracy - query slow but never checked EXPLAIN ANALYZE

EXPLAIN ANALYZE query; Compare estimated vs actual rows. If error >10x, investigate statistics. Check: SELECT last_analyze FROM pg_stat_user_tables WHERE tablename = 'orders';

EXPLAIN shows estimated rows, EXPLAIN ANALYZE shows actual rows. Comparing them reveals estimation errors. Common causes: outdated statistics, missing histogram, correlated columns, complex predicates. Without monitoring, bad estimates cause persistent performance problems.

Make EXPLAIN ANALYZE part of performance troubleshooting workflow. Rule: <2x error = OK, 2-10x = investigate, >10x = fix immediately. Set up automated monitoring: log queries with >10x estimation errors.

High

Estimation error goes unnoticed. Query takes 10 minutes, could take 10 seconds with correct statistics. Problem persists for months until someone investigates

Using default statistics target (100) for high-cardinality column (1M distinct user IDs)

ALTER TABLE orders ALTER COLUMN user_id SET STATISTICS 1000; ANALYZE orders; Verify: SELECT n_distinct, array_length(most_common_vals, 1) FROM pg_stats WHERE attname = 'user_id';

Default statistics target = 100 means 100 histogram buckets and 100 MCV entries. For high-cardinality columns (>100K distinct values), this is insufficient. Increasing to 1000 provides 10x more granularity: each bucket represents 1K values instead of 10K. Trade-off: 10x more storage, 2x longer ANALYZE, but 10x more accurate estimates.

Increase statistics target for columns with >100K distinct values. Check: SELECT attname, n_distinct FROM pg_stats WHERE tablename = 'orders' AND n_distinct > 100000. Set STATISTICS 1000 for these columns. Cost: minimal (few MB storage, seconds longer ANALYZE).

Medium

Histogram has only 100 buckets for 1M distinct values. Each bucket represents 10K values. Estimation accuracy: ±10K rows. For queries returning 50K rows, estimate could be 40K-60K (1.5x error)

Assuming 2x estimation error is harmless - "close enough"

Investigate any error >2x. In multi-table joins, errors compound: 2x per table = 2^5 = 32x total error in 5-table join. Fix statistics to prevent compounding.

Single-table 2x error seems acceptable. But errors multiply in joins: estimated_rows = table1_est × sel1 × table2_est × sel2 × ... If each estimate is 2x off, total error is 2^n where n = number of joins. 5 joins = 32x error. This causes catastrophically wrong join orders and algorithm choices.

Never ignore estimation errors, even small ones. In complex queries (3+ joins), even 1.5x error per table compounds to 7.6x total error. Fix statistics proactively. Monitor: log queries with >2x error on any operation. Investigate root cause (stale stats, missing histogram, correlated columns).

Medium

5-table join, each table has 2x error. Total error: 32x. Estimated 100K rows, actual 3.2M rows. Wrong join order chosen, query 100x slower