Statistics & Query Planning: Interview
Module: Query Optimization & Performance
What are database statistics and why does the query optimizer need them?
Database statistics are metadata about table data: row counts, distinct value counts, NULL fractions, value distributions (histograms), and physical ordering (correlation). The optimizer uses statistics to estimate how many rows each operation will return (cardinality estimation) and calculate costs for different execution plans. Without statistics, the optimizer is blind - it might choose a nested loop when a hash join would be 100x faster, or allocate 1MB of memory when it needs 1GB. Statistics are the foundation of cost-based optimization.
Explain how a histogram helps the optimizer make better decisions for skewed data.
A histogram stores the distribution of values in a column. For skewed data (e.g., status column with 99% active, 1% pending), without a histogram the optimizer assumes uniform distribution - each value appears equally (50% each). This causes massive estimation errors. With a histogram, the optimizer knows the actual frequencies: active=99%, pending=1%. For "WHERE status = 'active'" it estimates 99% of rows (accurate) instead of 50% (2x error). This affects memory allocation, join algorithm choice, and join order. A 2x cardinality error can cause a 100x performance degradation if it triggers the wrong join algorithm.
Why do cardinality estimation errors compound in multi-table joins?
Each join operation uses the estimated cardinality from the previous operation. If each table has a 10% estimation error, the errors multiply: 1.1 × 1.1 × 1.1 ... = 1.1^n. For a 10-table join, this becomes 1.1^10 = 2.6x total error. Worse, if the error causes the optimizer to choose the wrong join order or algorithm early in the plan, subsequent operations are based on that wrong choice. Example: Estimate 100K rows, choose nested loop, actually 10M rows - nested loop becomes 100x slower than hash join. The wrong algorithm choice has exponential impact.
You bulk loaded 50M rows into a 10M row table. The query that took 5 seconds now takes 10 minutes. EXPLAIN shows a nested loop join instead of the previous hash join. What happened and how do you fix it?
-- Diagnosis: Statistics are stale
-- Optimizer thinks table still has 10M rows
-- Actually has 60M rows (6x underestimate)
-- Chose nested loop for "small" table
-- Nested loop with 60M rows is disaster
-- Check statistics age
SELECT schemaname, tablename, last_analyze, last_autoanalyze,
n_live_tup, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE tablename = 'orders';
-- Result: last_analyze = 30 days ago, n_mod_since_analyze = 50M
-- Fix: Update statistics
ANALYZE orders;
-- Verify statistics updated
SELECT tablename, last_analyze FROM pg_stat_user_tables
WHERE tablename = 'orders';
-- Result: last_analyze = now
-- Run query again - should use hash join
-- Execution time: back to 5 seconds
-- Prevention: ANALYZE after bulk operations
BEGIN;
INSERT INTO orders SELECT * FROM staging_orders;
ANALYZE orders; -- Update statistics immediately
COMMIT;
The bulk load changed the table size from 10M to 60M rows, but statistics were not updated. The optimizer still thought the table had 10M rows and chose a nested loop join (good for small tables). With 60M rows, nested loop becomes catastrophic - 60M × cost per row. Hash join would be much faster - build hash table once, probe 60M times. The fix is simple: run ANALYZE to update statistics. The optimizer will then see the true row count and choose hash join. Always run ANALYZE after bulk operations >10% of table size.
A query on "WHERE status = 'active'" spills to disk and takes 2 minutes. EXPLAIN shows estimated rows = 5M, actual rows = 95M (19x error). The status column has 2 values: active (95%) and inactive (5%). How do you fix the cardinality estimation?
-- Diagnosis: No histogram, optimizer assumes uniform distribution
-- 2 distinct values → assumes 50% each
-- Actually: active=95%, inactive=5% (highly skewed)
-- Check if histogram exists
SELECT tablename, attname, n_distinct,
most_common_vals, most_common_freqs
FROM pg_stats