Statistics & Cardinality: Interview

Module: Query Optimization & Performance

Explain the difference between selectivity and cardinality. How does the optimizer use them?

Selectivity is the fraction of rows that pass a filter (0.0 to 1.0). Cardinality is the estimated number of rows. Formula: Cardinality = Table rows × Selectivity. Example: Table has 1M rows, WHERE status = 'active' has selectivity 0.8, cardinality = 1M × 0.8 = 800K rows. The optimizer uses cardinality to choose execution plans: <1K rows → nested loop, 1K-1M rows → hash join, >1M rows → merge join. It also uses cardinality to allocate memory for hash tables and sort buffers. Accurate cardinality estimates are critical - a 10x error can cause the optimizer to choose nested loop when hash join would be 100x faster.

Why does the independence assumption fail for correlated columns? Give an example with the math.

The optimizer assumes columns are independent: P(A AND B) = P(A) × P(B). This fails when columns are correlated. Example: city and state columns. WHERE city = 'Seattle' AND state = 'WA'. Independence assumption: P(Seattle) × P(WA) = 0.01 × 0.02 = 0.0002 (0.02%). Reality: Seattle is always in WA (100% correlation), so P(Seattle AND WA) = P(Seattle) = 0.01 (1%). Error: 0.01 / 0.0002 = 50x underestimate. For 10M rows: estimates 2K, actual 100K. This causes the optimizer to choose nested loop (good for 2K) when it should choose hash join (good for 100K), making the query 60x slower. Fix: CREATE STATISTICS (dependencies) ON city, state teaches the optimizer about correlation.

How do estimation errors compound in multi-table joins? Why is a 2x error per table dangerous?

Estimation errors multiply across joins because each join uses the estimated cardinality from the previous operation. If each table has a 2x error, total error = 2^n where n = number of joins. Example: 5-table join, each table has 2x error. Total error: 2^5 = 32x. Estimated 100K rows, actual 3.2M rows. Impact: (1) Wrong join order - optimizer joins large tables first instead of small tables first, (2) Wrong join algorithm - chooses nested loop for "100K rows" when hash join is needed for 3.2M rows, (3) Wrong memory allocation - allocates 10MB when 100MB needed, causing memory spill. Result: query 100x slower. This is why even 2x errors must be investigated in complex queries. Single-table 2x error seems harmless, but compounds to 32x in 5-table join.

EXPLAIN ANALYZE shows: Estimated rows=10000, Actual rows=1000000 (100x error). The query filters on city = 'Seattle' AND state = 'WA'. What is the likely cause and how do you fix it?

-- Diagnosis: Correlated columns treated as independent

-- Optimizer assumes: P(Seattle) × P(WA) = 0.001 × 0.01 = 0.00001

-- Reality: 100% correlated, P(Seattle AND WA) = P(Seattle) = 0.001

-- Error: 100x underestimate

-- Check if extended statistics exist

SELECT stxname, stxkeys, stxkind

FROM pg_statistic_ext

WHERE stxrelid = 'users'::regclass;

-- Result: (empty) - no extended statistics

-- Check correlation

SELECT

COUNT(*) FILTER (WHERE city = 'Seattle') as seattle_count,

COUNT(*) FILTER (WHERE state = 'WA') as wa_count,

COUNT(*) FILTER (WHERE city = 'Seattle' AND state = 'WA') as both_count

FROM users;

-- If both_count = seattle_count, they're 100% correlated

-- Fix: Create extended statistics

CREATE STATISTICS city_state_correlation (dependencies)

ON city, state FROM users;

-- Update statistics

ANALYZE users;

-- Verify extended statistics created

SELECT stxname, stxkeys, stxkind, stxdependencies

FROM pg_statistic_ext

WHERE stxname = 'city_state_correlation';

-- Result: stxkind = {d} (dependencies)

-- Run EXPLAIN ANALYZE again

EXPLAIN ANALYZE

SELECT * FROM users

WHERE city = 'Seattle' AND state = 'WA';

-- Now shows: Estimated rows=1000000 (accurate)

The 100x underestimate indicates correlated columns treated as independent. The optimizer multiplied P(Seattle) × P(WA) assuming independence, but Seattle is always in WA (100% correlation). Extended statistics teach the optimizer about this correlation. After creating extended statistics, the optimizer knows: if city=Seattle, then state=WA with 100% probability, so it uses P(Seattle) directly instead of multiplying probabilities. This fixes the 100x estimation error. Always create extended statistics for geographic data (city/state, city/country), hierarchical data (product/category), and temporal data (year/quarter).

A query on WHERE status = 'active' takes 3 minutes with memory spill. EXPLAIN ANALYZE shows: Estimated rows=5000000, Actual rows=9900000 (2x error). The status column has 2 values: active (99%), pending (1%). How do you fix this?