Statistics & Cardinality: Examples
Module: Query Optimization & Performance
Independence Assumption Failure - Correlated Columns 50x Error
advanced
E-commerce site with users table (10M rows). Query filters on city = 'Seattle' AND state = 'WA'. These columns are 100% correlated (Seattle is always in WA). Optimizer assumes independence, estimates 2K rows, actual 100K rows (50x error). Chooses nested loop join, takes 5 minutes instead of 5 seconds.
-- Table: 10M users
-- city = 'Seattle': 1% of rows (100K users)
-- state = 'WA': 2% of rows (200K users)
-- Correlation: 100% (Seattle is always in WA)
-- Query: Find Seattle users with their orders
EXPLAIN ANALYZE
SELECT u.user_id, u.name, COUNT(o.order_id) as order_count
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.city = 'Seattle' AND u.state = 'WA'
GROUP BY u.user_id, u.name;
-- Without extended statistics:
-- Optimizer assumes independence
-- Estimate: P(Seattle) × P(WA) = 0.01 × 0.02 = 0.0002
-- Estimated rows: 10M × 0.0002 = 2000 rows
-- EXPLAIN output:
Nested Loop (cost=0..50000 rows=2000)
-> Index Scan on users (cost=0..1000 rows=2000)
Filter: (city = 'Seattle' AND state = 'WA')
-- Estimated: 2000 rows
-> Index Scan on orders (cost=0..20 rows=5)
Index Cond: (user_id = u.user_id)
-- Actual execution:
-- Actual rows: 100K (50x more than estimate)
-- Nested loop with 100K outer rows = disaster
-- Time: 300 seconds (5 minutes)
-- Fix: Create extended statistics
CREATE STATISTICS city_state_correlation (dependencies)
ON city, state FROM users;
ANALYZE users;
-- Verify extended statistics
SELECT stxname, stxkeys, stxkind, stxdependencies
FROM pg_statistic_ext
WHERE stxname = 'city_state_correlation';
-- Result: stxkind = {d} (dependencies detected)
-- Run query again
EXPLAIN ANALYZE