SQL Practice Logo

SQLPractice Online

Statistics & Cardinality: Examples

Module: Query Optimization & Performance

Independence Assumption Failure - Correlated Columns 50x Error

advanced

E-commerce site with users table (10M rows). Query filters on city = 'Seattle' AND state = 'WA'. These columns are 100% correlated (Seattle is always in WA). Optimizer assumes independence, estimates 2K rows, actual 100K rows (50x error). Chooses nested loop join, takes 5 minutes instead of 5 seconds.

-- Table: 10M users

-- city = 'Seattle': 1% of rows (100K users)

-- state = 'WA': 2% of rows (200K users)

-- Correlation: 100% (Seattle is always in WA)

-- Query: Find Seattle users with their orders

EXPLAIN ANALYZE

SELECT u.user_id, u.name, COUNT(o.order_id) as order_count

FROM users u

JOIN orders o ON u.user_id = o.user_id

WHERE u.city = 'Seattle' AND u.state = 'WA'

GROUP BY u.user_id, u.name;

-- Without extended statistics:

-- Optimizer assumes independence

-- Estimate: P(Seattle) × P(WA) = 0.01 × 0.02 = 0.0002

-- Estimated rows: 10M × 0.0002 = 2000 rows

-- EXPLAIN output:

Nested Loop (cost=0..50000 rows=2000)

-> Index Scan on users (cost=0..1000 rows=2000)

Filter: (city = 'Seattle' AND state = 'WA')

-- Estimated: 2000 rows

-> Index Scan on orders (cost=0..20 rows=5)

Index Cond: (user_id = u.user_id)

-- Actual execution:

-- Actual rows: 100K (50x more than estimate)

-- Nested loop with 100K outer rows = disaster

-- Time: 300 seconds (5 minutes)

-- Fix: Create extended statistics

CREATE STATISTICS city_state_correlation (dependencies)

ON city, state FROM users;

ANALYZE users;

-- Verify extended statistics

SELECT stxname, stxkeys, stxkind, stxdependencies

FROM pg_statistic_ext

WHERE stxname = 'city_state_correlation';

-- Result: stxkind = {d} (dependencies detected)

-- Run query again

EXPLAIN ANALYZE