Statistics & Cardinality: Real-World

Module: Query Optimization & Performance

Query estimates 1000 rows but returns 1M rows (1000x error) - wrong join algorithm chosen, 100x slower. Multi-column filter WHERE city = 'Seattle' AND state = 'WA' estimates 100 rows (assumes independence), actual 50K rows (500x error). Skewed data: status column 99% active - without histogram, estimates 50% causing memory spill.

Airbnb: Geographic Query Cardinality Disaster - 100x Error

Airbnb search query filters on city = 'Paris' AND country = 'France' (100% correlated). Optimizer assumes independence: P(Paris) × P(France) = 0.001 × 0.01 = 0.00001. Estimates 1K listings, actual 100K listings (100x error). Chooses nested loop join with reviews table. Query takes 5 minutes instead of 3 seconds. Search page timeout, users see error. Revenue impact: $10K/hour in lost bookings.

Create extended statistics for city/country correlation. Run ANALYZE to update statistics. Optimizer now knows: if city=Paris, then country=France with 100% probability. Estimates 100K listings correctly. Chooses hash join. Query takes 3 seconds. Search page loads instantly. Add extended statistics for all geographic column pairs (city/country, city/state, zip/city) and hierarchical pairs (neighborhood/city, category/subcategory).

-- Emergency diagnosis

EXPLAIN ANALYZE

SELECT l.listing_id, l.title, AVG(r.rating) as avg_rating

FROM listings l

JOIN reviews r ON l.listing_id = r.listing_id

WHERE l.city = 'Paris' AND l.country = 'France'

GROUP BY l.listing_id, l.title;

-- Estimated rows: 1000, Actual rows: 100000 (100x error)

-- Plan: Nested Loop (disaster for 100K rows)

-- Time: 300 seconds

-- Check correlation

SELECT

COUNT(*) FILTER (WHERE city = 'Paris') as paris_count,

COUNT(*) FILTER (WHERE country = 'France') as france_count,

COUNT(*) FILTER (WHERE city = 'Paris' AND country = 'France') as both

FROM listings;

-- Result: paris_count = both (100% correlation)

-- Fix: Create extended statistics

CREATE STATISTICS city_country_correlation (dependencies)

ON city, country FROM listings;

CREATE STATISTICS city_state_correlation (dependencies)

ON city, state FROM listings;

ANALYZE listings;

-- Verify fix

EXPLAIN ANALYZE

SELECT l.listing_id, l.title, AVG(r.rating) as avg_rating

FROM listings l

JOIN reviews r ON l.listing_id = r.listing_id

WHERE l.city = 'Paris' AND l.country = 'France'

GROUP BY l.listing_id, l.title;

-- Estimated rows: 100000 (accurate)

-- Plan: Hash Join (correct for 100K rows)

-- Time: 3 seconds

-- Prevent recurrence: Add to schema setup

-- Create extended stats for all correlated pairs

CREATE STATISTICS neighborhood_city_corr (dependencies)