Statistics & Cardinality: Real-World
Module: Query Optimization & Performance
Query estimates 1000 rows but returns 1M rows (1000x error) - wrong join algorithm chosen, 100x slower. Multi-column filter WHERE city = 'Seattle' AND state = 'WA' estimates 100 rows (assumes independence), actual 50K rows (500x error). Skewed data: status column 99% active - without histogram, estimates 50% causing memory spill.
Airbnb: Geographic Query Cardinality Disaster - 100x Error
Airbnb search query filters on city = 'Paris' AND country = 'France' (100% correlated). Optimizer assumes independence: P(Paris) × P(France) = 0.001 × 0.01 = 0.00001. Estimates 1K listings, actual 100K listings (100x error). Chooses nested loop join with reviews table. Query takes 5 minutes instead of 3 seconds. Search page timeout, users see error. Revenue impact: $10K/hour in lost bookings.
Create extended statistics for city/country correlation. Run ANALYZE to update statistics. Optimizer now knows: if city=Paris, then country=France with 100% probability. Estimates 100K listings correctly. Chooses hash join. Query takes 3 seconds. Search page loads instantly. Add extended statistics for all geographic column pairs (city/country, city/state, zip/city) and hierarchical pairs (neighborhood/city, category/subcategory).
-- Emergency diagnosis
EXPLAIN ANALYZE
SELECT l.listing_id, l.title, AVG(r.rating) as avg_rating
FROM listings l
JOIN reviews r ON l.listing_id = r.listing_id
WHERE l.city = 'Paris' AND l.country = 'France'
GROUP BY l.listing_id, l.title;
-- Estimated rows: 1000, Actual rows: 100000 (100x error)
-- Plan: Nested Loop (disaster for 100K rows)
-- Time: 300 seconds
-- Check correlation
SELECT
COUNT(*) FILTER (WHERE city = 'Paris') as paris_count,
COUNT(*) FILTER (WHERE country = 'France') as france_count,
COUNT(*) FILTER (WHERE city = 'Paris' AND country = 'France') as both
FROM listings;
-- Result: paris_count = both (100% correlation)
-- Fix: Create extended statistics
CREATE STATISTICS city_country_correlation (dependencies)
ON city, country FROM listings;
CREATE STATISTICS city_state_correlation (dependencies)
ON city, state FROM listings;
ANALYZE listings;
-- Verify fix
EXPLAIN ANALYZE
SELECT l.listing_id, l.title, AVG(r.rating) as avg_rating
FROM listings l
JOIN reviews r ON l.listing_id = r.listing_id
WHERE l.city = 'Paris' AND l.country = 'France'
GROUP BY l.listing_id, l.title;
-- Estimated rows: 100000 (accurate)
-- Plan: Hash Join (correct for 100K rows)
-- Time: 3 seconds
-- Prevent recurrence: Add to schema setup
-- Create extended stats for all correlated pairs
CREATE STATISTICS neighborhood_city_corr (dependencies)