Statistics & Query Planning: Concept
Module: Query Optimization & Performance
Database statistics are metadata about your data - row counts, value distributions, data patterns. The query optimizer uses these statistics to estimate how many rows each operation will return (cardinality estimation) and choose the fastest execution plan.
Think of statistics as a map for the optimizer. Without accurate statistics, the optimizer is driving blind - it might choose a nested loop when a hash join would be 100x faster, or allocate 1MB of memory when it needs 1GB.
Key concept: The optimizer never sees your actual data during planning. It only sees statistics. Bad statistics = bad plans.
**What Statistics Track:**
1. **Row Count (n_tuples)**
- Total rows in table
- Most basic statistic
- Used for cost estimation baseline
2. **Distinct Values (n_distinct)**
- Number of unique values per column
- Used to estimate selectivity
- Example: gender column has 3 distinct values (M, F, Other)
- Selectivity = 1 / n_distinct = 1/3 = 33%
3. **NULL Fraction (null_frac)**
- Percentage of NULL values
- Affects WHERE clause estimates
- Example: 20% NULLs means WHERE col IS NOT NULL returns 80% of rows
4. **Histogram (most_common_vals + histogram_bounds)**
- Value distribution for skewed data
- Most common values with exact frequencies
- Histogram buckets for remaining values
- Example: status column - 'active': 80%, 'pending': 15%, 'cancelled': 5%
5. **Correlation**
- Physical vs logical ordering
- Affects index scan cost
- High correlation = sequential I/O (fast)
- Low correlation = random I/O (slow)
**How Optimizer Uses Statistics:**
Step 1: Estimate cardinality for each operation
WHERE status = 'active'
- Table has 1M rows
- Histogram shows 'active' = 80%
- Estimated rows: 1M * 0.8 = 800K rows
Step 2: Estimate costs for different plans
Option A: Sequential scan
- Cost: read all 1M rows = 10,000 I/O operations
- Filter to 800K rows
- Total cost: 10,000
Option B: Index scan
- Cost: index lookup = 100 I/O + read 800K rows = 8,000 I/O