Statistics & Cardinality: Concept
Module: Query Optimization & Performance
Cardinality is the estimated number of rows an operation will return. The query optimizer uses cardinality estimates to choose execution plans - which join algorithm, which join order, how much memory to allocate.
Selectivity is the fraction of rows that pass a filter. Cardinality = Table rows × Selectivity.
Example: Table has 1M rows, WHERE status = 'active' has selectivity 0.8 (80%), cardinality = 1M × 0.8 = 800K rows.
Accurate cardinality estimates are critical. A 10x error can cause the optimizer to choose nested loop instead of hash join, making the query 1000x slower.
**Selectivity Calculation:**
**1. Equality Predicate (WHERE col = value)**
Selectivity = 1 / n_distinct
Example: gender column has 3 distinct values (M, F, Other)
WHERE gender = 'M'
Selectivity = 1/3 = 0.33 (33%)
If table has 1M rows:
Cardinality = 1M × 0.33 = 333K rows
**2. Range Predicate (WHERE col > value)**
Without histogram: Assumes uniform distribution
Selectivity = (max - value) / (max - min)
Example: age column, min=18, max=80
WHERE age > 65
Selectivity = (80 - 65) / (80 - 18) = 15/62 = 0.24 (24%)
With histogram: Uses actual distribution
If histogram shows 10% of values > 65:
Selectivity = 0.10 (10%)
**3. Multi-Column Predicates (Independence Assumption)**
Optimizer assumes columns are independent:
Selectivity = sel(col1) × sel(col2) × sel(col3)
Example:
WHERE status = 'active' AND priority = 'high'
- status selectivity: 0.8 (80% active)
- priority selectivity: 0.1 (10% high)
- Combined: 0.8 × 0.1 = 0.08 (8%)
Problem: If columns are correlated, this is wrong!
**4. Correlated Columns (Independence Assumption Fails)**
Example: city and state columns
WHERE city = 'Seattle' AND state = 'WA'
Optimizer assumes independence:
- city = 'Seattle': 0.5% of rows
- state = 'WA': 2% of rows
- Combined: 0.005 × 0.02 = 0.0001 (0.01%)
- Estimate: 1M × 0.0001 = 100 rows
Reality: 100% correlation (Seattle is always in WA)
- Actual: 0.5% of rows