Statistics & Cardinality: Concept

Module: Query Optimization & Performance

Cardinality is the estimated number of rows an operation will return. The query optimizer uses cardinality estimates to choose execution plans - which join algorithm, which join order, how much memory to allocate.

Selectivity is the fraction of rows that pass a filter. Cardinality = Table rows × Selectivity.

Example: Table has 1M rows, WHERE status = 'active' has selectivity 0.8 (80%), cardinality = 1M × 0.8 = 800K rows.

Accurate cardinality estimates are critical. A 10x error can cause the optimizer to choose nested loop instead of hash join, making the query 1000x slower.

**Selectivity Calculation:**

**1. Equality Predicate (WHERE col = value)**

Selectivity = 1 / n_distinct

Example: gender column has 3 distinct values (M, F, Other)

WHERE gender = 'M'

Selectivity = 1/3 = 0.33 (33%)

If table has 1M rows:

Cardinality = 1M × 0.33 = 333K rows

**2. Range Predicate (WHERE col > value)**

Without histogram: Assumes uniform distribution

Selectivity = (max - value) / (max - min)

Example: age column, min=18, max=80

WHERE age > 65

Selectivity = (80 - 65) / (80 - 18) = 15/62 = 0.24 (24%)

With histogram: Uses actual distribution

If histogram shows 10% of values > 65:

Selectivity = 0.10 (10%)

**3. Multi-Column Predicates (Independence Assumption)**

Optimizer assumes columns are independent:

Selectivity = sel(col1) × sel(col2) × sel(col3)

Example:

WHERE status = 'active' AND priority = 'high'

- status selectivity: 0.8 (80% active)

- priority selectivity: 0.1 (10% high)

- Combined: 0.8 × 0.1 = 0.08 (8%)

Problem: If columns are correlated, this is wrong!

**4. Correlated Columns (Independence Assumption Fails)**

Example: city and state columns

WHERE city = 'Seattle' AND state = 'WA'

Optimizer assumes independence:

- city = 'Seattle': 0.5% of rows

- state = 'WA': 2% of rows

- Combined: 0.005 × 0.02 = 0.0001 (0.01%)

- Estimate: 1M × 0.0001 = 100 rows

Reality: 100% correlation (Seattle is always in WA)

- Actual: 0.5% of rows