Index Design & Selection: Interview
Module: Query Optimization & Performance
Explain why column order matters in composite indexes. How do you determine the optimal order?
Column order in composite indexes is critical because indexes work like a phone book - sorted by first column, then second, then third.
**How Composite Indexes Work:**
Index on (last_name, first_name, age):
- Sorted by last_name first
- Within each last_name, sorted by first_name
- Within each first_name, sorted by age
Can efficiently find:
- WHERE last_name = 'Smith' (uses first column)
- WHERE last_name = 'Smith' AND first_name = 'John' (uses first + second)
- WHERE last_name = 'Smith' AND first_name = 'John' AND age = 30 (uses all three)
Cannot efficiently find:
- WHERE first_name = 'John' (doesn't start with last_name)
- WHERE age = 30 (doesn't start with last_name)
This is the "leftmost prefix" rule.
**Determining Optimal Order:**
1. **Equality before range:**
- Equality: WHERE status = 'active'
- Range: WHERE created_at > '2024-01-01'
- Put equality conditions first
2. **Most selective first:**
- Selectivity = COUNT(DISTINCT column) / COUNT(*)
- Higher selectivity = fewer rows
- Example: user_id (100K unique) before status (5 unique)
3. **Consider query patterns:**
- If queries always filter by user_id, put it first
- Even if less selective than other columns
**Example:**
Query: WHERE user_id = 123 AND status = 'pending' AND created_at > '2024-01-01'
Data:
- user_id: 100K unique (high selectivity)
- status: 5 unique (low selectivity)
- created_at: continuous (range)
Optimal order: (user_id, status, created_at)
- user_id first: Equality, most selective (narrows to 100 rows)
- status second: Equality (narrows to 20 rows)
- created_at last: Range (narrows to 5 rows)
Wrong order: (created_at, status, user_id)
- created_at first: Range, scans 5M rows
- Then filters by status and user_id