Aggregate Query Optimization: Interview
Module: Query Optimization & Performance
Explain the difference between WHERE and HAVING in aggregate queries. When would using HAVING instead of WHERE cause performance problems?
WHERE filters rows BEFORE aggregation. HAVING filters groups AFTER aggregation.
**Execution Order:**
1. FROM - Get tables
2. WHERE - Filter rows (reduces data early)
3. GROUP BY - Group remaining rows
4. Aggregate functions - Compute COUNT, SUM, etc.
5. HAVING - Filter groups
6. SELECT - Return results
**Performance Impact:**
Bad (HAVING for row filtering):
SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING department IN ('Sales', 'Engineering');
- Aggregates all 50 departments (10M employees)
- Then filters to 2 departments
- Wasted work: aggregated 48 unnecessary departments
- Time: 12 seconds
Good (WHERE for row filtering):
SELECT department, AVG(salary)
FROM employees
WHERE department IN ('Sales', 'Engineering')
GROUP BY department;
- Filters to 2 departments first (400K employees)
- Aggregates only relevant rows
- Time: 0.5 seconds (24x faster)
**When to Use Each:**
WHERE:
- Filter individual rows
- Conditions on non-aggregated columns
- Example: WHERE active = true, WHERE salary > 50000
- Benefit: Reduces rows before expensive aggregation
HAVING:
- Filter aggregated results
- Conditions on aggregate functions
- Example: HAVING COUNT(*) > 10, HAVING AVG(salary) > 75000
- Use case: "Show departments with more than 100 employees"
**Rule:** Use WHERE whenever possible. Only use HAVING for conditions that require aggregation.
What is a covering index and why is it particularly effective for aggregate queries? Explain with an example.