CTE Performance Optimization: Interview
Module: Subqueries & CTEs
What are the main strategies for optimizing CTE performance? Explain each.
Five main strategies:
1. **Check Execution Plan**: Use EXPLAIN ANALYZE to see actual performance. Shows execution time, rows scanned, index usage, materialization status. Never optimize without this data.
2. **Materialize Reused CTEs**: If CTE referenced 2+ times, add MATERIALIZED hint (PostgreSQL). Prevents duplicate execution. Check execution plan for "CTE Scan" vs inlined subquery.
3. **Filter Early**: Push WHERE conditions into CTE definition, not after. Reduces data volume early. Filtering 50M rows to 500K in CTE is 100x less work than filtering after.
4. **Index Join Columns**: CREATE INDEX on columns used in JOIN, WHERE, ORDER BY. Enables index scans instead of full table scans. Can be 60x faster.
5. **Consider Alternatives**: Use window functions for running totals/rankings (100x faster than correlated subqueries). Use JOINs for simple combinations. Use temp tables for very large intermediate results.
Priority: Check execution plan first, then apply optimizations based on what you find.
How do you determine if a CTE should be materialized? Walk through the decision process.
Decision process:
1. **Check how many times CTE is referenced**:
- Used once: Don't materialize (adds overhead)
- Used 2+ times: Consider materializing
2. **Check if CTE is expensive**:
- Simple filter (SELECT * WHERE): Don't materialize
- Aggregation (GROUP BY, SUM): Consider materializing
- Complex join: Consider materializing
- Correlated subquery: Definitely materialize
3. **Check execution plan**:
- Run EXPLAIN ANALYZE
- Look for "CTE Scan" (materialized) vs inlined subquery
- If CTE referenced 3 times but only executed once: Already materialized
- If CTE referenced 3 times and executed 3 times: Need MATERIALIZED
4. **Test both ways**:
- Run with MATERIALIZED: Check execution time
- Run with NOT MATERIALIZED: Check execution time
- Use whichever is faster
5. **Consider memory**:
- Large materialized CTEs use work_mem
- If CTE results are huge (>1GB), may spill to disk
- Consider temp table instead
Rule of thumb: Materialize if CTE is referenced 2+ times AND is expensive (aggregation, complex join).
Explain the difference between filtering inside a CTE vs after a CTE. Why does it matter for performance?
Key difference: WHERE the filtering happens in the execution pipeline.
Filtering AFTER CTE:
WITH all_data AS (
SELECT * FROM orders -- Scans 50M rows
)
SELECT * FROM all_data WHERE date >= '2024-01-01'; -- Filters to 500K
Execution: