Correlated Subqueries Deep Dive: Interview

Q: How do you optimize a slow correlated subquery? Walk through the optimization process.

Optimization process: (1) Identify the problem: Use EXPLAIN ANALYZE to see execution time and loop count. Look for "Seq Scan" (bad) vs "Index Scan" (good). Check if subquery executes per row. (2) Create proper indexes: For pattern WHERE t2.foreign_key = t1.id AND t2.filter_column = value, create composite index: CREATE INDEX idx_table_fk_filter ON table(foreign_key, filter_column). This enables index seek instead of table scan. Impact: 100-1000x faster per execution. (3) Consider alternatives: C

Module: Subqueries & CTEs

What is a correlated subquery and how does it differ from a non-correlated subquery? Explain the execution model.

A correlated subquery references columns from the outer query, creating a dependency where the subquery result depends on which outer row is being processed. Non-correlated subquery is independent - executes once and returns a result used for all outer rows. Correlated subquery executes ONCE PER OUTER ROW. Execution model: (1) Outer query processes first row, (2) Subquery executes with that row's values, (3) Result used for that row, (4) Outer query processes next row, (5) Subquery executes AGAIN with new row's values, (6) Repeat for all N rows = N executions. Example: SELECT e1.name FROM employees e1 WHERE e1.salary > (SELECT AVG(e2.salary) FROM employees e2 WHERE e2.department = e1.department). For each employee, subquery calculates THEIR department average. Alice in Engineering gets Engineering average. Bob in Sales gets Sales average. Different calculation per row. This is powerful but expensive - 10,000 employees = 10,000 subquery executions.

Why are correlated subqueries slow? Explain the performance implications with specific numbers.

Correlated subqueries are slow because they execute once per outer row with no result caching. Performance breakdown: (1) Repeated execution: N outer rows = N subquery executions. For 10,000 rows, that is 10,000 executions. (2) No caching: Even if multiple rows have same values (e.g., same department), subquery still executes for each row. Database doesn't cache intermediate results. (3) Potential full table scans: Without proper indexes, each execution scans entire table. 10,000 rows × 10,000 scans = 100 million row reads! Real numbers: 10,000 employees calculating department averages. Without index: 10,000 full scans × 10,000 rows = 100M reads, 8000ms. With index on (department, salary): 10,000 index seeks × 50 rows = 500K reads, 450ms (18x faster). Window function alternative: 1 scan × 10,000 rows = 10K reads, 95ms (84x faster). The key insight: correlated subqueries scale poorly. Double the rows = double the execution time. This is why they are dangerous in production.

When should you use a correlated subquery vs alternatives like JOINs or window functions? Give specific examples.

Use correlated subquery when: (1) Row-dependent logic that is hard to express otherwise. Example: WHERE EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id AND total > c.credit_limit) - need to compare order to THAT customer's limit. Different threshold per customer. (2) Small outer table (<100 rows) with proper indexes. Performance acceptable. (3) Complex conditions that would make JOIN unreadable. Use window functions when: (1) Per-group aggregates. Instead of (SELECT AVG(salary) WHERE dept = e.dept), use AVG(salary) OVER (PARTITION BY dept). 5-10x faster. (2) Running totals, rankings, row numbering. Window functions are O(N), correlated subqueries are O(N²). 50-200x faster. Use JOINs when: (1) Multiple aggregates from same table. Instead of 3 correlated subqueries, use 1 JOIN with GROUP BY. 5-10x faster. (2) Counting or summing related rows. JOIN + GROUP BY is more efficient. Rule of thumb: Correlated subqueries are last resort. Try window functions first, then JOINs, then correlated subqueries only if necessary.

How do you optimize a slow correlated subquery? Walk through the optimization process.

Optimization process: (1) Identify the problem: Use EXPLAIN ANALYZE to see execution time and loop count. Look for "Seq Scan" (bad) vs "Index Scan" (good). Check if subquery executes per row. (2) Create proper indexes: For pattern WHERE t2.foreign_key = t1.id AND t2.filter_column = value, create composite index: CREATE INDEX idx_table_fk_filter ON table(foreign_key, filter_column). This enables index seek instead of table scan. Impact: 100-1000x faster per execution. (3) Consider alternatives: Can this be a window function? AVG() OVER (PARTITION BY) is 5-10x faster. Can this be a JOIN? JOIN + GROUP BY is 3-10x faster for multiple aggregates. (4) Filter outer query: Add WHERE clauses to reduce outer rows. 10% of rows = 10x faster. (5) Test and measure: Compare execution times. Verify index usage with EXPLAIN. Test with production data volumes. Example: Slow query with correlated subquery, 8000ms. Add index: 450ms (18x faster). Rewrite as window function: 95ms (84x faster). Always test alternatives - correlated subqueries are rarely the best solution.

Write a query to find employees earning above their department average. Then rewrite it using a window function. Compare performance.

-- Method 1: Correlated subquery

SELECT

e1.employee_id,

e1.name,

e1.department,

e1.salary

FROM employees e1

WHERE e1.salary > (

SELECT AVG(e2.salary)

FROM employees e2

WHERE e2.department = e1.department

)

ORDER BY e1.department, e1.salary DESC;

-- Required index:

CREATE INDEX idx_employees_dept_salary

ON employees(department, salary);

-- Performance: 10,000 employees

-- With index: 450ms

-- Without index: 8000ms

-- Method 2: Window function (better)

WITH dept_averages AS (

SELECT

employee_id,

name,

department,

salary,

AVG(salary) OVER (PARTITION BY department) AS dept_avg

FROM employees

)

SELECT

employee_id,