Subquery Optimization: Interview

Q: Why is NOT IN dangerous with NULLs? How do you fix it?

NOT IN with NULLs returns empty result (no rows) even when matches exist! Example: SELECT * FROM customers WHERE id NOT IN (SELECT customer_id FROM orders) If ANY order has NULL customer_id, entire query returns 0 rows. Why: NULL comparisons are undefined. id NOT IN (1, 2, NULL) evaluates to UNKNOWN for all values, filtered out. Fix 1 - LEFT JOIN with NULL check (best): SELECT c.* FROM customers c LEFT JOIN orders o ON c.id = o.customer_id WHERE o.customer_id IS NULL; Fix 2 - NOT EXISTS (als

Module: Query Optimization & Performance

Explain the difference between correlated and non-correlated subqueries. Why are correlated subqueries slow?

Non-correlated subquery: Independent of outer query, executes once. Example: WHERE id IN (SELECT customer_id FROM orders WHERE total > 1000). Runs once, result cached.

Correlated subquery: References outer query columns, executes for EVERY outer row. Example: WHERE EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id). If 1M customers, executes 1M times.

Why slow: Correlated subqueries have O(n * m) complexity. Each outer row triggers full subquery execution. 1M customers × 10M orders = 10 trillion comparisons without index. With index: 1M × log(10M) = 23M comparisons (still slow). JOIN is O(n + m) - single operation, 11M comparisons.

Key: Correlated in SELECT is worst - executes per row with no optimization. Always convert to JOIN or window function.

When should you use EXISTS vs IN vs JOIN? Give examples of each.

EXISTS: For existence checks only, don't need data from subquery. Short-circuits at first match. Example: Find customers with orders: WHERE EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id). Time: 1.5s. Best when: Checking existence, large subquery, need short-circuit.

IN: For small lists (<100 values) or literal values. Example: WHERE status IN ('active', 'pending'). OK for small subqueries (<1000 rows). Avoid for large subqueries - processes all values, no short-circuit.

JOIN: For data retrieval, large result sets. Example: SELECT c.*, o.order_date FROM customers c JOIN orders o ON c.id = o.customer_id. Time: 0.8s. Best when: Need data from both tables, large datasets, best performance.

Rule: EXISTS for existence, JOIN for data, avoid IN with large subqueries.

Why is NOT IN dangerous with NULLs? How do you fix it?

NOT IN with NULLs returns empty result (no rows) even when matches exist!

Example: SELECT * FROM customers WHERE id NOT IN (SELECT customer_id FROM orders)

If ANY order has NULL customer_id, entire query returns 0 rows. Why: NULL comparisons are undefined. id NOT IN (1, 2, NULL) evaluates to UNKNOWN for all values, filtered out.

Fix 1 - LEFT JOIN with NULL check (best):

SELECT c.* FROM customers c

LEFT JOIN orders o ON c.id = o.customer_id

WHERE o.customer_id IS NULL;

Fix 2 - NOT EXISTS (also safe):

SELECT * FROM customers c

WHERE NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id);

Fix 3 - Filter NULLs in subquery:

WHERE id NOT IN (SELECT customer_id FROM orders WHERE customer_id IS NOT NULL);

Best: Use LEFT JOIN - fastest and NULL-safe.

Optimize this slow query: SELECT p.product_name, (SELECT AVG(rating) FROM reviews WHERE product_id = p.id) as avg_rating, (SELECT COUNT(*) FROM reviews WHERE product_id = p.id) as review_count FROM products p. Currently takes 45 seconds for 100K products.

-- Problem: 2 correlated subqueries × 100K products = 200K executions

-- Solution 1: JOIN with GROUP BY

SELECT

p.product_name,

AVG(r.rating) as avg_rating,

COUNT(r.id) as review_count

FROM products p

LEFT JOIN reviews r ON p.id = r.product_id

GROUP BY p.id, p.product_name;

-- Time: 2 seconds (22x faster)

-- Solution 2: CTE for clarity

WITH review_stats AS (

SELECT

product_id,

AVG(rating) as avg_rating,