Subqueries in WHERE Clause: Interview
Module: Subqueries & CTEs
Explain the difference between IN and EXISTS. When should you use each?
IN checks if a value exists in a list returned by subquery. It executes subquery once, builds hash table of results, then checks membership. Best for small result sets (<100 values) and non-correlated subqueries. EXISTS checks if subquery returns any rows. It executes per outer row but stops at first match. Best for large result sets, correlated subqueries, and when you only care about existence. Performance: For small lists, IN and EXISTS are similar. For large lists (>1000 values), EXISTS is 3-5x faster because it stops at first match instead of building complete list. For correlated checks, EXISTS is usually 2-3x faster. Use IN when: subquery returns small list, non-correlated, readability matters. Use EXISTS when: large result sets, correlated subquery, performance critical, or using NOT EXISTS (to avoid NULL trap).
Why does NOT IN fail with NULL values? How do you avoid this trap?
NOT IN fails with NULL because of how SQL handles NULL comparisons. If subquery returns (1, 2, NULL), then "id NOT IN (1, 2, NULL)" means "id != 1 AND id != 2 AND id != NULL". The "id != NULL" part evaluates to unknown (not true or false). Since the entire condition needs to be true, and one part is unknown, the whole expression becomes unknown, which SQL treats as false. Result: zero rows returned, even if there are valid rows. This is a silent bug - no error, just wrong results. Solution: Always use NOT EXISTS instead of NOT IN. NOT EXISTS checks existence, not value comparison, so it handles NULL correctly. Example: WHERE NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id). This is one of the most important SQL gotchas to remember.
What are ANY and ALL operators? Give examples of when to use them.
ANY means condition is true if it matches at least one value in subquery result. ALL means condition is true if it matches every value in subquery result. Examples: "price > ANY (SELECT price FROM products WHERE category = 'Budget')" means price is greater than at least one budget product price - equivalent to price > MIN(budget prices). "price > ALL (SELECT price FROM products WHERE category = 'Budget')" means price is greater than every budget product price - equivalent to price > MAX(budget prices). Use ANY when: you want to match at least one value, finding values above minimum, checking if value exceeds any threshold. Use ALL when: you want to match all values, finding values above maximum, ensuring value exceeds all thresholds. Note: = ANY is equivalent to IN. != ALL is equivalent to NOT IN (but still has NULL trap). In practice, ANY and ALL are less common than IN/EXISTS, but useful for range comparisons.
When should you use a subquery in WHERE vs a JOIN? Compare performance.
Use subquery in WHERE when: (1) You only need to filter, not retrieve columns from related table, (2) Logic is clearer with subquery ("find customers who have orders"), (3) You're checking existence (EXISTS pattern), (4) You're comparing to aggregate (salary > AVG). Use JOIN when: (1) You need columns from both tables, (2) Performance is critical, (3) You're combining data, not just filtering, (4) Query optimizer can better optimize JOINs. Performance comparison for "customers with orders": EXISTS subquery: 180ms, IN subquery: 450ms, INNER JOIN: 95ms. JOIN is usually fastest because optimizer can use indexes optimally and avoid repeated subquery execution. However, for simple existence checks with proper indexes, EXISTS is nearly as fast and more readable. Modern optimizers often rewrite IN/EXISTS as JOINs internally. Rule of thumb: For filtering only, use EXISTS. For retrieving data, use JOIN. Test both with EXPLAIN to see which is faster for your specific case.
Write a query to find customers who have never placed an order. Use NOT EXISTS.
SELECT
c.customer_id,
c.name,
c.email,
c.join_date
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
)
ORDER BY c.join_date DESC;
-- Index for performance:
CREATE INDEX idx_orders_customer ON orders(customer_id);
NOT EXISTS checks if subquery returns any rows. For each customer, it looks for orders with matching customer_id. If no orders found, customer is included. Handles NULL correctly (unlike NOT IN). Index on orders(customer_id) critical for performance.
Find employees earning above their department average. Use correlated subquery.
SELECT
e1.employee_id,
e1.name,
e1.department,
e1.salary,
(SELECT ROUND(AVG(e2.salary), 0)
FROM employees e2
WHERE e2.department = e1.department) AS dept_avg
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department = e1.department
)
ORDER BY e1.department, e1.salary DESC;