WHERE Clause & Filtering: Interview
Module: SQL Fundamentals
WHERE runs before SELECT in the logical processing order. What are two practical consequences of this ordering that developers must keep in mind?
First: you cannot reference SELECT column aliases in WHERE because SELECT has not yet run. A common mistake is SELECT salary * 12 AS annual_salary FROM employees WHERE annual_salary > 100000 — this fails because annual_salary does not exist yet during WHERE evaluation. Second: aggregate functions (COUNT, SUM, AVG) cannot be used in WHERE — aggregation happens in step 4 (GROUP BY), which is after WHERE. To filter on aggregates, use HAVING.
Explain why WHERE col = NULL always returns zero rows and how to correctly check for NULL.
SQL uses three-valued logic: conditions evaluate to TRUE, FALSE, or UNKNOWN. Any comparison involving NULL produces UNKNOWN — including NULL = NULL. WHERE only passes rows where the condition is TRUE; UNKNOWN is treated identically to FALSE. Therefore WHERE col = NULL evaluates to UNKNOWN for every row and returns nothing. The correct operators are IS NULL (returns rows where col is NULL) and IS NOT NULL (returns rows where col has a non-NULL value).
What is a sargable predicate? Give two examples of non-sargable predicates and rewrite them to be sargable.
A sargable predicate is one where the database engine can use an index to seek directly to matching rows rather than scanning the entire table. The rule: the indexed column must appear bare on one side of the operator. Non-sargable example 1: WHERE YEAR(order_date) = 2024 — function on the column prevents index use. Sargable rewrite: WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'. Non-sargable example 2: WHERE customer_id = '10042' when customer_id is INT — implicit type cast forces per-row evaluation. Sargable rewrite: WHERE customer_id = 10042.
A composite index exists on (status, created_at, customer_id). Write a query that uses all three columns and explain the left-prefix rule.
A query using all three columns: WHERE status = 'active' AND created_at >= '2024-01-01' AND customer_id = 42. The left-prefix rule: a composite B-tree index is ordered first by the leftmost column, then by the next column within each group of the first, and so on. The database can only use the index from the left. If status is not in the WHERE clause, the index cannot be used at all. Additionally, once a range predicate is used on a column, subsequent columns in the index cannot be used for navigation — so equality predicates should come first in the index definition.
Why is NOT IN dangerous with subqueries, and what is the safe alternative?
NOT IN checks that a value is not equal to any value in the set. NULL equality evaluates to UNKNOWN under three-valued logic. If the subquery returns even a single NULL value, 'value NOT IN (list including NULL)' evaluates to UNKNOWN for every outer row — no rows pass the WHERE filter, and the query silently returns zero results with no error. The safe alternative is NOT EXISTS: WHERE NOT EXISTS (SELECT 1 FROM subquery WHERE correlated_condition). NOT EXISTS tests for the absence of a matching row using existence semantics, so it correctly handles NULLs. A LEFT JOIN anti-join pattern (LEFT JOIN ... WHERE right_key IS NULL) is also safe.
How does three-valued logic affect AND and OR when NULLs are involved? Give the result of TRUE AND UNKNOWN and FALSE OR UNKNOWN.
TRUE AND UNKNOWN = UNKNOWN (row excluded). FALSE AND UNKNOWN = FALSE (FALSE dominates — short-circuit). TRUE OR UNKNOWN = TRUE (TRUE dominates — short-circuit). FALSE OR UNKNOWN = UNKNOWN (row excluded). The practical implication: in an AND expression, a single FALSE eliminates the row regardless of NULLs. In an OR expression, a single TRUE includes the row regardless of NULLs. This means rows can be unexpectedly included or excluded when NULL columns participate in complex AND/OR conditions, which is why explicit IS NULL / COALESCE handling is important for nullable columns.