WHERE Clause & Filtering: Performance
Module: SQL Fundamentals
1. Index the columns that appear in WHERE predicates. A full scan on a 10M-row table takes seconds; an index seek takes milliseconds. The most impactful optimization is adding the right index, not rewriting the query.
2. Keep indexed columns bare. Any function or expression applied to a column in a WHERE predicate makes the predicate non-sargable and forces a full scan. Rewrite function predicates as range predicates on the raw column.
3. Match the data type. Comparing a VARCHAR column against an integer literal forces an implicit cast on every row, bypassing the index. Use the correct literal type to match the column.
4. Design composite indexes around query patterns. Index column order matters: (status, created_at) serves WHERE status = 'active' AND created_at > X but not WHERE created_at > X alone. Lead with equality predicates, follow with range predicates.
5. Prefer EXISTS over IN for correlated subqueries on large tables. EXISTS short-circuits on the first match; IN materializes the full subquery result before the join. For anti-join patterns, NOT EXISTS is typically faster than NOT IN when NULLs are possible.
6. Watch for OR expansion killing index use. A single column index cannot simultaneously seek two disjoint ranges connected by OR. The optimizer may choose a full scan. Consider UNION ALL of two separate indexed queries as an alternative.
7. Avoid leading wildcards in LIKE. WHERE name LIKE '%Smith' cannot use a B-tree index. WHERE name LIKE 'Smith%' can. For full substring search, use a full-text index.
8. Filter early to reduce downstream work. Every row eliminated by WHERE is a row that GROUP BY, ORDER BY, and JOIN do not have to process. In multi-join queries, push the most selective single-table predicates into the WHERE rather than leaving them to HAVING.
Index every column that appears alone in a WHERE predicate — the difference between an index seek and a full scan is orders of magnitude on large tables
Never apply a function to an indexed column in WHERE. YEAR(col) = 2024 is a full scan; col >= 2024-01-01 AND col < 2025-01-01 is an index range scan
Composite index left-prefix rule: the index (a, b, c) is only usable if column a is in the WHERE clause. Column b is only usable if column a is also an equality predicate
For OR across different columns, consider UNION ALL with one indexed query per branch — each branch can use its own index
EXISTS short-circuits on the first match — it is typically more efficient than IN for correlated subqueries on large tables
WHERE col = NULL — returns zero rows silently. Use IS NULL instead
WHERE A OR B AND C — parsed as WHERE A OR (B AND C) due to AND precedence. Use parentheses
WHERE YEAR(col) = 2024 — function wrapping makes the predicate non-sargable, bypassing the index
WHERE id NOT IN (SELECT customer_id FROM ...) — returns zero rows if the subquery contains any NULL. Use NOT EXISTS
WHERE created_at BETWEEN 2024-01-01 AND 2024-06-30 on DATETIME — misses rows at 2024-06-30 23:59:59. Use < 2024-07-01 as upper bound
Composite index column-order mismatch — querying on columns 2 and 3 without column 1 results in a full scan despite the index existing