Query Rewriting Patterns: Mistakes

Module: Query Optimization & Performance

SELECT * FROM products WHERE category = "Electronics" OR brand = "Apple";

SELECT * FROM products WHERE category = "Electronics" UNION ALL SELECT * FROM products WHERE brand = "Apple" AND category != "Electronics";

OR on different columns prevents index usage. Database must scan entire table checking both conditions. With UNION ALL, first query uses index on category (500K rows), second uses index on brand (200K rows). Total: 700K rows scanned vs 5M. 15 seconds → 0.5 seconds (30x faster). Critical: Add AND category != "Electronics" to second query to avoid duplicates.

Use UNION ALL for OR on different indexed columns. Add exclusion condition to second query to prevent duplicates.

High

Seq Scan on products (cost=0..125000 rows=5000000) - cannot use indexes

SELECT * FROM customers WHERE id NOT IN (SELECT customer_id FROM orders);

SELECT c.* FROM customers c LEFT JOIN orders o ON c.id = o.customer_id WHERE o.customer_id IS NULL;

NOT IN with NULL in subquery returns empty result. If orders has even one NULL customer_id, NOT IN (1, 2, NULL) evaluates to customer_id != 1 AND customer_id != 2 AND customer_id != NULL. Since customer_id != NULL is UNKNOWN, entire expression is UNKNOWN, returning no rows. LEFT JOIN handles NULLs correctly and is 15x faster (hash join vs nested loop). This is a critical production bug - query succeeds but returns wrong results.

NEVER use NOT IN with subqueries. Always use LEFT JOIN + IS NULL or NOT EXISTS. NOT IN with NULLs is a silent data bug.

Critical

Returns 0 rows if orders.customer_id contains ANY NULL value (silent bug!)

SELECT DISTINCT customer_id FROM orders;

SELECT customer_id FROM orders GROUP BY customer_id;

DISTINCT sorts all 10M rows to find duplicates. GROUP BY can use index on customer_id, avoiding sort. If index exists, GROUP BY does index scan (already sorted), while DISTINCT still sorts. 8 seconds → 1 second (8x faster). Note: Only faster when column has index. Without index, performance is similar.

Use GROUP BY instead of DISTINCT when column has index. Check EXPLAIN plan to verify index usage.

Medium

Sort (cost=500000..525000 rows=10000000) - sorts entire result set

SELECT product_id, (SELECT AVG(rating) FROM reviews WHERE product_id = p.id) FROM products p;

SELECT p.product_id, AVG(r.rating) FROM products p LEFT JOIN reviews r ON p.id = r.product_id GROUP BY p.product_id;

Correlated subquery in SELECT is nested loop: 100K products × 50M reviews = 5 trillion comparisons. Each product triggers full scan of reviews. LEFT JOIN + GROUP BY: single join operation, 50.1M comparisons. 45 seconds → 2 seconds (22x faster). Alternative: window function AVG(rating) OVER (PARTITION BY product_id) for same performance.

Replace correlated subqueries in SELECT with JOIN + GROUP BY or window functions. Correlated subqueries are almost always slow.

High

Subquery executes 100,000 times (once per product), each scanning reviews table

SELECT * FROM orders_2023 UNION SELECT * FROM orders_2024;

SELECT * FROM orders_2023 UNION ALL SELECT * FROM orders_2024;

UNION removes duplicates by sorting combined results. If tables are partitioned by year (no overlap), duplicates are impossible. UNION sorts 20M rows unnecessarily. UNION ALL just concatenates results. 10 seconds → 2 seconds (5x faster). Warning: Only use UNION ALL if certain no duplicates exist or duplicates are acceptable.

Use UNION ALL instead of UNION when duplicates are impossible (partitioned tables) or acceptable. UNION adds expensive sort.

Medium

Sort (cost=800000..850000) to remove duplicates, even when tables don't overlap