Aggregate Query Optimization: Mistakes

Module: Query Optimization & Performance

SELECT category, COUNT(*), AVG(price) FROM products GROUP BY category;

CREATE INDEX idx_category_price ON products(category, price);

SELECT category, COUNT(*), AVG(price) FROM products GROUP BY category;

Without index, database must scan entire table and sort 5M rows by category. With covering index on (category, price), database does index-only scan with no sorting - data already grouped. 5M rows, 8 seconds → 0.5 seconds (16x faster).

Always create covering index with GROUP BY columns first, then aggregate columns. Index column order matters: (group_col, agg_col) not (agg_col, group_col).

High

Seq Scan on products (cost=0..125000 rows=5000000) -> Sort (cost=500000..525000) -> GroupAggregate

SELECT department, AVG(salary) FROM employees GROUP BY department HAVING department IN ("Sales", "Engineering");

SELECT department, AVG(salary) FROM employees WHERE department IN ("Sales", "Engineering") GROUP BY department;

HAVING filters after aggregation. This query aggregates all 50 departments (10M employees), then throws away 48 departments. WHERE filters before aggregation - only processes 2 departments (400K employees). 10M rows aggregated → 400K rows aggregated = 25x faster. Rule: Use WHERE for row filtering, HAVING for aggregate filtering (HAVING AVG(salary) > 50000).

WHERE = filter rows before GROUP BY. HAVING = filter groups after aggregation. Use WHERE whenever possible.

High

No error, but aggregates all departments then filters (slow)

SELECT product_id, (SELECT AVG(rating) FROM reviews WHERE product_id = p.id) as avg_rating FROM products p;

SELECT p.product_id, AVG(r.rating) as avg_rating FROM products p LEFT JOIN reviews r ON p.id = r.product_id GROUP BY p.product_id;

Scalar subquery in SELECT executes for every product. 100K products = 100K subquery executions scanning reviews table (50M rows each). Total: 5 trillion comparisons, 45 seconds. JOIN with GROUP BY: single join operation, 11M comparisons, 2 seconds (22x faster). Never use correlated subqueries in SELECT for aggregates.

Replace scalar subqueries with JOIN + GROUP BY. Or use window functions: AVG(rating) OVER (PARTITION BY product_id).

Critical

Correlated subquery executes once per product (100K times)

SELECT DATE(order_date), COUNT(*), SUM(total) FROM orders WHERE order_date >= "2024-01-01" GROUP BY DATE(order_date);

CREATE INDEX idx_order_date ON orders(order_date);

SELECT DATE(order_date), COUNT(*), SUM(total) FROM orders WHERE order_date >= "2024-01-01" GROUP BY DATE(order_date);

GROUP BY DATE(order_date) uses function, so regular index on order_date cannot be used for grouping. Database must scan all rows, apply DATE() function, then sort. 10M rows, 15 seconds. Better: store date separately or use function-based index. Best: if querying by day frequently, add date_only column and index it. Alternative: GROUP BY order_date::date (PostgreSQL) or use generated column.

Avoid functions in GROUP BY. Store pre-computed values (date_only column) or use function-based indexes if supported.

High

Seq Scan on orders (cannot use index on function)

SELECT COUNT(*) FROM orders WHERE status = "completed";

SELECT SUM(total) FROM orders WHERE status = "completed";

SELECT AVG(total) FROM orders WHERE status = "completed";

SELECT COUNT(*) as order_count, SUM(total) as revenue, AVG(total) as avg_value FROM orders WHERE status = "completed";

Three separate queries = three table scans of 10M rows. Each scan: 5 seconds. Total: 15 seconds. Single query with multiple aggregates: one table scan, 5 seconds (3x faster). Database can compute COUNT, SUM, AVG in single pass. Always combine aggregates on same dataset into one query.

Combine multiple aggregates into single query. Database computes them in one pass over data.

Medium

No error, but scans table 3 times instead of once