PARTITION BY vs GROUP BY: Mistakes

Module: Window Functions

SELECT employee_name, department, AVG(salary) FROM employees GROUP BY department;

SELECT employee_name, department, salary, AVG(salary) OVER (PARTITION BY department) as dept_avg FROM employees;

GROUP BY collapses rows, so you cannot select individual employee names when grouping by department. Use PARTITION BY to keep individual rows while adding department averages.

Use PARTITION BY when you need both individual details and group calculations

High

SQL error: employee_name not in GROUP BY clause

SELECT customer_id, order_date, order_amount, SUM(order_amount) OVER (PARTITION BY customer_id) FROM orders; -- For executive summary

SELECT customer_id, SUM(order_amount) as total_spent, COUNT(*) as order_count FROM orders GROUP BY customer_id;

When executives only need customer totals, PARTITION BY returns all individual orders with repeated totals. GROUP BY provides clean summaries that are faster and more appropriate for reporting.

Use GROUP BY for dashboards and reports where only summaries are needed

Medium

Unnecessary complexity and memory usage for simple summary

SELECT dept, COUNT(*) FROM employees WHERE COUNT(*) > 5 GROUP BY dept;

SELECT dept, COUNT(*) FROM employees GROUP BY dept HAVING COUNT(*) > 5;

WHERE filters individual rows before grouping, but COUNT(*) is calculated after grouping. Use HAVING to filter groups based on aggregate conditions.

Use WHERE for row filtering, HAVING for group filtering

High

Cannot use aggregate functions in WHERE clause

No indexes on GROUP BY columns: SELECT region, SUM(sales) FROM large_table GROUP BY region;

CREATE INDEX idx_region ON large_table(region); SELECT region, SUM(sales) FROM large_table GROUP BY region;

Without indexes, GROUP BY must scan the entire table and sort/hash all rows. Proper indexing can improve performance by 10-100x on large datasets.

Always index GROUP BY and PARTITION BY columns for optimal performance

High

Extremely slow performance on large datasets

SELECT *, RANK() OVER (PARTITION BY very_unique_column ORDER BY amount) FROM large_table;

SELECT *, RANK() OVER (PARTITION BY meaningful_group_column ORDER BY amount) FROM large_table;

Partitioning by unique columns (like transaction_id) creates tiny partitions with one row each, eliminating the benefits of window functions and adding overhead.

Choose partition columns that create meaningful groups with multiple rows

Medium

Creates as many partitions as rows, defeating the purpose

Complex self-joins instead of window functions: SELECT e1.*, e2.avg_salary FROM employees e1 JOIN (SELECT dept, AVG(salary) as avg_salary FROM employees GROUP BY dept) e2 ON e1.dept = e2.dept;

SELECT *, AVG(salary) OVER (PARTITION BY dept) as avg_salary FROM employees;

Self-joins with GROUP BY subqueries are the old way to add group context to individual rows. PARTITION BY eliminates the need for complex joins and is often faster.

Use PARTITION BY instead of self-joins for adding group context to individual rows

Medium

Unnecessary complexity, harder to maintain, potentially slower