PARTITION BY vs GROUP BY: Mistakes
Module: Window Functions
SELECT employee_name, department, AVG(salary) FROM employees GROUP BY department;
SELECT employee_name, department, salary, AVG(salary) OVER (PARTITION BY department) as dept_avg FROM employees;
GROUP BY collapses rows, so you cannot select individual employee names when grouping by department. Use PARTITION BY to keep individual rows while adding department averages.
Use PARTITION BY when you need both individual details and group calculations
High
SQL error: employee_name not in GROUP BY clause
SELECT customer_id, order_date, order_amount, SUM(order_amount) OVER (PARTITION BY customer_id) FROM orders; -- For executive summary
SELECT customer_id, SUM(order_amount) as total_spent, COUNT(*) as order_count FROM orders GROUP BY customer_id;
When executives only need customer totals, PARTITION BY returns all individual orders with repeated totals. GROUP BY provides clean summaries that are faster and more appropriate for reporting.
Use GROUP BY for dashboards and reports where only summaries are needed
Medium
Unnecessary complexity and memory usage for simple summary
SELECT dept, COUNT(*) FROM employees WHERE COUNT(*) > 5 GROUP BY dept;
SELECT dept, COUNT(*) FROM employees GROUP BY dept HAVING COUNT(*) > 5;
WHERE filters individual rows before grouping, but COUNT(*) is calculated after grouping. Use HAVING to filter groups based on aggregate conditions.
Use WHERE for row filtering, HAVING for group filtering
High
Cannot use aggregate functions in WHERE clause
No indexes on GROUP BY columns: SELECT region, SUM(sales) FROM large_table GROUP BY region;
CREATE INDEX idx_region ON large_table(region); SELECT region, SUM(sales) FROM large_table GROUP BY region;
Without indexes, GROUP BY must scan the entire table and sort/hash all rows. Proper indexing can improve performance by 10-100x on large datasets.
Always index GROUP BY and PARTITION BY columns for optimal performance
High
Extremely slow performance on large datasets
SELECT *, RANK() OVER (PARTITION BY very_unique_column ORDER BY amount) FROM large_table;
SELECT *, RANK() OVER (PARTITION BY meaningful_group_column ORDER BY amount) FROM large_table;
Partitioning by unique columns (like transaction_id) creates tiny partitions with one row each, eliminating the benefits of window functions and adding overhead.
Choose partition columns that create meaningful groups with multiple rows
Medium
Creates as many partitions as rows, defeating the purpose
Complex self-joins instead of window functions: SELECT e1.*, e2.avg_salary FROM employees e1 JOIN (SELECT dept, AVG(salary) as avg_salary FROM employees GROUP BY dept) e2 ON e1.dept = e2.dept;
SELECT *, AVG(salary) OVER (PARTITION BY dept) as avg_salary FROM employees;
Self-joins with GROUP BY subqueries are the old way to add group context to individual rows. PARTITION BY eliminates the need for complex joins and is often faster.
Use PARTITION BY instead of self-joins for adding group context to individual rows
Medium
Unnecessary complexity, harder to maintain, potentially slower