PARTITION BY vs GROUP BY: Interview

Module: Window Functions

Explain the fundamental difference between GROUP BY and PARTITION BY. When would you choose each approach?

GROUP BY collapses rows into groups, returning one summary row per group - perfect for dashboards and reports where you only need aggregated data. PARTITION BY preserves all original rows while adding group-level calculations as additional columns - ideal for individual analysis with contextual comparisons. Choose GROUP BY when executives need "total sales by region" (3 summary rows). Choose PARTITION BY when analysts need "each salesperson's performance vs their region average" (all individual rows with context). The key decision: do you need to see the forest (GROUP BY) or the trees with forest context (PARTITION BY)?

A query using PARTITION BY is running slowly on a large dataset. How would you optimize it, and when might GROUP BY be a better alternative?

First, optimize PARTITION BY: 1) Index partition columns in the same order as specified, 2) Use composite indexes (partition_col, order_col), 3) Add WHERE clauses for partition pruning, 4) Use named windows to reuse specifications. However, if the business need is actually summary data, GROUP BY would be 3-5x faster and use 60-90% less memory. For example, if a dashboard shows "average salary by department" but the query uses PARTITION BY returning all employees, switch to GROUP BY for just department summaries. Consider hybrid approaches: use GROUP BY for summaries in materialized views, then PARTITION BY for detailed analysis when needed.

How can you combine GROUP BY and PARTITION BY in the same analysis? Provide a practical scenario.

Combine them using CTEs or subqueries for comprehensive analysis. Scenario: E-commerce customer segmentation. Step 1: Use GROUP BY to create customer summaries (total orders, lifetime value). Step 2: Use PARTITION BY to rank customers within segments and compare to peer averages. Step 3: Join results for complete analysis. This gives you both customer-level summaries (GROUP BY) and individual order context with peer comparisons (PARTITION BY). The pattern: GROUP BY creates the foundation data, PARTITION BY adds analytical context, final query combines insights for actionable business intelligence.

Explain the memory and processing differences between GROUP BY and PARTITION BY. How does this impact query design?

GROUP BY processes all rows but returns only group summaries, using minimal memory for results. PARTITION BY must keep all original rows plus calculated columns, requiring significantly more memory. For 1M input rows: GROUP BY might return 100 summary rows (low memory), PARTITION BY returns 1M rows with context (high memory). This impacts design: use GROUP BY for executive dashboards processing millions of transactions but showing monthly summaries. Use PARTITION BY for analytical applications where users need individual transaction details with contextual comparisons. For very large datasets, consider materialized views with GROUP BY summaries, then PARTITION BY on filtered subsets.

What are the common SQL errors when mixing GROUP BY and non-aggregate columns? How do window functions solve this?

The classic error: "SELECT name, department, AVG(salary) FROM employees GROUP BY department" fails because name is not in GROUP BY or an aggregate function. GROUP BY collapses rows, so individual names cannot be selected when grouping by department. Window functions solve this: "SELECT name, department, salary, AVG(salary) OVER (PARTITION BY department) FROM employees" keeps all individual rows while adding department averages. This is why PARTITION BY is powerful - it provides group context without losing individual detail, eliminating the need to choose between detail and summary.

Write two queries for sales analysis: one using GROUP BY for an executive summary, another using PARTITION BY for individual salesperson analysis.

-- Executive Summary (GROUP BY)

SELECT

region,

COUNT(*) as total_sales,

SUM(sale_amount) as total_revenue,

AVG(sale_amount) as avg_sale_amount,

COUNT(DISTINCT salesperson_id) as active_salespeople

FROM sales

WHERE sale_date >= '2024-01-01'

GROUP BY region

ORDER BY total_revenue DESC;

-- Individual Analysis (PARTITION BY)

SELECT

salesperson_name,

region,

sale_amount,

sale_date,

COUNT(*) OVER (PARTITION BY region) as region_total_sales,

AVG(sale_amount) OVER (PARTITION BY region) as region_avg_sale,

sale_amount - AVG(sale_amount) OVER (PARTITION BY region) as vs_region_avg,

RANK() OVER (PARTITION BY region ORDER BY sale_amount DESC) as region_rank,

PERCENT_RANK() OVER (PARTITION BY region ORDER BY sale_amount) as region_percentile

FROM sales

WHERE sale_date >= '2024-01-01'

ORDER BY region, sale_amount DESC;

This demonstrates the key difference: GROUP BY provides executive summaries for strategic decisions, while PARTITION BY provides individual performance analysis with regional context for coaching and development.

Create a query that combines GROUP BY and PARTITION BY to analyze customer behavior: customer summaries with individual order context.

WITH customer_summaries AS (

-- GROUP BY: Customer-level aggregations