Window Frame Specifications: Interview

Module: Window Functions

Explain the difference between ROWS and RANGE frame types. When would you use each?

ROWS counts physical rows regardless of values - "give me the 5 rows before this one". RANGE groups by ORDER BY values - "give me all rows with ORDER BY values within this range". ROWS is more predictable and commonly used for moving averages, running totals, and most analytical functions. RANGE is useful when you want to treat ties (duplicate ORDER BY values) as a group, or for time-based analysis like "all orders within 30 days". Example: ROWS 3 PRECEDING always gives exactly 3 rows, but RANGE 3 PRECEDING gives all rows where ORDER BY value is 3 less than current row value.

Why does LAST_VALUE often return unexpected results, and how do you fix it?

LAST_VALUE with default frame returns the current row value instead of the actual last value because the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW - it only sees up to the current row, not beyond it. Fix by specifying the full frame: LAST_VALUE(column) OVER (ORDER BY col ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING). This is the most common window function mistake. The same principle applies to FIRST_VALUE for consistency, though it usually works correctly with defaults.

How do window frames affect performance, and what optimization strategies would you use?

Frame performance depends on size and type: smaller frames are faster, ROWS is 20-40% faster than RANGE, and UNBOUNDED frames require more memory. Optimization strategies: 1) Index PARTITION BY and ORDER BY columns, 2) Use fixed-size frames instead of UNBOUNDED when possible, 3) Use ROWS instead of RANGE unless you need tie-handling, 4) Filter data early with WHERE clauses, 5) Use named windows for multiple functions with same frame, 6) Consider pre-aggregation for complex calculations on large datasets. For very large partitions (>1M rows), monitor memory usage with UNBOUNDED frames.

What are the default frame specifications, and why should you avoid relying on them?

Default frames: With ORDER BY = RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (running total behavior). Without ORDER BY = entire partition (UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING). Avoid defaults because: 1) RANGE default often unexpected (groups ties), 2) LAST_VALUE gives wrong results, 3) Code is unclear about intent, 4) Maintenance issues when ORDER BY is added/removed. Always specify explicitly: ROWS BETWEEN X AND Y for clarity and predictability.

How would you implement a centered moving average, and what are its limitations?

Centered moving average uses equal rows before and after current row: AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) for 5-period centered average. Benefits: best smoothing, removes noise effectively, symmetric window. Limitations: 1) Cannot be used for real-time analysis (needs future data), 2) Loses data points at beginning and end of series, 3) Not suitable for forecasting or live dashboards. Use trailing averages (N PRECEDING AND CURRENT ROW) for real-time applications, centered averages for historical analysis and smoothing.

Write a query to calculate 7-day moving average, 30-day moving average, and compare current value to both averages.

SELECT

date,

daily_sales,

-- 7-day moving average

ROUND(AVG(daily_sales) OVER (

ORDER BY date

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

), 2) as moving_avg_7day,

-- 30-day moving average

ROUND(AVG(daily_sales) OVER (

ORDER BY date

ROWS BETWEEN 29 PRECEDING AND CURRENT ROW

), 2) as moving_avg_30day,

-- Compare current to 7-day average

ROUND(

daily_sales - AVG(daily_sales) OVER (

ORDER BY date

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

), 2

) as vs_7day_avg,

-- Compare current to 30-day average

ROUND(

daily_sales - AVG(daily_sales) OVER (

ORDER BY date

ROWS BETWEEN 29 PRECEDING AND CURRENT ROW

), 2

) as vs_30day_avg,

-- Trend indicator

CASE