Advanced Analytics Functions: Interview

Module: Window Functions

Explain the difference between PERCENT_RANK and CUME_DIST. When would you use each?

PERCENT_RANK shows relative rank as percentage (0 to 1) using formula (rank-1)/(total-1), while CUME_DIST shows cumulative distribution using (rows ≤ current)/total. PERCENT_RANK is better for performance rankings ("top 10% performers") because it spreads values evenly. CUME_DIST is better for threshold analysis ("80% of customers spend less than $500") because it shows actual distribution. Example: In salary data, PERCENT_RANK helps identify top performers, while CUME_DIST helps set salary bands based on actual distribution.

How do you interpret correlation values, and what are the limitations of correlation analysis in business?

Correlation ranges from -1 to +1: +1 = perfect positive correlation, -1 = perfect negative correlation, 0 = no linear relationship. Values above 0.7 indicate strong correlation, 0.3-0.7 moderate, below 0.3 weak. Limitations: 1) Correlation ≠ causation, 2) Only measures linear relationships, 3) Sensitive to outliers, 4) Requires sufficient sample size. In business: High correlation between marketing spend and sales suggests effectiveness, but external factors (seasonality, competition) may be the real drivers. Always combine correlation with domain knowledge and additional analysis.

When would you use STDDEV vs VARIANCE, and how do they help in business decision making?

STDDEV (standard deviation) is the square root of VARIANCE, making it same unit as original data. Use STDDEV for interpretability ("sales vary by ±$10K"), VARIANCE for mathematical calculations. In business: STDDEV measures consistency - low STDDEV in customer satisfaction means reliable service, high STDDEV in sales means unpredictable revenue. VARIANCE is used in portfolio theory, quality control (process capability), and risk management. Example: STDDEV helps set realistic targets, VARIANCE helps calculate confidence intervals for forecasting.

Write a query to identify the top 20% of customers by spending and analyze their characteristics compared to the overall population.

-- Top 20% customer analysis

WITH customer_spending AS (

SELECT

customer_id,

SUM(order_amount) as total_spent,

COUNT(*) as order_count,

AVG(order_amount) as avg_order_value,

PERCENT_RANK() OVER (ORDER BY SUM(order_amount) DESC) as spending_percentile

FROM orders

GROUP BY customer_id

top_customers AS (

SELECT *

FROM customer_spending

WHERE spending_percentile <= 0.2 -- Top 20%

comparison_stats AS (

SELECT

'Top 20%' as segment,

AVG(total_spent) as avg_spending,

STDDEV(total_spent) as spending_stddev,

AVG(order_count) as avg_orders,

AVG(avg_order_value) as avg_order_size

FROM top_customers

UNION ALL

SELECT

'All Customers' as segment,

AVG(total_spent),

STDDEV(total_spent),

AVG(order_count),

AVG(avg_order_value)

FROM customer_spending

)