Advanced Analytics Functions: Interview
Module: Window Functions
Explain the difference between PERCENT_RANK and CUME_DIST. When would you use each?
PERCENT_RANK shows relative rank as percentage (0 to 1) using formula (rank-1)/(total-1), while CUME_DIST shows cumulative distribution using (rows ≤ current)/total. PERCENT_RANK is better for performance rankings ("top 10% performers") because it spreads values evenly. CUME_DIST is better for threshold analysis ("80% of customers spend less than $500") because it shows actual distribution. Example: In salary data, PERCENT_RANK helps identify top performers, while CUME_DIST helps set salary bands based on actual distribution.
How do you interpret correlation values, and what are the limitations of correlation analysis in business?
Correlation ranges from -1 to +1: +1 = perfect positive correlation, -1 = perfect negative correlation, 0 = no linear relationship. Values above 0.7 indicate strong correlation, 0.3-0.7 moderate, below 0.3 weak. Limitations: 1) Correlation ≠ causation, 2) Only measures linear relationships, 3) Sensitive to outliers, 4) Requires sufficient sample size. In business: High correlation between marketing spend and sales suggests effectiveness, but external factors (seasonality, competition) may be the real drivers. Always combine correlation with domain knowledge and additional analysis.
When would you use STDDEV vs VARIANCE, and how do they help in business decision making?
STDDEV (standard deviation) is the square root of VARIANCE, making it same unit as original data. Use STDDEV for interpretability ("sales vary by ±$10K"), VARIANCE for mathematical calculations. In business: STDDEV measures consistency - low STDDEV in customer satisfaction means reliable service, high STDDEV in sales means unpredictable revenue. VARIANCE is used in portfolio theory, quality control (process capability), and risk management. Example: STDDEV helps set realistic targets, VARIANCE helps calculate confidence intervals for forecasting.
Write a query to identify the top 20% of customers by spending and analyze their characteristics compared to the overall population.
-- Top 20% customer analysis
WITH customer_spending AS (
SELECT
customer_id,
SUM(order_amount) as total_spent,
COUNT(*) as order_count,
AVG(order_amount) as avg_order_value,
PERCENT_RANK() OVER (ORDER BY SUM(order_amount) DESC) as spending_percentile
FROM orders
GROUP BY customer_id
),
top_customers AS (
SELECT *
FROM customer_spending
WHERE spending_percentile <= 0.2 -- Top 20%
),
comparison_stats AS (
SELECT
'Top 20%' as segment,
AVG(total_spent) as avg_spending,
STDDEV(total_spent) as spending_stddev,
AVG(order_count) as avg_orders,
AVG(avg_order_value) as avg_order_size
FROM top_customers
UNION ALL
SELECT
'All Customers' as segment,
AVG(total_spent),
STDDEV(total_spent),
AVG(order_count),
AVG(avg_order_value)
FROM customer_spending
)