# Data Analyst SQL Interview Questions: Complete Preparation Guide 2025
## Introduction
Data analyst roles require strong SQL skills for extracting insights from data, creating reports, and supporting business decisions. This guide covers the most common SQL interview questions specifically tailored for data analyst positions.
## Core Data Analysis SQL Concepts
### 1. Aggregate Functions and GROUP BY
**Question 1: Calculate monthly sales totals**
```sql
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(total_amount) as monthly_sales,
COUNT(*) as order_count,
AVG(total_amount) as avg_order_value
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
```
**Question 2: Find top-performing products by category**
```sql
SELECT
category,
product_name,
SUM(quantity * price) as total_revenue
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY category, product_name
HAVING SUM(quantity * price) > 10000
ORDER BY category, total_revenue DESC;
```
### 2. Window Functions for Analytics
**Question 3: Calculate running totals and moving averages**
```sql
SELECT
order_date,
daily_sales,
SUM(daily_sales) OVER (ORDER BY order_date) as running_total,
AVG(daily_sales) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7_days,
LAG(daily_sales, 1) OVER (ORDER BY order_date) as previous_day_sales
FROM (
SELECT
order_date,
SUM(total_amount) as daily_sales
FROM orders
GROUP BY order_date
) daily_totals
ORDER BY order_date;
```
**Question 4: Rank customers by purchase behavior**
```sql
SELECT
customer_id,
customer_name,
total_spent,
order_count,
RANK() OVER (ORDER BY total_spent DESC) as spending_rank,
NTILE(4) OVER (ORDER BY total_spent DESC) as quartile,
CASE
WHEN NTILE(4) OVER (ORDER BY total_spent DESC) = 1 THEN 'High Value'
WHEN NTILE(4) OVER (ORDER BY total_spent DESC) = 2 THEN 'Medium-High Value'
WHEN NTILE(4) OVER (ORDER BY total_spent DESC) = 3 THEN 'Medium Value'
ELSE 'Low Value'
END as customer_segment
FROM (
SELECT
c.customer_id,
c.customer_name,
SUM(o.total_amount) as total_spent,
COUNT(o.order_id) as order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name
) customer_metrics;
```
## Business Intelligence Scenarios
### 1. Cohort Analysis
**Question 5: Customer retention cohort analysis**
```sql
WITH first_purchase AS (
SELECT
customer_id,
MIN(order_date) as first_purchase_date,
DATE_TRUNC('month', MIN(order_date)) as cohort_month
FROM orders
GROUP BY customer_id
),
monthly_activity AS (
SELECT
fp.customer_id,
fp.cohort_month,
DATE_TRUNC('month', o.order_date) as activity_month,
EXTRACT(MONTH FROM AGE(o.order_date, fp.first_purchase_date)) as months_since_first
FROM first_purchase fp
JOIN orders o ON fp.customer_id = o.customer_id
)
SELECT
cohort_month,
months_since_first,
COUNT(DISTINCT customer_id) as active_customers,
ROUND(
COUNT(DISTINCT customer_id) * 100.0 /
FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (
PARTITION BY cohort_month
ORDER BY months_since_first
), 2
) as retention_rate
FROM monthly_activity
GROUP BY cohort_month, months_since_first
ORDER BY cohort_month, months_since_first;
```
### 2. Sales Performance Analysis
**Question 6: Year-over-year growth analysis**
```sql
WITH monthly_sales AS (
SELECT
EXTRACT(YEAR FROM order_date) as year,
EXTRACT(MONTH FROM order_date) as month,
SUM(total_amount) as monthly_revenue
FROM orders
GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(MONTH FROM order_date)
)
SELECT
year,
month,
monthly_revenue,
LAG(monthly_revenue, 12) OVER (ORDER BY year, month) as same_month_last_year,
ROUND(
(monthly_revenue - LAG(monthly_revenue, 12) OVER (ORDER BY year, month)) * 100.0 /
LAG(monthly_revenue, 12) OVER (ORDER BY year, month), 2
) as yoy_growth_percent
FROM monthly_sales
ORDER BY year, month;
```
## Data Quality and Validation
### 1. Data Completeness Checks
**Question 7: Identify data quality issues**
```sql
-- Check for missing values and data quality issues
SELECT
'customers' as table_name,
COUNT(*) as total_records,
COUNT(customer_name) as non_null_names,
COUNT(*) - COUNT(customer_name) as missing_names,
COUNT(email) as non_null_emails,
COUNT(*) - COUNT(email) as missing_emails,
COUNT(DISTINCT email) as unique_emails,
COUNT(email) - COUNT(DISTINCT email) as duplicate_emails
FROM customers
UNION ALL
SELECT
'orders' as table_name,
COUNT(*) as total_records,
COUNT(customer_id) as non_null_customer_ids,
COUNT(*) - COUNT(customer_id) as missing_customer_ids,
COUNT(CASE WHEN total_amount <= 0 THEN 1 END) as invalid_amounts,
COUNT(CASE WHEN order_date > CURRENT_DATE THEN 1 END) as future_dates,
0 as placeholder1,
0 as placeholder2
FROM orders;
```
### 2. Outlier Detection
**Question 8: Find statistical outliers in sales data**
```sql
WITH sales_stats AS (
SELECT
AVG(total_amount) as mean_amount,
STDDEV(total_amount) as std_amount,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) as q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) as q3
FROM orders
),
outlier_bounds AS (
SELECT
q1 - 1.5 * (q3 - q1) as lower_bound,
q3 + 1.5 * (q3 - q1) as upper_bound,
mean_amount - 3 * std_amount as lower_3std,
mean_amount + 3 * std_amount as upper_3std
FROM sales_stats
)
SELECT
o.order_id,
o.customer_id,
o.total_amount,
CASE
WHEN o.total_amount < ob.lower_bound OR o.total_amount > ob.upper_bound
THEN 'IQR Outlier'
WHEN o.total_amount < ob.lower_3std OR o.total_amount > ob.upper_3std
THEN '3-Sigma Outlier'
ELSE 'Normal'
END as outlier_type
FROM orders o
CROSS JOIN outlier_bounds ob
WHERE o.total_amount < ob.lower_bound
OR o.total_amount > ob.upper_bound
OR o.total_amount < ob.lower_3std
OR o.total_amount > ob.upper_3std
ORDER BY o.total_amount DESC;
```
## Advanced Analytics Queries
### 1. Customer Lifetime Value (CLV)
**Question 9: Calculate customer lifetime value**
```sql
WITH customer_metrics AS (
SELECT
customer_id,
MIN(order_date) as first_purchase,
MAX(order_date) as last_purchase,
COUNT(DISTINCT order_id) as total_orders,
SUM(total_amount) as total_spent,
AVG(total_amount) as avg_order_value,
EXTRACT(DAYS FROM (MAX(order_date) - MIN(order_date))) + 1 as customer_lifespan_days
FROM orders
GROUP BY customer_id
)
SELECT
customer_id,
total_orders,
total_spent,
avg_order_value,
customer_lifespan_days,
CASE
WHEN customer_lifespan_days > 0
THEN ROUND(total_orders::DECIMAL / (customer_lifespan_days / 365.0), 2)
ELSE total_orders
END as purchase_frequency_per_year,
CASE
WHEN customer_lifespan_days > 0
THEN ROUND(
avg_order_value *
(total_orders::DECIMAL / (customer_lifespan_days / 365.0)) *
2.0, -- Assumed 2-year customer lifetime
2
)
ELSE total_spent
END as estimated_clv
FROM customer_metrics
WHERE total_orders >= 2 -- Focus on repeat customers
ORDER BY estimated_clv DESC;
```
### 2. Market Basket Analysis
**Question 10: Find frequently bought together products**
```sql
WITH order_products AS (
SELECT
o.order_id,
p.product_name,
p.category
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
),
product_pairs AS (
SELECT
op1.product_name as product_a,
op2.product_name as product_b,
COUNT(*) as co_occurrence_count
FROM order_products op1
JOIN order_products op2 ON op1.order_id = op2.order_id
WHERE op1.product_name < op2.product_name -- Avoid duplicates
GROUP BY op1.product_name, op2.product_name
),
product_totals AS (
SELECT
product_name,
COUNT(DISTINCT order_id) as total_orders
FROM order_products
GROUP BY product_name
)
SELECT
pp.product_a,
pp.product_b,
pp.co_occurrence_count,
pt1.total_orders as product_a_orders,
pt2.total_orders as product_b_orders,
ROUND(
pp.co_occurrence_count * 100.0 / pt1.total_orders, 2
) as support_a_to_b,
ROUND(
pp.co_occurrence_count * 100.0 / pt2.total_orders, 2
) as support_b_to_a
FROM product_pairs pp
JOIN product_totals pt1 ON pp.product_a = pt1.product_name
JOIN product_totals pt2 ON pp.product_b = pt2.product_name
WHERE pp.co_occurrence_count >= 5 -- Minimum co-occurrence threshold
ORDER BY pp.co_occurrence_count DESC;
```
## Reporting and Dashboard Queries
### 1. Executive Dashboard Metrics
**Question 11: Create executive summary metrics**
```sql
WITH current_month AS (
SELECT DATE_TRUNC('month', CURRENT_DATE) as month_start
),
previous_month AS (
SELECT DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month') as month_start
),
current_metrics AS (
SELECT
COUNT(DISTINCT o.order_id) as orders,
COUNT(DISTINCT o.customer_id) as customers,
SUM(o.total_amount) as revenue,
AVG(o.total_amount) as avg_order_value
FROM orders o
CROSS JOIN current_month cm
WHERE o.order_date >= cm.month_start
AND o.order_date < cm.month_start + INTERVAL '1 month'
),
previous_metrics AS (
SELECT
COUNT(DISTINCT o.order_id) as orders,
COUNT(DISTINCT o.customer_id) as customers,
SUM(o.total_amount) as revenue,
AVG(o.total_amount) as avg_order_value
FROM orders o
CROSS JOIN previous_month pm
WHERE o.order_date >= pm.month_start
AND o.order_date < pm.month_start + INTERVAL '1 month'
)
SELECT
'Current Month' as period,
cm.orders,
cm.customers,
cm.revenue,
cm.avg_order_value,
ROUND((cm.orders - pm.orders) * 100.0 / pm.orders, 2) as orders_growth,
ROUND((cm.customers - pm.customers) * 100.0 / pm.customers, 2) as customers_growth,
ROUND((cm.revenue - pm.revenue) * 100.0 / pm.revenue, 2) as revenue_growth
FROM current_metrics cm
CROSS JOIN previous_metrics pm;
```
## Common Data Analyst Interview Scenarios
### Scenario 1: A/B Test Analysis
**Question: Analyze the results of an A/B test on website conversion rates**
```sql
WITH test_results AS (
SELECT
test_group,
COUNT(*) as total_users,
COUNT(CASE WHEN converted = true THEN 1 END) as conversions,
ROUND(
COUNT(CASE WHEN converted = true THEN 1 END) * 100.0 / COUNT(*), 2
) as conversion_rate
FROM ab_test_data
GROUP BY test_group
),
statistical_significance AS (
SELECT
tr1.test_group as group_a,
tr2.test_group as group_b,
tr1.conversion_rate as rate_a,
tr2.conversion_rate as rate_b,
tr2.conversion_rate - tr1.conversion_rate as rate_difference,
-- Simplified z-test calculation
ABS(tr2.conversion_rate - tr1.conversion_rate) /
SQRT(
(tr1.conversion_rate * (100 - tr1.conversion_rate) / tr1.total_users) +
(tr2.conversion_rate * (100 - tr2.conversion_rate) / tr2.total_users)
) as z_score
FROM test_results tr1
CROSS JOIN test_results tr2
WHERE tr1.test_group = 'control' AND tr2.test_group = 'treatment'
)
SELECT
*,
CASE
WHEN z_score > 1.96 THEN 'Statistically Significant (95% confidence)'
WHEN z_score > 1.645 THEN 'Statistically Significant (90% confidence)'
ELSE 'Not Statistically Significant'
END as significance_level
FROM statistical_significance;
```
### Scenario 2: Churn Prediction Data
**Question: Identify customers at risk of churning**
```sql
WITH customer_activity AS (
SELECT
customer_id,
MAX(order_date) as last_order_date,
COUNT(*) as total_orders,
SUM(total_amount) as total_spent,
AVG(total_amount) as avg_order_value,
EXTRACT(DAYS FROM (CURRENT_DATE - MAX(order_date))) as days_since_last_order
FROM orders
GROUP BY customer_id
),
churn_indicators AS (
SELECT
ca.*,
CASE
WHEN days_since_last_order > 90 THEN 'High Risk'
WHEN days_since_last_order > 60 THEN 'Medium Risk'
WHEN days_since_last_order > 30 THEN 'Low Risk'
ELSE 'Active'
END as churn_risk,
NTILE(5) OVER (ORDER BY total_spent DESC) as value_quintile
FROM customer_activity ca
)
SELECT
churn_risk,
value_quintile,
COUNT(*) as customer_count,
AVG(days_since_last_order) as avg_days_inactive,
AVG(total_spent) as avg_customer_value,
SUM(total_spent) as total_at_risk_value
FROM churn_indicators
WHERE churn_risk IN ('High Risk', 'Medium Risk')
GROUP BY churn_risk, value_quintile
ORDER BY churn_risk, value_quintile DESC;
```
## Interview Tips for Data Analysts
### 1. Business Context Understanding
- Always ask about the business problem before writing SQL
- Consider data limitations and assumptions
- Think about actionable insights from your analysis
### 2. Data Validation Approach
- Check for data quality issues first
- Validate results with business logic
- Consider edge cases and outliers
### 3. Performance Considerations
- Use appropriate indexes for large datasets
- Consider query execution time for reporting
- Optimize for readability and maintainability
### 4. Communication Skills
- Explain your analytical approach
- Discuss limitations and assumptions
- Suggest follow-up analyses
## Common Mistakes to Avoid
1. **Not considering NULL values in calculations**
2. **Forgetting to handle duplicate records**
3. **Using inappropriate aggregation functions**
4. **Not validating results against business logic**
5. **Ignoring data quality issues**
6. **Over-complicating simple queries**
## Conclusion
Data analyst SQL interviews focus on practical business scenarios and analytical thinking. Success requires:
- **Strong foundation in SQL fundamentals**
- **Understanding of statistical concepts**
- **Business acumen and domain knowledge**
- **Data quality awareness**
- **Clear communication of insights**
Practice with real datasets and business scenarios to build confidence in translating business questions into SQL queries.
Ready to practice data analysis scenarios? [Start your SQL interview preparation →](/interview)