Real-World Performance Tuning Cases: Interview
Module: Query Optimization & Performance
Explain your systematic approach to diagnosing and fixing a slow query in production.
Systematic approach: (1) Measure: Identify slow query using pg_stat_statements or slow query log. Look at total time (execution × calls), not just average. (2) Analyze: Run EXPLAIN ANALYZE to see actual execution plan. Look for red flags: Seq Scan on large table, estimated ≪ actual rows, Disk usage in Sort/Hash, high "Rows Removed by Filter". (3) Diagnose: Identify root causes - often multiple: missing index, stale statistics, inefficient query, memory spill, lock contention. (4) Prioritize: Fix biggest bottleneck first (highest impact). (5) Optimize: Apply appropriate technique: add index, ANALYZE, rewrite query, increase work_mem. (6) Verify: Measure improvement, validate results match, check no regressions. (7) Monitor: Set up alerts for future issues. Document everything: problem, diagnosis, solution, results.
How do you prioritize multiple slow queries when you have limited time?
Prioritize by total impact: execution_time × call_frequency. Example: Query A takes 10 seconds, called 10 times/day = 100 seconds total impact. Query B takes 1 second, called 10,000 times/day = 10,000 seconds total impact. Fix Query B first (100x more impact). Also consider: (1) Effort: Quick wins first (add index = minutes, schema redesign = days). (2) Risk: Low-risk changes first (add index can be dropped, schema change is risky). (3) Business impact: User-facing queries before internal reports. (4) Cascading effects: Fix queries that block others. Framework: Impact = (Speedup × Frequency) / (Effort × Risk). Sort by this score, fix highest first.
A query was optimized and is now 10x faster, but users report incorrect results. How do you handle this?
Immediate action: (1) Rollback optimization immediately - correctness > performance. (2) Investigate difference: Compare results before and after. Use EXCEPT to find differing rows: (original EXCEPT optimized) UNION (optimized EXCEPT original). (3) Identify cause: Common issues - LEFT JOIN changed to INNER JOIN (excludes NULLs), DISTINCT removed (duplicates), subquery rewrite changed semantics, WHERE clause modified. (4) Fix optimization: Adjust to preserve correctness. Example: If INNER JOIN excludes NULLs, use LEFT JOIN with COALESCE. (5) Validate thoroughly: Test edge cases (NULLs, empty sets, duplicates), compare row counts, verify sample results. (6) Re-deploy with validation. (7) Post-mortem: Why did testing miss this? Add test cases for edge cases. Lesson: Always validate results before deploying optimizations. Correctness is non-negotiable.
Dashboard loads in 2 minutes. You have 1 hour to fix it. Walk through your approach.
-- Step 1: Identify slow queries (5 minutes)
SELECT
substring(query, 1, 100) as query_preview,
calls,
total_exec_time / 1000 as total_seconds,
mean_exec_time / 1000 as avg_seconds
FROM pg_stat_statements
WHERE query LIKE '%dashboard%'
ORDER BY total_exec_time DESC
LIMIT 10;
-- Result: 3 queries taking 120 seconds total
-- Step 2: Analyze slowest query (10 minutes)
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;
-- Result: Seq Scan on orders (10M rows), 40 seconds
-- Diagnosis: Missing index
-- Step 3: Quick fix - add index (5 minutes)
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders(customer_id);
-- CONCURRENTLY = no table lock
-- Step 4: Verify (5 minutes)
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;
-- Result: Index Scan, 100ms
-- Speedup: 40s → 0.1s = 400x faster
-- Step 5: Analyze second query (10 minutes)
EXPLAIN ANALYZE
SELECT COUNT(*) FROM orders
GROUP BY status;
-- Result: Estimated 1M, Actual 10M (10x error)
-- Diagnosis: Stale statistics
-- Step 6: Update statistics (5 minutes)
ANALYZE orders;