Real-World Performance Tuning Cases: Interview

Module: Query Optimization & Performance

Explain your systematic approach to diagnosing and fixing a slow query in production.

Systematic approach: (1) Measure: Identify slow query using pg_stat_statements or slow query log. Look at total time (execution × calls), not just average. (2) Analyze: Run EXPLAIN ANALYZE to see actual execution plan. Look for red flags: Seq Scan on large table, estimated ≪ actual rows, Disk usage in Sort/Hash, high "Rows Removed by Filter". (3) Diagnose: Identify root causes - often multiple: missing index, stale statistics, inefficient query, memory spill, lock contention. (4) Prioritize: Fix biggest bottleneck first (highest impact). (5) Optimize: Apply appropriate technique: add index, ANALYZE, rewrite query, increase work_mem. (6) Verify: Measure improvement, validate results match, check no regressions. (7) Monitor: Set up alerts for future issues. Document everything: problem, diagnosis, solution, results.

How do you prioritize multiple slow queries when you have limited time?

Prioritize by total impact: execution_time × call_frequency. Example: Query A takes 10 seconds, called 10 times/day = 100 seconds total impact. Query B takes 1 second, called 10,000 times/day = 10,000 seconds total impact. Fix Query B first (100x more impact). Also consider: (1) Effort: Quick wins first (add index = minutes, schema redesign = days). (2) Risk: Low-risk changes first (add index can be dropped, schema change is risky). (3) Business impact: User-facing queries before internal reports. (4) Cascading effects: Fix queries that block others. Framework: Impact = (Speedup × Frequency) / (Effort × Risk). Sort by this score, fix highest first.

A query was optimized and is now 10x faster, but users report incorrect results. How do you handle this?

Immediate action: (1) Rollback optimization immediately - correctness > performance. (2) Investigate difference: Compare results before and after. Use EXCEPT to find differing rows: (original EXCEPT optimized) UNION (optimized EXCEPT original). (3) Identify cause: Common issues - LEFT JOIN changed to INNER JOIN (excludes NULLs), DISTINCT removed (duplicates), subquery rewrite changed semantics, WHERE clause modified. (4) Fix optimization: Adjust to preserve correctness. Example: If INNER JOIN excludes NULLs, use LEFT JOIN with COALESCE. (5) Validate thoroughly: Test edge cases (NULLs, empty sets, duplicates), compare row counts, verify sample results. (6) Re-deploy with validation. (7) Post-mortem: Why did testing miss this? Add test cases for edge cases. Lesson: Always validate results before deploying optimizations. Correctness is non-negotiable.

Dashboard loads in 2 minutes. You have 1 hour to fix it. Walk through your approach.

-- Step 1: Identify slow queries (5 minutes)

SELECT

substring(query, 1, 100) as query_preview,

calls,

total_exec_time / 1000 as total_seconds,

mean_exec_time / 1000 as avg_seconds

FROM pg_stat_statements

WHERE query LIKE '%dashboard%'

ORDER BY total_exec_time DESC

LIMIT 10;

-- Result: 3 queries taking 120 seconds total

-- Step 2: Analyze slowest query (10 minutes)

EXPLAIN ANALYZE

SELECT * FROM orders WHERE customer_id = 123;

-- Result: Seq Scan on orders (10M rows), 40 seconds

-- Diagnosis: Missing index

-- Step 3: Quick fix - add index (5 minutes)

CREATE INDEX CONCURRENTLY idx_orders_customer_id

ON orders(customer_id);

-- CONCURRENTLY = no table lock

-- Step 4: Verify (5 minutes)

EXPLAIN ANALYZE

SELECT * FROM orders WHERE customer_id = 123;

-- Result: Index Scan, 100ms

-- Speedup: 40s → 0.1s = 400x faster

-- Step 5: Analyze second query (10 minutes)

EXPLAIN ANALYZE

SELECT COUNT(*) FROM orders

GROUP BY status;

-- Result: Estimated 1M, Actual 10M (10x error)

-- Diagnosis: Stale statistics

-- Step 6: Update statistics (5 minutes)

ANALYZE orders;