DISTINCT & Removing Duplicates: Performance
Module: SQL Fundamentals
**Performance Impact:**
- DISTINCT requires sorting or hashing all rows (expensive on large datasets)
- Use only when necessary
- Index on DISTINCT columns helps performance
- Alternative: GROUP BY with MIN/MAX for same result
- Avoid DISTINCT if data is already unique
- Profile queries: EXPLAIN shows DISTINCT cost
**Benchmark:**
- 1M rows without DISTINCT: 100ms
- 1M rows with DISTINCT: 500ms (5x slower)
- With index on DISTINCT column: 200ms (2x slower)
DISTINCT requires sorting/hashing entire result set (expensive)
Index on DISTINCT columns helps performance
Alternative: GROUP BY with MIN/MAX for same result
For "existence" queries, use EXISTS instead of DISTINCT + JOIN
Profile queries: EXPLAIN shows DISTINCT cost
DISTINCT applies to entire row, not individual columns
Adding more columns makes more combinations "unique"
DISTINCT cannot be used with aggregate functions directly
Performance cost on large datasets
Using DISTINCT to "fix" bad JOINs instead of fixing JOIN