DISTINCT & Removing Duplicates: Performance

Module: SQL Fundamentals

**Performance Impact:**

- DISTINCT requires sorting or hashing all rows (expensive on large datasets)

- Use only when necessary

- Index on DISTINCT columns helps performance

- Alternative: GROUP BY with MIN/MAX for same result

- Avoid DISTINCT if data is already unique

- Profile queries: EXPLAIN shows DISTINCT cost

**Benchmark:**

- 1M rows without DISTINCT: 100ms

- 1M rows with DISTINCT: 500ms (5x slower)

- With index on DISTINCT column: 200ms (2x slower)

DISTINCT requires sorting/hashing entire result set (expensive)

Index on DISTINCT columns helps performance

Alternative: GROUP BY with MIN/MAX for same result

For "existence" queries, use EXISTS instead of DISTINCT + JOIN

Profile queries: EXPLAIN shows DISTINCT cost

DISTINCT applies to entire row, not individual columns

Adding more columns makes more combinations "unique"

DISTINCT cannot be used with aggregate functions directly

Performance cost on large datasets

Using DISTINCT to "fix" bad JOINs instead of fixing JOIN