DISTINCT & Removing Duplicates: Concept
Module: SQL Fundamentals
DISTINCT removes duplicate rows from query results based on ALL selected columns. Only rows where the entire combination of values is unique are returned.
**Why Duplicates Occur:**
Duplicates appear when:
• Joining tables creates multiple matching rows
• Selecting non-unique columns (multiple employees in same department)
• Combining data from multiple sources (UNION without UNION ALL)
• Querying denormalized tables with repeated data
**DISTINCT Keyword:**
DISTINCT removes duplicate rows based on ALL selected columns:
• SELECT DISTINCT column1, column2 returns unique combinations
• Applied after WHERE filtering but before ORDER BY
• Compares entire row, not individual columns
• NULL values are considered equal (two NULLs = one unique value)
**DISTINCT vs GROUP BY:**
Both remove duplicates, but serve different purposes:
DISTINCT:
• Returns unique rows without aggregation
• Simpler syntax for basic deduplication
• Cannot use aggregate functions (COUNT, SUM, AVG)
• Example: SELECT DISTINCT department FROM employees
GROUP BY:
• Groups rows for aggregation
• Enables COUNT, SUM, AVG, MIN, MAX
• More powerful but more complex
• Example: SELECT department, COUNT(*) FROM employees GROUP BY department
Rule: Use DISTINCT for simple deduplication. Use GROUP BY when you need counts or aggregates.
DISTINCT is essential for data analysis, reporting, and building UI components like dropdown lists. Used in every data pipeline, dashboard, and analytics query. Understanding DISTINCT vs GROUP BY performance trade-offs is critical for production queries.
Duplicate rows in query results are common when joining tables or selecting non-unique columns. DISTINCT removes duplicate rows, returning only unique combinations of selected columns. Understanding when to use DISTINCT versus GROUP BY is crucial for writing efficient queries.