Partitioning & Sharding: Concept
Module: Schema Design & Advanced DDL
Partitioning splits a large table into smaller pieces (partitions) within the same database. Each partition is a separate physical table, but you query them as one logical table.
Sharding splits data across multiple databases (shards). Each shard is an independent database on a separate server.
Think of it this way:
- Partitioning = splitting a filing cabinet into drawers (same cabinet, organized sections)
- Sharding = splitting files across multiple filing cabinets (different locations)
Partitioning improves performance on a single server. Sharding enables horizontal scaling across multiple servers.
## Partitioning: Splitting Tables Within One Database
Three main types:
**1. Range Partitioning** - Split by value ranges (dates, IDs)
Most common. Perfect for time-series data.
**2. List Partitioning** - Split by specific values (regions, categories)
Good for categorical data with known values.
**3. Hash Partitioning** - Split by hash function (even distribution)
Good for uniform distribution when no natural partition key.
Partition pruning: Database automatically skips irrelevant partitions based on WHERE clause. This is the main performance benefit.
## Sharding: Splitting Data Across Multiple Databases
Horizontal partitioning across servers. Each shard is independent database.
Common strategies:
- Range-based: User IDs 1-10M on shard 1, 10M-20M on shard 2
- Hash-based: hash(user_id) % num_shards determines shard
- Geographic: US users on US shard, EU users on EU shard
- Entity-based: Each customer gets their own shard (multi-tenant)
Challenges: Cross-shard queries, distributed transactions, rebalancing, complexity.
Every company with large datasets uses partitioning. It's the standard solution for tables over 100 million rows.
Sharding is for massive scale. Instagram, Discord, Uber, and Twitter all shard their databases. It's complex but necessary when a single database can't handle the load.
In interviews, understanding when to partition vs shard shows you know production systems. Junior developers think "just add an index." Senior engineers know when you need partitioning (single server optimization) vs sharding (multi-server scaling).
You have an orders table with 500 million rows. Queries scanning the full table take 2 minutes. Indexes are 50GB. Backups take 8 hours. Archiving old data requires locking the entire table.
After partitioning by year, queries with date filters only scan relevant partitions. A query for 2024 orders scans 50 million rows instead of 500 million. Query time drops from 2 minutes to 10 seconds (12x faster). Each partition has its own 5GB index. You can archive 2020 data by dropping a partition in 1 second instead of deleting 100 million rows.
This is partitioning: splitting one large table into smaller, manageable pieces within the same database.
Sharding is different. When Instagram hit 1 billion users, a single database couldn't handle the load. They sharded users across 100 databases. User IDs 1-10M go to shard 1, 10M-20M to shard 2, etc. Each database handles 10 million users. Queries are fast because each shard is small. The system scales horizontally by adding more shards.