Partitioning & Sharding: Real-World

Module: Schema Design & Advanced DDL

You have an orders table with 500 million rows. Queries scanning the full table take 2 minutes. Indexes are 50GB. Backups take 8 hours. Archiving old data requires locking the entire table.

After partitioning by year, queries with date filters only scan relevant partitions. A query for 2024 orders scans 50 million rows instead of 500 million. Query time drops from 2 minutes to 10 seconds (12x faster). Each partition has its own 5GB index. You can archive 2020 data by dropping a partition in 1 second instead of deleting 100 million rows.

This is partitioning: splitting one large table into smaller, manageable pieces within the same database.

Sharding is different. When Instagram hit 1 billion users, a single database couldn't handle the load. They sharded users across 100 databases. User IDs 1-10M go to shard 1, 10M-20M to shard 2, etc. Each database handles 10 million users. Queries are fast because each shard is small. The system scales horizontally by adding more shards.

Instagram: User Sharding Across 100+ Databases

Instagram has 1 billion users. Single database can't handle the load. Sharded users across 100+ databases based on user ID. Each shard handles 10 million users.

Hash-based sharding: shard_id = user_id % num_shards. Application routes queries to correct shard. Each shard is independent PostgreSQL database.

-- Application logic determines shard

function get_shard(user_id) {

return user_id % 100; // 100 shards

}

-- Query routes to specific shard

const shard = get_shard(user_id);

const db = connect_to_shard(shard);

db.query('SELECT * FROM users WHERE user_id = ?', [user_id]);

-- Each shard has same schema

CREATE TABLE users (

user_id BIGINT PRIMARY KEY,

username VARCHAR(50) UNIQUE,

email VARCHAR(255),

created_at TIMESTAMP

);

-- Shard 0: user_id % 100 = 0

-- Shard 1: user_id % 100 = 1

-- Shard 99: user_id % 100 = 99

Handles 1 billion users across 100+ shards

Each shard: 10 million users, manageable size

Horizontal scaling: Add more shards as users grow

Query performance: Fast because each shard is small

Challenge: Cross-shard queries (following users on different shards)

PostgreSQL

Discord: Message Partitioning by Month

Discord stores billions of messages. Partitioned messages table by month. Old messages archived, recent messages fast to query.

Range partitioning by message timestamp. Each month is separate partition. Queries for recent messages only scan recent partitions. Old partitions archived to cold storage.

CREATE TABLE messages (

message_id BIGINT,

channel_id BIGINT,

user_id BIGINT,

content TEXT,

created_at TIMESTAMP NOT NULL