Partitioning & Sharding: Interview
Module: Schema Design & Advanced DDL
What is the difference between partitioning and sharding?
Partitioning splits a table into smaller pieces within the same database. All partitions are on one server. Sharding splits data across multiple databases on different servers. Partitioning is for single-server optimization (faster queries, easier maintenance). Sharding is for horizontal scaling (handle more load by adding servers). Use partitioning when table is large (100M+ rows) but single server can handle it. Use sharding when single server can't handle the load (billions of rows, high traffic). Example: Partition orders by year (same database). Shard users across 100 databases (different servers).
Explain partition pruning and why it improves performance.
Partition pruning is when the database automatically skips irrelevant partitions based on WHERE clause. If you partition orders by year and query for 2024 orders, database only scans orders_2024 partition, skipping orders_2022 and orders_2023. This provides 10-100x speedup because you scan 10% of data instead of 100%. Requires including partition key in WHERE clause. Without partition key in query, database scans all partitions (no benefit). Partition pruning is the main performance benefit of partitioning.
When would you use range vs list vs hash partitioning?
Range partitioning: Time-series data (orders by date, logs by timestamp). Most common. Easy to archive old data. List partitioning: Categorical data with known values (sales by region, products by category). Good for grouping related data. Hash partitioning: Even distribution when no natural partition key (users by ID). Ensures balanced partition sizes. Choose based on query patterns and data characteristics. Range for time-based queries, list for category-based queries, hash for uniform distribution.
Design a partitioning strategy for a logs table with 1 billion rows that grows by 10 million rows per day. Queries typically filter by date.
CREATE TABLE logs (
log_id BIGSERIAL,
log_date DATE NOT NULL,
level VARCHAR(10),
message TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) PARTITION BY RANGE (log_date);
-- Create partitions by month
CREATE TABLE logs_2024_01 PARTITION OF logs
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE logs_2024_02 PARTITION OF logs
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Create indexes on each partition
CREATE INDEX ON logs_2024_01(level, created_at);
CREATE INDEX ON logs_2024_02(level, created_at);
-- Automate partition creation (pg_cron)
CREATE OR REPLACE FUNCTION create_next_month_partition()
RETURNS void AS $$
DECLARE
next_month DATE := date_trunc('month', CURRENT_DATE + INTERVAL '1 month');
partition_name TEXT := 'logs_' || to_char(next_month, 'YYYY_MM');
BEGIN
EXECUTE format(
'CREATE TABLE IF NOT EXISTS %I PARTITION OF logs FOR VALUES FROM (%L) TO (%L)',
partition_name,
next_month,
next_month + INTERVAL '1 month'
);
END;
$$ LANGUAGE plpgsql;
-- Archive old partitions (drop partitions older than 90 days)
DROP TABLE logs_2023_10;
Partition by month balances partition size (~300M rows) and number of partitions (12 per year). Range partitioning perfect for time-series data. Queries with date filter use partition pruning. Automate partition creation for new months. Archive old data by dropping partitions (instant vs DELETE hours). Index each partition for fast queries.