Table Partitioning Strategies: Interview

Module: Query Optimization & Performance

Explain how partition pruning works and why it makes queries faster. What conditions must be met for pruning to occur?

Partition pruning is the optimizer's ability to skip irrelevant partitions when executing a query. Process: (1) Optimizer analyzes WHERE clause, (2) Identifies filters on partition key, (3) Determines which partitions match the filter, (4) Excludes non-matching partitions from execution plan, (5) Scans only matching partitions. Example: Table with 12 monthly partitions, query filters WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01'. Optimizer scans only January partition (8% of data) instead of all 12 partitions (100%). This makes query 12x faster. Conditions for pruning: (1) WHERE clause must filter on partition key, (2) Filter must be a simple comparison (=, <, >, BETWEEN), not a function, (3) Partition key must be in WHERE clause, not just SELECT or JOIN. If any condition fails, optimizer scans all partitions.

Compare range, list, and hash partitioning. When would you use each type?

Range partitioning: Partition by continuous ranges (dates, numbers). Use for: time-series data (logs, orders, events), data with natural time boundaries, archiving old data. Example: monthly partitions for orders table. Pruning works for range queries (WHERE date BETWEEN). List partitioning: Partition by discrete values (categories). Use for: data with known categories (country, region, status), uneven distribution, data isolation (compliance). Example: partition users by region (US, EU, Asia). Pruning works for equality and IN queries (WHERE country = 'US'). Hash partitioning: Partition by hash of column value. Use for: even data distribution, no natural boundaries, parallel processing. Example: partition users by user_id hash. Pruning works only for equality (WHERE user_id = 123), not ranges. Choice depends on: (1) Query patterns - what do queries filter on? (2) Data distribution - even or skewed? (3) Maintenance needs - archiving old data?

Why is DROP partition faster than DELETE for archiving old data? Explain the difference in how they work.

DROP partition is instant (metadata operation), DELETE takes hours (data operation). DELETE process: (1) Scan entire table to find matching rows, (2) Delete rows one by one, (3) Update all indexes to remove entries, (4) Mark space as free (but not reclaimed), (5) VACUUM needed to reclaim space (scans table again). For 1B row table, this takes 6-8 hours. DROP partition process: (1) Remove partition metadata from catalog, (2) Drop partition table file. This is instant (1 second) because: no table scan, no row-by-row deletion, no index updates, space immediately reclaimed. Example: Archive January 2023 data. DELETE: DELETE FROM orders WHERE created_at < '2023-02-01' (6 hours) + VACUUM (2 hours) = 8 hours. DROP: DROP TABLE orders_2023_01 (1 second). Speedup: 28,800x faster. This is why partitioning is essential for tables with archiving needs.

EXPLAIN shows all 12 partitions being scanned instead of 1. The query is: SELECT * FROM orders WHERE customer_id = 123 AND created_at >= '2024-01-01'. The table is partitioned by created_at (monthly). Why is pruning failing?

-- Diagnosis: Incomplete date range in WHERE clause

-- Query has created_at >= '2024-01-01' but no upper bound

-- Optimizer doesn't know which partitions to exclude

-- Check current query plan

EXPLAIN

SELECT * FROM orders

WHERE customer_id = 123 AND created_at >= '2024-01-01';

-- Result: Scans all partitions from 2024-01 onwards

-- Append

-- -> Seq Scan on orders_2024_01

-- -> Seq Scan on orders_2024_02

-- -> Seq Scan on orders_2024_03

-- ...

-- -> Seq Scan on orders_2024_12

-- Fix: Add upper bound to date range

EXPLAIN

SELECT * FROM orders

WHERE customer_id = 123

AND created_at >= '2024-01-01'

AND created_at < '2024-02-01';

-- Result: Scans only orders_2024_01

-- Append

-- -> Seq Scan on orders_2024_01

-- Filter: (customer_id = 123 AND created_at >= '2024-01-01' AND created_at < '2024-02-01')

-- Alternative: Use BETWEEN

SELECT * FROM orders

WHERE customer_id = 123

AND created_at BETWEEN '2024-01-01' AND '2024-01-31';

-- Or: Use DATE_TRUNC for current month

SELECT * FROM orders

WHERE customer_id = 123

AND created_at >= DATE_TRUNC('month', CURRENT_DATE)

AND created_at < DATE_TRUNC('month', CURRENT_DATE) + INTERVAL '1 month';