Join Optimization Strategies: Interview

Q: Why does join order matter? Give an example where wrong order causes 100x slowdown.

Join order determines size of intermediate results. Each join creates temporary result set. Larger intermediates = more comparisons. Example: orders (10M) JOIN order_items (50M) JOIN products (500K) Filter: order_date > last 30 days (reduces orders to 500K) Bad order (left-to-right): 1. orders (10M) JOIN order_items (50M) = 50M intermediate 2. 50M JOIN products (500K) = 25 billion comparisons Time: 180 seconds Good order (filter first): 1. Filter orders: 10M to 500K 2. orders (500K) JOIN prod

Module: Query Optimization & Performance

Explain the three join algorithms (Nested Loop, Hash Join, Merge Join) and when each is optimal.

Nested Loop: Loops through outer table, looks up each row in inner table. Best for small outer (<1K rows) with indexed inner. O(n * log m) with index. Example: 100 orders joining 1M customers with index on customer_id. Used in OLTP.

Hash Join: Builds hash table from smaller table, probes with larger. Best for large tables without indexes, equality joins. O(n + m) linear time. Needs memory for hash table. Example: 1M orders joining 500K customers, no indexes. Used in analytics.

Merge Join: Merges two sorted datasets. Best when both tables indexed/sorted on join column. O(n + m) linear time. Example: Orders and payments both indexed on order_id. Used for range joins.

Key: Nested Loop with index is usually fastest for OLTP. Hash Join for analytics without indexes. Merge Join for sorted data.

Why does join order matter? Give an example where wrong order causes 100x slowdown.

Join order determines size of intermediate results. Each join creates temporary result set. Larger intermediates = more comparisons.

Example: orders (10M) JOIN order_items (50M) JOIN products (500K)

Filter: order_date > last 30 days (reduces orders to 500K)

Bad order (left-to-right):

1. orders (10M) JOIN order_items (50M) = 50M intermediate

2. 50M JOIN products (500K) = 25 billion comparisons

Time: 180 seconds

Good order (filter first):

1. Filter orders: 10M to 500K

2. orders (500K) JOIN products (500K) = 500K intermediate

3. 500K JOIN order_items = 2.5M final

Time: 1.8 seconds (100x faster)

Rule: Start with most filtered table, join to smallest related tables, keep intermediates small.

What is a covering index for joins and when should you use it?

Covering index: Index includes ALL columns query needs (join + filter + select). Query reads only index, never accesses table. Called "index-only scan".

Example:

CREATE INDEX idx_orders_covering ON orders(customer_id, order_date)

INCLUDE (order_id, status, total_amount);

Query: SELECT order_id, order_date, status, total_amount FROM orders WHERE customer_id = ? AND order_date > ?

Benefit: 3-5x faster than regular index (no disk I/O for table lookup).

Use when:

- Hot query (>1K req/sec)

- Read-heavy workload (100:1 read/write ratio)

- Selecting few columns (not SELECT *)

- Worth storage cost (index 2-3x larger)

Don't use when:

- Cold query (<100 req/sec)

- Write-heavy workload

- Selecting many columns (index too large)

Given: orders (5M rows), customers (1M rows). Query: SELECT o.*, c.name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE o.order_date > '2024-01-01' (returns 500K orders). Currently takes 25 seconds. Optimize to under 1 second.

-- Step 1: Index foreign key (if not exists)

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

-- Improvement: 25s to 8s (3x faster)

-- Why: Enables nested loop with index lookup instead of hash join