UPDATE with Joins: Interview

Q: Why is UPDATE with JOIN 100x faster than row-by-row updates in a loop? Explain the technical reasons.

UPDATE with JOIN is 100x faster because it eliminates three major bottlenecks: (1) Network overhead: Row-by-row requires N network round-trips (one per row). UPDATE with JOIN sends 1 query, gets 1 response. For 10,000 rows with 5ms latency: 50 seconds vs 0.05 seconds = 1000x faster. (2) Transaction overhead: Row-by-row creates N separate transactions (each with BEGIN/COMMIT overhead). UPDATE with JOIN uses 1 transaction. For 10,000 rows: 10,000 commits vs 1 commit = 100x less overhead. (3) Query

Q: How do you safely update millions of rows without locking the table for hours? Explain batching strategy.

Safely update millions of rows using batching strategy: (1) Batch size: Process 1000-5000 rows per transaction. Too small = too many commits (slow). Too large = long locks (blocks queries). (2) ID range batching: Split by primary key range. Process rows 1-1000, commit. Process rows 1001-2000, commit. Repeat until done. (3) Monitor progress: Log each batch completion, estimate time remaining. (4) Pause between batches: Sleep 0.1 seconds between batches to allow other queries. (5) Run during off-p

Module: Data Modification & Transactions

Explain the syntax differences for UPDATE with JOIN across PostgreSQL, MySQL, and SQL Server. When would you use correlated subquery instead?

UPDATE with JOIN syntax varies by database: (1) PostgreSQL: UPDATE table SET col = val FROM source WHERE condition. Uses FROM clause for JOIN. Example: UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id. (2) MySQL: UPDATE table JOIN source ON condition SET col = val. JOIN directly after UPDATE. Example: UPDATE orders o JOIN shipments s ON o.id = s.order_id SET o.status = s.status. (3) SQL Server: UPDATE t SET col = val FROM table t JOIN source ON condition. Similar to PostgreSQL. Example: UPDATE o SET status = s.status FROM orders o JOIN shipments s ON o.id = s.order_id. (4) Oracle: No UPDATE...JOIN syntax. Use MERGE or correlated subquery. Correlated subquery (portable): UPDATE orders o SET status = (SELECT status FROM shipments s WHERE s.order_id = o.order_id) WHERE EXISTS (SELECT 1 FROM shipments s WHERE s.order_id = o.order_id). Works on all databases. When to use correlated subquery: (1) Need portable SQL (works on PostgreSQL, MySQL, SQL Server, Oracle), (2) Updating few rows (< 1000), (3) Complex logic easier with subquery, (4) JOIN syntax not supported (Oracle). Trade-offs: JOIN is faster (aggregates once, efficient hash join) but database-specific. Subquery is portable but slower (executes per row). Real-world: Multi-database applications use subqueries for portability. Performance-critical apps use database-specific JOIN syntax. Lesson: Know your database syntax or use portable subqueries.

Why is UPDATE with JOIN 100x faster than row-by-row updates in a loop? Explain the technical reasons.

UPDATE with JOIN is 100x faster because it eliminates three major bottlenecks: (1) Network overhead: Row-by-row requires N network round-trips (one per row). UPDATE with JOIN sends 1 query, gets 1 response. For 10,000 rows with 5ms latency: 50 seconds vs 0.05 seconds = 1000x faster. (2) Transaction overhead: Row-by-row creates N separate transactions (each with BEGIN/COMMIT overhead). UPDATE with JOIN uses 1 transaction. For 10,000 rows: 10,000 commits vs 1 commit = 100x less overhead. (3) Query planning: Row-by-row plans and executes N queries. UPDATE with JOIN plans once, executes once. Database can optimize JOIN (hash join, index seek) more efficiently than N separate queries. Technical details: (1) JOIN in memory: Database loads both tables, performs hash join in memory (fast). (2) Batch index updates: Updates indexes once after all rows modified (faster than per-row). (3) Lock optimization: Acquires locks once, releases once (less lock overhead). Real-world example: Amazon updates millions of order totals daily. Row-by-row would take hours. UPDATE with JOIN takes minutes. Performance: Row-by-row: 10 seconds for 10K orders (1 order/ms). UPDATE with JOIN: 0.1 seconds for 10K orders (100 orders/ms, 100x faster). Lesson: Always use UPDATE with JOIN for bulk updates. Never loop with individual UPDATEs in production code.

How do you safely update millions of rows without locking the table for hours? Explain batching strategy.

Safely update millions of rows using batching strategy: (1) Batch size: Process 1000-5000 rows per transaction. Too small = too many commits (slow). Too large = long locks (blocks queries). (2) ID range batching: Split by primary key range. Process rows 1-1000, commit. Process rows 1001-2000, commit. Repeat until done. (3) Monitor progress: Log each batch completion, estimate time remaining. (4) Pause between batches: Sleep 0.1 seconds between batches to allow other queries. (5) Run during off-peak: Schedule during low-traffic hours to minimize impact. Example batching code (PostgreSQL): DO $$ DECLARE batch_size INT := 1000; min_id INT; max_id INT; current_id INT; BEGIN SELECT MIN(order_id), MAX(order_id) INTO min_id, max_id FROM orders; current_id := min_id; WHILE current_id <= max_id LOOP UPDATE orders o SET total = (SELECT SUM(quantity * price) FROM order_items WHERE order_id = o.order_id) WHERE o.order_id >= current_id AND o.order_id < current_id + batch_size; current_id := current_id + batch_size; COMMIT; PERFORM pg_sleep(0.1); -- Pause between batches END LOOP; END $$; Benefits: (1) Short locks (other queries can run between batches), (2) Small transaction log (manageable memory), (3) Fast rollback if error (only 1000 rows to undo), (4) Can monitor progress (log batch completion). Real-world: Netflix updates billions of viewing records using batches of 5000 rows. Runs during off-peak hours. Monitors progress with logging table. Lesson: Batch large updates to avoid long locks and huge transaction logs.

Write a PostgreSQL query to update order totals from order_items using UPDATE...FROM. Include proper indexing and WHERE clause.

-- Step 1: Create indexes on join columns (if not exists)

CREATE INDEX IF NOT EXISTS idx_order_items_order_id

ON order_items(order_id);

CREATE INDEX IF NOT EXISTS idx_orders_order_id

ON orders(order_id);

-- Primary key usually auto-indexed

-- Step 2: Update order totals using UPDATE...FROM

UPDATE orders o

SET

subtotal = oi.item_total,

tax = oi.item_total * 0.1,

total = oi.item_total * 1.1,

item_count = oi.item_count,

updated_at = NOW()

FROM (

SELECT

order_id,

SUM(quantity * price) as item_total,

COUNT(*) as item_count

FROM order_items

GROUP BY order_id

) oi

WHERE o.order_id = oi.order_id

AND o.status = 'pending';

-- WHERE limits to pending orders only

-- Step 3: Verify updates

SELECT

order_id,

subtotal,

tax,

total,

item_count,

updated_at