UPDATE with Joins: Mistakes

Q: Not Testing on Subset Before Production

Running UPDATE on production without testing causes data corruption. Problem: UPDATE logic has bug, runs on 10M rows, corrupts data. Cannot easily rollback (no backup). Solution: Test on small subset first. Step 1: Test on 100 rows. BEGIN; UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id AND o.order_id < 100; SELECT * FROM orders WHERE order_id < 100; -- Verify results. ROLLBACK; -- Undo if wrong. Step 2: If correct, run on all rows with batching. R

Module: Data Modification & Transactions

Using Wrong Database Syntax

Using PostgreSQL UPDATE...FROM syntax on MySQL: UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id

Use MySQL UPDATE...JOIN syntax: UPDATE orders o JOIN shipments s ON o.id = s.order_id SET o.status = s.status

Each database has different UPDATE with JOIN syntax. PostgreSQL/SQL Server use UPDATE...FROM. MySQL uses UPDATE...JOIN. Oracle uses MERGE or correlated subquery. Problem: Code written for PostgreSQL fails on MySQL. Example: UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id; -- Works on PostgreSQL, fails on MySQL. Solution: Use database-specific syntax or portable correlated subquery. MySQL: UPDATE orders o JOIN shipments s ON o.id = s.order_id SET o.status = s.status; Portable: UPDATE orders o SET status = (SELECT status FROM shipments s WHERE s.order_id = o.order_id) WHERE EXISTS (SELECT 1 FROM shipments s WHERE s.order_id = o.order_id); Real-world: Multi-database applications use correlated subqueries for portability. Performance-critical apps use database-specific JOIN syntax. Lesson: Know your database syntax or use portable correlated subqueries.

PostgreSQL/SQL Server: UPDATE...FROM. MySQL: UPDATE...JOIN. Portable: Correlated subquery. Check database documentation for syntax.

Critical

Syntax error: MySQL does not support FROM clause in UPDATE

UPDATE Without WHERE Clause

UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id (missing WHERE on orders)

UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id AND o.status = 'pending'

Missing WHERE clause on target table updates all rows, even those without matching JOIN. Problem: UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id; This updates ALL orders. Orders without shipments get NULL status (if JOIN fails). Solution: Add WHERE clause to limit scope. UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id AND o.status = 'pending'; Now only pending orders with shipments are updated. Alternative: Use INNER JOIN (MySQL) or WHERE EXISTS (portable). Real-world: Production incident at e-commerce company: UPDATE without WHERE set all order statuses to NULL (orders without shipments). Required restore from backup. Cost: 2 hours downtime, lost sales. Lesson: Always use WHERE clause to limit UPDATE scope. Test on subset first.

Always add WHERE clause to limit rows updated. Test on 100 rows before millions. Use transactions to allow rollback.

Critical

Updates all orders, not just those with shipments

Not Indexing Join Columns

UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id (no index on order_id)

CREATE INDEX idx_order_items_order_id ON order_items_summary(order_id); then UPDATE

UPDATE with JOIN requires indexes on join columns for performance. Without indexes, database does full table scan (checks every row). Problem: UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id; Without index on oi.order_id: Full table scan, 10 seconds for 10K orders. With index on oi.order_id: Index seek, 0.1 seconds for 10K orders (100x faster). Solution: Create indexes on all join columns. CREATE INDEX idx_order_items_order_id ON order_items_summary(order_id); CREATE INDEX idx_orders_order_id ON orders(order_id); -- Usually exists as PK. Now UPDATE uses index seek (fast). Real-world: Data warehouse ETL job took 6 hours to update 10M rows. Added index on join column, now takes 5 minutes (72x faster). Lesson: Always index join columns before large UPDATE operations.

Index all join columns (foreign keys and primary keys). Use EXPLAIN to verify index usage. Create indexes before large updates.

High

Very slow (full table scan), 100x slower than with index

Updating Millions of Rows in Single Transaction

UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id (10M rows in 1 transaction)

Batch updates: Process 1000-5000 rows per transaction, commit, repeat

Updating millions of rows in single transaction causes problems. Issues: (1) Table locked for entire update (blocks reads/writes), (2) Transaction log grows huge (10GB+), (3) Rollback takes forever if error occurs, (4) Other queries blocked. Problem: UPDATE 10M rows in 1 transaction, takes 2 hours, locks table entire time. Solution: Batch updates in chunks of 1000-5000 rows. Process batch 1: BEGIN; UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id AND o.order_id >= 1 AND o.order_id < 1001; COMMIT; Process batch 2: BEGIN; UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id AND o.order_id >= 1001 AND o.order_id < 2001; COMMIT; Repeat for all 10M rows. Benefits: (1) Short locks (other queries can run between batches), (2) Small transaction log, (3) Fast rollback if error, (4) Can monitor progress. Real-world: Netflix updates billions of viewing records using batches of 5000 rows. Runs during off-peak hours. Lesson: Batch large updates to avoid long locks.

Batch size: 1000-5000 rows per transaction. Monitor progress with logging. Run during off-peak hours. Pause between batches if needed.

High

Locks table for hours, blocks other queries, huge transaction log, rollback takes forever

Not Testing on Subset Before Production

Run UPDATE on production with millions of rows without testing

Test on 100 rows first: UPDATE...WHERE order_id < 100, verify results, then run on all rows

Running UPDATE on production without testing causes data corruption. Problem: UPDATE logic has bug, runs on 10M rows, corrupts data. Cannot easily rollback (no backup). Solution: Test on small subset first. Step 1: Test on 100 rows. BEGIN; UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id AND o.order_id < 100; SELECT * FROM orders WHERE order_id < 100; -- Verify results. ROLLBACK; -- Undo if wrong. Step 2: If correct, run on all rows with batching. Real-world: E-commerce company ran UPDATE on 5M orders without testing. Logic had bug: multiplied prices by 100 instead of 1.01. All orders had wrong totals. Required restore from backup, 4 hours downtime. Cost: $500K lost sales. Lesson: Always test on subset before production. Use transactions to allow rollback.

Test on 100 rows first. Verify results with SELECT. Use transactions (BEGIN/ROLLBACK). Backup before large updates. Run during off-peak hours.

Critical

Data corruption, incorrect updates, cannot easily rollback