UPDATE with Joins: Mistakes
Module: Data Modification & Transactions
Using Wrong Database Syntax
Using PostgreSQL UPDATE...FROM syntax on MySQL: UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id
Use MySQL UPDATE...JOIN syntax: UPDATE orders o JOIN shipments s ON o.id = s.order_id SET o.status = s.status
Each database has different UPDATE with JOIN syntax. PostgreSQL/SQL Server use UPDATE...FROM. MySQL uses UPDATE...JOIN. Oracle uses MERGE or correlated subquery. Problem: Code written for PostgreSQL fails on MySQL. Example: UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id; -- Works on PostgreSQL, fails on MySQL. Solution: Use database-specific syntax or portable correlated subquery. MySQL: UPDATE orders o JOIN shipments s ON o.id = s.order_id SET o.status = s.status; Portable: UPDATE orders o SET status = (SELECT status FROM shipments s WHERE s.order_id = o.order_id) WHERE EXISTS (SELECT 1 FROM shipments s WHERE s.order_id = o.order_id); Real-world: Multi-database applications use correlated subqueries for portability. Performance-critical apps use database-specific JOIN syntax. Lesson: Know your database syntax or use portable correlated subqueries.
PostgreSQL/SQL Server: UPDATE...FROM. MySQL: UPDATE...JOIN. Portable: Correlated subquery. Check database documentation for syntax.
Critical
Syntax error: MySQL does not support FROM clause in UPDATE
UPDATE Without WHERE Clause
UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id (missing WHERE on orders)
UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id AND o.status = 'pending'
Missing WHERE clause on target table updates all rows, even those without matching JOIN. Problem: UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id; This updates ALL orders. Orders without shipments get NULL status (if JOIN fails). Solution: Add WHERE clause to limit scope. UPDATE orders o SET status = s.status FROM shipments s WHERE o.id = s.order_id AND o.status = 'pending'; Now only pending orders with shipments are updated. Alternative: Use INNER JOIN (MySQL) or WHERE EXISTS (portable). Real-world: Production incident at e-commerce company: UPDATE without WHERE set all order statuses to NULL (orders without shipments). Required restore from backup. Cost: 2 hours downtime, lost sales. Lesson: Always use WHERE clause to limit UPDATE scope. Test on subset first.
Always add WHERE clause to limit rows updated. Test on 100 rows before millions. Use transactions to allow rollback.
Critical
Updates all orders, not just those with shipments
Not Indexing Join Columns
UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id (no index on order_id)
CREATE INDEX idx_order_items_order_id ON order_items_summary(order_id); then UPDATE
UPDATE with JOIN requires indexes on join columns for performance. Without indexes, database does full table scan (checks every row). Problem: UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id; Without index on oi.order_id: Full table scan, 10 seconds for 10K orders. With index on oi.order_id: Index seek, 0.1 seconds for 10K orders (100x faster). Solution: Create indexes on all join columns. CREATE INDEX idx_order_items_order_id ON order_items_summary(order_id); CREATE INDEX idx_orders_order_id ON orders(order_id); -- Usually exists as PK. Now UPDATE uses index seek (fast). Real-world: Data warehouse ETL job took 6 hours to update 10M rows. Added index on join column, now takes 5 minutes (72x faster). Lesson: Always index join columns before large UPDATE operations.
Index all join columns (foreign keys and primary keys). Use EXPLAIN to verify index usage. Create indexes before large updates.
High
Very slow (full table scan), 100x slower than with index
Updating Millions of Rows in Single Transaction
UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id (10M rows in 1 transaction)
Batch updates: Process 1000-5000 rows per transaction, commit, repeat
Updating millions of rows in single transaction causes problems. Issues: (1) Table locked for entire update (blocks reads/writes), (2) Transaction log grows huge (10GB+), (3) Rollback takes forever if error occurs, (4) Other queries blocked. Problem: UPDATE 10M rows in 1 transaction, takes 2 hours, locks table entire time. Solution: Batch updates in chunks of 1000-5000 rows. Process batch 1: BEGIN; UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id AND o.order_id >= 1 AND o.order_id < 1001; COMMIT; Process batch 2: BEGIN; UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id AND o.order_id >= 1001 AND o.order_id < 2001; COMMIT; Repeat for all 10M rows. Benefits: (1) Short locks (other queries can run between batches), (2) Small transaction log, (3) Fast rollback if error, (4) Can monitor progress. Real-world: Netflix updates billions of viewing records using batches of 5000 rows. Runs during off-peak hours. Lesson: Batch large updates to avoid long locks.
Batch size: 1000-5000 rows per transaction. Monitor progress with logging. Run during off-peak hours. Pause between batches if needed.
High
Locks table for hours, blocks other queries, huge transaction log, rollback takes forever
Not Testing on Subset Before Production
Run UPDATE on production with millions of rows without testing
Test on 100 rows first: UPDATE...WHERE order_id < 100, verify results, then run on all rows
Running UPDATE on production without testing causes data corruption. Problem: UPDATE logic has bug, runs on 10M rows, corrupts data. Cannot easily rollback (no backup). Solution: Test on small subset first. Step 1: Test on 100 rows. BEGIN; UPDATE orders o SET total = oi.total FROM order_items_summary oi WHERE o.order_id = oi.order_id AND o.order_id < 100; SELECT * FROM orders WHERE order_id < 100; -- Verify results. ROLLBACK; -- Undo if wrong. Step 2: If correct, run on all rows with batching. Real-world: E-commerce company ran UPDATE on 5M orders without testing. Logic had bug: multiplied prices by 100 instead of 1.01. All orders had wrong totals. Required restore from backup, 4 hours downtime. Cost: $500K lost sales. Lesson: Always test on subset before production. Use transactions to allow rollback.
Test on 100 rows first. Verify results with SELECT. Use transactions (BEGIN/ROLLBACK). Backup before large updates. Run during off-peak hours.
Critical
Data corruption, incorrect updates, cannot easily rollback