INSERT Patterns: Mistakes

Module: Data Modification & Transactions

Using Single INSERT in Loop Instead of Bulk INSERT

Loop with single INSERT per row: for each row: INSERT INTO table VALUES (row)

Bulk INSERT with multiple VALUES: INSERT INTO table VALUES (row1), (row2), (row3)

Single INSERT in loop is extremely slow because each INSERT is separate transaction with network round-trip. Problem: Application loops through 10,000 rows, executes 10,000 separate INSERT statements. Each INSERT: (1) Network round-trip to database (latency), (2) Separate transaction (commit overhead), (3) Index update per row (slow), (4) Constraint check per row. Time: 10 seconds for 10,000 rows. Solution: Bulk INSERT with multiple VALUES tuples. INSERT INTO products VALUES (row1), (row2), ..., (row10000). Benefits: (1) 1 network round-trip (not 10,000), (2) 1 transaction (not 10,000), (3) 1 index update (not 10,000), (4) 1 constraint check (not 10,000). Time: 0.1 seconds for 10,000 rows (100x faster). Real-world: Shopify uses bulk inserts for flash sales (50K orders/min during Black Friday). Without bulk inserts, would take 100x longer. Lesson: Never use single INSERT in loop. Always use bulk INSERT for multiple rows.

Always use bulk INSERT for multiple rows. Build VALUES list with all rows, execute 1 INSERT statement. Optimal batch size: 1000-5000 rows per statement.

Critical

Very slow (N transactions, N network round-trips), 100x slower than bulk INSERT

Not Using RETURNING/OUTPUT to Get Inserted IDs

INSERT then SELECT to get ID: INSERT INTO users VALUES (...); SELECT user_id FROM users WHERE email = ...

Use RETURNING/OUTPUT: INSERT INTO users VALUES (...) RETURNING user_id (PostgreSQL)

Not using RETURNING/OUTPUT requires extra SELECT query to get inserted ID. Problem: (1) Extra network round-trip (slower), (2) Race condition (another INSERT might happen between INSERT and SELECT), (3) Not atomic (INSERT and SELECT are separate operations). Example without RETURNING: INSERT INTO users (name, email) VALUES ('John', 'john@example.com'); SELECT user_id FROM users WHERE email = 'john@example.com'; -- Extra query, race condition possible. Example with RETURNING: INSERT INTO users (name, email) VALUES ('John', 'john@example.com') RETURNING user_id; -- Returns user_id immediately, atomic, no race condition. Benefits: (1) No extra query (faster), (2) Atomic operation (no race condition), (3) Can use returned ID in subsequent queries (CTEs). Real-world: Stripe uses RETURNING for payment processing. INSERT payment, get payment_id immediately, use for payment_items. No extra SELECT needed. Lesson: Always use RETURNING (PostgreSQL) or OUTPUT (SQL Server) to get inserted IDs.

PostgreSQL: Use RETURNING clause. SQL Server: Use OUTPUT clause. MySQL: Use LAST_INSERT_ID() for auto-increment (but only returns first row ID for bulk inserts).

High

Extra SELECT query (slower), race condition (another INSERT might happen between), not atomic

Not Disabling Indexes During Large Bulk Loads

Bulk load with indexes enabled: INSERT INTO table SELECT * FROM source (10M rows)

Disable indexes, load data, rebuild indexes: DROP INDEX, INSERT, CREATE INDEX

Bulk loading with indexes enabled is very slow because index is updated for every inserted row. Problem: Loading 10M rows with 3 indexes. Each INSERT updates 3 indexes (slow). Index updates: (1) Find insertion point in B-tree (log N), (2) Insert into B-tree (may require rebalancing), (3) Update index statistics. Time: 10 minutes for 10M rows with indexes. Solution: Disable indexes before load, rebuild after. DROP INDEX idx1; DROP INDEX idx2; DROP INDEX idx3; INSERT INTO table SELECT * FROM source; -- 10M rows. CREATE INDEX idx1 ON table(col1); CREATE INDEX idx2 ON table(col2); CREATE INDEX idx3 ON table(col3); Time: 1 minute for 10M rows (10x faster). Why faster: (1) No index updates during INSERT (fast), (2) Rebuild indexes once after load (bulk index build is faster than incremental updates). Real-world: Data warehouses disable indexes during nightly ETL loads. Amazon Redshift drops indexes, loads data, rebuilds indexes. Lesson: Always disable indexes for large bulk loads (> 100K rows).

For large bulk loads (> 100K rows): (1) DROP indexes before load, (2) Load data, (3) CREATE indexes after load. 10x faster than loading with indexes enabled.

High

Very slow (index updated per row), 10x slower than disabling indexes

Using Huge Transactions for Bulk Loads

Load 10M rows in 1 transaction: BEGIN; INSERT 10M rows; COMMIT

Batch commits: Load 1000-5000 rows per transaction, commit, repeat

Loading millions of rows in 1 transaction uses too much memory and locks tables for too long. Problem: BEGIN; INSERT INTO table SELECT * FROM source; -- 10M rows. COMMIT; Issues: (1) Transaction log grows huge (10GB+), uses too much memory, (2) Table locked for entire load (blocks other queries), (3) If error occurs, rollback takes forever (must undo 10M inserts). Solution: Batch commits (1000-5000 rows per transaction). Process 1000 rows: BEGIN; INSERT INTO table SELECT * FROM source LIMIT 1000 OFFSET 0; COMMIT; Process next 1000 rows: BEGIN; INSERT INTO table SELECT * FROM source LIMIT 1000 OFFSET 1000; COMMIT; Repeat 10,000 times for 10M rows. Benefits: (1) Small transaction log (manageable memory), (2) Short locks (other queries can run between batches), (3) Fast rollback if error (only 1000 rows to undo). Trade-off: Slightly slower than 1 huge transaction (more commits), but much safer. Real-world: Netflix batches log inserts (1000 rows every 5 seconds). Balances performance and memory. Lesson: Batch commits for large loads (1000-5000 rows per transaction).

Optimal batch size: 1000-5000 rows per transaction. Balance performance (fewer commits) and memory (smaller transaction log). Monitor transaction log size.

Medium

Too much memory (transaction log grows huge), long locks (blocks other queries), rollback takes forever if error

Not Specifying Column List in INSERT

INSERT INTO users VALUES (1, 'John', 'john@example.com') - no column list

INSERT INTO users (user_id, name, email) VALUES (1, 'John', 'john@example.com') - explicit columns

Not specifying column list makes INSERT fragile and hard to read. Problem: INSERT INTO users VALUES (1, 'John', 'john@example.com'); -- What if column order changes? If table schema changes (column added, removed, or reordered), INSERT breaks. Example: ALTER TABLE users ADD COLUMN phone VARCHAR(20) AFTER name; -- Now INSERT breaks (wrong number of columns). Solution: Always specify column list. INSERT INTO users (user_id, name, email) VALUES (1, 'John', 'john@example.com'); Benefits: (1) Explicit columns (clear what each value means), (2) Resilient to schema changes (works even if columns reordered), (3) Can skip columns with defaults (INSERT INTO users (name, email) VALUES ('John', 'john@example.com')). Real-world: Production code always specifies column list. Makes code maintainable and resilient to schema changes. Lesson: Always specify column list in INSERT statements.

Always specify column list: INSERT INTO table (col1, col2) VALUES (val1, val2). Makes code clear and resilient to schema changes.

Medium

Breaks if table schema changes (column added/removed/reordered), hard to read (what does each value mean?)