MERGE/UPSERT: Interview

Module: Data Modification & Transactions

What is UPSERT and why is it better than separate INSERT/UPDATE operations?

UPSERT is insert-or-update operation: inserts row if it does not exist, updates if it does. Single atomic operation. Better than separate INSERT/UPDATE because: (1) Atomic (no race conditions): Traditional approach (try INSERT, catch error, UPDATE) has race condition when two processes run concurrently - both try INSERT, both fail, both UPDATE, last one wins (lost update). UPSERT is atomic, no race condition. (2) 10x faster: Separate operations require error handling overhead (slow). UPSERT is single operation (fast). Benchmark: 10,000 rows/sec with try-catch vs 100,000 rows/sec with UPSERT. (3) Simpler code: No error handling, no transaction management, just single statement. (4) Idempotent: Safe to retry (critical for ETL pipelines). If operation fails and retries, UPSERT produces same result. Database-specific syntax: PostgreSQL INSERT...ON CONFLICT, MySQL INSERT...ON DUPLICATE KEY UPDATE, SQL Server/Oracle MERGE. All provide same functionality. Real-world: Stripe processes 1B+ payment events daily using UPSERT (insert new payments, update status for existing). Shopify syncs 10M+ products using UPSERT (new products inserted, prices updated).

Explain the differences between PostgreSQL ON CONFLICT, MySQL ON DUPLICATE KEY UPDATE, and SQL Server MERGE.

Three different syntaxes for UPSERT, same concept: PostgreSQL INSERT...ON CONFLICT: (1) Syntax: INSERT...ON CONFLICT (column) DO UPDATE SET col = EXCLUDED.col, (2) EXCLUDED keyword refers to proposed values, (3) DO NOTHING option to ignore conflicts, (4) WHERE clause in DO UPDATE for conditional updates, (5) RETURNING clause with xmax to track inserts vs updates, (6) Best performance for bulk UPSERT (100,000 rows/sec). MySQL INSERT...ON DUPLICATE KEY UPDATE: (1) Syntax: INSERT...ON DUPLICATE KEY UPDATE col = VALUES(col), (2) VALUES() function refers to proposed values (deprecated in 8.0.20+), (3) AS new syntax in 8.0.20+ (more readable), (4) ROW_COUNT() returns 1 for insert, 2 for update, (5) Checks all UNIQUE indexes (can cause unexpected behavior), (6) Good performance (80,000 rows/sec). SQL Server MERGE: (1) Syntax: MERGE target USING source ON condition WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT, (2) More powerful: can handle insert/update/delete in single statement, (3) WHEN NOT MATCHED BY SOURCE for deletes, (4) OUTPUT clause with $action to track operations, (5) More complex execution plan (slower for simple UPSERT: 50,000 rows/sec), (6) Best for complex sync operations. Choose based on database: PostgreSQL for best performance, MySQL for simplicity, SQL Server for complex sync with deletes. Real-world: Most companies use database-native UPSERT for best performance.

What is idempotent operation and why is UPSERT idempotent? Why is this important?

Idempotent operation: Running same operation multiple times produces same result. UPSERT is idempotent: First run inserts or updates row to desired state. Second run (with same data) produces same result (row already in desired state, no change or updates to same values). Example: UPSERT payment_id=123 with status=completed. First run: Inserts payment or updates status to completed. Second run: Updates status to completed (already completed, no change). Result: Same final state. Why important: (1) ETL pipelines: If pipeline fails and retries, UPSERT ensures correct data (no duplicates, no lost updates). (2) API sync: If API call times out and retries, UPSERT prevents duplicate records. (3) Event processing: If event processed twice (message delivered twice), UPSERT handles gracefully. (4) Distributed systems: Network failures cause retries, UPSERT ensures correctness. Non-idempotent example: INSERT without UPSERT. First run: Inserts row. Second run: Duplicate key error or duplicate row (wrong). Counter increment: view_count = view_count + 1 is NOT idempotent (each run increments). UPSERT with counter: view_count = view_count + VALUES(view_count) is idempotent if VALUES(view_count) is always 1. Real-world: Payment processing systems use UPSERT to handle duplicate payment events from retries. Stripe processes billions of events daily, many are duplicates from retries, UPSERT ensures correct state.

Write PostgreSQL UPSERT to sync product catalog from staging table. Handle new products (insert) and existing products (update price/stock). Include conditional update to only update if price or stock changed.

-- PostgreSQL product catalog sync with conditional update

-- Step 1: Create products table with unique constraint

CREATE TABLE products (

product_id VARCHAR(50) PRIMARY KEY,

name VARCHAR(200),

price DECIMAL(10,2),

stock INT,

updated_at TIMESTAMP DEFAULT NOW()

);

-- Step 2: Create staging table with supplier data

CREATE TABLE staging_products (

product_id VARCHAR(50),

name VARCHAR(200),

price DECIMAL(10,2),

stock INT

);

-- Step 3: Load supplier data into staging

-- ... data loading code ...

-- Step 4: Bulk UPSERT from staging to products

INSERT INTO products (product_id, name, price, stock, updated_at)

SELECT product_id, name, price, stock, NOW()

FROM staging_products

ON CONFLICT (product_id)

DO UPDATE SET

name = EXCLUDED.name,

price = EXCLUDED.price,

stock = EXCLUDED.stock,

updated_at = NOW()

WHERE

products.price != EXCLUDED.price

OR products.stock != EXCLUDED.stock;

-- Only updates if price or stock changed (avoids unnecessary writes)

-- Step 5: Track inserts vs updates using RETURNING