DELETE & TRUNCATE: Interview

Module: Data Modification & Transactions

What is the difference between DELETE and TRUNCATE? When would you use each?

DELETE and TRUNCATE both remove data but work completely differently. DELETE: (1) Removes specific rows based on WHERE clause, (2) Row-by-row processing (slow: 10K-50K rows/sec), (3) Fully logged (every row deletion recorded), (4) Fires triggers (good for audit logging), (5) Can rollback (transactional), (6) Does NOT reset identity/sequence, (7) Supports WHERE clause (selective removal). TRUNCATE: (1) Removes all rows (no WHERE clause), (2) Deallocates data pages (fast: 1M+ rows/sec, 100x-1000x faster), (3) Minimal logging (only page deallocations), (4) No triggers fire, (5) Rollback varies by database (PostgreSQL: yes, MySQL: no), (6) Resets identity/sequence to starting value, (7) Cannot use with foreign key references (in most databases). Use DELETE when: Need WHERE clause (selective removal), table has foreign keys, need triggers to fire, need rollback support. Use TRUNCATE when: Removing all rows, fast cleanup of staging/temp tables, reset identity to 1, no foreign key references. Real-world example: Amazon Redshift uses TRUNCATE for ETL staging table cleanup (100M rows in 10 seconds vs 1000 seconds with DELETE). Stripe uses DELETE with WHERE for selective cleanup of old payment records (maintains referential integrity).

Explain foreign key CASCADE behavior. What are the dangers of ON DELETE CASCADE?

Foreign keys enforce referential integrity - child rows must reference valid parent rows. ON DELETE CASCADE automatically deletes child rows when parent is deleted. Example: customers table (parent) and orders table (child) with ON DELETE CASCADE. When you DELETE FROM customers WHERE customer_id = 100, it automatically deletes all orders for that customer. Cascade chain: Delete 1 customer → cascades to 10,000 orders → cascades to 50,000 order_items → cascades to 100,000 shipments (160,001 total rows deleted from single DELETE). Dangers: (1) Unexpected mass deletion (delete 1 row, cascade deletes millions), (2) Performance impact (cascade deletes are slow, row-by-row), (3) Cannot undo (once committed, data is gone), (4) Hard to track (cascade happens automatically, no explicit DELETE statements). Other CASCADE options: ON DELETE SET NULL (sets FK to NULL instead of deleting), ON DELETE RESTRICT (prevents deletion if child rows exist, default behavior), ON DELETE NO ACTION (same as RESTRICT). Prevention: (1) Check cascade impact before deleting: SELECT COUNT(*) from each child table, (2) Use transaction to verify: BEGIN, DELETE, check counts, ROLLBACK if unexpected, (3) Consider soft delete: UPDATE status = deleted instead of DELETE, (4) Remove CASCADE if not needed. Real-world: Instagram checks cascade impact before account deletion (prevent accidental deletion of millions of posts/comments). GDPR "right to be forgotten" uses CASCADE to delete all user data across tables.

What is the archive-then-delete pattern? Why is it important for production systems?

Archive-then-delete is a safe deletion pattern: (1) Copy data to archive table, (2) Verify archive matches source (counts, totals, date ranges), (3) Delete from source table, (4) Reclaim storage space. Example: Stripe deletes old payment records (7+ years) for storage optimization. Steps: INSERT INTO payments_archive SELECT * FROM payments WHERE payment_date < CURRENT_DATE - INTERVAL 7 years; (copy to archive), verify counts match, DELETE FROM payments WHERE payment_date < CURRENT_DATE - INTERVAL 7 years; (delete from source), VACUUM to reclaim storage. Benefits: (1) Can recover if deletion goes wrong (data in archive), (2) Maintains compliance (7-year retention for financial data), (3) Enables historical analysis (archive table for reporting), (4) Reduces risk (verify before delete). Alternative: Soft delete pattern - UPDATE status = deleted instead of DELETE. Benefits: Can undo (set status = active), maintains referential integrity (FK still valid), enables audit trail. Trade-off: Table grows larger (deleted rows still stored), queries must filter WHERE status != deleted. Real-world: Amazon uses archive-then-delete for order retention (7-year policy), Netflix uses soft deletes for user data (enable recovery), Stripe archives payment records before deletion (compliance). Production best practices: (1) Always backup before bulk delete, (2) Test on subset first (100 rows before millions), (3) Use transactions (can rollback), (4) Monitor progress (batch deletes with logging), (5) Verify after delete (check counts).

Write a query to safely delete old orders (older than 2 years) in batches of 5000 rows. Include progress monitoring and pause between batches.

-- PostgreSQL batch delete with progress monitoring

DO $$

DECLARE

deleted_count INT;

total_deleted INT := 0;

batch_num INT := 0;

start_time TIMESTAMP := CLOCK_TIMESTAMP();

BEGIN

LOOP

-- Delete batch of 5000 rows

DELETE FROM orders

WHERE order_id IN (

SELECT order_id

FROM orders

WHERE order_date < CURRENT_DATE - INTERVAL '2 years'

LIMIT 5000

);

-- Get number of rows deleted

GET DIAGNOSTICS deleted_count = ROW_COUNT;

total_deleted := total_deleted + deleted_count;

batch_num := batch_num + 1;

-- Log progress

RAISE NOTICE 'Batch %: Deleted % rows, Total: %, Elapsed: %',

batch_num, deleted_count, total_deleted,

CLOCK_TIMESTAMP() - start_time;

-- Exit when no more rows to delete

EXIT WHEN deleted_count = 0;

-- Pause 100ms between batches (let other queries run)

PERFORM pg_sleep(0.1);

END LOOP;

RAISE NOTICE 'Deletion complete. Total deleted: %, Total time: %',

total_deleted, CLOCK_TIMESTAMP() - start_time;

END $$;