Database System Fundamentals: Interview

Q: Why is MySQL InnoDB faster for primary key lookups but slower for secondary key lookups compared to PostgreSQL? Explain the storage engine difference.

MySQL InnoDB uses clustered index: data is stored in a B+tree ordered by primary key. The primary key index IS the data. Secondary indexes store: secondary_key → primary_key. Primary key lookup: 1 B+tree traversal (3-4 page reads) = fast. Secondary key lookup: 2 B+tree traversals (secondary index → primary key, then clustered index → data) = slower. Example: SELECT * FROM orders WHERE order_id=12345 (PK): 1 lookup in clustered index = 1ms. SELECT * FROM orders WHERE customer_id=100 (secondary):

Q: What are the 4 core components of RDBMS architecture? Explain how they work together to execute a query.

The 4 core components are: (1) Query Processor, (2) Storage Engine, (3) Transaction Manager, (4) Buffer Pool. Query execution flow: (1) Query Processor receives SQL: "SELECT * FROM users WHERE city = 'NYC'" → Parser validates syntax, builds parse tree → Optimizer generates execution plan using statistics (table size, index availability, cardinality estimates) and cost model (I/O cost, CPU cost) → Chooses index scan vs table scan → Executor runs the plan. (2) Executor requests data from Storage E

Module: Database-Specific Features

Explain how PostgreSQL MVCC allows non-blocking reads. Why does this enable higher concurrency than MySQL locking?

PostgreSQL MVCC (multi-version concurrency control) keeps multiple versions of each row. When a transaction updates a row, PostgreSQL creates a new version and keeps the old version. Each transaction sees a snapshot of data at its start time (identified by transaction ID). Readers see old versions, writers create new versions, so they never block each other. Example: Transaction 1 (xid=100) reads users table, sees all rows with xmin≤100. Transaction 2 (xid=101) updates user_id=1, creates new version (xmin=101). Transaction 1 still reads old version (xmin=100) because its snapshot xid=100 < 101. No blocking. MySQL InnoDB uses locking: readers acquire shared locks, writers acquire exclusive locks. Writers block readers, readers block writers (unless READ UNCOMMITTED). Example: Transaction 1 reads users table, acquires shared locks. Transaction 2 tries to update user_id=1, waits for shared lock to release. Transaction 1 blocks Transaction 2. MVCC enables higher concurrency because readers and writers never block each other. Uber uses PostgreSQL MVCC for ride matching: 50K writes/sec (driver location updates), zero blocking reads (user searches for rides). Trade-off: MVCC requires vacuum to remove old versions, MySQL locking has no cleanup overhead.

Why is MySQL InnoDB faster for primary key lookups but slower for secondary key lookups compared to PostgreSQL? Explain the storage engine difference.

MySQL InnoDB uses clustered index: data is stored in a B+tree ordered by primary key. The primary key index IS the data. Secondary indexes store: secondary_key → primary_key. Primary key lookup: 1 B+tree traversal (3-4 page reads) = fast. Secondary key lookup: 2 B+tree traversals (secondary index → primary key, then clustered index → data) = slower. Example: SELECT * FROM orders WHERE order_id=12345 (PK): 1 lookup in clustered index = 1ms. SELECT * FROM orders WHERE customer_id=100 (secondary): 1 lookup in idx_customer gets order_ids [12345, 12346, 12347], then 3 lookups in clustered index = 3ms. PostgreSQL uses heap storage: data is stored in insertion order (heap). All indexes (primary and secondary) store: key → TID (tuple identifier = page + offset). Primary key lookup: 2 steps (index → TID, heap → data). Secondary key lookup: 2 steps (index → TID, heap → data). Same cost. Example: SELECT * FROM orders WHERE order_id=12345 (PK): 1 lookup in PK index gets TID, 1 lookup in heap = 2ms. SELECT * FROM orders WHERE customer_id=100 (secondary): 1 lookup in idx_customer gets TIDs, 1 lookup in heap = 2ms (multiple TIDs on same page). InnoDB is 2x faster for PK lookups (1ms vs 2ms), PostgreSQL is 1.5x faster for secondary lookups (2ms vs 3ms). Choose based on query patterns: E-commerce orders by order_id (PK-heavy) = InnoDB. User search by email (secondary key) = PostgreSQL.

What are the 4 core components of RDBMS architecture? Explain how they work together to execute a query.

The 4 core components are: (1) Query Processor, (2) Storage Engine, (3) Transaction Manager, (4) Buffer Pool. Query execution flow: (1) Query Processor receives SQL: "SELECT * FROM users WHERE city = 'NYC'" → Parser validates syntax, builds parse tree → Optimizer generates execution plan using statistics (table size, index availability, cardinality estimates) and cost model (I/O cost, CPU cost) → Chooses index scan vs table scan → Executor runs the plan. (2) Executor requests data from Storage Engine: "Fetch rows from users table where city = 'NYC'" → Storage Engine checks Buffer Pool first (cache). (3) Buffer Pool: If data is in memory (cache hit), return immediately (100ns). If not in memory (cache miss), read from disk (100μs for SSD, 10ms for HDD) and cache in buffer pool. (4) Storage Engine returns rows to Executor → Executor applies any remaining filters, sorts, aggregations → Returns result to client. (5) Transaction Manager ensures ACID: Atomicity (all or nothing, uses WAL), Consistency (enforces constraints), Isolation (MVCC or locking), Durability (writes to WAL before data files). Example: UPDATE users SET balance = balance - 100 WHERE user_id = 1 → Transaction Manager writes to WAL first (sequential I/O, fast) → Then updates data in buffer pool → Checkpoint process flushes to disk later (batch I/O, efficient). If crash occurs, replay WAL to recover. Understanding these components helps diagnose issues: Slow query? Check execution plan (Query Processor). Random I/O? Check indexes (Storage Engine). Deadlocks? Check isolation level (Transaction Manager). Out of memory? Check buffer pool size.

Your application has 1000 concurrent users. Queries are timing out with "too many connections" error. MySQL max_connections=100. How do you fix this? Explain connection pooling setup.

-- Problem: 1000 concurrent users, max_connections=100

-- Each HTTP request creates new connection = 1000 connections needed

-- Database rejects connections after 100 = "too many connections" error

-- Solution 1: Increase max_connections (NOT RECOMMENDED)

SET GLOBAL max_connections = 1000;

-- Problem: Each connection uses 1-10MB RAM

-- 1000 connections = 1-10GB RAM just for connections

-- Plus buffer pool, query memory = OOM (out of memory)

-- Solution 2: Connection pooling (RECOMMENDED)

-- Install PgBouncer (PostgreSQL) or ProxySQL (MySQL)

-- PgBouncer configuration (pgbouncer.ini)

[databases]

mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]

listen_port = 6432

listen_addr = *

auth_type = md5

auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction -- Connection released after transaction

default_pool_size = 25 -- 25 connections per database

max_client_conn = 1000 -- Accept 1000 client connections

reserve_pool_size = 5 -- Extra connections for spikes

-- How it works:

-- 1000 clients connect to PgBouncer (port 6432)

-- PgBouncer maintains 25 connections to database (port 5432)

-- Each database connection serves ~40 clients (1000/25)

-- When client sends query, PgBouncer assigns available connection

-- After query completes, connection returns to pool

-- Next client uses same connection (no overhead)

-- Application connection string (before)

-- postgresql://localhost:5432/mydb

-- Application connection string (after)

-- postgresql://localhost:6432/mydb -- Connect to PgBouncer