Database System Fundamentals: Performance

Module: Database-Specific Features

Database selection impacts performance 10-100x. Key factors: (1) Workload: OLTP (many small transactions, use MySQL/PostgreSQL) vs OLAP (few large analytical queries, use columnar databases like Redshift). (2) Concurrency: High read concurrency (use MVCC databases like PostgreSQL), high write concurrency (use locking databases like MySQL with proper indexing). (3) Data size: <100GB (single server), 100GB-1TB (read replicas), >1TB (sharding/partitioning). (4) Query patterns: PK lookups (use InnoDB clustered index), secondary key lookups (use PostgreSQL heap), full table scans (use columnar storage). (5) Budget: Open source (MySQL, PostgreSQL) vs commercial (Oracle, SQL Server). Example: Uber migrated from MySQL to PostgreSQL because MVCC allowed non-blocking reads during high-concurrency ride matching. Shopify uses MySQL because InnoDB clustered indexes are fast for order lookups by order_id (primary key).

Primary key lookups: InnoDB 2x faster than PostgreSQL (clustered index), use InnoDB for order_id, user_id lookups

Secondary key lookups: PostgreSQL 3x faster than InnoDB (direct TID), use PostgreSQL for email, username searches

High read concurrency: PostgreSQL MVCC allows non-blocking reads (50K writes/sec, zero blocking), MySQL locks block readers

Write-heavy workloads: MySQL InnoDB with proper indexing, no MVCC bloat, simpler maintenance than PostgreSQL vacuum

Large tables (>100GB): Use partitioning (range, list, hash), enables partition pruning (10-100x faster queries)

Buffer pool sizing: Start with 70% of RAM, monitor hit ratio, increase if <95%, decrease if swapping occurs

Connection limits: PostgreSQL max_connections=100-200 (high memory per connection), MySQL max_connections=1000+ (low memory per connection)

Batch inserts: Use multi-row INSERT (10-100x faster than single-row), disable autocommit, commit every 1000-10000 rows

Wrong database for workload: Using MySQL for high read concurrency (locks block readers), use PostgreSQL MVCC instead

Insufficient buffer pool: <90% hit ratio causes excessive disk I/O (1000x slower), increase innodb_buffer_pool_size or shared_buffers

Wrong isolation level: Using SERIALIZABLE for all queries (10x slower), use READ COMMITTED for most queries, SERIALIZABLE only for critical data

No connection pooling: Creating new connection per request (100ms overhead), use PgBouncer or ProxySQL (1ms overhead)

Ignoring MVCC bloat: PostgreSQL dead tuples cause table bloat (10x larger), run VACUUM regularly, monitor n_dead_tup