Caching & Memory Management: Mistakes

Q: Common mistake in Caching & Memory Management

Temp files indicate memory spills - sorts or hash joins that don't fit in work_mem. Each temp file write/read is disk I/O (slow). High temp file usage means many queries are spilling, causing persistent performance problems. Temp files can fill disk, causing failures. Monitoring temp files reveals which queries need more work_mem or query optimization.

Module: Query Optimization & Performance

Using default work_mem = 4MB for large hash joins (hash table 200MB)

SET work_mem = '250MB' for specific query or increase globally to 50-100MB for analytical workload

Hash join builds hash table on smaller table. If hash table doesn't fit in work_mem, PostgreSQL splits into batches and writes to temp files. Each batch requires disk I/O (write + read). 50 batches = 50x more I/O. Disk is 1000x slower than memory. Increasing work_mem to fit hash table eliminates spill, making query 60x faster.

Check EXPLAIN ANALYZE for "Batches: >1" or "Disk Usage". Calculate hash table size: rows × row_width. Set work_mem = hash_table_size × 1.5. For one-off large queries, use SET work_mem instead of global change.

Critical

Hash join spills to disk in 50 batches. Query takes 15 minutes instead of 15 seconds (60x slower). EXPLAIN shows: Batches: 50, Disk Usage: 196MB

shared_buffers = 1GB on server with 64GB RAM and 100GB database

shared_buffers = 16GB (25% of RAM) to fit hot dataset in cache

Buffer pool (shared_buffers) caches frequently accessed data pages. With 1GB buffer pool and 100GB database, only 1% of data fits in cache. Hot dataset (frequently accessed data) is typically 20-30% of database. With 1GB cache, most hot data is on disk. Each disk read is 1000x slower than memory. Increasing shared_buffers to 16GB fits hot dataset in cache, reducing disk I/O by 90%.

Monitor cache hit ratio: SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) FROM pg_statio_user_tables. Target >95% for OLTP, >90% for analytics. If below target, increase shared_buffers. Rule: 25% of RAM for dedicated DB server.

Critical

Cache hit ratio: 65% (35% of reads from disk). Queries take 10x longer due to disk I/O. Disk I/O at 100% utilization, CPU idle waiting for disk

Setting work_mem = 500MB globally with 1000 concurrent connections

Use connection pooling (100 connections) + work_mem = 100MB, or use SET work_mem for specific large queries only

work_mem is allocated per operation (sort, hash), not per query. Complex query with 5 sorts + 3 hash joins uses 8 × work_mem. With 1000 connections, max usage = 1000 × 10 × 500MB = 5TB. This exceeds RAM, causing swapping (disk I/O for memory). Swap is 100,000x slower than RAM. System becomes unresponsive. Solution: connection pooling reduces active connections, or use SET work_mem for specific queries.

Calculate max work_mem usage: max_connections × avg_operations_per_query × work_mem. Should be <50% of RAM. Use connection pooling (pgBouncer, pgPool) to reduce connections. For mixed workloads, keep work_mem low (4-50MB) and use SET work_mem for large analytical queries.

Critical

Max memory usage: 1000 connections × 10 operations × 500MB = 5TB (exceeds 64GB RAM). System starts swapping, performance collapses. OOM killer terminates database

Not monitoring temp file usage - queries creating 100GB temp files daily

Monitor temp files: SELECT temp_files, pg_size_pretty(temp_bytes) FROM pg_stat_database. Investigate queries creating temp files, increase work_mem or optimize queries

Temp files indicate memory spills - sorts or hash joins that don't fit in work_mem. Each temp file write/read is disk I/O (slow). High temp file usage means many queries are spilling, causing persistent performance problems. Temp files can fill disk, causing failures. Monitoring temp files reveals which queries need more work_mem or query optimization.

Set up monitoring: SELECT datname, temp_files, pg_size_pretty(temp_bytes) FROM pg_stat_database. Alert if temp_bytes >10GB/day. Identify culprit queries: enable log_temp_files = 0 in postgresql.conf to log all temp file usage. Investigate logged queries with EXPLAIN ANALYZE.

High

Queries spilling to disk go unnoticed. Persistent 10-100x slowdowns. Temp files fill disk, causing disk full errors. Queries fail

Not warming cache after PostgreSQL restart - first queries take 10x longer

After restart, warm cache with pg_prewarm or run typical queries before production traffic

PostgreSQL buffer pool is in RAM, not persistent. After restart, cache is empty. First queries must read all data from disk (cold start). Disk I/O is 1000x slower than memory. As queries run, frequently accessed data fills cache (warm up). This takes 30-60 minutes. During warmup, queries are slow. Solution: explicitly warm cache after restart with pg_prewarm extension or by running typical queries.

Install pg_prewarm extension: CREATE EXTENSION pg_prewarm. Warm critical tables: SELECT pg_prewarm('orders'); SELECT pg_prewarm('customers'). Or create warmup script that runs typical queries. Run after every restart. Monitor cache hit ratio to verify warmup: should reach >90% within 5 minutes.

Medium

After restart, buffer pool is empty. First queries read everything from disk (cold cache). Queries take 10x longer. Users experience slow response times. Cache gradually warms up over 30-60 minutes