PostgreSQL Features Deep Dive: Mistakes

Module: Database-Specific Features

Using JSON type instead of JSONB for storing JSON data

Always use JSONB (binary JSON) instead of JSON (text JSON)

PostgreSQL has two JSON types: JSON (text-based) and JSONB (binary). JSON stores data as text, parsed on every query (slow). JSONB stores data as binary, parsed once on insert (fast queries). JSONB supports GIN indexes (10-100x faster queries). JSON does not support GIN indexes. Example: SELECT * FROM products WHERE data->>brand = Sony. With JSON: 500ms (full table scan). With JSONB + GIN index: 5ms (100x faster). Real-world: Shopify uses JSONB for 1M+ products. Migrating from JSON to JSONB: ALTER TABLE products ALTER COLUMN data TYPE JSONB USING data::JSONB. Lesson: Always use JSONB unless you need to preserve exact JSON formatting (rare).

Always use JSONB. Only use JSON if you need to preserve exact formatting (whitespace, key order). For 99% of cases, use JSONB.

High

Queries are 2-3x slower with JSON. JSON cannot be indexed with GIN (only JSONB can)

Not creating GIN indexes on JSONB or array columns, resulting in slow queries

Create GIN indexes on JSONB and array columns: CREATE INDEX idx ON table USING GIN (jsonb_column)

GIN (Generalized Inverted Index) is designed for JSONB and arrays. Without GIN index, PostgreSQL does full table scan (slow). With GIN index, PostgreSQL uses index scan (fast). Example: SELECT * FROM products WHERE attributes @> {brand: Sony}. Without GIN index: 500ms for 100K rows (full table scan). With GIN index: 5ms (100x faster). GIN index size: ~30% of table size. Worth it for query performance. Create GIN index: CREATE INDEX idx_products_attributes ON products USING GIN (attributes). For specific fields, use expression index: CREATE INDEX idx_brand ON products ((attributes->>brand)). Real-world: Shopify uses GIN indexes on all JSONB columns. Lesson: Always create GIN indexes on JSONB/array columns.

Create GIN indexes on all JSONB and array columns. Use EXPLAIN to verify index usage. If query does Seq Scan instead of Index Scan, add GIN index.

Critical

Queries on JSONB/arrays are 10-100x slower without GIN indexes. Full table scans instead of index scans

Not running VACUUM regularly, causing dead tuples to accumulate and slow queries

Run VACUUM ANALYZE regularly (weekly) or tune autovacuum settings

PostgreSQL MVCC creates new row version on UPDATE (doesn't modify old version). Old versions become dead tuples after transaction commits. Dead tuples waste space and slow queries (more data to scan). VACUUM removes dead tuples, reclaims space. ANALYZE updates statistics for query planner. PostgreSQL has autovacuum (runs automatically), but may need tuning for high-write workloads. Check dead tuples: SELECT relname, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC. If n_dead_tup > 10% of table, run VACUUM. Manual VACUUM: VACUUM ANALYZE table_name. Tune autovacuum: ALTER TABLE table_name SET (autovacuum_vacuum_scale_factor = 0.05). Real-world: Uber runs VACUUM daily on high-write tables. Lesson: Monitor dead tuples, run VACUUM regularly.

Monitor dead tuples with pg_stat_user_tables. Run VACUUM ANALYZE weekly. For high-write tables, tune autovacuum or run VACUUM daily.

High

Dead tuples accumulate (old row versions from MVCC), bloat table, slow queries. Table size grows unnecessarily

Opening too many database connections (1000s) without connection pooling

Use PgBouncer connection pooling to handle 1000s of connections with ~100 database connections

PostgreSQL creates process per connection (not thread). Each process uses ~10MB memory. 1000 connections = 10GB memory overhead. Context switching between 1000 processes is slow. Solution: PgBouncer connection pooling. Application opens 10K connections to PgBouncer. PgBouncer pools to 100 database connections. 100x more efficient. PgBouncer modes: (1) Session pooling (connection held for session), (2) Transaction pooling (connection held for transaction, most efficient), (3) Statement pooling (connection held for statement, rarely used). Real-world: Instagram uses PgBouncer to handle 50K connections with 100 database connections. Configuration: max_connections = 100 (PostgreSQL), max_client_conn = 10000 (PgBouncer). Lesson: Always use connection pooling for web applications.

Use PgBouncer for connection pooling. Set max_connections = 100 in PostgreSQL, max_client_conn = 10000 in PgBouncer. Use transaction pooling mode for best performance.

Critical

PostgreSQL handles ~100 connections well. With 1000s of connections, performance degrades (context switching, memory overhead)

Using arrays for complex many-to-many relationships that need referential integrity or metadata

Use arrays for simple lists (tags, categories). Use junction tables for complex relationships with metadata

Arrays are great for simple lists (tags, categories) but not for complex relationships. Arrays limitations: (1) No referential integrity (can't enforce foreign keys), (2) No metadata (can't store created_at, order, etc), (3) Difficult to query from both sides (can't efficiently find all posts for a tag). When to use arrays: Simple lists, no metadata needed, order matters, query from one side only. When to use junction tables: Complex relationships, need referential integrity, need metadata, query from both sides. Example: Blog posts with tags. Use arrays if you only query posts by tag (simple). Use junction table if you also query tags by post count (complex). Real-world: Instagram uses arrays for hashtags (simple list, query photos by hashtag). GitHub uses junction table for repository stars (need metadata: starred_at, query from both sides). Lesson: Use arrays for simple lists, junction tables for complex relationships.

Use arrays for simple lists (tags, categories). Use junction tables when you need referential integrity, metadata, or bidirectional queries.

Medium

Arrays have no referential integrity (no foreign keys). Cannot store metadata (created_at, order). Difficult to query from both sides