Denormalization Strategies: Concept

Module: Schema Design & Advanced DDL

Denormalization is the process of intentionally adding redundant data to your database to improve read performance. After you've normalized your schema to eliminate redundancy and anomalies, you selectively add some redundancy back based on actual query patterns and performance measurements.

Think of normalization as organizing a library where every book appears exactly once, and you use a card catalog to find related information. Denormalization is like putting popular reference books on every floor so people don't have to go to the main library every time. You're duplicating data to make access faster.

The key word is "selectively." You don't denormalize everything. You measure your queries, identify the expensive joins or aggregations that happen frequently, and denormalize those specific cases. You're making a conscious trade-off: accepting slower writes and more storage to get faster reads.

Here's what happens in a normalized vs denormalized schema:

**Normalized (3NF):**

- orders table: order_id, customer_id, order_date, total

- customers table: customer_id, name, email, city

- Query: SELECT o.order_id, c.name, c.email FROM orders o JOIN customers c ON o.customer_id = c.id

- Every query requires a join. If you're showing 1 million orders per day, that's 1 million joins.

**Denormalized:**

- orders table: order_id, customer_id, customer_name, customer_email, order_date, total

- Query: SELECT order_id, customer_name, customer_email FROM orders

- No join needed. Query runs 5x faster. But when a customer changes their email, you update 1 customer row plus potentially 100 order rows.

The question isn't "is this normalized?" The question is "does the performance gain justify the maintenance cost?"

## Why Denormalization Exists

Normalization optimizes for data integrity and storage efficiency. It assumes that writes and reads are equally important. But in real systems, reads often outnumber writes by 100:1 or even 1000:1.

If you have a query that runs 10,000 times per day and takes 500ms because of joins, that's 5,000 seconds (83 minutes) of total query time daily. If denormalizing makes it run in 50ms, you save 4,500 seconds (75 minutes) daily. The cost? Maybe updating 5 extra rows when data changes, which happens 10 times per day and takes 50ms extra per update. That's 500ms of extra write time to save 75 minutes of read time.

The math is clear when reads vastly outnumber writes.

## Seven Denormalization Patterns

### 1. Redundant Columns (Most Common)

Add frequently-joined columns to avoid joins entirely.

**Example:** E-commerce orders

```

-- Normalized

orders: order_id, customer_id, product_id, quantity

customers: customer_id, name, email

products: product_id, name, price

-- Query requires 2 joins

SELECT o.order_id, c.name, p.name, o.quantity

FROM orders o

JOIN customers c ON o.customer_id = c.id

JOIN products p ON o.product_id = p.id;

-- Denormalized

orders: order_id, customer_id, customer_name, product_id, product_name, product_price, quantity

-- No joins needed

SELECT order_id, customer_name, product_name, quantity

FROM orders;

```

**When to use:** The joined data is read frequently but changes rarely. Customer names change occasionally. Product names change rarely. But you query orders constantly.

**Maintenance:** When customer name changes, update customers table + all orders for that customer. Use triggers or application logic.