Denormalization Strategies: Interview
Module: Schema Design & Advanced DDL
When would you choose to denormalize a database, and what trade-offs would you consider?
I would denormalize when read performance is critical and reads vastly outnumber writes (typically 100:1 or higher). The key trade-offs are: (1) Read performance improves significantly - queries run 10-100x faster by eliminating joins. (2) Write performance degrades - updates must modify multiple tables to maintain consistency. (3) Storage increases - redundant data takes more space. (4) Consistency becomes harder - must use triggers or application logic to keep copies in sync. (5) Maintenance complexity increases - developers must remember to update all copies. I would measure the actual query performance and frequency before denormalizing. If a query runs 10,000 times per day at 500ms, and denormalization reduces it to 50ms, that saves 4,500 seconds daily. If writes happen only 100 times per day and take 50ms extra, that costs 5 seconds daily. The net benefit is 4,495 seconds, making denormalization worth it. I would also consider alternatives like caching or better indexes before denormalizing.
Explain the difference between a regular view and a materialized view. When would you use each?
A regular view is a stored query that executes every time you query it. It always shows current data but offers no performance benefit - it's just a query alias. A materialized view stores the query results physically on disk. It's like a denormalized table that the database manages. Querying a materialized view is fast because you're reading pre-computed results, not executing the query. The trade-off is staleness - data is only as fresh as the last refresh. Use regular views for: (1) Security - hiding sensitive columns. (2) Abstraction - simplifying complex queries for users. (3) When you need real-time data. Use materialized views for: (1) Expensive aggregations that don't need real-time data (analytics dashboards). (2) Complex joins that run frequently. (3) Reports that can tolerate staleness (hourly or daily refresh). Example: Netflix uses materialized views for "Top 10 in your country." That's calculated from billions of viewing records but only refreshes every few hours. Users don't need real-time rankings, and the performance gain is massive.
How would you maintain consistency in a denormalized database where customer email is stored in both customers and orders tables?
I would use a combination of strategies: (1) Database triggers - Create an AFTER UPDATE trigger on customers.email that automatically updates orders.customer_email for all orders belonging to that customer. This ensures consistency at the database level and can't be bypassed. (2) Application-level transactions - When updating customer email in application code, wrap both updates in a transaction: BEGIN; UPDATE customers SET email = new_email; UPDATE orders SET customer_email = new_email WHERE customer_id = X; COMMIT; This ensures atomicity. (3) Consistency checks - Run daily batch jobs that verify orders.customer_email matches customers.email and alert on mismatches. (4) Decide on consistency model - For customer email, I'd use strong consistency (immediate updates via triggers) because users expect to see their current email everywhere. For less critical data like cached product names, eventual consistency (hourly batch sync) might be acceptable. (5) Documentation - Clearly document which fields are denormalized and how they're maintained so all developers know to update both places. The key is choosing the right consistency strategy based on how critical the data is and how frequently it changes.
You have an orders table (10M rows) that joins with customers and products tables. The order listing page takes 2 seconds to load because of these joins. Design a denormalization strategy to improve performance.
-- Step 1: Add redundant columns to orders
ALTER TABLE orders
ADD COLUMN customer_name VARCHAR(100),
ADD COLUMN customer_email VARCHAR(255),
ADD COLUMN product_name VARCHAR(200),
ADD COLUMN product_price DECIMAL(10,2);
-- Step 2: Populate existing data
UPDATE orders o
SET
customer_name = (SELECT name FROM customers c WHERE c.customer_id = o.customer_id),
customer_email = (SELECT email FROM customers c WHERE c.customer_id = o.customer_id),
product_name = (SELECT name FROM products p WHERE p.product_id = o.product_id),
product_price = (SELECT price FROM products p WHERE p.product_id = o.product_id);
-- Step 3: Create indexes
CREATE INDEX idx_orders_customer_name ON orders(customer_name);
CREATE INDEX idx_orders_product_name ON orders(product_name);
-- Step 4: Create triggers for consistency
CREATE TRIGGER sync_customer_data
AFTER UPDATE OF name, email ON customers
FOR EACH ROW
BEGIN
UPDATE orders
SET customer_name = NEW.name, customer_email = NEW.email
WHERE customer_id = NEW.customer_id;
END;
CREATE TRIGGER sync_product_data
AFTER UPDATE OF name, price ON products
FOR EACH ROW
BEGIN
UPDATE orders
SET product_name = NEW.name, product_price = NEW.price
WHERE product_id = NEW.product_id;
END;