Normalization (1NF to BCNF): Interview

Module: Schema Design & Advanced DDL

Explain each normal form (1NF, 2NF, 3NF, BCNF) with examples. What problems does each solve?

1NF (First Normal Form): Atomic values, no repeating groups. Example: orders table with product1, product2, product3 columns violates 1NF. Solution: Create order_items table with one row per product. Solves: Cannot store unlimited items, wasted space, complex queries. 2NF (Second Normal Form): 1NF + no partial dependencies. Example: order_items with (order_id, product) as primary key, but customer_name depends only on order_id (partial dependency). Solution: Split into customers, products, orders, order_items tables. Solves: Redundancy, update anomalies (change customer email in multiple rows). 3NF (Third Normal Form): 2NF + no transitive dependencies. Example: customers table with customer_id → zip_code → city, state (transitive). Solution: Split into zip_codes and customers tables. Solves: City/state duplication, update anomalies (change city for zip in multiple rows). BCNF (Boyce-Codd Normal Form): 3NF + every determinant is candidate key. Example: enrollments with instructor → course, but instructor not candidate key. Solution: Make instructor primary key in separate table. Solves: Cannot add instructor without student, deletion loses instructor-course mapping. Real-world: Most applications normalize to 3NF, check BCNF for edge cases.

What are functional dependencies? How do you identify them? Provide examples.

Functional dependency X → Y means: if you know X, you can determine Y uniquely. Example: student_id → student_name (one student_id maps to exactly one student_name). Identifying functional dependencies: (1) List all attributes, (2) For each attribute, ask "what determines this value?", (3) Example: orders table - order_id determines customer_id, order_date; customer_id determines customer_name, email. Types: (1) Full dependency: Non-key attribute depends on entire primary key. Example: (order_id, product_id) → quantity. (2) Partial dependency: Non-key attribute depends on part of composite key. Example: (order_id, product_id) → customer_name (depends only on order_id, violates 2NF). (3) Transitive dependency: A → B → C (indirect). Example: order_id → customer_id → customer_email (violates 3NF). Determinant: Left side of dependency (X in X → Y), the attribute that determines others. Candidate key: Minimal set of attributes that uniquely identifies row. Real-world: Identify functional dependencies before normalizing, helps determine how to split tables.

When should you denormalize? What are the trade-offs?

Denormalize when: (1) Read-heavy workload (many SELECTs, few UPDATEs), (2) Performance critical (milliseconds matter), (3) Expensive joins (5+ tables, > 100ms), (4) Proven bottleneck (measured, not assumed). Example: E-commerce product catalog - denormalize category_name into products table (avoid join for every product display). Trade-offs: Benefits: (1) Faster reads (no joins), (2) Simpler queries (SELECT * FROM products), (3) Lower latency (< 10ms vs 100ms). Costs: (1) Redundancy (category_name duplicated), (2) Update complexity (change category name in multiple products), (3) Data inconsistency risk (forget to update some products), (4) More storage (duplicate data). When NOT to denormalize: (1) Write-heavy workload (many UPDATEs), (2) Data consistency critical (financial, healthcare), (3) Storage expensive, (4) Not measured (premature optimization). Real-world: Stripe normalizes to 3NF for payment data (consistency critical), denormalizes for analytics (read-heavy). Amazon normalizes orders (write-heavy), denormalizes product catalog (read-heavy). Best practice: Normalize to 3NF first, measure performance, selectively denormalize proven bottlenecks (< 1% of queries).

Given this unnormalized table, normalize to 3NF. Identify functional dependencies and explain each step. Table: student_courses(student_id, student_name, student_email, course1, course2, course3, instructor1, instructor2, instructor3)

-- Step 1: Identify functional dependencies

-- student_id → student_name, student_email

-- course → instructor (assume each course has one instructor)

-- Step 2: Convert to 1NF (eliminate repeating groups)

CREATE TABLE student_courses_1nf (

student_id INT,

student_name VARCHAR(100),

student_email VARCHAR(100),

course VARCHAR(100),

instructor VARCHAR(100),

PRIMARY KEY (student_id, course)

);

-- Now in 1NF: Atomic values, no repeating groups

-- Can store unlimited courses per student

-- Step 3: Check for partial dependencies (2NF)

-- Primary key: (student_id, course)

-- student_name depends on student_id only (not course) - PARTIAL

-- student_email depends on student_id only (not course) - PARTIAL

-- instructor depends on course only (not student_id) - PARTIAL

-- Convert to 2NF (eliminate partial dependencies)

CREATE TABLE students (

student_id SERIAL PRIMARY KEY,

student_name VARCHAR(100) NOT NULL,

student_email VARCHAR(100) UNIQUE NOT NULL

);

CREATE TABLE courses (

course_id SERIAL PRIMARY KEY,

course_name VARCHAR(100) UNIQUE NOT NULL,

instructor VARCHAR(100) NOT NULL

);

CREATE TABLE enrollments (

student_id INT REFERENCES students(student_id),

course_id INT REFERENCES courses(course_id),