Normalization and Denormalization
Normalization is about putting each fact in exactly one place. Denormalization is about putting it in many places to make reads fast. Pick by the read/write ratio you actually have, not by what the textbook says.
Two opposite philosophies
Normalization is the discipline of removing duplication from your schema. Each fact (user's email, a product's price) lives in exactly one place. If that fact changes, you update one row. The trade-off: you need joins to assemble useful data.
Denormalization is the opposite. You store duplicate or computed data on purpose, because joins are too expensive. The trade-off: writes get harder. When the underlying fact changes, you have to update many rows.
Most production schemas live somewhere on the spectrum. Pure normalization is too slow. Pure denormalization is too error-prone. The art is knowing where to be.
Normalization, in five minutes
Codd's normal forms (1NF, 2NF, 3NF, BCNF) are the formal rules. The simplified version:
- 1NF: no repeating columns. Atomic values per cell. No comma-separated lists in one column.
- 2NF: every non-key column depends on the whole primary key, not part of it.
- 3NF: no transitive dependencies. Non-key columns depend only on the key, not on other non-key columns.
For 95% of OLTP applications, getting to 3NF is enough. Don't memorize the rules; learn the smell. If you ever find yourself wanting to update the same fact in multiple rows, your schema is under-normalized.
An example progression
Imagine you have orders and want to store the customer's name on each order.
Bad (denormalized for no reason)
orders(id, customer_name, customer_email, customer_address, items, total)
If a customer changes their email, you update every order they ever placed. Worse, the customer table doesn't even exist. There's no source of truth.
Normalized
customers(id, name, email, address)
orders(id, customer_id, total)
order_items(id, order_id, product_id, qty, price)
Customer's email lives in one place. Joins assemble the order view.
Smartly denormalized
customers(id, name, email, address)
orders(id, customer_id, customer_name_at_time_of_order, total, ...)
order_items(...)
Why duplicate the name on the order? Because the order is a historical record. If the customer changes their name later, you still want to know what name was on the receipt. The duplication is intentional and immutable.
When to denormalize
Denormalize when:
- Reads vastly outnumber writes. A 1000:1 read:write ratio justifies precomputing materialized views.
- Joins are too expensive. Cross-shard joins, joins over huge tables, queries that hit the database 1000 times for one user view.
- You need historical accuracy. Order receipts, invoices, audit logs. Even after the source data changes, the historical record stays the same.
- You are using a NoSQL database. Document stores expect denormalized data; embed related data in the document.
Materialized views: the disciplined denormalization
A materialized view is a denormalized table the database maintains for you. You declare the query; the DB stores the result and updates it on schedule (or on demand). Postgres, Oracle, and BigQuery all support them. You get the read speed of denormalization without writing the sync code.
NoSQL changes the math
Document databases like MongoDB nudge you toward denormalization. Embedding the customer's name inside each order document is the natural shape. Joins are weak (or nonexistent). The trade-off is the same as before; you just pay it earlier and more often. Choose what to embed by reading patterns, not gut feel.
The senior heuristic
Start normalized. Measure. When a specific query is too slow, denormalize for that query. Track the duplications you introduce so future engineers know where to update. Resist the temptation to denormalize speculatively for performance you have not measured.
Database design is a long game. Schemas are sticky. Spend the time to get the basic shape right, allow for evolution, and remember: clean schemas keep dev teams happy and production stable for years.