Back-of-the-Envelope Estimation

Chapter 01 · Foundations

Back-of-the-envelope numbers turn vague requirements into concrete designs. You learn six rules of thumb, do quick math, and suddenly know whether you need one server or a thousand.

Why architects do napkin math

"It depends" is the most useless phrase in system design. So is "a lot of users". The job of estimation is to convert vague phrases into numbers, because numbers force decisions. Once you can say "10K writes per second, growing 2x a year, 1KB per write", you can compute everything else: storage, bandwidth, server count, cache size. The estimation is rough on purpose. We are not trying to be precise, we are trying to be approximately right.

If your estimate is off by 2x, you will probably still build the right thing. If it is off by 100x, you will build the wrong thing. The whole point is to make sure you are in the right order of magnitude.

The numbers every engineer should know

Memorize these. They show up in every estimation problem.

Powers of 2 (storage)

Suffix	Value	What it represents
KB	10³ ≈ 2¹⁰	1,000 bytes (a paragraph of text)
MB	10⁶ ≈ 2²⁰	1 million bytes (a small image)
GB	10⁹ ≈ 2³⁰	1 billion bytes (a movie)
TB	10¹²	1,000 GB
PB	10¹⁵	1,000 TB

Latency numbers (Jeff Dean's classic list, simplified)

Operation	Time
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
Read 1 MB from memory	250 µs
SSD random read	150 µs
Round trip in same datacenter	500 µs
Read 1 MB from SSD	1 ms
HDD seek	10 ms
Round trip across continents	150 ms

The headline insight: memory is 100x faster than SSD, SSD is 100x faster than disk seek, and a round trip across the world is 100x slower than a round trip in the datacenter. Caching exists because of these gaps.

Time math

Seconds in a day: 86,400. Round to 100K. Seconds in a month: ~2.5 million. Seconds in a year: ~30 million. These three numbers will save you ten minutes of math in every estimation problem.

The estimation recipe

Given a system, follow these five steps in order.

Estimate users. Daily active users (DAU). Often given. If not, derive from monthly active users assuming DAU ≈ 30% of MAU.
Estimate per-user activity. How many reads and writes does a user make per day? Be specific by feature.
Convert to QPS. Total daily ops divided by 100K seconds. Multiply by 2-3 for peak.
Estimate storage. Per-write size × writes per day × retention period.
Estimate bandwidth. Per-read size × reads per second.

From DAU to peak QPS in four steps. Each step you make a clear assumption that anyone can challenge.

Worked example: Twitter-like service

Let's design for 200M DAU. The PM tells you the average user posts 2 tweets a day and reads 200 tweets a day. Ready?

Writes per second

200M users × 2 tweets = 400M tweets/day. Divide by 100K seconds = 4,000 writes/sec average. Peak is roughly 3x → 12K writes/sec peak.

Reads per second

200M users × 200 tweet reads = 40 billion reads/day. Divide by 100K = 400K reads/sec average. Peak ~ 1.2M reads/sec.

Read to write ratio is 100:1. That tells you immediately you need to optimize hard for reads. Caching becomes mandatory, not optional.

Storage

Each tweet ~ 300 bytes (text) + 100 bytes (metadata). Round to 500 bytes. With media references: ~1 KB. Daily writes: 400M × 1 KB = 400 GB/day. Yearly: ~150 TB. With 3x replication: ~450 TB/year. Over 5 years: ~2 PB.

Bandwidth

Reads: 400K/sec × 1 KB = 400 MB/sec read bandwidth. That's just text. Add images and the number can 10x or 100x.

From those four lines of math, you now know:

You need a horizontally scaled write tier (12K/sec is too much for one DB).
You need aggressive read caching (1.2M reads/sec).
You need sharded storage (PB-scale data).
You need a CDN (high read bandwidth).

Five minutes of math gave you the entire architectural skeleton.

Common mistakes

Forgetting peak. Daily averages lie. A social network at 8 PM in California is doing 5x its average. Always multiply by 2-3 for peak.

Ignoring read amplification. A single user fetching their feed might trigger 100 internal reads (one for each tweet, comment, like, profile picture). The user-facing read count can hide a 100x amplification.

Forgetting replication overhead. If you replicate data three times for durability, multiply storage and write bandwidth by 3.

Confusing bytes and bits. Network bandwidth is measured in bits per second; storage is in bytes. 1 Gbps ≠ 1 GBps. Off by 8.

The "round to nice numbers" rule In your head, 86,400 is annoying. 100,000 is a friend. 365 is annoying. 400 is a friend. Round generously up. Your estimates will be slightly conservative, which is exactly what you want when capacity planning.

Estimation in interviews

If you skip estimation in a system design interview, the interviewer will assume you do not know how to do it. Even if it feels rushed, take three minutes after gathering requirements to do these calculations out loud. Show your work. Even if you get a number wrong, the interviewer can correct you and move on. Skipping it entirely is the actual red flag.

You do not need a calculator. You do not need precision. You need order of magnitude and you need to show that you reach for numbers reflexively whenever someone says "make it scalable".

← Previous

Functional vs Non-Functional Requirements

CAP Theorem