Back-of-the-Envelope Estimation
Back-of-the-envelope numbers turn vague requirements into concrete designs. You learn six rules of thumb, do quick math, and suddenly know whether you need one server or a thousand.
Why architects do napkin math
"It depends" is the most useless phrase in system design. So is "a lot of users". The job of estimation is to convert vague phrases into numbers, because numbers force decisions. Once you can say "10K writes per second, growing 2x a year, 1KB per write", you can compute everything else: storage, bandwidth, server count, cache size. The estimation is rough on purpose. We are not trying to be precise, we are trying to be approximately right.
If your estimate is off by 2x, you will probably still build the right thing. If it is off by 100x, you will build the wrong thing. The whole point is to make sure you are in the right order of magnitude.
The numbers every engineer should know
Memorize these. They show up in every estimation problem.
Powers of 2 (storage)
| Suffix | Value | What it represents |
|---|---|---|
| KB | 10³ ≈ 2¹⁰ | 1,000 bytes (a paragraph of text) |
| MB | 10⁶ ≈ 2²⁰ | 1 million bytes (a small image) |
| GB | 10⁹ ≈ 2³⁰ | 1 billion bytes (a movie) |
| TB | 10¹² | 1,000 GB |
| PB | 10¹⁵ | 1,000 TB |
Latency numbers (Jeff Dean's classic list, simplified)
| Operation | Time |
|---|---|
| L1 cache reference | 0.5 ns |
| L2 cache reference | 7 ns |
| Main memory reference | 100 ns |
| Read 1 MB from memory | 250 µs |
| SSD random read | 150 µs |
| Round trip in same datacenter | 500 µs |
| Read 1 MB from SSD | 1 ms |
| HDD seek | 10 ms |
| Round trip across continents | 150 ms |
The headline insight: memory is 100x faster than SSD, SSD is 100x faster than disk seek, and a round trip across the world is 100x slower than a round trip in the datacenter. Caching exists because of these gaps.
Time math
Seconds in a day: 86,400. Round to 100K. Seconds in a month: ~2.5 million. Seconds in a year: ~30 million. These three numbers will save you ten minutes of math in every estimation problem.
The estimation recipe
Given a system, follow these five steps in order.
- Estimate users. Daily active users (DAU). Often given. If not, derive from monthly active users assuming DAU ≈ 30% of MAU.
- Estimate per-user activity. How many reads and writes does a user make per day? Be specific by feature.
- Convert to QPS. Total daily ops divided by 100K seconds. Multiply by 2-3 for peak.
- Estimate storage. Per-write size × writes per day × retention period.
- Estimate bandwidth. Per-read size × reads per second.
Worked example: Twitter-like service
Let's design for 200M DAU. The PM tells you the average user posts 2 tweets a day and reads 200 tweets a day. Ready?
Writes per second
200M users × 2 tweets = 400M tweets/day. Divide by 100K seconds = 4,000 writes/sec average. Peak is roughly 3x → 12K writes/sec peak.
Reads per second
200M users × 200 tweet reads = 40 billion reads/day. Divide by 100K = 400K reads/sec average. Peak ~ 1.2M reads/sec.
Read to write ratio is 100:1. That tells you immediately you need to optimize hard for reads. Caching becomes mandatory, not optional.
Storage
Each tweet ~ 300 bytes (text) + 100 bytes (metadata). Round to 500 bytes. With media references: ~1 KB. Daily writes: 400M × 1 KB = 400 GB/day. Yearly: ~150 TB. With 3x replication: ~450 TB/year. Over 5 years: ~2 PB.
Bandwidth
Reads: 400K/sec × 1 KB = 400 MB/sec read bandwidth. That's just text. Add images and the number can 10x or 100x.
From those four lines of math, you now know:
- You need a horizontally scaled write tier (12K/sec is too much for one DB).
- You need aggressive read caching (1.2M reads/sec).
- You need sharded storage (PB-scale data).
- You need a CDN (high read bandwidth).
Five minutes of math gave you the entire architectural skeleton.
Common mistakes
Forgetting peak. Daily averages lie. A social network at 8 PM in California is doing 5x its average. Always multiply by 2-3 for peak.
Ignoring read amplification. A single user fetching their feed might trigger 100 internal reads (one for each tweet, comment, like, profile picture). The user-facing read count can hide a 100x amplification.
Forgetting replication overhead. If you replicate data three times for durability, multiply storage and write bandwidth by 3.
Confusing bytes and bits. Network bandwidth is measured in bits per second; storage is in bytes. 1 Gbps ≠ 1 GBps. Off by 8.
Estimation in interviews
If you skip estimation in a system design interview, the interviewer will assume you do not know how to do it. Even if it feels rushed, take three minutes after gathering requirements to do these calculations out loud. Show your work. Even if you get a number wrong, the interviewer can correct you and move on. Skipping it entirely is the actual red flag.
You do not need a calculator. You do not need precision. You need order of magnitude and you need to show that you reach for numbers reflexively whenever someone says "make it scalable".