Database Sharding

Chapter 03 · Databases

When one machine cannot hold your data, you split it across many. Sharding sounds simple. The actual choice of shard key, the resharding strategy, and the cross-shard queries are where systems live or die.

Why we split data

A single database server has limits. CPU. RAM. Disk. Network. At some point, even the biggest machine you can buy cannot keep up. Sharding splits the data across multiple machines, each holding a slice of the whole. Each query goes to the shard that owns the relevant data.

Sharding is the heavy artillery of horizontal scale. It works, and it is also a pain to operate. You do not shard until you have to.

The mechanics

A shard is a partition of the data on its own machine (or cluster). To know which shard owns a row, you compute a function of the row's key. The choice of that function is the choice that decides everything.

Range-based sharding

Split by ranges of the key. Users 1-1M on shard A, 1M-2M on shard B, etc. Easy to implement. Easy to do range queries within a shard. The downside: hot spots. If the latest users are the most active (very common), the newest shard is overloaded while older ones are idle.

Hash-based sharding

Apply a hash function to the key, take modulo N. Spreads load uniformly. The cost: you destroy ordering. Range queries (give me users from 1M to 2M) now hit every shard. And reshuffling when N changes is brutal.

Directory-based sharding

A lookup service tracks which shard owns which key. Maximum flexibility, you can rebalance at will. The lookup service becomes a bottleneck and a single point of failure if you are not careful.

Geo-based sharding

Shard by user location. EU users on EU shards, US users on US shards. Helps with data residency laws and reduces latency for users near their shard. Cross-region queries are the cost.

A shard router decides which shard owns each row by hashing or ranging the key.

Picking the shard key

This is the most important choice in your sharded system. A bad shard key haunts you for years.

Good shard keys have:

High cardinality: many distinct values. Country code (200 values) is a bad shard key for a global app.
Even distribution: no single value dominates. User_id is usually fine. Tenant_id is fine if tenants are similar size; disastrous if you have one whale tenant.
Locality of access: queries the user makes naturally hit one shard. If you shard by user_id, "show me all my orders" stays on one shard.

Bad shard keys: anything that creates hot spots (today's date, auto-incrementing IDs, status flags). Anything where a single user maps to multiple shards (random UUID per row, when queries always group by user).

Cross-shard queries are expensive

Once you shard, any query that does not include the shard key must hit every shard, gather results, and merge. This is called scatter-gather. It costs the latency of the slowest shard plus coordination overhead. Avoid where possible. Design your access patterns so common queries land on one shard.

Resharding: the part nobody warns you about

You started with 4 shards. You're at 90% capacity. You want to go to 8. Now what?

Naive hash mod N means changing N from 4 to 8 reshuffles roughly half the data. While the system is live. Customers expect their queries to keep working. This is where consistent hashing comes in (next topic). It limits the data movement to roughly 1/N of the total. Most production systems use it.

Even with consistent hashing, resharding is one of the riskiest operations you do. Practice it before you need it.

Don't shard prematurely Modern databases on a single machine handle terabytes and tens of thousands of QPS. Vertical scaling, read replicas, and good indexes solve 90% of "we need to shard" conversations. Shard when your numbers say so, not when your architect crush says so.

Functional sharding (different data on different DBs)

An older but still useful pattern: put different tables on different DBs. Users on one, orders on another, analytics on a third. Easier than sharding within a table, often enough on its own. Often a stepping stone before you need real sharding.

Real-world examples

Instagram shards by user ID. Each user's photos all live on the same shard, so loading a profile is one-shard. Discord shards messages by channel ID, because most queries scope to a channel. Slack shards workspaces; cross-workspace queries are rare. The shard key always tracks the dominant access pattern.

Sharding solves real scale problems and creates new operational ones. Pick the shard key carefully. Design queries to respect it. Plan resharding before you need it. The patterns above will cover 90% of what you'll see in practice.

← Previous

Database Indexing

Database Replication