Distributed Caching

Chapter 05 · Caching

When one cache node is not enough, you spread the cache across many. Sharding by key, dealing with hot keys, and surviving node failures are the three main concerns.

Why distribute the cache

One Redis instance can hold tens of GBs in memory and serve hundreds of thousands of ops per second. Plenty for many systems. But once you cross those limits, or once you need redundancy in case the cache box dies, you split the cache across multiple nodes.

The two distribution strategies

Client-side sharding. The application chooses which node to talk to (often via consistent hashing). Simple, no extra hops, but every client must know the topology. Memcached works this way.

Server-side cluster. The cache nodes form a cluster. The client sends to any node; the cluster routes internally if needed. Redis Cluster, DynamoDB DAX. More moving parts, easier client.

Hot keys

The single biggest problem in distributed caches. One key gets 10x more traffic than all others combined (think: the front-page article). The node owning that key is overloaded, the rest are idle.

Mitigations:

Key replication. Store hot keys on multiple nodes. Reads can hit any of them.
Local cache layer. Cache hot keys in the application process for a few seconds. Massive read volume becomes local.
Key splitting. Append a random suffix when reading (article:42:rand0 through article:42:rand9) and write to all variants. Spreads the load.

What happens when a node dies

Two scenarios.

Cache is just a performance layer. Losing a node means a temporary spike in DB queries while keys re-warm on other nodes. Annoying, survivable.

Cache holds critical data with no DB backing. Now data is gone forever. Don't do this. Always treat the cache as derivable from a source of truth.

Production caches usually run with replicas (Redis Sentinel or Cluster) so a node failure triggers failover, not data loss.

Each key lives on its primary node and a replica. Consistent hashing decides ownership.

Redis vs Memcached: the eternal question

	Redis	Memcached
Data structures	Strings, hashes, lists, sets, sorted sets, streams	Strings only
Persistence	Optional (RDB, AOF)	None
Replication	Built-in	Client-side or via add-ons
Pub/Sub	Yes	No
Performance	Single-threaded, ~100K ops/sec	Multi-threaded, can scale a single node
Use case	General purpose, cache + queue + leaderboards	Plain key-value cache

Today, Redis wins almost every "we need a cache" decision. Memcached is still around for very simple, very high-throughput scenarios.

The first thing to monitor Hit rate. Cache hit rate is the single most important cache metric. If it drops, your DB load goes up and latency follows. Alert on hit rate below threshold (say 90% for production). Investigate before performance degrades visibly.

Common architectures

Single Redis with replica. Plenty for most teams. Redis Sentinel handles failover.

Redis Cluster. 3-6+ shards, each with a replica. Scales out as needed.

Tiered cache. In-process cache (per-app instance) plus Redis. Hot keys served from process; warm keys from Redis. Best latency, cost in complexity.

Distributed caching is one of those topics where the implementation looks fine until traffic shifts and you discover hot keys, replication lag, or a missed invalidation. Build for failure modes from day one. Monitor hit rate. Have a runbook for "cache is down".

← Previous

Cache Eviction Policies

Message Queues