Design a Distributed Rate Limiter

Chapter 11 · Case Studies and Interview Designs

Design a distributed rate limiter that enforces limits across many servers in real time. The hard part is not the algorithm, it is making it consistent and fast across a fleet.

Problem

Build a service that enforces rate limits like "1000 requests per minute per API key" across a fleet of API servers. Every request consults the limiter before being served. The limiter must be fast (sub-millisecond), accurate (no leaking past the limit), and survive failures.

Requirements check

Functional: enforce per-API-key, per-IP, per-user, per-endpoint limits. Different limits for different tiers. Return 429 with Retry-After when limited. Non-functional: sub-millisecond p99 latency, 100K+ requests per second, available even when one node fails.

Where it lives

Two main spots: in the API gateway (edge), or as a middleware library in each service. The gateway pattern is more common because it covers everything uniformly. Either way, the actual counting state lives in a shared store so all instances see the same numbers.

The algorithm

Token bucket is the most flexible. Sliding window counter is more accurate for hard limits. Fixed window is simplest. We covered all four in the rate limiting topic. For this design, let's go with sliding window counter using Redis.

Per request, run a Lua script atomically on Redis (Lua scripts in Redis are atomic):

-- KEYS[1] = "rate:apikey:abc"
-- ARGV[1] = current time in seconds
-- ARGV[2] = window size in seconds (60)
-- ARGV[3] = limit (1000)
local count = redis.call('ZCOUNT', KEYS[1], ARGV[1] - ARGV[2], '+inf')
if count >= tonumber(ARGV[3]) then
  return {0, count}  -- denied
end
redis.call('ZADD', KEYS[1], ARGV[1], ARGV[1])
redis.call('EXPIRE', KEYS[1], ARGV[2] * 2)
return {1, count + 1}  -- allowed

This uses a sorted set of timestamps and counts how many are in the current window. Adds the current request's timestamp on success. Sets a TTL to clean up dead keys.

Sharding Redis

One Redis instance handles maybe 100K ops per second. For higher throughput, shard by API key. Each gateway hashes the API key and routes to the right shard. Hot keys (one super-active customer) can exceed a single shard's capacity. Solution: split that key across N pseudo-keys and aggregate, accepting slight inaccuracy.

Failure handling

What happens if Redis is unreachable? Two strategies. Fail open (allow the request) — easier on users but lets bad actors through. Fail closed (deny the request) — safer but punishes everyone. Most companies fail open with circuit breakers and alarming. Some sensitive endpoints (login, payments) fail closed.

Approximate limits

If you can tolerate slight slop, you can dramatically reduce the cost. Each gateway holds a local counter that syncs to Redis every 100 ms. Limits are enforced approximately — a customer might briefly exceed by N gateways' worth of slack, but the average over time is correct. This pattern (approximate counters with periodic reconciliation) scales to insanely high request rates.

Most production rate limiters are slightly inaccurate by design. Trying to be perfectly exact at scale costs latency and availability that nobody actually wants. Aim for close-enough with safety margins.

← Previous

Design a Twitter-like News Feed

Design WhatsApp / Messaging Service