Rate Limiting

Chapter 08 · Reliability and Performance

Rate limiting is how you protect a system from being overwhelmed, whether by a buggy client, a malicious actor, or just genuine traffic spikes. We will look at the four classic algorithms and where each fits.

Why rate limit at all

Picture this: your API serves 1000 requests per second comfortably. Some intern writes a script with a typo and accidentally fires 50 thousand requests per second from one IP. Without rate limiting, your servers crash, your database gets hammered, and every legitimate user gets a 500 error. Rate limiting puts a ceiling on how fast any single client (or the system overall) can hit you. It protects you from abuse, from buggy clients, from cost runaways on metered services, and from cascading failures.

The HTTP status code for rate limiting is 429 Too Many Requests. Good rate limiters also send a Retry-After header so the client knows when to try again instead of hammering you in a loop.

The four algorithms

Fixed window

Simplest. Count requests in a calendar window, like "100 requests per minute starting at the top of the minute". Easy to implement with a counter and a timer. The problem is the boundary effect: a client can fire 100 requests at 12:00:59 and 100 more at 12:01:00, getting 200 requests in two seconds and still passing the check.

Sliding window log

Keep a timestamp for every request in the last 60 seconds. To check, count entries newer than (now minus 60s). Accurate but expensive — you store one entry per request.

Sliding window counter

A clever compromise. Keep two fixed-window counters and weight them by how far through the current window you are. If you are 30 percent into the current minute, the rate is roughly (70 percent of last minute) plus (current minute count). Cheap and very close to accurate.

Token bucket

Imagine a bucket that fills with tokens at a steady rate, say 10 tokens per second, capped at 100. Every request consumes one token. If the bucket is empty, you reject. This naturally allows short bursts (use up the saved tokens) while enforcing the long-run average. Most production systems use this.

Leaky bucket

Cousin of token bucket. Requests enter a queue that drains at a fixed rate. If the queue is full, new requests are dropped. Smooths traffic into a constant outflow. Used a lot in network gear.

Try it: token bucket in motion

Tokens refill at the configured rate. Send single requests or a burst. Watch the bucket empty and refill. Allowed requests turn green, rejected turn red.

Capacity: 10 Refill /sec: 2 Allowed0 429s0

Where to put the rate limiter

The closer to the edge, the cheaper bad traffic is to drop. Most companies layer it: at the CDN or WAF (cheap volumetric protection), at the API gateway (per-user, per-API-key limits), and inside services (per-tenant limits, expensive operation limits). The deeper you go, the more you have already paid for the request.

Distributed rate limiting

One server with a counter is easy. Twenty servers behind a load balancer is harder. You can not have each server count independently because a client could route around the limit by hitting different servers. The standard answer is a shared store like Redis. Each request runs an atomic INCR with a TTL on a key like rate:user-42:minute-1234. Redis handles a lot of these per second, and many providers ship managed solutions.

Production tip: Always include rate limit info in response headers so clients can self-throttle: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. Polite clients will respect them. Hostile ones will not, but at least your good clients can be good.

← Previous

High Availability

Idempotency