Rate Limiting
Rate limiting is how you protect a system from being overwhelmed, whether by a buggy client, a malicious actor, or just genuine traffic spikes. We will look at the four classic algorithms and where each fits.
Why rate limit at all
Picture this: your API serves 1000 requests per second comfortably. Some intern writes a script with a typo and accidentally fires 50 thousand requests per second from one IP. Without rate limiting, your servers crash, your database gets hammered, and every legitimate user gets a 500 error. Rate limiting puts a ceiling on how fast any single client (or the system overall) can hit you. It protects you from abuse, from buggy clients, from cost runaways on metered services, and from cascading failures.
The HTTP status code for rate limiting is 429 Too Many Requests. Good rate limiters also send a Retry-After header so the client knows when to try again instead of hammering you in a loop.
The four algorithms
Fixed window
Simplest. Count requests in a calendar window, like "100 requests per minute starting at the top of the minute". Easy to implement with a counter and a timer. The problem is the boundary effect: a client can fire 100 requests at 12:00:59 and 100 more at 12:01:00, getting 200 requests in two seconds and still passing the check.
Sliding window log
Keep a timestamp for every request in the last 60 seconds. To check, count entries newer than (now minus 60s). Accurate but expensive — you store one entry per request.
Sliding window counter
A clever compromise. Keep two fixed-window counters and weight them by how far through the current window you are. If you are 30 percent into the current minute, the rate is roughly (70 percent of last minute) plus (current minute count). Cheap and very close to accurate.
Token bucket
Imagine a bucket that fills with tokens at a steady rate, say 10 tokens per second, capped at 100. Every request consumes one token. If the bucket is empty, you reject. This naturally allows short bursts (use up the saved tokens) while enforcing the long-run average. Most production systems use this.
Leaky bucket
Cousin of token bucket. Requests enter a queue that drains at a fixed rate. If the queue is full, new requests are dropped. Smooths traffic into a constant outflow. Used a lot in network gear.
Where to put the rate limiter
The closer to the edge, the cheaper bad traffic is to drop. Most companies layer it: at the CDN or WAF (cheap volumetric protection), at the API gateway (per-user, per-API-key limits), and inside services (per-tenant limits, expensive operation limits). The deeper you go, the more you have already paid for the request.
Distributed rate limiting
One server with a counter is easy. Twenty servers behind a load balancer is harder. You can not have each server count independently because a client could route around the limit by hitting different servers. The standard answer is a shared store like Redis. Each request runs an atomic INCR with a TTL on a key like rate:user-42:minute-1234. Redis handles a lot of these per second, and many providers ship managed solutions.
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. Polite clients will respect them. Hostile ones will not, but at least your good clients can be good.