Retries and Exponential Backoff

Retries are how distributed systems heal from transient failures. Exponential backoff with jitter is how you retry without making the failure worse. This is one of those small ideas with huge impact.

Why retries matter

In a distributed system, most failures are transient. A network blip, a brief CPU spike, a momentary GC pause, a load balancer reshuffle. If your code gives up on the first error, you turn a 50 millisecond hiccup into a permanent failure for the user. A retry, in most cases, just works the second time.

But retries are also dangerous. If your downstream is overloaded and every client retries immediately, you have just multiplied the load by however many retries you allow. You took a system that was struggling and made it die. This is called a retry storm or thundering herd.

Naive retry is wrong

for attempt in range(3):
    try:
        return call_api()
    except TransientError:
        continue  # immediate retry, no delay

Three problems with this. First, no delay means you hit the downstream three times in milliseconds, which is exactly when it least wants more traffic. Second, no jitter means every client retries at the same instant, creating synchronized waves. Third, no cap means a flapping downstream burns CPU on retries forever.

Exponential backoff

The fix is to wait longer between each retry. Wait 1 second, then 2, then 4, then 8. The idea is that if the system is in trouble, give it room to recover. Most retries succeed in the first one or two attempts anyway, so longer waits later cost you very little in the success case but save your downstream in the failure case.

delay = base * (2 ** attempt)
# attempt 0: 1s, attempt 1: 2s, attempt 2: 4s, attempt 3: 8s

Why jitter matters

Pure exponential backoff still synchronizes clients. If a thousand clients all hit a 500 at the same instant, they all wait exactly 1 second, then all retry at the same instant, then all wait 2, and so on. The downstream sees thundering waves. Jitter breaks the synchrony by adding randomness.

The AWS recommended pattern is "full jitter": pick a random delay between zero and the exponential value.

delay = random.uniform(0, base * (2 ** attempt))

This spreads retries evenly across the backoff window, smoothing the load on the downstream.

Retry Patterns: Load Over Time Without jitter (synchronized waves) t=0 +1s +3s +7s With full jitter (smoothed) t=0 +8s delay = base * 2^attempt delay = random(0, base * 2^attempt) Same number of retries, but the downstream sees steady load instead of waves
Try it: thundering herd vs jittered backoff
100 clients all see a failure at t=0. Toggle jitter and click Retry. Watch the synchronized waves disappear.
Peak load0

What you should NOT retry

Not every error deserves a retry. Retrying makes things worse if the failure is permanent.

What you SHOULD retry: 5xx errors, network timeouts, connection resets, 429s with a Retry-After header, and any clearly transient infrastructure failure.

The retry budget

Cap the total number of retries (usually 3 to 5) and cap the maximum delay (usually 30 to 60 seconds). Beyond that, fail loudly. Some teams also implement a global retry budget at the service level: if more than X percent of traffic is retries, stop retrying and let the failure propagate. This prevents the service from drowning in its own retries.

Combine with circuit breakers: Retries handle individual transient failures. Circuit breakers handle sustained failures. Together they cover both ends of the spectrum without making things worse.