Retries and Exponential Backoff
Retries are how distributed systems heal from transient failures. Exponential backoff with jitter is how you retry without making the failure worse. This is one of those small ideas with huge impact.
Why retries matter
In a distributed system, most failures are transient. A network blip, a brief CPU spike, a momentary GC pause, a load balancer reshuffle. If your code gives up on the first error, you turn a 50 millisecond hiccup into a permanent failure for the user. A retry, in most cases, just works the second time.
But retries are also dangerous. If your downstream is overloaded and every client retries immediately, you have just multiplied the load by however many retries you allow. You took a system that was struggling and made it die. This is called a retry storm or thundering herd.
Naive retry is wrong
for attempt in range(3):
try:
return call_api()
except TransientError:
continue # immediate retry, no delay
Three problems with this. First, no delay means you hit the downstream three times in milliseconds, which is exactly when it least wants more traffic. Second, no jitter means every client retries at the same instant, creating synchronized waves. Third, no cap means a flapping downstream burns CPU on retries forever.
Exponential backoff
The fix is to wait longer between each retry. Wait 1 second, then 2, then 4, then 8. The idea is that if the system is in trouble, give it room to recover. Most retries succeed in the first one or two attempts anyway, so longer waits later cost you very little in the success case but save your downstream in the failure case.
delay = base * (2 ** attempt)
# attempt 0: 1s, attempt 1: 2s, attempt 2: 4s, attempt 3: 8s
Why jitter matters
Pure exponential backoff still synchronizes clients. If a thousand clients all hit a 500 at the same instant, they all wait exactly 1 second, then all retry at the same instant, then all wait 2, and so on. The downstream sees thundering waves. Jitter breaks the synchrony by adding randomness.
The AWS recommended pattern is "full jitter": pick a random delay between zero and the exponential value.
delay = random.uniform(0, base * (2 ** attempt))
This spreads retries evenly across the backoff window, smoothing the load on the downstream.
What you should NOT retry
Not every error deserves a retry. Retrying makes things worse if the failure is permanent.
- 4xx errors (except 429 and 408): Bad request, unauthorized, not found. Retrying does not change anything. The exception is 429 Too Many Requests (back off and retry) and 408 Request Timeout.
- Validation errors: Your input is wrong. It will be wrong again next time.
- Non-idempotent writes without an idempotency key: Retrying might create duplicates. See the idempotency topic.
What you SHOULD retry: 5xx errors, network timeouts, connection resets, 429s with a Retry-After header, and any clearly transient infrastructure failure.
The retry budget
Cap the total number of retries (usually 3 to 5) and cap the maximum delay (usually 30 to 60 seconds). Beyond that, fail loudly. Some teams also implement a global retry budget at the service level: if more than X percent of traffic is retries, stop retrying and let the failure propagate. This prevents the service from drowning in its own retries.