Circuit Breaker Pattern
When a downstream service is failing, hammering it harder makes things worse. A circuit breaker stops calls from going through, gives the failing service time to recover, and protects your own system from cascading failure.
The cascading failure problem
Service A calls service B. B is slow. A's threads start piling up waiting on B. A runs out of threads. A starts failing too. C calls A; same thing happens to C. Within minutes, half the system is down because of one slow service.
The circuit breaker cuts the chain. When A notices B is failing, A stops calling B for a while. A's threads stay free. A degrades gracefully (returns a default, errors quickly, falls back). B gets breathing room.
The three states
- Closed. Normal operation. Calls pass through. The breaker counts failures.
- Open. Failure threshold exceeded. Calls fail immediately without trying. Breaker waits.
- Half-open. After a timeout, the breaker lets a few calls through to test the waters. If they succeed, go back to closed. If they fail, back to open.
Tuning
- Failure threshold: N failures in M seconds before opening. Too aggressive → flickers; too lax → cascade.
- Open duration: how long to stay open before testing. Long enough for the downstream to recover; short enough to test recovery promptly.
- Half-open allowance: how many test calls to send.
Pair with fallbacks
When the breaker is open, what does your service return? Three options:
- Fail fast. Return an error. Acceptable for non-critical paths.
- Cached response. Show stale data. Better than nothing for read-heavy paths.
- Default value. A reasonable placeholder (zero items in cart, default avatar).
Bulkhead pattern (worth pairing)
Isolate resources so one slow dependency can't drain everything. Allocate separate thread pools or connection pools per downstream. If service B exhausts its pool, A's pool to service C is unaffected. Often deployed alongside circuit breakers.
Circuit breakers prevent cascading failure. They're not magic; they're an explicit choice to fail fast when something is sick. Combined with retries, timeouts, and bulkheads, they keep your system robust under partial failure.