High Availability

Chapter 08 · Reliability and Performance

High availability is about staying up when things break. We measure it in nines, design for it with redundancy and failover, and accept that perfect uptime is a myth.

What does high availability actually mean

When someone says a system is highly available, they mean it stays up almost all the time, even when individual pieces fail. The fancy way to describe this is in nines. Two nines means 99 percent uptime, which sounds great until you do the math and realize it allows for about three and a half days of downtime a year. Three nines is around 8 hours and 45 minutes per year. Four nines is roughly 52 minutes. Five nines, the gold standard most people throw around without understanding the cost, allows only 5 minutes and 15 seconds of downtime in a year. That includes planned maintenance, deploys, network blips, and that 3am page nobody wants.

The point I want you to take away is this: every additional nine costs roughly ten times more to engineer. Going from 99 to 99.9 is a different game than going from 99.9 to 99.99. Most internal tools live happily at three nines. Payment systems, healthcare, anything where downtime means money or lives, push toward four or five.

The patterns that get you there

Redundancy

You can not be highly available with one server. If that server dies, you are down. Redundancy means running multiple copies of every component that matters. Two web servers behind a load balancer. Two database replicas. Two availability zones. The math is simple: if one server has 99 percent uptime, two independent ones together have 99.99 percent (assuming failures are uncorrelated, which they often are not, but the principle holds).

Failover

Redundancy is useless if traffic does not actually shift when something fails. Failover is the mechanism. In active-passive, one node serves traffic and the other sits warm waiting to take over. In active-active, both serve traffic and you just stop sending to the broken one. Active-active is more efficient but harder to build because both nodes need to handle writes safely.

Health checks and circuit breakers

Failover only triggers if you can detect failure quickly. Load balancers ping each server every few seconds. If three checks fail in a row, that server is removed from the pool. Inside services, circuit breakers stop sending traffic to a downstream that is clearly broken, which prevents your healthy server from melting down trying to reach a dead one.

Graceful degradation

The best HA systems do not just stay up, they fall back gracefully. If the recommendation service is down, show a generic list. If the comments database is slow, hide comments but render the post. The user gets a slightly worse experience instead of a 500 page.

Note: Most outages are not because the cloud went down. They are because someone deployed broken code at 4pm on a Friday, or a config change took down the load balancer. Redundancy does not save you from yourself. Canary deploys, feature flags, and rollback plans do.

Failure domains

When you design for HA, think about what is shared. Two servers in the same rack share a power cable. Two racks in the same datacenter share a generator. Two datacenters in the same region share a fiber path. The smaller your failure domain, the better your isolation. Cloud providers split this for you with zones and regions. A multi-zone deployment survives a single zone outage. A multi-region deployment survives an entire region going dark, but it costs more and adds latency to writes.

SLAs, SLOs, SLIs

SLA is what you promise customers contractually, with money attached if you miss. SLO is your internal goal, usually slightly stricter than the SLA so you have a buffer. SLI is the actual measurement of what is happening right now. The error budget is the gap between your SLO and 100 percent. If your SLO is 99.9, you have 0.1 percent of error budget per month. Once you burn through it, you stop shipping risky features and focus on stability. This is the core idea behind site reliability engineering as Google practices it.

← Previous

Saga Pattern

Rate Limiting