Load Balancing

Chapter 04 · Scaling Systems

A load balancer is the traffic cop of your system. It spreads requests across servers, hides failed servers, and is the place you implement most cross-cutting concerns. Critical, often invisible.

What a load balancer does

A load balancer sits in front of your application servers. Every client request hits the LB first; the LB picks a healthy server and forwards the request. To the client, it looks like there is one big server. Behind the scenes, work is distributed.

Three jobs:

Distribute load. Spread requests so no server is overwhelmed.
Health check. If a server fails, stop sending it traffic.
Cross-cutting concerns. SSL termination, rate limiting, request logging.

L4 vs L7 load balancers

The difference is where in the network stack the LB looks at traffic.

L4 (transport layer). Operates on TCP/UDP. Looks at IP addresses and ports. Fast, simple, but doesn't understand HTTP. Examples: AWS NLB, HAProxy in TCP mode.

L7 (application layer). Operates on HTTP. Sees URLs, headers, cookies. Can route based on path or hostname, do SSL termination, content-based routing. Heavier than L4 but vastly more flexible. Examples: NGINX, Envoy, AWS ALB, HAProxy in HTTP mode.

Modern systems usually use L7 for everything except very high-throughput TCP traffic.

A load balancer routes around failed servers and spreads traffic across the healthy ones.

Routing algorithms

Algorithm	How it picks	When to use
Round robin	Cycle through servers in order	Simple, default, fine when servers are equal
Least connections	Server with fewest active connections	When request times vary widely
Weighted	Round robin with weights for bigger servers	Heterogeneous fleet
IP hash	Hash of client IP picks server	Sticky sessions without cookies
Least response time	Fastest-responding server	Mixed performance servers
Consistent hash	Hash of request key	Cache affinity, stateful systems

Try it: watch routing algorithms in action

Switch between round robin, least connections, and random. Send a burst of requests with varying durations and see how loads end up.

Health checks

Health checks are how the LB knows which servers are alive. Two flavors:

Active. LB pings each server every few seconds (often GET /health). If it fails N times in a row, server is marked unhealthy.
Passive. LB watches actual traffic. Too many failures or timeouts → server pulled.

Make your health endpoint deep enough to detect real problems (DB connection works, dependencies reachable) but shallow enough to be fast. A common mistake is making the health check itself fail under load, taking down healthy servers.

SSL termination

The LB decrypts HTTPS, talks plain HTTP to backend servers, then re-encrypts the response. Why? Each server doesn't need to do TLS, which is CPU-heavy. The LB has hardware acceleration. Centralized cert management.

Trade-off: traffic between LB and backend is plain text. Fine inside a private network; not fine across the internet.

Sticky sessions

Sometimes you need the same user to keep hitting the same backend server (in-memory session, websocket connection). Sticky sessions (session affinity) bind a user to one server, usually via a cookie.

Sticky sessions are a smell. They prevent free horizontal scaling and make rollouts harder. Prefer stateless services with shared session storage. Use stickiness only when you must (websockets, some legacy apps).

The single point of failure A single load balancer is itself a SPOF. Run them in pairs (active-passive or active-active) with health monitoring between them. Cloud LBs (AWS ELB, GCP LB) do this for you under the hood.

Where the LB fits

Most production systems have multiple layers:

DNS / global LB: routes users to the nearest region.
Edge LB: per-region public-facing LB. SSL termination, DDoS protection, public IP.
Internal LB: in front of each service tier. L7 routing.

Each layer has its own scaling and failure characteristics. Skip a layer at your peril.

← Previous

Horizontal vs Vertical Scaling

Reverse Proxy and Forward Proxy