Horizontal vs Vertical Scaling
Scaling vertically means a bigger machine. Scaling horizontally means more machines. Both are valid, both have limits. Modern systems lean horizontal because there is no biggest machine.
The two ways to handle more load
When your single server can no longer keep up, you have exactly two options. Make the server bigger (vertical scaling, also called scale up). Or add more servers (horizontal scaling, also called scale out).
That's it. Every other technique you will read about (load balancing, sharding, replication) is in service of one of these two strategies.
Vertical scaling
You take your existing server and upgrade. More CPU. More RAM. Faster disk. Your code does not change. The application is unaware. The database, the cache, the API server, all just get more headroom.
Pros: simple. No code changes. No new failure modes. The pre-cloud default.
Cons: there is a ceiling. The biggest single machine in AWS is finite. And expensive (nonlinear cost; doubling capacity often more than doubles the price). Single point of failure: if the one big box dies, you are completely down.
Horizontal scaling
You run many servers in parallel, each handling a fraction of the load. A load balancer distributes incoming requests. Need more capacity? Spin up another server. Done.
Pros: no upper limit (or very high one). Resilient: lose one server, the others keep going. Often cheaper at scale.
Cons: your application must support it. Stateless servers (covered later) are easy. Anything with shared state (sessions, in-memory caches, sticky data) needs work to scale horizontally.
The reality: most systems do both
Even the most horizontal system has servers that are individually beefy. A typical setup: vertical-scale each node to a sensible size (say 8-16 cores, 32-64 GB), then horizontally scale by adding more of those nodes. The vertical helps avoid coordination overhead from too many tiny nodes; the horizontal removes the ceiling.
Databases are an interesting middle ground. Postgres scales vertically by default, with read replicas adding horizontal read scale. To horizontally scale writes, you shard. Each shard is itself a vertically scaled instance with its own replicas. Layered.
Stateless services: the prerequisite for horizontal
If your service stores state in memory (sessions, caches, counters) and that state is unique to each server, horizontal scaling breaks. Server A holds your session. The load balancer sends your next request to server B. You appear logged out.
The fix: pull state out of the application servers and into shared infrastructure (Redis for sessions, distributed cache for cached data, database for everything else). Now any server can serve any request.
Auto-scaling
The cloud lets you horizontally scale automatically. Set rules: when CPU is over 70% for 5 minutes, add a server. When CPU is under 30% for 10 minutes, remove one. AWS Auto Scaling Groups, Kubernetes HPA, GCP Managed Instance Groups all do this.
Auto-scaling sounds magic. It is mostly fine, but watch for: scale-up lag (it takes minutes to spin up a new server), thundering herd (everyone scales out at the same time), and autoscaler oscillation (scale up, scale down, repeat). Tune carefully.
The senior choice
Vertical first when the problem is simple and the load is moderate. It buys time. Horizontal when you cross the price-performance break-even or need redundancy more than raw power. Most production systems eventually use both, with the proportion shifting horizontal as the system matures.