Monitoring and Alerting

Chapter 10 · Observability

Monitoring is watching what matters. Alerting is being woken up when something is actually broken. Doing both well requires picking the right signals and tuning aggressively to avoid alert fatigue.

Monitoring vs alerting

Monitoring is the continuous collection and visualization of system signals. Alerting is the subset of monitoring rules that produce a notification (page, email, Slack ping) when something crosses a threshold. Every alert should be tied to a metric you are monitoring, but most monitored metrics will not produce alerts. Dashboards are for humans actively investigating. Alerts are for telling humans there is something to investigate.

The four golden signals

Adopted widely from Google's SRE book. If you have nothing else, monitor these four for every user-facing service.

Signal	What it tells you	Example metric
Latency	How long requests take	p50, p95, p99 response time
Traffic	How much demand	requests per second
Errors	Failure rate	5xx rate, error rate per endpoint
Saturation	How full your resources are	CPU, memory, disk, connection pool usage

Note that "average" latency is mostly useless. Always look at percentiles. p99 latency tells you the experience of your worst-served 1 percent of users, which is often where outages start.

Symptom-based alerting

The cardinal rule: alert on symptoms users feel, not on causes you imagine. If you alert on "CPU is 90 percent", you might page in the middle of the night for something that is not actually a problem (some workloads are CPU-saturated by design). If you alert on "p99 latency exceeded SLO" or "error rate is climbing", you only page when users are actually being hurt.

The exception: leading indicators that always precede user impact. "Disk will be full in 4 hours" is fine to alert on even if no user has noticed yet, because by the time they notice, it is too late.

Tuning to avoid alert fatigue

The biggest failure mode of an alerting system is too many alerts. When the on-call gets paged 30 times a night, they stop reading them. They stop investigating. The boy-who-cried-wolf effect kicks in and the one alert that actually mattered gets ignored.

Rules I have seen work in practice. Every page must be actionable — if the on-call can not do anything about it, it is not a page. Every page must have a runbook — link a doc that says what to check, what to do. Track noise — count alerts per week, retro the ones that did not require action, tune them out. Use multi-window alerts — fire only if the condition holds for at least N minutes, to ignore brief spikes.

SLOs and error budgets

An SLO (Service Level Objective) is your internal availability or performance goal. Say 99.9 percent of requests succeed in under 200 ms. The error budget is the inverse: 0.1 percent of requests can fail or be slow. If you are well within budget, you can ship risky features. If you have burned through it, you stop and focus on stability. This frames alerting nicely: you do not alert on "any error", you alert on "burning the error budget too fast".

The on-call rotation

Production systems need someone to respond when things break. A healthy rotation has at least 5 people so each person is on call about a week per month. Hand-offs include reading recent alerts and ongoing issues. The post-incident review is mandatory, blameless, and produces concrete action items. Burnout from on-call is a real cost; if your rotation is breaking people, you have an alerting problem to fix, not a tougher engineer to hire.

If I had to give one rule: the only thing that should ever page a human is "users are being hurt and only a human can fix it". Anything else is a ticket, a dashboard, or an automated remediation. Protect the on-call's sleep like it is sacred, because it is.

← Previous

Logging, Metrics, and Tracing

Design a URL Shortener