Monitoring and Alerting
Monitoring is watching what matters. Alerting is being woken up when something is actually broken. Doing both well requires picking the right signals and tuning aggressively to avoid alert fatigue.
Monitoring vs alerting
Monitoring is the continuous collection and visualization of system signals. Alerting is the subset of monitoring rules that produce a notification (page, email, Slack ping) when something crosses a threshold. Every alert should be tied to a metric you are monitoring, but most monitored metrics will not produce alerts. Dashboards are for humans actively investigating. Alerts are for telling humans there is something to investigate.
The four golden signals
Adopted widely from Google's SRE book. If you have nothing else, monitor these four for every user-facing service.
| Signal | What it tells you | Example metric |
|---|---|---|
| Latency | How long requests take | p50, p95, p99 response time |
| Traffic | How much demand | requests per second |
| Errors | Failure rate | 5xx rate, error rate per endpoint |
| Saturation | How full your resources are | CPU, memory, disk, connection pool usage |
Note that "average" latency is mostly useless. Always look at percentiles. p99 latency tells you the experience of your worst-served 1 percent of users, which is often where outages start.
Symptom-based alerting
The cardinal rule: alert on symptoms users feel, not on causes you imagine. If you alert on "CPU is 90 percent", you might page in the middle of the night for something that is not actually a problem (some workloads are CPU-saturated by design). If you alert on "p99 latency exceeded SLO" or "error rate is climbing", you only page when users are actually being hurt.
The exception: leading indicators that always precede user impact. "Disk will be full in 4 hours" is fine to alert on even if no user has noticed yet, because by the time they notice, it is too late.
Tuning to avoid alert fatigue
The biggest failure mode of an alerting system is too many alerts. When the on-call gets paged 30 times a night, they stop reading them. They stop investigating. The boy-who-cried-wolf effect kicks in and the one alert that actually mattered gets ignored.
Rules I have seen work in practice. Every page must be actionable — if the on-call can not do anything about it, it is not a page. Every page must have a runbook — link a doc that says what to check, what to do. Track noise — count alerts per week, retro the ones that did not require action, tune them out. Use multi-window alerts — fire only if the condition holds for at least N minutes, to ignore brief spikes.
SLOs and error budgets
An SLO (Service Level Objective) is your internal availability or performance goal. Say 99.9 percent of requests succeed in under 200 ms. The error budget is the inverse: 0.1 percent of requests can fail or be slow. If you are well within budget, you can ship risky features. If you have burned through it, you stop and focus on stability. This frames alerting nicely: you do not alert on "any error", you alert on "burning the error budget too fast".
The on-call rotation
Production systems need someone to respond when things break. A healthy rotation has at least 5 people so each person is on call about a week per month. Hand-offs include reading recent alerts and ongoing issues. The post-incident review is mandatory, blameless, and produces concrete action items. Burnout from on-call is a real cost; if your rotation is breaking people, you have an alerting problem to fix, not a tougher engineer to hire.