Logging, Metrics, and Tracing

Chapter 10 · Observability

The three pillars of observability are logs, metrics, and traces. Each tells you a different kind of story about what your system is doing. Used together they let you debug things you have never seen before.

Why three pillars

Monitoring tells you when something is wrong. Observability tells you why. The difference matters: monitoring is about predefined questions ("is CPU high?"), observability is about being able to answer questions you did not anticipate ("why did this specific user's checkout fail?"). To get there, you need three different types of data, and each one is good at a different thing.

Logs

Logs are timestamped events. Plain text traditionally, structured JSON ideally. They tell you what happened, when, and to whom. The shift to structured logs was huge — instead of grepping unstructured strings, you can query level=error AND user_id=42 like a database.

Best practices: log at boundaries (request in, request out, external call in, external call out), include request IDs so you can correlate across services, never log secrets, sample high-volume logs. Tools: ELK stack (Elasticsearch + Logstash + Kibana), Loki, Datadog, Splunk.

Metrics

Metrics are numeric measurements over time. Counters (total requests), gauges (current memory), histograms (request latency distribution). They are cheap to store at scale because they are aggregated. You typically have one number per metric per time bucket per dimension, not one entry per event.

The four golden signals from the SRE book: latency, traffic, errors, saturation. If you only track four things, track these. Tools: Prometheus, Graphite, InfluxDB, Datadog. Visualization: Grafana is the de facto standard.

Traces

A trace follows a single request as it travels through your system. Each step (a span) has a start time, an end time, and a parent. Visualize as a flame graph or gantt and you can see exactly where time is spent.

The reason traces matter so much in microservices: a slow request might involve 20 services. Logs and metrics tell you something is slow, but only traces tell you which hop is the bottleneck. The standard now is OpenTelemetry — vendor-neutral SDKs that emit traces in a common format. Backends include Jaeger, Zipkin, Honeycomb, Datadog APM, Tempo.

How they work together

You see a metric spike in error rate. You drill down to the affected service. You pull traces for failed requests during that window. You see the failures cluster in one downstream call. You pull logs for that specific call and see a database timeout. Three pillars, one investigation. Without all three, you would either know something is broken without knowing what (only metrics), drown in irrelevant data (only logs), or have no aggregated view at all (only traces).

Always include a correlation ID: Generate a request ID at the edge, propagate it through every service via a header (often X-Request-Id or W3C trace-context), and include it in every log line. This single habit makes debugging cross-service problems a hundred times easier.

← Previous

API Security

Monitoring and Alerting