Message Queues
A message queue lets one service send work to another without waiting for it. The producer drops a message in the queue and moves on. A consumer picks it up later. The whole system gets more resilient and more scalable.
The blocking problem
Imagine a sign-up flow that does five things: create the user, send a welcome email, create a default workspace, send analytics events, warm up some caches. If you do all five synchronously inside the HTTP request, the user waits for all of them. Worse, if the email service is down, sign-up fails entirely.
The fix: do only the essential work inside the request and queue the rest. The user gets a fast response. The other work happens when it can.
What a queue gives you
- Decoupling. Producer and consumer don't need to be running at the same time, on the same machine, or even at the same speed.
- Buffering. Traffic spikes get absorbed by the queue. Consumers process at their own pace.
- Retries. If a consumer crashes, the message stays in the queue and is redelivered.
- Workload spreading. Multiple consumers pull from the same queue, parallelizing work.
Delivery semantics
Three guarantees, ranked by difficulty:
- At most once. Message may be lost but never duplicated. Easy. Used when occasional drops are fine.
- At least once. Message is delivered, possibly multiple times. The default for most queues. Requires consumers to be idempotent.
- Exactly once. The holy grail. Hard in distributed systems. Often achieved through at-least-once + idempotency rather than truly once.
Common gotchas
Poison messages. A bad message that crashes every worker. Without a dead-letter queue, you process it forever. Always configure a DLQ.
Ordering. Most queues do not guarantee order across consumers. If order matters per-key, partition by key (Kafka does this naturally; SQS has FIFO queues for it).
Backpressure. If producers outpace consumers forever, the queue grows without bound. Set max length, alert on lag, scale consumers.
Queues turn brittle synchronous flows into resilient asynchronous ones. The price: harder reasoning. Trace IDs become essential. Idempotency becomes essential. The investment usually pays off the first time a downstream service has a bad day.