Design a Notification System

Design a system that sends push, email, SMS, and in-app notifications at scale. The challenge is reliable delivery across multiple channels with personalization, scheduling, and aggressive rate control.

Problem

Build a service that lets other services in the company say "send a notification to user X about event Y" and reliably delivers it via the right channels (push, email, SMS, in-app). Handle billions of notifications per day. Respect user preferences. Avoid spam and duplicates.

Why this is hard

Each channel has its own provider with its own quirks. APNs for iOS push, FCM for Android, SES or SendGrid for email, Twilio for SMS. Each has rate limits, errors, retry semantics. Each has different delivery guarantees. Users have preferences across channels and topics. Some notifications are urgent, some can be batched. And the volume is huge.

Architecture

Notification System Order Service Auth Service Marketing Notification API + template engine + user prefs lookup Per-channel queues push / email / sms priority lanes Push worker → APNs/FCM Email worker → SES SMS worker → Twilio In-app worker → WS User Prefs channels · topics · DND Dedupe Cache recent notification IDs Delivery Tracking opens · clicks · bounces

The flow

  1. Producer service POSTs to Notification API: {user_id, type: "order_shipped", data: {...}}
  2. API loads user preferences. If they have opted out of marketing emails or are in Do Not Disturb hours, drop or defer that channel.
  3. API renders the message from a template (per channel, per locale).
  4. Dedupe check: have we sent this exact notification (same key) recently? If yes, skip.
  5. API enqueues one job per channel into the appropriate queue.
  6. Channel workers consume jobs, call the provider API, handle errors.
  7. Delivery status (sent, delivered, opened, bounced) is tracked back via webhooks for analytics.

Channel-specific gotchas

Reliability

Notifications are at-least-once by default. Idempotency key (the notification ID) ensures retries do not double-send. Failed sends go to a dead-letter queue for manual investigation. Critical channels (auth codes, payment confirmations) have higher priority queues with SLAs.

Throttling and aggregation

If a user has 50 events in 5 minutes, do not send 50 push notifications. Aggregate: "You have 50 new likes". The aggregation logic is per notification type. Some are time-windowed, some are count-windowed, some are user-controlled in their preferences.

Operational lesson: Your notification system will eventually be used to spam users. Build admin tools that can pause specific notification types globally, mute marketing across the system, or rate-limit by event source. The day a bug sends 50 notifications to every user is the day you need that switch.