Design a Notification System
Design a system that sends push, email, SMS, and in-app notifications at scale. The challenge is reliable delivery across multiple channels with personalization, scheduling, and aggressive rate control.
Problem
Build a service that lets other services in the company say "send a notification to user X about event Y" and reliably delivers it via the right channels (push, email, SMS, in-app). Handle billions of notifications per day. Respect user preferences. Avoid spam and duplicates.
Why this is hard
Each channel has its own provider with its own quirks. APNs for iOS push, FCM for Android, SES or SendGrid for email, Twilio for SMS. Each has rate limits, errors, retry semantics. Each has different delivery guarantees. Users have preferences across channels and topics. Some notifications are urgent, some can be batched. And the volume is huge.
Architecture
The flow
- Producer service POSTs to Notification API:
{user_id, type: "order_shipped", data: {...}} - API loads user preferences. If they have opted out of marketing emails or are in Do Not Disturb hours, drop or defer that channel.
- API renders the message from a template (per channel, per locale).
- Dedupe check: have we sent this exact notification (same key) recently? If yes, skip.
- API enqueues one job per channel into the appropriate queue.
- Channel workers consume jobs, call the provider API, handle errors.
- Delivery status (sent, delivered, opened, bounced) is tracked back via webhooks for analytics.
Channel-specific gotchas
- APNs/FCM: Token expires when user reinstalls. Worker must handle "invalid token" by marking the device dead in your DB. Provider has rate limits per app, plan accordingly.
- Email: Bounce handling is critical. Hard bounces should suppress the address forever. Soft bounces retry. Sender reputation depends on it.
- SMS: Expensive ($0.01-$0.10 per message). Heavily rate-limit and reserve for high-value notifications. Comply with regulations (TCPA in US, GDPR in EU). Honor STOP responses.
- In-app: If the user is online, push down their websocket. If offline, store in a notifications inbox they see when they next open the app.
Reliability
Notifications are at-least-once by default. Idempotency key (the notification ID) ensures retries do not double-send. Failed sends go to a dead-letter queue for manual investigation. Critical channels (auth codes, payment confirmations) have higher priority queues with SLAs.
Throttling and aggregation
If a user has 50 events in 5 minutes, do not send 50 push notifications. Aggregate: "You have 50 new likes". The aggregation logic is per notification type. Some are time-windowed, some are count-windowed, some are user-controlled in their preferences.