Design a notification system (push, SMS, and email)
A notification system delivers messages across push (APNs/FCM), SMS (Twilio/Vonage), email (SES/SendGrid), and in-app channels reliably and at high scale. The engineering challenges span multi-channel routing, strict user-preference enforcement, provider variability, deliverability, and handling bursty traffic from product launches or breaking news events that can spike to hundreds of thousands of messages per second.
Functional requirements: send via any combination of push, SMS, and email; honor per-user channel preferences, language, time zone, and quiet hours; support transactional flows (login OTP, order update) and marketing flows (newsletters, promotions); track delivery receipts, opens, clicks, and bounces; allow versioned templates with personalization; provide a developer-facing API with idempotency guarantees. Non-functional: 99.99% availability, sub-second dispatch for transactional messages, throughput of 100k+ msgs/sec at peak, exactly-once delivery semantics via idempotency keys.
The core architecture is a queue-driven multi-channel pipeline. At the front sits a Notification API that accepts a request payload including recipient, template ID, channel preferences, and a caller-supplied idempotency key. The API deduplicates using the key against a short-lived Redis set, persists the notification record to a relational DB, and publishes an event to Kafka. From there, dedicated consumer groups — one per channel — each own their topic: push-topic, sms-topic, email-topic. This isolation means a Twilio outage cannot block APNs delivery.
Priority queues per channel — within each channel topic, messages are tagged with priority: critical (security alerts, OTPs), high (order updates, payment confirmations), medium (social notifications), and low (marketing). Separate Kafka partitions or SQS queues per priority ensure critical messages are never blocked by a backlog of promotional blasts.
User Preference Service — before any channel worker dispatches a message, it checks a fast-path cache (Redis) of per-user preferences: opted-in channels, do-not-disturb windows, language, and unsubscribe lists. If the user has opted out of marketing SMS, the SMS worker silently skips that message and records the decision in the audit log.
Template Service — templates are stored versioned by
template_idwith localization variants perlocale. Workers resolve the template at send time, interpolate user data (name, order ID, amount), and hand the rendered content to the provider client. Keeping templates server-side avoids re-deployment for copy changes.Provider abstraction with health-based routing — each channel has multiple provider adapters behind a thin router. If Twilio's error rate crosses a threshold (monitored by a circuit breaker), traffic shifts automatically to Vonage. For email, SES handles bulk sends; SendGrid handles transactional with better deliverability dashboards. This multi-provider strategy is essential at scale.
Retry and dead-letter queues — transient provider failures trigger exponential backoff retries (1s, 4s, 16s, up to N attempts). Messages that exhaust retries land in a
dead-letter-queuewhere an operator can inspect, requeue, or discard. Worker idempotency keys prevent double-sends on retry because each provider call includes the original notification ID.Rate limiting per user — a Redis token bucket enforces a cap of at most N notifications per user per hour across all channels combined, preventing notification fatigue. Marketing notifications can also be batched into digest emails when more than M events accumulate within a window.
Fan-out for marketing blasts — sending a campaign to 10 million users is a fundamentally different workload from transactional sends. A batch pipeline reads the audience segmentation from a data warehouse, partitions into shards of ~10k users, and feeds each shard into the channel queues at a controlled rate. This avoids overwhelming the provider APIs and violating their per-second limits.
Delivery tracking and analytics — provider webhooks (APNs delivery receipts, Twilio delivery reports, SES bounce and open events) feed back into a Kafka topic, which streams into
ClickHousefor near-real-time dashboards. Bounces are written back to the User Preference Service to suppress dead addresses automatically. Open-rate and click-rate data drive A/B testing of templates.
A typical transactional message flows as follows: the calling service (say, the payments service) calls POST /notifications with idempotency_key = txn-{uuid}. The API validates, persists, and emits to Kafka in under 20ms. The email worker picks it up, checks user preferences (no quiet hours at this time), renders the template in the user's locale, calls SES, and marks the notification as dispatched in the DB within 500ms end-to-end.
Key trade-off — at-least-once vs. exactly-once: Kafka guarantees at-least-once delivery by default; the idempotency key at the provider call boundary is what elevates this to effectively exactly-once. Without the key, a retried worker would call Twilio twice and the user receives duplicate SMS. Always design the dedup at the provider egress boundary, not just at queue ingestion.
For scheduled notifications (e.g., "send at 9am in the user's local time zone"), a delay queue backed by Cassandra TTL columns or a dedicated scheduler (Quartz) holds messages until their fire time, then injects them into the normal pipeline. Compliance (CAN-SPAM, GDPR, TCPA) requires storing opt-in proof, honoring unsubscribes within 10 business days, and maintaining suppression lists. Cost-aware routing for SMS is non-trivial — prices vary by country by up to 100x, so the router should pick the cheapest compliant provider for each destination country. Add full observability: per-template delivery SLO dashboards, alert on bounce-rate spikes, and per-channel latency p99 monitoring.
Lead by naming the four channels and immediately explaining the queue-per-channel isolation model — this shows you understand that a single queue would let one channel's problems cascade. Mention idempotency keys and the provider-boundary dedup to impress: most candidates say 'use Kafka' but miss where exactly-once semantics actually need to be enforced.
A strong number to cite: 100k msgs/sec at peak for a large platform, and a user rate-limit of no more than 5 notifications per hour to prevent fatigue. Don't forget quiet hours and compliance (TCPA, GDPR) — interviewers at consumer companies always probe regulatory awareness.