Design an e-commerce platform like Amazon

Question

Accepted Answer

An e-commerce platform like Amazon spans an enormous surface area — product catalog, search, recommendations, cart, checkout, payment, inventory, fulfillment, reviews, and customer service — each with different scale characteristics, consistency requirements, and team ownership. The architecture must be microservice-based : coupling these domains in a monolith creates a deployment, reliability, and scaling bottleneck that grows unsustainable within a few years of serious traffic growth. Scale framing: Amazon handles millions of product SKUs, hundreds of millions of active customers, and peak traffic during events like Prime Day that is 10–20x baseline. Checkout must be ACID-correct on inventory and payment. Search and catalog are read-heavy and can be eventually consistent. Recommendation inference must be low-latency (under 50 ms). The architecture uses different data stores, consistency models, and scaling approaches per service. Product catalog service — manages products, categories, attributes, images, and base pricing. The data model is hierarchical (category → subcategory → product → SKU/variant). Storage is a relational database for structured attributes plus a document store ( DynamoDB or MongoDB ) for the schemaless per-category attribute bag (a couch has different attributes than a CPU). Reads are served through a CDN-cached REST API; product pages are nearly static and cache-hit ratios above 95% are normal. Search service — product names, descriptions, brand, category, and behavioral signals (click-through rate, conversion rate) are indexed in Elasticsearch or a custom inverted index. Queries support full-text search, faceted filtering (price range, brand, average rating, Prime eligibility), spell correction, and autocomplete. A learning-to-rank model trained on click and purchase signals re-ranks Elasticsearch results — this is the "relevance" personalization layer. Recommendation service — "customers who bought X also bought Y" style recommendations are generated by offline collaborative filtering (matrix factorization, ALS ) or deep learning models running on the purchase and view event log. Recommendations are pre-computed per user and per item and stored in a key-value store. At request time the service looks up pre-computed scores (under 5 ms) rather than running inference live (hundreds of ms). Real-time contextual signals (session browsing) are fed into a lightweight online model for fine-tuning. Cart service — the shopping cart must be fast, durable, and device-independent. Carts for authenticated users are stored in DynamoDB keyed by user_id , enabling access from any device. A Redis layer caches the active cart for the current session. Guest carts are stored in a browser cookie or a short-TTL Redis entry keyed by a session token. On user login, the guest cart is merged with the authenticated cart (union of items, keeping the higher quantity for duplicates). Checkout saga and distributed transactions — checkout must atomically reserve inventory, validate the cart, calculate tax and shipping, authorize payment, persist the order, and trigger fulfillment. This spans four different services. Rather than a distributed ACID transaction (which would require 2PC and couple all services), the checkout orchestrator implements a saga pattern with compensating actions: (1) reserve inventory with a TTL hold; (2) calculate total; (3) authorize payment; (4) commit inventory decrement; (5) create order record; (6) dispatch to fulfillment. If step 3 fails (payment declined), a compensating action releases the inventory hold from step 1. Each step is idempotent so the saga can be safely retried. Inventory service — the source of truth for stock levels per SKU per warehouse. This service requires strong consistency to prevent overselling: concurrent checkouts for the last unit of a product must not both succeed. The implementation uses PostgreSQL with row-level locking on the (sku_id, warehouse_id) row, or optimistic concurrency with a version token (compare-and-swap decrement). Write throughput is managed by sharding by sku_id . Flash-sale scenarios use a reservation queue (pre-issued tokens) to buffer burst demand without hammering the DB. Payment service — integrates with payment processors ( Stripe , Adyen , Braintree ) and tokenizes card data at capture time (the raw PAN never touches application servers). Payment intents follow an authorize-then-capture model: authorization reserves funds at checkout; capture occurs when the item ships. The service emits payment events to Kafka for accounting, fraud, and analytics. Idempotency keys on all payment API calls prevent double-charges on retries. Order management and fulfillment — order state machine: placed → payment confirmed → warehouse picked → shipped → delivered → optionally returned . Each state transition is an event in Kafka consumed by fulfillment, notifications, and analytics. The warehouse management system receives pick-and-pack instructions