Design an e-commerce platform like Amazon
An e-commerce platform like Amazon spans an enormous surface area — product catalog, search, recommendations, cart, checkout, payment, inventory, fulfillment, reviews, and customer service — each with different scale characteristics, consistency requirements, and team ownership. The architecture must be microservice-based: coupling these domains in a monolith creates a deployment, reliability, and scaling bottleneck that grows unsustainable within a few years of serious traffic growth.
Scale framing: Amazon handles millions of product SKUs, hundreds of millions of active customers, and peak traffic during events like Prime Day that is 10–20x baseline. Checkout must be ACID-correct on inventory and payment. Search and catalog are read-heavy and can be eventually consistent. Recommendation inference must be low-latency (under 50 ms). The architecture uses different data stores, consistency models, and scaling approaches per service.
Product catalog service — manages products, categories, attributes, images, and base pricing. The data model is hierarchical (category → subcategory → product → SKU/variant). Storage is a relational database for structured attributes plus a document store (
DynamoDBorMongoDB) for the schemaless per-category attribute bag (a couch has different attributes than a CPU). Reads are served through a CDN-cached REST API; product pages are nearly static and cache-hit ratios above 95% are normal.Search service — product names, descriptions, brand, category, and behavioral signals (click-through rate, conversion rate) are indexed in Elasticsearch or a custom inverted index. Queries support full-text search, faceted filtering (price range, brand, average rating, Prime eligibility), spell correction, and autocomplete. A learning-to-rank model trained on click and purchase signals re-ranks Elasticsearch results — this is the "relevance" personalization layer.
Recommendation service — "customers who bought X also bought Y" style recommendations are generated by offline collaborative filtering (matrix factorization,
ALS) or deep learning models running on the purchase and view event log. Recommendations are pre-computed per user and per item and stored in a key-value store. At request time the service looks up pre-computed scores (under 5 ms) rather than running inference live (hundreds of ms). Real-time contextual signals (session browsing) are fed into a lightweight online model for fine-tuning.Cart service — the shopping cart must be fast, durable, and device-independent. Carts for authenticated users are stored in
DynamoDBkeyed byuser_id, enabling access from any device. ARedislayer caches the active cart for the current session. Guest carts are stored in a browser cookie or a short-TTLRedisentry keyed by a session token. On user login, the guest cart is merged with the authenticated cart (union of items, keeping the higher quantity for duplicates).Checkout saga and distributed transactions — checkout must atomically reserve inventory, validate the cart, calculate tax and shipping, authorize payment, persist the order, and trigger fulfillment. This spans four different services. Rather than a distributed ACID transaction (which would require 2PC and couple all services), the checkout orchestrator implements a saga pattern with compensating actions: (1) reserve inventory with a TTL hold; (2) calculate total; (3) authorize payment; (4) commit inventory decrement; (5) create order record; (6) dispatch to fulfillment. If step 3 fails (payment declined), a compensating action releases the inventory hold from step 1. Each step is idempotent so the saga can be safely retried.
Inventory service — the source of truth for stock levels per SKU per warehouse. This service requires strong consistency to prevent overselling: concurrent checkouts for the last unit of a product must not both succeed. The implementation uses
PostgreSQLwith row-level locking on the(sku_id, warehouse_id)row, or optimistic concurrency with a version token (compare-and-swap decrement). Write throughput is managed by sharding bysku_id. Flash-sale scenarios use a reservation queue (pre-issued tokens) to buffer burst demand without hammering the DB.Payment service — integrates with payment processors (
Stripe,Adyen,Braintree) and tokenizes card data at capture time (the raw PAN never touches application servers). Payment intents follow an authorize-then-capture model: authorization reserves funds at checkout; capture occurs when the item ships. The service emits payment events toKafkafor accounting, fraud, and analytics. Idempotency keys on all payment API calls prevent double-charges on retries.Order management and fulfillment — order state machine: placed → payment confirmed → warehouse picked → shipped → delivered → optionally returned. Each state transition is an event in Kafka consumed by fulfillment, notifications, and analytics. The warehouse management system receives pick-and-pack instructions; a shipping carrier API (FedEx, UPS, USPS) is called for label generation and tracking number. Returns trigger a reverse-fulfillment flow and a conditional refund saga.
The full checkout request flow: the user clicks "Place Order" → the checkout API validates the cart against current pricing → calls the inventory service to place holds on each SKU → calls the pricing service for tax and shipping → calls the payment service to authorize the card → if all succeed, atomically commits inventory decrements and creates the order record → publishes an OrderCreated event → fulfillment picks up the event and initiates warehouse operations → customer receives a confirmation email within seconds.
Key trade-off — strong consistency for inventory vs eventual consistency everywhere else: the most important consistency decision is that inventory and payment demand strong consistency (row-level locking, synchronous quorum writes) while everything else — catalog, search, recommendations, reviews — is fine with eventual consistency and aggressive caching. Trying to apply strong consistency universally at Amazon's scale is not economically or technically feasible. The design explicitly identifies the narrow set of operations that require ACID (the inventory decrement and the payment charge) and applies the appropriate tools there only.
Additional topics: CQRS (Command Query Responsibility Segregation) separates the write path (transactional, Postgres) from the read path (cached, replicated, Elasticsearch) for the catalog and inventory — reads query a denormalized read model populated by event-driven projections. Multi-region active-active requires careful routing: catalog and recommendations can serve from any region; inventory and payment must route to the owning region for a SKU to avoid cross-region coordination. Flash-sale engineering adds pre-warming of inventory caches, CDN-pinned product pages, and a virtual waiting-room queue. Review fraud detection runs an ML model over review text and reviewer behavior before publication.
Lead by naming all the major microservices and stating upfront that they have different consistency requirements — this immediately signals architectural maturity. Spend the most time on the checkout saga because it is the hardest and most commonly probed part: walk through the steps in order, name the compensating action for each, and explain why a saga is used instead of 2PC (availability and service decoupling).
Mentioning that inventory uses row-level locking or optimistic concurrency with a version token, while recommendations use pre-computed stores, is the detail that distinguishes a strong candidate.