Design a ride-sharing service like Uber or Lyft

Question

Accepted Answer

A ride-sharing service like Uber or Lyft must match riders to nearby drivers in real time, track both parties throughout a trip, and process payment upon completion — all at the scale of millions of concurrent drivers and tens of millions of daily trips worldwide. The three hardest problems are real-time geospatial indexing of continuously moving drivers, low-latency matching under bursty demand, and the distributed trip state machine that must remain consistent even if individual services fail mid-trip. Scale framing: Uber at peak processes tens of thousands of ride requests per second globally. Drivers send location updates every 3–5 seconds — millions of writes per second across the fleet. The match-to-driver round trip must complete in under 500 ms to feel instant to the rider. Surge pricing must react to supply/demand imbalances within seconds. The system is geographically partitioned by metro area, reducing coordination scope significantly. Driver location service — every active driver sends a (lat, lng, status, heading) update via WebSocket or HTTP/2 to a location service every 3–5 seconds. The service encodes each coordinate into a geohash , S2 cell , or H3 hexagon — all are hierarchical spatial indexing schemes that map a geographic region to a compact string. Driver locations are written to Redis using the GEOADD command ( Redis GEO stores coordinates as geohash-encoded sorted sets), enabling radius searches in O(log n). The last-known position per driver is also persisted to a Cassandra store for durability. Matching service — when a rider submits a request, the matching service queries Redis GEO (or the H3 cell index) for available drivers within a radius (e.g., 1 km, expanding if insufficient supply). Candidates are ranked by a scoring function combining ETA, driver rating, vehicle type, and a surge-adjusted willingness score. The top candidate receives a push offer; if declined or timed out (typically 10–15 seconds), the next candidate is tried. This cascade continues until a driver accepts or the request times out. The cascade logic runs in a stateless service, with the offer state held in Redis with a TTL. Trip state machine — a trip progresses through a well-defined sequence: requested → matched → accepted → en route to pickup → trip in progress → completed → rated . This state and every transition are persisted in a relational database ( PostgreSQL ) with ACID semantics, ensuring payment and dispatch integrity. Each state transition publishes an event to Kafka , which fans out to billing, surge pricing, notification, analytics, and safety services asynchronously. This event-driven architecture decouples downstream consumers from the core trip service. Surge pricing service — the city is divided into H3 hexagonal cells at a resolution of roughly 1–2 km per cell. The service continuously monitors the ratio of open rider requests to available drivers per cell. When demand exceeds supply by a configurable threshold, a surge multiplier (1.2×, 1.5×, 2.0×, etc.) is applied to fares in that cell. Surge prices are recalculated every 30–60 seconds and cached in Redis . The surge surface is displayed on the rider app as a heat map. Routing and ETA service — ETAs are computed on a precomputed road network graph using contraction hierarchies (a bidirectional Dijkstra variant optimized for time-dependent road networks). Live traffic data (from Google Maps, HERE, or the platform's own GPS traces) adjusts edge weights in real time. ETA is recalculated as the driver moves and pushed to the rider via WebSocket. Payment service — fare is calculated at trip end from distance, duration, surge multiplier, and any promotions. Payment is authorized against the rider's stored tokenized card via a payment gateway ( Stripe or Braintree ). Payout to the driver is batched (daily or weekly) via ACH or a similar bank transfer. A ledger service records every money movement for reconciliation and dispute resolution. Notification and presence — driver and rider apps maintain persistent WebSocket connections to a notification gateway. Location updates, status transitions, ETA recalculations, and chat messages are all pushed over this channel. The gateway is stateless; connection state is stored in Redis keyed by device_id . Request flow for a single ride: the rider submits a request → the API gateway authenticates and routes to the ride request service → the ride request service fetches the rider's location and calls the matching service → matching queries Redis GEO for nearby available drivers → the top-ranked driver receives a push offer via the notification gateway → driver accepts → the trip service persists the accepted state and publishes an event → Kafka consumers update surge metrics, trigger billing pre-auth, and send a confirmation push to the rider → driver and rider both receive live ETAs via WebSocket until the trip ends. Key trade-off — greedy nearest-driver matching vs global batch assignment: greedy matching