Design a ride-sharing service like Uber or Lyft
A ride-sharing service like Uber or Lyft must match riders to nearby drivers in real time, track both parties throughout a trip, and process payment upon completion — all at the scale of millions of concurrent drivers and tens of millions of daily trips worldwide. The three hardest problems are real-time geospatial indexing of continuously moving drivers, low-latency matching under bursty demand, and the distributed trip state machine that must remain consistent even if individual services fail mid-trip.
Scale framing: Uber at peak processes tens of thousands of ride requests per second globally. Drivers send location updates every 3–5 seconds — millions of writes per second across the fleet. The match-to-driver round trip must complete in under 500 ms to feel instant to the rider. Surge pricing must react to supply/demand imbalances within seconds. The system is geographically partitioned by metro area, reducing coordination scope significantly.
Driver location service — every active driver sends a
(lat, lng, status, heading)update via WebSocket orHTTP/2to a location service every 3–5 seconds. The service encodes each coordinate into a geohash, S2 cell, or H3 hexagon — all are hierarchical spatial indexing schemes that map a geographic region to a compact string. Driver locations are written toRedisusing the GEOADD command (Redis GEOstores coordinates as geohash-encoded sorted sets), enabling radius searches in O(log n). The last-known position per driver is also persisted to a Cassandra store for durability.Matching service — when a rider submits a request, the matching service queries
Redis GEO(or the H3 cell index) for available drivers within a radius (e.g., 1 km, expanding if insufficient supply). Candidates are ranked by a scoring function combining ETA, driver rating, vehicle type, and a surge-adjusted willingness score. The top candidate receives a push offer; if declined or timed out (typically 10–15 seconds), the next candidate is tried. This cascade continues until a driver accepts or the request times out. The cascade logic runs in a stateless service, with the offer state held inRediswith a TTL.Trip state machine — a trip progresses through a well-defined sequence: requested → matched → accepted → en route to pickup → trip in progress → completed → rated. This state and every transition are persisted in a relational database (
PostgreSQL) with ACID semantics, ensuring payment and dispatch integrity. Each state transition publishes an event toKafka, which fans out to billing, surge pricing, notification, analytics, and safety services asynchronously. This event-driven architecture decouples downstream consumers from the core trip service.Surge pricing service — the city is divided into H3 hexagonal cells at a resolution of roughly 1–2 km per cell. The service continuously monitors the ratio of open rider requests to available drivers per cell. When demand exceeds supply by a configurable threshold, a surge multiplier (1.2×, 1.5×, 2.0×, etc.) is applied to fares in that cell. Surge prices are recalculated every 30–60 seconds and cached in
Redis. The surge surface is displayed on the rider app as a heat map.Routing and ETA service — ETAs are computed on a precomputed road network graph using contraction hierarchies (a bidirectional Dijkstra variant optimized for time-dependent road networks). Live traffic data (from Google Maps, HERE, or the platform's own GPS traces) adjusts edge weights in real time. ETA is recalculated as the driver moves and pushed to the rider via WebSocket.
Payment service — fare is calculated at trip end from distance, duration, surge multiplier, and any promotions. Payment is authorized against the rider's stored tokenized card via a payment gateway (
StripeorBraintree). Payout to the driver is batched (daily or weekly) via ACH or a similar bank transfer. A ledger service records every money movement for reconciliation and dispute resolution.Notification and presence — driver and rider apps maintain persistent WebSocket connections to a notification gateway. Location updates, status transitions, ETA recalculations, and chat messages are all pushed over this channel. The gateway is stateless; connection state is stored in
Rediskeyed bydevice_id.
Request flow for a single ride: the rider submits a request → the API gateway authenticates and routes to the ride request service → the ride request service fetches the rider's location and calls the matching service → matching queries Redis GEO for nearby available drivers → the top-ranked driver receives a push offer via the notification gateway → driver accepts → the trip service persists the accepted state and publishes an event → Kafka consumers update surge metrics, trigger billing pre-auth, and send a confirmation push to the rider → driver and rider both receive live ETAs via WebSocket until the trip ends.
Key trade-off — greedy nearest-driver matching vs global batch assignment: greedy matching (dispatch the closest available driver instantly) is simple, fast, and easy to implement, but it is locally optimal, not globally optimal. During surge periods it leads to empty miles and suboptimal city-wide utilization. Batch matching (solve a bipartite assignment problem — Hungarian algorithm or auction algorithm — every 5 seconds over all open requests and available drivers) produces significantly better outcomes but adds latency to dispatch and requires more complex infrastructure. Uber moved toward batch matching with its H3-partitioned dispatch system; the right answer for an interview is to start with greedy and describe batch as the scaling upgrade.
Additional depth: geospatial cell granularity is a tuning parameter — smaller H3 cells give precise supply/demand signals but are more expensive to aggregate; larger cells smooth noise but may mis-price boundary zones. Multi-region deployment partitions the system by metro area (each city runs a mostly-independent stack) to reduce cross-region coordination. Fraud and safety services monitor trip anomalies (unexpected route deviation, sudden stop) in real time. Driver onboarding (background checks, document verification) and rating systems are separate services that feed signals into the matching ranking function.
Open with the core insight: driver location is a continuously moving write-heavy stream (millions of updates per second), while matching is a read-heavy geospatial lookup that must complete in under 500 ms. Bring up H3 or geohash early as the partitioning primitive.
Impress the interviewer by comparing greedy matching vs batch bipartite assignment and explaining why Uber moved toward batch — it is a real production decision with a measurable quality-of-service improvement. Mentioning the trip state machine as ACID (PostgreSQL) with async event fan-out to Kafka shows you understand where consistency is mandatory.