Design a distributed key-value store like DynamoDB or Redis

Question

Accepted Answer

A distributed key-value store like Amazon DynamoDB or Redis Cluster provides single-digit-millisecond reads and writes at arbitrarily large scale by distributing data across a cluster of commodity nodes. The four fundamental design axes are data partitioning (which node owns a key), replication (how many copies and where), consistency (what guarantees reads provide after a write), and failure handling (how the cluster recovers from node loss without losing data). Requirements and scale: functional requirements include get(key) , put(key, value) , delete(key) , optional TTLs, and optionally atomic compare-and-swap ( CAS ). Non-functional: p99 read/write latency under 10 ms, linear horizontal scalability, 99.999% availability, durability across full data-center failures, and predictable performance under hot keys or skewed access patterns. Consistent hashing for partitioning — keys are hashed (e.g., with SHA-256 or MurmurHash3 ) onto a logical ring of 2^32 positions. Each node owns a contiguous arc of the ring. A key is assigned to the first node clockwise from its hash position. Adding or removing a node only transfers the keys on the adjacent arc — no rehashing of the entire dataset. Virtual nodes (vnodes) — where each physical node owns 100–200 small, non-contiguous arcs — smooth out load imbalance and make rebalancing more granular when nodes join or leave. Replication — each key is stored on N replicas (typically N=3), placed on the next N distinct physical nodes clockwise from the primary. This means any two nodes can fail simultaneously without data loss. Writes are sent to all N replicas; reads can be served from any replica. The quorum rule R + W > N governs consistency: with N=3, W=2, R=2, any read will overlap with at least one node that received the last write — providing strong consistency. Lowering to W=1, R=1 maximizes availability at the cost of potentially stale reads. LSM-tree storage engine — the write path appends to a write-ahead log (WAL) on disk for crash durability, then updates an in-memory sorted structure (the memtable ). When the memtable exceeds a size threshold it is flushed as an immutable SSTable to disk. Background compaction merges SSTables, dropping deleted keys and reconciling overwrites. This gives high write throughput because every write is sequential. Reads consult the memtable first, then SSTables in recency order; a bloom filter per SSTable avoids reading files that cannot contain the key. Failure detection via gossip protocol — every node periodically exchanges cluster state (node list, liveness, version vectors) with a random subset of peers. Failures propagate through the cluster in O(log N) rounds without a central coordinator. A node that has not been heard from for a configurable timeout is marked suspect and then down after further silence. This eliminates single points of failure in cluster membership management. Hinted handoff for temporary failures — if the target replica for a write is temporarily unavailable, another node stores the write as a hint (the data plus metadata indicating its intended destination). When the down node recovers, the hinting node forwards the accumulated hints. This prevents write failures due to short outages while maintaining eventual consistency. Anti-entropy and read repair — replicas can drift apart due to network partitions or node restarts. Read repair fixes divergence opportunistically: during a read, if different replicas return different values, the coordinator writes the most recent version back to lagging replicas. Anti-entropy is a background process that compares replica state using Merkle trees — a hash tree where each leaf hashes a key range and each internal node hashes its children. Replicas exchange tree roots and walk down the tree to identify and transfer only the diverged key ranges, not the entire dataset. Conflict resolution — in eventual-consistency mode, concurrent writes to the same key can produce conflicting versions. Vector clocks track causality: if version A happened-before version B, B wins; if the versions are concurrent, both are kept as siblings and resolution is delegated to the application (or resolved automatically by last-write-wins using timestamps, which can silently lose updates). DynamoDB uses last-write-wins; Riak offered vector-clock-based siblings to the application. A write request flow: the client hashes the key to find the coordinator node (using a consistent-hash ring embedded in the client library or a proxy). The coordinator writes to itself and to the next N−1 nodes in the ring. It waits for W acknowledgements before returning success to the client. The write-ahead log on each replica ensures durability even if the process crashes immediately after acknowledging. Background compaction and anti-entropy keep the system clean over time. Key trade-off — CAP theorem and the consistency-availability dial: during a network partition, a distributed key-value store must choose between re