Design a search autocomplete / typeahead system

Question

Accepted Answer

A search autocomplete (typeahead) system returns ranked query completions for a user's prefix on every keystroke, with a strict budget of sub-100ms end-to-end latency . At Google or Amazon scale this means serving millions of QPS globally while refreshing suggestions from continuously streaming query logs. The hardest parts are the data structure choice, the offline aggregation pipeline, and the serving architecture that keeps the index fresh without sacrificing latency. Functional requirements : return the top-K completions (typically K=10) for any prefix the user has typed; rank by a blend of global popularity and personalization signals; support new trending queries flowing in with a lag of minutes to an hour; allow blocklist filtering for profanity or sensitive terms; optionally support fuzzy matching for misspellings. Non-functional : p99 latency under 100ms (the user types the next character in that time), millions of QPS at peak across all concurrent typing sessions, a fresh index that reflects trending queries within minutes, not days. Trie with pre-computed top-K per node — the canonical data structure is a trie (prefix tree) where each node stores the prefix and a fixed-size list of the top-K completions rooted at that subtree, pre-ranked by frequency. A query is then a single walk from the root to the prefix node, returning the cached top-K — no subtree traversal or scoring at query time. Pre-aggregation trades storage for query speed: each node's top-K is recomputed as the aggregation pipeline updates scores. Memory efficiency — FST/DAFSA encoding — a naive trie for millions of queries can consume tens of gigabytes. Representing it as a Finite State Transducer (FST) or a Directed Acyclic Word Graph (DAWG) shares suffixes across paths, reducing memory by 5–10x. This allows the entire serving trie to fit in RAM on a moderately sized node, eliminating disk I/O on the hot path. Sharding by prefix — for very large dictionaries or very high QPS, shard the trie by prefix range. A routing layer maps incoming prefixes to the appropriate shard: for example, prefixes a–m go to shard 1, n–z to shard 2. Replication within each shard gives read throughput and fault tolerance. Each shard is updated independently by the aggregation pipeline. Offline aggregation pipeline — query logs stream into Kafka , then a Flink or Spark Streaming job aggregates query counts over rolling windows (e.g., last 7 days with daily decay). Older queries fade via exponential decay so stale trends don't dominate. The job outputs a table of (prefix, top-K queries with scores) , which is snapshotted every few minutes and pushed to the serving tier. This pipeline is the freshness lever — shorter windows mean more real-time updates but noisier counts. Real-time overlay for trending queries — a separate in-memory layer absorbs queries that have spiked in the last few minutes (e.g., a breaking news event) and haven't propagated through the batch pipeline yet. At query time, the handler merges results from the snapshot trie and the real-time overlay, applying a recency boost to surface trending completions. Personalization — after fetching the global top-50 suggestions from the trie, a lightweight reranker on the edge injects per-user signals: recent search history, location, language, and click-through rates on past completions. This layer runs in under 5ms because it only reorders a small candidate set, not the full index. Client-side optimizations — the client debounces keystrokes with a 150–300ms delay before issuing a request, dramatically reducing QPS; this also allows showing cached results from a previous shorter prefix instantly. The browser caches responses keyed by prefix for the session, so backspacing reuses earlier results without a network round trip. Safety and blocklists — before any completion is published from the aggregation pipeline, it passes through a safety filter: profanity lists, NSFW classifiers, and manually curated blocklists. A compromised trending term can be hot-patched by adding it to the blocklist, which is distributed to serving nodes within seconds via a control-plane push. The serving path for a single keystroke: the client sends GET /autocomplete?q=coff ; the request is CDN-routed to a regional cluster, then sticky-routed to the shard responsible for the prefix coff ; the shard performs a single in-memory trie lookup in under 1ms; results are merged with the real-time overlay, filtered for blocked terms, reranked with personalization signals, and returned as JSON — total server time under 10ms, with CDN caching absorbing the most common prefixes entirely. Key trade-off — pre-aggregated snapshot vs. real-time freshness: Pre-computing top-K at every trie node makes queries trivially fast but rebuilding the trie is expensive (minutes for a large vocabulary). Increasing refresh frequency improves freshness but increases infra cost. The hybrid solution — a batch snapshot updated every few minutes combined wi