Design Google Search — web search engine

Question

Accepted Answer

Google Search — a web-scale search engine — must index hundreds of billions of pages, understand the semantic intent behind a query, and return the most relevant results in under 200 milliseconds. The system is architecturally decomposed into three distinct pillars: crawling (discovering and downloading the web), indexing (building a queryable representation of page content and authority), and query serving (executing queries and ranking results in real time). Scale framing: the crawlable web contains over a trillion unique URLs. Google processes over 8 billion queries per day. The index itself is sharded across hundreds of thousands of machines. Each query touches thousands of machines in parallel and must return within a strict latency budget. The system must simultaneously handle freshness (breaking news indexed within minutes) and the long tail (obscure pages refreshed monthly). Inverted index — the core data structure of any search engine. For every term in the vocabulary, the index stores a posting list : the ordered list of (doc_id, term_frequency, field_mask, position_list) entries for every document containing that term. A query for "distributed systems" retrieves the posting lists for both terms and intersects them — doc IDs present in both lists are candidates. Posting lists for common terms ( "the", "is" ) are millions of entries long; rare terms have short lists. Lists are compressed using delta encoding ( VByte or PForDelta ) to reduce I/O. Index sharding strategy — the index can be sharded in two ways. Document-partitioned sharding (used by Google): each shard holds a disjoint subset of the corpus and maintains its own complete inverted index over those documents. A query fans out to all shards in parallel (the scatter-gather pattern), each shard returns its local top-K results, and a merger aggregates the global top-K. This is preferable because it minimizes cross-shard coordination. The alternative — term-partitioned sharding (shard by term) — means each term lives on one machine but requires complex multi-machine coordination per query. PageRank and link graph authority — PageRank models the web as a directed graph and computes an iterative random-walk score for each page: a page is important if many important pages link to it. This is computed offline in batch (Pregel or Spark GraphX) over the entire link graph and stored as a per-doc signal. At query time, a document's PageRank is one input to the ranking function — high PageRank boosts authoritative content (Wikipedia, official docs) over thin affiliate pages. Ranking and scoring — the ranking function combines hundreds of signals: BM25/TF-IDF for lexical relevance, PageRank for authority, query-document semantic similarity from a neural model ( BERT -based encoders or a dual-encoder dense retrieval model), freshness (recent pages score higher for time-sensitive queries), user engagement signals (click-through rate, dwell time), and spam/quality classifiers. The first-pass ranking uses fast BM25 + PageRank on the posting-list intersection. A small set of top candidates (typically 100–1000) is re-ranked by the heavier neural model. Query processing pipeline — the query front-end receives a raw query string and runs: tokenization and normalization → spell correction (statistical n-gram language model or neural corrector) → synonym expansion and query rewriting → intent classification (navigational vs informational vs transactional) . The processed query is then dispatched to all index shards. At the merger, results are re-ranked, snippets are fetched from the document store, and vertical results (images, news, maps, shopping) are blended into the SERP. Document store and snippet generation — the inverted index stores only term presence and position, not the full text. The document store (a distributed key-value store keyed by doc_id ) holds the original HTML, extracted text, and page metadata. Snippet generation retrieves the relevant passage around query-term matches and highlights them. This is a separate I/O path from the index lookup and must also be fast — it is typically parallelized with the re-ranking step. Caching at every layer — popular queries are cached as complete result pages (a query result cache keyed by the normalized query string). Posting lists for high-frequency terms are pinned in RAM. Snippets for the top-10K most-queried documents are pre-computed. Aggressive caching means the vast majority of query traffic never touches the index compute at all. Tiered serving and freshness — not all of the web fits in the "hot" index tier that is queried for every request. A small high-quality index (the most authoritative ~10% of pages) serves most queries with low latency. Misses fall through to a deeper index tier. A separate real-time index (news, Twitter-scale freshness) is updated in near-real-time and merged with the base index result at query time. A query's full journey: the user's query string hits the search front-end