Design Google Search — web search engine
Google Search — a web-scale search engine — must index hundreds of billions of pages, understand the semantic intent behind a query, and return the most relevant results in under 200 milliseconds. The system is architecturally decomposed into three distinct pillars: crawling (discovering and downloading the web), indexing (building a queryable representation of page content and authority), and query serving (executing queries and ranking results in real time).
Scale framing: the crawlable web contains over a trillion unique URLs. Google processes over 8 billion queries per day. The index itself is sharded across hundreds of thousands of machines. Each query touches thousands of machines in parallel and must return within a strict latency budget. The system must simultaneously handle freshness (breaking news indexed within minutes) and the long tail (obscure pages refreshed monthly).
Inverted index — the core data structure of any search engine. For every term in the vocabulary, the index stores a posting list: the ordered list of
(doc_id, term_frequency, field_mask, position_list)entries for every document containing that term. A query for "distributed systems" retrieves the posting lists for both terms and intersects them — doc IDs present in both lists are candidates. Posting lists for common terms ("the", "is") are millions of entries long; rare terms have short lists. Lists are compressed using delta encoding (VByteorPForDelta) to reduce I/O.Index sharding strategy — the index can be sharded in two ways. Document-partitioned sharding (used by Google): each shard holds a disjoint subset of the corpus and maintains its own complete inverted index over those documents. A query fans out to all shards in parallel (the scatter-gather pattern), each shard returns its local top-K results, and a merger aggregates the global top-K. This is preferable because it minimizes cross-shard coordination. The alternative — term-partitioned sharding (shard by term) — means each term lives on one machine but requires complex multi-machine coordination per query.
PageRank and link graph authority — PageRank models the web as a directed graph and computes an iterative random-walk score for each page: a page is important if many important pages link to it. This is computed offline in batch (Pregel or Spark GraphX) over the entire link graph and stored as a per-doc signal. At query time, a document's PageRank is one input to the ranking function — high PageRank boosts authoritative content (Wikipedia, official docs) over thin affiliate pages.
Ranking and scoring — the ranking function combines hundreds of signals: BM25/TF-IDF for lexical relevance, PageRank for authority, query-document semantic similarity from a neural model (
BERT-based encoders or a dual-encoder dense retrieval model), freshness (recent pages score higher for time-sensitive queries), user engagement signals (click-through rate, dwell time), and spam/quality classifiers. The first-pass ranking uses fast BM25 + PageRank on the posting-list intersection. A small set of top candidates (typically 100–1000) is re-ranked by the heavier neural model.Query processing pipeline — the query front-end receives a raw query string and runs: tokenization and normalization → spell correction (statistical n-gram language model or neural corrector) → synonym expansion and query rewriting → intent classification (navigational vs informational vs transactional). The processed query is then dispatched to all index shards. At the merger, results are re-ranked, snippets are fetched from the document store, and vertical results (images, news, maps, shopping) are blended into the SERP.
Document store and snippet generation — the inverted index stores only term presence and position, not the full text. The document store (a distributed key-value store keyed by
doc_id) holds the original HTML, extracted text, and page metadata. Snippet generation retrieves the relevant passage around query-term matches and highlights them. This is a separate I/O path from the index lookup and must also be fast — it is typically parallelized with the re-ranking step.Caching at every layer — popular queries are cached as complete result pages (a query result cache keyed by the normalized query string). Posting lists for high-frequency terms are pinned in RAM. Snippets for the top-10K most-queried documents are pre-computed. Aggressive caching means the vast majority of query traffic never touches the index compute at all.
Tiered serving and freshness — not all of the web fits in the "hot" index tier that is queried for every request. A small high-quality index (the most authoritative ~10% of pages) serves most queries with low latency. Misses fall through to a deeper index tier. A separate real-time index (news, Twitter-scale freshness) is updated in near-real-time and merged with the base index result at query time.
A query's full journey: the user's query string hits the search front-end → spell correction and query understanding produce a normalized form → the processed query fans out in parallel to all document-partitioned index shards → each shard intersects posting lists, scores candidates with BM25 + PageRank, and returns its top-50 → the merger collects all per-shard top-50 lists, re-ranks with the neural model, fetches snippets from the document store, blends vertical results, and renders the SERP — all within 200 ms end-to-end.
Key trade-off — index freshness vs serving cost: rebuilding the entire index continuously is prohibitively expensive. The practical solution is a tiered freshness architecture: a base index rebuilt daily or weekly covers the authoritative long-tail; an incremental index updated hourly covers mainstream content; a real-time index updated in minutes or seconds covers news and breaking content. Query time merges results from all tiers. The trade-off is complexity at query merge time vs freshness guarantees — the design explicitly accepts that long-tail pages may be hours or days stale, which is acceptable because they change infrequently.
Additional depth to discuss: geographic replication of the serving stack (full index replica per major region) puts query latency within 30–50 ms of the user. Web spam detection (link-graph anomaly detection, content classifiers, manual review queues) runs as a pre-indexing filter and as a ranking penalty. Knowledge Graph integration adds structured-entity answers directly in the SERP, bypassing the document-ranking system for factual queries ("what is the capital of France"). Personalization applies user history and location signals at re-rank time without exposing the raw data to the ranking model.
Structure your answer around the three pillars: crawl, index, serve. The interviewer will probe each, so prepare depth on at least one — the inverted index and scatter-gather query execution are the most commonly asked.
Bring up document-partitioned vs term-partitioned sharding and explain why document-partitioned wins (each shard is self-contained, queries fan out to all shards in parallel). Mentioning the tiered freshness model (daily base index + real-time index merged at query time) is the detail that distinguishes a production-informed answer from a textbook answer.