Design a web crawler like Google's crawler
A web crawler at the scale of Google's Googlebot must discover, download, parse, and archive billions of web pages continuously — the crawlable web contains over a trillion unique URLs, and the freshest content changes within minutes of being published. The hardest engineering challenges are politeness (not overwhelming any individual site), deduplication (avoiding redundant work on near-identical pages), URL frontier management (choosing what to crawl next from an essentially infinite queue), and horizontal scalability across commodity hardware.
Scale and non-functional requirements: a Google-scale crawler targets millions of pages per second across the cluster, petabytes of HTML stored per day, and freshness ranging from minutes for major news outlets to months for static long-tail pages. The system must run 24/7, survive node failures without losing work, and respect every site's robots.txt directives absolutely.
URL Frontier — the heart of the crawler. The frontier is a large priority queue of unvisited URLs, partitioned by host so every URL for a given domain is assigned to the same worker. Priority is determined by a combination of PageRank (estimated authority), content freshness (how often the page has historically changed), and discovery recency. News sites are re-crawled every few minutes; static documentation pages monthly. The frontier is typically a distributed system combining in-memory heaps per shard with an overflow on disk or a key-value store like
RocksDB.Fetching layer — workers pull the next URL from their assigned domain queue, check the host's cached
robots.txtrules, enforce a per-domain rate limit (e.g., one request per second), then issue anHTTP/1.1orHTTP/2GET with a crawl-botUser-Agent. DNS results are aggressively cached (minutes-long TTL) to avoid repeated lookups. Connection pools are maintained per host to amortize TCP and TLS handshakes. The raw HTML response (with headers includingLast-ModifiedandETag) is streamed to object storage (S3) and a parse event is emitted to a queue.URL deduplication — before adding any extracted URL to the frontier, the system checks whether it has been crawled before. A bloom filter per shard provides O(1) membership testing with near-zero memory per URL and an acceptable false-positive rate (a small fraction of new URLs are incorrectly treated as seen). A complementary exact-match set in a key-value store (
RedisorCassandra) handles confirmed duplicates. URL normalization is applied first: lowercase scheme and host, remove fragment identifiers, resolve relative paths, canonicalize query-string parameter order — this eliminates a large class of synthetic duplicates before dedup even runs.Content-level deduplication — the same content is often served at multiple URLs (mirrors, syndication, pagination). SimHash (a locality-sensitive hash) computes a 64-bit fingerprint of the page body; pages with fingerprints within a Hamming distance of 3 are treated as near-duplicates and the lower-priority copy is discarded. MinHash can also estimate Jaccard similarity across large text corpora. Only canonical, unique pages enter the indexing pipeline.
Politeness enforcement — assigning all URLs for a domain to a single worker naturally serializes requests per domain. The worker maintains a per-host crawl delay (from
robots.txtCrawl-delaydirective or a default of 1 second). Therobots.txtfile for each domain is fetched once per worker startup and cached with a TTL (typically 24 hours). Hosts that are slow or returning errors trigger exponential backoff. Depth limits and per-host URL caps prevent spider traps (dynamically generated infinite URLs like calendar pages with infinite date parameters).JavaScript rendering — a growing fraction of the web requires JavaScript execution to reveal content. A headless browser farm (Puppeteer/Chromium) handles these pages. This is orders of magnitude more expensive than plain HTTP fetching — roughly 10–50x the CPU and memory — so it is applied selectively to high-value domains and pages whose fetch fingerprint suggests client-side rendering.
Geographically distributed crawlers — deploying crawlers in multiple regions reduces round-trip time to international sites and provides failover redundancy. Crawler output (raw HTML, extracted links) is unified in a centralized object store and feed into the downstream indexing pipeline (inverted index construction, link graph computation).
The end-to-end flow for a single URL: a worker dequeues it from the domain frontier, checks the host's robots.txt cache, waits for the per-host rate-limit window, issues the GET, receives HTML, stores raw content to S3, parses the HTML (link extraction, metadata, main-content extraction), normalizes and deduplicates extracted URLs, enqueues new URLs into the frontier with appropriate priority, and emits a parsed-content event downstream. The loop repeats continuously across tens of thousands of workers.
Key insight — partition the frontier by domain, not by URL hash: partitioning by domain (rather than by URL hash or randomly) is the single most important architectural decision in a polite crawler. It ensures one worker owns all requests to one domain, making rate limiting and robots.txt enforcement trivially local rather than requiring distributed coordination. The trade-off is potential load skew (a worker assigned to a large domain like Wikipedia becomes a hotspot), which is mitigated by splitting high-volume domains across multiple workers while keeping per-worker rate limits intact.
Additional depth to cover in an interview: re-crawl scheduling can use a predicted change frequency model (pages that change often get shorter TTLs); ETag and If-None-Match headers allow conditional GETs that return 304 Not Modified without re-transmitting the body — a major bandwidth saving. Failure handling uses a dead-letter queue for persistently failing URLs. The crawler's output feeds an indexing pipeline that is a separate system design problem (inverted index sharding, ranking signals) covered in the Google Search design.
Open by stating the two-part challenge: throughput (millions of pages per second) and politeness (never hammering a host). Immediately bring up partitioning the URL frontier by domain as the key architectural decision that enables both.
Interviewers are impressed when candidates quantify the bloom-filter trade-off (false-positive rate of 1% is fine because you pay O(1) memory per URL) and explain why JavaScript rendering is selectively applied rather than universally — it is a cost and latency decision, not a capability gap.