Design a web crawler like Google's crawler

Question

Accepted Answer

A web crawler at the scale of Google's Googlebot must discover, download, parse, and archive billions of web pages continuously — the crawlable web contains over a trillion unique URLs, and the freshest content changes within minutes of being published. The hardest engineering challenges are politeness (not overwhelming any individual site), deduplication (avoiding redundant work on near-identical pages), URL frontier management (choosing what to crawl next from an essentially infinite queue), and horizontal scalability across commodity hardware. Scale and non-functional requirements: a Google-scale crawler targets millions of pages per second across the cluster, petabytes of HTML stored per day, and freshness ranging from minutes for major news outlets to months for static long-tail pages. The system must run 24/7, survive node failures without losing work, and respect every site's robots.txt directives absolutely. URL Frontier — the heart of the crawler. The frontier is a large priority queue of unvisited URLs, partitioned by host so every URL for a given domain is assigned to the same worker. Priority is determined by a combination of PageRank (estimated authority), content freshness (how often the page has historically changed), and discovery recency . News sites are re-crawled every few minutes; static documentation pages monthly. The frontier is typically a distributed system combining in-memory heaps per shard with an overflow on disk or a key-value store like RocksDB . Fetching layer — workers pull the next URL from their assigned domain queue, check the host's cached robots.txt rules, enforce a per-domain rate limit (e.g., one request per second), then issue an HTTP/1.1 or HTTP/2 GET with a crawl-bot User-Agent . DNS results are aggressively cached (minutes-long TTL) to avoid repeated lookups. Connection pools are maintained per host to amortize TCP and TLS handshakes. The raw HTML response (with headers including Last-Modified and ETag ) is streamed to object storage ( S3 ) and a parse event is emitted to a queue. URL deduplication — before adding any extracted URL to the frontier, the system checks whether it has been crawled before. A bloom filter per shard provides O(1) membership testing with near-zero memory per URL and an acceptable false-positive rate (a small fraction of new URLs are incorrectly treated as seen). A complementary exact-match set in a key-value store ( Redis or Cassandra ) handles confirmed duplicates. URL normalization is applied first: lowercase scheme and host, remove fragment identifiers, resolve relative paths, canonicalize query-string parameter order — this eliminates a large class of synthetic duplicates before dedup even runs. Content-level deduplication — the same content is often served at multiple URLs (mirrors, syndication, pagination). SimHash (a locality-sensitive hash) computes a 64-bit fingerprint of the page body; pages with fingerprints within a Hamming distance of 3 are treated as near-duplicates and the lower-priority copy is discarded. MinHash can also estimate Jaccard similarity across large text corpora. Only canonical, unique pages enter the indexing pipeline. Politeness enforcement — assigning all URLs for a domain to a single worker naturally serializes requests per domain. The worker maintains a per-host crawl delay (from robots.txt Crawl-delay directive or a default of 1 second). The robots.txt file for each domain is fetched once per worker startup and cached with a TTL (typically 24 hours). Hosts that are slow or returning errors trigger exponential backoff . Depth limits and per-host URL caps prevent spider traps (dynamically generated infinite URLs like calendar pages with infinite date parameters). JavaScript rendering — a growing fraction of the web requires JavaScript execution to reveal content. A headless browser farm (Puppeteer/Chromium) handles these pages. This is orders of magnitude more expensive than plain HTTP fetching — roughly 10–50x the CPU and memory — so it is applied selectively to high-value domains and pages whose fetch fingerprint suggests client-side rendering. Geographically distributed crawlers — deploying crawlers in multiple regions reduces round-trip time to international sites and provides failover redundancy. Crawler output (raw HTML, extracted links) is unified in a centralized object store and feed into the downstream indexing pipeline (inverted index construction, link graph computation). The end-to-end flow for a single URL: a worker dequeues it from the domain frontier, checks the host's robots.txt cache, waits for the per-host rate-limit window, issues the GET, receives HTML, stores raw content to S3, parses the HTML (link extraction, metadata, main-content extraction), normalizes and deduplicates extracted URLs, enqueues new URLs into the frontier with appropriate priority, and emits a parsed-content event downstream. The loop repeats continuously across tens of thousands of workers. Key insight — partition