Sprinter: Speeding Up High-Fidelity Crawling of the Modern Web

Ayush Goel, Jingyuan Zhu, Ravi Netravali, Harsha V. Madhyastha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Crawling the web at scale forms the basis of many important systems: web search engines, smart assistants, generative AI, web archives, and so on. Yet, the research community has paid little attention to this workload in the last decade. In this paper, we highlight the need to revisit the notion that web crawling is a solved problem. Specifically, to discover and fetch all page resources dependent on JavaScript and modern web APIs, crawlers today have to employ compute-intensive web browsers. This significantly inflates the scale of the infrastructure necessary to crawl pages at high throughput. To make web crawling more efficient without any loss of fidelity, we present Sprinter, which combines browser-based and browserless crawling to get the best of both. The key to Sprinter’s design is our observation that crawling workloads typically include many pages per site and, unlike in traditional user-facing page loads, there is significant potential to reuse client-side computations across pages. Taking advantage of this property, Sprinter crawls a small, carefully chosen, subset of pages on each site using a browser, and then efficiently identifies and exploits opportunities to reuse the browser’s computations on other pages. Sprinter was able to crawl a corpus of 50,000 pages 5x faster than browser-based crawling, while still closely matching a browser in the set of resources fetched.

Original languageEnglish (US)
Title of host publicationProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
PublisherUSENIX Association
Pages893-906
Number of pages14
ISBN (Electronic)9781939133397
StatePublished - 2024
Event21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 - Santa Clara, United States
Duration: Apr 16 2024Apr 18 2024

Publication series

NameProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024

Conference

Conference21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
Country/TerritoryUnited States
CitySanta Clara
Period4/16/244/18/24

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'Sprinter: Speeding Up High-Fidelity Crawling of the Modern Web'. Together they form a unique fingerprint.

Cite this