TY - GEN
T1 - Sprinter
T2 - 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
AU - Goel, Ayush
AU - Zhu, Jingyuan
AU - Netravali, Ravi
AU - Madhyastha, Harsha V.
N1 - Publisher Copyright:
© 2024 Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Crawling the web at scale forms the basis of many important systems: web search engines, smart assistants, generative AI, web archives, and so on. Yet, the research community has paid little attention to this workload in the last decade. In this paper, we highlight the need to revisit the notion that web crawling is a solved problem. Specifically, to discover and fetch all page resources dependent on JavaScript and modern web APIs, crawlers today have to employ compute-intensive web browsers. This significantly inflates the scale of the infrastructure necessary to crawl pages at high throughput. To make web crawling more efficient without any loss of fidelity, we present Sprinter, which combines browser-based and browserless crawling to get the best of both. The key to Sprinter’s design is our observation that crawling workloads typically include many pages per site and, unlike in traditional user-facing page loads, there is significant potential to reuse client-side computations across pages. Taking advantage of this property, Sprinter crawls a small, carefully chosen, subset of pages on each site using a browser, and then efficiently identifies and exploits opportunities to reuse the browser’s computations on other pages. Sprinter was able to crawl a corpus of 50,000 pages 5x faster than browser-based crawling, while still closely matching a browser in the set of resources fetched.
AB - Crawling the web at scale forms the basis of many important systems: web search engines, smart assistants, generative AI, web archives, and so on. Yet, the research community has paid little attention to this workload in the last decade. In this paper, we highlight the need to revisit the notion that web crawling is a solved problem. Specifically, to discover and fetch all page resources dependent on JavaScript and modern web APIs, crawlers today have to employ compute-intensive web browsers. This significantly inflates the scale of the infrastructure necessary to crawl pages at high throughput. To make web crawling more efficient without any loss of fidelity, we present Sprinter, which combines browser-based and browserless crawling to get the best of both. The key to Sprinter’s design is our observation that crawling workloads typically include many pages per site and, unlike in traditional user-facing page loads, there is significant potential to reuse client-side computations across pages. Taking advantage of this property, Sprinter crawls a small, carefully chosen, subset of pages on each site using a browser, and then efficiently identifies and exploits opportunities to reuse the browser’s computations on other pages. Sprinter was able to crawl a corpus of 50,000 pages 5x faster than browser-based crawling, while still closely matching a browser in the set of resources fetched.
UR - http://www.scopus.com/inward/record.url?scp=85194166875&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85194166875&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85194166875
T3 - Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
SP - 893
EP - 906
BT - Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024
PB - USENIX Association
Y2 - 16 April 2024 through 18 April 2024
ER -