Jawa: Web Archival in the Era of JavaScript

Ayush Goel, Jingyuan Zhu, Ravi Netravali, Harsha V. Madhyastha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

By repeatedly crawling and saving web pages over time, web archives (such as the Internet Archive) enable users to visit historical versions of any page. In this paper, we point out that existing web archives are not well designed to cope with the widespread presence of JavaScript on the web. Some archives store petabytes of JavaScript code, and yet many pages render incorrectly when users load them. Other archives which store the end-state of page loads (e.g., screen captures) break post-load interactions implemented in JavaScript. To address these problems, we present Jawa, a new design for web archives which significantly reduces the storage necessary to save modern web pages while also improving the fidelity with which archived pages are served. Key to enabling Jawa's use at scale are our observations on a) the forms of non-determinism which impair the execution of JavaScript on archived pages, and b) the ways in which JavaScript's execution fundamentally differs between live web pages and their archived copies. On a corpus of 1 million archived pages, Jawa reduces overall storage needs by 41%, when compared to the techniques currently used by the Internet Archive.

Original languageEnglish (US)
Title of host publicationProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
PublisherUSENIX Association
Pages805-820
Number of pages16
ISBN (Electronic)9781939133281
StatePublished - 2022
Event16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022 - Carlsbad, United States
Duration: Jul 11 2022Jul 13 2022

Publication series

NameProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022

Conference

Conference16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
Country/TerritoryUnited States
CityCarlsbad
Period7/11/227/13/22

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems

Fingerprint

Dive into the research topics of 'Jawa: Web Archival in the Era of JavaScript'. Together they form a unique fingerprint.

Cite this