TY - GEN
T1 - Jawa
T2 - 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
AU - Goel, Ayush
AU - Zhu, Jingyuan
AU - Netravali, Ravi
AU - Madhyastha, Harsha V.
N1 - Publisher Copyright:
© 2022 by The USENIX Association. All rights reserved.
PY - 2022
Y1 - 2022
N2 - By repeatedly crawling and saving web pages over time, web archives (such as the Internet Archive) enable users to visit historical versions of any page. In this paper, we point out that existing web archives are not well designed to cope with the widespread presence of JavaScript on the web. Some archives store petabytes of JavaScript code, and yet many pages render incorrectly when users load them. Other archives which store the end-state of page loads (e.g., screen captures) break post-load interactions implemented in JavaScript. To address these problems, we present Jawa, a new design for web archives which significantly reduces the storage necessary to save modern web pages while also improving the fidelity with which archived pages are served. Key to enabling Jawa's use at scale are our observations on a) the forms of non-determinism which impair the execution of JavaScript on archived pages, and b) the ways in which JavaScript's execution fundamentally differs between live web pages and their archived copies. On a corpus of 1 million archived pages, Jawa reduces overall storage needs by 41%, when compared to the techniques currently used by the Internet Archive.
AB - By repeatedly crawling and saving web pages over time, web archives (such as the Internet Archive) enable users to visit historical versions of any page. In this paper, we point out that existing web archives are not well designed to cope with the widespread presence of JavaScript on the web. Some archives store petabytes of JavaScript code, and yet many pages render incorrectly when users load them. Other archives which store the end-state of page loads (e.g., screen captures) break post-load interactions implemented in JavaScript. To address these problems, we present Jawa, a new design for web archives which significantly reduces the storage necessary to save modern web pages while also improving the fidelity with which archived pages are served. Key to enabling Jawa's use at scale are our observations on a) the forms of non-determinism which impair the execution of JavaScript on archived pages, and b) the ways in which JavaScript's execution fundamentally differs between live web pages and their archived copies. On a corpus of 1 million archived pages, Jawa reduces overall storage needs by 41%, when compared to the techniques currently used by the Internet Archive.
UR - http://www.scopus.com/inward/record.url?scp=85141032162&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141032162&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85141032162
T3 - Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
SP - 805
EP - 820
BT - Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
PB - USENIX Association
Y2 - 11 July 2022 through 13 July 2022
ER -