Speculative Recovery: Cheap, Highly Available Fault Tolerance with Disaggregated Storage

Nanqinqin Li, Anja Kalaba, Michael J. Freedman, Wyatt Lloyd, Amit Levy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

The ubiquity of disaggregated storage in cloud computing has led to a nascent technique for fault tolerance: instead of utilizing application-level replication, newly-launched backup instances recover application state from disaggregated storage (REDS) after a primary's failure. Attractively, REDS provides fault tolerance at a much lower cost than traditional replication schemes, wherein at least two instances are running. Failover in REDS is slow, however, because it sequentially first detects primary failure and only then starts recovery on a backup. We propose speculative recovery to accelerate failover and thus increase the availability of applications using REDS. Instead of proceeding with failover sequentially, speculative recovery safely and efficiently parallelizes detecting primary failure and running recovery on a backup, by employing our new super and collapse primitives for disaggregated storage. Our implementation and evaluation of speculative recovery demonstrate that it considerably reduces failover time.

Original languageEnglish (US)
Title of host publicationProceedings of the 2022 USENIX Annual Technical Conference, ATC 2022
PublisherUSENIX Association
Pages271-286
Number of pages16
ISBN (Electronic)9781939133298
StatePublished - 2022
Event2022 USENIX Annual Technical Conference, ATC 2022 - Carlsbad, United States
Duration: Jul 11 2022Jul 13 2022

Publication series

NameProceedings of the 2022 USENIX Annual Technical Conference, ATC 2022

Conference

Conference2022 USENIX Annual Technical Conference, ATC 2022
Country/TerritoryUnited States
CityCarlsbad
Period7/11/227/13/22

All Science Journal Classification (ASJC) codes

  • General Computer Science

Fingerprint

Dive into the research topics of 'Speculative Recovery: Cheap, Highly Available Fault Tolerance with Disaggregated Storage'. Together they form a unique fingerprint.

Cite this