Who’s afraid of uncorrectable bit errors? Online recovery of flash errors with distributed redundancy

Amy Tai, Andrew Kryczka, Shobhit O. Kanaujia, Kyle Jamieson, Michael J. Freedman, Asaf Cidon

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Scopus citations

Abstract

Due to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper, we present an approach for addressing the flash lifetime problem by allowing devices to operate at much higher bit error rates. We present DIRECT, a set of techniques that harnesses distributed-level redundancy to enable the adoption of new generations of denser and less reliable flash storage technologies. DIRECT does so by using an end-to-end approach to increase the reliability of distributed storage systems. We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store in production at Facebook that is backed by and supports transactions on top of RocksDB, and HDFS, a distributed file system. When tested on production traces at Facebook, DIRECT reduces application-visible error rates in ZippyDB by more than 100X and recovery time by more than 10,000 Χ. DIRECT also allows HDFS to tolerate a 10,000–100,000X higher bit error rate without experiencing application-visible errors. By significantly increasing the availability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes.

Original languageEnglish (US)
Title of host publicationProceedings of the 2019 USENIX Annual Technical Conference, USENIX ATC 2019
PublisherUSENIX Association
Pages977-991
Number of pages15
ISBN (Electronic)9781939133038
StatePublished - Jan 1 2019
Event2019 USENIX Annual Technical Conference, USENIX ATC 2019 - Renton, United States
Duration: Jul 10 2019Jul 12 2019

Publication series

NameProceedings of the 2019 USENIX Annual Technical Conference, USENIX ATC 2019

Conference

Conference2019 USENIX Annual Technical Conference, USENIX ATC 2019
Country/TerritoryUnited States
CityRenton
Period7/10/197/12/19

All Science Journal Classification (ASJC) codes

  • General Computer Science

Fingerprint

Dive into the research topics of 'Who’s afraid of uncorrectable bit errors? Online recovery of flash errors with distributed redundancy'. Together they form a unique fingerprint.

Cite this