Runtime asynchronous fault tolerance via speculation

Yun Zhang, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, David I. August

Research output: Chapter in Book/Report/Conference proceedingConference contribution

25 Scopus citations

Abstract

Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multieore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP bench-marks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications.

Original languageEnglish (US)
Title of host publicationProceedings - International Symposium on Code Generation and Optimization, CGO 2012
Pages145-154
Number of pages10
DOIs
StatePublished - 2012
Event10th International Symposium on Code Generation and Optimization, CGO 2012 - San Jose, CA, United States
Duration: Mar 31 2012Apr 4 2012

Publication series

NameProceedings - International Symposium on Code Generation and Optimization, CGO 2012

Other

Other10th International Symposium on Code Generation and Optimization, CGO 2012
Country/TerritoryUnited States
CitySan Jose, CA
Period3/31/124/4/12

All Science Journal Classification (ASJC) codes

  • Software

Fingerprint

Dive into the research topics of 'Runtime asynchronous fault tolerance via speculation'. Together they form a unique fingerprint.

Cite this