DAFT: Decoupled acyclic fault tolerance

Yun Zhang, Jae W. Lee, Nick P. Johnson, David I. August

Research output: Contribution to journalArticlepeer-review

27 Scopus citations

Abstract

Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such faults, but software techniques are more appealing for their low cost and flexibility. Recent software proposals have not achieved widespread acceptance because they either increase register pressure, double memory usage, or are too slow in the absence of hardware extensions. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Evaluation results demonstrate that speculation allows DAFT to improves the performance of software redundant multithreading by 2.17× with no degradation of fault coverage.

Original languageEnglish (US)
Pages (from-to)118-140
Number of pages23
JournalInternational Journal of Parallel Programming
Volume40
Issue number1
DOIs
StatePublished - Feb 2012

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Information Systems

Keywords

  • Compiler
  • Fault tolerance
  • Speculation

Fingerprint

Dive into the research topics of 'DAFT: Decoupled acyclic fault tolerance'. Together they form a unique fingerprint.

Cite this