DAFT: Decoupled acyclic fault tolerance

Yun Zhang, Jae W. Lee, Nick P. Johnson, David I. August

Research output: Chapter in Book/Report/Conference proceedingConference contribution

42 Scopus citations

Abstract

Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such errors, but software transient fault detection techniques are more appealing for their low cost and flexibility. Recent software proposals double register pressure or memory usage, or are too slow in the absence of hardware extensions, preventing widespread acceptance. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Results demonstrate DAFT's high performance and broad fault coverage. Speculation allows DAFT to reduce the perfor- mance overhead of software redundant multithreading from an average of 200% to 38% with no degradation of fault coverage.

Original languageEnglish (US)
Title of host publicationPACT'10 - Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages87-97
Number of pages11
ISBN (Print)9781450301787
DOIs
StatePublished - 2010
Event19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010 - Vienna, Austria
Duration: Sep 11 2010Sep 15 2010

Publication series

NameParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
ISSN (Print)1089-795X

Conference

Conference19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010
Country/TerritoryAustria
CityVienna
Period9/11/109/15/10

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Keywords

  • multicore
  • speculation
  • transient fault

Fingerprint

Dive into the research topics of 'DAFT: Decoupled acyclic fault tolerance'. Together they form a unique fingerprint.

Cite this