Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems

Research output: Contribution to journalArticlepeer-review

16 Scopus citations

Abstract

This short note considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. This short note proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.

Original languageEnglish (US)
Pages (from-to)649-653
Number of pages5
JournalIEEE Transactions on Parallel and Distributed Systems
Volume5
Issue number6
DOIs
StatePublished - Jun 1994

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Keywords

  • Algorithm-based fault tolerance
  • checksum code
  • detection
  • error
  • error correction
  • transient errors

Fingerprint

Dive into the research topics of 'Partitioned Encoding Schemes for Algorithm-Based Fault Tolerance in Massively Parallel Systems'. Together they form a unique fingerprint.

Cite this