Algorithm-based fault tolerance for floating-point operations in massively parallel systems

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Scopus citations

Abstract

This paper considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. This paper proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes, with respect to numerical stability and hardware/time overhead. The partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.

Original languageEnglish (US)
Title of host publication1992 IEEE International Symposium on Circuits and Systems, ISCAS 1992
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages649-652
Number of pages4
ISBN (Electronic)0780305930
DOIs
StatePublished - 1992
Event1992 IEEE International Symposium on Circuits and Systems, ISCAS 1992 - San Diego, United States
Duration: May 10 1992May 13 1992

Publication series

NameProceedings - IEEE International Symposium on Circuits and Systems
Volume2
ISSN (Print)0271-4310

Conference

Conference1992 IEEE International Symposium on Circuits and Systems, ISCAS 1992
Country/TerritoryUnited States
CitySan Diego
Period5/10/925/13/92

All Science Journal Classification (ASJC) codes

  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Algorithm-based fault tolerance for floating-point operations in massively parallel systems'. Together they form a unique fingerprint.

Cite this