Abstract
This short note considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. This short note proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.
Original language | English (US) |
---|---|
Pages (from-to) | 649-653 |
Number of pages | 5 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 5 |
Issue number | 6 |
DOIs | |
State | Published - Jun 1994 |
All Science Journal Classification (ASJC) codes
- Signal Processing
- Hardware and Architecture
- Computational Theory and Mathematics
Keywords
- Algorithm-based fault tolerance
- checksum code
- detection
- error
- error correction
- transient errors