Abstract
Algorithm-based fault tolerance (ABFT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. In this short note, we present new methods for the design of ABFT systems. Our design procedure is applicable to a wide range of systems in which processors share data elements. A feature of our design approach is that the type of checks to be used in the final system can be controlled by the system designer. We also present some new bounds on the number of checks needed in ABFT system design.
Original language | English (US) |
---|---|
Pages (from-to) | 1099-1106 |
Number of pages | 8 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 5 |
Issue number | 10 |
DOIs | |
State | Published - Oct 1994 |
All Science Journal Classification (ASJC) codes
- Signal Processing
- Hardware and Architecture
- Computational Theory and Mathematics
Keywords
- Algorithm-based fault tolerance
- concurrent error detection
- fault detectability
- fault diagnosability
- system-level fault tolerance