Abstract
Parallel processing architectures are now in common use for signal processing and other computation-intensive applications. These applications are characterized by high throughput and long processing periods. Such characteristics decrease the reliability of high-performance architectures. The erroneous data produced by faulty processors could have damaging consequences, particularly in critical real-time applications. It is therefore desirable that any erroneous data produced by the system be detected and located as quickly as possible. Algorithm-based fault tolerance (ABFT) is a low-cost system-level concurrent error detection and fault location scheme. We apply methods used in the analysis of multiprocessor systems employing system-level diagnosis to the analysis of ABFT systems. A new algorithm to analyze an ABFT system for its fault diagnosability is developed using these methods. Based on this work, a fault diagnosis algorithm is developed for ABFT systems. No such algorithm has been presented previously.
Original language | English (US) |
---|---|
Pages (from-to) | 924-937 |
Number of pages | 14 |
Journal | IEEE Transactions on Computers |
Volume | 42 |
Issue number | 8 |
DOIs | |
State | Published - Aug 1993 |
All Science Journal Classification (ASJC) codes
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computational Theory and Mathematics
Keywords
- Algorithm-based fault tolerance
- checksum encoding
- concurrent error detection
- concurrent fault diagnosis
- fault diagnosability