Abstract
Parallel processing architectures are now in common use for signal processing and other computation-intensive applications. These applications are characterized by high throughput and long processing periods. Such characteristics decrease the reliability of high-performance architectures. The erroneous data produced by faulty processors could have damaging consequences, particularly in critical real-time applications. It is therefore desirable that any erroneous data produced by the system be detected and located as quickly as possible. Algorithm-based fault tolerance (ABFT) is a low-cost system-level concurrent error detection and fault location scheme. We apply methods used in the analysis of multiprocessor systems employing system-level diagnosis to the analysis of ABFT systems. A new algorithm to analyze an ABFT system for its fault diagnosability is developed using these methods. Based on this work, a fault diagnosis algorithm is developed for ABFT systems. No such algorithm has been presented previously.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 924-937 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Computers |
| Volume | 42 |
| Issue number | 8 |
| DOIs | |
| State | Published - Aug 1993 |
All Science Journal Classification (ASJC) codes
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computational Theory and Mathematics
Keywords
- Algorithm-based fault tolerance
- checksum encoding
- concurrent error detection
- concurrent fault diagnosis
- fault diagnosability