Abstract
This paper presents a way to perform fast, incremental checkpointing of multicomputers and distributed systems by using N + 1 parity. A basic algorithm is described that uses two extra processors for checkpointing and enables the system to tolerate any single processor failure. The algorithm's speed comes from a combination of N + 1 parity, extra physical memory, and virtual memory hardware so that checkpoints need not be written to disk. This eliminates the most time-consuming portion of checkpointing. The algorithm requires each application processor to allocate a fixed amount of extra memory for checkpointing. This amount may be set statically by the programmer, and need not be equal to the size of the processor's writable address space. This alleviates a major restriction of previous checkpointing algorithms using N + 1 parity [28]. Finally, we outline how to extend our algorithm to tolerate any m processor failures with the addition of 2m extra checkpointing processors.
Original language | English (US) |
---|---|
Title of host publication | Digest of Papers - International Symposium on Fault-Tolerant Computing |
Publisher | Publ by IEEE |
Pages | 288-297 |
Number of pages | 10 |
ISBN (Print) | 0818655224 |
State | Published - Jan 1 1994 |
Externally published | Yes |
Event | Proceedings of the 24th International Symposium on Fault-Tolerant Computing - Austin, TX, USA Duration: Jun 15 1994 → Jun 17 1994 |
Other
Other | Proceedings of the 24th International Symposium on Fault-Tolerant Computing |
---|---|
City | Austin, TX, USA |
Period | 6/15/94 → 6/17/94 |
All Science Journal Classification (ASJC) codes
- Hardware and Architecture
- Engineering(all)