Abstract
Checkpointing systems are a convenient way for users to make their programs fault-tolerant by intermittently saving program state to disk and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion, an important class of optimizations that reduce the overhead of checkpointing. Some forms of memory exclusion are well-known in the checkpointing community. Others are relatively new. In this paper, we describe all of them within the same framework. We have implemented these optimization techniques in two checkpointers: libckpt, which works on Unix-based workstations, and CLIP, which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long-running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show the improvements in time and space overhead.
Original language | English (US) |
---|---|
Pages (from-to) | 125-142 |
Number of pages | 18 |
Journal | Software - Practice and Experience |
Volume | 29 |
Issue number | 2 |
DOIs | |
State | Published - Feb 1999 |
All Science Journal Classification (ASJC) codes
- Software
Keywords
- Checkpoint optimizations
- Checkpointing
- Fault-tolerance
- Memory exclusion
- Rollback recovery