Memory Exclusion: Optimizing the Performance of Checkpointing Systems

James S. Plank, Yuqun Chen, Kai Li, Micah Beck, Gerry Kingsley

Research output: Contribution to journalArticle

57 Scopus citations

Abstract

Checkpointing systems are a convenient way for users to make their programs fault-tolerant by intermittently saving program state to disk and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion, an important class of optimizations that reduce the overhead of checkpointing. Some forms of memory exclusion are well-known in the checkpointing community. Others are relatively new. In this paper, we describe all of them within the same framework. We have implemented these optimization techniques in two checkpointers: libckpt, which works on Unix-based workstations, and CLIP, which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long-running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show the improvements in time and space overhead.

Original languageEnglish (US)
Pages (from-to)125-142
Number of pages18
JournalSoftware - Practice and Experience
Volume29
Issue number2
DOIs
StatePublished - Feb 1999

All Science Journal Classification (ASJC) codes

  • Software

Keywords

  • Checkpoint optimizations
  • Checkpointing
  • Fault-tolerance
  • Memory exclusion
  • Rollback recovery

Fingerprint Dive into the research topics of 'Memory Exclusion: Optimizing the Performance of Checkpointing Systems'. Together they form a unique fingerprint.

Cite this