Diskless checkpointing

James S. Plank, Kai Li, Michael A. Puening

Research output: Contribution to journalArticlepeer-review

266 Scopus citations

Abstract

Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.

Original languageEnglish (US)
Pages (from-to)972-986
Number of pages15
JournalIEEE Transactions on Parallel and Distributed Systems
Volume9
Issue number10
DOIs
StatePublished - 1998

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Keywords

  • Checkpointing
  • Copy-on-write
  • Error-correcting codes
  • Fault tolerance
  • Memory redundancy
  • RAID systems
  • Rollback recovery

Fingerprint

Dive into the research topics of 'Diskless checkpointing'. Together they form a unique fingerprint.

Cite this