Low-Latency, Concurrent Checkpointing for Parallel Programs

Research output: Contribution to journalArticlepeer-review

89 Scopus citations

Abstract

This short note presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.

Original languageEnglish (US)
Pages (from-to)874-879
Number of pages6
JournalIEEE Transactions on Parallel and Distributed Systems
Volume5
Issue number8
DOIs
StatePublished - Aug 1994

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Keywords

  • Checkpointing fault tolerance copy-on-write multiprocessing backward error recovery
  • Index Terms—

Fingerprint

Dive into the research topics of 'Low-Latency, Concurrent Checkpointing for Parallel Programs'. Together they form a unique fingerprint.

Cite this