An efficient checkpointing method for multicomputers with wormhole routing

Kai Li, Jeffrey F. Naughton, James S. Plank

Research output: Contribution to journalArticle

11 Scopus citations

Abstract

Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processor p to continue running the application being checkpointed except during the time that p is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements.

Original languageEnglish (US)
Pages (from-to)159-180
Number of pages22
JournalInternational Journal of Parallel Programming
Volume20
Issue number3
DOIs
StatePublished - Jun 1 1991

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Information Systems

Fingerprint Dive into the research topics of 'An efficient checkpointing method for multicomputers with wormhole routing'. Together they form a unique fingerprint.

  • Cite this