Faster checkpointing with N+1 parity

James S. Plank, Kai Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

55 Scopus citations

Abstract

This paper presents a way to perform fast, incremental checkpointing of multicomputers and distributed systems by using N + 1 parity. A basic algorithm is described that uses two extra processors for checkpointing and enables the system to tolerate any single processor failure. The algorithm's speed comes from a combination of N + 1 parity, extra physical memory, and virtual memory hardware so that checkpoints need not be written to disk. This eliminates the most time-consuming portion of checkpointing. The algorithm requires each application processor to allocate a fixed amount of extra memory for checkpointing. This amount may be set statically by the programmer, and need not be equal to the size of the processor's writable address space. This alleviates a major restriction of previous checkpointing algorithms using N + 1 parity [28]. Finally, we outline how to extend our algorithm to tolerate any m processor failures with the addition of 2m extra checkpointing processors.

Original languageEnglish (US)
Title of host publicationDigest of Papers - International Symposium on Fault-Tolerant Computing
PublisherPubl by IEEE
Pages288-297
Number of pages10
ISBN (Print)0818655224
StatePublished - Jan 1 1994
Externally publishedYes
EventProceedings of the 24th International Symposium on Fault-Tolerant Computing - Austin, TX, USA
Duration: Jun 15 1994Jun 17 1994

Other

OtherProceedings of the 24th International Symposium on Fault-Tolerant Computing
CityAustin, TX, USA
Period6/15/946/17/94

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture
  • Engineering(all)

Fingerprint Dive into the research topics of 'Faster checkpointing with N+1 parity'. Together they form a unique fingerprint.

  • Cite this

    Plank, J. S., & Li, K. (1994). Faster checkpointing with N+1 parity. In Digest of Papers - International Symposium on Fault-Tolerant Computing (pp. 288-297). Publ by IEEE.