CLIP: A checkpointing tool for message-passing parallel programs

Yuqun Chen, James S. Plank, Kai Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

45 Scopus citations

Abstract

Checkpointing is a useful technique for rollback recov- ery of parallel applications. While extensive researc h has been performed on checkpointing in parallel envi- ronments, there are few chec kpointers a vailable toap- plication users on commercial parallel computers. This paper presents one such chec kpointer: CLIP. CLIP is a user-lev el library that provides semi-transparent check- pointing for parallel programs on the Intel P aragonmul- ticomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is quite straigh tforward. How ev er, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design deci- sions to be made. Sometimes ease-of-use must be sac- rificed for efficiency and/or correctness. This paper de- tails what these decisions are, and how they were made in CLIP. We also present performance data when checkpoint- ing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose chec kpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.

Original languageEnglish (US)
Title of host publicationProceedings of the 1997 ACM/IEEE Conference on Supercomputing, SC 1997
PublisherAssociation for Computing Machinery
ISBN (Print)0897919858, 9780897919852
DOIs
StatePublished - 1997
Event1997 ACM/IEEE Conference on Supercomputing, SC 1997 - San Jose, CA, United States
Duration: Nov 15 1997Nov 21 1997

Publication series

NameProceedings of the International Conference on Supercomputing

Other

Other1997 ACM/IEEE Conference on Supercomputing, SC 1997
Country/TerritoryUnited States
CitySan Jose, CA
Period11/15/9711/21/97

All Science Journal Classification (ASJC) codes

  • General Computer Science

Fingerprint

Dive into the research topics of 'CLIP: A checkpointing tool for message-passing parallel programs'. Together they form a unique fingerprint.

Cite this