TY - GEN
T1 - CLIP
T2 - 1997 ACM/IEEE Conference on Supercomputing, SC 1997
AU - Chen, Yuqun
AU - Plank, James S.
AU - Li, Kai
PY - 1997
Y1 - 1997
N2 - Checkpointing is a useful technique for rollback recov- ery of parallel applications. While extensive researc h has been performed on checkpointing in parallel envi- ronments, there are few chec kpointers a vailable toap- plication users on commercial parallel computers. This paper presents one such chec kpointer: CLIP. CLIP is a user-lev el library that provides semi-transparent check- pointing for parallel programs on the Intel P aragonmul- ticomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is quite straigh tforward. How ev er, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design deci- sions to be made. Sometimes ease-of-use must be sac- rificed for efficiency and/or correctness. This paper de- tails what these decisions are, and how they were made in CLIP. We also present performance data when checkpoint- ing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose chec kpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
AB - Checkpointing is a useful technique for rollback recov- ery of parallel applications. While extensive researc h has been performed on checkpointing in parallel envi- ronments, there are few chec kpointers a vailable toap- plication users on commercial parallel computers. This paper presents one such chec kpointer: CLIP. CLIP is a user-lev el library that provides semi-transparent check- pointing for parallel programs on the Intel P aragonmul- ticomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is quite straigh tforward. How ev er, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design deci- sions to be made. Sometimes ease-of-use must be sac- rificed for efficiency and/or correctness. This paper de- tails what these decisions are, and how they were made in CLIP. We also present performance data when checkpoint- ing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose chec kpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
UR - http://www.scopus.com/inward/record.url?scp=84900298636&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84900298636&partnerID=8YFLogxK
U2 - 10.1145/509593.509626
DO - 10.1145/509593.509626
M3 - Conference contribution
AN - SCOPUS:84900298636
SN - 0897919858
SN - 9780897919852
T3 - Proceedings of the International Conference on Supercomputing
BT - Proceedings of the 1997 ACM/IEEE Conference on Supercomputing, SC 1997
PB - Association for Computing Machinery
Y2 - 15 November 1997 through 21 November 1997
ER -