TY - JOUR
T1 - Low-Latency, Concurrent Checkpointing for Parallel Programs
AU - Li, Kai
N1 - Funding Information:
Manuscript received July 7, 1992; revised July 9, 1993. This work was supported in part by the National Science Foundation under Grants CCR-8814265 and IRI-8909795, and in part by the Digital Equipment External Research Program and Systems Research Center. K. Li is with the Department of Computer Science, Princeton University, Princeton, NJ 08544 USA; e-mail: [email protected]. J. Naughton is with the Department of Computer Science, University of Wisconsin, Madison, WI 53706 USA. J. Plank is with the Department of Computer Science, University of Tennessee, Knoxville, TN 37966 USA. IEEE Log Number 9401208.
PY - 1994/8
Y1 - 1994/8
N2 - This short note presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.
AB - This short note presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.
KW - Checkpointing fault tolerance copy-on-write multiprocessing backward error recovery
KW - Index Terms—
UR - http://www.scopus.com/inward/record.url?scp=0028485392&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0028485392&partnerID=8YFLogxK
U2 - 10.1109/71.298215
DO - 10.1109/71.298215
M3 - Article
AN - SCOPUS:0028485392
SN - 1045-9219
VL - 5
SP - 874
EP - 879
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 8
ER -