TY - GEN
T1 - Algorithm-based fault tolerance for floating-point operations in massively parallel systems
AU - Rexford, Jennifer L.
AU - Jha, Niraj Kumar
N1 - Publisher Copyright:
© 1992 IEEE.
PY - 1992
Y1 - 1992
N2 - This paper considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. This paper proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes, with respect to numerical stability and hardware/time overhead. The partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.
AB - This paper considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. This paper proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes, with respect to numerical stability and hardware/time overhead. The partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.
UR - http://www.scopus.com/inward/record.url?scp=70449763796&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70449763796&partnerID=8YFLogxK
U2 - 10.1109/ISCAS.1992.230168
DO - 10.1109/ISCAS.1992.230168
M3 - Conference contribution
AN - SCOPUS:70449763796
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
SP - 649
EP - 652
BT - 1992 IEEE International Symposium on Circuits and Systems, ISCAS 1992
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 1992 IEEE International Symposium on Circuits and Systems, ISCAS 1992
Y2 - 10 May 1992 through 13 May 1992
ER -