TY - GEN
T1 - Runtime asynchronous fault tolerance via speculation
AU - Zhang, Yun
AU - Ghosh, Soumyadeep
AU - Huang, Jialu
AU - Lee, Jae W.
AU - Mahlke, Scott A.
AU - August, David I.
PY - 2012
Y1 - 2012
N2 - Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multieore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP bench-marks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications.
AB - Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multieore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP bench-marks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications.
UR - http://www.scopus.com/inward/record.url?scp=84863492598&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84863492598&partnerID=8YFLogxK
U2 - 10.1145/2259016.2259035
DO - 10.1145/2259016.2259035
M3 - Conference contribution
AN - SCOPUS:84863492598
SN - 9781605586359
T3 - Proceedings - International Symposium on Code Generation and Optimization, CGO 2012
SP - 145
EP - 154
BT - Proceedings - International Symposium on Code Generation and Optimization, CGO 2012
T2 - 10th International Symposium on Code Generation and Optimization, CGO 2012
Y2 - 31 March 2012 through 4 April 2012
ER -