TY - GEN
T1 - Design and evaluation of hybrid fault-detection systems
AU - Reis, George A.
AU - Chang, Jonathan
AU - Vachharajani, Neil
AU - Rangan, Ram
AU - August, David I.
AU - Mukherjee, Shubhendu S.
PY - 2005
Y1 - 2005
N2 - As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, Mean Work To Failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the fault-detection design space.
AB - As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, Mean Work To Failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the fault-detection design space.
UR - http://www.scopus.com/inward/record.url?scp=27544438520&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=27544438520&partnerID=8YFLogxK
U2 - 10.1109/ISCA.2005.21
DO - 10.1109/ISCA.2005.21
M3 - Conference contribution
AN - SCOPUS:27544438520
SN - 076952270X
T3 - Proceedings - International Symposium on Computer Architecture
SP - 148
EP - 159
BT - Proceedings - 32nd International Symposium on Computer Architecture, ISCA 2005
T2 - 32nd Interntional Symposium on Computer Architecture, ISCA 2005
Y2 - 4 June 2005 through 8 June 2005
ER -