TY - GEN
T1 - Fault-tolerant typed assembly language
AU - Perry, Frances
AU - MacKey, Lester
AU - Reis, George A.
AU - Ligatti, Jay
AU - August, David I.
AU - Walker, David P.
PY - 2007
Y1 - 2007
N2 - A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the independently computed results to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. To be specific, first, we provide an operational semantics for our assembly language, which includes a precise formal definition of our fault model. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. Third, we provide a formal specification for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate our detection scheme and show that it only takes 34% longer to execute than the unreliable version.
AB - A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the independently computed results to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. To be specific, first, we provide an operational semantics for our assembly language, which includes a precise formal definition of our fault model. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. Third, we provide a formal specification for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate our detection scheme and show that it only takes 34% longer to execute than the unreliable version.
KW - Fault tolerance
KW - Soft faults
KW - Transient hardware faults
KW - Typed assembly language
UR - http://www.scopus.com/inward/record.url?scp=35448936015&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35448936015&partnerID=8YFLogxK
U2 - 10.1145/1250734.1250741
DO - 10.1145/1250734.1250741
M3 - Conference contribution
AN - SCOPUS:35448936015
SN - 1595936335
SN - 9781595936332
T3 - Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
SP - 42
EP - 53
BT - PLDI'07
T2 - PLDI'07: 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation
Y2 - 10 June 2007 through 13 June 2007
ER -