TY - JOUR
T1 - Safety and reliability driven task allocation in distributed systems
AU - Srinivasan, Santhanam
AU - Jha, Niraj K.
N1 - Funding Information:
This work was supported in part by the U.S. Office of Naval Research under Contract no. N00014-91-J-1199 and in part by the U.S. National Science Foundation under Grant no. MIP-9423574. This work was done when Santhanam Srinivasan was at Princeton University.
Copyright:
Copyright 2012 Elsevier B.V., All rights reserved.
PY - 1999
Y1 - 1999
N2 - Distributed computer systems are increasingly being employed for critical applications, such as aircraft control, industrial process control, and banking systems. Maximizing performance has been the conventional objective in the allocation of tasks for such systems. Inherently, distributed systems are more complex than centralized systems. The added complexity could increase the potential for system failures. Some work has been done in the past in allocating tasks to distributed systems, considering reliability as the objective function to be maximized. Reliability is defined to be the probability that none of the system components fails while processing. This, however, does not give any guarantees as to the behavior of the system when a failure occurs. A failure, not detected immediately, could lead to a catastrophe. Such systems are unsafe. In this paper, we describe a method to determine an allocation that introduces safety into a heterogeneous distributed system and at the same time attempts to maximize its reliability. First, we devise a new heuristic, based on the concept of clustering, to allocate tasks for maximizing reliability. We show that for task graphs with precedence constraints, our heuristic performs better than previously proposed heuristics. Next, by applying the concept of task-based fault tolerance, which we have previously proposed, we add extra assertion tasks to the system to make it safe. We present a new heuristic that does this in such a way that the decrease in reliability for the added safety is minimized. For the purpose of allocating the extra tasks, this heuristic performs as well as previously known methods and runs an order of magnitude faster. We present a number of simulation results to prove the efficacy of our scheme.
AB - Distributed computer systems are increasingly being employed for critical applications, such as aircraft control, industrial process control, and banking systems. Maximizing performance has been the conventional objective in the allocation of tasks for such systems. Inherently, distributed systems are more complex than centralized systems. The added complexity could increase the potential for system failures. Some work has been done in the past in allocating tasks to distributed systems, considering reliability as the objective function to be maximized. Reliability is defined to be the probability that none of the system components fails while processing. This, however, does not give any guarantees as to the behavior of the system when a failure occurs. A failure, not detected immediately, could lead to a catastrophe. Such systems are unsafe. In this paper, we describe a method to determine an allocation that introduces safety into a heterogeneous distributed system and at the same time attempts to maximize its reliability. First, we devise a new heuristic, based on the concept of clustering, to allocate tasks for maximizing reliability. We show that for task graphs with precedence constraints, our heuristic performs better than previously proposed heuristics. Next, by applying the concept of task-based fault tolerance, which we have previously proposed, we add extra assertion tasks to the system to make it safe. We present a new heuristic that does this in such a way that the decrease in reliability for the added safety is minimized. For the purpose of allocating the extra tasks, this heuristic performs as well as previously known methods and runs an order of magnitude faster. We present a number of simulation results to prove the efficacy of our scheme.
UR - http://www.scopus.com/inward/record.url?scp=0032683084&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0032683084&partnerID=8YFLogxK
U2 - 10.1109/71.755824
DO - 10.1109/71.755824
M3 - Article
AN - SCOPUS:0032683084
SN - 1045-9219
VL - 10
SP - 238
EP - 251
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 3
ER -