TY - JOUR
T1 - Middleware support for many-task computing
AU - Raicu, Ioan
AU - Foster, Ian
AU - Wilde, Mike
AU - Zhang, Zhao
AU - Iskra, Kamil
AU - Beckman, Peter
AU - Zhao, Yong
AU - Szalay, Alex
AU - Choudhary, Alok
AU - Little, Philip
AU - Moretti, Christopher
AU - Chaudhary, Amitabh
AU - Thain, Douglas
N1 - Funding Information:
Acknowledgements This work was supported in part by the NASA Ames Research Center GSRP Grant Number NNA06CB89H and by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy, under Contract DE-AC02-06CH11357. This research was also supported in part by the National Science Foundation through TeraGrid resources provided by UC/ANL.
PY - 2010
Y1 - 2010
N2 - Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many-task computing denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. Traditional techniques found in production systems in the scientific community to support many-task computing do not scale to today's largest systems, due to issues in local resource manager scalability and granularity, efficient utilization of the raw hardware, long wait queue times, and shared/parallel file system contention and scalability. To address these limitations, we adopted a "top-down" approach to building a middleware called Falkon, to support the most demanding many-task computing applications at the largest scales. Falkon (Fast and Light-weight tasK executiON framework) integrates (1) multi-level scheduling to enable dynamic resource provisioning and minimize wait queue times, (2) a streamlined task dispatcher able to achieve orders-of-magnitude higher task dispatch rates than conventional schedulers, and (3) data diffusion which performs data caching and uses a data-aware scheduler to co-locate computational and storage resources. Micro-benchmarks have shown Falkon to achieve over 15K+ tasks/s throughputs, scale to hundreds of thousands of processors and to millions of queued tasks, and execute billions of tasks per day. Data diffusion has also shown to improve applications scalability and performance, with its ability to achieve hundreds of Gb/s I/O rates on modest sized clusters, with Tb/s I/O rates on the horizon. Falkon has shown orders of magnitude improvements in performance and scalability than traditional approaches to resource management across many diverse workloads and applications at scales of billions of tasks on hundreds of thousands of processors across clusters, specialized systems, Grids, and supercomputers. Falkon's performance and scalability have enabled a new class of applications called Many-Task Computing to operate at previously so-believed impossible scales with high efficiency.
AB - Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many-task computing denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. Traditional techniques found in production systems in the scientific community to support many-task computing do not scale to today's largest systems, due to issues in local resource manager scalability and granularity, efficient utilization of the raw hardware, long wait queue times, and shared/parallel file system contention and scalability. To address these limitations, we adopted a "top-down" approach to building a middleware called Falkon, to support the most demanding many-task computing applications at the largest scales. Falkon (Fast and Light-weight tasK executiON framework) integrates (1) multi-level scheduling to enable dynamic resource provisioning and minimize wait queue times, (2) a streamlined task dispatcher able to achieve orders-of-magnitude higher task dispatch rates than conventional schedulers, and (3) data diffusion which performs data caching and uses a data-aware scheduler to co-locate computational and storage resources. Micro-benchmarks have shown Falkon to achieve over 15K+ tasks/s throughputs, scale to hundreds of thousands of processors and to millions of queued tasks, and execute billions of tasks per day. Data diffusion has also shown to improve applications scalability and performance, with its ability to achieve hundreds of Gb/s I/O rates on modest sized clusters, with Tb/s I/O rates on the horizon. Falkon has shown orders of magnitude improvements in performance and scalability than traditional approaches to resource management across many diverse workloads and applications at scales of billions of tasks on hundreds of thousands of processors across clusters, specialized systems, Grids, and supercomputers. Falkon's performance and scalability have enabled a new class of applications called Many-Task Computing to operate at previously so-believed impossible scales with high efficiency.
KW - Computing
KW - Data-intensive distributed computing
KW - Falkon
KW - High-performance computing
KW - High-throughput
KW - Loosely-coupled applications
KW - Many-task computing
KW - Petascale
KW - Swift
UR - http://www.scopus.com/inward/record.url?scp=77955096331&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77955096331&partnerID=8YFLogxK
U2 - 10.1007/s10586-010-0132-9
DO - 10.1007/s10586-010-0132-9
M3 - Article
AN - SCOPUS:77955096331
SN - 1386-7857
VL - 13
SP - 291
EP - 314
JO - Cluster Computing
JF - Cluster Computing
IS - 3
ER -