TY - GEN
T1 - The quest for scalable support of data-intensive workloads in distributed systems
AU - Raicu, Ioan
AU - Foster, Ian T.
AU - Zhao, Yong
AU - Little, Philip
AU - Moretti, Christopher M.
AU - Chaudhary, Amitabh
AU - Thain, Douglas
PY - 2009
Y1 - 2009
N2 - Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. To explore the feasibility of data diffusion, we offer both a theoretical and an empirical analysis. We define an abstract model for data diffusion, introduce new scheduling policies with heuristics to optimize real-world performance, and develop a competitive online cache eviction policy. We also offer many empirical experiments to explore the benefits of dynamically expanding and contracting resources based on load, to improve system responsiveness while keeping wasted resources small. We show performance improvements of one to two orders of magnitude across three diverse workloads when compared to the performance of parallel file systems with throughputs approaching 80 Gb/s on a modest cluster of 200 processors. We also compare data diffusion with a best model for active storage, contrasting the difference between a pull-model found in data diffusion and a push-model found in active storage.
AB - Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. To explore the feasibility of data diffusion, we offer both a theoretical and an empirical analysis. We define an abstract model for data diffusion, introduce new scheduling policies with heuristics to optimize real-world performance, and develop a competitive online cache eviction policy. We also offer many empirical experiments to explore the benefits of dynamically expanding and contracting resources based on load, to improve system responsiveness while keeping wasted resources small. We show performance improvements of one to two orders of magnitude across three diverse workloads when compared to the performance of parallel file systems with throughputs approaching 80 Gb/s on a modest cluster of 200 processors. We also compare data diffusion with a best model for active storage, contrasting the difference between a pull-model found in data diffusion and a push-model found in active storage.
KW - Data diffusion
KW - Data management
KW - Data-aware scheduling
KW - Falkon
UR - http://www.scopus.com/inward/record.url?scp=70449623522&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70449623522&partnerID=8YFLogxK
U2 - 10.1145/1551609.1551642
DO - 10.1145/1551609.1551642
M3 - Conference contribution
AN - SCOPUS:70449623522
SN - 9781605585871
T3 - Proc. 18th ACM International Symposium on High Performance Distributed Computing, HPDC 09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conf., HPDC'09
SP - 21
EP - 29
BT - Proc. 18th ACM International Symposium on High Performance Distributed Computing, HPDC 09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conf., HPDC'09
T2 - 18th ACM International Symposium on High Performance Distributed Computing, HPDC 09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09
Y2 - 11 June 2009 through 13 June 2009
ER -