TY - GEN
T1 - Harnessing the power of many
T2 - 32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018
AU - Balasubramanian, Vivek
AU - Turilli, Matteo
AU - Hu, Weiming
AU - Lefebvre, Matthieu
AU - Lei, Wenjie
AU - Modrak, Ryan
AU - Cervone, Guido
AU - Tromp, Jeroen
AU - Jha, Shantenu
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/8/3
Y1 - 2018/8/3
N2 - Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of the two use case imple-mentations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) imple-menting dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(104) tasks; and (iv) task-level fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing.
AB - Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of the two use case imple-mentations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) imple-menting dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(104) tasks; and (iv) task-level fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing.
KW - Ensemble applications
KW - High performance computing
UR - http://www.scopus.com/inward/record.url?scp=85052246163&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052246163&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2018.00063
DO - 10.1109/IPDPS.2018.00063
M3 - Conference contribution
AN - SCOPUS:85052246163
SN - 9781538643686
T3 - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018
SP - 536
EP - 545
BT - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 May 2018 through 25 May 2018
ER -