Harnessing the power of many: Extensible toolkit for scalable ensemble applications

Vivek Balasubramanian, Matteo Turilli, Weiming Hu, Matthieu Lefebvre, Wenjie Lei, Ryan Modrak, Guido Cervone, Jeroen Tromp, Shantenu Jha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

35 Scopus citations

Abstract

Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of the two use case imple-mentations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) imple-menting dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(104) tasks; and (iv) task-level fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages536-545
Number of pages10
ISBN (Print)9781538643686
DOIs
StatePublished - Aug 3 2018
Event32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018 - Vancouver, Canada
Duration: May 21 2018May 25 2018

Publication series

NameProceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018

Other

Other32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018
Country/TerritoryCanada
CityVancouver
Period5/21/185/25/18

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management

Keywords

  • Ensemble applications
  • High performance computing

Fingerprint

Dive into the research topics of 'Harnessing the power of many: Extensible toolkit for scalable ensemble applications'. Together they form a unique fingerprint.

Cite this