TY - JOUR
T1 - Parthenon—a performance portable block-structured adaptive mesh refinement framework
AU - Grete, Philipp
AU - Dolence, Joshua C.
AU - Miller, Jonah M.
AU - Brown, Joshua
AU - Ryan, Ben
AU - Gaspar, Andrew
AU - Glines, Forrest
AU - Swaminarayan, Sriram
AU - Lippuner, Jonas
AU - Solomon, Clell J.
AU - Shipman, Galen
AU - Junghans, Christoph
AU - Holladay, Daniel
AU - Stone, James M.
AU - Roberts, Luke F.
N1 - Funding Information:
The authors would like to thank the Athena++ team, in particular Kengo Tomida, Kyle Felker, and Chris White for having provided an open, well-engineered basis for Parthenon. We also thank the Kokkos team for their continued support throughout the project and John Holmen for supporting the scaling tests on Frontier. Moreover, we would like to thank Daniel Arndt, Kyle Felker, Max Katz, and Tim Williams for their contribution to this work during the Argonne GPU Virtual Hackathon 2021. Finally, we would like to thank our lovely bots, especially par-hermes, who is a very good bot. This work has been assigned a document release number LA-UR-22-21270. The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the U.S. Department of Energy through the Los Alamos National Laboratory (LANL). LANL is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001). PG acknowledges funding from LANL through Subcontract No.: 615487. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101030214. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources during the April 2022 Texascale Days. Code development, testing, and benchmarking was made possible through various computing grants including allocations on OLCF Summit and Frontier (AST146 ), Jülich Supercomputing Centre JUWELS (athenapk ), Stony Brook’s Ookami (BrOs091321F ), and Michgian State University’s High Performance Computing Center.
Funding Information:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the U.S. Department of Energy through the Los Alamos National Laboratory (LANL). LANL is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001). PG acknowledges funding from LANL through Subcontract No.: 615487 . This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101030214 . This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources during the April 2022 Texascale Days. Code development, testing, and benchmarking was made possible through various computing grants including allocations on OLCF Summit and Frontier ( AST146 ), Jülich Supercomputing Centre JUWELS ( athenapk ), Stony Brook’s Ookami ( BrOs091321F ), and Michgian State University’s High Performance Computing Center.
Publisher Copyright:
© The Author(s) 2022.
PY - 2023/9
Y1 - 2023/9
N2 - On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multidimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7 × 1013 zone-cycles/s on 9216 nodes (73,728 logical GPUs) at (Formula presented.) weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.
AB - On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multidimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7 × 1013 zone-cycles/s on 9216 nodes (73,728 logical GPUs) at (Formula presented.) weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.
KW - Adaptive mesh refinement
KW - high-performance computing
KW - parallel computing
KW - performance portability
UR - http://www.scopus.com/inward/record.url?scp=85144188348&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85144188348&partnerID=8YFLogxK
U2 - 10.1177/10943420221143775
DO - 10.1177/10943420221143775
M3 - Article
AN - SCOPUS:85144188348
SN - 1094-3420
VL - 37
SP - 465
EP - 486
JO - International Journal of High Performance Computing Applications
JF - International Journal of High Performance Computing Applications
IS - 5
ER -