361 research outputs found
GAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics
We present the newly developed code, GAMER (GPU-accelerated Adaptive MEsh
Refinement code), which has adopted a novel approach to improve the performance
of adaptive mesh refinement (AMR) astrophysical simulations by a large factor
with the use of the graphic processing unit (GPU). The AMR implementation is
based on a hierarchy of grid patches with an oct-tree data structure. We adopt
a three-dimensional relaxing TVD scheme for the hydrodynamic solver, and a
multi-level relaxation scheme for the Poisson solver. Both solvers have been
implemented in GPU, by which hundreds of patches can be advanced in parallel.
The computational overhead associated with the data transfer between CPU and
GPU is carefully reduced by utilizing the capability of asynchronous memory
copies in GPU, and the computing time of the ghost-zone values for each patch
is made to diminish by overlapping it with the GPU computations. We demonstrate
the accuracy of the code by performing several standard test problems in
astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster
system. We measure the performance of the code by performing purely-baryonic
cosmological simulations in different hardware implementations, in which
detailed timing analyses provide comparison between the computations with and
without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are
demonstrated using 1 GPU with 4096^3 effective resolution and 16 GPUs with
8192^3 effective resolution, respectively.Comment: 60 pages, 22 figures, 3 tables. More accuracy tests are included.
Accepted for publication in ApJ
Parthenon -- a performance portable block-structured adaptive mesh refinement framework
On the path to exascale the landscape of computer device architectures and
corresponding programming models has become much more diverse. While various
low-level performance portable programming models are available, support at the
application level lacks behind. To address this issue, we present the
performance portable block-structured adaptive mesh refinement (AMR) framework
Parthenon, derived from the well-tested and widely used Athena++ astrophysical
magnetohydrodynamics code, but generalized to serve as the foundation for a
variety of downstream multi-physics codes. Parthenon adopts the Kokkos
programming model, and provides various levels of abstractions from
multi-dimensional variables, to packages defining and separating components, to
launching of parallel compute kernels. Parthenon allocates all data in device
memory to reduce data movement, supports the logical packing of variables and
mesh blocks to reduce kernel launch overhead, and employs one-sided,
asynchronous MPI calls to reduce communication overhead in multi-node
simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong
scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD
x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale
on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of
zone-cycles/s on 9,216 nodes (73,728 logical GPUs) at ~92%
weak scaling parallel efficiency (starting from a single node). In combination
with being an open, collaborative project, this makes Parthenon an ideal
framework to target exascale simulations in which the downstream developers can
focus on their specific application rather than on the complexity of handling
massively-parallel, device-accelerated AMR.Comment: 17 pages, 11 figures, accepted for publication in IJHPCA, Codes
available at https://github.com/parthenon-hpc-la
Scaling Deep Learning on GPU and Knights Landing clusters
The speed of deep neural networks training has become a big bottleneck of
deep learning research and development. For example, training GoogleNet by
ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training
process, the current deep learning systems heavily rely on the hardware
accelerators. However, these accelerators have limited on-chip memory compared
with CPUs. To handle large datasets, they need to fetch data from either CPU
memory or remote processors. We use both self-hosted Intel Knights Landing
(KNL) clusters and multi-GPU clusters as our target platforms. From an
algorithm aspect, current distributed machine learning systems are mainly
designed for cloud systems. These methods are asynchronous because of the slow
network and high fault-tolerance requirement on cloud systems. We focus on
Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original
EASGD used round-robin method for communication and updating. The communication
is ordered by the machine rank ID, which is inefficient on HPC clusters.
First, we redesign four efficient algorithms for HPC systems to improve
EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD
are faster \textcolor{black}{than} their existing counterparts (Async SGD,
Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design
Sync EASGD, which ties for the best performance among all the methods while
being deterministic. In addition to the algorithmic improvements, we use some
system-algorithm codesign techniques to scale up the algorithms. By reducing
the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x
speedup over original EASGD on the same platform. We get 91.5% weak scaling
efficiency on 4253 KNL cores, which is higher than the state-of-the-art
implementation
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions
We study the simulation of stellar mergers, which requires complex
simulations with high computational demands. We have developed Octo-Tiger, a
finite volume grid-based hydrodynamics simulation code with Adaptive Mesh
Refinement which is unique in conserving both linear and angular momentum to
machine precision. To face the challenge of increasingly complex, diverse, and
heterogeneous HPC systems, Octo-Tiger relies on high-level programming
abstractions.
We use HPX with its futurization capabilities to ensure scalability both
between nodes and within, and present first results replacing MPI with
libfabric achieving up to a 2.8x speedup. We extend Octo-Tiger to heterogeneous
GPU-accelerated supercomputers, demonstrating node-level performance and
portability. We show scalability up to full system runs on Piz Daint. For the
scenario's maximum resolution, the compute-critical parts (hydrodynamics and
gravity) achieve 68.1% parallel efficiency at 2048 nodes.Comment: Accepted at SC1
Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers
pre-printPresent trends in high performance computing present formidable challenges for applications code using multicore nodes possibly with accelerators and/or co-processors and reduced memory while still attaining scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities offer a possible solution. The Uintah framework for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for CPU cores and/or accelerators/coprocessors on a node. Uintah's clear separation between application and runtime code has led to scalability increases of 1000x without significant changes to application code. This methodology is tested on three leading Top500 machines; OLCF Titan, TACC Stampede and ALCF Mira using three diverse and challenging applications problems. This investigation of scalability with regard to the different processors and communications performance leads to the overall conclusion that the adaptive DAG-based approach provides a very powerful abstraction for solving challenging multi-scale multi-physics engineering problems on some of the largest and most powerful computers available today
- …