2 research outputs found
O2ATH: An OpenMP Offloading Toolkit for the Sunway Heterogeneous Manycore Platform
The next generation Sunway supercomputer employs the SW26010pro processor,
which features a specialized on-chip heterogeneous architecture. Applications
with significant hotspots can benefit from the great computation capacity
improvement of Sunway many-core architectures by carefully making intensive
manual many-core parallelization efforts. However, some legacy projects with
large codebases, such as CESM, ROMS and WRF, contain numerous lines of code and
do not have significant hotspots. The cost of manually porting such
applications to the Sunway architecture is almost unaffordable. To overcome
such a challenge, we have developed a toolkit named O2ATH. O2ATH forwards GNU
OpenMP runtime library calls to Sunway's Athread library, which greatly
simplifies the parallelization work on the Sunway architecture.O2ATH enables
users to write both MPE and CPE code in a single file, and parallelization can
be achieved by utilizing OpenMP directives and attributes. In practice, O2ATH
has helped us to port two large projects, CESM and ROMS, to the CPEs of the
next generation Sunway supercomputers via the OpenMP offload method. In the
experiments, kernel speedups range from 3 to 15 times, resulting in 3 to 6
times whole application speedups.Furthermore, O2ATH requires significantly
fewer code modifications compared to manually crafting CPE functions.This
indicates that O2ATH can greatly enhance development efficiency when porting or
optimizing large software projects on Sunway supercomputers.Comment: 15 pages, 6 figures, 5 tables
HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model
We present an architecture-portable and
performant implementation of the atmospheric dynamical core (High-Order
Methods Modeling Environment, HOMME) of the Energy Exascale Earth System
Model (E3SM). The original Fortran implementation is highly performant and
scalable on conventional architectures using the Message Passing Interface
(MPI) and Open MultiProcessor (OpenMP) programming models.
We rewrite the model in C++ and use the Kokkos library to
express on-node parallelism in a largely architecture-independent
implementation. Kokkos provides an abstraction of a compute node or device,
layout-polymorphic multidimensional arrays, and parallel execution
constructs. The new implementation achieves the same or better performance on
conventional multicore computers and is portable to GPUs. We present
performance data for the original and new implementations on multiple
platforms, on up to 5400 compute nodes, and study several aspects of the
single- and multi-node performance characteristics of the new implementation
on conventional CPU (e.g., Intel Xeon), many core CPU (e.g., Intel Xeon Phi Knights Landing),
and Nvidia V100 GPU.</p