3 research outputs found
Embracing a new era of highly efficient and productive quantum Monte Carlo simulations
QMCPACK has enabled cutting-edge materials research on supercomputers for
over a decade. It scales nearly ideally but has low single-node efficiency due
to the physics-based abstractions using array-of-structures objects, causing
inefficient vectorization. We present a systematic approach to transform
QMCPACK to better exploit the new hardware features of modern CPUs in portable
and maintainable ways. We develop miniapps for fast prototyping and
optimizations. We implement new containers in structure-of-arrays data layout
to facilitate vectorizations by the compilers. Further speedup and smaller
memory-footprints are obtained by computing data on the fly with the vectorized
routines and expanding single-precision use. All these are seamlessly
incorporated in production QMCPACK. We demonstrate upto 4.5x speedups on recent
Intel processors and IBM Blue Gene/Q for representative workloads. Energy
consumption is reduced significantly commensurate to the speedup factor.
Memory-footprints are reduced by up-to 3.8x, opening the possibility to solve
much larger problems of future.Comment: 12 pages, 10 figures, 2 tables, to be published at SC1
Honing and proofing Astrophysical codes on the road to Exascale. Experiences from code modernization on many-core systems
The complexity of modern and upcoming computing architectures poses severe
challenges for code developers and application specialists, and forces them to
expose the highest possible degree of parallelism, in order to make the best
use of the available hardware. The Intel Xeon Phi of second
generation (code-named Knights Landing, henceforth KNL) is the latest many-core
system, which implements several interesting hardware features like for example
a large number of cores per node (up to 72), the 512 bits-wide vector registers
and the high-bandwidth memory. The unique features of KNL make this platform a
powerful testbed for modern HPC applications. The performance of codes on KNL
is therefore a useful proxy of their readiness for future architectures. In
this work we describe the lessons learnt during the optimisation of the widely
used codes for computational astrophysics P-Gadget-3, Flash and Echo. Moreover,
we present results for the visualisation and analysis tools VisIt and yt. These
examples show that modern architectures benefit from code optimisation at
different levels, even more than traditional multi-core systems. However, the
level of modernisation of typical community codes still needs improvements, for
them to fully utilise resources of novel architectures.Comment: 16 pages, 10 figures, 4 tables. To be published in Future Generation
of Computer Systems (FGCS), Special Issue on "On The Road to Exascale II:
Advances in High Performance Computing and Simulations
Memory-Efficient Object-Oriented Programming on GPUs
Object-oriented programming is often regarded as too inefficient for
high-performance computing (HPC), despite the fact that many important HPC
problems have an inherent object structure. Our goal is to bring efficient,
object-oriented programming to massively parallel SIMD architectures,
especially GPUs.
In this thesis, we develop various techniques for optimizing object-oriented
GPU code. Most notably, we identify the object-oriented Single-Method
Multiple-Objects (SMMO) programming model. We first develop an embedded C++
Structure of Arrays (SOA) data layout DSL for SMMO applications. We then design
a lock-free, dynamic memory allocator that stores allocations in SOA layout.
Finally, we show how to further optimize the memory access of SMMO applications
with memory defragmentation.Comment: Ph.D. Thesi