1,164 research outputs found
Matching non-uniformity for program optimizations on heterogeneous many-core systems
As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns
On-the-fly memory compression for multibody algorithms.
Memory and bandwidth demands challenge developers of particle-based codes that have to scale on new architectures, as the growth of concurrency outperforms improvements in memory access facilities, as the memory per core tends to stagnate, and as communication networks cannot increase bandwidth arbitrary. We propose to analyse each particle of such a code to find out whether a hierarchical data representation storing data with reduced precision caps the memory demands without exceeding given error bounds. For admissible candidates, we perform this compression and thus reduce the pressure on the memory subsystem, lower the total memory footprint and reduce the data to be exchanged via MPI. Notably, our analysis and transformation changes the data compression dynamically, i.e. the choice of data format follows the solution characteristics, and it does not require us to alter the core simulation code
A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics
Mesoscopic simulations of hydrocarbon flow in source shales are challenging,
in part due to the heterogeneous shale pores with sizes ranging from a few
nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid
and fluid-solid interactions in nano- to micro-scale shale pores, which are
physically and chemically sophisticated, must be captured. To address those
challenges, we present a GPU-accelerated package for simulation of flow in
nano- to micro-pore networks with a many-body dissipative particle dynamics
(mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the
code offloads all intensive workloads on GPUs. Other advancements, such as
smart particle packing and no-slip boundary condition in complex pore
geometries, are also implemented for the construction and the simulation of the
realistic shale pores from 3D nanometer-resolution stack images. Our code is
validated for accuracy and compared against the CPU counterpart for speedup. In
our benchmark tests, the code delivers nearly perfect strong scaling and weak
scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge
National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU
benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device
NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we
demonstrate, through a flow simulation in realistic shale pores, that the CPU
counterpart requires 840 Power9 cores to rival the performance delivered by our
package with four V100 GPUs on ORNL's Summit architecture. This simulation
package enables quick-turnaround and high-throughput mesoscopic numerical
simulations for investigating complex flow phenomena in nano- to micro-porous
rocks with realistic pore geometries
Efficient Implementations of Molecular Dynamics Simulations for Lennard-Jones Systems
Efficient implementations of the classical molecular dynamics (MD) method for
Lennard-Jones particle systems are considered. Not only general algorithms but
also techniques that are efficient for some specific CPU architectures are also
explained. A simple spatial-decomposition-based strategy is adopted for
parallelization. By utilizing the developed code, benchmark simulations are
performed on a HITACHI SR16000/J2 system consisting of IBM POWER6 processors
which are 4.7 GHz at the National Institute for Fusion Science (NIFS) and an
SGI Altix ICE 8400EX system consisting of Intel Xeon processors which are 2.93
GHz at the Institute for Solid State Physics (ISSP), the University of Tokyo.
The parallelization efficiency of the largest run, consisting of 4.1 billion
particles with 8192 MPI processes, is about 73% relative to that of the
smallest run with 128 MPI processes at NIFS, and it is about 66% relative to
that of the smallest run with 4 MPI processes at ISSP. The factors causing the
parallel overhead are investigated. It is found that fluctuations of the
execution time of each process degrade the parallel efficiency. These
fluctuations may be due to the interference of the operating system, which is
known as OS Jitter.Comment: 33 pages, 19 figures, add references and figures are revise
Generalized full sparse tiling of loop chains
2013 Fall.Includes bibliographical references.Computer and computational scientists are tackling increasingly large and complex problems and are seeking ways of improving the performance of their codes. The key issue faced is how to reach an effective balance between parallelism and locality. In trying to reach this balance, a problem commonly encountered is that of ascertaining the data dependences. Approaches that rely on automatic extraction of data dependences are frequently stymied by complications such as interprocedural and alias analysis. Placing the dependence analysis burden upon the programmer creates a significant barrier to adoption. In this work, we present a new programming abstraction, the loop chain, that specifies a series of loops and the data they access. Given this abstraction, a compiler, inspector, or runtime optimizer can avoid the computationally expensive process of formally determining data dependences, yet still determine beneficial and legal data and iteration reorderings. One optimization method that has been previously applied to irregular scientific codes is full sparse tiling. Full sparse tiling has been used to improve the performance of a handful of scientific codes, but in each case the technique had to be applied from scratch by an expert after careful manual analysis of the possible data dependence patterns. The full sparse tiling approach was extended and generalized as part of this work to apply to any code represented by the loop chain abstraction. Using only the abstraction, the generalized algorithm can produce a new data and iteration ordering as well as a parallel execution schedule. Insight into tuning a generalized full sparse tiled application was gained through a study of the different factors influencing tile count. This work lays the foundation for an efficient autotuning approach to optimizing tile count
- …