398 research outputs found
A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics
Mesoscopic simulations of hydrocarbon flow in source shales are challenging,
in part due to the heterogeneous shale pores with sizes ranging from a few
nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid
and fluid-solid interactions in nano- to micro-scale shale pores, which are
physically and chemically sophisticated, must be captured. To address those
challenges, we present a GPU-accelerated package for simulation of flow in
nano- to micro-pore networks with a many-body dissipative particle dynamics
(mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the
code offloads all intensive workloads on GPUs. Other advancements, such as
smart particle packing and no-slip boundary condition in complex pore
geometries, are also implemented for the construction and the simulation of the
realistic shale pores from 3D nanometer-resolution stack images. Our code is
validated for accuracy and compared against the CPU counterpart for speedup. In
our benchmark tests, the code delivers nearly perfect strong scaling and weak
scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge
National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU
benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device
NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we
demonstrate, through a flow simulation in realistic shale pores, that the CPU
counterpart requires 840 Power9 cores to rival the performance delivered by our
package with four V100 GPUs on ORNL's Summit architecture. This simulation
package enables quick-turnaround and high-throughput mesoscopic numerical
simulations for investigating complex flow phenomena in nano- to micro-porous
rocks with realistic pore geometries
Near Memory Acceleration on High Resolution Radio Astronomy Imaging
Modern radio telescopes like the Square Kilometer Array (SKA) will need to
process in real-time exabytes of radio-astronomical signals to construct a
high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the
performance bottlenecks due to frequent memory accesses in a state-of-the-art
radio-astronomy imaging algorithm. In this paper, we show that a sub-module
performing a two-dimensional fast Fourier transform (2D FFT) is memory bound
using CPI breakdown analysis on IBM Power9. Then, we present an NMC approach on
FPGA for 2D FFT that outperforms a CPU by up to a factor of 120x and performs
comparably to a high-end GPU, while using less bandwidth and memory
Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories
Modern computing systems are embracing hybrid memory comprising of DRAM and
non-volatile memory (NVM) to combine the best properties of both memory
technologies, achieving low latency, high reliability, and high density. A
prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access
latency much higher than DRAM access latency. We call this inter-memory
asymmetry. We observe that parasitic components on a long bitline are a major
source of high latency in both DRAM and NVM, and a significant factor
contributing to high-voltage operations in NVM, which impact their reliability.
We propose an architectural change, where each long bitline in DRAM and NVM is
split into two segments by an isolation transistor. One segment can be accessed
with lower latency and operating voltage than the other. By introducing tiers,
we enable non-uniform accesses within each memory type (which we call
intra-memory asymmetry), leading to performance and reliability trade-offs in
DRAM-NVM hybrid memory. We extend existing NVM-DRAM OS in three ways. First, we
exploit both inter- and intra-memory asymmetries to allocate and migrate memory
pages between the tiers in DRAM and NVM. Second, we improve the OS's page
allocation decisions by predicting the access intensity of a newly-referenced
memory page in a program and placing it to a matching tier during its initial
allocation. This minimizes page migrations during program execution, lowering
the performance overhead. Third, we propose a solution to migrate pages between
the tiers of the same memory without transferring data over the memory channel,
minimizing channel occupancy and improving performance. Our overall approach,
which we call MNEME, to enable and exploit asymmetries in DRAM-NVM hybrid
tiered memory improves both performance and reliability for both single-core
and multi-programmed workloads.Comment: 15 pages, 29 figures, accepted at ACM SIGPLAN International Symposium
on Memory Managemen
Runtime Mechanisms to Survive New HPC Architectures: A Use-Case in Human Respiratory Simulations
Computational Fluid and Particle Dynamics (CFPD) simulations are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes demand high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on Intel-, IBM-, and Arm-based HPC technologies ranked in the Top500 supercomputers, showing the importance of using mechanisms applied at runtime to improve the performance independently of
the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2x, across different architectures, while applying runtime techniques and keeping constant the computational resources.This work is partially supported by the Spanish Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPreprin
TensorFlow on state-of-the-art HPC clusters: a machine learning use case
The recent rapid growth of the data-flow programming paradigm enabled the development of specific architectures,
e.g., for machine learning. The most known example is the Tensor Processing Unit (TPU) by Google. Standard data-centers, however, still can not foresee large partitions dedicated to machine learning specific architectures. Within data-centers, the High-Performance Computing (HPC) clusters are highly parallel machines targeting a broad class of compute-intensive workflows, as such they can be used for tackling machine learning challenges. On top of this, HPC architectures are rapidly changing, including accelerators and instruction sets other than the classical x86 CPUs. In this blurry scenario, identifying which are the best hardware/software configurations to efficiently support machine
learning workloads on HPC clusters is not trivial. In this paper, we considered the workflow of TensorFlow for image recognition. We highlight the strong dependency of the performance in the training phase on the availability of arithmetic libraries optimized for the underlying architecture. Following the example of Intel leveraging the MKL libraries for improving the TensorFlow performance, we plugged the Arm Performance Libraries into TensorFlow and tested on an HPC cluster based on Marvell ThunderX2 CPUs. Also, we performed a scalability study on three state-of-the-art HPC clusters based on different CPU architectures, x86 Intel Skylake, Arm-v8 Marvell ThunderX2, and PowerPC IBM Power9.Postprint (author's final draft
Experiences in porting mini-applications to OpenACC and OpenMP on heterogeneous systems
This article studies mini-applications—Minisweep, GenASiS, GPP, and FF—that use computational methods commonly encountered in HPC. We have ported these applications to develop OpenACC and OpenMP versions, and evaluated their performance on Titan (Cray XK7 with K20x GPUs), Cori (Cray XC40 with Intel KNL), Summit (IBM AC922 with Volta GPUs), and Cori-GPU (Cray CS-Storm 500NX with Intel Skylake and Volta GPUs). Our goals are for these new ports to be useful to both application and compiler developers, to document and describe the lessons learned and the methodology to create optimized OpenMP and OpenACC versions, and to provide a description of possible migration paths between the two specifications. Cases where specific directives or code patterns result in improved performance for a given architecture are highlighted. We also include discussions of the functionality and maturity of the latest compilers available on the above platforms with respect to OpenACC or OpenMP implementations
Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering
The resurgence of machine learning has increased the demand for
high-performance basic linear algebra subroutines (BLAS), which have long
depended on libraries to achieve peak performance on commodity hardware.
High-performance BLAS implementations rely on a layered approach that consists
of tiling and packing layers, for data (re)organization, and micro kernels that
perform the actual computations. The creation of high-performance micro kernels
requires significant development effort to write tailored assembly code for
each architecture. This hand optimization task is complicated by the recent
introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to
deliver high-performance matrix operations. This paper presents a compiler-only
alternative to the use of high-performance libraries by incorporating, to the
best of our knowledge and for the first time, the automatic generation of the
layered approach into LLVM, a production compiler. Modular design of the
algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear
interface between the tiling and packing layers and the micro kernel, makes it
easy to retarget the code generation to multiple accelerators. The use of
intrinsics enables a comprehensive performance study. In processors without
hardware matrix engines, the tiling and packing delivers performance up to 22x
(Intel), for small matrices, and more than 6x (POWER9), for large matrices,
faster than PLuTo, a widely used polyhedral optimizer. The performance also
approaches high-performance libraries and is only 34% slower than OpenBLAS and
on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for
large matrices, over 2.6x faster than the vector-extension solution, matches
Eigen performance, and achieves up to 96% of BLAS peak performance
- …