850 research outputs found
Macroservers: An Execution Model for DRAM Processor-In-Memory Arrays
The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in single memory accesses and supporting complex address manipulations in the memory. Furthermore, large arrays of PIMs can be arranged into a massively parallel architecture. In this report, we describe an object-based programming model based on the notion of a macroserver. Macroservers encapsulate a set of variables and methods; threads, spawned by the activation of methods, operate asynchronously on the variables' state space. Data distributions provide a mechanism for mapping large data structures across the memory region of a macroserver, while work distributions allow explicit control of bindings between threads and data. Both data and work distributuions are first-class objects of the model, supporting the dynamic management of data and threads in memory. This offers the flexibility required for fully exploiting the processing power and memory bandwidth of a PIM array, in particular for irregular and adaptive applications. Thread synchronization is based on atomic methods, condition variables, and futures. A special type of lightweight macroserver allows the formulation of flexible scheduling strategies for the access to resources, using a monitor-like mechanism
Vectorizing unstructured mesh computations for many-core architectures.
Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon-Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon-Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many-core systems. We show that auto-vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near-optimal performance, two times faster than non-vectorized code. We observe that the Xeon-Phi does not provide good performance for these applications but is still comparable with a pair of mid-range Xeon chips
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
The molecular dynamics simulation package GROMACS runs efficiently on a wide
variety of hardware from commodity workstations to high performance computing
clusters. Hardware features are well exploited with a combination of SIMD,
multi-threading, and MPI-based SPMD/MPMD parallelism, while GPUs can be used as
accelerators to compute interactions offloaded from the CPU. Here we evaluate
which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most
economical way. We have assembled and benchmarked compute nodes with various
CPU/GPU combinations to identify optimal compositions in terms of raw
trajectory production rate, performance-to-price ratio, energy efficiency, and
several other criteria. Though hardware prices are naturally subject to trends
and fluctuations, general tendencies are clearly visible. Adding any type of
GPU significantly boosts a node's simulation performance. For inexpensive
consumer-class GPUs this improvement equally reflects in the
performance-to-price ratio. Although memory issues in consumer-class GPUs could
pass unnoticed since these cards do not support ECC memory, unreliable GPUs can
be sorted out with memory checking tools. Apart from the obvious determinants
for cost-efficiency like hardware expenses and raw performance, the energy
consumption of a node is a major cost factor. Over the typical hardware
lifetime until replacement of a few years, the costs for electrical power and
cooling can become larger than the costs of the hardware itself. Taking that
into account, nodes with a well-balanced ratio of CPU and consumer-class GPU
resources produce the maximum amount of GROMACS trajectory over their lifetime
Accelerating sequential programs using FastFlow and self-offloading
FastFlow is a programming environment specifically targeting cache-coherent
shared-memory multi-cores. FastFlow is implemented as a stack of C++ template
libraries built on top of lock-free (fence-free) synchronization mechanisms. In
this paper we present a further evolution of FastFlow enabling programmers to
offload part of their workload on a dynamically created software accelerator
running on unused CPUs. The offloaded function can be easily derived from
pre-existing sequential code. We emphasize in particular the effective
trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
OP2-Clang : a source-to-source translator using Clang/LLVM LibTooling
Domain Specific Languages or Active Library frameworks have recently emerged as an important method for gaining performance portability, where an application can be efficiently executed on a wide range of HPC architectures without significant manual modifications. Embedded DSLs such as OP2, provides an API embedded in general purpose languages such as C/C++/Fortran. They rely on source-to-source translation and code refactorization to translate the higher-level API calls to platform specific parallel implementations. OP2 targets the solution of unstructured-mesh computations, where it can generate a variety of parallel implementations for execution on architectures such as CPUs, GPUs, distributed memory clusters and heterogeneous processors making use of a wide range of platform specific optimizations. Compiler tool-chains supporting source-to-source translation of code written in mainstream languages currently lack the capabilities to carry out such wide-ranging code transformations. Clang/LLVM’s Tooling library (LibTooling) has long been touted as having such capabilities but have only demonstrated its use in simple source refactoring tasks.
In this paper we introduce OP2-Clang, a source-to-source translator based on LibTooling, for OP2’s C/C++ API, capable of generating target parallel code based on SIMD, OpenMP, CUDA and their combinations with MPI. OP2-Clang is designed to significantly reduce maintenance, particularly making it easy to be extended to generate new parallelizations and optimizations for hardware platforms. In this research, we demonstrate its capabilities including (1) the use of LibTooling’s AST matchers together with a simple strategy that use parallelization templates or skeletons to significantly reduce the complexity of generating radically different and transformed target code and (2) chart the challenges and solution to generating optimized parallelizations for OpenMP, SIMD and CUDA. Results indicate that OP2-Clang produces near-identical parallel code to that of OP2’s current source-to-source translator. We believe that the lessons learnt in OP2-Clang can be readily applied to developing other similar source-to-source translators, particularly for DSLs
- …