277 research outputs found

    Developing performance-portable molecular dynamics kernels in Open CL

    Get PDF
    This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation

    Get PDF
    Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent a powerful and affordable tool for scientists who look to speed up simulations of complex systems. However, porting code to such devices requires a detailed understanding of heterogeneous programming tools and effective strategies for parallelization. In this paper we present a source to source compilation approach with whole-program analysis to automatically transform single-threaded FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized kernels. The main contributions of our work are: (1) whole-source refactoring to allow any subroutine in the code to be offloaded to an accelerator. (2) Minimization of the data transfer between the host and the accelerator by eliminating redundant transfers. (3) Pragmatic auto-parallelization of the code to be offloaded to the accelerator by identification of parallelizable maps and reductions. We have validated the code transformation performance of the compiler on the NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow water component of the ocean model Gmodel; the Linear Baroclinic Model, an atmospheric climate model and Flexpart-WRF, a particle dispersion simulator. The automatic parallelization component has been tested on as 2-D Shallow Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the original code on CPU, in both cases this is the same performance as manually ported code.Comment: 12 pages, 5 figures, submitted to "Computers and Fluids" as full paper from ParCFD conference entr

    Towards a portable and future-proof particle-in-cell plasma physics code

    Get PDF
    We present the first reported OpenCL implementation of EPOCH3D, an extensible particle-in-cell plasma physics code developed at the University of Warwick. We document the challenges and successes of this porting effort, and compare the performance of our implementation executing on a wide variety of hardware from multiple vendors. The focus of our work is on understanding the suitability of existing algorithms for future accelerator-based architectures, and identifying the changes necessary to achieve performance portability for particle-in-cell plasma physics codes. We achieve good levels of performance with limited changes to the algorithmic behaviour of the code. However, our results suggest that a fundamental change to EPOCH3D’s current accumulation step (and its dependency on atomic operations) is necessary in order to fully utilise the massive levels of parallelism supported by emerging parallel architectures

    Contributions to the efficient use of general purpose coprocessors: kernel density estimation as case study

    Get PDF
    142 p.The high performance computing landscape is shifting from assemblies of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators provide greater theoretical performance compared to traditional multi-core CPUs, but exploiting their computing power remains as a challenging task.This dissertation discusses the issues that arise when trying to efficiently use general purpose accelerators. As a contribution to aid in this task, we present a thorough survey of performance modeling techniques and tools for general purpose coprocessors. Then we use as case study the statistical technique Kernel Density Estimation (KDE). KDE is a memory bound application that poses several challenges for its adaptation to the accelerator-based model. We present a novel algorithm for the computation of KDE that reduces considerably its computational complexity, called S-KDE. Furthermore, we have carried out two parallel implementations of S-KDE, one for multi and many-core processors, and another one for accelerators. The latter has been implemented in OpenCL in order to make it portable across a wide range of devices. We have evaluated the performance of each implementation of S-KDE in a variety of architectures, trying to highlight the bottlenecks and the limits that the code reaches in each device. Finally, we present an application of our S-KDE algorithm in the field of climatology: a novel methodology for the evaluation of environmental models

    Heterogeneous CPU/GPU Memory Hierarchy Analysis and Optimization

    Get PDF
    In this master thesis, we propose a scheduling reordering for heterogeneous processors based on a hysteresis detector to give some fairness and speedup to the memory request threads taking advantage of the bank level parallelism at the memory system organization
    corecore