931 research outputs found

    Exponential Integrators on Graphic Processing Units

    Full text link
    In this paper we revisit stencil methods on GPUs in the context of exponential integrators. We further discuss boundary conditions, in the same context, and show that simple boundary conditions (for example, homogeneous Dirichlet or homogeneous Neumann boundary conditions) do not affect the performance if implemented directly into the CUDA kernel. In addition, we show that stencil methods with position-dependent coefficients can be implemented efficiently as well. As an application, we discuss the implementation of exponential integrators for different classes of problems in a single and multi GPU setup (up to 4 GPUs). We further show that for stencil based methods such parallelization can be done very efficiently, while for some unstructured matrices the parallelization to multiple GPUs is severely limited by the throughput of the PCIe bus.Comment: To appear in: Proceedings of the 2013 International Conference on High Performance Computing Simulation (HPCS 2013), IEEE (2013

    From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    Full text link
    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin

    A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems

    Full text link
    Heterogeneous systems are becoming more common on High Performance Computing (HPC) systems. Even using tools like CUDA and OpenCL it is a non-trivial task to obtain optimal performance on the GPU. Approaches to simplifying this task include Merge (a library based framework for heterogeneous multi-core systems), Zippy (a framework for parallel execution of codes on multiple GPUs), BSGP (a new programming language for general purpose computation on the GPU) and CUDA-lite (an enhancement to CUDA that transforms code based on annotations). In addition, efforts are underway to improve compiler tools for automatic parallelization and optimization of affine loop nests for GPUs and for automatic translation of OpenMP parallelized codes to CUDA. In this paper we present an alternative approach: a new computational framework for the development of massively data parallel scientific codes applications suitable for use on such petascale/exascale hybrid systems built upon the highly scalable Cactus framework. As the first non-trivial demonstration of its usefulness, we successfully developed a new 3D CFD code that achieves improved performance.Comment: Parallel Computing 2011 (ParCo2011), 30 August -- 2 September 2011, Ghent, Belgiu

    High-level programming of stencil computations on multi-GPU systems using the SkelCL library

    Get PDF
    The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code
    • …
    corecore