263 research outputs found

    High-level programming of stencil computations on multi-GPU systems using the SkelCL library

    Get PDF
    The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code

    Simulating the weak death of the neutron in a femtoscale universe with near-Exascale computing

    Full text link
    The fundamental particle theory called Quantum Chromodynamics (QCD) dictates everything about protons and neutrons, from their intrinsic properties to interactions that bind them into atomic nuclei. Quantities that cannot be fully resolved through experiment, such as the neutron lifetime (whose precise value is important for the existence of light-atomic elements that make the sun shine and life possible), may be understood through numerical solutions to QCD. We directly solve QCD using Lattice Gauge Theory and calculate nuclear observables such as neutron lifetime. We have developed an improved algorithm that exponentially decreases the time-to solution and applied it on the new CORAL supercomputers, Sierra and Summit. We use run-time autotuning to distribute GPU resources, achieving 20% performance at low node count. We also developed optimal application mapping through a job manager, which allows CPU and GPU jobs to be interleaved, yielding 15% of peak performance when deployed across large fractions of CORAL.Comment: 2018 Gordon Bell Finalist: 9 pages, 9 figures; v2: fixed 2 typos and appended acknowledgement

    Doctor of Philosophy

    Get PDF
    dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

    Doctor of Philosophy in Computer Science

    Get PDF
    dissertationStencil computations are operations on structured grids. They are frequently found in partial differential equation solvers, making their performance critical to a range of scientific applications. On modern architectures where data movement costs dominate computation, optimizing stencil computations is a challenging task. Typically, domain scientists must reduce and orchestrate data movement to tackle the memory bandwidth and latency bottlenecks. Furthermore, optimized code must map efficiently to ever increasing parallelism on a chip. This dissertation studies several stencils with varying arithmetic intensities, thus requiring contrasting optimization strategies. Stencils traditionally have low arithmetic intensity, making their performance limited by memory bandwidth. Contemporary higher-order stencils are designed to require smaller grids, hence less memory, but are bound by increased floating-point operations. This dissertation develops communication-avoiding optimizations to reduce data movement in memory-bound stencils. For higher-order stencils, a novel transformation, partial sums, is designed to reduce the number of floating-point operations and improve register reuse. These optimizations are implemented in a compiler framework, which is further extended to generate parallel code targeting multicores and graphics processor units (GPUs). The augmented compiler framework is then combined with autotuning to productively address stencil optimization challenges. Autotuning explores a search space of possible implementations of a computation to find the optimal code for an execution context. In this dissertation, autotuning is used to compose sequences of optimizations to drive the augmented compiler framework. This compiler-directed autotuning approach is used to optimize stencils in the context of a linear solver, Geometric Multigrid (GMG). GMG uses sequences of stencil computations, and presents greater optimization challenges than isolated stencils, as interactions between stencils must also be considered. The efficacy of our approach is demonstrated by comparing the performance of generated code against manually tuned code, over commercial compiler-generated code, and against analytic performance bounds. Generated code outperforms manually optimized codes on multicores and GPUs. Against Intel's compiler on multicores, generated code achieves up to 4x speedup for stencils, and 3x for the solver. On GPUs, generated code achieves 80% of an analytically computed performance bound

    Automatic code generation and tuning for stencil kernels on modern shared memory architectures

    Get PDF
    In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for "Parallel Autotuned Stencils,” generates a compute kernel from a specification of the stencil operation and a strategy which describes the parallelization and optimization to be applied, and leverages the autotuning methodology to optimize strategy-specific parameters for the given hardware architectur

    Autotuning Stencil Computations with Structural Ordinal Regression Learning

    Get PDF
    Stencil computations expose a large and complex space of equivalent implementations. These computations often rely on autotuning techniques, based on iterative compilation or machine learning (ML), to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task that may be unaffordable in many scenarios. Meanwhile, traditional ML autotuning approaches exploiting classification algorithms (such as neural networks and support vector machines) face difficulties in capturing all features of large search spaces. This paper proposes a new way of automatically tuning stencil computations based on structural learning. By organizing the training data in a set of partially-sorted samples (i.e., rankings), the problem is formulated as a ranking prediction model, which translates to an ordinal regression problem. Our approach can be coupled with an iterative compilation method or used as a standalone autotuner. We demonstrate its potential by comparing it with state-of-the-art iterative compilation methods on a set of nine stencil codes and by analyzing the quality of the obtained ranking in terms of Kendall rank correlation coefficients
    corecore