31 research outputs found

    Type oriented parallel programming for Exascale

    Get PDF
    Whilst there have been great advances in HPC hardware and software in recent years, the languages and models that we use to program these machines have remained much more static. This is not from a lack of effort, but instead by virtue of the fact that the foundation that many programming languages are built on is not sufficient for the level of expressivity required for parallel work. The result is an implicit trade-off between programmability and performance which is made worse due to the fact that, whilst many scientific users are experts within their own fields, they are not HPC experts. Type oriented programming looks to address this by encoding the complexity of a language via the type system. Most of the language functionality is contained within a loosely coupled type library that can be flexibly used to control many aspects such as parallelism. Due to the high level nature of this approach there is much information available during compilation which can be used for optimisation and, in the absence of type information, the compiler can apply sensible default options thus supporting both the expert programmer and novice alike. We demonstrate that, at no performance or scalability penalty when running on up to 8196 cores of a Cray XE6 system, codes written in this type oriented manner provide improved programmability. The programmer is able to write simple, implicit parallel, HPC code at a high level and then explicitly tune by adding additional type information if required.Comment: As presented at the Exascale Applications and Software Conference (EASC), 9th-11th April 201

    NVIDIA Tensor Core Programmability, Performance & Precision

    Full text link
    The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 201

    Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

    Get PDF
    In this paper, we target the parallel solution of sparse linear systems via iterative Krylov subspace-based method enhanced with a block-Jacobi preconditioner on a cluster of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the preconditioned conjugate gradient method that improve the interoperability between the message-passing interface and OmpSs programming models. Specifically, we progressively integrate several communication-reduction and iteration-fusing strategies into the initial code, obtaining more efficient versions of the method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 32 nodes with 24 cores each. The experimental analysis shows that the techniques described in the paper outperform the classical method by a margin that varies between 6 and 48%, depending on the evaluation
    corecore