96 research outputs found

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    Power Aware Computing on GPUs

    Get PDF
    Energy and power density concerns in modern processors have led to significant computer architecture research efforts in power-aware and temperature-aware computing. With power dissipation becoming an increasingly vexing problem, power analysis of Graphical Processing Unit (GPU) and its components has become crucial for hardware and software system design. Here, we describe our technique for a coordinated measurement approach that combines real total power measurement and per-component power estimation. To identify power consumption accurately, we introduce the Activity-based Model for GPUs (AMG), from which we identify activity factors and power for microarchitectures on GPUs that will help in analyzing power tradeoffs of one component versus another using microbenchmarks. The key challenge addressed in this thesis is real-time power consumption, which can be accurately estimated using NVIDIA\u27s Management Library (NVML) through Pthreads. We validated our model using Kill-A-Watt power meter and the results are accurate within 10\%. The resulting Performance Application Programming Interface (PAPI) NVML component offers real-time total power measurements for GPUs. This thesis also compares a single NVIDIA C2075 GPU running MAGMA (Matrix Algebra on GPU and Multicore Architectures) kernels, to a 48 core AMD Istanbul CPU running LAPACK

    A fast, dense Chebyshev solver for electronic structure on GPUs

    Full text link
    Matrix diagonalization is almost always involved in computing the density matrix needed in quantum chemistry calculations. In the case of modest matrix sizes (≲\lesssim 5000), performance of traditional dense diagonalization algorithms on modern GPUs is underwhelming compared to the peak performance of these devices. This motivates the exploration of alternative algorithms better suited to these types of architectures. We newly derive, and present in detail, an existing Chebyshev expansion algorithm [W. Liang et al, J. Chem. Phys. 2003] whose number of required matrix multiplications scales with the square root of the number of terms in the expansion. Focusing on dense matrices of modest size, our implementation on GPUs results in large speed ups when compared to diagonalization. Additionally, we improve upon this existing method by capitalizing on the inherent task parallelism and concurrency in the algorithm. This improvement is implemented on GPUs by using CUDA and HIP streams via the MAGMA library and leads to a significant speed up over the serial-only approach for smaller (≲\lesssim 1000) matrix sizes. Lastly, we apply our technique to a model system with a high density of states around the Fermi level which typically presents significant challenges.Comment: Submitted to Journal of Chemical Physics Communication

    Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

    Get PDF
    An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms. The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT. We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU

    Tasking in accelerators: performance evaluation

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In this work, we analyze the implications and results of implementing dynamic parallelism, concurrent kernels and CUDA Graphs to solve task-oriented problems. As a benchmark we propose three different methods for solving DGEMM operation on tiled-matrices; which might be the most popular benchmark for performance analysis. For the algorithms that we study, we present significant differences in terms of data dependencies, synchronization and granularity. The main contribution of this work is determining which of the previous approaches work better for having multiple task running concurrently in a single GPU, as well as stating the main limitations and benefits of every technique. Using dynamic parallelism and CUDA Streams we were able to achieve up to 30% speedups and for CUDA Graph API up to 25x acceleration outperforming state of the art results.This project has received funding from the EPEEC project from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 801051, from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII ( TIN2015-65316-P ) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Pro-gramació i Entorns d’Execució Paral·lels (2014-SGR-1051 ). Finally, this project also received funding from the Spanish Ministry of Economy and Competitiveness under the Juan de la Cierva Grant Agreement No IJCI-2017-33511 , and from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska Curie grant agreement No. 749516 .Peer ReviewedPostprint (author's final draft

    Efficient computation of CPW2000 using a CPU-GPU heterogeneous platform

    Get PDF
    Dissertação de mestrado em Engenharia de InformáticaThe modelling and simulation of complex systems in natural science usually require powerfull and expensive computational resources. The study of the plane wave properties in crystals, based on quantum mechanichs pose challenging questions to computer scientists to improve the e ciency of the numerical methods and algorithms. Numerical libraries had a signi cant boost in recent years, taking advantage of multi-threaded environments. This dissertation work addresses e ciency improvements in a plane wave package, CPW2000, developed by a physicist scientist, targeted to a heterogeneous platform with multicore CPU and CUDA enabled GPU devices. The performance botlenecks were previously identifed as being the module functions with FFT computations, and the study started with the application analysis and pro ling. This study shows that (i)over 90% of the code execution time was spent in two functions, DGEMM and FFT, (ii) code ef- ciency of current numerical libraries is hard to improve, and (iii) DGEMM function calls were spread in the code, while FFT was concentrated in a single function. These features were adequately explored to develop a new code version where parts of the code are computed on a multicore CPU with others taking advantage of the GPU multistreaming and parallel computing power. Experimental results show that CPU-GPU combined solutions o er near 10x speedup on the program routines that we proposed to improve, giving us a promising future work.A modelação e simulação de sistemas complexos em áreas científicas geralmente necessita de enormes e dispendiosos recursos computacionais de processamento. O estudo das propriedades de cristais em ondas planas, com base na mecânica quântica, oferece alguns desafios aos cientistas da computação para melhorar a eficiência dos métodos numéricos e algoritmos. As bibliotecas numéricas evoluíram muito tirando vantagem de ambientes multi-threading de computação. O trabalho apresentado nesta dissertação baseia-se na melhoria da eficiência de um programa de ondas planas, o CPW2000, desenvolvido por um investigador da área da física, orientado para uma plataforma heterogénea de computação com um CPU multicore e um GPU com suporte à plataforma CUDA. As principais causas da deterioração da eficiência foram identificadas no módulo que contêm os cálculos de FFT, e o estudo começou com a análise dos tempos de execução de cada componente da aplicação. Este estudo mostra que (i) mais de 90% do tempo total de computação é dividido por duas funções, DGEMM e FFT, (ii) é difícil de melhorar a eficiência das bibliotecas numéricas atuais, e (iii) que as funções DGEMM estão distribuídas pela aplicação enquanto as funções FFT estão concentradas numa função. Estas características foram devidamente exploradas de forma a desenvolver código em que partes deste executa num CPU multicore e outras aproveitam o paralelismo e multistreaming presente nos GPU. Resultados experimentais mostram que as soluções combinadas de CPU-GPU oferecem uma melhoria de aproximadamente 10x nas funções que nos propusemos a melhorar a eficiência, culminando num trabalho futuro promissor

    Batched Linear Algebra Problems on GPU Accelerators

    Get PDF
    The emergence of multicore and heterogeneous architectures requires many linear algebra algorithms to be redesigned to take advantage of the accelerators, such as GPUs. A particularly challenging class of problems, arising in numerous applications, involves the use of linear algebra operations on many small-sized matrices. The size of these matrices is usually the same, up to a few hundred. The number of them can be thousands, even millions. Compared to large matrix problems with more data parallel computation that are well suited on GPUs, the challenges of small matrix problems lie in the low computing intensity, the large sequential operation fractions, and the big PCI-E overhead. These challenges entail redesigning the algorithms instead of merely porting the current LAPACK algorithms. We consider two classes of problems. The first is linear systems with one-sided factorizations (LU, QR, and Cholesky) and their solver, forward and backward substitution. The second is a two-sided Householder bi-diagonalization. They are challenging to develop and are highly demanded in applications. Our main efforts focus on the same-sized problems. Variable-sized problems are also considered, though to a lesser extent. Our contributions can be summarized as follows. First, we formulated a batched linear algebra framework to solve many data-parallel, small-sized problems/tasks. Second, we redesigned a set of fundamental linear algebra algorithms for high- performance, batched execution on GPU accelerators. Third, we designed batched BLAS (Basic Linear Algebra Subprograms) and proposed innovative optimization techniques for high-performance computation. Fourth, we illustrated the batched methodology on real-world applications as in the case of scaling a CFD application up to 4096 nodes on the Titan supercomputer at Oak Ridge National Laboratory (ORNL). Finally, we demonstrated the power, energy and time efficiency of using accelerators as compared to CPUs. Our solutions achieved large speedups and high energy efficiency compared to related routines in CUBLAS on NVIDIA GPUs and MKL on Intel Sandy-Bridge multicore CPUs. The modern accelerators are all Single-Instruction Multiple-Thread (SIMT) architectures. Our solutions and methods are based on NVIDIA GPUs and can be extended to other accelerators, such as the Intel Xeon Phi and AMD GPUs based on OpenCL

    A pseudospectral matrix method for time-dependent tensor fields on a spherical shell

    Full text link
    We construct a pseudospectral method for the solution of time-dependent, non-linear partial differential equations on a three-dimensional spherical shell. The problem we address is the treatment of tensor fields on the sphere. As a test case we consider the evolution of a single black hole in numerical general relativity. A natural strategy would be the expansion in tensor spherical harmonics in spherical coordinates. Instead, we consider the simpler and potentially more efficient possibility of a double Fourier expansion on the sphere for tensors in Cartesian coordinates. As usual for the double Fourier method, we employ a filter to address time-step limitations and certain stability issues. We find that a tensor filter based on spin-weighted spherical harmonics is successful, while two simplified, non-spin-weighted filters do not lead to stable evolutions. The derivatives and the filter are implemented by matrix multiplication for efficiency. A key technical point is the construction of a matrix multiplication method for the spin-weighted spherical harmonic filter. As example for the efficient parallelization of the double Fourier, spin-weighted filter method we discuss an implementation on a GPU, which achieves a speed-up of up to a factor of 20 compared to a single core CPU implementation.Comment: 33 pages, 9 figure

    LU Factorization with Partial Pivoting for a Multicore System with Accelerators

    Full text link
    • …
    corecore