5,283 research outputs found

    CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

    Full text link
    Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV algorithm that is efficient both for regular matrices and for irregular matrices. Furthermore, we show that the overhead of the format conversion from the CSR to the CSR5 can be as low as the cost of a few SpMV operations. We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite. For the 14 regular matrices in the suite, we achieve comparable or better performance over the previous work. For the 10 irregular matrices, the CSR5 obtains average performance improvement of 17.6\%, 28.5\%, 173.0\% and 293.3\% (up to 213.3\%, 153.6\%, 405.1\% and 943.3\%) over the best existing work on dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5Comment: 12 pages, 10 figures, In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15

    A sparse octree gravitational N-body code that runs entirely on the GPU processor

    Get PDF
    We present parallel algorithms for constructing and traversing sparse octrees on graphics processing units (GPUs). The algorithms are based on parallel-scan and sort methods. To test the performance and feasibility, we implemented them in CUDA in the form of a gravitational tree-code which completely runs on the GPU.(The code is publicly available at: http://castle.strw.leidenuniv.nl/software.html) The tree construction and traverse algorithms are portable to many-core devices which have support for CUDA or OpenCL programming languages. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second.Comment: Accepted version. Published in Journal of Computational Physics. 35 pages, 12 figures, single colum

    Hardware Acceleration Using Functional Languages

    Get PDF
    Cílem této práce je prozkoumat možnosti využití funkcionálního paradigmatu pro hardwarovou akceleraci, konkrétně pro datově paralelní úlohy. Úroveň abstrakce tradičních jazyků pro popis hardwaru, jako VHDL a Verilog, přestáví stačit. Pro popis na algoritmické či behaviorální úrovni se rozmáhají jazyky původně navržené pro vývoj softwaru a modelování, jako C/C++, SystemC nebo MATLAB. Funkcionální jazyky se s těmi imperativními nemůžou měřit v rozšířenosti a oblíbenosti mezi programátory, přesto je předčí v mnoha vlastnostech, např. ve verifikovatelnosti, schopnosti zachytit inherentní paralelismus a v kompaktnosti kódu. Pro akceleraci datově paralelních výpočtů se často používají jednotky FPGA, grafické karty (GPU) a vícejádrové procesory. Praktická část této práce rozšiřuje existující knihovnu Accelerate pro počítání na grafických kartách o výstup do VHDL. Accelerate je možno chápat jako doménově specifický jazyk vestavěný do Haskellu s backendem pro prostředí NVIDIA CUDA. Rozšíření pro vysokoúrovňovou syntézu obvodů ve VHDL představené v této práci používá stejný jazyk a frontend.The aim of this thesis is to research how the functional paradigm can be used for hardware acceleration with an emphasis on data-parallel tasks. The level of abstraction of the traditional hardware description languages, such as VHDL or Verilog, is becoming to low. High-level languages from the domains of software development and modeling, such as C/C++, SystemC or MATLAB, are experiencing a boom for hardware description on the algorithmic or behavioral level. Functional Languages are not so commonly used, but they outperform imperative languages in verification, the ability to capture inherent paralellism and the compactness of code. Data-parallel task are often accelerated on FPGAs, GPUs and multicore processors. In this thesis, we use a library for general-purpose GPU programs called Accelerate and extend it to produce VHDL. Accelerate is a domain-specific language embedded into Haskell with a backend for the NVIDIA CUDA platform. We use the language and its frontend, and create a new backend for high-level synthesis of circuits in VHDL.

    Highly accelerated simulations of glassy dynamics using GPUs: caveats on limited floating-point precision

    Full text link
    Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 10^8 MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores.Comment: 12 pages, 7 figures, to appear in Comp. Phys. Comm., HALMD package licensed under the GPL, see http://research.colberg.org/projects/halm
    corecore