1,422 research outputs found

    Algorithmic patterns for H\mathcal{H}-matrices on many-core processors

    Get PDF
    In this work, we consider the reformulation of hierarchical (H\mathcal{H}) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). H\mathcal{H} matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix-vector products. The parallelization of H\mathcal{H} matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing H\mathcal{H} matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full H\mathcal{H} matrix construction and the fast matrix-vector product to many-core hardware. Here, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source H\mathcal{H} matrix library of this kind. We conclude this work by an in-depth performance analysis and a comparative performance study against a standard H\mathcal{H} matrix library, highlighting profound speedups of our many-core parallel approach

    Exact Sparse Matrix-Vector Multiplication on GPU's and Multicore Architectures

    Full text link
    We propose different implementations of the sparse matrix--dense vector multiplication (\spmv{}) for finite fields and rings \Zb/m\Zb. We take advantage of graphic card processors (GPU) and multi-core architectures. Our aim is to improve the speed of \spmv{} in the \linbox library, and henceforth the speed of its black box algorithms. Besides, we use this and a new parallelization of the sigma-basis algorithm in a parallel block Wiedemann rank implementation over finite fields

    A High-Throughput Solver for Marginalized Graph Kernels on GPU

    Get PDF
    We present the design and optimization of a linear solver on General Purpose GPUs for the efficient and high-throughput evaluation of the marginalized graph kernel between pairs of labeled graphs. The solver implements a preconditioned conjugate gradient (PCG) method to compute the solution to a generalized Laplacian equation associated with the tensor product of two graphs. To cope with the gap between the instruction throughput and the memory bandwidth of current generation GPUs, our solver forms the tensor product linear system on-the-fly without storing it in memory when performing matrix-vector dot product operations in PCG. Such on-the-fly computation is accomplished by using threads in a warp to cooperatively stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. Besides, we propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to improve the efficiency of the sparse format.We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales

    Design Principles for Sparse Matrix Multiplication on the GPU

    Full text link
    We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel and Distributed Computing (Euro-Par) 201
    corecore