129 research outputs found

    GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

    Full text link
    High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

    A High-Throughput Solver for Marginalized Graph Kernels on GPU

    Get PDF
    We present the design and optimization of a linear solver on General Purpose GPUs for the efficient and high-throughput evaluation of the marginalized graph kernel between pairs of labeled graphs. The solver implements a preconditioned conjugate gradient (PCG) method to compute the solution to a generalized Laplacian equation associated with the tensor product of two graphs. To cope with the gap between the instruction throughput and the memory bandwidth of current generation GPUs, our solver forms the tensor product linear system on-the-fly without storing it in memory when performing matrix-vector dot product operations in PCG. Such on-the-fly computation is accomplished by using threads in a warp to cooperatively stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. Besides, we propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to improve the efficiency of the sparse format.We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales

    The Reverse Cuthill-McKee Algorithm in Distributed-Memory

    Full text link
    Ordering vertices of a graph is key to minimize fill-in and data structure size in sparse direct solvers, maximize locality in iterative solvers, and improve performance in graph algorithms. Except for naturally parallelizable ordering methods such as nested dissection, many important ordering methods have not been efficiently mapped to distributed-memory architectures. In this paper, we present the first-ever distributed-memory implementation of the reverse Cuthill-McKee (RCM) algorithm for reducing the profile of a sparse matrix. Our parallelization uses a two-dimensional sparse matrix decomposition. We achieve high performance by decomposing the problem into a small number of primitives and utilizing optimized implementations of these primitives. Our implementation shows strong scaling up to 1024 cores for smaller matrices and up to 4096 cores for larger matrices

    Distributed-Memory Breadth-First Search on Massive Graphs

    Full text link
    This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451

    Graphs, Matrices, and the GraphBLAS: Seven Good Reasons

    Get PDF
    The analysis of graphs has become increasingly important to a wide range of applications. Graph analysis presents a number of unique challenges in the areas of (1) software complexity, (2) data complexity, (3) security, (4) mathematical complexity, (5) theoretical analysis, (6) serial performance, and (7) parallel performance. Implementing graph algorithms using matrix-based approaches provides a number of promising solutions to these challenges. The GraphBLAS standard (istc- bigdata.org/GraphBlas) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. The GraphBLAS mathematically defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the GraphBLAS and describes how the GraphBLAS can be used to address many of the challenges associated with analysis of graphs.Comment: 10 pages; International Conference on Computational Science workshop on the Applications of Matrix Computational Methods in the Analysis of Modern Dat

    Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

    Full text link
    It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. Our algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting path is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. We provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number

    Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation

    Full text link
    Across a variety of scientific disciplines, sparse inverse covariance estimation is a popular tool for capturing the underlying dependency relationships in multivariate data. Unfortunately, most estimators are not scalable enough to handle the sizes of modern high-dimensional data sets (often on the order of terabytes), and assume Gaussian samples. To address these deficiencies, we introduce HP-CONCORD, a highly scalable optimization method for estimating a sparse inverse covariance matrix based on a regularized pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal gradient method uses a novel communication-avoiding linear algebra algorithm and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving parallel scalability on problems with up to ~819 billion parameters (1.28 million dimensions); even on a single node, HP-CONCORD demonstrates scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to estimate the underlying dependency structure of the brain from fMRI data, and use the result to identify functional regions automatically. The results show good agreement with a clustering from the neuroscience literature.Comment: Main paper: 15 pages, appendix: 24 page

    Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

    Full text link
    Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research

    Extreme Scale De Novo Metagenome Assembly

    Get PDF
    Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads - size 2.6 TBytes.Comment: Accepted to SC1
    • …
    corecore