75 research outputs found

    Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

    Full text link
    Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research

    A Novel Partitioning Method for Accelerating the Block Cimmino Algorithm

    Get PDF
    We propose a novel block-row partitioning method in order to improve the convergence rate of the block Cimmino algorithm for solving general sparse linear systems of equations. The convergence rate of the block Cimmino algorithm depends on the orthogonality among the block rows obtained by the partitioning method. The proposed method takes numerical orthogonality among block rows into account by proposing a row inner-product graph model of the coefficient matrix. In the graph partitioning formulation defined on this graph model, the partitioning objective of minimizing the cutsize directly corresponds to minimizing the sum of inter-block inner products between block rows thus leading to an improvement in the eigenvalue spectrum of the iteration matrix. This in turn leads to a significant reduction in the number of iterations required for convergence. Extensive experiments conducted on a large set of matrices confirm the validity of the proposed method against a state-of-the-art method

    SciDAC Institute: Combinatorial Scientific Computing and Petascale Simulations (CSCAPES). Final Report

    Full text link

    09061 Abstracts Collection -- Combinatorial Scientific Computing

    Get PDF
    From 01.02.2009 to 06.02.2009, the Dagstuhl Seminar 09061 ``Combinatorial Scientific Computing \u27\u27 was held in Schloss Dagstuhl -- Leibniz Center for Informatics. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Deep Multilevel Graph Partitioning

    Get PDF
    Partitioning a graph into blocks of "roughly equal" weight while cutting only few edges is a fundamental problem in computer science with a wide range of applications. In particular, the problem is a building block in applications that require parallel processing. While the amount of available cores in parallel architectures has significantly increased in recent years, state-of-the-art graph partitioning algorithms do not work well if the input needs to be partitioned into a large number of blocks. Often currently available algorithms compute highly imbalanced solutions, solutions of low quality, or have excessive running time for this case. This is due to the fact that most high-quality general-purpose graph partitioners are multilevel algorithms which perform graph coarsening to build a hierarchy of graphs, initial partitioning to compute an initial solution, and local improvement to improve the solution throughout the hierarchy. However, for large number of blocks, the smallest graph in the hierarchy that is used for initial partitioning still has to be large. In this work, we substantially mitigate these problems by introducing deep multilevel graph partitioning and a shared-memory implementation thereof. Our scheme continues the multilevel approach deep into initial partitioning - integrating it into a framework where recursive bipartitioning and direct k-way partitioning are combined such that they can operate with high performance and quality. Our integrated approach is stronger, more flexible, arguably more elegant, and reduces bottlenecks for parallelization compared to existing multilevel approaches. For example, for large number of blocks our algorithm is on average at least an order of magnitude faster than competing algorithms while computing partitions with comparable solution quality. At the same time, our algorithm consistently produces balanced solutions. Moreover, for small number of blocks, our algorithms are the fastest among competing systems with comparable quality

    A Parallel Network Flow-Based Refinement Techinque for Multilevel Hypergraph Partitioning

    Get PDF

    New Iterative Algorithms for Weighted Matching

    Get PDF
    Matching is an important combinatorial problem with a number ofpractical applications. Even though there exist polynomial time solutionsto most matching problems, there are settings where these are too slow.This has led to the development of several fast approximation algorithmsthat in practice compute matchings very close to the optimal.The current paper introduces a new deterministic approximationalgorithm named G 3 , for weighted matching. The algorithm is based ontwo main ideas, the first is to compute heavy subgraphs of the originalgraph on which we can compute optimal matchings. The second idea isto repeatedly merge these matchings into new matchings of even higherweight than the original ones. Both of these steps are achieved usingdynamic programming in linear or close to linear time.We compare G 3 with the randomized algorithm GPA+ROMA whichis the best known algorithm for this problem. Experiments on alarge collection of graphs show that G 3 is substantially faster thanGPA+ROMA while computing matchings of comparable weight

    Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures

    Get PDF
    Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form C=A,B on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix A to evenly partition the work across threads with the objective of reducing the number of B-matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for B-matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for C-matrix rows. A similarity graph model is proposed for B-matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations. © 1990-2012 IEEE

    Understanding Coarsening for Embedding Large-Scale Graphs

    Full text link
    A significant portion of the data today, e.g, social networks, web connections, etc., can be modeled by graphs. A proper analysis of graphs with Machine Learning (ML) algorithms has the potential to yield far-reaching insights into many areas of research and industry. However, the irregular structure of graph data constitutes an obstacle for running ML tasks on graphs such as link prediction, node classification, and anomaly detection. Graph embedding is a compute-intensive process of representing graphs as a set of vectors in a d-dimensional space, which in turn makes it amenable to ML tasks. Many approaches have been proposed in the literature to improve the performance of graph embedding, e.g., using distributed algorithms, accelerators, and pre-processing techniques. Graph coarsening, which can be considered a pre-processing step, is a structural approximation of a given, large graph with a smaller one. As the literature suggests, the cost of embedding significantly decreases when coarsening is employed. In this work, we thoroughly analyze the impact of the coarsening quality on the embedding performance both in terms of speed and accuracy. Our experiments with a state-of-the-art, fast graph embedding tool show that there is an interplay between the coarsening decisions taken and the embedding quality.Comment: 10 pages, 6 figures, submitted to 2020 IEEE International Conference on Big Dat
    corecore