31 research outputs found

    Reducing Synchronization Overheads in CG-type Parallel Iterative Solvers by Embedding Point-to-point Communications into Reduction Operations

    Get PDF
    Parallel iterative solvers are widely used in solving large sparse linear systems of equations on large-scale parallel architectures. These solvers generally contain two different types of communication operations: point-topoint (P2P) and global collective communications. In this work, we present a computational reorganization method to exploit a property that is commonly found in Krylov subspace methods. This reorganization allows P2P and collective communications to be performed simultaneously. We realize this opportunity to embed the content of the messages of P2P communications into the messages exchanged in the collective communications in order to reduce the latency overhead of the solver. Experiments on two different supercomputers up to 2048 processors show that the proposed latency-avoiding method exhibits superior scalability, especially with increasing number of processors

    Analyzing and enhancing OSKI for sparse matrix-vector multiplication

    Get PDF
    Sparse matrix-vector multiplication (SpMxV) is a kernel operation widely used in iterative linear solvers. The same sparse matrix is multiplied by a dense vector repeatedly in these solvers. Matrices with irregular sparsity patterns make it difficult to utilize cache locality effectively in SpMxV computations. In this work, we investigate single- and multiple-SpMxV frameworks for exploiting cache locality in SpMxV computations. For the single-SpMxV framework, we propose two cache-size-aware top-down row/column-reordering methods based on 1D and 2D sparse matrix partitioning by utilizing the column-net and enhancing the row-column-net hypergraph models of sparse matrices. The multiple-SpMxV framework depends on splitting a given matrix into a sum of multiple nonzero-disjoint matrices so that the SpMxV operation is performed as a sequence of multiple input- and output-dependent SpMxV operations. For an effective matrix splitting required in this framework, we propose a cache-size-aware top-down approach based on 2D sparse matrix partitioning by utilizing the row-column-net hypergraph model. The primary objective in all of the three methods is to maximize the exploitation of temporal locality. We evaluate the validity of our models and methods on a wide range of sparse matrices by performing actual runs through using OSKI. Experimental results show that proposed methods and models outperform state-of-the-art schemes.Comment: arXiv admin note: substantial text overlap with arXiv:1202.385

    Revisiting Hypergraph Models for Sparse Matrix Partitioning

    Full text link

    Inverted index compression based on term and document identifier reassignment

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Master's) -- Bilkent University, 2008.Includes bibliographical references leaves 43-46.Compression of inverted indexes received great attention in recent years. An inverted index consists of lists of document identifiers, also referred as posting lists, for each term. Compressing an inverted index reduces the size of the index, which also improves the query performance due to the reduction on disk access times. In recent studies, it is shown that reassigning document identifiers has great effect in compression of an inverted index. In this work, we propose a novel technique that reassigns both term and document identifiers of an inverted index by transforming the matrix representation of the index into a block-diagonal form, which improves the compression ratio dramatically. We adapted row-net hypergraph-partitioning model for the transformation into block-diagonal form, which improves the compression ratio by as much as 50%. To the best of our knowledge, this method performs more effectively than previous inverted index compression techniques.Baykan, İzzet ÇağrıM.S

    Efficient successor retrieval operations for aggregate query processing on clustered road networks

    Get PDF
    Cataloged from PDF version of article.Get-Successors (GS) which retrieves all successors of a junction is a kernel operation used to facilitate aggregate computations in road network queries. Efficient implementation of the GS operation is crucial since the disk access cost of this operation constitutes a considerable portion of the total query processing cost. Firstly, we propose a new successor retrieval operation Get-Unevaluated-Successors (GUS), which retrieves only the unevaluated successors of a given junction. The GUS operation is an efficient implementation of the GS operation, where the candidate successors to be retrieved are pruned according to the properties and state of the algorithm. Secondly, we propose a hypergraph-based model for clustering successively retrieved junctions by the GUS operations to the same pages. The proposed model utilizes query logs to correctly capture the disk access cost of GUS operations. The proposed GUS operation and associated clustering model are evaluated for two different instances of GUS operations which typically arise in Dijkstra's single source shortest path algorithm and incremental network expansion framework. Our simulation results show that the proposed successor retrieval operation together with the proposed clustering hypergraph model is quite effective in reducing the number of disk accesses in query processing. (C) 2010 Published by Elsevier Inc

    A novel method for scaling iterative solvers: avoiding latency overhead of parallel sparse-matrix vector multiplies

    Get PDF
    Cataloged from PDF version of article.In parallel linear iterative solvers, sparse matrix vector multiplication (SpMxV) incurs irregular point-to-point (P2P) communications, whereas inner product computations incur regular collective communications. These P2P communications cause an additional synchronization point with relatively high message latency costs due to small message sizes. In these solvers, each SpMxV is usually followed by an inner product computation that involves the output vector of SpMxV. Here, we exploit this property to propose a novel parallelization method that avoids the latency costs and synchronization overhead of P2P communications. Our method involves a computational and a communication rearrangement scheme. The computational rearrangement provides an alternative method for forming input vector of SpMxV and allows P2P and collective communications to be performed in a single phase. The communication rearrangement realizes this opportunity by embedding P2P communications into global collective communication operations. The proposed method grants a certain value on the maximum number of messages communicated regardless of the sparsity pattern of the matrix. The downside, however, is the increased message volume and the negligible redundant computation. We favor reducing the message latency costs at the expense of increasing message volume. Yet, we propose two iterative-improvement-based heuristics to alleviate the increase in the volume through one-to-one task-to-processor mapping. Our experiments on two supercomputers, Cray XE6 and IBM BlueGene/Q, up to 2,048 processors show that the proposed parallelization method exhibits superior scalable performance compared to the conventional parallelization method

    Optimal block-tridiagonalization of matrices for coherent charge transport

    Get PDF
    Numerical quantum transport calculations are commonly based on a tight-binding formulation. A wide class of quantum transport algorithms requires the tight-binding Hamiltonian to be in the form of a block-tridiagonal matrix. Here, we develop a matrix reordering algorithm based on graph partitioning techniques that yields the optimal block-tridiagonal form for quantum transport. The reordered Hamiltonian can lead to significant performance gains in transport calculations, and allows to apply conventional two-terminal algorithms to arbitrary complex geometries, including multi-terminal structures. The block-tridiagonalization algorithm can thus be the foundation for a generic quantum transport code, applicable to arbitrary tight-binding systems. We demonstrate the power of this approach by applying the block-tridiagonalization algorithm together with the recursive Green's function algorithm to various examples of mesoscopic transport in two-dimensional electron gases in semiconductors and graphene.Comment: 28 pages, 14 figures; submitted to Journal of Computational Physic
    corecore