541 research outputs found
A Novel Partitioning Method for Accelerating the Block Cimmino Algorithm
We propose a novel block-row partitioning method in order to improve the
convergence rate of the block Cimmino algorithm for solving general sparse
linear systems of equations. The convergence rate of the block Cimmino
algorithm depends on the orthogonality among the block rows obtained by the
partitioning method. The proposed method takes numerical orthogonality among
block rows into account by proposing a row inner-product graph model of the
coefficient matrix. In the graph partitioning formulation defined on this graph
model, the partitioning objective of minimizing the cutsize directly
corresponds to minimizing the sum of inter-block inner products between block
rows thus leading to an improvement in the eigenvalue spectrum of the iteration
matrix. This in turn leads to a significant reduction in the number of
iterations required for convergence. Extensive experiments conducted on a large
set of matrices confirm the validity of the proposed method against a
state-of-the-art method
Efficient successor retrieval operations for aggregate query processing on clustered road networks
Cataloged from PDF version of article.Get-Successors (GS) which retrieves all successors of a junction is a kernel operation used to facilitate aggregate computations in road network queries. Efficient implementation of the GS operation is crucial since the disk access cost of this operation constitutes a considerable portion of the total query processing cost. Firstly, we propose a new successor retrieval operation Get-Unevaluated-Successors (GUS), which retrieves only the unevaluated successors of a given junction. The GUS operation is an efficient implementation of the GS operation, where the candidate successors to be retrieved are pruned according to the properties and state of the algorithm. Secondly, we propose a hypergraph-based model for clustering successively retrieved junctions by the GUS operations to the same pages. The proposed model utilizes query logs to correctly capture the disk access cost of GUS operations. The proposed GUS operation and associated clustering model are evaluated for two different instances of GUS operations which typically arise in Dijkstra's single source shortest path algorithm and incremental network expansion framework. Our simulation results show that the proposed successor retrieval operation together with the proposed clustering hypergraph model is quite effective in reducing the number of disk accesses in query processing. (C) 2010 Published by Elsevier Inc
Petascaling Machine Learning Applications with MR-MPI
This whitepaper addresses applicability of the Map/Reduce paradigm for scalable and easy parallelization of fundamental data mining approaches with the aim of exploring/enabling processing of terabytes of data on PRACE Tier-0 supercomputing systems. To this end, we first test the usage of MR-MPI library, a lightweight Map/Reduce implementation that uses the MPI library for inter-process communication, on PRACE HPC systems; then propose MR-MPI-based implementations of a number of machine learning algorithms and constructs; and finally provide experimental analysis measuring the scaling performance of the proposed implementations. We test our multiple machine learning algorithms with different datasets. The obtained results show that utilization of the Map/Reduce paradigm can be a strong enhancer on the road to petascale
Reducing Synchronization Overheads in CG-type Parallel Iterative Solvers by Embedding Point-to-point Communications into Reduction Operations
Parallel iterative solvers are widely used in solving large sparse linear systems of equations on large-scale
parallel architectures. These solvers generally contain two different types of communication operations: point-topoint
(P2P) and global collective communications. In this work, we present a computational reorganization
method to exploit a property that is commonly found in Krylov subspace methods. This reorganization allows
P2P and collective communications to be performed simultaneously. We realize this opportunity to embed the
content of the messages of P2P communications into the messages exchanged in the collective communications
in order to reduce the latency overhead of the solver. Experiments on two different supercomputers up to 2048
processors show that the proposed latency-avoiding method exhibits superior scalability, especially with
increasing number of processors
Simultaneous input and output matrix partitioning for outer-product-parallel sparse matrix-matrix multiplication
Cataloged from PDF version of article.FFor outer-product-parallel sparse matrix-matrix multiplication (SpGEMM) of the form C=A×B, we propose three hypergraph models that achieve simultaneous partitioning of input and output matrices without any replication of input data. All three hypergraph models perform conformable one-dimensional (1D) columnwise and 1D rowwise partitioning of the input matrices A and B, respectively. The first hypergraph model performs two-dimensional (2D) nonzero-based partitioning of the output matrix, whereas the second and third models perform 1D rowwise and 1D columnwise partitioning of the output matrix, respectively. This partitioning scheme induces a two-phase parallel SpGEMM algorithm, where communication-free local SpGEMM computations constitute the first phase and the multiple single-node-accumulation operations on the local SpGEMM results constitute the second phase. In these models, the two partitioning constraints defined on weights of vertices encode balancing computational loads of processors during the two separate phases of the parallel SpGEMM algorithm. The partitioning objective of minimizing the cutsize defined over the cut nets encodes minimizing the total volume of communication that will occur during the second phase of the parallel SpGEMM algorithm. An MPI-based parallel SpGEMM library is developed to verify the validity of our models in practice. Parallel runs of the library for a wide range of realistic SpGEMM instances on two large-scale parallel systems JUQUEEN (an IBM BlueGene/Q system) and SuperMUC (an Intel-based cluster) show that the proposed hypergraph models attain high speedup values. © 2014 Society for Industrial and Applied Mathematics
Efficient fast hartley transform algorithms for hypercube-connected multicomputers
Cataloged from PDF version of article.Although fast Hartley transform (FHT) provides
efficient spectral analysis of real discrete signals, the literature
that addresses the parallelization of FHT is extremely rare. FHT
is a real transformation and does not necessitate any complex
arithmetics. On the other hand, FHT algorithm has an irregular
computational structure which makes efficient parallelization
harder. In this paper, we propose a efficient restructuring for the
sequential FHT algorithm which brings regularity and symmetry
to the computational structure of the FHT. Then, we propose
an efficient parallel FHT algorithm for medium-to-coarse grain
hypercube multicomputers by introducing a dynamic mapping
scheme for the restructured FHT. The proposed parallel algorithm
achieves perfect load-balance, minimizes both the number
and volume of concurrent communications, allows only nearestneighbor
communications and achieves in-place computation and
communication. The proposed algorithm is implemented on a 32-
node iPSC12' hypercube multicomputer. High-efficiency values
are obtained even for small size FHT problems
Performance of query processing implementations in ranking-based text retrieval systems using inverted indices
Cataloged from PDF version of article.Similarity calculations and document ranking form the computationally expensive parts of query processing in ranking-based text retrieval. In this work, for these calculations, 11 alternative implementation techniques are presented under four different categories, and their asymptotic time and space complexities are investigated. To our knowledge, six of these techniques are not discussed in any other publication before. Furthermore, analytical experiments are carried out on a 30 GB document collection to evaluate the practical performance of different implementations in terms of query processing time and space consumption. Advantages and disadvantages of each technique are illustrated under different querying scenarios, and several experiments that investigate the scalability of the implementations are presented. (C) 2005 Elsevier Ltd. All rights reserved
- …