83 research outputs found
Taming computational complexity: efficient and parallel SimRank optimizations on undirected graphs
SimRank has been considered as one of the promising link-based ranking algorithms to evaluate similarities of web documents in many modern search engines. In this paper, we investigate the optimization problem of SimRank similarity computation on undirected web graphs. We first present a novel algorithm to estimate the SimRank between vertices in O(n3+ Kn2) time, where n is the number of vertices, and K is the number of iterations. In comparison, the most efficient implementation of SimRank algorithm in [1] takes O(K n3 ) time in the worst case. To efficiently handle large-scale computations, we also propose a parallel implementation of the SimRank algorithm on multiple processors. The experimental evaluations on both synthetic and real-life data sets demonstrate the better computational time and parallel efficiency of our proposed techniques
Out-of-core macromolecular simulations on multithreaded architectures
We address the solution of large-scale eigenvalue problems that appear in the motion simulation of complex macromolecules on multithreaded platforms, consisting of multicore processors and possibly a graphics processor (GPU). In particular, we compare specialized implementations of several high- performance eigensolvers that, by relying on disk storage and out-of-core (OOC) techniques, can in principle tackle the large memory requirements of these biological problems, which in general do not fit into the main memory of current desktop machines. All these OOC eigensolvers, except for one, are composed of compute-bound (i.e., arithmetically-intensive) operations, which we accelerate by exploiting the performance of current multicore processors and, in some cases, by additionally off-loading certain parts of the computation to a GPU accelerator. One of the eigensolvers is a memory-bound algorithm, which strongly constrains its performance when the data is on disk. However, this method exhibits a much lower arithmetic cost compared with its compute- bound alternatives for this particular application. Experimental results on a desktop platform, representative of current server technology, illustrate the potential of these methods to address the simulation of biological activity
Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSs
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in adapting the code for this operation in ScaLAPACK to SMPSs lie in algorithmic restrictions and the semantics of the SMPSs programming model, but also that they both can be overcome with a limited programming effort. The experimental results report considerable gains in performance and scalability of the routine parallelized with SMPSs when compared with conventional approaches to execute the original ScaLAPACK implementation in parallel as well as two recent message-passing routines for this operation. In summary, our study opens the door to the possibility of reusing message-passing legacy codes/libraries for linear algebra, by introducing up-to-date techniques like dynamic out-of-order scheduling that significantly upgrade their performance, while avoiding a costly rewrite/reimplementation.This research was supported by Project EU INFRA-2010-1.2.2 \TEXT:Towards EXa op applicaTions". The researcher at BSC-CNS was supported by the HiPEAC-2 Network of Excellence (FP7/ICT 217068), the Spanish Ministry of Education (CICYT TIN2011-23283, TIN2007-60625 and CSD2007-
00050), and the Generalitat de Catalunya (2009-SGR-980). The researcher at
CIMNE was partially funded by the UPC postdoctoral grants under the programme \BKC5-Atracció i Fidelització de talent al BKC". The researcher at UJI was supported by project CICYT TIN2008-06570-C04-01 and FEDER.
We thank Jesus Labarta, from BSC-CNS, for helpful discussions on SMPSs and his help with the performance analysis of the codes with
Paraver. We thank Vladimir Marjanovic, also from BSC-CNS, for his help in the set-up
and tuning of the MPI/SMPSs tools on JuRoPa. Finally, we thank Rafael Mayo, from UJI, for his support in the preliminary stages of this work.
The authors gratefully acknowledge the computing time granted on the supercomputer JuRoPa at Jülich Supercomputing Centrer.Peer ReviewedPreprin
Elemental: A new framework for distributed memory dense matrix computations
Abstract Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since it will likely not be practical to use MPI-based implementations. Thus, this is a good time to review what lessons we have learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves considerably better performance than the previously developed libraries
Recommended from our members
Using parallel computation to apply the singular value decomposition (SVD) in solving for large Earth gravity fields based on satellite data
textUsing satellite data only to estimate for an Earth gravity field introduces
the problem of an ill-conditioned system of equations. This mathematical difficulty
amplifies as the number of unknown gravity field parameters increases, requiring
a stabilization of the inversion for solution. But the number of parameters to be
estimated can also be too large to allow inversion using a sequential algorithm (one
computer processor). Therefore the challenge is two-fold. A stabilized inversion
must be performed with a parallel (multi-processor) algorithm.
Thus, new code was developed in the parallel computing infrastructure of
Parallel Linear Algebra Package (PLAPACK) to achieve the task of applying the
Singular Value Decomposition (SVD) to invert for (and stabilize) very large gravity
fields of well over 25,000 unknown parameters. This new code is given the name
(Parallel LArge Svd Solver) PLASS. The choice of the SVD was made because it offers multiple opportunities of
stabilization techniques. Poorly observed parameter corrections are removed from
the culpable eigenspace of the normal matrix of CHAMP or the singular vector
space of the upper R triangular matrix of GRACE. Solutions were stabilized based
on the removal of either eigenvalues or singular values using four different standard
optimization criteria: Inspection, Relative Error, Norm Norm minimization, trace
of the Mean Square Error (MSE) matrix, and with a fifth method, independently
introduced for this investigation, that optimizes removal of eigenvalues or singular
values based on Kaula’s power rule of thumb. This method is given the name “Kaula
Eigenvalue (KEV) or Kaula Singular Value (KSV) relation”. For the gravity fields
of this investigation, orbital fits, geodetic evaluations and error propagations of the
best of the resulting SVD gravity fields were performed, and shown to be comparable to the CHAMP solution obtained by the GeoForschungsZentrum (GFZ) and to
the full rank GRACE solution obtained by the Center for Space Research (CSR).Aerospace Engineering and Engineering Mechanic
Implementation of Parallel Least-Squares Alorithms for Gravity Field Estimation
This report was prepared by Jing Xie, a graduate research associate in the Department of Civil and Environmental Engineering and Geodetic Science at the Ohio State University, under the supervision of Professor C. K. Shum. This research was partially supported by grants from NSF Earth Sciences program: EAR-0327633, NASA Office of Earth Science program: NNG04GF01G and NNG04GN19G.This report was also submitted to the Graduate School of the Ohio State University as a thesis in partial fulfillment of the requirements for the Master of Science degree.NASA/GFZ’s Gravity Recovery and Climate Experiment (GRACE) twin-satellite
mission, launched in 2002 for a five-year nominal mission, has provided accurate
scientific products which help scientists gain new insights on climate signals which
manifest as temporal variations of the Earth’s gravity field. This satellite mission also
presents a significant computational challenge to analyze the large amount of data
collected to solve a massive geophysical inverse problem every month. This paper
focuses on applying parallel (primarily distributed) computing techniques capable of
rigorously inverting monthly geopotential coefficients using GRACE data. The gravity
solution is based on the energy conservation approach which established a linear
relationship between the in-situ geopotential difference of two satellites and the position
and velocity vectors using the high-low (GPS to GRACE) and the low-low (GRACE
spacecrafts) satellite-to-satellite tracking data, and the accelerometer data from both
GRACE satellites. Both the direct or rigorous inversion and the iterative (conjugate
gradient) methods are studied. Our goal is to develop numerical algorithms and a portable
distributed-computing code, which is potentially “scalable” (i.e., keeping constant
efficiency with increased problem size and number of processors), capable of efficiently
solving the GRACE problem and also applicable to other generalized large geophysical
inverse problems.
Typical monthly GRACE gravity solutions require solving spherical harmonic
coefficients complete to degree 120 (14,637 parameters) and other nuisance parameters.
The accumulation of the 259,200 monthly low-low GRACE observations (with 0.1 Hz
sampling rate) to normal equations matrix needs more than 55 trillion floating point
operations (FLOPs) and ~1.7 GB central memory to store it. Its inversion adds ~1 trillion
FLOPs. To circumvent this huge computational challenge, we use a 16 nodes SGI 750
cluster system with 32 733 MHz Itanium processors to test our algorithm. We choose the
object-oriented Parallel Linear Algebra Package (PLAPACK) as the main tool and
Message Passing Interface (MPI) as the underlying communication layer to build the
parallel code. MPI parallel I/O technique is also implemented to increase the speed of
transferring data between hard drive and memory. Furthermore, we optimize both the
serial and parallel codes by carefully analyzing the cost of the numerical operations, fully
exploiting the power of the Itanium architecture and utilizing highly optimized numerical
libraries.
For direct inversion, we tested the implementations of the Normal equations
Matrix Accumulation (NMA) method, that computes the design as well as normal
iii
equations matrix locally and accumulates them to global objects afterwards, and the
Design Matrix Accumulation (DMA) approach, which forms small-size design matrices
locally first and transfers them to global scale by matrix-matrix multiplication to obtain a
global normal equations matrix. The creation of the normal equations matrix takes the
majority of the entire wall clock time. Our preliminary results indicate that the NMA
method is very fast but at present cannot be used to estimate extremely high degree and
order coefficients due to the lack of central memory. The DMA method can solve for all
geopotential coefficients complete to spherical harmonic degree 120 in roughly 30
minutes using 24 CPUs. The serial implementation of the direct inverse method takes
about 7.5 hours for the same inversion problem using the same but only one processor.
In the realization of the conjugate gradient method on the distributed platform, the
preconditioner is chosen as the block diagonal part of the normal equations matrix. The
approximate computation of the variance-covariance matrix for the solution is also
implemented. With significantly less arithmetic operations and memory usage, the
conjugate gradient method only spends approximately 8 minutes (wall clock time) to
solve for the gravity field coefficients up to degree 120 using 24 CPUs after 21 iterations,
while the serial code runs roughly 3.5 hours to achieve the same results on a single
processor.
Both the direct inversion method and the iterative method give good estimates of
the unknown geopotential coefficients. In this sense, the iteration approach is better for
the much shorter running time, but only an approximation of the estimated variancecovariance
matrix is provided. Scalability of the direct and iterative method is also
analyzed in this study. Numerical results show that the NMA method and the conjugate
gradient method achieve good scalability in our simulation. While the DMA method is
not as scalable as the other two for smaller problem sizes, its efficiency improves
gradually with the increase of problem sizes and processor numbers. The developed
codes are potentially transportable across different computer platforms and applicable to
other generalized large geophysical inverse problems
Elemental: A new framework for distributed memory dense matrix computations
Abstract Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since it will likely not be practical to use MPI-based implementations. Thus, this is a good time to review lessons learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters (i.e., on two racks o
- …