80 research outputs found
A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices
We present the submatrix method, a highly parallelizable method for the
approximate calculation of inverse p-th roots of large sparse symmetric
matrices which are required in different scientific applications. We follow the
idea of Approximate Computing, allowing imprecision in the final result in
order to be able to utilize the sparsity of the input matrix and to allow
massively parallel execution. For an n x n matrix, the proposed algorithm
allows to distribute the calculations over n nodes with only little
communication overhead. The approximate result matrix exhibits the same
sparsity pattern as the input matrix, allowing for efficient reuse of allocated
data structures.
We evaluate the algorithm with respect to the error that it introduces into
calculated results, as well as its performance and scalability. We demonstrate
that the error is relatively limited for well-conditioned matrices and that
results are still valuable for error-resilient applications like
preconditioning even for ill-conditioned matrices. We discuss the execution
time and scaling of the algorithm on a theoretical level and present a
distributed implementation of the algorithm using MPI and OpenMP. We
demonstrate the scalability of this implementation by running it on a
high-performance compute cluster comprised of 1024 CPU cores, showing a speedup
of 665x compared to single-threaded execution
Online Tensor Methods for Learning Latent Variable Models
We introduce an online tensor decomposition based approach for two latent
variable modeling problems namely, (1) community detection, in which we learn
the latent communities that the social actors in social networks belong to, and
(2) topic modeling, in which we infer hidden topics of text articles. We
consider decomposition of moment tensors using stochastic gradient descent. We
conduct optimization of multilinear operations in SGD and avoid directly
forming the tensors, to save computational and storage costs. We present
optimized algorithm in two platforms. Our GPU-based implementation exploits the
parallelism of SIMD architectures to allow for maximum speed-up by a careful
optimization of storage and data transfer, whereas our CPU-based implementation
uses efficient sparse matrix computations and is suitable for large sparse
datasets. For the community detection problem, we demonstrate accuracy and
computational efficiency on Facebook, Yelp and DBLP datasets, and for the topic
modeling problem, we also demonstrate good performance on the New York Times
dataset. We compare our results to the state-of-the-art algorithms such as the
variational method, and report a gain of accuracy and a gain of several orders
of magnitude in the execution time.Comment: JMLR 201
Recommended from our members
Massively Parallel Latent Semantic Analyses Using a Graphics Processing Unit
Latent Semantic Analysis (LSA) aims to reduce the dimensions of large term-document datasets using Singular Value Decomposition. However, with the ever-expanding size of datasets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. A graphics processing unit (GPU) can solve some highly parallel problems much faster than a traditional sequential processor or central processing unit (CPU). Thus, a deployable system using a GPU to speed up large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a PC cluster. Due to the GPU’s application-specifi c architecture, harnessing the GPU’s computational prowess for LSA is a great challenge. We presented a parallel LSA implementation on the GPU, using NVIDIA® Compute Unifi ed Device Architecture and Compute Unifi ed Basic Linear Algebra Subprograms software. The performance of this implementation is compared to traditional LSA implementation on a CPU using an optimized Basic Linear Algebra Subprograms library. After implementation, we discovered that the GPU version of the algorithm was twice as fast for large matrices (1 000x1 000 and above) that had dimensions not divisible by 16. For large matrices that did have dimensions divisible by 16, the GPU algorithm ran fi ve to six times faster than the CPU version. The large variation is due to architectural benefi ts of the GPU for matrices divisible by 16. It should be noted that the overall speeds for the CPU version did not vary from relative normal when the matrix dimensions were divisible by 16. Further research is needed in order to produce a fully implementable version of LSA. With that in mind, the research we presented shows that the GPU is a viable option for increasing the speed of LSA, in terms of cost/performance ratio
A biconjugate gradient type algorithm on massively parallel architectures
The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine
Fast and Robust Parametric Estimation of Jointly Sparse Channels
We consider the joint estimation of multipath channels obtained with a set of
receiving antennas and uniformly probed in the frequency domain. This scenario
fits most of the modern outdoor communication protocols for mobile access or
digital broadcasting among others.
Such channels verify a Sparse Common Support property (SCS) which was used in
a previous paper to propose a Finite Rate of Innovation (FRI) based sampling
and estimation algorithm. In this contribution we improve the robustness and
computational complexity aspects of this algorithm. The method is based on
projection in Krylov subspaces to improve complexity and a new criterion called
the Partial Effective Rank (PER) to estimate the level of sparsity to gain
robustness.
If P antennas measure a K-multipath channel with N uniformly sampled
measurements per channel, the algorithm possesses an O(KPNlogN) complexity and
an O(KPN) memory footprint instead of O(PN^3) and O(PN^2) for the direct
implementation, making it suitable for K << N. The sparsity is estimated online
based on the PER, and the algorithm therefore has a sense of introspection
being able to relinquish sparsity if it is lacking. The estimation performances
are tested on field measurements with synthetic AWGN, and the proposed
algorithm outperforms non-sparse reconstruction in the medium to low SNR range
(< 0dB), increasing the rate of successful symbol decodings by 1/10th in
average, and 1/3rd in the best case. The experiments also show that the
algorithm does not perform worse than a non-sparse estimation algorithm in
non-sparse operating conditions, since it may fall-back to it if the PER
criterion does not detect a sufficient level of sparsity.
The algorithm is also tested against a method assuming a "discrete" sparsity
model as in Compressed Sensing (CS). The conducted test indicates a trade-off
between speed and accuracy.Comment: 11 pages, 9 figures, submitted to IEEE JETCAS special issue on
Compressed Sensing, Sep. 201
A Distributed and Secure Algorithm for Computing Dominant SVD Based on Projection Splitting
In this paper, we propose and study a distributed and secure algorithm for
computing dominant (or truncated) singular value decompositions (SVD) of large
and distributed data matrices. We consider the scenario where each node
privately holds a subset of columns and only exchanges ''safe'' information
with other nodes in a collaborative effort to calculate a dominant SVD for the
whole matrix. In the framework of alternating direction methods of multipliers
(ADMM), we propose a novel formulation for building consensus by equalizing
subspaces spanned by splitting variables instead of equalizing themselves. This
technique greatly relaxes feasibility restrictions and accelerates convergence
significantly, while at the same time yielding simple subproblems. We design
several algorithmic features, including a low-rank multiplier formula and
mechanisms for controlling subproblem solution accuracies, to increase the
algorithm's computational efficiency and reduce its communication overhead.
More importantly, the possibility appears remote, if possible at all, for a
malicious node to uncover the data stored in another node through shared
quantities available in our algorithm, which is not the case in existing
distributed or parallelized algorithms. We present the convergence analysis
results, including a worst-case complexity estimate, and extensive experimental
results indicating that the proposed algorithm, while safely guarding data
privacy, has a strong potential to deliver a cutting-edge performance,
especially when communication costs are high
Using reconfigurable computing technology to accelerate matrix decomposition and applications
Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications.
The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions:
• We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices.
• We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices.
• We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each.
• We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns.
• By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture.
• We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update.
Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time
- …