33,932 research outputs found
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
Computational linear algebra over finite fields
We present here algorithms for efficient computation of linear algebra
problems over finite fields
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
Faster Inversion and Other Black Box Matrix Computations Using Efficient Block Projections
Block projections have been used, in [Eberly et al. 2006], to obtain an
efficient algorithm to find solutions for sparse systems of linear equations. A
bound of softO(n^(2.5)) machine operations is obtained assuming that the input
matrix can be multiplied by a vector with constant-sized entries in softO(n)
machine operations. Unfortunately, the correctness of this algorithm depends on
the existence of efficient block projections, and this has been conjectured. In
this paper we establish the correctness of the algorithm from [Eberly et al.
2006] by proving the existence of efficient block projections over sufficiently
large fields. We demonstrate the usefulness of these projections by deriving
improved bounds for the cost of several matrix problems, considering, in
particular, ``sparse'' matrices that can be be multiplied by a vector using
softO(n) field operations. We show how to compute the inverse of a sparse
matrix over a field F using an expected number of softO(n^(2.27)) operations in
F. A basis for the null space of a sparse matrix, and a certification of its
rank, are obtained at the same cost. An application to Kaltofen and Villard's
Baby-Steps/Giant-Steps algorithms for the determinant and Smith Form of an
integer matrix yields algorithms requiring softO(n^(2.66)) machine operations.
The derived algorithms are all probabilistic of the Las Vegas type
Learning computationally efficient dictionaries and their implementation as fast transforms
Dictionary learning is a branch of signal processing and machine learning
that aims at finding a frame (called dictionary) in which some training data
admits a sparse representation. The sparser the representation, the better the
dictionary. The resulting dictionary is in general a dense matrix, and its
manipulation can be computationally costly both at the learning stage and later
in the usage of this dictionary, for tasks such as sparse coding. Dictionary
learning is thus limited to relatively small-scale problems. In this paper,
inspired by usual fast transforms, we consider a general dictionary structure
that allows cheaper manipulation, and propose an algorithm to learn such
dictionaries --and their fast implementation-- over training data. The approach
is demonstrated experimentally with the factorization of the Hadamard matrix
and with synthetic dictionary learning experiments
- …