112 research outputs found
Communication-optimal Parallel and Sequential Cholesky Decomposition
Numerical algorithms have two kinds of costs: arithmetic and communication,
by which we mean either moving data between levels of a memory hierarchy (in
the sequential case) or over a network connecting processors (in the parallel
case). Communication costs often dominate arithmetic costs, so it is of
interest to design algorithms minimizing communication. In this paper we first
extend known lower bounds on the communication cost (both for bandwidth and for
latency) of conventional (O(n^3)) matrix multiplication to Cholesky
factorization, which is used for solving dense symmetric positive definite
linear systems. Second, we compare the costs of various Cholesky decomposition
implementations to these lower bounds and identify the algorithms and data
structures that attain them. In the sequential case, we consider both the
two-level and hierarchical memory models. Combined with prior results in [13,
14, 15], this gives a set of communication-optimal algorithms for O(n^3)
implementations of the three basic factorizations of dense linear algebra: LU
with pivoting, QR and Cholesky. But it goes beyond this prior work on
sequential LU by optimizing communication for any number of levels of memory
hierarchy.Comment: 29 pages, 2 tables, 6 figure
Delay-Doppler Channel Estimation with Almost Linear Complexity
A fundamental task in wireless communication is Channel Estimation: Compute
the channel parameters a signal undergoes while traveling from a transmitter to
a receiver. In the case of delay-Doppler channel, a widely used method is the
Matched Filter algorithm. It uses a pseudo-random sequence of length N, and, in
case of non-trivial relative velocity between transmitter and receiver, its
computational complexity is O(N^{2}log(N)). In this paper we introduce a novel
approach of designing sequences that allow faster channel estimation. Using
group representation techniques we construct sequences, which enable us to
introduce a new algorithm, called the flag method, that significantly improves
the matched filter algorithm. The flag method finds the channel parameters in
O(mNlog(N)) operations, for channel of sparsity m. We discuss applications of
the flag method to GPS, radar system, and mobile communication as well.Comment: 11 page
Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds
A parallel algorithm has perfect strong scaling if its running time on P
processors is linear in 1/P, including all communication costs.
Distributed-memory parallel algorithms for matrix multiplication with perfect
strong scaling have only recently been found. One is based on classical matrix
multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's
fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz,
2012). Both algorithms scale perfectly, but only up to some number of
processors where the inter-processor communication no longer scales.
We obtain a memory-independent communication cost lower bound on classical
and Strassen-based distributed-memory matrix multiplication algorithms. These
bounds imply that no classical or Strassen-based parallel matrix multiplication
algorithm can strongly scale perfectly beyond the ranges already attained by
the two parallel algorithms mentioned above. The memory-independent bounds and
the strong scaling bounds generalize to other algorithms.Comment: 4 pages, 1 figur
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
Improving the numerical stability of fast matrix multiplication
Fast algorithms for matrix multiplication, namely those that perform
asymptotically fewer scalar operations than the classical algorithm, have been
considered primarily of theoretical interest. Apart from Strassen's original
algorithm, few fast algorithms have been efficiently implemented or used in
practical applications. However, there exist many practical alternatives to
Strassen's algorithm with varying performance and numerical properties. Fast
algorithms are known to be numerically stable, but because their error bounds
are slightly weaker than the classical algorithm, they are not used even in
cases where they provide a performance benefit.
We argue in this paper that the numerical sacrifice of fast algorithms,
particularly for the typical use cases of practical algorithms, is not
prohibitive, and we explore ways to improve the accuracy both theoretically and
empirically. The numerical accuracy of fast matrix multiplication depends on
properties of the algorithm and of the input matrices, and we consider both
contributions independently. We generalize and tighten previous error analyses
of fast algorithms and compare their properties. We discuss algorithmic
techniques for improving the error guarantees from two perspectives:
manipulating the algorithms, and reducing input anomalies by various forms of
diagonal scaling. Finally, we benchmark performance and demonstrate our
improved numerical accuracy
Graph Expansion and Communication Costs of Fast Matrix Multiplication
The communication cost of algorithms (also known as I/O-complexity) is shown
to be closely related to the expansion properties of the corresponding
computation graphs. We demonstrate this on Strassen's and other fast matrix
multiplication algorithms, and obtain first lower bounds on their communication
costs.
In the sequential case, where the processor has a fast memory of size ,
too small to store three -by- matrices, the lower bound on the number of
words moved between fast and slow memory is, for many of the matrix
multiplication algorithms, ,
where is the exponent in the arithmetic count (e.g., for Strassen, and for conventional matrix multiplication).
With parallel processors, each with fast memory of size , the lower
bound is times smaller.
These bounds are attainable both for sequential and for parallel algorithms
and hence optimal. These bounds can also be attained by many fast algorithms in
linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester
equation)
- …