6,508 research outputs found
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
The nested parallel (a.k.a. fork-join) model is widely used for writing
parallel programs. However, the two composition constructs, i.e. ""
(parallel) and "" (serial), are insufficient in expressing "partial
dependencies" or "partial parallelism" in a program. We propose a new dataflow
composition construct "" to express partial dependencies in
algorithms in a processor- and cache-oblivious way, thus extending the Nested
Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign
several divide-and-conquer algorithms ranging from dense linear algebra to
dynamic-programming in the ND model and prove that they all have optimal span
while retaining optimal cache complexity. We propose the design of runtime
schedulers that map ND programs to multicore processors with multiple levels of
possibly shared caches (i.e, Parallel Memory Hierarchies) and provide
theoretical guarantees on their ability to preserve locality and load balance.
For this, we adapt space-bounded (SB) schedulers for the ND model. We show that
our algorithms have increased "parallelizability" in the ND model, and that SB
schedulers can use the extra parallelizability to achieve asymptotically
optimal bounds on cache misses and running time on a greater number of
processors than in the NP model. The running time for the algorithms in this
paper is , where is the cache complexity of task ,
is the cost of cache miss at level- cache which is of size ,
is a constant, and is the number of processors in an
-level cache hierarchy
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion
We describe a new data format for storing triangular, symmetric, and
Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard
two dimensional arrays of Fortran and C (also known as full format) that are
used to represent triangular and symmetric matrices waste nearly half of the
storage space but provide high performance via the use of Level 3 BLAS.
Standard packed format arrays fully utilize storage (array space) but provide
low performance as there is no Level 3 packed BLAS. We combine the good
features of packed and full storage using RFPF to obtain high performance via
using Level 3 BLAS as RFPF is a standard full format representation. Also, RFPF
requires exactly the same minimal storage as packed format. Each LAPACK full
and/or packed triangular, symmetric, and Hermitian routine becomes a single new
RFPF routine based on eight possible data layouts of RFPF. This new RFPF
routine usually consists of two calls to the corresponding LAPACK full format
routine and two calls to Level 3 BLAS routines. This means {\it no} new
software is required. As examples, we present LAPACK routines for Cholesky
factorization, Cholesky solution and Cholesky inverse computation in RFPF to
illustrate this new work and to describe its performance on several commonly
used computer platforms. Performance of LAPACK full routines using RFPF versus
LAPACK full routines using standard format for both serial and SMP parallel
processing is about the same while using half the storage. Performance gains
are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP
parallel times faster using vendor LAPACK full routines with RFPF than with
using vendor and/or reference packed routines
On fast multiplication of a matrix by its transpose
We present a non-commutative algorithm for the multiplication of a
2x2-block-matrix by its transpose using 5 block products (3 recursive calls and
2 general products) over C or any finite field.We use geometric considerations
on the space of bilinear forms describing 2x2 matrix products to obtain this
algorithm and we show how to reduce the number of involved additions.The
resulting algorithm for arbitrary dimensions is a reduction of multiplication
of a matrix by its transpose to general matrix product, improving by a constant
factor previously known reductions.Finally we propose schedules with low memory
footprint that support a fast and memory efficient practical implementation
over a finite field.To conclude, we show how to use our result in LDLT
factorization.Comment: ISSAC 2020, Jul 2020, Kalamata, Greec
- …