9 research outputs found
Graph Expansion and Communication Costs of Fast Matrix Multiplication
The communication cost of algorithms (also known as I/O-complexity) is shown
to be closely related to the expansion properties of the corresponding
computation graphs. We demonstrate this on Strassen's and other fast matrix
multiplication algorithms, and obtain first lower bounds on their communication
costs.
In the sequential case, where the processor has a fast memory of size ,
too small to store three -by- matrices, the lower bound on the number of
words moved between fast and slow memory is, for many of the matrix
multiplication algorithms, ,
where is the exponent in the arithmetic count (e.g., for Strassen, and for conventional matrix multiplication).
With parallel processors, each with fast memory of size , the lower
bound is times smaller.
These bounds are attainable both for sequential and for parallel algorithms
and hence optimal. These bounds can also be attained by many fast algorithms in
linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester
equation)
QR factorization over tunable processor grids
The increasing complexity of modern computer architectures has greatly influenced algorithm design. Algorithm performance on these architectures is now determined by the movement of data. Therefore,
modern algorithms should prioritize minimizing communication.
In this work, we present a new parallel QR factorization algorithm solved over a
tunable processor grid in a distributed memory environment. The processor grid can be
tuned between one and three dimensions, resulting in tradeoffs in the asymptotic costs of
synchronization, horizontal bandwidth,
flop count, and memory footprint. This parallel algorithm is
the first to efficiently extend the Cholesky-QR2 algorithm to matrices with an
arbitrary number of rows and columns. Along its critical path of execution on P processors, our tunable algorithm improves upon the horizontal bandwidth cost of the existing
Cholesky-QR2 algorithm by up to a factor of c^2 when solved over a c x d x c processor grid
subject to P = c^2 d and E[1,P^1/3].
The costs attained by our algorithm are asymptotically
equivalent to state-of-the-art QR factorization algorithms that have yet to
be implemented.
We argue that ours achieves better practicality and
flexibility while still attaining minimal communication.Ope
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds
International audienceCommunication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more efficient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main result of this work is the establishment of memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios of the matrices
Analytical cost metrics: days of future past
2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree