Search CORE

9 research outputs found

Graph Expansion and Communication Costs of Fast Matrix Multiplication

Author: Ballard Grey
Demmel James
Holtz Olga
Schwartz Oded
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size

M

, too small to store three

n

-by-

n

matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms,

\Omega((\frac{n}{\sqrt M})^{\omega_0}\cdot M)

, where

\omega_0

is the exponent in the arithmetic count (e.g.,

\omega_0 = \lg 7

for Strassen, and

\omega_0 = 3

for conventional matrix multiplication). With

p

parallel processors, each with fast memory of size

M

, the lower bound is

p

times smaller. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation)

arXiv.org e-Print Archive

CiteSeerX

Crossref

QR factorization over tunable processor grids

Author: Hutter Edward
Publication venue
Publication date: 01/05/2017
Field of study

The increasing complexity of modern computer architectures has greatly influenced algorithm design. Algorithm performance on these architectures is now determined by the movement of data. Therefore, modern algorithms should prioritize minimizing communication. In this work, we present a new parallel QR factorization algorithm solved over a tunable processor grid in a distributed memory environment. The processor grid can be tuned between one and three dimensions, resulting in tradeoffs in the asymptotic costs of synchronization, horizontal bandwidth, flop count, and memory footprint. This parallel algorithm is the first to efficiently extend the Cholesky-QR2 algorithm to matrices with an arbitrary number of rows and columns. Along its critical path of execution on P processors, our tunable algorithm improves upon the horizontal bandwidth cost of the existing Cholesky-QR2 algorithm by up to a factor of c^2 when solved over a c x d x c processor grid subject to P = c^2 d and E[1,P^1/3]. The costs attained by our algorithm are asymptotically equivalent to state-of-the-art QR factorization algorithms that have yet to be implemented. We argue that ours achieves better practicality and flexibility while still attaining minimal communication.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds

Author: Al Daas Hussam
Ballard Grey
Grigori Laura
Kumar Suraj
Rouse Kathryn
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/07/2022
Field of study

International audienceCommunication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more efficient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main result of this work is the establishment of memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios of the matrices

INRIA a CCSD electronic archive server

Analytical cost metrics: days of future past

Author: Prajapati Nirmal
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2019
Field of study

2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree

Mountain Scholar (Digital Collections of Colorado and Wyoming)