19 research outputs found
Graph Expansion and Communication Costs of Fast Matrix Multiplication
The communication cost of algorithms (also known as I/O-complexity) is shown
to be closely related to the expansion properties of the corresponding
computation graphs. We demonstrate this on Strassen's and other fast matrix
multiplication algorithms, and obtain first lower bounds on their communication
costs.
In the sequential case, where the processor has a fast memory of size ,
too small to store three -by- matrices, the lower bound on the number of
words moved between fast and slow memory is, for many of the matrix
multiplication algorithms, ,
where is the exponent in the arithmetic count (e.g., for Strassen, and for conventional matrix multiplication).
With parallel processors, each with fast memory of size , the lower
bound is times smaller.
These bounds are attainable both for sequential and for parallel algorithms
and hence optimal. These bounds can also be attained by many fast algorithms in
linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester
equation)
Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds
A parallel algorithm has perfect strong scaling if its running time on P
processors is linear in 1/P, including all communication costs.
Distributed-memory parallel algorithms for matrix multiplication with perfect
strong scaling have only recently been found. One is based on classical matrix
multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's
fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz,
2012). Both algorithms scale perfectly, but only up to some number of
processors where the inter-processor communication no longer scales.
We obtain a memory-independent communication cost lower bound on classical
and Strassen-based distributed-memory matrix multiplication algorithms. These
bounds imply that no classical or Strassen-based parallel matrix multiplication
algorithm can strongly scale perfectly beyond the ranges already attained by
the two parallel algorithms mentioned above. The memory-independent bounds and
the strong scaling bounds generalize to other algorithms.Comment: 4 pages, 1 figur
Algorithms and Conditional Lower Bounds for Planning Problems
We consider planning problems for graphs, Markov decision processes (MDPs),
and games on graphs. While graphs represent the most basic planning model, MDPs
represent interaction with nature and games on graphs represent interaction
with an adversarial environment. We consider two planning problems where there
are k different target sets, and the problems are as follows: (a) the coverage
problem asks whether there is a plan for each individual target set, and (b)
the sequential target reachability problem asks whether the targets can be
reached in sequence. For the coverage problem, we present a linear-time
algorithm for graphs and quadratic conditional lower bound for MDPs and games
on graphs. For the sequential target problem, we present a linear-time
algorithm for graphs, a sub-quadratic algorithm for MDPs, and a quadratic
conditional lower bound for games on graphs. Our results with conditional lower
bounds establish (i) model-separation results showing that for the coverage
problem MDPs and games on graphs are harder than graphs and for the sequential
reachability problem games on graphs are harder than MDPs and graphs; (ii)
objective-separation results showing that for MDPs the coverage problem is
harder than the sequential target problem.Comment: Accepted at ICAPS'1
On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution
Technology trends are making the cost of data movement increasingly dominant,
both in terms of energy and time, over the cost of performing arithmetic
operations in computer systems. The fundamental ratio of aggregate data
movement bandwidth to the total computational power (also referred to the
machine balance parameter) in parallel computer systems is decreasing. It is
there- fore of considerable importance to characterize the inherent data
movement requirements of parallel algorithms, so that the minimal architectural
balance parameters required to support it on future systems can be well
understood. In this paper, we develop an extension of the well-known red-blue
pebble game to develop lower bounds on the data movement complexity for the
parallel execution of computational directed acyclic graphs (CDAGs) on parallel
systems. We model multi-node multi-core parallel systems, with the total
physical memory distributed across the nodes (that are connected through some
interconnection network) and in a multi-level shared cache hierarchy for
processors within a node. We also develop new techniques for lower bound
characterization of non-homogeneous CDAGs. We demonstrate the use of the
methodology by analyzing the CDAGs of several numerical algorithms, to develop
lower bounds on data movement for their parallel execution
Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential
Emerging computer architectures will feature drastically decreased flops/byte
(ratio of peak processing rate to memory bandwidth) as highlighted by recent
studies on Exascale architectural trends. Further, flops are getting cheaper
while the energy cost of data movement is increasingly dominant. The
understanding and characterization of data locality properties of computations
is critical in order to guide efforts to enhance data locality. Reuse distance
analysis of memory address traces is a valuable tool to perform data locality
characterization of programs. A single reuse distance analysis can be used to
estimate the number of cache misses in a fully associative LRU cache of any
size, thereby providing estimates on the minimum bandwidth requirements at
different levels of the memory hierarchy to avoid being bandwidth bound.
However, such an analysis only holds for the particular execution order that
produced the trace. It cannot estimate potential improvement in data locality
through dependence preserving transformations that change the execution
schedule of the operations in the computation. In this article, we develop a
novel dynamic analysis approach to characterize the inherent locality
properties of a computation and thereby assess the potential for data locality
enhancement via dependence preserving transformations. The execution trace of a
code is analyzed to extract a computational directed acyclic graph (CDAG) of
the data dependences. The CDAG is then partitioned into convex subsets, and the
convex partitioning is used to reorder the operations in the execution trace to
enhance data locality. The approach enables us to go beyond reuse distance
analysis of a single specific order of execution of the operations of a
computation in characterization of its data locality properties. It can serve a
valuable role in identifying promising code regions for manual transformation,
as well as assessing the effectiveness of compiler transformations for data
locality enhancement. We demonstrate the effectiveness of the approach using a
number of benchmarks, including case studies where the potential shown by the
analysis is exploited to achieve lower data movement costs and better
performance.Comment: Transaction on Architecture and Code Optimization (2014
On Characterizing the Data Access Complexity of Programs
Technology trends will cause data movement to account for the majority of
energy expenditure and execution time on emerging computers. Therefore,
computational complexity will no longer be a sufficient metric for comparing
algorithms, and a fundamental characterization of data access complexity will
be increasingly important. The problem of developing lower bounds for data
access complexity has been modeled using the formalism of Hong & Kung's
red/blue pebble game for computational directed acyclic graphs (CDAGs).
However, previously developed approaches to lower bounds analysis for the
red/blue pebble game are very limited in effectiveness when applied to CDAGs of
real programs, with computations comprised of multiple sub-computations with
differing DAG structure. We address this problem by developing an approach for
effectively composing lower bounds based on graph decomposition. We also
develop a static analysis algorithm to derive the asymptotic data-access lower
bounds of programs, as a function of the problem size and cache size