Communication Lower Bounds for Distributed-Memory Computations by Scquizzato, Michele & Silvestri, Francesco
ar
X
iv
:1
30
7.
18
05
v2
  [
cs
.D
S]
  2
0 S
ep
 20
13
Communication Lower Bounds for Distributed-Memory
Computations∗
Michele Scquizzato† Francesco Silvestri‡
August 7, 2018
Abstract
We give lower bounds on the communication complexity required to solve several compu-
tational problems in a distributed-memory parallel machine, namely standard matrix multipli-
cation, stencil computations, comparison sorting, and the Fast Fourier Transform. We revisit
the assumptions under which preceding results were derived and provide new lower bounds
which use much weaker and appropriate hypotheses. Our bounds rely on a mild assumption on
work distribution, and strengthen previous results which require either the computation to be
balanced among the processors, or specific initial distributions of the input data, or an upper
bound on the size of processors’ local memories.
Keywords: Communication, lower bounds, distributed memory, parallel algorithms, BSP.
∗This work was supported, in part, by the University of Padova Projects STPD08JA32 and CPDA121378.
†Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260, USA. Supported by a fellow-
ship of “Fondazione Ing. Aldo Gini”, University of Padova, Italy. E-mail: scquizza@pitt.edu. Most of this work
was done while this author was a Ph.D. student at the University of Padova.
‡Department of Information Engineering, University of Padova, 35131 Padova, Italy.
E-mail: silvest1@dei.unipd.it
1
1 Introduction
Communication is a major factor determining the performance of algorithms on current computing
systems, as the time and energy needed to transfer data between processing and storage elements
is often significantly higher than that of performing arithmetic operations. The gap between com-
putation and communication costs, which is ultimately due to basic physical principles, is expected
to become wider and wider as architectural advances allow to build systems of increasing size and
complexity. Hence, the cost of data movement will play an even greater role in future years.
As in all endeavors where performance is systematically pursued, it is important to evaluate
the distance from optimality of a proposed algorithmic solution, by establishing appropriate lower
bounds. Given the well-known difficulty of establishing lower bounds, results are often obtained
under restrictive assumptions that may severely limit their applicability. It is therefore important
to progressively reduce or fully eliminate such restrictions.
In this spirit, we consider lower bounds on the amount of communication that is required to
solve some classical computational problems on a distributed-memory parallel system. Specifically,
we revisit the assumptions and constraints under which preceding results were derived, and prove
new lower bounds which use much weaker hypotheses and thus have wider applicability. Even when
the functional form of the bounds remains the same, our results do yield new insights to algorithm
developers since they might reveal if some settings are needed, or not, in order to obtain better
performance.
We model the machine using the standard Bulk Synchronous Parallel (BSP) model of compu-
tation [32], which consists of a collection of p processors, each equipped with an unbounded private
memory and communicating with each other through a communication network. The starting point
of any investigation of algorithms for a distributed-memory model is the specification of an I/O
protocol, which defines where input elements reside at the beginning of the computation and where
the outputs produced by the algorithm must be placed. The distribution of inputs and outputs
effectively forms a part of the problem specification, thus restricting the applicability of upper and
lower bounds. Much of previous work on BSP algorithms considers a version of the BSP model
equipped with an additional external memory, which serves as the source of the input and the des-
tination for the output (see, e.g., [31]). This modification significantly alters the spirit of the BSP
of serving as a model for distributed-memory machines, making it very similar to shared-memory
models like the LPRAM [1]. In fact, in a distributed-memory machine, the inputs might already
be distributed in some manner prior to the invocation of the algorithm, and the outputs are usu-
ally left distributed in the processors’ local memories at the end of the execution, especially if the
computation is a subroutine of a larger computation. Thus, lower bounds that use this assumption,
which essentially exploit this “hack” to guarantee that acquiring the n input elements contributes
to the communication cost of algorithms (as some processor must read at least ⌈n/p⌉ input values),
are not directly applicable to distributed-memory architectures.
Other authors, within the original BSP model, assume specific distributions of the input data.
As we shall see later, it is usually assumed that the input is initially evenly distributed among the
p processors. However, this apparently “reasonable” hypothesis is not part of the logic of the BSP
model. In fact, the physical distribution of input data across the processors may depend on several
factors, ranging from how the inputs get acquired to the file system policies. Moreover, this hy-
pothesis may lead to unsatisfactory communication lower bounds. Consider, e.g., the computation
of a directed acyclic graph (DAG) with “few” input nodes and a “long” critical path. In this case,
naive algorithms which entrust the whole computation to one processor might be communication
2
optimal. This is misleading, since it steers towards algorithms which are not parallel at all.
One possibility to overcome both the issues discussed above is to require, in place of the even
distribution of the inputs and of the presence of an external memory, that algorithms exhibit some
level of load balancing of the computation. Typically, if W denotes the total work required by
any algorithm to solve the given problem, it is required that each processor performs O (W/p)
elementary computations. However, this way we are assuming, but not proving, that optimal
solutions balance computation. In fact, in general there is a tradeoff between computation costs
and communication costs. Some papers (see, e.g., [22, 35]) quantify such tradeoffs by establishing
lower bounds on the communication cost of any algorithm as a function of its computation time.
Nevertheless, results of this kind usually indicate that the higher lower bounds on communication
correspond only to perfectly (to within constant factors) work-balanced computations, and such
bounds are tight since achieved by balanced algorithms. This leaves open the possibility that a
substantial saving on communication costs could actually be achieved at a price of a small unbalance
of the computation loads.
Another common assumption is putting an upper bound on the size of processors’ local mem-
ories. However, current technological advances allow to build cheap memory and storage devices
that, for many applications, allow a single machine to store the whole input data set and the inter-
mediate data. Moreover, results derived under this assumption are less general than results that
put no limits on the amount of storage available to processors; indeed, lower bounds are relatively
easier to establish, as the model essentially becomes a parallel version of the standard external
memory (EM) model for sequential computations, for which much more results and techniques are
known (see, e.g., [16, 2]).
In contrast, lower bounds presented in this paper do not hinge on any of the above assumptions.
We develop new lower bounds for a number of key computational problems, namely standard
matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform,
using the weak assumption that no processor performs more than a constant fraction of the total
required work. This requires more involved arguments, and substantially strengthen previous work
on communication lower bounds for distributed-memory computations.
The model. The Bulk Synchronous Parallel (BSP) model of computation was introduced by
Valiant [32] as a bridging model for general-purpose parallel computing. The architectural com-
ponent of the model consists of p processing elements P0, P1, . . . , Pp−1, each equipped with an
unbounded local memory, interconnected by a communication medium. The execution of a BSP
algorithm consists of a sequence of supersteps, where each processor can perform operations on data
in its local memory, send messages and, at the end, execute a global synchronization. The running
time of the i-th superstep is expressed in terms of two parameters, g and ℓ, as Ti = wi + hig + ℓ,
where wi is the maximum number of local operations performed by any processor, and hi is the
maximum number of messages sent or received by any processor. The running time TA of a BSP
algorithm A is the sum of the times of its supersteps and can be expressed as WA +HAg + SAℓ,
where SA is the number of supersteps, WA =
∑SA
i=1 wi is the local computation complexity, and
HA =
∑SA
i=1 hi is the communication complexity.
Previous work. The complexity of communication on various models of computation has re-
ceived considerable attention. Lower bounds are often established through adaptations of the
techniques of Hong and Kung [16] for hierarchical memory, or by critical path arguments, such as
3
those in [1]. For applications of these and other techniques see [22, 2, 25, 15, 8, 6, 17, 24, 4, 9, 5]
as well as [26] and references therein. In the following, we discuss previous work on lower bounds
for the communication complexity of the problems studied in this paper.
A standard computational problem is the multiplication of two n×n matrices. For the classical
Θ
(
n3
)
algorithm, an Ω
(
n2/p2/3
)
lower bound has been previously derived for the BSP [30] and
the LPRAM [1]. However, both results hinge on the hypothesis that the input initially resides
outside the processors’ local memories and thus must be read, contributing to the communication
complexity of the algorithms. As such, these results are an immediate consequence of a result of [16]
(then restated in [17]) which, loosely speaking, bounds from above the amount of computation
that can be performed with a given quantity of data. When input is assumed to be initially
evenly distributed across the p processors’ local memories, the same lower bound is claimed in [11].
Recently, Ballard et al. [3] obtained a result of the same form by assuming perfectly balanced (to
within constant factors) computations, and disallowing any initial replication of inputs. Restricting
to balanced computations allows to reduce to the situation where inputs are evenly distributed: in
fact, it is easy to prove that, given a problem on n inputs and which requires N operations to be
solved, if each processor performs at most αN/p operations, for some α with 1 ≤ α ≤ p/2, then
there exists one processor which initially holds at most 2αn/p inputs, and that performs at least
α/(2α− 1) ·N/p operations. The very same bound was found also by Irony et al. [17], who restrict
their attention to computations that take place on machines where processors’ local memory size is
assumed to beM = O
(
n2/p2/3
)
. Finally, Solomonik and Demmel [27] investigate tradeoffs between
input replication and communication complexity (see also [4]).
A class of computations ubiquitous in scientific computing is that of stencil computations, where
each computing node in a multi-dimensional grid is updated with weighted values contributed
by neighboring nodes. These computations include the diamond DAG in the two-dimensional
case and the cube DAG in three dimensions. For the former, Papadimitriou and Ullman [22]
present a communication-time tradeoff which yields a tight Ω (n) lower bound on the communication
complexity only for the case of balanced computations. Aggarwal et al. [1] extend this result to all
algorithms whose computational complexity is within a constant factor of the number of nodes of
the DAG. To the best of our knowledge, this is the sole example of a tight lower bound that holds
under the same hypothesis used in this paper. By generalizing the technique in [22], Tiskin [30]
establishes a tight bound for the cube DAG, and claims its extension to higher dimensions. However,
this results only hold when the computational load is balanced among the p processors.
Another key problem is sorting. Many papers assume that the n inputs initially reside outside
processors’ local memories, thus obtaining an Ω (n/p) lower bound which turns out to be tight
when it is additionally assumed that problem instances have sufficient slackness, that is, n >> p
(e.g., p2 ≤ n is a common assumption). Under some technical assumptions, a bound of the form
Ω (n log n/(p log(n/p))), which is tight for all values of p ≤ n, was first given within the LPRAM
model [1].1 This bound, however, includes the cost to read the input from the shared memory. A
similar lower bound was derived later by Goodrich [15] within the BSP model, but the result holds
only for the subclass of algorithms performing supersteps of degree h = Θ(n/p), and when the
inputs are evenly distributed among the processors.
Previous work on the communication required to compute an FFT DAG of size n is similar to
previous work for sorting. By exploiting the property that, as shown in [34], the cascade of three
FFT networks has the topology of a full sorting network, the aforementioned lower bounds for
1For notational convenience, within asymptotic notations we will henceforth use log x to denote max{1, log2 x}.
4
sorting also hold for the FFT DAG. In a recent paper [9], we obtain the same result assuming that
the maximum number of outputs held by any processor at the end of the algorithm is at most n/2,
and without assumptions on the distribution of the input and of the computational loads; while
these hypotheses are not equivalent to the one we are using in this paper, the result in [9] is the
closest to the one that we will develop in Section 5.
Our contribution. In this paper we present lower bounds on the communication complexity
required by key computational problems such as standard matrix multiplication, stencil compu-
tations, comparison sorting, and the Fast Fourier Transform, when solved by parallel algorithms
on the BSP model. These results, which are all tight for the whole range of model parameters,
rely solely on the hypothesis that no processor performs more than a constant fraction of the total
required work. More formally, let W be the total work required by any algorithm to solve the given
problem (if the problem is represented by a directed acyclic graph, then W is the number of nodes
of the DAG, otherwise W is a lower bound on the computation time required by any sequential
algorithm), and let W be the maximum amount of work performed by any BSP processor; then, in
the same spirit of the aforementioned result for the diamond DAG in [1], W is assumed to satisfy
the bound W ≤ ǫW, for some constant ǫ ∈ (0, 1). The rationale behind this approach is that
communication is the major bottleneck of a distributed-memory computation unless the latter is
sequential or “nearly sequential”, in which case the main contribution to the running time T of an
algorithm comes from computation. Since it is directly linked to the running time metric, and it
does not allow for any other restrictive assumptions suggested by orthogonal constraints, we believe
that this is the right approach to perform a systematic analysis of the communication requirements
of distributed-memory computations.
We emphasize that, in contrast to previous work, our lower bounds do not count the commu-
nication required to acquire the input, allow for any initial distribution of the input among the
processors’ local memories, assume no upper bound on the sizes of the latter, and do not require
computations to be balanced. On the other hand, some of our results make use of additional
technical assumptions, such as the non-recomputation of intermediate results in the course of the
computation, or some restrictions on the replication of input data. Such restrictions, however, were
already in place in almost all of the corresponding state-of-the-art lower bounds.
2 Matrix Multiplication
In this section we consider the problem of multiplying two n × n matrices, A and B, using only
semiring operations, that is, addition and multiplication. Hence, each element ci,j of the output
matrix C is an explicit sum of products ai,k · bk,j, which are called multiplicative terms. This rules
out, e.g., Strassen’s algorithm [28] and the Boolean matrix multiplication algorithm of Tiskin [29].
As shown in [19], any algorithm using only semiring operations must compute at least n3 distinct
multiplicative terms.
In this section we establish a lower bound on the communication complexity of any parallel
algorithm for matrix multiplication on a BSP with p processors. This result is derived assuming
that no processor performs more than a constant fraction of the n3 total work required by any
algorithm, measured as the number of scalar multiplications, and that each input element is initially
stored in the local memory of exactly one processor. The bound has the form of Ω
(
W 2/3
)
, where
W is the maximum number of multiplicative terms evaluated by a processor, and is tight for all
5
values of p between two and n2. The argument trough which we establish such a result is a repeated
application of a “bandwidth” argument which, loosely speaking, is as follows. Consider a processor
which performs the maximum amount of work. If this processor initially holds “few” input values,
then, since it computes at least n3/p multiplicative terms, it must receive “many” inputs from the
submachine including the other processors; otherwise, if it initially holds “many” inputs, then it
has to send many of them to the other processors, because it cannot perform too much work on its
own, and thus the other processors have to perform at least a constant fraction of the total work.
The lower bound applies to any distribution of input and output matrices, and only requires that
the input matrices are not initially replicated.
Towards this end, we first establish a lower bound of Ω
(
n2
)
under the same hypotheses outlined
above for two processors. This result is derived using a bandwidth argument that bounds from
below the amount of data that must travel across the communication network of a two-processor
machine. A bound of the same form can be found in [17, Section 6], which holds only when the
elements of the input matrices A and B are evenly, or almost evenly, distributed among the two
processors. Our result, which instead allows any initial distribution of the input matrices (without
replication), establishes the same bound by using a mild hypothesis on the maximum computation
load faced by the processors.
Lemma 1. Let A be any algorithm for computing the matrix product C = AB, using only semiring
operations, on a BSP with two processors. If each processor computes at most ǫn3 multiplicative
terms, where ǫ is an arbitrary constant in (1/2, 1), and the input matrices are not initially replicated,
then the communication complexity of the algorithm is
HA(n, p) = Ω
(
n2
)
.
Proof. We use a bandwidth argument as the one employed in [17, Theorem 6.1]. By hypothesis,
each processor computes at most ǫn3 multiplicative terms. Let K be the number of elements of C
whose corresponding multiplicative terms have not been totally computed by the same processors.
If K ≥ (1− ǫ)n2/2, then the communication complexity is at least Ω (n2) since a processor receives
a message for at least K/2 of such entries containing a multiplicative term or a partial prefix sum.
Suppose now that K < (1 − ǫ)n2/2. Then there are at least (1 + ǫ)n2/2 entries of C whose n
multiplicative terms have been entirely computed by the same processor. We denote with n0 and
n1 the number of entries of C computed entirely by processor P0 and P1, respectively, and suppose
without loss of generality that n0 ≥ n1. Clearly, n0 + n1 ≥ (1 + ǫ)n2/2. Since n0 ≥ n1 and since
each processor can compute at most ǫn3 multiplicative terms, it follows that n0 ≤ ǫn2, and thus
n1 ≥ (1 − ǫ)n2/2. Let ri and ci denote the number of rows of A and columns of B, respectively,
whose n entries have all been accessed by processor Pi, with i ∈ {0, 1}, during the lifespan of the
algorithm. Since ni entries of C are computed entirely by processor Pi, then we have rici ≥ ni.
Let α =
√
ǫ+
√
(1− ǫ)/2; we observe that α ∈ (1/2 + 1/√2, 1) since ǫ ∈ (1/2, 1). If r0 + r1 ≥ αn,
at least (α − 1)n = µn rows of A, where µ is a suitable constant in (0, 1/√2 − 1/2), are used by
both processors, incurring Ω
(
n2
)
messages for exchanging the rows since, by hypothesis, the input
matrices are not initially replicated. Suppose now that r0 + r1 < αn. Then, we have
c0 + c1 ≥ n0
r0
+
n1
r1
≥ n0
r0
+
n1
αn− r0 .
By taking the derivative of the last term with respect to r0 we see that the right-hand side is
6
minimized at r0 = αn
√
n0/(
√
n0 +
√
n1), whence
c0 + c1 ≥
(√
n0 +
√
n1
)2
αn
.
Since n0+n1 ≥ (1 + ǫ)n2/2 and n0 ≤ ǫn2, the term √n0+√n1 is minimized when n0 assumes the
largest allowed value (i.e., ǫn2) and n0 + n1 = (1 + ǫ)n
2/2. Thus, we have
c0 + c1 ≥
(√
ǫn2 +
√
(1− ǫ)n2/2
)2
αn
= αn.
The lemma follows since (α− 1)n = Θ(n) columns of B are used by both processors, entailing
Ω
(
n2
)
messages for exchanging them.
To prove our main result we also need the following technical lemma, which was first given by
Hong and Kung in their seminal paper on I/O complexity, and then restated in [17] by applying
the Loomis-Whitney inequality [21].
Lemma 2 ([16, Lemma 6.1]). Consider the matrix multiplication of two n× n matrices A and B
using scalar additions and multiplications only. During the computation, if a processor accesses
at most K elements of each input matrices and contributes to at most K elements of the output
matrix C, then it can compute at most 2K3/2 multiplicative terms.
Now we have all the tools to prove the main result of this section. The following theorem
establishes an Ω
(
W 2/3
)
lower bound to the communication complexity of any standard algorithm,
where W denotes the maximum number of multiplicative terms evaluated by a processor. By the
result of [19] and by the pigeonhole principle, there exists a processor that computes at least n3/p
multiplicative terms, from which the standard Ω
(
n2/p2/3
)
lower bound follows.
Theorem 1. Let A be any algorithm for computing the matrix product C = AB, using only
semiring operations, on a BSP with p processors, where 1 < p ≤ n2, and let W be the maximum
number of multiplicative terms evaluated by a processor. If W ≤ max{n3/p, n3/113}, and the input
matrices are not initially replicated, then the communication complexity of the algorithm is
HA(n, p) = Ω
(
W 2/3
)
.
Proof. Without loss of generality, we assume that any multiplicative term computed by the pro-
cessors is actually used towards the computation of some entry of the output matrix C (that is,
processors do not perform “useless” computations). Consider one of the processors that compute
W multiplicative terms, and without loss of generality let P0 denote such a processor. Let I be the
number of input elements initially held by this processor in its local memory.
Consider first the case I ≤W 2/3/5. By Lemma 2, a processor that computes W multiplicative
terms either accesses, during the whole execution of algorithm A, at least (W/2)2/3 input elements,
or computes multiplicative terms relative to at least (W/2)2/3 elements of the output matrix.
In the first case, since P0 initially holds I ≤ W 2/3/5 input elements, it must receive at least
(W/2)2/3− I = Ω (W 2/3) data words from other processors, and the theorem follows. On the other
hand, suppose P0 computes multiplicative terms relative to (W/2)
2/3 entries of the output matrix,
and partition such entries into three groups: G1, the set of entries whose multiplicative terms have
7
all been computed by the processor; G2, the set of entries produced by the processor but for which
some multiplicative term or partial sum has been communicated by some other processor; G3, the
set of entries not produced by the processor. Clearly, at least one of these three groups must have
size at least (W/2)2/3/3. If |G1| ≥ (W/2)2/3/3, then P0 must have computed at least n(W/2)2/3/3
multiplicative terms, and since any entry of the input matrices occurs in only n of such terms,
the processor must have received (W/2)2/3/3 − I = Ω (W 2/3) elements from other processors. If
|G2| ≥ (W/2)2/3/3, then for each entry in G2 P0 has received some term from other processors,
therefore accounting for a total of Ω
(
W 2/3
)
incoming data words. Finally, if |G3| ≥ (W/2)2/3/3,
then, since any multiplicative term must be used towards the computation of some entry of the
output matrix C, for each entry in G3, P0 must send some multiplicative term or partial sum to
the processor that will produce the corresponding entry of C, and this implies that P0 must send
Ω
(
W 2/3
)
data words. In all three cases, Ω
(
W 2/3
)
messages have to be exchanged by P0 with the
other processors, and the claim follows for I ≤W 2/3/5.
Now suppose I > W 2/3/5 and p ≥ 113. Assume, without loss of generality, that P0 initially holds
at least I/2 elements of matrix A. Since any entry of the input matrices occurs in n multiplicative
terms, there are at least In/2 multiplicative terms that depend on the entries of A initially held
by the processor. Since W multiplicative terms are computed by the processor, the remaining
In/2−W ones are computed by other processors. Since, by hypothesis, each entry of A is initially
non replicated and a processor can compute at most n multiplicative terms using a single entry
of A, we have that (In/2 − W )/n messages are required for sending the appropriate entries of
A to the processors that will compute the remaining entries. Hence, HA(n, p) ≥ (In/2 −W )/n.
Finally, observe that since p ≥ 113, then by hypothesis it holds that W ≤ n3/113. Putting all
pieces together yields
HA(n, p) ≥ In/2−W
n
>
W 2/3
10
− W
n
=W 2/3
(
1
10
− W
1/3
n
)
≥W 2/3
(
1
10
− 1
11
)
=
W 2/3
110
,
which concludes the proof of the second case.
Finally, when I > W 2/3/5 and p < 113, the sought lower bound follows by Lemma 1. Indeed, the
p processors can be virtually partitioned into two subsets, each consisting of exactly p/2 processors;
in particular, processor P ∗0 will be identified with the submachine including the first half of the p
processors, and P ∗1 with the submachine including the second half. Since p < 11
3, by hypothesis each
BSP processor computes at most n3/p multiplicative terms, and thus both P ∗0 and P
∗
1 compute at
most (n3/p)(p/2) = n3/2 multiplicative terms overall. Hence we can apply Lemma 1 to processors
P ∗0 and P
∗
1 , obtaining the desired result.
The proposed bound is tight and is matched by the algorithm that decomposes the problem
into n3/W ≤ p subproblems of size W 1/3 ×W 1/3, and then solves each subproblem sequentially
8
in each round. Since W ≥ n3/p, the minimum communication complexity is Ω (n2/p2/3), which is
achieved by the standard 3D algorithm [17].
Finally, we observe that the above theorem can be extended to the case W ≤ ǫn3, for an
arbitrary constant ǫ ∈ (0, 1), as soon as each multiplicative term is computed once. Also, we
remark that, if each processor holds O
(
W 2/3
)
inputs, our bound applies even when each input
element may be present in more than one processor at the beginning of the computation. We
also conjecture that our bound can be extended up to p1/3 replication, as shown in [27] assuming
balanced memory or work.
3 Stencil Computations
A stencil defines the computation of an element in a d − 1-dimensional spatial grid at time t as a
function of neighboring grid elements at time t−1, . . . , t−τ , for some value τ ≥ 1 and constant d > 1
(see, e.g., [14]). We provide an Ω
(
nd−1/p(d−2)/(d−1)
)
lower bound to the communication complexity
of any algorithm evaluating n time steps of a d − 1-dimensional stencil. For simplicity we assume
τ = 1, however our bounds still apply in the general case. The bound follows by investigating
the (n, d)-array problem, which consists in evaluating all nodes of a d-dimensional array DAG of
size n. Indeed, the DAG given by the (d− 1)-dimensional grid plus the time dimension spans a d-
dimensional spacetime containing an (n/2, d)-array as a subgraph. A d-dimensional array DAG has
nd nodes 〈i0, . . . , id−1〉, for each 0 ≤ i0, . . . , id−1 < n, and there is an arc from 〈i0, . . . , ik, . . . , id−1〉
to 〈i0, . . . , ik+1, . . . , id−1〉, for each 0 ≤ k < d and 0 ≤ i0, . . . , id−1 < n− 1. Observe that 〈0, . . . , 0〉
and 〈n− 1, . . . , n− 1〉 are the single input and output nodes, respectively.
Our result hinges on the restriction on the nature of the computation whereby each vertex
of the DAG is computed exactly once. In this setting, the crucial property is that for each arc
(u, v) such that u is computed by processor P and v is computed by processor P ′, P 6= P ′, there
corresponds a message from P to P ′ (which may also cross other processors). Such arcs are referred
to as communication arcs.
We now introduce some preliminary definitions, which will be used throughout the section.
We envision an (n, d)-array as partitioned into pd/(d−1) smaller d-dimensional arrays, called blocks,
of size n/p1/(d−1), and denote each block with Bi0,...,id−1 for 0 ≤ i0, . . . , id−1 < p1/(d−1). Block
Bi0,...,id−1 contains nodes 〈i′0, . . . , i′d−1〉, for each ikn/p1/(d−1) ≤ i′k < (ik+1)n/p1/(d−1). A block has
nd/pd/(d−1) nodes, and is said ℓ-owned if more than half of its nodes are evaluated by processor Pℓ,
with 0 ≤ ℓ < p. A block is owned if there exists some ℓ, with 0 ≤ ℓ < p, such that it is ℓ-owned; it
is shared otherwise. Two blocks Bi0,...,id−1 and Bi′0,...,i′d−1 are said to be adjacent if their coordinates
differ in just one position k and |ik − i′k| = 1 (i.e., they share a face). For the sake of simplicity, we
assume that n and p are powers of 2d−1 and thus the previous values (e.g., n/p1/(d−1)) are integral:
since d is a constant, this assumption is verified by suitably increasing n and decreasing p by a
constant factor which does not asymptotically affect our lower bounds.
In order to establish our main lower bound, we need two preliminary lemmas. The first one
gives a slack lower bound based on the d-dimensional version of the Loomis-Whitney geometric
inequality [21], and reminds the result of Theorem 1 for matrix multiplication when d = 3.
Lemma 3. Let Ad be any algorithm solving the (n, d)-array problem, without recomputation, on a
BSP with p processors, where 1 < p ≤ nd−1, and denote with W the maximum number of nodes
evaluated by a processor. If W ≤ ǫnd, for an arbitrary constant ǫ ∈ (0, 1), then the communication
9
complexity of the algorithm is
HAd(n, p) = Ω
(
W (d−1)/d
)
.
Proof. Let P0 be a processor evaluatingW nodes, and supposeW ≤ nd/2. Denote with Ψ the set of
nodes evaluated by P0, and with Ni the set of points obtained by dropping the i-th dimension from
the set Ψ, for each 0 ≤ i < d. Let ni = |Ni|. Applying the discrete Loomis-Whitney inequality [21],
we have W d−1 ≤ Πd−1i=0 ni, and hence max{n0, . . . , nd−1} ≥ W (d−1)/d. Assume, without loss of
generality, n0 ≥W (d−1)/d. Let the set N ′0 contain the points in N0 such that processor P0 evaluates
all nodes defined by the associated 1-dimensional arrays: more formally, 〈i1, . . . , id−1〉 ∈ N ′0, if
〈i1, . . . , id−1〉 ∈ N0 and processor P0 evaluates 〈0, i1, . . . , id−1〉, . . . , 〈n − 1, i1, . . . , id−1〉. Let n′0 =
|N ′0|. If n′0 > n0/21/d, we get
W ≥ n′0n >
n0n
21/d
≥ W
(d−1)/dn
21/d
≥W,
where in the last inequality we have used the hypothesis thatW ≤ nd/2. This rises a contradiction,
and thus it must be that n′0 ≤ n0/21/d. Then, there are n0 − n′0 ≥ (1 − 1/21/d)W (d−1)/d points
in N0 whose respective 1-dimensional arrays have not been completely evaluated by P0: therefore,
there is one communication arc associated to each array, and the lemma follows.
If W > nd/2, we consider the remaining p−1 processors as a single virtual processor evaluating
nd −W ≤ nd/2 nodes, being recomputation disallowed. An argument equivalent to the previous
one gives the claim.
Now we need a second lemma that bounds from below the number of messages exchanged by
a processor Pℓ while evaluating nodes in an ℓ-owned block and in an adjacent block which is not
ℓ-owned.
Lemma 4. Consider an ℓ-owned block B adjacent to a shared or ℓ′-owned block B′, with ℓ 6= ℓ′.
Then, the number of messages exchanged by processor Pℓ for evaluating, without recomputation,
nodes in B and B′ is
Ω
(
nd−1
p
)
.
Proof. We suppose without loss of generality that B = B0,0,...,0 and B
′ = B1,0,...,0. We call a node
blue if it is evaluated by Pℓ, and red otherwise; let nb and nr (resp., n
′
b and n
′
r) be the number
of blue and red nodes in B (resp., B′). By definition, nb ≥ nd/(2pd/(d−1)). Moreover, since B′ is
either shared or ℓ′-owned with ℓ′ 6= ℓ, we have that n′r ≥ nd/(2pd/(d−1)).
Suppose nb ≥ 3nd/(4pd/(d−1)) and n′r ≥ 3nd/(4pd/(d−1)). Consider the nd−1/p arrays defined
by nodes in B and B′ sharing the last d − 1 indexes (i.e., the lines orthogonal to the adja-
cency face): more formally, for each 0 ≤ i1, . . . , id−1 < n/p1/(d−1), each array contains nodes
〈0, i1, . . . , id−1, 〉, . . . , 〈2n/p1/(d−1), i1, . . . , id−1〉. There are 2nd−1/(3p) arrays containing at least
n/(4p1/(d−1)) blue (resp., red) nodes in B (resp., B′). Then, at least nd−1/(3p) arrays contain both
red and blue nodes, and thus for each of them there is a communication arc. Since recomputation
is disallowed, each of these arcs entails the communication of a datum, and the lemma follows.
Finally, suppose nb < 3n
d/(4pd/(d−1)) (resp., n′r < 3n
d/(4pd/(d−1))). The lemma follows by
applying Lemma 3 to B (resp., B′), with W = nb (resp., W = n
′
r), and considering processors Pi
with i 6= ℓ as a single virtual processor.
10
The next theorem gives the claimed Ω
(
nd−1/(p(d−2)/(d−1)) +W (d−1)/d
)
lower bound, and its
proof is inspired by the argument in [30] for the cube DAG (which however assumes balanced
work). The lower bound is matched by the balanced algorithm given in [30], which decomposes the
(n, d)-array into p arrays with dimension d and size n/p1/(d−1).
Theorem 2. Let Ad be any algorithm for solving the (n, d)-array problem, without recomputation,
on a BSP with p processors, where 1 < p ≤ nd−1, and let W be the maximum number of nodes
evaluated by a processor. If W ≤ ǫnd, for an arbitrary constant ǫ ∈ (0, 1), then the communication
complexity of the algorithm is
HAd(n, p) = Ω
(
nd−1
p(d−2)/(d−1)
+W (d−1)/d
)
.
Proof. The second term of the lower bound follows directly from Lemma 3 and dominates the first
one as long asW = Ω
(
nd/(pd(d−2)/(d−1)
2
)
)
. In the remaining, we focus on the first one and assume
W < nd/(pd(d−2)/(d−1)
2
).
Suppose the number of shared blocks to be at least pd/(d−1)/2. Then, there exists a processor,
say P0, computing W
′ ≥ nd/(2p) nodes in b ≥ 1 shared blocks. Denote with wi, for 0 ≤ i < b, the
number of nodes computed by P0 in the i-th shared block. We have
∑b−1
i=0 wi =W
′. By Lemma 3,
the messages exchanged by P0 within the i-th block are Ω
(
w
(d−1)/d
i
)
, and thus the communication
complexity is at least
HAd(n, p) = Ω
(
b−1∑
i=0
w
(d−1)/d
i
)
.
The summation is minimized when each wi, i ∈ {0, 1, . . . , b−1}, is set to the maximum allowed value,
that is wi = n
d/(2pd/(d−1)) since each block is shared, and b is set toW ′/(nd/(2pd/(d−1))) ≥ p1/(d−1).
The claim follows.
Suppose now the number of shared blocks to be less than pd/(d−1)/2. Intuitively, in the follow-
ing argument we search for an hypercube that is almost entirely evaluated by a single processor
communicating an amount of messages proportional to the surface area. Then, we highlight a
critical sequence of these hypercubes which are evaluated one after the other: the total amount of
communication performed by the associated processors gives the claimed bound.
We define a chain of length f a sequence of blocks Bi00,...,i0d−1
, . . . , B
if0 ,...,i
f
d−1
such that ijk =
ij−1k +1, for each 0 ≤ k < d and 1 ≤ j < f . For instance, when d = 3 a chain is a sequence of blocks
parallel to the main diagonal. Since less than pd/(d−1)/2 blocks are shared, there exists at least one
chain of length c′p1/(d−1) containing at least cp1/(d−1) owned blocks, for suitable constants c′ and c
in (0, 1). For simplicity, we denote with Bk the k-th block in this chain, for 0 ≤ k < c′p1/(d−1).
Consider an ℓ-owned block Bk. An s-hypercube of Bk = Bik0 ,...,ikd−1
is defined as the set of
blocks ∪0≤i0,...,id−1<s{Bik0+i0,...,ikd−1+id−1}, and an s-hypercube is ℓ-owned if at least s
d/2 blocks are
ℓ-owned. Note that an s-hypercube of Bk contains Bk, Bk+1, . . . , Bk+s−1. The k-size sk of Bk is the
smallest value such that the sk-hypercube is not ℓ-owned, that is, there are at least s
d
k/2 shared or
ℓ′-owned blocks, with ℓ′ 6= ℓ. By definition, an sk-hypercube contains at least (sk − 1)d/2 ℓ-owned
blocks. Envision an sk-hypercube as a d-dimensional array of size sk where processor Pℓ evaluates
(sk − 1)d/2 ≤ W ′ < sdk/2 nodes (i.e., the ℓ-owned blocks); by Lemma 3, there are Ω
(
W ′(d−1)/d
)
communication arcs: that is, there are Θ
(
sd−1k
)
ℓ-owned blocks adjacent to (distinct) non ℓ-owned
11
blocks. Then, by Lemma 4, Pℓ exchanges Ω
(
sd−1k n
d−1/p
)
messages for evaluating the nodes of the
sk-hypercube.
Consider now the longest sequence of owned blocks Bk0 , . . . , Bkt−1 such that ki−ki−1 ≥ si−1 for
each 0 < i < t, and kt−1 ≤ (c′p1/(d−1)−2p1/(d−1)2). We observe that the last assumption guarantees
that the hypercube in Bkt−1 with size skt−1 is well defined, that is, it does not exceed the boundaries
of the (n, d)-array: indeed, since W < nd/(pd(d−2)/(d−1)
2
), the side skt−1 is smaller than 2p
1/(d−1)2 .
We also notice that blocks Bk0 , . . . , Bkt−1 may be owned by a different processor. By construction,
each node within block Bki depends on all the nodes in Bki−1 , and thus all messages exchanged
while evaluating nodes in Bki are subsequent to those exchanged while evaluating nodes in Bki−1 .
Then, by summing the amount of messages exchanged by the owners of the t blocks for evaluating
nodes in the respective ski-hypercubes, we have
HAd(n, p) = Ω
(
t−1∑
i=0
sd−1ki n
d−1
p
)
. (1)
Let S =
∑t−1
i=0 ski . Since there are Θ
(
p1/(d−1)
)
owned blocks in the chain, we have S = Θ
(
p1/(d−1)
)
.
Equation 1 is minimized by setting ski = S/t, and then we obtain the claimed communication
complexity by exploiting the fact that t ≤ p1/(d−1).
4 Sorting
In this section we give a lower bound to the communication complexity of comparison-based sorting
algorithms. Comparison sorting is defined as the problem in which a given set X of n input keys
from an ordered set has to be sorted, such that the only operations allowed on members of X
are pairwise comparisons. Our bound only requires that no processor does more than a constant
fraction ǫ of the Θ (n log n) comparisons required by any comparison sorting algorithm, for any
ǫ ∈ (0, 1), and does not impose any protocol on the distribution of the inputs and the outputs on
the processors, nor upper bounds to the size of their local memories, or specific communication
patterns. As for previous work, we still need the technical assumptions that the inputs are not
initially replicated, and that the processors store only a constant number of copies of any input key
at any moment during the execution of the algorithm.
The main result follows from the application of two lemmas, each of which provides a different
and independent lower bound to the communication complexity of sorting. Both rely on non-
trivial counting arguments, adapted from [2, 1], that hinge on the fact that any comparison sorting
algorithm must be able to distinguish between all the n! permutations of the n inputs. The first
lemma provides a lower bound as a function of the maximum number S of input keys initially held
by a processor. The second gives a lower bound as a function of the number Π of permutations
that can be distinguished before any communications take place. We begin by stating and proving
the first lemma.
Lemma 5. Let A be any algorithm sorting n keys on a BSP with p processors, with 1 < p ≤ n,
and let S denotes the maximum number of input keys initially held by a processor. If each processor
performs at most ǫ(n log n) comparisons, with ǫ being an arbitrary constant in (0, 1), and the input
is not initially replicated, then the communication complexity of the algorithm is
HA(n, p) = Ω (S) .
12
Proof. Without loss of generality, denote with P0 a processor holding S input keys at the beginning,
and let P ∗ identify the submachine including the remaining p − 1 processing units. Clearly, since
the input is not initially replicated, P ∗ initially holds n−S input elements. Finally, for convenience,
we redefine ǫ as 1/(1 + δ), with δ being an arbitrary constant greater than zero.
Suppose first that S >
(
1− δ4(1+δ/2)
)
n. By hypothesis, each processor performs at most
(n log n)/(1 + δ) comparisons and thus processor P0 can boost the number of distinguishable per-
mutations by a factor of at most
2
n log n
1+δ ≤
(
n
e(1 + δ/2)
) n
(1+δ/2)
≤
(
n
1 + δ/2
)
!,
where the first inequality can be verified by taking the logarithm of both sides, and applies for n
larger than a suitable constant, while the second one follows from Stirling’s approximation. This
holds independently of the number of keys that P0 contains initially (which could be even n) or
that it receives by P ∗ during the execution of the algorithm. Therefore, P ∗ must distinguish at
least n!/(n/(1 + δ/2))! permutations. Then, if we denote with S∗ = n − S the number of keys
initially held by P ∗, and with h∗ the number of keys sent by P0 to P
∗, we must have
(S∗ + h∗)! ≥ n!
(n/(1 + δ/2))!
.
By taking the logarithm of both sides and after some manipulation, we obtain
(S∗ + h∗) log(S∗ + h∗) ≥ δn
2(1 + δ/2)
log
n
1 + δ/2
,
from which follows
S∗ + h∗ ≥ δn
3(1 + δ/2)
.
Then, since S∗ < δn4(1+δ/2) and S ≤ n,
h∗ >
δn
12(1 + δ/2)
≥ δS
12(1 + δ/2)
,
and the lemma follows.
Now consider the case S ≤
(
1− δ4(1+δ/2)
)
n. Let h′ and h∗ be the number of keys received by
P0 and P
∗, respectively, and let V ′ and V ∗ be the maximum number of permutations distinguished
by P0 and P
∗, respectively. We must have V ′V ∗ ≥ n!. We also have V ′ ≤ S!(S+h′h′ ): indeed, P0 can
distinguish all the S! permutations of the S input keys, and the number of ways to intersperse the
h′ received keys within the group of S inputs is
(S+h′
h′
)
. (Note that the h′! permutations of the h′
messages are accounted in V ∗.) Similarly, V ∗ ≤ (n− S)!(n−S+h∗h∗ ). Thus, we have
S!(n− S)!
(
S + h′
h′
)(
n− S + h∗
h∗
)
≥ n!,
whence (
S + h
h
)(
n− S + h
h
)
≥
(
n
S
)
,
13
where h = max{h′, h∗}. By using the fact that (a/b)b ≤ (ab) ≤ (ea/b)b for any integer values a and
b, and then by taking the logarithm of both sides, we get
h log
(
e2
h2
(S + h)(n − S + h)
)
≥ S log n
S
, (2)
where e is Euler’s constant.
In the rest of the proof we will prove that h ≥ βS for a suitable constant β ∈ (0, 1) that will
be defined later. Suppose, for the sake of contradiction, that h < βS. We first observe that the
left-hand side of Equation 2 is increasing in h. Indeed, we have
h log
(
e2
h2
(S + h)(n − S + h)
)
= 2h log e+ h log
S + h
h
+ h log
n− S + h
h
,
where h log((x+ h)/h), with x ∈ {S, n−S}, is strictly increasing in h as soon as x > 0. Therefore,
the left-hand side of Equation 2 can be upper bounded as follows:
h log
(
e2
h2
(S + h)(n − S + h)
)
< βS log
(
e2
(S + βS)
βS
(n− S + βS)
βS
)
= βS log
(
e2(1 + β)(n − S(1− β))
β2S
)
< βS log
(
2e2n
β2S
)
< S log
(
2en
βS
)2β
,
where we have also used the facts that β < 1 and S ≤ n. We now argue that the last term in the
above formula is upper bounded by S log(n/S). We shall consider two separate cases. The first is
when n/S ≥ 2. In this case, we set β = log(n/S)/(8 log(2en/S)). (Observe that 0 < β < 1, as
required.) Standard calculus shows that 8 log(2en/S)/ log(n/S) < (2en/S)3 when n/S ≥ 2. Hence,
we can write (
2en
βS
)2β
=
(
2en · 8 log(2en/S)
S log(n/S)
) log(n/S)
4 log(2en/S)
<
(
2en
S
) log(n/S)
log(2en/S)
=
n
S
.
Consider now the case 4(1 + δ/2)/(4 + δ) ≤ n/S < 2. Then, we have (2en/(βS))2β < (11/β)2β .
Since δ is a constant, and since the right-hand term of the above inequality tends to one as β tends
to zero, then for each δ > 0 there exists a constant β ∈ (0, 1) such that (11/β)2β ≤ 4(1+δ/2)/(4+δ).
Therefore, we have shown for both cases that, if h < βS,
h log
(
e2
h2
(S + h)(n − S + h)
)
< S log
n
S
,
which is in contradiction with Equation 2. It follows that there exists a constant β > 0 such that
h ≥ βS, giving the lemma.
14
We now provide a second lemma, which bounds from below the communication complexity of
sorting in BSP as a function of the number Π of permutations that can be distinguished before any
communications take place, that is, when processors’ can only compare their local inputs.
As an aside, we observe that the proof of this lemma can be straightforwardly cast for the
LPRAM model, yielding a much simpler proof for Theorem 3.2 of [1], which bounds from below the
communication delay, that is, the number of communication steps, required for comparison-based
sorting.
Lemma 6. Let A be any algorithm sorting n keys on a BSP with p processors, where 1 < p ≤ n,
and let Π be the number of distinct permutations that can be distinguished by A before the second
superstep, that is, by comparing the inputs that (possibly) reside initially in the processors’ local
memories. If A stores only a constant number of copies of any key at any time instant, then the
communication complexity of the algorithm is
HA(n, p) = Ω
(
n log n− log Π
p log(n/p)
)
.
Proof. We prove the lemma only for the case when every input key is present in only one of the local
memories of the processors at any time instant; the extension to the case when a data element is
simultaneously present in a constant number of local memories is straightforward and thus omitted.
We suppose that A performs 1-relations in each superstep, that is, each processor can send and
receive only one message. This is without loss of generality because we observe that each superstep
of A where each processor performs an h-relation (i.e., it sends and receives at most h messages),
can be decomposed into h 1-relation supersteps without increasing the communication complexity
of A (since the latter does not charge a synchronization cost due to the latency incurred by each
superstep). Let mj denote the number of input keys in local memory of processor Pj after a given
superstep of the algorithm. Since, by hypothesis, a data element is present in only one of the
local memories of the processors at any time instant, we have that
∑p
j=1mj ≤ n. Hence, after a
communication superstep, which by hypothesis entails a 1-relation, the space of permutations can
be divided, at most, by the value of an optimal solution of the following convex program (observe,
in fact, that Pj may already have distinguished (mj − 1)! permutations before the last superstep):
max
p−1∏
j=0
mj
s.t.
p−1∑
j=0
mj ≤ n.
Since the solution is given by mj = n/p for each j, its value is (n/p)
p. Thus, after x supersteps, the
space of permutations can have been divided by at most (n/p)px. Since there remain n!/Π distinct
possible permutations, we must have (
n
p
)px
≥ n!
Π
.
By taking the logarithm of both sides, we obtain
x = Ω
(
n log n− log Π
p log(n/p)
)
,
as desired.
15
Now we are ready to prove the main result of this section, an Ω ((n log n)/(p log(n/p))) lower
bound to the communication complexity of any comparison sorting algorithm. The result follows
by combining the bounds given by the previous two lemmas. Both bounds are not tight when
considered independently, the first (Lemma 5) because it is weak when at the beginning the input
keys tend to be distributed evenly among the processors, the second (Lemma 6) because it is weak
when the input keys tend to be concentrated on one or few processors. However, the simultaneous
application of both provides the sought (tight) lower bound.
Theorem 3. Let A be any algorithm for sorting n keys on a BSP with p processors, with 1 < p ≤ n.
If each processor performs at most ǫ(n log n) comparisons, with ǫ being an arbitrary constant in
(0, 1), the inputs are not initially replicated, and the p processors store only a constant number of
copies of any key at any time instant, then the communication complexity of the algorithm is
HA(n, p) = Ω
(
n log n
p log(n/p)
)
.
Proof. The result follows by combining Lemma 5 with Lemma 6. Since, by hypothesis, each
processor performs at most ǫ(n log n) comparisons, with ǫ ∈ (0, 1), and the inputs are not initially
replicated, we can apply Lemma 5, obtaining
HA(n, p) = Ω (S) ,
where S denotes the maximum number of input keys initially held by a processor. Moreover, since
by hypothesis the p processors store a constant number of copies of any key at any time instant,
we can also apply Lemma 6, obtaining
HA(n, p) = Ω
(
n log n− log Π
p log(n/p)
)
, (3)
where Π denotes the number of distinct permutations that can be distinguished by A by comparing
the inputs that initially reside in processors’ local memories. In order to compare the latter bound
with the first one, we need to bound Π from above as a function of S. To this end, let si denote
the number of input keys initially held by processor Pi. Hence, S = max{s0, s1, . . . , sp−1}. The
number of permutations that can be distinguished by A without requiring communication, that
is, by letting each processor sort the keys that it holds at the beginning of the computation, is
therefore Π =
∏p−1
i=0 si!. Since the inputs are not initially replicated, an upper bound to Π as a
function of S is given by the value of an optimal solution of the following mathematical program:
max
p−1∏
i=0
si!
s.t.
p−1∑
i=0
si = n
si ≤ S ∀i = 0, 1, . . . , p − 1.
Since a!b! ≤ (a + b)! for any integer a and b, by a convexity argument it follows that ∏p−1i=0 si! ≤
(S!)n/S . Therefore, we can plug Π = (S!)n/S in Equation 3, obtaining
HA(n, p) = Ω
(
n log n− n logS
p log(n/p)
)
.
16
Putting pieces together, we conclude that
HA(n, p) = Ω
(
n log(n/S)
p log(n/p)
+ S
)
.
Standard calculus shows that the right-hand side of the above equation is increasing in S when
S = Ω(n/(p log(n/p))). The theorem follows by observing that S ≥ ⌈n/p⌉.
5 Fast Fourier Transform
In this section we consider the problem of computing the Discrete Fourier Transform of n values
using the n-input FFT DAG. In the FFT DAG, a vertex is a pair 〈w, l〉, with 0 ≤ w < n and
0 ≤ l ≤ log n, and there exists an arc between two vertices 〈w, l〉 and 〈w′, l′〉 if l′ = l+1, and either
w and w′ are identical or their binary representations differ exactly in the l′-th bit. We show that,
when no processor computes more than a constant fraction of the total number of vertices of the
DAG, the communication complexity is Ω (n log n/(p log(n/p))). Our bound does not assume any
particular I/O protocol, and only requires that every input resides in the local memory of exactly
one processor before the computation begins; as for preceding results, our bound also hinges on the
restriction on the nature of the computation whereby each vertex of the FFT DAG is computed
exactly once. The bound is tight for any p ≤ n, and is achieved by the well-known recursive
decomposition of the DAG into two sets of smaller
√
n-input FFT DAGs, with each set containing√
n of such subDAGs (see, e.g., [7]).
We will first establish a lemma which, under the same hypothesis of the main result, provides
a lower bound to the communication complexity as a function of the maximum work performed
by any processor. The proof of the lemma is based on a bandwidth argument, which exploits the
fact that an FFT DAG can perform all cyclic shifts (see, e.g., [20]), and on the following technical
result which is implicit in the work of Hong and Kung (a simplified proof is due to Aggarwal and
Vitter [2]).
Lemma 7 ([16]). Consider the computation of the n-input FFT DAG. During the computation, if
a processor accesses at most S nodes of the DAG, then it can evaluate at most 2S log S nodes, for
any S ≥ 2.
Lemma 8. Let A be any algorithm computing, without recomputation, an n-input FFT DAG on
a BSP with p processors, with 1 < p ≤ n, and let W be the maximum number of nodes of the FFT
DAG computed by a processor. If W ≤ ǫ(n log n), for an arbitrary constant in (0, 1), and the inputs
are not initially replicated, then the communication complexity of the algorithm is
HA(n, p) = Ω
(
W
logW
)
.
Proof. Let P0 be a processor computing W nodes of the FFT DAG, and consider as an unique
processor P ∗ the remaining p− 1 processing units.
Suppose first that processor P ∗ contains at least n/2 of the n output nodes at the end of the
algorithm. Let K = W/(2 logW ). Since P0 evaluates W nodes, it follows from Lemma 7 that P0
accesses at least K node values during the execution of the algorithm. These nodes can be either
inputs initially held by P0, or nodes whose values have been evaluated and then sent by processor P
∗.
17
If at least K/2 of them have been sent by P ∗, the lemma follows. Otherwise, P0 initially contains
at least K/2 input nodes. Since the inputs are not initially replicated, and since an FFT DAG can
perform all cyclic shifts, by [26, Lemma 10.5.2] there exists a cyclic shift that permutes K/4 input
nodes initially held by processor P0 into K/4 output nodes held by P
∗ at the end of the algorithm.
Since K = W/(2 logW ) and since, by hypothesis, W ≤ ǫ(n log n), it holds that K/4 ≤ n/2, and
thus at least K/4 messages are actually needed. Therefore, HA(n, p) ≥W/(8 logW ).
Now suppose that processor P ∗ contains at most n/2 output nodes at the end of the algorithm.
Thus, there are at least n/2 output nodes in P0. Since, by hypothesis, recomputation is disallowed,
P ∗ computes W ∗ = n log n−W ≤ n log n nodes of the DAG. The lemma follows by inverting the
role of P0 and P
∗ and setting K =W ∗/(2 logW ∗) in the previous argument.
We note that the above bound is matched whenW = O (nǫ log n), for any constant ǫ ∈ (0, 1), by
the previous recursive algorithm by ending the recursion when the subproblem size is Θ (W/ logW ).
The main result of this section follows by a simple application of the preceding lemma and of
a result implicit in the proof of the lower bound due to Bilardi et al. [9, Corollary 1].
Theorem 4. Let A be any algorithm computing, without recomputation, an n-input FFT DAG on
a BSP with p processors, where 1 < p ≤ n. If each processor computes at most ǫ(n log n) nodes,
for an arbitrary constant in (0, 1), of the FFT DAG and the inputs are not initially replicated, then
the communication complexity of the algorithm is
HA(n, p) = Ω
(
n log n
p log(n/p)
)
.
Proof. IfW ≥ n1/4, we have thatW ≥ max{(n log n)/p, n1/4}, and thus we observe that the bound
given by Lemma 8 dominates the one claimed by the theorem. Otherwise, when W < n1/4, we
use the following argument. By reasoning as in [9, Corollary 1], if at the end of the algorithm
A each processor holds at most U ≤ n output nodes of the FFT DAG and recomputation is not
allowed, then the communication complexity of A is Ω (max{0, n log(n/U2)/(p log(n/p))}). Since
W < n1/4, each processor cannot contain more than n1/4 output nodes, that is, U ≤ n1/4, and the
theorem follows.
6 Conclusions
We have presented new lower bounds on the amount of communication required to solve some key
computational problems in distributed-memory parallel architectures. All our bounds have the
same functional form of previous results that appear in the literature; however, the latter are built
by making a critical use of some assumptions that rule out a large part of possible algorithms.
The novelty and the significance of our results stem from the assumptions under which our lower
bounds are developed, which are much weaker than those used in previous work.
Our bounds are derived within the BSP model of computation, but can be easily extended
to other models for distributed computations based on or similar to the BSP, such as LogP [13]
and MapReduce [18, 23]. Moreover, we believe that our results can be also ported to models for
multicore computing (see, e.g., [10, 33, 12]), since our proofs are based on some techniques that
have already been exploited in this scenario.
There is still much to do towards the establishment of a definitive theory of communication-
efficient algorithms. In fact, we were not able to remove all the restrictions there were in place
18
in previous work: in some cases our lower bounds still make use of some technical assumptions,
such as the non-recomputation of intermediate results, or restrictions on the replication of input
data. Although it seems that such restrictions can be relaxed to encompass a small amount of
recomputation or input replication, it is an open question to assess whether these assumptions are
inherent to our proof techniques or can be removed. In particular, it is not clear, in general, when
recomputation has the power to reduce communications, since many lower bound techniques do
not apply in this more general scenario (see, e.g., [5]). Providing tight lower bounds that hold also
when recomputation is allowed is a fascinating and challenging avenue for future research.
Acknowledgments. The authors would like to thank Gianfranco Bilardi and Andrea Pietracap-
rina for useful discussions.
References
[1] A. Aggarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical
Computer Science, 71:3–28, 1990.
[2] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems.
Communications of the ACM, 31(9):1116–1127, 1988.
[3] G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Brief announcement: strong
scaling of matrix multiplication algorithms and memory-independent communication lower
bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Archi-
tectures (SPAA), pages 77–79, 2012.
[4] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical
linear algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866–901, 2011.
[5] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs
of fast matrix multiplication. Journal of the ACM, 59(6):32:1–32:23, 2012.
[6] G. Bilardi, A. Pietracaprina, and P. D’Alberto. On the space and access complexity of compu-
tation DAGs. In Proceedings of the 26th International Workshop on Graph-Theoretic Concepts
in Computer Science (WG), pages 47–58, 2000.
[7] G. Bilardi, A. Pietracaprina, G. Pucci, and F. Silvestri. Network-oblivious algorithms. In
Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium
(IPDPS), pages 1–10, 2007.
[8] G. Bilardi and F. Preparata. Processor-time tradeoffs under bounded-speed message propaga-
tion: Part II, lower bounds. Theory of Computing Systems, 32(5):531–559, 1999.
[9] G. Bilardi, M. Scquizzato, and F. Silvestri. A lower bound technique for communication on
BSP with application to the FFT. In Proceedings of the 18th International Conference on
Parallel Processing (Euro-Par), pages 676–687, 2012.
[10] G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch.
Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings
of the 19th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 501–510, 2008.
19
[11] T. Cheatham, A. F. Fahmy, D. C. Stefanescu, and L. G. Valiant. Bulk synchronous paral-
lel computing – a paradigm for transportable software. In Proceedings of the 28th Hawaii
International Conference on System Sciences (HICSS), pages 268–275, 1995.
[12] R. A. Chowdhury, V. Ramachandran, F. Silvestri, and B. Blakeley. Oblivious algorithms
for multicores and networks of processors. Journal of Parallel and Distributed Computing,
73(7):911–925, 2013.
[13] D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, E. E. Santos, K. E. Schauser, R. Subra-
monian, and T. von Eicken. LogP: A practical model of parallel computation. Communications
of the ACM, 39(11):78–85, 1996.
[14] M. Frigo and V. Strumpen. Cache oblivious stencil computations. In Proceedings of the 19th
International Conference on Supercomputing (ICS), pages 361–366, 2005.
[15] M. T. Goodrich. Communication-efficient parallel sorting. SIAM Journal on Computing,
29(2):416–432, 1999.
[16] J.-W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proceedings of
the 13th ACM Symposium on Theory of Computing (STOC), pages 326–333, 1981.
[17] D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory
matrix multiplication. Journal of Parallel and Distributed Computing, 64(9):1017–1026, 2004.
[18] H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In
Proceeding of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
938–948, 2010.
[19] L. R. Kerr. The Effect of Algebraic Structure on the Computational Complexity of Matrix
Multiplication. PhD thesis, Cornell University, 1970.
[20] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hyper-
cubes. Morgan Kaufmann Publishers Inc., 1992.
[21] L. Loomis and H. Whitney. An inequality related to the isoperimetric inequality. Bulletin of
The American Mathematical Society, 55:961–962, 1949.
[22] C. H. Papadimitriou and J. D. Ullman. A communication-time tradeoff. SIAM Journal on
Computing, 16(4):639–646, 1987.
[23] A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-round tradeoffs for
MapReduce computations. In Proceedings of the 26th International Conference on Supercom-
puting (ICS), pages 235–244, 2012.
[24] D. Ranjan, J. Savage, and M. Zubair. Strong I/O lower bounds for binomial and FFT com-
putation graphs. In Proceedings of the 17th Annual International Conference on Computing
and Combinatorics (COCOON), pages 134–145, 2011.
[25] J. E. Savage. Extending the Hong-Kung model to memory hierarchies. In Proceedings of the
First Annual International Conference on Computing and Combinatorics (COCOON), pages
270–281, 1995.
20
[26] J. E. Savage. Models of Computation: Exploring the Power of Computing. Addison-Wesley
Longman Publishing Co., Inc., 1998.
[27] E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and
LU factorization algorithms. In Proceedings of the 17th International Conference on Parallel
Processing (Euro-Par), pages 90–109, 2011.
[28] V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 14(3):354–356,
1969.
[29] A. Tiskin. Bulk-synchronous parallel multiplication of Boolean matrices. In Proceedings of the
25th International Colloquium on Automata, Languages and Programming (ICALP), pages
494–506, 1998.
[30] A. Tiskin. The Design and Analysis of Bulk-Synchronous Parallel Algorithms. PhD thesis,
University of Oxford, 1998.
[31] A. Tiskin. BSP (bulk synchronous parallelism). In Encyclopedia of Parallel Computing, pages
192–199. Springer, 2011.
[32] L. G. Valiant. A bridging model for parallel computation. Communications of the ACM,
33(8):103–111, 1990.
[33] L. G. Valiant. A bridging model for multi-core computing. Journal of Computer and System
Sciences, 77(1):154–166, 2011.
[34] C.-L. Wu and T.-Y. Feng. The universality of the shuffle-exchange network. IEEE Transactions
on Computers, 30:324–332, 1981.
[35] I.-C. Wu and H. T. Kung. Communication complexity for parallel divide-and-conquer. In
Proceedings of the 32nd annual Symposium on Foundations of Computer Science (FOCS),
pages 151–162, 1991.
21
