Network-Oblivious Algorithms by Bilardi, Gianfranco et al.
Original Citation:
Network-Oblivious Algorithms
Publisher:
Published version:
DOI:
Terms of use:
Open Access
(Article begins on next page)
This article is made available under terms and conditions applicable to Open Access Guidelines, as described at
http://www.unipd.it/download/file/fid/55401 (Italian only)
Availability:
This version is available at: 11577/3156396 since: 2017-05-14T16:43:11Z
http://dx.doi.org/10.1145/2812804
Università degli Studi di Padova
Padua Research Archive - Institutional Repository
Network-Oblivious Algorithms∗
Gianfranco Bilardi† Andrea Pietracaprina‡ Geppino Pucci§ Michele Scquizzato¶
Francesco Silvestri‖
Abstract
A framework is proposed for the design and analysis of network-oblivious algorithms, namely,
algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by
different degrees of parallelism and communication capabilities. The framework prescribes that
a network-oblivious algorithm be specified on a parallel model of computation where the only
parameter is the problem’s input size, and then evaluated on a model with two parameters,
capturing parallelism granularity and communication latency. It is shown that, for a wide
class of network-oblivious algorithms, optimality in the latter model implies optimality in the
Decomposable BSP model, which is known to effectively describe a wide and significant class of
parallel platforms. The proposed framework can be regarded as an attempt to port the notion
of obliviousness, well established in the context of cache hierarchies, to the realm of parallel
computation. Its effectiveness is illustrated by providing optimal network-oblivious algorithms
for a number of key problems. Some limitations of the oblivious approach are also discussed.
∗This work was supported, in part, by MIUR of Italy under project AMANDA and by the University of Padova
under projects STPD08JA32 and CPDA121378/12. A preliminary version of this paper appeared in Proceedings of
the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007.
†Department of Information Engineering, University of Padova, 35131 Padova, Italy.
E-mail: bilardi@dei.unipd.it.
‡Department of Information Engineering, University of Padova, 35131 Padova, Italy. E-mail: capri@dei.unipd.it.
§Department of Information Engineering, University of Padova, 35131 Padova, Italy. E-mail: geppo@dei.unipd.it.
¶Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260, USA.
E-mail: scquizza@pitt.edu. Most of this work was done while this author was a Ph.D. student at the Uni-
versity of Padova.
‖Department of Information Engineering, University of Padova, 35131 Padova, Italy.
E-mail: silvest1@dei.unipd.it.
1
ar
X
iv
:1
40
4.
33
18
v1
  [
cs
.D
S]
  1
2 A
pr
 20
14
1 Introduction
Communication plays a major role in determining the performance of algorithms on current com-
puting systems and has a considerable impact on energy consumption. Since the relevance of
communication increases with the size of the system, it is expected to play an even greater role
in the future. Motivated by this scenario, a large body of results have been devised concerning
the design and analysis of communication-efficient algorithms. While often useful and deep, these
results do not yet provide a coherent and unified theory of the communication requirements of
computations. One major obstacle toward such a theory lies in the fact that, prima facie, commu-
nication is defined only with respect to a specific mapping of a computation onto a specific machine
structure. Furthermore, the impact of communication on performance depends on the latency and
bandwidth properties of the channels connecting different parts of the target machine. Hence, the
design, optimization, and analysis of algorithms can become highly machine-dependent, which is
undesirable from the economical perspective of developing efficient and portable software. The
outlined situation has been widely recognized, and a number of approaches have been proposed to
solve it or to mitigate it.
On one end of the spectrum, we have the parallel slackness approach, based on the assumption
that, as long as a sufficient amount of parallelism is exhibited, general and automatic latency-hiding
techniques can be deployed to achieve an efficient execution. Broadly speaking, the required algo-
rithmic parallelism should be proportional to the product of the number of processing units by the
worst-case latency of the target machine [44]. Further assuming that this amount of parallelism
is available in computations of practical interest, algorithm design can dispense altogether with
communication concerns and focus on the maximization of parallelism. The functional/data-flow
and the PRAM models of computations have often been supported with similar arguments. Unfor-
tunately, as argued in [13, 14, 15], latency hiding is not a scalable technique, due to fundamental
physical constraints. Hence, parallel slackness does not really solve the communication problem.
(Nevertheless, functional and PRAM models are quite valuable and have significantly contributed
to the understanding of other dimensions of computing.)
On the other end of the spectrum, we could place the universality approach, whose objective is
the development of machines (nearly) as efficient as any other machine of (nearly) the same cost, at
executing any computation (see, e.g., [38, 13, 7, 16]). To the extent that a universal machine with
very small performance and cost gaps could be identified, one could adopt a model of computation
sufficiently descriptive of such a machine, and focus most of the algorithmic effort on this model.
As technology approaches the inherent physical limitations to information processing, storage, and
transfer, the emergence of a universal architecture becomes more likely. Economy of scale can also
be a force favoring convergence in the space of commercial machines. While this appears as a
perspective worthy of investigation, at the present stage, neither the known theoretical results nor
the trends of commercially available platforms indicate an imminent strong convergence.
In the middle of the spectrum, a variety of computational models proposed in the literature can
be viewed as variants of an approach aiming at realizing an efficiency/portability/design-complexity
tradeoff [9]. Well-known examples of these models are LPRAM [3], BSP [44] and its refinements
(such as D-BSP [25, 11], BSP* [5], E-BSP [34], and BSPRAM [43]), LogP [24], QSM [30], and several
others. These models aim at capturing features common to most (reasonable) machines, while
ignoring features that differ. The hope is that performance of real machines be largely determined
by the modeled features, so that optimal algorithms in the proposed model translate into near
optimal ones on real machines. A drawback of these models is that they include parameters that
2
affect execution time. Then, in general, efficient algorithms are parameter-aware, since different
algorithmic strategies can be more efficient for different values of the parameters. One parameter
present in virtually all models is the number of processors. Most models also exhibit parameters
describing the time required to route certain communication patterns. Increasing the number
of parameters, from just a small constant to logarithmically many in the number of processors,
can considerably increase the effectiveness of the model with respect to realistic architectures,
such as point-to-point networks, as extensively discussed in [11]. A price is paid in the increased
complexity of algorithm design necessary to gain greater efficiency across a larger class of machines.
The complications further compound if the hierarchical nature of the memory is also taken into
account, so that communication between processors and memories becomes an optimization target
as well.
It is natural to wonder whether, at least for some problems, parallel algorithms can be designed
that, while independent of any machine/model parameters, are nevertheless efficient for wide ranges
of these parameters. In other words, we are interested in exploring the world of efficient network-
oblivious algorithms with a spirit similar to the one that motivated the development of efficient
cache-oblivious algorithms [27]. In this paper, we define the notion of network-oblivious algorithms
and propose a framework for their design and analysis. Our framework is based on three models
of computation, each with a different role, as briefly outlined below.
The three models are based on a common organization consisting of a set of CPU/memory nodes
communicating through some interconnection. Inspired by the BSP model and its aforementioned
variants, we assume that the computation proceeds as a sequence of supersteps, where in a superstep
each node performs local computation and sends/receives messages to/from other nodes, which will
be consumed in the subsequent superstep. Each message occupies a constant number of words.
The first model of our framework (specification model), is used to specify network-oblivious
algorithms. In this model, the number of CPU/memory nodes, referred to as virtual processors, is
a function v(n) of the input size and captures the amount of parallelism exhibited by the algorithm.
The second model (evaluation model) is the basis for analyzing the performance of network-oblivious
algorithms on different machines. It is characterized by two parameters, independent of the input:
the number p of CPU/memory nodes, simply referred to as processors in this context, and a fixed
latency/synchronization cost σ per superstep. The communication complexity of an algorithm
is defined in this model as a function of p and σ. Finally, the third model (execution machine
model) enriches the evaluation model, by replacing parameter σ with two independent parameter
vectors of size logarithmic in the number of processors, which represent, respectively, the inverse
of the bandwidth and the latency costs of suitable nested subsets of processors. In this model,
the communication time of an algorithm is analyzed as a function of p and of the two parameter
vectors. In fact, the execution machine model of our framework coincides with the Decomposable
Bulk Synchronous Parallel (D-BSP) model [25, 11], which is known to describe reasonably well the
behavior of a large class of point-to-point networks by capturing their hierarchical structure [10].
A network-oblivious algorithm is designed in the specification model but can be run on the
evaluation or execution machine models by letting each processor of these models carry out the
work of a pre-specified set of virtual processors. The main contribution of this paper is an optimality
theorem showing that for a wide and interesting class of network-oblivious algorithms, which satisfy
some technical conditions and whose communication requirements depend only on the input size
and not on the specific input instance, optimality in the evaluation model automatically translates
into optimality in the D-BSP model, for suitable ranges of the models’ parameters. It is this
3
circumstance that motivates the introduction of the intermediate evaluation model, which simplifies
the analysis of network-oblivious algorithms, while effectively bridging the performance analysis to
the more realistic D-BSP model.
In order to illustrate the potentiality of the framework we devise network-oblivious algorithms
for several fundamental problems, such as matrix multiplication, Fast Fourier Transform, comparison-
based sorting, and a class of stencil computations. In all cases, except for stencil computations,
we show, through the optimality theorem, that these algorithms are optimal when executed on the
D-BSP, for wide ranges of the parameters. Unfortunately, there exist problems for which optimality
on the D-BSP cannot be attained in a network-oblivious fashion for wide ranges of parameters. We
show that this is the case for the broadcast problem.
To help place our network-oblivious framework into perspective, it may be useful to compare it
with the well-established sequential cache-oblivious framework [27]. In the latter, the specification
model is the Random Access Machine; the evaluation model is the Ideal Cache Model IC(M,B),
with only one level of cache of size M and line length B; and the execution machine model is a
machine with a hierarchy of caches, each with its own size and line length. In the cache-oblivious
context, the simplification in the analysis arises from the fact that, under certain conditions, opti-
mality on IC(M,B), for all values ofM and B, translates into optimality on multilevel hierarchies.
The notion of obliviousness in parallel settings has been recently addressed by several papers.
In a preliminary version of this work [12] (see also [31]), we proposed a framework similar to the one
presented here, where messages are packed in blocks whose fixed size is a parameter of the evaluation
and execution machine models. While blocked communication may be preferable for models where
the memory and communication hierarchies are seamlessly integrated (e.g., multicores), latency-
based models like the one used here are equivalent for that scenario and also capture the case when
communication is accomplished through a point-to-point network. Later, Chowdhury et al. [20]
introduced a multilevel hierarchical model for multicores and the notion of multicore-oblivious
algorithm for this model. A multicore-oblivious algorithm is specified with no mention of any
machine parameters, such as the number of cores, number of cache levels, cache sizes and block
lengths. Cole and Ramachandran [21, 22, 23] presented parallel algorithms for a shared-memory
multicore model featuring a two-level memory, which are oblivious to the number of processors and
to the memory parameters. Finally, Blelloch et al. [17] introduced a parallel version of the cache-
oblivious framework in [27], named the Parallel Cache-Oblivious model, and described a scheduler
for oblivious irregular computations. In contrast to these approaches, Valiant [45] studies parallel
algorithms for multicore architectures advocating a parameter-aware design of portable algorithms.
In his paper, he presents optimal algorithms for the Multi-BSP, a bridging model for multicore
architectures which exhibits a hierarchical structure akin to that of our execution machine model.
In fact, for the latter model we are able to show that, for the same computational problems, optimal
parallel algorithms do exist that are parameter-free.
The rest of the paper is organized as follows. In Section 2 we formally define the three models
relevant to the framework and in Section 3 we prove the optimality theorem mentioned above.
In Section 4, we present the network-oblivious algorithms for matrix multiplication, Fast Fourier
Transform, comparison-based sorting, and stencil computations. We also discuss the impossibility
result regarding the broadcast problem. Section 5 extends the optimality theorem by presenting a
less powerful version which, however, applies to a wider class of algorithms. Section 6 concludes
the paper with some final remarks.
4
2 The Framework
We begin by introducing a parallel machine model M(v), which underlies the specification, the
evaluation, and the execution components of our framework. Specifically, M(v) consists of a set of
v processing elements, denoted by P0,P1, . . . ,Pv−1, each equipped with a CPU and an unbounded
local memory, which communicate through some interconnection. (For simplicity, throughout this
paper, v is assumed to be a power of two.) The instruction set of each CPU is essentially that of a
standard Random Access Machine, augmented with the three primitives sync(i), send(m, q), and
receive(). Furthermore, each Pr has access to its own index r and to the number v of processing
elements. When Pr invokes primitive sync(i), with i in the integer range [0, log v), then it waits
that each processing element whose index shares with r the i most significant bits also perform
a sync(i), or terminate its execution.1 In other words, sync(i) is a barrier which synchronizes
the v/2i processing elements whose indices share the i most significant bits. When Pr invokes
send(m, q), with 0 ≤ q < v, a constant-size message m is sent to Pq; the message will be available
in Pq only after a sync(k), where k is not bigger than the number of most significant bits shared
by r and q. On the other hand, the function receive() returns an element in the set of messages
received up to the preceding barrier and removes it from the set.
In this paper, we restrict our attention to algorithms where the sequence of labels of the sync
operations is the same for all processing elements, and where the last operation executed by each
processing element is a sync.2 In this case, the execution of an algorithm can be viewed as a
sequence of supersteps, where a superstep consists of all operations performed between two consec-
utive sync operations, including the second of these sync operations. Supersteps are labeled by the
index of their terminating sync operation: namely, a superstep terminating with sync(i) will be
referred to as an i-superstep, for 0 ≤ i < log v. Furthermore, we make the reasonable assumption
that in an i-superstep, each Pr can send messages only to processing elements whose index agrees
with r in the i most significant bits, that is, message exchange occurs only between processors
belonging to the same synchronization subset. We observe that the results of this paper would hold
even if, in the various models considered, synchronizations were not explicitly labeled. However,
explicit labels can help reduce synchronization costs. For instance, they become crucial for the
efficient execution of the algorithms on point-to-point networks, especially those of large diameter.
Consider an M(v)-algorithmA satisfying the above restrictions. For a given input instance I, we
use LiA(I) to denote the set of i-supersteps executed by A on input I, and define SiA(I) = |LiA(I)|,
for 0 ≤ i < log v. Algorithm A can be naturally and automatically adapted to execute on a smaller
machine M(p), with p < v, by stipulating that processing element Pj of M(p) will carry out the
operations of the v/p consecutively numbered processing elements of M(v) starting with Pj(v/p),
for each 0 ≤ j < p. We call this adaptation folding. Under folding, supersteps with a label i < log p
on M(v) become supersteps with the same label on M(p), while supersteps with label i ≥ log p on
M(v) become local computation on M(p). Hence, when considering the communication occurring
in the execution of A on M(p), the set LiA(I) is relevant as long as i < log p.
A network-oblivious algorithm A for a given computational problem Π is designed on M(v(n)),
referred to as specification model, where the number v(n) of processing elements, which is a function
1For notational convenience, throughout this paper we use log x to mean max{1, log2 x}.
2As we will see in the paper, several algorithms naturally comply or can easily be adapted to comply with these
restrictions. Nevertheless, a less restrictive family of algorithms for M(v) can be defined, by allowing processing
elements to feature different traces of labels of their sync operations, still ensuring termination. The exploration of
the potentialities of these algorithms is left for future research.
5
of the input size, is selected as part of the algorithm design. The processing elements are called
virtual processors and are denoted by VP0,VP1, . . . ,VPv(n)−1, in order to distinguish them from
the processing elements of the other two models. Since the folding mechanism illustrated above
enables A to be executed on a smaller machine M(p), the design effort can be kept focussed on
just one convenient virtual machine size, oblivious to the actual number of processors on which the
algorithm will be executed.
While a network-oblivious algorithm is specified for a large virtual machine, it is useful to
analyze its communication requirements on machines with reduced degrees of parallelism. For
these purposes, we introduce the evaluation model M(p, σ), where p ≥ 1 is power of two and
σ ≥ 0, which is essentially an M(p) where the additional parameter σ is used to account for the
latency plus synchronization cost of each superstep. The processing elements of M(p, σ) are called
processors and are denoted by P0,P1, . . . ,Pp−1. Consider the execution of an algorithm A on
M(p, σ) for a given input I. For each superstep s, the metric of interest that we use to evaluate
the communication requirements of the algorithm is the maximum number of messages hsA(I, p)
sent/destined by/to any processor in that superstep. Thus, the set of messages exchanged in the
superstep can be viewed as forming an hsA(I, p)-relation, where h
s
A(I, p) is often referred to as the
degree of the relation. In the evaluation model, the communication cost of a superstep of degree h
is defined as h+σ, and it is independent of the superstep’s label. For our purposes, it is convenient
to consider the cumulative degree of all i-supersteps, for 0 ≤ i < log p:
F iA(I, p) =
∑
s∈LiA(I)
hsA(I, p).
Then, the communication complexity of A on M(p, σ) is defined as
HA(n, p, σ) = max
I:|I|=n
{
log p−1∑
i=0
(
F iA(I, p) + S
i
A(I) · σ
)}
. (1)
We observe that the evaluation model with this performance metric coincides with the BSP
model [44] where the bandwidth parameter g is set to 1 and the latency/synchronization parameter
` is set to σ.
Next, we turn our attention to the last model used in the framework, called execution machine
model, which represents the machines where network-oblivious algorithms are actually executed.
We focus on parallel machines whose underlying interconnection exhibits a hierarchical structure,
and use the Decomposable BSP (D-BSP) model [25, 11] as our execution machine model. A D-
BSP(p, g, `), with g = (g0, . . . , glog p−1) and ` = (`0, . . . , `log p−1), is an M(p) where the cost of an
i-superstep depends on parameters gi and `i, for 0 ≤ i < log p. The processing elements, called
processors and denoted by P0,P1, . . . ,Pp−1 as in the evaluation model, are partitioned into nested
clusters: for 0 ≤ i ≤ log p, a set formed by all the p/2i processors whose indices share the most
significant i bits is called an i-cluster. Hence, during an i-superstep each processor communicates
only with processors of its i-cluster. For the communication within an i-cluster, parameter `i
represents the latency plus synchronization cost (in time units), while gi represents an inverse
measure of bandwidth (in units of time per message). By importing the notation adopted in the
evaluation model, we define the communication time of an algorithm A on D-BSP(p, g, `) as,
DA(n, p, g, `) = max
I:|I|=n
{
log p−1∑
i=0
(
F iA(I, p)gi + S
i
A(I)`i
)}
. (2)
6
Through the folding mechanism discussed above, any network-oblivious algorithm A specified
on M(v(n)) can be transformed into an algorithm for M(p) with p < v(n), hence into an algo-
rithm for M(p, σ) or D-BSP(p, g, `). In this case, the quantities HA(n, p, σ) and DA(n, p, g, `)
denote, respectively, the communication complexity and communication time of the folded algo-
rithm. Moreover, since algorithms designed on the evaluation model M(p, σ) or on the execution
machine model D-BSP(p, g, `) can be regarded as algorithms for M(p), once the parameters σ or
g and ` are fixed, we can also analyze the communication complexities/times of their foldings on
smaller machines. These relations among the models are crucial for the effective exploitation of our
framework.
The following definitions establish useful notions of optimality for the two complexity measures
introduced above relatively to the evaluation and execution machine models. For each measure,
optimality is defined with respect to a class of algorithms, whose actual role will be made clear
later in the paper. Let C denote a class of algorithms, solving a given problem Π.
Definition 1. Let 0 < β ≤ 1. An M(p, σ)-algorithm B ∈ C is β-optimal on M(p, σ) with respect
to C if for each M(p, σ)-algorithm B′ ∈ C and for each n,
HB(n, p, σ) ≤ 1
β
HB′(n, p, σ) .
Definition 2. Let 0 < β ≤ 1. A D-BSP(p, g, `)-algorithm B ∈ C is β-optimal on D-BSP(p, g, `)
with respect to C if for each D-BSP(p, g, `)-algorithm B′ ∈ C and for each n,
DB(n, p, g, `) ≤ 1
β
DB′(n, p, g, `) .
Note that the above definitions do not require β to be a constant: intuitively, larger values of
β correspond to higher degrees of optimality.
3 Optimality Theorem for Static Algorithms
In this section, we show that, for a certain class of network-oblivious algorithms, β-optimality in
the evaluation model, for suitable ranges of parameters p and σ, translates into β′-optimality in the
execution machine model, for some β′ = Θ(β) and suitable ranges of parameters p, g, and `. This
result, which we refer to as optimality theorem, holds under a number of restrictive assumptions;
nevertheless, it is applicable in a number of interesting case studies, as illustrated in the subsequent
sections. The optimality theorem shows the usefulness of the intermediate evaluation model since it
provides a form of “bootstrap,” whereby from a given degree of optimality on a family of machines
we infer a related degree of optimality on a much larger family. It is important to remark that the
class of algorithms for which the optimality theorem holds includes algorithms that are network-
aware, that is, whose code can make explicit use of the architectural parameters of the model (p
and σ, for the evaluation model, and p, g, and `, for the execution machine model) for optimization
purposes.
In a nutshell, the approach we follow hinges on the fact that both communication complexity and
communication time (Equations 1 and 2) are expressed in terms of quantities of the type F iA(I, p).
If communication complexity is low then these quantities must be low, whence communication time
must be low as well. Below, we discuss a number of obstacles to be faced when attempting to refine
the outlined approach into a rigorous argument and how they can be handled.
7
A first obstacle arises whenever the performance functions are linear combinations of other
auxiliary metrics. Unfortunately, worst-case optimality of these metrics does not imply optimality
of their linear combinations (nor viceversa), since the worst case of different metrics could be
realized by different input instances. In the cases of our interest, the “losses” incurred cannot
be generally bounded by constant factors. To circumvent this obstacle, we restrict our attention
to static algorithms, defined by the property that the following quantities are equal for all input
instances of the same size n: (i) the number of supersteps; (ii) the sequence of labels of the various
supersteps; and (iii) the set of source-destination pairs of the messages exchanged in any individual
superstep. This restriction allows us to overload the notation, writing n instead of I in the argument
of functions that become invariant for instances of the same size, namely LiA(n), S
i
A(n), h
s
A(n, p),
and F iA(n, p). Likewise, the max operation becomes superfluous and can be omitted in Equation 1
and Equation 2. Static algorithms naturally arise in DAG (Directed Acyclic Graph) computations.
In a DAG algorithm, for every instance size n there exists (at most) one DAG where each node with
indegree 0 represents an input value, while each node with indegree greater than 0 represents a
value produced by a unit-time operation whose operands are the values of the node’s predecessors
(nodes with outdegree 0 are viewed as outputs). The computation requires the execution of all
operations specified by the nodes, complying with the data dependencies imposed by the arcs.3
In order to prove the optimality theorem, we need a number of technical results and definitions.
Recall that folding can be employed to transform an M(p, σ)-algorithm into an M(q, σ)-algorithm,
for any q < p. The following lemma establishes a useful relation between the communication metrics
when folding is applied.
Lemma 1. Let B be a static M(p, σ)-algorithm. For every 1 ≤ j ≤ log p and for every input size
n, considering the folding of B on M(2j , 0) we have
j−1∑
i=0
F iB(n, 2
j) ≤ p
2j
j−1∑
i=0
F iB(n, p) .
Proof. The lemma follows by observing that in every i-superstep, with i < j, messages sent/destined
by/to processor Pk of M(2
j , 0), with 0 ≤ k < 2j , are a subset of those sent/destined by/to the p/2j
M(p, σ)-processors whose computations are carried out by Pk.
It is easy to come up with algorithms where the bound stated in the above lemma is not
tight. In fact, while in an i-superstep each message must be exchanged between processors whose
indices share at least i most significant bits, some messages which contribute to F iB(n, p) may be
exchanged between processors whose indices share j > i most significant bits, thus not contributing
to F iB(n, 2
j). Motivated by this observation, we define below a class of network-oblivious algorithms
where a parameter α quantifies how tight the upper bound of Lemma 1 is, when considering their
foldings on smaller machines. This parameter will be employed to control the extent to which
an optimality guarantee in the evaluation model translates into an optimality guarantee in the
execution model.
Definition 3. A static network-oblivious algorithm A specified on M(v(n)) is said to be (α, p)-wise,
for some 0 < α ≤ 1 and 1 < p ≤ v(n), if considering the folding of A on M(2j , 0) we have
j−1∑
i=0
F iA(n, 2
j) ≥ α p
2j
j−1∑
i=0
F iA(n, p) ,
3In the literature, DAG problems have also been referred to as pebble games [39, Section 10.1].
8
for every 1 ≤ j ≤ log p and every input size n.
(We remark that in the above definition parameter α is not necessarily a constant and can be made,
for example, a function of p.) As an example, a network-oblivious algorithm for M(v(n)) where, for
each i-superstep there is always at least one segment of v(n)/2i+1 virtual processors consecutively
numbered starting from k · (v(n)/2i+1), for some k ≥ 0, each sending a number of messages equal
to the superstep degree to processors outside the segment, is an (α, p)-wise algorithm for each
1 < p ≤ v(n) and α = 1. However, (α, p)-wiseness holds even if the aforementioned communication
scenario is realized only in an average sense. Furthermore, consider a pair of values α′ and p′ such
that 1 < p′ ≤ p, and 1 < α′ ≤ α. It is easy to see that (p/p′)F iA(n, p) ≥ F iA(n, p′), for every
0 ≤ i < log p′, and this implies that a network-oblivious algorithm which is (α, p)-wise is also
(α′, p′)-wise.
A final issue to consider is that the degrees of supersteps with different labels contribute with
the same weight to the communication complexity while they contribute with different weights to
the communication time. The following lemma will help in bridging this difference.
Lemma 2. For m ≥ 1, let 〈X0, . . . , Xm−1〉 and 〈Y0, . . . , Ym−1〉 be two arbitrary sequences of real
values, and let 〈f0, . . . , fm−1〉 be a non-increasing sequence of non-negative values. If
∑k−1
i=0 Xi ≤∑k−1
i=0 Yi, for every 1 ≤ k ≤ m, then
m−1∑
i=0
Xifi ≤
m−1∑
i=0
Yifi .
Proof. By defining S0 = 0 and Sk =
∑k−1
j=0(Yj −Xj) ≥ 0, for 1 ≤ k ≤ m, we have:
m−1∑
i=0
fi(Yi −Xi) =
m−1∑
i=0
fi(Si+1 − Si) =
m−1∑
i=0
fiSi+1 −
m−1∑
i=1
fiSi ≥
≥
m−1∑
i=0
fiSi+1 −
m−1∑
i=1
fi−1Si = fm−1Sm ≥ 0 .
We are now ready to state and prove the optimality theorem. Let C denote a class of static
algorithms solving a problem Π, with the property that for any algorithm A ∈ C for q processing
elements, all of its foldings on q′ processing elements, 2 ≤ q′ < q, also belong to C .
Theorem 1 (Optimality theorem). Let A ∈ C be network-oblivious and (α, p?)-wise, for some
α ∈ (0, 1] and a power of two p?. Let also (σm0 , . . . , σmlog p?−1) and (σM0 , . . . , σMlog p?−1) be two vectors
of non-negative values, with σmj ≤ σMj , for every 0 ≤ j < log p?. If A is β-optimal on M(2j , σ)
w.r.t. C , for σmj−1 ≤ σ ≤ σMj−1 and 1 ≤ j ≤ log p?, then, for every power of two p ≤ p?, A is
αβ/(1 + α)-optimal on D-BSP(p, g, `) w.r.t. C as long as:
• gi ≥ gi+1 and `i/gi ≥ `i+1/gi+1, for 0 ≤ i < log p− 1;
• max1≤k≤log p{σmk−12k/p} ≤ `i/gi ≤ min1≤k≤log p{σMk−12k/p}, for 0 ≤ i < log p.4
4Note that in order to allow for a nonempty range of values for the ratio `i/gi, the σ
m and σM vectors must
be such that max1≤k≤log p{σmk−12k/p} ≤ min1≤k≤log p{σMk−12k/p}. This will always be the case for the applications
discussed in the next section.
9
Proof. Fix the value p and the vectors g and ` so to satisfy the hypotheses of the theorem, and
consider a D-BSP(p, g, `)-algorithm C ∈ C . By the β-optimality of A on the evaluation model
M(2j , ψp/2j), for each 1 ≤ j ≤ log p and ψ such that σmj−1 ≤ ψp/2j ≤ σMj−1, we have
HA
(
n, 2j ,
ψp
2j
)
≤ 1
β
HC
(
n, 2j ,
ψp
2j
)
since C can be folded into an algorithm for M(2j , ψp/2j), still belonging to C . By the definition of
communication complexity it follows that
j−1∑
i=0
(
F iA(n, 2
j) + SiA(n)
ψp
2j
)
≤ 1
β
j−1∑
i=0
(
F iC(n, 2
j) + SiC(n)
ψp
2j
)
,
and then, by applying Lemma 1 to the right side of the above inequality, we obtain
j−1∑
i=0
(
F iA(n, 2
j) + SiA(n)
ψp
2j
)
≤ 1
β
j−1∑
i=0
(
p
2j
F iC(n, p) + S
i
C(n)
ψp
2j
)
. (3)
Define ψmp = max1≤k≤log p{σmk−12k/p} and ψMp = min1≤k≤log p{σMk−12k/p}. The condition im-
posed by the theorem on the ratio `i/gi implies that ψ
m
p ≤ ψMp , hence, by definition of these two
quantities, we have that σmj−12
j/p ≤ ψmp , ψMp ≤ σMj−12j/p.
Let us first set ψ = ψMp in Inequality 3, and note that, by the above observation, σ
m
j−1 ≤
ψMp p/2
j ≤ σMj−1. By multiplying both terms of the inequality by 2j/(ψMp p), and by exploiting the
non-negativeness of the F iA(n, 2
j) terms, we obtain
j−1∑
i=0
SiA(n) ≤
1
β
j−1∑
i=0
(
F iC(n, p)
ψMp
+ SiC(n)
)
.
Next, we make log p applications of Lemma 2, one for each j = 1, 2, . . . , log p, by setting m = j,
Xi = S
i
A(n), Yi = (1/β)
(
F iC(n, p)/ψ
M
p + S
i
C(n)
)
, and fi = `i/gi. This gives
j−1∑
i=0
SiA(n)
`i
gi
≤ 1
β
j−1∑
i=0
(
F iC(n, p)
`i
ψMp gi
+ SiC(n)
`i
gi
)
,
for 1 ≤ j ≤ log p. Since, by hypothesis, `i/gi ≤ ψMp , for each 0 ≤ i < log p, we have `i/ψMp gi ≤ 1,
hence we can write
j−1∑
i=0
SiA(n)
`i
gi
≤ 1
β
j−1∑
i=0
(
F iC(n, p) + S
i
C(n)
`i
gi
)
, (4)
for 1 ≤ j ≤ log p.
Now, let us set ψ = ψmp in Inequality 3, which, again, guarantees σ
m
j−1 ≤ ψmp p/2j ≤ σMj−1. By
exploiting the wiseness of A in the left side and the non-negativeness of SiA(n), we obtain
j−1∑
i=0
α
p
2j
F iA(n, p) ≤
1
β
j−1∑
i=0
(
p
2j
F iC(n, p) + S
i
C(n)
ψmp p
2j
)
.
10
By multiplying both terms by 2j/(pα) and observing that, by hypothesis, ψmp ≤ `i/gi, for each
0 ≤ i < log p, we get
j−1∑
i=0
F iA(n, p) ≤
1
αβ
j−1∑
i=0
(
F iC(n, p) + S
i
C(n)
`i
gi
)
. (5)
Summing Inequality 4 with Inequality 5 yields
j−1∑
i=0
(
F iA(n, p) + S
i
A(n)
`i
gi
)
≤ 1 + α
αβ
j−1∑
i=0
(
F iC(n, p) + S
i
C(n)
`i
gi
)
,
for 1 ≤ j ≤ log p. Applying Lemma 2 with m = log p, Xi = F iA(n, p) + SiA(n)`i/gi, Yi = (1 +
α)/(αβ)
(
F iC(n, p) + S
i
C(n)`i/gi
)
, and fi = gi yields
log p−1∑
i=0
(
F iA(n, p)gi + S
i
A(n)`i
) ≤ 1 + α
αβ
log p−1∑
i=0
(
F iC(n, p)gi + S
i
C(n)`i
)
.
Then, by definition of communication time we have
DA (n, p, g, `) ≤ 1 + α
αβ
DC (n, p, g, `) ,
and the theorem follows.
A few remarks are in order regarding the above optimality theorem. The complexity metrics
adopted in this paper target exclusively interprocessor communication, thus a (sequential) network-
oblivious algorithm specified on M(v) but using only one of the virtual processors would clearly
be optimal with respect to these metrics. For meaningful applications of the theorem, the class
C must be suitably defined to exclude such degenerate cases and to contain algorithms where the
work is sufficiently well balanced among the processing elements.
Note that the theorem requires that both the gi’s and `i/gi’s form non-increasing sequences.
The assumption is rather natural since it reflects the fact that larger submachines exhibit more
expensive communication (hence, a larger g parameter) and larger network capacity (hence, a larger
`/g ratio).
Some of the issues encountered in establishing the optimality theorem have an analog in the
context of memory hierarchies. For example, time in the Hierarchical Memory Model (HMM) can
be linked to I/O complexity as discussed in [1], so that optimality of the latter for different cache
sizes implies the optimality of the former for wide classes of functions describing the access time
to different memory locations. Although, to the best of our knowledge, the question has not been
explicitly addressed in the literature, a careful inspection of the arguments of [1] shows that some
restriction to the class of algorithms is required, to guarantee that the maximum value of the I/O
complexity for different cache sizes is simultaneously reached for the same input instance. (For
example, the optimality of HMM time does not follow for the class of arbitrary comparison-based
sorting algorithms, since the known I/O complexity lower bound for this problem [4] may not be
simultaneously reachable for all relevant cache sizes.) Moreover, the monotonicity we have assumed
for the gi and the `i/gi sequences has an analog in the assumption that the function used in [1] to
model the memory access time is polynomially bounded.
11
In the cache-oblivious framework the equivalent of our optimality theorem requires algorithms
to satisfy the regularity condition [27, Lemma 6.4], which requires that the number of cache misses
decreases by a constant factor when the cache size is doubled. On the other hand, our optimality
theorem gives the best bound when the network-oblivious algorithm is (Θ(1), p)-wise, that is, when
the communication complexity decreases by a constant factor when the number of processors is
doubled. Although the regularity condition and wiseness cannot be formalized in a similar fashion
due to the significant differences between the cache- and network-oblivious frameworks, we observe
that both assumptions require the oblivious algorithms to react seamlessly and smoothly to small
changes of the machine parameters.
4 Algorithms for Key Problems
In this section, we illustrate the use of the proposed framework by developing efficient network-
oblivious algorithms for a number of fundamental computational problems: matrix multiplication
(Subsection 4.1), Fast Fourier Transform (Subsection 4.2), and sorting (Subsection 4.3). All of
our algorithms exhibit Θ(1)-optimality on the D-BSP for wide ranges of the machine parameters.
In Subsection 4.4, we also present network-oblivious algorithms for stencil computations. These
latter algorithms run efficiently on the D-BSP although they do not achieve Θ(1)-optimality, which
appears to be a hard challenge in this case. In Subsection 4.5, we also establish a negative re-
sult by proving that there cannot exist a network-oblivious algorithm for broadcasting which is
simultaneously Θ(1)-optimal on two sufficiently different M(p, σ) machines.
As prescribed by our framework, the performance of the network-oblivious algorithms on the
D-BSP is derived by analyzing their performance on the evaluation model. Optimality is assessed
with respect to classes of algorithms where the computation is not excessively unbalanced among
the processors, namely, algorithms where an individual processor cannot perform more than a
constant fraction of the total minimum work for the problem. For this purpose, we exploit some
recent lower bounds which rely on mild assumptions on work distributions and strengthen previous
bounds based on stronger assumptions [40]. Finally, we want to stress that all of our algorithms
are also work optimal.
4.1 Matrix Multiplication
The n-MM problem consists of multiplying two
√
n×√n matrices, A and B, using only semiring
operations. For convenience, we assume that n is an even power of 2 (the general case requires
minor yet tedious modifications). A result in [35] shows that any static algorithm for the n-MM
problem which uses only semiring operations must compute all n3/2 multiplicative terms, that is
the products A[i, k] ·B[k, j], with 0 ≤ i, j, k ≤ √n.
Let C denote the class of static algorithms for the n-MM problem such that any A ∈ C for q
processing elements satisfies the following properties: (i) no entry of A or B is initially replicated
(however, the entries of A and B are allowed to be initially distributed among the processing
elements in an arbitrary fashion); (ii) no processing element computes more than n3/2/min{q, 113}
multiplicative terms; (iii) all of the foldings of A on q′ processing elements, 2 ≤ q′ < q, also belong
to C . The following lemma establishes a lower bound on the communication complexity of the
algorithms in C .
12
Lemma 3. The communication complexity of any n-MM algorithm in C when executed on M(p, σ)
is Ω
(
n/p2/3 + σ
)
.
Proof. The bound for σ = 0 is proved in [40, Theorem 2], and it clearly extends to the case σ > 0.
The additive σ term follows since at least one message is sent by some processing element.
We now describe a static network-oblivious algorithm for the n-MM problem, which follows from
the parallelization of the respective cache-oblivious algorithm [27]. Then, we prove its optimality
in the evaluation model, for wide ranges of the parameters, and in the execution model through the
optimality theorem. The algorithm is specified on M(n), and requires that the input and output
matrices be evenly distributed among the n VPs. We denote with A, B and C the two input
matrices and the output matrix respectively, and with Ahk, Bhk and Chk, with 0 ≤ h, k ≤ 1, their
four quadrants. The network-oblivious algorithm adopts the following recursive strategy:
1. Partition the VPs into eight segments Shk`, with 0 ≤ h, k, ` ≤ 1, containing the same number
of consecutively numbered VPs. Replicate and distribute the inputs so that the entries of
Ah` and B`k are evenly spread among the VPs in Shk`.
2. In parallel, for each 0 ≤ h, k, ` ≤ 1, compute recursively the product Mhk` = Ah` ·B`k within
Shk`.
3. In parallel, for each 0 ≤ i, j < √n, the VP responsible for C[i, j] collects Mhk0[i′, j′] and
Mhk1[i
′, j′], with h = b2i/√nc, k = b2j/√nc, i′ = i mod (√n/2) and j′ = j mod (√n/2),
and computes C[i, j] = Mhk0[i
′, j′] +Mhk1[i′, j′].
At the i-th recursion level, with 0 ≤ i ≤ (log n)/3, 8i (n/4i)-MM subproblems are solved by distinct
M(n/8i)’s formed by distinct segments of VPs. The recursion stops at i = (log n)/3 when each VP
solves sequentially an n1/3-MM subproblem. It is easy to see, by unfolding the recursion, that the
algorithm comprises a constant number of 3i-supersteps for every 0 ≤ i < (log n)/3, where each
VP sends/receives O
(
2i
)
messages. In order to enforce wiseness, without worsening the asymptotic
communication complexity and communication time exhibited by the algorithm in the evaluation
and execution machine models, we may assume that, in each 3i-superstep, VPj sends 2
i dummy
messages to VPj+n/23i+1 , for 0 ≤ j < n/23i+1.
Theorem 2. The communication complexity of the above n-MM network-oblivious algorithm when
executed on M(p, σ) is
HMM(n, p, σ) = O
(
n
p2/3
+ σ log p
)
,
for every 1 < p ≤ n and σ ≥ 0. The algorithm is (Θ(1), n)-wise and Θ(1)-optimal with respect to
C on any M(p, σ) with 1 < p ≤ n and σ = O(n/(p2/3 log p)).
Proof. From the previous discussion, we conclude that
HMM(n, p, σ) = O
(log p)/3∑
i=0
(
n2i
p
+ σ
) = O( n
p2/3
+ σ log p
)
.
As anticipated, the wiseness is guaranteed by the dummy messages introduced in each superstep.
Finally, it is easy to see that the algorithm satisfies the three requirements for belonging to C ,
hence its optimality follows from Lemma 3.
13
Corollary 1. The above n-MM network-oblivious algorithm is Θ(1)-optimal with respect to C on
any D-BSP(p, g, `) machine with 1 < p ≤ n, non-increasing gi’s and `i/gi’s, and `0/g0 = O(n/p).
Proof. Since the network-oblivious algorithm is (Θ(1), n)-wise and belongs to C , the corollary
follows by plugging p? = n, σmi = 0, and σ
M
i = Θ
(
n/((i+ 1)22i/3)
)
into Theorem 1.
4.1.1 Space-Efficient Matrix Multiplication
Observe that the network-oblivious algorithm described above incurs an O
(
n1/3
)
memory blow-
up per VP. As described below, the recursive strategy can be modified so to incur only a constant
memory blow-up, at the expense of an increased communication complexity. The resulting network-
oblivious algorithm turns out to be Θ(1)-optimal with respect to the class of algorithms featuring
constant memory blow-up.
We assume, as before, that the entries of A,B and C be evenly distributed among the VPs. The
VPs are (recursively) divided into four segments which solve the eight (n/4)-MM subproblems in
two rounds: in the first round, the segments compute A00 ·B00, A01 ·B11, A11 ·B10 and A10 ·B01 (one
product per segment), while in the second round they compute A01 · B10, A00 · B01, A10 · B00 and
A11 · B11 (again, one product per segment). The recursion ends when each VP solves sequentially
an 1-MM subproblem. By unfolding the recursion, we get that for every 0 ≤ i < log n/2, the
algorithm executes Θ
(
2i
)
2i-supersteps where each VP sends/receives Θ(1) messages. At any time
each VP contains only O(1) matrix entries, but the recursion requires it to handle a stack of O(log n)
entries. However, it is easy to see that only a constant number of bits are needed for each stack
entry, hence, under the natural assumption that each matrix entry occupies a constant number of
Θ(log n)-bit words, the entire stack at each VP requires storage proportional to O(1) matrix entries.
Therefore, the algorithm incurs only a constant memory blow-up. As before, the algorithm can be
easily made (Θ(1), n)-wise by adding suitable dummy messages and, when executed on M(p, σ),
its communication complexity becomes O
(
n/
√
p+ σ
√
p
)
.
Let C ′ denote the class of static algorithms for the n-MM problem such that any A ∈ C ′
for q processing elements satisfies the following properties: (i) the local storage required at each
processing element is O(n/q); (ii) all of the foldings of A on q′ processing elements, 2 ≤ q′ < q, also
belong to C ′. Since it is proved in [32] that any n-MM algorithm in C ′ when running on M(p, 0)
must exhibit an Ω
(
n/
√
p
)
communication complexity, the above network-oblivious algorithm is
Θ(1)-optimal with respect to C ′ on any M(p, σ) with 1 < p ≤ n and σ = O(n/p). Consequently,
Theorem 1 yields optimality of the algorithm on any D-BSP(p, g, `) machine with 1 < p ≤ n,
non-increasing gi’s and `i/gi’s, and `0/g0 = O(n/p).
4.2 Fast Fourier Transform
The n-FFT problem consists of computing the Discrete Fourier Transform of n values using the
n-input FFT DAG, where a vertex is a pair 〈w, l〉, with 0 ≤ w < n and 0 ≤ l < log n, and there
exists an arc between two vertices 〈w, l〉 and 〈w′, l′〉 if l′ = l + 1, and either w and w′ are identical
or their binary representations differ exactly in the l-th bit [37].
Let C denote the class of static algorithms for the n-FFT problem such that any A ∈ C for q
processing elements satisfies the following properties: (i) each DAG node is evaluated exactly once
(i.e., recomputation is not allowed); (ii) no input value is initially replicated; (iii) no processing
element computes more than n log n DAG nodes, for some constant 0 <  < 1; (iv) all of the
foldings of A on q′ processing elements, 2 ≤ q′ < q, also belong to C . Note that, as in the preceding
14
subsection, the class of algorithms we are considering makes no assumptions on the input and output
distributions. we make no assumptions on the input and output distributions. The following lemma
establishes a lower bound on the communication complexity of the algorithms in C .
Lemma 4. The communication complexity of any n-FFT algorithm in C when executed on M(p, σ)
is Ω((n log n)/(p log(n/p)) + σ).
Proof. The bound for σ = 0 is proved in [40, Theorem 11], and it clearly extends to the case σ > 0.
The additive σ term follows since at least one message is sent by some processing element.
We now describe a static network-oblivious algorithm for the n-FFT problem, and then prove
its optimality in the evaluation and execution models. The algorithm is specified on M(n), and
exploits the well-known decomposition of the FFT DAG into two sets of
√
n-input FFT subDAGs,
with each set containing
√
n such subDAGs [2]. For simplicity, in order to ensure integrality of the
quantities involved, we assume n = 22
k
for some integer k ≥ 0. We assume that at the beginning
the n inputs are evenly distributed among the n VPs. In parallel, each of the
√
n segments of
√
n
consecutively numbered VPs computes recursively the assigned subDAG. Then, the outputs of the
first set of subDAGs are permuted in a 0-superstep so to distribute the inputs of each subDAGs
of the second set among the VPs of a distinct segment. The permutation pattern is equivalent to
the transposition of a
√
n × √n matrix. Finally, each segment computes recursively the assigned
subDAG.
At the i-th recursion level, with i ≥ 0, n1−1/2i n1/2i-FFT subproblems are solved by n1−1/2i
M(n1/2
i
) models formed by distinct segments of VPs. The recurrence stops at i = log log n when
each segment of two VPs computes a 2-input subDAG. It is easy to see, by unfolding the recursion,
that the algorithm comprises O
(
2i
)
supersteps with label (1−1/2i) log n, for each 0 ≤ i < log log n,
where each VP sends/receives O(1) messages. As before, in order to enforce wiseness without
affecting the algorithm’s asymptotic performance, we assume that in each (1−1/2i) log n-superstep,
VPj sends a dummy message to VPj+n1/2i/2, for each 0 ≤ j < n1/2
i
/2.
Theorem 3. The communication complexity of the above n-FFT network-oblivious algorithm when
executed on M(p, σ) is
HFFT(n, p, σ) = O
((
n
p
+ σ
)
log n
log(n/p)
)
,
for every 1 < p ≤ n and σ ≥ 0. The algorithm is (Θ(1), n)-wise and Θ(1)-optimal with respect to
C on any M(p, σ) with 1 < p ≤ n and σ = O(n/p).
Proof. From the previous discussion, we conclude that
HFFT(n, p, σ) = O
log(logn/ log(n/p))∑
i=0
2i
(
n
p
+ σ
) = O((n
p
+ σ
)
log n
log(n/p)
)
.
The wiseness is ensured by the dummy messages, and since the algorithm satisfies the requirements
for belonging to C , its optimality follows from Lemma 4.
We now apply Theorem 1 to show that the network-oblivious algorithm is Θ(1)-optimal on the
D-BSP for wide ranges of the machine parameters.
15
Corollary 2. The above n-FFT network-oblivious algorithm is Θ(1)-optimal with respect to C on
any D-BSP(p, g, `) machine with 1 < p ≤ n, non-increasing gi’s and `i/gi’s, and `0/g0 = O(n/p).
Proof. Since the network-oblivious algorithm is (Θ(1), n)-wise and belongs to C , we get the claim
by plugging p? = n, σmi = 0, and σ
M
i = Θ
(
n/2i
)
in Theorem 1.
We observe that although we described the network-oblivious algorithm assuming n = 22
k
, in
order to ensure integrality of the quantities involved, the above results can be generalized to the
case of n arbitrary power of two. In this case, the FFT DAG is recursively decomposed into a set
of 2blog
√
nc-input FFT subDAGs and a set of n/2blog
√
nc-input FFT subDAGs. The optimality of
the resulting algorithm in both the evaluation and execution machine models can be proved in a
similar fashion as before.
4.3 Sorting
The n-sort problem requires to label n (distinct) input keys with their ranks, using only comparisons,
where the rank of a key is the number of smaller keys in the input sequence.
Let C denote the class of static algorithms for the n-sort problem such that any A ∈ C for q
processing elements satisfies the following properties: (i) initially, no input key is replicated and,
during the course of the algorithm, only a constant number of copies per key are allowed at any
time; (ii) no processing element performs more than n log n comparisons, for an arbitrary constant
0 <  < 1; (iii) all of the foldings of A on q′ processing elements, 2 ≤ q′ < q, also belong to C .
We make no assumptions on how the keys are distributed among the processing elements at the
beginning and at the end of the algorithm. The following lemma establishes a lower bound on the
communication complexity of the algorithms in C .
Lemma 5. The communication complexity of any n-sort algorithm in C when executed on M(p, σ)
is Ω((n log n)/(p log(n/p)) + σ).
Proof. The bound for σ = 0 is proved in [40, Theorem 8], and it clearly extends to the case σ > 0.
The additive σ term follows since at least one message is sent by some processing element.
We now present a static network-oblivious algorithm for the n-sort problem, and then prove
its optimality in the evaluation and execution models. The algorithm implements a recursive
version of the Columnsort strategy, as described in [36]. Consider the n input keys as an r × s
matrix, with r · s = n and r ≥ s2. Columnsort is organized into seven phases numbered from
1 to 7. During Phases 1, 3, 5 and 7 the keys in each column are sorted recursively (in Phase 5
adjacent columns are sorted in reverse order). During Phases 2, 4, 6 and 8 the keys of the matrix
are permuted: in Phase 2 (resp., 4) a transposition (resp., diagonalizing permutation [36]) of the
r × s matrix is performed maintaining the r × s shape; in Phase 6 (resp., 8) an r/2-cyclic shift
(resp., the reverse of the r/2-cyclic shift) is done.5 Columnsort can be implemented on M(n)
as follows. For convenience, assume that n = 2(3/2)
d
for some integer d ≥ 0, and set r = n2/3
and s = n/r (the more general case is discussed later). The algorithm starts with the input
keys evenly distributed among the n VPs. In the odd phases the keys of each column are evenly
5In the original paper [36], the shift in Phase 6 is not cyclic: a new column is added containing the r/2 overflowing
keys and r/2 large dummy keys, while the first column is filled with r/2 small dummy keys. However, it is easy to
see that a cyclic shift suffices if the first r/2 keys in the first column are considered smaller than the last r/2 keys.
16
distributed among the VPs of a distinct segment of r consecutively numbered VPs, which form
an independent M(r). Then, each segment solves recursively the subproblem corresponding to the
column it received. The even phases entail a constant number of 0-supersteps of constant degree.
At the i-th recursion level, with 0 ≤ i ≤ log3/2 log n, each segment of n(2/3)i consecutively numbered
VPs forming an independent M(n(2/3)
i
) solves 4i subproblems of size n(2/3)
i
. The recurrence stops
at i = log3/2 log n when each VP solves, sequentially, a subproblem of constant size. It is easy
to see, by unfolding the recursion, that the algorithm consists of Θ
(
4i
)
supersteps with label
(1 − 2/3i) log n, for each 0 ≤ i < log3/2 log n, where each VP sends/receives O(1) messages. As
before, in order to enforce wiseness, without affecting the algorithm’s asymptotic performance, we
assume that in each (1− (2/3)i) log n-superstep, VPj sends a dummy message to VPj+n(2/3)i/2, for
each 0 ≤ j < n(2/3)i/2.
Theorem 4. The communication complexity of the above network-oblivious algorithm for n-sort
when executed on M(p, σ) is
Hsort(n, p, σ) = O
((
n
p
+ σ
)(
log n
log(n/p)
)log3/2 4)
,
for every 1 < p ≤ n and σ ≥ 0. The algorithm is (Θ(1), n)-wise and it is Θ(1)-optimal with respect
to C on any M(p, σ) with p = O
(
n1−δ
)
, for any arbitrary constant δ ∈ (0, 1), and σ ≥ 0.
Proof. By the previous description of the algorithm, we conclude that
Hsort(n, p, σ) = O
log3/2(logn/ log(n/p))∑
i=0
4i
(
n
p
+ σ
) = O((n
p
+ σ
)(
log n
log(n/p)
)log3/2 4)
.
The wiseness is guaranteed by the dummy messages. Since the algorithm satisfies the three re-
quirements to be in C , its optimality follows from Lemma 5.
Corollary 3. The above n-sort network-oblivious algorithm is Θ(1)-optimal with respect to C on
any D-BSP(p, g, `) machine with p = O
(
n1−δ
)
, for some arbitrary constant δ ∈ (0, 1), and non-
increasing gi’s and `i/gi’s.
Proof. Since the network-oblivious algorithm is (Θ(1), n)-wise and belongs to C , we get the claim
by plugging p? = n, σmi = 0, and σ
M
i = +∞ in Theorem 1.
Consider now the more general case when n is an arbitrary power of two. Now, the input keys
must be regarded as the entries of an r × s matrix, where r is the smallest power of two greater
than or equal to n2/3. Simple, yet tedious, calculations show that the results stated in Theorem 4
and Corollary 3 continue to hold in this case.
Finally, we remark that the above network-oblivious sorting algorithm turns out to be Θ(1)-
optimal on any D-BSP(p, g, `), as long as p = O
(
n1−δ
)
for constant δ, with respect to a wider
class of algorithms which satisfy the requirements (i), (ii), and (iii), specified above for C , but
need not be static. By applying the lower bound for sorting in [40] on two processors, it is easy to
show that Ω(n) messages must cross the bisection for this class of algorithms. Therefore, we get
an Ω(g0n/p) lower bound on the communication time on D-BSP(p, g, `), which is matched by our
network-oblivious algorithm.
17
4.4 Stencil Computations
A stencil defines the computation of any element in a d-dimensional spatial grid at time t as a
function of neighboring grid elements at time t−1, t−2, . . . , t−τ , for some integers τ ≥ 1 and d ≥ 1.
Stencil computations arise in many contexts, ranging from iterative finite-difference methods for
the numerical solution of partial differential equations, to algorithms for the simulation of cellular
automata, as well as in dynamic programming algorithms and in image-processing applications.
Also, the simulation of a d-dimensional mesh [14] can be envisioned as a stencil computation.
In this subsection, we restrict our attention to stencil computations with τ = 1. To this
purpose, we define the (n, d)-stencil problem which represents a wide class of stencil computations
(see, e.g., [28]). Specifically, the problem consists in evaluating all nodes of a DAG of nd+1 nodes,
each represented by a distinct tuple 〈i0, i1, . . . , id〉, with 0 ≤ i0, i1, . . . , id < n, where each node
〈i0, i1, . . . , id〉 is connected, through an outgoing arc, to (at most) 3d neighbors, namely 〈i0 +
δ0, i1 + δ1, . . . , id−1 + δd−1, id + 1〉 for each δ0, δ1, . . . , δd−1 ∈ {0,±1} (whenever such nodes exist).
We suppose n to be a power of two. Intuitively, the (n, d)-stencil problem consists of n time steps of
a stencil computation on a d-dimensional spatial grid of side n, where each DAG node corresponds
to a grid element (first d coordinates) at a given time step (coordinate id).
Let Cd denote the class of static algorithms for the (n, d)-stencil problem such that any A ∈ Cd
for q processing elements satisfies the following properties: (i) each DAG node is evaluated once
(i.e., recomputation is not allowed); (ii) no processing element computes more than nd+1 DAG
nodes, for some constant 0 <  < 1; (iii) all of the foldings of A on q′ processing elements, 2 ≤ q′ < q,
also belong to Cd. Note that, as before, this class of algorithms makes no assumptions on the input
and output distributions. The following lemma establishes a lower bound on the communication
complexity of the algorithms in Cd.
Lemma 6. The communication complexity of any (n, d)-stencil algorithm in Cd when executed on
M(p, σ) is Ω
(
nd/p(d−1)/d + σ
)
.
Proof. The bound for σ = 0 is proved in [40, Theorem 5], and it clearly extends to the case σ > 0.
The additive σ term follows since at least one message is sent by some processing element.
In what follows, we develop efficient network-oblivious algorithms for the (n, d)-stencil problem,
for the special cases of d = {1, 2}. The generalization to values d > 2, and to other types of stencils,
is left as an open problem.
4.4.1 The (n, 1)-Stencil Problem
The (n, 1)-stencil problem consists of the evaluation of a DAG shaped as a 2-dimensional array
of side n. We reduce the solution of the stencil problem to the computation of a diamond DAG.
Specifically, we define a diamond DAG of side n as the intersection of a (2n − 1, 1)-stencil DAG
with the following four half-planes: i0 + i1 ≥ (n − 1), i0 − i1 ≤ (n − 1), i0 − i1 ≥ −(n − 1), and
i0+ i1 ≤ 3(n−1) (i.e., the largest diamond included in the stencil).6 It follows that an (n, 1)-stencil
DAG can be partitioned into five full or truncated diamond DAGs of side less than n which can
be evaluated in a suitable order, with the outputs of one DAG evaluation providing the inputs for
subsequent DAG evaluations.
6We observe that our definition of diamond DAG is consistent with the one in [14], whose edges are a superset of
those of the diamond DAG defined in [3].
18
n
k
n
r-th stripe
Figure 1: The decomposition of the diamond DAG performed by our algorithm.
Our network-oblivious algorithm for the (n, 1)-stencil is specified on M(n) and consists of five
stages, where in each stage the whole M(n) machine takes care of the evaluation of a distinct
diamond DAG (full or truncated) according to the aforementioned partition. We require that all of
the O(n) inputs necessary for the evaluation of a diamond DAG are evenly distributed among the
n VPs at the start of the stage in charge of the DAG. No matter how the inputs are assigned to
the VPs at the beginning of the algorithm, the data movement required to guarantee the correct
input distribution at the various stages can be accomplished in O(1) 0-supersteps where each VP
sends/receives O(n) messages.
We now focus on the evaluation of the individual diamond DAGs. For ease of presentation, we
consider the evaluation of a full diamond DAG of side n on M(n). Simple yet tedious modifications
are required for dealing with truncated or smaller diamond DAGs. We exploit the fact that the DAG
can be decomposed recursively into smaller diamonds. (Oblivious parallel algorithms adopting this
strategy have been studied in [19, 29], but these algorithms, specified for a different computational
framework, are not oblivious to the number of processors p, hence they cannot be directly compared
to ours.)
Let k = 2d
√
logne. The diamond DAG is partitioned into 2k−1 horizontal stripes, each containing
up to k diamonds of side n/k, as depicted in Figure 1. The DAG evaluation is accomplished into
2k − 1 non-overlapping phases. In the r-th such phase, with 0 ≤ r < 2k − 1, the diamonds in the
r-th stripe are evaluated in parallel by distinct M(n/k) submachines formed by disjoint segments
19
of consecutively numbered VPs.7 At the beginning of each phase, a 0-superstep is executed so to
provide the VPs of each M(n/k) submachine with the appropriate input, that is, the immediate
predecessors (if any) of the diamond assigned to the submachine. In this superstep each VP
sends/receives O(1) messages. In each phase, the diamonds of side n/k are evaluated recursively.
In general, at the i-th recursive level, with i ≥ 1, a total of (2k−1)i non-overlapping phases are
executed where diamonds of side ni = n/k
i are evaluated in parallel by distinct M(ni) submachines.
Each such phase starts with a superstep of label (i− 1) · log k in order to provide each M(ni) with
the appropriate input. In turn, the evaluation of a diamond of side ni within an M(ni) submachine
is performed recursively by partitioning its nodes into 2k− 1 horizontal stripes of diamonds of side
ni+1 = n/k
i+1 which are evaluated in 2k − 1 non-overlapping phases by M(ni+1) submachines,
with each phase starting with a superstep of label i · log k where each VP sends/receives O(1)
messages (and thus each processor sends/receives O(n/p) messages). The recursion ends at level
τ = blogk nc, which is the first level where the diamond of side nτ becomes smaller than k. If
nτ > 1, each diamond of side nτ assigned to an M(nτ ) submachine is evaluated straightforwardly
in 2nτ−1 supersteps of label τ · log k. Instead, if nτ = 1, at recursion level τ each VP independently
evaluates a 1-node diamond, and no communication is required.
By unfolding the recursion, one can easily see that the evaluation of a diamond DAG of side n
entails, overall, (2k− 1)i supersteps of label (i− 1) · log k, for 1 ≤ i ≤ τ , and, if nτ > 1, (2k− 1)τnτ
supersteps of label τ · log k. In each of these supersteps, every VP sends/receives O(1) messages.
In order to guarantee (Θ(1), n)-wiseness of our algorithm, we assume that suitable dummy
messages are added in the each superstep so to make each VP exchange the same number of
messages.
Theorem 5. The communication complexity of the above network-oblivious algorithm for the (n, 1)-
stencil problem when executed on M(p, σ) is
H1-stencil(n, p, σ) = O
(
n4
√
logn
)
,
for every 1 < p ≤ n and 0 ≤ σ = O(n/p). The algorithm is (Θ(1), n)-wise and Ω
(
1/4
√
logn
)
-
optimal with respect to C1 on any M(p, σ) with 1 < p ≤ n and σ = O(n/p).
Proof. As observed before, the communication required at the beginning of each of the five stages
contributes an additive factor O(n) to the communication complexity, hence it is negligible. Let
us then concentrate on the communication complexity for one diamond DAG evaluation. Recall
that τ = blogk nc. First suppose that p ≤ kτ . Observe that at every recursion level i, with
0 ≤ i < dlogk pe, the evaluation of each diamond of side ni = n/ki is performed by p/ki > 1
processors, while at every recursion level i, with dlogk pe ≤ i ≤ τ , each diamond of side n/ki
is evaluated by a single processor of M(p, σ) and no communication takes place. Then, by the
7We observe that some M(n/k) submachines may not be assigned to subproblems, since the number of diamonds
in a stripe could be smaller than k. In order to comply with the requirement that in the algorithm execution the
sequence of superstep labels is the same at each processing element, we assume that idle M(n/k) submachines are
assigned dummy diamonds of side n/k to be evaluated.
20
previous discussion it follows that
H1-stencil(n, p, σ) = O
dlogk pe−1∑
i=0
(2k − 1)i+1
(
n
p
+ σ
)
= O
(
(2k)logk p+1
n
p
)
= O
(
n2logk pk
)
= O
(
n4
√
logn
)
,
where we exploited the upper bound on σ. Instead, if kτ < p ≤ n, we have that at every recursion
level i, with 0 ≤ i ≤ τ , the evaluation of each diamond of side ni = n/ki is performed by p/ki > 1
processors. Then, by the above discussion and recalling that for i = τ , diamonds of side nτ = n/k
τ
are evaluated straightforwardly in 2nτ − 1 supersteps, we obtain
H1-stencil(n, p, σ) = O
(
τ−1∑
i=0
(2k − 1)i+1
(
n
p
+ σ
))
+O
(
(2k − 1)τ n
kτ
(
n
p
+ σ
))
= O
(
(2k)τ
n
kτ
n
p
)
= O(n2τk)
= O
(
n4
√
logn
)
,
where we exploited the upper bound on σ and the fact that p > kτ hence, by definition of τ ,
n/p < k. The wiseness is ensured by the dummy messages. It is easy to see that the algorithm
complies with the requirements for belonging to C1, hence the claimed optimality is a consequence
of Lemma 6, and the theorem follows.
Finally, we show that the network-oblivious algorithm for the (n, 1)-stencil problem achieves
Ω
(
1/4
√
logn
)
-optimality on the D-BSP as well, for wide ranges of machine parameters.
Corollary 4. The above network-oblivious algorithm for the (n, 1)-stencil problem is Ω
(
1/4
√
logn
)
-
optimal with respect to C1 on any D-BSP(p, g, `) machine with 1 < p ≤ n, non-increasing gi’s and
`i/gi’s, and `0/g0 = O(n/p).
Proof. The corollary follows by Theorem 5 and by applying Theorem 1 with p? = n, σmi = 0, and
σMi = Θ
(
n/2i
)
.
We remark that a tighter analysis of the algorithm and/or the adoption of different values for
the recursion degree k, still independent of p and σ, may yield slightly better efficiency. However,
it is an open problem to devise a network-oblivious algorithm which is Θ(1)-optimal on the D-BSP
for wide ranges of the machine parameters.
21
4.4.2 The (n, 2)-Stencil Problem
In this subsection we present a network-oblivious algorithm for the (n, 2)-stencil problem, which
requires the evaluation of a DAG shaped as a 3-dimensional array of side n. Both the algorithm and
its analysis are a suitable adaptation of the ones for the (n, 1)-stencil problem. In order to evaluate
a 3-dimensional domain we make use of two types of subdomains which, intuitively, play the same
role as the diamond for the (n, 1)-stencil: the octahedron and the tetrahedron. An octahedron of
side n is the intersection of a (2n−1, 2)-stencil with the following eight half-spaces: i0+i2 ≥ (n−1),
i0 − i2 ≤ (n − 1), i0 − i2 ≥ −(n − 1), i0 + i2 ≤ 3(n − 1), i0 + i1 ≥ (n − 1), i0 − i1 ≤ (n − 1),
i0− i1 ≥ −(n−1), and i0 + i1 ≤ 3(n−1); a tetrahedron of side n is the intersection of a (2n−1, 2)-
stencil with the following four half-spaces: i0 + i1 ≥ (n− 1), i0 − i1 ≥ (n− 1), i1 + i2 ≤ 2(n− 1),
and i1 − i2 ≤ 0.
As shown in [14], a 3-dimensional array of side n can be partitioned into 17 instances of (possibly
truncated) octahedra or tetrahedra of side n (see Figure 6 of [14]). Our network-oblivious algorithm
exploits this partition and is specified onM(n2). It consists of 17 stages, where in each stage the VPs
take care of the evaluation of one polyhedra of the partition. We assume that at the beginning of
the algorithm the inputs are evenly distributed among the n2 VPs, and also impose that the inputs
of each stage be evenly distributed among the VPs. The data movement required to guarantee the
correct input distribution for each stage can be accomplished in O(1) 0-supersteps, where each VP
sends/receives O(1) messages.
Let k = 2d
√
logne. An octahedron of side n can be partitioned into octahedra and tetrahedra
of side n/k in log k steps, where the i-th such step, with 1 ≤ i ≤ log k, refines a partition of
the initial octahedron into octahedra or tetrahedra of side n/2i−1 by decomposing each of these
polyhedra into smaller ones of side n/2i, according to the scheme depicted in [14, Figure 5]. The
final partition is obtained at the end of the log k-th step. The octahedra and tetrahedra of the
final partition can be grouped in horizontal stripes in such a way that the polyhedra of each stripe
can be evaluated in parallel. Consider first the set of octahedra of the partition. It can be seen
that the projection of these octahedra on the (i0, i2)-plane coincides with the decomposition of the
diamond DAG depicted in Figure 1. As a consequence, we can identify 2k− 1 horizontal stripes of
octahedra, where each stripe contains up to k2 octahedra of side n/k. Moreover, the interleaving
of octahedra and tetrahedra in the basic decompositions of [14, Figure 5], implies that there is a
stripe of tetrahedra between each pair of consecutive stripes of octahedra. Hence, there are also
(2k−1)−1 horizontal stripes of tetrahedra, each containing up to k2 tetrahedra of side n/k. Overall,
the octahedron of side n is partitioned into 4k − 3 horizontal stripes of at most k2 polyhedra of
side n/k each, where stripes of octahedra are interleaved with stripes of tetrahedra. With a similar
argument one can derive a partition of a tetrahedron of side n into 2k − 1 ≤ 4k − 3 horizontal
stripes of at most k2 polyhedra of side n/k each, where stripes of octahedra are interleaved with
stripes of tetrahedra.
Once established the above preliminaries, the network-oblivious algorithm to evaluate a 3-
dimensional array of side n on M(n2) follows closely from the recursive strategy used for the
(n, 1)-stencil problem: the evaluation of an octahedron is accomplished in 4k − 3 non-overlapping
phases, in each of which the polyhedra (either octahedra or tetrahedra) of side n/k in one horizontal
stripe of the partition described above are recursively evaluated in parallel by distinct M(n2/k2)
submachines formed by disjoint segments of consecutively numbered VPs; a tetrahedron of side
n can be evaluated through a recursive strategy similar to the one for the octahedron within the
same complexity bounds. As usual, we add to each superstep O(1) dummy messages per VP to
22
guarantee (Θ(1), n2)-wiseness.
Theorem 6. The communication complexity of the above network-oblivious algorithm for the (n, 2)-
stencil problem when executed on M(p, σ) is
H2-stencil(n, p, σ) = O
(
n2√
p
8
√
logn
)
,
for every 1 < p ≤ n2 and 0 ≤ σ = O(n2/p). The algorithm is (Θ(1), n2)-wise and Ω(1/8√logn)-
optimal with respect to C2 on any M(p, σ) with 1 < p ≤ n2 and σ = O
(
n2/p
)
.
Proof. Let Hoctahedron(n, p, σ) be the communication complexity required by the recursive strategy
presented above for the evaluation of an octahedron of side n, when executed on M(p, σ). The
recursion depth of that strategy is τ = blogk nc. First suppose that p ≤ k2τ . At every recursion
level i, with 0 ≤ i < d(logk p)/2e, the evaluation of each polyhedron of side ni = n/ki is performed
by p/k2i > 1 processors, while at every recursion level i, with d(logk p)/2e ≤ i ≤ τ , each polyhedron
of side n/ki is evaluated by a single processor of M(p, σ) and no communication takes place. Then,
by the previous discussion it follows that
Hoctahedron(n, p, σ) = O
d(logk p)/2e−1∑
i=0
(4k − 3)i+1
(
n2
p
+ σ
)
= O
(
(4k)(logk p)/2+1
n2
p
)
= O
(
n2√
p
2logk pk
)
= O
(
n2√
p
4
√
logn
)
,
where we used the hypothesis σ = O
(
n2/p
)
. Instead, when k2τ < p ≤ n2, we have that at every
recursion level i, with 0 ≤ i ≤ τ , the evaluation of each polyhedron of side ni = n/ki is performed
by p/k2i > 1 processors. Then, since for i = τ , polyhedra of side nτ = n/k
τ are evaluated
straightforwardly in Θ(nτ ) supersteps, we obtain
Hoctahedron(n, p, σ) = O
(
τ−1∑
i=0
(4k − 3)i+1
(
n2
p
+ σ
))
+O
(
(4k − 3)τ n
kτ
(
n2
p
+ σ
))
= O
(
(4k)τ
n
kτ
n2
p
)
= O
(
n2√
p
4τk
)
= O
(
n2√
p
8
√
logn
)
,
where we used the hypothesis σ = O
(
n2/p
)
and the inequalities n/kτ < k and k2τ < p. Similar
upper bounds on the communication complexity can be proved for the evaluation of a tetrahedron
of side n and for the evaluation of truncated octahedra or tetrahedra.
23
Recall that the algorithm for the (n, 2)-stencil problem consists of 17 stages, where in each stage
the VPs take care of the evaluation of one (possibly truncated) octahedron or tetrahedron of side
n, and that the data movement which ensures the correct input distribution for each stage can
be accomplished in O(1) 0-supersteps, where each VP sends/receives O(1) messages. This implies
that
H2-stencil(n, p, σ) = O
(
n2√
p
8
√
logn
)
.
Since the strategies for the evaluation of (possibly truncated) octahedra or tetrahedra can be made
(Θ(1), n2)-wise, through the introduction of suitable dummy messages, the overall algorithm is also
(Θ(1), n2)-wise. Moreover, the algorithm complies with the requirements for belonging to C2, hence
the claimed optimality is a consequence of Lemma 6.
Corollary 5. The above network-oblivious algorithm for the (n, 2)-stencil problem is Ω
(
1/8
√
logn
)
-
optimal with respect to C2 on any D-BSP(p, g, `) machine with 1 < p ≤ n2, non-increasing gi’s and
`i/gi’s, and `0/g0 = O
(
n2/p
)
.
Proof. The corollary follows by Theorem 6 and by applying Theorem 1 with p? = n2, σmi = 0, and
σMi = Θ
(
n2/2i
)
.
4.5 Limitations of the Oblivious Approach
In this subsection, we establish a negative result by showing that for the broadcast problem, defined
below, a network-oblivious algorithm can achieve O(1)-optimality on M(p, σ) only for very limited
ranges of σ. Let V [0, . . . , n− 1] be a vector of n entries. The n-broadcast problem requires to copy
the value V [0] into all other V [i]’s. Let C denote the class of static algorithms for the n-broadcast
problem such that any A ∈ C for q processing elements satisfies the following properties: (i) at least
q processing elements hold entries of V , for some constant 0 <  ≤ 1, and the distribution of the
entries of V among the processing elements cannot change during the execution of the algorithm;
(ii) all of the foldings of A on q′ processing elements, 2 ≤ q′ < q, also belong to C . The following
lemma establishes a lower bound on the communication complexity of the algorithms in C
Theorem 7. The communication complexity of any n-broadcast algorithm in C when executed on
M(p, σ), with 1 < p ≤ n and σ ≥ 0, is Ω
(
max{2, σ} logmax{2,σ} p
)
.
Proof. Let A be an algorithm in C . Suppose that the execution of A on M(p, σ) requires t
supersteps, and let pi denote the number of processors that “know” the value V [0] by the end
of the i-th superstep, for 1 ≤ i ≤ t. Clearly, p0 = 1 and pt ≥ p, since by definition of C , at least p
processors hold entries of V to be updated with the value V [0]. During the i-th superstep, pi−pi−1
new processors get to know V [0]. Since at the beginning of this superstep only pi−1 processors
know the value, we conclude that the superstep involves an h-relation with h ≥ d(pi − pi−1)/pi−1e.
Therefore, the communication complexity of A is
HA(n, p, σ) ≥
t∑
i=1
(⌈
pi − pi−1
pi−1
⌉
+ σ
)
=
t∑
i=1
(⌈
pi
pi−1
⌉
− 1 + σ
)
.
24
Assuming, without loss of generality, that the pi’s are strictly increasing, we obtain
HA(n, p, σ) = Ω
(
tmax{2, σ}+
t∑
i=1
pi
pi−1
)
.
Since
∏t
i=1 pi/pi−1 = pt, it follows that
∑t
i=1 pi/pi−1 is minimized for pi/pi−1 = (pt)
1/t ≥ (p)1/t,
for 1 ≤ i ≤ t. Hence,
HA(n, p, σ) = Ω
(
t
(
max{2, σ}+ p1/t
))
. (6)
Standard calculus shows that the right-hand side is minimized (to within a constant factor) by
choosing t = Θ
(
logmax{2,σ} p
)
, and the claim follows.
The above lower bound is tight. Consider the following M(p, σ) algorithm for n-broadcast.
Let the entries of V be evenly distributed among the processors, with V [0] held by processor P0.
For convenience we assume n is a power of two. Let κ be the smallest power of 2 greater than
or equal to max{2, σ}. The algorithm consists of dlogκ pe supersteps: in the i-th superstep, with
0 ≤ i < dlogκ pe, each Pjp/κi , with 0 ≤ j < κi, sends the value V [0] to P(κj+`)p/κi+1 , for each
0 ≤ ` < κ. (When logκ p is not an integer value, in the last superstep only values of ` that are
multiples of κi+1/p are used.) It is immediate to see that the algorithm belongs to C and that its
communication complexity on M(p, σ) is
Hbroadκ(n, p, σ) = O((κ+ σ) logκ p) = O
(
max{2, σ} logmax{2,σ} p
)
.
Therefore, the algorithm is O(1)-optimal. Observe that the algorithm is aware of parameter σ
and, in fact, this knowledge is crucial to achieve optimality. In order to see this, we prove that
any network-oblivious algorithm for n-broadcast can be Θ(1)-optimal on M(p, σ), only for limited
ranges of σ. Let H?(n, p, σ) denote the best communication complexity achievable on M(p, σ) by
an algorithm for n-broadcast belonging to C . By the above discussion we know that H?(n, p, σ) =
Θ
(
max{2, σ} logmax{2,σ} p
)
. Let A ∈ C be a network-oblivious algorithm for n-broadcast specified
on M(v(n)). For every 1 < p ≤ v(n) and 0 ≤ σ1 ≤ σ2, we define the maximum slowdown incurred
by A with respect to the best M(p, σ)-algorithm in C , for σ ∈ [σ1, σ2], as
GAPA(n, p, σ1, σ2) = max
σ1≤σ≤σ2
{
HA(n, p, σ)
H?(n, p, σ)
}
.
Theorem 8. Let A ∈ C be a network-oblivious algorithm for n-broadcast specified on M(v(n)).
For every 1 < p ≤ v(n) and 0 ≤ σ1 ≤ σ2, we have
GAPA(n, p, σ1, σ2) = Ω
(
log max{2, σ2}
log max{2, σ1}+ log log max{2, σ2}
)
.
Proof. The definition of function GAP implies that
GAPA(n, p, σ1, σ2) = Ω
(
HA(n, p, σ1)
H?(n, p, σ1)
+
HA(n, p, σ2)
H?(n, p, σ2)
)
.
Let t be the number of supersteps executed by the folding of A on M(p, σ), and note that, since A is
network-oblivious, this number cannot depend on σ. By arguing as in the proof of Theorem 7 (see
25
Inequality 6) we get thatHA(n, p, σ) = Ω
(
t
(
max{2, σ}+ p1/t)), for any σ, hence GAPA(n, p, σ1, σ2)
is bounded from below by
Ω
(
t
(
max{2, σ1}+ p1/t
)
max{2, σ1} logmax{2,σ1} p
+
t
(
max{2, σ2}+ p1/t
)
max{2, σ2} logmax{2,σ2} p
)
,
which is minimized for t = Θ(log p/(log max{2, σ1}+ log log max{2, σ2})). Substituting this value
of t in the above formula yields the stated result.
An immediate consequence of the above theorem is that if a network-oblivious algorithm for
n-broadcast is Θ(1)-optimal on M(p, σ) it cannot be simultaneously Θ(1)-optimal on an M(p, σ′),
for any σ′ sufficiently larger than σ. A similar limitation of the optimality of a network-oblivious
algorithm for n-broadcast can be argued with respect to its execution on D-BSP(p, g, `).
5 Extension to the Optimality Theorem
The optimality theorem of Section 3 makes crucial use of the wiseness property. Broadly speaking, a
network-oblivious algorithm is (Θ(1), p)-wise when the communication performed in the various su-
persteps is somewhat balanced in the sense that the maximum number of messages sent/received by
a virtual processor does not differ significantly from the average number of messages sent/received
by other virtual processors belonging to the same region of suitable size. While there exist (Θ(1), p)-
wise network-oblivious algorithms for a number of important problems, as shown in Section 4, there
cases where wiseness may not be guaranteed.
As a simple example of poor wiseness, consider a network-oblivious algorithm A for M(n)
consisting of one 0-superstep where VP0 sends n messages to VPn/2. Fix p with 2 ≤ p ≤ n.
Clearly, for each 1 ≤ j ≤ log p we have that HA(n, 2j , 0) = n, hence the algorithm is (α, p)-wise
only for α = O(1/p). When executed on a D-BSP(p, g,0), the communication time of the algorithm
is ng0. However, as already observed in [11], under reasonable assumptions the communication
time of the algorithm’s execution on the D-BSP can be improved by first evenly spreading the n
messages among clusters of increasingly larger size which include the sender, and then gathering
the messages within clusters of increasingly smaller size which include the receiver. Motivated by
this observation, we introduce a more effective protocol to execute network-oblivious algorithms
on the D-BSP. By employing this protocol we are able to prove an alternative optimality theorem
which requires a much weaker property than wiseness, at the expense of a slight (polylogarithmic)
loss of efficiency.
Let A be a network-oblivious algorithm specified on M(v(n)), and consider its execution on
a D-BSP(p, g, `), with 1 ≤ p ≤ v(n). As before, each D-BSP processor Pj , with 0 ≤ j < p,
carries out the operations of the v(n)/p consecutively numbered VPs of M(v(n)) starting with
VPj(v(n)/p). However, the communication required by each superstep is now performed on D-BSP
more effectively by enforcing a suitable balancing. More precisely, each i-superstep s of A, with
0 ≤ i < log p, is executed on the D-BSP through the following protocol, which we will refer to as
novel protocol :
1. (Computation phase) Each D-BSP processor performs the local computations of its assigned
virtual processors.
26
2. (Ascend phase) For k = log p− 1 downto i+ 1: within each k-cluster Γk, the messages which
originate in Γk but are destined outside Γk, are evenly distributed among the p/2
k processors
of Γk.
3. (Descend phase) For k = i to log p − 1: within each k-cluster Γk, the messages currently
residing in Γk are evenly distributed among the processors of the (k + 1)-clusters inside Γk
which contain their final destinations.
Observe that each iteration of the ascend/descend phases requires a prefix-like computation to as-
sign suitable intermediate destinations to the messages in order to guarantee their even distribution
in the appropriate clusters.
Lemma 7. Let A be a network-oblivious algorithm specified on M(v(n)) and consider its execution
on D-BSP(p, g, `), with 1 < p ≤ v(n), using the novel protocol. Let ξs be the sequence of supersteps
employed by the protocol for executing an i-superstep s of A, where 0 ≤ i < log p. Then, for
every i < k < log p, ξs comprises O(1) k-supersteps of degree O
(
2khsA(n, 2
k)/p
)
, and O(log p)
k-supersteps each of constant degree.
Proof. Consider iteration k of the ascend phase of the protocol, with i + 1 ≤ k ≤ log p − 1, and a
k-cluster Γk. As invariant at the beginning of the iteration, we have that the at most h
s
A(n, 2
k+1)
messages originating in each k + 1-cluster Γ′ included in Γk and destined outside Γk are evenly
distributed among the processors of Γ′. Hence, the even distribution of these messages among the
p/2k processors of Γk requires a prefix-like computation and an O
(d2k+1hsA(n, 2k+1)/pe)-relation
within Γk. Consider now iteration k of the descend phase of the protocol, with i ≤ k ≤ log p − 1,
and a k-cluster Γk. As invariant at the beginning of the iteration, we have that the at most
2hsA(n, 2
k+1) messages to be moved in the iteration are evenly distributed among the processors of
Γk. Since each (k + 1)-cluster included in Γk receives at most h
s
A(n, 2
k+1) messages, the iteration
requires a prefix-like computation and an O
(d2k+1hsA(n, 2k+1)/pe)-relation within Γk. The lemma
follows, since each prefix-like computation in a k-cluster can be performed in O(log p) k-supersteps
of constant degree (e.g., using a straightforward tree-based strategy [33]).
We now define the notion of fullness which is weaker than wiseness but which still allows us to
port the optimality of network-oblivious algorithms with respect to the evaluation model onto the
execution machine model, although incurring some loss of efficiency.
Definition 4. A static network-oblivious algorithm A specified on M(v(n)) is said to be (γ, p)-full,
for some γ > 0 and 1 < p ≤ v(n), if considering the folding of A on M(2j , 0) we have
j−1∑
i=0
F iA
(
n, 2j
) ≥ γ p
2j
j−1∑
i=0
SiA(n),
for every 1 ≤ j ≤ log p and input size n.
It is easy to see that a (Θ(1), p)-wise network-oblivious algorithm A is also (Θ(1), p)-full as
long as hsA(n, p) ≥ 1, for every i-superstep s of A and every 1 < p ≤ v(n). On the other hand, a
(Θ(1), p)-full algorithm is not necessarily (Θ(1), p)-wise, as witnessed by the previously mentioned
network-oblivious algorithm consisting of a single 0-superstep where VP0 sends n messages to
VPn/2, which is (Θ(1), p)-full but not (Θ(1), p)-wise, for any 2 ≤ p ≤ n. In this sense, (γ, p)-fullness
is a weaker condition than (Θ(1), p)-wiseness.
27
The following theorem shows that when (γ, p)-full algorithms are executed on the D-BSP using
the novel protocol, optimality in the evaluation model is preserved on the D-BSP within a polylog-
arithmic factor. As in Section 3, let C denote a given class of static algorithms solving a problem
Π, with the property that for any algorithm A ∈ C , all its foldings on p ≥ 2 processing elements
also belong to C .
Theorem 9. Let A ∈ C be a (γ, p?)-full network-oblivious algorithm for some γ > 0 and a power of
two p∗. Let also {σm0 , . . . , σmlog p?−1} and {σM0 , . . . , σMlog p?−1} be two vectors of non-negative values,
with σmj ≤ σMj , for every 0 ≤ j < log p?. If A is β-optimal on M(2j , σ) w.r.t. C , for σmj−1 ≤ σ ≤
σMj−1 and 1 ≤ j ≤ log p?, then, for every power of two p ≤ p?, A is Θ
(
β/((1 + 1/γ) log2 p)
)
-optimal
on D-BSP(p, g, `) w.r.t. C when executed with the above protocol, as long as:
• the execution of A on D-BSP(p, g, `) using the above protocol is in C ;
• gi ≥ gi+1 and `i/gi ≥ `i+1/gi+1, for 0 ≤ i < log p− 1;
• max1≤k≤log p{σmk−12k/p} ≤ `i/gi ≤ min1≤k≤log p{σMk−12k/p}, for 0 ≤ i < log p.
Proof. Consider the execution of A on a D-BSP(p, g, `) using the novel protocol. Let A˜ denote the
actual sequence of supersteps performed on the D-BSP in this execution of A. Note that once the
D-BSP parameters are fixed, A˜ can be regarded as a network-oblivious algorithm specified on M(p).
Clearly, any optimality considerations on the communication time of the execution of A˜ (regarded
as a network-oblivious algorithm) on D-BSP(p, g, `) using the standard protocol, will also apply to
the communication time of the execution of A on D-BSP(p, g, `) using the novel protocol, being
the two communication times the same.
We will assess the degree of optimality of the communication time of A˜ by resorting to Theo-
rem 1. This entails analyzing the communication complexity of A˜ onM(2j , σ), for any 1 ≤ j ≤ log p,
and determining its wiseness. Focus on M(2j , σ) for some 1 ≤ j ≤ log p, and consider an arbitrary
i-superstep s of A, for some 0 ≤ i < j. Let ξs be the sequence of supersteps in A˜ executed in
the ascend and descend phases associated with superstep s. From Lemma 7, we know that for
every i < k < log p, ξs comprises O(1) k-supersteps of degree O
(
2khsA(n, 2
k)/p
)
, and O(log p) k-
supersteps each of constant degree. Now, in the execution on M(2j , σ) a k-superstep with k ≥ j
becomes local to the processors and does not contribute to the communication complexity. Since
each processor of M(2j , σ) corresponds to p/2j processors of M(p), the communication complexity
on M(2j , σ) contributed by the sequence ξs is
O
(
j−1∑
k=i+1
(
p
2j
(
2k
p
hsA(n, 2
k) + log p
)
+ σ log p
))
.
Therefore, since hsA(n, 2
k) ≤ 2j−khsA(n, 2j), the above summation is upper bounded by
O
(
j−1∑
k=i+1
(
hsA(n, 2
j) +
p log p
2j
+ σ log p
))
= O
((
hsA(n, 2
j) +
p
2j
+ σ
)
log2 p
)
.
Recall that LiA(n) denotes the set of i-supersteps executed by A, and SiA(n) = |LiA(n)|. Thus, the
28
communication complexity of A˜ on M(2j , σ) can be written as
HA˜(n, 2
j , σ) = O
j−1∑
i=0
∑
s∈LiA(n)
(
hsA(n, 2
j) +
p
2j
+ σ
)
log2 p

= O
log2 p
j−1∑
i=0
∑
s∈LiA(n)
(
hsA(n, 2
j) + σ
)
+
j−1∑
i=0
∑
s∈LiA(n)
p
2j

= O
(
log2 p
(
HA(n, 2j , σ) +
p
2j
j−1∑
i=0
SiA(n)
))
= O
(
(1 + 1/γ) log2 p ·HA(n, 2j , σ)
)
,
where the last inequality follows by the (γ, p∗)-fullness of A.
The above inequality shows that algorithm A˜ is β/((1 + 1/γ) log2 p)-optimal as a consequence
of the β-optimality of A. Let us now assess the wiseness of A˜. Consider again the sequence ξs
of supersteps of A˜ associated with an arbitrary i-superstep s of A, for some 0 ≤ i < log p. We
know that for every i < k < log p, ξs comprises O(1) k-supersteps of degree O
(
2khsA(n, 2
k)/p
)
,
and O(log p) k-supersteps each of constant degree. Moreover, we can assume that suitable dummy
messages are added so that in a k-superstep of degree O
(
2khsA(n, 2
k)/p
)
(resp., degree O(1)) all
processors of a (k+ 1)-cluster send Θ
(
2khsA(n, 2
k)/p
)
(resp., Θ(1)) messages to the sibling (k+ 1)-
cluster included in the same k-cluster. It is easy to see that the above considerations about the
optimality of A˜ remain unchanged, while A˜ becomes (Θ(1), p)-wise. Finally, we recall that A˜
belongs to class C by hypothesis, and this is so even forcing it into being wise. Therefore, by
applying Theorem 1 to A˜, we can conclude that A˜, hence A, is Θ(β/((1 + 1/γ) log2 p))-optimal on
a D-BSP(p, g, `) with parameters satisfying the initial hypotheses.
We conclude this section by observing that the relation stated by the above theorem between
optimality in the evaluation model and optimality in D-BSP can be tighten when the gi and
`i parameters of the D-BSP decrease geometrically. In this case, it is known that a prefix-like
computation within a k-cluster, for 0 ≤ k < log p, can be performed in O(gk + `k) communication
time (e.g., see [11, Proposition 2.2.2]). Then, by a similar argument used to prove Theorem 9 it
can be shown that a (γ, p)-full algorithm A which is β-optimal in the evaluation model becomes
Θ(β/((1 + 1/γ) log p))-optimal when executed on the D-BSP, thus reducing by a factor log p the
gap between the two optimality factors.
6 Conclusions
We introduced a framework to explore the design of network-oblivious algorithms, that is, al-
gorithms which run efficiently on machines with different processing power and different band-
width/latency characteristics, without making explicit use of architectural parameters for tuning
performance. In the framework, a network-oblivious algorithm is written for v(n) virtual processors
(specification model), where n is the input size and v(·) a suitable function. Then, the performance
of the algorithm is analyzed in a simple model (evaluation model) consisting of p ≤ v(n) proces-
sors and where the impact of the network topology on communication costs is accounted for by a
latency parameter σ. Finally, the algorithm is executed on the D-BSP model [25, 11] (execution
29
machine model), which describes reasonably well the behavior of a large class of point-to-point
networks by capturing their hierarchical structures. A D-BSP consists of p ≤ v(n) processors and
its network topology is described by the log p-size vectors g and `, which account for bandwidth
and latency costs within nested clusters, respectively. We have shown that for static network-
oblivious algorithms, where the communication requirements depend only on the input size and
not on the specific input instance (e.g., algorithms arising in DAG computations), the optimality on
the evaluation model for certain ranges of p and σ translates into optimality on the D-BSP model
for corresponding ranges of the model’s parameters. This result justifies the introduction of the
evaluation model that allows for a simple analysis of network-oblivious algorithms while effectively
bridging the performance analysis to D-BSP which more accurately models the communication
infrastructure of parallel platforms through a logarithmic number of parameters.
We devised Θ(1)-optimal static network-oblivious algorithms for prominent problems such as
matrix multiplication, FFT, and sorting, although in the case of sorting optimality is achieved
only when the available parallelism is polynomially sublinear in the input size. Also, we devised
suboptimal, yet efficient, network-oblivious algorithms for stencil computations, and we explored
limitations of the oblivious approach by showing that for the broadcasting problem optimality in
D-BSP can be achieved by a network-oblivious algorithm only for rather limited ranges of the
parameters. Similar negative results were also proved in the realm of cache-oblivious algorithms
(e.g., see [8, 18, 41, 42]). Despite these limitations, the pursuit of oblivious algorithms appears
worthwhile even when the outcome is a proof that no such algorithm can be Θ(1)-optimal on an
ample class of target machines. Indeed, the analysis behind such a result is likely to reveal what
kind of adaptivity to the target machine is necessary to obtain optimal performance.
The present work can be naturally extended in several directions, some of which are briefly
outlined below. First, it would be useful to further assess the effectiveness of our framework by
developing novel efficient network-oblivious algorithms for prominent problems beyond the ones of
this paper. Some progress in this direction has been done in [20, 26]. For the problems considered
here, in particular sorting and stencil computations, it would be very interesting to investigate
the potentiality of the network-oblivious approach at a fuller degree. More generally, it would be
interesting to develop lower-bound techniques to limit the level of optimality that network-oblivious
algorithms can reach on certain classes of target platforms. Another challenging goal concerns the
generalization of the results of Theorems 1 and 9 to algorithms for a wider class of problems,
and the weakening of the assumptions (wiseness or fullness) required to prove these results. It
would be also useful to identify other classes of machines, for which network-oblivious algorithms
can be effective. Another natural question to investigate is how to determine an efficient virtual-
to-physical processor mapping for a network with arbitrary topology. Again, it would be very
interesting to generalize our work to apply to computing scenarios, such as traditional time-shared
systems as well as emerging global computing environments, where the amount of resources devoted
to a specific application can itself vary dynamically over time, in the same spirit as [6] generalized
the cache-oblivious framework to environments in which the amount of memory available to an
algorithm can fluctuate. Finally, a somewhat ultimate goal is represented by the integration of
cache- and network-obliviousness in a unified framework for the development of machine-oblivious
computations, as recently done in [20] for shared-memory platforms.
Acknowledgments. The authors would like to thank Vijaya Ramachandran for helpful discus-
sions.
30
References
[1] Alok Aggarwal, Bowen Alpern, Ashok K. Chandra, and Marc Snir. A model for hierarchical
memory. In Proceedings of the 19th ACM Symposium on Theory of Computing (STOC), pages
305–314, 1987.
[2] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Hierarchical memory with block transfer.
In Proceedings of the 28th IEEE Symposium on Foundations of Computer Science (FOCS),
pages 204–216, 1987.
[3] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication complexity of PRAMs.
Theoretical Computer Science, 71:3–28, 1990.
[4] Alok Aggarwal and Jeffrey S. Vitter. The input/output complexity of sorting and related
problems. Communications of the ACM, 31(9):1116–1127, 1988.
[5] Armin Ba¨umker, Wolfgang Dittrich, and Friedhelm Meyer auf der Heide. Truly efficient parallel
algorithms: 1-optimal multisearch for an extension of the BSP model. Theoretical Computer
Science, 203(2):175–203, 1998.
[6] Michael A. Bender, Roozbeh Ebrahimi, Jeremy T. Fineman, Golnaz Ghasemiesfeh, Rob John-
son, and Samuel McCauley. Cache-adaptive algorithms. In Proceedings of the 25th annual
ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 958–971, 2014.
[7] Sandeep N. Bhatt, Gianfranco Bilardi, and Geppino Pucci. Area-time tradeoffs for universal
VLSI circuits. Theoretical Computer Science, 408(2-3):143–150, 2008.
[8] Gianfranco Bilardi and Enoch Peserico. A characterization of temporal locality and its porta-
bility across memory hierarchies. In Proceedings of the 28th International Colloquium on
Automata, Languages and Programming (ICALP), pages 128–139, 2001.
[9] Gianfranco Bilardi and Andrea Pietracaprina. Theoretical models of computation. In David A.
Padua, editor, Encyclopedia of Parallel Computing, pages 1150–1158. Springer, 2011.
[10] Gianfranco Bilardi, Andrea Pietracaprina, and Geppino Pucci. A quantitative measure of
portability with application to bandwidth-latency models for parallel computing. In Proceed-
ings of the 5th International Euro-Par Conference on Parallel Processing (Euro-Par), pages
543–551, 1999.
[11] Gianfranco Bilardi, Andrea Pietracaprina, and Geppino Pucci. Decomposable BSP: A
bandwidth-latency model for parallel and hierarchical computation. In John Reif and
Sanguthevar Rajasekaran, editors, Handbook of Parallel Computing: Models, Algorithms and
Applications, pages 277–315. CRC Press, 2007.
[12] Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, and Francesco Silvestri. Network-
oblivious algorithms. In Proceedings of the 21st IEEE International Parallel and Distributed
Processing Symposium (IPDPS), pages 1–10, 2007.
[13] Gianfranco Bilardi and Franco Preparata. Horizons of parallel computation. Journal of Parallel
and Distributed Computing, 27(2):172–182, 1995.
31
[14] Gianfranco Bilardi and Franco Preparata. Processor-time tradeoffs under bounded-speed mes-
sage propagation: Part I, upper bounds. Theory of Computing Systems, 30(6):523–546, 1997.
[15] Gianfranco Bilardi and Franco Preparata. Processor-time tradeoffs under bounded-speed mes-
sage propagation: Part II, lower bounds. Theory of Computing Systems, 32(5):531–559, 1999.
[16] Gianfranco Bilardi and Geppino Pucci. Universality in VLSI computation. In David A. Padua,
editor, Encyclopedia of Parallel Computing, pages 2112–2118. Springer, 2011.
[17] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri.
Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 23rd
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 355–366,
2011.
[18] Gerth S. Brodal and Rolf Fagerberg. On the limits of cache-obliviousness. In Proceedings of
the 35th ACM Symposium on Theory of Computing (STOC), pages 307–315, 2003.
[19] Rezaul A. Chowdhury and Vijaya Ramachandran. Cache-efficient dynamic programming algo-
rithms for multicores. In Proceedings of the 20th ACM Symposium on Parallelism in Algorithms
and Architectures (SPAA), pages 207–216, 2008.
[20] Rezaul A. Chowdhury, Vijaya Ramachandran, Francesco Silvestri, and Brandon Blakeley.
Oblivious algorithms for multicores and networks of processors. Journal of Parallel and Dis-
tributed Computing, 73(7):911–925, 2013.
[21] Richard Cole and Vijaya Ramachandran. Resource oblivious sorting on multicores. In Pro-
ceedings of the 37th International Colloquium on Automata, Languages and Programming
(ICALP), pages 226–237, 2010.
[22] Richard Cole and Vijaya Ramachandran. Efficient resource oblivious algorithms for multicores
with false sharing. In Proceedings of the 26th IEEE International Parallel and Distributed
Processing Symposium (IPDPS), pages 201–214, 2012.
[23] Richard Cole and Vijaya Ramachandran. Revisiting the cache miss analysis of multithreaded
algorithms. In Proceedings of the 10th Latin American Theoretical Informatics Symposium
(LATIN), pages 172–183, 2012.
[24] David E. Culler, Richard M. Karp, David A. Patterson, Abhijit Sahay, Eunice E. Santos,
Klaus E. Schauser, Ramesh Subramonian, and Thorsten von Eicken. LogP: A practical model
of parallel computation. Communications of the ACM, 39(11):78–85, 1996.
[25] Pilar de la Torre and Clyde P. Kruskal. Submachine locality in the bulk synchronous setting. In
Proceedings of the 2nd International Euro-Par Conference on Parallel Processing (Euro-Par),
pages 352–358, 1996.
[26] James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Ben Lipshitz, Oded Schwartz, and
Omer Spillinger. Communication-optimal parallel recursive rectangular matrix multiplication.
In Proceedings of the 27th IEEE International Parallel and Distributed Processing Symposium
(IPDPS), pages 261–272, 2013.
32
[27] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-
oblivious algorithms. ACM Transactions on Algorithms, 8(1), 2012.
[28] Matteo Frigo and Volker Strumpen. Cache oblivious stencil computations. In Proceedings of
the 19th International Conference on Supercomputing (ICS), pages 361–366, 2005.
[29] Matteo Frigo and Volker Strumpen. The cache complexity of multithreaded cache oblivious
algorithms. Theory of Computing Systems, 45(2):203–233, 2009.
[30] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. Can a shared-memory model
serve as a bridging model for parallel computation? Theory of Computing Systems, 32(3):327–
359, 1999.
[31] Kieran T. Herley. Network obliviousness. In David A. Padua, editor, Encyclopedia of Parallel
Computing, pages 1298–1303. Springer, 2011.
[32] Dror Irony, Sivan Toledo, and Alexandre Tiskin. Communication lower bounds for distributed-
memory matrix multiplication. Journal of Parallel and Distributed Computing, 64(9):1017–
1026, 2004.
[33] Joseph Ja´Ja´. An Introduction to Parallel Algorithms. Addison-Wesley Longman Publishing
Co., Inc., 1992.
[34] Ben H. H. Juurlink and Harry A. G. Wijshoff. A quantitative comparison of parallel compu-
tation models. ACM Transactions on Computer Systems, 16(3):271–318, 1998.
[35] Leslie Robert Kerr. The Effect of Algebraic Structure on the Computational Complexity of
Matrix Multiplication. PhD thesis, Cornell University, 1970.
[36] Frank T. Leighton. Tight bounds on the complexity of parallel sorting. IEEE Transactions on
Computers, 34(4):344–354, 1985.
[37] Frank T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees,
Hypercubes. Morgan Kaufmann Publishers Inc., 1992.
[38] Charles E. Leiserson. Fat-trees: Universal networks for hardware-efficient supercomputing.
IEEE Transactions on Computers, 34(10):892–901, October 1985.
[39] John E. Savage. Models of Computation: Exploring the Power of Computing. Addison-Wesley
Longman Publishing Co., Inc., 1998.
[40] Michele Scquizzato and Francesco Silvestri. Communication lower bounds for distributed-
memory computations. In Proceedings of the 31st Symposium on Theoretical Aspects of Com-
puter Science (STACS), pages 627–638, 2014.
[41] Francesco Silvestri. On the limits of cache-oblivious matrix transposition. In Proceedings of
the 2nd Symposium on Trustworthy Global Computing (TGC), pages 233–243, 2006.
[42] Francesco Silvestri. On the limits of cache-oblivious rational permutations. Theoretical Com-
puter Science, 402(2-3):221–233, 2008.
33
[43] Alexandre Tiskin. The bulk-synchronous parallel random access machine. Theoretical Com-
puter Science, 196(1-2):109–130, 1998.
[44] Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM,
33(8):103–111, 1990.
[45] Leslie G. Valiant. A bridging model for multi-core computing. Journal of Computer and
System Sciences, 77(1):154–166, 2011.
34
