A framework is proposed for the design and analysis of network-oblivious algorithms, namely algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network-oblivious algorithm be specified on a parallel model of computation where the only parameter is the problem's input size, and then evaluated on a model with two parameters, capturing parallelism granularity and communication latency. It is shown that for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in the decomposable bulk synchronous parallel model, which is known to effectively describe a wide and significant class of parallel platforms. The proposed framework can be regarded as an attempt to port the notion of obliviousness, well established in the context of cache hierarchies, to the realm of parallel computation. Its effectiveness is illustrated by providing optimal network-oblivious algorithms for a number of key problems. Some limitations of the oblivious approach are also discussed.
INTRODUCTION
Communication plays a major role in determining the performance of algorithms on current computing systems and has a considerable impact on energy consumption. Since the relevance of communication increases with the size of the system, it is expected to play an even greater role in the future. Motivated by this scenario, a large body of results have been devised concerning the design and analysis of communicationefficient algorithms. Although often useful and deep, these results do not yet provide a coherent and unified theory of the communication requirements of computations. One major obstacle toward such a theory lies in the fact that, prima facie, communication is defined only with respect to a specific mapping of a computation onto a specific machine structure. Furthermore, the impact of communication on performance depends on the latency and bandwidth properties of the channels connecting different parts of the target machine. Hence, the design, optimization, and analysis of algorithms can become highly machine dependent, which is undesirable from the economical perspective of developing efficient and portable software. The outlined situation has been widely recognized, and a number of approaches have been proposed to solve it or to mitigate it.
On one end of the spectrum, we have the parallel slackness approach, based on the assumption that as long as a sufficient amount of parallelism is exhibited, general and automatic latency-hiding techniques can be deployed to achieve an efficient execution. Broadly speaking, the required algorithmic parallelism should be at least proportional to the product of the number of processing units and the worst-case latency of the target machine [Valiant 1990 ]. Further assuming that this amount of parallelism is available in computations of practical interest, algorithm design can dispense altogether with communication concerns and focus on the maximization of parallelism. The functional/dataflow and the PRAM models of computations have often been supported with similar arguments. Unfortunately, as argued in Bilardi and Preparata [1995 , 1997 , latency hiding is not a scalable technique due to fundamental physical constraints (namely, upper bounds to the speed of messages and lower bounds to the size of devices). Hence, parallel slackness does not really solve the communication problem. (Nevertheless, functional and PRAM models are quite valuable and have significantly contributed to the understanding of other dimensions of computing.)
On the other end of the spectrum, we could place the universality approach, whose objective is the development of machines (nearly) as efficient as any other machine of (nearly) the same cost, at executing any computation (e.g., see Leiserson [1985] , Bilardi and Preparata [1995] , Bhatt et al. [2008] , and Bilardi and Pucci [2011] ). To the extent that a universal machine with very small performance and cost gaps could be identified, one could adopt a model of computation sufficiently descriptive of such a machine and focus most of the algorithmic effort on this model. As technology approaches the inherent physical limitations to information processing, storage, and transfer, the emergence of a universal architecture becomes more likely. Economy of scale can also be a force favoring convergence in the space of commercial machines. Although this appears as a perspective worthy of investigation, at the present stage neither the known theoretical results nor the trends of commercially available platforms indicate an imminent strong convergence.
In the middle of the spectrum, a variety of computational models proposed in the literature can be viewed as variants of an approach aiming at realizing an efficiency/portability/design-complexity trade-off [Bilardi and Pietracaprina 2011] . Well-known examples of these models are LPRAM [Aggarwal et al. 1990 ], DRAM [Leiserson and Maggs 1988] , BSP [Valiant 1990 ] and its refinements (e.g., D-BSP [de la Torre and Kruskal 1996; Bilardi et al. 2007a ], BSP* [Bäumker et al. 1998 ], E-BSP [Juurlink and Wijshoff 1998] , and BSPRAM [Tiskin 1998 ]), LogP [Culler et al. 1996] , QSM [Gibbons et al. 1999] , MapReduce [Karloff et al. 2010; Pietracaprina et al. 2012] , and several others. These models aim at capturing features common to most (reasonable) machines while ignoring features that differ. The hope is that performance of real machines is largely determined by the modeled features so that optimal algorithms in the proposed model translate into near optimal ones on real machines. A drawback of these models is that they include parameters that affect execution time. Then, in general, efficient algorithms are parameter aware, as different algorithmic strategies can be more efficient for different values of the parameters. One parameter present in virtually all models is the number of processors. Most models also exhibit parameters describing the time required to route certain communication patterns. Increasing the number of parameters, from just a small constant to logarithmically many in the number of processors, can considerably increase the effectiveness of the model with respect to realistic architectures, such as point-to-point networks, as extensively discussed in Bilardi et al. [2007a] . A price is paid in the increased complexity of algorithm design necessary to gain greater efficiency across a larger class of machines. The complications further compound if the hierarchical nature of the memory is also taken into account, so communication between processors and memories becomes an optimization target as well.
It is natural to wonder whether, at least for some problems, parallel algorithms can be designed that, although independent of any machine/model parameters, are nevertheless efficient for wide ranges of these parameters. In other words, we are interested in exploring the world of efficient network-oblivious algorithms with a spirit similar to the one that motivated the development of efficient cache-oblivious algorithms [Frigo et al. 2012] . In this article, we define the notion of network-oblivious algorithms and propose a framework for their design and analysis. Our framework is based on three models of computation, each with a different role, as briefly outlined next.
The three models are based on a common organization consisting of a set of CPU/memory nodes communicating through some interconnection. Inspired by the bulk synchronous parallel (BSP) model and its aforementioned variants, we assume that the computation proceeds as a sequence of supersteps, where in a superstep each node performs local computation and sends/receives messages to/from other nodes, which will be consumed in the subsequent superstep. Each message occupies a constant number of words.
The first model of our framework (specification model) is used to specify networkoblivious algorithms. In this model, the number of CPU/memory nodes, referred to as virtual processors, is a function v(n) of the input size and captures the amount of parallelism exhibited by the algorithm. The second model (evaluation model) is the basis for analyzing the performance of network-oblivious algorithms on different machines. It is characterized by two parameters, independent of the input: the number p of CPU/memory nodes, simply referred to as processors in this context, and a fixed latency/synchronization cost σ per superstep. The communication complexity of an algorithm is defined in this model as a function of p and σ . Finally, the third model (execution machine model) enriches the evaluation model by replacing parameter σ with two independent parameter vectors of size logarithmic in the number of processors, which represent, respectively, the inverse of the bandwidth and the latency costs of suitable nested subsets of processors. In this model, the communication time of an algorithm is analyzed as a function of p and of the two parameter vectors. In fact, the execution machine model of our framework coincides with the decomposable bulk synchronous parallel (D-BSP) model [de la Torre and Kruskal 1996; Bilardi et al. 2007a] , which is known to describe reasonably well the behavior of a large class of point-to-point networks by capturing their hierarchical structure .
A network-oblivious algorithm is designed in the specification model but can be run on the evaluation or execution machine models by letting each processor of these models carry out the work of a prespecified set of virtual processors. The main contribution of this article is an optimality theorem showing that for a wide and interesting class of network-oblivious algorithms, which satisfy some technical conditions and whose communication requirements depend only on the input size and not on the specific input instance, optimality in the evaluation model automatically translates into optimality in the D-BSP model for suitable ranges of the models' parameters. It is this circumstance that motivates the introduction of the intermediate evaluation model, which simplifies the analysis of network-oblivious algorithms, while effectively bridging the performance analysis to the more realistic D-BSP model.
To illustrate the potentiality of the framework, we devise network-oblivious algorithms for several fundamental problems, such as matrix multiplication, fast Fourier transform (FFT), comparison-based sorting, and a class of stencil computations. In all cases, except for stencil computations, we show, through the optimality theorem, that these algorithms are optimal when executed on the D-BSP for wide ranges of the parameters. Unfortunately, there exist problems for which optimality on the D-BSP cannot be attained in a network-oblivious fashion for wide ranges of parameters. We show that this is the case for the broadcast problem.
To help place our network-oblivious framework into perspective, it may be useful to compare it to the well-established sequential cache-oblivious framework [Frigo et al. 2012 ]. In the latter, the specification model is the random access machine; the evaluation model is the ideal cache model IC(M, B), with only one level of cache of size M and line length B; and the execution machine model is a machine with a hierarchy of caches, each with its own size and line length. In the cache-oblivious context, the simplification in the analysis arises from the fact that, under certain conditions, optimality on IC(M, B), for all values of M and B, translates into optimality on multilevel hierarchies.
The notion of obliviousness in parallel settings has been addressed by several research works. In a preliminary version of the current work [Bilardi et al. 2007b ] (see also [Herley 2011]), we proposed a framework similar to the one presented here, where messages are packed in blocks whose fixed size is a parameter of the evaluation and execution machine models. Although blocked communication may be preferable for models where the memory and communication hierarchies are seamlessly integrated, such as multicores, latency-based models like the one used here are equivalent for that scenario and also capture the case when communication is accomplished through a point-to-point network. In recent years, obliviousness in parallel platforms has been explored in the context of multicore architectures, where processing units communicate through a multilevel cache hierarchy at the top of a shared memory [Chowdhury et al. 2013; Cole and Ramachandran 2010 , 2012a , 2012b Blelloch et al. 2010 Blelloch et al. , 2011 . Although these works have significantly contributed to the development of oblivious algorithmics, the proposed results exploit the blocked and shared-memory nature of the communication system and thus do not suit platforms with distributed memories and point-to-point networks, for which our model of obliviousness is more appropriate. Chowdhury et al. [2013] introduced a multilevel hierarchical model for multicores and the notion of a multicore-oblivious algorithm for this model. A multicore-oblivious algorithm is specified with no mention of any machine parameters, such as the number of cores, number of cache levels, cache sizes, and block lengths, but it may include some simple hints to the runtime scheduler, like space requirements. These hints are then used by a suitable scheduler, aware of the multicore parameters, to efficiently schedule the algorithm on multicores with a multilevel cache hierarchy and any given number of cores. Cole and Ramachandran [2010 , 2012a , 2012b presented resourceoblivious algorithms: these are multicore-oblivious algorithms with no hints, which can be efficiently executed on two-level memory multicores by schedulers that are not aware of the multicore parameters. In Blelloch et al. [2010 Blelloch et al. [ , 2011 , it is shown that multicore resource-oblivious algorithms can be analyzed independently of both the parallel machine and the scheduler. In the first work, the claim is shown for hierarchies of only private or only shared caches. In the second work, the result is extended to a multilevel hierarchical multicore by introducing a parallel version of the cacheoblivious framework of Frigo et al. [2012] , named the parallel cache-oblivious model, and a scheduler for oblivious irregular computations. In contrast to these oblivious approaches, Valiant [2011] studies parallel algorithms for multicore architectures advocating a parameter-aware design of portable algorithms. The work presents optimal algorithms for multi-BSP, a bridging model for multicore architectures that exhibits a hierarchical structure akin to that of our execution machine model.
The rest of the article is organized as follows. In Section 2, we formally define the three models relevant to the framework, and in Section 3, we prove the optimality theorem mentioned earlier. In Section 4, we present the network-oblivious algorithms for matrix multiplication, FFT, comparison-based sorting, and stencil computations. We also discuss the impossibility result regarding the broadcast problem. Section 5 extends the optimality theorem by presenting a less powerful version, which, however, applies to a wider class of algorithms. Section 6 concludes the article with some final remarks. Appendix A provides a table that summarizes the main notations and symbols used in the article.
THE FRAMEWORK
We begin by introducing a parallel machine model M(v), which underlies the specification, the evaluation, and the execution components of our framework. Specifically, M(v) consists of a set of v processing elements, denoted by P 0 , P 1 , . . . , P v−1 , each equipped with a CPU and an unbounded local memory, which communicate through some interconnection. For simplicity, throughout this article, we assume that the number of processing elements is always a power of 2. The instruction set of each CPU is essentially that of a standard random access machine, augmented with the three primitives sync(i), send(m, q), and receive(). Furthermore, each P r has access to its own index r and to the number v of processing elements. When P r invokes primitive sync(i), with i in the integer range [0, log v), a barrier synchronization is enforced among the v/2 i processing elements whose indices share the i most significant bits with r. 1 When P r invokes send(m, q), with 0 ≤ q < v, a constant-size message m is sent to P q ; the message will be available in P q only after a sync(k), where k is not bigger than the number of most significant bits shared by r and q. On the other hand, the function receive() returns an element in the set of messages received up to the preceding barrier and removes it from the set.
In this article, we restrict our attention to algorithms where the sequence of labels of the sync operations is the same for all processing elements, and where the last operation executed by each processing element is a sync. 2 In this case, the execution of an algorithm can be viewed as a sequence of supersteps, where a superstep consists of all operations performed between two consecutive sync operations, including the second of these sync operations. Supersteps are labeled by the index of their terminating sync operation-namely, a superstep terminating with sync(i) will be referred to as an i-superstep, for 0 ≤ i < log v. Furthermore, we make the reasonable assumption that in an i-superstep, each P r can send messages only to processing elements whose index agrees with r in the i most significant bits-that is, message exchange occurs only between processors belonging to the same synchronization subset. We observe that the results of this work would hold even if, in the various models considered, synchronizations were not explicitly labeled. However, explicit labels can help reduce synchronization costs. For instance, they become crucial for the efficient execution of the algorithms on point-to-point networks, especially those of large diameter.
In a more intuitive formulation, processing elements in M(v) can be conceptually envisioned as the leaves of a complete binary tree of height log v. When a processing 1 For notational convenience, throughout this article we use log x to mean max{1, log 2 x}. 2 As we will see in the article, several algorithms naturally comply or can easily be adapted to comply with these restrictions. Nevertheless, a less restrictive family of algorithms for M(v) can be defined by allowing processing elements to feature different traces of labels of their sync operations, still ensuring termination. The exploration of the potentialities of these algorithms is left for future research. element P r invokes the primitive sync(i), all processing elements belonging to the subtree rooted at the ancestor of P r at level i are synchronized. Similarly, an i-superstep imposes that message exchange and synchronization are performed independently within the groups of leaves associated with the different subtrees rooted at level i. However, we remark that the tree is a conceptual construction and that M(v) should not be confused with a tree network, as no assumption is made on the specific communication infrastructure between processing elements.
Consider an M(v)-algorithm A satisfying the preceding restrictions. For a given input instance I, we use L i A (I) to denote the set of i-supersteps executed by A on input I, and define S i
Algorithm A can be naturally and automatically adapted to execute on a smaller machine M(2 j ), with 0 ≤ j < log v, by stipulating that processing element P r of M(2 j ) will carry out the operations of the v/2 j consecutively numbered processing elements of M(v) starting with P r(v/ p) , for each 0 ≤ r < 2 j . We call this adaptation folding. Under folding, supersteps with a label i < j on M(v) become supersteps with the same label on M(2 j ), whereas supersteps with label i ≥ j on M(v) become local computation on M(2 j ). Hence, when considering the communication occurring in the execution of A on M(2 j ), the set L i A (I) is relevant as long as i < j.
A network-oblivious algorithm A for a given computational problem is designed on M(v(n)), referred to as specification model, where the number v(n) of processing elements, which is a function of the input size, is selected as part of the algorithm design. The processing elements are called virtual processors and are denoted by VP 0 , VP 1 , . . . , VP v(n)−1 to distinguish them from the processing elements of the other two models. Since the folding mechanism illustrated earlier enables A to be executed on a smaller machine, the design effort can be kept focussed on just one convenient virtual machine size, oblivious to the actual number of processors on which the algorithm will be executed.
Although a network-oblivious algorithm is specified for a large virtual machine, it is useful to analyze its communication requirements on machines with reduced degrees of parallelism. For these purposes, we introduce the evaluation model M( p, σ ), where p ≥ 1 is a power of 2 and σ ≥ 0, which is essentially an M( p) where the additional parameter σ is used to account for the latency plus synchronization cost of each superstep. The processing elements of M( p, σ ) are called processors and are denoted by P 0 , P 1 , . . . , P p−1 . Consider the execution of an algorithm A on M( p, σ ) for a given input I. For each superstep s, the metric of interest that we use to evaluate the communication requirements of the algorithm is the maximum number of messages h s A (I, p) sent/destined by/to any processor in that superstep. Thus, the set of messages exchanged in the superstep can be viewed as forming an h s A (I, p)-relation, where h s A (I, p) is often referred to as the degree of the relation. In the evaluation model, the communication cost of a superstep of degree h is defined as h + σ , and it is independent of the superstep's label. For our purposes, it is convenient to consider the cumulative degree of all i-supersteps, for 0 ≤ i < log p:
Then, the communication complexity of A on M( p, σ ) is defined as
We observe that the evaluation model with this performance metric coincides with the BSP model [Valiant 1990 ] where the bandwidth parameter g is set to 1 and the latency/synchronization parameter is set to σ . Next, we turn our attention to the last model used in the framework, called the execution machine model, which represents the machines where network-oblivious algorithms are actually executed. We focus on parallel machines whose underlying interconnection exhibits a hierarchical structure and use the D-BSP model [de la Torre and Kruskal 1996; Bilardi et al. 2007a] as our execution machine model. A D-BSP( p, g, ), with g = (g 0 , g 1 , . . . , g log p−1 ) and = ( 0 , 1 , . . . , log p−1 ), is an M( p) where the cost of an i-superstep depends on parameters g i and i , for 0 ≤ i < log p. The processing elements, called processors and denoted by P 0 , P 1 , . . . , P p−1 as in the evaluation model, are partitioned into nested clusters: for 0 ≤ i ≤ log p, a set formed by all the p/2 i processors whose indices share the most significant i bits is called an i-cluster. As for the case of the specification model, if we envision a conceptual tree-like organization with the p D-BSP processors at the leaves, then i-clusters correspond to the leaves of subtrees rooted at level i. Observe that during an i-superstep, each processor communicates only with processors of its i-cluster. For the communication within an i-cluster, parameter i represents the latency plus synchronization cost (in time units), whereas g i represents an inverse measure of bandwidth (in units of time per message). By importing the notation adopted in the evaluation model, we define the communication time of an algorithm A on D-BSP( p, g, ) as
The results in provide evidence that D-BSP is an effective machine model, as its hierarchical structure and its 2 log p bandwidth and latency parameters are sufficient to capture reasonably well the cost of both balanced and unbalanced communication for a large class of point-to-point networks ]. Through the folding mechanism discussed earlier, any network-oblivious algorithm A specified on M(v(n)) can be transformed into an algorithm for M( p) with p < v(n), and hence into an algorithm for M( p, σ ) or D-BSP( p, g, ). In this case, the quantities H A (n, p, σ ) and D A (n, p, g, ) denote, respectively, the communication complexity and communication time of the folded algorithm. Moreover, since algorithms designed on the evaluation model M( p, σ ) or on the execution machine model D-BSP( p, g, ) can be regarded as algorithms for M( p), once the parameters σ or g and are fixed, we can also analyze the communication complexities/times of their foldings on smaller machines (i.e., machines with 2 j processors, for any 0 ≤ j < log p). These relations among the models are crucial for the effective exploitation of our framework.
The following definitions establish useful notions of optimality for the two complexity measures introduced earlier relative to the evaluation and execution machine models. For each measure, optimality is defined with respect to a class of algorithms, whose actual role will be made clear later in the article. Let C denote a class of algorithms, solving a given problem .
with respect to C if for each M( p, σ )-algorithm B ∈ C and for each n,
Definition 2.2. Let 0 < β ≤ 1. A D-BSP( p, g, )-algorithm B ∈ C is β-optimal on D-BSP( p, g, ) with respect to C if for each D-BSP( p, g, )-algorithm B ∈ C and for each n,
Note that the preceding definitions do not require β to be a constant: intuitively, larger values of β correspond to higher degrees of optimality.
OPTIMALITY THEOREM FOR STATIC ALGORITHMS
In this section, we show that for a certain class of network-oblivious algorithms, βoptimality in the evaluation model, for suitable ranges of parameters p and σ , translates into β -optimality in the execution machine model, for some β = (β) and suitable ranges of parameters p, g, and . This result, which we refer to as optimality theorem, holds under a number of restrictive assumptions; nevertheless, it is applicable in several interesting case studies, as illustrated in subsequent sections. The optimality theorem shows the usefulness of the intermediate evaluation model since it provides a form of "bootstrap," whereby from a given degree of optimality on a family of machines we infer a related degree of optimality on a much larger family. It is important to remark that the class of algorithms for which the optimality theorem holds includes algorithms that are network aware-that is, whose code can make explicit use of the architectural parameters of the model ( p and σ for the evaluation model, and p, g, and for the execution machine model) for optimization purposes. In a nutshell, the approach we follow hinges on the fact that both communication complexity and communication time (Equations (1) and (2)) are expressed in terms of quantities of the type F i A (I, p). If communication complexity is low, then these quantities must be low, and thus communication time must be low as well. Next, we discuss a number of obstacles to be faced when attempting to refine the outlined approach into a rigorous argument and how they can be handled.
A first obstacle arises whenever the performance functions are linear combinations of other auxiliary metrics. Unfortunately, worst-case optimality of these metrics does not imply optimality of their linear combinations (nor vice versa), as the worst case of different metrics could be realized by different input instances. In the cases of our interest, the "losses" incurred cannot be generally bounded by constant factors. To circumvent this obstacle, we restrict our attention to static algorithms, defined by the property that the following quantities are equal for all input instances of the same size n: (i) the number of supersteps, (ii) the sequence of labels of the various supersteps, and (iii) the set of source-destination pairs of the messages exchanged in any individual superstep. This restriction allows us to overload the notation, writing n instead of I in the argument of functions that become invariant for instances of the same size, namely
Likewise, the max operation becomes superfluous and can be omitted in Equations (1) and (2). Static algorithms naturally arise in directed acyclic graph (DAG) computations. In a DAG algorithm, for every instance size n, there exists (at most) one DAG where each node with indegree 0 represents an input value, whereas each node with indegree greater than 0 represents a value produced by a unit-time operation whose operands are the values of the node's predecessors (nodes with outdegree 0 are viewed as outputs). The computation requires the execution of all operations specified by the nodes, complying with the data dependencies imposed by the arcs. 3 To prove the optimality theorem, we need a number of technical results and definitions. Recall that folding can be employed to transform an M( p, σ )-algorithm into an M(2 j , σ )-algorithm, for any 1 ≤ j ≤ log p: as already mentioned, an algorithm designed on the M( p, σ ) can be regarded as algorithms for M( p), once the parameter σ is fixed; then, we can analyze the communication complexity of its folding on a smaller M(2 j , σ ) machine, for any 0 ≤ j < log p and σ ≥ 0. The following lemma establishes a useful relation between the communication metrics when folding is applied.
PROOF. The lemma follows by observing that in every i-superstep, with i < j, messages sent/destined by/to processor P k of M(2 j , σ ), with 0 ≤ k < 2 j , are a subset of those sent/destined by/to the p/2 j M( p, σ )-processors whose computations are carried out by P k .
It is easy to come up with algorithms where the bound stated in the preceding lemma is not tight. In fact, whereas in an i-superstep each message must be exchanged between processors whose indices share at least i most significant bits, some messages that contribute to F i B (n, p) may be exchanged between processors whose indices share j > i most significant bits, thus not contributing to F i B (n, 2 j ). Motivated by this observation, next we define a class of network-oblivious algorithms where a parameter α quantifies how tight the upper bound of Lemma 3.1 is when considering their foldings on smaller machines. This parameter will be employed to control the extent to which an optimality guarantee in the evaluation model translates into an optimality guarantee in the execution model.
for every 1 ≤ j ≤ log p and every input size n.
(We remark that in the preceding definition, parameter α is not necessarily a constant and can be made, for example, a function of p.) Intuitively, (α, p)-wiseness is meant to capture, in an average sense, the property that for each i-superstep involving an h-relation, there exists an i-cluster where an α-fraction of the processors send/receive h messages to/from processors belonging to a different (i + 1)-subcluster. As an example, a network-oblivious algorithm for M(v(n)) where, for each i-superstep there is always at least one segment of v(n)/2 i+1 virtual processors consecutively numbered starting from k · (v(n)/2 i+1 ), for some k ≥ 0, each sending a number of messages equal to the superstep degree to processors outside the segment, is an (α, p)-wise algorithm for each 1 < p ≤ v(n) and α = 1. However, (α, p)-wiseness holds even if the aforementioned communication scenario is realized only in an average sense. Furthermore, consider a pair of values α and p such that 1 < p ≤ p and 1 < α ≤ α. It is easy to see that
, for every 0 ≤ i < log p , and this implies that a networkoblivious algorithm that is (α, p)-wise is also (α , p )-wise.
A final issue to consider is that the degrees of supersteps with different labels contribute with the same weight to the communication complexity while they contribute with different weights to the communication time. The following lemma will help in bridging this difference. LEMMA 3.3. For m ≥ 1, let X 0 , X 1 , . . . , X m−1 and Y 0 , Y 1 , . . . , Y m−1 be two arbitrary sequences of real values, and let f 0 , f 1 , . . . , f m−1 be a nonincreasing sequence of nonnegative real values.
We then get the desired inequality m−1
We are now ready to state and prove the optimality theorem. Let C denote a class of static algorithms solving a problem , with the property that for any algorithm A ∈ C for v processing elements, all of its foldings on 2 j processing elements, for each 1 ≤ j < log v, also belong to C . THEOREM 3.4 (OPTIMALITY THEOREM). Let A ∈ C be network oblivious and (α, p )wise, for some α ∈ (0, 1] and p a power of 2. Let also (σ m
and 1 ≤ j ≤ log p , then for every p power of 2, p ≤ p , A is αβ/(1 + α)-optimal on D-BSP( p, g, ) with respect to C as long as
Fix the value p and the vectors g and so as to satisfy the hypotheses of the theorem, and consider a D-BSP( p, g, )-algorithm C ∈ C . By the β-optimality of A on the evaluation model M(2 j , ψp/2 j ), for each 1 ≤ j ≤ log p and ψ such that
since C can be folded into an algorithm for M(2 j , ψp/2 j ), still belonging to C . By the definition of communication complexity, it follows that
and then, by applying Lemma 3.1 to the right side of the preceding inequality, we obtain
Define ψ m p = max 1≤k≤log p {σ m k−1 2 k / p} and ψ M p = min 1≤k≤log p {σ M k−1 2 k / p}. The condition imposed by the theorem on the ratio i /g i implies that ψ m p ≤ ψ M p , and hence, by definition of these two quantities, we have that σ m (3), and note that by the preceding observation,
By multiplying both terms of the inequality by 2 j /(ψ M p p), and by exploiting the nonnegativeness of the F i A (n, 2 j ) terms, we obtain
Next, we make log p applications of Lemma 3.3, one for each j = 1, 2, . . . , log p, by setting m = j,
for 1 ≤ j ≤ log p. Now, let us set ψ = ψ m p in Inequality (3), which again guarantees σ m j−1 ≤ ψ m p p/2 j ≤ σ M j−1 . By exploiting the wiseness of A in the left side and the nonnegativeness of S i A (n), we obtain
By multiplying both terms by 2 j /( pα) and observing that by hypothesis ψ m
Summing Inequality (4) with Inequality (5) yields
Then, by definition of communication time, we have
and the theorem follows.
Note that the theorem requires that both the g i 's and i /g i 's form nonincreasing sequences. The assumption is rather natural, as it reflects the fact that larger submachines exhibit more expensive communication (hence, a larger g parameter) and larger network capacity (hence, a larger /g ratio).
A few remarks regarding the preceding optimality theorem are in order. First, the proof of the theorem heavily relies on the manipulation of linear combinations of worstcase metrics related to executions of the algorithms with varying degrees of parallelism. This justifies the restriction to static algorithms, since, as anticipated at the beginning of the section, the variation of the metrics with the input instances would make the derivations invalid. However, based on the fact that the linear combinations involve a logarithmic number of terms, the proof of the theorem can be extended to nonstatic algorithms by increasing the gap between optimality in the evaluation model and optimality in the execution machine model by an extra O(log p) factor. Specifically, for arbitrary algorithms, after a straightforward reinterpretation of the quantities in a worst-case sense, the summation on the right-hand side of Equation (6), although not necessarily equal to D C (n, p, g, ), can be shown to be a factor at most O(log p) larger.
The complexity metrics adopted in this article target exclusively interprocessor communication, and thus a (sequential) network-oblivious algorithm specified on M(v) but using only one of the virtual processors would clearly be optimal with respect to these metrics. For meaningful applications of the theorem, the class C must be suitably defined to exclude such degenerate cases and to contain algorithms where the work is sufficiently well balanced among the processing elements. In addition, one could argue that the effectiveness of our framework is confined only to very regular algorithms, because of the wiseness hypothesis and the fact that the evaluation model uses the maximum number of messages sent/received by a processor as the key descriptor for communication costs, thus disregarding the overall communication volume. However, it has to be remarked that wiseness can be achieved even under communication patterns that are globally unbalanced, as long as some balancing is locally guaranteed within some cluster. Additionally, since the quest for optimality requires evaluating an algorithm at different levels of granularity, communication patterns with the same maximum message count at a processor but different overall communication volume may be discriminated, to some extent, by their different communication costs at coarser granularities.
Some of the issues encountered in establishing the optimality theorem have an analog in the context of memory hierarchies. For example, time in the hierarchical memory model (HMM) can be linked to I/O complexity as discussed in so that optimality of the latter for different cache sizes implies the optimality of the former for wide classes of functions describing the access time to different memory locations. Although, to the best of our knowledge, the question has not been explicitly addressed in the literature, a careful inspection of the arguments of shows that some restriction to the class of algorithms is required to guarantee that the maximum value of the I/O complexity for different cache sizes is simultaneously reached for the same input instance. (For example, the optimality of HMM time does not follow for the class of arbitrary comparison-based sorting algorithms, as the known I/O complexity lower bound for this problem [Aggarwal and Vitter 1988] may not be simultaneously reachable for all relevant cache sizes.) Moreover, the monotonicity that we have assumed for the g i and the i /g i sequences has an analog in the assumption that the function used in to model the memory access time is polynomially bounded.
In the cache-oblivious framework, the equivalent of our optimality theorem requires algorithms to satisfy the regularity condition [Frigo et al. 2012, Lemma 6.4] , which requires that the number of cache misses decreases by a constant factor when the cache size is doubled. On the other hand, our optimality theorem gives the best bound when the network-oblivious algorithm is ( (1), p)-wise-that is, when the communication complexity decreases by a constant factor when the number of processors is doubled. Although the regularity condition and wiseness cannot be formalized in a similar fashion due to the significant differences between the cache-and network-oblivious frameworks, we observe that both assumptions require the oblivious algorithms to react seamlessly and smoothly to small changes of the machine parameters.
ALGORITHMS FOR FUNDAMENTAL PROBLEMS
In this section, we illustrate the use of the proposed framework by developing efficient network-oblivious algorithms for a number of fundamental computational problems: matrix multiplication (Section 4.1), FFT (Section 4.2), and sorting (Section 4.3). All of our algorithms exhibit (1)-optimality on the D-BSP for wide ranges of the machine parameters. In Section 4.4, we also present network-oblivious algorithms for stencil computations. These latter algorithms run efficiently on the D-BSP, although they do not achieve (1)-optimality, which appears to be a hard challenge in this case. In Section 4.5, we also establish a negative result by proving that there cannot exist a network-oblivious algorithm for broadcasting that is simultaneously (1)-optimal on two sufficiently different M( p, σ ) machines.
As prescribed by our framework, the performance of the network-oblivious algorithms on the D-BSP is derived by analyzing their performance on the evaluation model. Optimality is assessed with respect to classes of algorithms where the computation is not excessively unbalanced among the processors, namely algorithms where an individual processor cannot perform more than a constant fraction of the total minimum work for the problem. For this purpose, we exploit some recent lower bounds that rely on mild assumptions on work distributions and strengthen previous bounds based on stronger assumptions [Scquizzato and Silvestri 2014] . Finally, we want to stress that all of our algorithms are also work optimal.
Matrix Multiplication
The n-MM problem consists of multiplying two √ n × √ n matrices, A and B, using only semiring operations. A result in Kerr [1970] shows that any static algorithm for the n-MM problem that uses only semiring operations must compute all n 3/2 multiplicative terms-that is, the products
Let C denote the class of static algorithms for the n-MM problem such that any A ∈ C for v processing elements satisfies the following properties: (i) no entry of A or B is initially replicated (however, the entries of A and B are allowed to be initially distributed among the processing elements in an arbitrary fashion); (ii) no processing element computes more than n 3/2 / min{v, 11 3 } multiplicative terms; 5 and (iii) all of the foldings of A on 2 j processing elements, for each 1 ≤ j < log v, also belong to C . The following lemma establishes a lower bound on the communication complexity of the algorithms in C .
LEMMA 4.1. The communication complexity of any n-MM algorithm in C when executed on M( p, σ ) is (n/ p 2/3 + σ ).
PROOF. The bound for σ = 0 is proved in Theorem 2 of Scquizzato and Silvestri [2014] , and it clearly extends to the case σ > 0. The additive σ term follows since at least one message is sent by some processing element.
We now describe a static network-oblivious algorithm for the n-MM problem, which follows from the parallelization of the respective cache-oblivious algorithm [Frigo et al. 2012 ]. Then, we prove its optimality in the evaluation model, for wide ranges of the parameters, and in the execution model through the optimality theorem. For convenience, we assume that n is a power of 2 3 (the general case requires minor yet tedious modifications). The algorithm is specified on M(n) and requires that the input and output matrices be evenly distributed among the n VPs. We denote with A, B, and C the two input matrices and the output matrix, respectively, and with A hk , B hk , and C hk , with 0 ≤ h, k ≤ 1, their four quadrants. The network-oblivious algorithm adopts the following recursive strategy:
(1) Partition the VPs into eight segments S hk , with 0 ≤ h, k, ≤ 1, containing the same number of consecutively numbered VPs. Replicate and distribute the inputs so that the entries of A h and B k are evenly spread among the VPs in S hk . (2) In parallel, for each 0 ≤ h, k, ≤ 1, recursively compute the product
At the i-th recursion level, with 0 ≤ i ≤ (log n)/3, 8 i (n/4 i )-MM subproblems are solved by distinct M(n/8 i )'s formed by distinct segments of VPs. The recursion stops at i = (log n)/3 when each VP sequentially solves an n 1/3 -MM subproblem. By unfolding the recursion, we get that the algorithm comprises a constant number of 3i-supersteps at the i-th recursive level, where each VP sends/receives O(2 i ) messages. To easily claim that the algorithm is ( (1), n)-wise, we may assume that in each 3i-superstep, VP j sends 2 i dummy messages to VP j+n/2 3i+1 , for 0 ≤ j < n/2 3i+1 . These messages do not affect the asymptotic communication complexity and communication time exhibited by the algorithm in the evaluation and execution machine models. (In fact, constant wiseness is already achieved by the original communication pattern, but a direct proof would have required a more convoluted argument than resorting to dummy messages. Indeed, we will use the same trick in the other network-oblivious algorithms presented in the article.)
THEOREM 4.2. The communication complexity of the preceding n-MM networkoblivious algorithm when executed on M( p, σ ) is
for every 1 < p ≤ n and σ ≥ 0. The algorithm is ( (1), n)-wise and (1)-optimal with respect to C on any M( p, σ ) with 1 < p ≤ n and σ = O(n/( p 2/3 log p)). M( p, σ ) , the preceding algorithm decomposes the problem into eight subproblems that are solved by eight distinct M( p/8, σ ) machines and each processor sends/receives O(n/ p) messages in O(1) supersteps for processing the inputs and outputs of the eight subproblems. The communication complexity satisfies the recurrence relation:
PROOF. When executed on
By unrolling the recurrence, we get
As anticipated, the wiseness is guaranteed by the dummy messages introduced in each superstep. Finally, it is easy to see that the algorithm satisfies the three requirements for belonging to C , and hence its optimality follows from Lemma 4.1.
COROLLARY 4.3. The preceding n-MM network-oblivious algorithm is (1)-optimal with respect to C on any D-BSP( p, g, ) machine with 1 < p ≤ n, nonincreasing g i 's and i /g i 's, and 0 /g 0 = O(n/ p). PROOF. Since the network-oblivious algorithm is ( (1), n)-wise and belongs to C , the corollary follows by plugging p = n, σ m i = 0, and σ M i = (n/((i + 1)2 2i/3 )) into Theorem 3.4. 4.1.1. Space-Efficient Matrix Multiplication. Observe that the network-oblivious algorithm described earlier incurs an O(n 1/3 ) memory blow-up per VP. As described next, the recursive strategy can be modified to incur only a constant memory blow-up, at the expense of an increased communication complexity. The resulting network-oblivious algorithm turns out to be (1)-optimal with respect to the class of algorithms featuring constant memory blow-up.
We assume, as before, that the entries of A, B, and C be evenly distributed among the VPs. The VPs are (recursively) divided into four segments that solve the eight (n/4)-MM subproblems in two rounds: in the first round, the segments compute A 00 · B 00 , A 01 · B 11 , A 11 · B 10 , and A 10 · B 01 (one product per segment), whereas in the second round, they compute A 01 · B 10 , A 00 · B 01 , A 10 · B 00 , and A 11 · B 11 (again, one product per segment). The recursion ends when each VP sequentially solves a 1-MM subproblem. By unfolding the recursion, we get that for every 0 ≤ i < log n/2, the algorithm executes (2 i ) 2i-supersteps where each VP sends/receives (1) messages. At any time, each VP contains only O(1) matrix entries, but the recursion requires it to handle a stack of O(log n) entries. However, it is easy to see that only a constant number of bits are needed for each stack entry, and hence, under the natural assumption that each matrix entry occupies a constant number of (log n)-bit words, the entire stack at each VP requires storage proportional to O(1) matrix entries. Therefore, the algorithm incurs only a constant memory blow-up. As before, the algorithm can be easily made ( (1), n)-wise by adding suitable dummy messages.
When executed on M( p, σ ), the preceding space-efficient algorithm exhibits a communication complexity, denoted with H MM-space (n, p, σ ), that satisfies the recurrence relation:
By unrolling the relation, we get
Let C denote the class of static algorithms for the n-MM problem such that any A ∈ C for v processing elements satisfies the following properties: (i) the local storage required at each processing element is O(n/v), and (ii) all of the foldings of A on 2 j processing elements, for each 1 ≤ j < log v, also belong to C . Since it is proved in Irony et al. [2004] that any n-MM algorithm in C when running on M( p, 0) must exhibit an (n/ √ p) communication complexity, the preceding network-oblivious algorithm is
(1)-optimal with respect to C on any M( p, σ ) with 1 < p ≤ n and σ = O(n/ p). Consequently, Theorem 3.4 yields optimality of the algorithm on any D-BSP( p, g, ) machine with 1 < p ≤ n, nonincreasing g i 's and i /g i 's, and 0 /g 0 = O(n/ p).
Fast Fourier Transform
The n-FFT problem consists of computing the discrete Fourier transform of n values using the n-input FFT DAG, where a vertex is a pair w, l , with 0 ≤ w < n and 0 ≤ l < log n, and there exists an arc between two vertices w, l and w , l if l = l + 1, and either w and w are identical or their binary representations differ exactly in the l-th bit [Leighton 1992 ].
Let C denote the class of static algorithms for the n-FFT problem such that any A ∈ C for v processing elements satisfies the following properties: (i) each DAG node is evaluated exactly once (i.e., recomputation is not allowed); (ii) no input value is initially replicated; (iii) no processing element computes more than n log n DAG nodes, for some constant 0 < < 1; and (iv) all of the foldings of A on 2 j processing elements, for each 1 ≤ j < log v, also belong to C . Note that, as in the preceding section, the class of algorithms that we are considering makes no assumptions on the input and output distributions. The following lemma establishes a lower bound on the communication complexity of the algorithms in C .
LEMMA 4.4. The communication complexity of any n-FFT algorithm in C when executed on M( p, σ ) is ((n log n)/( p log(n/ p)) + σ ).
PROOF. The bound for σ = 0 is proved in Theorem 11 of Scquizzato and Silvestri [2014] , and it clearly extends to the case σ > 0. The additive σ term follows since at least one message is sent by some processing element.
We now describe a static network-oblivious algorithm for the n-FFT problem and then prove its optimality in the evaluation and execution models. The algorithm is specified on M(n) and exploits the well-known decomposition of the FFT DAG into two sets of √ n-input FFT subDAGs, with each set containing √ n such subDAGs . For simplicity, to ensure integrality of the quantities involved, we assume n = 2 2 k for some integer k ≥ 0. We assume that at the beginning, the n inputs are evenly distributed among the n VPs. In parallel, each of the √ n segments of √ n consecutively numbered VPs recursively computes the assigned subDAG. Then, the outputs of the first set of subDAGs are permuted in a 0-superstep so as to distribute the inputs of each subDAGs of the second set among the VPs of a distinct segment. The permutation pattern is equivalent to the transposition of a √ n × √ n matrix. Finally, each segment recursively computes the assigned subDAG.
At the i-th recursion level, with 0 ≤ i < log log n, n 1−1/2 i n 1/2 i -FFT subproblems are solved by n 1−1/2 i M(n 1/2 i ) models formed by distinct segments of VPs. The recurrence stops at i = log log n when each segment of two VPs computes a 2-input subDAG. It is easy to see, by unfolding the recursion, that the algorithm comprises O(2 i ) supersteps with label (1 − 1/2 i ) log n at the i-th recursive level, where each VP sends/receives O(1) messages. As before, to enforce wiseness without affecting the algorithm's asymptotic performance, we assume that in each (1 − 1/2 i ) log n-superstep, VP j sends a dummy message to VP j+n 1/2 i /2 , for each 0 ≤ j < n 1/2 i /2. 
for every 1 < p ≤ n and σ ≥ 0. The algorithm is ( (1), n)-wise and (1)-optimal with respect to C on any M( p, σ ) with 1 < p ≤ n and σ = O(n/ p).
PROOF. When executed on M( p, σ ), the preceding algorithm decomposes the problem into two sets of √ n subproblems that are solved by √ n distinct M( p/ √ n, σ ) machines and each processor sends/receives O(n/ p) messages in O(1) supersteps for processing the inputs and outputs of the 2 √ n subproblems. The communication complexity satisfies the recurrence relation:
The wiseness is ensured by the dummy messages, and since the algorithm satisfies the requirements for belonging to C , its optimality follows from Lemma 4.4.
We now apply Theorem 3.4 to show that the network-oblivious algorithm is (1)optimal on the D-BSP for wide ranges of the machine parameters.
COROLLARY 4.6. The preceding n-FFT network-oblivious algorithm is (1)-optimal with respect to C on any D-BSP( p, g, ) machine with 1 < p ≤ n, nonincreasing g i 's and i /g i 's, and 0 /g 0 = O(n/ p). PROOF. Since the network-oblivious algorithm is ( (1), n)-wise and belongs to C , we get the claim by plugging p = n, σ m i = 0, and σ M i = (n/2 i ) in Theorem 3.4. We observe that although we described the network-oblivious algorithm assuming n = 2 2 k , to ensure integrality of the quantities involved, the preceding results can be generalized to the case of n arbitrary power of 2. In this case, the FFT DAG is recursively decomposed into a set of 2 log √ n -input FFT subDAGs and a set of n/2 log √ ninput FFT subDAGs. The optimality of the resulting algorithm in both the evaluation and execution machine models can be proved in a similar fashion as before.
Sorting
The n-sort problem requires labeling n (distinct) input keys with their ranks, using only comparisons, where the rank of a key is the number of smaller keys in the input sequence.
Let C denote the class of static algorithms for the n-sort problem such that any A ∈ C for v processing elements satisfies the following properties: (i) initially, no input key is replicated and, during the course of the algorithm, only a constant number of copies per key are allowed at any time; (ii) no processing element performs more than nlog n comparisons, for an arbitrary constant 0 < < 1; and (iii) all of the foldings of A on 2 j processing elements, 1 ≤ j < log v, also belong to C . We make no assumptions on how the keys are distributed among the processing elements at the beginning and at the end of the algorithm. The following lemma establishes a lower bound on the communication complexity of the algorithms in C .
LEMMA 4.7. The communication complexity of any n-sort algorithm in C when executed on M( p, σ ) is ((n log n)/( p log(n/ p)) + σ ).
PROOF. The bound for σ = 0 is proved in Theorem 8 of Scquizzato and Silvestri [2014] , and it clearly extends to the case σ > 0. The additive σ term follows since at least one message is sent by some processing element.
We now present a static network-oblivious algorithm for the n-sort problem and then prove its optimality in the evaluation and execution models. The algorithm implements a recursive version of the Columnsort strategy, as described in Leighton [1985] . Consider the n input keys as an r × s matrix, with r · s = n and r ≥ s 2 . Columnsort is organized into eight phases numbered from 1 to 8. During Phases 1, 3, 5, and 7, the keys in each column are sorted recursively (in Phase 5, adjacent columns are sorted in reverse order). During Phases 2, 4, 6, and 8, the keys of the matrix are permuted: in Phase 2 (respectively, Phase 4), a transposition (respectively, diagonalizing permutation [Leighton 1985]) of the r × s matrix is performed maintaining the r × s shape; in Phase 6 (respectively, Phase 8), an r/2-cyclic shift (respectively, the reverse of the r/2-cyclic shift) is done. 6 Columnsort can be implemented on M(n) as follows. For convenience, assume that n = 2 (3/2) d for some integer d ≥ 0, and set r = n 2/3 and s = n/r (the more general case is discussed later). The algorithm starts with the input keys evenly distributed among the n VPs. In the odd phases, the keys of each column are evenly distributed among the VPs of a distinct segment of r consecutively numbered VPs, which form an independent M(r). Then, each segment recursively solves the subproblem corresponding to the column it received. The even phases entail a constant number of 0-supersteps of constant degree. At the i-th recursion level, with 0 ≤ i ≤ log 3/2 log n, each segment of n (2/3) i consecutively numbered VPs forming an independent M(n (2/3) i ) solves 4 i subproblems of size n (2/3) i . The recurrence stops at i = log 3/2 log n when each VP solves, sequentially, a subproblem of constant size. It is easy to see, by unfolding the recursion, that the algorithm consists of (4 i ) supersteps with label (1 − (2/3) i ) log n at the i-th recursive level, where each VP sends/receives O(1) messages. As before, to enforce wiseness without affecting the algorithm's asymptotic performance, we assume that in each (1 − (2/3) i ) log n-superstep, VP j sends a dummy message to VP j+n (2/3) i /2 , for each 0 ≤ j < n (2/3) i /2. , for every 1 < p ≤ n and σ ≥ 0. The algorithm is ( (1), n)-wise and is (1)-optimal with respect to C on any M( p, σ ) with p = O(n 1−δ ), for any arbitrary constant δ ∈ (0, 1), and σ ≥ 0. M( p, σ ) , the preceding algorithm decomposes the problem into four sets of n 1/3 subproblems that are solved in four phases by n 1/3 distinct M( p/n 1/3 , σ ) machines and each processor sends/receives O(n/ p) messages in O(1) supersteps for processing the inputs and outputs of the 4n 1/3 subproblems. The communication complexity satisfies the recurrence relation:
PROOF. When executed on
The wiseness is guaranteed by the dummy messages. Since the algorithm satisfies the three requirements to be in C , its optimality follows from Lemma 4.7.
COROLLARY 4.9. The above n-sort network-oblivious algorithm is (1)-optimal with respect to C on any D-BSP( p, g, ) machine with p = O(n 1−δ ), for some arbitrary constant δ ∈ (0, 1), and non-increasing g i 's and i /g i 's.
PROOF. Since the network-oblivious algorithm is ( (1), n)-wise and belongs to C , we get the claim by plugging p = n, σ m i = 0, and σ M i = +∞ in Theorem 3.4. Consider now the more general case when n is an arbitrary power of 2. Now, the input keys must be regarded as the entries of an r × s matrix, where r is the smallest power of 2 greater than or equal to n 2/3 . Simple yet tedious calculations show that the results stated in Theorem 4.8 and Corollary 4.9 continue to hold in this case.
Finally, we remark that the preceding network-oblivious sorting algorithm turns out to be (1)-optimal on any D-BSP( p, g, ), as long as p = O(n 1−δ ) for constant δ, with respect to a wider class of algorithms that satisfy requirements (i), (ii), and (iii), specified earlier for C , but need not be static. By applying the lower bound for sorting in Scquizzato and Silvestri [2014] on two processors, it is easy to show that (n) messages must cross the bisection for this class of algorithms. Therefore, we get an (g 0 n/ p) lower bound on the communication time on D-BSP( p, g, ), which is matched by our network-oblivious algorithm.
Stencil Computations
A stencil defines the computation of any element in a d-dimensional spatial grid at time t as a function of neighboring grid elements at time t − 1, t − 2, . . . , t − ρ, for some integers ρ ≥ 1 and d ≥ 1. Stencil computations arise in many contexts, ranging from iterative finite-difference methods for the numerical solution of partial differential equations to algorithms for the simulation of cellular automata, as well as in dynamic programming algorithms and in image-processing applications. In addition, the simulation of a d-dimensional mesh [Bilardi and Preparata 1997] can be envisioned as a stencil computation.
In this section, we restrict our attention to stencil computations with ρ = 1. To this purpose, we define the (n, d)-stencil problem, which represents a wide class of stencil computations (e.g., see Frigo and Strumpen [2005] ). Specifically, the problem consists of evaluating all nodes of a DAG of n d+1 nodes, each represented by a distinct tuple i 0 , i 1 , . . . , i d , with 0 ≤ i 0 , i 1 , . . . , i d < n, where each node i 0 , i 1 , . . . , i d is connected, through an outgoing arc, to (at most) 3 d neighbors, namely i 0 + δ 0 , i 1 + δ 1 , . . . , i d−1 +  δ d−1 , i d + 1 for each δ 0 , δ 1 , . . . , δ d−1 ∈ {0, ±1} (whenever such nodes exist) . We suppose n to be a power of 2. Intuitively, the (n, d)-stencil problem consists of n timesteps of a stencil computation on a d-dimensional spatial grid of side n, where each DAG node corresponds to a grid element (first d coordinates) at a given timestep (coordinate i d ).
Let C d denote the class of static algorithms for the (n, d)-stencil problem such that any A ∈ C d for v processing elements satisfies the following properties: (i) each DAG node is evaluated once (i.e., recomputation is not allowed); (ii) no processing element computes more than n d+1 DAG nodes, for some constant 0 < < 1; and (iii) all of the foldings of A on 2 j processing elements, 1 ≤ j < log v, also belong to C d . Note that as before, this class of algorithms makes no assumptions on the input and output distributions. The following lemma establishes a lower bound on the communication complexity of the algorithms in C d .
LEMMA 4.10. The communication complexity of any (n, d) 
PROOF. The bound for σ = 0 is proved in Theorem 5 of Scquizzato and Silvestri [2014] , and it clearly extends to the case σ > 0. The additive σ term follows since at least one message is sent by some processing element.
In what follows, we develop efficient network-oblivious algorithms for the (n, d)stencil problem, for the special cases of d = {1, 2}. The generalization to values d > 2, and to other types of stencils, is left as an open problem. 4.4.1. The (n, 1)-Stencil Problem. The (n, 1)-stencil problem consists of the evaluation of a DAG shaped as a two-dimensional array of side n. We reduce the solution of the stencil problem to the computation of a diamond DAG. Specifically, we define a diamond DAG of side n as the intersection of a (2n − 1, 1)-stencil DAG with the following four halfplanes: i 0 + i 1 ≥ (n − 1), i 0 − i 1 ≤ (n − 1), i 0 − i 1 ≥ −(n − 1), and i 0 + i 1 ≤ 3(n − 1) (i.e., the largest diamond included in the stencil). 7 It follows that an (n, 1)-stencil DAG can be partitioned into five full or truncated diamond DAGs of side less than n that can be evaluated in a suitable order, with the outputs of one DAG evaluation providing the inputs for subsequent DAG evaluations.
Our network-oblivious algorithm for the (n, 1)-stencil is specified on M(n) and consists of five stages, where in each stage the whole M(n) machine takes care of the evaluation of a distinct diamond DAG (full or truncated) according to the aforementioned partition. We require that all of the O(n) inputs necessary for the evaluation of a diamond DAG are evenly distributed among the n VPs at the start of the stage in charge of the DAG. No matter how the inputs are assigned to the VPs at the beginning of the algorithm, the data movement required to guarantee the correct input distribution at the various stages can be accomplished in O(1) 0-supersteps where each VP sends/receives O(n) messages.
Network-Oblivious Algorithms 3:21 Fig. 1 . The decomposition of the diamond DAG performed by our algorithm.
We now focus on the evaluation of the individual diamond DAGs. For ease of presentation, we consider the evaluation of a full diamond DAG of side n on M(n). Simple yet tedious modifications are required for dealing with truncated or smaller diamond DAGs. We exploit the fact that this DAG can be decomposed recursively into smaller diamonds. Parallel algorithms for stencil computations based on this or similar decompositions are known [Chowdhury and Ramachandran 2008; Frigo and Strumpen 2009; Tang et al. 2011] , but their focus is on optimizing processor cache efficiency rather than interprocessor communications.
Let k = 2 √ log n . The diamond DAG is partitioned into 2k− 1 horizontal stripes, each containing up to k diamonds of side n/k, as depicted in Figure 1 . The DAG evaluation is accomplished into 2k − 1 nonoverlapping phases. In the r-th such phase, with 0 ≤ r < 2k − 1, the diamonds in the r-th stripe are evaluated in parallel by distinct M(n/k) submachines formed by disjoint segments of consecutively numbered VPs. 8 At the beginning of each phase, a 0-superstep is executed to provide the VPs of each M(n/k) submachine with the appropriate input-that is, the immediate predecessors (if any) of the diamond assigned to the submachine. In this superstep, each VP sends/receives O(1) messages. In each phase, the diamonds of side n/k are evaluated recursively.
In general, at the i-th recursive level, with i ≥ 1, a total of (2k − 1) i nonoverlapping phases are executed where diamonds of side n i = n/k i are evaluated in parallel by distinct M(n i ) submachines. Each such phase starts with a superstep of label (i−1)·log k to provide each M(n i ) with the appropriate input. In turn, the evaluation of a diamond of side n i within an M(n i ) submachine is performed recursively by partitioning its nodes into 2k − 1 horizontal stripes of diamonds of side n i+1 = n/k i+1 that are evaluated in 2k − 1 nonoverlapping phases by M(n i+1 ) submachines, with each phase starting with a superstep of label i · log k where each VP sends/receives O(1) messages (and thus each processor sends/receives O(n/ p) messages). The recursion ends at level τ = log k n , which is the first level where the diamond of side n τ becomes smaller than k. If n τ > 1, each diamond of side n τ assigned to an M(n τ ) submachine is evaluated straightforwardly in 2n τ − 1 supersteps of label τ · log k. Instead, if n τ = 1, at recursion level τ each VP independently evaluates a 1-node diamond, and no communication is required.
By unfolding the recursion, one can easily see that the evaluation of a diamond DAG of side n entails, overall, (2k − 1) i supersteps of label (i − 1) · log k, for 1 ≤ i ≤ τ , and if n τ > 1, (2k − 1) τ n τ supersteps of label τ · log k. In each of these supersteps, every VP sends/receives O(1) messages.
To guarantee ( (1), n)-wiseness of our algorithm, we assume that suitable dummy messages are added in each superstep to make each VP exchange the same number of messages.
THEOREM 4.11. The communication complexity of the preceding network-oblivious algorithm for the (n, 1)-stencil problem when executed on M( p, σ ) is
for every 1 < p ≤ n and 0 ≤ σ = O(n/ p). The algorithm is ( (1), n)-wise and (1/4 √ log n )optimal with respect to C 1 on any M( p, σ ) with 1 < p ≤ n and σ = O(n/ p).
PROOF. As observed earlier, the communication required at the beginning of each of the five stages contributes an additive factor O(n) to the communication complexity, and hence it is negligible. Let us then concentrate on the communication complexity for one diamond DAG evaluation. Recall that τ = log k n . First suppose that p ≤ k τ . Observe that at every recursion level i, with 0 ≤ i < log k p , the evaluation of each diamond of side n i = n/k i is performed by p/k i > 1 processors, and each processor sends/receives O(n/ p) messages in O(1) supersteps for processing the inputs and outputs of these subproblems; on the other hand, at every recursion level i, with log k p ≤ i ≤ τ , each diamond of side n/k i is evaluated by a single processor of M( p, σ ) and no communication takes place. Thus, the communication complexity satisfies the recurrence relation:
This recurrence has the following solution,
where we exploited the upper bound on σ . Instead, if k τ < p ≤ n, we have that at every recursion level i, with 0 ≤ i ≤ τ , the evaluation of each diamond of side n i = n/k i is performed by p/k i > 1 processors. Then, by the preceding discussion and recalling that for i = τ , diamonds of side n τ = n/k τ are evaluated straightforwardly in 2n τ − 1 supersteps, we obtain
where we exploited the upper bound on σ and the fact that p > k τ , and hence, by definition of τ , n/ p < k. The wiseness is ensured by the dummy messages. It is easy to see that the algorithm complies with the requirements for belonging to C 1 , and hence the claimed optimality is a consequence of Lemma 4.10, and the theorem follows.
Finally, we show that the network-oblivious algorithm for the (n, 1)-stencil problem achieves (1/4 √ log n )-optimality on the D-BSP as well, for wide ranges of machine parameters.
COROLLARY 4.12. The preceding network-oblivious algorithm for the (n, 1)-stencil problem is (1/4 √ log n )-optimal with respect to C 1 on any D-BSP( p, g, ) machine with 1 < p ≤ n, nonincreasing g i 's and i /g i 's, and 0 /g 0 = O(n/ p).
PROOF. The corollary follows by Theorem 4.11 and by applying Theorem 3.4 with p = n, σ m i = 0, and σ M i = (n/2 i ). We remark that a tighter analysis of the algorithm and/or the adoption of different values for the recursion degree k, still independent of p and σ , may yield slightly better efficiency. The two techniques recently proposed in Tang et al. [2015] to improve the parallelism of recursive cache-efficient dynamic programming algorithms might also have the potential to lead to improved bounds. However, it is an open problem to devise a network-oblivious algorithm that is (1)-optimal on the D-BSP for wide ranges of the machine parameters. 4.4.2. The (n, 2)-Stencil Problem. In this section, we present a network-oblivious algorithm for the (n, 2)-stencil problem, which requires the evaluation of a DAG shaped as a three-dimensional array of side n. Both the algorithm and its analysis are a suitable adaptation of the ones for the (n, 1)-stencil problem. To evaluate a three-dimensional domain, we make use of two types of subdomains that intuitively play the same role as the diamond for the (n, 1)-stencil: the octahedron and the tetrahedron. An octahedron of side n is the intersection of a (2n − 1, 2)-stencil with the following eight half-spaces: − 1) , and i 0 + i 1 ≤ 3(n − 1); a tetrahedron of side n is the intersection of a (2n − 1, 2)-stencil with the following four half-spaces: i 0 + i 1 ≥ (n − 1), i 0 − i 1 ≥ (n − 1), i 1 + i 2 ≤ 2(n − 1), and i 1 − i 2 ≤ 0.
As shown in Bilardi and Preparata [1997] , a three-dimensional array of side n can be partitioned into 17 instances of (possibly truncated) octahedra or tetrahedra of side n (see Figure 6 of Bilardi and Preparata [1997] ). Our network-oblivious algorithm exploits this partition and is specified on M(n 2 ). It consists of 17 stages, where in each stage the VPs take care of the evaluation of one polyhedra of the partition. We assume that at the beginning of the algorithm, the inputs are evenly distributed among the n 2 VPs and also impose that the inputs of each stage be evenly distributed among the VPs. The data movement required to guarantee the correct input distribution for each stage can be accomplished in O(1) 0-supersteps, where each VP sends/receives O(1) messages.
Let k = 2 √ log n . An octahedron of side n can be partitioned into octahedra and tetrahedra of side n/k in log k steps, where the i-th such step, with 1 ≤ i ≤ log k, refines a partition of the initial octahedron into octahedra or tetrahedra of side n/2 i−1 by decomposing each of these polyhedra into smaller ones of side n/2 i , according to the scheme depicted in Figure 5 of Bilardi and Preparata [1997] . The final partition is obtained at the end of the log k-th step. The octahedra and tetrahedra of the final partition can be grouped in horizontal stripes in such a way that the polyhedra of each stripe can be evaluated in parallel. Consider first the set of octahedra of the partition. It can be seen that the projection of these octahedra on the (i 0 , i 2 )-plane coincides with the decomposition of the diamond DAG depicted in Figure 1 . As a consequence, we can identify 2k − 1 horizontal stripes of octahedra, where each stripe contains up to k 2 octahedra of side n/k. Moreover, the interleaving of octahedra and tetrahedra in the basic decompositions of Figure 5 of Bilardi and Preparata [1997] implies that there is a stripe of tetrahedra between each pair of consecutive stripes of octahedra. Hence, there are also (2k − 1) − 1 horizontal stripes of tetrahedra, each containing up to k 2 tetrahedra of side n/k. Overall, the octahedron of side n is partitioned into 4k − 3 horizontal stripes of at most k 2 polyhedra of side n/k each, where stripes of octahedra are interleaved with stripes of tetrahedra. With a similar argument, one can derive a partition of a tetrahedron of side n into 2k − 1 ≤ 4k − 3 horizontal stripes of at most k 2 polyhedra of side n/k each, where stripes of octahedra are interleaved with stripes of tetrahedra.
Once the preceding preliminaries have been established, the network-oblivious algorithm to evaluate a three-dimensional array of side n on M(n 2 ) follows closely from the recursive strategy used for the (n, 1)-stencil problem: the evaluation of an octahedron is accomplished in 4k− 3 nonoverlapping phases, in each of which the polyhedra (either octahedra or tetrahedra) of side n/k in one horizontal stripe of the partition described earlier are recursively evaluated in parallel by distinct M(n 2 /k 2 ) submachines formed by disjoint segments of consecutively numbered VPs; a tetrahedron of side n can be evaluated through a recursive strategy similar to the one for the octahedron within the same complexity bounds. As usual, we add to each superstep O(1) dummy messages per VP to guarantee ( (1), n 2 )-wiseness.
THEOREM 4.13. The communication complexity of the preceding network-oblivious algorithm for the (n, 2)-stencil problem when executed on M( p, σ ) is
for every 1 < p ≤ n 2 and 0 ≤ σ = O(n 2 / p). The algorithm is ( (1), n 2 )-wise and (1/8 √ log n )-optimal with respect to C 2 on any M( p, σ ) with 1 < p ≤ n 2 and σ = O(n 2 / p).
PROOF. Let H octahedron (n, p, σ ) be the communication complexity required by the recursive strategy presented earlier for the evaluation of an octahedron of side n, when executed on M( p, σ ). The recursion depth of that strategy is τ = log k n . First suppose that p ≤ k 2τ . At every recursion level i, with 0 ≤ i < (log k p)/2 , the evaluation of each polyhedron of side n i = n/k i is performed by p/k 2i > 1 processors, and each processor sends/receives O(n 2 / p) messages in O(1) supersteps for processing the inputs and outputs of these subproblems; on the other hand, at every recursion level i, with (log k p)/2 ≤ i ≤ τ , each polyhedron of side n/k i is evaluated by a single processor of M( p, σ ) and no communication takes place. Thus, the communication complexity satisfies the recurrence relation:
where we used the hypothesis σ = O(n 2 / p). Instead, when k 2τ < p ≤ n 2 , we have that at every recursion level i, with 0 ≤ i ≤ τ , the evaluation of each polyhedron of side n i = n/k i is performed by p/k 2i > 1 processors. Then, since for i = τ the polyhedra of side n τ = n/k τ are evaluated straightforwardly in (n τ ) supersteps, we obtain
where we used the hypothesis σ = O(n 2 / p) and the inequalities n/k τ < k and k 2τ < p. Similar upper bounds on the communication complexity can be proved for the evaluation of a tetrahedron of side n and for the evaluation of truncated octahedra or tetrahedra.
Recall that the algorithm for the (n, 2)-stencil problem consists of 17 stages, where in each stage the VPs take care of the evaluation of one (possibly truncated) octahedron or tetrahedron of side n, and that the data movement that ensures the correct input distribution for each stage can be accomplished in O(1) 0-supersteps, where each VP sends/receives O(1) messages. This implies that
Since the strategies for the evaluation of (possibly truncated) octahedra or tetrahedra can be made ( (1), n 2 )-wise, through the introduction of suitable dummy messages, the overall algorithm is also ( (1), n 2 )-wise. Moreover, the algorithm complies with the requirements for belonging to C 2 , and hence the claimed optimality is a consequence of Lemma 4.10.
COROLLARY 4.14. The preceding network-oblivious algorithm for the (n, 2)-stencil problem is (1/8 √ log n )-optimal with respect to C 2 on any D-BSP( p, g, ) machine with 1 < p ≤ n 2 , nonincreasing g i 's and i /g i 's, and 0 /g 0 = O(n 2 / p).
PROOF. The corollary follows by Theorem 4.13 and by applying Theorem 3.4 with p = n 2 , σ m i = 0, and σ M i = (n 2 /2 i ).
Limitations of the Oblivious Approach
In this section, we establish a negative result by showing that for the broadcast problem, defined next, a network-oblivious algorithm can achieve O(1)-optimality on M( p, σ ) only for very limited ranges of σ . Let V [0, 1, . . . , n − 1] be a vector of n entries. The n-broadcast problem requires copying the value V [0] into all other V [i]'s. Let C denote the class of static algorithms for the n-broadcast problem such that any A ∈ C for v processing elements satisfies the following properties: (i) at least v processing elements hold entries of V , for some constant 0 < ≤ 1, and the distribution of the entries of V among the processing elements cannot change during the execution of the algorithm, and (ii) all of the foldings of A on 2 j processing elements, 1 ≤ j < log v, also belong to C . The following theorem establishes a lower bound on the communication complexity of the algorithms in C .
THEOREM 4.15. The communication complexity of any n-broadcast algorithm in C when executed on M( p, σ ), with 1 < p ≤ n and σ ≥ 0, is (max{2, σ } log max{2,σ } p).
PROOF. Let A be an algorithm in C . Suppose that the execution of A on M( p, σ ) requires t supersteps, and let p i denote the number of processors that "know" the value V [0] by the end of the i-th superstep, for 1 ≤ i ≤ t. Clearly, p 0 = 1 and p t ≥ p, since by definition of C , at least p processors hold entries of V to be updated with the value V [0]. During the i-th superstep, p i − p i−1 new processors get to know V [0]. Since at the beginning of this superstep only p i−1 processors know the value, we conclude that the superstep involves an h-relation with h ≥ ( p i − p i−1 )/ p i−1 . Therefore, the communication complexity of A is
Assuming without loss of generality that the p i 's are strictly increasing, we obtain
Standard calculus shows that the right-hand side is minimized (to within a constant factor) by choosing t = (log max{2,σ } p), and the claim follows.
The preceding lower bound is tight. Consider the following M( p, σ ) algorithm for n-broadcast. Let the entries of V be evenly distributed among the processors, with V [0] held by processor P 0 . For convenience, we assume that n is a power of 2. Let κ be the smallest power of 2 greater than or equal to max{2, σ }. The algorithm consists of log κ p supersteps: in the i-th superstep, with 0 ≤ i < log κ p , each P jp/κ i , with 0 ≤ j < κ i , sends the value V [0] to P (κ j+ ) p/κ i+1 , for each 0 ≤ < κ. (When log κ p is not an integer value, in the last superstep only values of that are multiples of κ i+1 / p are used.) It is immediate to see that the algorithm belongs to C and that its communication complexity on M( p, σ ) is
Therefore, the algorithm is O(1)-optimal. Observe that the algorithm is aware of parameter σ, and, in fact, this knowledge is crucial to achieve optimality. To see this, we prove that any network-oblivious algorithm for n-broadcast can be (1)-optimal on M( p, σ ), only for limited ranges of σ . Let H (n, p, σ ) denote the best communication complexity achievable on M( p, σ ) by an algorithm for n-broadcast belonging to C . By the preceding discussion, we know that H (n, p, σ ) = (max{2, σ } log max{2,σ } p). Let A ∈ C be a network-oblivious algorithm for n-broadcast specified on M(v(n)). For every 1 < p ≤ v(n) and 0 ≤ σ 1 ≤ σ 2 , we define the maximum slowdown incurred by A with respect to the best M( p, σ )-algorithm in C , for σ ∈ [σ 1 , σ 2 ], as
.
THEOREM 4.16. Let A ∈ C be a network-oblivious algorithm for n-broadcast specified on M(v(n) ). For every 1 < p ≤ v(n) and 0 ≤ σ 1 ≤ σ 2 , we have
PROOF. The definition of function GAP implies that GAP A (n, p, σ 1 , σ 2 ) = H A (n, p, σ 1 ) H (n, p, σ 1 ) + H A (n, p, σ 2 ) H (n, p, σ 2 ) .
Let t be the number of supersteps executed by the folding of A on M( p, σ ), and note that since A is network oblivious, this number cannot depend on σ . By arguing as in the proof of Theorem 4.15 (see Inequality (7)), we get that H A (n, p, σ ) = (t(max{2, σ } + p 1/t )), for any σ , and hence GAP A (n, p, σ 1 , σ 2 ) is bounded from below by
which is minimized for t = (log p/(log max{2, σ 1 } + log log max{2, σ 2 })). Substituting this value of t in the preceding formula yields the stated result.
An immediate consequence of the preceding theorem is that if a network-oblivious algorithm for n-broadcast is (1)-optimal on M( p, σ ), it cannot be simultaneously (1)optimal on an M( p, σ ), for any σ sufficiently larger than σ . A similar limitation of the optimality of a network-oblivious algorithm for n-broadcast can be argued with respect to its execution on D-BSP( p, g, ).
EXTENSION TO THE OPTIMALITY THEOREM
The optimality theorem of Section 3 makes crucial use of the wiseness property. Broadly speaking, a network-oblivious algorithm is ( (1), p)-wise when the communication performed in the various supersteps is somewhat balanced in the sense that the maximum number of messages sent/received by a virtual processor does not differ significantly from the average number of messages sent/received by other virtual processors belonging to the same region of suitable size. Although there exist ( (1), p)-wise network-oblivious algorithms for a number of important problems, as shown in Section 4, there are cases where wiseness may not be guaranteed.
As a simple example of poor wiseness, consider a network-oblivious algorithm A for M(n) consisting of one 0-superstep where VP 0 sends n messages to VP n/2 . Fix p with 2 ≤ p ≤ n. Clearly, for each 1 ≤ j ≤ log p, we have that H A (n, 2 j , 0) = n, and hence the algorithm is (α, p)-wise only for α = O(1/ p). When executed on a D-BSP ( p, g, 0) , the communication time of the algorithm is ng 0 . However, as already observed in Bilardi et al. [2007a] , under reasonable assumptions the communication time of the algorithm's execution on the D-BSP can be improved by first evenly spreading the n messages among clusters of increasingly larger size that include the sender, then gathering the messages within clusters of increasingly smaller size that include the receiver. Motivated by this observation, we introduce a more effective protocol to execute network-oblivious algorithms on the D-BSP. By employing this protocol, we are able to prove an alternative optimality theorem that requires a much weaker property than wiseness at the expense of a slight (polylogarithmic) loss of efficiency.
Let A be a network-oblivious algorithm specified on M(v(n)), and consider its execution on a D-BSP( p, g, ), with 1 ≤ p ≤ v(n). As before, each D-BSP processor P j , with 0 ≤ j < p, carries out the operations of the v(n)/ p consecutively numbered VPs of M(v(n)) starting with VP j(v(n)/ p) . However, the communication required by each superstep is now performed on D-BSP more effectively by enforcing a suitable balancing. More precisely, each i-superstep s of A, with 0 ≤ i < log p, is executed on the D-BSP through the following protocol, which we will call the ascend-descend protocol:
(1) Computation phase: Each D-BSP processor performs the local computations of its assigned virtual processors.
(2) Ascend phase: For k = log p − 1 down to i + 1: within each k-cluster k , the messages that originate in k but are destined outside k are evenly distributed among the p/2 k processors of k . (3) Descend phase: For k = i to log p − 1: within each k-cluster k , the messages currently residing in k are evenly distributed among the processors of the (k + 1)clusters inside k that contain their final destinations.
Observe that each iteration of the ascend/descend phases requires a prefix-like computation to assign suitable intermediate destinations to the messages to guarantee their even distribution in the appropriate clusters.
LEMMA 5.1. Let A be a network-oblivious algorithm specified on M(v(n)), and consider its execution on D-BSP( p, g, ), with 1 < p ≤ v(n), using the ascend-descend protocol. Let s be an i-superstep, for some 0 ≤ i < log p, and let ξ s be the sequence of supersteps employed by the protocol for executing s. Then, for every i < k < log p, ξ s comprises O(1) k-supersteps of degree O(2 k h s A (n, 2 k )/ p) and O(log p) k-supersteps each of constant degree.
PROOF. Consider iteration k of the ascend phase of the protocol, with i + 1 ≤ k ≤ log p− 1, and a k-cluster k . As invariant at the beginning of the iteration, we have that the at most h s A (n, 2 k+1 ) messages originating in each k + 1-cluster included in k and destined outside k are evenly distributed among the processors of . Hence, the even distribution of these messages among the p/2 k processors of k requires a prefix-like computation and an O( 2 k+1 h s A (n, 2 k+1 )/ p )-relation within k . Consider now iteration k of the descend phase of the protocol, with i ≤ k ≤ log p − 1, and a k-cluster k . As invariant at the beginning of the iteration, we have that the at most 2h s A (n, 2 k+1 ) messages to be moved in the iteration are evenly distributed among the processors of k . Since each (k + 1)-cluster included in k receives at most h s A (n, 2 k+1 ) messages, the iteration requires a prefix-like computation and an O( 2 k+1 h s A (n, 2 k+1 )/ p )-relation within k . The lemma follows, as each prefix-like computation in a k-cluster can be performed in O(log p) k-supersteps of constant degree (e.g., using a straightforward tree-based strategy [JáJá 1992]) .
We now define the notion of fullness, which is weaker than wiseness but which still allows us to port the optimality of network-oblivious algorithms with respect to the evaluation model onto the execution machine model, at the price of some loss of efficiency.
Definition 5.2. A static network-oblivious algorithm A specified on M(v(n)) is said to be (γ, p)-full, for some γ > 0 and 1 < p ≤ v(n), if the folding of A on M(2 j , 0) satisfies
for every 1 ≤ j ≤ log p and input size n.
It is easy to see that a ( (1), p)-wise network-oblivious algorithm A is also ( (1), p)full as long as h s A (n, p) ≥ 1, for every i-superstep s of A and every 1 < p ≤ v(n). On the other hand, a ( (1), p)-full algorithm is not necessarily ( (1), p)-wise, as witnessed by the previously mentioned network-oblivious algorithm consisting of a single 0-superstep where VP 0 sends n messages to VP n/2 , which is ( (1), p)-full but not ( (1), p)-wise, for any 2 ≤ p ≤ n. In this sense, (γ, p)-fullness is a weaker condition than ( (1), p)-wiseness.
The following theorem shows that when (γ, p)-full algorithms are executed on the D-BSP using the ascend-descend protocol, optimality in the evaluation model is preserved on the D-BSP within a polylogarithmic factor. As in Section 3, let C denote a class of static algorithms solving a problem , with the property that for any algorithm A ∈ C for v processing elements, all of its foldings on 2 j processing elements, for each 1 ≤ j < log v, also belong to C . THEOREM 5.3. Let A ∈ C be a (γ, p )-full network-oblivious algorithm for some γ > 0 and p a power of 2. Let also {σ m 0 , σ m 1 , . . . , σ m log p −1 } and {σ M 0 , σ M 1 , . . . , σ M log p −1 } be two vectors of nonnegative values, with σ m j ≤ σ M j , for every 0 ≤ j < log p . If A is β-optimal on M(2 j , σ ) with respect to C , for σ m j−1 ≤ σ ≤ σ M j−1 and 1 ≤ j ≤ log p , then for every p power of 2, p ≤ p , A is (β/((1 + 1/γ ) log 2 p))-optimal on D-BSP( p, g, ) with respect to C when executed with the ascend-descend protocol, as long as -the execution of A on D -BSP( p, g, ) using the ascend-descend protocol is in C ; A on a D-BSP( p, g, ) using the ascend-descend protocol. LetÃ denote the actual sequence of supersteps performed on the D-BSP in this execution of A. Note that once the D-BSP parameters are fixed,Ã can be regarded as a network-oblivious algorithm specified on M( p). Clearly, any optimality considerations on the communication time of the execution ofÃ (regarded as a network-oblivious algorithm) on D-BSP( p, g, ) using the standard protocol will also apply to the communication time of the execution of A on D-BSP( p, g, ) using the ascend-descend protocol, since the communication time on A andÃ is the same.
PROOF. Consider the execution of
We will assess the degree of optimality of the communication time ofÃ by resorting to Theorem 3.4. This entails analyzing the communication complexity ofÃ on M(2 j , σ ), for any 1 ≤ j ≤ log p, and determining its wiseness. Focus on M(2 j , σ ) for some 1 ≤ j ≤ log p, and consider an arbitrary i-superstep s of A, for some 0 ≤ i < j. Let ξ s be the sequence of supersteps inÃ executed in the ascend and descend phases associated with superstep s. From Lemma 5.1, we know that for every i < k < log p, ξ s comprises O(1) k-supersteps of degree O(2 k h s A (n, 2 k )/ p) and O(log p) k-supersteps each of constant degree. Now, in the execution on M(2 j , σ ), a k-superstep with k ≥ j becomes local to the processors and does not contribute to the communication complexity. Since each processor of M(2 j , σ ) corresponds to p/2 j processors of M( p), the communication complexity on M(2 j , σ ) contributed by the sequence ξ s is
Therefore, since h s A (n, 2 k ) ≤ 2 j−k h s A (n, 2 j ), the preceding summation is upper bounded by
Recall that L i A (n) denotes the set of i-supersteps executed by A, and S i A (n) = |L i A (n)|. Thus, the communication complexity ofÃ on M(2 j , σ ) can be written as
where the last inequality follows by the (γ, p * )-fullness of A.
The preceding inequality shows that algorithmÃ is β/((1 + 1/γ ) log 2 p)-optimal as a consequence of the β-optimality of A. Let us now assess the wiseness ofÃ. Consider again the sequence ξ s of supersteps ofÃ associated with an arbitrary i-superstep s of A, for some 0 ≤ i < log p. We know that for every i < k < log p, ξ s comprises O(1) k-supersteps of degree O(2 k h s A (n, 2 k )/ p) and O(log p) k-supersteps each of constant degree. Moreover, we can assume that suitable dummy messages are added so that in a k-superstep of degree O(2 k h s A (n, 2 k )/ p) (respectively, degree O(1)) all processors of a (k+ 1)-cluster send (2 k h s A (n, 2 k )/ p) (respectively, (1)) messages to the sibling (k + 1)cluster included in the same k-cluster. It is easy to see that the preceding considerations about the optimality ofÃ remain unchanged, whereasÃ becomes ( (1), p)-wise. Finally, we recall thatÃ belongs to class C by hypothesis, and this is so even forcing it into being wise. Therefore, by applying Theorem 3.4 toÃ, we can conclude thatÃ, and hence A, is (β/((1 + 1/γ ) log 2 p))-optimal on a D -BSP( p, g, ) with parameters satisfying the initial hypotheses.
As remarked earlier, the fullness requirement is considerably less stringent than wiseness. Algorithmic strategies that could benefit from this weaker requirement might be, for example, those designed for processor networks characterized by low-bandwidth decompositions into subnets. Typical communication patterns arising in these strategies may not feature constant wiseness since at each level of the decomposition a small fraction of boundary processors communicates across subnets, whereas they may exhibit constant fullness as long as a sufficiently large number of messages are exchanged among these boundary processors.
We conclude this section by observing that the relation stated by Theorem 5.3 between optimality in the evaluation model and optimality in D-BSP can be tightened when the g i and i parameters of the D-BSP decrease geometrically. In this case, it is known that a prefix-like computation within a k-cluster, for 0 ≤ k < log p, can be performed in O(g k + k ) communication time (e.g., see Proposition 2.2.2 in Bilardi et al. [2007a] ). Then, by a similar argument used to prove Theorem 5.3, it can be shown that a (γ, p)-full algorithm A that is β-optimal in the evaluation model becomes (β/((1 + 1/γ ) log p))-optimal when executed on the D-BSP, thus reducing by a factor log p the gap between the two optimality factors.
CONCLUSIONS
We introduced a framework to explore the design of network-oblivious algorithmsthat is, algorithms that run efficiently on machines with different processing power and different bandwidth/latency characteristics, without making explicit use of architectural parameters for tuning performance. In the framework, a network-oblivious algorithm is written for v(n) virtual processors (specification model), where n is the input size and v(·) a suitable function. Then, the performance of the algorithm is analyzed in a simple model (evaluation model) consisting of p ≤ v(n) processors and where the impact of the network topology on communication costs is accounted for by a latency parameter σ . Finally, the algorithm is executed on the D-BSP model [de la Torre and Kruskal 1996; Bilardi et al. 2007a ] (execution machine model), which describes reasonably well the behavior of a large class of point-to-point networks by capturing their hierarchical structures. A D-BSP consists of p ≤ v(n) processors, and its network topology is described by the log p-size vectors g and , which account for bandwidth and latency costs within nested clusters, respectively. We have shown that for static network-oblivious algorithms, where the communication requirements depend only on the input size and not on the specific input instance (e.g., algorithms arising in DAG computations), the optimality on the evaluation model for certain ranges of p and σ translates into optimality on the D-BSP model for corresponding ranges of the model's parameters. This result justifies the introduction of the evaluation model that allows for a simple analysis of network-oblivious algorithms while effectively bridging the performance analysis to D-BSP, which more accurately models the communication infrastructure of parallel platforms through a logarithmic number of parameters.
We devised (1)-optimal static network-oblivious algorithms for prominent problems such as matrix multiplication, FFT, and sorting, although in the case of sorting, optimality is achieved only when the available parallelism is polynomially sublinear in the input size. In addition, we devised suboptimal, yet efficient, network-oblivious algorithms for stencil computations, and we explored limitations of the oblivious approach by showing that for the broadcasting problem, optimality in D-BSP can be achieved by a network-oblivious algorithm only for rather limited ranges of the parameters. Similar negative results were also proved in the realm of cache-oblivious algorithms (e.g., see Peserico [2001], Brodal and Fagerberg [2003] , and Silvestri [2006 Silvestri [ , 2008 ). Despite these limitations, the pursuit of oblivious algorithms appears worthwhile even when the outcome is a proof that no such algorithm can be (1)-optimal on an ample class of target machines. Indeed, the analysis behind such a result is likely to reveal what kind of adaptivity to the target machine is necessary to obtain optimal performance.
The present work can be naturally extended in several directions, some of which are briefly outlined next. First, it would be useful to further assess the effectiveness of our framework by developing novel efficient network-oblivious algorithms for prominent problems beyond the ones of this article. Some progress in this direction has been done in Chowdhury et al. [2013] and Demmel et al. [2013] . For the problems considered here, particularly sorting and stencil computations, it would be very interesting to investigate the potentiality of the network-oblivious approach at a fuller degree. More generally, it would be interesting to develop lower-bound techniques to limit the level of optimality that network-oblivious algorithms can reach on certain classes of target platforms. Another challenging goal concerns the generalization of the results of Theorems 3.4 and 5.3 to a wider class of algorithms, such as by removing the restriction to static algorithms and/or by weakening the assumptions (wiseness or fullness) required to prove these theorems. It would be also useful to identify other classes of machines for which network-oblivious algorithms can be effective. Another open problem is to augment our framework by incorporating memory constraints in the evaluation model to study the interplay between communication, parallelism, and memory. In this context, it is important to devise suitable schedulers that map networkoblivious algorithms on the evaluation model without violating the memory constraints and to study the inherent trade-offs for fundamental problems. Preliminary results in these directions include space-bounded schedulers for multicores (e.g., Chowdhury et al. [2013] and Simhadri et al. [2014] ) and trade-offs for linear algebra problems (e.g., Irony et al. [2004] and Ballard et al. [2011 Ballard et al. [ , 2012 ). More in general, it would be very interesting to generalize our work to apply to computing scenarios, such as traditional time-shared systems and emerging global computing environments, where the amount of resources devoted to a specific application can itself vary dynamically over time, in the same spirit as Bender et al. [2014] generalized the cache-oblivious framework to environments in which the amount of memory available to an algorithm can fluctuate.
Finally, we observe that some of the network-oblivious algorithms presented in this article share a similar structure with their cache-oblivious counterparts (e.g., see the matrix multiplication and FFT algorithms). It would be interesting to explore whether there is a deeper relation between the two kinds of obliviousness. We conjecture that cache-oblivious algorithms can be obtained by simulating network-oblivious ones using a suitable adaptation of the technique developed in Pietracaprina et al. [2006] . However, the other direction seems far more challenging, as cache-oblivious algorithms do not have to exhibit parallelism necessarily. The ultimate goal would be represented by the integration of cache-and network obliviousness in a unified framework for the development of machine-oblivious computations. The results obtained by Blelloch et al. [2010] and Chowdhury et al. [2013] in the context of shared-memory platforms could be a source of inspiration toward this goal.
APPENDIX

A. LIST OF NOTATIONS AND SYMBOLS
The following table summarizes the most important notations and symbols used in the article.
Notation/Symbol
Meaning n Input size.
M(v)
Computational model that underlies the specification, evaluation, and execution models. It consists of v processing elements.
M(v(n))
Specification model with v(n) virtual processors. M ( p, σ ) Evaluation model with p processors. D-BSP( p, g, )
Execution model with p processors. v
Number of processing elements in the underlying model. The symbol v can thus refer to any (specification, evaluation, or execution) model. v(n)
Number of virtual processors in the specification model. p
Number of processors in the evaluation or execution models. σ
Latency parameter in the evaluation model M( p, σ ). g = (g 0 , g 1 , . . . , g log p−1 )
Bandwidth parameters of the execution model D-BSP( p, g, ). = ( 0 , 1 , . . . , log p−1 )
Latency parameters of the execution model D-BSP( p, g, ). L i A (I) (respectively, L i A (n)) Set of i-supersteps executed by an algorithm A on input I (respectively, by a static algorithm A on an input of size n). 
