We give lower bounds on the communication complexity required to solve several computational problems in a distributed-memory parallel machine, namely standard matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform. We revisit the assumptions under which preceding results were derived and provide new lower bounds which use much weaker and appropriate hypotheses. Our bounds rely on a mild assumption on work distribution, and strengthen previous results which require either the computation to be balanced among the processors, or specific initial distributions of the input data, or an upper bound on the size of processors' local memories.
Introduction
Communication is a major factor determining the performance of algorithms on current computing systems, as the time and energy needed to transfer data between processing and storage elements is often significantly higher than that of performing arithmetic operations. The gap between computation and communication costs, which is ultimately due to basic physical principles, is expected to become wider and wider as architectural advances allow to build systems of increasing size and complexity. Hence, the cost of data movement will play an even greater role in future years.
As in all endeavors where performance is systematically pursued, it is important to evaluate the distance from optimality of a proposed algorithmic solution, by establishing appropriate lower bounds. Given the well-known difficulty of establishing lower bounds, results are often obtained under restrictive assumptions that may severely limit their applicability. It is therefore important to progressively reduce or fully eliminate such restrictions.
In this spirit, we consider lower bounds on the amount of communication that is required to solve some classical computational problems on a distributed-memory parallel system. Specifically, we revisit the assumptions and constraints under which preceding results were derived, and prove new lower bounds which use much weaker hypotheses and thus have wider applicability. Even when the functional form of the bounds remains the same, our results do yield new insights to algorithm developers since they might reveal if some settings are needed, or not, in order to obtain better performance.
We model the machine using the standard Bulk Synchronous Parallel (BSP) model of computation [32] , which consists of a collection of p processors, each equipped with an unbounded private memory and communicating with each other through a communication network. The starting point of any investigation of algorithms for a distributed-memory model is the specification of an I/O protocol, which defines where input elements reside at the beginning of the computation and where the outputs produced by the algorithm must be placed. The distribution of inputs and outputs effectively forms a part of the problem specification, thus restricting the applicability of upper and lower bounds. Much of previous work on BSP algorithms considers a version of the BSP model equipped with an additional external memory, which serves as the source of the input and the destination for the output (see, e.g., [31] ). This modification significantly alters the spirit of the BSP of serving as a model for distributed-memory machines, making it very similar to shared-memory models like the LPRAM [1] . In fact, in a distributed-memory machine, the inputs might already be distributed in some manner prior to the invocation of the algorithm, and the outputs are usually left distributed in the processors' local memories at the end of the execution, especially if the computation is a subroutine of a larger computation. Thus, lower bounds that use this assumption, which essentially exploit this "hack" to guarantee that acquiring the n input elements contributes to the communication cost of algorithms (as some processor must read at least ⌈n/p⌉ input values), are not directly applicable to distributed-memory architectures.
Other authors, within the original BSP model, assume specific distributions of the input data. As we shall see later, it is usually assumed that the input is initially evenly distributed among the p processors. However, this apparently "reasonable" hypothesis is not part of the logic of the BSP model. In fact, the physical distribution of input data across the processors may depend on several factors, ranging from how the inputs get acquired to the file system policies. Moreover, this hypothesis may lead to unsatisfactory communication lower bounds. Consider, e.g., the computation of a directed acyclic graph (DAG) with "few" input nodes and a "long" critical path. In this case, naive algorithms which entrust the whole computation to one processor might be communication optimal. This is misleading, since it steers towards algorithms which are not parallel at all.
One possibility to overcome both the issues discussed above is to require, in place of the even distribution of the inputs and of the presence of an external memory, that algorithms exhibit some level of load balancing of the computation. Typically, if W denotes the total work required by any algorithm to solve the given problem, it is required that each processor performs O (W/p) elementary computations. However, this way we are assuming, but not proving, that optimal solutions balance computation. In fact, in general there is a tradeoff between computation costs and communication costs. Some papers (see, e.g., [22, 35] ) quantify such tradeoffs by establishing lower bounds on the communication cost of any algorithm as a function of its computation time. Nevertheless, results of this kind usually indicate that the higher lower bounds on communication correspond only to perfectly (to within constant factors) work-balanced computations, and such bounds are tight since achieved by balanced algorithms. This leaves open the possibility that a substantial saving on communication costs could actually be achieved at a price of a small unbalance of the computation loads.
Another common assumption is putting an upper bound on the size of processors' local memories. However, current technological advances allow to build cheap memory and storage devices that, for many applications, allow a single machine to store the whole input data set and the intermediate data. Moreover, results derived under this assumption are less general than results that put no limits on the amount of storage available to processors; indeed, lower bounds are relatively easier to establish, as the model essentially becomes a parallel version of the standard external memory (EM) model for sequential computations, for which much more results and techniques are known (see, e.g., [16, 2] ).
In contrast, lower bounds presented in this paper do not hinge on any of the above assumptions. We develop new lower bounds for a number of key computational problems, namely standard matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform, using the weak assumption that no processor performs more than a constant fraction of the total required work. This requires more involved arguments, and substantially strengthen previous work on communication lower bounds for distributed-memory computations.
The model. The Bulk Synchronous Parallel (BSP) model of computation was introduced by Valiant [32] as a bridging model for general-purpose parallel computing. The architectural component of the model consists of p processing elements P 0 , P 1 , . . . , P p−1 , each equipped with an unbounded local memory, interconnected by a communication medium. The execution of a BSP algorithm consists of a sequence of supersteps, where each processor can perform operations on data in its local memory, send messages and, at the end, execute a global synchronization. The running time of the i-th superstep is expressed in terms of two parameters, g and ℓ, as
where w i is the maximum number of local operations performed by any processor, and h i is the maximum number of messages sent or received by any processor. The running time T A of a BSP algorithm A is the sum of the times of its supersteps and can be expressed as W A + H A g + S A ℓ, where S A is the number of supersteps, W A = S A i=1 w i is the local computation complexity, and
Previous work. The complexity of communication on various models of computation has received considerable attention. Lower bounds are often established through adaptations of the techniques of Hong and Kung [16] for hierarchical memory, or by critical path arguments, such as those in [1] . For applications of these and other techniques see [22, 2, 25, 15, 8, 6, 17, 24, 4, 9, 5] as well as [26] and references therein. In the following, we discuss previous work on lower bounds for the communication complexity of the problems studied in this paper.
A standard computational problem is the multiplication of two n × n matrices. For the classical Θ n 3 algorithm, an Ω n 2 /p 2/3 lower bound has been previously derived for the BSP [30] and the LPRAM [1] . However, both results hinge on the hypothesis that the input initially resides outside the processors' local memories and thus must be read, contributing to the communication complexity of the algorithms. As such, these results are an immediate consequence of a result of [16] (then restated in [17] ) which, loosely speaking, bounds from above the amount of computation that can be performed with a given quantity of data. When input is assumed to be initially evenly distributed across the p processors' local memories, the same lower bound is claimed in [11] . Recently, Ballard et al. [3] obtained a result of the same form by assuming perfectly balanced (to within constant factors) computations, and disallowing any initial replication of inputs. Restricting to balanced computations allows to reduce to the situation where inputs are evenly distributed: in fact, it is easy to prove that, given a problem on n inputs and which requires N operations to be solved, if each processor performs at most αN/p operations, for some α with 1 ≤ α ≤ p/2, then there exists one processor which initially holds at most 2αn/p inputs, and that performs at least α/(2α − 1) · N/p operations. The very same bound was found also by Irony et al. [17] , who restrict their attention to computations that take place on machines where processors' local memory size is assumed to be M = O n 2 /p 2/3 . Finally, Solomonik and Demmel [27] investigate tradeoffs between input replication and communication complexity (see also [4] ).
A class of computations ubiquitous in scientific computing is that of stencil computations, where each computing node in a multi-dimensional grid is updated with weighted values contributed by neighboring nodes. These computations include the diamond DAG in the two-dimensional case and the cube DAG in three dimensions. For the former, Papadimitriou and Ullman [22] present a communication-time tradeoff which yields a tight Ω (n) lower bound on the communication complexity only for the case of balanced computations. Aggarwal et al. [1] extend this result to all algorithms whose computational complexity is within a constant factor of the number of nodes of the DAG. To the best of our knowledge, this is the sole example of a tight lower bound that holds under the same hypothesis used in this paper. By generalizing the technique in [22] , Tiskin [30] establishes a tight bound for the cube DAG, and claims its extension to higher dimensions. However, this results only hold when the computational load is balanced among the p processors.
Another key problem is sorting. Many papers assume that the n inputs initially reside outside processors' local memories, thus obtaining an Ω (n/p) lower bound which turns out to be tight when it is additionally assumed that problem instances have sufficient slackness, that is, n >> p (e.g., p 2 ≤ n is a common assumption). Under some technical assumptions, a bound of the form Ω (n log n/(p log(n/p))), which is tight for all values of p ≤ n, was first given within the LPRAM model [1] . 1 This bound, however, includes the cost to read the input from the shared memory. A similar lower bound was derived later by Goodrich [15] within the BSP model, but the result holds only for the subclass of algorithms performing supersteps of degree h = Θ (n/p), and when the inputs are evenly distributed among the processors.
Previous work on the communication required to compute an FFT DAG of size n is similar to previous work for sorting. By exploiting the property that, as shown in [34] , the cascade of three FFT networks has the topology of a full sorting network, the aforementioned lower bounds for sorting also hold for the FFT DAG. In a recent paper [9] , we obtain the same result assuming that the maximum number of outputs held by any processor at the end of the algorithm is at most n/2, and without assumptions on the distribution of the input and of the computational loads; while these hypotheses are not equivalent to the one we are using in this paper, the result in [9] is the closest to the one that we will develop in Section 5.
Our contribution. In this paper we present lower bounds on the communication complexity required by key computational problems such as standard matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform, when solved by parallel algorithms on the BSP model. These results, which are all tight for the whole range of model parameters, rely solely on the hypothesis that no processor performs more than a constant fraction of the total required work. More formally, let W be the total work required by any algorithm to solve the given problem (if the problem is represented by a directed acyclic graph, then W is the number of nodes of the DAG, otherwise W is a lower bound on the computation time required by any sequential algorithm), and let W be the maximum amount of work performed by any BSP processor; then, in the same spirit of the aforementioned result for the diamond DAG in [1] , W is assumed to satisfy the bound W ≤ ǫW, for some constant ǫ ∈ (0, 1). The rationale behind this approach is that communication is the major bottleneck of a distributed-memory computation unless the latter is sequential or "nearly sequential", in which case the main contribution to the running time T of an algorithm comes from computation. Since it is directly linked to the running time metric, and it does not allow for any other restrictive assumptions suggested by orthogonal constraints, we believe that this is the right approach to perform a systematic analysis of the communication requirements of distributed-memory computations.
We emphasize that, in contrast to previous work, our lower bounds do not count the communication required to acquire the input, allow for any initial distribution of the input among the processors' local memories, assume no upper bound on the sizes of the latter, and do not require computations to be balanced. On the other hand, some of our results make use of additional technical assumptions, such as the non-recomputation of intermediate results in the course of the computation, or some restrictions on the replication of input data. Such restrictions, however, were already in place in almost all of the corresponding state-of-the-art lower bounds.
Matrix Multiplication
In this section we consider the problem of multiplying two n × n matrices, A and B, using only semiring operations, that is, addition and multiplication. Hence, each element c i,j of the output matrix C is an explicit sum of products a i,k · b k,j , which are called multiplicative terms. This rules out, e.g., Strassen's algorithm [28] and the Boolean matrix multiplication algorithm of Tiskin [29] . As shown in [19] , any algorithm using only semiring operations must compute at least n 3 distinct multiplicative terms.
In this section we establish a lower bound on the communication complexity of any parallel algorithm for matrix multiplication on a BSP with p processors. This result is derived assuming that no processor performs more than a constant fraction of the n 3 total work required by any algorithm, measured as the number of scalar multiplications, and that each input element is initially stored in the local memory of exactly one processor. The bound has the form of Ω W 2/3 , where W is the maximum number of multiplicative terms evaluated by a processor, and is tight for all values of p between two and n 2 . The argument trough which we establish such a result is a repeated application of a "bandwidth" argument which, loosely speaking, is as follows. Consider a processor which performs the maximum amount of work. If this processor initially holds "few" input values, then, since it computes at least n 3 /p multiplicative terms, it must receive "many" inputs from the submachine including the other processors; otherwise, if it initially holds "many" inputs, then it has to send many of them to the other processors, because it cannot perform too much work on its own, and thus the other processors have to perform at least a constant fraction of the total work. The lower bound applies to any distribution of input and output matrices, and only requires that the input matrices are not initially replicated.
Towards this end, we first establish a lower bound of Ω n 2 under the same hypotheses outlined above for two processors. This result is derived using a bandwidth argument that bounds from below the amount of data that must travel across the communication network of a two-processor machine. A bound of the same form can be found in [17, Section 6] , which holds only when the elements of the input matrices A and B are evenly, or almost evenly, distributed among the two processors. Our result, which instead allows any initial distribution of the input matrices (without replication), establishes the same bound by using a mild hypothesis on the maximum computation load faced by the processors. Lemma 1. Let A be any algorithm for computing the matrix product C = AB, using only semiring operations, on a BSP with two processors. If each processor computes at most ǫn 3 multiplicative terms, where ǫ is an arbitrary constant in (1/2, 1), and the input matrices are not initially replicated, then the communication complexity of the algorithm is
Proof. We use a bandwidth argument as the one employed in [17, Theorem 6.1]. By hypothesis, each processor computes at most ǫn 3 multiplicative terms. Let K be the number of elements of C whose corresponding multiplicative terms have not been totally computed by the same processors. If K ≥ (1 − ǫ)n 2 /2, then the communication complexity is at least Ω n 2 since a processor receives a message for at least K/2 of such entries containing a multiplicative term or a partial prefix sum. Suppose now that K < (1 − ǫ)n 2 /2. Then there are at least (1 + ǫ)n 2 /2 entries of C whose n multiplicative terms have been entirely computed by the same processor. We denote with n 0 and n 1 the number of entries of C computed entirely by processor P 0 and P 1 , respectively, and suppose without loss of generality that n 0 ≥ n 1 . Clearly, n 0 + n 1 ≥ (1 + ǫ)n 2 /2. Since n 0 ≥ n 1 and since each processor can compute at most ǫn 3 multiplicative terms, it follows that n 0 ≤ ǫn 2 , and thus n 1 ≥ (1 − ǫ)n 2 /2. Let r i and c i denote the number of rows of A and columns of B, respectively, whose n entries have all been accessed by processor P i , with i ∈ {0, 1}, during the lifespan of the algorithm. Since n i entries of C are computed entirely by processor P i , then we have
, are used by both processors, incurring Ω n 2 messages for exchanging the rows since, by hypothesis, the input matrices are not initially replicated. Suppose now that r 0 + r 1 < αn. Then, we have
By taking the derivative of the last term with respect to r 0 we see that the right-hand side is
αn .
Since n 0 + n 1 ≥ (1 + ǫ)n 2 /2 and n 0 ≤ ǫn 2 , the term √ n 0 + √ n 1 is minimized when n 0 assumes the largest allowed value (i.e., ǫn 2 ) and n 0 + n 1 = (1 + ǫ)n 2 /2. Thus, we have
The lemma follows since (α − 1)n = Θ (n) columns of B are used by both processors, entailing Ω n 2 messages for exchanging them.
To prove our main result we also need the following technical lemma, which was first given by Hong and Kung in their seminal paper on I/O complexity, and then restated in [17] by applying the Loomis-Whitney inequality [21] .
Lemma 2 ([16, Lemma 6.1]). Consider the matrix multiplication of two n × n matrices A and B using scalar additions and multiplications only. During the computation, if a processor accesses at most K elements of each input matrices and contributes to at most K elements of the output matrix C, then it can compute at most 2K 3/2 multiplicative terms. Now we have all the tools to prove the main result of this section. The following theorem establishes an Ω W 2/3 lower bound to the communication complexity of any standard algorithm, where W denotes the maximum number of multiplicative terms evaluated by a processor. By the result of [19] and by the pigeonhole principle, there exists a processor that computes at least n 3 /p multiplicative terms, from which the standard Ω n 2 /p 2/3 lower bound follows. Theorem 1. Let A be any algorithm for computing the matrix product C = AB, using only semiring operations, on a BSP with p processors, where 1 < p ≤ n 2 , and let W be the maximum number of multiplicative terms evaluated by a processor. If W ≤ max{n 3 /p, n 3 /11 3 }, and the input matrices are not initially replicated, then the communication complexity of the algorithm is
Proof. Without loss of generality, we assume that any multiplicative term computed by the processors is actually used towards the computation of some entry of the output matrix C (that is, processors do not perform "useless" computations). Consider one of the processors that compute W multiplicative terms, and without loss of generality let P 0 denote such a processor. Let I be the number of input elements initially held by this processor in its local memory. Consider first the case I ≤ W 2/3 /5. By Lemma 2, a processor that computes W multiplicative terms either accesses, during the whole execution of algorithm A, at least (W/2) 2/3 input elements, or computes multiplicative terms relative to at least (W/2) 2/3 elements of the output matrix. In the first case, since P 0 initially holds I ≤ W 2/3 /5 input elements, it must receive at least (W/2) 2/3 − I = Ω W 2/3 data words from other processors, and the theorem follows. On the other hand, suppose P 0 computes multiplicative terms relative to (W/2) 2/3 entries of the output matrix, and partition such entries into three groups: G 1 , the set of entries whose multiplicative terms have all been computed by the processor; G 2 , the set of entries produced by the processor but for which some multiplicative term or partial sum has been communicated by some other processor; G 3 , the set of entries not produced by the processor. Clearly, at least one of these three groups must have size at least (W/2) 2/3 /3. If |G 1 | ≥ (W/2) 2/3 /3, then P 0 must have computed at least n(W/2) 2/3 /3 multiplicative terms, and since any entry of the input matrices occurs in only n of such terms, the processor must have received (W/2) 2/3 /3 − I = Ω W 2/3 elements from other processors. If |G 2 | ≥ (W/2) 2/3 /3, then for each entry in G 2 P 0 has received some term from other processors, therefore accounting for a total of Ω W 2/3 incoming data words. Finally, if |G 3 | ≥ (W/2) 2/3 /3, then, since any multiplicative term must be used towards the computation of some entry of the output matrix C, for each entry in G 3 , P 0 must send some multiplicative term or partial sum to the processor that will produce the corresponding entry of C, and this implies that P 0 must send Ω W 2/3 data words. In all three cases, Ω W 2/3 messages have to be exchanged by P 0 with the other processors, and the claim follows for I ≤ W 2/3 /5. Now suppose I > W 2/3 /5 and p ≥ 11 3 . Assume, without loss of generality, that P 0 initially holds at least I/2 elements of matrix A. Since any entry of the input matrices occurs in n multiplicative terms, there are at least In/2 multiplicative terms that depend on the entries of A initially held by the processor. Since W multiplicative terms are computed by the processor, the remaining In/2 − W ones are computed by other processors. Since, by hypothesis, each entry of A is initially non replicated and a processor can compute at most n multiplicative terms using a single entry of A, we have that (In/2 − W )/n messages are required for sending the appropriate entries of A to the processors that will compute the remaining entries. Hence, H A (n, p) ≥ (In/2 − W )/n. Finally, observe that since p ≥ 11 3 , then by hypothesis it holds that W ≤ n 3 /11 3 . Putting all pieces together yields
which concludes the proof of the second case. Finally, when I > W 2/3 /5 and p < 11 3 , the sought lower bound follows by Lemma 1. Indeed, the p processors can be virtually partitioned into two subsets, each consisting of exactly p/2 processors; in particular, processor P * 0 will be identified with the submachine including the first half of the p processors, and P * 1 with the submachine including the second half. Since p < 11 3 , by hypothesis each BSP processor computes at most n 3 /p multiplicative terms, and thus both P * 0 and P * 1 compute at most (n 3 /p)(p/2) = n 3 /2 multiplicative terms overall. Hence we can apply Lemma 1 to processors P * 0 and P * 1 , obtaining the desired result.
The proposed bound is tight and is matched by the algorithm that decomposes the problem into n 3 /W ≤ p subproblems of size W 1/3 × W 1/3 , and then solves each subproblem sequentially in each round. Since W ≥ n 3 /p, the minimum communication complexity is Ω n 2 /p 2/3 , which is achieved by the standard 3D algorithm [17] .
Finally, we observe that the above theorem can be extended to the case W ≤ ǫn 3 , for an arbitrary constant ǫ ∈ (0, 1), as soon as each multiplicative term is computed once. Also, we remark that, if each processor holds O W 2/3 inputs, our bound applies even when each input element may be present in more than one processor at the beginning of the computation. We also conjecture that our bound can be extended up to p 1/3 replication, as shown in [27] assuming balanced memory or work. Our result hinges on the restriction on the nature of the computation whereby each vertex of the DAG is computed exactly once. In this setting, the crucial property is that for each arc (u, v) such that u is computed by processor P and v is computed by processor P ′ , P = P ′ , there corresponds a message from P to P ′ (which may also cross other processors). Such arcs are referred to as communication arcs.
Stencil Computations
We now introduce some preliminary definitions, which will be used throughout the section. We envision an (n, 
nodes, and is said ℓ-owned if more than half of its nodes are evaluated by processor P ℓ , with 0 ≤ ℓ < p. A block is owned if there exists some ℓ, with 0 ≤ ℓ < p, such that it is ℓ-owned; it is shared otherwise. Two blocks B i 0 ,...,i d−1 and
are said to be adjacent if their coordinates differ in just one position k and |i k − i ′ k | = 1 (i.e., they share a face). For the sake of simplicity, we assume that n and p are powers of 2 d−1 and thus the previous values (e.g., n/p 1/(d−1) ) are integral: since d is a constant, this assumption is verified by suitably increasing n and decreasing p by a constant factor which does not asymptotically affect our lower bounds.
In order to establish our main lower bound, we need two preliminary lemmas. The first one gives a slack lower bound based on the d-dimensional version of the Loomis-Whitney geometric inequality [21] , and reminds the result of Theorem 1 for matrix multiplication when d = 3. 
where in the last inequality we have used the hypothesis that W ≤ n d /2. This rises a contradiction, and thus it must be that
points in N 0 whose respective 1-dimensional arrays have not been completely evaluated by P 0 : therefore, there is one communication arc associated to each array, and the lemma follows.
If W > n d /2, we consider the remaining p − 1 processors as a single virtual processor evaluating n d − W ≤ n d /2 nodes, being recomputation disallowed. An argument equivalent to the previous one gives the claim. Now we need a second lemma that bounds from below the number of messages exchanged by a processor P ℓ while evaluating nodes in an ℓ-owned block and in an adjacent block which is not ℓ-owned.
Lemma 4.
Consider an ℓ-owned block B adjacent to a shared or ℓ ′ -owned block B ′ , with ℓ = ℓ ′ . Then, the number of messages exchanged by processor P ℓ for evaluating, without recomputation, nodes in B and B ′ is
Proof. We suppose without loss of generality that B = B 0,0,...,0 and B ′ = B 1,0,...,0 . We call a node blue if it is evaluated by P ℓ , and red otherwise; let n b and n r (resp., n ′ b and n ′ r ) be the number of blue and red nodes in B (resp., B ′ ). By definition, (d−1) ) blue (resp., red) nodes in B (resp., B ′ ). Then, at least n d−1 /(3p) arrays contain both red and blue nodes, and thus for each of them there is a communication arc. Since recomputation is disallowed, each of these arcs entails the communication of a datum, and the lemma follows.
Finally, suppose
The lemma follows by applying Lemma 3 to B (resp., B ′ ), with W = n b (resp., W = n ′ r ), and considering processors P i with i = ℓ as a single virtual processor.
The next theorem gives the claimed Ω
lower bound, and its proof is inspired by the argument in [30] for the cube DAG (which however assumes balanced work). The lower bound is matched by the balanced algorithm given in [30] , which decomposes the (n, d)-array into p arrays with dimension d and size n/p 1/(d−1) . Theorem 2. Let A d be any algorithm for solving the (n, d)-array problem, without recomputation, on a BSP with p processors, where 1 < p ≤ n d−1 , and let W be the maximum number of nodes evaluated by a processor. If W ≤ ǫn d , for an arbitrary constant ǫ ∈ (0, 1), then the communication complexity of the algorithm is
Proof. The second term of the lower bound follows directly from Lemma 3 and dominates the first one as long as
In the remaining, we focus on the first one and assume
Suppose the number of shared blocks to be at least p d/(d−1) /2. Then, there exists a processor, say P 0 , computing W ′ ≥ n d /(2p) nodes in b ≥ 1 shared blocks. Denote with w i , for 0 ≤ i < b, the number of nodes computed by P 0 in the i-th shared block. We have b−1 i=0 w i = W ′ . By Lemma 3, the messages exchanged by P 0 within the i-th block are Ω w
, and thus the communication complexity is at least
The summation is minimized when each w i , i ∈ {0, 1, . . . , b−1}, is set to the maximum allowed value, that is w i = n d /(2p d/(d−1) ) since each block is shared, and b is set to
Suppose now the number of shared blocks to be less than p d/(d−1) /2. Intuitively, in the following argument we search for an hypercube that is almost entirely evaluated by a single processor communicating an amount of messages proportional to the surface area. Then, we highlight a critical sequence of these hypercubes which are evaluated one after the other: the total amount of communication performed by the associated processors gives the claimed bound.
We define a chain of length f a sequence of blocks B We also notice that blocks B k 0 , . . . , B k t−1 may be owned by a different processor. By construction, each node within block B k i depends on all the nodes in B k i−1 , and thus all messages exchanged while evaluating nodes in B k i are subsequent to those exchanged while evaluating nodes in B k i−1 . Then, by summing the amount of messages exchanged by the owners of the t blocks for evaluating nodes in the respective s k i -hypercubes, we have
Let S = 
Sorting
In this section we give a lower bound to the communication complexity of comparison-based sorting algorithms. Comparison sorting is defined as the problem in which a given set X of n input keys from an ordered set has to be sorted, such that the only operations allowed on members of X are pairwise comparisons. Our bound only requires that no processor does more than a constant fraction ǫ of the Θ (n log n) comparisons required by any comparison sorting algorithm, for any ǫ ∈ (0, 1), and does not impose any protocol on the distribution of the inputs and the outputs on the processors, nor upper bounds to the size of their local memories, or specific communication patterns. As for previous work, we still need the technical assumptions that the inputs are not initially replicated, and that the processors store only a constant number of copies of any input key at any moment during the execution of the algorithm.
The main result follows from the application of two lemmas, each of which provides a different and independent lower bound to the communication complexity of sorting. Both rely on nontrivial counting arguments, adapted from [2, 1] , that hinge on the fact that any comparison sorting algorithm must be able to distinguish between all the n! permutations of the n inputs. The first lemma provides a lower bound as a function of the maximum number S of input keys initially held by a processor. The second gives a lower bound as a function of the number Π of permutations that can be distinguished before any communications take place. We begin by stating and proving the first lemma.
Lemma 5. Let A be any algorithm sorting n keys on a BSP with p processors, with 1 < p ≤ n, and let S denotes the maximum number of input keys initially held by a processor. If each processor performs at most ǫ(n log n) comparisons, with ǫ being an arbitrary constant in (0, 1), and the input is not initially replicated, then the communication complexity of the algorithm is
Proof. Without loss of generality, denote with P 0 a processor holding S input keys at the beginning, and let P * identify the submachine including the remaining p − 1 processing units. Clearly, since the input is not initially replicated, P * initially holds n−S input elements. Finally, for convenience, we redefine ǫ as 1/(1 + δ), with δ being an arbitrary constant greater than zero.
Suppose first that S > 1 − δ 4(1+δ/2) n. By hypothesis, each processor performs at most (n log n)/(1 + δ) comparisons and thus processor P 0 can boost the number of distinguishable permutations by a factor of at most 2 n log n 1+δ
where the first inequality can be verified by taking the logarithm of both sides, and applies for n larger than a suitable constant, while the second one follows from Stirling's approximation. This holds independently of the number of keys that P 0 contains initially (which could be even n) or that it receives by P * during the execution of the algorithm. Therefore, P * must distinguish at least n!/(n/(1 + δ/2))! permutations. Then, if we denote with S * = n − S the number of keys initially held by P * , and with h * the number of keys sent by P 0 to P * , we must have
By taking the logarithm of both sides and after some manipulation, we obtain
, from which follows
Then, since S * < δn 4(1+δ/2) and S ≤ n,
, and the lemma follows. Now consider the case S ≤ 1 − δ 4(1+δ/2) n. Let h ′ and h * be the number of keys received by P 0 and P * , respectively, and let V ′ and V * be the maximum number of permutations distinguished by P 0 and P * , respectively. We must have V ′ V * ≥ n!. We also have V ′ ≤ S! S+h ′ h ′ : indeed, P 0 can distinguish all the S! permutations of the S input keys, and the number of ways to intersperse the h ′ received keys within the group of S inputs is
. Thus, we have
where h = max{h ′ , h * }. By using the fact that (a/b) b ≤ a b ≤ (ea/b) b for any integer values a and b, and then by taking the logarithm of both sides, we get
where e is Euler's constant. In the rest of the proof we will prove that h ≥ βS for a suitable constant β ∈ (0, 1) that will be defined later. Suppose, for the sake of contradiction, that h < βS. We first observe that the left-hand side of Equation 2 is increasing in h. Indeed, we have h log e 2 h 2 (S + h)(n − S + h) = 2h log e + h log
where h log((x + h)/h), with x ∈ {S, n − S}, is strictly increasing in h as soon as x > 0. Therefore, the left-hand side of Equation 2 can be upper bounded as follows:
h log e 2 h 2 (S + h)(n − S + h) < βS log e
where we have also used the facts that β < 1 and S ≤ n. We now argue that the last term in the above formula is upper bounded by S log(n/S). We shall consider two separate cases. The first is when n/S ≥ 2. In this case, we set β = log(n/S)/(8 log(2en/S)). (Observe that 0 < β < 1, as required.) Standard calculus shows that 8 log(2en/S)/ log(n/S) < (2en/S) 3 when n/S ≥ 2. Hence, we can write
Consider now the case 4(1 + δ/2)/(4 + δ) ≤ n/S < 2. Then, we have (2en/(βS)) 2β < (11/β) 2β . Since δ is a constant, and since the right-hand term of the above inequality tends to one as β tends to zero, then for each δ > 0 there exists a constant β ∈ (0, 1) such that (11/β) 2β ≤ 4(1+δ/2)/(4+δ). Therefore, we have shown for both cases that, if h < βS, h log e 2 h 2 (S + h)(n − S + h) < S log n S , which is in contradiction with Equation 2. It follows that there exists a constant β > 0 such that h ≥ βS, giving the lemma.
We now provide a second lemma, which bounds from below the communication complexity of sorting in BSP as a function of the number Π of permutations that can be distinguished before any communications take place, that is, when processors' can only compare their local inputs.
As an aside, we observe that the proof of this lemma can be straightforwardly cast for the LPRAM model, yielding a much simpler proof for Theorem 3.2 of [1] , which bounds from below the communication delay, that is, the number of communication steps, required for comparison-based sorting.
Lemma 6. Let A be any algorithm sorting n keys on a BSP with p processors, where 1 < p ≤ n, and let Π be the number of distinct permutations that can be distinguished by A before the second superstep, that is, by comparing the inputs that (possibly) reside initially in the processors' local memories. If A stores only a constant number of copies of any key at any time instant, then the communication complexity of the algorithm is
Proof. We prove the lemma only for the case when every input key is present in only one of the local memories of the processors at any time instant; the extension to the case when a data element is simultaneously present in a constant number of local memories is straightforward and thus omitted. We suppose that A performs 1-relations in each superstep, that is, each processor can send and receive only one message. This is without loss of generality because we observe that each superstep of A where each processor performs an h-relation (i.e., it sends and receives at most h messages), can be decomposed into h 1-relation supersteps without increasing the communication complexity of A (since the latter does not charge a synchronization cost due to the latency incurred by each superstep). Let m j denote the number of input keys in local memory of processor P j after a given superstep of the algorithm. Since, by hypothesis, a data element is present in only one of the local memories of the processors at any time instant, we have that p j=1 m j ≤ n. Hence, after a communication superstep, which by hypothesis entails a 1-relation, the space of permutations can be divided, at most, by the value of an optimal solution of the following convex program (observe, in fact, that P j may already have distinguished (m j − 1)! permutations before the last superstep):
Since the solution is given by m j = n/p for each j, its value is (n/p) p . Thus, after x supersteps, the space of permutations can have been divided by at most (n/p) px . Since there remain n!/Π distinct possible permutations, we must have n p
By taking the logarithm of both sides, we obtain x = Ω n log n − log Π p log(n/p) , as desired. Now we are ready to prove the main result of this section, an Ω ((n log n)/(p log(n/p))) lower bound to the communication complexity of any comparison sorting algorithm. The result follows by combining the bounds given by the previous two lemmas. Both bounds are not tight when considered independently, the first (Lemma 5) because it is weak when at the beginning the input keys tend to be distributed evenly among the processors, the second (Lemma 6) because it is weak when the input keys tend to be concentrated on one or few processors. However, the simultaneous application of both provides the sought (tight) lower bound.
Theorem 3. Let A be any algorithm for sorting n keys on a BSP with p processors, with 1 < p ≤ n. If each processor performs at most ǫ(n log n) comparisons, with ǫ being an arbitrary constant in (0, 1), the inputs are not initially replicated, and the p processors store only a constant number of copies of any key at any time instant, then the communication complexity of the algorithm is
Proof. The result follows by combining Lemma 5 with Lemma 6. Since, by hypothesis, each processor performs at most ǫ(n log n) comparisons, with ǫ ∈ (0, 1), and the inputs are not initially replicated, we can apply Lemma 5, obtaining
where S denotes the maximum number of input keys initially held by a processor. Moreover, since by hypothesis the p processors store a constant number of copies of any key at any time instant, we can also apply Lemma 6, obtaining
where Π denotes the number of distinct permutations that can be distinguished by A by comparing the inputs that initially reside in processors' local memories. In order to compare the latter bound with the first one, we need to bound Π from above as a function of S. To this end, let s i denote the number of input keys initially held by processor P i . Hence, S = max{s 0 , s 1 , . . . , s p−1 }. The number of permutations that can be distinguished by A without requiring communication, that is, by letting each processor sort the keys that it holds at the beginning of the computation, is therefore Π = p−1 i=0 s i !. Since the inputs are not initially replicated, an upper bound to Π as a function of S is given by the value of an optimal solution of the following mathematical program:
Since a!b! ≤ (a + b)! for any integer a and b, by a convexity argument it follows that p−1 i=0 s i ! ≤ (S!) n/S . Therefore, we can plug Π = (S!) n/S in Equation 3, obtaining H A (n, p) = Ω n log n − n log S p log(n/p) .
Putting pieces together, we conclude that
Standard calculus shows that the right-hand side of the above equation is increasing in S when S = Ω (n/(p log(n/p))). The theorem follows by observing that S ≥ ⌈n/p⌉.
Fast Fourier Transform
In this section we consider the problem of computing the Discrete Fourier Transform of n values using the n-input FFT DAG. In the FFT DAG, a vertex is a pair w, l , with 0 ≤ w < n and 0 ≤ l ≤ log n, and there exists an arc between two vertices w, l and w ′ , l ′ if l ′ = l + 1, and either w and w ′ are identical or their binary representations differ exactly in the l ′ -th bit. We show that, when no processor computes more than a constant fraction of the total number of vertices of the DAG, the communication complexity is Ω (n log n/(p log(n/p))). Our bound does not assume any particular I/O protocol, and only requires that every input resides in the local memory of exactly one processor before the computation begins; as for preceding results, our bound also hinges on the restriction on the nature of the computation whereby each vertex of the FFT DAG is computed exactly once. The bound is tight for any p ≤ n, and is achieved by the well-known recursive decomposition of the DAG into two sets of smaller √ n-input FFT DAGs, with each set containing √ n of such subDAGs (see, e.g., [7] ).
We will first establish a lemma which, under the same hypothesis of the main result, provides a lower bound to the communication complexity as a function of the maximum work performed by any processor. The proof of the lemma is based on a bandwidth argument, which exploits the fact that an FFT DAG can perform all cyclic shifts (see, e.g., [20] ), and on the following technical result which is implicit in the work of Hong and Kung (a simplified proof is due to Aggarwal and Vitter [2] ).
Lemma 7 ([16]
). Consider the computation of the n-input FFT DAG. During the computation, if a processor accesses at most S nodes of the DAG, then it can evaluate at most 2S log S nodes, for any S ≥ 2.
Lemma 8. Let A be any algorithm computing, without recomputation, an n-input FFT DAG on a BSP with p processors, with 1 < p ≤ n, and let W be the maximum number of nodes of the FFT DAG computed by a processor. If W ≤ ǫ(n log n), for an arbitrary constant in (0, 1), and the inputs are not initially replicated, then the communication complexity of the algorithm is
Proof. Let P 0 be a processor computing W nodes of the FFT DAG, and consider as an unique processor P * the remaining p − 1 processing units. Suppose first that processor P * contains at least n/2 of the n output nodes at the end of the algorithm. Let K = W/(2 log W ). Since P 0 evaluates W nodes, it follows from Lemma 7 that P 0 accesses at least K node values during the execution of the algorithm. These nodes can be either inputs initially held by P 0 , or nodes whose values have been evaluated and then sent by processor P * .
If at least K/2 of them have been sent by P * , the lemma follows. Otherwise, P 0 initially contains at least K/2 input nodes. Since the inputs are not initially replicated, and since an FFT DAG can perform all cyclic shifts, by [26, Lemma 10.5.2] there exists a cyclic shift that permutes K/4 input nodes initially held by processor P 0 into K/4 output nodes held by P * at the end of the algorithm. Since K = W/(2 log W ) and since, by hypothesis, W ≤ ǫ(n log n), it holds that K/4 ≤ n/2, and thus at least K/4 messages are actually needed. Therefore, H A (n, p) ≥ W/(8 log W ). Now suppose that processor P * contains at most n/2 output nodes at the end of the algorithm. Thus, there are at least n/2 output nodes in P 0 . Since, by hypothesis, recomputation is disallowed, P * computes W * = n log n − W ≤ n log n nodes of the DAG. The lemma follows by inverting the role of P 0 and P * and setting K = W * /(2 log W * ) in the previous argument.
We note that the above bound is matched when W = O (n ǫ log n), for any constant ǫ ∈ (0, 1), by the previous recursive algorithm by ending the recursion when the subproblem size is Θ (W/ log W ).
The main result of this section follows by a simple application of the preceding lemma and of a result implicit in the proof of the lower bound due to Bilardi et al. [9, Corollary 1] .
Theorem 4. Let A be any algorithm computing, without recomputation, an n-input FFT DAG on a BSP with p processors, where 1 < p ≤ n. If each processor computes at most ǫ(n log n) nodes, for an arbitrary constant in (0, 1), of the FFT DAG and the inputs are not initially replicated, then the communication complexity of the algorithm is H A (n, p) = Ω n log n p log(n/p) .
Proof. If W ≥ n 1/4 , we have that W ≥ max{(n log n)/p, n 1/4 }, and thus we observe that the bound given by Lemma 8 dominates the one claimed by the theorem. Otherwise, when W < n 1/4 , we use the following argument. By reasoning as in [9, Corollary 1] , if at the end of the algorithm A each processor holds at most U ≤ n output nodes of the FFT DAG and recomputation is not allowed, then the communication complexity of A is Ω max{0, n log(n/U 2 )/(p log(n/p))} . Since W < n 1/4 , each processor cannot contain more than n 1/4 output nodes, that is, U ≤ n 1/4 , and the theorem follows.
Conclusions
We have presented new lower bounds on the amount of communication required to solve some key computational problems in distributed-memory parallel architectures. All our bounds have the same functional form of previous results that appear in the literature; however, the latter are built by making a critical use of some assumptions that rule out a large part of possible algorithms. The novelty and the significance of our results stem from the assumptions under which our lower bounds are developed, which are much weaker than those used in previous work. Our bounds are derived within the BSP model of computation, but can be easily extended to other models for distributed computations based on or similar to the BSP, such as LogP [13] and MapReduce [18, 23] . Moreover, we believe that our results can be also ported to models for multicore computing (see, e.g., [10, 33, 12] ), since our proofs are based on some techniques that have already been exploited in this scenario.
There is still much to do towards the establishment of a definitive theory of communicationefficient algorithms. In fact, we were not able to remove all the restrictions there were in place in previous work: in some cases our lower bounds still make use of some technical assumptions, such as the non-recomputation of intermediate results, or restrictions on the replication of input data. Although it seems that such restrictions can be relaxed to encompass a small amount of recomputation or input replication, it is an open question to assess whether these assumptions are inherent to our proof techniques or can be removed. In particular, it is not clear, in general, when recomputation has the power to reduce communications, since many lower bound techniques do not apply in this more general scenario (see, e.g., [5] ). Providing tight lower bounds that hold also when recomputation is allowed is a fascinating and challenging avenue for future research.
