This paper initiates the study of communication complexity when the processors have limited work space. The following trade-offs between the number C of communications steps and space S are proved:
COMMUNICATION AND SPACE
The minimum communication required in order to solve problems in the twoprocessor model has been studied extensively. (See, for example, [28, 1, 3, 191.) This paper initiates the study of communication complexity when the processors have limited work space. As is customary, the systems we study consist of two com-municating processors that are given private inputs x and y, respectively, and that arc to output some function f(x, y). We restrict our attention to the case in which these processors execute straight-line protocols (defined in Section 2) and measure both the number S of work registers and the number C of communication steps used.
With no restriction on S it is impossible to prove superlinear lower bounds on C, since one processor can send its entire input to the other, which then computes and outputs f(x, y). In contrast, we prove the following trade-offs when space is restricted to S:
1. For multiplying two n x n matrices in the arithmetic model with two-way communication, CS = @(n3).
2. For convolution of two degree n polynomials in the arithmetic model with two-way communication, CS = o(d).
For multiplying an n x n matrix by an n-vector in the Boolean model with one-way communication, CS = @(n2).
The proof technique used in the arithmetic lower bounds (Section 3) is quite different from that used in the Boolean lower bound (Section 4). The lower bounds are based on a new pebble game that models the space and communication requirements of straight-line protocols.
If a single processor can compute f(x, y) in time C and space S, then a system of two processors can compute f(x, y) in communication O(C) and space O(S), simply by communicating every intermediate value computed by either. Viewed another way, almost all the known time-space trade-offs are special cases of communication-space trade-offs, in which one processor receives all the inputs but is incapable of computation, being allowed only to communicate the inputs to the other processor. Thus, the new lower bounds outlined above imply previous time-space trade-offs of Ja'Ja' [lS], Tompa [26] , and Grigoryev [13] . (They do not imply the results of Yesha [30] and Abrahamson [2] .)
The converse, however, is false. Whereas the time T and space S must satisfy TS = Q(d) when computing the discrete Fourier transform [26] or sorting [7, 261, in Section 6 we demonstrate that both of these functions can be computed in linear communication steps and O(log n) space simultaneously. As further motivation for studying space-bounded communication complexity, in Section 5 we show that the search problems of Karchmer and Wigderson [ 161 associated with any language in NCk can be solved in O(logk n) communication steps and O(logk n) space simultaneously.
Hong and Kung [ I4 J introduced the "red-blue pebble game" to study the space and I/O requirements of straight-line programs. This too can be viewed as a special case of our new pebble game, in which one processor (the memory) has no space bound but is incapable of computing.
The closest previous work is that of Papadimitriou and Ullman [20] , in which they proved a communication-tinze trade-off. Both their work and the aforementioned work of Hong and Kung [14] studied lower bounds for straight-line implementations of a single circuit, whereas our lower bounds apply to any circuit that solves the problem. For instance, Hong and Kung proved that the space S and I/O Q required to implement the standard straight-line matrix multiplication algorithm satisfy Q fi = O(n3), whereas we prove that CS = @(n') for any straight-line matrix multiplication algorithm.
A PEBBLE GAME THAT MODELS COMMUNICATION AND SPACE
In this section, we examine what it means for communicating processors to execute "straight-line protocols." Each processor is assumed to have a set of private registers R,, R,, . . . and private inputs. A straight-line protocol consists of two sequences of instructions, one executting on each processor. There are four kinds of instructions:
1. Ri c a, where a is an input or constant.
2. Ri c f(Rj,, Rj2, . . . . R,;), where f is a primitive operator.
3. send Ri. 4 . Ri +-receive.
If x and y denote the respective inputs to the two processors, the processors themselves will be referred to as the x-processor and the y-processor. We first consider the case in which there is a one-way communication channel from the x-processor to the y-processor. In this case the y-processor cannot execute the send instruction and the x-processor cannot execute the receive instruction. A send causes a copy of the value stored in the specified register to be loaded into the communication channel, and the sending processor pauses until the value is received by the other processor. In executing a receive, a processor waits until there is some value on the channel, then copies that value into the designated register. A two-way communication channel can be viewed as two one-way communication channels, one in each direction.
A protocal is said to compute a set of values if each value in the set is stored in some private register at some time during the execution.
The space used by a straight-line protocol is the maximum number of registers used by either processor. The communication is the total number of send instructions executed by both processors.
There is a natural way in which a straight-line protocol gives rise to an equivalent circuit. However, a circuit can be realized by several straight-line protocols. To determine the space and communications required to implement a circuit, we introduce a new pebble game. The idea is analogous to the use of other pebble games in studying time and space requirements of straight-line programs [21] . This new game is implicit in the work of Papadimitriou and Ullman [20] .
There are two sets of pebbles, referred to as x-pebbles and y-pebbles. In each move, one can remove zero or more pebbles from the circuit and then, according to the following rules, choose a vertex v of the circuit and pebble it.
1. If u is an input vertex from x ( y), then an x-pebble (resp. y-pebble) can be placed on v.
2. If, at the beginning of the move, all immediate predecessors of v were x-pebbled (y-pebbled), then u can be x-pebbled (resp. y-pebbled).
3. If, at the beginning of the move, there was an x-pebble on u, then u can be y-pebbled. In this case we say a communication (from the .x-processor to the y-processor) has occurred at u.
The game as described models one-way communication from the x-processor to the y-processor, If two-way communication is allowed, then rule 3 is duplicated with x and y interchanged.
The goal of the game is to have pebbled each output vertex of the circuit at least once. Each pebbling strategy corresponds uniquely to a straight-line protocol as follows. Each pebble corresponds to a register of one of the processors. Pebbling a vertex u corresponds to loading the value computed at u into the corresponding register. Therefore, the maximum number of pebbles of either type used by a pebbling strategy measures the space used by the equivalent straight-line protocol, and the number of applications of rule 3 above measures the number of communication steps used.
TRADE-OFFS FOR ARITHMETIC STRAIGHT-LINE PROTOCOLS
This section is divided into two subsections. In Subsection 3.1, we observe that, without loss of generality, we can restrict our attention to bilinear arithmetic circuits while proving the communication-space trade-offs for general arithmetic straight-line programs that compute bilinear forms. This is reminiscent of a similar scenario when we count only the number of nonscalar multiplications performed by arithmetic straight-line programs for computing bilinear forms [S] . Then, in Subsection 3.2, we prove the communication-space trade-offs claimed in Section 1 for arithmetic straight-line protocols.
Preliminary Remarks
Lemmas 1 and 2 stated below follow immediately from a result of Winograd, and its generalization due to Strassen and Unger, respectively [S] . These lemmas will enable us to restrict our attention to bilinear straight-line programs in Subsection 3.2. For a definition of bilinear (linear) programs or circuits, see either of [8, 26, Define diag(a,, az, . . . . ak) to be the k x k diagonal matrix that has a,, CQ, . . . . ak on the main diagonal, in that order. We use the Binet-Cauchy theorem in order to prove the following corollary, which is required in Section 3.2. 
Proof:
Since rank(A) > r, we can select subsets XE [m] and Yc [n] 
LEMMA 5 (Valiant

The Arithmetic Trade-offs
Suppose that {x,, x2, . . . . Now we are ready to state the main theorem of this section. This theorem states that if a set of bilinear forms is "sufficiently independent," then arithmetic straightline programs for computing it exhibit a communication-space trade-off. As corollaries to this theorem, we will prove the main results of this section: (a) multiplication of two n x n-matrices requires CS= @(n3); (b) convolution of two degree n polynomials requires CS = @(n2). 
Then any pebbling strategy for 59 that uses C (two-way) communication steps and 2s pebbles satisfies C 3 lm/(2S + 1) Jr -2s).
Lower bounds on the rank in a linear subspace of R""", like the one used in Theorem 6, have been used earlier to relate the complexity of certain bilinear forms to the parameter of a corresponding error-correcting code [9, 171. In order to prove this theorem, we will need the following result, whose proof is given later in this section. Proof of Theorem 6 (using Lemma 7). By Lemmas 1 and 2 we can restrict our attention to bilinear circuits. Consider the pebble game of Section 2 on a bilinear circuit 3. We can partition the whole game into [m/(2,!?+ 1)1 phases. The ith phase begins immediately after a total of (i -1)(2S + 1) outputs have been pebbled, and it ends immediately after a total of i(2S+ 1) outputs are pebbled. The last phase may be an exception; it ends when all outputs have been pebbled. Observe that during each phase (except possibly the last one) exactly 2S+ 1 new outputs are pebbled. Using Lemma 7, we will prove that each phase requires at least r -2S communication steps. This, in turn, implies the theorem.
Consider the ith phase. Let T be the set of 2S+ 1 indices of new outputs pebbled in this phase. Let J!? be the subcircuit of Y consisting of all the edges and the vertices that lie on directed paths to at least one of the vertices of T. Lemma 7 is applicable to X', since Lemma 7 guarantees the existence of suitable sets U, V, and W together with two sets of vertex-disjoint paths (which may have vertices in common); assume that these paths connect ui E V and w, E W to ui E U.
Let U' = { ui ) uj E U, and the path from ui to ui (wi to ui) is free of y-pebbles (resp. x-pebbles) at the beginning of this phase 1. Then ( U'[ 2 r -25, since each of the 2S pebbles eliminates at most one gate from U'. For each USE U', there must be at least one communication from the x-processor to the y-processor on the path from Vi to Ui, or one from the y-processor to the x-processor on the path from wi to ui. Hence, at least (U'( > r -2s communication steps must occur in this phase. a Before we prove Lemma 7, we need some additional notation. Suppose that there are p multiplication gates in &?. Number them from 1 to p, and let qi be the bilinear form that is the output of the ith multiplication gate. Then qi=gihi, where gi=cJ"=l @jiXj and hi=x",=, v&iy&. Define column vectors Ai = [pli pLzi.. . pLnilT and Bi = [vii vzi.. . v,;lT. Then the coefficient matrix Qi for the bilinear form qi is given by Qi = A,BT.
In order to prove Lemma 7, we need an auxiliary result. Proof of Lemma 7 (using Lemma 8). Using Lemma 8, choose i,, i,, . . . . id, and A such that rank (AABT) = k 2 r. We will prove that, after a suitable rearrangement of indices, V and W can be chosen as follows: V= {vi [ vi is the input vertex in X corresponding to the input Xi, 1~ i < k}; W = { wi I wi is the input vertex in X corresponding to the input yi, 1 d i < k ).
Corollary 4 implies that, after a suitable rearrangement of indices, there exists ZE . Then, we will prove that there is another multiplication gate 4 # X that has a pebble-free path to at least one of the outputs j-1 ,.f2, d2s+ 1' This contradicts the maximality of X, and hence will be sufficient to prove the lemma. We know that
(1)
Next we show that t,, t2, ..,, t2S+, are linearly independent. Suppose that this is not the case. Then there are constants c1r, x2, . . . . ctZS+, , not all zero, such that ::;: 1 criti = 0. Substituting for ti, we get Cfs:' aifi-Cf= 1 /Iiqi= 0, or f": ' aifi = Cf= 1 /iiqi. This identity means that the coefficient of all the I? terms xjyk on both sides are equal, i.e., ~?~~' aim;2 = C:= 1 jipjivki. These n2 equations can be written in the matrix form as follows: The desired lower bound is an immediate consequence. For a matching upper bound, we divide the computation of (zO, .,,, I, _ 1 } into [n/S] phases. In the ith phase the coefficients zci_ ijs, . . . . zis_ 1 are computed by the y-processor as follows. At the beginning of each phase, the y-processor resets all its registers to zero. Then the x-processor starts transmitting the sequence x0, Xl, .a., x,. After receiving x,, the y-processor updates its registers so that they contain xi= 0 x/Y(k -I) mod n, for (i -1) S d k < is. Since each phase has exactly n communication steps, the total communication required is O(n2/S). m A similar argument can be used to prove a communication-space trade-off for matrix multiplication. Let A and B be n x n matrices whose elements are drawn from a ring. Let C=AB, (A)ii=x,j, (B),,=_Y+ and (C),=Z~. Then zij = C;: = 1 xik ykj, for any 1 < i, j G n. Initially the A-processor and the B-processor have inputs {xrl, ,.., x,,} and {y,,, . . . . y,,>, respectively. They cooperate to compute all entries of C in an abitrary order. Again, each zV can be viewed as a bilinear form in the n2 entries of A and B. Let M, be the n2 x n2 coefficient matrix of the bilinear form zij. M, can be arranged into an n x n matrix of blocks, where each block submatrix has size n x n, such that its (i,j)th block is an identity matrix, and the others are all equal to zero.
COROLLARY 11. The arithmetic straight-line complexity for multiplying two n x n matrices satisfies CS = O(n3).
Proof: The block structure of M,'s implies that rank(&(tliiMV)) k n, whenever not all of the tli)s are equal to zero. The proof can be completed along the lines of the proof of Corollary 10.
A matching upper bound is provided by an algorithm similar to the one described in Corollary 10 for convolution. We split the computation in phases, computing S outputs in each phase using one-way communication. u
TRADE-OFFS FOR BOOLEAN STRAIGHT-LINE PROTOCOLS
In this section we turn from arithmetic circuits to Boolean circuits. The lower bounds are complicated by the fact that we know less in this case about the internal structure of the circuits. In particular, there are no distinguished internal gates corresponding to the multiplication gates in bilinear circuits that indicate where the communications occur. As a result, we must be content with bounds on the model with one-way communication.
In Theorems 14 and 15 we prove a tight communication-space trade-off for any Boolean circuit that multiplies a matrix by a vector, one theorem for each direction of communication. The technique used is derived from that of Grigoryev [13] for Boolean time-space trade-offs. We also need the following combinatorial lemma of Sauer [22] and Perles and Shelah [23] . 
. ' LEMMA 13. Let 9 be any Boolean circuit that inputs an n x n matrix A and a vector x 15 (0, l}", and outputs Ax E (0, 1 }". Consider any pebbling strategy on 9 that has one-way communication from the x-processor to the A-processor and that uses S A-pebbles and any number of x-pebbles. Then in the course of pebbling any set Y of S+ 1 outputs, starting from any configuration of pebbles on '9, communication steps occur at more than n -(S + 1) log, n gates.
Proof
Note that, because of the one-way communication, all the outputs in Y must be A-pebbled. Suppose, by way of contradiction, that communication steps occur only at gates in the set K, where 1 KI < n -(S + 1) log, n. There must be a set Fc (0, 1)" of at least 2"/21K'>nSf' distinct assignments of values to the inputs in x that fix the values of the gates in K. Construct an )FJ x n Boolean matrix M whose rows are the elements of F. Lemma 12 demonstrates that M contains some row permutation of B,, 1 as a submatrix, since That is, there is a set X of S + 1 input vertices in x and a set F' c F that assigns each of the possible 2'+' values to X. Note that the 2'+' assignments in F' fix the values computed at K. Now fix A to be a permutation matrix that maps the S+ 1 inputs in X to the S+ 1 outputs in Y. As the assignments to x vary over More interesting examples derive from the "search problems" of Karchmer and Wigderson [16] . Associated with any language L is the following search problem S,: the l-processor receives any x E L, and the O-processor any y # L. At the end of their computation, they must output some index i such that their inputs differ in the ith position. Proof: The proof is identical to a corresponding one of Karchmer and Wigderson [ 16, Lemma 2.11, except that they had no need to account for the space used by the communicating processors.
Let M be an alternating Turing machine that accepts L in time T. Assume without loss of generality that each configuration of M has at most two successors and that M reads only one input character along any computation path, halting immediately after doing so. The latter is accomplished as follows. Whenever M intended to read an input in the middle of its computation, M instead existentially guesses the value to be read and universally does two things: verify that the read value is correct and, in parallel, continue with the successor of the original read configuration as though the guess were correct.
To solve S,, each processor uses its space to record M's current configuration Q, whose length is proportional to the space bound of M and hence O(T). Initially, Q is M's initial configuration. In general, if Q is existential (universal), the l-processor (resp., O-processor) chooses a successor of Q that leads to acceptance (resp. rejection) of its own input. It can do so in space O(T), since ATIME( T) E DSPACE(T) [lo] . It then communicates a single bit to the other processor indicating which of the two successors it has chosen, and they each update Q accordingly. After at most T communicated bits, Q will be a configuration in which M is reading some input character at some position i, at which point both processors output i.
By a straightforward induction on the number of steps that M has taken, 44, when ,begun in configuration Q on the input given to the l-processor (O-processor), will eventually accept (resp., reject). In particular, the inputs given to the processors must differ in the ith position, since A4 halts immediately after reading this input. 1 
SOME PROBLEMS THAT Do NOT EXHIBIT A COMMUNICATION-SPACE TRADE-OFF
In this section, we study sorting, ranking, and the discrete Fourier transform. We show that these problems do not exhibit any communication-space trade-offs by exhibiting algorithms that use minimum communication and minimum space, simultaneously. The fact that these problems do not exhibit a communication-space trade-off is somewhat surprising at first glance, because these problems are known to exhibit time-space trade-offs in the single processor model. This is evidence that the problem of determining a communication-space trade-off for a given problem is inherently different from the problem of determining a time-space trade-off for the same problem.
Sorting and Ranking
In the sorting problem, the two processors input sets X and Y of integers, respectively. For convenience, assume that X= {x1, x2, . . . . x,} and Y = {y, , y,, . . . . yn} are such that Xu Y consists of 2n distinct integers from the set (0, 1, . . . . 2"-1). The processors sort these integers, and the Y-processor outputs them in ascending order. It is known [25] that Q(n*) bits of communication are required even if each processor has unbounded space available to it. Below, we present an algorithm that requires only O(log n) space, and O(n*) bits of communication.
The algorithm works in 2n phases. At the end of the kth phase, the kth smallest integer from the set Xv Y is output by the Y-processor. At all times, both the processors maintain pointers i and j to the smallest elements of the sets X and Y, respectively, that have not been output so far. During any phase, the numbers pointed to by the pointers i and j are compared (using n bits of communication); the smaller one is output and the corresponding pointer is updated.
In order to compare xi and y,, the X-processor starts transmitting the bits of xi, starting with the most significant bit. The Y-processor receives these bits and simultaneously begins to compare these bits to the leading bits of ,vj. If the result of a bit comparison is an equality, then that bit is output. As soon as the Y-processor determines the first bit position in which xi and vj differ, it knows which one of them is the smaller one. It outputs the rest of the smaller number and then notifies the X-processor of the inequality. Then, one of the processors updates its pointer, and the phase ends. In order to update the pointer i, we must determine the smallest element in the set X that is larger than xi. This can be done easily in O(log n) space. The pointer j can be updated in a similar manner.
The ranking problem is similar to the sorting problem. As in the sorting problem, the two processors input sets X and Y of integers. The X-processor outputs integers r,, r2, . . . . r, (in that order) such that rk is the number of elements in the set Xu Y that are less than xk. It is known [ZS] that Q(n*) bits of communication are required even if each processor has unbounded space available to it. A slight modification of the sorting algorithm described above can be used to solve this problem using only O(log n) space, and O(n2) bits of communication.
The X-processor can easily determine the number of elements of X that are less than xk. The number of elements of Y that are less than xk can be determined in O(n) bits of communication as follows. After determining the number of elements in yj= (.YI,_Yz, -, y,} that are kSS than xk, the Y-processor also maintains two pointers 1 and m with the property that of all the elements in Yi, yrn and xk have the largest common prefix, and the length of this common prefix is l<n. Initially, m is undefined, and I = 0. In order to check if yj+ I < xk, the Y-processor compares the first 1 bits of y,,, and yj+ I. If this is not sufficient to determine the relation between xk and yj + 1, then the X-processor transmits additional bits of xk, starting with the (n-Z-1)th bit, until the inequality between xk and yi+ 1 is determined. Then, the Y-processor updates the pointers 1 and WZ.
This shows that the problems of sorting and ranking can be solved with linear (i.e., 0(n2)) communication even with a minimal amount of space and that there in no possibility of a nontrivial communication-space trade-off for this problem.
Discrete Fourier Transform
The eight-point FFT (fast Fourier transform) circuit is shown in Fig. 2 , along with two possible ways of distributing the inputs to the two processors. In general, the 2k-point FFT circuit has inputs labeled with the binary representations of 0, 1, ..,) 2k -1 in the natural order. There are two natural policies for distributing these inputs between the processors, namely, according to either the first or last bit of this label. To be more precise, Policy A assigns (xi, xt, . . . . x;lk-, _ 1 } to the x-processor and {yt, yt, . . . . y$, _ , } to the y-processor, where x4 is the input with label Oi and ye is the input with label li. On the other hand, Policy B assigns { x& x;, . ..) x$, _ 1 ) to the x-processor and { yt, yf, . . . . y$-, _ 1 ) to the y-processor, where xf is the input with label (Oi)R and yf is the input with label (li)R, where wR is the reversal of the string W. These policies are illustrated in Fig, 2 .
In the remainder of this subsection, we argue that there is no communicationspace trade-off under Policy A (and hence none for the general discrete Fourier ProoJ: Consider the pebble game of Section 2 on an n-point FFT circuit, where the inputs are distributed between the two processors in accordance with Policy B. Assume that S < n/16, since the result is immediate otherwise. We can partition the whole game into rn/4S] phases. The tth phase begins immediately after a total 4S(t -1) outputs have been pebbled, and it ends immediately after a total of 4St outputs have been pebbled. The last phase may be an exception; it ends when all outputs have been pebbled. Observe that during each phase (except possibly the last one) exactly 4s new outputs are pebbled. We will prove that each phase requires at least 2s Ln/8S _I communication steps. This, in turn, implies the theorem.
Consider the tth phase. Let Z= {z[,, z,,, . . . . z,,,} be the set of new outputs pebbled in this phase. Consider any set X= {XL+, , xi + *, . . . . .xz +4s} of 4s consecutive (according to Policy B) inputs of the x-processor. By [26, Lemma 31 and Lemma 5, there are 4S vertex-disjoint paths from X to 2.
Let U be the set of gates that are immediate successors of some vertex in X and have a pebble-free path to one of the outputs in 2. Let ( Uj = 4s -a. Since each pebble can block only one of the 4s vertex-disjoint paths from X to 2, there must be at least a pebbles on non-input vertices. All the gates in U must be pebbled in the tth phase. Since there are at most 2s -a pebbles on the circuit inputs, pebbling all the vertices in U requires at least (U( -(2S-a) = 2s communication steps. Since this is true for each of the Ln/8SJ blocks of 4s consecutive inputs to the x-processor and each of the /_n/4SJ phases, the total number of communication steps is C > 2SLn/WJLn/4SJ = fJ(n'/S), since S < n/16. 1
In this argument, we used the fact that a certain submatrix of the discrete Fourier transform matrix is nonsingular. As an alternative, we could use the fact that the FFT circuit is a grate in order to prove Theorem 18. (See Tompa [26] .)
Finally, it is worth pointing out that, for any set of linear forms under any policy for distributing the inputs to the two processors, there exists a circuit computing these forms that can be pebbled simultaneously in O(1) pebbles and a linear number of communication steps. The same is true of several other problems, e.g., the reduced sensitivity analysis problem of Bentley and Brown [4] .
OPEN PROBLEMS
There are numerous directions for future research suggested by this work: trade-off will not be sorting.
Prove a communication-space
trade-off for a single-output function. The only examples of this in the time-space trade-off literature are the results on element distinctness [6, 29] , which provably has no communication-space tradeoff. Corollary 11 shows that any straight-line protocol that multiplies two n x n matrices A and B in O(log n) space requires R(n3/log n) communication steps. Suppose that such a trade-off could be proved for the following related decision problem: One processor is given nonsingular matrices A and C, and the other is given B;and the problem is to decide if AB = C. Suppose it could be proved that this problem cannot be solved simultaneously in space O(log n) and communication O(n*). Then it would follow that n x n matrices cannot be inverted in space O(log n), since one possible protocol for the decision problem is for the first processor to compute A-%, which it then transmits to the other processor in n2 communication steps for direct comparison to B. (Tiwari [24] had posed a similar open problem, namely, proving that CS = f2(n2) for determining whether xy z 1 (mod 2") for n bit integers x and y. However, this problem can be solved in O(n) communications and O(log' n) space, by computing x-' mod 2" using the algorithm of von zur Gathen [ 121.) 4. The lower bounds proved here were all for deterministic algorithms. Is there a problem whose simultaneous communication and space complexity is decreased when randomization is allowed? This would be analogous to the result of Mehlhorn and Schmidt [18] on communication steps alone. 6. Finally, all these studies can be taken to the general case of more than two processors, where one might measure the four resources of space, communication steps, computation steps, and number of processors.
