We study ne-grain computation on the Recon gurable Ring of Processors (RRP), a parallel architecture whose processing elements (PEs) are interconnected via a multiline recon gurable bus, each of whose lines has one-packet width and can be con gured, independently of other lines, to establish an arbitrary PE-to-PE connection. We present a message-passing protocol, Comet, that will, in the presence of suitable implementation technology, endow an RRP with message latency that is logarithmic in the number of PEs a message passes over in transit. Our study focusses on the computational consequences of such latency in such an architecture. Denoting by (N; L)-RRP an N -PE ring whose bus has L lines, we establish the following computational properties of logarithmic-latency RRPs. 
Abstract
We study ne-grain computation on the Recon gurable Ring of Processors (RRP), a parallel architecture whose processing elements (PEs) are interconnected via a multiline recon gurable bus, each of whose lines has one-packet width and can be con gured, independently of other lines, to establish an arbitrary PE-to-PE connection. We present a message-passing protocol, Comet, that will, in the presence of suitable implementation technology, endow an RRP with message latency that is logarithmic in the number of PEs a message passes over in transit. Our study focusses on the computational consequences of such latency in such an architecture. Denoting by (N; L)-RRP an N -PE ring whose bus has L lines, we establish the following computational properties of logarithmic-latency RRPs. 1 . A leveled tree-structured algorithm (LTS algorithm) consists of some xed number of complete up-and/or down-sweeps on a complete binary tree, performing a unit-time computation at each node; (one-to-all) broadcast and accumulation (reduction) are 1-sweep LTS algorithms; parallel-pre x (scan) is a 2-sweep LTS algorithm. 
Hence, when L is as large as N 1= log log N , the RRP performs a sweep within time 2 log N log log N + l.o.t.
(b) The performance of RRPs on LTS algorithms can be improved by at 1 Introduction
Overview
We study ne-grain computation on the Recon gurable Ring of Processors (RRP), a parallel architecture whose processing elements (PEs) are interconnected via a multiline recon gurable bus: each line of the bus has one-packet width and can be con gured, independently of other lines, to establish an arbitrary PE-to-PE connection.
Our study is inspired by a novel strategy for (a) designing the bus of an RRP and for (b) passing one-word messages along the lines of the bus in a way that yields message latency that is logarithmic in the number of PEs the message traverses|at least for (MOS) wafer-scale implementations of RRPs. We are currently working on estimating the technological parameters that would enable our Cooperative Message Transmission (Comet) strategy to be realized. Our goal in the current paper, as in its companion paper 7] , is to understand the computational consequences of a ne-grain logarithmic delay model for RRPs: su ciently good news would provide powerful motivation for paying the technological cost that would enable the model to be realized. Henceforth, we focus only on RRPsthat have been implemented so as to achieve logarithmic message latency|perhaps, but not necessarily, via Comet. The interested reader should see 6] for another theoretical study of recon gurable architectures with logarithmic message latency.
In the remainder of this section, we describe, in turn, our main results, the detailed architecture of RRPs, and the Comet message-passing protocol. The subsequent sections of the paper describe our results and their proofs in detail.
Our Results
Our results are parameterized by the number of PEs in an RRP and the number of lines in its bus. We denote an N-PE ring whose bus has L lines an (N; L)-RRP when we want to make all parameters explicit, an N-RRP when we want to concentrate only on the number of PEs, and an RRP when the speci c parameter values are not consequential.
In 7], we showed that RRPs can e ciently execute any normal hypercube algorithm 1 can be speci ed in terms of a xed number of complete up-and/or down-sweeps on complete binary trees, performing a unit-time task each time a tree-node is encountered. Problems that can be solved by such algorithms include (segmented versions of) broadcast and accumulation (a/k/a reduction), which are 1-sweep algorithms, and parallel-pre x (a/k/a scan), which is a 2-sweep algorithm. When executed on an (N; L)-RRP, our algorithm performs each sweep in time bounded above by expression (1 (Since we envision only chip-or wafer-scale RRPs, the values of N will always be moderate, no more than, say, a few hundred; hence, a bus with N 1= log log N lines is quite within the realm of feasibility.) It follows that an N-RRP, endowed with su cient communication bandwidth (as measured by the number of buslines), can execute LTS algorithms almost as fast as can an N-node hypercube of commensurate-power PES: the slowdown incurred is only roughly 2 log log N.
In Section 3, we show that the algorithms of Section 2 are within constant factors of optimal, in a variety of senses. Throughout this section, we use (one-to-all) broadcast as the prototypical 1-sweep LTS algorithm, since it is|in a sense made precise in Section 4|the simplest such algorithm. In Section 3.1, we show that the time for performing a one-to-all broadcast on an (N; L)-RRP is bounded below by
Since this bound trivializes when L is as large as N (because even communicating between PEs 0 and N=2 takes time proportional to log N), we were motivated to nd a nontrivial lower bound that holds for all RRPs, no matter how many buslines they have.
In Section 3.2 we prove such a bound: the time required for any N-RRP|no matter how many buslines it has|to perform a one-to-all broadcast is no smaller than T Broadcast 1 15 log N log log N: This lower bound is surprisingly general: not only does it apply to all N-RRPs, it applies also, with minor adaptation, to a much broader class of logarithmic-latency architectures. One can view this fact as suggesting that, at least for LTS algorithms, one could not gain appreciably|i.e., by more than a constant factor|if one replaced the ring-structured topology of RRPs by a more highly connected topology such as the mesh.
The extendibility of the lower bound in Section 3.2 both inspires and lays the technical groundwork for our very general lower bound in Section 4. We focus in that section on the following broad class of algorithms for any parallel architecture that uses point-topoint communication. We term an algorithm nontrivial if it requires some one PE of the architecture to receive and/or send information|either directly or indirectly|to and/or from all other PEs. We prove that, to within constant factors, no nontrivial algorithm can be performed more e ciently than the (one-to-all) broadcast operation. As a consequence, any N-PE architecture of the sort discussed in Section 3.2 requires time proportional to log N log log N to execute any nontrivial algorithm. We stress that this lower bound is a fundamental limitation imposed by the logarithmic-latency model and is independent of speci c architectural implementations! 2 1.3 The Abstract RRP Architecture A static view. An (N; L)-RRP is a SIMD architecture that comprises N identical PEs, 3 P 0 ; P 1 ; : : :; P N ?1 , which we view as placed with equal spacing around a circle (see Figure   1 ). There is a \bundle" of L lines, each having one-packet-width, passing outside the circle \over" the PEs and \through" the switches that provide the recon guration capability. Each PE of an (N; L)-RRP has L associated communication sub-PEs (CPEs) to help it manage the message tra c on its bus; speci cally, CPE CP i;j controls switches on line j that allow PE P i to participate (either as a source, a destination, or an intermediate node) in a dedicated point-to-point path along that line. We impose certain limitations on RRPs, in order to minimize the technological resources required to implement them.
First, as is implicit in the fact that buslines form point-to-point paths between pairs of communicating PEs, we assume that each message has exactly one sending PE and exactly one receiving PE; in particular, we do not support any \wired-or" or multireader bus capability. Second, we have our RRPs observe a single-port communication regimen: in a single step a PE can send at most one message and receive at most one (possibly, though not necessarily, involving distinct buslines). A word about these restrictions is in order. The constraint of single-port communication may well decrease the e ciency of RRPs on a broad-range of computational problems; we have yet to study the multiport version of our model. However, the absence of a multireader capability has less overall impact than one might initially expect. Clearly, a multireader bus capability would enable an N-RRP to perform one-to-all broadcasts in time proportional to log N; however, this 2 It is instructive to see how our model and results compare with those of 6]. 3 Throughout, when we discuss the PEs of an N -RRP, all PE-indices are computed modulo N . The SIMD regimen observed by our RRPs allows switch settings to be computed centrally and downloaded to the CPEs; hence, we shall not comment further on this aspect of RRP operation.
For reasons that will become clear when we describe the Comet message-passing protocol (in Section 1.4), we study only \ ne-grain" computations by RRPs i.e., ones in which every inter-PE communication consists of a single one-packet message which is sent along a single busline.
Finally, to simplify algorithm speci cation and analysis, we assume henceforth that our RRPs have numbers of PEs N and numbers of buslines L that are powers of 2; in particular, we shall write (when convenient) N = 2 n and L = 2`. These assumptions, which can be avoided by clerical modi cations to our development, will a ect only small additive terms in our bounds.
Notation. For any pair of PE-indices i and j 6 = i, we denote by B(i; j) the block of PEs fP i ; P i+1 ; : : : ; P j g.
Path formation and communication. The rst step in performing a communication between PEs P i and P j (in either direction) is to establish a dedicated path that connects these two PEs. This is achieved as follows. The SIMD controller appropriates a segment of some busline which runs above either block B(i; j) or block B(j; i) and which is not currently used by any other communication. (Presumably, but not necessarily, if segments above both blocks are available, then the controller will choose the shorter one.) Say, for de niteness, that the available segment is a portion of busline k that runs above block B(i; j). The controller then establishes the following connections, using the switches in the CPEs; see Figure 2 . PE P i is connected to line k, via CPE CP i;k ; PE P j is connected to line k, via CPE CP j;k ; each CPE CP h;k , where i h < j is connected to CPE CP h+1;k .
(If no busline is free above either block B(i; j) or B(j; i), then the message transmission must be deferred until a later time. The algorithms we present are carefully designed to avoid such an event.) After forming the dedicated path, the desired (one-packet) message is transmitted from P i to P j along line k. 
An Implementation Proposal
Our aim in this study is not to propose yet another abstract model, but rather to explore the consequences of stretching existing VLSI technology in certain directions. Therefore, we view as an integral part of our research the following outline of a strategy for achieving logarithmic ( ne-grain) message latency in RRPs. This strategy has been a major motivating factor for our work; its technological feasibility and costs are currently under study.
Remark. The results in this paper, as those in 7], rely only on the abstract properties of RRPs (such as logarithmic message latency) that are described in Section 1.3. Having said this, we would be very excited if the following design strategy could be developed to yield a \real" instantiation of the results.
The hardware-design portion of our strategy is speci ed only implicitly: we would like to design the CPEs of RRPs so that they can support the Comet \cooperative" message transmission protocol, which (on paper, at least) allows the PEs of RRPs to exchange one-word messages with only logarithmic communication latency.
The Comet protocol builds on the assumption that there is a xed transit time such that a one-packet message can be transmitted in (machine) cycles between:
any PE P i and any one of its CPEs CP i;k (in either direction); 4 All logarithms are to the base 2 unless otherwise speci ed.
any CPE CP i;k and a neighboring CPE CP i 1;k .
As our description of Comet proceeds, the reader should note two de ning properties of the protocol.
Comet builds on speci c characteristics of MOS VLSI technology, particularly its being a capacitive, voltage-driven technology, rather than a current-driven one). Therefore, Comet will accelerate message transmission only with chip-or waferscale implementations of RRPs, not with implementations that leave an MOS environment. Consequently, we envisage the RRPs we study as comprising at most a few hundred PEs. This worldview makes it imperative that we always seek explicit analyses of bounds, rather than asymptotic ones.
Comet depends on having neighboring CPEs \cooperate" to accelerate the progress of a message in transit. This \cooperation" precludes pipelining messages, so we must restrict attention to routing small packets rather than, say, potentially long worms. This limitation explains our focussing here only on ne-grain computations and messages.
The Logic of Comet. We de ne the cooperative message transmission protocol that Comet uses to accelerate message transmission, via the following generic example. Let us focus on an arbitrary single-packet (i $ j) message M that PE P i wishes to transmit to PE P j on busline k. Assume that P i has already inserted message M onto busline k via CPE CP i;k .
Step 0 a lled circle denotes a CPE that \knows" message M an empty circle denotes a CPE that does not \know" message M a single arrow denotes an \empty" link of the dedicated path P(i; j) a double arrow denotes a link of the dedicated path P(i; j) that contains message M.
Thus, the above diagram is intended to illustrate that, after one step (of duration cycles), both CPEs CP i;k and CP i+1;k \know" message M.
Step 1 CPEs. In a voltage-driven technology, where delays are caused by capacitive loads, this harder \pumping" allows successively longer line segments to ll to threshold at successive steps of the transmission. This scheme leads to the logarithmic latency model described earlier.
Of course, the trick in making Comet work in a real technology is to build wires that will carry the pumped charge without melting. This limitation explains why we must model and simulate electrical overhead of the Comet protocol, in order to determine how large and fast an RRP the protocol will permit. Speci cally, we wish to determine what clock speed (which is embodied in the transit time ) can be supported with various numbers of PEs. As we stated in Section 1.1, our goal here is to determine whether or not logarithmic message latency leads to computational e ciency that would induce one to pursue vigorously a switch design that will e ciently support the Comet protocol. We believe that our upper bounds, here and in 7], supply an a rmative answer to this question.
Remark. We stress that the Comet protocol is intended to diminish only the capacitive delays of message transmission in MOS technologies; it does not a ect transmission-line limitations. Therefore, our speedup scheme does not run into any con icts with speed-of-light limitations 1, 8].
LTS Algorithms for RRPs
This section is devoted to our upper-bound results. We begin, in Section 2.1, with an algorithm that allows an (N; L)-RRP to perform a single sweep of an LTS Algorithm in time T(N; L) log 2 N log L + log N log log L: We then describe brie y, in Section 2.2, how to perform the operations of broadcast and accumulation using 1-sweep LTS algorithms and how to compute the parallel-pre x using a 2-sweep LTS algorithm; our descriptions are brief because these LTS algorithms are well known.
Generic Single-Sweep LTS Algorithms
In this subsection, we describe two intimately related parameterized families of 1-LTS algorithms. Each algorithm Down (N;L) (resp., Up (N;L) ) simulates, on an (N; L)-RRP, a single downward (resp., upward) sweep on an N-leaf complete binary tree. For the sake of simplicity, we describe only algorithms that involve the entire RRP. However, because our algorithms actually work on a path of PEs, rather than on a ring, it should be clear that they are easily modi ed to compute segmented versions of the same operations. ? 1) , in parallel, to execute the bottom n ? k levels of T N . We assume that this execution ends with the leaves of T N (which are its level-n nodes) distributed, in left-to-right order, in the PEs of R N;L . Choosing the parameter k. We have two goals that jointly determine our choice of the parameter k.
1. In order to minimize the communication overhead of our tree-sweep, we wish to be able to perform the remapping (Phase 2) at each level of the recursion via a single global communication. To accomplish this, we choose k so that 2 k L + 1.
We reason that we can remap up to L level-k nodes of T N via a single global communication, since R N;L has precisely this many buslines. Since the leftmost level-k node does not move in the remapping (staying in PE P 0 ), we arrive at the indicated inequality; since L is a power of 2, our inequality mandates making k `. We now present a detailed speci cation of Algorithm Down (N;L) . Each recursive invocation of the algorithm requires two parameters: the index root.index of the PE of R N;L that plays the role of the root of the current tree at that point in the recursion, and the number num.leaves of leaves of that tree. With no loss of generality, PE P 0 of R N will play the role of the root of the initial tree T N ; of course, the initial number of leaves is N. The detailed speci cation appears in Figure 3 . We simplify the speci cation Using the following easily veri ed initial condition for the recurrence, T(4; 2) = 2; we now bound the recurrence term by term.
We claim rst that, for all M 4, T(M; M) log M log log M:
We proceed by induction on M. We now use bounds (3, 4) to establish the claimed bound (2) on T(N; L).
T(N; L) log N 0 log log N 0 + log N + T(N=N 0 ; L) (log L log log L + log N) log L N = log N log log L + log 2 N= log L: The second step here uses bound (3) log L N times to bound the T(N=N 0 ; L) term. 2 
Speci c LTS Algorithms
We describe here simple LTS algorithms for three fundamental operations: broadcasting, accumulation, and parallel-pre x 2]-4].
One-to-all broadcasting. In the operation of broadcasting, one PE|with no loss of generality, P 0 |sends a single-packet message M to all other PEs. Let the PEs of the architecture be mapped to the nodes of a complete binary tree in any way that places PE P 0 at the root of the tree. Now perform a sweep down the tree: as each PE-node receives the message M from its parent, it relays the message to both of its children. At the end of the downward sweep, each PE \knows" the message M.
In the sequel, we denote by Br (N;L) the algorithm for broadcasting within an (N; L)-RRP, that is based on the downward sweep algorithm Down (N;L) . Accumulation. The operation of accumulation (or, reduction) is de ned for any binary associative operator . The -reduction of the vector hx 0 ; x 1 ; : : : ; x N ?1 i is the product x 0 x 1 x N ?1 . An architecture whose PEs are indexed from 0 to N ? 1 in such a way that each PE P i initially \knows" value x i can compute this reduction as follows. Assign the nodes of T N to the PEs in any way that assigns the leaves of T N to the PEs in their natural left-to-right order. Now perform a sweep up the tree: each PE computes the -product of the quantities it receives from its children and passes this product to its parent. At the end of this upward sweep, the PE that is assigned the root of T N \knows" the accumulated -product.
Parallel-pre x. The parallel-pre x (or, scan) operation is also de ned for any binary associative operator . The -scan of the vector hx 0 ; x 1 ; : : : ; x N ?1 i is the vector hy 0 ; y 1 ; : : : ; y N ?1 i, where each y i = x 0 x 1 x i . An architecture whose PEs are indexed from 0 to N ? 1 in such a way that each PE P i initially \knows" value x i can compute this scan as follows. Assign the nodes of T N to the PEs in any way that assigns the leaves of T N to the PEs in their natural left-to-right order. Now perform a sweep up the tree, followed by a sweep down the tree. During the upward sweep, the architecture performs an -reduction of the vector, but each PE retains (for the downward sweep) the value computed by its left child. During the downward sweep, each PE sends its retained value to its right child, which then computes the -product of its parent's retained value by its retained value (in that order). At the end of the downward sweep, each PE P i \knows" the quantity y i .
Lower Bounds for LTS algorithm on RRPs
In this section, we prove lower bounds on the following quantities: In particular, we prove that, to within constant factors, no (N; L)-RRP can perform a one-to-all broadcast in fewer than (log 2 N)=(log L) steps; and, no N-RRP|no matter how many buslines it has|can perform a one-to-all broadcast in fewer than log N log log N steps. Since broadcasting is intuitively the simplest 1-sweep LTS algorithm (an intuition that is veri ed in Section 4), these lower bounds demonstrate that our 1-sweep LTS algorithms in Section 2 (which operate within timebound (1)) cannot be sped up by more than a constant factor: our bound on T ? (N; L) provides the demonstration when the number L of buslines is small; our bound on T ?
(N) provides the demonstration when the number L of buslines is large. We turn now to the details of our bounds and their proofs. 
Proof. Say that the (N; L)-RRP R has a PE P 0 which wants to broadcast a one-word message M to all other PEs. We demonstrate that, no matter how R disseminates M to its PEs (subject, of course, to the limitations of RRPs), it can decrease only slowly the size of the largest remaining block of PEs that are \ignorant" of message M. To the end of verifying this, let us consider the execution of an arbitrary broadcast algorithm A on R.
At the beginning of the broadcast, only PE P 0 \knows" message M, so the initial block of \ignorant" PEs is precisely the block B 0 def = B(1; N ? 1). Let T 0 be the rst instant in the execution of Algorithm A in which a message crosses either PE P dN=3e or PE P d2N=3e . By our delay model, T 0 log N=3. Because our RRP has L buslines, at time T 0 , no more than 2L messages are \traveling" along the bus and directed to the block B (dN=3e; d2N=3e). Now, even if we assume that all of these 2L If we now substitute for in this bound and simplify, we obtain the desired bound (5). Since our reasoning holds for any broadcast algorithm on an (N; L)-RRP, the theorem follows. This bound tells us nothing about the complexity of the broadcast operation on RRPs, because simply getting message M from PE P 0 to PE P N=2 must take time proportional to log N (because of the subadditivity of logarithms). In the next subsection we derive a lower bound on T ? (N; L) that is nontrivial no matter how large L is, i.e., no matter how many buslines the RRP in question has. Moreover, this bound establishes the optimality (to within a constant factor) of Br (N;L) (hence, of Down (N;L) and Up (N;L) ), even for large values of L.
A Limitation for All RRPs
In this section, we prove that the time taken by an N-RRP R to broadcast must be proportional to log N log log N, no matter how many buslines the RRP has. As usual, we assume, with no loss of generality, that PE P 0 is broadcasting a message M to all other PEs. The reader should note that we could rephrase the proof of this bound to hold for any 1-sweep LTS algorithm. The rephrasing for downsweeps is little more than a change in terminology; the rephrasing for upsweeps is a bit more complicated, as one has to \run the proof backwards," which requires a bit of reformulation. Proof. We proceed by induction on N, using as a base all N 1024. Our lower bound is trivial for N in this range, since it asserts only that an RRP requires more than two steps to broadcast when N > 4. Let us, therefore, focus on an arbitrary xed N > 1024, and let us assume inductively that the theorem holds for all smaller N. Say that we have an optimal algorithm A that broadcasts on an N-RRP R within time T ? (N). By Theorem 2.1, we know that A operates within time 2 log N log log N. We can derive from A a broadcast tree BT(A) whose structure exposes how A disseminates the broadcast message. The tree BT(A) has node-set f0;1;:::;N ?1g; BT(A) has an edge from node i to node j precisely if, in Algorithm A, PE P j of R receives the broadcast message for the rst time via a direct communication from PE P i . Note that node 0 is the root of BT(A).
Henceforth, let us focus on the broadcast of a message M by Algorithm A. The sum of the times for these three phases is clearly a lower bound on the total time for Algorithm A's broadcast. The strategy of our proof is to show that there must exist a barrier for which each of these three operations takes a rather long time. We achieve this by showing that there must exist a barrier whose PEs must supply message M to \many" subroots, all of which are \far" from the barrier and each of which must broadcast message M to PEs of R which belongs to a \big" subtree of BT(A). One can view the rest of the proof as quantifying the quoted words in the preceding sentence.
We turn now to the details of the argument. Given a barrier B, we call any PE that receives the broadcast message|for the rst time|directly from some PE of B a subroot induced by B. We say that a subroot P i covers a PE P j precisely when node j is in the subtree of BT(A) rooted at node i; this is equivalent to saying that P j receives the broadcast message for the rst time from P i , either directly or indirectly. For any set of PEs P, we denote by Cov(P ) the set of PEs P fPEs covered by PEs in Pg. Notation. We denote the set of subroots induced by the kth barrier by R k (see Figure 4) .
Further, for notational clarity, we henceforth abbreviate the quantity 1 15 log N log log N by N . Our rst lemma shows that R must have a barrier which induces subroots that are located \far" from the barrier and that cover \many" PEs. Speci cally, the lemma is our rst step in quantifying the quali ers we have been putting in quotes. thus yields the desired set of k 1 roots, which completes the proof. 2
We are now ready to prove Theorem 3.2. Let k 0 and k 1 be the integers produced in the proofs of Lemmas 3.1 and 3.2, respectively. We bound the time that the optimal algorithm A takes for each of the three phases of the broadcast de ned by the barriers we have selected.
The time for Phase 1. We claim that Algorithm A (indeed, any algorithm) needs time . In order to see this, say that, at the instant when the last copy of these k 1 
Since T ? (x) log x for all x (see the closing paragraph of Section 3.1), the value of p that minimizes maximization (7) can be determined to have the form p 0 = k 1 = log k 1 
The time for Phase 2. The k 1 instances of message M that are sent from within barrier B k 0 to the \remote" subroot PEs must pass over block B k 0 +1 in transit. The time T 2 necessary for the last message instance to make this trip can be no smaller than T 2 1 2 log N;
because of the distance traveled.
The time for Phase 3. 
4 Nontrivial Algorithms on Arbitrary Architectures
This section is devoted to proving two rather surprising generalizations of Theorem 3.2. First, the lower bound of the Theorem holds not only for any single-sweep LTS algorithm, as shown in Section 3.2, but also for any algorithm that requires nontrivial communication. We say that an algorithm A requires nontrivial communication (for short, is nontrivial) if it requires some PE P to receive/send information|directly or indirectly| from/to all other PEs during the course of the computation.
The second generalization of Theorem 3.2 is perhaps even more surprising. The lower bound of the Theorem|even when generalized to all nontrivial algorithms|holds for a much broader class of parallel architectures than just RRPs. In fact, the bound holds for parallel machines that communicate via arbitrary point-to-point xed interconnection networks, provided only that the machines have been implemented in a way that satis es the following conditions (which are quite consistent with current technology).
Let M be any N-PE parallel machine having point-to-point connections between PEs, which is implemented in such a way that these three conditions hold.
Theorem 4.1 The N-PE machine M requires time proportional to log N log log N to execute any nontrivial algorithm A.
Proof. When machine M executes algorithm A, there is at least one PE whose nal state is a ected by the initial state of every other PE; let P 0 denote one such \sink" PE. We prove the theorem in two steps. First, we prove that the time that M takes to execute Algorithm A can be no smaller than the time that M takes to perform a one-to-all broadcast from PE P 0 . Second, we bound from below the time that M must take to perform this broadcast. The latter proof evolves in a manner similar to the proof of Theorem 3.2.
Let us attack the rst portion of our proof by considering the dependency tree DT(A)
of Algorithm A, that is constructed as follows. We assign PE P 0 to be the root of DT(A).
We assign as the children of P 0 those PEs|call them P 1 ; P 2 ; : : :; P k |that communicated directly to P 0 during the execution of A. We assign as the children of P 0 's child P 1 those
PEs that:
communicated directly with P 1 before its last communication with P 0 and
have not yet been assigned within the tree; name these grandchildren of the root P j , for j = k+1; k+2; : : :. We repeat the assignment process for PEs P 2 ; P 3 ; : : :, in order of the indices we are assigning during this process, until all PEs of M are assigned to tree-nodes. The essential features of the assignment are that the children of PE P i must have communicated directly with P 1 before its last communication with its parent in the tree, and they must not yet have been assigned within the tree; we then index these new tree-nodes using the smallest as-yet unassigned indices. Now, it is obvious that DT(A) is a spanning tree of the computation graph of Algorithm A. Given that communication delay is not a ected by the direction of the communication (by condition 3), we can conclude that the time taken by M to execute Algorithm A is no smaller than the time that M would take to perform a one-to-all broadcast from P 0 , using DT(A) as the broadcast tree (cf. the proof of Theorem 3.2).
We next bound from below the time taken to perform the broadcast. To this end, we formally de ne the two-dimensional analog of the blocks B i of Theorem 3.2, for our proof follows the logical ow of the proof of that Theorem.
Given any two-dimensional layout of machine M, let us use the position of PE P 0 as a reference point, in order to partition the PEs of M into two-dimensional blocks; we call these blocks squares as an aid to the intuition. We e ect the partition as follows:
square Sq i comprises those PEs whose distance from P 0 in the tree DT(A) is greater The following lemmas are the two-dimensional analogs of Lemmas 3.1 and 3.2 and are proved in much the same way.
Let T be the optimal time to broadcast using broadcast tree DT(A). The proof of the theorem now proceeds by induction on N and is similar to that of Theorem 3.2. 
