Abstract. We investigate the complexity of searching a sorted table of n elements on a synchronous, shared memory parallel computer with p processors. We show that f(lg n-lgp) steps are required if concurrent accesses to the same memory cell are not allowed, whereas O(lg n/lg p) steps are sufficient if simultaneous reads are allowed. The lower bound is valid even if only communication steps are counted, and the computational power of each processor is not restricted. In this model, (R)(x/-g n) steps are required for searching when the number of processors is unbounded. If the amount of information that a memory cell may store is restricted, then the time complexity for searching with an unbounded number of processors is O(lg n/lg lg n). If the amount of information a processor may hold is also restricted, then an fl(lg n) lower bound holds. These lower bounds are first proven for comparison-based algorithms; it is next shown that comparison-based algorithms are as powerful as more general ones in solving problems defined in terms of the relative order of the inputs.
1. Introduction. With the advance in microelectronics it becomes feasible to build parallel machines with thousands of cooperating processors. Yet, practice indicates that a thousandfold increase in raw computational power does not increase performance by the same amount. There are two main reasons for that. The first one is that not every problem admits an efficient parallel solution. The second one is that not every parallel algorithm can be mapped efficiently onto a realistic parallel computer architecture. Work sharing between many processors generates significant overheads for communication of data and coordination. The study in parallel complexity is dedicated, to a large extent, to the understanding of these two phenomena.
One useful model for the study of parallel computations is that of a paracomputer. It consists of many identical autonomous processors, each with its own local memory and its own program. In addition, the machine has a shared memory and each processor can in one step access any cell in shared memory.
We obtain successively weaker models by varying the assumptions concerning simultaneous accesses to shared memory:
(1) Concurrent Read, Concurrent Write CRCW). Both simultaneous reads and simultaneous writes to the same memory cell are allowed. The effect of simultaneous actions by the processors is as if the actions occurred in some serial order (for other possible definitions of CRCW, see [2] ). (2) Concurrent Read, Exclusive Write (CREW). Simultaneous reads are allowed but a processor can modify a shared memory cell only if it has exclusive access to it.
(3) Exclusive Read, Exclusive Write (EREW). Simultaneous accesses to the same shared memory cell are not allowed. This model can be further weakened in two ways: One can restrict the number of shared memory cells. One can also restrict the set of processors that have read or write access to each memory cell. If there is a unique processor that has read access and a unique processor that has write access to each shared memory cell, then each 689 memory cell represents a unidirectional link that connects two processors. We speak then of an ultracomputer.
In these models communication both of data and of control information is done through the shared memory. Thus, studying the relative power of these models is tantamount to studying the effect of constraints on communication and coordination on computational power.
We consider the problem of searching a key within a sorted list of n keys. The binary search algorithm solves this problem in (optimal) sequential time O(lg n). This can be generalized to a "(p+ 1)-ary" search algorithm that solves the problem in O(logp/ n) steps with p processors. At each step p comparisons are done that split the list into p + equal length segments, and the search proceeds recursively within the unique segment that may contain the key. This algorithm is optimal, so that p processors speed up searching by a factor of lg (p + 1) only: Searching does not admit an efficient parallel solution.
Consider now the problem of implementing this parallel algorithm. It turns out that the speedup can be achieved only if one item of information can be broadcast to all processors in constant time. On the other hand we show that in the EREW model, where an item of information can be accessed by one processor only at a time, searching requires at least (lg n-lg p) steps. Note that no transmission of data is required for parallel search, but the processors need to coordinate the search at each iteration. It turns out that the time spent in coordinating the processors offsets exactly the gain obtained from simultaneous table look-ups.
The f(lg n-lg p) lower bound is valid even if each processor may in one step do any amount of local processing or transfer any amount of information. The only restrictions are that at each step a processor may read or write at a unique location in memory.
This result settles the problem of the relative power of the different shared-memory machine models. It is known that a p-processor machine which supports concurrent accesses to the same location in memory can be simulated by a p-processor machine with no concurrent accesses to the same location in memory with a lg p time penalty [5] . Our result shows that this simulation is optimal.
Under the same assumptions we prove that the time complexity of searching with an unbounded number of processors is (R)(/lg n).
The computational model is very strong since there are no restrictions on the amount of information that can be transmitted in one step. This is remedied by assuming that a memory cell may contain a unique input value, and that inputs are atomic entities, so that an input symbol cannot be used to encode the values of several inputs. A memory cell may also store a symbol taken from a small domain of internal values.
In this model the time complexity of searching with an unbounded number of p,rocessors is (R)(lg n/lg lg n). Finally, if we impose a similar restriction on the local memory of each of the processors, then an f(lg n) lower bound is valid, independently of the number of processors.
The (lg n) lower bound on search implies a similar lower bound for the insertion problem on a shared memory parallel machine with no concurrent access to the same memory location. This settles an open problem posed by Borodin and Hopcroft in [2] .
The lower bounds for searching with an unbounded number of processors are proven for comparison-based computations. We also prove that comparison-based
The terms paracomputer and ultracomputer are taken from 10], but are used here in a slightly different meaning.
690
MARC SYXR algorithms are as efficient as more general ones in solving problems that are defined in terms of the relative order of the inputs. We first show that if such a problem can be solved by comparisons for inputs taken from a subset of values where every possible permutation of the inputs occurs, then it can be solved by comparisons, using the same resources, for any input. We next prove that given a paracomputer, one can build a sufficiently large set of input values where the behavior of each processor at each step depends only on the relative order of the pairs of inputs it has access to, but not on their actual values. The last result is proven by an application of Ramsey's theorem.
The remainder of this paper is organized as follows. The implementation of the "(p+ 1)-ary" searching algorithm is discussed in the next section. In 3 we prove the [l(lg n-lg p) lower bound for a simplified version of the searching problem. This is followed in 4 by a description of an O(x/g n) algorithm for searching in the EREW model with O(n) processors. In 5 we present the reduction of general paracomputer computations to computations using only comparisons. This reduction is used in 6 to prove the (l(/1-) lower bound for searching with an unrestricted number of processors. In 7 we examine the complexity of searching in restricted paracomputer models. Conclusions and open problems are brought in the last section. We require in the EREW model that the first components of c(s'i) and uj(sj)t be distinct for any #j. In the CREW model we require that if c(s)=(k, RW), then the first component of cg(s) is distinct from k for any j # i. In both cases the processor states and register contents are well defined.
The outputs of the computation are contained at step T in the k output registers.
Thus, a paracomputer H computes the function Fl:
where R2,,..., R2k are the k output registers.
Note that we do not restrict the size of the alphabet, or the number of processor states (they may both be infinite).
A function F is finite if it has finite range, say {0,..., k}. Proof The lower bound results from the lack of a mechanism to distribute instantaneously information throughout the system. In order to prove it we have to trace at each step the "information" represented by the state of each processor and the content of each register in shared memory. We do that at each step for all the possible input values, thus obtaining a "synoptic" description of the possible computations.
The information represented by the state of a processor (the content of a register) consists of the set of input values that could produce this state (content). It is important to note that the information represented by the content of a register may change even if this register is not modified, as the fact that no processor stored a new value is informative in itself. Cook and Dwork show in [4] how such "negative" information can be used at profit.
With each processor (register) is associated at each step a partition of the input symbols, according to the distinct states (values) the processor (register) may assume at this step. The lower bound will be obtained by tracing the evolution of these partitions, and showing that the number of sets in these partitions cannot grow too fast. In fact, we shall not trace the partitions that obtain in an actual computation, but the partitions that would obtain in a computation where there is no "loss of information", i.e. a computation where a processor stores a complete account of the information it has whenever it writes in shared memory. These partitions depend only on the access pattern of the processors.
Rather than counting the number of classes in each partition, it is easier to count the number of critical points of the partition.
Let II be an EREW paracomputer with p processors, q registers, and time bound T, that computes a unary encoding of the finite function DRF associated with the discrete root finding problem for sequences of length n. We define inductively sets P(i, t), i= 1,...,p, and R(j, t),j= 1,..., q, such that P(i, t)(R(j, t)) contains all the critical points of the partition defined by processor Pi (register Rj) at step of the computation. (iii) r P(i, t) itt (a) r P(i, t-1), or (b) Pi accesses at step of the computation on input fir the register Rj, and r6 R(j, t-1). (iv) r R(j, t) iff (a) re R(j, t-1), or (b) Pi modifies at step of the computation with input fir the register Rj, and re P(i, t-1), or (c) Pi modifies at step of the computation with input fir-1 the register R, and rP(,t-1).
These definitions are illustrated in Fig. 1 . CLAIM.
(i) If r P(i, t), then s(ff) s(ff_).
(ii) If r : R(j, t), then c(ff) c(ff_).
_Proof. By induction on t. The claim is obvious for 0. We suppose it holds for t-1 and prove it for t. inductive assertion, C-'(r)= C-'(r-), SO that ,-1(r),
On the other hand, if r R(j, t-1), then r P(i, t).
t-1
(ii) As r R(j, t-1), c ()= c (_). Suppose that P modifies R at step of the computation with input If r P(i, t-1), then, by the inductive assumption,
On the other hand, if r P(i, 
At most one processor may access at step of a computation on input fir the register Rj (this is the point where we are using the EREW property). Thus each occurrence of r in a set R(j, t-1) contributes at most one new occurrence of r in a set P(i, t) according to rule (iii.b). Let Jt be the set of registers accessed by some processor at step of some computation. The number of distinct registers accessed by Pi at step of some computation is bounded by the number of segments in the segment partition determined by the points in the set P(i, ), i.e. by IP(i, 1)l + 1. It follows that IJ, <_-y IP(i, t-1)l+ p.
We obtain the inequality
Combining the last two inequalities one obtains that
We did not use in the proof the fact that concurrent writes are not allowed. Indeed, the lower bound is still valid for an ERCW (exclusive read, concurrent write) parallel machine. It is also valid even if inputs are initially replicated, so that each input value can be accessed concurrently by all the processors.
The last lower bound is optimal. An EREW machine with n / processors can solve the root finding problem for n keys x,..., xn in constant time by comparing in parallel xi to xi+, i=0,..., n (Xo-0, x,+-l, by definition). This generalizes to an algorithm that solves the problem with p processors P,. , Pp in O(lg (n/p)) steps as follows. Let ij [j(n + 1)/pl. Firstly, each processor P checks in parallel whether xj_, < xj. Next, the unique processor P that found a strong inequality continues to execute a serial bisection algorithm on the list x_,+,..., x_l.
This simple algorithm can be extended to solve by comparisons the general range searching problem in O(lg n-lgp) steps, provided that the searched key can be accessed concurrently by all the processors. 4 Proof The idea of the algorithm is illustrated in Fig. 2 . The search is carried according to a multiway search tree where the branching factor is doubled at each level (Fig. 2b) . Such a tree of depth contains 2 t(t-l)/2-keys, so that searching a table of that size requires accesses to nodes of the tree. An access to a node of this tree is done in one memory access provided that an encoding of the tuple of keys at that node has been stored in one memory cell. Once the encoding of the keys at the node has been read, the decoding and the subsequent comparisons are performed locally, i.e. at no cost.
FIG. 2b. Corresponding multiway search tree.
The search will be carried by one processor. Concurrently, the remaining processors will compute and store encodings of tuples of keys. The multiway search tree is obtained by "compressing" the binary search tree: A node at level of the multiway search tree contains the keys belonging to a complete subtree of depth i, rooted at level 1/2i(i 1) + 1, in the binary tree (Fig. 2a) . The processors will compress the binary search tree, increasing by one at each iteration the depth of the subtrees which encodings has been computed.
We describe now this algorithm more formally. Assume w.l.g, that n + 2 ''+)/2.
Let ak k_ 1/2k(k + 1). The algorithm consists of iterations. At the end of iteration the search has proceeded through levels of the multiway search tree, that correspond to ai levels of the binary search tree. Also, encodings have been computed for the keys belonging to each complete subtree of the binary tree that has depth i+ and has its leaves at level aj, j i+ 1,..., (see Fig. 2a ). In particular, encodings have been computed for the keys of each subtree of depth i+ rooted at level ai + 1, i.e. for each tuple of keys belonging to a node of the multiway search tree at level i+ 1. During iteration i+ the searching processor accesses one of these encodings to push the search one level down on the multiway search tree; each of the remaining processors computes an encoding of the keys belonging to a binary tree of depth + 2 by combining the encodings for the left and right subtrees which have been computed at the previous iteration, and the key at the root.
At each iteration, each key and each encoding is accessed at most once, and the total number of new encodings computed is less than n. It follows that each iteration can be performed in constant time using O(n) processors, and that only O(n) registers are needed. The total running time is O(t)= O(x/lg n). [3 5 . Order invariant computations and canonical paracomputers. We prove in this section that algorithms using only comparisons are as powerful as more general algorithms in solving comparison based problems. In order to do so, it will be convenient to work with a paracomputer model where information of input values is clearly distinct from other information.
Let X <--n be the set of strings over X of length at most n. A paracomputer II with n inputs is in canonical form if it fulfills the following conditions. (i) The set of processor states is of the form X<--n C and the set of register symbols is of the form X -<-n x D (n is the number of inputs to II). (The input symbol x X is identified with the register symbol (x, 0), where 0 D is a fixed constant.) (ii) If to((cr, s))=(r, v), then every element of r occurs in tr (a processor can write an input symbol only if it is present in its "local memory"). (iii) If ((tr, s), (r, v))= (tr', s') then every element of tr' occurs either in r or in r (a processor can store an input symbol in its "local memory" only if it is already there, or if it accessed it from shared memory). We call C the set of control symbols, and D the set of coordination symbols. We postpone the proof of this theorem to the next section. It is important to note that the number of control and coordination symbols of r(II) is independent of the number of input symbols.
The following definitions will make precise the notion of a comparison based computation. Two strings , 37 X" are order equivalent (-= ) if i,j, x < xcr y < y.
A function F defined on X" is order invariant if = 37F()= F(37 Informally, a canonical paracomputer is order invariant if the behavior of each processor depends only on the value of the control and coordination symbols, and the relative order of the input values it has access to, but not on the value of the input symbols themselves. In particular, the computation will follow the same course on two sets of input values where the inputs have the same relative order.
Let trl denote the string obtained by substituting in tr each occurrence of x by an occurrence of y. We leave to the reader the straightforward proof of the following lemma, which formalizes the last claim. N-N(n, G, s) where tr' e Y--<" is order equivalent to tr and z'e Y<-" is order equivalent to As II is order invariant on inputs from Y', II' is well defined, and order invariant on all inputs from X ". Also, the computations of II' are identical to the computations of II for inputs taken from Y'. Thus II' computes a unary encoding of F on Y', and by Corollary 5.3, computes a unary encoding of F on all X'. 6 . Lower bounds on searching with an unbounded number of processors. We prove in this section that (x/g n) lower bound on searching with an unbounded number of processors. The argument consists of three parts. Firstly, we shall complete the proof of Theorem 5.1, thereby reducing the problem to canonical paracomputers. Secondly, the results of the previous section can be used to reduce the problem to canonical order invariant paracomputers. Finally, an argument similar to that used in the proof of Thm. 3.1 yields the lower bound.
Proof of Theorem 5.1. We shall build (II) from II by stipulating that whenever a processor of H writes onto shared memory, then the corresponding processor of ON PARALLEL SEARCHING 701 (II) writes onto shared memory a complete account of the information available to it; whenever a processor of H reads from shared memory, the corresponding processor of 97(II) reads and stores in its local memory (its state) the content of the register accessed. Thus, each processor of (H) has at each step sufficient information to simulate the corresponding processor of II.
In a processor state ((, c, cr will contain the input symbols which values "are known" to the processor, and c will represent the knowledge of the processor on the memory accesses that were executed. A similar convention holds for register values.
We shall use S-expressions to encode information on memory accesses. L 5f(X), the set of S-expressions over the set X, if If processor Pi accesses register Rj at step of the computation with input then ((P,, , t) ((P,, , t-1)((Rj, :, t-1)). If processor P modifies register R at step of the computation with input , then otherwise (Rj, , t)= (Rj, , t-1).
CLAIM. (i) The value of and the state s() of Pi at step of the computation with input are uniquely determined by (P, , t).
(ii) The content cj() of Rj at step of the computation with input is uniquely determined by j and (R, , t).
Proof. The claim is trivially true for 0. Assume it is valid for t-1.
(i) Let L (P, :, t). The value of can be determined from the first element of where J" is an atom-free S-expression (distinct from J') that encodes the number i. Let us pick 2n + elements from X, bo < a < b < "On < bn, and consider the behavior of the algorithm on the n + sets of inputs (a,..., an, bi)i 0," ", n. Note that RS(Y,) i.
We follow now the same approach as in the proof of Theorem 3. and of x(=a) are known to P (occur in r and or'), in which case r is not order equivalent to o-'. This motivates the following definitions. We define inductively the sets P(i, t) and R(j, t) as follows. (ii) rP(i,t) iff (a) re P(i, t-1), or (b) P accesses Ri at step of the computation on input Yr, and r R(j, t-1), or The last lower bound is valid even if we allow concurrent reads from those input registers that contain the searched table. It is only the access to the searched key that has to be restricted.
7. Paracomputers with bounded bandwidth. The O(x/g n) algorithm relies heavily on the fact that the content of one register may encode the values of an arbitrary number of inputs, so that an arbitrary amount of information can be transferred in one read or write operation, and processed in one instruction cycle. This is not a realistic assumption.
We can restrict this model by restricting the type of operations that can be performed on inputs. This is the approach usually followed in the analysis of comparison based algorithms, where it is assumed that inputs are atomic entities that can be only compared. We obtain a "structured" computational model (in the sense used by [3] ), which is more amenable to analysis.
Such restriction runs against the basic approach of this paper which is that of assuming powerful computational nodes, but restricted communication ability. We shall instead impose "structure" on the type of items that can be transmitted in one access to memory. We shall assume that a memory register may contain a unique input symbol; it can also contain a communication symbol, taken from a small set. Inputs are transferred atomically, so that an input symbol cannot encode the values of a tuple of input symbols. Formally, let II be a paracomputer with input set X and set of register symbols V. Then Proof. The algorithm used is similar to that given in Theorem 4.1. Assume w.l.g. that n (t + 1)!-1. The search proceeds according to a multiway search tree of depth t= O(lg n/lglg n), where nodes at level contain keys, and have, therefore, i+ children (see Fig. 3 ). Such a tree, of depth t, contains (t + 1)!-keys, so that a table of that size can be searched in iterations. A processor is assigned to each node of that tree. This processor reads at each iteration one key, and stores its value in its local memory. At iteration the processors assigned to nodes at level have accessed all the keys at their node. Each processor is also assigned a mailbox. The searched key is initially in the mailbox of the processor assigned to the root. At iteration the processors assigned to level nodes access their mailbox. One processor finds the searched key in its mailbox, and compares it to the keys of its node, thereby selecting a node at level i/ where the search proceeds. It then puts the searched key in the mailbox associated with the node selected.
It is easy to see that each iteration can be implemented in constant time, using O(n) processors, O(n) registers, and two communication symbols. A matching lower bound can be proven, using the methods of the previous section. We leave to the reader the proof of the following analogue to Theorem 5. (i) S xk C, where C is a set of c control state symbols.
(ii) Each input symbol that occurs in wi((tr, c)) occurs in (iii) If i((cr, d), u)= (r', d'), then each symbol of r' occurs either in o-or in u.
The second condition states that the value of a local register at step is either the value of a local register at step t-or a value read from memory. The third condition states that a processor may write an input symbol only if it is stored in one of its local registers. Note that a paracomputer of bounded processor bandwidth is a canonical paracomputer (provided that k_-< n). Proof The same argument that was twice applied works here as well. Inequality (6.5) of Theorem 6.2 is valid with K =k and H=2 T-I. We obtain that n-<_ 1/2H(2K +3) r_-<1/4(4k+6) , which yields the result. [3 The last result is asymptotically optimal: if processors may store in their local memory k keys, then it is possible to search a table of size n in O(lgk+l n + lg k) steps. There are few methods known to prove lower bounds for parallel algorithms, which are not based on fanin arguments. This paper contributes one such new method.
It seems to capture two "real-life" problems encountered while writing parallel programs" it is hard to parallelize algorithms with many test and branch operations; and frequent coordination between concurrent processes may offset any gain obtained from concurrency.
This paper also provides a method to generalize lower bounds obtained for comparison based algorithms to less restricted algorithms. In that, we were inspired by the work of Yao [14] . This method can be useful in other settings as well, and in particular can be used to analyse distributed algorithms [8] .
A more natural constraint on information transfer would be to restrict the number of bits that can be stored in one memory cell. We believe that our lower bounds are valid in such model too, but the proofs seem much harder to obtain.
The paracomputer models we presented may suffer a few interesting variations. As noted in 2, the O(lg n/lgp) searching algorithm can be implemented on any EREW shared memory parallel machine where one processor has the ability to broadcast messages to all the other processors in constant time (a BEREW machine?). If all the processors share this broadcasting ability (only one broadcast is allowed at a time), then this algorithm can be implemented even in the absence of shared memory. We have here a model of parallelism, corresponding to a bus-oriented architecture. A similar model was studied by Stout [11] .
Another natural variation is to assume that conflicting memory accesses do not result in an error, but rather in a busy signal being returned to all but one of the requests; alternatively one may postulate a queuing scheme at the memory.
In a real parallel machine memory is likely to be organized into modules with exclusive access being enforced at the level of the memory module rather than at the level of the memory cell. This suggests that we consider computational models where the number of shared memory cells is restricted, and where the amount of information that can be transferred in one read or write operation is smaller than the content of a memory cell. The work of Baer, Du and Ladner [1] , and of Vishkin and Wigderson [13] is a useful start in the investigation of such systems.
