Optimal parallel randomized algorithms for sparse addition and identification  by Spirakis, Paul G.
INFORMATION AND COMPUTATION 76, 1-12 (1988) 
Optimal Parallel Randomized Algorithms for 
Sparse Addition and Identification* 
PAUL G. SPIRAKIS 
Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, 
New York 10012; and Computer Technology Institute, 
P.O. Box 1122. 26110 Patras, Greece 
Although many sophisticated parallel algorithms now exist, most of them are not 
sensitive to properties of the input which can be determined only at run-time. For 
example, in the case of parallel addition in shared memory models, we intuitively 
understand that we should not add those inputs whose value is zero. A technique 
which exploits this idea may beat the general lower bound for addition if the count 
of nonzero operants is much smaller than the numbers to be added. In this paper, 
we device such algorithms for two fundamental problems of parallel computation. 
Our model of computation is the CRCW PRAM. We first provide a randomized 
algorithm for parallel addition which never errs and computes the result in 
O(log m) expected parallel time, where m is the count of nonzero entries among the 
n numbers to be added. This algorithm uses O(m) shared space. We then use this 
result to solve an interesting problem of processor identitication. All our techniques 
enjoy the following properties: 
( 1) They never produce an erroneous answer. 
(2) If T is the actual parallel time and E(T) its expected value, then 
Prob{T>k.E(T)} <II -I, where k is arbitrary and c > 1 is linear on k and can be 
specified by the implementor of the algorithm. 
(3) Our algorithms do not know m initially. but they produce an accurate 
estimate for it. ‘(‘1 1988 Academx Press. Inc. 
1. INTRODUCTION 
Recently there has been much interest in fast parallel algorithms that 
employ randomization. A natural question to ask is whether randomization 
may lead to techniques which can use dynamic properties of the input (e.g., 
sparseness) in an effective way. For example, in the case of parallel addition 
(or multiplication) in shared memory models, we understand intuitively 
* This work was supported in part by the NSF grants MCS 8300630 and DCR 8503497 
and by the Greek Ministry of Energy, Industry and Technology. A previous version of this 
work appeared in the 3rd Symposium on Theoretical Aspects of Computer Science 
(STACS 86). 
0890-5401/88 $3.00 
Copyqht ,C 1988 by Academic Press, Inc. 
All rights of reproduction in any form reserved. 
2 PAUL G. SPIRAKIS 
that we should not add (multiply) those inputs whose value is zero (one). 
Even if we manage to quickly estimate the number of nonzero inputs, we 
still must organize them in an appropriate manner (e.g., pack them in 
shared memory), in order to perform the addition. Our goal here is to 
devise such algorithms (by use of randomization) which are sensitive to 
such dynamic properties of the input and hence heart the knoww lo+ller 
bounds (which hold for the general case). 
We use the synchronous concurrent read-concurrent write (CRCW) 
model of parallel computation (called WRAM) (see, e.g., [SV, 83; G. 781). 
This model assumes the presence of a (potentially) unlimited number of 
processors with (potentially) unlimited local memory in each processor. We 
assume that our processors are capable of doing independent probabilistic 
choices on a fixed input. (This was first used by [R, 82a, b; V, 831). 
WRAM is like the PRAM of [W, 79; FW, 781 in the sense that different 
processors can read the same memory location at the same time. However, 
in the WRAM model, in the case of a simultaneous write attempt, exactly 
one processor succeeds. We make no assumption as to which one succeeds 
but we do assume that the failed ones are notified. This can be easily 
implemented by having processors read the result of the “write.” 
We first consider the fundamental problem of parallel addition of n num- 
bers. Our technique first provides a probabilistic estimate of the count (1~2) 
of the nonzero inputs, and then uses a probabilistic method to lay them out 
in shared memory and add them. The whole algorithm takes O(log m) 
expected parallel time, uses O(nz) shared space, and involves only IN 
processors. To our knowledge, deterministic WRAM algorithms for 
addition must take Q(log n) parallel time when at most II processors are 
used and n numbers are to be added. 
We then examine the related problem of processor ident(‘fication: II 
processors of a WRAM are given and each processor must find out which 
are the processors with the 1’s. We assume that each shared memory 
location can tit a number of at most O(n) size. We first use a nice technique 
of Erdos and Renyi [ER, 631, to provide an O(n/log n) parallel time 
solution to the problem, if the counting of O(n) units had unit cost. 
We then use our first result (about addition) to provide an 
O(min{m, n log m/log n}) expected parallel time algorithm for the WRAM 
which uses only O(m) shared space. We also give a matching lower bound 
for the parallel time. 
All our results satisfy the following: If T is the actual parallel time of our 
algorithm and E(T) is its expected value, then Prob( T> k .I?( T) 1 d N ‘, 
where k > 1 is any constant and c > 1 grows linearly with k and can be 
controlled by the algorithm designer. 
OPTIMAL PARALLEL RANDOMIZED ALGORITHMS 
2. THE CASE OF PARALLEL ADDITION 
2.1. The Algorithm 
Let the array M represent the shared memory. Let a 2 4 be a positive 
integer constant. Let each processor Pi be equipped with a local variable, 
TIME,, intended to keep the current parallel step. Initially, each processor 
Pi (1 d i 6 n) holds locally a number xi. The goal is to compute the sum of 
the x;‘s. Let m be the number of the nonzero x:s. We give the algorithm in 
two parts: Procedure ADDITION(m’) actually performs the addition, 
assuming an estimate m’ = cm + 4 (c, n> 1 constants) known. Function 
ESTIMATION produces such an estimate. So, the whole algorithm has the 
following high level description: 
begin 
m’ c ESTIMATION 
ADDITION (m’) 
end 
We provide the description of ESTIMATION first. In ESTIMATION, 
each Pi with ?ci#O produces k estimates of m (k is a constant) through a 
probabilistic technique, and then does a variance-reduction process to get 
the final estimate. The actual value of k is determined in the analysis. The 
probabilistic technique, which produces the estimates of m, is described in 
the procedure PRODUCE-AN-ESTIMATE. In this procedure, the 
processors with nonzero s’s (m in number) produce a Monte Carlo 
estimate of the maximum of m geometric random variables. This is 
achieved by having the m processors flip coins and see how long the last 
one gets heads. A detailed description of ESTIMATION follows: 
Function ESTIMATION 
procedure PRODUCE-AN-ESTIMATE 
begin 
stage 1 (Initialization) 
Processor P, initializes a special shared memory location 
(CLOCK) to zero. Then, each Pi executes TIMEi+ 0. 
stage 2 (Estimate) 
Processor Pi 
if xi # 0 then 
begin 
(1) Flip a fair coin (two-sided) 
(2 ) If the outcome is ‘tail’ then 
begin 
(2a) TIME, c TIME, + 1 
PAUL G. SPIRAKIS 
end 
(2b) CLOCK + TIME; 
(2c)goto(l) 
end 
comment: The following is done by processors which flipped 
a “head” 
(3) Ifxi#O then 
begin 
(4) read CLOCK into a local variable R, 
(5) wait for 5 steps 
(6) read CLOCK into a local variable R, 
(7) If R, #R, then go to (4) 
end 
comment: At this point, every Pi with xi#O has flipped a ‘head’ 
(8) Each Pi with xi # 0 reads CLOCK and makes its value to be 
the current estimate. 
end (of procedure PRODUCE-AN-ESTIMATE) 
begin (main part of ESTIMATION) 
Each Pi with xi # 0 runs procedure PRODUCE-AN-ESTIMATE 
k times and produces estimates E, , E,, . . . . E,. Then all compute 
(1) Et(log2)E’+ ‘i’ +E!i 
(2) m’ + exp(2) . exp(E) + d 
where d> 1 is a constant. 
m’ is the value returned by ESTIMATION. 
We assume that it is written to a special shared memory location, so that it 
is available to all processors. 
end (of function ESTIMATION) 
We now provide a description of procedure ADDITION(m’). It has 
three stages. In the first stage (initialization) a number (which is a multiple 
of m) of contiguous memory locations is initialized to zero. In the second 
stage, all processors with nonzero x’s write their x’s in these shared 
memory locations, a diferent location for each processor. (This stage is 
called memory marking.) Finally, the third stage performs a standard 
parallel addition of the numbers stored in the above-mentioned shared 
memory locations. A detailed description of each stage follows: 
PROCEDURE ADDITION (m’) 
Stage 1 (Initialization) 
In one parallel step, processors initialize a. m’ + 2 shared memory 
OPTIMAL PARALLEL RANDOMIZED ALGORITHMS 5 
locations to zero, by executing: “Processor Pi writes a zero to M(j), if 
j < am’ + 2”. Then, they all execute TIME, + 0. 
Stage 2 (Memory Marking) 
Processor Pj 
IF xi # 0 then 
BEGIN 
(1) Select y equiprobably at random from { 1, 2, . . . . am’ } 
(2) TIME, c TIME, + 1 
(3) Read M(y); TIME, c TIME, + 1 
(4) If M(y) = 0 then write xj into M(y). 
Also, TIME, + TIME, + 1 
(5) If the “write” failed then 
BEGIN 
(5a) write TIME, into M(am’ + 1) 
(5b) go to (1) 
END 
END 
Comment: The following part is executed by P, with .xj= 0 and by 
“successful” P, with .xj # 0. 
(6) Read M(am’ + 1) into a local variable R, 
(7) Wait for 8 steps 
(8) Read M(um’ + 1) into a local variable R,. 
(9) If R, #R, then go to (6) 
Comment: R, = R, means all processors with xj#O succeeded in writing 
xi in a shared memory location, different for each processor, among 
M( 1 ), ..., M(um’). (If a processor was failing, the value of M(um’ + 1) 
would change. ) 
Stage 3 (Addition) 
(Processor P, is assigned to location M(j), 1 <j < am’) 
From this point on, processors Pj (where 1 <j< am’) perform a stan- 
dard parallel addition of the numbers M(l), . . . . M(um’). In the ith parallel 
step of the addition, processor Pj adds M(j) and M(j-t 2’) into M(j), for 
j=k.2’+ 1, k=O, 1,2 7 . ..> um’/2’. (See, e.g., [K, 821 or [FW, 781 on how 
to do the parallel addition of am’ numbers by am’ processors in O(m’) 
space and O(log m’) parallel time.) 
2.2. Analysis of the Algorithm 
LEMMA 1. At the end of each execution of procedure PRODUCE-AN- 
ESTIMATE, the wriuble CLOCK is a random variable, whose mean and 
variance satisjj 
(1) E(CLOCK).log2>logm+0.5 
6 PAUL G. SPIRAKIS 
(2) (E(CLOCK) - 1). log 2 < log m + 0.5 
(3) var(CLOCK) < 4. 
ProoJ CLOCK is the maximum of m independent geometric random 
variables X, , . . . . X,,, (the number of coin flips until a head of the P,‘s with 
xi # 0) with density Prob(X, =j} = (l/2)‘, j 2 1. The rest is a relatively easy 
calculation, since Prob{ CLOCK Gj} = (Prob{ X, ,< jf )“. 1 
LEMMA 2. Given an-v 6’> 0, if we choose k 3 416 then, with probability at 
least l-S, we have 
(1) IE-logm) 62 and 
(2) The total running time of ESTIMATION is 
0 i-logm . 
( > 
Proof: From Chebyshev inequality and Lemma 1 we get 
Prob{ /E-log ml d 1.2) > 1 -4/k Also note that the running time of 
ESTIMATION is O(k . E) = 0 (E, + . . . Ek). 1 
COROLLARY 1. Given any 6 > 0, if we choose k 3 416 then, M’ith 
probability at least 1 - 6 we have 
m < m’ d m . exp(4). 
Proof: It follows immediately, by Lemma 2. 1 
In the following we assume k > 4/6 for a fixed small 6. 
LEMMA 3. Conditioned on the event E = {m d m’ d m .exp(4)}, the time 
of stage 2 of procedure ADDITION(m’) has an expected value of O(log m). 
Furthermore, the (conditional on E) probability that the time of stage 2 
exceeds p.logm is <m~a’Ogu+’ (and can be made arbitrarily small). 
Proof: It is easy to see that every time a processor Pj attempts to write 
its x,, and if g 6 m shared memory locations are already “occupied,” the 
competitors of P, are m -g - 1. Even if all of them manage to select dif- 
ferent memory locations which were not occupied previously, the 
maximum number of locations that Pj must “avoid” is g + m -g- 1 = 
m- 1. 
So P, will succeed with probability at least 
am’-(m-l)>am’-(m’-1) a-l 
am’ am’ 
a- 
a 
in each trial (and this holds for every P,). 
OPTIMALPARALLELRANDOMIZED ALGORITHMS 7 
A generalization of Lemma 1 about the maximum of m geometries with 
success probability k 1 - l/a implies that the average number of parallel 
steps required for all m processors to succeed is O(log m’) = O(log m). 
The probability that there exists a processor which continues failing for 
at least log m rounds is 
It is easy to see that the algorithm uses O(m’) shared memory, O(m’) 
processors, and performs the addition correctly, because, at the end of 
stage 2 of ADDITION(m’) the m nonzero x;s are placed one in each of m 
shared memory locations, and these locations are among M(l), . . . . M(am’). 
The rest of these locations contain zeros. So, we have: 
THEOREM 1. Given any 6 E (0, 1 ), we can choose /I > 0 such that with 
probability at least 1 - max(6, m~p’OgU+‘), our algorithm performs the 
parallel addition in O(log m) time, and uses O(m) shared space and O(m) 
processors. Our algorithm never errs. With diminishingI-y small probability, it 
may choose a bad estiate m’ of m and hence it may never exit the loop 
(6)-(9) of stage 2 qf ADDITION(m’). 
3. THE PROCESSOR IDENTIFICATION PROBLEM 
3.1. An O(n/log n) Parallel Time Algorithm Which Assumes Unit-Cost 
Addition 
The processor identscation problem assumes that n processors are given, 
each keeping either a 0 or a 1. The problem is for each processor to find 
out which are the processors with the 1’s. We first solve this problem for 
the so-called strong W-RAM (SW-RAM) model. This model has the 
property that simultaneous writes on the same memory location succeed 
only if they write the same value, and, if that is the case, their sum is recor- 
ded. We also assume that each shared memory location can hold only up 
to O(n) size-numbers. Let us imagine that all the processors are equipped 
with the same list L = I,, I,, . . . . 1, of “testing” sequences, where each I, is an 
n-bit sequence of O’s and 1’s. Let us also assume that L is independent of 
the particular assignment of O’s and l’s to processors. In the following, let u 
be a fixed memory location and let ei be the value of processor P,. 
Processors execute the following sequence of steps, s times: 
Roundi(l<ids) 
(1) P, erases v’s contents by writing a 0 
8 PAUL G. SPIRAKIS 
(2) Each Pi (1 <j< n) looks at the jth position of li. If li( j) = 1 and 
ej = 1 then Pj writes ej to location u. 
(3) Each Pi reads u’s contents. 
At the end of the s rounds, each processor has, for each testing sequence, 
the number of places in which a 1 stands in both the testing sequence and 
the sequence e,, e2, . . . . e, to be guessed. If L allows each processor to find 
e, ‘.. e, after the s rounds, we call L an s-algorithm for the processor iden- 
tification problem. (We allow unrestricted local memory per processor.) 
An obvious L (which would take O(n) parallel time) is that consisting of 
n 1,‘s with li( j) = 0 for i #j and ii(i) = 1 for all i. 
Erdos and Renyi [ER, 631 considered a very closely related problem, the 
“coin-weight” problem. Using their techniques, we show that the s needed 
is @(n/log n) and that L can be easily constructed. 
Let us view L as an s x n matrix of O’s and 1’s. 
LEMMA 4 (see also [ER, 631). A matrix L, s x n of O’s and l’s, is an 
s-algorithm for the processor ident$cation problem lj$ For each pair c, c’ of 
subsets of the set C of columns of L, such that c #cl, if we form the row- 
sums of the submatrices L(c) and L(c’) (consisting of the selected columns) 
and denote by V,. and rc, the column-vectors consisting of these row-sums, 
then pc # PCS. 
Proof: After m rounds, each processor has a row-sum vector, v, of L. 
This corresponds to just one subset c of the set of columns of L. This subset 
determines the processors with the l’s, because c is exactly the subset of 
processors with a value equal to 1. 1 
Here, the techniques of [ER, 631 can be used to prove: 
LEMMA 5. A matrix L, s x n, with s =an/(log n), a36, chosen so that 
the s x n entries are independent random variables each taking on the values 
0, 1 with probability 1, is an s-algorithm, with probability tending to 1 as 
n+ Soo. 
Proof: Let p = Prob{L is an s-algorithm}, let q = 1 -p, and let 
E(c,, c,), where c,, c2 are subsets of the set of columns C of L, denote the 
event that V<, = P,, (where P, is the row-sum vector of L(ci)). If c,, c2 are 
not disjoint, then if d, =ci -c, nc2 and d,=c,-c, nc2, we have 
Vdid, = Vd,. Hence, if L is not an s-algorithm, there exist disjoint subsets of 
the set of columns, such that VJ, = rd,. So 
where d,, d, disjoint subsets of C. 
OPTIMAL PARALLELRANDOMIZEDALGORITHMS 
One can then get, by some combinatorics, that 
qG2 
n(log3 - o/2) + o(n) > for s=%. 
log n 
So, if we choose a > log, 9 + 2 then we get 
q<2-” and lim q=O. 1 
n - cc 
COROLLARY. There exists an s-algorithm, for 
an 
S=logn’ 
a > 6. 
Proof Since, from Lemma 5, q < 2 Pn, we get p > 1 - 2 --n > 0. (In fact 
the vast majority of random 0 - 1 s x n matrices are s-algorithms.) 1 
3.2. An O(min{m, n(log m/log n)} Algorithm for the WRAM 
In the following, let m be the number of ones among the e,, e,, . . . . e,,. 
(a) An O(m) parallef-time algorithm for identification. 
The marking algorithm 
(Stage 1) The WRAM runs the algorithm for parallel addition once (as 
explained in Section 2) for the values ej of the processors. At the end of this 
process (which takes O(log m) time with high probability), each procesor 
knows the number of ones among e,, e,, . . . . e,. 
(Stage 2) The WRAM runs the stages 1 (initialization) and 2 (memory 
marking) of the procedure ADDITION(m), with the following 
modification: Each time a processor marks a memory location, it writes its 
id instead of its value. At the end of this stage, the m id’s of the processor 
with nonzero values have been placed “contiguously” in 
M(l), M(2), . . . . M(am). 
(Stage 3) Each processor reads the memory locations M( 1 ), . . . . M(am) in 
sequence. 
LEMMA 6. The marking algorithm solves the identifiration problem in the 
WRAM, in O(m), parallel time, with arbitrariljl high probability. 
Proof. By Theorem 1 of Section 2, the first stage of the algorithm takes 
O(log m) time with probability at least 1-max(6, m~B’Ogu+‘) where 6 and fl 
can be selected by the implementer. The second stage of the algorithm 
10 PAUL G. SPIRAKIS 
takes O(log m) time with probability at least 1 -n~~“g~+ ‘, by Lemma 3 
of Section 2. The last stage of our algorithm takes am time. Our algorithm 
never reports an erroneous answer. However, with diminishingly small 
probability, it may never terminate. 1 
(b) An O(n(log m/log n)) expected parallel time algorithm jar the 
WRAM. The WRAM here will simulate the SW-RAM of Section 3.1, as 
follows: 
(Stage 1) The WRAM runs the algorithm for parallel addition once, for 
the values ej of the processors. At the end of this process, each processor 
knows the number of ones (m) among the e,‘s. 
(Stage 2) The WRAM runs the s-algorithm, by simulating step 2 of each 
round, with the procedure ADDITION(m) (described in Section 2). 
LEMMA I. The simulation algorithm, described above, runs in 
O(n(log m/log n)) expected parallel time. 
Proof: Stage 1 runs in O(log m) expected time, by Theorem 1 of 
Section 2. Each round of the s-algorithm runs also in O(log m) expected 
time, by Lemma 3 of Section 2. m 
By properties of geometries, it follows that: 
LEMMA 8. If m = O(n), then our simulation algorithm runs in O(log m) 
parallel time, with probability at least 1 - m( -log n/log rn). 
(c) The combination of the two techniques. We can have the WRAM 
running both algorithms (a) and (b) interleaved (one parallel step of (a), 
and then one parallel step of (b). When one of the two techniques 
terminates, the processors will stop. 
4. REMARKS AND LOWER BOUNDS 
LEMMA 9. No s-algorithm can have s < n/log(n + 1). 
Proof: Each processor needs at least log(2”) = n “pieces of information” 
to distinguish between the 2” possible assignements of O’s and l’s to 
processors. On the other hand, if k processors attempt an addition (in 
step 2 of each round), the amount of information obtained cannot exceed 
log(k + 1) ,< log(n + 1) because the number of l’s among them are 0, 1 
or , . . . . or k. So, s rounds can give at most s log(n + 1) pieces of information 
to each processor. 1 
OPTIMALPARALLELRANDOMIZED ALGORITHMS 11 
Remark. Once the processors have an s-algorithm L, then they can 
construct a table of the possible row-sum vectors P, and their 
corresponding subset c of L. Then, given any instance of the identification 
problem, they need O(s) rounds to find V,. and one (indexed) table access 
to find c and solve the problem. Another piece of the preprocessing work is 
the construction of L itself. It seems to us that the II processors of the 
WRAM will need O(n2/log n) time to agree to a common random L. 
Clearly, our algorithm of Section 3.1 and of Section 3.2(b) becomes prac- 
tical in dynamic environments, where the values of the n processors change. 
We pose as a general open problem the construction of input-sensitive 
parallel algorithms for other problems (so that the “general” lower bounds 
are beaten). A possible candidate is graph connectivity for special types of 
graphs. 
ACKNOWLEDGMENTS 
The author thanks C. H. Papadimitriou, D. Shasha, and Z. Kedem for helpful comments in 
previous versions of this work. 
RECEIVED September 10. 1986; ACCEPTED January 1987 
REFERENCES 
[C, SO] COOK. S., Towards a complexity theory of synchronous parallel computations, in 
“Specker Symposium on Logic and Algorithms, Zurich, Feb. 5-l 1, 1980.” 
[ER, 601 ERDBS, P., AND RI?NYI, A., On the evolution of random graphs, in “The Art of 
Counting” (J. Spencer. Ed.), MIT Press, Cambridge, MA. 
[ER, 631 ERD~S. P.. AND RBNYI, A., On two problems of information theory. Muyar Tud. 
Akad. Mar. Kut. Inr. Kozl. 8; also in “The Art of Counting” (J. Spencer, Ed.), MIT Press, 
Cambridge, MA. (1973 1. 
[FW, 781 FORTUNE, S.. AND WYLLIE. J. C., “Parallelism in Random Access Machines.” 
Proceedings of the 10th STOC, San Diego, CA, 1978. 
[G, 781 GOLDSCHLAGER. L., A unified approach to models of synchronous parallel machines, 
in “Proceedings, I I th STOC, May 1978.” 
[G, 771 GOLDSCHLAGER, L., “Synchronous Parallel Computation,” Ph. D. thesis, University 
of Toronto, C. S. Dept. 
[K, 821 KUCERA, L.. Parallel computation and conflicts in memory access, I/arm. Process. 
Left. 14, April. 
[MV, 831 MELHORN, K., AND VISHKIN, U., Randomized and deterministic simulation of 
PRAMS by parallel machines with restricted granularity of parallel memories, “9th 
Workshop on Graph Theoretic Concepts in Computer Science, University of Usnabruck, 
June 1983.” 
[R, 82a] REIF, J.. Symmetric complementation, in 14th STOC. San Francisco, CA, May 
1982.” 
12 PAUL G. SPIRAKIS 
[R, 82b] REZF, J., On the power of probabilistic choice in synchronous parallel computations, 
in “9th ICALP, Aarchus, Denmark, July 1982.” 
[SV, 831 SHILOAH, Y., AND VISHKIN, U., “An O(log n) Parallel Connectivity Algorithm,” 
J. of Algorithms, Vol. 3, 1983. 
[V, 831 VISHKIN, U., “Synchronous Parallel Computation-A survey”, preprint, Courant 
Institute, New York University. 
[W, 791 WYLLIE, J. C., “The Complexity of Parallel Computations”, Ph.D. Thesis and Tech. 
Report 79-387, Dept. of Computer Science, Cornell University, 1979. 
