Efficient Execution of Nondeterministic Parallel Programs on Asynchronous Systems  by Aumann, Yonatan et al.
File: DISTIL 265301 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 4165 Signs: 2014 . Length: 58 pic 2 pts, 245 mm
Information and Computation  IC2653
information and computation 139, 116 (1997)
Efficient Execution of Nondeterministic Parallel
Programs on Asynchronous Systems*
Yonatan Aumann-
Department of Mathematics and Computer Science, Bar-Ilan University, Ramat-Gan, Israel
E-mail: aumanncs.biu.ac.il.
Michael A. Bender
Aiken Computation Laboratory, Harvard University, Cambridge, Massachusetts 02138
E-mail: benderdas.harvard.edu
and
Lisa Zhang9
Department of Mathematics and Laboratory for Computer Science, Massachusetts Institute of Technology,
Cambridge, Massachusetts 02139
E-mail: ylzmath.mit.edu
We consider the problem of asynchronous execution of parallel pro-
grams. We assume that the original program is designed for a synchronous
system, whereas the actual system may be asynchronous. We seek an
automatic execution scheme, which allows the asynchronous system to
execute the synchronous program. Previous execution schemes provide
solutions only for the case where the original program is deterministic. Here,
we provide the first solution for the more general case where the original
program can be nondeterministic (e.g., randomized). Our scheme is based
on a novel agreement protocol for the asynchronous parallel setting. Our
protocol allows n asynchronous processors to agree on n word-sized
values in O(n log n log log n) total work, assuming an oblivious adversary
scheduler. Total work is defined to be the summation of the number of
steps performed by all processors (including steps from busy waiting).
] 1997 Academic Press
article no. IC972653
1 0890-540197 25.00
Copyright  1997 by Academic Press
All rights of reproduction in any form reserved.
* An initial version of this paper was presented in SPAA 1996.
- This work was done while the author was at the MIT Laboratory of Computer Science. Supported
in part by a Wolfson postdoctoral fellowship and Darpa Contract N00014-92-J-1799.
 Supported by NSF Contract CCR-9313775.
9 Supported by an NSF graduate fellowship, ARMY Grant DAAH04-95-1-0607, and ARPA Contract
N00014-95-1-1246.
File: DISTIL 265302 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3814 Signs: 3374 . Length: 52 pic 10 pts, 222 mm
1. INTRODUCTION
Motivation. Parallel programs are frequently designed assuming tightly coupled
processors, operating in synchrony. A typical example of such a programming
model is the PRAM model, in which all processors are assumed to operate in lock-
step on the individual instruction level [16]. In less extreme models synchroniza-
tion is often not assumed at every step, but it is still an indispensable ingredient of
the overall structure (e.g., the BSP model [26]). Synchronization assumptions are
convenient, becasue they free the programmer from the need to consider actual pro-
cessor and network timings and let the programmer focus on the major task of
parallelizing the program. However, these assumptions do not correspond to the
way many actual parallel systems operate. Typically, in a multitasking parallel
systemand especially in a network environmentprocessors operate on separate
parts of the same application asynchronously and at considerably different speeds.
For example, a heavily loaded processor may dedicate considerably less CPU time
to a given application than a lightly loaded processor. In this case, the application
running on the system would experience asynchronous behavior of the processors,
even if the clocks were synchronized. Other sources for asynchrony include inter-
rupts, context switches, network congestion, and page faults.
Execution Schemes. Faced with a gap between the idealized synchronous
models, which facilitate program design, and reality, which dictates asynchrony,
an execution scheme [24, 20] is the necessary bridge. (An execution scheme is also
referred to as a method for automatic program transformation [20].) The execution
scheme allows the asynchronous system to run programs written for the idealized
synchronous model. Roughly speaking, the asynchronous system emulates the
operation of the synchronous system. Thus, the programmer can write programs
assuming the idealized synchronous model and run the programs on the
asynchronous system. Designing efficient execution schemes is the focus of much
previous work [7, 8, 20, 24]. However, all previous schemes are restricted to the
deterministic case. If the original program is nondeterministic, e.g., randomized, the
execution scheme fails. In this paper we provide the first efficient solution for
the execution of nondeterministic parallel programs in the asynchronous setting. The
solution provides a scheme that works regardless of the source of nondeterminism
(e.g., randomization, nondeterministic inputs, etc.). Our solution is based on a
novel agreement protocol for the asynchronous parallel setting (A-PRAM),
which assumes the oblivious adversary scheduler. The agreement protocol allows
the n asynchronous processors to agree on n word-sized values in a total of
O(n log n log log n) word operations. Previous asynchronous consensus protocols,
which are geared to operate under the stronger adaptive adversary scheduler, require
0 (n2) operations per consensus and are therefore too slow to be useful in this
context.
The model. For concreteness and simplicity we describe a solution for fine-
grained parallel programs. The same techniques also provide a solution for other
cases (e.g., large-grained programs [7]). We consider an n-thread EREW PRAM
program P written assuming a synchronous PRAM machine. The asynchronous
2 AUMANN, BENDER, AND ZHANG
File: DISTIL 265303 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3637 Signs: 3176 . Length: 52 pic 10 pts, 222 mm
host system H consists of n processors having a shared memory space and
individual sources of randomness. We postulate a global word size for the system.
Each processor has a small (constant) number of internal registers. Each processor
of the host system has a set of atomic operations which it may execute. We assume
that each atomic operation is executed to completion without interruption. The
atomic operations include:
1. reading a word from shared memory,
2. writing a word to shared memory,
3. executing any one of a fixed set of basic computations (e.g., add, multiply)
on words in local registers. (We assume that the set of basic computations allows
any processor to perform any single computing instruction of the PRAM program
P in a single step, i.e., if the program P includes the instruction x  f ( y, z), then
computing f is a basic computation).
We assume that in a single atomic operation the host system can read or write
a full word of the PRAM program together with an appropriate timestamp.
(Timestamps are of size O(log n).) We emphasize that no atomic operations can
both read from and write to shared memory. Thus, no compound operation such
as test H set or compare H swap is atomic.
The processors act asynchronously. Formally, with each processor Pi we
associate a schedule function Si : N  R+ _ [], which reflects the actual time
instances in which the steps of Pi are performed. The expression Si (k)=t means
that the k th operation of processor Pi is executed at actual time t. Thus, Si is
monotone. An  value indicates a faulty processor. The total schedule is the n-tuple
S=(S1 , ..., Sn). For simplicity we assume that all simultaneous reads succeed and
that among the simultaneous writes to the same location an arbitrary one succeeds.
Our results also hold for the model where all simultaneous accesses to the same
location are rejected (see [26]). Following the convention for asynchronous
PRAMs (see [7, 8, 20, 24]), we postulate an oblivious adversary scheduler, which
determines the schedule independent of the random choices of the processors. The
adversary knows the program P, the input values to the program, and the execu-
tion scheme. The adversary is not provided, however, with the dynamic random
choices made by the processors during the execution. Complexity is measured by
the total number of steps performed in the system, summed over all processors.
Formally, an actual-time interval I is said to contain w work units if w=
i |[k: Si (k) # I]|. A computation starting at t0 and ending at t1 is said to take w
work if the interval [t0 , t1] contains w work units. Note that this formulation
accounts for busy waiting and idling as well as effective work.
Our execution scheme is randomized and is successful with high probability. We
say that an event E occurs with high probability (w.h.p.) if for any c>0 there exists
a proper choice of constants such that Pr[E]1&n&c.
Related Work. The problem of efficient parallel computation using unreliable
andor asynchronous processors is the focus of much previous work. Kannellakis
and Shvartman [18, 19] introduce the fail-stop PRAM model and describe
3ASYNCHRONOUS EXECUTION OF PARALLEL PROGRAMS
File: DISTIL 265304 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3967 Signs: 3580 . Length: 52 pic 10 pts, 222 mm
solutions to specific algorithmic problems in this model. Kedem, Palem, and Spirakis
[22], and together with Raghunathan [21], show how to execute any deterministic
PRAM program on a fail-stop PRAM (see also [19]). Gibbons [17] and Cole and
Zajicek [13] introduce the Asynchronous PRAM (A-PRAM). Nishimura [25]
shows how to execute specific computations using this model. Martel et al. [24]
provide a general execution scheme that allows any deterministic PRAM program
to be executed on an A-PRAM, assuming loose read H write atomicity. In this
scheme, executing a T-step PRAM program requires O(T(log n) steps. Thus, the
overhead for using the asynchronous system is O(log n). Kedem et al. [20] provide
the first general solution for the case where only reads and writes are atomic. Their
scheme entails an O(log3 n) work overhead. Aumann and Rabin [8, 9] provide a
solution with an O(log2 n) overhead. Aumann et al. [7] consider large-grained
programs, as opposed to the PRAM programs discussed in the previous works. For
these programs they provide a solution with an O(log* n) overhead. All the above
schemes are restricted to the execution of deterministic programs and fail if the
original program is nondeterministic. The techniques we provide here allow any of
the above schemes to operate in the nondeterministic case and entail an
O(log n log log n) overhead.
At the heart of our execution scheme is a new asynchronous agreement protocol
designed for the A-PRAM model. The problem is closely related to, but distinct
from, the classical asynchronous consensus problem. The classical problem assumes
an adaptive adversary scheduler, which at any time determines the next processor
to operate based on the entire history of the computation. The A-PRAM model, in
contrast, assumes an oblivious adversary scheduler, where the entire schedule must
be determined by the adversary in advance. However, the time bounds required for
efficient program execution are much stricter than those provided by classical con-
sensus protocols. In particular, the best protocols for the classical problem complete
in O (n) steps per processor per bit, yielding a total of O (n2) per bit. If employed in
an execution scheme, these protocols would result in the unacceptable O (n) over-
head. Thus, a new scheme designed for the A-PRAM setting is needed.
We briefly describe the results for asynchronous consensus where the adaptive
adversary scheduler is assumed. Fischer et al. [15] prove the impossibility of deter-
ministic asynchronous consensus in the meassage-passing model, even if only one
processor fails. Chor et al. [12] and Loui and Abu-Amara [23] show that the same
result holds in the shared-memory model (also see [14]). A randomized polyno-
mial-time solution was first given in [12]; the result assumes a weak adaptive adver-
sary which cannot stop a processor between producing a random value and writing
it to shared memory. Abrahamson [1] provides the first (exponential) solution for
the standard model and also gives an improved algorithm for the weak adaptive
model. Aspnes and Herlihy [3] introduce the notion of the weak random coin. They
give the first polynomial solution (O(n4)), using unbounded registers. Attiya et al.
[5] provide a polynomial, bounded-register solution. Aspnes [2] gives an
O(n2( p2+n)) bounded-register solution, where p is the number of active processors.
Bracha and Rachman [10] give an O(n2 log n) unbounded-register solution.
Finally, Aspnes and Waarts [4] provide a solution of O(n log2 n) steps per pro-
cessor, using unbounded registers.
4 AUMANN, BENDER, AND ZHANG
File: DISTIL 265305 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3395 Signs: 2796 . Length: 52 pic 10 pts, 222 mm
As mentioned above, Chor et al. [12] and Abrahamson [1] consider a weak
adversary model. Recently, Aumann and Bender [6] and Chandra [11] defined
new intermediate adversary models. These adversaries, which are based on the idea
that a value should not be available to the adversary until it is used by some pro-
cessor, are less powerful than that of [1, 12] but more powerful than the oblivious
adversary of the A-PRAM. Aumann and Bender [6] provide a consensus proce-
dure that completes in O(n log2 n) work per bit, assuming an intermediate adver-
sary. Chandra [11] provides a consensus procedure that completes in expected
O(log2 n) work per processor per bit, also assuming an intermediate adversary
(slightly different than that of [6]).
2. THE EXECUTION SCHEME
In this section we present the overall structure of the execution scheme. The
techniques provided in this paper allow any of the previously known schemes to be
converted from the deterministic to the nondeterministic setting. For concreteness
we describe the result using the scheme presented in [9]. Extensions for other
schemes are analogous.
2.1. The Overall Structure
Consider an n-thread EREW PRAM program P. The program is a sequence of
steps. At each step ? each thread Ti performs one instruction z(?)i  f
(?)
i (x
(?)
i , y
(?)
i ),
where x (?)i , y
(?)
i , and z
(?)
i are shared-memory variables and f
(?)
i is one of the
program’s basic operations (e.g., add, multiply). On the ideal PRAM all instruc-
tions of a single step are assumed to be performed simultaneously. In reality, the
program P is executed on an n-processor asynchronous system. The execution is
conducted in a sequence of phases. Each phase corresponds to a single step. A phase
is composed of two subphases, Compute and Copy. In the Compute subphase, for
each i=1, ..., n, the values of x (?)i and y
(?)
i are read from the shared memory and
the function f (?)i is computed. Then the resulting value is stored in a temporary
location NewVal[i] in shared memory. In the Copy subphase the value in
NewVal[i] is copied to z (?)i . A schematic picture of the operation appears in
Fig. 1. (This split-execution procedure, introduced in [22], ensures idempotence of
operations.)
At any subphase there are n tasks to be performed. Processors of the asyn-
chronous system repeatedly choose from these tasks at random and execute the
chosen task to completion. The scheme in [9] guarantees that with high prob-
ability the system does not advance to the next subspace until all tasks of the
current subphase are completed.
The Phase Clock. In the asynchronous system processors may go to sleep in
one subphase and wake up much later. Thus, it is necessary to establish a method
for these tardy processors to determine the current subphase. The Phase-Clock
5ASYNCHRONOUS EXECUTION OF PARALLEL PROGRAMS
File: 643J 265306 . By:XX . Date:18:11:97 . Time:14:06 LOP8M. V8.0. Page 01:01
Codes: 2851 Signs: 2248 . Length: 52 pic 10 pts, 222 mm
FIG. 1. The execution scheme.
construction from [9] provides such a method. The Phase Clock supports two pro-
cedures: Read-Clock, which returns the current value of the clock and Update-
Clock, which allows processors to participate in advancing the clock. Read-
Clock takes 3(log n) operations, and Update-Clock takes O(1) operations. The
value of the clock is initialized to 0. Every 3(n) invocations of Update-Clock the
value of the clock advances from one integral value to the next. Specifically, for any
:1>0 there exists an :2:1 such that the following holds. For all n there is a
Phase Clock such that at least :1n invocations of Update-Clock are necessary
and :2 n are sufficient to advance the clock from one integral value to the next
(regardless of which processors invoke the procedure).
The Phase Clock serves a double function. First, it assists processors in identify-
ing the current phase and subphase. Second, the Phase Clock guarantees that all
tasks of the subphase are completed before the computation advances to the next
subphase. This is achieved by interleaving clock updates with task execution. With
a proper choice of the constants :1 and :2 , it is guaranteed that during each sub-
phase 3(n log n) tasks are chosen at random. The total number of distinct tasks in
a subphase is n. Thus, with high probability, every task is performed at least once.
For details see [9].
2.2. The Agreement Problem
Consider again a single Compute subphase in the [9] scheme. The scheme
guarantees that each task is executed at least once. However, tasks may also be per-
formed more than once. In fact, in an asynchronous system this redundancy is
unavoidable. If the program P is deterministic, multiple executions of the same task
pose no problem; performing the same deterministic operation several times yields
the same result. However, if the functions f (?)i are nondeterministic, then the values
written to NewVal[i] by different processors may be different. Thus, the processors
need to reach an agreement on all values NewVal[1], ..., NewVal[n] before con-
tinuing to the Copy subphase. For this purpose we augment the Compute subphase
with an agreement protocol that establishes agreement on all new values before the
6 AUMANN, BENDER, AND ZHANG
File: DISTIL 265307 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3653 Signs: 3199 . Length: 52 pic 10 pts, 222 mm
Copy subphase begins (see Fig. 1). Note that processors need to agree on n different
values, each of which is a machine word. Agreement must be reached within the
time bound of a single subphase, which is O (n) operations. Such an agreement
protocol is the main topic of this paper.
We formulate the problem in the following abstract terms. There are n
asynchronous processors P1 , ..., Pn performing a sequence of nonoverlapping phases
separated by gaps. The phases correspond to the Compute subphases of the execu-
tion scheme, and the gaps correspond to the Copy subphases. Associated with each
phase ? there are n nondeterministic functions f (?)1 , ..., f
(?)
n . We seek a protocol that
provides the following. For each ?, upon completion of phase ? and during the
entire gap between phase ? and ?+1, there is a set of n values NewVal[1], ...,
NewVal[n], agreed upon by and available to all processors; the values are such
that for all i, NewVal[i] # f (?)i . The Phase Clock determines the duration of the
phases and the gaps.
3. THE AGREEMENT PROTOCOL
The protocol employs a data structure that we call a bin array. The structure
consists of an array of n bins corresponding to the n consensus values to be agreed
upon. Each bin consists of ; log n cells (; to be determined later). We denote the
ith bin by Bini and the j th cell of this bin by Bini [ j ]. The same bin array is used
repeatedly in all phases of the execution scheme. Thus, since the processors are
asynchronous, it is possible for a slow processor from one phase to overwrite values
of a later phase. In order to distinguish between current and obsolete values, each
write is time stamped with the current phase number. When a processor reads from
a bin it only considers the values with the current time stamp. We call locations
with current time stamps filled and locations with previous time stamps empty.
The protocol operates in cycles, which processors execute repeatedly. The cycles
for all processors are identical. We show that after O(n log n) cycles, w.h.p. all n
values are computed and agreed upon, regardless of the asynchronous schedule of
the processors and of the identity of the processors performing the cycles. Thus,
after O(n log n) cycles the phase is completed. Each processor reads the Phase
Clock every log n cycles. The clock indicates the current phase and signals if the
processor is working on an ‘‘old’’ phase.
Pseudocode for one cycle of the agreement procedure appears in Fig. 2. The cycle
starts with the processor P choosing a bin Bini at random. Throughout this entire
cycle, processor P works on Bin i only. Cells of the bin are written in increasing
order. (An exception is when a tardy processor writes to a cell.) Processor P first
uses binary search to find the first empty cell Bini [ j ] of the bin. If this is the first
cell of the bin (i.e., j=1) then P evaluates f (?)i and writes the resulting value in
Bini [1]. Otherwise, P copies to the first empty cell the value appearing in the
previous cell.
Work Per Cycle. Each cycle requires O(log log n) steps. For the correctness of
the protocol it is necessary that all cycles execute the exact same number of steps
7ASYNCHRONOUS EXECUTION OF PARALLEL PROGRAMS
File: 643J 265308 . By:XX . Date:18:11:97 . Time:14:06 LOP8M. V8.0. Page 01:01
Codes: 2603 Signs: 1804 . Length: 52 pic 10 pts, 222 mm
FIG. 2. One cycle of the agreement procedure.
regardless of the random choices made by the processors. Thus, let |=: log log n
be the maximum number of steps necessary to complete any one cycle. All pro-
cessors spend | steps on each cycle, executing no-ops if necessary.
Obtaining the agreement values. A processor obtains i th agreement value
NewVal[i] by reading the cells in Bini between Bini [; log n2] and Bini [; log n].
Any value appearing in a filled cell in this range is a valid value for NewVal[i]. In
the next section we prove the following theorem.
Theorem 1. For a sufficiently large ; and for any given phase ?, after
O(n log n log log n) work units w.h.p. the following holds for each i :
1. Uniqueness: there exists a single value vi , such that for all j(; log n)2, if
Bini [ j ] is filled then it stores the value vi ;
2. Stability: the value of vi does not change (until the next phase begins);
3. Accessibility: half of the cells Bini [ j ], for j(; log n)2, are filled ;
4. Correctness: vi # f (?)i .
4. ANALYSIS
We now prove Theorem 1. First we bound the destructive effect tardy processors
have on the current phase. For a given phase ?, we say that a cell Bini [ j ] is clob-
bered if it is overwritten by a cycle associated with a previous phase.
Lemma 1. For any given phase ? w.h.p. there are at most O(log n) clobbers in
each bin.
Proof. Processors read the Phase Clock every log n cycles. A processor execut-
ing a cycle writes at most one cell (lines 9 and 11). Thus, any processor clobbers
at most log n cells before it reads the clock. Therefore, during any given phase, each
processor can clobber at most log n cells. Each clobber is to a randomly chosen bin
(line 1). Thus, in the given phase, there are a total of O(n log n) random clobbers
on n bins. W.h.p. there are at most O(log n) clobbers per bin. K
8 AUMANN, BENDER, AND ZHANG
File: DISTIL 265309 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3423 Signs: 2616 . Length: 52 pic 10 pts, 222 mm
4.1. Accessibility
Consider one cycle execution C. We denote the starting time of C by S[C] and
the finishing time of C by F[C]. (I.e., S[C] is the time when C starts executing
line 1, and F[C] is the time when it finishes executing line 9 or 11.) Let D[C] be
the time when the cycle reaches line 5, i.e., after the binary search and before the
write.
Consider a given phase ?. Let |=3(log log n) be the amount of work units
necessary to complete one cycle. Recall that | is fixed for all cycles. We divide each
phase into stages as follows. Let t0 be the starting time of the phase. Assume tk is
defined. Then tk+1 is the first time after tk such that the interval [tk , tk+1] contains
3|n work units. Stage k, denoted 6k , is the interval [tk&1 , tk ]. Let S(6k) and
F(6k) denote the starting and finishing times of stage 6k , respectively. We say that
a cycle C is a complete cycle in 6k if the entire execution of the cycle is performed
within the stage. (By an abuse of language, we use the shorthand term ‘‘cycle’’ to
mean ‘‘cycle execution.’’)
Lemma 2. Each stage contains at least n and at most 3n complete cycles.
Proof. Since there are n processors, at most n cycles can overlap the beginning
of a stage, and at most n can overlap the end of a stage. K
Definition 1. We say that stage 6k is effective in Bini if during the stage
1. there exists a complete cycle in 6k that operates on Bini ,
2. there is no clobber in Bini during the stage.
Define the frontier of Bini at any given time t to be the lowest indexed cell of the
bin never written in the current phase. A cell j is a hole if it is empty but has an
index smaller than the frontier. Since cells are written in increasing order, holes can
only be formed by clobbers (i.e., writes by tardy processors operating for a previous
phase). Note, however, that holes may prevent the binary search (line 2) from
finding the true frontier of the bin.
Lemma 3. After O(log n) effective stages in Bini all cells of the bin have been
written in the current phase.
Proof. Let fk , lk , and hk denote, respectively, the index of the frontier of Bini ,
the total number of clobbers to Bini , and the number of holes in Bini , at the
finishing time of the k th effective stage. Let f $k , l $k , and h$k be defined analogously
for the starting time of the kth effective stage. Then,
fkf $k+1fk+1 , (1)
hk+1h$k+1hk+(lk+1&lk). (2)
Inequality (1) is immediate. The left side of (2) follows from the definition of an
effective phase. The right side holds since the lk+1&lk new clobbers between the
kth and (k+1)st effective stages can create at most lk+1&lk holes.
9ASYNCHRONOUS EXECUTION OF PARALLEL PROGRAMS
File: DISTIL 265310 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 2178 Signs: 1099 . Length: 52 pic 10 pts, 222 mm
We prove by induction that
\k, fkmin[k&(lk&hk), ; log n]. (3)
If f $k+1=; log n, then
fk+1f $k+1=; log n
min[(k+1)&(lk+1&hk+1), ; log n].
Thus, we may assume that f $k+1<; log n. There are two cases to consider:
1. One of the writes in the (k+1)st effective stage pushes the frontier
forward. Hence, from Inequality (1) and the inductive hypothesis
fk+1f $k+1+1fk+1
min[k&(lk&hk), ; log n]+1.
However, k&(lk&hk)fkf $k+1<; log n. Thus,
fk+1k&(lk&hk)+1
=k&(lk+1&hk+1)+(lk+1&lk)+(hk&hk+1)+1.
From Inequality (2) we conclude that
fk+1(k+1)&(lk+1&hk+1).
2. None of the writes in the (k+1)st effective stage push the frontier forward.
However, since the stage is effective, there must be at least one write to the bin
during the stage. Thus, the write during the stage must be to a hole, therefore
‘‘filling’’ it. Hence, hk+1+1h$k+1. From Inequality (1) and the inductive hypo-
thesis, and since f $k+1<; log n,
fk+1=f $k+1fk
min[k&(lk&hk), ; log n]
=k&(lk&hk).
Combining with Inequality (2) we have
hk+1+1h$k+1hk+(lk+1&lk).
Hence, we conclude that
fk+1(k+1)&(lk+1&hk+1).
This completes the proof of Inequality (3).
10 AUMANN, BENDER, AND ZHANG
File: 643J 265311 . By:XX . Date:18:11:97 . Time:14:07 LOP8M. V8.0. Page 01:01
Codes: 2659 Signs: 1879 . Length: 52 pic 10 pts, 222 mm
From Lemma 1, w.h.p. there are at most O(log n) clobbers to Bini . Thus,
lkO(log n). Therefore, by Inequality (3), after O(log n)+; log n=O(log n) effec-
tive stages, the frontier is at Bini [; log n], the last cell in the bin. Since cells are
written in increasing order, the lemma follows. K
Lemma 4. W.h.p. after O(n log n log log n) work units and for each i, half of the
cells Bini [ j ] with j(; log n)2 are filled.
Proof. Each cycle requires 3(log log n) work units. Thus, O(n log n log log n)
work units constitute c1 log n=O(log n) stages. In each stage there are at least n
complete cycles. Each cycle randomly chooses the bin on which it operates. Thus,
for any given stage 6k and Bini ,
p1=Pr[There is a complete cycle in 6k in Bini ]1&\1&1n+
n
1&e&1.
By the Chernoff bound, among c1 log n stages w.h.p. at least c2 log n stages have
complete cycles in Bini , where c2p1c12. By Lemma 1 w.h.p. there are at most
c3 log n=O(log n) clobbers to Bini (with c3 independent of c1 and c2). Thus, w.h.p.,
at least (c2&c3) log n stages are effective in Bini . By Lemma 3 for c4=(c2&c3) suf-
ficiently large, after c4 log n stages all cells of the bin are written in the current
phase.
At most c3 log n cells of the bin are clobbered. Thus, choosing ;>4c3 , at least
half of the cells Bini [ j ], with j(; log n)2, store the new value. K
4.2. Uniqueness and Stability
We say that a cell Bini [ j ] is stable (in a given phase) if, whenever it is filled, it
stores the same value. We say that Bini reaches stability at cell j if all cells Bini [ j$],
j $ j are stable. Fig. 3 gives a low-probability situation where a bin does not
stabilize. We show that w.h.p. all bins reach stability by the middle cell.
FIG. 3. An arrangement of cycles in Bini that prevents Bini from converging. The values in Bini
oscillates between 3 and 5. If this low-probability situation continues then Bini never converges.
11ASYNCHRONOUS EXECUTION OF PARALLEL PROGRAMS
File: 643J 265312 . By:XX . Date:18:11:97 . Time:14:07 LOP8M. V8.0. Page 01:01
Codes: 2738 Signs: 1820 . Length: 52 pic 10 pts, 222 mm
FIG. 4. A stabilizing structure in Bin i . Notice that in each of the phases there is exactly one com-
plete cycle in Bini . Besides these two complete intervals there are no other intervals C such that D[C] #
(62k&1, 62k). If the conditions of Lemma 5 hold, then Bini reaches stability by Bin i [ j ].
Definition 2. A stabilizing structure in Bini is a pair of consecutive stages
(62k&1 , 62k) such that
1. in each of the stages 62k&1 and 62k there is exactly one complete cycle in
Bini ;
2. for all cycles C in Bini , if D[C] # 62k&1 then F[C] # 62k&1 , and if
D[C] # 62k then F[C] # 62k .
Figure 4 depicts a stabilizing structure.
Lemma 5. Consider a stabilizing structure (62k&1, 62k) on Bini . Let Bini [ j ] be
the frontier at F[62k ]. Suppose that
1. there are no clobbers to Bini during 62k&1 and 62k ;
2. the complete cycles in stages 62k&1 and 62k do not write to holes;
3. Bini [ j&1] is not clobbered after F[62k ].
Then Bini reaches stability at Bini [ j ],
Proof. Since both 62k&1 and 62k contain cycles that do not write to holes, each
of these cycles advances the frontier. Thus, at S[62k] the frontier must be at most
at Bini [ j&1], and at S[62k&1 ] the frontier must be at most at Bini [ j&2]. Let
v be the value in Bini [ j&1] at F[62k ]. We show that following F[62k ], v is the
only value stored in filled cells Bini [ j$] for j$j&1. Initially, at F[62k], all cells
j$>j&1 are empty, and thus the claim holds. Consider a cycle writing after
F[62k]. By the definition of the stabilizing structure there are two possibilities
for C.
1. D[C]<S[62k&1]. In this case, during the entire binary search of C the
frontier never moves beyond Bini [ j&2]. Hence, C writes to cell Bini [ j"] with
j" j&2< j$.
2. D[C]>F[62k]. We prove by induction on j$j&1 that all (nonclobber)
writes to cell j$ are with the value v. For j$=j&1, Bini [ j&1] is never clobbered
12 AUMANN, BENDER, AND ZHANG
File: DISTIL 265313 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3020 Signs: 2223 . Length: 52 pic 10 pts, 222 mm
after it is written in 62k&1 and hence never written after F[62k]. Thus, the value
v remains in Bini [ j&1]. Assume by induction that the claim is true for all j",
j&1 j"< j$, and that C writes to cell Bini [ j$]. Then by the inductive hypothesis,
v is the value of Bini [ j$&1] and is the value copied to Bini [ j$]. K
Lemma 6. There exists a constant p such that for any k and i, the probability that
(62k&1 , 6k) constitutes a stabilizing structure on Bini is at least p, independent of all
other k and i.
Proof. If stage 62k contains m complete cycles, then the probability that exactly
one cycle is operating on Bini is
m \1n+\1&
1
n+
m&1
.
By Lemma 2, the probability that condition 1 of Definition 2 holds is at least
_n \1n+\1&
1
n+
3n&1
&
2
re&6.
Since there are n processors, at most n cycles C do not satisfy condition 2 of Defini-
tion 2, i.e., D[C] # 62k&1 and F[C]  62k&1 , or D[C] # 62k and F[C]  62k . The
probability that none of these cycles are Bini is at least (1&(1n))2nre&2. Thus,
p>e&8. K
Lemma 7. For sufficiently large ; w.h.p. all bins reach stability by cell (; log n)2.
Proof. Recall that | is the amount of work per cycle and that each stage con-
sists of 3n| work units. Consider Bini . For any ; there exists a ;1;, such that
after |;1n log n work units, w.h.p. there are at most (; log n)2 writes to Bini .
Thus, after (;1 log n)3 stages the frontier of Bini has not reached cell (; log n)2.
Let S be the set of stabilizing structures on Bin i in the first (;1 log n)3 stages.
By Lemma 6 and Chernoff Bounds, there exists a ;2 , increasing with ; and inde-
pendent of n, such that w.h.p. |S|;2 log n. We show that conditions 13 of
Lemma 5 hold for at least one of the stabilizing structures in S. From this, by
Lemma 5, we get that Bini reaches stability before Bini [(; log n)2].
By Lemma 1, w.h.p. Bini , contains at most c log n=O(log n) clobbers, where c
is independent of ;. Thus,
1. At most c log n of the stabilizing structures in S contain a clobber.
2. At most c log n of the stabilizing structures in S contain complete cycles
that write to holes. (Each clobber produces at most one hole, and once a cell
is filled in a stage, no complete cycles from future stages write to it unless it is
clobbered again.)
13ASYNCHRONOUS EXECUTION OF PARALLEL PROGRAMS
File: DISTIL 265314 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3759 Signs: 2270 . Length: 52 pic 10 pts, 222 mm
3. For at most c log n of the stabilizing structures in S that do not write to
holes, the frontier at the end of the structure is subsequently clobbered. This is
because two structures not writing holes cannot have the same frontier.
For a sufficiently large ;, we have ;2>3c. Thus, all three conditions of Lemma 5
hold. K
4.3. Correctness
For all i a value written to Bini [1] is f (?)i as computed by some processor. Cell
Bini [ j+1] is copied from Bini [ j]. Thus, by induction, any value vi appearing in
any cell is a valid value, i.e., vi # f (?)i .
4.4. Randomized Programs
In the case of randomized programs we must also prove that the consensus pro-
cedure does not disrupt the distribution of the possible values for f (?)i .
Claim 8. For any i, ?, and value x, let pi (x) be the probability that a computa-
tion of f (?)i yields the value x. Then,
Pr[vi=x]=pi (x).
Proof. Consider Bini0 . Note that by Theorem 1, the agreement value for Bini0
converges to a single value vi0 even if each time f
(?)
i0
is computed it yields a different
value. Thus, the value of vi0 is actually determined by a single computation of f
(?)
i0
;
i.e., vi0 is f
(?)
i0
as computed by single processor in a single cycle C=C?, i0 (cycle C?, i0
writes Bini0 [1], and this value is subsequently copied to the other cells). The iden-
tity of C?, i0 is determined by the relative schedules of all cycles choosing i=i0 in
line 1. Since the schedule is determined before the start of the computation by an
oblivious adversary, and the choices of i in line 1 are indepent of all other random
decisions in the system, it follows that the identity of C?, i0 is independent of the
value computed for f (?)i0 by C?, i0 . Thus, for any x,
Pr[vi0=x]=Pr[ f
(?)
i0
=x as computed by C?, i0 ]=pi0 (x). K
Received December 31, 1996; final manuscrip received May 5, 1997
REFERENCES
1. Abrahamson, K. (1988), On achieving consensus using shared memory, in ‘‘Proceedings of the 7th
Annual ACM Symposium on the Principles of Distributed Computing,’’ pp. 291302.
2. Aspnes, J. (1990), Time- and space-efficient randomized consensus, in ‘‘Proceedings of the 9th ACM
Symposium on Principles of Distributed Computing,’’ pp. 325331.
3. Aspnes, J., and Herlihy, M. (1990), Fast randomized consensus using shared memory, J. Algorithms
11 (3), 441461.
14 AUMANN, BENDER, AND ZHANG
File: DISTIL 265315 . By:DS . Date:20:11:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 8624 Signs: 3901 . Length: 52 pic 10 pts, 222 mm
4. Aspnes, J., and Waarts, O. (1992), Randomized consensus in expected O(n log2 n) operations per
processor, in ‘‘Proceedings of the 33rd Annual Symposium on the Foundations of Computer
Science,’’ pp. 137146.
5. Attiya, H., Dolev, D., and Shavit, N. (1989), Bounded polynomial randomized consensus, in
‘‘Proceedings of the 8th ACM Symposium on Principles of Distributed Computing,’’ pp. 281294.
6. Aumann, Y., and Bender, M. A. (1996), Efficient asynchronous consensus with the value-oblivious
adversary scheduler, in ‘‘Proceedings of the 23rd International Colloquium on Automata,
Languages, and Programming,’’ pp. 622633.
7. Aumann, Y., Palem, K., Kedem, Z., and Rabin, M. O. (1993), Highly asynchronous execution of
large grained parallel programs, in ‘‘Proceedings of the 34th Annual Symposium on the Foundations
of Computer Science,’’ pp. 271280.
8. Aumann, Y., and Rabin, M. O. (1992), Clock construction in fully asynchronous parallel systems
and PRAM simulation, in ‘‘Proceedings of the 33rd Annual Symposium on the Foundations of
Computer Science,’’ pp. 147156.
9. Aumann, Y., and Rabin, M. O. (1994), Clock construction in fully asynchronous parallel systems
and PRAM simulation, Theoret. Comput. Sci. 128, 330.
10. Bracha, G., and Rachman, O. (1991), Randomized consensus in expected O(n2 log n) operations, in
‘‘Proceedings of the 5th International Workshop on Distributed Algorithms,’’ Springer-Verlag,
BerlinNew York, pp. 143150.
11. Chandra, T. D. (1996), Polylog randomized wait-free consensus, in ‘‘Proceedings of the 15th ACM
Symposium on Principles of Distributed Computing,’’ pp. 166175.
12. Chor, B., Israeli, A., and Li, M. (1987), On processor coordination using asynchronous hardware,
in ‘‘Proceedings of the 6th ACM Symposium on Principles of Distributed Computing,’’ pp. 8697.
13. Cole, R., and Zajicek, O. (1989), The expected advantage of asynchrony, in ‘‘Proceedings of the
ACM Symposium on Parallel Architectures and Algorithms,’’ pp. 8594.
14. Dolev, D., Dwork, S., and Stockmeyer, L. (1987), On the minimal synchronism needed for dis-
tributed consensus, J. Assoc. Comput. Mach. 34 (1), 7797.
15. Fischer, M. J., Lynch, N. A., and Paterson, M. S. (1985), Impossibility of distributed commit with
one faulty process, J. Assoc. Comput. Mach. 32 (2), 374382.
16. Fortune, S., and Wyllie, J. (1978), Parallelism in random access machines, in ‘‘Proceedings of the
10th Annual ACM Symposium on Theory of Computing,’’ pp. 114118.
17. Gibbons, P. B. (1989), A more practical PRAM model, in ‘‘Proceedings of the 1st ACM symposium
on Parallel Architectures and Algorithms,’’ pp. 158168.
18. Kanellakis, P., and Shvartsman, A. (1989), Efficient parallel algorithms can be made robust,
in ‘‘Proceedings of the 8th Annual ACM Symposium on the Principles of Distributed Computing,’’
pp. 211221.
19. Kanellakis, P., and Shvartsman, A. (1991), Efficient parallel algorithms on restartable fail-stop pro-
cessors, in ‘‘Proceedings of the 10th Annual ACM Symposium on the Principles of Distributed Com-
puting,’’ pp. 2336.
20. Kedem, Z. M., Palem, K. V., Rabin, M. O., and Raghunathan, A. (1992), Efficient program transfor-
mation for resilient parallel computation via randomization, in ‘‘Proceedings of the 24th Annual
ACM Symposium on the Theory of Computing,’’ pp. 306317.
21. Kedem, Z. M., Palem, K. V., Raghunathan, A., and Spirakis, P. G. (1991), Comining tentative and
definite executions for very fast dependable parallel computing, in ‘‘Proceedings of the 23rd Annual
ACM Symposium on Theory of Computing,’’ pp. 381390.
22. Kedem, Z. M., Palem, K. V., and Spirakis, P. G. (1990), Efficient robust parallel computations, in
‘‘Proceedings of the 22rd Annual ACM Symposium on Theory of Computing,’’ pp. 138148.
23. Loui, M., and Abu-Amara, H. (1987), Memory requirements for agreement among unreliable
asynchronous processes, Adv. Comput. Res. 4, 163183.
15ASYNCHRONOUS EXECUTION OF PARALLEL PROGRAMS
File: DISTIL 265316 . By:DS . Date:20:11:97 . Time:12:58 LOP8M. V8.0. Page 01:01
Codes: 1439 Signs: 544 . Length: 52 pic 10 pts, 222 mm
24. Martel, C., Park, A., and Subramonian, R. (1990), Asynchronous PRAMs are (almost) as good
assynchronous PRAMs, in ‘‘Proceedings of the 31st Annual Symposium on the Foundations of
Computer Science,’’ pp. 590599.
25. Nishimura, N. (1990), Asynchronous shared memory parallel computation, in ‘‘Proceedings of the
2nd ACM Symposium on Parallel Architectures and Algorithms,’’ pp. 7684.
26. Valiant, L. G. (1990), A bridging model for parallel computation, Comm. Assoc. Comput. Mach. 33
(8), 103111.
16 AUMANN, BENDER, AND ZHANG
