Clock construction in fully asynchronous parallel systems and PRAM simulation  by Aumann, Yonatan & Rabin, Michael O.
Theoretical Computer Science 128 (1994) 3-30 
Elsevier 
3 
Clock construction in fully 
asynchronous parallel systems 
and PRAM simulation 
Yonatan Aumann 
Institute of Mathematics and Computer Science, The Hebrew University, Givat Ram, Jerusalem, 
91904, Israel 
Michael 0. Rabin* 
Aiken Computing Laboratory, Harvard University, Cambridge, MA 02138, USA, and Institute of 
Mathematics and Computer Science, The Hebrew University, Givat Ram, Jerusalem, 91904, Israel 
Abstract 
Aumann, Y. and M.O. Rabin, Clock construction in fully asynchronous parallel systems and PRAM 
simulation, Theoretical Computer Science 128 (1994) 3-30. 
We consider the problem of simulating synchronous computations on asynchronous shared mem- 
ory systems. The systems we consider allow for arbitrary asynchronous behavior of the processors. 
In addition, we make very limited (and in some cases no) assumptions about the atomicity of read 
and write operations to shared memory. We provide detailed definitions of these asynchronous 
systems and their atomicity properties. 
The first construction in this paper is a novel clock for asynchronous systems. The clock is a basic 
tool for synchronization in the asynchronous environment, The construction we give is extremely 
robust, and can be implemented in a system with no atomicity assumptions, and in the presence of 
an adaptive adversary scheduler The correct behavior of the clock is obtained with overwhelming 
probability (> 1 - 2-““, c( > 0). 
We then show how to harness this clock to drive an efficient PRAM simulation on an asyn- 
chronous system. The simulation requires an 0(log2 n) work, and O(logn) space, overhead. This 
improves by a logn factor on the efficiency of previously obtained simulation results, while relaxing 
the assumptions on the underlying asynchronous system. 
1. Introduction 
Parallel algorithms and programs are most commonly designed and described 
for systems of tightly coupled processors working in almost complete synchrony. 
Correspondence to: Y. Aumann, Laboratory for Computer Science, Massachusetts Institute of Technology, 
545 Technology Square, Room 372, Cambridge, MA 02139, USA. Email: aumann@theory.lcs.mit.edu. 
*Supported in part by ONR contract number NOOO1491-J-1981, and NSF grant number CCR-90-07677, 
at Harvard University. Email: rabin@das.harvard.edu, rabin@cs.huji.ac.il. 
0304-3975,!94/$07.00 0 1994-Elsevier Science B.V. All rights reserved 
SSDI 0304-3975(93)E0170-9 
4 Y. Aumann, M.O. Rabin 
A typical example of such a system is the PRAM model, in which all processors work 
in lock-step on the individual instruction level. In less extreme models (e.g. the BSP 
model [16]), synchronization is not assumed to exist at each and every step, but is still 
an indispensable ingredient of the overall structure. Synchronization assumptions are 
convenient for program development since they free the programmer from the need to 
consider actual processor and network timings, and let him or her focus on the major 
task of parallelizing the program. These assumptions do not, however, correspond to 
the way one would expect actual parallel systems to operate. Typically, in a multitask- 
ing parallel system, processors working on separate parts of the same application 
would do so asynchronously and at considerably different speeds. For example, 
a heavily loaded processor would dedicate to any given application considerably less 
CPU time than a lightly loaded processor. In this case, running the application would 
experience extreme asynchronous behavior of the processors. Note that we are 
considering asynchrony on the application level. Thus, unlike hardware asynchronous 
behavior (e.g. clock-skew), which can reasonably be assumed to be within limited 
bounds, such bounds cannot be postulated in our setting. At the application level, 
asynchronous behavior may be extreme, and cannot be approximated by convenient 
stochastic assumptions. 
Handling asynchrony has thus attracted much research activity and is the topic of 
a large body of work. One important issue is how to simulate the execution of 
a PRAM program on an asynchronous parallel system. But even implementing 
particular algorithms in an asynchronous setting is a challenging task. In this paper 
we focus on asynchronous systems with shared memory. Previous work regarding this 
setting typically assumed some sort of atomic primitives. A minimal assumption is 
that individual reads and writes to shared memory are atomic. More frequently it is 
assumed that some compound instruction of the form “read & write” (e.g. test & set, 
fetch & add, compare h swap) is atomic. Bootstrapping on these primitives, consensus 
protocols and synchronization mechanisms are then developed. Carrying out the 
more complex computations, such as PRAM simulations, generally required the 
stronger primitives. Herlihy [3] describes a full hierarchy of atomicity assumptions, 
and proves that primitives of a higher class cannot be deterministically implemented 
by those of a lower class, if wait-free operation is to be guaranteed. In particular, 
complex computations such as full PRAM power require the most powerful primi- 
tives such as compare & swap (see also [4]). Recently, Kedem et al. [7] gave for the first 
time a (randomized) PRAM simulation scheme on an asynchronous PRAM for which 
only individual reads and writes are assumed to be atomic. 
In this paper we consider the power of systems with very weak, or altogether 
lacking, atomicity assumptions. We formally describe two such systems: the fully 
asynchronous parallel system (FAPS) and the atomic asynchronous parallel system 
(AAPS). The formulation of the FAPS captures the full notion of complete asyn- 
chrony, where there is no fixed correlation between processor speeds and actual 
physical time. Thus, we can have the processors stop and start working at any time, 
allowing for arbitrarily complex forms of interleaving activity. In addition, no action 
Clock construction in asynchronous systems 5 
in the FAPS is assumed to be atomic, not even a read or a write of a single bit. 
Individual read and write operations are viewed as occupying physical time intervals 
of varying lengths, and the intervals of different processors may overlap without 
coinciding. In the AAPS, complete asynchronous behavior is retained, but basic 
actions (e.g. individual reads and writes) are assumed to be atomic. 
There are two major problems that we need to solve when simulating an n-thread 
PRAM computation on an n-processor asynchronous machine. In the PRAM com- 
putation 9, the rrth parallel step comprises executing instructions xj”+fj”(yj”,z!), 
j= 1, . . . , n. Here x7, yj”, zj” are (shared) program variables and fj” is a processor 
instruction such as add or compare. The subscript j and superscript rr indicate the 
dependence on the thread and the parallel step. 
When executing the computation 9’ on an asynchronous machine we cannot simply 
assign the execution of the jth thread to processor Pj. If processor Pj is considerably 
slower than the others, or fails, completion of the rrth parallel step will be held up by 
waiting for Pj. Our strategy is to proceed in parallel (simulation) phases. In phase 
7c every processor Pi randomly and repeatedly chooses threads THj and executes 
xj”+f;(y$z;) until all the parallel instructions are completed. 
In the asynchronous system we do not have a physically implemented common 
clock for synchronizing the processors. Thus, we must implement, within the simula- 
tion, a logical clock which the processors read, in order to know in which simulation 
phase the computation is. 
However, having a logical phase clock is not sufficient for effecting a correct 
simulation. The main problem with asynchronous execution arises from the possibil- 
ity that a slow processor Pi may start executing xtf(y,z) in phase 7~ by computing 
f(y, z), It may then proceed within a later phase rci > 7c to update x =f(y, z). If x was in 
the meantime properly updated in an intermediate phase ~<r <x1, then the update 
by Pi clobbers a correct value of x. It does not help if Pi will again read the clock before 
writing x. Reading the clock and writing x are, in our model, separate operations, and 
an arbitrary number of write-x operations by other processors may occur between 
these two. 
In this paper, we first give a robust clock construction for the asynchronous 
systems. The clock is a basic synchronization tool for the PRAM simulation on these 
systems. Technically, the clock keeps track of the overall amount of work performed 
in the system. Consequently, if we have an estimate of the amount of work between 
synchronization barriers, then the clock functions as a synchronization mechanism 
(the notion of “work” is formally defined in Section 2). This enables us to perform the 
synchronization on a purely computational basis without reference to actual time, 
which is ill-defined in the asynchronous ystem. Our clock can be implemented in the 
most extreme setting of the FAPSs, and under an adaptive adversary scheduler. 
Moreover, with an overwhelming probability (> l-2-“‘) the clock operates correctly 
for an exponential number of steps. 
In the second part of the paper we show how to apply the clock to obtain efficient 
PRAM simulation on the AAPS. We consider an AAPS in which only individual 
6 Y. Aumann, M.O. Rabin 
reads and writes are assumed to be atomic. No further atomicity (e.g. read & write) is 
assumed. Our solution for the problem of clobbers follows the strategy of [7]. Every 
program or control variable x is represented in shared memory in a replicatedform by 
p = O(log n) copies x(i), . . . , xc@‘). When a processor Pi writes in phase rr, x:= U, it 
updates only one copy at a time, xCk):= U. Following each update, the processor reads 
the clock. Thus, a tardy processor will only clobber at most one copy of a variable. 
A probabilistic analysis will then show that, with high probability (2 1 - l/n”), alto- 
gether fewer than p/2 copies of x will be clobbered. For a synchronous n-processor 
exclusive write PRAM program and an n-processor AAPS we provide a simulation 
with O(log2 n) work overhead and O(log n) memory blowup. This is a log n factor 
improvement over the previous results. If the program is also exclusive read we obtain 
relaxed assumptions regarding the concurrent memory access the host system must 
handle. Our scheme requires O(1) expected concurrency and, with high probability, 
O(log n) maximum concurrency (instead of O(n) in the previous works). 
1 .I, Previous and related work 
Several studies in recent years have addressed the issue of incorporating asyn- 
chrony into the shared memory and PRAM models, and designing methods for 
handling the difficulties it introduces ([l&3,12-15] and others). For a detailed over- 
view, the reader is referred to the introductions in [7,13]. 
Martel et al. [12,13] give an O(m) work simulation scheme for a T-step n- 
processor PRAM program, on an n/log nlog* n processor limited asynchronous 
system, which is work optimal. With II processors in the actual system, the scheme 
gives an O(log nlog* n) work overhead. This was later improved to eliminate the 
log* n factor in [14]. Reads and writes are assumed to be atomic. In addition, there is 
a “loose atomicity” assumption which states that no more than O(n) work units are 
completed in the system between specified read and write instructions by the same 
processor (this prevents tardy processors from clobbering correct results). Thus, the 
system is not completely asynchronous in our sense. In fact, implementing “loose 
atomicity” in an asynchronous system would require various strong primitives such as 
test & set, which our work is avoiding. The scheme is randomized and is successful 
with high probability ( > 1 - n-‘) in the presence of an oblivious adversary scheduler 
(to be defined formally in Section 2.1). 
Kedem et al. [7] gave, for the first time, a scheme for the simulation of an EW 
PRAM on a completely asynchronous system, assuming only atomic reads and writes. 
For an n-processor PRAM program and an n-processor system, the scheme requires 
an 0(log3 n) work overhead. If the actual system consists of n/log n actual processors 
then the overhead reduces to O(log’n). The scheme entails an O(log n) memory 
blowup factor. This scheme also is randomized and successful with high probability in 
the presence of an oblivious adversary scheduler. 
In both of the above schemes it is central that the simulating asynchronous system 
allows concurrent memory accesses, and that simultaneous accesses to the same 
Clock construction in asynchronous systems 7 
shared memory location produce the correct outcome. For read actions, the schemes 
of [7,12,13] require that the system allows up to O(n) concurrent reads, all obtaining 
the correct value. For write actions, the schemes assume that if two writes occur 
simultaneously, then one of them succeeds. 
In this paper we improve on the previous results, by reducing the work overhead, 
while relaxing the assumptions on the simulating machine. For a system with only 
atomic reads and writes, and an exclusive write PRAM program (the latter is not 
a real restriction), we obtain a simulation with an O(log2 n) work overhead and an 
O(logn) memory blowup. If the PRAM program is also exclusive read, then our 
simulation entails only O(1) expected simultaneous access to each shared memory 
location and, with high probability, no location is ever accessed simultaneously 
by more than O(logn) processors. Furthermore, a slight modification of our 
simulation scheme allows one to perform the simulation on a system which does not 
provide for any effective simultaneous memory access. Specifically, the simulation can 
be correctly carried out on a system in which simultaneous accesses to the same 
memory location produce nondeterministic outcomes (as explained in Section 2). 
We omit the details of modified simulation and the related analysis from this 
publication. 
We note, however, that our scheme is correct only in the Monte-Carlo sense, while 
the previous ones give a Las-Vegas behavior. 
The idea of keeping several copies of each program variable, used in this work, was 
introduced in the simulation context in [7]. The latter paper also has a clock. Our 
clock is, however, rather different, both in structure and application, from that of [7]. 
We use the clock to drive the computation rather than for performing lateness tests. 
The possibility of doing so is unique to our new clock. 
A very particular form of asynchrony is the fail-stop behavior. PRAM simulation 
on fail-stop PRAM is dealt with in [6,8,9] and others. Clearly our results hold for this 
restricted model as well. 
Lamport in [lo, 111 dealt extensively with the delicate issue of atomicity of 
reads and writes, or the lack thereof. In [lo], a general definition of asynchronous 
systems is described. Our FAPS model is similar to the “global time model” described 
there. 
1.2. Outline and terminology 
The paper is organized as follows. In Section 2 we introduce the asynchronous 
systems, and give a formal description of the FAPS and AAPS. In Section 3 we 
describe the clock, and prove its strong properties, The PRAM simulation is described 
in Section 4, and the analysis is given in Section 5. 
We shall say that an event E occurs with high probability (w.h.p.) if for any a>0 
there exists a proper choice of the relevant parameters uch that Pr[E] a(1 -nea). 
We say that the event occurs with overwhelming probability if for ct as above 
Pr[E]a(l-2-“). 
8 Y. Aumann, M.O. Rabin 
2. Asynchronous ystems 
We consider two types of asynchronous systems: the FAPS and the AAPS. Both 
these systems allow the processors to function in arbitrary complex patterns of 
interleaving activities. The systems differ in the assumptions on the atomicity of 
processor actions. In the FAPS we have processor actions (e.g. read or write) occupy 
real-time intervals of varying lengths. For different processors these action intervals 
may overlap without coinciding. Thus, in the FAPS, no operation is atomic, not even 
a read or a write of a single bit. AAPS actions, in contrast, are assumed to occur 
atomically at an instantaneous point in time. Our formulation allows one to specify, 
within the definition of such a system, which of its actions are atomic. The PRAM 
simulation we give is for the AAPS where the only atomic operations are reading or 
writing a single memory cell. We proceed to give a formal description of these two 
systems. 
2.1. Fully asynchronous parallel systems 
In the definition of the fully asynchronous system we want to capture the idea 
that each processor may have a completely autonomous flow of “time”, not correlated 
to that of other processors. It is important to emphasize that not only can different 
processors disagree on the question “what time is it?” but also on how “fast” time 
passes by. In order to formulate this we express each processor’s internal, subjective, 
view of time in relation to the actual (physical) continuous time axis. Note, however, 
that actual time does not exist for the processors; it serves only in our formulations 
and analysis. 
The FAPS has the following properties: 
(1) The system consists of n independent parallel processors, {Pi}:= 1, and shared 
memory. Processors may also have private memory. 
(2) Processors act by reading from and writing to shared memory, and by perform- 
ing internal computations. We postulate a set of basic actions which include reading or 
writing a single memory cell and a predefined set of internal computations. 
(3) Processors have an internal view of time. Internal time is discrete, ranging over 
the natural numbers, JV. At each internal time point, ~EJV, a processor performs 
exactly one of the above basic actions. 
(4) To each processor Pi there corresponds a schedule Ti mapping discrete internal 
time into actual continuous-time intervals. Formally, let Int = { [a, b) / 0 6 a < b d CC >, 
the function Ti: M+Int u(Failure} is a mapping with the following properties. 
(a) Nonooerlapping: For T,cTE~'",z#~, if Ti(T)Eht and Ti(o)EInt then Ti(s)fJ 
T,(O) ~8. 
(b) Order preserving: for t < C, if T,(Z) = [a, b) and Ti(a)= [c, d) then b Gc. And if 
T,(T) = Failure then Ti(C) = Failure. 
The mapping Ti is called the schedule of processor Pi, and the sequence 
T=(T1, T,, . . . , T,) is the total schedule. 
Clock construction in asynchronous systems 9 
An interval, Ti(s), in the range of ri is called an action interval. The action interval 
Ti(z) is the actual time interval required by processor Pi to perform the basic action 
taking place at its internal time point 5. Throughout this interval, and nowhere else, 
the internal time for Pi is z. 
This formulation implies the following: 
l There can be arbitrary long actual time gaps between actions of any given 
processor. This allows the n processors to behave in complex forms of interleaving 
and overlapping actions. 
l The actual time it takes to perform any action, including the basic actions of 
reading and writing shared memory cells, may vary from one processor to another, 
as well as for the same processor from action to action. 
l The model does not assume atomicity of any sort. Even a single read or write action 
of a processor to a single shared memory cell is spread over a time interval (rather 
than occupying an idealized discrete time point), and during this interval the state 
of the memory cell is not determined. Moreover, two processors may access the 
same cell during action intervals that overlap, but do not coincide. 
(5) Concurrent memory accesses may produce nondeterministic results. If two 
processors perform a read or write action involving the same memory cell and their 
action intervals overlap, then the outcome is nondeterministic, and can produce any 
value. We allow the outcome of such concurrent access actions to be determined by an 
all-powerful adversary. 
(6) The total schedule is determined by an adversary. We consider two types of 
adversary. 
(a) Adaptive adversary: At any actual time instance the adaptive adversary may 
view the entire state of the computation and determine the continuation of the 
schedule. The adversary cannot, however, prescribe what actions processors choose to 
perform in the action intervals granted to them. 
(b) Oblivious adversary: The oblivious adversary determines the entire schedule 
before the parallel computation starts. The adversary has full knowledge of the 
computation to be executed, but cannot make schedule changes during the course of 
the actual computation. In particular, the adversary cannot adapt the schedule to 
random choices made by the processors. 
Our model entails that time is defined only in the topological (ordered) sense. Thus, 
no time-out mechanism can be implemented to obtain synchronization or solve 
lateness problems. We call a system thus described afully asynchronous parallel system 
(FAPS). 
Definition 2.1. A FAPS, M, is a triplet M=(n, J&‘, T), where n is the number of 
processors, d the set of basic actions and T the total schedule. 
For a processor Pi, we denote by A,(s) the basic action performed by Pi at internal 
time point z (i.e. occurring during the actual time interval Ti(r)). 
10 Y. Aumann, ~$4.0. Rabin 
Definition 2.2. We say that actions Ai and A,(g) (i#j) interfere with each other if 
both access (read or write) the same shared memory cell and the corresponding action 
intervals overlap (Ti(z)r\Tj(a)#CI). When A,(r) and Aj(a) interfere with each other we 
also say that Aj(o) interferes with Ai( and vice versa. 
An internal computation by a processor, however, never interferes with any other 
action, and cannot be interfered with, even if the actual time intervals overlap. 
Complexity and efficiency in the FAPS cannot be assessed by standard measures of 
time. For a FAPS it is natural to measure work in number of action intervals. Hence 
the following definition. 
Definition 2.3. Let M be a FAPS and I= [to, ti] an actual time interval. We say 
that I contains k work units if, summed up over all processors, there are k complete 
action intervals in I, i.e. k = I:= 1 1 {z ( Ti(Z) E I > 1 (where ) A / is the cardinality of the 
set A). 
2.2. Atomic asynchronous parallel systems 
In the AAPS, actions are assumed to occur instantaneously, at a point in time, 
rather than occupying a full interval. The description of the AAPS differs from that of 
the FAPS in items 4 and 5. For the AAPS we have the following: 
(4) For each processor Pi there is a monotone function, Ti: N+~?u(co>, such 
that, for T < (T, if Ti(a) < a then Ti(r) < Ti(o). The function Ti is the schedule of Pi and 
the n-tuple T= ( T1, . . . , T,,) is the total schedule. For a basic action A, performed by 
processor Pi at internal time r, Ti(Z) is the actual point in time at which A is effected. 
(5) Simultaneous read actions that access the same location obtain the same value. 
In the case of simultaneous write actions to the same location, one (arbitrary) write 
succeeds. In the case of simultaneous read and write, the read obtains the value prior 
to the write. 
The rest of the FAPS formulation holds also for the AAPS. In particular, the notion 
of basic actions and the specification of the set d are central in the AAPS. By 
determining the set d, we specify which operations of the AAPS are atomic and which 
are not. 
Definition 2.4. An AAPS, M, is a triplet M = (n, d, T), where n is the number of 
processors, .& the set of basic actions and T the total schedule. 
For the AAPS, we have the following definition of work. 
Definition 2.5. Let M be an AAPS and I = [to, tl] an actual time interval. We say that 
I contains k work units if k=~~=, ( {TIT~(z)EI}I. 
Clock construction in asynchronous systems 11 
3. The clock 
Our first goal is to construct a robust clock in the highly asynchronous FAPS 
environment. Clearly such a clock cannot measure actual time in the physical sense, 
rather it will give a useful measure of the amount of work performed. For a system 
with n asynchronous actual processors, the clock advances from rc to n + 1 after O(n) 
work units. In subsequent sections this clock will be harnessed to drive the entire 
parallel asynchronous computation. The clock we construct functions correctly even 
under an adaptive adversary scheduler. 
3.1. Construction 
The clock is composed of three arrays, of k = cn locations each (c to be determined 
later), Clock’= (xi, xi, . . . , x!J, I = 0, 1,2. Before going into the technical details let us 
outline the general behavior of the clock. Later we shall give exact meanings to the 
informal notions first used. 
The clock arrays drive each other (via processor actions) in a circular fashion. 
Initially, the value 0 is written in all locations of Clock’, 1 in all locations of Clock’ 
and 2 in those of Clock’. Now the value 2 in Clock2 will start driving the value of 
Clock’ to 3, which in turn drives the value of Clock’ to 4, and so forth in a circular 
fashion (for simplicity, in the following all operations in clock superscripts are taken 
mod 3, e.g. 2 + 1 = 0). Thus, clock array Clock’ holds only values xz 1 mod 3. We 
ensure that Clock’+’ does not start driving Clock’+2 from rr- 1 to n +2 until 
Clock’+’ itself has the value rr+ 1 “firmly” written in it. And by the time Clock’+2 
starts driving Clock’ from II to rr+3, the value n+ 1 is written in Clock’+’ in an 
extremely robust form, durable in the face of any number “clobbers” by tardy 
processors. The actual clock value is obtained by taking the value of Clock’ and 
dividing it by 3. Since the clock is of size Q(n), obtaining the value of the clock is 
actually achieved by sampling the clock arrays. 
We now give an exact formulation of the above outline. In the description we 
employ two processor actions: 
l Readi( which returns the value at location x (provided the action is not interfered 
with). 
l Writei(x, v), which writes value v in location x (provided the action is not interfered 
with). 
Let X be a set of memory locations (e.g. X=Clock’). A d-sample S, of X, is 
a reading, by a processor Pi, of d randomly chosen locations of X. For a sample S and 
a value 7c, denote by Densityi(S,rc) thefraction of the locations in the sample S with 
value rc. Note the subscript i in the notation of Density,(S, 7~). This serves to emphasize 
the fact that the density of the sample is subjective (for example, due to interferences) 
to the reading processor Pi, and does not necessarily reflect the true density in the 
array. The protocol for a d-sample is given in Fig. 1. Updating the clock is performed 
by sampling each of the three clock arrays, in sequence, and updating the next array 
12 Y. Aumann, M.O. Rabin 
Protocol 1: d-sample of X by Pi 
(1) For j:= 1 to d do 
l Choose xi~X at random. 
l valuei(x Readi( 
(2) For all n set Densityi(S, n):= I{ j: valuei = Z} ) /d (for all relevant n). 
Fig. I. Protocol for a d-sample of X. 
Protocol 2: clock update by Pi 
For 1:=0 to 2 do: 
(1) d-sample Clock’, let S be the sample. 
(2) If for all values r, Densityi(S, n) < 0.8 then continue to next 1. 
(3) Else, let 7c be such that Densityi(S,n)>,0.8 (n is unique). Choose one location 
x&lock’+’ at random and WriteJx, 7-c+ 1). 
Fig. 2. Clock update protocol. 
accordingly. The description of the clock update protocol is given in Fig. 2 (d is 
a constant to be determined later). By way of example, consider the update dynamics 
following the initial configuration of the clock. First the processor, Pi, d-samples 
Clock’. Initially, this sample will indicate that the predominant value in Clock’ is 
0 (provided the reads in the sample are not interfered with). Thus, Pi writes the value 
1 in a random location of Clock’, with little effect. Similarly, the sample of Clock’ 
indicates that the value of Clock ’ is 1, and Pi writes 2 in a random location of Clock’. 
Finally, the sample of Clock’ indicates that the predominant value in Clock’ is 
2 (Densityi(S, 2) aO.8) and Pi will write 3 in a random location of Clock’. Thus, the 
values in Clock0 will gradually shift from 0 to 3. At some point, the sample of Clock’ 
will fail to give a definite value, and then the processors will refrain from writing 
corresponding values to Clock ‘. However, the value of Clock’ will continue shifting. 
Later still, the density of 3’s in Clock’ will reach the level of 0.8. At this point, the 
sample of Clock’ will reflect this situation, and the processors will start writing 4 in 
Clock’. Thus, the value of Clock’ will now start shifting, and so forth in a circular 
fashion. 
3.2. Analysis 
We prove that, with overwhelming probability, once the arrays are initialized as 
above, the values appearing in the vast majority of the cells of the three clock arrays 
advance monotonically in a circular fashion. Moreover, we give upper and lower 
bounds on the amount of work it takes to advance the clock from one value to the 
Clock construction in asynchronous systems 13 
next. All the claims in this section are stated for the adaptive adversary scheduler. 
Throughout the analysis we use the well-known Hoeffding bound for the sum of 
independent Bernoulli trials. We use the following version of the bound [IS]. 
Fact 3.1 (Hoeffding [S]). Let X1, . . . , Xk be independent Bernoulli trials such that, for 
all i, Pr[Xi= l] =pi. For any t>O 
pr[i ,jl xivif Pi1 >t]42eC”“*. 
In our application, the variables will not always be independent. However, we will 
still have bounds on their conditional probabilities. Hence, the following corollary. 
. Y,,, be 
0 pi 6 1, 1 m. Yi and of values of the 
. ,ei_l)E{O, l}‘-‘, 
Then, for k>m, D>Cy=,pi and t>O, 
Pr c f Yi-D>t <2e-2tzlk. i=l 1 
Proof. The conditions on the Yis imply that, for any i, 1 ,<i<m, and any relation 
R(Y1,..., yi-l), 
Pr[Y1=lIR(Y1,...,Yi_l)]bpi. (1) 
Let !2, be the probability space over which the Yi’s are defined, and let v0 be the 
probability measure on Q,, yielding the specified probabilities for the Yi’s. Consider 
a set of m independent Bernoulli ‘trials, X1, . . . ,X,, such that Pr[Xi = l] ‘pi, These 
trials determine a probability space sZ1 = (0, l}“, with the probability measure vl, such 
that 
Pr,,[Z]=Pr[X1=el ,..., X,=e,]. 
The Xi’s can naturally be viewed as random variables over 0,. 
Let 52 = Q,-, x 52, be the product probability space, with probability 
that, for any OE&, and &Q,, 
Pr,[(o,Z)]=Pr,,[c0]. PrY,[Z]. 
measure v, such 
The Xi’s and Yi’s are naturally extended to random variables over the new space 0. 
We shall now be considering the probabilities in this new probability space. 
In the probability space (52, v), the set of variables Yi are independent of the Xi’s, 
and the latter are independent amongst themselves. Thus, by equation (l), for any 
14 Y. Aumann, M.O. Rabin 
i, ldi<m, and any relation R(Y, ,..., Yi_r,X1 )..., X,), 
Pr,[Yi=l jR(Yr,..., Yi_1,X,,...,X,)]~pi. 
For 1=0, . . . ,k, set 
Si= i Yi+ t Xi. 
i=l 1+1 
We prove that, for all t and I, Pr, [S, + 1 > t] d Pr, [S, > tl. 
Denote PI+’ =PrVICfZ1 Yi+cy= I+ 2 Xi > t]. Define the relation l?: 
R(Y,,..., YL,Xlf2, ..‘,X,) 
=(( YI, . ..1 YL,XLi.Z, . . . ,X,):(t-l<Cf=, Yi+Cy=“=,+2Xi<t)}, 
Then, 
Pr,[S,+,>t]=Pr,[C’,+: Yi+xy=r+2Xi>r] 
=Pry[xiZ1 Yi+~~Er++xi>t] 
+Pr,[(t-l<x:=, Yi+C~=“=l+2Xi~t)A(y~+1=1)1 
=P’+‘+(Pr,[R( Y,, . ..) Yl,xl+z, . . . ,X,)1 
.Pr,[Y,+,=lIR(Y,,...,Y,,X,+,,...,X,)l) 
bP’+‘+(Pr,,CR(Y,,..., Y~,XI+~,...,X~)I.PI+~) 
=Pl+l +(Pr,,[R( Y1, . . . , Yl,Xl+z, . . . ,X,)1 
.Prv[XI+1=11R(Y1,...,Y~,X~+2,...,Xm)l) 
=Pr,.[Zf=, Yi+Ci= ,+,Xi>t]=Pr,[S~>t]. 
Thus, 
Pr,, [ CT= I Yi - D > t] < PrVO [I:= 1 Yi > t + CT= 1 Pi1 = Pr, L > t + CT= 1 Pi1 
~Pr,[So>t+C~=~pi]=Pr[C~=“=,Xi-C~=~Pi>tl 
< 2e- 22*/m < 2e- Zt*ik. 0 
Before we analyze the overall dynamic behavior of the clock we address the impact 
of concurrent memory accesses, i.e. overlapping read or write actions interfering with 
each other. Recall the notation Ai for the basic action performed by Pi in internal 
time point t, and the definition of interfered actions (Section 2.1 and Definition 2.2). In 
the FAPS, interfered read and write actions may produce nondeterministic outcomes. 
We prove that these actions have a negligible effect on the overall behavior. 
Lemma 3.3. For any 0 < 1. < 1 and b0 >O there exists a c0 > 0 such that, for all b > bo, 
c>cO (where cn is the size of each of the clock arrays) and n sujficiently large, the 
following holds. Let M be an n-processor FAPS, and I = [to, tI] an actual time interval 
Clock construction in asynchronous systems 15 
containing bn action intervals. Assume that the processors of M are continually execut- 
ing Protocol 2. Then, with overwhelming probability, within I, no more than 1bn actions 
are interfered with. 
Proof. Consider the following ordering of actions: for Ai and Aj(o) with 
Ti(z)=[ao,al) and Tj(cr)=[a*,a3), denote Ai(s)<Aj(o) iff ao<a2, or ao=a2 and i<j. 
This is a complete ordering of the actions. We say that Aj(o) injures Ai(z) if 
(1) Aj(a) interferes with Ai( 
(2) Ai(z)<Aj(a) and 
(3) for all Ak(A), Ai(z)<Ak(n)iAj(o), Ak(I.) does not interfere with Ai( 
The idea motivating the injury relation is that while an action can interfere with 
several actions, it can injure at most one action. Also, every action which is interfered 
with by a later action is also injured by some action. 
We now count the number of injuries in I. Let A’ be the ith action in I according to 
the above ordering. Let 
i 
1 
Yi = 
A’ injures another action in I, 
0 otherwise. 
Let ti be the beginning time of action A’. Set 
Li= (x: 3j< i s.t. A’ accesses xEClock and A’ is completed after ti}. 
The set Li is the set of all clock array locations which are already accessed when A’ 
starts its access. At any given actual time instance there are at most n read or write 
actions in progress; thus, 1 Li 1 d n. Each clock array contains cn cells and the processors 
randomly choose which cell to access. Thus, Pr[ Yi= I] < l/c. Setting co = 4/;1 we 
obtain that Pr [ Yi = l] <i/4 for c > co. The Yts are not independent, but satisfy the 
conditions of Corollary 3.2, with pi = a/4 and m = bn. Taking t = lbn/4, we see that in 
a total of bn actions occurring during the interval I, for b > bo, 2 Yi 3 lbn/2 holds with 
probability less than 2e -12b0ni8 Thus, with overwhelming probability, there are no . 
more than 1bn/2 injuries in I. The number of actions interfered with in I is at most 
double the number of injuries. 0 
For a given actual time instance t, denote by Density(Clock’, rr, t) the fraction of 
locations of Clock’ which at time t store the value n. 
Consider a d-sample S of Clock’ by processor Pi. Let t’ be the starting time point of 
the first action in the sampling process and t” the ending time point of the last action, 
and set J = [t’, t”). We say that the sample S is s-distorted if there exists a rr such that 
l Densityi(S, 7r) > 0.8, while max,,,{ Density(Clock’, n, t)} < 0.8 -E or 
l Densityi(S, 7c) < 0.8, while min,,,{Density(Clock’, 7c, t)} >0.8 + E. 
An s-distorted sample deviates for some rc by at least E from the actual density of 
rc in Clock’ for all teJ, and may produce an erroneous outcome of the protocol. 
We use the shorthand term clock update to denote an execution, by a processor Pi, 
of the clock update protocol (Fig. 2). Let I= [to, tl) be an actual time interval. 
16 Y. Aumann, M.O. Rabin 
Consider a clock update, E, which starts at time t’ and ends at time t”. We say that 
E transcends I if the [t’, t")nl #@ and [t’, t”)- I #(!I (i.e. the execution partially, but not 
fully, overlaps I). For a time interval I, and a clock update E, we say that E is bud (with 
regard to E and I) if(i) E transcends I, or (ii) any of the read or write actions of E are 
interfered with, or (iii) any one of the samples during E is s-distorted. A clock update is 
proper if it is not bad. A bad write is a write action in a bad execution, and a proper 
write is the write action of a proper execution. 
Consider a FAPS M such that all processors are just running the clock update 
protocol. 
in any interval I which contains at most 2cn 
clock updates, there are at most ctcn updates which are bad with regard to E and 1. 
Proof. First we consider the clock updates for which none of the actions are interfered 
with. For these clock updates, we bound the number updates containing an E- 
distorted sample. Let E be such an update and let S be a sample in E. 
W.1.o.g. assume S is a sample of Clock’, and let J = [t’, t”) be the time interval of the 
execution of S. Let 17 = {n: max,,,(Density(ClockO, rr, t)> < 0.8 --E}. There are a total 
of d read actions in S. Let 
i 
1 
Xi= 
the ith read in S reads a value rcefl, 
0 otherwise. 
Then 
Pr[Xi= l] <0.8-s, 
and this inequality holds for each Xi separately, regardless of the other Xts. Thus, by 
Corollary 3.2, 
Pr [ 3rc~Z7 s.t. Densityi(S, n) 3 0.81 d Pr [C,,En Density&$ rc) 3 0.81 
=Pr[CXi30.8d]~2e-2”‘d=PE,d. (2) 
Similarly, suppose that for z, rninGJ jDensity(Clock’, r, t)} >0.8 + E. There is at most 
one such r. In analogy to equation (2), 
Pr[Densityi(S,r)<0.8] 6pE,d. 
Thus, Pr[S is s-distorted] d 2p,,d. Each clock update contains three samples. Thus, 
Pr [E contains an s-distorted sample] d 6p,, d. 
This inequality holds for each noninterfered clock update independently of the 
behavior of any other update. Set d,,, to be the minimal integral d for which 
6Pz.d G 46. 
Clock construction in asynchronous systems 17 
By assumption the number of clock updates in I is k < 2cn. Let 
1 the ith noninterfered clock update contains 
Yi = an s-distorted sample, 
0 otherwise. 
We have just shown that Pr[Yi= I] <x/6, and this inequality holds independent of 
other Yis. Thus, by Corollary 3.2, 
Thus, for any CI, with c sufficiently large we can make the above probability over- 
whelmingly small. 
Next, consider the interfered clock updates in I. Each clock update takes at most 
s = O(d,,,) actions. Thus, the total number of actions in Z is < 21~s. By Lemma 3.3, for 
c large enough, with overwhelming probability, at most 12cns actions are interfered 
with in I. Thus, at most 12cns of the clock updates in I contain a read or a write action 
which is interfered with. With c sufficiently large, A < u/6s, and thus the total number 
of interfered clock updates I is <cwz/~. 
Finally, there are at most 2n clock updates transcending I. Thus, putting it all 
together, we obtain that, with overwhelming probability, the number of bad clock 
updates in Z is bounded by 
2n+tcn+:cn<acn 
for c> 12/a. 0 
We are now ready to prove the main inductive lemma. We will be considering the 
dynamics of the clock, as the density in an array shifts from one value to another. For 
example, consider the case where the predominant value in Clock’ is rc + 1, and this 
value is driving the density of n+2 in Clock2 from 0 to above 0.6. If all clock updates 
were proper, then, with overwhelming probability, after at most cn clock updates the 
process will be completed. Shifting the density by less than 0.6 requires even fewer 
proper updates. Choosing CI < 0.01 in Lemma 3.4, we bound the number of bad updates, 
so that their total effect can change the resulting densities by at most 0.01. We can thus 
follow the dynamics of the clock, and give tight bounds on the amount of work 
required in order to shift from one state (in the sense of Lemma 3.5) to another. 
Lemma 3.5. There exist co >O, d, bl,b2 such that, for c>co and all n, the following 
holds. Assume that at time instance to the state of the clock is the following: 
- Density(Clock’, 71, to) ~0.82, 
- Density(Clock’+ I, rc + 1, to) >0.82, 
- Density (Clock’+ 2, rc + 2, to) = 0.6 + do, 
18 Y. Aumann, M.O. Rabin 
with 0~6~ ~0.1. Then, with overwhelming probability, there exists a tI > to such that 
~ Density(Clock’, n + 3, t,)=0.6+ 6,, 
- Density(Clock’+‘,z+l,t,)>0.82, 
- Density(Clock’+‘, 7c + 2, tl) > 0.82, 
for some 06~5~ ~0.1, and 
(1) the time interval [to, tl] contains w work units, with b, en < w < b2 en, 
(2) for all tECto, tll, 
Density(Clock’+‘, z+2,t)>OS, and Density(Clock’+‘,z+l,t)>0.8, and 
(3) jar any c${R,?I+~) and te[tO,tI], Density(Clock’,a,t)<OS. 
Proof. W.1.o.g. assume 1 =O. The progress of the clock will take place in two stages. 
First the clock will advance to the following state: 
- Density(Clock’, n, t)>0.81, 
- Density(Clock ‘, n + 1, t) > 0.83, 
- Density(Clock2, 7c+2, t)=0.8 i&, 
with 0 < S3 ~0.1. Following this intermediate state, the clock will advance to 
the final state described in the lemma. Denote the initial state described in the 
lemma as state 1, the intermediate state described above as state 2 and the final 
state claimed in the lemma as state 3. First we prove the passage from state 1 to state 2. 
We loosely use the term “the value of Clock’” to mean “the predominant value in 
Clock’“. 
Set ~=0.01, and consider the proper clock updates in the interval following state 1. 
Initially they have the following outcomes, for the different values of 1: 
0 l= 2: continue to next 1. 
l l= 0: write 7-c + 1 in a random location of Clock ‘. 
l l= 1: write z+2 in a random location of Clock2. 
Thus, disregarding the effect of bad updates, at time t,, following at most cn 
proper updates, we will obtain Density(Clock’, 71+ 2, t,)> 0.79. Meanwhile, 
the proper updates drive Density(Clock’, 7c+ 1, t,) at least up to 0.84. The 
proper updates do not effect the density at Clock’. By Lemma 3.4, throughout 
the interval [to, t,] there are at most ctcn bad clock updates. Thus, choosing CI <O.Ol, 
the bad updates can change the true densities by at most 0.01. All the above 
statements hold with overwhelming probability. Thus, at time t,, state 2 has been 
reached. 
Following t,, Clock* can start driving Clock’ to n+3. At this point, proper 
updates have the following outcomes: 
l I= 2: write n + 3 in a random location of Clock’, or continue to next 1. 
l I= 0: write x + 1 in a random location of Clock ‘, or continue to next 1. 
l l= 1: write rc + 2 in a random location of Clock2. 
Initially, if the density of 7~ + 2 at Clock2 is close to 0.8, then Clock* may fail to drive 
Clock’ to z + 3. However, after at most cn proper updates Density(Clock2, n + 2, t) 
will reach 8.4 (7.9 +0.15( 1 -e- ‘) > 8.4). Following this point, all proper updates 
to Clock’ write x+3. Thus, after at most cn additional proper updates 
Clock construction in asynchronous systems 19 
Density(Clock2, rt + 3, t) reaches z (1 -e- ’ ) = 6.3.’ The bad updates can only change 
the densities by at most 0.01. Thus, after a total of at most 301 clock updates, with 
overwhelming probability, state 3 will be reached. Each execution takes s = 0( 1) work 
units; thus [to,tI] consist of at most b2cn=3cns work units. On the other hand, to 
effect the change from Density(Clock’, rr, t,)>0.82 to Density(Clock2, z+ 3, t,)> 0.6 
surely requires at least b, cn = 0.4 ens work units. 
Finally, it is easy to verify that throughout the process the density of rc + 1 in Clock’ 
is at least 0.8, and the density of rc + 2 in Clock2 is at least 0.5, and that all other values 
have low densities, as claimed (items 2 and 3). 0 
Note that the outcome and conditional states of the lemma are identical, with the 
values shifted by one. Thus, we can apply the lemma inductively. 
Let 
t(z)=min {t: Density(Clock’, 3r, t)a0.6), 
if such a t exists. 
Theorem 3.6. For all y, s>O, and h, there exist constants d and c such that, for any 
c > c,, and all n, tf M is an n-processor FAPS, operating under an adaptive adversary 
scheduler, and every processor of M is just running Protocol 2, then, with probabil- 
ity 3 1 - 2-““, for all natural T d 2 gn, the following properties hold: 
l completeness: t(z) exists. 
0 monotonicity: for all o > 7, t(z) <t(o). 
l phases: for Phase(z) = [t(z), t(r + l)), 
(1) Phase(s) contains at most O(cn) work units and 
(2) for any CT # 37 and all tEPhase(r), Density(Clock’, G, t) ~0.6. 
l strong intervals: there exists a time interval Strong(r) c Phase(z) such that 
(1) for all tEStrong(t), Density(Clock’, 3r, t) > 0.8 and 
(2) Strong(r) contains at least hn work units. 
Proof. By construction, t(O)=O. Following the initial configurations, Clock2 drives 
the value of Clock’ to 3. After at most cn clock updates the density of 3 in Clock’ 
reaches 0.6, and the density of 1 and 2 in Clock1 and Clock’ remains higher than 0.82. 
Let to be the first time this state is reached. At to the conditions of Lemma 3.5 hold, 
with l= 1 and rc = 1. We now apply the lemma inductively. By induction, assume that 
t(t) exists and that the theorem holds for all T' CT. Furthermore, by induction assume 
that at time t(T) Density(Clock’, 3~-2, t(T))>0.82 and Density(Clock2,3~- 1, 
t(z))> 0.82. Thus, the conditions of Lemma 3.5 hold at t(z) with to = t(z), TC = 32 -2 and 
l= 1. Apply the lemma once and let t1 be the first time point at which the outcome 
state of the lemma is reached. At t’, Density(Clock’, 3r, t’)>0.82. Now, the condi- 
tions of the lemma hold with to = t ‘, 7c = 3~ - 1 and I= 2. Apply the lemma once more, 
and let t2 be the first time the outcome state is reached. By Lemma 3.5(2), for any 
I With c> 10 the density will not reach over 0.7 without having first reached p, 0.6<p<O.7. 
20 Y. Aumann, M.O. Rabin 
t~[tl, t’], Density(Clock’, 37, t)>0.8. Thus, [t’, t2] is a strong interval for z, consist- 
ing of at least blcn work units. Choose c >max{h/bl, co} to obtain the lower bound 
on the work in Strong(n). 
At t2 the conditions of the lemma hold with to = t2, rc= 32 and 1=0. Apply the 
lemma once more, and let t3 be the first time the outcome state is reached. Then t3 is 
the first time such that Density(Clock’, 3t+3, t3)>0.6 and thus t3= t(z+ 1). By 
Lemma 3.5(l), the interval [t(z), t(s + l)] consists of at most 3b2cn = O(cn) work units. 
Finally, by Lemma 3.5(2) and (3) during the entire interval [t(z), t(t + l)], for any 
other value o4{3t, 3(t+ l)} the density of (T in Clock’ is less than 0.5. Thus, the 
theorem and the inductive assumption hold for z + 1. 
For c sufficiently large, each time the lemma is applied the failure probability is 
< 2 -sn’a. Thus, by adding the probabilities, the induction may be repeated 2”” times 
while keeping the total failure probability ~2~““. 0 
Thus, the clock gives us a good measure of the amount of work performed on it. 
In the PRAM simulation, reading the clock by Pi is performed by (6 log n)-sampling 
Clock’. Let S be such a sample. We define 
valuei(Clock, S)= rm.,defined 
if Densityi(S, 3rc)>0.7, 
otherwise. 
The following lemma establishes the connection between the clock and the PRAM 
simulation. 
Lemma 3.7. Consider an AAPS with atomic reads and writes of shared memory cells. 
There exists a 6 suck that if reading the clock is performed by a (hlogn)-sampling 
Clock’, then for any sample S the following hold. 
l lf S is performed entirely within Strong(z), then w.k.p. &uei(Clock) = n. 
l If no action in S is performed during Phase(z), then w.k.p. valuei(Clock)#n. 
Proof. Corollary 3.2. 0 
4. PRAM simulation 
In this section we show how to use the clock to obtain an efficient PRAM simulation 
scheme. The simulation we give is for the n-processor exclusive write PRAM on an 
n-processor AAPS. The only AAPS operations which are assumed to be atomic are 
reading and writing individual shared memory cells (and computations on local data). 
4.1. PRAM computation 
The PRAM model we consider is the exclusive write PRAM. For reads, our 
simulation works both for the concurrent and the exclusive models. For completeness 
ClJJJJJJJJJJJJ ock construction in asynchronous systems 21 
we state the characteristics of an exclusive write PRAM program. 
l The program is written in parallel steps. In every parallel step each PRAM 
processor is to perform one instruction of the form x+f(y, z). It is postulated that 
each of the variables x,y,z occupies a single shared memory cell. 
l All instructions in a parallel step are assumed to be performed concurrently and 
completed together. In particular, no processor has to await the output, in the same 
parallel step, of any other processor. 
l It is assumed that al reads in the parallel step occur before al writes. Thus, if 
a processor reads a variable, it will obtain the value last given to the variable before 
the current parallel step. 
l The program is written so as to guarantee that, for every variable x, during any 
parallel step at most one processor attempts to write x. 
To avoid confusion we refer to the tasks assigned to the PRAM processors as 
computation threads. Thus, we have the AAPS processors P,, . . . , P, simulate the 
operation of the program threads THi , . . . , TH,. Associated with each thread, THj, 
there is a program counter variable PC,, which stores the address of the next 
instruction to be performed. 
4.1.1. Idempotence 
Consider a PRAM parallel step. Thread THj executes the instruction xj+fj(Yj, Zj). 
Suppose that syntactically Xj=yj. Then, performing the instruction twice may pro- 
duce an erroneous result. To avoid this problem, following [9], we transform the 
program so as to make al actions, within a step, idempotent, as follows. 
We hold thre auxiliary arrays, in shared memory, tmp= [tmp,, . . , tmp,], 
Loc=[Loc,, . . . , Lot,] and tmp PC = [tmp PC i, . . . , tmpPC,]. Now we split each 
step into two substeps. In the first substep, the new values are computed and stored in 
the temporary arrays. We cal this the computing substep. Then, in the copy substep, the 
new values are copied from the temporary arrays back into the main memory. 
A description of these two substeps is given in Fig. 3. We use the notation “x” to 
denote the name (or identity) of the variable x, as opposed to x which denotes the value 
of x. Note that in the compute substep, the thread can either be updating a program 
variable or updating the program counter. In the first case, following the step, the 
program counter is incremented by one. With this transformation, no shared memory 
variable is both read and written during the same substep, which implies idempotence. 
From now on we assume that the program is already given in this transformed 
idempotent form. Thus, when we refer to the parallel steps of the program, this 
actually denotes the substeps produced by the transformation of the original program. 
4.2. Program transformation 
We show how to execute the PRAM computation on an AAPS machine. We 
consider an AAPS M = (n, T, d) for which 
l the schedule T is determined by an oblivious adversary, and 
22 Y. Aumann, M.O. Rabin 
Protocol 3: substep compute 
For all je{l, . . . , n}, thread THj performs: 
(1) 
(2) 
(3) 
(4) 
Setpj:=PCj. 
Set Oj:= [Pj] (* [pj] is the value stored in the memory cell numbered pj*) 
Oj is the instruction to be performed. 
If Oj= (Xjtfj(yj, zj)), where “Xj” #“PC;’ then 
(3.1) Read yj, zj. 
(3.2) Set tmpj :=fj(Yj, Zj). 
(3.3) Set Locj:= “Xi” (*the name of Xj.*) 
(3.4) Set tmpPCj:=pj+ 1. 
Else, Oj=(PCj+fj(PCj, yj)), 
(4.1) Read Yj. 
(4.2) Set tmpj :=fj(pj, yj). 
(4.3) Set LOCj I= “PC,“. 
(4.4) Set tmp PC j :=fi (pj, Yj). 
Protocol 4: substep copy 
For all jE { 1 , . . . , n), thread THj performs: 
(1) Read tmp PCj. 
(2) Set PCj:= tmpPCj. 
(3) Read “xj” from LOCj, and Uj from tmpj. 
(4) Set Xj := Uj. 
Fig. 3. Splitting each step into two idempotent substeps. 
l the set d of basic action includes (i) reading or writing a single cell (variable) from 
and to shared memory and (ii) executing the operations fi specified in the PRAM 
program (on local data). 
Each PRAM parallel step is translated into a phase in the operation of the AAPS. 
The transformation guarantees that, w.h.p., the computation is: 
(1) Correct: Produces the same values for the variables as the original PRAM 
program would produce (under a suitable interpretation). 
(2) Progressive and eficient: If O(n log’ n) work units are devoted to a phase, then 
the corresponding PRAM parallel step is completed. 
A single parallel step requires O(n) work units on a synchronous PRAM. Thus, the 
complexity overhead is @(log’ n). 
4.2.1. Replicated variables 
In the asynchronous system, processors may “go to sleep” for long periods of time 
and then “wake up” at a later stage. Suppose processor Pi was about to update (write) 
variable x and that, before actually writing it, Pi falls asleep. By the time Pi wakes up, 
Clock construction in asynchronous systems 23 
Each 
(1) 
(2) 
Protocol 5: overall simulation 
processor Pi forever do: 
Perform clock update protocol (Protocol 2). 
For l:= 1 to g logn do (*g=O(l)*) 
(2.1) n = Get Valuei(Clock). If z = undefined then continue to next 1. 
(2.2) If n is odd then perform Sub_Step_Compute(z). 
Else perform Sub- Step-Copy(n). 
Fig. 4. The overall simulation protocol. 
x might have been updated several times. In this case, Pi would be overwriting the 
current updated value of x with an obsolete value. In order to avoid losing the correct 
value, following [7], we keep p copies of each shared memory variable, p = @(log n). 
We see to it that w.h.p. at all times a majority of the copies of each variable hold the 
correct value. All variables have p such copies: program variables, program counters 
and all entries of the three auxiliary arrays. 
Definition 4.1. Let x be a variable; the replicated representation of x is a sequence (also 
denoted by x), x=(x”‘, . . . ,xcp)) where ,u=/?logn, 8=0(l). ,
4.2.2. OveralIJlow of the simulation 
Processors divide their efforts between working on the program and advancing the 
clock, spending O(log’n) work units on the program, to every single execution of 
clock update. Since we have definite bounds on the amount of work it takes to 
advance the clock, this will also give us an accurate measure of the amount of work 
devoted to the program. We see to it that a sufficient amount of work is devoted to the 
program so as to guarantee that w.h.p. by the time the clock advances from one value 
to the next, the current program substep has been completed. 
When a processor chooses to work on the program this could either be in 
a computing substep or in a copying substep. There are separate protocols for each of 
these. Reading the clock is performed by (6logn)-sampling ClockO, and dividing by 
3 (see Section 3). We postulate a procedure GetValue,(Clock) which returns the 
(sampled) value of the clock. The subscript i reflects the fact that the value is subjective 
to the reading processor Pi. The overall protocol for the simulation is given in Fig. 4 
(g=O(l) to be determined later). 
4.2.3. The substep protocols 
Consider the compute substep. There are n computation threads to be simulated. 
Each thread performs some computations, and updates three variables in the auxili- 
ary arrays. Each variable has ,D = O(log n) copies. Accordingly, in the simulation, each 
time a processor is in a computing substep it chooses one of the threads at random, 
24 Y. Aumann, M.O. Rabin 
Protocol 6: GetValue, (x) 
(1) Forl:=l toydo 
XI := Raadi(x(“). 
(2) Let u be the most common value of the xl’s Return u. 
Fig. 5. Obtaining the value of a variable. 
(1) Choose jrz{l, . . . ,n} and ke{ 1, . . . ,p} at random. 
(2) Pj := GatValuei(PCj). 
(3) Oj := GatValuai( [pj]) 
(4) If Oi = (xjcfj(yj, zj)), and “x” #“PC;’ then 
(4.1) yj := GetValueJ yj). 
(4.2) Zj := GetValue;( 
(4.3) z = GetValueJClock). If T #n then exit. 
(4.4) Compute fj(Yj, Zj). 
(4.5) WPitei(tmpy’, fj(yj, Zj)). 
(4.6) WPitei(LOC~‘, “Xj,‘). 
(4.7) WPitai(tmpPCy), Pj + 1). 
(5) Else, Oj = (PCjtfj(PCj, yj)), then 
(5.1) yj := G&Value< (yj). 
(5.2) z = GetValuai (Clock). If r # TT then exit. 
(5.3) Compute fj(Pj> Yj). 
(5.4) Writei (tmpy’, fi( Pj, Yj)). 
(5.5) Write, (Lacy), “PC,“). 
(5.6) Write; (tmpPCy’, fj(pj, yj)). 
Protocol 8: Sub-Step-Copy(n) 
(I) ChoosejEjl, . . . , n} and kE{l, . . ..p} at random. 
(2) pj := G&Value, (tmpPCj). 
(3) “Xj)’ I= GetValuei (LOCj). 
(4) v := G&Value, (tmpj). 
(5) z := GetValuei (Clock). If t # 7c then exit. 
(6) Writai(PCy’, pj). 
(7) Writei( v). 
Protocol 7: Sub_Step_Compute(n) 
Fig. 6. Simulating the substeps. 
Clock construction i asynchronous systems 25 
performs the corresponding computations, and then updates one, random, copy of 
each of the three output variables in the auxiliary arrays. This is done as follows. 
Suppose Pi is in Phase (7~) and chooses to simulate thread THj. First the processor Pi 
obtains the values of the relevant variables (PC,, yj, Zj, “Xj”). Obtaining the value of 
a variable is achieved by reading all copies of the variable and taking the most 
common value (a description of this procedure, denoted G&Valuei(x), is given in 
Fig. 5). After obtaining these values, Pi reads the clock again. If this second clock 
reading does not give the value 71 (which is the original value obtained) then the 
execution is late and is aborted. Otherwise, Pi computes the output values and writes 
one copy of each of the three corresponding variables in the auxiliary arrays. The 
protocol for simulating the copy substep is analogous. A detailed description of the 
protocols is given in Fig. 6. 
4.2.4. Uniform protocol execution 
For the correctness, it is important that a protocol execution requires the exact 
same number of work units, regardless of success or failure, and independent of any 
random choices it makes. To achieve this, we consider each of the protocols above and 
determine the maximum number of work units the execution of the protocol could 
require. We then require that any processor Pi executing the protocol always expend 
this maximum number of actions on the execution, expending empty actions if 
necessary. Thus, each of the substep protocols requires some fixed number of work 
units, s= O(logn), and each execution of the overall protocol requires a fixed 
0(l)+glogn~s=O(log2 n) work units. 
5. Analysis 
Recall the definitions of Phase(n) and Strong(z) of Theorem 3.6. 
Lemma 5.1. For any b there exist g (of Protocol 5) and k (of Theorem 3.6) suck that, for 
all n, with overwhelming probability, 
l the number of complete substep protocol executions during Strong(z) is 2 bn log n and 
l the total number of work units in Phase(x) is 0(nlog2 n). 
Proof. Consider a total schedule T. For any g this schedule induces a schedule T’ of 
actions performed on the clock. By Theorem 3.6, with overwhelming probability, 
Strong(z) consists of at least kn actions of T’, and Phase(n) of at most O(cn) actions of 
T’ (and k can be made as large as desired by increasing c). Each clock update protocol 
takes LJ= O(1) work units of T’. Following each clock update protocol, there are 
glogn substep protocol executions. There are at most 2n executions transcending 
Strong(n). Thus, during Strong(z) there are at least (k/q - 2) gn log n complete execu- 
tions. With k>2q and g> b(k/q-2)-l, the lower bound follows. 
26 Y. Aumann, M.O. Rabin 
Each substep protocol execution requires s=O(log n) work units. Thus, the total 
number of work units in Phase(z) is <O(cn)~glogn~~=O(nlog~ n). 0 
We use the term execution to denote an execution by a processor Pi of either one of 
the substep protocols in Fig. 6. We say that an execution has successfully terminated if 
it does not exit after the second reading of the clock (steps 4.3 and 5.2 of Protocol 
7 and step 5 of Protocol 8). By construction, only successfully terminated executions 
perform write actions. For a sucessfully terminated execution E, we say that E is a nth 
phase execution if (both) the clock readings in the execution returned the value 7~. 
We now proceed to prove that the simulation is correct, i.e. we prove that, under 
a suitable interpretation to be formally defined here under, the simulation effects the 
same computation as the original PRAM program would effect, and produces the 
same values for the variables. Thus, for a program 9, and every rc, we will be relating 
the state of the simulation in phase r-c to that of the program in parallel step T-C. For 
a program variable x and parallel step rt, denote by P-val(x, x,.9) the value of 
x following the nth parallel step of 9’.2 The following definition gives a concrete 
meaning to the notion of the correctness of the simulation. 
Definition 5.2. For a program 9, variable x, and rc, the correct value of x in Phase(n) is 
P-val(x, n, 9). For a copy xck) of x, we say that x (k) is correct in Phase(z) if throughout 
the phase the copy stores the correct value of x. We say that x is correct in Phase(n) if 
at least 0.6 of the copies of x are correct in the phase. We say that the (entire) memory 
is correct in Phase(z) if all the variables which are not being updated in parallel step 
71 are correct in Phase(n). 
From now on we shall consider a given program 9, and omit explicit reference to it. 
Consider a nth phase execution E, simulating thread TH,. The input variables of 
E are the set of all variables to be read by TH, in parallel step 7-z. The output variables of 
E are the variables THj writes in parallel step n. We shall prove that at all times the 
entire memory is correct. This has the following implication. 
Lemma 5.3. Let E be a nth phase execution. Assume that all input variables to E are 
correct in Phase(z); then, w.h.p., 
(1) the only variables E writes are its output variables, and 
(2) for each output variable x of E, execution E writes one (random) copy of x with the 
correct value (with regard to Phase(z)). 
Proof. The clock is read before and after reading the variables. Both readings gave the 
value rc. By Lemma 3.7, w.h.p., both these clock readings must have overlapped 
Phase(z). Hence, the reading of the variables was entirely within Phase(n). By 
assumptions, all input variables to E are correct in the phase. Thus, for any variable 
*We consider the input values to be part of the specification of ~9. 
Clock construction in asynchronous systems 27 
y which E reads, at least 0.6 of the copies of y store the correct value throughout 
Phase(n). Thus, the procedure GetValuei(y), which returns the most common value, 
will return the correct value with regard to rc. Thus, E obtains all the correct input 
values. Hence, the output values are also correct. By construction, E writes one 
random copy of each of the output variables, and these variables only. 0 
There are two possible reasons for a copy not to have the correct value. 
l Old copy: the copy was not updated during the most recent update phase. 
l Clobbered copy: the copy was correctly updated, but was later overwritten. 
The following lemma bounds the fraction of old copies. 
Lemma 5.4. There exists PO such that, for any p >/I&, there exists a b (of Lemma 5.1) 
such that, for p =p logn and any n, the following holds. Assume that the memory is 
correct during Phase(z). Let x be a variable to be updated during parallel step z. W.h.p., 
at least 0.9 of the copies of x are updated during Phase(z) with the correct value. 
Proof. By Lemma 5.1 there are at least bn log n complete executions during Strong(z). 
By Lemma 3.7 all clock readings in these executions read 7-r. Hence, w.h.p., all these 
executions successfully terminate. Let E be one such execution. 
By idempotence, any input variable to E is not being updated in parallel step rr. 
Thus, since the memory is correct in Phase(z), in particular all input variables to E are 
correct. By Lemma 5.3, E correctly writes one (random) copy of each of the output 
variables. Execution E chooses at random the thread to simulate, and x is an output 
variable of one of these threads. Each variable has ,U = /J? log n copies. Thus, for copy 
XCk) of X, 
Pr[E correctly updates x(~)] = l/fin log n. 
Thus, in the totality of bn log n execution, 
Pr[x’k’ is not correctly updated] < 
for b z 38. This inequality holds for each copy separately, and is true regardless of the 
update process in the other copies (although it is not independent). Thus, by 
Corollary 3.2, 
Pr[more than 0.1~ of the copies of x are not updated] <2e-(“~05)2~~0~n 
,< n -@(Do) 
Thus, with PO sufficiently large, the result follows. 0 
The next lemma bounds the fraction of clobbered copies. 
Lemma 5.5. There exists a PI such that $,a > /I1 log n, then for any variable x and phase 
71 the following holds. Assume the memory is correct at least until Phase(z), exclusive, 
28 Y. Aumann, M.O. Rabin 
then, w.h.p.,following the beginning of Phase(x), at most 0.3 of the copies of x are written 
by 7th phase executions, with r < n. 
Proof. The schedule T is determined obliviously to the random choices. The schedule 
alone determines the starting and ending times of all protocol executions (independent 
of the random choices). This is true because we forced all protocol executions to 
expend the same number of actions regardless of success or failure or any of the 
random choices it makes (Section 4.2.4). Consider the sequence of all random choices 
made by all the processors during all the clock update protocols, and in all the 
procedures of reading the clock. Partition the probability space according to this 
entire sequence. We prove the lemma for each partition separately. 
Consider one such sequence, and denote it by r. The given schedule T and this fixed 
r completely determine the dynamics of the clock, independent of all other choices. 
For this given dynamics, the sequence r also determines the outcome of all clock 
readings. Thus, for these T and r, the phase of each execution is determined, and is 
independent of all and any other choices. 
For these T and r, let E be a zth phase successfully terminating execution writing 
during or after Phase(n), with r < rc. The value of r is independent of choices other than 
the (fixed) sequence r. In step 1, E chooses the thread, jE{ 1, . . . , n}, at random. This 
choice is independent of the choice of threads made by other executions. By assump- 
tion, the memory is correct in Phase(z). Hence, by Lemma 5.3, E only writes the 
correct output variables (with regard to Phase(r)). Since the original PRAM program 
is exclusive write, we are guaranteed that there is at most one thread THj, which is to 
update x in parallel step r. Thus, 
Pr[E writes a copy of x] <‘. 
Pn 
This is true for any such E independently. Since z <n, w.h.p. E started before Phase(n). 
There are at most n executions starting before Phase(n) and writing during or after the 
phase. With /I1 sufficiently large, and p>pr log II, there are w.h.p. at most 0.3~ such 
executions E, which write x, as claimed. 0 
We have obtained the following theorem. 
Theorem 5.6. Let M be an n-processor AAPS with atomic reads and writes. Let .“P be an 
m = poly(n)-step exclusive write PRAM program. The above protocols are a transforma- 
tion of 9 into a program for M such that, with overwhelming probability, the simulation 
of each parallel step is completed in O(n log’ n) work units, and with high probability at 
all times the memory is correct. 
Proof. Choose p > /?,, PI, of Lemmas 5.4 and 5.5, and g and c which yield h and b for 
Lemmas 5.1 and 5.4. 
We prove by induction that during all phases the entire memory is correct. Initially 
all copies of all variables are correct. Assume by induction that all times prior to 
Clock construction in asynchronous systems 29 
Phase(o) the memory is correct. We prove for G. W.1.o.g. assume G is a compute substep. 
Consider a variable x. If x is to be updated in parallel step g, then by definition it does 
not effect the correctness of the memory in Phase(a). Otherwise, let rc < 0 be the latest 
parallel step prior to (T in which x was to be updated. By Lemma 5.4, at least 0.9 of the 
copies of x were updated with the correct value during Phase(z). We now bound the 
number of these copies which are subsequently overwritten with a wrong value. 
Consider two cases. First suppose x is a variable of the regular memory, i.e. not one 
of the auxiliary arrays. Let E be a 7th phase execution writing a copy of x sometime 
between Phase(n) and Phase(o), inclusive. Since x is a regular memory variable, and 0 is 
a compute substep, we are guaranteed that no 0th phase execution writes x, regardless 
of the correctness of the memory. Thus, 5 < CJ. By the inductive assumption, the memory 
is correct in Phase(z). Thus, by Lemma 5.3, all writes of zth phase executions are correct 
writes. However, K is the most recent phase in which x is correctly updated. Thus, 
r < 7~. If z = IZ then the write produces the correct value. Thus, the only wrong writes to 
x are with t < n. By Lemma 5.5, at most 0.3 of the copies were overwritten during or 
after Phase(z) by such executions, which implies that x is correct in Phase(a). 
Next suppose x is an auxiliary array variable. Let E be a 0th phase execution. All 
input values to E are from the regular memory. We have just shown that all these 
variables are correct. Hence, by Lemma 5.3, E only writes the correct variables with 
regard to 6. In particular, E does not write x. Thus, any execution writing x must be 
a zth phase execution with z <g. Hence, we may repeat the exact same argument as 
above for the regular memory variables. We conclude, by induction, that the entire 
memory is correct in Phase(a). 
Lemmas 5.3-5.5 hold with probability 1 --nVa. Adding the failure probabilities we 
can carry the induction for a polynomial number of phases, while keeping the failure 
probability polynomially small. The work bounds follow from Lemma 5.1. 0 
The above transformation holds both for an exclusive read exclusive write (EREW) 
PRAM program and for a concurrent read exclusive write (CREW) program. How- 
ever, the transformation of such programs would result in different concurrency 
requirements from the simulating AAPS. Clearly, if the original PRAM program 
assumes concurrent reads, then the AAPS must also allow this concurrency. In the 
case that the original program is exclusive read, we have the following result. 
Proposition 5.7. Consider an exclusive read exclusive write PRAM program. With the 
above transformation, for any location in the AAPS memory, at all times the expected 
number of concurrent accesses is < 1, and w.h.p. no location is ever accessed concurrently 
by more than O(log n) processors. 
Proof. Since the memory is always correct, processors only access the right variables. 
At all times there are n threads to simulate and at most n accessing processors. 
Different threads access different variables, and the processors choose at random the 
thread to simulate. 0 
30 Y. Aumann, M.O. Rabin 
This is in contrast to [7], where the expected concurrency on the clock is O(n). 
Furthermore, a slight modification of our simulation scheme allows one to perform 
the simulation on a system which does not provide for any effective simultaneous 
memory access. Specifically, the simulation can be correctly carried out on a system in 
which simultaneous accesses to the same memory location produce nondeterministic 
outcomes (in a manner similar to that of the FAPS). We omit the details of modified 
simulation and the related analysis from this publication. 
Finally, if we want to have an indication of the termination of the computation, 
then w.1.o.g. the PRAM program can be augmented so as to include a control variable 
which is set to the value “Done” iff all program counters for the n PRAM processors 
have reached “halt”. By examining the copies of this variable, one can determine that 
the simulation is completed. The output/results of the simulated PRAM computation 
can then be acquired by reading the copies of the relevant program variables. 
References 
[l] R. Cole and 0. Zajicek, The APRAM: incorporating asynchrony into the PRAM model, in: Proc. 1st 
ACM Sxmp. on Parallel Architectures and Algorithms (1989) 1699178. 
[2] P.B. Gibbons, A more practical PRAM model, in: Proc. ISI ACM Symp. on Parallel Architectures and 
Algorithms (1989) 158-168. 
[3] M. Herlihy, Impossibility and universality results for wait-free synchronization, in: Proc. 7rh Ann. 
ACM Symp. on the Principles of Distributed Computing (1988) 276-290. 
[4] M. Herlihy, Impossibility results for asynchronous PRAM, in: Proc. 3rd ACM S?;mp. on the Parallel 
Architectures and Algorithms (1991) 327-336. 
[S] W. Hoeffding, Probability inequalities for sums of bounded random variables, Amer. Statist. Assoc. J. 
58 (1963) 13-30. 
[6] P. Kanellakis and A. Shvartsman, Efficient parallel algorithms on restartable fail-stop processors, in: 
Proc. 10th Ann. ACM Sump. on the Pritwipies of Disrrihuted Computing (1991) 23-36. 
[7] Z.M. Kedem, K.V. Palem, M.O. Rabin and A. Raghunathan, Efficient program transformation for 
resilient parallel computation via randomization, in: Proc. 24th Ann. ACM Symp. on the Theory of 
Computing (1992) 306-3 17. 
[8] Z.M. Kedem, K.V. Palem, A. Raghunathan and P.G. Spirakis, Combining tentative and definite 
executions for very fast dependable parallel computing, in: Proc. 23rd Ann. ACM Symp. on Theory of 
Computing (1991) 381-390. 
[9] Z.M. Kedem, K.V. Palem and P.G. Spirakis, Efficient robust parallel computations, in: Proc. Z2nd 
Aw. ACM Symp. on Theol:v qf Computing (1990) 138-148. 
[lo] L. Lamport, On interprocess communication. Part i: basic formahsm, Distributed Computing 1 (12) 
(1986) 77-85. 
[1 l] L. Lamport, On interprocess communication. Part ii: algorithms, Distributed Computing 1 (12) (1986) 
86-101. 
[12] C. Martel, A. Park and R. Subramonian, Asynchronous PRAMS are (almost) as good as synchronous 
PRAMS, in: Proc. 31st Ann. Symp. on the Foundations of Computer Science (1990) 590-599. 
[13] C. Martel, A. Park and R. Subramonian, Work optimal asynchronous algorithms for shared memory 
parallel computers, SIAM J. Comput. 21 (1992) 1070-1099. 
[14] C. Martel and R. Subramonian, On the complexity of certified write-all algorithms, unpublished 
manuscript. 
[15] N. Nishimura, Asynchronous shared memory parallel computation, in: Proc. 2nd ACM Symp. on 
Parallel Architectures and Algorithms (1990) 76-84. 
1161 L.G. Valiant, A bridging model for parallel computation, Comm. ACM 33 (8) (1990) 103-l 11. 
