Sharing memory in distributed systems by Aguilar, Oscar Rodrigo
UNLV Retrospective Theses & Dissertations 
1-1-1990 
Sharing memory in distributed systems 
Oscar Rodrigo Aguilar 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Aguilar, Oscar Rodrigo, "Sharing memory in distributed systems" (1990). UNLV Retrospective Theses & 
Dissertations. 92. 
http://dx.doi.org/10.25669/7c4n-g7up 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
INFORMATION TO USERS
The most advanced technology has been used to photograph and 
reproduce this manuscript from the microfilm master. UMI films the 
text directly from the original or copy submitted. Thus, some thesis and 
dissertation copies are in typewriter face, while others may be from any 
type of computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand corner and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in 
reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly 
to order.
U niversity M icrofilms International 
A Bell & H ow ell Information C o m p a n y  
3 0 0  North Z e e b  R oad . Ann Arbor. Ml 4 8 1 0 6 -1 3 4 6  USA  
3 1 3 /7 6 1 -4 7 0 0  8 0 0 /5 2 1 -0 6 0 0

Order N u m b er 1341636
Sharing memory in distributed systems
Aguilar, Oscar Rodrigo, M.S. 
University of Nevada, Las Vegas, 1990
U M I
300 N. ZeebRd.
Ann Arbor, MI 48106

SHARING MEMORY IN DISTRIBUTED SYSTEMS
by 
Oscar Rodrigo Aguilar
A thesis submitted in partial fulfillment 
of the requirements for the degree of
M aster of Science 
in 
Computer Science
Departm ent of Computer Science 
University of Nevada, Las Vegas 
August 1990
The thesis of Oscar Rodrigo Aguilar for the degree of Master of Science in Computer 
Science is approved.
^AyvA/\f&r<
Chairperson, Ajoy Kumar Datta, Ph.D
Examining Committee Member, Laxmi Gewali, Ph.D
I<̂ /̂ £ a a A
Examining Committee Member, Kazem Taghva, Ph.D
6 .  r _
Graduate Faculty Representative, Ebrahim Salehi, Ph.D
Graduate Dean, Ronald W. Smith, Ph.D
University of Nevada, Las Vegas 
August, 1990
A B S T R A C T
A set of processes sharing a block of common memory is one of the most frequent configurations for 
distributed systems, and is called the shared memory model. We will study this model and focus on the 
techniques for the specification of concurrent data objects and the verification of their implementation. 
We will also deal with the development of algorithm translations between the shared memory model 
and the message passing model, another highly frequent configuration in distributed systems where the 
processes communicate by exchanging messages only. The translation of algorithms between these two 
models is based on the fact that it is possible to simulate shared memory synchronization primitives in 
message passing systems. We propose an algorithm for simulating atomic registers, test-and-set, fetch- 
and-add, and read-modify-write registers in a message passing system. The algorithm is fault tolerant 
and works correctly in presence of up to [Nf 2J-1 node failures where N  is the number of processors in 
the system. The high resilience of the algorithm is obtained by using randomized consensus algorithms 
and a robust communication primitive. The use of this primitive allows a processor to exchange local 
information with a majority of processors in a consistent way, and therefore to take decisions safely. The 
simulator makes it possible to translate algorithms for the shared memory model to that for the message 
passing model. W ith some minor modifications the algorithm can be used to robustly simulate shared 
queues, shared stacks, etc.
A C K N O W L E D G E M E N T S
I dedicate this thesis to my parents for their encouragement, support, 
and their always oportune advice. They are the source of my inspiration 
and strength. I would like to express my gratitude to my advisor Dr. 
Ajoy D atta  for introducing me to the area of D istributed Systems, for his 
excellent advice, suggestions, and for many interesting discussions tha t 
helped to improve the quality of my work. I would also like to extend 
my appreciation to Dr. Evangelos Yfantis who was my initial advisor 
when I first arrived to UNLV. Acknowledgement is also due to three 
other members of my thesis committee, Dr. Laxmi Gewali, Dr. Kazem 
Taghva, and Dr. Ebrahim Salehi.
Contents
1 INTRODUCTION 10
1.1 Properties of the Distributed Computer S y s te m ..........................................................  13
1.2 Advantages and Disadvantages of Distributed S y s te m s .............................................  14
1.2.1 A d v a n ta g es ...............................................................................................................  14
1.2.2 D isadvantages............................................................................................................ 16
1.3 Architectural Models for Distributed S y s te m s .............................................................  17
1.3.1 W orkstation/Server M odel.....................................................................................  18
1.3.2 Processor Pool M o d e l ............................................................................................ 19
1.3.3 Integrated M o d e l.....................................................................................................  20
1.3.4 Comparing Different A rc h itec tu res ..................................................................... 20
1.4 Distributed Operating S y s te m s ........................................................................................ 21
2 CONCURRENT OBJECTS AND CONCURRENT SYSTEMS 26
2.1 Some Definitions for Concurrent Systems ....................................................................  27
2.2 Specification of o b je c ts ........................................................................................................ 31
2.3 Register ax iom s.....................................................................................................................  32
2.3.1 The A xiom s...............................................................................................................  38
3 ATOMIC REGISTERS 39
5
6
3.1 LAMPORT’S REGISTER CLASSIFICATION .......................................................... 40
3.1.1 Safe R e g is te r ...........................................................................................................  40
3.1.2 Regular R e g is te r ....................................................................................................  40
3.1.3 Atomic Register ....................................................................................................  40
3.2 SAFE AND REGULAR REGISTER C O N ST R U C TIO N S........................................ 42
3.2.1 Multivalued Safe R e g is te r ....................................................................................  42
3.2.2 M ultireader Boolean Regular R e g is te r .............................................................  43
3.2.3 Single Reader Multivalued Regular R e g is te r .................................................... 43
3.2.4 M ultireader Multivalued Safe or Regular R eg ister...........................................  44
3.3 ATOMIC REGISTER C O N ST R U C T IO N S................................................................  44
3.3.1 I/O  A u to m a to n s ....................................................................................................  44
3.3.2 Atomic O p e ra tio n s .................................................................................................  46
3.3.3 Implementing Atomic R eg iste rs ..........................................................................  51
4 S h a rin g  M e m o ry  in  A sy n ch ro n o u s  M essag e  P a ss in g  S y stem s 56
4.1 In troduction ...........................................................................................................................  57
4.2 M o d e l .....................................................................................................................................  59
4.3 Communication Primitive E xchange  ..........................................................................  61
4.4 S im u la to r ..............................................................................................................................  65
4.5 C o rre c tn e ss ...........................................................................................................................  76
4.5.1 An e x a m p le ..............................................................................................................  81
5 D is tr ib u te d  S im u la tio n  u sing  IS IS  83
5.1 The ISIS S y s te m .................................................................................................................  83
5.2 The Simulation P r o g r a m ..................................................................................................  88
5.2.1 The structure of an ISIS process .....................................................................  89
7
5.2.2 Implementation of the procedure E X C H A N G E ..........................................  91
5.2.3 Implementation of B IT C O N S E N S U S  and F L IP -G L O B  A L X JO IN  . 93
5.2.4 R e s u l ts ......................................................................................................................  93
6 S U M M A R Y  95
List of Figures
1.1 A typical Distributed Computer S y s te m .........................................................................  11
1.2 The two main layers of a Distributed Operating S y s te m ............................................ 22
1.3 The Amoeba A rch ite c tu re ...................................................................................................  24
1.4 A typical V c o n fig u ra tio n ...................................................................................................  25
2.1 The sequence of read and write operations by the processes Pi and P2 .................. 29
2.2 The operations by the three processes on a  register. It is possible to  rearrange
the operations in such a way ( as shown in Figure 7) th a t the nonconcurrent 
operations preserve their order and the read operations return values th a t are 
consistent with the write operations..................................................................................  33
2.3 The operations of Figure 6 rearranged in time to return consistent values . . . .  35
2.4 The operations by three processes on a register. These operations cannot be
rearranged in time to  return consistent values because f?d i(l) cannot precede 
W>3(0) ................................................................................................................................... 36
3.1 A reader process and a writer process accessing a shared register. The type of the
register determines which values are legal for a  read operation to return when it 
overlaps in time a write opera tion .....................................................................................  41
8
9
3.2 (a) Concurrent read and write operations executed non-atomically. (b), (c) Two
different ways to  execute the operations of part (a) atomically .............................. 48
3.3 (a) Two consecutive read operations to  a  shared register overlapping in time with 
a  write operation, (b) (c) (d) Three different ways of simulating serial execution
of the operations in part (a)................................................................................................. 50
C hapter 1
INTRO DUCTIO N
A distributed computer system consists of a set of autonomous computers linked by a net­
work. It is designed to enable the individual computers to  share resources in the network, such 
tha t they provide the same computing facilities to  geographically dispersed group of persons 
or institutions. Users of a  distributed system are given the impression th a t they are using a 
single, integrated computing facility, although the facility is actually provided by more than one 
computer and the computers may be in different locations. The most im portant characteristic of 
a properly designed distributed system is transparency. Transparency offers to  the user a unified 
interface to  a networked collection of computer systems, providing access to  programs and data  
objects located in any of the computers in the network using the same name and operations 
regardless of their location. The motivations for building distributed systems became apparent 
with the development of single user workstations, servers, and high-speed local networks in the 
period 1971-1980. Time sharing centralized systems of that time had not shown a good perfor­
mance when running highly interactive tasks such as graphics applications. In these cases single 
user systems had shown superiority due to  the dedicated processor power tha t enabled appli­
cation programs to  maintain an interactive dialogue with the user without interruption. Also,
10
11
the direct connection of the display screen to memory enabled programs to  display and mod­
ify information on the screen almost instantaneously. The first workstation developed was the 
Alto, designed at Xerox PARC between 1971-1973. Later in the eighties with the development 
of fast and low cost 16-bit and 32-bit microprocessors, other companies like Sun Microsystems 
and Apollo developed high speed and high performance workstations. Examples of them are 
the Sun-2 and Sun-3 workstations and the Apollo Domain DN300 and DN600 workstations. 
These workstations have speeds in the range of 1 to  3 MIPS. Figure 1.1 ilustrates the different 
components of a distributed system.
To WAN
File Printer
Servers
Other
Serversservers
Gateways
Work
Station
Work
Station
Work
Station
Local Area Network
Figure 1.1: A typical Distributed Computer System
The initial motivation for building distributed systems was the necessity to  share resources.
For example, program development requires the sharing of source and object code of programs. 
Office applications require the sharing of documents and other files. The necessity of sharing 
information originated the development of file services. A file service provided processes in other 
computers in the network (the clients) with facilities to  store and access da ta  files in a manner 
similar to  the filing systems of most conventional operating systems. Processes in different 
computers could share files and most of the times specialized computers called file servers were 
used to  store the files and to provide access and concurrency control. Access control is used to 
ensure th a t only authorized users could access the files. Concurrency control measures are used 
to ensure that updates by different clients to the same file were properly sequenced. The Xerox 
Distributed File Service is an example of such file services. Other distributed systems developed 
between 1970 and 1980 are the Cambridge Distributed Computing System, the Apollo Domain 
System, the Newcastle Connection, and Locus. In the eighties there has been a  rapid expansion 
of research and development activity in the area of Distributed Systems. Among the most 
outstanding and successful research projects are Accent, Mach, Amoeba, Argus, V-system and 
Chorus. Parallely, many distributed systems currently in use have evolved in an environment 
consisting of a multi-user computers and workstations with traditional operating systems such 
as UNIX linked by a  network. A common practice is to have a network of multi-user computers 
and workstations running BSD 4.2 UNIX with network file service software such as the Sun 
Network File System (NFS). The BSD 4.2 version of UNIX provides the conventional UNIX 
operating system in each separate computer with its own hierarchical file naming scheme and 
password file. The Sun Network File System allows any computer in the network to  export the 
names of any of its file stores, allowing them to be mapped as a  part of the file name space 
in other computers. The actual mapping is accomplished by an extended version of the UNIX 
mount operation, allowing a remote file system to appear as a part of the directory hierarchy in
13
the machine performing the mount operation. The network file system software in the computer 
th a t has mounted remote files stores intercepts the read, write and other file operations th a t refer 
the remote files and map these references to  the correct files in the remote computer. It is worth 
mentioning at this time th a t these UNIX based systems are not considered to  be completely 
distributed because they lack some of the properties of what is considered a truly distributed 
systems. The discussion of these properties is the topic of the next section.
1.1 P rop erties  o f  th e  D istr ib u ted  C om p u ter S ystem
The fundamental properties of a  Distributed System are fault tolerance, transparency, and the 
possibility to  use parallelism . A distributed system must be able to  continue working in the 
face of single-point failures, and must b§ capable of parallel execution. So it must have the 
following:
1. Multiple processing elements th a t can run independently, each processing element or node 
must contain a t least a  CPU and memory. These processing elements m ust fail indepen­
dently such th a t the failure of one node does cause the failure of the whole system.
2. Interconnection hardware tha t allows processes running in parallel to  communicate and 
synchronize. The interconnections must be reliable too. The processors th a t are working 
correctly must always be able to  coordinate their work.
3. Shared state for the distributed system. In this way a  node failure does not cause part of 
the system state  to  be lost.
In addition file transparency should be implemented such tha t it provides:
1. Location transparency. File names may be used without knowing the location of files.
14
2. Concurrency transparency. Facilities must be provided such tha t several users may use 
the same file or the same group of files a t the same time without corrupting them.
3. Failure transparency. Even if a client or server crashes during operations on files or group 
of files, the files must remain in a consistent state.
Systems built based on BSD 4.2 UNIX and Sun NFS do not have the properties of concur­
rency transparency, shared state  and failure independence, this is why they are not considered 
completely distributed.
1.2 A dvantages and D isadvantages o f  D istr ib u ted  S ystem s
Distributed Systems often evolve from networks of workstations. Owners of workstations connect 
their systems together because of the desire to communicate and share information and resources. 
Information is generated in one place and often needed in another. People and information are 
usually geographically distributed. For these situations, the use of distributed computer systems 
is the most appropriate.
1.2.1 Advantages
The following are the advantages th a t users and system managers can expect to  derive from the 
replacement of conventional systems by distributed ones:
1. Predictable response. Distributed system have a  very good performance and time response 
when used for highly interactive applications or applications tha t require significant pro­
cessing capacity. The dedication of the workstation processing power to support a single 
user, ensures a rapid response.
15
2. Cost. The cost of a computer depends on its performance and the amount of memory 
it has. The price of processors and memory and consequently the price of workstations 
is going down very fast. The cost of communication depends on the bandwidth of the 
communication channel and the length of the channel. Bandwidth may be increased, but 
not beyond the limits set by cables or interfaces used. The cost of replacing cables to 
increase bandwidth, specially in the case of wide area networks is high. As man-machine 
interfaces become more interactive, users want instant visual or audible feedback from their 
user interface. Latency caused by distances of more than a few kilometers of network is too 
high most of the times. Centralized computers tha t give the required number of cycles for 
these applications and the cost of the network technology th a t gets the required number 
of bits out to  the users screen is prohibitive. Also computers organized in a  distributed 
system may share expensive resources such as high quality laser printers and high capacity 
storage devices.
3. Extensibility. Distributed systems are built in a modular fashion since they are composed 
by autonomous computers. They are capable of incremental grow as the demand for ser­
vice increases. The addition of new file servers or computers can be done easily without 
replacing existing interfaces. The limiting parameter is in any case the network bandwidth 
since each active workstation adds to the communication load in the network. The ca­
pacity of any centralized system, on the other hand, is the one tha t imposes limits on the 
maximum size the system can grow.
4. Availability and reliability. In a  distributed system data  is often replicated. A distributed 
system usually has built-in redundancy on the resources tha t can fail to provide fault 
tolerance. Therefore, distributed systems have the potential to  be available even when 
arbitrary single point failures occur.
16
1.2.2 Disadvantages
1. Loss of flexibility in the allocation of memory and processing resources. In a centralized 
computer system or a tightly coupled multiprocessor system all of the processor and mem­
ory resources are available for allocation by the operating system according to  the work 
to  execute. In a  distributed system, the largest task tha t can be executed is determined 
by the processor and memory capacity of the workstation used.
2. Dependence on network performance and reliability. Failure of the network causes the ser­
vice to  users to  be interrupted, or the execution of tasks th a t require internode coordination 
to  be aborted. Overloading the network degrades the performance and responsiveness to 
the users.
3. Security weaknesses. To achieve extensibility the software interface is made available to 
the users. Clients have access to  the communication software and can access the interface 
to  the servers. This creates the necessity of having security software to protect the services 
against accidental or intentional violation of access control or privacy constraints.
The Design Complexity of Distributed Systems
The design of distributed systems is complex mainly because the interconnection of well un­
derstood components often generates new problems not apparent in the components. Most of 
this complexity is apparent through the unexpected behavior of systems th a t are believed to be 
understood. In some cases, formal methods can be used to help predict what will happen when 
two systems are interconnected. Examples of problems tha t occur in the design of distributed 
systems are:
• Interconnection problems. Different computers may have different ways to represent data 
or provide services such as electronic mail or file services.
17
• Interference. Concurrent access to the same file, or to the network channels are examples 
of interference th a t causes problems in a distributed system. Special software has to be 
designed to  handle cases of interference.
•  Propagation of effect. Failures in one component of the interconnection network can bring 
the system down if the design is not careful. Potential bottlenecks must be identified and 
eliminated. Also design effort must be devoted to  localize the effect of failures as much as 
possible.
• Effects of scale. In some cases a resource does not scale up with the rest of the components 
and becomes a bottleneck. This is usually the case of the network bandwidth.
• Partial failures. A distributed system must continue to operate in presence of partial failure 
of the components. When a node fails its behavior may become erratic, the system must 
be able to  hide the effect of this erratic behavior it and proceed with its work normally.
Distributed systems are complex because what they have to do is complex. Services such as au­
thentication, access control, maintaining quota, concurrency control to  handle concurrent reads 
and writes to  files, recovery mechanisms, among others are not easy to  implement. Sometimes 
simple solutions cannot be implemented because they are too expensive. For example, a simple 
way to achieve a reliable network tha t stands several link failures without becoming partitioned 
is to design it as a fully interconnected network with one physical link between every two nodes 
in the network. The implementation of such network is extremely expensive.
1.3 A rch itectu ra l M odels for D istr ib u ted  S ystem s
The architecture of a  Distributed System identifies the main hardware and software components 
and modules of the system and defines the relationship between them. Im portant aspects of this
18
are the types of computers used, their location in the network ,and the locations at which system 
programs and application programs are executed. We defined a distributed computer system as 
being composed of a group of autonomous computers linked by a  network. Distributed systems 
of this type are also known as loosely coupled systems. There is another multiprocessor scheme 
that is also considered by people as distributed. This scheme consists of a set of processors 
sharing a single memory or address space. Distributed systems of this type are known as tightly 
coupled systems. In tightly coupled multiprocessor systems all the hardware resources are under 
the control of the same operating system. The operating system allocates processors and memory 
space to  user’s tasks and lets them run concurrently. The use of shared memory allows the user’s 
tasks to  communicate with each other and with the operating system through the use of shared 
variables. In this type of systems the maximum number of processors th a t can be used is usually 
limited by the memory bandwidth. For this reason, a relatively large cache memory is often 
assigned to  each processor.
W ith respect to  loosely coupled systems, there are three main architectural models th a t have 
emerged to  date: the workstation/server model, the processor pool model, and the integrated 
model.
1.3.1 W orkstation /Server M odel
The m ajority of Distributed Systems currently in use are based on this model. Each user is 
provided with a workstation, where the application programs tha t he wants to  use are executed. 
The need for workstations is based primarily on user interface requirements in application tasks. 
In order to  enable users to  share information and in general files, file servers and directory servers 
are used. Also, in order to allow users to  share expensive devices such as high quality printers, 
tape drivers etc. specialized device drivers are used. The workstations are integrated by the
19
use of communication software tha t enables them to access the same set of servers. Servers are 
in charge of handling the special cases th a t need to be considered when resources are shared 
(authentication, security, concurrency etc.).
1.3.2 Processor Pool M odel
In this model programs are executed in a  set of computers managed as a  processor service. 
Users are provided with terminals instead of workstations, connected to the network via con­
centrators and interacting with programs via a terminal access protocol. The pool processors 
usually consists of a  processor with enough memory to load and run any of the system or ap­
plication programs available in the system. Initially, the terminals are connected to a server 
that manages and allocates pool processors. When a user requests an application program, a 
loader server loads the executable program from the file server into a pool processor the user 
is assigned. The requested application programs can even be conventional operating systems 
adapted to  work in a distributed environment. When an operating system has been loaded into 
the user’s pool processor, the user can interact with it directly issuing commands to load and 
execute programs. The Cambridge Distributed Computing System is an example of this type of 
distributed system architectures. The processor pool model has a  better utilization of resources 
and flexibility than the workstation/server model. However, it does not satisfy the needs of high 
performance interactive programs. To overcome this disadvantage designers have gone into the 
implementation of hybrid  systems where workstations as well as pool of processors are used. In 
this way, workstations can be used for interactive use and pool processors can be used to  run 
tasks, for example, th a t are too big for the workstations. The Amoeba system developed at 
Vrije University in the Netherlands is an example of this type of hybrid architecture.
20
1.3.3 Integrated M odel
In this model multi-user computers and workstations are integrated into a single computing 
system. Each computer is provided with appropriate software to  enable it to act as a server 
and as an application processor. The system software located in each computer is similar to  an 
operating system for a  centralized multi-user system with the addition of networking software. 
So, every computer deals with its own applications and services. When a program is to  be 
executed the system decides on a computer in which to  run it. Most usually the computer 
where the request was made is chosen. Next the system locates the executable program file and 
loads it into the selected computer. File names are global allowing the user to refer to  them 
using the same naming scheme. The mapping from user names to identifiers used in deciding 
ownership is also uniform throughout the system. An example of this type of architectures is 
the Locus system and the Newcastle Connection.
1.3.4 Com paring Different A rchitectures
The underlying architecture of a  distributed system can be characterized by a number of param ­
eters th a t describe its performance, reliability and functionality. These parameters are related 
to the communication mechanisms, the type and frequency of failures, the achievable paral­
lelism, access latency to the storage devices, etc. Among the communication-related parameters 
the more im portant are: the overhead for inter-thread communication and the communication 
bandwidth (the available data  transmission throughput). Among the failure related parameters 
the most im portant are: the failure and repair rate, the type of processor and stable storage 
failures ( fail stop processors etc.), and the effect of failures ( whether or not processors fail 
independently). The available number of processors on a processing node describes available 
CPU parallelism in the node. The available number of nodes in the network describes the avail­
21
able internode parallelism. The access latency to  storage devices describes the delay associated 
with read and writes and the feasible streaming data  rates. The above parameters are used to 
compare the different Distributed System Architectures and their implementations. According 
to the application programs to  execute most frequently, some param eters are more im portant 
than others.
1.4 D istr ib u ted  O perating S ystem s
A distributed operating system has been defined [54] as one tha t looks like an ordinary centralized 
operating system, but runs on multiple independent CPU’s. The key concept is transparency, 
the use of multiple CPU’s is invisible, i.e., transparent to the user. The user views the system as 
a virtual uniprocessor, not as a collection of independent machines. In addition, a Distributed 
Operating System does not have any single point of failure. In no case the failure of a part 
brings down the system.
Operating systems in conventional computers provide several im portant services to  the ap­
plication programs. These are:
• file system management and file access facilities
• peripheral device handling
• user authentication and access control
• memory and processor resource allocation;
• creation and scheduling of processes
In centralized systems these tasks are usually performed by the operating system kernel. In 
a  distributed system these system services are located in several separate computers and are
22
Distributed Program 
Layer
Operating System Kernel 
Sublayer
Base Architecture ~ 
Harware Sublayer
Figure 1.2: The two main layers of a Distributed Operating System
performed by separate software components. All the tasks performed by conventional operating 
systems are required in distributed systems and in addition each computer must also include 
software to  support communication on a local network. Figure 1.2 shows the main layers of a 
distributed operating system.
Distribution in a distributed operating system is based primarily on the use of servers de­
signed to  be used by clients programs running in workstations and other client computers. The 
file system, for example, runs as user-level processes separate from the operating systems used in 
the individual computers and can communicate directly with the user-level programs. Examples 
of distributed systems kernels are Amoeba, Mach, and V-kernel. The common characteristic in 
these kernels is the use of lightweight tasks in addition to the use of normal of heavyweight tasks 
(the term  light or heavy refers to the amount of overhead involved to  switch between different 
tasks). Lightweight tasks are processes (threads of control) that live in the same environment of
23
their parent process and share their address space via shared memory. In this way interprocess 
communication is very easy and flexible.
The amoeba kernel was developed at the Vrije University in Amsterdam. It was designed 
to be a basis for an open system with many of the components of a  conventional operating 
system, such as the file service, outside the kernel. There are several servers: a block server, 
a directory server, a boot server, a loader server, and a database server. All of the computers 
in the amoeba system run this kernel. The kernel includes facilities for creating processes and 
interprocess communication based on a triple of message passing primitives designed to support 
Remote Procedure Call messages. These primitives are: request for use by the clients to make 
remote calls; get-request and put-reply  for the servers to  receive and respond to  service calls. 
The Amoeba architecture consists of four principal components: the workstations, one per user, 
the pool processors, the specialized servers, and the gateways which are used to link Amoeba 
systems at different sites. Amoeba currently runs on a collection of 24 M otorola 6810 computers 
connected by a 10-Megabyte-per-second local network. Figure 1.3 shows the Amoeba system 
architecture.
The Mach kernel was developed at Carnegie Mellon University. It is an open system based 
on a lightweight kernel running in each computer with services such as the file system, network 
service and process management outside the kernel. These services replace the system calls 
found in traditional operating systems. The model provided by Mach is a  service model in 
which objects are managed by servers, and clients make requests for operations on objects by 
using remote procedure calls. Remote procedure calls are supported by efficient and flexible 
interprocess communication facilities in the kernel.
The V kernel is a research project a t Stanford University. It performs functions similar 
to the ones performed by a software back plane in the sense th a t provides an infrastructure
24
Workstations
Processor
Pool
To WAN
Local Area Network Gateway
Specialized Servers 
(file, data base, etc )
Figure 1.3: The Amoeba Architecture
25
N etw ork
F i l e
S e rv e r
Work
S t a t i o n
P r i n t
S e rv e r
Work
S t a t i o n
G atew ay
S e rv e r
F i l e
S e rv e r
Work
S t a t i o n
Figure 1.4: A typical V configuration
for components ( for hardware, boards; for software, processes) to  communicate and nothing 
else. Consequently, most of the facilities found in traditional operating systems such as the file 
system, resource management and protection are provided by V servers outside the kernel. The
V system consists of a  collection of workstations (SUN’s), each running an identical copy of the
V kernel. The kernel consists of three components: the interprocess communication handler, the 
kernel server ( for providing basic services, such as memory management), and the device server 
( for providing uniform access to  I/O  devices ). Some of the workstations support an interactive 
user, whereas others function as file servers, print servers, and other kind of servers. Figure 1.4 
illustrates a V-kernel configuration.
C hapter 2
CO NCURRENT OBJECTS A N D  
CO NCURRENT SYSTEM S
A concurrent system consists of a  collection of sequential processes tha t communicate through 
shared objects. This model corresponds to a shared memory multiprocessor system where pro­
cessors communicate by reading from and writing to  a shared memory board. Throughout this 
chapter this model is used. A concurrent object is a  d a ta  structure shared by several processes. 
Every object has a  type. The object’s type defines the set of possible values it can adopt and the 
set of primitive operations available to manipulate it. Sequential objects are usually specified 
through a set of axioms th a t define the meaning of the operations when they are invoked one 
at a time by the processors. Specifications for concurrent objects are more complex. More than 
one operation may be invoked at a  time and it is necessary to  define all possible interleavings 
of the operation invocations.
26
27
2.1 Som e D efin ition s for C oncurrent S ystem s
Shared registers, shared lists, shared queues are examples of concurrent objects. The following 
concepts will be used to  discuss the specification and implementation of concurrent objects:
O p e ra tio n  Any type of action on an object by a processor, such as read, write enqueue, de­
queue, etc. Operations consist of two events: the operation invocation, and the operation 
response. The operation invocation is denoted by x op(args) A, where x is the name of the 
object, op is an operation name, args are the arguments needed by the operation, and A 
is the name of the processor invoking the operation. The operation response is denoted by 
x term (res ) A, where term  is a  terminating condition (successful or failed) and res is the 
result(s) of the operation. A response matches an invocation if their object and process 
names agree. A pending invocation is an invocation with no matching response.
H is to ry  A history is a  finite sequence of operation invocation and response events. It is used to 
model the execution of a system (concurrent or sequential). A history is sequential if each 
invocation is immediately followed by a matching response. A process subhistory, H|P, of 
a  history H is the sequence of events in H whose process name is P.
JJ
A history H induces an irreflexive partial order - < / / ( —*• [36]) on operations. If eo and e\ are 
two operations belonging to  H then:
eo ^H ei if response(eo) precedes invocation(ei) in time in H.
JJ
- < / / ( —►) captures the “real time” precedence ordering of operations in H. In this chapter -<h  
will be used to denote tem poral precedence relations among operations belonging to  a history
H. Notice tha t for sequential histories -<jj is a to tal order. Since -<h  is an irreflexive partial 
order relation the following must hold: (eo,ei, and e<i are operations belonging to  a  history H)
1. e o / H eo
28
2. i f e 0X//ei then e\-fcHe0
3. if ecH/jei and e \< n e i  then e0 < h ^2
Operations tha t are not related by <jj are concurrent operations. Lamport [36] introduced
r r
the sym bol > to relate concurrent operations in a history 1 H. The sym bol ► means “may
affect” and has the following interpretation:
• An operation eo affects the outcome of another operation e\ if For example, if eo
is a write operation and e\ is a read operation to  some shared register Reg , eo affects ei 
because it makes e\ return the value it wrote to Reg. Two or more concurrent operations 
to  an object may or may not affect their outcomes depending on the implementation of 
the object. For the above example, if eo and e\ are concurrent, it is not possible to  tell 
which value will be returned by eo. It may return the old register’s value, the value eo is 
writing to  the register, or a corrupted value due to  the interference of eo. The concurrent 
operations eo and e\ are denoted by eo *■ e\.
Example:
Consider two processes P i and Pi communicating through a  shared register Reg . Reg  consists 
of 3 bytes Regi,Reg i,  and Reg3. V  denotes all the different values tha t Reg  can adopt, and 
V  denotes the ith value. A write operation by process P,- is denoted by Reg W r(V ')  Pi and 
Reg W r.ok.Q Pi for operation invocation and response, respectively. Similarly, a read operation 
is denoted by Reg Rd() Pi and Reg Rd.ok.(V•’ ) P,-. Consider the following history of events, H:
Reg W r { V l ) P2
'Lam port uses the term "system execution” instead of history
29
P ro c e s s
RdWr
WrRd
tg  Timet t.
Figure 2.1: The sequence of read and write operations by the processes P\ and P2
Reg Wr.ok.Q  P2 
Reg R d(V ' )  Pt 
Reg Rd.ok.Q P\
Reg W r ( V 2) Pt 
Reg Rd()  P2 
Reg Wr.ok.Q  Pi 
Reg Rd.ok.(V3) P2
Figure 2.1 shows the execution of the operations in time. From Figure 2.1 and the history 
H ,  e2 and e3 are concurrent.
The partial order ^ h  induced by the history H is 
eo ei -<h e2, eo<Hei<Hez-
Concurrent operations e2 and e3 cannot be related by < u.  e2  ► e3 and e3  ► e2. The
value V 3 returned by e3 can be V 1, V 2 or a corrupted value.
To see how a  corrupted value can be returned, consider the following situation:
V 1 =  8 ,7 ,5  (values from left to  right contained in R eg i,R eg2, and Reg3, respectively), and 
V 2 = 4 ,2 ,1 . Assume that during the execution of e2, Pi writes 4 to  Regi and then control is 
switched to  P2 which starts  executing e3. P 2 reads Reg\ (whose value a t th a t moment is 4), 
Reg2 (whose value is still 7) and then the scheduler schedules Pi again. P i writes 2 to Reg2 
and 1 to Reg3 finishing e2 and returns the control to  P 2. P 2 finishes e3 by reading Reg3 (whose 
value is 1 now ) and returns the values it read from Regi, Reg2, and Reg3. But, this value is 4, 
7, 1, a corrupted value which it was never intended to  be written to the register. Notice that 
e2 and e3 can be scheduled in several ways and whether e2 will affect e3 or not depends on the 
interleaving of the operations on the three registers Regi, Reg2, and Reg3.
Lamport [36] proposed the following set of axioms to  provide the properties of the temporal 
H  Hprecedence relations -< h  (—*•) a n d  ► induced by a history H (e^’s denote operations belonging
to a history H ):
1. The relation —>• is an irreflexive partial ordering.
31
2.2 S pecification  o f  ob jects
A sequential history for an object can be summarized by the value of the object a t the end of 
the history. To reason about the value of an object the axiomatic specifications are used. A 
specification is a  set of axioms of the form:
{ P }
op(args)/term(res)
{ Q }
where P  is a precondition on the value of the object and the argument values (args) to be met for 
the operation invocation (op), and Q is a postcondition upon return for the given termination 
condition (term ). A sequential history H  is legal if for all object subhistories H  | x of H , 
each operation in H  | x  satisfies its axiomatic specification (assuming a  systems with several 
concurrent objects). Axiomatic specifications however have no meaning for histories tha t are 
not sequential. For concurrent systems the result of an operation may depend on how it is 
interleaved with other concurrent operations. In one situation, however, it is possible to  reason 
about a concurrent object using axiomatic techniques for sequential objects. This happens 
when for every history of the concurrent object in the system it is possible to  find an equivalent 
legal sequential history. To find the equivalent sequential history, the concurrent operations are 
rearranged in time according to some definite order. The order must be consistent with the values 
the object is observed tocontain in time. This means tha t the implementation of the object must 
not permit all possible interleavings of operations but only the ones tha t generate histories for 
which it is possible to find sequential equivalents. The exact specification about the permissible 
interleavings determines the correctness conditions for the implementation. Different correctness 
conditions have been proposed for the implementation of concurrent objects, namely, sequential
32
consistency [35], serializabilty, [45] and linearizabilty [30]. Linearizability provides the strictest 
conditions. It requires an object to  be implemented in such a way as if every operation on 
the object has been executed instantaneously between its invocation and response This implies 
tha t concurrent object operations can still be specified with pre and post conditions. Notice 
tha t linearizability requires the implementation of the object to respect the order in which no 
overlapping operations were invoked.
2.3 R eg ister  axiom s
Misra [38] analyzed in great detail the implementation of shared (concurrent) registers, a type 
of concurrent object. He provided an axiomatic definition of this type of registers, reasoning in 
terms of the registers’ hypothetical values. Misra proposed to build a  register such tha t to an 
external observer concurrent operations (read and write) on the register appear to  be executed 
sequentially and nonconcurrent operations are executed in the same order they were requested. 
He shows tha t if a  register obeys a certain set of axioms then it may be analyzed as a serial 
device where all the operations are sequential. To illustrate his idea consider the the following 
example:
Example:
A history H a of read and write operations performed on a one bit register Reg  (flip-flop) is 
the following:
Reg W>(0) P\
Reg Wr.ok.Q P\
Reg W>(1) P2 
Reg Rd() Pi 
Reg R d .ok .( l) Pi
33
Process
Wr £1)
* Time >* 1 ot  n "t q *t tc t *fc *7 *fc Q "tt t
Figure 2.2: The operations by the three processes on a register. It is possible to rearrange the 
operations in such a  way ( as shown in Figure 7) tha t the nonconcurrent operations preserve 
their order and the read operations return values tha t are consistent with the write operations.
34
Reg W r{0) P3 
Reg Wr.ok.Q  P3 
Reg Wr.ok.Q P2 
Reg RdQ P2 
Reg Rd.ok.(0) P2
The order of execution of the operations in time is shown in Figure 2.2. In Figure 2.2, operations 
are denoted by OPi(val), where O P  can be W r  for a write or Rd  for a  read. The subscript i 
is the number of the process executing the operation and val ( 0 or 1) is the argument for the 
operation. Notice th a t W>2(1), P d i( l) , and W>3(0) are concurrent operations. If the register 
behaves as in Figure 2.2 then its behavior is indistinguishable from a register where operations 
take place sequentially one at a time as in Figure 2.3 . Notice tha t in the sequence of Figure 2.3 
the values returned by the read operations are consistent with the write operations.
The sequential history Hb corresponding to the operations in Figure 2.3 is the following:
Reg Wr{Q) P\
Reg Wr.ok.Q Pi 
Reg W rQ )  P2 
Reg Wr.ok.Q P2 
Reg RdQ Pi 
Reg Rd.ok.Q) Pi 
Reg W r{0) P3 
Reg Wr.ok.Q  P3 
Reg RdQ P2 
Reg Rd.ok.(0) P2
35
Process
Wr (0)
R diC l)
4 . . .  Time x 10t 6 t 7
Figure 2.3: The operations of Figure 6 rearranged in time to  return consistent values
Consider now the following history H c for the same register Reg  (shown in Figure 2.4):
Reg VFr(O) Pi 
Reg W r.ok .() Pi 
Reg W r{  1) P2 
Reg RdQ Pi 
Reg Rd.ok.Q) P\
Reg W>(0) P3 
Reg Wr.ok.Q  P3 
Reg Wr.ok.Q  P2 
Reg RdQ  P2 
Reg Rd.ok.Q)  P2
36
Process
Hr (0)
Rd (1)
4. Time
Figure 2.4: The operations by three processes on a register. These operations cannot be rear­
ranged in time to  return consistent values because R d i( l )  cannot precede W r 3 (0 )
37
The history Hc is almost the same as the history H\,. The difference is tha t the last read 
operation returned a 1 instead of a 0. For history Hc it is impossible to  rearrange the operations 
in such a way th a t the nonconcurrent operations preserve their order and the read operations 
return values consistent with the write operations (notice th a t R d \{ l)  has to precede W r^O )).
Misra analyzed the sequences of operations th a t he called schedules2 and stated tha t a sched­
ule is valid if its effect is equivalent to  the effect of some sequential schedule where operations 
are executed serially. Formally he defined a valid schedule S  to be one for which it is possible 
to find a perm utation S '  satisfying a set of validity conditions. These validity conditions are 
(modified to the notation used in this paper):
1. For every operation O P  in S '  the OP's  invocation precedes the OP's  response.
2. Every pair of operations in S '  has to  be nonconcurrent.
3. If operation OPi precedes operation OP2 in S  then OPi precedes OP2 in S'.
4. In S '  every read operation has a preceding write operation, and if W r,(a:) is the closest 
preceding write operation to  R dj(y ) in S '  then y  =  x.
If a  set of processes accesses a register in such a  way tha t the schedule for the operation executions 
is valid, it is possible to analyze the register behavior as if the operations were not concurrent. 
Referring to  the example discussed earlier in this section and Figures 2.2 and 2.3, H a is a valid 
history (schedule) with an equivalent perm utation Hb-
2Schedules are equivalent to histories.
38
2.3.1 The Axioms
Misra [38] proposed a  set of axioms tha t defines how the registers should behave under concurrent 
read and write operations so th a t it is possible to analyze the outcome of such operations as if 
they were nonconcurrent. The set of axioms requires a register to  behave in the following way:
1. If a  read operation Rd{(x) returns the value x , then a t some point within the time interval 
it was executed in, the register value was x. This means it is impossible to  return a value 
x if the register never had tha t value.
2. The value stored in the register previous to a write operation must be different than the 
value stored in it after the operation (he assumes a  writing operation never writes the 
same value twice). This means a write operation always changes the value of the register.
3. If the register value is x a t some point in time, there must exist a write operation W r{(x ) 
tha t wriites this value in the register.
4. The register does not change its value spontaneously. T hat is, if the register has the value 
x a t two different points in time 11 and t2  then its value is also x a t any time between 
t l  and 12 (again assuming that the write operations always write different values to the 
register).
Misra proves tha t every schedule S  for which the axioms are satisfied is valid and conse­
quently, it is possible to find the perm utation S'. He also shows that all the above four axioms 
he proposes are necessary and independent from each other.
C hapter 3
ATOMIC REGISTERS
The concept of atomic registers was introduced by Lamport and is closely related to the Con­
current Reading and Writing Problem. Concurrent read and write operations to  a  register must 
be handled carefully to  prevent consistency problems. If a  reading process accesses a register 
while a writing process is modifying it, the reader may obtain a  corrupted value (an inconsistent 
value). Also, if two or more writing processes write to  the register at the same time, these 
concurrent operations may leave the register with a  value tha t was never intended to  be written. 
M utual exclusion is the most obvious solution for this problem. This approach, however, has 
several drawbacks. The most im portant one is tha t it implies the existence of some waiting time 
between the request of access and the execution of the read or the write operation. Wait-free 
solutions for the problem th a t allow the parallel execution of the read and the write operations 
are implemented in several ways. The construction of atomic registers is the most refined version 
of these solutions. An atomic register is a construction tha t simulates an ideal register where 
concurrent reads and writes are executed in some definite order. This order of execution must 
insure th a t the values returned by the read operations are consistent with the write operations. 
This chapter is devoted to  the description of atomic registers and the techniques for the spec­
39
40
ification of these concurrent da ta  structures. Techniques for the verification of the correctness 
of an atomic register implementation are also discussed. Besides the atomic register the regular 
and the safe registers are also described. These are considered weaker types in the sense that 
they cannot handle all the cases of concurrent reading and writing.
3.1 L A M P O R T ’S R E G IST E R  C L A SSIFIC A T IO N
Registers were classified by Lamport [36] into safe, regular, and atomic registers, according 
to their behavior under concurrent operations.
3.1.1 Safe R egister
In this register a read nonconcurrent with any write obtains a correct value (the most recent 
one). In case of concurrent reads and writes, the read operations may return  any value. So, 
if the register has n  bits, the returned value may be any value between 0 and 2n -1. It is the 
weakest type of register.
3.1.2 Regular R egister
A regular register is also a  safe register but in case of concurrent read and write operations, 
the register will return either the new value being written or the old (previous) value stored in 
the register. In general, a  read overlapping a series of write operations will return the register’s 
value before the beginning of the write operations or one of the values being written.
3.1.3 A tom ic R egister
An atomic register is also a safe register, but under concurrent read and write operations it 
behaves as if they had occurred in some definite sequential order, even if the operations were 
concurrent. Among the restrictions for its behavior, the most im portant one is tha t in case
41
Rd1 (x) _ .i+2.
Reader
“  (7) Rd (z)
Wr"* (8) Wr^+1(3)Writer Wr^+2(11)
* 0  t l t 2 t 3 t 4  t 5 t 6 t 7 t 8 t 9 t 10 t l l t 12
Figure 3.1: A reader process and a writer process accessing a shared register. The type of the 
register determines which values are legal for a read operation to return when it overlaps in time 
a write operation
of two or more successive read operations Rd1, R d '+1, Rd%+2, . . .  to the register th a t overlap a 
series of successive write operations W r*, Wr*+l, Wr*+2, . . . ,  the case of Rd* returning the value 
written by W r *+2 and later read operations Rd'+1 and R d,+2 returning previous values written 
by Wr* and Wr*+l is forbidden.
To illustrate the different types of registers an example is given below:
Example:
Consider the sequence of read and write operations on an eight bit register as shown in Figure 
3.1. If the register is safe then the read operation Rd'(x)  returns value x  =  3, and Rdt+1(y) and 
Rdt+2(z ) return values between 0 and 255. For a regular register, R dl(x)  returns value 2 =  3, 
R dt+1 and R dt+2(z) return either 3 or 11. For an atomic register, R dl(x)  returns 2 =  3, and 
Rdi+1(y) and R dl+2(z)  may return any of the following pairs of values: (3,3), (3,11),(11,11),
Time
42
where the first value of the pairs corresponds to  y and the second to  z. The case of Rd,+1(y)
returning y = 11 and R d ,+2(z)  returning z  =  3 is forbidden for the atomic register. However, a
regular register would allow this. The safe type of register is usually implemented by hardware. 
The regular and atomic type have to be constructed using the safe registers and designing 
convenient access protocols. In general, the construction of a  register must fit a  combination of 
the following categories:
1. Safe, regular or atomic.
2. Boolean or multivalued.
3. Single reader or multireader.
4. Single writer or multiwriter.
This gives twenty four different types of registers tha t can be built. The weakest construction 
is a safe, boolean, single reader, single writer register, the strongest is an atomic, multivalued, 
multireader, multiwriter register. Constructions of these registers are described in the following 
sections.
3.2 SA F E  A N D  R E G U L A R  R E G IST E R  
C O N ST R U C T IO N S
The constructions discussed in this section were developed by Lamport [36]. The weakest 
type of register (safe, boolean, single reader, single writer) is assumed to  be hardware imple­
mented. All the constructions discussed here are also assumed to be single writer.
3.2.1 M ultivalued Safe R egister
The implementation of a multivalued (2n valued) multireader safe register M  consists of a set 
( m i , .. . ,m„)  of boolean multireader safe registers. The writing process writes a binary value
43
V \ , . . . , V n to the register by assigning to  each m,- the corresponding value V{. Any reading 
process gets the register’s value by reading every m; starting with m i.
3.2.2 M ultireader B oolean Regular R egister
The construction of such registers consists of a multireader boolean safe register S  and a variable 
old internal to the writer’s program (not shared). The write operation of the boolean value b (0 
or 1) to  the register is executed by assigning b to  it if this value is different from the previous 
one (old keeps this value ). Because the only possible values for register S  are 0 and 1, if the 
writer is changing its value and there is a  concurrent read operation then the returning value 
will always be the old value or the one being written. Therefore, S  behaves as a  regular register.
3.2.3 Single R eader M ultivalued Regular R egister
The algorithm presented in [46] could be used to  simulate this register. However, it does not use 
weaker types of registers strictly because some of the flags have to be atomic. For this reason, 
Lam port’s algorithm [36] is discussed here despite being less space efficient. For a register with 
values in the range from 1 to  n, the construction consists of a set of n  boolean multireader 
regular registers r i , . . . , r „ .  The register’s value is kept using unary encoding, where a value 
n  is encoded by setting register rn to 1 and registers r„_ i to r\ to zero. The read operation 
reads each register starting with r\ and stopping when a  1 is found. One thing to note is that 
the write operation is done from right to  left and the read operation is done from left to right. 
Lamport proves th a t any read operation concurrent with a writing will return only the previous 
value or the value just being written.
44
3.2.4 M ultireader M ultivalued Safe or R egular R egister
An m-reader, 2"-valued safe or regular register R  is implemented using m  single reader, 2n-valued 
registers r j , . . . ,  rm. Each rj keeps a  copy of the register’s value that the writer makes for every 
reader. The write operation R  := v is executed by assigning to each r,- the value v, starting with
t*i. The j th reader executes a  read operation by getting the value of rj. If registers r \ , . . ., rm
are regular, the construction implements a multireader, multivalued, regular register. If the 
registers are safe, the construction is also safe.
3.3 A T O M IC  R E G IST E R  C O N ST R U C T IO N S
In the previous section the implementation of safe and regular registers was discussed. The 
basic strategy used in th a t implementation was to  build the complex registers in terms of simpler 
ones. In this way, multivalued multireader safe and regular registers were built in terms of safe 
boolean single reader registers.
3.3.1 I /O  A utom atons
A formal way to  describe the implementation of a  da ta  object in terms of simpler ones is to 
model the objects as I /O  automatons. An I/O  autom aton is a  non-deterministic automaton 
represented by a  tuple (E ,Q ,S ,Q °) where:
• E  = E ,n\JEout is a  set of input and output events.
• Q is a  finite or infinite set of states.
• Q° is a distinguished set of starting states.
• S  is a transition function of the form (q ,e ,q f), where q,q' e Q\JQ° and e e E .  The tuple 
(q, e, q1) is called a  step and means that an autom aton in state q moves to  state  q' after the
45
occurrence of event e. The execution of an I/O  autom aton is represented by a sequence of 
steps called a schedule.
Processes in a  concurrent system can also be modeled as I/O  autom atons. The process will 
then be defined as the execution of a procedure. To model the communication between processes 
and objects, the I /O  autom atons are extended with the addition of ports. This extended I/O  
autom aton is called port autom aton. A port autom aton [49] is formally represented as a tuple 
(V, P, C H ,M ) ,  where:
• V is the set of values th a t can be sent as messages.
• P  is the set of ports. Each port has a type which is m aster or slave.
•  C H  is the set of channels. Each channel is a pair of ports, one of type m aster and the 
other of type slave. A port belongs to a t most one channel. A port tha t does not belong 
to a channel is called an external port. The other ports are called internal.
• M is an I/O  autom aton. Events in this autom ata are of the form (v ,p )  where v e V  and 
p e P.
Communication between port autom atons is done through I/O  channels. Messages sent 
from master ports to  slave ports are called commands and correspond to  output events of 
master ports and input events of slave ports. Messages sent from slave ports to  master ports 
are called responses and correspond to output events of slave ports and input events of master 
ports. A port autom aton is said to be well formed if any schedule of events corresponding to its 
ports starts with a  command and consists of alternating commands and responses. A schedule 
of events is said to  be balanced if for every command in the schedule there is a  response. A 
procedure corresponds to  a port autom aton with a t most one slave port. An object corresponds
46
to a  port autom aton with no master ports. Objects are specified by describing their external 
ports and stating all their legal schedules.
A composition of several port automatons is done by defining new channels tha t model 
the interaction among the different components. An object O B J  composed of other objects 
O B J \ , O B J 2 , O B Jz ,  ..., O B J n is a port autom aton built by interconnecting a  number of port 
automatons such th a t every autom aton is a  procedure, or corresponds to  one of the objects 
O B J i ,  O B J 21 O B J z , ..., O B J n. Formally, atomic objects are defined as follows [30]1:
• An object is atomic if for every schedule of events H , corresponding to  any of its external 
ports there exists schedules H '  and S  such tha t H 1 is a balanced extension of H , S  is a 
sequential schedule consisting of the same events as H',  and -<h> Q -<s2•
3.3.2 A tom ic O perations
The notion of atomicity is strongly related to  the idea of a  total order in a history3 of operations 
in a  system. When operations are assumed to  be atomic, the execution of an operation op\ 
can affect the execution of another operation 0P2 only if op\ precedes opz- When a data  object 
is implemented in terms of smaller objects, the operations on the object are implemented in 
terms of the operations on its components. These complex data  objects can be described at two 
levels: a t the high level the description is done through axioms tha t define the set of operations 
available to  manipulate the object, and at the low level the implementation of the high level 
operations in terms of the operations executed on the components (the low level operations) is 
described. In Section 2.1 an example of two processes communicating through a shared register 
was given. The shared register Reg consists of three smaller registers Regi, Reg2, and Regz,
’This definition has been modified to fit the notation used in this document.
2S  is called linearization of H.
3 The word history is used when referring to the execution of a system and the word schedule when referring 
to the execution of an automata.
47
each one with a size of one byte. When a  process wants to read from or write to  Reg , it has 
to individually access each of Reg's components and read from or write to  it. Therefore, a read 
(write) operation to  Reg  consists of three primitive read (write) operations to  Reg's  components.
Atomicity requires tha t when operation executions are viewed at the high level, they appear 
to be executed sequentially even if they are concurrent. Since the high level operations must be 
indivisible, the corresponding low level operations cannot be interleaved in time. Using the same 
example discussed in this section, consider a write operation W r  to  Reg  executed by process P\ 
and a  read operation R d  to  Reg  executed by a process P2 . As explained before each operation 
consists of three primitive operations to Reg's  components. These low level operations are 
denoted by W r i , W r 2 , Wr$  for write operation W r  and, R d i ,R d 2-, Rd 3 for read operation Rd  
(the subscript indicates the component which is being accessed). If operations Rd  and W r  are 
atomic, the case shown in Figure 3.2(a) cannot happen. In Figure 3.2, the execution of the low 
level operations corresponding to R d  and W r  are interleaved in time, i.e., R d  and W r  are not 
executed atomically. Figure 3.2(b) shows two ways these operations can be executed atomically. 
The interleaving shown in Figure 3.2(a) cannot be allowed because the read operations may 
return inconsistent values. So, the operations must be executed strictly sequentially.
Simulation of Atomic Reads and Writes
Serial execution is, however, a  very strong requirement. Fortunately, as discussed in Section 
2, it is possible to  implement the object such tha t it always behaves as if there were a  serial 
execution of the operations even though sometimes the operations are executed concurrently. 
For concurrent read and write operations the object must behave as if they had occurred in some 
definite order because this is the way the true atomic read and write operations would have been 
executed. Non-concurrent operations must be executed by the system in the same order as they 
are requested to maintain consistency with the order observed externally. For example, if it was
48
Writer
Reader
Writer
Reader
Writer
Reader
WrWrWr
RdRdRd
Time
(a)
WrWr Wr
Rd Rd Rd
t8 t9 tlot 11 1 12 Timet o t i t 2  1 3  1 4  t 5 t 6t 7
(b)
Wr WrWr
Rdi
+ Timet n t t «t
(c)
Figure 3.2: (a) Concurrent read and write operations executed non-atomically. (b), (c) Two 
different ways to  execute the operations of part (a) atomically
49
observed th a t a write operation W r x preceded in time a  write operation W r 2, the object in its 
a ttem pt to  simulate the serialization of operation executions cannot behave as if W r 2 preceded 
W r  i.
Consider the concurrent operations in Figure 3.3(a). The system cannot behave as if the 
operations would have been executed in the order R d 2 ,W rx,Rdx because it contradicts the actual 
order of request of R dx and Rd2 -  R x precedes the execution of R 2. The correct way to  simulate 
the serial execution is to  make the object behave as if the concurrent operations had been 
executed at some point in the time interval between its invocation and the response. For the 
concurrent operations in Figure 3.3(a), to simulate the serial execution of the operations, the 
object may behave in different ways as shown in Figure 3.3(b)(c)(d) (The asterisk denotes the 
point in time where the operations are assumed to take place).
It is easy to check if the object is pretending serialization correctly by looking at the values 
returned by the read operations. Because the operations are simulating atomic ones, a read 
operation must return the value written by the latest preceding write operation in the simulated 
order. For example, in the third situation in Figure 3.3 (part (d)), Rdx must return  the value 
written by W r x and so does R d2.
Conditions for Simulating Atomic Read and Write Operations
It should be obvious a t this point tha t there cannot be physical concurrent execution of atomic 
read and write operations. However, this can be simulated by simulating serialization of ex­
ecution of the operations. The object will behave such that to an external observer all the 
operations are executed sequentially according to  the following observations [21]:
1. Non-concurrent read and write operations will be executed in the object in the same order 
(in time) as they are requested.
50
Writer
Reader
Wr i
R di Rd 2
ii
i
i
i
i ....................... ......................1
1 1 
1 1 
1 1 
1 1 
1 1 
l  .... 1 .. —1
11
1
t
1
L------------------------------------L------------------------------------------------------------ ►
t0 t ! t 2 t  3 1 4 1 5
(a)
(b)
Time
WrWriter
RdRdReader
tg Time
(c)
Wr
Writer
RdRdReader
tg Time
WrWriter
Rd RdReader
Time
t  0
(d)
Figure 3.3: (a) Two consecutive read operations to  a shared register overlapping in time with a 
write operation, (b) (c) (d) Three different ways of simulating serial execution of the operations 
in part (a).
51
2. Concurrent read and write operations are executed in some definite order such tha t the 
object behaves as if the operations had been executed at some point in the time interval 
between their invocation and response.
3. Reading operations must return the value written by the latest preceding write according 
to  the simulated order.
3.3.3 Im plem enting A tom ic R egisters
As discussed in previous sections two operations are defined on an atomic register: atomic read 
and atomic write operations. These atomic operations are implemented by lower level primitive 
operations. The history of a system execution consists of a set of primitive operation executions 
(also called events). If a  to tal order can be established among the events of a  system execution, 
the system execution is defined to  be atomic. The main interest is, however, to  see if it is atomic 
from the high level point of view. From an abstract point of view, the general requirements for 
atomicity are:
A l .  Every high level operation execution must consist of a  finite nonempty set of primitive 
operation executions.
A 2. Each primitive operation execution must belong to  one and only one high level operation 
execution.
A 3. For any primitive operation execution, there must be a  finite number of other operations 
th a t precede it in time.
One operation execution precedes another if all the primitive operation executions corre­
sponding to the first one precede all the primitive operation executions of the second. From this 
precedence relationship, it can be determined if the to tal order can be extended to  the high level
52
operation executions. If this is possible then the system execution is also atomic from the high 
level point of view.
From the implementation point of view an atomic register consists of a  set of real registers 
along with some programs tha t access them. Any process tha t needs to  read from or write 
to the register invokes these programs. The concurrency of read and write operations occurs 
when two or more of these programs are invoked at the same time. In this situation the atomic 
register must behave as if the programs had been called in some definite order. A system 
execution then consists of a  series of calls to  the reading and writing procedures. The events in 
the system execution correspond to the events or primitive operation executions th a t implement 
the procedure calls. The main restriction for the read and write procedures is tha t they must 
not have loops or waiting statements because this could prevent processes from accessing the 
register for an indefinite time. This is implied by the requirement A1 (every high level operation 
execution must consist of a  finite nonempty set of primitive operation executions). Also there 
has to  be an initial write operation that sets the register with its initial value before any read 
operation. This follows from the requirement A3 (in any system execution, for any primitive 
operation execution there must be a finite number of other primitive operation execution that 
precede it).
An implementation of an atomic register is correct if for every system execution, a total 
order can be obtained in the corresponding history of read and write operations to  the register. 
This to ta l order must be consistent with the partial order defined by non-concurrent operations 
in the history. In addition every read operation must return the value written by the latest 
preceding write operation.
53
Establishing an Order among Operation Executions
To make a to tal order among the operations in a history of a  system execution, a  function is 
defined to  map the operations onto the set of integer numbers. It is then possible to  order them 
in time by assuming th a t given any two operations, the one with the smaller number is executed 
first. If the order obtained is to tal, the system execution is atomic and the construction is 
correct.
In a  history of a system execution if there is only one writer, all the write operations are 
totally ordered in time. Each execution of a writing operation can be assigned a version num­
ber th a t represents the order in which it occurs in time. If in a system execution the write 
operation is executed k times then the write operations will be assigned numbers from 1 to  k 
(WYx, W>2, . . . ,  Wrk)- In this way it is possible to  distinguish between different executions of 
the write operation. The value 0 is assigned to the write operation tha t writes the initial value 
the register had at the beginning of the history. This write operation W r 0 is assumed to  have 
preceded all read operations.
A function G  is defined to  assign these version numbers to  the read operations according to 
the following general rule:
• A read operation is assigned the version number of the write operation th a t wrote the 
value the read returns.
A construction is proved to be atomic if for every system execution the function G  satisfies 
the following requirements (note tha t a system execution consist of a set of read and write 
operations):
G l .  For every read operation Rd  there is a write operation W r  whose version number is 
assigned as G(Rd).
54
G 2. The write operation W r G(Rd) precedes the read Rd, but the write operation W r G R̂d^+l 
does not precede Rd.
G 3. For any two read operations Rd  and Rd', if Rd  precedes Rd' then G(Rd)<=G(Rd').
The existence of function G satisfying the above requirements ensures th a t the read operations 
return values tha t were written to  the register, and it is possible to  obtain a to tal order among 
the read and write operations of the system execution. Also, the construction will behave as 
if the operations had been executed in tha t order with every read returning the value written 
by W r G(Rdy  The requirement G3 is very im portant in obtaining an order for the operations. 
Consider a scenario where a read operation Rd  gets the value written by Wr$  and a later read 
operation Rd' gets the value written by W4 . In this case G(Rd)  would be 5 and G(Rd')  would 
be 4 and this would imply that Wr§  preceded W r d which is a  contradiction. So, the existence 
of a mapping function ensures tha t a construction is atomic. A formal proof is given in [36].
Reading Errors
Peterson and Burns [13] defined a similar function called indexed value function V th a t assigns a 
number to  each operation in a  system execution. The write operations are assigned the version 
numbers, so that no two write operations being assigned the same number. A read operation 
is assigned the version number of the write operation tha t wrote the value it returned, with a 
condition th a t the write operation must have preceded the reading. By using this function they 
proposed the following theorem:
• A construction is correct if for all of its system executions, there is an indexed value 
function V  such tha t there cannot be four (not necessarily distinct) high level operations 
G, G', H  and H '  with V(G )  =  V(G')  and V ( H ) =  V (H ') ,  where G  precedes H  and H '  
precedes G'.
55
The proof of this theorem can be found in [13]. A knowledge graph (directed graph) is built where 
the nodes are named with unique labels for every read and write operation. For example, if two 
values read or written were assigned labels A  and B ,  then the graph contains two nodes named 
A  and B.  The edges in the graph represent the precedence relations between the values. If there 
is an edge from A  to B  then there exists op(A) and op(B) (op is a write or a read operation) 
such th a t op(A) preceded op(B). It is shown that a cycle in the graph implies nonatomicity, 
and the graph is acyclic if and only if the system execution is atomic. Informally, if the graph 
is acyclic then it is possible to  obtain a total order among the operations which implies that 
the system execution is atomic. Since an operation can be either a  writing or a  reading, it is 
necessary to  analyze sixteen different cases. However, simplifications can be made, and for the 
case of a  single writer the sixteen cases can be collapsed into two which are:
1. W i  preceding W j  and R j  preceding 12/, with V(VF/)=V(12/) and V (W j)=V (12j).
2. W i  preceding W j  and W j  preceding 12/, with V(W /)=V(12/).
These two bad cases are considered reading errors and have specific names. The first one is 
called the new-then-old reading error. It consist of a read operation returning a new (most 
recent) value from the register and a strictly later read operation returning an old (previous to 
the most recent) value. The requirement G3 for the function G insures tha t this error does not 
happen. The second one is called the out-of-date reading error. It consist of a  read operation 
returning an old value even though the most recent write operation th a t had placed a new value 
in the register was completely finished before the reading. Condition G2 for the function G 
prevents this error from happening. The second type (out-of-date) of reading error is very easy 
to prevent in atomic constructions with a single writer because every write operation completely 
erases the previous value.
C hapter 4
Sharing M emory in Asynchronous 
M essage Passing System s
In this chapter we present an algorithm to simulate Read-Modify-Write Registers in a message 
passing system with unreliable asynchronous processors and asynchronous communication. The 
algorithm works correctly in presence of a  strong adversary th a t can stop up to  T  processors or 
stop the delivery of their messages where T  =  [N /2J-1 and N  is the number of processors in 
the system. This is the best resilience that can be achieved in the message passing sytems.The 
high resilience of the algorithm is obtained by using randomized consensus algorithms and a 
robust communication primitive. The use of this primitive allows a processor to exchange local 
information with a m ajority of processors in a consistent way and therefore, take decisions safely. 
The simulator makes it possible to translate algorithms for the shared memory model to that 
for the message passing model. W ith some minor modifications the algorithm can be used to 
robustly simulate shared queues, shared stacks, etc.
56
57
4.1 In trodu ction
In the shared memory model for intercommunication in multiprocessor systems, several syn­
chronization primitives have been proposed to coordinate actions among the processors. The 
most common use of these primitives is to  resolve conflicts when there are concurrent requests 
of access to  a  shared resource. These primitives are also used when the processors need to  select 
some unique values, e.g., id’s, time stamps, etc. On the other hand, the agreement problems in 
which all processors select the same value can also be solved using these synchronization primi­
tives. In all cases the purpose of a  primitive is to let a  processor execute some specific operation 
without being interrupted. The type of operations we are referring to always involves access 
to  a  shared data  object and the uninterrupted (atomic) access to it is absolutely necessary to 
guarantee the correct execution of the operation.
The most basic synchronization primitives are the atomic read and write instructions. These 
instructions guarantee th a t concurrent reads and writes to the object, will be executed in a serial 
manner according to  some definite order to maintain consistency of the shared data. A thorough 
study of these primitives can be found in [12], [15], [13], [44], [46], [28], [52], [53].
A very useful primitive is the read-modify-write instruction which atomically reads an object 
and writes a new value th a t is a  function of the current value. Instructions th a t belong to  this 
category are Test-and-Set, Fetch-and-Add, swap, and, compare and swap. A good description 
of these primitives can be found in [27], [33], [28].
Another category of primitives includes memory to memory operations. Examples of this 
type are the move instruction which atomically copies the value of one shared object into another, 
and the multiple assignment instruction which assigns values to  more than one shared data 
objects atomically. Synchronization can also be obtained through the use of shared queues, 
shared stacks, shared lists etc. [28].
58
There are several ways to  implement these primitives in shared memory architectures. In 
a system with a  large number of processors, the shared memory board is connected to  the 
processors through a multistage interconnected network. To resolve conflicts among concur­
rent operations to  the same memory location a  technique called combining is used. W ith this 
technique, two messages requesting to  execute operations /  and g to  the same shared variable 
arriving at a network switch are combined into one. The operation the new message carries 
is a  combination of /  and g ( /  o g). When the new combined message arrives to  the memory 
block, the result of the combined operation is stored in the variable. Next, the old content (v) 
of the variable is returned in a  reply message to the switch. The switch remembers which mes­
sage arrived first, and returns v to  the corresponding sending processor. The other processor 
is returned f ( v ) (or g(v)). The combining technique is very useful to  implement several types 
of read-modify-write instructions and was used in the construction of the New York University 
Ultracomputer [27].
Can we use these concepts to  achieve synchronization in message passing systems? Bar-Noy 
and Dolev [4] gave the first step in this direction by proposing building blocks — one for the 
shared memory model and one for the message passing model. If an algorithm for the shared 
memory model is written using the building block, it is possible to  translate the algorithm to 
work in the message passing model by using the equivalent block, and vice versa. One of their 
goals is to  identify the basic elementary operations in the two models and develop a  general 
translation scheme between them. In a  more recent work [3] the authors have presented an 
algorithm to  emulate atomic single writer multireader registers in a message passing system.
One im portant implication of this translation scheme is to  prove that if there is a solution 
to a given problem in one of the models then a solution to  the same problem can be developed 
in the other model too.
59
4.2 M od el
A message passing multiprocessor system can be characterized by a  set of parameters th a t de­
scribes the behavior of its components. These parameters model the environment in which an 
algorithm will work. The type of parameters of special importance in this paper is the one tha t 
describes the degree of synchronism in the system among the processors and in the communi­
cation mechanism. The algorithm presented in Section 4 handles asynchronous processors and 
asynchronous communication. The message delivery order however, must be synchronous. This 
could be relaxed but it would complicate the description of the communication primitives. In 
addition, we assume that the processors have fail-stop type of failures.
Processors in a system are said to  be asynchronous if any processor can wait an arbitrary 
amount of time between two of its own steps. If processors are synchronous then there is 
a constant <f> such tha t in any time interval in which some processor makes </> +  1 steps all 
nonfaulty processors make at least one step.
The communication in the system is asynchronous if messages can take an arbitrary amount 
of time to  be delivered. Synchronous communication implies tha t there is a constant such 
tha t every message sent has to  be delivered within time td to  its destination. Another type of 
asynchrony in the communication is the message delivery order. If messages can be delivered out 
of order then the message delivery order is said to be asynchronous. If messages are delivered in 
the same order as they were sent then the delivery is synchronous. Notice tha t the synchronous 
message delivery order guarantees tha t if two messages are sent by two different processors to 
the same destination, the one th a t was sent a t the earlier time arrives first.
Processor and communication asynchrony are potential sources of nondeterminism th a t an 
algorithm must be able to handle when solving a specific problem. Another source of nonde­
terminism is the type of processor failures tha t the system can have. Two types of processor
60
behavior under failures are usually considered: Fail-stop and Byzantine. A processor has fail- 
stop behavior if any failure makes the processor stop doing any computation or sending messages. 
A processor has Byzantine behavior if a  failure makes it behave erratically, making wrong com­
putations or sending contradicting or corrupted messages. Obviously, the first type of processors 
is easier to  deal with than the second one.
All these sources of nondeterminism in the system are well represented by a game between 
an adversary and a given algorithm. The goal of the adversary is to  make the algorithm fail. 
For th a t purpose the adversary is assumed to have some or all the following privileges:
• Instruct the processors when to  operate or even to stop up to T  processors and not to let 
them  restart a t all (processor asynchrony) . The factor T  is a resilience param eter. If an 
algorithm can stand up to  T  processor failures and continue producing correct results then 
it is T-resilient.
• If a  processor is waiting for N -1 messages from other processors, the adversary can suspend 
up to  T  messages (communication asynchrony), T  =  [1V/2J-1, where N  is the to tal number 
of processors.
• D ictate the delivery time for messages sent and thus make them arrive out of order (message 
delivery order asynchrony).
• Arrive a t the decisions by observing the messages to  be sent and the internal state  of 
all the processors. If however, an algorithm takes random steps like flipping a coin, the 
adversary should not to be able to  predict the outcome of future random steps.
In the next section, a  communication primitive tha t allows processors to  exchange local infor­
mation robustly is discussed.
61
4.3 C om m u nication  P r im itiv e  Exchange
The processors use the primitive Exchange  to  exchange their knowledge. Exchange  will ensure 
that the global information collected by a processor is consistent with tha t by a m ajority of 
processors. Exchange  requires every processor to  have an array to  store information about all 
processors in the system including itself. By invoking Exchange , a  processor sends its array 
and receives the arrays from other processors. This allows the processor to  update its local 
information. In order to  know which information is the most recent one the timestamp must 
be used. The timestamps are generated locally by the processors. W ith all the elements in the 
array tim estam ped, it is possible to  compare them and keep the most recent one.
N o ta tio n
Fro this point onward, Ts(A rray  A[i]) will denote the timestamp of the element i in the array 
A. The procedure Exchange  for any processor is given below:
P ro c e d u re  Exchange  (M yarray) 
begin
Counter:=0;
Done:= fa ls e ; 
while not Done do begin 
Broadcast (M  y array)]
Exit:= false]  
while not E x i t  do begin 
Receive(Other array)]
{ Otherarray is sent by a processor P j , je[l..jV]} 
if Less(Otherarray, M yarray)  then 
Discard(Otherarray)]
62
else
if Equal(Otherarray , M yarray ) then begin 
Counter  :=  Counter  +  1; 
if Counter > =  N  — T  then E x i t  := true ;
{ N  is the number of processors, T  < N /2  — 1 } 
end-if 
else begin
Update(Myarray , Otherarray);
E x i t  := true; 
end-else; 
end-while;
if Counter >— N  — T  then Done := true;
Return();
end-Exchange;
Procedure Update (A rray l ,  Array2) 
begin
For j := 1 to N  do
if T s(A rray l \j] )  < Ts(Array2[j]) then Array\\j]  :=  Array2[j]; 
end-Update;
Procedure Less (A rra y l ,  Array2) 
begin
Is less  := true ;
63
while Is less  and j  < N  do
if T s(A rra y l[ j ]) > Ts(Array2[j]) then Isless := fa ls e ;
end-Less;
P ro c e d u re  Equal (A rray l ,  Array2)  
begin
Isequal := true ;
while Isequal and j  < N  do
if T s(A rray l \j] )  ^  T s(A rra2/2[j]) then Isequal :=  /a /s e  
end-Equal;
Note th a t the indices of the arrays correspond to  the processors. Therefore, the same index 
( j)  of two arrays are used to  compare them. Procedure Exchange  assumes tha t a majority of 
the processors will always remain connected with each other. A processor th a t gets completely 
disconnected from the m ajority of processors is isolated by them. When the processors invoke 
Exchange , they all will have an empty slot in their arrays corresponding to  the disconnected 
processor. Therefore, they will agree in tha t there is no information available for that processor 
and will continue with their jobs without taking the disconnected processor in consideration.
The following lemmas show that a processor belonging to a m ajority will succesfully execute 
Exchange  in a finite time:
L em m a 3.1
Every invocation of the procedure Exchange  terminates.
Proof
64
By assumption, the adversary can stop up to T  processors from sending messages or hide up to 
T  messages th a t were sent by good processors. Every processor is then guaranteed to receive 
at least N  — T  messages containing arrays from other processors. This will allow the processor 
to update its array with the most recent information. This implies tha t a t some point of time 
a majority of processors will have the same information about others. After every update a 
processor sends its array and collects N  — T  arrays from other processors. When the majority 
of processors finish updating their arrays, these N  — T  arrays will be identical. At this point 
the procedure terminates.
E n d  o f  p ro o f
The following lemma shows that the arrays a processor obtains by using Exchange  are lin­
early ordered in time. This implies tha t decisions taken based on the information in the array 
are safe because they are done according to the most recent information about the system. 
L em m a 3.2
Let X  and Y  be two arrays collected by processor P,- at times to and t i , to <  t i . Then for all
J€[l..tf],*[j']< nfl-
P ro o f
Assume th a t X  and Y  were collected in two consecutives invocations of the procedure Exchange  
by Pi, and X[j] > Y \j],je[l..N].
This implies Bk | X[k] > V[fc].
Processor Pi passed X  as the param eter to  Exchange  which returned Y . In the procedure 
Exchange, X  can only be changed by calling the procedure Update. The procedure changes the 
value of element X[k] to  V[&] only if it finds an element Z[k] of one of the collected arrays such 
that Z[k] > X[fc]. This contradicts the assumption tha t T[fc] < X[fc], fce[1..7V]
65
E n d  o f  p ro o f
4.4 S im ulator
The algorithm in this section simulates Read-Modify-Write registers. This type of registers al­
lows the atomic execution of the following procedure: ( Reg is a  Read-Modify-Write register)
P ro c e d u re  R M W
begin
Tem p  := Reg ;
Reg := f(Tem p);  
return(Terap); 
end-RMW;
In the above procedure RMW, /  is a function from register values to register values. If /  is the 
increment function then the register is a fetch-and-add register. If /  sets the content of the Reg to 
1 only when T em p  is equal to  0 then register is a test-and-set register. Other type of Read-Write- 
Registers include Swap, Compare-and-Swap, etc. To simulate a Read-Modify-Write register in a 
message passing system every processor keeps a copy of the register. Since several processors may 
attem pt to  execute a  RMW operation at the same time, it must be ensured th a t all processors 
execute the same operation on their local copies. So, a t any time, all the copies have the same 
value and the consistency is mantained. In our algorithm, when a  processor wants to execute 
an operation, it broadcasts a request message including its ID .  The message also contains a 
timestamp to distinguish old request messages from new ones. Each processor maintains two set
66
of flags Opreqlj  and 0preq2j, jc[l..A r] to  keep track of the requesting processors. The request 
messages are collected by a  procedure Receive. The procedure Receive  after receiving a  message 
from a processor Pk reads Opreqlk and writes 0preq2k  with the opposite value. Receive also 
stores the tim estamp included in the message in a variable Tstamp. W hen receive gets a request 
for the first time, it retransm its it to  ensure tha t all processors receive it. Receive  knows if the 
request is old by comparing the timestamps. If an incomming message carries a timestamp 
older than  or equal to  the timestamp of the last message received, it is a duplicate request and 
must be discarded. The reason for the retransmision is that the adversary is allowed to  hide 
messages and one of these hidden messages could be a request message. Since it is only necessary 
to recognize a  duplicate of a request message from a given processor, the timestamps can be 
generated locally. Every processor has a  variable called Opnum.  This variable is incremented 
every tim e a  processor executes an operation in its local copy of the register. Opnum  is used as 
the tim estamp for the request messages. The procedure Receive is described below {Tstamp j 
is the variable where Receive stores the tim estamp of the last request message received from 
processor Pj):
P ro c e d u re  Receive ; 
begin
upon receiving operation request message < Req, O pnum , I D  >
if (O p re q lw  =  Opreq2jo)  and (T s ta m p w  < Opnum)  then begin 
0 p re q 2 w  '•= ~<Opreqlj£>;
TstampiD  :=Opnum;
Broadcast(< Req, I D  >); 
end; 
end-receive;
67
To resolve conflicts when two or more processors want to  execute a  RMW operation at the 
same time, the processors are forced to  reach an agreement. In the algorithm, every time a 
process wants to  execute an operation it goes into a leader election process. If a processor is 
elected as the leader, it knows tha t all other processors agree th a t it is its tu rn  to execute an op­
eration. The leader election procedure works based on the processor ID'a.  The processors reach 
an agreement on a single I D  by going through several rounds of consensus. In order to defeat an 
adversary th a t can stop processors and hide messages, a randomized consensus algorithm must 
be used. It has been proven [26], [24] th a t the deterministic consensus algorithms cannot work 
correctly in the presence of even one faulty processor in completely asynchronous enviroments. 
In the Leader-election procedure described in this section, a highly resilient consensus protocol 
is used. It was proposed in [4]. This protocol was proven to have resilience T  =  [ N /2J-1 in an 
enviroment of asynchronous processors and asynchronous communication.
In the procedure Leader .election a processor starts with a suggested ID .  To decide on which 
processor will be elected, it goes into \logN ] rounds of consensus. At every round it reaches an 
agreement with the rest of the processors on one bit of the leader’s I D .  If a  processor loses a 
round of consensus, it selects a t random another ID  among the processors th a t have requested 
operations. The I D  it selects must still have the chance to  be selected as the leader’s I D  The 
ID 's  of the processors that have requested operations are kept in a  set called Reqset. After 
selecting a  new I D  a  processor continues with the next round of consensus. This speeds up the 
leader election process and makes the algorithm very resilient. The reason for the high resilience 
is th a t all processors cooperate with each other to  elect the leader even after losing a round of 
consensus. W ith all the processors participating in the election, they will always be a majority 
that elects a  leader. One interesting thing to  note here is that if a  processor dies after making a
68
request, it can still win the election. The operation it requested is carried out by the remaining 
good processors.
The procedure Bitconsensus  used inside Leader .election works in the following way: A 
processor calls the procedure with an initial value. This value is one of the bits of the I D  it 
is sugesting as the leader’s ID T h e  processor maitains an array Bcr  with N  elements. Every 
element of Bcr  has three fields. The first one is Sugval  which contains the value other processor 
is suggesting. The second variable Round  keeps the number of attem pts a processor has made 
to reach an agreement. Round  is initialized to  zero. The third field is Tim estam p  where a 
processor writes Opnum  and B itn u m , the index of the leader’s ID  bit on which the processors 
are trying to  reach an agreement. This information will be used by the procedure Exchange  
to generate labels th a t identify the messages. Processors exchange values and check if all are 
suggesting the same value. If this is true, Bitconsensus  terminates. If there is a disagreement, 
they flip a  coin and adopt a new value according to  the result. They exchange the values and flip 
the coin again if the disagreement persists. The processors keep repeating these steps until all 
flip the coin with identical values. A processor increments Round  every time it initiates a new 
attem pt to  reach an agreement. The processors are asynchronous and run at diferent speeds. 
So, some processors attem pt to  reach an agreement faster than others. In Bitconsensus  if a 
group of processors tha t are ahead in their number of attem pts agree in their suggested value 
then they don’t flip the coin and go into the next round with the same value. Processors that 
are behind in their number of attem pts also adopt this value. After exchanging values, the 
leaders (the processors th a t are ahead) check if the processors th a t disagree in their suggested 
values are two or more rounds of attem pts behind. If this is true, the leaders decide on their 
suggested values only and exit the procedure. The processors tha t are behind will eventually 
notice th a t the leaders have reached consensus, accept tha t value, and exit the procedure . It
69
is very im portant to identify which processors are the leaders, because they are the ones who 
decide the final value. The leading processors are the ones whose Round  is the largest. To 
determine the leaders a t any given, time a processor needs to know the values of the Round 
variables of all other processors. The procedure Exchange  does this.
Ideally if all processors flip the same coin, they all get the same result. This corresponds to 
to  flipping a globally shared coin. W ith such a coin Bitconsensus  term inates a t most in two 
rounds. It has been proven however [29] tha t a perfectly unbiased global coin cannot be built. 
The procedure Flip-GlobaldCoin implements a  biased globally shared coin. For 7 > 1, a biased 
global coin has the property that with probability greater than all the processors flip the 
same value. In the Ftip-Global-Coin procedure every processor flips a  local coin and depending 
on the result it increments or decrements a  counter. The processor then gets the values of the 
counters of other processors and adds them up. If the result is greater than  7 N  then it decides 
one. If the result is less than -7 N  then it decides zero. Otherwise it repeats the steps above. To 
keep information about other processors, every processor maitains an array Coin  of size N . Ev­
ery element in Coin  has two fields. The first field Contribution  keeps the value of a processor’s 
counter. The second field is Tim estam p  where a processor stores the value of Opnum, B itnum , 
Coinnum, and I teration. The variable Coinnum  contains the number of times a processor has 
flipped a coin in a  given consensus round. I teration  keeps the number of times a processor has 
flipped a local coin while trying to decide a  value in Flip-GlobaLCoin. These values are used by 
Exchange  as labels to distinguish old messages from new ones. The expected number of rounds 
the processors will go through before flipping the same coin is (7 +  l ) 2IV2. The proofs of these 
results can be found in [4]. IDpool is a set containing the ID 's  of all processors.
M Y  I D  is the I D  of the executing processor.
70
P r o c e d u r e  Leader-election(Opnum, I D );
begin
for i := l  to  \logN] do begin
{ ID{ and Leader,- are the ith bit of ID  and Leader, respectively}
Leader,- :=  Bitconsensus(Opnum,i, IDi);  
if Leaderi ^  ID i  then
I D  := Randam sel{N  e w l  DeReqset | N e w ID j  = Leader j  
Mje(l..i) } 
if I D  =NULL then begin 
Found  := fa ls e ; 
repeat
for all N  ew lD e {I  Dpool — Reqset}, N e w ID  ^  M Y  I D  do begin 
Requestarrived:=( Opreqi ^  Opreq{)\
if Requestarrived and (N e w ID j = Leaderfi je ( \ . . i )  ) then begin 
I D  := N e w ID ;
Found  := true ;
Exit for- all loop; 
end-if; 
end-for_ all; 
until found; 
end-if ; 
end-for; 
end-Leader_ election;
71
P r o c e d u r e  Bitconsensus(Opnum, Bitnum , V a/);
begin
for j := 1 to  TV do begin 
Bcr[j].Round  := 0;
Bcr\j].Sugval  := 0; 
end-for; 
while true  do begin
Bcr[M YID ].Round  := Bcr[M YID ].Round + 1;
Bcr[M YID ].Sugval  := Val;
B cr[M YID ].T im estam p :=(Opnum, Bitnum)',
Exchange(Bcr);
M axround := max\<k<N (Bcr[k\.Round); 
if ( (Bcr[M YID].round = Maxround)  and (Vfc, fce(l..JV) and 
k  ^  M Y I D  | (Bcr[k].Sugval ^  5cr[M F/£)].5«fifi>a/) —►
( Bcr[k].Round < Bcr[M YID ].Round — 1) ) then 
return( Bcr[MYID].Sugval);
else
if (3 v | V k, Bcr[k\.Round = M axround, Bcr[k].Sugval =  v) then 
Val := v;
else
Val  :=  Flip-GlobaljCoin(Opnum, B itnum , Bcr[MYID].Round);  
end-while; 
end-bitconsensus;
72
p r o c e d u r e  Flip-GlobalJCoin(Opnum, Bitnum , Coinnum );
begin
for j := 1 to TV do begin
Coin[j].Contribution := 0;
Iteration  := 0;
while true  do begin
I teration  :=  I teration  +  1;
Coin[MYID].Contribution  := Coin[MYID].Contribution  +  Localeoinflip  
{ Localeoinflip  returns +1 or -1 with equal probabity} 
Coin[M YID ].T im estam p  := (Opnum , B itnum , Coinnum, Iteration)]  
Exchange(Coin)]
Globalvalue := Y^jL i Coin[j].Contribution]  
if Globalvalue >  7N  then return(l); 
if Globalvalue < —7 N  then return(O); 
end-while; 
end-FlipGlobalCoin;
The procedure used to  execute operations on a processor’s local copy is Execop. As explained 
before, a t any time a processor P,- may request to execute an operation. If it succeeds in being 
elected the leader, the other processors execute the operation it requested in their local copies. 
Execop  continuosly checks the flags Opreqlj  and Opreq2j to find out if any processor Pj wants 
to execute an operation. If these flags have different values, the procedure Receive  received a 
request message from Pj. All the requesting processors ID 's  are written in the set Reqset by 
Execop. If Execop receives a  request from its local processor, it broadcasts a request message.
73
Then it calls Leader Mection. Next it executes the operation requested by the winner of the 
election in the local copy. Execop  also resets the request flags Opreq coresponding to  the winning 
processor by reading Opreq2 and writing Opreql with the same value. If the executing processor 
Pi is not the winner, execop calls Leaderjelection again, and repeats the above steps until P, 
is elected the leader. After Pi executes its local operation, Execop  looks into the processors 
in Reqset and tries to  elect them as leaders. By making Execop  help other processors before 
executing another operation requested by its own processor, it is guaranteed tha t any processor 
succeeds in executing its operation in finite time. The local processor Pt- running Execop  requests 
to execute an operation by calling the procedure Myrequest. M yrequest  signals Execop  by 
reading Opreqli  and writing Opreqli with the opposite value. Myrequest  then waits until 
Execop  executes the operation by winning an election round. Execop  puts the old value of the 
register in a  variable called Retval. Next, it signals M yrequest  by writting Opreqli with the 
value of Opreq2{. M yrequest  gets the value from Retval and returns it to P«. In the procedure 
below, the function Randomsel selects at random an I D  from the set Reqset, /  is a function 
the processors use to  change the value of the register:
P ro c e d u re  Execop', 
var O pnum : integer;
Reqset: Set;
Object, Re tva l’. Register;
ID-, integer;
begin
Opnum, := 0;
Reqset := 0 
while true  do begin
74
for j := 1 to  N  do
if ( (Opreqlj  ^  Opreq2j) and (j  ^  M Y  ID )  ) then 
Reqset := Reqset +  [7];
M yop  := fa ls e ;
Otherop := fa ls e ;
if OpreqlMYlD 7̂  0preq2mYiD  then 
M yop  := true;
I D  := J l/y /D ;
Broadcast(< R e q ,ID  >); 
end-if
else
if Reqset ^  0 then
i2andomse/{/.De.ffe<7set}
Otherop := true; 
end-if;
if M yop  or Otherop then begin
Leader := Leader.election(Opnum, ID);  
if Leader ^  I D  then
Wait until OpreqlLeader £  0preq2ieader\
Retval :=  Object;
Object :=  f  (Retval);
Opnum  :=  Opnum  +  1;
Opreql Leader *=  0preq2Leader] 
if Leader = M Y  I D  then
repeat
Random sel{ID  e Reqset}
Leader := Leader jelection(Opnum, ID);  
Retval  := Object;
Object := f  (Retval);
Opnum Opnum + 1;
Opreq \ L ea d er  • —  Op v eq2L ead er  ̂  
if (ID  =  Leader) then
Reqset := Reqset -  [ID]; 
until Reqset = 0 
end-if; 
end-while 
end-execop;
P ro c e d u re  Myrequest; 
begin
Opreq2MYID •= ^OpreqlMYID; 
wait until Opreq2MYlD =  OpreqlMYID5 
Tetuin(Retval); 
end-myrequest;
76
4.5 C orrectness
To prove the correctness of the simulation algorithm of the Read-Modify-Write registers we need 
to  prove th a t it satisfies the following conditions:
C l) Any processor will execute a requested operation in finite time 
C2) A requested operation is not executed more than once.
C3) If a processor executes an operation in its local copy, the other processors execute the same 
operation in their local copies. This guarantees tha t the processors will see the same value for 
the object a t any time and the consistency is maintained
Lemma 4.1
Every invocation of the procedure Leader .election terminates.
Proof
A process executing Leader jelection starts with an initial ID  passed to  it by Execop. The 
processor goes through \logN ] rounds of consensus to  get its initial I D  selected as the leader’s 
I D  If it loses any consensus round, it selects another I D  from Reqset th a t can still be selected. 
If all processors in Reqset have lost the election, it waits to receive the request message(s) 
from the processor(s) th a t won the last consensus round. This message is guaranteed to ar­
rive in finite time because every processor retransm its request messages received from others. 
When the message arrives, the processor adopts the I D  of the sender, and continues executing 
Leader jelection. All processors participate in every consensus round until the last bit of the 
leader’s identifier is selected. If some processors fail, by our assumption a m ajority of N  — T  
processors will always be working to  select a leader. Therefore, the procedure Exchange  will 
always teminate. This implies that Bitconsensus  and consecuently, Leader jelection will also 
terminate.
77
E n d  o f  p ro o f  
L em m a 4.2
Any processor will succeed to be a  leader within a finite number of rounds of leader election. 
P ro o f
A processor Pj enters a  leader election round after broadcasting a  request message. If none of 
the IV — 1 remaining processors is attem pting to  be a leader, when they receive P j’s request 
they will enter the process of leader election with Pj's I D  and will elect Pj the leader. If one 
or more of the remaining processors want to  be the leader, they will include Pj's I D  in the set 
Reqset after receiving the request from Pj. When ail processors receive Pj's request, P j's ID  
will be in Reqset of all processors. As explained earlier, after a processor succeeds in executing 
an operation it requested, it helps the processors in Reqset to  be elected the leaders. Because 
Pj's I D  is in every Reqset set, eventually it will be elected the leader. It is not possible to 
know the exact number of rounds of leader election that will be executed before Pj is elected 
the leader since Pj's request messages may take an arbitrary amount of time to arrive to their 
destination. If the messages arrive instantanously, Pj would succeed after a t most 2N  - 1  rounds 
of leader election.
E n d  o f  p ro o f
L em m a 4.3
Every invocation of the procedure M yrequest  terminates.
P ro o f
When M yrequest  is invoked, it signals Execop  by writing 0preq2  the opposite value of Opreql. 
It then waits until their values become equal. Execop will set them equal after electing its
78
processor a  leader. Lemmas 4.1 and 4.2 guarantee that any processor becomes the leader in a 
finite time. Then Execop  sets Opreql and 0preq2  equal. This allows M yrequest  to term inate. 
E n d  o f  p ro o f
T h e o re m  4.1
Condition C l is satisfied.
P ro o f
A processor requests to  execute an operation by calling Myrequest.  Since Myrequest  termi­
nates in finite time (Lemma 4.3), the processor also executes its operation in finite time.
E n d  o f  p ro o f
T h e o re m  4.2
Condition C2 is satisfied.
P ro o f
An operation requested by a processor Pj is executed more than once only if another processor 
Pi receives a  duplicate of P , ’s request and takes it as a new request. P, then enters the leader 
election process with Pj’s I D  and elects it the leader. But, a duplicate message will not be 
accepted. The request messages are timestamped. An incoming message is recognized as a  new 
request of Pj only if it carries a timestamp greater than that of the last request message received 
from Pj.
E n d  o f  p ro o f
L em m a 4.3
For every invocation of Leader jelection a t most one leader is elected.
79
P ro o f
The processors ID 's  are assumed to  be unique and therefore, differ in one or more bits. In every 
round of Leader jelection the ID 's  th a t do not correspond with the bit selected are discarded. 
The processors suggesting those ID 's  select new ones among the ones th a t can still be selected 
as the leader’s I D .  Since all ID 's  are different, as processors go through the rounds of consen­
sus, the number of ID 's  th a t can be selected become fewer and fewer. In the last round they 
will have at most two choices. One of these I D 's  loses the election and the other becomes the 
elected leader’s ID .
E n d  o f  p ro o f
T h e o re m  4 .3
Condition C3 is satisfied.
P ro o f
Every time a processor wants to execute an operation, it goes through the process of leader 
election. Every invocation of Leader jelection elects at most one processor (Lemma 4.3). In 
Execop  all processors execute the operation requested by the elected leader (processor) only. 
Since there is only one leader, all processors execute the same operation in their local copies. 
E n d  o f  p ro o f
T h e o re m  4 .4
The algorithm simulates Read-Modify-Write Registers in a message passing system 
P ro o f
Theorems 4.1, 4.2 and 4.3 show that conditions C l, C2 and C3 are satisfied. This proves the 
correctness of the algorithm.
80
End of proof 
Theorem 4.5
The algorithm works correctly in presence of up to  [7V/2J-1 processor failures.
Proof
It is neccessary to  prove tha t the good processors will be able to  execute their operations when 
up to  [N/2J-1 processors have failed. To execute an operation a processor needs to  be elected 
the leader. It first broadcasts a request message and then calls the procedure Leader .election. 
Good processors will be able to  execute operations if the procedure Leader jelection always 
terminates. After a  leader is elected, all good processors execute the operation requested by the 
leader.
In order to  reach an agreement on every bit of the leader’s I D ,  Leader jelection calls the 
procedure Bitconsensus. To decide a value Bitconsensus  needs to  collect information about 
the other processors. For this purpose it calls the procedure Exchange. To collect information 
about all processors Exchange  needs to  communicate with a m ajority of processors. Information 
about a processor will be stored in a t least one node belonging to  this majority. If only up to 
[N/ 2J-1 processors fail, there is always a m ajority of processors th a t can exchange information. 
This collected information is used by Bitconsensus. The latter eventually decides a value and 
terminates. The procedure Leader jelection that calls Bitconsensus  [logN ] times also eventu­
ally term inates and a leader is elected. The good processors therefore will be able to  execute 
thier operations.
End of proof
81
4 .5 .1  A n  e x a m p le
Consider a message passing distributed system with 4 processors: P i ,P 2,P 3, and P4. These 
processors need to  generate unique labels 1 such that when a  processor selects one, no other 
processor can select the same. One solution to  this problem is to  have a  Fetch-and-add register. 
Every time a processor needs a label, it executes a  fetch-and-add(l) instruction to  the register 
and uses the returned value as the label.
In the simulation, every processor keeps a local copy of the fetch-and-add register. These 
registers are: R i,R ,2 ,R 3 , and P 4 belonging to processors P i ,P 2,P 3, and P4 respectively. The 
request flags are initialized as Opreqlj  =  0preq2j  for the processor Pj , j  =1 to N . Opnum  for 
every processor is initialized to  zero and Reqset is initialized as an empty set.
Assume th a t P\ and P4 want to execute a Fetch-and-add(l) operation at the same time. 
Both call Myrequest  which signals Execop  by setting Opreql =  ->Opreq2. Execop  sends request 
messages to  all other processors because this operation is requested by its own processor. Pi 
sends request messages to P2,P 3, and P4. P4 sends request messages to  P i ,P 2, and P3. Execop  
running in processors Pi and P4 ignore the request messages because they give priority to the 
requests made by their own processors. Execop  puts the I.D. of any requesting processor in 
Reqset. The procedure Receive in processors P2 and P3 gets the request message and sets 
Opreqli 0preq2\  and Opreqlj  ^  Opreq24. Execop running in P2 and P3 selects at random 
one of the requests and calls Leader jelection with the I D  of the selected procesor. Assume 
that P2 selects the request made by Pi and P3 selects the request made by P4. At this point 
all processors are in the process of electing a leader. The processors will go into two rounds 
of consensus to select the two bits of the leader’s I D  Because the processors try  to reach an 
agreement taking random steps (flipping a  coin) it is not possible to know before hand which 
'These labels can be timestamps used to resolve the order of access of several processors to a data base
82
one will be elected the leader. Suppose P\ is elected the leader. Every processor executes the 
operation requested by P i in their local copies. Also, they reset the request flags belonging to 
P i by writing Opreqli =  Opreq2i. Execop in processor Pi has the ID  of P4 in its Reqset. So, 
Pi tries to  help P4 to  get elected the leader. Other processors have P4’s request also. So, they 
call leader .election with P4’s ID .  Because all processes call the Leader .election procedure with 
P4’s I D ,  P4 is elected the leader. All processors execute the operation requested by P4.
C hapter 5
Distributed Simulation using ISIS
In this chapter we discuss the use of the ISIS Distributed Programming Toolkit for the simu­
lation of distributed algorithms. We start with a brief description of the ISIS system and then 
we describe the way we have used it to  simulate the algorithm presented in Chapter 4. ISIS 
is a UNIX-based programming environment designed to  provide the programmer with a set of 
tools th a t makes simple and easy to  write adaptive and reconfigurable solutions to  applications 
th a t must stand the occurrence of failures in the processing units or in the communication 
mechanisms. Its nice features become evident when writing solutions to  a problem that re­
quire coordinated and cooperative work from a set of processors, and when we need to manage 
replicated data.
5.1 T h e ISIS S ystem
ISIS is based on processes and messages. An ISIS process can be defined as an address space 
containing one or more lightweight tasks. Processes communicate through messages. Different 
type of messages are handled by separate tasks. Task execution is FIFO and non-preemptive 
but they can wait on, and signal condition variables when required. ISIS stands certain type of
83
84
failures like losing messages but cannot handle network partitions, it also supports reconfigura­
tion and continued execution after crash failures but as long as the crashing sites or programs 
just stop executing and do not send corrupted messages or go into infinite loops. ISIS supports 
a  virtually synchronous environment, in which processes can be structured into process groups. 
Events such as broadcasts to  the group, changes in the group membership, and state  transfer ap­
pear to  occur ( to  the members of the group) synchronously, or in other words instantaneously. 
This is of particular importance if we consider th a t there are cases in which the result of a 
distributed computation is affected by the order in which events are observed. When several 
processes are cooperatevily working in the solution of a problem, if broadcasts to  the processors 
arrive in different order, or if failures are observed in different order by different processors in 
the group, inconsistencies may arise. By supporting a virtually synchronous environment, ISIS 
offers the possibility of writing programs as if they were going to  work in an ideal setting. The 
compiler is in charge to  produce code th a t is able to run in a realistic environment. ISIS runs 
programs designed for synchronous environments relaxing any kind of synchronization th a t the 
algorithm does not rely upon. This helps to  execute programs in a more parallel fashion. The 
implementation of the idea of virtual synchrony  is based on a careful analysis of the ordering 
requirement of the application being run.
The programming tools offered by ISIS reflect the intention of the designers in addressing 
a particular set of applications in Distributed Systems. The type of application addressed is 
the one tha t in order to  be implemented efficiently needs to be decomposed into orthogonal 
components th a t can be treated separately. ISIS stresses in solving the problems caused by the 
asynchronous propagation of information among processes. Processes learn about the occurrence 
of a particular event through the arrival or not arrival of messages. This implies tha t processes 
in a system may observe events to occur in different order if the messages arrive in different
85
order. This is a potential source for inconsistency problems because there are cases in which 
a  distributed computation is affected by the way in which events are observed. The tools 
have orthogonal functionality They permit the programmer to  break up an application into 
components th a t can be solved independently and extended gradually into a  complete system. 
According to  its use an ISIS tool belongs to one of the following categories:
• Tools for managing process groups. They include tools for creating, joining, or leaving 
group processes. Also, it includes mechanisms for obtaining group views, for monitoring 
group changes, for atomically transferring the state of the group to  a  new member, etc.
• Tools for process and process group communication. ISIS defines a new type of object 
called a  message. A message is manipulated like an inpu t/ou tpu t stream  and is usually 
sent to an entry point defined by another process. For communication with process groups 
ISIS provides a set of broadcast and reply mechanisms.
• Tools for managing replicated data. ISIS provides mechanisms to  maintain replicated data  
using a broadcast to  update but reading locally. Processes do not need to block when doing 
a read update or releasing a  lock.
• Synchronization tools. ISIS provides these tools for cases in which processes executing 
concurrently need locking and mutual exclusion mechanisms to avoid interference between 
concurrent activities.
• Tools for organizing a distributed computation. There are three ways in which we can 
organize a distributed computation using ISIS. A first scheme, called coordinator-cohort 
scheme, selects some member to  be responsible for a request. Non-coordinator processes 
function as passive backups, taking no action unless a failure occurs. The second scheme, 
called redundant, is one where all members of a group execute a request in parallel, pre­
86
sumably arriving at identical results and changing the replicated data  in identical ways. 
The third scheme, called subdivided, consists of having each member perform one part of 
the request, with the collected outcome being presented to  the caller.
• Tools for detecting failures and recovering after they occur. Mechanisms are provided for 
detecting failures and informing any interested party of a  failure. To allow failed processes 
to  recover, mechanisms are provided for creating periodic checkpoints or logs th a t can 
be replayed on recovery. The recovery mechanisms take in consideration the possibility 
of total or partial failures. In case of a total failure, if all processes tha t make up an 
application fail, it is possible to restart the whole application gracefully using its stable 
storage. In case of a partial failure, if one of the processes executing an application fails, it 
is possible to  reintegrate the failed process into the system and transfer the current system 
state  to  it.
• Other tools. ISIS also provides tools for supporting transactions and protection.
All the tools are fault tolerant and support virtual synchrony. In case of a process failing at 
the time it is performing some action using the tools, either the action is executed completely 
or not executed at all. For example, if a  process fails while executing a  broadcast to  a group, 
then all the processes receive the message or none of them. The tools support virtual synchrony 
in the sense tha t by using the tools all processes in the system observe the same events and in 
the same order. This applies not only to  message delivery events, but also to  failure recoveries, 
group membership changes, etc.
As discussed before, the main problem to solve if we want to ensure th a t processes observe 
events in the same order is, to ensure tha t messages arrive in the same order. For this reason, ISIS 
provides a  set of four multicast primitives. The difference between them is the order in which the 
broadcasted messages are delivered to their destinations with respect to  other messages sent to
87
the same destinations. The stronger the message delivery order accomplished by the multicast 
primitive, the more expensive in terms of overhead it is. The multicast primitives provided are 
the following:
fb ca s t It is the least costly broadcast primitive; it provides FIFO ordering on a point to point 
basis and is reliable in the sense tha t either all the destinations or none of them receive it 
even if the sender fails.
c b c as t It provides a generalized kind of FIFO ordering. All messages sent using this primitive 
are delivered in the order in which they were sent. This means tha t if a process Pi 
broadcasts messages M a,Mb and M c then all the destinations will receive first message 
M a then Mb and last M c. It is also guaranteed tha t no destination will get message M c 
unless all of them have previously received messages M a and Mb. Another thing that 
ISIS guarantees is that if any invocations of cbcast are causally related, the corresponding 
messages are delivered everywhere in the order of invocation. Two multicasts events are 
causally related if information about the first could have reached the point where the second 
was begun before it was initiated there. The primitive cbcast however does not ensure any 
particular delivery order for the case of two or more independent processes broadcasting 
messages using this primitive. This means that if processes Pi and P2 broadcast messages 
M a and Mb using cbcast then some destinations could receive message M a first and message 
Mb next and other destinations could receive the messages in the reverse order.
a b c a s t This primitive guarantees a stronger delivery order tha t the cbcast primitive. It solves 
the problem of two independent processes broadcasting messages to  the same destination. 
If the processes use the abcast primitive, all the destinations receive the broadcasted 
messages in the same order. For example, if processes P\ and P2 broadcast messages 
M a and Mb, respectively then all the destinations receive message M a first and then Mb,
or in the reverse order. The order accomplished by abcast is a  consistent one but not 
predetermined.
g b c a s t If some processes broadcast messages using the primitive cbcast and other processes use 
the primitive abcast (to the same destinations) then again no particular delivery order is 
guaranteed. The gbcast broadcast primitive solves this problem. All processes th a t receive 
a message transm itted using gbcast observe the same order between tha t gbcast and other 
messages they receive, regardless of the mechanism used to  send those messages.
The multicast primitives are the fundamental tools th a t ISIS provides, other tools are built 
using them. In the simulation of the algorithm presented in Chapter 4 we have basically used 
the broadcast primitives and the tools for process group management. So we will not describe 
in detail the other tools th a t ISIS provides. As we explain how we conducted the simulation of 
the algorithm we will discuss the characteristics of the tools used.
5.2 T h e S im ulation  Program
In order to  test the correctness and fault tolerance of the algorithm and evaluate its performance 
in terms of messages exchanged and execution time, we wrote a simulation program where we im­
plemented the procedures E X C H A N G E , F L IP .G L O B A L .C O IN , and B IT C O N S E N S U S .  
As we explain before the running time of the algorithm is mostly determined by the consensus 
protocol tha t we use and in particular by the procedure tha t implements the shared global coin. 
Because we can only assure th a t with probability greater than  all processes will see the same 
outcome after they flip the global coin, in the worst case the consensus protocol will run for an 
infinite time. As we discussed earlier, for randomized algorithms like our consensus protocol we 
can calculate only the expected running time. The expected running time of our algorithm is 
(7 +  l )2 * n2. Both procedures F L IP .G L O B A L -C O IN  and B IT C O N S E N S U S  are claimed
89
to be fault tolerant in the sense th a t any process running them will be able to  finish their ex­
ecution even if up to  [N /2  — l j  processes in the system fail. Also, any process running the 
algorithm must obtain correct results in presence of failures in the communication mechanisms. 
Specifically, a  process running the algorithm can expect th a t in any broadcast it makes to the 
other processors, only [N /2  + l j  will arrive to  their destinations.
5.2.1 T he structure o f an ISIS process
A process in a  ISIS application is internally structured into a  number of tasks. An ISIS task 
looks like a  C function and shares the same address space and global variables as the other tasks 
and functions in the process. An ISIS program like a  C program has a function called m ain  
which is the first one to  be executed when we run the program. A task is implemented as a 
C function but has the particularity tha t can be invoked by the system in response to  certain 
events, the most common of which is message delivery. A task is declared in an ISIS program 
with the command isis .ta sk(ta sk .nam e, function .nam e). A task tha t is started  up in response 
to  a message delivery is called an entry. A process can have many entries and each one is given 
a different entry number. When a message is sent, it is addressed to  a particular entry. On 
delivery, a  pointer to  the message is passed as a  param eter to  the entry which in most cases 
reads the content of the message and passes it to other tasks or functions. One particular thing 
worth mentioning about ISIS tasks is tha t when a task makes certain ISIS system calls ( like a 
broadcast tha t waits for replies), it is possible for for a  new task to  start and begin executing 
before the system call returns. The original task will later continue from where it left off. This 
type of system calls tha t allow other tasks to be started before they return are called “blocking” .
In an ISIS application, the function m ain  usually just reads the command line arguments, if 
there is any, initializes ISIS with the system call is isJn it(portnum ber) ( for our system the port
90
number is 1623), and sets off the main loop with the system call isis-m ainloop{task-nam e). 
The argument to  isis-m ainloop  is the name of the first task to  be run. The general structure 
of the function m ain  in an ISIS program is then:
mainQ
{
variable declaration and function prototyping
isis_ init( < port-no > );
/*  task declaration * / 
isis_ task(task_ name, function, name);
/*  entries declaration * / 
isis. entry( en try , number, task , name, function, name);
isis. mainloop( task , name );
}
W hen the task started by isis.m ainloop  begins executing, ISIS inhibits the delivery of mes­
sages from other processes. This ensures that you can do all the necessary initialization before 
the program has to respond to events. If this task remains in an infinite loop it is necessary to 
make the system call is iss ta r t-d o n e  inside the task. This call tells the ISIS system that the 
s ta rt up sequence is completed. ISIS automatically invokes is iss ta r t-d o n e  at the end of the
91
main task. But if this task remains in a infinite loop the call has to  be made explicitly. Other­
wise, the ISIS program remains also in an infinite loop executing only the main loop. Another 
system call used in the simulation program is the isis  uiccept ̂ events call. This call explicitly 
tells ISIS to  accept events like incoming messages. This call is a  blocking call.
5.2.2 Im plem entation of the procedure E X C H A N G E
E X C H A N G E  needs to  broadcast the table passed by the procedure F L IP -G L O B A L -C O IN  
and the procedure B IT C O N S E N S U S  and then collect the tables sent by the other processors 
in order to  compare them  with the table it previously broadcasted. After making the comparison 
E X C H A N G E  decides if it needs to update the table or if increment a counter th a t keeps track 
of the number of tables identical to  the one broadcasted tha t E X C H A N G E  has collected. As 
it was discussed in Chapter 4, the basic idea is to  make the processes exchange information. In 
this way if a message get lost the processes tha t received the message will pass the information 
to the ones th a t did not receive. For this reason any processor instead of just broadcasting a 
message containing information about itself, it broadcasts a  table containing all the information 
collected so far about other processes in the system. The tables contain N  slots. Each one 
corresponds to  one process in the system. In these slots a  process stores the information collected 
from the other processes. Processes using E X C H A N G E  can be sure th a t when the procedure 
returns, the information obtained from the other processes is representative of their current 
state. Remember th a t every time tha t E X C H A N G E  updates its table, it rebroadcasts it, and 
tha t E X C H A N G E  returns only after it has received [N /2  — lj  tables identical to the one it 
broadcasted.
Because we are working in an asynchronous environment, messages can arrive a t any time. 
Processes are assumed to  execute computations a t different speeds and broadcast messages in
92
an asynchronous fashion without waiting for acknowledgments from the processes that receive 
the message. This implies tha t in general a  process will receive messages when it is not ready to 
use it because it is executing some part of the algorithm that involve only local computations. 
In order to handle this problem we implemented a procedure th a t buffers the received messages 
so that E X C H A N G E  can access them when it is ready to  use them. This procedure builds 
a linked list prepending each newly arrived message to the front of the list. E X C H A N G E  
will search this list and extract the messages it is looking for. E X C H A N G E  search for mes­
sages by comparing timestamps. Every element in the table passed to E X C H A N G E  has a 
timestamp which is generated by the local process. E X C H A N G E  searches for messages in the 
list by comparing timestamps. Messages in the list whose timestamp matches the one passed 
to E X C H A N G E  are extracted from the list to be evaluated. Messages in the list however, 
will be stored in an unordered way with respect to  their timestamps. To decrease the time 
E X C H A N G E  spends searching for messages in the list, in our implementation we store mes­
sages in two different lists. In one list we store the messages containing tables created by the 
B IT C O N S E N S U S  procedure, and in the second list we store messages containing tables cre­
ated by the F L I P  -G LO B A L -C O IN  procedure. We also use two procedures ”E X C H A N G E ” 
one is used by B IT C O N S E N S U S  and the other is used by F L IP .G L O B A L -C O IN . This 
improved the organization of our code and made it easier to  debug.
The procedures th a t receive messages in the program are declared as entries. As explained 
above we have two entries. The only work this procedures does is to  call a procedure that 
prepends the content of received message ( the table ) to the front of the corresponding list. 
The procedure tha t extracts messages from the list (called cm-receive and consejreceive ) for 
E X C H A N G E  work, as described above, by comparing timestamps. As these procedures search 
the list, they compare timestamp, if the timestamp of the message is older, it deletes the message
93
from the list. If the timestamp is greater, it leaves the message in the queue. If the timestamp is 
equal, it removes the message from the queue and passes it to  E X C H A N G E .  One im portant 
thing worth mentioning about extracting messages from the list is th a t in the case of the messages 
containing tables created by the procedure B IT C O N S E N S U S ,  it is necessary to  check in 
addition to  the timestamp the value of the variable round. In the procedure B IT C O N S E N S U S  
every time a process go into a new round it calls E X C H A N G E  and consequently broadcasts its 
table. A slow process will have all these tables bulfered, but the ones tha t it must consider are 
the ones th a t have the highest value for the variable round. This implies tha t when conse.receive 
is searching for messages in the list and finds a table sent by a particular process, it must search 
the entire list and make sure tha t there is no other message for which the variable round  has a 
higher value.
5.2.3 Im plem entation of B IT C O N S E N S U S  and FLIP-GLOBAL-COIN
The implementation of these procedures is straightforward and follows exactly the steps de­
scribed in the algorithms. F L IP -G L O B  A L -C O IN  is called by B IT C O N  S E N  S U S  and the 
la tte r is called by a driver program that makes the process join a  group ( all processes join the 
same group ) and initializes the variables. The complete code for the program is included as an 
appendix of this document.
5.2.4 R esults
By running the program we verified the fault tolerant properties of our algorithm as well as 
it correctness. Processes were killed while they were executing the B IT C O N S E N S U S  and 
F L IP -G L O B A L -C O IN  procedures and this did not disrupt or prevent the remaining opera­
tional processes from finishing executing the algorithm. In no case the failure of a process was 
observed to provoke other processes to  arrive to contradicting decisions. One interesting thing
94
we observed is th a t when the processes flipped the shared global coin, they always agreed in the 
coin’s value. This is due to  the fact th a t any process knows the result of the local coin flip of 
all others a t every iteration of the algorithm. So, they must come up with the same result for 
the coin flip at the end of the algorithm. This is not true in a shared memory environment for 
which the algorithm was originally written. Due to time restrictions we are not able to  test the 
algorithm in different circumstances and for this reason we are not providing tables with results 
of the simulation.
C hapter 6
SUM M ARY
The main objectives of our work has been to analyze the implementation of concurrent da ta  ob­
jects in the shared memory model, and to  develop techniques for the simulations of these objects 
in a message passing system. The motivation for the development of simulation techniques is 
to translate algorithms written for shared memory systems to  work in message passing systems, 
and to investigate to  what extent and under what conditions, problems tha t are solvable for the 
shared memory model can be solved in the message passing model by simulating shared memory 
synchronization primitives.
The most im portant topics th a t we have addressed are:
1. The description of the shared memory model and message passing model for Distributed 
Systems, their advantages, and disadvantages.
2. The concepts of concurrent systems and concurrent objects. The notion of history  of a 
system execution and the associated partial and to ta l order relations.
3. The axiomatic techniques for the specification of concurrent objects.
4. The methods developed for the verification of the correctness of a concurrent da ta  object
95
96
implementation.
5. The notion of atomicity and its close relation to the obtation of a  to ta l order in a history 
of operations in a system.
6. The concept of atomic regular and safe registers and the techniques used in their imple­
mentation.
7. The notion of I/O  and Port autom atons and their use to  model register constructions.
8. The translation of algorithms for the shared memory model to  the message passing model 
and the development of an algorithm for the simulation of shared memory synchronization 
primitives in a message passing system.
Considerable effort was devoted to  explain the concept of atomic register, its implementation 
in the shared memory model, and to  describe how a system can pretend to  execute atomic read 
and write operations even though the execution of these operations is interleaved in time.
Safe registers are considered to  be physically realizable. The regular and the atomic registers 
are built using them. For these type of registers, the readers and the writers execute special 
protocols to  maintain the regularity and the atomicity conditions. From an abstract point of 
view, a  regular or an atomic register can be visualized as an I/O  autom aton composed by the 
register and the procedure I/O  automatons. The register autom atons model the safe registers, 
and the procedure automatons model the protocols executed by the readers and the writers.
The correct simulation of atomic reads and writes requires th a t every operation behaves as if 
it had occurred instantaneously at some point in time between its invocation and its response. In 
addition, in a  history of operation of a  system it must be possible to  replace the real operations 
with the fictitious instantaneous ones such th a t a  read operation always returns the value written 
by the latest preceding write.
97
The implementation of an atomic register is difficult. In some cases the algorithms were 
incorrect and had to  be modified. Intuitive proofs in these cases failed to uncover situations in 
which the atomicity conditions were violated.
Atomic registers are used as synchronization primitives in the solution of different problems 
in distributed systems. Their use can be found in the solutions to  the atomic snapshot problem
[1], the concurrent timestamping problem [25], etc. On the other hand, the atomic registers have 
their limitations. They cannot be used to implement other common synchronization primitives 
like test-and-set and Read-Modify- Write registers. Herlihy [28] proved this by showing th a t the 
atomic registers cannot solve the consensus problem even for 2 processes. Any synchronization 
primitive th a t can solve the 2 process consensus problem like the test-and-set and fetch-and-add 
register, or the general n  process case cannot be implemented in terms of atomic registers. To 
build them  other approaches have been taken [28] [49].
The algorithm proposed in chapter 4 was designed to simulate Read-Modify-Write registers 
in message passing systems. Algorithms for simulating other types of synchronization primitives 
like atomic multiwriter registers, shared queues, etc. can be obtained by making some minor 
modifications to  the algorithm presented here. For these primitives it is necessary that a proces­
sor th a t wants to  execute an operation passes to  others the value it wants to  enqueue or write. 
This value could be included in the message the processor broadcasts requesting to execute an 
operation. The receiving processor buffers this value and uses it later when the sender succeeds 
in being elected the leader. The algorithm proposed is very resilient. It guarantees tha t any 
processor will be able to  execute an operation even in presence of up to [AT/2J -1 processor fail­
ures. This resilience is achieved through the use of a robust communication primitive and a 
randomized consensus algorithm. Basically, the algorithm is based on the fact that if a m ajority 
of processors can always communicate and exchange information then they would be able to
98
decide which processors have failed. The price paid for achieving high resilience is the large 
number of messages exchanged and the large size of the messages. The existence of algorithms 
for simulation of synchronization primitives in message passing systems implies tha t it is possi­
ble to construct emulators tha t translate algorithms for the shared memory model to  tha t for 
the message passing model. Also, it implies tha t the existence of the solution for a problem in 
one model guarantees the existence of a solution in the other model. After the translation is 
done, optimization should be done in order to reduce the amount of message exchanges. One 
point to  emphasize here is tha t when translating algorithms from the shared memory model to 
the message passing model, the degree of resilience cannot be maintained. Chor and Moscovici
[18] have proven th a t the shared memory model is more powerful in this respect. One direct 
application of this algorithm is in the implementation of resilient mutual exclusion algorithms.
Bibliography
[1] A f e k ,  Y ., A t t i y a ,  H., D o l e v ,  D ., G a f n i ,  E ., M e r r i t ,  M ., a n d  S h a v i t ,  N. Atomic 
snapshots of shared memory. In 9th Annual AC M  Symp. on Principles of Distributed 
Computing (August 1990). Quebec City.
[2] A l f o r d ,  M ., A n s a t ,  J .,  H o m m el,  G ., L a m p o r t ,  L., L isk o v ,  B ., M u l l e r y ,  G ., a n d  
S c h n e i d e r ,  F. Distributed Systems. Lecture Notes in Computer Science. Springer-Verlag,
1985.
[3] B a r - N o y ,  A ., A t t i y a ,  H ., a n d  D o l e v ,  D. Sharing memory robustly in message passing 
systems. In 9th Annual A C M  Symp. on Principles o f Distributed Computing (August 1990). 
Quebec City.
[4] B a r - N o y ,  A ., a n d  D o l e v ,  D. Shared memory vs. message-passing in an asynchronous 
distributed enviroment. In 8th Annual AC M  Symp. on Principles o f Distributed Computing 
(August 1989), pp. 307-318. Edmonton, Alberta.
[5] BERMOND, J .,  AND R a y n a l ,  M. Distributed Algorithms. Lecture Notes in Computer 
Science. Springer-Verlag, September 1989.
[6] B i r a n ,  O ., M o r a n ,  S., a n d  Z a c k s ,  S. A combinatorial characterization of the distributed 
1-solvable tasks. Personal Communication, October 1989.
99
100
[7] B irm a n ,  K. Replication and fault tolerance in the ISIS system. In 10th AC M  SIG O PS  
Symp. on Operating Systems Principles (December 1985), pp. 79-86. Orcas Island, Wash­
ington.
[8] B ir m a n ,  K ., C o o p e r ,  R ., J o s e p h ,  T ., M a r z u l l o ,  K ., M a k p a n g o u ,  M ., K a n e ,  K ., 
S c h m u c k ,  F ., AND W o o d ,  M. The IS IS  System Manual, Version 2.0. The ISIS Project, 
May 1990.
[9] B irm a n ,  K ., C o o p e r ,  R ., a n d  M a r z u l l o ,  K. ISIS and META projects: Progress report. 
Tech. Rep. TR90-1103, Department of Computer Science, Cornell University, February 
1990.
[10] B ir m a n ,  K ., a n d  J o s e p h ,  T . Exploiting virtual synchrony in distributed systems. In l l i/l 
A C M  SIG O PS Symp. on Operating Systems Principles (November 1987), pp. 123-138.
[11] B ir m a n ,  K ., S h ip e r ,  A., a n d  S te p h e n s o n ,  P. Fast causal multicast. Tech. Rep. TR90- 
1105, Departm ent of Computer Science, Cornell University, April 1990.
[12] B lo o m ,  B. Constructing two writer atomic registers. In 6th Annual AC M  Symp. on 
Principles o f Distributed Computing (August 1987), pp. 249-259. Vancouver, B.C.
[13] B u r n s ,  J . ,  a n d  P e t e r s o n ,  G. Concurrent reading while writing II: The multiwriter case. 
Tech. Rep. TR  GIT-ICS-86/26, School of Information and Computer Science, Georgia Tech., 
December 1986.
[14] B u r n s ,  J .,  a n d  P e t e r s o n ,  G. Constructing an atomic single-writer, multireader register 
from atomic single reader registers. Tech. Rep. TR  GIT-ICS-86/27, School of Information 
and Computer Science, Georgia Tech., December 1986.
101
[15] B u r n s ,  J . ,  a n d  P e t e r s o n ,  G. Constructing multireader atomic values from non atomic 
values. In 6th Annual AC M  Symp. on Principles o f Distributed Computing (August 1987), 
pp. 222-231. Vancouver, B.C.
[16] B u r n s ,  J . ,  a n d  P e t e r s o n ,  G. Pure buffers in concurrent reading while writing. Tech. 
Rep. T R  GIT-ICS-87/17, School of Information and Computer Science, Georgia Tech., 
April 1987.
[17] B u r n s ,  J . ,  a n d  P e t e r s o n ,  G. The ambiguity of choosing. Tech. Rep. TR  GIT-ICS-89/12, 
School of Information and Computer Science, Georgia Tech., February 1989.
[18] C h o r ,  B ., a n d  M o s c o v i c i ,  L. Solvability in asynchronous enviroments. In 30th Annual 
Symp. on Foundations o f Computer Science (October 1989), pp. 422-427. Research Triangle 
Park, North Carolina.
[19] C o u l o u r i s ,  G ., a n d  D o l l i m o r e ,  J. Distributed Systems, second ed. Addison-Wesley, 
1989.
[20] C o u r t o i s ,  P . ,  H e y m a n s ,  F . ,  a n d  P a r n a s ,  D. Concurrent control with readers and 
writers. Communications o f the A C M  14, 10 (October 1971), 667-668.
[21] D e b r u j i n ,  N. Additional comments on a problem in concurrent programing. Communi­
cations o f the A C M  10, 3 (March 1967), 137-138.
[22] D i j k s t r a ,  E. Solution to a  problem in concurrent programming control. Communications 
o f the A C M  8, 9 (September 1965), 569.
[23] D i j k s t r a ,  E. The structure of the multiprogramming system. Communications o f the 
A C M  11, 5 (May 1968), 341-346.
102
[24] D o l e v ,  D .,  D w o r k ,  C ., a n d  S t o c k m e y e r ,  L. On the minimal synchronism needed for 
distributed consensus. Journal o f the AC M  34, 1 (January 1987), 77-97.
[25] D o l e v ,  D ., a n d  S h a v i t ,  N. Bounded concurrent time-stamp systems are constructible. 
In 21th Annual AC M  Symp. on Theory o f Computing (May 1989), pp. 454-465. Seattle, 
W ashington.
[26] F i s h e r ,  M ., L y n c h ,  N ., a n d  P a t e r s o n ,  M. Impossibilty of distributed consensus with 
one faulty process. Journal o f the AC M  32, 2 (April 1985), 374-382.
[27] G o t t l i e d ,  A .,  G r i s h m a n ,  R .,  K r u s t a l ,  C ., M c A u l i f f e ,  K .,  R u d o lp h ,  L .,  a n d  S n ir ,  
M. The NYU Ultracomputer- designing an MIMD parallel computer. IEEE Transactions 
on Computers 32, 2 (February 1984), 175-189.
[28] H e r l i h y ,  M. Impossibility and universality results for wait-free synchronization. In 7th 
Annual AC M  Symp. on Principles o f Distributed Computing (August 1988), pp. 276-290. 
Toronto, Ontario, Canada.
[29] H e r l i h y ,  M ., a n d  A s p n e s ,  J . Fast randomized consensus using shared memory. Tech. 
Rep. CMU-CS-88-205, Computer Science Department, Carnegie Mellon University, Decem­
ber 1988.
[30] H e r l i h y ,  M ., a n d  W i n g ,  J . Axioms for concurrent objects. In 14th Annual AC M  
Symposium on Principles o f Programming Languages (January 1987), pp. 13-26. Munich, 
West Germany.
[31] K a t s e f f ,  H .  A new solution to the critical section problem. In 10th Annual AC M  Sympo­
sium on Theory o f Computing (May 1978), pp. 86- 88. San Diego, Calif.
103
[32] K n u t h ,  D. Additional comments on a  problem in concurrent programming control. Com­
munications o f the A C M  9, 5 (May 1966), 321-322.
[33] K r u s t a l ,  C ., R u d o l p h ,  L., a n d  S n i r ,  M. Efficient synchronization on multiprocessors 
with shared memory. In 5th Annual AC M  Symp. on Principles o f Distributed Computing 
(August 1986), pp. 218-227. Vancouver, B.C.
[34] L a m p o r t ,  L .  A new solution of dijkstra’s concurrent programming problem. Communi­
cations o f the AC M  17, 8 (August 1974), 453-455.
[35] L a m p o r t ,  L. How to  make a  multiprocessor computer th a t  correctly executes multipro­
cessor programs. IE E E  Transactions on Computers 28, 9 (September 1979), 690.
[36] L a m p o r t ,  L. On interprocess communication part i and ii. Distributed Computing 1, 1 
(1986), 77-101.
[37] L a m p s o n ,  B ., P a u l ,  M ., a n d  S i e g e r t ,  H. J .,  Eds. Distributed Systems, Architecture 
and Implementation, 3 ed. Springer-Verlag, 1985.
[38] M i s r a ,  J . Axioms for memory access in asynchronous hardware systems. AC M  Transac­
tions on Programming Languages and Systems 8, 1 (January 1986), 142-153.
[39] M o r a n ,  S. Extended impossibility results for asynchronous networks. Information Pro­
cessing Letters 26 (November 1987), 145-151. North Holland.
[40] M o r a n ,  S., a n d  T a u n b e n f e l d ,  G. Possibilty and impossibilty results in a  shared memory 
enviroment. Personal Communication, March 1990.
[41] M u l l e n d e r ,  S., Ed. Distributed Systems, first ed. Frontier. ACM Press, 1989.
[42] N e w m a n n - W o l f e ,  R. Communication Issues in Parallel Computing. PhD thesis, Uni­
versity of Rochester, 1986. Chapter V, Concurrent Reading while Writing.
104
[43] N e w m a n n - W o l f e ,  R . Economical atomic multireader shared registers. In 24th Annual 
Allerton Conference on Communication, Control and Computing (October 1986). Monti- 
cello, IL.
[44] N e w m a n n - W o l f e ,  R . A protocol for wait-free, atomic multireader shared variables. In 6 th 
Annual AC M  Symp. on Principles o f Distributed Computing (August 1987), pp. 232-247. 
Vancouver, B.C.
[45] P a p a d im i t r i o u ,  C. The serializability of concurrent database updates. Journal o f the 
A C M  26, 4 (October 1979), 631-653.
[46] P e t e r s o n ,  G. Concurrent reading while writing. AC M  Transactions on Programming 
Languages and Systems 5, 1 (January 1983), 46-55.
[47] P e t e r s o n ,  G. A new solution to lam port’s concurrent programming problem using small 
shared variables. A C M  Transactions on Programming Languages and Systems 5, 1 (January 
1983), 56-65.
[48] P e t e r s o n ,  G ., a n d  F i s h e r ,  M. Economical solutions for the critical section problem in a 
distributed system. In 9th Annual AC M  Symposium on Theory of Computing (May 1977), 
pp. 91-97. Boulder, Colo.
[49] PL OT K IN ,  S. Sticky bits and the universality of consensus. In 8 th Annual AC M  Symp. on 
Principles o f Distributed Computing (August 1989), pp. 159-175. Alberta, Canada.
[50] R i v e s t ,  R .,  a n d  P r a t t ,  V. The mutual exclusion problem for unreliable processes. In 
17th Annual Symposium on Foundations o f Computer Science (1976), pp. 1-8. Houston, 
Texas.
105
[51] S c h a f f e r ,  R. On the correctness of atomic-multiwriter registers. Tech. Rep. 
M IT/LCS/TM -364, Laboratory for Computer Science, Massachusetts Institu te of Tech­
nology, June 1988.
[52] S i n g h ,  A ., A n d e r s o n ,  J ., a n d  G o u d a ,  M. The elusive atomic register. Tech. Rep. T R  
86.29, Department of Computer Science, University of Texas at Austin, December 1986.
[53] S i n g h ,  A ., A n d e r s o n ,  J .,  a n d  G o u d a ,  M. The elusive atomic register revisited. Tech. 
Rep. TR  86.30, Department of Computer Science, University of Texas a t Austin, December
1986.
[54] T a n e n b a u m ,  A ., a n d  v a n  R e n e s s e ,  R. Distributed operating systems. AC M  Computing 
Surveys 17, 4 (December 1985), 419-469.
[55] T a u b e n f e l d ,  G ., K a t z ,  S., a n d  M o r a n ,  S. Initial failures in distributed computations. 
Tech. Rep. 517, Dapertment of Computer Science, Israel Institute of Technology, August 
1988. Haifa, Israel.
[56] T a u b e n f e l d ,  G ., K a t z ,  S., a n d  M o r a n ,  S. Initial failures in distributed computations. 
Tech. Rep. 517, Dapertment of Computer Science, Israel Institute of Technology, August 
1988. Haifa, Israel.
[57] T a u b e n f e l d ,  G ., K a t z ,  S., a n d  M o r a n ,  S. Impossibility results in the presence of 
multiple faulty processes. Tech. Rep. 586, Dapertment of Computer Science, Israel Institute 
of Technology, January 1989. Haifa, Israel.
[58] VON B o c h m a n n ,  G. Concepts for Distributed Systems Design. Springer-Verlag, 1983.
[59] Z e d a n ,  H., Ed. Distributed Computer Systems, first ed. Butterworths & Co., 1990.
APPENDIX A
T h e  P ro g ra m
#include <isis.h>
#ifndef _STDIO_H 
#define _STDIO_H 
#include <stdio.h> 
#endif
#ifndef _ERRNO_H 
#define _ERRNO_H 
#include <errno.h> 
#endif
#include <malloc.h> 
#include <string.h>
#define GET_CM 1 
#define GET_CONSEMESS
#ifndef PORTNO 
#define PORTNO 1623 
#endif
107
#define GAMMA 3 
#define NUMPRO 4
#define LOCALCOINFLIP(x) ((int)(randomO */. 2 == 0 ? -1 : 1))
#ifndef TRUE 
#define TRUE 1 
#endif
#ifndef FALSE 
#define FALSE 0 
#endif
/ * Type declaration for all the data structures 
used by the procedure FLIP_GL0BAL_C0IN */
>;
struct Mycoin { 
int Contribution;
struct coin_time_stamp { 
int Opnum; 
int Bitnum; 
int Coinnum; 
int Iteration;
108
int id;
s t r u c t  co in _ tim e_ sta m p  tim estam p;
>;
struct Coin_array{ 
struct Mycoin pcoin[NUMPR0];
>;
struct Coin_messages{
struct Coin_array Coin.table; 
struct Coin_messages *next; 
struct Coin_messages *previous;
>;
struct Coin_messages *cm_frontqueue = NULL; 
struct Coin_messages *cm_endqueue = NULL; 
int My_id = 0; 
int Numfail; 
address *gaddr_p;
/ * Type declaration for all the data structures 
used by the procedure BITCONSENSUS * /
s t r u c t  c o n se n su s .t im e sta m p  {
int Opnum; 
int Bitnum; 
>;
struct Myconsensus{ 
int round; 
int sugval;
struct consensus_timestamp timestamp;
>;
struct Consensus.array { 
int id;
struct Myconsensus BCR[NUMPRO];
>;
struct Consensus.messages-C 
struct Consensus.array Consensus.table; 
struct Consensus.messages *next; 
struct Consensus.messages p̂revious;
>;
struct Consensus.messages *conse_frontqueue=NULL
110
s t r u c t  C o n sen su s .m essa g es  *conse_endqueue=NULL;
/ I * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
mainO
{
void group.changeO; 
void get_cm(); 
void get.cons eme s s();
void test.loopO; 
void convertmsgO;
srandom(timeO); /* seed random number generator */
Numfail = ((double)(NUMPRO) / 2.0 != NUMPRO / 2 ?
(NUMPRO + 1) / 2 - 1 : NUMPRO / 2 - 1);
isis.init(PORTNO);
/* Declare tasks and entry points */ 
isis_task(test_loop, "test.loop");
/ * Declare the task that monitors group changes * /
i s i s _ ta s k (g r o u p _ c h a n g e , " grou p _ch an ge" );
/* Declare entries */ 
isis_entry(GET_CH, get_cm, "get.cm");
isis_entry(GET_CONSEMESS, get.consemess, "get.consemess");
/* Declare message types */ 
isis_define_type(10f ’r’, sizeof(struct Coin_array), convertmsg); 
isis_define_type(ll, ’t’, sizeof(struct Consensus.array), convertmsg);
/* start the ISIS main loop */ 
isis_mainloop(test_loop, NULLARG);
}
/ *  ....................................................................................................................................................
/* This procedure buffers messages originated by the 
FLIP_GLOBAL_COIN procedure */
int queue_cm(coin_mess) 
struct Coin.array *coin_mess;
{
struct Coin.messages *newmess;
new ness = ( s t r u c t  C oin_m essages *) m a l l o c ( s i z e o f ( s t r u c t  C o in _ m e s s a g e s ) ) ;
memcpy(&newmess->Coin_table, coin_mess, sizeof(struct Coin.array)); 
newmess->next = NULL; 
newmess->previous = NULL;
if (cm_frontqueue == NULL) { 
cm.frontqueue = newness; 
cm.endqueue = newmess;
>
else ■[
cm_endqueue->next = newmess; 
newmess->previous = cm_endqueue; 
cm.endqueue = cm_endqueue->next;
>;
>
/*  ....................................................................................................................................................
/* This procedure deletes messages originated by the 
FLIP.GLOBAL.COIN procedure */
delete.cm(ptr)
s t r u c t  C o in .m e ssa g e s  * p tr ;
i
if (ptr->previous == NULL) {
cm_frontqueue = cm_frontqueue->next; 
if (cm_frontqueue) cm_frontqueue->previous = NULL;
>
else {
ptr->previous->next = ptr->next;
>
if (ptr->next == NULL) {
cm_endqueue = cm_endqueue->previous; 
if (cm_endqueue) cm_endqueue->next = NULL;
>
else {
ptr->next->previous = ptr->previous;
>
free(ptr);
>
/ *  ..............................................................................................................
/* This procedure extracts messages containing tables 
created by the FLIP_GLOBAL_COIN procedure * /
114
int cm_receive(Other_coin_table, timestamp) 
struct Coin_array *Other_coin_table; 
struct coin_time_stamp *timestamp;
struct Coin.messages *q_pointer; 
struct Coin_messages *find.first_cm();
while C(q_pointer = find_first_cm(timestamp)) == NULL) { 
isis_accept_events(0);
}
memcpy(Other_coin_table, &q_pointer->Coin_table,
sizeof(struct Coin.array)); 
delete_cm(q_pointer);
>
/ *    * /
/ * This procedure finds the message in the list that has a 
timestamp equal to the one passed to it. If a timestamp 
is encountered that is older than the one passed then 
the message is removed from the list. Messages with 
timestamps that are newer are left in the list.
* /
115
struct Coin_messages *find_first_cm(timestamp) 
struct coin_time_stamp *timestamp;
{
int result;
struct Coin_messages *curr;
struct Coin_messages *old_timestamp;
struct coin_time_stamp *temp;
struct coin_time_stamp *max_table_timestamp();
curr = cm_frontqueue;
while (curr != NULL) {
temp = max_table_timestamp(&curr->Coin_table); 
result = timestampcmp(timestamp, temp); 
if (result == 0) { 
break;
>
else if (result == 1) { 
old_timestamp = curr; 
curr = curr->next; 
delete_cm(old_timestamp);
>
else {
116
curr = curr->next;
>
>
return(curr);
>
/*    * /
/ * This procedure finds the element in the table with 
the greatest timestamp.lt returns that timestamp
* /
struct coin_time_stamp *max_table_timestamp(coin_table) 
struct Coin_array *coin_table;
{
int index;
static struct coin_time_stamp temp;
memset(&temp, 0, sizeof(struct coin_time_stamp)); 
for (index = 0; index < NUMPRO; index++) {
if (timestampcmp(&coin_table->pcoin[index].timestamp, fttemp) == 1) { 
memcpy(fttemp, ftcoin_table->pcoin[index].timestamp, 
sizeof(struct coin_time_stamp));
}
117
>
return(fttemp);
}
/ *    * /
/* This is the procedure exchange used by 
the procedure FLIP_GLOBAL_COIN */
int cm_exchange(Myarray) 
struct Coin_array *Myarray;
■C
int nresp;
int mayo;
int Counter = 0;
int Done = FALSE;
int Exit = FALSE;
static int replies[NUMPRO];
struct Coin_array Otherarray;
mayo=((int)(NUMPRO/2)) + 1; 
while (!Done) {
nresp=fbcast_l("x", gaddr.p, GET_CM, "'/.R[l]", Myarray, mayo,
"'/,d", replies);
118
>
Exit = FALSE; 
while (!Exit) { 
cm.receive(ftOtherarray, Myarray->pcoin[My_id].timestamp); 
if (Less(ftOtherarray, Myarray)) {
; /* don’t do anything with msg */
>
else if (Equal(ftOtherarray, Myarray)) {
Counter++;
if (Counter >= NUMPRO - Numfail) {
Exit = TRUE;
> /* end if */
} / * end else if * /  
else {
Update(Myarray, ftOtherarray);
Exit = TRUE;
} /* end else * /
} /* end while end * /
if (Counter >= NUMPRO - Numfail) {
Done = TRUE;
>
> / * end while done * /  
return;
119
/ *    * /
/* Procedures Less Equal and Update are used by
procedure cm_exchange to compare tables an update 
the table that was passed to it */
int Less(array1, array2)
struct Coin_array *arrayi, *array2;
int index;
int cmpres;
for (index = 0; index < NUMPRO; index++) {
cmpres=timestampcmp(&arrayi->pcoin[index].timestamp,
&array2->pcoin[index].timestamp);
if ( cmpres==l 11 cmpres==0) return(FALSE);
}
return(TRUE);
>
int Equal(arrayl, array2)
120
s t r u c t  C o in _array  * a r r a y i ,  *array2;
{
int index;
for (index = 0; index < NUMPRO; index++) {
if (timestampcmp(&arrayl->pcoin[index].timestamp, 
&array2->pcoin[index].timestamp) !™ 0) { 
return(FALSE);
>
>
return(TRUE);
>
Update(arrayi, array2)
struct Coin_array *arrayi, *array2;
int index;
for (index = 0; index < NUMPRO; index++) {
if (timestampcmp(&arrayl->pcoin[index].timestamp, 
&array2->pcoin[index].timestamp) == -1) { 
memcpy(&arrayl->pcoin[index] , Starray2->pcoin[index] , 
sizeof(struct Mycoin));
>
return;
>
/*  .........................................................................................................................................................
/* This procedure implements the algorithm for 
flipping the shared global coin */
int Flip_Global_Coin(Opnum, Bitnum, Coinnum) 
int Opnum, Bitnum, Coinnum;
int j;
int Iteration = 0; 
int Globalvalue; 
struct Coin_array Coin;
for (j = 0; j < NUMPRO; j++) {
Coin.pcoin[j].Contribution = 0;
Coin.pcoin[j].id = 0;
memset(&Coin.pcoin[j].timestamp, 0, sizeof(struct coin_time_stamp));
>
while (TRUE) {
Iteration++;
Coin.pcoin[Hy_id].Contribution += LOCALCOINFLIP(O); 
Coin.pcoin[My_id].id = My_id;
Coin.pcoin[My_id].timestamp.Opnum = Opnum;
Coin.pcoin[My_id].timestamp.Bitnum = Bitnum;
Coin.pcoin[My_id].timestamp.Coinnum = Coinnum;
Coin.pcoin[My_id].timestamp.Iteration = Iteration; 
cm_exchange(&Coin);
Globalvalue = 0;
for (j = 0; j < NUMPRO; j++) {
Globalvalue += Coin.pcoin[j].Contribution;
>
if (Globalvalue > GAMMA * NUMPRO) { 
return(1);
>
if (Globalvalue < - GAMMA * NUMPRO) { 
return(O);
>
>
/* This procedure monitors group
ch an ges * /
123
void group_change(gview_p, arg) 
groupview *gview_p; 
int arg;
{
My_id=0;
while(!addr_ismine(&gview_p->gv.members[My_id]))
My_id++;
}
/ *    * /
/* This procedure is in charge of receiving 
the messages containing tables generated 
by the FLIP_GLOBAL_COIN procedure */
void get_cm(msg_p) 
message *msg_p;
■C
static struct Coin..array temp; 
int index;
124
msg_get(msg_p, "'/,R[1]", fttemp); 
reply (msg_p, My_id) ;
queue.cm(&temp);
>
/* * /
void convertmsg(data) 
char *data;
{
>
/* Dummy function for data conversion, it 
is not necessary to implement one in 
this case because all processes are 
running in the same type of computers */
/ * * /
/ * timestampcmp works like the C function 
strcmp:
0 if timel = time2
-1 if timel < time2
125
1 if timel > time2
* /
int timestampcmp(timel, time2) 
struct coin_time_stamp *timel; 
struct coin_time_stamp *time2;
if (timel->Opnum > time2->0pnum) return(l); 
else if (timel->Opnum < time2->0pnum) return(-l);
if (timel->Bitnum > time2->Bitnum) return(l); 
else if (timel->Bitnum < time2->Bitnum) return(-l);
if (timel->Coinnum > time2->Coinnum) return(l); 
else if (timel->Coinnum < time2->Coinnum) return(-l);
if (timel->Iteration > time2->Iteration) return(1); 
else if (timel->Iteration < time2->Iteration) return(-l);
return(O);
/*##########################################################################*/
126
/* This procedure buffers messages 
originated by the BITCONSENSUS 
procedure
* /
int queue_consemess(consemess) 
struct Consensus.array *consemess;
{
struct Consensus.messages *newmess;
newmess= (struct Consensus.messages * )
malloc(sizeof(struct Consensus.messages) ) ;
memcpy(&newmess->Consensus_table, consemess, 
sizeof(struct Consensus.array));
newmess->next=NULL; 
newmess->previous=NULL;
if (conse.frontqueue == NULL) { 
conse.frontqueue=nevmess; 
conse_endqueue=newmess;
}
else {
co n se_en d q u eu e-> n ext= n ew m ess;
newmess->previous=conse_endqueue;
conse_endqueue=conse_endqueue->next;
>
>
/*  ................................................................................................................................................
/* This procedure deletes messages 
originated by the BITCONSENSUS 
procedure
* /
delete_consemess(ptr)
struct Consensus.messages *ptr;
if (ptr->previous == NULL) {
conse_frontqueue=conse_frontqueue->next; 
if(conse.frontqueue) conse_frontqueue->previous= NULL;
>
else {
ptr->previous->next = ptr->next;
>
if (ptr->next == NULL) {
128
co n se_ en d q u eu e= co n se_ en d q u eu e-> p rev io u s;
if (conse_endqueue) conse_endqueue->next = NULL;
>
else {
ptr->next->previous = ptr->previous;
>
free(ptr);
>
/*    * /
/* This procedure extracts messages containing 
tables created by the BITCONSENSUS procedure
*/
int conse_receive(Other_conse_table, timestamp) 
struct Consensus.array *Other_conse_table; 
struct Consensus.timestamp *timestamp;
{
struct Consensus.messages *q_pointer;
struct Consensus.messages *find.first_consemess();
while ( (q.pointer = find.first.consemess(timestamp) ) == NULL ) {
129
i s i s _ a c c e p t _ e v e n t s ( 0 ) ;
>
memcpy(Other_conse_table, &q_pointer->Consensus_table, 
sizeof(struct Consensus.array) );
delete_consemess(q_pointer);
>
/ *    * /
/ * This procedure finds the first message 
in the list with a timestamp that matches 
the one passed as an argument.
*/
struct Consensus.messages *find_first_consemess(timestamp) 
struct consensus_timestamp *timestamp;
■C
int result;
static struct Consensus_messages *curr; 
struct Consensus.messages *old_timestamp;
130
s t r u c t  c o n se n su s .t im e sta m p  *temp;
struct consensus.timestamp *max_conse_table_timestamp(); 
struct Consensus.messages *pro_max_round();
curr=conse_frontqueue; 
while (curr != NULL) {
temp=max_conse_table_timestamp(&curr->Consensus_table); 
result=conse_timestampcmp(timestamp, temp); 
if (result == 0) {
curr = pro_max_round(curr); 
break;
>
else {
if ( result == 1 ) { 
old.timestamp = curr; 
curr = curr->next; 
delete_consemess(old_timestamp);
>
else ■(
curr=curr->next;
>
>
>
return(curr);
131
/*    * /
/ * This procedure finds the message 
for which the variable round has 
greatest value
* /
struct Consensus_messages *pro_max_round(messfirst) 
struct Consensus_messages *messfirst;
int result;
int theid;
int curr id;
int currround;
int maxround;
int firstmaxround;
int foundal;
struct Consensus.messages *curr;
static struct Consensus.messages *messmaxround;
struct Consensus.messages *old_round;
struct consensus.timestamp *temptime;
struct consensus.timestamp thetimestamp;
the id=me ssfirst->Cons ensus _t able.id; 
maxround=messfirst->Consensus_table.BCR[theid].round; 
f irstmaxround=maxround;
memcpy(&thetimestamp, ftmessfirst->Consensus_table.BCR[theid].timestamp, 
sizeof(struct consensus.timestamp) ); 
currid=theid; 
currround=maxround; 
curr=messfirst; 
messmaxround=messfirst;
while ( curr != NULL ) {
t empt ime=max_ conse_table_t imest amp(ft curr->Cons ensus _t able); 
result=conse_timestampcmp(&thetimestamp, temptime ); 
currid=curr->Consensus_table.id; 
currround=curr->Consensus_table.BCR[currid].round; 
if ( (result == 0) && (currid == theid) && (currround>maxround) ) { 
maxround=currround;
>
curr=curr->next;
>
if ( firstmaxround < mELxround ) {
133
currid=theid;
curr=messf irst;
messmaxround=messfirst;
currid=curr->Consensus_table.id;
currround=curr->Consensus_table.BCR[currid].round;
foundal=FALSE;
while ( curr != NULL ) {
temptime=max_conse_table_timestamp(&curr->Consensus_table); 
result=conse_timestampcmp(&thetimestamp, temptime ); 
currid=curr->Consensus_table.id;
currround=curr->Consensus_table.BCR[currid].round; 
if ( (result == 0) && (currid == theid) && (currround<maxround) ) { 
old_round=curr; 
curr=curr->next; 
delete_consemess(old_round);
>
else {
if ( (result==0)&&(currid==theid)&&(currround==maxround) ) {  
if ( foundal == FALSE ) { 
messmaxround=curr; 
foundal=TRUE;
curr=curr->next;
>
else {
134
curr=curr->next;
>
else {
curr=curr->next;
}
>
>
return(messmaxround);
>
else {
return(messfirst);
>
>
/*    * /
/ * This procedure finds the element in the 
table with the greatest timestamp */
struct consensus.timestamp *max_conse_table_timestamp(conse_table) 
struct Consensus.array *conse_table;
{
135
i n t  in d ex ;
static struct consensus.timestamp temp; 
index=conse_table->id;
memcpy(&temp, &conse_table->BCR[index].timestamp, 
sizeof(struct consensus.timestamp) ); 
return(fttemp);
>
/*    * /
/* conse.timestampcmp works like the C 
function strcmp:
0 if timel = time2 
-1 if timel < time2
1 if timel > time2
*/
int conse_timestampcmp(timel,time2) 
struct consensus.timestamp *timel; 
struct consensus.timestamp *time2;
if (timel->Opnum > time2->0pnum) return(l); 
else if (timel->Opnum < time2->0pnum) return (-1);
if (timel->Bitnum > time2->Bitnum) return(l); 
else if (timei->Opnum < time2->0pnum) return (-1);
return(O);
/*  .........................................................................................................................................................
/* This is the exchange routine used 
by BITCONSENSUS
* /
int conse_exchange(Myarray) 
struct Consensus.array *Myarray;
{
int mayo; 
int nresp; 
int Counter=0; 
int Done=FALSE;
137
i n t  Exit=FALSE;
static int replies[NUMPRO];
struct Consensus.array Otherarray;
mayo=( (int)(NUMPRO/2) ) + 1; 
while ( IDone ) {
nresp=fbcast_l("x", gaddr.p, GET.CONSEMESS, '7,T[1]", Myarray, mayo, 
'"/.d", replies);
Exit=FALSE; 
while (!Exit) {
conse_receive(&Otherarray, &Myarray->BCR[My_id].timestamp);
if (Conseless(&Otherarray, Myarray)) {
/* don’t do anything */
>
else if ( ConseequalC&Otherarray, Myarray)) { 
printf("Conseequal\n");
Counter++;
if (Counter >= NUMPRO - Numfail) {
Exit=TRUE;
>
>
else {
Conseupdate(Myarray, ftOtherarray);
Exit=TRUE;
>
>
if ( Counter >= NUMPRO - Numfail) {
Done=TRUE;
>
}
return;
}
/ *  ................................................................................................................................................................
/* Procedures Conseless, Conseequal, 
and Conseupdate are used by 
procedure conse.exchange to compare 
tables an update
the table that was passed to it */
int Conseless(arrayl, array2)
struct Consensus.array *arrayl, *array2;
{
int index; 
int cmpres;
139
i n t  low errou n d ;
for (index=0; index<NUMPRO; index++) { 
lowerround=TRUE;
cmpres=conse_timestampcmp(&arrayl->BCR[index].timestamp,
&array2->BCR[index].timestamp); 
if ( array1->BCR[index].round >= array2->BCR[index].round) { 
lowerround=FALSE;
>
if ( (cmpres ==1) II ( (cmpres == 0)&&(lowerround==FALSE) ) ) { 
return(FALSE);
>
>
return(TRUE);
>
int Conseequal(array1, array2)
struct Consensus_array *arrayl, *array2;
{
int index; 
int cmpres;
int equalround;
140
for (index=0; index<NUMPR0; index++) { 
equalround=TRUE;
cmpres=conse_timestampcmp(&arrayl->BCR[index].timestamp,
&array2->BCR[index].timestamp); 
if ( arrayl->BCR[index].round != array2->BCR[index].round) { 
equalround=FALSE;
}
if ( (cmpres != 0) II ( (cmpres==0)&&(equalround==FALSE) ) ) {
return(FALSE);
>
>
return(TRUE);
>
Conseupdate(arrayl, array2)
struct Consensus.array *arrayi, *array2;
int index; 
int cmpres; 
int higherround;
for (index=0; index<NUMPR0; index++) {. 
higherround=FALSE;
141
cmpres=conse_timestampcmp(&arrayl->BCR[index].timestamp,
&array2->BCR[index].timestamp); 
if ( arrayl->BCR[index].round < array2->BCR[index].round) { 
higherround=TRUE;
>
if ( (cmpres == -1) || ( (cmpres==0)&&(higherround==TRUE) ) ) { 
memcpy(ftarrayl->BCR[index], &array2->BCR[index], 
sizeof(struct Myconsensus));
return;
>
/*    */
/ * This procedure receives the messages 
orginated by the BITCONSENSUS 
procedure.
* /
void get_consemess(msg_p) 
message *msg_p;
142
{
static struct Consensus_array temp; 
int index;
msg.get (msg_p, "'/.T[1] ", fttemp) ; 
reply (msg_p, "'/.d", Hy.id) ; 
index=temp.id; 
queue_consemess(&temp);
>
/ *    * /
/ * This procedure implements the 
consensus protocol * /
int Bitconsensus(Opnum,Bitnum,val) 
int Opnum,Bitnum,val;
int j; 
int round; 
int Maxround; 
int I_am_a_leader; 
int single.leader;
i n t  c o u n t _ le a d e r s ;
int ahead.agree;
int leader_agree;
int leader.val;
int other_leader_val;
struct Consensus.array Consensus;
for(j=0;j< NUMPRO; j++) {
Consensus.BCR[j].round=0;
Consensus.BCR[j].sugval=0; 
memsetC&Consensus.BCR[j].timestamp, 0,
sizeof(struct consensus.timestamp) );
>
round=0; 
while (TRUE) { 
round++;
Consensus.id=My_id;
Cons ensus.BCR[My_ id].round=round;
Consensus.BCR[My.id].sugval=val;
Consensus.BCR[My.id].timestamp.0pnum=0pnum; 
Consensus.BCR[My.id].timestamp.Bitnum=Bitnum;
conse_exchange(&Consensus);
Maxround=0;
for(j=0;j< NUMPRO; j++) {
if ( Consensus.BCR[j].round > Maxround ) { 
Maxround = Consensus.BCR[j].round;
>
>
for(j =0;j <NUMPR0;j ++) {
if (Consensus.BCR[j].round == Maxround) { 
leader_val=Consensus.BCR[j].sugval; 
break;
>
>
if ( round == Maxround ) {
I_ am. a_leader=TRUE;
>
if ( I_am_a_leader ) { 
count_leaders=0; 
for(j=0;j< NUMPRO; j++)
if ( Consensus.BCR[j].round == Maxround ) { 
count_leaders++;
>
>
single_leader= (count.leaders==l); 
if ( single.leader) { 
ahead.agree=TRUE;
for(j =0;j <NUMPR0;j ++) {
i f  ( (C on sen su s.B C R [j] .s u g v a l  != v a l )  &St
(Consensus.BCR[j].round >= Maxround-1) ) { 
ahead_agree=FALSE;
>
>
if ( ahead.agree ) return( val );
>
else {
leader_agree=TRUE; 
for(j=0;j<NUMPR0;j++) {
if ( (Consensus.BCR[j].sugval != val) && 
(Consensus.BCR[j].round == Maxround ) ) { 
leader_agree=FALSE;
>
>
ahead.agree=TRUE; 
for(j=0;j<NUMPR0;j++) {
if ( (Consensus.BCR[j].sugval != val) &&
(Consensus.BCR[j].round >= Maxround-1) ) { 
ahead_agree=FALSE;
>
>
if ( leader.agree && ahead.agree ) return( val );
146
if ( !leader.agree )
val=Flip_Global_Coin(Gpnum, Bitnum, round);
}
>
else {
leader_agree=TRUE; 
for(j =0;j <NUMPR0;j ++) {
if ( (Consensus.BCR[j].round == Maxround) &&
( leader.val != Consensus.BCR[j].sugval) ) {
leader_agree=FALSE;
break;
>
>
if ( leader_agree ) -[ 
val=leader_val;
>
else {
val=Flip_Global_Coin(Opnum, Bitnum, round);
>;
/ *  *1
/* This is the driver program
void test_loop()
char buffer[10]; 
int Opnum = 1; 
int Bitnum = 1;
int myval;
int conseres;
int j;
void group_change();
/* Make the process join the group * /
gaddr_p = pg_join("coinservice", PG_M0NIT0R, group_change, 0, 0);
if (addr_isnull(gaddr_p)) {
printf("Couldn’t join group..An"); 
exit(l);
>
printf ("My ID: */,d\n" ,My_id) ;
148
isis_start_done(); 
isis_accept_events(0);
myval=My_id '/, 2;
printf ("myval : '/,d\nM, myval) ; 
conserQS=Bitconsensus(Opnum,Bitnum,myval); 
printf ("The consensus result is: V,d\n" .conseres) ; 
myval= (My_id+i) '/,2;
conseres=Bitconsensus(Opnum,Bitnum,myval); 
printf (."The consensus result is: '/,d\n" .conseres); 
exit(O);
