Asynchronous algorithms for shared memory machines by Wu, Michael M.
INFORMATION TO USERS 
This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may 
be from any type of computer printer. 
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely affect reproduction. 
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion. 
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand corner and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in 
reduced form at the back of the book. 
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly 
to order. 
UMI 
University Microfilms International 
A Bell & Howell Information Company 
300 North Zeeb Road. Ann Arbor. Ml 48106-1346 USA 
313/761-4700 800/521-0600 

Order Number 9215910 
Asynchronous algorithms for shared memory machines 
Wu, Michael M., Ph.D. 
University of Illinois at Urbana-Champaign, 1992 
Copyright ©1992 by Wu, Michael M. All rights reserved. 
UMI 
300N.ZeebRd. 
Ann Arbor, MI 48106 

ASYNCHRONOUS ALGORITHMS FOR SHARED MEMORY MACHINES 
BY 
MICHAEL M. WU 
B.S., University of Illinois, 1985 
M.S., University of Illinois, 1987 
THESIS 
Submitted in partial fulfillment of the requirements 
for the degree of Doctor of Philosophy in Electrical Engineering 
in the Graduate College of the 
University of Illinois at Urbana-Champaign, 1992 
Urbana, Illinois 
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
THE GRADUATE COLLEGE 
NOVEMBER 1991 
WE HEREBY RECOMMEND THAT THE THESIS BY 
MICHAEL M. WU 
ENTITLED ASYNCHRONOUS ALGORITHMS 
FOR SHARED MEMORY MACHINES 
BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR 
THE DEGREE OF DOCTOR OF PHILOSOPHY 
&7lXckaJu C dZLwL 
Djretttor of Thesis Research 
Committee on Final Examinations-
Head of Department 
T W ^ f %%JL Chairperson 
t Required for doctor's degree but not for master's. 
© Copyright by Michael M. Wu, 1992 ~ 
ABSTRACT 
In an effort to develop more realistic models of computation, we introduce several 
asynchronous shared memory machines and design asynchronous algorithms for those 
machines. We first model asynchronous protocols for communication across unreliable 
channels using finite-state machines communicating via an unreliable shared memory. We 
establish lower bounds on the size of machines and the number of symbols in the 
transmission alphabet required to achieve reliable communication. We consider two types of 
finite-state machines and two fault models for the shared memory. In each case, we show that 
there are robust protocols for deletion and insertion errors. We also show that there are no 
robust protocols for mutation errors. In contrast, in the synchronous case, robust protocols 
exist for all of these types of errors. 
The Parallel Random Access Machine (PRAM) is a fundamental model of parallel 
computation, but it is not physically realizable. We introduce a more realistic model of 
parallel computation, the Asynchronous PRAM (APRAM). Let G be a graph with n vertices 
and m edges. We present two APRAM models and algorithms to find the connected 
components of G for each model. Algorithm I runs on an APRAM with only atomic read 
and write primitives and requires 0(« log n) rounds. Algorithms II and HI run on an 
APRAM with limited read-modify-write primitives and require 0(log n) rounds. Algorithm 
HI is more efficient than Algorithm II and requires fewer global synchronizations. All three 
algorithms use m + n processors. We then modify our APRAM connected components 
algorithms to obtain APRAM algorithms for finding a spanning forest or a minimum 
spanning forest of G. 
iii 
Finally, we present an APRAM algorithm for finding the biconnected components of a 
connected graph G. Our biconnected components algorithm runs on an APRAM with limited 
read-modify-write primitives and requires 0(log n) rounds using 0(m + n) processors. 
iv 
ACKNOWLEDGMENTS 
First, I would like to thank my advisor Professor Michael Loui for his guidance, insight, 
and steadfast encouragement during the course of my graduate studies over the past six years. 
His dedication and enthusiasm are second to none. I would also like to acknowledge the 
other members of my doctoral committee: Professors Kent Fuchs, Bruce Hajek, and 
Constantine Polychronopoulos. I also thank Professor Franco Preparata for serving on my 
preliminary examination committee. 
I am grateful to my family and friends for providing moral support and the welcome 
distractions. In particular, I thank Mary Allison, Tim Yao, and Lyle Kipp. Finally, I would 
like to thank Hosame Abu-Amara, Jerry Trahan, Marsha Woerner, David Luginbuhl, Nancy 
Amato, David Atkinson, Pat McGuinness, and the other members of my research group. 
The following sources gave financial support for the research in this thesis: the 
University of Illinois Campus Research Board, Kodak through a New Faculty Incentive 
Grant, the Office of Naval Research under Contract N00014-85-K-0570, and the Joint 
Services Electronics Program under Contract N00014-90-J-1270. 
V 
TABLE OF CONTENTS 
CHAPTER PAGE 
1. INTRODUCTION 1 
2. MODELING ROBUST ASYNCHRONOUS COMMUNICATION 
PROTOCOLS WITH FINITE-STATE MACHINES 6 
2.1. Introduction 6 
2.2. The System Model 9 
2.3. Complexity of Communicating Finite-state Machines 12 
2.4. Transmission in the Presence of Deletion Errors 19 
2.4.1. Compulsive machines and read faults 19 
2.4.2. Selective machines and read faults 24 
2.4.3. Selective machines and write faults 25 
2.4.4. Compulsive machines and write faults 26 
2.5. Transmission in the Presence of Insertion Errors 30 
2.6. Transmission in the Presence of Mutation and Multiple Errors 32 
2.7. Conclusions 33 
3. PARALLEL ALGORITHMS FOR MINIMUM SPANNING TREE IN 
LOGARITHMIC TIME WITHOUT PRIORITY WRITING 37 
3.1. Introduction 37 
3.2. Model of Computation 39 
3.3. The MST Algorithm of Awerbuch and Shiloach 39 
3.4. Common CRCW PRAM MST Algorithm 43 
4. ASYNCHRONOUS PARALLEL ALGORITHMS FOR GRAPH 
CONNECTIVITY AND RELATED PROBLEMS 47 
4.1. Introduction 47 
4.2. Model of Computation 51 
4.2.1. The PRAM 51 
4.2.2. The asynchronous PRAM 52 
4.3. The Design of APRAM Algorithms 54 
4.3.1. Wait-free data structures 55 
4.3.2. Processor synchronization 56 
4.4. Synchronous Algorithms 59 
4.4.1. Connected components algorithms 59 
4.4.2. Spanning forest algorithm 63 
4.4.3. Minimum spanning forest algorithm 64 
4.5. APRAM Connected Components Algorithm 1 65 
VI 
4.6. APRAM Connected Components Algorithm U 72 
4.6.1. The algorithm 72 
4.6.2. Analysis 75 
4.6.2.1. The potential function 77 
4.6.2.2. Contribution 83 
4.6.2.3. Performance 85 
4.7. APRAM Connected Components Algorithm IH 104 
4.8. APRAM Spanning Forest Algorithms I l l 
4.8.1. Spanning forest algorithms I l l 
4.8.2. Minimum spanning forest algorithms 112 
4.9. Conclusions 114 
5. AN ASYNCHRONOUS PARALLEL ALGORITHM FOR 
COMPUTING BICONNECTED COMPONENTS 115 
5.1. Introduction 115 
5.2. Model of Computation 116 
5.3. The Synchronous Biconnected Components Algorithm 117 
5.4. APRAM Biconnected Components Algorithm 123 
REFERENCES 126 
VITA 132 
vii 
LIST OF TABLES 
Table Page 
2.1. Existence of robust protocols in the presence of various errors 35 
4.1. A summary of the definitions of stage, step, round, and phase 70 
viii 
LIST OF FIGURES 
Figure Page 
2.1. A robust protocol when there are no transmission errors 18 
2.2. A robust protocol for deletion errors for compulsive machines 20 
2.3. A robust protocol for deletion errors for selective machines and write faults 25 
2.4. A robust protocol for insertion errors for compulsive machines and write faults 32 
2.5. A robust protocol for deletion and insertion errors for selective machines 
and write faults 34 
4.1. An example of a forest F 80 
4.2. An example of a cluster 84 
ix 
CHAPTER 1. 
INTRODUCTION 
A parallel computer is a machine which consists of two or more processors, each with its 
own local memory, and in addition, possibly some globally shared memory. One well-
studied model of a parallel computer is the Parallel Random Access Machine (PRAM). The 
PRAM is an idealized parallel computer that allows algorithm designers to concentrate on the 
computational aspects of problems without concern about the reliability and synchronization 
of processors. 
Although the PRAM is an important and fundamental model of parallel computation, it 
is not physically realizable. In designing algorithms for real machines, the PRAM hides the 
cost of synchronization. As the number of processors becomes large, it becomes impractical 
to synchronize the processors using a single global clock. Synchronization mechanisms, 
whether implemented through software or hardware, slow the execution of algorithms. In 
addition, to prevent more than one processor from simultaneously accessing the same cell in 
the shared memory, read and write operations on the shared memory must also be 
synchronized. 
In an effort to develop more realistic models of computation, in this thesis we introduce 
several asynchronous shared memory machines and design asynchronous algorithms for those 
machines. In the remainder of this chapter, we give an overview of the thesis. 
In Chapter 2, we study asynchronous data-link communication protocols. Consider a 
data communications system consisting of a transmitter and a receiver communicating across 
1 
unreliable channels. The goal of the system is for the transmitter to communicate its input, an 
infinite sequence of O's and I s , to the receiver, and for the receiver to output this sequence of 
bits. 
Aho et al. (1982) used communicating finite-state machines to model synchronous 
protocols for reliable communication across unreliable channels. We extend their ideas and 
model asynchronous protocols for communication across unreliable channels using finite-
state machines communicating via an unreliable shared memory. We establish lower bounds 
on the size of machines and the number of symbols in the transmission alphabet required to 
achieve reliable communication. 
We consider two types of finite-state machines, compulsive machines and selective 
machines. In the compulsive model, a machine sends a symbol at every transition. In the 
selective model, a machine does not have to send a symbol at every transition. We also 
consider two fault models for the shared memory. In the read fault model, write operations 
are reliable but read operations are unreliable. In the write fault model, read operations are 
reliable but write operations are unreliable. These seemingly minor differences in the system 
model can significantly affect the results, and thus these differences illustrate the need for 
careful definitions of the machine and shared memory models. 
For each combination of finite-state machine and memory fault model, we show that 
there are robust protocols for deletion and insertion errors. We present protocols for 
compulsive machines in the read fault model and for selective machines in both the read and 
write fault models that have the minimum number of states for the machines and the 
minimum number of symbols in the transmission alphabet simultaneously. For compulsive 
machines in the write fault model, our protocol uses the minimum number of symbols when 
2 
the transmitter and receiver have only two states each. We also show that there are no robust 
protocols for mutation errors. In contrast, in the synchronous case, robust protocols exist for 
all of these types of errors. Chapter 2 has been accepted for publication in IEEE Transactions 
on Communications (Wu and Loui, 1991). 
In Chapter 3, we present Common Concurrent Read Concurrent Write (CRCW) PRAM 
algorithms for finding the minimum spanning tree (MST) of a graph. Although the 
algorithms of Chapter 3 are synchronous, the design of these algorithms provides insight into 
the design of the asynchronous algorithms of Chapter 4. 
Let G be a graph with n vertices and m edges. Hirschberg (1982) gave a randomized 
Common CRCW PRAM algorithm for finding the MST of G that runs in O(log n) expected 
time using n2 processors. Awerbuch and Shiloach (1987) gave a MST algorithm for the 
Priority CRCW PRAM that runs in 0(log n) time using 0(JTI + n) processors. 
We present a deterministic algorithm for finding the MST of G that runs on a Common 
CRCW PRAM in 0(log n) time using 2m + n1+2e processors, where e is a constant such that 
0 < £ < 1/2. Our algorithm has the same running time as the algorithm of Hirschberg, but is 
deterministic and uses fewer processors. For mildly dense graphs, where m = Q(/i1+2e), our 
algorithm has the same performance as the algorithm of Awerbuch and Shiloach using a 
weaker CRCW PRAM model. The algorithms of Chapter 3 have been published in a 
technical report (Wu, 1990). 
In Chapter 4, we present a more realistic model of parallel computation, the 
Asynchronous PRAM (APRAM). We introduce two APRAM models and design APRAM 
algorithms to find the connected components of G for each model. 
3 
We show that oblivious PRAM algorithms can be easily converted into APRAM 
algorithms. The connected components of G can be found obliviously by computing the 
transitive closure of the adjacency matrix of G. This method is inefficient, however. The 
fastest known CRCW PRAM algorithm for computing the transitive closure of an nxn 
Boolean matrix requires 0(log2 n) time using Ofa2-376) processors (Karp and Ramachandran, 
1990). 
Nonoblivious CRCW PRAM algorithms can find the connected components of G much 
more efficiently. But it is also more difficult to implement nonoblivious PRAM algorithms 
on an APRAM efficiently. Although several researchers have developed APRAM 
algorithms, with only one exception, all of the algorithms have been oblivious. 
All three of our connected components algorithms are nonoblivious and require 
significantly fewer global synchronizations than a straightforward asynchronous 
implementation of a synchronous algorithm. Algorithm I runs on an APRAM with only 
atomic read and write primitives and requires 0(n log n) rounds. Algorithms JJ and HI run 
on an APRAM with limited read-modify-write primitives, specifically replace-min and 
increment, and require 0(log/i) rounds. Algorithm III is more efficient than Algorithm II 
and uses fewer global synchronizations. All three algorithms use m + n processors. Finally, 
we modify our APRAM connected components algorithms to obtain APRAM algorithms for 
finding a spanning forest and a minimum spanning forest of G. 
In Chapter 5, we use the algorithms of Chapter 4 to design an APRAM algorithm for 
finding the biconnected components of G. The algorithm runs on an APRAM with replace-
min and increment primitives and requires 0(log n) rounds using m + n processors. 
4 
We discuss the related results and review the literature for each of the various problems 
in their corresponding chapters. Each chapter is self-contained. 
5 
CHAPTER 2. 
MODELING ROBUST ASYNCHRONOUS COMMUNICATION 
PROTOCOLS W I T H FINITE-STATE MACHINES 
2.1. Introduction 
Consider a data communications system consisting of a transmitter and a receiver 
communicating across unreliable channels. The transmitter has a channel to send symbols to 
the receiver, and the receiver has a separate channel to send symbols to the transmitter. The 
goal of the system is for the transmitter to communicate its input, an infinite sequence of O's 
and 1's, to the receiver, and for the receiver to output this sequence of bits. 
Let the transmission alphabet Z be a finite set of symbols including the null symbol e, 
and nonnull symbols 0,1, 2, A, B, C, • • •. The transmitter and receiver send symbols of S to 
each other to communicate the input of the transmitter to the receiver. Since the channels are 
unreliable, the symbols sent between the transmitter and receiver may be corrupted. We 
consider the following kinds of errors in channels: 
(1) Deletion: a nonnull symbol is sent and e is received. 
(2) Mutation: a nonnull symbol is sent and a different nonnull symbol is received. 
(3) Insertion: e is sent and some nonnull symbol is received. 
Communication protocols are used to ensure reliable transmission of data across 
unreliable channels. A protocol is robust under a certain class of errors if in the presence of 
those errors the protocol satisfies the following conditions under all possible executions: 
6 
(1) Safety: The output sequence is always a prefix of the input sequence. 
(2) Liveness: If the receiver has not output the complete input sequence and if no 
transmission error occurs within some finite number of alternations of steps by the 
transmitter and receiver, then the receiver outputs at least one more bit. 
Communicating finite-state machines have been used to model and validate 
communication protocols. Bartlett et al. (1969) used finite-state machines to model a 
protocol for reliable full-duplex transmission over half-duplex links. Dan thine (1982) used 
finite-state machines to model computer network protocols. 
Zafiropulo et al. (1980) and Brand and Zafiropulo (1983) used communicating finite-
state machines to investigate the problem of protocol verification. Yu and Gouda (1982) 
presented an algorithm to determine whether two communicating finite-state machines may 
deadlock. Peng and Purushothaman (1989) presented a dataflow approach to analyze 
protocols for deadlock and unspecified reception. 
Gouda and The (1985) used communicating finite-state machines to model two 
protocols for the physical layer of the International Standards Organization (ISO) reference 
model for computer networks (Tanenbaum, 1981). The first was an asynchronous start-stop 
protocol and the second was a protocol for synchronous transmission with modems. They 
gave a methodology to verify communication boundedness and progress for each protocol. 
Lynch et al. (1988) used input/output automata to specify the physical and data link layers 
formally. They showed that no data link layer protocol can tolerate crashes of the host 
processors on which the protocol runs. 
For faulty asynchronous communications channels where messages can be reordered or 
deleted, Wang and Zuck (1989) used a knowledge-based approach to derive tight bounds on 
7 
the number of different sequences that can be transmitted when the data items and the 
message alphabet have finite domains. For the same model, Temporo and Ladner (1990) 
introduced three characteristics of protocols, including transmission cost, recovery cost, and 
lookahead, and gave tight bounds on the efficiency of protocols in terms of these measures. 
Aho et al. (1982) used finite-state machines to model synchronous protocols for data 
communication and derived lower bounds on the size of machines needed to achieve reliable 
communication across unreliable channels. Several researchers have proved the correctness 
of their protocols. Hailpern (1985) used the finite-state machine and abstract-program 
approaches, Gouda (1985) used reachability graphs, and Halpern and Zuck (1987) used a 
knowledge-based approach. 
In this chapter, we extend the ideas of Aho et al. and consider finite-state machines 
communicating asynchronously across unreliable channels. We establish lower bounds on 
the size of machines and the number of symbols in the transmission alphabet needed to 
achieve reliable communication. Aho et al. showed that in the synchronous case there are 
robust protocols if at most two of deletion, mutation, and insertion errors are present. We 
show that in the asynchronous case there are robust protocols if deletion and insertion errors 
are present, but no robust protocols if mutation errors are present. 
Our results illustrate the difference between synchronous communication and 
asynchronous communication. Furthermore, we distinguish between read faults and write 
faults, and between two kinds of machines. We consider the identification of these 
distinctions to be the other major contribution of this chapter, for these distinctions emphasize 
the need for careful definitions of fault models and machines. 
8 
In Section 2.2, we describe our models of the communication system. In Section 2.3, we 
establish lower bounds on the size of machines and the number of symbols in the 
transmission alphabet required to achieve robust communication. In Section 2.4, we consider 
transmission in the presence of deletion errors. In Section 2.5, we consider insertion errors. 
In Section 2.6, we consider mutation and multiple types of errors. We summarize our results 
in Section 2.7. 
2.2. The System Model 
The data communications system abstracts the physical and data link layers of the ISO 
computer network model. The physical layer is concerned with transmitting raw bits over a 
communication channel between processors without regard for transmission errors. The 
purpose of the data link layer is to make the physical layer appear free of transmission errors 
to the network layer. Thus the input and output of the data communications system are the 
bits that are sent from one processor to another processor at the data link layer. 
In our system model, processors communicate asynchronously. Lamport (1986) 
discussed the nature of asynchronous communication. Since the sender does not know 
whether the receiver is observing the communication medium while the sender is modifying 
it, any change the sender makes to the state of the communication medium must remain after 
the sender has finished its transmission so that the receiver can sense the change at a later 
time. Lamport calls this a persistent communication act. 
The transmitter and receiver are modeled by finite-state machines. Let T be the 
transmitter and let R be the receiver. Machines T and R communicate by sending symbols 
of Z to each other. Transmitter T has an unreliable channel to send symbols to R, and R has 
9 
a separate unreliable channel to send symbols to T. On each channel, we will refer to the 
sending machine as the source and the receiving machine as the destination. 
The unreliable channels are modeled by an unreliable shared memory with read and 
write operations. The memory simulates the persistent signals of a physical system. We 
consider two fault models for the shared memory. In the read fault model, write operations 
are reliable but read operations are unreliable. The symbol that the destination reads from a 
memory cell c might not be the same as the symbol stored in c. The various transmission 
errors are as follows: 
(1) A deletion error occurs if a nonnull symbol is stored in c and the destination reads an e. 
(2) A mutation error occurs if a nonnull symbol is stored in c and the destination reads a 
different nonnull symbol. 
(3) An insertion error occurs if e is stored in c and the destination reads a nonnull symbol. 
In the write fault model, read operations are reliable but write operations are unreliable. 
The symbol that is stored in a memory cell c might not be the same as the symbol that the 
source writes into c. Since the channels are unidirectional, the source is not permitted to read 
c to determine whether a write operation is correct. The various transmission errors are as 
follows: 
(1) A deletion error occurs if the source writes a nonnull symbol into c and an e is stored. 
(2) A mutation error occurs if the source writes a nonnull symbol into c and a different 
nonnull symbol is stored. 
(3) An insertion error occurs if the source writes an e into c and a nonnull symbol is stored. 
10 
The read fault model corresponds to the case in which errors may occur only within the 
communication medium or when a machine is receiving a signal. The write fault model 
corresponds to the case in which errors may occur only when a machine is sending a signal. 
We consider two fault models because if the location and types of errors that can occur are 
limited, then we can derive optimal protocols in some cases. 
Machines T and R have two shared memory cells that represent the two unreliable 
channels. Each machine has one outgoing cell and one incoming cell. Let cj be 7"s 
outgoing cell and CR be R 's outgoing cell. Transmitter T's outgoing cell is R 's incoming 
cell and vice versa. A machine sends a symbol a by writing a into its outgoing cell. A 
machine receives a symbol by reading its incoming cell. The value of a cell may change only 
when it is written. Thus, a read operation does not change the value in a cell. Initially, both 
cells contain e. 
Transmitter T has an input buffer from which it reads input bits and an input pointer to 
keep track of the current input bit. Receiver R has an output buffer into which it writes 
output bits. During a move, a machine may perform a buffer operation. The buffer operation 
for T is advancing its input pointer to the next bit in the input buffer. The buffer operation 
for R is writing a bit into the output buffer. For convenience, we assume that the input is 
infinite. Thus we will not need to specify the termination of protocols at the end of a finite 
input 
We consider two types of finite-state machines for T and R. In the compulsive model, a 
machine sends a symbol of I at every transition. A move of T depends on its current state, 
the current bit in the input buffer, and the current symbol in CR . In one move T changes state, 
sends a symbol to R, and chooses whether to advance its input pointer to the next bit in the 
11 
input buffer. A move of R depends on its current state and the current symbol in Cj. In one 
move R changes state, sends a symbol to T, and chooses whether to write a bit into the output 
buffer. A state transition SR (S , a) = s' for R means that if the current state of R is s and R 
reads a from its incoming cell, then R makes a transition to state s'. If the states s and s' are 
the same, then the transition is a self-loop. A transition output %#(s,o) = 6 means that on 
the state transition 8R (S , a) machine R writes b into its outgoing cell. 
In the selective model, a machine does not have to send a symbol at every transition. In 
one move a machine changes state, chooses whether to send a symbol, and chooses whether 
to perform a buffer operation. If a machine does not send a symbol on some state transition, 
then there is no corresponding transition output. In Section 2.4.4, we will show that the 
selective and compulsive models are different. We show that in the presence of deletion 
errors and write faults we can design more efficient protocols for the selective model than for 
the compulsive model. 
With both the compulsive and selective models of finite-state machines, the system is 
asynchronous. Thus the moves of T and R are arbitrarily interleaved. We assume, however, 
that each move of a machine is atomic. This ensures that if a machine sends a symbol during 
a move, then the symbol it sends is based on the current symbol in its incoming cell. 
2.3. Complexity of Communicating Finite-state Machines 
In this section, we establish lower bounds on the size of machines and the number of 
symbols in the transmission alphabet needed to achieve reliable communication. These 
results apply to both compulsive and selective finite-state machines. We first consider the 
size of machines. As in Aho et al., we use the number of states as the measure of complexity. 
12 
This is a fair measure because the number of states is closely related to the size of the circuits 
needed to implement the machines. 
In the synchronous case, if there are no transmission errors, then there is a simple robust 
protocol consisting of a one-state transmitter and a one-state receiver: the transmitter sends 
the next bit to the receiver after it receives an acknowledgment for the previous bit. Aho et 
al. showed that in the presence of deletion errors there are no robust one-state transmitters or 
receivers. We show that in the asynchronous case there are no robust one-state transmitters 
or receivers, even if there are no transmission errors, regardless of the size of the transmission 
alphabet. 
Theorem 2.3.1: There is no robust one-state transmitter. 
Proof: If T has one state, then the moves of T are self-loops and depend only on the current 
bit in &e input buffer and the current symbol in CR . Consider the execution of T on the input 
consisting of k O's followed by a 1, where k > 2. Suppose T's next move is the move on 
which T advances its input pointer from the first 0 to the second 0. When T reads CR , T 
advances its input pointer to the second 0. Since T and R are asynchronous, T may read CR 
an arbitrary number of times before R writes a new symbol. 
Suppose T reads CR twice before R moves. Then T advances its input pointer twice and 
skips a 0. The execution of R in this case is the same as an execution of R on an input 
sequence of 6-1 O's followed by a 1 where T does not skip any O's. Thus R outputs a 
sequence of 6-1 O's followed by a 1, violating the safety condition of a robust protocol. • 
Theorem 2.3.2: There is no robust one-state receiver. 
Proof: The proof is similar to the proof of Theorem 2.3.1. If/? has one state, then the moves 
ofR are self-loops and depend only on the current symbol in cj. Consider the execution of 
13 
R on an input of 01 to T. Suppose R 's next move is the move on which R writes 0 into its 
output buffer. Since R and T are asynchronous, R may read cj an arbitrary number of times 
and output an arbitrary number of O's before T writes a new symbol. Thus the output may 
violate the safety condition of a robust protocol. 0 
Next we consider the number of symbols required in the transmission alphabet to 
achieve robust communication. We will show that in the asynchronous case at least three 
symbols are necessary, even if there are no transmission errors, no matter how many states 
the machines have. In contrast, in the synchronous case there are robust protocols that use 
only two symbols. 
Let T and R be the machines of a robust protocol. Fix a e 2. An a -sequence ofR is a 
sequence of distinct states sotS\t...,Sk such that 8^(^,ia) = 5,+i for i = 0 6-1 . An a-
cycle of A is a sequence of distinct states SQtSit...,Sk such that states Sn.^i, •••»£* form an 
a -sequence and 6& (%, a) = SQ. A self-loop is an a -cycle with one state. 
Lemma 2.3.3: R does not write into the output buffer in the transitions of an a -cycle. 
Proof: Suppose to the contrary that R does write into the output buffer in an a -cycle. Since 
T and R are asynchronous, R may read cj an arbitrary number of times before T writes a 
new value. If R is in a state of an a-cycle and a is in cj, then R may make an arbitrary 
number of transitions in the a -cycle and output an arbitrary number of bits. Thus the output 
may violate the safety condition. • 
A machine is reduced if for all a, the only a -cycles are self-loops. A sequence of 
transitions on a is maximal if the last transition of the sequence is a self-loop on a. 
14 
Lemma 2.3.4: Suppose R is not a reduced machine. Then there is a reduced machine R' 
such that if/? is replaced with R', then the protocol with T and R' is robust. 
Proof: To obtain R', we break all the a-cycles of R with more than one state. Let 
so. s i,..., si be an a -cycle of R. Let Si be a state of the a -cycle such that there is a transition 
from Sj on 6, where b *a, that leads to a sequence of transitions where R writes into the 
output buffer without first reentering a state of the a -cycle. Let bt-\ = X# (%_i, a), where / -1 
is interpreted mod 6+1. To break the o-cycle, replace the transition BR (%, a) and its output, 
if any, with the self-loop 8j?<s,;>a) = $; and make the transition output %R'(Sita) = bi-i. By 
Lemma 2.3.3, R does not write into the output buffer during this transition. In this fashion, 
we break every a -cycle and construct a machine R' whose only a -cycles are self-loops, for 
alio. 
Note that in breaking an a -cycle we cannot choose an arbitrary si. Suppose we choose a 
state si of/? such that all sequences of transitions from s,- return to some state of the a-cycle 
and reenter state 5,- without writing into the output buffer. If/?' enters state si, then R' cannot 
output any more bits. 
Next, we show that if R is replaced with R', then the protocol with T and R' is robust. 
Suppose R' is in state SQ of the a-cycle and a is in cj. If R' reads its incoming cell j < i 
times before T makes its next move, then R enters state Sj and writes 6y_i into CR . Thus, R' 
enters the same state and sends the same symbol to T as R would have. 
If R' reads its incoming cell more than i times before T makes its next move, then R' 
arrives at state sx and writes 6,_i into CR . In this case, the execution of T and Rf would be 
the same as an execution of T and R where R read its incoming cell / times before T makes 
its next move. Thus R' leaves state st by making the same transition R would have. • 
15 
By Lemma 2.3.4, we may assume without loss of generality that R is a reduced 
machine. We now establish a lower bound on the number of symbols required in the 
transmission alphabet. Note that the proof is independent of the number of states and does 
not depend on the finiteness of the state set. 
Theorem 2.3.5: Any robust protocol requires at least three symbols in its transmission 
alphabet. 
Proof: Suppose to the contrary that two symbols A and B suffice. Let 0C€ {A,B}. 
Intuitively, since in the asynchronous case the repetition of a symbol is indistinguishable, 
there are essentially only two unbounded sequences of symbols that can be sent from T to R. 
Thus there are not enough histories to encode all of the possible inputs. 
More precisely, consider an execution in which whenever T writes a symbol a, R makes 
a maximal sequence of transitions on a before T writes a new symbol. Since there are only 
two symbols, each time T writes a new symbol, R alternately leaves states with a self-loop on 
A and states with a self-loop on B. Thus, the sequence of transitions R makes between 
different states is fixed after the first transition. 
By Lemma 2.3.3, /? does not output any bits on moves that are self-loops; otherwise R 
may output arbitrarily many bits. Thus R may output a bit only if a move of/? causes /? to 
enter a different state. But since the sequence of transitions made by R between different 
states is fixed after the first, so is the sequence of buffer operations made by R. Thus the 
output of/? is a predetermined sequence of bits. If c and o' are two output strings of/? and 
| a | < | a ' |, then c is a prefix of a'. We call this the prefix property. 
It is clear that there are input strings to T that begin with the same bit but do not have 
the prefix property. Thus there are input strings to T that cannot be output by/?. It follows 
16 
then that there is no robust protocol that uses only two symbols in its transmission 
alphabet. • 
We note that the result of Theorem 2.3.5 can be used to establish the need for a strobe or 
"data ready" signal in asynchronous hardware designs since data lines carry only two 
signals. One example of its use is for asynchronous data transfer between a CPU and I/O 
interfaces in computer systems (Mano, 1982). 
We now present a simple asynchronous protocol in Figure 2.1 for data communication 
in the case in which there are no transmission errors. Other protocols in later sections are 
variations of this protocol. For clarity, we give names to the symbols. An arc from state s to 
state s' labeled alb means that if the machine is in state s and reads a from its incoming 
cell, then it changes state to s' and writes b into its outgoing cell. The asterisk (*) indicates 
that T advances its input pointer, and the dagger (f) indicates that R writes a bit into its 
output buffer. 
The initial states of T and /? are p and r, respectively. Initially, both shared memory 
cells contain e. When T is in state p, T tells R that T is ready to transmit the current bit in 
its input buffer by sending SYNC to /?. When /? is in state r, R is ready to request the next 
output bit from T. When R receives SYNC, R sends REQ to T and moves to states. When 
T receives REQ, T sends DATA to/?, where DATA represents the current bit in 7"s input 
buffer, and moves to state q. When /? receives DATA, /? writes the bit corresponding to 
DATA into its output buffer, sends ACK to T, and moves to state r . When T receives ACK, 
T advances its input pointer, sends SYNC to /?, and moves to state p. Then the cycle 
repeats. 
17 
TRANSMITTER 7 RECEIVER/? 
e/SYNC 
ACK/SYNC 
e/ACK 
DATA/ACK 
REQ/DATA ACK/SYNC * SYNC/REQ DATA/ACKt 
REQ/DATA SYNC/REQ 
Figure 2.1. A robust protocol when there are no transmission errors. 
Note that T and R alternately move to new states. Thus, during one cycle, T 
communicates one input bit to R, which R writes into its output buffer, and T advances its 
input pointer one bit. It is clear that the protocol satisfies the safety and liveness conditions of 
a robust protocol. 
Theorem 2.3.6: If there are no transmission errors, then the protocol of Figure 2.1 is optimal. 
Proof: 7/ and/? each have two states. By Theorems 2.3.1 and 2.3.2, the number of states is 
18 
minimum. To show that the number of symbols in the transmission alphabet is the minimum, 
we assign values to the symbol names. One possible assignment — DATA = 0,1; SYNC = e; 
REQ = 0; and ACK = 1 — yields a transmission alphabet consisting of the three symbols {0, 
1, e). By Theorem 2.3.5, three symbols are necessary. • 
2.4. Transmission in the Presence of Deletion Errors 
In this section, we consider data transmission in the presence of deletion errors alone. 
We present a robust protocol for deletion errors and show that it is optimal in the presence of 
read faults for both compulsive and selective machines. We then give a more efficient 
protocol for selective machines and write faults and establish its optimality. Finally we show 
that compulsive and selective finite-state machines are different. 
2.4.1. Compulsive machines and read faults 
For read faults, the compulsive stop-and-wait protocol of Figure 2.2 ensures reliable 
transmission in the presence of deletion errors. The protocol is basically the same as the 
protocol of Figure 2.1 except that if a machine reads an e, then it does not change state. A 
machine reading an e remains in its current state and resends the last symbol it wrote until it 
receives a nonnull symbol. The protocol is robust under deletion errors since the machines 
move through the same cycle of states as they do when there are no transmission errors. In 
general, any asynchronous protocol that does not send e and is robust in the absence of 
transmission errors can be made robust for deletion errors by adding self-loops on e at each 
state that resend the nonnull symbol the machine last sent. 
19 
TRANSMITTER T RECEIVER/? 
e/SYNC 
ACK/SYNC 
6/ACK 
DATA/ACK 
REQ /DATA ACK/SYNC * SYNC/REQ 
REQ/DATA 
e/DATA 
DATA/ACKf 
SYNC/REQ 
e/REQ 
Figure 2.2. A robust protocol for deletion errors for compulsive machines. 
In the remainder of this section, we establish a lower bound on the number of symbols 
required in the transmission alphabet in the presence of deletion errors and read faults. We 
then show that the protocol of Figure 2.2 is optimal. 
Let T and /? be the machines of a robust protocol for deletion errors. We may assume 
/? is reduced. Fix a e L, where a is nonnull. Let a be one of {a, e) and a be the other. An 
{a, e}-sequence of/? is a sequence of distinct states SQ, S
 it..., s^ such that 8/j (%,a) = s,+i for 
20 
0 < i < j i , 8R(sh,a) = s ; i, 8/?(s,-ta) = sM for j x <,i < j 2 , 8*(sh,a) = sh, 8*(sif a) = si+l for 
J2 £ i </3, • • •, 8/j(sk-i,a*) = s*, where a* is either a or a. That is, so, • •.,<*/, is an cc-
sequence, Sjif...,Sj2 is an a-sequence, etc. An [a,e}-cycle of/? is a sequence of distinct 
states so, s i,..., S* such that states so, s i,..., s* form an {a, e)-sequence and 8j? (sjt, a*) = so-
A state with self-loops on both a and e is an [a, e)-cycle with one state. 
Lemma 2.4.1: /? does not write into the output buffer in the transitions of an [a, e)-cycle. 
Proof: The proof is similar to the proof of Lemma 2.3.3. Suppose to the contrary that /? 
does write into the output buffer in an [a ,e}-cycle. If/? is in a state of an (a, e}-cycle and a 
is in R 's incoming cell, then since T and R are asynchronous and there are read faults, R 
may read an arbitrary sequence of a's and e's before T makes its next move. Thus /? may 
make an arbitrary number of transitions in the {a,e}-cycle and output an arbitrary number of 
bits.0 
For alio,a machine is {a, e)-reduced if every {a, e}-cycle has either one or two states. 
Lemma 2.4.2: Suppose R is a reduced but not {a,e}-reduced machine in a robust protocol 
for deletion errors. Then there is an {a,e}-reduced machine /? ' such that if R is replaced 
with /?', then the protocol with T and /? ' is robust. 
Proof: Since the proof is similar to the proof of Lemma 2.3.4, we give a sketch. /? ' is 
obtained from /? by breaking each {a,e}-cycle with more than two states and forming an 
{a, e)-cycle with two states. 
Let so,s i,..., Sk be an {a, e)-cycle of R with more than two states. Let s,- be a state of 
the [a ,e}-cycle with a self-loop on a such that there is a transition from s,- on some nonnull 
symbol b that leads to a sequence of transitions in which /? writes into the output buffer 
without first reentering a state of the (o, e}-cycle, if any, or any state of the [a, e}-cycle with 
21 
a self-loop on a otherwise; clearly b *a. Let sy be a state of the {a,e}-cycle with a self-
loop on e such that there is a transition from Sj on some nonnull symbol c that leads to a 
sequence of transitions in which /? writes into the output buffer without first reentering a state 
of the {a, e)-cycle, if any, and any state of the {a,e}-cycle with a self-loop on e otherwise; 
clearly c *a. Since /? is robust, there is a sequence of transitions from at least one of s,- and 
Sj in which /? outputs a bit. Let &,• = \R (S,-,a) and bj = XR (s,, e). 
To break the {a,e}-cycle, replace the transition 8^(s,-,e) and its output by 8 '^(s,-,e) = Sj 
and Xg'(Site) = bj, and replace the transition 8^(s/,fl) and its output by 8/e'(sJta) = s,- and 
^R'(Sj,a) = bi. By Lemma 2.4.1, /? does not write into the output buffer during this 
transition. In this fashion, we break every [a,e}-cycle and construct a machine /? ' whose 
only [a, e)-cycles contain at most two states, for all a. 
Next we show that if R is replaced with R', then the protocol with T and/?'is robust. 
Suppose /? ' is in a state of the [a,e}-cycle and a is in cj. Since /?' may read an arbitrary 
sequence of a's and e's before T makes its next move, /? ' may leave the {a,e)-cycle from 
state Sj or Sj. The execution of T and R' would be the same as an execution of T and /? in 
which /? leaves the (a, e}-cycle from state s,- or sj, respectively. • 
By Lemma 2.4.2, we may assume without loss of generality that R is an [a ,e)-reduced 
machine for all nonnull a e 2. 
Theorem 2.4.3: Any robust protocol for deletion errors for compulsive machines and read 
faults requires at least four symbols, including e, in its transmission alphabet. 
Proof: Suppose to the contrary that only two nonnull symbols A and B and e suffice. 
Intuitively, sending an e in the presence of deletion errors and read faults does not convey 
useful information because a machine receiving an e cannot determine whether e was sent or 
22 
a read fault occurred on a nonnull symbol. Thus machines should not send e. The number of 
symbols in the transmission alphabet is then, in effect, reduced to the number of nonnull 
symbols. 
More precisely, consider an execution of /? in which whenever T writes a nonnull 
symbol a, R makes transitions on a and e caused by read faults until R enters a state of an 
[a, El-cycle. By Lemma 2.4.1, /? must leave the cycle to output another bit. 
Let a be one of {A,B} and P be the other. Suppose /? reads a sequence of oc's and e's 
and enters a state of an {a, e}-cycle. Since there are only three symbols, /? can leave the 
cycle only on (3. If the {a, e}-cycle has one state, then there is only one way /? can leave the 
cycle. If the {a, E)-cycle has two states, then /? can leave the cycle in two ways. We 
consider two cases. 
Case 1: While R is in the {a, e}-cycle, if T sends P only after T sends and /? receives 
E, then the only way /? can leave the {a, £}-cycle is on a transition on P from the state of the 
cycle with a self-loop on E. 
Case 2: While R is in the {a, E)-cycle, if T can send p immediately after sending a, 
then R can leave the cycle either from the state with a self-loop on a, or if a read fault occurs, 
from the state with a self-loop on £. The latter happens if R reads its incoming cell twice 
before T moves and a read fault occurs on the first read. If R reads an £, then R cannot 
determine whether T wrote £ or T wrote P and a deletion error occurred. Thus, if/? leaves 
the {a, £}-cycle on a transition on P from the state with a self-loop on £, then the next bit /? 
outputs must be the same as the one /? would have output if /? had left the cycle from the 
state with a self-loop on a. A fortiori, all subsequent bits that are output must be the same. 
23 
The sequence of bits output by /? is fixed after the first. By the proof of Theorem 2.3.5, 
the output of /? has the prefix property. Thus, there is no robust protocol for deletion errors 
for compulsive machines and read faults that uses only two nonnull symbols and £ in its 
transmission alphabet. • 
We now establish the optimality of the protocol shown in Figure 2.2. 
Theorem 2.4.4: The protocol of Figure 2.2 is optimal for compulsive machines and read 
faults. 
Proof: By Theorems 2.3.1 and 2.3.2, the protocol has the minimum number of states. To 
show that the number of symbols in the transmission alphabet is the minimum, we assign 
values to the symbol names. One possible assignment — DATA = 0,1; SYNC = 2; REQ = 0; 
ACK = 1; and £ = £ — yields a transmission alphabet consisting of the four symbols {0,1,2, 
£}. By Theorem 2.4.3, four symbols are necessary. • 
2.4.2. Selective machines and read faults 
The protocol of Figure 2.2 is also robust for selective machines and read faults since a 
compulsive machine is simply a selective machine that writes on every transition. The 
protocol is also optimal for selective machines since the sequence of symbols read by T and 
/? in the presence of read faults when they are selective machines can be the same as when 
they are compulsive machines. 
The protocol for selective machines could be slightly simplified, however, since a 
selective machine does not have to send a symbol on every transition. Since write operations 
are reliable, a machine needs to send a symbol only once. Thus, we could modify the 
24 
protocol of Figure 2.2 and eliminate writes on all self-loops, except for the initial £ / SYNC 
transition for T to get the protocol started. 
2.4.3. Selective machines and write faults 
The protocol of Figure 2.3 is robust for deletion errors for selective machines and write 
faults. A dash (—) indicates that a machine does not write a symbol on the transition. The 
TRANSMITTER T 
REQ/DATA ACK/E* 
REQ/DATA 
£ / — 
RECEIVER/? 
DATA/ACK 
E/REQ DATA/ACKf 
E/REQ 
Figure 2.3. A robust protocol for deletion errors for selective machines and write faults. 
25 
protocol of Figure 2.3 is essentially the protocol of Figure 2.2 with the SYNC symbol 
replaced by E. The main difference is that in the protocol of Figure 2.3 machines can be made 
to wait. Since R sends only nonnull symbols, if T receives an E, then T remains in its current 
state until /? successfully resends the last symbol. 
Theorem 2.4.5: The protocol of Figure 2.3 is optimal for selective machines and write faults. 
Proof: By Theorems 2.3.1 and 2.3.2, the protocol has the minimum number of states. To 
show that the number of symbols in the transmission alphabet is the minimum, we assign 
values to the symbol names. One possible assignment — DATA = 0,1; REQ = 0; ACK = 1; 
and E = E — yields a transmission alphabet consisting of the three symbols {0, 1, £ }. By 
Theorem 2.3.5, three symbols are necessary. • 
2.4.4. Compulsive machines and write faults 
It is straightforward to verify that the protocol of Figure 2.2 is robust under deletion 
errors for compulsive machines and write faults. We showed that for selective machines and 
write faults there is a robust protocol for deletion errors in which T and R each have two 
states and the transmission alphabet consists of three symbols. In contrast, we show in this 
section that for compulsive machines and write faults there is no robust protocol for deletion 
errors in which the transmitter and receiver each have two states and the transmission 
alphabet has three symbols. Thus, the compulsive and selective models are different. 
Theorem 2.4.6: For compulsive machines and write faults there is no robust protocol for 
deletion errors that consists of a two-state transmitter, a two-state receiver, and a transmission 
alphabet with three symbols. 
26 
Proof: Suppose to the contrary that there is such a protocol. We first show that any robust 
two-state receiver must be similar to the receiver in the protocol of Figure 2.3. 
Let R be the receiver and let the two states of/? be r and s. By Lemma 2.3.3, any 
buffer operation performed by /? must occur on a transition between the two states. Without 
loss of generality, assume that if/? is in state s and reads a, then /? makes a transition to state 
r, writes 0 into its output buffer, and sends x to T. Then /? has no transition from state r to s 
on a; otherwise, R could output an arbitrary number of O's. Thus, state r has a self-loop on 
a. 
Next, consider a transition of/? that writes a 1 into the output buffer. It can be seen that 
there must be a transition either from state r to s or from state s to r that occurs on some 
symbol b, where 6 # a . We show that R cannot have a transition from r to s on b that 
writes 1 into the output buffer. 
Suppose to the contrary that /? does have such a transition. Then state s has a self-loop 
on 6. If/? is in state r and must next output a 0, then /? must make a transition to state s on 
some third symbol c, where c *a and c *b. Similarly, if /? is in state s and must next 
output a 1, then /? must make a transition to state r on some symbol d, where d*a and 
d * b. Since there are only three symbols, d~c. Thus /? has a c -cycle. 
By Lemma 2.3.3, R does not write into the output buffer on the transitions of the c-
cycle. Suppose /? is in state r and must next output a 0. Then /? must make an odd number 
of transitions on c before T makes its next move and sends an a. But since T and /? are 
asynchronous, /? may make an even number of moves and fail to output a 0. The next bit that 
/? outputs may then violate the safety condition. 
27 
Thus, the buffer operation for /? to write a 1 into the output buffer occurs on a transition 
from state s to r on b. Let y be the symbol that R sends to T on the transition. Note that y 
may be the same as x. To avoid a b -cycle, state r has a self-loop on b. 
Finally, /? needs a transition from state r to return to state s. The transition occurs on 
some symbol c, where c *a and c *b. Let z be the symbol that R sends to T on the 
transition. To avoid a c -cycle that may cause /? to fail to make an output, state s has a self-
loop on c. 
Now we assign values to the symbols a, b, and c. Since there are deletion errors, a and 
fc must be nonnull. Otherwise, if/? is in state s and a deletion error occurs, then R may 
output the wrong bit. Thus c must be E. We have now established that any robust two-state 
receiver is similar to the receiver in the protocol of Figure 2.3. 
Next, we consider the structure of the transmitter. Let T be the transmitter and let the 
two states of T be p and q. Transmitter T communicates the current input bit to /? by 
sending a symbol to/?. By the structure of /?, the symbol must be nonnull. Since there are 
deletion errors, however, if the symbol that T sends to R is deleted, then T must be able to 
resend that symbol. Thus T cannot advance its input pointer on the transition in which T first 
sends a symbol to communicate the current input bit to R. Otherwise, T may skip an input 
bit 
From the above restriction, the structure of /?, and the requirement that T advance its 
input pointer on a transition between two different states, we infer that T communicates the 
current input bit by sending either a or b, and T advances its input pointer on a transition 
between states p and q on either JC ory. 
28 
Without loss of generality, assume that T advances its input pointer on the transition 
from state q to state p on x. Given the structure of/?, 7/ sends c on the transition. To avoid 
an x -cycle that may cause T to skip input bits, state p has a self-loop on x. Similarly, there 
are corresponding transitions for y. To allow T to return to state q, there must be a transition 
from state p to q on some other symbol z. Given the structure of/?, 7/ first sends a symbol 
to communicate the current input bit to R on a transition from state p to q. Thus there are 
two transitions, one for communicating a 0 and one for communicating a 1. To avoid a z -
cycle that can cause T to communicate an input bit more than once and to allow T to resend a 
symbol in the case of a deletion error, state q has two self-loops on z, one for communicating 
a 0 and one for communicating a 1. 
We now show that the construction of T cannot be completed while preserving the 
robustness of the protocol. Since there are deletion errors, T must have transitions on £ from 
each of states p and q. We show that there is no transition from state a on £ that preserves 
the robustness of the protocol. We consider three cases. 
Case 1: 7 has a transition from q top on £ and T does not advance its input pointer on 
the transition. If a deletion error occurs when /? sends x ory, then T does not advance its 
input pointer after communicating the current input bit, and eventually /? will output an extra 
bit. 
Case 2: T has a transition from a to p on E and T advances its input pointer on the 
transition. Then T may advance its input pointer before communicating the current input bit 
to /? as follows: T sends a to/? and the a is deleted, and then the z that /? sends to T is 
deleted. 
29 
Case 3: T has a self-loop on E in state q. Consider the following execution. Let the 
current input bit be 0. Suppose T is in state q and sends a to /?. When /? receives a, R 
moves to state r, writes 0 into the output buffer, and sends x to T. Now a deletion error 
occurs and T receives E. Since T is compulsive, T must send some symbol to /?, but another 
deletion error occurs. When /? receives E, /? moves to state s and sends z to T. When T 
receives z, T sends a to/?. When /? receives a, R writes an extra 0 into the output buffer. • 
Note that in the selective model the self-loop on £ in state a does not cause extra bits to 
be output because when T receives £, T does not have to send another symbol to /?. Thus the 
nonnull symbol T sent to/? remains in /? 's incoming cell. 
We have shown that there is no robust protocol for deletion errors for compulsive 
machines and write faults that consists of a two-state transmitter and a two-state receiver that 
uses only three symbols. The protocol of Figure 2.2 is a protocol that uses four symbols. We 
conjecture that any robust protocol for deletion errors for compulsive machines and write 
faults requires four symbols, regardless of the number of states. 
We have thus constructed robust protocols for deletion errors for both compulsive and 
selective machines, and for both read and write fault models of the shared memory. We have 
established the optimality of the protocols for read faults and for selective machines and write 
faults. We have also shown that compulsive and selective machines are different. 
2.5. Transmission in the Presence of Insertion Errors 
We first consider transmission in the presence of insertion errors and read faults. 
Theorem 2.5.1: Any robust protocol for insertion errors and read faults requires at least three 
nonnull symbols in its transmission alphabet. 
30 
Proof: Suppose to the contrary that only E and two nonnull symbols A and B are necessary. 
If T sends E, then /? may receive £ or any other nonnull symbol. Since T and /? are 
asynchronous, if T writes £, then /? may read an arbitrary sequence of E'S, A's, and B's 
caused by insertion errors before T makes its next move. Thus /? may read a sequence of 
symbols that causes /? to output the wrong bit. If T never sends £, then the transmission 
alphabet is reduced to two symbols. By Theorem 2.3.5, three symbols are necessary. • 
Theorem 2.5.2: The protocol of Figure 2.1 is robust for insertion errors and read faults and 
optimal. 
Proof: The protocol of Figure 2.1 is robust for insertion errors since neither machine sends £. 
To show that the number of symbols in the transmission alphabet is the minimum, we assign 
values to the symbol names. One possible assignment — DATA = 0,1; SYNC = 2; REQ = 0; 
and ACK=1 — yields a transmission alphabet consisting of the three nonnull symbols 
{0,1,2} and £. It follows from Theorems 2.3.1,2.3.2, and 2.5.1 that the protocol is optimal. • 
For compulsive machines and write faults, the protocol of Figure 2.4 is robust under 
insertion errors and uses only two nonnull symbols and £. The pound sign (#) represents any 
nonnull symbol. The protocol of Figure 2.4 is essentially the protocol of Figure 2.1 with 
SYNC replaced by £. When /? is in state r, R expects to receive £. If an insertion error 
occurs and /? receives a nonnull symbol, then /? remains in state r until it receives E. It 
follows from Theorems 2.3.1,2.3.2, and 2.3.5 that the protocol is optimal. 
For selective machines and write faults, there is a robust protocol for deletion and 
insertion errors that uses two nonnull symbols and £. We present the protocol in the next 
section. 
31 
TRANSMITTER T 
REQ/DATA ACK/E * 
REQ/DATA 
E/REQ DATA/ACKf 
E/REQ 
Figure 2.4. A robust protocol for insertion errors for compulsive machines and write faults. 
2.6. Transmission in the Presence of Mutation and Multiple Errors 
We first consider transmission in the presence of mutation errors and then in the 
presence of deletion and insertion errors. 
Theorem 2.6.1: There is no robust protocol in the presence of mutation errors. 
Proof: Consider the symbols sent by T and /?. Since there are mutation errors, a nonnull 
symbol sent by T may be received as any nonnull symbol by /?. Thus the transmission 
32 
alphabet is, in effect, reduced to only two symbols, £ and #, where # represents any nonnull 
symbol of the transmission alphabet. By Theorem 2.3.5, at least three symbols are necessary. 
Thus, there is no robust protocol for mutation errors. • 
Since there are no robust protocols for mutation errors, there are also no robust protocols 
for combinations of errors that include mutation errors. The only combination remaining is 
deletion and insertion errors. But it can be seen that the protocol of Figure 2.1 for compulsive 
machines is also robust in the presence of both deletion and insertion errors. Since T and R 
do not send £, the deletion and insertion errors case is, in effect, reduced to the deletion errors 
only case. Thus the protocol is robust even if both deletion and insertion errors are present. 
Clearly, the protocol is also robust for selective machines. 
For the case of selective machines and write faults, however, the protocol of Figure 2.5 
requires only three symbols, including E, in its transmission alphabet. The protocol of Figure 
2.5 is very similar to the protocol of Figure 2.4. Since T expects only nonnull symbols, if T 
receives an £, then T remains in its current state and waits for a nonnull symbol. By 
Theorems 2.3.1,2.3.2, and 2.3.5, the protocol is optimal. 
2.7. Conclusions 
We have modeled asynchronous communication protocols with communicating finite-
state machines and established lower bounds on the size of machines and the number of 
symbols needed in the transmission alphabet to ensure reliable communication. We carefully 
studied two different types of machines, compulsive and selective, and two fault models for 
the shared memory, the read fault model and the write fault model. 
33 
TRANSMITTER T 
REQ/DATA ACK/E * 
REQ/DATA 
£ / — 
E/REQ DATA/ACKf 
E/REQ 
Figure 2.5. A robust protocol for deletion and insertion errors for selective machines and write faults. 
We summarize our results and the results of Aho et al. (1982) for synchronous protocols 
in Table 2.1. An entry containing a triple x, y, z indicates that there is a robust protocol for 
that class of errors and that the protocol has x states for the transmitter, y states for the 
receiver, and z symbols in the transmission alphabet. An asterisk (*) indicates the 
asynchronous protocols that are optimal in both the size of the machines and the number of 
symbols in the transmission alphabet. A "Yes" entry appears for the results of Aho et al. 
34 
Table 2.1. Existence of robust protocols in the presence of various errors. 
Errors 
Deletion 
Insertion 
Mutation 
Del.-Ins. 
Del.-Mut. 
Ins.-Mut. 
Del.-Ins.-MuL 
Synchronous 
2,2,3 
Yes 
Yes 
Yes 
3,3,2 
Yes 
No 
Asynchronous 
Compulsive Machines 
Kead Fault 
2,2,4* 
2,2,4* 
No 
2,2,4* 
No 
No 
No 
Write Fault 
2,2,4 
2,2,3* 
No 
2,2,4 
No 
No 
No 
Selective Machines 
Read Fault 
2,2,4* 
2,2,4* 
No 
2,2,4* 
No 
No 
No 
Write Fault 
2,2,3* 
2,2,3* 
No 
2,2,3* 
No 
No 
No 
where they established only the existence of robust protocols for a class of errors. A "No" 
entry appears for those classes of errors where no robust protocol exists. 
We have shown that for each combination of machine type and memory fault there is a 
robust asynchronous communication protocol for deletion and insertion errors. We 
established that the protocols for compulsive machines in the read fault model and for 
selective machines in both the read and write fault models have the minimum number of 
states for the machine and the minimum number of symbols in the transmission alphabet 
simultaneously. For compulsive machines in the write fault model, our protocol uses the 
minimum number of symbols when the transmitter and receiver have only two states each. 
We also showed that there are no robust asynchronous protocols for mutation errors. In 
contrast, robust synchronous protocols for mutation errors exist. Finally, our results illustrate 
35 
the differences between read faults and write faults, and between compulsive machines and 
selective machines. 
One area of further research is to investigate protocols for restricted types of mutation 
errors. For example, mutation errors may be limited such that a symbol that is sent is either 
received correctly or mutated to one of a subset of symbols in the transmission alphabet. 
Other restrictions can be placed to limit the number of different symbols that may be mutated 
to the same symbol. 
36 
CHAPTER 3. 
PARALLEL ALGORITHMS FOR MINIMUM SPANNING TREE 
IN LOGARITHMIC TIME WITHOUT PRIORITY WRITING 
3.1. Introduction 
Let G = (V,E) be a connected graph with a set of vertices V and a set of edges E in 
which each edge e has a weight w(e). Without loss of generality, assume that the edge 
weights are distinct. Hence, the minimum spanning tree (MST) of G is unique. Let n = | V \ 
and m = \E\. We present an algorithm for finding the MST of G on a Common Concurrent 
Read Concurrent Write (CRCW) Parallel Random Access Machine (PRAM) in 0(log n) time 
using 2m+n1+2e processors, where £ is a constant such that 0 < E < 1/2. 
The MST problem is an important problem of combinatorial optimization. Some 
practical applications of MSTs include the design of computer, communication, and 
transportation networks. Graham and Hell (1985) gave an extensive history of the MST 
problem. 
Yao (1975) and Cheriton and Tarjan (1976) designed sequential MST algorithms that 
run in time 0(m log log n). Fredman and Tarjan (1987) gave an improved algorithm which 
runs in time 0(mP(m,n)), where P(m,/z) = min {/1 log('>. < — } . If m>n, then 
P(/M,n) <log*/z. Gallager et al. (1983) presented a distributed MST algorithm that uses at 
most 5nlogn +2m messages and 0(n log/z) time. Awerbuch (1987) presented an optimal 
distributed MST algorithm that uses 0(m + n log n) messages and Q(n) time. The algorithm 
is optimal in both communication and time. 
37 
There have also been several parallel MST algorithms. Chin et al. (1982) presented an 
efficient Concurrent Read Exclusive Write (CREW) PRAM algorithm that runs in 0(log2«) 
time using ( " 2 ) processors. Hirschberg (1982) gave an algorithm for the Common 
CRCW PRAM which runs in 0(log n) time using n3 processors. Awerbuch and Shiloach 
(1987) designed an algorithm for the Priority CRCW PRAM which runs in Oflog n) time 
using 0(m +n) processors. Cole and Vishkin (1986a) presented a Priority CRCW PRAM 
algorithm that runs in 0(log /% log^n log(3>n) time using j*" g) processors. Cole 
and Vishkin (1986b) then gave a slighdy faster Priority CRCW PRAM algorithm that runs in 
0(log n) time using fo+gW3** processors. 
Our MST algorithm is based on modifications of the algorithm of Awerbuch and 
Shiloach. We employ some of the results of Fich et al. (1988a) to obtain a Common CRCW 
PRAM MST algorithm. A straightforward modification yields an algorithm that runs in 
0(logn) time using mn +n1+2e processors. We then reduce the number of processors to 
2m + n1+2e. Our algorithm has the same running time as the algorithm of Hirschberg but 
uses fewer processors. For mildly dense graphs, where m = Q(n1+2e), our algorithm has the 
same performance as the algorithm of Awerbuch and Shiloach using a weaker CRCW PRAM 
model. 
In Section 3.2, we describe the model of computation. In Section 3.3, we review the 
MST algorithm of Awerbuch and Shiloach. In Section 3.4, we present our Common CRCW 
PRAM algorithms. 
38 
3.2. Model of Computation 
A PRAM consists of a set of reliable synchronous processors, each with its own local 
memory, and a global shared memory. Each step of a processor of the PRAM consists of 
three stages. In the first stage, each processor may read a value from one shared memory cell 
into local memory. In the second stage, each processor may perform a local computation on 
values in its local memory. In the third stage, each processor may write a value from local 
memory into one shared memory cell. In the CRCW PRAM, any number of processors may 
simultaneously read from the same shared memory cell during a read stage, and any number 
of processors may simultaneously write to the same shared memory cell during a write stage. 
If more than one processor writes into the same cell, then the value that is written depends on 
the model. 
Two CRCW PRAM models are the Priority model and the Common model. For each 
model, we describe what happens when two or more processors simultaneously try to write 
into the same cell. In the Priority model, each processor has a unique fixed priority. The 
processor with the highest priority is the one that succeeds. In the Common model, all 
processors must write the same value. Fich et al. (1988a), Fich et al. (1988b), Chlebus et al. 
(1988), and Boppana (1989), among others, have studied the relationships between the two 
CRCW PRAM models. 
3.3. The MST Algorithm of Awerbuch and Shiloach 
The algorithm of Awerbuch and Shiloach (1987) uses a Priority CRCW PRAM. The 
priority of each processor is determined by its index. The smaller the index, the higher its 
priority. 
39 
Each edge of G is represented by two oppositely directed edges. The algorithm assigns 
processors to edges such that the smaller the weight of an edge, the higher the priority of the 
corresponding processor. The assignment can be made by sorting the edges by weight and 
then assigning processors in order. Let p (i, j) andp (J, i) be the processors assigned to edge 
(i,j). This can be done in O(logn) time using the parallel merge sort algorithm of Cole 
(1988). 
Without loss of generality, assume that the vertices of G are numbered from 1 to n. The 
number of a vertex is its identifier, denoted id. We will refer to vertices by their i'a"s. The 
notation u < v means that the id of vertex u is smaller than the id of vertex v. 
In the algorithm, there are variables associated with each vertex i. The processors that 
operate on these variables, however, correspond to edges. Each vertex / has a parent P(i), 
which is either another vertex or itself. If a vertex is a root, then its parent is itself. The 
parent-child relation defines a directed graph called the parents graph, PG. PG has the same 
vertices as G. Define GP (i) = P (P (i)), and call GP (i) the grandparent of / . 
The algorithm maintains a set T of undirected edges which always forms a forest of the 
MST. The algorithm adds edges to T using the property that for any subset of vertices, the 
edge of least weight leaving the set must belong to the MST. Set T grows until it becomes 
the MST. 
A rooted tree is a tree whose edges are directed toward the root. A star is a rooted tree 
with height 1. The algorithm maintains the invariant that after each iteration, for each 
directed tree in PG, there is a subtree in T spanning the same set of vertices. The algorithm 
finds edges of the MST by trying to combine trees in PG. Two trees are combined by 
hooking one tree to the other. The algorithm hooks a tree T\ to a tree T-i by making some 
40 
vertex of 7% the parent of the root of T\. Processors that correspond to edges leaving a star 
try to hook the star to a tree. Edges that correspond to successful processors are added to T. 
After the stars are hooked, the algorithm reduces the height of each tree with a shortcut 
operation, where each vertex takes its grandparent to be its new parent. 
A Boolean variable T(e) is attached to each edge e; T(e) is initially 0. During the 
algorithm, if edge e is added to the T, then T(e) is set to 1. Winner (i) contains the name of 
the edge corresponding to the writing processor. After the initialization, the algorithm 
iterates three Parts until all of the vertices are in the same star in PG. The algorithm is 
executed in parallel by each edge processor p (i, j). 
Priority CRCW PRAM Minimum Spanning Tree Algorithm 
Initialization 
7(g) := 0 for all eeE ; 
P(i):=iioTi = l,...,n; 
For each processor p (i, j) 
repeat 
Par t i : (Star hooking) 
if i belongs to a star and P(i)*P(j) then 
P(P(i)):=P(j); 
Winner (F (/)):= (/, j); 
endif 
if Winner (P(i)) = (i, j) then 
T(i,j):=l; 
endif 
Part 2: (Cycle breaking) 
i f f < f (;)and;=Gf(z)then 
endif 
41 
Part 3: (Shortcut operation) 
P(i):=GP(i); 
until every vertex i belongs to the same star 
Part 1 performs the hooking operation. Processors that correspond to edges leaving a 
star try to hook the star to another tree. A star is hooked to a tree by setting the parent of the 
root of the star to a vertex of the tree to which the star is being hooked. If more than one 
processor tries to hook the star, then the processor with the highest priority succeeds. Thus, 
Winner (i) contains the name of the edge e of smallest weight leaving the star. Since edge e 
belongs to the MST, the algorithm sets T(e) := 1. At the end of Part 1, every star is hooked 
to some tree. 
Part 2 eliminates any cycles that may have been formed in the parent's graph. A cycle 
of length two forms when an edge's endpoints belong to two different stars and the edge is 
the edge of least weight leaving both stars. To break a cycle, the algorithm changes the 
parent pointer of the vertex with the smaller id to point to itself. 
Part 3 performs the shortcut operation. For each vertex i, the algorithm sets the 
grandparent of / to be the new parent of i. Note that if more than one processor updates 
P(i), then all processors write the same value. The height of each tree that is not a star 
decreases by a factor of at least 3/2. 
A vertex determines whether it belongs to a star by using Procedure Star_Check. At the 
termination of Star_Check, if ST (i) is true (false), then i belongs (does not belong) to a star. 
42 
Procedure Star Cheek 
For each processor p (i) 
Sr(j):=irue; 
if P(i)*GP(i) then 
ST(i):= false; 
ST (GP (/)):= false; 
endif 
ST(i):=ST(P(i)); 
Awerbuch and Shiloach established the correctness of their algorithm. We briefly verify 
the running time. Consider each iteration of the three Parts. Parts 1 and 2 ensure that every 
star is hooked to some tree to yield a new tree with height greater than one. Since Part 3 
reduces the height of every tree with height greater than one by a factor of at least 3/2, the 
sum of the heights of all of the trees present at the start of the iteration is reduced by a factor 
of at least 3/2. Thus, 0(log n) iterations yield a single star. Since each iteration takes 0(1) 
time, the algorithm runs in 0(log n) time. 
3.4. Common CRCW PRAM MST Algorithm 
Our Common CRCW PRAM MST algorithm is the same as the algorithm of Awerbuch 
and Shiloach except that Part 1 is modified to eliminate the priority concurrent write. Thus 
we describe the modified implementation of Part 1 only. In our algorithm, we avoid the 
priority write by determining the processor of highest priority wanting to write to each 
memory cell and having only those processors write. It can be seen that the values written in 
the memory cells are the same as those that would have been written in the Priority CRCW 
PRAM model. 
43 
To determine the processor of highest priority writing to each cell, we solve a special 
case of the r-color minimization problem described in Fich et al. (1988a). 
r-Color Minimization Problem 
Before: Each processor pi, i = 1, •• •, p, has a color *,-, 0 < xi < r , known only to itself. 
Color*,- represents the cell p,- wants to write, if any, and 0 otherwise. 
After: Each processor/?,- knows the value a,-, where a,- =1 if and only if /?,- is the processor 
of lowest index writing into the cell represented by*,-. 
For our algorithm, we consider the case in which r = 1. Fich et al. showed that on a 
Common CRCW PRAM with 6 memory cells, the 1-color minimization problem can be 
solved in Q( i * % u \ ) steps. In our discussion, we present a simplified variation of their 
method and show how the problem can be solved in 0( | ° | ? ) steps. 
Let Mi, • • •, Mk be the 6 memory cells. Assume without loss of generality that k <pE, 
where £ is a constant such that 0 < £ < 1/2. If 6 > p1/2, then only the first p m cells are 
needed to achieve 0(1) steps. The algorithm iterates the following steps. Processor pt, 
i = 1,.. . ,6, writes 0 into M,-. The processors are then divided into 6 groups of nearly equal 
size, where each group is a set of consecutively numbered processors. The first p mod 6 
groups contain processors, and the remaining groups contain [ -?- processors. A 
processor/?,- in the ;th group, 1 < ; < 6, writes 1 into M} if and only if*,- = 1. 
The winner is the processor of smallest index with *,- = 1. Thus the winner is in the 
group corresponding to the Mj of smallest index containing 1. The algorithm determines the 
winning group by using the subroutine Leftmost One. 
44 
Leftmost One 
Before: Cells A/,-, i = 1,.. .,6, each contain 0 or 1. 
After: M,- contains 1 if and only if all Mj for j < i were initially 0, and Af,- was initially 1. 
The Leftmost One algorithm compares all pairs of cells M,- and Mj,\< i J <k. lfj<i 
and Mi and Mj both contain 1, then the algorithm writes 0 into Af,-. The algorithm requires 
62 </? processors. After applying the Leftmost One subroutine, processors in group j read 
Mj. A group determines it is the winning group if its processors read a 1. 
All processors that are not in the winning group set a,- := 0 and stop. The processors in 
the winning group then repeat the 1-color minimization algorithm. This process repeats until 
the winning group contains only one processor, the winner. 
Each iteration of the 1-color minimization algorithm reduces the number of processors 
that may be the winner by a factor of 6. Thus the winner is determined in at most log* /? 
iterations. Since each iteration takes 0(1) steps, the winner is determined in 0( !°B P) steps. 
In Part 1 of the Priority CRCW PRAM algorithm, if more than one processor tries to 
hook a star with root i to a tree, then a priority write of the variable P (i) occurs. Since there 
is a P(i) for each vertex i, there are n cells into which processors may write. Since 
processors performing the hooking operation correspond to edges leaving stars, as many as m 
processors may want to write into one P (i). 
In the Common CRCW PRAM algorithm, we first determine the processor of highest 
priority writing to each P(i) and then have only that processor write. We begin with the 
direct implementation which requires solving the r -color minimization problem with m 
processors and n colors. 
45 
To maintain the 0(log n) running time of the MST algorithm, Part 1 must run in time 
0(1). In Part 1, the Common PRAM algorithm simultaneously solves n 1-color minimization 
problems, one for each P(i). Each problem requires m +nQz processors to obtain an 0(1) 
time solution. During the first iteration, m processors are divided into ne groups. During 
each iteration, a processor can determine the group to which it belongs since it knows its rank 
from the sort performed during the initialization phase. Each iteration reduces the number of 
contending processors by a factor of n£, and thus 0( °& me) = 0(1) iterations suffice. Since 
there are n problems, Part 1 requires a total of mn + n1+2e processors. We now show how to 
reduce the number of processors. 
In Part 1, each processor corresponding to an edge leaving a star writes to exactly one 
P(i). Thus each processor that wants to write is a possible winner for only one of the n 1-
color minimization problems. 
The absence of nonwriting processors from the groups of processors formed during the 
solution of the 1-color minimization problem does not affect the outcome since the processors 
would not have written even if they were present. Thus each processor that wants to write 
needs only to participate in the solution of the 1-color minimization problem corresponding to 
the P (/) it wants to write. There are 2m edge processors that may write. Hence, for the n 1-
color minimization problems, the algorithm requires a total of only 2m+n 1+2E processors. 
The remaining Parts of the algorithm require 2m + n processors and only common write 
operations. Thus we have a Common CRCW PRAM algorithm for the MST problem that 
runs in 0(log n) time using 2m + n1+2e processors, where 0 < £ < 1/2. 
46 
CHAPTER 4. 
ASYNCHRONOUS PARALLEL ALGORITHMS FOR GRAPH 
CONNECTIVITY AND RELATED PROBLEMS 
4.1. Introduction 
Let G = (V, E) be an undirected graph with the set of vertices V and the set of edges E. 
Let n = | V | and m = | E |. A subgraph of G is a graph whose vertices and edges are in G. 
A connected component of G is a maximal connected subgraph of G. 
Finding the connected components of a graph is a fundamental problem of graph theory. 
Some practical applications of graph connectivity algorithms include the detection of failures 
in computer networks, communication networks, and distributed database systems. On single 
processor machines, sequential algorithms can perform depth-first search to find the 
connected components of G in 0(m +n) time. To find the connected components much 
faster, however, requires parallel computers, where multiple processors can work together on 
the problem concurrently. One well-studied model of parallel computation is the Parallel 
Random Access Machine (PRAM). 
For the Concurrent Read Exclusive Write (CREW) PRAM, Hirschberg et al. (1979) 
presented a connected components algorithm that runs in 0(log2«) time using O(-^—) 
processors. Chin et al. (1982) improved their result by reducing the number of processors to 
0(y-^2—). Han and Wagner (1990) presented a more efficient and nearly optimal algorithm 
that runs in time 0(-~ + v1 log") + log2 w), where /? is the number of processors. Their 
47 
algorithm achieves linear speedup when p <, "% and m £ n log n. Johnson and Metaxas 
(1991) designed the fastest known CREW PRAM algorithm. Their algorithm runs in 
0(log3y2 n) time using m+n processors. 
For the weaker Exclusive Read Exclusive Write (EREW) PRAM, Kruskal et al. (1990) 
gave an algorithm that computes the connected components of a graph in 
O(—-—-—jr#rr rr-4-^-log/?) time using p processors. Their algorithm achieves 
P log(m/(/?2log/?)) P 
linear speedup for m = Q(n log p) and m = Slip 2+e) for any constant E > 0. 
For the Arbitrary Concurrent Read Concurrent Write (CRCW) PRAM, Shiloach and 
Vishkin (1982) presented an algorithm that runs in 0(log n) time using 0(m + n) processors. 
Awerbuch and Shiloach (1987) gave a similar and slightly simpler algorithm with the same 
performance. The latter algorithm also has a substantially simpler correctness proof. Cole 
and Vishkin (1986a) presented an Arbitrary CRCW PRAM algorithm that runs in 
Q(log n log<2> n log^n) time using fo ^|ffiffi[*nj) processors, where a(m,n) is the 
inverse Ackermann function. Cole and Vishkin (1986b) then designed optimal algorithms for 
list ranking and prefix sum, and used these results to obtain a slighdy faster connected 
components algorithm that runs in 0(log n) time using Vm + ? ' a v " ' n ' processors. The 
running time of their algorithm is the same as the algorithms of Shiloach and Vishkin (1982) 
and Awerbuch and Shiloach (1987) but uses fewer processors. Finally, Gazit (1991) 
presented an optimal randomized algorithm for the Arbitrary CRCW PRAM that runs in 
OGog n) time using 0(-gj^-) processors. 
48 
Several researchers have considered a more realistic model of parallel computation, the 
Asynchronous PRAM (APRAM). Gibbons (1989) defined a family of APRAMs that vary in 
the types of synchronization steps that are permitted. Gibbons measured the cost of accessing 
the shared memory in terms of the communication delay. In the Phase APRAM model, a 
computation consists of a series of phases, where during a phase processors run 
asynchronously for a predetermined number of steps. At the end of each phase, all of the 
processors are synchronized. Gibbons presented Phase APRAM algorithms for several basic 
problems, including prefix sum, list ranking, and merging. 
Martel et al. (1989) presented an APRAM model in which asynchronous processors are 
randomly assigned to tasks. Thus their asynchronous algorithms can adapt to the availability 
of processors and are fault tolerant. To measure the performance of an algorithm, they define 
the work of an algorithm to be the total number of single processor operations performed by 
all processors during the execution of the algorithm. Work is analogous to the processor-time 
product for PRAM algorithms. Martel et al. (1990) showed that any p -processor CRCW 
PRAM algorithm can be simulated on an asynchronous CRCW PRAM with 0(p) expected 
work per parallel step using up to 2—,__ processors. 
In the APRAM model of Cole and Zajicek (1989), processors are asynchronous and may 
access any cell of the shared memory in constant time. They assume that a processor can 
access a block of memory to read or write in one atomic operation. Since this assumption is 
not realistic, they consider an algorithm weak if it requires atomic access to more than one 
variable. In their model, a computation is a sequence of all the steps executed by the 
processors. The computation can be partitioned into rounds, where a round is a minimal 
sequence of steps such that every processor executes at least one step. Cole and Zajicek 
49 
presented a complicated weak 0(log n) round graph connectivity algorithm. Later, Cole and 
Zajicek (1990) broadened their notion of rounds to include probabilistic delays between steps 
of each processor. During a round, each processor flips a coin to determine whether it should 
execute an instruction or wait. They presented two APRAM models, a bounded delay model 
and an unbounded delay model, and showed expected time bounds for prefix sum and list 
ranking algorithms for each model. 
Nishimura (1990) took a different probabilistic approach to the APRAM model and 
considered probability distributions over the interleaving of processor steps. She showed that 
the expected maximum number of steps taken by any processor for synchronizing p 
processors or for list ranking on a list of size p is 0(log/?), assuming that the distribution is 
uniform. 
In this chapter we present two deterministic APRAM models. The first is an APRAM 
that has only atomic read and write primitives. The second is an APRAM that has limited 
read-modify-write primitives. We then present efficient nonoblivious APRAM connected 
components algorithms for each model. All of our algorithms use m + n processors. We 
introduce our approach, which is different from the one used in the synchronous algorithms, 
in Algorithm I. Algorithm I runs on an APRAM with only atomic read and write primitives 
and requires 0(/i log n) rounds. Algorithms II and IE run on an APRAM with limited read-
modify-write primitives. Both algorithms require O(logn) rounds. Algorithm HI is more 
efficient than Algorithm II and uses fewer global synchronizations. 
We introduce and motivate our APRAM models in Section 4.2. In Section 4.3, we 
discuss the design of APRAM algorithms. In Section 4.4, we briefly describe the 
synchronous connected components algorithms. Although our APRAM algorithms take a 
50 
different approach, there are some similarities. We present Algorithm I in Section 4.5, 
Algorithm II in Section 4.6, and Algorithm HI in Section 4.7. 
Two problems that are closely related to the connected components problem are finding 
a spanning forest or a minimum spanning forest of a graph. In Section 4.8, we describe how 
to modify Algorithms II and m to obtain algorithms for computing a spanning forest and a 
minimum spanning forest. We give a brief conclusion in Section 4.9. 
4.2. Model of Computation 
4.2.1. The PRAM 
The PRAM is an idealized parallel computer that allows algorithm designers to 
concentrate on the computational aspects of problems without concern about the reliability 
and synchronization of processors. A PRAM consists of a set of reliable synchronous 
processors, each with its own local memory, and a global shared memory through which 
processors communicate. Each step of a processor of the PRAM consists of three parts, 
called stages. In the first stage, a processor may read a value from one shared memory cell 
into local memory. In the second stage, a processor may perform a local computation on 
values in its local memory. In the third stage, a processor may write a value from local 
memory into one shared memory cell. Each stage requires unit time. Since the processors of 
the PRAM are synchronous, they all execute the same stage at the same time. 
Various models of PRAMs restrict the execution of the read and write stages. In the 
EREW PRAM, at each step, in the read and write stages at most one processor may read from 
or write into a particular memory cell. In the CREW PRAM, at each step, in the read stage 
any number of processors may read from the same memory cell, but in the write stage at most 
51 
one processor may write into each memory cell. In the CRCW PRAM, at each step, in the 
read and write stages any number of processors may read from or write into the same memory 
cell. If more than one processor writes into a cell, then the value that is written depends on 
the particular CRCW model. 
Three popular CRCW PRAM models are the Priority, Arbitrary, and Common models. 
For each model, we describe what happens when two or more processors simultaneously try 
to write into the same cell. In the Priority model, each processor has a unique fixed priority. 
The processor with the highest priority is the one that succeeds. In the Arbitrary model, some 
arbitrary processor succeeds, but it is not known a priori which one. In the Common model, 
all processors must write the same value. 
Numerous parallel algorithms and computational techniques have been developed for 
the PRAM. Gibbons and Rytter (1988), Akl (1989), Karp and Ramachandran (1990), and 
Ja'Ja' (1991) surveyed PRAM algorithms. Fich et al. (1988a), Fich et al. (1988b), Chlebus et 
al. (1988), and Boppana (1989), among others, studied the relationships among the various 
PRAM models. 
4.2.2. The asynchronous PRAM 
Although the PRAM is an important and fundamental model of parallel computation, it 
is not physically realizable. The PRAM model hides the cost of synchronization. As the 
number of processors becomes large, it becomes impractical to synchronize the processors 
using a single global clock. In addition, concurrent accesses to the shared memory must also 
be synchronized. We present a more realistic model of parallel computation, the APRAM. 
Our APRAM consists of a set of reliable asynchronous processors, each with its own local 
52 
memory, and a global shared memory. No global clock synchronizes the processors, and 
access to the shared memory is also asynchronous. 
As in the PRAM, each step of a processor of the APRAM consists of a read stage, a 
local computation stage, and a write stage. The difference is that since the processors are 
asynchronous, some may be reading from the shared memory at the same time others are 
writing into the shared memory. A memory operation is either reading from or writing into 
one cell of the shared memory. Lamport (1986) considered the problem of concurrent 
reading and writing of the same cell using safe, regular, and atomic registers. In our APRAM 
model, we allow at most one memory operation to be performed on a cell of the shared 
memory at any time, thus each read operation and write operation is atomic. In real 
machines, hardware can ensure that at all times, only one processor performs a memory 
operation on a shared memory cell. Since the memory operations performed on each cell are 
serialized, there is no ambiguity about the value that is written or read. 
A primitive operation is a sequence of one or more memory operations performed 
atomically on a single cell of the shared memory. We consider two models of the APRAM 
that have different kinds of primitive operations. The first is the basic APRAM, whose only 
primitive operations are atomic read and write. The second is an APRAM with read-
modify-write primitives. A processor executing a read-modify-write primitive atomically 
reads a value v from a cell c of the shared memory into local memory, performs a local 
computation that may depend on v and some other values stored in local memory, and then 
writes a value into cell c. We note that Algorithms II and in actually require only limited 
read-modify-write operations, namely replace-min and increment, defined as follows. 
53 
A processor executing a replace-min primitive atomically reads a value v from a cell c 
of the shared memory, compares v with a value v ' in local memory, and then writes v ' into c 
if v ' < v. A processor executing an increment primitive atomically reads a value v from a 
cell c of the shared memory, increments v by 1, and then writes the value v+1 into c. Each 
primitive can be implemented by a single machine instruction on a conventional computer. 
An execution of an algorithm consists of a finite sequence of arbitrarily interleaved 
stages of the steps taken by all of the processors of the APRAM, with the restriction that 
when a processor executes a read-modify-write primitive, the stages of the step are atomic 
with respect to the memory cell that is accessed. We can partition the execution of the 
algorithm into rounds, where a round is a minimal sequence of stages such that every 
processor takes at least one complete step. We measure the running time of our algorithms 
using rounds. The number of rounds is a fair measure if we assume that all processors run at 
about the same speed and thus perform about the same number of operations. This 
assumption is reasonable because real parallel computers usually consist of identical 
processors. The algorithms must be correct, however, regardless of the speeds of the 
processors. 
4.3. The Design of APRAM Algorithms 
Ideally we would like to design algorithms for which the execution of each processor is 
independent of the executions of other processors. In that case, no processors impede the 
progress of other processors. But most algorithms, and nonoblivious algorithms in particular, 
require processors to communicate data to each other. Thus synchronization between 
54 
processors is needed. In Section 4.3.1 we review previous work in wait-free data structures, 
and in Section 4.3.2 we discuss methods for synchronizing processors. 
4.3.1. Wait-free data structures 
An implementation of a parallel data structure in shared memory is wait-free if every 
processor is guaranteed to complete an operation on the data structure in a finite number of 
steps regardless of the speeds of the other processors. A wait-free implementation tolerates 
failures of processors because the failure of other processors does not prevent the completion 
of an operation. 
Herlihy (1988) defined a hierarchy of data structures such that at each level no data 
structure has a wait-free implementation in terms of data structures from lower levels. For 
example, Herlihy shows that using only atomic read and write registers, it is impossible to 
construct wait-free implementations of queues and stacks, and synchronization primitives 
such as test-and-set and fetch-and-add. A primitive is universal if wait-free data structures 
can be constructed from the primitive. Herlihy established the existence of universal 
primitives. It is not clear, however, that universal primitives can be implemented in 
hardware. 
Some researchers have designed wait-free algorithms. Aspnes and Herlihy (1990) 
considered an APRAM model with only shared atomic registers and gave an algebraic 
characterization of a class of data structures that have wait-free implementations. Recently, 
Anderson and Woll (1991) designed a wait-free algorithm for the union-find problem. 
55 
4.3.2. Processor synchronization 
A barrier synchronization is a sequence of steps that synchronizes all processors. Every 
processor executing the barrier synchronization must complete the operation before any 
processor may proceed further. Barrier synchronization may be used to ensure that all 
processors have reached a particular point of the computation before continuing. 
On a basic APRAM, a barrier synchronization can be implemented with a binary tree. 
In the binary tree method, each processor is associated with one leaf in the tree, and each 
internal node of the tree is associated with some processor. A processor executing a barrier 
synchronization marks its corresponding leaf. A processor corresponding to an internal node 
x marks x after both children of x have been marked. If some child of x is not marked, then 
the processor waits for a while and checks again. The barrier synchronization completes 
when the root is marked. On a basic APRAM with p processors, the binary tree method for 
barrier synchronization requires exactly log2/? rounds since the processor corresponding to 
the root must wait for log2/? - 1 levels of nodes to be marked. 
Barrier synchronization can be made more efficient if processors do not have to wait to 
execute the barrier. The results of Fischer et al. (1985), Loui and Abu-Amara (1987), Chor et 
al. (1987), and Herlihy (1988) can be used to show that there is no wait-free implementation 
of barrier synchronization on an APRAM that has only atomic read and atomic write 
operations. Thus we consider an APRAM that can read and write a cell in one atomic 
operation, i.e., an APRAM with read-modify-write primitives. 
On an APRAM with an increment primitive, barrier synchronization can be 
implemented using a counter initialized to 0. In the counter method, a processor executing a 
56 
barrier synchronization increments the counter. On an APRAM with/? processors, the barrier 
synchronization completes when the value of the counter reaches p. With the counter 
method, a processor does not have to check whether otiier processors have executed the 
barrier synchronization before incrementing the counter. Thus barrier synchronization 
requires only one round. Note that processors must still wait for the barrier synchronization 
to complete before they can proceed further in the algorithm. 
A straightforward way to obtain an APRAM algorithm is to take a PRAM algorithm and 
insert a barrier synchronization after each step of the PRAM algorithm. This conversion is 
inefficient, however. For a basic APRAM with p processors, the round complexity of the 
APRAM algorithm is about \0g2p times greater than the round complexity of the 
corresponding PRAM algorithm. For an APRAM with an increment primitive, the round 
complexity of the APRAM algorithm is about 2 times greater than the PRAM algorithm. In 
either model, with so many barrier synchronizations, the APRAM algorithm proceeds no 
faster than the rate of the slowest processor at each step. Fast processors must wait for slow 
processors. 
One way to reduce the cost of synchronization is to synchronize only a constant number 
of processors at each synchronization point. For example, on a basic APRAM, if only a 
constant number of processors need to be synchronized at each step, then the APRAM 
algorithm is slowed by only a constant factor. Cole and Zajicek (1989) showed that for a 
class of algorithms in which the communication pattern is oblivious, each synchronization 
point involves only a constant number of processors. These algorithms include Batcher's 
bitonic sort (Batcher, 1968), parallel summation, and prefix sum. The memory cells that are 
used in these algorithms can be treated as an implicit complete binary tree. During the 
57 
execution of an algorithm, a processor associated with an internal node may proceed once 
both of its children have been computed. Note that each node is written by only one 
processor. 
Many oblivious PRAM algorithms, and especially tree-based PRAM algorithms, can be 
converted into APRAM algorithms in this manner since the memory cells that are to be 
accessed and the number of processors that need to be synchronized at each step can be 
determined a priori. Other algorithms that fall into this category include evaluation of 
algebraic expressions, evaluation of associative binary functions, and numeric computations 
such as matrix multiplication. 
For nonoblivious algorithms, however, as is the case for many graph algorithms, 
efficient synchronization is more difficult. Since the communication pattern depends upon 
the input, an algorithm cannot determine a priori the cells that each processor will access. 
Thus each barrier synchronization must involve every processor. 
In this chapter we present efficient nonoblivious APRAM algorithms for computing 
connected components, spanning forest, and minimum spanning forest. In our connected 
components APRAM algorithms, to reduce waiting, we introduce a variation of barrier 
synchronization, which we call rolling synchronization. We describe rolling synchronization. 
Rolling synchronization can be implemented with the same data structures used for 
barrier synchronization. We use two methods of rolling synchronization: the tree method 
and the counter method. Suppose we want to synchronize all of the processors after they 
have completed some set of steps L of a loop. After completing L, processor /? tries to 
execute its part of the rolling synchronization by either marking the nodes of a tree or 
incrementing a counter as in barrier synchronization. The difference is that if some processor 
58 
has not yet executed its part of the rolling synchronization, then p executes L again. Then p 
tries to execute its part of the rolling synchronization, if/? has not already completed it, and 
checks again whether all processors have completed the rolling synchronization. Processor/? 
repeatedly executes L until every processor has completed the rolling synchronization. The 
rolling synchronization completes after every processor has executed L at least once. Rolling 
synchronization is an improvement over barrier synchronization since performing the steps L 
again may do more useful work. The algorithm must be correct, however, for an arbitrary 
number of iterations of L. 
In real machines, processor synchronization can be expensive. Axelrod (1986) and 
Dubois and Briggs (1991), among others, have examined the degradation of the performance 
of multiprocessor machines caused by barrier synchronizations. Thus, in our APRAM 
algorithms, we attempt to minimize the number of synchronization points. 
4.4. Synchronous Algorithms 
4.4.1. Connected components algorithms 
Shiloach and Vishkin (1982) and Awerbuch and Shiloach (1987) presented similar 
connected component algorithms. We briefly describe their methods because our algorithms 
use some of their ideas. In our discussion, we will focus on the algorithm of Awerbuch and 
Shiloach since it is simpler. 
Assume that the vertices of G are numbered from 1 to n. Then the number of a vertex 
is its id. In our discussion, we will refer to vertices by their z'a"s. The notation u < v means 
that the id of vertex u is smaller than the id of vertex v. 
59 
The synchronous algorithm uses an Arbitrary CRCW PRAM. Each edge of G is 
represented by two oppositely directed edges. The algorithm assigns a processor p(i,j) to 
each edge (ij) and a processor p(v) to each vertex v. Thus, the number of processors 
required by the algorithm is 2m + n. 
Each vertex i has a parent P(i). If a vertex is a root, then its parent is itself. The 
grandparent of i is the parent of P (i). 
A rooted tree is a tree whose edges are directed toward the root. A star is a rooted tree 
of height 1. The parent-child relation defines a directed graph called the parents graph, PG, 
which is a forest of rooted trees; PG has the same vertices as G. We will call an edge of PG 
an arc to distinguish the edges of PG from the edges of G. The arcs of PG are (/ JP (i)), for 
all vertices i. 
Throughout the execution of the algorithm, PG is a forest of rooted trees with self-loops 
at the roots. The set of vertices of each tree of PG is a subset of vertices of some connected 
component of G. The vertices of two trees of PG belong to the same connected component 
if there is an edge of G with an endpoint in each tree. The processor corresponding to such 
an edge tries to combine two such trees by hooking one tree to the other. The algorithm 
hooks a tree 7% to a tree T2 by making some vertex of T% the parent of the root of T\. 
After the hooking operation, the algorithm reduces the height of each tree with a 
shortcut operation, in which each vertex takes its grandparent to be its new parent. The 
algorithm terminates when no trees in PG can be combined, and every tree is a star. All 
vertices that belong to the same connected component have the same parent in PG. 
60 
Arbitrary CRCW PRAM Connected Components Algorithm 
Initialization 
For each processor/? (j) 
/>(/):=/; 
For each processor/? (i ,j) 
loop 
Part 1: (Conditional star hooking) 
if i belongs to a star and P(i)>P(j) then 
P(P(i)):=P(j); 
endif 
Part 2: (Stagnant star hooking) 
if i belongs to a star and P(i)*P(j) then 
P(P(i)):=P(j); 
endif 
Part 3: (Shortcut operation) 
if / does not belong to a star then 
P(/):=P(P(i)); 
else 
stop; 
endif 
end loop 
A processor determines whether a vertex belongs to a star by using procedure 
Star_Check. At the termination of Star_Check, ST(i) is true if J belongs to a star and false 
otherwise. 
Procedure Star Check 
For each processor/? (i) 
ST (i):= true; 
61 
if P(i)*P(P(i)) then 
ST(i):= false; 
ST (P(P(i))):= false; 
endif 
ST(i):=ST(P(i)); 
In Part 1, processors that correspond to edges leaving a star try to hook the star to 
another tree, which may also be a star. If more than one processor tries to hook the star, then 
an arbitrary processor succeeds. If v is a vertex of a star that is hooked, then at the end of 
Part 1, v is no longer a vertex of a star. A star that has not been hooked to another tree or 
hooked onto by another star is stagnant. In Part 2, stars that are stagnant after Part 1 are 
hooked to trees. Stars that are stagnant after Part 2 correspond to connected components that 
are complete. Thus, a processor corresponding to an edge of such a star stops. In Part 3, the 
algorithm reduces the height of each tree that is not a star with a shortcut operation. If the 
height of a tree is h, then after a shortcut operation the height of the tree is \h/2~\. The 
height of every tree whose height is at least 2 is reduced by a factor of at least 3/2 because in 
the worst case a shortcut operation reduces a tree whose height is 3 to a tree whose height is 
2. 
The main difference between the algorithm of Shiloach and Vishkin and the algorithm 
of Awerbuch and Shiloach is that in the former algorithm, during the hooking operation, trees 
that are not stars may also be hooked. In Part 1, if / is a root or a child of a root of a tree and 
P(i)> P(j), then processorp (i,j) tries to hook the tree containing vertex / to P (J). 
Shiloach and Vishkin (1982) and Awerbuch and Shiloach (1987) established the 
correctness of their algorithms. We briefly verify the running time. Consider each iteration 
of the three Parts. Parts 1 and 2 ensure that if a star does not contain all of the vertices of 
62 
some connected component, then the star is either hooked to another tree or hooked onto by 
another star to yield a new tree with height at least 2. Part 3 reduces the height of every tree 
with a height of at least 2 by a factor of at least 3/2 with a shortcut operation. The algorithm 
ensures the progress of Part 3 by avoiding the creation of cycles during the hooking steps. In 
Part 1, the roots of stars are hooked only to vertices with smaller /a"s, and in Part 2, only the 
roots of stagnant stars are hooked. 
For each connected component C, during each iteration of the algorithm, each star that 
contains vertices of C is hooked to another tree or hooked onto by another star to form a new 
tree of height at least 2. After a shortcut operation, the height of the new tree is reduced by a 
factor of at least 3/2. Thus, the sum of the heights of the trees in PG that contain the vertices 
of C decreases by a factor of at least 3/2. After 0(log n) iterations, the vertices of C belong 
to a single star. Since each iteration comprises 0(1) steps, the algorithm requires O(log n) 
total steps. 
4.4.2. Spanning forest algorithm 
If C is a connected component of G, then a spanning tree of C is a tree that is a 
subgraph of G that contains every vertex of C. A spanning forest of G is a set of trees 
consisting of a spanning tree for each connected component of G. We can modify the 
connected components algorithm to find a spanning forest of G by keeping track of the 
processors that perform the hooking operations. We introduce a variable Proc(i) for each 
vertex i. Initially Proc(i)= nil. Parts 1 and 2 of the connected components algorithm are 
modified as follows. 
63 
Part 1: (Conditional star hooking) 
if i belongs to a star and P(i)>P(j) then 
P(P(i)):=P(j); 
ifP(P(i))=P(j) then 
Proc(P(i)):=p(iJ); 
endif 
endif 
Part 2: (Stagnant star hooking) 
if j belongs to a star and P(i)*P(j) then 
P(P(i)):=P(j); 
if P(P(i)) = P(j) then 
fn?c(P (;)):=/?(;',/); 
endif 
endif 
In Parts 1 and 2, if a vertex * is the root of a tree and x is hooked to a vertex y, then the 
processors that may have set P (x) toy try to write their names into Proc (x), and an arbitrary 
one succeeds. Note that if more than one processor tried to hook x toy, then the processor 
whose name is written in Proc(x) might not be the processor that hooked x to y. The 
algorithm remains correct, however, since Proc(x) is the name of a processor that 
corresponds to an edge of G whose endpoints are in the trees of PG that contain x and y. 
Note that if a vertex r is the root of a star, then Proc(r) = nil. After all vertices of each 
connected component are contained in a single star, the processors whose names are stored in 
Proc (i), for i = 1,..., n, correspond to the edges of a spanning forest of G. 
4.4.3. Minimum spanning forest algorithm 
A graph G is a weighted graph if a weight w(i,j) is associated with each edge (i,j) of 
G. If G is a weighted graph, then a minimum spanning forest of G is a spanning forest of G 
such that the sum of the weights of the edges in the trees of the spanning forest is a minimum. 
64 
Awerbuch and Shiloach (1987) also presented a minimum spanning forest algorithm for the 
Priority CRCW PRAM that runs in 0(log n) time using 0(m+n) processors. 
The minimum spanning forest algorithm is similar to the connected components 
algorithm, except that during hooking operations, the processor corresponding to the edge of 
least weight leaving a star hooks the star. To determine the edge of least weight, the 
algorithm assigns processors to edges such that the smaller the weight of an edge, the higher 
the priority of the corresponding processor. Thus, if more than one processor tries to hook a 
star, then the processor corresponding to the edge of least weight is the one that succeeds. As 
in the spanning forest algorithm, a variable for each vertex is used to keep track of the 
processor that successfully performed the hooking operation. After all vertices of each 
component are contained in a single star, the processors that successfully hooked stars 
correspond to the edges of a minimum spanning forest of G. 
4.5. APRAM Connected Components Algorithm I 
The main difficulty with an asynchronous implementation of the synchronous algorithm 
is that in the asynchronous case, without a barrier synchronization after Part 1, a processor 
executing Part 2 cannot determine which stars are stagnant. This is because it does not know 
which other processors are performing hooking operations. We show how a cycle may be 
created. 
In Part 2 of the synchronous algorithm, since only stagnant stars are hooked to other 
trees, no cycles are created. In an asynchronous implementation without barrier 
synchronization, however, some processors may be executing different Parts of the algorithm. 
Let r be the root of a star S and r' the root of another star such that r'>r. Suppose a 
65 
processor executing Part 2 determines that S is stagnant and hooks r tor'. If at the same 
time a different processor executing Part 1 hooks r' tor, then a cycle of length 2 is created. 
In general, longer cycles can be created. If there are cycles, then the progress of the shortcut 
operations is not guaranteed. 
In our algorithms, to avoid creating cycles, a vertex is hooked only to a vertex with a 
smaller id. In Algorithm I, the hooking operations act on individual vertices rather than trees, 
as in the synchronous algorithm. Let (x,y) be an edge of G. Let u be the parent of x and v 
be the parent of y. Note that u may be x and v may be y. The parents graph PG is the same 
as in the synchronous algorithm of Section 4.4. In PG, if u and v are different, then {«,v} is 
an eligible pair. Suppose u < v. Then processor p(x,y) hooks v tow, and vertices u and v 
are coupled. After the hooking step, a processor performs a shortcut operation on each 
endpoint. 
We now present Algorithm I, which runs on a basic APRAM. We use uppercase names 
for variables that are stored in the shared memory and lowercase names for variables stored in 
each processor's local memory. Each vertex v has a parent P(v). Unlike the synchronous 
algorithm, we use one processor for each edge of G. A processor corresponding to an edge 
has local variables parent () and grandparent () for each endpoint to hold the endpoint's 
parent and grandparent, respectively. 
APRAM Connected Components Algorithm I 
initialization 
For each processor p (v) 
P(v):=v; 
66 
For each processor p (x ,y) 
while there are eligible pairs 
Part 1: (Vertex hooking) 
parent(x):=P(x); 
parent (y):=P(y); 
if (parent (%)# parent (y)) then 
if parent (x) > parent (y) then 
P(parent(x)) := parent(y); 
else 
P(parent(y)) := parent(x); 
endif 
endif 
Part 2: (Shortcut operation) 
parent(x):=P(x); 
grandparent (x):=P (parent (x)); 
P (x) := grandparent (x); 
parent (y):=P(y); 
grandparent (y):=P (parent (y)); 
P (y) := grandparent (y); 
rolling synchronization with binary tree; 
endwhile 
For now we assume that the algorithm can determine whether there are eligible pairs. 
We will justify this assumption shortly. In Part 1, processors that correspond to edges whose 
endpoints have different parents hook the parent with the larger id to the other parent. Note 
that a vertex may be hooked to another vertex of the same tree in PG. In Part 2, processors 
perform a shortcut operation on each endpoint. Processors iterate Parts 1 and 2 until no 
eligible pairs remain in PG. Then all the vertices in each connected component have the 
same parent. 
67 
We show that the algorithm correctly finds the connected components of G. We can 
establish by induction on the number of hooking operations that all vertices of a tree of PC? 
are in the same connected component. Initially, each tree of PG comprises a single vertex. 
We verify that a hooking operation couples only vertices that belong to the same connected 
component. 
In PG, if a vertex u of a tree is coupled with a vertex v of a different tree, then the 
processor p(x,y) that performs the hooking operation corresponds to an edge (x,y) of G. 
Let P(x) = u and P(y) = v. By the inductive hypothesis, vertices x and u belong to the 
same component and vertices y and v belong to the same component. Vertices x and y 
belong to the same component since they are joined by an edge of G. Thus vertices u, v, x, 
and y all belong to the same connected component, and the hooking operation is correct. 
Since the only primitive operations allowed on the shared variables are atomic read and 
write, the update of P (v) by one processor may overwrite that of another. But the algorithm 
maintains the invariant that P(v) is always a vertex in the same connected component as v. 
In the hooking operation, v is hooked only to vertices that belong to the same connected 
component as v. In the shortcut operation, at the time a processor writes P(v), it is possible 
that grandparent(v) is no longer the grandparent of v because P(parent(v)) may have been 
updated. But the algorithm sets P(v) to a vertex in the same connected component as v 
because at the time grandparent(v) was determined, the algorithm learned that v and 
grandparent(v) are in the same connected component. 
Next we show that when the algorithm terminates, if two vertices belong to the same 
connected component, then they are in the same tree. Suppose x and y are two vertices that 
belong to the same connected component. Then there is a path W from x to y in G. If x and 
68 
y are in different trees of PG, then the endpoints of some edge e in W must belong to 
different trees. Since the parents of e form an eligible pair, the algorithm makes another 
iteration. 
To analyze the performance of the algorithm, we partition the execution of the algorithm 
into phases, where a phase is a minimal sequence of the stages of the steps of the processors 
in which every processor completes at least one iteration of the while loop. Note that our 
definition of a phase is not the same as the definition of Gibbons (1989). Although the ideas 
of phases and rounds are similar, except in scale, their boundaries do not necessarily coincide 
because a stage of a step of a processor that completes a phase might not complete a round. 
We can, however, bound the number of rounds in a phase. If an iteration of the while loop 
consists of / APRAM processor steps, then a phase contains at most i rounds since in each 
round every processor takes at least one step. We review the definitions of stage, step, round, 
and phase in Table 4.1. 
During a phase, if a vertex v is hooked to more than one vertex (in sequence), then the 
last processor to update P(v) determines the vertex to which v is hooked at the end of the 
phase. Since v may be hooked only to a vertex that has a smaller id, if v is hooked to another 
vertex during a phase, then at the end of the phase P (v) has decreased by at least 1 since the 
beginning of the phase. The algorithm maintains the invariant that from one phase to the 
next, P (v) never increases. 
Theorem 4.5.1: Algorithm I requires at most n - 1 phases. 
Proof: Let the connected component C contain nc vertices; without loss of generality, 
assume that the vertices are numbered 1 through nc. During each phase, the vertex w of C 
with the largest P(w) is hooked to some vertex. Thus, at the end of the phase, P(w) has 
69 
Table 4.1. A summary of the definitions of stage, step, round, and phase. 
Stage One read operation, local computation, or write 
operation executed by one processor. 
Step Each step of a processor comprises one read stage, one 
local computation stage, and one write stage. 
Round A minimal sequence of stages such that every processor 
completes at least one step. 
Phase A minimal sequence of stages such that every processor 
completes at least one iteration of a loop. 
decreased by at least 1. At the end of phase 6, for each vertex v of C,P(v)< n-k. Thus, 
P (v) = 1 for every vertex v after at most nc - 1 phases. Since nc <n, the result follows. • 
If G is a star with n vertices and the id of the root is n, then there is an execution of 
Algorithm I on G that requires n-\ phases. At the end of phase 6, the n-k trees of PG 
consist of a star rooted at vertex n-k with children n through n-6+1 and isolated vertices 1 
through M-6-1. 
Finally, we discuss how the algorithm determines whether there are eligible pairs. 
Synchronization is needed to ensure that every processor has had an opportunity to find any 
eligible pairs during the current phase. Otherwise a processor may terminate prematurely. 
In Algorithm I, a shared variable EP is used to keep track of whether there are eligible 
pairs; EP is initially true. At the end of each phase, if some processor found an eligible pair 
70 
during the latest phase, then EP is set to true. But if no processor found an eligible pair, then 
EP is set to false, and the. idgorithm terminates. 
The determination of whether some processor found an eligible pair during a phase is 
performed in conjunction with the rolling synchronization. The binary tree used for rolling 
synchronization is also used to keep track of whether some processor has found an eligible 
pair. When a processor marks its corresponding leaf in the synchronization tree, the 
processor marks its leaf true if it found an eligible pair during the phase, and false if it did 
not. A processor marks an internal node true if either child is marked true and false 
otherwise. The processor corresponding to the root of the synchronization tree sets EP for 
the next phase before marking the root to signal the completion of the rolling synchronization 
for the current phase. If any processor found an eligible pair during the previous phase, then 
EP is set to true. 
Each processor checks EP before executing an iteration of the while loop. The 
algorithm terminates after the first phase during which no processor finds an eligible pair. 
The determination of the existence of eligible pairs adds one more phase to the algorithm 
On a basic APRAM, rolling synchronization can be implemented with a binary tree. For 
simplicity, a separate tree can be used for each phase of the algorithm. To reduce space, 
however, two synchronization trees can be used alternately as follows. For either tree, a 
processor that corresponds to an internal node first erases the marks of the children of the 
node and then marks the node. At the time the root of the synchronization tree is marked, the 
marks for all other nodes of the tree are erased. 
To be able to erase the mark on the root of the synchronization tree, two trees are 
needed. If the mark on the root is erased too early, then not every processor would see the 
71 
mark. If the mark is erased too late, then a fast processor may see the mark and complete the 
next phase before the mark is erased. With two trees, one tree is used for odd phases and one 
tree for even phases. Before a processor corresponding to the root of tree marks the root, it 
erases the marks on the children of the root and also the mark on the root of the other tree. 
With two synchronization trees, at most one root is marked at any time, and no mark on a root 
is erased until every processor has had an opportunity to see it. 
Theorem 4.5.2: Algorithm I requires 0(n log n) rounds. 
Proof: Algorithm I requires at most n - 1 phases. Since each rolling synchronization 
requires 0(log m) = 0(log n) rounds, each phase of the algorithm requires 0(log n) rounds. 
Thus the total number of rounds required by the algorithm is 0(n log n). • 
We note that the straightforward asynchronous implementation of the synchronous 
algorithm of Section 4.4 yields an APRAM algorithm that requires 0(log2 n) rounds since the 
number of phases is 0(log n). The main purpose of Algorithm I, however, is to present our 
asynchronous approach. 
4.6. APRAM Connected Components Algorithm II 
4.6.1, The algorithm 
Algorithm I is inefficient because if during a phase a vertex v may be hooked to several 
different vertices, then in the worst case at the end of each phase v is hooked to the vertex 
whose id is the largest. Algorithm II runs on an APRAM with two read-modify-write 
primitives, replace-min and increment. Using these stronger primitives, we can design a 
more efficient algorithm. The replace-min primitive ensures that during each hooking and 
shortcut operation, if the algorithm assigns v a new parent, then the id of the new parent is 
72 
smaller than the id of the old parent. The increment primitive allows each rolling 
synchronization to be executed with a counter in one round. 
Algorithm H differs from Algorithm I in that the algorithm hooks entire trees, and only 
by their roots, to other trees. This restriction is necessary for the analysis of the algorithm. 
We use the notation replace-min«s » t o denote that the assignment statement s is executed 
atomically using a replace-min primitive. 
APRAM Connected Components Algorithm II 
Initialization 
For each processor p (v) 
P(v):=v; 
For each processor/? (x,y) 
while there are eligible pairs 
Part 1: (Shortcut operation) 
parent (x):=P(x); 
grandparent (x):=P (parent (x)); 
replace-min«/> (x) := min [P (x), grandparent (x)}»; 
parent (y):=P(y); 
grandparent (y):=P (parent (y)); 
rep!ace-mm«P (y) := min {P (y), grandparent (y)}»; 
rolling synchronization with counter; 
Part 2: (Tree hooking) 
parent (x):=P(x); 
parent (y):=P(y); 
73 
if parent (x) * parent (y) then 
if (x was a root at the beginning of the phase) and (parent (x) > parent (y)) then 
rep!ace-min«P (parent (x)) := min {P (parent (x)), parent (y)}»; 
else if (y was a root at the beginning of the phase) and (parent (y) > parent (x)) then 
replace-min«P (parent (y)) := min [P (parent (y)), parent (x)}»; 
endif 
endif 
if (x was a root at the beginning of the phase) then 
parent (x):=P(x); 
grandparent (%):=? (parent (x)); 
replace-min«P(x) := min {P(x),grandparent(x)}»; 
endif 
if (y was a root at the beginning of the phase) then 
parent(y):=P(y); 
grandparent (y):=P (parent (y)); 
replace-min«/>(y):= min {P(y), grandparent(y)}»; 
endif 
rolling synchronization with counter; 
Part 3: (Shortcut operation) 
Same as in Part 1; 
rolling synchronization with counter; 
endwhiie 
We give an example to demonstrate the improved performance of Algorithm II. 
Suppose G is a star and vertex n is the root. In the worst case, Algorithm I requires n - 1 
phases. In Algorithm II, at the end of the first phase, the parent of vertex n is vertex 1. At 
the end of the second phase, vertex 1 is the parent of every vertex. Thus Algorithm II 
requires at most two phases. 
74 
4.6.2. Analysis 
The correctness of Algorithm n can be established in a manner similar to that used for 
Algorithm I. Thus we will analyze only the performance. Let u and v be vertices such that 
u < v. In Algorithm II, a processor p tries to hook v to u if {u, v} is an eligible pair and v 
was the root of a tree at the beginning of the current phase. Note that at the time processor/? 
tries to hook v to u, vertex v may no longer be the root of a tree since v may have been 
previously hooked onto some other vertex. 
During a phase, if on the last time v is hooked onto a vertex, v is hooked onto u, then 
during that phase u adopts v, v is adopted by u, and u and v participate in an adoption. 
During a phase, a vertex x is stagnant if x does not participate in an adoption. Note that x 
may be stagnant even if some vertex y is hooked to x since during the phase, y may 
subsequendy be hooked to some other vertex. If during the phase v is adopted by u, then the 
tree rooted at v is adopted by the tree containing u, and the trees are merged. In our 
discussion, the adoption of the root of a tree implies the merging of trees. 
The rolling synchronization after Part 2 ensures that entire trees are adopted. The 
algorithm may perform shortcut operations on the root of a tree after the root is hooked to 
another vertex. The algorithm performs shortcut operations on the remaining vertices of the 
tree only after the root has been adopted during the current phase. If there were no rolling 
synchronization after Part 2, then through shortcut operations, some vertex of the tree might 
take on a parent that was only a temporary parent of the root, and the algorithm would not be 
guaranteed to merge entire trees. 
Let C be a connected component of G and let V(C) be the vertices of C. Let u and v 
be two vertices of V(C). 
75 
Lemma 4.6.1: If [u, v} is an eligible pair, u < v, and v is the root of a tree at the beginning 
of phase 6, then at the end of Part 2 of phase 6, vertex v is adopted. 
Proof: Let (a, y) be an edge of G such that P(x) = u andP(y) = v. Then processor p(x,y) 
finds the eligible pair {u, v}. Throughout the computation, P(v) decreases monotonically. 
In Part 2 of phase 6, processor p(x,y) hooks v to u unless v has already been hooked to 
another vertex and P(v) < u. Thus, at the end of Part 2 of phase 6 vertex v is adopted by 
some vertex. • 
Lemma 4.6.2: If [u, v} is an eligible pair, u < v, and u and v are both roots of trees at the 
beginning of Part 2 of phase 6, then at the end of Part 2 of phase 6+1, vertices u and v have 
each participated in at least one adoption, and at least one of u and v is adopted. 
Proof: Let /? (x ,y) be a processor such that P (x) = u and P (y) = v. Then processor p (x ,y) 
finds the eligible pair {w, v}. By Lemma 4.6.1, during phase 6 vertex v is adopted by some 
vertex. If v is adopted by w, then vertex u also participates in an adoption. If vertex u 
remains stagnant during phase 6, then at the end of Part 2 of phase 6, the parent of v must 
have been a vertex b, where b <u. At the end of phase 6, after processorp(x,y) performs 
at least one shortcut operation, the parent of y is a vertex a, where a < b. During phase 6+1 
processor p(x,y) finds the eligible pair (a,u). Since a <u and u is the root of a tree, by 
Lemma 4.6.1, vertex u is adopted at the end of Part 2 of phase 6+1. • 
Two trees T\ and T2 of PG are adjacent if there exist vertices x and y such that X is a 
vertex of T\, y is a vertex of T2, and (x, y) is an edge of G. A tree is tall if its height is at 
least 2. A tree is tall if and only if it is not a star. As a consequence of Lemma 4.6.2, we have 
the following corollary. 
76 
Corollary 4.6.3: If u and v are the roots of adjacent stars at the beginning of Part 2 of phase 
k, then at the end of Part 2 of phase 6+1, vertices u and v have each participated in at least 
one adoption, and at least one of u and v is adopted. 
Corollary 4.6.4: If u is the root of a star at the beginning of phase 6 and u is stagnant during 
phases 6 and 6+1, then at the beginning of phase k, every tree adjacent to the star with root u 
is a tall tree of height at least 3. 
Proof: Suppose to the contrary that at the beginning of phase 6 the height of some tree 
adjacent to the star rooted at u were at most 2. After a shortcut operation in Part 1, the tree 
would be a star. But then by Corollary 4.6.3, at the end of Part 2 of phase 6+1, vertex u 
would have participated in at least one adoption. 0 
4.6.2.1. The potential function 
To show that the vertices of V(C) form a single star after 0(log n) phases, we measure 
the progress of the algorithm with a potential function. We first give some intuition behind 
the potential function. In the synchronous algorithms of Section 4.4, the progress of the 
algorithms is measured by the sum of the heights of the trees in PG. During each iteration 
every star participates in an adoption to form a new tree of height at least 2. After a shortcut 
operation, the height of each tall tree is reduced by a factor of at least 3/2. Thus the sum of 
the heights of the trees that contain the vertices of V(C) decreases by a factor of at least 3/2. 
We show that in Algorithm II, during two consecutive phases, only the sum of the 
heights of tall trees and the sum of the heights of stars that are adjacent to other stars are 
guaranteed to decrease by at least a constant factor. During each phase, the algorithm 
77 
performs at least two shortcut operations. Thus during two consecutive phases, the algorithm 
performs at least four shortcut operations. 
If the height of a tall tree is h, then after four shortcut operations, the height of the tree is 
at most r r T \h /2l /2l /2l /2l = \h /16] . If 2 < h < 16, then after four shortcut operations, the 
height of the tree is reduced to 1. Since h > 2, after two consecutive phases, the height of the 
tree decreases by a factor of at least 2. 
If h > 17, then after two consecutive phases, the height of the tree is at most /1/I6+1. 
Since h > 17, the height of each tree decreases by a factor of at least 272/33. 
For stars that are adjacent to other stars, by Corollary 4.6.3, after hooking operations in 
two consecutive phases, at least half of those stars are adopted to form tall trees. After each 
hooking operation, the algorithm performs at least one shortcut operation. Since the height of 
each tall tree is reduced by a factor of at least 3/2, the sum of the heights of those stars that 
are hooked to form the tall trees is also reduced by a factor of at least 3/2. Stars that are 
adjacent only to tall trees may remain stagnant for several phases, however. But after a tall 
tree becomes a star, then all of the stars that were adjacent to the tall tree participate in an 
adoption within two phases. 
The potential function we use corresponds roughly to the sum of the heights of the trees. 
The difference is that if several stars are adjacent to one tree, then not all of the stars may be 
counted. This is because the maximum height of the new tree that can be formed by merging 
all of the stars with the tree is bounded. We will explain this shortly. 
Consider the vertices V(C) of C. The vertices of V(C) are partitioned into trees in PG. 
At the beginning of phase 6, let the forest Fk be the set of trees in PG that contain the 
vertices of V (C). During phase 6, the algorithm may merge trees of F&. The potential of the 
78 
forest, which we define precisely below, is an upper bound on the sum of the heights of the 
trees that can be formed if all of the trees in Ft participate in at least one adoption during the 
next two phases. If the potential of F^ is 1, then the vertices of V(C) are contained in a 
single star. We will show that after every two phases, the potential of the forest in PG that 
contains the vertices of V(C) decreases by at least a constant factor. Thus the vertices of 
V (C) are contained in a single star after 0(log n) phases. 
We define the potential function by considering the trees in a forest F. Without loss of 
generality, assume that there are no isolated stars in F, since each such star corresponds to a 
connected component all of whose vertices are contained in a single star. A tree in F is 
external if it is adjacent to one other tree in F and internal if it is adjacent to at least two 
other trees in F. We will partition F into two subsets, and then define the potential of each 
subset. The potential of F is the sum of the potentials of the two subsets. 
To help clarify the derivation of the potential function, we will use an example. 
Suppose F is the forest of trees in PG shown in Figure 4.1. Tall trees are represented by the 
tall triangles, and stars are represented by the short triangles. The dashed lines indicate 
adjacent trees. For each pair of adjacent trees, there is at least one edge of G between the 
trees. During our discussion, we will refer to Figure 4.1. 
A tree Tj covers a tree T2ifTx is adjacent to T2. Let T be a tree of F. A cluster of T is 
a set of stars covered by 7 . A cluster may contain from none to all of the stars covered by T. 
Call T the core of the cluster. Note that T is not in the cluster. 
Let AS be the set of stars of F that are adjacent to at least one other star. Let St (F) be a 
minimum cardinality set of stars of AS such that for every pair of adjacent stars in AS, at 
least one of the stars is in St(F). Note that St (F) might not be unique. Also, each star in 
79 
Figure 4.1. An example of a forest F. 
St(F) covers at least one star not in St (F); otherwise, St (F) would not be a set of minimum 
cardinality. 
Let Ta (F) be the set of tall trees in F. Let Co (F) = Ta (F) u St (F). Finally, let CI (F) 
be the set of stars of F that are not in Co (F). An assignment is a function that assigns each-
star S of Cl(F) to a tree of Co (F) adjacent to S. For an assignment A, A (S) is the tree to 
which star S is assigned. Note that each external star may be assigned to only one tree. We 
will consider particular assignments later. 
Fix an assignment. For each T in Co (F), let CI (T) be the set of stars of CI (F) assigned 
to T. If T is a tall tree, then CI (T) is a tall tree cluster. If T is a star, then CI (T) is a star 
cluster. It is clear that each tree of F is in exacdy one of the subsets Co (F) and CI (F). 
80 
In the example of Figure 4.1, Ta(F) = {TA ,TD ,TF,TH,Tj}. One possible set St(F) of 
minimum cardinality is St(F)= {TB,TE,TL,T0). For this St(F), one possible assignment 
yields the clusters Cl(TA)=[Tj,TK}, Cl(TB)=[Tc}, Cl(TE)={TN}, Cl(TH)={TG), 
Cl(TL)={TM), and Cl(T0)= {TP,TQ}. 
Lemma 4.6.5: During the execution of Algorithm II, if two trees of F are merged, then at 
least one of the trees is a tree of Co (F). 
Proof: If either tree is a tall tree, then at least one tree is a tree of Ta (F). If both trees are 
stars, then by definition, at least one tree is a star of St (F). D 
We now define the potential of each subset. We define the potential such that the 
potential of F is an upper bound on the sum of the heights of the trees that can be formed if 
every tree of F participates in at least one adoption during the next two phases. 
Let ht (T) denote the height of a tree T. If T is a tree of Co (F), then define <J>(7), the 
potential of T, to be ht(T). Define 0(C<? (F)), the potential of Co (F), to be 
@(C<?(F))= y ht(T)= V ht(T)+\St(F)\. 
TeC3(.F) Tefa(F) 
The value of 0(Co (F)) is well-defined since Ta (F) is the set of all tall trees of F and St (F) 
is a set of minimum cardinality. 
For T e Co(F), we define ®(Cl(T),T), the potential of cluster Cl(T) with core 7 , by 
considering four cases. In essence, 0(C/(T),T) is the difference between the height of T and 
the height of the tallest tree that can be formed by merging only stars of CI (T) with T. 
Casel: Cl(T) is empty. In this case, define 0(O(7) ,7) = 0. 
Case 2: If the id of the root of T is smaller than the id of the root of every star of CI (T), 
then the core of the cluster is grounded. If T is grounded, then no stars of CI (T) can adopt T. 
81 
But any number of stars of CI (T) may be adopted by T. Since every star of CI (T) is adjacent 
to T, regardless of the number of stars that T adopts, the height of the tallest tree that can be 
formed is ht(T) + 1. In this case, define 0(C/(7),T) = 1. 
Case 3: If T is not grounded and CI (T) contains only one star S, then S may adopt T. 
The height of the tallest tree that can be formed is ht(T)+l. In this case, define 
0(C/(7) ,7)=1. 
Case 4: If T is not grounded and CI (T) contains at least two stars, then at least one star 
of CI (J) has a root whose id is smaller than the id of the root of T. During the first phase, at 
most one star of CI (J) may adopt T, and T may adopt any number of stars of CI (J). The 
height of the tallest tree 7" that can be formed is ht(T) + 2. The algorithm performs at least 
two shortcut operations before the hooking operation in the second phase. Thus the height of 
T is reduced so that 
Af(r)<rr(Af(r) + 2)/2l /2l =\(ht(T) + 2)/4\. 
Since ht(T')<ht(T), the height of the tallest tree that can be formed through hooking 
operations in the second phase is at most ht(T) + 2. In this case, define <B(C/(7),T) = 2. 
This concludes the last case. 
Let R be the sum of the potentials of the clusters whose cores are trees of Co (F), i.e., 
R= S <D(C/(7/),T). 
TeCZ(F) 
Different assignments of stars of Cl(F) to trees of Co(F) yield different clusters CI (J). 
Since the potential of a cluster is determined by the number of stars in the cluster, the id of 
the root of the core of the cluster, and the id's of the roots of the stars in the cluster, different 
assignments of stars of Cl(F) to trees of Co(F) may yield different values for R. Define 
82 
®(Cl (F)), the potential of CI (F), to be the maximum value of R over all possible sets St (F) 
and assignments of stars of Cl(F) to trees of Co(F). Note that <B(C/(F)) < |C/(F)| since 
by definition 0(C/ (T), T) < | CI(T) \. Finally, define 4>(F), the potential of the forest F , to 
be 
<D(F) = ®(Co (F)) + <D(C/ (F)). 
4.6.2.2. Contribution 
Suppose that the algorithm is at the beginning of phase 6. For a particular connected 
component C, let Fk be the forest of trees in PG that contains the vertices of V(C). During 
phases 6 and 6+1 of the algorithm, only some of the trees of F& may be merged. Some tall 
trees may remain stagnant but be reduced in height through shortcut operations, and some 
stars may remain stagnant and unchanged. Since the potential of F& is based on clusters, we 
need to account for the change in potential when only some of the stars in a cluster are 
merged. We define a measure which we call contribution to relate the potential ofF&+2 to the 
potential of Fk. 
Let forest Fk+2 be the set of trees of PG that contain the vertices of V(C) at the end of 
phase 6+1. Then each tree of Fk+2 contains the vertices of one or more trees of Fk. If a tree 
T ofFk+2 contains all of the vertices of a tree TofFk, then we say that 7 ' incorporates T. 
If 7 ' is a tree of Fk+2 that is not in Fk, then 7 ' i s a tree formed as a result of shortcut 
operations on either a tall tree of Fk or a tree that was created by merging two or more trees 
of Fk. Let H be a subset of trees of Fk incorporated into 7 ' . The contribution of H to 7 ' , 
written cont(H,T'), which we define precisely below, is a rough measure of the portion of 
the height of 7 ' that is due to trees of H. 
83 
Let / be the length of the longest path in 7 ' consisting only of vertices of trees of H. If 
/ > 1, then cont (H,T') = l. If / = 0, i.e., no two vertices of trees of H are joined by an arc of 
PG, then cont(H,T')=\. Note that if 7 ' incorporates only trees of H, then 
cont(H,T') = ht(T'). 
Let Stag (7) be the set of stars of C/(7) that were stagnant during phases 6 and 6+1. 
Note that Stag (J) could be empty. Let 7 ' be the tree of Fk+2 that incorporates 7 . The 
contribution of Stag (7) to 7 ' is the potential Stag (7) would have if 7 ' were the core of a 
cluster that comprised only the stars of Stag (7). 
We clarify the concept of contribution with an example. Suppose that at the beginning 
of phase 6 the forest Fk contains the tall tree and its cluster shown in Figure 4.2. Arrows 
indicate arcs of PG and dotted lines indicate edges of G. The numbers are the id's of the 
84 
vertices. We will refer to trees by the ia"s of their roots. In the example, tree 7%) is the core 
of a tall tree cluster and C/(710)= {7l374:77;ri7}> Since 7 1 0 is the core of a cluster, 
<J>(710) = 3. Since 7 j 0 is not grounded and C/(7io) contains at least two stars, 
<5(C/(710),710) = 2. 
Suppose that during Part 1 of phase 6 the algorithm performs one shortcut operation on 
each vertex of the tree 7 ^ . Then at the end of Part 1 of phase 6, the height of 7 1 0 is 2. 
During Part 2 of phase 6, suppose that vertex 7 adopts 7%) and vertex 11 of tree T\Q adopts 
7n , while T\ and 7 4 remain stagnant. The height of the new tree 7 ' that is formed is 4. 
After the algorithm performs at least one shortcut operation in each of Part 3 of phase 6 and 
Part 1 of phase 6+1, the height of 7 ' is reduced to 1. During Part 2 of phase 6+1, suppose 7 ' 
is adopted by vertex 1 to form a new tree 7 " while again 7 4 remains stagnant. After a 
shortcut operation, the height of 7 " is 1. Thus, at the end of phase 6+1, the contribution of 
the set of trees {7i,77,7i0,717} to 7 " is 1. If T" were the core of a cluster containing only 
the star 74, then the potential of the cluster would be 1. Thus, cont ({74}, 7") = 1. 
4.6.2.3. Performance 
To analyze the progress of the algorithm, we examine one execution of the algorithm. 
But the analysis holds for every execution. Assume that at the beginning of phase 6 we know 
a priori how the trees of Fk will be merged and the stars of Fk that will remain stagnant 
during the next two phases. Let St(Fk) be the set of stars of Fk and let Ak denote the 
assignment of stars of CI (Fk) to trees of Co (Fk) that determines 0(C/ (Fk)). For St (Fk), we 
define another assignment Ak, where the stars of Cl(Fk) are assigned to trees of Co(Fk) in 
the following order. 
85 
Let S be any star of Cl(Fk). If S is adopted by a vertex of a tree 7 in Co(Fk) during 
phase 6 or 6+1, then assign 5 to 7 . If a vertex of S adopts a tree 7 of Co (Fk) during phase 
6 or 6+1 and S has not already been assigned to some other tree, then assign 5 to 7 . If 5 is 
stagnant during phases 6 and 6+1 and S is a star of Cl(Fk+2), then let 7 ' be the tree of 
Co (Fk+2) to which S is assigned in Ak+2 • In At, assign 5 to a tree 7 adjacent to 5 in Fk that 
is incorporated into 7'. Finally, for the set of stagnant stars of Cl(Fk) that are stars of 
Co(Fk+2), find an assignment of those stars to trees of Co(Fk) such that the sum of their 
contributions to trees ofF&+2 is maximized. This completes the specification of Ak. 
We now define C/(7)= {S e Cl(Fk):Ak(S) = T). For the assignment Ak, define 
Y(Co(F&))tobe 
W(Co(F*))= y ht(T), 
TeCo(Ft) 
define Y(CZ(Ft)) to be 
Y(C/(F*))= y 0(C/(7),7), 
TeCo(Fk) 
and define Y(F&) to be 
W * ) = V(Co (Fk)) + *¥(Cl(Fk)). 
For convenience, let §k be the value of 0(F&), and let yk be the value of *F(F&). By the 
definition of 0(Fk), it follows that yk < §k. 
Lemma 4.6.6: If every tree of Fk participates in at least one adoption during phases 6 and 
6+1, then y Af(7 ')<#. 
86 
Proof: Let 7 ' be a tree of Fk+2. Let M be the set of trees of Fk that are merged to form 7' . 
By Lemma 4.6.5 and from the assignment Ak, M can be partitioned into subsets where each 
subset contains exacdy one tree 7 of Co (Fk) and the set of stars CI (7) that are merged with 
7 . A set of trees {7!,72,.. .,Tk} is merged in series if 7X adopts 7 2 , 7 2 adopts 7 3 , . . . . and 
7jk_i adopts Tk. The height of 7 ' is greatest if all of the trees of M are merged in series. 
If 7 is a tree of Co (Fk), then 0(7) + <D(C/(7),7) is an upper bound on the height of the 
tree that can be formed by merging 7 and the stars of CI (7) in series. Let Co (M) be the set 
of trees of Co (Fk) in M. Then 
ht(T')< y (0(7) + 0(O(7),7)) 
TsCo(M) 
because the sum of the potentials of the cores and their corresponding clusters is an upper 
bound on the height of the tree that can be formed by merging all trees of M in series. Since 
every tree of Fk participates in at least one adoption during phases 6 and 6+1, every tree of 
Fk is incorporated into some tree ofFk+2. Since 7 ' may be any tree ofF&+2, it follows from 
the assignment At and the definition of W * ) that 
y /tf(70<y*. 
T'eFi* 
Then, by the definition of potential, \\fk < §k, and the result follows. • 
In general, the sum of the heights of the trees that are formed will be less than tyk. The 
trees that are formed depend on the relative ordering of the id's of the roots of the trees and 
the order in which trees are merged. Thus, the actual number of trees that are merged in 
series may be smaller than the number used in the determination of the potential. Also, the 
root of a tree may be adopted by a vertex that is closer to the root than a parent of a leaf 
87 
vertex. Finally, a star of Cl(Fk) that is adjacent to several trees of Co(Fk) might not be 
merged with the tree to which it was assigned when determining the potential. 
Lemma 4.6.7: The sum of the contributions of the trees of Ft to the trees of Fk+2 is at most 
(3/5) fc. 
Proof: Since \|/t < (|>t, it suffices to show that after two phases, the sum of the contributions 
of the trees of Fk to the trees of Fk+2 is at most (3/5) yk. We will partition the trees of Fk 
into subsets such that for each subset, the sum of the contributions of the trees in the subset to 
a tree of Ft+2 is at most 3/5 the sum of the potentials of those trees in Fk. 
Let 7 be a tree of Co (Fk) and let r be the root of 7. During phases 6 and 6+1, stars of 
C/(7) may be merged with 7. Let M'(T) be the set of stars of C/(7) that are merged with 7 
during phase 6. At the end of phase 6, let 7 ' be the tree of F&+i that incorporates 7. Let 
M"(7) be the set of stars of C/(7) that are stagnant during phase 6 and merged with T' 
during phase 6+1. At the end of phase 6+1, let T" be the tree of Ft+2 that incorporates 7 . 
Finally, let M(7) = M'(T) uM"(7) . 
Let C/(7) + 7 denote the set of trees C/(7) u {7}. Let Stag (7) denote the set of stars 
of C/(7) that are stagnant during phases 6 and 6+1. Note that Cl(T) = M(T)<uStag(T). 
Define p to be 
„ , com(M(T) + 7,7") + cont(Stag(T),T") 
P <b(T) + d(Cl(T),T) 
_ cont(M(T) + 7,7") + cont(Stag(T),T") 
nt(T) + 6(Cl(T),T) • 
We consider seven cases. In each case, we show that p < 3/5. 
Case 1: 7 is a star, 7 is grounded, and C/(7) contains at least one star. Then 
ht(T) + <P(C/(7),7) = 2. Since 7 is grounded, by Lemma 4.6.1, at the end of the hooking 
88 
operation in phase 6, every star of C/(7) is adopted by 7; hence M\T) = Cl(T). After at 
least one shortcut operation, at the end of phase 6, every vertex of CI (7) + 7 has a parent 
whose id is at most the id of r. Since every vertex of CI (7) + 7 has a parent that is either r 
or an ancestor of r in T', the length of the longest path in 7 ' consisting only of vertices of the 
trees of C/(7) + 7 is at most 1. Therefore, by the definition of contribution, 
cont(Cl (7) + 7,7')= 1. After further shortcut operations, at the end phase 6+1, each vertex 
of CI (7) + 7 has a parent whose id is at most the id of the parent of the vertex at the end of 
phase 6. Thus, cont(Cl(T) + T,T")= 1. Since no stars of C/(7) are stagnant, Stag(T) is 
empty. Thus, cont (Stag (7), 7") = 0. It follows that p = 1/2. 
Case 2: 7 is a star, 7 is not grounded, and C/(7) contains at least one star. Then 
Af(7) + 0(O(7),7) > 2. Let min-id(Cl(T)) denote the id of the root of the star in C/(7) 
whose id is smallest. Then, at the end of the hooking operation in phase 6, the parent of r is 
a vertex b, where b < min -id (CI (7)). After a shortcut operation, at the end of phase 6, 
every vertex of 7 has a parent whose id is at most b, and every vertex of a star of M '(7) has 
a parent whose id is at most r. 
In phase 6+1, after at least one shortcut operation, every vertex of a star of M'(7) has a 
parent whose id is at most b. Since every vertex of 7 has a parent whose id is at most b and 
b <min-id(Cl(T)), by Lemma 4.6.1, at the end of the hooking operation in phase 6+1, 
every star of C/(7) that is stagnant during phase 6 is adopted by 7'; hence, M(7) = C/(7). 
After a shortcut operation, every vertex of a star of M "(7) has a parent whose id is at most b. 
Thus, at the end of phase 6+1, every vertex of a tree of CI (7) + 7 has a parent whose id is at 
most b. Since b <min-id(Cl(T)), it follows, as in Case 1, that cont(Cl(T) + T,T") = 1. 
Since Stag (7) is empty, cont (Stag (7), 7") = 0. Thus, p < 1/2. 
89 
Case 3: The height of 7 is 2 and C/(7) contains at least one star. Then 
ht(T) + 0(C/(7),7) £ 3. After the first shortcut operation during phase 6, the remainder of 
the analysis for Case 3 is the same as the analysis for Case 1 or Case 2, depending on whether 
7 is grounded. In either case, cont (CI (7) + 7,7") = 1, and thus p £ 1/3. 
Case 4: 7 is a tall tree and C/(7) contains no stars. Since 7 is a tall tree, 
<3>(7) = /tf(7)>2. Since C/(7) is empty, 0(C/(7),7) = O. After at least two shortcut 
operations during each of phases 6 and 6+1, at the end of phase 6+1, 
cont(T,T")< \\\\(ht(T)/2)\ 12] /2~\ /2~\ (4.1) 
= \ht(7)/16l. 
Because C/(7) is empty, Stag(T) is empty, and cont (Stag (T),T") = 0. If 
2<ht(T)< 16, then by Inequality (4.1), cont(T,T")= 1. Since Af(7)>2, it follows that 
p < 1/2. 
Eliminating the ceiling operations in Inequality (4.1), we obtain 
cont(T,T") < Af(7)/16+1. If ht(T) > 17, then it follows that p < 33/272 < 3/5. 
Case 5: The height of 7 is at least 3, 7 is grounded, and C/(7) contains at least one 
star. Then 0(7) + $(C/(7),7) = ht(T) + 1. After at least one shortcut operation followed 
by a hooking operation in which 7 is grounded and at least one more shortcut operation, at 
the end of phase 6, 
cont(M'(T) + T,T')<\(lht(T)/2\ + l ) /2 | . 
Since 7 is incorporated into T, the id of the root of 7 ' is at most the id of the root of 7. 
Thus, T' is grounded with respect to the stars of C/(7). After at least one more shortcut 
90 
operation followed by another hooking operation in which 7 ' is grounded and then at least 
one more shortcut operation, at the end of phase 6+1, 
cont(M(T) + T,T") < |~(|^(|^(7)/2| + l)/2] /2\ + l)/2] (4.2) 
= \(\(\ht(T)/2\ +l)/4l + l)/2l. 
In phases 6 and 6+1, the algorithm performs at least three shortcut operations before the 
hooking operation in phase 6+1. If ht(T) < 8, then before the hooking operation in phase 
6+2, every vertex of 7 has a parent whose id is at most the id of r. Since 7 is grounded, by 
Lemma 4.6.1, every star of C/(7) is adopted. Thus, Stag(T) is empty, and 
cont(Stag(T),T") = Q. 
If 3</tf(7)<8, then by Inequality (4.2), <xwf(C/(7) + 7,7")<2. Since 
ht(T) + 0>(C/(7),7) > 4, p < 1/2. 
If ht(T) > 9, then stars of C/(7) may remain stagnant. Since 7" incorporates 7, the id 
of the root of 7 " is at most the id of r. Thus 7 " is grounded with respect to the stars of 
Stag(T). By the definition of contribution, cont (Stag (7), 7") < 1. Eliminating the ceiling 
operations in Inequality (4.2), we obtain cortf(A/(7) + 7,7")</tf(7)/16 + 9/4. Since 
ht(T) > 9, it follows that p < 61/160 < 3/5. 
Case 6: The height of 7 is at least 3,7 is not grounded, and CI (7) contains at least one 
star. We first consider the case where CI (7) has only one star. 
If C/(7) contains one star S, then 0(7) + <B(C/(7),7) = 6f(7)+l. If S and 7 are 
incorporated into a tree 7" of Ft+2, then the value of cont(S + 7,7") is greatest when 5 
adopts 7 in phase 6+1, because then the algorithm performs fewer shortcut operations on the 
tree that is formed. Thus, 
91 
cont(M(T) + T,T")<[(\\\ht(T)/2\/2\ /2~\ + l)/2| (4.3) 
= \(\ht(T)/S\ + l)/2l. 
If 3<fa(7)<8, then by Inequality (4.3), cont(M(T) + T,T") = 1. Since C/(7) 
contains only one star, cont(Stag(T),T")<>\. Since Af(7)>3 and <&(C/(7),7)= 1, it 
follows that p < 1/2. 
Eliminating the ceiling operations in Inequality (4.3), we obtain 
cont(M (7) + 7,7") </tf(7)/16 + 2. Since cont (Stag (T),T")<1 and 0(C/(7),7)=1, it 
follows that if ht (7) > 9, then p < 57/160 < 3/5. 
If C/(7) contains at least two stars, then 4>(7) + <D(C/(7),7) = ht(T) + 2. After at least 
one shortcut operation followed by a hooking operation in which 7 is not grounded and at 
least one more shortcut operation, at the end of phase 6, 
cont(M'(T) + T,T')< \(\ht(T)/2\ +2)/2] . 
After at least one more shortcut operation followed by another hooking operation in 
which 7 ' may not be grounded and then at least one more shortcut operation, at the end of 
phase 6+1, 
cont(M(7) + 7,7") < |"(|" \(\ht(7)/2l + 2)/2l /2~\ + 2)/2] (4.4) 
= r(l"(l"Af (7)/2l +2)/4] +2)/2l. 
If 3 < ht (7) £ 4, then at the end of phase 6, after at least two shortcut operations, every 
vertex of 7 has a parent whose id is at most the id of r. After the hooking operation in phase 
6+1, the parent of r is a vertex b, where b <min-id(Cl(T)). After at least one more 
shortcut operation, at the end of phase 6+1, every vertex of 7 has a parent whose id is at most 
b. Vertex b cannot be the root of a star in Stag (7) because either b < min-id(Cl(T)) or b 
92 
is the root of a star that participated in an adoption. Since the id of the root of every star of 
Stag(T) is larger than b, 7" is grounded with respect to the stars of Stag(T). Thus 
cont(Stag(T),T")<\. By Inequality (4.4), cont(M(T) + T,T")<2. Since 
ht(7) + <X>(C/(7),7) >. 5, it follows that p < 3/5. 
If 5<Af(7)S12, then by Inequality (4.4), cont(M(T) + T,T")<2. Since 
cont(Stag(T),T") = (b(Stag(T),T")<2 and 0(C/(7),7) = 2, if Af(7)>5, then 
p<4/7<3/5. 
Eliminating the ceiling operations in Inequality (4.4), we obtain 
a?Mf (M(7) + 7,7") < Af (7)/16 + 23/8. Since cont (Stag (7), 7") < 2 and ®(Cl (7), 7) = 2, it 
follows that if ht (7) > 13, then p < 91/240 < 3/5. 
Case 7:7 is a star and C/(7) contains no stars. Since no stars are assigned to 7, we call 
7 a free star. Since 7 is a star of St(Fk), there is at least one internal star adjacent to 7. By 
Corollary 4.6.3,7 participates in at least one adoption during phases 6 and 6+1. 
Since 7 is the core of a cluster, 0(7) =1. If 7 is merged only with other free stars, then 
the sum of the potentials of the free stars is the number of stars. Let Q be the set of free stars 
of Ft including 7 that are merged during phases 6 and 6+1. We can partition the stars of Q 
into subsets such that for each subset, one star 5 serves as the core of a cluster CI (S) and the 
remaining stars in the subset are the stars of CI (5). Using the analysis for Cases 1 and 2, we 
see that the contribution of the stars in each group to the tree 7" of Ft+2 that incorporates 
them is at most half the sum of their potential. 
If 7 is not merged with another free star, then suppose we consider 7 alone. At the end 
of phase 6+1, cont (7,7") = 1. In this case, p = 1. But we want p 3 3/5. Let H be a set of 
trees of C/(7) + 7 from one of the Cases 1 through 6 above. We show that if 7 is added to 
93 
H, then for the augmented set of trees H + 7 , the sum of the contributions of the trees of 
H + 7 to a tree of Ft+2 is at most 3/5 the sum of the potentials of those trees in Fk. We 
consider four subcases. 
Subcase 7.1: 7 adopts a vertex v that is the root of a tree 7C adjacent to 7 . Since 7 is 
free, 7 adopts no stars. Otherwise at least one star would have been assigned to 7. Thus Tc 
is a tall tree and the core of a cluster. We apply the analysis for Cases 3 and 6 in which 7C is 
the core of a cluster containing the stars CI (7C) + 7 and 7C is not grounded. 
Subcase 72: 7 is adopted by a tree Tc adjacent to 7 that is the core of a cluster. We 
apply the analysis for the case in which Tc is the core of a cluster containing the stars 
CI (Tc) + 7. The addition of 7 has no effect on whether 7C is grounded. 
Subcase 7.3: 7 is adopted by a star S in phase 6. Let Tc be the tree for which 
S e C/(7C). Since S is assigned to 7C, in phase 6 either S adopts 7C or 7C adopts S. We 
consider the two subcases. 
Subcase 7.3.1: S adopts 7C. Since 7 is a star, ht(T)<ht(Tc). It follows from the 
analysis of the contribution in Cases 1 through 6 above that cont ((M (Tc) + Tc +7) , 7") = 
cont(M(Tc) + TC,T"). Thus we can add 7 to the set of trees C/(7C) + 7C and the analysis 
remains valid. 
Subcase 73.2: Tc adopts S. Consider the sum of the contributions of the set of trees 
C/(7C) + 7^+7 to a treeT" ofFk+2. Define p' to be 
n,_ cont(M(Tc) + Tc +7 ,7" ) + cont (Stag (7C),7") 
P
 0(7C) + 0(C/(7C),7C) + <D(7) 
_ cont(M(Tc) + 7C + 7,7") + cont(Stag(7e),7") 
ht(Tc) + <i>(Cl(Tc),Tc)+l " 
94 
We show that p' 3 1/2. Let rc be the root of 7C. First suppose that Tc is grounded. 
Since Tc adopts S and S adopts 7 in phase 6, trees 7C, S, and 7 are merged in series. After 
at least one shortcut operation followed by a hooking operation in which TC,S, and 7 are 
merged in series and at least one more shortcut operation, at the end of phase 6, 
cont(M'(Tc) + Tc + r , r ) s [ ( | W c ) / 2 ] +2)/2J. 
After at least one more shortcut operation followed by another hooking operation in 
which 7 ' is grounded and then at least one more shortcut operation, at the end of phase 6+1, 
cont(M(TC) + Tc + 7,7") < [([[([/if (7C)/2J + 2)/2J #] + 1)/2J (4.5) 
= [([([to(7c)/2|+2)/4]+l)/2J. 
If ht(Tc) <: 8, then by the analysis for Cases 1, 3, and 5, at the end of phase 6+1, every 
star of CI (Tc) is adopted, and thus cont (Stag (Tc), 7") = 0. 
If ht (Tc) < 4, then by Inequality (4.5), cont (M (Tc) + 7C + 7,7") = 1. Since ht (Tc) > 1, 
<D(C/(7C),7C) = 1, and ht(T) = 1, it follows that p' < 1/3. 
If 5<fa(7c)<8, then by Inequality (4.5), cont(M(Tc) + Tc +7,7")<2. Since 
ht (Tc) > 5,9(Cl (Tc), Tc) = 1, and ht (7) = 1, it follows that p' < 2/7 < 1/2. 
Eliminating the ceiling operations in Inequality (4.5), we obtain 
cont(M(Tc) + Tc +T,T") <ht(Tc)/\6+ 19/8. Since cont (Stag (TC),T")<\, 
<D(C/(7C),7C) = 1, and ht(T) = 1, it follows that if ht(Tc) > 9, then p' < 63/176 < 1/2. 
Now suppose that Tc is not grounded. If 1 < ht (7C) < 2, then by the analysis for Cases 2 
and 3, after the hooking operation in phase 6, the parent of rc is a vertex b, where 
b <min-id(Cl(Tc). By Lemma 4.6.1, at the end of phase 6+1, every star of C/(7C) is 
95 
adopted. Thus, M(7C) = C/(7C) and cont (Stag (TC),T") = Q. Since 7C adopts S and S 
adopts 7 during the hooking operation in phase 6, after at least three shortcut operations, at 
the end of phase 6+1, every vertex of 7 has a parent whose id is at most b. Thus, as in the 
analysis for Case 1, it follows that cont(Cl(Tc) + Tc +7,7")= 1. Since ht(Tc)>\, 
0(C/(7C),7C) > 1, and ht(T) = 1, it follows that p' <, 1/3. 
In phase 6, trees Tc, S, and 7 are merged in series. Since 7C adopts S and Tc is not 
grounded, Cl(Tc) contains at least two stars. After at least one shortcut operation followed 
by a hooking operation in which 7C is not grounded and at least one more shortcut operation, 
at the end of phase 6, 
cont(M'(Tc) + Tc +7,7')<[([/,;(7J/21 +3)/2J . 
After at least one more shortcut operation followed by another hooking operation in 
which 7 ' may not be grounded and then at least one more shortcut operation, at the end of 
phase 6+1, 
cont(M(Tc) + Tc + 7,7") < [([[([/zf(7J/2| + 3)/2J fl] + 2)/2J (4.6) 
= [([([w c) /2]+3)/4 | +2)/2J. 
If 3 < ht (Tc) < 4, then by the analysis of Case 6, cont (Stag (Tc), 7") < 1. By Inequality 
(4.6), cont(M(Tc) + Tc + 7,7") < 2. Since ht(Tc)>3, 0(0(7,),7,) = 2, and ht(T)= 1, it 
follows that p' < 1/2. 
If 5 <ht(Tc)< 10, then by Inequality (4.6) cont(M(Tc) + Tc +7,7") <2. By the 
definition of contribution, cont (Stag (Tc), 7") < 2. Since ht (Tc) > 5, 0(C/ (7C), Tc) = 2, and 
ht(T) = 1, it follows that p' < 1/2. 
96 
Finally, eliminating the ceiling operations in Inequality (4.6), we obtain 
cont(M(Tc) + Tc+T,T")£ht(Tc)/16 + 3. Since cont (Stag (7C),7")<2, 
0(C/(7C),7C) = 2, and ht (7) = 1, it follows that if ht (Tc) > 11, then p' < 13/32 < 1/2. 
Subcase 7.4: 7 remains stagnant during phase 6, and in phase 6+1 tree 7 is adopted by 
a vertex v of a tree 7V not adjacent to 7 in Ft. Let 7C be the tree of Co (Fk) for which 7V is a 
tree of C/(7C) + 7C. We show that 7 can be added to the set of trees C/(7C) + 7C without 
increasing p. 
Since 7V is not adjacent to 7, some tree 7 ' of Fk+i incorporates 7V and a tree adjacent to 
7 in Ft. Otherwise, in phase 6+1, vertex v and the root of 7 would not form an eligible pair. 
Since v is a vertex of 7V, and thus a vertex of 7' , in phase 6+1, tree 7 ' adopts 7. In the 
analysis for Cases 2 through 6, the determination of the contribution is valid for an arbitrary 
number of stars adopted by 7' . Thus, by adding 7 to the set of trees C/(7C) + 7C, the 
analysis still holds. 
For Case I, at the end of phase 6, every star of Cl(Tc) is adopted by 7C. Thus, 
cont (CI (7C) + Tc, 70 = 1. In phase 6+1, during the hooking operation 7 ' adopts 7 . After at 
least one shortcut operation, every vertex of 7 has a parent whose id is at most the id of the 
root of Tc. Thus, cont (CI(7C) + 7 C +7 ,7" ) = 1. By the analysis from Case 1, p = 1/2. This 
completes the last subcase. 
From these four subcases, we conclude that if 7 is added to some set H from one of the 
Cases 1 to 6 above, then for the set of trees H + 7 , p < 3/5. This completes Case 7. 
Since every tree of Fk is included in one of the above cases, the sum of the contributions 
of the trees of Fk to trees of Ft+2 is at most (3/5) \fk. Since yk < <J>t, the result follows. • 
97 
We give an example to illustrate the computation of p. Consider the cluster of Figure 
4.2. Suppose in Fk tree T\Q is the core of a cluster containing four stars and T\Q is not 
grounded. Thus, 0 (7 w) + 0(C/ (7i0), 710) = 3 + 2 = 5. 
In phase 6, suppose every processor performs one shortcut operation before the hooking 
operation. After the shortcut operation, the height of 7 ^ is reduced to 2. During the hooking 
operation, processor/?(15,28) hooks vertex 10 to vertex 7, and processor/?(13,19) hooks 
vertex 17 to vertex 11. Thus tree 7 7 adopts tree 7 1 0 and tree 7%) adopts tree 717. Tree 7 ' 
incorporates trees 77 ,71 0 , and 717 . 
After at least two shortcut operations, before the hooking operation in phase 6+1, the 
parent of every vertex of 7 ' is vertex 7. During the hooking operation in phase 6+1, tree 7 ' 
is adopted by tree 7% to form a new tree 7". Suppose tree 7 4 remains stagnant. After at least 
one shortcut operation, at the end of phase 6+1, the height of 7 " is 1. Thus 
cc?«f({71,77,710,717},7")=l and cont({TA),T")= 1. It follows then that p = 2/5. This 
concludes the example. 
A tree of Fk is active if the tree participated in an adoption or was reduced in height by a 
shortcut operation during phases 6 and 6+1. If 7 " is a tree of Ft+2 that is not in Fk, then 7 " 
incorporates one or more active trees of Fk. Call tree 7 " a marked tree. Let Stag (Fk) be the 
stars of Ft that remain stagnant during phases 6 and 6+1. Call a stagnant star of Stag (Fk) an 
unmarked star. 
Let Ma(Fk+2) be the set of marked trees of Ft+2- Each tree 7 " of Ma(Fk+2) 
incorporates trees from one or more of the subsets of trees considered in the various cases in 
the proof of Lemma 4.6.7. Let In(Co(Fk),T") be the set of trees of Co(Fk) incorporated 
into 7". For a tree 7 of In (Co (Fk), 7"), let cont(M(T) + T,T") be the contribution of the 
98 
set of trees comprising 7 and the stars of CI (7) merged with 7 to 7". By the definition of 
contribution, 
y cont(M(T) + T,T")>ht(T',).
 (47) 
Define 
Ct+2= E ( y (cont (M(T) + T,T,f) +cont (Stag (T),T,f))). 
T"eMa{FM) TeIn(ColFt),T") 
The value of ck+2 is the sum of the contributions of the trees of Fk to the marked trees of 
Ft+2. The unmarked trees of Ft+2 are stagnant stars of Ft, and thus their contribution to trees 
of Ft+2 is included in o*t+2-
Lemma 4.6.8: 0(Ft+2) 3 c?t+2. 
Proof: Every tall tree of Ft+2 is formed as a result of a shortcut operation on a tall tree of Fk 
or a shortcut operation on a tall tree created by merging trees of Fk and is thus marked. If S 
is a star of Stag(Fk), then by Corollary 4.6.4, every tree in Fk adjacent to 5 is a tall tree. 
Thus, every tall tree of Fk adjacent to S is incorporated into a marked tree in Ft+2- It follows 
then that every tree adjacent to S in Ft+2 is a marked tree. Recall that <|>t+2 is the value of 
$(Fk+2)- We now show that <j»t+2 ^ "jfc+2-
We first consider the case in which every star of St (Fk+2) is marked. Then every tree of 
Co (Ft+2) is marked because every tree of 7a (Ft+2) is marked. Since every tree of Co (Fk+2) 
is marked, we can directly compare at+2 with <!>(Fk+2). 
By the definition of the potential function, 
* ( C » < F t + 2 ) ) V e ( ? ( M A , ( r < ) . (4.8) 
99 
It follows from Inequality (4.7) and Equation (4.8) that for the trees T" that are the cores of 
clusters in Ft+2, 
By the definition of the potential function, 0(C/(Ft+2)) is determined by a set St(Fk+2) 
of minimum cardinality and an assignment At+2 of stars of Cl(Fk+2) to trees of Co(FkJ,2) 
such that y 0(C/(7"),7") is maximum. If 5 is an unmarked star, then since every 
T"edo(Ft,2) 
star of St (Ft+2) is marked, 5 must be a star of CI (Fk+2). Note that CI (Fk+2) may also contain 
marked stars. Suppose we first assign only the unmarked stars of Cl(Fk+2) to the trees of 
Co(Fk+2) as given by Ak+2. Let 7 " be a tree of Co(Fk+2). Then the stars of C/(7") are 
stagnant stars of Fk. Specifically, 
Let x be the sum of the potentials of the clusters that are formed. Then 
x= y 0(C/(7"),7"). 
T"eCZ(Fk+2) 
By the definition of contribution, com (Stag (T),T") = 0(Stag(T),T"). But since stagnant 
stars from more than one cluster of Fk may be contained in the same cluster of Ft+2, 
Now consider the marked stars of Cl(Fk+2). Let Ma(Cl(Fk+2)) be the set of marked 
stars of CI (Fk+2). Let 5 be a star of Ma (CI (Fk+2)). Suppose the assignment At+2 assigns S 
to tree 7 " of Co(Fk+2). Then the potential of the cluster containing the set of stars 
100 
Cl(T") + S is at most 1 greater than the potential of cluster Cl(T"). After all stars of 
Ma (Cl(Fk+2)) are added to their assigned clusters, then by the definition of potential, 
0(C/ (Ft+2)) £ x + I Ma (CI (Ft+2)) I. (4.11) 
Since S is a marked star, by the definition of contribution, 
X cont(M(T) + T,S)>ht(S) = l.
 (412) 
By Inequality (4.12), the sum of the contributions of the trees of Fk incorporated into 
each star S of Ma (CI (Fk+2)) is at least 1. Thus, the sum of the contributions of trees of Fk to 
the stars of Ma (CI (Ft+2)) is at least | Afa (CI (Ft+2)) I • It follows from Inequality (4.10) then 
that 
S 6 < «j M ,v 6 , „&, s r ' < M ( r ) + r - S ) ) a t + | M a < c ; ( F ^ ) l -
Finally, by Inequalities (4.7), (4.11), and (4.13), we conclude that c?t+2 ^ #+2. 
Next we consider the case in which not every star of St(Fk+2) is marked. Let W be the 
set of stars of Stag(Fk) that are stars of St (Ft+2). Let Y be the set of stars of Cl(Fk+2) 
covered by the stars of W. Since the stars of W were stagnant during phases 6 and 6+1, by 
Corollary 4.6.3, no two stars of W are adjacent. By Corollary 4.6.4, every star of Y is 
marked. Also, \Y\ > \W\. This is because if \Y\ < \W\, then since the stars of Y cover 
the stars of W, St (Fk+2) would not be a set of minimum cardinality. 
To show that Ct+2 S 0(Ft+2), we introduce a modified potential function 0'(Ft+2) 
similar to 0(Ft+2), except that the sets St (Fk+2) and CI (Ft+2) are modified as follows. Every 
101 
star of W is moved from St (Fk+2) to CI (Fk+2), and every star of Y is moved from CI (Fk+2) to 
St(Fk+2). Then after this modification, in ^'(Fk+2), every star of St (Fk+2) is marked. From 
the analysis of the case in which every star of St(Fk+2) is marked, we conclude that 
Gt+2£0'(Ft+2). 
Next, we show that 0'(Ft+2) ^ 0(Ft+2). In ^(Fk+2), for each star S of St(Fk+2), 
0(5) = 1. Thus, the sum of the potentials of the stars of W contained in St(Fk+2) is | W | . If 
Cl(S) is a set of stars contained in a cluster with core S, then the potential of the stars in 
C!(S) is at most \Cl(S)\. Thus, the sum of the potentials of the stars of Y contained in 
clusters of Ft+2 is at most \Y\. 
In 0'(Ft+2), since each star of Y is a star of St(Fk+2), the sum of the potentials of the 
stars of Y is \Y\. Since \ Y | > j W | and every star of W is adjacent to at least one star of Y, 
in 0'(Ft+2), each star of W may be assigned to a different star of Y. If each star of W 
belongs to a cluster containing a single star, then the sum of the potentials of the stars of W is 
\W\. It follows that 0'(Ft+2)^0(Ft+2)- From these two cases, we conclude that 
Ot+2>0(Ft+2).O 
Theorem 4.6.9: Algorithm II requires at most 2 logs/3 n phases. 
Proof: Let V(C) be the vertices of a connected component C. At the beginning of the 
algorithm, every vertex of V (C) is the root of a star consisting of a single vertex. Each star is 
either the core of a star cluster or contained in a star cluster. The potential of each star that is 
the core of a cluster is 1, and the sum of the potential of the stars in clusters is at most the 
number of stars. Since the number of vertices in V (C) is at most n, §\ < n. 
By Lemmas 4.6.7 and 4.6.8, <t>t+2 ^ (3/5)<|>t, for every 6. Thus, after every two phases, 
the potential of the forest in PG that contains the vertices of V(C) decreases by a factor of at 
102 
least 5/3. Every vertex of V(C) is contained in a single star when the potential is 1. Since 
<j>i £ n, Algorithm H requires at most 2 logs/3 n phases. 
Theorem 4.6.10: Algorithm II requires 0(log n) rounds. 
Proof: The hooking and shortcut operations during each phase require O(l) rounds. With an 
increment primitive, the rolling synchronization for each phase requires one round. By 
Theorem 4.6.9, the number of phases required by Algorithm II is O(logn). Thus, the total 
number of rounds required by Algorithm II is 0(log n). • 
Algorithm H is much more efficient than a straightforward asynchronous 
implementation of the synchronous algorithm. If each iteration of a synchronous algorithm 
requires i PRAM steps, then each phase of a naive asynchronous algorithm would require i 
barrier synchronizations. The algorithm of Awerbuch and Shiloach (1987) requires log3/2 n 
iterations. Thus, the naive asynchronous algorithm would require a total of i (logs/2 n) barrier 
synchronizations. 
Since our algorithm uses only three rolling synchronizations per phase, the total number 
of global synchronizations required by our algorithm is at most 6 logs/3 n. Thus the total 
number of global synchronizations is reduced by a factor of 0.210/. This is a significant 
improvement since in each Part of the synchronous algorithm there is a test to determine 
whether a tree is a star. The Star_Check procedure alone requires at least 5 PRAM steps. 
Another improvement with our algorithm is that with rolling synchronization, after a 
processor completes a hooking or shortcut operation during the current phase, if the 
processor's endpoints have parents that are not the roots of trees, then the processor can make 
further progress via shortcut operations without waiting. 
103 
4.7. APRAM Connected Components Algorithm HI 
In Algorithm H, during each phase, Parts 1 and 3 are shortcut operations and Part 2 is a 
hooking operation. We can obtain a more efficient asynchronous algorithm by performing 
the hooking and shortcut operations in a different order depending on the phase. In 
Algorithm HI, the order of operations depends on whether the phase is odd or even. A 
variable phase is used to store the phase of the algorithm. In Algorithm HI, we will list only 
the operations. 
APRAM Connected Components Algorithm III 
Initialization 
phase := 1; 
while there are eligible pairs 
if (phase is odd) then 
Part 1: Shortcut operation; 
Part 2: Shortcut operation; 
rolling synchronization with counter; 
Part 3: Hooking operation; 
rolling synchronization with counter; 
else (* phase is even *) 
Part 1: Shortcut operation; 
rolling synchronization with counter; 
Part 2: Hooking operation; 
rolling synchronization with counter; 
104 
Part 3: Shortcut operation; 
rolling synchronization with counter; 
endif 
phase := phase + 1; 
endwhile 
In the odd phases, only one rolling synchronization is needed after Part 2 to determine 
that every processor has performed at least two shortcut operations. We analyze the 
performance of Algorithm IE in the same manner as for Algorithm n. The improvement in 
Algorithm IE comes as a result of an improvement over Lemma 4.6.7. 
Lemma 4.7.1: If 6 is odd, then the sum of the contributions of trees of Fk to trees of Ft+2 is 
atmost(4/7)<])t. 
Proof: The proof is similar to the proof of Lemma 4.6.7. Let 7 , Af'(7), 7 ' , M"(T), 7" , 
M (7), and p be as in the proof of Lemma 4.6.7. We consider 6 cases. In each case, we show 
that p < 4/7. 
Case 1: 7 is a star and CI (7) contains at least one star. The analysis is essentially the 
same as that for Cases 1 and 2 of Lemma 4.6.7. Thus, p < 1/2. 
Case 2: 7 is a tall tree, ht(T)< 4, and C/(7) contains at least one star. Since 6 is odd, 
the algorithm performs at least two shortcut operations before the hooking operation in phase 
6. This case reduces to Case 1. Since 7 is a tall tree, 0(7) = Af(7)>2. Since C/(7) 
contains at least one star, 0(C/(7),7)> 1. At the end of phase 6+1, 
cont(Cl(T) + 7,7") = 1. It follows that p < 1/3. 
105 
Case 3: 7 is a tall tree and C/(7) contains no stars. The analysis is the same as for 
Case 4 in the proof of Lemma 4.6.7. Thus, p 3 1/2. 
Case 4: The height of 7 is at least 5, 7 is grounded, and C/(7) contains at least one 
star. After at least two shortcut operations followed by a hooking operation in which 7 is 
grounded, at the end of phase 6, 
cont(M'(T) + 7,7 ' ) < |"{ht(T)/2~\ /2l + 1. 
After at least one more shortcut operation followed by another hooking operation in 
which T is grounded and at least one more shortcut operation, at the end of phase 6+1, 
cont(M(T) + 7,7") £ \(\(\ \ht(T)/2~\ 12~\ + l)/2| + l)/2l (4.14) 
= \({([ht(T)/4\ +l)/2l + l)/2l. 
If 5<ht(T)<20, then by Inequality (4.14), cont(M(T) + T,T")<2. Since 7 is 
grounded, cont (Stag (T),T")< 1. Since ht(T)>5 and 0(C/(7),7) = 1, it follows that 
p < 1/2. 
Eliminating the ceiling operations in Inequality (4.14), we obtain 
cont(M(7) + 7,7")</tf(7)/16 + 5/2. Since 7 " is grounded with respect to the stagnant 
stars of C/(7), if any, cont (Stag (T),T")< 1. It follows that if Af(7)>21, then since 
0(C/(7) ,7)=l ,p<7/32<4/7. 
Case 5: The height of 7 is at least 5,7 is not grounded, and CI (7) contains at least one 
star. If ht (7) > 5, then stars of C/(7) may remain stagnant during phases 6 and 6+1. 
If C/(7) contains one star, then as in Case 6 of the proof of Lemma 4.6.7, 
cont(M(T) + 7,7") £ rd"Af(7)/8l + W . (4.15) 
106 
If 5<£fa(7)£8, then by Inequality (4.15), cont(M(T) + T,T") = 1. Since C/(7) 
contains only one star, cont(Stag(T),T")£l. Since Af(7)>5 and 0(C/(7) ,7)=1, it 
follows that p < 1/3. 
Eliminating the ceiling operations in Inequality (4.15), we obtain 
cont(M (7) + 7 ,7") <ht(T)/\6 + 2. Since cont (Stag (T),T") < 1, it follows that if 
ht (7) > 9, then p <, 57/160 < 4/7. 
If C/(7) contains at least two stars, then after at least two shortcut operations followed 
by a hooking operation in which 7 is not grounded, at the end of phase 6, 
cont(M'(7) + 7,7 ') < |"\ht(T)/2~\ 12] + 2. 
After at least one more shortcut operation followed by another hooking operation in 
which 7 ' may not grounded and at least one more shortcut operation, at the end of phase 6+1, 
cont(M(T) + 7,7") < |"(|"(|"\ht(T)!2] /2l + 2)/2| + 2)/2l (4.16) 
= |"(|"Af(7)/8l + l)/2l + 1. 
If 5<Af(7)<8, then by Inequality (4.16), at the end of phase 6+1, 
cont(M(T) + 7 ,7") < 2. Since some stars of C/(7) may be stagnant and 7 " might not be 
grounded with respect to the stars of Stag (J), cont (Stag (T\T") <2. It follows that if 
Ar(7)>5 and 0(C/ (7), 7) = 2, then p < 4/7. 
Eliminating the ceiling operations in Inequality (4.16), we obtain 
cont(M(T) + T,T")<ht(T)/16 + 3. Since cont(Stag(T),T")<2, it follows that if 
ht(T) > 9 and 0(C/(7),7) = 2, then p < 89/176 < 4/7. 
107 
Case 6: 7 is a free star. This is similar to Case 7 in the proof of Lemma 4.6.7. The 
analysis for this case is the same as the analysis for Case 7 of Lemma 4.6.7 with the exception 
of Subcase 7.3.2. Thus we consider only that subcase and show that p' £ 1/2 < 4/7. 
7 is adopted by a star S in phase 6. Let 7C be the tree for which S e Cl(Tc). In phase 
6, tree Tc adopts S. Let rc be the root of Tc. 
First suppose that Tc is grounded. After at least two shortcut operations followed by a 
hooking operation in which TC,S, and 7 are merged in series, at the end of phase 6, 
cont(M'(Tc) + Tc + 7 , 7 ' ) ^ [[/ir(7c)/2l/2J +2. 
If ht(Tc)<4, then every star of C/(7C) is adopted in phase 6. Thus, after at least two 
more shortcut operations, at the end of phase 6+1, 
cont(M(7J + 7C + 7,7") < [ [ ( [ [ to (7C)/2J /2J + 2)/2J fc] (4.17) 
= [([to(7c)/8J +1)/2J. 
1fht(Tc)<4, then by Inequality (4.17), cont(M(Tc) + Tc + 7 ,7") = 1. Since every star 
of C/(7C) is adopted in phase 6, cont(Stag(7C),7") = 0. Since ht(Tc)>\, 
0(w (7C), 7C) = 1, and ht (7) = I, it follows that p' £ 1/3. 
If ht(Tc)> 5, then after at least one shortcut operation followed by a hooking operation 
in which 7 ' is grounded and then at least one more shortcut operation, at the end of phase 
6+1, 
cont(M(Tc) + 7C + 7 ,7") < [([([[fo(7c)/2 12 + 2)/2J + 1)/2J (4.18) 
= [/tf(7c)/16J +1 . 
108 
If ht(Tc) <, 8, then as in Subcase 7.3.2, at the end of phase 6+1, every star of C/(7C) is 
adopted, and thus cont (Stag (Tc),T") = 0. If 5£ht(Tc)£&, then by Inequality (4.18), 
c<?nf(M(7c) + 7 c + 7 , 7 " ) < 2 . Since ht(Tc)>5, 0(C/(7C),7C)= 1, and Af(7) = l, it 
follows that p ' < 2/7. 
Eliminating the ceiling operations in Inequality (4.18), we obtain 
co*f(M(7c) + 7c+7,7")</i?(7c)/16 + 2. Since 7C is grounded, cont (Stag (TC),T")< 1. 
Since 0(C/(7C), 7C) = 1 and ht(T) = 1, it follows that if ht(Tc) £ 9, then p' < 57/176 < 1/2. 
Now suppose that Tc is not grounded. After at least two shortcut operations followed by 
a hooking operation in which Tc is not grounded, at the end of phase 6, 
cont(M'(Tc) + Tc + 7,70 < [[to (7C)/2J /2J + 3. 
If ht (Tc) < 4, then after the hooking operation in phase 6, the parent of rc is a vertex b, 
where b <min-id(Cl(Tc). By Lemma 4.6.1, at the end of phase 6+1, every star of C/(7C) is 
adopted. Thus, M(7C) = C/(7C) and cont (Stag (Tc),T") = 0. Since Tc adopts S and S 
adopts 7 during the hooking operation in phase 6, after at least two shortcut operations, at the 
end of phase 6+1, every vertex of 7 has a parent whose id is at most b. Thus, 
cont(Cl(Tc) + Tc + 7 , 7 " ) = 1. Since ht(Tc)>\, 0(C/(7C),7C)> 1, and Af(7) = l, it 
follows that p' < 1/3. 
If 5 < ht(Tc) < 8, then before the hooking operation in phase 6+1, every vertex of Tc 
has a parent whose id is at most the id of rc. Because 7 ' incorporates 7 , if 7 ' is merged only 
with stars of Cl(Tc), then the height of the tallest tree that can be formed is at most 
ht(T')+ 1. Thus, after at least one more shortcut operation followed by another hooking 
109 
operation in which 7 ' may not be grounded and then at least one more shortcut operation, at 
the end of phase 6+1, 
cont(M(7C) + 7C +7 ,7" ) £ [([([[/tf(7c)/2J /2J +3)/2J + l)/2 (4.19) 
= [([([to(7c)/4J +3)/2J + l)/2J. 
If 5 £ ht(Tc) < 8, then by Inequality (4.19), cont(M(Tc) + 7C + 7,7") < 2. Since 7C is 
not grounded, cont (Stag (7C),7") <, 2. Since ht(Tc) £ 5,0(C/(7C),7C) = 2, and ht (7) = 1, it 
follows that p ' <, 1/2. 
If Af(7c)>8, then after at least shortcut operation followed by another hooking 
operation in which 7 ' may not be grounded and then at least one more shortcut operation, at 
the end of phase 6+1, 
cont(M(Tc) + Tc +7 ,7" ) < [([([[/tf(7c)/2| /2J +3)/2 + 2)/2 (4.20) 
to(7c)/4J +3)/4J + 1. 
Eliminating the ceiling operations in Inequality (4.20), we obtain 
cont(M (Tc) + Tc +7 ,7" ) </tf(7c)/16 + 3. Since cont(Stag(Tc),T")<2, 
0(C/(7C),7C) = 2, and ht(7) = 1, it follows that if ht(Tc) > 9, then p' < 89/192 < 1/2. This 
completes the analysis for the subcase. 
From these 6 cases, we conclude that the sum of the contributions of the trees of Fk to 
the trees of Ft+2 is at most (4/7) 4%. • 
The remainder of the analysis for Algorithm III is the same as for Algorithm II. As a 
result of Lemma 4.7.1, we conclude the following. 
Theorem 4.7.2: Algorithm III requires at most 2 log7/4 n phases. 
110 
Algorithm m reduces the number of rolling synchronizations that are required by 
Algorithm II in two ways. We have already shown that Algorithm IE requires fewer phases. 
In addition, Algorithm III uses only five rolling synchronizations for every two consecutive 
phases whereas Algorithm II uses six. The total number of global synchronizations required 
by Algorithm HI is at most 5 log7/4 n. Compared with the naive asynchronous algorithm, the 
number of global synchronizations is reduced by a factor of 0.276*, and compared with 
Algorithm II, the number of global synchronizations is reduced by a factor of 1.31. 
4.8. APRAM Spanning Forest Algorithms 
4.8.1. Spanning forest algorithms 
We can modify Algorithms II and HI to compute a spanning forest of G by keeping 
track of the processors that merge trees. Each processor corresponds to an edge of G that 
joins two trees. Let r be the root of a tree 7. If during a phase processor p hooks r to a tree 
that adopts 7 , then the edge of G corresponding top is an edge of a spanning forest. Sincep 
last updated P(r),p is associated with r . 
During a phase, 7 may be hooked more than once. In the asynchronous algorithm, P (r) 
and the name of the processor that performed the hooking operation must be updated 
atomically. Otherwise, the algorithm might not find a spanning forest, as follows. 
Suppose p is the processor that last hooked r . If P(r) and the name of the processor 
that hooked r are updated separately, then since the algorithm is asynchronous, some 
processor a that hooks r before /? may store its name after p does. The edge corresponding 
to q, however, may not be an edge between the trees merged by p. Note that in the 
synchronous algorithm the name of the processor is not stored until the trees have been 
111 
adopted. In the asynchronous algorithm, without synchronization, a processor does know 
whether it is the last processor to hook a tree. 
During a hooking operation, the parent of r and the name of the processor that performs 
the hooking are updated simultaneously in P (r) by incorporating the id of the parent and the 
name of the processor into a composite value. If processor /? hooks r to a vertex a, then the 
algorithm sets P(r):=a.p. Initially, the algorithm sets P (v) := v. 0 for each vertex v e V. 
If r is hooked to a vertex during a phase, then the last time r is hooked during the phase, 
7 is adopted. The processor p that performs the last hooking operation corresponds to an 
edge of a spanning forest of G. Processor p does not hook any other vertices during the 
remainder of the algorithm since the endpoints of the edge assigned top are contained in the 
same tree in PG. Also, after 7 is adopted, the name of the processor in P(r) should not 
change. Thus, in the shortcut operations, only the id of the new parent of r is changed in 
P(r). 
After the vertices of each connected component are contained in a single star, if v is not 
the root of a tree, then the name of the processor in F(v) corresponds to an edge of a 
spanning forest of G. Since each nonroot vertex is adopted exacdy once, each nonroot vertex 
is associated with one edge of the spanning forest. 
4.8.2. Minimum spanning forest algorithms 
Suppose G is a graph with weighted edges, and without loss of generality, suppose the 
weights of the edges of G are unique. We give an APRAM algorithm for finding the 
minimum spanning forest (MSF) of G. The APRAM MSF algorithm may be obtained by 
modifying either Algorithm II or III. 
112 
To find the MSF of G, during each hooking operation, only the processor that 
corresponds to the edge of G of least weight leaving a partial component performs a hooking 
operation. To determine the edge of minimum weight, however, requires the MSF algorithm 
to first reduce each partial component to a star through shortcut operations. 
Let 7 be a tree of PG with root r. The MSF algorithm maintains a variable Min (r) to 
keep track of the edge of least weight that leaves 7. Let e be an edge (u, v) of G such that u 
is a vertex of 7 and v is not. When u becomes a child of r through shortcut operations, then 
the processor corresponding to e executes a replace-min primitive on Min(r) with the 
weight of e. When 7 becomes a star, then Min(r) contains the weight of the edge of least 
weight leaving 7. The processor corresponding to that edge then performs a hooking 
operation to merge 7 with another tree. Note that 7 cannot be hooked until it is a star. We 
keep track of the processors that perform the hooking operations through a variable Proc (v) 
for each vertex v. 
The root r of 7 has a variable Count (r) to maintain a count of the number of children it 
has in PG. When a vertex v of 7 becomes a child of r through a shortcut operation, then the 
processor performing the shortcut operation atomically increments Count (r) using the 
increment primitive. Count (r) increases after each shortcut operation until 7 becomes a 
star. The algorithm determines that 7 is a star if after a shortcut operation Count (r) does not 
increase. A barrier synchronization is required after each shortcut operation to ensure that 
every processor has had an opportunity to update Count (r). 
The analysis for the APRAM MSF Algorithm is similar to the analysis for Algorithms II 
and III. The main difference is that the stars of a cluster cannot be merged with the core of 
the cluster until the core has also become a star. The analysis is similar to the analysis of the 
113 
connected components algorithm except that no stars are merged with the core until the core 
is a star. It can be shown that after every two phases the potential of the forest in PG that 
contains the vertices that belong to a connected component of G decreases by at least a 
constant factor. Thus, the number of phases required by the algorithm is 0(log/i). Since 
each phase requires 0(1) rounds, the total number of rounds needed to compute the MSF of 
GisO(logM). 
4.9. Conclusions 
We have presented three APRAM algorithms for finding the connected components of a 
graph. Algorithm I runs on a basic APRAM and requires 0(n log n) rounds. Algorithms H 
and in run on an APRAM with increment and replace-min primitives and require 0(log n) 
rounds. All three algorithms use m + n processors. 
Algorithm m requires fewer phases and global synchronizations than Algorithm II and 
is thus more efficient One area of possible improvement is to find a more elaborate potential 
function and lower the bound on the number of phases required by Algorithms II and DI. 
Another area of possible improvement is to reduce the number of global synchronizations that 
are required. 
114 
CHAPTERS. 
AN ASYNCHRONOUS PARALLEL ALGORITHM FOR 
COMPUTING BICONNECTED COMPONENTS 
5.1. Introduction 
Let G = (V, E) be a connected graph with the set of vertices V and the set of edges E. 
Let n = \V\ and m= \E\. A vertex v is an articulation point if the removal of v and the 
edges incident on v leaves the remaining graph disconnected. A graph with no articulation 
points is biconnected. A maximal biconnected subgraph of G is a biconnected component of 
G. 
Finding the biconnected components of a graph is a fundamental problem of graph 
theory. Some practical applications of finding the biconnected components of a graph include 
determining the reliability of communication and transportation networks. 
Tarjan (1972) gave a sequential algorithm that uses depth-first search to find the 
articulation points and the biconnected components of a graph in 0(m+n) time. Savage and 
Ja'Ja' (1981) presented a parallel biconnected components algorithm for the CREW PRAM 
that runs in 0(iog2«) time using 0( /* ) processors. Tsin and Chin (1984) presented an 
2 
improved CREW PRAM algorithm that runs in 0(log2n) time using n2 processors. 
Their algorithm is optimal for dense graphs. 
Tarjan and Vishkin (1985) designed a biconnected components algorithm by reducing 
the biconnected components problem to a connected components problem. They presented 
an algorithm for the Arbitrary CRCW PRAM that runs in O(logn) time using 0(m+n) 
115 
processors. They showed that their algorithm also runs on a CREW PRAM in 0(log2 n) time 
using the same number of processors. Then they gave an alternative implementation of their 
2 2 
algorithm that runs on a CREW PRAM in 0(-%-) time using any number p < n^ 
processors. Using the reduction idea of Tarjan and Vishkin, Cole and Vishkin (1986b) 
designed a biconnected components algorithm for the Arbitrary CRCW PRAM that runs in 
0(logn) time using (m +n)a\m>n) processors, where a(m,n) is the inverse Ackermann 
function. 
In this chapter, we present an Asynchronous PRAM (APRAM) algorithm for computing 
the biconnected components of a graph. Various APRAM models have been defined by 
Gibbons (1989), Martel et al. (1989, 1990), Cole and Zajicek (1989, 1990), and Nishimura 
(1990). We use an APRAM with limited read-modify-write primitives (Wu, 1991). Our 
APRAM algorithm, which is based on the CRCW PRAM algorithm of Tarjan and Vishkin, 
requires 0(log n) rounds using 0(m +n) processors. 
In Section 5.2, we briefly review our APRAM model. In Section 5.3, we describe the 
synchronous biconnected components algorithm of Tarjan and Vishkin. In Section 5.4, we 
present our APRAM algorithm. 
5.2. Model of Computation 
The APRAM consists of a set of reliable processors, each with its own local memory, 
and a global shared memory through which processors communicate with one another. There 
is no global clock that synchronizes the processors, and access to the shared memory is also 
asynchronous. Each step of a processor of the APRAM consists of three stages. In the first 
stage, a processor may read a value from one shared memory cell into local memory. In the 
116 
second stage, a processor may perform a local computation on values in its local memory. In 
the third stage, a processor may write a value from local memory into one shared memory 
cell. At most one memory operation may be performed on a cell of the shared memory at any 
time; hence, each read operation and write operation is atomic. Thus, there is no ambiguity 
about the value that is read or written. 
In addition, the APRAM has read-modify-write primitives. A processor executing a 
read-modify-write primitive atomically reads a value v from a cell c of the shared memory 
into local memory, performs a local computation that may depend on v or some other values 
stored in local memory, and then writes a value into cell c. The only read-modify-write 
primitives required by our algorithm are replace-min and increment. 
An execution of an APRAM algorithm consists of a finite sequence of arbitrarily 
interleaved stages of the steps taken by all of the processors of the APRAM, with the 
restriction that when a processor executes a read-modify-write primitive, the stages of the 
step are atomic with respect to the memory cell that is accessed. We partition the execution 
of the algorithm into rounds, where a round is a minimal sequence of stages such that every 
processor takes at least one complete step. We measure the running time of our algorithm in 
rounds. For PRAM algorithms, the number of rounds is equal to the running time. 
5.3. The Synchronous Biconnected Components Algorithm 
Tarjan and Vishkin (1985) presented an efficient parallel algorithm for finding the 
biconnected components of a graph. Since our algorithm is based on their algorithm, we first 
describe their approach. Their main idea is to reduce the problem of finding the biconnected 
components of a graph G to finding the connected components of an auxiliary graph G', 
117 
where the connected components of G' correspond to the biconnected components of G. A 
similar approach was independently discovered by Tsin and Chin (1984). The auxiliary 
graph of Tsin and Chin contains many more edges, however, and thus their algorithm is less 
efficient. 
Let e i and e2 be two arbitrary edges of G. Define R to be a relation on the edges of G 
such that e \ R e
 2 if and only if either e \ = e 2, or e \ and e 2 are on a common simple cycle of 
G. Harary (1969) showed that R is an equivalence relation. The subgraphs of G induced by 
the equivalence classes of R are the biconnected components of G. An edge in a singleton 
equivalence class is a bridge of G. If e is an edge that is a bridge of G, then removing e 
disconnects G. 
Tarjan and Vishkin define an auxiliary graph G'ofG such that connected components 
of G' correspond to biconnected components of G. We describe how to derive G' from G. 
Let 7 be a rooted spanning tree of G. Let v -»w denote an edge of 7 where v is the parent 
of w. Let P (w) denote the parent of w in 7. Number the vertices of 7 in preorder from 1 to 
n and let the number of each vertex be its id. We will refer to vertices by their id's. The 
notation u < v means that the id of vertex u is smaller than the id of vertex v. 
The vertices of G' are the sets {u, v} such that (u, v) is an edge of G. Each edge of G' 
has of one of the following forms: 
Casel: ({u,w}, [v,w}), where u ->w is an edge of 7 and (v,w)is an edge of G - 7 
such that v < w. 
Case2: ({u,v}, {x,w}), where u ->v and* ->w are edges of 7 and (v,w) is an edge 
of G - 7 such that neither v nor w is an ancestor of the other in 7. 
118 
Case 3: ([u, v}, {v, w}), where u -» v and v -> w are edges of 7 and some edge of G 
joins a descendant of w in 7 with a vertex x such that x is not a descendant of v in 7 . 
Tarjan and Vishkin proved the following theorem relating G and G'. 
Theorem 5.3.1: (Tarjan and Vishkin, 1985) Two edges of G are in the same biconnected 
component of G if and only if their corresponding vertices are in the same connected 
component of G'. 
Each edge e of G - 7 defines a simple cycle in G consisting of e and the path in 7 
joining the endpoints of e. Thus, all edges on this cycle belong to the same biconnected 
component. The edges of G' are such that the vertices of G' corresponding to the edges of a 
simple cycle in G are contained in the same connected component in G'. 
We now outline the parts of the biconnected components algorithm of Tarjan and 
Vishkin. We then describe their parallel implementation. 
Part 1: Find a spanning tree 7 of G. Number the vertices of 7 from 1 to n in preorder. 
Compute the number of descendants nd (v) of each vertex v. 
Part 2: For each vertex v, compute low(v) (resp. high(v)), the lowest (highest) vertex 
that is either a descendant of v or adjacent to a descendant of v by an edge of G - 7 . 
Part 3: Construct the graph G", the subgraph of G' induced by the edges of 7, as 
follows. For each edge (w,v) in G - 7 such that v + nd(v) < w, add the edge ({F(v),v}, 
[P (w), w}) to G". These are the edges of Case 2. For each edge of v -> w of 7 such that 
v # 1 , add the edge ({F(v),v}, {v,w}) to G" if low(w)< v or high(w)>v +nd(v). 
These are the edges of Case 3. 
119 
Part 4: Find the connected components of G ". 
Part 5: Extend the equivalence relation R on the edges of 7 to the edges of G - 7 by 
defining edge (v,w) equivalent to edge (P(w),w) for each edge (v,w) of G - 7 such that 
v < w. These are the edges of Case 1. 
The parallel implementation of the algorithm uses an Arbitrary CRCW PRAM. Each 
edge of G is represented by two oppositely directed edges. The algorithm assigns a processor 
p(i,j) to each edge (ij) and a processor p(v) to each vertex v. Thus, the number of 
processors required by the algorithm is at most 2m + n. 
In Part 1, the algorithm first constructs a spanning tree of G using a modification of the 
connected components algorithm of Shiloach and Vishkin (1982). This takes 0(log n) time 
using 0(m + n) processors. 
Next, the algoritiim constructs a circular list of the directed edges of the tree to form an 
Eulerian tour of the tree. The Eulerian tour corresponds to the order and direction in which 
edges are traversed during a depth-first search traversal of the tree starting from an arbitrary 
vertex. The tree is rooted by breaking the Eulerian tour at an arbitrary edge. Tarjan and 
Vishkin call this the Euler tour technique on trees. 
They combine the Euler tour technique with the doubling technique for list ranking 
(Fortune and Wyllie, 1978) and show that several tree functions, including computing the 
preorder and postorder numbers of vertices and the number of descendants for all vertices of 
the tree, can be performed in O(logn) time using 0(n) processors. We next describe the 
doubling technique. 
Let L be a list of n elements. For each element 6 in L, let next(k) be a pointer to the 
next element in the list. The last element in the list points to itself. The rank of an element is 
120 
its distance from the end of the fist; the rank of the last element is 0. Let rank (6) be the rank 
of element 6. The rank of each element can be determined with the following list ranking 
algorithm. A processor p (6) is assigned to each element 6. Each processor/? (6) maintains a 
pointer N(k) that initially points to the element next(k). During the execution of the 
algorithm, N(k) progressively points to elements closer to the end of the list. When the 
algorithm terminates, N (6) for every element 6 points to the last element in the list. 
List Ranking Algorithm 
For each processor/? (6) 
N(k):=next(k); 
if N(k)* next (k) then 
dist(k):=l; 
else 
dist(k):=0; 
endif 
repeat [log n\ times 
if N(k)*N(N(k)) then 
dist (6) := dist (6) + dist (N (6)); 
N(k):=N(N(k)); 
endif 
end repeat 
rank (k) := dist (k); 
Pointer N(k) is recursively "doubled" by updating W(6) to N(N(k)). After / 
iterations, for each element 6 whose rank is at most 2' ,7V (6) points to the element at the end 
of the list. Thus, the rank of each element can be determined after 0(log n) iterations. We 
now return to the discussion of the parallel algorithm of Tarjan and Vishkin. 
121 
In Part 2, the algorithm computes low(v) and high(v) for each vertex v. We sketch 
their method. We will describe the computation of low(v); the computation of high(v) is 
similar. Each vertex v has an adjacency list of the vertices adjacent to v. Let localJow(v) 
be the vertex u with the lowest preorder number such that u is either v or a neighbor of v 
such that (u,v) is an edge of G - 7 . For each vertex v, localJow(v) is computed by 
applying the doubling technique on the adjacency list of v. This requires 0(log n) time using 
0(m) processors. 
For an interval [i,j], let global_low[i,j] - min {localJow(x) | / <x <j}. For each 
a, 0 < a < log n, and for each x, 1 < x < n 12a, compute global Jow [(x-l)2«+l, x 2°]. The 
total number of global Jow values is 0(n). The global Jow values can be computed in 
0(log n) time using n processors. In the first iteration, compute global Jow [i, i ] for each i, 
1 </ <n, in parallel. This takes 0(1) time using n processors. Then in each successive 
iteration, compute the global Jow values for intervals that are twice the size of the intervals 
of the previous iteration by finding the smaller of two global Jow values for two consecutive 
intervals. Finally, compute low(v) for each vertex v using the formula 
low(v) = min{localjow(x) \ v <x <v +nd(v)-l}. 
The values of x are the preorder numbers of the vertices that are descendants of v. Any 
interval [i,j], l<i <j <n, can be represented as a union of at most 21ogn global Jow 
values. Thus, for each vertex v, the minimum value of localjow(x) in the interval 
[v, v + nd (v) - 1] can be determined by checking at most 2 log n global Jow values. Thus, 
low (v) for each vertex v can be determined in 0(log n) time using n processors. We 
conclude that the entire computation of Part 2 requires O(log n) time using 0(n) processors. 
122 
In Part 3, the algorithm constructs the auxiliary graph G". Since 7 has n - 1 edges, G" 
has n -1 vertices. Part 3 requires O(l) time using 0(m) processors since testing the 
conditions for each possible edge of G" takes 0(1) time. The graph G" has at most m - 1 
edges. 
In Part 4, the algorithm finds the connected components of G". The algorithm of 
Shiloach and Vishkin requires 0(log n) time using 0(m +n) processors. The vertices of G" 
that belong to the same connected component correspond to the edges of G in the same 
equivalence class and thus the same biconnected component. 
In Part 5, the algorithm extends the equivalence relation R found in Part 4 to the edges 
of G - 7 . If (i,j) is a nontree edge such that i <j, then (i,j) is assigned to the same 
connected component as edge (P(j),j). Part 5 takes 0(1) time using 0(m) processors. 
Arbitrary concurrent writing arises in the connected components algorithm of Shiloach 
and Vishkin, and thus is required in Parts 1 and 4 of the biconnected components algorithm. 
Since each Part of the algorithm runs in 0(log ») time and uses 0(m +n) processors, the 
biconnected components algorithm of Tarjan and Vishkin runs in O(logn) time using 
0(m + n) processors. Two other problems that can be solved by variants of the biconnected 
components algorithm in the same resource bounds are finding all of the bridges of a graph, 
and directing the edges of a bridgeless graph so that the resulting directed graph is strongly 
connected. 
5.4. APRAM Biconnected Components Algorithm 
We now present our APRAM biconnected components algorithm. Our algorithm is 
essentially an asynchronous implementation of the biconnected components algorithm of 
123 
Tarjan and Vishkin. We describe our implementation of the various parts of the synchronous 
algorithm. 
In Part 1, we can find a spanning tree 7 of G using a modification of APRAM connected 
components algorithm given in Chapter 4. Part 1 requires 0(logn) rounds using m +n 
processors. 
Next, we number the vertices of 7 in preorder and compute the number of descendants 
of each vertex using the Euler tour technique on trees and the doubling technique. The 
asynchronous implementation of the doubling technique, however, requires a barrier 
synchronization between the read and write stages of each doubling operation. This ensures 
that the updates of the dist(k) and TV(6) variables for each element 6 are coordinated. With 
an increment primitive, each barrier synchronization requires one round. Since the Euler 
tour technique on trees requires 0(log/i) iterations using 0(«) processors, the number of 
rounds required O(log n). 
We show that if barrier synchronization is not used during the doubling operation, then 
the computed distance to the end of the list may be too small. For an element 6, suppose that 
at the time dist (6) is updated, N (6) := 6' and N (N (k)) := 6 ". Then, when dist (6) is updated, 
dist(k):= dist(k) + dist(6'). Meanwhile, since the processors are asynchronous, while 
dist(k) is being updated, N(k') may also be updated. Subsequently, when N(k) is updated, 
N(k) does not point to element 6", but rather to an element closer to the end of the list 
Thus, the computed distance to the end of the list will be smaller than the actual distance. 
In Part 2, the algorithm asynchronously computes low(v) and high(v) using the same 
method as in the synchronous algorithm. Again, barrier synchronization is needed during the 
doubling operations. Part 2 requires 0(log n) rounds using 0(n) processors. 
124 
In Part 3, the algorithm asynchronously constructs the auxiliary graph G ". Part 3 takes 
O(l) rounds using 0(m) processors. 
In Part 4, the algorithm asynchronously finds the connected components of G " using the 
APRAM connected components algorithm of Chapter 4. Part 4 requires 0(log n) rounds 
using m + n processors. 
In Part 5, the algorithm extends the equivalence relation R found in Part 4 to the edges 
of G - 7 . Part 5 takes 0(1) rounds using 0(m) processors. 
In addition, a barrier synchronization is inserted between each of the Parts so that 
processors can determine when a part is completed. Since each of the Parts of the 
asynchronous algorithm requires O(logn) rounds and uses 0(m +n) processors, and each 
barrier synchronization requires one round, our APRAM biconnected components algorithm 
requires 0(log n) rounds using 0(m + n) processors. 
125 
REFERENCES 
Aho, A. V., Ullman, J. D., Wyner, A. D., and Yannakakis, M. (1982), Bounds on the Size and 
Transmission Rate of Communications Protocols, Computers & Mathematics with 
Applications, 8,205-214. 
Akl, S. (1989), The Design and Analysis of Parallel Algorithms, Prentice Hall, Englewood 
Cliffs, NJ. 
Anderson, R. J., and Woll, H. S. (1991), Wait-Free Parallel Algorithms for the Union-Find 
Problem, Technical Report 91-04-05, University of Washington. 
Aspnes, J. and Herlihy, M. (1990), Wait-Free Data Structures in the Asynchronous PRAM 
Model, Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and 
Architectures, 340-349. 
Awerbuch, B. (1987), Optimal Distributed Algorithms for Minimum Weight Spanning Tree, 
Counting, Leader Election and Related Problems, Proceedings of the 19th Annual ACM 
Symposium on Theory of Computing, 230-240. 
Awerbuch, B. and Shiloach, Y. (1987), New Connectivity and MSF Algorithms for Shuffle-
Exchange Network and PRAM, IEEE Transactions on Computers, 36,1258-1263. 
Axelrod, T. S. (1986), Effects of Synchronization Barriers on Multiprocessor Performance, 
Parallel Computing, 3,129-140. 
Bartlett, K. A., Scantlebury, R. A., and Wilkinson, P. T. (1969), A Note on Reliable Full-
Duplex Transmission over Half-Duplex Links, Communications of the ACM, 12, 260-
261. 
Batcher, K. (1968), Sorting Networks and Their Applications, Proceedings of the American 
Federation of Information Processing Societies Conference, 32,307-314. 
Boppana, R. B. (1989), Optimal Separations Between Concurrent-Write Parallel Machines, 
Proceedings of the 21st Annual ACM Symposium on Theory of Computing, 320-326. 
Brand, D. and Zafiropulo, P. (1983), On Communicating Finite-State Machines, Journal of 
the Association for Computing Machinery, 30,323-342. 
Cheriton, D. and Tarjan, R. E. (1976), Finding Minimum Spanning Trees, SI AM Journal of 
Computing, 5,724-742. 
126 
Chin, F. Y., Lam, J., and Chen, I. (1982), Efficient Parallel Algorithms for some Graph 
Problems, Communications of the ACM, 25,659-665. 
Chiebus, B. S., Diks, K., Hagerup, T., and Radzik, T. (1988), Efficient Simulations Between 
Concurrent-Read Concurrent-Write PRAM Models, Lecture Notes in Computer 
Science, 324,231-239. 
Chor, B., Israeli, A., and Li, M. (1987), On Processor Coordination Using Asynchronous 
Hardware, Proceedings of the 6th Annual ACM Symposium on Principles of Distributed 
Computing, 169-178. 
Cole, R. (1988), Parallel Merge Sort, SI AM Journal of Computing, 17,770-785. 
Cole, R. and Vishkin, U. (1986a), Deterministic Coin Tossing and Accelerating Cascades: 
Micro and Macro Techniques for Designing Parallel Algorithms, Proceedings of the 
18th Annual ACM Symposium on Theory of Computing, 206-219. 
Cole, R. and Vishkin, U. (1986b), Approximate and Exact Parallel Scheduling with 
Applications to List, Tree, and Graph Problems, Proceedings of the 27th Annual 
Symposium on Foundations of Computer Science, 478-491. 
Cole, R. and Zajicek, O. (1989), The APRAM: Incorporating Asynchrony into the PRAM 
Model, Proceedings of the 1989 ACM Symposium on Parallel Algorithms and 
Architectures, 169-178. 
Cole, R. and Zajicek, O. (1990), The Expected Advantage of Asynchrony, Proceedings of the 
2nd Annual ACM Symposium on Parallel Algorithms and Architectures, 85-94. 
Danthine, A. (1982), Protocol Representation with Finite State Models, Computer Network 
Architectures and Protocols, Ed. Paul E. Green, Jr., 579-606. 
Dubois, M. and Briggs, F. A. (1991), The Run-Time Efficiency of Parallel Asynchronous 
Algorithms, IEEE Transactions on Computers, 40,1260-1266. 
Fich, F. E., Ragde, P., and Wigderson, A. (1988a), Relations Between Concurrent-Write 
Models of Parallel Computation, SI AM Journal on Computing, 17,606-627. 
Fich, F. E., Ragde, P., and Wigderson, A. (1988b), Simulations Among Concurrent-Write 
PRAMs, Algorithmica, 3,43-51. 
Fischer, M. J., Lynch, N. A., and Paterson, M. S. (1985), Impossibility of Distributed 
Consensus with One Faulty Process, Journal of the Association for Computing 
Machinery, 32,374-382. 
127 
Fortune, S. and Wyllie, J. (1978), Parallelism in Random Access Machines, Proceedings of 
the 10th Annual ACM Symposium on Theory of Computing, 114-118. 
Fredman, M. L. and Tarjan, R. E. (1987), Fibonacci Heaps and Their Uses in Improved 
Network Optimization Algorithms, Journal of the Association for Computing 
Machinery, 34,596-615. 
Gallager, R., Humblet, P., and Spira, P. (1983), A Distributed Algorithm for Minimum-
Weight Spanning Trees, ACM Transactions on Programming Languages and Systems, 
5,66-77. 
Gazit, H. (1991), An Optimal Randomized Parallel Algorithm for Finding Connected 
Components in a Graph, SIAM Journal of Computing, 20,1046-1067. 
Gibbons, A. and Rytter, W. (1988), Efficient Parallel Algorithms, Cambridge University 
Press, Cambridge, Great Britain. 
Gibbons, P. B. (1989), A More Practical PRAM Model, Proceedings of the 1989 ACM 
Symposium on Parallel Algorithms and Architectures, 158-168. 
Gouda, M. G. (1985), On 'A Simple Protocol Whose Proof Isn't': The State Machine 
Approach, IEEE Transactions on Communications, 33,382-384. 
Gouda, M. G. and The, K.-S. (1985), Modeling Physical Layer Protocols Using 
Communicating Finite State Machines, Proceedings of the 9th Data Communications 
Symposium, 54-62. 
Graham, R. L. and Hell, P. (1985), On the History of the Minimum Spanning Tree Problem, 
Annals of the History of Computing, 7,43-57. 
Hailpem, B. (1985), A Simple Protocol Whose Proof Isn't, IEEE Transactions on 
Communications, 33,330-337. 
Halpern, J. Y. and Zuck, L. D. (1987), A Little Knowledge Goes a Long Way: Simple 
Knowledge-Based Derivations and Correctness Proofs for a Family of Protocols, 
Proceedings of the 6th Annual ACM Symposium on Principles of Distributed 
Computing, 269-280. 
Han, Y. and Wagner, R. (1990), An Efficient and Fast Parallel Connected Component 
Algorithm, Journal of the Association for Computing Machinery, 37,626-642. 
Harary, F. (1969), Graph Theory, Addison-Wesley, Reading, MA. 
128 
Herlihy, M. (1988), Impossibility and Universality Results for Wait-Free Synchronization, 
Proceedings of the 7th Annual ACM Symposium on Principles of Distributed 
Computing, 276-290. 
Hirschberg, D. S. (1982), Parallel Graph Algorithms Without Memory Conflicts, Proceedings 
of the 20th Annual Allerton Conference on Communications, Control, and Computing, 
257-263. 
Hirschberg, D. S., Chandra, A. K., and Sarwate, D. V. (1979), Computing Connected 
Components on Parallel Computers, Communications of the ACM, 22,461-464. 
Ja'Ja', J. (1991), An Introduction to Parallel Algorithms, Addison-Wesley, Reading, MA. 
Johnson, D. B. and Metaxas, P. (1991), Connected Components in 0(lg3/21V |) Parallel Time 
for the CREW PRAM, Proceedings of the 32nd Annual Symposium on Foundations of 
Computer Science, 688-697. 
Karp, R. M. and Ramachandran, V. (1990), Parallel Algorithms for Shared-Memory 
Machines, Handbook of Theoretical Computer Science, Vol. A: Algorithms and 
Complexity, Ed. J. van Leeuwen, Elsevier (Amsterdam) and MIT Press (Cambridge, 
Mass.), 869-941. 
Kruskal, C. P., Rudolph, L., and Snir M. (1990), Efficient Parallel Algorithms for Graph 
Problems, Algorithmica, 5,43-64. 
Lamport, L. (1986), On Interprocess Communication, Part II: Algorithms, Distributed 
Computing, 1,86-101. 
Loui, M. C. and Abu-Amara, H. H. (1987), Memory Requirements for Agreement Among 
Unreliable Asynchronous Processes, Advances in Computing Research, 4, JAI Press, 
Inc., 163-183. 
Lynch, N„ Mansour, Y., and Fekete, A. (1988), The Data Link Layer: Two Impossibility 
Results, Proceedings of the 7th Annual ACM Symposium on Principles of Distributed 
Computing, 149-170. 
Mano, M. M. (1982), Computer System Architecture, Prentice-Hall, Inc., Englewood Cliffs, 
NJ 07632. 
Martel, C, Park, A., and Subramonian, R. (1989), Optimal Asynchronous Algorithms for 
Shared Memory Parallel Computers, Report CSE-89-8, Division of Computer Science, 
University of California, Davis. 
129 
Martel, C, Subramonian, R., and Park, A. (1990), Asynchronous PRAMs are (almost) as 
good as Synchronous PRAMs, Proceedings of the 30th Annual Symposium on 
Foundations of Computer Science, 590-599. 
Nishimura, N. (1990), Asynchronous Shared Memory Parallel Computation, Proceedings of 
the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, 76-84. 
Peng, W. and Purushothaman, S. (1989), Towards Dataflow Analysis of Communicating 
Finite State Machines, Proceedings of the 8th Annual ACM Symposium on Principles of 
Distributed Computing, 45-58. 
Savage, C. and Ja'Ja', J. (1981), Fast, Efficient Parallel Algorithms for Some Graph 
Problems, SIAM Journal of Computing, 10,682-691. 
Shiloach, Y. and Vishkin, U. (1982), An 0(n log n) Parallel Connectivity Algorithm, Journal 
of Algorithms, 3,57-67. 
Tanenbaum, A. S. (1981), Computer Networks, Prentice Hall, Inc., Englewood Cliffs, NJ 
07632. 
Tarjan, R. E. (1972), Depth-First Search and Linear Graph Algorithms, SIAM Journal of 
Computing, 1,146-160. 
Tarjan, R. E. and Vishkin, U. (1985), An Efficient Parallel Biconnectivity Algorithm, SIAM 
Journal of Computing, 14,862-874. 
Tempera, E. and Ladner, R. E. (1990), Tight Bounds for Weakly-Bounded Protocols, 
Proceedings of the 9th Annual ACM Symposium on Principles of Distributed 
Computing, 205-218. 
Tsin, Y. H. and Chin, F. Y. (1984), Efficient Parallel Algorithms for a Class of Graph 
Theoretic Problems, SIAM Journal of Computing, 13,580-599. 
Wang, D. and Zuck, L. D. (1989), Tight Bounds for the Sequence Transmission Problem, 
Proceedings of the 8th Annual ACM Symposium on Principles of Distributed 
Computing, 73-83. 
Wu, M. M. (1990), An 0(log n) Time Common CRCW PRAM Algorithm for Minimum 
Spanning Tree, Technical Report UILU-ENG-90-2250 (ACT-114), Coordinated 
Science Laboratory, University of Illinois at Urbana-Champaign. 
Wu, M. M. (1991), Asynchronous PRAM Algorithms for Graph Connectivity and Related 
Problems, in preparation. 
130 
Wu, M. M. and Loui, M. C. (1991), Modeling Robust Asynchronous Communication Protcols 
with Finite-State Machines, to appear in IEEE Transactions on Communications. 
Yao, A. C. (1975), An 0( | E | log log | V J) Algorithm for Finding Minimum Spanning Trees, 
Information Processing Letters, 4 21-23. 
Yu, Y. and Gouda, M. G. (1982), Deadlock Detection for a Class of Communicating Finite 
State Machines, IEEE Transactions on Communications, 30,2514-2518. 
Zafiropulo, P., West, C. H., Rudin, H., Cowan, D. D., and Brand, D. (1980), Towards 
Analyzing and Synthesizing Protocols, IEEE Transactions on Communications, 28, 
651-661. 
131 
VITA 
Michael M. Wu was bom on January 23,1964, in Oak Park, Illinois. He received a B.S. 
degree in Electrical Engineering with Honors from the University of Illinois at Urbana-
Champaign in 1985. He then entered graduate school at the University of Illinois and 
received the M.S. and Ph.D. degrees in Electrical Engineering in 1987 and 1992, respectively. 
His thesis research was directed by Dr. Michael C. Loui. 
Dr. Wu is a member of Tau Beta Pi and Eta Kappa Nu and an associate member of 
Sigma Xi. He is also a member of the Institute of Electrical and Electronics Engineers and 
the Association for Computing Machinery. His research interests include the design of 
parallel and distributed computer systems. His publications include: 
An Efficient Distributed Algorithm for Maximum Matching in General Graphs, by M. M. Wu 
and M. C. Loui, Algorithmica, 5 (1990), 383-406. 
Modeling Robust Asynchronous Communication Protocols with Finite-State Machines, by M. 
M. Wu and M. C. Loui, accepted for publication in IEEE Transactions on Communications. 
An 0(log n) Time Common CRCW PRAM Algorithm for Minimum Spanning Tree, 
Technical Report ACT-114, Coordinated Science Laboratory, University of Illinois at 
Urbana-Champaign, 1990. 
Asynchronous PRAM Algorithms for Graph Connectivity and Related Problems, submitted 
for publication. 
132 
