Aspects of practical implementations of PRAM algorithms by Ravindran, Somasundaram
 warwick.ac.uk/lib-publications  
 
 
 
 
 
 
A Thesis Submitted for the Degree of PhD at the University of Warwick 
 
Permanent WRAP URL: 
http://wrap.warwick.ac.uk/137165 
 
Copyright and reuse:                     
This thesis is made available online and is protected by original copyright.  
Please scroll down to view the document itself.  
Please refer to the repository record for this item for information to help you to cite it. 
Our policy information is available from the repository home page.  
 
For more information, please contact the WRAP Team at: wrap@warwick.ac.uk  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
Aspects of Practical 
Implementations of PRAM  
Algorithms
Somasundaram Ravindran
A thesis submitted for the 
Degree of Doctor of Philosophy
Department of Computer Science 
University of Warwick
December 1993

Acknowledgments
I am very grateful to my supervisor. Dr Alan Gibbons, who introduced me to the area 
of parallel computation and provided encouragement and support in my research. 
His quick mind and knowledge of the field greatly helped me in my thinking, writing 
proofs, and making me realise what 1 was doing in a much broader context. He was 
also a constant source of support in non-technical matters.
I would also like to thank the many members of the Computer Science Department 
at the University of Warwick who provided a diversity of assistance over the past 
three years.
Above all I would like to gratefully thank my wife, Mythili. Without her constant 
help and support, this thesis would not have been possible.
Declaration
This thesis is submitted to the University of Warwick in support of my application 
for admission to the degree of Doctor of Philosophy. No part of it has been submitted 
in support of an application for another degree or qualification of this or any other 
institution of learning. Parts of the thesis appeared in the following papers in which 
my own work was that of a full pro-rata contributor:
1. S. Ravindran, B. Dessau and A.M. Gibbons, “An overview of PRAM to practical 
PRAM algorithmics". Report 6.2.1, Parallel Universal Message-passing Architec­
ture, ESPRIT Project 2701 of the EC, 1991.
2. S. Ravindran and A.M. Gibbons, "Dense edge-disjoint embedding of binary trees 
in the hypercubc”, Information Processing Letters, Vol. 45, 321-325, 1993.
3. S. Ravindran and A.M. Gibbons, "Densely embedding the complete binary tree 
in communication networks", 9"‘ British Colloquium for Theoretical Computer Sci­
ence, University of York. England, March 1993.
4. S. Ravindran. A.M. Gibbons and M.S. Paterson, "Dense edge-disjoint embed­
ding of complete binary trees in interconnection networks”, submitted to Theoretical 
Computer Science, December 1993.
5. S. Ravindran, N.W. Holloway and A.M. Gibbons, "Approximating minimum 
weight perfect matchings for complete graphs satisfying the triangle inequality", 
in Proceedings o f  19"' International Workshop on Graph-Theoretic Concepts in 
Computer Science, Lecture Notes in Computer Science, Springer- Vcrlag, 1993.
S. Ravindran 20th I )cccmbcr 1993
Abstract
The PRAM is a shared memory model of parallel computation which abstracts 
away from inessential engineering details. It provides a very simple architecture 
independent model and provides a good programming environment. Theoreticians 
of the computer science community have proved that it is possible to emulate the 
theoretical PRAM model using current technology. Solutions have been found for 
effectively interconnecting processing elements, for routing data on these networks 
and for distributing the data among memory modules without hotspots. This thesis 
reviews this emulation and the possibilities it provides for large scale general purpose 
parallel computation. The emulation employs a bridging model which acts as an 
interface between the actual hardware and the PRAM model. We review the evidence 
that such a scheme crn achieve scalable parallel performance and portable parallel 
software and that PRAM algorithms can be optimally implemented on such practical 
models. In the course of this review we presented the following new results:
1. Concerning parallel approximation algorithms, we describe an N C  algorithm 
for linding an approximation to a minimum weight perfect matching in a 
complete weighted graph. The algorithm is conceptually very simple and it 
is also the first jVC-approximation algorithm for the task with a sub-linear 
performance ratio.
2. Concerning graph embedding, we describe dense edge-disjoint embeddings of 
the complete binary tree with n leaves in the following n-node communication 
networks: the hypercubc, the dc Bruijn and shuffle-exchange networks and 
the 2-dimcnsional mesh. In the embeddings the maximum distance from a
I IV
leaf to the root of the tree is asymptotically optimally short. The embeddings 
facilitate efficient implementation of many PRAM algorithms on networks 
employing these graphs as interconnection networks.
3. Concerning bulk synchronous algorithmics, we describe scalable transportable 
algorithms for the following three commonly required types of computation; 
balanced tree computations. Fast Fourier Transforms and matrix multiplica­
tions.
I v
Contents
1 Introduction I
2 Classical PRAM Design 6
2.1 Introduction......................................................................................... 6
2.2 The PRAM m o d e l ...............................................................................  9
2.2.1 Concepts of efficient and optimal a lg o rith m s .................... II
2.3 Basic PRAM techniques.....................................................................  12
2.3.1 Balanced t r e e ......................................................................... 12
2.3.2 Doubling..................................................................................  14
2.3.3 Divide-and-conquer............................................................... 15
2.3.4 Reducing the number of p rocessors..................................... 16
2.4 PRAM algorithmic to o ls ...................................................................... 18
2.4.1 Prefix computation..................................................................  18
2.4.2 List ra n k in g ............................................................................ 21
vi
\Contents
1 Introduction 1
2 Classical PRAM Design 6
2.1 In troduction .........................................................................................  6
2.2 The PRAM m o d e l...............................................................................  9
2.2.1 Concepts of efficient and optimal a lg o rith m s .....................  II
2.3 Basic PRAM techniques...................................................................... 12
2.3.1 Balanced tree ......................................................................... 12
2.3.2 Doubling................................................................................... 14
2.3.3 Divide-and-conqucr...............................................................  15
2.3.4 Reducing the number of p rocessors..................................... 16
2.4 PRAM algorithmic to o ls ...................................................................... 18
2.4.1 Prefix computation..................................................................  18
2.4.2 List ra n k in g ............................................................................  21
vi
2.5 Graph algorithms ...............................................................................  23
2.5.1 Euler tour on t r e e s .................................................................. 25
2.5.2 Tree contraction .....................................................................  27
2.5.3 Ear decom position ............................................................... 27
2.5.4 Matrix computations............................................................... 28
3 Parallel Approximation Algorithm 30
3.1 In troduction.........................................................................................  30
3.2 Performance ratios of approximation algorithms...............................  32
3.3 Minimum weight perfect matching ............................................  33
3.4 Approximate minimum weight perfect matching in a complete weighted
graph ......................................................................................................  35
3.4.1 Validity of the approximation ratio c l a i m ........................... 38
3.4.2 Parallel execution and complexity of the algorithm . . . .  41
3.5 Further w o rk .........................................................................................  44
4 Realistic Issues 46
4.1 Introduction.........................................................................................  46
4.2 Interconnection N e tw o rk s ..................................................................  47
4.2.1 The hypcrcubc fa m ily ...........................................................  48
4.2.2 The shuffle-exchange and dc Bruijn netw orks.....................  54
vii
I
*4.2.3 Meshes ................................................................................... 56
4.2.4 Randomly-wired networks.....................................................  58
4.3 Contention and congestion ............................................................... 61
4.3.1 H a sh in g ..................................................................................  63
4.3.2 Combining...............................................................................  66
4.3.3 R o u tin g ................................................................................... 69
4.3.4 Information dispersal algorithm ...........................................  73
4.4 Synchrony and asynchrony..................................................................  74
5 Kmbcddings 75
5.1 In troduction.........................................................................................  75
5.2 Efficiency requirem ents...................................................................... 76
5.3 The em beddings................................................................................... 77
5.3.1 Embedding in the dc Bruijn g ra p h ........................................  79
5.3.2 Embedding in the shuffle-exchange g ra p h ........................... 82
5.3.3 Embedding in the 2-dimensional mesh ...............................  84
5.3.4 Embedding in the hypcrcubc..................................................  87
5.4 Depths of the embedded t r e e s ............................................................ 92
5.4.1 Maximum root-to-lcaf distances of the embeddings . . . .  93
5.4.2 Lower bounds for the embedded tree d ep th s .......................  95
viii
I
IX
5.5 Further remarks and algorithmic issues ........................................... 97
5.6 Summary and open p ro b le m s ...........................................................  100
6 Practical Parallel Models of Parallel Computation 102
6.1 In troduction .........................................................................................  103
6.2 The practical PRAM m o d e l ............................................................... 104
6.3 Latency h id in g ......................................................................................  106
6.4 Asynchronous com p u ta tio n ............................................................... 113
6.5 Memory m anagem ent......................................................................... 117
6.6 Conclusion............................................................................................  120
7 Bulk Synchronous Parallel Algorithms 121
7.1 In troduction .........................................................................................  121
7.2 Balanced tree computation..................................................................  123
7.3 Fast Fourier T ran sfo rm .....................................................................  124
7.4 TYansitive closure and graph algo rithm s........................................... 127
7.5 Further w o rk .........................................................................................  129
8 Conclusions 130
I
Chapter 1
Introduction
A few years ago parallel computers could be found only in research laboratories. 
Due to the rapidly decreasing cost of processors, memory and communication, 
they arc now available commercially. It is possible that, within a decade, parallel 
computation will dominate in all areas of computer science and its applications. A 
deep understanding of parallel computations is therefore highly desirable.
The complexity theory research community has developed a rich literature in the 
design and analysis of efficient parallel algorithms. To date, most of this work has 
been based on the Parallel Random Access Machine (PRAM). The PRAM mtxlcl 
consists a number of processing elements and a (shared) memory. The processing 
elements operate in lock step synchrony and access a location in the shared memory 
in unit time (chapter 2 gives a detailed description of the PRAM model). The 
realistic issues arc ignored by the PRAM model. The PRAM algorithms do not take 
account of the low level details of parallel computation, such as intcrproccssor or 
pnxessors-to-memory communication, memory management and hardware failure.
I I
2For this reason, the PRAM provides a very simple and natural model for architecture 
independent parallel algorithm design.
One benefit of PRAM studies is an extensive list of fast parallel algorithms. By 
a fast algorithm we mean one that that takes polylogarithmic parallel time using 
a polynomial number of processors. Problems which can be solved by a fast 
algorithm are said to belong to the class N C . This class name is an acronym for 
Nick (Pippenger’s) Class. The search for parallel solutions to place the problems 
in N C  has demonstrated that entirely new algorithmic techniques are appropriate 
and it is usually a bad starting point to attempt to parallelise the best sequential 
algorithms. Design techniques and tools have therefore been developed for parallel 
computation which are completely different from the sequential domain. In fact, 
some of the commonly used methods in sequential algorithm design do not adapt 
well to parallelism. Moreover, many problems in the class P, i.c. the problems have 
polynomial time sequential solution, have been proven P-complete. If a problem is 
P-complete then it is very unlikely that the problem belongs to the class N C .
The primary interest of the study of parallel complexity is placing the problems in 
P  into the class N C . In fact, there arc problems in P but we do not know whether 
they lie in N C , and they have not been proven to be P-complete. However, for 
such problems it may be possible to find an N C  approximation algorithm to the 
problems. For example, if the problem is a minimisation problem then we find a 
solution which is not minimum, but the solution is never greater than some factor of 
the optimal solution.
Although the PRAM model has been prominent in the development of the design
I
\and analysis of parallel algorithms, the unrealistic characteristics of the model 
may make us think that the model is not suitable for general purpose parallel 
computation. However, we shall see in this thesis that the PRAM model can be 
emulated by a feasible parallel computer. A feasible parallel computer is a number 
of processing elements interconnected by a network, where each processing element 
performs computational and/or memory operations and the processing elements may 
or may not have to be synchronised. The memory of a feasible parallel computer is 
distributed among memory modules.
In contrast to a PRAM, a (non local) memory access in a feasible parallel computer 
takes much longer than a local computation. This is because the data have to be 
transferred through an interconnection network. Further delays can occur due to 
congestion in the interconnection network and contention at the memory modules. 
Moreover, a PRAM assumes that the processors and the communications links 
are fault-free. Theoretical solutions have been found for routing data to the right 
processing clement within a reasonable time, for distributing the data among memory 
modules so that the distribution will not slow down the computation, and for coping 
with processors and network failures.
At present, almost all of the parallel software designed for a realistic physical 
parallel model is not portable. This is because the algorithms arc often developed 
for a particular network topology of fixed size. This software is not based on general 
principles and to date has not been written for a common virtual machine. Progress 
in technology suggests to us that rapid changes in parallel machines arc still to come. 
Hence, current software will not have a long life time. The major challenge in parallel
3
I
4computing is developing architecture independent parallel software with an expected 
lifetime of several decades. We need an interface (or bridging) model between the 
software and hardware. Such a model will offer architecture independent software 
and will be compatible with technological evolution.
Using the theoretical solutions for routing and for memory management we can 
build a bridging model with virtual shared memory. A user can view this model 
as an extended PRAM, which hides hardware details from the user. The model is 
called practical PRAM. Recently a number of practical PRAMs have been proposed 
which variously take account of communication delay, contention and congestion, 
asynchrony and component failures. In this thesis we review these models and show 
that it is possible to develop an architecture independent software with a long life 
time.
The remainder of this thesis is organised as follows.
• Chapter 2 reviews the design techniques and tools of parallel computation, 
and provides the evidence for the significance of the PRAM model.
• Chapter 3 gives a novel approximation algorithm for a problem whose parallel 
complexity remains unknown. It is not known whether the problem lies in 
N C , and it has not been proven to be P-complete.
• Chapter 4 focuses on the realistic issues which arc ignored by the PRAM 
model. If an interconnection network is to be used for general problem solv­
ing then it ought have certain desirable graph theoretic properties. Chapter 4 
describes interconnection networks which satisfy these properties. Further-
5more, chapter 4 reviews the theoretical solutions for routing, for memory 
management and for fault tolerance.
• The most commonly occurring structure in parallel computation is the com­
plete binary tree. It is precisely because such logarithmic depth structures are 
used (either explicitly or implicitly) that polylogarithmic time complexities 
are attained for many PRAM algorithms. Chapter 5 shows that the complete 
binary tree can be embedded in the interconnection networks, the mesh, hy­
percube, de Bruijn and shuffle-exchange networks, so that an algorithmically 
important parameter, the maximum distance from a leaf to the root, is opti­
mised asymptotically. Thus, O(logn) time PRAM algorithms which use the 
complete binary tree as the algorithmic structure can be translated to optimal 
time on the interconnection network.
• With the use of theoretical solutions which arc described in chapter 4, chapter 6 
demonstrates that there is no hindrance in designing practical parallel models 
of parallel computation. Moreover, this chapter shows that a PRAM algorithm 
can be implemented on the practical parallel model without any significant 
delay in run time.
• Chapter 7 shows that scalable transportable algorithms for practical parallel 
models can be written for certain basic tasks, balanced tree computations, Fast 
Fourier transforms and matrix multiplication.
I
Chapter 2
Classical PRAM Design
2.1 Introduction
The PRAM model of parallel computation is described in detail in section 2.2. At 
first glance, the PRAM model of computation might not appear to be suitable as 
a general model for designing and analysing parallel algorithms. For sequential 
computation, it has been of considerable advantage to deal with an abstraction of 
the von Neumann machine, namely the Random-Access Machine or RAM (sec (3] 
for details of the model). Similar advantages justify the PRAM model:
• Ease o f use: Algorithms can he specified with little intricacy.
• Portability: PRAM algorithms do not need to take into account memory 
organisation, network topology or other hardware design attributes o f real 
parallel machines, so that they eliminate obstacles to portability.
6
7• Scalability: Typically the number o f processors which are used can be in­
creased in a natural way such that the speed o f the computation retains the 
same functional dependence on the problem size.
The main preoccupation of PRAM algorithm designers has been to place the problem 
in hand in the class N C . Various idealised models of parallel computation other 
than the PRAM model have been used in the study of parallel algorithms and their 
complexity. These include Boolean circuits and alternating Turing machines. The 
complexity class N C  remains unchanged when defined by these models 1138. 149]. 
This motivates the definition of the complexity class N C . Note that when using a 
family of Boolean circuits as a model of parallel computation, the family is usually 
required to satisfy a logspace uniformity condition [ 139]. A family of Boolean 
circuits is logspacc uniform if there is a deterministic l uring machine can construct 
some standard encoding of the n,h circuit using Of log //) work space.
It is reasonable to seek an N C  algorithm for a problem, if the problem can be 
solved sequentially in polynomial time. Let P be the class of decision problems 
that can be solved by a deterministic Turing machine within a polynomial number 
of sequential steps [3]. We can sec that N C  C / ’ by converting N C  algorithms into 
sequential algorithms in the obvious way. A fundamental open question is whether 
every problem in P lies in NC. The parallel computation thesis states that “time 
bounded parallel machines arc polynomially related to space bounded sequential 
machines" [18, 22, 43, 57. 122], That is N C  computation! can be simulated by 
Turing machines using only polylogarithmic space. Thus, P N C  would imply 
that P is contained in a class of problems that can be solved in polylogarithmic
I
8space by a sequential machine, which is considered very unlikely. This is why we 
believe that there exists problems which do not adapt well to parallelism. In fact, 
many problems have been proven to be in P-complele. For a list of P-complete 
problems see [59, 107]. If a problem is P-complele then the problem is unlikely 
to lie in NC. More formally, a problem L 6 P  is said to be P-complete if every 
other problem in P can be transformed into L by a deterministic Turing machine 
using only log space (such a transformation is said to be a log space reduction). 
It follows that, if L € P-complete and L 6 N C  then P = NC. It turns out that 
some of the commonly used methods in sequential algorithms are likely inherently 
sequential methods. Thus the design of parallel algorithms requires new paradigms 
and techniques.
In this chapter we show that, the benefit from the PRAM model is not only in the 
extensive list of efficient and parallel algorithms that have been designed, but also 
fundamental paradigms and design techniques and tools have emerged. These arc 
usually completely different from the best known sequential solutions to the same 
problems and indicate new paradigms for parallel algorithm design. Sections 2.3 
and 2.4 describe the techniques and tools respectively. In section 2.5 we review sonic 
graph algorithms, and show that these arc useful not only in their own right for the 
problems they solve, but also as common subroutines in many parallel algorithms. 
Note that in this chapter we consider only graph algorithms. Of course, the PRAM 
can be and has been used to solve problems in many other areas. For example, survey 
papers [ 3K| and [65] list references for computational geometry anil pattern matching 
respectively. Many other references can be found in [42, 52, 68, 76. 103, I66|.
9Processors
Main control program "~J
Shared
Memory
Figure 2.1:
2.2 The PRAM model
The PRAM model is an abstract shared memory model. It was introduced in 
[43. 169]. There arc a number of processors working together and communicating 
through the shared memory. The processors synchronously execute the same pro­
gram through the central main control (see figure 2.1). Although performing the 
same instructions, the processors can be operating on different data. Hence, such a 
model is a Single-Instruction. Multiple-Data model, namely an SIMD model. Each 
processor is a uniform-cost random-access machine or RAM with usual operations 
and instructions. The cost of arithmetical operations (addition, subtractions, equal­
ity predicate and so on) is constant. In one step each processor can access (cither 
reading from it or writing to it) one memory location or execute a single RAM 
operation.
Memory access leads to variants of the model which allow or do not allow more 
than one processor to read or to write to the same memory location. For example the
10
EREW PRAM (exclusive read exclusive write parallel random access machine which 
allows no concurrent reads and no concurrent writes), the CREW PRAM (concurrent 
reads allowed but only exclusive writes) and the CRCW PRAM (in which both 
concurrent writes and concurrent reads are allowed). In general concurrent reads 
cause no logical errors. However, with concurrent writes additional rules are required 
to resolve the outcome. This leads to variants of the CRCW PRAM, namely the 
so-called common, arbitrary and priority variants. These resolves the write conflicts 
as follows: in the common variant all processors writing into the same location write 
the same value, in the arbitrary variant any one processor participating in a common 
write may succeed and the algorithm should work regardless of which one does, 
and in the priority variant there is a linear order on the processors and the minimum 
numbered processor succeeds in writing.
Any algorithm that works on an EREW PRAM works on a CREW PRAM, any 
algorithm that works on a CREW PRAM works on a common CRCW PRAM, and 
so on. Moreover, the list of variants of the PRAM model: EREW, CREW, common 
CRCW, arbitrary CRCW and priority CRCW, represents the PRAM models in 
increasing order of their power. But, they do not differ much in their power. The 
following theorem 1164] indicates this. This justifies the class N C\ the class remains 
unchanged regardless of variants of the PRAM model.
Theorem 2.2.1 Any algorithm for a priority CRCW PRAM o f p processors can be 
simulated by a EREW PRAM with the same number o f processors and with the 
parallel time increased by a factor o f 0 (  log p).
11
2.2.1 Concepts of efficient and optimal algorithms
In the PRAM model, the relevant complexity measure of an algorithm are the time 
for parallel computation and the number of processors used. The time-processor 
product of a PRAM algorithm is called the work of the algorithm.
Suppose a PRAM algorithm runs in time t(n) by employing />(») processors for 
a problem size », then the PRAM algorithm can be converted into a sequential 
algorithm of time equal to the work of the PRAM algorithm, t{n) x /)(»). We can 
do this by simulating each parallel step of the PRAM algorithm on a sequential 
processor in p(n) time units. This justifies the definition for an optimal algorithm. 
A PRAM algorithm for the problem in N C  is optimal if the work of the algorithm 
is asymptotically equal to the fastest sequential computation time for the problem.
An optimal parallel algorithm achieves a high degree of parallelism. Analogously, 
a PRAM algorithm for the problem in N C  is efficient if the work of the algorithm 
is within a polylogarithmic factor of the fastest sequential computation time for the 
problem. Designing an optimal algorithm on a CRCW PRAM is easier than on a 
CREW or EREW PRAM. This is because more parallelism can be expressed on a 
CRCW PRAM than on a CREW or EREW PRAM. The class of problems which 
have efficient algorithms, remains unchanged regardless of the PRAM model used 
(for example, sec theorem 2.2.1 ). Thus our notion of efficiency is more robust than 
the notion of optimality.
Concerning speed of computation, one might expect that it is possible to discover 
parallel algorithms that run in constant time. Research on lower bounds for paral­
lel computation indicates that this goal is unachievable for almost any interesting
I
12
problems. This is because we require a lower bound of i2(log » /  log log n) time for 
the parity problem of n bits on a priority CRCW PRAM with a polynomial number 
of processors [14]. This is because the parity problem depends on all the inputs. 
But the problem of computing the OR or AND of ti Boolean variables can be done 
in constant time on a CRCW PRAM. Because, the decision of this problem only 
depends on one input. However, this computation requires S2(log n) on a CREW 
PRAM with no restriction on the number of processors [34]. Since any interesting 
problems such as the basic PRAM subcomputations of prefix computation and list 
ranking (described in section 2.4) are typically at least as hard as computing the 
parity of n bits or OR of n variables, we see that constant time parallel computation 
is not admissible for any interesting problems.
2.3 Basic PRAM techniques
A number of general techniques and principles of common use in the design of 
parallel algorithms arc described in this section.
2.3.1 Balanced tree
The balanced binary tree is a fundamental structure in parallel computation. A tree 
is a connected graph containing no circuits, in which one vertex is distinguished as 
a root. In a tree any vertex with of degree one, unless it is the root, is called a leaf. 
A node is said to be an internal node if the node is not a leaf. If (.r,;/) is an edge of 
a tree such that .r lies on the path from the root to i/, then r is said to be the parent of
I
13
y and y is the child of x. A tree is balanced if each internal node has same number 
of children. A balanced tree is called complete binary tree if each internal node has 
two children and the number of nodes is 2n — 1. where n is the number of leaves. 
The minimum distance between a leaf and the root in a complete binary tree is log n, 
where ii is the number of leaves.
It is precisely because such logarithmic depth structures are used (either explicitly 
or implicitly) that polylogarithmic time complexities are attained for many PRAM- 
algorithms. In the PRAM model, the balanced binary tree is employed as follows. 
Data for a problem are placed at the leaves, and each internal node corresponds to 
the computation of a subproblem. The sub problems are solved in bottom-up order 
(or in one or more sweeps up and down the tree), with those at the same level in the 
tree being computed in parallel.
For example, consider the problem of adding n numbers. Let n = 2'" and .4 be an 
array of length 2n. The numbers whose sum is to be found can be placed at the leaves 
of a tree, i.e. we store the n numbers in the locations .4(h ), A(v  + 1 ) , . . . ,  A(2n — 1). 
At each level of the tree, numbers arc added together in pairs by different processors 
in parallel and the result sent to the next level as follows [52].
for k• «— m — I step -/ to 0 do
for all j, 2k < j  < 2* + l — 1, in parallel do A(j) «— A(2j) + A(2j+I)
A processor assigned to an internal node reads the values of the left child and the 
right child from the corresponding locations, then writes the sum of the values in 
the location corresponding to the internal node. If the corresponding location of
I
14
an internal node is A( j )  then the corresponding locations of its left child and right 
child are respectively A(2j)  and A(2j  +  1). At the end of the computation .4( 1), 
the location corresponding to the root, stores the result. The depth of the tree is 
bounded by [log ii] and so the computation time is 0(log n) using u/2  processors. 
This problem clearly belongs to NC.  It will be shown later how to reduce the 
number of processors to achieve optimal work measure.
2.3.2 Doubling
This technique is normally applied to an array or to a list of elements. Each element 
knows the location of the next element in the data structure. The computation 
proceeds by a recursive application of the calculation in hand to all elements over a 
certain distance (in the data structure) from each individual element. This distance 
doubles in successive steps. Thus after k stages the computation has been performed 
(for each clement) over all elements within a distance of 2*.
For example, consider an array .4 of ti elements which specifics a set of rooted 
directed trees, a forest F.  A location of the array ,4(i ) =  j  if j  is the parent of 
/ in a tree of F,  for 1 < t < n, and if i is a root then .4(i) =  i. Suppose, for 
each t, 1 < i < n,  we want to determine the root of the tree containing the node
i. Let s(i) be the successor of node i, i.e initially .*(/’) = A(i),  for 1 <  t <  n. 
The successor of each node, *(»), is replaced by the successor's successor, «(¿(t)), 
in successive steps. If a processor is assigned to each array location then O(log //) 
steps arc sufficient for this computation. Here h is the maximum height of any tree 
in the forest. Sometimes this technique is referred to as pointer jumping.
t
15
The doubling technique is not only applicable to arrays and lists. For example, it can 
be used to compute .4" of a Boolean matrix .4. The computation can be performed 
in at most 2[log2 nj matrix multiplications. First we compute A 2, .44. .48 . . .  .4* by 
squaring the matrix successively, where k is the largest number such that k• < v and 
k- is a power of 2. Now we can compute .4" by multiplying together appropriate 
matrices of the form A2‘,0  < l < [log2 nJ •
2.3.3 Divide-and-conquer
The dividc-and-conqucr technique involves partitioning a problem into subproblcms, 
solving the subproblems, and then combining the solutions to the subproblems to 
form the solution for the original problem. The methodology is recursive; that is, 
the subproblcms themselves may be solved by the dividc-and-conquer technique. 
This method is widely applicable in sequential computation. In a parallel setting 
the method requires that the subproblcms at the same level of recursion can be 
independently computed in parallel and (in order to reduce the depth and therefore 
the computation time) are of similar size.
There arc several examples of the dividc-and-conqucr technique applied in a parallel 
setting. Although the field of computational geometry is rather neglected in this 
thesis, we complete this description of dividc-and-conqucr with an application from 
this area. Given a finite set 5  of points in the plane, computing their convex hull, 
CH(S) ,  is an important basic problem in computational geometry that arises in 
a variety of contexts [38]. CH( S )  is the smallest convex polygon containing all 
the points of 5. A polygon CH(U)  is convex if, given any two points /< and </ in
16
CH(U) ,  the line segment whose endpoints are ¡1 and q lies entirely in CH(U) .  A 
tangent of a convex polygon CH(U)  is a line passing through a vertex of CH(U)  
such that CH(U)  lies entirely on one side o f the line. A tangent is called upper 
(lower) common tangent between two convex polygon, CH(U)  and CH( V) ,  if the 
tangent is the common tangent such that C H( U)  and CH( V)  are below (above) it. 
C H( S )  can be constructed as follows:
1. Sort the set S  in some fixed direction (eg. in the x or ¡/-direction). Let U he 
the first |S |/2  points in this sorted order, and V  the remainder.
2. Recursively construct CH(U)  and C H ( V )  in parallel.
3. Compute the upper and lower common tangents o f CH(U)  and CH(V) .
4. Combine C H(U) and CH( V)  by using the upper and lower common tangents 
of CH(U)  andCH(V) ,  to form CH( S ) .
2.3.4 Reducing the number of processors
Consider a computation .4 that can be done in i parallel steps. Let o, be the number 
of primitive operations at step i. To run .4 directly on a PRAM in t parallel steps, the 
number of processors required is the maximum of the o,, say in. Suppose we have 
p < w  processors. The i'h step can be computed in time [»>,/;>], by partitioning the 
o, operations into p groups and assigning a processor to each of the groups. For each 
of (he p groups in parallel, each processor will be computing (in sequential style) for 
at most [o ,//>] time. Hence the total parallel time is no more than t + o</p]
17
For example, consider the algorithm described earlier for computing the sum of n 
numbers using the balanced binary tree method. This executes in O(logn) time. 
Notice that n/2 processors are required for the first step. Suppose that we have 
p < 11/ 2  processors. We can simulate the first step with p processors as follows. 
First we divide the n numbers into¡> groups, such that ¡"' group has elements indexed 
from (t fn /p ]) to (((/ + 1)(\ n / p \ )) — 1) for 0 <  t < p. The first (p — 1) of these 
groups contain |n/p"| elements and the remaining group contain n — (p — 1 )[n/p] 
elements. We assign a processor to each of the groups. For all of the p groups 
in parallel, each group now finds the sum in sequential style within its group, and 
the computation takes \n /p \ time. We have reduced the original problem of size 
ii to a problem of size /», and the problem can be solved as described before in 
O(logp) time using the p processors. Thus, overall, we can find the sum of the n 
numbers in |n /p] + log/» time using p <  n /2  processors. Notice that if we set 
p = n /  log ii then we obtain a computation time of ()(log n ) and we thus have an 
optimal algorithm. The work of the algorithm is O(n)  and is same as the best known 
sequential algorithm.
This is an example of applying Brent's scheduling principle [20], and is often used 
in the design of efficient or optimal parallel algorithms. It should be noted that 
this simulation assumes that processor allocation is not a problem. We will sec (in 
section 2.4.2) that this is sometimes a nontrivial task.
I
18
2.4 PRAM algorithmic tools
In this section we describe the algorithmic tools, known as prefix computation and 
list ranking, that have been found to be of wide use in the construction of many 
parallel algorithms. One can appreciate this from figure 2.2 [166], which illustrates 
how the solutions of some PRAM algorithms depend on the solutions of others. 
Such a diagram is an example of so-called structural algorithmics. Let P\ and P2 
be the problems in figure 2.2 such that P| is above P?. If there is an edge between 
P\ and P2 then the N C  algorithm that solves P2 is used to obtain an N C  algorithm 
to solve P\. Thus, if there is a path between P\ and Pj then all the solutions to 
the problems on the path are used to obtain an N C  algorithm to solve P \ . As an 
example, the solution to the prefix sum problem is a subroutine in the solution to the 
list ranking problem, the solution to list ranking is a key subroutine to the so-called 
the Euler tour technique and so on.
2.4.1 Prefix computation
Given an array [.r0. x \ , . . . ,  .r„-i] of n elements together with an associative operator 
*, a prefix computation gives Si -  .r0 * .r\ * • • • * ,r,, for 1 < t <  n — 1. There is a 
simple algorithm for performing prefix discovered by Ladner and Fisher |86|. The 
algorithm is perhaps easily understood with reference to the complete binary tree of 
the compulation. Let n be a power of 2, otherwise we add a minimum number of 
dummy elements to achieve this. At the outset we store the n numbers at the leaves 
of the tree so that r, is in the corresponding location of the tlh leaf forO < 1 <  » — 1. 
The leaves arc numbered from 0 to n — 1 from the left to the right. Levels arc
19
Figure 2.2:
I
2 0
numbered from 0 to log n, from the level of the leaves upwards. Let any node j  of 
a complete binary tree cover the leaves from positions /> to q (note if j  is a leaf then 
p — q). Then let A(j)  and B(j )  be storage locations that will be used respectively 
to store the values of xp * j-p+i * • • • * .r, and .r0 * xj  * ■ • • * xq. The following code 
performs the desired computation.
for level= I to log n do
for all j  6 level, in parallel do compute A(j)
B(root) «— A(root)
for level = log n — 1 to 0 do
for all j  € level. in parallel do compute B(j)
At the end of computation X\ * ■ • • * x j  is stored in the corresponding location B(j )  
of the j " ‘ leaf, for 0 < j  < n — 1. For each node a processor can identify the level 
of the node and its children as in the computation of the sum of n numbers (sec 
section 2.3.1). Moreover, the first phase of the computation from level 1 (one level 
above the leaves) to the root can be done as explained in section 2.3.1, by reading 
the values from left child and right child. In the second phase from the root to the 
leaves, B(i)  for the node i can be computed as follows. If the node i is the right 
child then B(i) is A(i's parent), otherwise B(i)  is (B(i's parent) * (A(i's sibling)). 
The computation can be run in 0(log ii ) time on an FREW PRAM since there arc no 
conflicts in memory accesses. At first it would seem that we need O(ti) processors 
to achieve this time. Since the input is stored in an array as already described (i.e 
in consecutive memory locations), we can easily achieve a optimal algorithm (i.e. 
the same time complexity with O(/// log») processors) using Brent's scheduling
j
21
principal as in section 2.3.4.
Given an array .4 of locations storing 0 or 1. the associated parity problem is to 
determine whether the number of 1 s is even or odd in the array. This problem can be 
regarded as a special prefix computation problem. Performing such a computation 
on .4 will leave the result of the parity problem in the rightmost location of .4. A 
lower bound of £2( log n/  log log n ) time is known for the parity problem on a priority 
CRCW PRAM with a polynomial number of processors. To match this lower bound 
for the prefix computation. Cole and Vishkin [28] described an optimal algorithm 
(different from the one described above) which runs in time 0 (log n / log log ») 
using n log log n /  log n processors on a CRCW PRAM.
The fact that the prefix-sums problem appears at the bottom of figure 2.2 is meant 
to convey the basic role of this problem. It appears in many guises. For example, 
consider compacting a sparse array. Given an array of n elements, many of which 
are zero, we wish to generate a new array only containing the non-zero elements 
in their original order. This problem can be solved by assigning a value I to the 
non-zero elements, and performing the prefix sums using arithmetic addition. Such 
a computation calculates, for each non-zero element of the array, the position that 
such a non-zero clement would have in the new array.
2.4.2 List ranking
Given a linked list, the list ranking problem is to calculate for each member of 
the list its relative position from the end of the list, i.c. its rank in the list. The 
importance of this problem was first identified by Wyllic 1169], An obvious solution
2 2
to list ranking can be regarded as a prefix computation using addition of Is within a 
pointer structure. A linked list is an alternative to an array in storing sequences of 
clement in shared memory. In an array each element knows its address within the 
array whereas in a linked list an clement does not know a priori its rank in the list.
Using the pointer jumping technique (as explained in section 2.3.2) we can solve 
the list ranking problem. Initially we set distance(k) = I for each clement k except 
the last for which we set distance(k) = 0. The last clement can be easily determined 
by looking at the pointer's address, s(i), because the last clement uniquely points to 
itself. The algorithm is then described as follows [52].
repeat [logo] times
for each element k in the list in parallel do 
if s(k) /  s(s(k)) then
distonee(k) *— distanee(k) + dislanee(s(k)) 
s(k)« -  s(s(k))
At the end of the computation diatanct(k)  gives the rank of the element k in the 
list. By associating a processor with each clement of the list we can solve the list 
ranking problem in O(logti) time. However, the algorithm is not optimal, since 
the work of the algorithm is O (u logn) and the sequential time to rank the list 
is ()(n ). In attempting to get an optimal algorithm for this problem using Hrent 
scheduling technique as in the prefix computation we run into a problem. Because 
the elements arc not initially indexed (as in an array) we can not assign processors 
to begin subcomputation at defined positions on the list.
23
A substantial amount of effort has been put into finding a optimal algorithm for 
the list ranking. The key step in one optimal algorithm is to cleverly splice out 
elements from the list so that 0 ( n /  log n) elements remain in O(log n) time with 
» /lo g «  processors [26]. Then we can solve the list ranking on the reduced list 
as described above taking O(logu) time using » /lo g »  processors. The original 
list then reconstructed by reinserting the elements that were spliced out. This step 
can also be done in Oflog » ) time with » / log » processors. The total work of this 
algorithm is O(n) which is the best sequential time to rank the list.
Wyllie conjectured that it is impossible to find an optimal parallel algorithm for this 
problem [169]. Cole and Vishkin [26] were able to invalidate Wyllie's conjecture 
by describing all details of the above algorithm which runs on an an EREW PRAM. 
The drawback to their algorithm is that it is complicated and has very large constant 
factors, and they rely on an expander graph construction to solve a scheduling 
problem that arises. Anderson and Miller [8] gave an another optimal algorithm that 
runs in <9(log») time and uses » /lo g »  processors on an EREW PRAM, which is 
much simpler and has reasonable constant factors. Moreover, this algorithm docs 
not rely on an expander graph construction, although it is still fairly intricate for 
practical purposes.
2.5 Graph algorithms
Graphs play an important role in solving real-world problems. Specifically they play 
a major role in important problems in combinatorial optimisation. For example, in 
graph colouring we assign colours to a graph such that no two adjacent edges or
24
vertices have the same colour. Edge colouring and vertex colouring can be used 
to solve time-tabling problems. Another important graph problem is to decide 
whether a given graph is planar. For example, in the layout of printed circuits one 
is interested in knowing if a particular electrical network is planar. Other important 
problems are concerned with the so-called connectivity of the graph in question. A 
graph is said to be connected if there is a path between any two vertices. A graph 
is A-vertex (or edge) connected if A- is the minimum number of vertices (or edges) 
whose removal will disconnect the graph. If we think of a graph as representing 
a communication network, the vertex connectivity (or edge connectivity) becomes 
the smallest number of communication stations (or communication links) whose 
breakdown would jeopardise communication in the system. The higher the vertex 
connectivity and edge connectivity, the more reliable the network.
The design of efficient parallel algorithms for graph problems has presented a chal­
lenge since traditional sequential graph search techniques have proved not readily 
to admit parallelisation. Sequential optimal algorithms for many graph problems 
commonly use one of two methods to search a graph: depth-first search (dfs) or 
breadth-first search (bfs) [49). At present, neither of these methods has an efficient 
parallel algorithm, and the most useful of these methods (dfs) is P-complete. Thus 
new tools are needed to replace dfs or bfs. One such tool is the process of ear 
decomposition search (described in section 2.5.3). We need to avoid dfs or bfs 
in the parallel algorithm. For example, computing connected components is often 
considered a basic problem and the best sequential algorithm for this problem uses 
dfs. Two nodes of a graph arc in the same component if there is a path from one to 
another in the graph. An efficient parallel algorithm for connected components on a
I
25
CRCW PRAM was described in [70,73, 144] and the algorithm avoids dfs. Another 
example is concerned with finding a topological ordering of directed acyclic graph, 
i.e. assigning a number to each of the vertices such that there is no path from a vertex 
to lower numbered one. This can easily be done in linear time sequentially, but the 
algorithm does not obviously lend itself to parallelism. Kucera [85] described an 
N C  algorithm for this problem using the transitive closure technique (as explained 
in section 2.5.4).
The following sections briefly describe algorithmic techniques which can be used as 
building blocks for graph algorithms. These exemplify new paradigms for parallel 
algorithm design.
2.5.1 Euler tour on trees
The construction of a rooted spanning tree, and the computation of various tree 
functions ( for example, preorder and postordcr numbering of vertices in the tree, 
distances of each vertex from the root of the tree, and the number of descendants 
of each vertex in the tree) are common features in many efficient parallel graph 
algorithms. It is often the case for particular algorithms that polylogarithmic ef­
ficiency is obtained simply because the algorithm has been contrived to perform 
certain functions on a tree. These functions can be computed by finding a so-called 
Euler tour of the tree and performing list ranking on the Euler tour. The Euler tour of 
a tree reduces the computation of many tree problems to some form of list ranking. 
The Euler tour technique was introduced by Tarjan and Vishkin [152].
An Eulcrian circuit is a circuit in a graph which traverses every edge precisely once.
I
26
Given an undirected and unrooted tree, by replacing each edge of the tree by two 
anti-parallel directed edges an Eulerian circuit (or Euler tour) can be constructed 
optimally by a clever use of the adjacency lists of the tree vertices and the optimal 
list ranking algorithm (see for example (52], pages 21 -24). If we break the Eulerian 
tour at an arbitrary edge, fixing some edge (», j )  as a first edge of the list so formed, 
then it is easy to see that the tour represents a depth-first traversal of the tree with 
i as the root. Tarjan and Vishkin called such a list a traversal list of the tree. The 
parent-child relation, preorder and postorder numbering, number of descendants 
of each vertex can be determined by ranking the traversal list using appropriate 
weights [152]. Hence, all these tree functions can be computed in O(logzi) time 
using 0 ( n /  log v) processors on an EREW PRAM.
We can also use the traversal list to compute the distance of each vertex from the 
root of the tree. This proved to be key subroutine used to compute the biconnectcd 
components of a graph (152]. A graph is biconncctcd if there is no vertex whose 
removal leaves the graph disconnected. Using the Euler tour in trees, Schicbcr and 
Vishkin [ 140] solved another problem on trees, that of finding the lowest common 
ancestor of each pair of vertices. This was then used as part of optimal algorithms 
for computing strong orientations of sparse graphs. Given an undirected graph, the 
strong orientation problem is to assign a direction to each edge so that (he resulting 
graph is strongly connected. A graph is strongly connected if for any pair of vertices 
u and v, there exist directed paths from ii to v and from v to u.
I
27
2.5.2 Tree contraction
Tree contraction is an efficient parallel method of evaluating an expression given 
the associated tree. The method transforms the input tree in stages using local 
operations in such a way that an /i-node tree is contracted into a single node by local 
contractions in O(logn) stages, each of which takes constant time on a PRAM. 
Optimal algorithms for tree contraction are described by Gibbons and Rytter [53] 
and others in [29, 47], that run in Oflog n) time on an EREW PRAM.
In addition to expression evaluation, tree contraction has been applied to a wide 
variety of problems. The technique easily generalises to arbitrary (nonbinary) 
trees, and has been used to drive parallel algorithms for various graph-theoretic 
computations on trees such as maximum matching, minimum vertex cover and 
maximum independent set [62]. Other applications of tree contraction can be found 
in [105].
2.5.3 Ear decomposition
An Ear decomposition of a graph is an ordered collection simple paths called cars, 
such that the end points of each car appear in previous cars but such that the interior 
vertices of each ear appear for the first time in that car. Ear decomposition search 
has been developed for undirected graphs and was suggested as a replacement for 
dfs to search undirected graphs. This method provides efficient parallel algorithms 
for several graph problems on the PRAM model. This can be seen in figure 2.2. 
Several of these parallel algorithms convert to entirely new and optimal sequential
I
28
algorithms. This is an example of a new emerging discipline enriching an existing 
one. Surveys of these results can be found in (42, 76, 128, 166],
2.5.4 Matrix computations
Matrix computation provides a fundamental tool for placing many graph problems 
in .VC, using the strategy of repeated matrix multiplication. Let .4 =  («,,) and 
D = (b,j) be n x n boolean matrices. Let C = AD  be the product of A and D\ 
that is, the ( i , j )  entry of C is defined by r,j = ®*=0("i* ® h j ) ,  where © and ® 
are two binary operators. This can be done in 0(log n ) time using n2 ’76 processors 
on a CRCW PRAM [35]. However, the algorithm is of theoretical interest only 
because it quite complicated and the big-oh notation hides a large constant factor 
in the running time, and the algebraic structure with the binary operates © and © 
requires to be a ring. The standard method of multiplication in Oflog n) time with 
n ' processors on a BREW PRAM, still remains the algorithm of practical choice.
The transitive closure of .4 (denoted by .4") is ©¡£lo-4*. where .4° =  /  (identity 
matrix) and .4* = A k~i ()A  for h > I . Let matrix D be the matrix /  0  A. It can be 
shown that A * = [68 |. Hence, the straightforward method of computing
the transitive closure of an v x ti matrix is to compute the 2 |jog2 ?/]"' power of the 
matrix D using repeated squaring as we explained in section 2.3.2.
We can solve several (directed) graph problems by taking powers of the adjacency 
matrix as in the transitive closure problem. For a given graph G(V , E) of n ver­
tices, the adjacency matrix of the graph is the u x n matrix, M  = (tn, j ) (say) such that
29
rtiij =
1 if edge (*', j )  € E  
0 otherwise
The transitive closure of a graph G'( V, E)  is the graph (denoted by G'*) with nodes 
V  and edges E " =  {(*, j )  | there is path from i to j  in Gr'}. Let M" be the adjacency 
matrix of G". To find M" it is sufficient to compute the transitive closure of M  here 
® and are replaced by logical operators or and and respectively. Hence, by this 
computation we can find whether one vertex is reachable from another in a directed 
graph.
As a second example consider finding the shortest path between each pair of vertices 
in the weighted graph. Here the operators ® and are replaced by min and + 
respectively [85]. and the input matrix is the edge-weight matrix (i.e. m,j is weight 
of the edge
Several other problems on directed graphs can be solved using this strategy of 
repeated matrix multiplication. These include topological sorting and strong com­
ponents [128].
I
Chapter 3
Parallel Approximation Algorithm
A major part of this chapter is the description of an efficient parallel approximation 
algorithm for finding minimum weight perfect matching in graphs. This represents 
joint work with A.M. Gibbons and N. W. Holloway, which was published in the 
proceedings of the 19,,< International Workshop on Graph Theoretic Concepts in 
Computer Science [136]. Preliminary sections of the chapter provide essential 
background material.
3.1 Introduction
In sequential complexity theory, it has been the consensus view that the so-called 
NP-complete problems (which includes literally hundreds of computationally impor­
tant problems, many in the area of combinatorial optimisation) arc computationally 
intractable [3, 46. 49]. Although there is no proof of this fact, so much effort 
has been fruitlessly expended in the search for polynomial time algorithms that
I
30
31
most theoreticians now believe that none exist. The absence of polynomial time 
algorithms for the NP-complete problems has spawned the development of many 
approximation algorithms which run in polynomial time but which provide an ap­
proximation (within guaranteed bounds) to the required result [46]. In a similar 
vein, N C  approximation algorithms for P-complele problems have been developed 
recently [ 102 , 141].
For many problems in P , not much is known concerning their parallel complex­
ity. For example, although the problem of constructing a minimum weight perfect 
matching is known to be in P N C  it is not known if it belongs to N C . A problem 
is said to belong to the class P N C  if the problem can be solved in polylogarithmic 
parallel time by a randomised parallel algorithm, using a polynomially bounded 
number of processors. A matching in a graph is a set of edges M , so that no two 
elements of M  have a common vertex. If every vertex of a graph is an end-point 
of some element of M  then M  is a perfect matching. Not every graph contains a 
perfect matching. Given a weighted graph, a minimum-weight perfect matching is a 
perfect matching whose sum of edge weights is a minimum.
In this chapter an N C  approximation algorithm is described for finding a minimum- 
weight perfect matchings in complete weighted graphs satisfying the triangle in­
equality. In a graph satisfying the triangle inequality, the weight of any single edge 
forming a triangle with two other edges is less than or equal to the sum of the weights 
of these other two edges. Such an inequality is satisfied in many natural problems. 
The problem that we address is an important sub-task for many problems of com­
binatorial optimisation and features, for example, in solutions to Chinese Postman
32
Problems (CPP) and in Approximations to the Traveling Salesman Problem (TSP) 
149]. Given a graph (directed or undirected) and a weight for each edge, the CPP is 
the problem of finding a circuit of minimum total weight which contains each edge 
at least once. The TSP requires to find a minimum weight circuit of a weighted 
graph which visits every vertex at least once.
3.2 Performance ratios of approximation algorithms
We can measure the performance of an approximation algorithm by comparing the 
optimal solution and the (approximate) solution produced by the approximation 
algorithm [46). If Q is an optimisation problem and 7 a particular instance of that 
problem, then the performance ratio of an approximation algorithm .4 on 7 is given 
by R \ ( I ) defined as follows:
/?•»(/) =
if Q is a minimisation problem 
^  it Q  is a maximisation problem
Here .4(7) and ()p(I) arc respectively the approximate solution produced by the 
algorithm .4 on 7 and the optimal solution for 7. It is clear from this definition that 
7? i(/)  > I always. However, a useful performance ratio is a ratio that is known 
never to exceed some constrained value for any instance of the problem. Let 7? i 
denotes the worst-case performance ratio, for all problem instances. Then generally 
we require to prove that 7?,t is always constrained for any input.
I
33
3.3 Minimum weight perfect matching
As [88] has emphasised, matching problems have played an important role in the 
foundations of sequential algorithmic complexity theory. This is because they are 
important problems that arise in many guises that can be solved in polynomial time, 
but for which all naive algorithms take exponential time. In fact, it was in Edmonds’ 
celebrated paper [41 ] on matching algorithms that the connection between tractable 
problems and polynomial time solvable problems was first made. It is likely that 
matching problems will play a similar role in the development of parallel algorithmic 
complexity theory.
There are practically no extant algorithms placing matching problems in N C  with 
the notable exception of the maximal matching problem [67] and certain algorithms 
for special classes of graphs. The class of problems which can be solved in poly- 
logarithmic expected time using a polynomial number of processors is called It N C  
(see, for example, [148]and section 2.5.5. of [88]). Most matching problems can 
be solved in parallel using randomness [77, 1 1 0 ] and so belong to the class I tNC.  
It has been stated [88] that whether a modern definition of a tractable problem in 
parallel computation is one can that can be solved rapidly with randomisation or 
one that can be solved rapidly without randomisation may ultimately depend upon 
whether fast parallel algorithms for matching require randomisation.
Even with restrictions on the graph such as completeness and triangle inequality 
satisfaction, the problem scents very difficult to place in N C . We have therefore 
addressed the problem of finding an N C  approximation algorithm. Specifically, we 
describe an N C  approximation algorithm for the minimum-weight perfect matching
34
problem for graphs satisfying triangle inequality for which 7 ? =  2 log, n. This is the 
first such deterministic algorithm with a sub-linear performance ratio. Previously, 
it was known (see [148]) that there is an JVC approximation algorithm for the 
maximum- (equivalently, minimum-) weight perfect matching problem such that 
77.4 = n.
Karp. Upfal and Wigderson [77] described an Ii.VC  algorithm for the minimum 
weight perfect matching problem, which runs in 0 (logu log2(H/ »)) time (after 
the improvements of [45]) using 0 (W » V5) processors, where W  is the maximum 
weight of any edge. A faster R N C  algorithm was obtained in [110] which runs 
(7(log2») time using 0 (mH,n , ! ) processors, where m  is the number of edges. 
These algorithms arc in 7?JVC only if W  is relatively small (that is, W -  n0<l)).
Although finding a JVC algorithm for minimum weight perfect matching seems to 
be hard, there are JVC algorithms for finding a perfect matching in special classes 
of unweighted graphs. Examples are dense graphs [37], bipartite graphs with a 
bounded permanent [60], complements of transitive oriental graphs [63] and line 
graphs [111]. However, there is no known deterministic NC algorithm for minimum 
weight perfect matching for complete graphs. The best known deterministic parallel 
algorithm for complete graphs runs in O(n*/p + ii2 log n) polynomial time using p 
(< n) processors [115].
An exact solution for minimum weight perfect matching can be computed in 0 (/i’) 
sequential time by the intricate algorithm of Edmonds [41 ] and this provides a target 
for the work measure of parallel algorithms. There arc sequential approximation 
algorithms for special graphs [66, 121, 137, 151]. The algorithms of [121] find an
I
35
approximate minimum-weight perfect matching for graphs satisfying the triangle 
inequality. However, it is not clear that these algorithms can be effectively paral­
lelised. Even if they could be, they would provide a much more intricate solution 
to the problem solved in the following section. One of the virtues of the algorithm 
described here is its simplicity.
3.4 Approximate minimum weight perfect matching 
in a complete weighted graph
We start by providing an overview of the algorithm whose input is a complete 
weighted graph G = (V. E) with edge set E  and vertex set V, where | V  |=  n 
and n is even. The first part of the algorithm concerns the construction of a graph 
F = (V', E') where E' C E, thus F  may be obtained from E  by a set of edge 
deletions. The essential properties of F  will be that each component contains an 
even number of vertices and the total sum of its edge weights will be less than 
(2 log, n)M,  where M  is the sum of edge weights of a minimum weight perfect 
matching in G. The second part of the algorithm first constructs, for each component 
of F, a Hamiltonian circuit which will be even length. The sum of the edge weights 
over all such circuits is less than (4log, n)M.  A perfect matching in G is then 
obtained by taking alternate edges on each such circuit and such that (of the two 
possibilities presented by each circuit) the lightest weight possibility is chosen. The 
weight, M',  of the perfect matching constructed in this way satisfies the inequality: 
M' < 2M  log, n. We now consider the two parts of the algorithm in more detail.
I
36
Afterwards we consider precise details of its parallel execution, justify the bounds 
on the approximation and consider the complexity parameters.
The Algorithm
/. Construction ofF.
The construction is performed over at most log, n phases. At the beginning of each 
phase, the vertices of G have been partitioned into disjoint subsets whose union is 
V. Each such subset is called a super-vertex. If such a super-vertex contains an 
even number of vertices of G, then it is an even super-vertex, otherwise it is an odd 
super-vertex. Before the first phase, every vertex of G is a super-vertex.
Now, the action of each phase is as follows. In the ilh phase, first construct the 
weighted complete graph G, which is the graph whose vertex set is the set of super- 
vertices and the edge (Vj, 14) between super-vertices Vj and V* has weight equal to 
the weight of the lightest edge in G that connects a vertex in Vj to a vertex in V*. Note 
that G, does not hold the triangle inequality. Now construct the complete weighted 
graph G\ from G, as follows. The vertices of 6 " are the odd super-vertices of G, and 
the weight of the edge (V'', V j) between the super-vertices Vj and Vj of G\ is the 
weight (that is, the sum of the weights of the edges) of a shortest path between these 
super-vertices in G,. We now construct a weighted digraph G'j, whose underlying 
graph (that is, the graph obtained by removing the edge orientations) is a subgraph 
of G[. The vertex set of G'j is the vertex set of G\. We choose precisely one edge 
to be directed from  each vertex of G'j. If e is an edge of least weight with such a 
vertex as an end-point in G[, then the edge chosen from the vertex in G'j is directed 
towards the vertex corresponding to the other end-point of < in G\. This directed
I
37
edge has weight equal to the weight of e. Each such directed edge of G" corresponds 
to a path and so to a set of subset of edges (the edges on the path) of G, and, for 
all such directed edges, we now add to F  these corresponding subsets of edges. 
As we shall see later (Lemma 3.4.1), the sum of weights of these edges does not 
exceed 2M  log, n. To complete the description of what happens within each phase, 
it remains to say how the super-vertices are constructed for the next phase. Those 
super-vertices belonging to the same component of G" are coalesced into larger 
super-vertices. Provided all the super-vertices are not even super-vertices, we enter 
another phase of the construction of F.
2. Construction o f the matching from F
The input to this stage of the algorithm is the graph F. We arc interested in 
the partition of the vertices of G that is implied by the components of F. Each 
component F, of F has an even number of vertices of G. We find a minimum weight 
spanning forest of F , that is, a minimum weight spanning tree Ti for each F,. For 
each T, we then find a preorder numbering of the vertices. Now, for each 7’,, such 
a numbering defines an even-length circuit in G obtained by visiting the vertices in 
the order of their pre-order indices. For each such circuit, we take alternative edges 
(of the two possible subsets that can be chosen in this way, we choose that of the 
smallest weight) to be edges of the approximate minimum-weight perfect matching. 
The total weight (Lemma 3.4.2) of edges chosen to belong to the approximate weight 
perfect matching is less than 2 M  log, n
end of the algorithm
I
38
3.4.1 Validity of the approximation ratio claim
We now prove that the total weight of the edges of the graph F  is less than 2 M  log , n. 
First notice that within each iterated phase employed in the construction of F , each 
odd super-vertex is coalesced with at least one other so that the number of odd super- 
vertices reduces by a factor of at least two within each phase. However, we continue 
to iterate only if there are odd super-vertices remaining and this only happens if at 
least three odd super-vertices are coalesced. Thus log, n repetitions arc sufficient 
to remove all the odd super-vertices (notice that by an elementary theorem of graph 
theory, there will always be an even number of odd super-vertices). All we now 
need for our proof is the following Lemma.
Lemma 3.4.1 The sum o f  the edge weights o f the edges added to F in each iterated 
phase o f its construction is less than or equal to 2 M.
Proof In any of the iterated phases used in the construction of F , the edges added 
to F  are those belonging, for every odd super-vertex, to a shortest path from such a 
super-vertex to another. We first show that, in G',, there exists a path from any odd 
super-vertex to some other odd super-vertex only using edges of a minimum-weight 
perfect matching. Consider then any odd super-vertex V' of G,. Now because there 
arc an odd number of vertices of G in V, not all these vertices can be matched by 
edges of a minimum-weight perfect matching of G  which connect pairs of vertices 
contained in V. There must therefore be an edge of a minimum-weight perfect 
matching connecting V  to some other super-vertex that is a vertex of G,. This is the 
first edge of the path whose existence we wish to prove. If this edge takes us to a
I
39
vertex corresponding to an odd super-vertex then we have finished. If it takes us to a 
even super-vertex, then it will match one of its constituent vertices and there will be 
an odd number remaining to be matched by a minimum-weight perfect matching. 
The implication is that there is another edge from this vertex corresponding to an 
even super-vertex which takes us on to yet another vertex of G,. Continuing in this 
way, we see that there must exist a path from every vertex of G, corresponding to an 
odd super-vertex to a similar vertex and that in G, such a path only uses edges of a 
minimum-weight perfect matching. Notice, of course, that any two such paths may 
have edges in common. Let S* be the sub-set of edges defining the shortest path 
from vertex A- (which corresponds to some odd super-vertex) to some other similar 
vertex which uses edges of a minimum-weight perfect matching only and let S[ be 
the subset defining the path from vertex A- to some odd super-vertex as constructed 
by the algorithm.
We need to obtain a worst case bound on the weight of the union of the S[ in terms of 
the weight of the union of the S*-. This is because we already have a worst case bound 
on the weight o f the union of the S* provided by M , the weight of a minimum-weight 
perfect matching of G. Notice that for all k, weight(S'k ) < weiglit(Si,) because 
the algorithm chooses minimum-weight paths; however, it docs not follow that the 
weight of the union of the Sk. will be less than the weight of the union of the S* 
because there may be an entirely different sharing of edges between paths in the two 
cases. To obtain a worst bound, we need to consider the cases in which the S'k have a 
minimum union and the S* have a maximum union. By maximum (minimum) union 
we mean that the number of shared edges of paths between the odd super-vertices 
is maximum(minimum). In the latter case, notice that no two of the 5« may share
I
40
edges whose combined weight is more than half the weight of the lightest set of the 
two, otherwise the St would not be shortest paths of their type. The situation of a 
maximum weight union of the S* corresponds to every path sharing half its weight 
with every other path. The case of minimising the weight of the union of edges of 
the S'k is essentially that of practically no sharing (although the detail is a little more 
subtle, this observation suffices to achieve the bound we seek). Thus, the weight of 
the union of the Sk. is bounded, in the worst case, by twice the weight the union of 
the St, but the weight of the union of the S* is bounded by XI and so the lemma 
follows. □
We have proved that the sum of the weights of the edges of F  is bounded by 
2M  log, n. The following Lemma provides a similar bound on the approximation 
ratio of the algorithm.
Lemma 3.4.2 The weight, M '. o f the perfect matching found hy the algorithm is 
hounded as follows:
M ' < 2 M  log, n
where, M is the weight o f a minimum-weight perfect matching and n is the number 
o f nodes o f G. The input G is a complete weighted graph satisfying the triangle 
inequality.
Proof The total weight of the edges of the graph F  is, by Lemma 3.4.1, bounded by 
2M  log, n. For each component F, of F, the total weight of the edges of T, is less 
than the total weight of the edges of F„ because T, is a minimum weight spanning 
tree of F,. Hence, the total weight of all the trees edges is less than 2M  log, n.
I
41
For each T,, consider the standard twice-around-the-spanning-tree circuit (see, for 
example, [49]), C,, obtained by visiting the nodes in pre-order and making short­
cuts to avoid re-visiting nodes that have already been visited. Such short-cuts are 
always possible because G is complete and they will be short-cuts because we have 
triangle-inequality holding. Thus, for each », the weight of C, is bounded by twice 
the weight of T,'s edges. However, the weight of the edges chosen for the matching 
from C, constitute at most half the weight of the circuit and so at most weight of T,. 
Thus, over all such circuits, we choose a weight of edges for the matching which is 
less than or equal to the weight of F  and the lemma is proved. □
3.4.2 Parallel execution and complexity of the algorithm
Consider first the construction of the graph F. There are log, n phases and within 
each, the activities dominating the computation time arc the construction of all 
shortest paths and the coalescing of super-vertices which can be achieved by an 
algorithm for finding connected components of a graph. As we cite later, there are 
well known polylogarithmic time parallel algorithms performing these tasks using a 
polynomial numbers of processors. Other tasks arc trivially solved by more efficient 
parallel algorithms. Thus, we may express the time-complexity for constructing 
F  as 0( (SP (n )  + CC(ti)) log n ) using max(psp(n) +  prep,)) processors, where 
SP(n)  is the parallel time for the all shortest paths problem using psp(n) and CC(ii) 
is the parallel time for the connected components problem using pcc{n) processors.
Now consider the construction of the approximate minimum-weight perfect match­
ing from the graph F. The dominating activity from the point of view of the
42
computation time might seem to be that of finding a spanning forest. This can be 
done in log2 n time using n2/lo g 2 j/ processors [25]. A pre-order numbering of 
tree vertices can be found by the Euler tour technique of [152], and the problem of 
choosing a set of alternate edges of least weight from a circuit can be computed by 
employing ranking and summing. Note that the parallel time and work required for 
the construction of the approximate minimum-weight perfect matching from F  are 
small compared with what are required for the construction of F.
Thus overall, we see that the problem of finding an approximate minimum-weight 
perfect matching has a parallel solution taking 0(( SP (n )  -I- CC(n)) logn) time 
using rnax(psp(n), Pcc(n)) processors. We can now see what this means in terms of 
variants of the P-RAM and using the best extant parallel algorithms for the problems 
of finding all shortest paths and connected components.
Consider first the best practical (in terms of modest constants hidden by the order 
notation) extant solutions for the all pairs shortest path problem. The problem can 
be solved in (7(log2 n) time using n3/  log a processors on a CREW P-RAM or on 
an EREW P-RAM. These algorithms arc adaptations (sec, for example, [52]) of a 
common CRCW P-RAM algorithm of Kuccra [85]. Clearly, the same algorithm 
will run within the same complexity bounds on an arbitrary CRCW P-RAM. On the 
model for which it was described. Kuccra's algorithm runs in Oflog n) time using 
n4 processors. Now consider the best extant solutions for the connected components 
problem. The problem can be solved on an arbitrary CRCW P-RAM in Oflog ii) 
time using {m + n) processor* [144], For the EREW P-RAM [73] (and therefore 
for the CREW P-RAM. although an earlier algorithm [70] already existed for this
I
43
model) the problem can be solved in 0 (log3^ 2 n) time using 0 (m  + n) processors, 
where m  is the number of edges.
From the preceding information, we have the following theorem.
Theorem 3.4.1 There is an algorithm to find an approximate minimum-weight per­
fect matching o f a complete weighted graph satisfying the triangle inequality, with 
n nodes, having a performance ratio Ii,\ = 2 log, n which:
1. runs in (7(log2 n ) time using ti* processors on an arbitrary CRCW P-RAM.
2. runs in 0(log3 n) time using n3/  log n processors on either a CREW P-RAM 
or an EREW P-RAM.
For all P-RAM models, the problem of finding all shortest paths (both in terms of 
time complexity and work) dominates the computation time and the work measure 
(that is, the processor number, computation time product). The best sequential time- 
complexity for an exact solution on complete graphs, ()(n3), can still be achieved 
by the primal-dual algorithm of Edmonds |40| as improved by Gabow [44| and 
Lawler [87). Thus, our algorithm is within a factor of log2 n of the work measure 
of Edmonds' algorithm. The faster computation afforded by implementation on the 
CRCW P-RAM comes (as in commonly the case for P-RAM implementations) at a 
high cost in terms of numbers of processors required.
I
44
3.5 Further work
There are many open problems in the area of parallel approximation. In general, the 
question of approximating certain intransigent problems in NC,  although often dif­
ficult does provide some hope for fast parallel computation which may not otherwise 
exist. We believe that work in the area of parallel approximation is bound to play 
an important role in a deep understanding of parallel computation. The following 
describe some further work in this area.
1. The existence of parallel algorithms. Some P-complete problems do not 
have an approximating solution in N C  for any value of the performance ratio 
unless N C  =  P  [81, 82, 141). One such problem is Linear Programming 
(L P ) [141 ]. Given an integer n x d matrix .4, an integer n x 1 vector b, and 
an integer 1 x <I vector c, LP  problem is to find a rational d x 1 vector ./• such 
that .4.r <  h and C.r is maximised. Identifying such problems will further 
refine the complexity classes.
2. Threshold behavior. Some P-complete problems exhibit threshold behavior. 
For certain values of the performance ratio, an approximation algorithm exists 
while no such algorithm can exist for other values [9, 142, 141). For example 
consider the High Degree Subgraph (HDS)  problem, defined as follows: For 
given a graph G compute the largest d such that the nodes of the induced 
subgraph of G have degree at least d. This problem cannot be approximated 
in N C  by a performance ratio 7?.» < 2 unless P  = N C ,  but it can be 
approximated to within any R A for /? ., > 2 by an algorithm in N C  |9). Many
I

Chapter 4
Realistic Issues
4.1 Introduction
The PRAM is a virtual design-space for a parallel computer, that is the PRAM 
is a theoretical model of an idealised parallel computer. The PRAM assumes 
constant-length data paths from every processor to every memory cell. In current 
technology, this quickly becomes physically unrcalisable as we scale up the number 
of processors and the size of the shared memory. In feasible large scale parallel 
computers, packing constraints such as this force the inevitability of employing 
communication networks. A feasible parallel model consists of processing elements 
each placed at the node of an inter-connection network. The machine memory is 
split into modules each of which is also located at a node of the network. That is, an 
intercommunication docs not generally provide constant time access between every 
processor and every memory location. Further delays can occur due to congestion 
in the network and contention at the memory modules. The need for such networks
46
i
47
may be obviated in the long term by the appearance of new technologies (optical 
communication provides one such hope (8 , 99, 131]), but for the foreseeable future 
we will have to contend with the complication of networks with communication 
delay and restricted message passing density. The PRAM does not account for 
possibility processor failure and break down in communication links. Moreover, 
the PRAM assumes that it operates in lock step synchrony, but the processors of the 
feasible models may or may not have to be synchronised. This chapter reviews all 
of these issues. In chapter 6 we demonstrate that using the facts described in this 
chapter there is no theoretical hindrance in designing a feasible large scale parallel 
computer.
4.2 Interconnection Networks
A feasible parallel computer is a number of processing elements interconnected 
by a network. Each processing element may thought of as a conventional random 
access machine. The interconnection network can be depicted by a graph in which 
edges represent communication links and nodes represent processing elements or 
switching elements. Each processing element or switching clement has ability 
to route messages (one in unit time) to adjacent nodes of the graph. The inter­
connection networks can be broken into two broad classes depending on whether or 
not a processing clement is located at every node. In a multistage or indirect network, 
the computing elements and memory modules arc interconnected by a network of 
switches. In a direct network, there is a (processing clement) computing clement 
and a memory module at each node of the network. Note that, in both networks
I
48
each computing element can access any memory module. Hence, we consider the 
union of the memory modules in both networks as a virtual shared memory.
The cost of communication is related to the topology of the graph. To reflect physical 
packaging constraints we require that the degree of the graph should be small. The 
degree of a graph is the maximum number of edges attached to any node of the 
graph. The diameter of a graph is the maximum minimum-length path between 
any pair of nodes of the graph. TWo processors may be separated by a path of this 
length. Therefore the diameter provides a lower bound on communication delay in 
the graph. Thus we require that both the degree and the diameter of a graph are 
small (preferably constant or growing slowly, for example logarithmically with the 
size of the graph). Note that we should consider the tradeoff between diameter and 
degree. For example, Moore graphs, which arc graphs of minimum diameter for 
fixed network size and degree, have been studied in [69]. However, the need to able 
to support high parallel message passing density dictate a little away from optimality 
in Moore sense. Also we may advantageously use recursively decomposable graphs 
which not only naturally support recursive algorithms but can also aid physical 
construction and size enhancement. Here we describe some of the networks that 
have been proposed for general purposes [88, 124, 146. 147, 153],
4.2.1 The hypercube family
49
Figure 4.1:
Hypercube
The (/-dimensional hypercube, d-hypercube, has n = 2'1 nodes and <12rf_l edges. 
For example, I, 2 and 3 dimensional hypcreubcs are shown in figure 4.1 (a), (b) 
and (c) respectively. Nodes are addressed by binary strings of length <1 and edges 
connect binary strings which differ in precisely one bit position. As a consequence, 
each node is incident to d = log n other nodes. Thus the degree of the d-hypercube 
is log a. The length of a shortest path between any pair of nodes is the number of 
positions in which their binary strings differ. Since binary strings of any two nodes 
differ in at most d positions, the diameter of the hypercubc is il. A shortest path 
from a node, .4, to a node, B, can be constructed by successively visiting the nodes 
whose labels are those obtained by modifying the bits of .4 one by one (for example, 
from left most significant bit to right most significant bit) in order to transform .4 
into B. For example, the shortest path between 000 and 011 in the d-hypercube is 
000 -» 010 -» O il.
Edges which connect nodes which differ in the i"‘ position arc called edges of the
I
50
i"‘ dimension. The hypercube is a recursively decomposable graph. If we remove 
the i"‘ dimension edges of the d-hypercube, for 1 < i <  d, then we get two disjoint 
copies of a (d-l)-hypercube. Conversely, a d-hypercube can be constructed from 
two (d-1 )-hypercubes, by joining every vertex of the first (d-1 )-hypercube to the 
vertex of the second having the same number (or binary string). Indeed, it suffices 
to renumber the nodes of the first cube as b\b2 • • bj_ ,0  (add 0 as last bit) and 
b,/>2 ■ ■ • 1 (add 1 as last b it), where b,b2 ■ ■ ■ bj-\ is a binary string representing
the two similar nodes of the (d-1)-hypercubes.
The cube-connected cycles, butterfly and Benes networks
Although the hypcrcube is quite powerful from a computational point of view, the 
degree of the hypercube grows logarithmically with its size. This is a disadvantage of 
its use as an interconnection network for parallel computation. The cube-connected 
cycles [123], the butterfly network and the Benes network [15] can be regarded as 
constant-degree variations of the hypcrcubc. These networks have properties similar 
to those of the hypcrcubc.
A <1 dimensional cube connected cycles (d-cube connected cycles) is a d-hypercube 
in which each node of the hypcrcubc has been replaced by cycle of <1 nodes. Hence 
the d-cube connected cycles has ii =  ill'1 nodes. The ¡"' dimensional edge origi­
nally incident to a node of the hypcrcubc is now connected to the i"‘ node of the 
corresponding cycle. Each node of the cube connected cycles can be represented by 
a pair (/>, c), where p is the position of the node of the cycle c. For each node (p. c) 
of the d-cube connected cycles e is a </-bit binary string and I < p < </, since there
51
Figure 4.2:
arc d nodes in any cycle of (he d-cube connected cycle and each cycle corresponds 
to a node of the d-hypercuhe. Two nodes of the d-cube connected cycles (p. c) 
and (p',c') are connected by an edge if and only if p = / /  and c differs from c' in 
precisely the p"' bit, or <• = d  and |p — pt\ is I or </ — 1. Figure 4.2 illustrates the 
graph for d = 3. It is easy to see that, for all <1, the degree of a d-cube connected 
cycles is three. The diameter is 3 [r//2] or 3 [log n/2] for a d-cube connected cycles. 
This is because a message from (p, r) to (/>'. r') can reach (p, d) within <1 steps then 
(p, d) to (p \ d) takes at most \d / l \  steps.
Figure 4.3 shows the butterfly of dimension 3. In general, a d dimensional butterfly, 
d-butterfty or n-input butterfly, has 2''( d + 1 ) (or n (log n + 1 )) nodes and 2',+ 1 d (or 
2n log n) edges. The butterfly network is an example of a multistage network. The 
nodes arc divided into (d + 1) levels with 2'1 nodes in each level. Sometimes, the 
nodes on level 0 arc called Inputs and the nodes on level d arc called outputs. Let 
node (r, /) refer to the node on the rlh row and the l"‘ level, where r is a r/-bit binary
I
52
level 0 level 1 level 2 level 3
Figure 4.3:
53
string and 0 < / < d. There is an edge between nodes (r, I) and (/•', /') if and only if 
/ =  / '+ !  and either r = r' or r and r' differ in precisely l"‘ bit. The entire network 
is made of such "butterfly" pattern hence the name. The butterfly has a interesting 
recursive structure, by removing level 0 nodes of the d-butterfty we can obtain two 
(d-l)-butterfties. Moreover, a d-hypercube can be obtained from a d-buiierfly by 
collapsing each row of nodes, i.e. i"‘ node of the hypercube corresponds to the i,h 
row of the butterfly.
There is a unique path of length log u (or </) from every input node to every output 
node in the d-butterfty. This unique path is referred to sometimes as the logical path. 
The path between input node u and output node v can be constructed by using the 
edge from a i,h level node (w, i ) to a i + I *' level node (w', i + 1), 0 <  i < d — 1, such 
that w and w' differ in the ( / + 1)"' bit if ii and r differ in the (i +  1)”' bit otherwise 
ti> = ir'. In other words, the path from an input to an output moves downwards 
at a node on the tlh level 0 < i <  (</ — I ) if the (« + 1)"' bit of the destination is 
a I, and upward otherwise. The unique path from (001.0) to (100,3) is shown in 
figure 4.3. As a simple consequence of this fact, we can sec that any two nodes in 
the d-butterfty can be connected by a path of length at most 2 log n (or 2d). The 
diameter of the d-butterfty is 2 log n or 2d, and the (maximum) degree is 4.
A number of variations of the butterfly have been proposed in the literature (for 
example sec (84]) which include the wrapped butterfly, the omega network, the 
flip network, the Banyan and Delta network, the A-ary butterflies and the Bcncs 
network. Among the variations the Bcncs network has an interesting property. 
The d dimensional Bcncs network, d-Hcnes network is constructed from two d-
I
Figure 4.4:
butterflies connected back-to-back. Figure 4.4 shows the 3-Benes network. On the 
Bcncs network of d dimension we can construct edge-disjoint paths one path from 
each node on level 0 to a unique node at level 2d, for any permutation of the inputs 
to the outputs [ 15, 50, 168).
4.2.2 The shuffle-exchange and de Bruijn networks
The (/ dimensional shuffle-exchange graph, d-shuffle-exchange graph, has n nodes 
and 3t;/2 edges, where n = 2'1, Each node i, for I < i < n — 1, is labeled with a 
corresponding f/-bit binary string [ 150). There arc two types of edges, a shuffle edge 
connects any node b\ki..hm to the node and an exchange edge connects
any node b\hj..bm to the node b\b2 ..tfm, where h„, ^  h'm, Hence the maximum 
degree of d-shuffle-exchange graph is three, for all d. For example, see figure 4.5 
lor a 3-dimensional shuffle-exchange graph. In the figure dashed edges represent
I
55
oio on
Figure 4.5:
001 o il
Figure 4.6:
exchange edges, and solid edges represent shuffle edges. The diameter of a d- 
sh uffle - exchange graph is 2 log n -  1 or 2d -  1. We can construct a path of length at 
most 2 log n — I between any two nodes in the d-.•shuffle-exchange graph by simply 
using the exchange edges and shuffle edges alternately.
The <1 dimensional dc Bruijn graph has n nodes, where n = 2'1. The nodes arc 
labeled from 0 to n -  I with binary labels. The graph has 2n edges. There is an 
edge between two nodes u and c if and only if ii = />, l>2 • • • h,t and v is Mb • • • M), 
Mb •••/»,<1, Oh, l>2 ■ ■ • l>,i_ | or 1 l>t l>2 ■ ■ ■ b j - 1. The <I dimensional dc Bruijn graphs have 
diameter of </(= log n) and their degree is four. Figure 4.6 shows the 3-dimensional 
de Bruijn graph. In chapter 5 we shall look this graph in more detail.
I
56
Figure 4.7:
4.2.3 Meshes
Meshes are another class of network that have received wide attention. A d di­
mensional mesh, d-mesh, of dimensions <l\ ,d2, - - - .  d,t has nodes {1 1 } x
{1, * * 4/2} x ••• x {1, • • •, </,<} and edges connecting each node of the form
[-1 - - 2. with nodes [*i. «2, •••,*< ±  I , • • •, crf], for I <  i <  d. Such
a mesh is called a d\ x <lj x ■ • • x dj  mesh. The diameter of a d-mesh is
(r/i — 1 ) + (<¡2 — !) +  ••• +  (i/j — 1 ) and it has degree of 2d, if each d, > 3.
If all the dimensions arc of the same size, N  (say), then the diameter is the mini­
mum for fixed number of nodes and d. That is, the number of nodes n is equal to 
,\ x A x - x N  = N '1 and the diameter is H( n 1^ ) .  Moreover, if all the dimensions 
equal to 2 then the d-mesh is the d-hypercube. ( 4 x 5 )  2-mesh, ( 4 x 4 x 2 )  d-mesh 
and ( 2 x 2 x 2 x 2 ) d-mesh (or 4-hypercuhe) arc shown in figure 4.7 (a),(b) and (c) 
respectively.
Although (low dimensional) meshes are relatively simple to construct, they suffer 
from having large diameter. However, by adding edges to the mesh the diameter can 
be reduced by a small constant factor. For example, figure 4.8(a) shows a (4 x 4)
57
(a) (b)
Figure 4.8:
mesh with wrap-around connections and figure 4.8(b) shows the same mesh with 
toroidal connections.
By adding nodes and edges to the d dimensional N  x  N  x  - A ' mesh, such that 
the A’ nodes of each one-dimensional "row" arc the leaves of a complete binary 
tree, we can construct a graph called a mesh o f trees. This has the small diameter 
2d log .V. For example, consider a 2-dimcnsional N  x Ar mesh of trees, which can 
be constructed from an .V x N  mesh by adding nodes and edges to form a complete 
binary tree in each row and each column. In a 2-dimcnsional mesh of trees any two 
nodes u, v can be connected by a path of length at most 4 log ,V. Suppose u is a 
node on the row and i> is a node on the j lh column and let w be the node on the 
/"' row and j " ‘ column. We first construct a path of length at most 2 log n from u to 
w using the tree of the i"' row and then finish the path in at most 2 log n steps from 
te to ti using the tree of the j " ‘ column. Hence the diameter is 4 log ti.
I
58
4.2.4 Randomly-wired networks
Among randomly-wired networks the most popular one is called multibutterfly. 
Informally, the </ dimensional multibutterfly is like a d-hutterfly in which the nodes 
at each level are randomly connected to the next level [155]. The object of random 
wiring is that the logical path between the input nodes (the nodes on level 0) and 
the output nodes (the nodes on level d) of the <I dimensional multibutterfly can be 
realised by a myriad of physical paths. We shall see in section 4.3.3 that the large 
number of these paths allow to construct an optimal routing in the network. Recall 
that, there is the unique path of length d between every input node and every output 
node in the d-butterfly. To define the multibutterfly network more formally, we need 
to understand the graph called a splitter, which is the basic building block of the 
network.
An (a,l),m ,2k-)- or m-input- splitter is a bipartite graph G = (A . D . E ), where 
|A| =  \B\ = in. A graph is bipartite, if it is possible to partition the vertices of 
the graph into two subsets, .4 and D, such that every edge of the graph connects a 
vertex in .4 to a vertex in D. The nodes in .4 arc arc called inputs, and the nodes 
in D arc called outputs. The outputs arc divided into two blocks of n /2  nodes. 
The block consisting of the first n/2  nodes is called the upper block, and the block 
consisting of the remaining nodes is called the lower block. The edges from the 
inputs to the upper block arc called up edges, and those to the lower block arc called 
down edges. The up and down edges arc chosen at random subject to the constraint 
that each input is incident to k- up and down edges, and each output is incident to 
2k edges. Moreover, the splitter has («, (i)-expansion, i.e., every subset of inputs,
I
59
up edges
Figure 4.9:
-V, |-Y| < a m , is connected to at least /i|A'| outputs in the upper block and ,i|.Y| 
outputs in the lower block (see figure 4.9). Note that, although splitters can be 
constructed deterministically in polynomial time, random choice of edges provides 
the best known possible expansion [71,95, 155). Sec for example [89) for recent 
work on practical implementations of this network.
An (n,k,  a , fl)-multibutterfly is a logo dimensional network, with ii input nodes 
and n output nodes. The niultibutterfly and the log n-butterfly arc strongly related 
to each other. The only difference between the two networks is that the degree of 
log n-butterfly is 4, and the degree of the niultihuttcrfly is 2k. More precisely, in the 
niultibutterfly. the edges from level / to level / + I in rows j n / 21 to (j  + I )n/2'  form 
a (o ,/i, n/2‘,2k)-or n/21-input-splitter, for all 0 < / <  log)» -  I ) undo < j  < 2'.
60
N input N/2 input N/4 input
Splitter Splitter Splitter
N outputs
Figure 4.10:
The nodes on each level of multibutterfly can be partitioned into blocks. All the nodes 
on level 0 belong to same block. On level 1, there arc two blocks, one consisting 
of the nodes that are in the upper tt/2 rows, and other consisting of the nodes that 
are in the lower ti/2 rows. In general, there arc 21 blocks of size n/21 on level /, for 
0 < / < log n. Each block l>t on level / is connected to two blocks of size n /2 ,+ l on 
level / T 1. We refer to the two blocks as the upper block, «/, and lower block, //. 
Now iii and // contain the nodes on level / + I that arc in the same row as the upper 
n/ 2,+ l nodes of l>t, and the same rows as the lower n/ 2,+ l nodes of hi, respectively. 
The three blocks, l>i, m and //, form the splitter (o, ¡1, n/21,2k),  for which hi forms 
inputs and (u/ U //) forms outputs. Figure 4.10 shows a mullibuttcrfly with ii inputs.
Like in the butterfly, the logical path between every input nodes and every output 
nodes can be constructed by using up and down edges. Furthermore, since each
I
61
node has k■ up and k down edges, each step of a logical path can take any one of k 
edges. Hence, we have a large number of choices to construct one logical path.
4.3 Contention and congestion
In the PRAM there are conventions for multiple access to any one memory location 
depending on whether the PRAM is defined as EREW, CREW or CRCW. In feasible 
machines shared memory is distributed amongst memory modules, and the memory 
modules are connected by an interconnection network. Each memory module 
contains many memory locations, and each is capable of serving only a constant 
number of requests during each time step. This is because, the interconnection 
network can only carry a certain density of messages, and the memory modules can 
only store or retrieve a constant number of requests per time unit. Congestion occurs 
when network links or nodes arc overloaded with messages to pass. Contention 
occurs when memory modules arc overloaded with memory requests in any time 
step. Local congestion occurs at so-called hot spots. Unless hotspots arc managed, 
wc might not efficiently implement PRAM algorithms on the feasible models. This 
is because excessive accesses and transmission will be serialised and served over 
several sequential time steps.
One solution to avoid contention is to randomly distribute the logical addresses to 
the memory modules using some hash functions. Another solution is to keep the 
copies of each logical address in several memory modules [104, 156). A special kind 
of contention occurs when running a CRCW or CREW PRAM algorithm, several 
processors attempt to access (read or write) a value in the same memory location.
6 2
This can be avoided by a combining mechanism. Note that if the patterns of shared 
memory access are known in advance then we can deterministically distribute the 
shared memory among the memory modules to avoid contention.
A large amount of work (for example see [89] for a list of references) has been done 
on the development of efficient routing algorithms for interconnection networks, to 
reduce congestion and getting the right data to the right place within a reasonable 
time. A wide variety of routing problems may arise in practice, such as one-to- 
one routing problems or /tartial permutations: at most one message starts at each 
processing element and is destined for at most one processing element, one-to-many 
routing problem: one message may be destinated for more than one place, many-to- 
one routing problem: many messages are destined for the same place. In general, 
an h-relation is a routing problem in which at most h messages are sent front each 
processing clement and at most h messages arrive at each processing clement. The 
natural lower bound for routing is the diameter of a network. Deterministic parallel 
routing strategies can often be improved by using randomised algorithms. The 
routing paths can be set up in two ways. First, in an off-line fashion, the routing 
paths arc precomputed using global information. Second, in an on-line fashion, 
the routing paths arc made using only local information. We arc interested in on­
line routing, the requirement of global information for off-line routing is not often 
available, or can only be obtained with excessive time-penalties.
I
63
4.3.1 H ash ing
The most promising method to avoid contention is to randomly hash the memory. 
The PRAM memory locations are distributed among the memory modules of a fea­
sible machine using some hash function. In this section we consider that concurrent 
access does not occur (i.e. we consider only HREW PRAM algorithms). Suppose, 
an EREW PRAM algorithm uses M  memory locations. To implement the algorithm 
with p memory modules we randomly distribute the M  memory locations among 
l> memory modules. More precisely, all of the EREW PRAM memory locations 
which are logical addresses, are mapped into memory locations which are physical 
addresses of a feasible model, using some randomly chosen hash function h from 
5 a/, h: [0, • • •. M  — 1 ] —► [0. • • •, M  — 1 ]. 5 a/ is the full set of permutations of 
[0, ■ • •, M - 1], Let memory module p „  0 < ■ ?< /> — 1, contain the content of loca­
tion x, if logical address .r mapped into a physical address y, and physical address 
y is in the memory module />., i.e.,
h(x) = y and
y modp  = z, x ,y  € (0, • • •, M — 1],
The motivating idea is that if the random mapping is used by some hash function, 
then we can minimise the maximum number of memory requests which arc assigned 
to the same memory module. Let be the maximum of /?:, where /?.. is the 
number of requests for addresses located in module z, 0 < z < ¡> — I . The requests 
for each module arc queued in some order and served sequentially, thus we need 
to be small. In addition, we need to minimise the time for computing /i(.r), 
0 < r < M  — I, and storing the mapping. If li is a totally random function of
I
64
the form h : [0, • • •, M — 1] —> [0, • • • M — 1] (i.e. randomly chosen from Sa/), then 
every computing element needs additional, log M m = M  log M  bits (assuming a 
suitable encoding) to store an element h e 5a/, which can be expensive if M  is 
large. Fortunately, it is sufficient for us to choose // randomly from among a much 
smaller set, H  (say), of easily computable functions. Hash functions for this parallel 
context have been analysed by Mehlhorn and Vishkin [104]. Their work is based on 
universal hashing as introduced by Carter and Wcgman [21].
Suppose an arbitrary step of an EREW PRAM requests a set S up to t> number of 
addresses of the shared memory (i.e. |S | < t>), where 5  C [0. • • •, M — 1], Then we 
need to make sure that h(S)  is spread evenly among // memory modules. Consider 
the class H of hash functions h of the form.
//(x) = (ao + rt|X +  • • • + a*_|X*-1 ) mod M,
where no, a i, • • •, u*_ i arc b integers chosen uniformly and independently at random 
from the interval [ I , A/ — 1 ]. Let Rmax(p, v . H)  be the maximum number of requests 
to the memory module which has the largest such number, with respect to all possible 
5 addresses of size t’, and with respect to all hash functions of H.  Using the results 
of [104] Valiant proved the following results [158],
Theorem 4.3.1 Let v = //log// and h = 4 log//. I f  M  is prime then with a high 
probability Rmar(p, v . H)  = .flog// = J.v/p.
Note that if M  is not prime then we can choose M ' where M' is the smallest prime 
larger than M, and define the corresponding H as before but for .)/' rather than
I
65
M . From theorem 4.3.1 we can see that r requests are spread evenly among p 
memory modules, with high probability each module will get no more than 3 .e/p 
requests which is only three times the expected number. However, H  is not a class 
of permutations, the hash functions randomly drawn from H are not one-to-one. 
Up to k- = 4 logy) addresses may map to the same memory location. This can 
be avoided with high probability by choosing different hash functions of the form, 
h mod [3//;)], where h Ç. H and k > 4 logy) [158]. Another problem of this 
hashing occurs if v = p. Then rather than having a constant number of requests 
to each memory module, with high probability one memory module will get about 
logy>/ log log y) number of requests and some will get none. Hence, we need to 
ensure that t> > p log p.
Since we are using the class of polynomials of degree k- =  4 logy), we need 
special purpose hardware to compute these polynomials in constant time. Note 
that O(loglogy)) parallel steps are needed to evaluate the logy) degree polyno­
mials. However, recent analysis shows that a fixed degree polynomial is suffi­
cient [83, 145, 130, 109]. For example, the class of hash functions of the form 
((« + b.r) mod 3 /'), where a, b e  [1 ,3 /' — I], will result in excellent performance, 
if the choice of a, b is suitably restricted [ 109]. The following theorem is proved in 
183] for the class of hashing functions of fixed degree k-.
Theorem 4.3.2 Let v = y)1 and Id k- > 6o . I f  M  = r"  then with high probability 
n<nns(l>,V.H) = 2.vlp.
Note that whenever more than 2.t>/p requests arc generated for any memory module 
we rehash the entire memory, using a new randomly chosen hash function b € H.
66
However, the probability of rehashing is significantly small.
4.3.2 Combining
In the last section we saw how to evenly distribute the memory accesses among 
memory modules when all of the memory accesses are for distinct memory locations 
(or addresses). Several or all the computing elements may wish to access not only 
the same memory module, but also the same memory location at the same time. This 
may happen when running a CRCW or CREW PRAM algorithm. Here, randomly 
distributing the PRAM memory to the memory modules cannot help. The number 
of accesses for the same memory location at the same time will be the same no 
matter where the memory location physically exists.
Fortunately, we can solve this (concurrent) access problem by using a so-called 
combining technique. For example, in a butterfly network each request for accessing 
a common memory location moves along the directed path from its source to the 
destination in the interconnection network. These paths are in general not disjoint. 
Thus these requests can be viewed as flowing from the leaves of a tree to the root. 
Then as explained below we can combine these messages into a single message at 
a node of the tree.
Suppose the requests arc for reading the same memory location, concurrent read, 
then there is no need to send more than one read request along any branch of this 
tree. Of course, the reply message needs to flow in the reverse direction, along each 
edge of the tree so that each requesting processor receives a reply. To accomplish 
this, whenever the read requests are combined at a node of the tree, the sources of the
I
67
requests in that stage are stored at that node. Suppose several computing elements 
want to write a value in the same memory location at the same time, concurrent 
writes (write conflicts can be resolved as explained in section 2.2). Whenever two 
or more concurrent write requests meet at a node of the tree, the node can send any 
one of the requests. Moreover, for concurrent write we do not have to worry about 
a flow of information in the reverse direction. Note that, two or more requests to 
distinct memory locations within the same memory module might be passed through 
a node, in which case the requests should be concatenated.
One approach to implementing combining is to use combining networks, i.e. net­
works that can combine and replicate messages in addition to delivering them in 
a point-to-point manner. The NYU Ultracomputer incorporates a combining net­
work; computing elements and memory modules arc connected by switches with 
the geometry of the Omega network [58], where the switches are capable of com­
bining the messages. The experiments on the NYU Ultracomputer reveal poor 
performance, and combining increases the switch si/.c and cost [119]. The Fluent 
machine of Ranade [ 129) supports inexpensive hardware for a combining network, 
and the combining (based on 1130]) is efficient. The nodes of the Fluent machine arc 
connected with the geometry of the butterfly. where each node contains a computing 
clement, a memory clement and 6 switches.
The reason that Ranadc's combining works well is that the requests leaving each 
node arc sorted by destination. This works with the help of ghost messages. These 
contain information on the minimum location address, which can identify to which 
subsequent messages can be sent. After a request leaves a node no more requests
I
68
to the same destination will arrive at that node. In other words if two requests to 
a common destination pass through a node then they will pass through the node at 
the same time, in which case the node combines the two requests and forwards the 
result. So, a request waits at each node until the node determines that there arc no 
more messages arriving at the node which will request to the same destination, or 
until another request to the same destination arrives.
Ranade’s combining method ensures that the queue-size of the requests at each 
node is 0( 1). Moreover, if a memory of CRCW PRAM with size M  is randomly 
distributed among p nodes of the butterfly (each node is a computing element and a 
memory module), then the combining gives the following theorem. The memory is 
randomly distributed by the class of hash functions of the form ((« +  b.r) mod M'),  
where a, b £ [ 1. M ' — 1 ] and M ' is a fixed prime no less than M.
Theorem 4.3.3 1129] Let p he the number o f accesses requested by a CRCW PRAM 
at any step. The p accesses can be realised in a p node butterfly in time 15 log /» 
with hif>h probability.
In addition, concurrent requests to the same memory location can be provided 
without the use of combining networks. For example. Valiant 1158] and Kruskal et 
al. 183] provide algorithms for simulating concurrent requests on networks. The 
simulation uses a sorting algorithm of [ 127] to sort the concurrent requests according 
to the destination addressees so that access requests to the same location in memory 
will be adjacent in the sorted array. In [ 159] Valiant described another algorithm 
which is more efficient and practical than previous solutions [ 158, 83].
I
69
Theorem 4.3.4 [159] For any constant e > 0, v <  ;>l+* requests o f CRCW PRAM 
can be realised on a p node network in optimal 0 ( j f  ) time with high probability.
The results assume that each node of the network has a computing element and 
a memory module, and the memory of a CRCW PRAM is distributed among the 
memory modules using the randomly chosen hash function as previously described. 
Moreover, the v requests of a CRCW PRAM are spread among ;> components such 
that each component sends at most p" requests.
4.3.3 Routing
The simplest kind of routing problem is a I-relation. In fact /-relations can be 
realised in a p-node network by sorting the p items. If processing elements arc 
indexed according to some natural scheme, then by sorting the items which are 
distributed across the network, one located at each processing element, we can route 
the item which would be the ilh (in a sorted list of the items) to the processing 
clement with index i.
Among deterministic parallel sorting algorithms the fastest known is based on the 
AKS sorting network which is a p-node bounded degree network and capable of 
sorting p items in 0(\ogp)  time [5], Although the algorithm is asymptotically 
optimal the constant factor involved in the run time is large, and the network 
topology is complicated and nonregular. Batcher's sorting algorithm 113] which has 
been known for several decades remains the most practical algorithm for sorting. 
For example, the algorithm can be implemented in (){p) lime on a p x p mesh, 
and in Of log2/») on a p-node hypcrcubc and on a p-node shuffle-exchange. This
I
70
implementation matches the lower bound for the mesh, and is a log p factor worse 
than the lower bound for the hypercubc and shuffle-exchange networks. Recently, 
Cyper and Plaxton [36] described a sorting algorithm for sorting p items on a p-node 
hypercube in 0(log /#(log log/»)2) time orO(log /»(log log /»)) time with a substantial 
amount of off-line computation, which is very close to the lower bound. It is not 
known whether or not there is an on-line deterministic algorithm to realise a I- 
relation or to sort /» items on a log/» dimensional hypercubic network in 0 (lo g /») 
time.
If we restrict a deterministic routing algorithm to be oblivious, then the following 
lower bound holds. A routing algorithm is oblivious if the routing paths arc deter­
mined only by the sources and destinations, and not in any way by the interacting 
traffic on the network.
Theorem 4.3.5 ¡19, 72] For any oblivious algorithm there is a permutation rout­
ing problem or I-relation which will take £l(y/p/<P^2) time on a p-node degree-d 
network.
Among oblivious algorithms the greedy algorithms arc known to perform very well 
for almost all routing problems, and very poorly for some of the most important 
and most common routing problems in practice [88], A Greedy algorithm routes 
every packet along a shortest path to its destination. For example, on any log/» 
dimensional hypercubic network the greedy algorithm can realise any monotone 
routing problem, in optimal time, 0(log /») time. A monotone routing problem is 
a I-relation for which the relative order of the packets is unchanged. Similarly 
optimal performance is exhibited by the greedy algorithm for a routing problem for
I
71
which each message has a destination address which is the complement of its initial 
address. This I-relation can be realised in exactly log p time on a p-node hypercube.
On the other hand, there are bad cases, such as the hit-reversal permutation and the 
transpose permutation, for which the greedy algorithm performs very poorly. Con­
sider the hit-reversal permutation, routing messages to the address of its initial loca­
tion read backwards, will take at least y/Ji steps on a p-node hypcrcube. The reason is 
that there are 2^^"“ -  ^  nodes with addresses of the form u\it2.. . .  »Hmi-nOO. . .  0, 
and all messages starting from such nodes will be at address (X). . .  0 halfway through 
execution of the algorithm (see section 4.2.1 for a shortest path between any two pairs 
of nodes in hypercube). The permutations associated with worst-case performance 
can be overcome by randomising the memory and/or randomised routing.
Any /-relation can be converted into a random routing problem by hashing the 
memory. This is because each message, x, will now be destined for a random 
location, h(x),  where the memory is randomised by the hash function h. Then the 
greedy algorithm can be shown to perform well on the butterfly as in (129, 130], 
However, there will still be some permutations which require i l(y/n)  time with low 
probability on the butterfly, if we use the greedy algorithm |89].
Valiant proposed a routing algorithm to realise any /-relation without altering the 
memory organisation at all (however, to solve memory contention problems we 
may need hashing and/or combining). Moreover, the algorithm docs not exhibit 
consistent worst case behavior for any ¡-relation |I60, 161, 163], This algorithm is 
known as two phase randomised routing, anti works as follows,
I. For each node / wishing to send a message, choose an intermediate target node
l
72
t(i) randomly. That is, for the binary address of t(i) choose each bit to be 0 
or I with equal probability. Then send the messages by the greedy algorithm,
2. When a message reaches t(i) it is then sent to its true destination, again by the 
greedy algorithm.
Two phase randomised routing was first shown to work for the d-hypercuhe. With 
the high probability any ¡-relation can be realised in less than 8tl (or 8 log n) steps 
[163]. Using this two phase randomised routing, subsequent work has been extended 
to show that, with high probability a ¡-relation can be realised on a d-cube connected 
cycles, d-butterfly, d-shuffle-exchange and (d x d) 2-mesh in O(d) (i.e. proportional 
to the diameter of the network) steps [7, 88, 138, 154, 158]. Moreover, Valiant has 
proved the following stronger result for the hypercubc.
Theorem 4.3.6 / 15H] With high probability, every log p-relation can be realised on 
a p-node hypercube in (7( log p) steps.
An interesting alternative to using two phase randomised routing is to use determin­
istic routing on a randomly wired network. It is shown that deterministically any 
¡-relation can be realised on a log;» dimensional multibutterfly in ()(logp) steps 
[90. 155]. Moreover, the routing algorithm on the multibutterfly is shown to be 
robust against faults [90]. Recall that, there is a large number of paths between 
every input and every output on a multibutterfly.
I
73
4.3.4 Information dispersal algorithm
The performance of a routing problem can be affected, when a node or a link ceases 
to transmit data, and/or when a node or a link transmits incorrect data. Moreover, 
if we use only one path to route a packet then every node and every link must be 
reliable. An obvious alternative is to create a number of copies of the packet, and 
route the packets along different paths. This will result in increase in the network 
load.
Rabin proposed a method in [125]. This is called the Information Dispersal Algo­
rithm (IDA) which breaks each packet into a collection of subpackets such that only 
a fraction of them suffice to reconstruct the original packet. Since only a fraction 
of the subpackets have to reach their destination, the algorithm is tolerant to faults. 
Using the IDA it has been shown that, with high probability, any / -relation can be 
realised on a p-node hypercube without any additional delay (i.e. in 0(log />) steps), 
even if many nodes and links in the hypercubc arc faulty [61, 97], Similar results 
have been obtained for the de Bruijn graph [98].
In addition, the IDA can be used to avoid contention in memory [ 12,88]. We encode 
each item in the memory into a number of subitems, k (say), such that the original 
item can be reconstructed from any fraction of the sub-items, storing each sub-item 
across k memory modules. Then contention can be avoided by simply dropping 
subitems, if they occur at hotspots.
I
74
4.4 Synchrony and asynchrony
The PRAM model operates in lock step synchrony. There is an implicit synchro­
nisation barrier after every instruction. This lock step synchronisation guarantees 
that processors reading from global memory can be sure to obtain the correct value 
from that memory at the end of the previous instruction, easing the task of algorithm 
design. The PRAM model, however, neglects the cost of this synchronisation. If we 
add a synchronisation barrier after every step in a feasible parallel model algorithm, 
then this would increase the complexity by a factor typically of the order of the 
network diameter. This question is addressed again in chapter 6.
I
Chapter 5
Embeddings
This chapter describes joint research with A.M. Gibbons and M.S. Paterson. The 
results described in this chapter on dense edge-disjoint embedding of binary tree 
were published in [ 133] and [ 134]. Similar results for the mesh, which appeared in 
[51 ] were merged with those of [133, 134] then extended and published in [ 135].
5.1 Introduction
Our concern in this chapter is the improvement of running times for PRAM al­
gorithms when implemented on feasible parallel computers in which processing 
elements with associated memory arc located at the node of an interconnection 
network. The implementation of a large class of PRAM computations on these 
models can be made to run optimally fast by the employment of certain strategics. 
For example, if a particular PRAM algorithmic structure is frequently employed, 
then the idea of embedding this structure in the communication network can lead.
75
I
76
as we will show, to optimal implementation on feasible machines. Perhaps the 
most commonly occurring structure in this regard is the complete binary tree. It 
is precisely because such logarithmic depth structures are used (either explicitly 
or implicitly) that polylogarithmic time complexities are often attained for many 
PRAM algorithms. In this chapter we describe optimally efficient embeddings of 
the complete binary tree in the following graphs: the hypercube, the de Bruijn. the 
shuffle-exchange and 2-dimensional mesh.
5.2 Efficiency requirements
In the PRAM model, the complete binary tree is most usually employed as follows. 
Data for a problem (or sub-problem) are placed at the leaves, and the required result 
is obtained by performing computations at the internal nodes in one or more sweeps 
up and down the tree, so that computations at the same depth are performed in 
parallel. It should be noted however that some algorithms may require simultaneous 
computation at an arbitrary number of nodes at different depths of the tree. If we arc 
to embed the complete binary tree into the host topology of some distributed memory 
machine, we therefore need to observe the following requirements to achieve an 
efficient embedding:
/. All tree nodes at the same depth should he mapped into disjoint intercon­
nection network nodes if  (as in the PRAM computation) computations arc to 
he performed in parallel at these nodes. In addition, PRAM algorithms may 
require compulation at nodes o f the tree which arc o f  different depth. Thus
I
77
for greatest utility, the embedding should map at most a constant number o f 
tree nodes to any node o f the host graph.
2. Tree edges at the same depth should correspond to edge-disjoint paths in the 
interconnection network i f  the commonest types ofPRAM algorithm employing 
this technique are to be simulated. For greatest flexibility, all tree paths should 
be mapped to disjoint paths in the host graph.
3. The maximum distance from the root to a leaf o f the tree (in terms o f edges 
o f the host graph) in the embedding should be minimised in order that the 
routing time is minimised.
4. Consistent with satisfying the above points, the size o f  the host graph should 
be a minimum in the interests o f processor economy.
5.3 The embeddings
In ihc subsections that follow, we describe embeddings of the complete binary tree 
with n leaves in the hypcrcubc and dc Bruijn graphs and in the doubly-connected 
2-dinicnsional mesh and shuffle-exchange graphs, each with ii nodes. These arc 
all topologies that have been advocated for interconnection networks and which we 
individually recall in the following subsections. By doubly-connected, we mean that 
each edge in the standard definition of the graph is replaced by two parallel edges. 
As we shall sec, with the exception of the shuffle-exchange graph, the embeddings 
arc such as to satisfy the following crucial properties which guarantee that in every 
respect the efficiency requirements stated in the previous section arc met.
I
78
Embedding Properties:
1. Each node of the host graph is assigned exactly one leaf of the tree.
2. Each node of the host graph, except one, is also assigned exactly one internal 
node of the tree.
3. Distinct tree edges are mapped onto edge-disjoint (possibly null) paths in the 
host graph.
4. Consistent with Embedding Properties 1 and 2, the maximum length of images 
in the host graph of tree paths from a leaf to the root is optimally short.
In the case of the shuffle-exchange graph, the embedding that we describe ensures 
that embedding properties 1-3 arc satisfied. However, we can only conjecture that 
embedding property 4 is also satisfied by our embedding. In the embedding, the 
maximum length of an image of a leaf to root path is 2 log2 " +  2, whereas our best 
lower bound in this case is (3/2) log2 n.
Let DR C DT  be shorthand for Double Rooted Complete Binary Tree. A D R C D T  
is a complete binary tree in which the path (of length 2) connecting the two children 
of the root is replaced by a path. P,  of length 3. Each of the two internal nodes of 
P (both of degree two) is a root of the DRCDT.  These roots will be denoted by 
>'i and r2. In the hypercube, dc Bruijn and shuffle-exchange graphs, each with n 
nodes, we shall in fact embed the D R C D T  with 2n nodes.
The following subsections establish that Embedding Properties 1-3 hold for the
I
79
010 101
(MX 0 1 1
Figure 5.1:
embeddings described. We delay consideration of Embedding Property 4 until the 
following section. As we shall see, it is also the case that the multiplicity of the 
topologies (that is, the maximum number of parallel edges between any pair of 
nodes) is a minimum consistent with Embedding Properties 1-2.
5.3.1 Embedding in the de Bruijn graph
The undirected de Bruijn graph of degree m, m > 0, has n — 2"' nodes which 
are all the possible distinct binary strings of length m. Each node M '2 - • • K, is 
connected to node M m ■ • • bmb\ by a shuffle edge, and to node M m • • • fc„, I>\ by a 
shuffle-exchange edge. Here h  is the complement of bk. By implication, each node 
is also connected to /)jM '2 • • • M i a°d to M 'ih2 • • ■ M i) .
For our purposes it is convenient to direct the edges of the dc Bruijn graph from each 
node M m • • • b,„ towards nodes M m • • ■ b„,h\ and M m • • • This automatically
assigns a direction to every edge of the graph and ensures that each node has both 
out-degree and in-dcgrcc of 2. figure 5.1(a) shows the directed dc Bruijn graph of 
degree 4.
I
80
It is also convenient here to direct the edges of the DRCDT.  All edges are directed 
away from the roots (the edge between the roots will be bi-directed) as the example 
of figure 5.1 (b) illustrates. We say that each directed edge is from a parent to a child.
For i . j  both non-negative integers and i < j ,  we now inductively define digraphs 
G(i , j )  as follows:
1. G(0. j )  consists of two isolated vertices each denoted by a binary string of 
length j  consisting (for positive j ) o f  alternating Os and Is, one string begining 
with a 0 and the other begining with a 1.
2. G( i . j ) is constructed from G(i — 1 , j )  as follows. From each vertex v — 
b\b2 ■ • ■ bj of G(i — 1, j )  we add new directed edges (if they do not already 
exist) to (possibly new) vertices b2b^.. .bj 1 and b2b^. . .  bjO. The former is 
called the left-child o f v and the latter the rif>hi-child of v.
Lemma 5.3.1 Fori < j ,  G( i . j )  is a directed D RC  DT and for i = j ,  G( i , j )  is a 
directed de Hritijn uraph.
Proof First suppose that i < j  and, to avoid trivial cases, that j  > 2. Let [01]* 
denote cither of the binary strings of length le consisting of alternating 0's and 1 's 
that begins with a 0 or a 1. Now, G( 1, j ) is easily seen to be the directed DRC DT  
with four nodes. The roots are of the form [01], and are connected by anti-parallel 
edges. The two additional nodes arc r| =  [01]j_200 which is a right-child of one 
root and r2 =  [01 ]^_211 which is the left-child of the other. As long as / < j  the 
inductive construction of G(i , j )  is such as to grow complete out-trees rooted at ri 
and f'2, each of depth i. To sec this, it is sufficient to show that at each inductive
I
81
step in which G(i . j )  is constructed from G(i -  1, j ) ,  the only new edges connect 
leaves of G(i — 1. j  ) to nodes whose labels arc distinct from all previously obtained 
nodes and arc distinct amongst themselves. Thus, the new nodes will be leaves 
of G(i ' j ).  It is easy to see that the new nodes are of the form [01 ]^_|_, 1 lo  or 
[01]j _ i _,00<>. where a  is a binary string of length (t — 1 ). These arc distinct from 
all previous labels because, starting at the i"‘ position from the right, they contain 
cither the substring (X) (if they arc descendcnts of C| ) or the substring 11 (if they are 
desccndcnts of c2) and all previously existing nodes contain cither 10 or 01 at this 
position. Any two of the new nodes descended from the same r, will have different 
os because each such n  is uniquely determined by the path sequence of left-child, 
right-child moves that must be traced from <■,.
Thus, we have proved that for i < j ,  G( i , j )  is a directed DRCDT.  By a trivial 
proof, if i = j  — l the D liC D T  has 2J distinctly labelled nodes. Because this is the 
maximum number of distinct binary strings of length j ,  it follows that G(j , j )  will 
have the same set of nodes as G(j  — l , j ) .  Also, every leaf of the G(j  — \ , j )  will 
have the form OOn or 11 o, so that the rightmost substring of length (j — 1 ) is distinct 
amongst the labels of the leaves. This ensures that in the inductive construction of 
G(j.  j  ) from G(j  — I. j  ) the new edges (all directed from leaves of G(j  — I , j  )) will 
be directed to distinct nodes. In this way, every node of G(j , j  ) has in-dcgrcc and 
out-degree of 2. In fact, it trivially follows from the construction of G(j.  j  ) that every 
node v = h2. . .  bj is connected to h2h\ . . .  I>,0 and .../>> I which arc the children 
of v and edges arc directed lo r from t>( = Oh] l>2. . .  fy_| and r2 =  IA,/>2. . .  bj_t . Of 
this last pair of nodes, if h\ = 0 then t>i is a leaf of G(j  — I , j )  and r2 is the parent 
of r  in G(j -  I,./). If l>\ = I then the rôles of t>t and tt2 arc reversed. Thus, the
Figure 5.2:
nodal connections of G{j , j )  are precisely those of the directed dc Bruin graph and 
this observation completes the proof. □
The following theorem follows trivially from the proof of the preceding lemma.
Theorem 5.3.1 The directed DRC DT with n leaves can he embedded in the di­
rected n-node de Hrnijn graph so as to satisfy Properties 1-3.
Figure 5.1 provides an illustration of the theorem. Both (a) and (li) arc G(3.3). In 
(h) each node appears twice, once as a leaf of the directed D R C D T  and once as an 
internal node. Copies of nodes arc identified in («) to show the directed dc Bruijn 
graph.
5.3.2 Embedding in the shuffle-exchange graph
An undirected shuffle-exchange graph of dimension m has ii = 2"' nodes which arc 
all the possible distinct binary strings of length m. Each node btb2. . .  bm. tb„, 
is connected by an exchange edge to b\b2 . . .  and by a shuffle edge to
b2l>). . .  /»,„/»). By implication, each node b{b2 . . .  b,„ is also connected by a shuf­
fle edge to />,„/)) b2. | .
I
83
Figure 5.2 shows the shuffle-exchange graph of degree 6 in which each exchange 
edges has been replaced by a pair (dashed for emphasis) of anti-parallel edges and 
each shuffle edge has been replaced by a pair of parallel edges. This particular form 
is derived from a previously known (see [108], for example) embedding of the de 
Bruijn graph in the shuffle-exchange graph. The embedding has both congestion 
and dilation of 2. The dilation of an embedding is the maximum distance in the 
host between the images of adjacent guest nodes. The congestion of an embedding 
is the maximum, over all edges < in the host graph, of the number of edges in the 
guest graph mapped to a path in the host graph which includes e. The embed­
ding is obtained by removing each shuffle-exchange edge (M m . . .  /<„,. M m . . .  hmb\ ) 
from the directed de Bruijn graph and replacing it with the directed path consist­
ing of the shuffle edge (6162 ••• b,„, 62^3 ••• bmbf) followed by the exchange edge 
(M m . . .  b,„h[. M m • • • b„J>[) of the shuffle-exchange graph. Because the graph now 
employs just the nodal connections of the shuffle-exchange graph, it is precisely 
such a graph but with parallel and anti-parallel edges. In this way, for example, 
it is easy to see that figure 5.2 can be derived from figure 5.1(a). The following 
theorem follows immediately from this embedding of the dc Bruijn graph and from 
theorem 5.3.1.
Theorem 5.3.2 The D R C D T  with n leaves can he embedded in the doubly- 
connected shuffle-exchange graph with 11 nodes so as to satisfy conditions I — 3 .
I
84
5.3.3 Embedding in the 2-dimensional mesh
The 2-dimensional doubly-connected mesh is the target graph for the embedding 
of this sub-section. Adjacent nodes are connected by a pair of anti-parallel edges. 
The guest graph of the embedding is the complete binary in-tree, that is, a complete 
binary tree in which the edges arc directed towards the root. Note that there is 
a large corpus of work concerning embedding of different classes of graph into 
the 2-dimensional mesh, for example see [39, 93, 118, 162]. These use different 
constraints on the embedding form than the ones used here, and address different 
optimisation issues such as minimising the area of the host graph.
We prove the following theorem [51 ].
Theorem 5.3.3 For all in > 1, there are embeddings o f the complete binary tree 
with 22'" and 22m+l leaves, into a doubly-connected 2"' x 2"' mesh and a doubly- 
connected 2"' x 2",+l mesh respectively which satisfy Embedding Properties 1-3.
Proof First consider the embedding in a square for the tree with 22’" leaves. The 
case m = 1 is easy. For in = 2, figure 5.3 shows one possible embedding in the 
4 x 4  mesh. In this figure, the internal tree nodes and the the paths corresponding 
to the tree edges, are drawn with increasing size and boldness from leaves to root 
respectively. The edges arc directed towards the root. The leaf nodes are not shown 
explicitly since there is one at each mesh node. Note that some tree edges, incident 
with the leaves, arc mapped to null paths, indicated by loops in the figure. The root is 
embedded on the left side, but the heavy path shown from this to the top-left corner 
is used later in larger embeddings. The node distinguished with a dotted square in
I

86
1. -1
1
f  - ? - n
f
- »  »i,
Figure 5.5:
the figure is that unique node which has not yet been assigned an internal tree node. 
The small diagram underneath gives the salient features of this embedding, .42, for 
use in the recursive construction. The arrows on the perimeter indicate the usage so 
far of the outside edges, and show that all the clockwise outside edges on three of 
the sides arc as yet unused. Perhaps, this embedding can easily understand with the 
help of figure 5.4. Left hand side of this figure shows the internal nodes of a 32-node 
(or 24 leaves) tree, and the right hand side shows a 4 x 4 mesh. Each number 
1 < « < 16, of the figure indicates that internal node with number / is mapped into 
the mesh node with number i.
This construction requires also an alternative 4 x 4  embedding, Di, shown in fig­
ure 5.5. The root here is embedded in the interior of the square but there is an 
outgoing path from it to the lower-right corner. This time, all the clockwise edges
I
87
on the top, left and bottom sides are free.
The next stage in our construction, the embeddings for m > 3, is shown in figure 5.6. 
Three Am_|S and one 5 m_i are combined to give embeddings of the 22m-leaf tree 
in a 2’" x 2"' mesh. The three new internal tree nodes required are shown by 
white and black circles, and are connected by paths of appropriate weight. For 
the recursion, the embedding is continued in two different ways, the black root 
node can be connected to the top-left corner by one of the hatched paths shown, 
or joined to the lower-right corner by another hatched path. The first alternative 
yields an embedding A„, which has edge characteristics of type .4 given by the 
small diagram in figure 5.3, while the second similarly yields D,„. The arrangement 
shown in figure 5.6 therefore represents a recursive step by which the construction 
can be continued indefinitely. The third path illustrated, with different hatching, 
to the lower-left corner, will be used in the 22"‘+l-lcaf embedding. For the case 
2'" x 2",+l, if m  > 3 we can connect seven copies of A m-\ with one copy of C„,_i 
as shown in figure 5.7. The cases where in < 3 are simple. □
5.3.4 Embedding in the hypercube
A hypcrcube has n nodes (where n =  2'", for some positive integer m) labelled 
from 0 to n — 1 in binary and such that there is an edge between two nodes if and 
only if their binary labels differ in exactly one bit. For completeness, we briefly 
present the following result which was first described in 1133].
Theorem 5.3.4 The double-rooted complete binary tree (DRCBT) with n > 8 leaves 
can he embedded in the hypercube with n nodes so as to satisfy Embedding Proper-
I
Figure 5.6:
89
ties 1-3.
Proof For n > 16 inductively construct the embedding starting with the base case 
of n =  32, as shown in figure 5.8. In the figure nodes occur at the corners of the 
squares defined by the dashed lines and the labels of nodes in the top left-hand 
quarter of the figure are shown. The first two binary bits of the labels of nodes in 
the other quarters are shown at the center of their quarter of the figure. The last 
three binary bits of such an address will be the same as the corresponding node in 
the top left-hand quarter. Generally speaking, figures will only show some edges of 
the hypercube, just those that are of interest. Dashed edges happen to correspond 
to certain hypercube edges but are used merely as an aid in locating nodes in the 
layout. For clarity, two figures (5.8 (a) and (b)) are employed to describe this case. 
Figure 5.8(a) shows the embedding of those tree edges which have leaves as end­
points. For clarity, some embedded tree edges point towards that endpoint which is 
a leaf of the tree. Some tree edges are mapped to null paths which are indicated by 
loops, figure 5.8(b) shows the embedding of all other tree edges. Notice that hashed 
edges arc used for the path of length 3 on which full circles denote the possible roots 
of the embedded complete binary tree. Also, notice that the three edges on this path 
belong to three different dimensions of the hypcrcubc. In figure 5.8(b), the internal 
nodes arc drawn with increasing size and the tree edges arc drawn with increasing 
boldness the nearer they arc to the root. It is easy to sec that this base case satisfies 
Properties 1-3 in all respects.
Figure 5.9 illustrates the inductive step in the construction of the embedding of the 
D RC D T  with ti leaves in the hypcrcubc with n nodes from two embeddings of n/2
l
90
Figure 5.8:
Figure 5.9:
I
91
leaf D R C B Ts  in hypercubes with n/2  nodes. These two embeddings are denoted 
by T  and T' in figure 5.9(a). The hashed vertical edges in that figure are edges 
of the new dimension of the constructed hypercube. The hashed horizontal paths 
((C1.r 1.r2.c2) and (c'|, r\, r2, c'2)) are the paths of length 3 which have as internal 
nodes the possible roots of the embedded complete binary trees with n /2  leaves. The 
triangular shapes attached to children of these possible roots represent the embedded 
subtrees rooted at these children. The two smaller hypercubes are oriented so that 
n  and o', are made to correspond, then the dimension corresponding with the edge 
(ri,r2) is made to correspond with the dimension of the edge (ci.rj). In this way, 
the nodes r2 and r\ arc made to correspond. Similarly, the dimension of (r2,c2) is 
made to correspond with the dimension of (r\,r2) and so node r2 is brought into 
correspondence with r2. This is always possible given the edge transitivity of the 
hypercube and given that each of the horizontal hashed paths of length 3 has each 
edge of different dimension. Figure 5.9(b) shows the embedding of the D R C D T  
with ji leaves and unit dilation in the constructed hypercubc with n nodes. The 
labelling of nodes in this figure makes clear its derivation from figure 5.9(a).
For n =  16, an embedding satisfying the Properties 1-3 is shown in figure 5.10. 
Note that figure 5.10 ((a) and (b)) is the hypcrcubc layout corresponding to the top 
half of figure 5.8 ((a) or (b)). Again, for convenience of illustration the embedding 
of tree edges attached to leaves in one diagram (figure 5.10(a)) and the embeddings 
of all are shown other edges in another (figure 5.10(b)). P
1
92
Figure 5.10:
5.4 Depths of the embedded trees
In this section we examine the quality of our embeddings from the point of view of 
Embedding Property 4. The maximum distance from the root to a leaf in the image 
of the complete binary tree for any host graph is an important algorithmic parameter. 
It is a measure of the routing time required for a single sweep of the balanced binary 
tree. We denote this distance by P(n)  for the complete binary tree with n nodes. 
If the embedding satisfies Embedding Properties 1-3 of Section 5.2, the Oflog ii) 
routing time of the PRAM algorithm for such a sweep translates to ()(P(n))  for the 
interconnection network.
We first determine P{n) for the embeddings in each of the four interconnection 
networks considered in this chapter. Then we establish lower bounds for maximum 
root-to-lcaf distances for these embeddings which we use to show that, in all cases 
except for that of the shuffle-exchange graph, our values of P ( v ) arc asymptotically 
as short as possible. We conjecture that this fact is also true for the shuffle-exchange 
graph, although there is a gap between the /*(») of our embedding and the lower 
bound obtained.
I
93
It will also be evident from this section that the congestion and the load of our host 
graphs are a minimum consistent with satisfying Embedding Properties 1-3. The 
load of an embedding is the maximum number of the guest nodes mapped to a node 
of the host graph.
5.4.1 Maximum root-to-leaf distances of the embeddings
For the »-node hypercube and de Bruijn graphs P(n)  is log2» + 1. This is because 
each edge of the »-leaf D RC DT is mapped into at most one edge of the »-node host 
and there are root-to-leaf paths for which every such edge is mapped to precisely 
one edge of the host. Thus, for the embedded D R C D T  the maximum length of 
root-to-leaf paths is log2» in these cases. This translates to log2» +  I for the 
complete binary tree when its root is identified with a particular one of the two roots 
of the DRCDT.
In our embedding of the »-leaf D RC D T  in the doubly-connected »-node shuffle- 
exchange graph one of the pair of edges from each parent to its children is mapped 
into two edges of the host and so for this case P (») =  2( log2 » + 1) -  2 log2 » -|- 2.
Now consider the embedding of the » = 22m-leaf complete binary tree into a doubly- 
connected 2"' x 2'“ mesh. Let D (w ) be P( ii) when expressed as a function of »». It 
is a trivial matter to construct an embedding for in =  1 for D(m)  is 2. For in = 2, 
wc sec by inspection of the embedding of figure 5.5 that this maximum distance 
is 6 mesh steps and so D(2) = 6 . Now consider square meshes with » = 22’" nodes, 
in > 2. Let A(»0, B(»i) and C(»)). be the maximum distances from a leaf to the 
output from the top left corner of pattern .4,,,, the lower-right corner of pattern D,„
I
94
and the lower-left corner of pattern C,„, respectively. From figures 5.3 and 5.5, we 
see that A(2) -  10 and B(2) -  8. A corresponding layout C2 with C(2) -  9 is easy to 
derive from D2. We can verify from figure 5.3 the following recurrence equations
for m > 3:
A(m) =  D(m)  +  2m,
B(m ) = D(m)  +  2™ — 2,
C(m) = D(m)  +  2™ -  1,
D(tii) = m a x[A (m  — 1) +  2 ,B(m  — 1) -I- 2}
=  D(m  — 1) + 2™-1 + 2.
The solution to these equations is:
D(m)  =  2’" + 2m -  2, for  m > 1,
that is.
P(n)  = \/n +  2log2n — 2 =  \/n + O(logn)
Now consider the case of embedding the n = 22m+l-leaf tree into a doubly-connected 
2"' x  2",+l mesh. I^ ct D'(m)  be the corresponding maximal Icaf-to-root distance. 
We may verify in figure 5.7 that:
D'(m) = m ax{A (m  — 1 ) + 4. C(m — 1 ) + 3} +  2"'~l
and so in this case,
P(n)  = + f7(log n)
95
5.4.2 Lower bounds for the embedded tree depths
Here we obtain lower bounds for the depth of the complete binary tree for the different 
embeddings of this chapter and show that, in the cases of the mesh, hypercubc and 
de Bruijn graphs, these bounds asymptotically match the values of P(i>) that were 
obtained in the previous subsection.
For a given graph, let its radius, p, be the minimum distance r  such that for 
some central node c, every node is at a distance at most r from c. Clearly, for 
any embedding of a complete binary tree in a communication network in which 
Embedding Properties 1-3 are met. a lower bound for P(n)  is provided by p.
Lemma 5.4.1 The following relationships hold for graphs with u nodes: for the 
hypercuhe p = log2 n and for the shuffle-exchange graph p = ' (log2 n — 1) for 
log2 ii odd and p — \ log2 n + 1 for  log2 n even.
Proof Follows easily from the definitions. □
For the hypcrcubc we can obtain a marginally stronger lower bound for P( n ) from 
the following Lemma.
Lemma 5.4.2 Consider leaf disjoint embeddings o f a complete binary tree with n 
leaves into the n-nodc hypercuhe. For any such embedding P(ti) > log2 n + I.
Proof If there were an embedding satisfying Embedding Properties 1-3 and P ( n ) 
was log2 n, then this would imply that a unit dilation embedding of the complete 
binary tree (perhaps with some leaf to parent edges mapped to null paths) was
96
possible in the hypcrcubc. Consider the mapping of the subtree consisting of all 
edges other than leaf to parent edges. An embedding of this subtree would have to 
be vertex disjoint with every edge being mapped to a hypcrcubc edge, this is not 
possible because such a graph is not a subgraph of the hypercubc as follows. Both 
the subtree and the hypercubc arc bipartite graphs. In the case of the hypcrcubc both 
halves of the bipartition contain the same number of nodes, this is not the case for the 
subtree and (with each subtree edge mapped precisely to an edge of the hypcrcubc) 
this would force more than one node of the subtree to be embedded in a single node 
of the host. 0
Lemma 5.4.3 Consider leaf-disjoint embeddings o f a tree with n leaves into the 
mesh. For an arbitrary embedding
P(n)  >  y | - O d ) ,
for an embedding into an r x ,s rectangle where n =  vs ,
P(n) > r ^ l  +  r ^ l  =  ( r  + *) /2  -  0 (  I ).
Proof Let br be the number of vertices of the mesh graph Z x Z  within path length 
i of the origin, where Z  is the set of integers. Then U(> = I , h\ = 5. and in general 
hr =  I + = 2 r(r + 1) + I for r  > 0. For any injective mapping of u
leaves into the mesh, if n > h, then some vertex has to be mapped to a mesh node 
at distance greater than r from the root. □
From the above Lemmas, and consistent with Embedding Properties 1-2, it follows 
that the values of P(u) attained by our embeddings arc asymptotically as short as
I
97
possible for the de Bruijn, two-dimensional mesh and hypercube interconnection 
networks. Thus for these graphs. Embedding Property 4 has been optimally met. We 
conjecture that Embedding Property 4 has been optimally met also for the shuffle- 
exchange graph, although in this case the P ( n ) of our embedding is asymptotically 
a constant factor of j greater than our lower bound.
5.5 Further remarks and algorithmic issues
Here we briefly justify the use of parallel or anti-parallel edges in our embeddings, 
where they occur, so long as the embeddings have the density implied by Embedding 
Properties 1-2. We then comment on the complexity gains afforded by our embed­
dings when they might be employed for PRAM implementation on the associated 
communication networks.
A natural question to consider is whether the pairs of anti-parallel edges arc necessary 
for the mesh. Can the complete (undirected) tree be densely embedded in the usual 
undirected mesh? Each mesh node (except two) is host to one leaf vertex, with 
degree one. and one internal vertex, with degree three, and so has a total of at least 
four embedded edges incident with it. Note that some of the edges adjacent to 
leaves can be mapped into paths of length zero, the loops in our figures, and so some 
mesh nodes may require only two of their incident mesh edges. Thus there is no 
immediate contradiction from degree considerations. However, we now consider 
local details and easily find a contradiction. Consider boundary mesh nodes, away 
from the one special node that docs not host an internal tree vertex. Any such node 
has degree less than four and so must have a loop in the embedding. It therefore
I
98
is host to a leaf vertex and the internal node adjacent to that leaf, and requires one 
incoming path from another leaf, and one outgoing path to the parent vertex. Since 
the neighbouring boundary nodes are in the same predicament, there is an impossible 
situation at the boundary, even worse if it is at a corner.
It is also easy to see that we need parallel (or anti-parallel) edges for the shuffle 
exchange graph when embedding the complete binary tree if the embedding is 
consistent with Embedding Properties 1-2. This is because the shuffle exchange has 
degree 3 but any internal node of the tree which is not adjacent to a leaf has to be 
mapped to the same node of the shuffle exchange graph as a leaf. This requires that 
at least four tree edges have this shuffle exchange node as an end-point which is not 
possible without parallel (or anti-parallel) edges being added to the shuffle exchange 
graph to ensure edge-disjointness of the embedding. Notice that it also then follows 
that the dilation of the embedding must be greater than 1.
For the hypcrcubc. dc Bruijn and shuffle exchange graphs our embeddings show that 
the complete binary tree can be cdgc-disjointly embedded in hosts that arc generally 
half the size compared with previously described embeddings without detriment 
to the time complexities of PRAM algorithmic implementations that employ the 
complete binary tree. The embedding o f  a complete binary tree in the hypcrcubc 
described in |I6) meets all our efficiency requirements except that the host graph 
is twice as large as it need be. In fact [16] embeds the u leaf complete binary 
tree in the hypcrcubc with 2;i nodes. In [88| (pages 407-410), an embedding is 
described in which the n leaf tree is embedded in the n node hypcrcubc. However, 
in this embedding, up to log2 v tree nodes of different depths arc mapped to a
I
99
single node of the hypercube. Although the embedding is such as to facilitate the 
efficient implementation of most PRAM algorithms, there may be difficulties in the 
exceptional cases when simultaneous computation is required to take place at an 
arbitrary number of different levels within the tree. Such an example is cited later 
in this section.
For the two-dimensional mesh, our embeddings may not only reduce the size of 
the host graph but will also improve running times of the implementations. For 
example, in the well-known H-tree construction (see for example, page 84 of [153]), 
the complete binary tree with n leaves is embedded in the (2>Jn — 1) x (2>fn — I ) 
mesh and the maximum root to leaf distance in the mesh image is 2>Ju — 2. Of 
course, this embedding was not designed from the point of view of our criteria and 
would in any case be very costly in terms of unemployed processor sites. In the 
embedding of [54], although the complete binary tree with u leaves is embedded 
in the square mesh with n nodes, the maximum root to leaf distance is 3.54>/n. 
Moreover, only tree edges at the same depth arc mapped to disjoint paths.
Compared with previous embeddings and for some PRAM algorithms, the edge- 
disjointness property of our embedding in the mesh yields further complexity gains. 
Occasionally it is useful for all nodes in the tree, not just those at the same level, 
to pass messages simultaneously to their children in such a way that this continues 
until all messages (including that from the root) reach the leaves of the tree. An 
example of such a cascading requirement is provided within the implementation of a 
bracket matching algorithm on a mesh detailed in [54], This can be simulated in the 
embedding of [54] by allowing the messages from the internal tree nodes adjacent to
100
leaves to be passed directly to the leaves, then subsequently messages from the nodes 
at the next level are sent to the leaves and so on, until finally the message from the 
root is allowed to be copied down to all descendants. In this way, only tree edges at 
the same level are being employed at the same time and the lack of disjointness of all 
paths from the root to the leaves is no hindrance. In the embedding of [54], the routing 
time for such a process would be 3.54(1 + 1/2 + 1/4 + 1/8 + • • ■ +  l /2 l,,g")v/” 52 
7.08y/ii. The successive terms in the series arise from routing from successive tree 
levels. This is because in successive iteration the size of each subproblcm and the 
required area of the mesh for each subproblem are recursively reduced by a factor of 
2. For the embedding of this chapter, the path-disjointness property allows messages 
to be passed down the tree simultaneously from all levels, and so the routing time 
for the cascading requirement is just that for passing a message from the root to the 
leaves (this masks the time for message passing from all other internal nodes) which
is s/n.
5.6 Summary and open problems
We have described dense edge-disjoint embeddings of the complete binary tree with 
ii leaves in the following n node intercommunication networks: the hypercubc, the 
dc Bruijn and shuffle-exchange graphs and the 2-dimensional mesh. The embed­
dings have the following properties: paths of the tree arc mapped onto edge-disjoint 
paths of the host graphs, at most two tree nodes (just one of which is a leaf) arc 
mapped onto each host node. We also proved (except for the shuffle-exchange 
graph) that an algorithmically important parameter, the maximum distance from a
101
leaf to the root of the tree, is asymptotically as short as possible. We conjecture 
that for the shuffle-exchange graph this distance is also optimally short within our 
embedding. The embeddings facilitate efficient implementation of many PRAM 
algorithms on these networks and improve extant results. For the mesh and shuffle- 
exchange graphs these embeddings were not possible without replacing each edge 
by a pair of parallel (or anti-parallel) edges.
A number of problems remain open. Because of the logarithmic lower order term 
in P(n) for the embedding of a complete binary tree in the mesh there is a small 
gap between the distance obtained here and the naive lower bound of the network 
radius. Whether this gap can be closed, from either side, is an open question. A 
mesh architecture sometimes used is in the form of torus, with no boundary. It 
seems unlikely that a complete binary tree with 22'" leaves could be embedded in 
the directed 2'" x 2"' torus, but we have not been able to prove this. There is 
also no proof of our conjecture that for the shuffle-exchange graph our embedding 
exhibits a shortest possible maximum root to leaf distance consistent with our other 
embedding requirements.
The question of how to find similarly dense embeddings of complete binary trees in 
meshes of higher dimension is unanswered. Similarly, the question of finding dense 
embedding of complete trees of fixed higher degree in communication networks is 
insolvcd. From a graph-theoretic point of view, dense edge-disjoint embeddings 
of arbitrary trees in communication networks present a challenge, although these 
problems may prove to be of less general algorithmic importance than embedding 
complete trees.
Chapter 6
Practical Parallel Models of Parallel 
Computation
The work described in this chapter was carried out as part of the ESPRIT I project 
PUMA which was supported by the European Community (under project number 
P2701) and appeared in a report [132] to the European Commission. This chapter 
reviews the Models of Massively Parallel Computation that have been proposed to 
investigate the major issues in practical parallel computation (including scalability, 
granularity, asynchrony, latency, fault tolerance, contention and congestion which 
were introduced in chapter 4). None of these models has yet achieved a consensus 
as a target model for both hardware and software design, but each provides its own 
lessons for the design of efficient, fast, reliable parallel software. With the use of 
theoretical solutions which arc described in chapter 4. this chapter demonstrates that 
there is no theoretical hindrance in designing massively parallel models of parallel 
computation.
102
)
103
6.1 Introduction
In the previous chapter we described architecture dependent embedding techniques 
which can lead to optimal PRAM algorithmic PRAM simulation on feasible ma­
chines. Such techniques, although of wide usefulness, are not always applicable. 
In this chapter we address the important question of bridging models for general 
purpose parallel computing which (like the von Neumann machine in sequential 
computation) can act as an interface between the actual hardware and the PRAM 
model. Thus software issues will be separated from hardware issues and the prospect 
of genuinely portable software in an environment of user friendly high level coding 
would be a possibility. In other words, from the programmer’s point of view, the 
realistic parallel model can be made to appear like a PRAM.
There have been several models proposed to bridge the gap between the PRAM 
model and feasible machines. These models variously take account of communi­
cation latency, contention and congestion, asynchrony and/or component failures. 
They have been introduced together with simulation results, demonstrating the ex­
tent to which PRAM algorithms can be implemented with little or no asymptotic 
loss in efficiency. In particular, a user can view these models as extended PRAMs, 
which hide hardware details from the user. We shall briefly describe some of these 
models and survey some of the difficulties involved in simulation. By doing so, we 
show the possibility of general purpose parallel computing.
I
( ^ )  -Computing element 
□  -Memory module
104
O O  Q
Network
□  □  □  
Shared Memory 
(b)
Figure 6.1:
6.2 The practical PRAM model
The practical PRAM consists of a set of components or processing elements con­
nected by a network. Each processing element consists of a computing element 
and a memory module, as in figure 6.1(a). We may even consider that, the com­
puting elements and the memory modules arc separated as in figure 6.2(b). In both 
models, all the memory modules can be considered to be global memory or virtual 
shared memory. Each computing element can access any non-local memory module 
through the network.
From an algorithm designer's view, the network becomes a "black box". Reads 
and writes to and from the non-local memory module by a computing element arc 
sub ject to uniform delay, or communication latency, which is a parameter / of the
I
105
machine. As a function of p (the number of processors), I can vary depending on the 
architecture (e.g. it might be ()(s/p)  for the J p  x ,/p-node 2 dimensional mesh, or 
Oflog/<) for a /»-node hypercube). By uniform delay, we mean that the access time 
to a non-local memory module is independent of the processor making the request.
There is a router for routing data between computational elements and memory 
modules. As computing elements and (non local) memory modules communicate 
through a router [157], the tasks of computation and communication can be sepa­
rated. The router operates independently of the individual processors. Once a data 
packet is delivered to the router (through a router interface), the packet is routed 
through the network to its destination without any burden on the processing elements 
which may continue their processing. Note that, by separating computation from 
communication, no particular network topology is favoured beyond the requirement 
that a high throughput be delivered.
For example, if computational elements want to access distinct non-local memory 
modules, realising a I-relation, then they send the requests to the router through a 
router interface. As we saw in section 4.3.3, a I-relation can be realised in 0(1) 
time, where / is the diameter of the network. Note that the algorithm designer docs 
not need to know about the topology of the network.
The global memory of the feasible machine is divided into a number of memory 
modules, in contrast to the PRAM's memory which is a single block. Moreover, 
accessing a global memory location will take the 0(1) steps on the practical PRAM, 
whereas the time is 0(  1) on the PRAM. However, the practical PRAM and PRAM 
arc similar to the user, since both models have no notion of network locality. In
I
106
this chapter we survey optimal simulations of PRAM algorithms on the practical 
PRAM, and argue that the practical PRAM could be a candidate for general purpose 
parallel computing. By optimal simulation we mean that when simulating a PRAM 
algorithm on a practical PRAM the work done for the algorithm on the PRAM is 
equal to, within a constant factor, the work done on the practical PRAM.
6.3 Latency hiding
A naive approach to implementing a PRAM algorithm on the practical PRAM is 
to allow 0(1) time for message routing after every step of the PRAM algorithm. 
This significantly slows down the algorithm. Techniques arc needed to "tolerate'* 
or “hide” the network latency. Recently, a series of results in [I, 2, 24, 55, 83, 
116, 117, 157, 158] have shown how parallel algorithms for realistic models can be 
designed such that the effect of network latency can be minimised with respect to 
work measure.
First we consider 12,116,117], which capture the communication and computational 
complexity of PRAM algorithms. The model used is called the Local-memory 
Parallel Random Access Machine (LPRAM) [2], The LPRAM is a CREW PRAM 
in which as well as global memory each processor is provided with an unlimited 
amount of local memory. As in our practical PRAM, the LPRAM has a parameter 
I which is the time taken by one communication step. Each computation step time 
takes unit time . The total time of an algorithm is T  + (l x C), where T  is the 
number of computations steps and C  is the number of communication steps. The 
different model of 1117] can be thought of as a pipelined version of the LPRAM .
I
107
The computational problem to be solved is presented as a data-dependency graph. 
The data dependency graph is a directed acyclic graph (DAG). We model the compu­
tational problem to be solved as a DAG, with its nodes corresponding to operations 
and its arcs corresponding to the values computed by performing such operations. 
A computation schedule of the DAG consists of a sequence of computation steps 
and communication steps. At a computation step each processor may evaluate a 
node of the DAG; this evaluation can only take place when its local memory has the 
values of all incoming arcs into this node. At a communication step, any processor 
may write into the global memory any value that is presently in its local memory, 
and then it may read into its local memory a value from global memory.
A node of the DAG with in-degree zero corresponds to the value of an input. The 
inputs arc initially stored in the global memory, and the output of the DAG has to 
be written into the global memory. Our problem is to efficiently schedule a DAG to 
minimise overall computation time and communication time such that the total time 
is minimum. For example figure 6.2 shows a DAG and the schedule that computes 
the DAG with two processors, P and Q.
This schedule computes the DAG in five communication steps (i.e. C = 5) and 
three computation steps (T  = 3), the total computation time is 3 +  51. Note that if 
we allow Q, which is idle at the communication steps 1 and 2, to read a and />, then 
communication step 3 is not necessary. Hence, the total time is reduced to 3 + 4/. 
In other words, allowing several processors to compute the same value can save 
some communication at no additional time delay. It is an open problem whether 
rccomputation can save more than a constant factor in time or communication.
108
Comm, stepl : P reads a 
Comm. step2: P reads b 
Comp, step 1: P computes c 
Comm. step3: P writes c, Q reads it 
Comp. step2: P computes d; Q computes e 
Comm. step4: Q writes e, P reads it 
Comp, step 3: P computes f 
Comm. step5: P writes f
Figure 6.2:
Moreover, it is an NP-completc problem to decide, given a DAG, an integer /, and 
Tmnr, whether there exists a schedule 5  such that no time greater than Tmaj is used
[117].
In [ 116] nontrivial trade-offs between communication and computation were shown 
for the diamond DAG. Those results were not satisfactory because no general prin­
ciple or technique was described which is applicable to all DAGs. But, in [117] 
the technique was generalised to all DAGs and the technique was applied to three 
particular families of DAGs: the complete binary tree, butterfly, and the diamond. 
Aggarwal et at. [2] have shown that matrix multiplication can be modelled as a 
DAG in the form of a complete binary tree. Furthermore, they obtained (upper 
and lower) bounds for any binary tree DAG (in which each internal node has ex­
actly two children). This DAG is determined by the design of the algorithm but 
none of 1116, 117, 2] have described a general technique for algorithmic design for 
these types of trade-offs. However, they show that, if we schedule a DAG such
109
that temporal locality of reference is utilised then the communication cost can be 
reduced.
Aggarwal, Chandra and Snir introduced a model called the Block PRAM (BPRAM) 
in [1]. They show that efficiency can be enhanced by using temporal and spatial 
locality of reference. Spatial locality of reference is that data items to be accessed 
by a processor are in contiguous locations in the global memory. The BPRAM is 
similar to our practical PRAM and LPRAM, in which a processor may access a 
block of contiguous locations from global memory, and it may write a block into 
contiguous locations of the global memory. Recall that access to a global memory 
location may take up to 0(1) units time. On the other hand, a block of /»consecutive 
words in global memory (or local memory of another processor for the model of 
(83]) can be copied into local memory, optimally, in time / +  /» and vice versa. For 
example, / consecutive words in global memory can be accessed in 0(1) time, not 
in 0(12) time. To support this, recent randomised routing algorithms provide strong 
theoretical support |4, 91], The router can realise any permutation of p messages 
(i.e. a l-rclation) of size m in time ()(tn + log/») with high probability on a p- 
node hypcrcubc. This is because, there is a significant overhead for establishing a 
communication. Once established, a large amount of information can be transferred 
at low cost. The latency can be hidden in (his way by pipelining a block of global 
memory access.
We can implement block pipelining to hide latency provided that each processor 
has multiple requests to global memory at the same time, and requests can be 
grouped into blocks of length £2(/). This can be resolved by assigning many PRAM
I
processors ("virtual processors”) to each actual processor ("physical processor”) of 
the machine. The ratio v/p  is called the parallel slackness, here v is the number 
of "virtual processors” and p is the number of “physical processors”. Instead of 
executing one process on each processor, we now provide each processing element 
with a scheduler allowing it to share its time between v /p  processes. In other words, 
the algorithm is written for t> virtual processors, where v significantly exceeds p, 
the number of physical processors. Then each physical processor may make many 
requests during each simulation step. Note that here the requests are made to blocks 
of contiguous locations in global memory
Many PRAM algorithms can be restructured specifically to provide for block ac­
cesses using ()(/) parallel slackness (i.e. v > //>)(!,24). For example, consider the 
problem of transposing a \ /v  x s/T' matrix. The matrix is given in row major order 
at,i, a |,2, . . .  nj.i, • • •, «yr.yr ¡n the first v locations of global memory, and
the output « u , «2.1, . . . , « 1,2, •. • ,a ^ ,yr i ¡s desired in the next v locations. 
This computation can be performed on an EREW PRAM in 0 (  1 ) time using v 
processors.
On the BPRAM the transposing computation can be done in () (v/p  -F lyjv/p) time 
using p < v processors. During the computation each processor is assigned to 
transpose a submatrix of size \Jvj~p x \Jv/p. That is each physical processor is 
doing the “job" of v/p  virtual processors. The algorithm executes in rounds, which 
arc cither read rounds or write rounds. Each row of a submatrix requires a separate 
block read operation from global memory, likewise each column requires a separate 
block write operation. During a read round each processor reads a block of size
n o
I
y fi fp  (i.e. a row of the assigned submatrix) from global memory, taking / + 
time. Since each submatrix has <Ji'/p rows there are \Jv/p read rounds. Thus all the 
read rounds take 0 ( v /p  + lyfv/p) time, similarly all the write rounds take the same 
time. Note that each block access is to consecutive locations of global memory.
This BPRAM algorithm is an optimal algorithm if v/p  > I2. In this case the work 
done by the PRAM, Wpram  =  0 (  1) x r , is equal, to within a constant factor, to the 
work done by the BPRAM, Wbpram , where Whpham = 0 ( v / p +  ly fifp )  x p = 
0( v).
In general, a BPRAM takes up to / times as long as to run its EREW PRAM 
counterpart. However, the factor of / can be reduced in the BPRAM for several 
problems. These include matrix transposition, matrix multiplication and Fast Fourier 
Transform [I], The factor of / occurs for problems (for example, performing 
general permutations on elements in memory [I]) that have fine granularity [84): 
the computation can not efficiently use large blocks for communication. This can 
cither be due to poor spatial locality: data items to be accessed by a processor arc 
not in contiguous locations, or due to poor temporal locality: successive accesses 
can not be blocked together, c.g. because of control dependencies. In particular. 
Chin justifies the claim of Gaz.it, Miller and Tcng |47) that list ranking procedure 
should be replaced with prefix sum whenever possible on the BPRAM [24). This is 
because prefix sum has better locality of reference than list ranking.
Now consider arbitrary pipelining, instead of bhxk access to contiguous locations 
in global memory, a processor can access any h locations distributed in global 
memory. Block pipelining has the practical advantages that a batch of values arc
112
guaranteed to return in order, and the entire line or cache line or memory page can be 
transferred to local memory as a unit. However, arbitrary pipelining is clearly more 
flexible than block pipelining. It is also easier to use, since the programmer does not 
have to ensure that values of interest are collected into contiguous locations. That 
is, the programmer does not need to worry about locality of references.
The Bulk Synchronous Parallel (BSP) model of Valiant [157] and the Phase PRAM 
of Gibbons [55] allow arbitrary pipelining. Computation on these models proceeds 
in a sequence of supersteps. In each superstep each processing element is given 
a task that can be executed using the data that is already available locally before 
the start of the superstep. The task can be computation, message transmission or 
message receipt. Let L denotes the time to complete a superstep.
Assume that a router can realise any h-relation in gh time, where </ is the throughput 
of the router. Then we choose L to be at least gh. That is, in a superstep L local 
operations, or an Llg-relation can be realised. Note that an L/g -relation can be 
realised in L time because L > gh.
To support the BSP model Valiant provides a strong theoretical result. Theorem 
4.3.6 of chapter 4 shows that a log/» relation can be realised on a p-node hypcrcubc 
in 0 (g  log p) time. This gives an optimal simulation. A superstep of the BSP model 
can be simulated in 0( L ) time, when the topology of network is a hypcrcubc every 
node of which is a computing clement with memory, and L > </ log p. In each 
superstep L local computations arc performed or an Llg-relation is realised. All this 
can be done in O(L) time on the hypcrcubc.
I
113
6.4 Asynchronous computation
It is clear that a PRAM algorithm can be converted into an asynchronous PRAM 
algorithm by imposing synchronisation after each statement of the algorithm. How­
ever, this significantly slows the algorithm down. In this section we investigate the 
design of algorithms which minimise the cost of synchronisation.
Recent papers [33, 101, 113] focus on the implicit cost of synchronisation. For 
example, in [101] the time complexity of an asynchronous algorithm is the number 
of instructions executed by a processor including busy wait instructions (i.e. taking 
the implicit costs of synchronisation into account). The model used is the CRCW 
PRAM, but the processors can have arbitrary asynchronous behavior, including 
arbitrary unbounded delays in executing instructions. These delays can be overcome 
through the use of randomisation as follows. In this model processors are not 
assumed to have unique IDs; each processor is instead equipped with an independent 
random number generator. A directed acyclic graph (DAG) representing the tasks to 
be performed, and the dependencies between them arc placed in the shared memory. 
Each processor selects a task at random, performs the task if its predecessors in the 
graph have been completed, and repeats. In this way, processors that arc delayed, 
or that have failed, do not unduly slow down the computation: the fast processors 
will simply evaluate more nodes in the graph. For example, a maximum finding 
algorithm proceeds in the following manner:
/. Examine the root node. I f  it has been evaluated, exit the compulation
2. Select an interior node uniformly at random
I
114
I f  the children o f  the interior node have been evaluated, evaluate the node 
4. Return to step I
By this method any »-processor PRAM algorithm that solves a DAG in ( ) (T (n)) 
time can be transformed into an asynchronous computation with 0(n T(n ))  ex­
pected work using n /  log log' n processors [ 101 ]. Hence the simulation is optimal. 
However, the model does not account for communication delay. If there is a commu­
nication delay (as in practical machines) then choosing a node of a DAG randomly 
will not minimise the communication delay.
The APRAM, introduced by Cole and Zajicek [32] focuses on the explicit costs of 
synchronisation. The goal of the APRAM model is to design algorithms which avoid 
global synchronisation, thereby reducing the explicit costs of synchronisation. In 
this scheme one time unit is called a round, which was introduced earlier in [96. 10]. 
In one round each processor executes at least one instruction, the slowest executes 
one and faster processors execute more. For instance in the parallel summation 
problem, the algorithm uses 2n — I (shared) memory location treated as an implicit 
complete binary tree. Each location is assumed to contain an extra valid bit; the 
valid bit of the input value is assumed to be initially true and all other locations 
start with a valid bit of false. The algorithm terminates when the valid bit of the 
root is true. The algorithm for process i is given below, where L(i) and /?(/) arc 
respectively the left and right children of node /:
/. wait until (L(i) is valid)
I
115
2. wait until <R(i) is valid)
3. V(i):=L(i) + R(i)
4. valid(i) := true
The sum of n numbers can be computed in O(logzi) rounds, the worst case is 
achieved when every processor executes a single step each round. The APRAM 
permits multiple sets of processors to synchronise independently and in parallel, 
it does not account for communication delay, and it permits concurrent reads and 
writes. This measure docs not capture the extent to which slowing down a subset of 
the processors slows down the overall running time of the algorithm. They introduce 
a complexity measure by dividing processes into two sets, the slow and the normal 
processes. For example, the summation algorithm takes O(log n) + /(.s, c), where 
/  is a function of .s and r. Each of the slow processes executes at least one event in 
any * consecutive rounds, where r is the number of slow processes. Flowcvcr, this 
measure is difficult to analyse and to implement on practical machines.
Many asynchronous algorithms have been developed for particular problems [32. 
33, 101, 113, 126, 57], Most of this work is tailored to specific machines and docs 
not present a general treatment of asynchronous parallel computation. Moreover, 
communication delays have not been considered for developing asynchronous al­
gorithms. As in feasible models, delay for communication will not give reasonable 
time complexity for their algorithmic techniques.
Gibbons [55] suggested an asynchronous PRAM model, the Phase PRAM, which 
includes extra hardware needed to achieve synchronisation. The set of operations
I
116
between two synchronisation barriers is called a phase. The cost of a phase is the 
maximum number of steps taken by any of the processors during that phase, and 
the cost of a synchronisation barrier is B(p), which is a function of the number of 
processors.
Valiant’s BSP model 1157) incorporates barrier synchronisation in a similar way to 
the Phase PRAM. Each processor operates in accordance with a barrier synchronisa­
tion protocol which may be supervised by a master synchroniser. The synchroniser 
ensures that all (or subset of ) the processors (and the router) have completed a 
superstep (that is a phase) and, if so, signals all processors to continue to the next 
superstep.
In essence, the insertion of synchronisation barriers between supersteps ensures 
that the algorithm is slowed down to the speed of the slowest processor within each 
superstep. Although the number of steps executed remains the same regardless of the 
relative speeds of the processors, the time elapsed to execute this type of algorithm is 
in fact very sensitive to changes in speeds of processors. In particular, if the slowest 
processor is very slow, then so is the actual running time of the algorithm. It is a 
weakness that such considerations arc not reflected in the complexity measure.
However we can overcome the problem by synchronising all or a subset of the 
components at regular intervals of L time units where L is a periodicity parameter. 
After each period of L time units, a global check is made to determine whether 
the superstep has been completed by all the processors. If it has not, then the next 
period of L units is allocated to the unfinished superstep. The results of the runtime 
analysis will not change by more than small constant factors.
I
117
Synchronisation is needed in a PRAM algorithm precisely when a write to a shared 
memory location is followed by a read to the same location. Provided there is 
no need for synchronisation, processors running asynchronously can execute L 
instructions each between synchronisation barriers. In this way processors in an 
asynchronous machine can perform L instructions, followed by a synchronisation 
barrier cost < L, not in time 0 (L 2), but optimally in time O(L). We can use 
this bulk synchrony to hide the synchronisation overhead, provided that we have 
sufficient parallel slackness so that each processor makes many memory accesses 
during each simulation step.
A common algorithm design technique is to have each processor take about L steps 
during each supcrstcp, to balance communication and computation costs. Within 
each superstep each component sends or receives at most h messages in time T, 
then we can fix L as greater than T. Not surprisingly, many PRAM algorithms can 
be restructured specifically to provide for bulk synchrony using parallel slackness.
6.5 Memory management
To design algorithms for a virtual shared memory model machine, we need to know 
how to distribute the logical memory addresses among the physical locations of the 
machine such that distribution will not slow down the computation (i.e avoid con­
tention and congestion). The distribution can be done randomly or deterministically.
Under fairly general assumptions, Upfal and Wigdcrson 1156]) showed that an on­
line simulation of T  PRAM steps by a synchronous practical PRAM with the same
I
number of processors requires i i(T  log » /  log log n) time, where n is the number 
of processors. Their simulation assumes each processing element of the practical 
PRAM consists of a computing element and a memory module, and the processing 
elements are connected by a complete network.
We saw in section 4.3.1 that the most promising method known for randomly 
evening out memory accesses (to avoid contention) is hashing. Using randomised 
hash functions, the simulation of a PRAM on a practical PRAM is governed by the 
following:
1. The time to evaluate the hash function.
2. The maximum number of shared memory accesses which are mapped to the 
same memory module under the hash function.
3. The time needed to access memory location (i.e communication latency /).
Simulation using hash functions was dealt with in [24, 74, 104, 129, 130, 157, 158). 
Recently Karp and Luby [75) introduced a simulation which is more involved than 
that using a simple hashing scheme. The simulation uses two or more hash functions, 
and thus makes the contents of each PRAM cell accessible in two or more places.
As we saw in section 6.3, efficiency can be enhanced by using spatial locality of 
reference. The results of [24] show the possibility of exploiting locality during the 
simulations by using locality-preserving hash functions. This simulation supports 
block pipelining. Valiant [157, 158) gave strong theoretical evidence for supporting 
arbitrary pipelining during simulations of PRAM on the BSP model. Each process­
ing clement of the BSP model consists of a computational clement and a memory
119
module. Each computational element has a capability for efficiently computing hash 
addresses. This can be done by a hardware hashing module associated with a router 
interface without slowing down the computations performed by the processor.
Theorem 6.5.1 ¡157, 158] Any EREW PRAM step o f the v processors can he 
simulated on the p-processors BSP in optimal expected time (()(v/p ) time), provided 
v = p log p and g = 0(1), where g is the throughput o f the router.
I’roof We randomly choose an appropriate hash function that will randomise memory 
and distribute memory requests evenly among p memory modules of the BSP. We 
distribute the v processors of a EREW PRAM so that each processor of the BSP 
model simulates c/p  = log/» of these. In one superstep the BSP model completes 
one EREW PRAM step. In the superstep each processor may need to access log/» 
memory locations. Recall theorem 4.3.1. The expected largest number of accesses 
made to any memory module is ()(log />). So each processor will send log /»requests 
and each memory module will receive at most 0(log/>). Hence the duration of 
the superstep, L, needs to be large enough to accommodate the routing of a log/>- 
relation. From theorem 4.3.6, we can choose L as large as (){</ log />) for the 
hypcrcubc. By assuming »/ = 0 (  1), the superstep can be completed in optimal time.
□
Note that, if hashing is to be exploited efficiently, then the periodicity L may as well 
be at least logarithmic. Moreover, Valiant’s results arc justified only if wc assume 
that the hash function can be evaluated in constant time.
The prevailing vision of general-purpose parallel computers is that the network 
topology should be hidden, but the programmer should retain control of memory
120
management, including the decision whether or not to hash the shared memory. 
This decision should take into account the cost of hashing, as well as the relative 
intricacies of the model and simulated PRAM algorithms.
When the patterns of shared memory accesses are known in advance, the memory 
locations can be deterministically addressed, to avoid contention. This reduces 
the amount of slack required in programming [48. 157], Moreover, we need not 
maintain logarithmic periodicity, i.e the length of a superstep does not need to be 
logarithmic.
6.6 Conclusion
Overall the material of this chapter demonstrates that there is no theoretical hindrance 
in designing massively parallel machines for parallel computation.
I
\Chapter 7
Bulk Synchronous Parallel 
Algorithms
The work of this chapter was carried out as part of the ESPRIT I project PUMA as 
explained in the beginning of the last chapter and appeared in a report ( 132] to the 
European Commission. This chapter shows that scalable transportable algorithms 
can be written for certain basic tasks, balanced tree computations. Fast Fourier 
Transform (P'PT) and matrix multiplications.
7.1 Introduction
There are two modes of programming, automatic mode and direct mode. In the 
automatic mode the virtual shared memory is distributed among memory modules 
by a hash function. Thus, the memory distribution is hidden from the user (as in the 
PRAM  model). However, optimal simulation with the automatic mode requires
I
121
122
the slackness to be at least logarithmic, and g to be close to unity (theorem 6.5.1). 
Moreover, we assume that the hash function can be evaluated in constant time. If the 
patterns of shared memory accesses are known in advance, then the programmer can 
retain control of the memory management to avoid hashing. This is called the direct 
mode of programming. In this chapter we describe some transportable algorithms 
in the direct mode on a BSP model.
Recall that the BSP model is defined as the combination of four attributes [ 157]):
• A number of components each performing computational and/or memory 
operations.
• A common hashing function being evaluated by an individual hashing mod­
ule associated with each computational clement. Each hashing module is 
implemented in hardware.
• A router for routing data between computational and memory elements. The 
router operates independently of continued computation and storage access in 
the computational and memory elements.
• A synchroniser for all or a subset of the components.
A computation on the BSP model consists of a sequence of supersteps. In each 
superstep. each component is allocated a task consisting of some combination of 
local computation steps, message transmissions and message arrivals from other 
memory elements. After each period of L time units, a global check is made 
to determine whether the superstep has been completed. If it has, the machine 
prixccds to the next superstep. Otherwise the next period of L units is allocated to
I
123
the unfinished superstep. The performance of a BSP algorithm measured by three 
parameters, // the number of processors, g a parameter such that an /i-relation can 
be realised in gh steps, L which captures the minimum reasonable interval between 
global synchronisation. We can choose L as small as the time units to realise a 
//-relation, where // = L/g. For example, if the network is a //-node hypercube then 
we choose // = log p/g and L = log p.
Note that, the algorithms described below assume that each processor of the BSP 
model is performing computational and memory operations.
7.2 Balanced tree computation
As we saw in section 2.3.1, the complete binary tree computation can be performed 
on a EREW PRAM by // processors in ()(log//) time. For the BSP model, when 
there is substantial communication latency. /, for example if / = log2/», then the 
complete binary tree is no longer the natural structure for parallel computation. 
This is because the synchroniser needs to synchronise at the every level of the 
tree, and each superstep consists of 2 computational steps and each computational 
clement sends or receives at most 2 messages. Recall that, each internal node of 
a complete binary tree has two children. Each superstep will take 0 (lo g p) time, 
that is L =  O(logp). But each superstep consists of only 2 ( <  L) computational 
operations.
However, this can be improved by using a //-ary tree [117, 48]. Figure 7.1 shows 
the 4-ary tree with 64 leaves. At each level of the tree, each active processor reads
124
Figure 7.1:
// values, computes at most L operations, writes the result, and then
synchronises. Hence, if // = p ]^ k then k- supersteps will suffice and each superstep 
will take 0 ( L ) time.
7.3 Fast Fourier Transform
It is a well known fact that the FFT can be efficiently computed in parallel using 
a communication pattern that is a butterfly graph. Recall that, the v input butterfly 
has n rows and log v + 1 levels (or columns). The inputs arc at level 0 and the 
outputs at level log r. By assigning a processor to each row we can simulate one 
level at a time. In this way a PRAM can compute the FFT in O(logt>) time using <> 
processors. Likewise, the BSP can simulate one such level at a time, synchronising 
after each layer, to solve an FFT problem in log// supersteps with // processors, 
where // is equal to v. Notice that in each superstep we need to realise a I-relation. 
If the time to realise I-relation is equal to //-relation then we can improve upon the 
work as follows (55, 117],

126
We partition the levels of the butterfly into log t</ log// stages of log// consecutive 
levels each. By the structure of the butterfly, the value at each node in the last level 
of its stage depends on the values at // nodes in the last level of the previous stage. 
Moreover, the values of a set of // nodes in the last level of their stage depend on 
the values of a set of // nodes in the last level of their previous stage. For example, 
figure 7.2 shows a butterfly graph of t> = 16 inputs. The levels are divided into 
stages of log // levels when // = 4. The set of values of four circled nodes in last 
level of stage I depend on the set of values of four squared nodes in level 0. The 
dependencies between the circled nodes and the squared nodes are shown in thick 
edges. Interestingly, these thick edges form a butterfly graph of four inputs. There 
are four independent similar butterfly graphs in stage 1. In general, each stage 
consists of /’/  // independent butterfly graphs of // inputs each. Here the expressions 
are integers rounded appropriately. At each stage a processor can mimic each of 
these butterfly graphs of // rows and log // levels in // log // sequential time. All the 
butterfly graphs in a stage can be executed in // log// time using /’/  // processors. 
Once a stage (or superstep) is completed the next stage (or superstep) can start. 
In each superstep each processor computes // log// local operations and sends and 
receives // messages. Kach superstep will take L =  0 ( // log //) time. Initially we 
distribute t> inputs uniformly among // =  r /  // processors, // inputs in each. We can, 
therefore, evaluate the FFT on v /  // processors in log />/ log // supersteps, with total 
time // log v. Note that the work of this algorithm is ()(v  log /’) and equal to the work 
of the best known PRAM algorithm. Hence we have an optimal implementation.
I
127
7.4 Transitive closure and graph algorithms
As described in section 2.5.4, the transitive closure of an adjacency matrix solves 
several graph problems including topological sorting, strong components and all 
pairs shortest paths. Recall that the transitive closure .4" of an n x n matrix .4 is 
equal to "1, where D =  I  0  A  and I  is the identity matrix. Here the matrix 
product is defined with two binary operators ® and .
Suppose we have p < n2 processors. We assume that the elements of .4 are initially 
distributed as evenly as possible among the p processors. First we compute the 
matrix D , this can be done in 0(  1) time. Now A" can be computed by repeated 
squaring of D as was explained in section 2.3.2, chapter 2. Each squaring operation 
is performed as follows. This is similar to the algorithm of Valiant [157] for 
computing the product of two n x n matrices.
Consider the squaring operation on matrix D, i.e. computing D2’ from D", for 
I < ,s < [log2 »]. We assign to each processor the sub problem of computing 
an {»/y/p  x u /^ p )  submatrix of D 2\  To compute the position of D2\  a
processor has to receive the i'h row and the j ' h column of D". Suppose a processor 
computes positions (k \j)  of D2’, where / < j  < I + n^/p. Then the processor 
has to receive data describing the k lh row and columns from the I"’ column to the 
((/ + <>\/p) ~  1 )'h column of matrix D*. That is, the processor has to receive n 
elements of the row and ii(v / y/p) elements of the column. There are »/>//» such 
rows required to compute all the positions of the n/y/p  x ii/ submatrix. Thus, 
each processing elements has to receive 2 i i 2/ ¡/p  elements of D". Now consider 
the number of messages each processor has to send. Elements of each row of
128
2s
D
Figure 7.3:
D* is required by 0 5  processing elements. This is because each row of D2" is 
partitioned into 0 5  pieces of length n /0 5 , and the pieces are computed by distinct 
processing elements. Similarly, elements of each column of D’ required by 0 5  
processing elements. Thus each element of D" is required by 2 05  processors. 
Assume that the n2 elements describing D" are distributed uniformly among ¡> 
processors. Each processor has n2/p  elements of D". Each processor thus has 
to replicate each of its elements 2 05  times and send the appropriate elements to 
the 2 0 5  processors requiring them. Hence we have a communication pattern that 
each processor has to send 2n2/0 5  messages and receive 2n2/ 0 5  messages, i.e. 
realising 2n2/05-relation. Now each processor can compute an n /0 5  x n /0 5  
submatrix of D2’ in 2»r’/p  sequential time, using the standard sequential algorithm. 
Suppose 2i/2/ 0 5  =  //, then wc can square the matrix D’ in ()(ti*/p) time (i.e. in 
a superstep of length ()(n*/p)). For example, figure 7.3 shows the partition of D 2’ 
when ii2 = 16 and p = 4. The elements of the matrices arc shown in thick dots. 
Each square in D2' represents a subproblcm. Each processor computes the elements 
in a square in D2". For example, the processor computes the elements of the top left 
hand corner square has to receive the elements of D". These elements arc enclosed 
in squares on the left hand side of the figure.
129
Notice that for the next squaring operation on D2’ to compute D4", the elements of 
D2’ are distributed uniformly among /»processors by the squared operation described 
above. We can compute the transitive closure in [log2 ji] supersteps of length n2/p  
using p processors. The total work of this algorithm is ()(it* log n ). We thus have an 
optimal parallel algorithm for topological sorting, strong components and all pairs 
shortest paths on graphs.
7.5 Further work
One can observe front this chapter that algorithmic design requires a new discipline 
to get optimal algorithms in the BSP model. A systematic study ought to be the 
subject of an extensive research program. Some work towards this direction is
described in |48],
Chapter 8
Conclusions
This thesis reviewed the evidence for the statemcntl 166) that: Unless parallel 
machines are designed to support the PRAM, or a model o f  parallel computation 
which is very close to it, the design o f  parallel algorithms is doomed to he a very 
difficult (or even impossible) task. In the course of this review we presented new 
results concerned with parallel approximation algorithms (chapter 3), embeddings 
(chapter 5) and the design of bulk synchronous algorithm (chapter 7). This work 
has appeared in the literature and has been reported at conferences as detailed in the 
declaration at the beginning of this thesis.
We have seen that the PRAM provides a very simple and natural architecture in­
dependent model for the parallel algorithm designer. Furthermore, the PRAM has 
proved to be a valuable tool for theoretical computer scientists studying the power 
and fundamental limitation of parallelism. Unfortunately, the gap between real 
parallel machines and the PRAM may force us to think that the PRAM is not a 
particularly practical model for general purpose parallel computing. However, the
130
131
theoretical community has proved that the gap between the PRAM and feasible 
parallel models can be bridged. Solutions have been found for effectively intercon­
necting processing elements, for routing data on these networks and for distributing 
the data among memory modules without hotspots. Using these solutions, we have 
reviewed the possibility of general purpose computing employing a bridging model. 
Such a model acts as an interface between the actual hardware and the PRAM model. 
We reviewed the evidence that if a practical model can be viewed as a PRAM by 
the user (i.e the model hides all the hardware details) then this will achieve scalable 
parallel performance and portable parallel software. We demonstrated that PRAM 
algorithms can be optimally implemented on such practical models.
Chapter 2 described algorithmic tools and techniques which have been frequently 
used to place many problems in the class N C . In particular, we saw that by com­
puting the transitive closure (of the adjacency matrix) several graph problems can 
be placed in NC. Unfortunately, this technique docs not lead to an efficient parallel 
algorithm. At this time no efficient parallel solutions are known for this problem. 
This difficulty is known as the transitive closure bottleneck [128]. Designing ef­
ficient algorithms without the transitive closure technique for those problems is a 
challenging area for research.
In chapter 3 we considered the notion of parallel approximation algorithms. In 
particular we provide such an algorithm for finding minimum weight perfect match­
ing. The question of whether the problem of finding an exact solution to the 
minimum-weight perfect matching problem can be placed in the class N C  remains 
open. Resolution of the existence or otherwise of appropriate algorithms in this
I
132
area may ultimately help to place more precise boundaries around what ought to 
be regarded as tractable problems for parallel computation. The minimum-weight 
perfect matching problem is still open for complete weighted graphs and even if 
they satisfy the triangle inequality. The algorithm of chapter 3 places the problem of 
finding an approximate minimum-weight perfect matching in a complete weighted 
graph satisfying the triangle inequality in N C  with a performance ratio of 2 log, n. 
The algorithm is conceptually very simple and comes within a log2 n factor of the 
work measure of the sequential algorithms. It is also the first JVC-approximation 
algorithm for the task with a sub-linear performance ratio.
In chapter 5 we describe dense edge-disjoint embeddings of the complete binary 
tree with n leaves in the following u-node communication networks: the hypercube, 
the de Bruijn and shuffle-exchange networks and the 2-dimensional mesh. In the 
embeddings the maximum distance from a leaf to the root of the tree is asymptoti­
cally optimal. The embeddings facilitate efficient implementation of many PRAM 
algorithms on these networks. Note that this technique is architecture dependent. 
However, embedding may be hidden by system softwarc/hardwarc in due course.
In chapter 6 we reviewed the practical PRAM models (in particular the BSP model) 
for architecture independent parallel algorithm design. These models differ from 
the well studied PRAM in two important parameters namely / and r/. Although the 
models can cope with the current best values of / and </. we suggest to continue to 
improve these values. A desirable goal is to obtain values of the same order as for 
the PRAM (/ = g = 0 ( I )).
In chapter 7 we described some (direct) bulk synchronous algorithms, but a sys­
133
tematic study ought to be the subject of an extensive research program. Some work 
towards this direction is described in [48].
We described some results concerned with fault-tolerant, for example fault-tolerant 
routing using the IDA. However, in this thesis we have not given much attention to 
the problem of coping with processor failures. This is important for large parallel 
systems. The larger the number of processors, the greater the probability of failure. 
But, we assumed that the processing elements of the BSP model operate correctly at 
all times. In this context efficient techiques that will allow PRAM algorithms to run 
optimally on fault-prone practical PRAMs need to be developed. Some important 
work has already been done in [78, 79, 80], However, coping with processor failures 
of a parallel model with communication delay remains to be done.
I
Bibliography
[1] A. Aggarwal, A.K. Chandra and M. Snir, “Communication complexity of 
PRAMs”, Theoretical Computer Science, Vol. 71, 3-28, 1990.
[2] A. Aggarwal, A.K. Chandra and M. Snir, "On communication latency in PRAM 
computations". Symposium on Parallel Architectures and Algorithms, 11-21, 
1989.
[3] A. V. Aho, J.E. Hopcroft, and J.D. Ullman, The design and analysis o f computer 
algorithms, Addison-Wesley, Reading, MA, 1974
[4] W. Aiello, T. Leighton, B. Maggs and M. Newman, "Fast algorithms for bit- 
serial routing on a hypcrcubc", in Proceedings o f  2"d Annual ACM Symposium 
on Parallel Algorithms and Architectures, 55-64, 1990.
[5] M. Ajtai, J. Kontlos and E. Szcnicrcdi, "Sorting in clog u steps”, Comhinator- 
i c a , \61.3, 1-19, 1983.
|6] S.G. Akl. The Design and Analysis o f  Parallel Algorithms, Prentice-Hall In­
ternational Editions. 1989.
17) R. Alcliunas, "Randomised parallel communication", in Proceedings o f the
134
I
135
ACM Symposium on Principles o f  Distributed Computing, 60-72, 1982.
[8] R.J. Anderson and G.L. Miller, “Optical communication for pointer based 
algorithms”. Technical Report CRI 88-14, University of Southern California, 
1988.
[9] R. Anderson and E. Mayr, “A P-complete problem and approximation to 
it". Research Report STAN-CS-84-I0I4, Department of Computer Science, 
Stanford University, 1984.
[10] E. Arjmandi, M.J. Fischer and N.A. Lynch, "Efficiency of synchronous versus 
asynchronous systems”. Journal o f  the ACM, Vol. 30 No. 3, 449-456, 1983.
[11] J. Aspens and M. Herlihy, “Wait free data structures in the asynchronous 
PRAM model”. Journal o f  the ACM, 340-349, 1990.
[12] Y. Aumann and A. Schuster, “Deterministic PRAM simulation with constant 
memory blow-up and no time-stamps", in Proceedings o f the 3"' Symposium 
on the Frontiers o f Massively Parallel Compulation, 22-29, 1990.
[13] K. Batcher, "Sorting networks and their applications", in Proceedings o f the 
API PS Spring Joint Computing Conference, Vol.32. 307-314, 1968.
114] P. Became and J. Hastad, "Optimal bounds for decision problems on theCRCW 
PRAM", In Proceedings o f the 19'/l Annual ACM Symposium on Theory o f 
Computing, 83 -93, 1987.
[15] V. Bcncs, "Permutation groups, complexes and rcarrangcablc multistage con­
necting networks", Bell System Technical Journal, Vol. 43. 1619-1640. 1964.
136
[16] S.N. Bhatt and I.C.F. Ipsen, "How to embed trees in hypercubes”. Research 
Report Yale/DCS/RR-443, Department of Computer Science, Yale University, 
1985.
[17] G. Blelloch, “Scans as primitive parallel operations”, in Proceedings o f  Inter­
national Conference on Parallel Processing, 355-362, 1987.
[18] A. Borodin, "On relating time and space to size and depth”, SIAM Journal on 
Computing, Vol. 6, 733-744, 1977.
[19] A. Borodin and J.E. Hopcroft, “Routing, merging and sorting on parallel model 
of computation”. Journal o f  Computer System Science, Vol. 30, 130-1 45, 1985.
[20] R.P. Brent, “The parallel evaluation of general arithmetic expressions” . Journal 
o f the ACM, Vol.21,201-206, 1974.
[21] J. I. Carter and M. N. Wcgman. “Universal classes of hash functions” . Journal 
o f Computer and System Sciences, Vol. 18, 143-154, 1979.
[22] A.K. Chandra, D.C. Kozen and L.J. Stockmeyer, "Alternation", Journal o f the 
ACM. Vol. 28. 114-133, 1981.
[23] K. M. Chandy and S. Taylor, An introduction to parallel programming, Jones 
and Bartlett Publishers, Boston. 1992
[24] A. Chin, Complexity issues in general purpose parallel computing, D. Phil, 
thesis. University of Oxford. Oxford, 1991.
[25] F.Y. Chin. J. Lam and I. Chen, "Efficient parallel algorithms for some graph 
problems", Communications o f the ACM, Vol. 25, 659-665, 1982.
137
[26] R. Cole and U. Vishkin, "Approximation and exact parallel scheduling with 
applications to list, tree and graph problems", in Proceedings o f l l" '  Annual 
IEEE Symposium on Foundations o f  Computer Science, 32-53, 1986.
[27] R. Cole and U. Vishkin. "Approximate parallel scheduling. Parti: The basic 
technique with applications to optimal parallel list ranking in logarithmic time", 
SIAM Journal on Computing, Vol. 17, 128-142, 1988.
[28] R. Cole and U. Vishkin, "Faster optimal parallel prefix sums and list ranking", 
Information and Control, Vol. 81,335-352, 1989.
[29] R. Cole and U. Vishkin. "Optimal parallel algorithms for expression tree eval­
uation and list ranking", in Proceedings o f Aegean Workshop on Computing, 
91-100, 1988.
[30] R. Cole and U. Vishkin, “Deterministic coin tossing with applications to opti­
mal parallel list ranking", Information and Control, Vol. 70, 32-53, 1986.
[31] R. Cole and U.Vishkin, “Deterministic coin tossing and accelerating cascades: 
micro and macro techniques for scsigining parallel algorithm”, in Proceedings 
of 18"' Annual ACM Symposium on Theory o f Computing, 206-219, 1986.
[32] R. Cole and O. Zajicek, "The APRAM: Incorporating asynchrony into the 
PRAM model", in Proceedings o f  I*' Annual ACM Symposium on Parallel 
Algorithms and Architectures, 169-178, 1989.
[33] R. Cole and O. Zajicck, "The expected advantage of asynchrony", in Proceed­
ings o f ! ’"1 Annual ACM Symposium on Parallel Algorithms and Architectures,
85-94, 1990.
138
[34] S.A. Cook. C. Dwork and R. Reischuk, “Upper and lower time bounds for 
parallel random access machines without simultaneous writes”, SIAM Journal 
on Computing, Vol. 15, 87-97, 1986.
[35] D. Coppersmith and S. Winograd, "Matrix multiplication via arithmetic pro­
gressions”, Proceedings o f \9"' Annual ACM Symposium on Theory o f  Com­
puting, 1-6, 1987.
[36] R. Cypher and C. G. Plaxton, “Deterministic sorting in nearly logarithmic time 
on the hypercube and related computers", in Proceedings 22"'( Annual ACM 
Symposium on Theory o f Computing, 193-203, 1990.
[37] E. Dahlhaus and M. Karpinski, "Parallel Construction of Perfect Matching and 
Hamiltonian Cycles on Dense Graphs". Theoretical Computer Science, Vol. 
61, 121-136, 1988.
[38] C. O Dunlaing, "Some parallel geometric algorithms”, in Lectures in Parallel 
Computation (A.M. Gibbons and P.G. Spirakis, eds.,), Cambridge University 
Press. 77-108, 1993.
[39] P.E. Dunne, "A result on k-valcnt graphs and its application to a graph embed­
ding problem", Acta Informatica, Vol. 24. 447-459, 1987.
[40] J. Edmonds. "Matching and Polyhedrons with 0.1 Vertices”, Journal o f Re­
search o f the National Bureau o f Standards B, 125-130, 1965.
[41] J. Edmonds, “Paths, trees and flowers", Canadian Journal o f Mathematics,
Vol. 17. 449-467. 1965.
139
[42] D. Eppistcin and Z. Galil, "Parallel algorithmic techniques for combinatorial 
computation”. Annual Review o f Computer Science, 233-283, 1988.
[43) S. Fortune and J. Wyllic, “Parallelism in random access machines". Proceed­
ings o f  10',‘ Annual ACM Symposium on Theory o f Computing, 114-118, 1978.
[ 44 ] H .N. Gabow, Implementations o f algorithm for Maximum Matching on Nonbi- 
partite Graphs, Ph.D. Dissertation, Department of Computer Science, Stanford 
University, 1974.
[45] Z. Galil and V. Pan, "Improved Processor Bounds for Combinatorial Problems 
in RNC”, Combinatorica, Vol. 8. I89-2(X). 1988.
[46| M.R. Garcy and D.S. Johnson, Computers and Intractability: A Guide to the 
Theory o f NP-completeness. Freeman, 1979.
[47] H. Ga/.it, G. L. Miller and S. FI. Tcng, "Optimal tree contraction in the FREW 
model”, in Proceedings o f Concurrent Compulations: Algorithms. Architec­
ture and Technology, Plenum, New York, 139-156, 1988.
[48| A.V. Gcrbcssiotis and L.G. Valiant. "Direct bulk-synchronous parallel algo­
rithms", Technical Report TR-10-92, Center for Research in Computing Tech­
nology, Harvard University, 1992.
[49] A.M. Gibbons. Algorithmic Graph Theory, Cambridge University Press, 1985.
[50] A. M. Gibbons. "An introduction to distributed memory models of parallel 
computation", in Lectures in Parallel Computation (A.M. Gibbons and P.G. 
Spirakis, eds.,), Cambridge University Press, 197-215. 1993.
I
140
[51] A.M. Gibbons and M.S. Paterson, "Dense edge-disjoint embedding of binary 
trees in the mesh”, in Proceedings o f the 4"‘ Annual ACM Symposium on 
Parallel Algorithms and Architectures, 257-263, 1992.
[52] A.M. Gibbons and W. Ryttcr. Efficient Parallel Algorithms, Cambridge Uni­
versity Press, 1988.
[53] A. M. Gibbons and W. Ryttcr. "An optimal parallel algorithm for dynamic 
expression evaluation and its application", in Proceedings o f Symposium on 
Foundations o f Software Technology and Theoretical Computer Science, 453- 
469, 1986.
[54] A. M. Gibbons and R. Ziani, "The balanced binary tree technique on mesh con­
nected computers”, Information Processing Letters, Vol. 37, 101-109, 1991.
[551 P. B. Gibbons. The asynchronous PRAM: a semi-synchronous model for shared 
memory MIMD machines, Ph.D. Thesis, University of California at Berkeley, 
1989.
[56) A.V. Goldberg, S.A. Plotkin and G.E. Shannon. "Parallel symmetry-breaking 
in sparse graphs", in Proceedings o f the 19"' Annual ACM Symposium on 
Theory o f Computing, 1987.
[57| L.M. Goldschlager, "A unified approach to models of synchronous parallel 
machines". Journal o f  the ACM, Vol. 29. 1073-1086, 1982.
[58| A.Gottlicb. R. Grishnian, C.P. Kruskal. K.P. Mcauliffc. L. Rudolph and M. 
Snir. "The NYU Ultracompulcr-dcsigning an MIMD shared memory parallel 
computer", IEEE Transactions on Computers, Vol. C-32, 175-189, 1983.
I
141
[59] R. Greenlaw, H.J. Hoover and W.L. Ruzzo, "A Compendium of Problems Com­
plete for P”, Technical Report TR 91-11, Department of Computer Science, 
University of Alberta, 1991.
[60] D.Y. Grigoriev and M. Karpinski, "The Matching Problem for Bipartite Graphs 
with Polynomially Bounded Permanents is in NC”, Proceedings o f the 27"' 
Annual IEEE Symposium on Foundations o f Computer Science, 166-172, 1987.
[61] J. Hastad, T. Leighton and M. Newman, "Fast computation using faulty hy­
percubes”, in Proceedings o f  the 2 1" Annual ACM Symposium on Theory of 
Computing, 251-263, 1989.
[62] X. He and Y. Ycsha, "Binary tree algebraic computation and parallel algorithms 
for simple graphs", Journal of Algorithms, Vol. 9, 6-20, 1986.
[63] D. Hcmbold and E. Mayer, "TWo-processor Scheduling is in NC". VLSI Algo­
rithm and Architectures, editors: Makedon et al.. Lecture Notes in Computer 
Science, Vol. 227, 12-25, 1986.
[64] S. W. Hornick and F. P. Preparata, "Deterministic P-RAM simulation with 
constant redundancy", in Proceedings o f the 1”' Annual ACM Symposium on 
Parallel Algorithms and Architectures, 103-109, 1989.
[65] C.S. lliopoulos, "Parallel algorithms for string matching", in Lectures in Par­
allel Computation (A.M. Gibbons and P.G. Spirakis, eds..), Cambridge Uni­
versity Press, 109-121, 1993.
|66| M. Iri, K. Murota and S. Matsui, "Linear-time Approximation Algorithms 
for Finding the Minimum-Weight Perfect Matching on a Plane", Information
I
142
Processing Letters, Vol. 12,206-209, 1981.
[67] A. Israeli and Y. Shiloach, "An improved algorithm for maximal matching”. 
Information Processing Letters", Vol. 33, 57-60, 1986.
[68] J. JaJa, An Introduction to Parallel Algorithms, Addison Wesley, 1992.
[69] M. R. Jcrrum and S. Skyum, "Families of fixed degree graphs for processor 
'interconnection",IEEE Transaction on Computing, Vol. 33, 190-194, 1984.
[70] D.B. Johnson and P. Metaxas, "Connected components in Oflog^2 | V  |) 
parallel time for the CREW PRAM", in Proceedings o f the 31*' Annual IEEE 
Symposium on Foundations o f Computer Science, 1991
[71 ] N. Kahalc, "Better expansion for Ramanujam graphs". In Proceedings o f the 
32"'1 Annual IEEE Symposium on Foundations o f Computer Science, 398-404, 
1991.
[72] C. Kaklamanis, D. Kri/.anc and T. Tsantilas, "Tight bounds for oblivious 
routing in the hypcrcubc", In Proceedings o f the 2nd Annual ACM Symposium 
on Parallel Algorithms and Architectures, 31-36, 1990.
[73] D.R. Karger, N. Nisan and M. Parnas, “Fast Connected Components Algo­
rithms for the ERKW PRAM". 4th Annual ACM Symposium on Parallel Algo­
rithms and Architectures, 373-38, 1992.
[74] A. R. Karlin and E. Upfal, “Parallel hashing: An efficient implementation of 
shared memory". Journal o f the ACM, Vol. 35. No. 4. 876-892, 1988.
143
[75] R.M. Karp and M. Luby, “Efficient PRAM simulation on a distributed memory 
machine", in Proceedings o f  the 24"' Annual ACM Symposium on Theory o f 
Computing, 318-326, 1992.
[76] R. Karp and V. Ramachandran, “Parallel algorithms for Shared Memory Ma­
chines”, Handbook o f Theoretical Computer Science, J. van Lceuwen (editor), 
Vol.l, Elsevier and MIT Press, 1991.
[77] R. Karp, E. Upfal and A. Wigderson, “Constructing a Perfect Matching is in 
Random NC”, Proceedings o f  the 17"' Annual ACM Symposium on Theory o f 
Computing, 22-32, 1985.
[78] Z.M. Kedem, K.V. Palem. A. Raghunathan and P.G. Spirakis, “Combining 
tentative and definite executions for very fast dependable parallel computing", 
in Proceedings o f 23nl Annual ACM Symposium on Theory o f  Computing, 
381-390, 1991.
[79] Z.M. Kcdcm. K.V. Palem, A. Raghunathan and P.G. Spirakis, “Resilient par­
allel computing on unreliable parallel machines”, in Lectures in Parallel Com­
putation (A.M. Gibbons and P.G. Spirakis, cds.,), Cambridge University Press, 
149-175, 1993.
[80] Z.M. Kcdcm. K.V. Palem, and P.G. Spirakis. “Efficient robust parallel com­
putations", in Proceedings o f  22'"1 Annual ACM Symposium on Theory o f 
Computing, 138-148, 1990.
[81] L. Kirousis, M. Serna and I*. Spirakis, "The parallel complexity of the subgraph 
connectivity problem", in Proceedings o f the 30"' Annual IEEE Symposium
144
on Foundations o f Computer Science, 294-299, 1989.
[82] L. Kirousis and P. Spirakis, "Probabilistic log-space reductions and problems 
probabilistically hard for P ”, in Proceedings o f the 1*' Scandinavian Workshop 
on Algorithm Theory, 1988.
[83] C. P. Kruskal. L. Rudolph and M. Snir, “A complexity theory of efficient 
parallel algorithms”. Theoretical computer science, Vol. 71,1990, 95-132.
[84] C. Kruskal and M. Snir, “ A unified theory of interconnection network struc­
ture”, Theoretical Computer Science, Vol. 48(1), 75-94, 1986.
[85] L. Kucera, "Parallel computation and conflicts in memory access". Information 
Processing Letters, Vol. 14,93-96, 1982.
[86] R.E. Ladner and M.J. Fisher, Parallel prefix computation, Journal o f the ACM, 
Vol. 27, 831-838, 1980.
[87] E.L. Lawler, Combinatorial Optimization: Networks and Matroids, Holt- 
Rinchart-Winston, New York, 1976.
[88] T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays ■ 
Trees ■ Hypercubes, Morgan Kaufmann, California, 1992.
[89] T. Leighton, "Methods for message routing in parallel machines", in Proceed­
ings o f the 2 4 Annual ACM Symposium on Theory o f  Computing, 77-96, 
1992.
[90] T. Leighton and B. Maggs, "Expanders might be practical", in Proceedings 
o f the 30"’ Annual IF.EF. Symposium on Foundations o f  Computer Science,
I
145
384-389, 1989.
[91] T. Leighton and C.G. Plaxton, "A (fairly) simple circuit that (usually) sorts”, 
in Proceedings o f 31 *' Annual IEEE Symposium on Foundations o f  Computer 
Science, 1990.
[92] C. E. Lciserson, “Fat trees: universal network for hardware efficient supercom­
puting”, in Proceedings o f  the International Conference on Parallel Processing, 
393-402, 1985.
[93] C.E. Lciserson, "Area-efficient graph layouts (for VLSI)”, in Proceedings o f 
the 21”' Annual IEEE Symposium on Foundations o f Computer Science, 270- 
281, 1980.
[94] L. Lovasz, "Computing ears and branching in parallel”,in Proceedings oJ'2(>"‘ 
Annual IEEE Symposium on Foundation o f Computer Science, 464-467, 1985.
[95] A. Lubotzky, R. Phillips and P. Sarnak, “Ramanujan graphs", Comhinatorica, 
Vol. 8, 261-277, 1988.
[96] N. A. Lynch and M. J. Fischer, "On describing the behavior and implementation 
of distributed systems". Theoretical computer science, Vol. 13, 17-43, 1981.
[97] Y.-D. Lyuu, “Fast fault-tolerant parallel communication and on-line mainte­
nance using information dispersal", in Proceeding o f the 2'"1 Annual ACM 
Symposium on Parallel Algorithms and Architectures, 378-387, 1990.
[98] Y.-D. Lyuu, "Fast fault-tolerant parallel communication with law congestion 
and on-line maintenance using information dispersal", Technical Report TR-
Aiken Computation Lab., Harvard University, 1989.
146
[99] E.S. Maniloff, K.M. Johnson and J. Reif, "Holographic routing network for 
shared-memory parallel computers”. Technical Report CSE-89-8, University 
of California at Davis, 1989.
[100] Y. Maon, B. Schieber and U. Vishkin, "Parallel ear decomposition search 
(EDS) and st-numbering in graphs”. Theoretical Computer Science, Vol. 47, 
277-298, 1986.
[ 101 ] C. Martel, A. Park and R. Subramonian, "Optimal asynchronous algorithms 
for shared memory parallel computers”. Technical Report CSE-89-8, Univer­
sity of California at Davis. 1989.
[102] E.W. Mayr, "Parallel Approximation Algorithms”, Research Report, Depart­
ment of Computer Science, Stanford University. California.
[103] W.F. McColl, "General purpose parallel computing”, in Lectures in Parallel 
Computation (A.M. Gibbons and P.G. Spirakis, cds.), Cambridge University 
Press, 243-296, 1993.
[104] K. Mchlhorn and U. Vishkin, “Randomised and deterministic simulation of 
PRAMs by parallel machines with restricted granularity of parallel memories", 
Acta Informatica, Vol. 21,339-374, 1984.
[105] G. L. Miller and J. H. Reif, "Parallel tree contraction and its application", 
in Proceedings o f the \9"' Annual ACM Symposium on Theory o f Computing, 
254-263, 1987.
[106] J. Misra, "Phase synchronization". Information Processing Letters, Vol. 38.
101-105, 1991.
147
[107] S. Miyano, S. Shiraishi and T. Shoudai, "A List of P-Complete problems”. 
Technical Report RIFIS-TR-CS-17, Research Institute of Fundamental Infor­
mation Science, Kyushu University, Japan, 1989.
[108] B. Monien and H. Sudborough, “Comparing interconnection networks”, in 
Proceedings o f the Mathematical Foundations o f Computer Science, Lecture 
Notes in Computer Science 324, 139-153, Springer-Verlag, 1988.
[109] J.K. Mullin, “A caution on universal classes of hash functions”, Information 
Processing Letters, Vol. 37, 247-256, 1991.
[110] K. Mulmuley, U. Vazirani and V. Vazirani, “Matching is as easy as matrix 
inversion”, in Proceedings o f the 19"‘ Annual ACM Symposium on Theory o f  
Computing, 345-354, 1987.
[111] J. Naor, “Computing a Perfect Matching in a Line Graph”, in Proceedings o f 
the 9"' Conference on the Foundations o f Software Technology and Theoretical 
Computer Science, 139-148. 1989.
[ 112] D. Nassimi and S. Sahni. "An optimal routing algorithm for mesh connected 
computers". Journal o f  the ACM, Vol. 27, 6-29, 1980.
[113] N. Nishimura, "Asynchronous shared memory parallel computation”, in Pro­
ceedings o f the 2'"1 Annual ACM Symposium on Parallel Algorithms and Ar­
chitectures, 76-84, 1990.
1114] D. Nussbaum and A. Aggarwal, “Scalability of parallel machines". Commu­
nication o f the ACM, Vol. 34, 57-61. 1991.
148
[115] C.N.K. Osiakwan and S.G. Akl, •'The Maximum weight perfect vmatching 
problem for complete weighted graphs is in PC”, Proceedings o f  the 2nd IEEE 
Symposium on Parallel and Distributed Processing, 880-887, 1990.
[116] H. Papadimitriou and J. D. Ullman, "A communication-time tradeoff”, SIAM 
Journal on Computing, Vol. 16, 260-269, 1984.
[117] H. Papadimitriou and M. Yannakakis, "Towards an architecture-independent 
analysis of parallel algorithms”, in Proceedings o f the 20"' Annual ACM Sym­
posium on Theory o f Computing, 510-513,1988.
[118] M.S. Paterson, W.L. Ruzzo, L. Snyder, "Bounds on minimax edge-length for 
complete binary trees”, in Proceedings o f the 13"' Annual ACM Symposium 
on Theory o f Computing, 293-299, 1981.
[119] G.F. Pfister and V.A. Norton, "Hot spot contention and combining in multi­
stage interconnection networks”, IEEE Transactions on Computers, Vol. C-34, 
No. 10,943-948,1985.
[120] N. Pippcngcr, "Parallel communication with limited buffers”, in Proceedings 
o f the 25"1 Annual IEEE Symposium on Foundations o f Computer Science, 
127-136, 1984.
[121] D.A. Plaistcd, “Heuristic Matching for Graphs Satisfying the Triangle In­
equality”, Journal o f Algorithms, Vol. 5,163-179, 1984.
1122] V.R. Pratt and L.J. Stockmcycr, "A characterization of the power of vector 
machines", Journal o f Computer System Science, Vol. 12. 198-221, 1976.
149
[123] F. Preparata and J. Vuillemin, “The cube-connected cycles: a versatile net­
work for parallel computation”. Communication o f the ACM, Vol.24(5), 300- 
309, 1981.
[124] M. J. Quinn, “Designing Efficient Algorithms for Parallel Computers", 
McGraw-Hill International Editions, 1988.
[125] M.O. Rabin, "Efficient dispersal of information for security, load balancing, 
and fault tolerance”, Journal o f the ACM, Vol. 36, 335-348, 1989.
[126] P. Ragde, “Analysis of an asynchronous PRAM algorithm". Information 
Processing Letters, Vol. 39, 253-256, 1991.
[127] S. Rajasekaran and J. H. Reif, “Optimal and sublogarithmic time randomised 
parallel sorting algorithms”, SIAM Journal on Computing, 594-607, 1989.
[128] V. Ramachandran, "Efficient parallel graph algorithms", in Lectures in Par­
allel Computation (A.M. Gibbons and P.G. Spirakis, cds.,), Cambridge Uni­
versity Press, 67-76, 1993.
[129] A. G. Ranadc. The fluent abstract machine, Ph. D. Thesis, Yale University, 
1989.
[ 130] A. G. Ranadc. "How to emulate shared memory", in Proceedings o f  the 2X"‘ 
Annual IEEE Symposium on Foundations o f  Computer Science. 185-194. 1987.
[131] S.B. Rao, "Properties of an interconnection architecture based on wavelength 
division multiplexing", TR-92-009-3-0054-2, NEC Research Institute, Prince­
ton. 1992.
I
150
[132] S. Ravindran. B. Dessau and A.M. Gibbons, "An overview of PRAM to 
practical PRAM algorithmics”. Report 6.2.1, Parallel Universal Message- 
passing Architecture, ESPRIT Project 2701 of the EC, 1991.
[133] S. Ravindran and A.M. Gibbons, "Dense edge-disjoint embedding of binary 
trees in the hypercube”, Information Processing Letters, Vol. 45, 321-325, 
1993.
[134] S. Ravindran and A.M. Gibbons, "Densely embedding the complete binary 
tree in communication networks", 9"‘ British Colloquium for Theoretical Com­
puter Science, University of York, England, March 1993.
[135] S. Ravindran. A.M. Gibbons and M.S. Paterson, "Dense edge-disjoint em­
bedding of complete binary trees in interconnection networks", to appear in 
Theoretical Computer Science, 1994.
[136] S. Ravindran, N.W. Holloway and A.M. Gibbons, "Approximating mini­
mum weight perfect matchings, for complete graphs satisfying the triangle in­
equality". in Proceedings o f \9"‘ International Workshop on Graph-Theoretic 
Concepts In Computer Science, Lecture Notes in Computer Science, Springer- 
Verlag. 1993, To appear.
[ 137] E.M. Rcingold and R.E. Tarjan, "In a greedy heuristic for complete matching", 
SIAM Journal on Computing, Vol. 10, 676-681, 1981.
[138] J. H. Reif and L. G. Valiant, "A logarithmic time sort for linear size networks", 
Journal o f  the ACM. Vol. 34. 60-76, 1987.
151
[139] W. L. Ruzzo, ”On uniform circuit complexity”. Journal o f  Computer and 
Systems Sciences, Vol. 22, 365-383, 1981.
[140] B. Schieber und U. Vishkin, "On finding lowest common ancestors: simpli­
fication and paralization”, SIAM Journal on Computing, Vol. 17, 1253-1262, 
1988.
[141] M.J. Serna, The Parallel Approximahility o f P-complete Problems, Ph.D. 
Thesis, Dcp. dc Llcnguatges I Sistmes Informatics, Universität Politecnica de 
Catalunya. 1990.
[142] M.J. Serna and P. Spirakis, "The approximability of problems complete for 
P", in Proceedings o f the International Symposium on Optimal Algorithms, 
Lecture Notes in Computer Science, Springer-Verlag, Vol. 401, 193-204,1989.
[143] D. Shasha and M. Snir, "Efficient and correct execution of parallel programs 
that share memory", ACM Transactions on Programming Languages and Sys­
tems, Vol. 10, 282-312, 1988.
[144] Y. Shiloach and U. Vishkin, "An O(logn) parallel connectivity algorithm". 
Journal o f Algorithms, Vol. 3, 57-67, 1982.
[145] A. Siegel, "On universal classes of fast high-performance hash functions, 
their time-space tradeoff, and their applications", in Proceedings o f the 30"’ 
Annual IEEE Symposium on Foundations o f Computer Science, 20-25, 1989.
[146] H. Siegel, “Interconnection networks for SIMI) machines”. Computer, Vol. 
12(6), 57-65, 1979.
I
152
[147] H. Siegel. Interconnection Networks for Large-Scale Parallel Processing: 
Theory and Case Studies, Lexington Books, Lexington, MA, 1984.
[ 148] P. Spirakis, “PRAM models and fundamental parallel algorithmic techniques: 
part II (randomized algorithms)”, in Lectures in Parallel Computation (A.M. 
Gibbons and P.G. Spirakis, eds.,), Cambridge University Press, 77-108, 1993.
[149] L. Stockmeyer and U. Vishkin, "Simulations of Parallel Random Access 
Machines by Circuits", SIAM Journal on Computing, Vol. 13, 409-422, 1984.
[150] H. Stone, “Parallel processing with perfect shuffle”, IEEE Transactions on 
Computers, Vol. C-20, 153-161, 1971.
[151] K.J. Supowit, D.A. Plaisted and E.M. Rcingold, "Heuristics for weighted 
perfect matching". Proceedings o f  the 12"‘ Annual ACM Symposium on Theory 
o f Computing, 398-419, 1980.
[ 152] R.E. Tarjan and U. Vishkin, "Finding biconnectcd components and computing 
tree functions in logarithmic parallel time", in Proceedings o f the 25"' Annual 
IEEE Symposium on the Foundations o f Computer Science, 12-20, 1984.
[153] J. D. Ullman, Computational Aspects o f VLSI, Computer Science Press, 1984.
[154] E. Upfal, "Efficient schemes for parallel communication". Journal o f the 
ACM, Vol. 31.507-517, 1984.
[ 155] E. Upfal, “An Oflog n) deterministic packet routing scheme". Journal o f the
ACM, Vol.39, 55-70, 1992.
153
[156] E. Upful and A. Wigderson, "How to share memory in a distributed system", 
Journal o f  the ACM, Vol.34, 116-127, 1987.
[157] L. G. Valiant, "A bridging model for parallel computation". Communication 
o f the ACM, Vol.33, 103-111, 1990.
[158] L. G. Valiant, "General purpose parallel architectures”. Handbook o f The­
oretical Computer Science, Volume A: Algorithms and Complexity (J. Van 
Lccuwcn, cd.), Amsterdam: North-Holland, 944-971, 1990.
[159] L.G. Valiant, "A combining mechanism for parallel computers",Technical 
Report TR-24-92, Center for Research in Computing Technology, Harvard 
University, 1992.
[160] L.G. Valiant, "Experiments with a parallel communication scheme", in Pro­
ceedings o f the 18"' AUerton Conference on Communication, Control and 
Computing, 802-811, 1980.
1161 ] L.G. Valiant, "A scheme for past parallel communication”, SIAM Journal on 
Computing, Vol. 11,350-361, 1982.
[162] L.G. Valiant. "Universality considerations in VLSI circuits", IEEE Transla­
tions on Computing, Vol. 30. 135-140. 1981.
[163] L.G. Valiant and G.J. Hrcbncr. “Universal schemes for parallel communica­
tions", in Proceeding o f the 13** Annual Symposium on Theory o f Computing, 
263-277, 1981.
1164] U. Vishkin, "Implementation of simultaneous memory address access in mod­
els that forbid it", Journal o f Algorithms, 45-50, 1983.
154
[165] U. Vishkin, "A parallel-design distributed-iniplementation(PDDI) general- 
purpose computer”. Theoretical Computer Science, Voi. 13, 157-172, 1984.
[166] U. Vishkin, “Structural parallel algorithmics”, in Lectures in Parallel Com­
putation (A.M. Gibbons and P.G. Spirakis, cds„), Cambridge University Press, 
1-18, 1993.
[167] U. Vishkin, "Randomized speed-ups in parallel computations”, in Proceed­
ings o f the 16lh Annual ACM Symposium on Theory o f Computing, 230-239, 
1984.
[168] A. Waksman, "A permutation network", Journal o f  the ACM, Voi. 15, 159- 
163, 1968.
[ 169] J. C. Wyllie, The complexity o f parallel computation, Ph. I). thesis. Depart­
ment of Computer Science, Cornell University, 1979
I
