A complexity theory of parallel computation by Parberry, Ian
 warwick.ac.uk/lib-publications  
 
 
 
 
 
 
A Thesis Submitted for the Degree of PhD at the University of Warwick 
 
Permanent WRAP URL: 
http://wrap.warwick.ac.uk/112008 
 
Copyright and reuse:                     
This thesis is made available online and is protected by original copyright.  
Please scroll down to view the document itself.  
Please refer to the repository record for this item for information to help you to cite it. 
Our policy information is available from the repository home page.  
 
For more information, please contact the WRAP Team at: wrap@warwick.ac.uk  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
A Complexity Theory of 
Parallel Computation
Ian Parberry
A dissertation submitted for the degree of 
Doctor of Philosophy
University of Warwick 
Department of Computer Science
May 1984
Dedicated to my wife and my parents, 
without whose support and encouragement 
this work would not have been possible.
Contents.
Chapter 1: Introduction ..................................................................................  1
Chapter 2: Designing a Parallel Machine Model ..............................................  9
2. 1. The Basic Model ........................................................................  10
2.2. The Unit-Cost Measure of Time ................................................  IS
2.3. The Assignment of Programs to Processors ............................  18
2.4. Processor Activation ................................................................  22
Chapter 3: Relationships with Other Models .................................................  27
3.1. A Fixed-Structure Model ..........................................................  28
3.2. Shared Memory Machines ........................................................  32
3.3. Reasonableness and Practicality .............................................. 35
3.4. A Practical Model .....................................................................  40
3.5. Speedup of Sequential Machines ..............................................  44
Chapter 4: Programming Techniques for Feasible Networks .......................  49
4.1. Interconnection Patterns and Programming Tools ..................  50
4.2. Recurrent Interconnection Patterns .......................................  56
4.3. Some Useful Algorithms ...........................................................  62
4.4. Reducing the Number of Processors .......................................  67
Chapter 5: Practical Simulations .................................................................  76
5.1. A General Simulation Theorem ................................................  77
5.2. A Universal Parallel Machine ...................................................  82
5.3. A Hardware Measure ................................................................  85
5.4. Circuits and Turing Machines ..................................................  96
Chapter 6: High-Arity Machines ............................. .....................................  100
6.1. A High-Arity Model ...................................   101
6.2. The Computational Power of High-Arity Machines .................. 103
6.3. A Constant-Degree Universal Machine ...................................  109
6.4. Examples of High-Arity Algorithms ......................................... 113
Chapter 7: More on Universal Machines ........................................................ 120
7.1. Some Lower Bounds ...............................................................  120
7.2. A Non-Literal Simulation ........................................................  126
7.3. Oblivious Simulations .............................................................  129
Chapter 8: Conclusion .................................................................................  138
References ..................................................................................................  140
Diagrams.
Figure 2.4.1 ...................................................................................................  24
Figure 4.1.1 ...................................................................................................  52
Figure 4.1.2 .................................................................................................... 53
Figure 4.1.3 .................................................................................................... 55
Figure 4.2.1 .................................................................................................... 59
Figure 5.3.1 ...................................................................................................  87
Figure 5.3.2 .................................................................................................... 88
Figure 8.2.1   104
Figure 8.2.2 .................................................................................................. 106
Figure 6.4.1   115
Figure 6.4.2 ......................................   117
Figure 6.4.3 .................................................................................................. 119
Tables.
Table 2.4.1   26
Table 4.3.1   63
Table 4.3.2   65
Table 4.3.3   66
Table 4.3.4   67
Table 5.1.1   81
Table 5.1.2   81
Table 6.2.1   107
Table 6.4.1   116
Table 6.4.2 ................................................................................................... 119
Acknowledgements.
1 would like to thank my supervisor. Mike Paterson, for his expertise 
and guidance during all stages of this work. 1 would be satisfied if it 
could be said that the technical content comes close to meeting his 
exceptionally high standards for conciseness, clarity and elegance. 1 
am grateful to a number of people who helped keep me up-to-date with 
the latest developments in parallel complexity theory by sending 
manuscripts and correspondence. These Include Allan Borodin. Patrick 
Dymond, Zvl Galil. Tom Leighton Frtedhelm Meyer auf der Heide. Nick 
Pippenger. Uzi Vishkin and Avi Wlgderson. The Commonwealth Scholar­
ship Commission, via their awarding body in Australia, provided the 
financial support which enabled me to travel to England and undertake 
this degree. Finally. 1 would like to> thank Tony Cohn for assistance with 
typesetting problems. Rod Moore for the loan of drawing equipment, 
and Meurlg Beynon for the use of his office while preparing this 
manuscript.
Declaration
In the interests of rapid dissemination, preliminary versions ot the results 
in section 4.4, chapter 5 and sections 8.1*6.3 have been published as internal 
Theory of Computation Reports. Material from chapters 4, S and 6 can be found 
in [51], [50] and [52] respectively.
Summary.
Parallel complexity theory is currently one of the fastest growing fields of 
theoretical computer science. This rapid growth has led to a proliferation of 
parallel machine models and theoretical frameworks. Our aim is to construct a 
unified theory of parallel computation based on a network model. We claim that 
the network paradigm is fundamental to the understanding of parallel computa­
tion. and support this claim by providing new and Improved theoretical results, 
and new approaches to old questions concerning "reasonable" and "practical" 
models.
This thesis is made up of eight chapters. Chapter 1 contains the introduc­
tion. In chapter 2 we define the basic model, and justify our choice of a unit- 
cost measure of time, a uniform assignment of programs to processors, and 
simultaneous processor activation. Chapter 3 compares the network model to a 
variety of others, including ' fixed-structure networks and shared-memory 
machines. We explore the concepts of "reasonableness” and "practicality" in 
parallel machine models, and show that even "reasonable" parallel computers 
are much taster than sequential ones. ------
Chapter 4 is devoted to programming techniques for a "practical” network 
model, (which we call a feasible network), covering interconnection patterns, 
useful algorithms, and some processor-saving theorems. In chapter 5 we find 
efficient simulations of the general network model on more practical machines, 
including a universal feasible network, and uniform circuits. Chapter 6 extends 
the network model, and defines a new resource, that of arify. Although increas­
ing arlty Increases computing power, some efficient constant-arity universal 
machines are found. Chapter 7 takes a final look at universal networks, concen­
trating on lower-bounds and the conditions under which they hold. Chapter 6 
contains the conclusion.
-1  -
Chapter 1 
Introduction
As recently aa I960. Schwartz [62] complained of an apparent lack of 
theoretical reaulta concerning the computational complexity of parallel or 
concurrent algorithma.
"In the aerial caae, the deeign o f algorithms haa come to  be illuminated by a 
growing body of thecretieal knowledge concerning the ultimate limits of algorithm 
performance.... Until a like body of theoretical knowledge has been developed for 
highly concurrent algorithms, we will have little basis for judging the extent to 
which a given concurrent approach can be improved."
Two of the most important and fundamental papers in the field of parallel 
complexity theory (that of Goldschlager [23], later to become [27], and that of 
Plppenger [S3]) had already appeared by the time Schwartz’s paper reached 
publication Since then the flow of results has increased from a trickle to a 
steady stream, and is now threatening to become a flood. Today, parallel 
complexity theory must be ranked as one of the fastest-growing fields of 
theoretical computer science.
A theoretical treatment of parallel computation is an attempt to formalize 
the intuitive concept of a "parallel computer" based on practical experience or 
reasonable expectations. Amongst the questions which should be addressed by 
such a formal exposition are the following:
What do we mean by a parallel computation?
What la a good model of a parallel computer?
What are the resources of Interest, and how should they be defined?
-  2 -
How should wa design a parallel programming language?
Are parallel machines necessarily faster than sequential ones?
What kind of problems can be solved significantly faster by using a parallel 
algorithm?
Can we obtain asymptotically optimal upper and lower bounds on the 
parallel resources needed to compute some "natural” functions?
The latter problem appears to be the most popular, judging by sheer volume of 
contributions (tor some examples, consult [56,73] and the references contained 
therein). In comparison, relatively little attention has been paid to the first four 
questions, resulting in a proliferation of parallel machine models in the current 
literature. Even the most popular model, the shared-memory machine 
(consisting of a collection of RAMs communicating via a shared memory, see. for 
example, [20.27]) has many variants. There is a growing tendency to 
“customize" a machine to allow short, elegant proofs of a particular upper or 
lower bound, with scant regard to the suitability of the model as a vehicle for 
further research.
Intuitively, a parallel machine should consist of many processors which in 
some way co-operate in order to compute a result. Obviously there are many 
ways of formalizing this intuition. Compare the S1MDAG of Goldschlager [27] to 
the model of Galll and Paul [21]. The processors of the former are RAMs, the 
latter allow RAMs, RACs or even finite-state machines. Lev, Plppenger and 
Valiant [42] insist that they must be RACs. Goldschlager has almost identical 
processors which are all started simultaneously at the start of the computation, 
and communicate via a shared memory. Galll and Paul have sim ilar processors 
which start up at run-time, and communicate via direct proeessor-to-processor 
links. Some of the more obvious variations on those models Include the 
Instruction-set (for example, should multiplication be allowed?), and memory
- 3 -
access conflicts (should multiple attempts to write to a shared register be 
allowed as in [27], or should even multiple reads be disallowed, as in [42]?).
Some of these differences are merely cosmetic in nature, but some are 
extremely Important. In order to design a useful parallel machine model, we 
must first determine which choices matter. We have chosen a model which 
consists of a network of interconnected RAMs; each RAM can in one step perform 
an internal computation, or read from or write to a register belonging to one of 
its neighbours. We believe that the network paradigm is fundamental to the 
understanding of parallel computation. One attraction is the fact that it 
possesses a certain theoretical elegance. A RAM is just a network consisting of 
one processor. A shared-memory machine is just a network where all 
processors can communicate with a single distinguished processor and no other, 
and that distinguished processor remains idle throughout the computation. The 
number of extant papers which use the shared-memory model attest to its ease 
of programming, and its usefulness as a tool for proving and communicating 
theoretical results. It is widely accepted, however, that the shared-memory 
model is not in itself a viable architecture. By placing restrictions on our 
network model, it is possible to define a practical variant in a far more natural 
way than is possible with shared-memory machines. This makes the network 
approach doubly attractive.
The aim of this thesis, then, is to shed some light on the nature of parallel 
computation. We shall do this by presenting a unified theory of parallel 
computation based on our network model. We shall demonstrate its utility by 
providing some fairly eonelse and elegant alternate proofs of results from the 
eurrent literature, which will quite often lead to Improved resource bounds or 
more general theorems. We will also attempt to provide answers to the 
questions posed at the start of this Introduction, based on this theory.
-  4 -
The main body of this thesis Is made up of six chapters. In chapters 2 and 3 
we design and justify our parallel machine model. In section Z. 1 we define the 
basic model, and in section 2.2 discuss the consequences of choosing unit-cost 
RAMs as opposed to log-cost RAMs as processors. This decision can be summed 
up by what we call the unit-cost hypothesis: "the unit-cost measure of time is a 
valid one for parallel processors". We will refer to this hypothesis often 
throughout the thesis, and conclude that it holds in most situations of Interest. 
Section 2.3 discusses our assignment of programs to processors, comparing and 
contrasting it to the S1MD and M1MD approaches of Flynn [19], and section 2.4 
our decision to have all processors activated simultaneously at the start of the 
computation
Chapter 3 compares the basic network machine to a selection of other 
models. In section 3.1 we propose an alternative fixed-structure variant. It Is 
shown that a fixed-structure network of 2°(n> processors and a non-recursive 
interconnection pattern can compute any single-valued Boolean function in a 
constant number of steps, using an instruction-set consisting of addition, 
subtraction and logical shifts. In section 3.2 we compare networks to shared- 
memory machines,and conclude that they are almost identical. In section 3.3 
we discuss the possible bounds which need to be placed on the resources of our 
parallel machines in order to make them "reasonable" or "practical” . The 
parallel computation thesis of Goldschlager [27] states that time on a 
''reasonable'’ model of parallel computation should be polynomially equivalent to 
sequential space. Goldschlager places strong restrictions on his SIMDAGs (a 
variant of the shared-memory model considered in section 3.2) in order to make 
them obey the parallel computation thesis. We find that much less strict bounds 
are sufficient.
In the light of this discussion, in section 3.4 we define a practical variant of
- 5 -
our network model, which we call a faaaibla network. This has:
(1) Constant degree.
(2) A constant number of registers per processor.
(3) An easy-to-compute interconnection pattern.
(4) Fixed structure.
We will And later that there is an efficient feasible network which is universal for 
the general model of section 2.1. Thus the user of such a universal machine has 
the freedom to program in a high-level language which corresponds to a more 
powerful architecture at little cost, and the theoretician is provided with a 
motivation for studying the more abstract models.
Section 3.5 is devoted to exploring the speed-ups which can be made by a 
parallel machine as opposed to a sequential one. Let B:N-»N be an arbitrary 
function. Then a T(n) time-bounded deterministic Turing machine can be 
simulated In time 0(T(n)/B(n)) by a network of 20<B*B>*)+T(n) processors. By 
choosing B (n )*T (n ) we And that every function in NP can be computed in 
constant time by a network of 2n°(1> processors. Choosing B(n) *  T(n)1_* for some 
c>0 we And that an arbitrary polynomial speed-up is possible on a machine 
which obeys the parallel computation thesis. This is a striking result, because 
an exponential speed-up is not possible for certain natural problems in P unless 
Pc POLYLOGS PACE.
In chapter 4 we develop the techniques necessary for the construction of 
our universal feasible network. Section 4.1 demonstrates the usefulness of the 
shuffle-exchange [66] and cube-connected-cycles [55] interconnection patterns. 
In section 4.2 we present a recurrent Interconnection pattern CCL* with &  
processors and degree 3 with the property that for all kajkO, CCL* has at lsast 
gk-l-1 disjoint subgraphs which are Isomorphic to CCLj. yet it is at least as
our network model, which we call a/eaafhle nat-work This has:
(1) Constant degree.
(2) A constant number of registers per processor.
(3) An easy-to-compute interconnection pattern.
(4) Fixed structure.
We will And later that there is an efficient feasible network which is universal for 
the general model of section 2.1. Thus the user of such a universal machine has 
the freedom to program in a high-level language which corresponds to a more 
powerful architecture at little cost, and the theoretician Is provided with a 
motivation for studying the more abstract models.
Section 3.5 is devoted to exploring the speed-ups which can be made by a 
parallel machine as opposed to a sequential one. Let B:N-*N be an arbitrary 
function. Then a T(n) time-bounded deterministic Turing machine can be 
simulated in time 0(T(n)/B(n)) by a network of 20<B<,'),)+T(n) processors. By 
choosing B(n) *  T(n) we And that every function in NP can be computed In 
constant time by a network of 2I>0(>> processors. Choosing B(n) *T(n)'~* for some 
t> 0  we And that an arbitrary polynomial speed-up is possible on a machine 
which obeys the parallel computation thesis. This is a striking result, because 
an exponential speed-up is not possible for certain natural problems In P unless 
PCPOLYLOGSPACE.
In chapter 4 we develop the techniques necessary for the construction of 
our universal feasible network. Section 4.1 demonstrates the usefulness of the 
shuffle-exchange [66] and cube-connected-cycles [55] Interconnection patterns. 
In section 4.2 we present a recurrent Interconnection pattern CCL* with 2k 
processors and degree 3 with the property that for all k*J»0, CCL* has at least 
8M-* disjoint subgraphs which are isomorphic to CCLj. yet it is at least as
powerful as the cube-connected-cycle«. Further, using the techniques of 
Meertens [43] we show that any similar interconnection pattern with 2k_1 such 
disjoint subgraphs cannot share this property. Section 4.3 contains some useful 
algorithms, and section 4.4 some processor-saving theorems. The latter show 
that for any machine based on the above interconnection patterns, a P(n) 
processor network can be simulated on one with F(n) processors, with a time- 
loss of 0 (P (n )/F (n )) for each step. Thus constant multiples in processor- 
bounds can be ignored without asymptotic time-loss, a fact which simplifies 
many of our later proofs. A preliminary version of the results of section 4.4 has 
appeared in [51].
Chapter 5 considers simulations of networks by more practical models. In 
section 5.1 a general machine-independent simulation theorem is given. 
Specific instances of this theorem have been seen before in the literature (see, 
for example. [4.8,21.42,45.71,73]). We use it in section 5.2 to construct our 
universal feasible network, and again in section 5.3 to simulate networks on 
width and depth bounded uniform circuits and space and reversal bounded 
deterministic Turing machines. In section 5.4 we build upon the latter results to 
improve Pippenger's [S3] simulation of space and reversal bounded Turing 
machines by width and depth bounded uniform circuits. More specifically, a k- 
tape Turing machine with space S(n) and reversals R(n) can be simulated by a 
uniform circuit of width 0(S(n)k) and depth 0(R(n).logaS(n).loglog S(n)). A 
preliminary version of the work In chapter 5 has appeared In [50].
Chapter 6 generalizes our model to allow high-arlty processors, that is, 
processors which have the power to communicate with more than a constant 
number of Its neighbours in unit time (and the power to make good use of this 
ability). High-arlty machines have appeared in [8,60,70]. Section 6.1 contains 
our high-arlty model. In section 6.2 we show that increasing artty gives more
- 7 -
computing power. In particular, a network with arity A(n) and a polynomial
number of processors needs time f ) ( , J1. ) to add n numbers, each oflog A(n;
polynomial size, even in the presence of write-conflicts. Thus, for example, a 
polynomial-processor PRAM with multiple-writes needs time O(log n) to add n 
polynomial-bit numbers. This is the first lower-bound of this nature to be 
achieved on a model which allows write-conflicts. Section 6.3 explores 
simulations of high-arlty machines on constant-degree universal machines of 
arity 1. As a corollary, we obtain an improved proof of theorem B of [81]. A 
preliminary version of sections 6.1, 6.8 and 6.3 has appeared in [58]. Section 6.4 
contains some examples of high-arity algorithms, most notably the parallel 
prefix problem.
Section 7.1 is devoted to lower-bounds for universal machines. The 
universal machine of chapter 5 is found to be optimal for simulations of that 
nature. The universal machine of chapter 6 is optimal for the simulation of 
degree-3 machines (Meyer auf der Heide [31] had earlier found it to be optimal 
for the simulation of constant-degree machines). Section 7.8 contains a new 
proof of a result of Meyer auf der Heide [33]. In section 7.3 we obtain
asymptotic upper and lower bounds of 0 ( j + log P(n)) for the oblivious
simulation of a P(n) processor network on a constant-degree universal machine 
with P(n) processors. This extends the results of Borodin and Hopcroft [8] and 
Lang [39], who prove the same lower and tipper bound respectively for 
P (n )-P (n ).
It should be noted that this is a theoretical treatment of parallel 
computation, and as such Is based upon a number of assumptions which are 
widely accepted amongst workers in the field of parallel complexity theory. 
Although our model Is synchronous (In the sense that the instruction-cycles of
- 8 -
the processors are synchronized), we will see in section 3.4 that this is not an 
important restriction. The advantage of having a synchronous theoretical model 
is that it is easy to program and reason with. We assume that inter-processor 
communications can take place within a single instruction-cycle. In the real 
world, this assumption is unlikely to be true for large numbers of processors; a 
complexity theory based on this observation will differ quite radically from ours 
[81], However, we feel fairly safe in making the assumption for networks 
consisting of a small number (say in the millions) of fairly large processors 
(about the size of a microprocessor), even though it is unlikely to hold for. say, 
individual gates in a VLSI chip.
Finally, the reader should note that throughout this work, all logarithms are 
to base 8, N denotes the set of non-negative Integers. Z the set of integers, and if 
ceN. deZ. then d mod c is defined to be the unique integer acN such that 
0 £ a < c  and there exists beZ such that a+bc = d. For those unfamiliar with the 
"order" notation, we provide the following reminder. Let f,g;N-»R4' (where R* 
denotes the set of positive real numbers). We say that:
(1) f(n) = 0(g(n)) if there exists ceR+, NeN such that for all nfeN, f(n )<c.g(n).
(2) f(n) = 0(g(n)) if g(n) *  0(f(n)).
(3) f(n) *  0(g(n)) if f(n) *  0(g(n)) and f(n) = 0(g(n)).
(4) f(n) = o(g(n)) if U m ^^-=0 .
- 9 -
Chapter 2
Designing a Parallel Machine Model
In this chapter we present our basic parallel machine model, and attempt 
to justify some of the decisions which contributed to its present form. 
Informally, the model consists of a network of interconnected random-access 
machines, or RAMs. In the first section we give a more formal description, 
providing illustration by way of an example RAM instruction-set. We define the 
major resources of interest; processors (number of RAMs). time (number of 
Instructions executed), degree (degree of the interconnection pattern), space 
(number of registers required) and word-siss (the size of those registers). In 
order to simplify the presentation of algorithms, a high-level pseudo­
programming language is sketched.
The second section is devoted to a discussion of our choice of a unit-cost 
measure of time. We have chosen to charge a single unit of time for each 
instruction executed, rather than charge according to some notion of 
"difficulty". This raises an interesting question: for which instruction-sets is this 
a valid measure of time? We shall see in subsequent chapters that the answer 
can be provided in many different ways.
Our basic machines have a single program for all processors. In the third 
section we Justify this approach, comparing and contrasting it with the S1MD and 
MIMD machines of Flynn [19]. In a SIMD machine, the processors have their 
program-counters synchronized, with each individual processor either executing 
the common current instruction or remaining dormant for a step. In contrast, 
we allow each processor to be at a potentially different point in the program. A 
MIMD machine has a different program for each processor. Our model is seen to 
be equivalent to a SIMD one, and to a reasonable subset of the MIMD model.
-  10-
Flnally, In the fourth section we justify our decision to start the 
computation with all processors active, rather than have them become active at 
run-time. This latter approach places a not altogether unreasonable upper- 
bound on the number of processors used in a computation, in relation to time. 
We shall see in chapter 3 that it is sometimes profitable to consider machines 
with a larger number of processors. Within this limitation, however, the two 
models are equivalent.
2.1. The Basic Model
Our parallel machine model can be loosely described as an infinite 
collection of interconnected -random-access machines, only finitely many of 
which are active in any particular finite computation. By "random-access 
machine" we refer to a variant of the RAM. which is already well-known as a 
sequential machine model (see, for example, [1,12,63]); and by "synchronous" 
we mean that the instruction-cycles of the RAMs are synchronized. Each RAM
has an infinite number of general-purpose registers r0.ri....  each of which is
capable of storing a single integer, and a number of read-only registers which 
are Initialized at the start of a computation. These include the processor 
identity register PIO and the input-size register SIZE. The PID of the Ith RAM is 
preset to l, for i ■ 0,1....
More formally, a network M consists of a program and a processor-bound. 
The program  of M is a finite list of instructions; each instruction has the form 
either:
(1) Read a value from a register of a neighbouring processor.
(U) Write a value to a register of a neighbouring processor.
-11 -
(ill) Perform an Internal computation.
(iv) Conditional transfer of control, or halt.
For example, let denote a binary operation defined on integers. For 
convenience we divide our example instruction-set into two categories. Local 
instructions have the form:
goto m if r, > 0 (conditional transfer of control)
Communication instructions have the form:
r(«-(rrj of r*) (read)
(rr, of rj)«-!^ (write)
The program is to be executed synchronously in parallel by the (finitely 
many) active processors. As far as local instructions are concerned, their 
behaviour is that of independent RAMs, that is. references to registers in local 
instructions are treated as references to their respective local registers. 
Execution of a read instruction:
by processor p has the following effect. Suppose registers rj, n, of processor p 
contain the values q and p' respectively. Then the contents of register r , of 
processor p' are read and placed into register r, of processor p. Similarly, 
execution of a write instruction:
by processor p has the following effect. Suppose registers rt, rj of processor p 
contain the values q and p‘ respectively. Then the contents of register rii of 
processor p are written into register r, of processor p'.
ri«-cons tant 
r,*-rj~n,
(load register with constant) 
(binary operation)
(indirect load) 
(indirect store)
(store read-only register R) 
(end execution)
ri*-(rr, of rk)
(r^ of rj)«-n,
-  12-
Multlple reads of the same register are allowed. In the case of multiple 
writes to a single register, we adopt some reasonable convention whereby a 
single processor succeeds and is allowed to write its value, whilst all others fall. 
For example (after [27]) the lowest-numbered processor attempting to write 
succeeds, or (as In chapter 5), the processor which Is attempting to write the 
smallest value succeeds, with ties being broken In favour of the lowest- 
numbered processor. A local Instruction must compete with incoming data on 
the same basis.
Suppose f:Z*-»Z* and xs<xo.X|........x„-i>. where XieZ for 0 * i< n  We will
say that x has sis• or Isngth n. and write |x| =n. Let m = max|f(x)I and
1*1 ■»
f„:Zn-»Zm denote the restriction of f to n arguments (we adopt the convention 
that unused output places are filled by zeros). We will variously refer to x as an 
input or input string, and each x, as an argument or input symbol.
Suppose M is a network with processor-bound P:N-*N. Let p = P(n) A 
computation of M on input x is defined as follows. Place X| into register r|/pj of 
processor (i mod p), and set all other general-purpose registers to zero. Set 
register SIZE of all processors to n. Simultaneously activate processors
0,1....p-1. These synchronously execute the program of M. For 0 « i< m  let y,
denote the contents of register r|/p| of processor (i mod p) when all processors 
have finally halted. We say that M computss t it for all ns 0 and inputs x with 
|x| «  a  f»(x ) ■ <y0.yt........y«-i>.
The interconnection pattern of H is an infinite family of finite graphs 
Q«(Go.G,....). one for each input-size. For nk 0. G„ has vertex-set
(0,1....P (n )-1 {, and an edge between vertices i and J if at any time during the
computation of M on an input of length a  processor i attempts to read from or 
write to a register of processor J. Let D:N-»N. M is said to have degree D(n) if for 
all nftO. G* has degree D(n).
- 13-
Let T,S,W:N-*N. M Is said to compute within Kmi T(n) if for all inputs of size 
n. all active processors have halted within T(n) steps. For 0 «t£ T (n ), let St(n) be 
the maximum (over all Inputs of size n) number of registers of M with non-zero 
contents after t instructions have been executed. Then M uses space S(n) if 
S(n) *  o«t*^n) 11 hM tuordsiwt W(n) if every value which appears in a
register during such a computation has absolute value less than 2w(n) (note that 
this includes the inputs, outputs and processor Identity registers).
Notes. (1) We have chosen a unit-cost measure of time. This choice will be 
discussed in more detail in section 2.2.
(2) The space bound is a measure of the number of registers used in a 
computation It is slightly unusual - the more usual method (see, for example.
[1] for the case of a single RAM) is to define space to be the number of registers 
which are assigned non-zero contents at any point during the computation. Our 
reasons will become more apparent in chapter S.
(3) The word-size is a measure of the width of (inter- and intra-processor) data 
paths, and a measure of register size. This can be combined with our unit-cost 
measure of space to provide an upper-bound on log-cost space.
Consider the example instruction-set given earlier in this section. So far, 
we have not specified exactly which binary operations can be used for "•*•". In 
particular we will be Interested in four types of instruction-sets. Each has two- 
input Boolean functions (defined on single-bit quantities) and the following 
Integer operations.
(1) The minimal instruction-set allows addition, subtraction, shifts of a single 
bit, and extraction of the least-significant bit.
r,«-rj*rk
n -lr/ 2 )
ri«-rj mod 2
-  14-
(2) In addition, the rwMtricttd arithm atic instruction-set allows larger shifts and 
extractions. Suppose rv > 0
ri«-rj mod 2 ~  *
(3) The fu ll arithmmtic instruction-set is the minimal instruction-set plus 
multiplication, integer division and remaindering.
rtt-rj’ rk
rt«-rj mod r*
(4) The aztwndid arilhmatic instruction-set is the full arithmetic instruction- 
set plus exponentiation.
ri*-r/k
A number of questions spring to mind. Are these instruction-sets reasonable? 
Are they powerful enough? Too powerful? Natural? Clearly an unrestricted 
instruction-set which allows any computable function as a local instruction is too 
powerful, but what kind of Instruction-set is reasonable? We will return to these 
questions in the next section.
Instead of writing algorithms in the low-level RAM language, we will follow 
the common practice of using a high-level language which can easily be 
translated Into Instructions of this form. We use the usual high-level constructs 
for flow-of-control, based on sequencing, selection and iteration. Variables of 
the form (x at processor 1) will be taken as a reference to variable x of processor
1. An unmodified variable x will be taken to mean (x of processor  P1D), that Is, a 
local variable. For example, execution of the statement
ri«-rJ*2^**r|‘l 
r,«- [r/ 2 °^* '“'I]
- 15-
if y < (y of prop— or iPID/ 2j) 
than statement!
•1m statement«
by a network of P(n) processors causes the 1th processor. 0*ei<P(n). to 
simultaneously compare its variable y with variable y  of processor k/2]. If It 
finds that the former is less than the latter, then it executes statement^ 
otherwise it executes statement«. To aid synchronization, we assume that 
statement and statement« are translated into blocks of code containing the 
same number of instructions, by filling with NO-OPs (such as r0*-r0) as necessary. 
All of the algorithms in this thesis will maintain synchronization by virtue of this 
simple arrangement. As a notatlonal convenience we may occasionally use 
multiple, concurrent and conditional assignments.
2.2. The Unit-Coat Measure of Time
In section 2.1 we defined the running-time of our parallel machines to be 
the number of instructions executed (synchronously) before all active 
processors have halted. That is. we. charge a single unit of time for each 
instruction executed. This is termed a unit-cost measure of time. The use of 
unit-cost charging is a contentious issue. The alternative is log-coat charging, 
whereby the cost of an instruction is expressed as a function of the size of its 
arguments, thus tying the time required for a particular computation to its 
word-size.
Ve follow Cook [14] in the belief that the major parallel resources of 
interest are Mmi and hardware. We also believe that the important issues in the 
design of a parallel machine are more clear-cut If these two resources are kept 
completely Independent. A hardware measure should take into account the 
•mount of memory used, which is related to word-size. This makes the unit-cost
measure of time more attractive, since it alone is independent of word-size, and 
thus hardware.
Even for purely sequential machines, the selection of unit-cost measures 
versus log-cost is of fundamental importance. Inter-simulations between various 
log-cost models (for example [1], Turing machines and log-cost RAMs) can be 
achieved with only a polynomial increase in time, whereas no such simulation 
can be obtained between unit-cost and log-cost models. For example, in time t a 
unit-cost RAM with multiplication can compute (without input) a value as large 
as whereas the same machine with log-cost charging can only compute a
value as large as 2t+#(l).
From a purely practical standpoint, the choice of charging mechanism 
depends on the type of computation in question. If the word-size is sufficiently 
small, then the unit-cost measure is more applicable. Alternatively, if the values 
being manipulated grow very quickly with input-size, requiring the use of multi­
word Instructions for quite modest input lengths, then the log-cost measure Is 
preferable. For example, log-cost would more accurately model a small 
microprocessor, and unit-cost a large mainframe.
This issue is neatly encapsulated In what Goldschlager and Lister [29] call 
the “sequential computation thesis". This states that time on all “reasonable" 
sequential models is polynomlally related. This is motivated principally by the 
polynomial-time simulations of one log-cost model by another, but in fact breaks 
the models into two disjoint classes, those with unit-cost and those with log-cost 
measure of time. Members of the same class are polynomlally related, but two 
models from different classes are not. Given this observation, the Important 
question which must be addressed by any theoretical treatment is not “which 
model is better", but "whloh model is more accurate for the Intended 
application".
-  18-
measure of time more attractive, since it alone is independent of word-size, and 
thus hardware.
Even for purely sequential machines, the selection of unit-cost measures 
versus tog-cost is of fundamental importance. Inter-simulations between various 
log-cost models (for example [1], Turing machines and log-cost RAMs) can be 
achieved with only a polynomial increase in time, whereas no such simulation 
can be obtained between unit-cost and log-cost models. For example, in time t a 
unit-cost RAM with multiplication can compute (without input) a value as large 
as 2*1***1’, whereas the same machine with log-cost charging can only compute a 
value as large as 2t+,(l).
From a purely practical standpoint, the choice of charging mechanism 
depends on the type of computation in question. If the word-size is sufficiently 
small, then the unit-cost measure is more applicable. Alternatively, if the values 
being manipulated grow very quickly with input-size, requiring the use of multi­
word instructions for quite modest input lengths, then the log-cost measure is 
preferable. For example, log-cost would more accurately model a small 
microprocessor, and unit-cost a large mainframe.
This issue is neatly encapsulated in what Goldschlager and Lister [29] call 
the "sequential computation thesis". This states that time on all "reasonable” 
sequential models is polynomlally related. This is motivated principally by the 
polynomial-time simulations of one log-cost model by another, but in fact breaks 
the models into two disjoint classes, those with unit-cost and those with log-cost 
measure of time. Members of the same class are polynomlally related, but two 
models from different classes are not. Given this observation, the important 
question which must be addressed by any theoretical treatment is not "which 
model 1s better", but “whioh model is more accurate for the intended 
application".
-  17-
The parallel analogue of the sequential computation thesis is the so-called 
"parallel computation thesis" [9.87]. This states that time on all "reasonable” 
parallel models Is polynomially related. Furthermore, It attempts to 
characterize parallel computers by relating parallel time to a sequential 
resource. More precisely, it states that time on a "reasonable" parallel 
computer Is polynomially equivalent to log-cost sequential (for example, Turing 
machine) space. This has two Implications. Firstly, a machine which Is too weak 
to simulate an S(n) space-bounded Turing machine In time S(n)0(t> Is not 
powerful enough, to be called a parallel machine. Secondly, a machine which Is 
so strong that a T(n) time-bounded computation cannot be simulated in space 
T(n)0(l) by a Turing machine is too powerful to be called parallel. We will be 
concentrating mainly on the latter aspect of the parallel computation thesis, 
since networks with an unrestricted Instruction-set are obviously extremely 
powerful. Henceforth, by "reasonable" we will mean "not too powerful", in the 
sense that It Is "reasonable" to expect a parallel computer to have only a 
moderate amount of resources at its disposal.
One way of making our model obey the parallel computation thesis is to 
restrict the processors to the minimal instruction-set of section 2.1 (this 
approach was taken by Goldschlager for his SIMDAG [27]). This ensures that the 
word-size grows by at most one In every time-step, and so the log-cost of the 
individual Instructions executed in any given computation is at most a 
polynomial in the unit-cost running-time, provided the input integers are 
sufficiently small. In this case, unit-cost and log-cost are polynomially related. 
It makes sense to restrict the word-size of parallel processors since (as we saw 
in the second paragraph of this section) the extra power of unit-cost RAMs over 
log-eost RAMs seems to stem from their ability to generate large Integers 
quickly. Indaed, a single unit-cost RAM with either the restricted [S4] or full
-  18-
arithmetic [30] instruction-sets obeys the parallel computation thesis, so is 
itself as powerful as a parallel machine.
We claim that the unit-cost measure of time is a valid one for parallel 
processors. We shall call this the writ-cost hypothesis. It is framed as a 
hypothesis because it depends upon the way in which the word "valid" is to be 
interpreted; we will meet several Interpretations in the remainder of this work. 
Whilst It is intuitively obvious that the unit-cost measure of time is unrealistic 
for very powerful instruction-sets which allow the computation of infeasible 
functions in a single step, we may reasonably expect it to be realistic for fairly 
weak instruction-sets, such as the minimal instruction-set of section 2.1.
This raises a number of interesting side-issues. We are in effect asking 
when a unit-cost model Is "reasonable". We have seen that restricting the 
processors to the minimal instruction-set makes our model "reasonable" in the 
sense that it obeys the parallel computation thesis. But what do we actually 
mean by "reasonable"? Do models which satisfy the parallel computation thesis 
successfully formalize the idea of "parallel computers"? What do we really 
expect from a parallel machine model? These are amongst the issues that we 
will address in chapter 3.
2.3. The Alignment of Programs to Processors
Although every processor of our parallel machine executes the same 
program, our model does not fall precisely into the S1MD category of Flynn [19]. 
This is because the conditional goto Instruction takes action depending on the 
value of a local register, the contents of which may vary from processor to 
processor. Thus different processors may be at different points in the program 
at any given time. However, It is fairly easy to show that our model is equal in
-  19-
power to a SIMD one. and to a reasonable subset of M1MD models, including that 
of Galil and Paul [21],
For the sake of discussion, we will call our assignment of programs to 
processors a uniform, one. We use the term "uniform" in the sense of Karp and 
Upton [36], meaning that every machine has a finite description (in our case, 
the program and processor bound). A MIMD model is non-uniform in the sense 
that it allows a different program for each processor; thus an Infinite family of 
finite descriptions (one for each input size) is needed. Some authors (for 
example [7,14.59]) use the term "uniform" to denote the fact that an external 
"constructibility" condition has been enforced on a non-uniform model in order 
to restrict Interest to machines with finite descriptions.
A SIMD machine is a uniform one in which, at any given point in time, all 
active processors are either executing the same instruction, or are dormant. 
Our high-level pseudo-programming language allows the user to write non-SIMD 
programs; we believe that this keeps the language simple, elegant and flexible 
(It may be argued that it gives the user the flexibility to get into a lot of trouble, 
but the same is often said of the goto statement in modern programming 
languages). Furthermore, it is not really necessary to force the programmer to 
write SIMD programs, since a uniform machine can be simulated by a SIMD one 
without asymptotic time-loss, using the same number of processors and degree, 
with space and word-size Increasing by only a constant.
Suppose M Is a P(n)-processor uniform machine. We will construct a SIMD 
machine to simulate M as follows. Processor 1 of the SIMD machine, 0 «l<P (n ), 
simulates processor 1 of M. using variables PC. VPC, NPC, PR. A and V, and an 
Infinite array R. PC keeps track of the program counter of the simulated 
processor, whilst for J*0, R[J] contains the current contents of Its register rj. 
VPC (the virtual program counter) will cycle from 1 to the program length
- 20 -
(whlch la a constant, independent of n); when PC = VPC the PC01 instruction of 
the program of M Is simulated. NPC receives the new program counter value, 
and If the Instruction involves a data-transfer. A and PR receive the address and 
PID respectively of the register to be updated, and V its new value. At the end of 
the cycle, the arrays R are updated to reflect the new register contents, using 
the information in PR. A and V. whilst PC is updated using the contents of NPC. 
The process is completed at the end of a cycle In which a halt instruction is 
simulated.
We present the algorithm in the high-level language of section 2.1. A 
different interpretation is placed on the control constructs however, in order to 
make them SIMD. The branches of a selection statement (such as if or ease 
are tried one at a time, with a processor executing a particular branch if its 
register contents satisfy the entry condition; all other processors remain 
dormant during that period. This is opposed to the general (non-SIMD) uniform 
case, in which all processors are free to start their respective branches at the 
same time, or to enter and leave the construct at different times.
Suppose M has the example instruction-set of section 2.1. Then the 
program of the simulating machine is as follows:
-21 -
JL»V:*0 
PC=VPC*1 
while PC > 0 4 »
for VPC *  1 to program length do 
if VPC = PC then
PR.A.V *  com PC01 instruction of M of 
"^•-constant" PID.i, const ant 
•'ri^-rj-rii": PID, i. R[j ]~R[ Ic]
"r.«-rrj" PlD.i.R[R[j]]
”rn»q":PIDIR[t].R[j]
"r,-PID"PlD.i.PID
"ri-rrj of rk" PlD.i.(R[R[j]] of proca -o r  R[k])
” (*>, of rj,-ni":RU].R(l].R[k]
NPC = com PC*11 instruction of M of 
"halt'O
"goto m if r, >0":lf R[i] >0 then m 
"others": PC+1 
(R[A) o f proceeeor PR) «V  
PC*NPC
Thus we see that our uniform model is equivalent to a S1MD one Now, a 
MIMD model allows a different program for each processor Let AN-»N be such 
that A(i) is a reasonable encoding of a RAM program (say. using the example 
instruction-set of section 2.1). for ikO. By "reasonable encoding" we mean that 
a universal RAM should be able to decode this program, using negligible 
resources, into a format which allows efficient simulation. A MIMD variant of our 
model is identical to that of section 2.1, except that processor i of a P(n)- 
processor network has program A(i). 0 * i< P (n ) A is called the processor 
assignment function
Let M be a P(n)-processor MIMD machine which uses resources R|(n). whose
processor assignment A Is such that A*a<A(0).A(l)........A (P (n )-l)>  can be
computed by a P(n)-processor uniform parallel machine using resources Rs(n).
- 22 -
Then clearly there Is a uniform P(n)-processor machine which can simulate M in 
resources Rj(n)+Rt(n), simply by computing A*, and then having processor l, 
0 * i< P (n ) simulate program A(i). Each processor of the uniform machine has 
an Identical program made up of two parts, a part to compute A*, and a 
universal RAM.
Thus we see that (provided the resources needed to compute A are kept to 
a feasible level) a uniform machine can efficiently simulate a MIMD one. This can 
be summarized as follows: if a MIMD machine is easy to specify, then it can be 
specified as a uniform machine. Thus a uniform model is equivalent to a useable 
subset of the MIMD model.
2.4. Processor Activation
In our model as presented so far, all P(n) processors are activated 
simultaneously at the start of the computation, and begin synchronously 
executing the first Instruction of the program at time t=l. We call this the 
in itia l activation model. An alternative formulation ( (asy activation ) is to 
start off with some small number of active processors (for example, just 
processor 0, or just those which receive input), postponing the activation of the 
remainder until run-time. This convention has been adopted by Galll and Paul 
[21] and Savltch [80],
There are two essentially different ways of approaching lazy activation. The 
first requires that an active processor explicitly activate an Inactive one by 
executing a special "call" instruction (as in, for example. [60]). This implies that 
the number of active processors can at most double in each time-step. 
Alternatively, Galll and Paul [21] allow the inactive processors to execute a 
polling-loop, In order to decide when to become active. This is really only 
feasible for machines with constant degree, in which case it is asymptotically
- 23 -
equivalent to the first approach.
Note that this Implies that a T(n) time-bounded machine can have at most 
n gontn)) (in the case when only input-bearing processors are initially active), or 
2 °(T(n)) ( ^  the case when processor 0 Is initially active) processors. Our model Is 
more general than this, and we shall see In section 3.3 that it doss make good 
sense to talk about T(n) time-bounded machines with ¿ W * 11 processors.
For definiteness, we will assume that:
(1) Initially, only processor 0 is activated.
(2) In a computation on an input of size n, processor 0 is initially given the 
value of P(n) as part of its input.'
(3) If an Inactive processor has a value written into it for the first time in a 
computation during time-step t. it becomes active and executes the first 
instruction of the program during time-step t+1. Thereafter, it is 
indistinguishable from any other active processor. A processor which has 
halted cannot be reactivated.
Lazy activation Is essentially the same as initial activation, provided 
P (n )*2 0(T<,')). Clearly an initial-activation machine can simulate a lazy- 
actlvatlon one without asymptotic loss In resources, by simply maintaining an 
activation flag in each processor. Simulation in the other direction is only 
slightly more difficult. The problem Is to activate P(n) processors and 
synchronize them so that they begin the execution of the program of M at the 
same time. If M has P(n) processors and runs in time T(n), we will show that the 
simulating (lazy) machine runs in time 0(T(n)+log P(n)). whilst increasing space 
and degree by only a constant amount.
To simplify the presentation, assume that P(n) is of tha form 2k- l  tor some 
k>0. We will activate P(n) processors using an interconnection pattern In the
- 24-
shape of a binary tree (aae figure 2.4.1).
F l f m  1.4.1.The binary tree interconnection pattern with 10 verticee.
Each proceaaor has variables C and P. P holds the value of P(n) (we assume 
that P of processor 0 Is set to P(n) at the start of the computation), and C the 
number of processors activated so far (we assume that C of processor 0 starts 
out at 0). The algorithm consists of a single loop. At each Iteration a new level 
of the tree is activated; C is used to detect termination. Upon exiting the loop, 
all processors are synchronized, and execution of the program of M can begin.
Processor 1 activates its children at the next level (processors 21+1 and 
21+2) using the high-level statements
(C.P) of prooswor 2i+1 ; ■ C.P 
(C.P) of processor 21+2 :■ C.P
This also initializes their variables C, P so that they can join in the loop at the 
appropriate stage. Note that the left-hand child is activated before the right- 
hand, and so potentially enters the loop earlier. We can avoid this by making 
odd-numbered processors wait for a steps (where a is a suitably chosen 
constant) before entering the loop. In order to synchronize the newly-activated 
processors entering the loop with those already inside it. it is necessary to add
- 28 -
another delay, this time of fi atepa (where fi la another auitably chosen 
constant). Note that the values a, fi depend only on the exact form of the RAM 
Instruction-set, and the ability of the compiler to generate succinct code from 
high-level statements. In a high-level form, the algorithm is:
Odd-numbered processors wait for a steps
Wait for fi steps
C:*2C+1
while C < P do
(C.P) of prooeseor 2i+l := C.P 
(C.P) of processor 2H-2 : «  C.P 
C:*2C+1
To make the synchronization method completely transparent, this program 
would generate the following' code (using an instruction-set similar to that of 
section 2.1, with certain acceptable liberties taken with arithmetic and Boolean 
expressions in order to ensure brevity). In this case, a = 2 and fi=  1.
1. goto 4 if PIO mod 2 = 0
2. NOOP
3. NOOP
*. NOOP
5. C«-2*C+1
6. goto 13if C *P
7. (C of 2*PID+ 1)»-C
8. (P  of 2*PID+1)«-P
9. (C of 2*PID+2)«-C
10. (P  of 2*PID+2)«-P
11. C«-Z*C+1
12. goto 8
13. ate.
Table 2.4.1 gives a trace of this algorithm for P(n) ■ 7 processors.
26
¥2 ___O___ ___ 1____ 3 ____3_____ ___3____H pft p pf p £ P9 p G PC p G P f p £ PC p £ PC p ft0 1 7 0
1 4 7 0
2 3 7 0
3 6 7 1
4 7 7 1
3 8 7 1 1 1
6 9 7 1 2 7 1
7 19 7 1 3 7 1 1 |
s ll 7 1 4 7 1 4 7 1
9 12 7 3 3 7 1 3 7 1
10 6 7 3 6 7 3 6 7 3
7 7 ? 7 7 3 7 7 312 8 7 3 8 7 3 8 7 3 1 3 1 3
13 9 7 3 9 7 3 9 7 3 2 7 3 2 7 3
. 1 1 19 7 19 7 3 19 7 3 3 7 3 3 3 7 3 115 ll 7 3 n 7 3 ll 7 3 4 7 3 4 7 3 4 7 3 4 7 3
16 12 7 7 12 7 7 12 7 7 3 7 3 3 7 3 3 7 3 3 7 3
17 6 7 7 6 .7 7 6 7 7 6 7 7 6 7 7 6 7 7 6 7 7
-ifl 7 7 7 13 7 7 13 7 7
*-«•»- I A 1  Activation and synchronization o f 7 proceaaora in a lazy-activation 
modal. Table entry show* the value of the program-counter (PC) and variables P.C 
for each proceeoor initially (at time 0). and after each of 16 «tape.
- 27 -
Chapter 3
Relationships with Other Models
The main aim of this chapter ia to compare our network machines to a 
number of other models. In the first section, we propose a fixed-structure 
variant of the network model, that is. one in which the interconnection patterns 
of the machines can be predicted. The more general model computes its own 
interconnections, which makes it rather difficult to construct as a physical 
device without increase in degree. It is observed that more efficient machines 
can be constructed by expending more resources, for example a machine with a 
non-re cursive interconnection function can compute arbitrary (non-recurslve) 
single-valued Boolean functions in constant time, given sufficiently many 
processors.
The second section compares our model to a shared-memory one. In a 
shared-memory machine, the processors communicate indirectly via a common 
shared memory, rather than by direct register access. The two models are 
easily seen to be almost identical in computing power. The third section 
investigates the concept of a "reasonable" parallel machine touching on such 
Issues as the parallel computation thesis, bounds on word-size, and restrictions 
on inter-processor communication.
In the fourth section we define a practical variant of the network machine, 
the so-called "feasible network". We expound the desirability of a feasible 
network which Is universal for the more general model of section 2.1. Various 
types of universal machines are considered, according to the manner In which 
they achieve their simulations. In the fifth and final section, we Investigate 
possible speedups of sequential machines by parallel ones. Any computable 
function can be computed In constant time if sufficiently many processors are 
present. Alternatively, an arbitrary polynomial speedup In time can be obtained
- 28 -
on a muahin» which obeys the parallel computation thesis (and. as we shall see 
in section 3.3. no such exponential speedup is likely).
3.1. A Fixed-Structure Model
The basic model described in section 2.1 (alls into Cook’s [14] category of 
machines with "modifiable structure" (since processor interconnections are 
computed at run-time). This implies that a resource-bound (or such a machine 
is made up of two parts, corresponding to the resources required to compute 
the interconnections and those required to perform the actual computation In 
a "fixed-structure" machine these two components are separated. The former 
reflects the cost of building the machine, and the latter the cost of using it.
This separation can become significant when the two components differ by a 
large amount. For example, consider a machine with the example instruction- 
set of section 2.1, whose only allowable binary operations are (single-bit) two- 
input Boolean functions, integer division by 2 and multiplication by 2. Let
f:(0,lj*-*{0,lj* be defined by f(xo........x„_|)»<yo.........yn-i> where for 0 « i< n ,
y( = x, © X(|+i) modn' An coprocessor, constant-degree machine can compute f in a 
constant number of steps, provided processor i knows the value of (141) modn. 
0 < K n , However, the same machine requires (l(log n) steps to actually compute 
those values. Thus in a model with modifiable structure, the run-time of this 
machine is O(log n), under a fixed-structure model the run-time is 0(1) (and any 
reasonable fabrication device which Includes addition as part of its instruction- 
set can compute the Interconnections in parallel in a constant amount of time).
A fixed-structure analogue of our basic model can be defined as follows. 
Qalil and Paul [21] call this a model with "predictable communication", since the 
Inter-processor connections need to be known in advance of actually running the 
machine. Note that all machines have "predictable communication" In the sense
- 29 -
that they can be fabricated aa a completely-connected machine (with each 
processor connected to every other), but this may involve an unacceptable 
Increase in degree.
Our fixed-structure model has the same format as the basic model of 
section 2.1, with a number of minor modifications. Each processor is given a 
number of additional read-only registers which are preset at the beginning of a 
computation These correspond to values which are "hard-wired" into the 
machine during the fabrication process. They consist of the DEGREE register,
and an infinite number of port registers Po-Pi.... each of which is capable of
holding a single integer.
More formally, a parallel machine consists of a program and an 
interconnection scheme. The program  is a finite list of instructions, each of 
which may have the following form (where p is a port register). Either:
( 1) Read a value from a register of processor p.
(2) Write a value to a register of processor p.
(3) Perform an internal computation.
(4) Conditional transfer of control, or halt.
In the example instruction-set of section 2.1, the read instruction:
r,«-(rrj of nJ
would be interpreted as meaning "read the (rj)th register of processor p^ and 
place the result into register r|". and the write Instruction:
(fu of rj)«-iìi
would be Interpreted as meaning "write the value from register n, into the (ri)th 
register of processor p,(".
An infercofmecHon scheme consists of three functions, a processor function 
P:N-*N, a degree function D:N-*N and an interconnection function
- 3 0 -
G:{11 Osi < P(n) j x ( d | Owd <D(n) J xN  -» {1 1 Owi <P(n) J.
In a P(n)-processor computation, processor i is connected to processors
G(i,d,n). 0«d<D (n ). We adopt the convention that it ie( G(j,d,n) | Os: d < D(n) J 
then je { G(i.d.n) | Ow d <D(n) A computation of M, where M has Interconnection 
scheme (P.D.G), is defined similarly to section 2.1, with the following addition. 
Before the processors are activated, the DEGREE register is set to D(n). and for 
Owd <D(n), Osi <P(n), register p* of processor 1 is set to G(i.d.n). The resources 
of space and word-size are modified to include the new registers (the word-size 
of the port registers may be measured according to their absolute contents, or 
some concise relative encoding, if such is applicable).
Note that a resource bound for computing any given function must Include 
reference to the complexity of the Interconnection scheme. This is because, as 
might be expected, more efficient machines can be built by investing more 
resources in their construction. Information can be stored in the 
interconnection pattern, to be used later as a kind of "look-up table". Take for 
example the problem of computing an arbitrary (perhaps non-recurslve) single- 
valued Boolean function f:(0,lj*-»(0,l(. We will show how to compute f on n inputs 
in a constant number of steps, using n.2" processors.
If x»<Xo........Xn-i> is an input of size n, let be a binary
encoding of x as an integer. The n.2“ processors are broken up into 2n teams Tt, 
0wi<2n. For Owl<2P, team Tj consists of the n processors l.n+J, for Owj<n. 
The smallest-numbered processor of each team is a distinguished processor 
called the team-leader. For each input x, the team-leader of will have the 
value f(x) encoded as part of its interconnection pattern. Our problem then, 
given an input x, is to notify the appropriate team-leader.
This Is achieved as follows. Each team-leader sets a specified register a to 
zero. For O s K P ,  0* j  <n  the j *  member of team T( compares the j *  symbol of
- 31-
the Input to the j0* bit of i. If those two values are different, it writes a one to 
register a of its team-leader. The team-leader of T ^ , )  will be the only team- 
leader which is not written to; it then consults its interconnection pattern for 
the value of f(x), and writes this value to processor 0 for subsequent output.
The following is a high-level implementation of this algorithm. Assume that 
initially variable x of processor p contains the p1*1 bit (x,) of the input, 0 *  p < n 
Each processor has two variables i and j which (as in the previous paragraph) 
record that processor's team number and position within that team. Variable a 
of the team-leaders will be used for communication with its team members. The 
result f(x) will end up in variable r of processor 0.
The interconnection pattern is as follows. For 0 « i< 2 n, 0<j<n . processor 
n.i+j (the member of Tt) is connected to processor j (the processor in charge 
of the j1*1 bit of the input), and processor n.i (its team-leader). For each input x, 
processor alnt(x) is connected to processor f(x ) via a special link. It can 
determine the value of t(x) by reading the PID of that processor.
a:*r:=0
i.j: = |piD/ nJ.PID mod n 
If (x of processor J) *• 0th bit of i) 
then (a of processor l*n ):»l 
if (j *0 ) and (a *0 )
then (r of processor 0) :■ PID of processor special link 
Notes. (1) The algorithm as presented uses the extended arithmetic 
instruction-set. The restricted arithmetic instruction-set can be substituted by 
increasing the number of processors in each team to 2". If the minimal 
instruction-set is used, the run-time is 0(log n). Note that the values l.n. 
0 *i<2 ", are not computed at run-time, but are stored as part of the 
Interconnection pattern.
(2) The degree can be reduced to a constant by the use of binary trees for
- 32 -
routing. Information about f(x) Is encoded using the technique of theorem 4 of 
Galll and Paul [21]. The run-time is increased to O(log n) on either the full 
arithmetic, restricted arithmetic or minimal instruction-sets.
(3) The number of processors can be reduced to 2n*’* [21]. This increases the 
run-time to 0(n). although it does reduce the degree to a constant, and uses 
only flnite-state machines as processors.
3.2. Shared Memory Machines
A popular alternative model is obtained by constraining processors to 
communicate via a common memory, rather than communicating by direct 
processor-to-processor links. Let D.P,S,T.W,Z:N-»N.
A short d m tm ory mac h int consists of an infinite number of processors 
attached to a globally accessible shared memory. Each processor possesses an 
infinite number of general purpose registers, and a unique read-only processor 
identity register PID which is preset to i in the 1th processor. ieN. A program  for 
this machine consists of a finite list of instructions: each instruction is of the 
form either:
(l) Read a value from a specified place In the global memory.
(li) Write a value to a specified place In the global memory.
(ill) Perform an Internal computation.
(tv) Conditional transfer of control, halt.
The allowable internal computations usually consist of direct and indirect 
register transfers, logical and arithmetic operations.
More formally, each machine is specified by a program P  and a processor 
bound P(n). A computation proceeds roughly as follows. An input of sise n 
(where the "sise" measure depends on the problem in question) is broken up
- 33 -
into a unit-size pieces, and the 1th piece is stored in global memory location i. 
0 « i< n .  All other memory locations and general purpose registers are set to
zero. Processors 0.1....P(n)-1 are activated simultaneously; they synchronously
execute the program P. When all processors have halted, the output is to be 
found in some specified place in the global memory.
The processor bound P(n) is a measure of the number of processors used as 
a function of input size. The space S(n) is the maximum number of non-zero 
entries in the global memory and registers at any time during the computation. 
(Note that this includes the input and the processor identity registers). The 
machine is said to have word-stss W(n) if the maximum value in any register or 
global memory location during any computation on an input of size n has 
absolute value less than 2w(n). The time bound T(n) is the number of instructions 
executed before all processors have halted, again as a function of input size.
Variants of this model have appeared. for example. in 
[8.14.20.21.27,42,47,62,64.68.69,71.72]. We assume some reasonable protocol for 
dealing with memory access conflicts, as in those references. The general 
consensus of opinion is that whilst the shared memory model is a powerful 
theoretical tool, It is not feasibly buildable using any foreseeable technology.
A shared-memory machine M can be simulated by a network with Identical 
internal instruction-set, without asymptotic loss of resources. Suppose M has 
P(n) processors. Then the network has P(n)+1 processors. Processor 0 remains 
idle throughout the computation whilst processor i, l « l * P (n )  simulates the 
action of processor 1-1 of M. A reference to global memory location m is 
replaced by a reference to register rm of processor 0. The extra processor can 
be eliminated by having processor 0 reserve the odd-numbered registers for its 
own use, and the even-numbered registers for the contents of the shared 
memory. A reference to global memory location m is then replaced by a
- 3 4 -
reference to register r*m of processor 0.
Alternatively, the global memory contents can be divided up amongst the 
processors of the network, provided the instruction-set is sufficiently powerful. 
Suppose M has P(n) processors and space S(n). Processor i of the network, 
0 « i< P (n ). simulates processor i of M. and in addition holds the values of global 
memory locations i+j.P(n). JkO. A reference to global memory location m is 
replaced by a reference to register rs.|n/p(B)J of processor m mod P(n) (each 
processor can reserve its even-numbered registers for memory locations, and 
the odd-numbered registers for its own use).
This assumes that the instruction-set is at least as powerful as the full 
arithmetic instruction-set of section 2.1. If the restricted arithmetic 
instruction-set is used, P(n) should be replaced by 2,l°« I. For a minimal 
instruction-set, the time-loss is O(log P(n)) per instruction, using P(n) 
processors. If sufficiently many processors are used (so that each processor 
holds at most one memory location) this time-loss can be reduced to a constant 
multiple.
Similarly, a network M can be simulated by a shared-memory machine 
without asymptotic loss in resources, provided the instruction-set is sufficiently 
powerful. The registers of the network are stored in the common memory - each 
processor of the shared-memory machine need only have a constant number of 
local registers (note that a similar trick serves to reduce the local-memory 
requirements of all shared-memory machines, subject to similar conditions). A 
reference to register ri of processor j is replaced by a reference to global 
memory location P(n).l+J.
Diis replacement costs only a constant number of steps per access for 
machines with the full arithmetic instruotlon-set. As before, if P(n) is replaced 
by gO«« it also costs a constant number of steps with the restricted
arithmetic instruction-set. For machines with the minimal instruction-set, a 
similar result can be obtained by storing, along with each register rt, the 
contents of r( multiplied by P(n). This requires time proportional to
Thereafter, these values
can be maintained and used for register access with a constant loss in time for 
each step of M. Alternatively, the multiplication by P(n) can be computed at 
access-time, at a cost of 0(log P(n)) per access.
3.3. Reasonableness and Practicality
In section 2.2 we raised the following important question: what constitutes a 
"reasonable" model of parallel computation? In particular, what is a reasonable 
instruction-set for our processors, given that we have chosen a unit-cost
measure of time? Goldschlager, in [27], placed certain restrictions on his
SIMDAG's (a variant of the shared-memory model considered in section 3.2) to 
ensure that they obey the parallel computation thesis: time on any "reasonable" 
parallel model is polynomially equivalent to sequential (log-cost) space.
Evidence for this thesis is provided by a multiplicity of "reasonable" models, for 
example, alternating Turing machines [9], uniform circuits [7] and vector 
machines [54], as well as Goldschlager's S1MDAG and conglomerate.
As we shall see later in this section, in order to make networks and shared- 
memory machines obey the parallel computation thesis, it is necessary to place 
upper bounds on the word-size and type of instructions allowed. These
restrictions can be accepted as "reasonable" purely on practical grounds - for 
example, one can argue that the word-size of problems tackled in practice 
should not grow too rapidly with input-size. In this sense, "reasonable" can be 
equated to "practical".
- 36 -
The parallel computation thesis also provides us with a powerful theoretical 
tool. Suppose that we are interested in those problems from P which have an 
exponential speedup in parallel, that is. those members of P which can be solved 
in time log0(,)n by a "reasonable" parallel machine. If a "reasonable" machine is 
one which obeys the parallel computation thesis, then these are precisely the 
members of P which can be solved in polylog space by a Turing machine.
Let POLYLOGSPACE denote the class of languages which can be accepted in 
space log0(l,n by a Turing machine. It is widely conjectured that 
PC  POLYLOGSPACE (although it is not known for sure whether either class 
contains the other). Evidence is provided for this conjecture by the existence of 
log space complete problems (see, for example. [24.25,26.28.34.35.37]); that is. 
problems which are members of P. yet if any one of them is a member of 
POLYLOGSPACE then PcPOLYLOGSPACE. Thus log-space complete problems 
probably do not have an exponential speedup on any "reasonable" parallel 
machine, where the parallel computation thesis is used as a criterion for 
"reasonableness".
Thus we see that there are two facets to the concept of "reasonableness", 
that which is reasonable from a practical point of view, and that which is 
reasonable from the theoretical point of view. It may be theoretically 
interesting to consider networks with an exponential number of processors (as 
in section 3.4), but it is certainly not reasonable to consider them as a practical 
proposition for all except the smallest values of n. A theoretical model is an 
attempt to capture the essence of an Intuitive notion of "parallel computation"; 
a practical model is, in addition, governed by physical and technological 
constraints.
The remainder of this section is devoted to a closer look at some ways of 
defining a "reasonable" model. Earlier in this section we referred to some
additional conditions which ensure that networks obey the parallel computation 
thesis. What exactly are these conditions? Firstly, an S(n) space-bounded 
nondeterministic Turing machine can be simulated by a network with the 
minimal instruction-set. In time and word-size 0(S(n)). using the techniques of 
theorem 2.1 of Goldschlager [27]. Conversely, we have:
Theorem 3.3.1 A T(n) time-bounded network M with word-size W(n) can be simu­
lated by a deterministic Turing machine using space T(n).W(n)+S(n), where S(n) 
is the space required for the Turing machine to simulate a single instruction of a 
processor of M.
Proof. Similar to theorem 2.2 of [27], □
This enables us to throw some light on the unit-cost hypothesis. As far as 
the parallel computation thesis is concerned, it is reasonable to charge a single 
unit of time for instructions which can be computed by a Turing machine in 
space T(n)0(l^ where T(n) is the number of steps in the intended computation. 
Given this condition, networks obey the parallel computation thesis provided 
W(n) *  T(n)0(,). Note that this allows machines with as many as 2T(n)0<0 
processors; although those who support lazy activation (see section 2.4) insist 
that P(n) ■ 2°fr(n)\ and some authors insist that P(n) = n°*l> (for example, 
[16,17.42,53]).
To summarize, here are a number of restrictions on network and shared- 
memory models which can be used to define so-called "reasonable" machines.
( 1)  Rastrlcthan* on thm instruction-sat.
Restrictions on Instruction-set are motivated by a desire to see that the
unit-cost hypothesis holds.
(a) The first premise Is that individual processors should behave like log- 
coot sequential machines. In particular, the resource of time should
- 38 -
be polynomi&lly related to time on an accepted log-coat sequential 
machine model, such as the deterministic Turing machine (c.f. the 
sequential computation thesis, section 2.2). Thus instructions which 
are valid for a T(n) unit-cost-time bounded computation should 
individually take no more than Tin)0**1 steps on a deterministic Turing 
machine.
(b) Instructions should be computable in space T(n)0(1) by a Turing 
machine. This helps to ensure that the parallel computation thesis 
holds. Note that this condition is implied by part (a ) above.
(2) Bounds upon processors and tim e.
Upper-bounds on the number of processors are usually motivated by the 
observation that, given enough processors, every computable function can 
be computed in constant time (see section 3.5), which makes time a 
singularly uninteresting resource.
(a) P (n )*20fr(n)). This is a consequence of the lazy-activation approach 
(see section 2.4).
(b) P(n) =n0(1), T(n) = log0(1)n. Machines with these two properties are 
sometimes called small and fast respectively. See, for example.
[16,17,53,59].
(3) Bounds upon wordsiss.
Upper-bounds on word-size are usually motivated by the observation that 
(as previously noted in section 2.2) single-processor machines with the full 
[30] or restricted [54] arithmetic instruction-sets obey the parallel 
computation thesis, and so can be considered "reasonable" parallel 
machines In themselves. This makes the processor-bound an uninteresting
resource.
(a) W(n) = 0(T(n)). This can be achieved indirectly (as in Goldschlager 
[27]) by restricting the instruction-set and the size of the input- 
symbols.
(b) W(n) = T(n)0(,V This condition guarantees that the parallel computation 
thesis holds, subject to the additional conditions on the instruction-set 
mentioned in 1 (b) above.
(c) W(n) = n0(1). This ensures that the input encoding is "concise" in the 
sense of [22]. If the input symbols are allowed to be Integers with more 
than a polynomial number of bits, then n is no longer a reliable (to 
within a polynomial) measure of input-size.
Other restrictions are often made in the literature, motivated, it Is often
claimed, by practical considerations. These include the following:
(1) Restrictions on degree. It is widely accepted that a completely-connected 
machine is impractical. Some authors (for example, Galil and Paul [21]) 
think that degree should be constant (i.e. independent of input-size).
(2) Restrictions on the interconnection pattern In the case of fixed-structure 
networks (see section 3.1) it is desirable to restrict oneself to machines 
with an Interconnection pattern which is in a sense easy to compute (see. 
for example, [21]). This is also the case for uniform circuits [59] and 
conglomerates [27]. One advantage of this approach is that it avoids the 
kind of machine described in section 3.1, which can compute a large class 
of functions (which may even be non-recursive) in an unnaturally small 
amount of time.
(3) Restrictions on register access. Even if higher-degree machines are 
acceptable, should every processor necessarily have the freedom of being 
able to read any register of Its neighbours? An alternative is to provide 
each processor with a special communication register, which is the only
- 40 -
register accessible to other processors. This approach is taken, for 
example, in [21,36.68]. We will call machines of this kind restricted-access 
networks.
(4) Restrictions on multiple register access. Some authors (for example. [20]) 
insist that simultaneous writes to a single register be disallowed, others (for 
example. [42]) insist that simultaneous reads of a single register also be 
banned.
We will have more to say on these matters in later sections.
3.4. A Practical Model
In the last section we saw various constraints which can be placed on our 
network model in order to make it "reasonable" or "practical". We are now 
ready to define our own practical variant of the network model. A feasible 
network H is a fixed-structure network (see section 3.1) with interconnection 
scheme (P.D.G), such that:
(i) Each processor has a constant number of general-purpose registers.
(li) The degree, D(n), Is a constant.
(lii) The interconnection function G can be computed in time 0(log P(n)) by a 
deterministic Turing machine.
These three conditions ensure that the machines are in a sense easy to 
construct. Each processor has a small amount of memory, and a small number 
of easy-to-compute interconnections. Machines with similar characteristics 
have appeared, for example, in [21,46,46,55,62.66,69]. Note that we have made 
no attempt to make the model "reasonable” by placing bounds on the number of 
processors, space, time, word-size, or the complexity of the instruction-set, 
according to the guidelines laid down in the last section. The reader is
-*1  -
completely free to make whatever additional restrictions are required, 
according purely to taste, or In order to model a particular kind of computing 
environment.
Even if we accept the feasible network as being feasibly constructlble, it is 
unlikely that the fabrication costs would be so low that the average user would 
be willing to build a new machine for each application. More likely, the user 
would prefer to present each new machine (in the form of a program) to a 
universal parallel computer which can simulate it at a small cost In resources. 
The user would thus be able to trade the fabrication cost of a feasible network 
for a small Increase in resources at run-time.
A further advantage is to be gained if we can And an efficient feasible 
network which is universal for the general model of section 2.1. From a 
practical point of view, it would provide the user of a feasible network with a 
new, Aexible high-level programming language. Programs which are written in a 
high level programming language similar to that of section 2.1 could (although 
they may correspond to machines which are not feasibly constructlble) be run 
on a feasible universal machine, for a small extra cost in resources. By building 
a single feasible network the user gains the use of a flexible and elegant virtual 
architecture, corresponding to a completely-connected network. From a 
theoretical point of view, we obtain a practical motivation for studying the more 
esoteric parallel machine models of chapter 2.
Note that the universal machine is far more attractive than the machines 
that it can simulate. For example:
(1) It is a fixed-structure machine with a small number of easy-to-compute
interconnections per processor.
- 42 -
(2) The number of registers per processor is small. The problem of whether to 
allow access to arbitrary registers of neighbouring processors thus 
vanishes; each processor can be restricted to communicating via a single 
communication register (as mentioned in section 3.3) without asymptotic 
time-loss.
(3) Because its degree is constant, the problem of whether to allow 
simultaneous access to those communication registers also vanishes. 
Accesses in the universal machine can be restricted to exclusive reads 
without asymptotic time-loss (by use of a polling loop).
(4) The requirement that the universal network is synchronized is no longer 
essential (see [21]).
Yet the machines being simulated need share none of these restrictions.
Exactly what do we mean by a "universal machine"? Suppose U is a P(n)
processor (feasible) network. M is an arbitrary network, and x *  <Xo....x„-,> is an
input of size n. A simulation of M on input x by U is to proceed as follows. Let 
p=P(n). Place X| into register r||/pj of processor (i mod p). Place into the 
remaining registers of processor 0 a concise finite encoding of the program of M. 
Set all other general-purpose registers to zero, and simultaneously activate 
processors 0.1....p-1 on the program of U. Suppose M computes a function f.
and fn(x) *  <y0........ym-i> (for definitions see section 2.1). U is said to be
unit;ersol if for all machines M and Inputs x. when all processors of U have 
halted, register rv » j of processor (i mod p) contains yt, for 0 « l  <m.
Ve are Interested in a particular kind of simulation, which we shall call 
"step-wise”. A simulation of a T(n) time bounded network M on an Input of size n 
is said to be sfsp-iUss if;
- 43 -
(1) For O il<S (a ), 0 <T «T (n ) each register 1 of M has a corresponding
dscttcoted processor d(l.r) In U. Note that we may allow d(i,r) = d(j.r) when
l * * J
(2) Suppose t:N-»N. The simulation consists of three phases:
(a) Initialization. This includes the assignment of. and routing of the Input 
values to the dedicated processors, as well as any pre-computations 
required for phase (b).
(b) Computation. The computation phase is to take t(n).T(n) steps. For 
0 «rx T (n ) we require that after t(n).r steps of this phase, processor 
d (l.r) has a distinguished. register which contains the contents of 
register i of M after r  steps of M. Os i < S(n).
(c) Termination. This includes routing of the output from the dedicated 
processors to the output processors.
Such a universal machine is said to have da Lay t(n). The sa tup tona is the 
time required for phases (a) and (c) combined. Note that the set-up time must 
be independent of T(n). A step-wise simulation is also said to be Maral if a 
data-transfer from registers 1 to register j of M. Osi.j <S(n), during time-step r, 
l « r < T (n ) ,  gives rise to a communication between processors d (i.r - l )  and d(j.r) 
of U between time-steps (r - l ) . t (n )+ l and r.t(n) of phase (b). More formally, 
define a directed multi-graph Gn as follows (Gn is to reflect the information flow 
between processors of U during the simulation of time-step r  of M). Gn has
vertex-set (0.1....P '(n)-l| (where U has F(n) processors), and an edge from
vertex u to vertex v, labelled 6. if during time step (r - l).t (n )+ d  of phase (11). 
processor v of U reads a value from processor u. l « 4 « t (n ) .  (Recall that 
processors of l) use only exclusive-reads for inter-processor communication). 
We require that there be a path from d (l.r - l) to d(J.r) in Gn with monotonie 
increasing labels on the edges. Thus in a literal simulation, a data transfer
- 44 -
between registers of the simulated machine can give rise to a data transfer 
between the corresponding dedicated processors within the simulation of that 
time-step. In a nan-litoral simulation, the required data may (for example) have 
started out during the simulation of the previous step of M, and been kept up- 
to-date by auxiliary processors along the way (see sections 7.1, 7 2).
Later, we will consider a more restrictive form of literal simulation in which 
the dedicated processor assignment does not change with time. We will call this 
type of simulation strongly litora l In section 8.3 we give an upper-bound of 
O(log P(n)) on the delay for a strongly-literal simulation of a P(n) processor, 
constant-degree, restricted-access machine, and match this with a lower-bound 
in section 7.1.
3.5. Speedup of Sequential Machines
In section 3.3 we briefly touched on the following question: which problems 
In P  have an exponential speedup in time on a "reasonable" parallel machine, 
where "reasonable" is defined using the parallel computation thesis? The only 
answer which is currently available is "probably not all of them". Here we tackle 
an easier question: what speedups ore offered by our networks (reasonable or 
otherwise), as opposed to sequential ones. As a partial answer, we provide the 
following result.
Theorem 3.5.1 Let B:N-*N. A T(n) time-bounded deterministic Turing machine 
can be simulated in time 0( q^ ~ )  by a network with 20(B(n,% T(n ) processors, 
word-size 0(B(n)*+log T(n)) and a constant number of registers per processor. 
Proof. (Outline). Let M be a T(n) time-bounded k-tape deterministic Turing 
machine. A configuration of M consists of k.T(n) tape symbols corresponding to 
the tape contents, and k Integers corresponding to the head positions on the k 
tapes. The network has T(n)+1 processors devoted to holding the current
configuration of M In an easlly-accessible manner. Processor 0 holds the k head 
positions, and for 0< t*T (n ) processor 1 holds the 1th symbol of each tape. The
simulation will consist of Is fé r i phases, each corresponding to B(n) steps of M
The Initial configuration of M is easy to set up, and the simulation will endeavor 
to maintain It from phase to phase.
A situation consists of that portion of the tape which may be altered during 
the current phase, that is, the k.(2B(n)-l) tape-cells that are within distance 
B(n) from a head at the start of the phase. During each phase the simulation 
will be conducted using these situations - at the end of each phase the final 
situation will be used to update the stored configuration. To be more precise, a 
situation consists of k (2B (n )-l) tape symbols, and k head-pointers (each of 
O(log B(n)) bits).
Before the first phase, some pre-computation is carried out. A computation 
of M consists of a string of B(n)+1 situations. The processors are broken up into 
2°Wn,l) teams (one for each computation), each of B(n)+1 processors. The 
lowest numbered processor of each team is a distinguished processor called the 
leader of that team. Our aim is to notify the leaders of the teams which 
correspond to valid computations of M.
The 1th team is made up of processors for which iPID/ (B(n) + l)J = J The i01 
member of this team has PID mod (B(n)+1) »  i. The value j is interpreted as the 
encoding of a computation (note that this computation is the same for all 
members of a team). The 1th member of each team 0 <i<B (n ) verifies that the 
1th situation of the computation follows from the ( i - l ) th one by the rules of M.
where the situations of a computation are numbered 0,1....B(n). If not. then
that processor is said to fa il The team-leader verifies that tha heads of the 
initial situation of tha computation are all at cell B(n) of their respective tapes.
Processors which fall notify their team-leader as follows. Each team-leader 
sets a pre-determined register r to zero. Failed processors then attempt to 
write a one to register r of their team-leaders. A number of team-leaders will 
have their register r remain at zero. The computations of their teams 
correspond to valid computations of M. They extract the initial and Anal 
situations I.F of their respective computations, and write F to processor I.
Each phase is broken up into three parts.
(1) Determine the initial situation from the initial configuration of the phase. 
This time the processors are broken up into 2 °<B(n» teams (one for each 
possible situation), each of 2B(n)—1 processors. The lowest numbered 
processor Of each team is a distinguished processor called the leader of 
that team. Processors which are not members of a team remain idle.
The Ith member of the j*** team (i.jfeO) has i = P1D mod (2B (n )-l) and 
j s  |piD/(2B(n)-l)J. Each processor first computes i and j. The value j is 
interpreted as the encoding of a situation (note that j is the same for all 
members of any particular team). The 1th processor of each team, 
0sci<(2B(n )-l) decodes the head positions and the i1*1 symbol of each tape 
from this situation. Every processor of every team then compares its 
symbols to the corresponding symbols of the stored configuration. If they 
disagree, the processor is said to fa il. Each team-leader sets a pre­
determined register r to zero. Failed processors then attempt to write a 
one to register r of their team-leaders.
The team leader whose head-pointers are equal to B(n), and whose register 
r remains at zero knows that its value of J is an encoding of the initial 
situation of the phase. It writes this value to processor 0 for safe-keeping.
-  47 -
(2) Determine the final situation of the current phase. Processor 0 can obtain 
this information from processor I. where I is the initial situation of the 
current phase computed in (1) above. Processor I obtained this 
Information during the pre-computation stage.
(3) Determine the Anal configuration of the phase from the final situation. 
Those processors holding symbols of the configuration which are within 
distance B(n) from a head update their values using the final situation 
stored in processor 0. Processor 0 likewise updates its head positions.
The network is then in a position to begin the next phase. T(n) steps of M
every phase each take a constant number of steps, assuming that each 
processor has the extended arithmetic instruction-set. With care, the restricted 
arithmetic instruction-set can be substituted, by using 20 B^(n),) processors per 
team in the pre-computation, and 20(Btn,) processors per team in each phase.
Note that processor 0 is to be given the value of B(n) before the start of a 
simulation on an input of size n. This result implies that any computable 
function can be computed in constant time if sufficiently many processors are 
present. By taking B(n) *  T(n) we can extend the simulation to nondeterministic 
Turing machines. Thus, for example, every function in NP can be computed in 
constant time by a network of Z"*1’ processors. This is an improvement over the 
result of Savltch [60], who obtains time 0(log n) on a network of n0<1) 
nondeterministic processors. But what about simulation by a “reasonable" 
machine? Suppose we require that the parallel computation thesis holds. It is 
sufficient in this case to bound the word-size to be a polynomial in the parallel 
running time. This means we can choose B(n) to be T(n)l_* for 0<c < 1. Thus a 
T(n) time bounded deterministic Turing machine can be simulated in time T(n>*
are simulated by repeating this for phases. The pre-computation and
The entire simulation takes 0( ) steps, using 2°<B<n),)+.T(n) processors. □
4 8 -
by a "reasonable" network, for 0 <e < 1. Thus an arbitrary polynomial speedup is 
possible. This is an extremely strong result, since, as we observed in section 3.3, 
there are natural problems in P which probably have no exponential speedup on 
a parallel machine which obeys the parallel computation thesis.
Dymond [IB] has achieved a superior resuit in the case where word-size is 
to be linear (instead of just polynomial) in the parallel running-time. He obtains 
parallel time 0(VT(n)) on 2°<VT(n)) processors, compared to our 0(T(n)e/s) on 
2 °(r(n)*^*) processors. We can duplicate his result by doing the pre-computation 
sequentially, in time 0(B(n)) using 2°(B(n»  teams, each of one processor. This 
gives time 0 ( g ^ - +  B(n)) on 2°Wn»  processors. By using standard techniques it
is possible (Blum [6]) to simulate a T(n) time-bounded deterministic Turing 
machine in time O(log T(n)) using 2°fr(B»  processors, without the use of multiple
writes. We can achieve the same result by choosing B(n) = I M , and doing the
pre-computation recursively. Blum thinks that parallel machines with this many 
processors are "reasonable", and attacks the parallel computation thesis on this 
basis.
- 49 -
Chapter 4
Programming Techniques for Feasible Networks
In chapter 3 we suggested the possibility of finding a feasible network which 
is universal for the general network model. Before we actually tackle this 
problem, it is instructive to Investigate the methods at our disposal. This 
chapter consists of four sections. The first section deals with possible 
interconnection patterns, concentrating on the shuffle-exchange of Stone [66] 
and the cube-connected-cycles of Preparata and Vuillemin [55]. The latter 
paper also provides us with a useful programming tool - a large class of fast 
algorithms on the multi-dimensionai cube (called com poritt algorithms) which 
can be simulated without loss of resources on either the cube-connected-cycles 
or shuffle-exchange. This will allow us to express the program of our universal 
machine in a high-level form which is to a certain extent independent of 
interconnection pattern.
The second section deals with recurrent interconnection patterns, that is, 
interconnection patterns G= (G0.Gi,...) such that for all kaj%0, G* is made up of 
a collection of disjoint subgraphs, each of which is isomorphic to Gj. We present 
a recurrent interconnection pattern called the cube-connected-lines, which is 
equal to the cube-connected-cycles in its ability to simulate composite 
algorithms. It is shown that a recurrent interconnection pattern with twice as 
many subgraphs as the cube-connected-lines cannot share this property.
The third section contains some composite sub-algorithms which we will 
later find useful for constructing universal machines. The fourth and Anal 
section presents some theorems which allow a reduction in the number of 
processors in machines with the shuffle-exchange, cube-connected-cycles or
-  50-
cube-connected-lines interconnection patterns, at a cost in time. A reduction in 
processors from P(n) to P'(n) results in a delay of 0(P(n)/P'(n)). Thus constant 
multiples in processor bounds can be ignored without asymptotic time-loss, a 
fact that we will use often in later chapters. A preliminary version of the 
material contained in the last section has appeared in [51].
4.1. Interconnection Patterns and Programming Tools
As suggested in section 3.4, our aim is to construct a feasible network which 
can efficiently simulate any general network. There are a number of 
interconnection patterns available in the literature which we might use for this 
universal machine. These appear to  be roughly equal in computing power. 
Rather than tie ourselves to one particular interconnection pattern, it would be 
more instructive to express our program in a language which can be 
implemented efficiently on several interconnection patterns.
Fortunately, the literature already provides us with some tools. Preparata 
and Vuillemin [55] consider various algorithms which use a multi-dimensional 
cube as the Interconnection pattern. Although this has non-constant degree, 
they find that a large class of useful algorithms have strong properties which 
allow them to be simulated without asymptotic time-loss on a feasibly-buildable 
machine which they call the cube-connected-cycles.
First, let us Introduce some useful notation. Suppose v and 1 are non­
negative Integers. If ife 1, let v( denote the 1th least-significant bit in the binary 
representation of v, that is, vt = [v/ J mod 2. Where convenient, we may 
confuse the Integer v and a binary representation 
vfcVk-i • • • V| (where kfc llog vj+1) of ▼. Also let v0* denote the Integer which
-51 -
differs from v precisely in the t°* (least-significant) bit, that is. 
v M* v + ( — l i v e  (O.lJ. let V denote v*1*. the complement of v.
Suppose k is a non-negative integer. The k-cubt Ck has vertex-set 
( v | 0 *v < 2 k|, and each vertex v is joined to vertices v® for K is k .  Ck has 2* 
vertices and degree k; it is this high degree which makes it unsuitable as a 
realistic interconnection pattern. However, it has played an important part in 
motivating the degree-3 interconnection patterns which we shall meet below. 
Figure 4.1.1 shows the four-dimensional cube (commonly called the hyper-cube) 
C4. which has 16 vertices and degree 4.
Dimension:
figure 4.1.1 The hyper-oube, C«.
Consider a network baaed on the k-cube. with a constant number of 
registers per processor. The link between v and v® Is said to be In dimension l. 
Suppose k '«k . An algorithm la termed simpte-ascend (after [85]) If all data
- 53 -
transfers occur synchronously along dimension 1, then dimensions 3,3....k' in
monotone increasing order. Similarly it is called simple-descend if the data 
transfers occur in the opposite order, from k' down to 1. An algorithm is called 
simple it it is either simple-ascend or simple-descend, and composite it it is 
either simple or made up from local instructions and modules which are 
themselves composite. We learn from [55] that there are fast composite 
algorithms for a rich selection of data routing problems (such as permutations, 
merging and sorting).
The shuffle-exchange SEj, of Stone [68] has vertex-set ( v | Osv<2*1 j. and 
each vertex v is joined to vertices v*1*. (2v) mod 2k+vit and |y/2j +v1.2k_l. 
Relative to processor v, these three edges are called exchange, shuffle and 
unshuffle links respectively. SE* has 2k vertices and degree 3. Figure 4.1.2 
shows the B vertex shuffle-exchange SEg. As an interconnection pattern, 
SE = (Sq.Si ,...) where for ifeO. St = SEha(i|.
Figure 4.1JI The B vertex ehuffle-exchanfe, SE«.
From this point onwards, to help avoid possible confusion, we will often call 
the processors of the simulated machine processes in order to distinguish them 
from the processors of the simulating machine. This is consistent with the view 
that the simulated machine is presented to the simulator as a program, not as a 
physical collection of processors and wires.
Theorem 4.1.1 A shuffle-exchange with &  processors can simulate a 2k process 
composite algorithm with constant delay.
Proof. (Outline) Suppose k '«k . Without loss of generality we will prove the 
result (or algorithms whose data transfers occur synchronously along
dimensions 1,2........ k - l .k .k '- l .........2.1 in turn. Both simple-ascend and
simple-descend class algorithms At into this category with constant delay, the 
former by taking the last k1 data transfers to be null, and the latter the first k’. 
Applying the same technique to each simple module shows that the result also 
holds for composite algorithms.
Each processor will be assigned the task of simulating one process. Since 
each process has only a constant number of registers, it is possible to have a 
simulation in which the processor assignments are Aexible. To move a process 
from one processor to another, we need only transfer the contents of its 
registers. If this transfer is to take place between neighbouring processors, the 
entire process can be moved In constant time. We start off with processor i of 
the shuffle-exchange simulating process 1, 0 s i< 2 k. Most importantly where 
composite algorithms are concerned, we also end up in this configuration. 
Initially, we can manage the data transfers along dimension 1. since for 0 « i  <2k, 
processor 1 is connected to processor l(l> via an exchange link.
Next we simply move the entire process from processor 1 to processor 
lilklk-i • • • 1« via the unshuffle link out of processor 1 (which the processor at the 
other end views aa a shuffle link). After this has been done in parallel for all 1, 
0 s i< 2 k, we see that process i(a) is then resident in processor (l|iklk-i ' la)('\ 
which Is adjacent to processor l|iklk-i • • • i* via the exchange link. Thus the 
necessary transfers between processes 1 and l(,) can take place over the 
exchange links. After a second unshuflle of processes, data transfers In 
dimension 3 can take place over the exchange links. This continues up to
- 55 -
dimenston lc*. and then la reversed back down to dimension 1. □
The cube-cormectsd-cycles CCCk of Preparata and Vuillemin [55] is deflned 
as follows. Let r be such that 2,_ ,+ r - l  <k<2N-r. CCCk has vertex-set 
{ (v.p) | 0 «v < 2 k~r, 0 *p < 2 r j. and each vertex (v,p) is joined to vertices:
(i) (v^M),p), provided 0 *  p < k-r,
(ii) (v,(p+ l) mod 21). end 
(ill) (v . (p - l) mod S').
The first link is called a cube edge, the remaining two cycle edges. Relative to 
processor (v.p), the first cycle-edge is called vpcyclt, the second dotuncycle. 
CCCk has 2* vertices and degree 3. Figure 4.1.3 shows the 16 vertex cube- 
connected-cycles. CCC4. As an-Interconnection pattern, CCC* (G0.G,,...) where 
for lk  0. G, *  CCCigg ,|
Hgms 4.1J Ths 16 vert« eube-eonMcted-eyelM, CCC«.
Theorem 4.1.S A cube-connected-cycles with 2* processors can simulate a 2* 
process composite algorithm with constant delay.
Proof. Preparata and Vuillemin [55] prove this result (or Vc' (the upper 
dimension In the simple modules) equal to k. A straightforward modification to 
the pipelining argument and to LOOPOPER gives us the desired result. □
By application of these theorems we have:
Theorem 4.1.3 A feasible network with at most 2,a* processors can permute n 
items according to some fixed permutation in time O(log n) (provided some pre- 
computation is allowed).
Proof. The algorithm is just a simulation of the permutation network of 
Waksman [75]. See Schwartz [62] or Preparata and Vuillemin [55]. □
Theorem 4.1.4 A feasible network with at most 2,0i 01 processors can perform the 
pre-computation mentioned in theorem 4.1.3 in time 0(log4n).
Proof. See. for example, Nassimi and Sahni [46], Schwartz [62], Opferman and 
Tsao-Wu [48] or Lev, Pippenger and Valiant [42]. □
Theorem 4.1.5 A feasible network with at most 2,0< 111 processors can sort n items 
in time 0(log*n).
Proof. The algorithm is a composite realization of the odd-even or bltonic sorts 
of Batcher [4], See. for example. Schwartz [62] or Preparata and Vuillemin [55]. 
□
4.2. Recurrent Interconnection Patterns
An interconnection pattern G » (G0.G,... ) with P(n) processors is said to bo
Pin)recurrent If for all n,m with Osman, G„ has 0( ) dl»jolnt subgraphs which
are isomorphic to Gm. The simplest form of recurrence one might choose is to 
have Gn constructed from precisely p ^ )  auctl ,ub* raPhr Unfortunately this
typo of recurrent interconnection pattarn is much lass powerful than the 
shuffle-exchange or cube-connected cycles met in section 4.1. Later in this
- 57 -
Suppose c Is a fixed positive integer (independent of n). More precisely, a 
racursive interconnection pattern is one in which Gn (n >c ) is made up of 
exactly c disjoint copies of G^/ej (with fixed graphs for n<c), joined by extra 
edges from some graph G‘n.
Theorem 4.2.1 No constant degree recursive parallel machine with 0(n) proces­
sors can can permute n items in O(log n) steps.
ft-oof. For a contradiction, suppose G = (G0,G|,...) is a recursive interconnection 
pattern with degree d and 0(n) processors which can be used to permute n 
items in time O(log n). The following technique is due to Meertens [43].
Suppose n = ck for some kfc 0. Let
Pk denote the number of vertices in Gn, 
Ek denote the number of edges in Gn, 
Ek denote the number of edges in G'BI
Note that rk< jp  (Let Sk be the sum over all vertices v in G„ of the number of
edges Incident with v. Clearly Sk£d.Pk. But every edge is counted twice, so 
Skse.E^). Also, Pj, = 0(ck).
We claim that for OssKk.
EkH-CKSj— ¡ - ) (*)
Consider one of the subgraphs of G ^  Isomorphic to G n • Pick a permutation
o» «»♦«
- 57 -
section we will meet a recurrent interconnection pattern where Gn is made up of 
at least g p ^ y  copies of Gm, which is equal to the cube-connected-cycles in its 
ability to simulate composite algorithms.
Suppose c is a fixed positive integer (independent of n). More precisely, a 
recursive interconnection pattern is one in which Gn (n >c ) is made up of 
exactly c disjoint copies of G^oj (with fixed graphs for n<c), joined by extra 
edges from some graph G'„.
Theorem 4.2.1 No constant degree recursive parallel machine with 0(n) proces­
sors can can permute n items in O(log n) steps.
Proof. For a contradiction, suppose G= (G0,Gi....) is a recursive interconnection 
pattern with degree d and 0(n) processors which can be used to permute n 
items in time 0(log n). The following technique is due to Meertens [43].
Suppose n = ck for some kfc 0. Let
Pk denote the number of vertices in Gn. 
Etc denote the number of edges in G„. 
E*k denote the number of edges in G'n,
Note that Tk< jp  (Let Sk be the sum over all vertices v in G„ of the number of
edges Incident with v. Clearly Sk*»d.Pk. But every edge is counted twice, so 
Sk = 2.H,). Also, Pk^Sic11).
We claim that for 0 *  i <k,
E'¡•k -t-fH S j-j-)
Consider one of the subgraphs of G ^  Isomorphic to G „ .
e» «w»
(•)
Pick a permutation
- 58 -
which takes a data Item trom each (input bearing) vertex of the subgraph to a 
vertex of outside that subgraph. These data items must pass along the
e»
edges of G'^, since these are the only edges linking the subgraph with the rest 
of Gj^ Thus in one step, at most E'k_t items can be moved. By hypothesis we
e>
can move all the items in O(k-t) steps. There are n(Pk-t_i) = n(ck*‘" 1) items to 
be moved. Hence ck-‘ ' 1 = 0(E’k-,.(k-i)) as required.
Now
Ek = E‘k+c.Ek-1
This is in contrast to the corresponding result (theorem 4.1.3) for the cube- 
connected-cycles and shuffle-exchange.
The following is a recurrent interconnection pattern which is as powerful as 
the cube-connected-cycles, at least In its ability to simulate composite 
algorithms. The cube-connscfed-Hnss, CCL* is simply a copy of CCCk with the 
edges from vertices (v.O) to (v .^ - l ) ,  0 *  v < 2k_r deleted. That is. the cycles of 
the cube-connected-cycles are broken, and thus become lines. Figure 4.2.1 
shows some cube-connected-lines graphs with 2,4.8 and 16 vertices. CCLk has 2k 
vertices and degree 3.
*  0( £  ) by re-indexing
ial i
Thus Tk= 0 (£  t-). which diverges as k-*<■«. But this contradicts the fact“k (a | i >
that Pk*  a constant independent of k. Thus no such parallel machine can
exist. C
- 59 -
<o.o> < 1.0)
< 0 . 0 )
<0 . 1 )
< 1 . 0 )
<1.1)
<0 . 0 ) (
<0 . I K
<2 . 1 ) 1 ^*<3 . 1 )
ilfiir i 4J.1 The 2.4.6 end 16 vertex cube-connected-linee graphs. CCLj through 
CCt*.
Theorem 4.2.2 For 0 *J *  k, CCL* has at leaat 2k_1M disjoint subgraphs which are 
isomorphic to CCLj.
Proof. Lot kfcJfcO, and r be such that 2,- ‘ + r - l  < J i2 r-*-r. For r * 0  we call 
CCLf,4.r a fu ll cube-connected-lines graph.
First suppose that J «k -1 . that is. we wish to break the 2* vertex CCL* into 
half-sized (2k_l vertex) CCLk-i's. There are two cases to consider, according to 
whether or not CCLj is full.
i . o >
1 . 1 »
1 . 2 >
1 . 3 )
- 60 -
(1) k - l  *2 r+r. CCLk-i has vertices (v.p) with 0<v<2k-,- ,1 O ip i ? .  Vartax 
(v.p) la jolnad to vertical:
(I) (v&**>,p). 0 < v < 2 k- '- ,1 0 < p < 2 '1
(II) (v.p+1), 0 * v < 2 k_,_l. 0 *p < 2 r- l .
(III) (v.p-1). 0 < v< 2 k-r- ,1 0 < p < 2 r.
CCU has vartlcas (v.p) with 0 * v < 2 k_r~1, O ^ p ^ * 1. Vertex (v.p) is joined 
to vortices:
(l) (v&*».p). 0<v<2 '«- '- ‘ . 0 *p < 2 r.
(ii) (v.p+1). 0 *p < 2 r* ,- l .
(Ill) (v.p-1). O i v « ^ - 1-». 0 < p < 2 r* ‘ .
Thus CCI* looks exactly like CCLk., with lines extended to double the length 
using vertices without cube links. So CCL* has only one subgraph which is 
isomorphic to CCLk_| (see figure 4.2.1 for the case when k = 2 and k = 4).
(2) k - l< 2 rfr . CCLfc-i has vertices (v.p) with 0 *v < 2 k_r~l, 0<p<2r. Vertex 
(v.p) Is joined to vertices:
(i) (vfrM\p). 0 * v < 2 k_,-,1 0 « p < k - r - l ,
(ii) (v.p+1). 0 * v < ^ ,,- -^ ,. 0 *p < 2 r- l ,
(ill) (v.p-1). 0 < v< 2 k-r- ,i 0 < p < 2 r.
CCIa has vertices (v.p) with 0 * v < 2 k' r. 0 *p < 2 r Vertex (v.p) is joined to 
vertices:
(t) (v^'J.p). 0 < v < 2 k-r, 0 < p < k -r .
(li) (v.p+1). 0 * v < 2 k“, l 0 *p < 2 r- l .
(Ul) (v.p-1). 0 * v < 2 k-r. 0<p<2r.
Deleting the cube-edges from (v.p) to (v<**l),p) with p ■ k -r-1  from CC1* 
gives two disjoint graphs which are isomorphic to CCl*_, (see figure 4.2.1 
tor the oase when k ■ 3).
- 61 -
Thua for 8f- ,+r—1 < j « k « 8 , +r wo can break CCL* into 2k_1 subgraphs 
isomorphic to CCLj, by Iterating the procedure In (2) above. By then applying 
(1). we see that CCI* has 2k-J_l subgraphs isomorphic to CCLj when j *  2r-,+ r - l. 
It remains to show what happens when j and k are separated by more than one 
full CCL
Now suppose k = 2r+r and j «  2r'+r’ for some r *  r’*  0. CCLj has vertices (v.p), 
0 iv < 2 ^ , 0<p<21'. Vertex (v.p) is joined to vertex:
(1) (vfc*l\p). 0*v<2**'. 0xp<2?'.
(li) (v.p+1), 0<v<2*^, 0sBp<2r'—1.
(Ui) (v.p-1). 0<v<2*r, 0<p<2r'.
CCL* has vertices (v.p), 0 *v< 2 *r, 0 «p < 2 r Vertex (v.p) is joined to vertex:
(l) (vk+,>.p). 0 *v< 2 *r, 0 *p < 2 r.
(ii) (v,p+l), 0 *v< 2 *r, 0 «p < 2 r- l .
(ili) (v ,p -l). 0 *v<2 *r, 0<p<2r
Deleting the line-edges between vertices (v.l.2,' - l )  and (v . i^ )  for 0 * v < 2*r, 
Oasi <2r' r’ serves to break CCL* into 2k~1 graphs isomorphic to CCLj Thus a full 
CCL* has 2k~1 disjoint subgraphs isomorphic to a full CCLj.
Finally, we now have the tools to prove the result for general j and k.
(a) First reduce CCL* into subgraphs Isomorphic to the next smaller full CCL, 
using (1) and (2) as mentioned above. If CCLj is encountered along the way, 
then this Is sufficient. If j and k are separated by precisely one full CCL. 
further Iterations of (2) are sufficient.
(b ) Next, reduce the full CCL Immediately below CCL* into subgraphs 
isomorphic to the CCL immediately above CCLj. The latter can be reduced 
to CCLj by subsequent iterations of (2).
- 62 -
In this entire process we only once have to reduce a non-full CCL to 
subgraphs which are isomorphic to full ones. Thus CCL* consists of 2k' 1' 1 
subgraphs isomorphic to CCLj. □
Note that any attempt to increase the number of subgraphs from 2k_1' 1 to 
2k_1 is doomed to failure. For if CCL* had 2k~1 subgraphs isomorphic to CCLj, it 
would then be recursive. Thus by theorem 4.2.1 it would be much weaker than 
the cube-connected-cycles for computing arbitrary permutations. However we 
have:
Theorem 4.2.3 A cube-connected-lines with 2* processors can simulate a 2k pro­
cess composite algorithm with constant delay.
Proof. Similar to theorem 4.1.2. C
Relf and Valiant [57] have independently discovered a graph which is almost 
identical to the cube-connected-lines. A degree-4 graph with similar properties 
was earlier devised by Meyer auf der Heide [31,32].
4.3. Some Useful Algorithms
Having developed the idea of a composite algorithm in section 4.1, we are 
now ready to describe some simple sub-algorithms which we will And useful in 
the next three chapters. The algorithms are given for the k-cube 
interconnection pattern, but can be simulated without asymptotic loss of 
resources on either the shuffle-exchange, cube-connected-cycles or cube- 
connected-lines Interconnection patterns, as described In sections 4.1 and 4.2. 
It is Important to note that the algorithms are SIMD In nature; synchronization 
is maintained by the fact that (as we earlier insisted in section 2.1) the code 
generated for each branch of a selection statement (such as if-then-else, even if 
the "else" branch la null) has the same number of instructions.
Algorithm 1. Broadcast.
Suppose processor 0 has a value v which it wishes to broadcast to all &  
processors of a k-cube. This can be achieved in time 0(k) by the following 
simple-ascend algorithm, which terminates with variable V of every processor 
equal to v.
V:=if PID = 0 then v elaeO 
for b: = l  to k do
UPIDb-1 then V : = (V of processor PID(b>)
If O s i t ! 11, define the b-block (after [45.47]) of processor i to be the set of 2* 
processors { |l/2bJ.2b4j | 0 * j <2b|. It is easy to prove by induction that after the 
b01 iteration of the for-loop, variable V of all processors in the b-block of
processor 0 is equal to v. for b = 0.1....k. (By the 0* iteration, we mean the point
Immediately before the loop is entered for the first time). Table 4.3.1 shows a 
trace of the algorithm for W=4. The concept of a b-block will play an Important 
part in the next two algorithms.
M le 4J.I Traoe of algorithm 1 on IS procoaoor*. Toblo entry ohows tho oontonts 
of vorloblo V of ooeh procoaoor after b iterations of tha for-loop.
- 64 -
Algortthm 2. Local Rank.
Suppoaa every processor of a k-cube holds an integer value in some variable 
V. The focal rank of processor i, 0 « i< 2 k is defined to be the number of 
processors j. 0 « j< l .  such that for all processors p with j s  p i  i, V of processor p 
equals V of processor 1. The following is a simple-ascend class algorithm which 
sets variable R of each processor to its local rank, and runs in time 0(k).
VT:=V
R:«RT:»0
for b: = 1 to k do
if (PIDb «  1) and (VT of processor PIDM) ■ V 
than R := R+(RT of proce—or PID(b)) + 1 
if (VT of processor PID^’) =VT 
than RT:* RT+(RT of processor P ID ^ )* 1 
else if PIDb = 0 then (VT.RT): »  (VT,RT) of processor PID(b)
At the end of the b**1 iteration of the tor-loop, 0 « b*k , variable R of 
processor l. 0 * i< 2 k holds that processor's local rank within its b-block. At the 
same time, variables VT and RT contain the values V and R (respectively) of the 
topmost processor In its b-block (l.e. processor |i/ 2bj.2b+2h- l ) .  The 
correctness of the algorithm follows by Induction. Table 4.3.2 shows a trace for
k*3, with V of processor 1 initially equal to 0,0.0.6,6,6.6.1 for i = 0.1....7
respectively.
- 65 -
n 1 2 3 5 7
__wo__ __wo__ v-0
ff VT RT p VT RT p IT0 0 0 0 0 0 0 0 0 0 0 6 0 0 6 0 0 6 0 0 6 0 0 1 010 0 1 1 0 i 0 6 0 0 6 0 0 6 i 1 6 10 1 0 0 1 02 0 6 0 1 6 0 2 6 0 0 6 0 0 1 0 1 1 0 2 1 0 0 1 0
j , SL 1 0X 1 A X _ i__a o 1 OX _JL_X 2 X -a X
U k  4 JJ  Trace of aleorithm 2 on 8 proceaaora. Table entry ahowe the contents 
of »enable» R, VT and RT of each processor after b iterations of the for-loop.
Algorithms. Fan-out.
Suppose every proceuor of a k-cube holds two Integer values x and y (which 
may be different for each processor). Our aim is to produce a value fanout(i) for 
each processor l. where for O iK # ,  fanout(i) is defined to be the y-value of the 
smallest numbered processor) such that for all p with J *p * l,  x of processor p Is 
equal to x of processor l. The following is a simple-ascend class algorithm which 
sets variable Y of processor l to fanout(i), 0 < i< 2 k, in time 0(k).
Y :«y
XT.YT: = x.y
for b :* l tolc do
if (PIDb «  l)and(XT of processor PID'b) = x) 
then Y: = YT of proooeaor PID'b> 
if (XT *  XT of procsssB r PIDtb')e*(PIDb = 1 ) 
than (XT.YT): «  (XT.YT) of processor PID(b>
At the end of the b1*1 Iteration of the for-loop. 0 < b *k , variable Y of 
processor 1 contains the value of fanout(l) restricted to its b-block. At the same 
time, variables XT and YT contain the values of X and Y (respectively) of the 
highest-numbered processor In Its b-block (l.e. processor |t/8bj.8b + 8b- l ) .  The 
correctness of the algorithm follows by Induction. Table 4.3.3 shows a trace for 
k « 3  with (x,y) of processor 1 equal to (0.06). (0.89). (0.69). (6.95). (6,19), (6.88). 
(6,56). (1.44) for 1« 0.1....7 respectively.
- 66 -
0 1 2 3 6 3 6 7
X-O x - o x - o X -6 X -6 X -6 X -6 X -  1
V -9 9 V eS 9 V -69 v - 95 V - 19 v -3 6
n y  r r  y t Y XT YT Y XT YT Y XT YT Y XT YT Y XT YT
0 9 9  0 99 89  0 89 6 9  0 69 93 6 93 19 6 19 2 8  6 28 56 6 56 44 1 44
1 99  0 99 99 0 99 6 9  6 93 93 6 93 19 6 19 19 6 19 36 1 44 44 1 44
2 9 9  6 93 99 6  95 99 6 93 93 6 93 19 1 44 19 1 44 19 1 44 44 1 44
2 ■22. 1 M 11  1 44 ? ?  i, 4 4 99  1 9 9  1 4 » 99  1 » 4 * *  i  * *
T lM »  O .S  Trace o f algorithm 3 on 8 procenora. Table entry ehowe the contents 
at variable« X, XT and YT of each processor after b iterations of ths for-loop.
Aliorltbm 4. Scatter
For the moment we briefly step away from the main theme of this chapter, 
and allow the processors of our machines to have more than a constant number 
of registers each. In particular, we want each of the 2k processors of a k-cube to 
have an array of 2k elements. Suppose processor 0 has 2k items of data in its
array, and wishes to scatter these amongst processors 0.1....2k- l  in such a
manner that each processor receives precisely one value. The algorithm 
consists of k stages. At the end of the 1th stage, the 2* processors p. 0 *  p <2*. are 
each in possession of 2k data items. Stage 1 consists of processor p. 0 *  p < 2* 
sending 2k~i of its data items to processor pw. In the following Implementation, 
processor 0 starts off with 2k items of data In an array dfl. ^ ]. Each processor 
receives its value into variable d [l].
for i: = l to k do 
fo r J :-l to 2 kH do 
If PIDj ■  1
than d[J] :■ (d[J+2k-‘] of prooeeeor PID<‘>) 
oiaa d[J+2^-,]:»0
Table 4.3.4 shows a trace for k »3 , with d[i] of processor 0 initially containing l,
1*1*6.
The algorithm runs in time 0 (f]2 k*') ■0(2k) on a k-cube, but la not strictly
- 67 -
simple-ascend (because dimension 1 is used 2k_l times in succession, not merely 
once). This makes very little difference as far as the shuffle-exchange is 
concerned (see the proof of theorem 4.1.1). A minor modification to the proofs 
of theorems 4.1.2 and 4.2.3 serves to give the same result for the cube- 
connected-cycles and cube-connected-lines Interconnection patterns, the key 
point being the fact that after the 1th iteration. O sisk, only 2‘ processors are in 
possession of data items.
M b  U 4  Tree« of algorithm 4 on •  prooMSora. Tabla tntry ahowa the contants 
afdUlaf aach processor for various values of J after i Iterations of the outer for- 
loop. A blank entry denotes a content of 0.
4.4. Reducing the Number at Proceeeore
In this section we examine a particular kind of time/processor trade-off. 
namely, the question of whether a reduction In the number of processors of a
network baaed on the shuffle-exchange, cube-connected-cycles or cube- 
connected-lines interconnection patterns can be made at a reasonable cost in 
time. We And that a small machine based on these interconnection patterns can 
simulate a larger one by having each processor of the former simulate many of 
the latter. The fact that this simple approach works is of course due to the 
highly regular form of the interconnection patterns under consideration.
This section is partly motivated by a result of Galil and Paul. In [21] they 
transform the standard n-processor algorithm for bitonic sort into an algorithm 
which can sort n numbers using m *n  processors in time G( ~  l°8 m.iog n) on
any of the graphs considered above. This raises some interesting questions. Do 
Similar results hold for all algorithms on these graphs? Does the result of Galil 
and Paul depend on some special property of the sorting algorithm? We are able 
to provide an affirmative answer to the first question. With regard to the second 
question, (as one might expect) the time-bounds achieved by our general 
transformations are slightly Inferior to that of the special-purpose 
transformation for bitonic sort. The results work both for the general model of 
register access proposed in section 2.1, and the restricted-access model.
Our processor-saving theorems have many applications. Firstly, as pointed 
out in [21], in many situations the input to a parallel computer cannot be read 
in parallel (this assumption has been made, for example, in [44,49,76]). In this 
case our results can be applied to slow down various fast parallel algorithms to 
the speed at which the input becomes available, and thus make a large decrease 
in the number of processors without any observable increase in time. Secondly, 
we can throw some light on the importance of constant factors in processor 
bounds. For example, Galil and Paul [21. theorem?] are able to reduce the 
number of processors in their universal parallel machine from 0(p) to p while 
increasing time by only a constant multiple. We are able to extend this result by
- 69 -
showtng that tha number of processors In any parallel machine based on many 
popular Interconnection patterns can be reduced by a constant factor without 
asymptotic time-loss.
Constant multiples in processor-bounds pervade the current literature, due 
to the fact that the commonly used Interconnection patterns come only in 
certain sizes, typically 2k or k2k for some non-negative integer k. Thus to 
process n inputs may take more than n processors. For example, it has been 
shown (see theorem 4.1.3) that it is possible to permute n items on a 2,1#* nl 
processor cube-connected-cycles or shuffle-exchange in time 0(log n) by 
simulating Waksman's [75] permutation network. In this case it may be 
necessary to use as many as 2n-2 processors. Our results enable us to remove 
these "hidden" constants without asymptotic time loss.
Our technique is motivated by the following result.
Theorem 4.4.1. For all dfe 0. Ck can simulate Ck«.a with delay 0(2*).
Proof. In order to simulate a single step of Cm, every processor i, 0 * i <2k will 
synchronously execute a single step of processes 2*1 + j, for 0 * j  <2* This takes 
place in two stages.
(1) For 0 * j< 2 *  simulate the communication between process 2* t + J and its 
neighbouring processes. Suppose, for example, process 2 fi + j wishes to 
communicate with process (2 *l+ j)(m> for some m with l « m « k  + d. If m «d  
then no Interprocessor communication is necessary. Otherwise d < m *k  + d 
and so process (2*. 1 ♦  J)*m* *  2fl<m-4> +J is being simulated by processor 
|(m-4) since processor l<m**> is a neighbour of processor 1 In Ck, the desired 
communication can be carried out in 0(1) steps, the exact constant being 
dependent on the type of instruction-set in use. Note that communications 
with process 2 fl+J may be Initiated by processes 2*.l(m~*l 'f j, l « m « k ,  
resident In every neighbour t(m) of processor 1. We
- 70-
assume that the communication protocol of Ck deals with these possible 
clashes in the same manner as that of Ck«*. Clashes involving a 
communication between two processes resident in the same processor are 
to be dealt with in a manner which is compatible with this protocol.
(2) Finally, simulate the current step of processes 2*.i+ j. 0 * j< 2 d. making 
possible use of the information obtained in (1). This is assumed to »ake 0(1) 
steps per process (the constant again being dependent on the instruction- 
set).
At this point. Ck is ready to simulate the next step of Ck»4 . Thus we have 
simulated a single step of a 2k+d processor machine Ck** on a 2k processor Ck 
with a time-loss of only 0(2d).Q
Note that the simulation of theorem 1 can be carried out in such a manner 
that it maintains the simple-ascend or slmple-descend property of section 4.1. 
Thus it is strong enough to achieve the desired savings in processors for 
composite algorithms.
As observed earlier in this section, in some instances the Input to a parallel 
machine may not be available in parallel. The example taken by Galil and Paul 
in [21] is that of matrix multiplication. It is well-known [55] that two nxn 
matrices can be multiplied in time 0(log n) using 0(ns) processors on any of the 
graphs listed in sections 4.1 and 4.2 provided the input can be read in parallel. 
Suppose, to the contrary, that the input can only be read sequentially, so that 
there is an a p rio ri lower bound of 0(n*). By applying theorem 1 with 
k *  llog(n.log n) J and d »  f3.log nl -k . we have a linear-time (l.e. 0(n*)) algorithm 
on n.log n  processors. If a row (or column) of the input can be read in parallel, 
theorem 1 with k «  |log(n*.log n ) J  and daf3.1ogn1-k gives us an 0(n) time 
algorithm on 0(na.log n) processors. This improvement over the corresponding 
results in [21] stems from the fact that they have used a universal parallel
-71 -
machine, which leada to a significant degradation in performance.
Whilst theorem 1 gives the claimed savings in processors for an important 
class of algorithms on the shuffle-exchange, cube-connected-cycles and cube- 
connected-lines, our aim is to produce a stronger result which holds for all 
algorithms on these graphs. Fortunately the interconnection patterns are 
sufficiently like the k-cube for similar techniques to work. The shuffle-exchange, 
for instance, is particularly amenable.
Thaorem 4.4.2 For all datO. SEk can simulate SEkt.d with delay 0(2*).
Proof. In order to simulate one step of SEk+d, every processor l, 0 s i< 2 k will 
synchronously simulate one step of processes 2*.i + J, for 0s»j<2i . As in the 
proof of theorem 1, this takes place in two phases - each processor first carries 
out any communications required by its processes from their respective 
neighbours, then updates their configurations once all the information has been 
gathered.
Suppose 0*J <2* and process 2*.i+J wishes to communicate with one of its 
neighbours In SE|c«.d. There are three cases to be considered, according to 
whether process 2*.i+j wishes to communicate with its neighbour along the 
exchange, shuffle or unshuffle edge.
Either
(1) Exchange link. It wishes to communicate with process (2 fl + j ) (l>. In this 
case, no Interprocessor communication is necessary, since process 
(2*.i+J){,) ■2 *.i+ ji,) is also being simulated by processor 1.
(2) ShtuJJIs link. It wishes to communicate with process 
!k-iik-s ' ' ' iiJaie-i ' • - Jilk- This process Is being simulated by processor 
ik-ib-s ' • • t|Jd. There are two cases to consider:
- 72 -
(a) ik = jd. Processor i s jslk-iik-s • • ti can communicate with processor 
ik-iik-t ■ iijd directly through its shuffle link.
(b) ik »• jd Processor i =Jdlk-iik-a it can communicate indirectly with 
processor ik-iik-s ■ • • iijs by utilizing the shuffle '.ink to processor 
ik-ilk-s' • 1 Je. end the exchange link from there to processor 
Ik-Uk-t • liJe-
(3) CMshufflt link. It wishes to communicate with process
JiMk-i1 * ' hield-i js- 'This process is being simulated by processor
JiMk-i ' i* Again, there are two cases to be considered:
(a) i|*J|. Processor i*ifcik-i • • igji can communicate directly with 
processor j|ikik-i is through its unshuffle link.
(b) iiP ji- Processor i—ikik-i • • ■ l*Ji can communicate indirectly with 
processor jiikik-i ig by utilizing the exchange link to processor 
ikik-i • • ■ tali and the unshuffle link from there to processor 
jilklk-i •••!«■
Thus we have shown how to simulate one step of the 2k*d processor SEk«.d on 
the 2k processor SEk with a time-loss of only 0(2^). the constant multiple being 
dependent on the instruction-set used. □
The results for the variants of the cube-connected-cycles are proved in a 
similar manner.
Theorem 4.4.3 For alldkO,
( i )  CCCk can simulate CCCk»s. and 
( « )  CCLfc can simulate CCL*»*. 
with delay 0(2*).
Proof. We will demonstrate the technique for d ■ 1. This can be looked upon as a 
recursive algorithm for the processor assignment, along with a proof that the
- 73 -
assignment la valid. Ve conaider CCCj, first.
First, suppose that k la of the form 2r + r for some integer r>0. Then the 
processors of the simulating machine are of the form (v.p) where 0 *  v < 2*r, 
0 *p < 2 r. In contrast, the processes of the simulated machine have the form 
(v.p) with 0 «v< 2 *r, 0 *p < 2 r*1. Processor (v.p), 0 * v < Z r , 0 < p < 2r will simulate 
processes (v.p) and (v .^ ^ '-p - l).  As before, each processor synchronously 
carries out the communications requested by all its processes, and then updates 
their configurations internally.
Suppose process (v.p) wishes to communicate with one of its neighbours. 
There are three cases to consider.
(1) Cuba link. It wishes to communicate with process (v*pM,.p). This process is 
being simulated by processor (v^ '^.p ) which is directly connected to 
processor (v.p) via a cube link.
(2) Upcycla link. It wishes to communicate with process (v.p + 1). If 
0 «p < 2 r- l  then process (v.p + 1) is being simulated by processor (v.p+ 1), 
which is directly connected to processor (v.p) by an upcycle link. Otherwise 
p *2 r- l  and process ( v ^ )  is also being simulated by processor (v,p), so no 
interprocessor communication is necessary.
(3) Downeycla link. It wishes to communicate with process (v ,(p - l) mod 2rM). 
If 0<p<2r then process (v ,p - l) is being simulated by processor (v ,p -l), 
which is directly connected to processor (v.p) by a downcycle link. 
Otherwise p = 0 and process (v,2r* ' - l )  is also being simulated by processor 
(v,p), so no interprocessor communication is necessary.
Now suppose that process (v,2fM- p - l )  wishes to communicate with one of 
its neighbours. This Is handled similarly to (2) and (3) above (remembering that 
processes of this form have only cycle links).
- 74 -
This computes the case where k Is of the form 2f + r. Now suppose k is not 
of that form. Let r be such that 2F~* + r-1 < k < 2r + r The processors of the 
simulating machine are of the form (v.p) where 0ssv<2k_r. 0 sp < 2 r The 
processes of the simulated machine are of the form (v.p) where 0 *v < 2 k' r*1, 
0e>p<2f.
Processor (v,p) will simulate processes (v.p) and (v + Zk_r.p). As always, in 
order to simulate a single step, each processor first carries out the 
communications required by its processes, and then updates their 
configurations internally.
Suppose process (v.p) wishes to communicate with one of its neighbours. 
There are two cases to consider.
(1) Cub* link. It wishes to communicate with process (v^ 'V p ). I fp < k -r th en  
process (v^*'>,p) is being simulated by processor (vk+,\p). Otherwise 
p = k -r  and process (v****\p) = (v + 2k_r,p) is also being simulated by 
processor (v.p), so no interprocessor communication is necessary. Note 
that p cannot exceed k -r  since processes (v.p) with p > k -r  have no cube 
links.
(2) Cycle Unie*. It wishes to communicate with process (v,(p ± l )  mod 21)  This 
process is being simulated by processor (v,(p 1 1) mod 2r) respectively, 
which is connected to processor (v.p) by a cycle link.
This computes the simulation of process (v.p). Process (v + 2k~r,p) is 
handled similarly.
Thus we have shown how to simulate a step of CCCk»a on CCCk in time 0(2*). 
Since CCI« is a subgraph of CCCk, part ( i i )  of the theorem follows immediately.□ 
Note that the set-up times for theorems 4.4.1 and 4.4.2 are far superior to 
that of theorem 4.4.3. Not only are the assignments of processes to processors
- 75 -
easler to computa, but alao the Input symbols ara placad into tha correct 
processors at tha start of a computation according to tha convention 
established in section 2.1.
- 76 -
Chapter 5
Practical Simulations
This chapter is devoted to simulations of our general network model on 
more practical models of parallel computation. The first section contains a 
general theorem which characterizes the computational power needed to 
simulate a resource-bounded network. Many specific Instances of this theorem 
(for particular machine models) have already appeared in the literature 
[4,8,21,42,45,71,72]. In the second section we construct our universal feasible 
network. This feasible network is, as has already been mentioned, to be 
universal for the general network model. The result follows as a fairly 
straightforward corollary to the general theorem of the first section, by 
application of the techniques developed in chapter 4.
In the third section we propose a hardware measure for general networks. 
This hardware measure is compared to popular definitions of hardware which 
have appeared in the literature (including size and width of uniform circuits), 
using simulations based on the result of section 1. We examine the extended 
parallel computation thesis [16,17] as a criterion for "reasonableness" based on 
the resources of time and hardware. This states that time and hardware on any 
reasonable parallel machine model are simultaneously polynomially related to 
space and reversals on a deterministic Turing machine. The fourth and final 
section Is devoted to obtaining Improved simulations of space and reversal- 
bounded deterministic Turing machines by width and depth-bounded uniform 
circuits.
A preliminary version of the material In this chapter can be found In [50].
- 77 -
5.1. A General Smulation Theorem
The central result of this section is a theorem which describes the 
computational power needed to simulate a resource-bounded network. As a 
fairly easy corollary, we will in section 5.2 be able to construct a feasible 
network which is universal for the general model of section 2.1. In order to keep 
the proof as manageable as possible, the simulation will be functional rather 
than machine-based.
If n>0, define an n-tupU  X (for n>0) over some set S to be a sequence of n
elements <Xo-X|........X„_i>. such that X(€S, Osi<n. Let S" denote the set of all
n-tuples over S and SxT denote (<s,t>|seS,teT) for arbitrary sets S and T. 
Ordering of n-tuples is done lexicographically (first-field-first), fo r  example, if 
X.YeZ", then X<Y  iff there exists j with 0 < j< n  such that Xj<Yj and for 0 « i< j ,  
Xj = Y(. L e ts * »  o S".1 n»0
Let M be a P(n) processor. S(n) space bounded network. To simplify the 
presentation we will assume that:
(i) All local instructions operate only on registers r0.rt.r2. Only registers r, for 
lk  3 can be read from, or written to.
(ii) Read instructions have the form "extract values p,a from ro.ri respectively, 
read register ra of processor p and place the value obtained into ro".
(ill) Write instructions have the form "extract values p.a.w from registers r0,r|,rs 
respectively and write w into register ra of processor p".
(lv) Multiple reads are allowed, and in the case of write conflicts, the smallest 
value being written into a register is the one which succeeds.
Note that (i), (ii) and (111) are sufficient for the example instruction-set of 
section 2.1, since a processor can address Its own registers by the use of reads 
and writes. The general case follows In a similar manner.
- 7 8 -
For convenience we define a special null element null and adopt the 
conventions that for all X and n. null is always a member of X", and that for all 
i i  0, nu lli = S(n). Define the configuration of M to be a member of 
C„ = (ZsxN)p<n>x(N8xZ)s(n). For example, we take
«xo.xi, ■ • • ,xpfe)-i>,<y0.yi, • • • ,yS(B) - i »
to indicate the following. If Xi = <aj,b),ci,di> then processor i has values ai,bi,C| in 
its registers r0.r,,ra respectively, and it is to execute the d/* instruction of the 
program of M next (with d| out of range indicating that the processor has 
halted). If nu ll *  yt = <p,,ai,vi>, a*fc 3, then register of processor pi contains v(. 
In particular, where the latter is concerned we insist that:
(i) Only registers with non-zero contents are listed. These are listed left- 
justified in increasing order of <pi,ai>.
(ii) The remaining entries are filled, if necessary, with nulls
Definitions. We now define some useful functions. Let sorted((Zn)m)c(Zn)m be 
the set of m-sequences X such that X o^X i* • • ■ <Xm_i. For convenience, if 
Xe(Z")m, let (X-i)o = (X,„)o = -1. Then
(1) sort:(Za)"-»(Zs)n maps unsorted sequences of ordered pairs into sorted ones. 
More precisely,
{Xj if l=|ik|0*k<n.Xk<XjJ| 
sort(X),*|nuU otherwlie
(2) merge:sorted((Zs)B)xsorted((Zs)m)-»(Z4)n+ra merges two sorted sequences of 
ordered pairs.
merga(X.Y)i ■
<(Xj)o*(Xj)„(Xj)s.l> if Xj* null, l *  |}k|Xk<Xj|| ♦
I Jk  | <(Yk)0.(Yk) , > *  <0q)o.(Xj),>J | 
<(Y1)o.(Y1)„(Yj),,0> if Yj*  null, 1* |(k|Yj|<Yj{| +
Ilk  | <(Xk)0.O tói>  <  «(YjM Y j) ,» !  | 
otherw isenull
(3) fanout:sorted((Z4)n)-»(Z*)n achieves the fan-out of data values to multiple 
read requests.
fanout (X)i =
<PO*.(XJ)g> If there exists j *  l such that (Xj)0 = (Xj)0 and (Xj)i = (X,), 
and < (X j- , ) 0.(X|- i ) , > » *  <(Xj)e.(Xj),> and (Xj)3 = 0 
<(X|)g.0> If there exists j such that (Xj)0 -  (Xt)0 and (Xj), = (X,), 
and < (X j- i )o . (X j- i ) i>  # < (X j) o.(X j) i > and (Xj)3 = 1 
nu ll otherwise
(4) deliver: (Z4)n-»(Zs)n performs the fan-in of multiple write requests.
deliver(X)i
<(X,)0.(X,),.(Xt),> if (X,)8 *  0 and X,*, w <(X))0.(Xt)„v .l>  for all veZ.
or (X|)3 = 1 and (X»)g *  0 and Xj_i *  <(Xj)o.(X,)i,v. 1> 
for all veZ. 
null otherwise
(S) concentrate:(Z*)n-*(Z*)n moves all non-nuii entries to the left-hand end of
the sequence.
IX , if Xj »»ruiil and i *  | (Xfc | Xk»*null. 0< k < j( | 
concentrate(X)( = olherwiSe
Let dy:C||-*Cn be the next-configuration function of M. That is. if CeCjj then 
¿n(C) is the configuration which follows from C according to the program of M. 
Let M' be the machine obtained from M by changing all read and write 
Instructions to null operations (for example, add zero to a register), and define
- 80 -
Ib sen m  5.1.1 Suppose a machine can compute the functions:
(I) Merge using resources Rt(n+m).
(II) Fanout, deliver and concentrate using resources R|(n).
(Hi) Sort using resources Rs(n).
(Iv) i ’n using resources R4(P(n)).
Then it can compute in using resources
Rl(P(n)+S(n))+Rt(P(n)+S(n))+Rs(P (n »+R 4(P(n)).
Proof. We make the assumption that the model is capable of storing 
configurations of M in such a manner that they can be dismantled and 
reassembled using negligible resources. For example, we assume that 
re adrequest, write request: Cy -• (Zs)P!n). data:Cg-»(Za)s<n) defined by
readrequest(X.Y)i ■
write reque st (XY)i :
data(XY) s Y 
can be computed easily.
<(X)c.(X)i.i> if the (Xîi*1 instruction 
of M is a read 
mill otherwise
<(Xi)o'(Xi)i'(Xt)s> if the (X )i1* Instruction 
of M is a write 
null otherwise
Let CeCH be a configuration of M. The aim is to simulate a single step of M 
starting in configuration C. Internal computations can be handled directly by 
application of 6'u. Read requests are satisfied by computing:
X » sort(readrequest(C)) 
y *  merge(x,data(C>)
The new processor configurations can then be obtained from 
sort(eoncentrate(fanout(y))). For example, suppose In a particular step, 
processors 0,1,2 and 3 wish to read register 4 of processor 3, register 6 of 
processor 0, register 7 of processor 1 and register 6 of processor 0 respectively.
- 81 -
Further suppose that the only non-zero registers at that time are register 6 of 
processor 0. register 9 of processor 1 and register 4 of processor 3. which 
contain the values 99. 89 and 69 respectively. Table 5.1.1 gives the simulation 
steps in this case.
R ta d r tq u ia t
s o r t
Morso
wonout 
C oncon trato
0 . 4 . 0 »  
< 0 .0 . I »  
< 0 .0 . • * .0 » 
n u ll
< t.00>  
<0.60 >
< 0 .6 .I »  
< 0 .0 .3 » 
< 0 .0 .1 .1 » 
< l.00>  
<3.00» 
<1.00»
« l . “ . i »
< i . T . a »  
< 0 .0 .3 .1» 
<3 .00 »
<a.o>
< 0 .0 .3 »
<3 .4 .0 »
< 1 .0 .0 0 .0 »
nu ll
<0.00>
< 0 .0 .0 0 » 
< 0 .0 .0 0 » 
< 1 .7 .2 .1» 
<2 .0 » 
n u ll 
n u ll
<1 .0 .0 0 »
<1 .0 .0 0 »
< 3 .4 .0 0 .0 »
n u ll
n u ll
< A•4 ,00  » 
< 3 .4 .0 0 »
< 3 .4 .0 .1 »  
<0 .00 » 
n u ll
-------u _____ L - M L -
M i l  & 1 . 1  Simulation of road requests by 4 processors.
Write requests are simulated by computing:
x = sort( write request(C)) 
y = merge(x.data(C))
The new register contents can then be computed from concentrate(deliver(y)). 
For example, suppose processors 0. 1. 2 and 3 wish to write values 0. 7?, 50 and 
28 to register 4 of processor 3, register 6 of processor 0. register 7 of processor 
1 and register 6 of processor 0 respectively. Further suppose that the current 
register-contents are exactly the same as In the read-request example above 
(see table 5.1.1). Table 5.1.2 gives the simulation steps in this case.
w n to ro q u o o t
s o r t
s o rs o
D o liv o r
<A .4 .0 »  
< 0 .0 .2S» 
< 0 .0 .0 0 .0 »  
n u ll
<0 .0 .77 »
<0 .0 .7 7 »
< 0 .0 .2 0 .1 »
<0 .0 .20 »
< 1 .1 .0 0 »  
< 1 .7 .SO» 
< 0 .0 .7 7 .1 »  
n u ll
< 0 .0 .so »
< 3 .4 .0 »
< 1 .7 .0 0 .1 »
<1 .7 .00 »
< 0 .0 .0 0 »
< 0 .0 .0 0 »
< 1 .0 .0 0 .0 »
< 1 .0 .0 0 »
< 1 .0 .0 0 »
< 1 .0 .0 0 »
< 3 .4 .0 0 .0 »
n u llnul 1
< 3 .4 .0 0 » 
< 3 .4 .0 0 » 
< 3 .6 .0 .1  » 
n u ll
□
Tkble & U  Simulation of write requests by 4 processors.
- 82 -
5.2. A Universal Parallel Machine
Specific instances of theorem 5.1.1 (the simulation of networks or shared-
memory machines on other parallel machine models) have appeared many
times over in the current literature. Theorem 5.1.1 is a powerful general result.
It can be used to:
(1) Provide general communication between the processors of a feasible 
network (which is equivalent to simulating a network on a feasible network) 
[45].
(2) Simulate restricted-access networks on a universal network with constant 
degree and easy-to-compute interconnections [21].
(3) Simulate shared-memory machines on a network with constant degree and 
easy-to-compute interconnections. This has been observed in the case 
where no memory access conflicts are allowed [42], or P(n) = S(n) [8].
(4) Remove memory access conflicts from shared-memory machines [72].
(5) Simulate shared memory machines on a variant of the feasible network 
which uses a small number of "large" processors (with a large amount of 
local memory and "powerful" instruction set) and a larger number of 
"small" processors (with a constant amount of local memory and minimal 
instruction set) [71].
(6) Construct a multi-access memory [4] to provide a practical implementation 
of a shared-memory machine as a physical device.
(7) Simulate space and reversal bounded Turing machines by width and depth 
bounded uniform circuits (and vice-versa) [53].
This latter application will be explored further in the next two sections. In
this section we will concentrate on the first application.
Corollary 8.8.1 There is a feasible network which can simulate any network of 
P(n) processors and space S(n). using S(n) processors, the same word-size as 
the simulated machine, set-up time 0(log S(n)) and delay
log2P(n)
[log S(n)-log P (n )+F +*°®
Proof. A P(n)+S(n) processor feasible network can be used as follows. Note that 
S(n)fen. so initially every processor has at most one input symbol. The first P(n) 
processors are to simulate the processes (keeping only registers r0. r,. r8 of 
their respective process); the remaining S(n) are to hold the remaining register 
contents. The set-up time comes from the need to first concentrate the input 
values (to get rid of any zeros), and route them out to the register-holders using 
procedures Rank and Concentrate from [45,47]. An additional 0(log P(n)) steps 
are required to broadcast the program of the simulated machine to the first 
P(n) processors using algorithm 1 of section 4 3. The result then follows from 
theorems 5.1.1 and 4.4.1, noting that a P(n)+3(n) processor feasible network 
based on either the shuffle-exchange or cube-connected-cycles can:
(1) Sort P(n) items in time 0 togaP(n)log S(n)-log P(n)+1 i-log P(n) using the
algorithm of [47],
(3) Merge P(n)+S(n) items in time 0(log S(n)) by using a Batcher [4] merge 
(see for example [55,63,66]).
(3) Fan-out P(n)+S(n) items in time 0(log S(n)) by using algorithm 3 of section
4.3. Alternatively, procedures Rank, Concentrate and Generalize from [45] 
can be utilized, as in that reference.
(4) Deliver P(n)+S(n) items in time O(log S(n)) by using procedure Concentrate 
from [45,47].
- 84 -
(S) Concentrate P(n)+S(n) items in time O(log S(n)) using procedures Rank and
Concentrate from [45,47]. G
The time complexity of theorem 5.2.1 is dominated by the cost of sorting 
the read and write requests. This can be reduced by substituting the sorting 
algorithm of Ajtai, Komlds and Szemerddi [2] for (1). Although this results in a 
better asymptotic time-bound, the constant multiple is too large to be of any 
practical use. The algorithm as presented in [2] has a constant multiple of 
several million, although this has more recently been reduced to around 1400 by 
M. S. Paterson. For our purposes, corollary 5.2.1 is to be regarded as superior. 
There are a number of ways of making the substitution. P(n) values can be 
sorted in time 0(log P(n)) by using 0(P(n).log P(n)) processors, giving rise to: 
Corollary 5.2.2 There is a feasible network which can simulate any P(n) proces­
sor. S(n) s 0(P(n).log P(n)) space bounded network using 0(S(n)) processors, the 
same word-size and delay 0(log S(n)).
Alternatively. P(n) values can be sorted in time O(log P(n).loglog P(n)) on 
0(P (n )) processors, by pipelining a P(n)/log P(n) processor sorting network, and 
merging the log P(n) sorted sequences that result using a Batcher merge. This 
has also been noted in [40]. More recently, Leighton [41] has discovered an 
elegant method for sorting P(n) items in time 0(log P(n)). using only P(n) 
processors. Thus we have:
Corollary 5.2.3 There is a feasible network which can simulate any S(n) space- 
bounded network using 0(S(n)) processors, the same word-size and delay 
0(log S(n)).
What if the processors of the universal machine are allowed to have more 
than a constant amount of memory? Then:
(1) O(log S(n)) delay, with a more reasonable constant multiple, can be 
achieved on a probabilistic machine (with overwhelming probability) on S(n) 
processors by using the sorting algorithm of [57],
(2) The processor-bound in (1) can be reduced to P(n), increasing the delay to 
0(log2P(n)), by using the techniques of Upfal [68].
(3) The bounds of (2) can be achieved on a deterministic machine for the 
simulation of restricted-access networks [45], (The delay can also be 
reduced by the use of the technique of corollary 5.2.3).
Note that the universal machine conserves many of the notions of 
"reasonableness" mentioned in section 3.3. For example
(1) If the machine being simulated obeys the parallel computation thesis, then 
so does the universal machine.
(2) If the simulated machine is small and fast (provided T(n) = log0(l)P(n)) the 
universal machine is small and fast.
(3) Bounds upon word-size are maintained.
5.3. A Hardware Measure
In this section, we attempt to capture the Idea of a hardware measure on 
our network model. The amount of hardware needed to build a universal feasible 
network is governed by the amount of memory needed, and the complexity of 
the instruction-set. To simplify matters, we will concentrate on networks with 
the minimal instruction-set. We claim that spacexwordslze is a good hardware 
measure for such a machine (or Indeed, any machine where memory-costs 
dominate the cost of a processing-unit). In order to Justify this claim, we can 
relate this to the measures of hardware on other popularly-accepted models, 
whilst maintaining time to  within a polynomial.
- 86 -
A uniform  circu it is an Infinite family C = (Co.C^...) of combinational 
circuits, one for each input size (see, for example [7,13,53,59]). Without loss of 
generality we assume that the circuits are built using gates which realize 
functions drawn from the class Bg of two-input Boolean functions. An input of 
size n is presented, in some suitably encoded form, to the inputs of Cn. The 
output of Cn is then taken as the output of C. C is said to have depth D(n) if the 
length of the longest path from an input to an output in C„ is at most D(n), for 
niO. It has width W(n) if C„ has width (as defined in [53]) W(n) and sise Z(n) if 
C„ has Z(n) gates We assume D(n) = Q(log W(n)).
The function f:N*x{left,right|-»N where for n i  0 the j-input of gate l i n  is 
connected to the output of gate f(i.n,j). is called the interconnection function  of
C. We assume that gates 0.1....n-1 are distinguished gates representing the
inputs. The function g:N*-»Bg, where for n iO  gate i » n  of Cn is a g(i.n)-gate, is 
called the pate function  of C. We insist that the interconnection and gate 
functions be computable in space O(log Z(n)) by a deterministic Turing machine 
Corollary 5.3.1 Every network with P(n) processors, space S(n), time T(n) and 
word-size W(n) can be simulated by a uniform circuit of depth 
0(T(n).log S(n).log W(n)) and width 0(S(n) W(n)).
Proof. (Sketch). The circuit consists of T(n) levels, one for each simulated 
time-step. Each level has P(n) sub-circuits corresponding to a single step of a 
processor, and a further S(n) sub-circuits carrying register values. Between 
each level is a circuit for carrying out inter-processor communication, built out 
of a sorter, merger, concentrator etc. as in corollary 5 2 2. Each processor unit 
takes as input the program-counter, current values of registers ro. r t and r*. and 
Incoming values from read requests. It produces outgoing read and write 
requests, and updated values for the aforementioned program-counter and 
registers (see figure 5.3.1). These units fit together as in figure 5.3.2.
- 87 -
ProgramCounter Registers IncomingValues
PC Reg Read
Processor
Reo Refid write
New
Program
counter
New 
Register Contents
U h l Block diagram representation of a circuit to compute a single step of
a processor.
-88-
Regiatarcontant*
F i l m  B i l l  Block diagram of a circuit to  compute a »ingle atop of a network.
-89-
Each processor-unit has circuitry which
(1) Deals with incoming data which has arrived in response to a read request in 
the last step.
(2) Performs a single instruction, issuing a read or write request as necessary. 
The processor units have width 0(W(n)) and depth O(log W(n)) The circuits
tor sort, merge, concentrate etc. have width 0(5(n).W(n)) and depth 
O(log S(n) log W(n)). The register contents have width 0(S(n).W(n)). Thus the 
complete circuit has width 0(S(n).W(n)) and depth 0(T(n).log S(n).log W(n)).Q 
Note that in the general case, if the internal instructions can be computed 
by a uniform circuit of depth d|(n) and width W|(n). then the above circuit has 
depth 0(T(n).(log S(n).log W(n)+di(n))) and width 0(S(n).W(n) + P(n).W|(n)).
In section 3.3 we saw a number of different ways of characterizing a 
"reasonable" parallel machine model. For example, the parallel computation 
thesis states that a parallel machine model is reasonable if time on that model 
Is polynomlally related to sequential space. Dymond [16,17] gives an extended 
version of the parallel computation thesis which takes into account both the 
time and the amount of hardware used. This can be loosely summed up as 
follows: time and hardware on any reasonable parallel machine model are 
simultaneously polynomially related to Turing machine reversals and space 
respectively (a reversal is said to occur when any tape-head changes direction).
This raises an obvious question: when are our network machines a 
“reasonable" parallel machine model according to the extended parallel 
computation thesis, given that spacexwordstzo Is taken as a measure of 
hardware? We And that a T(n) time, P(n) processor bounded network which uses 
space S(n) and has word-size W(n) obeys the extended parallel computation 
thesis provided:
(0 Local Instructions can be computed by a deterministic Turing machine
using space (W(n).S(n))0<l) and Tfn)0*** reversals.
(11) P(n) = ¿W 00’ .
Part (i) provides more evidence for the unit-cost hypothesis. Note that the 
Turing machine is to be given the value of P(n) in binary along with any input of 
size n.
In particular, for a machine with the minimal instruction-set:
Corollary 5.3.2 Every P(n) processor network which runs in time T(n), space S(n) 
and word-size W(n) can be simulated by a deterministic Turing machine using 
space 0(S(n).W(n)) and reversals 0{T(n).(log*P(n) + log S(n))).
Proof. (Sketch). This result follows from theorem 5.1.1 much in the same 
manner as corollary 5.2.1, substituting the sorting algorithm of theorem 4.1.5 
(Batcher sort) for that of [47]. The composite sub-algorithms used thus have 
simple modules whose upper dimension is easy to compute.
Consider a simple-ascend class sub-algorithm which ascends to the full 
value of k. and uses the minimal instruction-set. Suppose the n inputs are 
initially encoded as binary strings on tape 1 of the Turing machine, each 
separated by a special blank symbol. The Turing machine computes in k phases 
(one for each dimension), each of which consists of a constant number of passes 
over two tapes. The first phase does the following. First, copy every alternate 
string on to tape 2. In a constant number of left-to-rlght scans over the tapes, 
perform the necessary data transfers in dimension 1, and the internal 
operations. Copy the (updated) strings from tape 2 back to tape 1. The word- 
size can Increase by only a constant, so the overflow from each string can be 
stored temporarily by using a large tape alphabet, and the tape contents can be 
moved along as part of the copying process by making extra use of the second 
tape. (Extra tapes may be necessary for more powerful instruction-sets which
-91 -
Increase the word-size more rapidly). This Is the end of the first phase. Phase 1. 
2 «l*ck achieves data transfers In dimension i by similarly copying alternate 
blocks of 2*—* strings from tape 1 to tape 2. performing the transfers in a 
constant number of left-to-rlght scans, and copying the strings back to tape 1.
Take for example algorithm 1 of section 4.3, the algorithm to broadcast a 
single value to all processors. On Input
• • • • ! ■  I'W 'W  ■  ■  i'i«w > i ■  I'l'M O  ■  (•M ill) ■  ■  PM .M
copy every alternate string to tape 2. leaving a special mark on tape 1 at the 
end of every copied string.
i l i m l  » n  i >m «m n -------- n m n i i in  M -W 'i
,  r
i m
;
Perform data transfers In dimension 1 in a single left-to-rlght scan of both tapes.
i l m i l  I n  M'T'T'I n — in  i *i*i*i*i ~ l ---------1nFUTüü
í
1 -  - l id
• • • • i  i  i  ■  F m n  i
Now copy the stringi back to tape 1.
To handle transfers In dimension 2. first copy every alternate block of two 
strings to tape Z (again leaving a special mark at the end of every copied block).
- 93 -
î i i i i i i  m m  I n n  ['TTr-M ■  ( iMiT.l I
i:___ I
UI
Perform the date transfers,
l l 1 1 1 1 I  I n n i  I I I n  f 'M 'i 'i  ■  f I
i ___ I £2 I 1 1 1 1 I  l i m i  looooi looooi 1
33=33
and copy back.
l i m i l  I n n i  l i  i n i  I 1 1 1 1  I « l o o o o l  lo o o o l  looo ol  l o o o o l « I
J
I
9 °01 lo ^o
T I M ' I T  L I I
Finally, for dimension 3. first copy across every alternate block of four strings.
Perform transfers,
- 95 -
Note that tha sorting algorithm of [47] can ba uaad (as In corollary 5.2.1) to 
glva an Improved reversals bound, provided the upper dimensions in the simple 
sub-modules can be computed by a deterministic Turing machine within the 
stated resources.
- 96 -
5.4. Circuits and Turing Machines
In order to justify his extended parallel computation thesis. Dymond [16,17] 
appeals to a seminal paper by Pippenger [53] which relates depth and size of 
uniform circuits to Turing machine reversals and time. Dymond prefers to use 
Turing machine space instead of time, and circuit width as a measure of 
hardware (rather than size) since it is a measure of the amount of hardware 
which comes into play at any given instant in time. We can use the results of 
this chapter to gain improved simulations of space and reversal bounded Turing 
machines by uniform circuits.
We follow the general structure of the proof appearing in [53]. Pippenger 
simulates a Turing machine on an oblivious Turing machine, and then simulates 
this by a uniform circuit. We will simulate a Turing machine on a network. We 
can then build a uniform circuit by application of corollary 5.3.1.
Theorem 5.4.1. An S(n) space, R(n) reversal bounded k-tape deterministic Tur­
ing machine can be simulated on a network with processors and space
0( l0g g^|n) ) • tune 0(R(n).log S(n)) and wordsize O(log S(n)).
Proof. Let M be a k-tape deterministic Turing machine which runs in space S(n) 
and reversals R(n). Following [S3] deflne a phase to be all the steps of M from 
one reversal to the next (the first move is counted as a reversal for this 
purpose), and a rituatlon  to be the control state and head positions of M. It may 
be assumed that all transition rules of M which write a new value onto a tape cell 
also move the head away from that cell. This implies that symbols written 
during one phase cannot bo read until the next. Let d(n) = 2,u* ,0« s(n) 1 and call a 
situation special if it has at least one head on the (l.d(n))01 cell of its tape, for
-97-
some leN. Note that there are at most 0( ) different special situations.
and that at moat 0(log S(n)) steps of M can occur between one special situation 
and the next.
In order to make the proof more readable, we will present the algorithm on 
a shared-memory machine. This reinforces the observation in section 3.2 that 
completely-connected networks are almost Identical to shared-memory 
machines. The simulation proceeds roughly as follows. The tape contents at the 
start of the current phase, the head directions and the Initial situation for the 
current phase are stored in the global memory. This is easy to do at the start of 
the initial phase; the algorithm will maintain this information from phase to 
phase. We reserve one processor (and two global memory locations) for each 
special situation. The aim is to have these processors confer, via the global 
memory, and decide which special situations are involved in the current phase. 
The processors corresponding to these special situations then simultaneously 
update the tape cell contents in global memory; the final situation (which is 
detected by an attempted reversal) determines the head directions and the 
initial situation tor the next phase. This proceeds for a total of R(n) phases.
The simulation of a phase is achieved as follows. Processor i handles the 1th 
special situation. Firstly, each processor 1 computes in parallel the special 
situation which follows from special situation i. by doing a step-by-step read-only 
simulation of M on the tape contents In global memory (by "read-only 
simulation" we mean that the tape-contents are not updated). This value is 
stored into array element s[l] in global memory. If an illegal situation occurs 
during this process, or a reversal Is detected (determined by examining the 
head directions for the current phase, which are stored in global memory) then 
s[l] la set to 1. All processors t execute the following code synchronously in 
parallel. Upon termination global array element actlve[l] will be set to true iff
- 90 -
speclal situation i occurs in the current phase. Each processor can determine 
whether Its special situation is the first special situation to occur in the current 
phase by using a step-wise read-only simulation of M starting at the initial 
situation of the phase.
active[t]:= if (first special situation in this phase) = i 
than true 
•lao false
for b:si to flog S(n)1 do
If active[i] then active[s[i]]:=true 
s[l]:*s [s[i]]
Those processors i with active[i] *  true can then update the tape contents; the 
last special situation Is readily available (in all entries of s). from which the final 
situation of the current phase can be determined.'
The running time is dominated by O(log S(n)) for each phase. This comes 
from;
(1) Decoding of PIDs (each of O(log S(n)) bits) into special situations.
(8) Determining the first special situation from the initial situation and the 
final situation from the last special situation by simulating at most 
0(log S(n)) steps of M.
(3) Computing the special-situation transition function by simulating 
0(log S(n)) steps of M.
(4) Computing the active array In 0(log S(n)) steps.
(6) Updating the tape contents by simulating 0(log S(n)) steps of M.
Repeating this for R(n) phases gives us the required result. 3
CoroQuy 8.4.8. An S(n) space, R(n) reversal bounded deterministic k-tape Tur­
ing machine can be simulated by a uniform circuit of depth 
0(R(n).log*S(n). log log S(n)) and width O ^ n )1*).
Proof. By theorem 5.4.1 and corollary 5.3.1. □
Corollary 8.4.3 A T(n) time, R(n) reversal bounded deterministic k-tape Turing 
machine can be simulated by a uniform circuit of depth
0(R(n).log*T(n).log log T(n))
and size
0(R(n).T(n)k.logaT(n).log log T(n)).
This is a small improvement over the results of Pippenger [53] who obtains 
depth 0(R(n).log^Hn)) and size 0(R(n).T(n)k log4T(n)),
Chapter 6
High-Arity Machines
In the general network model as described in section 2.1. a communication 
line between two processors A and B is made up of two bidirectional links, one 
under the control of A. and the other under the control of B. We call a 
processor's links active if they are under its control, and passive otherwise. The 
active links of processor A are those which it can use to initiate communication 
(by executing a read or write Instruction), whereas its passive links are used for 
communication initiated by its neighbours (attempts to read from or write to a 
register of A).
In section 2.1 we made the assumption that during any time-step, each 
processor can make use of only one of its active links (albeit a potentially 
different one at each time-step). In this chapter we extend our network model 
to give each processor the use of more than one active link simultaneously (and 
the power to make efficient use of the information thus obtained). We call the 
number of active links which can be used by a single processor in any time-step 
the arity of the network.
Although machines with non-constant arity have already appeared in the 
literature, there has so far been no systematic investigation into the extra 
computing power offered by high-arlty Instructions. For example, the random 
routing results of [69.70] initially appeared in a high-arlty form, although this 
has since been redressed [3,6,57,67]. The oblivious lower-bound of [8] is also 
presented for high-arlty machines.
In the first section we present our high-arlty model. In the second section 
we show that machines of arity and degree A(n) are potentially more powerful 
than those of arity o(A(n)). In particular we are interested in networks with
-  101 -
reasonably small arlty and degree; more precisely, those with P(n) processors 
and arity and degree O(log P(n)). Whilst it is apparent from section 2 that these 
machines are more powerful than those of constant degree, we will show in 
section 3 that they are not too much more powerful, in the sense that there 
exists an efficient constant-degree universal machine. Finally, section 4 gives 
some examples of the speed-ups which can be obtained by increasing arity. A 
preliminary version of the material contained in the first three sections of this 
chapter has appeared in [52].
a i .  A High-Arity Model
Our aim is to generalize the network model to give each processor the 
ability to communicate with asymptotically more than a constant number of its 
neighbours in unit time, and sufficient power to make good use of this ability. 
The basic high-arity model is defined in the much the same manner as the 
constant-arity one of section 2.1, except for the fact that we allow the 
processors to have instructions which can be simulated in time A(n) by the 
processors of that section. A(n) is then called the arity  of the machine. For 
example, we can replace the example instruction-set of section 2.1 with the 
following:
(1) r, [r¿ «- constant]
(8) n ir ,« -^ ]
(3 )  r ift «-rk ~ n ]
(4) r( [rj«- ~rk]
(8) rt [rj *- r^]
( « )  ri [*V, *■ Hi]
(block-load constant)
(duplicate register) 
(element-by-element operation) 
(prefix ~)
(indirect loads)
(indirect stores)
(7) rj»-PlD
(8) halt
(9) goto m If rj > 0
(10) r,[rj «- (r^  of r,)] (write)
(11) r,[(rrjo f rk)»-ri] (read)
Instructions (7-9) are as in section Z. Instructions (1-3.8.6.10.11) have the same
effect (in unit time) as the high-level statement:
for m:*0 to  rt- l  do S 
where statement S is respectively
(I) fj+m =constant,
(8) *>„:**■*
(3) rJ(.m:«rk«.m~n4.m.
(5) rj*m: * r ^ —,
(8) r^^  xrk»^.
(10) rJ4.m: * r ^ _  of proc eeeor n»m
(II) ( % . « *  proooaBorrk«.m):«n *m 
Instruction (4) has the same effect as
rj:»rk
fo rm :« l torj-1  do
In this particular model, a parallel machine has artty A(n) if for all Inputs of 
sise naO, the largest value present in register r( during the execution of 
Instructions of the form (1-6,10,11) Is at most A(n).
-  103-
A fixed-structure variant of this model can be defined by augmenting the 
processors with an infinite collection of read-only port registers, and 
interpreting communication instructions after the manner of section 3.1. In 
section 6.3 we will consider a restricted-access, fixed-structure model. Each 
processor is augmented with a communication register COM, and 
communication instructions are restricted to allow reads and writes of those 
registers only. Instructions (10) and (11) of the example instruction-set are 
replaced by:
(10’) rt[rj*-C0M of p^] (read)
(11) r,[(COM of p,J)«-rk] (write)
which have the same effect as the high-level statements:
(10’) for m: =0 to r ,- l  do
«Vm-COM of processor priMi
(11’) for m: =0 to rt—1 do
(COM of processor prj+J:=rk*m
respectively.
6.2. The Computational Power of High-Arity Machines
Some of the power of high-arity machines comes from the fact that they 
have high degree. It is easy to show that a machine with degree D(n) is 
asymptotically faster that any machine of degree o(D(n)) (with arity kept 
constant). Consider the problem of broadcasting a single value amongst n 
processors. More formally, we wish to compute, in parallel, the function f:N*-»N*
defined by f(xo........xn- i ) * ( y 0.........yn- 1) where y( «xo fo rO sK n . Suppose d a 3.
The following is an n processor, degree d, arity-1 algorithm for computing f on
inputs of sise n in time 0( °  ) . Assuming that initially variable x of processor
i contains Xt. the algorithm terminates with variable x of processor 1 containing
xq. 0 «t< P (n ). The interconnection pattern used is a (d-l)-ary tree Figure 6.8.1 
shows (our levels of a D-ary tree, (or arbitrary D.
b :* l
while b < n do
b :*b .(d -l)
The time-bound attained by the above algorithm is asymptotically optimal.
regardless of how complicated its interconnections are, how many processors 
are used, or what the arity of those processors is. This follows from the 
observation that there must be a path in the interconnection graph of size n 
from processor 0 to processor i. 0 *i<P (n ). From this we can conclude that a 
degree D(n). constant arity parallel machine with n processors is asymptotically 
faster than than any machine of degree o(D(n)). Indeed, the latter machine may 
even be allowed to have a different non-recursive program for each processor, 
which may vary with input size.
Plgwe as.1 A >ary tree.
Furthermore, we wish to show that increasing the arity of a parallel 
machine Increases its power. Unfortunately there are no good lower-bounds
-  U n ­
available for even constant-arity machines with simultaneous writes (an 
exception is the recent paper by Wigderson and Vishkin [74], which uses a very 
restrictive model). Even without simultaneous writes, an argument based on 
"information-flow" is often quite difficult, for in many cases, the concept of 
"information” is subtle. Even though a single processor can only receive one 
communication in each time-step, it may receive it from  potentially many 
different sources, depending on the input values. Information can be encoded 
both in the value written, and the identity of the source.
Even without simultaneous writes, the situation may not be as easy as it 
looks. For example, information can even be passed by a processor choosing not 
to write. Suppose processor A has a value ve(0,lj which it wishes to 
communicate to processor D. Although A must be connected to D by a path in 
the interconnection graph of the machine, every subgraph which corresponds to 
a particular computation may have A and D disconnected, as follows. Two extra 
processors each Initialize a register r to zero. In the first step, A writes a one to 
register r of processor B if v s  1, and a one to register r of processor C if v = 0. 
(Thus register r of processor B holds v. and register r of processor C contains its 
complement). In the second step. B writes a zero to processor D if its register r 
is zero, and C writes a one to processor D if its register r is zero. Thus processor 
D has the value v written to it by a processor which has had no direct 
communication with A (see figure 6.2.2). Therefore, we see that not every input 
symbol need be connected to the output node by a path in the computation 
graph which corresponds to the action of a network on a particular input.
-  106-
-  107-
s,b:«x,l
while b < n do
pnxd+d
s:=x+ V  (s of processor i)
l « P lD d t  1
b:=b.d
if PID>0 thens=0
_ L _ _ b 2 3 4 5 6 7
0 1 x o X 1 X 2 X3 x * X3 X6 X 7
I 3 X 0*X »* X 2*X 3 x , - x 4 . x s . x 6 X2*X 7 X 3 X4 X S X 6 X 7
2 9 *o*•••*x 7 X 1*X4*X 3*X6 X 2 -X 7 X3 X4 X3 X 6 x 7
R esu1t X0........... X 7 0 o 0 0 0 0 0
Table U 1  Trace of summation algorithm on 8 processors of arity 3. Table entry 
shows the value of b and variable a of each processor after i iterations.
Thus for D (n )i2 , an n processor, arity D(n). degree D(n)+1 parallel machine 
can compute the sum of n integers in time G( ) • This algorithm is
asymptotically optimal for all machines of arity 0(D(n)). In fact, we will show 
that an arity D(n) machine must take at least |[0g | steps to sum n
Integers.
Suppose M Is a P (n )»n  processor parallel machine of arity D(n) which can
sum n Integers In time T(n). and let x = <Xo x„_i> be an Input string consisting
of n symbols, each of which Is a non-negative Integer. Let Gs be the directed 
graph with vertices (p,t), 0 *p<P (n ), (X tsT (n ), and an edge from (pi,tt) to 
(ps.ts) If t* = t i+ 1 and either Pi *  p« or during time-step tf of the computation of 
M on Input x, either processor ps reads a value from pi, or pi successfully writes 
a value to pa. The l**1 symbol of x Is said to be rsacKabls if there is a path from 
vertex (1,0) to vertex (0,T(n)) In Gs. The rsaahabls string Is the string derlvsd
from x by deleting all unreachable symbols. The unraachabla string Is similarly 
derived by deleting all reachable symbols.
Suppose the values to be added together are all less than N. We claim that 
(provided N is sufficiently large) there is an input string with all symbols 
reachable. For a contradiction, suppose that every Input has at least one 
unreachable symbol. Fix a graph G, and consider the strings y such that Gz = Gy. 
Each reachable symbol of y can take on N possible values, giving a total of N1- 
possible reachable strings, where r< n  is the number of reachable symbols. 
Further, for each reachable string, the corresponding unreachable string must 
sum to a fixed value, dependent only on the reachable string In question. This 
follows because M must give the same result Tor two inputs y t. y8 such that 
Gyj = GTi  and yt. y* have identical reachable strings. Since m i l  non-negative 
integers can sum to a fixed value at most Nm_l times, we see that there are at 
most Nn~r~t unreachable strings which can appear with any reachable string, 
and thus there are at most N"-1 choices of y. That Is, each graph Gs can be used 
for at most N " '1 different input strings x.
Let G (n)s |{G,|x€Nnj|. By the pigeonhole principle, at least one graph 
must be used for at least NV G(n) input strings. If N is chosen such that N >G(n) 
then this value is greater than N"_l. which contradicts the result of the previous 
paragraph. Thus there must be an Input string for which all symbols are 
reachable. Since for all x, G, has ln-degree D(n)+2, this implies that
Unfortunately, this proof is based heavily on the use of extremely large 
integers as summands. Indeed, it may be necessary to choose N to be as large as 
p (a)(D(a)«’i).P(n)T(ii). Thus:
-  109-
(1) It we insist that W(n)*T(n)0(>> (which, as we saw in section 2. ensures that 
the parallel computation thesis holds), then the lower-bound is not valid.
(2) For machines with W(n) = n0(I> (which is a reasonable restriction since it 
ensures that the input encoding is "concise" in the sense of [22]). the 
lower-bound holds provided P(n ) = n°*l).
(3) If the word-size is arbitrary, the lower-bound holds regardless of arity or 
number of processors. This is despite the fact that machines with large 
word-size are (as observed in section 3.5) exceptionally powerful.
6.3. A Constant-Degree Universal Machine
From the results in the previous section, it is apparent that high-arity 
machines are more powerful than those of constant degree. In this section we 
propose to show that they are not too much more powerful, in the sense that 
efficient constant-degree universal machines exist. We will concentrate mainly 
on a fixed-structure, restricted-access model (see section 6. i). Each dedicated 
processor will be initialized with the processor-identities of the neighbours of its 
processes. We shall see that slightly more efficient simulations are made 
possible by the prescence of this extra information (which cannot be provided in 
a model with modifiable structure). In a fixed-structure model, it is quite 
reasonable to expect the user to provide this information (perhaps in the form 
of an easy-to-compute interconnection function, in which case its resource 
requirements should be added to the setup time for the universal machine), 
since it forms part of the specification which would be required by a fabrication 
device, should the network be realized in hardware.
-  1 1 0 -
Ih aon a i 6.3.1 There la a P(n).D(n) processor universal parallel machine which 
can simulate any P(n) processor machine of arity-and-degree D(n). with delay 
0(logP(n) + D(n)) and setup time 0(log4P(n) + D(n)).
Proof. (Outline). Suppose m = flog P(n)l and m’ *  hog D(n)l. We describe our 
algorithm on an (m + m')-cube. Let M be a P(n) processor parallel machine with 
degree and arity D(n). Processor i. 0*1 <P(n ) of the universal machine will 
simulate processor i of M. Let l[d] denote the d01 neighbour of processor i of M. 
in order of ascending PID. For 0*1 <P(n) let W, be the m'-cube consisting of 
processors 2m.k+ i. for 0 *k < 2 m\ of the universal machine.
As part of the initialization, each processor i 0 * i< P (n ) will receive D(n) 
identification numbers I^ dj for 0 *  d < D(n) such that:
(1) 0 *I^d]<D(n) for all 0 * i< P (n ), 0*d<D (n), and
(2) For all 0 * i.i'.j < P(n), if If and If are both defined, and 1/ = If then i = i'.
In particular, processor i will be the (I^d))01 neighbour of processor i[d] in 
ascending order of PID.
This is achieved as follows. Processor 1 0 * i< P (n ) prepares D(n) packets 
(i[d].i). 0*d<D (n ). and scatters them around the 2m'%D(n) processors of W,. at 
most one packet per processor, using algorithm 4 of section 4.3. These packets 
are then sorted within the ( m  + m ' ) - c u b e  m lexicographic (flrst-fleld-flrst) order. 
Each processor J, 0 * j< 2 m+m' thus receives some packet (ij[d].ij). It then sets 
variable V to tj[d]. Running algorithm 2 of section 4 3 on the (m + m')-cube 
computes the local rank of each processor, which in this case is I^dj. Armed 
with this information, for 0 * i< P (n ) the processor in charge of packet (l[d].l) 
transforms it into (L lfd ],!^)). These packets are sorted back to their respective 
W|‘s, and then gathered back into processor i by reversing the scattering 
algorithm.
-  I l l  -
After the Initialization phase, each step of the simulation proceeds as 
follows. First, requests to read communication registers are fulfilled. Processor 
i O iK P (n ) prepares D(n) request packets (i[d].l^j.t). 0<d<D(n) These are 
scattered at most one-per-processor around the processors of W(, using 
algorithm 2. Once this has been carried out. let n be the permutation which 
carries packet (l[d].I^j.l) to processor 3m.l^ dj i [ d ] .  for 0 * i<P (n ). Once n has 
been applied, Wt contains the D(n) requests from the neighbours of processor i of 
M, O il <P(n). Processor i can then fulfill the D(n) requests by broadcasting the 
contents of the communication register of processor 1 of M around the 
processors of Wt using algorithm 1 of section 4.3. The fulfilled requests are 
routed back to their originating processors by reversing the above process. 
Processor i of the universal machine can then simulate the internal computation 
of processor i of M. 0 * i< P (n ). Finally, requests to write to communication 
registers are handled in a similar manner.
Repeating this t times enables us to simulate t steps of M. Note that n is 
the same for each step. It is well-known (see theorem 4.1.3) that a fixed 
permutation can be carried out in time O(log P(n)) by simulating one of 
Waksmans permutation networks. This requires 0(log4P(n)) setup time, 
however (see theorem 4.1.4). The total setup time is thus comprised of:
(1) 0(log*P(n) + D(n)) to compute the identification numbers l^j. The log8 term 
comes from sorting using a straightforward simulation (see theorem 4.1.5) 
of one of Batcher’s sorting networks. The D(n) term comes from the use of 
algorithm 4 of section 4.3 to scatter D(n) values.
(2) 0(log4P(n)) to set up n.
-  1 1 2 -
The delay la comprised of:
(1) 0(D(n)) to prepare and scatter the request packets.
(2) O(log P(n)) to compute n.
(3) O(log D(n)) to fulfill the request packets, 
which gives us the required result.Q
Corollary 6.3.2 There is an 0(P(n).log P(n)) processor universal parallel machine 
which can simulate, with delay O(log P(n)), any P(n ) processor parallel machine 
of arity and degree O(log P(n)).
The proof of theorem 6.3.1 can be modified slightly to give 
Theorem 6.3.3 There is a P(n).D(n) processor universal machine which can simu­
late any P(n) processor, degree D(n). arity-1 machine, wtth delay O(log P(n)) and 
setup time 0(log4P(n) D(n)).
Note that by making D(n) constant in theorem 6.3.3. and using the 
processor-saving theorems of section 4.4 we obtain a shorter proof of theorem 8 
of [21] (a P(n) processor universal machine which can simulate any P(n) 
processor, constant degree parallel machine wtth delay O(log P (n ))). In section 
7.1 we will see that this result is optimal for this type of literal simulation 
(although in section 7.2 we will see that more efficient non-literal simulations 
are possible, using polynomially more processors). In comparison, corollary 
6.3.2 uses only log P(n) times as many processors to achieve the same delay for 
simulating machines of arity and degree 0(log P(n)). Thus in a sense, networks 
with non-constant arity and degree are not much more powerful than those with 
constant arity and degree, provided the former are kept to a reasonable level.
In contrast, for a modifiable-structure machine we have:
-  113 -
U m o n d  8.3.4 There ia an A(n).P(n) processor, constant degree universal paral­
lel machine which can simulate, with delay 0(A(n)+log P(n)). any P(n) processor 
machine of arity A(n).
Proof. Substitute the sorting technique of [41] for the permutation n in the 
proof of theorem 6.3.1. □
We should note that the improvement of theorem 8.3.4 stems from the use 
of the sorting algorithm of [Z\ so as such, the constant multiples are too large 
for the result to be of any practical use.
Finally, we turn to the simulation of high-artty machines by feasible 
networks. We can prove a theorem which is analogous to theorem 5.2.1 by 
having each arity-A(n) processor represented by a feasible network of A(n) 
processors. The delay of that theorem Is increased by the time for these 
subnetworks to simulate local Instructions. For fairly simple instruction-sets 
similar to that of section 6.1, this is an increase of only O(log A(n)), which is 
dominated by routing costs. The use of the techniques of theorem 6.3.1 thus 
leads to a delay of 0(log P(n)) on a feasible network of 5(n) processors.
6.4. Example« of High-Arity Algorithms
In section 6.2 we saw some examples of functions which can be computed 
faster by increasing arity. These examples were simple in nature, but sufficient 
to meet the lower-bounds of that section In this section we give two slightly 
more difficult examples, designed to illustrate that the same speed-ups can be 
achieved for a much wider range of problems. As we have already observed, 
time-bounds for some routing problems can be reduced by simply increasing 
degree (while the arity remains fixed at 1).
Suppose we have n processors, and processor l, 0 «  l < n has a pair of input 
values (at,X|). Suppose also that for 0 « l,j < n:
-  114-
(1) If 0 * ai.aj <n and i *  j then a< i*aj, and 
(8) la i-a jl« 11—j|
For 0 * i< n , we wish to route xt to processor a^ . In [45,47], Nassimi and Sahni 
provide a 2*®* ,‘l processor algorithm which achieves this process (which they call 
concentration) in time O(log n). We made good use of this algorithm in corollary 
5.2.1. when constructing our universal feasible network. Our aim is to produce a
processor, degree D(n) algorithm which runs in time 0 log nlog D(n)
Definitions. Suppose k i d * 0 Define the functions shufTle£.unshufTle£:N-*N and 
exchange£:N8-»N by
shufflei(i) >
2 ^ 1 +(i mod 2k-d).2d
unshulfle£(i) = sh u flle i^ i) 
exchangei(i.j) = |^j- 2d+j
Suppose k i d i O .  The d inay shuffle graph s£ is defined as follows (a similar 
graph appears in [69]). S$ has vertex-set (0,1, ,2k-l|, and each vertex v is
joined to vertices:
(1) shuffle£(v)
(8) unshufflei(v)
(3) exchange£(v,j). 0«J<2*
(4) shuffle^ -(v)
(5) unshuffle^ "•4 4(v)
S i has 2** vertices and degree at most 2*+4. Figure 6.4.2 shows the d-way shuffle 
for d ■ 8, k ■ 3.
-  115-
MAI The S vertex 2-wey ahufflc, Sf. Shuffle and unahuffle links appear 
above the vertices; the remainder are exchange links.
The following It t  2k processor algorithm based on the d-way shuffle, for 
concentrating 2k items in time 0(k/ d). Suppose initially variables a and x of 
processor i contain ai and xt respectively. Each processor has an extra register 
r to help with the transfer of data. The latter is achieved with the help of a 
special procedure, defined as follows.
Procedure transfer(p) 
r :*0
i f a < n  then (a.x.r) of processor p : = (a.x, 1) 
i f  r = 0 then a :=n
The main body of the algorithm is then:
for b :» l to Ik/ dj do
( 1) transfer(exchangej?(PID,a mod 2*))
(2) transfer(unshufflej?(PID))
(3) If a < n than a: =unshuflle£(a)
(♦ ) tranafer(exchanged ma* d(PID,a mod 2k’~ 4d))
(5) transfer(unshufflejfmod i (PID))
Table 6.4.1 shows a trace of this algorithm for d*2.  k = 3 and a i *0 ,  a3= 1, 
a« *2, a r *3  and X|>1, On l <7, with all other a values equal to B. Those 
processors with a-valuas equal to 8 have an empty entry in the table.
-  118-
1____________________ <a .3O  O f proces SOT________________________1Steo o 1 2 3 4 5 6 7
Input < 0 . 0 ( 1 .3 ) ( 2 .4 ) ( 3 . 7 )
< l ) < 0 . 0 <1 .3 ) ( 2 . 4 ) <3.7)
(2) <0. 1) < 1.3) ( 2 .4 ) ( 3 . 7 )
< 3 ) ( 0 .1 ) <2.3) ( 4 .4 ) ( 6 . 7 )
( 4 ) < 0 . 0 ( 2 .3 ) ( 4 .4 ) ( 6 . 7 )
<S)___ <0 . o <2 . 3 ) ( 4 . 4 ) <6.7)
___ i___ ___3___ ___ &___ ___z___
TiH » U .1  Trace of concentration algorithm for d *  2, k *  3. Table entry ihowe the 
values of variables a and a of each processor initiallj (at input) and after each la­
belled step of the algorithm.
The correctness of the algorithm follows from exactly the same argument as 
used by Nasslmi and Sahni In [47]; essentially, every one of our parallel data 
transfers achieves the same result as d of theirs.
In section 8 2 we gave an arity D(n) algorithm for summing n items in time 
0( jj-jStlL—). in fact, we can compute the parallel prefix sum in asymptotically
the same amount of time. Let denote an arbitrary associative binary
operation. Let prefix: N*-*N* be defined by preflx(xo........xn_,) = (y0.........yn- t)
where y0 *  Xo and for 1 « l < n. y, *  yt_1~xl. Then the parallel prefix problem is the 
problem of computing the function prefix using a parallel machine whose 
processors can compute in unit time. Nassiml and Sahni [45] give a 2*°*n* 
processor algorithm (which they call Rank, using the operation of integer 
addition for "•*•") for the parallel prefix problem, which runs in time O(log n) and 
has degree 3 and arity 1.
Ladner and Fischer [38] provide a parallel prefix circu it which has depth 
O(log n)> and uses 0(n) two-input "•'•''-gates. An easy generalization gives us a 
parallel prefix circuit of depth 0( ^ ) ■ using 0(n) d-lnput "•'•"-gates. Let P«(n)
denote an n-lnput prefix circuit built from gates of arlty d. Figure 6.4.2 shows a 
recursive construction of P*(n) from Pg(fn/d1-l).
Figure C 4 A  Recursive construction of an n-input prefix circuit from one with 
b / d l- 1  Inputs. Solid dots ere connections, circles denote ""'"-fete«.
Let Te(n) and Ze(n) denote the depth and number of "''•"•gates in an artty-d 
prefix circuit, respectively. Clearly Te(n)*Te(fn/d1-l)+2, and thus
Tg(n) ■ 0( Also, Ze(n)<(n-fn/d1)+Ze(ln/dl-l)+(n-fn/dl-d-H), from
-  118-
which ws can conclude that Zd(n )*g(n-1 ) - (d -1 ). }°*  , and so Zd(n) = 0(n) as
lo g  a
claimed. Prefix circuits using gates with unbounded fan-in have been studied in 
detail by Chandra, Fortune and Lipton [10,11]. Note that our results cannot be 
compared with theirs, since they use the number of wires as a measure of the 
size of a circuit. Our construction uses many more wires than theirs, but a large 
number of the wires carry the same value. This is sufficient for us to be able to 
construct a high-arity parallel machine for the parallel prefix problem.
Definition. Suppose 2 «d «n .  Define the function ga N-»N by
d f li/d ] if i mod d = d- 1
8 b(1)= | fn/ dl + li/ d J.(d—l)+ (i mod d) otherwise
The following algorithm uses n processors of arity d. degree d+1. and runs 
in time 0( j j ) . Assume X| is initially held in variable x of processor 1, 0< i < n
(0 ) b:=n 
while b > 1  do
(1) If PID<b
PtD
* 4 x ° * procw,or 1
(2) x at proceeeor g^(PID): »  x
(3) b:»|b/d] 
while b < n de
(4) x :»x  of processor g*(PID)
(6 ) b: = bd
(8 ) if (PID < b) and (P1D mod d *• d-1 ) and ( IPID/ dj > 0) 
than x:»x~x of prooseear (|PID/ dj d -1 )
Table 6.4.2 shows a trace of this algorithm for n o 8  and d *3, and figure 6.4.3 
shows the interconnection graph used by the algorithm for those values.
119-
7
flaw * &4S Interconnection pattern used bp the parallel prete algorithm for 
n “  8, d »  3. The arrows point from rOrtez i to rertez 0 * i < 8.
< 1 )  
( 2 )  
Í 3 )
8
8
4
( 0 . 0 )
( 0 . 1 )
( 0 . 1 )
( 0 . 1 )
( 2 . 3 )
( 2 . 3 )
( 2 . 2 )
( 4 . 3 )
( 4 . 3 )
( 2 . 3 )
( 6 . 7 )
( 6 . 7 )
( 4 . 4 )
( 0 . 0 )
( 0 . 0 )
( 4 . 3 )
( 2 . 2 )
( 2 . 2 )
( 6 . 6 )
( 4 . 4 )
( 4 . 4 )
( 6 . 7 )
( 6 . 6 )
( 6 . 6 )
< l )  
( 2 )  
< 2 ?
4
4
2
( 0 . 1 )
( 0 . 3 )
( 0 . 3 )
( 0 . 3 )
( 4 . 7 )
( 4 . 7 )
( 4 . 3 )
( 2 . 2 )
( 2 . 2 )
( 4 , 7 )
( 6 . 6 )
( 6 . 6 )
( 0 . 0 )
( 0 . 1 )
( 0 . 1 )
( 2 . 2 )
( 4 . 5 )
( 4 . 3 )
( 4 . 4 )
( 0 . 0 )
( 0 . 0 )
( 6 . 6 )
( 4 . 4 )
( 4 . 4 )
< l )  
( 2 )
2
2
1
( 0 . 3 )
( 0 . 7 )
( 0 . 7 )
( 0 . 7 )
( 6 . 6 )
( 6 . 6 )
( 2 . 2 )
( 4 . 3 )
( 4 . 3 )
( 6 . 6 )
( 4 . 4 )
( 4 . 4 )
( 0 . 1 )
( 0 . 3 )
( 0 . 3 )
( 4 . 5 )
( 2 . 2 )
( 2 . 2 )
( 0 . 0 )
( 0 . 1 )
( 0 . 1 )
( 4 . 4 )
( 0 . 0 )
( 0 . 0 )
( 4 )
( 3 )
< $ >
1
2
2
( 0 . 3 )
( 0 . 3 )
( 0 . 3 )
( 0 . 7 )
( 0 . 7 )
( 0 . 7 1
( 2 . 2 )
( 2 . 2 )
( 2 . 2 )
( 6 . 6 )
( 6 . 6 )
( 6 . 6 )
( 0 . 1 ) 
( 0 . 1 ) 
( 0 . 1 1
( 4 . 3 )  
( 4 . 5 )
( 4 . 3 )
( 0 . 0 )
( 0 . 0 )
( 0 . 0 )
( 4 . 4 )
( 4 . 4 )
( 4 . 4 )
( 4 )  
< 3  )  
( 6 )
2
4
4
( 0 . 1 )
( 0 . 1 )
( 0 . 1 )
( 0 , 3 )
( 0 . 3 )
( 0 . 3 )
( 4 . 3 )
( 4 . 3 )  
( 0 . 3 )
( 0 . 7 )
( 0 . 7 )
( 0 . 7 )
( 0 . 0 )
( 0 . 0 )
( 0 . 0 )
( 2 . 2 )
( 2 . 2 )
( 2 . 2 )
( 4 . 4 )
( 4 . 4 )
( 4 . 4 )
( 6 . 6 )
( 6 . 6 )
( 6 . 6 )
< 4  )
< 3 )
4
8
( 0 . 0 )
( 0 . 0 )
( 0 . 0 )
( 0 . 1 )
( 0 . 1 )
( 2 . 2 )
( 2 . 2 )
( 0 . 3 )
( 0 . 3 )
i f l . a i
( 4 . 4 )
( 4 . 4 )
( 0 . 3 )
( 0 . 3 )
( 6 . 6 )
( 6 . 6 )
( 0 . 7 )
( 0 . 7 )
1 0 + 2 1
U l e t t l T r a c e  « f  the parallel prete algorithm for n ■ 8, d ■ 3. Table entry show* 
oontenta of variable i  of oaeh proceeeor after each labelled atep o f the alforlthm. 
[a,b] donotee «  ( i  o f p m e w  1).
-  1 2 0 -
Chapter 7
More on Universal Machines
We close the main body of this thesis with a Anal Look at some universal 
networks. The first section is devoted to lower-bounds on literal simulations. 
The delay of corollary 3.2.3 is easily seen to be asymptotically optimal for a 
literal simulation. However, no such elementary lower-bound can be found for a 
simulation which is not literal, as is demonstrated by the existence of a 
nondeterministic universal machine which has constant average delay. We find 
that the delay of theorem 8  3.1 is optimal for a strongly-literal simulation of 
degree-3, arity-1 networks.
In the second section we find that the latter lower-bound can be beaten by a 
non-literal simulation, by giving a simplified presentation of a result of Meyer auf 
der Heide [33]. The third section considers oblivious universal machines. A 
literal simulation is said to be oblivious (after Borodin and Hopcroft [8 ]) if the 
routes taken by data packets sent in response to read or write requests depend 
only upon their respective sources an<l destinations. By extending the work of 
Borodin and Hopcroft [ 8 ] and Lang [39] we obtain asymptotically matching
upper and lower-bounds of 8 ( log P(n)) for the delay required for an
oblivious simulation of a P(n) processor network on a P'(n) processor, constant- 
degree universal machine.
7.1. Sacne Lower Bounds
In section 8.2 we saw several examples of an S(n)-processor feasible 
network which can perform an 0(Log S(n>) delay literal simulation of any S(n) 
space-bounded network (see section 3.4 for definitions). It Is easy to see that 
this delay Is optimal for a literal simulation. For suppose T:N-»N is such that
-  121 -
T(n)<n. Consider the n-processor machine with the following program, where x 
of processor 1 Is initially the Ith piece of Input, O ^ K n
y= x
for i:=0 to T (n )- l do 
y:=y + (x of processor l)
M runs In time 0(T(n)), yet every constant-degree universal machine must 
take time fl(T(n).log n) to perform a literal simulation of M (no matter how many 
processors are available), even if the universal machine is allowed to have more 
than a constant number of registers per processor. For if there are at most
0 ( dedicated processors then one processor must be looking after
O(log n) registers of M. which requires time f)(log n) to keep up to date 
(assuming the universal machine has asymptotically the same arity as the 
simulated machine). Otherwise, since the simulation is literal, there is ample 
opportunity for the contents of the requisite register to be broadcast to the
0 ( l 5 Fir> other dedlcated processors during each iteration, which takes time 
O(log n) on a constant-degree machine (see section 6 .2 ).
In section 5.2 we also saw that a P(n) processor universal machine can 
achieve delay 0(log P(n)) when simulating a P(n) processor restricted-access 
network. Machine M above also serves to give us a matching lower-bound in this 
case. Note that these lower-bounds rely on two Important facts, the limited 
data-carrylng capacity of constant-degree networks, and the fact that a literal 
simulation creates a large amount of traffic. If we relax the requirement that 
the simulation be literal, then no such simple lower-bound technique Is 
available. For example a nondetermlnistlc universal machine can achieve a 
constant average delay.
We define a nondetermlnistlc network similarly to the deterministic model 
of section 2 .1 , with the following modifications:
-  1 2 2 -
(1) Two extra instructions are allowed.
(a) rt«-random (rj). and
(b) fall.
The former assigns to register rj a value between 0 and the contents of rj, 
and is called a guess. The latter is a special kind of halt instruction. A 
processor which has executed it is said to have failed.
(2) A computation is said to succeed if no processor has failed. A 
nondeterministlc parallel machine M is said to compute a relation RcN*xN* 
if for every input x. <x.y>cR iff there is a sequence of guessed values such 
that M succeeds and produces output y. Resources are defined in the 
obvious manner.
Theorem 7.1.1 There is a P(n).log P(n) processor, constant-degree nondeter- 
ministic universal parallel machine which can simulate any T(n) time, P(n) pro­
cessor bounded nondeterministic restricted-access network in time 
0(T(n)+log P(n)).
Proof. Suppose U is a T(n) time, P(n) processor bounded restricted-access 
network. For the present, assume T(n) = O(log P(n)). Fix n, and let 
m = hog P(n)l, m' = hog ml. We will describe our algorithm on an (m+m'+2)-cube. 
Processor i of the universal machine will simulate process 1, 0< i<P (n ). The
algorithm consists of phases, each of which corresponds to m steps of M.
The first phase proceeds as follows. Suppose at the t *  step of M, K t i m ,  
process 1 wishes to read the communication register of process Jf. Instead of 
obtaining the correct value from processor J*1. it nondeterministically guesses 
some value dt which it uses Instead, having recorded it, along with the value jf, 
for later verification. The m values ct. l< t< m , where ct denotes the contents of 
the communication register of processor i at time t, are also recorded. The
-  123-
efleets of possible write-attempts are also guessed and recorded In a similar 
manner
For each i such that 0 < K P (n ) let Wt be the (m>2)-eube consisting of 
processors 2m j+l, 0 £ j<2 m‘*( . Having simulated m steps with guessed data 
values, the verification procedure is as follows. Processor i, 0 « i< P (n ) prepares 
m read-request packets (jf.t). each being a request for the contents of the 
communication register of process jj* at time t. It also prepares m data packets 
(i.t.Ct). and similarly m write-request packets. These are scattered around the 
2m'**st4m processors of W(. at most one packet per processor, using algorithm 4 
of section 4.3. The request packets are fulfilled using the techniques of theorem 
5.1.1. The guessed data values are then compared to the fulfilled requests, and 
any processor which detects a discrepancy fails immediately.
Thus m steps of M can be simulated in time O(log P(n)+m) = O(m). Note 
that 0(P(n).log P(n)) items can be sorted in time 0(log P(n)) using 
0(P(n).log P(n )) processors by guessing a set of switch positions of the Waksman 
permutation network (see theorem 4.1.3), and verifying afterwards that the
permuted values are in sorted order. By repeating this for 3M.m phases we are
able to simulate T(n) steps of M in time 0(T(n)) as required. A set-up time of 
O(log P(n)) is required to broadcast the program of the simulated machine, 
using algorithm 1 of section 4.3. This extra term in the time-bound also takes 
care of the case when T(n) = o(log P(n)). □
In section 6.3 we saw a very special kind of literal simulation of a P(n) 
processor, constant-degree, restricted-access network M on a universal machine 
U. This had the property that
(1) Each processor 1 of M has a dedicated processor d, in U
-  1 2 4 -
(2) This dedicated processor looks after all registers of processor i of M.
(3) d|*d jforl# j.
(4) The initial dedicated-processor assignment is the same for all simulated 
machines.
(5) The dedicated processor assignment does not change with time.
Under these very strict conditions, a delay of 0(log P(n)) was achieved using 
only P(n) processors. We will call simulations with property (5) strongly-literaL. 
Meyer auf der Heide [31] has shown that the above delay is optimal tor a 
strongly-literal simulation of a constant-degree, restricted-access network. We 
can strengthen this to show that thé delay is optimal even for the simulation of 
networks with degree 3.
Theorem 7.1.2 A strongly-literal universal parallel machine with P(n)aP*n) proces­
sors. where a < and degree D(n) must have delay Q( |°g ) when simulat­
ing a P(n) processor, degree-3 network.
Proof. Suppose we have an m-processor. degree-d universal parallel machine 
which can carry out a strongly literal simulation of any degree-3, p-processor
network with delay k. We will show that k = 0( ?°^ f? ) when simulating a speciallog a
kind of machine whose interconnection pattern is a degree-3 graph called a 
matched-cycle. A p-vertex matched-cycle has vertex-set {0.1. - .p—1 { and
(1) Vertex v is joined to vertices (v ± l) mod p (cycle links).
(2) The remaining edges form a graph of degree 1  (that is. they constitute part 
of a matching).
Let M be a network with one register per processor, whose interconnection 
pattern is a matched cycle. Each processor i of M. OstKp is assigned a 
dedicated processor d< in the universal machine. Without loss of generality we
-  1 2 5 -
will assume that each processor of the universal machine is to be assigned to at 
most one processor of M (for each multiply-assigned dedicated processor of U 
can be replaced by a ring of distinct dedicated processors, without disturbing 
the time or processor bounds in the statement of the theorem). Let 
N = the number of matched-cycle graphs.
N( = the number of dedicated processor assignments which work for at least one 
matched-cycle graph, and
N* = the maximum number of matched-cycles for which any given assignment 
can be used.
Claim 1. N * —¡¡¡2-----
Without loss of generality suppose p is even. Then there are 
(p - l). (p -3 ).(p — 5)...l = J?'---- matchings on p vertices At most 2* of these
2 * < f *
matchings can give rise to the same matched-cycle (by Ailing in the missing 
cycle edges), so there are at least ¡J?'-----matched-cycles.
Claim 2. N,*m.dkfr-,>.
If a particular processor assignment is to work for a matched-cycle, then 
processor dj must be at distance at most k from d(i+ o mod „ in the 
interconnection pattern of the universal machine, 0 « l< p . Thus there are m 
oholces for do. but at most dk choices for dt, and similarly dk choices for each of 
de.da........ dp-i-
Claim 3. Notfd1’’ .
Fix a processor assignment. Consider the machines M for which that processor 
assignment works. Each processor i of M can be adjacent to the (a t most) dk
-  128-
processors j auch that d< and dj are at dlatance at moat k in the interconnection 
pattern of the universal machine. Thus each vertex in the interconnection 
pattern of M can be adjacent to at most dk other vertices via a matching link.
If the universal machine is to simulate all matched-cycles within the stated 
resource-bounds, then we must have N,.N8i  N. Thus
-----*  m .d«*-'>
1 • f ' 10«  P < k(2 p-l).log  d *  log m + 0 (p)
Hence if m < p,p for all a < 3 -: we see that k = 0( ¡28-E-) □
c log a
Thus a constant-degree universal machine must have delay 0(log P(n)). 
Here is yet another approach to the unit-cost hypothesis. It is valid to charge 
one time-step for an interned computation which takes time O(log P(n)) on the 
instruction-set of the universal machine.
7.2. A Non-Literal Simulation
In section 7.1 we saw an O(log P(n)) lower-bound on the delay of a strongly- 
literal simulation of a P(n ) processor, constant-degree restricted-access 
network. Here we will see that relaxing the literalness condition allows a more 
efficient simulation of flxed-structure machines. In a literal simulation there is 
ample opportunity during the simulation of a single step for the data to be 
routed from the dedicated processors in response to read or write requests. In 
a non-literal (but step-wise) simulation, this information may start out from the 
dedicated processors at an earlier point in time, being kept up-to-date along the 
way by auxiliary processors. Using this technique, Meyer auf der Heide [33] 
obtains a constant delay (on average) for the simulation of constant-degree, 
flxed-structure, restricted-access machines. The following Is a much-simpllfled 
presentation.
-  127-
Theorecn 7.2.1 Thera la a constant-degree universal machine with P(n)1** pro­
cessors for any e > 0 which can simulate any P(n) processor, T(n) time-bounded, 
constant-degree, flxed-structure, restricted-access network In time 
0(T( n)+ log4P( n )).
Proof. (Sketch). Suppose the machine to be simulated has degree d. Without 
loss of generality we can assume that it communicates by reads alone. The 
universal machine has P(n) dedicated processors, one for each process. Each 
dedicated processor is the root of a complete binary tree of depth t. Hog dl, 
where t>0  is some value to be determined later. Vertices at depth i. flog dl, 
0<lact are said to be on the 1th level The dedicated processors are thus on the 
0 th level.
The simulation will proceed in fT(n)/t1 phases, each corresponding to t 
steps of the simulated machine. The trees will be Initialized so that each 
processor on the 1th Level will be attempting to simulate one of the processes 
which are adjacent to the process of its predecessor on the (i—1)M level. Each 
process thus has many processors attempting to simulate it. A request from a 
process in a processor on the Ith level, 0 s i< t ,  to read the communication 
register of one of its neighbours is passed on to whichever (i+ l)*-level successor 
of that processor is attempting to simulate that neighbour. A request by a 
process in a processor on the t*11 level to read the communication register of one 
of its neighbours is Ignored.
Thus aftsr 1 stops have been simulated, the processors on level t - l + 1  have 
probably been led astray in their simulation of a process by being misinformed 
by processors on the next level. All other processors have simulated correctly. 
After t steps, only the dedicated processors can be guaranteed to have not 
deviated. This part of the simulation takes 0 (t) steps.
-  128-
Meanwhile, the dedicated proceisors have been saving the communication 
register contents of their processes at each of the t simulated time-steps. 
These t values are to be routed to the processors on all levels which are 
attempting to simulate the same process. Armed with this information these 
processors can re-compute the last t steps internally, and get back to a correct 
state. The trees are then ready to simulate another t steps without further 
initialization.
Suppose ail processors of the trees are at the head of a distinct sub-cube of 
2»o « tl processors, and that further edges are added to make the whole structure 
into a multidimensional cube (with embedded trees) of 2*a,l,.P(n) d* processors. 
The correction stage can be carried out by having each level-1 processor. 1  *  i < t. 
prepare t requests for the correct communication register contents at each of 
the t steps of the phase. The dedicated processors prepare t packets which 
provide this information. We then
(1) Scatter them around the sub-cubes in time 0(t) using algorithm 4 of section
4.3.
(2) Permute the requests and data into sorted order. Note that the 
permutation is the same for each phase, so theorem 4.1.3 can be applied to 
give time O(t+log P(n)).
(3) The techniques of corollary 5.2.1 are then used to satisfy these requests in 
time 0(t+ log P(n)).
(4) The satisfied requests are gathered back by reversing (1) and (2).
This gives a time-bound of O(t+log P(n)) to simulate t stops. A total of 
0(P(n).t.dl) processors are needed. Choosing t*e .log 4P(n)-log<|log4 P(n) for 
some *> 0  gives a constant average delay using 0(P(n)'**) processors. The set­
up time consists of:
-  129-
(a) Tima to assign processes to processors at each level of the trees.
(b) Initialization of the permutation used to sort the requests in part (2) above.
(c) Distribution of Inputs, outputs and the program of the simulated machine. 
The process In (a) can be achieved level-by-level starting at the dedicated 
processors, at a cost of 0(logaP(n)) per level for routing the identities of the new 
processes from the dedicated processors by use of sorting. This cost of 
0(log*P(n)), and the cost of (c). is dominated by the 0(log4P(n)) required in 
theorem 4.1.4 for (b).
All algorithms which use the cube-part of the interconnection pattern are 
composite, and thus the interconnection pattern can be thought of as a shuffle- 
exchange or cube-connected-cycles with embedded trees, of degree 6 . □
7.3. Oblivious Simulations
We complete this chapter by considering a very strict form of a strongly 
literal simulation of a restricted-access network, which we shall call oblivious. 
Consider a single step of a strongly literal simulation with dedicated processors 
d,. If process l wishes to read the contents of the communication register of 
process j, then during this simulated time-step the required value can be 
provided by dedicated processor dj and routed to d(. (Similarly, if process i 
wishes to write into the communication register of process j, the value can be 
routed from d( to dj). If the routes taken by these data items depend solely on 
the source dj and the destination d< (respectively d, and dj in the wrlte-mode 
case), then the strongly literal simulation is said to be oblivious. If in addition 
the next step in the route depends solely on the current location of the data 
packet and the eventual destination, than it is said to be source-obiiviour
-  130-
Ttae following lower bound Is a generalization of theorem 1  of [8 ].
Theorem 7.3.1 An oblivious simulation of a P(n) processor parallel machine on a 
constant-degree. F (n ) processor universal parallel machine must have delay
n(^ r +l°*p(n))
Proof. Fix n. and let p = P(n), p’ = P'(n). Suppose the universal machine has 
degree d. dedicated processors dj. 0< i < p. and interconnection graph G.
For 0 * i, j <p let Rp be the path in G corresponding to the route taken by a 
data packet sent from processor dj to processor d| of the universal machine in 
response to a request by process i to read the communication register of 
process j. Since the simulation is to be oblivious, these paths are. invariant with 
time. Note that a path may consist of a single vertex (in the case where d( = dj), 
and that two paths may coincide along part, or even all. of their length. For 
0 < K p  le t G| be the graph obtained from G by removing all edges which do not 
lie on some route Rp. for 0< j <p
Suppose kfeO. For each i. O S K p  construct a set of vertices V, from the 
vertices of G« as follows. Initially V, consists solely of vertex dt (the destination of 
the routes which comprise Gt). Repeat the following until no new vertices are 
added: if | {J|v lies on Rpj| ask (I.e. v lies on at least k routes In G,) and v is 
adjacent to some vertex VcVj in G|. then add v to V|. Thus V, consists of the 
largest set of vertices of G,. clustered around the destination d(, which are on k 
or more routes.
Let T|W |V||, and denote the set of vertices of G not in V,. Let q be the 
maximum number of processes to be simulated by any dedicated processor (l.e. 
q ■ jnax I f JI 0 «J < p  and d|*dj| |). Then at most Tt.q routes of Gt start from
vertices In Vla so at least p-T|.q start from Vj. In order to get from the vertices 
of V, to vertex 1, these must pass through the vertices of Vj which are adjacent to
-  131 -
vertices of V( in G|. By the definition of Vjf each of these can carry at most k-1
routes of G|. Hence there must be at least such vertices. Furthermore,
since G has degree d. there can be at most T j.(d -l) vertices of V, which are 
adjacent to vertices if V, in G. Hence
T,.(d -1)* £ ¿ ^ 3 .
t e T‘ *  (d—l) (k —l)+q
Let T = jmlnT|, and for each vertex j of G, Os:j <p'. let
Cj*  | (l| 0 < i< p  and jeV,j |. N o w C j  *  pT. so there must be (by the pigeonhole
]«o
principle) a vertex v with
r *  ¿ L *  ________ e ! _________
P' p '. ( (d - l )(k - l)+ q )
Choose k : ^ ^ ^ —  ♦ 1. If k<2 then the result follows: If q>
then a lower bound of 0 ( ^ r )  follows immediately (assuming the universal 
machine has asymptotically the same arlty as the machine being simulated), 
otherwise ^ j r *  0 ( 1 ), and so a lower bound of 0 ( ) is trivial.
Now suppose k *  2. Then Cv> If q>  then a lower bound
of 0( ^ r )  follows Immediately. Otherwise k > and so v lies on at
least k>  routes to vertex 1 tor C,k choices of destination
1. Thus there Is a  com bination of req u ests  which re su lts  In p ack e ts
being routed  through v ertex  v; furtherm ore, e ach  p ack et con tain s a  different 
d a ta  item , which p rec lu d es the am algam ation  of p ack e ts (assum ing the 
universal m achine h as the sam e word-size a s  the m achine being sim ulated).
V ertex v thus form s a  bottleneck, giving us a  tim e-loss of 0 ( -9 r ) for each  step
VP
-  132-
of the simulation. This gives us a delay of (1( tor oblivious simulation.
This lower bound very quickly becomes trivial, in fact whan P‘ (n )sO (P (n )*) it 
gives us no information at all. In this case, theorem 7.1.2 gives us a lower bound 
of O(log P(n)). □
The proof of theorem 7.3.1 was motivated by theorem 1  o f Borodin and 
Hopcroft [8 ], where they prove the same result using essentially the same 
methods in the special case where P(n ) = P(n) and all dedicated processors are 
assumed distinct. Lang [39] gives a matching upper bound for the case where 
P*(n) *  p(n) and the data-transfers form a permutation (which, in particular, 
means that there can be no read or write conflicts in the machine to be 
simulated). By extending his technique we can derive a general upper bound 
which asymptotically matches the lower bound of theorem 7.3.1.
Theorem 7.3.2 Suppose P (n )«P '(n ). There is a P'(n) processor universal net­
work based on the shuffle-exchange which can carry out a source-oblivious simu­
lation of a P(n) processor network with delay 0(
Proof. Fix n. and let m=* (log P(n)l, m‘ = flog P '(n )1 -m .  we will describe our 
algorithm on an (m-t-m')-cube. Processor t.2m' will simulate process 1 . 
Oei <P(n). The simulation of a single step proceeds as follows. Suppose process 
1, 0 *t< P (n ) wishes to read the communication register of some process J«. 
0 « j, <P(n). Then each processor l.Z1*. 0 < i<P (n ) makes up a request packet 
(Jt,l). These packets are routed to the respective processors ji.2m'. with multiple 
requests being combined as necessary. The requests are fulfilled and routed 
back to their sources along the same paths. Once read-requests have been dealt 
with in this manner, write-requests are handled analogously with a single 
routing.
The routing of read-requests is broken up into three parts. In each part we
assume that the packets (jj.l) are held In variables (j.l) of the requisite 
processor. Processors not in possession ot a packet are deemed to hold the null 
packet (null.null). A collision is said to occur if two packets are resident in the 
same processor.
(a) Route the packets from i.2m' to i.2*1* -*- (j, mod 2m). This can be done quite 
easily using the following algorithm.
fo rb :*l to m' do 
if PIDfc = (j of processor P ID »)b 
then (J.i):=(j.i) of processor PID(b> 
else (j.i)=(null,null)
There can be no collisions during this stage of the routing, since a packet 
from processor i remains in processors l.2 m‘+x for some 0 « x < 2 m'.
(b) Route the packets from i.2 m'-*-(Jl mod 2 m) to |i/2 ra"m'J.2 m+jl.
m bits nr bits
r >/--------*--------
l.2ra'+(Jlmod2m) f I j, mod 2*"
m*rrv m«1 m mr*1 nr 1
|l/2m-"']2m+Ji J l l/ 2 nvmj __________________ ii__________________ I
This Involves changing bits m > l through m ot the current location ot each 
packet, which were previously the low-order bits (bits 1  through m -m ) of 1. 
into the high-order bits (bits m'+l through m) of j.
To simplify our presentation let us assume at first that there are no read- 
conflicts, so provided 1*1’. We will return to the problem of read- 
conflicts later. Each processor has a queue, with unit-cost operations 
enqueue(x.y) (which places packet (x.y) at the tail of the queue), and
dequeue (which removes the packet from head of the queue, and returns its 
value as a result). An attempt to dequeue an empty queue returns the null 
packet.
This stage of the algorithm consists of m-m' phases. During the k°* phase, 
l « k « m - m ’. we move each packet (Ji.i) so that bit m'+k of its current 
location is the same as bit m‘+k of |i/2 m_m'J.2 m+j,. This is sufficient to 
move packet (j,.l) from i.2m'+(j< mod 2m') to |i/2m~ra'J.2m+j(. Many collisions 
will occur - this is why each processor has a queue. In order to move every 
packet in the system in this manner, we must completely flush the queues 
at each phase. Let m* be the maximum number of items in each queue at 
the start of phase k. l « k < m —m’, to be determined later.
initialize queue to empty queue 
for k: = 1  to m-m' do 
for t: = l to mk do 
if (J ** null) and (Jm'»k = PIDm «.|«) 
then enqueue(j.i)
if (J of processor P!D<m>k) *• null) and
(PIDh,«*  -  (J of prooo—or PID<m'*k>)w*k) 
than enqueue((j,l) of processor PID(m>k))
(J.t): »dequeue
To make our analysis easier, we will include the packet (j.i) as part of the 
queue, since we have used variables (j.i) as a dummy head-of-queue in the 
algorithm. At the beginning of phase 1, the queues are empty, so mt »1. 
After phase k has terminated, a request from process i to process j will And 
Itself in the queue of processor |i/2kJ.2m>k+(J mod 2B,>k). Each request 
oomes from a different source, and 1s bound for a different destination.
-  135-
Thua if two different requests, one from process i to process J. 
end another from process i’ to process j' end up in the same 
queue at the end of the k**1 phase, then i# i', j *  j' and 
[i/ 2 k mod 2 m+k) = [iV S*1 mod 2 m>k).
How many different choices of 1 and i' are there? Since i * i '  and yet 
[i/ 2 *1] = |i'/2 kJ, we are forced to assume that 1 and i' differ in the last k bits. 
Thus there are at most 2k choices for the source, and so m *« 2* Similarly, 
since J * ) ' and yet j mod 2m'*k *  j' mod 2m*k we are forced to conclude that 
J and j' differ in the leading m-m '+k bits. Thus there are at most 2m_m~k 
choices for the destination, and so m]l« 2 m~m'~k. Putting both of these 
together, we conclude that mit*m in (2 k.2 ra-ra'*k).
Thus the algorithm will work if we set m* = min(2k.2m_m‘_k). The delay is 
proportional to
m'£ " ,min(2 k.2 m-m-k).
k-0
If m-m' is even the latter is equal to 2*m“m'*'3)/’*-3, and if it is odd it is equal 
to 3 .2 <m-«f>/«-3 . Thus the delay is 0(>/2iK=fir).
The case where read-conflicts are allowed is a little more complicated. As 
well as a queue, arm each processor with a stack, and the usual stack 
operations. Instead of enqueuing a packet, first check to see whether the 
queue already contains a packet bound for the same destination. If so, the 
newly arrived packet is relegated to the stack. This ensures that only 
packets with different destinations are put in any Individual queue, which 
preserves the invariants necessary for the above timing analysis.
Whan, much later, the fulfilled request is routed back along the same path, 
the stack is checked before it is entered on the queue, and any requests for 
the same data item are fulfilled. By also stacking the time at which a
-  138-
duplicated request was received, the processor can tell when to unstack 
and despatch each fulfilled request in the return routing.
jt.2m'+(ji mod 2m'), and then from there to These two parts correspond 
to the two for-loops below.
for b:=m+m' downto m '+1 do 
If PIDb = (j of proceeoor PID(b))b 
then Q.t):*(J.i) of processor PID^ 
for b:=m' downto l do
If PIDb *  0
than (j.i):-(j.l) of processor PI
Since at all times there are at least m bits of j present in the location of 
each packet (j.i). and at the end of stage (b) above there are no two distinct 
packets bound for the same destination, there are no collisions.
Thus by applying the algorithms of parts (a), (b) and (c) consecutively, we 
can route the request packets in time
on an (nn-m')-cube. Part (a) is a simple-ascend class algorithm, and part (c) is 
simple-descend. Thus by the use of theorems 4.1.1 and 4.1.2 they can be 
realized on the cube-connected-cycles or shuffle-exchange Interconnection 
patterns without asymptotic time-loss, using P'(n) processors.
The implementation of (b) needs special care however. It would be simple 
ascend class except for the fact that m* data transfers occur along dimensions 
m'+k, Instead of the usual 1. A careful analysis of the proof of theorem 4.1.1 
shows that this is easy to handle in the shuffle-exchange case. This does not
(c) Thirdly and finally, route the packets from processor |i/2m-m'J.2ra+j, to 
2 m' j1. This is done in two parts. First route it from |i/2 ra' m'J 2 m+jt to
-  137-
appear to be the case with theorem 4.1.2 however, due to the pipelining 
technique used. In the implementation of part (b) on the shuffle-exchange, 
processes must be moved around at the end of each phase. It takes time 
proportional to mk to move the queue at the start of the k0 1 phase, giving 
asymptotically the same delay as above.
We have not yet described how the fulfilled requests are to be routed back 
to their sources along the same paths. We simply reverse the above algorithm, 
by making each ascending loop descend, and vice-versa. In the case where 
conflicts are allowed, the stacks need not be moved from processor to processor 
in the simulation of theorem 4.1.1. They are simply implemented as an array at 
each processor, and elements .to be stacked are stored in the processor that the 
process is currently residing in. Since the algorithm returns the fulfilled 
requests in a mirror-image of the original routing, each process will be back in 
the correct place when it wishes to remove a particular item from its stack. □
- 138-
Chapter 8 
Conclusion
We have presented a complexity theory of parallel computation baaed on a 
network model, and have argued that this model is a good one, from both a 
practical and a theoretical point of view. The concept of a universal network is 
central to our arguments. We have found a practical universal machine which 
can efficiently simulate the more general model. Thus the user of a practical 
universal machine is free to program in a high-level language whose virtual 
architecture corresponds to. and the theoretician is provided with a motive for 
studying, the more abstract models.
We have seen various kinds of universal machine. A literal simulation is 
often more efficient than a strongly-literal one, in the sense that slightly less 
processors are needed (this is tied in strongly with our non-standard definition 
of space in section 2.1). On the other hand, the number of processors can be 
reduced even further, and the simulation made strongly-literal. if the machines 
being simulated are restricted-access networks. Upper bounds on the time 
required for these simulations can be asymptotically matched by lower-bounds. 
The situation is quite the opposite, however, in the non-literal case.
We have seen that networks with a large word-size and number of 
processors are very powerful, even when those processors have a modest 
instruction-set. In particular, any computable function can be computed in 
constant time if sufficiently many processors are present. Furthermore, an 
arbitrary polynomial speed-up of a sequential machine is possible on a network 
with "reasonable" resources, although an exponential speed-up is probably not.
The choice of a unit-cost measure of time, although controversial in 
sequential models, can be defended in the parallel case. We have seen a
139-
diversely-mottvated collection of evidence in favour of the unit-cost hypothesis.
(a) Networks with a unit-cost measure of time are "reasonable" in the sense 
that they obey the parallel computation thesis, provided a T(n) time- 
bounded network has instructions which can be simulated by a Turing
e
machine using space T(n)0(1>. (Section 3.3).
(b) To ensure that individual processors behave like log-cost sequential 
machines, replace Turing machine space by deterministic Turing machine 
time in part (a) above. (Section 3.3).
(c) Networks with a unit-cost measure of time are "reasonable" in the sense 
that they obey the extended parallel computation thesis, provided an S(n) 
space-bounded network with word-size W(n) has instructions which can be 
computed by a deterministic Turing machine using space (W(n).S(n))0(l) and 
Tin)011* reversals. (Section S 3).
(d) In practice, the average user would probably prefer to own a universal 
network, rather than go to the expense of fabricating special-purpose 
networks for each application. In this case, it is valid to use unit-cost 
charging for a P(n) processor machine whose local instructions take time 
0(log P(n)) on the universal machine. (Section 7.1).
Thus the unit-cost hypothesis holds for a wide range of instruction-sets (not
Just the commonly-used arithmetic instruction-sets proposed in section 2.1),
Including a large class of high-arity machines considered in chapter 6 .
-  140-
Referencea
1. A. V. Aho, J. E. Hopcroft. and J. D. Uliman. The design and analysis o f 
computer algorithms, Addison-Wesley (1974).
2. M. Ajtai, J. Komlds, and E. Szemeredl, "An 0(n.log n) sorting network". 
Proceedings o f the 15th Annual ACM Symposium on Theory o f Computing, 
(Apr. 1963).
3. R. Aleliunas, "Randomized parallel communication". Proceedings o f the 
ACM Symposium on the Principles o f Distributed Computing, (August 1962).
4. K. E. Batcher, "Sorting networks and their applications", Proceedings 
AFIPS Spring Joint Computer Conference 38 pp. 307-314 (April 1966).
5. P. Beams, "Random routing in constant-degree networks". Technical 
Report 161/82, Dept, of Computer Science, University of Toronto (1962).
6 . N Blum, "A note on the parallel computation thesis'", Information 
Processing Letters 17 pp. 203-205 (1963).
7. A. Borodin "On relating time and space to size and depth” , SIAM Journal 
on Computing 6(4) pp. 733-744 (Dec. 1977).
6 . A. Borodin and J. E. Hopcroft. "Routing, merging and sorting on parallel 
models of computation", Proceedings o f the 14th Annual ACM Symposium 
on Theory o f Computing. (May 1962).
9. A K. Chandra. D. C. Kozen. and L. J. Stockmeyer, "Alternation", Journal o f 
the ACM 8 B(l)(Jan. 1961).
10. A  K. Chandra, S. J. Fortune, and R. Upton, "Unbounded fan-in circuits and 
associative functions", Proceedings o f the 15th Annual ACM Symposium on 
Theory o f Computing, (April 1963).
-  141 -
11. A. K. Chandra. S. Fortune, and R. Upton, "Lower bounds (or constant depth 
circuits (or prefix problems". Procaadings o f tha 10th ICALP, Springar- 
Ihrlog La dura Not as in Computar Science 154<July 1983).
12. S. A. Cook and R. A. Reckhow, "Time-bounded random access machines". 
Journal o f Computar and Systam Sciences 7(4) pp. 354-375 (1973).
13. S. A. Cook. "Deterministic CFL's are accepted simultaneously in polynomial 
time and log squared space". Proceedings o f tha llth  Annual ACM 
Symposium, on Thaory o f Computing. (Apr. 1979).
14. S. A. Cook. "Towards a complexity theory o( synchronous parallel 
computation", L Ensaignamant Mot ha matiqua 30(1980).
15. S. A. Cook and C. Dwork. "Bounds on the time (or parallel RAMs to compute 
simple (unctions". Procaadings of tha 14th Annual ACM Symposium on 
Thaary o f Computing, pp. 231-233 (May 1982)
16. P. W. Dymond, "Simultaneous resource bounds and parallel computations", 
Ph. D thesis, issued as Technical Report TR145/80, Dept. o( Computer 
Science. University o( Toronto (Aug. 1980).
17. P. W. Dymond and S. A. Cook, "Hardware complexity and parallel 
computation", Procaadtngs o f tha Slat Annual IEEE Symposium on 
Foundations o f Computar Science, (Oct. 1980).
18. P. W. Dymond. "Speedup o( multi-tape Turing machines by synchronous 
parallel machines". Invitad oddrasa at tha special session on thaoraticol 
computar science, meeting 792. American Mathematical society. (Nov. 
1981).
19. M. Flynn, "Very high-speed computing systems". Procaadtnga o f tha IEEE 
M pp. 1901-1909 (Dec. 1966).
-  142-
20. S. Fortune and J. Wyllie, "Parallelism in random access machines", 
Proceedings o f the 10th Annual ACM Symposium on Theory o f Computing, 
pp 114-118 (1978).
21. Z. Galil and W. J. Paul, "An efficient general purpose parallel computer", 
Journal o f the ACM 30(2) pp. 360-387 (Apr. 1983)
22. M. R Garey and D. S. Johnson, Computers and intractability: a guide to the 
theory o f NCom pleteness. W. H. Freeman (1979).
23 L  M. Goldschlager, "Synchronous parallel computation", Ph. D. Thesis, 
issued as TR-114, Dept, of Computer Science, University of Toronto 
(December 1977).
24. L  M. Goldschlager, "The monotone and planar circuit value problems are 
log space complete for P", SIGACTNews 9(2)(1977).
25. L  M. Goldschlager, "Epsilon-productions in context-free grammars". 
Technical Report TR13, Dept, of Computer Science, University of 
Queensland (Apr. 1980).
26. L  M. Goldschlager, R. A. Shaw, and J. Staples, "The maximum flow problem 
is log space complete for P". Technical Report TR28, Dept, of Computer 
Science, University of Queensland (June 1981).
27. L  M. Goldschlager, "A universal Interconnection pattern for parallel 
computers", Journal o f the ACM 20(4) pp. 1073-1086 (Oct. 1982).
28. L  M. Goldschlager and I. Parberry, "On the construction of parallel 
computers from various bases of boolean functions". Theory of 
Computation Report No. 48, Department of Computer Science, University of 
Warwick (March 1983).
29. L. M. Goldschlager and A. M. Lister, Computer science: a modem  
introduction, Prentice-Hall (1983).
-  1 4 3 -
30. J. Hartmanls and J. Simon. "On the power of multiplication in random 
access machines", Proceedings o f the 15th Annual IEEE Symposium, on 
Switching and Automata Theory, pp. 13-23 (1974).
31. F. Meyer auf der Heide. "Efficiency of universal parallel computers". Acta 
fnform atica 19 pp. 269-296 (1963).
32. F. Meyer auf der Heide. "Infinite cube-connected cycles", Inform ation  
Processing bettors 16 pp. 1-2 (Jan. 1963).
33. F. Meyer auf der Heide. "Efficient simulations among several models of 
parallel computers", Interner Bericht 2/83. Fachbereich Informatik. 
Universität Frankfurt (1963).
34. N. D. Jones. Y. E. Lien, and W. T. Laaser. “ New problems complete for 
nondeterministic log space". Technical Report TR-75-1, Dept, of Computer 
Science, Kansas University (Apr. 1975).
35. N. D. Jones and W. T. Laaser, "Complete problems for deterministic 
polynomial time", Theoretical Computer Science 3pp. 105-117 (1977).
36. R. M. Karp and R. J. Lipton, - "Turing machines that take advice". 
Symposium über Logik und algorithmik” in honour o f Ernst Specker, 
L'Enseignmsnt Mathematiqus 30(Feb 1980).
37. R. E. Ladner, "The circuit value problem Is log space complete for P". 
SIGACTNews 7(1) pp. 18-20 (1975).
36. R. E. Ladner and M. J. Fischer, "Parallel prefix computation", Journal o f the 
ACMteHA) pp. 631-838 (October 1960).
39. T. Lang. "Interconnections between processors and memory modules using 
the shuffle-exchange network", IEEE Transactions on Computers C- 
96(5) (May 1976).
40. F. T. Leighton, "Problem P44". Bulletin o f tha European Association fo r  
Theoretical Computar Scianca. (22) p. 110 (February 1964).
41. T. Leighton. "Tight bounds on the complexity of parallel sorting", 
Proceedings o f tha 18th Annual ACM Symposium on Thsory o f Computing. 
(Aprtl-May 1984).
42. C. F. Lev. N. Pippenger. and L. G. Valiant, "A fast parallel algorithm for 
routing in permutation networks", IEEE Transactions on Computers C- 
30(2) (Feb 1961).
43. L  G. L. T. Meertens, "Recurrent ultracomputers are not log n - fast". 
Technical Report IW118/79, Dept, of Computer Science, Mathematisch 
Centrum (Sept. 1979).
44. G. Miranker, L. Tang, and C. K. Wong. "A  zero-time VLSI sorter". IBM 
Journal o f Research and Development 27(2) pp. 140-148 (Mar. 1983).
45. D. Nassimi and S. Sahni, "Data broadcasting in S1MD computers", IEEE 
Transactions on Computers C-30(2) pp. 101-106 (Feb 1981).
46. D. Nassimi and S. Sahni. "Parallel algorithms to set up the Benes 
permutation network". IEEE Transactions on Computers C-31(2)(Feb. 
1982).
47. D. Nassimi and S. Sahni. "Parallel permutation and sorting algorithms and a 
new generalized connection network", Journal o f the ACM 28(3) pp. 642-667 
(July 1982).
48. D. C. Opferman and N. T. Tsao-Wu, "On a class of rearrangable switching 
networks". Bell Systems Technical Journal 80pp. 1579-1618 (1971).
49. J. Orensteln, T. H. Merrett, and L. Devroye, "Linear sorting with 0(log n) 
processore". B IT 23 pp. 170-180 (1983).
-  145-
50. I. Par berry, "Some practical simulations of impractical parallel 
computers", Theory of Computation Report No. 58, Dept, of Computer 
Science. University of Warwick (December 1983).
51. I. Par berry, "Some processor-saving theorems for synchronous parallel 
computers". Theory of Computation Report No. 53, Dept, of Computer 
Science, University of Warwick (October 1983).
53. I. Parberry, "On the power of parallel machines with high-arity instruction 
sets". Theory of Computation Report No. 57, Dept, of Computer Science, 
University of Warwick (November 1983, Updated February 1984).
53. N. Plppenger, "On simultaneous resource bounds", Proceedings o f the 80th 
Annual IEEE Symposium.'on Foundations o f Computer Science, (Oct. 1979).
54. V. Pratt and L. J. Stockmeyer, “ A characterization of the power of vector 
machines", Journal o f Computer and System Sciences 12 pp. 198-221 
(1976).
55. F. P. Preparata and J. Vuillemin, "The cube-connected cycles: a versatile 
network for parallel computation", Communications o f the ACM 24(5) pp. 
300-309 (May 1981).
56. M. J. Quinn and N. Deo. "Parallel algorithms and data structures in graph 
theory", Technical Report CS-82-098, Computer Science Department, 
Washington State University (Oct. 1982, Revised June 1983).
57. J. Reif and L. Valiant, "A logarithmic time sort for linear size networks", 
Proceedings o f the 15th Annual ACM Symposium on Theory o f Computing, 
pp. 10-16 (Apr. 1963).
58. R. Relschuk, "A lower time-bound tor parallel random-access machines 
without simultaneous writes". Research Report RJ3431, IBM Research, San 
Jose (Mar. 1962).
-  146 -
59. W. L  Ruzzo, "On uniform circuit complexity". Journal o f Computer and 
System Science« 28(3) pp. 365-383 (June 1961).
60. W. J. Savitch. "Parallel random access machines with powerful instruction 
sets", JUathtmaHcal Systems Theory 15 pp. 191-210 (1982).
61. A. Schorr. "Physical parallel devices are not much faster than sequential 
ones", Inform ation Processing Letters 17 pp. 103-106 (August 1983).
82. J. T. Schwartz. "Ultracomputers". ACM Transactions on Programming 
Languages and Systems 2(4) pp. 484-521 (Oct. 1980).
63. J. C. Sheperdson and H. E. Sturgis. "Computability of recursive functions", 
Journal o f the ACM 10(2) pp. 217-255 (1963).
64. Y. Shiloach and U. Vishkin, Finding the maximum, sorting and merging in a 
parallel computation model", Journal o f Algorithms 2 pp. 88-102 (1981).
65. H. Simon, "A tight Cl(log log n)-bound on the time for parallel RAM's to 
compute nondegenerated Boolean functions", Inform ation and Control 
55 pp. 102-107(1982).
6 6 . H. S. Stone. "Parallel processing with the perfect shuffle", IEEE 
Transaction«  on Computers C40(3) PP 153-161 (Feb. 1971).
67. E. Upfal, “ Efficient schemes for parallel communication", Proceedings o f 
the ACM Symposium  on the Principles o f Distributed Computing. (1962).
6 8 . E. Upfal, "A  probabilistic relation between desireable and feasible models 
for parallel computation", Proceedings o f the 16th Annual ACM Symposium  
on Theory o f Computing, (Aprll-May 1984).
69. L. 0. Valiant and G. J. Brebner, "Universal schemes for parallel 
communication", Proceedings o f the 13th Annual ACM Symposium  on 
Theory o f Computing, pp. 263-277 (1961).
-  147-
70. L  G. Valiant, "A scheme (or fast parallel communication", SIAM Journal on 
Cbmputmg 11pp. 350-361 (1962).
71. U. Vishkin, "A parallel-design space distributed implementation space 
(PDDI) general purpose computer". Research Report RC9541, IBM Thomas 
Watson Research Centre, Yorkown Heights (1962)
72. U. Vishkin. “ Implementation of simultaneous memory address accesses in 
models that forbid It", Journal o f Algorithms 4(1) pp. 45-50 (Mar. 1963).
73. U. Vishkin. "Synchronous parallel computation - a survey". Technical 
Report #71, Dept, of Computer Science. Courant Institute. New York 
University (April 1963).
74. U. Vishkin and A- Wigderson. "Trade-offs between depth and width in parallel 
computation". Proceedings o f the 24th Annual IEEE Symposium on 
Foundations o f Computer Science, (November 1983).
75. A Waksman. "A permutation network", Journal o f the ACM 15(1) pp. 159- 
163 (Jan. 1966).
76. Y. Wallach. "Alternating sequential/parallel processing". 9pringer~\Mrlag 
Lecture Notes in  Computer Science 127(1962).
