Static allocation of computation to processors in multicomputers by Norman, Michael G.
Static Allocation of 
Computation to Processors 
in Multicomputers 
Michael G. Norman 
N 
PhD 
University of Edinburgh 
1993 
To Ruth Thomas 
Acknowledgements 
I would like to thank Peter Thanisch, my supervisor, for sound guidance and 
extremely consciencious supervision of my PhD. His hard work and personal 
interest was invaluable, and way beyond that which one expects of a supervisor. 
I would also like to thank Professor David Wallace, my second supervisor and 
Director of the Edinburgh Parallel Computing Centre, and Professor Roland 
Ibbett, Head of the Department of Computer Science and Associate Director for 
Research in the EPCC. 
As indicated in the declaration much of the work was done in collaboration 
with others. I would like to thank George Chochia, Tim Kempster, Susanna 
Pelagatti, Cristina Boeres, Gary Chang and Emmanuel Issman. 
The work was also supported by more general collaboration with others in 
EPCC and the Department of Computer Science. In particular I would like to 
mention Tim Harris, Brian Wylie, Lyndon Clarke, Tommy Kelly, Neil Macdonald 
and Greg Wilson. 
I would like to thank my family, and finally to thank Ruth for proof reading 
and putting up with the stress of my doing the PhD. 
1 
Abstract 
In this thesis we address the static mapping problem—that is the problem of 
allocating computation to processors—in a MIMD, distributed-memory archi-
tecture: a multicompUter. 
We are primarily interested in the way in which the 
computation and the multicomPUter can be modelled: the features of the miii-
ticomputer and the computation that are included and left out, and the way 
in which that impacts upon the predictions made by the models for the per-
formance of computations. We try to put the various published formulations 
of the mapping problem into the context of the multicomPuter, and to identify 
correspondences between features of the models underlying the formulations, 
and features of the multicomputer and the computation. 
The two types of models which we choose to consider in detail are precedence 
constrained scheduling with interproceSsor communication delay, and static 
process based models. We review approaches to hybridising the two types of 
model and propose such a model of our own. We also consider the impact of 
message contention in the multicomPUter. 
We analyse the models underlying formulations of the mapping problem 
in a number of ways. We look at the way in which performance gains can be 
to the models. We consider the way in 
achieved by adding more processors  
which the complexity of mapping problems depends upon the modelling of 
interprocessor communication. We compare bounds on performance given for 
approximation algorithms in different, but related models. We show, for an 
example computations how the predictions of the various models differ and 
2 
how these differences might lead the multicomputer programmer to different 
conclusions. 
Finally we relate the predicted performance in some of the models of our 
example computation with that observed when executing it on a real multicom-
puter. 
3 
Table of Contents 
1. Introduction 
	 13 
1.1 Overview 	 . 14 
1.1.1 Area of Interest and Novelty of Approach .........14 
	
1.1.2 	Thesis Structure ........................16 
1.1.3 Issues Not Considered in Detail ...............18 
1.2 	Formulating the Mapping Problem .................19 
1.2.1 	Modelling Costs ........................20 
1.2.2 	Modelling Processors .....................21 
1.2.3 	Modelling Modules ......................22 
1.3 	A Framework for Discussing Mapping Problems ..........23 
1.3.1 	The Four Basic States .....................23 
1.3.2 Time Spent Calculating ......................24 
1.3.3 Time Spent Communicating 	.................24 
1.3.4 Time Spent Housekeeping 	..................25 
1.3.5 Time Spent Idle 	........................26 
1.3.6 Interprocessor Communications ................27 
4 
An Overview of Scheduling Models 	 29 
	
2.1 	No Communication ..........................31 
2.2 	Tasks with Precedence ........................32 
2.3 	Scheduling with Comunication Delays ...............36 
2.4 Definitions ...............................37 
2.4.1 Cardinality, Integer Part, Absolute Value and Modulus . . 37 
2.4.2 	Directed Graphs ........................39 
2.4.3 Models ..............................40 
2.4.4 Schedules .............................42 
2.4.5 	Validity of Schedules ......................44 
2.4.6 	Properties of Schedules ....................46 
Analysis of Scheduling Models 	 47 
3.1 	General-Purpose Results .......................48 
3.1.1 Lemmas referring to General Delay Scheduling Models . 	48 
3.1.2 Results Referring Only to No Delay Scheduling Models . 	51 
3.2 	Bounds on Profitable use of Processors .................53 
3.2.1 Results for Models Without Communications Delay . . . 53 
3.2.2 Precedence With Delay but Without Replication ......59 
3.2.3 Precedence with Delay and Replication ...........61 
3.3 	Complexity Results ..........................67 
3.3.1 The Types of Complexity Results ..............67 
3.3.2 No-Communication Scheduling Models ..........68 
5 
3.3.3 No-Delay Scheduling Models 	 . 69 
3.3.4 Unit, Uniform and General Delay Models .........69 
	
3.4 	Comparison with the Framework ..................75 
Analysis of Scheduling Algorithms 	 78 
4.1 	Approximation Algorithms for Scheduling Problems .......79 
4.1.1 No-Communication Models .................79 
4.1.2 	No-Delay Models .......................80 
4.1.3 General, Uniform and Fixed Delay Models .........81 
4.2 	Papadimitriou and Yannakakis' Algorithm (PNY) .........82 
4.2.1 A PNY schedule using Ill processors ............83 
4.2.2 The Problem of PNY Pruning ................85 
4.2.3 A Slightly Less Processor Wasteful Algorithm .......93 
4.2.4 A PNY Schedule using W(A) 1'-1 Processors ....... 97 
4.2.5 A PNY Schedule Using an Arbitrary Number of Processors 101 
4.3 ETF ....................................103 
4.3.1 	The ETF Algorithm ......................103 
4.3.2 	The ETF bound .........................107 
4.4 	Comparing ETF and PNY .........................107 
4.4.1 	A Comparison of Bounds .....................108 




5.1 	An Overview of Process Based Models ...............116 
n. 
5.2 Definitions 	 . 117 
	
5.3 	Stone's 1977 Model 	 . 120 
5.3.1 	Understanding Stone's model .................120 
5.3.2 	Results and Algorithms ....................122 
5.3.3 Adding Constraints to Stone's Model ............123 
5.4 	Bokhari's 1981 Model .........................124 
5.4.1 Bokhari's Model in Terms of the Framework ........126 
Message Contention 
	 1PJ 
6.1 	Contention in the Multicomputer ................... 130 
6.1.1 	Packet Switched Systems ....................130 
6.1.2 More Complex Interconnection Systems ..........131 
6.2 	Contention in a Process Based Model ................132 
6.3 	The Multicomputer Configuration Problem ............133 
6.4 	Contention in a Scheduling Model ..................143 
6.4.1 Relevant Complexity Results ................144 
6.4.2 The Complexity of Scheduling with Contention ......144 
Hybrid Models 
	 150 
7.1 	Integrating Computation and Communication ...........151 
7.1.1 	Mini-Max formulations ....................152 
7.1.2 Papadimitriou and UlIman's 1987 Models .........153 
7.2 Claud ..................................154 
7.2.1 Claud and the Mapping Problem Framework .......157 
7.2.2 Complexity Results for Claud 	 . 158 
7.2.3 Performance Guarantees for Claud Scheduling Algorithms 158 
Applying the models 	 161 
8.1 	The Diamond dag ...........................162 
8.2 	Partitions of the Diamond dag ....................164 
8.2.1 Stripes ..............................164 
8.2.2 Lines ...............................168 
8.2.3 Boxes ..............................171 
8.3 	The Predictions of the Models .....................174 
8.4 	The Claud Stripes Schedule .....................176 
8.4.1 Predicted Performance for the Claud Models .......183 
Validation of the models 
	 185 
9.1 	A Numerical Application .......................186 
9.2 	A Validation Experiment .......................188 
9.2.1 Deriving Parameters of the Model ...............189 
9.2.2 Predicted Versus Achieved Performance ..........190 
9.2.3 Fitting the Models to the Observed Data ..........193 
9.3 
	
	Parameter Dependence in the Claud model ............194 
9.4 The Claud Model and our Multicomputer Computation .....198 
9.4.1 Matching CS-tools to our Model ...............199 
9.4.2 Atomicity of Communications Events ...........201 
10. Concluding Remarks 
	
203 
10.1 A Summary of Results ........................204 
10.1.1 Models and the Framework .................204 
10.1.2 Complexity ..........................204 
10.1.3 Bounds on Processor Usage .................206 
10.1.4 Performance Guarantees of Algorithms ..........206 
10.1.5 Nature of Predictions .....................208 
10.1.6 Predictive Power .......................208 
10.2 Verisimilitude, Complexity and Predictive Power .........209 
List of Figures 
2-1 AaExample Mapping Problem in a No Communication Scheduling 
- 	 Model and a Solution .........................32 
2-2 	Dag with Shaded Critical Path ......................35 
2-3 Schedule Represented as a Gantt chart ...............35 
2-4 Scheduling in a Model with Communication Delays .......38 
3-1 Counter-example for W(A) Bound in TJET Unit Delay Scheduling 
models.................................62 
3-2 Counter-example for Ill Bound in UET General Delay Scheduling 66 
3-3 Partial Gantt Chart for the Encoding in Decision Problem 3.5 . . 72 
4-1 The Constructed dag Obtained in the Polynomial Time Reduction 
from Three Minimum Cover ...................... 87 
4-2 The Schedule Obtained from Algorithm 4.2 on a dag Encoded as 
in Definition 4.1 	..............................89 
4-3 The Lumpy dag for r = 3 .......................111 
4-4 The Inverse Binary Tree for 'r = 6 and its e Values .........113 
4-5 ETF Schedule of the Inverse Binary Tree for r = 6 .........114 
5-1 An Example of Bokhari's Mapping Problem and its Solution. . . 125 
10 
6-1 Carbuncle for Degree 3 	 . 134 
6-2 	Carbuncle for Degree 4 ........................134 
6-3 Partial Gantt Chart for the Encoding in Decision Problem 6.3 . . 148 
8-1 The 6 x 6 Diamond dag ........................163 
8-2 Stripes Schedule in k Processors ...................166 
8-3 Lines Schedule in k Processors 	...................169 
8-4 Boxes Schedule in k Processors ...................172 
8-5 Claud Stripes Schedule of the n x n Diamond dag ........178 
8-6 Equal Message Partition of the Stripes Claud Schedule of the 6 x 6 
Diamond with n/rn = 2 and n/k = 2 Showing Corresponding 
Send and Receive Events . . . . . . . . . . . . . . . . . . . . . . . . 179 
8-7 Makespan M as a Function of two Variables k and rn, for n = 
240,A = 475, = 0.775,w = 43.30...................184 
9-1 Performance with Constant Message Size: Observed (points) vs 
Predicted (line) 	.............................191 
9-2 Performance with Constant Number of Processors: Observed 
(points) vs Predicted (line) .......................192 
9-3 Performance with Constant Message Size: Observed (points) vs 
that Predicted by Regression Parameters (line) ...........195 
9-4 Performance with Constant Number of Processors: Observed 
(points) vs that Predicted by Regression Parameters (line) . . . . 196 
9-5 Makespan Against 1 and for n = 240, k = 80, w = 43.3, rn = m0pt 197 
9-6 Gantt Charts Illustrating Overheads that may be Attributable to 4. 202 
11 
List of Tables 
3-1 A No Communication Scheduling Model in Terms of our Framework 77 
3-2 A No Delay Scheduling Model in Terms of our Framework . . . . 77 
3-3 A Uniform Delay Scheduling Model in Terms of our Framework . 77 
5-1 Relevant Differences Between Models of Distributed Systems and 
Models of Multicomputer Systems .................117 
5-2 A Process Based Model in Terms of our Framework ........121 
5-3 A Graph Matching Model in Terms of our Framework ......127 
7-1 A Claud Model in Terms of our Framework ............158 
10-1 A Summary of the Presence of Components of our Framework in 
Models .................................205 
10-2 A summary of Bounds Beyond Which There is No Performance 






This thesis is about the problem of mapping computation to processors in a 
parallel computer in order to optimise performance. We choose to consider 
a restricted class of such problems, one which we argue is important to the 
programmer of the class of parallel computers known as multicomputers. In this 
thesis we bring forward evidence to support the following claim. 
Progress in solving the problem of allocating computation to 
processors in multicomputers is hindered by the lack of appro-
priate architectural and computational models rather than by 
the complexity or non-approximability of the various existing 
formulations of the problem. 
1.1.1 Area of Interest and Novelty of Approach 
The general problem of allocating computation to processors in a parallel com-
puter is well studied, and we certainly do not aim to cover the whole of the field. 
The problems we will consider are restricted in both the type of computation 
and the type of parallel computer architecture. 
The class of architectures we consider in detail is the distributed memory 
multiple-instruction multiple-data computer: the multicomputer of which the 
various commercial hypercube systems and transputer based machines are ex-
amples. They consist of processors', each of which has its own memory. There 
is no global memory and processors communicate by message passing. The 
'Some authors refer to them as processing elements since they may consist of more 
than just a simple processing unit. 
14 
processors in a multicomputer may execute asynchronously. Informally, the 
processors need not all be on the same chip, but should be in the same box; that 
is, we are interested in parallel systems, not distributed systems. 
The types of computation we consider are those for which a static "all-
knowing" representation (usually graph based) can be derived in advance of 
program execution. Although we will often assume the existence of such rep-
resentations and it may be possible to derive them for the regular computations 
discussed in the later chapters of the thesis, it must be understood that they are 
rarely readily available for real computations running on real multicomputers. 
In particular such representations may be program-data dependent. 
Our treatment of the subject is novel in that we are not primarily interested 
in defining algorithms for mapping, but in the assumptions inherent in the 
models underlying the mapping problem. Both the multicomputer and the 
computation must be modelled in some way. We are interested in the way in 
which such models are formulated: in putting into the context of the multicom-
puter the assumptions inherent in such formulations; in the way in which such 
assumptions affect the complexity and approximability of mapping problems; 
and in the way in which different assumptions lead to different predictions of 
performance. We are also, ultimately, interested in the predictive power of such 
models for mapping problems encountered by multicomputer programmers. 
As an example of our treatment, consider the following. There are a number 
of algorithms for mapping computation which have performance guarantees. 
That is they guarantee to find a mapping which performs at worst, say, half as 
well as an optimal mapping. We show that such guarantees of performance 
are only as good as the underlying models. If a given computation and mul-
ticomputer differ from the underlying model, then the performance guarantee 
will not apply to that computation running on that multicomputer. Moreover, 
another algorithm, possibly with a weaker "performance guarantee" might well 
give better results in practice. 
15 
The nature of the thesis is largely analytical: we derive features of mod-
els in the abstract, such as the complexity of mapping problems formulated in 
the models, and bounds on the number of processors that can be used in the 
models by mapping algorithms. We also analyse the performance of particular 
mapping algorithms for particular models. In addition we describe a range of 
models in terms of a single framework, which may be thought of as another 
model, but one which has a strong relationship to the multicomputer and to 
multicomputer computations. Finally, towards the end of the thesis we take 
a more experimental approach in validating the predictions of models, includ-
ing one proposed in the thesis, against the performance of a real computation 
running on a real multicomputer. 
1.1.2 Thesis Structure 
The structure of the thesis reflects the differences between the models under 
consideration. In particular there is no single literature survey. Where mod-
els incorporate markedly different features they are considered in different 
chapters, and existing work is surveyed in many chapters, including this one. 
This chapter sets the context of our work and outlines the underlying liter-
ature. For example it explains the difference between the task based scheduling 
models discussed in Chapters 2, 3 and 4, and the process based models dis-
cussed in Chapters 5 and 6. It also defines the multicomputer-based framework 
in which models are presented, and discusses the influence of the multicom-
puter's interprocessor communication network. Finally, it reviews some rel-
evant literature and some models which are not covered in any detail in later 
chapters. 
Chapter 2 gives an overview of the various scheduling models that have been 
presented in the past. It explains how they relate to each other and develops 
some notation within which they can all be expressed. 
16 
Chapter 3 considers various scheduling models from three different view-
points. First we analyse the way in which performance gains can be achieved by 
adding more processors to the model. Second we review and derive complexity 
results for scheduling in such models. Third we derive the relationship between 
the scheduling models and the framework defined in this chapter. 
Chapter 4 considers algorithms that have been proposed for scheduling 
models. In particular we show how the performance guarantee of a particular 
heuristic algorithm relies on the assumption in the underlying model of the 
availability of an unlimited numbers of processors. We show the NP complete-
ness of remapping its output to achieve comparable performance guarantees 
on a fixed number of processors, and propose a related algorithm for which the 
performance guarantee depends upon the number of processors used. Finally 
in Chapter 4 we compare a pair of heuristic algorithms on the small range of 
models to which they both apply. 
Chapter 5 is primarily a literature review. It describes a pair of process based 
models and puts them into the framework, and develops some notation for such 
models which is used in later chapters. 
Chapter 6 considers the influence of the multicomputer network. First, we 
show the NP completeness of mapping in a process based formulation where 
it is the network of the multicomputer that is to be optimised. Mapping in 
a similar model with a fixed multicomputer network is not known to be NP 
complete. Second we show the NP completeness of a scheduling problem 
(which otherwise has a polynomial time algorithm) when the communications 
network is explicitly modelled. 
Chapter 7 considers models which may be thought of as merging the process 
based models with the scheduling models. We review the relevant literature 
and propose a model of our own which will be used in subsequent chapters. 
Chapters 8 and 9 concern one particular computation: the diamond dag 
17 
which corresponds to, for example, the longest common subsequence compu-
tation which is used in gene sequence matching. In Chapter 8 we derive for a 
range of models the predicted performance of the diamond dag in a number of 
mappings. In Chapter 9 we test our predictions against the performance of a 
real computation running on a Meiko Computing Surface. 
In our concluding remarks we attempt to identify recurring themes in the 
results of our analysis and experimentation, and to identify areas which may 
prove fruitful-for further research. 
1.1.3 Issues Not Considered in Detail 
By restricting our discussion to mapping in multicomputers we are exclud-
ing a large number of related areas. For example, we ignore scheduling for 
VOW architectures and hardware data-flow architectures (e.g. McDowell and 
Appelbe [1986], Gaudiot et al. [1988], and the complexity results of Fellows and 
Langston [19881). 
We also exclude multicomputers with special-purpose hardware, such as 
Barrier MIMD architectures [Dietz et al. 19921, which keep the processors in 
synchronisation at barrier points in the computation, thereby allowing pro-
gramming techniques akin to those used for VLIW architectures. 
We are not directly interested in the issues in scheduling pipelined or vector 
processors, as reviewed recently by Krishnamurthy [1990]; and we are not 
interested in issues of language design for multicomputers (see Bal et al. [1989]). 
We ignore the interactions of multicomputer programs with operating sys-
tern facilities such as those discussed by Chu and Lan [1987]. 
Casavant and Kuhl [19881 propose a taxonomy for describing mapping 
strategies which is determined by the types of algorithms being used. In their 
terms, our discussion is restricted to the static mapping problem, which allows 
19 
us to concentrate on the underlying modelling issues, rather than, say, the prob-
lems of deriving parameters of computations or the reliability of local mapping 
strategies given only local information. 
1.2 Formulating the Mapping Problem 
Multicomputers can have a performance-price ratio that is vastly superior to 
that of uniprocessors. Having said that, it is a much harder intellectual exercise 
to develop efficient software that can extract this performance from a multicom-
puter. Each stage of the conventional software development cycle - algorithm 
design, implementation, debugging, performance analysis and tuning - is more 
complex when the target hardware is a multicomputer. 
Much of the extra difficulty, however, is caused by components in the soft-
ware development cycle that have no counterparts in the uniprocessor case. 
In particular, the computation must be designed in such a way that it can be 
divided into modules which can be assigned to the various processors in the 
multicomputer. Furthermore, it may also be necessary to devise a schedule for 
each processing element, i.e. an order in which the modules of computation 
assigned to a particular processing element are to be executed. 
In this thesis, we refer to the problem of finding such an assignment - and, 
where appropriate, a schedule - as a mapping pro blem2. 
In broad terms we can consider three aspects to the mapping problem which 
we require to model: the processors and their communications facilities; the 
21n the literature on parallel computing, there is no standard definition of "map-
ping". For example, some other authors use the term to mean just an assignment. 
19 
modules and their communications patterns; the function which is used to de-
termine the cost of a mapping. It is worth stressing at this stage that much of 
the work we consider was originally formulated in the context of Operations Re-
search or Distributed Systems, and thus our finding it inapplicable to the problem 
of mapping in multicomputers is not in any way a criticism of the work. 
1.2.1 Modeffing Costs 
There are several different ways to formulate "optimisation" in the mapping 
problem. For example, for software such as operating systems, it is useful 
to construe the optimisation in terms of maximising the throughput of jobs. 
In a real time system design problem, the designer may be interested in the 
minimum number of processors that can guarantee a particular level of per-
formance. Alternatively, thenumber of processors in the multicomputer may 
be fixed, and the performance of the software may be optimised with respect 
to a single program. The throughput based formulation has been used by a 
number of authors e.g. Bacceffi and Liu [1990] and Bokhari [1981b]; the problem 
of finding a minimum number of processors has been addressed by others such 
as Fernández and Bussel [1973], Al-Mouhammed [1990] and Houstis [1990]. 
The main emphasis in this thesis is on the formulation of the problem for a 
fixed number of processors which are executing a single program. By focus-
sing our attention on the execution of a single program, the optimisation of the 
mapping becomes the minimisation of time to completion, which is the elapsed 
wall-clock time between the moment when the multicomputer starts to execute 
the program to the moment at which the result is presented. 
20 
1.2.2 Modelling Processors 
We shall often refer to a set P of processors. It is common (e.g. Graham et 
al. [19791) to consider three ways in which the processors making up a parallel 
computer can vary in processing speed. They may be identical, that is every 
processor processes all modules at the same speed as every other. They may 
be uniform, that is the time of any given processor to process any module is 
a constant integer multiple of a time which is a parameter of that module. 
Alternatively they may be unrelated, for example a processor p could be faster 
than processor q at computing module y, but slower than q at computing some 
other module 6. Since we are considering the multicomputer we shall be inter-
ested mainly in identical processors. The more general case of heterogeneous 
multicomputers corresponds to models of unrelated processors.. 
In order to model communications facilities between processors we often 
introduce a relationship between the cost of communication between modules 
and the processors to which modules are mapped. 
There are a number of options: 
. The cost of communication between modules is independent of the pro-
cessors to which modules are mapped. 
. The cost of communication between modules depends only upon whether 
or not they have been assigned to the same processor. 
. The cost of communication between modules depends upon the pro-
cessors to which they have been mapped. For example, processors may 
be considered to be connected in an undirected graph. 
21 
1.2.3 Modelling Modules 
We shall often refer to a set F of modules that make up the computation. A 
module is a unit of computation that is executed sequentially. Modules can be 
executed preemptively or non-preemptively: that is they may or may not be 
allowed to be suspended. Modules are often algorithmic units - perhaps func-
tions in a functional decomposition. Alternatively in data parallel applications 
they may correspond to the computation associated with divisions of a data 
space. 
There are three basic types of models. Firstly there are models where no com-
munication occurs between modules. Secondly there are models which consist 
of modules (referred to as tasks) arranged in directed acyclic graphs where an 
arc between a pair of tasks corresponds to both a precedence relationship and 
an associated communication event. Thirdly there are models which consist of 
modules (referred to as processes) arranged in undirected graphs where an arc 
corresponds to a volume of communication between processes. The directed 
graph models are often used by researchers interested in scheduling problems, 
and tend to have a basis in shared-memory multiprocessors whereas the undir-
ected graph models tend to have a basis in distributed systems and are often 
used by researchers who are mapping programs which are explicitly parallel. 
Although very different techniques are used for mapping in directed and 
undirected graph based models, these models may simply embody different 
views of the same computation. There is a sense in which the undirected graph 
models correspond to directed graph models in which some tasks has been pre-
defined to be mapped as a single unit. Indeed there are mapping techniques 
for directed graphs which are based upon this approach [Kruatrachue and 
Lewis 19881. 
22 
1.3 A Framework for Discussing Mapping Problems 
In this section we present a framework for discussing the mapping problem in 
terms of the activity of the processors of a multicomputer. We consider the case 
of a calculation - that is a set of algorithms and a set of data upon which the 
algorithms are to be applied - which is partitioned into a set of modules which 
are executed on a set, say P. of identical3 processors. The framework bears some 
resemblance to the models of parallel computers described by Fox et al. [1988], 
or by Reed and Fujimoto [1987]. It differs in that it is more explicit in dealing 
only with the state of processors. 
1.3.1 The Four Basic States 
It is assumed that at all times during a given execution of a parallel program 
any processor p e P can be uniquely identified as being in one of four states: 
Performing thd computation required by the calculation 
. Performing computation associated with message transfer 
Performing computation associated with housekeeping operations 
Idle 
We shall refer to the time that a processor p spends in these states as T aic(p), 
T omm (p),  THouse(P) and TIdle(P) respectively. We also define: 
T alc = 	Tüaic(p) 
PEP 
and similarly for all other values which are subscripts in T. 
'Recall our definition in Section 1.2.2. 
23 
1.3.2 Time Spent Calculating 
We can define T aic to be a property of P and the calculation being performed, 
and completely independent of the partitioning of the calculation into modules 
and the mapping of modules to processors. Where the set of modules includes 
some re-calculation this will appear as acomponent of TH0use. 
1.3.3 Time Spent Communicating 
We can sub-divide the term T omm (p) as follows: 
T omm (p) = T12  (p) + TT'. (P) +. TRoute(P), 
where T1 (p), TTerm(P) and TROUt€ (P) are the amounts of time that a processor 
p spends performing the computation associated with message initiation, ter-
mination and through-routing respectively. Where processor p sends a message 
to processor q, the processing associated with communication contributes to 
T12  (p) and 	and if, for example, through-routing of the message causes 
costs to be incurred at processors other than p and q then it appears in TR0te  on 
those processors4. It should be noted that the time a message may take to arrive 
is not solely dependent upon these factors. 
TTerm and Tj,,it include all the computation which results from the partitioned 
address space: determining whether a message transfer is required; generating 
a packet; function call overhead associated with transfer and receipt of the 
packet and with generating any associated protocol; copying into and out of 
local address space; and stripping of packet headers. 
'The above definition applies for multicast messages if they are considered to be 
multiple message transfers. 
24 
TR0t , the computation associated with intermediate node transfer of mes-
sages in a multicomputer, is seen only on such machines as first generation 
hypercubes and transputer based multicomputers running message-passing 
systems. It is envisaged that this overhead will disappear as the computation 
is taken over by dedicated hardware. This is not to say that the influence of 
the underlying processor network disappears, since it continues to show up in 
THouse  and TIdl. 
1.3.4 Time Spent Housekeeping 
There are a number of overheads associated with parallel computation which 
are neither present in the sequential case nor directly associated with commu-
nication. For completeness we include these as "housekeeping overheads", 
THouse, of which we identify three components. 
THouse (P) = TSChed (p) + T1njer (p) + TRecalc(p). 
TRecalc(P), which of the three is the component we will come across most, is 
the overhead associated with recomputation of parts of the calculation. As we 
will see in Section 3.2 below, in some models this is worthwhile, and in others 
it is not. The computation associated with the calculation may appear only 
once in T ai. Recomputation must appear in TReca1c  If calculation is performed 
more than once it is useful to consider the first computation to be calculation, the 
others recalculation. If it is initiated simultaneously on more than one processor 
then some arbitrary assignment of the computation events to T aic  and TRecalc 
must be made. 
The other two overheads feature little in the rest of the thesis since they 
are rarely modelled by those researching into the mapping problem. They 
incorporate all computation that occurs in the parallel case that would not occur 
25 
in the serial case, and that cannot be attributed to initiating and terminating 
communication. TSched(P) contains the overheads of local dynamic scheduling 
of computation where processor p is assigned more than one module. Tinter (p) 
contains all other such overheads. 
1.3.5 Time Spent Idle 
TIdle(P), the timeprocessor p spends idle, is also partitioned: 
TId16 (p) = T ajt(p) + TFIfl (p). 
TFi,, is the time processors spend idle, having completed all their modules. It 
can be thought of as a place-filler. There is no cost associated with the extension 
of processing into 
T att(p), the time that processor p spends idle with none of its modules 
executing, before it has performed all the computation associated with its mod-
ules, is a property of the global mapping of modules, and can be regarded as the 
overhead associated with load imbalance. Again there is no cost directly asso-
ciated with extension of processing into T att(p) and so, for example, TRecatc(P) 
can be extended at the expense of T ajt(p), corresponding to re-computation of 
values so as to minimise overall program completion time. It is often useful for 
T ait(p) to be considered the sum of two components: 
T ait(p) = T ajts(p) + TWait D (P) 
where T attg(p) is the time processor p spends waiting for messages before they 
are sent, and TWajtD  (p) is the time that processor p spends waiting for messages 
that are in transit. If p's next scheduled task is awaiting more than one message 
from tasks on other processors then the time contributes to TWa tD (p) only if all 
outstanding messages are in transit. 
26 
1.3.6 Interprocessor Communications 
The above "processor-centric" framework does not consider the processor net-
work which mediates the interprocessor communication in the multicomputer. 
This network can often be represented as an undirected graph - such as a hy-
percube or a torus. In its general form the network consists of routing elements 
(vertices) with channels (edges) between them. Messages are passed between 
adjacent routing elements, being through-routed as necessary. The processors are 
connected to individual routing elements in the network. In a number of so-
called direct networks, there is a one-to-one correspondence between processors 
and routing elements. 
Even amongst those corresponding to the same graph; different interpro-
cessor communication networks have different performance characteristics. 
Those which are simplest to model are packet switched whereby messages are 
sent in one or more packets each of which moves independently through a net-
work being stored at intermediate routing elements before being forwarded to 
its subsequent destination. In circuit switched and wormhole networks messages 
are transferred as sequences of flits which do not move independently. Circuit 
switched and wormhole systems differ in the way in which the sequence of 
channels through which a message is transferred is set up. There is a variation 
on the above which is used to avoid deadlock [Daily and Seitz 19871 and also 
to enhance performance [Duato 19921 , whereby channels are split into many 
virtual channels over which flits can move independently. 
The communications network is often described as having quiet and busy 
performance characteristics. Quiet networks are relatively well understood; 
busy networks less so. Busy networks are discussed in Chapter 6. In this section, 
and in the majority of the thesis we consider only quiet networks, by which we 
are making the assumption that communication between all pairs of modules 
in the multicomputer program occurs across hardware which is not shared with 
27 
communications between other pairs of modules, at least for the duration of 
any communication. In these cases it is often possible to approximate the delay 
associated with message transfer in a multicomputer by a simple function of the 
number of interprocessor network links traversed, the number of words being 
transferred in the message and hardware specific constants. 
This model is most appropriate for modelling simple store-and-forward 
message passing systems such as were written in software on first generation 
commercial hypercube machines and on transputer based machines. Circuit 
switched and wormhole systems are less adequately modelled by this approach. 
The constants must be replaced with functions which are derived experimentally 
from the machine. For example in the Intel 1PSC/2 there is a strong discontinuity 
in performance at a message length of 100 bytes [Bomans and Roose 1989]. 
Nevertheless it is often possible to derive expressions which accurately model 
the communication delay. A number of studies have been published (see for 
example [Bokhari 19901 and [Bomans and Roose 1989]). See Seitz [1990] and 
Dally [1990a] for more discussion of message latencies in packet switched and 
circuit switched networks. 
It should be stressed that there is no strong causal link between the com-
ponents of T om, and the properties of the communication network. Delays in 
communication need not show up as overheads on any processor, and commu-
nications overheads on processors need not show up as delays to messages. We 
discuss this in more detail in Chapter 9 in the context of a particular computation 
executing on a particular multicomputer. 
Chapter 2 
An Overview of Scheduling Models 
29 
This chapter consists of four sections. The first three introduce a number of re-
lated scheduling models which might have predictive power for multicomputer 
performance. The fourth introduces a notation which is used in subsequent 
chapters. 
The No Communication scheduling model of Section 2.1 and the No Delay 
scheduling model outlined in Section 2.2 below, correspond to the formulations 
that are the basis of scheduling theory, which developed from the 1950s onwards, 
and have been reviewed on a number of occasions (Graham etal. [1979], Conway 
et al. [1967], Coffman [1976] (including the chapter by Sethi ), Gonzalez [1977], 
Blazewicz et al.[1991]), and still are of current interest (e.g. Ramamithram et 
al. [19901). Although the complexity results and algorithms for such models 
underly much of what is discussed in the later chapters, it is not our purpose to 
discuss them in detail except where they relate to multicomputer systems. 
A number of generalisations of these scheduling models have been proposed 
which might have relevance to the multicomputer programmer but which we 
only mention here in passing. In particular some authors (e.g. Kafura and 
Shen [1977] and Garey and Johnson [19751) have extended scheduling models 
to allow constraints upon allocations of tasks so that they compete for resources, 
thereby modelling the situation in multicomputers which do not allow virtual 
memory and have localised software interfaces and special purpose hardware. 
In addition, a number of authors (e.g. Lo [1992], Blazewicz et al. [1986][1984], 
Wang and Cheng [1992], Du and Leung [1989], Nicol et al. [19921) have recently 
considered the mapping problem for tasks which are themselves parallel pro-
grams. The models correspond closely to the subcube allocation problem for 
hypercubes studied by Chen and Lai [1988] and Chen and Shin [1987]. 
For the purposes of this thesis, however, we restrict our consideration of 
precedence constrained scheduling to models with the following properties: 
. Identical processors 
30 
. Uniprocessor tasks 
Arbitrary positive integer task execution times 
No preemption 
Arbitrary non-negative integer interprocessor communication delays. 
2.1 No Communication 
The simplest models of parallel computation are those where computations are 
modelled as tasks, each of which is executed sequentially on a single processor 
and between which there is no communication. The problem is often referred 
to as a scheduling problem, but in truth, the latter assumption means that the 
tasks can be executed in any order, and the problem is simply one of balancing 
the load on the various processors to which the computation is being mapped. 
The model corresponds to the execution of independent programs, say for 
example by a parallel batch server, in a way that multicomputer programmers 
often refer to as embarrassingly parallel (e.g. Fox et al. [19881) or event parallel (e.g. 
Pritchard et al. [19871). For example, in parallel implementations of ray-tracing 
algorithms, the computation is often decomposed into tasks corresponding to 
the computation of square tiles which together make up a screen. The tasks may 
be executed independently, and the computation time of the tasks varies. As 
long as the communications overheads associated with startup and termination 
of tasks can be ignored, dynamic mapping problems framed in the model may 
correspond accurately to the problem of predicting the performance of a task 
farming implementation of ray-tracing. Static versions of the mapping problem 














l8ms 	 l5ms 
25ms 	
7ms 
7ms 	 II II 	I
urns 
I 	---------- _ 
5Bms 
Figure 2-1: An Example Mapping Problem in a No Communication Scheduling 
Model and a Solution 
As an example of a mapping problem of this type, the tasks in Figure 2-1 
have execution times as shown and are allocated to two processors as indicated. 
The resulting execution time of the computation is 58 milliseconds. One of 
the processors is active for 57 milliseconds the other for 58 milliseconds and 
so the schedule is optimal amongst non-preemptive schedules. No schedule 
could have both processors active for 57.5 milliseconds without a preemption 
in which a task is executed on one processor, its execution terminated, and then 
restarted on another processor. 
2.2 Tasks with Precedence 
It might be that multicomputer programs with inter-task communications are 
not handled with sufficient accuracy by the model outlined in Section 2.1. In 
this section we describe a model of non-preemptive scheduling where again the 
program is represented as a set of tasks but this time the tasks communicate 
results to other tasks on termination, and the structure of the computation is 
32 
represented as a directed task graph in which a directed edge connects a pair of 
distinct tasks if and only if the task at the start of the edge requires the results 
of computation from the task at the end of the edge. 
The directed task graph represents a partial order for an agenda of activities. 
It may be, for example, that at two or more separate stages in the computation 
it is necessary to perform a given computation—for example a sort operation 
on two different sets of data. This would be represented by two separate 
tasks/nodes in the task digraph. Consequently it is clear that in a valid task 
graph, there cannot exist a cycle in which a given task requires the results of a 
task to which it, in turn (either directly or indirectly) supplies results. The task 
graph is always a directed acyclic graph or dag. 
Most researchers in this area of the mapping problem would, as for the 
model discussed in Section 2.1, expect the nodes to be labelled with integers 
representing the time required to execute them. The mapping problem, which 
is usually referred to as Precedence Constrained Scheduling, is now the problem of 
assigning tasks to processors and giving the processors local schedules for their 
tasks, subject to three constraints: 
All tasks are executed. (This property is formalised below as the Complete-
ness property of Definition 2.30.) 
No processor is executing more than one task at a time (that is, it is not 
Overbooked as stated in Definition 2.31 below.) 
The tasks that are defined to precede a given task have finished executing 
before that task is started (that is, it is not Temporally Compromised as stated 
in Definition 2.35 below.) 
This definition of what constitutes a valid schedule is identical in all but the 
symbols, to that used by, for example, Papadimitriou and Yannakakis [1990], 
33 
or by Chretienne and Picouleau [1992], but is different from and more complex 
than that which has usually been used in the past, e.g. in the definition of 
Precedence Constrained Scheduling in Garey and Johnson [1979]. The added 
notational burden results from the fact that we wish to use the same definition in 
the models described in Section 2.3 below, where tasks may usefully be executed 
more than once. See Section 3.2 below. 
Let us consider an example task graph (shown in Figure 2-2). This contains 
the same tasks as we discussed in the previous section, but imposes a precedence 
relation on the tasks. There is a single sink node corresponding to the task that 
presents the final result to the user, and a single source node corresponding to 
the task that inputs the user's request to start the computation. 
In this case, we may wish to identify a critical path in the task graph, ie. a path 
from source to sink such that the sum of the node weights is maximised. The 
significance of the critical path is that, even assuming an unbounded number 
of processors, the sum of the execution times labelling the nodes on the critical 
path represents the minimum achievable time to completion for this task graph 
(assuming the program's designer got the labels correct). This contrasts with the 
model outlined in Section 2.1 above, where it is possible to use simultaneously 
as many processors as there are tasks. 
The critical path has been identified on Figure 2-2, and corresponds to the 
shaded nodes. We can consider mapping the tasks to two processors as shown 
in Figure 2-3. Here we show the activity of.  the processors, through time, and 
also show communication events as arrows, and the buffering of messages that 
are received. Since we have mapped the critical path and only the critical path 
to one of the two processors, the time to completion of the schedule is minimal. 
34 
 
Processor 1 	 Processor 2 
Time 
Figure 2-2: Dag with Shaded Figure 2-3: Schedule Repre- 
Critical Path 	 sented as a Gantt chart 
35 
2.3 Scheduling with Comunication Delays 
The model described in Section 2.2 captures the essence of interprocessor com-
munication in terms of the implied precedence, but fails to capture any of the 
overheads associated with message transfer. We now introduce an approach 
to modelling such costs whereby messages are subject to a delay. The area has 
been the subject of a recent short review (Veltman et at. [19901). 
The three conditions of validity of the schedule are essentially the same as 
in Section 2.2 above, except that the time at which a task can be executed in 
order for the schedule not to be temporally compromised is determined by the 
time of completion of its predecessors plus the time required for the results of 
its predecessors to reach the processor on which it is executed. This statement 
is made more rigorous in Definition 2.35. 
If we consider our example in Section 2.2, we can annotate each of the edges 
of the graph with a value indicating the number of bytes of data that must be 
communicated from the precedent task to its successor in order to satisfy the 
precedence relation. Such a set of edge labels is shown in Figure 2-4(a). Note 
that, as Efe [1982] points out, the node and edge labels are different in character: 
edges are labelled with data volumes and the vertices are labelled with times. 
In order to make the labels consistent, in the models outlined in this section 
there is often considered to be some per byte communication latency between a 
given pair of processors. One can think of this as the reciprocal of the available 
interconnect bandwidth between these processors. 
We will consider mapping the computation to two processors, and we will 
assume a 1ts/byte communications latency. If we return to the schedule of 
Figure 2-3, we can respect the order of task execution on each processor but in 
order to allow the requisite interprocessor communication we must delay the 
36 
execution of the tasks as shown in Figure 2-4(b). As a result of this, the schedule 
is actually longer than the execution time that would be achieved if all tasks 
were executed sequentially on the same processor. A shorter time to completion 
can be achieved by repeating the computation of the two tasks that come first 
in the partial order of the dag. The resulting schedule is shown in Figure 2-4(c). 
2.4 Definitions 
In this section we define the majority of the notation that is used in the thesis. 
The notation is based on directed graphs and its primary purpose is to capture 
the various scheduling models which we analyse in Chapters 3 and 4. We 
develop a simpler notation based on undirected graphs, primarily for use in 
Chapters 5 and 6. In addition a few more definitions are required on a chapter-
by-chapter basis, mainly in Chapters 5 and 7. 
2.4.1 Cardinality, Integer Part, Absolute Value and Modulus 
Firstly, to avoid confusion, we clarify the way we will be referring to a few 
simple arithmetic and set operations. 
Definition 2.1 Cardinality of a set. 
We refer to the cardinal ity of a set X as lxi. 
Definition 2.2 Absolute Value. 
We refer to the absolute value of an integer X as IXI. 
Definition 2.3 Floor. 
Let R be a real number. We refer to the largest integer i such that i < R as [R] 
Definition 2.4 Ceiling. 
Let R be a real number. We refer to the smallest integer i such that i > R as FR] 
37 
to Kbt.o 
2Kb1 	 22)yt.o 
	
70., 	 4n• 
to II 	 10 W.. 
3500, 	 limO 
000001.3 	 22K 001.s 
180.8 
V 
a) Dag Annotated with Commwticatlon$ Volumes 
b) Sdie&JIe without Recomputallon 	 c) Sctiedte with Recomputatlon 
Figure 2-4: Scheduling in a Model with Communication Delays 
Definition 2.5 Integer Remainder. 
Let i e 	and j e Z. By i mod j we denote j(i/j - 
2.4.2 Directed Graphs 
Next we define the way we will be referring to directed graphs and their proper-
ties. Most of these definitions are as given by Swamy and Thulasiram [1981]. 
Definition 2.6 Directed Graph. 
A directed graph, say A = (F, A), is an ordered pair consisting of a finite set F 
of vertices and a finite set A of edges. Each edge, p c A, is an ordered pair of 
vertices, say (-y, 5) where -y, S E F. 
Definition 2.7 Directed Walk. 
A directed walk in a directed graph A = (F, A) is a finite sequence of vertices, say 
71, 	. . . ,7, n > 2 such that for i = 2,3,.. . , n, ('y_, y) E A. We say that this 
directed walk is from -y j to -y,. 
Definition 2.8 Open/Closed Directed Walk. 
A directed walk from-)' to S is closed iff, that is if and only if, -y = S. Otherwise it 
is open. 
Definition 2.9 Directed Trail. 
A directed walk, say -yr , 'Y2, . . , -y, is adirected trail iff each ('y1, y2 ), i = 2,3,... 
is distinct. We say that this directed trail is from -y j to -. 
Definition 2.10 Directed Path. 
An open directed trail, say 71,72,• . . , -y, is a directed path iff each -y, i = 1,... 
is distinct. We say that this directed path is from -y1 to 'y,. 
Definition 2.11 Directed Cycle. 
A closed directed trail, say 71,72 .. . . , -y, is a directed cycle iff each y, i = 1,. . . , 
is distinct, except Yi = 
39 
Definition 2.12 dag. 
A directed graph is acyclic if it has no directed cycles. Such a graph is referred 
to as  dag'. 
Definition 2.13 Predecessors; Successors. 
Let A = (F, z) be a dag. Let 'y E F. The predecessor set, pred(7, A), of -y  is defined 
to be the set {S : there is a directed path from (5 to 'y  in Al. The successor set 
succ(y, A) is defined similarly. 
Definition 2.14 Antichain. 
An antichain of a dag A = (F, z) is a set F' C F of vertices such that for each 
pair of vertices -y, (5 E F', 'y 6, there is neither a directed path from -y to (5 nor a 
directed path from 6 to -y in A. 
Definition 2.15 Width. 
The width, W(A) of a dag A = (F, z) is the cardinality of a maximal-size antichain 
in A. 
Definition 2.16 Height. 
The height, 7-1(A) of a dag A = (F, z) is the cardinality of a maximal-size directed 
path in A. 
2.4.3 Models 
Our scheduling models consist of dags which encode tasks and the precedence 
relationship between them, processors which are assumed to be identical, and 
functions which encode the costs associated with tasks and with the commu-
nications implied by precedence edges. We start by defining the most general 
scheduling model that we use, and then define special cases of it. 
'This is an abbreviation for directed acyclic graph, although acyclic directed graph is a 
better description. 
40 
Definition 2.17 General Delay Scheduling Model. 
An instance of a General Delay scheduling model is a 4-Tuple (F, A, f,d) where 
P is a set of n processors, A = (F, z) is a dag such that F is a set of tasks, 
f : F - 	is a function returning the time that a task requires for execution
2, 
and d A x P x P -* 	is a function where d((y, ),p, q) returns the delay 
associated with communication if task -y is mapped to processor p and task S is 
mapped to processor q. 
It is useful to classify restricted versions of the General Delay scheduling 
model. 
Definition 2.18 UET Scheduling Model. 
Let A = (P, A = (F, ), f, d) be an instance of a General Delay scheduling model. 
A is said to be an instance of a Unit Execution Time 
(UET) model iff the range of 
f is {1}. 
Definition 2.19 No Communication Scheduling Model. 
Let A = (P, A = (F, ), f, d) be an instance of a General Delay 
scheduling model. 
A is said to be an instance of a No Communication model iff A = {}. 
Definition 2.20 No Delay Scheduling Model. 
Let A = (P, A = (F, ), f, d) be an instance of a General Delay 
scheduling model. 
A is said to be an instance of a No Delay model iff the range of 
d is {0}. 
Definition 2.21 Unit Delay Scheduling Model. 
Let A = (P, A = (F, z), f, d) be an instance of a General Delay 
scheduling model. 
A is said to be an instance of a Unit Delay model iff for all p E 
P, for all p E L, 
d(p, p, p) = 0, and for all p, q E P where p q, for all p E A, d(p, p, q) = 1. 
21n Chapter 7 we will also have use for a General Delay 
scheduling model which 
allows f to return 0 
41 
Definition 2.22 Fixed Delay Scheduling Model. 
Let A = (P, A = (F, s.), f, d) be an instance of a General Delay scheduling model. 
A is said to be an instance of a Fixed Delay model 1ff for all p e P, for all p E & 
d(p,p,p) = 0, and there exists some non-negative integer r such that for all 
p, q e P where p q, for all p e & d(p, p, q) = T. 
Definition 2.23 Uniform Delay Scheduling Model. 
Let A = (P, A = (F, z), f, d) be an instance of a General Delay scheduling model. 
A is said to be an instance of a Uniform Delay model 1ff for all p E P for all p E & 
d(p, p, p) = 0, and for each p e A there exists some non-negative integer such 
that for all p, q E P. p q, d(p, p, q) 
Section 2.1 described a No Communication scheduling model. Section 2.2 
described a No Delay scheduling model. Section 2.3 described a General Delay 
scheduling model. 
It should be dear that there is a hierarchy of generality which means that 
General Delay scheduling problems are at least as hard as Uniform Delay sched-
uling problems which are at least as hard as Fixed Delay problems which are at 
least as hard as either Unit Delay scheduling problems or No Delay scheduling 
problems, and that Unit Delay scheduling problems or No Delay problems are at 
least as hard as No Communication scheduling problems. 
2.4.4 Schedules 
Given a scheduling model, the problem of allocating computation to processors 
becomes that of finding a schedule (that is, an allocation of tasks to processors) 
such that the constraints outlined in Section 2.2 are respected. A given task 
may be allocated to more than one processor. We define a schedule as a set of 
tuples, where a tuple refers to the execution of a particular task on a particular 
processor at a particular time. Note that at the point of definition, we are not 
interested in the validity or otherwise of schedules. 
42 
Definition 2.24 Schedule. 
We define a schedule of a set of tasks F onto a set of processors P. as a finite set 
of3-tuples{x1,x2,...x}where for i=1,2, ... ,n,x,.=(-y EF,p2 EP,t 2 EZ). 
The tuple ; indicates that task 7i is executed on processor p, beginning at time 
ti. 
For notational convenience we will often wish to refer only to certain tuples 
within a schedule, such as those which specify a particular processor. The 
following notation is widely used in the thesis to describe a subset of a sched-
ule. Note that, according to the above definition, the resulting subset is also a 
schedule. 
Definition 2.25 Matching Subset of Schedule. 
Let A = (P, A = (F, ), f, d) be an instance, of a General Delay scheduling model. 
Let s be a schedule for A. We use the notation X(s, F', F', T') as shorthand for 
{(y,p,t) Es : y E F',p E P',t E T'}, 
where it is understood that F' C F,P' ç P and T' C Z. 
The following definitions are more pieces of shorthand which are used to 
refer to certain properties of schedules. 
Definition 2.26 Makespan. 
Let s be a schedule. We define its makes pan, denoted M (s), as the lowest value 
of time beyond which no tuple is being executed in s. More formally we define 
max t+f(-y). 
(y,p,t) Es 
Definition 2.27 Idle Processor, Busy Processor. 
In a schedule s of a task set F onto a processor set F, a processor p E P is said 
to be idle iff X(s, F, {p}, Z) = {}. Otherwise it is said to be busy. 
Definition 2.28 Processor Usage. 
The processor usage, denoted P(s), of a schedule s of an instance (F, A = 
43 
(F, z), f, d) of a scheduling model is the number of busy processors. More 
formally, 
P(s) = I {P E P : X(s,F,{p},Z) 
Definition 2.29 Active Task Set. 
We define the active task set, denoted .,4(s, x), of a schedule s of an instance 
(P, A = (F, ), f, d) of a scheduling model, at time x as 
A(s,x) = {('y,p,t) E s: 	x <t +f(y)1. 
2.4.5 Validity of Schedules 
Having developed a notation in which schedules may be expressed, we are now 
able to define more formally the properties described in Section 2.2. The two 
properties of completeness and overbookedness are defined in a straightforward 
way. The definition of temporal compromise is made in terms of subsidiary 
definitions which at this stage appear superfluous, but which are of use in 
Chapters 6,7 and 9. 
Definition 2.30 (In)Complete Schedule. 
Let A = (P, A = (F, ii), f, d) be an instance of a General Delay scheduling model. - 
Let s be a schedule for A. s is said to be complete for A iff every task is executed. 
More formally this means that for all tasks 'y E F X(s, {'y}, P, Z( ) 	{}. If a 
schedule is not complete it is referred to as incomplete. 
Definition 2.31 Overbooked Schedule. 
Let A = (P, A = (F, z), f, d) be an instance of a General Delay scheduling model. 
Let s be a schedule for A. s is said to be overbooked if any processor is executing 
more than one task at any one time. More formally this means that there exists a 
tuple (-y, p, t) E s such that X(s—{(-y,p,t)},F,{p}, It, t+1,...,t+f(7)l}) ZA {}. 
44 
Definition 2.32 (InterprocessOr) Tuple Dependence. 
Let A = (P, A = (F, z), f, d) be an instance of a general delay scheduling model, 
and s be a schedule for A. A tuple dependence for .s is an ordered pair of tuples, 
say c = ((,p,t),(6,q,t')) where (,q,t'),(6,p,) e s and (,8) E 	. C is an 
interprocessor tuple dependence 1ff p q. 
Definition 2.33 Valid Tuple Dependence. 
Let A = (P, A = (F, ), f, d) be an instance of a general delay scheduling model, 
and s be a schedule for A. Let c = ((-y,p,t),(8,q,t'))be 
a tuple dependence for 
.s. c is valid for s 1ff 
t + f() + d((, 6), p, q) <t'. 
Definition 2.34 (Valid) Tuple Dependence Graph. 
Let A = (P, A = (F, z), f, d) be an instance of a General Delay 
scheduling model, 
and s be a schedule for A. A tuple dependence graph, say D = (s, 
C) is a directed 
graph with vertex set s and with a set of tuple dependences, C. D 
is valid for 
1ff both of the following are true. First, all dependences in C are valid for s and 
second, for each tuple in s, for each incoming edge to the task in the tuple, there 
is at least one tuple dependence in C. More formally, V(6, q, t') e s v7 E F such 
that (-y, b) e z,((y,p,t),(S,q,t')) E Csuch that 
t' > t + f(y) + d((y,6),p,q). 
Definition 2.35 Temporally Compromised Schedule. 
Let A = (P, A = (F, ), f, d) be an instance of a General Delay 
scheduling model, 
and s be a schedule for A. s is said to be temporally compromised 
1ff it has no 
valid tuple dependence graph, that is there exists a tuple (6, q, t) e s and an edge 
A such that 
min 	t' + f(y) + 	P, q) > t. 
(,p,t')EX(s,{Y},P,Z) 
45 
Definition 2.36 Valid Schedule. 
A schedule will be said to be valid iff it is complete and neither overbooked nor 
temporally compromised. 
2.4.6 Properties of Schedules 
Finally we define some properties (other than validity) of schedules and some 
ways in which schedules may be related. In later chapters we will prove further 
general properties about schedules which have these basic properties, or are 
related to each other in these ways. 
Definition 2.37 Perfect Schedule. 
A schedule s of an instance (F, A = (F, A), f, d) 
of a scheduling model, is referred 
to as a perfect schedule iff itisvalid and for all times x = 1,2,..., 
M(s), IA(s, x)I = 
I P1. Informally each processor is active throughout the whole computation. 
Definition 2.38 Pruning. 
A schedule .s' is defined to be a pruning of a schedule s iff s' C s. 
Definition 2.39 Fully Pruned. 
A schedule s is said to be fully pruned with respect to an instance of a scheduling 
model iff, with respect to that instance, it is valid, and there exists no valid 
schedule s' C S. 
Definition 2.40 Remapping. 
A schedule s' is defined to be a remapping of a schedule s of an instance 
(P, A = 
(F, A), f, d) of a scheduling model iff it executes the same tasks as 
s at the same 
times as in s, but not necessarily on the same processors. More formally, this 
means V(-y, p, t) € s, 3q e P such that (-y, q, t) e s' and V(-y, q, t) E s', 3p e P 
such 
that (y,p,t) E s. 
46 
Chapter 3 
Analysis of Scheduling Models 
47 
In this chapter we review and derive a number of results for a variety of sched-
uling models, and put the models into the framework given in Chapter 1. The 
results concern the way in which performance gains can be achieved by adding 
processors, and the complexity of scheduling to achieve a given deadline. These 
results turn out to be strongly dependent upon the exact nature of the sched-
uling model. It is our purpose to highlight the differences between the models. 
As a result of the bounds derived, and assuming some basic properties of sched-
ules, we are able to show differences between the various scheduling models in 
terms of our framework. 
To start with, in Section 3.1 below, we prove a number of results which are of 
use in the later sections of this chapter and in Chapter 4. We also define a simple 
optimal scheduling algorithm for No Delay scheduling models where there are 
at least as many processors as tasks. 
3.1 	General-Purpose Results 
The general purpose results are divided into two subsections: those that refer 
to General Delay scheduling models, and those that refer exclusively to No Delay 
scheduling models. Many of the results are relatively trivial, and appear only 
to avoid replication in later proofs. 
3.1.1 Lemmas referring to General Delay Scheduling Models 
Lemma 3.1 Let A = (P, A = (F, ), f, d) bean instance of a General Delay scheduling 
model, and s be a valid schedule for A. Let s' be a schedule formed by pruning .s, 
M(s') < M(s). 
Proof 
By Definition 2.38 s' C s. Thus 
M(s) = max t + f('y) > max t' + f(y' ) = M(s'). 
(y,p,t)Es 	 (' 'PI  ,t')Es' 
DO 
Lemma 3.2 Let A = (P, A = (F, z), f, d) bean instance of a General Delay scheduling 
model, and s be a schedule for A which is not overbooked. Let s' be a schedule which is 
a pruning of s. .s' is not overbooked. 
Proof 
Let us assume as a hypothesis to be proven contradictory that there exists some 
tuple, say (,p,t) E s'suth that X(s'—{(,p,t)}, IF, p,{t,t+1,...,t+f()}) = x 
{}. Then by Definition 2.25 x C s'. Now since by Definition 2.38 s D .s', x C s. 
This in turn would imply that X(s - {(-y,p,t)}, r, p, {t,t + 1,... ,t + f('y)}) 2 x, 
which contradicts our assumption that s is not Overbooked. 	 0 
Lemma 3.3 Let A = (F, A = (F, ), f, d) be an instance of a General Delay scheduling 
model, s be a schedule for A and s' be a schedule formed by remapping s. M (s') = M (s). 
Proof 
By Definition 2.26 there exists a tuple (-y, p, t) E s such that t + f(-y) = M(s). By 
Definition 2.40 this implies there is a tuple (-y,  p', t) E s' and so M(s') > M(s). 
Now let us assume as a hypothesis to be proven contradictory that M (s')> 
M (s). Thus, by Definition 2.26 there exists a tuple ('y',  q, t') E s' such that 
t + f('y') = M (s'). Thus, by Definition 2.40 there is a tuple ('y',  q, t') E S, 
corresponding to the execution of a task which terminates at time M(s'). By 
hypothesis M(s') > M(s) which contradicts Definition 2.26. Thus we may 
conclude that M(s) = M(s'). 	 U 
49 
Lemma 3.4 Let A = (P, A = (F, z), f, d) bean instance of a General Delay scheduling 
model, .s be a complete schedule for A and s' be a schedule formed by remapping s. s' is 
complete. 
Proof 
Let us assume as a hypothesis to be proven contradictory that s" is not complete. 
Thus, by Definition 2.30 there exists some task -y such that X(s', {-y}, P, Z( ) = 
{}. By Definition 2.40, V('y,p,t) e s, 2q e P such that (-y,q,t) e s'. Thus 
X(s, {'y}, F, Z() = {}. Thus .s is incomplete, which leads to a contradiction. C] 
Lemma 3.5 Let A = (P, A = (F, ), f, d) bean instance of a General Delay scheduling 
model, and .s be a valid schedule for A. If in A, -y e pred(8, A) and there exists a tuple 
(5, p, t5 ) E s then there exists a tuple (-y, q, t) E .s such that t + f(-y) 	t 5. 
Proof 
Let us assume as a hypothesis to be proven contradictory that 5 E F and 7 E 
pred(5, A) and there exists some tuple (5, p, t 5) e .s but no tuple (-y, q, t) e .s such 
that t+f() :5 t8. Let -t = 	 = Sbea directed path inA. 
For i = 2,.. . , x, (-y, -y) E A and thus, by Definition 2.35 for each tuple 
(-y2,p,t..) E X(s, {-y}, P, Z), there exists a tuple, say 
(_17 p',t-_1 ) E X(s, {7_i} P, Z) 
such that 
t y1 , + f() + d(('y_1,y),p',p) < t_yi . 
However, by Definition 2.17, d(( 2_1, -y),p',p) ~! 0, and thus t_, + f(-y2) t Yi 
In addition, by Definition 2.17, f('y) > 0. Thus t_1 + f(-y1) <t., + f(yj), i = 
2,3,.. .,x,thatis 
= t 1 + f('yi) <t Y2 + f(-v)... < L = t5 -  
which leads to a contradiction. 
50 
3.1.2 Results Referring Only to No Delay Scheduling Models 
Our first result below states that we can remap schedules of No Delay scheduling 
models without worrying about rendering them temporally compromised. 
Lemma 3.6 Let A = (P, A =(F, z), f, d) be an instance of a No Delay scheduling 
model. Let s be a valid schedule for A. Let s' be a remapping of s. .s' is not temporally 
compromised 
Proof 
Let us assume as a hypothesis to be proven contradictory that s' is temporally 
compromised. By Definition 2.20 Vp, q E P, Vp E z, d(p, p, q) = 0. Thus, by 
Definition 2.35, there exists a tuple, say (-y, p', t) E s' such that there exists an 
edge, say (6,7) e z, such that 
mm 	(4+f(8))>tV.. 
(5,q,t6 )EX(s',{6},P,Z) 
Now by Definition 2.40, for our tuple (-y, p', t) E .5 there exists a processor 
p E P such that (-y, p, I) E s, and by Definition 2.36 there exist a processor q E P 
such that (6, q, t) E .s and t5 + f(S) < I. This in turn means, by Definition 2.40, 




which leads to a contradiction and the lemma is proved. 	 D 
Our second result relates to the following naïve scheduling algorithm, which 
assigns each task ^ji to a different processor, pi at a time which is the maximum 
vertex-weighted path length for those paths whose end vertex is 'y. We prove 
that such schedules are valid and have optimal makespan. In due course we 
will define another algorithm that can be used to make such schedules less 
processor-greedy. 
Algorithm 3.1 Naïve No Delay Scheduler 
51 
1 	Let A = (P, A = (F, /), f, d) be an instance of a No Delay Scheduling 
Model where I Fl = P1; 
2 	Let Pi,• . . , p1 p, be an enumeration of P; 
3 	Let 'Yi,• • ii be an enumeration of y; 
4 Lets={}; 
5 	Foreach -y,EF 
gre 
6 	let t, = max 51 S-y, is a directed path in -?=1  f(); 
7 	lets = s U ('y, pi,  t); 
enddo 
end 
Theorem 3.1 Let A = (P, A = (F, z), f, d) be an instance of a No Delay Scheduling 
Model where I IF = Fl. Let s be the result of applying Algorithm 3.1 to A. s is valid 
and there exists no valid schedule s' such that M(s') < M(s). 
Proof 
For i = 1,. . ., IPl let pi, 7i and t j  be as assigned in the execution of Algorithm 3.1. 
Note that for each p, e F, Ix (s, F, {p2}, Z)I = 1 thus by Definition 2.31 s is not 
overbooked. Note also that for each task 7i e I' there is a tuple (-y, p2, t,) E s, 50, 
by Definition 2.30, s is complete. 
Let us assume as a hypothesis to be contradicted that s is temporally com-
promised. This implies there exists a task ly, e F and an edge ('ye, 7&) E A such 
that there exists a pair of tuples say (-y,,, pa, t,), bb, Pb,. tb) e s where tb  <ta+f(ya). 
Let 7H, i ... , -yHy , 	b be the directed path in F referred to at Line 6. Thus 
ta = Ejy=lf(yn,). Thus tb < >I i f(') + f('7a) Thus 
X 
tb< 	 max 
S1,...,8x,Yb is a directed path in rj=i 
which leads to a contradiction. 
Let us assume as a hypothesis to be proven contradictory that there exists 
a valid schedule s' such that M(s') < M(s). Let (ya,pa, t) E s' be a tuple such 
that t + f(ya) =M(s'). Let Yrh,• . . 	be a directed path in F such that 
t. = 	f('y-j.). For each j = 1,. . . , y let t' be the earliest time at which 
t11, is executed in s'. Note that, by Definitions 2.20 and 2.35 we have that for 




which, by Definition 2.17 leads to a contradiction. 	 0 
3.2 	Bounds on Profitable use of Processors 
In this section we consider the way in which, as a result of the differences 
between the underlying communications costs, the choice by the multicomputer 
programmer of a No Communication model, a No Delay model, a Fixed Delay 
model, or a Uniform Delay model for performance prediction, could lead to very 
different conclusions as to how many processors he or she might usefully use. 
3.2.1 Results for Models Without Communications Delay 
Our first result simply states that in models without communication delay there 
is never any benefit to be gained from repeated execution of the same task. 
Theorem 3.2 Let A = (P, A = (F, z), f, d) be an instance of a No Delay scheduling 
model. Let s be a fully pruned schedule for A. For all y  E F, IX(s, {'y}, F, Z)I = 1. 
Proof 
Let us assume, as a hypothesis to be contradicted, that there exists some task 
53 
-y such that X(s, {'y}, P, Z)i =A I. We let x denote the set X(s, {'y}, P, Z( ). By 
Definitions 2.31 and 2.39, lxi > 1 and thus we may assume that lxi > 1. Let 
XI) x21  ..., x, be an enumeration of x such that for i = 1,21 ..., n, x, = (-y, p, t) and 
for i = 2,3,..., n, ti-I 	> t. Let s' = (s - x) U {x1 }. By Definition 2.39 and by 
hypothesis, s' is invalid. 
Since s' D X(s, F - {-y}, P, 	and s is valid, for all /3 E F - {-y}, we have 
that X(s', {9}, P, Z( ) 	{}. Since s' D {x1}, we have that X(s, {-y}, F, Z( ) 	{}. 
Thus we conclude that for all /3 E F, X(s', {fi}, P, Z( ) {}, so by Definition 2.30, 
s is complete. 
Let us assume s' is temporally compromised. Since s is valid, s U s' = s and 
s - s / ' contains only tuples instancing -y and by Definitions 2.35 and 2.20 
min 	t+f(y). 
(,p,t)EX(s,{'y},P,Z) 
However, min(,P,)EX(S,{},p,Z-) t = t1 and so we conclude s' is not temporally 
compromised. 
By Lemma 3.2, s' c s cannot be overbooked. 
Thus we must conclude that s' is valid since it is complete and neither tem-
porally compromised nor overbooked, which leads to a contradiction, and so the 
theorem is proved. 	 0 
Next, as an intermediate result which is of use in proving a number of 
theorems, we show that in any schedule without repeated execution of the 
same task, at all time steps the active task set of Definition 2.29 is an antichain. 
Lemma 3.7 Let A = (F, A = (F, z), f, d) bean instance of a General Delay scheduling 
model. Let s be a valid schedule for A. If for all -ye F, iX(s,{-y},P,Z)i = 1, then for 
all t=0,1,...,M(s)-1 {-y : (-y,p,t) e.A(s,t)}isanantichain. 
Proof 
Let us assume, as a hypothesis to be contradicted, that there exists some time 
54 
t' E to, 1,... ,M(s) - l} such that {'y : ( y,p,t*) e .A(s, t*)} is not an antichain. 
Thus there exists some pair of tuples, say (8, q, t), (-y, p, r) E .A(s, j*), such that 
-y e pred(S,A). Thus r < t' < r + f(-y) and t < t < t + f(6) in which case 
i < r + f(). 
By Lemma 3.5 in order for s to be valid there would have to be a tuple 
p, r*) E s such that r*  + f(-y) < t, that is r* 	r. Thus IX(s, {-y}, P, Z) l> 1, 
which leads to a contradiction and the lemma is proved. 
This leads us on to proving that when scheduling any dag in a model in 
which there are no communication delays there is never any point in using 
more processors than the width of the dag. Consider the following algorithm 
which acts upon schedule s. The algorithm enumerates the set of processors in 
the instance of the scheduling problem and remaps all computation to a subset 
of that set of processors. It simply moves through the schedule considering the 
tuples in order of increasing time of execution. Let us say it is considering a 
tuple (-y, p, t) and p is outwith the chosen subset of processors, and there is some 
processor q within the subset which is idle at time t. Informally, the algorithm 
swaps the computation performed by processors p and q from timestep t on-
wards. The algorithm remaps all tasks executed on processor p, in timesteps t 
and after, so that they are executed on processor q at the same time steps, and 
similarly remaps q's computation to p. 
Algorithm 3.2 General-Purpose Remapper 
1 	Let A = (P, A = (F, ), f, d) be an instance of a General Delay Scheduling 
Model; 
2 	Let s be a valid schedule for A; 
3 	Let Pi,• . . 'Pii be an enumeration of P; 
4 	Let r1,.. . , r131  be an enumeration of .s such that for each i = 2,. . ., 
where say r_ = (-y, p, t) and r, = (8, q, t'), t' >_ t; 
M(s) 5 	Let x = max 0  IA(s,t)I; 
55 
	
6 	For i=1,...js 
7 	Let ('y,p,,t) = r2 ; 
8 	ifj>x 
Me 
9 	let h be the minimum integer such that X(A(s, t), F, {Ph},  Z) = 
{ }; 
10 	let a=X({r,...,r11},F,{p3} Z+  0 )F 
11 	let b=X({r1,...,r131 }, IF, {ph }, Z0 /j'r 
12 	for all rk  = (,p3 ,t) E a let rk  = ('y, Ph, t); 




Note that in the above algorithm r and s are being treated like program 
variables. At any one instant they form a set, but as the elements of the set are 
modified, the set changes. 
In the following lemma we prove certain properties of schedules resulting 
from the application of this algorithm to valid schedules of General Delay sched-
uling models. We do not go so far as to show they are valid, but merely show 
they are complete, not temporally compromised, and use at most as many pro-
cessors as the maximum number of processors in use at any one time in the 
input schedule. In later theorems we will show that the output schedules are 
valid in three special cases. One of these cases is where the input schedule is 
a valid schedule for a No Delay scheduling model. Note that this means that 
application of Algorithm 3.2 to the result of applying Algorithm 3.1 to a No 
Delay scheduling model will result in a schedule with optimal makespan. 
56 
Lemma 3.8 Let A = (P, A = (F, Li), f, d) bean instance of a General Delay scheduling 
model. Let s be a valid schedule for A. Let s' be the output of Algorithm 3.2 on s. 




Let r1,. . . , r131 and Pi, . . . p1pi be the enumeration of s and p 
respectively chosen 
by the algorithm. Let x = max 	IA(s, t)l. We use variables indexed by a 
superscript, eg m, to refer to the state of the variable at Line 6 when i = 
M. 
Note that Lines 12 and 13 ensure that for each tuple r, a = 1,. . ., si, a = 
1,..., is! , where say 	= ('y,p,t), there exists some processor 
p' such that 
r 1 = (y,p',t). Thus for each i = 1,..., is 1, s is a remapping of s. Thus 
A(s1,t)i = A(s'1,t)I = .A(s,t), and, by Lemma 3.4, s' is complete. 
Let us assume as a hypothesis to be proven contradictory that there exists 
(y,p,t) e s' such that j > x. Since each tuple ri e s is considered at Line 6 
in order of increasing i and no tuple that has been considered is subsequently 
remapped this implies that at some value of i, where say, (ny, p, t) = r, for 
M = 1,...,X,X(A(S2,t),F,{Pm},) 	{} and thus 
iX(A(s2,t),F,P,Z)I ~ X. 
But (,p,t) e A(s,t) and (,p,t) V X(A(s,t),F,P,Z), thus iA(s,t)l 
> x 
which leads to a contradiction. Thus we conclude that no tuple outside pi , • ,px 
is used ins' and thus P(s') max 	IA(s,t)i. 
Let us assume as a hypothesis to be proven contradictory that s' is over-
booked. Since .s is not overbooked, this would imply there were two inter-
mediate schedules, say s and s24 1 such that s was not overbooked and 
s+l 
was overbooked. We shall denote the value of a U b2 immediately after Lines 
12 and 13 as c. Since s 	= 
(1 
- (a' U b1)) U c there must exist two tuples,. 
say ('y, p, t), (S, p, t') E c such that t < t' + f('y). 
This implies there exist two 
tuples, say (-y, p', t), (8,p', t') e (a1 U b1) such that t < t' + f('y) which implies 
s is 
overbooked which leads to a contradiction. 	 D 
57 
We are now in a position to prove, by stringing together a series of lem-
mas, that a schedule in a No Delay scheduling model can usefully use no more 
processors than the width of the dag being scheduled. 
Theorem 3.3 Let A = (P, A, f, d) be an instance of a No Delay scheduling model. Let 
s be a valid schedule for A. There exists a valid schedule s' such that M(s') M(s) 
and 'P(s') < W(A). 
Proof 
Let s* s be a fully pruned schedule for A. Let s' be the output of Algorithm 3.2 
on s*. By Lemma 3.l M(s*) < M(s). By Lemmas 3.3 and 3.8M(3') 
= M(s*). 
By Lemma 3.7 and Theorem 3.2, for t = 0, 1,. . . M(s*) - 1, A(s*, t)I . W(A). 
By Lemma 3.8 P(s') :!~- max , IA(s*, t)I and s' is complete and not overbooked. 
By Lemma 3.6 s" is not temporally compromised. Thus s' is valid for 
A and is 
our schedule as required. 	
0 
Corollary 3.1 Let A = (P, A = (F, ), f, d) be an instance of a No Communication 
scheduling model. Let s be a valid schedule for A. There exists a valid schedule s' such 
that M(s') <M(s) and 2(s) < IFI. 
For a graph with an empty edge set, F is an antichain, since there is no path 
between any of the tasks, and thus W(A) = IF. Note also that the following 
theorem establishes that the W(A) bound is tight. 
Theorem 3.4 Let A = (P,A = (F, ), f, d) be an instance of allo Communication 
LlETscheduling model. Let s be a valid schedule for A such that M (s) = 1. 2(s') ~! IF I. 
Proof 
Let us assume there exists a schedule valid s for A such that 2(s) < IFI. Since 
is complete for some processor p e P this implies jX(s,p,F,0)j = 2 which 
implies s is overbooked which leads to a contradiction. 	
11 
3.2.2 Precedence With Delay but Without Replication 
We now consider the introduction of communications delay into the scheduling 
problem. Our first result refers again to Algorithm 3.2. We prove that if the 
input schedule to Algorithm 3.2 is a valid schedule of an instance of a sched-
uling model with certain properties then the output schedule is valid. These 
properties are either that the scheduling model is a Unit Delay model, or that 
the scheduling model is a Fixed Delay model, and in the schedule all tuples that 
are executed by a processor are executed in a single continuous period. The 
latter pair of properties are not important in this section, but become relevant 
in Section 4.2.4 below. 
Lemma 3.9 Let A = (P, A = (F, ), f, d) be an instance of a Fixed Delay scheduling 
model where the range of d is {O, r }. If 'r = 1 let s be an arbitrary valid schedule for A. 
If 'r > 1 let s be a schedule such that there does not exist a pair of tuples (y,  p, t), (8, p)  t') 
such that there exists a time t", t < t < t' such that X(.A(s, t*),  F, {p}, Z) = {}. Let 
S' be the result of applying Algorithm 3.2 to .s. s' is not temporally compromised for A. 
Proof 
Let r, a = 1,..., s, b = 1,..., Isl refer to the state of T a  at step 6 o Algorithm 3.2 
when i = b. Let r1,  refer to the state of Ta  at the termination of the algorithm. 
Let us assume s' is temporally compromised. This implies that there exists an 
edge (-y, 6) E A and a pair- of tuples, say r1 = (,p1,t) and r1 = 
such that 
t + f(y) + d((,6),pi,pm ) > 
Now .s is valid, and thus not temporally compromised and so there exist two 
tuples r = ('y, p, t) and r = ( 8, p', t') such that 
t + f(y) + d((-y, 6),p,p') < t'. 
59 
These inequalities imply d((-y, (5), p, p') <dfty, 6), Pi, Pm)' that is d((-y, 6), p, p') = 0 
and d((-y, 6), Pi, pm) = r. Thus p = p' and pj / p. They also imply t' > t which 
in turn implies r,, comes later in the enumeration than r, that is u <v. 
Let us consider the states of the variables at the beginning of the loop when 
i = u and when i = v. Note that r 	(y,p1,t) and thus r = (S,p,,t'). Note that 
at some value of i, u < i < v, 	(8,p1,t') and r 	= 	p., t'). We have two 
cases whereby this could have occurred. 
. 1 > x at Line 8 in which case since ('y, Pi' t) E s', and we are led to a 
contradiction of Lemma 3.8. 
There exists a time t", t < t" < t' such that X(..4(s, t*), F, {m}, Z) = {} at 
Line 9. Thus, by our restrictions on s and A, r = 1 and t + f(y) + 1 > t'. 
Now for i = t, . . . ,t' —1, X(A(s),F,{pi },Z) = {(-y,p j,t)}, and thus 
t' > t + f(-y) + 1 which leads to a contradiction. 
Thus we conclude that our hypothesis is contradictory and the theorem is 
proved. 
	 I. 
Theorem 3.5 Let A = (F, A, f, d) be an instance of a Unit Delay scheduling model. 
Let s be a valid schedule for A such that for all y e F, IX(s, {'-y}, F, Z)I = 1. There 
exists a valid schedule s' such that M(s') <M(s) and 'P(s') W(A). 
Proof 
Let s' be the output of Algorithm 3.2 on s. By Lemmas 3.3 and 3.8 M (s') = M (s). 
By Lemma 3.7, for t = 0,1,...,M(s)-1, IA(s',t)I :~ W(A). By Lemma 3.8 
max, IA(s', t) I and s" is complete and not overbooked. By Lemma 3.9 
S is not temporally compromised. Thus s' is valid for A and is our schedule as 
required. 
	 i  
RIC 
3.2.3 Precedence with Delay and Replication 
The proof of Theorem 3.5 relies on the fact that task replication is disallowed. 
Consider the dag shown in Figure 3-1 in which each of the tasks has unit 
execution time and there is a unit communication delay. 
In order for the makespan of a schedule of the dag to be 4, tasks A, B, C, D 
must be executed on the one processor, tasks A, B, C, Eon another and F, H, J, K 
on a third, since these chains are of length 4 and so there is no way that an 
interprocessor communication delay can be introduced into them. We are then 
required to compute task C at time 1 so that it can be ready as input to task E 
at time 3, and this must be done on a fourth processor as the three processors 
executing the chains are each busy for the duration of the computation. Thus a 
schedule with optimal makespan requires to use at least 4 processors, whereas 
the width of the dag is 3 by inspection. 
We now consider arbitrary execution time and Uniform Delay scheduling 
models. Recall that, by Definition 2.23, Uniform Delay models are those where 
the communication delay is dependent upon the precedence edge but not upon 
the processors to which the tasks incident to the edge have been mapped. 
We shall need the following lemma which states that in such models the last 
task executed by a processor can be pruned if it is computed by some other 
processor at the same or at an earlier time. 
Lemma 3.10 Let A = (P, A = (F, i.), f, d) bean instance of a Uniform Delay sched-
uling model. Let s be a valid schedule for A. Let (-y, p, t) E s be a tuple such that 
X(s,F,p,{t + 1,...,M(s) - 1)) = {} and X(s,{},P - {p}, to, l,... ,t}) 	{}. 
Lets' = s - {(,p,t)}. s' is valid for A. 
Proof 
Let us assume as a hypothesis to be proved contradictory that s' is incomplete. 
61 
A A F F 
B B G H 
C C J 
D E K 
UET UCT Task Graph with Width 3 	 Pruned schedule on 4 Processors 
Figure 3-1: Counter-example for W(A) Bound in UET Unit Delay Scheduling 
models 
62 
Now s is complete thus there exists a task S E F such that X(s', {S}, P, Z) = f  
whereas X(s, {S}, P, Z) {}. We have two cases: 
e S = 'yin which case X(s, {-y}, P - {p}, {0, 1,... , t}) = I  which leads to a 
contradiction. 
S 	'yin which case since s' = s - {(-y,p,t)}, X(s', {S}, P, Z) 	
t  which 
leads to a contradiction. 
Thus we conclude that s' is complete. 
Since s is valid, it is not temporally compromised. Let us assume as a hypothesis 
to be contradicted that s' is temporally compromised. Thus there exists some edge 
say p 	(cr, /3) E i such that there exists a tuple (3, q, t') e .s' 
which has a 
valid tuple dependence edge, say ((cr, q, j*), (3, q', t')) in s but no such edge in 
S'. Since s - s' = {('y,p,t)}, this implies that 
(c,q, t*) = ('y,p,t). Furthermore, 
by Definition 2.34 (t + f('y) + d(p,p,q)) 	t', whereas 
min(,p,t*)Es,(t* + f('y) + 
d(p,p,q)) > t'. That is 
(t + f('y) + d(p,p, q)) < mm 	(t*  + f(-y) + d(p,p, q)). (-y,p,t*) 
Note that by Definition 2.23, there exists some non-negative integer, say Ti 
such that for all a, b e P, a b, d(p, a, b) = 77. Note also that by Definition 
2.17 
f(-y) >0. Since X(s,F,p,{t + 1,.. .,M(s) - 1}) = {}, q p 
which implies that 
d(p, p, q) = j. This in turn implies 
t< mm t (y,p,t*) Es' 
but since X(s', {'y}, P-  { p}, to, 1,.. . , 	{}, fflifl(,p,t*)Eslt < t, which leads 
to a contradiction. 
By Lemma 3.2 pruning cannot make a schedule overbooked. Now, s' c s that 
is it has been derived by pruning the valid (ie. not overbooked) schedule s, so we 
can conclude that .s' is not overbooked. 
63 
Since s' is complete and neither temporally compromised nor overbooked, it is 
valid. 
The above lemma leads us to consider a pruning technique whereby the last 
tuple executed by a processor is pruned if there is another tuple instancing the 
same task which is executed on some other processor at a the same time or later. 
Lemma 3.11 Let A = (P, A = (F, ), f, d) be an instance of a Uniform Delay sched-
uling model. Let s be a valid schedule for A with 2(s)> I FI. s is not fully pruned. 
Proof 
Let s be a valid schedule such that 2(s) > 1 = Fl. For each task -y we define a 
sequence oftuples{x,x',. . ., x } such that for i = 1,2,.. .,n,x7 = (,p,t) E 
since there are only 1 distinct tasks and 2(s) > 1 distinct non-idle processors, by 
the pigeon-hole principle there must exist some task, say 'y  such that n.> 1. We 
then construct a schedule s' by pruning s' of x for each 2 < i < n. 
By Lemma 3.10 s' is valid for any instance of a Uniform Delay scheduling 
model because at each pruning of xi X (s, {-y }, P - {pi}, { 0, 1, . . . , t }) D { x1  } 11. 
The following corollary follows from the fact that s can be iteratively pruned 
until 2(s) < 1 at which point the pigeon-hole principle no longer applies. 
Corollary 3.2 Let A = (P, A = (F, ), f, d) be an instance of a Uniform Delay 
scheduling model. Let s be a valid schedule for A with 2(s)> lFl. There exists a valid 
schedule s' of A such that 2(s') < IFl. 
Thus we have that in Uniform Delay models no performance gains can be 
achieved by using more than Irl processors. Finally we show that this bound 
in turn relies on the Uniform Delay property, and that it is possible usefully to 
64 
schedule to more processors than there are tasks in the dag once this restriction 
is relaxed. 
Theorem 3.6 There exists an instance A = (P, A = (1', ), f, d) of a General Delay 
scheduling model and an integer M such that for any valid schedule, say s of A, with 
M(s) < M,wehave that P(s) > 1171. 
Proof 
of tasks. Let z = {(c,)} U {(/9,'y) : i = 1,...,31 U {('y,) : i = 1,...,37 j = 
1,...,4}. 
	
Let  = (F,). A is shown in Figure 3-2. Let 	pil 	4,j = 
1,:.. 1 31 be a set of processors. For all p E A for all p'i E P let d(p,p,p) = 0. 
For all p e A for all p  e P for all pl E P j 1, let d(p, p, p) = 1. For all p E L 
for all p E P for all 1k  E P k i, let d(p,p,p'k) = 2. For all E F let 1(C) 
= 1. 
Let M = 4. Let C, = (c,. . , c) = (cr,fll , 'yr , Sj, i = 1,.. . , 4. Note that each 
C1,...,C4 isa critical path ofF,thus s is valid only if for each i = 1, ... ,4,for 
each j = 1, ... 1 4 there exists a tuple (c, p, j) e .s. Let us assume as a hypothesis xi  
to be proved contradictory that there exists some tuple (6, p', 3) E .s such that Xi 
for some task 'y E {'Yi, Y2}' 
X(s,{},{p. : y 	{1,...,3} - {y}}, {l}) = {}. 
Now since lpred(-r, A) I = 1, it follows that 	 t = 1 thus 
min 	t+f()+d((,),p) >3. 
(y,p,t)EX(s,{y},P,Z) 	
% 
which implies s is temporally compromised, which leads to a contradiction. 
Thus we conclude that .s is valid only if for each 6j, i = 1,.. . 1 4 for all 
processors p,j = 1,.. .,3, X(s,F,{p},2) 	{}. This in turn implies that s is 
valid only if all I P processors are active insat time 2, and since I P I = 12>11 = 
Fl, the theorem is proved. 	 EM 
Figure 3-2: Counter-example for I Fl Bound in UET General Delay Scheduling 
3.3 	Complexity Results 
In this section we briefly review the complexity of scheduling in the various 
delay-variants of scheduling models. In addition we prove a complexity result 
of our own. 
3.3.1 The Types of Complexity Results 
We can identify three different types of complexity results, namely those where 
the number of processors is unbounded, those where it is part of the instance 
and those where it is part of the problem. 
The unbounded processor decision problem is stated below. 
Decision Problem 3.1 Given an instance A = (P, A = (F, z), f, d) of a scheduling 
model where P is an infinite set, and a positive integer k does there exist a valid schedule 
sfor A such that M(s) <k? 
As we showed in Section 3.2 in many types of scheduling model there is a 
bound on the number of processors beyond which no performance gains can be 
achieved. In this case we can replace the infinite set of processors in Decision 
Problem 3.1 with a finite set of the corresponding cardinality. 
The second type of decision problem is more straightforward, and can be 
stated as follows. 
Decision Problem 3.2 Given an instance A = (F, A = (F, i.), f, d) of a scheduling 
model and a positive integer k does there exist a valid schedule .s for A such that 
M(s) < k? 
67 
The final types of scheduling problem are known as 2-processor scheduling 
problems, 3-processor scheduling problems etc. The x-processor scheduling 
problem is stated below. 
Decision Problem 3.3 Given an instance A = (P, A = (F, ), f, d) of a scheduling 
model where I P1 = x and a positive integer k does there exist a valid schedule .s for A 
such that M(s) < k? 
It should be clear that for any particular delay-variant of the scheduling 
model, the NP hardness of Decision Problem 3.1 and Decision Problem 3.2 
follow from the NP completeness of Decision Problem 3.3. 
3.3.2 No-Communication Scheduling Models 
If A is restricted to be a No Communication scheduling model, Decision Prob-
lem 3.3 is NP complete (Bruno et al. [19741) in the case of two or more processors. 
It is related to the bin packing problem (See Coffman et al. [1984b] and Garey 
and Johnson [19821). 
Decision Problem 3.1 is polynomial. Indeed in these models Algorithm 3.1 
above is a polynomial time algorithm for optimally scheduling to as many 
processors as there are tasks. Algorithm 3.2 is a polynomial time algorithm 
which will remap such schedules to jFI processors. 
There is also a set of results for preemptive scheduling, which we men-
tion in passing. McNaughton [19591 produced a simple exact polynomial time 
algorithm and Martel [1988] shows that this version of the problem is in the 
complexity class NC defined by Pippenger [19791, which implies that it can be 
solved on a concurrent read concurrent write (CRCW) PRAM with a 'number of 
processors polynomial in the size of the problem in time which is a polylog of 
the problem size. On the other hand Rayward-Smith [1987a] shows that if pree-
mption incurs a unit delay (in a way analogous to the communication delays 
El 
that are present in the General Delay scheduling model) the problem becomes 
NP complete. 
3.3.3 No-Delay Scheduling Models 
The case where A is restricted to be a No Delay scheduling model has been the 
subject of considerable analysis. 
If A is restricted to be a No Delay UET scheduling model, Decision Problem 3.2 
is NP complete. Decision Problem 3.3 has a polynomial time algorithm for x =. 
2 [Coffman and Graham 1972], and indeed is in NC [Hembold and Mayr 19871, 
but it is NP complete in the case of x = 2 if the range of f is 11, 2 } [Ullman 
19751. However, if x has a value of 3 or more (as is often the case in a real 
multicomputer), and A is an instance of a UET scheduling model, it is not 
known whether the problem is NP complete [Garey and Johnson 19791. 
In these models, Decision Problem 3.1 is polynomial. Again we may use 
Algorithm 3.1 above to optimally schedule to as many processors as there are 
tasks. Algorithm 3.2 is a polynomial time algorithm which will remap such a 
schedule to W(A) processors. 
3.3.4 Unit, Uniform and General Delay Models 
Complexity results in this area are even now appearing in the literature. At first 
sight these results may seem redundant or contradictory, but it is important to 
take the following into consideration. 
If there is a makespan minimisation problem that is NP complete when 
task replication is allowed, the corresponding problem where no task 
replication is allowed is also NP complete. The reverse is not necessarily 
true. 
The NP completeness of a No Delay makespan minimisation problem im-
plies the NP completeness of an equivalent problem in a Uniform Delay, 
Fixed Delay or General Delay scheduling model, but does not imply the NP 
completeness of the corresponding problem in a Unit Delay scheduling 
model. 
Decision Problem 3.1 is not trivial in General Delay, Uniform Delay or Fixed 
Delay scheduling models. 
By a reduction which disallows recomputation, Rayward-Smith [1987b] 
shows that the UET Unit Delay scheduling problem is NP complete in the ver-
sion of Decision Problem 3.2. Although he does not state the fact explicitly his 
encoding produces a perfect schedule, and thus shows the NP completeness of 
Decision Problem 3.2 even if replication is allowed. Decision Problem 3.3 in 
UET scheduling models is open for Unit Delay, Fixed Delay, Uniform Delay and 
General Delay models, for any x> 1. It is perhaps surprising that the polynomial 
time algorithm for x = 2 in No Delay models has not been generalised to Unit 
Delay models. 
Papadimitriou and Yannakakis [1990] show that Decision Problem 3.1 is NP 
Complete for UET Fixed Delay models. Inspired by an earlier version of Papadi-
mitriou and Yannakakis' paper, Jung et al. [1989] noted that, in Papadimitriou 
and Yannakakis's complexity proof, the encoding could produce large values 
for r. Jung et al. present an algorithm for UET Fixed Delay models that finds 
an optimal schedule onto I F I processors and proved that its time complexity is 
polynomial in T. The time complexity of the algorithm is 
0(1T+1).  A similar 
result byChrétienne [1989] shows a polynomial time algorithm for unbounded 
numbers of processors where the execution time of all tasks is less than i-. 
The result of Jung et al. [1989] implies that, for example, UET Unit Delay 
scheduling to an unbounded number of processors has a polynomial time solu-
tion if task replication is allowed. On the other hand, Picouleau [1992] shows 
70 
that UET Unit Delay scheduling to an unbounded number of processors is NP 
complete if task replication is not allowed. 
In the case of non UET General Delay models, Decision Problem 3.2 is NP 
complete because the No Delay problem is NP complete [Ullman 1975] even 
for 2 processors with execution time of 1 or 2. Sarkar [1989] showed Decision 
Problem 3.1 is NP complete for non-UET General Delay scheduling models. 
Papadimitriou and Yannakakis's later result for Uniform Delay models is stronger 
because it is for a model which is both Uniform Delay and UET. 
Below, by a modification of Uliman's proof we show the NP completeness 
of Unit Delay two processor scheduling with execution times of 1 or 2. In our 
notation, as .a partial result Ullman [1975] shows the NP completeness of the 
following problem: 
Decision Problem 3.4 Let n be a positive integer. Let A = (F, A = (F, ), f, d) be 
an instance of a UET No Delay scheduling model where I Fl = k and I IF = kn. Does 
there exist a perfect No-Recomputation Schedule' s for A such that M (s) = 
We consider the following problem: 
Decision Problem 3.5 Let ii' be a positive integer. Let B = ({ p,, P2 } A', f', d) be an 
instance of a Fixed Delay scheduling model where the range off' is 11, 2}. Does there 
exist a perfect No-Recomputation Schedule s' for B such that M (s') = n' ? 
Theorem 3.7 Decision Problem 3.5 is NP complete. 
Proof 
Suppose we are given an instance of Decision Problem 3.4 as above. We con-
struct an instance of Decision Problem 3.5 as follows. 











Figure 3-3: Partial Gantt Chart for the Encoding in Decision Problem 3.5 
72 
n'=(4k+1)n 
A'= (F', ') where 
. F/ =U T U F U F* where 
—={ 1 :i=O,...,n'—l} 
—T={T,:i=O,...,n-1,j=O,...,k} 
_F* = { : i = O,...,nk_1} 
• 	f'is given byf'(a)=2 for all aEF and f'(13)=l for all tasks /3EF'—F. 
' = U U U where 
- 	= {(z, i+1) : i = 0,... ,n' - 21, 
- z4 = 	 : i = O,...,n - 1, j = O,...,k, u = 
(4k + 1)i + 2j - 21 except where u = —2, we have only the edge 
(T0,0, 2) 
- 	={(i)I° < i < nk}, 
- 	= 	 E Al. 
First observe that EEr,  f'(-y) = (4k + 1)2n, and since our deadline is (4k +1) n 
and we have two processors, any valid schedule for B is a perfect schedule with a 
single tuple instancing each task in V. Next observe that because of the edges 
one processor must be devoted to processing an element of E at each time 
unit if the time limit is to be met, thus the tasks of E in the order indicated by 
the precedence relation L4 form the critical path of the computation. Moreover, 
the same processor must process all elements of since no communication 
latency can be allowed to occur between the execution of the elements of the 
critical path if the computation is to finish by the deadline. Without loss of 
generality let us assume that it is processor p1  that is computing the tasks in , 
73 
and therefore we have that in any valid schedule of our instance of Decision 
Problem 3.5, for all E E , X(s,{},P,Z) = {(, pi , i)}, and for all -y 
X(s,{y},P,Z) = {(71 p2,t)}. 
Next, the proof then follows in a very similar way to the proof described 
by Ullman [Ullman, 19751, pages 391-392. The tasks in T must be executed at 
very specific times as shown in Figure 3-3. Progressing in time we have an 
alternation of breaks in which there is one time unit available on processor 2 
every other time unit and bands in which 2k consecutive time units are available 
on processor 2. Since the elements of F require two time units each, it is clear 
that they must be executed in the bands only. As there are kn such jobs, they 
must completely fill the bands, which means that the elements of F*  must be 
executed exclusively during the breaks. 
As a consequence, if tasks 'y, 8 E F are both executed in the same band it is 
not possible for there to exist an edge (-y,  8) in L. For if so, by our construction 
there would exist a task a E F* such that there exist edges (-y,  a) and (a, 8) in 
L.', and a would have to be executed in that same band, violating what we have 
just concluded. Thus if our instance of Decision Problem 3.5 has a solution, we 
can find a solution to the original instance of Decision Problem 3.4 by executing 
at time unit i exactly those jobs executed in the ith band. 
Conversely if we have a solution to the given instance of Decision Problem 3.4 
we can find a solution of the constructed instance of Decision Problem 3.5 by 
executing 'y  in break t and 'y  in band t whenever -yi is executed at time unit t in 
the solution to Decision Problem 3.4. 	 D 
Note that the above proof establishes that the problem remains NP complete 
even if recomputation is allowed since s' is a perfect schedule without recompu-
tation and thus if any valid schedule s* included recomputation, M(s*)  would 
exceed n'. 
74 
3.4 Comparison with the Framework 
For any valid schedule s of a General Delay scheduling model we can define the 
times that processors spend in each of the states referred to in the framework of 
Chapter 1.3. It is instructive to compare the values for the No Communication, 
No Delay and General Delay scheduling models. For the purposes of this section 
we assume the schedule is of a particular form: it is fully pruned and fully 
squashed according to the following definition. 
Definition 3.1 Fully Squashed Schedule. 
Let A = (P, A = (F, z), f, d) be an instance of a General Delay scheduling model. 
Let s be a valid schedule for A. s is fully squashed iff V(7, p, t) E .s 
.sU {(y,p,t - l)} - {(-y,p,t)} 
is not a valid schedule for A. 
The following theorem simply states that in No Communication models no 
processor need be idle until it has finished all computation that has been as-
signed to it. 
Theorem 3.8 Let A = (F, A = (F, ), f, d) be an instance of a No Communication 
scheduling model. Let s be a valid fully squashed schedule for A. For each p e P, 
f(y) = M(X(s, IF, {p}, Z)). 
(y,p,t)EX(s,F,{p},Z) 
Proof 
Let us assume as a hypothesis to be proven contradictory that there exists a 
processor p e P such that (-p,t)X(s,F,{p},) f(y) 	M(X(s, y,  {p}, Z)). We 
have two cases: 
>y,p,t)€X(s,r,{p},4) f('y) > M(X(s, F, {p}, Z)) in which case, by the pi-
geonhole principle, s is overbooked which leads to a contradiction. 
ali 
(p,t)EX(s,F,{p},Z) f(y) < M(X(s, IF, {p}, Z)) in which case there exists 
some tuple ( -y, p, t) such that there exists no tuple (8, p, t') such that t - 
1 < t' + f(S) < t, in which case s' = s— {(-y,p,t)} U {(-y,p,t - 1)} is 
not overbooked. Note s' is complete and since A is empty s' cannot be 
temporally compromised. Thus we conclude that s' is valid which leads 
to a contradiction. 
. 
Let A = (r, A) be a dag. Let AA = (F,{}). Let A = (P, AA,  f,dA) be an 
instance of a No Communication scheduling model. Let B = (P, A, f, dB) be an 
instance of a No Delay scheduling model. Let C = (P, A, f, dc) be an instance of 
a Constant Delay scheduling model where the range of dc is {O, r}. 
Let a be a fully pruned fully squashed valid schedule for A. Let b be a 
fully pruned fully squashed valid schedule for B. Let c be a fully pruned 
fully squashed valid schedule for C. Note that by Theorem 3.2, for all 'y  e F 
IX (a, {y}, F, Z)I = IX(b, {-y},  F, Z)j = 1. Note also that by Theorem 3.8 
(-y,p,t)EX(a,r,{p},Z) f(y) = M(X(a, F, {p}, Z)). In the following tables, p refers 
to some arbitrary processor p E F which is busy in the relevant schedule. 
Table 3-3, requires one more definition. For each p e F, let (-y, p, t),.. . , 	p, P) 
be the enumeration of X(s, F, {p}, Z) such that for each j = 2,. . . , xi,, t 1 + 






THose  0 




Table 3-1: A No Communication Scheduling Model in Terms of our Framework 







M(X(b, F, {p}, Z)) - Tcaic(p 
M(b) - M(X(b,F,{p},Z)) 
Table 3-2: A No Delay Scheduling Model in Terms of our Framework 
Table 3-3: A Uniform Delay Scheduling Model in Terms of our Framework 
77 
Chapter 4 
Analysis of Scheduling Algorithms 
4.1 Approximation Algorithms for Scheduling Prob-
lems 
As a result of the complexity of scheduling problems, it is natural to consider 
polynomial time algorithms which, although they are not guaranteed to find the 
best solution, are guaranteed to find solutions which are close to the optimum. 
Such algorithms are known as approximation algorithms, or in the case where 
there exists some positive constant € such that the algorithm's solutions are 
guaranteed to be at most 1 + e times the optimal solution, as c-approximation 
algorithms. An early partial review of this area can be found in-Garey et at. [19771. 
More recent work is described in recent reviews of scheduling: namely Cheng 
and Sin [1990] and Veltman et al. [1990]. 
4.1.1 No-Communication Models 
Let A = (P, A = (F, z), f, d) be an instance of a No Communication scheduling 
model where IPI = n and IFI = 1. Graham's Longest Processing Time (LPT) 
algorithm [Graham, 19691 guarantees to find a schedule .s such that 
M(s) (4/3 - 1/3n)M(s0 ) 
Where sopt is a valid schedule for A with the shortest possible makespan. Coff-
mann et al. [1978] give an algorithm based on techniques from bin packing and 
improve this to 1.22M(s0 ). Sahni [1976] produced a family of approximation 
algorithms for any guaranteed performance ç whose running time was poly-
nomial in 1 but exponential in ii. More recently, Hochbaum and Shmoys [1988a, 
1988b] have developed a family of €-approximation algorithms for which the 
complexity is polynomial in both 1 and n. 
79 
As an extension to Graham's work, Coffman et al. [19841 consider the expec-
ted makespan for LPT scheduling under the assumption that tasks' execution 
times are independent, identically-distributed, random variables. They show 
that LPT makespans converge stochastically to optimal makespans for a range 
of distributions of task execution times. 
4.1.2 No-Delay Models 
Algorithms for solving or approximately solving the mapping problem for No 
Delay scheduling models have been appearing in the literature with remarkable 
frequency. In the cases where the problem is NP complete, heuristic approaches 
have often been used. This version of the mapping problem has been the subject 
of previous reviews (see Chen and Liu [1975] for a discussion of various similar 
heuristic approaches). Indeed it is often considered in the same reviews as No 
Communication models, and as a result we only describe a selection of the results 
and heuristics. 
Hu [1961] showed that for UET models where the task graph is a tree, a 
schedule based upon sequential processing of layers of the tree is optimal. 
Kaufmann [1974] extended Hu's algorithm to the case of non-unit length tasks, 
and showed bounds on its performance which allowed him to consider it "al-
most optimal". Graham [1966, 19691 extended this work to propose a general 
technique for scheduling known as list scheduling. Coffman and Graham [1972] 
showed an algorithm which generates optimal schedules for any UET, 2 pro-
cessor No Delay scheduling model. Their, algorithm may be generalised to 
models with an arbitrary number of processors but it loses its exactness. Lam 
and Sethi [1977] showed that Coffman and Graham's algorithm will generate 
a schedule s where M(s)/M(s0t) < 2 - 2/n. Gabow [1988] showed a linear 
time algorithm for scheduling on two uniform processors (recall our definition 
in Section 1.2.2), again with unit length tasks, which generates optimal res- 
Wo 
ults with certain fixed ratios of processor speeds and nearly optimal results 
otherwise. Cho and Sahni [19801 give bounds for list schedules on No delay 
uniform-processor models. Cole and \lishkin [1988] show logarithmic time par-
allel implementations of list scheduling on an EREW PRAM. There are also a 
number of results for preemptive scheduling of precedence constrained jobs 
(e.g. Muntz and Coffman. [1969] for 2-processor systems), and in this context 
we refer the reader to the reviews referred to in Section 3.3 and to Lawler [1982]. 
4.1.3 General, Uniform and Fixed Delay Models 
In the later sections of this chapter we concentrate on two algorithms which 
were proposed by Papadimitriou and Yannakakis [1990] on the one hand and 
by Hwang et al. [1989] on the other. The latter algorithm is known as ETF. We 
refer to the former algorithm as PNY for the sake of brevity. 
PNY works with a UET Uniform Delay scheduling models1; with what they 
refer to as unbounded numbers of processors; and performs recomputation. We 
have shown above the number of processors that can usefully be used by any 
such algorithm is bounded by I Fl, and as we show below there is another, often 
tighter, processor bound for PNY On the positive side PNY's performance is 
shown by Papadimitriou and Yannakakis to produce schedules which have a 
makespan within a factor of 2 of optimal with respect to their model. We give 
detailed consideration to this bound in later sections. 
The unbounded processor assumption is also made by Sarkar (in the In-
ternalisation phase of his work), by Yang and Gerasoulis [1992] [1993] and by 
Kim and Browne [1988]. However none of the algorithms that these authors 
'There is another algorithm in [Papadimitriou and Yannakakis 1990] which relaxes 
these two constraints 
E1I 
propose allow task replication. Jung et al. [1989] showed for the inverse bin-
ary tree that recomputation can gain performance which is a function of the 
interprocessor communications delay, which implies that there can be no fixed 
bound on the ratio by which these algorithms exceeds the optimal makespan of 
the computation. As we will see, determining how to remove recomputation 
from a Papadimitriou and Yannakakis schedule so as to minimise the number 
of processors being used is itself NP complete. 
ETF is similar in flavour to Graham's List scheduling algorithm [Graham 
19691. It works with arbitrary length tasks; general communication delays'; a 
fixed number of processors and forbids recomputation. The result of Jung et 
al. implies that there can be no fixed bound on the ratio by which ETF exceeds 
the optimal makespan of the computation, however Hwang et al. show their 
algorithm satisfies a dag-dependent bound which is presented and discussed 
below. Scheduling heuristics for this model can also be found in Williams [1983] 
and Baxter and Patel [1989]. 
4.2 Papadimitriou and Yannakakis' Algorithm (PNY) 
Given a dag A = (F, z) and an integer communications delay rPapadimitriou 
and Yannakakis [1990] compute a function e: F -p Z01  inductively on the depth 
of a node in the dag according to the following algorithm. 
Algorithm 4.1 e-value 
1 	let A = (F, z) be a directed acyclic graph and r be some integer. 
2 	let y1,y2,...,'ybean enumeration ofF such that if 1 < a < b< m, then 
7b V pred(ya,A); 
2 A discussed in later sections it is not presented in this way 
IM 
3 fort' :=ltom 
do 
4 	if pred(y, A)I = 0 then let e(-y) = 0; 
else do 
5 	let (61, S2, . . . , 8) be an enumeration of pred('y2 , A) such that 
e(Si) 	e(82) > ... -2: e(5,,); 
6 	let k=min{T+1,p}; 




Papadimitriou and Yannakakis prove the following properties relating to 
their function e and the class of schedules. 
Lemma 4.1 There is no valid schedule in which vertex y j is scheduled before time e(-y). 
Lemma 4.2 For each vertex -/i there is a valid schedule in which vertex -y j is scheduled 
at start time 2e(-y2 ). 
The proofs of both Lemma 4.1 and Lemma 4.2 are found in [Papadimitriou 
and Yannakakis 19901. 
4.2.1 A PNY schedule using IFI processors 
Using the proof of Lemma 4.2 Papadimitriou and Yannakakis suggest the fol-
lowing simple algorithm for scheduling the vertices of a dag, so that every 
vertex is executed by time 2e(-y2 ). The algorithm thus first computes all e-values 
132 
then creates a schedule so that -y j  is computed at start time equal to twice its 
e-value. 
In the algorithm below each vertex is computed on a separate processor 
together with the 'r highest (in e value) predecessors of -Y j and receives the 
rest of its inputs in the form of communications from other processors. If 
I pred(-y,, A) 1< 7-, then all predecessors are computed on the processor. 
Algorithm 4.2 Papadimitriou and Yannakakis Mark 1 
1 	let A = (P, A = (F, ), f, d) be an instance of a Fixed Delay UET Sched- 
uling Model where I Fl = I P I = 1; and the range of d is {i-, O}; 
2 	let 'Yi, 'Y2,- . . , -y be an enumeration of F; 
3 	let Pi, P2,- , Pt be an enumeration of P; 
4 lets={}; 
5 	for a:=ltol 
do 
6 	let Ra = 
(b a 	5 ) be an enumeration of pred('ya, A) U {'Ya} 
(a 
such that Vi < I < 0a, e(5 1) < 	/ and for all 1 < i < j 	crc, -
6a V  pred(S,A); 
7 	let Pa = mirt(r, pred(-y., A) 1); 
8 	forh=aa  — pa toaa 





Combining Lemmas 4.1 and 4.2 we have the following theorem from Papadi-
mitriou and Yannakakis. 
Theorem 4.1 Algorithm 4.2 is an approximation algorithm for minimising the make-
span with a worst case ratio of two. 
The use of 1 processors by Algorithm 4.2 is consistent with the bound of 
Corollary 3.2, and the correspondence can be explained by the fact that for all 
1 < i < 1, each processor pi ends by computing a distinct task 'y,. However, it is 
important to note that the schedule of Algorithm 4.2 can often be transformed 
to use fewer processors than 1. 
4.2.2 The Problem of PNY Pruning 
There is an implicit assumption in the model underlying Papadimitriou and 
Yannakakis' mapping problem that processors are plentiful. Like schedules 
produced by Algorithm 3.1, PNY schedules use exactly as many processors as 
there are tasks to be scheduled. We now consider the problem of pruning a 
PNY schedule so that it uses a number of processors which is different from the 
number of tasks. We show the following problem is NP complete. 
Decision Problem 4.1 PNY Pruning 
Let A = (P, A = (F, ), f, d) be an instance of a LiLT Fixed Delay scheduling model 
where II = IPI = land the range of d is {O, T}. Let s be a schedule for A that was 
generated by Algorithm 4.2. Let J be an integer. Does there exist a schedule s' C 
which is valid for A and such that P(s') < J? 
We reduce from a general instance of Three Minimum Cover (TMC)—Decision 
Problem 4.2 below, which is known to be NP complete [Garey and Johnson 
1979]. 
Decision Problem 4.2 TMC 
Let S = {S,. . . , Sa} be a set of items. Let C = {C1,.. . Cb} be a collection of subsets 
99 
	
ofSsuch that I C,  I = 3, 	Vi = 1, ... , band Lj 1 C = S. Let 	< bbeapositive 
integer. Does there exist a cover of 5, C', such that I C'  I < K? That is, does there 
exist a subset C' c C such that, UcEc' C2 = 8? 
We shall need the following lemma. 
Lemma 4.3 Let 5, C, K be respectively a set of items, a collection of subsets and a 
positive integer forming an instance of TMC. Let C' be a collection of subsets of S. Let 
K' = IC'I and let S' be a subset of S such that (Uc,Ec' C2 ) U S' = S. Let J' = S,  1. S 
has a cover of size no more than K' + X. 
Proof 
Let S, 52, . . . , Sd be an enumeration of 5'. For each Si E 5' choose some C, e C 
such that Si E O. Let C* = (u.1{0}) U C'. C* is the cover as required. 	0 
We now define a polynomial time encoding from TMC to PNY pruning. 
Definition 4.1 TMC to PNY Pruning encoding. 
Let 5, C, K respectively be a set of items, a collection of subsets and a positive 
integer forming an instance of TMC. Let S, S2,. . . , S be an enumeration of 
S and let C1, C2,.. . , Cb be an enumeration of C. We construct our instance 
of Decision Problem 4.1 in the following way. The resulting dag is shown in 
Figure 4-1. 
T= 	Iv, :j=1,...,b,i=1, ... ,3} 
= 	{j=1, ... ,b,i=1, ... ,3} 
b, Z'  
11= 	 1,..., b} 
SuCuTUUEUI' 
= 	{(,v):i=1 ... 3,j=1 ... b} U 
Figure 4-1: The Constructed dag Obtained in the Polynomial Time Reduction 
from Three Minimum Cover. 
... 3,j1 ... b} U 
{(v 	... 3,j1 ... b} U 
U 




We note the following properties of our encoded instance. Papadimitriou 
and Yannakakis assign the following e-values to the vertices in F: e(S) = 0 for 
all Si e S;e(q)=0 for all E ;e()=O for all e;e(v)=2 for all v E 
e(C2)= 3for all C1 e C; e(ib)= 6 for all & E T. 
Let 	1'2•• •,Yi be the enumeration of Fand (PI ,p2,.. .,p1)° be the enumer- 
ation of P chosen by Algorithm 4.2, Application of Algorithm 4.2 results in 
the schedule shown in the Gantt chart of Figure 4-2. Processor p2 is assigned 
task and its p, = min{i-, Ipred(72 , A)} highest in e-value predecessors. In the 
following we shall make much use of the correspondence between the indices 
in the enumerations of P and F. Below, we identify the predecessors that are 
executed with each task. Where there is a non-deterministic choice to be made 
by Algorithm 4.2 we express that non-determinism. 
For all 'y E SUUE,pred(y,A)={} and so 
Vy 	SUU, X(s,F,{p},Z)= {(yj,p2,O)}. 
For all -yj = C3 e C, pred(C3 , A) = Is E S: .s E C}. Thus lpred(C3 , A)I = 
3 = r. Let (S, S, S) be the order in which pred(C, A) were arranged by 
Algorithm 4.2, then 
Vy = C3 C, X(s,F,{p},Z) = U {(S,p,6—k)}U{(C,p,6)}. 
k=1,2,3 
For all -yj = v E T, pred(vJc,A) = 	Thus pred(v,A)I = 2 < T. Let 
t,k be chosen nondeterministically from {2, 3}. Let 	= 5 - t ok 
Vy = v E T, X(s,F,{p},Z) = 
kk For all = 	'I', pred(, A) = Uk=1,2,3{v}UUk=1,2,3{c}UUk=1,2,3{}UC. 
Note that for each C2, e(C2 ) = 3, for each v, e(v) = 2, and for all , 	0 
thus 
V'y2 = 	E 'I', X(s,F,{p},Z) D {(C3,p2,1l),(,p2 ,l2)}. 
Finally, before our NP completeness proof, we will require to prove the 
following lemma which states that at least 4b processors are required to compute 
tasks which are not in C U S. Moreover, these processors all correspond (in our 
pairing of indices) to tasks which are not in C U S. 
Processors 
Time 
3b 	3b 	a 	3b 	b 	b 














Figure 4-2: The Schedule Obtained from Algorithm 4.2 on a dag Encoded as in 
Definition 4.1 
Lemma 4.4 Let A = (P, A = (F, z), f, d), .s and J be an instance of PNY pruning 
which has been constructed from an instance S, C, K of TMC as described in Defini-
tion 4.1. Let F = F - S. Let z = {(-y, 8) E A : -y,S E F}. If  schedules is valid 
for B = (P, (F, zr), f, d) then P(X(s, F, 1pi e P : -y E F - (S U C)}, Z)) ~j 4b. 
Proof 
Let us assume as a hypothesis to be proven contradictory that there exists 
a schedule s which is valid for B and such that P(X(s*, F, jPi e P -fi e 
F - (S U C)}, Z)) <4b. Since 	TI = 3b and I'I'I = b this implies one or 
other of the following. 
. There exists some task = 	e 'I' such that p, is idle in s. In this case Oj 
is never executed which leads to a contradiction. 
There exists some pair of tasks -y. = v E T and Yh = E such that both 
pi and Ph are idle in s. In this case Oj is never executed which leads to a 
contradiction. 
ii 
Theorem 4.2 The problem of PNY pruning is NP complete. 
Proof 
Let s be a schedule produced by Algorithm 4.2 on an instance A = (P, A = 
(F, 	), f, d) of a Unit Delay 
scheduling model. We can check if a given schedule 
c s is valid by checking that it is complete and neither temporally compromised 
nor overbooked 
To check that it is complete we need to check that for all y E IF, 
X(s',b},P,Z) 
which can be achieved by searching IFl times through Is'! tuples. 
To check that it is not overbooked we need to check that there exists no tuple 
(y,p,t) e s' such that X(s',F - {'y},{p}, It,  t + 1,.. .,t + f('y) - 1}) 	}. This 
can be achieved by searching Is'I times through I'l tuples. 
To check that it is not .temporally compromised we need to show that there 
exists no tuple p = (-y, 8) e A such that there exists a tuple (8, p, 1) .e 
s' such 
that min( ,q,tl)Esl(t' + f('y) + d(p,p, q)) > t. This can be achieved by searching Is' 
times through Is'l tuples for each of I 	Fl2 edges. 
Since s' c s, we can conclude that the time complexity of checking the 
validity of the schedule is at most O(lsl2lFl2), which is a polynomial in the size 
of our instance. Since we can just guess non-deterministically a pruning of s 
such that P(s) < J and test that it is valid, we have established that the PNY 
pruning problem is in NP. 
Let A = (P, A = (F, ), f, d), sand Jbe an instance of PNY pruning which has 
been constructed from an instance S, C, K of TMC as described in Definition 4.1. 
We now show that s can be pruned to a valid schedule s' with fewer than 
J 
non-idle processors if and only if there is a subset C' C C, IC'l < K, which 
covers S. 
it 1 Let C' c C be a cover of size K of the set S. Let
be an 
enumeration of C'. We prune our schedule as follows: 
S' = X(s,F,{pEP:yi ET},Z)U 
X(s,C UT, 	E P: E 
X(s,S,{p E P :'y j E C'},Z) 
= 	U 
U 	{(C3,p,11)7 (tI,P,12)} U U 	U {(S,p,6 - k)}. -y,=CjEC k1,2,3 
91 
Note P(s') = TI + I''I + IC'I = 3b + b + K = J. Note that for all tasks y e 
T U 1 U U 'I! U C, we have that X(s', {-y}, F, Z) {}. Moreover since C' covers 
S, 
U 	U {(S,p,6—k)} 
yC€C k=1,2,3 
contains a triple instancing every Si e S. Thus we conclude s' is complete. 
Note also that s' c .s and .s is valid, and thus by Lemma 3.2 s' is not over-
booked 
Let us assume as a hypothesis to be proven contradictory that s' is temporally 
compromised. This implies there exists a tuple (-y, p, t) such that one of the 
following is true. 
Syj= v E T and is not executed on processor p, before time t = 3, which 
J. 
leads to a contradiction. 
-y v E T and is not executed on processor p, before time t = 3, which 
leads to a contradiction. 
= 	E 'P and there exists some v E 
T which is not executed on any 
processor before time t' = t - ,r = 12-3 = 9, which leads to a contradiction. 
yj= 	e 'I' and C is not executed on processor 
p, before time t = 12, 
which leads to a contradiction. 
= C3 E C and there exists a task Sk  e S such that Sk  is not executed 
before time t' = t - r = 11 - 3 = 8, which leads to a contradiction. 
Thus we conclude that .s' is complete and neither temporally compromised 
nor overbooked and thus valid for A and we further conclude there is a valid 
pruning of s with P(s') < J if there exists a K cover for S. 
92 
Let s*  C s be a valid schedule with p(s*) < J = K + 4b. We shall 
establish there must exist a K cover for S. Let s' = X(s*, 5, F, Zr ). Let C' 
{y e C: X(s', 5, {p} Z) 	{}}. Let 5' = {'y E S: X(s', 5, {p}, Z) 	{}}. 
Note X (s, S, 1pi e P : 	E F - (S U C)},Z) = {} and by Lemma 4.4 
2(X(s*,F,{p e p:  -y E F—(SUC)},Z)) > 4b. Thus P(s') KandIC'I+IS'I 
K. Let us assume that there exists some task S, E S such that S V Ucizc,Ci U 5'. 
Now s* is complete thus 53  must be executed. We have two cases 
S e 5' which leads immediately to a contradiction. 
There exists a task = Ch e C' such that X(s', {S}, {p1 }, Z) I  which 
implies 53 E pred(Ch, A) and so by construction 53 E Ch, which leads to a 
contradiction. 
Thus 	U 5' = S and so, by Lemma 4.3, S has a cover of at most size 
IS'I+ I C'I = K and the theorem is proved. 
Note that the above argument holds irrespective of the non-determinism in 
Algorithm 4.2. That is, we are not concerned with the rdering of the recompu-
tation of the predecessors of any task, nor by the subset of the predecessors of 
each ib which get computed with it. 
4.2.3 A Slightly Less Processor Wasteful Algorithm 
We now know it is an NP complete problem to determine whether it is possible 
to prune a PNY schedule so as to use at most a given number of processors. 
Consider the algorithm below, which is a modification of Algorithm 4.2, which 
leaves out the computation on certain processors. Specifically, we prune all 
computation mapped to a processor p, where for the corresponding task 
there exists some successor 'i such that e(-y) = e(y). The rather complex proof 
of Theorem 4.3 establishes that this computation is redundant. 
93 
Schedules generated by this algorithm are used in later sections where we 
show bounds on the number of active processors and indicate how they can be 
transformed to schedules with fewer processors, in some cases with the same 
performance guarantee, or in general with a weaker performance guarantee. 
Algorithm 4.3 Papadimitriou and Yannakakis Mark 2 
	
1 	let A = (P, A = (F, ii), f, d) be an instance of a Fixed Delay UET Sched- 
uling Model where I Fl = P1 = 1; and the range of d is {r, O}; 
2 	let 'Yi, 72,• . . , 71 be an enumeration of F; 
3 	let Pi ,P2,• 	be an enumeration of P; 
4 lets={}; 
5 	for a:=ltol 
do 
6 	if {(y E SUCC(7a) : e(c) = e(7a)} = {} 
do 
7 	let Ra  = (ö, 8,..., & a. ) be an enumeration of pred(7, A) U {y} 
01 
such that Vi < i <°a  e(S) <e(8 1) and for all 1 < j <j< ü 
5.' V pred(S,A); 
8 	let Pa = min(r, Ipred(ra, A) 1); 
9 	forh=cra  — pa tooa 
do 






Before we can show the validity of schedules produced by the algorithm, we 
first need to show that, in its schedules, every task is executed at least once by a 
time which is twice its e-value. 
Lemma 4.5 Let A = (P, A = (1', z), f, d) be an instance of a LIET Fixed Delay 
scheduling model. Let s be a schedule generated by Algorithm 4.3 for A. For all -y E F, 
X(s,{y},P,{O,1,... ,2e(y)}) 	{}. 
Proof 
Let us assume as a hypothesis to be proven contradictory that there exists some 
task -y j such that X(s, {-y}, F, {0, 1,... ,2e(-y)}) = {}. If the condition at Line 6 of 
A1gorithm4.3 holds for Y then X(3,{7i},P {O,l,. . . , 2e(-y1)}) 	{(-y2 ,p ) 2e('y))}, 
so where 
S = {a e succ(y) e(c) = 
we shall assume that S {}. 
Let -y j E S be a task such that 
{fi E succ() : e(3) = e(y)} = {}. 
We' note that -y j  must exist because S is finite and A is acyclic. Note that the 
condition at Line 6 of Algorithm 4.3 holds for -y,  and thus by Line 10, for each 
'yE 	.8.),X(s,{'y1,P,10,1,... ,2e('y)}) 	{}Thus we may assume that, 
notwithstanding -yj e pred('y3, A), 
'y 	
(8(7j_pj• . . , 
We have two cases, 
pred('y,A)I < r in which case since -y j E pred(y3,A), we know that 
Ipred('y, A)I < pred(-y3 , A)I and thus e(-y1 ) = Ipred(i, A)l < Ipred(, A)I = 
e(-y3) which is contradictory to our construction of S. 
Ipred('y3 , A)I > r thus by Algorithm 4.1, e(-y3 ) > e(-Y) + 'r, which is again 
contradictory to our construction of S. 
Wal 
Thus we conclude that in all cases that our hypothesis is contradictory, and 
	
the lemma is proved. 	 F. 
Since every task is executed by a time equal to twice its e-value, every task 
is executed at least once, and so we have the following simple corollary to 
Lemma 4.5. 
Corollary 4.1 Let A = (F, A = (F, ), f, d) be an instance of a UET Fixed Delay 
scheduling model. Let s be a schedule generated by Algorithm 4.3 for A. s is complete. 
Next we show that Algorithm 4.3 generates valid schedules. 
Theorem 4.3 Let A = (F, A = (F, i.), f, d) be an instance of a LIET Fixed Delay 
scheduling model. Let s be a schedule generated by Algorithm 4.3 for A. s is valid. 
Proof 
By Corollary 4.1, s is complete. Line 10 adds tuples of the form (-y, p., 2(y.) - 
o, + h) where y, Ya E F. Note that a and h uniquely identify an execution of 
Line 10 of Algorithm 4.3, and thus it is not possible for two tuples to be inserted 
into s with both an identical processor and an identical execution time. Thus 
we conclude s is not overbooked. 
Let 71, 72, - , ^/1 be the enumeration of F and Pi, P2,• , P1 be the enumeration 
of P chosen by Algorithm 4.3. Let us assume as a hypothesis to be proven 
contradictory that s is temporally compromised. Note that A is an instance 
of a IJET scheduling model. This would imply there existed some edge, say 
A such that there existed some tuple, say (y, pi, t) E s where p3  
and p,  (the processor corresponding to 7i in the pair of enumerations) are not 
necessarily distinct, such that, defining 
1= 	min 	t'+d((yk ,'y),p',p), 
(Yk ,p',t')EX(s,{k},P,Z) 
we have that i> t. 
Note Yk E pred(y, A) and since yj  is executed on p2, by Line 7 of Al-
gorithm 4.3, -yj E pred(, A) U {-y}, thus 7k  E pred(.-1 , A). Note also that by 
Line 10 of Algorithm 4.3, t < 2e(7) - r. 
We have two cases: 
X(s, {fk}, {p}, Z) 0 {}. Thus there must be a tuple, say (yk, p j, t*) 
X(s, 	{p}, Z). Since Yk  e pred(7, A), by Line 10 of Algorithm 4.3, 
t < t. But by Definition 2.22d(('y,), pi, pi ) = 0. Thus< t <t,which 
leads to a contradiction. 
X(s,{yk },{p},Z) 	in which case since Yk  E pred('y,A), e('yk) 
e(y) - T - 1, and by Lemma 4.5, 
X(s, {'i'k} F, 10, 1,..., 2(e('y) -,r - 1)}) 	{}. 
Furthermore the range of d is upper bounded by r, thus I < 2(e(71 ) - r - 
1) + T, which means I < t —2 which leads to a contradiction. 
Thus we conclude that in all cases s is not temporally compromised. And 
since we have concluded that s is complete and neither temporally compromised 
nor overbooked we conclude it is valid and the theorem is proved. 	El 
4.2.4 A PNY Schedule using W(A) [9.i] Processors 
We have established that Algorithm 4.3 generates valid schedules. We now 
consider the way in which these schedules can be made less processor-greedy 
by relnapping. First we consider how many processors are in use at any one 
time in the schedules produced by Algorithm 4.3. Next we consider putting 
these schedules through Algorithm 3.2, making use of various lemmas proved 
in Section 3.2. 
We will need the following lemmas. 
97 
Lemma 4.6 For Papadimitriou and Yannakakis' function e, computed on a task -y in a 
dag A such that Ipred('y, A)I < r, e(-y) = pred(y, A) 1. 
Proof 
We consider the state of the variables in Algorithm 4.1 at the assignment of 
some task -y such that pred(y, A) I < T. At Line 5 p = pred(-y)1, and by the 
enumeration at Line 5, Ipred(5)I = 0. By the enumeration at Line 2 and the 
assignment at Line 4, the value e(S) = 0 has already been assigned. By Line 6, 
k = p and thus by Line 7 e('y2 ) = e(S) + p = 0 + pred(-y, A) 1. 	 0 
Lemma 4.7 Papadimitriou and Yannakakis' function e, computed on a dag A is non-
decreasing along any path in A. 
Proof 
Let A = (1', z) be a dag. 
Let us assume as a hypothesis to be proven contradictory that there exists a 
path, say 71, 'Y2,• . . , y r,in A such that the function e is decreasing along the path. 
Thus there exists an edge, say (-y, y2 ) E A such that e(-y_1) > e('yj. We have 
two cases 
e(-y1) = pred('y, A)l < T in which case by Lemma 4.6, 
= Ipred(y_i,A)I < pred(y,A) I = e(-y) 
which leads to a contradiction. 
pred(y, A)I > r in which case let 61, 62,• ,S, be an enumeration of 
pred(-y) such that for j = 2,.. . , n e(83) ~! e(S,_1). Now e(7j) = e(8 1) + 
r + 1 and since (-y_1,) E A. pred(-y_i,A) c pred('y,A) We have two 
subcases 
- pred(y_i , A)I < r. Thus by Lemma 4.6, e('y,) :5 r + 1 < e(y2 ) which 
leads to a contradiction. 
- pred(-y_1, A) I > T. Thus e(71_1) r + 1 +e(S +2) 5 r + 1 + e(ST+l) = 
e(-y2 ) which leads to a contradiction. 
1 
This leads us on to considering the set of tasks -y j such that processor p, is 
allocated computation by Algorithm 4.3. In the following lemma we state that 
the number of such tasks sharing the same e value is at most the width of the 
dag. 
Lemma 4.8 Let A = (F, ) be a directed graph. For each -y e F we use Papadimitriou 
and Yannakakis' definition of e(-y). Let F' = {-y E F VS e succ(-y) c(S) 	e(-y)}. Let 
(J) be a subset ofF' such that for all -y,S e c1, e(-y) = c(S). 	W(A). 
Proof 
Let us assume that 1,01 > W(A). Thus is not an antichain and thus there exists 
some pair of vertices, say -y,S E 1 such that there is a directed path from y to 
S. Let this pathbe denoted = 	 = S for some m > 2. We know 
e(-y) = c(S) and so by Lemma 4.7 e-(-y) = c(S1) = ... = e(Sm) = c(S). But if 
e(-y) = c(S) and there exists a path from -y to S then -y V F' which leads to a 
contradiction. Hence 140 1 :~ W(A) as required. 
	 11 
We are now in a position to state how many processors are active at any one 
time in S. 
Theorem 4.4 Let A = (F, A = (F, z), f, d) be an instance of a UET Fixed Delay 
scheduling model. Let s be a schedule generated by Algorithm 4.3. 
M(s) 
max l.A(s,t)l W(A)1(r+ 1)121. t=o 
Proof 
Let us assume as a hypothesis to be proven contradictory that there exists some 
time t such that l.A(s,t)l > W(A) F(-r + 1)/21. Let Pi,•,Pi and -r1,...,'y1bethe 
enumerations of P and F chosen by Algorithm 4.3. For a = 0,.. . , M(s) - 1 let 
us define Pa as the set of processors which, in s, finish being active at time a. 
That is, 
1 a = {p 2e(7) = a and Vc E succ(y) , e(a) 
Note that for a = 1,3,.. . , M(s) - 2, lPa = 0 and , by Lemma 4.8, for a = 
0,2,.. .,M(s) - 1, I1a I < W(A). Note also that since s is not overbooked and 
since processors in Pa are only active from at earliest time a - 7- to at latest time 
a, we have that 
a=max(O,t--r) 
Thus 
IA(s,t)l :~ W(A)1(r + 1)121 
which leads to a contradiction. Thus we conclude that 
M(s)-1 
max IA(s,t)I W(A)1(7- + 1)/21 t=o 
and the theorem is proved. 	 UI 
Finally we consider the use of the General Purpose Remapper Algorithm 
defined earlier (Algorithm 3.2). 
Theorem 4.5 Let A = (P, A, f, d) be an instance of a Fixed Delay UET scheduling 
model. Lets be the result of applying Algorithm 4.3 to A. Lets' be the result of applying 
Algorithm 3.2 to s. s' is a valid schedule with P(s') :5 W(A) ftr + 1)121. 
Proof 
Note that in Algorithm 4.3 task 7i is allocated to processor p, with pi of its an-
cestors executed in previous time steps in a continuous sequence. This means that, 
by Lemma 3.9, s'is not temporally compromised. Moreover, by Lemma 3.8 s" is 
complete and not overbooked, and thus s'is valid. Furthermore, by Lemma 3.8 
100 
and Theorem 4.4, 
P(s') 	
M(s)-1 
max IA(s,t) t=o 
W(A)J(r + 1)/2]. 
FMI 
4.2.5 A PNY Schedule Using an Arbitrary Number of Processors 
Finally in this chapter, we consider mapping a PNY schedule to some arbitrary 
number of processors, a. We use a simple round-robin scheduler [Brent 19741 
as defined below. 
Definition 4.2 Round Robin Reschedule. 
Let A = (P = {po,. 	A, f, d) be an instance of a Fixed Delay UET sched- 
uling model and a < n be a positive integer. Let s be a valid schedule for A. 
For  = O, ... ,n— 1, let p' =Pjmoda Let b= In/al. We define s' the round robin 
reschedule of s to a processors as follows. 
= 1(-y,p,bt+ Lila]): (y,p,t) e s} 
Theorem 4.6 Let A = (P = {po,.. . , 	}, A, f, d) be an instance of a Fixed Delay 
1JET scheduling model and a < n be a positive integer. Let s be a valid schedule for 
A. Let .s' be the round robin reschedule of .s to a processors. s" is a valid schedule for A 
with at most a busy processors and M(s') fn/al M(s). 
Proof 
Let us assume as a hypothesis to be proven contradictory that for some processor 
p, i> a X(s', F, {p},  Z) {}. Thus, by Definition 4.2, there exists a processor 
q3 such that j mod a> a, which leads to a contradiction. Thus we conclude that 
for any processor pi which is active in s', 1 < i < a and thus P(s') < a. 
101 
Let (y,p,t') be a tuple such that t' + 1 = M(s). Now there exists a tuple 
(y, pi , t) E s such that t' = bt + Lila], and so M(s) - 1 > t. Thus b(M(s) - 1) > 
t'— [i/a], and since Lila] <b — i, b(M(s)-1) > t'—b+l. Thus bM(s) > t'+l = 
M(s'). 
Note that for every tuple (-y, p,, t) e s there is a tuple (-y, p, t') E .s' and thus, 
since s is complete, s' is complete. 
Let us assume as a hypothesis to be proven contradictory that s' is temporally 
compromised. Since s is valid this means there exists some tuple, say (-y, p, x) E 
s for which there exists an edge, say p = (6, -y) e A and a tuple (8, p, t) E s such 
that 
t+1+d(p,p,p) < x 
whereas, for the corresponding pair of tuples, (-i, p, bt+ Lila]), (8, p'3, bt+ Li /a]) e 
SI , 
bt + Lila] + 1 + d(p,p'3,p) > bx + Li/ai. 
Thus 
bt + Li/a] + 1 + d(p,p'3,p) > bt ± b + bd(p) p,,p) + Li /a] 
and, since Li/a] ~! 0, 
Li/a] + d(p,p'3,p) ? b + bd(p,p3,pi ) 
Note that if p3 = p2, p'3 = p and so d(p,p3,p2) ~ d(p,p,p) and since b > 1, 
Li/a] > b = In/al 
and thus i > n which leads to a contradiction. 
Let us assume as a hypothesis to be proven contradictory that s" is over- 
booked. Since s is valid this means there exists some pair of tuples, say 
E .s such that i i and i mod a = j mod  and bt + Li/a] = 
bt + Li /a]. Thus I i - ii <a, which leads to a contradiction. 	
0 
102 
We can now take the performance guarantee for Algorithm 4.2, use Al-
gorithm 4.3, remap to fewer processors using Algorithm 3.2, and perform a 
round-robin schedule of the result. We finally end up with a schedule to some 
fixed number of processors for which we have the performance guarantee stated 
in the corollary below. 
Corollary 4.2 Let A = (P = {po'• . . 	A, f, d) be an instance of a Fixed Delay 
UET scheduling model and a < ri be a positive integer. Let s be a schedule produced 
by Algorithm 4.3 for A. Let s' be the result of applying Algorithm 3.2 to .s. Let s*be 
the Round Robin Reschedule of .s' to a processors. Let 3opt  be a valid schedule of A of 
minimal makespan. 
M(s*)W(A)1(T + 
' '21 1A'1 <(s) 2  a 
4.3 ETF 
4.3.1 The ETF Algorithm 
Unlike PNY, ETF does not allow recomputation, so the execution of each task 
is performed only once and no pruning of the schedule can result in a valid 
schedule. ETF is presented below in our notation and in a slightly different 
way to that given by Hwang et al. [19891. Hwang et at. use a General Delay 
scheduling model where d can be expressed as the product of two functions: 
one with domain A and the other with domain P x P. We relax this restriction. 
In. order to simplify the presentation of the algorithm we now define a little 
more notation. 
Definition 4.3 First Event Time. 
The first event time of a schedule s {}, denoted T(s), is 
T(s) = min t. 
(y,p,t)Es 
103 
We also define Y({}) = oc. We note that in general, X(s,F, P,T(s)) I need not 
be 1. 
Definition 4.4 First Event. 
Given an instance (F, A = (F, z), f, d) of a scheduling model and a schedule 
s =A {}, we let .F be a function such that F(s) returns a single tuple chosen 
nondeterministically from X(s, F, P, T(s)). 
ETF works by stepping through time, keeping track of the tasks that are 
available for execution and the processors that are idle, and allocating tasks 
to processors whenever there are idle tasks and available processors. When 
there is a choice at any one moment about which task in the ready set is to be 
assigned to which processor in the idle set, task-processor pairings are made in 
the order in which they became possible, or rather would have become possible 
had the relevant processor not been busy. The algorithm is made slightly more 
complex because it does not move forward in steps of unit time. It only allocates 
tasks at time zero and at subsequent times when a processor has just finished 
executing a task, because only at these points will there be both idle tasks and 
idle processors. 
In the algorithm below there are three schedules s, a, and r. s is the schedule 
being produced by ETF, and may be thought of as the output of the algorithm. 
a is a subset of s, and may be thought of as the set of tasks active at particular 
value of CM - the current moment. r is not really a schedule at all, at least 
not a valid schedule. It is used to record the times at which tasks will become 
available to processors, and may be thought of as the current set of available 
start times. 
Tuples are present in r for every processor irrespective of whether or not 
they are idle, and as tasks are allocated to processors they are removed from 
the ready sets of all processors. The set of currently idle processors in the 
104 
above algorithm is I. It is initially set to all the processors in the instance of the 
scheduling model. 
The outermost loop runs until all tasks have been allocated. The top inner 
while-loop allocates tasks to idle processors in the order in which such alloca-
tions became possible. It continues either until there are no processors left, or 
until the task would be allocated at a time after some processor finishes execut-
ing a task and becomes idle. When the top loop finishes, the current moment 
moves on to the next time a processor finishes executing a task, and the bottom 
inner while loop starts considering all the tasks that have just finished executing. 
If any of these tasks is the last to finish of the predecessors of some task1  
then that task is added to every processor's ready list. The time at which it will 
become available to each processor is set to the time at which the last of the 
results of its predecessor tasks will become known to that processor. Note that 
because of non-uniform communication delays the last predecessor known to a 
processor is not necessarily the last one to complete execution. 
The ETF algorithm is, in terms of our notation, as follows. 
Algorithm 4.4 ETF 
1 	let (P, A = (F, z), f, d) be an instance of a General Delay scheduling 
model. 
2 	for all -yE IF let Q(-y) = 116 e F: (6, -y) E All 
3 	let s=a={}; 
4 	let I=P; 
5 	let  = {(p,,O) :p e P, E F, Q() = O} 
6 	let CM=O; 
7 While sl<IFl 
do 
8 	While Y(X(r, IF, I, Z)) < T(a) 
105 
9 let (-y,p,t) 
10 let e = max{ CM, t}; 
11 let r = r — X(r, -y, P, Z); 
12 lets =sU{(y,p,e)}; 
13 let a = aU{('y,p,e+f(-y))}; 
14 let I=I—{p}; 
enddo 
15 let CM 
16 While T(a) = CM 
do 
17 	let ('y,p,t) = 
18 	let I=Iu{p}; 
19 	for each (-y,5)EL 
do 
20 	 let Q(t)=Q(5)-1; 
21 	 ifQ(5)=0 
do 
22 	 for each qeP 
do 
23 	 let max = 	 + f(3) + 
24 	 let r=ru{(S,q,tma )}; 
enddo 
enddo 





4.3.2 The ETF bound 
Let A = (P, A = (F, ), f; d) be an instance of a General Delay scheduling model. 
where I P1 = n. Let B = (P, A = (F, z), f, d') be a corresponding instance of a 
No Delay scheduling model. 
Let E TF be a schedule produced by Algorithm 4.4 for 
A. The performance 
bound on M E (s 	
is given in terms of the makespan of an optimal schedule, 
say sopt  of B. Hwang et al. prove a bound on M(4TF) 
which can be generalised 
to the following. 
M(4TF) < (2— i/m)M(s) + C 
where 
r-1 
C = 	max q). p,qEP 
4.4 Comparing ETF and PNY 
As a result of the differences between the underlying models, there are dif-
ficulties in comparing PNY with ETF directly. Instead we compare schedules 
produced by Algorithm 4.3 that have been remapped by Algorithm 3.2 and then 
round robin rescheduled according to Definition 4.2. Furthermore we restrict 
our attention to UET Fixed Delay scheduling models. 
107 
4.4.1 A Comparison of Bounds 
In the following analysis we will be referring to related instances of four sched-
uling models. 
. Fl is an instance of a Fixed Delay scheduling model with 1 processors. 
Ni is an instance of a No Delay scheduling model with 1 processors. 
. Fa is an instance of a Fixed Delay scheduling model with a processors. 
. Na is an instance No Delay scheduling model with a processors. 
Let P be a set of lprocessors. Let Fi-= (P, A = (F, z), f, dFj) be an instance of 
a Fixed Delay UET scheduling model where P = 1 and the range of d is {O, T }. Let 
Ni = (F, A = (F, z), f, dN j) be a corresponding instance of a No Delay scheduling 
model. Let a E 	be a number of processors such that a < W(A)[(r + 1/2)]. 
Let Pi,• ,p,be an enumeration of P. Let 1' = {pi, ,pa}. Let dFa be the 
restriction of dFl  to the domain A x P x P'. Let dNa  be the restriction of dNl to 
the domain A x P x P'. Let Fa = (P', A, f, dFa). Let Na = (F', A, f, dNa). 
Let / be an optimal schedule of Fl. Let s' be an optimal schedule of Ni. opt 	 opt 
Let sN(i  be an optimal schedule of Na. opt 
Let s F1  be a schedule generated by Algorithm 4.3 for Fl. Let .s' be the result PNY 
of applying Algorithm 3.2 tos F1 Let s, Fa   be the round-robin remapping of PNY* 
s' to a processors. By Corollary 4.2 
Fa 	W(A)1(r+1)/2] 	Ft m( 
(spNy) 2 a 	 opt 
Let s Fa be a schedule produced by ETF for Fa. In the special case of the ETF 
UET Fixed Delay scheduling model the bound of Section 4.3.2 can be expressed 
as 
Fa 
M (SET F) < (2— 1/a)M(s) + T11(A). 
These bounds are not easily comparable since the PNY bound is expressed 
in terms of M(s) and the ETF bound in terms of M (s). We now give a new opt opt 
bound on ETF which is in terms of M(sopt ). 
As an intermediate result Hwang etal. [1987], when restated for the restricted 













since in any schedule the tasks of a path of length 1-1(A) must be processed in 
sequence, and finally note 
W(A)1-1(A) > 1. 






	1/a(l+ 	a— IA(s,t)l) 
< 1/a(W(A)1-1(A) + (a - 1)1-1(A) + ar(fl(A) - 1)) 




- 	 a 	 opt 
In comparing the bounds it must be noted that ETF is a more general-purpose 
algorithm than PNY ETF might not be expected to perform as well as PNY on 
the special case of Fixed Delay UET scheduling models to which PNY uniquely 
applies. 
109 
4.4.2 Exemplar Schedules 
We can consider the use of ETF and PNY on different scheduling models. Again 
we restrict our consideration to the set of scheduling models to which they may 
both be applied. We consider two such models. The first is one on which PNY 
performs spectacularly badly. The second is one on which ETF performs almost 
as spectacularly badly. 
The Lumpy dag 
Let r be a positive integer. Let us define task graph A = (F, z) and a set of 
processors P as follows. 
= 
r 	a' th
y ,y)I1 <x < r,I< y  r,r + I< a< 2,r +11 U 
{(-yy)I1< x < r, 7- 1 ~ y ~ 2T + l} 
P = 	 . 
Figure 4-3 shows A for r = 3. Vertices are labelled on the figure simply with 
their two indices. Let A = (F, A, f, d) be an instance of a Fixed Delay UET 
scheduling model where the range of d is {0, r}. Let P' = {P1, P2,. . . ,p}. Let d' 
be the restriction of d to the domain P. Let B = (F', A, f, d'). 
Let us consider some schedule 4TF of B generated by Algorithm 4.4. First 
note that only tasks-y 1 < x < r have no predecessors so these form the initial 
ready set, and without loss of generality task -y is assigned to processor p, 
to execute at time 0. When each of these tasks finishes, task is assigned to 
processor px since it becomes ready to execute on that processor (but not on 
any other processor for which a communication delay would be required). The 
allocation continues until each task -y 1 < x < r completes its execution on 
110 
Figure 4-3: The Lumpy dag for 'r = 3 
processor pat time r. At this time tasks'y 1 < x < r,r+1 < y 2r+2 
become ready for execution on their respective processors p. ETF is free to 
schedule these tasks on their respective processors in any order and since their 
connections are symmetric, without loss of generality we may assume that it 
schedules tasks yY  1 < x 	T + 1< y < 27- + 1 at time y - 1. This means that 
2T 1 
each processor px finishes computing task 'y 	at time 27- + 1, and then finally 
task becomes available to all processors at time 3T + 1 and some processor, 
say Pi' executes it and finishes at time 3r + 2. That is 
= 37- +2. 
Now consider applying Algorithm 4.1 to A. The e-value of each task -YXY 1 
S < T, 1 < y :5 r is xy - 1, since they have y - 1 <r predecessors. The e-value 
of tasks 'y : 1 < x r, T + 1 < y :5 2r + 2 is 7- since they all'have r predecessors. 
Finally the e-value of task -yo is 27- + 1 since its i- highest in e-value predecessors 
(indeed its 7-2 + ,r immediate predecessors) all have e-value r, and therefore we 
must add T + 1 to the e-value of one of these predecessors. 
Let us consider some schedule 8PNY of A generated by Algorithm 4.3. In 
this schedule the last task to be executed is-y at time 2e( 0) so 
M(4Ny) 
= 4r + 3. 
111 
Unfortunately such a schedule would use r2 + 27 + 1 processors, where at 
least (r ± 1),r processors were active for 'r + 1 time units. Let us consider some 
schedule $NY  of B generated by rescheduling 4NY•  No matter how clever an 
algorithm was used to reschedule to the r processors this means that at least 
one processor would be executing for at least (r + 1)2 time units. Thus 
M($Ny) > (r + 
1)2. 
The Inverse Binary Tree 
Our second example is the inverse binary tree of Jung et al. [1989]. Given an 
integer value T we construct our instance C = (P)  A = (F, z), f, d) of a Fixed 
Delay UET scheduling model where 
F = {71 :1<i<2T -1}. 
= 	 : 1 < i < 17 I/ 21. 
P 	= {p1,p2, ... ,p2r}. 
Consider applying Algorithm 4.1 to A. The e-value of each task -y : 1 < i < 2'  
is flog i] since all tasks have that number of predecessors which is fewer than r. 
Let us consider some schedule s NY  of C generated by Algorithm 4.3. Thus 
M(s Ny)= 2(r -1)+l =2r-1. 
Let us consider some schedule STF  of C generated by Algorithm 4.4. Some 
processor, say Pi'  is allocated task Yi•  After executing task yi  the two children 
of that task become available for execution on Pi  at time 1, and on all other 
processors at time r + 1. Processor Pi  continues executing tasks of the tree in 
a breadth first traversal of the tree. Task -yj is executed at time i - 1 and for 









Figure 4-4: The Inverse Binary Tree for r = 6 and its e Values 
for execution on processor p1. Eventually, at time 2r, task -y2, which has been 
available for execution on processor pi since time T, has not been executed and 
becomes available for execution on all other processors. Note that from this 
time step on, processor Pi cannot execute tasks as fast as its execution of their 
parents make them available to it. For each 2r tasks available to it, it executes 
Together these make 27- successors available to it, of which it can execute r, 
so at each subsequent level of the tree it executes 7- tasks. Since there are at least 
r - [log(2r)j subsequent layers to the tree after the first 27- tasks, 
M(sT;) ~ 2r + T(T - [log(2r)j) 
and thus 
M(sTF) -:~!  r(r+l -logr). 
113 





5.1 	An Overview of Process Based Models 
The scheduling models of Chapters 2, 3 and 4 contrast with another approach 
to modelling computations on multicomputers whereby modules (which in this 
context we refer to as processes) are arranged in undirected graphs, and edges 
between processes represent two-way communication rather than precedence 
and one-way communication. These models are often used for modelling com-
putations at a coarse granularity where modules are persistent processes which 
exist for the duration of a computation, and there are stable communication 
patterns between them. 
In many cases, there is an equivalence between a process based model of 
computation and a probabilistic, and possibly cyclic, graph of task dependencies 
and branching probabilities. The former can be derived from the latter by 
applying a Markov chain analysis. An example of the approach is to be found 
in Chu et al. [1984]. 
In this chapter we describe a pair of models, the first by Stone [1977a], and 
the second by Bokhari [1981b]. The two models differ in the way in which 
communication is handled and both are limited in application. In Chapter 6 
we consider the complexity of mapping in a version of Bokhari's model. In 
Chapter 7 we describe various approaches to extending the applicability of such 
models, and for integrating them with scheduling models described earlier. 
It is worth bearing in mind that a number of the models discussed in this and 
later chapters were presented in the context of distributed systems rather than 
multicomputer systems. We include them in our review, firstly because there 
is a grey area between the two types of system and, secondly, because several 
papers can be viewed as attempts to adapt these research results to mapping 
problems for multicomputers. Unfortunately, there are significant differences in 
116 
Feature Distributed Systems Multicomputers 
Work Profile Multiple Job Single Job 
Processor Type Heterogenous Homogenous 
Fault Tolerance Important Ignored 
Optimisation Maximise 	Through- 
put by Minimising System 
Resource Consumption  
Minimise 	Time 	to 
Completion 
Ratio of Message Latency 
to Instruction Cycle lime  
Relatively High Relatively Low 
Individual Job Execution Can be Required to be 
Sequential  
Parallel 
Table 5-1: Relevant Differences Between Models of Distributed Systems and 
Models of Multicomputer Systems 
the abstract models of computation being used by researchers in the two fields; 
these are summarised in Table 5-1. We are not saying that all researchers make 
these assumptions, rather we are conveying an impression of the differences in 
emphasis between the two fields. 
5.2 Definitions 
Below we develop a notation based on graphs (rather than directed graphs) 
which is used in this chapter and in Chapter 6. 
Definition 5.1 Graph. 
A graph (rather than a directed graph) say, G = (V, E), is an ordered pair com-
prising a set, V of vertices and a set, E, of edges. An edge in the set E is an 
unordered pair of distinct vertices from the set V. 
117 
Definition 5.2 Degree of Vertex and of Graph. 
Given a graph G = (V, E), the degree of a vertex v E V is the cardinality of the 
set containing all edges {u, v} E E where u e V. The degree of G is the degree of 
an vertex in V of maximal degree. 
Definition 5.3 Path. 
A path in a graph C = (V, E) between a pair of vertices {u, v} is a sequence of 
vertices {O, 02, 	Ok} c V - {u, v}, k > 0, such that 
{u,01 }, {01, 02}, ..., { 0k-1,Ok}, 10k, VI e E 
Definition 5.4 Path Length, Shortest Path Length. 
Given a graph C = (V, E), we define the length of such a path {O, 02, ••, Ok} c 
V - {u, v} - to be k - 2. We also define a function SHQ({u, v}) to be the length 
of the (possibly non-unique) shortest path between u and v in G. Note that, 
our definition of path length is slightly different from the usual graph theoretic 
definition. So, we have SHG({U, v}) = 0 if and only if {u, v} E E. 
Definition 5.5 Shortest Path. 
Given a graph G = (V, E), and a pair of vertices, u and v in V, we define 
a function PAG such that if there is a unique shortest path between u and v, 
PAG({U, v}) returns that path. Otherwise PAG({U, v}) is undefined. 
Definition 5.6 Routing Load. 
Given a graph C = (V, E), and a pair of vertices, u and v in V, we define 
ROG({X, y}, v) to be a function which indicates if a vertex v is in the unique 
shortest path between x and y. That is it returns 1 if v E PA({x, y}), or 0 
otherwise. 
Note that, by our definition, if a shortest path exists then 
ROG({x,y},v) = SHG({x,y}) 
vEV 
and 	x,yEV  ROG({x, y}, v) is equal to the number of unique shortest paths on 
which the vertex v is located. 
Definition 5.7 Process Based Model. 
An instance of a process based model is a 4-tuple (P, G, f, c). P is a set of n 
processors, G = (V, E) is a graph, where V is a set of vertices corresponding 
to processes, and E is a set of edges corresponding to communication between 
processes. f : V  P -* 	is a function such that f(v,p) returns the cost of 
computing task v e V on processor p E P; C: E x P x P -* 	is a function 
returning the cost associated with communication between processes if they are 
mapped to different processors. 
Definition 5.8 Process Mapping Function. 
A process mapping function for an instance (P, G = (V, E), f, c) of a process based 
model is a function 
which maps each process to a processor. We define a corresponding function 
- g:P—*2 V  
which returns the set of processes mapped to a given processor by mapping 
function g. 
Definition 5.9 Computation Cost of a Mapping. 
Given an instance (P, G = (V, E), f, c) of a process based model, the global cost 
of computation, U(g) associated with mapping function g, is given by: 
U(g) = 	f(v,g(v)). 
vEV 
Definition 5.10 Communication Cost of a Mapping. 
Given an instance (P, G = (V, E), f, c) of a process based model, the global cost 
119 
of communication, V(g) associated with mapping function g, is given by: 
V(g) = 	 c({v,w},g(v),g(w)). 
{v,w}EE,g(v)Og(w) 
Definition 5.11 Cost of a Mapping. 
Given an instance (P, C = (V, E), f, c) of a process based model the global cost 
C(g) of a mapping is given by: 
C(g) = U(g) + V(g). 
5.3 Stone's 1977 Model 
The basis of Stone's model [Stone 1977a] is that there is a computation cost as-
sociated with executing each process and a communication cost associated with 
sending each inter-process message. The total cost of a mapping is the sum 
of all the computation costs and communication costs. For reasons discussed 
later, Stone's model is formulated for non-identical processors, where the com-
putation cost of executing a process depends on the processor upon which it is 
executed. 
Decision Problem 5.1 Given an instance A of a process based model and an integer k: 
does there exist a process mapping function, g, for A such that C(g) <k? 
5.3.1 Understanding Stone's model 
We can rationalise Stone's model in terms of our framework as outlined in 
Table 5-2. Here we are associating C(g) with the time to completion of the 
120 
T.10(p) >ivE(p) f(v,p) 
T omm 0 
THOS 0 
TIdle 	TWaitD  
Tait, + TF 
V(g) 
(n - 1)C(g) 
Table 5-2: A Process Based Model in Terms of our Framework 
program: as Bokhari points out [Bokhari 1981b] there is nothing in the model 
to enforce this; C(g) could be measured, for example, in dollars. In our ration-
alisation T aic(p) is simply the execution time of the processes that have been 
assigned to processor p. The total time processors spend waiting for a message 
to be delivered, TWajtD  is simply the total communication cost. All other time is 
spent in T ait, or 
The rationalisation stems from the fact that Stone's work considers only 
the case of sequentially executing processes. That is at any one time exactly 
one process is executing on one of n processors, and the other processors are 
idle for the duration of its execution. The cost of communication between 
processes corresponds to a delay between a process terminating and another 
process starting. The sum of the costs corresponds to the time to completion of 
the program since the costs are incurred sequentially. Stone clearly states that 
he is dealing with these serial programs, but it is an issue that could easily be 
overlooked when considering other papers referencing his model. 
As a result of the sequential nature of execution in this cost model, in the 
case of machines with identical processors such as most multicomputers, it will 
always be optimal to execute tasks on a single processor. To do otherwise will 
never reduce computation cost, and can only ever incur extra communications 
cost. 
121 
5.3.2 'Results and Algorithms 
There are a number of results for this model which relate to two-processor 
systems. These stem from work by Stone [1977a, 19781 based upon the use 
of network flow diagrams. Stone shows an optimal algorithm for a process 
based model with n = 2. An extension of Stone's work to three processors was 
performed by Stone himself [1977b] and Bokhari [1981b] cites an unpublished 
result of Gursky that the four or more processor versions are NP complete. 
Fernández-Baca [1989] shows that Decision Problem 5.1 is NP complete even 
if all of the following restrictions hold: 
. the range of f is {O}, 
. the range of c is {1}, 
S n=3, 
G is both planar and bipartite. 
Furthermore, Fernández-Baca [1989] showed that there can exist no c ap-
proximation algorithms for the problem constrained in the above fashion and 
no exact local search algorithm that takes polynomial time per iteration. This 
rather negative result is balanced by the claim by Sinclair [1987] that the average 
time-complexity is manageable in this problem for a version of exact branch 
and bound search. 
There are also positive results for special classes of process graphs. Bok-
han [1981b] shows an algorithm that, if the process graph is a tree, is guaran-
teed to find an exact solution in 0(m 2) time. Towsley [19861 applies a dynamic 
programming approach to the problem with series-parallel graphs, and shows 
an algorithm with time complexity 0(m 3). Fernández-Baca[1989] extended the 
results to other tree-like graphs. Rao et al. [19791 consider the case of Stone's 
122 
original model but where one of the processors is constrained in memory, and 
show two techniques which can reduce the complexity of the problem in some 
cases but not in general. Gusfield [1983] solves the problem with a parametric 
computing technique, for the costs of mapping processes to the two processors 
varying as a function of two independent parameters. 
5.3.3 Adding Constraints to Stone's Model 
There have been a number of direct extensions to Stone's work whereby the 
underlying model has been extended to allow constraints upon the solution, for 
example memory resource constraints. Given the NP completeness of the model 
without constraints, heuristic approaches to the problem of determining optimal 
mappings have been proposed for these extended models. Chu et al. [1980], in 
an early review of the field, consider integer programming approaches in the 
presence of constraints. Gylys and Edwards [1976] describe process clustering 
algorithms which satisfy constraints. Ma et al. [1982] use a branch and bound 
technique to solve a model which includes constraints, including a redundancy 
constraint: certain processes must be allocated to more than one processor. 
Other authors have used constraints and costs to attempt to encourage the 
potential of parallelism between processes in the resulting mapping. Lo [1988] 
considers an alternative to our process based model where if task v is mapped 
to processor p, c(v, p) is dependent upon the elements of (p): that is, there 
is a cost associated with interference between processes. Houstis [19901 (in 
the context of real-time systems) adds an explicit parallel processing constraint 
to the model: if two processes can be executed in parallel then they must be 
executed on different processors. 
123 
5.4 Bokhari's 1981 Model 
In cases where computation and communication costs are not incurred sequen-
tially (ie. in parallel rather than serial programs) Stone's model is not directly ap-
plicable since communication costs and computation costs are no longer strictly 
additive. As an alternative, we might consider algorithms which attempt to 
find minimum communication cost mappings irrespective of computation costs. 
Models of this nature often include considerations of some underlying processor 
architecture. The problem becomes that of mapping an undirected graph of pro-
cesses into an undirected graph of processors so that one process is mapped to 
each processor and so as to minimise the communication overhead.. 
To return to our example graph, we now consider the graph to be undirected, 
and the edges to indicate volumes of communications between processes. We 
consider the problem of mapping such processes to a processor graph which, 
for the purposes of our example, we shall assume is a seven processor ring. 
Clearly it is not possible to place all communicating processes on adjacent 
processors: each processor has exactly two adjacent processors whereas process 
D communicates with three processes. The mapping we show in Figure 5-1 is 
optimal in the sense that no interprocess communication is extended by more 
than one interprocessor link - that is it has a dilation of 1. 
To describe these approaches more thoroughly we use a new type of model 
which we refer to as a graph matching model. For a given graph matching model 
we can construct a corresponding process based model. 
Definition 5.12 Graph Matching Model. 
A graph matching model consists of a 4-tuple, say, (Q, G, e, c') where Q = 
(P, L) 
is a graph of n processors (the edges correspond to interprocessor links); G = 




Figure 5-1: An Example of Bokhari's Mapping Problem and its Solution 
processes as there are processors) and an edge {u, v} e E implies processes 
u and v communicate with each other. c' : E -* Z is a function such that 
c'({u, v}) returns an integer corresponding to the amount of communication that 
u performs with v (bidirectionally) during their execution and e is an integer 
execution time which is the same for each process. 
Definition 5.13 Process Based Model Corresponding to Graph Matching Model. 
Given an instance (Q = (P, L), G = (V, E), e, c') of a Graph Matching Model 
we can construct a corresponding instance (P, G, f, c) of a process based model 
where for all {u, v} e E, for all p, q E F, p q, 
c({u,v},p,q) = SHQ({p,q}) x c'({u,v}) 
and for all u E V, f(u) = e. 
We have the following decision problem for mapping in a graph matching 
model. Note the mapping function is restricted to be a bijection. 
125 
Decision Problem 5.2 Let k be an integer and A = (Q, G, e, c') be an instance of a 
graph matching model. Let B be the instance of a process based model which corrés ponds 
to A. Does there exist a one-to-one mapping function g for B such that C(g) k? 
Decision Problem 5.2 corresponds to the question: can I allocate processes 
to processors in a given processor topology so as to minimise dilation? Bok-
hari [1981a] points out that in the case where k = 0, Decision Problem 5.2 is a 
notational variant of the graph isomorphism problem. As Bokhari noted the 
complexity status of graph isomorphism is an open problem (see Garey and 
Johnson [19791)1. It is possible that graph isomorphism affords a polynomial 
time algorithm in the types of regular graphs of bounded degree that were 
under consideration by Bokhari. 
5.4.1 Bokhari's Model in Terms of the Framework 
At first sight it is not dear how a model that fails to take into consideration com-
putation overheads can relate well to the mapping problem for multicomputers. 
However, as Fox observed in a survey of parallel applications [Fox 1989], 76% 
of real parallel programs being run were what he considered to be synchronous 
or loosely synchronous. 
These programs would be executed on a multicomputer by regular data 
decomposition with one process per processor; roughly equal amounts of com-
putation per process; and periodic interprocess communications. Moreover, as 
a result of global or pairwise synchronisations between processes the computa-
tion would execute in a roughly lock-step manner. 
'The incorrect notion that graph isomorphism is NP complete is endemic in liter-
ature on the mapping problem; see, for example, Kramer and Mühlenbein [1989] and 
Pountain [1989]. 
126 
T aic(p) e 






Table 5-3: A Graph Matching Model in Terms of our Framework 
Given the inherent load-balance of such problems the problem of mapping 
such computations to a multicomputer is simply that of allocating processes 
to processors to minimise communications overhead, and it is likely that the 
dilation of the process graph will adequately model the loss in performance due 
to non-local communication. We discuss this issue again in Chapter 6 below. 
It is instructive to consider a graph matching model in terms of our frame-
work. If we assume that all processes start waiting for communications at 
exactly the time they are sent, then the cost of the message, which is dependent 
upon interprocessor distance, makes sense as a communication latency, which 
that process must wait upon, and therefore shows up in TIdle  in our framework 
as outlined in Table 5-3. 
Another possible way of reconciling the cost of communication in terms of 
our framework is to consider it to be contributing to TR0t . Unfortunately, this 
does not bear up to close scrutiny since in Bokhari's model it is the extension of 
individual messages that is being considered harmful whereas through-routing 
costs are additive on a per-processor basis. The function c does not appor-
tion through-routing computation to a particular processor along the shortest 
route(s), and so the relationship between the through-routing costs and the time 
to completion of the program is undefinable. 
Given the data-parallel rationalisation above, it is perhaps not surprising that 
127 
formulations such as Bokhari's are used in the analysis of algorithms for SIMD 
architectures. In these machines the mapping problem is more constrained and 
has been the subject of significant analysis. An introduction and a significant 
bibliography is to be found in [Johnsson 1990] and also in [Leighton 19921. The 
multicomputer programmer may find such approaches useful in analysing the 
performance of regular data parallel computations, but should be wary of the 





6.1 	Contention in the Multicomputer 
In many multicomputer applications, in particular in those where the structure 
of the computation and the structure of the multicomputer are not similar, it 
may not be appropriate to treat each communication as if it were sent down an 
independent piece of hardware, rather it may be assumed that communications 
sometimes interact with one another to delay each other. This interaction we 
shall refer to as message contention. In this chapter we briefly review the liter-
attire on the subject, and consider the complexity of two problems which, in 
very different ways, may be thought of as mapping problems where message 
contention is taken into account. 
First we review the literature which concerns contention in the multicom-
puter rather than the impact of contention on mapping. There are different 
issues in packet switched systems and circuit switched or wormhole systems. 
6.1.1 Packet Switched Systems 
There is a substantial literature on contention in packet switched systems which 
is eloquently summarised in [Leighton 19921 and [Johnsson 19901. The focus 
is the packet routing problem where a number of processors arranged in some 
topology are required to send at most one message, each to a distinct destination, 
and the objective is to minimise the amount of time that this operation takes, or 
the amount of queuing resource that is required. 
As an alternative to the above, it is possible to consider the routing tech-
niques where the route that messages take is chosen at random, in the sense 
that messages are sent to a random processor before being sent to their final 
destination. See, for example [Valiant and Brebner 19811, [Valiant 19821 and 
Upfal [1984]. Now, the expected time to completion of a set of messages can 
130 
be shown to be probabilistically bounded and independent of the message set. 
Both best case performance and worst case performance are expected to be 
around half the average case performance for deterministic strategies, and loc-
ality in the processor network is lost. To most multicomputer programmers 
the loss in performance may appear Quixotic, but this approach may prove 
useful in mapping, for example in the models outlined in Section 2.3. In prac-
tice, however, random routing is not implemented in hardware in commercially 
available multicomputers' routing systems; the overheads of random routing 
implemented in software (which we would associate with TROUte) could often 
outweigh any advantage. 
6.1.2 More Complex Interconnection Systems 
Most of the above is rather academic because the routing strategy and the 
hardware topology are rarely under the control of the multicomputer program-
mer, and the majority of existing multicomputer architectures based on hyper-
cubes, for example use the simple deterministic e-cube routing strategy, which 
is known to have very poor worst-case performance [Daily 1990]. In addition, 
the majority of currently available multicomputers have wormhole or circuit 
switched interconnects which behave very differently from the simple packet 
switched systems described above. Moreover some newer systems have in-
herent flow control mechanisms which make analysis of contention extremely 
difficult. 
Queueing theory based approaches have been successful to some degree in 
analysing the performance of more complex interconnects. Good examples are 
Daily's analysis of the k-ary n-cube [Daily 1990b], Chittor and Enbody's work on 
the iPSC860 [Chittor and Enbody 19921 and Scott et al. [1992] on the proposed 
IEEE standard Scalable Coherent Interconnect [IEEE 19911. 
131 
6.2 Contention in a Process Based Model 
In this section we consider the complexity of a process based mapping problem 
in the presence of contention. Recall the graph matching model outlined in 
Chapter 5. The formulation associates cost with the total number of message-
link transfers that are implied by a mapping. Clearly in these models, which do 
not incorporate temporal aspects, we are unable to consider the detailed aspects 
of how messages interact, but since the total number of links in the processor 
graph is fixed, our cost function may be thought of as measuring the total level 
of message traffic contending for interprocessor message links. 
There are some obvious extensions to the graph matching model. First it is 
possible to consider the cost of communication to be the maximum of the sum 
of the message link utilisation levels along the path taken by any,  message. This 
approach is taken by Lee and Aggarwal [1987] and Berman and Snyder [19871. 
The second extension is to consider the problem when there are more processes 
than processors. Berman and Snyder [19871 extend the analysis to consider 
multiprocessing, having contracted the process graph to have the same number 
of processors as the processor graph. 
Alternatively, approaches to modelling contention in process based models 
are often couched directly in terms of the corresponding probabilistic model 
(recall Section 5.1), since it can be integrated with probabilistic queuing the-
ory models, and then solved with Markov chain analysis in the same way as 
the process based model is derived. Examples of these approaches are given 
by Floustis [1990] and Cvetanovic [1987] who consider the communications 
medium as a saturable bus, whose performance for any communication is de-
termined by its rate of utilisation, and Kapelnikov et at. [1989] who model mul-
ticomputer networks. The models are complicated and are usually presented 
132 
as modelling methodologies rather than being proposed as objective functions 
in mapping problems. 
Below we consider a graph matching model, from the point of view of the 
problem of building a multicomputer whose structure matches the computation. As 
stated in Chapter 5, Bokhari formulated a graph matching model and showed 
the problem reduces to graph isomorphism of which the complexity is open. In 
this section we consider a graph matching based mapping problem whereby the 
processor graph is the variable to be optimized. The aim is to build a multicom-
puter which minimises total message traffic for a given computation modelled 
as a process graph. 
6.3 The Multicomputer Configuration Problem 
Before we state our decision problem, we will first establish two lemmas. Note 
that we are using the definitions of Chapter 5 for the routing load and the shortest 
path in an (undirected) graph. 
Lemma 6.1 Let G = (V, E) be a graph. Let V ç V be such that IV'I = a. Let 
E' C {{t, t3 } : t2, t e V'} Then, the following inequality holds. 
3 	2 a —3a +2a 
E >ROG(e,v) 	2 
vEV' eEE' 
Proof 
We note that adding an edge to E' cannot decrease the sum. Thus the left-hand 
side of the inequality will be maximised when G' = (V', E') is a complete graph. 
In this case each vertex v E V1 will have edges to at most (a - 1) vertices in V. 
That is there will be at most 	edges to be through-routed. Each such edge 




Figure 6-1: Carbuncle for Degree 3 Figure 6-2: Carbuncle for Degree 4 
I: 
We now define a function which, given a degree D, an index k and a vertex 
t returns a graph structure with all vertices except t connected by V edges and 
t connected by V - 2 edges. We refer to this graph structure as a Carbuncle. 
Given a degree 2), a vertex t and an index k, let x(t k, 2)) denote the pair 
(T', F" ), where T' = W' U Xk defined as follows. For V = 2, Wk = Xc = F  
{}. 
k 	kk k k 	 k ForD=3,X ={x1x 2,x3,x4 }, ={w 1, and 
Fk{{k k} k 	X} 1 k 	ic k k k = 	x,x. 	
2 





2 E 	-{x,x}} U{{x1,w },{x2,w },{wk,t}}. 
k 	k k 	k Xk 1k k 	k For V >= 4, W = {w1 ,w 2 , ... ,w_2}, 	= x1,x2,. . and 
k 	k 	 k k 	k T,Vkxk E  Xk} Fc = {{t,w.}:w. e {w. x.}:w. E 
2 	 2 	 2 	 2 
U{{x,x +i} : i = 1,2,...,V_2}U{{x,4 1 }}. 
Figures 6-1 and 6-2 respectively show the graphs corresponding to this 
definition when V = 3 and V = 4. In the case where V = 2, 
Wk  and F   are 
empty sets. In the case where V = 3, we note that the elements of 	U W' are 
134 
arranged in a cycle, and w' is adjacent to t. In the case where V > 4, we note 
that the elements of X" U Wk U {t} are arranged in a cycle. Thus we conclude 
that for all V > 2, the graph x(t, k, 1)) = (Tk U {t}, Fk) is connected. 
We now show that if a graph consisting of carbuncles of degree V is to have 
degree V then the carbuncles must be connected in a ring or a chain. 
Lemma 6.2 Let V > 2. Let T be a set of vertices. Let t 2,. . . , 	be an enumeration of 
T and for each t, e T, let x(t, i, V) = (T 2 , F2). Any graph G = (V, L) with degree 
< V where V = u T 2  U T and u F2  ç  L is 	connected if the subgraph G' of C 
with vertex set T and edge set L' = L - U F2 is a chain or a ring. 
Proof 
We noted above that each carbuncle (T 2 U {t1 }, F2), is a connected subgraph 
of degree V. Thus the graph C is connected if it contains a connected subgraph 
with vertex set T. We note that each vertex t2 E T is adjacent to V - 2 carbuncle 
edges. If these vertices are connected in a subgraph L' which is a ring or a 
chain, then every vertex in T is adjacent to at most 2 edges in L', and thus the 
degree of C is V. Furthermore all carbuncles are connected, and we conclude G 
is connected. 
I 
ONLY IF Now for any carbuncle (T 2 U {t2 }, F2), all vertices in T 2 have degree V 
and therefore in any connected graph C = (V, L) with degree < V there cannot 
be an edge in L' = L - U ITI  F2 which is incident to a vertex in T 2. Thus for all the 
carbuncles to be in a single connected component, the subgraph C of G with 
vertex set T and edge set L' = L - UITI F2 must be connected. In this case, since 
for each ti e T the degree of t, is bounded by V and ti is incident to V —2 edges 
in F2, t 2 can be adjacent to at most 2 other vertices in T. Thus we conclude that 
C' is either a ring or a chain. 	 0 
Lemma 6.3 Let V > 2. Let T be a set of vertices. Let t1,. . . , tJTI be an enumeration 
of T and for each t, E T, let X(t1, i, V) = (T 2 , F2). Let G = (V, L) be a graph with 
135 
degree < Vwhere V U T'IJTand u F' C L. For all 	e T,i j,for all 
i ITI 'T" DC) (1-! 4 1 	- 0. V 	U1 ' -"-'GU'i, V - 
Proof 
Let us assume that for some pair of vertices, say t, t E T ,i j, for some t 1 E T, 
for some vertex, say v E T 1, ROG({t, t3 }, v) 0 0. Thus v e SHG({t I , t3 }). Now, 
by Lemma 6.2, the vertices of T are arranged in a ring, thus t1 e SHG({t, v}) 
and t1 E SHG({v,t 3 }). This implies t 1 is traversed twice in the shortest path 
from t, to t3  which is in contradiction to our definition of shortest path. Thus 
we conclude v V SHG({t, t,}) and the lemma is proved. 	 0 
The following facilities location problem is known to be NP complete [Garey 
etal. 19761. 
Decision Problem 6.1 SOLA 
Let G = (V, E) be a graph and K; be a positive integer. Does there exist a one-to-one 
function, f: V '—* 11, 2,..., IVI}, such that >1{ii,v}EE f(u) — f(v)I :5 K;? 
We wish to show the NP completeness of the following problem. 
Decision Problem 6.2 MCP 
Let J = (T, F) be a graph, V > 2 be an integer degree, 13, be a non-negative integer 
bound and c: F i—p Z, representing the volume of oinmunication between a pair of 
tasks, say t, and t3, for each {t, t3 } e F. Does there exist a graph, say H 	(T, L), 
such that each vertex in T has degree V and such that >eEF SHH(e) x c(e) B? 
Fixing V = 4, we have the TCP: Transputer Configuration Problem, since 
transputers have just 4 physical links per processor [INMOS Limited, 19881. 
We define the following polynomial time transformation from SOLA to TCP. 
Definition 6.1 SOLA-to-TCP Encoding.. 
Let G = (V, E) and K; be an instance of SOLA. We construct an instance, J = 
(T, F), B and c, of the problem TCP as follows. 
Let d = I V1.  Let B = IC-  El + d2 - d. Let Tpad be a set of d3 new vertices and s 
be a new vertex. Let A = d3 + d. Let T' = VU Tpad U {s}. Let {t0, t 1,. . . , ti,. . . , t} 
be an enumeration of T' such that to = s; for i = 1, 2,. . . , d, t, e V, and for 
i_—d+1,d+2, ... ,A,t1 ETpad. We note that lT'l=d3 +d+1, which isalways 
an odd number. 
Let F' = Fpad U F. U F3 U E where Fpad = {{t,t1}d,. . .,A 
F = {{t0, t} : i = 1,. ..,d} and F = {{t0,t} : i = d3 + 1,.. 
For i = O..., A let (T 2,F2) = x(t,i,4). Let 
T* = u,\ OT'. Let T 	T* UT', 
= U0F2 and F= F* U P. 
The function, c, is defined as for each e e F', c(e) = 1, and for each e e 
c(e)=B+l. 
We now prove the following theorem. 
Theorem 6.1 The Trans puter Configuration Problem is NP complete. 
Proof 
First, we prove that TCP is in NP. In order to compute the sum, the non-
deterministic Turing machine must find a minimum-length route between each 
pair of vertices , say u and v, such that {u, v} E F. This can be done in O(1T13) 
time [Aho et al., 19831. The other operations are simple arithmetic. Thus TCP 
is in NP 
Let the graph C (V, E) and the positive integer, K;, be an instance of SOLA. 
The SOLA-to-TCP encoding method, described above, produces an instance, 
J = (T, F), B and c, of TCP in an amount of time that is polynomial in the size 
of the instance of SOLA. 





if and only if there exists a graph H = (T, L) such that each t E T has degree 4 
in H and 
SHH(e) x c(e) ; B. 
eEF 
I -ONLY IF] Suppose in our instance G = (V, E) of SOLA there exists a one to one 
function such that, 
L If(u)—f(V)1K. 
{u,v}EE 
Our solution to the constructed instance J = (T, F), )C, c of TCP is the graph 
H = (T, L) with L constructed as follows. We add an edge for each edge in 
F* and Fpad, one between to and t1, one between t,, and to and, recalling that 
t1, t 2,.. . , td is an enumeration of V1 one between each pair of tasks t, t E V 
where f(t) - f(t 3) = 1. That is 
L = {If '(i),f 1(i + 1)} : i = 1,.. .,d— 11 U { Ito, t1 }, {t A, to} } U F U Fpad. 
We observe that the inclusion of all edges in F*  and the edges that form the 
vertices in T' into a ring, means that each of the vertices in H has degree 4. 
For all edges {v, w} E F* the edge {v, w} E L forms the unique shortest path 
between v and w in H. For all edges {v, w} e F - F*, v, w e T"; that is the 
vertices of T' are in the ring. By Lemma 6.3, any shortest route between the 
elements of T' would include only vertices in V. Now since IT'I is odd there is 
a unique shortest path around the ring between any pair of vertices in V. Thus 
we conclude there is a unique shortest path in H between the vertices of any 
edge in F. 
For every edge e E F* U F 0d, we have e E L and so, SHH(e) = 0. Thus 
SHH(e) x c(e) = 0. 
eEF*UFPUd 
The edges in F, and F are through-routed in the edges of L that form a ring, 
thus 
SHH(e) x c(e) = E SHH(e) x c(e) = (d2 - d)/2. 
eEF, 	 eEFp 
138 
For all edges {u, v} e E, since SHH ({U, v}) :5 lf(u) - f(v)l and c({u, v}) = 1, we 
have 
i 	SHH({u,v}) x c({u,v}) 
{tL,v}EE 
Thus 
SHH(e)xc(e)K:_IEI+d2 _d= 13. 
eEF 
[I] Suppose there exists a graph H = (T, L) such that each vertex in 
T has 
degree 4 and 
SHH({u,v}) x c({u,v}) < B. 
{u,v}EF 
Note that, in any solution to TCP which achieves the bound, if there existed an 
edge in F* which was not in L, it would have to be routed through at least one 
intermediate processor, thus exceeding the bound. Thus we may assume that 
all edges F* C L, and that each edge {u, v} E F* forms the unique shortest route 
between u and v. Thus 
	
ROff (e, v) = 0. 	 (6.1) 
vET,eEF* 
Note that J is connected since the vertices in Tpad  U {} are connected in J 
in a chain; and every vertex in V is connected in J to to. Thus in order for H 
to successfully route the messages in J, H must be connected. By Lemma 6.2, 
in order for H to be connected, L - F* must be a ring or a chain. The chain 
may be connected into a ring by the addition of an edge and that addition can 
never extend routes. We conclude, therefore, if there exists a solution to our 
constructed instance of TCP which achieves the bound, there exists a solution 
which achieves the bound where the vertices in T' are connected in a ring. 
Thus we shall assume that in H the vertice of T' are connected in a ring. 
Note that since there are an odd number of vertices in T' there is a unique 
shortest route between any two vertices in V. Furthermore by Lemma 6.3, the 
139 
shortest route between any pair of vertices in T' contains no vertices in T*.  Thus 
ROH(C,V)=O. 	 (6.2) 
vETt ,eEF' 
By combining Equations (6.2) and (6.1) we have that in any graph H which 
achieves the bound where the edges of T' are connected in a ring, 
RO(e,V) = 	RO(e,v). 	 (6.3) 
vT,eEF 	 vET',eEF' 
Now let <tal , 	> denote the sequence of tasks in V clockwise in 
the ring H from to. Consider a graph H' = (T', L'), where 
L' = {{t0 , tal  }} U {{ta, 	} : i = 1,. . . 	- 11 U {{taA, t0 }} U 
We show that if H achieves the bound then so does H'. That is 
E 	ROHI(e,V) :5 E ROH (C,V). 	 (6.4) 
vET',eEF' vET',eEF' 
We note the following. 
ROHI(e,V) = 	0 ROH(e,v). 	(6.5) 
vET',eEFpad VET',CEFpad 
1 	ROHI(e,v)= 	(d2 —d)/2 < 	> 	ROH (e,v). 	(6.6) 
vET',eEF0 vET',eEFa 
ROHI(e,v)= 	(d2  —d)/2 ROH(e,v). 	(6.7) 
vET',eEFp vET',eEF 
We divide this proof into two cases. 
CASE I: Suppose there exists w E Tpad/ such that, 
Ve E E ROH(e,w) = 0. 	 (6.8) 
By combining Equations (6.5),(6.6) and (6.7), 
ROHI(e,V) ; 	ROH(e,v). 
vET',eEF'—E 	 vET',eEF'—E 
140 
Thus it remains to be proven that: 
ROHI(C,V)< 	RO(e,v). 	 (6.9) 
vET',eEE 	 vET',eEE 
Suppose, as a hypothesis to be proven contradictory, that inequality (6.9) 
above, does not hold. Thus, for some edge, say It.,, t a } e E, there exists 
tan E T', such that ROH( {tai , tarn }, t) = 0; that is, It,,,, tam } is not routed through 
t 	in H, although it is routed through tan in H'. Without loss of generality, let 
1 < m. We sub-divide this case into two sub-cases, in each sub-case obtaining a 
contradiction. 
SUB-CASE 1.1 : 1 < n < m. Since {tai ,tam } is not routed through tan 
in H, it must be routed anticlockwise from ' t,,,. But in that case, w lies on 
the shorter route between tai and tame 50 ROH({tai ,tam }t0) = 1, contradicting 
Equation (6.8). Thus Inequality (6.9) holds. 
SUB-CASE 1.2: n < 1 or m < n. It is given that in H' tan is on the unique 
shortest route between t,, and 	That route is anticlockwise from ta1 to ta, 
so it has a length no less than d13 - 1. But the route clockwise from ta to t,m 
has a length no greater than d and would therefore be shorter, which leads to a 
contradiction. Thus Inequality (6.9) holds. 
Thus we conclude that in either subcase Inequality (6.9) holds and so In-
equality (6.4) holds in general in case I. 
CASE II: Suppose that, for all W E Tpadi ae E E, such that ROH(€, w) = 1. 
Thus 
RO(e,v) ~: IT,.dl =d3' 
VETpad,CEE 
By Lemma 6.1, we have that 
ROHI(e, v) :~ (c13 - 3d2 + 2d)/2, 
vEV,eEE 
and by combining Equations (6.5),(6.6) and (6.7), 




RO(e,v) < (d - d2  + 2d)/2 d 	ROH (e, v) 
t'ET',eEF' 	 vET 1,eEF' 
and Inequality (6.4) holds. 
Thus we conclude that in general Inequality (6.4) holds. This means that for 
our constructed graph H', by Equation (6.3) 
RO'(e,v)B=K—lEl+d2 —d. 
vET,eEF 
That is, by Equations (6.10) and (6.3), and by the definition of RO and SH, 
ESHHI(e)= E ROH(e,v)X—IEI. 
eEE 	 VEV,eEE 
Let the function f be defined as: f(ta ) = j, i = 1,. . . , d. Now 
lf(tai) - f(ta )l = - ii 	SH,(a, a) + 1, 
so 
If (u) - f(v)l < E SHHI(e) + JEJ 
u,vEE 	 eEE 
Thus f is the required allocation function for the instance of SOLA. 	0 
Note that in our reduction in any instance of the constructed TCP problem 
that achieves the bound, there is a unique shortest route in the processor graph 
between any pair of vertices that communicate in the task graph. It should be 
clear that our result generalises to a version of the problem where a unique path 
is chosen among non-unique routes, or where message traffic is split between 
non-unique routes. Note that Lemma 6.2 applies for any V > 2, and thus 
in general the bounded-degree MCP problem is NP complete. Note also that 
in the case where V = 2, the carbuncles are empty and so every edge in the 
constructed graph has a label of 1. We refer to the above problem as "simple" 
MCP in the spirit of [Garey et al., 19761. We have therefore shown that simple 
degree-2 MCP is NP complete. The complexity of simple general-degree MCP 
is open. 
142 
6.4 Contention in a Scheduling Model 
The above complexity result refers to a process based model where the cost of a 
message transfer is assumed to be independent of the exact timing of commu-
nication events. A similar approach was taken by El-Rewini and Lewis [1990] 
for scheduling models, but in these models we are able to capture the timing of 
communication events and the way in which they interact. 
One example of this approach is the work of Mak and Lundstrom [1990]. 
They consider the timing of communications events that results from a partic-
ular schedule and the corresponding demands that are made upon an inter-
processor communications network. These demands are fed into a queuing 
network model which is then solved to give the expected latencies of the vari-
ous communications which are fed back into the scheduling model to alter the 
timings of communications events. The algorithm is applied iteratively until a 
predefined convergence property is satisfied. 
In this section we present complexity results for an explicit model of com-
munication contention which is not based upon queuing theory. We consider 
that the interprocessor communications medium is a "processor" and that pre-
cedence arcs give rise to communications "tasks" that must be scheduled on the 
interprocessor communications medium. We shall, however, make an import-
ant assumption, that the interprocessor communications cannot be buffered. 
That is, if a task executed on one processor has an outgoing precedence arc to 
a task executed on another processor, the first processor must remain idle fol-
lowing the first task's execution until a task corresponding to the interprocessor 
communication is executed by the communications medium. 
Intuitively our model corresponds to a communication system with no buf- 
143 
fering at the sender, such as Meiko's CS-Tools [Meiko 19921 in synchronous 
transfer mode with blocking send and non-blocking receive. 
6.4.1 Relevant Complexity Results 
Our complexity result is for UET tasks in a two processor system with a single 
bidirectional communications link between the processors. As mentioned in 
Section 4.1, the complexity of two processor scheduling is open in UET models 
with General Delay, Uniform Delay, Fixed Delay and Unit Delay. There is a poly-
nomial time algorithm for the problem in UET No Delay models [Coffman and 
Graham 19721. 
Since we are effectively introducing an extra processor, corresponding to 
the interprocessor communications medium, three processor scheduling com-
plexity results are also of interest. The complexity of scheduling is open in 
all delay-variants upon 3 processor UET scheduling models. Our final related 
result is that by Berger and Cowen [19911 who show the NP completeness of 
scheduling in 3 processor models where UET tasks may have to be executed 
simultaneously on two processors. 
6.4.2 The Complexity of Scheduling with Contention 
Definition 6.2 Two-Processor UET Contention Model and Schedule. 
Given an instance of a unit delay UET scheduling model, (P, (F, ), f, d), where 
P I = 2, an interprocessor communications link ir 
and a communication event 
k, we can construct a corresponding instance, say B, of a 2 processor UET 
contention model. Let B = (P U {ii- }, (F U {ic}, ), ir, ic). For all 'y  e F U {k}, let 
= 1. A schedule for B is a schedule of (PU 17r 1, (F U {'}, ), f', d). 
The definition of what constitutes a valid schedule in such a model is slightly 
different from that in scheduling models outlined earlier. We refer to it as con- 
144 
tentionally valid to avoid confusion, and define it more formally below. Note that 
one of the conditions of contentional validity is that the subset of the schedule 
involving only processors (rather than interconnect) is valid with respect to a 
corresponding unit delay scheduling model. The other conditions of validity 
are that the schedule as a whole is not overbooked, that computational tasks 
are not scheduled for execution on the interconnect, and that for every preced-
ence arc between tasks mapped to different processors there is a corresponding 
interprocessor communication event in the schedule. 
Definition 6.3 Contentionally Valid Schedule. 
Let B = (F', (F', ), ir, ic), be a 2 processor UET contention model. Let I 
be a 
function with domain F' - {K} x P - {ir} and range {1}. Let d be a function 
with domain Y -  {ir} and range {1}. Let A = (F' - fr}, (F' - {K}, z), f, d). 
Let 
s be a schedule for B. Let s' = X(s, IF, P, Z). Let s = X(s, F, 
17r 1, Z) 
s is contentionally valid if all of the following are true. 
s' is valid for A. 
s is not overbooked 
there exists a valid tuple dependence graph (s', C) for s' such that for 
all interprocessor dependences ((-y, p, t)(8, q, t')) e C there exists a tuple 
(c,ir,t+1) Es. 
We show the NP completeness of the following problem by reduction from 
problem (P5) in [Ullman 1975] which is stated above as Decision Problem 3.4. 
Our complexity proof, like that of Theorem 3.7 above, follows the outline of 
Ullman's proof of the complexity of his problem (P3). 
145 
Decision Problem 6.3 Let B = (P', A' = (F', zV), ir, ic) bea2 processor UET con ten-
tional model. Let n' be a positive integer. Does there exist a valid contentional schedule 
.s'for Bsuch that M(s') = n' ? 
Theorem 6.2 Decision Problem 6.3 is NP complete. 
Proof 
Suppose we are given an instance of Decision Problem 3.4 as above. We con-
struct an instance of Decision Problem 6.3 in the way outlined in the formulae 
below. To clarify the notation, the tasks ofEE and the edges of L4 form one chain, 
while the tasks of T and the edges of Lh form another. 	is a set of edges which 
connect the chains as indicated in Figure 6-3. In addition we have the tasks of 
F, and for each task in F an edge in z to a corresponding task in F. In A', for 
every edge (of, 9) e A we have an edge from c to the vertex corresponding to 
in F. Finally, every vertex in F*  has an edge to the last vertex in the chain 
formed by the elements of 
, 
P/ = tPi,P, 7r I 
. n' = 8nk 
. A' = (F', z') where 
. FF =U T U F U F* U {,c}where 
_F* = {y: i = O,...,nk_1}.  
z'= 
- 	= {(h,i,j, a) : h,i,j e - {E3,2k_1,fl_1}, if h < 3 then a = 
else if i < 2k - 1 then a = ' O,i+1,j else a = O,O,j+1} 
'146 
- L4 = {(T h,I, , a) : 	E T— {T2,2k _l, _l }, if h <2 then a = 
else if i < 2k - 1 then a = T0, +1,3  else a = 
- LV3 = 	 : i = 0,... 1 2k - 1,j = 0,... ,n - 1}U 
i =0,...,2k— 1,j =0,...,n-2, 
if j <2k - 1 then a = 	3 = 1,i+1,j 
else if j < n - 1 then a 0,O,j+1' = 
else we have no edge} 
- 	{(yy) i =O,...,nk— 1}. 
- 	={('y: (-y,'yj) e 
- 	{(6. 	y, 3,2k-1,n-1) : i = 01 ... , nk - 11. 
First observe that IVI = 16kn, and since our deadline is 8kn and we have two 
processors, any valid schedule is a perfect schedule with a single tuple instancing 
each task in F'. 
Next note that because of the edges z4 one processor must be devoted to 
processing an element of at each time unit if the time limit is to be met, thus 
the tasks of in the order indicated by the precedence relation A4 form the 
critical path of the computation. Moreover, the same processor must process all 
elements of E since no communication latency can be allowed to occur between 
the execution of the elements of the critical path if the computation is to finish by 
the deadline. Without loss of generality let us assume that it is processor Pi  that is 
computing the tasks in E, and therefore in any valid schedule s of our instance of 
Decision Problem 6.3, for all Eh,ij E , X(s, 	P, Z) = {(h,1,,Pi, th, 7 )}, 
whereth,,3 = h+4i+8kj. and for all-y e F'—;X(s,{-y},P,Z)  
The tasks in T must be executed at very specific times as shown in Figure 6- 
3. 	Note that tasks TO, jj , i = 0,. . . , 2k - 1, j = 0,... , n - 1 may, if we only 
consider the precedence relation between tasks in E U T, be executed in either 
147 
Processor 













Figure 6-3: Partial Gantt Chart for the Encoding in Decision Problem 6.3 
148 
the time step shown or the subsequent time step. Note that, progressing in 
time, we have on processor P2'  an alternation of breaks in which there are k 
available time slots where it is not possible for a task to communicate in the 
subsequent step, and bands in which there are k available time slots where it 
is possible for a task to communicate in the subsequent step. Note that the 
kn tasks in F must communicate to task -3,2k-1,n-1  which is scheduled on 
processor Pi•  Now since we must never have an idle time slot on any processor, 
the tasks of F must be executed in the kn available slots in bands, and tasks 
= k,..., 2k - 1,j = 0,.. .,n —1, must be executed at the times shown. 
This also means the kn tasks in F must be executed in the kn available slots in 
breaks. 
As a consequence, if tasks y, 6 e F are both executed in the same break it is 
not possible for there to exist an edge ('y, 6) in A. For if so, by our construction 
there would exist a task a E F* such that there exist edges (.y, a) and (a, 6) in /' 
and a would have to be executed in that same break, violating what we have 
just concluded. Thus if our instance of Decision Problem 6.3 has -a solution, we 
can find a solution to the original instance of Decision Problem 3.4 by executing 
at time unit i exactly those jobs executed in the ith break. 
Conversely if we have a solution to the given instance of Decision Problem 3.4 
we can find a solution of the constructed instance of Decision Problem 6.3 by 
executing 'y  in band t and .72  in break t whenever 7i is executed at time unit t in 
the solution to Decision Problem 3.4. 	 0 
Note that the above proof establishes that the problem remains NP complete 
even if recomputation is allowed, and even if a single interconnect event is 
sufficient for any number outgoing edges from a task. Note that the reduction 
no longer works in a case corresponding to non-blocking communication where 





In this this chapter we consider models in which communication costs and 
delays are integrated, at least to some extent. As well as surveying the field, 
one of the aims of this chapter is to present models that will be used later to 
analyse a particular mapping problem. We describe models by Papadimitriou 
and Ullman in which considerations of communication costs are included into 
a No Delay scheduling model and we propose a new model which integrates 
communications costs into a General Delay scheduling model. We also consider 
general purpose scheduling algorithms for this new model. 
7.1 	Integrating Computation and Communication 
In Bokhari's model, the computation cost of each process is assumed to be equal 
and one process is mapped to each processor. This leads to an optimal load 
balance, but it may well be that the best mapping, once communications costs 
are taken into consideration, is not amongst those which optimally balance 
load. Considerable research effort has gone into extending the two process 
based models discussed in Chapter 5, namely Stone's approach of assuming se-
quential execution and Bokhari's approach of considering only communication 
overheads. 
The basic problem of adding communication costs to computation costs is 
twofold. First, as Efe points out, they may be measured in different units [Efe 
198211. Second, the cost of a communication is attributable to a particular 
processor or pair of processors and not to the calculation as a whole. 
'Recall that in the case of scheduling models with delays we were able to solve 
this problem because message lengths could be turned into times by considerations of 
communications bandwidth. 
151 
7.1.1 Mini-Max formulations 
There is a class of models which may be thought of as versions of Stone's model 
where costs are not incurred sequentially, or of Bokhari's models where more 
than one process may be assigned to the same processor. Indurkhya et al. [19861, 
who consider randomly generated programs, simply add all communications 
costs to the maximum of the processors' execution costs - defined as the sum 
of the costs of the computations assigned to each processor. See Nicol [19891 
for the limitations of the results in this paper. A variation on this is the work 
of Efe [19821 who does not explicitly propose this model but whose mapping 
algorithm iteratively improves this quantity from an initial heuristic assignment 
of processes. 
Bokhari [19881 describes polynomial time algorithms for solving the mini-
max problem in the case of chains of processes and chains of processors with 
the constraint that communicating processes must be mapped to adjacent pro-
cessors, and also in various other constrained process formulations for host-
satellite processor systems. 
Shen and Tsai [1985] use a graph matching algorithm to minimise costs in 
a model which assigns communication costs to both the sending and receiving 
processor. They state that they are considering computations where "little or no" 
precedence relation exists between processes. As a result they assume negligible 
processor idleness (T ait in our framework), and their communication costs can 
be associated directly with TI,,it and TT,,,,, which are added respectively to the 
computation overhead of the sending and receiving processors. This model 
may be appealing to the multicomputer programmer, but it is important to 
remember that on a multicomputer TI,,it and TT,,,,,, are rarely dependent upon 
the distance in the processor network over which a particular message is to be 
sent, and so it may be that the processor dependence of communication cost is 
being inadequately modelled by Shen and Tsai. 
152 
Chu and Lan [1987] use a very similar model to Shen and Tsai, but work with 
a probabilistic task based model rather than the corresponding process based 
model. They consider the effect of the probabilistic precedence relation upon 
the time to completion and conclude that, where two processes are connected 
by a precedence relation, if the cost of execution of the second process is much 
larger than that of the first process, then they should be mapped to the same 
processor, whereas if the cost of execution of the second is much smaller than 
the first, they should be mapped to different processors. They use this heuristic, 
and one which tends to group heavily communicating processes, to form grains 
which are mapped to the same processor, and then map the grains to processors 
by an exhaustive search which minimises costs ignoring precedence. Both Shen 
and Tsai [1985] and Chu and Lan [19871 allow the possibility of re-computation 
of processes on processors so as to minimise the overall execution time. 
7.1.2 Papadimitriou and Uliman's 1987 Models 
Papadimitriou and Ullman [1987] propose two models of parallel computation. 
Each model contains two components: the processing time of the computation, 
and the communication requirement of the computation. Papadimitriou and 
Ullman are careful never to add the time and communication quantities to-
gether, merely claiming that in situations where communications are expensive 
one would use a schedule where the communications cost is minimal (such as 
one where all tasks are assigned to the same processor), and that in situations 
where it is cheap it would be more appropriate to use a schedule with many 
communications in order to minimise the computation time of the schedule. 
The two models differ in their notion of communication cost. In the event 
model the communication corresponding to every directed arc from a task 
allocated to one processor to a task allocated to another processor incurs a cost, 
analogous to paying the price of a postage stamp. 
153 
In the model they refer to as delay, but which we refer to as the chain model 
to avoid confusion with the delay based scheduling models, it is the number of 
dependent communications that causes the problem. 
As discussed in Chapter 8 below, Papadimitriou and Ullman consider prop-
erties of something referred to as partitions rather than of schedules. For our 
purposes, we can define Papadimitriou and Ullman's costs with reference to a 
tuple dependence graph of Definition 2.34. 
Definition 7.1 Event Cost, Chain Cost. 
Let D = (s, C) be a tuple dependence graph. Let C' be the subset of C containing 
all interprocessor tuple dependences. The event cost of D, denoted E(D), is given 
by 
E(D) = 
The chain cost of D, denoted C(D), is given by 
C(D)=max 	JXflC' 
x is a path in D 
7.2 Claud 
In this section we propose a new hybrid model whereby communication costs 
and delays are treated in a consistent way. The basis of the model is that 
communication is subject to a delay, and also incurs a cost in the form of two 
computational tasks referred to as communications events. These must be sched-
uled, one by each of the sending and the receiving processor. The tasks involve 
respectively setting up the transfer of a message and setting up the receipt of a 
message. 
Claud schedules allow processors to be in one of four states: computing, 
sending a message, receiving a message and idle. The schedule corresponds to 
154 
a process time graph (Lamport [19791), as annotated by Lo [1992] where each 
processor is executing a single process. Related approaches have been described 
in the partitioning literature. Agrawal and Jagadish [1988] use a model of sched-
uled overheads of both send and receive type, but deal only with independent 
tasks (ie. those without precedence relations). McCreary and Gill consider min-
imising communications costs by partitioning communications into messages, 
but they do not consider these costs as schedulable computations—rather they 
simply add computation costs and communications costs to get the overall cost 
of the partition. Finally, Kruatrachue and Lewis [1988] describe an approach 
that they refer to as Grain Packing, whereby communications costs are aggreg-
ated within a grain, but give insufficient details to determine the relationship 
with our approach. 
A Claud schedule consists of tuples corresponding to the execution of tasks 
in the original dag and also to the special communications tasks. The conditions 
of validity of the schedule are similar to those outlined in Section 2.2 for sched-
uling models, but with the added complication that communications must be 
packaged into messages. The three conditions are 
All tasks in the original dag are executed. 
No processor is executing more than one task at a time, where that task 
may be either in the dag or a communications task. 
The tasks that are defined to precede a given task have finished executing 
before that task is started, and if they are executed on different processors, 
a message has been passed between the two processors, into which it was 
possible to fit the data corresponding to the precedence arc. 
In the following we formally define a message as a set of tuple dependences 
where all the start tuples have been mapped to the same processor and all the 
finish tuples have been mapped to the same processor. 
155 
Definition 7.2 (Valid) Message. 
Let A = (P, A = (F, ), f, d) be an instance of a general delay scheduling model, 
and s be a valid schedule for A. Let D = (s, C) be a valid tuple dependence 
graph for s. A message, say m, is a non-empty subset of C such that for each 
pair of tuples ((y,p,t), (S,q,r)), (('y',p',t'), (8',q',r')) E m, p = p' and q = q'. We 
define the length, denoted £(m), the release time, denoted 1Z(m), and the deadline, 
denoted D(m) of m in the following way. 
	
R(m) = 	max 	t+f(-y) 
((y,p,t),(S,q,r)) E- 
D (M) = 
m
V m 	mm 	r 
((y,p,t) ,(6,q,r)) Em 
£(m) = 	E d(('y,S),p,q) 
(('y,p,t),(5,q,r))Em 
m is a valid message for D iff R(m) + £(m) < D(m). 
Definition 7.3 Valid Message Partition. 
Let A = (P, A = (F, ), f, d) be an instance of a general delay scheduling model, 
and .s be a valid schedule for A. Let D = (s, C) be a tuple dependence graph for 
A. Let C' be the subset of C containing all interprocessor tuple dependences. 
Let m = {m1,... , m,} be a partition of C. m is a valid message partition for A iff 
each m1,. . . , m, is a valid message for D. 
Definition 7.4 Claud Model. 
Given an instance A = (P, A = (F, ), f, d) of a General Delay scheduling model 
and two tasks, a and p which are involved in respectively sending and receiving 
a message and which respectively require time .A,7 and A,, for execution, we 
construct an instance B of a general delay Claud scheduling model as follows. 
For all -y E F, let f'(-y) = f(-y). Let f'(o-) = A. Let f'(p) = A. Let B = 
(F, A, f', d, A9, 'r)  B is a Unit Delay Claud model if A is a Unit Delay scheduling 
model. B is a No Cost Claud model if A, = A,, = 0. 
Definition 7.5 Claud Schedule. 
Let B = (F, A = (F, L\), f', d, A,. A,) be a Claud model. A Claud schedule of B is a 
schedule of (F, (F U {A,, A, 1, ), f', d). 
Definition 7.6 Support for a Message Partition. 
Let A = (F, A = (F, ), f', d, a, p) be an instance of a Claud scheduling model. 
Let s be a Claud schedule for A. Let f be the restriction of f' to the domain 
F. Let s' = X(s, F, f, d). Let D be a tuple dependence graph for s and m = 
{ m1 , m2,. . . , m} be a valid message partition for D. let S = X(s, {a}, F, 
Let R= X(s,{p},P,Z). 
.s is said to support m if there exist two one to one functions T : m -* S and 
m - R such that for each mi e m, where say (7, p, t) = f(m), 1?(rn) < t 
and V(m) > t. 
Definition 7.7 Valid Claud Schedule. 
Let A = (P, A = (F, ), f', d, a, p) be an instance of a Claud scheduling model. 
Let f be the restriction of f' to the domain F. Let B = (F, A, f, d). Let .s be a 
Claud schedule for A. Let s' = X(s, F, P, Z). .s is valid iff all of the following 
hold. 
. s it is not overbooked. 
.s' is valid for B. 
There exists a valid tuple dependence graph (s', C) for s' for which there 
exists a valid message partition M such that that s supports M. 
7.2.1 Claud and the Mapping Problem Framework 
Let A = (P, A = (F, ), f', d, a, p) be an instance of a Claud scheduling model. 
Let s be a Claud schedule for A. We can present .s in terms of the framework 
outlined in Section 1.3 in the way shown in table 7-1. 
Here all the non-communication tasks mapped to processor p contribute 
to T aic(p) or TRecalc(P). Communication send tasks contribute to Tj (p) and 
communication receive tasks contribute to TTrm  (P) 
157 
T alc >1-yErf(2') 
T omm T1 (p) f'(a)X(s, {a}, {p} Z)' 
TTerm (P) f'(p)X(s, {p}, {p}, Z) 
TR0t€  0 
THouse TRecaic ('y,p,t)eX(s,r,P,Z) f('Y) - T alc 
Tinter 0 
Tshd 0 
Tidle T ait(p) M(X(s,F, {p}, Z)) 	(p,t)EX(s,{y},{p},Z) f(Y) 
T 2 (p) M(s) - M(X(s, F, {p}, Z)) 
Table 7-1: A Claud Model in Terms of our Framework 
7.2.2 Complexity Results for Claud 
The general case decision problem for scheduling in a Claud model is as follows: 
Decision Problem 7.1 Given an instance A = (P, A, f, d, a, p) of a Claud model and 
some integer k does there exist a valid Claud schedule s of A such that M (s) < k? 
In the case where f(a) = f(p) = 0, Decision Problem 7.1 reverts to the 
decision problems outlined in Chapter 3 since zero-length send and receive 
events may precede and succeed any tuple dependence arcs. 
7.2.3 Performance Guarantees for Claud Scheduling Algorithms 
In this section we consider the way in which performance guarantees may be 
attained by scheduling algorithms for Claud models. We consider algorithms 
for UET No Delay Unit Cost Claud models which have 1 processors and 1 tasks, 
with the performance guarantees given by ETF in a No Delay scheduling model. 
158 
Let A = (F, A = (F, Li), f, d) be an instance of a No Delay UET scheduling 
model constructed as follows. 
P = 
F = 
= 	: i = 1,.. .,l— 11. 
Let 8ETF  be the schedule produced by ETF on A whereby nondeterminism is 
resolved by allocating the task of lowest index to the processor of lowest index. 
That is, 
= {(,p,O) : i = 1,...,l— 11 U {(i,Pi, 2)}. 
Note that M(sTF) = 2. 
Note that, in 3ETF'  data satisfying all but one of the 1 - 1 incident edges to 
71 are being simultaneously received by processor p.  In Claud (and we would 
venture to suggest in all existing multicomputers) each of those communication 
events must incur an overhead which is modelled as a task which is scheduled 
on processor p. 
Let B = (P, A = (F, ), f', d, a-, p) be an instance of a UET No Delay Unit Cost 
Claud scheduling model with A as given above. Let s be a valid Claud schedule 
for B whereby k processors are involved in computing the tasks in 'Yi, . . , 
At least one of the k processors must compute at least (1 - 2)/k tasks and send 
a message and the processor computing -yj must receive at least k - 1 messages. 
Thus 
M(s) > (l-2)/k+)+)(k-1)+1 
and thus M(s)>(l-2)/k+1+k. 
Differentiating with respect to k we find that (1-2)/k + 1 + k has a minimum 
at 1— 1 = \/k. Thus 
M(s)>(l-2)/\/l-1+2+ \/l-1. 
159 
That is, although the ETF algorithm for mapping in scheduling models has a 
performance guarantee that is independent of the width of the dag, it is not pos-
sible to achieve an equivalent bound in a Claud model. Returning to the theme 
of the thesis, we wish to stress that Claud models' relatively poor algorithmic 
guarantees result from their modelling extra features of the multicomputer, 
rather than from defects in the models. As we will see in Chapter 9 this added 
verisimilitude leads to increased predictive power. 
160 
Chapter 8 
Applying the models 
161 
This chapter and Chapter 9 consider the way in which the models of computa-
tion outlined in previous chapters can be used to predict the performance of a 
computation. In this chapter we show that the different models produce qualit-
atively different predictions of performance. In Chapter 9 we show the degree 
to which the models predict the performance of a real computation running on 
a real multicomputer. 
8.1 The Diamond dag 
As we saw in, for example Chapters 2 and 7, some parallel computations can be 
represented as a dag in which the nodes represent sequential subcomputations 
and there is an arc from a node v to a node w if the output from the operation 
performed at v is needed as one of the inputs to the operation at w. In this 
chapter and the subsequent one we consider computations that can be expressed 
as indexed families of dags, whereby the indices refer to some aspect of "problem 
size". 
Examples of indexed families of dag computations are the fast Fourier trans-
form graph, the binary tree for arithmetic expression evaluation and the diamond 
for computing the longest common subsequence [Aho et al. 19761 or for the 
numerical computation which is described in Section 9.1. 
The n x n diamond is most simply described if we imagine vertices occupying 
points in cartesian space from (0, 0) in the lower left hand corner. We shall refer 
to the vertices as IF 	{'y : i = 0,. . . , n - 1, j = 0,... , ii - 1 }. Where nis the index 
of the diamond as referred to above. The 6 x 6 diamond is shown in Fig. 8_1i.  
The arcs in the diamond are as follows: 
'We choose to break with the convention that in diagrams of the diamond, the start 
162 
Figure 8-1: The 6 x 6 Diamond dag 
. the vertex-y° is the start vertex and has no incoming edges. O 
vertices -y, 1 i < n each have a single predecessor, namely the vertex 
i-i 
,Yo  
vertices 	1 < j <i-i each have a single predecessor, namely -yr. 
all other vertices,'y, 1 Z'< n, 1 j <Ti, have two predecessors, 'yr'  and 
1• 
We examine five approaches to analysing the performance of the diamond 
dag computation. All five models are variants of models that have been de-
scribed in previous chapters. 
The first two models are No Delay scheduling models according to Defin-
ition 2.20, but with communication costs included, and were described by 
Papadimitrou and Ullman [19871, and outlined in Section 7.1.2. The third is a 
node is at the bottom and the end is at the top. That is we rotate the cartesian grid 45 
degrees clockwise. 
163 
Fixed Delay scheduling model as, for example was described by Papadimitriou 
and Yannakakis [1992]. The aim of the analysis is to show that the three models 
produce quantitatively different predictions, rather than to allow comparison 
with a real multicomputer. 
The fourth model used to analyse performance of the diamond dag compu-
tation is a Fixed Delay No Cost Claud model partitioning, and the final approach 
is a Fixed Delay Claud model. 
8.2 	Partitions of the Diamond dag 
Although, in some sections of their analysis, Papadimitriou and Ullman use 
a tuple-based notation similar to ours, most of their work is framed in terms 
of partitions of the diamond dag, whereby each task is mapped to a single 
processor. They consider two partitions: stripes and boxes. In addition, we 
consider a third: lines. 
8.2.1 Stripes 
The stripes approach is shown in Figure 8-2. Informally, if there are k processors, 
the diamond is divided into k stripes of width n/k. The first processor computes 
nodes-y to 	After n/k time units, it reaches the stripe boundary and, after 
a delay of r time steps, the second processor can begin. Meanwhile, the first 
processor works on the second row of its stripe, and so on up the stripe. For 
the purposes of Section 8.4 below, we define the stripes schedule in terms of 
a slightly more general type of schedule which we refer to as the holey stripes 
schedule. 
The holey stripes schedule differs from the stripes approach in that each 
processor, on a number of occasions during the execution of its stripe, halts 
164 
for a while before continuing with the rest of the rows. In a later section we 
will introduce schedules where it is executing something else during these idle 
periods. These periods are referred to as holes in the schedule. A given holey 
stripes schedule will have a hole count, referred to as rn below, and a hole width, 
referred to as 1 below. 
The holey stripes schedule of an n x n diamond dag is the same as the 
stripes schedule outlined above, except each processor halts for 1 time steps after 
processing each n/rn rows of its stripe. In addition, for all processors except 
the first, the time at which the processor starts its stripe is delayed. Instead 
of starting r time steps after the first row of the previous stripe is executed, a 
processor must wait rn/rn time units after the first hole in the schedule of the 
processor computing the previous stripe. Since we allow 1 = 0 and n = rn, a 
stripes schedule is simply a special case of a holey stripes schedule. 
Definition 8.1 Holey Stripes Schedule. 
Let 1 be a non-negative integer hole width and rn be a positive integer hole 
count. Let A = (P = {p,. . . ,Pkl}, A = (F, z\), f, d) be an instance of a Fixed 
Delay UET scheduling model, where A is an n x n diamond dag, and the range 
of d is {0, r}. Let Y = (1 + n/rnT + n2/rnk). Let Z = (1 + n2/rnk). For each 
i = 0,... , n - 1, Let p = Pik/nj For each i = 0,... , n - 1,for each j = 0,... , n - 1 
let 
= i mod n/k + (j mod n/rn)n/k + [ik/n]Y + [jrn/n]Z. 
We define .s, the holey stripes schedule of A with hole width 1 and hole count rn in 
the following way. 
s= {(,pz,t)  :i=0,...,n—1,j =0,...,n—l}. 
Definition 8.2 Stripes Schedule. 
Let A = (P = {p,. . . , Pkl}, A = (F, ), f, d)be an instance of aFixed Delay UET 
P1 	 1'k n 
-- I _ft- I -- 
k 
Figure 8-2: Stripes Schedule in k Processors 
scheduling model, where A is an n x ii diamond dag, the range of d is {O, r} and 
n/k E 	Let s be the holey stripes schedule of A with hole count n and hole 
width 0. s is a stripes schedule for A. 
Theorem 8.1 Let rn be a positive integer. Let 1 be a non-negative integer. Let A = 
(P = IN, . . . ,pkl}, A = (F, ), f, d) be an instance of a Fixed Delay scheduling 
model, where A is an n x n diamond dag, the range of d is {0, r}, n/k e 	and 
n/rn e Z. Let .s be a holey stripes schedule of A with hole count rn and hole width 1. 
s is valid. 
Proof 
is complete since, by construction, there is a tuple (-y, p1, t) E s for each 
i = 0,. ..,n - 1,j = 0, ... ,n —1. Let us assume as a hypothesis to be proved 
contradictory that s is overbooked. Thus there exist two tuples, say ('y,  p, t) e 
and ('y,p',t') E s such that  = p' and t = t' and either one or both of i 	U, 
ij. 
166 
Let Y = (1+ n/mr + n2/mk). Let Z = (1+ n2 /rnk). Note by the construction 
of .s, p = 	and p' = Iuk/nJ thus [ik/n] = [uk/n] and thus I i - ul = 
i mod n/k - u mod n/kl <n/k. Note, also that 
t = i mod n/k + (j mod n/m)n/k + [ik/njY + [jrn/nj Z. 
= u mod n/k + (v mod n/m)n/k + [uk/n]Y + [vm/n] Z. 
and thus, since t' = t, 
i - u = (v mod n/rn - j mod n/m)n/k + [vrn/n - jm/n] Z. 
We have two cases. 
[vm/nj = [jm/nj thus v mod n/m—jmodn/m=v—jthusO < v — j 
(i - u)k/n < 1, that is v = j, and thus i = u which leads to a contradiction. 
[vm/nj [jrn/nj and thus 
11-u  > (1 + n2/mk) - (n/rn - 1)n/k > 1+ n/k > n/k 
which leads to a contradiction. 
Let us assume as a hypothesis to be proved contradictory that .s is temporally 
compromised. Thus there exists an edge, say ii = 	y) E A such that for 
the unique tuples involving those tasks in .s, say ('y,p',t'), (y,p,t), t' < t + 
d(r,p,p') + 1, where 
t = i mod n/k + (j mod n/m)n/k + [ik/n]Y + [jrn/n]Z 
= u mod n/k + (v mod n/m)n/k + [uk/njY + [vrn/njZ 
We have three cases. 
i = u and v = j + 1, in which case by the construction of .s', p = p' which 
implies t' - t < 1. However, t' = i mod n/k + ((j + 1) mod n/m)n/k + 
167 
[ik/n]Y+[(j+l)rn/n]Z. Thust'—t =((J+ 1)  mod n/rn—j mod n/rn)n/k+ 
([(i + 1)m/n] - [jrn/n])Z and since Z > n 2  /rnk, t'-  t > n/k > 1, which 
leads to a contradiction. 
= v and u = i+ 1 and [ik/n] = [uk/n], that is (u mod n/k) - (imod 
n/k) = 1 and by the construction of s, p = p' which implies t' - t < 1. 
However, t' = u mod n/k + (J mod n/m)n/k + [uk/n]Y + [im/n] Z. Thus 
t' - t = 1, which leads to a contradiction. 
i = v and u = i + 1 and [ik/n] = [uk/n] + 1. Note that since the range 
of d is upper bounded by r, t' - t < r + 1. However, t' = u mod n/k + 
(i mod n/rn)n/k+([ik/n] + 1)Y+ [jm/n]Z thus t'—t = ((i+ 1) mod n/k —
i mod n/k)n/rn + Z and since Z > r + n2 /rnk), t' - t > n/rn + r> 1 + r 
which leads to yet another contradiction. 
Thus we conclude that in all cases .s is not temporally compromised and since 
it is complete and not overbooked it is valid and the theorem is proved. 	0 
Corollary 8.1 Let A = (P = { po,. . . 	A = (F, ), f, d) be an instance of a 
Fixed Delay UET scheduling model, where A is an n x n diamond dag and the range of 
d is {O, r}. Let s be a stripes schedule of A. .s is valid for A. 
8.2.2 Lines 
The lines approach is shown in Figure 8-3. Informally, each processor computes 
nodes a line at a time, that is, once it has computed a node in a line it continues 
computing nodes further up the line. When it finishes its line it waits until all k 
processors have started a line before starting the k + l'th line. Line Jis assigned 
to processor j mod k. There are similar approaches to lines which can deliver 




Figure 8-3: Lines Schedule in k Processors 
Definition 8.3 Lines Schedule. 
Let A = (P = {Po,• . . , Pk-1}, A = (F, ), f, d)be an instance of a Fixed Delay UET 
scheduling model, where A is an n x n diamond dag, the range of d is {O, r} and 
n > k(1 + ,r). 
For each i = 01 . . . , n - 1 let P = P mod k• For each i = 0,. . . , n - 1, for each 
j=1, ... ,n—llet 
= (i mod k)(1 + r) +j + n[i/kj 
We define .s, the lines schedule of A as follows. 
= {(,p,t) : i = 0,... ,n - 1,j = 0,... ,n - 1} 
Theorem 8.2 Let A = (P = {po,. . . 	A = (F, ), f, d) be an instance of a 
Fixed Delay 1JET scheduling model, where A is an Ti x n diamond dag, the range of d 
is 10, r}. Let s be a lines schedule of A. s is valid for A. 
Proof 
is complete since, by construction, there is a tuple (-y, p, t) e s for each 
169 
i = 0,... I n - 1,J = 0,.. . , n - 1. Let us assume as a hypothesis to be proved 
contradictory that s is overbooked. Thus there exist two tuples, say (-yij , p, t) e 
and 	p', t') E .s such that p = p' and t = t' and either one or both of 1 i, 
mj. 
Note that by the construction of s, the following hold. 
(i mod k)(1 + T) + j + n[i/kj = (1 mod k)(1 + r) + m + n[l/kj 
imodk = lmodk 
Thus 
j+n Li/ kj =m+n[l/k]. 
We have two cases 1 = i or I 1 - il  > k. 
. 1 = i, thus n[i/k] = n[l/kj, thus m = j which leads to a contradiction. 
1— i > Ic, that is [i/k] 	[i/ku > 1, thus j j* -rnj > n which leads to a 
contradiction. 
Let us assume as a hypothesis to be proved contradictory that .s is temporally 
compromised. This would imply there existed an edge, say ri = ('y, -y) e 
such that for the unique tuples involving those tasks in s, say 
(l 'I 
 p', t'), (-y, P, t ),  
t' <t + d(ii,p,p') + 1, where 
t = (imodk)(1+r)+3'+n[i/k] 
t' = (1modk)(1+T)+m+n[1/k] 
We have three cases. 
i = 1 and m = j + 1, in which case by the construction of .s, p = p' which 
implies t' - t < 1. However, t' = (i mod k)(1 + T) + j + 1 + n[i/k], thus 
- t = 1 which leads to a contradiction. 
170 
j = m and i = i+1 and [i/k] = [ilk]. Note that since the range of d is upper 
bounded by i-,t' - t < ,T+1.  However, t' = ( i mod k + 1)(1 + r) + j + n[i/kj 
thus t' - t = 1 + r which leads to another contradiction. 
j=m and l=i+l and [i/k]=[i/k]+1, thus l mod k=O and imodk= 
k - 1. Note that since the range of d is upper bounded by r, - t < -r + 1. 
However, t'= (1+r)+j+n([i/k] + 1), thus t'— t = n—(k-1)(1 +r) = 
ri - k(i +,r) + 1 + 7-,but by the definition of the stripes schedule n > k(1 + T) 
which leads to yet another contradiction. 
Thus we conclude that in all cases .s is not temporally compromised and 
since it is complete and not overbooked it is valid and the theorem is proved. 0 
8.2.3 Boxes 
Papadimitriou and Ullman describe how a lower bound on event cost (for those 
partitions which evenly distribute work between processors) can be met by 
partitioning the diamond dag into boxes (Fig.8-4), each with a length of side of 
Each box thus contains n/V/ x n// nodes. We consider only the case 
where k has an integer square root which divides ii exactly. The boxes schedule 
described below is not the only schedule to correspond to Papadimitriou and 
Ullman's boxes partition, but it has optimal makespan amongst them. 
Definition 8.4 Boxes Schedule. 
Let A = (P = {Po,. . . ,Pk -1}, A = (F, ), f, d)be an instance of aFixed Delay UET 
scheduling model, where A is an ri. x n diamond dag, the range of d is {O, -r}, 
Z and n/ %/k e 
'These two conditions serve merely to avoid notational excesses. 
171 
Figure 8-4: Boxes Schedule in k Processors 
For each i=O,...,n-1,j=O,...,n—1 let us define the following 
P 	= Pi mod n/k+(j  mod m//k)/k 
j + [jk/nj + in /k + [ik/n](1 - n/k + r) 
We define s, the boxes schedule of A as follows. 
s = {(,p,t) : i = 0,.. .,n - 1,j = 1,... ,n 
Theorem 8.3 Let A = (P = {po, . . . 	A = (F, ), f, d) be an instance of a 
Fixed Delay UET scheduling model, where A is an n x n diamond dag, the range of d 
is 10, T}. Let .s be a boxes schedule of A. s is valid for A. 
Proof 
is complete since, by construction, there is a tuple (-y, p, t?) e .s for each 
i = 0,... I n - 1'j  = 0,.. . I n - 1. Let us assume as a hypothesis to be 
proved contradictory that .s is overbooked. Thus there exist two tuples, say 
(-y,p,t), ('yr,p',t')  E .s such that p = p' and t = t' and one or both of 1 i and 
mj. 
172 
Note that by the construction of s, 
i mod n/v/k + (j mod n/\/k)/k = 1 mod n/s/k + (m mod n//k)/k. 
Thus i mod n/s/k = 1 mod n//k and j mod n/,/k = m mod n/i/k, which im-
plies [is/k/nj = [l/k/nj, [j/k/nj = [ms/k/nj, and j - ml < n/,/k. Note also, 
by the construction of .s 
t 	= J+ [j/k/njr + in/\/k + [i/k/nj(1 - n/s/k + r) 
= m + [m.Jk/njr + ln//k + Ll/k/n](1 - n/./k + r) 
Thus since t = t', j - m = (1 - i)rt//k which implies 1 = i and thus j = m 
which leads to a contradiction. 
Let us assume as a hypothesis to be proved contradictory that .s is temporally 
compromised. This would imply there existed an edge, say rj = ( -y, r) E L 
such that for the unique tuples involving those tasks in s, say (fl,  p', t'), (-y, p, t), 
t' <t + d(i,p,p') + 1, where 
t = j + j/k/m]r + in/Jk + Li/kIn] (1— n/i/k + r) 
= m + [m./k/njr + mR/k + [l.Jk/nj(1 - n//k +7-) 
We have four cases. 
i = 1 and m = j + 1, and [j./k/ri] = Lmv"k/ni in which case by the 
construction of s, p = p' which implies t' - t < 1. However, t' = j + 1 + 
[j/k/n]r + in/s/k + [i%/k/nj(1 - n/s/k + r), thus t' - t = 1 which leads 
to a contradiction. 
j = m and 1 = i + 1, and [i \/k/nj = [l/k/n] in which case by the 
construction of s, p = p' which implies t' - t < 1. However, t' = j + 
[j/k/n]r + (i + 1)n/\/k + [iiJk/nj(1 - n/.Jk + ,r), thus t' -  t = n//k > 1 
which leads to another contradiction. 
173 
. j = m and 1 = i + 1 and [i,/k/n] = Llv"k/ni + 1. Note that since the range 
of d is upper bounded by r, - t < r + 1 However, t = J + [j/k/n]r + 
(i + 1)n/,/k + ([i\/k/n] + 1)(1 - n/jk + T), thus t' -  t = 1 + ,r which leads 
to yet another contradiction. 
. i = 1 and m = j + 1 and [mJk!n] = [j \/k/nj + 1. Note that since 
the range of d is upper bounded by -r, - t < 'r + 1 However, t' = 
i+l+([i\/k/nj +1)T+in/../k+[i./k/n](1  —n//k+T), thus t'—t = 1+T 
which leads to our final contradiction. 
Thus we conclude that in all cases .s is not temporally compromised and 
since it is complete and not overbooked it is valid and the theorem is proved. 0 
8.3 	The Predictions of the Models 
Let A = (P = {Po,. ,pk-i}, A = (F, ), f, d)be an instance of aFixed Delay UET 
scheduling model, where A is an n x n diamond dag and the range of d is {O, 'r}. 
Let 	be a stripe schedule of A. Let me8  be a lines schedule of A. Let s boxes 
be a boxes schedule of A. 
We can obtain the makespans of the schedules by considering the execution 
of y 	which depends upon every other task. Since all tasks are executed 
exactly once in the schedules we have defined, it is the last task to be executed, 
and by adding 1 to its execution time we obtain the makespan of the schedule. 
= (n - 1) mod n/k + ((n - 1) mod n/n)n/k + 
[(n - 1)k/n]Y + 1(n - 1)k/nlZ +1 
= n/k - 1 + (k - 1)(r + n/k) + (n - 1)(n/k) + 1 
= n2/k + (k - 1)n/k + (k - 1)T. 
174 
12n 
M(sA ) = ((n-1) mod k)(i+r)+n—i+n[(n—i)/k]+i 
= (k-1)(i+r)+n—i+n2 /k—n+i 
= n2/k+(k-1)+(k—i)T. 
boxes 
M(S A ) = n—i + [(n - i)/k/nj(2r + 1 - n//k) + (n - i)n/Jk 
= n2/k+k-1 +2(k—i)r. 
Note n/k > 1, thus for all -r.> 1, M(s) ~ M(sTs) 
We observe that M (s 4°xes ) has a smaller multiplicative factor of -r than either 
M Wines) or M 	Note that 	 t) = 2/k. Both the stripes 
and the lines schedules would have an identical multiplicative factor if they 
were restricted to 2,/k processors. 
Let B = (P = {p,. . . ,pk.1}, A = (F, ), f, d) be an instance of a No Delay 
UET scheduling model, where A is an n x n diamond dag. Let 	be a stripes 
schedule of B. Let 4 be a lines schedule of B. Let 	be a boxes schedule 
of B. 
We have the following values for makespan. 
stripes ) P  = n2/k + (k - i)n/k. 
M(SBEmes) = n2/k+(k—i). 
box 
M(sB 	= n2/\/k+ \/k—i. 
Note that each of our three schedules of B contains a single tuple for each 
task in F. It thus has a unique tuple dependence graph,of which we shall refer 
to the subset containing all interprocessor dependences as, for example 	For 
all k > 1 we have the following values for event cost. 
stripes = n(k - 1). 
ODlines = n2 - n. 
S b. = 2n(/k - 1). 
175 
Note n> k, thus EDzines > ED stripes 
For all k> 1 we have the following values for event cost. 
C stripes = k - 1. 
B 
CDlines = n—i. 
CDxes = 2(\/k-1). 
Note n > k, thus CD  lines> CDstripes. B 	 B 
It is clear that although the lines schedule fares particularly badly in having 
a larger event cost and chain cost than either of the other schedules, it performs 
the best out of the three schedules for any value of n/k > 1, and for any value of 
r if it is restricted to a reasonable number of processors. The programmer who 
is faced with the problem of explicitly scheduling a computation corresponding 
to the diamond dag, should therefore make different qualitative decisions on 
the schedule, and write different programs, according to which cost model best 
approximates his or her architecture. 
8.4 The Claud Stripes Schedule 
In this section we consider two versions of the Claud model: the Fixed Delay No 
Cost Claud model and the Fixed Delay Fixed Cost Claud model. We consider the 
performance of schedules corresponding to Papadimitriou and Ullman's stripes 
partitions, with a restricted class of message partitions which are indexed by 
a single positive integer parameter m, where km is the number of fixed length 
messages into which the set of interprocessor tuple dependences is partitioned 
when the diamond dag is scheduled on k processors. 
As the Fixed Delay No Cost Claud model is a special case of the Fixed Delay 
Claud model we present analysis only for the Fixed Delay Claud model. The 
176 
Stripes Claud Schedule defined below is a holey stripes schedule with extra send 
and receive tasks scheduled to occur during the holes. The execution times of 
the send and receive tasks add up to form the hole width, and the hole count 
corresponds to the number of messages that are used to send the precedence 
arcs from tasks that have been mapped to a given processor. The schedule is 
shown diagrammatically in Figure 8-5. 
Definition 8.5 Stripes Claud Schedule. 
Let rn be a positive integer. Let A = (P = {po,. . . , Pk- i}, A = (F', ), f', d, o-, p) 
be an instance of a Fixed Delay Claud model, where A is an n x n diamond dag, 
the range of d is {O, r}, n/k E Z, and f'(a) = ), f'(p) = 
Let f be the restriction off' to the domain F. Let B = (P, (IF' -  {u, p}, ), f, d). 
Note that B is an instance of a fixed delay scheduling model. Let .s' be a holey 
stripes schedule of B with hole width 1 = i + 1, and hole count rn such that 
n/rnE Z. Let Y = (1 + n/mT .1- n2/mk). Let Z = (1 + n2/mk). For each 
X 	1,. . . , k - 1, y = 1, . . . , rn let us define the following 
v=Yx+Zy-(l+n/rnr) 
= Yx + Zy - lr 
s, the stripes Claud schedule of A with m messages per processor, is defined as 
follows. 
s = 
{(p,p,r) : x = 1, ... ,k— 11  = 1, ... ,rn} 
The above schedule specifies the timing of communication events, but not 
the exact set of precedence arcs which are sent in each message. We send arcs 
in messages of equal length. The exact message partition is defined below. 
Definition 8.6 Equal Message Partition for Stripes. 






------------------ - NOWNEW  
IIIIIIIIIIHhIllhIIHhIIIIIIIIIIIIIIiiiliuTflhIIIIIIIIIIIIU 
Figure 8-5: Claud Stripes Schedule of the n x n Diamond dag 
Claud model, where A is an n x n diamond dag and n/k E Z. Let .s be a stripes 
Claud schedule of A. Let s' be the set of tuples instancing tasks in the diamond 
itself (rather than send and receive tasks). More formally, let 
.5 = X(s,F,P,Z) 
1,j = 0,...,n - 11. 
We define M, the equal message partition of .s with m messages per processor 
as follows. Let 
M={M:x=1,...,k-1,y=1,...,m} 
where, for x = 1,. . . , k - 1, for y = 1,. . . , m, M1 is the yth message sent to 
processor x, and 
M = {((_1,p_1,t_1), (,p,t)) : i = xn/k,j = (x - 1)n/m,..., xn/m - 11. 
In order to later show the validity of the Claud schedule we need to show 
that it includes a set of communication events which can support a message 
partition of all the implied interprocessor communication. We now show that 
the schedule does indeed support the equal message partition described above. 
178 
O:I' Q=po=c 
Figure 8-6: Equal Message Partition of the Stripes Claud Schedule of the 6 x 6 
Diamond with n/rn = 2 and n/k = 2 Showing Corresponding Send and Receive 
Events. 
179 
Theorem 8.4 Let m be a positive integer. Let A = (P = {p,. . . ,Pk -1}, A = 
(F, 	), f', d, a, p) be an instance of a Fixed Delay Claud model, where A is an n x n 
diamond dag and ri/k E Z. Let s be a stripes Claud schedule of A with m messages. 
Let 	= {M : x = 1,...,k— 1,j = 1, ... ,m} be an equal message partition for s. 
supports M. 
Proof 
Let f'(a) = i and f(p) = 1,.. Let 1 = ir +l.. Let D be the unique tuple dependence 
graph for s. Let C be the set of all interprocessor tuple dependences in D. Note 
that M partitions D. Let Y = (1 + n/mr  + n2/mk). Let Z = (1 + n2/mk). Note 




= 	xY + yZ - (1+ n/mr + 1). yn/k-1 
V(M) - xn/m = 	xY+yZ. 
£(M)= 1+n/rnr+1. 
That is for each M,x=1,...k-1,y=1,...,m,L(M)>n/mr=Mr,that 
is, by Definition 7.2, My is valid for D. Let us define the following: 
S 	=X(s,{a},P,Z) ={(a,p,v),x=1 .... k -1,y=1, ... ,MI. 
R 	=X(s,{p},P,Z) ={(p,p_1,r),x=1 .... k-1,y=1,...,MI. 
Let J 3 : M — S and J : M —+ R be defined such that for each M, 
X =1,...k — l,y = 1,...,rn,.7 8(M) = (a,p_1,v) and .F(M) = (p,p,r). 
Note that for each x = 1,... k — 1, y = 1,..., m, 
1(M)= xY+yZ—(l+n/mr+1)<v_1. 
Thus,by Definition 7.6, .s supports M. 	 0 
IM 
Now we are in a position to prove the validity of our stripes Claud schedule. 
Theorem 8.5 Let m be a positive integer. Let A = (P = {p,,. . . ,Pk-1}, A = 
(F, 	), f', d, a, p) be an instance of a Fixed Delay Claud model, where A is an n x n 
diamond dag, n/k E Z and n/rn e Z. Let .s be a stripes Claud schedule of A with 
m messages. s is valid. 
Proof 
Let f'(a) = i, f'(p) = ir. Let f be the restriction of f' to the range F. Let 
= X(s, F, P, Zr ). Note .s' is a holey stripes schedule of (P, A, f, d) with hole 
count rn and hole size 1,, + l / thus .s' is valid. 
Let D be the unique tuple dependence graph for s. Let C be the set of all 
interprocessor tuple dependences in D. Let M = {M : x = 1, . . . , k - 1, y = 
1, . . . , rn - I  be an equal message partition for .s. M partitions C and s supports 
M. 
Finally consider whether s is overbooked. Note .s' is valid and therefore not 
overbooked and thus no two tasks in F can share the same time step. Thus we 
have five cases. 
• There exist two tuples (a,p_1,v),(a,p_1,v) E F' such that 0 <v  —v' < 
i. Now v xY + yZ - (1+ nT/rn) and v' = xY + Zy' - (1+ n7- /m) and 
thus 0 - v ' = Z(y - y') which, since Z > is  leads to a contradiction. 
There exist two tuples (p, p, r), (p,p, r') E F' such that 0 <r -ryl  < 1,.. 
Now r = xY + yZ - ir  and = xY + yZ - ir  and thus r - = Z(y V 
which, since Z > 1, leads to a contradiction. 
There exists two tuples (a, p, v 1), 	 ' r) E F' 	h that either r 
V 	< r' -j-- 1 or v 	< r?' < v 	+ i. Thus —i < v' - ri" < r 	x-1 - 	 x-1 
However, v_1 = (x - 1)Y + yZ - (1+ nT/rn) and r = xY + y'Z - 1, and 
thus v ' - r' = Z(v - y' + 1) - i. We have two subcases x x-1 
181 
I 	 y 
- y - y —1 thus v5 - 	 < —is which leads to a contradiction. 
- y - y'> —1 thus 0 - 	 > i 15 = i which leads to a contradiction. 
e There exist two tuples (p, p5, r), (-y,p5, t) e F' such that 0 < v - < i. 
Now r' = XY+yZlr 
= i mod n/k + (j mod n/m)n/k + [ik/n]Y + [jm/n]Z. 
Thus, since x = [ik/n] 
r 	P - = yZ - i mod n/k + (j mod n/rn)n/k H- [imln]Z - ir , 
which implies 
(y - [jm/n])Z - ir <r - t <(y - [jm/n] + 1)Z - i - 
We have two subcases 
- 	- Lim/n] > 0 thus r - t ~ Z - lr > i which leads to a contradiction. 
- y — Lirn/n] <O thus r' —t3 < —1—i <O which leads toacontradiction. 
. There exist two tuples (a-, p, v_1), 	p5 , t) e F' such that 0 <v_1 - t < 
i. Now v_1 = (x - 1)Y + yZ - (1 + n7- /m) and 
t = i mod n/k + (i mod n/rn)n/k + Lik/niY + Lim/niz. 
Thus, since x = [ik/n] 
- 	= yZ - i mod n/k + (i mod n/rn)n/k + Lirn/n]Z + Y - (1+ nT/rn), 
which, since Y - (1+ nT/rn) = Z implies 
(y+l — [im/n] +1)Zv—t <(y+l - [im/n])Z—i 
We have two subcases 
182 
- 
y + 1 -  [jm/nj > 0 thus v' - t3 ~ Z - 1,. ~ i which leads to a 
contradiction. 
- y + 1 - [jm/n] < 0 thus v' - t2 < —i - 1,. < 0 which leads to a 
contradiction. 
0 
8.4.1 Predicted Performance for the Claud Models 
In this section we consider the performance of a stripes Claud schedule. Again 
we can obtain the makespans of the schedules by considering the execution of 
-yI which depends upon every other task, and by adding 1 to its execution 
time we obtain the makespan of the schedule. 
Let m be a positive integer. Let A = (P = {po,. . . , Pi-i }, A (F, ), f', d, a, p) 
be an instance of a Fixed Delay Claud model, where A is an n x n diamond dag, 
n/k E Z and n/rn E Z1 . Lets be a stripes Claud schedule of A with m messages. 
Let X = (1+ n/mr + n2/mk). Let Y = (1+ n2/mk). Let 1 = f'(a) + f'(p). 
M(s) = (k - 1)(1 + n/mr + n2 /rnk) + rn(i + n2/rnk) - 1. 
In order to model fixed (but not unit) execution time computations we in-
troduce a node computation time w and consider i and 7- to be normalised with 
respect to it, that is 1 = 1/w and r = /w, where 1 and are parameters of our 
multicomputer architecture which may be measured as in Chapter 9 below. 
In Figure (84) M(s) is shown as a three dimensional surface for 1, , w as 
derived in a section 9.2.1. There are points in k and m where M(s) has a 
minimum. A slow linear increment of M (s) is observed when k or m are above 
the minimal point and a rapid nonlinear growth takes place for values of k and 






1 120 16 160 m 	 k 00 
Figure 8-7: Makespan M as a Function of two Variables k and m, for 
n = 2401  A = 475, = 0.775, w = 43.30. 
If we take the liberty of considering M(s) as a continuous function of con-
tinuous variables, we may find minima by differentiation. When k is fixed we 




we note the second derivative of M (s) by m is positive in the interval (0, oo) and 
thus the extremum is a minimum. An analogous consideration is true when 
we consider M(s) as a function of k with fixed m. 
Let s0 be a schedule of optimal makespan. For the purposes of our analysis 
we shall assume that 	is close to 1, so the approximate formula for M (s) 
is 
k1 
A4, Pt 	1 + r— (1 + O(—)). n 	72 
184 
Chapter 9 
Validation of the models 
185 
In this chapter we test the quantitative predictive power of the Claud model 
on a computation which can be represented as the diamond dag. This we 
do by performing the computation in a manner corresponding to the stripes 
schedule outlined in Chapter 8, on different numbers of processors and with 
different message partitions. We compare the measured performance of the 
computation with that predicted by the model. We measure the parameters 
of the models directly from benchmarking the computation. Also, by using 
regression analysis, we attempt to find sets of parameters which optimally fit 
the data points. Finally we discuss the predictive power of our models in terms 
of features of our multicomputer and our software. 
9.1 	A Numerical Application 
Given a set of partial differential equations that describe a (continuous) physical 
process, a solution can be obtained using a discrete approximation. 
The discretisation procedures we consider here are such that the unknowns 
are coupled together within a regular n x n rectangular mesh. This can be 
achieved by the 5-point finite difference discretisation (e.g. see [Wachpress 
1966]) which leads to sparse positive definite linear systems. 
In order to express a linear system with a matrix form Ax = b of size n x n, 
there must exist a correspondence between equations and unknowns, estab-
lished by choosing an ordering; we consider here the natural or lexicographical 
ordering (e.g. see [Hageman and Young 19811 or [Beauwens 19891). 
The method we consider to solve the matrix equation is an iterative method 
where the preconditioning matrix B is obtained from an incomplete symmetric 
factorisation of A without fill-in (see [Beauwens 1989] for full details). B is of 
the form LU (several normalisations are possible) where U is upper triangular 
10. 
and L lower triangular and the non-zero element pattern of (L + U) is the same 
as the non-zero element pattern of A. 
Whatever the iterative method used it will require at each iteration (at least) 
the inversion of B. This means solving a matrix equation of the form By = k 
which is performed in two steps: 
Lz = k, yielding z (this is the forward solve) and then Ux = z yielding x (this 
is the backward solve). 
These two equations being similar, we limit our discussion to the first one 
within which the general expression of the unknown z{i, j] at the point [i, i] of 
the mesh is of the form: 
k[i,j] = q[i,]']z[i,j] + q[i - 1,j]z[i - 1,j] + q[i,j - 1]z[i,j - 1] 
From that expression it can be seen clearly that the computation of each un-
known of the mesh requires the results of the south and west neighbours. 
To represent this computation in the form of a dag, we have one node for 
each mesh element and arcs entering the node from nodes corresponding to 
these neighbouring south and west mesh elements (where such neighbours 
exist). The result is a diamond dag. 
The parallel execution of the computation consists of the execution of the 
computation of different unknowns on different processors, and the sending 
of messages between processors. The unknowns were executed on processors 
to which corresponding tasks had been mapped in the stripes Claud schedule, 
and messages were sent containing data corresponding to edges as in the equal 
message partition. We are unable to specify the timing of computations on each 
processor (indeed to do so would mean that we were not testing the predictive 
power of the model). Instead our program specified an order of task execution 
and message send and receive calls which was consistent with the stripes Claud 
schedule. 
187 
9.2 A Validation Experiment 
Given the expression derived above for Makespan in the fixed cost fixed delay 
Claud model, we wish to compare the predictive power of the no cost fixed delay 
Claud model and the fixed cost fixed delay Claud model for a computation on k 
processors corresponding to the the n x n diamond dag and sending rn equal 
sized messages per processor. 
The computation associated with a node in our example involves three 
additions, two multiplications and a division, all at double precision floating 
point. It also involves computations to index into arrays during the loop. The 
computation was implemented in C on a on a Meiko Computing Surface [Meiko 
1992] containing 20MHz Inmos T800 transputers [inmos 19881 with 20 Mbit/s 
links through Meiko switch chips. 
In order to match the computation to the Claud stripes schedule, it was 
coded as an outer loop over rn blocks that make up a stripe, a centre loop over 
the n/rn strips of a block, and an inner loop over n/k nodes of a strip. Each stripe 
was assigned to a different processor. Processors assigned adjacent stripes were 
connected by a single physical transputer link. The arrays were arranged so 
that the last elements of each strip were contiguous in memory. These elements 
were sent in messages of length n/rn using communication calls are put in the 
outer loop. 
The communication was effected by csn calls in Meiko's CS-tools version 
1.19 Messages were sent by non-blocking' csn_txnb calls immediately preceded 
(in all but the first from each processor) by a csntest test for completion.. A 
'This corresponds to the asynchronous call of some other message passing systems. 
See [Meiko 19921 for an explanation. 
Im 
final csn_t est was made by each processor. Messages were received by blocking 
csn_rx calls. Each of these send and receive calls was surrounded by a small 
"wrapper" to test its return status. 
9.2.1 Deriving Parameters of the Model 
The computation is parameterised by n, the index of the diamond, w, the node 
execution time, 1 = 1, + l., the time required to set up message transfers, k, the 
number of processors used, m the number of messages each processor sends 
and 	the time required to send a unit length message between processors. m, n 
and k are well-defined. I, i, w and were measured as below. 
Message send and receive overheads were measured by timing a large loop, 
and measuring the extension to that time when messages were being repeatedly 
sent to or received from adjacent processors during the loop. The values include 
a small overhead (about 25jis) associated with the "wrappers". 
= 27Ojts 
1:9 = 205,us 
was considered to be the reciprocal of the bandwidth available between 
adjacent processors for asymptotically large messages. This was computed by 
measuring the time taken to echo a one megabyte message between adjacent 
processors and subtracting the time taken to echo a one byte message. The 
result was divided by 2 x (220 - 1). This gave a value which (owing to vagaries 
of the Meiko hardware) depended slightly upon the pair of physical processors 
to which the processes were mapped. 
= 0.775is/byte 
It is possible to "optimise" the computation by using, at all levels of the 
loop, pointers instead of indices. If this is done then the computation time for a 
node is strongly dependent upon the block size and stripe (or strip) width. This 
effect is taken into account in [Manneback et al. 19921, but we chose to analyse 
results for a naïve version of the software where arrays were always accessed 
with explicit array indexing on loop counters. Fortunately, from the point of 
view of our experiments, the Meiko compiler chose not to optimise this aspect 
of our code and the w values are relatively consistent for different block sizes 
and stripe widths. As a point of interest, the asymptotic "optimised" value is 
13. 96is. 
w = 43.30zs 
9.2.2 Predicted Versus Achieved Performance 
Figures 9-1 and 9-2 show two cross-sections through the basket shape shown 
in Figure 8-7 for n = 240, i = 475, = 0.775, w = 43.30 for the performance of 
the numerical application on a Meiko Computing Surface, with parameters as 
derived in Section 9.2.1. Figure 9-1 shows the performance for m = 80 and 
where k is a factor of 240 in the range 1 to 80 inclusive. Figure 9-2 shows 
the performance for k = 80 and where k is a factor of 240 in the range 1 to 
240 inclusive. The data points indicated by crosses were measured times of 
execution. The predicted performance is also shown. In each case, the lower of 
the two lines corresponds to the performance prediction of the Fixed Cost Claud 
model, and the upper line corresponds to the Fixed Cost Fixed Delay Claud model. 
It is clear that the Fixed Cost Fixed Delay Claud model is a good predictor of the 
performance of our computation. Indeed, in no case do the prediction and the 








0 	10 	20 	30 	40 	50 	60 	70 	80 
Processors 











0 	 40 	 80 	 120 	 160 	 200 	 240 
buffers 
Figure 9-2: Performance with Constant Number of Processors: Observed 
(points) vs Predicted (line) 
192 
9.2.3 Fitting the Models to the Observed Data 
The analysis still, however, begs the question as to whether it is possible to 
predict the performance of the computation equally as effectively with a No 
Cost Fixed Delay Claud model. It is clear that we had the Fixed Cost Fixed Delay 
Claud model in mind when we were deriving the parameters of the architecture, 
and therefore it is not simply sufficient to use in the No Cost Fixed Delay Claud 
model the value of we derived earlier. A more rigourous statistical treatment 
is required. 
We choose to analyse the respective efficacies of the models by deriving best-
fit parameters by regression analysis and then analysing the goodness of fit of 
the curves that are predicted, in terms of its statistical significance by analysis of 
variance. In the case of the No Cost Fixed Delay Claud model we use regression 
analysis to find the single parameter . In the case of the Fixed Cost Fixed Delay 
Claud model we use the analysis to find both 1 and i. Then, for the Fixed Cost 
Fixed Delay Claud model we consider the deviations between the regression 
parameters and the measured parameters, in terms of confidence limits that we 
derive for the regression parameters. 
There is an assumption in the analysis that the deviation of the experimental 
values from the values predicted by the expression is normally distributed. It 
is less than clear how one could set about verifying this assumption, but it is a 
standard assumption in this type of analysis. 
Numerical solution of the regression coefficients for the Fixed Cost Fixed Delay 
Claud model gives 
1 = 531is, 	= 0.7871is/byte 
which values are remarkably close to those derived by our independent bench-
marking of the multicomputer. 
193 
Analysis of variance gives a variance ratio which, according to tables of the 
significance, indicates that the probability that a fit as good as that which is 
predicted by a Claud model (with the regression parameters) could have arisen 
by chance is very much less than 0.1%. 
Numerical solution of the regression coefficients for the No Cost Fixed Delay 
Claud model gives 
= 2.02 its/byte 
Analysis of variance gives a variance ratio which again indicates that the 
probability that a fit as good as that which is predicted by a Claud model (with 
the regression parameters) could have arisen by chance is very much less than 
0.1%. 
The curves derived from parameters predicted by the regression analyses 
are shown in Figures 9-3 and 9-4. 
The Fixed Cost Fixed Delay Claud model correctly predicts the increase in 
computation time seen with large numbers of processors and with large num-
bers of messages, whereas in The No Cost Fixed Delay Claud model the optimal 
number of processors is one for each line. In addition, the optimal message size 
is 1 since if the transfer of the data associated with a precedence arc is delayed 
by putting it into a message with another arc, this can never cause a reduction 
in the computation time of the schedule. 
9.3 Parameter Dependence in the Claud model 
Given that we have some confidence in the predictive power of Fixed Cost 
Fixed Delay Claud model for the Diamond dag computation, it is interesting 
to consider how the performance of the computation would be improved by 










0 	10 	20 	30 	40 	50 	60 	70 	80 
Processors 
Figure 9-3: Performance with Constant Message Size: Observed (points) vs that 










0 	 40 	 80 	 120 	 160 	 200 	 240 
buffers 
Figure 9-4: Performance with Constant Number of Processors: Observed 
(points) vs that Predicted by Regression Parameters (line) 
196 
M(s) [ms] 
Figure 9-5: Makespan Against land for n = 240, k = 80, w = 43.3, m = m0 pt 
the relationship between M, 1 and at optimal Message size,m0 , which we 





The result is plotted in Figure 9-5 for k = 80, n = 240, in the range 0 < 1 < 800, 
and 0 < I < 2, which includes the value 1 = 475, = 0.775 as measured for 
CS-Tools at the intersection of the two cross-hair like lines. 
Consider a version of CS-tools with infinite bandwidth, that is with = 0, by 
moving the crosshairs along the line parallel to the axis to the point where it 
intersects the (l,'M) plane. It is clear that the line is almost horizontal: there is 
very little to be gained from increasing bandwidth. 
Now move back to the point marked by the cross-hairs for CS-tools and 
move along the line parallel to the 1 axis to the point where it intersects the 
(, M) plane. This corresponds to CS-tools with the same bandwidth as normal 
but zero overhead. This line is not horizontal: it dips steeply down towards the end 
with zero overhead, indicating that in the case of this numerical computation 
197 
substantial gains in performance can be achieved by moving to a message 
passing system with lower overhead. 
The figure shows values expressed in real units ie. not normalised with 
respect to node calculation time. If the processor was ten times as fast, we 
can use the approximation that it would compute all sequential computations 
at ten times the speed, and divide the numbers on each axis by ten to get 
the performance curve for message passing systems on multicomputers whose 
processors run ten times faster, say for the Intel iPSC/860 [Intel 19911. 
9.4 The Claud Model and our Multicomputer Compu-
tation 
It is instructive to consider how the Claud model corresponds to the way com-
putation and communication are effected in a CS-tools program, and thus the 
properties of the diamond dag and of our implementation which may lead to 
our assumptions being accurate in this special case but not necessarily so in the 
case of a general computation. 
CS-tools on transputer-based communications hardware is basically a store 
and forward network, but one where the item which is stored and forwarded is 
not a packet but a subcomponent of a packet known as a flit. The technique is 
sometimes known as virtual cut-through [Kermani and Kleinrock, 1979]. The 
actual'transfer of data between processors is accomplished by direct memory 
access units which do not interrupt computation on the processor (except by 
competing for memory bandwidth). Software overheads are, however, incurred 
at source and destination processors, and also at intermediate processors in 
the processor network through which messages may be forwarded. We can 
IM 
assume that in the case of synchronous transfer, communication is effected in 
the following way. 
The message is transferred in packets fragmented into 32 byte flits, the first of 
which contains routing information. These packets are by default 1Kbyte long. 
The head packet of a message is sent at the point at which the csn_tx call is made. 
If there is message space available for storage of the packet at the destination 
an acknowledgement is sent back to the source processor, which then sends the 
next packet. In quiet networks across small numbers of interprocessor links the 
acknowledgement can be received at the source before it has transmitted the 
last flit of the packet to which it refers, and so message transfer can continue as 
an uninterrupted stream of flits (from consecutive packets) until the last packet 
has been transferred, or until no message space is available at the destination. 
Message space is always available if the receiving processor has made the 
csnrx call at the time at which the flit arrives: the message is simply copied 
directly to the address specified in the csn_rx call. If the message is too large, 
the tail is discarded and an error condition is raised. If, however, the receive 
call has not yet been made when the first flit arrives, the first few packets of the 
message can be buffered at destination in CS-tools' internal buffering, and then 
copied out to the address specified in the csnrx call. Only if CS-tools buffering 
(which is of an unspecified size) is exhausted will the flow of flits be staunched, 
because in this case a failure token is returned to the sending processor instead 
of an acknowledgement. At some unspecified time after receiving the failure 
token the sending processor will retry sending the failed packet. 
9.4.1 Matching CS-tools to our Model 
It is less than clear that our three parameters: i and 19 (the lengths of atomic 
computations required to send and to receive a message of arbitrary length) and 
, a latency per unit length of message being sent, are adequate to model such 
199 
complex behaviour. The following properties of our diamond dag software 
may be improving the fit of our model. 
Messages are only sent between adjacent processors along interprocessor 
links which are not used for any other messages for the duration of the 
computation. 
Non-Blocking send calls are used. 
Messages are short: at most of length 0(n) which, in the case of our test 
example, works out at 240 units of length 8 bytes - or 2 packets. 
As a result of 1, we do not see the computation overheads of or the latency 
associated with the store and forward of flits. In addition there is no contention 
between messages for interprocessor communications links. 
As a result of 2, the sending process does not wait until a message has 
arrived before it continues computing other nodes. This means that it is indeed 
appropriate to consider as a delay associated with satisfying the precedence 
relation rather than a cost at the sending processor. 
As a result of 3, CS-tools internal buffering will render it unlikely that failure 
tokens and retries will occur even if there is some mismatch between the timing 
of messages being sent and csnrx calls being made. 
There still remain two glaring problems with our model. First 1, and ir  
contain no per-flit and per-message computation overhead. This means that they 
are underestimates of the cost of sending anything other than the unit length 
messages for which they were measured. Their underestimation grows as 
message size grows. Second, contains no component for the flit of each packet 
that contains its header, so it is an underestimate for anything other than the 
asymptotically large messages for which it was measured. Even if this is taken 
200 
into account, the fixed flit size means that is only accurate for packets of an 
exact number of flits, and messages of an exact number of packets. 
9.4.2 Atomicity of Communications Events 
A third problem is more subtle. It is important to notice that the diamond 
dag software does not specify a schedule; it specifies a partition of the node 
set; a message set; and a complete order of the node computations and csn_rx 
csn_txnb and csn_test calls at any given processor. Computation is performed 
according to the order specified, as and when preceding computations are com-
pleted and messages are received. The exact timing of events is the same as 
computed by our schedule if the measured parameters are accurate, and if the 
order of computation events is preserved. The problem lies in the way in which 
csntxnb and csnrx calls correspond to computation events at the destination 
processor, even if we assume there are no per-flit overheads. The problem is 
outlined below for the receiving processor. There are analogous problems for 
the sending processor when the computation associated with message protocol 
and csn_test calls is taken into consideration. 
If we consider the Gantt charts in Figure 9-6, Processor X sends a message 
of length ii to processor Y at time t. Processor Y has received it at time 1 + 
According to our model this engenders a computation event of length 1, at time 
t + i. In practice, as shown in the Gantt charts in Figure 9-6, there may well be 
some component of overhead incurred when the csn..r<x call is made (A), some 
when the message arrives (B), and some when the csn.rx call completes (C). 
The sequence of events on processor X could be as shown in scenario 1. It 
makes a csn..zx call (overhead A) at time s < t. It waits. The first flit of the 
message arrives and overhead B incurred. When all flits have arrived the csn..rx 
call completes and overhead C occurs. Alternatively, as shown in scenario Y, 

















- 	Scenario 1 	 Scenario 2 
Figure 9-6: Gantt Charts Illustrating Overheads that may be Attributable to lr  
B is incurred. At time u> t the csnrx call is initiated overhead A is incurred. 
Then immediately afterwards it completes and overhead C is incurred. 
Overhead B is guaranteed to occur at a time greater than t. Overhead C is 
guaranteed to occur at a time greater than t + . Nothing can be said about 
the timing of overhead A, except with reference to the local task schedule at 
processor Y. These three overheads may vary in magnitude according to the 
relative timings of the message receipt and the csnrx call. The value of ir that 
was measured was the sum of all three overheads for the case when the message 
was received before the csnrx call was made. There are dearly a number of 
approximations involved in considering i as a single overhead incurred at time 
t + i. Some of these may be being masked by the lock-step nature of the 





10.1 A Summary of Results 
Below is a summary of the results in the thesis. It is a selective summary. For 
example, some theorems and lemmas are not mentioned, or are mentioned in a 
specific context when they have more general utility. 
10.1.1 Models and the Framework 
Throughout the thesis there have been attempts to describe the models under-
lying formulations of the mapping problem in terms of the unifying framework 
of Chapter 1.3. At this point we wish to highlight the way in which the models 
differ in terms of the components of the framework they incorporate.  Table 10-1 
gives a summary of the non-zero terms in tables 3-1, 3-2,3-3,5-2,5-3 and 7-1. 
10.1.2 Complexity 
There are four complexity results in the thesis. Two of them refer to scheduling 
problems, a third to mapping in process based models, and the fourth to the 
remapping of schedules. 
Scheduling problems tend to be NP complete if the underlying models are 
non-UET. We showed that scheduling in the 2-processor i&2 unit execution time 
model is NP complete with unit interprocessor communication delay (The-
orem 3.7). This is not altogether surprising since it is NP complete with no 
communication delay. More interesting, perhaps, is the result (Theorem 6.2) 
that two processor UET scheduling is NP complete if communications con-
tention is introduced. It is not known whether there is any fixed value of x 
such that x processor scheduling is NP complete in models with contention-free 
communication delay or without communication delay. 
204 
Model 
State Scheduling Models 
No Graph 
No 	General Process 
Commu- Match- 	Claud 
Delay 	Delay Based 
nication    mg  
T alc  
T omm - Tinit 
TTerm Vt 
TRoute 
THouse 	TRecalc Vt Vt 
Tinter 
TSched 
Tidle 	T a  it 5 Vt Vt Vt Vt 
TWaitD  Vt Vt Vt Vt 
TF2  Vt Vt Vt I 	Vt  
Table 10-1: A Summary of the Presence of Components of our Framework in 
Models 
The complexity of bijective mapping in graph matching models is open. It 
is related to graph isomorphism. We showed the NP completeness of bijective 
mapping in graph matching models where the processor graph is the variable 
to be optimised (Theorem 6.3). 
Papadimitriou and Yannakakis give an approximation algorithm for sched-
uling in models with communication delay which is guaranteed to find a map-
ping with a performance within a factor of two of optimal—at least in terms 
of the model. We showed that there still remains an NP complete problem if a. 
mulficomuter programmer wishes to use such a schedule on a multicomputer 
with a fixed number of processors (Theorem 4.2). 
10.1.3 Bounds on Processor Usage 
We derived a number of bounds beyond which there are no performance gains 
associated with introducing more processors into a scheduling model. We can 
refer to the number of processors beyond which performance benefits can be 
achieved as P,,,,,,. In Table 10-2 we give the results we have shown for 1max  in 
various scheduling models. We refer to a dag A = (1', ). 
10.1.4 Performance Guarantees of Algorithms 
We considered remapping a PNY schedule to a fixed number of processors for 
comparison with ETF. The performance bounds for the algorithms are similar, 
although the PNY bound is better for low values of r and when the number 
of processors is high in comparison to the width of the dag. The ETF bound 
is better for high values of r and when the number of processors is low in 
comparison to the width of the dag. We showed that on some dags an ETF 
schedule can be a factor of T worse than a remapped PNY schedule, and on 
some dags it can be a factor of r better. 
206 
[Model Result Reference 
No Communication UET 
________________________  
VA, Pma = F Theorem 3.4 and Corol-
lary 3.1 
No Communication VA, Pmax 	Fl Corollary 3.1 
No Delay VA, max < W 	I Theorem 3.3 
Unit Delay No Replication VA, Pmas ~ W 	I Theorem 3.5 
Unit Delay UET A: Pmax > WIAI Section 3.2.3 
Uniform Delay VA, 1max 	Fl Corollary 3.2 
General Delay UET 3A : Pm,,x > Fl Theorem 3.6 
Table 10-2: A summary of Bounds Beyond Which There is No Performance to 
be Gamed by Introducing Extra Processors 
Now we consider mapping a UET Fixed Delay scheduling model with a 
communication delay of at most r. When we refer to "optimal makespan" 
below, we mean the optimal makespan of schedules on a number of processors 
equal to the number of tasks. In the case of both PNY and ETF we can generate 
performance bounds which are independent of the width of the dag on a number 
of processors which is a linear function of the width of the dag. In the case of 
PNY we simply use Algorithm 4.3, apply Algorithm 3.2 and, by Theorem 4.5, 
we end up with a schedule with makespan at most twice optimal for which the 
processor usage is at most (r + 1)12 times the width of the dag. In the case of 
ETF our performance guarantee of Section 4.4.1 states that we will get within a 
factor of r + 2 of optimal makespan if we use as many processors as the width 
of the dag. In Section 7.2.3 we showed such performance guarantees are not 
available for Claud models, even for those that are No Delay Fixed Cost and UET. 
207 
10.1.5 Nature of Predictions 
There are a number of different results in Chapter 8. An interesting way to sum-
manse some of them is to consider the plight of a multicomputer programmer 
who is writing a longest common subsequence program and trying to decide 
whether to write a program corresponding to the lines schedule or to the stripes 
schedule. If she believed that event cost and chain cost had predictive power for 
her multicomputer, she would use the stripes schedule in preference to the lines 
schedule. If she believed that the Fixed Delay scheduling model had predictive 
power then she would use a lines schedule. 
Another interesting prediction which appears in Chapters 8 and 9 is that in 
Fixed Dela' scheduling models and in Fixed Delay No Cost Claud models, adding 
processors to the model can never extend the makespans of stripes schedules of 
the diamond dag. In Fixed Delay Fixed Cost Claud models there is an optimum 
number of processors. 
10.1.6 Predictive Power 
It appears that for a program corresponding to our diamond dag and parameters 
derived independently from the target multicomputer, the predictive power of 
a Claud model is very high. Indeed, with a high degree of accuracy, we are 
able to use the performance of the computation to predict the parameters of 
multicomputer. Moreover, there is indeed an optimum number of processors 
for a given size of a computation which corresponds to the diamond dag, and 
that optimum is roughly where we predict it to be. 
10.2 Verisimilitude, Complexity and Predictive Power 
The word verisimilitude appears once previously in this thesis. It means "appar-
ent increase in believability" and embodies a useful concept for discussing the 
models underlying mapping problem. One can formulate mapping problems 
in terms of models which have features corresponding to more and more of the 
properties of the multicomputer. This may lead to an increase in the number of 
non-zero terms when the models are related to our framework. Alternatively, 
it may lead to an apparent modelling of more features of the interprocessor 
communications medium. In short it may increase the verisimilitude of the 
model. 
The problem with added verisimilitude is that it leads to added complexity: 
added computational complexity, non-approximability and excessively com-
plicated notation. 
On the other hand, models with different levels of verisimilitude give differ-
ent predictions and lead the multicomputer programmer in different directions. 
If progress is to be achieved in the mapping problem for multicomputers, mod-
els must be developed with the lowest level of verisimilitude that can achieve 
predictive power. Fusion of Claud with a model incorporating message con-
tention may well be a good way forward. 
FM 
Bibliography 
[Agrawal and Jagadish, 1988] Agrawal, R. and Jagadish, H. (1988). Partitioning 
techniques for large gram parallelism. IEEE Trans. Comput., C-37(12):1627-
1634. 
[Aho et al., 1976] Aho, A., Hirschberg, D., and Ullman, J. (1976). Bounds on 
the complexity of the longest common subsequence problem. Journal of the 
Association for Computing Machinery, 23(1):1-12. 
[Aho et al., 1983] Aho, A., Hoperoft, J., and Ullman, J. (1983). Data Structures 
and Algorithms. Addison Wesley. 
[Al-Mouhammed, 19901 Al-Mouhammed, M. (1990). Lower bound on the 
number of processors and time for scheduling precedence graphs with com-
munication costs. IEEE Trans. Software Engrg., SE-16(12):1390-1301. 
[Bacceffi and Liu, 1990] Bacceffi, F. and Liu, Z. (1990). On the execution of 
parallel programs on multiprocessor systems - a queueing theory approach. 
J. ACM, 37(2):373-414. 
[Bal et al., 19891 Bal, H., Stenier, J., and Tanenbaum, A. (1989). Programming 
languages for distributed computing systems. Computing Surveys, 21(3):261-
322. 
210 
[Baxter and Pate!, 19891 Baxter, J. and Patel, J. (1989). The LAST algorithm: A 
heuristic based static task allocation algorithm. In Proc. Intl. Conf. Parallel 
Comput., volume 2, pages 217-222. 
[Beauwens, 19891 Beauwens, R. (1989). Approximate factorisations with 
S.P.C.O M-factors. Bit, 29:658-681. 
[Berger and Cowen, 19911 Berger, B. and Cowen, L. (1991). Complexity results 
for {<, , =}-constramed scheduling. In Proceedings of the Second Annual 
ACM-SIAM Symposium on Discrete Algorithms, pages 137-147. 
[Berman and Snyder, 19871 Berman, F. and Snyder, L. (1987). On mapping par-
allel algorithms into parallel architectures. I. Parallel Dist. Comput., 4:439-458. 
[Blazewicz et al., 19861 Blazewicz, J., Drabowski, M., and Weglarz, J. (1986). 
Scheduling multiprocessor tasks to minimize schedule length. IEEE Trans. 
Comput., C-35(5):389-393. 
[Blazewicz et al., 19911 Blazewicz, J., Dror, M., and Weglarz, J. (1991). Math-
ematical programming formulations for machine scheduling: A survey. 
European Journal of Operational Research, 51:283-300. 
[Blazewicz et al., 19841 Blazewicz, J., Weglarz, J., and Drabowski, M. (1984). 
Scheduling independent 2-processor tasks to minimise schedule length. In-
form Process. Lett., 18(5):267-273. 
[Bokhari, 1981a] Bokhari, S. (1981a). On the mapping problem. IEEE Trans. 
Comput., C-30(3):207-214. 
[Bokhari, 1981b] Bokhari, S. (1981b). A shortest tree algorithm for optimal 
assignments across space and time in a distributed computer system. IEEE 
Trans. Software Engrg., SE-7(6):583-589. 
211 
[Bokhari, 19881 Bokhari, S. (1988). Partitioning problems in parallel, pipelined 
and distributed computing. IEEE Trans. Comput., C-37(1):48-57. 
[Bokhari, 1990] Bokhari, S. (1990). Communication overhead on the Intel iPSC-
860 Hypercube. Technical Report 10, ICASE, NASA, Langley Research Cen-
ter, Virginia. ICASE Interim Report. 
[Bomans and Roose, 19891 Bomans, L. and Roose, D. (1989). Benchmarking 
the iPSC/2 Hypercube multiprocessor. Concurrency Practice and Experience, 
1(1):3-18. 
[Brent, 19741 Brent, R. (1974). The parallel evaluation of general arithmetic 
expressions. J. ACM, 21:201-206. 
[Bruno et al., 1974] Bruno, J., Coffman Jr., E., and Sethi, R. (1974). Scheduling 
independent taskms to reduce mean finishing time. Comm. ACM, 17(7):382-
387. 
[Casavant and Kuhl, 1988] Casavant, T. and Kuhl, J. (1988). A taxonomy of 
scheduling in general purpose distributed computing systems. IEEE Trans. 
Software Engrg., SE-14(2):141-154. 
[Chen and Lai, 19881 Chen, G.-I. and Lai, T.-H. (1988). Scheduling independent 
jobs on hypercubes. In Con, R. and Wirsing, M., editors, Proc Conf. Theoretical 
Aspects of Computer Science, pages 273-280. 
[Chen and Shin, 19871 Chen, M.-S.-and Shin, K. (1987). Processor allocation in a 
n-cube multiprocessor using gray codes. IEEE Trans. Comput., C-36(12):1396-
1407. 
[Chen and Liu, 19751 Chen, N. and Liu, C. (1975). On a class of scheduling 
algorithms for multiprocessor computing systems. In Feng, T.-Y., editor, 
Lecture Notes in Computer Science. Springer, New York. 
212 
[Cheng and Sin, 1990] Cheng, T. and Sin, C. (1990). A state-of-the-art review 
of parallel-machine scheduling research. European Journal of Operational Re-
search, 47:61-63. 
[Chittor and Enbody, 1992] Chittor, S. and Enbody, R. (1992). Minimising Con-
tention: A New Mapping Objective for Second Generation Multicomputers. 
Technical report, University Michigan State University East Lansing, MI 
48824-1027, USA. Submitted to IEEE Transactions on Parallel and Distrib-
uted Systems. 
[Cho and Sahni, 19801 Cho, Y. and Sahni, S. (1980). Bounds for list schedules 
on uniform processors. SIAM I. Comput., 9(1):91-103. 
[Chrétienne, 1989] Chrétienne, P. (1989). A polynomial algorithm to optimally 
schedule tasks on a virtual distributed system under tree-like precedence 
constraints. European I. Oper. Res., 43:225-230. 
[Chrétienne and Picouleau, 1992] Chrétienne, P. and Picouleau, C. (1992). The 
basic scheduling problem with interprocessor communication delays. In 
Proceedings of the Summer School on Scheduling Theory and its Applications, 
Chateau de Bonas, France. 
[Chu et al., 1984] Chu, L., Lan, M.-T., and Hellerstein, J. (1984). Estimation of 
intermodule communication (IMC) and its applications in distributed pro-
cessing systems. IEEE Trans. Corn put., C-33(8):691-699. 
[Chu et al., 19801 Chu, W, Holloway, L., Lan, M.-T., and Efe, K. (1980). Task 
allocation in distributed data processing. Computer, 13(11). 
[Chu and Lan, 19871 Chu, W. and Lan, M.-T. (1987). Task allocation and pre-
cedence relations for distributed real-time systems. IEEE Trans. Corn put., 
C-36(6):667-679. 
213 
[Coffman Jr., 19761 Coffman Jr., E., editor (1976). Computer and Job Shop Sched-
uling Theory. John Wiley, New York. 
[Coffman Jr. et al., 1984a] Coffman Jr., E., Flatto, L., and Leuker, G. (1984a). 
Expected makespans for largest-first multiprocessor scheduling. In Gelenbe, 
E., editor, Performance '84, pages 491-506. Elsevier Science Publishers B.V. 
(North Holland). 
[Coffman Jr: et al., 19781 Coffman Jr., E., Garey, M., and Johnston, D. (1978). An 
application of bin-packing to multiprocessor scheduling. SIAM J. Comput., 
7(1):1-17. 
[Coffman Jr. et al., 1984b] Coffman Jr., E., Garey, M., and Johnston, D. (1984b). 
Approximation algorithms for bin-packing - an updated survey. In Ausellio, 
G., Lucertini, M., and Serafini, P., editors, Algorithm Design for Computer 
System Design, Berlin. Springer-Verlag. 
[Coffman Jr. and Graham, 19721 Coffman Jr., E. and Graham, R. (1972.). Optimal 
scheduling for two processor systems. Acta Informatica, 1:200-213. 
[Cole and \lishkin, 19881 Cole, R. and \lishkin, U. (1988). Approximate parallel 
scheduling, part 1: Thebasic technique with applications to optimal parallel 
list ranking in logarithmic time. SIAM I. Comput., 17(1):128-142. 
[Conway et al., 19671 Conway, R., Maxwell, W., and Miller, L. (1967). Theory of 
scheduling. Addison-Wesley, Reading, Mass. 
[Cvetanovic, 1987] Cvetanovic, Z. (1987). The effects of problem partitioning, 
allocation and granularity on the performance of multiple-processor systems. 
IEEE Trans. Comput., C-36(4):421-432. 
214 
[Daily, 1990a] Daily, W. (1990a). Network and processor architecture for 
message-driven computers. In Suaya, R. and Birtwhistle, G., editors, VLSI 
and Parallel Computation, pages 140-222. Morgan Kaufmann, Palo Alto, CA. 
[Daily, 1990b] Daily, W. (1990b). Performance analysis of k-ary n-cube inter-
connection networks. IEEE Transactions on Computers, 39:775-785. 
[Daily and Seitz, 1987] Daily, W. and Seitz, C. (1987). Deadlock-free message 
routing in multiprocessor interconnection networks. IEEE Trans. Corn put., 
C-36(5). 
[Dietz et al., 1992] Dietz, H., Zaafrani, A., and O'Keefe, M. (1992). Static sched-
uling for barrier MIIMD architectures. The Journal of Supercomputing, 5,4:263-
289. 
[Du and Leung, 19891 Du, J. and Leung, J.-T. (1989). Complexity of scheduling 
parallel task systems. SIAM I. Disc. Math, 2(4):473-487. 
[Duato, 1992] Duato, J. (1992). Impact of locality on the performance of some 
adaptive routing algorithms for the hypercube. In Joosen, W. and Milgrom, 
E., editors, Parallel Computing: from Theory to Sound Practice. lOS Press. 
[Efe, 19821 Efe, K. (1982). Heuristic models of task assignment scheduling in 
distributed systems. IEEE Computer, June 1982:50-56. 
[El-Rewini and Lewis, 19901 El-Rewini, H. and Lewis, T. (1990). Scheduling 
parallel program tasks onto arbitrary target machines. I. Parallel Dist. Corn put., 
9:138-153. 
[Fellows and Langston, 19881 Fellows, M. and Langston, M. (1988). Processor 
utilisation in a linearly connected parallel processing system. IEEE Trans. 
Corn put., C-37(5):594-603. 
215 
[Fernández and Büssell, 1973] Fernández, E. and Bussell, B. (1973). Bounds on 
the number of processors and time for multiprocessor optimal schedules. 
IEEE Trans. Comput., C-22(8):745-751. 
[Fernández-Baca, 1989] Fernández-Baca, D. (1989). Allocating modules to pro-
cessors in a distributed system. IEEE Trans. Software Engrg., SE-15(11):1427-
143. 
[Fox, 19891 Fox, G. (1989). Parallel computing comes of age. Concurrency Prac-
tice and Experience, 1(1):63-103. 
[Fox et al., 1988] Fox, G., Johnson, M., Lyzenga, G., Otto, S., Salmon, J., and 
Walker, D. (1988). Solving Scientific Problems on Concurrent Processors. Prentice 
Hall, New Jersey. 
[Gabow, 19881 Gabow, H. (1988). Scheduling UET systems on two uniform 
processors and length two pipelines. SIAM I. Comput., 17(4):810-811. 
[Garey et al., 1977] Garey, M., Graham, R., and Johnston, D. (1977). Perform-
ance guarantees for scheduling algorithms. Operations Research, 26(1). 
[Garey and Johnson, 19751 Garey, M. and Johnson, D. (1975). Complexity res-
ults for multiprocessor scheduling with resource constraints. SIAM I. Corn-
put., 4(4):396-411. 
[Garey and Johnson, 19791 Garey, M. and Johnson, D. (1979). Computers and 
Intractability. W.H. Freeman and Co., San Francisco. 
[Garey et al., 19761 Garey, M., Johnson, D., and Stockmeyer, L. (1976). Some 
simplified NP-complete graph problems. Theor. Comput. Sci., 1:237-267. 
[Garey and Johnson, 1982] Garey, M. and Johnson, R. (1982). Approximation 
algorithms for bin packing problems - a survey. In G. Ausefflo, M., ed- 
216 
itor, Analysis and Design of Combinatorial optimisation, pages 147-172. Springer 
Verlag, Vienna, Austria. 
[Gaudiot et al., 19881 Gaudiot, J., Pi, J., and Campbell, M. (1988). Program 
graph allocation in distributed multicomputers. Parallel Computing, 7:227-
247. 
[Gonzalez, 19771 Gonzalez, M. (1977). Deterministic processor scheduling. 
Computing Surveys, 9(3). 
[Graham, 19661 Graham, R. (1966). Bounds for certain multprocessing timing 
anomalies. Bell System Technical 1.' 45:1563 - 1581. 
[Graham, 19691 Graham, R. (1969). Bounds on multiprocessing timing anom-
alies. SIAM I. Appl. Math., 17(2):416-429. 
[Graham et al., 19791 Graham, R., Lawler, E., Lenstra, J., and Rinnooy Kan, A. 
(1979). Optimisation and approximation in deterministic sequencing and 
scheduling a survey. Annals of Discrete Mathematics, 5:287-236. 
[Gusfield, 1983] Gusfield, D. (1983). Parametric combinatorial computing and 
a problem of program module distribution. J. ACM, 30(3):551-563. 
[Gylys and Edwards, 19761 Gylys, V. and Edwards, J. (1976). Optimal parti-
tioning of workload for distributed systems. In Proc. Compcon Fall 76. 
[Hageman and Young, 19811 Hageman, L. and Young, D.0981). Applied iterat-
ive methods. Academic Press. 
[Helmbold and Mayr, 1987] Heimbold, D. and Mayr, E. (1987). Two processor 
scheduling is in NC. SIAM I. Comput., 16(4):747-759. 
217 
[Hochbaum and Shmoys, 1988a] Hochbaum, D. and Shmoys, D. (1988a). A 
polynomial approximation scheme for scheduling on uniform processors 
using the dual approximation approach. SIAM J. Comput., 17(3):539-551. 
[Hochbaum and Shmoys, 1988b] Hochbaum, D. and Shmoys, D. (1988b). Us-
ing dual approximation algorithms for scheduling problems: Theoretical and 
practical results. J. ACM, 34(1):144-162. 
[Houstis, 1990] Houstis, C. (1990). Module allocation of real-time applications 
to distributed systems. IEEE Trans. Software Engrg., SE-16(7):699-709. 
- 	[Hu, 19611 Hu, T. (1961). Parallel sequencing and assembly line problems. Oper 
Res., 9:841-848. 
[Hwang et al., 1989] Hwang, J.-J., Chow, Y.-C., Anger, F., and Lee, C.-Y. (1989). 
Scheduling precedence graphs in systems with interprocessor communica-
tion times. SIAM I. Corn put., 18(2):244-257. 
[IEEE, 19911 IEEE (1991). SCI, scalable coherent interconnect. Technical Report 
P1596/D1.98, IEEE Standards Department, 345 East 47th Street, New York, 
N.Y. 1007 USA. Draft for Review by Negative Ballot Review Committee. 
[Indurkhya and Stone, 1986] Indurkhya, B. and Stone, H. (1986). Optimal par-
titioning of randomly generated parallel programs. IEEE Trans. Software 
Engrg., SE-12(3):483---495. 
[INMOS Limited, 19881 INMOS Limited (1988). The Transputer Reference Man-
ual. Prentice Hall. 
[Intel, 19911 Intel (1991). ipsc/860 parallel supercomputer product overview. 
Technical report, Intel Corporation Supercomputer Systems Division. 
218 
[Johnsson, 19901 Johnsson, S. (1990). Communication in network architectures. 
In Suaya, R. and Birtwhistle, G., editors, VLSI and Parallel Computation, pages 
223-378. Morgan Kaufmann, Palo Alto, CA. 
[Jung et al., 19891 Jung, H., Kirousis, L., and Spirakis, P. (1989). Lower bounds 
and efficient algorithms for multiprocessor scheduling of dags with commu-
nication delays. In Proc. ACM Symposium on Parallel Algorithms and Architec-
tures, pages 254-264. 
[Kafura and Shen, 19771 Kafura, D. and Shen, V. (1977). Task scheduling on 
a multiprocessor system with independent memories. SIAM I. Comput., 
6(i):167-187. 
[Kapelnikov et al., 19891 Kapelnikov, A., Muntz, R., and Ercegovac, M. (1989). 
A modelling methodology for the anlaysis of concurrent systems and com-
putations. I. Parallel Dist. Comput., 6:568-597. 
[Kaufmann, 19741 Kaufmann, M. (1974). An almost-optimal algorithm for the 
assembly line scheduling problem. IEEE Trans. Comput., C-23(11):1169-1174. 
[Kermani and Kleinrock, 1979] Kermani, P. and Kleinrock, L. (1979). Virtual 
cut-through: A new computer communication switching technique. Com-
puter Networks, 3:287-286. 
[Kim and Browne, 19881 Kim, S. and Browne, J. (1988). A general approach 
to mapping of parallel computations upon multiprocessor architectures. In 
International Conference on Parallel Processing 3, pages 1-8. 
[Kramer and Muhlenbein, 1989] Kramer, 0. and MUhlenbein, H. (1989). Map-
ping strategies in message-based multiprocessor systems. Parallel Computing, 
9:213-225. 
219 
[Krishnamurthy, 19901 Krishnamurthy, S. (1990). A brief survey of papers on 
scheduling for pipelined processors. Sigplan Notices, 25(7):97-106. 
[Kruatrachue and Lewis, 1988] Kruatrachue, B. and Lewis, T. (January 1988). 
Gram size determination for parallel programming. IEEE Software, pages 
23-32. 
[Lam and Sethi, 19771 Lam, S. and Sethi, R. (1977). Worst case analysis of two 
scheduling algorithms. SIAM I. Corn put., 6(3):518-536. 
[Lamport, 19791 Lamport, L. (1979). How to Make a Multiprocessor Computer 
That Correctly Executes Multiprocess Programs. IEEE Transactions on Com-
puters, C-28:690-691. 
[Lawler, 19821 Lawler, E. (1982). Preemptive scehduling of precedence con-
strained jobs on parallel machines. In Dempster, M., editor, Deterministic and 
Stochastic Scheduling. D.Reidel Publishing Co. 
[Lee and Aggarwal, 1987] Lee, S. and Aggarwal, J. (1987). A mapping strategy 
for parallel computing. IEEE Trans. Corn put., C-36(4):433-442. 
[Leighton, 19921 Leighton, F. (1992). Introduction to Parallel Algorithms and Ar-
chitectures. Morgan Kaufmann, San Mateo, CA. 
[Lo, 19881 Lo, V. (1988). Heuristic algorithms for task assignment in distributed 
systems. IEEE Trans. Comput., 37(11):1384-1397. 
[Lo, 19921 Lo, V. (1992). Temporal Communication Graphs: Lamport's Process-
Time Graphs Augmented for the Purpose of Mapping and Scheduling. Tech-
nical Report CIS-TR-92-05, University of Oregon, Eugene, Oregon 97403. To 
appear in Journal of Parallel and Distributed Computing. 
NI 
[Ma et al., 19821 Ma, P.-Y., Lee, E., and Tsuchiya, M. (1982). A task allocation 
model for distributed computing systems. IEEE Trans. Comput., C-31(1):246--
252. 
[Mak and Lundstrom, 1990] Mak, V. and Lundstrom, S. (1990). Predicting per-
formance of parallel computations. IEEE Trans. Paral. Distr. Comput., 1(3):257-
270. 
[Manneback et al., 1992] Manneback, P., Quin, J., and Libert, G. (1992). Per-
formance models of MICCG algorithm on distributed memory MIMD com-
puters. In Joosen, W. and Milgrom, E., editors, Parallel Computing: from Theory 
to Sound Practice, pages 52-59. los Press. 
[Martel, 19881 Martel, C. (1988). A parallel algorithm for preemptive sched-
uling of uniform machines. J. Parallel Dist. Comput., 5:700-715. 
[McDowell and Appelbe, 1986] McDowell, C. and Appelbe, W. (1986). Pro-
cessor scheduling for linearly connected parallel processors. IEEE Trans. 
Comput., C-35(7):632-638. 
EMcGreary and Gill, 19891 McGreary, C. and Gill, H. (1989). Automatic determ-
ination of grain size for efficient parallel programming. Comm. ACM, 32(9). 
[McNaughton, 1959] McNaughton, R. (1959). Scheduling with deadlines and 
loss functions. Management Science, 6:1-12. 
[Meiko, 19921 Meiko (1992). So how does the CSN work anyway? Technical 
report, Meiko Scientific Limited. 
[Muntz and Coffman Jr., 19691 Muntz, R. and Coffman Jr., E. (1969). Optimal 
preemptive scheduling on two-processor systems. IEEE Trans. Comput., C-
18(11):1014-1020. 
221 
[Nicol, 1989] Nicol, D. (1989). Optimal partitioning of random programs across 
two processors. IEEE Trans. Software Engrg., SE-15(2):134-141. 
[Nicol et al., 19921 Nicol, D., Simha, R.rChoudhury, A., and Narahari, B. (1992). 
Optimal Processor Assignment for Pipeline Computations. Technical Report 
91-79, ICASE, NASA, Langley Research Center, Virginia. NASA Contractor 
Report. 
[Norman, 19901 Norman, M. (1990). Multicomputer applications and how to 
model the mapping problem. In Malek, M., editor, Parallel Computing for the 
Physical Sciences, pages 37-42, Rocqencourt, France. ONR Europe / Inria. 
[Norman et al., 1990] Norman, M., Boeres, C., and Thanisch, P. (1990). Minim-
ising message path lengths in multicomputers. Technical Report EPCC-TR-
90-17, Edinburgh Parallel Computing Centre. Under Review by Journal of 
Parallel and Distributed Computing. 
[Norman and Thanisch, 1991] Norman, M. and Thanisch, P. (1991). Models 
of machines and computations for mapping in multicomputers. Technical 
Report EPCC-TR-92-15, Edinburgh Parallel Computing Centre. Accepted by 
ACM Computing Surveys. 
[Norman and Thanisch, 1993] Norman, M. and Thanisch, P. (1993). Bounds 
beyond which there are no performance gains from adding processors to 
scheduling models. Technical Report EPCC-TR-93-06, Edinburgh Parallel 
Computing Centre. Submitted to Operations Research. 
[Norman et al., 19921 Norman, M., Thanisch, P., and Chang, K.-F. (1992). Par-
titioning dag computations: A cautionary note. In Joosen, W. and Milgrom, 
E., editors, Parallel Computing: from Theory to Sound Practice, pages 360-364. 
lOS Press. 
222 
[Papadimitriou and Ullman, 19871 Papadimitriou, C. and Ullman, J. (1987). A 
communication-time tradeoff. SIAM I. Comput., 16(4):639-646. 
[Papadimitriou and Yannakakis, 19901 Papadimitriou, C. and Yannakakis, M. 
(1990). Towards an architecture-independent analysis of parallel algorithms. 
SIAM I. Comput., 19:322-328. 
[Picouleau, 19921 Picouleau, C. (1992). New complexity results on the UET-
UCT scheduling problem. In Proceedings of the Summer School on Scheduling 
Theory and its Applications, Chateau de Bonas, France. 
[Pippenger, 1979] Pippenger, N. (1979). On simultaneous resource bounds (pre-
liminary version). In Proc. 20th IEEE FOCS, pages 307-311. 
[Pountain, 19891 Pountain, R. (1989). Configuring parallel programs. Part 1. 
Byte, (December):349-352. 
[Pritchard et al., 19871 Pritchard, D., Askew, C., Carpenter, D., Glendinning, I., 
Hey, A., and Nicole, D. (1987). Practical parallelism using transputer arrays. 
In deBackker, J., Nijman, A., and Treleaven, P., editors, Parallel Architectures 
and Languages, Lecture Notes in Computer Science, page 278. Springer Verlag, 
Berlin. 
[Ramamritham et al., 19901 Ramamritham, K., Stankovic, J., and Shiah, P.-F. 
(1990). Efficient scheduling algorithms for multiprocessor systems. IEEE 
Trans. Paral. Distr. Comput., 1(2):184-194. 
[Rao et al., 19791 Rao, G., Stone, H., and Hu, T. (1979). Assignment of tasks in 
a distributed processor system with limited memory. IEEE Trans. Comput., 
C-28(4):291-298. 
223 
[Rayward-Smith, 1987a] Rayward-Smith, V. (1987a). The complexity of pree-
mptive scheduling given interprocéssor communication delays. Inform. Pro-
cess. Lett., 25(2):123-125. 
[Rayward-Smith, 1987b] Rayward-Smith, V. (1987b). UET scheduling with unit 
interprocessor communication delays. Discrete App!. Math., 18:55-71. 
[Reed and Fujimoto, 19871 Reed, D. and Fujimoto, R. (1987). Multicomputer 
network operating systems. In Multicomputer networks: message-based parallel 
processing, pages 177-238. MIT Press, Cambridge Mass. 
[Sahni, 19761 Sahni, S. (1976). Algorithms for scheduling independent tasks. I. 
ACM, 23(1):116-127. 
[Sarkar, 1989] Sarkar, V. (1989). Partitioning and Scheduling Parallel Programs for 
Multiprocessors. Pitman, London. 
[Scott et al., 19921 Scott, S., Goodman, J., and Vernon, M. (1992). Performance 
of the SCI ring. In Proceeedings of the 19th International Symposium on Computer 
Architecture, pages 403-414. IEEE Computer Society Press. 
[Seitz, 19901 Seitz, C. (1990). Concurrent architectures. In Suaya, R. and 
Birtwhistle, G., editors, VLSI and Parallel Computation, pages 1-84. Morgan 
Kaufmann, Palo Alto, CA. 
[Sethi, 19761 Sethi, R. (1976). Algorithms for minimal length schedules. In 
Coffman Jr., E., editor, Computer and Job Shop Scheduling Theory. John Wiley, 
New York. 
[Shen and Tsai, 19851 Shen, C.-C. and Tsai, W.-H. (1985). A graph matching 
approach to optimal task assignment in distributed computing systems using 
a minimax criterion. IEEE Trans. Comput., C-34(3):197-203. 
224 
[Sinclair, 19871 Sinclair, J. (1987). Efficient comutation of optimal assignments 
for distributed tasks. I. Parallel Dist. Corn put., 4:342-362. 
[Stone, 1977a] Stone, H. (1977a). Multiprocessor scheduling with the aid of 
network flow algorithms. IEEE Trans. Software Engrg., SE-3:85-93. 
[Stone, 1977b] Stone, H. (1977b). Program assignment in three processor sys-
tems and tricutset partitioning of graphs. Technical Report ECE-CS-77-7, 
Univ. Massachusetts, Amherst. 
[Stone, 19781 Stone, H. (1978). Critical load factors in distributed systems. IEEE 
Trans. Software Engrg., SE-4:254-258. 
[Swamy and Thulasiram, 19811 Swamy, M. and Thulasiram, K. (1981). Graphs, 
Networks and Algorithms. John Wiley and Sons. 
[Towsley, 19861 Towsley, D. (1986). Allocating programs containing branches 
and loops within a multiple processor system. IEEE Trans. Software Engrg., 
SE-12(10):1018-1024. 
[Ullman, 19751 Ullman, J. (1975). NP-complete scheduling problems. I. of Com-
puter and System Sciences, 10:384-393. 
[Upfal, 19841 Upfal, E. (1984). Efficient schemes for parallel communication. I. 
ACM, 31(4). 
[Valiant, 19821 Valiant, L. (1982). A scheme for fast parallel communication. 
SIAM J. Comput., 11:350-361. 
[Valiant and Brebner, 1981] Valiant, L. and Brebner, G. (1981). Universal 
schemes for parallel communication. In Proceedings of the 13th Annual ACM 
Symposium on Theory of Computing, pages 263-267. ACM, New York. 
225 
[Veltman et al., 19901 Veltman, B., Lageweg, B., and Lenstra, J. (1990). Multipro-
cessor scheduling with communication delays. Parallel Computing, 16:173-
182. 
[Wachspress, 19661 Wachspress, E. (1966). Iterative solution of Elliptic systems. 
Prentice Hall mt. 
[Wang and Cheng, 19921 Wang, Q. and Cheng, K. (1992). A heuristic of sched-
uling parallel tasks and its analysis. SIAM I. Comput., 21,2:281-294. 
[Williams, 1983] Williams, E. (1983). Assigning processes to processors in dis-
tributed systems. In Proceedings IEEE Conference on Parallel Processing, pages 
404-406. 
[Yang and Gerasoulis, 19931 Yang and Gerasoulis (1993). List scheduling with 
and without communication delay. Parallel Computing. To Appear. 
[Yang and Gerasoulis, 1992] Yang, T. and Gerasoulis, A. (1992). DSO Sched-
uling parallel tasks on an unbounded number of processors. Technical report, 
Rutgers University, Department of Computer Science. 
226 
