Efficient parallel architecture for highly coupled real-time linear system applications by Homaifar, Abdollah et al.
Grant Number NAGS-093 
Efficient Parallel Architecture for 
Highly Coupled Real-Time Linear System Applications 
Chester C. Carroll 
Cudworth Professor of Computer Architecture 
Abdollah Homaifar 
Temporary Visiting Assistant Professor 
of Electrical Engineering 
and 
Soumavo Barua 
Graduate Research Assistant 
Prepared for 
The National Aeronautics and Space Administration 
Bureau of Engineering Research 
The University of Alabama 
January 1988 
BER Report No. 419-17 
https://ntrs.nasa.gov/search.jsp?R=19880010631 2020-03-20T08:11:39+00:00Z
ACKNOWLEDGEMENT 
This research was supported by NASA, George C. Marshall Space 
Flight Center, Huntsville, Alabama, under Grant Number NAG8-093 and 
conducted in the Computer Architecture Research Laboratory in the 
College of Engineering at The University of Alabama. 
ii 
LIST OF ABBREVIATIONS 
ABPC 
Gals 
PE 
PIA 
RISC 
TUS 
WSI 
FAST 
REMPS 
Xpn 
Xcn 
Adams Bashforth Predictor Corrector 
Gallium Arsenide 
Processing Element 
Parallel Integration Algorithm 
Reduced Instruction Set Computer 
Time Units 
Wafer Scale Integration 
Flexible Architecture Simulation Task 
Reconfigurable Multiprocessor for Scientific 
Supercomputing 
Predicted value of variable X at the nth 
computing interval 
Corrected value of variable X at the nth 
computing interval 
state variable 
Control Weighting Matrix 
State Weighting Matrix 
Terminal State Weighting Matrix 
iii 
TABLE OF CONTENTS 
Page 
ACKNOWLEDGEMENTS .................................... ii 
LIST OF ABBREVIATIONS ............................... iii 
LIST OF TABLES ...................................... vi 
LIST OF FIGURES ...................................... vii 
ABSTRACT ............................................ viii 
CHAPTER 1: INTRODUCTION ............................. 1 
1.1 Background ........................... 1 
1.2 Objective ........................... 1 
1.3 Research phases ...................... 2 
CHAPTER 2: APPLICATION AND MODEL DEVELOPMENT ........ 5 
2.1 Problem Identification ............... 5 
2.2 Solution Methods ..................... 6 
2.3 Parallel Integration Algorithms ...... 8 
2.4 The Prototype Problem ................ 10 
CHAPTER 3: PARAUZL IMPLEMENTATION .................. 13 
3.1 Task Graph Attributes ................ 13 
3.2 Task Graph Development ............... 16 
3.3 Task Matrix .......................... 22 
3.4 Scheduling Problem ................... 24 
3.5 Scheduling Classification ............ 24 
3.6 Approaches in Scheduling ............. 25 
3.7 Assumptions in the Scheduling 
Algorithm ........................... 27 
3.8 Scheduling Algorithm ................. 28 
CHAPTER 4:  SIMULATION AND PERFORMANCE EVALUATION .... 33 
4.1 Performance Evaluation Criterion ..... 33 
4.2 Assumptions in Simulation ............ 34 
4.3 Results of Simulation ................ 35 
CHAPTER 5: ARCHITECTURE AND HARDWARE DESIGNS ........ 40 
5.1 Architectural Requirements ........... 40 
5.2 PE System Design ..................... 41 
5.3 Technology Selection ................. 41 
5.4 Interconnection and System Layout ... 43 
5.5 Future Directions ................... 45 
iv 
REFERENCES ........................................... 46 
APPENDIX A: SOLUTION OF OPTIMAL CONTROL L A W  USING 
MATRIX RICATTI EQUATIONS ............... 48 
APPENDIX B: TASK GRAPH ATTRIBUTES FOR HIGHLY-COUPLED 
LINEAR SYSTEM EQUATIONS ................. 52 
APPENDIX C: FLOWCHART FOR SCHEDULING ALGORITHM ....... 69 
APPENDIX D: SCHEDULER ROUTINE IN PASCAL ............. 76 
V 
LIST OF TABLES 
TABLE 
1 . 
2 . 
3 . 
4 . 
5 . 
6 . 
7 . 
a . 
Page 
Node Description for Task Graph ............... 21 
Task Matrix for Task Graph .................... 23 
Scheduling Techniques ......................... 26 
Task Graph and Task Matrix .................... 29 
Elementary Operation on Task Matrix ........... 30 
Elementary Operation on Task Matrix ........... 30 
Elementary Operation on Task Matrix ........... 32 
Elementary Operation on Task Matrix ........... 32 
vi 
LIST OF FIGURES 
Page 
.................... Overview of Research Project 3 
Overall Problem Representation .................. 7 
Serial Computation Sequence 9 
Parallel Computation Sequence ................... 11 
Reverse Parallel Computation Sequence ............ 11 
Example of a Task Graph ......................... 14 
..................... 
Task Graph Development .......................... 17 
Function Task Block ............................. 18 
Task Graph for a Single System Equation ........ 20 
1 .1  
2 .1  
2.2 
2 .3  
2 .4  
3 .1  
3 .2  
3.3 
3.4 
4 .1  Processor Execution Time ........................ 36 
4 . 2  Processor Efficiency ............................ 37 
4 . 3  Processor Speed Up .............................. 39 
5 . 1  PE Design Schemata ............................. 42 
5 . 2  System Architecture Layout ..................... 44 
vii 
ABSTRACT 
A systematic procedure has been developed for exploiting 
the parallel constructs of computation in a highly coupled, 
linear system application. An overall top down design approach 
is adopted. 
Differential equations governing the application under 
consideration are partitioned into subtasks on the basis of a 
data flow analysis. The interconnected task units constitute a 
task graph which has to be computed in every update interval. 
Multiprocessing concepts utilizing parallel integration 
algorithms are then applied for efficient task graph execution. 
A simple scheduling routine has been developed to handle task 
allocation while in the multiprocessor mode. 
Results of simulation and scheduling are compared on the 
basis of standard performance indices. Processor timing diagrams 
have been developed on the basis of program output accruing to 
an optimal set of processors. 
Basic architectural attributes for implementing the system 
is discussed together with suggestions for processing element 
design. Emphasis has been placed on flexible architectures that 
are capable of accommodating widely varying application 
specifics. 
v i i i  
CEAPTRRl 
INTRODUCTION 
1.1 Background: 
Real-time application algorithms are characterized by complex 
and time consuming computations suitable for processing in large 
mainframes and associated machines. 
constraints would favor the development of small multiprocessor 
machines that are capable of exploiting the inherent parallel 
constructs of computation [l] .  With decreasing hardware costs a 
large number of processors may be grouped together to form 
specialized processing clusters or modules 121. Flexible 
customization methodology may serve to utilize these specialized 
hardware modules to achieve computational speeds that are beyond 
the limits of uniprocessor sequential methods. 
in computing power accompanied by the drastic reduction in cost, 
makes parallel processing in multiprocessor environment a 
viable option for the critical timing constraints of real-time 
applications. 
However cost and space 
The vast increase 
1.2 Objective: 
The objective of this research is to develop a systematic 
procedure for evolving a computational model that is 
1 
2 
particularly amenable for parallel processing in a 
multiprocessor environment. An overall top-down approach (see 
Figure 1.1) is adopted. Any real-time system may be represented 
in general by a set of differential equations which govern the 
dynamic behavior of the system. As a specific example, a 
prototype real-time control problem is modeled as a set of 
differential equations. These are mapped onto a task graph which 
is then allocated to a set of processors in accordance with an 
allocation algorithm. This is followed by a verification and 
comparison stage wherein the results of such a mapping are 
compared with that of traditional uniprocessor methods in terms 
of speed up ratio, efficiency and average processor utilization. 
Finally, hardware schemata are included for processors and their 
design . 
1.3 Research Phases: 
Research was conducted in the following phases: 
1) Problem Identification 
2) Task Graph Development 
3)  Scheduling and Simulation 
4) Hardware and software issues 
A few simplistic assumptions were made throughout the 
overall research. Interprocessor communication time was 
neglected in all cases. Although the author acknowledges that 
this is not a very practical assumption, the overall performance 
improvement would not be greatly undermined even if such delays 
are taken into account. Finally , an inexhaustible supply of 
APPLICATION REAL TIME TRACKING PROBLEM 
TASK GRAPH 0 
SCHEDULING AND ALLOCATION 
SOFTWARE ISSUES 
'\\, 
HARDWARE STRUCTURES 
Figure 1.1 Overview of Research Project 
4 
hardware resources has been assumed. The number of available 
processors has been treated as a variable parameter which may be 
optimized to obtain maximum speed of execution. It is this 
singular fact that makes a flexible architecture the best 
hardware support for this project. 
APPLICATION AND MODEL DEVELOPHEHT 
A vast majority of real time control problems can be 
represented by a stochastic system of equations and an 
associated cost function or performance index. The dynamic 
behavior of the system is modeled by a set of linear state 
equations of the form: 
;( t)=A( t)x(t)+B( t)u( t) 
The major objective in such a system model is to obtain the 
optimal control law by minimizing the overall cost function 131.  
2.1 Problem Identification 
A typical class of optimal control problems are of the 
tracking type. These are primarily concerned with constraining 
the motion of a body in a defined trajectory and are widely used 
in attitude control of rocket, missile guidance, aircraft 
landing analysis etc. The cost function to be minimized for 
optimal control is commonly represented as: 
5 
6 
Modern control theory suggests two principle ways of 
solving such problems (Appendix A). One convenient technique is 
the generation of a set of first order differential equations 
known as the Matrix Ricatti Differential Equations (see Figure 
2.1) having a form : 
K=-K( t)A( t)-AT( t)K( t)-Q( t)+K( t)B( t)R-'( t)BT( t)K( t) 
;( t)=- [AT( t)-K(t)B( t)R'l( t)BT(t)]s( t)tQ( t)r( t) 
It may be easily proved that if K is a "n by n" symmetric matrix 
and s is a % by 1" vector , then the above matrix equations 
reduce to a set of "n(n+l)/2+n" first order differential 
equations which have to be solved in real time.With large values 
of "n" as is true for most practical systems ,-an inconveniently 
large set of equations is obtained. Even with available current 
technology, it requires a mini supercomputer to perform the 
necessary computations. 
2.2 Solution Methods 
Several standard software routines using Runge Kutta 
Method, Adams Bashforth Method is available for solving 
differential equations and may be applied to the solution of 
Matrix Ricatti Equation. However, these are sequential 
techniques with a set limitation on execution speed. By 
employing parallel integration algorithms (PIA) it is possible 
to obtain a greater throughput while maintaining the same level 
of accuracy [ 4 ] .  The method presented here is a modified version 
of that proposed by Willard L. Miranker and Werner Linigar [5]. 
7 
K IS SYMMETRIC MATRIX 
s IS A N BY 1 VECTOR 
A SET OF N(N + 1)/2+ N FIRST 
ORDER DIFFERENTIAL EQUATIONS 
HAVE TO BE SOLVED 
STOCHASTIC SYSTEM EQUATIONS 
& 
COST FUNCTION 
K = -KA-A~K-Q + KBR"B~K 
S= -(AT-KBR"B~)s + QR 
ALGEBRAIC MANIPULATIONS 
RELEVANT TO OPTIMAL CONTROL 
THEORY 
RZCONTROL WEIGHTING MATRIX 
H =TERMINAL STATE WEIGHTING 
Q = STATE WEIGHTING MATRIX 
MATRIX 
MATRIX RlCATTl DIFFERENTIAL I EQUATION 
SOLVE FOR K,s MATRICES USING I INITIAL CONDITIONS 
I 
OPTIMAL CONTROL LAW AS A 
FUNCTION OF 
K,S AND X 
SUBSTITUTION 
~~ 
FIGURE 2.1 Overall Problem Representation 
a 
A modification is necessary as the aforementioned authors 
developed their algorithm for standard differential equations 
which are typically initial value problems as opposed to the 
Matrix Ricatti Equations where the integration has to be carried 
out backwards in time. 
2.3 Parallel Integration Algorithm 
A widely used technique for solving differential 
equations is the Adam Bashforth Predictor Corrector (ABPC) 
method. For a general problem of the type 
the differential equations for a two step ABPC method are given 
YC*-l + h/2 [ 3 f cn- 1 fen-2 I 
where h = step increment = % / (n-1); 
It is apparent that the predicted value at the (nIth step 
is used in the next step to compute the corrected value at the 
(n)th step. 
Figure 2 .2 ) .  The "P" and "C" lines denote the predicted and 
corrected values of the function. A hypothetical computation 
front is indicated by means of a dotted line. The directed line 
segments display that at the (nlth mesh point , results flow in 
from both sides of the computation front thereby precluding any 
chances of simultaneous prediction and correction. 
The sequence of computation is schematized (see 
9 
1 
I 
1 
1 
I 
I 
1 e C 0 a 
n-2 n- 1 1 n 
Figure 2.2 Serial Computation Sequence 
10 
A suitable modification converts this sequential technique 
into an effective PIA. The modified equations are: 
The computation front and associated sequence of 
computation are shown (see Figure 2 . 3 ) .  The arrows indicate that 
calculation at any step depends only on information at previous 
mesh points. This implies that the parallel implementation 
simultaneously accommodates prediction at the (nlth step and 
correction at the (n-llth mesh point and thus may be executed in 
parallel on two arithmetic processors. 
Application of this technique to the solution of Matrix 
Ricatti equations necessitates the computation front to proceed 
backward in time. For this purpose the aforementioned parallel 
differential equations are modified to yield : 
9'n-2 
The corresponding computation front has also been shown 
Figure 2 . 4 ) .  
(see 
2.4 The Prototype Roblem 
A prototype reflects an actual problem area with all its 
attributes but in smaller dimensions. It provides the researcher 
with a congenial environment to experiment novel schemes. In 
/ 
/ I 
11 
P 
/ 
/ 
Figure 2.3 Parallel 
\ 
Computation Sequence 
\ . . 
C 
Figure 2.4 Reverse Parallel Computation Sequence 
12 
this thesis, a prototype tracking problem has been considered so 
as to illustrate the basic concepts and ideas that were 
developed in course of research. 
The system to be controlled is assumed to be represented by 
two state equations: 
= q(t) 
The performance index to be minimized is 
J(U) = - 0.2tI2 + 0.025u2(t)}dt 
In this problem the major objective is to maintain the 
state x1 close to the ramp function rl(t) = 0.2t. The Matrix 
Ricatti equations for such a system are : 
si(t) = 2 [ 10 k12(t) - 1 ] ~2(t) + 0.4t 
All the equations in the above set are cross coupled. 
However, the computational parallelism inherent in the equations 
may be exploited to obtain a higher throughput. This is 
discussed in the next chapter. 
One of the important potentials of multiprocessor systems 
is the ability to speed up computation by concurrently 
processing independent portions of a given assignment [l, 111. 
Extensive research is being carried out to develop mathematical 
models that can be solved efficiently on parallel processors 
[6]. The first step in developing such multiprocessor models is 
to identify the parallelism within the mathematical formulation 
of the problem. This necessitates a data flow analysis of the 
problem with a subsequent evolution of a '' task graph ". This is 
then allocated to a set of processors by means of a scheduling 
algorithm so as to obtain minimum achievable execution time. 
3.1 Task Graph Attributes 
A task graph represents a set of "jobs" or "computation 
units" arranged in accordance with certain precedence 
constraints. Such a set is generally described by a "finite 
directed acyclic graph" 171 and is assumed to have single entry 
and terminal nodes through which all other nodes may be 
accessed. Task execution times are represented by node weights 
[8]. An example of a task graph is shown (see Figure 3.1). 
In most practical problems, the mathematical nature of the 
model yields a set of closely coupled equations as is also true 
13 
14 
Figure 3.1 Example of a Task Graph 
15 
for the prototype problem under consideration. Hence it becomes 
a difficult task to identify not only the areas of mathematical 
parallelism [6] but also integrate these with solution 
techniques ( like ABPC ) under consideration. 
A few important notions must be explicitly stated before 
any attempt is made to outline a systematic procedure for task 
graph development. 
A "data flow graph" is very similar to a task graph except 
that the latter precludes all logical constructs of an incumbent 
program. In its simplest form, a task graph reflects an attempt 
to partition computation tasks in an optimum manner without any 
reference to logic statements which may have a representation in 
an equivalent data flow graph. 
Being very closely related to the mathematical model of the 
system, a task graph is unique and specific to a particular 
application. The same system under different functional 
operations may require an entirely different task layout. 
Even by partitioning the system model into several 
independent paths which may be computed in parallel, there 
exists a "critical path" which presents a set ''lower limit" on 
the minimum achievable execution time. No amount of task 
decentralization in the form of a well balanced task graph or 
processor computing power can overcome the timing constraints 
set by the critical path. It is imperative that the update 
interval of data is greater than or atmost equal to the 
calculation time of the critical path. 
16 
3.2 Task Graph Development 
A top-down design strategy is adopted in task graph 
development (see Figure 3.2). The system differential equations 
are partitioned and combined with standard integration 
techniques ( ABPC in this case ) to yield a set of difference 
equations. Subsequently, a data flow analysis is made wherein 
each difference equation is further broken up into simpler 
computation units in consonance with the mathematical attributes 
of the system. This procedure of task fragmentation is 
repeatedly continued till elementary computer operations 
( addition, subtraction, multiplication and division ) or basic 
task units result. These are all interconnected and yield a 
complex mesh which is collectively called the "task graph" for 
the application under consideration. An attempt is made to keep 
the overall task graph reasonably balanced so as to preclude 
possibilities of unduly long critical paths. 
To illustrate the above concepts, let us consider one of 
the differential equations having a high degree of cross 
coupling: 
The first step is to make a data flow analysis for the equation 
above. This is done by constructing a function task block l'f12'' 
(see figure 3.3). The nodes in the first level are either data 
constants or values of "k12" and "k-72" at the previous update 
interval. The subsequent levels keep a numerical count of the 
elementary operations involved with "l*" within a node 
17 
TASK GRAPH DEVELOPMENT 
DESIGN STRATEGY: TOP - DOWN APPROACH. 
SYSTEM DIFFERENTIAL]  STANDARD TECHNIQUE 1 
1 EQUATIONS J lOF NUMERICAL INTEGRATION I 
DIFFER EN CE EQU AT1 0 N S m 
FRAGMENTATION OF I COMPUTATION STEPS 
1 v 
MESH OF ELEMENTARY 
OPERATIONS 
r 
TASK GRAPH 
TASK MATRIX 
INPUT TO ALLOCATION 
ALGORITHM 
Figure 3.2 Task Graph Development 
18 
1+, 2- u ! 
Figure 3.3 Function t a s k  block 
19 
indicating one multiplication. Similarly, 1+,2- indicates a 
total of three operations comprising of one addition and two 
subtractions. Task time is counted on the basis of "time units" 
or "Us. Multiplication and division are assigned a weightage of 
3 "Us compared to addition and subtraction which take 2 TUs. The 
function task block has a total count of 6 operations equaling 
at least 15 TUs. 
. 
The given equation along with the function task block must 
be integrated with the ABPC method. The difference equation to 
be solved becomes: 
Again on the basis of data flow, a track of the flow of 
computation is maintained and the resulting interconnected mesh 
of simple operations obtained constitutes the task graph for the 
equation in question (see figure 3 . 4 ) .  An interesting feature of 
this task graph is that it is non terminating in nature. Apart 
from the data constants, the parameter values are updated in 
every sampling interval. The systematic node description for the 
task graph under consideration is shown in Table 1. Each 
differential equation of the original set is thus fragmented to 
yield a sub task graph which are then interlinked to yield the 
overall task graph for the system. This has been shown in 
Appendix B. 
20 
c 
0 
21 
TABLE I 
NODE DESCRIPTION FOR TASK GRAPH IN FIGURE 3 . 4  
Node No. Parameter Operation 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
"1 3 
14 
15 
16 
17 
18 
h=constant 
2=data constant 
20=data constant 
f (k12) cn 
0.0 
;;op 
NOP 
Load 
Load 
Load 
NOP 
Load 
/ 
* 
Load *. 
4- 
* 
Load . 
Load 
. . .  
22 
3.3 Task Natrix 
A task graph for a practical problem is quite imposing in 
its complexity. A "Task Matrix" offers a convenient and concise 
technique for representing a task graph and at the same time 
maintains all precedence constraints. For a faithful 
representation, a task matrix should have the following fields: 
1) Task Field ( T ): It indicates the task number. 
2) Task Enable Field ( E ): It can assume only two 
values - a "HI" indicated by binary "1" and a "LO" indicated by 
a binary "0". Whenever E=l, the corresponding task is enabled. 
3)  Pending Task Queue Field ( Q ): It represents the 
number of tasks pending at each node. It provides a count of the 
immediate predecessor tasks that have to be executed prior to 
self execution. A task unit at a particular level in the task 
graph may be enabled only if the corresponding value of Q = 0. 
4) Successor Field ( S ): This is in array field 
which keeps track of the number of immediate successor tasks at 
each node. 
5)  Weight Field ( W ): It shows the time taken for a 
task defined by the node under consideration to execute. The 
weight field is assigned arbitrarily as the speed of execution 
tends to vary with hardware features of the selected processor. 
However reasonable assumptions are made while assigning weights, 
e.g., task unit defining multiplication must have a larger 
execution time compared that which defines addition. 
The task matrix table for the task graph in Figure 3.1 is 
shown (see Table 2) .  The tasks are numbered from "1" to "8" with 
23 
TABLE 2 
TASK MATRIX FOR TASK GRAPH IN FIGURE 3.1 
T E Q S W 
1 1 0 4 X 
2 1 0 4 X 
3 1 0 596 X 
4 0 2 7 X 
5 0 1 8 X 
6 0 1 X 
7 0 1 X 
8 0 1 X 
T = TASK NUMBER FIELD. 
E = TASK ENABLE FIELD. 
Q = PENDING TASK QUEUE FIELD. 
S = SUCCESSOR TASK FIELD. 
W = WEIGHT FIELD. 
X = DON’T CARE. 
- 5 -  
24 
weights being "don't care" denoted by "X". "0" represents the 
input node whereas "*" denotes the terminal node. During start 
of execution any one of the tasks 1,2 and 3 may be executed.and 
this is indicated by E = 1 and Q = 0 in corresponding fields. 
Task 4 has Q = 2 because it has two immediate pending or 
predecessor tasks in tasks 1 and 2. Tasks 5 and 6 are the 
successors of task 3 as shown in the S field. Tasks 6,7 and 8 
terminate in the output node indicated by "*". 
3.4 Scheduling Problem 
The scheduling problem primarily deals with resource 
optimization. Stated simply it reduces to " Given a set of tasks 
or computations along with a set of operational precedence 
relationships that exist between a certain of these tasks, and 
given a set of *k' identical processors, how does one sequence 
or schedule these tasks on the 'k' processors so that they 
execute in minimum time?" [ 8 ] .  By definition a 'scheduler' is an 
algorithm that uniquely specifies which job unit is to be 
serviced next by a resource [ l o ]  and to this end, an efficient 
scheduling algorithm need be developed which undertakes 
efficient task allocat,ion and sequencing. Problems of this type 
are commonly referred to as "minimum execution time 
multiprocessor scheduling problem" [7]. 
3.5 Scheduling Classification 
Task scheduling by itself forms an interesting area of 
research and draws heavily on concepts of graph theory and 
operations research. A number of scheduling strategies are in 
25 
vogue (see Table 3), each being suitable for a specific 
application. The major class of schedulers are categorized as 
pre-emptive or non pre-emptive. 
A pre-emptive scheduler is capable of selecting and 
assigning a job to a server at any time irrespective of job 
completion, that is, a pre-emptive scheduler assumes that jobs 
are interruptible and will do so if another job of higher 
priority needs service. The overall flexibility of the schedule 
increases due to pre-emption but at the cost of hardware 
overhead and job "set-up" time. On the contrary, a non pre- 
emptive scheduler allows no job-switching, that is, once a job 
is assigned to a resource it has to be executed before another 
job can be accommodated even though it may have a higher 
priority. 
3.6 Approaches to the Scheduling Algoritbm 
The scheduling problem may be approached from two different 
angles. 
(1) Given a task graph and a set of -k' processors, a 
task assignment routine has to be developed that yields a 
description of the tasks done by each processor as a function of 
time. It ensures an optimum processor packing of task units so 
as to yield maximum resource utilization and at the same time 
attain a maximum speed of execution. 
(2) Given a task graph, the scheduler keeps the 
option of available hardware open and selects an optimum number 
of processors for executing the task graph in minimum time. The 
26 
TABLE 3 
SCHEDULING TECHNIQUES 
Scheduler Name Type of Operation 
FCFS First-come-first-served 
SXFS Shortest-job-first 
LCFS Least-completed-first 
EDFS Earliestdue-time-first 
HSFS Highest-static-priority-first 
RR Round robin 
27 
number of available processors in this case is a variable 
parameter which is optimally selected by the scheduling 
algorithm. This approach pre-supposes a flexible architecture 
for its realization since it needs a variable number of 
processors and sacrifices hardware utilization to get a higher 
throughput. 
The scheduling algorithm that is developed is primarily 
based on the aforementioned second approach. 
3.7 Assumptions in developing the Scheduling Algorithm 
The scheduling algorithm developed is based on the 
following assumptions: 
1) Scheduling is non pre-emptive and all task 
allocation is static. 
2)  
3 )  Interprocessor and intraprocessor communication 
Execution time of each task is known apriori. 
times are negligible. 
4 )  Task weights are assigned arbitrarily but 
uniformity is maintained between comparable tasks. Tasks 
requiring longer CPU time (like multiplication) have been 
assigned larger weights compared to tasks requiring lower CPU 
time (like register move, addition etc. ). Such arbitrariness is 
primarily due to lack of well defined execution-time standards 
on account of the widely varying 
currently. Moreover, conceptually the algorithmic implementation 
is independent of the weights assigned to the task units. 
processor types available 
28 
3.8 Scheduling Algorithm 
The scheduling algorithm (originally credited to Oschner) 
maps the task graph onto a task matrix and seeks to obtain an 
optimum schedule by means of elementary operations on the task 
matrix. The step by step detail for the algorithm is as follows: 
A task matrix is defined by five fields T,E,Q,S,W. 
A task is enabled only when E-1 and 0-0 
An enabled task can be allocated to a free PE 
1) 
2)  
3)  
only. 
4) A task unit assigned to a PE has its E field 
decremented to zero, that is, E=O for an assigned task unit. 
5 )  After task completion, the successor or S field of 
the task is examined so as to decrement the Q field of each 
successor. 
6) 
decrement are enabled. 
7)  
8 )  
All successor tasks having 9-0 as a result of 
Repeated execution whenever a PE becomes idle. 
Scheduling is complete when all tasks have E=O and 
Q=O . 
As a specific example, a simple task graph and associated 
task matrix is considered (see Table 4 ) .  Initially any one of 
tasks 1, 2 and 3 may be allocated depending on the number of 
processors available. Assuming that all tasks are assigned, 
execution ( timegrocessing in Pascal routine - Appendix D ) 
begins and the respective "E" fields are reduced to zero (see 
Table 5). Task 1 having minimum weight is completed first so 
that the PE to which it is assigned is the first to become idle. 
2 9  
TABLE 4 
TASK GRAPH AND TASK MATRIX 
T E Q S W 
~~ 
1 1 0 4 10 
2 1 0 4 20 
3 1 0 30 
4 0 2 10 t 
30 
TABLE 5 
ELEMENTARY OPERATION ON TASK 
MATRIX 
T E Q S W 
1 0 0 4 10 
2 0 0 4 20 
3 0 0 30 
4 0 2 10 
TABLE 6 
ELEMENTARY OPERATION ON TASK 
MATRIX 
T E Q S W 
1 0 0 4 10 
2 0 0 4 20 
3 0 0 30 
4 0 1 10 
31 
When this stage is reached, the scheduling process takes over. 
The successor field of task 1 is examined which points to task 
4 .  The scheduler now decrements the Q field of task 4 thereby 
making it equal to 1 (see table 6). 
Even though task 1 is complete, task 4 cannot be assigned 
until task 2 ends. So task execution starts again with PE to 
which task 1 was assigned remaining idle. When task 2 is 
completed, the scheduler looks at the corresponding S field 
which again points to task 4 .  The Q field of task 4 is 
decremented to zero as a result. The scheduler now sets the E 
field of task 4 thereby enabling it (see Table 7). Task 4 is 
assigned to an available PE and its E field is reduced to zero. 
When all tasks have been assigned and execution is complete, the 
E and Q fields of all tasks equal zero and the resulting task 
matrix is shown in Table 8. 
From this example, it becomes clear that by elementary 
operations ( like look up, decrement etc. ) it is possible to 
keep a dynamic track of a variable number of tasks and PES. The 
resulting information is adequate to set up a timing diagram or 
"Gantt Chart" schedule for each PE which is of considerable help 
in calculating the overall time necessary to execute the task 
graph. By the varying the number of processors used, 
considerable insight on overall performance is obtained. These 
factors are discussed subsequently. 
32 
I 
TABLE 7 
ELEMENTARY OPERATION ON TASK 
MATRIX 
T E Q S W 
1 0 0 4 10 
2 0 0 4 20 
3 0 0 30 
4 1 0 10 
TABLE 8 
ELEMENTARY OPERATION ON TASK 
MATRIX 
T E Q S W 
SIMULATION AND PgRpORHANcE EVALUATION 
The evaluation of a computer system generally involves the 
following classes of considerations: 
1) Performance 
2) cost 
3)  User convenience 
4) Reliability 
An attempt is made here to provide a critical appraisal of 
overall performance improvement when the system under 
consideration is subjected to the previously described parallel 
model of implementation. 
4.1 Performance Evaluation Criterion 
The primary requirements for performance evaluation are: 
1) Analysis 
2)  Simulation 
3)  Measurements 
Analysis and simulation is accomplished by partitioning the 
system differential equations into task units which are then 
allocated to a variable set of processors. The merit of the 
scheme is judged on the basis of the following performance 
indices : 
33 
34 
. 1) Execution time 
2)  Percentage speed-up 
3)  Percentage efficiency 
Execution time may be defined as the time required by a 
given set of processors to execute the task graph in question. 
For a real-time control problem, the execution time is of great 
significance and must be-less than the periodic update time. 
The increase in speed of computation with a larger number 
of processors compared to that of an uniprocessor is generally 
denoted by the percentage speed-up factor. If "t" is the time 
required to execute a task graph using a set of 
and ltmtt equals the time to do the same using a single processor, 
then speed-up factor [9] is given by: 
"p" processors 
speed-up = (m / t) 
The percentage efficiency shows the overall resource 
utilization for a parallel implementation. Mathematically, 
X efficiency = (m / tp) * 100 
Percentage efficiency is a measure of the idle time of the PES. 
It has a value of 100% for an uniprocessor system as can be 
verified from the mathematical expression. 
4.2 Assumptions in Simulation 
To facilitate and simplify analysis, the following model 
for a parallel implementation is adopted: 
1) an unlimited number of processors is available. 
35 
2) each PE is capable of evaluating any of the four 
fundamental arithmetic operations (+, -, *, />. 
3)  data and memory alignment times are neglected. 
Although assumptions 1) and 3)  appear unrealistic, 
decreasing hardware costs are giving rise to large 
multiprocessor systems which have almost an unlimited number of 
processors , eg., The Hypercube, The Butterfly Computer which 
has 256 PES with scope for further expansion. Similarly, data 
and memory time penalties simply offset the computation results 
by a fixed factor and therefore do not form a barrier to the 
conceptual implementation of a parallel model. 
4.3 R e s u l t s  of Simulation 
The task flow pattern for the linear system is simulated 
using a variable number of PES and at each stage the 
aforementioned performance indices are recorded. A graphical 
representation of these indicate interesting highlights . 
The execution time curve ( see Figure 4.1 ) droops sharply 
as the number of processors increase showing that with increase 
in the number of PES the task completion time rapidly decreases. 
The curve has a characteristic hump in the vicinity of ten PES. 
Any further attempt to boost computing power by increasing the 
number of PES has negligible effect thereby indicating that time 
corresponding to critical path has been reached. 
The percentage efficiency curve (see figure 4.2) initially 
remains at a high value which implies that available tasks are 
adequate to keep the set of processors occupied throughout the 
2 15 
165 
65 
15 
P R O C E S S O R  P E R F O R M A N C E  
1 2  3 4 5 6 7 0 9 1 0 1 1 1 2 1 3 1 4 1 5  
NOOFPROCESSORS 
Figure 4.1 Processor Execution Time 
37 
90 
80 
70 
60 
50 
40 
30 
r 20 
P R O C E S S O R  P E R F O R I A N C E  
100 
1 2  3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5  
NO OF PROCESSORS 
Figure 4 . 2  Processor Efficiency 
update interval. However, for more than five PES it rapidly 
decreases owing to the idle time generated. This trend continues 
till for ten PES the curve has a local maxima corresponding to a 
percentage efficiency of approximately 85X. Beyond this, the 
efficiency curve again toggles down. The logical inference drawn 
is that for a set of ten PES a compromise is affected between 
idle time and speed of execution whereby resource efficiency is 
sacrificed to obtain a greater speed advantage. This is also 
corroborated by the speed up curve (see Figure 4 . 3 )  which 
indicates that beyond ten PES the speed up ratio remains 
unaltered. The performance indices therefore point to ten PES as 
an optimum selection for the task graph under consideration. The 
task allocation scheme for the optimum number of PES is 
generated as output by the scheduling program. A Gantt Chart or 
a processor timing diagram can be set up from the results. It 
may be noted that a close processor 
overall idle time is negligible. The task graph, task matrix, 
program output and Gantt chart are listed in Appendix B. 
packing of tasks exist and 
39 
0 
7 
6 
2 
a 5  
2 
I 
w w 
4 
3 
2 
1 
0 
P R O C E S S O R  P E R F O R M A N C E  
1 2  3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5  
NO OF PROCESSORS 
Figure 4.3 Processor Speed-Up 
AR- AND HARDWARE DESIGN 
Conventional computers solve problems one step at a time. 
Advanced parallel computers are able to execute independent 
parts of the problem concurrently thereby reducing overall 
execution time [13 ] .  The success of a parallel implementation 
depends entirely on the hardware support and to this end an 
efficient architecture is proposed. 
5.1 Architectural Requirements 
Computer architecture encompasses a very wide area of 
knowledge bounded by ever changing innovations. It is extremely 
difficult to define all attributes necessary to justify a 
particular architecture. In this thesis research, a 
multiprocessor parallel algorithmic implementation has been 
proposed which in turn needs a truly parallel hardware back up. 
Flexibility is one of most desirable features for such an 
architecture. A task graph corresponds uniquely to an 
application . Any changes in application demands a new task 
graph which in turn requires an altered hardware support. 
Hence, a truly parallel machine must have hardware upgradability 
and reconfigurability. Popular parallel machines like the 
Butterfly Computer, Hypercube, REMPS [ 1 4 ,  151 etc. incorporate 
40 
41 
this philosophy. Current researches on the FAST at the 
University of Alabama also re-emphasizes this point. 
The PE system architecture must have a high degree of 
pipelining to reduce intermediate idle time. It is also 
imperative for each PE to have an on-chip in addition to global 
memory. This reduces the conventional "Von Neumann" bottleneck 
and increases computing power. 
5.2 Pg system Design 
A large number of PES with excellent functional features 
are currently available [16, 171. However, a futuristic PE 
design is proposed here (see Figure 5.1). A gallium arsenide 
RISC engine is coupled with a floating point coprocessor unit 
and constitutes the core of the processing element [18, 191. 
These are connected by instruction and data buses to respective 
caches which virtually eliminates all global memory accesses 
except perhaps at the pre-processing stage [20 ] .  Separate 
instruction and data caches reduce cache-contention and internal 
bus traffic. The PE interfaces with the system bus u s i n g  I:. bus 
controller. 
5.3 Technology Selection 
An ambitious proposition using WSI GaAs is recommended. 
Although a great majority of the integrated circuits are 
fabricated with silicon, GaAs technology offers several 
advantages [20] :  
1) 
fastest silicon chips. 
GaAs chips are five to ten times faster than 
4 2  
I I- 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
L 
- 1  I I  
I 
To System Bus 
I 
I 
I 
1 
I 
I 
I 
I 
1 
I 
I - - - - - - - - +  - -  - - -  - - - -  - - 
FPU = Floating Point Unit 
CC = Cache Controller 
MMU = Memory Management Unit 
BC = Bus Controller 
CAMMU = Cache and Memory Management Unit 
Figure 5.1 PE Design Schemata 
43 
2) It is radiation "hard" and operates over a wide 
temperature range ( -2000~ to +2000c). 
3 )  It is also better suited for efficient 
integration with electronic and optical components. 
Although high cost and low levels of integration are major 
drawbacks, these are expected to be eliminated as the technology 
matures. 
Wafer-scale-integration denotes the level of integration 
attained when an entire wafer is used is used to fabricate a 
circuit. Currently WSI is the highest level of integration for 
monolithic circuits [21]. The technology is still plagued by 
problems of heat dissipation and low production yield. However, 
higher attainable density levels and fewer off chip connections 
are major factors in proposing this futuristic technology that 
has already started making inroads in the chip market [22]. 
5.4 Interconnection and System Layout 
A hierarchical fiber optic star (see Figure 5 .2 )  is 
proposed as a suitable.system layout and corresponds to the FAST 
architecture [23]. Such a structure is easily expandable and 
provides an inexhaustible source of computing power. Each 
tentacle of the star ends in individual processing modules which 
may be specialized to perform functions like error checking, 
I / O ,  communication, numeric processing etc. Such a system has 
the option of having heterogeneous modules or homogeneous 
modules depending upon the application. Each fiber optic star 
cluster may be configured to form specialized hardware modules 
0 
rl tu 
X 
O 
0 
4 
U a 
P 
k 
a 
Y 
m 
3 
AJ 
i, al u 
.rl c u 
k 
45 
for efficient task execution. Optical fiber communication links 
are optimally compatible with GaAs WSI technology and is 
sufficient to meet the highest transfer rates [24]. 
5.5 Future Directions 
Although a futuristic hardware support is proposed, 
architectural innovations may still be implemented to attain 
higher modularity and efficiency. Considerable work needs to be 
done in the development of parallel software bases which still 
happens to be inherently sequential [25] .  The setting up of a 
task graph for different applications is wasteful of manhours. 
Automated software packages need to be developed for performing 
domain and functional decomposition. The future will undoubtedly 
be affected by improvements in semiconductor technology. 
However, any drastic performance improvement would need a 
technological breakthrough, like the development of high 
temperature superconductors etc., but the basic tenets of 
parallel processing are going to hold good for some time to 
come. 
4 6  
REFERENCES 
G. C. Fox and P. C. Messina, "Advanced Computer 
Architectures", Scientific American, vol 257, pp. 67-75, 
October 1987. 
D. Peng and K. G. Shin, "Modelling of Concurrent Task 
Execution in a Distributed System for Real-Time Control", 
IEEE Transactions on Computers, vol. c36, no.4, pp. 500- 
516, April 1987. 
Optimal Control Theory - An Introduction ; by Donald E. 
Kirk; Prentice-Hall; 1970. 
L. G. Birta and 0. Abou Rabia, Parallel Block Predictor- 
Corrector Methods for ODES", IEEE Transactions on 
Computers, vol. c36, no.4, pp. 299-311, March 1987. 
W. L. Miranker and W. Liniger, "Parallel Methods for 
the Numerical Integration of Ordinary Differential 
Equations", Math. Comput., vol. 21, pp. 303-320, Nov. 1967 
D. J. Arpasi and E. J. Milner, "Mathematical Model 
Partitioning and Packing for Parallel Computer 
Calculation", pp. 67-74, NASA TM-87170 
H. Kasahara and S. Narita, "Practical Multiprocessor 
Scheduling Algorithms for Efficient Parallel 
Processing", IEEE Transactions on Computers, vol. c33, 
no.11, pp. 1023-1029, Nov. 1984 
R. R. Muntz and E. G. Coffman, Jr., "Optimal Premptive 
Scheduling on Two - Processor Systems", IEEE 
Transactions on Computers, vol. c18, no.11, pp. 1014 
-1020, Nov. 1969. 
C. V. Ramamoorthy, K .  M. Chandy, and M. J. 
Gonzalez, "Optimal Scheduling Strategies in a 
Multiprocessor System", IEEE Transactions on 
Computers, vol. c21, no.2, pp. 137-146, Feb. 1972. 
Introduction to the Design and Analysis of Algorithms; 
by S. E. Goodman and S. T. Hedetniemi; McGraw Hill Book 
Company; 1977. 
Computer System Performance; by H. Hellerman and 
T. F. Conroy; McGraw Hill Book Company; 1975. 
A. H. Sameh, I' Numerical Parallel Algorithms - A 
Survey", High Speed Computer and Organization, pp. 207-228, 
1977. 
4 7  
[13] R. Cytron, Useful Parallelism in Multiprocessing 
Environment", Proceedings of the 1985 International 
Conference on Parallel Processing, pp. 450-457. 
1141 K. Hwang and Z. Xu, '' REMPS: A Reconfigurable 
Multiprocessor for Scientific Supercomputing", Proceedings 
of the 1985 International Conference on Parallel 
Processing, pp. 102-111. 
[15] J. C. Peterson, J. 0. Tuazon, D. Lieberman and M. Pniel, 
"The Mark I11 Hypercube-Ensemble Concurrent Computer", 
Proceedings of the 1985 International Conference on 
Parallel Processing, pp. 71-73. 
[16] R. P. Bianchini and J. P. Shen, "Interprocessor Traffic 
Scheduling Algorithms for Multiple-Processor Networks", 
IEEE Transactions on Computers, vol. c36, no.4, pp. 396- 
409, April 1987. 
[17] T. L. Johnson, "The RISC/CISC Melting Pot", BYTE, pp. 153- 
160, April 1987. 
1181 J. F. Mcdonald, H. J. Greub, R. H. Steinworth, B. J. 
Donlan and A. S. Bergendahl, "Wafer Scale Interconnections 
for GaAs Packaging - Application to RISC Architecture", 
IEFX Computer, pp. 21-34, April 1987. 
[19] V. Milutinovic, "An Introduction to GaAs microprocessor 
architecture for VLSI", IEEE Computer, pp. 30-42, March 
1986. 
1201 V. Milutinovic, '' GaAs Microprocessor Technology", IEEE 
Computer", pp. 10-13, October 1986. 
[211 J. F. Mcdonald, The Trials of Wafer-Scale Integration", 
IEEE Spectrum, pp. 32-39, October 1984. 
[22] R. 0. Carlson, "Future trends in Wafer-Scale Integration", 
Proceedings of the IEEE, pp. 1741-1752, December 1986. 
[23] L. D. Huthceson, "Optical interconnects replace hardwire", 
IEEE Spectrum, pp. 30-35, March 1987. 
[241 D. H. Hartman, "An effective lateral fiber-optic 
electronic coupling and packaging technique suitable for 
VHSIC applications, Journal of Lightwave Technology, pp. 
73-81, Jan 1986. 
[251 A. H. Karp, "Programming for Parallelism", IEEE Computer, 
pp. 43-56, May 1987. 
APPENDIX A 
SOLUTION METHOD FOR OPTIMAL CONTROL PRO- USING 
PWRlX RICATTI EQUATIONS 
Several techniques are available for the solution of 
optimal control problems. A widely used method involves the 
setting up of Matrix Ricatti equations. 
The state equations are : 
and the performance measure to be minimised is 
where r(t) is the desired value of the state vector. H and Q are 
positive semidefinite matrices, and R is real symmetric and 
positive definite. .The final time "tf" is fixed. 
The Hamiltonian is given by 
h(dt),u(t),p(t)st) = 0.5 Ildt) - dt)$(t) 
llu( fR( t) + p T(t)A(t)x(t) +p T( t)B(t)u(t) 
The costate equations are 
and the algebraic relations to be satisfied are 
4 9  
0 = R(t)u*(t) + BT(t)p*(t) 
50 
I 
This yields the optimal control law in terms of the costate 
equation as 
u*( t) = -R-l( t)BT( t)p*( t) 
Instead of computing the STM, an easier computational 
alternative is to express 
Differentiating both sides with respect to "t", we get 
* Substituting for i*(t) and i*(t) and then eliminating p (t), 
the following equations, conmonly referred to as the Matrix 
Ricatti equations, are obtained 
K(th -X(t)A(t) - AT(t)K(t) - Q(t) + K(t)B(t)R"(t)BT(t)K(t) 
and 
&t) = -[AT(t) - K(t)B(t)R'l(t)BT(t)]s(t) + Q(t)r(t) 
I t  I t  K is a symmetric matrix of order "n" by 'In" and "s" is a 
I t  I t  n by 1 vector. Hence a set of "[n{n+l)/2]+n" first-order 
differential equations need to solved. The boundary conditions 
are 
51 
As all x*(tf) and r(tf) satisfy these equations, the boundary 
conditions are 
and 
The optimal control law may be computed from the values of 
"K" and "s" by means of standard integration techniques. 
APPENDIX B 
TASK GRAPE ATJXIBUTFS FOR HIGBLY-COUPLKD 
LINEAR SYSTEW EQUATIONS 
53 
54 
Node N o .  Parameter Operat ion 
7 
8 
9 
1 0  
11 
1 2  
1 3  
14 
15 
16 
17 
18 
1 9  
20  
Load 
Load 
Load 
Load 
Load 
Load 
Load 
Load 
Load 
Load 
f u n c t i o n  
sub ta sk  
II 
II 
I1 
I1 
11 
NodeDescr ip t ion  f o r  System Task Graph 
55 
Node no. Parameter Operation 
21 
22 
23 
24 
25 
26 
2 7  
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
+ 
* 
+ 
* 
+ 
* 
+ 
* 
+ 
* 
* 
- 
* 
- 
* 
- 
* 
Node Description for System Task Graph 
56 
45 
57 
58 

60 
61 
6 2- 
63 
6 4  
6 5  ' 
1 
1 
1 
13 
13 
18 
15 
le 
13 
10 
16 
1 0  
16 - 
L 
-7 
4 
2 
3 
2 
2 
3 
2 
3 
3 
2 
-7 4 
-? 
3 
CL CI 
3 
3 
2 
3 
2 
2 
2 
c.) CL
c.) 
L 
rn 
L 
r) CL 
66 
TASK ALLOCAT I ON 
THE PJUMEEF; O F  FcROCESSORS USED= it:) 
THE NUMEER O F  OEFINEO TASI.::S=45 
p r o c e s s o r  C 1 1  assigned task C l l  
processor C 2 1  a s s i g n e d  task t 2 1  
p r o c e s s o r  CZJ a s s i g n e d  task C 3 1  
p r o c e s s o r  C 4 1  a s s i g n e d  task C 4 1  
p r o c e s s o r  C51 a s s i g n e d  task C51 
processor Cbl a s s i g n e d  task C 6 1  
p r o c e s s o r  C'71 a s s i g n e d  task C71 
p i - u c e s s o r  C 8 3  a s z i g n e d  task C 8 1  
p r s c e s s o r  C 9 3  a s s i q n e d  task:: C'?:! 
p r o c e s s o r  C 1 5 1  a s s i g n e d  tasl:: C 1 0 1  
p r o c s s s o r  i 1 3  a s s i g n e d  task [ I l l  
processor  t Z 3  a s . s i g n e d  t a d :  t 121 
p r a c e s s o r  i 3 1  a s s i g n e d  task: C 1 Z 3  
processor C 4 1  a s s i g n e d  tasi:: C 1 4 1  
p r o c e s s o r  E 5 1  a s s i g n e d  task C 1 5 1  
processor Cbl a s s i g n e d  task C 1 6 1  
p r o c e s s c r  C 7 1  a s s i g n e d  task:: C 1 7 1  
processor C 8 1  a s z i g n e d  t a s k :  I: !81 
p r o c e s s o r  C 4 1  sssigned tztsC;: C 1 4 2  
processor C I C 1  as;si gned ~as1:: C2!::jl 
p r o c e s s o r  C 7 2  assigned task:: C272 
praces ; - ,o r  CY:! assigned task: C 2 8 1  
p r o c e s s o r  C 7 1  a s ; s i g n e d  task L 3 7 1  
proce.s.sor C 1 1  a s s i g n e d  task: E 2 1  1 
p r o c e s s o r  C21 a s s i g n e d  task C223 
processor C 9 1  a s s i g n e d  task: C381 
p r o c e s s o r  C I I  a s s i g n e d  task: C2Jl 
p r o c e s s o r  C 4 1  a s s i g n e d  tasI:: C X l  
p r o c e s s o r  Ccj3 a s s i g n e d  task C 3 1 1  
processor E71 a s s i g n e d  ta,iiI:: C441 
p r o c e s s o r  number C Y 1  i d l e  fc j r  1 TUS 
p r a c e s s n r  CZI a s s i g n e d  tasi.:: C291 
p r o c e s s o r  C 8 1  a s s i g n e d  ts.sC:! C3i31 
processor C91 a s s i g n e d  task: C 3 2 1  
p r o c e s s o r  number CiC!I i d l e  fo r  1 T U S  
processor C 1 1  a s s i g n e d  task:: C351 
p r o c e s s o r  number C'71 id l e  .for 1 T U S  
pracessor  number C1C:il i d l e  f o r  2 TlJS 
p r o c e s s i i l r  CZI ass1 gned tasC:: CX33 
p r o c e s s o r  C 3 1  aszigned task:: C 2 4 1  
p r o c e s s o r  C47 assigned task C 3 b J  
p r o c e s s o r  CS1 a s s i g n e d  task C391 
p r o c e s s o r  C&l a s s i q n e d  t a s k  C4311 
67 
pr-ocessor number C 7 3  i d l e  +o r  2 TUS 
procassor number C 9 1  i d l e  +o r  1 T U S  
processor number E l 0 3  i d l e  f o r  3 TUS 
processor C 7 1  assigned task  C 4 0 1  
processor number C 8 1  i d l e  f o r  1 TUS 
processor number C 3 1  i d l e  f o r  2 TUS 
processor number C l Q l  i d l e  f o r  4 TUS 
processor C i 1  assigned task C337 
processor C 2 1  assigned task: E431 
processor number C 4 1  i d l e  f o r  1 TU3 
processor number Cejl i d l e  f o r  1 TUS 
processor number C81 i d l e  f o r  2 T U S  
processor number C 9 1  i d l e  f o r  3 TUS 
processor number [IC)] i d l e  f o r  5 TU5 
proceszor C 3 1  assiuned t a s k  C341 
processor E 4 7  asjsigned task E451 
processor numoer f5J i d l e  f o r  1 TIJS 
processor number Ccjl i d l e  f o r  2 TUS 
processor number til i d l s  f o r  1 TUS 
processor number C 8 1  i d l e  +or 3 TUS 
processor number C 9 1  i d l e  f o r  4 TUS 
processor number C 1 0 1  i d l e  i a r  b TUS 
processor number C Z I  idle f o r  1 TUS 
processor number E 5 1  i d l e  f o r  2 TUS 
processor number Cbj i d l e  f o r  3 TUS 
processor number C'73 i d l e  f o r  2 i l l s  
processor nc,mber LEI:! i d l e  fur  4 TUS 
' processor number C 5 1  i d l e  f o r  5 TFS 
processor number L l G I  i d l e  +or  7 'TUS 
processor C 1 1  assigned task C427 
Schedule Complete 
68 
1 
PE 1 
33 42 11 21 I 25 ' 35 
A 
0 1  14 16 18 21 24 26 
? 
2 
r 
12 22 29 23 43 
14 17 19 21 23 PE 3 0 1  
3 13  24 34 
0 1  PE 4 19 22 24 
4 
c *  
14 26 36 45 
1s 5 
0 1  PE T 16 19 24 
39 
16 31 41 6 
17 27 37 44 7 
+ 4 0  
8 ,  
0 1  17 
Gantt Chart of Optimal Schedule 
- -  
18 30 
19 9 28 38 32 
10 20 

s t a r t  (7 
t ' =  w ( i )  
= t rue  
p ( j )  
I 
= t ( i )  . a c t i v e - e ( i ) =  o 
L 
(CONTINUED) 
tl 
I 
= t ( k k )  
k k = i  p ( j ) .  t a s k  
j=j-+l  
e ( k k )  = O  
p ( j )  . t ime 
=w(kk)  
p ( j )  . a c t i v e  
= t r u e  
t 
r 
_ Y  n 
k k = k k + l  
1 
k k = k k + l  n 
8 
F l o w  C h a r t  f o r  P r o c e d u r e  ' S c h e d u l e '  
7.2 
i 
j j = 1  
I 
j j = j j + l  
n 
1 
Schedule  
Comp 1 e t e 
(-) 
Flow Chart f o r  Procedure 'Check Schedule' - 
7 3  
I L 
~ ( k )  .time 
p(k) .time- 
:k=k+l 
- 
n ,- 
1=1 
(CONTINUED) 
7 4  
0- 
I 
p(1) .active 
= false 
tmpl=p(l) .task 
tmp=succ (tmpl) 
tmp=0 -Q 
1=1+1  
Flowchart for Procedure 'Time - Processing' 
75 
de lay=  
- p  (11) . t ime 
I 
Y 
Procedure 
Schedule 
d 
Flowchart f o r  Procedure QeaI.locat2 
APPENDIX D 
SCHEDULER ROUTINE IN PASCAL 
77 
~ X l * L * t t * a * * ~ * * * * * t * ! ~ * ~ * ~ * * * * * * * * * * ~ * * * * * ~ * ~ l ~ * ~ * * ~ * ~ * ~ * * *  
The  following F'ascal routine allocates tasks to a set of 
processors inaccordance with the scheduling algorithm 
already outlined in Chapter 3. The number of processing 
elements is treated as a variable pararneter.The program 
requires as input the fallowing: 
1) The number of available PES denoted by 'In" 
2) The number of defined tasks denoted by "tn" 
3)  The task matrix which is read from an input 
data file 
The program outputs the delay time o f  each processor and 
also the task: numbers which are assigned to a particular 
processor. It keeps track of the time schedule o f  each 
processor by providing relevant information. 
t * Z * * * * t ~ * * * * * * * * * * * ~ * * * * * ~ ~ * * t ~ * * * t ~ ~ * * * t * * * * $ ~ * ~ ~ : ~ ~ * ~ ~ ~ ~ ~ ~  
program processor-schedul ing; 
const 
ma:.: -succ='J; 
.C ma:.:-succ is the maximum number 0-f successors that can 
be present at each node of the task graph. It can be 
predefined to assume any v a l u e . I n  this case it has been 
defined to be equal to seven as this is adequate far the 
task graph under consideration. :. 
type 
processor =r ecor d 
time: inteqer: *C Each processor is defined as a record 2. 
task: integer; .C the boolean field denotes whether a 3. 
active:boolean; < processor is active i inactive 3 
end; 
proc= arrayC1. .201 of processor; .I ma:timum number of F'Es 3. 
arraytype- arrayC1. .SO3 of integer; 
successorarray=arrayLl. .50,1. .503 of integer; 
var 
ii,tn,n,inp,z,is:integer; 
e,q,w,t:arraytvpe; 
SLIC : s u c c e ~ s o r  array; 
p:proc; 
f ilvarl, f ilvar2: text: 
f 1,f 2:strinqC123; 
78  
p r o c e d u r e  I N  I f I kL I SE : 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
T h i s  p r o c e d u r e  i n t i a l i s e s  a l l  t h e  PES b y  making  t h e  
a c t i v e  f i e l d  f a l s e  a n d  s e t t i n g  task t i m e  and number = 0. 
I t  p r o v i d e s  the s c h e d u l e r  w i t h  a set o f  PEs that are 
r e a d y  t u  be a s s i g n e d  t o  i n c u m b e n t  tasks. 
****~******t*t*****1~**~~L~*$********~*******1******~~ 
v a r  
k i : i n t e g e r ;  
5 
b e g i n  
f o r  k i : = l  t o  n do 
b e g i n  
p t k i  3 .  time: =I.:); 
p t k i l . t a s k : = C ) ;  
p C k i  3 .  a c t i v e :  =f a1 se: 
end;  
e n d ;  
p r o c e d u r e  SCHEDULE: 
~ ~ * ~ t * ~ 2 * ~ * * * * * * * * * * b * * * ~ * * * ~ * * * * * b * ~ * ~ * * * ~ * ~ ~ * ~ ~ * ~ * 2 * * ~ ~ ~  
T h i s  p r o c e d u r e  a l loca tes  a set of a v a i l a b l e  tasks t o  a 
set o f  processors t h a t  are  i n a c t i v e  o r  svai lable .  A t : t r r  
i n i t i a l  a s s i g n m e n t ,  i t  c h e c k s  whether all tasks have been 
s c h e d u l e d  by i n v o k i n g  t h e  p r o c e d u r e  c h e c k - s c h e d u l  e. 
* * * X L t * * t ~ ~ * * * ~ ~ ~ * # K * # * * * * * * * * t ~ * $ * * * * * * * * ~ ~ * * * * * t * ~ * * ~ * ~ * ~ ~  
l a b e l  
s t a r t , m a r k ;  
79 
p r o c e d u r e  T I  ME-FROCESS ING;  
~ ~ * * * * * * * * * * * * * ~ ~ Y * * * * * ~ * * * * * * * * * * * * * * ~ * * * * * * * ~ * * * * ~ * * *  
T h i s  p r o c e d u r e  d e c r e m e n t s  t h e  t i m e  f i e l d  of  e a c h  
p r o c e s s o r  a n d  a f t e r  e a c h  d e c r e m e n t  make5 a s e l f  c h e c k  
t o  a s c e r t a i n  w h e t h e r  a n y  p r o c e s s o r  is i d l e .  I f  a l l  
p r o c e s s o r s  a re  active t h e n  i t  c o n t i n u e s  d e c r e m e n t i n q .  
I f  a n y  p r o c e s s o r  is i d l e ,  i t  i n v o k e s  t h e  p r o c e d u r e  
reallocate f o r  rea l loca t ion  of  a n y  a v a i l a b l e  task.: t o  
t h e  i d l e  p r o c e s s o r  ,*' p r o c e s s o r s .  
* * ~ t * l * t ~ * ~ * * * * f * * * * * * * ~ * ~ ~ * ~ * ~ * ~ ~ ~ ~ ~ * * * * ~ * * ~ ~ ~ * ~ * * * ~ .  
1 ? b e l  
sl 9 52;  
v a r  
k 1 temp  1 t emp , -i k I::. n o - s ~ i c c .  ma:.: -i t : i n t eger : 
p r o c e d u r e  REALLOCATE; 
~ ~ * ~ * * * * * * * * * * * l * * * l t * ~ * * * * * * ~ * * ~ * * * * ~ * * * * * * * ~ f * ~ ~ * * ~ ~ ~ * *  
T h i s  p r o c e d u r e  h a n d l e s  s i t u a t i o n s  when ~501ne p r o c e s s o r s  
become f r e e  d u e  t o  task c o m p l e t i o n  w h i l e  some a re  s t i l l  
a c t i v e .  The idle pracessors are  assigned t o  incumbent 
tasks. If n o  tasks a r e  a v a i l a b l e ,  t h e n  i d l e  t i m e  
is r e c o r d e d  f o r  t h e  i n a c t i v e  p r o c e s s o r s .  A f t e r  p o s n i b : l e  
r ea l loca t ion ,  t h e  main  s c h e d u l i n g  p r o g r a m  is a g a i n  i n v o k e d .  
t * * X * l * t * * * * * * * % * Y * * * * * * l * ~ * * * * ~ * * ~ : ~ * ~ * * * * * ~ : ~ * ~ * * * * * * ~ ~ * * ~ ~  
v a r  
1 1 , d e l a y : i n t e y e r ;  
8 0  
I 
I 
. 
begin .: of REALLOCATE 3 
ll:=l; 
f1:if pC11l.time .::= C! then 
begin 
if pC113,time .::: 0 then 
begin 
delay:= -IpCllI.time!; 
writeln(filvar2,3 processor number C'.11,21 idle for ',delay, ' T U S ' ) ;  
end; 
11:=11+1; 
if 1 1  :. n then 
S C HED U LE 
else gcto f1 : 
end 
begin 
else 
11 :=11+1: 
if 11 > n then 
SCHEDULE ; 
begin 
end 
got0 .f 1; 
end: 
else 
end j .: OS REALLOCAT'E 3. 
begin .: o f  TIME-PROCESSING 2. 
k : = l ;  
sl: pCkl.time:=pCkl. time-1: 
k:=k+l; 
if k 3 n then 
begin 
l:=i: 
s2:if pC1l.time = (1) then 
begin 
pC11. active: =f a1 se; 
templ:=pClI.ta~,C::; 
no-s;ucc:= sucLtemo1, 11: 
81 
I 
ma:< -it : =nc)-succ+l; 
f o r  _ikk:=2 t o  m a x - i t  do 
beg in  
temp : =sue C t e m p  1 -i 1: C: 1 : 
if temp .::> fJ then 
beg in  
q Ctemp 3 :  =qt  temp 3-1 : 
i f  qtternpJ=C) then  e C t e m p I : = l ;  
e n d :  
end ; 
1:=141; 
i f  1 :::. n then 
else 
REALLOCATE 
goto 52; 
end 
beg in  
else 
1:=141; 
i f  1 :::. n then 
begin 
end 
REALLOCATE: 
else 
goto 52; 
e n d :  
end 
beg in  
end; 
el ss 
goto 51: 
end; 
82  
p r o c e d u r e  CHECK -SCHEDULE; 
~ ~ * * * * * * * * * * * * * * * * * S * * * ~ * * * * * * * * * * * * * * * : ~ * * * * * ~ ~ * * * * * ~ ~ * ~  
T h i s  p r o c e d u r e  e x a m i n e s  t h e  task m a t r i x  t o  e n s u r e  t h a t  
s c h e d u l i n g  is c o m p l e t e ,  t h a t  is, t h e  task q r a p h  h a s  b e e n  
c o m p l e t e l y  e x e c u t e d .  I f  n o t .  i t  invo1::es p r o c e d u r e  
t i m e g r o c e s s i n q  t o  b e g i n  t a s k  e x e c u t i o n  o n c e  a g a i n .  
I f  a l l o c a t i o n  is c o m p l e t e ,  i t  i n d i c a t e s  t h i s  b y  d i s p l a y i n g  
" S c h e d u l e  C o m p l e t e .  'I 
X * * * * * * * * * * * * * ~ * * t * * ~ * * * * ~ ~ ~ * * * * * * * * * ~ * * * * * ~ * ~ * * * * * * ~ ~ ~ ~  
1 a b e l  
11; 
v a r  . .  
JJ: i n t e g e r :  
b e g i n  
j j := l ;  
11: i f  e C j j J = ( : ) )  a n d  I q C j j 3 = 0 )  t h e n  
b e g i n  
j j : =j j + 1 : 
i f  jj 1:. t n  t h e n  
b e g i n  
e n d  
else 
b e g i n  
e n d ;  
w r i t e l n  ( f i lvat-2,  ' S c h e d u l e  Cornplete ' l :  
g o t 0  11: 
e n d  
begin 
e n d ;  
else 
TIME-F'HOCESSING: 
e n d  j 
8 3  
i 
beqin .: of SCHEDULE 2. 
i:=I; 
j:=l; 
start: if j 3 n then 
CHECK-SCHEDULE : 
begin 
end 
begin 
else 
if pCj1,active = f a l s e  then 
beqin 
if eCil=l then 
begin 
pC-il. tirne:=wCi7: 
pC-il.active:=trueg 
pC-il. tasC:::=tfi 1; 
eCi 1: =il; 
i:=i+l= 9
writeln(filvar2,' processor C',j,?I assigned task:: C 7 q i p  ' 7 1 7 ) :  
j:=j+l; 
goto s t a r t ;  
end 
beqin 
else 
1:: 1: : = i ; 
mark: iS eCkC::I:=l then 
begin 
p C j 3 .  ti me: =w C 1::k:: 1 : 
pCjl.active:=true; 
p C j  3 . task:: : =t C I::I:: 7 ; 
e I: k I:: 1 : =(I) ; 
j: =j+l ; 
gotc start; 
writeln(filvar2,' prcrcessor C ' , - i . ' l  assiqned task C ' , k l : : . ' J 7 j ;  
end 
begin 
else 
i f  qCI::C::l =(I then 
begin 
k 1; : = 1:; I:: + 1 5 
8 4  
if ki:: 1:. t n  then 
beg in  
end 
beg in  
end; 
CHECK-SCHEDULE ; 
else 
goto mar I.: : 
end 
begin  
c:: k : = I.:: I:: + 1. ; 
if ki:: :::. tn %hen 
beg in  
end 
begin  
end; 
else 
CHEC t::: -SC HE5 U L E : 
el 5e 
goto m a r k :  
end; 
end; 
end; 
end 
el se 
begin  
-i : =j+l ;
goto start: 
end; 
end: 
end; 
85 
end. 
begin .:of M A I N 3  
writeln (‘input number o f  pracesrurs’ 1 : 
readln (n); 
writeln(’ SELECT INPUT DATA FILE ’ ) j  
writeln(’ OPTIONS-Tl.DAT/T2.DAT ’ ) ;  
readln(f1); 
assign (f  i lvar 1. f 1) ; 
reset(filvar1): 
writeln(’ SELECT OUTPUT DATA FILE ’ 1 ;  
writeln(’ OPTIONS- Rl.DAT/F;2.DAT ’ ) ;  
readln ( f a )  : 
assi qn if i lvar22 f 2) ; 
rewriteifilvar2): 
writeln (f ilvar2, ’ TASK ALLOCGTION ’ )  ; 
writsln (f i lvar2. ’THE blUMBER O F  FFXICESSOES USED=’ ,  tr) ; 
readlnifilvar1,tn); 
writelnifilvar2. ’THE NUMBER OF DEFINED TASKS=’, tns: 
for ii:=l to tn do 
begin 
end ; 
for ii:=l to tn do 
t Ci  i 3 : =i i : 
begin 
end; 
for ii:=l to tn do 
readl n ( +  i lvar 1. el: i i 3 > ; 
begin 
end; 
for ii:=l to tn do 
readl n ( f  i 1 var  1, q C  i i 1 ;I : 
begin 
for i s : = l  to ma:.:-succ do 
begin 
end ; 
r @ad 1 n ( f i 1 var 1 . suc C i i . i s 3 1 : 
end; 
for ii:=l to tn do 
begin 
end; 
INITIALISE: 
SCHEDULE; 
clase(filvar2>: 
readl n < f i 1 var 1, wT_ i i 1 i : 
I 
