Robot Control Computation in Microprocessor systems with Multiple Arithmetic Processors by Li, Bo & Ahmad, Shaheen
Purdue University 
Purdue e-Pubs 
Department of Electrical and Computer 
Engineering Technical Reports 
Department of Electrical and Computer 
Engineering 
12-1-1987 
Robot Control Computation in Microprocessor systems with 
Multiple Arithmetic Processors 
Bo Li 
Luoyang Institute of Tracking and Telecommunications Technology, China 
Shaheen Ahmad 
Purdue University 
Follow this and additional works at: https://docs.lib.purdue.edu/ecetr 
Li, Bo and Ahmad, Shaheen, "Robot Control Computation in Microprocessor systems with Multiple 
Arithmetic Processors" (1987). Department of Electrical and Computer Engineering Technical Reports. 
Paper 586. 
https://docs.lib.purdue.edu/ecetr/586 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 






School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
ROBOT CONTROL COMPUTATION IN MICROPROCESSOR 
SYSTEMS WITH MULTIPLE ARITHMETIC PROCESSORS
Bo Li*, Shaheen Ahmad**
*Luoyang Institute of Tracking and Telecommunications Technology 
Henan, Peoples Republic of China
**
School of Electrical Engineering 
Purdue University 
West Lafayette, IN 47907
ABSTRACT
In this paper we address the problem of designing a high performance robot con­
troller with multiple arithmetic processing units (APU’s). One attractive feature about 
this controller is that a minimum number of special purpose hardware components are 
needed, and in fact off the shelf components can be used. In the controller described in 
this paper, one main processor (MPU) schedules a number of APU’s to produce the 
computational throughput. In this design an efficient scheduling algorithm plays the
most important role in the system performance.
DF/IHS* algorithm [8] is an efficient algorithm that solves "strong" NP-hard prob­
lems of scheduling a set of particularly ordered computational tasks onto a
-----------------
*DF/IHS = Depth First/Initial Heuristic Search, this is a derivative of CP/MISF (critical 
path/Most Immediate Successor First) scheduling algorithm, see [8].
multiprocessor system. When interprocessor communication overheads are appreciable, 
it is not very effective in providing a practical near optimum schedule. It fails to con­
sider the problem of contention for shared resources.
In this paper we present new multiprocessor scheduling algorithm, which minimizes 
the effect of overhead and by doing so it reduces the effect of contention.
We used this scheduling algorithm to derive the operational instructions of the 
APU’s and the MPU for our multiple APU-based robot controller. Simulations show six 
Motorola MC 68881 APU’s can be used to generate the robotic control computations in 
approximately 2.5 milliseconds. The control computations involve inverse dynamic cal­
culations, forward kinematics, inverse kinematics, and trajectory computations.
/ 1/schultzm/Ahmad/microprocessor - 2 - December 10, 1987
1. INTRODUCTION
One of the bottlenecks in the control of industrial robots is that fast computers are 
necessary and they are not cheaply available. There are many calculations that need to 
be performed in a control loop time. These calculations include the trajectory genera­
tion, which is the calculation of the position of the hand in the next sample time, 
inverse kinematic operations to generate the positions of the joints, and the feedback 
control of the joints. Often linear feedback control is not adequate for trajectory track­
ing, a feedforward control signal (derived from inverse dynamic computations) is then 
added to the joint drive signal for improved trajectory performance. In many control 
schemes, the control signal is based on the manipulator hand position in cartesian space 
(as opposed to using joint positions) in such cases the forward kinematic or Jacobian 
computations are also needed every sample time. The order of computational demand 
each of these tasks make on the control computer are as listed:





The order in the above may change depending on the number of joints and type of 
computational scheme. Generally the inverse dynamics computation is by far the most 
intensive, the other four computation are equally important although they require less 
of the robot controller.
Simpler computation algorithms for inverse dynamics (task one) has received much 
attention [13] [17] [23] [24]. Likewise the design of special purpose computers for task 
one has also received much attention. A number of researchers have proposed the 
developments of special purpose computers with parallel architectures.
/1/schultzm/Ahmad/microprocessor - 3 - December 10, 1987
Luh and Lin [12] were the first to consider the parallel computation of the inverse 
dynamics via Newton-Euler^ [13] techniques. No specific detail of their computer sys­
tem is given, apart from the fact a branch and bound algorithm is used for scheduling 
the processors. Niagam and Lee [16] considered the same problem and proposed some 
nominal architecture for various commercial microprocessors. They assumed the pro­
cessors can be appropriately interconnected for the particular computation. Kashara 
and Narita [10] utilized their DF/IHS algorithm [8] to schedule number of microproces­
sors connected by a common bus to compute the inverse dynamics. Similar problems 
have been addressed by Watanalee et al. [18] also by Zheng and Hemami [19]. Lathrop 
[20] has shown systolic architecture and recursive doubling may be used to exploit the
parallelism of inverse dynamics, and the computation may be performed in 0( log2n )
+ Newton-Euler computation scheme allows the inverse dynamic computation in 0(n) 
steps given n is the number of robot joints. It is the least computationally intensive 
scheme in sequential (non parallel) machines.
time steps, where n is the number of robot joints. Nash [21] has designed a processor to
perform linear matrix computations; such a processor is useful in kinematics and in
’ < ■ ■
many inverse dynamic operations. Several systolic algorithms and pipelined architec­
tures were proposed by Orin, et al. for Jacobian and dynamic inverse computations [22]. 
Lee and Chang [25] have shown that by using a parallel pipelined single instruction 
multiple data stream machines, they are able to perform the inverse dynamic computa­
tion in 0(Ki [n/p] j- K2
the number of joints of the robot. VLSI implementation of their algorithm was also 
proposed.
In this paper we also address the problem of robot control computer design. Such 
a control computer should be able to perform the inverse and forward kinematics, 
inverse dynamic and trajectory computations in approximately one to five milliseconds. 
It should be architecturally simple "with few components, easy to program and as new 
control methodology evolves we should be able to implement them without alteration of 
the existing hardware system. Prefferably when we construct such a system we would 
like to utilize existing off-the-shelf hardware.
Recently in the market a number of 32 bit microprocessor and coprocessors for arith­
metic processing (APU’s) have appeared. APU’s generally are only able to execute 
arithmetic operations and have a very limited storage eg Motorola’s MC 68881 [15]. 
Such APU’s may be loaded by a host with an instruction and their respective operand’s. 
Usually, the APU will interrupt the host once the operation is complete. At that time 
the host is required to offload the results.
In our proposed robot controller we connect a number of APU’s to one host proces­
sor through a 32 bit bus (see Figure l). In this design we select an optimal number of 
APU’s to exploit the task parallelism. One simplicity of this architecture is that on a 
double size Eurocard board (measuring approximately 9’x8’ in2) it is possible to
/1/schultzm/AEmad/microprocessor -4- December 10, 1987
) time steps, where p is the number of processors and n is
/1 /schultzm /Ahmad/microprocessor - 5 - December 10, 1987
accommodate a host (eg a Motorola 68000 or 68020 microprocessor) with all necessary 
peripherals and in excess of ten or more MC 68881 APU’s. This system is also upgrad­
able as APU’s become significantly faster we may simply exploit this by directly replac­
ing the APU’s in our controller.
In this paper we will address the issues involved in generating the instructions to 
run concurrently on the APU’s for the basic robotic tasks one through five. We will also
address the question of how many APU’s do we need and how fast we can carry out this
1
computation for a particular APU (MC 68881).
In order to take full advantage of parallel processing, an efficient scheduling algorithm
must be developed to obtain a minimum computation time with a minimum number of
*
processors. Numerous scheduling algorithms have been' developed [3] [4] [8] [12] [13], 
etc. Among them DF/IHS [8] [9] [10] is one of the most efficient. When data transfer 
time among tasks are not negligible and other overheads exist, even DF/IHS algorithm 
becomes inefficient. This is because it assumes that data transfer times are negligible in 
comparison with the processor computation time. Thus if all the processors are identi­
cal, a task can be assigned to any one of the processors without increasing or decreasing 
the execution time. In fact when the interprocessor communication overheads are con­
sidered, a task may have a different execution time in two separate processors. This is 
why DF/IHS and some branch-and-bound methods become inefficient.
Another drawback of DF/IHS is that it fails to consider the possible contention
problem. As is often the case, the contention for shared resources cannot be neglected, it
\
has to be reduced as much as possible in a multiprocessor system, such that resources 
are efficiently utilized and maximum parallel processing is obtained. Contention prob­
lem has been analyzed as a markovian process in [7] [14] etc. However effect of 
We do not cite all reference, but a few relevant to this paper.
contention on a schedule of a set of known tasks has not been extensively analyzed.
In order to obtain maximum throughput from our parallel processing system we
■ *
developed a new scheduling algorithm DF/MOHS . The algorithm assumes that: (1) 
intefprocessor communication overheads (including data transfer) and other necessary 
overheads are not insignificant, (2) contention for host processor (MPU) service exists 
and has to be considered. As a result (in our scheduling algorithm) not only the relation 
among tasks, but also the assignment relation between tasks and processors is impor­
tant and is considered during the scheduling process.
/1/schultzm/Ahmad/microprocessor -6- December 10, 1987
2. THE SCHEDULING ALGORITHM
In order to allocate tasks to processors efficiently, some assumptions are essential. 
Every APU is assumed to be identical, i.e. they have, the same processing capability. 
The time needed to transfer the same data packet between two processors are also the
same, and both data and instructions are transferred between the main processor and
*coprocessors through a shared bus (see Figure 1). Hence, the execution time of a task i 
is T; and it can bg viewed as,a computational time ta; and a overhead time t0j. The 
computational time is the time needed for an APU to compute the task, whereas the 
overhead time may include the times to fetch the task operational code, task operands, 
retrieve the results from the APU, and store the results appropriately. Therefore, the 
overhead time can be further represented as: initiation overhead"^ t^, data and operand 
fetch time tg; task termination overhead te;, and data storage time tsj. The overhead
times tbj and tg are accumulated before a task is executed in a APU and te; and ts; are
---------------- ' ‘ /^DF/MOHS = Depth First/Minimized Overhead Heuristic Search.
A mathematical operation executed in an APU.
1: Initiation times may include effective address computation etc.
2: Termination overhead will include such operations as interrupt processing or
coprocessor polling.
/1/schultzm/Ahmad/microprocessor - 7 - December 10, 1987
the overhead times accumulated after a task has been executed in the APU. Four possi­
ble situations may arise:
ti0i = t^j + tej (if data transfer is unnecessary) (l)
otherwise,
Hoi — fbi + ffi + tei; (with data fetch only) (2)
^3oi “ ^bi + te[ + ts;. (with data storage only) (3)
t4o; = tbi + tfi + te; + ts;; (with full data transfer) (4)
and the total execution time of task i, t; is then accumulated as:
ti. = tkoi + tai; (i = l,...,n) and k = 1,2,3,4. (5)
If the processing system has one APU, then we may find the overall computation
1 2time by adding the prefetch and termination times into the task execution time. In a 
multi-APU system, if there are more than one APU being serviced by the MPU (Main 
Processor), then the MPU is required to perform the appropriate prefetch and termina­
tion operations, appropriately interleaved with other APU operations so as to minimize 
the effect of the overhead on the overall execution time. Kasihara and Narita’s optimal 
scheduling algorithm DF/IHS neglected the fact that task initiation and termination 
may be as large as the actual APU execution time, eg. a fast floating point APU may 
take approximately 500ns to perform an arithmetic operation, whereas the prefetch and 
the termination may require more than 100ns each. Kasihara and Narita’s scheduling
algorithms in this particular case would not select an appropriately efficient solution (as/
the overhead processing is not addressed). Additionally, the host processor might be 
requested to service multiple APU’s simultaneously and as only one bus exists
Prefetch will now be taken to mean operations related to time t^j and t^. 
Termination will now be taken to mean operations related to te^ and tsj. These are
loose terms used for easy explanation of the problem.
(restrictions of our problem) to service the APU’s, a contention for the MPU service 
would exist. If the prefetch and the termination operations can be interleaved at 
instances when the service request for the APU is zero, then the optimal schedule may 
be obtained by the DF/IHS method. If this is not possible, an additional delay time will 
be inserted in to the overall computation time. We wish to minimize the effect of this 
delay time.
Task Representation
Given a set of n computational tasks T = {T1} . . . , Tn}, the relationship between 
each task may be represented in a finite acyclic task graph G. In general, data transfer 
only occurs between tasks and their immediate successors. The graph G (see Figure 2) 
is a multiple weighted as multiple packets of data may be transferred between a partic­
ular parent and different children tasks. In G the task i, Tj, is represented as a node, 
two extra nodes are included in G, one for the beginning of the computation and one 
for the termination. Both of these nodes have zero processing time, and all nodes can 
be reached from the entry and exit nodes.
We now describe the scheduling algorithm, it is based around the DF/IHS algo­
rithm except that additional steps are included to minimize the time delay due to over­
head operations that cause contention for MPU services. The algorithm is divided into 
eight steps, each of which are explained in the below.
STEP 1: Determine the level of each task in G. The level 1; of task T, is defined as 
the longest path from the exit node to the node of T;:
lj =max£; (t4oj +taj ) (8)
: ■ . k jS'Tk .
where 7rk is the kth path from the exit node to task node T;. The time (t4oj + taj) is the 
maximum execution time of task Tj in the worst case without contention. If
/1/schultzm/Ahmad/microprocessor - 8 - December 10, 1987
contention exists, i.e. other APU’s requests the service of the MPU this time might 
increase further. This time is dependent on the selected schedule, and therefore 1; is an 
approximation.
STEP 2: We next form a list for each task Tj {1;, ni > t-ioi }> where nj is the number 
of immediate successors. From this list, we form a priority table for each task Tj. 
Task’s with the higher priority are those with larger 1; and n;, and in that order. That 
is, if lj < 1 j, task j kas higher priority, if 1; = lj, then the one with largest number of 
children has higher priority. If, 1; =lj, and n; = nj then the task having the smallest 
overhead time t4oj has the higher priority. This is chosen because the smaller, overhead 
implies the MPU may begin servicing other APU’s at an earlier time. Here we are mak­
ing an assumption that t4o; is composed mainly of the task initiation, eg prefetch opera­
tions as opposed to termination tasks. If this is not the case, then those tasks with the 
smallest initiation time should be considered first, before other tasks are scheduled. 
Note in robotic computations, dyadic and monadic operations are usual with one or two 
data fetches and one operator code fetch, and one resultant word is output. In robotic 
computations initiation time is almost equal for all tasks.
STEP 3: At each scheduling step a list of tasks available for immediate execution 
(afe(t)} is next formed. A task is assigned “afe” status, if its parents have been exe­
cuted: ' '
afe(t) - {Tr) ...Tnir}
where nr is the number of tasks in afe(t).
/1/schultzm/Ahmad/microprocessor - 9 - December 10, 1987
STEP 4: If ma is the number of processors available for computation at this
scheduling stage, select as many tasks as possible from the afe(t) in the sequence of 
priority (as proposed earlier) to form an execution list (fe(t)}. This forms a branching 
node, for example: fe(t) = {Trl,Tr2, . . . , Tr<f}; / ^ min(ma,nr).
STEP 5: Assign tasks in {fe(t)} to the available processors and compute delay time 
as in the below:
<a> Set an incremental variable t<i denoting time delay caused by contention for 
MPU services equal to zero.
<b> For every available processor, check if any of the tasks in (fe(t)} are children
of the task Tej (a task already executed in the available processors). If task Tr; is the
child of Tej, assign the processor on which Tej was executed on to the task Trj and
*
remove the task from the fe(t) list and the processor from the ma list .
If the finished task in the processor has other children, other than Tr; the MPU 
schedule is organized to store data first so that other tasks may access them. If no other 
children other than Tr. exists then this is unnecessary.
Time delay caused by the transfer operations can now be accumulated as:
td = td + tgjk + + tg0
If the executed task j has a child, task i and it is in the (fe(t)} list and the task i is 
chosen to execute on the same processor x (see Figure 3). Then time tfio is the time 
taken to fetch all the data for task i from the parent tasks 1, m and n. The time t^i is 
the initiation time for task i, e.g. this may include the time needed to fetch the instruc­
tion code and perform the address computations. The time tg^ represents a storage 
■?-- -----;--------— '
. The exact method of processor and task assignment is complex and is described in the 
below, but it may be thought as being simple as that.
Note 1 is the number of tasks in (fe(t)} list and 1 = min(ma,nr). ,
/1/schultzm/Ahmad/microprocessor -10- December 10, 1987
operation for task k not in the (fe(t)} list. Note this time delay t^ is not the delay of 
the entire computation but it is a delay in this computation step with respect to that of 
the no overhead case. This delay is used to calculate the real-computation time.
We wish to minimize the data transfer from APU to storage; In order to achieve 
this we may assign tasks to processors (APU’s) from the (fe(t)} list in an appropriate 
manner. If a task ^ has a parent in APU Pi, then the APU Pi is assigned task r4.
t
This simplistic algorithm does not produce the optimal results as we illustrate by the 
below example, assume {fe(t)} = {r1; r3, r4 r5, r2} and ma = {P1, P2, P3, P4, P5} 
also the part of the task graph which is currently being executed is shown in Figure 4. 
If we assign APU’s on a priority basis then we may select the following assignment 
r4—>P1; r3—^P2 therefore we need to transfer data from P2 to P3 to execute r4. Clearly 
from the task graph this is nonoptimal.
If a task has m-parents we may reduce the data transfer to (m-l) data fetches by 
assigning an appropriate APU. This is easily accomplished by the below assignment 
steps. First we form a table to illustrate the relationship between processors and tasks 
interms of their parents. For the example in Figure 4, a table may be formed as per 
below:
/1/schultzm/Ahmad/microprocessor -11- December 10, 1987
/1/schultzm/Ahmad/microprocessor - 12 - December 10, 1987
ri ^2 r4 r5
Sy(-) \ EJ') 3 0 2 1 0
PI 2 X X






A cross is placed in the table to indicate if a processor generated a parent, e.g. PI con­
tains the parents of task rl and r3. Row Ex(*) is used to indicate the number of 
parents of a task generated by the processors in {ma} list e.g. Ex(rl) = 3. Column Ey^) 
indicates the number of children of task M, where task M is generated in a particular 
processor eg Ey(P3) = 1. Clearly those tasks which have one link to the processor list 
(Ex(‘) = 1) or one processor link to the task list (Ey(-) = 1), must be assigned first 
before assigning other tasks. Since these tasks are those with one parent as the case 
with r4 or alternatively the parents have only one child as is the case with r4 and P3.
Therefore we schedule r4—rP2 and ^ —>-P3, such an assignment reduces data 
transfer operations. Following these assignments, these tasks and processors are 
removed from the assignment map and the map is changed accordingly. The next set of 
processor assignments is similar, as Sx(r3) = 1, we must assign r3—dPl. These
/I/schultssm/Ahmad/microprocessor - 13 - December 10, 1987
assignment steps reduce the data transfer. Remaining tasks are arbitrarily assigned to 
processors on priority basis, as Ex(*) = 0, and Ey(•) — 0, the case with r5 and r2.
Consider a second example, with the task graph as shown in Figure 5. The result­
ing processor assignment map is then as given by Table 2:
T2- rl r3 r4
Sy(-)\2x(') 3 3 3 1
PI 2 X X
P2 3 X ■ X x
P3 2 X ' x
P4 3 X X X .
Table 2
Processor Assignment Map for Example 2
In this example we select r4—»P4, as Ex(r4) — 1. Next we remove P 4 row and r4 
column from the assignment map, and adjust Ex(*), Ey(*) values accordingly. We now 
consider r2, if the time to transfer data for execution of lr2 in processor PI or in proces­
sor P2 is exactly the same, the task may be chosen to execute in any of the two proces­
sors. Otherwise the processor chosen is based on minimum data transfer time. If for 
example Pi is chosen, then PI and r2 must be removed from the assignment map, and 
Ex(‘), Ey(*) must be adjusted accordingly. We continue to assign processors in this 
manner, if at any step the Ex(») or Ey(*) value should become equal to one, the task is 
assigned the corresponding processor on which one of its parents was generated. The 
process is repeated down the priority list until all the processors are assigned. These
assignment rules can be recursively programmed in a compact maimer.
Reflecting on the above algorithm, the processor assignment method will yield 
optimal results, if all the data transfer operations have approximately the same transfer 
time. In robot dynamics, kinematics and trajectory computations most of the tasks 
require the same data transfer time. In such a case, the algorithm will yield optimal 
results. If however, the operations have varying unit data transfer times the algorithm 
must be adjusted to select assignments which minimize the time of the transfer, as 
opposed to minimizing the number of transfers.
<c> In the above we assigned all the tasks that have parents which executed on 
the processors in the {ma} list. We now assign the remaining tasks with the available 
processors in their respective priority, until all the tasks in (fe(t)} list are assigned. The 
time delay tj is accumulated appropriately for each assignments.
Note that if two tasks in the {fe(t)} list are of the same level then task with the 
lower overhead may be assigned first.
STEP 6: The time to poll the processors to see if the task has finished may now 
accumulated, this time is: '
/1/schultzm/Ahmad/microprocessor -14- December 10, 1987
td S ^ei
where ff is the set of tasks which have completed execution.
Accumulation of Time-Delay into Computation Time
The computation time is determined on an incremental basis and it is illustrated in 
Figure 6. Consider at time t0, MPU schedules task j in APUl and task k in APU2 then 
time delay of t^i and t^ is, associated with each of these operations respectively. 
Assume also task m is in execution in APU X. Then after APUl is scheduled, time will 
read to — to d- tdi- After APU2 is scheduled it will read to = t0 + tdl -f td2. The next
time processor will be required to service an APU will be at tq 
= tg + min{(taj — t^), t^, (tam — t(ji — t(j2 )}• The computation time can be further 
determined in this stepwise manner.
STEP 7: If the entire task graph is executed, i.e. the exit node is reached, go on to 
Step 8, otherwise go to Step 3.
STEP 8: At this point one schedule is obtained. If the task graph is complicated or 
if the graph is large we can stop. Assuming the solution is satisfactory. When the data 
transfer and contention problems are negligible, this solution is similar to CP/MISF. If 
interprocessor communication is lengthy and contention problem is appreciable, a better 
solution than CP/MISF can be obtained by the above steps. This is because the above 
method of assigning tasks to processors reduces delay due to contention. If the gen­
erated solution is not satisfactory or it is not very close to the optimal one backtracking 
may be employed to search for a better solution. In those cases (in which optimal solu­
tion time is unknown) an estimate of the ideal lower bound on the computation time 
can be used to compare with the present solution and if a better solution is desired, 
then the present solution can be used as an initial solution. Next, utilizing branch-and- 
bound method, we can backtrack and search for other possible solutions that have a 
shorter length. For that purpose, the present solution can be used as an initial upper 
bound. Thus by continually backtracking from the terminal node to other possible 
branches closer to the entry node, a desirable computation time may be obtained.
The procedure for backtracking and the determination of new branching nodes is 
now explained as follows: Assuming there are m coprocessors altogether, at a certain 
branching stage, if there are nr ready tasks and ma idle processors available for execu­
tion, and if nr ^ ma < m, then the number of branching nodes (possible choice of 
local schedules) are: .
/1/schultzm/Ahmad/microprocessor - 15 - December 10, 1987
/!/schuItzm/A hmad/microprocessor - 16 - December 10, 1987
nb E
k=0
“r/.Ck =nrC0 H-^Cj + + “rcr
where m = >ma and rC0 corresponds to selecting (ma) processors idle, and the (m—ma) 
processors would currently be active. If ma = m, the sum is carried over the range 
k = l...m, and m ==m. This would correspond to having at least one processor being 
active. If m = ma > nr, then we reassign the above summation to be carried over the 
range, k = l...nr and m = nr = min(ma,nr). If m > ma > nr then k = 0...nr as this 
corresponds to having (m — ma) processors being active.
We seek to eliminate those nodes which will not yield a better solution than the 
one selected by the present schedule. The way we achieve this is by the following elimi­
nation rule. If the selected node will not lead to a solution which will be better than the 
current solution within an approximation of the lower bound, we delete it from the 
search list. The lower bound on the computation time of a new node is given in the 
below:
Abound max(Vi, t<f2) (1+e)
where,
tn = max {lj} + t0 
i
where max{li} is the critical path length from current node to the terminal node in the 
task graph.
and, fa. = — E (tai + t0i) + t0.
... m ieu . .
where the set of unassigned tasks from the current node is U and i£ U . The time t0 is 
the time taken to arrive at the current branch node from the entry node utilizing the 
current schedule. The constant e is an arbitrary number and it is chosen as 0 < e < 1.
After the initial solution is obtained and it is not within tn,oun<j' for the graph, we 
backtrack from the lower level branching node to higher level nodes, using the selection 
rule and elimination rule to select a new branching node which appears yields a better 
schedule. Next we branch to STEP 5 to generate another schedule and proceed to 
check if it is within tiboun(} of the graph. If it is not, we select another node to back­
track for a solution within the tn,ound of the graph. We repeat this process until the 
entire tree is searched or the desired schedule is obtained.
The scheduling algorithm is represented in the flow charts shown in Figure 7.
A Simple Example of the Scheduling Algorithm
An example task graph is shown in Figure 8. It consists of nine tasks including an 
entry node and an exit node., Two schedules were generated one by our algorithm 
(DF/MOHS) and one by DF/IHS for varying number of processors. The results are 
summerized in Figure 10 for DF/MOHS. It is seen that the computation can be carried 
out in 17.33 time units for one processor. The results for the DF/IHS algorithm are 
summerized in Figure 9. The computation time for the DF/IHS algorithm is 19.00 units 
of time for one processor. This because DF/IHS does not minimize data transfers, eg in 
steps involving Task 2—*Task 5, Task 5—>-Task 7 and Ta^k 4—>-Task 6.
Note that overhead transfer time has been added to the task execution time in the 
DF/IHS algorithm. The effect of contention and data transfer is also clear if we con­
sider the schedules with two processors, DF/MOHS has a processing time of 10.08, 
whereas DF/IHS has a processing time of 12. With three processors DF/MOHS results 
in a processing time of 8.08 and for DF/IHS this is 10.5.
On examination of task graph it is seen that at most three parallel execution paths 
exist for T1; T2 and T3, for remaining part of the graph two such paths exists. Because 
of such a small parallelism the contention problem is not so significant. However, it 
does exist and it contributes to the relative increase in the computation time for the
/1/schultzm/Ahmad/microprocessor -17- December 10, 1987
DF/IHS schedule. By minimizing the data transfers in DF/MOHS we have reduced the 
MPIJ service request, therefore reduced the effect of contention. This will be further 
born out by the simulations of the robot control tasks.
Summary of Algorithm Advantages
It is difficult to rigorously prove that our algorithm DF/MOHS yields a near 
optimal schedule in presence of overhead and contention for MPU services. However, 
extensive simulation have shown the following:
/1/schxiltzm/Ahmad/microprocessor - 18 - December 10, 1987
(i) It can be executed in approximately the same time as DF/IHS and CP/MISF algo­
rithms. ■■
(ii) Our algorithm considers the overhead involved with data transfer and proceed to 
reduce it by generating a schedule which minimize it.
(iii) By minimizing the data transfer operations we reduce the contention for MPU ser­
vices, and if contention occurs we accumulate its effect in the overall schedule.
3. The Number of APU’s Needed for a Robot Controller
In order to determine the optimal number of APU’s needed to perform the forward 
kinematics, the inverse kinematics, the inverse dynamics and trajectory computations 
we used the DF/MOHS algorithm to generate the computation time for varying number 
of APU’s. From these set of times we were able to determine the optimal number of 
APU’s and obtain the computation time, these times included overhead and effects of 
contention. In our calculations we used the data for the Motorola MC68881 APU, the 
computation times are summarized in Table 3.







tb tf te ts
subtraction 4.86 0.24 0.96 0.18 0.48
addition 4.66 0.24 0.96 0.18 0.48
multiplication 5.87 0.24 0.96 0.18 0.48
division 7.78 0.24 0.96 0.18 0.48
sqrt 7.90 0.24 0.48 0.18 0.48
sincos 28.47 0.24 0.48 0.18 0.96
at an 2 33,38 0.24 0.48 0.18 0.48
negate 3.59 0.24 0.48 0.18 0.48
Table 3
Computation time for the Motorola MC68881 APU used in our Simulations 
Computation Tasks:
For our computation we used the dynamical and kinematic models of the PUMA 
manipulator. The forward kinematics of the PUMA arm is summarized in the 
Appendix-Al, the inverse kinematics is summerized in the task graph of Appendix-A2. 
Seventy four APU operations are necessary to compute the forward kinematics and 104 
operations are necessary to compute the inverse kinematic operations. The Newton- 
Euler inverse dynamic computations of the PUMA arm are summerized in the 
Appendix-A3, 154 steps or approximately 400 APU computations are necessary to com­
pute the joint feedforward torques given the joint position velocity and accelerations. 
The cartesian trajectory computation as described in Paul’s book in terms of drive 
matrix are summerized in Appendix-A4. Two hundred and fifty five APU-operations
are necessary to compute the drive matrix.
Summary of Simulation Results
The computation times and schedules were generated by DF/IHS and by the 
DF/MOHS algorithm. In all simulations overhead is included. In order to show the 
effect of contention, a set of simulations for DF/IHS and DF/MOHS were preformed 
without accumulating the effect of contention, and another set accumulates the effect 
due to contention.
The factor e was set to 0.05 for those simulations not accumulating the effect of 
contention, arid e is set to 0.09 for those simulations accumulating the effect of conten­
tion. A computer time limit was imposed on the simulations by the UNIX 4.3 Operat­
ing System running on VAX 11/780. In those simulations in which contention was 
accumulated, this time limit was exceeded.
Figure 11 shows the simulation for the forward kinematics. It is seen from the Fig­
ure 11 that DF/MOHS without contention produces better results than DF/IHS as 
expected, as DF/MOHS minimizes data transfer. In the case with contention 
DF/MOHS also produces better solution than DF/IHS with contention, as minimizing 
data transfer reduces the effect of contention. Note also that in the DF/IHS simulation 
with contention a ‘kink’ is present, this is because in presence of contention it is difficult 
to get the near optimal solution. No such kink is present in DF/MOHS simulation, 
mainly because of the reasons indicated earlier i.e. with minimized data transfer, a 
smaller contention exists, therefore an acceptable solution is found quite quickly.
From Figure 11, it is apparent that approximately six APU’s lead to an optimal process­
ing time of 147/is for the forward kinematics.
Simulations for the inverse kinematic operations are shown in Figure 12. The 
optimal APU utilization occur for five APU’s with a processing time of 416ns.
/1/schultzm/Ahmad/microprocessor -20- December 10, 1987
Simulation results for the inverse dynamics is shown in Figure 13. Here we note 
that the optimal processing time of 1213 /is occurs for six APU’s.
The drive matrix computations can be carried out in approximately 400/is by six 
APU’s.
From our simulations it is seen that with six MC 68881 APU’s we may perform the 
forward kinematic, inverse dynamic, inverse kinematic, drive matrix computations using 
floating point arithmetic in approximately ~ 2500/is using a very simple parallel pro­
cessing architecture.
It is interesting to note that different parallelism exist for each of the tasks and it is 
reflected in the way the computation time change with increasing number of APU’s.
CONCLUSION
In this paper we presented an algorithm which extends the method of DF/IHS to 
include overhead and contention. The algorithm seeks to minimize overhead by reduc­
ing the number of data transfer operations between the processors and in this way 
reduces the effect due to contention for MPU services. This result is verified from simu­
lations of robot control tasks for varying number of APU’s.
We have also presented a simple multi-coprocessor (APU) robot controller which 
may be constructed utilizing the Motorola MC 68881. Such a device has optimal perfor­
mance with six APU’s. Such a controller is able to perform kinematics, inverse dynam­
ics and trajectory computations using floating point arithmetic in approximately 2.5ms. 
It is sufficiently modular to' allow adaptation for other computational purposes. A fun­
damental component of the design of this robot controller is an accurate schedule which 
not only produces an accurate estimate of the computation time but also produces an 
MPU schedule which has a minimum number of operations, allowing easy programming 
and implementation.
/1/schultzm/Ahmad/microprocessor - 21 - December 10, 1987
/1/schultzm/Ahmad/microprocessor - 22 - December 10, 1987
REFERENCES
[1] Ahmad, S. and Li, B. “Optimal Design of Multiple Arithmetic Processor-Based 
Robot Controllers,” Proc. 1987 IEEE Robotics and Automation Conf. Raleigh,
N.C.
[2] Ahmad, S., “On the Design of Special-Purpose Computational Structures for 
Robot Control: Design Constraints,” Proc. 1986 Applied Motion Control Conf., 
Minneapolis, Minnesota.
[3] Arya, S. “An Optimal Instruction-Scheduling Model for a Class of Vector Proces­
sors,” IEEE Trans. Comp., Vol. C-34, No. 11, Nov. 1985, pp. 981-994.
[4] Coffman, E.G. et al. Computer and Job Shop Scheduling Theory, New York, 
Wiley, 1976.
[5] Craig, J. J. Introduction to Robotics, Addison-Wesley Publishing Company, 1986.
[6] Fernandez, E. G. and Bussel, B., “Bounds on the Number of Processors and Time 
for Multiprocessor Optimal Scheduler, IEEE Trans. Comp., Vol. C-22, No. 8, Aug. 
1973, pp. 745-751.
[7] Fung, K. T. and Torng, H. C. “On the Analysis of Memory Conflicts and Bus 
Contentions in a Multiple-Microprocessor System,” IEEE Trans. Comp. Vol. C- 
27, No. 1, Jan. 1979, pp. 28-37.
[8] Kasahara, H., and S. Narita, “Practical Multiprocessor Scheduling Algorithms for 
Efficient Parallel Processing,” IEEE Trans. Comput., Vol. C-33, 1984, No. 11.
[9] Kasahara, H., and S. Narita, “Load Distribution Among Real-Time Central Com­
puters Connected via Communication Media,” Proc. 9th IFAC World Cong., 
1984, Oxford: Pergammon.
[10] Kasahara, H. and Narita, S., “Parallel Processing of Robot-Arm Control Compu­
tation on a Multimicroprocessor Systems,” IEEE Journal of Robotics Automat.,
/I/schultzru/Ahmad/microprocessor - 23 - December 10, 1987
Yol. RA-1, No. 2, June 1985, pp. 104-113.
[11] Li, C. J. and Wah, B. W., “Computational Efficiency of Parallel Processing
Approximate Branch-and-Bound Algorithm,” Proc. Int. Conf. Parallel Process­
ing, 1984, pp. 473-480. ‘
[12] Lull, J. Y. S. and Lin, C. S., “Scheduling of Parallel Computation for a 
Computer-Controlled Mechanical Manipulator,” IEEE Trans. Sys., Man and 
Cyber., Vol. SMC-12, No. 2, March/April 1982, pp. 214-234.
[13] Lull, J. Y. S., Walker, M. W., and Paul, R. P. C., “On-line Manipulator Scheme 
for Mechanical Robots,” Journal of Dynamic Systems, Measurement and Control 
Trans. ASME, Vol. 102, June 1980, pp. 69-76.
[14] Marsan, M. A., Balbo, G. and Conte, G., “Comparative Performance Analysis of 
Single Bus Multiprocessor Architectures,” IEEE Trans. Comp., Yol. C-31, No. 12, 
Dec. 1982, pp. 1179-1191.
[15] MC6888! Floating-Point Coprocessor User’s Manual, Motorola, Inc., 1985.
[16] Nigam, R. and Lee, C. S. G,, “A Multiprocessor-Based Controller for the Control
of Mechanical Manipulators,” IEEE J. Robotics Au,tomat., Vol. RA-1, No. 4, Dec. 
85, pp. 173-182. " •
[17] Paul, R. P., Robot Manipulator: Mathematics Programming, and Control. MIT 
Press, 1981.
[18] Watanalee, T., et. al., “Improvement in the Computing Time of Robot Manipula­
tors Using a Multimicroprocessor,” Trans. ASME, Vol. 108, Sept. 1986, pp. 190-
197. '• j U' '-U/ V.- V U /•
[19] Zheng, Y. F. and Hemami, H., “Computation of Multibody System Dynamics by a 
Multiprocessor Scheme,” IEEE Trans. Sys. Man, and Cyber., SMC-16, No. 1, 
Jan./Feb. 1986, pp. 102-110.
j\jschultzm/Ahmad/microprocessor - 24 - December 10, 1987
[20] Lathrop, R.H., “Parallelism In Manipulator Dynamics,” Int. Journal of Robotics 
Research, MIT Press, Vol. 4, 1985.
[21] Nash, G., “A Systolic/Cellular Computer Architecture for Linear Algebraic 
Operations,” Proceedings of the 1985 IEEE Conference on Robotics and Automa­
tion, St. Louis, MO, April 1985.
[22] Orin, D.E., H.H. Chao, Olson, K.W., Schrader, W.W., “Pipeline/Parallel Algo­
rithms for Jacobian and Inverse Dynamic Computations,” Proceedings of the 
IEEE Conference on Robotics and Automation, St. Louis, MO, April 1985.
[23] A.K. Bejczy, “Robot Arm Dynamics and Control,” JPL, Pasadena CA, memo 33- 
669, Feb. 1974.
[24] J.M. Hollerbach, “A Recursive Lagrangian Formulation of Manipulator Dynamics 
and a Comparative Study of Dynamics Formulation Complexity,” IEEE Trans. 
Systems, Man, Cybernetics, Vol. SMC-10, No. 11, pp. 730-736, Nov. 1980.
[25] Lee, C.S.G., Chang, P.R., “Efficient Parallel Algorithm for Robot Inverse Dynam­
ics Computation,” IEEE Trans. Systems, Man Cybernetics, Vol. SMC-16, No. 4, 
pp. 532-542, July 1986.




NC: Number of children






















































































24, 25, 27, 28, 37, 39, 42, 43, 52, 53, 58, 60 
7, 8, 11, 10, 48, 49
7,8,10,11
14, 15, 21, 22, 33, 38




19, 30, 35, 45, 51, 54 
12 ■ , .
12






























/I/schultzm /Ahmad/microprocessor 26 - December 10, 1987
TN: Task number
FN: Function number
NC: Number of children






































































































































- 27 -/!/ schultzm/Ahmad / microprocessor December 10, 1987
Function Type and Computation Time
/
FN Function number 
FT — Function type 
CT = Computation Time
FN FT CT h tf ts
0 tfadd 4.66 0.24 0.96 0.18 0.48
1 tfsub 4.66 0.24 0.96 0.18 0.48
2 tfmul 5.87 0.24 0.96 0.18 0.48
3 tfdiv 7.78 0.24 0.96 0.18 0.48
4 tfsqrt 7.90 0.24 0.48 0.18 0.48
5 tfsincos 28.47 0.24 0.48 0.18 0.96
6 tfatan2 33,38 0.24 0.48 0.18 0.48
7 tfneg 3.59 0.24 0.48 0.18 0.48
8 exit 0.0 - 0.0 0.0 0,0 0.0
Function type key
0 = Floating addition
1 = Floating subtraction
2 = Floating multiplication
3 = Floating division
4 = Floating square root
5 = Floating sine cosine
6 = Floating arc-tangent
7 —. Floating negate
8 = Null function




NC: Number of children





































































































































































/1/schultzm/Ahmad/microprocessor - 29 - December 10, 1987
TN: Task number
FN: Function number
NC: Number of children













































































































































































/I/schultzm/Ahmad / microprocessor - 30 - December 10, 1987
TN: Task number
FN: Function number
NC: Number of children
LC: List of children




















































































/1 /schultzm / Ahmad / microprocessor - 31 - December 10, 1987
Function,Type and Computation Time
FN = Function number 
FT = Function type 
CT = Computation Time
FN FT CT ......... h . tr ts
0 tfadd 4.66 0.24 0.96 0.18 0.48
1 tfsub 4.66 0.24 0.96 0.18 0.48
2 tfmul .5.87 0.24 0.96 0.18 0.48
3 tfdiv 7.78 0.24 0.96 0.18 0.48
' 4 tfsqrt 7.90 0.24 0.48 0.18 0.48
5 tfsincos 28.47 0.24 0.48 0.18 0.96
6 tfatan2 33.38 0.24 0.48 0.18 0.48
7 tfneg 3r59 0.24 0.48 0.18 0.48
8 exit 0.0 0.0 0.0 0.0
Function type key
0 = Floating addition
1 = Floating subtraction
2 = Floating multiplication
3 .== Floating division
4 = Floating square root
5 = Floating sine cosine
6 = Floating arc-tangent
7 = Floating negate
8 = Null function
APPENDIX A3
/1/schultzm/Ahmad/microprocessor - 32 - December 10, 1987
TN: Task number 
FN: Function number 
NC: Number of children 
LC: List of children
TN FN NC LC
1 10 7 4, 5, 10, 11, 14, 19, 20
2 10 4 6, 12, 16, 21
3 10 2 7, 22
4 5 1 5
5 5 1 8.
6 5 1 7
7 10 1 8
8 2 1 9
9 6 2 146, 148
10 6 1 11
11 5 1 13
12 5 1 13
13 2 1 150
14 5 2 15, 17
15 10 7 25, 26, 31, 32, 35, 40, 41
16 5 1 18
17 11 1 18
18 2 4 27, 33, 37, 42
19 5 1 20
20 5 1 23
21 5 1 22
22 10 1 23
23 2 1 24
24 7 2 28, 43
25 8 1 26
26 8 1 29
27 8 1 28
28 3 1 29
29 3 1 30
30 6 2 137, 139
31 9 1 32
32 8 1 34
33 9 1 34
34 3 1 141
35 7 2 36, 38
36 1 7 46, 47, 52, 53, 56, 61, 62
37 7 1 39
38 11 1 39
39 3 4 48, 54, 58, 63
40 8 1 41
/I/schultzm / Ahmad / microprocessor 33 December 10, 1987
TN: Task number
FN: Function number
NC: Number of children
LC: List of children
TN FN NC LC
41 8 1 44
42 8 1 43
43 3 1 44
44 3 1 45
45 7 - ■’ 2 49, 64
46 ■? 1 47
47 8 1 50
48 8 1 49
49 3 1 50
50 3 1 51
51 6 2 128,130
52 9 1 53
53 8 1 55
54 9 1 55
55 3 1 132
56 7 2 57, 59
57 1 5 67, 68, 73, 74, 77
58 7 1 60
59 11 1 60
60 3 3 69, 75, 79
61 8 . 1 62
62 8 1 65
63 8 1 64
64 3 1 65
65 3 1 66
66 7 2 70,82
67 8 1 68
68 8 1 71
69 8 1 70
70 3 1 ' 71
71 3 1 72
72 6 2 121, 123
73 9 1 74
74 8 1 76
75 1 76
76 3 1 124
77 7 ' 2 78, 80
78 1 5 83,84,89,90,93
79 7 1 81
80 11 1 81
81 3 3 85, 91, 95
82 7 - 2 86, 98
/1/schultzm/Alim ad/microprocessor - 34 - December 10, 1987
TN; Task number
FN: Function number
NC: Number of children 
LC: List of children
TN FN NC LC
.84 8 1 87
85 8 1 86
86 3 1 87
87 3 1 88 i
88 6 2 ii4,ii6:
89 9 1 90
90 8 1 92
91 9 1 92 ;
92 3 1 117
93 7 2 94, 96 |
94 1 4 99, 100, 105, 106
95 7 1 97
96 11 1 97 ;
97 3 2 101,107
98 7 1, 102 .
99 8 1 100
100 8 1 103
101 8 1 102
102 3 1 103 ;
103 3 1 104
104 6 2 109, 110j
105 9 1 106
106 8 1 108 |
107 9 1 108 !
108 3 1 Ill i
109 12 1 113 j
110 8 1 111 Ii
111 3 2 112, 115 j
112 1 1 154 1
113 7 1 114
114 3 1 120 :
115 7 1 117 ;
116 8 1 118
117 3 1. ' » . 118
118 3 2 119, 122
119 1 1 ’ 154
120 7 1 121
121 3 1 127
122 7 1 124
123 8 1 125
124 3 1 125
125 3 2 126,129
126 1 154
/I / schultzm / Ahmad / microprocessor - 35 - December 10, 1987
TN: Task number
FN: Function number
NC: Number of children
LC: List of children
















































































































/!/schultzm / Ahmad / microprocessor - 36 - December 10, 1987
Function Type and Computation Time
FN = Function number
FT == Function type
CT = Computation Time
FN FT CT th t. t*
0 exit 0.0 0.0 0.0 0.0 0.0
1 tfadd 4.66 0.24 0.96 0.18 0.48
2 tf2add 9.32 0.48 1.92 0.36 0.96
3 tf3add 13.98 0.72 2.88 0.54 1.44
4 tfmul 5.87 0.24 0.96 0.18 0.48
5 tf2mul 11.74 0.48 1.92 0.36 0.96
6 tf3mul 17.61 0.72 2.88 0.54 1.44
7 tf4m2a 32.80 1.44 5.76 1.08 2.88
8 tf6m3a 49.20 2.16 8.64 1.62 4.32
9 tf9m6a 80.80 3.6 14.4 2.7 7.2
10 tfload 4.66 0.24 0.96 0.18 0.48
11 tf2mil 16.40 0.72 2.88 0.54 1.44
12 tf31oad 13.98 0.72 2.88 0.54 1.44
Function type key
0 = Null function
1 = Floating add
2 = Floating (2xl) vector addition
3 = Floating (3x1) vector addition
4 = Floating multiplication
5 = Floating vector x scalar (one element — 0)
6 = Floating vector x scalar
7 = Floating (3x3) matrix x vector (three elements = 0)
8 = Floating (3x3) matrix x vector (four elements = 0)
9 — Floating (3x3) matrix x vector
11 = Floating 3x3 matrix (four element = 0) x (3x1) vector
12 = Floating (3x1) vector addition
/1/schultzm/Ahmad/microprocessor - 37 - December 10, 1987
APPENDIX A4
TN: Task number 
FN: Function number 
NC: Number of children 
LC: List of children

































































































































































/I/schultzm/Ahmad / microprocessor - 38 - December 10, 1987
TN: Task number 
. FN: Function number 
NC: Number of children 
LC: List of children













































































































































































TN: Task number 
FN: Function number 
NC: Number of children 
LC: List of children
/1/schultzm/Ahmad/microprocessor - 39 - December 10, 1987
TN FN NC LC
84 2 2 ' 87, 93
85 1 1 88
86 1 1 88
87 1 1 89
88 0 1 89
89 0 1 90
90 3 3 94, 95, 96
91 1 1 94
92 1 1 95
93 1 1 96
94 2 2 138, 142
95 2 2 151, 155
96 2 2 164,168
97 0 1 100
98 0 1 101
99 0 1 102
100 1 1 103
101 1 1 103
102 1 1 104
103 0 1 104
104 0 1 105
105 3 1 109
106 0 1 , . 108
107 0 1 108
108 0 2 109, 111
109 2 1 110
110 4 3 129, 130, 131
111 1 2 112, 113
112 0 2 114, 118
113 o 1 114
114 2 1 115
115 3 1 116
116 6 3 117, 123, 129
117 0 1 118
118 1 2 120, 122
119 0 1 120
120 2 2 124, 130
121 0 1 122
122 2 2 125
123 1 1 126
124 1 1 126
125 1 1 127
126 0 1 127
127 0 1 128
TN: Task number 
FN: Function number 
NC: Number of children 
LC: List of children
/1/schultzm/Ahmad/microprocessor -40- December 10, 1987

















































































































































































/1/schultzm/Ahmad/microprocessor - 41 - December 10, 1987
TN: Task number 
FN: Function number 
NC: Number of children 
LC: List of children


































































































































/1/schultzm/Ahmad/microprocessor - 42 - December 10,1987
TN: Task number
FN: Function number
NC: Number of children 
LC: List of children
TN FN NC LC
215 1 2 220,222
216 1 1 224
217 0 1 218
218 o 1 225
219 0 1 220
220 o 1 225
221 0 1 222
222 0 1 225
223 0 6 136, 149, 162, 175, 188, 201
224 0 6 143,156,169,182,195,208
225 7 0
/!/schultzm / Ahmad/microprocessor - 43 - December 10, 1987
Function Type and Computation Time
FN = Function number 
FT — Function type 
CT = Computation Time
FN FT CT h tr t. t,
0 tfadd 4.66 0.24 0.96 0.18 0.48
1 tf,ul 5.87 0.24 0.96 0.18 0.48
2 tfdiv 7.78 0.24 0.96 0.18 0.48
3 tfsqrt 7.90 0.24 0.48 0.18 0.48
4 tfatan 25.6 0.24 0.48 0.18 0.48
5 tfvmul 26.93 1.2 4.8 0.9 2.4
6 tfneg 3.59 0.24 0.48 0.18 0.48
7 exit 0.0 0.0 0.0 0.0 0.0
Function type key
0 — Floating addition
1 — Floating subtraction
2 = Floating multiplication
3 = Floating division
4 = Floating square root
5 = Floating sine cosine
6 = Floating arc-tangent
7 — Floating negate
8 — Null function
Main




Multiple APU-Based Robot Controller



























An Acyclic Graph used for Representing a Computation







Computing time delay td
/l/schultzm/Ahmad/microprocessor - 35 October 15, 1987
ma = {P1,P2,P3,P4,P5} 
te(t) = {r1,r2,r3,r4,r5}
processors: P2 P1 P3
priority list
Figure 4
Example Computation with Processors and Priority Lists
/l/schultzm/Ahmad/microprocessor - 36 - October 15, 1987
ma = {Pl,P2,P3,P4} 










t„ = t« -t-t^ + t
t =t +t
In this figure: tQ=tQ+taj-td2
Figure 6
Computation time with accumulation 
of time-delay
Form the processor-task 
assignment map
Determine level 
lj of tasks in 
G (the task graph)
set processing time
Form priority list
All parents of Tq 
have been executed 
save (afe(t)} on stack
Form {afe(t)} list
Form a branching node (fe(t)} 
with I = min (n {, m a) 
tasks of (afe(t)} 
save the {fe(t)J list on stack 
set tri = 0
Figure 7a
Flowchart of Scheduling Algorithm
Allocate tasks to the APU's 
according to the assignment 
map recursively and compute 
' the real time t0 , accumulating 
the time delay of this processing stage
Accumulate overhead in
finish executing task's 
in the next computational stage
Any more 
tasks to be 
scheduled in 
.graph G?v
y Is this ^ 
solution satisfactory?
S. t < t ? y\ Ibound ' yT.-
Figure 7b
Flowchart of Scheduling Algorithm
/ Is thereby.
any further N 
branching nodes 
s, in the stack?/1unsatifactorysolution
Pop a branching node 
from the stack and form 
{afe(t)} list, reverse 
the time to that of the 
"popped" stage
Are there N. 
*^any branching nodes' 
in {afe(t)} which have 
sjiot been searched?.
Select a new {fe(t)} list 
suppressing the choice of 
the set that has already 
been examined
y/ Check if this
branching node will ^ 
bring about a better solution 










(i) Quadruplet weight is (t H t f, t , t s)
(ii) If task has more than one parent,
t$ is replaced with (-), meaning only one
Storage to main memory is necessary, and 
it is only counted once for all the children.
Terminal node
Figure 8
An Example Schedule with Data Transfer Overhead
/m/chris/ahmad/micropro.tables -39- October 15, 1987
Number of processors (APU)=1









Number of processors (APU)—2













Number of processors (APU)=3
Time m Task in execution a
0.00 3 1 3
4.50 3 5 4
6.50 2 7 4
8.00 2 7 6













Figure 9. Simulation Results for DF/IHS
/m/chris/ahmad/micropro.tables -40- October 15, 1987
Number of processors (APU)—1









Number of processors (APU)=2

















Number of processors (APU) =3
Time m Task in execution a
0.00 3 1 3 2
2.50 1 4 3 2
3.00 1 4 0 2
3.50 2 4 0 5
4.83 2 4 0 7
5.58 2 6 0 7
6.83 2 0 0 7
7.58 3 0 0 8
Number of Processing Lower hound APU
processors time time idle time
1 17.33 19.00 0.00
2 10.08 9,50 1.33
3 8.08 8.00 6.33










Y Y Y = Data transfer No contention the algorithm of this paper 
AAA = DF/IHS No contention
— — — = Data transfer and contention with this algorithm
labelled (DF/MOHS)
- .............. = DF/IHS with contention
DF'IHS
Number of APU1 at
Figure 10










Y = Data transfer No contention the algorithm of this paper 
A = DF/IHS No contention
— = Data transfer and contention with this algorithm 
labelled (DF/MOHS)
= DF/IHS with contention
325.DO
^DF/MQHS-r
Number o+' APU1 s
Figure 11











Y Y Y , = Data transfer No contention the algorithm of this 
AAA = DF/1HS No contention
— —— = Data transfer and contention with this algorithm 
labelled (DF/MOHS)














Data transfer No contention the algorithm Of this paper 
DF/IHS No contention
Data transfer arid contention with this algorithm 
labelled (DF/MOHS)
DF/IHS with contention
Number o-f APU1 s
Figure 13
Cartesian Space Path Planning Computation Time
