Preliminary Report on High-Performance Computational Structures for Robot Control by Rahman, Mahibur & Meyer, David G.
Purdue University
Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports
Department of Electrical and Computer
Engineering
10-1-1987
Preliminary Report on High-Performance





Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Rahman, Mahibur and Meyer, David G., "Preliminary Report on High-Performance Computational Structures for Robot Control"










School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
TABLE OF CONTENTS
Page
LIST OF TABLES ..................... ...............................................................................iv
LIST OF FIGURES................................. ............................... ................................ vi
ABSTRACT.......... ..................... ................ .............. .........................................viii
CHAPTER 1 - INTRODUCTION AND BACKGROUND....;.........................!
1.1 Introduction......... ................................................. ................................1
1.2 Previous Work........... ...................... .............. ...... .....................................4
1.3 Organization of Preliminary Report................. .....................................5
CHAPTER 2 - PARALLEL PROCESSING OF ROBOT INVERSE 
DYNAMICS, FORWARD AND INVERSE
KINEMATICS COMPUTATIONS.......... ................ ...... .............. ...... .7
2.1 Introduction ............................................................................. ....................3'
2.2 Parallel Processing of the Inverse Dynamics Problem ...;.................... 8
2.2.1 Ideal Lower Bounds on the Number of Processors and 
Execution Time of the Inverse Dynamics Problem 
while running on a Multiprocessor System using
an Optimal Scheduling Algorithm.................... ..............................8
2.2.2 SIMD Multiprocessor Architectural Model...... ............................11
2.2.3 Proposed Instruction Scheduling Algorithm for
SIMD Architecture with Crossbar Interprocessor 
Communication Network.......................... .......;.............................12
2.2.4 Simulation Results and Comparisons to previous work ...........20
2.3 Parallel Processing of PUMA Forward and Inverse Kinematics.......?®
9.9.1 Multiprocessor Architectural Model........................29
2.3.2 Simulation Results.................... .............. ...... ..............................33
2.4 Conclusions................ ....................... .......................................... ............... 37
CHAPTER 3 - A COST-EFFICIENT BIT-SERIAL ARCHITECTURE
FOR ROBOT INVERSE DYNAMICS COMPUTATION.......... 38
3.1 Introduction................................................. ....... ...... ....................... ....... ..38
Page
3.2 Choice of a Bit-serial Architecture....... ...... ............................................38
3.3 Time Lower Bound To Compute the Inverse
Dynamics Problem on a Bit-Serial Architecture...... ............................39
3.4 Bit-Serial System Architecture........ ...................... ...... ..41
3.5 Bit-Serial Array Processor Organization
and Performance............. .................... .............. .,43
3.6 Multi-Functional Bit-Serial Leaf Cell
Architecture and Operation........ ............. ................................... ..... ...... 47
3.7 Implementation of Bit-Serial Cell in
CMOS Technology....... ................................ ....... .................................50
3.8 Conclusions.............................................. ...................................................56
CHAPTER 4 - SYSTOLIC ARCHITECTURE FOR ROBOT INVERSE
DYNAMICS COMPUTATION.......................................................61
4.1 Introduction................. ..........,..... .................................. .
4.2 Systolic Array Design Methodology.......... ...... .
4.3 Time Lower Bound to Compute the Inverse Dynamics
Problem on A Systolic Array...................... .............. .
4.4 Reformulation of Newton Euler Equations of
Motion for Direct Mapping onto a Fixed 
Systolic Architecture....................... ............ ..... ............. .
4.5 Systematic Design of a Basic Set of Systolic
Processors for the Inverse Dynamics 
Computational Problem................................... .................
4.5.1 Systolic Processor for Unidirectional Tasks.......
; 4.5.2 Systolic Architecture of a Type 4 Task ..............
4.5.3 Systolic Architecture of a Type 5 Task ..............
4.5.4 Systolic Architecture for Task Types 6(a)-(c).....
4.5.5 Systolic Array for a Type 7 Task.............. ............
4.6 Optimal Buffer Assignment for the Formulation
of a Balanced Systolic Architecture to Evaluate 
the Inverse Dynamics Computational Problem............
4.7 Conclusions........................................ .
CHAPTER 5 - SUMMARY OF CONCLUSIONS........................................106
LIST OF REFERENCES................. .................................................... ...............109
APPENDICES
Appendix A: N-E Equations of Motion for a Manipulator
with Rotary Joints......... ............................................................... .117
Appendix B: 3-D Matrix/Vector Arithmetic Operations 
















Appendix C: PUMA Forward Kinematics Solution....... ...... ........ ...........121
Appendix D: PUMA Inverse Kinematics Solution.................................... 122





1.1 Computational Complexity of Robot Control Algorithms.....................,.2
2.1 initial SIMD Task Assignment for WG of Fig. 2.1.......................... ........18
2.2 Final STAT for WG of Fig. 2.1............... ...................... ............................... 18
2.3 DPAT for WG of Fig. 2.1.......  .................18
2.4 STAT for a Single Iteration of the N-E Dynamics Algorithm ...............19
2.5 DPAT for a Single Iteration of the N-E Algorithm............ ....................21
2.6 MC 68020 Operation Execution Times.............    ................23
2.7 MC 68881 Floating-Point Operation Execution Times................. 31
3.1 Bit-Serial Cell Performance Measures.................... .............. ......................46
3.2 Parallelism in the Execution of Tasks for Three
Forward Iterations....................         .................48
3.3 Parallelism in the Execution of Tasks for Three
Backward Iterations.....................        ......49
4.1 GVT of a Type 4 Task...................         78
4.3 LryT of a Type 4 Task........ .........     „.,,,,,,7S
4.3 IT of a Type 4 Task............................... ........................ ............................. ..78
4.4 GVT of a Type 5 Task ............................................... ................................81
4.5 UVT Of a Type 5 Task.................. *.......................................... ....................81
4.6 IT of a Type 5 Task............... .............. ...81
Table
4.7 GVT of a Type 6a Task .................................................. .
4.8 UVT of a Type 6a Task...................................... .
4.9 IT of a Type 6a Task.........................................................
4.10 GYT of a Type 7 Task.......... ............... ..................................
4.11 UVT of a Type 7 Task........... ...... ....................................... .
4.12 IT of a Type 7 Task...... ...................... .........................................
4.13 Computational Delays for Basic Set of Systolic
Processors....... ................ ................ ............................ ..................
5.1 Execution Times and Hardware Overhead for Computation
of Robot Control Algorithms in a MC 68020-Based 
Multiprocessor System..................... ....................... ................. .
5.2 Execution Times and Hardware Overhead for Computation













2.1 SlMD Architectural Model for Robot Inverse Dynamics
Computation............................................................................ ......................13
2.2 A Given WG........ ................      ..............................17
2.3 Simulation Results for Robot Inverse Dynamics Computation
on Proposed Architectural Model....... ....... ....... ..................>...................24
2.4 Performance Comparison of Proposed Architecture to that
of [19] for Robot Inverse Dynamics Computation........................... ...........27
2.5 Multiprocessor Architectural Model for PUMA Forward and
Inverse Kinematics Computations..................... ..................... .....................30
2.6 Simulation Results for Forward Kinematics...»...........   ............32
2.7 Simulation Results for Inverse Kinematics........ .................. ....................35
3.1 Bit-Serial System Architecture................ ........... .....42
3.2 Pipelined Organization of Bit-Serial Cells in Processor 1.......................44
3.3 Pipelined Organization of Bit-Serial Cells in Processor 2 ........ .....45
3.4 Architecture of A Multi-Functional Bit-Serial Leaf Cell .......................51
3.5 Interconnections Between Neighbouring Bit-Serial Cells ,{j>2
3.6 Flowchart of Asynchronus Communication Protocol............................. ..53
3.7 Data Routing Paths for Alternate Vector Arithmetic
Operating Modes...... ........................................................................................54
3.8 Zipper CMOS Driver Circuit...................   57
LIST OF FIGURES
Figure
3.9 Zipper CMOS Full Adder/Subtractor..............
3.10 Zipper CMOS Bit-Serial Multiplier Module....
4.1 Systolic Processors for Unidirectional Tasks...
4.2 Systolic Architecture for a Type 4 Task.........
4.3 Systolic Architecture for a Type 5 Task........ .
4.4 Systolic Architecture for a Type 6a Task.......
4.5 Processing Element for a Type 6b Task.........
4.6 Processing Element for a Type 6c Task..........
4.7 Systolic Architecture for a Type 7 Task...... .
4.8 WG for Forward Iterations of N-E Algorithm















In this report we present some initial results of our work completed thus 
far on "Computational Structures for Robot Control". A SIMD architecture 
with the crossbar interprocessor network which achieves the parallel process­
ing execution time lower bound of o( [ajn ]), where a1 is a constant and n is
the number of manipulator joints,for the computation of the inverse dynam­
ics problem, is discussed. A novel SIMD task scheduling algorithm that 
optimizes the parallel processing performance on the indicated architecture 
is also delineated. Simulations performed on this architecture show speedup 
factor of 3.4 over previous related work completed for the evaluation of the 
specified problem, is achieved. Parallel processing of PUMA forward and 
inverse kinematics solutions is next investigated using a particular schedul­
ing algorithm. In addition, a custom bit-serial array architecture is designed 
for the computation of the inverse dynamics problem within the bit-serial 
execution time lower bound of o(c1k + c2kn), where c1 and c2 are specified 
constants, k is the word length, and n is the number of manipulator joints. 
Finally, mapping of the Newton-Euler equations onto a fixed systolic array is 
investigated. A balanced architecture for the inverse dynamics problem 
which achieves the systolic execution time lower bound for the specified 
problem is depicted. Please note again that these results are only prelim­






Inverse Dynamics, Forward and Inverse Kinematics are particular robot 
control algorithms which need to be computed during regular robot servo 
control loops. Unfortunately, tbe computational complexity of these 
algorithms (see Table 1.1) degrades the servo control loop time of present- 
day robot control systems. This preliminary report therefore investigates 
high-performance computational structures to evaluate the specified 
algorithms in minimal time. The architecture, operation, performance, and 
derivation of lower bounds for the execution times and number of processors 
on specific computational structures to evaluate the specified problems is 
discussed.
The forward and inverse kinematics algorithms deal with the analytical 
study of the geometry of motion of a robot with respect to a fixed reference 
coordinate system. The former algorithm determines the position and 
orientation of the end-effector of the manipulator with respect to a fixed 
reference coordinate system, given the joint angles and geometric link 
parameters. Conversely, the latter algorithmic formulation determines the 
joint angles, given the position and orientation of the manipulator with 
respect to a fixed coordinate system. The second scheme is thus used 
primarily to determine whether the manipulator can reach the desired hand 
position and orientation, and if it can, find how many different manipulator 
configurations the given manipulator position and orientation may satisfy.
independent variables in a robot arm are the joint vari§j}les? apd ft 
task is usually stated in terms of the reference coordinate system, the inverse 
kinematics problem is used more frequently.
















Multiplications 678 100 74
Additions 597 77 66
Transcendental - 137 62
Square Root - 3
Division - 10
The inverse kinematics solution may be obtained by various methods 
such as inverse transform, screw algebra, dual matrices, dual quaternian, 
iterative, and geometric approach. The iterative solution often requires a 
very high number of computations and does not guarantee convergence to 
the correct solution. The inverse transform method yields a set of explicit, 
non-iterative joint angle equations. The dataflow structure of this technique 
will therefore be used in this thesis. The architecture, operation, and 
performance of a parallel processing architecture to compute the specified 
kinematics algorithms for the PUMA robot is described.
The inverse dynamics problem is that of efficiently determining the 
motor torques required to drive a manipulator arm in free motion. These 
torques must be evaluated repeatedly during servo control loops to avoid 
undesirable motion deviations of the robot arm from the desired trajectory 
path. The computation of the torques is, however, a very mathematically 
intense task which degrades the servo loop time of present-day robot control 
systems. Many researchers have thus concentrated on simplifying the 
dynamics equations [30,35,57], or developing new computational architectures 
[5,19,26,32,34,35,43,45,58,60] to reduce the servo sampling period. In this 
paper, the latter approach is employed.
There are a number of ways to formulate the robot arm dynamics 
equations of motion. They include the Lagrange-Euler [4], recursive 
Langrange-Euler [18], Newton-Euler [36] and the generalized d’Alembert 
principle functions [30]. Among these methods, the Newton-Euler (N-E) 
dynamics equations is the most efficient and has been shown to possess the 
computational time lower bound of O(n) [36], where n is the number of 
degrees-of-freedom of the manipulator. The N-E algorithm will therefore be 
used to implement the various computational structures discussed in this 
thesis. It should, however, be noted that the proposed architectures may be 
customized to meet the computational model of other formulations of the 
inverse dynamics problem.
The N-E formulation uses two types of iterative recursions, namely, the 
forward and the backward recursions, which are applied to the robot links 
sequentially. The forward recursion propagates kinematics information — 
Ipela ag Unear velocities, angular accelerations, and linear accelerations at 
the center of mass of each link — from the inertial coordinate frame to the 
hand coordinate frame. The backward recursion propagates the forces and 
moments exerted on each link from the end-effector of the manipulator to the 
base reference frame. The parallel, pipelined, and recursive nature of the N-
4
E dynamics algorithm suggests that it is amenable to parallel processing 
structures using off-the-shelf processors, as well as parallel and pipelined 
custom VLSI array architectures. This thesis therefore also presents such 
computational schemes.
1.2 Previous Work
Several parallel architectures have been proposed to solve the inverse 
dynamics problem. Luh and Lin [35] proposed a modification of the 
traditional "branch and bound" scheduling technique to process the equations 
in parallel on a set of n CPUs. Kasahara and Narita [19] continued along 
the same path by proposing to use the DFIHS (Depth-First-Implicit- 
Heuristic-Search) scheduling algorithm for this purpose. However, in both 
the above studies, the important problem of interprocessor communication 
which degrades the performance of their particular multiprocessor structure 
is ignored.
Orin, Chao and Schrader [45], recognizing the pipelined nature of the 
N-E formulation, proposed assigning two pipelined CPUs per link to compute 
the forward and backward recursions in each processor, respectively. Their 
structure eliminates some of the performance degradation problems 
associated with interprocessor communications that exist in the computation 
of the N-E algorithm for parallel processing environments. The task of 
computing forward or backward iterations of a single stage of the N-E 
recursive algorithm at a single CPU location does not, however, produce 
sufficient speedup over a single CPU solution.
Liao and Chern [32] suggest using the CBAP (Cross-Bus Array 
Processor), which uses a large set of bit-parallel processors arranged in an 
array format with two sets of busses crossing over the array in two 
directions. The primary disadvantage of this system is the cost-inefficiency 
of using a large number of bit-parallel array processors which are not fully 
utilized. Additional problems include complexity in the operand data 
alignment process and the directional data-shifting mechanisms used in the 
control of the array, causing the system to be susceptible to 
hardware/software faults.
Lathrop [26] proposed two parallel algorithms on the inverse dynamics 
problem using a group of special purpose processors. One is a linear-parallel 
algorithm and the other is a logrithmic-parallel algorithm based on the
5
partial sum technique. The main concern with both approaches is the 
massive buffering between forward and backward recursions, which 
deteriorates the performance. The second problem is that they both involve 
complex interprocessor communication structures which frequently cause 
data to be fetched and, as a result, data for operand pairs are not properly 
aligned for parallel computations.
Lee and Chang [29] recently proposed using the recursive doubling 
algorithm with a modified inverse perfect shuffle interconnection scheme 
between a set of parallel processors. Their processor interconnection 
structure improves pipelining and eliminates some of the problems associated 
with interprocessor communication. They did not, however, incorporate any 
operand access mechanisms into their structure to allow proper 
synchronization of the parallel processors. Their implementation requires a 
total of 530 modular processors, where the processor complexity required to 
perform 3-D parallel vector dot product and/or vector addition along with 
the expensive interconnection structure among processors indicate this 
approach may not be a cost-efficient and fault-tolerant practical approach.
For the computation of the inverse kinematics algorithm, which is an 
equally important problem, Lee and Chang have proposed a pipelined VLSI 
architecture using CORDIC processors as building blocks of the structure. 
Delay buffers were inserted along data paths of the system to balance 
operand arrival time of the CORDIC computational nodes. The author has 
no valid criticism of this custom VLSI architecture.
1.3 Organization of Preliminary Report
Parallel processing using off-the-shelf processing elements to evaluate 
the inverse dynamics, PUMA forward and inverse kinematics algorithms is 
discussed in chapter 2. Ideal lower bounds on the number of parallel 
processors and execution time for the computation of the inverse dynamics 
problem in the specified computational model is derived here. A novel 
scheduling algorithm for the parallel processing of N-E equations on an SIMD 
pif/qbinp ydth a particular interprocessor communication network is prqgeiitgd 
next. Simulation results performed on this architecture are then delineated 
and compared to the performance of previous related work completed on 
this problem. The second part of this chapter describes the performance of 
both types of kinematics algorithms in a multiprocessor environment with a
6
shared memory interprocessor communication mechanism while using a 
particular instruction scheduling algorithm.
In chapter 3, a custom bit-serial array architecture for the computation 
of the inverse dynamics problem is presented. The organization, operation 
and performance of the proposed system is described. In addition, the 
architecture and intercell communication protocol of an individual bit-serial 
cell (Which is used as the building block of the overall array itMiethrS) is 
delineated.
Mapping of the Newton-Euler equations of motion onto a fixed systolic 
architecture is depicted in Chapter 4. The systolic design methodology for 
this mapping process is discussed here. This design procedure is used to find 
a basic set of systolic processor architectures which are used to build the 
complete systolic system. Integer linear programming is applied for the 
optimal buffer assignment problem to obtain a "balanced" systolic array for 
the computation of the specified problem. The performance, operation, 
design and lower bound task latency costs of such a systolic system are also 
delineated here. Finally, Chapter 5 provides a summary of conclusions 
drawn from this report.
CHAPTER 2
PARALLEL PROCESSING OF ROBOT INVERSE DYNAMICS, 
FORWARD AND INVERSE KINEMATICS COMPUTATIONS
2.1 Introduction
In the first part of this chapter, we discuss parallel processing of the inverse 
dynamics computational problem. First, the ideal lower bounds for the number of 
parallel processors and execution time to evaluate the specified problem in a parallel 
processing environment using an optimal scheduling algorithm is derived. The 
proposed multiprocessor architectural model is then presented. Next, a scheduling 
algorithm customized for optimal parallel processing on the proposed system model is 
described. Further, simulation results for computing the desired algorithm using the 
proposed architecture and scheduling methodology is delineated and compared to 
previous work completed on this topic.
In the second part of the chapter, the parallel processing system model for the 
computation of PUMA forward and inverse kinematics algorithms is presented. This 
architecural model for the evaluation of the specified problems is simulated by using 
the DFIHS (Depth-First-Implicit-Heuristic-Search) algorithm [19] as the primary task 
scheduling tool.
2.2 Parallel Processing of the Inverse Dynamics Problem
2.2.1 Ideal Lower Bounds on the Number of Processors and Execution Time of 
the Inverse Dynamics Computational Problem while running on a 
Multiprocessor System using an Optimal Scheduling Algorithm.
The ideal minimum number of processors required to compute the ihvei^e 
dynamics problem in minimal time while running in a parallel processing 
environment using an optimal scheduling algorithm is discussed here. In addition, the 
limitation on speeding up the specified problem in the indicated computational 
architecture is also investigated. Before deriving these ideal lower bounds, the 
following notation, definitions, and lemmas must be established.
Notation:
nip = Minimum number of parallel processors needed to compute the inverse 
dynamics problem in minimal time using an optimal scheduling algorithm.
Tp = Lower bound on the execution time of the inverse dynamics problem while 
running in a parallel processing system with nip processors, using an optimal 
scheduling algorithm.
E</> denotes a linear arithmetic expression of / distinct atoms, where an atom is a 
Constant or variable, e.g. E<4> = a + b - c/d.
Tqj = Minimum time to perform the set of computations in the critical path of a task 
graph G.




for te [Xj - tj, tj] 
otherwise.
where f(tj,t) indicates the activity (or computational delay) of 
vertex j (or task Tj) along time, according to the restrictions 
imposed by the task graph, and Tj is the earliest completion time 
of task Tj.
Definition 2.2: had density function [10] is definedby
F(x,t) = 2 f(Xj,t) (2.1)
■ j=i
where F(x,t) depicts the total activity of the task graph as a 
function of time, and / is the total number of tasks in task graph 
. G
Definition 2.3: The load density function within the interval [0l5 02] c [0, tcP]
after all tasks have been shifted to yield minimum overlap with 




Lemma 2.1: A lower bound on the minimum number of processors [10] to




where the maximum is taken over all integer time segments
[0i,02l.
Theorem 2.1: The minimum number of parallel processors needed to compute
the inverse dynamics problem within time tcp of the Newton- 
Euler task graph using an optimal scheduling algorithm is n, 
where n is the number of manipulator joints.
Proof: Using Eq. 2.1, the load density function for the inverse dynamics problem
' is’
bjn
F(x)t) = Xf(xj,t) = b1b2n (2.3)
where bj is the number of tasks in a single iteration of the N-E algorithm, 
and b2 is the number of tasks in the complete N-E task graph which have an 
unit activity at a set vertices T}. The second load density function within
ny= max [01, 02]
10
time interval [0j, ©2] c [0» <cp] to yield minimum overlap for the inverse 
dynamics problem is found by Eq. 2.2 to be,
R(el5 02, t) = b1b2n(02 - 0i)
Thus, by lemma 2.1, trip must be of the form,
1 02
max[0l5 02] J b1b2n(02 - 0t) dttrip =
®2-®l 0!
[ max [bjb2(02 - O^njl
I [0i, oj I
(2.4)
It is evident from Eq. 2.4 that the minimum number of parallel processors to
compute the inverse dynamics algorithm using an optimal scheduling
algorithm is n. Note that this lower bound occurs when
max bib2(02-0i) is unity.
[6i. ej I
Lemma 2.2: The set of computations of a graph G cannot be completed with m
processors in a time less than tL [10],
ti =trp+ max
where tK is a discrete point in time and % is the latest completion 
time of a specific task.
Lemma 2.3: The critical path time, tcP, of the inverse dynamics task graph S is





Proof: Let X! be the computational delay in the critical path of a single iteration of
the N-E dynamics algorithm. Thus, the critical path time for n iterations is 
tcp<n> = Xjn. The inverse dynamics problem may be considered as 
computing a set of joint torques which result in obtaining the joint torques. 
Each joint torque Xj of joint i can be expressed as an arithmetic expression 
containing at last 3n atoms: n joint positions {qj}£i, n joint velocities {qj},
11
and n joint accelerations {ql^, implying, tcP<3n> ='3xjn. The previous 
expression may be rewritten without changing the order of the lower bound 
as tcptXj,.. .,xn] = am, where aj is specified constant.
Theorem 2.2: The minimum time to compute the inverse dynamics problem in
a parallel processing system with mp processors using aft optimal 
scheduling algorithm is bounded below by o|" arnj , where ft is 
the number of manipulator links, and aj is a specified constant;
Proof: By substituting F(t,t) from Eq. 2.3, tcP from Lemma 2.3, and m from
Theorem 2.1 into the expression for tL in Lemma 2.2, we obtain a 
formulation for TP,
Tp = am + max
ti—tjc—tcp
-tK +
1 K - J b^ndt
n
— am + tjj(b]b2 — 1) (2.5)
Since a1? bl5 b2 and tK are specified constants, the lower bound for TP, 
without changing the order of the lower bound of Eq. 2.5, may be expressed 
as,
TP[E<3n>] > oCajn)
where E<3n> represents the computational complexity of evaluating n joint 
positions {qi}£i n joint velocities {qi)i=i, and n joint accelerations {qi}^.
2.2.2 SIMD Multiprocessor Architectural Model
The N-E Equations of motion may be decomposed into a set of homogeneous 
linear unidirectional and recurrence tasks (see section 4.4). This suggests that the 
inverse dynamics problem is amenable to a SIMD multiprocessor-based system. The 
basic architectural model of the proposed architecture to compute the desired problem 
is presented in this section.
It has been stated that "the most critical system control mechanism in a 
multiprocessor-based computer structure are clearly those involved with interprocess
12
and interprocessor communication"[17]. The interprocessor communication network 
for the SIMD model is therefore a crucial choice. In Theorem 2.1, we proved the 
ideal lower bound on the number of parallel processors, m, to compute the inverse 
dynamics problem to be n, where n is the number of manipulator joints. For n>6, 
industrial robots have undesired redundant degrees of freedom. It is therefore safe to 
assume n<8 for worst-case applications. Thus, for n=m<8, a "complete" 
iriterprocessor communication strategy such as the crossbar network is desired for this 
purpose due to the commercial availability of 8x8 crossbar switches. In most image 
processing applications, crossbar interprocessor communication networks are 
generally not used for SIMD processing due to excessive network overhead costs 
since m»8. This is however not the case for our application, as discussed above.
Fig. 2.1 illustrates the proposed SIMD system model. The configuration is 
Structured with m synchronized processing elements with a crossbar interprocessor 
communication network. The microcode ROM (Read-Only-Memory) contains a 
Static instruction schedule for optimal SIMD processing of N-E Equations of motion 
on the proposed system model. Each processing element PEj has its own local 
memory PEMj. Interprocessor communication data is thus directed to single or 
multiple destination PEM; from a single source PEj at the end of every SIMD 
instruction cycle. Memory conflicts, however, occur when multiple Source PEj 
attempt to send data to the same destination PEMj through the crossbar network. In 
addition, an unsystematic SIMD instruction schedule in the microcode ROM may 
cause PEjS to be idle during SIMD processing cycles, thus causing a longer overall 
parallel execution time. It is therefore clear that an algorithm which optimally 
schedules SIMD tasks to the PEs, in addition to scheduling intermediate results 
through the crossbar network without resource conflicts, is desired. The next section 
presents such an algorithm.
2.2.3 Proposed Instruction Scheduling Algorithm for SIMD Architecture with 
Crossbar Interprocessor Communication Network
In this section, a novel algorithm for the scheduling of SIMD-mode processes 
and destination processor data through the crossbar interprocessor network is 
presented. It should be stressed here that the technique for scheduling of destination 
data embedded in the algorithm is valid only for the case of a crossbar interprocessor 













(1) TR(t) = {Tj,.. .,T/}, Is z11, is a set of ready tasks (tasks whose predecessors have 
been executed) at time t which perform the same type of operation.
(2) TRF(t) = {TR1,.. .,TRd}, ke zn, is a set of valid TR’s at time t.
(3) NRi specifies the number of ready tasks in the task set TRie TRS(t).
(4) Nj(t + A) denotes the number of interprocessor data transfers required to be 
performed by the crossbar network at time, t + A, where A is the computational 
delay of the operation initiated at time t.
(5) Ncj(t + A) is the number of unique destination processors to which the result of 
task Ti executed in processor Pj is to be sent at time t + A.
(6) PDi(t + A) = (Pl5.. .,Pr}, re zn, specifies the set destination processors to which 
the result of task Tj executed in processor Pj is to be sent at time t + A.
(7) WG is a weighted precedence graph for a computational problem G, where 
specifications of node task number, node computational delay and node task 
type must be included in the task graph.
Procedure ISSMACIN (Instruction Scheduling for a SIMD Multiprocessor
Architecture with Crossbar Interconnection Network).
Input: Weighted Graph WG and number of available parallel processors m.
Output: Set of tables T = {STAT, DP AT} where,
. STAT is a SIMD Task Allocation Table which specifies the assignment of 
SIMD processes to parallel processors at time t.
DP AT is a Destination Processor Assignment Table which specifies the 
destination processor tags for the proper routing of data through the 
crossbar network at time t + A.
STEP 1: By the CP/MISF method [19], determine the “level” of each task in WG,
where the level /j of task Tj is defined as the longest path from die exist 
node to node Nj corresponding to Tj. Mathematically,
/:= max T tk . J je%
where Jtk is the kth path from the exit node to node Ni? and tj is the
15
computational delay of node j.
STEP 2: Construct the priority list L in the descending order of /j and number of
immediately successive tasks n;. Order tasks of identical l{ and n; in 
lexicographic fashion.
STEP 3: Initial assignment process of tasks in WG to parallel processors for SIMD
program execution:
REPEAT
(a) Determine TRS(t) = {TR1,.. ..Tri,}
(b) Find TRie TRS(t) with max[NRi],
(c) Select max[x] tasks from TRi where xe zn and 1 < x < m. Let Tx be 
the chosen.
(d) Assign the tasks Tx to x parallel processors in lexicographic order.
(e) Direct the corresponding SIMD command to these x processors and 
mask the other m-x processing elements.
(f) Perform TRi = TRi\Tx.
UNTIL NRi = 0 for all possible TRie TRS and t.
STEP 4: Rearrange the task assignment of STEP 3 at each SIMD processing state t
so that Nj(t + A) is maximized. This may be achieved by using the 
generalized backtracking algorithm [15]. The explanation of this 
algorithm is beyond the scope of this thesis. We may however note that 
the time complexity of the specified algorithm is 0(m-s), where m is the 
number of PEs, and s is the number of SIMD processing states t.
STEP 5: The scheduling of destination processor data through the crossbar
interconnection network at time t + A is as follows:
REPEAT
(a) Find task Tje Tx(t + A) with max[Nci] where Tx is the set of x tasks 
computed at time t.
(b) Broadcast result of task T, executed in processor P/, 1 < / < m, to the 
set of destination processors Ppi(t + A).
(c) At the same time, send/broadcast results of set of tasks Tj, where 
TjcTx and T^Tj, to set of destination processors PDj, where 
PDjcPDi(t + A).
16
(d) Tx(t + A) = Tx(t + A)\(TiUTj).
UNTIL Nq = 0 for all possible T^ Tx, t and A.
Let us now illustrate procedure ISSMACIN by a simple example. Fig. 2.2 Shows 
a given WG. The task numbers Tj are specified within the nodes Nj. The 
nomenclature ty j next to each node specifies the type of task to be computed in the 
vertex, where j denotes the type of task. For purpose of simplicity, we shall assume 
the operational delay of task type j is j time units. Applying ISSMACIN to Fig. 2.2 
when two parallel processors with a crossbar interprocessor communication network 
is presumed to be available:
STEP 1: Applying CP/MISF method to Fig. 2.2:
/ j 5, /2 “ 6, / j 5, / ^ 3, / j 4, 
l § = 5, lj = 4, / g = 4, l $ = 4, / jo = 3,
/u = 2, /12 = 3, /13 = 3, /j4 = 2, /15 = 2.
STEP 2: The priority list L of the task set is:
L = {T2, Tl, T3, T4, T5, T6, T7, T8, T9, T10, T12, T13, Til, tl4, T15}
STEP 3: Results for the first-run through the loop which specifies the process for
assigning tasks to processors, we get:
(a) TRS(tl) = {TR1, TR2) where,
TR1 = {Tl, T3, T4, T5} and
: tR2={t2}. ■
(b) NR1 = 4, and NR2 = 1. Thus, TR1 is selected for SIMD processing.
(c) TX={T1,T3,T4}.
(d) Assign Tl to PEI, T3 to PE2, and T4 to PE3.
(e) Direct type 1 SIMD command to PEI-3.
(f) TR1 = TR1\TX = T5.
By repeating the loop, we may arrive at the initial task assignment table of 
Table 2.1 for SIMD processing states tl-t5. Note that the last column of 
the table specifies the type of SIMD task performed at state t/.
STEP 4: By apply the generalized backtracking algorithm to Table 2.1 to minimize
the number of interprocessor data transfers at time ti + A, we may arrive at 
the more efficient task assignment (STAT) of Table 2.2.
17




Fig. 2.2: A given WG
18
Table 2.1: Initial SIMD Task Assignment for WG of Fig. 2.1
Time PE 1 PE 2 PE 3 Task Type
tl Tl T3 T4 tyl
t2 T2 T6 T9 ty2
t3 T5 T7 T8 tyl
t4 T10 T12 T13 tyl
t5 Til T14 T15 ty2
Table 2.2: Final STAT for WG of Fig- 2.1.
Time PE 1 PE 2 PE 3 Task Type
tl T3 Tl T4 tyl
t2 T9 T6 T2 ty2
t3 T5 T7 T.8 tyl
t4 T12 T13 T10 tyl
5 Til T14 T15 TY2
Table 2.3: DP AT for WG of Fig. 2.1
Time PE 1 PE 2 PE 3
tl+Al d3 d2 dl
t2+Al dl d2,d3
t2+A2 d2
t3+Al d3 dl d2
■t4+Al d3 di d2
19
Table 2.4: STAT for a Single Iteration of the N-E Dynamics Algorithm
Time PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 Task Type
tl 1 - 2 4 5 6 7 9 10 M
t2 '■ 3 8 20 23 25 27 28 I ' A
t3 ’ 21 22 24 26 29 30 37 38 M
t4 39 40 41 42 11 12 13 14 M
t5 34 35 36 46 47 48 31 I A
t6 49 50 51 52 53 54 55 56 M
t7 57 58 59 60 15 16 17 18 M
t8 70 71 72 73 74 75 76 77 . ■ A
t9 19 82 83 84 85 86 87 88 M
tio 89 90 91 92 93 94 95 96 M
til 97 98 99 100 101 102 103 104 M
tl2 105 61 62 63 64 65 66 67 M
tl3 32 33 121 122 123 124 125 126 A
tl4 127 128 129 43 44 45 79 80 A
tl5 81 118 119 120 139 140 141 142 A
tie 109 110 111 112 113 114 68 69 M
tl7 106 107 108 143 144 151 152 153 A
tl8 130 131 132 133 134 135 115 116 M
tl9 136 137 138 145 146 147 157 158 A
t20 148 149 150 159 178 179 180 I A
t21 160 161 162 172 173 174 175 176 M
t22 154 155 156 184 185 186 I I A
t23 177 163 164 187 188 189 190 191 M
t24 206 196 197 198 199 200 201 193 A
t25 205 206 207 208 209 210 211 212 M
t26 192 165 166 213 167 168 169 170 M :
t27 217 218 219 214 215 216 182 183 A
t28 220 221 222 165 166 I I I A
t29 223 224 225 202 203 204 I I A
t30 226 227 228 I I I I I A
/ t31 v 229 I I I I I I I '■ A
t32 230 I I I I I I i A
20
STEP 5: Applying the loop at t2 + A1:
(a) NC2 = 2, Nc6 = 1, NC9 = 1. Thus, task T2 is thus selected for 
primaiy data broadcasting.
(b) Results of T2 is therefore sent to processing elements 2 and 3.
(c) Since PD6 c PD2 and Pj^ cPD2, product of T6 is thus also selected 
for sending its results to PEI at this time.
(d) Tx(t2 + Al) = Tx(t2 + A1)\(T2UT6) = T9
At t2 + A2, result of T9 is naturally routed to PE2. By repeating the 
specified loop, we may arrive at the DP AT in Table 2.3. The nomenclature 
di in the table columns specify the destination processor memories PEMj to 
which respective source processor sends its result at time tx + A.
When procedure ISSMACIN is applied to parallel process a single iteration of 
the N-E algorithm, the resulting ST AT for this application is illustrated in Table 2.4. 
In this Table, the initials M and A in the "task type" column specifies a multiplication 
and addition SIMD operation respectively. Notice the high density of SIMD task 
processing when using the proposed scheduling algorithm to evaluate the 230 tasks in 
a single iteration of the N-E algorithm. The idle processors prevalent from t28 
through t32 will not occur when the number of iterations is more than one (which is 
the case for any n-link manipulator), since tasks of successive iterations are computed 
in SIMD processing fashion at the indicated idle processors at specified time intervals. 
Accordingly, the DP AT generated by procedure ISSMACIN for a single iteration of 
the N-E algorithm is shown in Table 2.5. This table verifies that an average of 3 
interprocessor data transfers through the crossbar network are required at the end of 
every SIMD instruction processing cycle. From these two tables, it may be deduced 
that the worst-case execution time to compute the inverse dynamics problem of an n- 
link manipulator on the proposed architecture using procedure ISSMACIN as primary 
design tool for SIMD processing is (7ta + 15tm + 32tf+ 65ts)n, where, ta is the time to 
perform an add operation, tm is the time to compute a multiplication operation, tf is the 
time to fetch operands from local PEMj, and ts is the time to send the result to the 
destination PEMj(s) through the crossbar network.
2.2.4 Simulation Results and Comparison to Previous work
In this section, we present simulation results for computing the inverse dynamics 
problem using the proposed SIMD architectural model and procedure ISSMACIN;
Table 2.5: DPAT for a Single Iteration of the N-E Algorithm


























































d4-d d3 d8 dl,d2
d3












































da d7 dl 62
\
Table 2.5: DPAT for a Single Iteration of the N-E Algorithm (continued)
Time PEI PE2 PE3 PE4 PE5 PE8 PE7 PE8
tl5+Al d3 dS dd d8
615+A2 da d7 d3
;tis+A3 d7
tl8+Al dl d2 d3
tl8+A2 dl d2 d3
tl8+A3 dl d2
tl'T+dl d4td5 dl d2,d3 d8 d7
tl7+A2 d5 d4 d8 d4
tl8+Al d4 d5 da d3
tl8+A2 d4 d5. da d3
tl9+Al dl d2 d3
tl9+A2 dl d2 d3
tl9+A3 dl d2
t20+Al dl d2 d3 d7,d8 d4 d5,da
t20+A2 d3 dl
t21+Al d4 d5 dd
t22+A2 d4 d5 da
t22+A3 d4 d5
t23+ilX d3 dl d5 da d7
t23+A2 dl dS da
|: t244-AX dl,d4,d7 d2,d5,d8 d3,d8
: t24+A2 dd d4 d2 d5 da d3
j c25+AX dl d2 d3 d4
|: C25+A2 dl d2 d3 d4
j:t28+Al d7 d4 d3 d7 d4 d8
t28+A2 d7 d8
j C27+A1 dl d.2 d3 d4 d5
| t27-rA2 dl d2 d3
| t28+Al dl d2 d3 d5 da
1: t29+dX dl d2 d3
| t30+Al dl
| C30+A2 dl
! ‘.314-Al | dl 1
23











Fetch Operands from 




Store Result in Shared M 
or Destination PEM^ MOVE Dy,<EA>
5 0.3
Add ADD Dx,Dy 3 0.18




















3* ’4' 5' *
Number of Processors
(a): Execution Time versus Number of Processors
3.tTOO
Number of Processors
(b): Speedup versus Number of Processors
Fig. 2.3: Simulation Results for Robot Inverse Dynamics
Computation on Proposed Architectural Model
25
The MC 68020 microprocessor is used as the basis 5f simulation, i.e. each PEi in Fig.
2.1 is represented by the specified processor type. Table 2.6 specifies particular 
instruction execution times of a 16.7MHz MC 68020 processor relevant to the 
processing of the specified problem. This data is used to simulate the performance of 
the N-E dynamics algorithm on the proposed architecture, and also for the purpose of 
comparison with previous work.
Fig. 2.3(a) shows a plot of the execution time for computing the inverse 
dynamics problem of a six link robot versus the number of parallel processing 
elements when using the proposed architecture and scheduling algorithm. An average 
lower bound processing time of 513.84|is is achieved for 7 to 9 PEs. Note that the 
time lower bound is not achieved at exactly six parallel processors as required by the 
ideal case (see Theorem 2.1) when a manipulator with six degrees of freedom is used. 
In fact, the actual lower bound cost is achieved for the case of nine PEs where the 
execution time is 100|is lower than that of the ideal case at 6 parallel processors. This 
slight distortion from the ideal case is due to the irregular number of propagation 
delays when passing intermediate results through the interprocessor network. As seen 
in Fig. 2.3(b), speedup>5.7 over the uniprocessor solution is accomplished for m>8, 
where m is the number of parallel processors. Also, Fig. 2.3(c) illustrates the 
variation of architectural efficiency with the number of parallel processors. The 
efficiency is defined as Sm/m, where Sm is the speedup for m parallel processors. 
Intuitively, the efficiency simply gives a measure of how the achieved speedup 
compares with the ideal speedup. Thus, a high performance evaluation weight should 
not be given to this plot. From these figures, it is evident that lower bound solutions 
may be achieved by using m=8, which is also the desired number of PEs if 
commercially available 8x8 crossbar switches are to be used in the architecture.
We shall now compare the performance of the proposed mutiprocessor 
architecture and scheduling algorithm for the computation of the inverse dynamics 
problem to that of Kasahara and Narita [19]. Note that the simulations for comparison 
purposes will use common MC 68020 processing times of Table 2.6 to maintain 
fairness. The specified authors use the DFIHS (Depth-First-Implicit- Heuristic- 
Search) scheduling algorithm (see appendix E) to parallel process the N-E dynamics 
Equations on a multiprocessor system with a shared single bus interprocessor 
communication mechanism. Fig. 2.4(a) compares the execution time for the 
computation of the specified algorithm on the two architectural models when m is 
varied. A performance improvement of more than 1.75ms for m>8 is achieved for 
the case of the proposed system over that of Kasahara and Narita. In addition, Fig. 






































Kasahara and Narita [19]
2031.i
Number of Processors






(b): Relative Speedup versus Number of Processors
Fig. 2.4: Performance Comparison of Proposed Architecture to that






















(c): Percentage of Total Execution Time spent on Interprocessor 




3.4 for m>8 over the previous architectural model. Fig. 2.4(c) suggests a valid cause 
for the much lower performance of Kasahara and Narita’s parallel processing 
environment. It is seen that interprocessor communication is a major bottleneck in 
their shared bus multiprocessor system. In fact, over 90% of the algorithm execution 
time in their architecture is spent by performing dialogue between PEs for m>8. 
Latencies due to arbitration costs for access to the shared bus is thus a crucial 
performance degradation factor when multiple PEs attempt to perform interprocessor 
communication via mailboxes at the end of instruction processing cycles. Note the 
moderately lower percentage of total execution time that the proposed system spends 
for interprocessor communication (approximately 60% for m>8). The crossbar 
interprocessor communication network is naturally one of the most important reasons 
for the higher perforamnce of the proposed architecture. In addition to optimal 
scheduling of data through the crossbar network to destination PEs in minimal number 
of interprocessor transfer latencies, simultaneous operand fetches by the m parallel 
PEs from respective PEMj is performed to minimize overall algorithm execution time.
2.3 Parallel Processing of PUMA Forward and Inverse Kinematics
2.3.1 Multiprocessor Architectural Model
The non-linear, non-recursive and irregular form of PUMA forward and inverse 
kinematics Equations (see Appendix C and D respectively) suggest that these 
algorithms are not amenable for a SIMD parallel processing environment. In fact, if 
such a type: of architectural model is used for this application, severe performance 
degradation will occur due to multiple idle/masked processors at each parallel 
execution stage. The masking of SIMD PEs occur since insufficient number of of 
identical tasks will be active at any processing level due to the non-recursive nature of 
the specified algorithms. An alternate architectural approach must therefore be 
investigated. A multiprocessor system with shared memory for interprocessor 
communication is studied for this application. This architectural model is illustrated 
in Fig. 2.5. The memory arbitration mechanism allows access to the mailboxes in 
shared memory for interprocessor communication on a first-come first-serve basis. 
Th@ DFJHS (Depth-First-Implicit-Heuristic-Search) scheduling algorithm (gee 
Appendix E) is used to create the static schedule for parallel processing of forward 
and inverse kinematics equations. This static schedule of parallel tasks is stored in the 
microcode ROMj at run time.
Shared Memory
ROM,
Fig. 2.5: Multiprocessor Architectural Model for PUMA Forward
and Inverse Kinematics Computations
31





Fetch Single Operand 
from Shared M FMOVE <EA>,FPy 2.28





Store Result in Shared M FMOVE FPy,<EA> 2.28
Add FADD FPx,FPy 3.05
Subtract FSUB FPx,FPy 3.05
Multiply FMUL FPx,FPy 4.25
Divide FDIV FPx,FPy 6.17
Square Root FSQRTFPy 6.41
Arc Tangent FATAN FPy 24.1
Sine FSIN FPy 23.4
































(b) Speedup versus Number of Processors
Fig. 2.6: Simulation Results for Forward Kinematics
Performance degradation occurs in this architecture due the operational delay 
caused by multiple PEs waiting for access to shared resources. Note that overall 
execution speed may not be improved for computation of the specified problems by 
using an iterprocessor network with local PEMjS since these types of networks are 
generally only efficient for SIMD machines.
2.3.2 Simulation Results
Simulation results for the computation of PUMA forward and inverse kinematics 
algorithms using the shared memory interprocessor communication strategy and 
DFIHS task scheduling algorithm (see Appendix E) is presented here. The MC 68020 
with its MC 68881 floating point math coprocessor is used as the basis of performance 
evaluation. In other words, each PEj in Fig. 2.5 is represented by this dual processor 
combination. Table 2.7 specifies the execution time data of MC 68881 operations 
needed for the computation of the specified problems. These processing times are 
used to generate the simulation results presented hereafter.
The variation of execution time for the forward kinematics algorithm with the 
number of parallel processors m is depicted in Fig. 2.6(a). Minimum execution time 
is achieved at approximately 392(0.s for m>8. As seen from Fig. 2.6(b), speedup close 
to 2 is reached for m>8 over the uniprocessor solution. Also, Fig. 2.6(c) specifies how 
the architectural efficiency varies with the number of parallel processors. As a 
reminder, high importance should not be given to this plot since it simply shows how 
the achieved speedup compares to the ideal speedup. Finally, Fig. 2.6(d) shows the 
percentage of total execution time that the parallel processors spend for interprocessor 
communication to evaluate the specified algorithm. Approximately 85% of the 
overall execution time is spent performing interprocessor communication. This high 
rate is primarily due to arbitration costs for access to shared memory. Note again that 
performance will not be improved by using an interprocessor network with SIMD 
instruction processing, although the interprocessor communication latencies may be 
lower for this case.
Simulation results for parallel processing of the inverse kinematics algorithm 
will now be discussed. Fig. 2.7(a) verifies that minimum execution time of 
approximately 750|is is achieved when five parallel processors are used. Further, 
speedup of 2.3 occurs for m>5 over the uniprocessor solution as illustrated in Fig. 
2.7(b). For consistency, the variation of efficiency with m is delineated in Fig. 2.7(c). 
Finally, Fig. 2.7(d) confirms that over 61% of total algorithm execution time is spent 
for interprocessor communication through shared memory. Notice that this last result 
























(d): Percentage of Total Execution Time spent on Interprocessor 




























(b): Speedup versus Number of Processors






(c): Efficiency versus Number of Processors
Humber of Processors
■(d): Percentage of Total Execution Time spent on Interprocessor 
Communication versus Number of Processors
Fig. 2.7 (Continued)
37
spends for interprocessor communication.
2.4 Conclusions
The ideal lower bounds on the number of parallel processors and execution time 
to evaluate the inverse dynamics problem of an n-link manipulator in a parallel 
processing architecture using an optimal scheduling algorithm was derived to be n and
o( ajn ) respectively, where is a specified constant. Next, a novel SIMD
scheduling algorithm to perform parallel processing on a SIMD multiprocessor-based 
architectural model with a crossbar interprocessor network to compute the specified 
problem close to the ideal lower bound is presented. The performance of the N-E 
dynamics algorithm on the proposed architecture using the specified scheduling 
technique is next simulated with the MC 68020 representing each processing element. 
Speedup factor of greater than 3.4 over previous related work [19] for the computation 
of the defined problem on a parallel processing system is achieved. The SIMD 
architectural model is then not used again to simulate the performance of PUMA 
forward and inverse kinematics algorithms because of the non-recursive and non­
linear nature of these algorithms. A multiprocessor model with a shared memory 
interprocessor communication strategy is instead investigated for this purpose using 
the DFIHS scheduling algorithm. The MC 68020 with its MC 68881 math 
coprocessor is used as the basis of the next few simulations to determine the 
performance of the kinematics algorithms on the specified architecture. Results show 
speedup of approximately 2 over the uniprocessor solution when the number of 
parallel processors is greater than eight and five for the cases of forward and inverse 
kinematics algorithms respectively. Very large speedups is not achieved in this latter 




A COST EFFICIENT BIT-SERIAL ARCHITECTURE FOR ROBOT 
INVERSE DYNAMICS COMPUTATION
3.1 Introduction
The parallel, pipelined, and recursive nature of the N-E dynamics algorithm 
suggests that it is amenable to a custom bit-serial array architecture. This chapter 
presents such an architecture to compute the desired algorithm within the bit-serial
execution time lower bound of o( c^ + c2kn ), where Cj and c2 are specified
constants, k is the system word length, and n is the number of manipulator links. A 
multi-functional bit-serial cell is developed as a building block for the recursive array 
unit proposed. The cell’s fault-tolerant external asynchronous communication 
protocol and its high-performance design using Zipper CMOS [10] circuit structures is 
of particular interest. This cell may easily be used as a "standard cell" to design the 
two array chips which realize the forward and backward recursions, respectively, of 
the N-E dynamics formulation. In addition to these two chips, a minimum number of 
external FIFO register files are required to implement the proposed system.
3.2 Choice of a Bit-Serial Architecture
The Newton-Euler dynamics formulation is a highly pipelined recursive process. 
A bit-serial architecture is therefore a natural candidate for its efficient computation in 
a closely pipelined cost-efficient manner. The single wire input and output of bit- 
serial devices allows tight pipelining of operational cells, leading to efficient 
qppaippnication within and between chips. This is an important advantage sinpe 
communication issues are a dominant factor in the computation of the inverse 
dynamics problem. In addition, when comparing n-bit word length bit-serial and bit- 
parallel arithmetic operators, we can expect the bit-serial part to contain 1/nth of the 
hardware of the bit-parallel device [8]. Minimum chip count solutions are therefore
feasible in bit-serial systems without the problems of on-chip routing of data lines. In 
fact, the N-E dynamics algorithm may easily be computed in the proposed 
architecture with only two bit-serial chips in addition to a minimum number of 
external FIFO register files. Preliminary studies show the two chips require a 
combined die area of approximately 70mm2 when fabricated in a 1.2|i CMOS 
technology, using the proposed bit-serial cell architecture as the standard building 
Kleek. If bur proposed architecture is implemented using a bit-parallel architecture; a 
total of 32 matrix processors capable of performing matrix-vector multiplications and 
vector cross-products operations would be required. In addition, a more complex 
control network is needed for the bit-parallel case. Furthermore, the parallel wiring 
interconnections between bit-parallel processors suggests that it is a cost-inefficient 
approach.
39
3.3 Time Lower Bound to Compute the Inverse Dynamics Problem on a Bit 
Serial Architecture
In this section we will discuss the limitation of speeding up the inverse dynamics
computational problem when running on a bit-serial system. Before deriving the time
lower bound, the following notation and lemma must be established.
Notation:
(1) E<1> denotes a linear arithmetic expression E of 1 distinct atoms, where an atom 
is a constant or a variable, e.g. E<4>=a+b-c/d.
(2) Tk = Minimum time to evaluate an expression E<1> in a non-recursive bit-serial 
system.
(3) T1 = Shortest time to compute the first iteration of the N-E algorithm in a bit- 
serial system.
(4) = Shortest time to evaluate each additional iteration when computing the 
remaining n-1 iterations of the N-E algorithm in a recursive bit-serial system.
(5) Tn = Minimum time to compute the inverse dynamics problem in a recursive 
bit-serial system.
...... 40 '
Lemma 1: The time lower bound of Tk[E<l>] [8]. The shortest bit-serial
processing time to evaluate an arithmetic expression E<1> is bounded 
below by and equal to o([kj), in other words,
Tk[E<l>] > o( L kj )
Theorem1: The minimum time to compute the inverse dynamics problemofan/i- 
link manipulator in a recursive bit-serial architecture is bounded below 
by o^Lkj + c2tknj ), where k is the system word length, Ci and c2 are 
specified constants.
Proof: Addition, subtraction and multiplication are the primitive arithmetic
operations prevalent in any iteration of the inverse dynamics problem. By 
beginning a subsequent operation before finishing the previous one, a bit- 
serial arithmetic cell requires k clock cycles to evaluate any one of the three 
specified operations. The minimum time to compute the first iteration of the 
N-E algorithm may therefore be expressed as a function containing at least 
3k atoms. Using Lemma 1, we thus have,
T![E<3k>] > 0(|_ 3kJ ). (1)
Let fyk (where ^ is a constant) be an uniform clock period count needed to 
evaluate each additional iteration of the N-E algorithm in a recursive bit- 
serial system. To better conceptualize, bjk clock periods is the constant 
additional time required to compute those operations in the ith iteration 
which are not evaluated in parallel to the (i-l)th computational cycle. 
Therefore, the additional time to calculate the remaining (n-1) iterations 
may be expressed as follows:
Vi-bikCn-l).
Clearly, the minimum time to determine Tn-1 is when bx is one. Thus, the 
lower bound of T^ is defined as,
Tn l[<n-1>] ^ o(Lk(n - 1)J ). (2)
By combining equations (1) and (2), we obtain
Tn[E<n>]2i O(L2k + knJ ).
The inverse dynamics problem may be considered as computing a set of 
arithmetic operations which result in obtaining the joint torques. Each joint 
torque Xj of joint i can be expressed as an arithmetic operation containing at
41
least 3n atoms: n joint positions n joint velocities {qj^, and n joint
accelerations {q^^, implying,
Tn[E<3n>] > o(|_2k + 3knJ ). (3)
Rewriting equation (3) without changing the order of the lower bound to 
evaluate a set of n torques,
Tn['Ci,x2v..,xn]>o( qk + c2knj ).
where cx and c2 are specified constants. Q.E.D.
By establishing a time lower bound in Theorem 1, we can now compare and 
contrast bit-serial computational structures by simply comparing the coefficients cx 
and C2. It should be noted here that such comparisons will only be valid if the 
competing systems use the same word length and manipulator link count.
3.4 Bit-Serial System Architecture
The high-level view of the bit-serial system architecture is illustrated in Fig. 3.1. 
Basically, the core of the system consists of two bit-serial array processors: Processor 
1 computes the forward recursions and Processor 2 evaluates the backward iterations 
of the N-E algorithm.
Operand feeding of the bit-serial arrays is an asynchronous mechanism. Initially, 
the host processor loads the rotation ( -+1R) and inertia (%) matrices, relative position 
(’Pi+i), joint velocity (0) and acceleration (0) vectors, and the scalar link masses (mj) 
into the PISO FIFO (Parallel-In-Serial-Out First-In-First-Out) register blocks in the 
order shown. The asynchronous handshaking protocol used by the bit-serial cells to 
communicate with external devices allows computations to be initiated whenever both 
operands of any bit-serial cell are available. The philosophy employed is similar to 
that of data flow computer architectures. This eliminates the "initial delay time" (time 
required to set up all the operands) which degrades the performance of many custom 
algorithmic processors.
Whenever the host processor loads any operand(s) into the FIFO, it also informs 
the respective bit-serial cell whose operands are currently the load has been 
performed. The notified cell subsequendy provides the clock pulses to its operand 
FIFOs to load its input data synchronously in a bit-serial fashion. Pulses delivered by 




















Fig. 3.1: Bit-Serial System Architecture
possibility of misalignment of multi-operand data bits.
Array processor 1 computes the angular velocity (‘coj) and acceleration (‘coj), and 
the linear acceleration ('v;) vectors which are fed back multiplexed with their 
initialization vectors as inputs to cells earlier in the processor’s pipeline. Outputs of 
the processor are total external forces (*Fj) and moments (‘Nj) exerted at the center of 
mass of link i. These computed results are then loaded into array processor 2 through 
the §tS6 LIFO (Serial-In-Serial-Out Last-In-First-Out) register files. A LIPt) 
structure is used since the total forces and moments computed in processor l’s last 
iteration are needed as operands of processor 2’s first iteration.
Processor 2 initiates execution at the end of processor l’s nth iteration, where n 
is the number of manipulator links. Operands are loaded into this processor in a 
manner identical to Processor 1. Intermediate results of computations in the processor 
include forces (Tj) and moments exerted on link i by link (i-1) with respect to 
the base frame. These results are subsequently fed back multiplexed into the array 
with their initialization vectors. A complete set of joint torques and forces (Xj) are 
obtained at the end of the nth backward iteration and loaded into an SIPO FIFO 
(Serial-In-Parallel-Out First-In-First-Out) register file as system outputs.
43
3.5 Bit-Serial Array Processor Organization and Performance
At this stage in the paper, we shall assume that an individual bit-serial cell is 
capable of computing the 3-D vector and matrix arithmetic operations depicted in 
Appendix B. Details on how the cell implements these functions is discussed in 
Section 3.7.
Fig. 3.2 illustrates the parallel and pipelined organization of bit-serial cells in 
Processor 1 to evaluate the forward iterations of the N-E algorithm. The initials 
within each node denote the type of operation that the particular cell is required to 
perform. Also, the number next to each cell serves as a means of individual cell 
identification. Interconnections between cells represent "single-wire" bit-serial data 
paths, the cells are organized in eight stages of a pipeline with some intermediate 
results being fed back as inputs to cells earlier in the pipeline in a recursive fashion.
Table 3.1 summarizes the computation times of the various matrix and vector 
arithmetic functions which may be evaluated by any individual bit-serial cell. The 
operational delay data from this table is used to construct the pipeline schedule of 
tasks to compute the first three iterations of the N-E dynamics algorithm in Processor 







VC-VECTCR CROSS PRODUCT 
SV-SCALAR-VECTOR PRODUCT








VC-VECTOR CROSS PRODUCT 
IP-INNER PRODUCT
Fig. 3.3: Pipelined Organization of Bit-Serial Cells in Processor 2
46









(In# of Clock . 
Cycles)
Operation Delay Using 
16-Bit Wordlength and 
Clock Frequency of 20-MHz
Vector-Add 1 3 Adders k 300 ns
Scalar-Vector





















. 1.6 ms I
. |
47
the system word length. Also, the numbers in the iteration columns indicate which 
bit-serial cell of Fig. 3.2 is executing a particular operation during the specified time 
interval. Notice that bit-serial cells begin execution of tasks in successive iterations 
as soon as they complete their designated operations in the previous iteration, 
provided process precedence relations still hold true. Thus, the parallelism between 
successive iterations allows high cell utilization. This higher performance operational 
featurb iS Supported in a fault-tolerant manner by the asynchronous hahiishaking 
protocol (described in Section 3.7) used by the bit-serial cells for external 
communication. A total of 1 lk clock periods are needed to compute the first forward 
iteration. The massive amount of parallelism between tasks of successive iterations 
allow each additional iteration to be processed in only 4k clock cycles.
Similarly, Fig. 3.3 illustrates the organization of bit-serial cells in array processor 
2 to compute the backward iterations. Accordingly, Table 3.3 depicts its pipeline 
schedule to process three backward iterations. The first backward iteration requires 8k 
clock periods, whereas each additional iteration needs 2k clock cycles for completion.
The total time to compute the inverse dynamics problem on the proposed 
architecture may, therefore, be expressed as [19k + 6kn] clock periods, where k is the 
system word length and n is the number of manipulator links. Thus, for comparison 
purposes, the coefficients C! and C2 (defined earlier in Section 3.4) are 19 and 6, 
respectively, to compute the inverse dynamics problem in our particular recursive bit- 
serial system architecture. The total time to compute the N-E dynamics algorithm is, 
therefore, 44|is when a 16-bit system word length, 20MHz operating frequency, and a 
6-link manipulator is used.
3.6 Multi-Functional Bit-Serial Leaf Cell Architecture and Operation
In this section, we shall describe the architecture and operation of a multi­
functional bit-serial cell capable of performing the various 3-D matrix/vector 
arithmetic operations depicted in Appendix B. A block diagram of this cell is shown 
in Fig. 3.4. The 3-bit mode register determines the type of vector operation that the 
particular cell is required to perform. This register is set by direct hardware control. 
A PLA-based control unit is used for the control of the external asynchronous 
communication mechanism and internal operation mode. The counter provides load 
pulses to operand cells to obtain the cell’s input data in a synchronous fashion. Shift 
registers, "previous" and "new", store up to two intermediate results temporarily when 
the cell to which these results are to be sent is not ready to receive new operands. The 
three voters determine bit-serial data routing paths to support alternate arithmetic




Iteration- #1 Iteration #2 Iteration #3
k ■ .
1 2 3 4 !
2k 2 4 5 I
3k 5 6 ! oi . - ■
i 'i • *
4k 7 3 9 10 ! 1 2
5k 7 3 9 10 5 2
6k 11 12 13 14 15 16 3 5 2
7k 11 12 13 14 15 16 8 7 3 9 1
8k 17 18 19 7 8 9 10 . i
■
I 5
'9k ! 20 1 10 11 12 13 i 5
10k ! - 21 . 4 11 12 13 14 15 16 ■ \i 3 7 8 9
Ilk • i 22 j 4 14 15 .16 6 7 3 9
12k ; i | 17 18 19 10 11 12 13
13k | . i 20 10 11 12 13
14k I i 21 4 14 15 16
15k | 1l 22 1 4 14 15 16 .
16k ! 17 18 19
17k 1 | 20
18k i i 21
19k 22





Iteration =£1 Iteration Iteration =£3
k 1 2 3
2k 12 6 3
3k 4 5 6 2 3
4k 7 8 12 6
5k 7 . 1 5 6 2
6k 9 4 8 1 2 6
I ' 7k ' 10 7 13 6
3k . 10 9 4 . 8





operating modes. The function of the voters is illustrated in Fig. 3.7(a)-(d). It should 
be noted here that a matrix-vector multiplication task (not shown in Fig. 3.7) is 
realized by simply combining three inner product cells. Individual cell utilization and 
operation delay within the cells to evaluate the matrix/vector operations is 
summarized in Table 3.1.
Fig, 3.5 shows the control and data line interconnections between neighboring 
bit-serial cells. Asynchronous handshake control signals, RTRIN (Ready-To- 
Receive-IN), RTROUT (Ready-To-Receive-OUT), RTS IN (Ready-To-Send-IN), 
RTSOUT (Ready-To-Send-OUT), LOADCLKIN (Load-Clock-IN) and 
LOADCLKOUT (Load-Clock-OUT) use an inter-cell communication protocol to 
achieve higher performance and fault-tolerance in the system. The flow chart 
illustrated in Fig. 3.6 provides details of this asynchronous communication protocol.
When a cell is ready to receive new operands, it indicates this by asserting the 
RTROUT signal, and if the preceding cells are ready to send the operands, they 
activate the RTSIN lines. In response, the current cell generates LOADCLKOUT 
pulses to load its operands in a bit-serial fashion. This particular feature allows 
synchronized loading of multi-operand data bits. Near the end of an operation, the 
succeeding cell may have the RTRIN signal asserted, which indicates it is ready to 
receive its operands. In that case, the current cell activates its RTSOUT line, thus 
indicating that it is ready to send the operands. Subsequently, the next cell provides 
the LOADCLKIN pulses to load its input data bit-serially. However, if the 
succeeding cell has the RTRIN signal negated at the end of any operation, the result 
will be loaded into temporary "previous/new" shift registers. If both the "previous" 
and "new" shift registers are full at any point in time, the given cell is not allowed to 
load any further operands for execution until one of these registers transfers its 
contents into the succeeding cell.
3.7 Implementation of Bit-Serial Cell In CMOS Technology
Our design strategy utilizes Zipper CMOS [28] circuit structures. It is basically 
an improved version of Domino [20] and NORA [14] dynamic CMOS circuits, which 
are characteristically immune to the problems of instability and charge-sharing 
prevalent ip the latter two techniques. Also, area utilization is considerably better and 
Zipper CMOS circuits can operate at least two times faster than static CMOS circuits. 
In addition, its high density allows it to be clocked at very high frequencies. It is thus 









^ O - QDC (/) 0HR HR





































BIT SERIAL BIT-SERIAL 
MULTIPLIERS ADDERS/
SUBTRACTORS















Fig. 3.5: Interconnections Between Neighbouring Bit-Serial Cells
53




(c); Inner Product AGO
Y







(d); Vector Cross Multiply
Fig. 3.7 (Continued)
56
The basic structure of a Zipper CMOS circuit uses a single Zipper driver circuit 
which generates four strobe signals to drive subsequent N-P CMOS logic blocks. 
During precharge, the output of every N block is high and that of every P block is low. 
This ensures that the transistors of each dynamic block are off. Dining evaluation, the 
output of each N stage can undergo only one transition, from high to low, and the 
output of each P stage can undergo only one transition, from low to high. Signals 
therefore propagate down each stage of the circuit in a "staggered" fashion.
In the Zipper driver circuit of Fig. 3.8, strobe signals ST and ST simply act as a 
two phase clock to drive the logic stages. Signals ST1 and ST1 which feed the N 
block’s precharge pullup and P block’s precharge pulldown, respectively, produce a 
residual voltage which keep the precharge transistors slightly on to overcome leakage 
currents. It is therefore clear that this driver plays a major role in insuring stability 
and minimizing charge-sharing effects. It should be noted that only one driver circuit 
is needed for each bit-serial multiplier and adder/subtractor unit used in the cell.
Fig. 3.9 illustrates our implementation of the bit-serial full-adder/subtractor using 
Zipper CMOS circuits and dynamic latches. It is designed as a reduced majority 
function of its inputs. Simulation results have shown the propagation delay of this 
module to be 37ns. The bit serial adder is also the most significant circuit in the bit- 
serial multiplier. This suggests that an individual cell may operate at frequencies of 
up to 25MHz.
The design of a single module of Lyon’s 2’s complement bit-serial pipelined 
multiplier in Zipper CMOS is shown in Fig. 3.10. The module is implemented using a 
Zipper CMOS full-adder, pass-transistors, and dynamic registers. For a k-bit system 
word length, a bit-serial multiplier consists of k such modules, with serial 
interconnections of the data words (Aj and Bj), the partial product sum (PPSi), and the 
control signal Rj. This latter control signal is used to truncate the low order bit of the 
partial product sum and to clear the carry flip-flop after each multiplication. The last 
module of such a multiplier employs a full subtractor instead of a full-adder, as is 
required by the two’s complement algorithm. The bit-serial multiplier takes k clock 
periods to form a k-bit product of two k-bit operands, by beginning a subsequent 
operation before finishing the previous one.
3.8 Conclusions
We have described a cost-efficient parallel and pipelined bit-serial system for the 
inverse dynamics computational problem to achieve the bit-serial execution time
n-TYPE MOSFET













Fig. 3.10: Zipper CMOS Bit-Serial Multiplier Module
60
lower bound of o(|^c1k + C2knj ). A high performance multi-functional bit-serial cell
architecture is described as a building block for the array configurations of the system. 
Zipper CMOS circuit design strategy is proposed for the cell’s implementation to 
minimize propagation delays and maximize operating frequencies. For the case of a 
16-t)it system word length and 20MHz operating frequency, the total time to compute 
the inverse dynamics problem of a 6-link manipulator on the proposed bit-serial 
architecture is only 44|is.
61
CHAPTER 4
SYSTOLIC ARCHITECTURE FOR ROBOT INVERSE DYNAMICS
COMPUTATION
4.1 Introduction
A novel systolic architecture to compute the inverse dynamics computational 
problem of an n-link manipulator within the systolic execution time lower bound of 
o| dihj, where dj is a specified constant, is proposed. A modified form of the systolic
design methodology of Moldovan and Fortes [40] is used as the primary design tool 
for the architecture. Modifications were required to provide allowances for unequal 
systolic execution times of processing cells, and to eliminate broadcasting of any 
input or generated variable within the systolic array. The inverse dynamics problem is 
decomposed into a set of directional and linear recurrence tasks amenable for direct 
mapping onto a fixed systolic system. This task set is composed of seven "basic" 
types of tasks. Thus, a basic set of systolic processors onto which these basic types of 
tasks may be mapped is discussed. Next, the specified systolic processor set is used as 
building blocks for realization of the overall systolic system. Due to alternate paths 
and operational delays of different systolic processors, operands may appear at muti- 
input modules at unequal arrival times, causing a longer pipeline time. Delay buffers 
must therefore be inserted to balance the pipeline. The optimal buffer assignment 
problem is reduced to an integer linear optimization problem. The resulting systolic 
architecture is thus realized in a maximally pipelined manner with a minimum number 
of delay buffers inserted along various computational paths.
4.2 Systolic Array Design Methodology
Various systolic array design procedures [22,23,31,40,48,59] are described in current 
literature. The systolic design algorithm of Moldovan and Fortes [40] is found to be 
the efficient since it has been proven that their design methodology produces solutions 
which have performance features close to those of dataflow machines. A modified
form of their systolic array design methodology will thus be used as the primary 
design tool for realization of systolic processors to compute the inverse dynamics 
problem. As mentioned earlier, some modifications to their methodology are essential 
to develop our particular systolic system for allowance of different computational 
delays of alternate systolic processors, and to eliminate broadcasting of any input or 
generated variable. These ideas will become more apparent by the end of section 4.6. 
The modified design methodology is presented below.
Procedure: SADLRE (Systolic Array Design of Linear Recurrence Equations)
Input: Algorithm A over an algebraic structure S as a 5 tuple A = (Jn,C,D,X,Y) 
where,
J" is a finite index set of A, .Tel11;
C is a set of computations indexed by J";
D is a set of data dependencies;
X is a set of input variables;
Y is a set of generated variables.
Output: New dependence matrix A of the developed algorithm and set of tables
T = {GVT, UVT, IT}, where,
GVT is a Generated Variable Table which specifies the time and cell of 
execution of yi€ Y at index point j;
UVT is a Used Variable Table that specifies the times and cell location 
from which yj€ Y are available for use at index point j;
IT is an Input Table that specifies the times and cell locations from 
which xjeX are available for use at index point j.
STEP 1: Determine a set of generated variables {yBi} with linear indexing matrices 
IBi that are susceptible for broadcasting. Variable yBi is broadcasted if,
rank(IBi) <n
where n is the dimensionality of the algorithm index set J11, and ieln.
STEP 2: When {yBi} has more than one variable, determine a set of indices {jBi}€jn 
corresponding to linearly dependent columns in Im- ■■■'
STEP 3: Select a transformation II that will provide a valid linear time scheduler, 
fldi, dj€ D. Heuristically, II may be found by following these steps:
62
63
(3.1) When {yBi} has more than one variable, then nBidBi must be of the 
form,
ndBi = Ci+ X Jicji
ji£jBi
and the following conditions must be satisfied,
Q- 2 tji * Sji) dBJ + £ tji * SjJ dBJ (4.1)
Cji > tji * SJ where j; e jBi (4.2)
where, dBie D provides data dependence vectors of variables yBi, Q 
are the constant parts of nBidBj, Cj; are the coefficients of variables 
jiejBi in nBidBi, and t^ is the maximum time needed for the variable 
yBi to travel a single unit distance in the direction of ji. This time 
variable may easily be deduced from the original recurrence 
equation. It is equal to the interprocessor communication time, if the 
Output y| in the original recurrence equation is constant for any jj. 
On the other hand, t^ is the sum of the interprocessor communication 
time and cell computational delay in the ease of a varying output yj 
when index jj is varied. Sjj represents a row of the transformation 
matrix S (see STEP 5) corresponding to the transformation of the 
jBith index in dBi; dBie dj is the constant part of dependence vector 
dBi; d^e dj is the coefficient of the jBith index in dBi.
(3.2) Time schedulers of input variables, xje X, must satisfy the condition,
l^xi^xi ^ tc
where IIxien transform columns dxieD that provide data
dependence vectors for input variables xi? and tc is the interprocessor 
communication time.
(3.3) Time schedulers of generated variables, y^eY, must meet the 
condition,
Hyjdyi > (tf +1<^)
where 1^611 transform columns dyieD that provide data
dependence vectors for variables y-v and tf is the computational delay 
within a cell.
64
(3.4) The final II must be chosen so as to minimize the algorithm 
execution time,
maxll(j1 - j2) + tf 
minlldi
(4.3)
for any jl5 j2e J” and d,e D.
STEP 4: Select the columns of the matrix of interconnection primitives, P, as
Pi = di\dBi
where P}€ P, and d[, dBie D; D corresponds to the last one and two rows of 
dependence matrix D for one- and two-dimensional systolic arrays 
respectively; dj corresponds to columns of D; dBi corresponds to columns of 
D that provide n-1 elements of dependence vectors for variable yBr
STEP 5: Select a second transformation S of size (n-l)xn, where n is the dimension 
of the algorithm index set J”, such that:
(5.1) When {yBi} has more than one variable, elements of S must satisfy 
the condition,
SjBidBi = Ql + )BiQ2
where C^eZn and Cj2eIn, Q2*0 and SjBi is a row of S which 
determines the spatial interconnect of variable yBi in the direction of
jfii*
(5.2) Diophantine equation SD = PK may be solved for S, where matrix K 
which indicates the utilization of primitive interconnections in 
matrix P must satisfy the conditions of Kjj > 0, and < Ildj.
j
(5.3) Matrix transformation T =
n
S is non-singular.
For the chosen transformation, T, the partitioning hyperplanes nPi 









maxIIpKO1 - j2) + 1
mK
for any j , j e Jn, and mK is the width of the band. 
STEP 6: Determine new dependence matrix A = TD.
STEP 7: Determine GVT (Generated Variable Table). Significant columns of this 
table specify the initiation time (tj), completion time (tp) and location of the 
cell (/) where index point j is processed.
ti = nj
tE = ti + tf
T=s]
where % in the computational delay of a single cell.
STEP 8: Determine UVT (Used Variable Table). Significant columns of this table 
specify the times {ty} and location of cells {/y} from which the set of 
generated variables y(j - dy) are available for use at index point j.
ty=no-dy) + tf 
ry=sa-dy)
where dy is the dependence vector corresponding to generated variable y.
STEP 9: Determine IT (Input Table). Significant columns of this table include the 
times {tx} and location of cells {/x} from which input variables x(j) are 
available for use at index point j.
tx = n(f-dx)
K = Sj-d,)
where dx is the dependence vector corresponding to input variable x, and tc
66 ■
is, as defined above, the interprocessor communication time.
STEP 10: Use the new dependence matrix A along with the set of tables 
T = {GVT, UVT, IT) to map the algorithm onto a fixed systolic 
architecture. The amount of time delay for buffers inserted along the paths 
of generated and input variables within an individual cell for 
synchronization purpose may be determined by,
lDx = ndx - tc 
lDy = ndy - (tf + tc)
where, tDx and t^ are the time intervals required for delay buffers inserted 
along the paths of an input (x) and generated variable (y) respectively. 
dx, dy, tc and tf are as defined before.
STEP 11: Partition the developed algorithm onto a set of r bands (r may be
determined by equation (4.4). The mapping of the indices to processors 
of the partitioned algorithm is as follows: each index point j is 
processed within band Bj in the processor whose ith coordinate is,
jPi = nPij moduif
where uij is the width of the ith coordinate inside the band. It should be 
noted here that this step usually results in a more cost-efficient systolic 
architecture for algorithms with a large number of recursive iterations.
The modifications to the design methodology of [40] are included in design 
steps: 1, 2, 3.1-3.3, 4, 5.1, and 7-10. Step 1 determines whether any generated 
variable in the given linear recurrence problem may be broadcasted, which is 
undesired in any systolic architecture. Step 2 simply selects these variables which are 
amenable to broadcasting so that their respective dependence vectors may be used to 
construct special conditions on the selection of transformation II in step 3.1. As 
described in step 3.1, IIdBi must be of the form depicted to ensure variable(s) yBi is 
sequentially propagated through the systolic array instead of being viable to 
broadcasting. Special conditions are established on the choice of Q and Cj; so as to 
set lower bounds on propagation delays of jBi in the specified direction of data flow. 
The conditions on the linear time scheduler in steps 3.2 and 3.3 ensure II is selected 
appropriately when the communication time, tc, and computation time, tf, are 
different. Step 4 provides a systematic method for the selection of the matrix of 
interconnection primitives, P, to eliminate any type of enumeration for this choice, as
delineated in [40].
Conditions set on the selection of II alone in step 3.1 is insufficient for the 
elimination of broadcasting since the space scheduler S must provide the 
interconnections support for such data communications specified by II. Thus, step 5.1 
discusses die conditions on the choice of S to avoid broadcasts/ Finally^ steps 7-10 
incorporate desired modifications to [40] for appropriate determination of GVT, UVT, 
and IT to the given linear recurrence problem when tc arid tf are different.
4.3 Time Lower Bouiid to Compute The Inverse Dynamics Problem on a 
Systolic Array
The limitatiori on speedirig up the inverse dynahfics computational problem 
while running on a systolic array will now be discussed. Before deriving the time 
lower bound, the following notation must be established.
Notation:
(1) E</ > denotes a linear arithmetic expression of / distinct atoms, where an atom 
is a constant or variable, e.g. E<4> = a + b - c/d.
(2) Ts= Minimum time to compute the inverse dynamics problem of an n-link 
manipulator on a fixed systolic array.
Theorem 4.1: The minimum time to compute the inverse dynamics
problem of an n-link manipulator on a fixed systolic 
architecture is bounded below by o^djrij, where dj is a
specified constant.
Proof: Using equation (4.3) in procedure SADLRE, the systolic processing











max n^n-1) + £ max IT^max aj - min aj) + tf 
_ i=2________ , . ■ . -
min Ildj
where n is the number of manipulator joints, max [a2,.. .,am] ^ 
min [a2,.. .^j are upper and lower bounds of the task’s index set
_ _ m
{j2,.. respectively. Variables max ITj (max % — min a*), tf,
i=2
and minlldi in the above equation are constants independent of n. 
In the case of IIj = f(jc) where is the first term in the vector 
transformation and jG is the outermost index of the iterative
algorithm, maxll^n-l) will be o(n2). Thus, for minimal t, 
IIj * f(j0). Under this condition for n to achieve the lower bound for 





where b1? b2, 1>3, Cj, and c2 are specified constants. Thus, the lower 
bound to evaluate Ts for an arithmetic expression E<n> may be 
defined as,
Ts[E<n>] > o(c2 T nl)
The inverse dynamics problem may be considered as computing a 
set of arithmetic operations which result in obtaining the joint 
torques. Each joint torque Xj of joint i can be expressed as an 
arithmetic expression containing at least 3n atoms: n joint positions 
{qj}-^, n joint velocities {qj-^i and n joint accelerations {qj}£i, 
implying,
Ts[E<3n>] > 0(30! N) (4.5)
Rewriting equation (4.5) without changing the order of the lower 
bound to evaluate a set of n torques,
69
'■ ■Ts[T1,...,Tj£'o(dirnl)
where dx is a specified constant. Q.E.D.
By establishing a lower bound in Theorem 4.1, alternate systolic architectures for the 
computation of the inverse dynamics problem may be compared and contrasted by 
simply comparing coefficient d2 of the competing systolic structures.
4.4 Reformulation of Newton Euler Equations of Motion for Direct Mapping 
onto a Fixed Systolic Architecture
In this section we shall reconstruct the Newton-Euler dynamics algorithm in 
Appendix A in a form amenable for direct mapping onto a fixed systolic array. This 
reformulation is achieved by decomposing the algorithm into a set of unidirectional 
and linear recurrence computational tasks. The algorithm is as follows:
Algorithm NEWTON_EULER
Forward Iterations:
for h = 0 to (n-1) do 
begin h
for i = 0 to 2 do 
begin i
Forj=0to2do 
For k = 0 to 1 do 
Start Task 1 
begin j
co[h,i,j] = co[h,i,j-l] + R[i j](o>[h-U2] + 0z[h,i]) 
end j
Finish Task 1 
Start Task 2 
if i = 2 then
xi[h,i] = co[h,2]*0 z[h,0] - O)[h,0]*9 z[h,2] 
else




x2[h,i] = x1[h,i] + 0z[h,i]
Finish Task 3 
Start Task 4 
begin j
o>[h,i,j] = co[h,i j-1] + R[i,j](co[h-l,i,2] + x2[h,i]> 
end j
Finish Task 4 
Start Task 5 
if i = 2 then
x3[h,i] = <o[h,2]*P[h,0] - o>[h,0]*P[h,2] 
else
x3[h,i] = 0)[h,i]*P[h,i+l] - G)[h,i+l]*P[h,i]
Finish Task 5 
Start Task 6 
begin k
x4[h,i,k=0] = co[h,i] 
if i = 2 then
x4[h,ijk] =x4[h,2>k-l]*P[h,0,k] -X4[h,0Jc-l]*P[h,2Jc] 
else
x4[h,ijc] = x4[h,i,k-1 ]*P[h,i+1 Jc] - x4[h,i+l,k-l]*P[h,iJc] 
endk
Finish Task 6 
Start Task 7
x5[h,i] = x3[h,i] + x4[h,i]
Finish Task 7 
Start Task 8 
begin j
v[h,i,j] = v[h,i,j—1] + R[i,j]*v[h—l,i,2] + x5[h,i] 
endj
Finish Task 8 
Start Task 9 
if i = 2 then
x6[h,i] = o)[h,2]*PG[h,0] - a)[h,0]*Pc[h,2]
■ else
x6[h,i] = 0)[h,i]*Pc[h,i+l] - G)[h,i+l]*Pc[h,i]
Finish Task 9
71
Start Task 10 
begin k
x7[h,i,k=0] = C0[h,i] 
if i = 2 then
x7[h,ijc] = x7[h,2Jc-l]*Pc[h,0,k] - x7[h,0,k-l]*Pc[h,2,kj| 
else
x7[h,ijc] =x7[h,i,k-l]*Pc[h,i+l Jc] -x7[h,i+l,k-l]*PG[h,i,k] 
endk
Finish Task 10 
Start Task 11
x8[h,i] = x6[h,i] + x7[h,i]
Finish Task 11 
Start Task 12
vc[h,i] = v[h,i] + x8[h,i]
Finish Task 12 
Start Task 13
F[h,i] = m[h]*vc[h,i]
Finish Task 13 
Start Task 14 
begin j
x9[h,ij] = x9[h,i,j-l] + I[i,j]*co[h,i] 
end j
Finish Task 14 
Start Task 15 
if i = 2 then
x10[h,i] = co[h)2]*x9[h,0] - co[h,0]*x9[h,2] 
else
x10[h,i] = co[h,i]*x9[h,i+l] - 0)[h,i+l]*x9[h,i]
Finish Task 15 
Start Task 16 
begin j
xuthdj] = xn[h,i,j-l] + I[i j]*co[h,i] 
endj
Finish Task 16 
Start Task 17






for h = n to 1 do 
begin h
fori = 0to2do 
begin i
forj — 0 to 2 do 
Start Task 18 
begin j
f[h,i j] = f[h,i,j-l] + R[i,j]*f[h+l,i] + F[h,i] 
end j
Finish Task 18 
Start Task 19
';^i2[h4]=P[h,i] + PcM 
Finish Task 19 
Start Task 20 
if i = 2 then
x13[h,i] = x12[h,2]*F[h,0] - x12[h,0]*F[h,2] 
else
x13[h,i] = x12[h,i]*F[h,i+l] - x12[h,i+l]*F[h,i] 
Finish Task 20 
Start Task 21
X14[h,i] = N[h,i] + xi3[h,i]
Finish Task 21 
Start Task 22 
if i = 2 then
x15[h,i] = P[h,2]*f[h+1,0] - P[h,0]*f[h+1,2] 
else
x15[h,i] = P[h,i]*f[h+l,i+l] - P[h,i+l]*f[h+l,i] 
Finish Task 22 
Start Task 23
n[h,i,j] = n[h,i,j-l] + R[i,j](n[h+l,i,2] + x15[h,i]) + x14[h,i] 
end j
73
Finish Task 23 
Start Task 24
T[h,i] = t[h,i-l] + Rz[h,i] * n[h,i] 
Finish Task 24 
end i 
end h
4.5 Systematic Design of a Basic Set of Systolic Processors for the Inverse 
Dynamics Computational Problem
From the algorithm of Section 4.4, it is seen that a “basic” set of types of tasks 
exist from which the complete systolic architecture for the computation of the inverse 
dynamics problem may be derived. Thus, by designing a set of systolic processors 
onto which this “basic” set of task types may be mapped, we can arrive at the 
complete systolic architecture for the computation of the specified problem by simply 
using these processors as building blocks for the realization of the overall systolic 
system. The basic set of tasks may be partitioned into a set of unidirectional and 
linear recurrence tasks as follows: ^
Unidirectional tasks:
Type 1: Vector-Add (VA) 
fori = 0 to 2 do 
y[i] = x[i] + z [i] 
endi
Type 2: Scalar-Vector Product (SVP) 
for i = 0 to 2 do 
y[i]=K*x[i] 
endi
Type 3: Vector Cross-Product (VCP) 
for i = 0 to 2 do 
if i = 2 then
y[i] = x[2] * z[0J - x[0] * z[2]
: else




Type 4: Inner Product (IP) 
for i = 0 to 2 do
y[i] = y[i-l] + x[i] * z[i] 
endi
•Type 5l Matrix-Vector Product 
for i = 0 to 2 do 
for j = 0 to 2 do
yUj] = y[ij-1] + A[i,j] * x[i] 
endj 
endi
Type 6: Recursive Matrix-Vector Product (RMVP)
Type 6a:
for h = 0 to n-1 do 
for i = 0 to 2 do 
forj =0 to 2 do





Replace line (1) in Type 6a by,
y[h,ij] =y[h,ij-l] + A[i,j] * y[h-l,i,2] + x[h,i]
Type 6c:
Replace line (1) in Type 6a by,
y[h,ij] = y[h,i j-1] + A[ij] * <y[h-l,U] + x[h,i])-i-z[hdl 
Type 7: Recursive Vector Cross-Product (RVCP) 
for i = 0 to n
forj =0 to 2 do 
if j = 2 then
y[h,i] = y[i—1,2] * x[i,0] - y[i—1,0] * x[i,2]
', else ■




Now that the basic types of tasks prevalent in the inverse dynamics 
computational problem have been defined, we can develop systolic architectures onto 
which these processes may be mapped. The modified systolic mapping procedure Of 
Section 4.2 will be used to develop the systolic architectures for linear recurrence 
tasks 4-7.
For the purpose of generality while deriving systolic architectures using 
procedure SADLRE, we shall assume that the interprocessor communication time, tc, 
is of unit time latency and the operational delay of primitive tasks such as addition, 
subtraction and multiplication is also an unit of time.
Before deriving the specified systolic processor architectures, two terms which 
will be used hereafter must be defined. The first one is, Systolic Operational Delay, 
which is defined as the latency required to compute the first element of a vector 
operation at the end of the first iteration in the given linear recurrence problem on the 
specified systolic array. Next, the Systolic Execution Time is defined as the latency 
to evaluate all elements of the vector operation at the end of n iterations in the linear 
recurrence problem on the specified systolic array.
4.5.1 Systolic Processors for Unidirectional Tasks
Task types 1 and 2 are represented by primitive adder and multipler cells (Figs. 
4.1(a), (b)). Paired vector operands enter these cells in synchronous fashion. A 
systolic vector cross-product module (type 3 task) is illustrated in Fig. 4.1(c). Unit 
delays are inserted along some input variable paths for operation synchronization. 
The operation delay of this particular module is 3 time units.
4.5.2 Systolic Architecture for a Type 4 Task
The Algol-like program for an inner product operation is,
for i = 0 to 2 do
j=i
x[i j] = x[i,j—1] 
z[i,j] = z[i,j-l] 
y[i] = y[i-i] + x[i,j] * z[ij] 
endi
The data dependency matrix for the above algorithm is D =
0 0 1 
1 1 0 , where the





(a): Type 1 Processor
X,i|-----
(y->y|i|
(b): Type 2 Processor




Fig. 4.1: Systolic Processors for Unidirectional Tasks
77
By following procedure SADLRE, fl is found to be [3 1], and S to be
1 0 
0 1 •
The new transformation matrix, A, is thus
11 3 
0 0 1 
1 1 0
Tables 4.1-4.3 show the GYT,
UVT, and IT for this type of task. The systolic architecture and its partitioned 
structure may be derived from these tables to be as shown in Tig. 4.2(a) and (b) 
respectively. Also, the architecture of an individual processing element of Fig. 4.2(a) 
is illustrated in Fig. 4.2(c). Information regarding scheduling of operands and
generation of intermediate variables in Fig. 4.2(a) can be obtained from task type 4’s 
tables {T}. The operational delay for the Type 4 task is eight time units.
4.5.3 Systolic Architecture for a Type 5 Task
The Algol-like program of a matrix-vector product (Task 4) operation will now 
be presented in a slightly modified form,
fori = 0 to 2 do 
for j = 0 to 2 do
x[i]=x[i-l] (1)
A[i,j] = A[i,j-1] (2)
y[ij] * y[ij—1] + A[ij] * x[i] 
endj 
endi
The purpose of statements (I) and (2) is to ensure the inputs are propagated 
sequentially along the ith and jth directions, respectively, of the systolic array. These 
statements, therefore, eliminate any possibility of broadcasting input variables. The 
data dependency matrix, D, of iteration (i,j) for the above algorithm may easily be 
derived as,
10 0 
D= 0 1 1
where columns 1, 2 and 3 represent dependence vectors for variables x[i-l], A[i,j-1] 
and y[i,j—1] respectively.
Linear indexing matrices, ^Bi’ for this problem were found to have the same rank 
as the dimensionality of the algorithm index set. Thus, it may be assumed that 
broadcasting of any input or generated variable will not occur. We can, therefore, 
jump to step 3.2 in procedure SADLRE.









6 0 2 0 m
1 3 . 5 1 7(1)
2 6 8 2 *2)









0 7(-l) -1 -1
1 7(0) 0 2
2 ■ ■ y(i) 1 . 5 .
Table 4.3: IT of a Type 4 Task
ITERATION USING INPUTS FROM CELL AT
. 0)______ x(i), z(i) (i.j ~ TIMES
0 x(0), z(0) 0,-1 -1
X x(l), z(l) 1,-1 2
2 x(2), z(2) 2,-1 5
Systolic Architecture
(b): Partitioned Systolic Architecture
(c): An Individual Processing Element
Fig. 4.2: Systolic Architecture for a Type 4 Task
80
From the above Algol-like program specification, it is easily derived that the 
computational delay of an individual systolic cell will be two time units since two 
consecutive primitive operations are required to evaluate y[i,j]. The linear time 
scheduler, FID, must, therefore satisfy the conditions of nd1, nd2 ^ 1 and IId3>3, 
where the subscripts of dependence vector d indicate column indexing in matrix D. 
The transformation vector, II, which meets the above specifications and minimizes the 
algorithm execution time, t, in step 3.4, is found to be [1 3].
Using step 4, we may obtain a two-dimensional systolic array architecture by
selecting the matrix of interconnection primitives, P, as
1 0 
0 1 Further, the second
transformation S is chosen to allow the diophantine equation SD = PK to be solvable 
for S, and minimize the number of bands, r. The derived S-transformation and 
interconnection utilization matrix, K, are as follows:
1 o 100
s = 0 1 for K = 0 1 1
We thus arrive at the new dependence matrix, A,
’l 3 1 0 0
A = TD = 1 0 o 1 1
.0 1.
’l 3 3
= 1 0 0 •
0 1 1
The first row of the transformed matrix, A, indicates the number of time units allowed 
for the respective variable to travel from the processor where it is generated to the 
processor where it is used. The next two rows specify the direction of data flow for 
the respective variable. It is therefore easily Verified that input x[i] needs one time 
unit to propagate one space in the ith direction, input A[i,j] and generated variable 
y[i,j-l] requires three time units to travel one space in the jth direction.
Tables 4.4 through 4.6 delineate the GVT, UVT and IT, respectively, of the 
matrix-vector product operation. They are derived using Steps 7 through 9 of 
procedure SADLRE. The systolic array and its partitioned form inferred from these 
tables are illustrated in Fig. 4.3(a) and (b) respectively. The architecture for an 
individual cell of the systolic architecture using step 10 is shown in Fig. 4.3(c). The 
operational delay of the systolic module is 8 time units to perform the designated











0,0 0 2 (0,0) 7(0,0)
0,1 3 5 (0,1) 7(0,1)
0,2 6 ' 8 (0,2) 7(0,2)
1,0 1 1 (1,0) 7(1,0)
1,1 4 8 (1,1) 7(1,1)
1,2 7 9 (1,2) 7(1,2)
2,0 2 4 (2,0) 7(2,0)
2,1 5 7 (2,1) 7(2,1)
2,2 8 10 (2,2) y(2>2)









0,0 {0.-1) (o,-i) -3
0,1 (0,0) (0,0) 0
0,2 (0.1) (0,1) 3
1,0 (1,-1) (1,-1) -2
1,1 (1,0) (1,0) 1
1,2 (1,1) (1,1) 4
2,0 (2.-1) (2,-1) -1
2,1 (2,0) (2,0) 2
2,2 (2,1) (2,1) 5









0,0 x[0], A[0,0] .(-1,0), (0,-1) -1,-3
0,1 x[0], A[0,1] (-1,1), (0,0) 2,0
0,2 x[0j, A[0,2] (-1,2), (0,1) 5,3
1,0 x[l],A[l,0] (0,0), (1,-1) 0,-2
1,1 x[l), A[l,l] (0,1), (1,0) 3,1
1,2 x(l},A[l,2] (0,2), (1,1) 6,4
2,0 x[2], A[2,0] (1,0), (2,-1) 1,-1
2,1 x[2],A[2,l] (1,1), (2,0) 4,2




(b): Partitioned Systolic Architecture
y[U|
(c): An Individual Processing Element
Fig. 4.3: Systolic Architecture for a Type 5 Task
computation.
4.5.4 Systolic Architecture for Task Types 6(a)-(c)
Type6(a)Task
The recursive matrix-vector product formulation of a Type 6(a) task is 
represented by the following modified Algol-like program:
for h = 0 to n-1 do 
for i = 0 to 2 do 
for j = 0 to 2 do 
x[h,i] =x[h,i-l]
y[h,i,j] = y[h,i,j—1] + A[i,j] (y[h—1 ,i,2] + x[h,i])
end i 
endh
The data dependency matrix for the above algorithm specification may be 
derived to be,
0 0 0 ' 1 
D = 10 0 0 
.0 1 1 j-2j
where the columns from left to right specify data dependence vectors for variables 
x[h,i-l], A[i,j-1], y[h,i,j-l] and y[h-l,i,2] respectively.
The linear indexing matrix for the generated variable y[h-l,i,2] is,
Vfr-i.U]
1 0 0 
0 1 0 
.0 0 0
Since the rank of the above linear indexing matrix is less than the dimensionality of 
the algorithm index set, it is therefore clear that the generated variable y[h-l,i,2] may 
be broadcasted. By using Step 2 and 3.1 of procedure SADLRE, the linear time 
scheduler for variable y[h-l,i,2] must be of the form,
ndBy[h-i,i,2]-Ci+jC2 (4*6)
The conditions on constants, Cj and C2, may only be determined after the selection of 
the S transformation matrix. By Step 4, we find the matrix of interconnection
84
primitives, P, to be,
P
1 0 0 
Oil •
Using Step 5.1, broadcasting of variable y[h-l,i,2] is avoided by also setting the 
following condition on the space scheduler transformation matrix S,
SBy[h-l,i,2]dBy[h-l,i,2] ~ P3 + JQ (4.7)
where C3e Zn, C4e I" and C4 *0. A S-matrix is thus selected which minimizes the 
number of bands, r, satisfies equation (4.7), and allows the diophantine equation SD = 
PK to be solvable for S,
0 1 0 T 0 0 0
s = 3 0 1 for K = 0 11 1
0 0 0 0
From this choice of S, constants C3 and C4 in Eq. (4.7) are both one.
We may now return to the problem of setting conditions on Cj and C2 in Eq. 
(4.6). By Eqs. (4.6) and (4.7) in Step 3, and C2 must both be greater or equal to 4 
since tj > tg + tc(=4) and Sy[h_li>2J dBy[h-i.i,2j > Sylh-t.tfJ dBy[h-i,i,2i ^ both one. 
Using Steps 3.2 and 3.3, we impose further conditions on the selection of 
II t.ndxj^i-i], ndA[ij_1] > l, and ndyfh.ij-q ^ 4. Finally, the transformation II is 
chosen to satisfy all the above conditions and minimize the algorithm execution time 
t,n = [12 1 4].
The new transformation matrix, A, is thus,
A = TD
1 4 4 4+4j 
1 0 0 0 
.0 1 1 1+j.
The information gathered from this A matrix regarding communication latency for the 
directional flow of variables in the systolic array is as follows:
(1) x[h,i-l] requires 1 time unit to travel 1 space unit in the ith direction.
(2) A[i,j-1] needs 4 units of time to move 1 space unit in the jjth direction.
(3) Generated variable, y[h,ij-l], travels 1 space unit in the jth direction in 4 time 
units. ■
(4) Generated variable, y[h-l,i,2], requires 4+4j units of time to travel 1+j space 
units in the jth direction.
85
The set of tables {GVT, UVT, IT} for the Type 6a operation is depicted in 
Tables 4.7 through 4.9, respectively. The systolic array architecture derived from 
these tables is shown in Fig, 4.4(a). The number of systolic cells required for this 
architecture is 3n, where n is the number of manipulator joints, and the operational 
delay of the array is 11 units of time.
The architecture of an individual cell in the systolic array of Fig. 4.4(a) is 
illustrated in Fig. 4.4(c). Note that buffer delays derived using Step 10 of SADLRE 
generated insufficient latencies along the paths of A[i,j] and y|h,i,j-l] due to the 
asynchronous sequence in which these variables are operated upon in the cell. This 
problem is solved by adding extra latencies along their paths for the purpose of 
operation synchronization.
The algorithm may be mapped onto a more cost-efficient architecture of a 3x3 
VLSI array by following the partitioning technique in Step 11. This partitioned 
architecture onto a set of 3 bands is illustrated in Fig. 4.4(b). The scheduling of these 
bands is performed by assigning the processing of y(j) in band Bj, 1 < i < 3, at a 
processor whose ith coordinate is nPij mod 3, where nPi corresponds to the ith row in 
the S matrix.
Type 6(b) Task
The systolic architecture for this type of task is identical to that of a Type 6(a) 
task. The only architectural modification exists within an individual processing cell. 
Fig. 4.5 specifies the processing cell architecture for a Type 6(b) task. The shorter 
computational latency of this cell compared to that of a Type 6(a) task, thus, requires 
the systolic array for a Type 6(b) task to have a shorter operation delay of 8 time units.
Type 6(c) Task
The systolic architecture of this task type is also the same as that of a Type 6(a) 
task. An individual processing cell architecture for a Type 6(c) task is shown in Fig. 
4.6. The operational delay for the type type 6(c) task is indifferent from that of a 
Type 6(a) task, i.e. a latency of 11 units.
4.5.5 Systolic Array for a Type 7 Task
The Algol-like program for a type 7 recursive vector cross-product operation is 
as follows,
for i = 0 to n-1 do 
forj = 0 to 2 do
Table 4.7: GYT of a Type 6a Task
ITERATION BEGIN EXECUTION FINISH EXECUTION IN CELL GENERATING
(h,i,j) AT TIME AT TIME Lm_. VARIABLE
0,0,0 0 3 0,0 7(0,0,0)
0,0,1 4 7 0,1 y(0,0,l)
0,0,2 8 11 0,2 7(0,0,2)
0,1,0 1 4 1,0 7(0,1,0)
0,1,1 : s 8 1,1 7(0,1,1)
0,1,2 9 12 1,2 7(0,1,2)
0,2,0 2 5 2,0 7(0,2,0)
0,2,1 3 9 2,1 7(0,2,1)
0,2,2 10 13 2,2 7(0,2,2)
1,0,0 12 15 0,3 7(1,0,0)
1,0,1 16 19 0,4 7(1,0,1)
1,0,2 20 23 0,5 7(1,0,2)
1,1,0 13 16 1,3 7(1,1,0)
1,1,1 17 20 1,4 7(1,1,1)
1,1,2 21 24 1,5 7(1,1,2)
1,2,0 14 17 2,3 7(1,2,0)
1,2,1 18 21 2,4 7(1,2,1)
1,2,2 22 25 2,5 7(1,2,2)









0,0,0 7(0,0,-1), y(-l,0,2) (0,1), (0,-1) -1,-1
0,0,1 y(0,0,0), y(-l,0,2) (0,0), (0,-1) 3,-1
0,0,2 y(0,0,l), y(-l,0,2) (0,1), (0,-1) 7,-1
0,1,0 y(0,l,-l), y(-l,l,2) (1,-1), (1,-1) 0,0
0,1,1 7(0,1,0), y(-l,l,2) (1,0) (1,-1) 4,0
0,1,2 ■7(0,1,1), 7(-1,1,2) (1,1), (1,-1) 8,0
0,2,0 y(0,2,-l), y(-l,2,2) (2,-1), (2,-1) 1,1
0,2,1 y(0,2,0), y(-l,2,2) (2,0), (2,-1) 5,1
0,2,2 y(0,2,l), y(-l,2,2) (2,1), (2,-1) 9,1
1,0,0 y(l,0,-l), y(0,0,2) (0,2), (0,2) 11,11
1,0,1 y( 1,0,0), y(0,0,2) (0,3), (0,2) 15,11
1,0,2 y(i,o,i), y(o,o,2) (0,4), (0,2) 19,11
1,1,0 y(l,l,-l), y(0,l,2) (1,2), (1,2) 12,12
1,1,1 7(1,1,0), y(0,i,2) (1,3), (1,2) 16,12
1,1,2 7(1,1,1), y(o,i,2) (1,4), (1,2) 20,12
1,2,0 y(l,2,-l), y(0,2,2) (2,2), (2,2) 13,13
1,2,1 7(1,2,0), y(0,2,2) (2,3), (2,2) 17,13
1,2,2 y(l,2,l), y(0,2,2) (2,4), (2,2) 21,13









0,0,0 x(0,0), A(0,0) (-1,0), (0,-1) -1,-4
0,0,1 x(0,l), A(0,1) (-1,1), (0,0) 3,0
0,0,2 x(0,2), A(0,2) (-1,2), (0,1) 7,4
0,1,0 x(0,0), A(1,0) (0,0), (1,-1) 0,-3
0,1,1 x(0,l), A(l,l) (0,1), (1,0) 4,1
0,1,2 x(0,2), A(l,2) (0,2), (1,1) 8,5
0,2,0 x(0,0), A(2,0) (1,0), ((2,-1) 1,-2
0,2,1 x(0,l), A(2,l) (1,1,), (2,0) 5,2
0,2,2 x(0,2), A(2,2) (1,2), (2,1) 9,6
1,0,0 x(l,0), A(0,0) (-1,3), (0,2) 11,8
1,0,1 x(l,l), A(0,1) (-1,4), (0,3) 15,12
1,0,2 x(l,2), A(0,2) (-1,5), (0,4) 19,16
1,1,0 x(l,0), A(l,0) (0,3), (1,2) 12,9
1,1,1 x(l,l), A(l,l) (0,4), (1,3) 16,13
1,1,2 x(l,2), A(l,2) (0,5), (1,4) 20,17
1,2,0 x(l,0), A(2,0) (1,3), (2,2) 13,10
1,2,1 - x(l,l), A(2,l) (1,4), (2,3) 17,14




y(—1,0,21 yf—1,1,2) y(—1,2,2} 
yih,0,—1| yjh.l,—ij yjh,2,-l| 
Afh,0,j| Ajh,l.j| Ajh,2.j|
y(n—1.0,2| yjn—1,1,2| yirt-1,2,2)
(b): Partitioned Systolic Architecture
y|h,i,j|
(c): An Individual Processing Element
Fig. 4.4: Systolic Architecture for a Type 6a Task
90
y|h.Ul
Fig. 4.5: Processing Element for a Type 6b Task
x|H,i|
y|h-l,i,2| A|ij| jr|h,ij—1|
Fig. 4.6: Processing Element for a Type 6c Task
91
x[i] = x[i—1]
y[ij] = CQ] • (y[i-l,j] * x[i+l] - y[i-l,j+l] * x[i]) 
+ C[j] • (y[i—1,2] * x[0] - y[i-l,0] * x[2]>
end i
where C[jJ. is a control signal entering the systolic array vertically. The data 
dependence matrix for the above program is,
D
011 1 1 1 
10 0 -1 0-2) j
where the columns from left to right specify data dependence vectors for C[j—1], 
x[i-l], y[i-l,j], y[i-l,j+l], y[i—1,2] and y[i—1,0] respectively.




The rank of the above linear indexing matrix is less than the dimensionality of the 
algorithm index set. The variables y[i—1,2] and y[i—1,0] are, therefore, amenable to 
broadcasting within the systolic array.
By Step 4, the matrix of interconnection primitives, P, is
P =
0 1 1 
10-1 •
Conditions on the selection of the S matrix to avoid broadcasting of specified 
variables include,
^By[i—l,2]^By[i—1,2] “ Q + jC2
and.
SBy[i-l,0]dBy[i-l,0] “C3+jC4
where Ci, C3e Zn; C2, C4eln and C2, C4 * 0. An S-matrix is selected which satisfies 
these conditions, minimizes the number of bands, and allows the diophantine equation 






1 1 0 0 0 0
0 0 1 0 0 0 
0 0 0 1 1 1
Now that the S matrix has been chosen, conditions on the selection of the II 
transformation may be summarized as follows:
(i) ndBy[i_12] - C, + jC^;
(ii) ndBy[i_10] = C3 +jC4, where Cj > 2, C3 > 0 and C2, C4 > 1;
(iii) ITdc, ndx > 1;
(iv) ridy[i_ij], ndy[i_1j+1] ^ 3.
The above conditions are derived using Steps 3.1-3.3 of SADLRE. A II 
transformation is selected to satisfy these conditions in addition to minimizing the 
algorithm execution time t, II = [4 1].
The new transformation matrix, A, is found be,
A = TD =
1 4 4 3 (2+j) (4+j) 
0 111 1 1 
100-1(3-2) j.
Information regarding communication latency for the directional flow of 
variables in the systolic array may be deduced from the A matrix. This can be done in 
a manner similar to that previously described.
The {GVT, UVT, IT} for the recursive vector cross-product operation is shown 
in Tables 4.10-4.12. The resulting systolic array for this type of operation is 
delineated in Fig. 4.7(a), and the architecture of an individual processing cell is 
illustrated in Fig. 4.7(c). The control signal, C(j), operates as an input selection line 
for the multiplexers. Typically, it will allow y(i—1 J) and y(i-l,j+l) to pass through 
the multiplexers for the first two rows of the systolic array in Fig. 4.7(a). C(j) will, 
however, permit y(i-l,0) and y(i-l,j) to propagate through the multiplexers due to the 
operational specification of a vector cross-product operation. The partitioned 
architecture for a Type 7 operation is delineated in Fig. 4.7(b). The operational delay 
for the proposed architecture is 3 time units for imax= l in the algorithm index set,











0,0 0 2 0,0 7( 0,0)
0,1 1 3 0,1 7(0,1)
0,2 2 4 0,2 7(0,2)
. ■' 1,0 4 6 1,0 7(1,0)
1,1 5 7 1,1 7(1,1)
1,2 6 8 1,2 7(1,2)
2,0 8 10 2,0 7(2,0)
2yl 9 n 2,1 7(2,1)
2,2 10 12 2,2 7(2,2)









0,0 y(-i,o), y(-i,i) (-1,0), (-1,1) -2,-1
0,1 y(-i,i), y(-i,2) ,(-1,1), (-1,2) -1,0
0,2 y(-i,o), y(-i,2) (-1,0), (-1,2) -2,0
1,0 y(o,o), y(o,i) (0,0), (0,1) 2,3
1,1 y(o,i), y(o,2) (0,1), (0,2) 3,4
1,2 y(o,o), y(0,2) (0,0), (0,2) 2,4
2,0 y(i,o), y(i,i) (1,0), (1,1) 6,7
2,1 y(i,i), y(i,2) (1,1), (1,2) 7,8
2,2 y(i,o), y(i,2) (1,0), (1,2) 6,8









0,0 x(0), x(l), C(0,0) (-1,0), (-1,0), (0,-1) -4,-4,-1
0,1 x(l), x(2), C(0,1) (-1,1), (-1,1), (0,0) -3,-3,0
0,2 x(0), x(2), C(0,2) (-1,2), (-1,2), (0,1) -2,-2,1
1,0 x(0), x(l), C(1,0) (0,0), (0,0), (1,-1) 0,0,3
1,1 x(l), x(2), C(l,l) (0,1), (0,1), (1,0) 1,1,4
1,2 x(0), x(2), C(l,2) (0,2), (0,2), (1,1) 2,2,5
2,0 x(0), x(l), C(2,0) (1,0), (1,0), (2,-1) 4,4,7
2,1 x(l), x(2), C(2,l) (1,1), (1,1), (2,0) 5,5,8






which is also the maximum outermost index limit specification for a recursive vector 
cross-product operation in the inverse dynamics computational problem. Note that an 
unit latency for input data synchronization is also included in the operational delay.
4.6 Optimal Buffer Assignment for the Formulation of a Balanced Systolic
Architecture to Evaluate the Inverse Dynamics Computational Problem
The computational flow of tasks in the forward iterations of algorithm 
NEWTON_EULER (Section 4.4) as a directed task graph is shown in Fig. 4.8. Each 
node in this diagram represents a systolic processor of Section 4.5. The nomenclature, 
ty i, next to each node denotes the type of systolic cell, where the variable i identifies 
the type of cell. Also, the variable x in the field Tx inside each node specifies the 
actual task number. ,
The bottleneck in achieving maximum throughput in the architecture of Fig. 4.8 
i$ due to the different arrival time of data into the multi-input systolic processing cells. 
Since computations may not be initiated in cells until all input data are available 
simultaneously, the pipelined execution time is therefore naturally lengthened. We 
must therefore balance the task graph to achieve maximum pipelining. This problem 
may be solved by assigning appropriate amounts of delay (using buffer stages) along 
some paths of the graph.
The problem of balancing a directed graph by inserting buffers along appropriate 
paths has been solved by the cut-set theorem [21,22], local correctness criterion [22], 
and the graph-theoretic approach [5,7]. However, we shall use a more systematic 
approach by formulating it as an integer linear programming problem [47].
Before formally presenting the procedure which may be used to generate a set of 
integer linear constraint equations for the balancing problem, the following 
notation/definitions must be introduced.
Definition 1: A Weighted Graph, WG, corresponds to a directed task graph using a 
given set of processing units which provides a weight variable CDj, ie Zn, along the 




T17y ty 1 T7 J tyl
T8)ty6b
T12)tyl
WG for Forward Iterations of N-E AlgorithmFig. 4.8
97 '
Definition 2: A weight, £ co(P(nj)), is defined as the sum of weights, or total
P(ni)ePk(jii)
cost, along the kth path, pk, from the input node to multi-input node, ni5 in the 
Weighted Graph, WG.
Definition 3: The critical path, pc(nj), denotes the path from the input node to multi­
input node, nj which constitutes the largest weight variable, i.e. 
max £ co(p(n;)), in the Weighted Graph, WG.
p(fti)epk(ni)
Definition 4: The Buffered Graph, BG, corresponds to a directed task graph which 
provides a set of buffers {Bj,... ,Bj+n}, jeZn, along the (n-j) output edges of multi­
output node n|, ie Zn, in WG that do not belong to output edges in the critical path 
pc(nQ), where nQ is the output node of WG.
Note that an input node, nr, that supplies operand data for nodes in the directed 
task graph must be included in BG.
Definition 5: The variable, £ | B(p(nj))|, is defined as the sum of buffer
pOOepkfri)
variables, or total buffer cost, along the kth path, pk, from input node, nj, to multi­
input node, nj, in the Buffered Graph, BG. Now that the nomenclature to be used has 
been discussed, we may proceed further by presenting a procedure which will 
generate an optimal buffer assignment for a given weighted graph.
Procedure OBAP (Optimal Buffer Assignment Problem)
Input: Weighted Graph, WG.
Output: Optimal numerical buffer assignment of buffers Bj,jeZn, which
minimizes the total number of buffers stages in Buffered Graph, BG.
STEP 1: Find critical path, pc(nQ), to output node n0 from the input node of WG.
98
STEP 2: Obtain BG by assigning buffer stage variables to the output edges of
multi-output nodes in WG, except for the output edges which belong to 
the critical path, pc(n0), of the output node n0.
STEP 3: For every multi-input node, ni? except output node nc, acquire a set of
m-1 integer linear constraint equations, where m is the number of inputs 
into node nj. These equations may be found by,
X [| B(p(nj))| + ©(pCnj)) ]
p(ni)epk(ni)/pc(ni)
> E [| B(p(nj))| +co(p(ni))] (4.8)
P(ni)epc(ni)




subject to the constraints of STEP 3, where n is the number of buffers.
The above procedure may now be applied to the weighted graph, WG, for the 
forward iterations of the Newton-Euler algorithm (in Fig. 4.8). The weights assigned 
to the output edges of each systolic node represents the computational delay within 
the node. These delays for the various types of systolic processors developed in 
Section 4.5 is summarized in Table 4.13.
STEP 1: The critical path pc(n0) for the WG of Fig. 4.8 is dong the nodes: T1-T2-
T3-T4-T5-T7 T8-T12-T13-n0.
STEP 2: By applying buffer assignment rules to the WG of Fig. 4.8, we obtain the
normalized Buffered Graph BG as in Fig. 4.9,
STEP 3: The integer linear constraint equations on the buffering variables in BG
may be found by applying Eq. (4.8) to each node:
nodeTl:| B6| =0 
node T2: j B4| =11 
node T3:1 B5| =14
99
Fig. 4.9: BG for Forward Iterations of N-E Algorithm
100
T20 Jty3 T22 I ty3
T21 ) tyl
Fig. 4.10: WG for Backward Iterations of N-E Algorithm
101
nodeT4:|B6| +|B12| >15
node T5: j B3| + | Bl5| = 26
nodeT6: j B3| =j Bllj +11
nodeT7: j Bll| =11
nodeT8: j B6j +| B12| + | B18| =30
node T9: j B2| + j B14| = j B 17| + 26
hodeT10:| B2| =| B10| +| Bll| +11 ?
node Til: j B10| +1 B11| =|B17| +11
node T12: j B17| =8
node T13: j B7| =39
V node T14: j Blj =| B9| +| BlOj +| Bll| + 
node T15: j B8| =8
nodeT16:| Blj +| B13| =|B16| +|B17| +26 
■■'■■'.■’■.■'.'■nodeT17:|B9|+|B10|+|Blli=|B16|+|B.17|- + 12.-
STEP 4: Apply integer linear programming to minimize ]£ | Bj subject to the
i=l
constraints of STEP 3. The optimization process generates:
| Bl| =31,| B2| =30,| B3| =22,|B4| =11, 
j B5| =14,| B6| =0,| B7| = 39,| B8| =8, 
j B9| = 1,| B10| = 8,j Bll| = 11,| B12| = 15, 
j B13| =3, |B14| =4,| B15| =4, j B16| = 0, 
j B17| = 8, j B18| =15
Procedure OB AP may now be applied to the WG (in Fig. 4.10) for the backward 
iterations in program NEWTON-EULER. Intermediate results of steps in the 
procedure are as follows:
STEP 1: The critical path pc(n„) for the WG in Fig. 4.10 is along the nodes: T19-
T20-T21-T23-T24-no.
STEP 2: Applying buffer assignment rules to the WG of Fig. 4.10, the normalized
buffered graph in Fig. 4.11 may be obtained.
102
Fig. 4.11: BG for Backward Iterations of N-E Algorithm




















STEP 3: The integer linear constraint equation on buffering variables in Fig. 4.11
are as follows:
node T18: | B22| = 0
node T19: j B20| = | B21|
node T20: j B20| +1 = B25
node T21: j B19( = | B20| + 4
node T22: j B21| + j B24j =8




Applying integer linear programming to minimize £ Bj subject to the
*=19
constraints of STEP 3, we get:
| B19( = 10,| B20| =6,1 B21| = | 
j B23| = 22, j B24| = 2, | B25| = 7, | B26| =11.
A balanced architecture for the computation of the inverse dynamics problem is thus 
found by pipelining the weighted systolic architecture of Fig. 4.8 to that of Fig. 4.10 
through an intermediate LIFO (Last-In-First-Out) intermediate register file structure. 
The total time to compute the given problem on the developed systolic machine is 
(tcrf + tcrb)n+2, where tCTf is the operational delay of the critical path of Fig. 4.8 
(forward iterations), tcrb is the operational delay of the critical path in Fig. 4.10 
(backward iterations), and n is the number of manipulator joints. By using the 
operational latency data (Table 4.13) for all types of systolic processors designed in 
this chapter, the total execution time to evaluate the inverse dynamics problem of an 
n-link manipulator is 70n+2. Note that the coefficient dx (see Theorem 4.1) for our 
particular systolic machine for the computation of the desired problem is thus 70.
4,7 Conclusions
A novel systolic architecture to compute the inverse dynamics computational 
problem within the systolic execution time lower bound of o^djnj , where db is a
specified constant, is discussed. A modified form of the systolic design methodology 
of Moldovan and Fortes [40] is used as the primary design tool for the architecture. 
Modifications were required to provide allowances for unequal execution times of 
processing cells, and to eliminate broadcasting of any input or generated variable
105
within the systolic array. The inverse dynamics problem is decomposed into a set of 
unidirectional and linear recurrence tasks amenable for direct mapping onto a fixed 
systolic system. A set of seven basic types of systolic processor architectures are 
developed to represent nodes in an acyclic systolic directed task graph to compute the 
Newton-Euler dynamics algorithm. The balancing of this acyclic task graph is 
reduced to an integer linear optimization problem for the assignment of delay buffers 
along alternate data routes. The final systolic architecture achieves maximal 





In chapter 2, the ideal lower bounds on the number of parallel processors and 
execution time to evaluate the inverse dynamics problem of an n-link manipulator in a 
parallel processing architecture using an optimal scheduling algorithm was proven to
be n and o( axn ) respectively, where aj is a specified constant. Next, a novel SIMD
scheduling algorithm to perform parallel processing on a SIMD multiprocessor-based 
architectural model with a crossbar interprocessor network which performs close to 
the ideal lower bound is presented.. The performance of the N-E dynamics algorithm 
on the proposed architecture using the specified SIMD scheduling technique is next 
evaluated. Speedup factor of greater than 3.4 over previous related work [19] for the 
computation of the defined problem on a parallel processing system is achieved. The 
SIMD architectural model is then not used again to simulate the performance of 
PUMA forward and inverse kinematics algorithms because of the non-recursive and 
non-linear nature of these algorithms. A multiprocessor model with a shared memory 
interprocessor communication strategy is instead investigated for this purpose using 
the DFTHS scheduling algorithm. Simulation results show speedup of approximately 
2 over the uniprocessor solution when the number of parallel processors is greater 
than eight and five for the cases of forward and inverse kinematics respectively. 
Simulation results showed very large speedups is not achieved in this latter 
multiprocessor architectural model due to high interprocessor communication latency 
costs. The important results of this chapter may be summarized as in Table 5.1. Note 
that the hardware overhead due to the crossbar interprocessor network must be 
included when evaluating the overall system cost for the computation of the inverse 
dynamics problem.
In chapter 3, we described a cost-efficient parallel and pipelined bit-serial system 
for the inverse dynamics computational problem to achieve the bit-serial execution




Execution Times and Hardware Overhead for Computation of Robot 

































Bit-Serial 19k+6kn 32 Bit-Serial 
Cells
Systolic 70n+2 24 Systolic 
Processors
108
cell architecture is described as a building block for the array configurations of the 
system. Zipper CMOS circuit design strategy is proposed for the cell’s 
implementation to minimize propagation delays and maximize operating frequencies. 
For the case of a 16-bit system word length and 20MHz operating frequency, the total 
time to compute the inverse dynamics problem of a 6-link manipulator on the 
proposed bit-serial architecture is only 44|i.s.
Finally, in chapter 4, a novel systolic architecture to compute the inverse 
dynamics computational problem within the systolic execution time lower bound of
din , where di is a specified constant, is discussed. A modified form of the systolic
design methodology of Moldovan and Fortes [40] is used as the primary design fool 
for the architecture. Modifications were required to provide allowances for unequal 
execution times of processing cells, and to eliminate broadcasting of any input or 
generated variable within the systolic array. The inverse dynamics problem is 
decomposed into a set of unidirectional and linear recurrence tasks amenable for 
direct mapping onto a fixed systolic system. A set of seven basic types of systolic 
processor architectures are developed to represent nodes in an acyclic systolic 
directed task graph to compute the Newton-Euler dynamics algorithm. The balancing 
of this acyclic task graph is reduced to an integer linear optimization problem for the 
assignment of delay buffers along alternate data routes. The final systolic architecture 
achieves maximal pipelining with a minimal number of delay buffers assigned along 
its various computational paths.
The performance and hardware complexity of alternate computational structures 
for the evaluation of the inverse dynamics problem is compared in table 5.1 The 
variables "n" and "k" in this table specify the manipulator link count and bit-serial 
system word length respectively. Clearly, the systolic approach achieves superior 
performance (assuming the word length for the bit-serial system is greater than 16). 
The hardware complexity for this approach is however highest due to the overhead 
associated with the control of the systolic arrays and the number of cells in the arrays 
themselves. The parallel processing approach achieves the next best execution time, 
where the crossbar network is the primary hardware overhead. Finally, the specified 
bit-serial architecture provides acceptable performance at the cost of moderate 
hardware. Lowest amount of hardware complexity is required for this case because of 
minimum circuitry needed for the implementation of bit-serial cells.
109
LIST OF REFERENCES
[1] Acosta, R. and Kjelstrup, H.C., "An Instruction Issuing Approach to 
Enhancing Performance in Multiple Functional Unit Processors," IEEE 
Trans. Comp., Vol. C-35, No. 9, Sept. 1986, pp. 815-828.
[2] Ahmad, S., Meyer, D., and Rahman, M. "On the Design of Special-
Purpose Computational Structures for Robot Control: Design
Constraints," Proc. 1986 Applied Motion Control Conf., Minneapolis, 
Minnesota.
[3] Ahmed, H. M., Delosme, J. M., and Morf, M., "Highly Concurrent 
Computing Structures for Matrix Arithmetic and Signal Processing," 
Proc. IEEE, Jan. 1982; pp. 65-82.
[4] Bejczy, A., "Robot Arm Dynamics and Control," Technical Memo 33-669, 
Jet Propulsion Lab., Pasadena, Calif., 1974.
[5] Brock, J. D., "Translation and Optimization of Dataflow Programs," 
Proc. 1979 Int. Conf. Parallel Proc. , Aug. 1979, pp. 46-54.
[6] Buric, M. and Carver, M., "Bit-Serial Inner Product Processors in VLSI," 
CALTECH Int. Conf. on VLSI, January 1981, pp. 155-164.
110
[7] Dennis, J. B., and Gao, R. G., "Maximum Pipelining of Array Operations 
on Static Dataflow Machine," Proc. 1983 Int. Conf. Parallel Proc. , Aug. 
1983, pp. 331-334.
[8] Denyer, P. and Renshaw, D., VLSI Signal Processing: A Bit-Serial 
Approach, Addison-Wesley, Reading, Mass., 1985.
[9] Du, H. C., "On the Performance of Synchronous Multiprocessors," TEER 
Trans. Comp., Vol. C-34, No.5, May 1985, pp. 462-466.
[10] Fernandez, E. B. and Bussel, B., "Bounds on the Number of Processors 
and Time for Multiprocessor Optimal Schedules," IEEE Trans. Comp., 
Vol. C-22, No. 8, Aug. 1973, pp. 745-751.
[11] Fortes, J. A. B. and Moldovan, D. I., "Parallelism Detection and 
Algorithm Transformation Techniques useful for VLSI Architecture 
Design," J. Parallel Distrib. Comp., May 1985.
[12] Fortes, J. A. B., and Moldovan, D. I., "Data Broadcasting in Linearly 
Scheduled Array Processors," Proc. 11th Int. Symp. on Comp. Arch. , 
June 1984, pp. 84-98.
[13] Fu, K. S., Gonzalelez, R. C., Lee, C. S. G., Robotics: Control, Sensing, 
Vision, and Intelligence, Me Graw Hill, 1987.
[14] Goncalves, N.F. and De Man, H.J., "NORA: A Racefree Dynamic CMOS 
Technique for Pipelined Logic Structures," IEEE J. Solid-State Circuits, 
Vol. SC-18, No. 3, June 1983, pp. 261-266.
[15] Horowitz, E. and. Salmi, S., Fundamentals of Computer Algorithms, 
Computer Science Press, Inc., Rockville, Maryland, 1978.
[16] Hwang, K. and Briggs, F. A., Computer Architecture and Parallel 
Processing, Me Graw Hill, 1984.
[17] Jensen, E., "The Honeywell experimental distributed processor—An 
overview," Computer , Vol. 11, pp. 28-38, Jan. 1978.
[18] Hollerbach, J., "A Recursive Lagrange Formulation of Manipulator 
Dynamics and a Comparative Study of Dynamics Formulation 
Complexity," IEEE Trans. Sys., Man and Cyber., Vol. SMC-10, No. 11, 
November 1980, pp. 730-736.
[19] Kasahara, H. and Narita, S., "Parallel Processing of Robot-Arm Control 
Computation on a Multimicroprocessor System," IEEE Robotics 
Automat., Vol. RA-1, No. 2, June 1985, pp. 104-113.
[20] Krambeck, R.H., Lee, C.M., and Law, H.F.S., "High-Speed Compact 
Circuits with CMOS," IEEE J. Solid-State Circuits, Vol. SC-17, No. 3, 
June 1982, pp. 614-619.
[21] Kung, S. Y., "VLSI Array Processors," IEEE ASSP, July 1985, pp. 4-22.
[22] Kung, S. Y., "Wafer-Scale Integration and Two-Level Pipelined 
Implementation of Systolic Arrays," J. Parallel Distrib. Comp., Vol. 1, 
No. 1, Sept. 1984, pp. 36-63.
112
[23] Kung, S. Y. et. al., "Wavefront Array Processor: Language, Architecture 
and Applications," IEEE Trans. Comp., Vol. C-31, No. 11, Nov. 1982.
[24] Kung, H. T., "Why Systolic Arrays?," Proc. IEEE , January 1982, pp. 
37-46.
[25] Lang, T. et. al., "Bandwidth of Crossbar and Multiple Bus Connections 
for Multiprocessors," IEEE Trans. Comp., Vol. C-31, No. 12, Dec. 1982.
[26] Lathrop, L., "Parallelism in Manipulator Dynamics," M.I.T. Artificial 
Intelligence Lab., Cambridge, Mass., Tech. Rep. No. 754, Dec. 1983.
[27] Lawler et. al., The Traveling Salesman, John Wiley & Sons, Great 
Britain, 1985.
[28] Lee, C.M. and Szeto, E.W., "Zipper CMOS," IEEE Circuits and Devices, 
May 1986, pp. 10-16.
[29] Lee, C.S.G. and Chang, P.R., "Efficient Parallel Algorithm for Robot 
Inverse Dynamics Computation," IEEE Trans. Sys., Man and Cyber., 
Vol. SMC-16, No.4, July/August 1986, pp. 532-542.
[30] Lee, C.S.G. et. al., "Development of the Generalized d’Alembert 
Equations of Motion for Mechanical Manipulators,:" Proc. 1982 Pattern 
Recognition and Image Processing Conf., Las Vegas, Nev., 1982, pp. 
634-640.
[31] Li, G. J. and Wall, B. W., "The Design of Optimal Systolic Arrays," 
IEEE Trans. Comp. , Vol. C-34, No. 1, Jan. 1885, pp. 66-77.
113
[32] Liao, F.Y. and Chern, M.Y., "Robot Manipulator Dynamics Computation 
on a VLSI Array Processor," Proc. 1985 Int. Conf. on Supercomputers, 
1985, pp. 116-121.
[33] Lint, B. and Agerwala, T., "Communication Issues in the Design and 
Analysis of Parallel Algorithms", IEEE Trans. Soft. Eng., Vol. SE-7, No. 
2, March 1981, pp. 174-188.
[34] Liu, C.H. and Chen, Y.M., "Multi-Microprocessor Based Cartesian-Space 
Control Techniques for a Mechanical Manipulator," IEEE J. Robotics 
Automat., Vol. RA-2, No. 2, June 1986, pp. 110-115.
[35] Luh, J.Y.S. and Lin, C. S., "Scheduling of Parallel Computation for a 
Computer-Controlled Mechanical Manipulator," IEEE Trans. Sys., Man 
and Cyber., Vol. SMC-12, No. 2, March/April 1982, pp. 214-234.
[36] Luh, J.Y.S., Walker, M.W. and Paul, R.P.C., "On-Line Manipulator 
Scheme for Mechanical Robots," IEEE Trans. Automatic Control, Vol. 
AC-25, No. 3, pp. 468-474.
[37] Lyon, R.F., "Two’s Complement Pipeline Multipliers," IEEE Trans. 
Commun., Apr. 1976, pp. 418-425.
[38] Masson, G. M. et. al., "A Sampler of Circuit Switching Networks", Proc. 
IEEE , June 1979, pp. 145-160.
114
[39] Mead, C. and Conway, L., Introduction to VLSI Systems, Addison- 
Wesley, Reading, Mass., 1980.
[40] Moldovan, D. I. and Fortes, J. A. B., "Partitioning and Mapping 
Algorithms into Fixed Systolic Arrays " IEEE Trans. Comp. , Vol. C-35, 
No. 1, Jan. 1986, pp. 1-12.
[41] Moldovan, D.I., "On the Analysis and Synthesis of VLSI Algorithms," 
IEEE Trans. Comp., Vol. C-31, No. 11, Nov. 1982, pp. 1121-1126.
[42] Murty, K. G., Linear Programming, John Wiley & sons, Inc.,1983.
[43] Nigam, R. and Lee, C.S.G., "A Multiprocessor-Based Controller for the 
Control of Mechanical Manipulators," IEEE J. Robotics Automat., Vol. 
RA-1, No. 4, Dec. 1985, pp. 173-182.
[44] Ohwada, N. et. al., "LSI’s for Digital Signal Processing," IEEE J. Solid- 
State Circuits, Vol. SC-14, No. 2, Apr. 1979, pp. 214-220.
[45] Orin, D.E., Chao, H.H. and Schrader, W.W., "Pipeline/Parallel 
Algorithms for the Jacobian and Inverse Dynamics Computations," 
Proc. 1985 IEEE Int. Conf. Robotics, pp. 785-789.
[46] Orin, D.E., "Kinematic and Kinetic Analysis of Open-Chain Linkages 
Utilizing Newton-Euler Methods," Math. Rio sc. , Vol. 43, 1979, pp. 107- 
■ 130.
"115 ; ...
[47] Papadimitrious, C. H. and Steiglitz, K., "Combinatorial Optimization: 
Algorithms and Complexity," Prentice-Hall, Inc., Englewood Cliffs, New 
Jersey, 1982.
[48] Quinton, P., "Automatic Synthesis of Systolic Arrays from Uniform 
Recurrent Equations," IEEE Conf. Par. Proc. , 1984, pp. 208-214.
[49] Rahman, M. and Meyer, D., "Case Study—A MC 68020 Based I/O 
Controller", in Topics in High-Level Computer Architecture , 
Milutinovic, V. ed., Computer Science Press, Rockville, Maryland, to 
appear, Aug. 1987.
[50] Rahman, M. and Meyer, D., "High-Performance Parallel and Pipelined 
Bit-Serial Architecture for Robot Inverse Dynamics Computations," 
Submitted to IEEE Trans. Sys., Man, Cyber., January 1987.
[51] Siegel, L. et. al., "Performance Measures for Evaluating Algorithms for 
SIMD Machines," IEEE Trans. Soft. Eng., Vol. SE-8, No. 4, July 1982, 
pp. 319-330.
[52] Siegel, H. J., "Interconnection Networks for SIMD Machines," Proc. IEEE 
, June 1979, pp. 57-65.
[53] Siegel, H. J., "A Model of SIMD Machines and a Comparison of Various 
Interconnection Networks," IEEE Trans. Comp. , Vol. C-28, No.12, Dec. 
1979, pp. 907-917.
[54] Siegel, H. J., "The Theory Underlying the Partitioning of Permutation 
Networks," IEEE Trans. Comp. , Vol. C-29, No. 9, Sept. 1980, pp. 791- 
800.
116
[55] Tjaden, G. S. and Flynn, M. J., "Detection and Parallel Execution of 
Independent Instructions," IEEE Trans. Comp. , Yol. C-19, No. 10, Oct. 
1970.
[56] Tomasulo, R. M., "An Efficient Algorithm for Exploiting Multiple 
Arithmetic Units," IBM J. , Jan. 1967, pp. 25-33.
[57] Walker, M.W. and Orin, D.E., "Efficient Dynamics Computer Simulation 
of Robotic Mechanisms," Robot Motion, Brady, M., ed., MIT Press, 
1982, pp. 107-125.
[58] Watanabe, T. et. .ah, "Improvement in the Computing Time of Robot 
Manipulators Using a Multimicroprocessor," IEEE Trans. ASME, Vol. 
108, Sept. 1986, pp. 190-197.
[59] Wong, Y. and Delosme, J. M., "Optimal Systematic Implementations of 
N-Dimensional Recurrences," Int. Symp. Comp. Arch. , June 1985, pp. 
618-621.
[60] Zheng, Y.F. and Hemami, H., "Computation of Multibody System 
Dynamics by a Multiprocessor Scheme," IEEE Trans. Sys., Man, and 
Cyber., SMC-16, No. 1, Jan./Feb. 1986, pp. 102-110.
Appendix A: N-E Equations of Motion for a Manipulator with Rotary joints
Forward Iterations:
FOR i = 0 TO (n-1) DO 
BEGIN
i+1<»i+i = + Bi+i +
= r+lRiVi+ (i+1coi+i X {Pi+l) + i+1(0M x C^cPi+i X Hu)




FORi = n TO 1 DO 
BEGIN
'fi=‘Fi+iViR‘%1




mass of link i
Bi+i zth component of the joint velocity of link i+1
0^ zth component of the joint acceleration of link i+1
1+10)i+] angular velocity of link i+1 with respect to the i+lth coordinate frame
1+1°>i+i angular acceleration of link i+1 with repect to the i+lth coordinate frame
1+1vi+1 linear acceleration of link i+1 with respect to the i+lth coordinate frame









rotation matrix which maps position vectors from i+lth frame to frame i
origin of link i+1 with respect to the ith coordinate frajne
location Of the center of mass of link i+1 with respect the i+lth coordinate fram
total inertial force exerted on link i+1 at the center of mass
total moment exerted on link i+1 at the center of mass
force exerted on link i by link i-1
moment exerted on link i by link i-1
torque exerted by the actuator on link i
119
Appendix B: 3-D Matrix/Vector Arithmetic Operations Used In The N-E
Dynamics Computational Problem
(1) Vector-Add (VA)
The vector sum Yj of vectors and bj is defined as,
Yj = (aj + bj) fori =1,2,3.
(2) Scalar-Vector (SV) product
The scalar-vector product Yj of vector \ by scalar K is defined as,
Yj = Kaii fori = 1,2,3.
(3) Inner product (IP)
The inner product Y of vectors al and bj is defined as,
Y = |aibi.
i=l
(4) Matrix-vector (MV) product
The matrix-vector product Yn of matrix a^ and vector bj is defined as,
3
Yn=Xanibi for n= 1,2,3. 
i=l
(5) Vector cross (VC) product
The vector cross product, Y of vectors aj and bj, is defined as,
'i i k 1 I1 J K |
Y = Yi+Yj+Yk = det |aj &2 a3 I
Jbi b2 b3|
120
Appendix C: PUMA Forward Kinematics Solution
The position and orientation of the end-effector of the PUMA arm with respect to a 
fixed reference coordinate system, given the joint angles and and geometric link 
parameters, may be found as follows:
The aim matrix for the PUMA robot arm is
T — Tj T2 — °Af * AvAj *A5 =
^.r flr
ny fv 0r /’v
n. J; a. />:
0 0 a l
where,
/ij — Cj [ C23 (C4 C5 C$ — S4S$) — S23S5 ] — 5^ ( Sa Ci C*y t C4S$)
ny = SilCaiC4C5C* - 54^) -S3S5CJ + C<(S<C<C\ - C\S*)
/U 38 — Sr3 [C4 C5 "" S4S« I — C23S5.
^ = CI[-C23(C4C5S6 4-S4C«) +StjS5S61 -S1C-SiC5S6 + C4Q)
= Sf [ — Cl3 ( C4 C5S* 4- S4 C$ ) 4* S33S5S$ J 4- C| ( — Sj,G S4 4- C4 C$ )
J, 38 S23(C4QS^ 4- S4Q) 4- C12S1S4 
flj3C| (C23C4S5 4* S23C5 ) — S1S4S5
a7 = S| (C4S5 4- S^ C5) 4- Cf S4S5
a, — — S23 C4S5 -4 C23 Cj
px ~ C\ C23 C4S5 4* S^3 C5 ) 4- S23d4 4- C23 4- C2 j — Sf iO*S4S5 4- ^ )
:p7 =5 5j (<i$ ( C23 C4S5 4- S23 Cj ) 4- S33 J4 4- 03 C23 4- Ct ] 4- Cj (d$S4S5 ■ 4" ^2 ).
/7r * d^iCr^Cs — Sp C4S5 ) 4- C23J4 *" ^3*^3 — ^3^2
121
Appendix D: PUMA Inverse Kinematics Solution
The inverse kinematics problem can he stated as: Given the position/orientation of 
the manipulator hand and the link/joint parameters, determine the joint angles so that 
the manipulator can be positioned as desired. That is, given
<V *»■ P, 
ny °y <h
»i a. pz
.0 0 0 1
n o a p
0 0 0 1





flip ~ ?*C 1 + Y
f XU = 0*C\ + °y$l 
flip = ~Pz $ 1 + PpC X 
f 13* = + °p^X
fit* = az^l ■*" 1
/13* = + S07!
= flip + /l2> “ <*4 “ ai







W? / lip «1 ?:
wlfllp + wiPt
where t»j — <*2^3 + a3> ^2 = ^4 +■ «2*^3> <?,* * cos 0f» ^
So -* ^53*^3








^23/ 1U ~ ^23as 
C* U C’csi ^1 a8 *^* ^,lgy) ~ ^>23<j;i *** *^~
5~&C ta, + 5ia7) + Costf, 
^•t( ^23/lU “ ^23ar)' "** S*f \Z*
^23/ Il<« ^ ^23ai 
ii^2zf H« ~ S*30:) + ~-»/l3«! ~ $s(S*zf 11# ^
•?4( a ^23°:) “ 13«
d ^ d ^ J 3
where . .(—) , (—) , (—) are constants, C- = cos (0t* + 5;), and
ao . 7 .J
% s sin (0,- + 0..),
Appendix E: DFIHS (Depth-First-Implicit-Heuristic-Search) Algorithm
The DF/IHS is a kind of depth first/heuristic search. The main feature of 
DF/IHS lies in the fact that, unlike the conventional DF/H method, it does not 
require the computation of the values of heuristic function for all active nodes with 
the largest depth in order to find, as the next branching node, the node having the 
smallest value. Prior to the search procedure, priorities are assigned to those nodes 
that may be generated during search by using the priority list of the CP/MISF 
method. In this manner, the memory requirements and the average computing time 
required for search can be reduced markedly, since the choice of the proper next 
branching node can be made without computing the heuristic values. The DF/IHS 
method is broken down into two parts, the preprocessing part to assign priorities 
heuristically to the nodes generated during search, and the depth-first search part.
1) Preprocessing (Task Renumbering): All tasks are renumbered in the same 
order as in the priority list of the CP/MISF method. This can be performed by 
the use of an 0(n ) algorithm involving the solution of a longest-path problem 
and sorting.
2) Depth-First Search: In order to describe the features of DF/IHS, use is made 
of Kohler’s general representation of the branch-and-bound method, BB(Bp, 
S, E, F, D, L, U, e, RB). (The dominant relation D and the characteristic 
function F are not used in DF/IHS).
a) Branching Rule Bp:
1) The original problem is decomposed along the time axis (an 
example of decomposition to subproblems).
2) Since optimal solutions may not be obtained by simply assigning 
the ready task(s) to the available processor(s), a fictitious task 
that forces a processor to be idle is introduced.
3) A total of
^branch — (nready + *Mdle) C mav
nodes are generated from each node in the search tree, where 
nready *s the number of ready tasks at time t: nidol is the number 
of idle tasks to be considered at time t (if mav = m then nidIe =
124
mav - 1; if 1 <midle = mav); mav is the number of processors 
available at time t; C means combination.
b) Selection is made in the form ofDF/FIFO.
1) Selection is made in the form of DF/FIFO.
2) A special table R (Ready task table) and pointer SP (next 
branching node selection pointer) for the memory of active nodes 
and selection of a next branching node are used to facilitate node 
selection by the computation of the order O(m). At the same 
time, the storage requirement, which often becomes the 
bottleneck in using the branch-and-bound method, can be 
retained in the order 0(mrn).
3) As the initial search solution, the CP/MISF solution obtained has 
an error bound, relative to the optimal solution, that is 
guaranteed. Thus any intermediate solution which has been 
obtained when the search procedure is terminated is superior to 
the CP/MISF solution in its accuracy.
c) Lower Bound Function L:
1) Femadez’extension of Hu’s lower bound
W^a) = + [q(TCa)]
q(7ta)= max " tk + (1/m) J F(x, t) dt 
o
where the load density function F(Xj, t) is given by
f(Tj, t)<
tcr(^a) tj to




is used in the light of the accuracy and the computing load 
involved. Here 7ta is the partial problem decomposed by 
branching operations; to is the allocation time represented by the 
current branching node; and [x] is the minimum integer greater
125
than x.
2) Since the calculation of the lower-bound function in 1) requires 
the use of a pseudo-polynomial time algorithm, two much 
simpler lower bounds with the complexity of at most O(n) are 
used jointly so that the computing load for the bounding 
operation is minimized. Specifically, the computing load is much 
reduced by applying the bounding operations by the lower bound 
of 1) only to those nodes for which bounding by using the simple 
bounds has failed.
d) Upper Bound Cost U: Since a very accurate intermediate solution, i.e., 
U, can be obtained as the initial solution, the total search time can be 
reduced remarkably.
e) Elimination Rule E: E = U/DB AS (upper bound tested for dominance 
of descendants of branching node and members of Currently active set);
f) Acceptable Approximation Error e;
1) While Kohler could only evaluate the relative error of the 
intermediate solution U from the lower bound as “Bracket,” BR
Ak
((U - L)/USBR). DF/IHS can obtain an approximate solution 
whose relative error from the optimal solution does not exceed e.
(U-topl)/topt<e .
2) Useless search computation can be avoided since the search 
process is terminated as soon as optimality is attained or a 
predetermined approximation error limit is reached by comparing 
the lower bound with the intermediate solution U or the
A A A
approximate intermediate solution Ue, Ue = U/(l + e).
g) Resource Bound RB: In the conventional scheduling algorithms on the 
basis of the branch-and-bound method, the number of active nodes 
increases exponentially with the number of tasks n, and because of the 
limit of memory capacity, only small problems of about 40 tasks could 
be scheduled. By the use of special lists and pointers, however, 
DF/IHS can successfully reduce the number of active nodes for each of 
which it is required to memorize the state. The search process only 
requires the memory area of the order O(mm). Thus the resource 
bound, i.e., storage limit, it rarely violated.
