# Preliminary Report on High-Performance Computational Structures for Robot Control 

Mahibur Rahman<br>Purdue University<br>David G. Meyer<br>Purdue University

Follow this and additional works at: https://docs.lib.purdue.edu/ecetr

[^0]
# Preliminary Report on High-Performance Computational Structures for Robot Control 

Mahibur Rahman
David G. Meyer

TR-EE 87-38
October 1987

School of Electrical Engineering<br>Purdue University<br>West Lafayette, Indiana 47907

## TABLE OF CONTENTS

Page
LIST OF TABLES ..... iv
LIST OF FIGURES ..... vi
ABSTRACT ..... viii
CHAPTER 1 - INTRODUCTION AND BACKGROUND ..... 1
1.1 Introduction ..... 1
1.2 Previous Work ..... 4
1.3 Organization of Preliminary Report ..... 5
CHAPTER 2 - PARALLEL PROCESSING OF ROBOT INVERSE DYNAMICS, FORWARD AND INVERSE KINEMATICS COMPUTATIONS ..... 7
2.1 Introduction ..... 7
2.2 Parallel Processing of the Inverse Dynamics Problem ..... 8
2.2.1 Ideal Lower Bounds on the Number of Processors and Execution Time of the Inverse Dynamics Problem while running on a Multiprocessor System using an Optimal Scheduling Algorithm ..... 8
2.2.2 SIMD Multiprocessor Architectural Model ..... 11
2.2.3 Proposed Instruction Scheduling Algorithm for SIMD Architecture with Crossbar Interprocessor Communication Network ..... 12
2.2.4 Simulation Results and Comparisons to previous work ..... 20
2.3 Parallel Processing of PUMA Forward and Inverse Kinematics ..... 29
2.3.1 Multiprocessor Architectural Model ..... 29
2.3.2 Simulation Results ..... 33
2.4 Conclusions ..... 37
CHAPTER 3-A COST-EFFICIENT BIT-SERIAL ARCHITECTURE FOR ROBOT INVERSE DYNAMICS COMPUTATION ..... 38
3.1 Introduction ..... 38
Page
3.2 Choice of a Bit-serial Architecture ..... 38
3.3 Time Lower Bound To Compute the Inverse Dynamics Problem on a Bit-Serial Architecture ..... 38
3.4 Bit-Serial System Architecture ..... 41
3.5 Bit-Serial Array Processor Organization and Performance ..... 43
3.6 Multi-Functional Bit-Serial Leaf Cell Architecture and Operation ..... 47
3.7 Implementation of Bit-Serial Cell in CMOS Technology ..... 50
3.8 Conclusions ..... 56
CHAPTER 4 - SYSTOLIC ARCHITECTURE FOR ROBOT INVERSE DYNAMICS COMPUTATION ..... 61
4.1 Introduction. ..... 61
4.2 Systolic Array Design Methodology ..... 61
4.3 Time Lower Bound to Compute the Inverse Dynamics Problem on A Systolic Array ..... 67
4.4 Reformulation of Newton Euler Equations of Motion for Direct Mapping onto a Fixed Systolic Architecture ..... 69
4.5 Systematic Design of a Basic Set of Systolic Processors for the Inverse Dynamics Computational Problem ..... 73
4.5.1 Systolic Processor for Unidirectional Tasks ..... 75
4.5.2 Systolic Architecture of a Type 4 Task ..... 75
4.5.3 Systolic Architecture of a Type 5 Task ..... 77
4.5.4 Systolic Architecture for Task Types 6(a)-(c) ..... 83
4.5.5 Systolic Array for a Type 7 Task ..... 85
4.6 Optimal Buffer Assignment for the Formulation of a Balanced Systolic Architecture to Evaluate the Inverse Dynamics Computational Problem. ..... 95
4.7 Conclusions ..... 104
CHAPTER 5 - SUMMARY OF CONCLUSIONS ..... 106
LIST OF REFERENCES ..... 109
APPENDICES
Appendix A: N-E Equations of Motion for a Manipulator with Rotary Joints ..... 117
Appendix B: 3-D Matrix/Vector Arithmetic Operations Used In the N-E Dynamics Computational Problem ..... 119
Appendix C: PUMA Forward Kinematics Solution ..... 121
Appendix D: PUMA Inverse Kinematics Solution ..... 122
Appendix E: DFIHS (Depth-First-Implicit-Heuristic-Search)
Algorithm ..... 124

## LIST OF TABLES

Table Page
1.1 Computational Complexity of Robot Control Algorithms ..... 2
2.1 Initial SIMD Task Assignment for WG of Fig. 2.1 ..... 18
2.2 Final STAT for WG of Fig. 2.1 ..... 18
2.3 DPAT for WG of Fig. 2.1. ..... 18
2.4 STAT for a Single Iteration of the N-E Dynamics Algorithm ..... 19
2.5 DPAT for a Single Iteration of the N-E Algorithm ..... 21
2.6 MC 68020 Operation Execution Times ..... 23
2.7 MC 68881 Floating-Point Operation Execution Times ..... 31
3.1 Bit-Serial Cell Performance Measures ..... 46
3.2 Parallelism in the Execution of Tasks for Three Forward Iterations ..... 48
3.3 Parallelism in the Execution of Tasks for Three Backward Iterations ..... 49
4.1 GVT of a Type 4 Task ..... 78
4.2 UVT of a Type 4 Task ..... 78
4.3 IT of a Type 4 Task ..... 78
4.4 GVT of a Type 5 Task ..... 81
4.5 UVT of a Type 5 Task ..... 81
4.6 IT of a Type 5 Task ..... 81
Table Page
4.7 GVT of a Type 6a Task ..... 86
4.8 UVT of a Type 6a Task ..... 87
4.9 IT of a Type 6a Task ..... 88
4.10 GVT of a Type 7 Task ..... 93
4.11 UVT of a Type 7 Task ..... 93
4.12 IT of a Type 7 Task ..... 93
4.13 Computational Delays for Basic Set of Systolic Processors ..... 103
5.1 Execution Times and Hardware Overhead for Computation of Robot Control Algorithms in a MC 68020-Based Multiprocessor System. ..... 107
5.2 Execution Times and Hardware Overhead for Computation of Inverse Dynamics Problem ..... 107

## LIST OF FIGURES

Figure Page
2.1 SIMD Architectural Model for Robot Inverse Dynamics Computation ..... 13
2.2 A Given WG ..... 17
2.3 Simulation Results for Robot Inverse Dynamics Computation on Proposed Architectural Model ..... 24
2.4 Performance Comparison of Proposed Architecture to that of [19] for Robot Inverse Dynamics Computation ..... 27
2.5 Multiprocessor Architectural Model for PUMA Forward and Inverse Kinematics Computations ..... 30
2.6 Simulation Results for Forward Kinematics ..... 32
2.7 Simulation Results for Inverse Kinematics ..... 35
3.1 Bit-Serial System Architecture ..... 42
3.2 Pipelined Organization of Bit-Serial Cells in Processor 1 ..... 44
3.3 Pipelined Organization of Bit-Serial Cells in Processor 2 ..... 45
3.4 Architecture of A Multi-Functional Bit-Serial Leaf Cell ..... 51
3.5 Interconnections Between Neighbouring Bit-Serial Cells ..... 52
3.6 Flowchart of Asynchronus Communication Protocol ..... 53
3.7 Data Routing Paths for Alternate Vector Arithmetic Operating Modes ..... 54
3.8 Zipper CMOS Driver Circuit ..... 57
Figure Page
3.9 Zipper CMOS Full Adder/Subtractor ..... 58
3.10 Zipper CMOS Bit-Serial Multiplier Module ..... 59
4.1 Systolic Processors for Unidirectional Tasks. ..... 76
4.2 Systolic Architecture for a Type 4 Task ..... 79
4.3 Systolic Architecture for a Type 5 Task ..... 82
4.4 Systolic Architecture for a Type 6a Task ..... 89
4.5 Processing Element for a Type 6b Task ..... 90
4.6 Processing Element for a Type 6c Task ..... 90
4.7 Systolic Architecture for a Type 7 Task ..... 94
4.8 WG for Forward Iterations of N-E Algorithm ..... 96
4.9 BG for Forward Iterations of N-E Algorithm ..... 89
4.10 WG for Backward Iterations of N-E Algorithm ..... 100
4.11 BG for Backward Iterations of N-E Algorithm ..... 102

## ABSTRACT

In this report we present some initial results of our work completed thus far on "Computational Structures for Robot Control". A SIMD architecture with the crossbar interprocessor network which achieves the parallel processing execution time lower bound of $o\left(\left\lceil a_{1} n\right\rceil\right.$, where $a_{1}$ is a constant and $n$ is the number of manipulator joints,for the computation of the inverse dynamics problem, is discussed. A novel SIMD task scheduling algorithm that optimizes the parallel processing performance on the indicated architecture is also delineated. Simulations performed on this architecture show speedup factor of 3.4 over previous related work completed for the evaluation of the specified problem, is achieved. Parallel processing of PUMA forward and inverse kinematics solutions is next investigated using a particular scheduling algorithm. In addition, a custom bit-serial array architecture is designed for the computation of the inverse dynamics problem within the bit-serial execution time lower bound of o( $\left.c_{1} k+c_{2} k n\right)$, where $c_{1}$ and $c_{2}$ are specified constants, $k$ is the word length, and $n$ is the number of manipulator joints. Finally, mapping of the Newton-Euler equations onto a fixed systolic array is investigated. A balanced architecture for the inverse dynamics problem which achieves the systolic execution time lower bound for the specified problem is depicted. Please note again that these results are only preliminary and improvements to our algorithms and architectures are currently still being made.

## CHAPTER 1 <br> INTRODUCTION AND BACKGROUND

### 1.1 Introduction

Inverse Dynamics, Forward and Inverse Kinematics are particular robot control algorithms which need to be computed during regular robot servo control loops. Unfortunately, the computational complexity of these algorithms (see Table 1.1) degrades the servo control loop time of presentday robot control systems. This preliminary report therefore investigates high-performance computational structures to evaluate the specified algorithms in minimal time. The architecture, operation, performance, and derivation of lower bounds for the execution times and number of processors on specific computational structures to evaluate the specified problems is discussed.

The forward and inverse kinematics algorithms deal with the analytical study of the geometry of motion of a robot with respect to a fixed reference coordinate system. The former algorithm determines the position and orientation of the end-effector of the manipulator with respect to a fixed reference coordinate system, given the joint angles and geometric link parameters. Conversely, the latter algorithmic formulation determines the joint angles, given the position and orientation of the manipulator with respect to a fixed coordinate system. The second scheme is thus used primarily to determine whether the manipulator can reach the desired hand position and orientation, and if it can, find how many different manipulator configurations the given manipulator position and orientation may satisfy. Since the independent variables in a robot arm are the joint variables, and a task is usually stated in terms of the reference coordinate system, the inverse kinematics problem is used more frequently.

Table 1.1 Computational Compexity of Robot Control Algorithms

| Operation | Inverse Dynamics <br> (Newton-Euler) | PUMA <br> Forward <br> Kinematics | PUMA <br> Inverse <br> Kinematics |
| :---: | :---: | :---: | :---: |
| Type of <br> Equations | Linear Recurrence | Non-Recursive <br> Non-Linear | Non-Recursive <br> Non-Linear |
| Multiplications | 678 | 100 | 74 |
| Additions | 597 | 77 | 66 |
| Transcendental | - | 137 | 62 |
| Square Root | - | - | 3 |
| Division | - | - | 10 |

The inverse kinematics solution may be obtained by various methods such as inverse transform, screw algebra, dual matrices, dual quaternian, iterative, and geometric approach. The iterative solution often requires a very high number of computations and does not guarantee convergence to the correct solution. The inverse transform method yields a set of explicit, non-iterative joint angle equations. The dataflow structure of this technique will therefore be used in this thesis. The architecture, operation, and performance of a parallel processing architecture to compute the specified kinematics algorithms for the PUMA robot is described.

The inverse dynamics problem is that of efficiently determining the motor torques required to drive a manipulator arm in free motion. These torques must be evaluated repeatedly during servo control loops to avoid undesirable motion deviations of the robot arm from the desired trajectory path. The computation of the torques is, however, a very mathematically intense task which degrades the servo loop time of present-day robot control systems. Many researchers have thus concentrated on simplifying the dynamics equations $[30,35,57]$, or developing new computational architectures $[5,19,26,32,34,35,43,45,58,60]$ to reduce the servo sampling period. In this paper, the latter approach is employed.

There are a number of ways to formulate the robot arm dynamics equations of motion. They include the Lagrange-Euler [4], recursive Langrange-Euler [18], Newton-Euler [36] and the generalized d'Alembert principle functions [30]. Among these methods, the Newton-Euler (N-E) dynamics equations is the most efficient and has been shown to possess the computational time lower bound of $O(n)$ [36], where $n$ is the number of degrees-of-freedom of the manipulator. The N-E algorithm will therefore be used to implement the various computational structures discussed in this thesis. It should, however, be noted that the proposed architectures may be customized to meet the computational model of other formulations of the inverse dynamics problem.

The N-E formulation uses two types of iterative recursions, namely, the forward and the backward recursions, which are applied to the robot links sequentially. The forward recursion propagates kinematics information such as linear velocities, angular accelerations, and linear accelerations at the center of mass of each link - from the inertial coordinate frame to the hand coordinate frame. The backward recursion propagates the forces and moments exerted on each link from the end-effector of the manipulator to the base reference frame. The parallel, pipelined, and recursive nature of the N -

E dynamics algorithm suggests that it is amenable to parallel processing structures using off-the-shelf processors, as well as parallel and pipelined custom VLSI array architectures. This thesis therefore also presents such computational schemes.

### 1.2 Previous Work

Several parallel architectures have been proposed to solve the inverse dynamics problem. Luh and Lin [35] proposed a modification of the traditional "branch and bound" scheduling technique to process the equations in parallel on a set of $n$ CPUs. Kasahara and Narita [19] continued along the same path by proposing to use the DFIHS (Depth-First-Implicit-Heuristic-Search) scheduling algorithm for this purpose. However, in both the above studies, the important problem of interprocessor communication which degrades the performance of their particular multiprocessor structure is ignored.

Orin, Chao and Schrader [45], recognizing the pipelined nature of the N-E formulation, proposed assigning two pipelined CPUs per link to compute the forward and backward recursions in each processor, respectively. Their structure eliminates some of the performance degradation problems associated with interprocessor communications that exist in the computation of the N-E algorithm for parallel processing environments. The task of computing forward or backward iterations of a single stage of the N-E recursive algorithm at a single CPU location does not, however, produce sufficient speedup over a single CPU solution.

Liao and Chern [32] suggest using the CBAP (Cross-Bus Array Processor), which uses a large set of bit-parallel processors arranged in an array format with two sets of busses crossing over the array in two directions. The primary disadvantage of this system is the cost-inefficiency of using a large number of bit-parallel array processors which are not fully utilized. Additional problems include complexity in the operand data alignment process and the directional data-shifting mechanisms used in the control of the array, causing the system to be susceptible to hardware/software faults.

Lathrop [26] proposed two parallel algorithms on the inverse dynamics problem using a group of special purpose processors. One is a linear-parallel algorithm and the other is a logrithmic-parallel algorithm based on the
partial sum technique. The main concern with both approaches is the massive buffering between forward and backward recursions, which deteriorates the performance. The second problem is that they both involve complex interprocessor communication structures which frequently cause data to be fetched and, as a result, data for operand pairs are not properly aligned for parallel computations.

Lee and Chang [29] recently proposed using the recursive doubling algorithm with a modified inverse perfect shuffle interconnection scheme between a set of parallel processors. Their processor interconnection structure improves pipelining and eliminates some of the problems associated with interprocessor communication. They did not, however, incorporate any operand access mechanisms into their structure to allow proper synchronization of the parallel processors. Their implementation requires a total of 530 modular processors, where the processor complexity required to perform 3-D parallel vector dot product and/or vector addition along with the expensive interconnection structure among processors indicate this approach may not be a cost-efficient and fault-tolerant practical approach.

For the computation of the inverse kinematics algorithm, which is an equally important problem, Lee and Chang have proposed a pipelined VLSI architecture using CORDIC processors as building blocks of the structure. Delay buffers were inserted along data paths of the system to balance operand arrival time of the CORDIC computational nodes. The author has no valid criticism of this custom VLSI architecture.

### 1.3 Organization of Preliminary Report

Parallel processing using off-the-shelf processing elements to evaluate the inverse dynamics, PUMA forward and inverse kinematics algorithms is discussed in chapter 2. Ideal lower bounds on the number of parallel processors and execution time for the computation of the inverse dynamics problem in the specified computational model is derived here. A novel scheduling algorithm for the parallel processing of N-E equations on an SIMD machine with a particular interprocessor communication network is presented next. Simulation results performed on this architecture are then delineated and compared to the performance of previous related work completed on this problem. The second part of this chapter describes the performance of both types of kinematics algorithms in a multiprocessor environment with a
shared memory interprocessor communication mechanism while using a particular instruction scheduling algorithm.

In chapter 3, a custom bit-serial array architecture for the computation of the inverse dynamics problem is presented. The organization, operation and performance of the proposed system is described. In addition, the architecture and intercell communication protocol of an individual bit-serial cell (which is used as the building block of the overall array structure) is delineated.

Mapping of the Newton-Euler equations of motion onto a fixed systolic architecture is depicted in Chapter 4. The systolic design methodology for this mapping process is discussed here. This design procedure is used to find a basic set of systolic processor architectures which are used to build the complete systolic system. Integer linear programming is applied for the optimal buffer assignment problem to obtain a "balanced" systolic array for the computation of the specified problem. The performance, operation, design and lower bound task latency costs of such a systolic system are also delineated here. Finally, Chapter 5 provides a summary of conclusions drawn from this report.

## CHAPTER 2 <br> PARALLEL PROCESSING OF ROBOT INVERSE DYNAMICS, FORWARD AND INVERSE KINEMATICS COMPUTATIONS

### 2.1 Introduction

In the first part of this chapter, we discuss parallel processing of the inverse dynamics computational problem. First, the ideal lower bounds for the number of parallel processors and execution time to evaluate the specified problem in a parallel processing environment using an optimal scheduling algorithm is derived. The proposed multiprocessor architectural model is then presented. Next, a scheduling algorithm customized for optimal parallel processing on the proposed system model is described. Further, simulation results for computing the desired algorithm using the proposed architecture and scheduling methodology is delineated and compared to previous work completed on this topic.

In the second part of the chapter, the parallel processing system model for the computation of PUMA forward and inverse kinematics algorithms is presented. This architecural model for the evaluation of the specified problems is simulated by using the DFIHS (Depth-First-Implicit-Heuristic-Search) algorithm [19] as the primary task scheduling tool.

### 2.2 Parallel Processing of the Inverse Dynamics Problem

### 2.2.1 Ideal Lower Bounds on the Number of Processors and Execution Time of the Inverse Dynamics Computational Problem while running on a Multiprocessor System using an Optimal Scheduling Algorithm.

The ideal minimum number of processors required to compute the inverse dynamics problem in minimal time while running in a parallel processing environment using an optimal scheduling algorithm is discussed here. In addition, the limitation on speeding up the specified problem in the indicated computational architecture is also investigated. Before deriving these ideal lower bounds, the following notation, definitions, and lemmas must be established.

## Notation:

$m_{p}=$ Minimum number of parallel processors needed to compute the inverse dynamics problem in minimal time using an optimal scheduling algorithm.
$\mathrm{T}_{\mathrm{p}}=$ Lower bound on the execution time of the inverse dynamics problem while running in a parallel processing system with $\mathrm{m}_{\mathrm{p}}$ processors, using an optimal scheduling algorithm.
$\mathrm{E}<l>$ denotes a linear arithmetic expression of $l$ distinct atoms, where an atom is a constant or variable, e.g. $\mathrm{E}<4>=\mathrm{a}+\mathrm{b}-\mathrm{c} / \mathrm{d}$.
$\mathrm{T}_{\mathrm{cp}}=$ Minimum time to perform the set of computations in the critical path of a task graph $G$.

Definition 2.1: The activity of vertex $j$ [10] in a task graph is defined as,

$$
f\left(\tau_{\mathrm{j}}, \mathrm{t}\right)= \begin{cases}1, & \text { for } t \in\left[\tau_{\mathrm{j}}-\mathrm{t}_{\mathrm{j}}, \tau_{\mathrm{j}}\right] \\ 0, & \text { otherwise }\end{cases}
$$

where $f\left(\tau_{j}, t\right)$ indicates the activity (or computational delay) of vertex j (or task $\mathrm{T}_{\mathrm{j}}$ ) along time, according to the restrictions imposed by the task graph, and $\tau_{\mathrm{j}}$ is the earliest completion time of task $\mathrm{T}_{\mathrm{j}}$.

Definition 2.2: The load density function [10] is defined by

$$
\begin{equation*}
F(\tau, t)=\sum_{j=1}^{l} f\left(\tau_{j}, t\right) \tag{2.1}
\end{equation*}
$$

where $F(\tau, t)$ depicts the total activity of the task graph as a function of time, and $l$ is the total number of tasks in task graph G.

Definition 2.3: The load density function within the interval $\left[\theta_{1}, \theta_{2}\right] \subset\left[0, \mathrm{t}_{\mathrm{CP}}\right]$ after all tasks have been shifted to yield minimum overlap with this interval [10] is defined as,

$$
\begin{equation*}
\mathrm{R}\left(\theta_{1}, \theta_{2}, \mathrm{t}\right)=\int_{\theta_{1}}^{\theta_{2}} \mathrm{~F}(\tau, \mathrm{t}) \mathrm{dt} \tag{2.2}
\end{equation*}
$$

Lemma 2.1: A lower bound on the minimum number of processors [10] to execute a task graph $G$ within time $t_{C P}$ is given by,

$$
m_{p}=\left[\max _{\left[\theta_{1}, \theta_{2}\right]}\left[\frac{1}{\theta_{2}-\theta_{1}} \int_{\theta_{1}}^{\theta_{2}} R\left(\theta_{1}, \theta_{2}, t\right) d t\right]\right]
$$

where the maximum is taken over all integer time segments $\left[\theta_{1}, \theta_{2}\right]$.

Theorem 2.1: The minimum number of parallel processors needed to compute the inverse dynamics problem within time $t_{C P}$ of the NewtonEuler task graph using an optimal scheduling algorithm is $n$, where $n$ is the number of manipulator joints.

Proof: Using Eq. 2.1, the load density function for the inverse dynamics problem is,

$$
\begin{equation*}
F(\tau, t)=\sum_{j=1}^{b_{1} n} f\left(\tau_{j}, t\right)=b_{1} b_{2} n \tag{2.3}
\end{equation*}
$$

where $b_{1}$ is the number of tasks in a single iteration of the N-E algorithm, and $b_{2}$ is the number of tasks in the complete N-E task graph which have an unit activity at a set vertices $\mathrm{T}_{\mathrm{i}}$. The second load density function within
time interval $\left[\theta_{1}, \theta_{2}\right] \subset\left[0, \mathrm{t}_{\mathrm{CP}}\right]$ to yield minimum overlap for the inverse dynamics problem is found by Eq. 2.2 to be,

$$
\mathrm{R}\left(\theta_{1}, \theta_{2}, \mathrm{t}\right)=\mathrm{b}_{1} \mathrm{~b}_{2} \mathrm{n}\left(\theta_{2}-\theta_{1}\right)
$$

Thus, by lemma 2.1, $\mathrm{m}_{\mathrm{p}}$ must be of the form,

$$
\begin{align*}
m_{p} & =\left\lceil\left.\max \left[\theta_{1}, \theta_{2}\right]\left[\frac{1}{\theta_{2}-\theta_{1}} \int_{\theta_{1}}^{\theta_{2}} b_{1} b_{2} n\left(\theta_{2}-\theta_{1}\right) d t\right] \right\rvert\,\right. \\
& =\left\lceil\max _{\left[\theta_{1}, \theta_{2}\right]}\left[b_{1} b_{2}\left(\theta_{2}-\theta_{1}\right) n\right]\right\rceil \tag{2.4}
\end{align*}
$$

It is evident from Eq. 2.4 that the minimum number of parallel processors to compute the inverse dynamics algorithm using an optimal scheduling algorithm is $n$. Note that this lower bound occurs when $\max _{\left[\theta_{1}, \theta_{2}\right]}\left[b_{1} b_{2}\left(\theta_{2}-\theta_{1}\right)\right]$ is unity.

Lemma 2.2: The set of computations of a graph $G$ cannot be completed with $m$ processors in a time less than $\mathrm{t}_{\mathrm{L}}$ [10],

$$
t_{L}=t_{C P}+\max _{\mathrm{t}_{5} \leq_{\mathrm{K}} \leq_{\mathrm{t}_{\mathrm{CP}}}}\left[-\mathrm{t}_{\mathrm{K}}+\frac{1}{\mathrm{~m}} \int_{0}^{t_{\mathrm{K}}} \mathrm{~F}(\bar{\tau}, \mathrm{t}) \mathrm{dt}\right]
$$

where $\mathrm{t}_{\mathrm{K}}$ is a discrete point in time and $\bar{\tau}$ is the latest completion time of a specific task.

Lemma 2.3: The critical path time, $\mathrm{t}_{\mathrm{CP}}$, of the inverse dynamics task graph S is of the form $a_{1} n$, where $a_{1}$ is a specified constant and $n$ is the number of manipulator joints.

Proof: Let $\mathrm{x}_{1}$ be the computational delay in the critical path of a single iteration of the N-E dynamics algorithm. Thus, the critical path time for $n$ iterations is $\mathrm{t}_{\mathrm{CP}}<\mathrm{n}>=\mathrm{x}_{1} \mathrm{n}$. The inverse dynamics problem may be considered as computing a set of joint torques which result in obtaining the joint torques. Each joint torque $\tau_{\mathrm{i}}$ of joint i can be expressed as an arithmetic expression containing at last 3 n atoms: n joint positions $\left\{\mathrm{q}_{\mathrm{i}}\right\}_{\mathrm{i}=1}^{\mathrm{n}}, \mathrm{n}$ joint velocities $\left\{\dot{\mathrm{q}}_{\mathrm{i}}\right\}$,
and $n$ joint accelerations $\{\ddot{q}\}_{i=1}^{n}$, implying, $t_{C P}<3 n>=3 x_{1} n$. The previous expression may be rewritten without changing the order of the lower bound as $t_{C P}\left[\tau_{1}, \ldots, \tau_{n}\right]=a_{1} n$, where $a_{1}$ is specified constant.

Theorem 2.2: $\quad$ The minimum time to compute the inverse dynamics problem in a parallel processing system with $\mathrm{m}_{\mathrm{p}}$ processors using an optimal scheduling algorithm is bounded below by o $\left\lceil a_{1} n\right\rceil$, where $n$ is the number of manipulator links, and $a_{1}$ is a specified constant.

Proof: By substituting $F(\tau, t)$ from Eq. 2.3, $t_{C P}$ from Lemma 2.3, and $m$ from Theorem 2.1 into the expression for $t_{L}$ in Lemma 2.2, we obtain a formulation for $T_{P}$,

$$
\begin{align*}
T_{P} & =a_{1} n+\max _{t_{1} \leq t_{K} \leq t_{C P}}\left[-t_{K}+\frac{1}{n} \int_{0}^{t_{K}} b_{1} b_{2} n d t\right] \\
& =a_{1} n+t_{K}\left(b_{1} b_{2}-1\right) \tag{2.5}
\end{align*}
$$

Since $a_{1}, b_{1}, b_{2}$ and $t_{K}$ are specified constants, the lower bound for $T_{P}$, without changing the order of the lower bound of Eq. 2.5 , may be expressed as,

$$
\mathrm{T}_{\mathrm{P}}[\mathrm{E}<3 \mathrm{n}>] \geq \mathrm{o}\left(\mathrm{a}_{1} \mathrm{n}\right)
$$

where $\mathrm{E}<3 \mathrm{n}>$ represents the computational complexity of evaluating $n$ joint positions $\left\{q_{i}\right\}_{i=1}^{n} n$ joint velocities $\left\{\dot{q}_{i}\right\}_{i=1}^{n}$, and $n$ joint accelerations $\left\{\ddot{q}_{i}\right\}_{i=1}^{n}$.

### 2.2.2 SIMD Multiprocessor Architectural Model

The N-E Equations of motion may be decomposed into a set of homogeneous linear unidirectional and recurrence tasks (see section 4.4). This suggests that the inverse dynamics problem is amenable to a SIMD multiprocessor-based system. The basic architectural model of the proposed architecture to compute the desired problem is presented in this section.

It has been stated that "the most critical system control mechanism in a multiprocessor-based computer structure are clearly those involved with interprocess
and interprocessor communication"[17]. The interprocessor communication network for the SIMD model is therefore a crucial choice. In Theorem 2.1, we proved the ideal lower bound on the number of parallel processors, $m$, to compute the inverse dynamics problem to be $n$, where $n$ is the number of manipulator joints. For $n>6$, industrial robots have undesired redundant degrees of freedom. It is therefore safe to assume $n \leq 8$ for worst-case applications. Thus, for $n=m \leq 8$, a "complete" interprocessor communication strategy such as the crossbar network is desired for this purpose due to the commercial availability of $8 \times 8$ crossbar switches. In most image processing applications, crossbar interprocessor communication networks are generally not used for SIMD processing due to excessive network overhead costs since $m \gg 8$. This is however not the case for our application, as discussed above.

Fig. 2.1 illustrates the proposed SIMD system model. The configuration is structured with m synchronized processing elements with a crossbar interprocessor communication network. The microcode ROM (Read-Only-Memory) contains a static instruction schedule for optimal SIMD processing of N-E Equations of motion on the proposed system model. Each processing element $\mathrm{PE}_{\mathrm{i}}$ has its own local memory $\mathrm{PEM}_{\mathrm{i}}$. Interprocessor communication data is thus directed to single or multiple destination $\mathrm{PEM}_{\mathrm{i}}$ from a single source $\mathrm{PE}_{\mathrm{i}}$ at the end of every SIMD instruction cycle. Memory conflicts, however, occur when multiple source $\mathrm{PE}_{\mathrm{i}}$ attempt to send data to the same destination PEM $_{\mathrm{i}}$ through the crossbar network. In addition, an unsystematic SIMD instruction schedule in the microcode ROM may cause $\mathrm{PE}_{\mathrm{i}} \mathrm{s}$ to be idle during SIMD processing cycles, thus causing a longer overall parallel execution time. It is therefore clear that an algorithm which optimally schedules SIMD tasks to the PEs, in addition to scheduling intermediate results through the crossbar network without resource conflicts, is desired. The next section presents such an algorithm.

### 2.2.3 Proposed Instruction Scheduling Algorithm for SIMD Architecture with Crossbar Interprocessor Communication Network

In this section, a novel algorithm for the scheduling of SIMD-mode processes and destination processor data through the crossbar interprocessor network is presented. It should be stressed here that the technique for scheduling of destination data embedded in the algorithm is valid only for the case of a crossbar interprocessor network. Before a formal presentation of this algorithm, the following notation must be established.


Fig. 2.1: SIMD Architectural Model for Robot Inverse Dynamics Computation

## Notation:

(1) $\mathrm{T}_{\mathrm{R}}(\mathrm{t})=\left\{\mathrm{T}_{1}, \ldots, \mathrm{~T}_{l}\right\}, l \in \mathrm{z}^{\mathrm{n}}$, is a set of ready tasks (tasks whose predecessors have been executed) at time $t$ which perform the same type of operation.
(2) $T_{R F}(t)=\left\{T_{R 1}, \ldots, T_{R d}\right\}, k \in z^{n}$, is a set of valid $T_{R}$ 's at time $t$.
(3) $\quad N_{R i}$ specifies the number of ready tasks in the task set $T_{R i} \in T_{R S}(t)$.
(4) $\quad \mathrm{N}_{\mathrm{I}}(\mathrm{t}+\Delta)$ denotes the number of interprocessor data transfers required to be performed by the crossbar network at time, $t+\Delta$, where $\Delta$ is the computational delay of the operation initiated at time $t$.
(5) $\quad N_{C_{i}}(t+\Delta)$ is the number of unique destination processors to which the result of task $T_{i}$ executed in processor $P_{j}$ is to be sent at time $t+\Delta$.
(6) $\quad P_{D i}(t+\Delta)=\left\{P_{1}, \ldots, P_{r}\right\}, r \in z^{n}$, specifies the set destination processors to which the result of task $T_{i}$ executed in processor $P_{j}$ is to be sent at time $t+\Delta$.
(7) WG is a weighted precedence graph for a computational problem $C$, where specifications of node task number, node computational delay and node task type must be included in the task graph.

Procedure ISSMACIN (Instruction Scheduling for a SIMD Multiprocessor Architecture with Crossbar Interconnection Network).

Input: Weighted Graph WG and number of available parallel processors m.
Output: Set of tables $T=\{S T A T, ~ D P A T\}$ where,
STAT is a SIMD Task Allocation Table which specifies the assignment of SIMD processes to parallel processors at time $t$.
DPAT is a Destination Processor Assignment Table which specifies the destination processor tags for the proper routing of data through the crossbar network at time $t+\Delta$.

STEP 1: By the CP/MISF method [19], determine the "level"' of each task in WG, where the level $l_{\mathrm{i}}$ of task $\mathrm{T}_{\mathrm{i}}$ is defined as the longest path from the exist node to node $N_{i}$ corresponding to $T_{i}$. Mathematically,

$$
l_{\mathrm{i}}=\max _{\mathrm{k}} \sum_{\mathrm{j} \in \pi_{\mathrm{k}}} \mathrm{t}_{\mathrm{j}}
$$

where $\pi_{k}$ is the kth path from the exit node to node $N_{i}$, and $t_{j}$ is the
computational delay of node j .
STEP 2: Construct the priority list L in the descending order of $l_{\mathrm{i}}$ and number of immediately successive tasks $n_{i}$. Order tasks of identical $l_{\mathrm{i}}$ and $\mathrm{n}_{\mathrm{i}}$ in lexicographic fashion.

STEP 3: Initial assignment process of tasks in WG to parallel processors for SIMD program execution:
REPEAT
(a) Determine $\mathrm{T}_{\mathrm{RS}}(\mathrm{t})=\left\{\mathrm{T}_{\mathrm{R} 1}, \ldots, \mathrm{~T}_{\mathrm{Rk}}\right\}$
(b) Find $\mathrm{T}_{\mathrm{Ri}} \in \mathrm{T}_{\mathrm{RS}}(\mathrm{t})$ with $\max \left[\mathrm{N}_{\mathrm{Ri}}\right]$.
(c) Select $\max [\mathrm{x}]$ tasks from $\mathrm{T}_{\mathrm{Ri}}$ where $\mathrm{x} \in \mathrm{z}^{\mathrm{n}}$ and $1 \leq \mathrm{x} \leq \mathrm{m}$. Let $\mathrm{T}_{\mathrm{x}}$ be the chosen.
(d) Assign the tasks $\mathrm{T}_{\mathrm{x}}$ to x parallel processors in lexicographic order.
(e) Direct the corresponding SIMD command to these $x$ processors and mask the other $\mathrm{m}-\mathrm{x}$ processing elements.
(f) Perform $T_{R i}=T_{R i} \backslash T_{x}$.

UNTIL $\mathrm{N}_{\mathrm{Ri}}=0$ for all possible $\mathrm{T}_{\mathrm{Ri}} \in \mathrm{T}_{\mathrm{RS}}$ and t .
STEP 4: Rearrange the task assignment of STEP 3 at each SIMD processing state $t$ so that $\mathrm{N}_{\mathrm{I}}(\mathrm{t}+\Delta)$ is maximized. This may be achieved by using the generalized backtracking algorithm [15]. The explanation of this algorithm is beyond the scope of this thesis. We may however note that the time complexity of the specified algorithm is $\mathrm{O}(\mathrm{m} \cdot \mathrm{s})$, where m is the number of PEs, and $s$ is the number of SIMD processing states $t$.

STEP 5: The scheduling of destination processor data through the crossbar interconnection network at time $t+\Delta$ is as follows:

## REPEAT

(a) Find task $T_{i} \in T_{x}(t+\Delta)$ with $\max \left[N_{C i}\right]$ where $T_{x}$ is the set of $x$ tasks computed at time $t$.
(b) Broadcast result of task $\mathrm{T}_{\mathrm{i}}$ executed in processor $\mathrm{P}_{l}, 1 \leq l \leq \mathrm{m}$, to the set of destination processors $\mathrm{P}_{\mathrm{Di}}(\mathrm{t}+\Delta)$.
(c) At the same time, send/broadcast results of set of tasks $T_{j}$, where $T_{j} \subset T_{x}$ and $T_{i} \notin T_{j}$, to set of destination processors $P_{D j}$, where $P_{D j} \subset P_{D i}(t+\Delta)$.
(d) $\quad T_{x}(t+\Delta)=T_{x}(t+\Delta)\left(T_{i} \cup T_{j}\right)$.

UNTIL $\mathrm{N}_{\mathrm{Ci}}=0$ for all possible $\mathrm{T}_{\mathrm{i}} \in \mathrm{T}_{\mathrm{x}}, \mathrm{t}$ and $\Delta$.

Let us now illustrate procedure ISSMACIN by a simple example. Fig. 2.2 shows a given WG. The task numbers $\mathrm{T}_{\mathrm{i}}$ are specified within the nodes $\mathrm{N}_{\mathrm{i}}$. The nomenclature $\mathrm{t}_{\mathrm{y}} j$ next to each node specifies the type of task to be computed in the vertex, where $j$ denotes the type of task. For purpose of simplicity, we shall assume the operational delay of task type $j$ is $j$ time units. Applying ISSMACIN to Fig. 2.2 when two parallel processors with a crossbar interprocessor communication network is presumed to be available:

STEP 1: Applying CP/MISF method to Fig. 2.2:
$l_{1}=5, l_{2}=6, l_{3}=5, l_{4}=5, l_{5}=4$,
$l_{6}=5, l_{7}=4, l_{8}=4, l_{9}=4, l_{10}=3$,
$l_{11}=2, l_{12}=3, l_{13}=3, l_{14}=2, l_{15}=2$.
STEP 2: The priority list $L$ of the task set is:
$\mathrm{L}=\{\mathrm{T} 2, \mathrm{~T} 1, \mathrm{~T} 3, \mathrm{~T} 4, \mathrm{~T}, \mathrm{~T} 6, \mathrm{~T}, \mathrm{~T} 8, \mathrm{~T} 9, \mathrm{~T} 10, \mathrm{~T} 12, \mathrm{~T} 13, \mathrm{~T} 11, \mathrm{~T} 14, \mathrm{~T} 15\}$
STEP 3: Results for the first-run through the loop which specifies the process for assigning tasks to processors, we get:
(a) $\mathrm{T}_{\mathrm{RS}}\left(\mathrm{t}_{1}\right)=\left\{\mathrm{T}_{\mathrm{R} 1}, \mathrm{~T}_{\mathrm{R} 2}\right\}$ where, $T_{R 1}=\{T 1, T 3, T 4, T 5\}$ and $\mathrm{T}_{\mathrm{R} 2}=\{\mathrm{T} 2\}$.
(b) $\mathrm{N}_{\mathrm{R} 1}=4$, and $\mathrm{N}_{\mathrm{R} 2}=1$. Thus, $\mathrm{T}_{\mathrm{R} 1}$ is selected for SIMD processing.
(c) $\mathrm{T}_{\mathrm{x}}=\{\mathrm{T} 1, \mathrm{~T} 3, \mathrm{~T} 4\}$.
(d) Assign T1 to PE1, T3 to PE2, and T4 to PE3.
(e) Direct type 1 SIMD command to PE1-3.
(f) $\quad T_{R 1}=T_{R 1} \backslash T_{x}=T 5$.

By repeating the loop, we may arrive at the initial task assignment table of Table 2.1 for SIMD processing states t1-t5. Note that the last column of the table specifies the type of SIMD task performed at state $t i$.

STEP 4: By apply the generalized backtracking algorithm to Table 2.1 to minimize the number of interprocessor data transfers at time $t i+\Delta$, we may arrive at the more efficient task assignment (STAT) of Table 2.2.


Fig. 2.2: A given WG

Table 2.1: Initial SIMD Task Assignment for WG of Fig. 2.1.

| Time | PE 1 | PE 2 | PE 3 | Task Type |
| :---: | :---: | :---: | :---: | :---: |
| t 1 | T 1 | T 3 | T 4 | ty 1 |
| t 2 | T 2 | T 6 | T 9 | ty 2 |
| t 3 | T 5 | T 7 | T 8 | $\mathrm{ty1}$ |
| t 4 | T 10 | T 12 | T 13 | $\mathrm{ty1}$ |
| t 5 | T 11 | T 14 | T 15 | ty 2 |

Table 2.2: Final STAT for WG of Fig. 2.1.

| Time | PE 1 | PE 2 | PE 3 | Task Type |
| :---: | :---: | :---: | :---: | :---: |
| t 1 | T 3 | T 1 | T 4 | ty1 |
| t 2 | T 9 | T 6 | T 2 | ty2 |
| t 3 | T 5 | T 7 | T 8 | ty1 |
| t 4 | T 12 | T 13 | T 10 | ty 1 |
| 5 | T 11 | T 14 | T 15 | TY 2 |

Table 2.3: DPAT for WG of Fig. 2.1

| Time | PE 1 | PE 2 | PE 3 |
| :---: | :---: | :---: | :---: |
| $t 1+\Delta 1$ | d 3 | d 2 | d 1 |
| $\mathrm{t} 2+\Delta 1$ |  | d 1 | $\mathrm{~d} 2, \mathrm{~d} 3$ |
| $\mathrm{t} 2+\Delta 2$ | d 2 |  |  |
| $\mathrm{t} 3+\Delta 1$ | d 3 | d 1 | d 2 |
| $\mathrm{t} 4+\Delta 1$ | d 3 | d 1 | d 2 |

Table 2.4: STAT for a Single Iteration of the N-E Dynamics Algorithm

| Time | PE 1 | PE 2 | PE 3 | PE 4 | PE 5 | PE 6 | PE 7 | PE 8 | Task Type |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| t1 | 1 | 2 | 4 | 5 | 6 | 7 | 9 | 10 | M |
| t 2 | 3 | 8 | 20 | 23 | 25 | 27 | 28 | I | A |
| t 3 | 21 | 22 | 24 | 26 | 29 | 30 | 37 | 38 | M |
| t 4 | 39 | 40 | 41 | 42 | 11 | 12 | 13 | 14 | M |
| t 5 | 34 | 35 | 36 | 46 | 47 | 48 | 31 | I | A |
| t 6 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | M |
| t 7 | 57 | 58 | 59 | 60 | 15 | 16 | 17 | 18 | M |
| t 8 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | A |
| t 9 | 19 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | M |
| t 10 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | M |
| t 11 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | M |
| t 12 | 105 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | M |
| t 13 | 32 | 33 | 121 | 122 | 123 | 124 | 125 | 126 | A |
| t 14 | 127 | 128 | 129 | 43 | 44 | 45 | 79 | 80 | A |
| t 15 | 81 | 118 | 119 | 120 | 139 | 140 | 141 | 142 | A |
| t 16 | 109 | 110 | 111 | 112 | 113 | 114 | 68 | 69 | M |
| t 17 | 106 | 107 | 108 | 143 | 144 | 151 | 152 | 153 | A |
| t 18 | 130 | 131 | 132 | 133 | 134 | 135 | 115 | 116 | M |
| t 19 | 136 | 137 | 138 | 145 | 146 | 147 | 157 | 158 | A |
| t 20 | 148 | 149 | 150 | 159 | 178 | 179 | 180 | I | A |
| t 21 | 160 | 161 | 162 | 172 | 173 | 174 | 175 | 176 | M |
| t 22 | 154 | 155 | 156 | 184 | 185 | 186 | I | I | A |
| t 23 | 177 | 163 | 164 | 187 | 188 | 189 | 190 | 191 | M |
| t 24 | 206 | 196 | 197 | 198 | 199 | 200 | 201 | 193 | A |
| t 25 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | M |
| t 26 | 192 | 165 | 166 | 213 | 167 | 168 | 169 | 170 | M |
| t 27 | 217 | 218 | 219 | 214 | 215 | 216 | 182 | 183 | A |
| t 28 | 220 | 221 | 222 | 165 | 166 | I | I | I | A |
| t 29 | 223 | 224 | 225 | 202 | 203 | 204 | I | I | A |
| t 30 | 226 | 227 | 228 | I | I | I | I | I | A |
| t 31 | 229 | I | I | I | I | I | I | I | A |
| t 32 | 230 | I | I | I | I | I | I | I | A |
|  |  |  |  |  |  |  |  |  |  |

STEP 5: Applying the loop at $\mathrm{t} 2+\Delta 1$ :
(a) $\mathrm{N}_{\mathrm{C} 2}=2, \mathrm{~N}_{\mathrm{C} 6}=1, \mathrm{~N}_{\mathrm{C} 9}=1$. Thus, task T 2 is thus selected for primary data broadcasting.
(b) Results of T2 is therefore sent to processing elements 2 and 3.
(c) Since $\mathrm{P}_{\mathrm{D} 6} \subset \mathrm{P}_{\mathrm{D} 2}$ and $\mathrm{P}_{\mathrm{D} 9} \subset \mathrm{P}_{\mathrm{D} 2}$, product of T 6 is thus also selected for sending its results to PE 1 at this time.
(d) $\quad \mathrm{T}_{\mathrm{x}}(\mathrm{t} 2+\Delta 1)=\mathrm{T}_{\mathrm{x}}(\mathrm{t} 2+\Delta 1) \backslash(\mathrm{T} 2 \cup \mathrm{~T} 6)=\mathrm{T} 9$

At $\mathrm{t} 2+\Delta 2$, result of T 9 is naturally routed to PE 2 . By repeating the specified loop, we may arrive at the DPAT in Table 2.3. The nomenclature $\mathrm{d} i$ in the table columns specify the destination processor memories $\mathrm{PEM}_{\mathrm{i}}$ to which respective source processor sends its result at time $t x+\Delta$.

When procedure ISSMACIN is applied to parallel process a single iteration of the N-E algorithm, the resulting STAT for this application is illustrated in Table 2.4. In this Table, the initials M and A in the "task type" column specifies a multiplication and addition SIMD operation respectively. Notice the high density of SIMD task processing when using the proposed scheduling algorithm to evaluate the 230 tasks in a single iteration of the $\mathrm{N}-\mathrm{E}$ algorithm. The idle processors prevalent from t 28 through t32 will not occur when the number of iterations is more than one (which is the case for any $n$-link manipulator), since tasks of successive iterations are computed in SIMD processing fashion at the indicated idle processors at specified time intervals. Accordingly, the DPAT generated by procedure ISSMACIN for a single iteration of the $\mathrm{N}-\mathrm{E}$ algorithm is shown in Table 2.5. This table verifies that an average of 3 interprocessor data transfers through the crossbar network are required at the end of every SIMD instruction processing cycle. From these two tables, it may be deduced that the worst-case execution time to compute the inverse dynamics problem of an n link manipulator on the proposed architecture using procedure ISSMACIN as primary design tool for SIMD processing is $\left(7 \mathrm{t}_{\mathrm{a}}+15 \mathrm{t}_{\mathrm{m}}+32 \mathrm{t}_{\mathrm{f}}+65 \mathrm{t}_{\mathrm{s}}\right) \mathrm{n}$, where, $\mathrm{t}_{\mathrm{a}}$ is the time to perform an add operation, $\mathrm{t}_{\mathrm{m}}$ is the time to compute a multiplication operation, $\mathrm{t}_{\mathrm{f}}$ is the time to fetch operands from local $\mathrm{PEM}_{\mathrm{i}}$, and $\mathrm{t}_{\mathrm{s}}$ is the time to send the result to the destination $\mathrm{PEM}_{\mathrm{i}}(\mathrm{s})$ through the crossbar network.

### 2.2.4 Simulation Results and Comparison to Previous work

In this section, we present simulation results for computing the inverse dynamics problem using the proposed SIMD architectural model and procedure ISSMACIN.

Table 2.5: DPAT for a Single Iteration of the N-E Algorithm

| Time | PE1 | PE2 | PE3 | PE4 | PE5 | PE6 | PE7 | PE8 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $t 1+\Delta 1$ | d3 |  | d4 |  | d5 |  | d8 | d7 |
| $t 1+\Delta 2$ |  | d3 |  | d4 |  | d5 |  |  |
| $t 2+\Delta 1$ | d1,d2,d3 | d4,d5 |  |  |  | d7 | d8 |  |
| t2+ ${ }^{2} 2$ |  |  | d1 | d2 | d3 |  |  |  |
| t3+ ${ }^{\text {d }}$ | d1 | d2 | d3 |  |  |  | d4 |  |
| t3+ 42 |  |  |  | d1 | d2 | d3 |  | d4 |
| $14+\Delta 1$ | d5 |  | d8 |  | d7 |  | d4 | d1 |
| t4 $+\Delta 2$ |  | d5 |  | d8 |  | d7 |  |  |
| t5 $+\Delta 1$ |  | d6 | d1-5, d7, d8 |  |  |  |  |  |
| t5- 52 | d2-5,8 | d1,d7 |  |  |  |  |  |  |
| t5+ 43 |  | d3,d4 |  | d1 | d2 |  |  |  |
| t5 $+\Delta 4$ |  |  |  |  |  | d3 | d4 |  |
| $t 8+\Delta 1$ | d4 |  | d5 |  | d8 |  | d7 |  |
| $t 8+\Delta 2$ |  | d4 |  | d5 |  | d8 |  | d7 |
| $t 7+\Delta 1$ | d8 |  | d7 |  | d1 | d5 | d2 |  |
| $17+\Delta 2$ |  | d8 |  | d7 |  |  |  | d2 |
| t8+ 41 | d1,d4-8 | d2 | d3 |  |  |  | $\therefore$ |  |
| 18+ 22 |  | d1,d4,d5,d7,d8 |  |  |  |  |  | d3 |
| $18+33$ |  |  | d4-8 | d3 | d8 | d1,d2 |  |  |
| t8 $+\Delta 4$ |  |  |  | d4 | d5 |  | d1,d2 | d6 |
| $19+\Delta 1$ | d6 | d2 |  | d3 |  | d4 |  | d1 |
| t9+ +2 |  |  | d2 |  | d3 |  | d4 |  |
| t10+ 51 | d3 | d4 |  | d5 | d5 | d 8 |  | d7 |
| t10+ 10 |  |  | d4 |  |  |  | d6 |  |
| $111+\Delta 1$ | d7 | d8 |  | dl |  | d2 |  | d3 |
| 111- -2 |  |  | d8 |  | d1 |  | d2 |  |
| t12+ $\triangle 1$ | d3 |  | d4 | d1 | d8 |  | d2 |  |
| $112+\Delta 2$ |  | d3 |  |  |  | d8 |  | d1 |
| $113+\Delta 1$ | d5 | d6 |  |  | d 7 | d8 | d4 | d3 |
| t13+ 42 |  |  | d5 | d6 |  |  |  |  |
| t14- -1 | d8 | d4 | d5 |  | d6 | d7 | d1 | d2 |
| +14-32 |  |  |  | d5 |  |  |  |  |

Table 2.5: DPAT for a Single Iteration of the N-E Algorithm (continued)

| Time | PE1 | PE2 | PE3 | PE4 | PE5 | PE8 | PE7 | PE8 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $t 15+\Delta 1$ | d3 | d6 | d6 | d8 |  |  |  |  |
| t15+ 22 |  |  |  |  | ds | d7 | d8 |  |
| $615+53$ |  |  |  |  |  |  |  | d7 |
| $t 16+\Delta 1$ | d1 |  |  | d2 |  |  | d3 |  |
| $t 16+52$ |  | d1 |  |  | d2 |  |  | d3 |
| t18+13 |  |  | d1 |  |  | d2 |  |  |
| $t 17+\Delta 1$ | d4,d5 | d1 | d2,d3 | d8 |  | d7 |  |  |
| t17+ 12 |  | d5 |  |  | d4 |  | d8 | d4 |
| t18+ -11 | d4 |  | d5 |  | dB |  | d3 |  |
| t18+ 12 |  | d4 |  | d5 |  | d8 | $\because$ | d3 |
| $t 19+\Delta 1$ | d1 | d2 | d3 |  |  |  | $\cdots$ |  |
| t19+42 |  |  |  | d1 | d2 | d3 |  |  |
| t19+-33 |  |  |  |  |  |  | d1 | d2 |
| $t 20+31$ | dl | d2 | d3 |  | d7,d8 | d4 | d5,dB |  |
| t20+ 2 |  |  |  | d3 |  | d1 |  |  |
| t21+ +1 | d4 | d5 | d 6 |  |  |  |  |  |
| t22+ 2 |  |  |  | d4 |  | d5 |  | d 6 |
| t22+ 23 |  |  |  |  | d4 |  | d5 |  |
| t23+ 51 | d3 | dl |  | d5 |  | d6 |  | d7 |
| t23+42 |  |  | d1 |  | d5 |  | d6 |  |
| 124-31 |  | d1,d4,d7 | d2,d5,d8 |  |  |  |  |  |
| 124+12 | d8 |  |  | d4 | d2 | d5 | d6 | d3 |
| $125+11$ | d1 |  | d2 $\quad$ |  | d3 |  | d4 |  |
| $\mathrm{t} 25+32$ |  | d1 |  | d2 |  | d3 |  | d4 |
| $t 28+-1$ | d 7 |  | d4 | d3 | d7 | d4 | d8 |  |
| $t 28+ \pm 2$ |  | d7 |  |  |  |  |  | d8 |
| $127+\Delta 1$ | d1 | d2 | d3 |  |  |  | d4 | d5 |
| t27- 22 |  |  |  |  |  | d3 |  |  |
| 128+ 21 | d1 | d2 | d3 | d5 | d 6 |  | - |  |
| $129+31$ | d1 | d2 | d3 |  |  |  |  |  |
| $t 30+11$ | d1 |  |  |  |  |  |  |  |
| $130+22$ |  | d1 |  |  |  |  |  |  |
| $431+ \pm 1$ | d1 |  |  |  |  |  |  |  |

Table 2.6: MC 88020 Operation Execution Times

| Function | Assembler Code | Operational Delay (in \# of Clock Cycles) | Operational <br> Delay <br> (in $\mu \mathrm{s}$ ) |
| :---: | :---: | :---: | :---: |
| Fetch Operands from Shared M or PEM | $\begin{aligned} & \text { MOVE }\langle E A>\text {,Dx } \\ & \text { MOVE }<E A>\text {,Dy } \end{aligned}$ | 14 | 0.84 |
| Store Result in Stared M or Destination PEM: | MOVE Dy, <EA> | 5 | 0.3 |
| Add | ADD Dx, Dy | 3 | 0.18 |
| Multiply | MUL Dx, Dy | 28 | 1.68 |


(b): Speedup versus Number of Processors

Fig. 2.3: Simulation Results for Robot Inverse Dynamics Computation on Proposed Architectural Model

The MC 68020 microprocessor is used as the basis of simulation, i.e. each $\mathrm{PE}_{\mathrm{i}}$ in Fig. 2.1 is represented by the specified processor type. Table 2.6 specifies particular instruction execution times of a 16.7 MHz MC 68020 processor relevant to the processing of the specified problem. This data is used to simulate the performance of the N -E dynamics algorithm on the proposed architecture, and also for the purpose of comparison with previous work.

Fig. 2.3(a) shows a plot of the execution time for computing the inverse dynamics problem of a six link robot versus the number of parallel processing elements when using the proposed architecture and scheduling algorithm. An average lower bound processing time of $513.84 \mu$ s is achieved for 7 to 9 PEs. Note that the time lower bound is not achieved at exactly six parallel processors as required by the ideal case (see Theorem 2.1) when a manipulator with six degrees of freedom is used. In fact, the actual lower bound cost is achieved for the case of nine PEs where the execution time is $100 \mu$ s lower than that of the ideal case at 6 parallel processors. This slight distortion from the ideal case is due to the irregular number of propagation delays when passing intermediate results through the interprocessor network. As seen in Fig. 2.3(b), speedup $>5.7$ over the uniprocessor solution is accomplished for $\mathrm{m} \geq 8$, where $m$ is the number of parallel processors. Also, Fig. 2.3(c) illustrates the variation of architectural efficiency with the number of parallel processors. The efficiency is defined as $S_{m} / \mathrm{m}$, where $S_{m}$ is the speedup for $m$ parallel processors. Intuitively, the efficiency simply gives a measure of how the achieved speedup compares with the ideal speedup. Thus, a high performance evaluation weight should not be given to this plot. From these figures, it is evident that lower bound solutions may be achieved by using $\mathrm{m}=8$, which is also the desired number of PEs if commercially available $8 \times 8$ crossbar switches are to be used in the architecture.

We shall now compare the performance of the proposed mutiprocessor architecture and scheduling algorithm for the computation of the inverse dynamics problem to that of Kasahara and Narita [19]. Note that the simulations for comparison purposes will use common MC" 68020 processing times of Table 2.6 to maintain fairness. The specified authors use the DFIHS (Depth-First-Implicit- HeuristicSearch) scheduling algorithm (see appendix E) to parallel process the N-E dynamics Equations on a multiprocessor system with a shared single bus interprocessor communication mechanism. Fig. 2.4(a) compares the execution time for the computation of the specified algorithm on the two architectural models when $m$ is varied. A performance improvement of more than 1.75 ms for $\mathrm{m} \geq 8$ is achieved for the case of the proposed system over that of Kasahara and Narita. In addition, Fig. 2.4(b) verifies that the proposed architecture performs at speedup factors greater than

(c): Efficiency versus Number of Processors

Fig. 2.3 (Continued)

(a): Execution Time versus Number of Processors

(b): Relative Speedup versus Number of Processors

Fig. 2.4: Performance Comparison of Proposed Architecture to that of [19] for Robot Inverse Dynamics Computation


Fig. 2.4 (Continued)
3.4 for $\mathrm{m}>8$ over the previous architectural model. Fig. 2.4(c) suggests a valid cause for the much lower performance of Kasahara and Narita's parallel processing environment. It is seen that interprocessor communication is a major bottleneck in their shared bus multiprocessor system. In fact, over $90 \%$ of the algorithm execution time in their architecture is spent by performing dialogue between PEs for $m \geq 8$. Latencies due to arbitration costs for access to the shared bus is thus a crucial performance degradation factor when multiple PEs attempt to perform interprocessor communication via mailboxes at the end of instruction processing cycles. Note the moderately lower percentage of total execution time that the proposed system spends for interprocessor communication (approximately $60 \%$ for $\mathrm{m} \geq 8$ ). The crossbar interprocessor communication network is naturally one of the most important reasons for the higher perforamnce of the proposed architecture. In addition to optimal scheduling of data through the crossbar network to destination PEs in minimal number of interprocessor transfer latencies, simultaneous operand fetches by the m parallel PEs from respective $\mathrm{PEM}_{\mathrm{i}}$ is performed to minimize overall algorithm execution time.

### 2.3 Parallel Processing of PUMA Forward and Inverse Kinematics

### 2.3.1 Multiprocessor Architectural Model

The non-linear, non-recursive and irregular form of PUMA forward and inverse kinematics Equations (see Appendix C and D respectively) suggest that these algorithms are not amenable for a SIMD parallel processing environment. In fact, if such a type of architectural model is used for this application, severe performance degradation will occur due to multiple idle/masked processors at each parallel execution stage. The masking of SIMD PEs occur since insufficient number of of identical tasks will be active at any processing level due to the non-recursive nature of the specified algorithms. An alternate architectural approach must therefore be investigated. A multiprocessor system with shared memory for interprocessor communication is studied for this application. This architectural model is illustrated in Fig. 2.5. The memory arbitration mechanism allows access to the mailboxes in shared memory for interprocessor communication on a first-come first-serve basis. The DFIHS (Depth-First-Implicit-Heuristic-Search) scheduling algorithm (see Appendix E) is used to create the static schedule for parallel processing of forward and inverse kinematics equations. This static schedule of parallel tasks is stored in the microcode $\mathrm{ROM}_{\mathrm{i}}$ at run time.


Fig. 2.5: Multiprocessor Architectural Model for PUMA Forward and Inverse Kinematics Computations

Table 2.7: MC 68881 Floating-Point Operation Execution Times

| Function | Assembler <br> Code | Operational <br> Delay (in $\mu \mathrm{s})$ |
| :---: | :---: | :---: |
| Fetch Single Operand <br> from Shared M | FMOVE <EA>,FPy | 2.28 |
| Fetch Pair of Operands <br> from Shared M | FMOVE <EA $>, F P x$ <br> FMOVE <EA $>, F P y ~$ | 4.55 |
| Store Result in Shared M | FMOVE FPy, <EA> | 2.28 |
| Add | FADD FPx,FPy | 3.05 |
| Subtract | FSUB FPx,FPy | 3.05 |
| Multiply | FMUL FPx,FPy | 4.25 |
| Divide | FDIV FPx,FPy | 6.17 |
| Square Root | FSQRT FPy | 6.41 |
| Arc Tangent | FATAN FPy | 24.1 |
| Sine | FSIN FPy | 23.4 |
| Cosine | FCOS FPy | 23.4 |


(a) Execution Time versus Number of Processors

(b) Speedup versus Number of Processors

Fig. 2.6: Simulation Results for Forward Kinematics

Performance degradation occurs in this architecture due the operational delay caused by multiple PEs waiting for access to shared resources. Note that overall execution speed may not be improved for computation of the specified problems by using an iterprocessor network with local $\mathrm{PEM}_{\mathrm{i}} \mathrm{s}$ since these types of networks are generally only efficient for SIMD machines.

### 2.3.2 Simulation Results

Simulation results for the computation of PUMA forward and inverse kinematics algorithms using the shared memory interprocessor communication strategy and DFIHS task scheduling algorithm (see Appendix E) is presented here. The MC 68020 with its MC 68881 floating point math coprocessor is used as the basis of performance evaluation. In other words, each $\mathrm{PE}_{\mathrm{i}}$ in Fig. 2.5 is represented by this dual processor combination. Table 2.7 specifies the execution time data of MC 68881 operations needed for the computation of the specified problems. These processing times are used to generate the simulation results presented hereafter.

The variation of execution time for the forward kinematics algorithm with the number of parallel processors $m$ is depicted in Fig. 2.6(a). Minimum execution time is achieved at approximately $392 \mu$ s for $\mathrm{m} \geq 8$. As seen from Fig. 2.6(b), speedup close to 2 is reached for $\mathrm{m} \geq 8$ over the uniprocessor solution. Also, Fig. 2.6(c) specifies how the architectural efficiency varies with the number of parallel processors. As a reminder, high importance should not be given to this plot since it simply shows how the achieved speedup compares to the ideal speedup. Finally, Fig. 2.6(d) shows the percentage of total execution time that the parallel processors spend for interprocessor communication to evaluate the specified algorithm. Approximately $85 \%$ of the overall execution time is spent performing interprocessor communication. This high rate is primarily due to arbitration costs for access to shared memory. Note again that performance will not be improved by using an interprocessor network with SIMD instruction processing, although the interprocessor communication latencies may be lower for this case.

Simulation results for parallel processing of the inverse kinematics algorithm will now be discussed. Fig. 2.7(a) verifies that minimum execution time of approximately $750 \mu \mathrm{~s}$ is achieved when five parallel processors are used. Further, speedup of 2.3 occurs for $m \geq 5$ over the uniprocessor solution as illustrated in Fig. 2.7(b). For consistency, the variation of efficiency with $m$ is delineated in Fig. 2.7(c). Finally, Fig. 2.7 (d) confirms that over $61 \%$ of total algorithm execution time is spent for interprocessor communication through shared memory. Notice that this last result is considerably lower than the percentage of time the forward kinematics algorithm

(c) Efficiency versus Number of Processors

(d): Percentage of Total Execution Time spent on Interprocessor Communication versus Number of Processors

Fig. 2.6 (Continued)

(a): Execution Time versus Number of Processors

(b): Speedup versus Number of Processors

Fig. 2.7: Simulation Results for Inverse Kinematics

(c): Efficiency versus Number of Processors

(d): Percentage of Total Execution Time spent on Interprocessor Communication versus Number of Processors

Fig. 2.7 (Continued)
spends for interprocessor communication.

### 2.4 Conclusions

The ideal lower bounds on the number of parallel processors and execution time to evaluate the inverse dynamics problem of an $n$-link manipulator in a parallel processing architecture using an optimal scheduling algorithm was derived to be $n$ and $o\left(\left[a_{1} n\right]\right)$ respectively, where $a_{1}$ is a specified constant. Next, a novel SIMD scheduling algorithm to perform parallel processing on a SIMD multiprocessor-based architectural model with a crossbar interprocessor network to compute the specified problem close to the ideal lower bound is presented. The performance of the N-E dynamics algorithm on the proposed architecture using the specified scheduling technique is next simulated with the MC 68020 representing each processing element. Speedup factor of greater than 3.4 over previous related work [19] for the computation of the defined problem on a parallel processing system is achieved. The SIMD architectural model is then not used again to simulate the performance of PUMA forward and inverse kinematics algorithms because of the non-recursive and nonlinear nature of these algorithms. A multiprocessor model with a shared memory interprocessor communication strategy is instead investigated for this purpose using the DFIHS scheduling algorithm. The MC 68020 with its MC 68881 math coprocessor is used as the basis of the next few simulations to determine the performance of the kinematics algorithms on the specified architecture. Results show speedup of approximately 2 over the uniprocessor solution when the number of parallel processors is greater than eight and five for the cases of forward and inverse kinematics algorithms respectively. Very large speedups is not achieved in this latter multiprocessor architectural model due to high interprocessor communication latency costs.

## CHAPTER 3 <br> A COST EFFICIENT BIT-SERIAL ARCHITECTURE FOR ROBOT INVERSE DYNAMICS COMPUTATION

### 3.1 Introduction

The parallel, pipelined, and recursive nature of the N-E dynamics algorithm suggests that it is amenable to a custom bit-serial array architecture. This chapter presents such an architecture to compute the desired algorithm within the bit-serial execution time lower bound of o( $\left[c_{1} k+c_{2} k n\right]$, where $c_{1}$ and $c_{2}$ are specified constants, k is the system word length, and n is the number of manipulator links. A multi-functional bit-serial cell is developed as a building block for the recursive array unit proposed. The cell's fault-tolerant external asynchronous communication protocol and its high-performance design using Zipper CMOS [10] circuit structures is of particular interest. This cell may easily be used as a "standard cell" to design the two array chips which realize the forward and backward recursions, respectively, of the N-E dynamics formulation. In addition to these two chips, a minimum number of external FIFO register files are required to implement the proposed system.

### 3.2 Choice of a Bit-Serial Architecture

The Newton-Euler dynamics formulation is a highly pipelined recursive process. A bit-serial architecture is therefore a natural candidate for its efficient computation in a closely pipelined cost-efficient manner. The single wire input and output of bitserial devices allows tight pipelining of operational cells, leading to efficient communication within and between chips. This is an important advantage since communication issues are a dominant factor in the computation of the inverse dynamics problem. In addition, when comparing $n$-bit word length bit-serial and bitparallel arithmetic operators, we can expect the bit-serial part to contain $1 / n$th of the hardware of the bit-parallel device [8]. Minimum chip count solutions are therefore
feasible in bit-serial systems without the problems of on-chip routing of data lines. In fact, the N-E dynamics algorithm may easily be computed in the proposed architecture with only two bit-serial chips in addition to a minimum number of external FIFO register files. Preliminary studies show the two chips require a combined die area of approximately $70 \mathrm{~mm}^{2}$ when fabricated in a $1.2 \mu$ CMOS technology, using the proposed bit-serial cell architecture as the standard building block. If our proposed architecture is implemented using a bit-parallel architecture, a total of 32 matrix processors capable of performing matrix-vector multiplications and vector cross-products operations would be required. In addition, a more complex control network is needed for the bit-parallel case. Furthermore, the parallel wiring interconnections between bit-parallel processors suggests that it is a cost-inefficient approach.

### 3.3 Time Lower Bound to Compute the Inverse Dynamics Problem on a BitSerial Architecture

In this section we will discuss the limitation of speeding up the inverse dynamics computational problem when running on a bit-serial system. Before deriving the time lower bound, the following notation and lemma must be established.

## Notation:

(1) $E<1>$ denotes a linear arithmetic expression $E$ of 1 distinct atoms, where an atom is a constant or a variable, e.g. $\mathrm{E}<4>=\mathrm{a}+\mathrm{b}-\mathrm{c} / \mathrm{d}$.
(2) $\mathrm{T}_{\mathrm{k}}=$ Minimum time to evaluate an expression $E<l>$ in a non-recursive bit-serial system.
(3) $\mathrm{T}_{1}=$ Shortest time to compute the first iteration of the $\mathrm{N}-\mathrm{E}$ algorithm in a bitserial system.
(4) $\mathrm{T}_{\mathrm{n}-1}=$ Shortest time to evaluate each additional iteration when computing the remaining $\mathrm{n}-1$ iterations of the $\mathrm{N}-\mathrm{E}$ algorithm in a recursive bit-serial system.
(5) $\mathrm{T}_{\mathrm{n}}=$ Minimum time to compute the inverse dynamics problem in a recursive bit-serial system.

Lemma 1: The time lower bound of $\mathrm{T}_{\mathrm{k}}[\mathrm{E}<1>$ ] [8]. The shortest bit-serial processing time to evaluate an arithmetic expression $\mathrm{E}<\mathrm{l}>$ is bounded below by and equal to $o(\lfloor k\rfloor)$, in other words,

$$
\mathrm{T}_{\mathbf{k}}[\mathrm{E}<l>] \geq \mathrm{o}(\lfloor\mathrm{k}])
$$

Theorem 1: The minimum time to compute the inverse dynamics problem of an $n$ link manipulator in a recursive bit-serial architecture is bounded below by $o\left(c_{1}\lfloor k\rfloor+c_{2}\lfloor k n\rfloor\right)$, where $k$ is the system word length, $c_{1}$ and $c_{2}$ are specified constants.

Proof: Addition, subtraction and multiplication are the primitive arithmetic operations prevalent in any iteration of the inverse dynamics problem. By beginning a subsequent operation before finishing the previous one, a bitserial arithmetic cell requires k clock cycles to evaluate any one of the three specified operations. The minimum time to compute the first iteration of the N-E algorithm may therefore be expressed as a function containing at least $3 k$ atoms. Using Lemma 1 , we thus have,

$$
\begin{equation*}
\mathrm{T}_{1}[\mathrm{E}<3 \mathrm{k}>] \geq \mathrm{O}(\lfloor 3 \mathrm{k}\rfloor) . \tag{1}
\end{equation*}
$$

Let $b_{1} k$ (where $b_{1}$ is a constant) be an uniform clock period count needed to evaluate each additional iteration of the N-E algorithm in a recursive bitserial system. To better conceptualize, $b_{1} k$ clock periods is the constant additional time required to compute those operations in the ith iteration which are not evaluated in parallel to the ( $\mathrm{i}-1$ )th computational cycle. Therefore, the additional time to calculate the remaining ( $n-1$ ) iterations may be expressed as follows:

$$
T_{n-1}=b_{1} k(n-1)
$$

Clearly, the minimum time to determine $T_{n-1}$ is when $b_{1}$ is one. Thus, the lower bound of $T_{n-1}$ is defined as,

$$
\begin{equation*}
T_{n-1}[<n-1>] \geq o(\lfloor k(n-1)\rfloor) . \tag{2}
\end{equation*}
$$

By combining equations (1) and (2), we obtain

$$
T_{n}[E<n>] \geq o(\lfloor 2 k+k n\rfloor) .
$$

The inverse dynamics problem may be considered as computing a set of arithmetic operations which result in obtaining the joint torques. Each joint torque $\tau_{i}$ of joint $i$ can be expressed as an arithmetic operation containing at
least $3 n$ atoms: $n$ joint positions $\left\{q_{i}\right\}_{i=1}^{n}, n$ joint velocities $\left\{\dot{q}_{i}\right\}_{i=1}^{n}$, and $n$ joint accelerations $\left\{\ddot{q}_{i}\right\}_{i=1}^{n}$, implying,

$$
\begin{equation*}
T_{n}[E<3 n>] \geq o(\lfloor 2 k+3 k n\rfloor) . \tag{3}
\end{equation*}
$$

Rewriting equation (3) without changing the order of the lower bound to evaluate a set of $n$ torques,

$$
\mathrm{T}_{\mathrm{n}}\left[\tau_{1}, \tau_{2}, \ldots, \tau_{n}\right] \geq o\left(\left[c_{1} \mathrm{k}+\mathrm{c}_{2} \mathrm{kn}\right]\right) .
$$

where $c_{1}$ and $c_{2}$ are specified constants. Q.E.D.

By establishing a time lower bound in Theorem 1, we can now compare and contrast bit-serial computational structures by simply comparing the coefficients $\mathrm{c}_{1}$ and $c_{2}$. It should be noted here that such comparisons will only be valid if the competing systems use the same word length and manipulator link count.

### 3.4 Bit-Serial System Architecture

The high-level view of the bit-serial system architecture is illustrated in Fig. 3.1. Basically, the core of the system consists of two bit-serial array processors: Processor 1 computes the forward recursions and Processor 2 evaluates the backward iterations of the N-E algorithm.

Operand feeding of the bit-serial arrays is an asynchronous mechanism. Initially, the host processor loads the rotation ( ${ }_{i}^{i+1} \mathrm{R}$ ) and inertia ( ${ }^{\mathrm{i}} \mathrm{I}_{\mathrm{i}}$ ) matrices, relative position ( ${ }^{i}{ }_{i+1}$ ), joint velocity $(\theta)$ and acceleration $(\theta)$ vectors, and the scalar link masses ( $\mathrm{m}_{\mathrm{i}}$ ) into the PISO FIFO (Parallel-In-Serial-Out First-In-First-Out) register blocks in the order shown. The asynchronous handshaking protocol used by the bit-serial cells to communicate with external devices allows computations to be initiated whenever both operands of any bit-serial cell are available. The philosophy employed is similar to that of data flow computer architectures. This eliminates the "initial delay time" (time required to set up all the operands) which degrades the performance of many custom algorithmic processors.

Whenever the host processor loads any operand(s) into the FIFO, it also informs the respective bit-serial cell whose operands are currently the load has been performed. The notified cell subsequently provides the clock pulses to its operand FIFOs to load its input data synchronously in a bit-serial fashion. Pulses delivered by individual cells for operand loading enhances fault tolerance by eliminating any


Fig. 3.1: Bit-Serial System Architecture
possibility of misalignment of multi-operand data bits.
Array processor 1 computes the angular velocity ( ${ }^{\mathrm{i}} \omega_{\mathrm{i}}$ ) and acceleration ( ${ }^{\mathrm{i}} \omega_{\mathrm{i}}$ ), and the linear acceleration ( ${ }_{\mathbf{i}}^{\mathbf{v}} \dot{\mathbf{j}}_{\mathbf{j}}$ ) vectors which are fed back multiplexed with their initialization vectors as inputs to cells earlier in the processor's pipeline. Outputs of the processor are total external forces ( ${ }^{\mathrm{i}} \mathrm{F}_{\mathrm{i}}$ ) and moments ( ${ }^{\mathrm{i}} \mathrm{N}_{\mathrm{i}}$ ) exerted at the center of mass of link i. These computed results are then loaded into array processor 2 through the SISO LIFO (Serial-In-Serial-Out Last-In-First-Out) register files. A LIFO structure is used since the total forces and moments computed in processor 1's last iteration are needed as operands of processor 2 's first iteration.

Processor 2 initiates execution at the end of processor 1's nth iteration, where $n$ is the number of manipulator links. Operands are loaded into this processor in a manner identical to Processor 1. Intermediate results of computations in the processor include forces ( $\mathrm{i}_{\mathrm{i}}$ ) and moments ( $\mathrm{i}_{\mathrm{i}}$ ) exerted on link i by link ( $\mathrm{i}-1$ ) with respect to the base frame. These results are subsequently fed back multiplexed into the array with their initialization vectors. A complete set of joint torques and forces $\left(\tau_{\mathrm{i}}\right)$ are obtained at the end of the nth backward iteration and loaded into an SIPO FIFO (Serial-In-Parallel-Out First-In-First-Out) register file as system outputs.

### 3.5 Bit-Serial Array Processor Organization and Performance

At this stage in the paper, we shall assume that an individual bit-serial cell is capable of computing the 3-D vector and matrix arithmetic operations depicted in Appendix B. Details on how the cell implements these functions is discussed in Section 3.7.

Fig. 3.2 illustrates the parallel and pipelined organization of bit-serial cells in Processor 1 to evaluate the forward iterations of the N-E algorithm. The initials within each node denote the type of operation that the particular cell is required to perform. Also, the number next to each cell serves as a means of individual cell identification. Interconnections between cells represent "single-wire" bit-serial data paths. The cells are organized in eight stages of a pipeline with some intermediate results being fed back as inputs to cells earlier in the pipeline in a recursive fashion.

Table 3.1 summarizes the computation times of the various matrix and vector arithmetic functions which may be evaluated by any individual bit-serial cell. The operational delay data from this table is used to construct the pipeline schedule of tasks to compute the first three iterations of the N-E dynamics algorithm in Processor 1 (Table 3.2). The unit on the time scale of Table 3.2 is k clock periods, where k is


Fig. 3.2: Pipelined Organization of Bit-Serial Cells in Processor 1


Fig. 3.3: Pipelined Organization of Bit-Serial Cells in Processor 2

Table 3.1: Bit-Serial Cell Performance Measures

| Operation | No. of Cells Required | Individual Cell Ütilization | Operation Delay (In $\neq$ of Clock Cycles) | Operation Delay Using 16-Bit Wordlength and Clock Frequency of $20-\mathrm{MHz}^{2}$ |
| :---: | :---: | :---: | :---: | :---: |
| Vector-Add | 1 | 3 Adders | k | 800 |
| Scalar-Vector Product | 1 | 3 Multipliers | $!$ | 800 ns |
| Inner Product | 1 | 3 Multipliers <br> 2 Adders | 2k | 1.6 \% |
| Matrix-Vector Product | 3 | 3 Multipliers <br> 2 Adders | 2k | $1.6 \mu \mathrm{~s}$ |
| Vector <br> Cross-Product | 3 | 3 Multipiiers/ <br> 3 Subtractors | 2x | $1.6 \mu \mathrm{~s}$ |

the system word length. Also, the numbers in the iteration columns indicate which bit-serial cell of Fig. 3.2 is executing a particular operation during the specified time interval. Notice that bit-serial cells begin execution of tasks in successive iterations as soon as they complete their designated operations in the previous iteration, provided process precedence relations still hold true. Thus, the parallelism between successive iterations allows high cell utilization. This higher performance operational feature is supported in a fault-tolerant manner by the asynchronous handshaking protocol (described in Section 3.7) used by the bit-serial cells for external communication. A total of 11 k clock periods are needed to compute the first forward iteration. The massive amount of parallelism between tasks of successive iterations allow each additional iteration to be processed in only 4 k clock cycles.

Similarly, Fig. 3.3 illustrates the organization of bit-serial cells in array processor 2 to compute the backward iterations. Accordingly, Table 3.3 depicts its pipeline schedule to process three backward iterations. The first backward iteration requires 8 k clock periods, whereas each additional iteration needs 2 k clock cycles for completion.

The total time to compute the inverse dynamics problem on the proposed architecture may, therefore, be expressed as [ $19 \mathrm{k}+6 \mathrm{kn}$ ] clock periods, where $k$ is the system word length and n is the number of manipulator links. Thus, for comparison purposes, the coefficients $c_{1}$ and $c_{2}$ (defined earlier in Section 3.4) are 19 and 6 , respectively, to compute the inverse dynamics problem in our particular recursive bitserial system architecture. The total time to compute the N-E dynamics algorithm is, therefore, $44 \mu$ s when a 16 -bit system word length, 20 MHz operating frequency, and a 6-link manipulator is used.

### 3.6 Multi-Functional Bit-Serial Leaf Cell Architecture and Operation

In this section, we shall describe the architecture and operation of a multifunctional bit-serial cell capable of performing the various 3-D matrix/vector arithmetic operations depicted in Appendix B. A block diagram of this cell is shown in Fig. 3.4. The 3-bit mode register determines the type of vector operation that the particular cell is required to perform. This register is set by direct hardware control. A PLA-based control unit is used for the control of the external asynchronous communication mechanism and internal operation mode. The counter provides load pulses to operand cells to obtain the cell's input data in a synchronous fashion. Shift registers, "previous" and "new", store up to two intermediate results temporarily when the cell to which these results are to be sent is not ready to receive new operands. The three voters determine bit-serial data routing paths to support alternate arithmetic

Table 3.2: Parallelism in the Execution of Tasks for Three Forward Iterations

| Time (In <br> Number oi <br> Clocik Periocs) | Iteratioz $\# 1$ | Iteratiom \#2 | Iteration \#3 |
| :---: | :---: | :---: | :---: |


| $k$ | $1 \begin{array}{llll}1 & 2 & 3 & 4\end{array}$ |  |  |
| :---: | :---: | :---: | :---: |
| 2 k | 243 |  |  |
| 3 k | 58 | 2 |  |
| 4k | $\begin{array}{llll}7 & 8 & 9 & 10\end{array}$ | 12 |  |
| 5ik | $\begin{array}{llll}7 & 8 & 9 & 10\end{array}$ | 5 | 2 |
| 3 k | $\begin{array}{lllllll}11 & 12 & 13 & 14 & 15 & 18\end{array}$ | 35 | 2 |
| 7 k | $\begin{array}{lllllll}11 & 12 & 13 & 14 & 15 & 18\end{array}$ | $6 \quad 7 \quad 8 \quad 9$ | 1 |
| 8k | $\begin{array}{llll}17 & 18 & 19\end{array}$ | $\begin{array}{lllll}7 & 8 & 9 & 10\end{array}$ | 5 |
| 9 k | 20 | $\begin{array}{llllll}10 & 11 & 12 & 13\end{array}$ | 5 |
| 10k | 21 | $\begin{array}{llllll}+11 & 12 \quad 13 \quad 14 \quad 15 \quad 16\end{array}$ | $3 \quad 7 \quad 8 \quad 9$ |
| 11k | 22 | 4141516 | 6.789 |
| 12k |  | 171818 | $\begin{array}{lllll}10 & 11 & 12 & 13\end{array}$ |
| 13k |  | 20 | $\begin{array}{lllll}10 & 11 & 12 & 13\end{array}$ |
| 14 k |  | 21 | $\pm 141516$ |
| 15k |  | 22 | 4141518 |
| 18k |  |  | $17 \quad 18 \quad 19$ |
| 17 k |  |  | 20 |
| 18k |  |  | 21 |
| 19k |  |  | 22 |

Table 3.3: Parallelism in the Execution of Tasks for Three Backward Iterations

operating modes. The function of the voters is illustrated in Fig. 3.7(a)-(d). It should be noted here that a matrix-vector multiplication task (not shown in Fig. 3.7) is realized by simply combining three inner product cells. Individual cell utilization and operation delay within the cells to evaluate the matrix/vector operations is summarized in Table 3.1.

Fig. 3.5 shows the control and data line interconnections between neighboring bit-serial cells. Asynchronous handshake control signals, RTRIN (Ready-To-Receive-IN), RTROUT (Ready-To-Receive-OUT), RTSIN (Ready-To-Send-IN), RTSOUT (Ready-To-Send-OUT), LOADCLKIN (Load-Clock-IN) and LOADCLKOUT (Load-Clock-OUT) use an inter-cell communication protocol to achieve higher performance and fault-tolerance in the system. The flow chart illustrated in Fig. 3.6 provides details of this asynchronous communication protocol.

When a cell is ready to receive new operands, it indicates this by asserting the RTROUT signal, and if the preceding cells are ready to send the operands, they activate the RTSIN lines. In response, the current cell generates LOADCLKOUT pulses to load its operands in a bit-serial fashion. This particular feature allows synchronized loading of multi-operand data bits. Near the end of an operation, the succeeding cell may have the RTRIN signal asserted, which indicates it is ready to receive its operands. In that case, the current cell activates its RTSOUT line, thus indicating that it is ready to send the operands. Subsequently, the next cell provides the LOADCLKIN pulses to load its input data bit-serially. However, if the succeeding cell has the RTRIN signal negated at the end of any operation, the result will be loaded into temporary "previous/new" shift registers. If both the "previous" and "new" shift registers are full at any point in time, the given cell is not allowed to load any further operands for execution until one of these registers transfers its contents into the succeeding cell.

### 3.7 Implementation of Bit-Serial Cell In CMOS Technology

Our design strategy utilizes Zipper CMOS [28] circuit structures. It is basically an improved version of Domino [20] and NORA [14] dynamic CMOS circuits, which are characteristically immune to the problems of instability and charge-sharing prevalent in the latter two techniques. Also, area utilization is considerably better and Zipper CMOS circuits can operate at least two times faster than static CMOS circuits. In addition, its high density allows it to be clocked at very high frequencies. It is thus very advantageous to apply this type of circuit design methodology to bit-serial signal processing.


Fig. 3.4: Architecture of a Multi-Functional Bit-Serial Leaf Cell


Fig. 3.5: Interconnections Between Neighbouring Bit-Serial Cells


Fig. 3.6: Flowchart of Asynchronous Communication Protocol


Fig. 3.7: Data Routing Paths for Alternate Vector Arithmetic Operating Modes

(d): Vector Cross Multiply

Fig. 3.7 (Continued)

The basic structure of a Zipper CMOS circuit uses a single Zipper driver circuit which generates four strobe signals to drive subsequent N-P CMOS logic blocks. During precharge, the output of every N block is high and that of every P block is low. This ensures that the transistors of each dynamic block are off. During evaluation, the output of each N stage can undergo only one transition, from high to low, and the output of each $P$ stage can undergo only one transition, from low to high. Signals therefore propagate down each stage of the circuit in a "staggered" fashion.

In the Zipper driver circuit of Fig. 3.8, strobe signals ST and ST simply act as a two phase clock to drive the logic stages. Signals $\mathrm{ST}^{1}$ and $\overline{\mathrm{ST}}^{1}$ which feed the N block's precharge pullup and P block's precharge pulldown, respectively, produce a residual voltage which keep the precharge transistors slightly on to overcome leakage currents. It is therefore clear that this driver plays a major role in insuring stability and minimizing charge-sharing effects. It should be noted that only one driver circuit is needed for each bit-serial multiplier and adder/subtractor unit used in the cell.

Fig. 3.9 illustrates our implementation of the bit-serial full-adder/subtractor using Zipper CMOS circuits and dynamic latches. It is designed as a reduced majority function of its inputs. Simulation results have shown the propagation delay of this module to be 37 ns . The bit serial adder is also the most significant circuit in the bitserial multiplier. This suggests that an individual cell may operate at frequencies of up to 25 MHz .

The design of a single module of Lyon's 2's complement bit-serial pipelined multiplier in Zipper CMOS is shown in Fig. 3.10. The module is implemented using a Zipper CMOS full-adder, pass-transistors, and dynamic registers. For a k-bit system word length, a bit-serial multiplier consists of $k$ such modules, with serial interconnections of the data words ( $\mathrm{A}_{\mathrm{i}}$ and $\mathrm{B}_{\mathrm{i}}$ ), the partial product sum ( $\mathrm{PPS}_{\mathrm{i}}$ ), and the control signal $\mathrm{R}_{\mathrm{i}}$. This latter control signal is used to truncate the low order bit of the partial product sum and to clear the carry flip-flop after each multiplication. The last module of such a multiplier employs a full subtractor instead of a full-adder, as is required by the two's complement algorithm. The bit-serial multiplier takes k clock periods to form a k -bit product of two k -bit operands, by beginning a subsequent operation before finishing the previous one.

### 3.8 Conclusions

We have described a cost-efficient parallel and pipelined bit-serial system for the inverse dynamics computational problem to achieve the bit-serial execution time


Fig. 3.8: Zipper CMOS Driver Circuit


Fig. 3.9: Zipper CMOS Full Adder/Subtractor


Fig. 3.10: Zipper CMOS Bit-Serial Multiplier Module
lower bound of o( $\left.\left[c_{1} k+c_{2} k n\right]\right)$. A high performance multi-functional bit-serial cell architecture is described as a building block for the array configurations of the system. Zipper CMOS circuit design strategy is proposed for the cell's implementation to minimize propagation delays and maximize operating frequencies. For the case of a 16 -bit system word length and 20 MHz operating frequency, the total time to compute the inverse dynamics problem of a 6 -link manipulator on the proposed bit-serial architecture is only $44 \mu$ s.

## CHAPTER 4 SYSTOLIC ARCHITECTURE FOR ROBOT INVERSE DYNAMICS COMPUTATION

### 4.1 Introduction

A novel systolic architecture to compute the inverse dynamics computational problem of an $n$-link manipulator within the systolic execution time lower bound of $o\left[d_{1} n\right]$, where $d_{1}$ is a specified constant, is proposed. A modified form of the systolic design methodology of Moldovan and Fortes [40] is used as the primary design tool for the architecture. Modifications were required to provide allowances for unequal systolic execution times of processing cells, and to eliminate broadcasting of any input or generated variable within the systolic array. The inverse dynamics problem is decomposed into a set of directional and linear recurrence tasks amenable for direct mapping onto a fixed systolic system. This task set is composed of seven "basic" types of tasks. Thus, a basic set of systolic processors onto which these basic types of tasks may be mapped is discussed. Next, the specified systolic processor set is used as building blocks for realization of the overall systolic system. Due to alternate paths and operational delays of different systolic processors, operands may appear at mutiinput modules at unequal arrival times, causing a longer pipeline time. Delay buffers must therefore be inserted to balance the pipeline. The optimal buffer assignment problem is reduced to an integer linear optimization problem. The resulting systolic architecture is thus realized in a maximally pipelined manner with a minimum number of delay buffers inserted along various computational paths.

### 4.2 Systolic Array Design Methodology

Various systolic array design procedures [22,23,31,40,48,59] are described in current literature. The systolic design algorithm of Moldovan and Fortes [40] is found to be the efficient since it has been proven that their design methodology produces solutions which have performance features close to those of dataflow machines. A modified
form of their systolic array design methodology will thus be used as the primary design tool for realization of systolic processors to compute the inverse dynamics problem. As mentioned earlier, some modifications to their methodology are essential to develop our particular systolic system for allowance of different computational delays of alternate systolic processors, and to eliminate broadcasting of any input or generated variable. These ideas will become more apparent by the end of section 4.6. The modified design methodology is presented below.

## Procedure: SADLRE (Systolic Array Design of Linear Recurrence Equations)

Input: Algorithm $A$ over an algebraic structure $S$ as a 5 tuple $A=\left(\mathrm{J}^{\mathrm{n}}, \mathrm{C}, \mathrm{D}, \mathrm{X}, \mathrm{Y}\right)$ where,
$\mathrm{J}^{\mathrm{n}}$ is a finite index set of $\mathrm{A}, \mathrm{J}^{\mathrm{n}} \subset \mathrm{I}^{\mathrm{n}}$;
C is a set of computations indexed by $\mathrm{J}^{\mathrm{n}}$;
D is a set of data dependencies;
X is a set of input variables;
Y is a set of generated variables.
Output: New dependence matrix $\Delta$ of the developed algorithm and set of tables T = \{GVT, UVT, IT \}, where,
GVT is a Generated Variable Table which specifies the time and cell of execution of $y_{i} \in Y$ at index point $\bar{j}$,

UVT is a Used Variable Table that specifies the times and cell location from which $y_{i} \in Y$ are available for use at index point $\bar{j}$;

IT is an Input Table that specifies the times and cell locations from which $\mathrm{x}_{\mathrm{i}} \in \mathrm{X}$ are available for use at index point $\overline{\mathrm{j}}$.

STEP 1: Determine a set of generated variables $\left\{y_{B_{i}}\right\}$ with linear indexing matrices $\mathrm{I}_{\mathrm{Bi}}$ that are susceptible for broadcasting. Variable $\mathrm{y}_{\mathrm{Bi}}$ is broadcasted if,

$$
\operatorname{rank}\left(\mathrm{I}_{\mathrm{Bi}}\right)<\mathrm{n}
$$

where n is the dimensionality of the algorithm index set $\mathrm{J}^{\mathrm{n}}$, and $\mathrm{i} \in \mathrm{I}^{\mathrm{n}}$.
STEP 2: When $\left\{\mathrm{y}_{\mathrm{Bi}}\right\}$ has more than one variable, determine a set of indices $\left\{\mathrm{j}_{\mathrm{Bi}}\right\} \in \mathrm{J}^{\mathrm{n}}$ corresponding to linearly dependent columns in $\mathrm{I}_{\mathrm{Bi}}$.

STEP 3. Select a transformation $\Pi$ that will provide a valid linear time scheduler, $\Pi \overline{\mathrm{d}}_{\mathrm{i}}, \overline{\mathrm{d}}_{\mathrm{i}} \in \mathrm{D}$. Heuristically, $\Pi$ may be found by following these steps:
(3.1) When $\left\{y_{B i}\right\}$ has more than one variable, then $\Pi_{B i} \bar{d}_{B i}$ must be of the form,

$$
\Pi \overline{\mathrm{d}}_{\mathrm{Bi}}=\mathrm{C}_{\mathrm{i}}+\sum_{\mathrm{j} \in \mathrm{j}_{\mathrm{Bi}}} \mathrm{j}_{\mathrm{i}} \mathrm{C}_{\mathrm{ji}}
$$

and the following conditions must be satisfied,

$$
\begin{align*}
& C_{i} \geq \sum_{\mathrm{ji}_{\mathrm{i}} \in \mathrm{~J} / /_{\mathrm{Bi}}} \mathfrak{t}_{\mathrm{ji}} * S_{\mathrm{ji}} l \mathrm{~d}_{\mathrm{B}}^{\mathrm{C}}+\sum_{\mathrm{j}_{\mathrm{i}} \in \mathrm{j}_{\mathrm{Bi}}} \mathrm{t}_{\mathrm{ji}} * \mathrm{~S}_{\mathrm{j}} \int \mathrm{~d}_{\mathrm{Bi}}^{\mathrm{C}}  \tag{4.1}\\
& \mathrm{C}_{\mathrm{ji}} \geq \mathrm{t}_{\mathrm{ji}} * \mathrm{~S}_{\mathrm{ji}} \mid \mathrm{d}_{\mathrm{Bi}}^{\mathrm{V}} \quad \text { where } \mathrm{j}_{\mathrm{i}} \in \mathrm{j}_{\mathrm{Bi}} \tag{4.2}
\end{align*}
$$

where, $\mathrm{d}_{\mathrm{Bi}} \in \mathrm{D}$ provides data dependence vectors of variables $\mathrm{y}_{\mathrm{Bi}}, \mathrm{C}_{\mathrm{i}}$ are the constant parts of $\Pi_{\mathrm{Bi}} \overline{\mathrm{d}}_{\mathrm{Bi}}, \mathrm{C}_{\mathrm{ji}}$ are the coefficients of variables $\mathrm{j}_{\mathrm{i}} \in \mathrm{j}_{\mathrm{Bi}}$ in $\Pi_{\mathrm{Bi}} \overline{\mathrm{d}}_{\mathrm{Bi}}$, and $\mathrm{t}_{\mathrm{ji}}$ is the maximum time needed for the variable $y_{B i}$ to travel a single unit distance in the direction of $j_{i}$. This time variable may easily be deduced from the original recurrence equation. It is equal to the interprocessor communication time, if the output $y_{i}$ in the original recurrence equation is constant for any $j_{i}$. On the other hand, $\mathrm{t}_{\mathrm{ji}}$ is the sum of the interprocessor communication time and cell computational delay in the case of a varying output $y_{i}$ when index $\mathrm{j}_{\mathrm{i}}$ is varied. $\mathrm{S}_{\mathrm{ji}}$ represents a row of the transformation matrix $S$ (see STEP 5) corresponding to the transformation of the $\mathrm{j}_{\mathrm{Bi}}$ th index in $\mathrm{d}_{\mathrm{Bi}} ; \mathrm{d}_{\mathrm{Bi}}^{\mathrm{C}} \in \overline{\mathrm{d}}_{\mathrm{i}}$ is the constant part of dependence vector $d_{B i} ; d_{B i}^{V} \in \bar{d}_{i}$ is the coefficient of the $j_{B i}$ th index in $d_{B i}$.
(3.2) Time schedulers of input variables, $x_{i} \in X$, must satisfy the condition,

$$
\Pi_{\mathrm{xi}} \mathrm{~d}_{\mathrm{xi}} \geq \mathrm{t}_{\mathrm{C}}
$$

where $\Pi_{\mathrm{xi}} \in \Pi$ transform columns $\overline{\mathrm{d}}_{\mathrm{xi}} \in \mathrm{D}$ that provide data dependence vectors for input variables $\mathrm{x}_{\mathrm{i}}$, and $\mathrm{t}_{\mathrm{C}}$ is the interprocessor communication time.
(3.3) Time schedulers of generated variables, $y_{i} \in Y$, must meet the condition,

$$
\Pi_{\mathrm{yi}} \overline{\mathrm{a}}_{\mathrm{yi}} \geq\left(\mathrm{t}_{\mathrm{f}}+\mathrm{t}_{\mathrm{C}}\right)
$$

where $\Pi_{\mathrm{yi}} \in \Pi$ transform columns $\overline{\mathrm{d}}_{\mathrm{yi}} \in \mathrm{D}$ that provide data dependence vectors for variables $y_{i}$, and $t_{f}$ is the computational delay within a cell.
(3.4) The final $\Pi$ must be chosen so as to minimize the algorithm execution time,

$$
\begin{equation*}
\mathrm{t}=\left\lceil\frac{\max \Pi\left(\overline{\mathrm{j}}_{1}-\overline{\mathrm{j}}_{2}\right)+\mathrm{t}_{\mathrm{f}}}{\min \Pi \overline{\mathrm{~d}}_{\mathrm{i}}}\right\rceil \tag{4.3}
\end{equation*}
$$

for any $\overline{\mathrm{j}}_{1}, \overline{\mathrm{j}}_{2} \in \mathrm{~J}^{\mathrm{n}}$, and $\overline{\mathrm{d}}_{\mathrm{i}} \in \mathrm{D}$.
STEP 4: Select the columns of the matrix of interconnection primitives, $P$, as

$$
\mathrm{P}_{\mathrm{i}}=\mathrm{d}_{\mathrm{i}}^{\prime} \mathrm{d}_{\mathrm{Bi}}^{\prime}
$$

where $P_{i} \in P$, and $d_{i}^{\prime}, d_{B i}^{\prime} \in D^{\prime} ; D^{\prime}$ corresponds to the last one and two rows of dependence matrix D for one- and two-dimensional systolic arrays respectively; $d_{i}$ corresponds to columns of $D^{\prime} ; d_{B i}^{\prime}$ corresponds to columns of $D^{\prime}$ that provide $n-1$ elements of dependence vectors for variable $y_{B i}$

STEP 5: Select a second transformation $S$ of size $(n-1) \times n$, where $n$ is the dimension of the algorithm index set $\mathrm{J}^{\mathrm{n}}$, such that:
(5.1) When $\left\{y_{B i}\right\}$ has more than one variable, elements of $S$ must satisfy the condition,

$$
S_{j B i} \bar{d}_{B i}=C_{i 1}+j_{B i} C_{i 2}
$$

where $C_{i 1} \in Z^{n}$ and $C_{i 2} \in I^{n}, C_{i 2} \neq 0$ and $S_{j B i}$ is a row of $S$ which determines the spatial interconnect of variable $\mathrm{y}_{\mathrm{Bi}}$ in the direction of $\mathrm{j}_{\mathrm{Bi}}$.
(5.2) Diophantine equation $S D=P K$ may be solved for $S$, where matrix $K$ which indicates the utilization of primitive interconnections in matrix P must satisfy the conditions of $\mathrm{K}_{\mathrm{ji}} \geq 0$, and $\sum_{\mathrm{j}} \mathrm{K}_{\mathrm{ji}} \leq \Pi \bar{d}_{\mathrm{i}}$.
(5.3) Matrix transformation $T=\left[\begin{array}{l}\Pi \\ S\end{array}\right]$ is non-singular.

For the chosen transformation, $T$, the partitioning hyperplanes $\Pi_{P i}$ are given by the rows of matrix $S$, i.e.,

$$
S=\left[\begin{array}{c}
\Pi_{P 1} \\
\Pi_{P 2} \\
\vdots \\
\Pi_{P(n-1)}
\end{array}\right]
$$

(5.4) Finally, $S$ must be chosen to minimize the total number of bands, $r$,

$$
\begin{equation*}
\mathrm{r}=\prod_{\mathrm{K}=1}^{\mathrm{n}-1}\left\lceil\frac{\max \Pi_{\mathrm{PK}}\left(\mathrm{j}^{1}-\overline{\mathrm{j}}^{2}\right)+1}{\mathrm{~m}_{\mathrm{K}}}\right\rceil \tag{4.4}
\end{equation*}
$$

for any $\vec{j}, \vec{j}^{2} \in J^{n}$, and $m_{K}$ is the width of the band.
STEP 6: Determine new dependence matrix $\Delta=T D$.
STEP 7: Determine GVT (Generated Variable Table). Significant columns of this table specify the initiation time $\left(\mathrm{t}_{\mathrm{I}}\right)$, completion time $\left(\mathrm{t}_{\mathrm{E}}\right)$ and location of the cell $(\bar{l})$ where index point $\overline{\mathrm{j}}$ is processed.

$$
\begin{aligned}
& \mathrm{t}_{\mathrm{I}}=\Pi \overline{\mathrm{j}} \\
& \mathrm{t}_{\mathrm{E}}=\mathrm{t}_{\mathrm{I}}+\mathrm{t}_{\mathrm{f}} \\
& \bar{l}=\overline{\mathrm{S}}
\end{aligned}
$$

where $t_{f}$ in the computational delay of a single cell.
STEP 8: Determine UVT (Used Variable Table). Significant columns of this table specify the times $\left\{\mathrm{t}_{\mathrm{y}}\right\}$ and location of cells $\left\{\bar{l}_{\mathrm{y}}\right\}$ from which the set of generated variables $y\left(\overline{\mathrm{j}}^{-} \overline{\mathrm{d}}_{\mathrm{y}}\right)$ are available for use at index point $\overline{\mathrm{j}}$.

$$
\begin{aligned}
& \mathrm{t}_{\mathrm{y}}=\Pi\left(\overline{\mathrm{j}}-\overline{\mathrm{d}}_{\mathrm{y}}\right)+\mathrm{t}_{\mathrm{f}} \\
& \bar{l}_{\mathrm{y}}=\mathrm{S}\left(\overline{\mathrm{j}}-\overline{\mathrm{d}}_{\mathrm{y}}\right)
\end{aligned}
$$

where $d_{y}$ is the dependence vector corresponding to generated variable $y$.
STEP 9: Determine IT (Input Table). Significant columns of this table include the times $\left\{\mathrm{t}_{\mathrm{x}}\right\}$ and location of cells $\left\{\bar{l}_{\mathrm{x}}\right\}$ from which input variables $\mathrm{x}(\overline{\mathrm{j}})$ are available for use at index point $\overline{\mathrm{j}}$.

$$
\begin{aligned}
& \mathrm{t}_{\mathrm{x}}=\Pi\left(\overline{\mathrm{j}}-\overline{\mathrm{d}}_{\mathrm{x}}\right) \\
& \bar{l}_{\mathrm{x}}=\mathrm{S}\left(\overline{\mathrm{j}}-\overline{\mathrm{d}}_{\mathrm{x}}\right)
\end{aligned}
$$

where $\overline{\mathrm{d}}_{\mathrm{x}}$ is the dependence vector corresponding to input variable x , and $\mathrm{t}_{\mathrm{C}}$
is, as defined above, the interprocessor communication time.
STEP 10:Use the new dependence matrix $\Delta$ along with the set of tables $T=\{G V T, U V T, I T\}$ to map the algorithm onto a fixed systolic architecture. The amount of time delay for buffers inserted along the paths of generated and input variables within an individual cell for synchronization purpose may be determined by,

$$
\begin{aligned}
& t_{D x}=\Pi \bar{d}_{x}-t_{\mathrm{C}} \\
& t_{D y}=\Pi \bar{d}_{y}-\left(t_{\mathrm{f}}+t_{\mathrm{t}}\right)
\end{aligned}
$$

where, $t_{D x}$ and $t_{D y}$ are the time intervals required for delay buffers inserted along the paths of an input ( $x$ ) and generated variable ( $y$ ) respectively. $\overline{\mathrm{d}}_{\mathrm{x}}, \overline{\mathrm{d}}_{\mathrm{y}}, \mathrm{t}_{\mathrm{C}}$ and $\mathrm{t}_{\mathrm{f}}$ are as defined before.

STEP 11: Partition the developed algorithm onto a set of $r$ bands ( $r$ may be determined by equation (4.4). The mapping of the indices to processors of the partitioned algorithm is as follows: each index point $j$ is processed within band $\mathrm{B}_{\mathrm{i}}$ in the processor whose ith coordinate is,

$$
j_{\mathrm{P}_{\mathrm{i}}}=\Pi_{\mathrm{Pi}} \overline{\mathrm{j}} \bmod \mathrm{~m}_{\mathrm{i}}
$$

where $m_{i}$ is the width of the ith coordinate inside the band. It should be noted here that this step usually results in a more cost-efficient systolic architecture for algorithms with a large number of recursive iterations.

The modifications to the design methodology of [40] are included in design steps: 1, 2, 3.1-3.3, 4, 5.1, and 7-10. Step 1 determines whether any generated variable in the given linear recurrence problem may be broadcasted, which is undesired in any systolic architecture. Step 2 simply selects these variables which are amenable to broadcasting so that their respective dependence vectors may be used to construct special conditions on the selection of transformation $\Pi$ in step 3.1. As described in step $3.1, \Pi_{\mathrm{Bi}}$ must be of the form depicted to ensure variable(s) $\mathrm{y}_{\mathrm{Bi}}$ is sequentially propagated through the systolic array instead of being viable to broadcasting. Special conditions are established on the choice of $\mathrm{C}_{\mathrm{i}}$ and $\mathrm{C}_{\mathrm{ji}}$ so as to set lower bounds on propagation delays of $\mathrm{j}_{\mathrm{Bi}}$ in the specified direction of data flow. The conditions on the linear time scheduler in steps 3.2 and 3.3 ensure $\Pi$ is selected appropriately when the communication time, $\mathrm{t}_{\mathrm{C}}$, and computation time, $\mathrm{t}_{\mathrm{f}}$, are different. Step 4 provides a systematic method for the selection of the matrix of interconnection primitives, $P$, to eliminate any type of enumeration for this choice, as
delineated in [40].
Conditions set on the selection of $\Pi$ alone in step 3.1 is insufficient for the elimination of broadcasting since the space scheduler $S$ must provide the interconnections support for such data communications specified by $\Pi$. Thus, step 5.1 discusses the conditions on the choice of $S$ to avoid broadcasts. Finally, steps 7-10 incorporate desired modifications to [40] for appropriate determination of GVT, UVT, and IT to the given linear recurrence problem when $\mathrm{t}_{\mathrm{c}}$ and $\mathrm{t}_{\mathrm{f}}$ are different.

### 4.3 Time Lower Bound to Compute The Inverse Dynamics Problem on a Systolic Array

The limitation on speeding up the inverse dynamics computational problem while running on a systolic array will now be discussed. Before deriving the time lower bound, the following notation must be established.

## Notation:

(1) $\mathrm{E}<l>$ denotes a linear arithmetic expression of $l$ distinct atoms, where an atom is a constant or variable, e.g. $E<4>=a+b-c / d$.
(2) $\mathrm{T}_{\mathrm{S}}=$ Minimum time to compute the inverse dynamics problem of an n -link manipulator on a fixed systolic array.

Theorem 4.1: The minimum time to compute the inverse dynamics problem of an $n$-link manipulator on a fixed systolic architecture is bounded below by o $\left\lceil d_{1} n\right\rceil$, where $d_{1}$ is a specified constant.

Proof: Using equation (4.3) in procedure SADLRE, the systolic processing time, $t$, for any task in program Newton-Euler may be expressed in the general form,

$$
\mathrm{t}=\left[\max \Pi \frac{\left[\left[\begin{array}{c}
n \\
\max \mathrm{a}_{2} \\
\vdots \\
\max \mathrm{a}_{\mathrm{m}}
\end{array}\right]-\left[\begin{array}{c}
1 \\
\min a_{2} \\
\vdots \\
\min a_{m}
\end{array}\right]\right]+\mathrm{t}_{\mathrm{f}}}{\min \Pi \bar{d}_{\mathrm{i}}}\right]
$$

$$
=\left[\frac{\max \Pi_{1}(n-1)+\sum_{i=2}^{m} \max \Pi_{i}\left(\max a_{i}-\min a_{i}\right)+t_{f}}{\min \Pi \bar{d}_{i}}\right]
$$

where $n$ is the number of manipulator joints, $\max \left[a_{2}, \ldots, a_{m}\right]$ and $\min \left[a_{2}, \ldots, a_{m}\right]$ are upper and lower bounds of the task's index set $\left\{\overline{\mathrm{j}}_{2}, \ldots, \overline{\mathrm{j}}_{\mathrm{m}}\right\}$, respectively. Variables $\sum_{\mathrm{i}=2}^{m} \max \Pi_{\mathrm{i}}\left(\max \mathrm{a}_{\mathrm{i}}-\min \mathrm{a}_{\mathrm{i}}\right), \mathrm{t}_{\mathrm{f}}$, and $\min \Pi_{i}$ in the above equation are constants independent of $n$. In the case of $\Pi_{1}=\mathrm{f}\left(\mathrm{j}_{\mathrm{o}}\right)$ where $\Pi_{1}$ is the first term in the vector transformation and $j_{o}$ is the outermost index of the iterative algorithm, $\max \Pi_{1}(n-1)$ will be $o\left(n^{2}\right)$. Thus, for minimal $t$, $\Pi_{1} \neq f\left(j_{0}\right)$. Under this condition for $\Pi$ to achieve the lower bound for $t$, we may rewrite the above above formulation of $t$ as,
$t=\left\lceil\frac{b_{1}(n-1)+b_{2}}{b_{3}}\right\rceil$

$$
=c_{1} n+c_{2}
$$

where $b_{1}, b_{2}, b_{3}, c_{1}$, and $c_{2}$ are specified constants. Thus, the lower bound to evaluate $T_{S}$ for an arithmetic expression $\mathrm{E}<\mathrm{n}>$ may be defined as,

$$
\mathrm{T}_{\mathrm{S}}[\mathrm{E}<\mathrm{n}>] \geq \mathrm{o}\left(\mathrm{c}_{1}\lceil\mathrm{n}\rceil\right)
$$

The inverse dynamics problem may be considered as computing a set of arithmetic operations which result in obtaining the joint torques. Each joint torque $\tau_{i}$ of joint $i$ can be expressed as an arithmetic expression containing at least $3 n$ atoms: $n$ joint positions $\left\{q_{i}\right\}_{i=1}^{n}, n$ joint velocities $\left\{\dot{q}_{i}\right\}_{i=1}^{n}$ and $n$ joint accelerations $\left\{\ddot{q}_{i}\right\}_{i=1}^{n}$, implying,

$$
\begin{equation*}
T_{S}[E<3 n>] \geq o\left(3 c_{1}[n]\right) \tag{4.5}
\end{equation*}
$$

Rewriting equation (4.5) without changing the order of the lower bound to evaluate a set of $n$ torques,

$$
\begin{gathered}
T_{\mathrm{S}}\left[\tau_{1}, \ldots, \tau_{\mathrm{n}}\right] \geq o\left(d_{1}\lceil\mathrm{n}]\right) \\
\text { where } \mathrm{d}_{1} \text { is a specified constant. Q.E.D. }
\end{gathered}
$$

By establishing a lower bound in Theorem 4.1, alternate systolic architectures for the computation of the inverse dynamics problem may be compared and contrasted by simply comparing coefficient $\mathrm{d}_{1}$ of the competing systolic structures.

### 4.4 Reformulation of Newton Euler Equations of Motion for Direct Mapping onto a Fixed Systolic Architecture

In this section we shall reconstruct the Newton-Euler dynamics algorithm in Appendix A in a form amenable for direct mapping onto a fixed systolic array. This reformulation is achieved by decomposing the algorithm into a set of unidirectional and linear recurrence computational tasks. The algorithm is as follows:

## Algorithm NEWTON_EULER

## Forward Iterations:

```
for h = 0 to (n-1) do.
    begin h
        for i=0 to 2 do
            begin i
            For j = 0 to 2 do
                Fork=0 to 1 do
    Start Task 1
            begin j
            \omega[h,i,j]=\omega[h,i,j-1]+R[i,j](\omega[h-1,i,2]+\dot{0}
            end j
    Finish Task 1
    Start Task 2
        if i=2 then
            \mp@subsup{x}{1}{}}[\textrm{h},\textrm{i}]=\omega[\textrm{h},2]*\mp@subsup{\dot{0}}{}{\textrm{z}}[\textrm{h},0]-\omega[\textrm{h},0]*\mp@subsup{\dot{0}}{}{\textrm{z}}[\textrm{h},2
            else
            \mp@subsup{x}{1}{}[h,i]=\omega[h,i]*\mp@subsup{\dot{0}}{}{\textrm{z}}[h,i+1]-\omega[h,i+1]*\mp@subsup{\dot{0}}{}{z}[h,i]
```

    Finish Task 2
    Start Task 3

$$
x_{2}[h, i]=x_{1}[h, i]+\ddot{\theta}^{2}[h, i]
$$

Finish Task 3
Start Task 4
begin j
$\dot{\omega}[h, i, j]=\dot{\omega}[h, i, j-1]+R[i, j]\left(\dot{\omega}[h-1, i, 2]+x_{2}[h, i]\right)$
end $j$
Finish Task 4
Start Task 5

$$
\text { if } \mathrm{i}=2 \text { then }
$$

$$
\mathrm{x}_{3}[\mathrm{~h}, \mathrm{i}]=\dot{\omega}[\mathrm{h}, 2] * \mathrm{P}[\mathrm{~h}, 0]-\dot{\omega}[\mathrm{h}, 0] * \mathrm{P}[\mathrm{~h}, 2]
$$

else

$$
\mathrm{x}_{3}[\mathrm{~h}, \mathrm{i}]=\dot{\omega}[\mathrm{h}, \mathrm{i}] * \mathrm{P}[\mathrm{~h}, \mathrm{i}+1]-\dot{\omega}[\mathrm{h}, \mathrm{i}+1] * \mathrm{P}[\mathrm{~h}, \mathrm{i}]
$$

Finish Task 5
Start Task 6 begin $k$

$$
\mathrm{x}_{4}[\mathrm{~h}, \mathrm{i}, \mathrm{k}=0]=\omega[\mathrm{h}, \mathrm{i}]
$$

$$
\text { if } i=2 \text { then }
$$

$$
x_{4}[h, i, k]=x_{4}[h, 2, k-1] * P[h, 0, k]-x_{4}[h, 0, k-1] * P[h, 2, k]
$$

else

$$
\mathrm{x}_{4}[\mathrm{~h}, \mathrm{i}, \mathrm{k}]=\mathrm{x}_{4}[\mathrm{~h}, \mathrm{i}, \mathrm{k}-1] * \mathrm{P}[\mathrm{~h}, \mathrm{i}+1, \mathrm{k}]-\mathrm{x}_{4}[\mathrm{~h}, \mathrm{i}+1, \mathrm{k}-1] * \mathrm{P}[\mathrm{~h}, \mathrm{i}, \mathrm{k}]
$$ end $k$

Finish Task 6
Start Task 7

$$
\mathrm{x}_{5}[\mathrm{~h}, \mathrm{i}]=\mathrm{x}_{3}[\mathrm{~h}, \mathrm{i}]+\mathrm{x}_{4}[\mathrm{~h}, \mathrm{i}]
$$

Finish Task 7
Start Task 8 begin j
$\dot{\mathrm{v}}[\mathrm{h}, \mathrm{i}, \mathrm{j}]=\dot{\mathrm{v}}[\mathrm{h}, \mathrm{i}, \mathrm{j}-1]+\mathrm{R}[\mathrm{i}, \mathrm{j}] * \dot{\mathrm{v}}[\mathrm{h}-1, \mathrm{i}, 2]+\mathrm{x}_{5}[\mathrm{~h}, \mathrm{i}]$ end $j$
Finish Task 8
Start Task 9

$$
\text { if } \mathrm{i}=2 \text { then }
$$

$$
x_{6}[h, i]=\dot{\omega}[h, 2] * P_{C}[h, 0]-\dot{\omega}[h, 0] * P_{C}[h, 2]
$$

else

$$
\mathrm{x}_{6}[\mathrm{~h}, \mathrm{i}]=\dot{\omega}[\mathrm{h}, \mathrm{i}] * \mathrm{P}_{\mathrm{C}}[\mathrm{~h}, \mathrm{i}+1]-\dot{\omega}[\mathrm{h}, \mathrm{i}+1] * \mathrm{P}_{\mathrm{C}}[\mathrm{~h}, \mathrm{i}]
$$

Finish Task 9

Start Task 10

$$
\text { begin } k
$$

$$
\mathrm{x}_{7}[\mathrm{~h}, \mathrm{i}, \mathrm{k}=0]=\omega[\mathrm{h}, \mathrm{i}]
$$

if $\mathrm{i}=2$ then

$$
\mathrm{x}_{7}[\mathrm{~h}, \mathrm{i}, \mathrm{k}]=\mathrm{x}_{7}[\mathrm{~h}, 2, \mathrm{k}-1] * \mathrm{P}_{\mathrm{C}}[\mathrm{~h}, 0, \mathrm{k}]-\mathrm{x}_{7}[\mathrm{~h}, 0, \mathrm{k}-1] * \mathrm{P}_{\mathrm{C}}[\mathrm{~h}, 2, \mathrm{k}]
$$

else

$$
\mathrm{x}_{7}[\mathrm{~h}, \mathrm{i}, \mathrm{k}]=\mathrm{x}_{7}[\mathrm{~h}, \mathrm{i}, \mathrm{k}-1] * \mathrm{P}_{\mathrm{C}}[\mathrm{~h}, \mathrm{i}+1, \mathrm{k}]-\mathrm{x}_{7}[\mathrm{~h}, \mathrm{i}+1, \mathrm{k}-1] * \mathrm{P}_{C}[\mathrm{~h}, \mathrm{i}, \mathrm{k}]
$$

end $k$
Finish Task 10
Start Task 11

$$
\mathrm{x}_{8}[\mathrm{~h}, \mathrm{i}]=\mathrm{x}_{6}[\mathrm{~h}, \mathrm{i}]+\mathrm{x}_{7}[\mathrm{~h}, \mathrm{i}]
$$

Finish Task 11
Start Task 12

$$
\dot{\mathrm{v}}_{\mathrm{c}}[\mathrm{~h}, \mathrm{i}]=\dot{\mathrm{v}}[\mathrm{~h}, \mathrm{i}]+\mathrm{x}_{8}[\mathrm{~h}, \mathrm{i}]
$$

Finish Task 12
Start Task 13

$$
\mathrm{F}[\mathrm{~h}, \mathrm{i}]=\mathrm{m}[\mathrm{~h}] * \dot{\mathrm{v}}_{\mathrm{C}}[\mathrm{~h}, \mathrm{i}]
$$

Finish Task 13
Start Task 14
begin j
$\mathrm{x}_{9}[\mathrm{~h}, \mathrm{i}, \mathrm{j}]=\mathrm{x}_{9}[\mathrm{~h}, \mathrm{i}, \mathrm{j}-1]+\mathrm{I}[\mathrm{i}, \mathrm{j}] * \omega[\mathrm{~h}, \mathrm{i}]$
end $j$
Finish Task 14
Start Task 15
if $i=2$ then

$$
\mathrm{x}_{10}[\mathrm{~h}, \mathrm{i}]=\omega[\mathrm{h}, 2] * \mathrm{x}_{9}[\mathrm{~h}, 0]-\omega[\mathrm{h}, 0] * \mathrm{x}_{9}[\mathrm{~h}, 2]
$$

else
$\mathrm{x}_{10}[\mathrm{~h}, \mathrm{i}]=\omega[\mathrm{h}, \mathrm{i}] * \mathrm{x}_{9}[\mathrm{~h}, \mathrm{i}+1]-\omega[\mathrm{h}, \mathrm{i}+1] * \mathrm{x}_{9}[\mathrm{~h}, \mathrm{i}]$
Finish Task 15
Start Task 16
begin j

$$
\begin{aligned}
& \mathrm{x}_{11}[\mathrm{~h}, \mathrm{i}, \mathrm{j}]=\mathrm{x}_{11}[\mathrm{~h}, \mathrm{i}, \mathrm{j}-1]+\mathrm{I}[\mathrm{i}, \mathrm{j}] * \dot{\omega}[\mathrm{~h}, \mathrm{i}] \\
& \text { end } \mathrm{j}
\end{aligned}
$$

Finish Task 16
Start Task 17
$N[\mathrm{~h}, \mathrm{i}]=\mathrm{x}_{10}[\mathrm{~h}, \mathrm{i}]+\mathrm{x}_{11}[\mathrm{~h}, \mathrm{i}]$
Finish Task 17
end $i$
end $h$

## Backward Iterations:

$$
\begin{aligned}
& \text { for } h=n \text { to } 1 \text { do } \\
& \text { begin } h \\
& \text { for } i=0 \text { to } 2 \text { do } \\
& \text { begin } i \\
& \text { for } j=0 \text { to } 2 \text { do } \\
& \text { Start Task } 18 \\
& \text { begin } j \\
& \quad f[h, i, j]=f[h, i, j-1]+R[i, j] * f[h+1, i]+F[h, i] \\
& \quad \text { end } j \\
& \text { Finish Task } 18 \\
& \text { Start Task } 19 \\
& x_{12}[h, i]=P[h, i]+P_{C}[h, i] \\
& \text { Finish Task } 19 \\
& \text { Start Task } 20 \\
& \text { if } i=2 \text { then } \\
& x_{13}[h, i]=x_{12}[h, 2] * F[h, 0]-x_{12}[h, 0] * F[h, 2] \\
& \text { else } \\
& x_{13}[h, i]=x_{12}[h, i] * F[h, i+1]-x_{12}[h, i+1] * F[h, i]
\end{aligned}
$$

Finish Task 20
Start Task 21

$$
\mathrm{x}_{14}[\mathrm{~h}, \mathrm{i}]=\mathrm{N}[\mathrm{~h}, \mathrm{i}]+\mathrm{x}_{13}[\mathrm{~h}, \mathrm{i}]
$$

Finish Task 21
Start Task 22

$$
\text { if } i=2 \text { then }
$$

$$
\mathrm{x}_{15}[\mathrm{~h}, \mathrm{i}]=\mathrm{P}[\mathrm{~h}, 2] * \mathrm{f}[\mathrm{~h}+1,0]-\mathrm{P}[\mathrm{~h}, 0] * \mathrm{f}[\mathrm{~h}+1,2]
$$

else

$$
\mathrm{x}_{15}[\mathrm{~h}, \mathrm{i}]=\mathrm{P}[\mathrm{~h}, \mathrm{i}] * \mathrm{f}[\mathrm{~h}+1, \mathrm{i}+1]-\mathrm{P}[\mathrm{~h}, \mathrm{i}+1] * \mathrm{f}[\mathrm{~h}+1, \mathrm{i}]
$$

Finish Task 22
Start Task 23 begin j
$n[h, i, j]=n[h, i, j-1]+R[i, j]\left(n[h+1, i, 2]+x_{15}[h, i]\right)+x_{14}[h, i]$ end $j$

Finish Task 23
Start Task 24
$\tau[\mathrm{h}, \mathrm{i}]=\tau[\mathrm{h}, \mathrm{i}-1]+\mathrm{R}^{\mathrm{Z}}[\mathrm{h}, \mathrm{i}] * \mathrm{n}[\mathrm{h}, \mathrm{i}]$
Finish Task 24
end $i$
end $h$

### 4.5 Systematic Design of a Basic Set of Systolic Processors for the Inverse Dynamics Computational Problem

From the algorithm of Section 4.4, it is seen that a "basic" set of types of tasks exist from which the complete systolic architecture for the computation of the inverse dynamics problem may be derived. Thus, by designing a set of systolic processors onto which this "basic" set of task types may be mapped, we can arrive at the complete systolic architecture for the computation of the specified problem by simply using these processors as building blocks for the realization of the overall systolic system. The basic set of tasks may be partitioned into a set of unidirectional and linear recurrence tasks as follows:

## Unidirectional tasks:

Type 1: Vector-Add (VA)

$$
\text { for } i=0 \text { to } 2 \text { do }
$$

$y[i]=x[i]+z[i]$
end $i$
Type 2: Scalar-Vector Product (SVP)
for $i=0$ to 2 do
$y[i]=K * x[i]$
end $i$
Type 3: Vector Cross-Product (VCP)
for $\mathrm{i}=0$ to 2 do
if $\mathrm{i}=2$ then

$$
y[i]=x[2] * z[0]-x[0] * z[2]
$$

else

$$
y[i]=x[i] * z[i+1]-x[i+1] * z[i]
$$

end $i$

## Linear Recurrence tasks:

Type 4: Inner Product (IP)
for $\mathrm{i}=0$ to 2 do
$y[i]=y[i-1]+x[i] * z[i]$
end $i$
Type 5: Matrix-Vector Product
for $\mathrm{i}=0$ to 2 do
for $\mathrm{j}=0$ to 2 do
$y[i, j]=y[i, j-1]+A[i, j] * x[i]$
end $j$
end $i$
Type 6: Recursive Matrix-Vector Product (RMVP)
Type 6a:
for $h=0$ to $\mathrm{n}-1$ do
for $i=0$ to 2 do
for $\mathrm{j}=0$ to 2 do
$y[h, i, j]=y[h, i, j-1]+A[i, j](y[h-1, i, 2]+x[h, i])$
end $j$
end $i$
end $h$
Type 6b:
Replace line (1) in Type 6 a by,

$$
y[h, i, j]=y[h, i, j-1]+A[i, j] * y[h-1, i, 2]+x[h, i]
$$

Type 6c:
Replace line (1) in Type 6a by,

$$
y[h, i, j]=y[h, i, j-1]+A[i, j] *(y[h-1, i, 2]+x[h, i])+z[h, i]
$$

Type 7: Recursive Vector Cross-Product (RVCP)
for $\mathrm{i}=0$ to n
for $j=0$ to 2 do
if $\mathrm{j}=2$ then
$y[h, i]=y[i-1,2] * x[i, 0]-y[i-1,0] * x[i, 2]$
else
$y[h, i]=y[i-1, j] * x[i, j+1]-y[i-1, j+1] * x[i, j]$
end j
end $i$

Now that the basic types of tasks prevalent in the inverse dynamics computational problem have been defined, we can develop systolic architectures onto which these processes may be mapped. The modified systolic mapping procedure of Section 4.2 will be used to develop the systolic architectures for linear recurrence tasks 4-7.

For the purpose of generality while deriving systolic architectures using procedure SADLRE, we shall assume that the interprocessor communication time, $\mathrm{t}_{\mathrm{c}}$, is of unit time latency and the operational delay of primitive tasks such as addition, subtraction and multiplication is also an unit of time.

Before deriving the specified systolic processor architectures, two terms which will be used hereafter must be defined. The first one is, Systolic Operational Delay, which is defined as the latency required to compute the first element of a vector operation at the end of the first iteration in the given linear recurrence problem on the specified systolic array. Next, the Systolic Execution Time is defined as the latency to evaluate all elements of the vector operation at the end of $n$ iterations in the linear recurrence problem on the specified systolic array.

### 4.5.1 Systolic Processors for Unidirectional Tasks

Task types 1 and 2 are represented by primitive adder and multipler cells (Figs. 4.1(a), (b)). Paired vector operands enter these cells in synchronous fashion. A systolic vector cross-product module (type 3 task) is illustrated in Fig. 4.1(c). Unit delays are inserted along some input variable paths for operation synchronization. The operation delay of this particular module is 3 time units.

### 4.5.2 Systolic Architecture for a Type 4 Task

The Algol-like program for an inner product operation is,

$$
\begin{aligned}
& \text { for } i=0 \text { to } 2 \text { do } \\
& \begin{array}{l}
j=i \\
x[i, j]=x[i, j-1] \\
z[i, j]=z[i, j-1] \\
y[i]=y[i-1]+x[i, j] * z[i, j]
\end{array} \\
& \text { end } i
\end{aligned}
$$

The data dependency matrix for the above algorithm is $\mathrm{D}=\left[\begin{array}{lll}0 & 0 & 1 \\ 1 & 1 & 0\end{array}\right]$, where the columns from left to right denote data dependencies for $\mathrm{x}[\mathrm{i}], \mathrm{z}[\mathrm{i}]$ and $\mathrm{y}[\mathrm{i}]$ respectively.


Fig. 4.1: Systolic Processors for Unidirectional Tasks

By following procedure SADLRE, $\Pi$ is found to be $\left[\begin{array}{ll}3 & 1\end{array}\right]$, and $S$ to be $\left[\begin{array}{ll}1 & 0 \\ 0 & 1\end{array}\right]$. The new transformation matrix, $\Delta$, is thus $\left[\begin{array}{lll}1 & 1 & 3 \\ 0 & 0 & 1 \\ 1 & 1 & 0\end{array}\right]$. Tables 4.1-4.3 show the GVT, UVT, and IT for this type of task. The systolic architecture and its partitioned structure may be derived from these tables to be as shown in Fig. 4.2(a) and (b) respectively: Also, the architecture of an individual processing element of Fig. 4.2(a) is illustrated in Fig. 4.2(c). Information regarding scheduling of operands and generation of intermediate variables in Fig. 4.2(a) can be obtained from task type 4's tables $\{T\}$. The operational delay for the Type 4 task is eight time units.

### 4.5.3 Systolic Architecture for a Type 5 Task

The Algol-like program of a matrix-vector product (Task 4) operation will now be presented in a slightly modified form,

```
for i = 0 to 2 do
    for j = 0 to 2 do
        x[i] = x[i-1]
        A[i,j]=A[i,j-1]
        y[i,j]=y[i,j-1]+A[i,j] * x[i]
    end j
end i
```

The purpose of statements (1) and (2) is to ensure the inputs are propagated sequentially along the ith and jth directions, respectively, of the systolic array. These statements, therefore, eliminate any possibility of broadcasting input variables. The data dependency matrix, $D$, of iteration ( $\mathrm{i}, \mathrm{j}$ ) for the above algorithm may easily be derived as,

$$
D=\left[\begin{array}{lll}
1 & 0 & 0 \\
0 & 1 & 1
\end{array}\right]
$$

where columns 1,2 and 3 represent dependence vectors for variables $x[i-1], A[i, j-1]$ and $y[i, j-1]$ respectively.

Linear indexing matrices, $\mathrm{I}_{\mathrm{Bi}}$, for this problem were found to have the same rank as the dimensionality of the algorithm index set. Thus, it may be assumed that broadcasting of any input or generated variable will not occur. We can, therefore, jump to step 3.2 in procedure SADLRE.

Table 4.1: GVT of a Type 4 Task

| ITERATION | BEGIN EXECUTION | FINISH EXECUTION | IN CEIL | GENERATING |
| :---: | :---: | :---: | :---: | :---: |
| (i) | AT TIME | AT TIME | (i) | VARIABLE |
| 0 | 0 | 2 | 0 | $y(0)$ |
| 1 | 3 | 5 | 1 | $y(1)$ |
| 2 | 8 | 8 | 2 | $y(2)$ |

Table 4.2: UVT of a Type 4 Task

| ITERATION <br> $(i)$ | USING <br> VARIABLES | FROM CELL <br> $\left(\mathrm{i}-\mathrm{d}_{y}\right.$ | AT <br> TIMES |
| :---: | :---: | :---: | :---: |
| 0 | $\mathrm{y}(-1)$ | -1 | -1 |
| 1 | $\mathrm{y}(0)$ | 0 | 2 |
| 2 | $\mathrm{y}(1)$ | 1 | 5 |

Table 4.3: IT of a Type 4 Task

| ITERATION <br> $(i)$ | USING INPUTS <br> $x(i), z(i)$ | FROM CELL <br> $\left(i, j-\bar{d}_{x}\right.$ | AT |
| :---: | :---: | :---: | :---: |
| 0 | $x(0), z(0)$ | $0,-1$ | -1 |
| 1 | $x(1), z(1)$ | $1,-1$ | 2 |
| 2 | $x(2), z(2)$ | $2,-1$ | 5 |


(a): Systolic Architecture

(b): Partitioned Systolic Architecture

(c): An Individual Processing Element

Fig. 4.2: Systolic Architecture for a Type 4 Task

From the above Algol-like program specification, it is easily derived that the computational delay of an individual systolic cell will be two time units since two consecutive primitive operations are required to evaluate $y[i, j]$. The linear time scheduler, $\Pi D$, must, therefore satisfy the conditions of $\Pi \bar{d}_{1}, \Pi \bar{d}_{2} \geq 1$ and $\Pi \bar{d}_{3} \geq 3$, where the subscripts of dependence vector $d$ indicate column indexing in matrix $D$. The transformation vector, $\Pi$, which meets the above specifications and minimizes the algorithm execution time, t , in step 3.4, is found to be [13].

Using step 4, we may obtain a two-dimensional systolic array architecture by selecting the matrix of interconnection primitives, $P$, as $\left[\begin{array}{ll}1 & 0 \\ 0 & 1\end{array}\right]$. Further, the second transformation S is chosen to allow the diophantine equation $\mathrm{SD}=\mathrm{PK}$ to be solvable for $S$, and minimize the number of bands, $r$. The derived $S$-transformation and interconnection utilization matrix, K , are as follows:

$$
S=\left[\begin{array}{ll}
1 & 0 \\
0 & 1
\end{array}\right] \text { for } K=\left[\begin{array}{lll}
1 & 0 & 0 \\
0 & 1 & 1
\end{array}\right] \text {. }
$$

We thus arrive at the new dependence matrix, $\Delta$,

$$
\begin{aligned}
\Delta=\mathrm{TD} & =\left[\begin{array}{ll}
1 & 3 \\
1 & 0 \\
0 & 1
\end{array}\right]\left[\begin{array}{lll}
1 & 0 & 0 \\
0 & 1 & 1
\end{array}\right] \\
& =\left[\begin{array}{lll}
1 & 3 & 3 \\
1 & 0 & 0 \\
0 & 1 & 1
\end{array}\right]
\end{aligned}
$$

The first row of the transformed matrix, $\Delta$, indicates the number of time units allowed for the respective variable to travel from the processor where it is generated to the processor where it is used. The next two rows specify the direction of data flow for the respective variable. It is therefore easily verified that input $x[i]$ needs one time unit to propagate one space in the ith direction, input $A[i, j]$ and generated variable $y[i, j-1]$ requires three time units to travel one space in the jth direction.

Tables 4.4 through 4.6 delineate the GVT, UVT and IT, respectively, of the matrix-vector product operation. They are derived using Steps 7 through 9 of procedure SADLRE. The systolic array and its partitioned form inferred from these tables are illustrated in Fig. 4.3(a) and (b) respectively. The architecture for an individual cell of the systolic architecture using step 10 is shown in Fig. 4.3(c). The operational delay of the systolic module is 8 time units to perform the designated

Table 4.4: GVT of a Type 5 Task

| ITERATION <br> (i,j) | BEGIN EXECUTION AT TIME | FINISH EXECUTION AT TMME | $\begin{gathered} \text { IN CELL } \\ (i, j) \end{gathered}$ | GENERATING <br> VARIABLE |
| :---: | :---: | :---: | :---: | :---: |
| 0,0 | 0 | 2 | $(0,0)$ | $\boldsymbol{j}(0,0)$ |
| 0,1 | 3 \% | 5 | $(0,1)$ | $\overline{7}(0,1)$ |
| 0,2 | 6 | 8 | $(0,2)$ | $y(0,2)$ |
| 1,0 | 1 | 1 | $(1,0)$ | $y(1,0)$ |
| 1,1 | 4 | 6 | $(1,1)$ | $y(1,1)$ |
| 1,2 | 7 | 9 | $(1,2)$ | $y(1,2)$ |
| 2,0 | 2 | 4 | $(2,0)$ | $y(2,0)$ |
| 2,1 | 5 | 7 | $(2,1)$ | $y(2,1)$ |
| 2,2 | 8 | 10 | $(2,2)$ | $y(2,2)$ |

Table 4.5: UVT of a Type 5 Task

| ITERATION <br> $(\mathrm{i}, \mathrm{j})$ | USING <br> VARIABLE | FROM <br> CELL | AT |
| :---: | :---: | :---: | :---: |
| 0,0 | $(0,-1)$ | $(0,-1)$ | -3 |
| 0,1 | $(0,0)$ | $(0,0)$ | 0 |
| 0,2 | $(0,1)$ | $(0,1)$ | 3 |
| 1,0 | $(1,-1)$ | $(1,-1)$ | -2 |
| 1,1 | $(1,0)$ | $(1,0)$ | 1 |
| 1,2 | $(1,1)$ | $(1,1)$ | 4 |
| 2,0 | $(2,-1)$ | $(2,-1)$ | -1 |
| 2,1 | $(2,0)$ | $(2,0)$ | 2 |
| 2,2 | $(2,1)$ | $(2,1)$ | 5 |

Table 4.6: IT of a Type 5 Task

| ITERATION <br> ( $\mathrm{i}, \mathrm{j}$ ) | USING <br> INPUTS | FROM CELLS | AT TIMES |
| :---: | :---: | :---: | :---: |
| 0,0 | $\mathbf{x}[0], \mathbf{A}[0,0]$ | $(-1,0),(0,-1)$ | -1,-3 |
| 0,1 | $\mathbf{x}[0], \mathbf{A}[0,1]$ | $(-1,1),(0,0)$ | 2,0 |
| 0,2 | $\mathrm{x}[0], \mathrm{A}[0,2]$ | $(-1,2),(0,1)$ | 5,3 |
| 1,0 | $\mathrm{x}[1], \mathrm{A}[1,0]$ | $(0,0),(1,-1)$ | 0,-2 |
| 1,1 | $\mathbf{x}[1], \mathbf{A}[1,1]$ | $(0,1),(1,0)$ | 3,1 |
| 1,2 | $\mathrm{x}[1], \mathrm{A}[1,2]$ | $(0,2),(1,1)$ | 6,4 |
| 2,0 | $\mathrm{x}[2], \mathrm{A}[2,0]$ | $(1,0),(2,-1)$ | 1,-1 |
| 2,1 | $\mathrm{x}[2], \mathrm{A}[2,1]$ | $(1,1),(2,0)$ | 4,2 |
| 2,2 | $\mathrm{x}[2], \mathrm{A}[2,2]$ | $(1,2),(2,1)$ | 7,5 |


(a): Systolic Architecture
$x[2]$

(b): Partitioned Systolic Architecture


$$
y|i, j|
$$

(c): An Individual Processing Element

Fig. 4.3: Systolic Architecture for a Type 5 Task
computation.

### 4.5.4 Systolic Architecture for Task Types 6(a)-(c)

## Type 6(a) Task

The recursive matrix-vector product formulation of a Type 6(a) task is represented by the following modified Algol-like program:
for $h=0$ to $n-1$ do
for $i=0$ to 2 do
for $\mathrm{j}=0$ to 2 do
$x[h, i]=x[h, i-1]$
$\mathrm{A}[\mathrm{i}, \mathrm{j}]=\mathrm{A}[\mathrm{i}, \mathrm{j}-1]$
$y[h, i, j]=y[h, i, j-1]+A[i, j](y[h-1, i, 2]+x[h, i])$
end $j$
end $i$
end $h$

The data dependency matrix for the above algorithm specification may be derived to be,

$$
D=\left[\begin{array}{cccc}
0 & 0 & 0 & 1 \\
1 & 0 & 0 & 0 \\
0 & 1 & 1 & j-2
\end{array}\right]
$$

where the columns from left to right specify data dependence vectors for variables $\mathrm{x}[\mathrm{h}, \mathrm{i}-1], \mathrm{A}[\mathrm{i}, \mathrm{j}-1], \mathrm{y}[\mathrm{h}, \mathrm{i}, \mathrm{j}-1]$ and $\mathrm{y}[\mathrm{h}-1, \mathrm{i}, 2]$ respectively.

The linear indexing matrix for the generated variable $y[h-1, i, 2]$ is,

$$
\mathrm{I}_{\mathrm{By}[h-1, i, 2]}=\left[\begin{array}{lll}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 0
\end{array}\right]
$$

Since the rank of the above linear indexing matrix is less than the dimensionality of the algorithm index set, it is therefore clear that the generated variable $y[h-1, i, 2]$ may be broadcasted. By using Step 2 and 3.1 of procedure SADLRE, the linear time scheduler for variable $y[h-1, i, 2]$ must be of the form,

$$
\begin{equation*}
\Pi \overline{\mathrm{d}}_{\mathrm{By}[\mathrm{~h}-1, \mathrm{i}, 2]}=\mathrm{C}_{1}+\mathrm{jC} \mathrm{C}_{2} \tag{4.6}
\end{equation*}
$$

The conditions on constants, $\mathrm{C}_{1}$ and $\mathrm{C}_{2}$, may only be determined after the selection of the $S$ transformation matrix. By Step 4, we find the matrix of interconnection
primitives, P , to be,

$$
P=\left[\begin{array}{lll}
1 & 0 & 0 \\
0 & 1 & 1
\end{array}\right] .
$$

Using Step 5.1, broadcasting of variable $y[h-1, i, 2]$ is avoided by also setting the following condition on the space scheduler transformation matrix S ,

$$
\begin{equation*}
\mathrm{S}_{\mathrm{By}[\mathrm{~h}-1, \mathrm{i}, 2]} \overline{\mathrm{d}}_{\mathrm{By}[\mathrm{~h}-1, \mathrm{i}, 2]}=\mathrm{C}_{3}+\mathrm{jC}_{4} \tag{4.7}
\end{equation*}
$$

where $\mathrm{C}_{3} \in \mathrm{Z}^{\mathrm{n}}, \mathrm{C}_{4} \in \mathrm{I}^{\mathrm{n}}$ and $\mathrm{C}_{4} \neq 0$. A $S$-matrix is thus selected which minimizes the number of bands, r , satisfies equation (4.7), and allows the diophantine equation $\mathrm{SD}=$ PK to be solvable for $S$,

$$
S=\left[\begin{array}{lll}
0 & 1 & 0 \\
3 & 0 & 1
\end{array}\right] \text { for } K=\left[\begin{array}{llll}
1 & 0 & 0 & 0 \\
0 & 1 & 1 & 1 \\
0 & 0 & 0 & 0
\end{array}\right]
$$

From this choice of S , constants $\mathrm{C}_{3}$ and $\mathrm{C}_{4}$ in Eq. (4.7) are both one.
We may now return to the problem of setting conditions on $\mathrm{C}_{1}$ and $\mathrm{C}_{2}$ in Eq . (4.6). By Eqs. (4.6) and (4.7) in Step 3, $\mathrm{C}_{1}$ and $\mathrm{C}_{2}$ must both be greater or equal to 4 since $\mathrm{t}_{\mathrm{j}} \geq \mathrm{t}_{\mathrm{fj}}+\mathrm{t}_{\mathrm{C}}(=4)$ and $\left.\left.\mathrm{S}_{\mathrm{y}[\mathrm{h}-1, i, 2]} \mathrm{d}_{\mathrm{By}[\mathrm{h}-1, \mathrm{i}, 2]}^{\mathrm{C}}\right], \mathrm{S}_{\mathrm{y}[\mathrm{h}-1, \mathrm{i}, 2]} \mathrm{d}_{\mathrm{By}[\mathrm{h}-1, \mathrm{i}, 2]}^{\mathrm{V}}\right]$ are both one. Using Steps 3.2 and 3.3, we impose further conditions on the selection of $\Pi: \Pi d_{\mathrm{x}[\mathrm{h}, \mathrm{i}-1]}, \Pi_{\mathrm{A}[\mathrm{i},-1]} \geq 1$, and $\Pi \bar{d}_{\mathrm{y}[\mathrm{h}, \mathrm{i} j-1]} \geq 4$. Finally, the transformation $\Pi$ is chosen to satisfy all the above conditions and minimize the algorithm execution time $\mathrm{t}, \Pi=\left[\begin{array}{lll}12 & 1 & 4\end{array}\right]$.

The new transformation matrix, $\Delta$, is thus,

$$
\Delta=T D=\left[\begin{array}{cccc}
1 & 4 & 4 & 4+4 \mathrm{j} \\
1 & 0 & 0 & 0 \\
0 & 1 & 1 & 1+\mathrm{j}
\end{array}\right]
$$

The information gathered from this $\Delta$ matrix regarding communication latency for the directional flow of variables in the systolic array is as follows:
(1) $\mathbf{x}[h, i-1]$ requires 1 time unit to travel 1 space unit in the ith direction.
(2) $\mathrm{A}[\mathrm{i}, \mathrm{j}-1]$ needs 4 units of time to move 1 space unit in the jth direction.
(3) Generated variable, y[h,i,j-1], travels 1 space unit in the jth direction in 4 time units.
(4) Generated variable, y[h-1,i,2], requires $4+4 \mathrm{j}$ units of time to travel $1+\mathrm{j}$ space units in the jth direction.

The set of tables \{GVT, UVT, IT\} for the Type 6a operation is depicted in Tables 4.7 through 4.9 , respectively. The systolic array architecture derived from these tables is shown in Fig. 4.4(a). The number of systolic cells required for this architecture is 3 n , where n is the number of manipulator joints, and the operational delay of the array is 11 units of time.

The architecture of an individual cell in the systolic array of Fig. 4.4(a) is illustrated in Fig. 4.4(c). Note that buffer delays derived using Step 10 of SADLRE generated insufficient latencies along the paths of $A[i, j]$ and $y[h, i, j-1]$ due to the asynchronous sequence in which these variables are operated upon in the cell. This problem is solved by adding extra latencies along their paths for the purpose of operation synchronization.

The algorithm may be mapped onto a more cost-efficient architecture of a $3 \times 3$ VLSI array by following the partitioning technique in Step 11. This partitioned architecture onto a set of 3 bands is illustrated in Fig. 4.4(b). The scheduling of these bands is performed by assigning the processing of $y(\overline{\mathrm{j}})$ in band $B_{i}, 1 \leq i \leq 3$, at a processor whose ith coordinate is $\Pi_{P j} \bar{j} \bmod 3$, where $\Pi_{P i}$ corresponds to the ith row in the $S$ matrix.

## Type 6(b) Task

The systolic architecture for this type of task is identical to that of a Type 6(a) task. The only architectural modification exists within an individual processing cell. Fig. 4.5 specifies the processing cell architecture for a Type $6(\mathrm{~b})$ task. The shorter computational latency of this cell compared to that of a Type 6(a) task, thus, requires the systolic array for a Type 6(b) task to have a shorter operation delay of 8 time units.

## Type 6(c) Task

The systolic architecture of this task type is also the same as that of a Type 6(a) task. An individual processing cell architecture for a Type 6(c) task is shown in Fig. 4.6. The operational delay for the type type 6(c) task is indifferent from that of a Type 6(a) task, i.e. a latency of 11 units.

### 4.5.5 Systolic Array for a Type 7 Task

The Algol-like program for a type 7 recursive vector cross-product operation is as follows,

$$
\begin{aligned}
& \text { for } i=0 \text { to } n-1 \text { do } \\
& \text { for } j=0 \text { to } 2 \text { do }
\end{aligned}
$$

Table 4.7: GVT of a Type 6a Task

| ITERATION <br> $(\mathbf{h , i , j})$ | BEGIN EXECUTION <br> AT TIME | FINISH EXECUTION <br> AT TIME | IN CELL <br> $(\mathrm{i}, \mathrm{j})$ | GENERATING <br> VARIABLE |
| :---: | :---: | :---: | :---: | :---: |
| $\mathbf{0 , 0 , 0}$ | 0 | 3 | 0,0 | $y(0,0,0)$ |
| $0,0,1$ | 4 | 7 | 0,1 | $y(0,0,1)$ |
| $0,0,2$ | 8 | 11 | 0,2 | $y(0,0,2)$ |
| $0,1,0$ | 1 | 4 | 1,0 | $y(0,1,0)$ |
| $0,1,1$ | 5 | 12 | 1,1 | $y(0,1,1)$ |
| $0,1,2$ | 9 | 5 | 1,2 | $y(0,1,2)$ |
| $0,2,0$ | 2 | 9 | 2,0 | $y(0,2,0)$ |
| $0,2,1$ | 6 | 13 | 2,1 | $y(0,2,1)$ |
| $0,2,2$ | 10 | 15 | 2,2 | $y(0,2,2)$ |
| $1,0,0$ | 12 | 19 | 0,3 | $y(1,0,0)$ |
| $1,0,1$ | 16 | 23 | 0,4 | $y(1,0,1)$ |
| $1,0,2$ | 20 | 16 | 0,5 | $y(1,0,2)$ |
| $1,1,0$ | 13 | 20 | 1,3 | $y(1,1,0)$ |
| $1,1,1$ | 17 | 24 | 1,4 | $y(1,1,1)$ |
| $1,1,2$ | 21 | 17 | 1,5 | $y(1,1,2)$ |
| $1,2,0$ | 14 | 21 | 2,3 | $y(1,2,0)$ |
| $1,2,1$ | 18 | 25 | 2,4 | $y(1,2,1)$ |
| $1,2,2$ | 22 |  | 2,5 | $y(1,2,2)$ |

Table 4.8: UVT of a Type 6a Task

| ITERATION (h,i,j) | USING <br> VARIABLES | FROM CELLS | AT TIMES |
| :---: | :---: | :---: | :---: |
| 0,0,0 | $\mathrm{y}(0,0,-1), \mathrm{y}(-1,0,2)$ | $(0,1),(0,-1)$ | -1,-1 |
| 0,0,1 | $y(0,0,0), \mathrm{y}(-1,0,2)$ | $(0,0),(0,-1)$ | 3,-1 |
| 0,0,2 | $\mathrm{y}(0,0,1), \mathrm{y}(-1,0,2)$ | $(0,1),(0,-1)$ | 7,-1 |
| 0,1,0 | $\mathrm{y}(0,1,-1), \mathrm{y}(-1,1,2)$ | $(1,-1),(1,-1)$ | 0,0 |
| 0,1,1 | $\mathrm{y}(0,1,0), \mathrm{y}(-1,1,2)$ | $(1,0)(1,-1)$ | 4,0 |
| 0,1,2 | $\mathrm{y}(0,1,1), \mathrm{y}(-1,1,2)$ | $(1,1),(1,-1)$ | 8,0 |
| 0,2,0 | $y(0,2,-1), y(-1,2,2)$ | $(2,-1),(2,-1)$ | 1,1 |
| 0,2,1 | $y(0,2,0), y(-1,2,2)$ | $(2,0),(2,-1)$ | 5,1 |
| 0,2,2 | $y(0,2,1), y(-1,2,2)$ | $(2,1),(2,-1)$ | 9,1 |
| 1,0,0 | $\mathrm{y}(1,0,-1), \mathrm{y}(0,0,2)$ | $(0,2),(0,2)$ | 11,11 |
| 1,0,1 | $\mathrm{y}(1,0,0), \mathrm{y}(0,0,2)$ | $(0,3),(0,2)$ | 15,11 |
| 1,0,2 | $\mathrm{y}(1,0,1), \mathrm{y}(0,0,2)$ | $(0,4),(0,2)$ | 18,11 |
| 1,1,0 | $\mathrm{y}(1,1,-1), \mathrm{y}(0,1,2)$ | (1,2), (1,2) | 12,12 |
| 1,1,1 | $\mathrm{y}(1,1,0), \mathrm{y}(0,1,2)$ | $(1,3),(1,2)$ | 16,12 |
| 1,1,2 | $\mathrm{y}(1,1,1), \mathrm{y}(0,1,2)$ | $(1,4),(1,2)$ | 20,12 |
| 1,2,0 | $\mathrm{y}(1,2,-1), \mathrm{y}(0,2,2)$ | (2,2), (2,2) | 13,13 |
| 1,2,1 | $\mathrm{y}(1,2,0), \mathrm{y}(0,2,2)$ | $(2,3),(2,2)$ | 17,13 |
| 1,2,2 | $\mathrm{y}(1,2,1), \mathrm{y}(0,2,2)$ | (2,4), (2,2) | 21,13 |

Table 4.9: IT of a Type 6a Task

| ITERATION (h,i,j) | USING INPUT $\mathbf{x}(\mathbf{h}, \mathbf{j}), \mathbf{A}(\mathrm{i}, \mathbf{j})$ | FROM CELLS | $\begin{gathered} \text { AT } \\ \text { TIMES } \end{gathered}$ |
| :---: | :---: | :---: | :---: |
| 0,0,0 | $\mathbf{x}(0,0), \mathrm{A}(0,0)$ | $(-1,0),(0,-1)$ | -1,-4 |
| 0,0,1 | $\mathbf{x}(0,1), \mathbf{A}(0,1)$ | $(-1,1),(0,0)$ | 3,0 |
| 0,0,2 | $\mathrm{x}(0,2), \mathrm{A}(0,2)$ | $(-1,2),(0,1)$ | 7,4 |
| 0,1,0 | $\mathbf{x}(0,0), \mathbf{A}(1,0)$ | $(0,0),(1,-1)$ | 0,-3 |
| 0,1,1 | $\mathbf{x}(0,1), \mathbf{A}(1,1)$ | $(0,1),(1,0)$ | 4,1 |
| 0,1,2 | $\mathbf{x}(0,2), \mathbf{A}(1,2)$ | $(0,2),(1,1)$ | 8,5 |
| 0,2,0 | $\mathbf{x}(0,0), \mathrm{A}(2,0)$ | $(1,0),((2,-1)$ | 1,-2 |
| 0,2,1 | $\mathbf{x}(0,1), \mathbf{A}(2,1)$ | $(1,1),,(2,0)$ | 5,2 |
| 0,2,2 | $\mathrm{x}(0,2), \mathrm{A}(2,2)$ | $(1,2),(2,1)$ | 9,6 |
| 1,0,0 | $\mathbf{x}(1,0), \mathrm{A}(0,0)$ | $(-1,3),(0,2)$ | 11,8 |
| 1,0,1 | $\mathbf{x}(1,1), \mathrm{A}(0,1)$ | $(-1,4),(0,3)$ | 15,12 |
| 1,0,2 | $\mathrm{x}(1,2), \mathrm{A}(0,2)$ | $(-1,5),(0,4)$ | 19,16 |
| 1,1,0 | $\mathrm{x}(1,0), \mathrm{A}(1,0)$ | $(0,3),(1,2)$ | 12,8 |
| 1,1,1 | $\mathbf{x}(1,1), \mathbf{A}(1,1)$ | $(0,4),(1,3)$ | 16,13 |
| 1,1,2 | $\mathbf{x}(1,2), \mathrm{A}(1,2)$ | $(0,5),(1,4)$ | 20,17 |
| 1,2,0 | $\mathrm{x}(1,0), \mathrm{A}(2,0)$ | $(1,3),(2,2)$ | 13,10 |
| 1,2,1 | $\mathbf{x}(1,1), \mathbf{A}(2,1)$ | $(1,4),(2,3)$ | 17,14 |
| 1,2,2 | $\mathbf{x}(1,2), \mathbf{A}(2,2)$ | $(1,5),(2,4)$ | 21,18 |


(a): Systolic Architecture

(b): Partitioned Systolic Architecture

(c): An Individual Processing Element

Fig. 4.4: Systolic Architecture for a Type 6 a Task


Fig. 4.5: Processing Element for a Type 6b Task


Fig. 4.6: Processing Element for a Type 6c Task

```
        \(\mathrm{C}[j]=\mathrm{C}[\mathrm{j}-1]\)
        \(x[i]=x[i-1]\)
        \(y[i, j]=C[j] \cdot(y[i-1, j] * x[i+1]-y[i-1, j+1] * x[i])\)
        \(+C[j] \cdot(y[i-1,2] * x[0]-y[i-1,0] * x[2])\)
    end j
end \(i\)
```

where $C[j]$ is a control signal entering the systolic array vertically. The data dependence matrix for the above program is,

$$
D=\left[\begin{array}{cccccc}
0 & 1 & 1 & 1 & 1 & 1 \\
1 & 0 & 0 & -1 & (j-2) & j
\end{array}\right]
$$

where the columns from left to right specify data dependence vectors for $\mathrm{C}[\mathrm{j}-1]$, $x[i-1], y[i-1, j], y[i-1, j+1], y[i-1,2]$ and $y[i-1,0]$ respectively.

The linear indexing matrices for generated variables $y[i-1,2]$ and $y[i-1,0]$ are,

$$
I_{B y[i-1,2]}=I_{B y[i-1,0]}=\left[\begin{array}{ll}
1 & 0 \\
0 & 0
\end{array}\right]
$$

The rank of the above linear indexing matrix is less than the dimensionality of the algorithm index set. The variables $y[i-1,2]$ and $y[i-1,0]$ are, therefore, amenable to broadcasting within the systolic array.

By Step 4, the matrix of interconnection primitives, $P$, is

$$
P=\left[\begin{array}{rrr}
0 & 1 & 1 \\
1 & 0 & -1
\end{array}\right]
$$

Conditions on the selection of the $S$ matrix to avoid broadcasting of specified variables include,

$$
S_{B y[i-1,2]} \overline{\mathrm{d}}_{\mathrm{By}[i-1,2]}=\mathrm{C}_{1}+\mathrm{jC} \mathrm{C}_{2}
$$

and,

$$
S_{\mathrm{By}[i-1,0]} \overline{\mathrm{d}}_{\mathrm{By}[\mathrm{i}-1,0]}=\mathrm{C}_{3}+\mathrm{jC} \mathrm{C}_{4}
$$

where $C_{1}, C_{3} \in Z^{n} ; C_{2}, C_{4} \in I^{n}$ and $C_{2}, C_{4} \neq 0$. An S-matrix is selected which satisfies these conditions, minimizes the number of bands, and allows the diophantine equation $\mathrm{SD}=\mathrm{PK}$ to be solvable for S ,

$$
S=\left[\begin{array}{ll}
1 & 0 \\
0 & 1
\end{array}\right]
$$

for,

$$
K=\left[\begin{array}{llllll}
1 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 1 & 1
\end{array}\right]
$$

Now that the $S$ matrix has been chosen, conditions on the selection of the $\Pi$ transformation may be summarized as follows:
(i) $\quad \overline{\Pi d}_{B y[i-1,2]}=\mathrm{C}_{1}+\mathrm{jC}_{2}$;
(ii) $\Pi_{\mathrm{By}[\mathrm{i}-1,0]}=\mathrm{C}_{3}+\mathrm{j} \mathrm{C}_{4}$, where $\mathrm{C}_{1} \geq 2, \mathrm{C}_{3} \geq 0$ and $\mathrm{C}_{2}, \mathrm{C}_{4} \geq 1$;
(iii) $\Pi \bar{d}_{\mathrm{C}}, \Pi \overline{\mathrm{d}}_{\mathrm{x}} \geq 1$;
(iv) $\Pi \bar{d}_{y[i-1, j]}, \Pi \bar{d}_{y[i-1, j+1]} \geq 3$.

The above conditions are derived using Steps 3.1-3.3 of SADLRE. A $\Pi$ transformation is selected to satisfy these conditions in addition to minimizing the algorithm execution time $t, \Pi=\left[\begin{array}{ll}4 & 1\end{array}\right]$.

The new transformation matrix, $\Delta$, is found be,

$$
\Delta=T D=\left[\begin{array}{cccccc}
1 & 4 & 4 & 3 & (2+\mathrm{j}) & (4+\mathrm{j}) \\
0 & 1 & 1 & 1 & 1 & 1 \\
1 & 0 & 0 & -1 & (\mathrm{j}-2) & \mathrm{j}
\end{array}\right]
$$

Information regarding communication latency for the directional flow of variables in the systolic array may be deduced from the $\Delta$ matrix. This can be done in a manner similar to that previously described.

The \{GVT', UVT, IT\} for the recursive vector cross-product operation is shown in Tables 4.10-4.12. The resulting systolic array for this type of operation is delineated in Fig. 4.7(a), and the architecture of an individual processing cell is illustrated in Fig. 4.7(c). The control signal, $\mathrm{C}(\mathrm{j})$, operates as an input selection line for the multiplexers. Typically, it will allow $y(i-1, j)$ and $y(i-1, j+1)$ to pass through the multiplexers for the first two rows of the systolic array in Fig. 4.7(a). $\mathrm{C}(\mathrm{j})$ will, however, permit $y(i-1,0)$ and $y(i-1, j)$ to propagate through the multiplexers due to the operational specification of a vector cross-product operation. The partitioned architecture for a Type 7 operation is delineated in Fig. 4.7(b). The operational delay for the proposed architecture is 3 time units for $i_{\max }=1$ in the algorithm index set,

Table 4.10: GVT of a Type 7 Task

| ITERATION <br> (i,j) | BEGIN EXECUTION AT TIME | FINISH EXECUTION AT TIME | $\begin{array}{\|c\|} \hline \text { IN CELL } \\ (\mathrm{i}, \mathrm{j}) \end{array}$ | GENERATING VARIABLE |
| :---: | :---: | :---: | :---: | :---: |
| 0,0 | 0 | 2 | 0,0 | $\mathbf{~}(\mathbf{0}, \mathbf{0})$ |
| 0,1 | 1 | 3 | 0,1 | $y(0,1)$ |
| 0,2 | 2 | 4 | 0,2 | $\mathrm{y}(0,2)$ |
| 1,0 | 4 | 8 | 1,0 | $\mathrm{y}(1,0)$ |
| 1,1 | 5 | 7 | 1,1 | $\mathrm{y}(1,1)$ |
| 1,2 | 6 | 8 | 1,2 | y $(1,2)$ |
| 2,0 | 8 | 10 | 2,0 | y( 2,0 ) |
| 2,1 | 9 | 11 | 2,1 | $\mathrm{y}(2,1)$ |
| 2,2 | 10 | 12 | 2,2 | $\mathrm{y}(2,2)$ |

Table 4.11: UVT of a Type 7 Task

| ITERATION <br> $(\mathrm{i}, \mathrm{j})$ | USING | FROM | AT |
| :---: | :---: | :---: | :---: |
| VARIABLES | CELLS | TIMES |  |
| 0,0 | $\mathrm{y}(-1,0), \mathrm{y}(-1,1)$ | $(-1,0),(-1,1)$ | $-2,-1$ |
| 0,1 | $\mathrm{y}(-1,1), \mathrm{y}(-1,2)$ | $(-1,1),(-1,2)$ | $-1,0$ |
| 0,2 | $\mathrm{y}(-1,0), \mathrm{y}(-1,2)$ | $(-1,0),(-1,2)$ | $-2,0$ |
| 1,0 | $\mathrm{y}(0,0), \mathrm{y}(0,1)$ | $(0,0),(0,1)$ | 2,3 |
| 1,1 | $\mathrm{y}(0,1), \mathrm{y}(0,2)$ | $(0,1),(0,2)$ | 3,4 |
| 1,2 | $\mathrm{y}(0,0), \mathrm{y}(0,2)$ | $(0,0),(0,2)$ | 2,4 |
| 2,0 | $\mathrm{y}(1,0), \mathrm{y}(1,1)$ | $(1,0),(1,1)$ | 6,7 |
| 2,1 | $\mathrm{y}(1,1), \mathrm{y}(1,2)$ | $(1,1),(1,2)$ | 7,8 |
| 2,2 | $\mathrm{y}(1,0), \mathrm{y}(1,2)$ | $(1,0),(1,2)$ | 6,8 |

Table 4.12: IT of a Type 7 Task

| ITERATION <br> $(\mathrm{i}, \mathrm{j})$ | USING <br> INPUTS | FROM <br> CELLS | AT <br> TIMES |
| :---: | :---: | :---: | :---: |
| 0,0 | $\mathbf{x}(0), \mathbf{x}(1), \mathrm{C}(0,0)$ | $(-1,0),(-1,0),(0,-1)$ | $-4,-4,-1$ |
| 0,1 | $\mathbf{x}(1), \mathbf{x}(2), \mathrm{C}(0,1)$ | $(-1,1),(-1,1),(0,0)$ | $-3,-3,0$ |
| 0,2 | $\mathbf{x}(0), \mathbf{x}(2), \mathrm{C}(0,2)$ | $(-1,2),(-1,2),(0,1)$ | $-2,-2,1$ |
| 1,0 | $\mathbf{x ( 0 ) , \mathbf { x } ( 1 ) , \mathrm { C } ( 1 , 0 )}$ | $(0,0),(0,0),(1,-1)$ | $0,0,3$ |
| 1,1 | $\mathbf{x}(1), \mathbf{x}(2), \mathrm{C}(1,1)$ | $(0,1),(0,1),(1,0)$ | $1,1,4$ |
| 1,2 | $\mathbf{x}(0), \mathbf{x}(2), \mathrm{C}(1,2)$ | $(0,2),(0,2),(1,1)$ | $2,2,5$ |
| 2,0 | $\mathbf{x}(0), \mathbf{x}(1), \mathrm{C}(2,0)$ | $(1,0),(1,0),(2,-1)$ | $4,4,7$ |
| 2,1 | $\mathbf{x}(1), \mathbf{x}(2), \mathrm{C}(2,1)$ | $(1,1),(1,1),(2,0)$ | $5,5,8$ |
| 2,2 | $\mathbf{x}(0), \mathbf{x}(2), \mathrm{C}(2,2)$ | $(1,2),(1,2),(2,1)$ | $6,8,8$ |


(a): Systolic Architecture

(b): Partitioned Systolic Architecture

(c): An Individual Processing Element

Fig. 4.7: Systolic Architecture for a Type 7 Task
which is also the maximum outermost index limit specification for a recursive vector cross-product operation in the inverse dynamics computational problem. Note that an unit latency for input data synchronization is also included in the operational delay.

### 4.6 Optimal Buffer Assignment for the Formulation of a Balanced Systolic Architecture to Evaluate the Inverse Dynamics Computational Problem

The computational flow of tasks in the forward iterations of algorithm NEWTON_EULER (Section 4.4) as a directed task graph is shown in Fig. 4.8. Each node in this diagram represents a systolic processor of Section 4.5. The nomenclature, $\mathrm{t}_{\mathrm{y}} i$, next to each node denotes the type of systolic cell, where the variable $i$ identifies the type of cell. Also, the variable $x$ in the field $\mathrm{T}_{\mathrm{x}}$ inside each node specifies the actual task number.

The bottleneck in achieving maximum throughput in the architecture of Fig. 4.8 is due to the different arrival time of data into the multi-input systolic processing cells. Since computations may not be initiated in cells until all input data are available simultaneously, the pipelined execution time is therefore naturally lengthened. We must therefore balance the task graph to achieve maximum pipelining. This problem may be solved by assigning appropriate amounts of delay (using buffer stages) along some paths of the graph.

The problem of balancing a directed graph by inserting buffers along appropriate paths has been solved by the cut-set theorem [21,22], local correctness criterion [22], and the graph-theoretic approach [5,7]. However, we shall use a more systematic approach by formulating it as an integer linear programming problem [47].

Before formally presenting the procedure which may be used to generate a set of integer linear constraint equations for the balancing problem, the following notation/definitions must be introduced.

Definition 1: A Weighted Graph, WG, corresponds to a directed task graph using a given set of processing units which provides a weight variable $\omega_{\mathrm{i}}, \mathrm{i} \in \mathrm{Z}^{\mathrm{n}}$, along the output edges of every node $n_{i}$. The weight, $\omega_{i}$, constitutes the computational delay within node $n_{i}$.


Fig. 4.8: WG for Forward Iterations of N-E Algorithm

Definition 2: A weight, $\sum_{P\left(n_{i}\right) \in P_{k}\left(n_{i}\right)} \omega\left(P\left(n_{i}\right)\right)$, is defined as the sum of weights, or total cost, along the kth path, $\mathrm{p}_{\mathrm{k}}$, from the input node to multi-input node, $\mathrm{n}_{\mathrm{i}}$, in the Weighted Graph, WG.

Definition 3: The critical path, $p_{c}\left(n_{i}\right)$, denotes the path from the input node to multiinput node, $n_{i}$ which constitutes the largest weight variable, i.e. $\max \sum_{p\left(n_{i}\right) \in p_{k}\left(n_{i}\right)} \omega\left(p\left(n_{i}\right)\right)$, in the Weighted Graph, WG.

Definition 4: The Buffered Graph, BG, corresponds to a directed task graph which provides a set of buffers $\left\{B_{j}, \ldots, B_{j+n}\right\}, j \in Z^{n}$, along the ( $n-j$ ) output edges of multioutput node $n_{i}, i \in Z^{n}$, in WG that do not belong to output edges in the critical path $p_{c}\left(n_{0}\right)$, where $n_{0}$ is the output node of WG.

Note that an input node, $n_{\mathrm{I}}$, that supplies operand data for nodes in the directed task graph must be included in BG.

Definition 5: The variable, $\sum_{p\left(n_{i}\right) \in p_{k}\left(n_{i}\right)} \mid B\left(p\left(n_{i}\right) \mid\right.$, is defined as the sum of buffer variables, or total buffer cost, along the $k$ th path, $\mathrm{p}_{\mathrm{k}}$, from input node, $\mathrm{n}_{\mathrm{I}}$, to multiinput node, $\mathrm{n}_{\mathrm{i}}$, in the Buffered Graph, BG. Now that the nomenclature to be used has been discussed, we may proceed further by presenting a procedure which will generate an optimal buffer assignment for a given weighted graph.

## Procedure OBAP (Optimal Buffer Assignment Problem)

Input: Weighted Graph, WG.

Output: Optimal numerical buffer assignment of buffers $\mathbf{B}_{\mathbf{j}}, \mathbf{j} \in \mathrm{Z}^{\mathbf{n}}$, which minimizes the total number of buffers stages in Buffered Graph, BG.

STEP 1: Find critical path, $p_{c}\left(n_{o}\right)$, to output node $n_{o}$ from the input node of WG.

STEP 2: Obtain BG by assigning buffer stage variables to the output edges of multi-output nodes in WG, except for the output edges which belong to the critical path, $\mathrm{p}_{\mathrm{c}}\left(\mathrm{n}_{\mathrm{o}}\right)$, of the output node $\mathrm{n}_{\mathrm{o}}$.

STEP 3: For every multi-input node, $n_{i}$, except output node $n_{0}$, acquire a set of $m-1$ integer linear constraint equations, where $m$ is the number of inputs into node $n_{i}$. These equations may be found by,

$$
\begin{align*}
& \sum_{p\left(n_{i}\right) \in p_{k}\left(n_{i}\right) / p_{c}\left(n_{i}\right)}\left[\mid B\left(p\left(n_{i}\right)\right)+\omega\left(p\left(n_{i}\right)\right)\right] \\
& =\sum_{p\left(n_{i}\right) \in p_{c}\left(n_{i}\right)}\left[\mid B\left(p\left(n_{i}\right)\right)+\omega\left(p\left(n_{i}\right)\right)\right] \tag{4.8}
\end{align*}
$$

STEP 4: Apply integer linear programming [47] to minimize total number of buffer stages, $\mathbf{B}_{\mathbf{i}}$,

$$
\text { Minimize } \sum_{i=1}^{n}|B|
$$

subject to the constraints of STEP 3, where $n$ is the number of buffers.
The above procedure may now be applied to the weighted graph, WG, for the forward iterations of the Newton-Euler algorithm (in Fig. 4.8). The weights assigned to the output edges of each systolic node represents the computational delay within the node. These delays for the various types of systolic processors developed in Section 4.5 is summarized in Table 4.13.

STEP 1: The critical path $\mathrm{p}_{\mathrm{c}}\left(\mathrm{n}_{\mathrm{o}}\right)$ for the WG of Fig. 4.8 is along the nodes: T1-T2-T3-T4-T5-T7-T8-T12-T13-n.

STEP 2: By applying buffer assignment rules to the WG of Fig. 4.8, we obtain the normalized Buffered Graph BG as in Fig. 4.9.

STEP 3: The integer linear constraint equations on the buffering variables in BG may be found by applying Eq, (4.8) to each node:
node $\mathrm{T} 1:|\mathrm{B} 6|=0$
node T2: $|\mathrm{B} 4|=11$
node T3: $|\mathrm{B} 5|=14$


Fig. 4.9: BG for Forward Iterations of N-E Algorithm


Fig. 4.10: WG for Backward Iterations of N-E Algorithm

```
node T4: \(|\mathrm{B} 6|+|\mathrm{B} 12|=15\)
node T5: \(|\mathrm{B} 3|+|\mathrm{B} 15|=26\)
node T6: \(|\mathrm{B} 3|=|\mathrm{B} 11|+11\)
node \(\mathrm{T} 7:|\mathrm{B} 11|=11\)
node T8: \(|\mathrm{B} 6|+|\mathrm{B} 12|+|\mathrm{B} 18|=30\)
node T9: \(\mid\) B \(2|+|\mathrm{B} 14|=|\mathrm{B} 17|+26\)
node T10: \(|\mathrm{B} 2|=|\mathrm{B} 10|+|\mathrm{B} 11|+11\)
node T11:|B10 \(+|\mathrm{B} 11|=|\mathrm{B} 17|+11\)
node T12: \(\mid\) B17 \(\mid=8\)
node T13:|B7| \(=39\)
node T14:|B1| \(=|\mathrm{B} 9|+|\mathrm{B} 10|+|\mathrm{B} 11|+11\)
node T15: \(|\mathrm{B} 8|=8\)
node T16:|B1|+|B13| \(=|\mathrm{B} 16|+|\mathrm{B} 17|+26\)
node T17: \(\mid\) B9 \(|+|\) B10 \(|+|\) B11 \(|=|\) B16 \(|+|\) B17 \(\mid+12\)
```

STEP 4: Apply integer linear programming to minimize $\sum_{i=1}^{17}\left|B_{j}\right|$ subject to the constraints of STEP 3. The optimization process generates:

$$
\begin{aligned}
& |\mathrm{B} 1|=31,|\mathrm{~B} 2|=30,|\mathrm{~B} 3|=22,|\mathrm{~B} 4|=11, \\
& |\mathrm{~B} 5|=14,|\mathrm{~B}|=0,|\mathrm{~B} 7|=39,|\mathrm{~B}|=8, \\
& |\mathrm{~B} 9|=1,|\mathrm{~B} 10|=8,|\mathrm{~B} 11|=11,|\mathrm{~B} 12|=15, \\
& |\mathrm{~B} 13|=3,|\mathrm{~B} 14|=4,|\mathrm{~B} 15|=4,|\mathrm{~B} 16|=0, \\
& |\mathrm{~B} 17|=8,|\mathrm{~B} 18|=15
\end{aligned}
$$

Procedure OBAP may now be applied to the WG (in Fig. 4.10) for the backward iterations in program NEWTON-EULER. Intermediate results of steps in the procedure are as follows:

STEP 1: The critical path $p_{c}\left(n_{0}\right)$ for the WG in Fig. 4.10 is along the nodes: T19-T20-T21-T23-T24-n.

STEP 2: Applying buffer assignment rules to the WG of Fig. 4.10, the normalized buffered graph in Fig. 4.11 may be obtained.


Fig. 4.11: BG for Backward Iterations of N-E Algorithm

Table 4.13: Computational Delays for Basic Set of Systolic Processors

| Task | Systolic <br> Type <br> Opational <br> Delay | Systolic <br> Execution <br> Time |
| :---: | :---: | :---: |
| 1 | 1 | $3(n+1)$ |
| 2 | 1 | $3(n+1)$ |
| 3 | 3 | $4(n+1)$ |
| 4 | 8 | $8(n+1)$ |
| 5 | 8 | $8 n+10$ |
| $6 a$ | 11 | $11 n+13$ |
| $6 b$ | 8 | $8 n+10$ |
| $6 c$ | 11 | $11 n+13$ |
| 7 | 3 | $4(n+1)$ |

STEP 3: The integer linear constraint equation on buffering variables in Fig. 4.11 are as follows:
node T18: $|\mathrm{B} 22|=0$
node T19: $\mid$ B20 $=|\mathrm{B} 21|$
node T20:|B20 $+1=\mathrm{B} 25$
node T21: $|\mathrm{B} 19|=|\mathrm{B} 20|+4$
node T22: $|\mathrm{B} 21|+|\mathrm{B} 24|=8$
node T23: $|\mathrm{B} 20|=6,|\mathrm{~B} 22|+|\mathrm{B} 26|=11$
node T24:|B23| $=22$

STEP 4: Applying integer linear programming to minimize $\sum_{i=19}^{23} B_{i}$ subject to the constraints of STEP 3, we get:

$$
\begin{aligned}
& |\mathrm{B} 19|=10,|\mathrm{~B} 20|=6,|\mathrm{~B} 21|=6,|\mathrm{~B} 22|=0, \\
& |\mathrm{~B} 23|=22,|\mathrm{~B} 24|=2,|\mathrm{~B} 25|=7,|\mathrm{~B} 26|=11 .
\end{aligned}
$$

A balanced architecture for the computation of the inverse dynamics problem is thus found by pipelining the weighted systolic architecture of Fig. 4.8 to that of Fig. 4.10 through an intermediate LIFO (Last-In-First-Out) intermediate register file structure. The total time to compute the given problem on the developed systolic machine is $\left(\mathrm{t}_{\mathrm{crf}}+\mathrm{t}_{\mathrm{crb}}\right) \mathrm{n}+2$, where $\mathrm{t}_{\mathrm{crf}}$ is the operational delay of the critical path of Fig. 4.8 (forward iterations), $\mathrm{t}_{\mathrm{crb}}$ is the operational delay of the critical path in Fig. 4.10 (backward iterations), and $n$ is the number of manipulator joints. By using the operational latency data (Table 4.13) for all types of systolic processors designed in this chapter, the total execution time to evaluate the inverse dynamics problem of an $n$-link manipulator is $70 n+2$. Note that the coefficient $d_{1}$ (see Theorem 4.1) for our particular systolic machine for the computation of the desired problem is thus 70.

### 4.7 Conclusions

A novel systolic architecture to compute the inverse dynamics computational problem within the systolic execution time lower bound of o $\left[d_{1} n\right\rceil$, where $d_{1}$ is a specified constant, is discussed. A modified form of the systolic design methodology of Moldovan and Fortes [40] is used as the primary design tool for the architecture. Modifications were required to provide allowances for unequal execution times of processing cells, and to eliminate broadcasting of any input or generated variable
within the systolic array. The inverse dynamics problem is decomposed into a set of unidirectional and linear recurrence tasks amenable for direct mapping onto a fixed systolic system. A set of seven basic types of systolic processor architectures are developed to represent nodes in an acyclic systolic directed task graph to compute the Newton-Euler dynamics algorithm. The balancing of this acyclic task graph is reduced to an integer linear optimization problem for the assignment of delay buffers along alternate data routes. The final systolic architecture achieves maximal pipelining with a minimal number of delay buffers assigned along its various computational paths.

## CHAPTER 5 SUMMARY OF CONCLUSIONS

In chapter 2, the ideal lower bounds on the number of parallel processors and execution time to evaluate the inverse dynamics problem of an $n$-link manipulator in a parallel processing architecture using an optimal scheduling algorithm was proven to be $n$ and $o\left(\left[a_{1} n\right]\right)$ respectively, where $a_{1}$ is a specified constant. Next, a novel SIMD scheduling algorithm to perform parallel processing on a SIMD multiprocessor-based architectural model with a crossbar interprocessor network which performs close to the ideal lower bound is presented.. The performance of the N-E dynamics algorithm on the proposed architecture using the specified SIMD scheduling technique is next evaluated. Speedup factor of greater than 3.4 over previous related work [19] for the computation of the defined problem on a parallel processing system is achieved. The SIMD architectural model is then not used again to simulate the performance of PUMA forward and inverse kinematics algorithms because of the non-recursive and non-linear nature of these algorithms. A multiprocessor model with a shared memory interprocessor communication strategy is instead investigated for this purpose using the DFIHS scheduling algorithm. Simulation results show speedup of approximately 2 over the uniprocessor solution when the number of parallel processors is greater than eight and five for the cases of forward and inverse kinematics respectively. Simulation results showed very large speedups is not achieved in this latter multiprocessor architectural model due to high interprocessor communication latency costs. The important results of this chapter may be summarized as in Table 5.1. Note that the hardware overhead due to the crossbar interprocessor network must be included when evaluating the overall system cost for the computation of the inverse dynamics problem.

In chapter 3, we described a cost-efficient parallel and pipelined bit-serial system for the inverse dynamics computational problem to achieve the bit-serial execution time lower bound of $o\left(\left[c_{1} k+c_{2} k n\right\rfloor\right.$. A high performance multi-functional bit-serial

Table 5.1 Execution Times and Hardware Overhead for Computation of Robot Control Algorithms in a MC 68020-Based Multiprocessor System

| Robot <br> Control <br> Algorithm | Minimum <br> Execution <br> Time (in $\mu \mathrm{s})$ | Number <br> of <br> MC 68020's |
| :---: | :---: | :---: |
| Inverse <br> Dynamics | 513.84 | 8 |
| PUMA Forward <br> Kinematics | 392.60 | 8 |
| PUMA Inverse <br> Kinematics | 750.45 | 5 |

Table 5.2 Execution Times and Hardware Overhead for Computation of Inverse Dynamics Problem

| Computational <br> Approach | Minimum <br> Execution <br> Time | Hardware <br> Overhead |
| :---: | :---: | :---: |
| Parallel | 119 n | 8 PEs, <br> Crossbar <br> Interprocessor <br> Network |
| Bit-Serial | $19 \mathrm{k}+6 \mathrm{kn}$ | 32 Bit-Serial <br> Cells |
| Systolic | $70 \mathrm{n}+2$ | 24 Systolic <br> Processors |

cell architecture is described as a building block for the array configurations of the system. Zipper CMOS circuit design strategy is proposed for the cell's implementation to minimize propagation delays and maximize operating frequencies. For the case of a 16 -bit system word length and 20 MHz operating frequency, the total time to compute the inverse dynamics problem of a 6 -link manipulator on the proposed bit-serial architecture is only $44 \mu \mathrm{~s}$.

Finally, in chapter 4, a novel systolic architecture to compute the inverse dynamics computational problem within the systolic execution time lower bound of $o\left\lceil d_{1} n\right]$, where $d_{1}$ is a specified constant, is discussed. A modified form of the systolic design methodology of Moldovan and Fortes [40] is used as the primary design tool for the architecture. Modifications were required to provide allowances for unequal execution times of processing cells, and to eliminate broadcasting of any input or generated variable within the systolic array. The inverse dynamics problem is decomposed into a set of unidirectional and linear recurrence tasks amenable for direct mapping onto a fixed systolic system. A set of seven basic types of systolic processor architectures are developed to represent nodes in an acyclic systolic directed task graph to compute the Newton-Euler dynamics algorithm. The balancing of this acyclic task graph is reduced to an integer linear optimization problem for the assignment of delay buffers along alternate data routes. The final systolic architecture achieves maximal pipelining with a minimal number of delay buffers assigned along its various computational paths.

The performance and hardware complexity of alternate computational structures for the evaluation of the inverse dynamics problem is compared in table 5.1 The variables " n " and " k " in this table specify the manipulator link count and bit-serial system word length respectively. Clearly, the systolic approach achieves superior performance (assuming the word length for the bit-serial system is greater than 16). The hardware complexity for this approach is however highest due to the overhead associated with the control of the systolic arrays and the number of cells in the arrays themselves. The parallel processing approach achieves the next best execution time, where the crossbar network is the primary hardware overhead. Finally, the specified bit-serial architecture provides acceptable performance at the cost of moderate hardware. Lowest amount of hardware complexity is required for this case because of minimum circuitry needed for the implementation of bit-serial cells.

## LIST OF REFERENCES

[1] Acosta, R. and Kjelstrup, H.C., "An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors," IEEE Trans. Comp., Vol. C-35, No. 9, Sept. 1986, pp. 815-828.
[2] Ahmad, S., Meyer, D., and Rahman, M. "On the Design of SpecialPurpose Computational Structures for Robot Control: Design Constraints," Proc. 1986 Applied Motion Control Conf., Minneapolis, Minnesota.
[3] Ahmed, H. M., Delosme, J. M., and Morf, M., "Highly Concurrent Computing Structures for Matrix Arithmetic and Signal Processing," Proc. IEEE, Jan. 1982, pp. 65-82.
[4] Bejczy, A., "Robot Arm Dynamics and Control," Technical Memo 33-669, Jet Propulsion Lab., Pasadena, Calif., 1974.
[5] Brock, J. D., "Translation and Optimization of Dataflow Programs," Proc. 1979 Int. Conf. Parallel Proc., Aug. 1979, pp. 46-54.
[6] Buric, M. and Carver, M., "Bit-Serial Inner Product Processors in VLSI," CALTECH Int. Conf. on VLSI, January 1981, pp. 155-164.
[7] Dennis, J. B., and Gao, R. G., 'Maximum Pipelining of Array Operations on Static Dataflow Machine," Proc. 1983 Int. Conf. Parallel Proc., Aug. 1983, pp. 331-334.
[8] Denyer, P. and Renshaw, D., VLSI Signal Processing: A Bit-Serial Approach, Addison-Wesley, Reading, Mass., 1985.
[9] Du, H. C., "On the Performance of Synchronous Multiprocessors," IEEE Trans. Comp., Vol. C-34, No.5, May 1985, pp. 462-466.
[10] Fernandez, E. B. and Bussel, B., "Bounds on the Number of Processors and Time for Multiprocessor Optimal Schedules," IEEE Trans. Comp., Vol. C-22, No. 8, Aug. 1973, pp. 745-751.
[11] Fortes, J. A. B. and Moldovan, D. I., "Parallelism Detection and Algorithm Transformation Techniques useful for VLSI Architecture Design," J. Parallel Distrib. Comp., May 1985.
[12] Fortes, J. A. B., and Moldovan, D. I., "Data Broadcasting in Linearly Scheduled Array Processors," Proc. 11th Int. Symp. on Comp. Arch. , June 1984, pp. 84-98.
[13] Fu, K. S., Gonzalelez, R. C., Lee, C. S. G., Robotics: Control, Sensing, Vision, and Intelligence, Mc Graw Hill, 1987.
[14] Goncalves, N.F. and De Man, H.J., "NORA: A Racefree Dynamic CMOS Technique for Pipelined Logic Structures," IEEE J. Solid-State Circuits, Vol. SC-18, No. 3, June 1983, pp. 261-266.
[15] Horowitz, E. and Sahni, S., Fundamentals of Computer Algorithms, Computer Science Press, Inc., Rockville, Maryland, 1978.
[16] Hwang, K. and Briggs, F. A., Computer Architecture and Parallel Processing, Mc Graw Hill, 1984.
[17] Jensen, E., "The Honeywell experimental distributed processor-An overview," Computer , Vol. 11, pp. 28-38, Jan. 1978.
[18] Hollerbach, J., "A Recursive Lagrange Formulation of Manipulator Dynamics and a Comparative Study of Dynamics Formulation Complexity," IEEE Trans. Sys., Man and Cyber., Vol. SMC-10, No. 11, November 1980, pp. 730-736.
[19] Kasahara, H. and Narita, S., "Parallel Processing of Robot-Arm Control Computation on a Multimicroprocessor System," IEEE Robotics Automat., Vol. RA-1, No. 2, June 1985, pp. 104-113.
[20] Krambeck, R.H., Lee, C.M., and Law, H.F.S., "High-Speed Compact Circuits with CMOS," IEEE J. Solid-State Circuits, Vol. SC-17, No. 3, June 1982, pp. 614-619.
[21] Kung, S. Y., "VLSI Array Processors," IEEE ASSP, July 1985, pp. 4-22.
[22] Kung, S. Y., "Wafer-Scale Integration and Two-Level Pipelined Implementation of Systolic Arrays," J. Parallel Distrib. Comp., Vol. 1, No. 1, Sept. 1984, pp. 36-63.
[23] Kung, S. Y. et. al., "Wavefront Array Processor: Language, Architecture and Applications," IEEE Trans. Comp., Vol. C-31, No. 11, Nov. 1982.
[24] Kung, H. T., "Why Systolic Arrays?," Proc. IEEE , January 1982, pp. 37-46.
[25] Lang, T. et. al., "Bandwidth of Crossbar and Multiple Bus Connections for Multiprocessors," IEEE Trans. Comp., Vol. C-31, No. 12, Dec. 1982.
[26] Lathrop, L., "Parallelism in Manipulator Dynamics," M.I.T. Artificial Intelligence Lab., Cambridge, Mass., Tech. Rep. No. 754, Dec. 1983.
[27] Lawler et. al., The Traveling Salesman, John Wiley \& Sons, Great Britain, 1985.
[28] Lee, C.M. and Szeto, E.W., "Zipper CMOS," IEEE Circuits and Devices, May 1986, pp. 10-16.
[29] Lee, C.S.G. and Chang, P.R., "Efficient Parallel Algorithm for Robot Inverse Dynamics Computation," IEEE Trans. Sys., Man and Cyber., Vol. SMC-16, No.4, July/August 1986, pp. 532-542.
[30] Lee, C.S.G. et. al., "Development of the Generalized d'Alembert Equations of Motion for Mechanical Manipulators," Proc. 1982 Pattern Recognition and Image Processing Conf., Las Vegas, Nev., 1982, pp. 634-640.
[31] Li, G. J. and Wah, B. W., "The Design of Optimal Systolic Arrays," IEEE Trans. Comp., Vol. C-34, No. 1, Jan. 1885, pp. 66-77.
[32] Liao, F.Y. and Chern, M.Y., "Robot Manipulator Dynamics Computation on a VLSI Array Processor," Proc. 1985 Int. Conf. on Supercomputers, 1985, pp. 116-121.
[33] Lint, B. and Agerwala, T., "Communication Issues in the Design and Analysis of Parallel Algorithms", IEEE Trans. Soft. Eng., Vol. SE-7, No. 2, March 1981, pp. 174-188.
[34] Liu, C.H. and Chen, Y.M., "Multi-Microprocessor Based Cartesian-Space Control Techniques for a Mechanical Manipulator," IEEE J. Robotics Automat., Vol. RA-2, No. 2, June 1986, pp. 110-115.
[35] Luh, J.Y.S. and Lin, C. S., "Scheduling of Parallel Computation for a Computer-Controlled Mechanical Manipulator," IEEE Trans. Sys., Man and Cyber., Vol. SMC-12, No. 2, March/April 1982, pp. 214-234.
[36] Luh, J.Y.S., Walker, M.W. and Paul, R.P.C., "On-Line Manipulator Scheme for Mechanical Robots," IEEE Trans. Automatic Control, Vol. AC-25, No. 3, pp. 468-474.
[37] Lyon, R.F., "Two's Complement Pipeline Multipliers," IEEE Trans. Commun., Apr. 1976, pp. 418-425.
[38] Masson, G. M. et. al., "A Sampler of Circuit Switching Networks", Proc. IEEE , June 1979, pp. 145-160.
[39] Mead, C. and Conway, L., Introduction to VLSI Systems, AddisonWesley, Reading, Mass., 1980.
[40] Moldovan, D. I. and Fortes, J. A. B., "Partitioning and Mapping Algorithms into Fixed Systolic Arrays," IEEE Trans. Comp., Vol. C-35, No. 1, Jan. 1986, pp. 1-12.
[41] Moldovan, D.I., "On the Analysis and Synthesis of VLSI Algorithms," IEEE Trans. Comp., Vol. C-31, No. 11, Nov. 1982, pp. 1121-1126.
[42] Murty, K. G., Linear Programming, John Wiley \& sons, Inc.,1983.
[43] Nigam, R. and Lee, C.S.G., "A Multiprocessor-Based Controller for the Control of Mechanical Manipulators," IEEE J. Robotics Automat., Vol. RA-1, No. 4, Dec. 1985, pp. 173-182.
[44] Ohwada, N. et. al., "LSI's for Digital Signal Processing," IEEE J. SolidState Circuits, Vol. SC-14, No. 2, Apr. 1979, pp. 214-220.
[45] Orin, D.E., Chao, H.H. and Schrader, W.W., "Pipeline/Parallel Algorithms for the Jacobian and Inverse Dynamics Computations," Proc. 1985 IEEE Int. Conf. Robotics, pp. 785-789.
[46] Orin, D.E., "Kinematic and Kinetic Analysis of Open-Chain Linkages Utilizing Newton-Euler Methods," Math. Biosc., Vol. 43, 1979, pp. $107-$ 130.
[47] Papadimitrious, C. H. and Steiglitz, K., "Combinatorial Optimization: Algorithms and Complexity," Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1982.
[48] Quinton, P., "Automatic Synthesis of Systolic Arrays from Uniform Recurrent Equations," IEEE Conf. Par. Proc. , 1984, pp. 208-214.
[49] Rahman, M. and Meyer, D., "Case Study-A MC 68020 Based I/O Controller", in Topics in High-Level Computer Architecture, Milutinovic, V. ed., Computer Science Press, Rockville, Maryland, to appear, Aug. 1987.
[50] Rahman, M. and Meyer, D., "High-Performance Parallel and Pipelined Bit-Serial Architecture for Robot Inverse Dynamics Computations," Submitted to IEEE Trans. Sys.; Man, Cyber., January 1987.
[51] Siegel, L. et. al., "Performance Measures for Evaluating Algorithms for SIMD Machines," IEEE Trans. Soft. Eng., Vol. SE-8, No. 4, July 1982, pp. 319-330.
[52] Siegel, H. J., "Interconnection Networks for SIMD Machines," Proc. IEEE , June 1979, pp. 57-65.
[53] Siegel, H. J., "A Model of SIMD Machines and a Comparison of Various Interconnection Networks," IEEE Trans. Comp., Vol. C-28, No.12, Dec. 1979, pp. 907-917.
[54] Siegel, H. J., "The Theory Underlying the Partitioning of Permutation Networks," IEEE Trans. Comp., Vol. C-29, No. 9, Sept. 1980, pp. 791800.
[55] Tjaden, G. S. and Flynn, M. J., "Detection and Parallel Execution of Independent Instructions," IEEE Trans. Comp. , Vol. C-19, No. 10, Oct. 1970.
[56] Tomasulo, R. M., "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM J. , Jan. 1967, pp. 25-33.
[57] Walker, M.W. and Orin, D.E., "Efficient Dynamics Computer Simulation of Robotic Mechanisms," Robot Motion, Brady, M., ed., MIT Press, 1982, pp. 107-125.
[58] Watanabe, T. et. al., "Improvement in the Computing Time of Robot Manipulators Using a Multimicroprocessor," IEEE Trans. ASME, Vol. 108, Sept. 1986, pp. 190-197.
[59] Wong, Y. and Delosme, J. M., "Optimal Systematic Implementations of N-Dimensional Recurrences," Int. Symp. Comp. Arch. , June 1985, pp. 618-621.
[60] Zheng, Y.F. and Hemami, H., "Computation of Multibody System Dynamics by a Multiprocessor Scheme," IEEE Trans. Sys., Man, and Cyber., SMC-16, No. 1, Jan./Feb. 1986, pp. 102-110.

## Appendix A: N-E Equations of Motion for a Manipulator with Rotary Joints

Forward Iterations:

$$
\begin{aligned}
& \text { FOR i=0 TO (n-1) DO } \\
& \text { BEGIN } \\
& { }^{i+1} \omega_{i+1}={ }_{i}^{i+1} R\left({ }^{i} \omega_{i}+\dot{\theta}_{i+1}^{z}\right) \\
& { }^{i+1} \dot{\omega}_{i+1}={ }_{i}^{i+1} R\left({ }^{i} \omega_{i}+\ddot{\theta}_{i+1}^{z}+{ }^{i} \omega_{i} \times \dot{\theta}_{i+1}^{z}\right) \\
& { }^{i+1} \dot{v}_{i+1}={ }_{i}^{i+1} R^{i} \dot{v}_{i}+\left({ }^{i+1} \omega_{i+1} \times{ }^{i} P_{i+1}\right)+{ }^{i+1} \omega_{i+1} \times\left({ }^{i+1} \omega_{i+1} \times{ }^{i} P_{i+1}\right) \\
& { }^{i+1} \dot{v}_{c_{i+1}}=\left({ }^{i+1} \dot{\omega}_{i+1} \times{ }^{i+1} P_{c_{i+1}}\right)+{ }^{i+1} \omega_{i+1} \times\left({ }^{i+1} \omega_{i+1} \times{ }^{i+1} P_{c_{i+1}}\right)+{ }^{i+1} \dot{v}_{i+1} \\
& { }^{i+1} \mathrm{~F}_{\mathrm{i}+1}=\mathrm{m}_{\mathrm{i}+1}{ }^{\mathrm{i}+1} \dot{\mathbf{v}}_{\mathrm{c}_{\mathrm{i}+1}} \\
& { }^{i+1} N_{i+1}={ }^{i+1} I_{i+1}{ }^{i+1} \omega_{i+1}+{ }^{i+1} \omega_{i+1} \times\left({ }^{i+1} I_{i+1}{ }^{i+1} \omega_{i+1}\right)
\end{aligned}
$$

END

Backward Iterations:
FOR $\mathrm{i}=\mathrm{n}$ TO 1 DO
BEGIN

$$
\begin{aligned}
& { }^{i} f_{i}={ }^{i} F_{i}+{ }_{i+1}^{i} R{ }^{i+1}{ }_{f_{i+1}} \\
& { }^{i} n_{i}={ }^{i} N_{i}+\left({ }^{i} P_{i+1}+{ }^{i} P_{c_{i+1}}\right) \times{ }^{i} F_{i}+{ }_{i+1}^{i} R\left({ }^{i+1} n_{i+1}+{ }^{i} P_{i+1} \times{ }^{i+1} f_{i+1}\right) \\
& \tau_{i}={ }_{i-1}^{i} R^{z} n_{n_{i}}
\end{aligned}
$$

END
where,
$m_{i} \quad$ mass of link $i$
$\dot{\theta}_{i+1}^{z} \quad$ zth component of the joint velocity of link $i+1$
$\ddot{\theta}_{i+1}^{z} \quad$ zth component of the joint acceleration of link $i+1$
${ }^{i+1} \omega_{i+1} \quad$ angular velocity of link $i+1$ with respect to the $i+1$ th coordinate frame
${ }^{i+1} \dot{\omega}_{i+1} \quad$ angular acceleration of link $i+1$ with repect to the $i+1$ th coordinate frame
${ }^{i+1} \dot{v}_{i+1} \quad$ linear acceleration of link $i+1$ with respect to the $i+1$ th coordinate frame
${ }^{i+1} \dot{\mathbf{v}}_{\mathrm{c}_{\mathrm{i}+1}} \quad$ linear acceleration of the center of mass of link $i+1$ with respect to the $i+1$ th fra

| ${ }_{i}^{i+1} R$ | rotation matrix which maps position vectors from $i+1$ th frame to frame $i$ |
| :---: | :---: |
| ${ }^{1} \mathrm{P}_{\mathrm{i}+1}$ | origin of link $i+1$ with respect to the ith coordinate frame |
| ${ }^{i+1} P_{c_{i+1}}$ | location of the center of mass of link $i+1$ with respect the $i+1$ th coordinate fram |
| ${ }^{\mathrm{i}+1} \mathrm{~F}_{\mathrm{i}+1}$ | total inertial force exerted on link $i+1$ at the center of mass |
| ${ }^{\mathbf{i}+1} \mathrm{~N}_{\mathrm{i}+1}$ | total moment exerted on link $i+1$ at the center of mass |
| ${ }^{1}{ }_{i}$ | force exerted on link i by link i-1 |
| ${ }^{1} \mathrm{n}_{\mathrm{i}}$ | moment exerted on link i by link i-1 |
| $\tau_{i}$ | torque exerted by the actuator on link i |

## Appendix B: 3-D Matrix/Vector Arithmetic Operations Used In The N-E Dynamics Computational Problem

(1) Vector-Add (VA)

The vector sum $Y_{i}$ of vectors $a_{i}$ and $b_{i}$ is defined as,

$$
Y_{i}=\left(a_{i}+b_{i}\right) \quad \text { for } i=1,2,3
$$

(2) Scalar-Vector (SV) product

The scalar-vector product $Y_{i}$ of vector $a_{i}$ by scalar $K$ is defined as,

$$
Y_{i}=K a_{i} \quad \text { for } i=1,2,3 .
$$

(3) Inner product (IP)

The inner product $Y$ of vectors $a_{i}$ and $b_{i}$ is defined as,

$$
\mathrm{Y}=\sum_{\mathrm{i}=1}^{3} \mathrm{a}_{\mathrm{i}} \mathrm{~b}_{\mathrm{i}}
$$

(4) Matrix-vector (MV) product

The matrix-vector product $Y_{n}$ of matrix $a_{n i}$ and vector $b_{i}$ is defined as,

$$
Y_{n}=\sum_{i=1}^{3} a_{n i} b_{i} \quad \text { for } n=1,2,3
$$

(5) Vector cross (VC) product

The vector cross product, $Y$ of vectors $a_{i}$ and $b_{i}$, is defined as,

$$
Y=Y_{i}+Y_{j}+Y_{k}=\operatorname{det}\left|\begin{array}{ccc}
i & j & k
\end{array}\right|
$$

## Appendix C: PUMA Forward Kinematics Solution

The position and orientation of the end-effector of the PUMA arm with respect to a fixed reference coordinate system, given the joint angles and and geometric link parameters, may be found as follows:

The arm matrix for the PUMA robot arm is

$$
\mathbf{T}=\mathbf{T}_{1} \mathbf{T}_{\mathbf{2}}={ }^{0} \boldsymbol{A}_{f}{ }^{1} \mathbf{A}_{2}{ }^{2} \boldsymbol{A}_{3}{ }^{3} \mathbf{A}_{4}{ }^{4} \boldsymbol{A}_{9}{ }^{5} \boldsymbol{A}_{3}=\left[\begin{array}{cccc}
n_{r} & s_{x} & a_{r} & p_{r} \\
n_{y} & s_{y} & a_{v} & p_{y} \\
n_{7} & s_{z} & a_{y} & p_{z} \\
0 & 0 & 0 & 1
\end{array}\right]
$$

where,

$$
\begin{aligned}
& n_{2}=C_{1}\left[C_{23}\left(C_{4} C_{9} C_{9}-S_{4} S_{5}\right)-S_{23} S_{9} C_{6}\right]-S_{1}\left(S_{1} C_{9} C_{n}-C_{4} S_{5}\right) \\
& n_{y}=S_{1}\left[C_{23}\left(C_{4} C_{9} C_{9}-S_{4} S_{9}\right)-S_{73} S_{9} C_{6}\right]+C_{1}\left(S_{1} C_{9} C_{n}-C_{3} S_{9}\right) \\
& n_{3}=-S_{33}\left[C_{4} C_{5} C_{6}-S_{4} S_{5}\right]-C_{33} S_{5} C_{6} \\
& S_{x}=C_{1}\left[-C_{23}\left(C_{4} C_{5} S_{5}+S_{4} C_{6}\right)+S_{33} S_{9} S_{5}\right]-S_{1}\left(-S_{5} C_{5} S_{6}+C_{4} C_{6}\right) \\
& s_{y}=S_{1}\left[-C_{23}\left(C_{4} C_{9} S_{5}+S_{4} C_{6}\right)+S_{33} S_{5} S_{5}\right]+C_{1}\left(-S_{1} C_{4} S_{9}+C_{4} C_{6}\right) \\
& S_{7}=S_{77}\left(C_{4} C_{9} S_{6}+S_{4} C_{6}\right)^{\circ}+C_{23} S_{9} S_{6} \\
& a_{x}=C_{1}\left(C_{33} C_{4} S_{5}+S_{23} C_{9}\right)-S_{1} S_{4} S_{9} \\
& a_{7}=S_{1}\left(C_{23} C_{4} S_{3}+S_{23} C_{5}\right)+C_{1} S_{4} S_{9} \\
& a_{:}=-S_{33} C_{4} S_{5}+C_{23} C_{9} \\
& p_{x}=C_{1}\left(d_{3}\left(C_{23} C_{3} S_{9}+S_{23} C_{5}\right)+S_{23} d_{4}+a_{3} C_{23}+a_{2} C_{2} j-S_{1}\left(d_{4} S_{4} S_{9}+d_{2}\right)\right. \\
& p_{y}=S_{1}\left[d_{3}\left(C_{23} C_{4} S_{9}+S_{23} C_{5}\right)+S_{23} d_{4}+a_{3} C_{23}+u_{2} C_{2}\right]+C_{1}\left(d_{6} S_{4} S_{9}+d_{2}\right) \\
& p_{z}=d_{5}\left(C_{23} C_{3}-S_{23} C_{4} S_{5}\right)+C_{23} d_{4}-a_{3} S_{33}-a_{2} S_{2}
\end{aligned}
$$

## Appendix D: PUMA Inverse Kinematics Solution

The inverse kinematics problem can be stated as: Given the position/orientation of the manipulator hand and the link/joint parameters, determine the joint angles so that the manipulator can be positioned as desired. That is, given

$$
T=\left[\begin{array}{llll}
n_{z} & a_{z} & a_{y} & p_{y} \\
n_{y} & o_{y} & a_{y} & p_{y} \\
n_{z} & 0_{z} & a_{z} & p_{z} \\
0 & 0 & 0 & 1
\end{array}\right]=\left[\begin{array}{llll}
n & 0 & a & p \\
0 & 0 & 0 & 1
\end{array}\right]
$$

the joint angle equations are [2]:

$$
\begin{aligned}
& r=\left(p_{y}^{2}+p_{y}^{3}\right)^{1 / 2} \\
& \theta_{1}=\tan ^{-1}\left[\frac{p_{y}}{p_{z}}\right]-\tan ^{-1}\left[\frac{d_{3}}{=\sqrt{p^{2}-d_{3}^{2}}}\right] \\
& f_{11 p}=p_{z} C_{1}+p_{y} S_{1} \\
& f_{110}=o_{z} C_{1}+o_{y} S_{1} \\
& f_{13 p}=-p_{y} S_{1}+p_{y} C_{1} \\
& f_{13}=-o_{z} S_{1}+o_{y} C_{1} \\
& f_{114}=a_{z} C_{1}+a_{y} S_{1} \\
& f_{136}=-a_{2} S_{1}+a_{y} C_{1} \\
& d=f_{11 p}^{2}+f_{12 p}^{2}-d_{4}^{2}-a_{3}^{2}-a_{2}^{2} \\
& e=4 a_{2}^{2} a_{3}^{2}+4 a_{2}^{2} a_{4}^{2}
\end{aligned}
$$

$$
\begin{aligned}
& \theta_{3}=\tan ^{-1}\left[\frac{a_{3}}{-d_{4}}\right]-\tan ^{-1}\left[\frac{d}{ \pm \sqrt{e-d^{2}}}\right] \\
& \theta_{23}=\tan ^{-1}\left[\frac{w_{2} f_{11 p}-w_{1} p_{z}}{w_{1} f_{11 p}+w_{2} p_{z}}\right] \\
& \text { where } w_{1}=a_{2} C_{3}+a_{3} w_{2}=d_{4}+a_{2} S_{3}, C_{i} \equiv \cos \theta_{i}, \text { and } S_{i} \equiv \sin \theta_{i} . \\
& \theta_{2}=\theta_{n 3}-\theta_{3} \\
& \theta_{4}=\tan ^{-1}\left[\frac{-S_{1} a_{2}+C_{1} a_{y}}{C_{23}\left(C_{1} a_{3}+S_{1} a_{y}\right)-S_{33} a_{z}}\right] \\
& .=\tan ^{-1}\left[\frac{f_{134}}{C_{33} f_{114}-S_{33} a_{3}}\right] \\
& \theta_{5}=\tan ^{-1}\left[\frac{C_{4}\left(C_{23}\left(C_{1} a_{2}+S_{1} a_{y}\right)-S_{23} a_{1}\right)+S_{4}\left(-S_{1} a_{3}+C_{1} a_{y}\right)}{S_{23}\left(C_{1} a_{3}+S_{1} a_{y}\right)+C_{23} a_{3}}\right] \\
& =\tan ^{-1}\left[\frac{C_{4}\left(C_{23} f_{11 a}-S_{23} a_{z}\right)+S_{4} f_{13 a}}{S_{33} f_{11 a}+C_{33} a}\right] \\
& g_{3}=\tan ^{-1}\left[\frac{-C_{5}\left(C_{1}\left(C_{23} f_{110}-S_{23} 0_{2}\right)+S_{4} f_{130}\right)+S_{5}\left(S_{23} f_{110} \div C_{33} 0_{2}\right)}{-S_{4}\left(C_{33} f_{110}-S_{230}\right) \div C_{4} f_{130}}\right]
\end{aligned}
$$

where $\left(-\frac{d_{4}}{a_{2}}\right),\left(\frac{d_{4}}{a_{2}}\right),\left(\frac{d_{3}}{a_{2}}\right) \quad$ are constants, $\quad C_{i j} \equiv \cos \left(\theta_{i}+\theta_{j}\right), \quad$ and

$$
S_{i j} \equiv \sin \left(\theta_{i}+\theta_{j}\right)
$$

## Appendix E: DFIHS (Depth-First-Implicit-Heuristic-Search) Algorithm

The DF/IHS is a kind of depth first/heuristic search. The main feature of DF/IHS lies in the fact that, unlike the conventional DF/H method, it does not require the computation of the values of heuristic function for all active nodes with the largest depth in order to find, as the next branching node, the node having the smallest value. Prior to the search procedure, priorities are assigned to those nodes that may be generated during search by using the priority list of the CP/MISF method. In this manner, the memory requirements and the average computing time required for search can be reduced markedly, since the choice of the proper next branching node can be made without computing the heuristic values. The DF/IHS method is broken down into two parts, the preprocessing part to assign priorities heuristically to the nodes generated during search, and the depth-first search part.

1) Preprocessing (Task Renumbering): All tasks are renumbered in the same order as in the priority list of the CP/MISF method. This can be performed by the use of an $\mathrm{O}\left(\mathrm{n}^{2}\right)$ algorithm involving the solution of a longest-path problem and sorting.
2) Depth-First Search: In order to describe the features of DF/IHS, use is made of Kohler's general representation of the branch-and-bound method, $B B(B p$, $S, E, F, D, L, U, \in, R B$ ). (The dominant relation $D$ and the characteristic function $F$ are not used in DF/IHS).
a) Branching Rule Bp :
3) The original problem is decomposed along the time axis (an example of decomposition to subproblems).
4) Since optimal solutions may not be obtained by simply assigning the ready task(s) to the available processor(s), a fictitious task that forces a processor to be idle is introduced.
5) A total of

$$
n_{\text {branch }}=\left(n_{\text {ready }}+n_{\text {idle }}\right) C m_{\text {av }}
$$

nodes are generated from each node in the search tree, where $\mathrm{n}_{\text {ready }}$ is the number of ready tasks at time $t: \mathrm{n}_{\text {idol }}$ is the number of idle tasks to be considered at time $t$ (if $\mathrm{m}_{\mathrm{av}}=\mathrm{m}$ then $\mathrm{n}_{\mathrm{idle}}=$
$\mathrm{m}_{\mathrm{av}}-1$; if $1 \leq \mathrm{m}_{\mathrm{idle}}=\mathrm{m}_{\mathrm{av}}$ ); $\mathrm{m}_{\mathrm{av}}$ is the number of processors available at time $t$; C means combination.
b) Selection is made in the form of DF/FIFO.

1) Selection is made in the form of DF/FIFO.
2) A special table $R$ (Ready task table) and pointer $S P$ (next branching node selection pointer) for the memory of active nodes and selection of a next branching node are used to facilitate node selection by the computation of the order $O(m)$. At the same time, the storage requirement, which often becomes the bottleneck in using the branch-and-bound method, can be retained in the order $O(m \cdot n)$.
3) As the initial search solution, the CP/MISF solution obtained has an error bound, relative to the optimal solution, that is guaranteed. Thus any intermediate solution which has been obtained when the search procedure is terminated is superior to the CP/MISF solution in its accuracy.
c) Lower Bound Function L:
4) Fernadez' extension of Hu's lower bound

$$
\begin{gathered}
\mathrm{t}_{\mathrm{hu}}\left(\pi_{\mathrm{a}}\right)=\mathrm{t}_{\mathrm{cr}}\left(\pi_{\mathrm{a}}\right)+\left[\mathrm{q}\left(\pi_{\mathrm{a}}\right)\right] \\
\mathrm{q}\left(\pi_{\mathrm{a}}\right)=\max _{0 \leq \mathrm{t}_{\mathrm{hu}} \leq t_{\mathrm{r}} \leq\left(\pi_{\mathrm{a}}\right) \mathrm{t}_{0}}\left\{-\mathrm{t}_{\mathrm{k}}+(1 / \mathrm{m}) \int_{0}^{\mathrm{t}_{\mathrm{k}}} \mathrm{~F}(\vec{\tau}, \mathrm{t}) \mathrm{dt}\right\}
\end{gathered}
$$

where the load density function $F\left(\bar{\tau}_{j}, t\right)$ is given by

$$
\begin{gathered}
\bar{\tau}_{j}=t_{c r}\left(\pi_{\mathrm{a}}\right)-\mathrm{t}_{\mathrm{j}}-\mathrm{t}_{0} \\
\mathrm{f}\left(\bar{\tau}_{\mathrm{j}}, \mathrm{t}\right)\left\{\begin{array}{l}
=1, \text { for } \mathrm{t} \in\left(\bar{\tau}_{\mathrm{\tau}}, \bar{\tau}_{\mathrm{j}}+\mathrm{t}_{\mathrm{j}}\right) \\
=0, \text { otherwise }
\end{array}\right. \\
\mathrm{F}(\bar{\tau}, \mathrm{t})=\sum_{\mathrm{j} \in\left(\pi_{\mathrm{a}}\right)} \mathrm{f}\left(\bar{\tau}_{\mathrm{j}}, \mathrm{t}\right),
\end{gathered}
$$

is used in the light of the accuracy and the computing load involved. Here $\pi_{\mathrm{a}}$ is the partial problem decomposed by branching operations; $t_{0}$ is the allocation time represented by the current branching node; and [ x ] is the minimum integer greater
than x .
2) Since the calculation of the lower-bound function in 1) requires the use of a pseudo-polynomial time algorithm, two much simpler lower bounds with the complexity of at most $O(n)$ are used jointly so that the computing load for the bounding operation is minimized. Specifically, the computing load is much reduced by applying the bounding operations by the lower bound of 1 ) only to those nodes for which bounding by using the simple bounds has failed.
d) Upper Bound Cost U: Since a very accurate intermediate solution, i.e., $U$, can be obtained as the initial solution, the total search time can be reduced remarkably.
e) Elimination Rule $E: \mathrm{E}=\mathrm{U} / \mathrm{DBAS}$ (upper bound tested for dominance of descendants of branching node and members of currently active set).
f) Acceptable Approximation Error $\in$ :

1) While Kohler could only evaluate the relative error of the intermediate solution U from the lower bound as "Bracket," $B R$ $((\hat{U}-L) / \hat{U} S B R)$. DF/IHS can obtain an approximate solution whose relative error from the optimal solution does not exceed $\in$.

$$
\left(\hat{U}-t_{\mathrm{opt}}\right) / \mathrm{t}_{\mathrm{opt}} \leq \epsilon .
$$

2) Useless search computation can be avoided since the search process is terminated as soon as optimality is attained or a predetermined approximation error limit is reached by comparing the lower bound with the intermediate solution $\hat{U}$ or the approximate intermediate solution $\hat{\mathrm{U}}_{\epsilon}, \hat{\mathrm{U}}_{\epsilon}=\hat{\mathrm{U}} /(\mathrm{l}+\epsilon)$.
g) Resource Bound RB: In the conventional scheduling algorithms on the basis of the branch-and-bound method, the number of active nodes increases exponentially with the number of tasks $n$, and because of the limit of memory capacity, only small problems of about 40 tasks could be scheduled. By the use of special lists and pointers, however, DF/IHS can successfully reduce the number of active nodes for each of which it is required to memorize the state. The search process only requires the memory area of the order $O(m \cdot n)$. Thus the resource bound, i.e., storage limit, it rarely violated.

[^0]:    Rahman, Mahibur and Meyer, David G., "Preliminary Report on High-Performance Computational Structures for Robot Control" (1987). Department of Electrical and Computer Engineering Technical Reports. Paper 579.
    https://docs.lib.purdue.edu/ecetr/579

