Purdue University

Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports

Department of Electrical and Computer
Engineering

11-1-1987

Residue Arithmetic VLSI Array Architecture for
Manipulator Pseudo-Inverse Jacobian
Computation
P. R. Chang
Purdue University

C. S. G. Lee
Purdue University

Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
Chang, P. R. and Lee, C. S. G., "Residue Arithmetic VLSI Array Architecture for Manipulator Pseudo-Inverse Jacobian Computation"
(1987). Department of Electrical and Computer Engineering Technical Reports. Paper 581.
https://docs.lib.purdue.edu/ecetr/581

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.

Manipulator Pseudo-Inverse

Residue Arithmetic
Array Architecture
VLSI
fo
Jacobian Computation
P. R. Chang
C. S. G. Lee

TR-EE 87-42
November 1987

School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907

Residue Arithmetic VLSI Array Architecture for
Manipulator Pseudo-Inverse Jacobian Computation

P. R. Chang and C. S. G. Lee

Ransburg Robotics Laboratory
School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907

TR-EE 87-42

November 1987

- 2 -

ABSTRACT
Most Cartesian-based control strategies require the computation of the manipula
tor inverse Jacobian in real time at every sampling period. In some cases, the Jacobian
matrix is not of full column or row rank due to singularity or redundant robot
configuration. This requires the computation of the manipulator pseudo-inverse Jaco
bian in real time. The calculation of the pseudo-inverse Jacobian may become
extremely sensitive to small perturbation in the data and numerical instabilities, when
the Jacobian matrix is not of full column or row rank. Even if the Jacobian matrix is of
full rank, the ill-conditioned problem may still plague the computation of the pseudo
inverse Jacobian. This paper presents the use of residue arithmetic for the exact compu
tation of the manipulator pseudo-inverse Jacobian to obviate the roundoff errors nor
mally associated with the computations. A two-level macro-pipelined residue arithmetic
array architecture implementing the Decell’s pseudo-inverse algorithm has been
developed to overcome the ill-conditioned problem of the pseudo-inverse computation.
Furthermore, the Decell algorithm is quite suitable for VLSI array implementation to
achieve the real-time computation requirement. The first-level arrays are data-driven,
wavefront-like arrays and perform the matrix multiplications, matrix diagonal addi
tions, and trace computations. A pool or sequence of the first-level arrays are then
configured into a second-level macro-pipeline with outputs of one array acting as inputs
to another array in the pipe. The proposed architecture can calculate the pseudo
inverse Jacobian with a pipelined time in the same computational complexity order as
evaluating a matrix product in a wavefront array.

This work was supported in part by the NSF-Purdue Engineering Research Center under Grant
CDR8500022. Any opinions, findings, and conclusions or recommendations expressed in this article are
those of the authors and do not necessarily reflect the views of the funding agency.

- 31. Introduction

Robot manipulators are highly nonlinear systems and their control involves actuat
ing appropriate joint motors due to small changes in the pose (position and orientation)
of the manipulator hand along a planned Cartesian trajectory. This requires the com
putation of the manipulator inverse Jacobian at every sampling period in real time. The
manipulator Jacobian is a linear mapping which relates the derivatives of the jointvariables to the derivatives of the manipulator hand coordinates. The pose of the mani
pulator hand relative to a fixed inertial coordinate system can be described by a set of 6
algebraic equations containing the joint variables q1, q2, ' " ' , q.^ as
x=f(q)

;

(l)

where x = (px , py ,pz , 4>x , 4>y , <j>z )T is a 6-dimensional vector describing the pose of the
manipulator hand with respect to the inertial coordinate frame, q is the n-dimensional
joint-variable vector for an n-link manipulator, and the superscript "7’" indicates
matrix/vector transpose. Differentiating Eq. (1) with respect to time yields
v(t)
w{t) = J(q)q(t)

(2)

where v;(f) and cj(t) are, respectively, the linear and angular velocities of the manipula
tor hand, q = (, q2 > ' ‘ ‘ , qn )T is the joint-velocity vector of the manipulator, and
J(q) is the 6Xn Jacobian matrix (usually n > 6) and is a function of joint variables.
In general, the manipulator Jacobian (or forward Jacobian) is quite easy to deter
mine than the inverse Jacobian. However, most Cartesian-based control techniques
require the computation of the inverse Jacobian in real time at every control servo
period. In particular, the use of the inverse Jacobian in resolved motion rate control [1],
and the model-based control techniques in Cartesian space [2,3], has been a computa
tional bottleneck for these control schemes. In order to use these control schemes
effectively, efficient computation of the inverse Jacobian in real time must be developed.
Most of the existing methods in eompiuting the inverse Jacobian are formulated to
be computed by uniprocessor computers [4-6]. One of the most widely used methods in
computing the inverse Jacobian is first by determining the forward Jacobian and then
taking its inverse. Several efficient parallel algorithms have been developed for comput
ing the forward Jacobian [7-9]. If the manipulator has six degrees-of-freedom and is
nonsingular at q(i), then the Jacobian matrix may be inverted using normal matrix
inversion techniques for square matrices. For example, in the case of resolved motion
rate control, the joint-velocity vector can be computed from the manipulator hand velo
cities as
k(t)=r'mt)

(3)

where J *(q) is the 6X6 inverse Jacobian matrix. If the manipulator Jacobian is singu
lar at q(Z), then the inverse of the Jacobian is not defined because the Jacobian matrix

-4-

is not of full column or row rank. In the case of redundant robots, the inverse of the
Jacobian is also not defined because the Jacobian matrix is rectangular. Useful solu
tions to the above cases, however, can be obtained through the use of pseudo-inverse
[10-13]. The best approximate solution of the inverse Jacobian to Eq. (3) is given by the
pseudo-inverse [12,13] and is denoted by J+(q). Using the pseudo-inverse, Eq. (3) can be
written as [12,13]
q(t) = J+(q)x(t)-f-(I-J+(q) J(q))z

(4)

where I is an rtXn identity matrix and % is an arbitrary vector in the joint-velocity
space. Equation (4) indicates that the resultant joint velocities can be decomposed into
a combination of the least-squares solution of minimum norm plus a homogeneous solu
tion created by the action of a projection operator (I — J+J)f, which describes the
redundancy of the manipulator system. If the Jacobian is a square matrix and non
singular at q(i), then the projection operator is equal to the null operator and Eq. (4)
reduces to Eq. (3).
Many numerical algorithms, such as the Decell algorithm, the Grevill algorithm, the
Hermite algorithm, and the Gram-Schmidt orthogonalization algorithm, have been
developed to compute the pseudo-inverse Jf [12,13]. However, the calculations involved
in these four algorithms are quite sensitive to ill-conditioning in which the exact solution
is extremely sensitive to small perturbations in the data. Some researchers [14-16] con
sidered the possibility of using the residue arithmetic for the exact computation of
pseudo-inverses in order to obviate the roundoff errors normally associated with their
computations. These four algorithms have been implemented in the residue number sys
tem (RNS) for this purpose. Unfortunately, [15] showed that the Grevill algorithm is not
suitable for implementation in the residue arithmetic system. This is because the algo
rithm requires the greatest common divisor and the least common multiple in each itera
tion, otherwise the moduli choice may change very much. The Hermite algorithm and
the Gram-Schmidt orthogonalization algorithm require a lot of data broadcasting opera
tions which are not feasible for implementation in VLSI array architectures. The Decell
algorithm [14,15], based on the Cayley-Hamilton theorem, is an iterative procedure with
a sequence of matrix operations and can compute the pseudo-inverse J+ quite efficiently.
This sequence of matrix operations can be easily implemented in the residue arithmetic
and is quite suitable for VLSI implementation [17]. This paper presents the design of a
two-level macro-pipelined VLSI array architecture for the real-time computation of the
exact solution of the pseudo-inverse Jacobian using the Decell algorithm in the residue
arithmetic system.
The first-level array of the proposed architecture is an asynchronous data-driven,
wavefront-like array [17] and performs the matrix multiplication, matrix diagonal addi
tion, and the trace computation required in the Decell algorithm. The topological
f For brevity, we shall drop the function dependency of

q on

all Jacobian matrices and their inverse.

-5-

configuration of the first-level array is a two-dimensional processor array consisting of
three basic components: (1) an ordinary matrix multiplication wavefront array, (2) an
adder tree, and (3) I/O memory modules. The adder tree, connected to the diagonal
processing elements (PEs) of the ordinary wavefront array, is used to evaluate the trace
computation, and the calculation of the matrix multiplication and the trace computa
tion can be executed in parallel. The above executions are in an asynchronous datadriven manner, and the data transfers between a PE and its immediate neighbor are by
mutual convenience or handshaking protocol. A pool or sequence of the first-level arrays
are then configured into the second-level macro-pipeline with the outputs of one array
acting as inputs to another array in the pipe. For computing the pseudo-inverse of a
general p Xn {n > p) matrix J, the second-level macro-pipelined linear array is a pipe of
p stages and can evaluate a.total of p iterations of the Decell algorithm with the firstlevel array in each stage. The second-level linear array is a synchronous systolic array
whose data movements in the array are controlled by global timing-reference clock sig
nal. The results of each stage are stored in its local memory and waiting for activated
global synchronous signals. The pipelined time of the proposed two-level pipelined array
architecture has a computational order of O (n + 2p — 1) which is the same computa
tional complexity order as evaluating a matrix product in an ordinary wavefront array.
Furthermore, the primitive processing elements of the architecture perform computa
tions in the residue number system. Recently, [18] proposed attractive wafer-scale
integration (WSI) techniques for implementing the residue number system designs. WSI
technology is based on the concept of placing one or more functional processing units on
a semiconductor wafer. The processing units are then interconnected on the wafer as
opposed to individual packaging. The wafer-scale integration compresses a large
microelectronics representing a complete digital system onto a single wafer, and its pri
mary advantages are an improvement in total system density and a built-in faulttolerance capability. Thus, the WSI has the potential to provide a very cost-effective
and reliable way of implementing the proposed RNS two-level macro-pipelined array
architecture for the computation of manipulator pseudo-inverse Jacobian in real time.
2, Unpleasant Fact for Calculating the Pseudo-Inverse
If the manipulator Jacobian is not of full column or row rank, an unpleasant fact
exists for calculating the pseudo-inverse Jacobian J+. The rank of J which cannot be
determined exactly because it can be easily altered by an arbitrarily small perturbation.
This problem is commonly called the ill-conditioned problem for which the exact solution
is extremely sensitive to small perturbations in the data. In solving such problems, the
introduction of roundoff errors in the computations can be disastrous. This illconditioned problem can be explained in the following lemma and the proof of the
lemma can be found in [13].
Unpleasant Fact Lemma [13]. Suppose a matrix J £RpXn is of neither full
column nor row rank, then for any real number A; and any e > 0, there exists a matrix E,

-6-

| |E 11 < e, such that | |(J + E)+ - J+ 11 > k.
Even if the Jacobian matrix is of full rank, the ill-conditioned problem may still plague
the computation of the pseudo-inverse Jacobian. For example, consider the problem of
evaluating the determinant of the matrix
J

-73 78 24
92 66 25
-80 37 10

This problem is extremely ill-conditioned as shown by the fact that if the roundoff error
matrix is
0 0
E= 0 0
0 0

0
0
10 -2

then we have det(J) = 1, whereas det(J + E) = —118.94. In other words, if an accumu
lation of roundoff errors in the computations were to correspond to the introduction of
the perturbation matrix E, then the computed value of the determinant would be
—118.94, whereas the exact value is 1, even though the matrix J is of full rank.
The above lemma and example show that roundoff-error-free computations are
necessary for evaluating the pseudo-inverse Jacobian matrix. It is known that the resi
due number system (RNS) is the most popular technique for error-free computations in
signal processing and filter design [19]. We shall apply the residue arithmetic computa
tion technique for computing the “error-free” pseudo-inverse Jacobian.
3. Overview of Residue Number System
The residue number system (RNS) arithmetic is a familiar concept in the number
theory. In this section, we shall review some definitions and properties of the residue
number system that will lead us to the computation of the error-free pseudo-inverse
Jacobian.
Definition 1. Given any integer x and any modulus m, if r = x (mod m) and
0 < r < m, then we write r = \x |TO and say r is a residue of x modulo m.
With this definition, computations in single-modulus residue arithmetic is simple in the
sense that no negative integers are involved in the RNS. However, if we wish to solve
problems involving negative integers, we must be able to handle negative integers as
well as positive integers. One way to accomplish this is to introduce the system of sym
metric residues modulo m.
Definition 2. Given any integer a; and any modulus ra, if r = x (mod m) and
— m < r < — m, then we write r = fxfm and say r is a symmetric residue of x
2

2

-7-

modulo m.
It can be shown that both residue representations of x are unique. Furthermore, there
exists a conversion between | x | m and/x/OT,
A/m

>

A/m + m

and

if 0 < A/m < y
(5)

otherwise

/
\X

/ •^ / m

|

m

»

|x Im -m

ifO< |x \m < y
(«)

otherwise

Some important properties of the RNS are listed below:
0)

\x±y |m = I \x | m db | y |m | m = |x± |y |m |m .

(ii)

k y lm ^ I Aim Iy lm Im •

(iii)

\k x\m= \ky |m and gcd(A;,m) = 1, then
gcd(fc,m) = greatest common divisor of k and m.

\x |TO = \y \m,

where

Next, we introduce the concept of multiplicative inverse modulo m.
De finition 3. Assume x, y, and m > 1 are integers, and if 0 < y < m and
Ixy \m ;= \yx \m — 1, then we write y = s_1(m) and say y is a multiplicative inverse of
x modulo m.
Notice that regardless of the sign of x, the multiplicative inverse of x modulo m is
defined to be a positive integer in the range of 0 < x~x{m) < m. Of course, there are
questions of existence and uniqueness of the multiplicative inverse modulo m, and these
questions are addressed by the following theorem.
Theorem 1 [20]. If x and m > 1 are integers, then x_1(m) exists and is unique, if
and only if \x\m =£ 0 and gcd(z,m) = 1.
The choice of the modulo m is quite important. If m is not a prime number, then no
convenient explicit expression for x_1(m) has been discovered, whereas if m is a prime
number, one can find a convenient explicit expression for it.
Theorem 2 [Fermat] [20]. If x and m > 1 are integers, and if m is a prime number
and x-1(m) exists, then x_1(m) = |xm-2 |m = |( \ x |m)m_2 |m •
This theorem indicates that we can perform division to a limited extent. For exam
ple, if the quotient of two integers is again an integer, we can compute this quotient as
follows": If x, y, and m > .1 are integers, and if x_1(m) exists, then
x

= |yx-1(m) |m, even in those cases when x does not divide y. For example,

8

-

-

y
[— |n = 17 *9—1(ll) |n = 17 * 5)12 = 2. Finally, it should be mentioned that the above
9
properties are also suitable for the symmetric residue representation.
For any integer x, there are two different fixed-radix weighted number representa
tions: the signed number representation and the unsigned number representation. For
the signed number system, both positive and negative number numbers can appear.
Hence all integer values are assumed to be within the following range depending on
whithdr fit i§ bdd or even
m

x6
x£

2

m
2

’

(m —T)
2

1

(m — l)
’

2

, for m is even
for m is odd .

However, it is more convenient to use what appears to be only positive numbers by
mapping the negative range above the positive upper limit. This can be accomplished
by the following rule: If x is a negative integer that falls within the above negative
range, it can be represented by the positive number as,
rn — |x | = m + x, x < 0
which exceeds the positive upper limit but falls below m. Thus, any integer value
greater than the upper positive range limit represents a negative number. For example,
assume m = 60 and x^^ = —2, then xunsigned =58. It should be noted that these two
representations are identical, usually the signed number system is used in the symmetric
residue system and the unsigned number system is used in the common residue system.
3.1. Arithmetic Decomposition
Arithmetic representations can be decomposed linearly into an equivalent one con
taining many components, each with a relatively small arithmetic range [18,20]. This
decomposition leads us to a number of parallel and independent pseudo-inverse compu
tations, each using the residue arithmetic which allows us to use short registers and sim
ple arithmetic operations. Furthermore, since the decomposition is linear, the final
pseudo-inverse results for each parallel computation can be linearly recombined back
together, yielding the desired pseudo-inverse result. To achieve this objective, we need
to introduce the concept of multiple-modulus residue arithmetic.
Definition 4- Let ml5 ra2, * • • ,mL be the bases for a residue number system, where
v v".'
•
l
gcd(m.j-, nij) = 1, for i ^ j, and let M — m1 m2 • • • mL — JJ ny. The unique L-tuple
i=i
residue arithmetic representation of x is given by

l®lrn2.

>

\x\rn,) or x ~ {rj, r2 , •

••

, rL} .

-9-

where ?-j =:\x |m. Notice that the representation also holds true for the symmetric resi
due system. Similar to the RNS, there are some important properties for the multiplemodulus residue arithmetic:
(i)

Assumes ~{|® |mi, • • • ,\x |mJ and y ~ {\y |mj., • • • , |y |mJ
then \x ±y\M~{\x ±y\mi, ••• ,\x±y\m&}
and |s ^
~ { |s y |TOi, \xy \m^, • • • ,\xy jmt} .

(ii)

Multiplicative inverse

(iii)

(a)

x~l(M) ~{i"1(m1),i"I(m2), • • • ,af ^m,,)} .

(b)
r

If
ml , m2 , • • • , mi
are
all
prime
x1 (M) ~ {\xm 1-2 |mi, Is—2 L, ..., Is—2|mJ.

numbers,
;■

then

Quotient of two integers
(a)

l~ Im~{ I^/_1(^i) \mi, Ixy [m2) L? • • •, \xy~l{rnL) |^} .

(b)

If, in addition, the bases are all prime numbers, then
' If U - {k,\*ym^2 W- • •, I**”*-2 u>.

As an example for illustrating the idea of the multiple-modulus residue arithmetic, let
M — 60 —3X4X5 and xunsigned = 58, then, mx= 3, m2 = 4, m3 = 5 and |s |mi = l,
I3' lm2 = I3' Im3= 3* Thus, we haye s ~ {i, 2, 3}.
3.2. Arithmetic Recomposition (or Reconstruction)
Assume that the L-tuple residue representation of s, s ~ { |s |m , |s \m,2, ■ • • ,
|s |mi }, is given, one wants to find the corresponding value of s. Two techniques are
available to achieve this purpose: the Chinese remainder theorem (CRT) and the
mixed-radix recomposition (MRR).
3.2.1. The Chinese Remainder Theorem
Let ml5 m2 , • • • , fnL be the bases for a residue system where gcd (mt-, rrtj) = 1, for
L
i^j, and M =
m,-. Also, let ihj = M/mj, 1 < j< L. If x has the residue represen2=1

tation

a: ~ {rx, • • • , rt-, • •• , rL},
" L

where

rt- = | a: | m ,

1 < i < L,

then

'

I S
\rj
\m \m- Furthermore, x is expressed in the unsigned
i-i
number system and falls within the range [0,M— 1]. Thus, x = \x\M. Although the
CRT method is fairly simple in principle [20], it would not be able in its given form to be
implemented in a residue computer, since the residue computer is equipped to perform
arithmetic operations modulo m^, not modulo M operation as required by the CRT. In
contrast, the mixed-radix recomposition procedure can be implemented in a residue

10

-

-

maehine because it requires modulo m,- operations only.
3.2.2. The Mixed-Radix Recomposition
Szabo and Tanaka [20] proposed the mixed-radix recomposition (MRR) method to
avoid the modulo M addition operation in the CRT. The MRR is of great importance in
the residue arithmetic computation and the VLSI implementation for the following two
reasons:
(i) The mixed-radix system is a weighted number system and magnitude comparison
can be easily performed. Also, it can be shown that its VLSI implementation is
quite simple.
(ii) Conversion from the residue to a certain mixed-radix system is relatively fast in
residue computers.
The mixed-radix number system can be described as follows: Let ml, ra2, • • ■ , mL
be the pairwise relatively prime bases (or radices) for the mixed-radix system. A number
x may be expressed in the mixed-radix form as
2,-1
x = ai II mi + ' • ■ + a3m1m2 + a2ml + a1
in
i=1

where the are the mixed-radix digits and 0 < at- < m,. For a given set of radices, the
where the digits
mixed-radix representation of x is denoted by <a^ , aL-i> ' ' ’ ,
are listed in order of decreasing significance. It can be shown that
x G [0, ]"[ rrii —l] = [0, M—1] and has a unique representation. Notice that if x is in
. i==1
the signed number system, the digits ai should be within the interval (—— m,- , — mt ),
2
2
and then x G (——M, —M). For convenience, we will only discuss the case for the
2
2
unsigned number system.
Assume that the L-tuple residue representation of x, {r1,r2, • • • ,rL}, is given,
one wants to find the corresponding mixed-radix digits ot-, in terms of riy 1 < i < L.
The MRR method involves the following steps:
(i)

To obtain av Eq. (7) is first taken as modulo my Since all terms except the last
are multiples of m1? we have,
«i = I* lmi = ri .

(ii)

(8)

To obtain a2, we first form x — ax in its residue code, then take modulo m2 on
both sides of the following equation,
■■

'

■

2,-1

(x — ax) mod m2 =. {(aL
- i =3

+ • • • + a3) m2ml +

mxa2 mod m2 .

mod m2
(9.a)

-11 -

Because gcd (m.j, m2) = 1 and from Theorem 1, mj '1 (m2) exists, so we have
a2 mod m2 = | mf1 (m2)(\x | W2 - ax) |m2 - | mf1 (m2)(r2 - r2) |m2 .

(9.b)

Furthermore, since 0 < a2 < ra2, we have
a2 = a2mod m2 = | C712(r2-r1)|TO2

(M

where C l2 = m]-1 (m2).
(iii) For any y, and 1 < j <L, rj is given and a1, a2, • • • , ay_y have been evaluated
already. Thus, by induction, we have
aj = I ( • • • (((ry - «i)"»fX («»y) - a2) m2x (my)) • • • ) - ay-Omy.^ (my) |ro.
= !(' • • (((ry — ai)^iy — a2)^2j)''') ~ ay-i)

Im,

(410)* * * * *

where Qij;= mt_1(my).

It should be noted that if we are interested in the signed number system, then the pro
cedure is the same as the above equations except the symmetric residue j x jrn replaces
the \x |m.
The above residue arithmetic for integers can be extended to integral matrices. We
shall define the residue matrix modulo m and most of the results that we have discussed
for integers can be extended to integral matrices.
De finition 5. Let X = [ x,-y ] be a p Xn integral matrix and m > 1 be an integer. If
R — [r^y] is the matrix with elements defined by r,y = |x>y |m for all i and j, then we
write R = |X |m and say R is a residue of X modulo m.
We shall use this residue matrix modulo m concept in the residue number system to
compute the error-free pseudo-inverse Jacobian.
4. Computation of Pseudo-Inverse Jacobian Using RNS

The Decell algorithm for residue computation consists of a sequence of matrix mul
tiplications, matrix diagonal additions, and trace computations. These operations are
suitable for VLSI array implementation [17]. Stallings and Boullion [14] proposed using
the residue arithmetic to compute the Decell algorithm. They used the Chinese
remainder theorem (CRT) to recover the final solution. However, using the CRT for
recomposition involves the modulo M addition, which has the potential of resulting in
long registers and complicated arithmetic hardware. A better solution is to use the
mixed^radix recomposition (MRR) technique to avoid the modulo M addition by using
normal binary adders.
We shall first describe the concept of the Decell algorithm and then propose a VLSI
array implementation for it. Throughout this paper, we assume that A is a pXn matrix
and n > p, otherwise we could compute (Ar)+ using the relation A+ = [(A3V)+]r. The
Decell algorithm for computing the pseudo-inverse A+ is based on the Cayley-Hamilton

-12

-

theorem and consists of computing a sequence of matrices Aq , Aj , • • • , A.%, as follow:
B0=I,

Aq - 0 ,

?o = -1 >

Ax = A AT ,

<7i = tf'(Ai),

A2 - Ai Bi ,

?2 = “-tr(A2),

:

b2

:

-Q21

>

:

A^_! = Aj B^_2
AK = Aj BK_i,

~ A2

(11)

, qj( -l = (AiC-i) > Bff-1 = -^K-l ~ QK-J- >
qK =

tr {AK) ,

BK = AK — qKI,

The pseudo-inverse A+ is given by
A+ = —ArBK-i

(12)

■ qK
where tr (A, ) ^ trace of A,- , 1 < i < K, I is a p Xp identity matrix, and K is the rank
of A. Since the rank of A is not known apriori, the iteration is continued until the
matrix product Aj^B,- becomes a zero matrix for some i, i = 1,2, • • •' ,K. The termina
tion of the iteration will determine A as well as therank of A. The critical step in the
Decell algorithm involves the determination of whether or not A1Bt- = 0 for some i,
i =1,2, • • v , K. Obviously, the error-free computation is useful in overcoming the
numerical instability in this iteration as well as the determination of the zero matrix in
the matrix product. Thus, we shall use the residue arithmetic for the computation of
the Decell algorithm in an effort to obviate the round-off errors that may cause difficulty
in determining the zero matrix from the matrix product AjBj.
The use of residue arithmetic presumes that each element a- of the matrix A is an
integer. Although fixed word-length computers store only rational numbers, they can be
converted to integers by an appropriate scaling. For example, in the radix u system, o,y
can

be

evaluated

by

affiu* = u~s(
ajk~s^uk) = u~s aij, where
k=—s
k—0
v
=Y) alj~s^uk is the normalized integer and u~s is the constant scaling factor.
■ ' k=0 '
_
Thus, we can obtain. the normalized matrix A = us A. Since it is known that
A+ = (?/_'s-A)+ =—3—A+ = us A+, one may apply the proposed residue arithmetic
u s

.

_

Decell algorithm to obtain the pseudo-inverse A+ from which the desired pseudo-inverse
A+ may.'he obtained by multiplying the scaling factor us. For convenience, we assume
that the elements of A are integers in the following discussion.
Re-examining the equations (11) and (12) carefully again, one may find that the
determination of whether or not A1B,- = 0 for some i is a very critical step. If the ordi
nary fixed-point arithmetic is used, it will be extremely difficult, in the presence of

- 13
roundoff errors, ;o determine whether or not the elements of each iteration in Ajfi, are
exactly zero. Hence, the algorithm may suffer from the numerical instability. So, the
error-free computation must be used in Eq. (11) for overcoming this difficulty. However,
the evaluation of Eq. (12) can be done in the ordinary fixed-point arithmetic. Thus, the
residue arithmetic Decell algorithm can be divided into two parts: Eq. (11) is evaluated
in the residue arithmetic and Eq. (12) is evaluated in the ordinary fixed-point arith
metic.
Before discussing the residue arithmetic Decell algorithm, one has to consider the
proper choice of the range M and the selection of the bases (or rriodtili)
1 < i < L,
which are critical to the success of the method for calculating A+. Stallings and Boullion [14] suggested the following criterion for selecting M
M >max{X?

(13)

where X = min {tr (AAr) , | |AAt || } and K — rank (A). For simplicity, two practi
cal criteria are given as follows:
a)

u> 2 n (£ <$*
■■ i-i j=i

(i4)

where at-y is the {i , j) element of A.
(ii)

M > pp'/2 X?2

(15)

where X — the maximal absolute value of an element in the matrix AA . Stallings and
Boullion [14] also suggested that the bases (or moduli) m*, 1 < * < L, should be chosen
,

: ,, '; ..

l

■ ■;

to be large prime numbers greater than p and must satisfy M = J f rn^ then rank
(| A |m ) • rank (|AAr |TO ) = rank (AA3' ) = rank (A) — K, where 1 < «; < L. In
other words, the rank of | A |m is equal to rank (A) regardless of the modulus choice m,-.
This indicates that the residue arithmetic Decell iteration will terminate at the same
(if+l)th iteration for any different mt-, where K is the rank of A. Thus, we can use the
multiple-modulus arithmetic to calculate the pseudo-inverse. However, the last
(p +l)th iteration is not necessary when the rank of A is equal to p. Since
K < min(p ,n) —p, if the matrix multiplications Ay = A1By_1 are not equal to zero
matrices for the previous p iterations, where 1 < j <p, then Ap+1 of the last (p + l)th
iteration must be a zero matrix, and K is determined automatically and is equal to p,
otherwise K would be greater than p which is impossible. Figure 1 indicates that the
Decell algorithm may be decomposed into L parallel Decell algorithms, each of which
can be implemented using arithmetic modulo the respective component of M. The
recombination of the outputs from the L parallel systems may be achieved by using the
mixed-radix recomposition technique. The computational procedure of the residue
arithmetic Decell algorithm can be described in the following Residue Arithmetic Decell
Algorithm:

-14 Algorithm RADA (Residue Arithmetic Deceit Algorithm). Given a pXn matrix A
with normalized integral elements, p < n, the range M, and the moduli mt , 1 <i <L,
this algorithm uses the residue arithmetic to compute the pseudo-inverse A"1 based on
the Decell algorithm as in Eqs. (11) and (12).
R1

[Initialization.] Let A^ = 0, q*0 — —1, Bq = I, 1 < i '■< L.

R2

[Arithmetic Decomposition.]
FDfti -1 STEP 1 UNTIL L DO
(2.a)

Obtain JA |m,

(2.b) Perform A[ == ||A |m. jAr |m. |m_.
R3

[Deceit’s Iteration.]
FOR j = 1 STEP 1 UNTIL p DO

R4

Perform Aj = |A4 By_x |m.

R5

[Termination.]
(i) If Ay = 0, then stop and let
= j—1 and go to Step R2. Otherwise, con
tinue.
(ii) If j = p, then stop and let
= p and go to Step R2. Otherwise, continue.
R6

Perform q) = 11 j-1 |m, |tr(A}) |m> |m_
(notice that \j~1 |m are pre-computed integer values)

R7

.

Perform By = |Ay — <7yl| m.
End DoJBlock of Step R3
End Do_BIock of Step R2
'; k = kx= •• ■ =Ki • • • =kl.

R8

[Arithmetic Recomposition.] Reconstruct qK and B^_1 from q*K and B^:_1,
1 < i < L , by using the MRR technique (notice that K is the rank of A)

R9

[Compute the pseudo-inverse A+.] Compute A+ =
AT Bif.., in the ordi■■ ■■ vk
:v-v:.V
nary fixed-point system.

RIO [Output and terminate.] Output the pseudo-inverse A+ and terminate.
END RADA.
Let us give an example to illustrate the residue arithmetic Decell algorithm.
Assume
1 0 50 0
A= 0 2 0 1
1 0 0 0
and

choose

m2 — 11,

m3 = 17,

and

m4 = 23,

thus

M — 7 X 11 X 17 X 23 = 30107 satisfies the criteria of Eq. (14) or Eq. (15). '['he follow
ing discussion is in the signed number system. First, we compute
and
2

A/ =

0

1

0 -2 0

7/7

10 1
il = mad/, -1 ,Bi = /a;

1 0 1
«;i/v = 0 -3 0
1 0 0

3 0 2
A2 — /aI~bI/7 — 0 -1 0
201
#2

~ /9

*r (^2)/7 — //2 V7Ar (Ag)/7/7 ~ 7(—3)

-2 0 2

®2 — /Aj — 02

0 1 0

—

2

A,1 =

11-p
*1.91 /? —

03 =/^tr

-

0 3

-2

0

0

0

-2

0

0

0

-2

— —2 ,

A4 = A1B3 = 0. Terminate the RADA algorithm and obtain the resultant /g,3A and
/B2A- Similarly, repeat the above computations with moduli 11, 17, and 23, we obtain,
respectively, /A3A1 - 4, /g3A7 = /03A3 =
and
5 () -5
5 0 -5
5 0 -5
? foil17 i 0 1 0
, /B2/23 — 0 -7 0 .
n=; 0 1i 0
-5 (> -2.
-5 0 -7.
-5 0 -7;
So, the multiple-modulus residue representations are
03 ~ {“2, 4, 5,

{-2,5,5,5}
{0,0,0,0} {2,-5,-5,-5}
, B2~ {0,0,0,0}
{1,3,1,-7} {0,0,0,0}
.{2,-5,-5,-5} {0,0,0,0} {3,—2,—7,-7} J

One may use the MRR technique to reconstruct the qz and B2 with the pre
computed weighting factors, i.e., C12 = —3, C13 = 5, C14 = 10, C2z = —3, C24 = —2,
and C34 = —4, where
= mj-1(my). For example, the residue representation of qz is
given, find the associated mixed-radix digits <a4,a3,a2,a1> where the mixed-radix
expression is qz = a4(7XHX17) + a3(7Xll) + a2X(7) + av The associated mixed-radix
digits may be found by using Eqs. (8)-(l0) in the signed number system, then we have

- 16 al = —2, a2 = 4, a3 = —8 and a4 = 10. Furthermore, q3 can be evaluated as q3 = 12500.
Similarly, using the same procedures, the elements of B2 can be evaluated as
612 — b 2i = 5 23 = 5 22 = 0, b ii = 5, b j3 = —5> b 3i = —5, b 22 = 2500, b 33 ■== 125 05, where
bij is the (j , j) element of B2. Finally, we can compute A'! in the ordinary fixed-point
system as follows:

A+

—AtB2 =
9s

1

12500

1
0
50
0

0
2
0
1

1
0
0
0

5 . 0 ■ —5 ■
0 2500 0
-5 0 12505

0.
0.
0.02
0.

0.
1.
0.4 0.
0. —0.02
0.2 0.

5. Hardware Implementation of Residue Arithmetic
The implementation of the multiple-modulus residue arithmetic requires the con
sideration of three hardware components: (i) basic residue arithmetic processing Cells,
(ii) arithmetic decomposition cell, and (iii) arithmetic recomposition cell. It should be
noted that the implementations have been previously realized by VLSI/WSI techniques
[18].
5.1. Basic Residue Arithmetic Processing Cells
Two basic residue arithmetic processing cells are residue adder and multiplier. As
shown in Figure 2, one may use two A:-bit binary adders to realize a modulo rn residue
adder, where 2fe < m. Assuming that x and y are inputs to the first binary adder of the
residue adder, and if the result of the first binary adder overflows, it can be corrected for
the modulo m operation by adding (—m) to the result of the first addition. If the first
addition does not overflow, it may fall in the range [ m , 2k— 1 ], in which case it can also
be corrected by adding (—m) to the first addition. The carry bits generated from the
first binary adder or the second binary adder indicate whether or not (x + y) is greater
than m. A multiplexor controlled by the carry selects the correct output.
One familiar method of residue multiplication is “log transform-residue addition antilog transform” [18]. The concept of logarithms can be used for the multiplication of
field elements. The theory guarantees that there is at least one primitive element which
generates all nonzero field elements [20]. It is known that when m is a prime number
[20], the nonzero elements of GF(m) will be GF (m) = {l,2, • • • , m—l}. For each
i G GF (m), there is a corresponding element e, where 0 < e < m—1, such that i; = .//,
where fx is a primitive element of GF(m). The exponent is the logarithm of the element
i = //

and

log? = e .

(16)

The multiplication of two field elements is equivalent to the addition modulo (m—1) of
the corresponding exponents. If i1 = ix1 and i2 = fJ?2, then
=M|ei+e'iltm 0

^d

\el + e2|(m_i) = log(i1-i2)

(17)

Thus, multiplications can be implemented by adding the appropriate exponents as
determined from a logarithm table [21]- The procedure requires three steps: (1) find the
exponents e4- from the logarithm table, (2) add indices and take modulo (m—1), and (3)
find antilogarithm in the logarithm table to yield the result. This procedure is shown in
Figure 3.
5.2. Arithmetic Decomposition Cell

The decomposition of an input number consists of finding its residues for its various
moduli m4-. In general, this does not create problems because the input number can be
applied simultaneously to separate decomposition blocks as suggested in Figure 4.a.
Each decomposition block can be realized by a custom VLSI design based on a sequen
tial restoring division algorithm [18,22]. This yields a compact design that is easily
modified to simplify the 2’s complement to the RNS conversion process. The division
algorithm simplifies considerably when only the remainder is of interest. The resulting
architecture can be a pipeline requiring only subtractions. Assuming that the input is a
k-bit positive number, we can subtract from it the binary representation Of the modulus
m?- such that the highest order “1” bit of the representation is aligned with the highest
order input bit. If the result is negative, the original input is passed on (i.e. subtract 0).
If the result is positive, the subtracted version is passed on. In either case, a (A: —l)-bit
number will result. This number is now treated as the input, rat- is shifted one less posi
tion to the left, ahd the above process repeats until m,- has been totally shifted. The
remainder is the result after that step. A floor plan of a decomposition block to find the
modulo 5 residue is shown in Figure 4.b. With 2’s complement input Values, a positive
input is passed directly to the residue decomposers, and 0 is added to their outputs. For
a negative input, the input is first complemented (without adding one) and then passed
to the residue decomposer.
5.3. Arithmetic Recomposition Cell

Recomposition using the MRR process is shown in Figure 5 for a four-modulus sys
tem. The need for the modulo M addition required for the CRT can be eliminated.
Pipelined operation is possible so that a new value may enter every clock cycle. In the
recomposition architecture, the values
and W; are pre-computed and stored in a
read-only memory (ROM), where wk —

k—1

mi and ml — 1. Once, the residue inputs r,-,

o

1 < { <4, enter into the architecture, the residue adders and multiplier ate executed
from top to down. The buffers are used to synchronize the timing of the pipelines.
Thus, the resulting mixed-radix digits <a4,a3,a2, aj > are carried Out at the same
time. Later on, a tree-structured architecture With normal binary adders and multi
pliers can evaluate the resulting binary output from the mixed-radix digits.

-

18

-

Let us examine how the MRR process, of Eqs. (8)-(10) can be realized by the pro
posed pipelined implementation. From Eq. (8), the resulting al is equal to the input r1;
so rl would be shifted down through the buffers. Observing Eqs. (9.c) and (10), the
resulting r1 must also be sent to the right three pipes for evaluating a 2, a3, and a4.
These pipes are executed in their corresponding modulus residue number systems. For
example, the resulting a2 may he carried out and present at the second stage of the pipe
after a modulo m2 residue subtraction, x = |r2 — r.j| m£, and a modulo m2 residue multiplication, o2 = |C12x | mo. At the same time, the pipes for a3 and a4 are also executed in
the similar operations of modulo m3 and m4, respectively. These operations correspond
to the computations within the most inside parenthesis of Eqs. (9.c) and (10). Next, the
resulting a2 will be stored in the buffers of its pipe and also broadcasted to the other two
pipes. This process will continue to make the computations within the parentheses of
Eq. (10) for a3 and a4 to be executed from inside to outside. Because of the multiplepipe structure, the calculations of a2, a3, and a4 would be evaluated in parallel. Exa
mining the MRR architecture, one notes that multiplications are only required by the
moduli mit i =2, ••• ,L (in this case, L = 4). Thus, the best implementation is to
assign m1 the largest valued modulus and mL the smallest valued modulus. This allows
for short register (or buffers) and simple arithmetic operations with relatively small
arithmetic ranges, which can be implemented on the proposed pipelined architecture.
6. VLSI Architecture for Residue Arithmetic Decell Algorithm
From the first part of the Decell algorithm (i.e. Eq. (ll)), one finds that there are
three different matrix operations involved in the iteration of the algorithm: matrix mul
tiplication, trace computation, and matrix diagonal addition. Based on the discussion in
section 4, it is suggested that these operations in Eq. (ll) should be done in the residue
number system so that the exact determination of AjBt- = 0 for some i can be easily
detected. The determination of AjB,- = 0 terminates further iterations, and a matrix
multiplication and a scalar division in Eq. (12) are then executed in the ordinary fixedpoint arithmetic.
Using the multiple-modulus arithmetic to implement the Decell algorithm requires
the algorithm be decomposed into L identical parallel algorithms, each of which is
.

'-A-

■ ' -'A

:

L

:

implemented using arithmetic modulo mt- and L[mt = M. The results from the L parali=1

lel algorithms can be reconstructed by the MRR technique to obtain the desired result.
In this section, we shall discuss one of the L identical parallel algorithms.
There are at most (K+l) iterations involved in Eq. (ll). However, K is not known
until Ax Bt- becomes a zero matrix for some i. Thus, K or the rank of A will be deter
mined at the end of the iterations. In fact, the rank of A is. always less than or equal to
min (p ,n), that is, K < min (p ,n). If p < n, then A < p, and we have at most (p +1)
iterations. However, the algorithm RADA is section 4 indicates that the last (p T l)th

- 19 -

iteration is not necessary. Based on this fact, it is possible to implement Eq. (11) on a
linear array of p processing architectures as shown in Figure 6. Each processing archi
tecture is an asynchronous data-driven, wavefront-like, two-dimensional array. Each
performs the matrix multiplication, matrix diagonal addition, and trace computation
involved in one iteration of Eq. (11). Furthermore, the architecture also has to deter
mine whether or not A1 By is a zero matrix which is used to determine K and terminate
the iterations as well. Once K is known, no more operations will be executed. In this
case, inhibited tags are assigned to those output data qK , 1; By+1, and By of the
(/'C+l)th processing architecture with q%+\ — #y, BK = ®2T-1» and they are also used to
disable the computations of the next (p—K—1) architectures, where
and By_x are
the input data. When a processing architecture is disable, no operations will be exe
cuted, and the input data are stored in its local memory and it is waiting for the
activated global synchronous signal. The above design of Figure 6 is regarded as a twolevel macro-pipelined array: the first-level consists of data-driven, wavefront-like, twodimensional arrays (or the processing architectures), and the second-level consists of
cascading p of these arrays into a linear macro-pipelined array. In the second-level
pipelined array, the outputs of one array are acting as inputs to another array in the
pipe. Furthermore, it should be noted that the first-level array is in an asynchronous
data-driven manner, whereas the second-level linear array is a synchronized multipro
cessing in which the data movements in the array are controlled by global timingreference clock signal.
After (p—K—1) clock periods, the desired output data qK, By+1, and By_1 from
the (R-tT)th first-level array will be pumped out of the last pth first-level array of the
second-level pipe. Then, we use the gy and By_x to calculate the pseudo-inverse
_ 1 ATBy_iA+
Since the proposed architecture is a two-level macro-pipeline, the
<lK
desired results are pumped out in succession and at every clock period coming in* Next,
we shall discuss our design of the first-level, asynchronous data-driven, wavefront array.
Rung [17] proposed an asynchronous, data-driven, wavefront array to implement
matrix operations, in particular the matrix multiplication. The wavefront array com
bines the systolic pipelining principle and the data-flow computing concept. In fact, it
should be noted that the major difference between a wavefront array and a systolic
array is the data-driven property. There is no global timing reference in a wavefront
array, and yet the order of task sequencing is correctly followed. In the wavefront archi
tecture, the data transfer between a processing element (PE) and its immediate neigh
bors is by mutual convenience. Whenever data are available, the transmitting PE
informs the receivers, and the receivers accept the data whenever required. They then
communicate with the sender to acknowledge that the data have been received. This
scheme can be implemented by means of a simple handshaking protocol [17] which
ensures that the computational wavefronts propagate in an orderly manner instead of
crashing into one another. Since there is no need to synchronize the entire array, a

20

-

-

wavefront array is architecturally scalable. Kung [17] demonstrated how a square
matrix multiplication algorithm can be executed on a square, orthogonal wavefront
array. This concept can be extended to a more general case to perform the multiplica
tion of non-square matrices. If A is an mXn matrix and B is an nXp matrix, the wavefront array size needed is mXp, thus the computations are completed at time
n + (m — 1) T (p — l) — n+m +p —2. Extending this idea to our non-square matrix
multiplications for Ay = AAr of Eq. (11) and AT BJf_1 of Eq. (12), where A is a pXn
matrix and Bif_1 is a p Xp matrix, and n > p, one may use the wavefront arrays Of size
p Xp and size nXp to evaluate the AAr and ATB j, respectively, with the same com
putational time of (n +2p — 2).
With some modifications of the wavefront array, it is possible to evaluate the
matrix multiplication, trace computation, and matrix diagonal addition for One iteration
of Eq. (11) by the proposed array in the residue number system as shown in Figure 7.
Furthermore, the proposed array also includes the determination of whether or not
AiB; is a zero matrix. To explain the functions of the proposed array clearly, we shall
first discuss the main-frame of the processor array which does not include the adder tree
and dotted connected lines. This main-frame of the processor array is mainly used for
matrix multiplications only. It is quite important to illustrate how the matrix multipli
cation is executed on a square, orthogonal p Xp wavefront array. Let A=[aiy],
B = [6t-y ], and C = AB = [ Cty ] be all p Xp matrices. The matrix A can be decomposed
into columns Aj- and the matrix B into rows By, and we have
C=A1B1+A2B2+ ••• + Ap Bp

(18)

where the product A* B^ is the “outer product.” The matrix multiplication can then be
carried out in p sets of wavefronts (recursions), each executing one outer product:
c{k) = c(k-i) + Afc Bk
(18)
or equivalently,
(2°)
where

— aik and bjk\— bkj, for k = 1,2, • • • , p.

With reference to Figure 7, let us examine the computational wavefront for the first
recursion in the matrix multiplication. Suppose that the internal registers of all the PEs
are initially set to zero, that is, Cffl = 0, for all z, j. The entries of A are stored in the
memory modules in the left (in columns), and those of B in the memory modules on the
top (in rows). The process starts with PE(1,1), where cIP —
+ a 11X611 . The com
putational activity then propagates to the neighboring PEs, (1,2) and (2,1). Since they
both have input operands available (data-driven property), they will, respectively, exe
cute <7$ =■€$■■ + anX612 and C$ = C$ + a2iX6 n in parallel. The next wavefront
of activity will be at PEs (3,1), (2,2), and (1,3), thus creating a computational wavefront
traveling down the processor array. When the first wavefront sweeps through all the

- 21 -

PEs or all the available input operands of PE (1,1) are used up, the first recursion is over
and the result is stored in the internal register of PE (1,1). Similarly, when the second
wavefront sweeps through all the PEs, the computations of PEs (1,2) and (2,1) are com
pleted. Next, the computations will be completed at PEs (3,1), (2,2), and (1,3). Thus,
the completion of PEs will keep going and traveling down the processor array. The
above concept shows that the results of the diagonal PEs (circular cells) of the processor
array will be carried out at every two wavefronts after the computation of PE (1,1) is
completed.
For calculating the trace computation tr(AB), we shall use the above idea to
achieve this purpose. An adder tree of height f log2p ], connected to the diagonal PEs of
the processor array, computes the trace function in parallel with the calculation of the
matrix multiplication of AB. This tree-structured adder architecture is also in an asyn
chronous data-driven manner. For example, the first adder is activated after both com
putations of PE (l,l) and PE (2,2) are completed. This also implies that the adders
start their computations at every two wavefronts sweeping through the square processor
array after the PE (l,l) is completed. The last adder will start its computation immedi
ately after the PE (p ,p) has been completed. Thus, the result of the trace computa
tion, tr(AB), is carried out just after the completion of the matrix operation AB about
one addition time late. Using the above concept, we can calculate the As = A1BS_1 and
tr(As) =tr(A1Bs_1) of the sth iteration of Eq. (ll) by letting A = Ax and B = Bs_j
almost simultaneously. Note that these computations are in the RNS by using the resi
due arithmetic implementation. Furthermore, one may evaluate |gs |TOi = IN-1 |m,
|tr (A1BS_1)'| m. | m. in the residue arithmetic modulus ra,- manner by connecting a residue
multipier to the (residue) adder tree, where |s"1 | m is a precomputed constant. The
resulting |gs |m. is sent back to the diagonal PEs (circular cells) of the processor array in
order to calculate the matrix diagonal addition, |BS |ot = |AS — <2sl|m , of the sth itera
tion of Eq. (11). Before performing this matrix operation, the determination of whether
or not K L “
I is a zero matrix is performed. To achieve this purpose,
zero-value testing flags
generated by the input memory modules to test the input
data of the architecture. If any PE (i , j) has the result which is stored in the internal
register
and is not equal to zero, or either of the two inputs is the non-zero flag "A",
then both outputs of the PE will be assigned with the non-zero flags "A". Otherwise, the
outputs will remain to have the zero-value testing flags "*". It could be shown that if
the outputs of PE (p ,p) are
then all the entries of A1BS_1 shall be zero, otherwise,
at least one entry of A1BS_1 is non-zero. This resulting flags will be carried out immedi
ately after the computations of all the entries of A1BS_1 are completed. In addition, the
resulting flags also make the processor array and the output memory modules perform
the following functions:
(1) For the processor array, if the non-zero flag "A" is present at each PE of the proces
sor array, then the array performs the matrix diagonal addition, that is,

22-

-

|Bs|m, = |As- qsl| m.. For instance; the diagonal PEs (» , i) (circular cells) execute

the operation, R{i = \Hi{ — qs | m., while the off-diagonal PEs (square cells) have
their same internal register values. Otherwise, all the internal registers of the PEs
will be set to the initial state and the inhibited tags will be added to the output
data Bs. Once the global timing clock comes in, each column of Bs which is stored
in the internal registers of the vertical PEs will be pumped out upward along the
dotted output data line. This will make the input data and output data in the
same sequencing and data format.
(2) For the output memory modules of Bg_1, the entries of Bg_2 coming down from the
previous array are stored in the right internal buffers of the output memory
modules, and the entries of Bs_1 coming down from the current processor array are
stored in the left internal buffers. If the resulting flag
is present at each memory
module, then the inhibited tags will be assigned to the stored Bg_2 and they indi
cate that those data are the desired results. Next, the data in the right internal
buffers will be pumped out. Otherwise, the flag "A" controls the switches inside the
output memory modules and forces the data stored in the left internal buffers, i.e.
Bs_li to be pumped out. However, if the input Bs_2 have the inhibited tags, then
the above functions are disabled. The Bs_2 stored in the output memory modules
are in the waiting situation, before the global timing clock is present at this array.
(3) For tlie output memory module of qs, its functions are very similar to the output
memory modules of Bg _2. Before the resulting flag comes in, the qs_1 from the pre
vious array and the resulting qs have been stored in the right and left internal
buffers,; respectively. If qs_y has the inhibited tag, then the qs_x is waited for being
pumped out. Otherwise, the resulting flag
is present at the memory module,
making the qs_x being tagged. This also shows that the q8_x is the desired result.
Meanwhile, if the resulting flag is "A", then the qs will be pumped out and the com
putations will continue to find the desired result. Note that the data in the above
output memory modules of Bg_1 and qs are in the Waiting situation, before the glo
bal timing clock is present at this array.
If we let f* be the time for performing modulo mi residue scalar addition, tlm be the
time for performing modulo
residue scalar multiplication,
be the time for data
transfer, and to be the time for pumping out the results. It can be shown that the resi
due matrix multiplication |A;B, .,|
can be completed in (3p—2) (t^ + t}n +1^) time
units and the resulting residue trace computation |tr (Ag)| OT can also be completed with
(ta+tj) time delays after the residue matrix multiplication completes its operation.
Next, the residue scalar multiplication, k lm, - I k 11 rm ltr (AS) I m, I m,> and the |qs \
can be evaluated at [(3p—2)((* +tlm +t\) + [tla
+ {tlm +^)] time units. One can
see that the time for performing the three residue matrix functions exceeds the time for
evaluating the residue matrix multiplication in a normal wavefront array by only

- 23+ 2i| +3iJ) tixile delays. At almost the same time as the calculation of |gs | m is com
pleted, the resulting flags are also carried out and present at the output of PE (p ,p).
The resulting flags will pass through the dotted data lines, and then go back to the PEs
in the array. If the resulting flag is "A", then a residue matrix diagonal addition, i.e.,
|B, | m = |AS—
, will be evaluated. Otherwise, the internal registers of the PEs
will be set to the initial zero-value state. However, Some modifications will be required
in the first-level processor array of Figure 7 for evaluating |AX | m ,
| m, and
|m
because |Aj | m. ••= |AAr | m is a residue non-square matrix multiplication, where A is a
p Xn matrix, and it Mil take (n+2 p —2) (t*
+ td) time units to complete in a wavefront array of size pXp ■ Using the same idea of the above discussion, the architecture
can evaluate krlm, at tithe [(n+2p-l)(t* +'<&+*«) + *! + 2td ]. Mehntvhiie, wfe need
both resulting data Aj and Bf so that there are two internal registers and two output
data lines of each PE for having the Ax and Bj be stored in and ptimped out, respec
tively. The other modifications for the architecture are that the q0 = —1 and B0 = I
have been stored in the right internal buffers of the output memory modules already.
Otherwise, the functions of these output memory modules are similar to the output
memory modules of Figure 7. Furthermore, another non-square matrix product compu
tation Y — AT'BK_1 in Eq. (12) can be completed in [ (n +2p —2)(ta + tm + td)\ time
units by an ordinary fixed-point wavefront array of size nXp, where ta, tm, and td are,
respectively, the counterparts of t\, t^, and td in the ordinary fixed-point system. It
should be noted that the entries of Y are pumped out column by column. In other
words, n entries of Y are pumped out at the same time. So, the evaluation of the
pseudo-inverse in Eq. (12), A+ = ——Y, may be executed in parallel by using n fixed9k

point division processing elements which are connected to their corresponding rows of Y.
Therefore, the parallel evaluations can be completed in p (tdiv -Mo) Urhe units, where
tdiv is the time for performing a fixed-point division and t0 is the time for pumping out
the results.
In general, the performance (or throughput) of a VLSI array architecture can be
evaluated by the pipelined time, denoted by T, which is the time interval between two
successive computations for the architecture. In our case, the pipelined time of the pro
posed two-level macro-pipelined array is equal to the global timing clock period.
Because of achieving the synchronization df the pipe, the clock period is always equal to
the largest processing time of the first-level processor arrays plus the time for pumping
out the results. Since the Decell algorithm is implemented in the L-modulus arithmetic
system, so L two-level macro-pipes are required for the realization of the L identical
parallel Decell algorithms with different moduli m4- , i = 1, • ■ • , L. Thus, the maximum
clock period of the * th parallel macro-pipe, 1 < i <L, will synchronize the multiplepipe structure

- 24T - max { max { [ (3p — 1)(tza + tlm +1\ ).+ <*' + 2t'd +1*0 ] ,
1<i <L

[(n +2p — 1)(C +

■)■+■*<! +2^ +*0 1} >

[(n + 2p — 2)(£a + im + td )] , p [ (tdiv + io)]
As an example, for a 12-link redundant robot, the associated Jacobian is a 6X12 matrix,
and its pipelined time is
T = max{ max (24t* + 23^ -f 25tld + 4 ), 22(ta +tm +td ), p (tMv + t0)}
1<i <L

Next, we shall give some numerical data on evaluating the performance of the pro
posed architecture for computing the pseudo-inverse Jacobian for a 12-link redundant
robot. If the input data size to the proposed architecture is 16 bits, then from Table 1,
the suggested number of parallel pipes, L , is equal to 4, the largest modulus bit size, g,
is equal to 5, and the dynamic range M becomes 392863. Though it is not required to
use moduli of the same bit length, the choice of moduli rests on the desired dynamic
range M and the number of parallel pipes L. For example, from Table 2, a four moduli
{31, 29, 23, 19 } is chosen to be greater than p(=6) which yields a dynamic range of
392863 when a 16-bit input number is used. Chiang and Johnsson [23] estimated the
computational speeds of their residue adder and residue multiplier designs (in 4 /im
NMOS) and showed that the k-bit residue adder and the fc-bit residue multiplier,
respectively, require 10A; ~ 15k ns and 30k ~ 45k ns. Using these data, the upper com
putational time bounds for a 5-bit residue adder and a 5-bit residue multiplier are,
respectively, equal to 75ns and 225ns. On the other hand, if we were to use a 16-bit
fixed-point adder and multiplier implemented in a 1.5 /xm CMOS integrated circuit, a
46.8 ns addition time and a 64 ns multiplication time were reported [24]. For simplicity,
if we do not consider the time for data transfer and I/O interface and if we let
tdiv =
= 64 ns, the pipelined time for the proposed array architecture becomes
T = max(6975,2437.6,384) ns =6.975 /xs. Another desired performance measure to
evaluate the proposed array architecture is the initial delay time (or set-up time) which
is defined as the required time interval for obtaining the first resulting pseudo-inverse
Jacobian from the architecture. There are (p +4) processing stages in the proposed
array architecture: p stages for evaluating Eq. (11), two stages for evaluating Eq. (12),
and two stages for dealing with arithmetic decomposition and recomposition. Hence,
the initial delay time is equal to (p +4)XT = 10X6.975 /xs = 69.75 /xs, when p — 6. The
above performance evaluations indicate that the real-time computation of a general
pseudo-inverse Jacobian matrix is achievable with current VLSI custom or semi-custom
design technology.

- 25
7. Conclusion
A two-level macro-pipelined VLSI array architecture has been designed, based on
the iterative Decell algorithm, for the real-time computation of the exact solution of a
general p Y.,n ( n > p ) pseudo-inverse Jacobian matrix. Due to the ill-conditioned prob
lem in determining the rank of the manipulator Jacobian, the residue arithmetic is
employed to obviate the round-off error occurred in the Decell algorithm computations
in order to obtain an ezacLsolution. This is achieved by using the residue arithmetic in
determining whether or not the matrix product Ax
is exactly a zero matrix in the
(jPC-fl)th iteration of the Decell algorithm. The first-level arrays of the proposed archi
tecture are asynchronous data-driven, wavefront-like two-dimensional arrays which per
form the matrix multiplications, matrix diagonal additions, and trace computations in
parallel. A pool of the first-level arrays (including the arithmetic decomposition and
recomposition cells) are then configured into a second-level macro-pipeline of (p +4)
stages with outputs of one array acting as inputs to another array in the pipe. The
second-level linear array is a synchronous systolic array whose data movements in the
array are controlled by global timing-reference clock pulses. The pipelined time of the
proposed two-level pipelined array architecture has a computational order of
0(n + 2p — l) which is the same computational complexity order as evaluating a
matrix product in an ordinary wavefront array. For a 12-link redundant robot, a pipe
lined time of 6.975 p,s is achievable with current VLSI custom design technology.

- 26 -

8. References
[1]

D. E Whitney, “Resolved Motion Rate Control of Manipulators and Human
Protheses,,” IEEE Transactions on Man-Machine Systems, Vol. MMS-10, No. 2, pp.
47-53, June 1969.

[2]

J. Y. S. Luh, M. W. Walker, and R. P. Paul, “Resolved-Acceleration Control of
Mechanical Manipulators,” IEEE Trans, on Automatic Control, Vol. AC-25, No. 3,
pp. 468-474, Nov. 1980.

[3]

C. S. G. Lee and B. H. Lee, “Resolved Motion Adaptive Control for Mechanical
Manipulators,” Trans, of ASME, J. Dynamic Syst., Meas., Contr., Yol. 106, No- 2,
pp. 134-142, June 1984.
" '

[4]

R. P. Paul, B. E. Shimano, and G. Mayer, “Differential Kinematic Control Equa
tions for Simple Manipulators,” IEEE Transactions of Systems, Man, arid, Cyber
netics, Vol. SMC-11, no. 6, 1981.

[5]

R. Featherstone, “Position and Velocity Transformations between Robot EndEffector Coordinates and Joint Angles,” Int. J. Robotics Research, Vol. 2, No, 2,
Summer 1983.

[6]

M. B. Leahy et. al, “Efficient PUMA Manipulator Jacobian Calculation and Inver
sion,” J. Robotic Systems, Vol. 4, No. 2, 1987.

[7]

D. E. Grin and W. W. Schrader, “Efficient Computation of the Jacobian for Robot
Manipulator,” Int’l J. Robotics Research, Vol. 3, No. 4, Winter 1984.

[8]

D. E. Orin et al, “Systolic Architectures for Computation of the Jacobian for
Robot Manipulator,” Computer Architectures for Robotics and Automation, Gor
don and Breach, NY, 1987.

[9]

T. B. Yeung and C. S. G. Lee, “Efficient Parallel Algorithms and VLSI Architec
tures for Manipulator Jacobian Computation,” Technical Report TR-EE 87-16,
School of Electrical Engineering, Purdue University, May 1987,
[10] C. A. Klein and C. H. Huang, “Review of Pseudo-inverse Control for Use with
Kinematically Redundant Manipulators,” IEEE Trans, on System, Man and Cyber
netics, Vol. SMC-13, No. 13, pp. 245-280, March/April 1982.
[11] K. C. Suh and J. M. Hollerbach, “Local versus Global Torque Optimization of
Redundant Manipulators,” IEEE J. Robotics and Automation,, Vol. RA-3, No. 4,
Aug, 1987,
[12]

C. R. Rao and S. K. Mitra, Generalized Inverse: Theory and Applications, Wiley,
NY, 1974.

[13]

S. L. Campbell and C. D. Meyer, Jr., Generalized Inverse of Linear Transforma
tions, pp. 242-248, Pitman, London, 1979.

-27[14] W. T. Stallings and T. L. Boullion, “Computation of Pseudo-Inverse Matrices
Using Residue Arithmetic,” SIAM Review, Vol. 14, pp. 152-163, 1972.
[15] T. M. Rao et al., “Residue Arithmetic Algorithms for Exact Computation of
generalized-inverse of Matrices,” SIAM J. Numer. Anal., Vol. 13, pp. 155-171,
1976. .
[16] R. T. Gregory and E. V. Krishnamurthy, Methods and Applications of Error-Free
Computation, Springer-Verlag, NY, 1984.
[17] S. Y. Kung et ah, “Wavefront Array Processor: Language, Architecture, and
Applications,” IEEE Trans, on Computer, Vol. C-31, No. 11, pp. 1054-1066, Nov.
. 1932.
[18] B. W. Lamacchia and G. R. Redin bo, “RNS Digital Filtering Structures for
Wafer-Scale Integration,” IEEE J. on Selected Area in Communications, Vol.
SAC-4, No. 1, Jan. 1986.
[19] M. A. Soderstrand, Residue Number System Arithmetic: Modern Applications in
Digital Signal Processing, The IEEE Inc., 1986.
[20] N S. Szabo and R. I. Tanaka, Residue Arithmetic and its Applications to Computer
Technology, McGraw-IIill, NY, 1967.
[21]

G. A. Jullien, “Implementation of Multiplication Modulo A Prime Number, with
Applications to Number Theoretic Transforms,” IEEE Trans. Comput., Vol. C-29,
No. 10, pp. 899-905, Oct. 1980.
[22] F. J. Mowle, A Systematic Approach to Digital Logic Design, Reading, AddisonWesley, MA, 1983.
[23] C. L. Chiang and L. Johnsson, “Residue Arithmetic and VLSI,” Proc. of 1983 Inti
Conf. Comput. Design, pp. 80-83, 1983.
[24] D. A. Heniin et ah, “A 16 bit X 16 bit Pipelined Multiplier Macrocell,” IEEE J. of
Solid-State Circuits, Vol. SC-20, No. 2, pp. 542-547, April 1985.

- 28-

Table 1 L, g, and M for various input bit size

n

L

9

M

8
16
32
64

2
4
5
10

5
5
7
7

899
392863
17.2X109
9.98X1019

n is the input bit size.
L is the number of parallel pipes.
g is the largest modulus bit size.
M is the dynamic range.
Table 2. Prime Numbers from 1 to 128
Bits
3
4
5
6
7

Primes
2, 3, 5, 7
11, 13
17, 19, 23, 29,
37, 41, 43, 47,
67, 71, 73, 79,
101, 103, 107,

31
53, 59, 61
83, 89, 97,
109, 113, 127

-29-

Arithmetic decomposition

Processing
block mod m.
forEq. (11)

Processing
block mod m.
forEq. (11)

Processing
block for
Eq. (12) in fixedpoint arith.

Processing
block mod m{
for Eq. (11)

Figure 1. General residue arithmetic array architecture for the Decell algorithm.

>• Output

- 30 -

(-m) I k bits
Carry out

Carry out

select

kbit
binary
adder

adder

Figure 2. Implementation of residue adder with two k-bit binary adders.

X

Index
gen.
mod m
Mod m-1
adder

Inverse
index
mod m

Index

Y
mod m

Figure 3. Implementation of residue multiplier using index a,ddition.

- 31 -

Decomposition
for mod m„

Decomposition
for mod m.
RNS
output

Decomposition
for mod m.

to'

8 bit binary input

KEY:

iuuui
c

1

0

1

p

p

p

p

p

c

1

0

1

p

p

p

p

c

1

0

1

p

p

p

c

1

0

1

p

p

c

1

0

1

p

c

1

0

1

Control cell. If result of subtraction
is positive then allows result of the
subtraction to pass to the next level
If subtraction result is negative
(borrow requested from left end of
subtractor chain), non-subtracted
input was passed.
Subtract 1 cell
j Q |

(b)

Figure 4.

MOD 5 residue

Substract 0 cell

0

Pass cell. Does not include borrow
chain of the substract cells.

(a) General decomposition architecture.
(b) Residue decomposer (modulo 5) floor plan.

- 32Rasidue input

x) ni

X) m

XI m

Binary output
+ 1 Binary adder
7~1 Binary multiplier
©m Modmmultiplier
Mod m subtracter

©m

k-1

W,

nm
W3

W|«1

Buffer

Figure 5* Mixed-radix recomposition pipelined architecture.

- 33-

i

A

First-level
processor array

VBi’Bo
First-level
processor array
buffer

| q2 ’ B2 , B1

j

\

At

^q(P-i),B<

First-level
processor array

q_ • B ,
P
P

1r
Note that: the first-level processor array will be
implemented in Fig. 7.

Figure 6. The second-level p -stage linear pipelined array for equation (11).

- 34 -

Input memory modules

m

h

m

%13 ali all H*

~ “T “T
I

T

7

~'

~

*33 *32 a3I *“4*

rJL
output memory modules for Bs_

output memory module for q,

Figure 7.

(a) An asynchronous data-driven ■wnyefront-like processor array
for the sth. iteration of equation (11).

(aiz ^ out

B

(1)

If input data (bd)in have inhibited tags, then the PE (i,i) is disabled.

(2)

Initial: Internal register R# = 0.

The PE (t, i) is activated, when the following conditions are satisfied.
(1) Both {aiz)in and (bd)in are available at the PE (t‘, *), thus performing
(i)

Rx +- Ra, if either of (a^, (bd )fn = * or A or null

(ii) Otherwise R$ «— Rd
(2)

{aiz)inX(bd)in

Both d, and e are available at PE (t,t), thus performing
■^tt
.

C

e » if d — A
0,

if d = *

After PE (»',*) has been executed p times, the output c„- will be activated and

.

equal to the accumulated value in the internal register Rd ( =

p

' i=l ■
D

I/O data token operations:
(1)
= & andibd)out =
(2)

E,

if

or Min = A or {bd)in == A.

Otherwise Mora = Mim and {bd)owt = (V)i»Notice that
is a zero value testing flag and “A” is a non-zero flag.

Once the global timing clock comes in, the data stored in the current and previ
ous PEs will be pumped out along the dotted output data line.
Figure 7. (b) The diagonal PE (i,t ) (circular cells).

If input data. (6ZJr)tn have inhibited tags, then the PE (i,j) is disabled.
Initial: Internal register

= 0.

The PE (t',j) is activated, when the following conditions are satisfied.
(1) Both (a,-*),-* and (bZJ)in are available at PE
(i)

thus performing

Rij *—Rij, if either of (aiz)in and (bZJ)jn = * or A or null.

(ii) Otherwise Rij 1- Ri} + {aiz)inX{bzj)in
or
(2)

when d is present at PE

thus performing

Rij > if d = A
^ij

C.

D.

, 0,

if d = *

I/O data token operations:
(1)

Maid = A and (6^)0tt< - A, if % # Oor {aiz)in = A or

(2)

Otherwise

=A

= {aiz)in, and (bZJ)otd = {bzj)in.

Once the global timing clock comes in, the data stored in the current and previ
ous PEs will be pumped out along the dotted output data line.

Figure 7. (c) The off-diagonal PE (t ,jf) (square cells).

