Most Cartesian-based control strategies require the computation of the manipulator inverse Jacobian in real time a t every sampling period. In some cases, the Jacobian matrix is not of full column or row rank due to singularity or redundant robot configuration. This requires the computation of the manipulator pseudo-inverse Jacobian in real time. The calculation of the pseudo-inverse Jacobian may become extremely sensitive t o small perturbation in the data and numerical instabilities, when the Jacobian matrix is not of full column or row rank. Even if the Jacobian matrix is of full rank, the ill-conditioned problem may still plague the computation of the pseudo-inverse Jacobian. This paper presents the use of residue arithmetic for the emct computation of the manipulator pseudoinverse Jacobian to obviate the roundoff errors normally associated with the computations. A two-level macro-pipelined residue arithmetic array architecture implementing Decell's pseudo-inverse algorithm has been developed to overcome the ill-conditioned problem of the pseudo-inverse computation. Furthermore, the Decell algorithm is quite suitable for VLSI array implementation to achieve the realtime computation requirement. The first-level arrays are datadriven, wavefront-like arrays and perform the matrix multiplications, matrix diagonal additions, and trace computations. A pool of the first-level arrays are then configured into a second-level macropipeline with outputs of one array acting as inputs to another array in the pipe. The proposed architecture can calculate the pseudoinverse Jacobian with a pipelined time in the same computational complexity order as evaluating a matrix product in a wavefront array.
I. Introduction
Robot manipulators are highly nonlinear systems and their control involves actuating appropriate joint motors due to small changes in the pose (position and orientation) of the manipulator hand along a planned Cartesian trajectory. This requires the computation of the manipulator inverse Jacobian J-'(q) a t every sampling period in real time. Most of the existing methods in computing the inverse Jacobian are formulated to be computed by uniprocessor computers (1-3. If the manipulator Jacobian is singular a t q t), then the inverse o I the Jacobian is not defined because the Jacobian matrix is not of full column or row rank. In the case of redundant robots, the inverse of the Jacobian is also not defined because the Jacobian matrix is rectangular. For these cases, the 'oint rate can be obtained through the use of pseudo-inverse J+(q) I4-71, i ( t ) = J'(q);(t) + (1 -J+(q) J(q))s (1) where I is an n x n identity matrix and s is an arbitrary vector in the joint-velocity space. Equation (1) indicates that the resultant joint velocities can be decomposed into a combination of the least-squares solution of minimum norm plus a homogeneous solution created by the action of a projection operator (I -J+J t, which describes the redundancy of the manipulator system. If t l e Jacobian is a square matrix and nonsingular a t q t , then the projection operator is equal to the null operator and Eq. (llreduces to { ( t ) = J-'(q)k(t).
Many numerical algorithms, such as the Decell algorithm, the Grevill algorithm, the Hermite algorithm, and the Gram-Schmidt orthogonalization algorithm, have been developed to compute the pseudo-inverse J+ [8, 7] .. However, the calculations involved in these four algorithms are quite sensitive to ill-conditioning in which the emct solution is extremely sensitive to small perturbations in the data. Some researchers [S-lo] residue arithmetic in order to obviate the roundoff errors normally associated with their computations. These four a1 orithms have been implemented in the residue number system [ R N q for this purpose. Unfortunately, Rao [9] showed that the Grevill algorithm is not suitable for implementation in the residue arithmetic system. The Hermite algorithm and the Gram-Schmidt orthogonalization algorithm require a lot of data broadcasting operations which are not feasible for im lementation in VLSI array architectures. The Decell algorithm !8,9], based on the Cayley-Hamilton theorem, is an iterative procedure with a sequence of matrix operations and can compute the pseudo-inverse J+ quite efficiently. This sequence of matrix operations can be easily implemented in the residue arithmetic and is quite suitable for VLSI implementation [ll] . This paper presents the design of a two-level macro-pipelined VLSI array architecture for the realtime computation of the emct solution of the pseudo-inverse Jacobian using the Decell algorithm in the residue arithmetic system.
The first-level array of the proposed architecture is an asynchronous, data-driven, wavefront-like array (11) and performs the matrix multiplication, matrix diagonal addition, and the trace computation required in the Decell algorithm. The topological configuration of the first-level array is a two-dimensional processor array consisting of three basic components: (1) an ordinary matrix multiplication wavefront array, (2) an adder tree, and (3) 1 / 0 memory modules. The adder tree, connected to the diagonal processing elements (PES) of the ordinary wavefront array, is used to evaluate the trace computation, and the calculation of the matrix multiplication and the trace computation can be executed in parallel. The above executions are in an asynchronous data-driven manner, and the data transfers between a PE and its immediate neighbor are by mutual convenience or handshaking protocol. A sequence of the first-level arrays are then configured into the second-level macropipeline with the outputs of one array acting as inputs to another array in the pipe. For computing the pseudo-inverse of a general p X n (n 2 p ) matrix J, the second-level macro-pipelined linear array is a pipe of p stages and can evaluate a total of p iterations of the Decell algorithm with the first-level array in each stage. The secondlevel linear array is a synchronous systolic array whose data movements in the array are controlled by global timing-reference clock signal. The results of each stage are stored in its local memory and wait for activated global synchronous signals. The pipelined time of the proposed two-level pipelined array architecture has a computational order of O(n + 2p -1) which is the same computational complexity order as evaluating a matrix product in an ordinary wavefront array. Furthermore, the primitive processing elements of the architecture perform computations in the residue number system. This problem is extremely ill-conditioned as shown by the fact that if the roundoff error matrix is -80 37 10 then we have det(J) = 1, whereas det(J + E) = -118.94. In other words, if an accumulation of roundoff errors in the computations were to correspond to the introduction of the perturbation matrix E, then the computed value of the determinant would be -118.94, whereas the exact value is 1, even though the matrix J is of full rank.
JI. Unpleasant Fact
The above lemma and example show that roundoff-error-free computations are necessary for evaluating the pseudo-inverse Jacobian matrix. It is known that the residue number system (RNS) is the most popular techni ue for error-free computations in signal processing and filter design(fl21. We shall apply the residue arithmetic computation technique for computing the "error-free" manipulator pseudo-inverse Jacobian.
m. Overview of Residue N u m b e r System
In this section, we shall review some definitions and properties of the residue number system (RNS) that will lead us t o the computation of the error-free pseudo-inverse Jacobian. Definition 1. Given any integer z and an modulus m, if r = z (mod m) and 0 < r < m, then we write r = fz 1 and say r is a residue of z modulo 6.
With this definition, computations in single-modulus residue arithmetic are simple in the sense that no negative integers are involved in the RNS. However, if we wish to solve problems involving negative integers, we must be able to handle negative integers as well as positive integers. One way to accomplish this is to introduce the system of symmetric residues modulo m.
Definition 2. Given any integer z and any modulus m, if r = z (mod m) and -m 2 < r < m/2, then we write r = / z /~ and say r is a symmetric residueGf zmodulo m.
It can be shown that both residue representations of z are unique. where gcd (k,m\ = greatest common divisor of k and m.
Next, we introduce the concept of multiplicative inverse modulo m. Definition 8. Assume z , y , and m > 1 are integers, and if O < y < m and Iqlm= l y z I , = l , t h e n w e w r i t e y = z -' ( m ) a n d say y is a multiplicative inverse of z modulo m.
Notice that regardless of the sign of z, the multiplicative inverse of z modulo m is defined t o be a positive integer in the range of 0 < z-'(rn) < m. Of course, there are questions of existence and uniqueness of the multiplicative inverse modulo m, and these questions are addressed by the following theorem.
T h e o r e m 1 [13] . If z and m > 1 are integers, then z-'(m) exists and is unique, if and only if I z I
The choice of the modulo m is quite important. If m is not a prime number, then no convenient explicit expression for z-'(m) has been discovered, whereas if m is a prime number, one can find a convenient explicit expression for it.
For any integer 2, there are two different fixed-radi weighted number representations: the signed and the unsigned number representations. It should be noted that these two re resentations are identical, usually the signed number system is u s e l in the symmetric residue system and the unsigned number system is used in the common residue system.
III.1. Arithmetic Decomposition
Arithmetic representations can be decomposed linearly into an equivalent one containing many components, each with a relatively small arithmetic range 1131. This decomposition leads us to a number of parallel and independent pseudo-inverse computations, each using the residue arithmetic which allows us t o use short registers and simple arithmetic operations. Furthermore, since the decomposition is linear, the final pseudo-inverse results for each parallel computation can be linearly recombined back together, yielding the desired pseudo-inverse result. To achieve this objective, we need to introduce the concept of multiple-modulus residue arithmetic. where ri = 1 z I mi. Notice that the representation also holds true for the symmetric residue system. Important pro erties for the multiple-modulus residue arithmetic can be found in b3].
III.2. Arithmetic Recomposition (or Reconstruction)
Assume that the L-tuple residue representation of z, z -{ I z l m l , lzlm2, e * -, I~(~~} , i s g i v e n , o n e w a n t s t o f i n d t h e corresponding value of z. Two techniques are available to achieve this purpose: the Chinese remainder theorem CRT and the mixedradix recomposition (MRR . Although the C R ' ! method is fairly rimmented in a residue computer, since the residue computer is equipped to perform arithmetic operations modulo q, not modulo M operation a8 required by the CRT. In contrast, the mixed-radix recomposition procedure can be implemented in a residue machine because it requires modulo q operations only.
Ssabo and Tanaka (131 proposed the mixed-radix recomposition method to avoid the modulo M addition operation in the ple in principle 1131, it wou ) d not be able in its given form t o be imple-MRR is of great importance in the residue arithmetic computation and the VLSI implementation for the following two reasons:
(i) The mixed-radix system is a weighted number system and magnitude comparison can be easily performed. Also, it can be shown that its VLSI implementation is quite simple.
Conversion from the residue to a certain mixed-radix system is relatively fast in residue computers. The mixed-radix number system can be described as follows:
Let m, , m2, * . , mL be the pairwise relatively prime bases (or radices) for the mixed-radix system. A number z may be expressed in the mixed-radix form as
(ii)
,=I
where the ai are the mixed-radix digits and 0 < ai < mi. For a given set of radices, the mixed-radix representation of z is denoted by <aL , aL+, . * * , a,> where the digits are listeq in order of decreasing significance. It can be shown that
and has a unique representation. Notice that i) z is in the signed number system, the digits ai should be within the interval ( -q / 2 , q / 2 ) , and then z E ( -M / 2 , M / 2 ) . For convenience, we will only discuss the case for the unsigned number system. Assume that the L-tuple residue representation of 2, {rl ,r2, --,rL}, is given, one wants to find the corresponding mixed-radii digits ai, in terms of ri, 1 5 i 5 L . The MRR method involves the following steps:
To obtain a,, Eq. (3) is first taken as modulo m,. Since all terms except the last are multiples of ml, we have, a , = ~~~~~= r , .
To obtain a2, we first form z -a, in its residue code, then take modulo m2 on both sides of the following equation,
Furthermore, since 0 5 a 2 < m2, we have a 2 = a2 mod m2 = 1 C12 (r2
where Cl, = m;' (mz). (i) For any j, and 15 j < L , r . is given and a , , a z , * * . a.-1 have been evaluated alryady. khus, by induction, we have ' '
It should be n&ed that itwe are interested in the signed number system, then the procedure is the same as the above equations except the symmetric residue / z /,,, replaces the I z I m.
The above residue arithmetic for integers can be extended t o integral matrices. We shall define the residue matrix modulo m and most of the results that we have discussed for integers can be extended t o integral matrices.
Definition 5. Let X = [z, 1 be a pxn integral matrix and m > 1 be an integer. If R = Ir, 1 is the matrix with elements defined by r = 1% 1 for all i and 2 , then we write R = IX 1 , and say R is a reiidue o t Xmmodulo m .
We shall use this residue matrix modulo m concept in the residue number system to compute the error-free pseudo-inverse Jacobian.
IV. Computation of Pseudo-Inverse Jacobian Using RNS
The Decell algorithm for residue computation consists of a sequence of matrix multiplications, matrix diagonal additions, and trace computations. These operations are suitable for VLSI array implementation 1111. Stallings and Boullion [8J proposed using the residue arithmetic t o compute the Decell algorithm. They used the Chinese remainder theorem t o recover the final solution. However, using the CRT for recomposition involves the modulo M addition, which has the potential of resulting in long registers and complicated arithmetic hardware. A better solution is t o use the mixed-radii recomposition technique to avoid the modulo M addition by using normal binary adders.
Throughout this paper, we assume that A is a p ~n matrix and n > p otherwise we could compute,,$A')+ using the relation and K is the rank of A. Since the rank of A is not known apriori, the iteration is continued until the matrix product AiBi becomes a sero matrix for some i, i = 1,2, * . * ,K. The termination of the iteration will determine K and the rank of A. This is the critical step in the Decell algorithm. If the ordinary fixed-point arithmetic is used in the determination of whether or not AIBi = 0 for some i , it will be extremely d s c u l t , in the presence of roundoff errors, t o determine whether or not the elements of each iteration in AIBi are exactly sero. Hence, the algorithm may suffer from the numerical instability. So, the error-free computation must be used in Eq. (9) for overcoming this difficulty. However, the evaluation of Eq. (10) can be done in the ordinary fixed-point arithmetic. Thus, the residue arithmetic Decell algorithm can be divided into two parts: Eq. (9) is evaluated in the residue arithmetic and Eq. (10) is evaluated in the ordinary fixedpoint arithmetic.
The use of residue arithmetic presumes that each element ai. of the matrix A is an integer. Although fixed word-length compders store only rational numbers, they can be converted to integers by an appropriate scaling. For example, in t$e radix U system, aij can be evaluated by aij = a//-')uk) = U-' Cij, where Cij = a//-")uk is the normaliied integer and u -~ is the constant
k---r k=O h d 3
scaling factor. Thus, we can obtain the normalked matrix A = U ' A.
Since it is known that where X = min {tr (AAT), I lAAT I I} and K = rank (A). For simplicity, two practical criteria are given as follows:
where aij is the (i , j ) element of A.
M > p p 2 1 2 X P 2
AAT. Arithmetic Decomposition. 
{3,-2,-7,-7)
One may use the MRR technique t o reconstruct the q3 and B, with the pre-computed weighting factors, i.e., Cl2 = -3, C-y = 5, C,, = 10, C, = -3, C,, = -2, and C,, = -4, where C.. = m; (m.).
For example, the residue representation of q3 is given, fiXd the assokated mixed-radix digits <a,,a3 a., a > where the mixed-radix expression is q3 = a4(7Xllx17 + a3(7g1i) + a,X 7) + a The associated mixed-radix digits may i e found by using kqs. (4)(8) in the signed number system, then we have a, = -2, az = 4, a3 = -8 and a, = 10. Furthermore, q3 can be evaluated as q3 = 12500. Similarly, using the same procedures, the elements of B2 can be evaluated as b33= 12505 where 6,. is the (i , J ) element of B,. Finally, we can compute A I in the ordinary fixed-point system as follows: There are a t most (K+1) iterations involved in Eq. (9). However, K is not known until A, Bi becomes a zero matrix for some i. Thus, K or the rank of A will be determined a t the end of the iterations. In fact, the rank of A is always less than or equal to min (p ,n), that is, K < min (p ,n). If p 5 n, then K < p , and we have a t most (p + 1) iterations. However, the algorithm-MA in section VI indicates that the last (p + 1)th iteration is not necessary. Based on this fact, it is possible to implement Eq. (9 on a linear array of p processing architectures as shown in Fig. 2 . kach processing architecture is an asynchronous data-driven, wavefront-like, two-dimensional array. Each performs the matrix multiplication, matrix diagonal addition, and trace computation involved in one iteration of Eq. (9). Furthermore, the architecture also has to determine whether or not A, BK is a sero matrix which is used t o determine K and terminate the iterations as well. Once K is known, no more operations will be executed.
In this case, inhibited tags are assigned to those output data qK+,, BK+,, and B, of the (K+l)th processing architecture with qK+, = q, , BK = BK-,, and they are also used to disable the computations of the next (p-K-1) architectures, where qK and BK-, are the input data. When a processing architecture is disabled, no operations will be executed, and the input data are stored in its local memory and it waits for the activated global synchronous signal. The above design of Fig.  2 is regarded as a two-level macro-pipelined array: the first-lefel consists of data-driven, wavefront-lie, two-dimensional arrays (or the processing architectures , and the second-level consists of cascading p of these arrays into a k e a r macro-pipelined array. In the secondlevel pipelined array, the outputs of one array are acting as inputs t o another array in the pipe. Furthermore, it should be noted that the first-level array is in an asynchronous data-driven manner, whereas the second-level linear array is a synchronised multiprocessing in which the data movements in the array are controlled by global timing-reference clock signal. After (p-K-1) clock periods, the desired output data q K , BK+,, and BK-] from the K + l ) t h first-level array will be pumped out of the last p t h first-level array of the second-level pipe. Then, we use the qK and BK-l t o calculate the pseudo-inverse A + = LATBK-I. Since the proposed architecture is a two-level macro-pipeline, the desired results are pumped out in succession and a t every clock period coming in. Next, we shall discuss our design of the first-level, asynchronous data-driven, wavefront array.
94
demonstrated how a square matrix multiplication executed on a square, orthogonal wavefront array. be extended to a more general case to perform the multiplication of non-square matrices. If A is an m x n matrix and B is an nxp matrix, the wavefront array sire needed is m x p , and the a t time are completed com utations n + pm -1) + (p -1) = n + m + p -2. Extending this idea to our non-square matrix multiplications for A, = AAT of Eq. (9) and A T B K -~ of Eq. (lo), where A is a pxn matrix and BK-, is a pxp matrlx, and n 2 p , one may use the wavefront arrays of size p Xp and sise nxp t o evaluate the AAT and ATBK-,, respectively, with the same computational time of (n + 2p -2).
With some modifications of the wavefront array, it is possible to evaluate the matrix multiplication, trace computation, and matrix diagonal addition for one iteration of Eq. (9) by the proposed array in the residue number system as shown in Fig. 3 . Furthermore, the proposed array also includes the determination of whether or not A,B; is a zero matrix. To explain the functions of the proposed array clearly, we shall first discuss the main-frame of the processor array which does not include the adder tree and dotted connected lines. This main-frame of the processor array is mainly used for matrix multiplications only. It is quite important to illustrate how the matrix multiplication is executed on a square, orthogonal xp wavefront array. With reference to Fig. 3 , let us examine the computational wavefront for the first recursion in the matrix multiplication. Suppose that the internal registers of all the PES are initially set to sero, that is, CP) = 0, for all i , J . The entries of A are stored in the memory modules in the left (in columns), and those of B in the memor modules on the top (in rows4 The process starts with PE(1,lt where C[:) = Cfil + aIlXb1,. he computational activity then propagates to the neighboring PES, 1,2) and (2,l). Since they and res ectivel CA$ = CA$; a2,xbll in parallel. The next wavefront of activity will both have input operands available data 6 riven property), they will, execute hi;) C @ + allxbI2 be a t PES 3,1), (2,2), and (1,3), thus creating a computational wavesweeps through all the PES or all the available input operands of PE (1,l) are used, the first recursion is over and the result is stored in the internal register of PE (1,l). Similarly, when the second wavefront sweeps through all the PES, the Computations of PES (1,2) and 2,l) are completed. Next, the computations will be completed a t LEs (3,1), [2, 2) , and (1,3). Thus, the completion of PES will keep going and traveling down the processor array. The above concept shows that the results of the diagonal PES (circular cells) of the processor array will be carried out a t every two wavefronts after the computation of PE (1,l) is completed.
For calculating the trace computation t r (AB), we shall use the above idea to achieve this purpose. An adder tree of height [ log9 1, connected to the diagonal PES of the processor array, computes the trace function in parallel with the calculation of the matrix multiplication of AB. This tree-structured adder architecture is also in an asynchronous data-driven manner. This implies that the adders start their computations a t every two wavefronts sweeping through the square processor array after the PE (1,l is completed. The last adder will start its computation immediately after the PE (p , p ) has been completed. Thus, the result of the trace computation, t r (AB is carried out just after the completion of the matrix operation AA about one addition time late. Using the above concept, we can calculate the A, = AIB,-, and t r A ) t r (AlB8-l) of the 8 t h iteration of Before performing this matrix operation, the determination of whether or not IA, I , , = IA,B,-, I m, is a zero matrix is performed. To achieve this purpose, zero-value testing flags "*" are generated by the input memory modules to verify the result of the architecture. If any P E (i , j ) has the result which is stored in the internal register R and is not equal to eero, or either of the two inputs is the non-eero f a g "A", then both outputs of the P E will be assigned with the non-zero flags "A". Oth-;:wise, the outputs will remain to have the zero-value t es+i ; g flags , then
all the entries of A,B,-, shall be zero, otherwise, a t least one entry of A,B,-, is non-zero. This resulting flags will be carried out immediately after the computations of all the entries of A,B,-, are completed. In addition, the resulting flags also make the processor array and the output memory modules perform the following functions: the inhibited tags will be added to the output data B,. Once the global timing clock comes in, each column of B, which is stored in the internal registers of the vertical PES will be pumped out upward along the dotted output data line. This will make the input data and output data in the same sequencing and data format. (2) For the output memory modules of Be-,, the entries of Bs-* coming down from the previous array are stored in the right internal buffers of the output memory modules, and the entries of B,-, coming down from the current processor array are stored in the left internal buffers. If the resulting flag "*" is present a t each memory module, then the inhibited tags will be assigned to the stored B,, and they indicate that those data are the desired results. Next, the data in the right internal buffers will be pumped out. Otherwise, the flag "A" controls the switches inside the output memory modules and forces the data stored in the left internal buffers, i.e., Be-,, to be pumped out. However, if the input B,, have the inhibited tags, then the above functions are disabled. The B,-2 stored in the output memory modules are in the waiting situation, before the global timing clock is present a t this array. . Meanwhile, we need both resulting data A, and B, so that there are two internal registers and two output data lines of each P E for having the Al and B, be stored in and pumped out, respectively. The other modifications for the architecture are that the qo = -1 and Bo = I have been stored in the right internal buffers of the output memory modules already. Otherwise, the functions of these output memory modules are similar to the output memory modules of Fig. 3 The pipelined time T of the proposed two-level macro-pipelined array is equal to the global timing clock period. Because of achieving the synchronization of the pipe, the clock period is always equal to the largest processing time of the first-level processor arrays plus the time for pumping out the results. Since the Decell algorithm is implemented in the L-modulus arithmetic system, so L two-level macropipes are required for the realisation of the L identical parallel Decell algorithms with different moduli mi , i = 1, * * . ,L . Thus, the maximum clock period of the i t h parallel macro-pipe, 1 2 i 5 L, will synchronise the multiple-pipe structure qK T = max { max { [( 3p -I ) ( tf + t i + t j ) + tf + 2 t j + t & 1,
I S i s L [ ( n + P p -l ) ( t ; + t i + t ; ) + t f + 2 t 6 + t b ] } ,
I( n + 2~ -2 t a + tm + t d )I P 1 ( tdtu + t o 11 1 .
As an example, for a 12-link redundant robot, the associated Jacobian is a 6x12 matrix, and its pipelined time is T=max( l~~~L ( 2 4 t : + 2 3 t a + 2 5 t i + t b ) , 2 2 ( t . + t ,
+ t i ) , 6 ( t i , , + t o ) ) .
If the input data sise t o the proposed architecture is 16 bits, a pipelined time of 6.975 f i s is achievable with current VLSI custom design technology. Details can be found in 1141.
VI. Coneluston
A two-level macro-pipelined VLSI array architecture has been designed, based on the iterative Decell algorithm, for the real-time computation of the ezclet solution of a general pxn ( n 2 p ) pseudoinverse Jacobian matrix. Due to the ill-conditioned problem in determining the rank of the manipulator Jacobian, the residue arithmetic is employed to obviate the round-off error occurred in the Decell algorithm computations in order to obtain an cmet solution. This is achieved by using the residue arithmetic in determining whether or not the matrix product A B is emetly a Eero matrix in the (K+l)th iteration of the Decell a&ogthm. The first-level arrays of the proposed architecture are asynchronous data-driven, wavefront-like two-dimensional arrays which perform the matrix multiplications, matrix diagonal additions, and trace computations in parallel. A pool of the first-level arrays (including the arithmetic decomposition and recomposition cells) are then configured into a second-level macro-pipeline of (p + 4) stages with outputs of one array acting as inputs to another array in the pipe. The second-level linear array is a synchronous systolic array whose data movements in the array are controlled by global timing-reference clock pulses. The pipelined time of the proposed two-level pipeliied array architecture has a computational order of 0 (n + 2p -1) which is the same computational complexity order as evaluating a matrix product in an ordinary wavefront array. For a 12-liik redundant robot, a pipelined time of 8.975
pts is achievable with current VLSI custom design technology. 
