Most Cartesian-based control strategies require the computation of the manipula tor inverse Jacobian in real time at every sampling period. In some cases, the Jacobian matrix is not of full column or row rank due to singularity or redundant robot configuration. This requires the computation of the manipulator pseudo-inverse Jaco bian in real time. The calculation of the pseudo-inverse Jacobian may become extremely sensitive to small perturbation in the data and numerical instabilities, when the Jacobian matrix is not of full column or row rank. Even if the Jacobian matrix is of full rank, the ill-conditioned problem may still plague the computation of the pseudo inverse Jacobian. This paper presents the use of residue arithmetic for the exact compu tation of the manipulator pseudo-inverse Jacobian to obviate the roundoff errors nor mally associated with the computations. A two-level macro-pipelined residue arithmetic array architecture implementing the Decell's pseudo-inverse algorithm has been developed to overcome the ill-conditioned problem of the pseudo-inverse computation. Furthermore, the Decell algorithm is quite suitable for VLSI array implementation to achieve the real-time computation requirement. The first-level arrays are data-driven, wavefront-like arrays and perform the matrix multiplications, matrix diagonal addi tions, and trace computations. A pool or sequence of the first-level arrays are then configured into a second-level macro-pipeline with outputs of one array acting as inputs to another array in the pipe. The proposed architecture can calculate the pseudo inverse Jacobian with a pipelined time in the same computational complexity order as evaluating a matrix product in a wavefront array.
Introduction
Robot manipulators are highly nonlinear systems and their control involves actuat ing appropriate joint motors due to small changes in the pose (position and orientation) of the manipulator hand along a planned Cartesian trajectory. This requires the com putation of the manipulator inverse Jacobian at every sampling period in real time. The manipulator Jacobian is a linear mapping which relates the derivatives of the jointvariables to the derivatives of the manipulator hand coordinates. The pose of the mani pulator hand relative to a fixed inertial coordinate system can be described by a set of 6 algebraic equations containing the joint variables q1, q2, ' " ' , q.^ as x=f(q) ;
where x = (px , py ,pz , 4>x , 4>y , <j >z )T is a 6-dimensional vector describing the pose of the manipulator hand with respect to the inertial coordinate frame, q is the n-dimensional joint-variable vector for an n-link manipulator, and the superscript "7'" indicates matrix/vector transpose. Differentiating Eq. (1) with respect to time yields v(t) w{t) = J(q)q(t)
where v; (f) and cj(t) are, respectively, the linear and angular velocities of the manipula tor hand, q = (, q2 > ' ' ' , qn )T is the joint-velocity vector of the manipulator, and J(q) is the 6Xn Jacobian matrix (usually n > 6) and is a function of joint variables.
In general, the manipulator Jacobian (or forward Jacobian) is quite easy to deter mine than the inverse Jacobian. However, most Cartesian-based control techniques require the computation of the inverse Jacobian in real time at every control servo period. In particular, the use of the inverse Jacobian in resolved motion rate control [1] , and the model-based control techniques in Cartesian space [2, 3] , has been a computa tional bottleneck for these control schemes. In order to use these control schemes effectively, efficient computation of the inverse Jacobian in real time must be developed.
Most of the existing methods in eompiuting the inverse Jacobian are formulated to be computed by uniprocessor computers [4] [5] [6] . One of the most widely used methods in computing the inverse Jacobian is first by determining the forward Jacobian and then taking its inverse. Several efficient parallel algorithms have been developed for comput ing the forward Jacobian [7] [8] [9] . If the manipulator has six degrees-of-freedom and is nonsingular at q(i), then the Jacobian matrix may be inverted using normal matrix inversion techniques for square matrices. For example, in the case of resolved motion rate control, the joint-velocity vector can be computed from the manipulator hand velo cities as
where J *(q) is the 6X6 inverse Jacobian matrix. If the manipulator Jacobian is singu lar at q(Z), then the inverse of the Jacobian is not defined because the Jacobian matrix is not of full column or row rank. In the case of redundant robots, the inverse of the Jacobian is also not defined because the Jacobian matrix is rectangular. Useful solu tions to the above cases, however, can be obtained through the use of pseudo-inverse [10] [11] [12] [13] . The best approximate solution of the inverse Jacobian to Eq. (3) is given by the pseudo-inverse [12, 13] and is denoted by J+(q). Using the pseudo-inverse, Eq. (3) can be written as [12, 13] 
q(t) = J+(q)x(t)-f-(I-J+(q) J(q))z (4)
where I is an rtXn identity matrix and % is an arbitrary vector in the joint-velocity space. Equation (4) indicates that the resultant joint velocities can be decomposed into a combination of the least-squares solution of minimum norm plus a homogeneous solu tion created by the action of a projection operator (I -J+J)f, which describes the redundancy of the manipulator system. If the Jacobian is a square matrix and non singular at q(i), then the projection operator is equal to the null operator and Eq. (4) reduces to Eq. (3). Many numerical algorithms, such as the Decell algorithm, the Grevill algorithm, the Hermite algorithm, and the Gram-Schmidt orthogonalization algorithm, have been developed to compute the pseudo-inverse Jf [12, 13] . However, the calculations involved in these four algorithms are quite sensitive to ill-conditioning in which the exact solution is extremely sensitive to small perturbations in the data. Some researchers [14] [15] [16] con sidered the possibility of using the residue arithmetic for the exact computation of pseudo-inverses in order to obviate the roundoff errors normally associated with their computations. These four algorithms have been implemented in the residue number sys tem (RNS) for this purpose. Unfortunately, [15] showed that the Grevill algorithm is not suitable for implementation in the residue arithmetic system. This is because the algo rithm requires the greatest common divisor and the least common multiple in each itera tion, otherwise the moduli choice may change very much. The Hermite algorithm and the Gram-Schmidt orthogonalization algorithm require a lot of data broadcasting opera tions which are not feasible for implementation in VLSI array architectures. The Decell algorithm [14, 15] , based on the Cayley-Hamilton theorem, is an iterative procedure with a sequence of matrix operations and can compute the pseudo-inverse J+ quite efficiently. This sequence of matrix operations can be easily implemented in the residue arithmetic and is quite suitable for VLSI implementation [17] . This paper presents the design of a two-level macro-pipelined VLSI array architecture for the real-time computation of the exact solution of the pseudo-inverse Jacobian using the Decell algorithm in the residue arithmetic system.
The first-level array of the proposed architecture is an asynchronous data-driven, wavefront-like array [17] and performs the matrix multiplication, matrix diagonal addi tion, and the trace computation required in the Decell algorithm. The topological f For brevity, we shall drop the function dependency of q on all Jacobian matrices and their inverse.
-5 -configuration of the first-level array is a two-dimensional processor array consisting of three basic components: (1) an ordinary matrix multiplication wavefront array, (2) an adder tree, and (3) I/O memory modules. The adder tree, connected to the diagonal processing elements (PEs) of the ordinary wavefront array, is used to evaluate the trace computation, and the calculation of the matrix multiplication and the trace computa tion can be executed in parallel. The above executions are in an asynchronous datadriven manner, and the data transfers between a PE and its immediate neighbor are by mutual convenience or handshaking protocol. A pool or sequence of the first-level arrays are then configured into the second-level macro-pipeline with the outputs of one array acting as inputs to another array in the pipe. For computing the pseudo-inverse of a general p Xn {n > p) matrix J, the second-level macro-pipelined linear array is a pipe of p stages and can evaluate a.total of p iterations of the Decell algorithm with the firstlevel array in each stage. The second-level linear array is a synchronous systolic array whose data movements in the array are controlled by global timing-reference clock sig nal. The results of each stage are stored in its local memory and waiting for activated global synchronous signals. The pipelined time of the proposed two-level pipelined array architecture has a computational order of O (n + 2p -1) which is the same computa tional complexity order as evaluating a matrix product in an ordinary wavefront array. Furthermore, the primitive processing elements of the architecture perform computa tions in the residue number system. Recently, [18] proposed attractive wafer-scale integration (WSI) techniques for implementing the residue number system designs. WSI technology is based on the concept of placing one or more functional processing units on a semiconductor wafer. The processing units are then interconnected on the wafer as opposed to individual packaging. The wafer-scale integration compresses a large microelectronics representing a complete digital system onto a single wafer, and its pri mary advantages are an improvement in total system density and a built-in faulttolerance capability. Thus, the WSI has the potential to provide a very cost-effective and reliable way of implementing the proposed RNS two-level macro-pipelined array architecture for the computation of manipulator pseudo-inverse Jacobian in real time.
2, Unpleasant Fact for Calculating the Pseudo-Inverse
If the manipulator Jacobian is not of full column or row rank, an unpleasant fact exists for calculating the pseudo-inverse Jacobian J+. The rank of J which cannot be determined exactly because it can be easily altered by an arbitrarily small perturbation. This problem is commonly called the ill-conditioned problem for which the exact solution is extremely sensitive to small perturbations in the data. In solving such problems, the introduction of roundoff errors in the computations can be disastrous. This illconditioned problem can be explained in the following lemma and the proof of the lemma can be found in [13] . [13] . Suppose a matrix J £RpXn is of neither full column nor row rank, then for any real number A; and any e > 0, there exists a matrix E, -6 -| |E 11 < e, such that | |(J + E)+ -J+ 11 > k.
Unpleasant Fact Lemma
Even if the Jacobian matrix is of full rank, the ill-conditioned problem may still plague the computation of the pseudo-inverse Jacobian. then we have det(J) = 1, whereas det(J + E) = -118.94. In other words, if an accumu lation of roundoff errors in the computations were to correspond to the introduction of the perturbation matrix E, then the computed value of the determinant would be -118.94, whereas the exact value is 1, even though the matrix J is of full rank.
The above lemma and example show that roundoff-error-free computations are necessary for evaluating the pseudo-inverse Jacobian matrix. It is known that the resi due number system (RNS) is the most popular technique for error-free computations in signal processing and filter design [19] . We shall apply the residue arithmetic computa tion technique for computing the "error-free" pseudo-inverse Jacobian.
Overview of Residue Number System
The residue number system (RNS) arithmetic is a familiar concept in the number theory. In this section, we shall review some definitions and properties of the residue number system that will lead us to the computation of the error-free pseudo-in verse Jacobian. Definition 1. Given any integer x and any modulus m, if r = x (mod m) and 0 < r < m, then we write r = \x |TO and say r is a residue of x modulo m.
With this definition, computations in single-modulus residue arithmetic is simple in the sense that no negative integers are involved in the RNS. However, if we wish to solve problems involving negative integers, we must be able to handle negative integers as well as positive integers. One way to accomplish this is to introduce the system of sym metric residues modulo m. Notice that regardless of the sign of x, the multiplicative inverse of x modulo m is defined to be a positive integer in the range of 0 < x~x{m) < m. Of course, there are questions of existence and uniqueness of the multiplicative inverse modulo m, and these questions are addressed by the following theorem.
Theorem 1 [20] . If x and m > 1 are integers, then x_1(m) exists and is unique, if and only if \x\m =£ 0 and gcd(z,m) = 1.
The choice of the modulo m is quite important. If m is not a prime number, then no convenient explicit expression for x_1(m) has been discovered, whereas if m is a prime number, one can find a convenient explicit expression for it. For any integer x, there are two different fixed-radix weighted number representa tions: the signed number representation and the unsigned number representation. For the signed number system, both positive and negative number numbers can appear. Hence all integer values are assumed to be within the following range depending on whithdr fit i § bdd or even
for m is odd .
However, it is more convenient to use what appears to be only positive numbers by mapping the negative range above the positive upper limit. This can be accomplished by the following rule: If x is a negative integer that falls within the above negative range, it can be represented by the positive number as, rn -|x | = m + x, x < 0 which exceeds the positive upper limit but falls below m. Thus, any integer value greater than the upper positive range limit represents a negative number. For example, assume m = 60 and x^^ = -2, then xunsigned =58. It should be noted that these two representations are identical, usually the signed number system is used in the symmetric residue system and the unsigned number system is used in the common residue system.
Arithmetic Decomposition
Arithmetic representations can be decomposed linearly into an equivalent one con taining many components, each with a relatively small arithmetic range [18, 20] . This decomposition leads us to a number of parallel and independent pseudo-inverse compu tations, each using the residue arithmetic which allows us to use short registers and sim ple arithmetic operations. Furthermore, since the decomposition is linear, the final pseudo-inverse results for each parallel computation can be linearly recombined back together, yielding the desired pseudo-inverse result. To achieve this objective, we need to introduce the concept of multiple-modulus residue arithmetic. (ii) Multiplicative inverse
(b) If, in addition, the bases are all prime numbers, then ' If U -{k,\*ym^2 W-• •, I**"*-2 u>.
As an example for illustrating the idea of the multiple-modulus residue arithmetic, let M -60 -3X4X5 and xunsigned = 58, then, mx= 3, m2 = 4, m3 = 5 and |s |mi = l, I3' lm2 = I3' Im3= 3* Thus, we haye s ~ {i, 2, 3}.
Arithmetic Recomposition (or Reconstruction)
Assume that the L-tuple residue representation of s, s ~ { |s |m , |s \m, 2, ■ • • , |s |mi }, is given, one wants to find the corresponding value of s. Two techniques are available to achieve this purpose: the Chinese remainder theorem (CRT) and the mixed-radix recomposition (MRR).
The Chinese Remainder Theorem
Let ml5 m2 , • • • , fnL be the bases for a residue system where gcd (mt-,
I S \rj \m \m-Furthermore, x is expressed in the unsigned i-i number system and falls within the range [0,M-1]. Thus, x = \x\M. Although the CRT method is fairly simple in principle [20] , it would not be able in its given form to be implemented in a residue computer, since the residue computer is equipped to perform arithmetic operations modulo m^, not modulo M operation as required by the CRT. In contrast, the mixed-radix recomposition procedure can be implemented in a residue -10- maehine because it requires modulo m,-operations only.
The Mixed-Radix Recomposition
Szabo and Tanaka [20] proposed the mixed-radix recomposition (MRR) method to avoid the modulo M addition operation in the CRT. The MRR is of great importance in the residue arithmetic computation and the VLSI implementation for the following two reasons: (i) The mixed-radix system is a weighted number system and magnitude comparison can be easily performed. Also, it can be shown that its VLSI implementation is quite simple. (ii) Conversion from the residue to a certain mixed-radix system is relatively fast in residue computers. The mixed-radix number system can be described as follows: Let ml, ra2, • • ■ , mL be the pairwise relatively prime bases (or radices) for the mixed-radix system. A number x may be expressed in the mixed-radix form as 2,-1
where the are the mixed-radix digits and 0 < at-< m,. For a given set of radices, the mixed-radix representation of x is denoted by <a^ , aL-i> ' ' ' , where the digits are listed in order of decreasing significance. It can be shown that
and has a unique representation. Notice that if x is in . i==1 the signed number system, the digits ai should be within the interval (--m,-, -mt ), 2 2 and then x G (--M, -M). For convenience, we will only discuss the case for the 2 2 unsigned number system.
Assume that the L-tuple residue representation of x, {r1,r2, • • • ,rL}, is given, one wants to find the corresponding mixed-radix digits ot-, in terms of riy 1 < i < L. The MRR method involves the following steps:
(i) To obtain av Eq. (7) is first taken as modulo my Since all terms except the last are multiples of m1? we have,
(ii) To obtain a2, we first form x -ax in its residue code, then take modulo m2 on both sides of the following equation,
Because gcd (m.j, m2) = 1 and from Theorem 1, mj '1 (m2) exists, so we have
Furthermore, since 0 < a2 < ra2, we have
(iii) For any y, and 1 < j <L, rj is given and a1, a2, • • • , ay_y have been evaluated already. Thus, by induction, we have
where Qij;= mt_1(my).
It should be noted that if we are interested in the signed number system, then the pro cedure is the same as the above equations except the symmetric residue j x jrn replaces the \x |m.
The above residue arithmetic for integers can be extended to integral matrices. We shall define the residue matrix modulo m and most of the results that we have discussed for integers can be extended to integral matrices.
De finition 5. Let X = [ x,-y ] be a p Xn integral matrix and m > 1 be an integer. If R -[r^y] is the matrix with elements defined by r,y = |x>y |m for all i and j, then we write R = |X |m and say R is a residue of X modulo m.
We shall use this residue matrix modulo m concept in the residue number system to compute the error-free pseudo-in verse Jacobian.
Computation of Pseudo-Inverse Jacobian Using RNS
The Decell algorithm for residue computation consists of a sequence of matrix mul tiplications, matrix diagonal additions, and trace computations. These operations are suitable for VLSI array implementation [17] . Stallings and Boullion [14] proposed using the residue arithmetic to compute the Decell algorithm. They used the Chinese remainder theorem (CRT) to recover the final solution. However, using the CRT for recomposition involves the modulo M addition, which has the potential of resulting in long registers and complicated arithmetic hardware. A better solution is to use the mixed^radix recomposition (MRR) technique to avoid the modulo M addition by using normal binary adders.
We shall first describe the concept of the Decell algorithm and then propose a VLSI array implementation for it. Throughout this paper, we assume that A is a pXn matrix and n > p, otherwise we could compute (Ar)+ using the relation A+ = [(A3V)+]r. The Decell algorithm for computing the pseudo-inverse A+ is based on the Cayley-Hamilton 
The pseudo-in verse A+ is given by
■ qK where tr (A, ) ^ trace of A,-, 1 < i < K, I is a p Xp identity matrix, and K is the rank of A. Since the rank of A is not known apriori, the iteration is continued until the matrix product Aj^B,-becomes a zero matrix for some i, i = 1,2, • • •' ,K. The termina tion of the iteration will determine A as well as the rank of A. The critical step in the Decell algorithm involves the determination of whether or not A1Bt-= 0 for some i, i =1,2, • • v , K. Obviously, the error-free computation is useful in overcoming the numerical instability in this iteration as well as the determination of the zero matrix in the matrix product. Thus, we shall use the residue arithmetic for the computation of the Decell algorithm in an effort to obviate the round-off errors that may cause difficulty in determining the zero matrix from the matrix product AjBj.
The use of residue arithmetic presumes that each element a-of the matrix A is an integer. Although fixed word-length computers store only rational numbers, they can be converted to integers by an appropriate scaling. For example, in the radix u system, o,y can be evaluated by affiu* = u~s( ajk~s^uk) = u~s aij, where k=-s k-0 v =Y) alj~s^uk is the normalized integer and u~s is the constant scaling factor.
■ ' k=0 ' _ Thus, we can obtain. the normalized matrix A = us A. Since it is known that A+ = (?/_' s-A)+ =-3-A+ = us A+, one may apply the proposed residue arithmetic Decell algorithm to obtain the pseudo-inverse A+ from which the desired pseudo-inverse A+ may.'he obtained by multiplying the scaling factor us. For convenience, we assume that the elements of A are integers in the following discussion.
Re-examining the equations (11) and (12) carefully again, one may find that the determination of whether or not A1B,-= 0 for some i is a very critical step. If the ordi nary fixed-point arithmetic is used, it will be extremely difficult, in the presence of -13 roundoff errors, ;o determine whether or not the elements of each iteration in Ajfi, are exactly zero. Hence, the algorithm may suffer from the numerical instability. So, the error-free computation must be used in Eq. (11) for overcoming this difficulty. However, the evaluation of Eq. (12) can be done in the ordinary fixed-point arithmetic. Thus, the residue arithmetic Decell algorithm can be divided into two parts: Eq. (11) is evaluated in the residue arithmetic and Eq. (12) is evaluated in the ordinary fixed-point arith metic.
Before discussing the residue arithmetic Decell algorithm, one has to consider the proper choice of the range M and the selection of the bases (or rriodtili) 1 < i < L, which are critical to the success of the method for calculating A+. Stallings and Boullion [14] suggested the following criterion for selecting M M >max{X? (13) where X = min {tr (AAr) , | |AAt || } and K -rank (A). For simplicity, two practi cal criteria are given as follows:
where at-y is the {i , j) element of A.
where X -the maximal absolute value of an element in the matrix AA . Stallings and Boullion [14] also suggested that the bases (or moduli) m*, 1 < * < L, should be chosen to be large prime numbers greater than p and must satisfy M = J f rn^ then rank (| A |m ) • rank (|AAr |TO ) = rank (AA3' ) = rank (A) -K, where 1 < «; < L. In other words, the rank of | A |m is equal to rank (A) regardless of the modulus choice m,-. This indicates that the residue arithmetic Decell iteration will terminate at the same (if+l)th iteration for any different mt-, where K is the rank of A. Thus, we can use the multiple-modulus arithmetic to calculate the pseudo-inverse. However, the last (p +l)th iteration is not necessary when the rank of A is equal to p. Since K < min(p ,n) -p, if the matrix multiplications Ay = A1By_1 are not equal to zero matrices for the previous p iterations, where 1 < j <p, then Ap+1 of the last (p + l)th iteration must be a zero matrix, and K is determined automatically and is equal to p, otherwise K would be greater than p which is impossible. Figure 1 indicates that the Decell algorithm may be decomposed into L parallel Decell algorithms, each of which can be implemented using arithmetic modulo the respective component of M. The recombination of the outputs from the L parallel systems may be achieved by using the mixed-radix recomposition technique. The computational procedure of the residue arithmetic Decell algorithm can be described in the following Residue Arithmetic Decell Algorithm: -14 -Algorithm RADA (Residue Arithmetic Deceit Algorithm). Given a pXn matrix A with normalized integral elements, p < n, the range M, and the moduli mt , 1 <i <L, this algorithm uses the residue arithmetic to compute the pseudo-inverse A"1 based on the Decell algorithm as in Eqs. (11) One may use the MRR technique to reconstruct the qz and B2 with the pre computed weighting factors, i.e., C12 = -3, C13 = 5, C14 = 10, C2z = -3, C24 = -2, and C34 = -4, where = mj-1(my). For example, the residue representation of qz is given, find the associated mixed-radix digits <a4,a3,a2,a1> where the mixed-radix expression is qz = a4(7XHX17) + a3(7Xll) + a2X(7) + av The associated mixed-radix digits may be found by using Eqs. (8)-(l0) in the signed number system, then we have al = -2, a2 = 4, a3 = -8 and a4 = 10. Furthermore, q3 can be evaluated as q3 = 12500. Similarly, using the same procedures, the elements of B2 can be evaluated as 612 -b 2i = 5 23 
Hardware Implementation of Residue Arithmetic
The implementation of the multiple-modulus residue arithmetic requires the con sideration of three hardware components: (i) basic residue arithmetic processing Cells, (ii) arithmetic decomposition cell, and (iii) arithmetic recomposition cell. It should be noted that the implementations have been previously realized by VLSI/WSI techniques [18] .
Basic Residue Arithmetic Processing Cells
Two basic residue arithmetic processing cells are residue adder and multiplier. As shown in Figure 2 , one may use two A:-bit binary adders to realize a modulo rn residue adder, where 2fe < m. Assuming that x and y are inputs to the first binary adder of the residue adder, and if the result of the first binary adder overflows, it can be corrected for the modulo m operation by adding (-m) to the result of the first addition. If the first addition does not overflow, it may fall in the range [ m , 2k-1 ], in which case it can also be corrected by adding (-m) to the first addition. The carry bits generated from the first binary adder or the second binary adder indicate whether or not (x + y) is greater than m. A multiplexor controlled by the carry selects the correct output.
One familiar method of residue multiplication is "log transform-residue additionantilog transform" [18] . The concept of logarithms can be used for the multiplication of field elements. The theory guarantees that there is at least one primitive element which generates all nonzero field elements [20] . It is known that when m is a prime number [20] , the nonzero elements of GF(m) will be GF (m) = {l,2, • • • , m-l}. For each i G GF (m), there is a corresponding element e, where 0 < e < m-1, such that i; = .//, where fx is a primitive element of GF(m). The exponent is the logarithm of the element i = // and log? = e .
The multiplication of two field elements is equivalent to the addition modulo (m-1) of the corresponding exponents. If i1 = ix1 and i2 = fJ?2, then =M|ei+e' iltm 0 ^d \el + e2|(m_i) = log(i1-i2)
Thus, multiplications can be implemented by adding the appropriate exponents as determined from a logarithm table [21] -The procedure requires three steps: (1) find the exponents e4-from the logarithm table, (2) add indices and take modulo (m-1), and (3) find antilogarithm in the logarithm table to yield the result. This procedure is shown in Figure 3 .
Arithmetic Decomposition Cell
The decomposition of an input number consists of finding its residues for its various moduli m4-. In general, this does not create problems because the input number can be applied simultaneously to separate decomposition blocks as suggested in Figure 4 .a. Each decomposition block can be realized by a custom VLSI design based on a sequen tial restoring division algorithm [18, 22] . This yields a compact design that is easily modified to simplify the 2's complement to the RNS conversion process. The division algorithm simplifies considerably when only the remainder is of interest. The resulting architecture can be a pipeline requiring only subtractions. Assuming that the input is a k-bit positive number, we can subtract from it the binary representation Of the modulus m?-such that the highest order "1" bit of the representation is aligned with the highest order input bit. If the result is negative, the original input is passed on (i.e. subtract 0). If the result is positive, the subtracted version is passed on. In either case, a (A: -l)-bit number will result. This number is now treated as the input, rat-is shifted one less posi tion to the left, ahd the above process repeats until m,-has been totally shifted. The remainder is the result after that step. A floor plan of a decomposition block to find the modulo 5 residue is shown in Figure 4 .b. With 2's complement input Values, a positive input is passed directly to the residue decomposers, and 0 is added to their outputs. For a negative input, the input is first complemented (without adding one) and then passed to the residue decomposer.
Arithmetic Recomposition Cell
Recomposition using the MRR process is shown in Figure 5 for a four-modulus sys tem. The need for the modulo M addition required for the CRT can be eliminated. Pipelined operation is possible so that a new value may enter every clock cycle. In the recomposition architecture, the values and W; are pre-computed and stored in a k-1 read-only memory (ROM), where wkmi and ml -1. Once, the residue inputs r,-, o 1 < { <4, enter into the architecture, the residue adders and multiplier ate executed from top to down. The buffers are used to synchronize the timing of the pipelines. Thus, the resulting mixed-radix digits <a4,a3,a2, aj > are carried Out at the same time. Later on, a tree-structured architecture With normal binary adders and multi pliers can evaluate the resulting binary output from the mixed-radix digits.
-18 -Let us examine how the MRR process, of Eqs. (8)- (10) can be realized by the pro posed pipelined implementation. From Eq. (8), the resulting al is equal to the input r1; so rl would be shifted down through the buffers. Observing Eqs. (9.c) and (10) , the resulting r1 must also be sent to the right three pipes for evaluating a 2, a3, and a4. These pipes are executed in their corresponding modulus residue number systems. For example, the resulting a2 may he carried out and present at the second stage of the pipe after a modulo m2 residue subtraction, x = |r2 -r.j| m£, and a modulo m2 residue multiplication, o2 = |C12x | mo. At the same time, the pipes for a3 and a4 are also executed in the similar operations of modulo m3 and m4, respectively. These operations correspond to the computations within the most inside parenthesis of Eqs. (9.c) and (10) . Next, the resulting a2 will be stored in the buffers of its pipe and also broadcasted to the other two pipes. This process will continue to make the computations within the parentheses of Eq. (10) for a3 and a4 to be executed from inside to outside. Because of the multiplepipe structure, the calculations of a2, a3, and a4 would be evaluated in parallel. Exa mining the MRR architecture, one notes that multiplications are only required by the moduli mit i =2, ••• ,L (in this case, L = 4). Thus, the best implementation is to assign m1 the largest valued modulus and mL the smallest valued modulus. This allows for short register (or buffers) and simple arithmetic operations with relatively small arithmetic ranges, which can be implemented on the proposed pipelined architecture.
VLSI Architecture for Residue Arithmetic Decell Algorithm
From the first part of the Decell algorithm (i.e. Eq. (ll)), one finds that there are three different matrix operations involved in the iteration of the algorithm: matrix mul tiplication, trace computation, and matrix diagonal addition. Based on the discussion in section 4, it is suggested that these operations in Eq. (ll) should be done in the residue number system so that the exact determination of AjBt-= 0 for some i can be easily detected. The determination of AjB,-= 0 terminates further iterations, and a matrix multiplication and a scalar division in Eq. (12) are then executed in the ordinary fixedpoint arithmetic.
Using the multiple-modulus arithmetic to implement the Decell algorithm requires the algorithm be decomposed into L identical parallel algorithms, each of which is
implemented using arithmetic modulo mt-and L[mt = M. The results from the L parali=1 lel algorithms can be reconstructed by the MRR technique to obtain the desired result.
In this section, we shall discuss one of the L identical parallel algorithms.
There are at most (K+l) iterations involved in Eq. (ll). However, K is not known until Ax Bt-becomes a zero matrix for some i. Thus, K or the rank of A will be deter mined at the end of the iterations. In fact, the rank of A is. always less than or equal to min (p ,n), that is, K < min (p ,n). If p < n, then A < p, and we have at most (p +1) iterations. However, the algorithm RADA is section 4 indicates that the last (p T l)th -19 -iteration is not necessary. Based on this fact, it is possible to implement Eq. (11) on a linear array of p processing architectures as shown in Figure 6 . Each processing archi tecture is an asynchronous data-driven, wavefront-like, two-dimensional array. Each performs the matrix multiplication, matrix diagonal addition, and trace computation involved in one iteration of Eq. (11) . Furthermore, the architecture also has to deter mine whether or not A1 By is a zero matrix which is used to determine K and terminate the iterations as well. Once K is known, no more operations will be executed. In this case, inhibited tags are assigned to those output data qK , 1; By+1, and By of the (/'C+l)th processing architecture with q%+\ -#y, B K = ®2T-1» and they are also used to disable the computations of the next (p-K-1) architectures, where and By_x are the input data. When a processing architecture is disable, no operations will be exe cuted, and the input data are stored in its local memory and it is waiting for the activated global synchronous signal. The above design of Figure 6 is regarded as a twolevel macro-pipelined array: the first-level consists of data-driven, wavefront-like, twodimensional arrays (or the processing architectures), and the second-level consists of cascading p of these arrays into a linear macro-pipelined array. In the second-level pipelined array, the outputs of one array are acting as inputs to another array in the pipe. Furthermore, it should be noted that the first-level array is in an asynchronous data-driven manner, whereas the second-level linear array is a synchronized multipro cessing in which the data movements in the array are controlled by global timingreference clock signal.
After (p-K-1) clock periods, the desired output data qK, By+1, and By_1 from the (R-tT)th first-level array will be pumped out of the last pth first-level array of the second-level pipe. Then, we use the gy and By_x to calculate the pseudo-inverse A+ _ 1 <lK ATB y_i-Since the proposed architecture is a two-level macro-pipeline, the desired results are pumped out in succession and at every clock period coming in* Next, we shall discuss our design of the first-level, asynchronous data-driven, wavefront array. Rung [17] proposed an asynchronous, data-driven, wavefront array to implement matrix operations, in particular the matrix multiplication. The wavefront array com bines the systolic pipelining principle and the data-flow computing concept. In fact, it should be noted that the major difference between a wavefront array and a systolic array is the data-driven property. There is no global timing reference in a wavefront array, and yet the order of task sequencing is correctly followed. In the wavefront archi tecture, the data transfer between a processing element (PE) and its immediate neigh bors is by mutual convenience. Whenever data are available, the transmitting PE informs the receivers, and the receivers accept the data whenever required. They then communicate with the sender to acknowledge that the data have been received. This scheme can be implemented by means of a simple handshaking protocol [17] which ensures that the computational wavefronts propagate in an orderly manner instead of crashing into one another. Since there is no need to synchronize the entire array, a -20- wavefront array is architecturally scalable. Kung [17] demonstrated how a square matrix multiplication algorithm can be executed on a square, orthogonal wavefront array. This concept can be extended to a more general case to perform the multiplica tion of non-square matrices. If A is an mXn matrix and B is an nXp matrix, the wavefront array size needed is mXp, thus the computations are completed at time n + (m -1) T (p -l) -n+m +p -2. Extending this idea to our non-square matrix multiplications for Ay = AAr of Eq. (11) and AT BJf_1 of Eq. (12), where A is a pXn matrix and Bif_1 is a p Xp matrix, and n > p, one may use the wavefront arrays Of size p Xp and size nXp to evaluate the AAr and ATB j, respectively, with the same com putational time of (n +2p -2).
With some modifications of the wavefront array, it is possible to evaluate the matrix multiplication, trace computation, and matrix diagonal addition for One iteration of Eq. (11) by the proposed array in the residue number system as shown in Figure 7 . Furthermore, the proposed array also includes the determination of whether or not AiB; is a zero matrix. To explain the functions of the proposed array clearly, we shall first discuss the main-frame of the processor array which does not include the adder tree and dotted connected lines. This main-frame of the processor array is mainly used for matrix multiplications only. It is quite important to illustrate how the matrix multipli cation is executed on a square, orthogonal p Xp wavefront array. Let A=[aiy], B = [6t-y ], and C = AB = [ Cty ] be all p Xp matrices. The matrix A can be decomposed into columns Aj-and the matrix B into rows By, and we have
where the product A* B^ is the "outer product." The matrix multiplication can then be carried out in p sets of wavefronts (recursions), each executing one outer product:
or equivalently,
where -aik and bjk\-bkj,
With reference to Figure 7 , let us examine the computational wavefront for the first recursion in the matrix multiplication. Suppose that the internal registers of all the PEs are initially set to zero, that is, Cffl = 0, for all z, j. The entries of A are stored in the memory modules in the left (in columns), and those of B in the memory modules on the top (in rows). The process starts with PE (1,1) , where cIP -+ a 11X611 . The com putational activity then propagates to the neighboring PEs, (1,2) and (2,1). Since they both have input operands available (data-driven property), they will, respectively, exe cute <7$ =■€$■■ + anX612 and C$ = C$ + a2iX6 n in parallel. The next wavefront of activity will be at PEs (3,1), (2, 2) , and (1,3), thus creating a computational wavefront traveling down the processor array. When the first wavefront sweeps through all the -21 -PEs or all the available input operands of PE (1,1) are used up, the first recursion is over and the result is stored in the internal register of PE (1,1) . Similarly, when the second wavefront sweeps through all the PEs, the computations of PEs (1,2) and (2,1) are com pleted. Next, the computations will be completed at PEs (3,1), (2,2), and (1,3) . Thus, the completion of PEs will keep going and traveling down the processor array. The above concept shows that the results of the diagonal PEs (circular cells) of the processor array will be carried out at every two wavefronts after the computation of PE (1,1) is completed.
For calculating the trace computation tr(AB), we shall use the above idea to achieve this purpose. An adder tree of height f log2p ], connected to the diagonal PEs of the processor array, computes the trace function in parallel with the calculation of the matrix multiplication of AB. This tree-structured adder architecture is also in an asyn chronous data-driven manner. For example, the first adder is activated after both com putations of PE (l,l) and PE (2,2) are completed. This also implies that the adders start their computations at every two wavefronts sweeping through the square processor array after the PE (l,l) is completed. The last adder will start its computation immedi ately after the PE (p ,p) has been completed. Thus, the result of the trace computa tion, tr(AB), is carried out just after the completion of the matrix operation AB about one addition time late. Using the above concept, we can calculate the As = A1BS_1 and tr(As) =tr(A1Bs_1) of the sth iteration of Eq. (ll) by letting A = Ax and B = Bs_j almost simultaneously. Note that these computations are in the RNS by using the resi due arithmetic implementation. Furthermore, one may evaluate |gs |TOi = IN-1 |m, |tr (A1BS_1)'| m. | m. in the residue arithmetic modulus ra,-manner by connecting a residue multipier to the (residue) adder tree, where |s"1 | m is a precomputed constant. The resulting |gs |m. is sent back to the diagonal PEs (circular cells) of the processor array in order to calculate the matrix diagonal addition, |BS |ot = |AS -<2sl|m , of the sth itera tion of Eq. (11) . Before performing this matrix operation, the determination of whether or not K L " I is a zero matrix is performed. To achieve this purpose, zero-value testing flags generated by the input memory modules to test the input data of the architecture. If any PE (i , j) has the result which is stored in the internal register and is not equal to zero, or either of the two inputs is the non-zero flag "A", then both outputs of the PE will be assigned with the non-zero flags "A". Otherwise, the outputs will remain to have the zero-value testing flags "*". It could be shown that if the outputs of PE (p ,p) are then all the entries of A1BS_1 shall be zero, otherwise, at least one entry of A1BS_1 is non-zero. This resulting flags will be carried out immedi ately after the computations of all the entries of A1BS_1 are completed. In addition, the resulting flags also make the processor array and the output memory modules perform the following functions: (1) For the processor array, if the non-zero flag "A" is present at each PE of the proces sor array, then the array performs the matrix diagonal addition, that is, Otherwise, all the internal registers of the PEs will be set to the initial state and the inhibited tags will be added to the output data Bs. Once the global timing clock comes in, each column of Bs which is stored in the internal registers of the vertical PEs will be pumped out upward along the dotted output data line. This will make the input data and output data in the same sequencing and data format. (2) For the output memory modules of Bg_1, the entries of Bg_2 coming down from the previous array are stored in the right internal buffers of the output memory modules, and the entries of Bs_1 coming down from the current processor array are stored in the left internal buffers. If the resulting flag is present at each memory module, then the inhibited tags will be assigned to the stored Bg_2 and they indi cate that those data are the desired results. Next, the data in the right internal buffers will be pumped out. Otherwise, the flag "A" controls the switches inside the output memory modules and forces the data stored in the left internal buffers, i.e. Bs_li to be pumped out. However, if the input Bs_2 have the inhibited tags, then the above functions are disabled. The Bs_2 stored in the output memory modules are in the waiting situation, before the global timing clock is present at this array.
(3) For tlie output memory module of qs, its functions are very similar to the output memory modules of Bg _2. Before the resulting flag comes in, the qs_1 from the pre vious array and the resulting qs have been stored in the right and left internal buffers,; respectively. If qs_y has the inhibited tag, then the qs_x is waited for being pumped out. Otherwise, the resulting flag is present at the memory module, making the qs_x being tagged. This also shows that the q8_x is the desired result. Meanwhile, if the resulting flag is "A", then the qs will be pumped out and the com putations will continue to find the desired result. Note that the data in the above output memory modules of Bg_1 and qs are in the Waiting situation, before the glo bal timing clock is present at this array.
If we let f* be the time for performing modulo mi residue scalar addition, tl m be the time for performing modulo residue scalar multiplication, be the time for data transfer, and to be the time for pumping out the results. It can be shown that the resi due matrix multiplication |A;B, .,| can be completed in (3p-2) (t^ + t}n +1^) time units and the resulting residue trace computation |tr (Ag)| OT can also be completed with (ta+tj) time delays after the residue matrix multiplication completes its operation. Next, the residue scalar multiplication, k lm, -I k 11 rm ltr (AS) I m, I m,> and the |qs \ can be evaluated at [(3p-2)((* +tl m +t\) + [tl a + {tl m +^)] time units. One can see that the time for performing the three residue matrix functions exceeds the time for evaluating the residue matrix multiplication in a normal wavefront array by only -23-+ 2i| +3iJ) tixile delays. At almost the same time as the calculation of |gs | m is com pleted, the resulting flags are also carried out and present at the output of PE (p ,p). The resulting flags will pass through the dotted data lines, and then go back to the PEs in the array. If the resulting flag is "A", then a residue matrix diagonal addition, i.e., |B, | m = |AS-, will be evaluated. Otherwise, the internal registers of the PEs will be set to the initial zero-value state. However, Some modifications will be required in the first-level processor array of Figure 7 for evaluating |AX | m , | m, and | m because |Aj | m. ••= |AAr | m is a residue non-square matrix multiplication, where A is a p Xn matrix, and it Mil take (n+2 p -2) (t* + td) time units to complete in a wavefront array of size pXp ■ Using the same idea of the above discussion, the architecture can evaluate krlm, at tithe [(n+2p-l)(t* +'<&+*«) + *! + 2td ]. Mehntvhiie, wfe need both resulting data Aj and Bf so that there are two internal registers and two output data lines of each PE for having the Ax and Bj be stored in and ptimped out, respec tively. The other modifications for the architecture are that the q0 = -1 and B0 = I have been stored in the right internal buffers of the output memory modules already. Otherwise, the functions of these output memory modules are similar to the output memory modules of Figure 7 . Furthermore, another non-square matrix product compu tation Y -AT'BK_1 in Eq. (12) can be completed in [ (n +2p -2)(ta + tm + td)\ time units by an ordinary fixed-point wavefront array of size nXp, where ta, tm, and td are, respectively, the counterparts of t\, t^, and td in the ordinary fixed-point system. It should be noted that the entries of Y are pumped out column by column. In other words, n entries of Y are pumped out at the same time. So, the evaluation of the pseudo-inverse in Eq. (12), A+ = --Y, may be executed in parallel by using n fixed9k point division processing elements which are connected to their corresponding rows of Y. Therefore, the parallel evaluations can be completed in p (tdiv -Mo) Urhe units, where tdiv is the time for performing a fixed-point division and t0 is the time for pumping out the results.
In general, the performance (or throughput) of a VLSI array architecture can be evaluated by the pipelined time, denoted by T, which is the time interval between two successive computations for the architecture. In our case, the pipelined time of the pro posed two-level macro-pipelined array is equal to the global timing clock period. Because of achieving the synchronization df the pipe, the clock period is always equal to the largest processing time of the first-level processor arrays plus the time for pumping out the results. Since the Decell algorithm is implemented in the L-modulus arithmetic system, so L two-level macro-pipes are required for the realization of the L identical parallel Decell algorithms with different moduli m4-, i = 1, • ■ • , L. Thus, the maximum clock period of the * th parallel macro-pipe, 1 < i <L, will synchronize the multiplepipe structure -24-
As an example, for a 12-link redundant robot, the associated Jacobian is a 6X12 matrix, and its pipelined time is
Next, we shall give some numerical data on evaluating the performance of the pro posed architecture for computing the pseudo-inverse Jacobian for a 12-link redundant robot. If the input data size to the proposed architecture is 16 bits, then from Table 1 , the suggested number of parallel pipes, L , is equal to 4, the largest modulus bit size, g, is equal to 5, and the dynamic range M becomes 392863. Though it is not required to use moduli of the same bit length, the choice of moduli rests on the desired dynamic range M and the number of parallel pipes L. For example, from Table 2 , a four moduli {31, 29, 23, 19 } is chosen to be greater than p(=6) which yields a dynamic range of 392863 when a 16-bit input number is used. Chiang and Johnsson [23] estimated the computational speeds of their residue adder and residue multiplier designs (in 4 /im NMOS) and showed that the k-bit residue adder and the fc-bit residue multiplier, respectively, require 10A; ~ 15k ns and 30k ~ 45k ns. Using these data, the upper com putational time bounds for a 5-bit residue adder and a 5-bit residue multiplier are, respectively, equal to 75ns and 225ns. On the other hand, if we were to use a 16-bit fixed-point adder and multiplier implemented in a 1.5 /xm CMOS integrated circuit, a 46.8 ns addition time and a 64 ns multiplication time were reported [24] . For simplicity, if we do not consider the time for data transfer and I/O interface and if we let tdiv = = 64 ns, the pipelined time for the proposed array architecture becomes T = max(6975,2437.6,384) ns =6.975 /xs. Another desired performance measure to evaluate the proposed array architecture is the initial delay time (or set-up time) which is defined as the required time interval for obtaining the first resulting pseudo-inverse Jacobian from the architecture. There are (p +4) processing stages in the proposed array architecture: p stages for evaluating Eq. (11), two stages for evaluating Eq. (12), and two stages for dealing with arithmetic decomposition and recomposition. Hence, the initial delay time is equal to (p +4)XT = 10X6.975 /xs = 69.75 /xs, when p -6. The above performance evaluations indicate that the real-time computation of a general pseudo-inverse Jacobian matrix is achievable with current VLSI custom or semi-custom design technology.
-25

Conclusion
A two-level macro-pipelined VLSI array architecture has been designed, based on the iterative Decell algorithm, for the real-time computation of the exact solution of a general p Y.,n ( n > p ) pseudo-inverse Jacobian matrix. Due to the ill-conditioned prob lem in determining the rank of the manipulator Jacobian, the residue arithmetic is employed to obviate the round-off error occurred in the Decell algorithm computations in order to obtain an ezacLsolution. This is achieved by using the residue arithmetic in determining whether or not the matrix product Ax is exactly a zero matrix in the (jPC-fl)th iteration of the Decell algorithm. The first-level arrays of the proposed archi tecture are asynchronous data-driven, wavefront-like two-dimensional arrays which per form the matrix multiplications, matrix diagonal additions, and trace computations in parallel. A pool of the first-level arrays (including the arithmetic decomposition and recomposition cells) are then configured into a second-level macro-pipeline of (p +4) stages with outputs of one array acting as inputs to another array in the pipe. The second-level linear array is a synchronous systolic array whose data movements in the array are controlled by global timing-reference clock pulses. The pipelined time of the proposed two-level pipelined array architecture has a computational order of 0(n + 2p -l) which is the same computational complexity order as evaluating a matrix product in an ordinary wavefront array. For a 12-link redundant robot, a pipe lined time of 6.975 p,s is achievable with current VLSI custom design technology.
-28- 
