Arithmetic issu es in the calcul ation of the Singular Value Decomposition C8VD) are discu�d.. Traditional algo rithms using hardware division and square roo t are replaced with the special purpose CORDIC algorithms for computing vector rotations and inverse tangents. The CORDIC 2 X 2 SVD processor can be twice as fast as one assembled from tradi-' tional hardware units. A prototype VLSI implementation of a CORDIC SVD processor array is planned for use in real-time signal processing applications.
Introduction
Recent advances in parallel architectures and VLSI have encouraged the use of special-purpose arithmetic techniques for the implementation of computationally complex scientific algorithms. The Singular Value Decomposition CSVD) is an important algorithm for image processing [1] and is especially well-suited for analyzing data matrices from sensor arrays [21
Traditional SVD algorithms for use on unip rocess or sys tems rely heavily upon costly division and square roo t opera tions to compute the required rotation parameters. Special purpose parallel architectures and numerical algorithms can increase the efficiency of the SVD by more effectively map ping the algorithm to hard ware [3. 41 The systolic array structure of Brent. Luk. and Van Loan [5] uses an expandable square array of simple 2 X 2 processors to compute the SVD of a large matrix. In this paper, a novel architecture for a CORDIC 2 X 2 processor is presented..
In addition. a new scheme which simplifies CORDIC scale factor correction for the two-sided vector rota tion is introduced. The reduction in the area and time com plexity of the SVD processor through the use of the coo rdi nate rotation algorithms (CORDI C) is also analyzed. The replacement of explicit multiplication. division, and square root units by CORDIC modules produces a processo r that is twice as fast and also allows for a regular structure suitable for VLSI implementation. where U and V are orthogonal matrices and :t is a diagonal matrix of singular values.
SVD -Jacobi Method
The Jacobi method see ks to systematically reduce the off-diagonal elements to zero. This is done by applying a sequence of plane rotations to M which transforms Minto L. Several sweeps over the entire matrix M may be neces sary to complete the SVD. Within each sweep. the matrix elements need to be paired and appropriate rotations need to be ll cllatet Tre p X P matrix is distributed over an arr ay of � x I � simple 2 X 2 processors where the basic operation is the two-sided rotation of each 2 X 2 matrix.
2.1. Basic Methods for a 2 X 2 Matrix. respectively. The rotation matrix is R ( e ) = [_:: :: I. (3) and the input matrix is (4) The efficient computation of the rotation parameters is essential. Several methods are poss ible to solve this problem.
The two-step method first applies esY M = (9z -er) to sym metrize M and then utilizes er to diagonalize M. The direct two ang le method [5] calculates 9z and er by comput ing the inverse tangents of the data elements of M. Given M as defined in (4) . eSUM and eDIFF are e SUM = (el+ er)= tan-1 !C+ b j. (5) d-a
The two angles, I1r and er• can be separated from the sum and difference results and applied to the two-sided rotation module as in (2) to dia g onalize M.
In a typical serial computer, the calculation of the rota tion angles for the SVD is expensive and can be avoided by finding the sines and cosines directly. Matrix-vector multi plication can then be used to apply the rotations to the 2 x 2 submatrix. However, these operations still involve costly multiplication, division, and square root.
With the CORDIC al g orithms, the inverse tangent func tion is a primitive operation and the angles can be found Use CORDIC angle-solver modUle to find rotation angles;
Use CORDIC rotation module to transform the 2 x 2 matrix;
CORDlC Algorithms
The Coordinate Rotation Digital Computer algorithms (CORDIC) were first presented in 1959 by J. VoIder [6] .
Further theoretical work was done by J. Walther [7] in 1971 to show the applicability of CORDIC to various tran scendental and hyperbolic functions. The CORDlC algorithms allow fast iterative hard ware calculation of sin, cos, arctan, sinh, cosh, arctanh, product. quotient. and square roo t.
Recently. there has been renewed interest in the use of CORDlC algorithms for real-time signal processing [8] . pri marily due to the poss ibility of VLSI implementation [9] . The applicability of CORDIC to the basic operations in the SVD will be presented along with the limltations of the algorithm. The CORDIC SVD processor described here may be used as a math co-processor or within a special purpose sys tolic array.
CORDIC Recurrence Equations
The CORDIC aJgorithms are based upon defining a vector Cx 0' y 0) in tile 2-p lane. and then applying a rotational transformation. That is, the vector (x Q. Y 0) is rotated through an angle e, in the clockwise direction, to ex � , Y �) .
The CORDIC equations describe a rotation in one of three modes: circular, Ime.ar. or hyperbolic. For the SVD, the rota tions are in the circular mode and the eq uations are:
114
The CORDIC algorithms decompose the rotation angle into a sequence of n known smal ler angles. such that
where ej > 0 and 8/ == ± 1. From the geometry of rota tions. it is clear that the result of n rotations usin g the sequence of e/ 's is equivalent to that of one rotation using e.
The number of known angles in the sequence determines the accuracy of the CORDIC algorithms. In order to achieve n bits of accuracy, at least n rotations must be performed.
From the rotation equations. the recurrence equations describing these rotations can be found. If the recurrence equations are divided by cos e/, t.hen
The key contribution of VoIder [6] and Walther [7] was to set tan ei = {3-/ where {3 is the machine radix. In most applications, binary arithmetic is used, so (3 = 2, and there fore multiplication by tan e/ becomes a simple arithmetic shift operation. 
CORDIC Scale Factor Correction
The CORDIC formulation is not yet complete since the vector is not only rotated but also scaled at each iteration.
This scaling is only by a constant. and can be factored from the recurrence equations. If k; == cos e/. th e n the CORDIC equations are:
If the multiplication by k, is postponed until after the com pletion of n iterations. then th e scale factor, K n' can be defined as:
,,-1
IT cos ei = IT 1.
The final CORDIC iteration equations which are to be impJe ICltmted in hardware are:
US)
The desired I1nal values of the CORDIC operation x.. and , J inal Y / iTl"!' are not readily obtained by these equations. Instead.
after after n iterations the values
are determ ined. As a last step. a scale factor correction needs to be pcrfllrmed to yield.
(17)
where the scale factor correction constant is K SFC � C K ".
The constant C is either 1 or a power of the machine radix, f3, so that it can be easily cleared by a simple shift to yield
For the SVD processor, a CORDIC scale factor correction is required after the two-sided vector rotation. While a sin gle CORDIC vector rotation module requires correction by K", a two-sided vector rotation module requires correction by K ,,2. The time complexity of the existing schemes will be described and a novel method which offers a factor of two speedup for the two-sided vector rotation will be presented.
The most direct scheme involves a multiplication by After approximate] y n I 2 special iterations, the scale factor would then be corrected and the final values determined.
Haviland and Tuszynski [9] proposed a method whereby the special scale factor correction iterations,
are performed for both the x and y variables. This scheme also causes a multiplication of xn and Yn by KSFC and pro duces X final. and Y fina l as in (17) . The scale factor correc tion constant will be
where j E J and J = {2. Ahrnoo [8] see ks to make the constant, C, a power of the machine radix by repeating certain full CORDIC itera tions. A final shift will then clear the remaining scale c.on "�tanto This scheme operates differently from the previous met.hods since it relies on using extra CORDIC rotation itera tIons instf'..a d of special multiplicative scaling iterations.
[{ec11l1 from (9) and (10) (1 ± 2-J) when special scale factor iterations are applied.
The scale factor constant,
where J = {O, 1,3,5,6,8, 1 4 } and C = 2, also requires a final shift to yield x final an d Y final as in (23).
Each of the above schemes requires numerical methods to determine the appropriate sequence for a particular word length. The number of iterations is chosen to reduce the approximation error to less than 2-n• Unfortunately, these schemes lack a systematiC approach, and are diffi cult to extend to the two-sided rotation required by the SVD. A novel method is now presented for the two-sided rotation.
In the SVD processor, one CORDIC scale factor correction can be performed for the complet e two-sided vector rotation instead of for each single vector rotation. Thus, the scale fac tor will be the square of the single rotation factor. From (13),
and the novel observation is made that each term resembles the special scale factor iterations shown in (2(J). If the CORDIC processor performs special iterations of the form (26) similar to those of Haviland and Tuszynski, then the scale factor correction constant will be 
CORDIC Operation Modes
The CORDIC algorithms can be generalized to provide the calculation of several functions. In order to facilitate these operations, a third equation is added to the two rota tion equations to accumUlate the choice of angle used at each iteration:
The variable, Zi' contains the total rotation angle applied, G i is the current rotation angle increment, and 8! = ± 1.
Through the appropriate selection of each 81, either the ini tial Z 0 value can be reduced to zero (z -reduction) or the ini tial Yo value can be reduced to zero Cy -reduction).
Inverse Tan g ent
In the circular mode, the y -reduction will yield the quantity tan-1 (Yol x 0). This can be shown as follows. Con sider the CORDIC equations:
If, after n iterations, Y n = 0, and tan e = (y 01 x 0) and
Note that the scale factor K n cancels from the calculation.
Vector Rotation
In the circular mode, the Z -reduction will yield a vec tor rotation or the sine and cosine of the original angle.
Again consider the CORDIC equations. If, after n iterations, zn = 0, then the angle e = Z 0 and
This represents rotating (x 0' Yo) by the angle z o' The appli cation of vector rotations is an important step in the SVD. Note, however, that the scale factor Kn does remain in this calculation.
Convergence Iss ues
Walther [7] has shown that the domain of convergence lib of the CORDIC algorithms is limited by the sum of the series of the n known rotation angles. Therefore, since Bl > 0, the maximum angular rotation, O!o' is given by
If a non-repetitive sequence of angles (i = 0, ... ,n -0 is used for the circular mode, then O!o :::: :: : 99". Once the angle a satisfies IO! I > no, the CORDIC algorithms no longer converge. The result remains the same as that for sign(O!) 00-
The CORDIC convergence properties are related to the behavior of the tangent function. For any i = 0,···. n -I, it is required that n-l e! -r. Gh < e n-1 • (36)
In the circular mode, the inverse tangent function is used and this relation holds since
A new extension to the inverse tangent algorithm is proposed here to enable the CORDIC processor to determine the principal value of the inverse tangent function. In the SVD algorithm, it is desirable to limit rotations to ±90" based upon (5) and ( 6 ) . However. the CORDIC processor con siders the entire unit circle in computing the inverse tangent, although circular mode convergence is limited to only ±99". Therefore complexity of the proposed CORDIC SVD architectures will be com pared and presented in terms of the area, A c, and time, T C , complexity of a basic fully parallel CORDIC proces sor.
The area complexity of an n bit CORDIC processor which performs n iterations can be determined from the fully parallel CORDIC processor design [ 8 ] shown in Figure 1 .
The main substructures will be a programmable logic array (PLA) for finite state control, a ROM for storage of the angles used by the CORDIC algorithm, and hardware for the x ,y, and z variables, such as barrel shifters (sH), adders (ADD), and registers (REG). Therefore, the total area of a CORDIC processor. Ac. is Ac = ApLA + AROM + 2 ASH
For a fued-point implementation. the largest area in this design will be used by the barrel shifters which have bee n se lected to multiply by 2-i in the least amount of time. Since a constant time shift is desired, the area complexity of an n -bit barrel shifter will be 0 en 2), Although other schemes exist which require less area, a constant time shift would no longer be poss ible [11] , Therefore, the area com plexity of an entire CORDIC module will be
The internal structure of a parallel fixed point CORDIC processor is based upon the form of a CORDIC rotation equa tion:
Therefore, the time for one CORDIC iteration, T Ct, is
where T ADD' T SH ' T ST are. respectively, the time for addi-.
tion, shifting, and the sign test that determines °1 = ± 1-
The total time for a complete CORDIC operation. T c' is
where n is the number of bits in the operandS. 
If vector rotation is to be performed, then additional scale factor correction iterations need to be performed. For a one-sided vector rotation, an additional CORDIC scale factor correction time, T SFC � 1/ 4T c' is required using Delosme's method [1O� For the two-sided rotation, the method intro duced here can be used and the scale factor correction time is T SFC = l/ 8T c per rotation. Therefore the total time for a CORDIC two-sided rotation, T T -S' is T T -s = 2 (T c + T SFC) 
1. Parallel Diagonaliza.tion Method
The fourth method, the Parallel Diagonalizatlon Method [121 which is based upon determining e S U M and e D ! F F directly, results in a reduction in the time and area necessary for the 2 X 2 processor. In this architecture, shown in Figure 2 , the calculation of esy M is replaced by the calculation of eD1FF• Additionally, the entire sym metrization rotation is eliminated.. These modifications allow the area of the processor to be reduced while preserving the time needed for computation. The algorithm can be sum marized as follows: The time complexity of the complete CORDIC 2 X 2 SVD processor can be determined from the longest path. Ini tially, the sums and differences of the matrix elements of M need to be determined. These four additions can be done in parallel. Therefore, the preprocess time is T PRE = T ADD'
The angles eSUM and eD1FF are computed in parallel in TAT AN = T c = n CT ADD + T SH + T S T ) by two CORDIC modules. The separation of 9z and er can be com puted in parallel using an adder followed by a shifter, T SEP = (T ADD + T SH)' Finally, the two-sided CORDIC rotation with the new scale factor correction method can be performed i n T T -s = 225T C . The total time for a CORDIe 2 X 2 SVD, TCSVD' is
This expression can be simplified to yield:
The area required by this architecture is approximately twice that of a single CORDIC processor. The calculation of eSUM and eD1FF uses two CORDIe modules. Also, these two modules can perform the additions and shifts that are required to prepare flz and Gr' Finally, these modules will be available and can be reconfigured to compute the diagonali zation and scale factor correction of the 2 X 2 submatrix.
Therefore, this architecture reqUires an area
118 5.
Comparison with Traditional Arithmetic Tech niques
The SVD can also be computed using traditional multi plication, division, and square root An algorithm to calculate the rotation parameters is given in Brent, Luk, and Van Loan [5] . This algorithm produces the sine and cosine pairs for the diagonalization. An analysis of this algorithm shows that the execution time, T TCS' is T TCS = 5 T ADD + 4T MU LT + 3T DIV + zr SQRT ' (48) and the area, ATCS ' is ATCS = 4AADD+4AMULT+ADIV+AsQRT ' ( 4 9 ) if the maximum amount of parallelism is exploited.
A two-sided multiplication then follows to diagonalize the matrix. With parallel hardware, this can be done in time, T TT -s, equal to
with area, ATT -S
The total execution time for a 2 X 2 SVD on "traditional"
The total area required to support the parallelism in the 
CORDIC SVD Diagonalization Module
In a prototype system, the CORDIC Parallel Diagonaliza tion Method would be used since the least time and area are needed and the structure is regular. The basic floor-plan of a VLSI imp lementation is shown in Figure 3 . Three major sec tions are visible: two CORDIC processors, and an interconnec tion network.
1. Module Organization
The CORDIC processors are based upon the design shown in Figure 1 . The intra-module interconnection network will allow the same chip to function as both an angle solver and a rotation modu le, and will permit flexibility in designing, constructing, and reconfiguring a large array.
Finite state control for the interconnection network and for the SVD algorithm will be provided by a PLA. The array will be connected in a mesh confi.guration [5] . Each module will poss ess the necess ary control for systolic I/O. A basic layout for an SVD array composed of CORDIC modules is given in Figure 4. 
Internal Data Representation
In a fixed-point implementation, the number of bits in t h e internal data representation must be chosen to prevent loss of signiftcance due to rounding and overflow. In a proto type implementation, all quantities wil l be considered to be fractions.
In order to determine the number of bits necessary to prevent overflow, the following analysis is presented. The internal registers in an SVD processor must be able to store the largest calculated singular value. Therefore, worst-case bounds on the largest singular value of the input matrix are necessary.
If the entire p X P matrix, M, is divided by the larg est value, m1 j, then all of the elements of M will be frac tions. In the worst case, all elements of M will remain equal to unity, and rank eM ) = 1. However, the external interprocessor data path will only need to contain N ext == n + log Z p bits to prevent overflow.
In order to reduce the execution time, attention must be paid to addition techniques, since each CORDIC processor will perform 0 en ) additions for each 2 X 2 diagonalization.
Several alternative addition algorithms can be utilized including ripple-carry, carry look-ahead, signed-digit [161 and on-line addition [171 Efficient methods for a d dition which ensure that the time for addition, T ADD ' can be minimized wil l be important for system implementation.
Additionally. the CORDlC data paths could be modified to provide for a floating-point representation. Finally. methods for fault detection and reconftguration will become impor tant for large arrays of processors. All of these factors will have an effect upon the integration density achievable in VLSI.
Summary
The SVD is a computationally complex algorithm that can benefit from spe cial purpose arithmetic algorithms. The CORDlC algorithms have been shown to efficiently produce a 2 x 2 SVD processor architecture which can be implemented in VLSL The CORDlC processor has been enhanced through improvements in the scale factor correction scheme for the tW(}-Si d ed vector rotation. The novel architecture of the CORDIC Paralle l Diagonalization Method, which has bee n presented, has area, Acsv D = 2Ae , and time. T C SV D = 3.25T c' where Ac and Te are the area and time for one CORDIC operation, respectively. The structure is sim ple and more regular than a design using standard mu ltipli cation, division, and square root cells and resu lts in a factor of two speec;lup. A VLSI implementation of this architecture is planned as part of a prototype system.
