Architectures for systolic array processor elements for calculating the singular value decomposition (SVD) are proposed. These special purpose VLSI structures incorporate the coordinate rotation (CoRDIc) algorithms to diagonalize 2x 2 submatrices of a large array. The area -time complexity of the proposed architectures is analyzed along with topics related to a prototype implementation.
Introduction
Real -Time Signal Processing is undergoing rapid development due to recent advances in parallel architectures and VLSI. Many important algorithms that were once considered too computationally complex are now being reinvestigated.
One such algorithm is the Singular Value Decomposition. The SVD provides a reliable way to detect and correct the illconditioning that can occur in data matrices received from sensor arrays.1 Digital image processing also uses this algorithm for image enhancement. 2 Parallel architectures, in particular systolic arrays, show great potential for improving the performance of the SVD. 3 Recent research has shown that special purpose VLSI structures are possible for the SVD.4, 5 It has been suggested that novel architectures which "map" the algorithm more closely to hardware are desirable. 6 The coordinate rotation algorithms, CORDIC, have been shown to have enormous potential in this application4,7 due to their ability to compute inverse tangents and vector rotations.
With the systolic array organization of Brent, Luk, and Van Loan,$ a high degree of parallelism can be obtained. In this paper, several novel architectures for a CORDIC 2x 2 SVD processor are proposed. These modular 2x 2 submatrix units can be combined to build a larger Brent -Luk -Van Loan array for use in real -time signal processing applications. The array contains diagonal processors that compute the two-sided rotations, and off -diagonal processors that apply them.
In this paper, the area and time complexity of the various architectures will be analyzed. The goal is the production of a CORDIC architecture that will compute the SVD in the minimum amount of time and area. A computer simulation has been performed to compare these architectures and a prototype system is planned which is based upon the architecture with the smallest execution time. As part of this work, the domain of convergence of the CORDIC inverse tangent function has been extended due to an enhancement to the algorithm. Additionally, the results which are presented here for a fixed -point implementation can be extended to floatingpoint arithmetic through modifications to the data paths of the CoRDiC processors.
SVD -Jacobi method
The singular value decomposition 9 of an p x p matrix M is M _ U EV T ,
where U and V are orthogonal matrices and E is a diagonal matrix of singular values. The Jacobi method seeks to systematically reduce the off-diagonal elements to zero. This is done by applying a sequence of plane rotations to M which transforms M into E. Several sweeps over the entire matrix M may be necessary to complete the SVD. Within each sweep, the matrix elements need to be paired and appropriate rotations need to be calculated. The p X p matrix is distributed over an array of 
The goal is to compute the rotations efficiently. Several methods are possible to solve this problem. One method uses eSyM = (el -0r) to symmetrize M and e,. to diagonalize M , while a second method uses a parallel calculation of el and Or to diagonalize M .
Introduction
Real-Time Signal Processing is undergoing rapid development due to recent advances in parallel architectures and VLSI. Many important algorithms that were once considered too computationally complex are now being reinvestigated. One such algorithm is the Singular Value Decomposition. The SVD provides a reliable way to detect and correct the illconditioning that can occur in data matrices received from sensor arrays. 1 Digital image processing also uses this algorithm for image enhancement2
Parallel architectures, in particular systolic arrays, show great potential for improving the performance of the SVD. 3 Recent research has shown that special purpose VLSI structures are possible for the SVD.4' 5 It has been suggested that novel architectures which "map" the algorithm more closely to hardware are desirable. 6 The coordinate rotation algorithms, CORDIC, have been shown to have enormous potential in this application4' 7 due to their ability to compute inverse tangents and vector rotations.
With the systolic array organization of Brent, Luk, and Van Loan,8 a high degree of parallelism can be obtained. In this paper, several novel architectures for a CORDIC 2x 2 SVD processor are proposed. These modular 2x 2 submatrix units can be combined to build a larger Brent-Luk-Van Loan array for use in real-time signal processing applications. The array contains diagonal processors that compute the two-sided rotations, and oif-diagonal processors that apply them.
In this paper, the area and time complexity of the various architectures will be analyzed. The goal is the production of a CORDIC architecture that will compute the SVD in the minimum amount of time and area. A computer simulation has been performed to compare these architectures and a prototype system is planned which is based upon the architecture with the smallest execution time. As part of this work, the domain of convergence of the CORDIC inverse tangent function has been extended due to an enhancement to the algorithm. Additionally, the results which are presented here for a fixed-point implementation can be extended to floatingpoint arithmetic through modifications to the data paths of the CORDIC processors.
SVD -Jacobi method
The singular value decomposition 9 of an p X p matrix M is
CD
where U and V are orthogonal matrices and L is a diagonal matrix of singular values.
The Jacobi method seeks to systematically reduce the off-diagonal elements to zero. This is done by applying a sequence of plane rotations to M which transforms M into L. Several sweeps over the entire matrix M may be necessary to complete the SVD. Within each sweep, the matrix elements need to be paired and appropriate rotations need to be calculated. The p X p matrix is distributed over an array of simple 2x 2 processors where the basic operation is the two-sided rotation of each 2X 2 matrix.
Basic methods for a 2x 2 matrix
A 2x 2 SVD can be described as
where Qt and 9r are the left and right rotation angles, respectively. The rotation matrix is cosO sinO sin@ cosO and the input matrix is
The goal is to compute the rotations efficiently. Several methods are possible to solve this problem. One method uses ®SYM = (&i "~®r^ to symmetrize M and 9r to diagonalize M, while a second method uses a parallel calculation of Ql and 9r to diagonalize M.
The rotation parameters can be calculated from the inverse tangents of the elements of M. Also, the diagonalization of M can be performed by treating M as a pair of vectors and using the rotation angles to transform M. The computation of these vector rotations and inverse tangents can be performed efficiently by the CORDIC algorithms.
CORDIC algorithms
The CORDIC algorithms were first presented in 1959 by J. Volder.10 CORDIC is an acronym for Coordinate Rotation Digital Computer. Further theoretical work was done by J. Walther 11 in 1971 to show the applicability of CORDIC to various functions. During the last several years, there has been renewed interest in CORDIC algorithms, principally due to the possibility of VLSI implementation12 and the application to real -time signal processing. 4, 6 In this section, the CORDIC algorithms will be described. Additionally, the applicability of CORDIC to the basic operations in the SVD will be presented along with the limitations of the algorithm. The CORDIC algorithms provide an iterative VLSI method to calculate the transcendental and hyperbolic functions. The goals have been fast hardware calculation of sin, cos, arctan, sinh, cosh, arctanh, product, quotient, and square root. The target application given here is a math processor within a special purpose systolic array.
In a typical serial computer, the calculation of the rotation angles for the SVD is expensive and can be avoided by finding the sines and cosines directly. Matrix-vector multiplication can then be used to apply the rotations to the 2x 2 submatrix. With the CORDIC algorithms, the inverse tangent function is a primitive operation and the angles can be found explicitly, without penalty. Also, vector rotations are primitive CORDIC operations and can replace traditional matrixvector multiplication.
CORDIC equation set
The CORDIC algorithms are based upon defining a vector (x o, yo) in the 2 -plane, and then applying a rotational transformation. That is, the vector (x 0, yo) is rotated through an angle e, in the clockwise direction, to (x o , y o ) . The CORDIC equations describe a rotation in one of three modes: circular, linear, or hyperbolic. For the SVD, the rotations are in the circular mode and the equations are: The CORDIC algorithms decompose the rotation angle into a sequence of n known smaller angles, such that n-1 e= t eo el... ± en -1= EStet , (6) 
where es > 0 and 6 = t 1. From the geometry of rotations, it is clear that the result of n rotations using the sequence of es 's is equivalent to that of one rotation using O. The number of known angles in the sequence determines the accuracy of the CORDIC algorithms. In order to achieve n bits of accuracy, at least n rotations must be performed.
From the rotation equations, the recurrence equations describing these rotations can be found. If the recurrence equations are divided by cosec , then xi +1 -xi + & yi tangs (7) (8) 
Yt +1 = Yi -4 xs2-.
These equations are executed for n iterations and then a final scale factor multiplication by Kn is performed.
Scale factor considerations
The major limitations of the CORDIC algorithms are the treatment of the scale factors and the narrow domain of convergence. Many proposals 6,12,13,14 have been made to cope with the scale factor problems.
In most CORDIC designs, the number of iterations, n, is to cope with the scale factor dilemma. They include either
CORDIC algorithms
The number of known angles in the sequence determines the accuracy of the CORDIC algorithms. In order to achieve n bits of accuracy, at least n rotations must be performed.
From the rotation equations, the recurrence equations describing these rotations can be found. If the recurrence equations are divided by cos0^, then (7) The CORDIC algorithms were first presented in 1959 by J. Voider. 10 CORDIC is an acronym for Coordinate Rotation Digital Computer. Further theoretical work was done by J. Walther n in 1971 to show the applicability of CORDIC to various functions. During the last several years, there has been renewed interest in CORDIC algorithms, principally due to the possibility of VLSI implementation12 and the application to real-time signal processing.4 ' 6 In this section, the CORDIC algorithms will be described. Additionally, the applicability of CORDIC to the basic operations in the SVD will be presented along with the limitations of the algorithm. The CORDIC algorithms provide an iterative VLSI method to calculate the transcendental and hyperbolic functions. The goals have been fast hardware calculation of sin, cos, arctan, sinh, cosh, arctanh, product, quotient, and square root The target application given here is a math processor within a special purpose systolic array.
In a typical serial computer, the calculation of the rotation angles for the SVD is expensive and can be avoided by finding the sines and cosines directly. Matrix-vector multiplication can then be used to apply the rotations to the 2x2 submatrix. With the CORDIC algorithms, the inverse tangent function is a primitive operation and the angles can be found explicitly, without penalty. Also, vector rotations are primitive CORDIC operations and can replace traditional matrixvector multiplication.
CORDIC equation set
The CORDIC algorithms are based upon defining a vector Gc 0, y 0) in the 2-plane, and then applying a rotational transformation. That is, the vector Gc 0, y 0) is rotated through an angle 0, in the clockwise direction, to (^o'^o^ The CORDIC equations describe a rotation in one of three modes: circular, linear, or hyperbolic. For the SVD, the rotations are in the circular mode and the equations are:
The CORDIC algorithms decompose the rotation angle into a sequence of n known smaller angles, such that 0 = ± 00 ± (6) where Qt > 0 and ^ = ± 1. From the geometry of rotations, it is clear that the result of n rotations using the sequence of 0f 's is equivalent to that of one rotation using 0.
The key contribution of Voider10 and Walther 11 was to set tan0£ = /3~l where ]8 is the machine radix. In most applications, binary arithmetic is used, so j3= 2, and therefore multiplication by tan0f becomes a simple arithmetic shift operation. For example, when i = 0, then 2~l = 1 and 0£ = tan~1(2~') = 45°. Again, for i = 1, 0. = 26.7°. Obviously, as i increases, Qt decreases toward 0.
The CORDIC formulation is not yet complete since the vector is not only rotated but also scaled at each iteration. This scaling is only by a constant, and can be factored from the recurrence equations. If kt = cos0f , then the CORDIC equations are:
If the multiplication by kt is postponed until after the completion of n iterations, then the scale factor, Kn , can be defined as:
The final CORDIC equations are:
These equations are executed for n iterations and then a final scale factor multiplication by Kn is performed,
Scale factor considerations
The major limitations of the CORDIC algorithms are the treatment of the scale factors and the narrow domain of convergence. Many proposals 6> 12> 13> 14 have been made to cope with the scale factor problems.
In most CORDIC designs, the number of iterations, n> is fixed. This permits Kn to be fixed and for a traditional sequence of angles without repetitions, Kn *** 0.607. In many cases, fewer than n iterations are necessary. However, the storage of Kn for all possible values of n is necessary if early termination is permitted. This limitation of the CORDIC algorithms is presented in the discussion by Bridge et al. 15 of techniques for latency reduction.
Two classes of techniques are described in the literature to cope with the scale factor dilemma. They include either special scale factor compensation iterations 12, 14 or modified repetitive angle sequences.6,13 These methods have been developed to eliminate the final costly multiplication by Kn that would otherwise be necessary with Walther's11 original algorithm.
Despain described a "compensated" CORDIC algorithm.14 This method applies a correction factor at selected iterations to try to force the scale factor, Kn, to unity. In this way, the magnitude correction iterations can be combined with the rotation iterations by using an additional parameter.
Haviland and Tuszynski 12 have implemented an approach similar to that of Despain. Their CORDIC processor is capable of performing either a CORDIC rotation step or a special scale factor reduction step. If these extra steps are performed for certain iterations, then the scale factor is reduced to unity. The special scale factor compensation iteration technique has several drawbacks. First, the area complexity of the control structure is increased to allow for the special cycles and extra data handling. Second, although the extra iterations do not add to the time complexity as much as an explicit multiplication, these iterations do not extend the domain of convergence.
Ahmed6 suggested a method that is based upon repeating certain iterations. Since Kn is based upon the product of the individual cos6i, it is possible to repeat certain iterations so that Kn will become a power of the machine radix. Ahmed proposed an example angle sequence that yields K 0.50.
Therefore, a simple arithmetic shift after the last iteration will correct for scaling. As a benefit, the extra iterations increase the domain of convergence at the expense of extra time required for a complete operation.
Delosme 13 extended Ahmed's work and combined it with that of Haviland and Tuszynski. He described an optimization procedure to eliminate the scale constant that uses the least number of scale factor compensation iterations and repeated iterations. However, this method produces an implementation that requires increased complexity in the control structure.
CORDIC operation modes
The coRDic algorithms can be generalized to provide the calculation of several functions. In order to facilitate these operations, a third equation is added to the two rotation equations to accumulate the choice of angle used at each iteration:
The variable, zi, contains the total rotation angle used, 0t is the current rotation angle increment, and 4 = t 1 indicates whether to add or subtract this angle increment.
In a full CORDIC processor, either the initial z 0 value can be reduced to zero (z -reduction) or the initial y 0 value can be reduced to zero (y -reduction). Through the appropriate selection of operating mode, the coilDi= processor can yield various elementary functions.
Vector rotation
In the circular mode, the z -reduction will yield a vector rotation or the sine and cosine of the original angle. Again consider the CORDIC equations. If, after n iterations, zn = 0, then the angle 0 = z 0 and xn = Kn (x 0 + y 0tan(z 0)) , (15) yn = Kn (y 0 -x 0tan(z 0)) .
This represents rotating (x 0, y 0) by the angle z 0. The application of vector rotations is an important step in the SVD. Note, however, that the scale factor Kn does remain in this calculation. This requires the use of one of the scale factor correction techniques, such as the repetitive angle sequences proposed by Ahmed.
Inverse tangent
In the circular mode, the y -reduction will yield the quantity tan-1(y 0l x0). This can be shown as follows. Consider the CORDIC equations:
zn =z0 +0. 
Note that the scale factor Kn cancels from the calculation.
Convergence issues
Walther 11 has shown that the domain of convergence of the CORDIC algorithms is limited by the sum of the series of the n known rotation angles. Therefore, since 0t > 0, the maximum angular rotation, cep, is given by n -1 a0 = E et special scale factor compensation iterations 12» 14 or modified repetitive angle sequences.6' 13 These methods have been developed to eliminate the final costly multiplication by Kn that would otherwise be necessary with Walther's 11 original algorithm.
Despain described a "compensated" CORDIC algorithm.14 This method applies a correction factor at selected iterations to try to force the scale factor, Kn , to unity. In this way, the magnitude correction iterations can be combined with the rotation iterations by using an additional parameter.
Haviland and Tuszynski 12 have implemented an approach similar to that of Despain. Their CORDIC processor is capable of performing either a CORDIC rotation step or a special scale factor reduction step. If these extra steps are performed for certain iterations, then the scale factor is reduced to unity.
The special scale factor compensation iteration technique has several drawbacks. First, the area complexity of the control structure is increased to allow for the special cycles and extra data handling. Second, although the extra iterations do not add to the time complexity as much as an explicit multiplication, these iterations do not extend the domain of convergence.
Ahmed6 suggested a method that is based upon repeating certain iterations. Since Kn is based upon the product of the individual cos@£ , it is possible to repeat certain iterations so that Kn will become a power of the machine radix. Ahmed proposed an example angle sequence that yields Kn % 0.50. Therefore, a simple arithmetic shift after the last iteration will correct for scaling. As a benefit, the extra iterations increase the domain of convergence at the expense of extra time required for a complete operation.
CORDIC operation modes
The CORDIC algorithms can be generalized to provide the calculation of several functions. In order to facilitate these operations, a third equation is added to the two rotation equations to accumulate the choice of angle used at each iteration:
The variable, zit contains the total rotation angle used, fy is the current rotation angle increment, and &t = ± 1 indicates whether to add or subtract this angle increment.
In a full CORDIC processor, either the initial z 0 value can be reduced to zero (z -reduction) or the initial y 0 value can be reduced to zero Qy -reduction). Through the appropriate selection of operating mode, the CORDIC processor can yield various elementary functions.
Vector rotation
In the circular mode, the z -reduction will yield a vector rotation or the sine and cosine of the original angle. Again consider the CORDIC equations. If, after n iterations, zn = 0, then the angle 9 = z 0 and )),
yn = Kn (y 0 -x 0tanU 0)) .
This represents rotating Gc 0, y 0) by the angle z 0. The application of vector rotations is an important step in the SVD. Note, however, that the scale factor Kn does remain in this calculation. This requires the use of one of the scale factor correction techniques, such as the repetitive angle sequences proposed by Ahmed.
Inverse tangent
In the circular mode, the y-reduction will yield the quantity tan~1(y 0/JC Q). This can be shown as follows. Consider the CORDIC equations:
If, after n iterations, yn = 0, then Ii=tane.
Thus, 0 = tan~x(y 0/ ;c 0) and if z 0 = 0, then
Convergence issues
Walther n has shown that the domain of convergence of the CORDIC algorithms is limited by the sum of the series of the n known rotation angles. Therefore, since Oi > 0, the maximum angular rotation, OQ, is given by
If a non-repetitive sequence of angles (z = 0, , n 1) is used for the circular mode, then a0 *** 99°. Once the angle ex satisfies Icxl > cx0, the CORDIC algorithms no longer converge. The result remains the same as that for sign(<x) c*0.
The CORDIC convergence properties are related to the behavior of the tangent function. For any i = 0 , -, n -1, it is required that (23)
In the circular mode, the inverse tangent function is used and this relation holds since tan -1 (2 -i) < 2tan 1 (2-(1+1)) . 
Area and time complexity
The VLSI model of computation concerns both the area and the time needed to perform an operation. The best VLSI architecture for the solution of a given problem has the least area and time 16 The area, A.CSVD , and time, TCSVD , complexity of the proposed CORDIC SVD architectures will be compared and presented in terms of the area, Ac , and time, Tc, complexity of a basic fully parallel CORDIC processor.
The area complexity of an n bit CORDIC processor which performs n iterations can be determined from the fully parallel CORDIC processor design6 shown in Figure 1 . The main substructures will be a programmable logic array (PLA) for finite state control, a ROM for storage of the angles used by the CORDIC algorithm, and hardware for the x ,y , and z variables, such as barrel shifters (SH), adders (ADD), and registers (REG). Therefore, the total area of a CORDIC processor, Ac , is Ac -` 1 PLA + ``l ROM + 2ASH + 3AADD + 3AREG (25) For a fixed -point implementation, the largest area in this design will be used by the barrel shifters which have been selected to multiply by 2 -I in the least amount of time.
Since a constant time shift is desired, the area complexity of an n -bit barrel shifter will be O (n2). Therefore, the area complexity of an entire CORDIC module will be Ac 2ASH = O (n 2) .
The internal structure of a parallel fixed point CORDIC processor is based upon the form of a CORDIC rotation equa- The total time for a complete CORDIC operation, Tc, is Tc = n (TADD + TsH + TsT) , (29) where n is the number of bits in the operands. For example, the time to compute an inverse tangent, TATAN , is Tc
The relative complexity of the primitive operations can be compared by making the following assumptions. First, if a barrel shifter design is used for the shifter implementation, then all distance shifts occur in equal time, and the approximation can be made that TsH « TADD. In a two's complement fixed -point implementation, the sign test will determine whether addition or subtraction is to be performed, and TST « TADD From these assumptions, the limiting factor in a CORDIC SVD processor is the time needed to perform an addition, TADD. The time for a CORDIC operation depends linearly on the number of bits in the operands and Tc ~T ADD n = O(TADDn) ( 
30) CORDIC SVD processor architectures
The Jacobi method for the SVD has been shown to rely heavily upon two basic functional blocks, the inverse tangent and the vector rotation. The CORDIC algorithms have been described and have been shown to be precisely capable of performing these functions. Thus, a general algorithm for a 2x 2 In the circular mode, the inverse tangent function is used and this relation holds since
Since the circular mode convergence is limited to ± 99°, which covers the first and fourth quadrants of the unit circle, a new extension to the algorithm is proposed here to allow for angles in the second and third quadrants. An initial test is performed to check the signs of x 0 and y 0. If both x 0 and y 0 are negative (third quadrant), then the signs of both x Q and y 0 are changed in order to move the angle into the first quadrant Similarly, if X Q is negative and y 0 is positive (second quadrant), then the signs of both JC Q and y 0 are changed in order to move the angle into the fourth quadrant. These modifications to the CORDIC algorithm allow the computation of the tan-1(y 0/ x o^ for a^ x o an(^ y o except x o = y o = 0* This property makes the CORDIC module an excellent choice for finding the rotation parameters for the SVD.
Area and time complexity
The VLSI model of computation concerns both the area and the time needed to perform an operation. The best VLSI architecture for the solution of a given problem has the least area and time. 16 The area, ACSVDt and time, TCSVDt complexity of the proposed CORDIC SVD architectures will be compared and presented in terms of the area, Ac , and time, Tc , complexity of a basic fully parallel CORDIC processor.
The area complexity of an n bit CORDIC processor which performs n iterations can be determined from the fully parallel CORDIC processor design6 shown in Figure 1 . The main substructures will be a programmable logic array (PLA) for finite state control, a ROM for storage of the angles used by the CORDIC algorithm, and hardware for the * ,y, and z variables, such as barrel shifters (SH\ adders (ADD), and registers C&EG). Therefore, the total area of a CORDIC processor, Ac , is
For a fixed-point implementation, the largest area in this design will be used by the barrel shifters which have been selected to multiply by 2~* in the least amount of time. Since a constant time shift is desired, the area complexity of an n-bit barrel shifter will be O(n 2). Therefore, the area complexity of an entire CORDIC module will be Ac **2ASH = O(n 2).
The internal structure of a parallel fixed point CORDIC processor is based upon the form of a CORDIC rotation equation: (27) (28) Therefore, the time for one CORDIC iteration, TCi , is
The total time for a complete CORDIC operation, Tc , is
where n is the number of bits in the operands. For example, the time to compute an inverse tangent, TATAN ,isTc .
The relative complexity of the primitive operations can be compared by making the following assumptions. First, if a barrel shifter design is used for the shifter implementation, then all distance shifts occur in equal time, and the approximation can be made that TSH « TADD . In a two's complement fixed-point implementation, the sign test will determine whether addition or subtraction is to be performed, and TST « TADD . From these assumptions, the limiting factor in a CORDIC SVD processor is the time needed to perform an addition, TADD . The time for a CORDIC operation depends linearly on the number of bits in the operands and
TC ** TADD " = O (TADD n ).
(30)
CORDIC SVD processor architectures
The Jacobi method for the SVD has been shown to rely heavily upon two basic functional blocks, the inverse tangent and the vector rotation. The CORDIC algorithms have been described and have been shown to be precisely capable of performing these functions. Thus, a general algorithm for a 2x 2 where TADD , TSH , TST are, respectively, the time for addition, shifting, and the sign test that determines ^ = ± 1. Symmetrization, diagonalization method (34) This method is a direct mapping of the equations to CORDIC inverse tangent and vector rotation modules. The architecture is sequential and will require the largest CORDIC SVD execution time, TCSVD, to diagonalize a 2x 2 matrix. Figure 2 illustrates the implementation of the following algorithm:
Algorithm CORDIC SVD Sym -Diag () : begin parallel do { b -c, a +d };
Use CORDIC module to find eSYM Use CORDIC rotation module to apply esYM parallel do { g -e , SHIFT(f ) };
Use CORDIC module to find 2er; Or f-SHIFT( 2e7.
Use CORDIC rotation module to apply er; end.
The total execution time is CORDic SVD processor would be:
begin Use CORDIC angle-solver module to find rotation angles; Use CORDIC rotation module to transform the 2x 2 matrix; end.
A two-sided rotation using Qr will diagonalize the matrix:
vmmetrization. diagonalization method (34) The four novel CORDIC architectures to be discussed perform variations on this algorithm with different time and area costs. A computer simulation has been developed to analyze the methods. Each method which is introduced possesses an advantage in area or time. The first method is the most basic while the fourth method exploits the maximum amount of parallelism to obtain the minimum area-time complexity.
CORDIC two step architectures
The two step approach 8 first uses a rotation to symmetrize M. The angle, $$YM > *s defined as
b-c
and is applied to M as follows:
(32)
After the matrix is symmetrized, the diagonalization angle can be found from: This method is a direct mapping of the equations to CORDIC inverse tangent and vector rotation modules. The architecture is sequential and will require the largest CORDIC SVD execution time, TCSVD , to diagonalize a 2x 2 matrix. Figure 2 illustrates the implementation of the following algorithm:
Use CORDIC module to find Use CORDIC rotation module to apply parallel do { g -e , SHIFlt/ ) }; Use CORDIC module to find 2Qr ; Or <-SHIFT(20r i Use CORDIC rotation module to apply 9 end.
The total execution time is CSVD (35) where T^^ and TDIAG are, respectively, the time for symmetrizing and diagonalizing the matrix M . An initial addition, TADD , is needed to prepare the operands for the calculation of QSYM* The inverse tangent computation time, TAT AN> ^ tiie symmetrization rotation time, TROT , both equal the time for a CORDIC vector rotation, Tc . The total symmetrization time is
The calculation of the diagonalization angle, 0r , requires a shift, TSH , after the inverse tangent computation and the initial additions. The two-sided rotation time, TT _S , is equivalent to the time for two CORDIC rotations, or 2TC . Therefore, the total diagonalization time is
DIAG + T AT AN
These results can be combined to yield
A reduction in latency can be achieved by pipelining the symmetrization angle inverse tangent module with the symmetrization rotation module. Since both modules utilize the same CORDIC angles, the intermediate results, 8^^ and STM-' can k6 processed by the rotation module after an initial sign test time, TST . Pipelining reduces TSYM by approximately Tc and the final execution time for this method is
The area used by this architecture can be expressed in terms of the area of a basic CORDIC processor, Ac. Each 
A reduction in the total area can be achieved by increasing the utilization of the hardware. For example, the shift required to produce 9r can be performed by the inverse tangent module. Also, two inverse tangent modules may not be needed since the module that finds esym will be available to calculate 9r. Similarly, the two-sided rotation can be performed by the symmetrization rotation module. The minimum hardware configuration is limited by the pipelined symmetrization section. The final total area for this architecture is ACSVD = ASYM = 3Ac .
The reduction in area due to increased utilization is offset, however, by the increased complexity of the internal interconnection and control structures.
Approximation method
In the previous architecture, the diagonalization proceeded sequentially since the angle, 29,,, was first calculated, then multiplied by 1 /z, and finally applied to rotate the matrix. This new architecture, shown in Figure 3 , seeks to reduce TCSVD by pipelining the second or diagonalization rotation. A simple one -half approximation is performed between the arctan and rotation modules. Since the CORDIC algorithm angles decrease by approximately one -half, the rotation module can choose the next angle in the sequence in order to perform a rotation by 9r.. This technique will allow pipelining which will reduce TCSVD by approximately 7'c and the execution time will be TCSVD = 3Tc
The area requirements remain the same for this architecture The area requirement increases for this architecture since an additional inverse tangent module is needed for the simultaneous calculation of 9sYM and esUM The total area is ACSVD = 4Ac .
(50)
Although this architecture has reduced TCSVD while preserving numerical accuracy, the area has enlarged. Since area is an important parameter for VLSI systems, a design which minimizes both area and time is desired.
CORDIC direct two angle architectures
The direct two angle method 8 calculates et and Or by computing the inverse tangents of the data elements of M.
The area used by this architecture can be expressed in terms of the area of a basic CORDIC processor, Ac . Each inverse tangent module has area, AATAN , equal to a CORDIC processor. The rotation modules require twice the area since they operate on two vectors. A first estimate of the total area is 
(42)
A reduction in the total area can be achieved by increasing the utilization of the hardware. For example, the shift required to produce 0r can be performed by the inverse tangent module. Also, two inverse tangent modules may not be needed since the module that finds Q$YM w*^ ^ available to calculate Or . Similarly, the two-sided rotation can be performed by the symmetrization rotation module. The minimum hardware configuration is limited by the pipelined symmetrization section. The final total area for this architecture is
ACSVD = A SYM (44)
Approximation method
In the previous architecture, the diagonalization proceeded sequentially since the angle, 20r , was first calculated, then multiplied by x/2, and finally applied to rotate the matrix. This new architecture, shown in Figure 3 , seeks to reduce TCSVD by pipelining the second or diagonalization rotation. A simple one-half approximation is performed between the arctan and rotation modules. Since the CORDic algorithm angles decrease by approximately one-half, the rotation module can choose the next angle in the sequence in order to perform a rotation by 0r . This technique will allow pipelining which will reduce TCSVD by approximately Tc and the execution time will be
The area requirements remain the same for this architecture and
Although TCSVD is reduced, convergence of the SVD -Jacobi method may have been slowed due to the inexact computation of 0,. . 
Semi-parallel method
This architecture, shown hi Figure 4 , seeks to exploit parallelism in the computation of the rotation angles. In parallel with the computation of the symmetrization angle, > a second CORDIC inverse tangent module computeŝ
Once these two angles have been found, 0r can be computed by a subtraction and a shift, since
The parallel computation of the rotation angles overlaps the symmetrization and diagonalization processes and reduces the total time to
The area requirement increases for this architecture since an additional inverse tangent module is needed for the simultaneous calculation of QSYM and ®SUM Tne total area k
CORDIC direct two angle architectures
The direct two angle method 8 calculates fy and Qr by computing the inverse tangents of the data elements of M. Given M and esUM as defined in (4) The two angles, 9 and 9r, can then be separated from the sum and difference results and applied to the two-sided rotation module as in (2).
Parallel diagonalization method
In this architecture, shown in Figure 5 , the calculation of 9syM is replaced by the calculation of OD' Additionally, the entire symmetrization rotation is eliminated. These modifications allow the area of the processor to be reduced while preserving the time needed for computation. The algorithm can be summarized as follows: Use CORDIC rotation module to apply el , Or; end.
The time complexity of the complete CORDIC 2x 2 SVD processor can be determined from the longest path. Initially, the sums and differences of the matrix elements of M need to be determined. These four additions can be done in parallel. Therefore, the preprocess time is T pRE = TADD .
The angles SUM and 9DIFF are computed in parallel in TATAN = TC = n (TADD + TsH + TsT) by two CORDIC modules. The separation of 9, and Or can be computed in parallel using an adder followed by a shifter, TSEP = (T ADD + TsH ). Finally, the two-sided coRDic rotation can be performed in TT_s = 2Tc . The total time 
The area required by this architecture is approximately twice that of a single CORDIC processor. The calculation of SUM and BDIFF uses two CORDIC modules. Also, these two modules can perform the additions and shifts that are required to prepare 9/ and 9r.. Finally, these modules will be available and can be reconfigured to compute the diagonalization of the 2x 2 submatrix. Therefore, this architecture requires an area ACSVD = 2Ac . 
The two angles, Oz and 0r , can then be separated from the sum and difference results and applied to the two-sided rotation module as in (2).
The time complexity of the complete CORDIC 2x 2 SVD processor can be determined from the longest path. Initially, the sums and differences of the matrix elements of M need to be determined. These four additions can be done in parallel. Therefore, the preprocess time is TPRE = TADD .
The angles 0SUM and QDIFF are computed in parallel TAT AN = Tc = SH CORDIC modules. The separation of Ot and Or can be computed in parallel using an adder followed by a shifter, TSEp = (TADD + TSH \ Finally, the two-sided CORDIC rotation can be performed in TT _S = 2TC . The total time for a CORDIC 2X 2 SVD, TCSVD , is
This expression can be simplified to yield:
The area required by this architecture is approximately twice that of a single CORDIC processor. The calculation of ®SUM an<^-QDIFF uses two CORDIC modules. Also, these two modules can perform the additions and shifts that are required to prepare Qt and 6r . Finally, these modules will be available and can be reconfigured to compute the diagonalization of the 2x2 submatrix. Therefore, this architecture requires an area
ACSVD = 2Ac

CORDIC SVD diagonalization module
In a prototype system, the CORDIC Parallel Diagonalization Method would be used since the least time and area are needed. The basic floor-plan of a VLSI implementation is
Parallel diagonalization method
In this architecture, shown in Figure 5 , the calculation of QSYM k ^placed by the calculation of 0DIFF . Additionally, the entire symmetrization rotation is eliminated. These modifications allow the area of the processor to be reduced while preserving the time needed for computation. The algorithm can be summarized as follows: shown in Figure 6 . Three major sections are visible: two CORDIC processors, and an interconnection network. The CORDIC processors are based upon the design shown in Figure 1 . The intra-module interconnection network will allow the same chip to function as both an angle solver and a rotation module, and will permit flexibility in designing, constructing, and reconfiguring a large array.
Finite state control for the interconnection network and for the SVD algorithm will be provided by a PLA. The array will be connected in a mesh configuration.8 Each module will possess the necessary control for systolic I /O. A basic layout for an SVD array composed of CORDIC modules is given in Figure 7 .
In order to reduce the execution time, attention must be paid to addition techniques, since each CORDIC processor will perform O (n ) additions for each 2x 2 diagonalization. Several alternative addition algorithms can be utilized including ripple -carry, carry look-ahead, signed-digit, and on -line addition. Efficient methods for addition which ensure that the time for addition, T ADD , can be minimized will be important for system implementation. Additionally, the CORDIC data paths could be modified to provide for a floating -point representation. Finally, methods for fault detection and reconfiguration will become important for large arrays of processors. All of these factors will have an effect upon the integration density achievable in VLSI. shown in Figure 6 . Three major sections are visible: two CORDIC processors, and an interconnection network. The CORDIC processors are based upon the design shown in Figure 1 . The intra-module interconnection network will allow the same chip to function as both an angle solver and a rotation module, and will permit flexibility in designing, constructing, and reconfiguring a large array.
Finite state control for the interconnection network and for the SVD algorithm will be provided by a PLA. The array will be connected in a mesh configuration.8 Each module will possess the necessary control for systolic I/O. A basic layout for an SVD array composed of CORDIC modules is given in Figure 7 .
In order to reduce the execution time, attention must be paid to addition techniques, since each CORDIC processor will perform O (n ) additions for each 2x 2 diagonalization. Several alternative addition algorithms can be utilized including ripple-carry, carry look-ahead, signed-digit, and on-line addition. Efficient methods for addition which ensure that the time for addition, T"ADL>' can ^ minimized will be important for system implementation. Additionally, the CORDIC data paths could be modified to provide for a floating-point representation. Finally, methods for fault detection and reconfiguration will become important for large arrays of processors. All of these factors will have an effect upon the integration density achievable in VLSI. 
Summary
The CORDIC 2x 2 SVD processor achieves a simple structure requiring a small number of functional units that are interconnected in a manner suitable for VLSI implementation. The Jacobi method for the SVD has been reviewed and the applicability of the CORDIC algorithms has been discussed. Novel architectures for a special purpose systolic processor have been presented and analyzed. The design objective has been the minimization of both area and time. The Parallel Diagonalization Method has area, ACSVD = 2AC , and time, TCSVD = 3TC , where Ac and Tc are the area and time for one CORDIC operation, respectively. A VLSI implementation of this architecture is planned as part of a prototype system.
