The design and custom CMOS VLSI implementation of a CORDIC SVD processor is presented. Special-purpose parallel processor arrays have many important applications in real-time signal processing. The processor architecture is reviewed and the current CORDIC 2 Control and X,Y Data Path chips are described. The hierarchical design methodology will lead next to a full CORDIC processor followed by a complete CORDIC SVD processor and array.
Introduction
The Singular Value Decomposition (SVD) of a matrix is an important computationally complex algorithm which can benefit from the recent advances in parallel architectures and VLSI. The array structure of Brent, Luk, and Van Loan [21 uses an expandable square systolic array of simple 2 x 2 processors to compute the SVD of a large matrix. A VLSI processor array for the SVD would have applications in real-time signal processing and image processing.
Algorithms for use on uniprocessor systems require many division and square root calculations to compute the necessary sine and cosine rotation parameters. The Coordinate Rotation (CORDIC) algorithms [ll] provide efficient hardware calculation of vector rotations and inverse tangents. We have previously presented the architecture of a CORDIC 2 x 2 SVD processor [51. In this paper, we describe the VLSI implementation of this architecture.
As part of the first author's thesis research at Cornell, the design of a prototype custom CMOS VLSI CORDIC X,Y Data Path Chip was begun. This chip consists of a 20-bit register, barrel shifter, and adder. There is also a data router that provides for both normal CORDIC and scale factor correction operation. The chip contains approximately 2100 transistors in addition to those contained in the pad frame. It will be submitted for fabrication by MOSIS.
At Rice, other design units are being used as projects for the VLSI Design course. These projects emphasize a conservative design approach and final product functionality as opposed to area and time minimization. A 10-bit CORDIC 2 Control Chip has been completed and fabricated through the MOSIS service in a 3-1 CMOS p-well process. Angle data, in this chip, is represented with 10 bits. This chip contains approximately 1200 transistors in addition to the pads. This chip is fully functional up to a clock speed of 8 MHz. A CORDIC processor can be implemented using the CORDIC 2 Control Chip interconnected with two CORDIC X,Y Data Path Chips. Current plans include implementing the complete CORDIC SVD processor on a single chip and minimizing critical timing paths in the processor design.
Singular Value Decomposition
The singular value decomposition of an p x p matrix M is M=UXVT,
where U and V are orthogonal matrices and Z is a diagonal matrix of singular values. In the Brent, Luk, Van Loan array, the matrix is divided into 2 x 2 submatrices. Each processor element contains a 2 x 2 submatrix. The array architecture is scalable. There are two types of data flowing in this array. Rotation angles generated by the &ago-nal processors flow systolically along the rows and columns of the array. Matrix data elements are exchanged diagonally, after the diagonal neighbor has received and applied the necessary rotation angles. This leads to "waves" of activity moving diagonally away from the main array diagonal.
A 2 x 2 SVD can be described as The sum and difference angles yield the two angles, 81 and e,, which can be applied to the two-sided rotation module to diagonalize M .
CORDIC Arithmetic for an SVD Processor
In a typical computer algorithm for the SVD, the sines and cosines of the rotation angles are computed through formulas that require division and square mot operations. The explicit angles are not required and only the sines and cosines are computed. The rotations are then applied to the 2 x 2 submatrix using standard matrix multiplication techniques. However, time-consuming operations such as multiplication, division, and square root are needed.
2S7
The SVD is more closely mapped onto hardware through the use of the CORDIC algorithms 1111. The CORDIC algorithms can provide the calculation of vector rotation and inverse tangent. The CORDIC iteration equations which are to be implemented in hardware are:
The variable, zi, contains the total rotation angle applied, Bi is the current rotation angle increment, and s i = f 1. Through the appropriate selection of each 6i, either the initial zo value can be reduced to zero (vector rotation) or the initial yo value can be reduced to zero (inverse tangent). These equations can be implemented with simple structures: registers, shifters, adders, and a small ROM.
If CORDIC processors are used, then the rotation parameters can be calculated from the inverse tangents of the elements of M. The rotation angles are found explicitly, without penalty, since the inverse tangent function is a primitive CORDIC operation. The traditional matrix-vector multiplication can also be replaced since vector rotations are primitive CoRDIC operations. The diagonalization of M can be performed by treating M as a pair of vectors and using the rotation angles to transform M. The computation of these vector rotations and inverse tangents can be performed efficiently by the CORDIC algorithms. Thus, a general algorithm for a 2 x 2 C O~I C SVD processor would be: 
CORDIC SVD Processor VLSI Implementation
In the prototype fixed-point arithmetic system, the CoRDIc Parallel Diagonalization Method would be used since the least time and area are needed and the structure is regular. The basic floor-plan of a VLSI implementation, shown in Figure 2 , contains three major sections: two CORDIC processors, and an interconnection network.
A 16-bit fixed-point implementation of the CORDIC SVD architecture is described. In a fixed-point implementation, the number of bits in the internal data representation must be chosen to prevent loss of significance due to rounding and overtlow. Therefore, a 16-bit fmed-point implementation of the CORDIC SVD architecture requires 20-bit internal arithmetic. A prototype custom CMOS VLSI chip is currently being designed and will be fabricated through the MOSIS fabrication service.
The intra-module interconnection network will allow the same chip to function as both an angle solver and a rotation module, and will permit flexibility in designing, constructing, and reconfiguring a large array. Finite state control for the interconnection network and for the SVD algorithm will be provided by a PLA. The array will be connected in a mesh configuration. Each module will possess the necessary control for systolic YO.
The systolic control network schedules the communication among the processor elements of the array. Each processor communicates with its eight nearest neighbors. This communication requirement increases both the complexity of the VLSI layout and of the packaging technology. However, the pin count of the chip can be reduced by restricting the external data paths to nibbles (4 bits). This multiplexing approach increases the time required for communication. Hence, 32 bi-directional pads may be used for data transfer. Various control signals are also necessary. Each processor has a particular address which effects the data communication algorithm, especially at the edges of the array. The horizontal and vertical coordinates can be input serially during the initial array configuration phase. Signals to indicate start, stop, setup, and done are required in addition to the standard two phase clock signals (I$I~,I$~), power Wd), and ground (GND). A sixty-four pin package will be sufficient for a prototype CORDIC SVD chip. Internally, a programmable logic array, (PLA), will control the data movement between the two CORDIC modules and the U 0 pads. This PLA will also control the CORDIC modules.
The internal structure of a fixed-point CORDIC module contains three data paths. The x and y data paths implement the CORDIC rotation equations. The z data path is used to accumulate angle information. Each module is composed of three registers and adders, two barrel shifters, and a local PLA control unit. The ROM which stores the angles can be shared between the two CORDIC modules on the chip. In order to begin the implementation of the processor array, the floorplan was first separated into smaller design units.
Design Partitioning
In order to approach the design of a moderately complex special-purpose processor in an academic setting, the, CORDIC SVD processor was partitioned into "Tinychip" projects for initial design study. The full processor design is divided into the design of a CORDIC processor and a systolic interconnection network. The CORDIC processor is further divided into a CORDIC 2 Control Chip and a CORDIC x,Y Data Path Chip as shown in Figure 3 . Experience gained from the TinyChip projects is valuable to the design of larqer chips. Each subsection is appropriate for a VLSI Design Course project, while the larger chips are appropriate for funded research projects. been completed and was fabricated through the MOSIS service. A block diagram of the chip is shown in Figure 4 . The major structures are: control PLA, adder, ROM, ROM selector, and registers. CMOS transmission gates are used to control data flow within the chip. This chip contains approximately 1200 transistors in addition to the pads and a photograph is shown in Figure 5 . The chip has been tested and is fully functional up to a clock speed of 8 MHz.
CoRDIc X,Y Data Path Chip
The design of the Data Path chip was started a t Cornell using the 3 -~ double-metal, single polysilicon, p-well process. The x and y data paths implement the CORDIC iteration equations. A plot of the current 20-bit Data Path chip design which could implement either the x or y data path is shown in Figure 6 . The major cells are: register, input to barrel shifter for arithmetic shift, barrel shifter array, control for adder two-phase clocking, switch for normal CORDIC and CORDIC scale factor correction data routing, and adder array. In this project, basic ripple-carry addition has been assumed for simplicity of implementation.
When completed, the 20-bit Data Path chip will contain approximately 2100 transistors in addition to those contained in the twenty-eight YO pads. The area is 1 1 4 6~x 2 5 8 2~. This design will fit into the MOSIS TinyChip padframe which is 2300px3400p. Currently, a CORDIC processor would be composed of two Data Path chips and a CORDIC 2 Control Chip as indicated in Figure   3 .
Design Methodology
Structured VLSI design is used following the Mead and Conway design methodology [9, 12] . In order to simplify design requirements, static complementary logic is used. In addition, a formal two-phase clocking strategy is start mode reset1 
Current Work and Summary
The goal of the CORDIC X,Y Data Path and CORDIC Z Control chips has been to give insight into the design of a full CORDIC SVD processor chip. Current work includes the expansion of the 10-bit CORDIC 2 Control chip to a 20-bit design to complement the CORDIC X,Y Data Path design. The two designs will then be combined and replicated to yield two full CORDIC processors, as shown in Figure 2 . The systolic control structure and internal data routing will then be added to complete the SVD processor element. Eventual design goals include a floating-point implementation based upon a hybrid floating-point architecture 141. Initial timing results from the CORDIC 2 Control Chip indicate that a 16-bit fixed-point array of 10x10 processors should compute the SVD of a 20x20 matrix in 1 millisecond. It is expected that greater speed-can be achieved through the use of carry look-ahead adder designs and the An array of 100 chips to compute the SVD of a 20x20 matrix, could be built using either conventional packaging or surface mount technology. Larger CORDIC SVD processor arrays would greatly benefit from the emergmg Wafer Scale Integration (WSI) technoloaes. Advances in fault tolerance techniques at both the architectural and manufacturing levels will improve the yield of these systems.
The design of special-purpose processor chips is always challenging. As general-purpose microprocessors continue to advance, special-purpose processors remain useful for custom applications. The CORDIC SVD processor array would be important for specialized, portable, realtime applications where the benefits of VLSI are great.
