Abstract-This brief presents a hardware design to achieve high-throughput QR decomposition, using the Givens rotation method. It utilizes a new 2-D systolic array architecture with pipelined processing elements, which are based on the COordinate Rotation DIgital Computer (CORDIC) algorithm. CORDIC computes vector rotations through shifts and additions. This approach allows a continuous computation of QR factorizations with simple hardware. A fixed-point field-programmable gate array (FPGA) architecture for 4 × 4 matrices has been optimized by balancing the number of CORDIC iterations with the final error. As a result, compared with other previous proposals for FPGA, our design achieves at least 50% more throughput, as well as much less resource utilization.
I. INTRODUCTION

M
OST of the advanced signal processing algorithms are based on algebraic matrix operations. Many examples of this are found in wireless communication, such as multipleinput-multiple-output (MIMO), beamforming, and multiuser detection and cancellation [1] . One useful operator for these matrix operations is QR factorization, particularly for MIMO technologies [2] , [3] and adaptive filtering [4] . Some of these applications require high-throughput QR decomposition but are for small matrix sizes. Thus, many works have addressed the parallel hardware implementation of this operation for either application-specific integrated circuit or field-programmable gate array (FPGA) technologies. In this brief, we focus on highthroughput computation for small matrices on FPGAs.
The Givens rotation method (and its variations) is probably the most widely used to implement QR decomposition by hardware due to its robust numerical properties and its easy parallelization [5] . In the literature, there are several papers in which QR factorization has been implemented on FPGA by using this method. Although serial approaches or linear systolic arrays may be used [6] , to achieve high throughput, the most common hardware implementation is through twodimensional (2-D) systolic arrays, such as in [2] and [7] - [11] . A 2-D systolic array is a parallel grid structure where processing elements (PEs) work in parallel and are locally interconnected. This systolic architecture allows the exploitation of different grades of parallelism inherent to the Givens rotation algorithm. Thus, these approaches have high throughput and relatively low latency, at the cost of considerable area consumption. In this brief, through combining several ideas, we have designed a new architecture, which improves previous highthroughput FPGA implementations. It is based on the COordinate Rotation DIgital Computer (CORDIC) algorithm to simplify hardware, pipelining the PEs to obtain better throughput, along with a different schedule for performing the Givens rotations to reduce latency. As a result, the proposed architecture has very high throughput and low latency, with a relatively reduced area consumption. They also have a very simple control and communication logic.
The next sections of this brief are organized as follows. Section II reviews some important aspects of the QR decomposition using Givens rotations, along with a brief review of some previous works proposed in the literature. Section III presents the proposed architecture to achieve high throughput. In Section IV, the results of the FPGA implementation are studied and compared with other previous works. Finally, Section V provides the conclusions of this work.
II. GIVENS ALGORITHM AND PREVIOUS FPGA IMPLEMENTATIONS
Given a matrix A m×n , this is equivalent to the product of two factors, i.e., A = Q · R, in which matrix Q m×m is orthogonal and R m×n is an upper triangular matrix [5] . The computation of these two factors is called QR decomposition or factorization.
The Givens method achieves a QR factorization through unitary transformations, which are called Givens rotations, which selectively allow the introduction of a zero element [5] . A Givens rotation matrix has rank-two corrections about identity matrix, where the rank (i, j) is replaced by orthogonal values based on sines and cosines, i.e.,
As an example, a Givens rotation is represented in (1) for a 2 × 1 matrix, where the resultant matrix has a new inserted zero; this can be extrapolated to any other matrix size. The rotation angle θ must be computed beforehand by the 1549-7747 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. formula arctan(a 2 /a 1 ). Alternatively, these values can be also calculated by
Accordingly, the Givens method algorithm starts zeroing the lower elements, from the first column to the last one, and, on each column, starting from the bottommost element to the diagonal element. The upper triangular matrix R is achieved by accumulating the Givens rotations on the initial matrix. Similarly, Q is obtained when the same rotations are applied to the identity matrix.
As an example, Fig. 1 illustrates the application of the Givens method on a 4 × 4 matrix to achieve an upper triangular matrix R in six stages. Each arrow represents a Givens rotation, where G k (i, j) specifies the involved rows (i, j) and the column k where a zero will be inserted. The circular areas indicate the elements selected to calculate the rotation angle, whereas the squared areas delimit those elements that will be rotated using the said angle.
It is clearly seen that this algorithm has different levels of parallelism that could be exploited depending on the selected architecture. Several works, such as in [7] and [10] , have proposed a 2-D systolic array similar to the one shown in Fig. 2 . In this architecture, each PE always works over elements on the same column. On each row of the architecture, the Givens rotations may be performed in parallel using as many PEs as nonzero elements are within the row of the matrix.
In addition to this parallel computation, this configuration has the advantage of the two different types of PEs used, one (V) to compute the rotation angle, which, at first, requires much more complicated operations, and another (R) to perform the effective rotations, which is much simpler. Each row of PEs only needs one type-V PE and the rest as type R. Thus, although they need more PEs, the number of type-V PEs (much more complex) is reduced, and then, the overall area may be also reduced.
This architecture is used in [7] , where standard arithmetic operations are utilized to implement the PEs. In [10] , iterative CORDIC circuits are used instead, which reduces area consumption. However, in these approaches, due to data dependence between consecutive rotations, the PEs of the last rows are idle most of the time, which means a significant waste of resources. In addition to this, the same data dependence prevents the use of pipelining internally in the PEs, which limits the achievable throughput.
A different and older approach is the one used in [8] and [12] , which is shown in Fig. 3 . On this scheme, a PE completely performs a Givens rotation for all elements of the two rows. Thus, the two operations involved in a Givens rotation have to be combined in one PE, making it more complex, although much fewer PEs are required. The main advantage of this approach is the fact that the only data dependence, which prevents the pipelining within the PEs, is the one between the computation of the rotation angle and the rotation itself. In [12] , they propose to interleave columns of different input matrices to overcome this dependence, but this is unpractical for many applications, particularly for deep pipelines. On the other hand, square root and division free Givens rotations [13] are utilized in [8] , where this dependence is eliminated by means of a preprocessing and a postprocessing, which allows pipelining of the PEs. Thus, this architecture achieves a high throughput, but the complexity of the operations involved also requires a high utilization of resource.
III. PROPOSED ARCHITECTURE
Similarly to the work in [8] , we propose to use a 2-D array architecture where each PE works with all the elements of the same row, and these PEs are pipelining, to achieve high throughput. However, at the same time, we propose different connections for the PEs within the 2-D array to reduce latency. Moreover, the PEs are designed based on the CORDIC algorithm to implement this pipeline in a simpler way, which produces a system with lower area and higher throughput. Next, we present some details of this architecture.
A. Givens Rotations Schedule
The classic schedule to implement the Givens algorithm, as it is previously described in Fig. 1 , starts zeroing the bottommost element of the first column and serially continues up in the same column until this column is finished. Then, the same procedure is performed over the next column and so on, until the matrix is triangular. The 2-D systolic array improves this schedule by performing several rotations at the same time. Concretely, as it is indicated in Fig. 3 , all PEs in the same diagonal work in parallel, significantly reducing the number of steps required for one matrix computation. However, this schedule can be performed with a higher level of parallelism, if as many rows as possible were rotated simultaneously. Thereby, the algorithm decreases latency by reducing its amount of steps. We should note that the number of Givens rotations remains the same, but the number of Givens rotations by step is increased.
To reduce the number of steps as much as possible, on each step, all suitable pairs of rows (i.e., two unselected rows that contain the same number of zeros to the left) are selected, and they are rotated in parallel. This is repeated until an upper triangular matrix is achieved. Fig. 4 illustrates how a 4 × 4 matrix is factorized by using this schedule. In this figure, two types of lines are described, i.e., dotted and continuous; each one represents a Givens rotation made simultaneously.
In the first stage, two Givens rotations are concurrently performed; it takes the adjacent rows (1, 2) and (3, 4) . This means inserting two zeros in rows 2 and 4, as shown in the second stage. Next, rotations G 1 (1, 3) and G 2 (2, 4) are calculated, finishing the computation on the first column. Then, the first row will not be used again, restricting the algorithm to only one rotation by stage for the two last stages. Therefore, only four stages are required using this schedule.
B. CORDIC-Based PEs
CORDIC is an iterative algorithm based on shifts and additions, which allows calculating many elementary functions with a very simple hardware [14] . The same CORDIC circuit may operate in two modes, i.e., vectoring or rotation mode. The former rotates an input vector (X, Y ) until its Y coordinate reaches a zero value, returning angle θ, and the X coordinate has rotated. The latter rotates an input vector (X, Y ) with a determined angle θ. Therefore, a CORDIC unit could be used to compute the angle for a Givens rotation (vectoring mode) and, then, performs the rotation through the rest of the row using the said angle (rotation mode).
Many different circuits based on CORDIC have been proposed in the literature to perform Givens rotations. To achieve our goal, we have selected the one used to implement a linear systolic array in [15] . It is a pipeline architecture that performs both vectoring and rotation mode. Due to the data dependence of the linear array, the circuit presented in [15] needs matrix interleaving to take advantage of the pipeline design. However, within our row-based 2-D systolic array, the pipeline is used in a natural way. This CORDIC approach replaces the computation of the rotation angle θ by the direction of each microrotation. This direction is indicated by the sign of Y on each CORDIC stage, and it is stored in a register, i.e., σ, to be used in subsequent rotations. Thus, this circuit allows overlapping the computation of the angle with the rotation of the corresponding rows. Fig. 5 shows this circuit divided into two sections. The right section is the operation section that contains the typical CORDIC x-y data path. The left section represents the control hardware, which, in vectoring mode, selects the rotation direction and updates the σ registers. These registers configure the adds/subs in the rotation mode. An active nda signal indicates a new angle computation (vectoring mode) on this stage. Then, sign(Y ) is used to control the add/sub , and it is stored in the σ register. While the active nda signal goes through the pipeline, the rest of the elements of the corresponding rows are introduced into the circuit to be rotated using the said stored directions (rotation mode). Therefore, both computations are overlapped; furthermore, it is clearly seen that a new Givens rotation may be introduced in the pipeline before the actual one was completely finished.
The constant scale factor introduced by the CORDIC algorithm [14] is compensated by multiplying the output values by its inverse. As we will see in the next section, constant multipliers or embedded multipliers could be utilized for this operation.
C. Proposed Circuit
Using the schedule described in Section III-A, the proposed architecture is derived by assigning a PE to each Givens rotation. Fig. 6 illustrates the architecture for 4 × 4 matrices. There are four stages: the two first ones with two PEs, since two Givens rotations are performed in parallel, and only one in the two others. Input and output buses are directly connected from one stage to the next. Only a first-in-first-out register is required on stage 3, since one of the rows computed in stage 2 is used in stage 4. Not much logic is required for synchronization of this architecture, due to its pipelined structure. The nda signals of the PEs on the first stage may be set externally, or using a counter if the flow of input matrices is regular. In the next stages, the nda inputs are connected to the nda outputs of the previous stage. In some PEs, the nda signal has to be delayed one extra cycle to compensate for the zero elements. In the first stage, all rows are introduced simultaneously, element by element (input matrix followed by identity matrix). Furthermore, due to its fully pipelined architecture, a new matrix computation could start right after the last element is introduced. Therefore, a very high throughput is achieved (for this example, one matrix computation each eight cycles).
IV. PERFORMANCE ANALYSIS AND COMPARISON
Using the proposed architecture, a VHDL fixed-point QR decomposition core for 4 × 4 matrices has been designed. The said core allows us to configure both bit width and number of CORDIC iterations. This core has been simulated and synthesized using Xilinx ISE 14.3 software and implemented and evaluated using a hardware Virtex-6 XV6VLX240T speed-2 FPGA platform. To confirm the correctness of the proposed core, first, it has been tested with a wide range of random matrices, and the results have been checked using MATLAB.
Second, to improve the area and the latency of the proposed circuit, we have experimentally studied how the number of CORDIC iterations influences the error of its results. To do this, different circuits, using three word lengths (16, 24, and 32 bits), have been implemented on our hardware platform for several numbers of CORDIC iterations. Using each one of these circuits, the QR decomposition has been calculated for 50 000 random matrices whose results have been checked by computing Q t * R and comparing it with the original matrix A. In Table I , the maximum error detected on these comparisons is presented for each tested configuration. It is clearly observed that, at first, the maximum error decreases when the number of iterations increases, due to the better approximation achieved for the rotation angle. However, at a certain point, the error starts to slightly increase due to the accumulated rounding error. Thus, to obtain minimum error while reducing the area and the latency, the best configurations for 16-, 24-, and 32-bit widths are 10, 18, and 26 CORDIC iterations, respectively. Table II shows the implementation results for the three analyzed word lengths, each one with three different approaches for the scale factor compensation required by the CORDIC algorithm (see Section III-B). All designs use the optimum number of iterations previously computed. Two approaches use the embedded multipliers (DSP48E), which typically exist in FPGAs, either nonpipelined or pipelined (Multiplier A and Multiplier B, respectively), whereas the third approach uses pipelined constant coefficient (Multiplier C) designed with Xilinx Core Generator. There are no great area differences between the approaches using DSP48, but the one pipelined allows much higher clock frequency and, consequently, much better throughput at the cost of a moderate latency increase. On the other hand, the number of slices used to implement the constantcoefficient multipliers is relatively high compared with the rest of the circuit. Then, although this approach achieves the same Finally, to study the effectiveness of our proposal, from the literature, we have selected some representative works, which provide enough data to perform a reasonable comparison. Table III shows the mean results of these works along with the ones for our 16-bit circuit using Multiplier A. To provide a fair comparison, we have synthesized our architecture on equivalent FPGAs as those works, concretely Virtex4 (XC4VFX60-11) and Virtex5 (XC5VTX150T-2).
With regard to the performance, the only design with a throughput relatively close to ours is the one in [8] . Similarly to our proposal, it uses pipelined PEs and has practically the same latency in clock cycles. However, its lower maximum frequency provides that our proposal presents about 35% less latency (seconds) and 50% more throughput than the design in [8] . The better critical path of our design is explained mainly by the simplicity of the CORDIC architecture. On the other hand, the circuits presented in [7] and [10] have a throughput that is one order of magnitude lower than ours, since their PEs are iterative.
With regard to the area, our design clearly requires several times less resources than the others, considering all kinds of resources. The closest one is the circuit in [10] , which also uses a CORDIC-based architecture. Although it requires practically the same number of lookup tables (LUTs) and no multipliers, the number of registers is more than three times greater. This is mainly explained by the much greater number of PEs presented in the architecture and the use of carry-save arithmetic.
Taking all these results into account, we can conclude that the architecture proposed herein presents much better throughput, as well as much lower resource utilization, than previously proposed works. Moreover, this throughput could be doubled by using the pipelined version of DSP48 at the cost of a moderate latency increase.
V. CONCLUSION
This brief has presented a fixed-point systolic architecture to achieve high-throughput QR decomposition for small matrices. This is achieved by performing as many Givens rotations as possible in parallel in an unrolled architecture, as well as using a pipelined CORDIC circuit, which allows completely overlapping the angle computation and the rows rotation. Thus, this highly pipelined circuit performs an n × n matrix decomposition each 2n clock cycles. The FPGA implementation of this architecture for 4 × 4 matrices has been optimized for different word lengths by selecting the appropriate number of CORDIC iterations. Comparing with previous FPGA approaches, our proposal highly improves both performance and resources utilization.
