The purpose of this article is to propose a CORDIC-based QR Decomposition (CQRD) for MIMO Signal Detector module with qualities of low-resource and low-latency. The design contains four stages with six CORDIC modules in which its hardware architecture employs both vectoring and rotation mode equations. The evaluated results of CORDIC-based QRD prove that the proposed hardware design is high performance, low resource, and low latency. Because of the advantages of CQRD, it is suitable for the signal detector in MIMO systems.
Introduction
One of the most significant innovations in wireless communications is Multiple-Input Multiple-Output (MIMO) which utilizes multiple transmitting and receiving antennas [1] . The MIMO technique employs the spatial dimension, space and time dimension, or time and frequency dimension to achieve the high throughput and high accuracy. Since the receiver of a MIMO system has to detect the transmitted signal from many antennas, its hardware architecture is more complex than that of a single antenna system. Zero-Forcing, Mean Square Error, Sphere Decoding, and Maximum Likelihood are some MIMO signal detector methods arranged in an ascending order of complexity.
In order to reduce the complexity and remain the high accuracy of the Maximum Likelihood method, the Sphere Decoding deploys the QR Decomposition (QRD) in its architecture [2] . The QRD is made use for estimating the transmitted channel of a MIMO system [3, 4] by decomposing the channel matrix into a unitary matrix, which is Q, and an upper triangular matrix, which is R. The upper triangular matrix R is utilized to easy to detect the signal. Gram-Schmidt [5] , householder transform, or Givens Rotation (GR) [3] are some methods to enforce the QRD. Among them, the GR method is utilized most widely in hardware implementation [4] because it can be implemented efficiently using a coordinate rotation digital computer (CORDIC) scheme [3, 6] . There are several papers proving that the QRD can perform under CORDIC scheme in literature [3, 4, 7, 8] .
In this paper, the low-latency CORDIC-based QRD (CQRD) is proposed and implemented in Stratix FPGA. CQRD is designed by combining several previous ideas to improve the resource utilization and performance. Its general structure utilized four-stage QR structure in [4] while the CORDIC module in each stage developed and improved the low-latency pipeline CORDIC design proposed in [9] . The latency of CQRD reduces 1:31Â and 2:81Â in comparison with that of the designs in [4] and [10] , respectively. Moreover, the suggested hardware architecture contains no DSP block but still achieves high operating frequency leading to low delay time. Furthermore, CQRD's throughput is 1:01Â and 15:13Â higher than that of the designs in [4] and [10] , respectively, while its error rate is nearly the same with that of the design in [10] .
The remainder of this paper comprises of four parts. Section II describes some previous work that related to the suggested architecture. The idea and detail of the hardware architecture are describes in the Section III while section IV discusses its resources and performance. Finally, the conclusion is presented in the Section V.
2 Related work 2.1 CORDIC algorithm Coordinate Rotation Digital Computer (CORDIC) was first introduced in 1959 by Volder [11] and later extended by Walther [12] in 1971. The CORDIC hardware architecture requires only the additions, subtractions, and shift operations; hence it is suitable for Very-Large-Scale Integration (VLSI) system design. The CORDIC algorithm contains two operation modes: vectoring mode and rotation mode.
CORDIC Rotation Mode: Assuming that the initial vector V 0 has the coordinate value ðx 0 ; y 0 Þ and È r is the distance between the vector V nr ðx nr ; y nr Þ with the initial vector V 0 , the new coordinate value ðx nr ; y nr Þ is able to calculate in CORDIC rotation mode (1).
The CORDIC rotaion mode utilizes a set of n angle constants of which each angle called i satisfied the condition of
1g is the direction of each constant angle i . The residual angle z and the coordinate values ðx; yÞ then are calculated by (2) .
cos i CORDIC Vectoring Mode: Different from the rotation mode, which rotates the input angle to zero, in the vectoring mode the initial input angle is zero, and the algorithm rotates the y value until its value returns zero [13] . The equation of CORDIC vectoring mode in case of (0 i n À 1) is the same with those of CORDIC rotation mode (2), except the direction of each angle constant i is obtained by d iv ¼ signðy iv Þ. Finally the value of new vector V nv ðx nv ; y nv Þ is calculated by (3) and the value of z nv is the angle
Givens rotation QRD
Assuming that the input matrix contains two rows and one column,
, is capable of rotating the input matrix like (4), if the matrix Q satisfied the condition in (5) .
cosÈ sinÈ
ÀsinÈ cosÈ
In case the input matrix is H mÂm , the Givens Rotation (GR) QRD will rotate all the lower elements of input matrix to zero by
For instance, in order to enforce the first element of the second row in a matrix channel H 4Â4 to zero, the Q H matrix will be shown in (6) . 
where the value of È H will be calculated by tan À1 H 21 H 11 . Other lower elements of input matrix will be enforced to be zero by other matrix Q H until the upper triangle matrix is obtained.
CORDIC-based QRD
Based on the GR QRD and CORDIC equations, it is the fact that the angle È H of GR QRD can be calculated by the CORDIC vectoring mode of which equation was described in (3) . The operation of GR QRD in (4) then can be made by the CORDIC rotation mode with the equation expressed in (1). In the literature, the CORDIC-based QRD hardware architecture contains two recursive CORDIC circuits operating in vectoring mode and rotation mode, respectively [4] .
Proposed hardware architecture
The CORDIC vectoring mode and CORDIC rotation mode utilized in CQRD architecture can be summarized in Table I . Based on the equation of residual angle z at the beginning and final steps, there is no doubt that the direction d iv in vectoring mode is possible to be employed for rotation mode. Therefore, in this paper, the proposed CORDIC-based QR Decomposition (CQRD) utilizes only one CORDIC circuit instead of two recursive CORDIC circuits. To begin with, the CORDIC vectoring mode is deployed for calculating residual angle z ivþ1 and direction d iv . The CORDIC rotation mode making use of the direction d iv then is employed to calculate the coordinate vector ðx irþ1 ; y irþ1 Þ.
The general CQRD architecture in [4] employed to calculate QR Decomposition can be described in Fig. 1 . The CQRD including four stages with six CORDIC modules named COR_QR is based on the idea in [9] because of its advantage of low latency. Firstly, the channel matrix is sent to the first stage consisting of two COR_QR modules operating in parallel. The results of this stage include values of zero at the first data of the second and the fourth rows. The second stage in QRD also contains two parallel CORDIC modules and makes the first data of the third row and the second data of the fourth row become zero. In the third stage, there is only one CORDIC module to make the second data in the third row become zero. Finally, the final stage will return value of the upper triangle matrix, which is the estimated channel matrix.
COR_QR module: The architecture of COR_QR based on the idea in [9] is described in the Fig. 2 . This design contains n 2 stages and one final stage module. As above mentioned, first n 2 stages utilize vectoring mode for calculating residual angle and direction, and rotation mode for calculating the coordinate value. The final stage returns the values of the final coordinate vector ðx n ; y n Þ, and its theory and architecture will be described below. The reasons for containing REC and MULT modules will also be explained in final-stage-module part. Step CORDIC Vectoring Mode CORDIC Rotation Mode
The architecture of stage 1 to stage n 2 þ 1 is described in the Fig. 3(a) . The input and output data of one COR_QR module can be explained in (7) , where vectoring mode utilizes only the column of which the data will be enforced to zero, while the rotation mode makes use of all elements of the matrix. Therefore, in addition to the ADD/SUB modules utilize vectoring equation for z and rotation equation for x and y, the architecture of stage i þ 1 requires the COUNTER, MUX, and REG modules to choose the right value of direction d iv for rotation mode operation. Stage final:
The final values of COR_QR are calculated by final stage in which the values of coordinate vector ðx n ; y n Þ are calculated by the equation (8) [9] x n ¼ x m þ y m z mr ð8Þ y n ¼ y m À x m z mr where x m , y m , and z mr are the coordinate vector and the residual angle of CORDIC rotation mode after n 2 stages, respectively. However, only the residual angle of vectoring mode z iv is calculated in stage 1 to stage n 2 þ 1 modules. Therefore, the value of z mr must be interpolated from the final residual angle of vectoring mode z nv , and the residual angle of vectoring mode at stage n 2 þ 1, z mv by (9) .
Futhermore, the author of [9] proved that the value of 1 x m is able to be estimated from the value of x k with k ! n 4 . Then reciprocal of value x k will be calculated by REC module utilizing the Newton-Raphson equation, and the value of y m x k can be obtained by the MULT module and sent to the final stage to calculate the values of x n and y n . The architecture of stage final module is illustrated in Fig. 3(b) . The CORDIC-based QRD (CQRD) is proposed and synthesized by Altera Quartus 13.0 with Stratix FPGA. Its resources and performance are illustrated in Table II . In order to make a fair comparison, the suggested CQRD was implemented on different Altera family devices which is the same process with previous designs. In [4] , the proposed architecture was implemented in three different approaches for the scale factor, i.e., Dedicated Non-Pipelined DSP48E (DNP), Dedicated Pipelined DSP48E (DP), Const. Coef. Pipelined (without DSP48E) (CCP). Among three approaches, the DP and CCP obtained high frequency and high throughput than DNP, but their disadvantages are high latency and more resource utilization than DNP. Especially, the number of slices to implement the constant-coefficient multipliers of the CCP is high compared with the rest of circuit, then the authors recommended that this approach only selected if the application required to save DSP48 for different computations [4] . Because of these reasons, in Table II the DNP approach in [4] was selected to compare with our designs.
Results and discussion
The suggested CQRD requires no DSP blocks while the design in [4] and [10] contains 24 and 28 DSP blocks, respectively. Moreover, the CQRD spends only 64 clock cycles which is 1.31 and 2.81 times lower than that of the architectures in [4] and [10] , respectively. Thanks to the quality of low latency, although the CQRD module operates at lower frequency than that of design in [10] , its delay time is 1:87Â and 1:32Â lower than that of [4] and [10] , respectively, while maintaining the error rate, which is nearly the same with that of the architecture in [10] . The 4 Â 4 matrix is the input of CQRD, and the triangle output matrix R will be received after each eight clock cycles. Therefore, the throughput of CQRD will be calculated by (10) , and it is 1.01 and 15.13 times higher than that of the implementations in [4] and [10] , respectively.
Conclusion
In this paper, the low-resource low-latency CORDIC-based QR Decomposition (CQRD) hardware architecture was proposed and implemented on Stratix FPGA. The CQRD requires four stages with six CORDIC modules in which each of them utilizes both vectoring and rotation modes. The results of CQRD is the upper triangle matrix which is the estimated transmitted channel of communication system. The results of CQRD prove that its hardware architecture is high performance, low resource, and low latency. Thanks to the advantages of the proposed architecture, this CQRD is suitable for the MIMO signal detector.
