A wide variety of digital communication systems are encountered with high computational tasks. QR decomposition is one of such algorithms that can be implemented on FPGAs as a solution to large complex matrix inversion problems. A flexible vector processing architecture for the fixed and floating point implementations of the QR decomposition is presented. The design is implemented on the StratixIV device with 230K logic elements and verified with the SignalTap II built-in logic analyzer. Throughputs of 2.4M and 2.11M decompositions per second with maximum clock frequency of 340 MHz and 360 MHz are achieved for 4×4 matrices with the fixed and floating point designs respectively. The FPGA resource utilizations of the two data type implementations are also compared for different matrix sizes for the StratixIV and Arria10 devices.
I. INTRODUCTION
The advent of new advanced digital communication systems and military radar applications such as adaptive beamforming [1] , Space-Time Adaptive Processing, MIMO systems, etc. has imposed complex and computationally demanding algorithms [2] . These algorithms involve matrix inversions or least square minimizations at large sizes and high data rates. Matrix factorization techniques such as LU factorization, Cholesky factorization and QR decomposition are crucial solutions to these problems and many attempts have been made on their hardware realization in the last decades [3, 4] . QR decomposition is widely applied nearly in every field where a numerically stable and robust inversion of large and dense matrices is needed [5] . It involves two prerequisites of least square estimates namely, triangularization and orthogonalization, means that it factorize a matrix A into the product of a unitary quadrature matrix, Q, and an upper triangular matrix, R.
There are three main algorithms for QR decomposition including Gram -Schmidt and modified Gram-Schmidt, Householder reflections and Givens rotations [6] . Householder transformations need to process in columns and rows so there must be a parallel access across memories. This can be a disadvantage in high speed processing [6] . Previous attempts aimed at Givens Rotation implementation suffer from high latency due to more complex operations [7] . Gram-Schmidt has the same level of performance in terms of stability and accuracy but requires fewer operations which make it more attractive for both fixed and floating point Siavash Amin-Nejad is with University of Guilan, (e-mail: Siavash.Amin-Nejad@liverpool.ac.uk).
Katayoon Basharkhah was with University of Guilan, Rasht, Iran. She is now with the Department of Electrical Engineering, University of Tehran, Iran (e-mail: ktbasharkhah@gmail.com).
calculations. Applying slight modifications to the Gram-Schmidt algorithm gives the modified Gram-Schmidt (MGS) algorithm which has better accuracy than the classic one [8] .
Due to the extensive computational burden of the QR decomposition, ASICs and FPGAs are two candidates for the hardware implementation. ASICs serves best in the low power and high speed computation requirements in the large volumes of production. On the other hand, FPGAs with powerful and abundant DSP Blocks, RAM Blocks and adaptive LUTs offer flexibility, parallelism and high throughput at much lower cost [9] [10] . FPGAs with short time to market are suitable for mid volume productions but suffer from more power consumption and less clock frequency.
Most of the time FPGAs are used for fixed point rather than floating point designs because the latter suffers from long data path latencies, routing congestion and reduced Fmax. Fused data path can be an alternative method for floating point implementation that provides hardware friendly representation and also reduced redundancies [6] , [11] . In this paper we propose a flexible architecture designed for both fixed and floating point implementation using fused data path. In section II the QR Decomposition algorithm as a parameterizable core for different matrix sizes and different vector sizes using vector processing method is described. The proposed architecture is described in section II. Section IV devotes to explaining fixed and floating point representations. The hardware implementation of the design on the Altera's DE4 development board with StratixIV FPGA is covered in section V. Performance analysis and comparisons are presented in section VI. Conclusion and discussions are given in section VII.
There are many ways to solve the linear system equation = to find the unknown vector [12] . In analytic approach which uses determinants, matrix inversion can be completely a bottleneck as it needs too many complicated operations and results in non-scalable architectures [13] . This necessitates the usage of decomposition methods for the solution of the large size matrices. QR decomposition is one of these techniques that can be used for any general m by n matrix unlike other methods such as Cholesky solver which are applied only on square and symmetric matrices [6] . 
II. QR DECOMPOSITION USING MODIFIED GRAM-SCHMIDT
Projection of vector (a) over vector (q) in an m dimensional space is defined as:
Where <q.a> represents the inner product. In the following equations u1, u2 … un are orthogonal and q1, q2 … qn are their corresponding orthonormal set.
After rearranging the equations:
Since qi are unit vectors, equation (3) can be rewritten as:
And in matrix form:
This algorithm can be realized in Matlab by the code in Appendix 2. This algorithm is composed of the processing elements to do norm, which requires n square root operations and nm multiplications and (nm-1) additions. It also requires n(n-1)/2 inner products and n(n-1)/2 divisions of vector by scalar. Updates of matrix A requires mn(n-1)/2 multiplications and additions [6] , [14] .
The overall goal is to design a low-latency high-throughput hardware for this iterative algorithm. Thus any part of the calculation that doesn't need to be repeated for each new piece of data is removed from the inner loop. As it is evident from the algorithm the diagonal elements of the R matrix must be computed first and then the rest of the R and Q matrices' elements are calculated. The disadvantage of this code is that the elements of Q cannot be calculated unless the update of matrix A is completed first. This can cause long latencies and the system will be halted for some time. The total number of inactive clock cycles is where and are latency of the calculation respectively [14] . In order to remove these latencies the algorithm is modified as shown in Appendix 1:
III. ARCHITECTURE OF THE PROPOSED IMPLEMENTATION
Based on the modified Matlab code in the previous section, the calculations are done in the two main for loops and are column-wise. In the first for loop the square of the diagonal element of each column of matrix R is calculated first and then the non-diagonal elements of its corresponding row are calculated. These row elements are divided to their corresponding diagonal element in next main for loop to generate the correct row elements of matrix R. Before repeating this procedure for the next column the matrix A elements at the right side of the current column must be updated according to the first for loop code. The Q matrix elements are calculated in second main for loop based on the modified matrix A elements. Fig. 1 shows the block diagram of the proposed design. The main units are control unit, memory blocks and vector processing unit. The core memory block initially stores matrix A for decomposition. The process is done column by column. Once the process for one column is completed and it is no longer used, the updated version of matrix A will be overwritten in the memory. This provides an optimized design in terms of memory usage. Since the column to be processed A(:,k) will be used several times for different operations such as dot product and update calculations it is stored in another small and fast memory called circular buffer. Vector processing engine can be considered as the main unit of the design which consists of three subunits: Dot product, Mult/Add and Div/sqrt. Dot product is the major arithmetic operation in this algorithm. The parallel implementation of this operation for large matrix sizes, which is requisite for many digital communication systems, needs a large number of DSP blocks. In order to reduce the number of the DSP blocks in the design, the dot product operation is divided into smaller sizes dot operations which are done sequentially. A compilation design parameter named vector size defines the dot operation size. For the dot operation of two columns of size m and vector size of v, v multipliers are implemented in each clock cycle and m/v clock cycles are needed. As shown in Fig. 2 , the Dot product unit calculates the vector size dot operation and a single accumulator in Mult/Add unit performs the summation of the results of m/v dot operations. (1)
As shown in the Matlab code, the second main for loop includes square root and division functions. These functions are calculated in the Div/Sqrt unit which is shown in Fig. 3 . The square root function is not available in the DSP Builder library, instead the inverse square root is available which can be implemented by using 3-4 times the required resources for a floating point adder or multiplier and produces one result per clock cycle [6] .
The timing and latencies for these iterations will be realized by four nested loop in control unit which serves as a single finite state machine (FSM). The nested loops are attributed to four state in calculations, square calculation, dot product, delay to estimate R(k,k) and new updates. The control unit will produce 8 control signal to provide the required latencies for each step calculations. Signal "Mag" will be active m/v cycles to calculate the square of R. In the last cycle signal "Last Mag" will be active to determine when the dot product (Rn) calculation should begin as it comes exactly after R2 and also Div/sqrt unit can calculate the square root and it's inverse. When column k is under processing, the dot product must be operated on the remained columns so (n-k)*m/v cycle is needed after R2. After dot product, Rn/r_kk and Rn/r_kk2 values are prepared and sub operation cannot be complemented unless this task is done. So a control signal named "check_rkk" alerts to wait for these values and once it is completed signal rkk_valid will have a token. This is the time that sub operation begins. It also needs (n-k)*m/v cycles. During this time updated value of A are estimated and Q values can be calculated after A values are ready. This figure also shows the memory blocks' read or write mode. 
IV. FIXED AND FLOATING POINT ARITHMETIC
Real numbers are presented in two representation system, fixed point and floating point. The two's complement fixed point representation consists of three part, sign bit, integer bits and fractional bits. The number of integer bits defines the highest absolute number and the number of the fraction bits defines precision [15] . The dynamic range of a fixed point number is determined by its word length [2word length -1]. This shows a dynamic range of approximately 6 times the word length which is considered so small for some digital processing systems like military applications. The finite word nature of fixed point numbers introduces truncation error and there may be rounding errors too. However the hardware efficient implementation and lower latency of fixed point systems make them an attractive option for QR decomposition. On the other hand, floating point numbers can provide much more dynamic range and higher accuracy at the expense of long processing. As it is evident, the tradeoff between hardware efficiency, latency and accuracy will affect the choice between fixed or floating point designs.
According to IEEE754 standard the 32-bit single precision floating point format is represented using one sign bit, 8 exponent bits and 23 mantissa bits as follows:
Elementary operations like addition and multiplication need normalization and de-normalization functions which use 48 bit barrel shifter for single precision floating point. This can cause poor fitting, slow clock rate and excessive routing. Hence floating point becomes a bottleneck in hardware implementations [6] . An alternative method that addresses this issue is the fused datapath methodology. Fused datapath integrates the basic operations into one single function and eliminate as many normalization and de-normalization step as possible [16] [17] [18] .
V. SIMULATION RESULTS
The design environment chosen in this paper is Simulink, which provides a high level of design description. It easily switches between fixed-point floating-point processing. The testbench used in the system-level simulation is later used to verify the FPGA-based implementation. The automated back-end synthesis tool running under Simulink is DSP Builder from Altera. Advanced Blockset (DSPBA) can provide a model based design based on fused datapath method. It also supports fixed point implementation by applying a set of libraries including fixed and floating point together with data type propagation. The proposed design is implemented in both fixed and floating point formats to compare the results. The two main parameters are the matrix size and vector size. The first one is a design parameter defined according to the device resources and the latter is a user variable and may be of any size. The performance and resource usage of the QR decomposition have been evaluated for different matrix and vector sizes implemented on Stratix IV EP4SGX230KF40C2 and Arria 10AX115U5F45I3SP devices. The StratixIV device can accommodate vector sizes of up to 60 complex elements whereas the Arria10 device can accommodate vector sizes of up to 90 complex elements [14] . To demonstrate the effect of vector size on the QR decomposition performance, a 30 by 10 matrix is decomposed with different vector sizes for both fixed and floating point data types. Fig. 5a shows that the DSP blocks of StratixIV increases linearly with the vector size for both the fixed and floating point representations but the logic utilization increases significantly for the floating point architecture. Fig. 5b shows that by increasing the vector size latency decreases whereby the number of matrices processed per second, throughput, increases. Fig. 6a shows the effect of matrix size on the resource utilization and Fig. 6b shows its effect on performance for two FPGA implementations with the fixed vector size of 10. As it is evident, large matrix sizes impose longer latencies and therefore lower throughput.
Arria10 is the first FPGA with hardened floating point DSP blocks which makes it suitable for floating point arithmetic. Fig. 7a shows comparison between StratixIV and Arria10 devices for different size matrices and different vector sizes for floating point data type. Arria10 uses five times less DSP blocks than StratixIV for the four combinations. Logic elements and memory bits usage are also decreased with nearly the same rate in Arria10. To show the number of DSP blocks, Memory bits and Logic elements in one graph for comparison they are normalized. Fig. 7b shows that in the fixed point implementations the Arria10 uses more DSP blocks as it should use the hardened floating point DSP blocks for fixed point data type. StratixIV uses nearly the same amount of the DSP blocks for both data types. Logic elements usage for floating point is 3-4 times higher than the fixed point for all matrix sizes and the two devices. Memory bits usage is not affected by the data type because both uses 32-bit representation. As expected the number of memory bits and Logic elements are increase by increasing the matrix size. To investigate the accuracy of the design, the calculated Q and R elements are compared to the double precision results obtained from the QR decomposition function in Matlab. Table1 shows the error performance. Frobenius Norm was used to measure the overall error magnitude in the resultant matrix: The error analysis has been done on 3 different matrix sizes. The standard deviations are also included to show how far the elements of matrix R are spread out in each case. Results show that the fixed point implementation imposes more error than the floating point which was expected.
VI. HARDWARE IMPLEMENTATION AND COMPARISON
In order to evaluate the performance of the proposed design in hardware, fixed and floating point QR core for a 4×4 matrix was implemented on the StratixIV FPGA. Table 2 shows the results in terms of latency, throughput, maximum clock frequency and the number of the clock cycles per each calculation alongside with the proposed architectures in [19] [20] [21] [22] [23] . As can be seen, the throughput of the proposed design for both fixed and floating point implementations are better than the other architectures. With the higher word length the data accuracy of this design would be much higher although the data for other works are not available. 
VII. CONCLUSION
In this paper an efficient FPGA implementation of QR decomposition is presented. The decomposition core was first designed in DSP Builder Advanced Blockset and then the VHDL code was synthetized in Quartus II software and finally it was implemented on Stratix Ep4SGX230KF40C2 device. For a 4×4 matrix, throughputs of 2.4M and 2.11M matrix per second are achieved for fixed and floating point respectively. The results shows that although fixed point design utilizes less FPGA resources but it suffers from less accuracy compared to floating point probably due to quantization and round off errors. It was also shown that there must be a trade-off between vector and matrix size in order to achieve the best performance and less area.
