The hierarchical multi-function matrix operation (MFMO) circuit modules are designed using coordinate rotations digital computer (CORDIC) algorithm for realizing the intensive computation of matrix operations. The paper emphasizes that the designed hierarchical MFMO circuit modules can be used to develop a power-efficient software-defined radio (SDR) digital beamformer (DBF). The formulas of the processing time for the scalable MFMO circuit modules implemented in field programmable gate array (FPGA) are derived to allocate the proper logic resources for the hardware reconfiguration. The hierarchical MFMO circuit modules are scalable to the changing number of array branches employed for the SDR DBF to achieve the purpose of power saving. The efficient reuse of the common MFMO circuit modules in the SDR DBF can also lead to energy reduction. Finally, the power dissipation and reconfiguration function in the different modes of the SDR DBF are observed from the experiment results.
Introduction
Future wireless communication system tends to use the software-defined radio (SDR) [1] , [2] and cognitive radio (CR) technologies [3] , [4] to improve its spectrum efficiency. A SDR with the intelligence of adapting to environmental changing is called a CR. The reconfigurable hardware of the SDR provides the flexibility, performance, and efficiency to enable the implementation of multi-mode handset terminal operated in any given wireless standard [5] . It would also be possible to implement arbitrary baseband signal processing functionalities of the receiver by mapping the application code onto the configured processing modules. An example of this kind application is the multimode digital beamformer (DBF) [6] . In [7] , it implements a multi-input-multi-output (MIMO) decoder accelerator on a field programmable gate array (FPGA) and shows a very high performance-cost metric compared with general purpose DSP and application-specific IC implementations. The FPGA is increasingly being considered as a highperformance low-cost reconfigurable device for implementing the SDR and CR applications [8] . The multi-context FPGA is one of the typical dynamically-programmable gate arrays, which can efficiently reuse limited hardware resources in time without hierarchical circuit modules. In [9] , the small part of array branches can be reconfigured to save the energy and increase the lifetime. The diverse algorithms of wireless communications need intensive computation of matrix operations, which include eigen-value and eigen-vector decomposition (EVD), EVD (eigen-vector inverse matrix), singular-EVD (S-EVD), Hessenberg factorization (HF), linear system solver (LSS), matrix inverse, upper matrix LSS (U-LSS), QR factorization (QRF), QR iteration (QRI) and matrix multiplier (MM). In this study, for the purpose of efficiently utilizing the hardware resource and significantly saving in power consumption, the hierarchical multi-function matrix operation (MFMO) circuit modules are designed in FPGA using coordinate rotations digital computer (CORDIC) algorithm [10] . Besides, the least square (LS) unitary-signal parameters via rotational invariance technique (ESPRIT) subspace-based DBF realized with a set of software-driven MFMO circuit modules in FPGA chip is chosen as an example to demonstrate its hardware reconfiguration capability and scalability of adapting to the number of array branches. The required processing time and the logic resource of the LS unitary-ESPRIT SDR DBF are evaluated by using the derived formulas in order to optimally reconfigure the hierarchical MFMO circuit modules to achieve the purpose of energy reduction. The ESPRIT subspace-based SDR DBF [11] , [12] will become an essential smart antenna technology in the SDR system [13] , [14] .
The preliminary work in part has been presented in [15] . The present paper augments the contents to emphasize that the designed hierarchical MFMO circuit modules can be used to develop an energy-efficient SDR DBF. The hierarchical MFMO circuit modules are scalable to the changing number of array branches employed for the SDR DBF to achieve the purpose of energy saving. The reuse of the common MFMO circuit modules in the SDR DBF can also save the power consumption. More experiments are included in the paper to observe the energy dissipation in the different number of antenna branches and reconfiguration function in the different modes of the SDR DBF.
The structure of the paper is described as follows. In Sect. 2, the hierarchical MFMO circuit modules are designed in FPGA chip. The formulas of the required processing time for each circuit module are derived in Sect. 3, where demonstrates a SDR two-EVD system using common MFMO circuit modules can save power. In Sect. 4, the SDR DBF realized with hierarchical MFMO circuit modules is presented. In Sect. 5, the experimental results for both directions of arrival (DOA) estimation and null-steering demonstrate the reconfiguration function of a subspace-based LS Copyright c 2012 The Institute of Electronics, Information and Communication Engineers unitary-ESPRIT SDR DBF and the scalable function of the designed hierarchical MFMO circuit modules. Section 6 concludes the paper. In addition, the detailed derivation of required processing time for Ntimes N HF circuit module is presented in Appendix. A list of acronyms is provided in Table 1 to aid reading.
Hierarchical MFMO Circuit Structure
The hierarchical MFMO circuit modules implemented in FPGA is shown in Fig. 1 , which is divided as the matrix operation module (MOM) layer and the processing element (PE) layer to perform the matrix operations. The different MFMO circuit modules, such as the EVD, S-EVD, Hessenberg factorization (HF), QRF, QRI, U-LSS, real Schur form (RSF) matrix LSS (RSF-LSS), and MM circuit modules, are located in the MOM layer and their main PEs, such as the CORDIC divider, the CORDIC vectoring mode (CVM) and CORDIC rotation mode (CRM), multiplier and accumulator circuit modules, are located in the PE layer. The branches among the circuit modules as shown in Fig. 1 represent their hierarchical relationship. The higher layer modules, which consist of several subordinate matrix operation modules, perform the functions of a more complicated matrix operation. For example, the EVD circuit module consists of three subordinate modules including S-EVD, LSS, and MM subordinate circuit modules. The S-EVD circuit module consists of two subordinate circuit modules including HF and QRI which can share the CVM and CRM circuit modules. Through the hierarchical circuit structure, it is able to reuse common circuit modules for the realization of high level MFMO circuit modules in order to save the power consumption.
For the EVD matrix operation, at first, N × N matrix A is similarity transform to Hessenberg matrix H, A = P T HP, employing the HF circuit module in order to reduce the computing load of QR iteration. Secondly, the QRI circuit module is used to generate RSF matrix R s = Q s HQ T s . The diagonal of R s including N eigen values of A forms a diagonal eigen value matrix D. Thirdly, V that is the eigen vector matrix of R s can be obtained from R s and Q s through the RSF-LSS circuit module. Finally, the MM circuit module is used to compute the eigen vector matrix V of A through the multiplication of P T Q T s and V. Following the similar procedure, the inverse of V can be generated from R s and Q s through the RSF-LSS and U-LSS circuit modules directly. Both EVD and inverse eigen vector matrix EVD circuit modules in sublayer 1 of the MOM layer satisfy with A = VDV −1 . The sublayer 2 of the MOM layer consists of S-EVD and LSS circuit modules. For S-EVD matrix operation, A is replaced with a symmetric matrix S. The RSF matrix R s is the diagonal eigen value matrix of S and the orthogonal matrix Q s P is the eigen vector matrix of S. The solution X of the linear system AX = B can be obtained through the QRF and U-LSS circuit modules. For the LSS matrix operation, the QRF circuit module can transform A and B into R and Q T B, respectively, at first. The U-LSS circuit module solves RX = Q T B in turn to generate the solution of
If B is an N × N identity matrix I N , the solution of LSS operation will generate the inverse of matrix A. The MM circuit module is employed to compute the product of N × M matrix A and M × P matrix B that is AB = C, where C is an N × P matrix. The six fundamental MFMO circuit modules including HF, QRF, QRI, U-LSS, RSF-LSS and MM circuit modules are constructed in sublayer 3 of the MOM layer.
HF, QRF, and QRI matrix operations can be realized through a sequence of Givens matrix multiplications [16] Xilinx FPGA to realize Givens matrix multiplication, which can be constructed with the CVM and CRM circuit modules. The inputs (x, y) and outputs (R, θ) of CVM circuit module are defined as a function of
The inputs (x, y, θ) and outputs (X, Y) of CRM circuit mod-ule are defined as a function of
The PE layer includes two sublayers. The CORDIC divider, CVM, CRM, multiplier and accumulator circuit modules are located in the sublayer 1 of the PE layer. The CORDIC PEs are located in the sublayer 2 of the PE layer. The CRM and CVM circuit modules consist of D c CORDIC PEs. The i th CORDIC PE including three adders, two bit shifters, one inverter and one look up table (LUT) of arctan(2 −i ), computes 
Processing Time of MFMO Circuit Modules
The required processing clocks of the six fundamental MFMO circuit modules for matrix size N and PE stage number D c are formulated to determine the system hardware configuration implemented in FPGA. As shown in Fig. 2 , the proposed N × N HF circuit module is able to adjust D c of the CVM and CRM circuit modules based upon the tradeoff between the processing accuracy and the processing time.
The processing time of the N × N HF circuit module is derived in Appendix.
When all sub-diagonal entries of H are zeros, P and H become Q and R. The processing time of QRF circuit module is almost the same as that of HF circuit module except extra-processing is added to transform the sub-diagonal into zeros and no right multiplication is required. When all subdiagonal entries of Hessenberg matrix H are zeros, then H becomes an upper triangular matrix. The processing of QRF circuit module is almost the same as that of HF circuit module except processing P (N−1) A (N−2) is added to transform the sub-diagonal into zeros and right multiplication of P (i+1) A (i) and P T (i+1) is not necessary for all i. Therefore, the processing time of QRF circuit module is derived as
for N = 3
(5) Figure 3 shows the functional diagram of the N × N QRI circuit module and the processing time for the QRI circuit module is derived as
The U-LSS circuit module uses a single divider that is based on the CORDIC algorithm for all division operations of the backward elimination (BE) processing [16] . Assuming a realization of 32-bit CORDIC divider with pipeline structure requires more than 47 clocks latency to achieve enough accuracy for the U-LSS circuit module. The BE processing for 4 × 4 Rx = b needs more than 4 × 47 = 188 clocks for its four division operations. To utilize the pipeline processing, the row pre-division is designed to replace the division operations within the BE processing. The pre-division row is to divide the i th rows of R and B with R i,i for i = 1, 2, . . . , N before starting the BE operation. The 4 × 4 R x = b which is row pre-division of 4
Utilizing the pipelined structure of the CORDIC divider, the row pre-division of 4 × 4 Rx = b is able to reduce its processing time to 4(3+1)/2+4+47 = 59 clocks. The row pre-division can save more processing time of the U-LSS circuit module when the matrix size is larger. As shown in Fig. 4 , the proposed U-LSS circuit module is able to adjust the pipelined length D d of CORDIC divider based upon the tradeoff between the processing accuracy and the processing time. The U-LSS circuit module includes a CORDIC divider and an inner productor which consists of a multiplier and N accumulators (ACCs). The processing time of N × N U-LSS circuit module is derived as
where D d is the latency of CORDIC divider. The proposed algorithm for the RSF-LSS circuit module is modified by the method presented in [16] . The R s eigen vectors x n , ∀n = 1, 2, . . . , N − 1 can satisfy T n x n = 0 where T n = R s − R s (n, n)I N . The n × n matrix T n , that denotes the partial T n from row 1 to n and from column 1 to n can be represented as The equivalent polynomial form is
The solutions of (9), x (a) n , form the R s eigen vector matrix:
As shown in Fig. 5 , the RSF-LSS circuit module is designed to compute (10) efficiently and its processing time is derived as
The MM circuit module consists of a multiplier and an ACC. The processing time of MM circuit module is given by
where the D M clock delay is the latency of multiplier. Equations (4), (5), (6), (8), (12) and (13) show that the processing time of the MFMO circuit modules increases with the matrix size. The scalable property of the MFMO circuit modules enables the reconfiguration function in the SDR applications. As listed in Table 2 , the processing clock delay of EVD circuit modules in the hierarchical circuit structure is the sum of the processing clocks of its subordinate modules. For example, the N × N EVD circuit module includs HF, QRI, RSF-LSS and MM subordinate modules. Therefore, the predict processing time of the N × N EVD circuit module can be calculated by Eqs. (4), (6), (12) and (13) . Figures 6 and  7 show that the SDR system is implemented with two-EVD circuit modules using two different architectures in which the common modules of HF, QRI, RSF-LSS and MM are reused. In the SDR architecture of Fig. 6 , the required processing clock is 176,674 for N = 15 and D c = 11. In the pipelined SDR architecture of Fig. 7 , the required processing clocks is 156,345 for N = 15 and D c = 11. Because the same common circuit modules are reused, both SDR implementations can save a half power compared with a two-EVD system without using the common circuit modules. Table 2 The processing clocks of EVD circuit modules.
Fig. 6
The SDR architecture of two-EVD system. Fig. 7 The pipelined SDR architecture of two-EVD system.
Implementation of SDR DBF with MFMO Circuit Modules
The unitary-ESPRIT algorithm is summarized in [12] , which can provide the DOA estimation of the coherent signals generated in the multipath fading channels by using the forward-backward average (FBA). It is able to estimate at most (M-1) high resolution DOAs when the (M-1) DOA signals arrive to an M-element array simultaneously for the DOA estimation mode. Moreover, it is able to compute the weights of the DBF which steers one of the (M-1) DOA 
where the selection matrices K 1 and K 2 are defined in [12] . For the total least squares (TLS) version, the singular value decomposition (SVD) of E 12 = UΣV T is computed to find the 2m × m signal sub-space V which consists of m left singular vectors of V corresponding to m minimum singular values. The SVD can be computed by the EVD module. Let
The square root of eigen values of R E are the singular values of E 12 and the eigen vector matrix of R E is V. The upper half matrix V 1 and lower half V 2 of V are used to calculate the TLS solution matrix Ψ T LS = −V 1 V −1
. For the LS version, the LS solution matrix
. If the distance of adjacent ULA elements is half wave length of DOA signal carrier, the estimated DOAs are
where φ 1 , φ 2 , . . . , φ m are the eigen values of Ψ T LS or Ψ LS . Here, the CORDIC arcsine circuit module is similar to the CVM module and its d i is 1 if y i is less than the value of input and is −1 otherwise. The null-steering weight matrix which consists of m weight vectors for the null-steering beamforming of the m DOAs is given by
where T is the eigenvector matrix of Ψ T LS or Ψ LS . Figure 8 shows the flow chart of the LS unitary-ESPRIT DOA estimation and null-steering weight matrix computations that are realized with the hierarchical MFMO circuit modules. The configured circuit modules of the LS unitary-ESPRIT algorithm are allocated from MOM layer and PE layer of Fig. 1 . The EVD, LSS and MM circuit modules are mapped to the sublayer 1, sublayer 2 and sublayer 3 of the MOM layer. The CORDIC-arcsine and CVM circuit modules are mapped to the sublayer 1 of the PE layer. As soon as the sequence controller receives the reconfiguration command, the selected circuit modules are allocated to perform the LS unitary-ESPRIT algorithm according to the enable signals generated from the equence controller. After completing the processing, the allocated MFMO circuit modules are reset and they will send the output enable signals to the equence controller. Two-EVD circuit modules are included in the configured MFMO circuit modules of the SDR DBF. As stated earlier, the reuse of the EVD circuit module in the SDR DBF can save the power consumption of one EVD circuit module. Figure 9 shows the SDR architecture of the unitary-ESPRIT SDR DBF realized with hierarchical MFMO circuit modules. The software part includes software-code library unit (SLU), interface processing unit (IPU), and module implementation unit (MIU). The SLU, IPU and MIU are implemented in the RC. The processing module unit (PMU) is implemented in a reconfigurable FPGA hardware. The application program code consists of the high-level design tool, such as MATLAB or the PMU driver, and the functionality code modules [17] . The functionality code programs including TLS UESPRIT {M, Mode} and LS UESPRIT{M, Mode} and so on are stored in the SLU of hard disc (HD) of the PC. Table 3 etc.. The MIU employs the application program code stored in the SLU to configure the MFMO circuit modules in the PMU. The MIU will convert the generic application program code into bit stream and download it to the PMU to execute the declared MFMO functions on the configured MFMO circuit modules. First, the MIU configures the maximum reused MFMO circuit modules in FPGA through a Peripheral Component Interconnect (PCI) interface and predicts its total processing time T pt and gate count N gc . Next, it determines whether T pt and N gc satisfy the requirements of acceptable DBF processing time T pt req and current utilizable hardware gate count N gc max . If the reconfiguration requirements are satisfied and the bit-stream of the configured MFMO circuit modules is verified, then the MIU will download the bit-streams into the configured MFMO circuit modules in PMU. If T pt > T pt req , the processing time of the configured MFMO circuit modules needs to be reduced N p times of the configured MFMO circuit modules in the PMU by increasing gate count N p times through using N p -branch parallel processing.
Experiments
The hierarchical MFMO circuit modules of the proposed SDR DBF are implemented in Xilinx Virtex II V3000 FPGA associated with the Quad DSP-FPGA signal quad kit provided by Lyrtech [18] and it is tested using real-time or simulation inputs. There are 28,672 slice flip-flops, 28,672 LUTs, 96 Block RAMs and 96 18x18 multipliers in the Xilinx Virtex II V3000 FPGA. The DOA estimation mode and DOA estimation plus null steering mode of the LS unitary-ESPRIT SDR DBF are used to show the flexibility and scalability of the proposed hierarchical architecture of the MFMO circuit modules. For the DOA estimation mode, the covariance matrix of six DOA signals at angles −47
• , −23
• , −8 • , 2
• , 11
• , and 37
• is used to test the correctness of DOA function of the 7 × 7 LS unitary-ESPRIT SDR DBF implemented by the designed hierarchical MFMO circuit modules. Figures 10 shows that six target DOA signals are exactly estimated. Figure 11 demonstrates that one target DOA signal at 2
• is estimated and five interference signals at −47
• are exactly nulled, respectively. The experiment of the DOA estimation plus null steering mode is also used to test the reconfiguration function of the 7×7 LS unitary-ESPRIT SDR DBF implemented by the hierarchical MFMO circuit modules. The utilized hardware, total energy dissipation and the processing time of the reconfigured MFMO circuit modules for the number of array branches M = 7, 8, 9, 10, 11, 12, 13, 14 and 15 are listed in Tables 4 and 5 for DOA estimation mode and DOA estimation plus null steering mode, respectively. The DOA estimation plus null steering mode in the Table 5 needs more gate counts, processing time and total energy dissipation than the DOA estimation mode in Table 4 . Both tables show that the processing time, number of gate counts and total energy dissipation are increased with the number of array branches. Therefore, if the number of users in a given cell of a mobile wireless system is less than the maximum number of array branches, the SDR DBF only needs to operate in the parts of the array branches in order to save the energy. If the processing time of the SDR DBF is longer than the channel coherent time of a mobile wireless system, we can decrease the processing time to meet the flat fading channel operation condition [19] by either reducing the array branches of the SDR DBF or using the parallel processing if the double gate counts are available in FPGA. Previous SDR implementation without MFMO circuit modules such as [8] cannot scalably adapt its hardware configuration to the changing number of antenna branches to save the energy.
Conclusions
The hierarchical MFMO circuit modules based on SDR architecture are designed to implement the subspace-based Fig. 11 Test results of 7 × 7 LS unitary-ESPRIT DBF for one estimated DOA and five nulling. Table 4 Resource allocation and processing time of the LS unitary-ESPRIT DBF for DOA estimation. Table 5 Resource allocation and processing time of the LS unitary-ESPRIT DBF for DOA estimation and null-steering weighting matrix calculation.
DBF. The formulas of the processing time for the MFMO circuit modules were derived. The LS unitary-ESPRIT DBF is used as an example to demonstrate the reconfiguration function of the hierarchical MFMO circuit modules implemented in FPGA. The matrix size of the hierarchical MFMO circuit modules is scalable to the number of array branches employed for the LS unitary-ESPRIT SDR DBF. Based on the trade off between the available logic resources and the required processing time, the appropriate MFMO circuit modules are reconfigured. Test results demonstrate the proposed hierarchical MFMO circuit modules are able to implement the multi-mode subspace-based SDR DBF with accurate DOA estimation and nulling performance. In the experiment, we show that the hierarchical MFMO circuit modules are scalable to the number of array branches. The hardware configuration of the SDR DBF implemented with the designed hierarchical MFMO circuit modules can adapt to the varying DBF mode and user number in a giving cell to achieve the purpose of saving power. We don't need to operate a SDR DBF with a fixed number of array branches and maximum hardware gate counts in all environment conditions.
The future research work related to further reduce the processing time and power consumption of the SDR DBR will cover the switching control design of the power supply in the array branches. Let the RC command the power switching circuit to deactivate the active array branch and activate the inactive array branch. The supply of power to an array branch that is not to be used is halt, so the more energy reduction is enabled and battery operating periods can be further extended. Additionally, the designed hierarchical MFMO circuit modules are programmable and reconfigurable so that they are also applicable for future SDR and CR systems.
Appendix: Derivation of (4) for N× N HF Circuit Module
The HF circuit module, as shown in Fig. 2 , in terms of N = 6 and D c = 11 is given as an example to illustrate the derivation procedure of the processing time formula for the MFMO circuit modules. Since no buffer is designed in CRM and CVM modules, the PE stage number D c cannot be less than the processing matrix size N. Table A·1 shows the timing diagram of the input and output sequences of CVM and CRM modules. As shown in Table A·1 , CVin1, CVin2, CVmag and CVtheta denote x input, y input, mag output, and atan output, respectively, of the CVM module in Fig. 2 . Here, × denotes "a don't care item", a i means the i th row vector of matrix A and a i j means the i th row and j th column element of matrix A. It is assumed that (CVin1, CVin2) = (a 21 , a 31 ) occurs at the time of clk = 0, and (CVmag, CVtheta) = (a 21 , θ 1 ) = CVM(a 21 , a 31 ) occurs at the time of clk = 11. Through the feedback between the CVmag and CVin1, (CVin1, CVin2) = (a 21 , a 41 ) occurs at the time of clk = 11 and (CVmag, CVtheta) = (a 21 , θ 2 ) = CVM(a 21 , a 41 ) results Table A·1 , CRin1, CRin2, CRtheta, CRout1, and CRout2 denote x input, y input, atan input, X output, and Y output of CRM in Fig. 2 respectively. CRin1 and CRin2 are equal to the 2nd row a 2 and 3rd row a 3 of A, respectively, at the time period of clks = 11-16 and CRtheta = θ 1 is stored in a register within the time period so that (CRin1, CRin2, CRtheta) = (a 2 , a 3 , θ 1 ) is at the time period of clks = 11-16. (CRout1, CRout2) = (a 2 , a 3 ) = CRM(a 2 , a 3 , θ 1 ) will result at the time period of clks = 22-27. Through the feedback between the CRout1 and CRin1, (CRin1, CRin2, CRtheta) = (a 2 , a 4 , θ 2 ) is at the time period of clks = 22-27 and (CRout1, CRout2) = (a 2 , a 4 ) = CVM(a 2 , a 4 ). Consequently, (CRin1, CRin2, CRtheta) = (a 2 , a 3 , θ 1 ), (a 2 , a 4 , θ 2 ), (a 2 , a 5 , θ 3 ), (a 2 , a 6 , θ 4 ) at the time period of clks = 11-16, 22-27, 33-38, 44-49, and 55-60 respectively and (CRout1, CRout2) = (a 2 , a 3 ), (a 2 , a 4 ), (a 2 , a 5 ), (a 2 , a 6 ) at the time period of clks = 22-27, 33-38, 44-49, 55-60, and 66-71, respectively. a 2 , a 3 , a 4 , a 5 , and a 6 take the position of the 2nd to 6th rows of the matrix A that are stored in the random access memory (RAM), respectively. The matrix of the RAM is transformed as
The RAM used in the modules is single input single output (SISO) component. Since the reading and writing operations of the SISO RAM cannot carry out in the same clock, the five output rows, a 3 , a 4 , a 5 , a 6 , and a 2 , cannot be written in RAM until the five input rows, a 2 , a 3 , a 4 , a 5 , and a 6 , are read out from RAM completely at the time of clk = 44. Hence, a 3 , a 4 , a 5 , a 6 , and a 2 are written into the RAM at the time period of clks = 45-50, 56-61, 67-72, 78-83 and 84-89, respectively. Consequently, P 1 A will be written in RAM after 90 clocks. Without loss of generality, suppose P 1 A is stored in RAM at the time of clk = 0. Input the 2nd to 6th columns of P 1 A at the time period of clks = 0-5, 6-11, 17-22, 28-33, 39-44 to the data A for the process as P 1 A where θ 1 , θ 2 , θ 3 , and θ 4 are resulted from the prior processing of P 1 A. Hence, A
(1) = P 1 AP T 1 will be calculated and stored in RAM after 90 clocks. The required processing time for A (1) calculation is (90 + 1) × 2 = 182 clocks, where one clock is added due to the latency for each RAM access.
Suppose the pre-processing result A (1) is written in RAM at the time of clk = 0. P 2 A
(1) P T 2 is calculated with the same procedure as P 1 AP T 1 . Four input rows, a (1) 4 , a (1) 5 , a (1) 6 , and a (1) 3 are read completely at the time of clk = 33 and four output rows, a (1) 4 , a (1) 5 , a (1) 6 , and a (1) 3 , are written in RAM at the time period of clks = 34-39, 45-50, 56-61, and 62-67 so that the processing time for calculating P 2 A (1) is 68 clocks. Similarly, the processing time for calculating A (2) is (68 + 1) × 2 = 138 clocks. The processing time for calculating A (3) is (51 + 1) × 2 = 104 clocks and the processing time for calculating A (4) is (40 + 1) × 2 = 82 clocks. The A (4) is a 6 × 6 Hessenberg matrix H so that the processing time of 6 × 6 HF module is 182 + 138 + 104 + 82 = 506 clocks. As shown in Fig. 2 , the HF module can compute Hessenberg matrix H and its similarity matrix P at the same time if a CRM and an N × N RAM are added for I N access. 
Chi-Cheng Kuo
is currently a Ph.D. graduate student in the Department of Communications Engineering, Yuan-Ze University. His current research interests include Cognitive radio and SDR baseband signal processing.
Shin-Ru Wu
is currently a Ph.D. graduate student in the Department of Communications Engineering, Yuan-Ze University. His current research interests include smart antenna and signal processing.
You-Rong Lin
is currently a Ph.D. graduate student in the Department of Communications Engineering, Yuan-Ze University. His current research interests include UWB vehicular radar and baseband signal processing.
