Abstract-In this paper, we propose a parallel and pipelined VLSI architecture for a circulant approximated equalizer for the MIMO-CDMA systems. The FFT-based tap solver reduces the Direct-MatrixInverse of the size (NF × NF ) to the inverse of O(N ) sub-matrices of the size (N × N ). Hermitian optimization and tree pruning is proposed to reduce the number and complexity of the FFTs. A divide-andconquer method partitions the 4 × 4 sub-matrices into 2 × 2 sub-matrices and simplifies the inverse of sub-matrices. Generic VLSI architecture is derived to eliminate the redundancies in the complex operations. Multiple level parallelism and pipelining is investigated with a Catapult C High-Level-Synthesis (HLS) methodology. This leads to efficient VLSI architectures with 3× further complexity reduction. The scalable VLSI architectures are prototyped with the Xilinx FPGAs and achieve area/time efficiency.
I. INTRODUCTION
The growing demands for broadband multimedia services, ubiquitous networking via mobile devices push the development of advanced modem technology. Multiple Input Multiple Output (MIMO) technology [1] [2] using multiple antennas at both the transmitter and receiver has recently emerged as one of the most significant technical breakthroughs in modern communications. On the other hand, UMTS and CDMA2000 extensions optimized for data services lead to the standardization of multi-code CDMA systems such as the High-Speed-Downlink-Packet-Access (HSDPA) and its equivalent 1X EV-DV (Evolution Data and Voice) [3] . Recently, the MIMO technology has been proposed in the CDMA downlink systems to achieve much higher data rate than the current 3 rd generation cellular systems.
The original MIMO technology is known as D-BLAST [1] and a more realistic strategy as V-BLAST [2] by nulling and cancelling with reasonable tradeoff between complexity and performance. However, the original MIMO spatial multiplexing was proposed for narrow band and flat-fading channels. In a multipath fading channel, the orthogonality of the spreading codes is destroyed and Multiple-AccessInterference (MAI) along with Inter-Symbol-Interference (ISI) are introduced. With a very short spreading gain, the conventional Rake receiver could not provide acceptable performance. Linear-MinimumMean-Square-Error (LMMSE) chip equalizer is promising to restore the orthogonality of the spreading code, so as to suppress both the ISI and MAI [4] . However, it requires to inverse a large correlation matrix with O((NF )
3 ) complexity, where N is the number of Rx antenna and F is the channel length. This is very expensive for hardware implementation. To avoid the Direct-Matrix-Inverse (DMI), adaptive stochastic gradient algorithms such as LMS have been proposed. However, they suffer from stability problems because the convergence depends on the choice of a good step size. On the other hand, non-adaptive block-based algorithms are proposed, e.g., a Conjugate Gradient-based tap solver in [5] with a complexity at the order of O((NF )
2 ). However, it is shown that this algorithm still has a high complexity for real-time implementation.
To further reduce the complexity, an FFT-based fast algorithm is proposed in [7] by approximating the block Toeplitz structure of the correlation matrix using a block circulant matrix to avoid the DMI. Although this FFT-based algorithm avoids the DMI of the original correlation matrix with the dimension of (NF × NF ), the matrix inverse of some smaller sub-matrices with size of (N × N ) is inevitable for MIMO receiver. For a MIMO receiver with high antenna configuration, the complexity increases dramatically with the number of antennas. The fact that the receiver must be embedded into a portable device makes the design of low complexity and low cost products critical for widespread commercial deployment. It is necessary to determine which range of possible architectures is most suitable for VLSI implementation [6] .
In this paper, a reduced complexity MIMO receiver architecture for the FFT-based chip level equalizer is proposed. Hermitian optimization is proposed by utilizing the structures of the correlation coefficients and the FFT algorithm to further reduce the complexity. A reduced-state FFT module is proposed to avoid duplicate computation of the symmetric coefficients and the zero coefficients. The number and complexity of conventional FFT design modules is reduced. The Hermitian feature is then applied to reduce the complexity of the inverse of the sub-matrices. Of particular interest is the non-trivial inverse of many 4 × 4 sub-matrices. We apply a divide-and-conquer method to partition the 4 × 4 sub-matrices into four 2 × 2 submatrices. The 4 × 4 matrix inverse is then dramatically simplified by exploring the commonality. Generic VLSI design architecture is derived from the special design blocks to eliminate the redundancies in the complex operations. The simplified model facilitates the design of efficient parallel VLSI modules such as "Complex-HermitianMultiplication", "Hermitian Inverse" and "Diagonal Transform". This leads to efficient architectures with 3× further complexity reduction and more parallel and pipelined VLSI schematic, which is verified in a Xilinx FPGA prototyping platform using a Catapult C based HLS design methodology [8] .
II. SYSTEM MODEL AND CHIP EQUALIZER

A. System Model
The system model of the MIMO multi-code CDMA downlink using M Tx antennas and N Rx antennas is described as follows. Multiple spreading codes are assigned to a single user to achieve high data rate in the multi-code CDMA downlink. First, the high data rate symbols are demultiplexed into KM lower rate substreams, where K is the number of spreading codes used in the system for data transmission. The substreams are divided into M groups, where each substream in the group is spread with a spreading code of spreading gain G. Each group is then combined and scrambled with a long scrambling code and transmitted through the m th Tx antenna. The chip level signal at the m th transmit antenna is given by 
where hm,n(l) and Lm,n are the l th path channel coefficient and the delay spread between the m th Tx antenna and the n th Rx antenna, respectively. zn(i) is the additive Gaussian noise at the n th receive antenna.
By packing the received chips from all the receive antennas in a vector
T and collecting the LF = 2F + 1 consecutive chips with center at the i th chip from all the N Rx antennas, we form a signal vector as rA
In the vector form, the received signal is given by
where Hm(i) is the channel matrix with a block Toeplitz structure. The transmitted chip vector for the m th transmit antenna is given by
B. LMMSE Tap Solver with Circulant Approximation
Linear MMSE based chip equalizer estimates the transmitted chip samples by a set of linear FIR filter coefficients asdm(i) = w H m (i)rA(i). It is well known that the LMMSE chip equalizer coefficients are given by minimizing the MSE between the transmitted and recovered chip samples aŝ
where
is the transmitted chip power.Rrr(i) andĥm(i) are the covariance estimation and channel estimation, respectively. Here the covariance matrix is estimated by the time-average with ergodicity assumption aŝ
where NB is the length for the time average. The channel coefficients are estimated asĥm
using the pilot symbols. In the HSDPA standard, about 10 % of the total transmit power is dedicated to the Common Pilot Channel (CPICH). This will provide good channel estimation. By assuming that the channel is stationary over the observation window length, we can have a block based operation by omitting the chip index inRrr(i),ĥm(i) andŵm(i).
Using the stationarity of the channel and the convolution property, it is shown that the correlation matrix Rrr is a banded block Toeplitz matrix, which is approximated by a block-circulant matrix [7] after we add two corner matrices as
Here E[l] is a N × N block matrix constructed with the correlation coefficients. Using the extension of the diagonalization theorem, the block-circulant matrix can be decomposed as
is the phase factor coefficient for the DFT computation. ⊗ denotes the Kronecker product and D is the DFT matrix. Finally the MIMO equalizer taps are computed from
is a block-diagonal matrix with elements taken from the element-wise FFT of the first column of a circular matrix. For an (M × N ) MIMO system, this reduces the inverse of a (NLF × NLF ) matrix to the inverse of sub-block matrices with size (N × N ).
III. SCALABLE PIPELINED VLSI ARCHITECTURE
To achieve a real-time implementation, either DSP processors or VLSI architectures can be applied. The limited hardware resource and power supply in mobile handsets makes the hardware design more challenging, especially for the MIMO system. Many optimizations are needed to reduce the redundant computation and make it suitable for real-time implementation. We emphasize the interaction between architecture, system partitioning and pipelining with these objectives: 1). Propose further optimization schemes to reduce the computation complexity for efficient VLSI implementation; 2). Design parallel and pipelined architecture for the critical computation blocks.
A. HLS Architecture Scheduling
Field Programmable Gate Array (FPGA) can behave like a number of different ASICs. This makes FPGA a good platform to build, verify and prototype System-on-Chip (SoC) designs quickly. We applied an efficient Catapult C based High-Level-Synthesis (HLS) design methodology to investigate various pipelined architectures and different levels of parallelism. Register-Transfer-Level (RTL) design is generated directly from C/C++ code and imported to the HDL Designer tool for high-level integration. Seamless verification and software hardware co-design is achieved. Catapult C provides architecture scheduling for both block-mode and throughput mode [8] to generate efficient RTL on different resource/timing requirements. Configurable parallelism is achieved by assigning the number of Functional Units (FU) according to area/timing constraints. The best solution would be the smallest design meeting the real-time requirements. 
B. System-level Partitioning
With a timing and data dependency analysis, the top level design blocks for the MIMO equalizer are shown in Fig. 1 . The system-level pipelining is designed for better modularity. The overall equalizer receiver includes the tasks in the following procedure:
and form the first block column of circulant C by adding the corner elements as C
2) Take the element-wise FFT of C (1) rr , where the element vectors Fn1,n2 = FFT{E A correlation estimation block takes the multiple input samples for each chip to compute the correlation coefficients of the first column of Rrr. It is made circulant by adding corner to form the matrix
The complete coefficients are then written to DPRAMs and the To reflect the correct timing, the correlation and channel estimation modules at the front-end will work in a throughput mode on the streaming input samples. The FFT-inverse-IFFT modules in the dotted line block construct the post-processing of the tap solver. They are suitable to work in a block mode using dual-point RAM blocks to communicate the data. The MIMO FIR filtering will also work in throughput mode on the buffered streaming input data.
C. Hermitian Optimization and Reduced-state FFT
From the circulant feature of the correlation matrix, we can reduce the complexity of the FFT computation with the following Lemma x (1) x (1) x (3) x ( x (4) x (2) x (2) x (1) x (1) x (1) x (1) x (2) x (2) x (2) x (2) x (3) x (3) x (3) x(3) Fig. 2 . Reduced-state FFT butterfly tree.
for the MIMO receiver. Lemma 1 (Hermitian): Fi,j = conj(Fj,i). Thus the computation of Fj,i is redundant for j < i.
Lemma 2 (Hermitian Complexity):
Because the imaginary part of Fi,i equals to 0, the computation of Fi,i can be reduced to only L/LF of the full DFT. The computations related to Fi,i also reduce to real computation, saving 50% of complexity.
Because the FFT algorithm applies the features of the rotation coefficients, the application of the Hermitian feature in Lemma 2 is not straightforward. Thus, we derived the hardware-oriented optimization for the Reduced-State FFT (RS-FFT) with pruning operations based on the standard Decimation-In-Time (DIT) FFT algorithm. We differentiate the different types of butterfly units based on the feature of the output coefficients and prune unnecessary computation branches in the butterfly tree. This is shown in Fig.  2 .
IV. HERMITIAN MATRIX INVERSE ARCHITECTURES
In this section, we utilize the Hermitian feature and focus on the optimization of the matrix inverse and multiplication module following the element-wise FFT modules in the block tap solver. Although the FFT-based tap solver avoids the direct matrix inverse of the original correlation matrix with the dimension of (NF ×NF ), the inverse of the diagonal matrix F is inevitable. For a MIMO receiver with high receive dimension, the matrix inverse and multiplication in F −1ĥ m is not trivial. Because of the diagonal feature of F matrix, the inverse of F can be divided into the inverse of LF sub-matrices of size (N × N ) as in
Gaussian elimination or Cholesky decomposition can be applied to inverse these matrices with O(N 3 ) complex operations. However, it requires arithmetic square root operations that are preferable to be avoided in hardware due to their complexity. Considering the fact that it is unlikely to have more than four Rx antennas in a mobile terminal, we consider the two special cases individually, i.e., 2 and 4 Rx antennas, with the Hermitian optimization. We propose complexity reduction schemes and efficient architectures suitable for VLSI implementation based on the exploration of block partitioning. Fig. 3 . The data path of merged 2 × 2 inverse and multiplication.
The commonality of the partitioned block matrix inverse is extracted to design generic RTL modules for reusable modularity. We then build the 4 × 4 receiver by reusing the 2 × 2 block partitioning.
A. Dual-antenna MIMO Receivers
From equation (7), a straightforward partitioning is at the matrix inversion for F and then the matrix multiplication of F −1 (D⊗I)hm and dimension-wise FFT of the channel coefficients. In this partitioning, we would first compute the inverse of the entire sub-block matrix in F and then carry out a matrix multiplication. However, this partitioning involves two separate loop structures. Since the two steps have same loop structure, it is more desirable to merge the two steps and reduce the overhead. If the inverse of a 2 × 2 submatrix is given by 
with the k th element of the matrix W as
With the Hermitian features of F00 and F11, we can reduce the number of real operations. In the equatoin, "a · b" means "real × real" and "a • b" means "real × complex" while "a * b" means "complex × complex" multiplications. The complex division is replaced by a real division. From this, we derived the simplified data path with the Hermitian optimization as in Fig. 3 . In this figure, f00(k) and f11(k) are real numbers. The single multiplier means a real multiplication. The multiplier with a circle means the "real × complex" multiplication and the multiplier with a rectangle is a "complex × complex" multiplication. The data path is significantly simplified, which facilitates the scaling in fixed-point implementation and increases the numerical stability. Notice that the storage for the interface from element-wise FFTs is also reduced. We save four distributed DPRAMs for the following real and imaginary parts (f00), (f11), (f10), (f10).
B. Receiver with 4 Rx Antennas
This includes the 1 × 4, 2 × 4, 4 × 4 SIMO and MIMO scenarios. Note that the receiver diversity by over-sampling also has the same mathematic format. So this may also be the case of two receive antennas with an over-sampling factor of 2. The principle operation Fig. 4 . The data dependency path of the partitioned 4 × 4 matrix inverse.
of interest is the inverse of the 4 × 4 matrices, so it is necessary to determine which range of possible matrix architectures is most suitable to this application. In addition to minimizing the circuit area used, the design needs to work within a short time budget. We need to derive efficient computing architecture for this part to save the area and time resources. We partition the 4 × 4 sub matrices and its inverse in F[i] into four 2 × 2 block sub matrices as
Then we apply an partitioned inverse of the 4 × 4 matrix from the inverse of 2 × 2 sub-matrices. It can be shown that the subblocks are given by the following equations:
Without looking into the data dependency, a straightforward computation will have 8 complex matrix multiplications, 2 complex matrix inverses and 2 complex matrix subtractions, all of the size 2×2. But this is not very efficient. By examining the data dependency, we will find some duplicate operations in the data path. Now we utilize the Hermitian feature of the F matrix to derive more parallel and optimized computing architecture. Since the inverse of a Hermitian matrix is Hermitian, this leads to the data path by removing the duplicate computation blocks that has the Hermitian relationship.
However, the straightforward treatment still does not lead to the most efficient computing architecture. The data path is still constructed with a very long dependency path. To fully extract the commonality and regulate the design blocks in VLSI, we define the following special operators on the 2 × 2 matrices for the different type of complex number computations. These special operators will be mapped to VLSI Processing Units (PU) to deal with the special features of the Hermitian matrix. High-level modularity is achieved by extracting the commonality among the data path. 
Define 4 (Diagonal Transform): Given the 4 × 4 Hermitian A which is divided into four subblocks as
the Diagonal Transform (DT) of A is defined as,
(13) With these definitions, we regulate the inverse of the 4 × 4 Hermitian matrix F = F H into simplified operations on 2 × 2 matrices. After some manipulation, the partitioned subblock computation equations can be mapped to the following procedure using the defined operators.
Finally this leads to the much simplified hardware mapping using the generic Processing Units in Fig.5 . From the figure, it is clear that the overall computation complexity is 2 HInv operations, 2 DTs, 1 extra CHM block. Because the sign inverter and the Hermitian formatter [ ] H has no hardware resource at all, the computation complexity is determined by the three generic blocks. The data path of the computation shows the timing relationship between different design modules.
C. Parallel Architecture Modules
This regulated block diagram facilitates the design of efficient parallel VLSI modules. To extract the commonality and reduce the redundancy, we need to explore the timing relationship of the basic computations involved in generic operations. Because the operation M is also embedded in the T transform, we design the interface in a way that the duplicate computations are removed and the efficient computing architecture is reused. The grouping of computations and the smart usage of interim registers will eliminate the redundancy and give simple and generic interface to the design modules. Finally the simplified parallel M (A, B) RTL module can be design as values do not need to be complex multiplications. We also only need to compute d21 and d22 to get the G elements. The redundant computations in {tmp4, tmp8, d11, d12} are eliminated from the M (A, B) operation. Built from the simplified M (A, B) module, the data path RTL module of the transform T (A11, A21, A22) of the 4 × 4 Hermitian matrix is given by Fig. 7 , with the simplified Functional Components {pP ow(a, b), (a, b) } as defined. The output ports of the T (A11, A21, A22) include the independent elements {t11, t21, t22}.
V. EXPERIMENTAL FPGA IMPLEMENTATION
The performance is evaluated in an HSDPA simulation chain for different MIMO antenna configurations. The readers are referred to [7] for the algorithmic Bit-Error-Rate performance. Based on the above algorithmic and architectural optimizations, we have designed the VLSI architecture and prototyped the RTL design on the Nallatech FPGA platform. The chip rate is in accordance with the WCDMA chip rate at 3.84MHz. We applied a clock rate of 38.4MHz for the Xilinx Virtex-II V6000 FPGA. The correlation window is set to 10 chips for all 4 receive antennas. The FFT size is 32-points. In the following, we give the specification of the major design blocks: the throughput-mode correlation calculation, the multiple FFT/IFFT modules and the LF inverse of 4 × 4 submatrices. We utilize a Catapult C based design methodology to study many area/time tradeoffs of the VLSI architecture design. For example, for the 16-FFT/IFFT modules, at one extreme, we can design a fully parallel and pipelined architecture with parallel butterfly-units and complex multipliers laid out in the pattern of butterfly-tree. Since it is economically desirable to reduce the area, this is not practical. We design the merged multiple input multiple output FFT modules to utilize the commonality in control logic and phase coefficient loading. Overall, we can utilize only 4 multipliers to achieve area/time efficient design for this module. For the LF inverse of 4 × 4 Hermitian matrix, the latency is 38 µs with 6 multipliers. This benefits from the aforementioned architectural optimization. Moreover, to achieve powersaving, scalability using different functional units are also explored extensively for different type of architectures, such as shown in Fig.  8 for the MIMO-FFT modules with different latency. Although such detail is out of the scope of focus in this paper due to the limited space, we demonstrate the architectural scalability using the Catapult C scheduling methodology.
VI. CONCLUSION
In this paper, we propose a MIMO receiver architecture for a circulant LMMSE chip equalizer for the CDMA downlink systems. Hermitian optimization and partitioned sub-matrix inverse is proposed to construct parallel architecture for the 4 × 4 MIMO receiver using the commonality. The much simplified parallel and pipelined RTL is design using Catapult C HLS flow and verified in an FPGA prototyping platform.
