In this paper, we present an efficient circulant approximation based MIMO equalizer architecture for the CDMA downlink. This reduces the Direct-Matrix-Inverse (DMI) of size (N F × N F ) with O((N F ) 3 ) complexity to some FFT operations with O(N F log 2 (F )) complexity and the inverse of some (N × N ) sub-matrices. We then propose parallel and pipelined VLSI architectures with Hermitian optimization and reduced-state FFT for further complexity optimization. Generic VLSI architectures are derived for the (4 × 4) high-order receiver from partitioned (2 × 2) sub-matrices. This leads to more parallel VLSI design with 3× further complexity reduction. Comparative study with both the Conjugate-Gradient and DMI algorithms shows very promising performance/complexity tradeoff. VLSI design space in terms of area/time efficiency is explored extensively for layered parallelism and pipelining with a Catapult C High-Level-Synthesis methodology.
equalizers demonstrates its promising performance/complexity tradeoff.
As real-time implementation is concerned, System-on-Chip (SoC) architecture offers more parallelism, more compact size and lower power consumption than general purpose DSP processors. However, the research for the SoC architectures of MIMO HSDPA mobile receiver remains a relatively new and hot topic. Recently, Nokia successfully demonstrated a single antenna HSDPA real-time system in the CTIA'03 wireless trade show [16] [17].
Although MIMO VLSI implementations have been reported for Lucent's BLAST ASIC chip [18] and some MIMO detection algorithms [19] , the VLSI architecture design of MIMO CDMA equalizers remains a new research topic. To support the MIMO CDMA downlink in a multipath fading channel, it is necessary to explore the efficient VLSI design architecture [14] for the complex equalizer.
In the second part, we focus on the VLSI-oriented optimizations of the architecture complexity. Hermitian optimization is proposed by utilizing the structures of the correlation coefficients and the FFT algorithm. A reduced-state FFT module is proposed to avoid redundant computation of the symmetric coefficients and the zero coefficients. These reduces both the number and complexity of the conventional FFT module. On the other hand, the matrix inverse of some smaller sub-matrices of size (N × N ) is inevitable for the MIMO receiver although the (N F × N F ) inverse is avoided. For a high-order MIMO receiver, the complexity still increases dramatically with the number of antennas. Therefore, the Hermitian feature is applied to reduce the sub-matrix inverse complexity. Of particular interest is the non-trivial (4 × 4) MIMO configuration. We apply a divide-and-conquer method to partition the (4 × 4) sub-matrices into four (2 × 2) sub-matrices. The (4 × 4) matrix inverse is then dramatically simplified by exploring the commonality in a partitioned matrix inverse lemma. Generic VLSI architectures are derived from the special design blocks to eliminate the redundancies in the complex operations. The regulated model facilitates the design of efficient parallel VLSI modules such as "Complex-Hermitian-Multiplication", "Hermitian Inverse" and "Diagonal Transform". This leads to efficient architectures with 3× further complexity reduction and more parallel and pipelined schematic.
In addition to minimizing the circuit area used, the design needs to work within a time budget. There are many area/time tradeoffs in the VLSI architectures. Extensive architecture tradeoff study provides critical insights into implementation issues that may arise during the product development process. However, this type of SoC design space exploration is extremely time consuming because the standard trial-and-optimize approaches today are usually tied to hand-coded VHDL/Verilog-based methodology [23] [24] . In this paper, we present a Catapult C based [16] High-Level-Synthesis (HLS) methodology which integrates several key technologies to explore the VLSI architecture tradeoffs extensively.
Extensive design space exploration is enabled by allocating different architecture/resource constraints in a Catapult C architecture scheduler [16] . Synthesizable Register Transfer Level (RTL) design is generated from an algorithmic C/C++ fixed-point design, integrated in other downstream flows and validated in a Xilinx FPGA prototyping platform.
The rest of the paper is organized as follows. Section II gives the MIMO-CDMA downlink system model. The FFT-based circulant chip equalizer is presented in section III. Section IV presents the system level partitioning and the VLSI level complexity optimization. The comparative performance and complexity analysis is presented in section V. Finally, section VI presents the HLS-based design space exploration and an experimental implementation on FPGA.
System Model for MIMO-CDMA Downlink
The system model of the MIMO multi-code CDMA downlink with M Tx antennas and N Rx antennas is described in Fig. 1 . In a multi-code CDMA downlink, multiple spreading codes are assigned to a single user to achieve high data rate. By using spatial multiplexing, the high data rate symbols are demultiplexed into KM lower rate substreams, where K is the number of spreading codes for data transmission. The substreams are divided into M groups, where each substream in the group is spreaded with a spreading code of spreading gain G. 
The system model of the MIMO multi-code CDMA downlink.
By packing the received chips from all the receive antennas in a vector r(i) = [r 1 (i), 
Here F is the observation window length corresponding to the channel length. In the vector form, the received signal can be given by
where H m is a block Toeplitz matrix constructed from the channel coefficients as shown in [15] . The multiple receive antennas' channel vector is defined as
The transmitted chip vector for the m th transmit antenna is given by
3 LMMSE Tap Solver with Circulant Approximation 
It is well known that the LMMSE chip equalizer coefficients are given by minimizing the MSE between the transmitted and recovered chip samples aŝ
where σ 2 d (i) is the transmitted chip power.R rr (i) andĥ m (i) are the covariance estimation and channel estimation respectively. Here the covariance matrix is estimated by the timeaverage with ergodicity assumption aŝ
where N B is the length for the time average. The channel coefficients are estimated aŝ
using the pilot symbols. In the HSDPA standard, about 10 % of the total transmit power is dedicated to the Common Pilot Channel (CPICH). This will provide accurate channel estimation. By assuming that the channel is stationary over the observation window length, we can have a block based operation by omitting the chip index
FFT-based Circulant Approximation Tap Solver
Using the stationarity of the channel and the convolution property, it is easy to show that the covariance matrix is a banded block Toeplitz matrix as
is a (N × N ) block matrix with the cross antenna covariance coefficients. The
where L F is determined by the channel length L.
In an outdoor environment, L F could be up to 32. The direct inverse of the matrix is very expensive for hardware implementation.
To reduce the computation complexity, an FFT-based fast algorithm is presented in this section. It is known that a circulant matrix S can be diagonalized by the FFT operation
where D is the FFT phase coefficient matrix and Λ is a diagonal matrix whose diagonal elements are the FFT result of the first column of the circulant matrix S.
This known lemma is applied to simplify the MIMO equalizer computation dramatically. It is shown that the covariance matrix R rr can be approximated by a block-circulant matrix after we add two corner matrices as
Using the extension of the diagonalization lemma and the features of Kronecker product, the block-circulant matrix can be decomposed as
where
is the phase factor coefficient for the DFT computation. ⊗ denotes the Kronecker product. By denoting
it can be shown that the final MIMO equalizer taps are computed as the following equation 
System-on-Chip (SoC) Architecture Partitioning
To achieve the real-time implementation, either DSP processors or VLSI architectures could be applied. For example, a multiple processor architecture using TI's DSP processors have been reported in [8] for the 3G base station implementation. However, the requirement for low power consumption and compact size makes it difficult to use multiple DSPs in a mobile handset to achieve the real-time processing power for the chip level physical layer design, especially for the MIMO systems. SoC architecture is a major revolution for integrated circuits due to the unprecedented levels of integration and many advantages on the power consumption and compact size. However, the straightforward implementation of the proposed equalizer has many redundancies in computation. Many optimizations are needed to make it more suitable for real-time implementation. We emphasize the interaction between architecture, system partitioning and pipelining in this section with these objectives: 1). propose VLSI-oriented optimizations to further reduce the computation complexity; 2). implement the equalizer with the minimum hardware resource to meet the real-time requirement; 3) obtain an efficient architecture with optimal parallelism and pipelining for the critical computation parts. To explore the efficient architectures, we elaborate the tasks as the following procedure: 
Each element is an (N × N ) sub block matrix.
2. Take the element-wise FFT of C (1) rr , where the element vectors 
4. Compute the inverse of the block diagonal matrix F, where With a timing and data dependency analysis, the top level block diagram for the MIMO equalizer is shown in Fig. 2 . The system-level pipeline is designed for better modularity. In the front-end, a correlation estimation block takes the multiple input samples for each chip to compute the covariance coefficients of the first column of R rr . It is made circulant by adding corner to form the matrix [
]. The complete coefficients are written to DPRAMs and the (N × N ) element-wise FFT module computes 
VLSI Level Complexity Optimization

Hermitian Optimization
In this section, more emphasis is given to the VLSI-oriented implementation aspects. For Lemma 1 (Hermitian) F n 1 ,n 2 = conj(F n 2 ,n 1 ), where the vector is formed from the covariance element vector between antennas n 1 and n 2 . F n 2 ,n 1 is redundant for n 2 < n 1 .
Proof: For the Rx antennas n 1 , n 2 , it can be shown that the elements in the circulant column have the following relations, where N B is the covariance time average window length:
Using the features of the FFT, it can be proven that the element-wise FFT results have the relation that
Thus, instead of having N × N complex FFT computations, we only need to compute 
by defining the input sequence to the FFT module as
we only need to compute the real part FFT of the x(i) to get f mm (k). From the butterfly decomposition, we have the recursion for the real-part FFT computation as
x (4) x (12)=0 x (2) x(10)=0
x(6)=0
x(15)=0 x (1) x (1) x (3) x ( x (4) x (2) x (2) x (1) x (1) x (1) x (1) x (2) x (2) x (2) x (2) x (3) x (3) x (3) x(3) From the recursion, it can be shown that we can prune the redundant computations by replacing the complex multiplication in the butterfly units for some portion of the FFT BFU tree. Before considering the many zeros in the input coefficients, the total number of
, we can further truncate the computations related to the zero values.
After pruning all the un-necessary BFU branches, the FBFUs and PBFUs only take effects from stage 3. The number of FBFU is reduced to (L F /2) * log 2 L F − 2L F + 6. This also reduces the number of memory access and register files for stage 1 and stage 2 as well as in the partial BFUs. The final data flow is shown as the BFU tree in Fig. 3 . In the figure, only the shaded portion has full BFUs. 
the Reduced-State FFT and ZP means Zero-Pruning. Although the saving diminishes when the length of FFT increases to a very large number, the RS-FFT with ZP saves roughly 50% of the real multiplications because the FFT length within 64 points will suffice for most realistic equalizer applications.
Hermitian Matrix Inverse Architectures
In this section, we utilize the Hermitian feature and focus on the optimization of the submatrix inverse and multiplication module following the element-wise FFT modules in the tap solver. Although the FFT-based tap solver avoids the direct matrix inverse of the original covariance matrix with the dimension of (N F × N F ), the inverse of the diagonal matrix F is inevitable. For an MIMO receiver with high receive dimension, the matrix inverse and multiplication in F −1ĥ m is not trivial. Because F is a block diagonal matrix, its inverse can be decomposed to the inverse of L F sub-matrices of size (N × N ) as in
A traditional (N × N ) matrix inverse using Gaussian elimination has the complexity at O(N 3 ) complex operations. Cholesky decomposition can be applied to facilitate the inverse of these matrices. However, this method requires arithmetic square root operation, which is expensive for hardware implementation. Considering the fact that it is unlikely to have more than four Rx antennas in a mobile terminal, we consider the two special cases individually, i.e., 2 and 4 Rx antennas. We propose complexity reduction schemes and efficient architectures suitable for VLSI implementation based on the exploration of block partitioning. The commonality of the partitioned block matrix inverse is extracted to design generic RTL modules for reusable modularity. We then build the (4 × 4) receiver by reusing the (2 × 2) block partitioning.
Dual-antenna MIMO Receiver
From equation (8), a straightforward partitioning is at the matrix inversion of F and then the matrix multiplication of the dimension-wise FFT of the channel coefficients as F −1
(D⊗I)ĥ m .
In this partitioning, we would first compute the inverse of the entire sub-block matrix in F and then carry out a matrix multiplication. However, this partitioning involves two separate loop structures. In the VLSI circuit design, this will introduce some overhead for memory access and finite state machine logic. Since the two steps have same loop structure, it is more desirable to merge the two steps and reduce the overhead shown as follows. The inverse of a (2 × 2) submatrix is given by
. 
Thus we can use a single merged loop to compute the final result of W instead of using separate loops. Moreover, with the Hermitian features of F 00 and F 11 , we can reduce the number of real operations in the matrix inverse and multiplication module. It leads to a simplified equation for the k th element of the matrix W as
where "a·b", "a•b" and "a * b" indicate a "real × real", "real × complex" and "complex × complex" multiplication respectively. The complex division is replaced by a real division. From this, we derived the simplified data path with the Hermitian optimization as in Fig. 4 . In this figure, f 00 (k) and f 11 (k) are real numbers. The single multiplier means a real multiplication. The multiplier with a circle means the "real × complex" multiplication and the multiplier with a rectangle is a "complex × complex" multiplication. The simplified data path facilitates the scaling and thus increases the stability in the fixed-point implementation.
Receiver with 4 Rx antennas
The principle operation of interest is the inverse of the (4 × 4) sub-matrices. To achieve a scalable design, we first partition the (4 × 4) sub matrices in F[i] into four (2 × 2) block sub matrices as sub-matrices. We also partition the inverse of the (4 × 4) element matrix as
It can be shown that the subblocks are given by the following equations from the matrix inverse lemma [27] :
Without looking into the data dependency, a straightforward computation will have 8 complex matrix multiplications, 2 complex matrix inverses and 2 complex matrix subtractions, all of the size (2 × 2). By examining the data dependency, we will find some duplicate operations in the data path. For a general case before considering the Hermitian structure of the F[i] matrix, a sequential computation has the data dependency path given by Fig. 5 .
The raw complexity is given by: 6 matrix Mult, 2 inverses and 2 substractions. From the data path flow, the critical path can be identified. Figure 5: The data path of the partitioned 4 × 4 matrix inverse for each subcarrier.
can be shown that 
Note that all the numbers are complex except {b 11 , b 22 } ∈ R. 
where there is only real multiplications and divisions.
Definition 4 (Diagonal Transform) Given the (4 × 4) Hermitian matrix A which is partitioned into four subblocks as
A is defined as
With these definitions, we regulate the inverse of the (4 × 4) Hermitian matrix F = F H into simplified operations on (2 × 2) matrices. After some manipulation, the partitioned subblock computation equations can be mapped to the following procedure using the defined
The overall computation complexity is reduced to 2 HInv operations, 2 DTs, 1 extra
CHM block. Because the sign inverter and the Hermitian formatter [ ]
H
has no hardware resource at all, the computation complexity is determined by these three generic blocks.
The data path of the computation shows the timing relationship between different design modules. This regulated procedure facilitates the design of efficient parallel VLSI modules, whose details are given in the following.
Parallel Architecture Modules
Now we derive the efficient VLSI modules for the generic M and T operations. Because the operation M is also embedded in the T transform, we need to design the interface so that the computing architecture is reused efficiently. The grouping of computations and the smart usage of interim registers will eliminate the redundancy and give simple and generic interface to the design modules. For a single M (A, B) module, we define
To extract the commonality in the M and T operations, we have the following lemma for Hermitian matrix. . We can further simplify the top level RTL schematic by extracting the commonality of the M and T module designs as in Fig. 8 to eliminate the extra individual M module.
Thus, the results of C 11 , C 12 and C 21 are generated together from the second T module.
Compared with the design in Fig. 5 , the architecture demonstrates better parallelism and reduced redundancy. The data path is much better balanced and facilitates the pipelining in multiple subcarriers for high-speed design.
If we use a standard computing architecture of the partitioned (4×4) matrix inverse, we need 308 real multiplications before dependency optimization (DO). With a straightforward DO, the complexity is still 244 real multiplications. Traditionally, a complex multiplication 
. This has 4 Real Multiplications (RM) and 2 Real Additions (RA). By rearranging the computation order, we can reduce the number of real multiplications as: (1) . Table 2 . Note that the critical data path is also dramatically shortened with better modularity and pipelining.
Comparative Performance and Complexity Analysis
BER Performance
The performance is evaluated in an MIMO-HSDPA simulation chain for different antenna configurations. We compare the performance of four different schemes: the LMS adaptive algorithm, the CG algorithm, the FFT-based algorithm and the DMI using Cholesky decomposition. We simulated the Pedestrian-A and Pedestrian-B channels following the I-METRA channel model [21] , which are typical for high speed downlink application. The chip rate for the transmit signal is 3.84 Mcps, which is in compliance with the 3GPP standard. The channel state information is estimated from the CPICH at the receiver. 10% of the total transmit power is dedicated to the pilot training symbols.
We provide the simulation results for QPSK modulation with antenna configuration in the form of (M × N ). In the figures, L h is the channel delay spread. Fig. 9 and Fig.   10 show the fully loaded system for Pedestrian-A and Pedestrian-B channels with (2 × 2) configuration, while Fig. 11 shows a highly loaded system with 10 codes for (2×2) Pedestrian-B channel. Fig. 12 shows the simulation results for Pedestrian-A with (4 × 4) configuration.
It can be seen that for Fig. 9 , the FFT-based algorithm overlaps with both the DMI and the CG at 5 iterations very closely. In a (2 × 2) case for Pedestrian-B channel, both the CG and FFT-based algorithm shows very small divergence from the DMI at the very high SNR range in Fig. 10 . For a fully loaded system, CG with 5 iterations seems to be slightly better than FFT-based algorithm. But in a case with 10 codes, FFT-based algorithm outperforms the CG for both 3 iterations and 5 iterations. In the (4 × 4) case as shown in Fig. 12 , the FFT-based algorithm also outperforms the CG with 5 iterations. However, because the realistic system is most unlikely to work in the very high SNR range, the small difference in the BER performance is negligible. In all cases, the DMI, CG and FFT-based algorithms significantly outperform the LMS adaptive algorithm. It should be pointed out that the performance of the LMMSE based chip equalizer is limited for the fast fading channel because of its block-based feature could not track the fast fading channel environments very well. To deal with this, a Kalman filter based equalizer has been proposed in one of the authors' another paper [22] with much higher complexity.
The discussion of the related architecture is out of the scope of this paper.
Complexity
The complexity is a very important consideration for real-time implementation. Although the complete equalizer system consists of the correlation/channel estimation, the tap solver and the FIR filtering, we focus on the three tap solver complexity with similar performance, i.e., the DMI, the CG and the FFT-based algorithm. The other two parts are common for the algorithms presented here. Cholesky decomposition is assumed for the DMI. The complexity is compared in terms of number of equivalent complex multiplications and additions.
For the DMI, the complexity is at the order of O ((N (F + 1)) 3 ) for the inverse of 
With the Hermitian optimization, the complexity
. For the FFT-based algorithm, we usually require L F ≥ 2F + 1. The complexity is summarized in Table 3 . For simplicity, we only list the most significant part of equivalent number of complex multipli-
cations. An example is given for the (4 × 4) case with F = 10, J = 5. The length of FFT L F = 32 will suffice for both Pedestrian-A and Pedestrian-B channels. In Fig. 13 , we show the complexity trend for different J and different L F vs. the channel length for a (4 × 4) system. Although the Conjugate Gradient algorithm has reduced complexity compared with the DMI, the complexity reduction in the FFT-based algorithm is much more significant. 
6 VLSI Design Architecture Exploration quickly. It has been well accepted as a powerful rapid prototyping platform for the SoC architectures in the literature [16] [20] . A detailed discussion on the tradeoffs using FPGA and DSP for real time implementation is presented in [16] .
However, this type of SoC design space exploration is very time consuming because the current standard trial-and-optimize approaches apply handcoded VHDL/Verilog or graphical schematic tools. In this section, we present a Catapult C [25] based HLS methodology to explore the VLSI architecture space extensively in terms of the area/time tradeoff. This is enabled with high-level architecture and resource constraints. Synthesizable RTL is generated from a fixed-point C/C++ level design and imported to the graphical tools for module binding. The proposed procedure for implementing an algorithm to the SoC hardware is shown in Fig. 14 . The number of FUs is assigned according to the time/area constraints.
Software resources such as registers and arrays are mapped to hardware components and required Finite State Machines (FSM) necessary for accessing these resources are generated.
In this way, we can study several architecture solutions efficiently. In the next step of the design flow, the generated RTL is imported in the HDL environment and integrated with other modules of the system, which are either another Catapult C design or a legacy IP core. Leonardo Spectrum is invoked for gate-level synthesis. Xilinx ISE place & route tool is used to generate gate-level bit-stream file. Raising the language level may lead to concerns on the architecture efficiency, which highly depends on the design tool's capability. To address these concerns, we have compared both the architecture area/time efficiency and the achieved productivity in [16] with the conventional design flow. In most cases, the manual tradeoff study of a complex design with hundreds of multipliers could be extremely time-consuming and difficult. However, we can almost achieve the most efficient design architecture for a given specification using the architecture scheduling in Catapult C, especially for the computation intensive algorithms. Compared with the conventional hand-code and schematic based design methodologies, the Catapult C based methodology demonstrates not only improved productivity, but also a capability to study the architecture tradeoffs extensively in a short design cycle.
Real-time VLSI Architecture Exploration
The complete equalizer includes two major steps: the computation of the equalizer coefficientsŵ and the actual FIR filtering using the updated equalizer taps as inŵ H r A (i). The update of the equalizer coefficients is a block-based operation depending on the channel varying speed. The FIR filtering depends on the chip rate. Thus, we need to compute the L-tap convolution for each input chip from the N receive antennas for the FIR filtering within f clk /f chip cycles, where f clk and f chip are the system clock rate and chip rate respectively.
The WCDMA chip rate is 3.84 MHz. We applied a clock rate of 38.4 MHz for the Xilinx Virtex-II V6000-4 FPGA. There will be 10 cycles time constraint per input chip. For the tap solver, the experiment shows that 2 updates per slot is sufficient to provide acceptable performance for slow and median fading channels. Since there are 1920 chips per slot, the latency requirement for each update is 250µs.
We schedule architectures in two basic modes according to the real-time behavior of the sub-system in Catapult C: the throughput mode or the block mode. Throughput mode assumes that there is a top-level main loop for each incoming sample, which is processed immediately in the computation period. The module processes for each input sample periodically, so there is a strict limit for the processing time. Block mode processes once after a block data is ready. Because the Finite State Machine (FSM) usually depends on complex logic and extensive memory access, the computation patten is more like a processor architecture in loading data to the functional units. In the following, we use two typical design modules to demonstrate these different working modes.
Scalable Pipelined Multiplexing Scheduler
The covariance estimation is computed as R rr = . Memory stalls are avoided and scalability is achieved because it can stop at any chip to adjust to different update rates.
In the PMS, the number of FUs is assigned according to the time/area constraints. As an example for a (2 × 2) case with L = 10, the VLSI area/time tradeoff is shown in Table   4 . The complexity is 176 multiplications and 136 additions in each computation period.
A typical manual design will layout 176 multipliers and 136 adders all in parallel. This will take 4 cycles to complete the computation. However, the multipliers are in IDLE state for 9 cycles and wasted. On the other extreme, an area-constraint solution will reuse one multiplier and one adder, but has to take more than 176 cycles. The most area/time efficient architecture in 10 cycles is to reuse 22 multipliers and 15 adders as the pipelined operations.
The multiplexing of so many multipliers in manual RTL layout could be very difficult and time consuming. Moreover, for a changed specification such as the chip rate or clock rate, we can rapidly reschedule the design to meet the real-time requirement by using the minimum hardware resource. The similar design method is applied for the FIR and channel estimation.
Block-based MIMO-FFT IP Cores
For the multiple FFTs in the tap solver, the keys for optimization of the area/speed are loop unrolling, pipelining and resource multiplexing. Although Xilinx provides FFT IP cores, they are considerably large and much faster than required. For example, a single v32FFT For the MIMO-FFT/IFFT modules, we can reuse the control logic inside the FFT module and schedule the number of FUs more efficiently in the merged mode.
Prototyping Implementation
Based on the above algorithmic and architectural optimizations, we have prototyped the VLSI architecture of a (4 × 4) MIMO equalizer on the Aptix FPGA platform [26] . The correlation window is set to 10 chips for all 4 receive antennas. Fixed-point simulation shows that 8-bit input chip could provide negligible performance loss. To give a safe range, the input chip samples to both the corelator and the channel estimator have 10-bit precision. The 32-point MIMO-FFT module has 16-bit input word length for both the covariance and channel coefficients. To support even faster fading speed, we design the prototyping system for up to 4
updates per slot with an overall tap solving latency requirement of 125µs. In Table. 6, we give the specification of the major design blocks. Overall, we utilize only 4 multipliers to achieve area/time efficient design for 16 merged FFT/IFFT modules. For the L F inverse of the (4×4)
Hermitian sub-matrices, the latency is 38 µs with 6 multipliers. It is also noticed that the different modules have very similar latency, which provides a very balanced pipelining in multiple stages. The overall 124µs meets the real time requirement very closely and gives area efficiency. This efficiency not only benefits from the afore-mentioned algorithmic and architectural optimization, but also from the extensive design space exploration to find the most compact design by meeting the real-time requirement. The integration of the MIMO equalizer into the complete HSDPA transceiver system following the same methodology as in [16] is also being considered.
Conclusion
In this paper, we propose an efficient circulant MIMO chip equalizer for multi-code CDMA downlink by using FFT-based operations to avoid the direct matrix inverse. A comparative study demonstrates very promising performance/complexity tradeoff. VLSI-oriented optimizations are proposed to reduce the number and complexity of FFTs. The inverse of (4 × 4) submatrices is solved by partitioned (2 × 2) submatrices, which leads to dramatically simplified VLSI modules. The VLSI design space is explored extensively for area/time efficiency by a Catapult C based HLS methodology. The VLSI design is validated in a real-time FPGA prototyping system.
