Abslrocl-In this paper, scalable FPGA architectures for the LMMSE-based chiplevel equalizer in HSDPA downlink receivers are studied. An FFT-based algorithm is applied to avoid the direct matrix inverse by utilizing the block-Toeplitz structure of the correlation matrix. A PipelinedMultiplexing-Scheduler (PMS) is designed in the front-end to achieve scalable computation of the correlation coefficients. Very efficient VLSI architectures are designed by investigating the multiple level parallelism and pipelining with a Precision< based High-Level-Synthesis (HLS) design methodology. A 1x2 Single-Input-Multiple-Output (SIMO) downlink receiver is designed and integrated in the HSDPA prototype system with Xilinx Virtex-11 XCZV6000 FF' GAs. The design demonstrates more aredtime efficiency by achieving the best tradeoffs between the usage of functional units and real-time requirements.
INTRODUCTION
With the emerging multimedia services over the wireless cellular systems, the downlink capacity in the 3G cellular networks is expected to be more crucial than uplink capacity due to the asymmetric data rate requirements. High Speed Downlink Packet Access (HSDPA) is an evolutionary mode for the Wideband CDMA (WCDMA) packet data service for data rate up to 10 Mbps over a 5 MHz bandwidth. The higher data rate is achieved by assigning multiple spreading codes with very short spreading gain to a single user. Although Orthogonal-Variable-Spreading-Factor (OVSF) codes are applied in the HSDPA downlink, the orthogonality of the codes is destroyed by the multi-path fading channel. This results in Multiple-Access-Interference (MAI) at the receiver in addition to the Inter-Symbol-Interference (ISI) caused by the multi-path propagation. The optimal maximum likelihood multi-user detector was shown to have un-realistic computational complexity that increases exponentially with the number of users. The Rake receiver is sbown to provide unacceptable performance in such systems as HSDPA because of the short spreading gain. To achieve a good tradeoff between the performance and complexity, Linear-Minimum-Mean-Square-Error (LMMSE) based chip level equalization has been proposed as one of the most promising sub-optimal receivers [I ,2,3] . System capacity of the HSDPA system is significantly increased because the orthogonality of the spreading codes is partially restored by the equalization. Recently, receiver diversity techniques using multiple receive antennas or over-samplig are proposed to further increase the system performance. The equalizer has been one of the most complex blocks in typical communication systems because it usually sets up a problem using the inverse of the correlation matrix, which is very expensive in hardware implementation. The diversity technique makes the design of an equalizer in the downlink even more challenging. To make it feasible for practical implementation, an FFT-based fast algorithm is proposed in [3] by using the banded-Toeplitz structure of the correlation matrix to avoid the matrix inverse and reduce the complexity. The study of efficient VLSI architectures for this algorithm is the focus of this paper, which will he especially interesting for prototyping in a real system and for commercial products.
The complexity of the equalizer is dominant compared with other parts of the HSDPA downlink receiver. There exist many aredtime architecture tradeoffs because the algorithm demands different usage of hardware resource and parallelism in different situations. On the other hand, scalability is an important feature to make the design flexible to different channel environments. To achieve the scalability, a PipelinedMultiplexing-Scbeduler is designed at the fiont-end for the computation of correlation coefficients in a streaming mode. A Precision-C based High-Level-Synthesis (HLS) design methodology [4] is then applied to study the multi-level parallelism and pipelining in the algorithm. A 1 transmit x 2 receive antenna SIMO downlink receiver is designed and integrated in the HSDPA prototype system with more ardtime efficiency by achieving the best tradeoffs between the usage of functional units and real-time requirements.
SYSTEM MODEL
In the HSDPA downlink receiver, the chip-level transmitted signal with K active codes is expressed as where i and m are chip and symbol indices, respectively.
Meanwhile, c(i) is the scrambling code, uk is the amplitude, bdm) is the modulated symbol and sk(i) is the spreading s e quence for code k G is the spreading factor of OVSF codes. In a SIMO system where the multi-channel diversity factor is M(note that the multi-channel diversity may be in the form of either over-sampling or multiple receiver antennas), the receiver signal assumes a matrix-vector form of 
where h is the (F+l)" column of H and R = grr"] is the correlation matrix. It is shown in [3] that the correlation mabix can he made circulant by adding two corners. FFT operators are applied to make the circulant matrix diagonal and reduce the complexity dramatically. M e r some elegant numerical operations, it is shown that the equalizer filter taps can be computed with the following approximation, w=R-'h,=@"@I).P' . @@I)h (3) where 0 denotes the Kronecker product and D is the DFT matrix. F is a block-diagonal matrix with elements F= diag(Fh F,, ..., FLF.'). which is the element-wise FFT of the fust column of a circular matrix. LF is the length of the DFT. For the details of the algorithm and performance simulation, please refer to [3].
SCALABLE FPGA ARCHITECTURE
The focus of this paper is to study the system design and VLSI architectures of the algorithm in [3] . The limited hardware resource and power supply in mobile handsets makes the hardware design more challenging, especially with multiple channel diversity. We emphasize the interaction between architecture, system partitioning and the requirements for the system design flow with three objectives: 1) implement the equalizer with the minimum hardware resource; 2) achieve high-speed update of the equalizer filters; 3) obtain a scalable architecture to accommodate various mobile speeds.
Precision-C Design Flow
As the design progresses from concept to manufacturing, it goes through several levels of abstractions. At each abstraction level, different design views are created in different design representations. In this section, we will present common design representations and examine how they vary from one abstraction level to another. We used the efficient Precision-C based architecture scheduling methodology fiom Mentor Graphics [4]. The design was conducted using traditional "bottom-up" schematic entry with the Mentor Graphics HDL designer tools. RTL is generated directly from C/C* level and imported to the HDL designer for high-level integration. The HDL design is then synthesized using Leonard0 Spectrum. Hardware netlist is generated with Xilinx ISE PlaceRoute tools and verified in Nallatech FPGA development platforms.
System Level Pipeline
We illustrate our design blocks for the M2 case, which includes 6 FFT + 2 IFFT operations and inverse of LF suhblock complex matrices of size 2x2. The system-level pipeline is designed for a better modularity as in Fig. 1 
ARCHITECTURE SCHEDULING
We use Precision-C to do architecture scheduling on different resourcehime requirements. From the system level diagram in Fig. l , we can schedule the tradeoff between latency T, and area A, for each stage to find the most efficient design. Speed is not always linear with the number of FUs because both data dependency and shucture hazards stall the pipeline while logic blocks and MUXs determine the clock rate and cycle number. Precision-C can schedule the order of operations to remove pipeline stalls to some extent. Configurable parallelism is achieved by assigning the number of FUs according to aredtime constraints. Multiple ProcessingElement QE) architectures are generated as in Fig. 2 with configurable parallelism at different levels. The best solution would be the smallest design meeting the real-time requirements. In the process of the implementation, we will work on several different modeling levels. Behavioral representation &eats the design simply as a black box, while specifying its behavior as a function of its input values and elapsed time. In other words, a behavioral representation describes the system's functionality, but tells us nothing about its implementation. A structural representation begins to answer some of these questions, as it serves to define the black box in terms of a set of components and their connections. It focuses on specifying the product's implementation, and even though the functionality of the black box can be derived from its interconnected components, the structural representation does not describe the functionality explicitly. An architectural representation carries the implementation of the design one step tiuther by specifying the physical characteristics of the components described in the structural representation. Because of the limited space in this paper, we only highlight the important features of some major design blocks.
I. Scalability
By assuming ergodicity, the ensemble average of the correlation matrix RpE[r?] can be realized by the time average in one block of N samples. The complexity to compute the full correlation matrix is high. However, because of the block Toeplitz structure of the correlation matrix, the correlation matrix is defined by the first block column E,,=[Eo, ..., EL].
Thus the fust column block of correlation matrix, i.e.. E,, is computed as in where r[il is the chip vector that contains Msamples at chip index i ffom all receive antennas. The block size N is determined by the mobile speed. The scalability in the design requires that the design can be flexible to different mobile speeds with no dramatic change of the resource usage.
The independent elements for EXi=I-L) are computed as in (5) , where r,[n] is the j& receiver sample of chip n. By composing the algorithm in different structural blocks, we can schedule two different architectures: block mode in Fig. 3 (a) and throughput mode in Fig. 3 e) . Block mode architecture first collects all the N samples in the observation window and then starts computation ffom the RAM blocks. FSMs for the WE, ADDRDATA bus and control logic for RAMS are designed to make it work in a similar way as a processor. It has extensive memory access and requires large ping-pong buffers for N samples. Memory stalls will fail the pipeline in computation. The best RTL scheduled from Precision-C has more than 12 ms latency and could not provide enough parallelism to meet the real-time requirement.
To approach real-time computation, a scalable architecture is designed with throughput mode. It completes processing for each sample input within the time resource F&* cycles (e.g., 10 cycles for 38.4MHz clock rate and 3.84MHz chip Fig. 3 (a) . Block-mode with ping-pong Buffer. C M complex Multiplier; ,~ C A complex adder. FUB multipliers; FUBA RTB adders).
Multi-level Pipeline/Parallelism
Since there are multiple FFTs in the design, the keys for optimization of the &/speed are loop unrolling, pipelining and resource multiplexing. Although Xilinx IF' cores for
FFTs could be applied for integration, they are much faster than required and the design is considerably large. For example, a single v32FFT core in Xilinx CoreGen library utilizes 12 multipliers and 2066 slices. Since we need 6 FFTs plus 2 IFFTs in one computation period, it is not easy to apply the commonality in the algorithm by using the IF' core.
We designed the Radix-2 Decimation-In-Time @IT) FFT algorithm with customized specifications. The initial coslsin phase coefficients for each stage are stored in ROM blocks. Parallelidpipeline in the parallel FFTs are studied extensively in multi-levels: the Butterily-Unit (BFU) level, the stage level, and the FFT-processor level. By using MergedButterfly-Unit for M-FFT, we utilize the commonality and different level parallelism and achieve much more efficient resource utilization while still meeting the speed requirement. For a single 32-point FFT with 16 bits precision, the Precision-C scheduled RTL is compared with Xilinx v32FFT Core in Tab. II and demonstrates much smaller size for different solutions, e.g. ffom solution 1 with 8 multipliers and 535 slices to solution 3 with only one multiplier and 551
slices. Overall, solution 3 represents the smallest design with slower but acceptable speed for a single FFT. For four element-wise FFTs shown in Fig. 1 , we can either lay out duplicate FFT blocks, or just reuse one FFT module in serial computation. In a parallel layout, all the computations are localized and the latency is the same as one single FFT, however, the resource is 4x of a single FFT module.
For a reused module, extra control logic needs to be designed for the multiplexing. The time is equal to or larger than 4x of the single FFT computation. However, since we merge the FFTs to apply the Commonality, we can reuse the control logic inside the FFT module and schedule the number of FUs more efficiently in the integrated mode. The specifications for 4 merged FFTs are listed in Table III 
Merged Submatrix Inverse and Multiplication
Another major design block is the matrix inverse and multiplication as in F'(DB1)h. From equation (3), a sbaightforward partitioning is at the matrix inversion for F followed by the matrix multiplication of F' and dimension-wise FFT of the channel coefficients. In this partitioning, we would first compute the inverse of the entire sub-block matrix in F and then carry out a matrix multiplication. However, this partitioning involves two separate the loop smctwes. Since the two steps have same loop structure, it is more desirable to merge the two steps and reduce the overhead. To have a closer look and derive the expressions for a better partitioning, we expand all the related computations as follows. Because of the diagonal feature of F matrix, it can be separated into the inverse of L p 2x2 sub-matrices as in 1 '
x[
Thus we can use a single merged loop to compute the final result of G instead of using separate loops. The computation data path structure for W(k) is shown in Fig. 4 . In the figure, the multipliers in the boxes denote complex multiplications. For the division, we can fust compute the scale using a single real divider instead of using separate dividers as the dotted line indicates. The divider can be made pipelined and be operated concurrently with the multiplication path to reduce the latency. Moreover, the multipliers in the loop can be multiplexed to tradeoff area with speed.
Overall Fig. 4 . Data path of the merged 2x2 sub-manix inverse and multiplication with dimension-wise FFT output.
5. PERFORMANCE & SPECIFICATION The overall specifications for FPGA implementation are shown in Table. IV. As the most complex block in the HSDPA receiver, the scalable FPGA implementation consumes about 32% of Xilinx XC2V6000 CLBs and 22% of dedicated multipliers. A straightforward implementation is also listed in Itulics for comparison. Although a straightforward implementation may consume similar number of slices We have studied the efficient FPGA implementation for a SIMO chip equalizer in the HSDPA downlink with an architecture scheduling methodology. Several strategies have made the design scalable to channel variation speed and efficient in terms of aredtime tradeoffs.
ACKNOWLEDEGEMENTS

