Abstract-In this paper, a scalable architecture of the multicarrier CDMA system using Multiple-Input-Multiple-Output (MIMO) technology is designed in the programmable logic array. The system-level partitioning with different architecture design entries is described. The overall computing architecture for complex signal processing blocks, e.g., channel estimation, frequency domain equalization, demodulation etc is described. The MIMO architecture is easily extended from a SISO system with single antenna. This scalable architecture demonstrates resource utilization efficiency and easy extension to MIMO configurations.
I. INTRODUCTION
Much higher data rate than what is offered today is required for multimedia services and ubiquitous networking via mobile devices with the emerging beyond 3G and 4G wireless communication technologies. Due to the excellent performance in dealing with hostile frequency selective wireless channels, Orthogonal Frequency Division Multiplexing (OFDM) technologies have been researched extensively for many different standards. Multi-carrier CDMA [1] is one key technology that combines OFDM and CDMA because of its many advantages, such as larger capacity, high data rate support etc. On the other hand, the use of MIMO (Multiple Input Multiple Output) technology [2] is becoming more and more important because of its capability to enhance the spectrum efficiency significantly. Recent years have seen the combination of MIMO with the Multi-carrier CDMA as an important candidate for the emerging B3G and future 4G wireless systems [4] .
However, these technologies also involve many complex signal processing algorithms, which demand tremendous processing power to achieve the real-time performance with lowcost silicon usage and low power consumption. Due to the advances in silicon technologies, many very complex signal processing algorithms are now feasible to be implemented in dedicated DSP processors. However, the increase rate of processing power for most commercial DSP processors is still behind the complexity increase in the demanding signal processing algorithms, especially for the MIMO MC-CDMA systems. Even with the emerging multi-core System-on-Chip (SoC) DSP processors, many baseband signal processing are still more suitable for hardware acceleration with VLSI circuits. [3] . To reduce redundancy, commonalities in different configurations must be exploited.
In this paper, scalable architecture for the complete PHY layer of a MIMO MC-CDMA system is proposed. We first start with the baseline implementation of the SISO system. Efficient architectures for the dominant signal processing algorithms are designed. The commonality in a MIMO system is studied. Partitioning strategy for both software/hardware and design entries with different flow is described. The tradeoff in data throughput and the design area is exploited to derive the scalability from a SISO system to MIMO systems. We also utilize different design flows for design entries in different domains based on the features of the algorithms. A Catapult C flow [5] is applied to synthesize the design architecture to RTL design. The architecture is implemented in a small form factor demonstrator based on Nallatech design boards with multiple Xilinx Virtex-I1 FPGAs. The architecture not only demonstrates efficiency in resource utilization, but also has the scalability for easy extension to MIMO configurations.
11. MIMO MC-CDMA SYSTEM MODEL A simple model of the basic SISO MC-CDMA system is described here, as depicted in Fig. 1 Fig. 1 , the scalability from SISO to MIMO can be viewed as simple duplication of processing elements. However, to derive a more efficient architecture, the processing power in each major blocks need to be balanced and re-partitioned, as we will show later.
The extension from SISO to MIMO at the receiver is much more challenging, as more advanced detection algorithms are needed to do the joint detection for multiple antennas. As this paper will focus on the scalable architecture for implementation, we will not elaborate the mathematical equations too much. But we will focus on the architecture design of the dominant signal processing blocks by addressing the design entry partitioning, commonality analysis and reusability etc. Major equations will be explained as needed to facilitate the understanding.
III. SCALABLE ARCHITECTURE FROM SISO TO MIMO
A. Architecture for SISO Transceiver The practical implementation of the prototyping system is much more complex than the data flow block diagram shown in Fig. 1 . The partitioning of the SISO transmitter on the FPGA is shown in Fig. 2 . The partitioning is based on the signal processing requirement, data flow, and the computing architecture. The partitioning between DSP and the FPGA is at the boundary of encoder and the baseband transmitter. First, the DSP will transfer a frame of data to the FPGA buffer. The first block in the FPGA will form an OFDM symbol packet based on the MC-CDMA frame structure. For example, for the first OFDM symbol of each slot, the preamble is inserted at the head. The spreader and the inter-leaver are merged in a single module because the loop structure for each OFDM symbol is similar for the spreader and inter-leaver. To support configurable interleaver, an interleaver table initialization is designed to generate interleaving indices to a memory block. Also the pilot locations are stored in a ROM block to make it configurable. The pilot insertion and some over-sampling functions are merged to prepare the OFDM symbol data for the IFFT. These functions are very suitable for high-level synthesis based design flow. They are designed by the Mentor Graphics Catapult C design tools [5] .
The time domain transmitter functions include the IFFT and the FIR filtering as well as digital up-conversion. These functions require more complex signal processing power, but are more or less common standard modules. It is more efficient to integrate off-the-shelf high-performance IP cores. Thus IP cores from 3rd party providers are integrated in the HDL designer environment.
The architecture partitioning of the SISO receiver is shown in Fig. 6 . It is also partitioned into three design entry domains. First at the interface of ADC input, some digital downconverter and Low Pass (LP) filter modules are integrated with other modules such as synchronization and frequency offset correction. The low pass filter is used to suppress the outof-bad noise and interference. Together with the cyclic prefix removal and FFT module, they form the processing sub-system in the time-domain. Because many of these blocks require high-throughput processing power, they are more suitable for optimized IP Fig. 4 . In this option, the frequency domain FD TX module is scheduled by Catapult C to be a very area-efficient design, which just supports the throughput requirement of a SISO system. Because of the small design area of this design, it is very easy to duplicate the design entities for a MIMO system. However, for the IFFT module, since it is designed with very high throughput by the third party, it is a relatively large design with tremendous pipelining and parallelism already. The throughput of the IFFT alone is sufficient for the targeted MIMO configuration. Thus, we can first split the data with spatial multiplexing to multiple parallel SISO 
where W is the DFT matrix. Because all the matrices except the data vector Yp to construct the H, are known as a priori, the matrix L = WVS-1UH is computed off-line and stored in the ROM blocks. The channel estimation in the frequency domain is then essentially a matrix-vector multiplication of the predefined matrix in ROM blocks and the frequency-domain input data.
Note that here we need to estimate the channel coefficients for each subcarrier and these sub-carriers are considered independent to each other. We can exploit the parallelism in the subcarrier level to speed up the computation. If we store all the predefined coefficients in a single ROM block, the memory race would cause some data dependency, which stalls the intrinsic parallelism in the algorithm. Thus, we need to split these into sub-block memories and create multiple processing elements for channel estimation. This is shown as the multiple parallel processing elements for channel estimation in Fig. 6 .
When the frequency-domain channel coefficients are estimated, the channel is equalized by using an LMMSE algorithm as XLMMSE = H (HH + 7 I/P) Y.
Where H is the frequency domain channel matrix for each subcarrier, and X and Y are the detected symbol vector and the received signal vector, respectively. ( is the noise variance and P is the transmit power. Thus, we need another block to estimate the SNR as shown in Fig. 6 . Note that the equalizer itself is done for each subcarrier independently. Thus, the equalizer is essentially a computation loop for all sub-carriers, where the loop entity does the equalization computation. For the SISO case, the channel coefficient is a scalar for each subcarrier. Similar to the transmitter, it is straightforward to merge the de-interleaver and the de-spreader into a single processing module because of their similarity in the structure.
To identify the scalability in the receiver side, we need to analyze the difference of the SISO vs. MIMO receiver. As it shows, when the LMMSE equalizer is applied, the MIMO architecture has quite many similarity with the SISO receiver. The extension from the SISO receiver to the MIMO receiver architecture is shown in Fig. 7 .
The major difference here is that the channel becomes a MIMO channel matrix compared with the SISO case. However, the frontend processing for each antenna is independent. Because of the high sampling frequency rate for the front-end filtering, we can simply duplicate the front-end path for each antenna. The FFT is multiplexed in the same way as the IFFT in the transmitter. Also because the pilot separation and pulseshaping filter module is easy to achieve high throughput design with relatively small resource utilization, we can also multiplex a single SISO processing element as in the FFT. However, we need to estimate more channel coefficients for each subcarrier as compared with the SISO case, the channel coefficients form a matrix for each independent subcarrier. Even though, these coefficients are decoupled from each other in the estimation. Thus, we can duplicate the SISO channel estimator processing components to meet the increased throughput requirement. Thus, we have both the subcarrier-level parallelism and the antenna-level parallelism.
The actual LMMSE MIMO detector now becomes a joint detection for multiple streams as in the LMMSE detection equation. For each subcarrier, this requires the matrix multiplications and the matrix inversion compared with the scalar multiplication and division in the case of SISO case. The complexity increase thus is not linear to the number of streams. However, for the despreader + deinterleaver/demodulator functions, the SISO design modules can be reused and scaled to support the MIMO configurations.
IV. DESIGN SUMMARY A. FPGA Resource Utilization
This section summarizes the FPGA resource utilization for the major building blocks. The front-end of the time domain receiver is similar to the time domain transmitter processing. However, the complete frequency domain receiver design alone requires 74 multipliers and 101 RAM16 blocks. The number of slices is 67% of the Virtex-I1 V4000 device. The breakup of the different modules in the frequency-domain equalizer is shown in Table. II. The channel estimation module alone consumes the most number of multipliers. After the channel coefficients are obtained for each sub-carrier, the complexity of the equalizer itself is relatively simple. Because of independence between the channel coefficients, the scalability to MIMO channel estimation can be achieved easily by increase the number of processing elements to meet the throughput requirement. As the joint detection for the MIMO detection only changes the internal entity in the loop structure compared with the SISO system, extension to the MIMO system can still maintain the design size of the equalizer well-balanced.
V. CONCLUSION
In this paper, we present a scalable architecture for the MIMO multi-carrier CDMA system. The commonality between the SISO and MIMO systems is exploited for reusability of the major design modules. The design is prototyped in the FPGA platform, which demonstrates architecture efficiency and scalability.
