Abstract
Introduction
Achieving power-efficient architectures will be a major goal in the design of next-generation mobile communication receivers such as laptops, cell phones, PDA etc. Future portable receivers will need the ability to handle various multimedia data traffic irrespective of mobility, provide guaranteed Quality-of-service (QoS) requirements, and integrate multiple functionality (GPS, World Wide Web, e-commerce, etc.) simultaneously. The high bandwidth required by these applications implies that this functionality would come at the expense of a heavy drain on the available battery power. For example, the IMT Standard [8] for a mobile terminal specifies a target data-rate of 384 Kbps at bit error rates of the order of 10 −3 to 10 −7 in an urban outdoor terrain. Achieving these high levels of expected performance, as well as the required data-rates will call for the implementation of advanced algorithms in the design of such receivers. With rapidly improving integrated-circuit (IC) technology as well as the decreasing cost of silicon area, there have been great advances in the ability to integrate the entire receiver chain on a single-chip (System-on-chip design). The point that has not been addressed in these designs is the system integration, with power minimization as a key constraint. This work explores the techniques and trade-offs involved in the design of power-efficient architectures for next-generation DS-CDMA mobile communication receivers. Figure 1 shows the high-level description of the front-end in a wireless communication receiver. The architectures implemented in this paper are represented by the solid line blocks, while the dashed-line blocks are assumed to input the sampled wide-band signal and the estimated multi-path delays into the receiver. Further details regarding the implementation of the individual blocks can be found in [3] . The RAKE receiver unit forms an important constituent of a DS-CDMA mobile receiver for performing single-user detection. The RAKE algorithm is a conceptually simple algorithm, however, its computational complexity increases linearly with the number of multi-path components being processed. Even though there has been considerable research investigating techniques for improving the performance of DS-CDMA RAKE receivers in fading multi-path channels, there has been comparatively little research on investigating methodologies for minimizing the power dissipation of the receiver architectures. A strength reduction technique has been described in [1] for reducing the on line power dissipation in the complex RAKE multipliers by up to 25%. Power reduction techniques for a spread spectrum based correlator have been described in [5] using a modified adder-tree structure and employing bus-invert coding. Low-power correlator architectures have been described in [11] that employ a partial correlation approach for reducing on-line power dissipation during code acquisition in WCDMA based systems. To the best of our knowledge, there has been very little work on developing a framework which analyzes the performance vs. power dissipation trade-offs in the context of mobile DS-CDMA RAKE receivers.
Contributions
The work presented in this paper has two principal aims. First, we analyze the impact of reduced precision and arithmetic complexity on the algorithm performance and power dissipation in the DS-CDMA mobile RAKE receiver. Next, we explore the architectural design-space for reducing the on-line power dissipation. Starting with a conventional implementation of the RAKE receiver, we demonstrate design methodologies for achieving power reduction at the algorithm level and the architectural level. This "proof of concept" architecture has been targeted towards a Xilinx Virtex-II FPGA and achieves the targeted data rate of 384 kbps. The resulting power-performance profiles have been obtained after passing synthesized complex receiver data simulating an urban 3 path fading channel through the targeted architectures.
• Algorithm level: We show that reduction of sampling rate of the input complex multi-path receiver data to the DS-CDMA RAKE correlator during de-spreading results in favorable trade-offs in power consumption versus the corresponding receiver performance. Significant power savings are achieved through reduction in arithmetic complexity by decreasing the number of arithmetic operations during the RAKE correlation per symbol demodulation. For a 16 bit data-path, we have observed a 24.65% reduction in dynamic power dissipation in the reduced complexity RAKE receiver compared to the reference RAKE receiver implementation, with an acceptable performance loss of less than 2 dB.
• Architectural level: Starting with a 16 bit data-path, and reducing precision down to 10 bits, we study the variation in the RAKE receiver performance with decreasing fixed-point precision. Word-length reduction alone results in power reduction of up to 25.6% in the original reference RAKE receiver architecture, and 16.96% further in the reduced complexity RAKE receiver architecture mentioned above.
System Description
We consider a K user DS-CDMA downlink system employing Binary Phase Shift Keying (BPSK) symbol modulation during transmission. The k th user's information sequence b k ∈ {−1, 1} is multiplied by a N chip pseudonoise (PN) sequence whose bit duration equals T bit = NT chip . For purposes of estimating the complex channel coefficients [4] , a common code-multiplexed pilot signal is broadcast by the base station to all mobile users. The sampled complex receiver data r(n) at the DS-CDMA mobile receiver can be written in vector-matrix notation [3, 6] as r i = SH i Ab i +w i where • r i is the received sampled data (S samples/chip) corresponding to the i th information symbol represented by
• S describes the signature matrix for all K active users and the pilot channel given by
samples) signature waveform of the k th user and p th multi-path. Therefore,
where s k (t) represents the k th user's continuous-time spreading waveform given by the convolution of the user's spreading sequence {c k (n)} and the transmitted chip-waveform g T (t).
• H i denotes the complex channel impulse response coefficient matrix for the i th information symbol given by
• A is the user/pilot amplitude matrix given by diag{A
• b i is the symbol vector for all K users and pilot corresponding to the i th transmission given by
DS-CDMA RAKE receiver
The DS-CDMA RAKE receiver attempts to collect the signal energy from all the received signal paths that fall within the delay line and carry the same information [9] . Assuming that user 1 is the user of interest, define the signature matrix,
the RAKE receiver computes the decision statistic given by:
whereĥ i ∈ C P ×1 is the complex channel coefficient estimate obtained from the output of a channel estimator. An all-ones pilot symbol sequence (assumed to be known at the mobile receiver) is used for the purpose of channel estimation. Define
as the pilot code signature matrix. Then, the channel estimateĥ i is given by the
where L is the length of the averaging filter.
Power-efficient DS-CDMA RAKE receiver architectures
The dynamic power consumption P dyn at any node in a CMOS-based design is a function of the node capacitance C, the switching activity α of the node [defined as the average number of node transitions per clock cycle], the clocking frequency f clock , and the supply-voltage V cc employed in the design, given by P dyn = 1 2 αCV cc 2 f clock . Since P dyn is quadratically related to V cc , voltage reduction yields the biggest savings in power consumption. In addition, optimizations such as reduced algorithmic complexity, re-ordering of arithmetic expressions, word-length reduction can markedly reduce the overall capacitance and node switching activity in the design, thereby reducing the power-dissipation (detailed description is provided in [2, 10] ). 
Reduction in arithmetic complexity
The computationally most intensive operation involved in the RAKE receiver is the correlation operation where the sampled complex multi-path receiver data is correlated with the spreading waveform vector for the user and pilot channels. For the p th finger, the correlation output X p cor (i) corresponding to the i th signaling interval can be represented by,
where
When implementing the correlation operation as a digital matched filter, the complexity of the correlation operation is governed by the length of the signature waveform vector N corr and the number of active fingers P . The signature waveform vector s 1,p is represented by the discrete-time convolution of the length N spreading sequence {c 1 (n)} and the square root raised cosine filter (pulse-shaping) with impulse response {g T (n)}. The length of the pulse-shaping filter equals M = 2DS + 1 taps (being linear phase) where D is the group delay of the filter and S is the upsampling rate at the filter input. The length of the convolution output is given by N conv = M + NS − 1 samples. Assuming values of D = 10 samples, S = 2 samples/chip, we obtain M = 41, N conv = 2N + 40 samples, hence the overall correlator length is specified by N corr = N conv . For typical values such as a spreading code of length N = 32, P = 3 path channel, L = 16 tap channel estimator, the arithmetic complexity of the RAKE receiver with ideal correlation equals 16NP + 318P + 2LP − 1 = 2585 flops/symbol. We explore two schemes for reducing the correlator length as a means for achieving reduction in arithmetic complexity (Table 1) .
• Sampling at 2 samples/chip: The starting and ending DS = 20 samples of the spreading waveform at the convolution output occur due to the group delay of the filter g T (n). By discarding these 2DS = 40 samples and retaining the steady state response, the correlator length reduces to N corr = N conv − 40 = 2N samples/symbol, which translates into savings in arithmetic complexity. Thus the number of correlation operations involved in the pilot correlators (for channel estimation) and rake correlators (for despreading and detection) are reduced by 320P = 960 flops/symbol to 1625 flops/symbol. In the results , the performance of the resulting receiver (with truncated correlation waveform) is shown to be almost identical with that obtained with perfect correlation. We call this receiver as the reference RAKE receiver. Complex receiver input data 1
User signature matrix 1 S pilot ∈ R 2NS×P Pilot signature matrix
Maximal Ratio Combiner output 6
• Sampling at 1 sample/chip: To achieve a reduction in the arithmetic complexity, we reduce the sampling rate for the despreading operation in the RAKE correlators to 1 sample/chip, and investigate the resulting complexity vs. performance trade-offs. This halves the length of the correlator for the RAKE despreading operation to N corr = N samples/symbol and a corresponding reduction in the overall operation count by 4NP = 384 flops/symbol to 1241 flops/symbol. As the performance of detection is heavily influenced by the accuracy of channel estimates, the pilot channel correlation is still performed at 2 samples/chip. The complexity reduction comes at the tradeoff of reduced correlator output energy owing to the halved correlation length. The results demonstrate a significant power reduction with acceptable detection performance due to this optimization. We call this receiver as the reduced complexity RAKE receiver.
Reduction in fixed-point Precision
All the DS-CDMA architectures presented in this paper are based on a fixed-point implementation. A quantization analysis tool developed at the University of Texas, Dallas [7] was used for determining the dynamic range and precision requirements of the RAKE receiver. Table 2 shows the fixed-point integer requirements of the individual RAKE receiver variables after quantization analysis. From the obtained fixed-point formats, extensive simulations were carried out using MAT-LAB/C with C++ classes in SystemC providing the fixed-point arithmetic support. A minimum word-length of 10 bits was required for the RAKE receiver to achieve acceptable performance (within 1 dB) of the equivalent floating point version of the algorithm.
Architecture description
Two distinct architectures incorporating the aforementioned power saving techniques were implemented on a Virtex-II FPGA.
• Reference architecture: Figure 2 shows the reference architecture of the RAKE receiver.
This implementation employs a uniform input sampling rate of 2 samples/chip for both the PI-LOT and RAKE correlator matched filtering operations. The external clock is passed through a delay-locked loop to derive the global clock buffer CLK running at the input sample frequency of f samp = 24.576 MHz. • Reduced Complexity architecture: To explore the effects of reduced arithmetic complexity on the resulting power consumption of the RAKE receiver, the wide-band signal was input at the rate of 2 samples/chip to the PILOT correlator and 1 sample/chip to the RAKE correlator. Figure 3 shows the architecture of the resulting reduced complexity RAKE receiver with two separate clocking domains namely CLK (shown by the solid box) and CLK DV (shown by the dashed box) running at f samp = 24.576 MHz and fsamp 2 = 12.288 MHz respectively. While the global clock buffer distribution CLK was used to clock the PILOT matched filtering operation, the second clock buffer CLK DV was used to clock the RAKE matched filtering, channel estimation and Maximal Ratio Combining blocks. The presence of two independent clocking domains required the use of additional synchronizing logic to transfer signals (such as the pilot soft matched filter output) from the CLK domain to CLK DV domain.
Results
For studying the impact of precision reduction on the resulting algorithm performance, the mobile receivers were simulated based on 10,12,14,16 bit fixed-point word-length and compared with a floating point implementation. For each word-length format, the average received SN R = 10 log 10 (
) was varied to study the effect on the bit-error rate performance of the algorithm. In the computer simulations, 5 equal power users employing length 32 extended Gold sequences were considered. The scenario in consideration was a 5 user, 3 path correlated Rayleigh fading channel based on the Jakes mobility model. For each data-point, 40 random test cases of 5000 transmitted bits were tested . The multi-path delays were fixed for each simulation and varied from one simulation to the next. All the users were assigned unit transmit amplitudes. An additional code-multiplexed pilot channel with a 3 dB higher power was employed for channel estimation at the mobile receiver. The over-sampling rate at the transmitter and receiver front end was chosen to be 2 samples/chip in order to account for fractional multi-path delays. The A/D converter at the receiver front end was chosen to have an 8 bit width (S8Q7 format). We consider the performance of the following DS-CDMA RAKE receivers:
• Reference RAKE receiver performing truncated correlation sampled at 2 samples/chip (Complexity=16NP − 2LP − 2P − 1 operations/symbol).
• Reduced arithmetic complexity RAKE receiver performing truncated correlation sampled at 1 sample/chip for detection and 2 samples/chip for channel estimation (Complexity=12NP − 2LP − 2P − 1 operations/symbol).
The performance of these receivers were compared against a DS-CDMA RAKE receiver employing perfect correlation (highest complexity of 16NP + 318P + 2LP − 1 operations/symbol). Figure 4 shows the performance of the reference DS-CDMA RAKE receiver for the above scenario. We notice that the receiver performance in fixed-point is close to the ideal floating point performance, with negligible performance degradation for the 10 bit precision (less than 1 dB loss) upto an SNR of 10 dB. Figure 5 shows the performance of the reduced complexity DS-CDMA RAKE receiver. The reduction in complexity for reducing the dynamic power consumption, causes a performance degradation of 2 dB compared to the ideal DS-CDMA RAKE receiver employing ideal correlation (shown by the dashed line in black), owing to the reduced energy at the output of the RAKE correlator. We note that the receiver performance in fixed-point is almost identical with the floating-point performance up to a 10 bit precision. 
Multi-user, Multi-path fading channel

Results of FPGA implementation: Timing simulation
The RAKE receiver architectures were targeted for a 2 million gate Virtex-II (XC2V2000 series) FPGA. Synthesized complex receiver data for an urban 3 path Rayleigh multi-path channel was passed through each receiver implementation, and symbol detection was carried out. For finding the dynamic power consumption in the design, the synthesized receiver data was run through the receiver. An external clock running at 50 MHz was produced to clock the receiver. The analysis was carried out following the synthesis, translation, mapping, netlist extraction, and the post-placement and routing phase. Extensive timing simulations were carried out in the Modelsim simulator to model true-device behavior. All internal node transitions occurring during the course of the simulations were dumped into a ".vcd" (Value-Change-Dump) file format. The .vcd files were then analyzed by the power analysis tool XPower provided by Xilinx. The dynamic power consumption was obtained after calculating the difference of the overall design power consumption and the queiscent power (225 mW) of the FPGA.
In Table 3 , the results of implementation of the reference and reduced complexity architectures have been reported. The area shown in the table is represented in FGPA slices as well as the percentage occupancy in the FPGA, with the available area being 10752 slices in a Virtex-II FPGA. Considering only the effect of reduced precision, the reference architecture family shows a power reduction of 25.6% for the 10 bit data-path compared to the 16 bit data-path. Within the reduced complexity architecture family, we observe power savings of 16.96% for the 10 bit data-path compared to the 16 bit data-path. These power savings are quite significant considering that the 10 bit data-path achieves almost close to the equivalent floating point performance for both the reference and reduced complexity receivers (performance loss being less than 1 dB).
Next, we consider the additional effect of complexity reduction on the resulting power savings. The 16 bit reduced complexity RAKE receiver achieves a power saving of 24.65% compared to the 16 bit reference RAKE receiver implementation. The combined effect of reduced precision and arithmetic complexity results in 37.4% reduction in dynamic power consumption for the 10 bit RAKE receiver, with a 3 dB degradation in performance ( Figure 5 ). The tradeoff of dynamic baseband power consumption with receiver performance is important for battery operated mobile wireless terminals. In scenarios where there is a strong received signal, then adaptive methods to reduce the dynamic digital baseband processing as proposed in this paper will greatly increase battery life.
Conclusion
We have examined design methodologies and performance trade-offs for reducing the online power dissipation in a DS-CDMA mobile RAKE receiver. At the algorithm level, reduction in arithmetic complexity has been investigated for obtaining savings in the dynamic power dissipation. At the architectural level, precision reduction and activity rate reduction have been exploited for additional savings.
Reduction in precision shows that a 10 bit data-path achieves near floating point performance with minimal performance degradation for the reference RAKE receiver. Power-efficient architectures based on a Xilinx Virtex-II FPGA have been proposed for implementing both the conventional and reduced complexity DS-CDMA RAKE receiver. For a 16 bit data-path, we have observed a 24.65% reduction in dynamic power dissipation in the reduced complexity RAKE receiver compared to the reference RAKE receiver implementation, with an performance loss of less than 2 dB. The combined effect of reduced precision and complexity reduction leads to a 37.44% savings in digital baseband power consumption which will extend the operation of mobile wireless terminals.
