Abstract-Direct Sequence Spread Spectrum (DSSS) transmissions require a despreading stage within the standard receiver block to recover the spread spectrum signal. For long spread spectrum codes, the correlation block can be a large portion of the receiver size, hence a considerable portion of the power consumption. This paper 1 looks at two power reduction alternatives for a parallel spread spectrum correlator, by analyzing the algorithm and designing a baseline correlator and by investigating how to streamline the arithmetic operations in one case, and optimizing the sample storage in the other. The two correlator designs are compared with a mix of analytical techniques and simulation data to determine the optimal correlator alternative for the DSSS application. The final analysis shows that the register file based correlator can reduce the power by over 30% for bus widths greater then 6 by using a structure which maintains the multi-bit data samples in a static area and by rotating the single bit coefficients around the data with a circular shift register.
INTRODUCTION
A type of spread-spectrum encoding that is widely used is Direct Sequence Spread Spectrum (DSSS). DSSS takes a message stream and adds a wideband spreading code before the signal is modulated onto a carrier frequency. For binary coefficients, each symbol (bit) is represented by a pseudo-random pattern of chips. The despreading output is the convolution of the discrete-time sample signal with the original code coefficients as in seen in (1): where L is the length of the DSSS code and k is the discrete-time index. For every new sample, the correlator must calculate L multiplications and additions. Depending on the application DSSS systems can use design alternatives with different code lengths, bits per sample, samples per symbol, and acquisition times [1] - [10] .
This paper deals with a parallel correlator, where the correlations are performed at the chip rate and the DSSS code can be acquired in every symbol period. There are structures that sacrifice the acquisition time for a reduction in the hardware size, but their correlations cannot be performed every cycle [10] - [12] .
The structure of the paper is arranged as follows: in section 2, we present three design, a shift register implementation, a bypass-adder tree structure, and finally a register file approach which maintains the incoming samples in a static storage area. Section 3 presents the power analysis of the three correlators choices. Section 4 presents the comparions of the three correlators, and ultimately shows that the register file correlator has the lowest power dissipation of the three choices.
CORRELATOR STRUCTURES

Shift Register Sample Storage
A conceptually simple structure to handle the sample streams, and provide the L samples to an adder tree is a shift register (Fig. 1) . The structure size is determined by the sample resolution, n, and the length of the DSSS codes, 2 m -1 for maximal length sequences. As a new sample enters the top, the oldest sample will shift out of the storage area as it is no longer required, thus providing additional flexibility for chaining
3 several correlation blocks together in order to handle larger codes.
In the arithmetic unit, the samples are first multiplied by the code coefficients and then are added together. Fig. 2 shows the block diagram of the arithmetic unit taking the samples from the registers (the control logic for the registers has been removed for simplicity). In the case of binary coefficients (+1 and -1), the multiplications can be reduced to selecting the original sample for a +1 coefficient, or the 2's complement of a sample for a -1 coefficient. In order to provide a fast summation, the adder tree can be organized in a binary tree structure, reducing the calculation time for (1) on the order of log 2 (2 m -1), or simply m.
Reducing Power in the Adder Tree
Given the shift register storage, there are potential power savings that can be realized in the adder tree by looking at the data statistics. In our case the DSSS code is based on maximal length pseudo-random sequences that can be generated with linear-feedback shift register (LFSR) as in Fig. 3 [16] . These sequences have unique properties that define the coefficients are changing between successive calculations of the matched filter equation.
The run property of a maximal length sequence defines the number of runs (streams of consistent 1s or 0s) to be dependent solely on the code length [16] . The run property of maximal-length codes allows us to considerably reduce the number of transitions in the adder tree as only half of the terms in the correlation sum will have a different coefficient from one calculation to the next (and thus change their contributions to the overall sum in each cycle). As the data is shifted by one position, the previous coefficient and the new coefficient will remain the same for half the number of samples (in runs of length 2 or greater). In order to capture this behavior we define "bypass bits" (see Fig. 4 ) which tell the adder stages if a term is not changing and thus it has zero contribution to the difference between the present and next correlation sum.
Similar work has been reported for differential coding [17] , but the emphasis was not in reducing power consumption.
By storing the previous sum and identifying the factors that are changing we can streamline the arithmetic operation to reduce the number of terms. The overall number of adders cannot be reduced as different codes change the locations of inactive adders, but we can shutdown unused adders, and limit their If the coefficient for a sample has not changed from the previous calculation, then h* t is 0 in (2), otherwise h* t will reflect the new polarity (+1 or -1). When the coefficient changes, the original sample value must be removed from the sum, and then the sample with the new polarity must be added. This can be handled in one step by adding the sample with the new polarity twice, which explains the factor of 2 before the summation symbol in (2) . Also, in each cycle, the newest sample that enters the chain must be added and the outgoing sample must be subtracted from the overall correlation sum.
In order to take advantage of the new method of calculating the correlation sum, a specialized adder cell was developed as in Fig. 5 . In the case where a coefficient has not changed as a sample is shifted, its particular contribution to the overall sum should be zero. When a term is bypassed, the adder can be configured to ignore its value (using cs), and only pass the other input as the result (using ca or cb). Fig. 5 shows a full-adder surrounded by passgates, and the truth table for the control bits Za and Zb, which are set when an input is zeroed in the calculation.
Fig . 6 shows a slice of the first level of the adder tree, and how the regular adder tree differs from the adder with the bypass cells. In Fig. 6 (a), regardless of the coefficients, the data is recomputed on every cycle in every adder. In Fig. 6(b) , on the other hand, the arithmetic only performs computations when the coefficients are different. In the case where both input to an adder are zero (as determined in Fig. 4 ), the bypass adder propagates a zero control line into the next level of the tree (in this case the adder has an invalid result, but with with zero signal into the next level of the tree, the sum is not used).
Register File Storage
Another approach to reduce power dissipation is to reduce the activity in the storage area. One of the biggest causes of transition activity in the shift register implementation is that every sample is moved on every clock cycle. A possible approach to reduce the unnecessary activity on the datalines is to use a register file (with pointer) implementation instead of the n-bit wide shift register [18] , as seen in Fig. 7 . With this scheme, only one register out of the total of 2 m -1 will experience clock and output transitions for each new sample. However, a global bus must be connected to each register, increasing the load due to the inputs of all the registers. To create the behavior of a FIFO structure, we use a one-hot address register of length (2 m -1), which acts as the clock pulse to the single active destination. In addition, whereas the DSSS coefficients are static for the shift register correlator, the register file structure requires that the coefficients be shifted in a one-bit shift register ring. Different adders are inactive in each clock cycle as runs of coefficients pass the multipliers.
Because the global bus feeds every register in the register file, minimizing the transitions on this [19] . A large portion of the decoding overhead for Bus Invert can be incorporated into the binary coefficient multiplications units.
3.CORRELATOR POWER ANALYSIS
Shift Register Correlator
The dynamic power consumed by CMOS circuits is directly proportional to the number of transitions over time, with the energy a single transition defined as 1 / 2 C L V 2 DD . Lowering the supply voltage will decrease power consumption quadratically [13] , but at a cost of reducing the maximum chip clock rate. The total switching energy can then be determined by summing all switching events (which multiplied by the system frequency will provide power dissipation):
In order to simplify the analysis, we assume a constant logic voltage swing throughout the design, ignore the routing overhead, and consider that each gate provides a unit load to the corresponding driver.
The power consumed by the shift register will then have two components proportional to:
• transitions on the register inputs and outputs, 
• clock transitions.
In order to provide an absolute power number, one section of an 6-bit wide shift register chain with the appropriate fanout load was simulated using a transistor level model for the HP 0.5 µm process available through MOSIS. For a 10MHz clock and a 3.3V supply, the average power dissipation per 6-bit register for uncorrelated data was 41 µW (average power per bit was 6.83 µW). For our system the majority of the time for the correlator will be spent looking for a correlation, thus the samples will look like noise. Using (4), the average power dissipation for a 2 m -1 length shift register, where P bit is the measured parameter 6.83 µW for the 1-bit register, n is the sample size and the filter is running at 10MHz, becomes:
To find the power dissipation in the adder tree, a 6-bit ripple-carry adder was simulated in the HP 0.5 µm process with uncorrelated data inputs. At a 10MHz data rate, the 6-bit adder consumed 19 µW of power. For an estimate of the entire tree, the 6-bit adder power measurement was bit-sliced, and applied to all the adders in the entire binary tree. Each successive level has half the adders of the last, with the number of bits increased by one (i.e. two 6-bit additions generate a 7-bit result). The overall power dissipation for the adder tree can be approximated with the following relationship:
The total power dissipation of the shift register correlator as a function of filter size is shown in Fig. 8.
Bypass Adder Correlator Power Analysis
P SR bitshifts clocktrans fanout
To estimate the correlator power for the bypass adder, we must first consider the performance of the bypass adder as compared to the simple adder block. When both inputs are active, the bypass adder suffers from the overhead of the passgates. On average, a 6-bit adder in the HP 0.5 µm process has a 5.8% power dissipation increase as measured in SPICE simulations for uncorrelated data. The bypass adder has the overhead of the passgates in the case where it is adding two numbers, but it uses two orders of magnitude less power in cases of bypass or shutdown (the adder is disconnected, and the power is only consumed in charging the bypass lines).
The run property statistics determine how many bypass bits are set and the particular spread spectrum code dictates where the bypass bits are located. Using these observations and the run property for maximal length sequences, we can generate the expected number of adders in each row that will be in each of the three modes: off, bypassed, or on. Table I summarizes the expected number of adders in the three modes of the states bypass adders (on, off, and bypass) for maximal length codes. In order to provide a fair comparison with the first correlator structure, we must factor in the additional overhead of using the bypass adder trees:
• additional register which holds the previous correlation results, • additional adder which includes the previous result in the sum.
A plot of the power dissipation for the bypass adder correlator is shown in Fig. 9. 
Register File Correlator Power Analysis
As with the shift register, the transition activity for the register file can be modeled as a function of the code lengths and bus widths. The switching activity in the register file comes from six main components:
• transitions due to the global bus (w/ Bus Invert), • the fanout of the registers into the correlation block, • the clocked registers, • the clocks on the address bit registers, • the hot-bit shifting through the address register bits, • the filter coefficients that must now be rotated.
The total number of transitions per cycle becomes: 
When the ratio of the transition activity for the register file (8) and the shift register (4) are plotted ( Fig. 10) for various filter sizes and bus widths, the register file has a clear advantage in reducing the transitions in the sample storage area over the shift register. As both the filter size and bus width increase, the register file approaches less then one quarter of the transition in the shift register.
By calculating the power dissipation for a given size shift register (5) and then applying the ratio of transition activity (9), we can generate an estimate of the power dissipation for the register file storage. The register file also significantly changes the data transition statistics of the inputs on the first level of the adder tree. Instead of each adder seeing pseudo-random data, with on average only half the bits changing values, the register file forces every input bit to the adder to change when its corresponding sample changes polar- 
OVERALL COMPARISON
In all, we have presented three variants to realize the correlation block for fast acquisition of an incoming DSSS code. Fig. 12 shows a graph with the normalized power dissipations of each of the variants:
• a shift register storage area with regular binary adder tree, • a shift register storage with the bypass adder tree configuration, • the register file implementation with the Bus Invert technique.
The first correlator, the shift register design, was chosen as the baseline case, and the curves in Fig.   12 were all normalized with respect this correlator. The bypass-adder correlator asymptotically approached a 12% reduction in power over the shift register correlator across for large filter sizes. The relationship between the power in the shift register and the bypass adder correlator remained constant acrossed all bus widths. In contrast, the register file correlator has a large power savings over both the shift register and bypass adder correlators. The register file correlator power dissipation shows a strong dependence on the bus width, and at 16 bits it has a 40% reduction over the normalized correlator. Even with a 6 bit sample width the register file has a 30% power reduction. 
CONCLUSIONS
The starting point for this research was the shift register implementation, its structure is easy to understand, and is most readily seen directly from the algorithm. The bypass adder technique was explored to try and remove some of the computational expense per cycle. For the first row of the binary adder trees, the bypass techniques was effective in shutting down over half of the adders in the first row. While large gains were achieved in the first row of the binary adder tree, the lower levels of the adder are active almost in every clock cycle. While the bypass adder tree did achieve a significant power reduction in the arithmetic unit, the storage area was still consuming most of the overall power. The most substantial power reduction was achieved by the register file implementation, regardless of the filter size. Coupling the register file correlator with Bus Invert data encodings produced a correlator with over 30% reduction over the shift register case on bus sizes of 6 or greater.
We have presented several power minimization techniques for a direct sequence spread spectrum correlator working at the chip rate. Depending on the FIFO implementation (shift register or register file), different adder tree solutions are optimal for low power design. When samples are shifted each cycle (as for the shift register FIFO), an adder tree with bypass reduces the overall power by 12%. When the samples are static and only the coefficients are shifted (as for the register file FIFO), a regular adder tree gives the best results for more then 30% power reduction for most bus widths. The most effective technique found to reduce the dynamic power caused by switching activity was simply moving the data that would cause less transitions (the 1-bit coefficients) and keeping the multi-bit samples static.
