Abstract-This paper presents a low-power 128-tap dual-channel direct-sequence spread-spectrum (DSSS) digital matched-filter chip. Design techniques used to reduce the power consumption of the system include latch-based register file filter structure, a high-rate compression scheme, optimized compressor cells, and semicustom layout design. To further reduce the power consumption and the hardware requirement of the clock tree, a double-edge-triggered clocking scheme is adopted. The proposed chip is fabricated using a 0.8-m standard CMOS process. As the experimental results of the chip indicate, the matched filter can operate at 50 MHz and dissipates 184 mW at 5-V supply voltage. The supply voltage can be scaled down to 2 V for lower speed applications. As a consequence, the proposed design has low power consumption and can be used for code acquisition of DSSS signals in portable systems.
I. INTRODUCTION

S
PREAD spectrum is a class of digital modulation techniques that have been widely used in various wireless communication systems, such as the global positioning system (GPS), personal communication systems (PCS), wireless LAN, and second/third-generation cellular phone systems such as IS-95, WCDMA, and CDMA-2000 systems. The wide use of spread-spectrum techniques is due to their outstanding features, such as resistance to jamming or multipath fading environments, high multiple-access capability, low power-spectrum density, and low probability of interception [1] , [2] .
In a direct-sequence spread-spectrum (DSSS) system, the code synchronization problem is typically divided into categories such as code acquisition/tracking process or the coarse (initial)/fine synchronization. During code acquisition process, the system sweeps all the possible code phase/frequency combinations and picks one that is the best match to the incoming signal. This is done by a correlation device performing chip-by-chip 1 coherent integration over a symbol duration [3] . Since there are many phase/frequency pairs to examine, most receivers adopt parallel correlation by using Manuscript received July 5, 2000; revised February 28, 2001 . This work was supported in part by the National Science Council, Taiwan, R.O.C., under Grant NSC89-2219-E-002-019, and by the Ministry of Education, Taiwan, R.O.C., under Grant 89-E-FA06-2-4.
The authors are with the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan 10617 R.O.C. (e-mail: chiueh@cc.ee.ntu.edu.tw).
Publisher Item Identifier S 0018-9200(01) 04129-4. matched filters/convolvers/correlator banks and other smart searching strategies [1] - [3] to reduce the mean acquisition time, which highly affects the performance of packet-switching networks. Noncorrelation-type techniques, such as rapid acquisition by sequential estimation (RASE), recursion-aided RASE (RARASE) [4] , threshold decoding with hard/soft decision [5] , and the transform domain processing techniques [6] , achieve a much shorter mean acquisition time and less hardware complexity. However, their applications are substantially limited due to the lack of multiple-access capability and several rigid operating requirements (moderate to high signal-to-noise ratio (SNR), perfect carrier/bit synchronization before code acquisition). Recent advances in CMOS technologies make possible various implementation methods of code acquisition. Most attention has been on the implementation of DSSS code acquisition matched filters [7] - [10] , either through digital, analog, or mixed-mode circuits. Besides ASIC, the digital signal processor (DSP) and the software radio solutions are also possible [11] , [12] . Due to increasing demands of portable/battery-operated systems, the issue becomes how to reduce the power consumption of these code acquisition circuits.
In this paper, we propose a digital ASIC design of the DSSS matched filter for code acquisition. We have reduced the power consumption in both the filter architecture, circuitry, and layout design and thus provided low-power implementation for DSSS code acquisition. The paper is organized as follows. Section II describes the functionality and the system parameters of the proposed matched filter. Section III focuses on the architecture of the memory part of the filter. Section IV discusses several high-rate compression schemes and the low-power compressors used in the tree additions. A gated pullup mechanism is proposed to reduce the voltage drop across the pass-transistor network in compressor cells. Section V describes the circuit design in detail. Section VI shows the experimental result of the test chip and the comparison with various digital matched-filter implementations, and Section VII gives the conclusions.
II. SYSTEM DESCRIPTION AND IMPORTANT DESIGN PARAMETERS
The proposed DSSS matched filter is composed of two finite impulse response (FIR) filters followed by a post processor (Fig. 1) . The two filters evaluate the partial correlation of the DSSS code sequence and the complex incoming signal. Then, the post processor combines the and -channel filter outputs. When the receiver code sequence and the incoming signal are perfectly aligned, the matched filter outputs a maximum 0018-9200/01$10.00 ©2001 IEEE correlation value that indicates a "hit" of the code-phase test. Since the code sequence is used as the tap coefficients, the filter functionality is identical to a bank of correlators that search the consecutive code phases using overlapped samples, and dump the correlation values into an output stream through time multiplexing. Accordingly, the acquisition time speedup factor, compared with the serial search schemes [1] , is approximately the tap number divided by 2. On the other hand, the matched filter is not merely useful for code phase test and signal despreading. In the multipath fading channels, the magnitude of the peaks within a symbol period reflects the strength of each significant path. This feature can be applied to channel estimation for RAKE reception [1] and pilot detection of the incoming signal [2] .
Before discussing the system architecture, three design parameters that affect the system performance and circuit complexity should be determined.
• Input word length.
In an all-digital design, the amplitude quantization leads to a performance loss and a change in the receiver operating characteristics [3] , [13] . The loss can be eliminated almost completely if five or more bits quantization and a proper quantization threshold are used [3] . Taking circuit complexity into account, we adopt 4-bit quantization, which induces about 0.1-dB performance loss in a scalar Gaussian channel.
• Oversampling factor.
The choice of the oversampling factor is not just a tradeoff between performance and circuit complexity. The code phase estimation should be accurate enough to ensure correct operation of the code tracking process and the good multipath resolvability. Moreover, the choice of the oversampling factor dictates the phase resolution in the code tracking loop. Therefore, we choose an oversampling factor of four, which is conservative yet accurate enough for most applications [3] .
• Tap number.
The matched-filter acquisition scheme is sensitive to the carrier/timing frequency offset and the amplitude/phase fluctuation caused by Doppler/flat fading [1] , [14] , [15] . When these channel impairments are present, one should noncoherently combine the output values of several matched filters instead of using long coherent integration time [14] - [16] . Hence, a moderately large tap number of a matched filter is enough. The proposed matched filter has 128 taps, which amounts to a coherent integration over a 32-chip interval. The processing gain of the digital matched filter is about 15 dB. 
III. LOW-POWER FILTER ARCHITECTURE
Filter design is a classical topic in the digital signal processing field. Much literature has focused on high throughput, low latency, and hardware efficient filter structures [17] . It is shown that several structures, such as the transposed form, the hybrid, or the bit-plane structure, either have lower latency than the direct form structure or require shorter word length in the accumulator part. However, in these structures, the pipeline registers and the multiply-and-accumulate (MAC) part are often merged. The whole circuit is operated at the sample rate, and little switching activity reduction can be made. Furthermore, in the transposed form filter, the pipeline registers required are doubled if carry-save additions are adopted. These factors often increase the total power consumption and loading on the clock buffers. To achieve low power consumption while maintaining the same throughput, the direct form structure is more promising and studied in more detail.
A. Register-File-Type Filter Design
In DSSS systems with binary spreading codes, the coefficients of the matched filter are one-bit numbers ( or ), while the inputs are multibit quantized. From this point of view, one can "rotate" the coefficients instead of keeping the tapped delay line always running. This leads to the register-file-based matched-filter approach. The register-file-type filter is a direct-form filter that stores the input samples into a register file while it shifts the coefficients to perform convolutions. Referring to Fig. 2 , in the register-file-type filter the coefficients are rotated all the time. A register that matches the location of the first filter coefficient is selected to store the input sample. The register file type filter clearly has the same functionality as the tapped-delay-line counterpart. Since only one register is loaded with a new input sample, the main power dissipation is contributed by overhead circuitry (e.g., the global buses and the address decoders of the register file) and the coefficient shift register. It has been shown that the register-file-based matched filters consume less power when the input word length is more than 4 bits and the filter coefficients are binary-valued [18] .
It is noteworthy that in the case of oversampling, the register file can be further simplified. Referring to the register designs depicted in Fig. 3 , the conventional positive-edge-triggered flip-flop [ Fig. 3(a) ] is of a master-slave type, while the double-edge-triggered flip-flop [ Fig. 3(b) ] is composed of two latches followed by a 2-to-1 multiplexer. Only one latch is selected at a time to load the input. The multiplexer passing the value of the other latch acts as a "common slave," hence the flip-flop is also a master-slave type design. Similarly, in the register file [ Fig. 3(c) ], the multiplexers are used to select the output registers. If each register element is a master-slave flip-flop, the multiplexer then acts as a "secondary slave," which is redundant and lengthens the propagation delay of the datapath. By proper arrangement of the master clock (generated by the address decoder of the register file) and the slave clock (generated by the sample selector), the slave latch in each register can be removed. Then the master latches and the corresponding multiplexer can be merged to form a new multiphase flip-flop. This yields further reduction in active area and power consumption.
B. Prefiltering and Differential Filter Structures
Consider a -tap FIR filter with -time oversampling input. The term denotes the complex incoming signal, denotes the -chip code sequence, and denotes the filter coefficients. Then the output of the FIR code matched filter is given by
Equations (2) and (5) imply two quite different structures. In the prefiltering structure corresponding to (2), a short sliding window filter that evaluates the inner summation of (2) is added before the main filter. Using this two-stage summation, the number of multipliers and adders required in the FIR filter is reduced by a factor of , with -bit increase in the tapped-delay-line register width. In the differential structure [10] corresponding to (5), the incoming signal are first convolved with the derivative of the code sequence, then the results are accumulated to generate the filter output. By differentiating the code sequence, the multipliers/adders required are also reduced by a factor of , with one-bit increase in the filter coefficient resolution. With an oversampling ratio of four, the differential structure requires less hardware than the prefiltering structure. However, in a register-file-type filter, the coefficients are always shifting while the input samples are not. Thus increase in the coefficient word length implies a higher power dissipation while the increase in the tapped-delay-line width does not. So we adopt the prefiltering structure in the register-file-type matched filter.
IV. ARCHITECTURE AND CIRCUIT DESIGN OF THE TREE ADDITION
A. High-Rate Compression Schemes
Tree-type addition (or compression) schemes are often used to sum a number of binary summands in a carry-save fashion. A common disadvantage in this approach is that a lot of spurious transitions will occur due to disparate propagation delays in different signal paths. To solve this problem, one may adopt compressors with higher compression ratios, and match the path delay among compressors [19] , [20] . In this scheme, the inputs of each compressor are changed simultaneously, which implies the propagation delay of interconnections between compressors needs be controlled carefully. Due to lack of regularity, such requirement is not realistic in compressor tree architecture. Nevertheless, the high-rate compression scheme is good for reducing the number of interconnections and total latency. Consider an -summand tree adder, the asymptotic reduction factors of the interconnections in the 6-3/4-2 compression schemes, compared to the full-adder-based scheme, are given by (6) and (7), shown at the bottom of the next page, where denotes the number of summands and denotes the number of bits per summand. In our design, the 6-3 compressor is about four times as large as a full adder, so the active area of the 6-3 compression tree is about the same as those of the full-adder-based design. According to Rent's rule and Donath statistics [21] , [22] , the average length of an interconnection in both architectures will be roughly equal. This implies that up to one-third reduction in wiring capacitance can be achieved. Moreover, the switching activity of the full-adder outputs is computed as the expectation of symbol transition probability multiplied by the number of changed bits. Assume the input symbols are independent and equally probable, and no spurious transitions are introduced for any input symbol. Then the switching activity of the full-adder outputs is given as 0.4583, and that of the 6-3 and 4-2 compressors is 0.4955 and 0.4375, respectively. Taking into account the switching activity and the interconnection capacitance computed in (6) and (7), the 4-2/6-3 compression schemes achieve up to 26%-28% less dynamic power consumption on interconnections than the full-adder compression scheme under the same supply voltage. This part of power consumption is the dominant factor in today's deep-submicron technologies.
B. Compressors Based on the Window Detector
Compressor design is the crucial element in the success of the high-rate compression scheme. Most literature focuses on the 4-2 compressor design, which is shown to be more power-saving than cascaded full-adder-based compressors [19] , [20] . The design of higher rate compressors is more challenging because the logic complexity grows exponentially with the number of input bits. In the proposed filter, the high-rate compressors (Fig. 4) are realized by a window-detector-based architecture [23] . Conceptually, the "window detector" is a multiplexer array that converts the input pattern into a one-hot symbol. Then the binary translator controlled by the window detector looks up a binary value corresponding to the one-hot symbol. Owing to the two-dimensional structure of the window detector, the circuit complexity is just a square function of the input bits, thus making the high-rate compressor design feasible [23] .
C. Gated Pullup Circuit
In Fig. 4 , the window detector of the compressor is an -type pass-transistor network, which introduces voltage drops as conventional pass-transistor circuits. Due to body effect, the voltage drops are larger than the intrinsic threshold voltage of the process. Thus the p-transistors in the buffers between the window detector and the binary translator cannot be turned off, which leads to a static power dissipation. Moreover, the voltage drops degrade the noise margin and lower the driving capability of n-transistors in the buffer stage, especially in the case of low supply voltage. To solve the problem, pullup transistors can be attached to the pass-transistor network outputs. Fig. 5(a) and (b) shows the conventional pullup circuit for the lean integration with pass-transistor (LEAP) [24] and the complementary pass-transistor logic (CPL) families. Fig. 5(c) shows the driver stage of the swing-restored pass-transistor logic (SRPL) circuit, which is also a pullup configuration. In each case, the pullup transistor and the buffer stage form a static latch that tends to keep the logic value unchanged. Furthermore, the driving capability of the p-transistors in both the pullup path and the buffer stage should be much smaller than that of the n-type pass-transistor network. However, in the window-detector-based design, the pass-transistor chains are long, and their driving capability is rather small. Thus the transistor aspect ratio has very little margin, making the whole design sensitive to process variation.
In this paper, we propose a "gated" pullup circuit for the long n-type pass-transistor chain in a window detector. This mechanism enables the pullup path only at the positive state transitions. Circuits using gated pullup in the LEAP/CPL family are depicted in Fig. 5(d) and (e). Simulation results of the gated pullup circuits are shown in Fig. 6 . Referring to Fig. 6(a) , the pullup path is composed of cascaded p-transistors , controlled by the complementary outputs of the buffer stage. When the pass-transistor chain delivers a "HIGH" value, is turned on prior to turning off. Then the pullup path charges the pass-transistor chain in a short interval and makes the p-transistors in the buffer stage off (or at least operated in the deep subthreshold region). Hence the quiescent current (leakage) of the gated pullup buffer is roughly equal to the conventional (unconditional) pullup buffer, as shown in Fig. 6(c) . On the contrary, when the pass-transistor chain delivers a "LOW" value, is turned off prior to turning on, and the pullup path is always open. Since current-path competition is unlikely in the gated pullup circuits, the proposed circuit is thus more robust to process variation. Referring to Fig. 6(b) , the falling edge of the output waveform of the pass-transistor chain in the gated pullup design is sharper than that in the conventional pullup design. Fig. 7 shows the layout design of the window-detector-based 6-3 compressor with gated pullup buffers. The window detector part realized by a compact n-type pass-transistor network occupies about 42% of the active area. The gated pullup buffers composed of 14 inverters and 14 pullup transistors occupies about 35% of the active area. The 6-3 compressor is three to four times larger than the conventional transmission-gate (TG) adder design. Hence the area of the window-detector-based compressors and that of the cascaded full-adder-based compressors is roughly equal.
D. Comparison of Low-Power Logic Families
The most frequently used components in the matched-filter design are the full-adders and the 6-3 compressors. In the circuit design phase, we compare the power-delay characteristics of the two components realized by various logic styles [23] - [26] to exploit a low-power solution. Fig. 8 shows the schematic of the full adders. In the simulation, a 0.8-m single-poly double-metal (SPDM) technology is selected. The loading of each full-adder/compressor output is assumed to be the input capacitance of each full-adder/compressor. Transistors in each design are sized to minimize the power-delay product at 2.5-V supply voltage, which is about three times the intrinsic threshold voltage (about 0.85 V). Post-layout simulation results are illustrated in Fig. 9 . Among these full-adder cells, the complementary pass-transistor logic (CPL) [24] design with the conventional pullup has the best power-delay characteristic over a wide range of supply voltage (2.0-5.0 V). The outstanding features of CPL mainly come from its short logic chain and good delay matching. The TG full adder is also low power and area efficient, yet is somewhat slower than the CPL adder. The double pass-transistor logic (DPL) [25] full adder is not power efficient due to the large wiring overhead and transistor count. The SRPL [26] full-adder has the same pass-transistor network as the CPL full-adder, but it possesses poorer power-delay characteristic. This can be explained as the lack of isolation. Specifically, in the compressor (or tree adders) composed of SRPL adders, the capacitive load in the successive stages is coupled to the pass-transistor network. To solve this problem, one should isolate the pass-transistor network from it load by buffer insertion, which increases total power dissipation though. As a consequence, for the 6-3 compressor designs, the power-delay characteristics of the window-detector-based designs is superior to the designs implemented by cascaded full adders. The window-detector-based compressors with the gated pullup buffers possess the minimum power-delay products around 2.5-V supply voltage, as shown in Fig. 9 .
V. CIRCUIT DESIGN AND IMPLEMENTATION
A. System Architecture
A more detailed architecture of the digital matched filter is shown in Fig. 10 . The tapped delay line of the conventional FIR filter is replaced by a register file. The summation part is composed of the high-rate compressor tree followed by carry-select adders. The signal sample processed by the sliding window filter replaces the oldest value in the register file. Then the values in the register file are selected, multiplied by the filter coefficients (oversampled code sequence), and summed to produce a single channel output. The multipliers composed of exclusive-OR gates perform one's complement multiplication of the register file values and the filter coefficients. The high-rate compressor tree performs summation of the 32 one's complement partial products and a constant that compensates for the difference between one's complement values and two's complement values. The computing steps of the compressor trees are shown in Fig. 11 . Note two extra pipeline stages are inserted in the compressor tree to both increase its throughput and balance the path delay. Seven-bit carry-select adders are used to do the final summing. Finally, the two channel outputs are combined by the post processor. A multiplexer is added at the matched filter output for bypassing the post processor unit. When the code phase is determined, the coefficient shifter can be frozen, and the matched filter can be used as a correlator to demodulate the and -channel bits.
The overhead of the filter architecture is an address decoder and a sample selector. Since the hardware is shared by the and Fig. 11 . The 33-summand tree addition using 6-3 compression scheme. channels, this design is fairly efficient. The address decoder is realized by a double-edge-triggered flip-flop ring that propagates the hot-bit to generate the "master clock" of the register file (see Fig. 3 ). Since the data activity of the ring is close to zero, the decoder is low power dissipation. Likewise, the sample selector generates the "slave clock" of the register file (see Fig. 3 ). Since only one register is changed for each incoming signal, the input terminals of the other registers can be floated. To minimize loading of the input signal, we break the input bus into several segments, and use a second decoder to enable one of these segments.
B. Post Processor
The post processor combines the two filter outputs , so that the next stage can perform peak detection. Theoretically the root-mean-square (RMS) or the mean-square combining should be used under a Gaussian noise channel. However, since the post processor operates at sample rate, we adopt an approximation to lower its power dissipation. The following equations can be used to approximate the RMS combining. (10) where , , denote constant factors that can be removed. Specifically, (8) is simple but is poor in estimation accuracy. Equation (9) is an alternative form of Robertson's fast amplitude approximation formula [7] that is frequently used as an RMS estimator. Equation (10) is a nested realization of (9) that possesses a fairly flat response over a wide range. The approximation error of the three equations is depicted in Fig. 12 . Taking into account hardware cost, power consumption, and the approximation error we choose to implement the formula in (9). Fig. 13 shows the simulation result of the power consumption of different matched-filter designs. In each tapped-delayline-based design [ Fig. 13(a)-(d), (f) ] the delay line consumes a significant amount of power. In the register-file-based design [ Fig. 13(e) ], this part of the power consumption is greatly reduced. With the register file architecture adopted, the power consumption of the tree adder becomes quite significant, necessitating the high-rate compression schemes. According to the simulation results, the power consumption of the 6-3 compression tree adder is 35% less than that of the static CMOS full-adder tree adder. Furthermore, the power consumption of the register file is only about less than 8% of the power consumed by the tapped delay line, about a 12-fold savings.
C. System Simulation Result
D. Chip Implementation
A prototype test chip is implemented to evaluate the power dissipation of the matched-filter design. The chip is fabricated using a 0.8-m SPDM process and is packaged in a 68-pin package. The core size is m and the transistor count is 52 661. As the die photograph in Fig. 14 shows, the layout of the chip is rectangular. The register file and the multipliers (array of exclusive-OR gates) lie in the center of the chip. The coefficient shifter and the decoders shared by the two channels are tied to the register file and multipliers. The compression trees and the carry-select adders are located beside the coefficient shifter and the decoders. Beside the filter part, the post processor (the -combinor), the sample selector, and the clock buffers are relatively small and are placed toward the right end of the chip. As a whole, the data stream flows from left to right, and the system clock/reset signal runs in the opposite direction to avoid the race problem. In this design, the register file, the multipliers, the address decoder, and the sample selector are fully custom designed, while the high-rate compression trees are placed and routed with tools. The routing channels of the tree modules occupy about half of its active area. This is mainly caused by the intensive use of the second metal layer in the window-based compressor cells, which reduces the wiring resource and the module porosity in a double-metal process.
E. Clock Signal Design
The design of the clock trees in a large synchronous network is critical. Skew and signal integrity of the clock signal affect the correctness of system functionality. Besides, the clock tree usually consumes a significant amount of power, especially in the tapped-delay-line filters. In the low-power filter architecture described above, the latched-based register file is substituted for the conventional tapped-delay-line to reduce the power consumption of the system and the capacitive load of the clock buffer. Consequently, the sample rate clock buffers need not provide extremely large driving capability as that in the tapped-delay-line-based filter design. In order to further reduce the power consumption of the clock tree, pipeline registers in the system are double-edge-triggered [ Fig. 3(b) ], and the clock buffers operate at half the sample rate. Furthermore, because the address decoder and the sample selector act as the clock sources of the register file, these two modules should be carefully designed. In this chip, the two modules and the register file are fully custom designed to control the wiring delay. Moreover, transistors of the output buffers of all clock sources are sized to provide multiphase nonoverlapping clock signals. Table I shows the measured result of the digital matched filter. The power consumption of the pad ring is excluded in this measurement. The functionality of the test chip is verified using a logic analyzer and a pattern generator that provides input patterns with different clock periods (10, 20, 25, 40 , 50, 100 ns, ). At 5-V supply voltage, the test chip can operate up to 50 MHz. Due to the operating requirements of the standard cells using pass-transistor logic, the power supply voltage of the chip needs be higher than several times the intrinsic threshold voltage. The experimental results show that the supply voltage must be greater than 2 V to guarantee the correct operations of the chip. At 2-V supply voltage, the maximum operating frequency of the chip is 10 MHz.
Next we examine power consumption contributed by various modules of the digital matched filter. The result should be quite different from the tapped-delay-line-based filter design. As the example in [18] shows, the 8-bit tapped-delay-line of the filter consumes over 60% of the total power. Using the register-filebased design and a low-power encoding technique, the power consumption contributed by the memory part can be reduced to 44%. In the proposed digital matched filter, we have improved the register file design, and the power consumption should be further reduced. Here the power supply current of each portion of the chip, under different supply voltages and clock rate, is individually characterized, and the power consumption is plotted in Figs. 15 and 16 . The memory part of the digital matched filter (i.e., latch-based register file, column decoder, sample selector, and cofficient shift registers) and -combinor consume 30%-40% of the total power. The clock buffers consume about 10% of the total power. The rest of the chip (sliding window filters, tree adders, and carry-select adders) operates at the sample rate, and thus make up most of the total power consumption. Table II lists the specification and design features of four CMOS digital matched-filter chips. In [7] , a 64-tap dual-channel matched filter was proposed using transposed form systolic array architecture and was fabricated in a 0.7-m CMOS process. Instead of carry-save addition, binary addition is adopted in each systole. The basic building block is optimized using static cascade voltage switch logic (CVSL) and the differential CVSL with pass-gate (DCVSPG) logic to meet low-power and high-speed constraints in small silicon area. In [8] , a 44-tap dual-channel matched filter is designed using the prefiltering tapped-delay-line filter architecture and is fabricated in a 0.8-m process. It is a cell-based design with a single clock and conventional positive-edge-triggered flip-flops. The standard cells are designed using static CMOS logic. In [10] , a 64-tap single-channel matched filter is realized using the transposed form and the differential filter structure described previously. It is also designed through the cell-based methodology and is fabricated in a 0.8-m process. For comparison, we normalize the power consumption and the active area by the code length, i.e., the tap numbers divided by the input oversampling factors. The result shows that under a given supply voltage, the proposed chip can achieve a normalized power saving of 59% and 79% when compared to the chip in [8] and [7] , respectively.
VII. CONCLUSION
In this paper, we proposed a low-power DSSS digital matched-filter design. The low-power design techniques contain the architecture design, circuit optimization, double-phase clocking strategy, and layout design. The latch-based register file is proposed to reduce both power and active-area requirement of the filter part. For partial product summation, the 6-3 compression scheme and the window-detector-based compressor are adopted. A robust gated pullup mechanism is proposed to improve the power-delay characteristic of the window-detector-based compressor. To further reduce the power consumption on the clock tree and clock buffers, double-edge-triggered flip-flops are used in all pipeline registers. Finally, a test chip is implemented using a 0.8-m SPDM process. The measurement results show that the test chip can operate up to 50 MHz under 5-V supply voltage, and the supply voltage can be scaled down to 2 V. Comparing to the previous works [7] , [8] , about 59%-79% reduction of normalized power can be achieved.
