Multiband orthogonal frequency-division multiplexing (MB-OFDM) systems employ frequency-hopping technology to achieve the capabilities of multiple access and frequency diversity. However, they also complicate packet detector (PD) in terms of the requirement for the high hardware complexity. In this paper, we propose several low-cost design schemes for the PD, such as Walsh-Hadamard decomposition, buffered summation, and sign-bit-remaining methods. The estimated gate count of the resulting implemented PD is less than half that of existing solutions.
1. INTRODUCTION Multiband orthogonal frequency-division multiplexing (MB-OFDM) technology (or socalled time-frequency interleaved OFDM) has recently been applied to wireless personal area networks [1] [2] . This technique increases both the traffic capacity and the frequency diversity. Furthermore, MB-OFDM employs several types of time-frequency code (TFC) in the preamble part to easily identify the packet properties and frequency-hopping sequences. However, even though MB-OFDM can lever the successful experience of other OFDM systems, such as wireless local area network (WLAN) [3] [4] , there are still many differences between MB-OFDM and conventional OFDM systems. These differences are often the bottlenecks of system performance. As illustrated in the dashed box of Fig. 1 , the potential bottlenecks of the receiver front-end are as follows: 1) Feedback control to the radio frequency circuit. Controlling the hopping of the radio frequency (RF) circuit requires a feedback path from baseband to RF. The building blocks of the analoglRF circuit and the front-end for the baseband are shown in Fig. 1 . Z-D and z-P denote the latencies of analog-to-digital converter (ADC) and packet detector (PD). Z-R and z-c are the processing delays of the controller and the RF circuit. The processing delay of the feedback loop results in an iteration bound [7] that limits the system operation speed and performance. 2) Large hardware complexity. The receiver should identify different TFCs by detecting the preamble symbols of received packets. Hence, the order of the hardware complexity is proportional to both the frequency diversity and the types of preamble. For [1] with a 1.89-ns sampling rate), and P=4 (by the unfolding 4 method [7] ). Thus, the OLA delay T is less than 22, which is not met by the required OLA period T of channels CM3 and CM4 [4] . Therefore, the long latency of the feedback loop could degrade the performance for long-duration channels.
In Section 3 we derive our PD algorithm that can alleviate the above problems. The proposed PD structure is shown in Fig.  2 . The upper block in the figure is the matched filter, which is an FIR filter with preamble coefficients as filter coefficients, and the lower block is the power meter, which provides the power sum of tap data. Before deriving the algorithm, we explain the basic properties of the preamble coefficients of the MB-OFDM system.
We define the coefficient vector as
where the ci variables are the preamble coefficients. The inner product of the coefficient vector is normalized as [1] CHC =N.
Because zero-padding signaling is adopted in the MB-OFDM system, the transmitted OFDM symbol can be expressed as the following vector form:
where ((n))w denotes the remainder of n divided by W. From the quasi-orthogonal properties of the preamble coefficients, the cross-correlation between the coefficient vector and transmitted symbol vector is
and K7 =_ 0))w (7) where a(.) is the delta function, OL is the zero vector with length L, and v(.') is the small sidelobe signal which can be neglected because
To simplify the derivation, we assume that the CFO is small, so the received signal of the preamble part can be expressed in the following matrix form [5] : 5 bits to fully meet the precision of the original specification. This long word length will increase the complexity of both the transmitter and receiver. We reduce the word lengths (and hence the hardware requirements) using a canonical sign digit on the transmitter side and a word-length-truncation method on the receiver side.
Assume that the ideal received signal is a vector C as defined in Eq. (2) and that the truncated coefficient vector of the matched filter is Ctrun. Table 2 lists the matched-filter output of C'Ctrun for different truncated word lengths of the preamble coefficients. For the full word length, the value of C'Ctrun is N (128), as shown by Eq. (3). It is evident that if there is only one bit (the sign bit) for each coefficient, the orthogonality property still holds true and the loss at the matched-filter output is only 0.778 dB. Table 2 The minimal matched-filter output for all TFCs for different truncated word lengths of the preamble coefficients.
W-H decomposition and matched-filter retiming method
Because only the sign information of the preamble coefficients needs to be kept in the matched filter with slight degradation, we can decompose each coefficient into two W-H subsequences [6] as follows:
Ci Ai1Bj,i= 1 6' (26) where e3 denotes the W-H product, Ci is the coefficient vector for the 6 TFCs, and Ai and the Bi are factorized coefficients with 16 and 8 taps, respectively. After applying W-H decomposition, the total order of the matched filter is reduced from 128 to (16+8) (i.e., a W-H hardware reduction factor of 128/(16+8)). The hardware circuit of the decomposed matched filter is shown in Fig. 3(a) . We retime the FIR matched filter into the data broadcasting type [7] for consistency with the operating speed of 528 MHz. In addition, we precalculate the l's complement of the input signal to be used in each processing unit (PU) as shown in Fig. 3(b) , where Sk is the kth coefficient (only one bit) ofAi or Bi. To further reduce the hardware complexity and avoid having to use an adder in the 2's complement converter (for multiplying by -1), we map multiplier coefficients of +1 and -1 onto 0 and 1, respectively. The adder in a traditional two's complement converter (i.e., 2'complement is 1'complement plus 1) is combined with the adder in branch summation as shown in Fig. 3(c) , where we use the carrier-in of the adder as the function of "plus 1". Applying the retiming method requires only one adder in the PU and no multiplier. 4.3 Buffered summation to reduce the squaring circuit Fig. 2 shows the 128 squaring circuits that are utilized to calculate the input signal power.
However, we use the buffered-summation method to reduce the hardware complexity in a real implementation, as shown in Fig. 4 . When new data arrive, we add the new squared data into the accumulator and subtract the oldest squared data from the accumulator. This implementation requires only two squaring circuits and two adders (one adder for adding new data and the other for subtracting the oldest data). 
I
To relax the RF-baseband iteration bound as shown in Eq. (1), we approximate the matched-filter output as follows:
where the Q is the relaxation factor and k is the preamble symbol index. The principle of operation is demonstrated in Fig. 5 . Fig. 5(a) shows the output waveform of the received data and the original output of the matched filter. If a preamble sequence is received, the matched filter of the PD will not detect the preamble symbol until the full FFT window of length N is filled. In contrast, Fig. 5 Table 3 (a), where N is the number of points in the FFT, FD is the W-H hardware reduction ratio, Nc is the original word length of the preamble coefficients in the matched filter, FT is the wordlength-reduction ratio of the preamble coefficient, and Nx is the input data word length. We assume that the complexity of an adder depends on the input width, and that the complexity of the multiplier is the product of the word lengths of both inputs. Using real numerical values in Table 3(a) results in Table  3 (b), where FD=1281(8+16), FT-5, N=128, Nx=5, and Nc=5. This indicates that the final gate count of the combination circuit is only 2.28% that of the direct implementation. In addition, the hardware complexity is only half that in [8] , where the cost-down approach reduces the tap number by 4, which degrades the performance by 6 dB (i.e., 10 log4 dB).
We have introduced several low-cost implementation methods to reduce the area and power requirements. To clarify the performance degradation for each low-cost method, we define PDs I-IV to enable a fair comparison of the performance. The used methods for PDs I-IV are listed in Table 4 , in which the related MCRP and TR are also given. The table indicates that the TR is degraded by 8.1 (i.e., 18.2-10.1) after the fixed-point implementation. In addition, there is almost no degradation for the W-H decomposition (PD III) relative to the fixedpoint version (PD II). The TR is further decreased by 2.7 (i.e., 9.8-7.1) after the implementation of the PNCR in PD IV. However, all of the PDs meet the stringent specification of the MCRPs all being larger than 1-lo-1. 5 CONCLUSIONS We have described the difficulties both in algorithm and hardware implementations. We propose a PD that requires only 2.28% of the gates and 5000 of the hardware complexity relative to the direct implementation method and other existing solutions described in [8] . Moreover denote using/not using the method. Table 4 PD hardware schemes and related implementation losses.
