In this paper, a low complexity demodulation scheme is proposed for IEEE 802.15.4 LR-WPAN systems. In the proposed scheme, multiple active correlators for demodulation are replaced with a matched filter-based cross-correlator shared with a synchronization unit, which requires only a few additional control units. Consequently, with the proposed demodulation scheme, the total hardware complexity for the LR-WPAN baseband modem is 54% less than that of the existing architecture to have the separated demodulation and synchronization units.
Introduction
The IEEE 802.15.4 is a newly developed standard dedicated to a low rate (< 250 kbits/s) wireless personal area network (LR-WPAN) used for low power applications such as Zigbee or wireless sensor networks [1, 2] . Since a synchronization unit and a demodulation unit are the most complex parts in the baseband modem and IEEE 802.15.4 LR-WPAN systems should have low complexity, they should be implemented as efficiently as possible. However, because they have different structure, they cannot share any hardware components and the complexity of the receiver significantly increases.
In this paper, a new demodulation scheme is proposed, which makes the active correlators for the demodulation unit replaced with the matched filter used for the synchronization process. In addition, an efficient shared architecture for the synchronization and demodulation units is proposed for low-complexity implementation of an IEEE 802.15.4 LR-WPAN baseband modem.
Synchronization and demodulation schemes for IEEE 802.15.4 LR-WPAN systems
The IEEE 802.15.4 standards define three blocks for a transmitter working in the 2.45 GHz band [1] . The three blocks are shown in Fig. 1 . A bit stream from a medium access control (MAC) layer is transformed to a bit-to-symbol unit (BSU), which makes a 4-bit symbol with the bit rate, 250 kHz (f B ).
A symbol-to-chip unit (SCU) spreads the 4-bit symbol into one of 16 nearly orthogonal pseudo-random noise (PN) sequences (i.e. a direct sequence spread spectrum (DSSS)), which consists of 32 PN code chips fed into an offset quadrature phase-shift keying (O-QPSK) modulator serially with the chip rate, 2 MHz (f C ). The chip sequence is modulated onto the carrier using O-QPSK modulation. The O-QPSK modulator oversamples the PN chip sequence for the pulse shaping and accurate synchronization in the receiver. Therefore, it operates with the over-sampling rate,
is the over-sampling factor, and is set to two in this paper). In a receiver, a synchronization unit (SU) is necessary in order to align the received signal and a locally generated PN sequence, which works with f O for the accurate synchronization. Therefore, there are N O samples (R O samples per a PN chip) in the symbol time interval, T S , for synchronization. Since the cross-correlation for synchronization should be carried out for very short time interval, T O = 1/f O , the SU should employ a passive correlator, called a matched filter, with the high complexity. The matched filter consists of a tapped delay line and coefficient registers that contain 32 code chip values of the PN sequence 0 [3] . The reference signal in the coefficient registers is cross-correlated with the received signal in a tapped delay line, which shifts the samples of received signal samples right at every T O .
After the synchronization, a demodulation unit (DU) detects the 4-bit symbol from the received signal. Since non-oversampled signals are sufficient for the efficient demodulation, it operates with not f O but f C , which means that N C samples (one sample per a PN chip) are used in order to despread one. The DU carries out the cross-correlation between 16 PN sequences and the aligned received signal at every T S in order to decide the transmitted symbol among 16 candidates. Therefore, 16 correlators are needed unlike the synchronization process. Since there is sufficiently long time interval, T S , for the demodulation, each correlator can be implemented by an active correlator with low hardware costs. Finally, a symbol-to-bit unit (SBU) serializes the demodulated 4-bit symbol and passes it to the MAC layer. In Fig. 1 , most multipliers in a receiver are trivial ones, which can be implemented with several adders and shifters, because the reference signal is constant. However, the multipliers to calculate the magnitude-value are non-trivial ones requiring the complex hardware resources.
In this paper, we assume a non-coherent receiver with double correlation technique [4] which is known as an efficient method to cancel the effect of a frequency offset. The equation for double correlation, which is commonly used for synchronization and demodulation, is given by
where r k (n) = r(kN + n) denotes the n-th sample of k-th received signal, and s m (n) means the n-th sample of m-th reference signal. N and D denote the number of samples to correlate and the delay for double correlation, respectively. For the better performance and the lower complexity, N and D in Eq. (1) are set to different values depending on the operation, synchronization and demodulation, as follows:
where N T means the number of samples in 2T C time interval for synchronization. 
Proposed demodulation scheme
The matched filter for demodulation calculates the cross-correlation for T C unlike the active correlator producing the results at every T S , and the DU needs 16 cross-correlation results at every T S . Therefore, the implementation of DU with one matched filter is possible if the matched filter is reused 16 times for T S for the purpose of the calculation of cross-correlation (i.e. T S = 16 × T C ). In short, the proposed demodulation process means to find the index having the maximum correlation value among 16 correlation results calculated repeatedly by one matched filter in T S . For this, the matched filter should calculate the cross-correlation with a different reference at every T C . This means that the references in the matched filter should be consequently changed. The PN sequences in the IEEE 802.15.4 standards are related to one another through cyclic shifts and/or conjugation (i.e. inversion of odd-indexed chip values) [1] . Using this property, the next index of a PN sequence can be generated by shifting the previous index of a PN sequence by four chips, except in the case of the PN sequence 8. The inversion of the odd-indexed chip values is needed for the generation of the PN sequence 8 in addition to the shift of the PN sequence 7. Consequently, all of the PN sequences can be generated with shift registers and inverters if the PN sequence 0 is already known.
Architecture for unified synchronization and demodulation unit
With the proposed demodulation scheme that is based on the matched filter, the synchronization and demodulation units can be unified into one hardware unit, called the unified synchronization and demodulation unit (USDU). Fig. 2 illustrates a block diagram of the proposed receiver architecture.
As shown in Fig. 2 , the proposed receiver contains the USDU and the SBU. The SBU has an operating frequency, f B , however, the USDU has two operating frequency, f O and f C . The USDU performs the general operation of the matched filter at the synchronization mode with f O . After the end of the synchronization process, the unit works for demodulation with f C (i.e. The mode is chosen by SYNC/DEMOD select signal). At this time, its operation is somewhat different from the general operation of the matched filter. The matched filter is divided into two parts (i.e. it is divided into four For T 1, the part I is in a storing mode. d in , the received signal multiplied by r k (n − N C ) and s * (n − N C ), is inserted into d[N O − 1], D1 is shifted by one chip at every T C . The T 1/T 2 select signal lets a register in D1 get a new value as a value in a previous register and a register in S1 maintain the PN sequence 0. On the other hand, the calculation for the cross-correlation is performed in part II, which means the part II is in a calculating mode. At this time, The T 1/T 2 select signal also makes D2 hold the fixed value that is stored for previous T 2 (i.e. part II was in a storing mode for previous) and S2 be cyclic shifted by four chips and/or be conjugated at every T C . The 7T C signal is set only at the time interval 7T C for the conversion from PN sequence 7 to PN sequence 8. The MAG block in Fig. 2 uses the result of A2 as an input.
For T 2, the roles are exchanged. The T 1/T 2 select signal is set to T 2. The part I is in a calculating mode and the part II is in a storing mode. In this case, the calculation for the cross-correlation is carried out in part I. The MAG block in Fig. 2 uses the result of A1 as an input instead of A2 used for T 1, and the cyclic shift and/or inversion of S1 is carried out while D1 is fixed. In part II, d in is inserted into d[N O /2 − 1] and D2 is shifted by one Fig. 3 . Operation of S1 and S2 in a calculating mode chip at every T C and S2 is fixed with the PN sequence 0. Fig. 3 illustrates the operations of S1 and S2 in a calculating mode.
As shown in Fig. 3 , S1 or S2 in a calculating mode is cyclic shifted by four chips at every T C . However, the additional inversion is required at 7T C because the conversion from PN sequence 7 to PN sequence 8 needs the conjugation of the shifted PN sequence 7.
The USDU has almost the same hardware architecture as the SU in the existing system shown in Fig. 1 . A few trivial functional blocks for multiplexing, cyclic shifts, and inversions are added in the USDU compared with the SU in the existing system. Consequently, it is possible to make the synchronization block perform both synchronization and demodulation processes with only a few additional hardware resources.
Comparison results
As shown in Section 2 and 3, since there is no difference in the functional equation between the existing receiver and the proposed one, they show exactly the same performance. However, the proposed architecture has the USDU, and therefore the outstanding improvement is shown with respect to hardware complexity.
The proposed architecture reduces 83% of non-trivial multipliers, 23% of adders, consequently, 54% of logic gates are reduced over the existing hardware architecture (i.e. Non-trivial multipliers from 36 to 6, adders from 338 to 259 and gates count from 59 K to 27 K). The results are calculated from the implementation results with 0.18 um CMOS standard cell library.
Conclusion
In this paper, a new demodulation scheme was proposed in order to reduce hardware complexity through the unification of the SU and the DU. In the proposed demodulation scheme, the matched filter whose coefficients are changed at every T C was used instead of 16 active correlators. As a result, the DU could be implemented by reusing the hardware in the SU with trivial functional blocks, and the hardware complexity is decreased by 54% compared with the conventional scheme without performance degradation. So, we can find that the proposed demodulation scheme is very useful for IEEE 802.15.4 LR-WPAN systems, for which a low-complexity design is important.
