Abstract-The wide unlicensed bandwidth of a 60 GHz channel presents an attractive opportunity for high data rate and low power personal area networks (PANs). The use of single-carrier modulation can yield energy-efficient transmitter and receiver implementation, but equalization of the long channel response in non-line-of-sight (NLOS) conditions presents a significant challenge. A digital equalizer for 60 GHz channels has been designed for both line of sight (LOS) and NLOS channel conditions to meet the IEEE WPAN standard. Power consumption is minimized by using a parallelized distributed arithmetic (DA) architecture. A 2 mm 2 mm test chip in 65nm CMOS implements a 6 tap feedforward and 32 tap feedback equalizer that can be configured to cancel the response of up to 72 symbols, and consumes 5.6 mW at 2 Gb/s throughput. The chip also includes a channel estimator based on a Golay correlator for setting the equalizer coefficients and estimating frequency and timing error.
I. INTRODUCTION
T HE 7 GHz of unlicensed bandwidth at 60 GHz band is drawing attention for high-rate video and data transfer applications. Current developments are focused on uncompressed high-definition video transfer indoors between a set-top box and a display, using 1.75 GHz channels in the 60 GHz band. These systems mitigate long channel impulse response by using OFDM modulation and phased arrays [1] - [3] . While effective for indoor high-definition video distribution, they consume relatively high power, making them suitable for wall-plugged devices. Simultaneously, several emerging standards are being developed that explore the use of the wide bandwidth in the 60 GHz band to implement 1.5 Gb/s-6 Gb/s wireless links between portable devices [4] , [5] , where the power consumption is a critical design limitation. To improve the power amplifier efficiency, this system is expected to utilize a single-carrier constant-envelope modulation scheme, with low peak-to-average power ratios [6] .
The primary challenge for a high data rate receiver in the 60 GHz band is the need to equalize a long channel response, which can cause the inter-symbol interference (ISI) of several tens of symbols [7] . Fig. 1 shows examples of impulse responses generated from the IEEE channel model for LOS and NLOS conditions [5] . The high data rate of the system along with the long channel response presents a challenge of the energy efficiency of the receiver that needs to estimate and equalize such channels.
Phased-array transmitters and receivers with beam-forming capability have been developed for reducing the multipath propagation, but they increase the power consumption and the system overhead in terms of the adaptation time and signal processing [2] , [3] . Although they reduce the length of the effective channel response, they cannot eliminate the need for the receive equalizer. In order to achieve a low power consumption of the receiver, an analog equalizer has been considered [8] . The analog equalizer is a popular solution in the high-speed wired link for back-plane communication [9] - [12] . The limitation of the analog solutions reported so far is that they implement a limited number of taps, making them suitable only for LOS propagation conditions in the channel of interest.
Digital implementations of an equalizer have advantages over their analog counterparts such as the ease of design, flexibility, and noise immunity. Channel equalizers for high delay spread DTV, WiFi and cellular systems have been implemented using digital signal processing techniques with up to several hundreds of Mb/s data rate [13] . However, the relatively high power consumption of high-speed digital circuits with high-speed, power hungry analog-to-digital converters is a challenge that needs to be addressed for this application. A straightforward implementation of a digital equalizer would involve, for example, an implementation of a direct-form FIR filter. Such a filter with 30 taps, operating at 2 Gb/s in 90 nm CMOS would consume over 100 mW [14] , [15] .
In this paper, we present a fully digital baseband receiver for a single-carrier 60 GHz transceiver that can operate in both LOS and NLOS conditions with low power consumption. The resulting chip implements an all-digital equalizer and channel estimator (CE). The key idea that enables low power consumption with high data rate is the use of distributed arithmetic (DA) in the equalizer. To simplify the test procedure, a transmitter with a channel emulator and noise generator is embedded on the chip.
In Section II, the high-level architecture of the receiver is presented. The equalizer structure and link-level simulation results are described in Section III, followed by the channel estimation 0018-9200/$26.00 © 2011 IEEE in Section IV. The test environment and measurement results from the implemented chip are given in Section V with the conclusion given in Section VI.
II. SYSTEM

A. Modulation Scheme
This work targets the single-carrier modulation option from the IEEE WPAN standard [5] , [16] . This standard, in general, includes both the single-carrier and orthogonal frequency-division multiplex (OFDM) modulation options. While OFDM has its advantages in relative immunity to multipath propagation, it results in higher system power. On the transmit side, the high peak-to-average ratio (PAR) of the OFDM signal requires power-amplifier back-off to maintain linearity, making it less efficient. On the receive side, high-resolution ADC and FFT blocks must be operated regardless of the channel conditions, resulting in a non-power-scalable architecture. Finally, it is generally hard to adaptively turn off the channel coding even in a mild multipath condition in an OFDM system. Alternatively, the single-carrier receiver lends itself to reconfigurability and its power consumption can scale with the actual channel conditions, which is highly desirable in an adaptive high-speed communication system with a stringent power budget.
Binary phase-shift keying (BPSK) modulation is chosen in our implementation to demonstrate the key concepts, with a relatively simple design. A design can be readily extended to quadrature phase-shift keying (QPSK) by duplicating and slightly modifying the data path.
B. Frame Structure
The single-carrier modulation options of emerging standards share the similar frame structure shown in Fig. 2 . The preamble consists of the SYNC pattern with short period pilots for initial synchronization of frequency and timing and the channel estimation sequence (CES) for initial channel estimation. To track the time variance of the channel and maintain the frequency and timing synchronization, short pilot patterns are inserted into the data (TS) [5] . There is an inter-frame spacing (IFS) period specified between the preamble and the data bursts stretching several microseconds (several thousand of symbols), which is used to accommodate the latency of the initial, coarse estimators. The IFS can also be used to pre-calculate parameters that are used during the data bursts, and thereby save power consumption. This is the motivation for using distributed arithmetic (DA) [17] in the implementation of equalizers in this work. Fig. 3 shows a general digital baseband for a single-carrier system consisting of an ADC, a channel estimator, timing and frequency estimators with correction paths, and an equalizer block. The implemented test chip contains the blocks shown in Fig. 4 . The receiver illustrated in Fig. 4 (b) consists of an equalizer and a channel estimator, which estimate the channel impulse response using a pilot sequence, and can be used for the frequency and timing error detection as well as the coefficient calculation for the equalizer. For testing purposes, the transmitter, the channel emulator, and the noise generator, are implemented on chip as well ( Fig. 4(a) ). A configuration scan chain, debug logic, and an memory initialization logic are included to facilitate the chip verification from an external test FPGA.
C. Receiver Structure
III. EQUALIZER
The main objective is to design an equalizer suitable for a wide range of propagation conditions in a 60 GHz channel. It must support 2 Gb/s throughput with minimum power consumption. An additional goal is to have the power consumption scalable with the number of taps.
A. Reduced Complexity Equalizer Structure
There are numerous options for implementing an equalizer. Generally, equalization can be implemented either in the time or frequency domains. The frequency domain equalizer has been excluded from consideration because it has high power consumption similar to the OFDM receiver [18] , [19] . Equalizers in the time domain are typically implemented as linear or decisionfeedback. The implemented time-domain equalizer in Fig. 4(b) is a hybrid, consisting of a linear equalizer (LE) and two decision-feedback equalizers (DFEs): the main DFE (M-DFE) and the sub-DFE (S-DFE). The LE implemented with A taps, the M-DFE has B taps, and the S-DFE consists of L taps. The precursor consists of basically ISI, which cannot be equalized by the DFEs. The LE can equalize this interference, it has fundamental issues of noise enhancement and implementation complexity [20] . To minimize these shortcomings, a LE is implemented with a small number of taps. The DFE has been a popular solution for single-carrier systems because of its low hardware complexity, although it has a possible issue with error propagation [9] - [12] . Unlike the conventional LE-DFE structure, in this design, the DFE output is fed back to the input of the LE (Fig. 4(b) ), by which the channel estimation output can be directly used for the DFE coefficients, thereby significantly reducing the number of operations [21] , [22] . Also, with this structure, the DFE taps can be implemented in either analog or digital domains, if further power optimization is desired.
The S-DFE is added to compensate for the latency of the M-DFE loop by limiting the feedback delay to one symbol period. Fig. 5 shows the tap assignment of each equalizer element for a representative impulse response.
B. Link-Level Simulation
The required number of equalizer taps is initially determined by the link outage probability analysis in the statistically-generated NLOS channel profiles. The BER performance target is set to be since the errors at the equalizer output are substantially corrected by error correction codes such as low-density parity-check (LDPC). LDPC coding is a part of the proposed standards in the 60 GHz band. The channel decoder is assumed to be adaptively turned off if the channel conditions are good, such as under LOS condition. An outage is defined to be a case when the performance target is not achieved. In the case of the BPSK modulation under consideration, the outage occurs when the SNR of the signal after the equalizer is less than 4.2 dB. Therefore, the outage probability of the BPSK signal can be expressed as (1) The noise term of the after an ideal DFE is a summation of additive white Gaussian noise (AWGN) and the residual ISI terms that are not cancelled by the available DFE taps: (2) where represents the main tap, is a time index for excess delay and is the AWGN noise power. Fig. 6 illustrates the outage probability calculated for 100 channel profiles generated from the NLOS statistical channel model (IEEE residential channel model, CM2.3 [5] ), which shows that approximately 30-tap DFE is enough to achieve better than 10% outage probability. The actual number of implemented filter taps in the linear equalizer is , in the DFE it is , and in the sub-DFE it is to meet the latency requirement of the feedback loop inside the equalizer. Fig. 7 shows the generated impulse responses (IR2-6, and AWGN) generated from the IEEE statistical channel model and corresponding BER performance of the equalizer with the implemented number of filter taps , which shows that the is achievable with the equalizer in the reasonable SNR range.
The digital signal wordlengths are also determined by the link-level simulation in accordance with the determined number of the filter taps. The wordlength is minimized to reduce the hardware size, power consumption as well as the LUT size needed for the DA implementation, while maintaining the fixed point loss in the BER performance to be below 1 dB. The wordlength used for the equalizer is shown in Fig. 10 . A baud-rate sampling is employed in this work to avoid the power consumption associated with oversampling.
C. Hardware Implementation
The equalizers are divided into four parallel data-paths, each operating at 500 MHz, to meet the throughput requirement with low power consumption. Otherwise, the equalizer has to be implemented with power-hungry dynamic logic style, given the high data rate and the targeted CMOS process.
The DFE is implemented as an FIR filter that calculates a convolution of estimated channel coefficients, and the slicer output, :
Although the transpose form of an FIR filter is often preferred for high-speed, low-latency applications [23] - [25] the structure is difficult to parallelize because it needs to perform multiple multiply-and-add operations within a clock, by using a multiphase clock. On the other hand, the direct form FIR is parallelized by simply repeating the same structure with time-shifted inputs, which can be expressed as (4) From (4), it is easy to see that the filter can be parallelized by implementing with four identical blocks and time-shifted inputs, expressed as (5) For the FIR filters of the LE and M-DFE, the look-up table (LUT) based distributed arithmetic (DA) architecture is chosen for each of the parallelized blocks to reduce the latency and implement the filter with very low power consumption. In the DA architecture, intermediate results of multiply-and-add operations are pre-computed and stored in LUTs [17] . The pre-computation can be done during the IFS period, which is needed only when there are changes in the channel condition in our application. In designing the DA architecture, there is a trade-off between the memory size and latency. The size of the LUT depends on the number of coefficients, their wordlengths and structure [14] . In this particular implementation, the emphasis is put on meeting the timing requirement to close the feedback loop while shortening the latency. Fig. 8 illustrates the M-DFE implementing the 24-tap FIR, which uses 4 LUTs. Using only one LUT would minimize the latency of the filter, but would require a LUT with prohibitive entries with binary input of the BPSK signal. On the other hand, breaking down the LUT reduces the memory requirement while increasing the latency [14] . In this work, we used LUTs of four instances each with 2 64 entries. Each parallelized branch is marked as 0,1,2,3.
With the DA structure, sixteen different memory instances would be required to directly implement the filter in (5) with the parallelization factor of four. However, because the LUTs share the same contents, they can be implemented with four multi-ported memories. In this implementation for M-DFE, the four 2 -word LUTs are instantiated with D-FFs and MUXs as shown in Fig. 8 . The LE and the channel emulator also share the same architecture with 6 LUTs (6 taps, 6-bit input, 2 64 words each) and 12-LUTs (72 taps, 2 64 words each), respectively, implementing the following convolutions:
The channel emulator filter uses larger number of LUT instances compared to other filters because it has no latency restriction, and is being used for testing purpose only. The LE has slightly different in terms of the input wordlength. On the other hand, the S-DFE structure has to be implemented differently to support single-cycle feedback. Therefore, the S-DFE is first combined with the slicers and loop-unrolled [11] and then implemented in a DA architecture. Although the loop-unrolling requires additional combinational logic, the 8-tap filter needs only one LUT with 256-entries because the 8-tap acts as an address of the LUT and the slicer output has only two levels in a BPSK system (Fig. 9 ). Fig. 10 shows the implementation details of the equalizer with its bit width and pipeline register allocation, whose block diagram is shown in Fig. 4(b) . As shown in the figure, the feedback loop has two clock cycles of latency resulting from the logic delay register#1 to register#2, and again from register#2 and register#1. The latency is handled by the S-DFE.
Multiplexers are added to the delay line to enable adjustable tap allocation, which makes it possible to configure the equalizer to cancel ISI up to 72 taps long. Fig. 11 illustrates a structure and an example that shows this dynamic tap assignment that allocates four-tap delay line groups to the major multi-path clusters with large amplitude, which can be determined by the channel estimator output. This accommodates typical observed channel responses where the non-zero taps are observed to be clustered.
The coefficients of the equalizer filters are calculated based on the channel estimation results. The coefficients of the feedforward linear equalizer can be calculated on the precursor parts of the impulse response using a minimum mean square error (MMSE) criterion, which can be expressed in the frequency domain [18] : (8) where and are discrete Fourier transform of and , respectively . The complexity of this operation is low because the number of taps in the linear equalizer is minimized to be six and can be reduced further depending on the channel profile. Also, this operation only needs to be performed sporadically when there is a change in the channel condition. In addition, in a real system, the calculation can be easily done with a general purpose DSP or CPU, which is common in most communication systems to perform analog calibrations and medium access control (MAC) operations. The exact estimation of is known to be less critical for the BER performance [18] . For the coefficients of the two DFEs, the channel estimator results, can be directly used. In the case of the M-DFE (Fig. 8) , entries of LUT0, are calculated by (9) where is a binary number representing possible combinations of the slicer outputs that has a following relationship with an integer, : (10) The entries of other LUTs are calculated in a similar way. Although this LUT entry calculation is not implemented on-chip, this operation can be hard-wired via low-power adder trees because the throughput requirement of this operation is as low as several MHz (in IFS length).
IV. CHANNEL ESTIMATOR
The IEEE WPAN standard specifies a channel estimation sequence based on Golay codes, both in a preamble (CES) and within data bursts (TS) [5] . The sequence is used to estimate the channel impulse response, which can be used to calculate the equalizer coefficients. The estimation also can be used for the synchronization of frequency and timing. The code is a binary complementary sequence consisting of and of elements that have the following autocorrelation property [27] : 
The channel, can be reconstructed by the following recursive equations, which consists of shift, add, and subtract operations between two sequences.
(14) (15) where is the Kronecker delta function, is the iteration index , are the binary coefficients , and is a circular delay. Since the number of operations required for a Golay correlator is , as opposed to in a PN sequence correlator, it is more suitable for power-constrained high-speed communication systems [26] . Fig. 12 shows the timing diagram of the implemented channel estimator working on the CES based on a 128-symbol Golay sequence. Only the center portions of the received sequence are buffered to be correlated in order to estimate the impulse response without the influence of the ISI from the irrelevant signals.
The estimator operation consists of adding and element-by-element after delaying by . When the delay value, is a multiple of the , the delay operation required in the parallel scheme are easily implemented by adding or subtracting offsets to the read address. However, when the delay value is a fraction of the factor, the delay operation is implemented by a swap-and-partial-shift operation of the buffer. Fig. 13 illustrates the operations to get delay by 2 , and delay by 1 , when the is four. In the figure, the leftmost column boxes show the original buffer in which data order is represented by the number inside. In a clock cycle, four of the data in a row are processed simultaneously. In order to increase the delay by four, it is sufficient to increment the read pointer of the memory by one. To implement a fractional delay operation, a swap operation is performed, which is essentially swaps the grey and white portions of the buffer. Next, the partial shift operation performs a rotational shift of the white portion of the box, which eventually moves the deep grey boxes from the top to the bottom. It can be seen that, by sequentially reading out the re-ordered buffer, the delay operation is completed. All of these operations can be implemented by pointer management, without any physical movement of the data. Although the data path is shared with the equalizer that is parallelized by a factor of four in this work, it is desirable to make the structure easy to reconfigure because the parallelization factor, needs to be tuned depending on the system latency and power requirement. The control path of the channel estimator consists of controllers, and correlator A and B. Each correlator is composed of identical cells built out of static memory (SRAM) elements. In this way, a different can be easily accommodated with slight modification of the design depending on the system requirements. Fig. 14 shows an impulse response generated by the channel emulator in the chip overlapped with the channel estimator output measured from the chip, which demonstrates proper operation of the block. Although the block consumes 33 mW at 2 Gb/s, the channel estimation is performed only for a fraction of the time during the connection with the activity factor of (Fig. 2) . In the IEEE WPAN standard, the periodicity of 768 CES field can set to be 8192 16384 , 
32768
, or infinite [5] . Therefore, can be as low as 1% by the MAC layer adjustment if the channel is stationary.
While the channel estimation error is not negligible, particularly in the low SNR range, it's verified through the link-level simulation that the performance degradation caused by this error is less than 1 dB under nominal operating conditions.
V. TEST SETUP AND MEASUREMENT RESULTS
The chip was synthesized using a customized design flow [28] and fabricated by TSMC in 65 nm CMOS. Although the core size of the chip is 1.53 mm by 1.53 mm, the design is pad-limited with a utilization factor of 15.1%. The chip photo is shown in Fig. 15 .
The on-chip test blocks eliminate the need for high-speed interconnections. The operating mode (data or channel estimation mode) and the delay line offset are configured by a scan chain. The filter coefficients for the equalizers and the channel emulator are initialized by a separate data bus. The chip has simple input control signals: start and reset. Debugging pins are provided to monitor internal operations when using a low frequency clock. In the full-speed test, the BERT_done indicator is the only control signal that needs to be read out. The resulting number of bits and errors are read from the debugging interface once a BERT measurement is available. Fig. 16 shows the measured BER performance for both a single-path AWGN and multipath channels, verifying the correct operation. The deviation from the theoretical performance shows the effect of fixed-point loss and the error propagation of the DFE. Fig. 17 shows the measured total power consumption with varying throughput. The total power consumption of the chip, which is measured in a multipath condition with four multipath component and SNR of 4 dB, is 60.7 mW at 2 Gb/s. Although the power consumption varies by a multipath conditions and SNR, the chosen response represents a typical operating scenario that achieves a target BER. The power consumption of each block is derived by scaling the total measured power consumption based the simulated power breakdown results of the chip. The equalizer consumes 5.6 mW, which is 9.2% of total power. The channel estimator consumes 33 mW, and relatively high of 10% is assumed for the activity factor in the summary shown in Table II .
The comparison to prior equalizers working at a similar throughput is made in Table I , which shows that the presented equalizer compares favorably by implementing a larger number of taps with lower power consumption.
VI. CONCLUSION
A digital signal processing chip that can equalize a high delay spread channel seen in NLOS conditions at the 60 GHz band is presented. An equalizer minimizes the power consumption by using a parallelized DA architecture. A configurable 38-tap equalizer has been implemented with 5.6 mW power consumption with 2 Gb/s throughput. A channel estimator for calculating the equalizer coefficient and measuring the synchronization error is also developed. The power consumption of the proposed architecture might be further reduced by implementing a part of it in a mixed signal domain.
