An up to 36Gbps analog baseband equalizer and demodulator for mm-wave wireless communication in 28nm CMOS by Mattia, O.E. et al.
An up to 36Gbps Analog Baseband Equalizer and
Demodulator for mm-Wave Wireless Communication
in 28nm CMOS
Oscar Elisio Mattia∗†, Davide Guermandi†, Guy Torfs†‡, Piet Wambacq∗†
∗ETRO Department
Vrije Universiteit Brussel, Pleinlaan 2, Brussel
Email: oscar.elisio.mattia@imec.be
†imec, Leuven, Belgium
‡imec - Ghent University, Ghent, Belgiun
Abstract—Future mm-Wave wireless links with datarates of
20Gbps and more will result in prohibitive power consumption
at the front end of the DSPs. The use of analog or mixed-signal
baseband processing, however, can significantly relax the receiver
power budget. The most critical block for such a baseband is the
decision feedback equalizer, that compensates for the line-of-sight
multi-path components and demodulates the signal. In this paper
we present, to the authors’ best knowledge, the first DFE capable
of handling 16QAM data at 9GHz RF bandwidth, aggregating
all 4 channels of the 60GHz IEEE802.11ad band and resulting
in a maximum datarate of 36Gbps. It is able to compensate
for 0.7x cursor amplitude of inter-symbol interference spread
over 5 complex taps, while the minimum input SNR is 26dB. It
consumes 138mW from a 0.9V supply, achieving 3.8mW/Gbps
power efficiency including clock distribution.
I. INTRODUCTION
Future mm-Wave links will require data rates of 20Gbps
and more, over wide bandwidths, unavoidably leading to a
large power consumption in the analog baseband section and
the digital front-end. Analog demodulation significantly re-
laxes the requirements for the oversampling ADC and the sub-
sequent digital circuitry. Different baseband functions, such as
carrier synchronization, baseband clock recovery, equalization
and demodulation, in the analog domain, have been proposed
in [1], [2]. The most critical block is the time-domain equalizer,
in the form of a decision feedback equalizer (DFE), that
compensates for the channel multi-path components (MPC)
under practical line-of-sight (LOS) condition. The number of
taps depends on the delay spread, which can be as small as
0.5ns for very directional and short distance links. In a rough
approximation, at least one tap is needed for every symbol
period of delay spread. In the case of a 9GS/s baseband symbol
rate, it means 1 tap for every 110ps of delay spread.
In this paper we present the first DFE capable of equal-
izing 5 complex taps of inter-symbol interference (ISI) on
QPSK/16QAM data, at a maximum data rate of 18/36Gbps
respectively, corresponding to an RF bandwidth of 9GHz that
aggregates all 4 channels of the IEEE802.11ad standard. This
is realized with I and Q signal paths that can each handle
4PAM signals, leading to 16 possible constellation points to
demodulate. To come to a full RX, the circuit is to be preceded
by a mm-Wave LNA and IQ direct downconverter, as well as
carrier synchronization and baseband clock recovery [2, 4].
These blocks are less critical to meet the bandwidths and data
rates mentioned above.
II. ARCHITECTURE AND CIRCUIT SCHEMATICS
Before introducing the chosen architecture, it is important
to emphasize some of the challenges in designing a DFE for
a complex and multi-level modulation scheme, with respect
to their more traditional real and binary counterparts found in
wireline systems. First and perhaps more obvious are the noise
and linearity requirements for the analog part, as well as 3x
more logic gates per coefficient. Second, when looking at the
closure of the most critical first tap traditional techniques used
to relax the timing budget, such as time interleaving or half-rate
sampling require further doubling the number of comparators
and digital logic. Third, the coefficient design is also simpler
in wireline, where the first coefficient has always positive sign
due to the RC nature of the channel. This way an extra XOR
gate in series with the comparator is not needed, making the
first tap feedback loop shorter and often combined into the
comparator structure itself [3]. Finally, cascode isolation [2]
that could allow a bigger number of taps is not available
because of the 0.9V supply of the 28nm CMOS process. All
of these factors result in greater power consumption, making
it unrealistic to achieve the same efficiencies of 1mW/Gbps
that state of the art simple wireline equalizers do. Solutions
with more advanced equalization, including ADC/DSP-based
designs, are closer to and often above a budget of 10mW/Gbps.
A. 16QAM 5-tap DFE Core
Given the reasons mentioned above, the DFE is imple-
mented as a direct full-rate architecture having 5 complex taps
current summed in a resistive load, as shown in Figure 1 (I-
path only).
The incoming baseband signal VIN containing all the
ISI is converted into current IMAIN by the differential pair.
Three comparators, each with a different offset, are placed
at the summing node, detecting the four levels of a 4PAM
modulation. The nominal ISI-free input amplitude is 200mVpp
for the ±3 levels. The output of the comparators feed a delay
line, that is tapped by the DFE coefficients. The total tap
current ITAPS is subtracted from IMAIN to generate an ISI
free eye-diagram. Note that, as an IQ receiver, the in-phase
Fig. 1. Schematics of I-path 5-tap complex DFE.
(quadrature) DFE also contain cross coefficients, that come
from the Q(I)-path delay line, and are used to compensate for
the complex characteristic of the MPCs. Every coefficient is
made of 3 unit cells, each containing a 7b current-steering DAC
driven by the data bits and coefficient sign selection logic. The
sum of the currents is either ±3ITAP or ±1ITAP , according
to the previous incoming bits. All DFE logic is built in CML
style for a maximum clock frequency of 9GHz.
B. 1st Tap Timing Closure
To ensure timing closure of the first tap of the DFE, the
settling time of the analog node and the digital propagation
time has to be smaller than a clock period, 110ps. Assuming
that the analog settling takes 75% of that time (more than
30GHz of bandwidth for 3τ settling), there is only 27ps left
for the digital gates, including the delays of both comparator
and XOR sign-select gate, as well as the comparator setup
time. It would be the equivalent of a flash ADC comparator
with a sampling clock of more than 40GHz, and therefore
clearly needs some attention.
Figure 2 shows the comparator schematics, composed of
two static CML latches in master-slave configuration. Fighting
offset across PVT with sizing is very inefficient for these high-
speed applications. The traditional approach is to use a current-
DAC at the output of the first latch to calibrate this offset after
production. The problem with this compensation scheme is
that it unbalances the first latch output, leading to asymmetric
rise/fall times, depending on the sign of the input signal. This
will not only make the comparator slower for the unfavored
transition, but also affect the settling of the tap currents. To
mitigate this imbalance effect we propose the use of different
threshold devices on the sensing pair of comparators C1 and
C2 to embed an initial offset VOFF±, bringing it on average
closer to the desired level of ±2 (despite PVT variations),
considerably speeding up the transition of the comparator with
respect to the current-DAC only topology.
Fig. 2. CML Latch and Comparator schematics, including different VTH
devices and offset compensating current-DAC.
Shown in Figure 3 is a post-layout extraction simulation
result at the input of the DFE and at the summing node in front
of the comparators, while canceling 0.3x of cursor ISI on the
1st direct coefficient of the DFE, clocked at 9GHz. While on
the amplifier input no distinguishable 4PAM signal is visible, at
the comparators’ input the tap currents have already subtracted
the ISI, leading to a clear open eye. The extra levels around the
nominal ±1/3 are the result of a first direct coefficient greater
than the applied ISI, which compensates for the approximately
10mV hysteresis found on this type of static latches.
Fig. 3. DFE summing amplifier input and output nodes, showing the effect
of equalization and hysteresis compensation for the first direct tap of the DFE.
C. Chip Architecture and Building Blocks
The chip is designed for three test modes (Figure 4): (1)
is for external testing with an Arbitrary Waveform Generator
(AWG) and a Logic State Analyzer (LSA). It has as inputs
the I and Q downconverted signals, which are terminated on
chip at 50Ω and AC coupled, followed by an input buffer. The
output bits of the comparators are decoded into MSB/LSB,
demuxed by 4, converted from CML (white) to CMOS (grey)
and buffered off-chip; test mode (2) is fully integrated, needing
only TX and RX clocks, containing two 4PAM PRBS9 gen-
erators and a BER tester (see Figure 5). Finally, test mode (3)
outputs the PRBS9 generators signal for verification purposes.
The 4PAM PRBS9 generator produces uncorrelated MSB
and LSB bits, which are Gray coded and feed a transmitter cell
with two 7b DACs. To compose the 16QAM signal they are
randomly time-shifted between I and Q paths, resulting again
in uncorrelated sequences from the point of view of the RX. On
the other side, the BER tester receives the DFE MSB/LSB I/Q
signals, that are fed into the same PRBS generator of the TX
side, operating in open loop. Their output is masked by a clock
and divided by counters, converted to CMOS and buffered
off-chip. This way, a transition of the output represents a
programmable number of errors of the DFE.
Fig. 4. DFE16QAM chip. Test modes (1) DFE AWG signal and LSA; (2)
Integrated PRBS generator and BER tester; (3) Output of PRBS generator.
Fig. 5. Auxiliary blocks: 4PAM PRBS9 generator, 1:4 output demux, CLK
buffers, integrated BER tester.
III. MEASUREMENT RESULTS
Figure 6 shows the die micrograph, fabricated in 28nm
CMOS technology. The 16QAM 5-tap complex DFE occupies
a total active area of 410x450 µm2.
Figure 7 shows experimental BER results vs. sampling
instant for the integrated test mode (2) of Figure 4, before
and after offset calibration, as well as hysteresis compensation
(where the first direct coefficient is made to compensate for
roughly 10mV of hysteresis). The receiver makes no mistakes
for 1E9 bits on 0.86/0.32 unit interval (UI) at 18/36Gbps, for
QPSK/16QAM modulations, respectively, successfully validat-
ing the most critical 1st tap closure and all of the proposed
test structures. The fact that the receiver needs hysteresis
compensation for 16QAM mode can indicate the presence of
residual ISI due to finite bandwidth of the PRBS9 generators,
or a higher hysteresis value than predicted by simulation.
To include emulated SNR and ISI conditions, the
Fig. 6. Die photo in 28nm CMOS. 16QAM 5-tap DFE area is 410x450µm2.
Fig. 7. BER vs. Sampling Instant curves for 18Gbps QPSK and 36Gbps
16QAM data, including comparators’ offset and hysteresis calibration.
AWG+LSA setup (1) of Figure 4 is needed. In this scenario
we are constrained to 16Gbps (due to measurement setup
limitations) for both modulation schemes. As shown in Figure
8 (top), the proposed DFE is able to compensate for 0.7x
cursor amplitude of ISI while making no errors for 1E6
bits, with 0.6/0.36UI aperture, for QPSK/16QAM respectively.
The noise performance (without any added ISI) is shown on
Figure 8 (bottom), needing a minimum SNR of 15/26dB for
QPSK/16QAM data, respectively.
The channel impulse response and resulting constellation
diagrams, as loaded into the AWG memory, are shown in
Figure 9. The dotted lines represent the decision thresholds for
an ideal detector. As can be seen, 0.7x cursor ISI is not enough
to produce errors for a QPSK signal, while it is impossible to
demodulate the 16QAM signal under the same conditions. We
therefore propose a different metric, that ratios the total ISI
with respect to the constellation decision margin (1 for QPSK
and 0.33 for 16QAM). An ideal detector would, therefore, be
able to demodulate any signal with a relative ISI < 1. For the
case of Figure 9 it results in a relative ISI of 0.7x/2.3x for the
QPSK/16QAM case, respectively, and makes more sense for
comparison purposes.
A summary of the DFE core power consumption is shown
in Table I. As stated previously, the digital gates and clock
buffering power consumption increases significantly when
going from a binary modulation to a multi-level one. Still,
the circuit is able to achieve reasonable power efficiency of
Fig. 8. Measured BER vs. sampling instant, with 0.7ISI (top) and SNR with
no ISI (bottom). QPSK/16QAM data, both at 16Gbps (8GS/s and 4GS/s, left
and right, respectively).
Fig. 9. Channel impulse response containing 0.7x cursor ISI and resulting
constellations for QPSK and 16QAM data.
3.8mW/Gbps for the 16QAM case. A comparison against the
wireless state of the art is presented in Table II. With respect
to previous implementations, we have doubled the clock fre-
quency and modulation complexity, resulting in a total datarate
that is almost 4x higher. The penalty on the ISI cancellation is
mainly due to a smaller number of taps, since relative to the
constellation margin, it is essentially the same. Maintaining
noise and linearity requirements while operating with a 0.9V
supply is challenging, and obviously impacts the overall power
consumption. Still, the more advanced and faster node of 28nm
certainly helps, and a complete receiver baseband, including
variable gain amplifier, carrier and baseband clock recovery,
should consume no more than 10mW/Gbps.
TABLE I. POWER CONSUMPTION BREAKDOWN.
Spec / Block DFE Amp Comparator FF XOR CLK Buffer Unit
Avg Power 11.7 4.5 1.44 0.72 36 mW
Multiplier
QPSK 2x 2x 8x 8x 0.33x -
16QAM 2x 6x 24x 24x 1x -
Total Power
QPSK 23.4 9 11.5 5.8 11.9 62 mW
16QAM 23.4 27 34.6 17.3 36 138 mW
TABLE II. WIRELESS RX STATE OF THE ART COMPARISON.
Specification [4] [1] [2] Proposed
Technology 90nm 1.2V 65nm 1.2V 65nm LP 1.2V 28nm 0.9V
Symbol Rate 5 GS/s 5 GS/s 4 GS/s 9 GS/s
Modulation QPSK QPSK QPSK 16QAM
Data Rate 10 Gbps 10 Gbps 8 Gbps 36 Gbps
DFE Unrolled Unrolled Unrolled Direct
Architecture half-rate half-rate half-rate full-rate
Number of Coeffs 10 40 100 10
Coeff Resolution 6 bits 7 bits 8 bits 7 bits
ISI (wrt cursor) - 2.5X 2X 0.7X*
ISI (relative) - 2.5X 2X 2.3X*
Power (mW) 12 14 42 138
Efficiency (mW/Gbps) 1.2 1.4 5.25 3.8
1Includes 32 FFE coefficients; 2Includes phase rotator; *Measured
at 16Gbps.
IV. CONCLUSION
The chip demonstrates, for the first time, analog demod-
ulation and equalization of 16QAM signals at a total data
rate up to 36Gbps, doubling both the clock frequency and
the modulation complexity when compared to previous im-
plementations, while maintaining comparable power efficiency
of 3.8mW/Gbps. It implements a complex 5-tap DFE with
resistive summing and full-rate architecture, as well as 1:4
demuxing and integrated test structures. It is a first step
towards a complete baseband for future power-efficient mm-
Wave receivers.
ACKNOWLEDGMENT
The authors would like to thank Luc Pauwels and Hans
Suys for lab support.
REFERENCES
[1] C. Thakkar, L. Kong, K. Jung, A. Frappe, and E. Alon, “A 10 Gb/s
45 mW adaptive 60 GHz baseband in 65 nm CMOS,” IEEE Journal of
Solid-State Circuits, vol. 47, no. 4, pp. 952–968, 2012.
[2] C. Thakkar, N. Narevsky, C. D. Hull, and E. Alon, “Design Techniques
for a Mixed-Signal I/Q 32-Coefficient Rx-Feedforward Equalizer, 100-
Coefficient Decision Feedback Equalizer in an 8 Gb/s 60 GHz 65 nm LP
CMOS Receiver,” IEEE Journal of Solid-State Circuits, vol. 49, no. 11,
pp. 2588–2607, 2014.
[3] J. Lee, P.-c. Chiang, P.-j. Peng, L.-Y. Chen, and C.-C. Weng, “Design of
56 Gb/s NRZ and PAM4 SerDes Transceivers in CMOS Technologies,”
IEEE Journal of Solid-State Circuits, vol. 50, no. 9, pp. 2061–2073, Sep.
2015.
[4] C. Marcu, D. Chowdhury, C. Thakkar, J. D. Park, L. K. Kong, M. Tabesh,
Y. Wang, B. Afshar, A. Gupta, A. Arbabian, S. Gambini, R. Zamani,
E. Alon, and A. M. Niknejad, “A 90 nm CMOS low-power 60 GHz
transceiver with integrated baseband circuitry,” IEEE Journal of Solid-
State Circuits, vol. 44, no. 12, pp. 3434–3447, 2009.
