A 60-Gb/s PAM4 Wireline Receiver With 2-Tap Direct Decision Feedback Equalization Employing Track-and-Regenerate Slicers in 28-nm CMOS by Chen, Kuan-Chang et al.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE JOURNAL OF SOLID-STATE CIRCUITS 1
A 60-Gb/s PAM4 Wireline Receiver With 2-Tap
Direct Decision Feedback Equalization Employing
Track-and-Regenerate Slicers in 28-nm CMOS
Kuan-Chang Chen , Member, IEEE, William Wei-Ting Kuo, Graduate Student Member, IEEE,
and Azita Emami , Senior Member, IEEE
Abstract— This article describes a 4-level pulse amplitude
modulation (PAM4) receiver incorporating continuous time lin-
ear equalizers (CTLEs) and a 2-tap direct decision feedback
equalizer (DFE) for applications in wireline communication.
A CMOS track-and-regenerate slicer is proposed and employed
in the PAM4 receiver. The proposed slicer is designed for the
purposes of improving the clock-to-Q delay as well as the output
signal swing. A direct DFE in a PAM4 receiver is made possible
with the proposed slicer by having rail-to-rail digital feedback
signals available with reduced delay, and accordingly relaxing the
settling time constraint of the summer. With the 2-tap direct DFE
enabled by the proposed slicer, loop-unrolling and inductor-based
bandwidth enhancement techniques, which can be area/power
intensive, are not necessary at high data rates. The PAM4 receiver
fabricated in 28-nm CMOS technology achieves bit-error-rate
(BER) better than 1E-12, and energy efficiency of 1.1 pJ/b at
60 Gb/s, measured over a channel with 8.2-dB loss at Nyquist.
Index Terms— Comparator, decision-feedback equalizer
(DFE), equalization, 4-level pulse amplitude modulation
(PAM4), receiver, slicer, wireline.
I. INTRODUCTION
FOUR-LEVEL pulse amplitude modulation (PAM4) sig-naling has become an attractive option for high-speed
data communication links where the channels suffer from
severe bandwidth limitation, by virtue of its halved Nyquist
frequency in comparison with that of non-return-to-zero (NRZ)
modulation. In other words, the PAM4 signaling improves
the spectral efficiency over that of NRZ, by encoding two
bits of information, often referred to as the most significant
bit (MSB) and the least significant bit (LSB), into one symbol.
The consequent advantages of using PAM4 as a substitution
for NRZ include the following: the bandwidth requirements for
the channel and the front-end circuits are both reduced, and
the circuits for clock generation and distribution can operate
at the halved frequency. These advantages can potentially
lead to higher data rates and/or lower power consumptions.
However, there are new challenges stemming from the nature
of multilevel signaling when designing PAM4 transceivers.
Manuscript received June 2, 2020; revised August 8, 2020; accepted
September 6, 2020. This article was approved by Guest Editor Qun Jane
Gu. (Corresponding author: Kuan-Chang Chen.)
The authors are with the California Institute of Technology, Pasadena,
CA 91125 USA (e-mail: kcxchen@caltech.edu).
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSSC.2020.3025285
Specifically, with a fixed transmitter swing that is divided
into the multiple levels, the receiver needs to resolve the
transmitted bits from signals that have lower strength. The
foregoing infers to two important design challenges, which
this work focuses to address. One is the more demanding
sensitivity of the decision circuitry, as will be elaborated in
later paragraphs. The other is the necessity of canceling the
inter-symbol interference (ISI), since the ISI resulting from
strong symbols, for example, (MSB, LSB) = (+1, +1), can
intrude detrimental interference on the nearby weak symbols,
for example, (MSB, LSB) = (−1, +1), and cause undesirable
data eye closure as a result. The same proportion of ISI
level can be, on the contrary, tolerable in the cases of NRZ
modulation in that the bipolar symbols, or bits, have nominally
identical magnitude of signal swings.
Depending on the architecture of the receiver, analog-based
equalization and/or analog-to-digital converter (ADC)-based
equalization can be employed. In both scenarios, the incor-
poration of a decision feedback equalizer (DFE) is often an
appealing option, as a DFE can succeed in compensating
post-cursor ISI without amplifying crosstalk and noise. Recent
examples include an ADC-based PAM4 receiver [1], designed
in 16-nm FinFET CMOS, utilizing analog CTLE together with
24-tap feedforward equalizer (FFE) and 1-tap DFE imple-
mented in digital domain, and the transceiver achieves bit-
error-rate (BER) less than 1E-8 at 56 Gb/s over a channel
with 31-dB loss at 14 GHz (Nyquist frequency). With the
feasibility of integrating hybrid analog and digital equalization
including long-tap FFE, ADC-based receiver architectures
have been designed for longer reach or channels with loss
greater than 30 dB at Nyquist [1]–[4]. On the other hand,
an analog-based 40–56-Gb/s PAM4 receiver in 16-nm FinFET
CMOS [5], targeting chip-to-module and board-to-board cable
interconnects, mitigates the channel loss of 10 dB at 14 GHz
and reflections, by incorporating CTLE and direct 10-tap DFE
in analog domain. Compared to [1] that equalizes >30-dB
loss at Nyquist with ADC-based architecture, this analog-
based receiver [5] designed for 10-dB loss at Nyquist achieves
BER of less than 1E-12 at 56 Gb/s, but consumes ∼40% less
power [5]. These previous designs suggest that for short reach
applications where channel losses can be less than 10 dB
at Nyquist, an ADC-based receiver may not be the optimal
solution in consideration of both the hardware and power that
need to be invested.
0018-9200 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE JOURNAL OF SOLID-STATE CIRCUITS
Fig. 1. Hardware implementation of PAM4-DFE in half-rate designs. Only
the even data-path is shown for clarity, where THH, TH0, THL are the
three distinct threshold levels, and h1 corresponds to the first post-cursor ISI.
(a) Direct 1-tap PAM4-DFE. (b) Loop-unrolling 1-tap PAM4-DFE.
Despite the usefulness of including a DFE as part of
a PAM4 receiver in the analog fashion, as demonstrated
in [5]–[7], improving the energy efficiency of an analog-
based PAM4-DFE at high data rates remains challenging.
First, compared with NRZ receivers, the reduced eye-height
in PAM4 receivers (by a factor of ∼3 in the absence of
nonlinearity and with fixed transmitter swing) sets a more
stringent limit for the sensitivity of the slicer used for resolving
the symbols and making decisions. Furthermore, the sensitivity
requirement generally becomes more difficult to meet, given
tighter timing constraints, such as at higher data rates and/or
with lower decision latency requirements in feedback loops.
Second, at least three slicers are required with respect to the
three distinct thresholds, and therefore, the power consumed
by the slicers and the loading presented by the slicers are of
much greater concern in designing PAM4 receivers. Moreover,
as can be seen in Fig. 1, which compares the implementation
of direct 1-tap PAM4-DFE with that of 1-tap loop-unrolling
PAM4-DFE, the loop-unrolling technique demands signifi-
cantly more hardware. Even if only one tap is unrolled, it needs
12 slicers, three multiplexers, and one thermometer-to-binary
(T2B) decoder for each deserialized branch (e.g., 24 slicers,
six multiplexers, and two T2B decoders in total for a half-rate
design). Since the number of slicers increases exponentially
with the number of taps unrolled, the loop-unrolling technique
is prohibitively costly in hardware and power consumption for
a high data rate PAM4 receiver, suggesting that the speed or
the delay performance of the slicers is critical. As illustrated
in Fig. 2, a stringent timing constraint that requires all the
operations to be finished within 1 UI is set, when attempting
to directly close the decision feedback loop for the first tap.
Although the signal propagation and settling happen concur-
rently in reality, it is informative and useful to conceptually
distinguish them into the setup time of the slicer, clock-to-
Q delay of the slicer, the propagation delay of the DFE tap,
and the settling time of the summer. Details and interpreta-
tions of these timing constraints are presented in Section IV.
In particular, since the clock-to-Q delay of the slicer takes
up a considerable portion in the 1-UI constraint, as will be
Fig. 2. Timing constraint for a direct DFE design for N th post-cursor ISI
compensation.
shown in Section III, the improvement in slicer delay helps
to close the loop at higher data rates or to relax the summer
design such that no excess power-bandwidth tradeoffs or area-
consuming inductors are required for reducing the summer
settling time. Therefore, this work aims to demonstrate the
idea of implementing an energy-efficient PAM4 receiver with
direct DFE loops by improving the slicer performance.
This article, expound upon [22], is organized as follows.
Section II presents the overall PAM4 receiver architecture,
where each subsection describes the circuits that serve as key
building blocks in the analog front-end (AFE) and in the
clock path. Section III reviews the operations and features
of prevalent slicer topologies, and describes the proposed
slicer in detail. Section IV elaborates the timing constraint
for completing the DFE loops. Experimental results of this
PAM4 receiver are shown in Section V, and finally, Section VI
summarizes this work with performance comparisons and
conclusions.
II. RECEIVER ARCHITECTURE
A. Overall Architecture
The overall architecture of the PAM4 receiver is shown
in Fig. 3. The AFE is composed of two stages of continuous
time linear equalizer (CTLE) and two half-rate summers. The
outputs of each summer are connected to four proposed CMOS
track-and-regenerate slicers, among which one is responsible
for the eye monitor (EM), and the other three slicers are
dedicated to recovering the analog summer outputs to the
corresponding 3-bit thermometer-coded digital levels. With the
proposed slicers, direct 2-tap DFE is implemented. The 3-bit
thermometer-coded outputs are first directly fed back to the
summer in the other data path for the first tap of DFE, and
then with 1-UI delay, fed back to the summer in the same
data path for the second tap of DFE. The digital-level slicer
outputs are further demultiplexed (1-to-32) for external and
on-chip EM and BER counters (BERCs) to evaluate the eye-
opening and BER performance, respectively. The clock path
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN et al.: 60-Gb/s PAM4 WIRELINE RECEIVER WITH 2-TAP DIRECT DECISION FEEDBACK EQUALIZATION 3
Fig. 3. Overall architecture of the PAM4 receiver.
Fig. 4. (a) Schematic of the source-degenerated CTLE. (b) Simulated
frequency response of the CTLE (single stage) with different settings of VCAP.
takes in an external pair of half-rate differential clock signals
and amplifies them to rail-to-rail levels with on-chip duty cycle
correction (DCC). Clock buffers (CKBUFs) and a digitally
adjustable delay line (DL) are included on the chip, serving
as the interfaces with the clocked slicers to provide rail-to-rail
clock signals for data recovery as well as the required clock
phases for eye monitoring.
The following sections describe the details associated with
the design of CTLE in Section II-B, half rate summers in Sec-
tion II-C, linearity characterizations in Section II-D, current
mode logic (CML)-to-CMOS clock converter in Section II-E,
and DCC circuits in Section II-F. The details of the proposed
slicer are presented in Section III.
B. CTLE
CTLE is included in the receiver to mitigate both pre-cursor
ISI and post-cursor ISI, as the coverage of the direct 2-tap
DFE design is limited to the first and second post-cursor ISI.
Fig. 4(a) shows the schematic of the CTLE, which adopts
the conventional topology of RC source-degenerated differen-
tial amplifier with digital controllability. The high-frequency
peaking can be enabled or disabled by setting VCTRL to be
Fig. 5. Architecture and performance of the summer for 2-tap DFE.
logic low or logic high, respectively. As shown in Fig. 4(b),
the peaking frequency is digitally adjustable by varying the
voltage level of VCAP. Since the source-degenerated resistance
remains unchanged, the dc gain of the CTLE is approximately
0.9 (V/V), independent of the setting of VCAP. Without the
inclusion of inductors, the frequency boost at 15 GHz is
simulated to be 2.1 dB for a single CTLE stage. The voltage
level of VCAP is set by an 8-bit on-chip voltage digital-to-
analog converter (DAC), and the implementation of which
follows the conventional resistor ladder R-2R architecture as
presented in [1]. The voltage DAC therefore provides a dc
voltage with 8-bit resolution between the ground (0 V) and
a reference voltage, VHIGH, where the value of VHIGH can be
changed via a pad connected to an external voltage source.
In this prototype, an on-chip voltage DAC bank consisting
of duplications of the aforementioned 8-bit voltage DAC is
responsible for generating the digitally adjustable voltage lev-
els. For further reduction in the area overhead, the resolution of
each voltage DAC can be individually optimized with respect
to the associated circuit blocks.
C. Summer
The summers used in the PAM4 receiver fall in the category
of resistively loaded CML summer, and the architecture incor-
porating 2-tap DFE summation is shown in Fig. 5. Resistive
source-degeneration is employed for linearity improvement.
Depending on the previous two symbols resolved by the three
data slicers; that is, the corresponding six thermometer-coded
digital signals in differential fashion, the six tail currents are
respectively steered to one of the two load resistors to perform
DFE summation. To maintain the common-mode voltage level
at the summer outputs irrespective of the DFE setting, all these
tail currents are summed and mirrored to a common-mode
restoration block which injects the currents evenly from the
supply into the summing nodes (OUTPSUM and OUTNSUM).
The common-mode restoration allows the threshold setting and
delay performance of the slicers to be independent of the DFE
setting.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE JOURNAL OF SOLID-STATE CIRCUITS
Fig. 6. (a) Schematic of the common-mode restoration circuits. (b) Simulated performance of the common-mode restoration circuits, showing the deviation
from the target common mode with and without the common-mode restoration circuits.
Fig. 7. (a) Nomenclature for PAM4 eye diagrams and the definition for PAM4 EL. (b) Simulated linearity performance of the summer. (c) Simulated linearity
performance of the CTLE.
The schematic of the common-mode restoration circuits is
shown in Fig. 6(a). It is similar to that in a prior art [5], while
an additional function is included in this work for offset
compensation. The common-mode restoration currents, ICMP
and ICMN, are nominally half of the sum of all DFE currents
that is, (3IDFE1 + 3IDFE2)/2. The offset cancellation currents,
IOSP and IOSN, are individually adjustable to compensate the
accumulated dc offset of the CTLE stages and the summer.
In this prototype, a closed offset-cancellation loop is not
implemented, and the values of IOSP and IOSN are adjusted
with on-chip voltage DACs. Due to the finite output resistance
of the current sources and current mirrors, larger errors can
be introduced when the currents to be copied become larger.
As shown in Fig. 6(b), simulations have been carried out to
study the deviations from the target common-mode voltage
level, with distinct settings of DFE currents. It can be seen
that without the common-mode restoration circuits, the
output common-mode level of the summer drops roughly
linearly with the increase of DFE currents; the voltage drop
of output common-mode level is approximately 70 mV
when (IDFE1 + IDFE2) = 500 μA. By contrast, when the
common-mode restoration circuits are connected, the voltage
drop of output common-mode level is less than 6 mV when
(IDFE1 + IDFE2) = 500 μA, and a relatively constant output
common-mode level is sustained across the range shown
in Fig. 6(b).
D. Linearity Characterizations
Since the receiver front-end linearity performance is crucial
for multilevel signaling, the linearity of the summer and that of
the CTLE are respectively examined via the evaluations on the
output eye linearity (EL) versus the input amplitude. Fig. 7(a)
shows the nomenclature for PAM4 eye diagrams along with
the definition used for PAM4 EL, where VAmp denotes the
peak-to-peak input amplitude; EHH, EHM, and EHL measure
the eye heights of the upper eye, the middle eye, and the lower
eye, respectively. Clean PAM4 signals without level mismatch
(i.e., EL = 1) are applied to the input of the summer, and
the output EL of the summer is recorded for each given input
amplitude. Similarly, to test the CTLE linearity, PAM4 signals
of different amplitudes with EL = 1 are generated, whereas
these signals go through a channel with 4-dB loss at 15 GHz
before being applied to the input of the CTLE. The two-stage
CTLE is correspondingly configured to provide ∼4-dB boost
at 15 GHz, which is the target amount of peaking, as described
in Section II-B. The simulated results at different process
corners are shown in Fig. 7(b) for the summer, and Fig. 7(c)
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN et al.: 60-Gb/s PAM4 WIRELINE RECEIVER WITH 2-TAP DIRECT DECISION FEEDBACK EQUALIZATION 5
Fig. 8. (a) Schematic of the CML-to-CMOS clock converter. (b) Simulated minimum required input peak-to-peak amplitude with different input clock
frequencies for the CML-to-CMOS clock converter.
for the CTLE. As variable gain amplifiers (VGAs) are not
included in this prototype, the EL remains above 90%, only
when VAmp is not greater than ∼450 mV.
E. CML-to-CMOS Clock Converter
Fig. 8(a) shows the schematic of the CML-to-CMOS clock
converter. It consists of a differential amplifier and two stages
of ac-coupled inverter-based clock amplifier. The use of ac
coupling capacitor and inverter with the input node connected
to the output node via a resistor ensures that the dc level of the
clock signals is biased to around half of the supply voltage.
The CML-to-CMOS clock converter is able to amplify incom-
ing sinusoidal clock signals to rail-to-rail (i.e., CMOS levels)
at various clock frequencies, provided that the amplitude of
the input sinusoidal signals is sufficiently large. Fig. 8(b)
summarizes the minimum required peak-to-peak amplitudes
at different frequencies such that the swings of the clock
signals at the converter output are larger than 50–850 mV.
In particular, for 15-GHz clock signals, 24-mVpp input
amplitude is needed for output swing larger than 50–850 mV,
and 40-mVpp input amplitude further increases the output
swing from approximately ground (0 V) to the supply voltage
(900 mV). By providing larger input amplitudes, this CML-
to-CMOS clock converter can work at higher frequencies.
F. DCC Circuits
Duty-cycle distortion effectively induces unequal time
frames for the operations (e.g., sampling, or data recovery) in
different data paths, and therefore duty-cycle distortion can be
highly undesirable for high data-rate designs where the per-
formance such as BER is sensitive to the unwanted reduction
or imbalance of the timing allocation. In light of the negative
effects of the duty-cycle distortion, DCC circuits are designed
and implemented on the chip. Fig. 9(a) presents the schematic
of the DCC circuits. The duty-cycle is adjusted by varying
the amounts of the currents, IUP and IDN, which are digitally
programmable by 10 bits, b<9:0>. In addition, to be capable
of accommodating both large duty-cycle distortion and fine-
tuning, the value of VBIAS, which sets the current level of the
current sources, is designed to be also digitally adjustable with
an on-chip DAC. Simulation results of the DCC at 15 GHz are
shown in Fig. 9(b). With the simultaneous programmability
of the values of IUP, IDN, and VBIAS, the DCC is able to correct
the input clock signal with duty-cycle of 25%–75% such that
the duty-cycle of the output clock signal is very close to 50%
with errors not greater than 0.1%. Provided that the duty-cycle
of the input clock source to the receiver chip is 50%, Monte
Carlo simulations show that the resultant duty-cycle at the
outputs of the on-chip clock path varies from 48.46% to
52.48%. Accordingly, the presented DCC well covers the
range due to process variations, and also has the competence
to accommodate an input clock source whose duty-cycle
deviates from 50%. In this work, an on-chip adaptive closed
loop for setting the DCC is not implemented, but the setting is
swept with the aim of optimizing the measured BER instead.
III. SLICER DESIGN
A. Overview
Voltage comparators, also known as slicers, or sense ampli-
fiers in some contexts, have served widely in mixed-signal
circuits and systems, including ADCs, adaptive configuration
loops, memory access circuitry, and data receivers. A variety of
slicer topologies with their practical utility have been demon-
strated. In [8], CML slicers appeared in the implementation of
a 6-bit ADC, and later the CML slicer topology has been fre-
quently employed in data receivers [6], [7], [9], [10]. A CMOS
latch-type comparator, famous as the StrongArm and originally
studied in memory circuits [11], becomes popular due to its
often negligible static power consumption and the competence
to generate rail-to-rail output swings. The StrongArm has
found broad applications in both low-power architectures and
high-speed receivers, and its mechanism appears to incite
inventions or variants of CMOS latch-type slicers. A 2-stage
topology called double-tail latch-type voltage sense amplifier
is presented in [12], which enhances the capability of operating
at lower supply voltages and input common-mode voltage
levels, by having less stacking transistors and separate tail
currents for the input stage and the latch stage. The slicers
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6 IEEE JOURNAL OF SOLID-STATE CIRCUITS
Fig. 9. (a) Schematic of the DCC circuits. (b) Simulated performance of DCC with 15-GHz clock signals.
Fig. 10. Prevalent slicer topologies. (a) StrongArm slicer. (b) CML slicer.
used in [13] and [14] are both essentially variants of the
double-tail latch-type slicer [12], with an augmented function
to incorporate 1-tap DFE summation. Another 2-stage slicer
is reported in [15], where it is mentioned that increasing
the common mode for the same clock-to-Q delay is enabled.
Compared with the StrongArm, the aforementioned latch-type
slicers [12]–[15] attempt to conform the delay performance
among an extended range of supply voltage or input common-
mode levels, without much emphasis on considerably reducing
the achievable delay.
As illustrated in Fig. 2, the clock-to-Q delay performance
of the slicers plays a critical role in closing the DFE loops.
The next subsection describes the features of the particularly
prevalent two slicer circuits; that is, the StrongArm and the
CML slicer, and discusses the potential improvements.
B. Prevalent Slicer Topologies
The schematic of the StrongArm is shown in Fig. 10(a).
It is designed in a dynamic CMOS latch fashion and its typical
operation is illustrated in Fig. 11. When the clock (CK) is logic
low, the outputs are both being charged to the supply value
such that the differential output is reset to approximately zero.
When the clock becomes logic high, the StrongArm samples
the differential input and then the differential output is regen-
erated toward rail-to-rail with the help of the positive feedback
offered by the cross-coupled pairs. A few observations can be
made after closely examining the simulated waveforms. First,
Fig. 11. Simulated waveforms showing the typical operations including the
reset, sample, and regenerate phases of the StrongArm slicer.
attributing to the reset mechanism, there is always certain time
that needs to be spent for the differential output signal to grow
from approximately zero. Second, since the time allocated for
regeneration is limited by the data rate, the regeneration started
with a higher level is very beneficial in that at the end of
the regeneration phase, the differential output swing can be
considerably larger and the delay to achieve digital level is also
significantly less. In other words, for high data-rate operations,
the time required for the output signals of the StrongArm
to grow from approximately zero to the level that can be
identified as digital outputs may not be sufficient. The above
observations motivate the idea to design a slicer which instead
of resetting, tracks the polarity of the differential input signal
such that the regeneration can proceed with a higher signal
level. Another prevalent slicer topology shown in Fig. 10(b),
known as the CML slicer, leverages the idea of tracking the
inputs. However, as the output swing magnitude of the CML
slicer cannot exceed the product of the tail current and the
load resistance, (ITAIL × RL), a number of drawbacks are
associated with the CML slicer when implementing a PAM4-
DFE with it. For one thing, since the output swing is not
rail-to-rail, the CML slicer may not be directly compatible
to the relatively energy-efficient CMOS gates for delaying
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN et al.: 60-Gb/s PAM4 WIRELINE RECEIVER WITH 2-TAP DIRECT DECISION FEEDBACK EQUALIZATION 7
or buffering the resolved data, and a potential solution by
inserting CML-to-CMOS amplifiers would increase the total
delay. For another thing, the smaller output swing offers less
strength to steer the DFE currents, and therefore, the sizes of
the differential pairs of the DFE current branches cannot be
minimized, which equivalently adds restrictions in minimizing
the load capacitance at the summer outputs. Furthermore,
referring to Fig. 10(b), because M1, M2 are directly connected
to M3 and M4, when the CML slicer is designed for larger
output swing with large tail current, it leads to relatively high
power consumption, and at the same time presents large input
and output capacitances.
C. CMOS Track-and-Regenerate Slicer
In view of the above, a CMOS track-and-regenerate
slicer is proposed and designed, aiming to improve the
clock-to-Q delay as well as the output swing. When the DFE
is implemented with the proposed slicer, digital-level outputs
are directly available and the settling time specification
of the summer is relaxed in consequence of the reduced
slicer delay, enabling an energy-efficient DFE design that
operates at high data rates. The overall circuit schematic of
the proposed CMOS track-and-regenerate slicer is shown in
Fig. 12(a). The proposed slicer tracks the differential input
instead of being reset, and it regenerates the differential
output to rail-to-rail levels. Designed in CMOS dynamic latch
fashion, the proposed slicer is suitable for technology scaling
and can be viewed as having three-stage configuration. The
first stage, consisting of M1−M10, works as a dynamic
differential amplifier. M11−M14 form the second stage,
which serves as a buffer to provide some isolation between
the first and the third stage. The third stage, M15−M22,
is essentially dynamically controlled cross-coupled pairs
that are responsible for regenerating the signal with positive
feedback. Fig. 12(b) and (c), respectively, illustrates the
operation of the proposed slicer during the two complimentary
clock phases. When CK is logic low and CKB is logic high,
as in Fig. 12(b), M1−M8 and M11−M14 perform the tracking
function with M9, M10, M17, and M18 turned off, and they
overwrite the latch outputs (OUTP and OUTN). M15 and
M16 are kept always on and conduct relatively weak currents
to avoid the cross-coupled pairs (M19−M22) recovering from
being completely off while allowing the outputs to be easily
overwritten. In the other half of clock cycle, that is, when
CK is logic high and CKB is logic low, as in Fig. 12(c),
the tracking function is stopped with the outputs of the first
stage being cleared and the second stage disabled. The outputs
of the first stage are cleared by M9 and M10 which discharge
the output node voltages toward zero, and hence eventually
turn off M11 and M12. With M11 and M12 turned off by
M9 and M10, respectively, M13 and M14 turned off with the
rise of CK, the second stage is quickly disabled, isolating the
continuously changing inputs from the latch outputs which
shall be regenerated toward rail-to-rail levels with respect
to the polarity that has been tracked. At the same time,
the cross-coupled pairs conduct significantly more currents
by turning on M17 and M18, empowering strong positive
Fig. 12. Proposed CMOS track-and-regenerate slicer. (a) Overall circuit
schematic. (b) Proposed slicer in track mode. (c) Propose slicer in regenerate
mode.
feedback for the regeneration. It is noteworthy that clearing
the outputs of the first stage with M9 and M10 during the
regenerate-mode also helps with tracking the inputs for the
next tracking phase, thanks to that the first stage itself does
not memorize the results from the previous tracking phase.
In this work, the threshold level of the slicer is determined by
the gate voltages THP and THN, and each slicer has its own
individually adjustable threshold generator. The threshold
levels are programmable from an external field-programmable
gate array (FPGA) and set by on-chip voltage DACs that are
described in Section II-B. The slicer offset can be compensated
by setting THP and THN correspondingly for a given threshold.
D. Simulation Results
For the purpose of demonstrating the features of the pro-
posed CMOS track-and-regenerate slicer and comparing the
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8 IEEE JOURNAL OF SOLID-STATE CIRCUITS
Fig. 13. Simulations and comparisons of the large-signal performance
between the reset-and-regenerate StrongArm and the proposed CMOS track-
and-regenerate slicer. (a) Input signals to the slicers. (b) Optimal clock
signals and the resulting output waveforms of the slicers with 900-mV supply.
(c) Optimal clock signals and the resulting output waveforms of the slicers
with 850-mV supply. (d) Faster reaction to strong symbols with the proposed
slicer.
proposed slicer with the StrongArm which is also compact
in size and suitable for technology scaling, extensive simu-
lations have been carried out and the results are presented
as follows. Fig. 13 illustrates the large-signal behavior and
clock-to-Q delay performance of the proposed CMOS “track-
and-regenerate” slicer along with the “reset-and-regenerate”
StrongArm slicer. The input signals to the slicers are shown
in Fig. 13(a), representing a worst-case pattern when a weak
negative symbol, that is, (MSB, LSB) = (−1, +1), is between
a long sequence of strong positive symbols, i.e., (MSB,
LSB) = (+1, +1). Using this input pattern, the worst-case
delay performance for PAM4 signaling can be evaluated and
the memory effect in the proposed slicer with non-resetting
mechanism is examined. As shown in Fig. 13(b), in contrast
to the conventional CML slicer, the proposed slicer offers rail-
to-rail output swings and thus direct availability of digital-level
outputs. Meanwhile, in comparison with the StrongArm slicer,
instead of resetting the latch, the proposed slicer tracks the
input signals like how the CML slicer does, helping to reduce
the required regeneration time. As a result, the proposed slicer
improves the clock-to-Q delay as well as the output swing
over the StrongArm. With the sizes of the input transistors
and the output cross-coupled pairs designed to be identical,
the worst-case clock-to-Q delay (with respect to the switching
points defined as ±450 mV) is simulated to be 30.96 ps
for the StrongArm, whereas the delay reduces to 15.34 ps
for the proposed slicer. As already been shown in [21],
when not operating with low supply voltages or low input
common-mode levels, the delay performance of the double-
tail slicer [12] is similar to that of the StrongArm. This is
also observed when a double-tail is tested with the same data
pattern shown in Fig. 13(a). The double-tail slicer is designed
to have the same input stage and output cross-coupled pairs
as the StrongArm, achieving 31.2 ps with the input common-
mode level set to 750 mV, and 29.58 ps with the input
common-mode level reduced to 600 mV, for the worst-case
clock-to-Q delay. Fig. 13(c) furthermore shows the immunity
to the change in the power supply. A voltage drop of 50 mV
from a 900-mV supply hinders the StrongArm from resolving
the weak negative symbol to digital level with its output swing
less than 450 mV, while the output swing of the proposed slicer
is still approximately rail-to-rail and the penalty on resolving
the weak symbol is 2.36 ps of increase in delay. Fig. 13(d)
emphasizes another desirable feature offered by the proposed
slicer; namely, the fast reaction to strong symbols. Since the
strong symbols tend to cause relatively strong ISI for the
next symbols, it is beneficial to have fast reaction and thus
fast decision on them. The fast reaction makes sure the DFE
summation is completely settled so as to minimize the negative
impact of the residual ISI caused by the strong symbols.
In addition to the improved clock-to-Q delay performance,
the proposed slicer also holds superior output swing and
input sensitivity to the StrongArm, as can be seen below.
Fig. 14(a) shows the input pattern under tests, which is similar
to Fig. 13(a), but with the magnitude of V swept from
10 to 100 mV instead. The results in the right of Fig. 14(a)
suggest that the proposed slicer outperforms the StrongArm
in the output swings by recovering the input signal to a
stronger output. Next, to investigate the slicer’s capability
of resolving a relatively weak input to a level that can be
identified and further easily processed as a digital signal,
the input sensitivity is defined as the minimum required
differential input swing, V∗, such that the output swing
of the slicer is larger than the digital level of 650 mV. The
input pattern is depicted in the left of Fig. 14(b), where
the baud rates are swept from 10 to 40 GBaud/s, and the
value of V∗ at each baud rate is searched to fulfill the
target output swing. In the right of Fig. 14(b), the simulated
sensitivity performance at different baud rates is plotted.
The proposed CMOS track-and-regenerate slicer achieves
better input sensitivity than the StrongArm, especially for
the higher data rates shown in Fig. 14(b). To investigate
the effects of input common-mode level and supply voltage
variations, simulations used for Fig. 13 are further extended
with distinct settings. The results are shown in Fig. 15, where
the output swings and delay performance for strong symbols
and weak symbols are individually characterized. It would be
worthwhile to reiterate that the input differential pairs and the
output cross-coupled pairs in both slicers are designed to be
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN et al.: 60-Gb/s PAM4 WIRELINE RECEIVER WITH 2-TAP DIRECT DECISION FEEDBACK EQUALIZATION 9
Fig. 14. (a) Slicer input signals, and the simulated slicer output swings with
distinct input swings at 30 GBaud/s. (b) Slicer input signals, and the simulated
slicer input sensitivity at different baud rates.
identical for fair comparisons, and therefore the two slicers
present similar area and loading to the summer circuitries.
Techniques to simulate and examine the noise performance
of periodically clocked slicers have been well studied
in [16], by identifying the periodically clocked slicers as
linear periodically time-varying (LPTV) systems. As shown
in [16], the procedures to obtain the dominant signal-to-noise
ratio (SNR) involve both periodic steady-state (PSS) and
periodic noise (PNOISE) simulations in time domain, which
find out the large-signal response of the slicer, and the noise
power at any specified observation point, respectively. After
running the PSS and PNOISE simulations with respect to
the differential output of the slicer under test, the output
SNR in voltage can be derived from dividing the large-signal
response by the root-mean-squared (rms) noise voltage at
each observation time step. Fig. 16(a)–(c) respectively show
the simulated differential output signal, differential rms noise
voltage, and the resultant differential output SNR, of the
proposed slicer. Since it is the SNR before rapid regeneration
that dominates the probability of decision errors [16],
Fig. 15. Simulated output swing and clock-to-Q delay performance of
the StrongArm (SA) and the proposed track-and-regenerate slicer (T/R) for
typical strong symbols and weak symbols. (a) Output swing versus input
common-mode level, VDD = 0.9 V. (b) Output swing versus supply voltage,
VCM = 0.75 V. (c) Clock-to-Q delay versus input common-mode level,
VDD = 0.9 V. (d) Clock-to-Q delay versus supply voltage, VCM = 0.75 V.
Fig. 16. Simulated results from the PSS and PNOISE simulations at
30-GBaud/s operation; the clock frequency is 15 GHz. (a) Clock signal
and the proposed slicer’s differential output signal from PSS simulation.
(b) Simulated differential output integrated noise of the proposed slicer from
PNOISE simulations in time domain. (c) Resultant differential output SNR of
the proposed slicer.
the input-referred noise should be evaluated accordingly.
As labeled in Fig. 16(c), the simulated differential output
SNR before rapid regeneration is 31 dB (i.e., 35.5 V/V).
Therefore, the differential input-referred noise is derived to be
1.69 mVrms, given the differential input signal of 60 mV. The
overall noise is investigated by referring the noise from other
stages to the input of the slicer. The two-stage CTLE con-
tributes 0.80 mVrms, and the summer contributes 0.58 mVrms,
resulting in overall noise of 1.96 mVrms at the slicer input.
In summary, the proposed CMOS track-and-regenerate
slicer offers benefits of less delay and higher gain, thanks to its
non-resetting mechanism when the allocated regeneration time
becomes stringent. With the multistage architecture and the
need for continuously conducting currents when performing
the tracking function, the proposed slicer consumes more
power. The noise, power, and offset comparisons of the
slicers having identical input and output capacitances are
summarized in Table I.
IV. DFE LOOPS
In this section, the most stringent timing constraint
for completing the DFE loops with the proposed slicer
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE JOURNAL OF SOLID-STATE CIRCUITS
Fig. 17. (a) Simulated SSF of the proposed track-and-regenerate slicer at
30 GBaud/s. (b) Simulated ISF of the proposed track-and-regenerate slicer at
30 GBaud/s.
TABLE I
SLICER COMPARISON
is examined. Referring back to the timing constraint diagram
shown in Fig. 2, it can be inferred that for direct DFE loops,
the tightest timing constraint lies in the loop of first tap, where
TCKQ + Tdh1 + Tsettle + Tsetup < 1 UI (1)
and 1 UI is 33.33 ps for 60-Gb/s PAM4 signaling, or
30-GBaud/s operation. With the StrongArm slicer presented
in Section III, aside from the undesirable smaller swing, it is
nearly impossible to close the loop for the first tap of DFE
at 30 GBaud/s, since its worst-case clock-to-Q delay (TCKQ)
is 30.96 ps, leaving very little time for other parts to settle.
By contrast, with the improved TCKQ, the delay of the proposed
slicer is not significant for strong symbols and is not more
than 0.5 UI in the worst case shown in Fig. 13(b), allowing
favorably more time for other operations to be finished. For
Tdh1 and Tsettle, as mentioned previously, since the operations
take place concurrently, it is more appropriate to view them
as the additional delay with respect to TCKQ during which the
DFE tap currents and the summer have already started to settle
toward their steady states. The setup time (Tsetup) is commonly
used for digital gates or digital circuits to characterize the
required time of arrival of digital inputs prior to the change
Fig. 18. Timing diagram for the first tap DFE in a half-rate design, and the
simulated numbers for the timing constraints at 30 GBaud/s.
of the state, for example, triggered by the rising/falling edge
of the clock. In the context of analog-based DFE design,
the idea of Tsetup can be useful, whereas it is not directly asso-
ciated with digital inputs anymore, but linked to the sampling
aperture of the slicer. Specifically, the width of the sampling
aperture of a slicer is characterized through an equivalent
setup time. For instance, a wider sampling aperture suggests
that the signal to have greater impact at the end of sampling
phase needs to arrive earlier, and equivalently implies a larger
value of Tsetup. As the impulse response is used for describing
a linear time invariant (LTI) system, the impulse sensitivity
function (ISF) reveals the time-dependent sensitivity of the
output at a certain observation time, to the impulse input
with a specific arrival time. The ISF of a slicer can be
interpreted as the weighted time-average sampling function,
and the sampling aperture corresponds with the shape of ISF.
More fundamentals and details of ISF and LPTV systems can
be found in [9], [16], and [17]. The approach to simulating
the ISF of a clocked slicer has been developed and presented
in [17]. First, a step function and a fixed offset are applied
as inputs, where the height of the step function, that is, step-
height, is self-adjusted by a feedback loop. The step sensitivity
function (SSF) is obtained by searching the step-height that
makes the slicer metastable at each time step. And then,
the ISF is derived from taking the derivative of SSF.
Fig. 17(a) shows the simulated SSF of the proposed
track-and-regenerate slicer, and its normalized ISF is shown
in Fig. 17(b), both at 30-GBaud/s operation. The sampling
aperture can be defined as the time frame between TLEFT and
TRIGHT, during which the integration of the area under the
ISF from TLEFT to TRIGHT is 80% of the total area under ISF.
The width of sampling aperture, that is, (TRIGHT − TLEFT),
indicates the sampling bandwidth [17], and furthermore the
values of TLEFT and TRIGHT specify an effective timing window
for applying inputs so as to have their responses at the
output influential. For the purpose of studying the DFE timing
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN et al.: 60-Gb/s PAM4 WIRELINE RECEIVER WITH 2-TAP DIRECT DECISION FEEDBACK EQUALIZATION 11
Fig. 19. Pulse responses with considerable post-cursor ISI. (a) Simulated nor-
malized pulse response at the input of the summer. (b) Simulated normalized
pulse response at the output of the summer.
Fig. 20. Simulated differential output of the summer with distinct DFE
settings. (a) First-tap DFE and second-tap DFE are both disabled. (b) First-
tap DFE is disabled, while the second-tap DFE is enabled. (c) First-tap DFE
is enabled, while the second-tap DFE is disabled. (d) First-tap DFE and the
second-tap DFE are both enabled.
Fig. 21. Block diagram of the experiment setup.
constraint, we conveniently set TRIGHT to be 0; that is, aligned
with the rising/falling edge of the clock signals, and define the
analog-fashion Tsetup as 90% of the sampling aperture width.
Namely
Tsetup = 0.9 × (TRIGHT − TLEFT). (2)
As labeled in Fig. 17(b), TRIGHT is 0 and TLEFT is about
−11 ps from simulations, resulting in the Tsetup of 9.9 ps
for the proposed slicer. With the simulated values of Tsetup
Fig. 22. (a) Measured 30-GBaud/s pulse response at the input of the receiver
chip. (b) Measured (single-ended) 60-Gb/s PAM4 data eyes at the input of
the receiver chip.
Fig. 23. (a) Measured bathtub curves at 60-Gb/s PAM4, with DFE loops
disabled/enabled. (b) Measured eye contour color map at 60-Gb/s PAM4 after
equalization.
Fig. 24. (a) Chip micrograph with key building blocks highlighted. (b) Mea-
sured receiver data-path power consumption at 60-Gb/s PAM4.
and TCKQ along with the usage of (1), the desirable require-
ment of additional delays from Tdh1 and Tsettle can be calcu-
lated. For example, from the simulation shown in Fig. 13(b),
the worst TCKQ is 15.34 ps, and thus (Tdh1 + Tsettle) <
(33.33 − 15.34 − 9.9) = 8.09 ps guarantees the effectiveness
of the first-tap DFE loop. However, it is noteworthy to point
out that in the case of employing the track-and-regenerate
slicers, even if the (Tdh1 + Tsettle) does not completely satisfy
the calculated specification from (1), the first-tap DFE loop
can still be closed as long as the feedback signal is within
the sampling aperture, which effectively leads to degraded
accuracy of DFE summation. In other words, depending on
the tolerable accuracy of DFE summation, the factor of
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12 IEEE JOURNAL OF SOLID-STATE CIRCUITS
TABLE II
PERFORMANCE SUMMARY AND COMPARISON
0.9 appearing in (2) can be revised. The timing diagram for the
first-tap DFE in a half-rate design is shown in Fig. 18, along
with the simulated numbers for timing constraints, where the
values of (Tdh1 + Tsettle) are measured as the additional delay
relative to the TCKQ.
To prove that the direct DFE loops can be closed with the
proposed slicers and thus successfully expand the eye-opening,
simulations with only the equalization offered by the direct
2-tap DFE loops have been carried out, excluding the benefits
from CTLE. The differential pulse responses at the input and
the output of the summer are shown in Fig. 19(a) and (b),
respectively. These pulse responses correspond to channel loss
of ∼6 dB at 15 GHz. When simulating the pulse responses
and the DFE loops, in addition to the loading presented by the
slicers, an additional capacitive load of 20 fF is added to each
of the summer outputs for representing the input capacitance
of clock and data recovery (CDR) circuits. Fig. 20(a)–(d)
shows the simulated eye diagrams at 60-Gb/s PAM4 with
distinct DFE settings. The input MSB and LSB patterns used
in the simulations are both pseudorandom binary sequence-7
(PRBS-7), with the LSB pattern delayed by 5 bits relative to
the MSB. The simulation results of the DFE match with those
of the pulse responses. The first-tap DFE plays a critical role
in opening the eyes, and the simultaneous inclusion of the
second-tap DFE further expands the eye-opening.
V. EXPERIMENTAL RESULTS
The PAM4 receiver chip was fabricated in 28-nm CMOS
technology, and Fig. 21 shows the experiment setup. The
receiver chip is wire-bonded to a printed circuit board
(PCB). A high-speed pattern and clock generator transmits
the PAM4 data and the half-rate differential clock signals to
the chip via cables and PCB traces. The channel loss for
the transmitted PAM4 data mainly consists of the loss of
cables (48-in long) and PCB trace (∼0.8 inch, FR4), which
is measured to be 8.2 dB at 15 GHz excluding the bond
wire. The associated 30-GBaud/s pulse response derived from
S21 measurement, and the measured 60-Gb/s PAM4 eyes
at the input of the receiver chip are shown in Fig. 22(a)
and (b), respectively. An oscilloscope is set up to measure
the aforementioned input data eyes and for monitoring the
recovered output data signals which are driven by on-chip
CML drivers. In addition to the on-chip BERC, an external
commercial BER tester is connected to measure the BER and
verify the function of the on-chip BERC.
To verify the effectiveness of the direct DFE loops imple-
mented with the proposed slicer, PRBS-7, 9, 31 patterns have
been fully tested and the bathtub curves with DFE loops
disabled and enabled are measured at 60 Gb/s. As shown
in Fig. 23(a), with DFE loops disabled, the measured BER
is not better than 1E-6, while with DFE loops enabled,
the measured bathtub curve shows 0.15-UI horizontal opening
for BER = 1E-12, when tested with PRBS-31 pattern. The eye
contour map at 60 Gb/s, which is captured by the on-chip 2-D
eye-monitoring circuits and shown in Fig. 23(b), confirms the
open eyes after equalization. The 2-tap DFE coefficients are
estimated to be (−0.212, −0.0311), according to that the IDFE1
and IDFE2, previously defined in Fig. 6(a), are set to 205 and
30 μA, respectively.
The chip micrograph is shown in Fig. 24(a), with its key
building blocks highlighted, including the CTLEs, the half-rate
CML summers (Sum), the proposed slicers along with 2-tap
DFE logics, the 1-to-32 data demultiplexer (DMUX), the syn-
thesized BERC, the DCC circuits, CKBUFs, the digitally
controlled DL, and the on-chip voltage DAC (VDAC) banks.
The total chip area measures 900 μm × 745 μm, or 0.67 mm
squared. The power consumption of the receiver data-path
circuitries together with its breakdown is shown in Fig. 24(b).
At 60 Gb/s, the two stages of CTLE consume 13 mW, the two
half-rate summers consume 13.4 mW, the slicers and latches
consume 12.6 mW, and the CKBUF and data buffers consume
27 mW. In total, 66 mW is consumed by the receiver data-path,
and 1.1-pJ/bit power efficiency is achieved at 60 Gb/s.
VI. CONCLUSION
A CMOS track-and-regenerate slicer is proposed and
designed with the aims to improve the power efficiency,
output swing, technological scalability over the conventional
CML slicer, and to improve the clock-to-Q delay and
output swing over the conventional StrongArm slicer as
well. A PAM4 receiver, employing the proposed CMOS
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN et al.: 60-Gb/s PAM4 WIRELINE RECEIVER WITH 2-TAP DIRECT DECISION FEEDBACK EQUALIZATION 13
track-and-regenerate slicer, benefits from the relaxed settling
time constraint thanks to the reduced slicer delay, and from
the direct availability of rail-to-rail digital signals offered by
the proposed slicer. The prototype fabricated in 28-nm CMOS
technology achieves power efficiency of 1.1 pJ/bit at 60 Gb/s
over a channel with 8.2-dB loss at Nyquist, demonstrating
an energy-efficient PAM4-DFE design. The performance
and comparisons with the state of the art are summarized
in Table II.
ACKNOWLEDGMENT
The authors thank D. A. Nelson of Rockley Photonics,
Caltech MICS Lab members, with special thanks to Arian
Hashemi Talkhooncheh, Saransh Sharma, Fatemeh Aghlmand,
and Caltech CHIC Lab for sharing testing resources.
REFERENCES
[1] Y. Frans et al., “A 56-Gb/s PAM4 wireline transceiver using a 32-
way time-interleaved SAR ADC in 16-nm FinFET,” IEEE J. Solid-State
Circuits, vol. 52, no. 4, pp. 1101–1110, Apr. 2017.
[2] T. Ali et al., “6.4 a 180 mW 56Gb/s DSP-based transceiver for high
density IOs in data center switches in 7nm FinFET technology,” in IEEE
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco,
CA, USA, Feb. 2019, pp. 118–120.
[3] D. Pfaff et al., “A 56Gb/s long reach fully adaptive wireline PAM-4
transceiver in 7nm FinFET,” in Proc. Symp. VLSI Circuits, Kyoto, Japan,
Jun. 2019, pp. 270–271.
[4] M. Pisati et al., “A 243-mW 1.25–56-Gb/s continuous range PAM-
4 42.5-dB IL ADC/DAC-based transceiver in 7-nm FinFET,” IEEE
J. Solid-State Circuits, vol. 55, no. 1, pp. 6–18, Jan. 2020.
[5] J. Im et al., “A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct
decision-feedback equalization in 16-nm FinFET,” IEEE J. Solid-State
Circuits, vol. 52, no. 12, pp. 3486–3502, Dec. 2017.
[6] A. Roshan-Zamir et al., “A 56-Gb/s PAM4 receiver with low-overhead
techniques for threshold and edge-based DFE FIR- and IIR-tap adap-
tation in 65-nm CMOS,” IEEE J. Solid-State Circuits, vol. 54, no. 3,
pp. 672–684, Mar. 2019.
[7] P.-J. Peng, J.-F. Li, L.-Y. Chen, and J. Lee, “6.1 a 56Gb/s
PAM-4/NRZ transceiver in 40nm CMOS,” in IEEE Int. Solid-State
Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA,
Feb. 2017, pp. 110–111.
[8] M. Choi and A. A. Abidi, “A 6b 1.3 GSample/s A/D converter in 0.35-
m CMOS,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,
Feb., vol. 2001, pp. 126–127.
[9] T. Toifl et al., “A 22-Gb/s PAM-4 receiver in 90-nm CMOS SOI
technology,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 954–965,
Apr. 2006.
[10] K.-L.-J. Wong, A. Rylyakov, and C.-K.-K. Yang, “A 5-mW 6-Gb/s
quarter-rate sampling receiver with a 2-Tap DFE using soft decisions,”
IEEE J. Solid-State Circuits, vol. 42, no. 4, pp. 881–888, Apr. 2007.
[11] T. Kobayashi, K. Nogami, T. Shirotori, Y. Fujimoto, and O. Watanabe,
“A current-mode latch sense amplifier and a static power saving input
buffer for low-power architecture,” in Proc. Symp. VLSI Circuits Dig.
Tech. Papers, Seattle, WA, USA, 1992, pp. 28–29.
[12] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta,
“A double-tail latch-type voltage sense amplifier with 18ps setup+hold
time,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
San Francisco, CA, USA. Feb. 2007, pp. 314–605.
[13] P. A. Francese et al., “23.6 a 30Gb/s 0.8pJ/b 14nm FinFET receiver
data-path,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, San Francisco, CA, USA. Jan. 2016, pp. 408–409.
[14] K. C. Chen and A. Emami, “A 25-Gb/s avalanche photodetector-based
burst-mode optical receiver with 2.24-ns reconfiguration time in 28-nm
CMOS,” IEEE J. Solid-State Circuits, vol. 54, no. 6, pp. 1682–1693,
Jun. 2019.
[15] I. Ozkaya et al., “A 64-Gb/s 1.4-pJ/b NRZ optical receiver data-path in
14-nm CMOS FinFET,” IEEE J. Solid-State Circuits, vol. 52, no. 12,
pp. 3458–3473, Dec. 2017.
[16] J. Kim, B. S. Leibowitz, J. Ren, and C. J. Madden, “Simulation and
analysis of random decision errors in clocked comparators,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 56, no. 8, pp. 1844–1857, Aug. 2009.
[17] M. Jeeradit et al., “Characterizing sampling aperture of clocked com-
parators,” in Proc. IEEE Symp. VLSI Circuits, Honolulu, HI, USA,
Jun. 2008, pp. 68–69.
[18] J. Han, N. Sutardja, Y. Lu, and E. Alon, “Design techniques for a 60-
Gb/s 288-mW NRZ transceiver with adaptive equalization and baud-rate
clock and data recovery in 65-nm CMOS technology,” IEEE J. Solid-
State Circuits, vol. 52, no. 12, pp. 3474–3485, Dec. 2017.
[19] P. Upadhyaya et al., “A fully adaptive 19-to-56Gb/s PAM-4 wireline
transceiver with a configurable ADC in 16nm FinFET,” IEEE Int. Solid-
State Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA,
USA, Oct. 2018, pp. 108–110.
[20] C. Wang, G. Zhu, Z. Zhang, and C. P. Yue, “A 52-Gb/s sub-1pJ/bit
PAM4 receiver in 40-nm CMOS for low-power interconnects,” in Proc.
Symp. VLSI Circuits, Kyoto, Japan, Jun. 2019, pp. 274–275.
[21] S. Babayan-Mashhadi and R. Lotfi, “Analysis and design of a low-
voltage low-power double-tail comparator,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 22, no. 2, pp. 343–352, Feb. 2014,
doi: 10.1109/TVLSI.2013.2241799.
[22] K.-C. Chen, W. W.-T. Kuo, and A. Emami, “A 60-Gb/s PAM4 wireline
receiver with 2-Tap direct decision feedback equalization employing
track-and-regenerate slicers in 28-nm CMOS,” in Proc. IEEE Custom
Integr. Circuits Conf. (CICC), Boston, MA, USA, Mar. 2020, pp. 1–4,
doi: 10.1109/CICC48029.2020.9075948.
[23] A. Roshan-Zamir, O. Elhadidy, H.-W. Yang, and S. Palermo, “A recon-
figurable 16/32 Gb/s dual-mode NRZ/PAM4 SerDes in 65-nm CMOS,”
IEEE J. Solid-State Circuits, vol. 52, no. 9, pp. 2430–2447, Sep. 2017,
doi: 10.1109/JSSC.2017.2705070.
Kuan-Chang Chen (Member, IEEE) received the
B.S. degree in electrical engineering from National
Taiwan University (NTU), Taipei, Taiwan, in 2011,
and the M.S. degree in electrical engineering from
Stanford University, Stanford, CA, USA, in 2014.
He is currently pursuing the Ph.D. degree in electri-
cal engineering with Caltech, Pasadena, CA, USA,
with a special emphasis on analog and mixed-signal
circuits and systems.
Mr. Chen was a recipient of the 2015 Henry
Ford II Scholar Award at Caltech.
William Wei-Ting Kuo (Graduate Student Mem-
ber, IEEE) received the B.S. degree in electronics
engineering from National Chiao Tung University
(NCTU), Hsinchu, Taiwan, in 2016, and the M.S.
degree in electrical engineering from the Califor-
nia Institute of Technology, Pasadena, CA, USA,
in 2018, where he is currently a graduate student
in electrical engineering.
From 2016 to 2017, he served as a Substitute
Military Service Draftee in the Ministry of Justice,
Taiwan.
Azita Emami (Senior Member, IEEE) received the
B.S. degree from the Sharif University of Tech-
nology, Tehran, Iran, in 1996, and the M.S. and
Ph.D. degrees in electrical engineering from Stan-
ford University, Stanford, CA, USA, in 1999 and
2004, respectively.
From 2004 to 2006, she was with the IBM
T. J. Watson Research Center before joining
Caltech, Pasadena, CA, USA, in 2007. She is cur-
rently the Andrew and Peggy Cherng Professor of
electrical engineering and medical engineering with
the California Institute of Technology (Caltech), and also a Heritage Medical
Research Institute Investigator. She also serves as the Executive Officer
(Department Head) of electrical engineering. Her current research interests
include integrated circuits and systems, integrated photonics, wearable and
implantable devices for neural recording, neural stimulation, sensing, and drug
delivery.
Dr. Emami is currently an Associate Editor of the IEEE JOURNAL OF SOLID
STATE CIRCUITS (JSSC). She was an IEEE SSCS Distinguished Lecturer
in 2017–2018.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2020 at 14:00:37 UTC from IEEE Xplore.  Restrictions apply. 
