A 2-40 Gb/s PAM4/NRZ dual-mode wireline transmitter with 4:1 MUX in 65-nm CMOS by Lv, F et al.
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.2, APRIL, 2018 ISSN(Print)  1598-1657 
https://doi.org/10.5573/JSTS.2018.18.2.270 ISSN(Online) 2233-4866  
 
Manuscript received Jul. 27, 2017; accepted Oct. 26, 2017 
1 Air Force Engineering University, Xi’an 710051, China 
2 Institute of Microelectronics, Tsinghua University, Beijing 100084, 
China 
3 Department of Computer Science, Liverpool John Moores Univer
sity, Byrom Street, Liverpool, L3 3AF, United Kingdom 
E-mail : wangziq@tsinghua.edu.cn; zhengxuqiang@mail.tsinghua.edu.cn 
 
 
A 2-40 Gb/s PAM4/NRZ Dual-mode Wireline 
Transmitter with 4:1 MUX in 65-nm CMOS     
 
Fangxu Lv1,2, Xuqiang Zheng2, Feng Zhao3, Jianye Wang1, Ziqiang Wang2*,                         
Shuai Yuan2, Yajun He2, Chun Zhang2, and Zhihua Wang2   
 
 
 
 
Abstract—This paper presents a 2-40 Gb/s dual-mode 
wireline transmitter supporting the four-level pulse 
amplitude modulation (PAM4) and non-return-to-
zero (NRZ) modulation with a multiplexer (MUX)-
based two-tap feed-forward equalizer (FFE). An edge-
acceleration technique is proposed for the 4:1 MUX to 
increase the bandwidth. By utilizing a dedicated 
cascode current source, the output swing can achieve 
900 mV with a level deviation of only 0.12% for 
PAM4. Fabricated in a 65-nm CMOS process, the 
transmitter consumes 117 mW and 89 mW at 40 Gb/s 
in PAM4 and NRZ at 1.2 V supply.    
 
Index Terms—Four-level pulse amplitude modulation 
(PAM4), non-return-to-zero (NRZ), edge-acceleration 
technique, linearity optimization, dual-mode wireline 
transmitter    
I. INTRODUCTION 
The ever-increasing bandwidth demand for data 
communications has necessitated the wireline 
connections towards data rates of 40 Gb/s or beyond [1], 
[2]. When the data rate reaches 40 Gb/s and above, two 
methods are used to further improve the wireline data 
throughput. One is to precede with non-return-to-zero 
(NRZ) signaling by increasing the clock speed [1-3]. The 
other employs high-order modulations (HOM) such as 
four-level pulse amplitude modulation (PAM4) [4-6] and 
eight-level pulse amplitude modulation (PAM8) [7], [8], 
which are attracting more and more attention owing to 
their bandwidth efficiency. Since PAM4 exhibits 
excellent balance among performance, cost, power and 
complexity, it is currently considered as the best HOM 
for the upcoming Ethernet 400GE [9]. To keep the 
compatibility with the existing NRZ components and 
avoid developing multiple IPs, the wireline transceivers 
are required to support both multiple-modulation and 
wide operation range [10], [11]. 
The major challenge in designing NRZ transmitter 
(TX) is the final-stage serialization. For the widely used 
half-rate architecture shown in Fig. 1, the delay 
difference between the data and the clock tree buffer 
paths (t1-t2) may fluctuate beyond 1 unit interval (UI) at 
high-speed rate under different PVT corners [12], thus 
causing setup/hold timing violations for the high-speed 
latches. To solve this problem, reference [13] uses an 
 
2:1
MUX L
L L
DIV2
B1B2
B3
L
L
t2t1
CK20
Half Rate
Clock
2:1
MUX
2:1
MUX
 
Fig. 1. Conventional half-rate transmitter. 
 
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.2, APRIL, 2018 271 
 
additional calibration loop to guarantee the timing 
constraint between the data and the clock. Nonetheless, 
the calibration loop significantly increases the 
complexity and consumes additional power. On the other 
hand, the quarter-rate architecture has been proved to be 
a promising solution for high-speed applications [1], [2], 
[14], [15] because it not only relaxes the critical path 
timing to 3 UI, but also halves the maximum clock speed. 
Additionally, the power consumption can be saved by 
replacing the last two-stage 2:1 multiplexers (MUXs) and 
the retiming latches with one 4:1 MUX. However, the 4:1 
MUX suffers from limited bandwidth due to its doubled 
self-loads. 
The main difficulty in designing the PAM4 TX focuses 
on the output nonlinearity associated with the desired 
large swing. Compared with the NRZ modulation, PAM4 
applies four voltage levels to transmit two bits in one 
symbol to improve the channel efficiency. Nevertheless, 
the vertical swing is reduced to one third of the full 
swing, which means that the signal-to-noise-ratio (SNR) 
is attenuated by 9.5 dB, thus leaving a smaller margin for 
the receiver-side data recovery. Therefore, a high-output 
swing is crucial for the PAM4 TX. However, this usually 
leads to nonlinearity problem that is particularly severe 
in the current mode logic (CML)-based topology, 
because the high-output swing inevitably compresses the 
VDS of the tail current source. 
Another challenge in implementing the feed-forward 
equalizer (FFE) for both NRZ and PAM4 TX is the 
generation of several accurate 1 UI-delayed sequences. 
Fig. 2(a) shows a typical D flip-flop (DFF)-based FFE, 
which sums the delayed sequences using different weight 
coefficients to compensate for the channel loss. In this 
structure, the delayed versions of the sequences are 
generated by the DFF, thus it can operate at a wide range. 
However, limited by the large CK-to-Q delay, the flip-
flop-based FFE structure [16] is impractical in designs of 
over 24 Gb/s in 65-nm CMOS process. An alternative 
solution is to use analog delay line to produce the 1 UI-
delayed sequences, as displayed in Fig. 2(b). These 
analog delay lines are usually based on cascaded CML 
buffers [17], [18], or LC-cells [2], [19]-[21], which have 
been proved in ultra-high-speed applications such as 64 
Gb/s [2]. Nonetheless, they cannot support a wide 
operation range because of their limited adjusting range. 
Besides, the multiple CML buffers-based FFE is power 
hungry due to the need of multiple stages to realize 1 UI 
delay, and the LC-cell-based FFE occupies a large chip 
area due to the involvement of many inductors.  
To address these issues, a quarter-rate NRZ/PAM4 
dual-mode transmitter supporting a wide frequency range 
with a MUX-based two-tap FFE is proposed. To further 
increase the bandwidth of the MUX with high power 
efficiency, an improved 4:1 MUX with edge-acceleration 
technique is designed. Additionally, a dedicated cascode 
current source is utilized to guarantee large output swing, 
high output linearity, and small parasitic capacitance. 
The remainder of this paper is organized as follows. 
Section II describes the proposed NRZ/PAM4 dual-mode 
transmitter. Section III presents the crucial circuit blocks 
of the transmitter, and Section IV gives the measurement 
results of the chip. The conclusion is drawn in Section V. 
II. PROPOSED NRZ/PAM4 DUAL-MODE 
TRANSMITTER 
Fig. 3 shows the block diagram of the NRZ/PAM4 
dual-mode transmitter. It consists of two quarter-rate 
DFF DFF
 
(a) 
 
Delay
Line
Delay
Line
MUX
N:1
N
N N
 
(b) 
Fig. 2. FFE structure (a) DFF-based FFE, (b) delay line-based 
FFE. 
 
 
272 FANGXU LV et al : A 2-40 Gb/s PAM4/NRZ DUAL-MODE WIRELINE TRANSMITTER WITH 4:1 MUX IN 65-nm CMOS 
 
serializers for MSB and LSB, one output driver (DRV), 
and one multi-clock generator (MCG). Each quarter-rate 
serializer contains one 4-bit parallel PRBS generator, two 
4:1 MUXs, and two latch arrays. The operation mode of 
the transmitter is controlled by the external signal EN. 
When EN is logic high, both the MSB and LSB quarter-
rate serializers along with the combiners are activated 
and then the transmitter works in the PAM4 mode. Here, 
the combiners adopt the 2:1 tail current sources to 
generate the four voltage levels. When EN is logic low, 
the LSB quarter-rate serializer and the LSB combiner are 
disabled, thus the transmitter works in the NRZ mode. To 
make up for the ability of the driver (the LSB combiner 
is disabled) in the NRZ mode, the currents in the MSB 
combiner are correspondingly increased. 
In each quarter-rate serializer, an independent PRBS7 
generator is employed to produce the 4-bit parallel data, 
which are applied to two interleaved latch arrays to 
generate the 1 UI-delayed data sequences. These 
sequences are then serialized into two full-rate data 
streams for the main tap and post tap in the two 4:1 
MUXs. The two serialized data streams are finally fed 
into the combiners to produce the pre-emphasis signal. 
Fig. 4(a) describes the details of the latch array, where 
the interleaved latching technique presented in [22] is 
adopted. The latch array and S2D triggered by the four 
orthogonal clocks provide proper delay data for the main 
and post 4:1 MUXs. The latches and S2D circuits are 
implemented by dynamic latches, which are power 
efficient and compact because they are consisting of only 
four transistors. The four orthogonal clocks are generated 
by the MCG shown in Fig. 4(b). The MCG receives 
single-end half-rate clock, and then transforms it into 
differential CML clocks in the CLK conditioner. The four 
orthogonal CML clocks are divided by the CML 
differential flip-flop-based divider, which are then 
converted into CMOS clocks in the CML2CMOS circuits. 
These clocks are first buffered to enhance their driving 
capabilities and then applied to the latch array and 4:1 
MUXs for the parallel data retiming and serializing. 
Compared with traditional DFF and LC-cells-based FFE, 
the multi-MUX-based FFE together with the latch array 
not only supports a wide data-rate range, but also 
improves the power efficiency. 
III. CRUCIAL CIRCUIT BLOCKS 
1. 4:1 MUX  
 
The block diagram of a typical 4:1 MUX is depicted in 
Fig. 5(a). It includes two peaking inductors, two resistors, 
and four equivalent unit cells. Each unit cell takes two 
clocks with 90-degree phase skew and one quarter-speed 
data as the inputs and generates 1 UI output pulse. Thus, 
the combined four unit cell serializes four quarter-speed 
data into a single full-speed output signal, which is 
described in Fig. 5(b). The unit cell that is the critical 
block of the 4:1 MUX usually has three basic structures 
MSB Combiner
LSB Combiner
4:1 MUX 4:1 MUX
4:1 MUX 4:1 MUX
Latch
Array
4-bit
Parallel
PRBS7
Gen. A
Multi-Clock 
Generator
Half Rate
Clock
Wire
Bonding
Channel
4-bit
Parallel
PRBS7
Gen. B
4 4
4 4
Latch
Array
Latch
Array
Latch
Array
MSB Quarter-Rate Serializer
LSB Quarter-Rate Serializer
EN
MAIN_MSB POST_MSB
MAIN_LSB POST_LSB
DRV
44
44
CMOS
CML
 
Fig. 3. NRZ/PAM4 dual-mode transmitter architecture. 
 
L
PH0 PH90 PH180 PH270
L
P0_P/N<n>
M0_P/N<n>D0<n>
L
L
D1<n>
L
L
L
D2<n>
L
L
D3<n>
PH0 PH90 PH180
L
4:1
MUX
S2D
S2D
S2D
S2D
S2D
S2D
S2D
S2D
M1_P/N<n>
P1_P/N<n>
M2_P/N<n>
P2_P/N<n>
P3_P/N<n>
M3_P/N<n>
M0_P/N<n>
M1_P/N<n>
M2_P/N<n>
M3_P/N<n>
Latch Array
D
CKN
CKP
Latch Details
MAIN
POST
4-bit
Parallel
PRBS
Gen.
CLK_IN
 
(a) 
 
BUF 
X4
CK0_CML
CK90_CML
CK180_CML
CK270_CML
CML 
2
CMOS
X4
CKP_CML
CKN_CML
CLK
Conditioner
DCC
&
DIV2
CK0
CK90
CK180
CK270
CK0
CK180
 
(b) 
Fig. 4. (a) Multi-MUX-based FFE, (b) MCG. 
 
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.2, APRIL, 2018 273 
 
for implementation, as shown in Fig. 6. 
The first unit cell structure and its timing are given in 
Fig. 6(a). It stacks three transistors to generate the 25% 
duty cycle output pulse. The timing shows that the two 
clocks (CK1, Ck2) with 90-degree phase skew and the 
input data (Din) are combined into one stage. The main 
drawback is its small current driving ability. To overcome 
the difficulty, large-size transistors need to be adopted 
[14], which increases the output capacitance and adds the 
load of the preceding stage. So, this structure is 
bandwidth limited and has low power efficiency. 
The second structure of the unit cell and its timing are 
shown in Fig. 6(b), which only consists of two transistors 
at the output stage. The 25% duty cycle clock pulse 
generated in the preceding stage samples the quarter-rate 
input data at the output stage. To guarantee the serialized 
data with low jitter performance, the 25% duty cycle 
clock pulse must have sharp edges to drive the output 
transistor in every clock period. In [17], [22], [23], a 
large-size inverter is employed to provide steep transition 
edges, which is power hungry. 
Fig. 6(c) illustrates the third structure and its timing. It 
also utilizes two transistors at the final stage to generate 
the output pulse. In comparison with the second structure, 
it replaces the 25% duty cycle pulse with the 50% duty 
clock to sample the data at two stages, which is power 
efficient. As depicted in its timing diagram, at the output 
stage, the rising edge of CK1 and the falling edge of CK2 
that is dependent on the input data are combined together 
to control the output pulse width. In the implementation 
of this unit cell, another structure can be obtained by 
placing CK1 ahead of CK2 with 90-degree skew. 
Furthermore, switching the input signals at the final stage 
can also evolve another two structures. Thus, the method 
of two-stage sampling with quarter-clock has four 
structures to realize this unit cell, three of which are 
introduced in [1], [2], [17], respectively. 
In order to further extend the bandwidth and improve 
the power efficiency, we propose an improved 4:1 MUX 
based on the work in [2], which is a variation of the third 
structure illustrated in Fig. 6(c). Fig. 7(a) depicts the 
block of the proposed 4:1 MUX and the details of the 
unit cell. Each unit cell contains a differential pair of 
pulse generators. The pulse generator circuit is illustrated 
in Fig. 7(b), which utilizes two logic gates (INV_A and 
INV_B) to output Va and Vb to drive the stacked 
transistors M6 and M7, respectively. By inserting the 
data-controlling transistor M4 into the INV_A, transistor 
M7 only opens when Din and CKa are both logic low, thus 
Vout,1 
Vout,2 
Vout,3 
Vout,4 
Dout
1 UI
D0 D1 D2 D3
CK270CK180
D1
CK90 CK180
CK0 
CK270
D2 D3
Unit 3 Unit 4
CK0 Unit 1CK90
D0
Unit 2
(a)
(b)
 
Fig. 5. Typical 4:1 MUX (a) Blok diagram, (b) timing diagram. 
 
 
Fig. 6. Three different structure of the unit cell and their timing 
diagrams. 
 
274 FANGXU LV et al : A 2-40 Gb/s PAM4/NRZ DUAL-MODE WIRELINE TRANSMITTER WITH 4:1 MUX IN 65-nm CMOS 
 
the data can be transmitted to the output. The inserted 
transistor NM8 is used to accelerate the rising edge of the 
output pulse to further extend the bandwidth and improve 
the signal quality. The timing of the input data (Din), 
input clocks (CKa, CKb), intermediate nodes (Va, Vb, Vc), 
and output data (Vout) is shown in Fig. 7(c). Here, Va 
depends on Din and CKa, Vb is the inverting version of 
CKb, and Vc is the connecting node of M6 and M7. The 
output of the stacked M6 and M7 is Vout. The logic 
function of the pulse generator is as follows, 
 .out a b in a b inV CK CK D CK CK D= · · = + +     (1) 
 
As shown in Fig. 7(c), the input data Din can be 
transmitted to the output only when CKa and CKb are 
both low. If Din is low, the pulse generator generates the 
negative pulse; otherwise the generator output will keep 
high. Here, the falling edge of the output negative pulse 
is determined by the rising edge of Vb, while its rising 
edge is subjected to the falling edge of Va. Besides, both 
the rising and falling edges of the negative pulse are 
determined by the load capacitance of Vout (Cp1 and CL). 
Therefore, its bandwidth can be increased by sharping 
the edges of the negative pulse transition. This target can 
be realized by improving the driving ability that sharpens 
Va’s falling edge and Vb’s rising edge. On the other hand, 
it can also be met by reducing the output parasitic 
capacitance. 
In the first method, the two transition edges can be 
sharpened by skewing the P-to-N ratio of the INV_A and 
INV_B, and enlarging M1 and M5. At the same time, this 
method reduces the falling and rising time of Va and Vb, 
respectively, which can save the power [2]. Fig. 8 reveals 
the simulation waveform of Va and Vb. 
In the second method, reducing the output parasitic 
capacitance is not easy in traditional structure due to the 
large size of the two stacked transistors that provide large 
current drive. It is also difficult to reduce the self-load 
that is contributed by the parasitic capacitance of Vout and 
Vc. Fig. 7(b) indicates the parasitic capacitance on the 
critical path, where CL denotes the next-stage input load, 
Cp1 represents the output parasitic capacitance that is 
mainly contributed by the transistors of the four unit cells, 
and Cp2 stands for the parasitic capacitance of Vc. In the 
MUX operation, CL and Cp1 influence both the rising 
time and falling time of the negative output pulse. 
However, Cp2 only affects its rising time because when 
M7 is driven by Va’s falling edge and starts to be turned 
off, Cp2 extracts charge from Vout at the same time. When 
M6 is driven by Vb’s rising edge and starts to be turned 
on, the charge of Cp2 has been discharged through M7 
[see Fig. 7(c)]. 
In fact, this improved structure can further reduce the 
output parasitic capacitance by introducing an 
acceleration transistor. In Fig. 7(b), after adding the 
acceleration transistor NM8, the parasitic capacitance of 
Vc (Cp2) has no influence on the rising and falling time of 
 
(a) 
 
Din
CKb 
CKa 
NM8
INV_B INV_A
Cp2 
M1 
M2 
M3 
M4 
M5 
M6 
M7 
Vout 
Vc 
Va 
Vb 
Cp1 CL 
 
(b) 
 
CKa 
CKb 
Va 
Vb 
Vout 
Din 
Vc 
1 UI
 
(c) 
Fig. 7. (a) Proposed 4:1 MUX, (b) details of the pulse 
generator, (c) timing diagram of the unit cell operation. 
 
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.2, APRIL, 2018 275 
 
the negative output pulse, which extends its bandwidth. 
When M7 is driven by Va’s falling edge and starts to be 
turned off, NM8 is already opened and charges Cp2 since 
CKa’s rising edge is ahead of Va’s falling edge. At the 
same time, Cp2 doesn’t extract charge from Vout, which 
accelerates the rising time of the negative output pulse. 
When M6 is driven by Vb’s rising edge and starts to be 
turned on, NM8 has been turned off and the charge of Cp2 
also has been discharged through M7, which has no 
impact on the falling time of the negative output pulse. 
Consequently, when the acceleration transistor NM8 is 
added, the self-load contributed by Cp2 has been 
eliminated. Fig. 8 shows the simulated waveform of the 
pulse generator with and without adding NM8. The result 
indicates that the rising times of Vc and Vout are both 
accelerated by NM8 when M7 is switched off. In addition, 
the falling edge of Vout is not changed compared with the 
traditional structure. 
Besides, we can extend the bandwidth through 
shrinking the M6 size, which becomes possible after 
introducing the acceleration transistor NM8. In the 
traditional design, when the pulse generator has fixed 
output swing with a specific pull-up resistor, the on-
resistance of M6 and M7 in series is determined. The 
parasitic capacitances of these two transistors contribute 
to limiting the bandwidth. Consequently, changing their 
size when maintaining constant on-resistance of M6 and 
M7 in series is not obvious for extending the bandwidth. 
However, when adding the acceleration transistor, this 
situation is different. When keeping the on-resistance of 
M6 and M7 in series the same as that in the traditional 
design, we can shrink M6 and increase M7 with proper 
size to greatly reduce Cp1 and slightly increase Cp2. This 
is because the parasitic capacitance of Vc (Cp2) can be 
ignored in this structure as explained earlier. So, 
shrinking M6 and increasing M7 with proper size can 
further extend the bandwidth. Fig. 9 shows the simulated 
eye-diagram of the proposed MUX, whose rising edge is 
faster than that of the traditional design when they have 
the same load. 
The simulation result depicted in Fig. 9 also shows 
that the glitch of the traditional structure is suppressed by 
introducing the accelerating transistor. In the traditional 
design, when the MUX continuously outputs two logic-
low bits, a negative glitch occurs at the second unit 
opening. When the Vout dragged down by the second unit 
cell already arrives its required low voltage, Cp2 in the 
first unit cell also extracts charge from Vout at the falling 
edge of Va. However, in the proposed structure, by 
introducing the accelerating transistor, Cp2 avoids 
extracting charge from Vout when the improved MUX 
continuously outputs two logic-low bits. 
 
2. Linearity Optimization on the Output Driver 
 
The output DRV is one of the most important circuits 
in the PAM4 transmitter design. Compared with the eye-
height of the NRZ signaling, the PAM4 eye-height 
shrinks to one third of that of the NRZ, hence there is 
little margin left for the receiver-side data recovery. 
Moreover, lower eye-height compression makes this 
effect more severe as the transmitting bit error rate (BER) 
is subject to the worst eye. Consequently, large output 
swing with uniform level spacing is crucial for the PAM4 
DRV to improve the SNR and relax the complexity of the 
receiver design. 
The transmitter in [11] cannot operate at ultra-high-
speed since it incorporates the source-series terminated 
DIN
Va
Vb
Vc
Vout
W/ NM8
W/O NM8
W/O NM8
W/ NM8
 
Fig. 8. Simulated waveform of the pulse generator with and
without NM8. 
 
1.3
1.2
1.1
1.0
0.9
0.8
0.7
0
Time (ps)
10 20 30
V
 (v
)
40 50
This work Traditional
Maximum Glitch: 112 mV
 
Fig. 9. Simulated eye-diagram of the proposed 4:1 MUX and 
the traditional 4:1 MUX. 
 
276 FANGXU LV et al : A 2-40 Gb/s PAM4/NRZ DUAL-MODE WIRELINE TRANSMITTER WITH 4:1 MUX IN 65-nm CMOS 
 
(SST) driver, which is a voltage-mode driver. Reference 
[4] adopts an 8-bit DAC-based driver, which employs a 
1.5 V supply to increase the output swing for improving 
the linearity. Nevertheless, the heavy drain-loading 
significantly limits its maximum operation speed. 
Additionally, the uniformity of the voltage level spacing 
can be deteriorated by the nonlinearity of the DAC. This 
work applies a CML-based driver with dedicated cascode 
current sources to enlarge the output swing and improve 
the output linearity. 
Fig. 10(a) illustrates the details of the output driver. It 
comprises four current-mode differential pairs and a pair 
of shut-peaked loads. In the CML-based driver, the 
linearity is mainly constrained by the channel-
modulation effect of the tail current sources. Although 
the internal resistance of the traditional current source 
[see Fig. 10(c)] can be improved by adopting a long 
channel device, the increased device size inevitably 
enlarges the parasitic capacitance at the output. This 
enlarged parasitic capacitance in return lower the internal 
impedance at high frequencies, thus degrading the output 
linearity. In addition, the large parasitic capacitance is 
also detrimental to the impedance matching at high 
frequencies. 
To address these issues, a low-voltage cascode current 
source is utilized, as shown in Fig. 10(b). To satisfy the 
requirements of both high output resistance and low 
parasitic capacitance at Vx, a large-size M2 is used to 
improve the output resistance while a minimum-size M1 
is cascoded to isolate the large drain capacitance of M2. 
Fig. 11 gives the single-end output linearity comparison 
between the traditional current source-based driver and 
the proposed cascode current source-based driver. The 
simulated results show that the level deviation is 
optimized from 3.3% to 0.1% for an output swing of 450 
mV with a 1.2 V power supply. The level deviation is 
defined as,  
 
 
, 10,21,32
31 100%,
3 3
ij Pk Pk
i j Pk Pk
d V
Level deviation
V
-
= -
-
= ´å    
  (2) 
 
where ijd  is the vertical space of level i and level j, 
and Pk PkV - is the full eye height of the output. 
IV. MEASUREMENT RESULTS 
The chip is fabricated in a 65-nm CMOS technology 
and its die photo is given in Fig. 12(a). The whole 
transmitter is 1.1 ´ 0.98 mm2, where the core circuit 
occupies an area of 0.48 mm2. Fig. 12(b) presents the 
photograph of the test setup. The chip is packaged in 
chip-on-board, and the half-rate clock is generated by 
Agilent E8257D. After 15 mm PCB trace, the differential 
signal is measured through a center-mounted 40 GHz 
connector and 1 m cable. The overall channel loss is 4.5 
dB and 11 dB at 10 GHz and 20 GHz, respectively. The 
eye-diagrams are measured by an Agilent Digital Signal 
Analyzer 93204A (33 GHz). When the transmitter works 
at the PAM4 mode, it consumes 117 mW with a 1.2 V 
supply, which corresponds to an energy efficiency of 
2.93 pJ/bit. The power breakdown of the transmitter is 
shown in Fig. 12(c). The power consumption of the 
PRBS, MCG, latch array, MUX, and DRV with FFE is 
9%, 47%, 5%, 24%, and 14%, respectively. When the 
transmitter operates at the NRZ mode, the LSB quarter-
MSB_P MSB_N
MSBP_P MSBP_N LSBP_P LSBP_N
LSB_P LSB_N
MSB_MAIN_CTL MSB_POST_CTL LSB_MAIN_CTL LSB_POST_CTL
Current Source
iDAC
CTL
( a )
( b )
MSB Combiner LSB Combiner
Vout
Vx 
iDAC
L
R
iDACiDAC
 Current Source
M1 
M2 
M3 
M4 
Vx 
Vx 
M5 M6 
 Current Source
iDAC
iDAC
CTL
( c )
 
Fig. 10. (a) Output driver, (b) dedicated cascode current source, 
(c) traditional current source. 
 
1.3
1.2
1.1
1.0
0.9
0.8
0.7
5
Time (ns)
6 7 8 9 5
Time (ns)
6 7 8 9
1.3
1.2
1.1
1.0
0.9
0.8
0.7
(a) (b)
V
 (v
)
V
 (v
)
 
Fig. 11. Single-end output waveforms (a) Traditional driver, (b) 
low-power cascode drive. 
 
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.2, APRIL, 2018 277 
 
rate serializer and LSB combiner consume less and the 
total power consumption is 89 mW at 40 Gb/s. 
Fig. 13 presents the measured PAM4 output, where Fig. 
13(a) and (b) displays the output eye-diagrams at 2 Gb/s 
without equalization and with over-equalization, 
respectively. The output swing is 900 mV and the level 
deviation is 0.12%. Fig. 13(c) and (d) gives the properly-
equalized eye diagrams at 35 and 40 Gb/s. The minimum 
vertical eye-opening is about 100/82 mV and the 
minimum horizontal eye width is about 34/30 ps. Fig. 14 
depicts the measured NRZ output eye-diagrams, where 
the output eye-diagrams at 10 Gb/s without equalization 
and with over-equalization are displayed in Fig. 14(a) 
and (b). Fig. 14(c) and (d) shows the properly-equalized 
eye-diagrams at 40 and 50 Gb/s, where the eye height 
and total jitter are equalized to 300/130 mV and 9.6 /16.8 
ps, respectively. 
In Table 1, the performance of the designed 2-40 Gb/s 
wireline transmitter is summarized and compared with 
other state-of-the-art works with similar data rates. The 
results indicate that our transmitter achieves excellent 
power efficiency and high-quality eye diagrams in both 
NRZ and PAM4 modes, owing to the proposed quarter-
rate architecture and edge-acceleration 4:1 MUXs. 
V. CONCLUSION 
A 2-40 Gb/s PAM4/NRZ dual-mode wireline 
transmitter with two-tap multi-MUX-based FFE is 
implemented in 65-nm CMOS technology. The robust 
multi-MUX-based FFE has been validated that it can 
 
(a) 
 
40 GHz
Connector
Chip
 
(b) 
 
Total Power is 117 mW
 
(c) 
Fig. 12. (a) Die photograph, (b) photograph of the testing 
board, (c) power breakdown. 
 
(a)
200 mV 200 ps
(b) 
35 Gb/s W/ EQ 40 Gb/s W/ EQ
200 mV 200 ps
100 mV 10 ps165 mV 11.4 ps
2 Gb/s W/O EQ 2 Gb/s W/ EQ
(c) (d) 
 
Fig. 13. PAM4 eye diagrams (a) Without equalization at 2 
Gb/s, (b) over-equalized at 2 Gb/s, (c) properly-equalized at 35 
Gb/s, (d) properly-equalized at 40 Gb/s. 
 
10 Gb/s W/O EQ 10 Gb/s W/ EQ
40 Gb/s W/ EQ 50 Gb/s W/ EQ
80 mV 4 ps
200 mV 20ps
100 mV 5 ps
200 mV 20ps
(a) (b) 
(c) (d) 
 
Fig. 14. NRZ output eye diagrams (a) Without equalization at 
10 Gb/s, (b) over-equalized at 10 Gb/s, (c) properly-equalized 
at 40 Gb/s, (d) properly-equalized at 50 Gb/s. 
 
278 FANGXU LV et al : A 2-40 Gb/s PAM4/NRZ DUAL-MODE WIRELINE TRANSMITTER WITH 4:1 MUX IN 65-nm CMOS 
 
provide a wide data-rate range from 1 to 50 Gb/s, while 
the proposed edge-acceleration technique in the 4:1 
MUX extends its bandwidth and suppresses the glitch. In 
addition, the linearity of the PAM4 output is improved by 
utilizing the dedicated cascode current sources, and the 
level deviation reaches 0.12%. With a 1.2 V supply, the 
transmitter consumes 117 mW in the PAM4 mode and 89 
mW in the NRZ mode, both at 40 Gb/s. 
ACKNOWLEDGMENT 
This work was supported by Beijing Engineering 
Research Center under grant BG0149. 
REFERENCES 
[1] A. A. Hafez et al., “A 32-48 Gb/s serializing 
transmitter using multiphase serialization in 65 nm 
CMOS technology,” IEEE J. Solid-State Circuits, 
vol. 50, no. 3, pp. 763-775, Mar. 2015. 
[2] M.-S. Chen and C.-K. K. Yang, “A 50-64 Gb/s 
serializing transmitter with a 4-tap, LC-ladder-
filter-based FFE in 65 nm CMOS technology,” 
IEEE J. Solid-State Circuits, vol. 50, no. 8, pp. 
1903-1916, Aug. 2015. 
[3] Y. Frans et al., “A 40-to-64 Gbps NRZ transmitter 
with supply-regulated front-end in 16nm FinFET,” 
IEEE J. Solid-State Circuits, vol. 51, no. 12, pp. 
3167-3177, Dec. 2016. 
[4] A. Nazemi et al., “A 36Gb/s PAM4 transmitter 
using an 8b 18GS/s DAC in 28nm CMOS,” in Proc. 
IEEE Int. Solid-State Circuits Conf. Dig. Tech. 
Papers, pp. 58-59, 2015. 
[5] K. Gopalakrishnan et al., “A 40/50/100Gb/s PAM-
4 Ethernet transceiver in 28nm CMOS,” in Proc. 
IEEE Int. Solid-State Circuits Conf. Dig. Tech. 
Papers, pp. 62-63, 2016.  
[6] M. Bassi et al., “A 45 Gb/s PAM-4 transmitter 
delivering 1.3Vppd output swing with 1V supply in 
28nm CMOS FDSOI,” in Proc. IEEE Int. Solid-
State Circuits Conf. Dig. Tech. Papers, pp. 66-67, 
2016. 
[7] D. J. Foley and M. P. Flynn, “A low-power 8-PAM 
serial transceiver in 0.5-μm digital CMOS,” IEEE J. 
Solid-State Circuits, vol. 37, no. 3, pp. 310-316, 
Mar. 2002. 
[8] K. Szczerba et al., “70 Gbps 4-PAM and 56 Gbps 
8-PAM using an 850 nm VCSEL,” IEEE J. 
Lightwave Technology, vol. 33, no. 7, pp. 1395-
1401, Apr. 2015. 
[9] Optical Internetworking Forum (OIF), “CEI-56G-
LR-PAM4 long reach implementation agreement 
draft text,” Opt. Internetworking Forum Contrib., 
Tech. Rep. 2014.380.03, 2016. 
[10] A. Roshan-Zamir et al., “A reconfigurable 16/32 
Gb/s dual-mode NRZ/PAM4 SerDes in 65nm 
CMOS,” IEEE J. Solid-State Circuits, vol. 52, no. 9, 
pp. 2430-2447, Sep. 2017. 
[11] J. Kim et al., “A 16-to-40Gb/s quarter-rate 
NRZ/PAM4 dual-mode transmitter in 14nm 
CMOS, ” in Proc. IEEE Int. Solid-State Circuits 
Conf. Dig. Tech. Papers, pp. 60-61, 2015. 
[12] K. Kanda, et al, “A single-40 Gb/s dual-20 Gb/s 
serializer IC with SFI-5.2 interface in 65 nm 
CMOS,” IEEE J. Solid-State Circuits, vol. 44, no. 
12, 3580-3589, Dec. 2009. 
[13] J. Lee et al., “Design of 56 Gb/s NRZ and PAM4 
SerDes transceivers in CMOS technologies,” IEEE 
J. Solid-State Circuits, vol. 50, no. 9, pp. 2061-
2073, Sep. 2015. 
[14] P. Chiang et al., “A 20-Gbps 0.13-μm CMOS serial 
link transmitter using an LC-PLL to directly drive 
Table 1. Performance summary and comparison 
  This work [10] [11] [13] 
Technology 65 nm 65 nm 14 nm 65 nm 
Supply (V) 1.2 1.2 N/A 1.3 
Chip Area 
(mm2) 1.08 0.06 (core) 0.0279 1.14 
Modulation NRZ PAM4 NRZ PAM4 NRZ PAM4 PAM4 
Data Rate 
(Gb/s) 40 40 16 32 40 40 60 
Data Range 
(Gb/s) 1-50 2-40 N/A N/A 16-40 16-40 56-62 
FFE (Taps) 2 2 4 2 4 N/A 3 
Output 
Swing  
Vpp (V) 
0.9 0.9 1.2 1.2 0.6 0.6 0.25 
Vertical Eye 
Opening 
(mV) 
300 82 100 100 200 80a 50 
Horizontal 
Eye Opening 
(UI) 
>0.6 >0.5 >0.5 >0.4 >0.5 >0.4a >0.6 
Power (mW) 89 117 N/A 159 518b 167b 290 
Energy 
Efficiency 
(pJ/bit) 
2.22 2.93 N/A 4.97 12.9b 4.19b 4.84b 
aIncluding software-based CTLE at scope, bincluding PLL consumption. 
 
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.2, APRIL, 2018 279 
 
the output multiplexer,” IEEE J. Solid-State 
Circuits, vol. 40, no. 4, pp. 1004-1011, Apr. 2005. 
[15] X. Zheng et al., “A 40-Gb/s quarter-rate SerDes 
transmitter and receiver chipset in 65-nm CMOS,” 
IEEE J. Solid-State Circuits, vol. 52, no. 11, pp. 
2963-2978, Nov. 2017. 
[16] M.-S. Chen et al., “A fully-integrated 40-Gb/s 
transceiver in 65-nm CMOS technology,” IEEE J. 
Solid-State Circuits, vol. 47, no. 3, pp. 627-640, 
Mar. 2012. 
[17] D. Cui et al., “A dual-channel 23-Gbps CMOS 
transmitter/receiver chipset for 40-Gbps RZ-
DQPSK and CS-RZ-DQPSK optical transmission,” 
IEEE J. Solid-State Circuits, vol. 47, no. 12, pp. 
3249-3260, Dec. 2012. 
[18] B. Raghavan et al., “A sub-2 W 39.8-44.6 Gb/s 
transmitter and receiver chipset with SFI-5.2 
interface in 40 nm CMOS,” IEEE J. Solid-State 
Circuits, vol. 48, no. 12, pp. 3219-3228, Dec. 2013. 
[19] J.-Y. Jiang et al., “100 Gb/s Ethernet chipsets in 65 
nm CMOS technology,” in Proc. IEEE Int. Solid-
State Circuits Conf. Dig. Tech. Papers, pp. 120-
121, 2013. 
[20] R. Navid et al., “A 40-Gb/s serial link transceiver 
in 28-nm CMOS technology,” IEEE J. Solid-State 
Circuits, vol. 50, no. 4, pp. 814-827, Apr. 2015.  
[21] J. Sewter and A. C. Carusone, “A CMOS finite 
impulse response filter with a crossover traveling 
wave topology for equalization up to 30 Gb/s,” 
IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 
909-917, Apr. 2006. 
[22] J. Sewter and A. C. Carusone, “A 3-tap FIR filter 
with cascaded distributed tap amplifiers for 
equalization up to 40 Gb/s in 0.18-µm CMOS,” 
IEEE J. Solid-State Circuits, vol. 41, no. 8, pp. 
1919-1929, Aug. 2006. 
[23] X. Zheng et al., “A 5-50 Gb/s quarter rate 
transmitter with a 4-tap multiple-MUX based FFE 
in 65 nm CMOS,” in Proc. 42nd European Solid-
State Circuits Conf, pp. 305-308, 2016. 
[24] C.-K. K. Yang and M. A. Horowitz, “A 0.8-µm 
CMOS 2.5 Gb/s oversampling receiver and 
transmitter for serial links,” IEEE J. Solid-State 
Circuits, vol. 31, no. 12, pp. 2015-2023, Dec. 1996. 
 
 
 
Fangxu Lv received the B.S. and 
M.S. degrees from Air Force 
Engineering University, Xi’an, China, 
in 2011 and 2014, respectively. He is 
currently pursuing the Ph.D. degree 
at Tsinghua University, Beijing, 
China. His current research interests 
include high-speed wireline system design. 
 
Xuqiang Zheng received the B.S. 
and M.S. degrees from the School of 
Physics and Electronics, Central 
South University, Hunan, China, in 
2006 and 2009, respectively. He is 
currently pursuing the Ph.D. degree 
with the University of Lincoln, 
Lincoln, U.K. Since 2010, he has been a Mixed Signal 
Engineer with the Institute of Microelectronics, Tsinghua 
University, Beijing, China. His current research interests 
include high-performance A/D converters and high-speed 
wireline communication systems. 
 
Feng Zhao received the B.Eng. 
degree in electronic engineering from 
the University of Science and 
Technology of China, Hefei, China, 
in 2000, and the M.Phil. and Ph.D. 
degrees in computer vision from The 
Chinese University of Hong Kong, 
Hong Kong, in 2002 and 2006, respectively. From 2006 
to 2007, he was a Post-Doctoral Fellow with the 
Department of Information Engineering, The Chinese 
University of Hong Kong. From 2007 to 2010, he was a 
Research Fellow with the School of Computer 
Engineering, Nanyang Technological University, 
Singapore. He was then a Post-Doctoral Research 
Associate with the Intelligent Systems Research Centre, 
University of Ulster, Londonderry, U.K. From 2011 to 
2015, he was a Workshop Developer and a Post-Doctoral 
Research Fellow with the Department of Computer 
Science, Swansea University, Swansea, U.K. From 2015 
to 2017, he was a Post-Doctoral Research Fellow with 
the School of Computer Science, University of Lincoln, 
Lincoln, U.K. Since 2017, he has been with the 
Department of Computer Science, Liverpool John 
Moores University, Liverpool, U.K., where he is 
currently a Senior Lecturer. His research interests include 
image processing, biomedical image analysis, computer 
vision, pattern recognition, machine learning, artificial 
intelligence, and robotics. 
280 FANGXU LV et al : A 2-40 Gb/s PAM4/NRZ DUAL-MODE WIRELINE TRANSMITTER WITH 4:1 MUX IN 65-nm CMOS 
 
Jianye Wang received the B.S. and 
M.S. degrees from Nanjing Univer- 
sity of Science and Technology, 
Nanjing, China. He received the 
Ph.D. degree from Air Force 
Engineering University, Xi’an, China. 
He is currently working in Air Force 
Engineering University, where he is a Professor in the 
Air and Missile Defense College. His research interests 
include high-speed wireline system design. 
 
Ziqiang Wang received the B.S. and 
Ph.D. degrees from the Department 
of Electronic Engineering, Tsinghua 
University, Beijing, China, in 1999 
and 2006, respectively. After the 
Ph.D. degree, he was a Research 
Assistant with the Institute of 
Microelectronics, Tsinghua University, where he has 
been an Associate Professor since 2015. His current 
research interests include analog circuit design. 
 
Shuai Yuan received the B.S. and 
Ph.D. degrees from the Institute of 
Microelectronics, Tsinghua Univer- 
sity, Beijing, China, in 2011 and 
2016, respectively. He is now doing 
postdoctoral research at the Institute 
of Microelectronics, Tsinghua Univer- 
sity. His current research interests mainly focus on high-
speed wire-line transceivers and low-power equalizers. 
 
Yajun He received the B.S. degree 
from the School of Microelectronics 
and Solid-State Electronics, Univer- 
sity of Electronic Science and 
Technology of China, Chengdu, 
China, in 2015. She is now working 
toward the M.S. degree at Tsinghua 
University, Beijing, China. Her research interests include 
high-speed wireline transmitter and PLL. 
 
Chun Zhang (M’03) received the 
B.S. and Ph.D. degrees from the 
Department of Electronic Engi- 
neering, Tsinghua University, Beijing, 
China, in 1995 and 2000, respec- 
tively. Since 2000, he has been with 
Tsinghua University, where he was 
with the Department of Electronic Engineering from 
2000 to 2004 and he has been an Associate Professor 
with the Institute of Microelectronics since 2005. His 
current research interests include mixed signal integrated 
circuits and systems, embedded microprocessor design, 
digital signal processing, and radio frequency identification. 
 
Zhihua Wang (SM’04–F’17) 
received the B.S., M.S., and Ph.D. 
degrees in electronic engineering 
from Tsinghua University, Beijing, 
China, in 1983, 1985, and 1990, 
respectively. In 1983, he joined the 
faculty at Tsinghua University, where 
he has been a Full Professor since 1997 and the Deputy 
Director of the Institute of Microelectronics since 2000. 
From 1992 to 1993, he was a Visiting Scholar with 
Carnegie Mellon University, Pittsburgh, USA. From 
1993 to 1994, he was a Visiting Researcher with KU 
Leuven, Leuven, Belgium. He is the co-author of ten 
books and book chapters, over 90 papers in international 
journals, and over 300 papers in international 
conferences. He holds 58 Chinese patents and four U.S. 
patents. His current research interests include CMOS 
radio frequency integrated circuit (RFIC), biomedical 
applications, radio frequency identification, phase locked 
loop, low-power wireless transceivers, and smart clinic 
equipment with combination of leading edge CMOS 
RFIC and digital imaging processing techniques. Prof. 
Wang was an Official Member of the China Committee 
for the Union Radio-Scientifique Internationale from 
2000 to 2010. He served as a Technologies Program 
Committee Member of the IEEE International Solid-State 
Circuit Conference from 2005 to 2011. He has been a 
Steering Committee Member of the IEEE Asian Solid-
State Circuit Conference since 2005. He has served as the 
Deputy Chairman of the Beijing Semiconductor 
Industries Association and the ASIC Society of Chinese 
Institute of Communication, as well as the Deputy 
Secretary General of the Integrated Circuit Society in the 
China Semiconductor Industries Association. He was one 
of the chief scientists of the China Ministry of Science 
and Technology serves on the Expert Committee of the 
National High Technology Research and Development 
Program of China (863 Program) in the area of information 
science and technologies from 2007 to 2011. He was the 
Chairman of the IEEE Solid-State Circuit Society Beijing 
Chapter from 1999 to 2009. He has served as the Technical 
Program Chair of the 2013 A-SSCC. He served as the Guest 
Editor of the IEEE J OURNAL OF S OLID -S TATE C 
IRCUITS Special Issue in 2006 and 2009. He is an Associate 
Editor of the IEEE TRANSACTIONS ON BIOMEDICAL 
CIRCUITS AND SYSTEMS and the IEEE TRANS- 
ACTIONS ON CIRCUITS AND SYSTEMS-PART II: 
EXPRESS BRIEFS. 
