A 5Gb/s 7.1fJ/b/mm 8× Multi-Drop On-Chip 10mm Data Link in 14nm FinFET CMOS SOI at 0.5V by Sacco, Elisa et al.
A 5Gb/s 7.1fJ/b/mm 8× Multi-Drop On-Chip 10mm Data Link  
in 14nm FinFET CMOS SOI at 0.5V 
 
Elisa Sacco1,2, Pier Andrea Francese1, Matthias Brändli1, Christian Menolfi1, Thomas Morf1, 
Alessandro Cevrero1, Ilter Ozkaya1, Marcel Kossel1, Lukas Kull1, Danny Luu1, Hazar Yueksel1 
Georges Gielen2 and Thomas Toifl1 
1IBM Research, Zurich, Switzerland, 2KU Leuven – ESAT MICAS, Belgium 
 
Abstract 
We report a 5Gb/s data link implemented in 14nm FinFET 
CMOS SOI technology in which a single transmitter (TX) 
broadcasts NRZ data to eight receivers (RXs) distributed along 
an on-chip RC-dominated 10mm-long channel. The TX 
comprises a full-rate AC-coupled 2-tap FIR driver with a 
quarter-rate pre-driver and aligner. Each RX is equipped with 
a novel decision-gated 1-tap speculative DFE optimized for 
low-power. The RX architecture is half-rate and sliced data are 
de-multiplexed and aligned at quarter-rate so that all the bus 
interfaces are at 1.25Gb/s. PRBS generator and checker are 
available on-chip. Correct operation was verified with PRBS31 
data transmitted at 5Gb/s and concurrently received error-free 
at each drop with >40% horizontal margin (BER<10−12) at the 
RX connected at the end of the channel. At this data-rate the 
efficiency is 7.1fJ/b/mm, resulting in the best performance 
among multi-drop on-chip data links so far published (to the 
best of our knowledge). The TX and eight RXs are running on 
a 0.5V power supply and consume 0.62 and 0.98mW, 
respectively. 
Introduction 
Neuromorphic systems, where one artificial neuron 
communicates with many neurons through synapses, and 
data-centric systems, where microprocessors and parallel 
accelerators concurrently operate on the same centrally 
distributed data, require parallel processing of the same data. 
These applications drive the interest in the energy-efficient 
implementation of multi-drop on-chip data links built with 
wires interconnecting digital blocks placed a few mm apart. 
On-Chip Interconnect Channel 
In this work, we used a 10mm-long differential channel built 
with 0.5m thick wires routed through a 5.12×5.12 m2 power 
mesh acting as a shield. The channel was modeled in HFSS. 
The attenuation is 1.8dB/mm at 2.5GHz and is RC-dominated 
with R=37/mm and C=340fF/mm at 25˚C. Figure 1 shows 
the channel cross section, its insertion loss and the layout 
placement of the circuits connected to it. 
AC-Coupled TX 
The TX employs AC coupling, which intrinsically provides 
pre-emphasis, with selectable unit metal capacitors of 10fF. 
The additional FFE tap, in which the signal is inverted and 
delayed one UI, further extends the data rate. In Fig. 2 the 
implementation is shown together with the quarter-rate 
pre-driver and timing diagrams. Both the signal amplitude and 
the equalization strength can be controlled separately by means 
of 32 units allocated to the cursor and 16 to the post-cursor. 
Figure 3 presents the eye diagram traces at two points along the 
channel with the TX FFE turned off and on. Our equalization 
strategy is to select the best TX FFE setting for the farthest RX 
and use the DFE at each RX to refine the sensitivity. 
Multi-Drop RX 
In a multi-drop link the cumulative loading of all the 
connected RXs is of concern. With AC coupling each RX 
should present a sufficiently high differential input impedance 
Zdiff to minimize DC baseline wandering and at the same time 
a sufficiently low common-mode Zcm to ensure a short start-up 
settling time to the desired Vcm (2/3 of Vdd in our design). 
The circuit shown in Fig. 4 is very effective in achieving our 
goal. In common mode the power rail is equally portioned by 
the three stacked diode-connected NMOS transistors. In 
differential mode the cross-coupled NMOS transistors 
generate a negative transconductance gm that cancels the gm of 
the NMOS diodes above them, thus setting a high Zdiff equal to 
their ro. The small difference between the PRBS7 and PRBS31 
horizontal margins of the measured BER bathtubs down to 
10−12 (Fig. 5) confirms the correct operation of the adopted 
circuit. 
A benefit of a multi-drop system is that the TX power is 
amortized over the number of served RXs. The efficiency η is 
calculated as: 
𝜂 =  
𝑃𝑇𝑋  +  𝑛𝑏𝑖𝑡𝑠 ∙ 𝑃𝑅𝑋
𝑓𝑇𝑋 ∙ 𝐿𝑐ℎ𝑎𝑛 ∙ 𝑛𝑏𝑖𝑡𝑠
   [𝐽/𝑏/𝑚𝑚] 
It saturates with increasing number of bits nbits concurrently 
received, i.e. eight in our implementation. Lchan is the average 
channel length equal to 5.625mm. PTX and PRX are the average 
power consumed by the TX and RX. fTX is the transmission rate. 
Decision-Gated DFE 
Each RX operates at half-rate and features a novel 
low-power 1-tap speculative DFE architecture that reduces the 
power consumption and the kick-back effect by powering up 
only the speculative path that is going to take the next decision 
based on the previous decision. Compared to the usual 
speculative DFE implementation [1], power is saved because 
the path not selected by the DFE multiplexer is kept in reset 
state by gating its clock with the decision taken in the previous 
UI. As an example, Fig. 6 shows the operation when the even 
slice detects the current bit while the odd slice is holding a logic 
one as the previously detected bit. 
The slicers are low-noise two-stage dynamic comparators in 
which the second regenerating stage is self-timed and shared 
between the first stages of the two speculative comparators to 
further reduce power. The threshold voltages are generated 
with 8-bit monotonic resistor-ladder DACs built with highly 
ohmic and compact poly stripes and switches connected from 
(2/3-1/6) to (2/3+1/6) of the supply rail. The measured 
input-referred noise of the comparators is 3.1mVrms. 
The DFE equalization is used in each RX to increase the 
vertical BER margin in both cases when the signal is under- or 
over-equalized. The optimal TX FFE setting chosen at the end 
of the channel must be shared among all the other RXs. The 
RXs closer to the TX will then be over-equalized as already 
shown in Fig. 3. However, thanks to our DFE negative range, 
their vertical margin can be restored. For example, in the BER 
contour plots (time vs. DFE H1 level) in Fig. 3, the vertical 
margin at BER 10−8 improves from 18 to 31mVppd in the last 
RX located at the end of the channel, and from 38 to >60mVppd 
in the RX connected 2.5mm away from the TX. 
Measurement Setup 
The prototype circuit was measured with wafer needle 
probing. The BER tests are performed with an on-chip PRBS 
generator and checker physically placed below the channel and 
interfaced at quarter-rate with the TX pre-driver and each of 
the RX de-multiplexer/aligner. The BER bathtubs horizontal 
openings are shown in Fig. 5 together with the delay of the 
bathtub centers with respect to the input CLK. The measured 
latency is 21ps/mm. 
Conclusion 
This work is significantly better than the previously 
published results of on-chip multi-drop data links [2, 3] with 
respect to both the reported efficiency, which is more than 
eight times better, and the channel dimensions, i.e., length and 
pitch density. Moreover, two point-to-point links [4, 5] are 
included in the comparison table shown in Fig. 7. In that case, 
the comparison is made with only the RX at the end of our 
channel turned on, thus nbits and Lchan are equal to 1 and 10mm, 
respectively. We underline that PTX and PRX reported in this 
work also include clock generation, serializer, de-serializer and 
voltage reference generation. 
References 
[1] D. Turker et al., “A 19Gb/s 38mW 1-Tap Speculative DFE 
Receiver in 90nm CMOS,” pp. 216-217, VLSI 2009. 
[2] H. Ito et al., “A 8-Gbps Low-Latency Multi-Drop On-Chip 
Transmission Line Interconnect with 1.2-mW Two-Way 
Transceivers,” pp. 136-137, VLSI 2007. 
[3] H. Wu et al., “A 60GHz On-Chip RF-Interconnect with λ/4 
Coupler for 5Gbps Bi-Directional Communication and Multi-Drop 
Arbitration,” CICC 2012. 
[4] S. Lee et al., “A 95 fJ/b current-mode transceiver for 10 mm on-
chip interconnect,” pp. 262–263, ISSCC 2013. 
[5] Y. Liu et al., “A 0.1 pJ/b 5-to-10 Gb/s charge-recycling stacked 
low power I/O for on-chip signaling in 45 nm CMOS SOI,” 
pp. 400–401, ISSCC 2013. 
 
 
 
                  
Fig. 1. On-chip channel and test circuit implementation overview. 
 
Fig. 2. TX schematic and timing diagram. 
 
 
Fig. 3. Eye diagrams FFE off/on and BER vs. H1 at 2.5 and 10mm. 
 
Fig. 4. Circuits setting CM voltage at the RX inputs. 
 
Fig. 5. BER bathtub at the farthest RX and center/opening vs. RX. 
 
 
Fig. 6. Decision-gated DFE block diagram and circuit implementation. 
 
Fig. 7. Comparison table of on-chip interconnect data links. 
