This paper describes a 16-Gb/s differential bidirectional I/O transceiver cell in an emulated 40nm DRAM process that has a fan-out of four-inverter delay (FO4) of 45ps, resulting in a bit time that is only 1.4 FO4 delays long. The transceiver implements several techniques to achieve low jitter despite the slow process and constrained power consumption, including a quad rate clocking with closed-loop quadrature correction, a shared LC-PLL with an octagonal inductor in a three-metal process, and a data-dependent regulator. The transceiver has measured random jitter of 380fs rms at the transmitter output and BER < 10 -14 while consuming 8mW/Gb/s. Introduction The top-level transceiver architecture is shown in Fig. 1 with a companion controller [1] implementing equalization and per bit slice skew calibration. The DRAM transceiver uses a quad rate clocking scheme to circumvent high gate delay (FO4), a quadrature corrector to minimize phase error in the I & Q clocks, an LC-PLL to achieve low random jitter (RJ), a data-dependent regulator to minimize power-supply induced jitter (PSI-J) in the data path, and CML signaling to minimize PSI-J in the clock path. This I/O cell is implemented in a 65nm process with design restrictions to emulate a projected 40nm DRAM process, expected to be available in 2010 [2] . Process Emulation Critical parameters in the projected 40nm DRAM process node are shown in the first two columns of Table 1 [2]. Note that periphery I/O transistors have larger minimum channel length than core transistors and that only 3 metal layers are available. The parameters emulated in a 65nm process are shown in the third column of Table 1 . The worst-case FO4 delay is emulated using high Vt device with a +10nm gate length bias (L drawn = 70nm versus 60nm). The DRAM thick metal layer is emulated by stacking M3-M5.
Introduction
The top-level transceiver architecture is shown in Fig. 1 with a companion controller [1] implementing equalization and per bit slice skew calibration. The DRAM transceiver uses a quad rate clocking scheme to circumvent high gate delay (FO4), a quadrature corrector to minimize phase error in the I & Q clocks, an LC-PLL to achieve low random jitter (RJ), a data-dependent regulator to minimize power-supply induced jitter (PSI-J) in the data path, and CML signaling to minimize PSI-J in the clock path. This I/O cell is implemented in a 65nm process with design restrictions to emulate a projected 40nm DRAM process, expected to be available in 2010 [2] . Process Emulation Critical parameters in the projected 40nm DRAM process node are shown in the first two columns of Table 1 [2] . Note that periphery I/O transistors have larger minimum channel length than core transistors and that only 3 metal layers are available. The parameters emulated in a 65nm process are shown in the third column of Table 1 . The worst-case FO4 delay is emulated using high Vt device with a +10nm gate length bias (L drawn = 70nm versus 60nm). The DRAM thick metal layer is emulated by stacking M3-M5.
Clocking and Quadrature Corrector Low jitter, critical for 16-Gb/s operation (UI=62.5ps), is achieved with a supply-regulated LC-PLL operating from a 500-MHz reference. The PLL contains an 8-GHz LC-VCO followed by a quadrature divide-by-two circuit to generate 4-GHz quadrature clocks (Fig. 2) . Despite a limited inductor Q of ~3.5 due to high metal resistance, the LC-VCO achieves superior phase noise performance -proportional to Q 2 -relative to a ring-based VCO for a given power budget. The 0.75-nH inductor is constructed with an octagonal differential structure for maximum Q over area ratio and has a self-resonant frequency of 30GHz. The inherent low K VCO of the LC-based design suppresses reference spurs and reduces loop filter phase noise. Supply regulation for the LC-VCO reduces PSI-J caused by the high multiplication ratio. To relax the headroom requirement for supply regulation, an NMOS-only cross-coupled pair is used as a negative transconductance element. The higher g m /C of the NMOS design (relative to PMOS) also reduces fixed tank capacitance and improves tuning range. The VCO frequency is set by a fine analog varactor control and a coarse digitally switched MOM capacitor array for extended tuning range, while keeping K VCO low [3] . Fig. 3 shows the measured clock jitter. To minimize power and area, one PLL provides the 4-GHz I & Q clocks to eight data bit slices (DQ) and one command/address slice (RQ). CML clock distribution is chosen to minimize PSI-J. Closed-loop quadrature correction in each bit-slice (Fig. 4) is used to correct the accumulated quadrature error from global and bit-slice clock distributions up to ±0.07UI (±4.5ps). Quadrature error is detected by a symmetric CML XOR transconductor, a low pass filter, and a digital sampler. The sampled error feeds a state machine that corrects the quadrature by adjusting switched-capacitor clock loads. Using this scheme, quadrature error is limited only by local device mismatches in the quadrature detector. 
Front-End and Data-Dependent Regulator
Achieving 16-Gb/s operation in the front-end data-path circuits is challenging even with interleaved quadrature rate operation. Due to device stacking and headroom limitations, the TX pulser in Fig. 1 is implemented with CMOS NAND gates instead of CML multiplexers to achieve faster performance and lower DDJ. Similarly, CMOS style StrongArm sense amplifiers are used in the RX path to achieve faster regeneration and reset performance than CML latches. While these CMOS front-end circuits achieve 1.4 FO4 operation at lower power consumption than their CML counterparts, they have higher PSI-J. This supply noise sensitivity can be addressed by supply regulation at lower power than a CML implementation. However, such a regulator must tolerate the transient load currents of these data-path circuits.
A voltage regulator with a replica load has been shown to provide power and area efficient suppression of high frequency transient noise from the external supply [4] . We extend the idea by using a data-dependent replica (Fig. 5) that has a transient load current proportional to that of the front-end, providing fast feed-forward regulation of transient load currents. For a given capacitance area, this technique provides 4× reduction in self-induced supply noise, and thus PSI-J. The voltage regulator regulates the 1.2V VTT supply to 1.0V vreg for critical front-end circuits. RX sampler sizing is limited by area and power constraints, resulting in a random offset voltage distribution with 3σ ≅ 30mV. Active offset correction based on current summation at the sampler inputs achieves a ±45mV correction range with 3mV nominal resolution. The linear equalizer does not fully equalize the channel, but provides 3dB gain at high frequencies to mitigate the random noise contribution of later RX stages and relaxes the tap weight requirements of the swing-constrained TX-FIR filter on the controller (Fig. 1) .
Measurement Results
The transceiver cell is implemented as part of a testchip (Fig. 6 ) that contains two DQ bytes and one RQ slice in an area of 4mm by 1.1mm. The DQ transceiver cell and PLL cell are both 0.2mm wide. The RQ slice is 0.4mm wide. The testchip uses two rows of IO pads with a pad pitch of 50µm. The testchip is housed in a low-cost 2-layer wirebond µBGA package on an FR4 test board. Fig. 7 shows the superposed quadrature transmit eyes. With the quadrature corrector turned ON, the eye opening is improved by 40%. Fig. 8 
