Abstract-This paper describes a bidirectional, differential, 16 Gb/s per link memory interface that includes a Controller and an emulated DRAM physical interface (PHY) designed in 65 nm CMOS. To achieve high data rate, the interface employs the following technology ingredients: asymmetric equalization, asymmetric timing calibration, asymmetric link margining, inductor based (LC) PLLs, multi-phase error correction, and a data dependent regulator. At 16 Gb/s, this interface achieves a unit-interval to inverter FO4 ratio of 2.8 (Controller) and 1.4 (DRAM) and operates in a channel with 15 dB loss at Nyquist. Under such bandwidth limitations on and off chip, the Controller and DRAM PHYs consume 13 mW/Gb/s and 8 mW/Gb/s, respectively. Using PRBS 2 11 1, the link achieves a timing margin of 0.19 UI at a BER of 1e-12 for both read and write operations.
I. INTRODUCTION

S
YSTEM memory bandwidth is one of the key limitations to high performance computing. Total memory bandwidth can be increased either by increasing the number of links or by increasing the per link data rate. However, there is a cost advantage to increasing the per link data rate to achieve this goal as it reduces the package size as well as the number of DRAMs and eases routing congestion that would otherwise require more metal layers in the package and PCB. Thus, the primary objective of this work is to build the fastest memory transceiver possible while constraining power efficiency (mW/Gb/s) such that thermal issues do not negate the cost savings achieved by reducing the number of links.
It is estimated that the total memory bandwidth required by graphics processors and game consoles will approach 1 TB/s in J.-H. Chun was with Rambus Inc., Los Altos, CA 94022 USA, and is now with the Department of Semiconductor Systems Engineering, SungKyunKwan University, Suwon, Korea.
R. Navid was with Rambus Inc., Los Altos, CA 94022 USA, and is now with True Circuits Inc., Los Altos, CA 94022 USA. 2012 [1] . The interface in this work operates at 16 Gb/s/link in order to achieve this aggregate bandwidth with 512 links, a count that has already been demonstrated in state-of-the-art GPUs. Fig. 1 illustrates such a 1 TB/s memory system [2] . It shows 16 DRAMs, each of which has 32 data links (DQ) and two Command/Address (C/A) links all operating at 16 Gb/s. The 32 DQs provide a combined data bandwidth of 64 GB/s to each DRAM. 1 Both C/A and DQ are point-to-point differential links. Point-to-point signaling provides good signal integrity by minimizing impedance discontinuities while differential signaling is robust against common-mode noise sources such as supply noise and ground bounce. Furthermore, differential signaling results in less crosstalk/EMI and simultaneous switching output noise than single-ended signaling. One difference between the C/A and DQ links is that the C/A is unidirectional whereas the DQ is bidirectional.
The key technical challenges to designing 16 Gb/s memory links discussed in [3] and [4] are: bandwidth limitation in the channel, bandwidth limitation on the silicon, jitter, skew, and power efficiency. To overcome the bandwidth limitation of the channel, an asymmetric equalization architecture that places all the equalization capability on the Controller is used. Since the transistors in the Controller are about twice as fast as the periphery devices in the DRAM in any given process generation, the power of the overall system can be reduced by placing equalization circuits for both read and write directions on the Controller. As a reference, the inverter fanout-of-4 (FO4) gate delay is 22 ps for the controller and 45 ps for the emulated DRAM in this work. To further alleviate the speed limitation of the silicon, half rate and quadrature clocking are employed on the Controller and DRAM, respectively, to reduce the on-chip clock rates. Since high frequency jitter is amplified when passing through a low pass channel [5] , duty cycle correction and quadrature correction are employed in the Controller and DRAM respectively to reduce deterministic jitter (DJ). The speed limitation of the silicon exacerbates another type of jitter, namely power-supply induced jitter (PSIJ) which cannot be alleviated with multi-phase clocking. A data-dependent regulator powers the datapath and clock tree of the DRAM which would otherwise have excessive PSIJ owing to the slower periphery transistors. Skew from trace length, clock distribution, and random transistor mismatches become problematic as the data rate increases since the bit time is reduced (62.5 ps at 16 Gb/s). Phase mixers with active calibration correct lane-to-lane skew in this system. Once again, to reduce the overall system power, an asymmetric clocking architecture [12] is employed where all timing adjustment circuits are implemented on the Controller. Finally, LC PLLs are used on both devices to optimize the random jitter (RJ) performance of the system.
To motivate the architecture and circuits, the on and off chip bandwidth limitations are discussed in Section II. Section III discusses the asymmetric equalization architecture and the transmitter (TX) and receiver (RX) circuits. The clocking architecture and circuits are described in Section IV. The experimental results are given in Section V and the conclusions are made in Section VI.
II. ON AND OFF CHIP BANDWIDTH LIMITATIONS
The interconnect between the Controller and DRAM consists of a flip-chip thin-core FBGA package (Controller) with substrate traces up to 20-mm long, a 3-inch FR4 PCB trace, and a wirebond FBGA package (DRAM) with a trace length of 8-mm. However, the channel attenuation is dominated by the capacitance of the Controller and DRAM devices which total about 1 pF each. This capacitance is due to the primary ESD diodes, pad metallization, metal routing between the pad and TX, and the diffusion of the TX. The combined RX gate and secondary ESD device capacitance of about 100 fF is shielded from the pad by a 100 resistor, but adds an additional pole below 16 GHz at the receiver input that further attenuates the signal. Fig. 2 shows the simulated channel response for the DQ links. The first graph shows the magnitude of of the channel with 15 dB attenuation at the Nyquist frequency of 8 GHz. The second graph shows the single bit response (SBR) with 4 main ISI components. The SBR shows the need for equalization as the sum of the absolute value of the ISI components is almost equal to the magnitude of the main cursor, indicating a closed eye.
At 16 Gb/s, the on-chip bandwidth is also a limitation. In the CMOS process that the Controller and DRAM were designed in, the FO4 inverter delays are 22 ps and 45 ps, respectively, in the slow device, low voltage, high temperature corner. This translates into a low unit-interval to inverter FO4 ratio (UI/FO4) of 2.8 (Controller) and 1.4 (DRAM). Prior research showed that the minimum clock period to pass through a FO4 chain of inverters without amplitude reduction is 8-FO4 [6] . This means that the high speed portions of this design will require logic with effective FO less than 4. In fact, the required FO of the clock trees approaches 2 for double data rate at the Controller and quad data rate at the DRAM. The primary cost of low FO design is added power consumption. A related observation is that designing a given circuit in the Controller is more power efficient than designing it in the DRAM. This is true for analog circuits as well as logic, as the lower and high thresholds in the DRAM process degrade amplifier power efficiency. This is the motivation behind the asymmetric memory link architecture, shown in Fig. 3 , where the Controller contains more of the timing, equalization, and link diagnostic circuitry. The details will be covered in subsequent sections.
In addition to the slower devices, DRAM processes also have fewer metal layers (at most 3) which have higher resistivity and larger capacitance to the substrate than those for ASICs. This limits the options for on-chip inductors as well as global clock and power grid distribution.
On a final note, we want to make clear that the DRAM presented in this paper is an emulated DRAM. Since this memory link is designed for production in the 2010-2012 timeframe, the process does not exist yet. Hence, to validate the architecture and ideas set forth, a 65 nm CMOS process was degraded (by design constraints) to match the best estimates on the future DRAM process. For example, non minimum length and high threshold devices were exclusively used to match the threshold voltage and the FO4 of the future DRAM process and metal usage was restricted to three routing layers as described in [4] . In contrast, the ASIC process in the 2010-2012 timeframe will be faster than the 65 nm CMOS process used for the Controller. Hence, the power efficiency (mW/Gb/s) of the Controller will improve in the future process.
III. EQUALIZATION ARCHITECTURE AND CIRCUITS
As shown in Fig. 3 , the Controller provides equalization for both write and read operations. In the write direction, the channel is equalized using a 5-tap TX FIR. 2 In the read direction, an analog linear equalizer provides about 6 dB of emphasis at the Nyquist frequency of 8 GHz. As the write and read directions leverage different equalizers, some performance mismatch between them is expected, but they are designed to offer similar performance to prevent one direction from limiting the overall system margin. Based on signal integrity simulations, the nominal swing of the Controller (DC swing without equalization) is made 50% larger than the DRAM for this purpose. The DRAM TX output swing is set to 540 mV 11 mA so that its high threshold transistors have sufficient saturation margin. The Controller swing is thus 800 mV 16 mA .
A. Transmit FIR and Link Diagnostic Circuits
A block diagram of the Controller TX is shown in Fig. 4 . It has 5 FIR taps (1 pre, 1 main, 3 post) to equalize the SBR of Fig. 2 . Each segment has a 2:1 multiplexing pre-driver that serializes the 8 Gb/s input into a full rate 16 Gb/s stream. The output drivers are open-drain differential pairs with digitally controlled tail current sources for tap weight adjustment.
The FIR circuits necessary for equalization are also used to evaluate the performance of the link. With minor reconfiguration enabled by digital registers, the Controller TX can add a DC differential voltage offset into the channel so that the voltage margin can be measured. This is important because while the write eye at the pad of the DRAM is relatively wide open due to the Controller TX-FIR, the read eye at the pad of the Controller will be severely degraded since the effect of equalization is only visible on chip at the output of the Controller RX. This is illustrated by the eye diagrams in Fig. 3 . 3 Hence, it is not possible to verify the functionality of the equalizers by observing eye diagrams on an external scope. The ability to add a voltage offset combined with timing control from the phase mixers enables the Controller to map the effective BER eye diagram in voltage and time for the entire link, including the RX circuits themselves. There are two distinct features in our implementation compared to [7] and [8] : the first is that the voltage offset is added by the transmitter rather than the receiver and the second is that the Controller transmitter adds the voltage offset to margin the link for both the write and read operations. These features leverage both the asymmetry and the bidirectionality of the link. Fig. 4 (b)-(d) shows how the Controller TX is reconfigured for the various modes of operation. Conceptually, each TX tap driver consists of 3 separate segments. The first segment consumes half the current and the sum of the other 2 segments makes up the other half. During normal write operation, all three segments are driven with data by their respective pre-drivers. When the write direction is margined, only the first segment ('Data Segment') is connected to its pre-driver and transmits data. The second segment ('Offset Segment') is hard switched to add a differential offset to the channel. The polarity of the offset is set by which transistor of the differential pair is connected to supply and which is connected to ground. The magnitude of the offset is adjusted with the digital code to the tail current source. The inputs of the third differential pair ('CM Segment') are both tied to the supply to pull both output nodes down. The current of the CM segment is set according to constraints (1) and (2) so that the output common-mode level is kept constant even as the offset is varied. Essentially, the Controller TX is adding a differential offset to its own signal on the channel. When margining the read direction, the DRAM TX is put in half swing mode and the Data Segment of the Controller TX is now turned off. The Offset and CM segments are still enabled. As mentioned before, the swing of the DRAM is less than the Controller, so the current of the Offset and CM Segments are scaled by the same proportion to maintain the common mode of the signal at the input of the Controller RX. In this manner, the Controller TX adds a differential offset to the signal from the DRAM.
(1) (2)
B. Controller Receive Equalizer
The Controller RX shown in Fig. 5 (a) starts with a continuous time equalizer (i.e. linear equalizer). The RX then splits into two sampling paths (even and odd) to relieve the maximum on-chip clock frequency. The circuits are kept as small as possible to minimize power consumption, and the resulting random mismatch components are corrected via current summing at the sampler inputs. The simulated mismatch of the complete RX has a standard deviation of 12 mV referred to the input of the samplers where the correction takes place. The offset correction range is 40 mV with 3 mV resolution.
A linear equalizer is chosen over the more versatile decision feedback equalizer (DFE) to avoid having a feedback path of 3 FO4. Unlike the more prevalent source degenerated and resistor loaded topology [9] , the linear equalizer in Fig. 5(b) uses active inductor loads [10] to provide emphasis in its transfer function. Hence, the DC gain is not changed when varying the peaking in the transfer function, as seen in Fig. 5(c) . The active inductor topology was chosen as simulations showed that it provides 3 dB more emphasis than the source degenerated topology when the input gate size, current consumption, and DC gain are kept equal.
The active inductor equalizer is not without its challenges. For Vdd referenced signaling, the active inductor loads need to be biased with a gate voltage higher than Vdd (Vbh) in order to provide significant output swing. A good choice for Vbh is such that the load device remains saturated and still acts as an inductor for all output voltages. Vbh is generated by a voltage boosting circuit similar to [11] with an added feedback loop to compensate for various losses from PVT and input clock conditions.
The transfer function of the equalizer can be expressed as (3) .
is the load capacitance, is the transconductance of the input device, and is the transconductance of the load. To simplify the analysis, the finite output impedance of the devices as well as any parasitic capacitance between the gate of the active inductor and (AC) ground were ignored. However, care must be taken to minimize this parasitic capacitance as it reduces the effective inductance. The frequency of the zero can be lowered, and the peak gain increased, by either increasing or . Digitally controlled resistors are used for to provide adjustable peaking of the AC response. NMOS accumulation mode capacitors were added to supplement the inherent Cgs of the load device to increase the peaking.
(3) Fig. 5(c) shows a family of transfer function curves simulated in the nominal corner for the complete receiver. The DC gain is 2.5 dB at all settings, and the AC gain is controllable from 6-9 dB to provide up to 6.5 dB of emphasis.
IV. CLOCKING ARCHITECTURE AND CIRCUITS To ensure sufficient timing margin at 16 Gb/s, high performance clocking circuits are necessary on both the Controller and DRAM. Therefore, LC-VCO based PLLs, which offer superior noise performance compared to ring-VCO based PLLs, are employed on both PHYs. Fewer and more resistive metal layers present challenges to designing an LC-VCO in a DRAM process. Solutions to these challenges are discussed in [4] . A low-cost 500 MHz reference clock from an external clock part is distributed in a controlled impedance fashion to the Controller and then to the DRAM. An asymmetric clocking architecture is employed (Fig. 3) . The burden of phase adjustment, skip, and levelization are all placed on the Controller. The phase mixer adjusts the TX and RX clock for optimal BER performance and also compensates the skew between bit slices to ease routing on the board. The skip block handles the arbitrary phase clock domain crossing between sclk and dclk (phase mixer output). The purpose of levelization is to address skew greater than 1 UI by equalizing the latency of the bit slices. Together, the levelization and phase mixer enable an effective phase adjustment range of 64 UI for both the TX and RX. These features are also used to compensate C/A and DQ flight time differences as well as reference clock skew across the multiple DRAMs. 
A. Controller Clocking Architecture and Circuits
The Controller clocking architecture is shown in Fig. 6 . On the top of each byte, an LC-VCO based PLL generates an 8 GHz differential clock signal from the 500 MHz reference. This 8 GHz clock is divided down to 2 GHz before distribution to reduce the power consumption. Although this LC PLL results in a clean clock output, there are two challenges. First, the phase mixer that adjusts dclk requires quadrature 8 GHz phases to be interpolated. Second, a wide frequency range is desired to test the interface at different data rates. To address these, each byte level LC PLL is followed by local M/N ring-VCO based PLLs which generate multiple phases over a wide frequency range.
To reduce power consumption, each ring PLL is shared by two neighboring bit slices (DQx2). The DQx2 clocking circuits generate a dclk0 for bit0, a dclk1 for bit1, and a common sclk. To avoid the area and power consumption of two phase mixers per bit slice (a total of five for a DQx2 block), one phase mixer is time-multiplexed between the TX and RX of the same bit slice, since the two operate at different times in the same bidirectional I/O. Two phase codes are obtained from separate read and write operation timing calibrations and provided to the shared phase mixer through a multiplexer which is appropriately selected according to the data transaction taking place.
B. DRAM Clocking Architecture and Circuits
The clocking architecture of the DRAM is shown in Fig. 7 . One LC-PLL is used in each DQ byte to multiply the 500 MHz reference clock to 8 GHz. Because of the aforementioned transistor and global metal routing limitations, the 8 GHz PLL output is divided and distributed as quadrature 4 GHz clocks (two pairs of differential 4 GHz clocks I/Ib and Q/Qb) to the individual bit slices. These 4 GHz clocks are distributed in CML to minimize PSIJ. In each bit slice, these quadrature 4 GHz clocks are locally converted to CMOS levels to drive the front-end circuits of the TX and RX.
16 Gb/s operation in the DRAM is challenging even with quadrature rate operation. Of primary concern is PSIJ which is proportional to the delay of the circuit, and is not improved by multiphase clocking. Since the relatively slow process leads to a large circuit delay, PSIJ is more severe on the DRAM. It is not just the clock tree that is of concern, but also the clocked TX and RX circuits that are closest to the pad. Variations in the clk-q delay of the pulser as well as the aperture time of the RX sampler equally hurt the performance of the link. These blocks are implemented in CMOS due to headroom limitations and better power efficiency than their CML counterparts. However, this comes at the cost of higher PSIJ, which is addressed by supply regulation of the front-end datapath circuits and local (bit slice level) clocking circuits, as shown in Fig. 7 . This approach maintains a net power advantage over a full CML datapath implementation, even with the regulator overhead.
The primary challenge of the voltage regulator design is maintaining a stable output voltage despite the data-dependent transient load currents of the CMOS datapath circuits. A voltage regulator with a replica load has been shown to provide power and area efficient suppression of high frequency transient noise from the external supply [13] . We extend the idea by using a data-dependent replica (Fig. 8) that has a transient load current proportional to that of the front-end, providing fast feed-forward regulation of transient load currents. When a transient load step occurs, the small capacitance of the replica circuit allows the change in VREP to be detected very quickly. The amplifier can then quickly adjust the PMOS gate voltage to match the new load condition before any significant change occurs in the output voltage VREG. The regulated load includes local clock buffers, receive samplers, transmit pulser, and several flip-flops. The replica load is implemented with simple CMOS inverters which receive the same clock and data signals as the actual load, but are scaled to consume one quarter of the load current. The regulator is designed to power a 36 mA maximum load current with 300 MHz closed loop bandwidth, and an output capacitance of 100 pF. This regulator reduces the capacitance required to achieve a given transient load ripple by 4 . The simulated worst case PSRR over all frequencies is 16 dB.
C. Duty-Cycle and Quadrature Correction
High frequency jitter can be severely amplified by lossy channels [5] . Hence, both duty cycle error on the Controller and quadrature phase error on the DRAM must be minimized. These are achieved by using a duty cycle corrector (DCC) and a quadrature error corrector (QEC), respectively, as shown in Fig. 9(a) and (b) .
Similar to the phase mixer on the Controller, a single digitally controlled DCC is time-multiplexed between the TX and the RX clocks. The clock duty cycle is adjusted toward its desired 50% value by using a duty cycle correction loop. This feedback loop depicted in Fig. 9(a) consists of the DCC to adjust clock duty cycle, a MUX to select the clock signal to be calibrated, an integrator followed by a comparator to detect the duty cycle error polarity, and an FSM that accordingly increments or decrements On the DRAM PHY, the accumulated quadrature phase errors from mismatches in the global distribution, CML-to-CMOS conversion, and local clock buffers are corrected by the feedback loop shown in Fig. 9(b) . Quadrature error is detected near the front-end circuits by a symmetric CML XOR transconductor followed by an integrator. The integrator output is then sensed by a comparator whose output is accumulated by an FSM to adjust switched capacitor loads. The resolution is 20 mUI (1.25 ps at 16 Gb/s) and the range is 0.07 UI (4.5 ps). Using this scheme, quadrature error is limited by local device mismatches in the quadrature detector itself.
V. MEASUREMENTS
The testchip was fabricated in TSMC 65 nm G+ technology. Fig. 10 shows the cell photomicrographs of a Controller interface with 4 C/A 4 and 32 DQ links as well as a DRAM interface with 1 C/A and 16 DQ links. Each Controller byte has one LC PLL, eight bit slices, and a dedicated place and route (PnR) logic block that includes pattern generators and checkers, timing margin measurement logic, termination calibration, phase mixer control, etc. On the DRAM interface, the PLL on the left provides high speed clocks to 8 neighboring DQ slices, while the PLL on the right provides clocks to the other 8 DQ slices as well as to the receiver in the C/A cell for command requests. The DRAM C/A cell also includes associated logic for command processing. Memory reads and writes are targeted to 512 kbits of on-chip SRAM to allow interface testing with various read and write activity patterns. Note that the C/A slice is placed to the side to emulate the signal routing and timing requirements for a 32-bit interface with C/A in the center, which would be approximately 8 mm wide.
The measured jitter performance of the Controller LC PLL, Controller TX output, and DRAM TX output are shown in Fig. 11 . The measured random jitter (RJ) of the Controller LC-PLL output routed directly to pads is 318 fs rms and total jitter (TJ) is 7.7 ps at a BER of 1e-12. When transmitting a 2 1 PRBS pattern at 16 Gb/s, the Controller TX waveform has RJ of 810 fs and TJ of 24.36 ps. Apart from the RJ contribution, the TJ increases because of power supply induced jitter (PSIJ) in the PLL/clock tree and transmitter, crosstalk from 4 The extra C/A were routed off the board via SMP connectors to provide visibility of various analog/mixed signal components (e.g. TX-FIR, PLL). adjacent bit slices, residual ISI, and reference feed-through. The DRAM TX output has a measured RJ of 378 fs and a TJ of 8.2 ps when transmitting a clock pattern. The DRAM TX has less jitter than the Controller TX since the LC drives the TX directly rather than passing through another ring-based PLL. The ring-based PLL provides flexibility to the Controller but increases the RJ and PSIJ.
The effect of jitter amplification and the need for duty cycle and quadrature (IQ) correction is shown in Fig. 12 . The timing margin of the PRBS 2 DRAM TX eye increases by 10 ps with only a 3.75 ps clock correction. Thus, the loss in this channel, as evidenced by the significant amount of ISI, more than doubles the impact of the quadrature error on the eye size. Fig. 13 shows measured BER bathtub curves for both write and read directions at 16 Gb/s with a PRBS 2 1 data pattern. The measurement was taken with optimized TX/RX equalization coefficients and calibrated settings for RX voltage offset (both devices), for DCC (Controller), and for IQ correction (DRAM). Positive eye openings of 0.19 UI are observed at a BER of 1e-12 in both directions. Fig. 14 shows the 16 Gb/s on-chip eyes measured using the asymmetric link diagnostic feature for both the write and read directions. This in situ measurement shows the effective eye of the system including the complete effects of the channel, equalization, jitter, and all TX and RX circuit limitations. Although the transmit swings are cut in half for these tests as mentioned earlier, both directions show timing margin greater than 0.3 UI and voltage margin of 25 mV at a BER of 1e-9. To test the robustness of the memory interface, the timing margin is measured across a byte. As shown in Fig. 15 , both the write and read operations have timing margin greater than 0.5 UI at a BER of 1e-3. The effect of bus turnaround (the act of powering on and off the TX or RX between transactions), which results in the worst case supply noise condition, is included in these measurements. Fig. 16 shows the breakdown of the power for both devices. The power efficiency of the interface cells are 13 mW/Gb/s for the Controller and 8 mW/Gb/s for the DRAM. In both cases, clock related power is greater than 40% and is the most significant contributor, while the TX and RX combined account for about 25% of the power. Finally, Table I summarizes the achieved performance of the Controller and DRAM interfaces.
VI. CONCLUSION
A 16 Gb/s/link memory interface has been demonstrated. An asymmetric memory architecture and circuit innovations allow the testchip to overcome the primary challenges of on and off chip bandwidth limitations while achieving 13 mW/Gb/s and 8 mW/Gb/s at the Controller and emulated DRAM, respectively. The jitter in the system was optimized using LC PLLs, duty cycle and quadrature error correction circuits, and a data dependent regulator to reduce PSIJ. The technologies demonstrate the feasibility of a 1 TB/s memory system in the 2010-2012 timeframe. He is currently a Principal Engineer at Rambus Inc. Since joining Rambus in 2004, he has worked on equalization and mixed-signal circuit design for a variety of high-speed serial links and memory interfaces. In the past, he was with IBM, Hewlett-Packard Company, and Agilent Technologies, where he worked on design and electrical characterization of advanced multilayer packages and analog and RF circuit simulation tools. He is currently with Rambus as a Senior Principal Engineer responsible for signal integrity of high-speed systems. His current interests include signals and systems in general and efficient simulation and optimization of deterministic and stochastic systems in high-speed links in particular.
Nhat Nguyen
Simon Li received the B.S. degree in electrical engineering from the University of California at Berkeley in 1992.
He is a Principal Design Engineer at Rambus Inc. His current interests include high-speed and low-power digital design. 
Reza Navid
