Abstract-A source-synchronous receiver based on a delay-locked loop is presented. It employs a shared global calibration control between channels, yet achieves channel expandability for high aggregate I/O bandwidth.
A Highly Expandable Forwarded-Clock Receiver with
Ultra-Slim Data Lane using Skew Calibration by Multi-Phase Edge Monitoring
Byoung-Joo Yoo, Ho-Young Song, Han-Kyu Chi, Woo-Rham Bae, and Deog-Kyoon Jeong
Abstract-A source-synchronous receiver based on a delay-locked loop is presented. It employs a shared global calibration control between channels, yet achieves channel expandability for high aggregate I/O bandwidth. The global calibration control accomplishes skew calibration, equalizer adaptation, and phase lock of all the channels in a calibration period, resulting in the reduced hardware overhead and area of each data lane. In addition, the weightadjusted dual-interpolating delay cell, which is used in the multiphase DLL, guarantees sufficient phase linearity without using dummy delay cells, while offering a high-frequency operation. The proposed receiver is designed in the 90-nm CMOS technology, and achieves error-free eye openings of more than 0.5 UI across 9-28 inch Nelco4000-6 microstrips at [4] [5] [6] [7] Gb/s and more than 0.42 UI at data rates of up to 9 Gb/s. The data lane occupies only 0.152 mm 2 and consumes 69.8 mW, while the rest of the receiver occupies 0.297 mm 2 and consumes 56.0 mW at the 7-Gb/s data-rate and supply voltage of 1.35 V.
Index Terms-Serial link, source-synchronous link, receiver (RX), delay-locked loop (DLL), continuoustime linear equalizer (CTLE), dual-input interpolating delay cell
I. INTRODUCTION
Over the past decade, dramatic technological progress of CMOS scaling has increased system complexity, and has consequently necessitated high speed and low-cost I/O circuits. The explosive growth of the need for higher aggregate I/O bandwidth in chip-to-chip communication has forced designers to increase the per-pin data rate, and to expand the number of data channels.
Various techniques have been adopted to increase the per-pin data rate in a system, such as multi-level signaling [1] , per-pin skew compensation [2] , channel impedance matching [3] , channel equalization [4] and so forth. Adding the number of channels in parallel interfaces [5] or wide I/Os [6] is a straightforward way to achieving high data-transmission rates. However, those methods usually have a trade-off between performance and cost. A simple architecture in each lane of data transmission is desired while an accurate clock recovery on the receiver is required for a low bit error rate. To meet these demands, many candidates for the transceiver architectures of serial-data communication have been reported in the literature, including source-synchronous clocking architectures such as forwarded-clock transceivers [7] and embedded-clock transceivers [8] . A source-synchronous clocking architecture is a simple solution to increase aggregate I/O bandwidth, while reducing channel complexity. When paths and circuits of its shared clock are ideally matched to those of data, the impact of transmit-induced jitter in the receiver can be minimized. Moreover, the cost and power overhead are amortized across multiple links in the system [9] . On the other hand, an embedded-clock link, while no specific clock channel is required, incurs high hardware complexity and is unsuitable for multi-channel links. Since clock recovery must be done in each channel independently, samplers, multiphase generators, and duty-cycle correctors and many other functions must be included in all channels. In addition, any skews between channel propagation delays require local de-skewing circuits.
In this paper, a source-synchronous receiver that employs a global calibration control logic for minimizing channel complexity is presented. The presented receiver has a simple data link channel, which consists of only a continuous-time linear equalizer, two half-rate samplers, and a 64-to-2 deserializer. With this simple data lane, only two half-rate clocks to sample at the center of incoming data are generated by a multiphase clock generator, and are distributed to each data lane. All of the functions to adjust the linear equalizer and to find the center of data are performed in the global calibration control logic implemented externally with an FPGA in the prototype chip. This low channel complexity meets the needs of good channel scalability and cost-power optimization.
Considerations and the motivations of this work are described in Section II. The overall receiver architecture and details of the circuit implementation are presented in Section III and IV. Section V shows measurement results for the test chip, and finally, Section VI concludes this paper.
II. DESIGN CONSIDERATION
To understand its advantages and disadvantages and to motivate a new architecture, a conventional sourcesynchronous architecture is shown in Fig. 1 . The receiver is divided into a forwarded-clock lane and a set of identical data lanes. In contrast to the simple forwardedclock lane, each data lane consists of an equalizer with its adaptation logic, edge and data slicers for 2x-binary phase detection, deskewing logic to calibrate the channel skew, a multiphase clock generator, a duty-cycle corrector and a deserializer for serial-to-parallel conversion of recovered data. The forwarded-clock lane uses either a delay-locked loop (DLL) or phase-locked loop (PLL) to generate and send the global clock to data lanes or send the forwarded clock directly to data lanes through the clock distribution circuit. Since multiphase clock distribution requires many highly capacitive interconnection lines and their buffers, it consumes much power and occupies a large area. Thus, most source-synchronous architectures use a single global clock in either single-ended or differential form. In each data lane, a DLL or PLL converts the distributed clock into multi-phase clocks one of which is to be picked up for canceling the channel skew and sampling the received data [10, 11] . In this architecture, each data lane requires a multiphase generator, a dutycycle corrector and de-skewing circuit.
With many functions in each data channel with high structural complexity, it is inappropriate to expand the number of channels due to high power and large area. If a receiver adopts the sub-rate sampling method to alleviate the speed-budget, the number of multiphase clocks and samplers is increased inversely and exponentially proportional to the sampling rate. Therefore, an alternative method to reduce the hardware complexity in each channel must be considered in a multichannel receiver probably at the cost of increased complexity at the central forwarded clock channel.
The design of simple and effective equalizer is important for performance. A continuous-time linear equalizer (CTLE) and/or a decision feedback equalizer (DFE) is a commonly used method to compensate for the channel loss in the receiver [12] . However, the trade-off between accurate equalization and power consumption is required; the complex adaptation algorithm of DFE and CTLE coefficients extends the bandwidth, thereby reducing the bit error rate (BER), while increasing the power consumption and chip area. Therefore, coming up with a simple and fast equalizer is essential in order to achieve the required bandwidth. In addition to equalization, the method of channel skew calibration can affect the system performance represented by BER. Background calibration is desired in response to the changes of the channel characteristics due to PVT variations. However, it requires a multiphase clock generator or variable delay line (VDL) in each channel. Also, the additional de-skewing loop should have a very low bandwidth to avoid interaction with a global clock generator or a duty-cycle corrector. Therefore, calibration time is dependent not only on the skew compensation loop bandwidth, but also on the other loop bandwidth. If a receiver is allowed a periodic idle time when data are not sent, a training sequence can be used to compensate the channel skew during the idle time. A fixed training pattern sent during the idle time can be more effectively used to compensate for the channel skew as well as adapting the equalizer coefficient faster. Fig. 2 shows the architecture of the proposed sourcesynchronous receiver. The receiver can be divided into four parts: data lanes, a clock lane, a multiphase generator, and a global calibration control logic. Each data lane compensates for channel loss of the incoming serial data stream and converts it to parallel data. The clock lane receives the incoming half-rate forwarded clock and the DLL generates multiphase clocks and distributes them as a global clock. With the globally distributed multiphase clocks, a phase rotator provides the samplers with two half-rate clocks, which are aligned to the center of the data eye. The equalization and phase rotation in each data lane are digitally processed by the global calibration control logic. Note that the clock lane has the same equalizer and phase rotator as the data lane and they are used as a tracking device of the PVT variations. As shown in Fig. 3 , each data lane consists of a CTLE, two strong-arm type samplers, and a serial-to-parallel converter to communicate with an external host. The chip includes a high-speed driver to monitor the waveform after equalization with the oscilloscope. This driver uses inductive-peaking technique to enhance the bandwidth. In SPICE simulation, the inductively-peaked currentmode driver offers more than 10-GHz bandwidth under the all PVT variations. The data stream from the differential transmission lines is fed to the CTLE, which consistsk of a two-stage amplifier. The CTLE boosts high-frequency components of the incoming data using the source degeneration of resistors and its bypass capacitors, and adapts to the channel attenuation with digitally-controlled capacitance. After equalization by the CTLE, its output is buffered and sampled for decision by two slicers. Finally, the recovered 2-bit data are converted to 64-bit parallel data by a deserializer, and are fed to the global calibration logic for processing as well as being forwarded to the external host. To alleviate the requirements of the highbandwidth buffer and DLL, a half-rate clocking scheme is used in the receiver. In that scheme, two slicers are respectively used for sampling data and edge in the calibration period, whereas both clocks are used for sampling even and odd data in the normal mode as shown in Fig. 4 full-rate 2x-oversampling CDR, but with much reduced timing requirement. During the calibration mode, a fullrate scheme is used for a half-rate training pattern as shown in Fig. 4 (a) and, when switched to the normal operation mode, a half-rate scheme is used on the fullrate input as shown in Fig. 4 (b). The correct sampling time is determined by estimating and cancelling two kind of skews: on-chip skew due to an equalizer, clock generator, and clock distribution path and the skew among transmission lines. Both are cancelled in different ways.
III. ARCHITECTURE
The clock lane delivers the incoming half-rate clock to the multiphase clock generator, and compensates for the delay variation of the clock path that is caused by the PVT variations. Except for the finite-state machine (FSM), the clock lane is a replica of the data lane, to reduce the mismatch between the data path and clock path. The FSM generates compensation codes for the delay variation, which is caused by the clock distribution from the multiphase DLL to the sampler of individual channels. The distributed clock is compared with the incoming clock at the samplers of the clock lane. Then, the FSM increases or decreases the phase control code of the phase rotator (PR) with hysteresis, according to integrated up and down signals.
The clock received by the clock lane is distributed to each data lane by a multiphase clock generator. It has a dual-loop architecture that consists of an analog DLL to generate multiphase clocks, and a digital DLL to compensate for channel skew. The analog DLL generates 10 equally spaced multiple clocks, and the PR, which is a sub-block of the digital DLL, selects the data sampling clock by phase interpolation between two incoming adjacent clocks. Then, the selected half-rate two clocks are delivered to samplers of each channel through the clock distribution circuit with buffers. With the analog and digital DLL, the total number of phase steps to control is 80. The CTLEs, PRs, and DLL of all lanes are controlled by the global calibration control logic monitoring the recovered data and clock in the data lanes. The global calibration control logic consists of registers for phase and equalizer control, and a digital calibration logic that performs the following three functions: a phase lock, a channel skew calibration, and an equalizer adaptation. Basically, methods of iterations and memorizations are used to perform the functions for an appointed calibration period. When the equalizer coefficient and phase control are changed, the sampled and deserialized data in the data lane are monitored and stored in the digital calibration control logic, and the logic builds a table to find the optimized results. The change will be repeated until the calibration logic finds the optimized equalizer coefficients and the accurate phase of edge crossing and they are stored in the CTLE register and the PR register, respectively. In the global calibration control logic, the accuracy of skew compensation and equalizer adaptation is proportional to the multiphase resolution of sampling clocks. Since most of the equalizer adaptation results and the skew compensation results are memorized and processed using a unit of minimum phase step, the receiver needs the fine phase resolution of the clocks. To achieve high resolution, a large number of delay cells can be used in the design of a DLL. However, a large number of delay cells, including dummy cells, causes the critical problem of consuming too much power and occupying a large area for the DLL. In the proposed DLL, a weightadjusted dual-input interpolating delay cell is used to reduce the number of dummy delay cells [13] .
With this global calibration control logic, the proposed architecture takes advantage of the reduced hardware overhead. Because the global calibration logic finds the edge of data and programs the register of PR for the calibration period, the number of channel slicers is only half of the conventional architecture. In addition, the PRs are located close to the global DLL, so that just two halfrate clocks are distributed to each channel's samplers, reducing the buffer power and area.
The proposed receiver was designed for high-speed multi-channel links, such as a wide I/O memory interfaces requiring low channel complexity and small clock distribution circuits. A designed CTLE achieves a gain in the range of 9 dB to 25 dB at 7 GHz for various channels, such as Nelco4000-6 microstrips and RG58 coaxial cables. The multiphase generator, which consists of the DLL and PRs, provides 80-step multiphase clocks for two UIs.
IV. CIRCUIT DESCRIPTION 1. Analog Multi-Phase DLL As described in the introduction, the forwarded-clock architecture keeps the same transmit-induced jitter among channels. The correlated jitter should be tracked by a clock recovery circuit in the receiver. There are two general clock generation schemes, the PLL or the DLL, in the forwarded-clock architectures. Since the PLL has a low-pass jitter transfer characteristic, high-frequency jitter beyond the cut-off frequency is filtered out. Therefore, a recovered clock has superior jitter performance when compared to the DLL. However, the excessively low loop bandwidth lessens the tracking performance of data-correlated jitter. Eventually, the bit error rate (BER) can be worsened. Unlike the PLL, the DLL has an all-path jitter transfer characteristic. So the internal phase error is not accumulated and the sampling clock can be aligned more precisely at the center of a data eye. Therefore, the DLL-based source-synchronous architecture is a better solution toward achieving the high aggregate bandwidth.
The operation frequency of the DLL is often limited by the speed of the phase detector (PD), and the minimum delay of the voltage controlled delay line (VCDL). In the proposed DLL, the VCDL consists of dual-input interpolating delay cells (DIDCs) to achieve delay time less than an inverter delay. In addition, a divided-by-4 clock is used in the PD to alleviate its speed budget [13] . Fig. 5(a) shows the high-speed analog multiphase DLL that employs the 10-stage DIDC. The forwarded clock from the CTLE is transmitted to both the VCDL and the frequency divider. Then the divided clock is compared with the delayed and divided VDCL clock by the bang-bang phase detector (BBPD). Its comparison result (UP or DOWN) changes the Vc and controls the delay of the VCDL through the charge pump and the filter capacitor.
To prevent the harmonic-lock problem, CKF control block is used to select the phase of CKF as shown in Fig.  5(b) . After frequency division, the phase difference between CKR and CKF could be more than one period of the forwarded clock, resulting in a harmonic lock. When the DLL is in the reset state, the control voltage Vc of the VCDL is pulled up to VDD to make the delay minimized.
Then, the CKF control block samples all of the 4-phase CKF<0:3> with the CKR<0> which has a fixed phase. Using the sampled pattern S<0:3>, the CKF control block selects one of the 4-phase clocks, whose phase is closest and leading to that of CKR<0>. For example, if the sampled values of CKF<0:3> with CKR<0> is '1100' like the case2, the selected CKF is CKF<1>. The selected clock guarantees the phase difference to be less than a single period of the forwarded clock (T ref ), the startup direction of phase detection is down, and consequently harmonic lock is prevented.
Dual-Input Interpolating Delay Cells
Under an operating frequency of a few Giga-hertz, the multiphase resolution of a clock generator is limited by the minimum delay of cells. To achieve a small delay of VCDLs, several techniques are employed in voltagecontrolled oscillators (VCOs) such as the negative skewed delay scheme and dual-path delay cell [15] [16] [17] . However, those methods cannot be easily applied to the VCDL because of delay mismatches among the delay cells [18] . In order to reduce the delay difference between early stages and following stages in the chain of delay cells, a few dummy delay cells are typically used until those operating conditions are equalized. However, the most critical power consuming component in the multiphase DLL is the delay cell, so a large number of dummy delay cells presents a critical problem. Fig. 6 shows the basic principle and scheme of the proposed weight-adjusted DIDC chain, which effectively reduces the number of dummy delay cells in the VCDL. The delay mechanism of the dual-input interpolating delay cells can be explained by the phase interpolation, as illustrated in Fig. 6(a) . If the phase interpolator (PI) has an intrinsic delay of D 0 , the weighting ratio of its two inputs CLK n-2 and CLK n-1 is set to M:1, and the output is delayed by D 0 with interpolation ratio, then, the delay difference ∆D n between the interpolated output and the CLK n-1 is given as
where D n-1 is the time difference between two inputs which are equal to the delay of the previous stage in the DIDC. Then, the effective delay D n between CLK n-1 and CLK n can be written as 
This recursive equation shows that the delay in each cell just keeps decreasing as the clock propagates through the delay line and the delay of the DIDC converges to a specific value as follows, as the number of the delay cells approaches infinity:
Many dummy cells are required for all cells to have an equal delay time. For example, if the weighting ratio M is 2, at least 5 dummy cells are required to keep the delay difference among the cells within less than 1%.
If the interpolation weight of a few delay cells is adjusted to a different value from other delay cells, the number of required dummy delay cells can be drastically reduced. Fig. 6(b) shows an example of the weightadjusted DIDC when the interpolating ratio of the second delay cell is M+1:1, unlike the M:1 ratio of the others. As shown in (2) , if the delay of the first stage is fixed at D 0 , the delay of the second stage can be expressed by
This is identical with the converged value of (3). With the recursive relation shown as (2), the next delay times of the delay cells are always the same as that of the second delay cell. Therefore, delay times of the weightadjusted DIDC converge to (M+1)/(M+2)D 0 from the second and following delay cells, regardless of the interpolating ratio M, while only one dummy cell is required, as shown in Fig. 6(c) .
Interleaved Data Sampling without DCC
To alleviate the speed budget of logics in the data lane, a half-rate clocking scheme is commonly used. The halfrate clocking scheme needs two sampling circuits to alternatively sample the data at the rising and falling edges. However, interleaved data sampling, which uses the rising and falling edges of a clock, often requires a duty-cycle corrector (DCC), thus increasing the design complexity of each lane.
To remove the DCC in the data lane, an alternative interleaved data sampling scheme is implemented, as shown in Fig. 7 . Each sampler is triggered by different clocks, the phases of which are controlled by individual PRs, and ideally spaced at 180 degrees in the case of the data sampling mode. If the outputs of the equalizer or the transmitted global clocks have a duty distortion, the individual skew calibration code corrects the error. That is, CLK_even and CLK_odd can be individually located at the center of the data eye without any DCC when the duty of global clock is distorted. In addition, these separated clock control circuits can be used to detect the edge of incoming data during the calibration period during power-up. If CLK_even and CLK_odd are closely located at the adjacent phase, the sampling results around the data edge with those clocks are different. Particularly, when the data pattern is pre-programmed without any inter-symbol interference (ISI), such as '101010', accurate edge information can be detected. In the proposed receiver, this edge detection method is used to calibrate the channel-skew to adapt the equalizer and to align the sampling clocks to the center of data.
Cherry-Hooper Continuous-time Linear Equalizer
A continuous-time linear equalizer is a very simple and effective solution to boost the high-frequency signal of the incoming data. However, its limited bandwidth is a bottleneck for the performance of the whole system. There are several methods to increase the bandwidth of a Fig. 7 . A data lane employing the interleaved data sampling architecture without DCC.
linear equalizer, such as inductive peaking and a negative capacitance technique [12] . But, those techniques are not suitable for designing passive devices for a multi-channel receiver configuration requiring a large area. Fig. 8(a) shows the implemented linear equalizer exploiting the Cherry-Hooper broadband technique. The Cherry-Hooper technique incorporates local feedback in the drain-to-gate network to improve the speed [19] . The implemented linear equalizer consists of two cascaded common-source amplifiers. The first stage boosts the high-frequency signal of the incoming data, using a source degeneration resistor R S and capacitor C D . These degeneration devices are controlled by the 4-bit digital code generated by the global calibration logic. Each capacitor bank has a symmetric architecture to remove the mismatch of differential outputs, and the use of parallel connection reduces the area by 1/4 in comparison with a series connection [20] . The first stage amplifier is connected to the second common-source amplifier by a feedback resistor R F . In contrast to the first stage, the second stage amplifies all frequency signals within its wide bandwidth. With the high resistance of R F , the dominant pole of the linear equalizer is gm 2 /C X , where gm 2 is the transconductance of transistor M2, and C X is the capacitance of node X. This dominant pole is much higher than R D1 /C X of the first stage, thus enlarging the whole bandwidth of the linear equalizer.
The single linear equalizer has a maximum 13.5-dB gain at the Nyquist rate of 7-Gb/s and 9-dB gain at the DC, as shown in Fig. 8(b) . To compensate for channel loss of more than 16.5 dB, each lane in the proposed receiver has a two-stage Cherry-Hooper linear equalizer that can boost from 9 dB to 25 dB at 3.5 GHz.
Equalizer Adaptation and Phase Lock Scheme
Since many widely used high-speed cables are minimum-phase-like systems [21] , the magnitude of frequency response is related to its phase response. Therefore, an optimized result of amplitude equalization almost coincides with an optimized result of phase equalization. In other words, the well-equalized phase of received data guarantees fairly large vertical eye opening. In this paper, the CTLE for group delay or phase delay minimization, which also has a minimum-phase or minimum-phase-like characteristic, is used to compensate for the loss of the minimum-phase channel.
For phase equalization, group-delay variations under various frequency conditions of the training sequence can be minimized. If the signal is transmitted through the channel with a limited bandwidth, ISI will show up, due to the fact that different signal frequency components are delayed by different amounts. Fig. 9(a) shows the channel characteristic, and the results of ideal constant group delay equalization in the frequency domain. The linear equalization at high frequency is commonly described by the gain boosting, as shown in the left side of Fig. 9(a) . However, in the minimum-phase system, channel loss and equalization can be explained by the phase distortion, as shown in the right side of Fig. 9(a) . The loss of real channel causes phase distortion above its cutoff frequency. On the other hand, the phase below the cutoff frequency is relatively linear, similar to the phase characteristic of an ideal lossless channel. Since group delay is a derivative form of phase, a relatively linear phase means constant group delay, and phase distortion means group delay variations. If ripples or variations of group delay under various frequencies can be detected, the phase after equalization can be linear, close to that of an ideal lossless channel. In the proposed EQ adaptation, we use the segmented multiphase and pre-programmed calibration sequence, the frequency of which is generated by dividing the Nyquist rate by various integers, to detect the minimum variation of group delay under the various frequency conditions. The results of group delay (or phase delay) variation, according to signal frequency in time-domain, are shown in Fig. 9(b) . When the loss of received data is not compensated for, crossing points generated from different data patterns are located in different phases, and its convolution with random noise reduces the lateral and vertical eye opening, as shown in the left side of Fig. 9(b) . In contrast, the crossings of gain-boosted data after adaptation using a CTLE are centered where the group delay variation is minimized, as shown in the right side of Fig. 9(b) . Then, not only the lateral eye opening, but the vertical eye opening also is optimized.
The whole process of equalizer adaptation and phase lock is divided into two states: a coarse-lock state and a fine-lock state, as shown in Fig. 10 . In the coarse-lock state, a CTLE control bit and a PR control bit for the group delay minimization are roughly detected using a pre-programmed training sequence such as '0101' or '00001111'. When the calibration enable signal is turned on, the DLL in the receiver is locked to the forwarded clock, and generates 80 phases for 2 UIs. Then, the global calibration logic changes the CTLE control bits from minimum to maximum under various frequency conditions of the pre-programmed pattern, as shown in Fig. 11(a) . At every control of the CTLE, positions of data transition are detected by the shift of the multiphase clock, and stored in a table, as illustrated in Fig. 11(b) . After detecting and storing the results from all parameters, the global calibration logic calculates the CTLE control bit that minimizes the phase variation of the training sequence after equalization. Then, this CTLE control bit and the phase control bit are set to coarse lock parameters.
After the coarse lock, the training sequence is changed to PRBS, and then a fine CTLE control and a phase control are determined and stored in the registers. Unlike the coarse lock operation, only a few control bits around the fixed control bits are varied to find out the optimum parameters to reduce the calibration time. In addition, the number of different sampling results between the even and odd samplers, such as '10' or '01', are counted to scan the amount of jitter after equalization. First of all, two even and odd sampling clocks (CLKE and CLKO) are suitably spaced near the edge of data, which is found in the coarse lock state by the targeted amount of jitter. Then, the counting results at every fine CTLE control are
Comparison between the output of channel and the result of constant group delay equalization by (a) frequency responses, and (b) eye diagrams. stored in a table. When the counted number is maximized, the control bit is finally set to the CTLE register. With this fixed CTLE control, the jitter histogram of the eye diagram is statistically monitored by scanning the XORoutputs of sampling results at CLKE and CLKO, as shown in Fig. 11(c) . Then, the phase at the maximum counts is selected to the accurate edge location of the equalized data. Finally, these fine CTLE control bits and fine phase control bits are stored in the CTLE and PR registers, respectively. After the entire calibration process, the two sampling clocks are located to the center of even and odd data, respectively, without any channel skews. Fig. 12 shows behavioral simulation results of the entire process. As shown in Fig. 12(a) , the training sequence can be divided into three sections: a highfrequency training sequence, a low-frequency training sequence, and a PRBS. During the period of high and low frequency training sequences, the global calibration logic finds the coarse CTLE parameters and the location of edge crossing using equalizer sweep and phase sweep. In the case of coarse-lock state, the global calibration logic roughly increases and decreases the phase-control by two bits at a time. When the coarse-lock is done, the global calibration logic reduces the phase spacing between two sampling clocks to monitor the eye diagram after equalization. In this fine-lock state, both the equalizer sweep and the phase sweep are finely trimmed, by one bit, for a narrow range. In the simulation results, the CTLE is finally locked to the 5 th control bit of the total 15-steps, and the phase is locked to the 6 th control bit in the total of 80 steps. At the fine-lock state, the 5-Gb/s eye diagrams after channel (RG58-8m coaxial cable) and equalizer are shown in Fig. 12(b) . The total calibration time is about 40 µs for one channel. If single global logic rotationally calibrates all channels, the time will increase in proportion to the number of channels. In addition, this calibration process should be repeated in every idle state or special calibration period to adapt to the various time-variant conditions.
V. MEASUREMENT RESULTS
The proposed receiver was implemented using the 90-nm CMOS process. To confirm the function and performance of the receiver, the BER is measured by the Agilent J-BERT N4903A serial bit-error-ratio tester. The transmitted data and clock through the channel are recovered and parallelized by the fabricated chip. Then, the parallelized data is fed to the FPGA, which performs the function of global calibration logic. Using this FPGA, the equalizer is adapted to various channel conditions, and the sampling clock is locked to the center of incoming data while compensating the channel skew. The result of equalization and multiphase clocks can be monitored by an oscilloscope. The test environment for the fabricated chip is shown in Fig. 13 .
Digital and analog buffers are used to monitor phase resolution of the multiphase generator and eye opening after the CTLE. The clock of the DLL was divided by 144 to alleviate the bandwidth requirement of the digital buffer. In contrast, the analog buffer uses inductive peaking to enhance the bandwidth by 10 GHz. The recovered and divided clock with the 3.5-GHz forwarded clock is shown in Fig. 14 . The RMS jitter of the clock is 2.05 ps, and the peak-to-peak jitter is 16.4 ps, as shown in Fig. 14(a) . However, the RMS jitter of the forwarded clock is 1.50 ps, so the jitter added by the DLL is just 0.55 ps. Fig. 14(b) shows the phase space of the DLL exploiting the weight-adjusted DIDC. With just 2 dummy delay-cells, the DLL shows a small differential nonlinearity of less than +/-0.18 LSB between its 10 phases at 3.5 GHz. However, the measured static phase offset caused by the mismatches of BBPD and charge pump was 7.14ps when the DLL was locked at 4.5 GHz,.
The total phase nonlinearity of the recovered clock, including the DLL and PR, is shown in Fig. 15 . The measured results show a tendency for the linearity to degenerate when the data rate is lower. Under various frequency conditions, all results of differential nonlinearity (DNL) and integral non-linearity (INL) are +/-0.8 LSB and 0/-5LSBs, respectively. Because of the relatively large phase-interpolation mismatch in the PR, the entire multiphase linearity is changed for the worse. However, its phase resolution is small enough to obtain the targeted BER, so it is not a serious problem. In the ideal case, the phase resolution of the 3.5-GHz clock is Various channels, such as Nelco4000-6 microstrips and RG58-coaxial cables, are used to verify functions of the chip. The Nelco4000-6 microstrips are contained in the Agilent J-BERT N4903A serial bit-error-ratio tester. Fig. 16 shows the measured frequency response of those channels, for which s-parameters are measured by Agilent ENA Network Analyzer E5071C. The 9-inch Nelco4000-6 microstrip has the lowest loss of about -10 dB at 4.5 GHz. At the same frequency, the 8-m RG58 cable and 28-inch Nelco4000-6 microstrip show a loss of -22 dB. The proposed linear equalizer achieves over 20-dB gain at 7 Gb/s. However, the loss of SMA connectors and package requires more than 20-dB gain at 9-Gb/s. Therefore, an eye opening at 9 Gb/s can only be achieved in the 9-inch Nelco4000-6 microstrip. Fig. 17 shows the measured differential waveforms under various data rates and channels when the 2 12 -1 PRBS pattern is injected. The left side of each figure is the eye diagram of uncompensated channel output, and the right side is the eye diagram after compensation by the adapted CTLE. All results of equalization are measured at the maximum possible data rate to achieve an eye opening of more than 0.5 UI, and at the supply voltage of 1.35 V. As shown in Fig. 17(a) , the maximum data-rate with the 9-inch Nelco4000-6 microstrip, which has the lowest loss, is 9 Gb/s. In the case of the 20-inch and the 28-inch microstrips, the maximum data-rates are 7 Gb/s and 5 Gb/s, respectively, as shown in Fig. 17(b) and (c). With those channels, the eye diagram of the channel output is completely closed due to the channel loss. The eye openings with microstrip channels are relatively smaller than those of coaxial cables under the same loss, as shown in Fig. 17(d) and (e). Because of the many stubs on the PCB, the impedance discontinuity distorts the signal waveforms. Since the waveforms in Fig. 17 are at the outputs of the driver exploiting inductive peaking, the eye openings are not exactly the same as the results at the equalizer output. So, the BER is measured at every 0.01-UI step of the sampling phase over 1 UI. Fig. 18 shows the bathtub curve measured under various conditions of the channel. In comparison with the waveform in Fig. 17(b) , its eye opening is remarkably consistent with the bathtub curve of the 20-inch Nelco4000-6 microstrip at 7 Gb/s. As previously mentioned, eye openings are more than 0.5 UI in most conditions. However, the error-free span at 9 Gb/s is about 0.42 UI, unlike the waveform in Fig. 17(a) . This is due to nonlinearity of the multiphase clocks and the limited timing margin of logic gates. Also, the phase spacing between two half-rate clocks is heavily departed from 1 UI at 9 Gb/s, so that the BER in the vicinity of data transition is lower than others.
The measured performance is summarized in Table 1 . The highest data rate to achieve an eye opening of more than 0.42 UI is 9 Gb/s. The total power consumption of the fabricated chip at the data rate of 7 Gb/s is 125.8 mW at the supply voltage of 1.35 V. The data lane consumes 69.8 mW and the clock lane consumes 56 mW. The major power consumption in the data lane occurs in a CML-type buffer and a driver to monitor the equalized data. In case of the clock lane, the digital buffers for signal monitoring and clock distribution circuits are the dominant sources of power consumption. The simulated power breakdown for 7-Gb/s data rate is displayed in Table 2 .
Although the whole chip area is 670 × 670 µm 2 , the active chip area of the data lane is just 330 × 460 µm 2 .
The core components of data lane, which consists of a CTLE, samplers and a CML buffer only amount to 256 × 304 µm 2 . A comparison of this work with other RX architectures is shown in Table 3 . Our work outperforms others in terms of power and area per single data lane.
The chip microphotograph is shown in Fig. 19 . Fig. 18 . Measured bathtub curve. 
VI. CONCLUSIONS
A DLL-based forwarded-clock receiver for high aggregate I/O bandwidth was implemented with global calibration logic to reduce the complexity of the data lane. With global calibration logic, the data lane of the proposed architecture consists of only a CTLE, a sampler and a deserializer, thereby improves the channel expandability. To achieve high bandwidth of the DLL, a dual-interpolating delay cell is used, and the weightadjustment technique is used on the delay line, in order to reduce the number of dummy delay cells in the DLL. The chip was fabricated in 90-nm CMOS process, and its functions were verified under various channel conditions. The maximum rate of 9 Gb/s is achieved with an eye opening of more than 0.42 UI with the 9-inch Nelco4000-6 microstrip at the supply voltage of 1.35 V. 
