1 8-rate clock.
I. INTRODUCTION
T HE COMPUTING performance of a single chip has increased exponentially due to the advance of semiconductor technology. Accordingly, the improvement of I/O bandwidth is indispensable. High-speed serial data links provide multigigabit bandwidth with reduced system complexity and cost [1] - [3] . Particularly, the optical fiber links provide efficient solution for gigabit data rate, while traditional copper links hardly sustain increasing data rates due to its physical limitations [4] , [5] . Yet the optical fiber link is still costly, and therefore, the physical layer designs of optical fiber links (e.g., optical receivers) are required to be based on low-cost and low-power strategies [6] - [10] .
The clock and data recovery (CDR) circuit is a crucial element in optical receivers. It must extract pure clocks from the corrupted input data and regenerate clean data output by the extracted clocks. For these purposes, the CDR circuits have been dominantly implemented in III-V materials, Si bipolar, or SiGe HBT technologies that provide high-speed and inherently low-noise characteristics. However, these materials are still costly and dissipate high power [11] - [14] . Thus, CMOS technologies become very attractive due to low-cost, low-power, and high-integration-level characteristics [15] - [17] . These characteristics are very desirable for short-haul applications such as backplane interconnections or chip-to-chip interconnects [18] . This paper presents a 4-Gb/s CDR circuit implemented in a 0.25-m standard CMOS technology. The proposed CDR architecture incorporates a number of circuit techniques such as -rate clock technique, voltage-controlled oscillator (VCO) with active inductor loads and duty-cycle correction buffers, and -rate linear phase detector (PD) functioning 1:4 demultiplexer (DEMUX). The CDR circuit eliminates the need of a 1:4 DEMUX block, thereby achieving lower power consumption and smaller active area than other configurations in [13] - [15] . Also, it facilitates the integration of the high-speed CDR circuit with other digital circuitry on a chip, and thus significantly reduces the chip cost. Hence, the proposed -rate CDR circuit is suitable for short-haul optical communication applications or low-cost optical interconnects.
The architecture of the -rate CDR circuit is described in Section II. Sections III and IV explain the mechanism of the main building blocks such as VCO and -rate linear PD, respectively. Measurement results are presented in Section V. Finally, the conclusion follows in Section VI.
II.
-RATE CDR ARCHITECTURE Fig. 1 shows the conventional full-rate CDR circuit, consisting of four blocks: a clock recovery circuit, a decision circuit, a frequency divider, and a 1:4 DEMUX. Typically, the clock recovery circuit employs the phase-locked loop (PLL) configuration that comprises a PD, a charge pump (CP), a low-pass filter (LPF), and a VCO. In realizing high-speed CDR circuits with a standard submicron CMOS technology, a number of design challenges arise. Particularly, the VCO is the bottleneck to achieve the desired high-speed operation [17] , [19] . Fig. 2 illustrates the simulated maximum oscillation 0018-9200/03$17.00 © 2003 IEEE frequency of a VCO in a 0.25-m CMOS technology, where the VCO is designed as a simple differential ring oscillator with resistive loads and isolation output buffers. The maximum oscillation frequency of the three-stage VCO reaches 2 GHz and then drastically falls with the increase of the delay stage. Since the VCO should drive the following decision circuit and the frequency divider, its oscillation frequency can be less than 2 GHz even with three delay stages. Similar results were reported in [20] . Conclusively, the conventional full-rate CDR architecture can hardly overcome such physical limitations of a submicron CMOS technology and achieve high speed ( -Gb/s) operations.
Novel circuit techniques have been suggested such as half-rate clock [16] , [17] or oversampling [19] . The half-rate-clock technique reduces the clock frequency by a factor of two. Yet the clock frequency of 2 GHz is very close to the physical limit of the 0.25-m CMOS technology. Therefore, the 4-Gb/s operation with low jitter characteristic is difficult to achieve. Also, the reliability of the VCO cannot be guaranteed under the process, voltage, and temperature (PVT) variation with the increasing data rate and the decreasing supply voltage. According to [17] , a VCO realized in a 0.18-m CMOS technology demonstrated more reliable operation at a 2.5-V supply voltage than the nominal 1.8 V.
Meanwhile, an oversampling technique [19] is very attractive to design a reliable VCO because the clock frequency can be reduced by a factor of four or more. However, it produces considerable quantization jitter in the data eyes due to the samplings at fixed points that incurs high output jitter and it mandates highly precise clock phase control. Also, extra decision logic is required for post-processing, leading to a large active area and high power consumption. To alleviate the above tradeoffs, the -rate clock technique is proposed in this paper. Fig. 3 shows the architecture of the proposed -rate CDR circuit, consisting of a -rate linear PD, a CP, a second-order LPF, and a VCO. All building blocks are designed to be fully differential in order to minimize the crosstalk and common-mode noise. The clock and data signals have low-voltage swings of about 600 mVp-p, not only to increase its speed, but also to reduce power consumption. Output buffers (not shown in Fig. 3 ) employing the open-drain differential pair configuration drive the off-chip 50 terminations and help the measurements of the recovered half-quadrature clocks and demultiplexed data. In this CDR, the clock frequency is reduced by a factor of eight. Thus, the reliability of the VCO is considerably improved, even at the 4-Gb/s data rate. Also, the CDR circuit merges the four functional blocks (a clock recovery circuit, a decision circuit, a frequency divider, and a 1:4 DEMUX) into a single block. When the full-rate 4-Gb/s incoming NRZ data stream enters the -rate PD, the phase of input data are compared with those of the four half-quadrature clocks ( -) of the VCO. Then, a data transition (DT) signal and a clock transition (CT) signal are generated in the PD. Simultaneously, the input data is demultiplexed into the four 1-Gb/s outputs. Each data output differs by a half-quadrature phase (45 ) because it is retimed by the four half-quadrature clocks, respectively. Since the DT and CT signals pass through a CP and a LPF, the difference between the average values of DT and CT signals is linearly converted to the control voltage of the VCO. If the difference of the average values becomes zero, the loop enters the locking state.
The -rate PD is designed to be a linear type so as to obtain smaller output jitter than a bang-bang type. The oscillation frequency of the VCO is times lower (500 MHz) than the data rate so that the VCO can provide sufficient tuning range and tolerate the temperature and process variations efficiently [16] . It is known that switching noise of a VCO traverses through the common substrate of a single chip and becomes detrimental to noise-sensitive analog blocks, e.g., transimpedance amplifiers in optical receivers [6] . However, it is reported in [21] that the substrate noise voltage scales down with the square root of frequency. Thus, the substrate noise effect can be significantly suppressed by lowering the operating frequency.
With the proposed -rate technique, the CDR circuit reduces the substrate noise effect, thereby providing reliable operations for optical receiver applications.
III. VCO
For CDR applications, two configurations of VCOs are widely exploited: ring oscillators and LC-tank oscillators [22] . The ring oscillator provides wide tuning range and can generate a number of clocks with different phases. However, it produces relatively high phase noise and operates only at low frequencies due to its low-quality factor. Meanwhile, the LC-tank oscillator achieves stable operations at higher frequencies, providing less phase noise. However, its tuning range is very narrow. Also, on-chip spiral inductors have a low factor (typically 3-5) and consume a large active area. Hence, we employ the active inductor loads to acquire wide tuning range, moderately high factor, and small area.
The proposed ring oscillator employs half-quadrature technique to generate four phase clocks. Fig. 4 shows the block diagram of the half-quadrature differential ring oscillator, consisting of four delay stages and four isolation buffers with duty-cycle correction (DCC). Each delay stage produces a half-quadrature -rate clock with 45 phase difference. DCC function is necessary because utilizing both clock edges incurs the duty-cycle distortion in the retimed data [17] .
Generally, there are two approaches to obtain the DCC [23] . One is to add a divide-by-two circuit at the output of the VCO. However, it requires the VCO to operate at twice the clock frequency. The other is to exploit an extra feedback loop. Since the loop stabilization is difficult to obtain, the DCC may become incorrect at high frequencies. If the VCO operates at low frequencies, the feedback isolation buffer can achieve the stable auxiliary feedback operation and thus the reliable DCC. The latter is feasible with the proposed VCO due to the low ( -rate) clock. Fig . 5 shows the block diagram and the implementation of the feedback isolation buffer with the function of DCC. It comprises two amplifier stages i.e., an input differential amplifier ( -) and a resistive feedback amplifier ( --) in parallel with an auxiliary high common-mode rejection ratio (CMRR) feedback amplifier ( -). The DCC function is incorporated by exploiting a high CMRR feedback loop. The resistive feedback loop extends the bandwidth in order to drive large loads facilely. According to HSPICE simulations, the feedback isolation buffer achieves almost 99% of DCC. Fig. 6 shows the schematic diagram of a single delay stage of the VCO, consisting of a coarse tuning stage with a programmable 6-bit digital control word and a fine-tuning stage with a folded differential pair with source degeneration. The programmable 6-bit coarse control word, which is externally programmed, can widen the limited capture range that is due to the narrow loop bandwidth of the CDR for lowering input noise effect. The coarse tuning is digitally controlled by varying the tail currents, as illustrated in Fig. 6 .
Each delay stage exploits a pair of active inductor loads ( -and -). Since the active inductor loads result in larger voltage drop than the passive inductors and mandate higher supply voltage, a folded NMOS differential pair of -is adopted in the fine tuning stage to alleviate the voltage headroom effect. Source degeneration is used in the differential pair to achieve low VCO gain that results in wide linearity. IV.
-RATE LINEAR PD Fig. 7 shows the block diagram of the proposed -rate linear PD. It consists of eight data sampling latches, a data and clock transition (DCT) detector, and a DCT generator. The -rate linear PD accomplishes three tasks with no systematic offset: data transition detection, linear phase error detection, and data regeneration. In the latch stage, the incoming NRZ data stream is sampled in each bit at every rising and falling edge of the four half-quadrature clocks. Then, the DCT detector generates the four DCT signals ( -) and provides the retimed data output ( -) simultaneously which are the 1:4 demultiplexed data. With the incoming four DCT signals, the DCT generator produces the DT and CT signals to determine the phase error between the data and the clock.
The PD employs the differential configuration to achieve stable operations and wide output swings even at low supply voltages. It also exploits the current-mode logic (CML) configuration for high-speed operations. By rendering the input and output common-mode (CM) levels equal, it acquires the wide input CM range with no level shifters. In each folded CML, the input and output CM levels are equal to be a half of the supply voltage. The schematic diagrams of the folded CML family, including a folded D-latch, a folded MUX, and a folded XOR, are shown in Fig. 8 . Fig. 9 illustrates the operation of the -rate linear PD. Whenever the transition of the input NRZ data occurs, all DCT signals go high. The falling edges of , and signals are triggered by the rising and falling edges of , and , respectively. Consequently, the DCT signals appear at every data transition and each DCT signal is produced by XORing two consecutive bits of the input data as described in the following where represents the output of a data sampling latch in Fig. 7 .
The DT and CT signals are generated by toggling the four DCT signals at every consecutive rising and falling edge. That is, the DT and CT signals appear at every data transition. The pulsewidth of the CT signal is linearly proportional to the frequency of the half-quadrature clock, whereas that of the DT signal is unchanged if the input data rate is fixed. Therefore, the phase error detection is obtained by comparing the average value of the DT signal with that of the CT signal.
In the locked condition, the pulse widths of the DT and CT signals will be equal to that of the input data. The phase error of the DT and CT signals is converted to a differential fine control voltage of the VCO through the CP. Fig. 10 shows the transistor-level implementation of the fully differential CP. The CP employs a common-mode feedback (CMFB) to fix the CM level of the VCO control line and to achieve the stable frequency acquisition. Also, large-size transistors are used to minimize the effect of mismatch between the output currents. Fig. 11 shows the simulation results of the -rate linear PD characteristic, indicating the difference of the average values of the DT and CT signals as a function of the delay between a half-quadrature clock and the input data. It is seen that the -rate linear PD achieves the linear range of about 160 ps. The half-rate linear PD [17] and the Hogge PD [24] cannot avoid the inherent systematic phase offset between the average values of the error and the reference signals, which may be attributed to the asymmetric path in the generation of both signals. Thus, there is significant static phase offset between the clock and the input data even in locked condition. However, the proposed linear PD has the symmetric path and the same propagation delay in the generation of both DT and CT signals. Therefore, no systematic phase offset occurs between the clock and the data. In addition, the proposed PD can improve the linearity due to the considerably lower operating frequency than the data rate and therefore operate at sufficiently higher speed. 
V. MEASUREMENT RESULTS
Test chips were fabricated in a 0.25-m standard CMOS technology with four metal layers and a single poly. Fig. 12 shows the chip microphotograph, where the core consumes the area of 0.9 1.0 mm . For facilitating the measurements, a test chip was mounted on a FR-4 PC board by using bondwires. Fig. 13 shows the measured eye diagrams of the four 1:4 demultiplexed 1-Gb/s data outputs for 4-Gb/s pseudorandom bit sequence (PRBS) input data. Each data output exhibits almost half-quadrature phase difference. Fig. 14 shows the recovered half-quadrature clocks, where a slight phase offset occurs between the adjacent waveforms. It may be attributed to the mismatch in the output buffer stage and also possibly the parasitics on the test board. Fig. 15 shows the spectrum of the recovered clock, where the phase noise is measured to be 112 dBc/Hz at 1-MHz offset. Also, the jitter histogram of the recovered clock is shown in Fig. 16 , indicating the clock jitter to be 6.5 ps and 47 ps for PRBS input data. Considering the test equipment jitter of 3.9 ps , the pure clock jitter is 5.2 ps . In addition, the bit error rate (BER) of the CDR circuit is measured to be less than for PRBS. The fine-tuning gain of the VCO is measured to be 75 MHz/V. The fine and coarse tuning ranges are 70 and 150 MHz, respectively. The capture range is measured to be 16 MHz, which is four times wider than the loop bandwidth.
The chip excluding output buffers dissipates the total power of 70 mW from a single 2.5-V supply. Hence, the proposed -rate CDR circuit achieves lower power dissipation and lower jitter performance for the same 4-Gb/s operation than the oversampling CDR circuit in [19] . Table I summarizes the performance of the proposed CDR circuit. 
