I. INTRODUCTION
A S data rates continue to increase, the transmit signals in wireline communications are subjected to higher attenuation by legacy channels. This requires more sophisticated equalization schemes than what analog equalization is able to provide in binary receivers [see Fig. 1(a) ]. In contrast, ADC-based receivers have an analog-to-digital converter (ADC) that allows additional equalization to be performed in the digital domain [e.g., Fig. 1(b) ]. Digital blocks are advantageous compared with their analog counterparts because they are more robust to PVT variations, can be designed through HDL code, and are more easily ported to newer, more advanced technologies.
As shown in Fig. 1(b) , an ADC-based receiver consists of an ADC, one or more equalizers, and a digital clock and data recovery (CDR). This paper focuses on a novel architecture for a digital CDR. Our work does not include channel equalization and, therefore, recovers data from low-loss channels. However, in our simulated results, we show that the digital CDR can recover data from a high-loss channel when combined with appropriate equalization.
There are two types of ADC-based CDRs: phase-tracking [1] - [5] and blind [6] . In the phase-tracking architecture illustrated in Fig. 2(a) , the ADC samples the received signal at the center of the data eye using digital-to-analog feedback. This is time-consuming to design because the analog and digital blocks must be simulated together to ensure the feedback loop works well. In the blind architecture shown in Fig. 2(b) , the ADC samples the received signal with a local plesiochronous clock and the digital CDR extracts data from the blind samples. This eliminates the feedback loop between digital and analog domains, and the associated design complexity so that the ADC and the digital CDR can be designed and simulated independently. The digital CDR may have internal feedback, but no feedback goes to the analog blocks.
In this work, we focus on blind ADC-based CDRs. Previous works [6] , [7] sampled the incoming data at 2 samples per UI and 1.45 samples per UI to achieve 5 and 6.875 Gb/s, respectively. In an attempt to further increase the data rate to 10 Gb/s, we eliminate oversampling and sample at baud rate (1 sample per UI). Existing baud-rate architectures [1] - [5] rely on a phasetracking clock to sample at the middle of the data eye. In contrast, this paper presents a blind baud-rate CDR [8] fabricated in 65-nm CMOS.
This paper is organized as follows. Section II provides the background for ADC-based sampling. Section III introduces the receiver architecture and describes how the CDR handles frequency offset. Section IV discusses the implementation of each block. Section V presents the simulation and measurement results. Section VI summarizes the main concepts and results in the paper.
II. BACKGROUND
An example of a 2 blind ADC-based CDR [6] , [9] is shown in Fig. 3 . A 5-Gb/s input is sampled by a 5-bit ADC and is passed to a feed-forward equalizer (FFE) in the digital CDR. After the FFE, the blind samples are processed by the phase detector (PD). If two adjacent blind samples are opposite in sign, a zero-crossing is detected which corresponds to the edge sample in a phase-tracking system. This zero-crossing, denoted by variable , is approximated by the linear interpolation shown in Fig. 3 . The instantaneous value of is low-pass filtered into by the digital filter. The data decision block adds 0.5UI to to find the center of the eye and compares it to to recover the data. This system uses 2 sampling where the blind samples are 0.5UI apart. However, if oversampling ratio can be decreased, then the data rate can be increased without increasing the frequency of the blind clock.
A subsequent work [7] , illustrated in Fig. 4 , reduces the oversampling ratio to 1.45 ; the receiver takes 16 samples for every 11UI to achieve 6.875 Gb/s. Its architecture is similar to the one presented in [6] , but now the samples are farther apart than 0.5UI and the linear interpolation used in the PD to estimate zero-crossings is less accurate. To solve this problem, the PD filters out some of the less accurate results based on sample amplitude. With this architecture, 1.45 seems to provide a good compromise where the oversampling ratio can be reduced without much loss in jitter tolerance. In order to eliminate oversampling altogether, a different CDR architecture is required. The PDs in the 2 and 1.45 blind CDRs interpolate between the blind samples in order to detect the phase of the zero crossings; they require a finite slope in order to calculate phase. Given a low-loss channel, the data transitions become too sharp and, as a result, the interpolation cannot accurately estimate phase. Unlike phase-tracking CDRs, blind ADC-based CDRs perform poorly with low-loss channels. Since a blind ADCbased CDR should work with a range of channels, we focus most of our analysis on low-loss channels. In Section V, we show how the proposed CDR can recover data from a high-loss channel when combined with additional equalization.
Fig. 5 compares eye diagrams with different sampling rates given a low-loss channel. The worst-case sampling position occurs when adjacent samples are equally far from the center of the eye. For 2 blind sampling, the worst case is where adjacent samples are both 0.25UI from the edge, which leads to a high-frequency jitter tolerance of 0.5UI . When the oversampling ratio is decreased to 1.45 , jitter tolerance decreases to 0.31UI . At 1 , the samples may occur on the edges. If jitter shifts samples away from each other, then the CDR will not capture the bit at all, which results in zero jitter tolerance. In the following paragraph, we will use the channel's pulse response to elaborate on this issue and to arrive at our proposed solution. Fig. 6 shows the pulse response of an ideal channel. The best sampling position occurs when the main cursor is at the center of the ideal pulse response. In a clocked phase-tracking system, the sampling would remain at this position. However, with 1 blind sampling, any frequency offset between the data and receiver clock will cause the sampling phase to shift continuously across a 1UI window. When the sampling occurs near the UI boundary, any high-frequency jitter may shift the sampling outside the 1UI phase range, resulting in the loss of data bits (i.e., zero jitter tolerance).
In order to increase the jitter tolerance at baud-rate sampling, we extend the pulse response beyond 1UI by introducing a controlled amount of ISI in the data using a rectangular filter, which we implement via an integrate-and-dump (I&D) circuit [10] in the receiver front end. A rectangular filter is suitable in this case since its response has a finite length of ISI and requires fewer equalization taps compared to the exponentially decaying response of an RC filter. A 1UI rectangular filter, convolved with the ideal channel, spreads the pulse response across 2UI. If we have a perfect decision feedback equalizer (DFE) to cancel all post-cursor ISI, then the eye would be open for a range of 1.5UI (this would have been 2UI if we could cancel precursor ISI). If the blind samples shift beyond the 1UI window, there is still a remaining jitter margin of 0.5UI . A 2UI rectangular filter increases this margin to 1UI and results in a symmetric eye opening with respect to the blind sampling window. For these reasons, we choose a 2UI I&D circuit in our proposed design. Fig. 7 shows the system diagram of the receiver including an analog front-end and digital CDR. The analog front-end consists of four interleaved I&D and ADC blocks, each operating at 2.5 GS/s. I&D samples directly in the analog domain and feed them to the ADC, we would have lost the additional 1 bit of resolution. Simulations showed that the system needed an ADC with a minimum ENOB of 4 bits; hence we chose a 5-bit ADC with a known ENOB of 4.2 bits [9] for our design. The proposed design does not include ADC calibration; the addition of digital calibration for gain, offset, and timing mismatches [11] - [13] would further improve the receiver performance.
III. PROPOSED 1 BLIND RECEIVER ARCHITECTURE
The samples in the digital CDR are processed by the data interpolator, which estimates the samples at the center of the eye using the recovered phase, . The digital data interpolator allows us to use a more sophisticated interpolation algorithm compared to an analog interpolator [14] . A Mueller-Muller PD and loop filter form a feedback loop with the data interpolator. Loop latency is critical in this design because it degrades the stability of the feedback loop. Since the digital CDR operates on a 625-MHz divided clock, each cycle in the loop adds significant delay. Our implementation has a loop latency of seven cycles. A 2-tap DFE recovers the binary data, , from the interpolated samples, . The data interpolator compensates for frequency offset. As shown in Fig. 9 (a), we define negative frequency offset to mean that the transmitter clock is slower than the blind receiver clock. When this occurs, an interpolated sample is skipped each time the phase completes a 1UI rotation. Similarly, Fig. 9(b) shows a positive frequency offset where the transmitter clock is faster than the receiver clock. A positive frequency offset would result in cases where no blind sample exists between two desired samples; the interpolator resolves these cases by interpolating twice between the closest two blind samples when the decreasing rolls over from 0UI to 1UI. The range of frequency offset supported by the loop filter is sufficiently low that we can assume the extra interpolated sample is very close to the blind sample at 1UI. Hence, our implementation directly uses the blind sample as the extra interpolated sample.
The data path in the digital CDR is sized for 17 parallel samples. Most of the time, only 16 paths are active. If there is frequency offset and rolls over, then the number of active paths is temporarily reduced to 15 or increased to 17 for one cycle.
IV. RECEIVER IMPLEMENTATION

A. I&D Filter
The output from the channel drives the input of the I&D filter. The I&D circuit in Fig. 10 introduces controlled ISI into the ADC input and operates as a frequency-scalable anti-aliasing filter [10] . The circuit consists of a single source-degenerated transconductance stage that converts the input voltage to current and integrates the signal on the input capacitance of the four interleaved ADCs, labeled as in Fig. 10 . As shown in Fig. 11 , each interleaved I&D block operates in three phases: integrate, hold (during which the ADC samples the value), and reset. The clock pulses (SC0-SC3) and inverted pulses (SC0x-SC3x) reset the outputs (V0-V3) and redirect the current to each of the interleaved ADCs. Each clock pulse is 1UI wide. Fig. 12 shows the clock generator which drives the ADC and I&D. A CML toggle flip-flop divides a 5-GHz input clock into four phases, each at 2.5 GHz. The outputs are then converted into single-ended CMOS signals and buffered. The clock pulse generator [10] uses logic gates to generate 1UI wide pulses from the four clock pulses. Fig. 13(a) shows an example of the clock pulses when skew exists between the 4 phases. First, we note that any skew could change the integration periods when the pulses control the I&D operation. There would be gain mismatch between the four interleaved I&D blocks. Second, when we sample high-speed signals, the clock skew would appear effectively as high-frequency periodic or duty-cycle-dependent (DCD) jitter. Both the gain mismatch and high-frequency jitter will degrade the receiver's jitter tolerance. In simulation, the CDR's high-frequency jitter tolerance is reduced by approximately 0.2UI when the clock pulse widths are 0.95UI, 1.05UI, 0.95UI, and 1.05UI, respectively.
B. Clock Generator
As shown in Fig. 13(b) , we compensate for skew by adjusting the clock phase through deskew circuits. In this design, the skews are manually adjusted by observing the ADC outputs (shown in Section V). Fig. 14 shows the deskew circuitry implemented in each of the CML-to-CMOS converters as a 4-bit phase interpolator. The differential clock signal connects to the and inputs and a 20-ps delayed clock connects 
C. Data Interpolator
Given the ADC's blind samples and the CDR's recovered phase , the data interpolator estimates the value of the data at the center of the eye (i.e., the desired sample). Fig. 15 shows four consecutive blind samples , , , and that are separated by 1UI. The desired sample is away from sample . For simplicity, the expression in Fig. 15 assumes that is a floating point value between 0 and 1UI. In our implementation, is represented by a 5-bit value. The desired sample is estimated first by linearly interpolating between samples and . This estimate has a large error because samples and are separated by 1UI. To improve accuracy, extrapolation is performed using the slopes and . We scale the piecewise linear shape in Fig. 15 by the average of the two slopes and superimpose it on the linear interpolation. Hence, the accuracy of the estimate is improved by using four instead of two blind samples.
D. Mueller-Muller Phase Detector (MMPD)
The MMPD is defined by a function we will denote as the MM function , which should be chosen based on the pulse response of the channel. The MM function is also the transfer characteristic of the MMPD. When placed in a CDR feedback loop, the feedback forces the MM function to zero. Fig. 16 shows an example that Mueller and Muller presented in their 1976 paper [15] . The MM function demonstrated in [15] was (i.e., the difference between the pre-cursor, , and post-cursor, ). Given the example pulse response shape, when the samples and shift to the left, becomes greater than and is negative. Conversely, if the samples shift to the right, becomes positive. When the CDR locks, the feedback forces to zero and and are equal such that the main cursor, , is near the optimal sampling position close to the peak of the pulse response.
In this work, the 2UI I&D provides a wider pulse response such that the MM function in Fig. 16 would not provide the optimal sampling phase. If the receiver includes a DFE to cancel post-cursor ISI, the maximum vertical eye opening occurs when the main cursor, , is at time in Fig. 17 because is the maximum value of the pulse response and is zero. Setting the pre-cursor to zero will allow us to fully benefit from the DFE and eliminates the need for FFE. This sampling position occurs when post-cursor ISI is equal to the main cursor, . To identify this desired phase location, we choose the MM function to be [16] and force it to zero through the feedback loop. Since our actual sampling phase is blind, we force the desired phase on the interpolating phase, . It can be shown [15] that the pulse response can be estimated using the samples and the recovered data . For convenience, we include the derivation in Appendix A. From (9) and (7), and can be estimated by the expected values, and , respectively. We substitute the expected values into the MM function to transform the MM function into the MMPD. The loop filter in the next block performs the expected value operation by averaging the MMPD output.
Note that the above expressions for pulse response are not unique. For example, according to (10) , is also equal to . In the implementation illustrated in Fig. 18 , we can therefore choose so that can be factored out of the expressions for and . The DFE has some latency before it recovers ; factoring out allows the subtraction to be performed before becomes available. Since takes on only two values, and , it only affects the sign of the MMPD. In the PD implementation, subtraction is performed first and speculation is used for the sign of . The DFE's recovered data and the PD output are ready at the same time, thereby reducing latency in the CDR feedback loop and improving loop stability.
E. Decision-Feedback Equalizer
The DFE compensates for post-cursor ISI from the channel and the I&D filter. As can be seen from the pulse response in Fig. 17 , recovering data from an ideal channel and 2UI I&D filter would require one DFE tap to equalize post-cursor , while a more attenuative channel may require more taps. Three pipelined stages, operating at 625 MHz, resolve 16 bits in parallel-actually 15 to 17 bits to handle cases of frequency offset as discussed in Section III. DFE adaptation was not included in this design.
To recover 16 bits per clock cycle, 16 parallel DFE sum blocks are required. Speculation is used extensively to reduce latency in the CDR feedback loop. In each DFE summation block shown in Fig. 19(a) , the two DFE taps, and , are manually set and speculation is performed by subtracting the four possible levels from the interpolated sample . When the previous two bits and have been recovered, the mux selects the correct . This speculation removes the adder from the critical path. However, the muxes remain on the critical path since, in order to resolve all 16 bits, data must propagate through 16 muxes. However, at 625 MHz, the data can only propagate through 8 muxes per cycle. Fig. 19(b) shows eight DFE summation blocks that resolve 8 bits in one clock cycle. For this reason, we created another stage of speculation.
In the next stage, we speculate on the and inputs to the DFE Sum x8 blocks. As shown in Fig. 20 , and drive the first four parallel DFE Sum x8 blocks in a speculative structure which resolve bits to . The last two bits and of this first stage then drive a second set of four DFE Sum x8 blocks which resolve bits to . In the end, the complete DFE has a latency of three cycles.
F. Loop Filter
The loop filter is a conventional proportional-integral controller as shown in Fig. 21 . The parallel PD outputs are summed together and the result is scaled by configurable proportional and integral gains. The saturating counter is sized to handle up to 1900 ppm of frequency offset. At the output, the 5-bit phase counter produces the recovered CDR phase as discrete values ranging from 0 to 31 which are fed back to the data interpolator block, closing the CDR feedback loop.
V. SIMULATION AND MEASUREMENT RESULTS
Here, we will show, through simulation, that the feedback loop converges correctly, how the system can be modified for a more attenuative channel, and how the system tolerates jitter. Next, we will show the measured eye diagrams and measured jitter tolerance of the proposed CDR. Fig. 22 illustrates the loop dynamics by showing the transient signals in the loop filter. When the system in Fig. 7 starts up, it appears that the MMPD relies on correctly recovered data to estimate phase and, at the same time, the DFE requires a correct phase to recover the data. To verify that the feedback loop does not enter into a deadlock, we have applied an input with 1000 ppm of frequency offset so as to start the loop with both phase and data errors. The proportional gain and saturating counter outputs are, respectively, the outputs of the proportional and integral paths in the loop filter. The cycle-slipping causes the saturating counter to temporarily decrease at times, but the saturating counter settles to a value corresponding to 1000 ppm within 4 s. The up/down signal increments or decrements . In steady state, ramps from 0 to 31 and wraps around in order to track the frequency offset. After 3 s, is sufficiently close to the center of the eye to recover the data correctly (i.e., no more bit errors).
In simulation, the digital CDR has a CID tolerance of approximately 1600UIs when the input has SSC modulation with 1000 ppm of frequency offset at 32 kHz. The CID tolerance is mainly limited by low-frequency jitter from the SSC modulation and the error at the output of the saturating counter in Fig. 21 (which can be caused, for example, by noise from the MMPD).
As discussed in Section II, the receiver relies on ISI which spreads the pulse response beyond 1UI. We demonstrate through simulation that the 1 blind CDR can work in two cases. In cases where the channel attenuation is low (i.e., there is not enough ISI produced by the channel), we rely on the 2UI I&D to produce the ISI. This situation is demonstrated in Fig. 23 which shows the combined frequency response of a low-attenuation Channel A followed by its associated 2UI I&D filter. In contrast, where the channel is attenuative by itself (i.e., there is enough ISI produced by the channel), we no longer need the 2UI I&D to produce extra ISI; in fact, we require equalization to reduce ISI. This situation is demonstrated by Channel B in Fig. 23 . Simulations show that the 1 blind CDR works in both of these cases. If the CDR will be used in applications with a wide variety of channels, then, ideally, the front-end filter should be adaptive such that it decreases the amount of post-cursor ISI generated when the channel has more high-frequency loss and, therefore, reduces the required equalization. However, an adaptive filter is beyond the scope of this work. Our test chip, which we describe later, demonstrates only the first case (i.e., low-attenuation channel with 2UI I&D).
Figs. 24 and 25 show the eye diagrams from simulations done in Simulink using event-driven models [17] . The data source is 10 Gb/s and has 0.17UI of random jitter. Similarly, the blind receiver clock is simulated with 0.23UI of random jitter. The two leftmost eye diagrams in Fig. 24 show the data eye after Channel A (low attenuation) and I&D. The 5-bit ADC quantizes the samples into discrete values from 0 to 31. The eyes are still open because the analog 1UI I&D does not add much attenuation. The filter adds further ISI and closes the eye. In order to obtain the eye diagrams in the digital CDR, we break the feedback loop and set to 0.5UI. This forces the desired sample halfway between the blind samples and the data interpolator produces the worst-case interpolation error in this condition. The open eye after the DFE adder shows that the data can be successfully recovered. Fig. 25 demonstrates that the system can recover the data with Channel B without the I&D filter, however, it requires a 20-tap DFE. This large number of taps is necessary for Channel B because it introduces a long tail of ISI. This is not the case for Channel A with the 2UI I&D because it produces far less ISI. Alternatively, a FFE could also be used to suppress the long-tail ISI and reduce the number of DFE taps required for Channel B. Fig. 26 compares the simulated jitter tolerance for each of the two channels. The simulation assumes a bit error rate (BER) of . The high-frequency jitter tolerance of the system in Fig. 25 (Channel B) is slightly below that of the system in Fig. 24 Channel A 2 UI I D . We also note that the former has a lower CDR bandwidth compared to the latter, which is caused by a lower PD gain. Compared to Channel A, Channel B further spreads out the pulse response, which reduces the PD gain (i.e., the slope of the MM function).
We implemented the proposed receiver in Fujitsu's 65-nm CMOS process. Fig. 27 is a photograph of the test chip. The I&D, clock generator, and ADC are custom-designed analog blocks. The digital CDR was designed using Verilog RTL and implemented with standard cell gates. For jitter tolerance measurements, we apply sinusoidal jitter to the transmitter clock. Fig. 29 shows the average ADC output when the I&D is given a DC input. On one test chip, we observed that one of the interleaved front end blocks had a lower gain compared with the other blocks as we varied the DC input. As discussed in Section IV, the gain error is mostly caused by systematic clock skew. If left uncompensated, the skew will reduce the CDR's jitter tolerance. Hence, we manually adjusted the delays in the clock generator. Fig. 29(b) shows that the gain at the output of ADC 3 matches more closely with gain of the other interleaved blocks after skew correction.
Our measurements were performed with a 48-in SMA cable as the channel-its frequency response is plotted in Fig. 30 . Fig. 31(a) shows the data eye at the output of the channel. Fig. 31(b) shows the eye diagrams taken from the outputs of the interleaved ADCs. It has been partially attenuated by the analog 1UI I&D. There is some mismatch between the four interleaved analog front ends, but the digital CDR is able to tolerate this, as demonstrated in the jitter tolerance measurement. We measured jitter tolerance after skew correction and with a maximum BER of 10 at 10 Gb/s. In Fig. 32 , we show the results given 300, 0, 300, and 1000 ppm of frequency offset. A negative frequency offset means that the transmitter is slower than the blind receiver clock (i.e., above baud-rate sampling). A positive frequency offset means that the transmitter is faster than the blind receiver clock-this case is worse for jitter tolerance since we are actually sampling slightly below baud-rate. During measurement, we were able to push the frequency offset to 1000 ppm with only a slight degradation in jitter tolerance. Fig. 32 also compares the measurement results against a simulation using the measured channel response (Fig. 30) with 300 ppm of frequency offset. Due to simulation time constraints, the simulation assumes a maximum BER of 10 . For this reason, the simulated jitter tolerance is higher compared with the measured results. We also show the jitter tolerance mask for XL-Attachment-Unit-Interface (XLAUI) in Fig. 32 . Although we did not specifically target ethernet applications in the proposed design, we provide the mask as a reference. Table I compares the proposed CDR with other baud-rate ADC-based CDRs published in [3] - [5] .
VI. CONCLUSION
We have presented a 1 blind ADC-based CDR. In the proposed architecture, we recover data by extending the channel pulse response so that the pulse amplitude is greater than zero, no matter where the blind samples occur within a 1UI window. The receiver adds controlled ISI to the pulse response through the use of an I&D block in the receiver front end. The baud-rate design allows the CDR to operate at 10 Gb/s given a 10-GS/s sampling rate.
We fabricated the proposed design in a 65-nm CMOS process. The test chip successfully recovers 10-Gb/s data with BER below 10 . Jitter tolerance measurements show that the CDR implementation can recover data with below-baud rate sampling-the CDR operates with 300 ppm of frequency offset and a high-frequency jitter tolerance of 0.19UI .
APPENDIX DERIVATION OF PULSE RESPONSE SAMPLES
Let be the received signal, be the combined pulse response of the transmitter, channel, and receiver, be the sampled signal, and be the resolved bit. The data is assumed to be binary , independent, and equiprobable :
(1) (2) Substitute (2) into (1) Since the data bits are independent and uncorrelated, we have if if (6) Now, substitute (6) into (5) to obtain
Similarly (8)
(10)
