Abstract-Recently, there has been a strong drive to replace established analog circuits for multi-gigabit clock and data recovery (CDR) by more digital solutions. We focused on phase locked loop-based all-digital CDR (AD-CDR) techniques which contain a digital loop filter (DLF) and a digital controlled oscillator (DCO) and pushed the digital integration up to a level where our DLF is entirely synthesized. To enable this, we found that extensive subsampling can be used to decrease the speed of the DLF while maintaining a good operation. Additionally, an Inverse Alexander phase detector and a 5.5-bit resolution DCO complete the AD-CDR architecture. As a result of the low complexity and digital architecture, the AD-CDR occupies a compact active chip area of 0.050 mm 2 and consumes only 46 mW at 25 Gb/s. This is the smallest area and the lowest power consumption compared with the state-of-the-art. In addition, our implementation is highly tunable due to the synthesized logic, and supports a wide operating range (12.5-25 Gb/s), which is a significantly larger range compared with the previous work. Finally, thanks to our digital architecture, the power dissipation decreases linearly while moving to the lower speeds of our operating range. This is in contrast with the most prior work, making our design truly adaptive.
Abstract-Recently, there has been a strong drive to replace established analog circuits for multi-gigabit clock and data recovery (CDR) by more digital solutions. We focused on phase locked loop-based all-digital CDR (AD-CDR) techniques which contain a digital loop filter (DLF) and a digital controlled oscillator (DCO) and pushed the digital integration up to a level where our DLF is entirely synthesized. To enable this, we found that extensive subsampling can be used to decrease the speed of the DLF while maintaining a good operation. Additionally, an Inverse Alexander phase detector and a 5.5-bit resolution DCO complete the AD-CDR architecture. As a result of the low complexity and digital architecture, the AD-CDR occupies a compact active chip area of 0.050 mm 2 and consumes only 46 mW at 25 Gb/s. This is the smallest area and the lowest power consumption compared with the state-of-the-art. In addition, our implementation is highly tunable due to the synthesized logic, and supports a wide operating range (12.5-25 Gb/s), which is a significantly larger range compared with the previous work. Finally, thanks to our digital architecture, the power dissipation decreases linearly while moving to the lower speeds of our operating range. This is in contrast with the most prior work, making our design truly adaptive.
Index Terms-All-digital clock and data recovery (AD-CDR), digital controlled oscillator (DCO), digital loop filter (DLF),
Inverse Alexander phase detector (PD), subsampling, synthesis.
I. INTRODUCTION

I
N MULTI-GIGABIT data communication links, the data is serially transmitted to the receiver without any accompanying clock. This clock has to be recovered at the receiver side in order to sample and process the received data. Therefore, a clock and data recovery (CDR) circuit is an essential component in such a high speed receiver, and the design and the performance of the CDR has a significant influence on the overall operation of the link [1] .
The need for low cost and high integration mandates that the CDR should be implemented in a deep-submicrometer technology. However, it is hard to achieve high performance for classical analog CDRs in today's modern technologies [2] . Therefore, digital CDRs have become increasingly important for high-speed data communication. A digital CDR eliminates the need for a large loop filter capacitor used in classical analog CDRs. Instead, a digital CDR uses a compact digital loop filter (DLF) which can realize large time-constants without any additional cost in area. Additionally, a DLF is tolerant to process, voltage, and temperature variations and is noise insensitive. The filter is also easily scalable, portable across CMOS technologies, and highly adaptable. Therefore, a digital CDR is the optimal choice for a high speed receiver implemented in a deep-submicrometer technology and has been a major area of research interest in recent years [2] - [12] .
We focus on a subset of these digital CDRs, i.e., so-called all-digital clock and data recovery (AD-CDR) circuits. AD-CDRs are derived from the first all-digital phase locked loop (AD-PLL) introduced in [13] , and comprise a phase detector (PD) and a digital controlled oscillator (DCO) in addition to a DLF [2] - [4] , [14] - [18] . PLL-based CDR circuits have the advantage over alternative digital-friendly CDRs that they have intrinsically a wide frequency capture range due to the ability to adapt both phase and frequency [19] . Additionally, they benefit from a wide bandwidth and have the ability to reject input jitter [20] .
The only problem is that the DLF, which consists of a proportional and an integral path, typically cannot operate at the tens of Gb/s data rate. In prior work, the speed of the integral path of the DLF is reduced by using demultiplexing [2] , [18] or subsampling [3] , [21] . However, the proportional path still runs at a high speed and due to this, these blocks had to be designed and laid out by hand, largely counteracting the advantages of a digital design which ultimately should allow automatic synthesis.
There is only one very recent related work [4] where the digital block is entirely synthesized. To accommodate this synthesis, the input of the digital loop is heavily demultiplexed into many parallel lanes but this has disadvantages: a large amount of parallel samplers are needed to process the highspeed data input and this in turn requires a considerable clock distribution network. Moreover, the huge amount of samples has to be processed by a complex signal processing block. This increases the power consumption and chip area: the work in [4] , which includes a CTLE and a DFE, has an area which is 10 times larger than our work. Additionally, the power consumption per bit is more than 75% higher than our work.
0018-9200 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
In this paper, we use extensive subsampling [22] instead of demultiplexing to reduce the operating speed of the entire DLF. This enables us to push the digital integration up to a level where our DLF is entirely synthesized without requiring complex signal processing. To demonstrate the correct operation, we implemented a 25-Gb/s PLL-based AD-CDR circuit. This AD-CDR features an Inverse Alexander PD which is low power, simple, fast, and accurate. In particular, this PD shows improved performance in simulations over the conventional Alexander PD when subsampling is used [23] . In this paper, we complement the earlier theoretical work by presenting the first experimental verification of this Inverse Alexander PD. The last building block of the AD-CDR is a low-resolution digital controlled ring oscillator. We demonstrate that a resolution as low as 5.5 bit can be used without degrading the performance of the AD-CDR.
Thanks to the highly digital architecture, the active die area is very compact and only occupies 0.050 mm 2 which is significantly smaller than competing work [2] - [11] . Moreover, the power efficiency of the CDR core is 1.8 pJ/b which is also better than the state-of-the-art [2] - [11] . Additionally, the AD-CDR is highly adaptable, i.e., the characteristics of the loop filter can be tuned to satisfy multiple jitter tolerance (JTOL) specifications. Moreover, the operating range can be varied from 12.5 to 25 Gb/s, which is the broadest operation range of any digital CDR that does not use a high-quality, multi-gigahertz reference clock. Due to the truly digital frequency adaptable nature, the power consumption decreases linearly with the operating data rate. This means that when the data rate is reduced, also the power consumption goes down accordingly and, hence, an excellent power efficiency is maintained over the entire operating range: e.g., at 25 Gb/s, the power consumption is 46 mW, while at 12.5 Gb/s, this is 23 mW.
The remainder of this paper is organized as follows. Section II presents the used AD-CDR architecture. In Section III, the detailed circuit implementation in a 40-nm low power CMOS process is discussed. The experimental results of our 12.5-to 25-Gb/s AD-CDR circuit are summarized in Section IV, and Section V concludes this paper.
II. ALL-DIGITAL CLOCK AND DATA RECOVERY ARCHITECTURE
The overall architecture of our AD-CDR architecture is shown in Fig. 1 . It consists of a bang-bang PD (BB-PD), subsampler, DLF, and DCO. The BB-PD determines the phase difference between edges in the input data stream (D in ) and the recovered clock (Clk) signal. When the clock is leading the input data, an Early signal is generated to decrease the frequency of the recovered clock. Alternatively, when the clock is lagging, the BB-PD outputs a Late signal to increase the frequency of the recovered clock. These Early and Late signals are subsampled by a factor of N and then filtered by the DLF. The resulting signal controls the DCO such that the phase error is reduced. Note that if no data transition occurs, the BB-PD cannot determine if the clock leads or lags the data and, therefore, does not generate any signal. Consequently, the DCO is not adjusted.
A. Bang-Bang Phase Detector
Alexander bang-bang phase detectors are typically used in high-speed CDR circuits because they provide simplicity in design, good phase adjustment, and can work at high speeds [24] . Additionally, these BB-PDs have the advantage that the output is already digital, making this type of PD very suitable to drive the DLF.
Recently, the Inverse Alexander PD was proposed as an improvement over this established and well-known circuit [23] . An elaborate comparison between the conventional and Inverse Alexander PD is given in the following sections.
1) Comparison of Alexander and Inverse Alexander PD:
The conventional Alexander phase detection is based on three successive data samples which are sampled at twice the data clock frequency. In the basic block diagram illustrated by Fig. 2 , this is done by sampling the data both on the rising and the falling edges of the recovered clock Clk. By monitoring the differences between the three sampled values, it can be detected whether a data edge has occurred and if this data edge occurs before or after the corresponding clock edge. For the actual phase detection, the three successive samples, available at nodes S 0 , S 1 , and S 2 are used. To understand the operation, three possible waveforms are considered in Fig. 3 . First, the ideal locking condition is shown in Fig. 3(a) . In this case, the value of sample S 1 is undefined and in practice due to noise the PD will randomly produce an Early or a Late pulse. Fig. 3(b) shows the case where the clock edge leads on the data edge (Early), and Fig. 3(c) shows the case where the clock edge lags on the date edge (Late). In the absence of data transitions (not shown in the figure), all three samples S 0 , S 1 , and S 2 are equal and the XOR gates (Fig. 2) will set both the Early and the Late signals to zero. These relations are summarized as [25] Early : Fig. 3(a) shows that once the CDR has settled, the samples S 0 and S 2 correspond to two successive data output (D out ) samples, while sample S 1 occurs at the transition of the data.
The proposed Inverse Alexander PD is also shown in Fig. 2 and obviously has the same schematic as the Alexander PD, but the Early and the Late signal are interchanged, which leads to an inversion of the sign in the CDR loop.
The inversion of the sign in the CDR loop causes the CDR to settle to a different equilibrium point. As shown in Fig. 4(a) , the Inverse Alexander PD will align the rising edges of the clock signal with the data edges. If the rising edge of the clock leads (is Early), the first sample, S 0 , is unequal to the last two and the clock frequency must decrease [ Fig. 4(b) ]. Vice versa, if the rising edge of the clock lags (is Late), the last sample, S 2 differs from the first two and the clock frequency must increase [ Fig. 4(c) ]. In lock, the middle sample, S 1 , corresponds with the data sample D out while the other sample moments S 0 and S 2 occur at the data transitions.
2) PD Characteristics Comparison-Full-Rate Operation: The output characteristic of both the conventional and the Inverse Alexander PD are shown in Fig. 5 (a) and (b). Here it is assumed that all waveforms are ideal (as in Figs. 3 and 4) . If an edge occurs, either a 1-bit Early or Late pulse will be generated, which for both phase detectors results in the wellknown bang-bang action. For both PDs, there is only one stable locking point, which corresponds to a phase shift of half an UI (unit interval) for the conventional and to zero phase shift for the Inverse Alexander PD (also indicated on the figure).
However, in practice the waveforms are not ideal and several imperfections occur such as phase noise on the recovered clock and nonideal input data waveforms that exhibit pulsewidth jitter and unequal rise and fall times, which translates to dutycycle distortion (DCD). All these effects affect the behavior of both PDs. At full rate, the difference between the conventional and Inverse Alexander are negligible, but when the PD is subsampled, the difference becomes pronounced. To illustrate the phenomenon, we will discuss the case of DCD.
DCD means that the duration of a logic-0 differs from the duration of a logic-1 [26] . The notations T 0 and T 1 are used to represent, respectively, the duration of an occurrence of a single logic-0 and a single logic-1 affected by DCD; where the sum of T 0 and T 1 always equals 2 UI. Note that when T 1 equals 1 UI, there is no DCD and when T 1 < 0.5 UI, the DCD is too large to have any useful operation of the CDR. The reciprocal case when T 1 > T 0 , is analog. To examine the influence of DCD, the output characteristics of both PDs are determined and shown in Fig. 5 (c) and (d) for the artificial case of a data stream with a single logic-1 data pulse. This means that there are two consecutive data transitions. When examining this case, it turns out that apart from the normal Early and Late cases, two anomalous states occur. The first anomalous case, shown in Fig. 6 (a), occurs around the locking point of the conventional PD. Here an Early pulse is immediately followed by a Late pulse. If the PD is operated at full speed, this will be filtered by the lowpass loop filter and essentially translate in a net null action. The second anomalous case is shown in Fig. 6 (b) and is most relevant for the Inverse Alexander variant. Both the Early and Late signals are simultaneously active. This is normally an illegal state, but in practice most loop filters (e.g., the popular charge pump [25] and also the DLF used in our prototype) deal with this situation by interpreting this as a net null action.
Both these anomalous cases occur for phase errors near the equilibrium locking point and broaden the locking point into a locking region which is illustrated in Fig. 5 (c) and (d). For the conventional Alexander PD, the locking range corresponds to the Early immediately followed by Late case, whereas for the Inverse Alexander, the locking range corresponds to the simultaneous both Early and Late case. Despite of this difference, both cases are almost equivalent when the PDs are operated at full rate.
3) PD Characteristics Comparison-Subsampled Operation: When the PD is subsampled only one out of N of the PD output values is used. When we study the PD characteristics for the case of ideal waveforms, we still obtain the same result as Fig. 5 (a) and (b). However, in the case of DCD, the simultaneous both Early and Late case remains unchanged but the Early immediately followed by Late case is altered: since one of the two successive samples will be lost and since the data are not correlated with the subsampling process, either Early or Late will be randomly selected as shown in Fig. 7 . This means that a significant amount of excess random jitter is injected in the loop which will increase the probability of bit errors. This problem occurs in the locking region of the conventional Alexander PD and not for the Inverse Alexander PD. For this reason, the Inverse Alexander PD is expected to have a greatly improved performance when the PD is subsampled [23] . Therefore, the Inverse Alexander PD topology, proposed in [23] , is chosen to implement the BB-PD. 
B. Digital Controlled Oscillator
For the implementation of the DCO in our AD-CDR, a quarter-rate architecture [27] is used. This means that the DCO operates at one-fourth of the data speed, and provides the required sample-time resolution in the form of eight uniformly phase-shifted clock phases. This can conveniently be realized by a 4-stage differential ring oscillator (see Section III-C) and significantly relaxes the requirements on the clock buffers and BB-PD circuitry. For a 25-Gb/s data input, this means that the DCO frequency will be 6.25 GHz. To illustrate the quarterrate operation, the waveforms of a "1010" data sequence and the eight clock phases are shown in Fig. 8 for the case of an Early clock.
In the ideal locking condition, the even clock phases are perfectly aligned with the data edges, while the odd clock phases are in the middle of the data symbol, which is the ideal sample moment. Per clock period, there are four sets of three consecutive samples and each set of three consecutive samples can be used by the Inverse Alexander PD operation to generate an Early/Late signal. In our design, only one out of these four Early/Late signals is used, i.e., we only use clock phases Clk 0 , Clk 1 , and Clk 2 to gather the phase information (Early/Late). Of course, still all the data need to be recovered, which can be done using the odd clock phases to sample the input data. The net result is that clock phases Clk 4 and Clk 6 are not used and that the phase information is already subsampled by a factor of 4 in the PD.
C. Digital Loop Filter
A typical DLF consists of a proportional and integral path and can be described by the discrete-time transfer function H DLF (z) given by
where K p and K i are the respective gains of the proportional path and integral path, and D K p and D K i are the corresponding delays. In our implemented DLF, we can adapt both the proportional and integral gain setting, while the delays are hard wired. The delay in the proportional path and in the integral path are, respectively, D K p = 2 and D K i = 9 digital clock cycles. Especially the delay in the proportional path should be limited in order to avoid stability issues, but with the expected jitter in the CDR loop, this delay (D K p = 2) is low enough to ensure its stability [28] . Note that this DLF is connected directly following the subsampling block (Fig. 1) to allow automatic synthesis of the entire DLF. Consequently, the proportional and integral path are equally affected by the subsampling.
D. Subsampling
In the 40-nm low power CMOS process used in this paper, the maximal clock speed should not exceed 1.75 GHz to enable an automated design (synthesis, place, and route) of this DLF. This means that, even with the subsampling by a factor of 4 that already occurs in our PD implementation, the operating frequency at the output of the BB-PD is still too high, e.g., if the CDR operates at 25 Gb/s, the output of the BB-PD operates at 6.25 GHz. Hence, this operating frequency has to be further reduced to facilitate the implementation of the DLF. Therefore, the output of the BB-PD is additionally subsampled by a factor of 4. Overall, this means that the DLF will only receive an output signal of the PD once out of every N (= 16) data periods. In Fig. 1 , the subsampling corresponds to the block "↓ N."
Although a higher level of subsampling would further reduce the area and the power of the DLF, a higher subsample factor will not lead to an overall optimal power efficiency. This is because the CDR should be able to deal with data sequences where the BB-PD does not receive data edges (and,
In the case without subsampling, this occurs if the input data contains a long sequence of consecutive identical digits (CIDs). If this happens the output of the PD is stuck at zero and the feedback is broken such that the CDR operates temporarily in open loop. This means that the oscillator runs freely and any frequency difference between the input data rate and the recovered clock frequency will cause a linear increase or decrease of the phase difference over time. In a prolonged open-loop situation, this phase drift will exceed a unit interval causing the AD-CDR to lose its lock, which means that the CDR operation is disrupted. For an input sequence of k CIDs, the idle time of the CDR without subsampling is given by
where f data corresponds to the input data rate.
To tolerate a long idle sequence, the DCO must have a sufficiently high resolution such that quantization error is small. This way, the DCO frequency will be closer to the desired input data frequency. And hence, when the loop temporarily opens due to an idle sequence, the corresponding phase drift will remain acceptable.
Another effect that lowers the maximum tolerable idle sequence is given by the random walk process of the phase of the recovered clock during open-loop operation [29] . Lowering the phase noise of the DCO will reduce this random walk process.
In the case of subsampling, the loop filter operates at lower frequency and the total idle time T idle corresponding to a CID input sequence of length k will be
This means that the idle time due to k CID input bits for the case with subsampling is almost equal to the case without subsampling.
However, regardless of the CIDs in the full-rate input data, it can happen that after subsampling the phase detection output consists of a long idle sequence of length l (without any Early nor Late pulse). For example, for the popular PRBS31 test sequence, it can be shown that for subsample factor that is a power of 2, there will always be an idle sequence in the subsampled PD output of length l = 31. Now, the corresponding idle time is proportional to the subsampling factor N
This means that the tolerance to a long idle sequence will become worse for increasing value of the subsampling factor N. To maintain adequate robustness to long idle sequences for an increasing value of the subsampling factor N, the DCO phase noise and resolution should be improved accordingly. This indicates that there is a trade-off for the subample factor N in the sense that increasing N will decrease the power consumption of the DLF but increase the required power consumption in the DCO. From behavioral simulations, we found that choosing N = 16 is an adequate compromise. According to simulation, with this setting, our circuit should be able to tolerate input data streams which after PD subsampling have an idle (subsampled) sequence length l of over 100.
III. CIRCUIT IMPLEMENTATION
The top-level implementation of our AD-CDR is shown in Fig. 9 . In our physical partitioning, we tried to maximally exploit automated digital tools. Therefore, we pushed part of the BB-PD after the subsampling such that it could also be automatically synthesized. The result is that the BB-PD and the subsampling block are intertwined. The implementation consists of six high-speed samplers followed by a retiming block, a subsampling block, and the (automatically synthesized) "phase detection logic." Additionally, the AD-CDR comprises an automatically synthesized DLF, a clock divider, and a DCO.
The six high-speed samplers are driven each by their own 6.25 GHz clock phase coming from the DCO. Four samplers out of six are used to sample the data, while the other two samplers are used to sample the edges. As mentioned above, two out of eight uniformly phase-shifted DCO clock phases are not used.
In the retiming block, all the collected samples (i.e., four data samples and two edge samples) are aligned to one clock phase. The retimed samples of the data constitute the recovered data (the actual CDR output), while the phase information, which adjusts the CDR to reduce the phase error, is subsampled to 1.56 Gb/s. This phase information is sent to the synthesized digital block (running at 1.56 GHz) where first, the phase detection logic calculates the Early and Late signals. These are then further processed by the DLF which controls the quarter-rate DCO.
A. BB-PD and Subsampling
The implementation of the BB-PD and subsampling comprises two parts: a full custom designed block and the auto- Fig. 9 . Block diagram of AD-CDR implementation (speeds are indicated for 25 Gb/s operation). Red is used for edge-related samples and black for data-related samples (as in Fig. 4) . Fig. 10 .
Detail of the full custom part of the BB-PD and Subsampling, which contains six samplers, a retiming block, and a subsampling block (speeds are indicated for 25 Gb/s operation). matically synthesized phase detection logic. A more detailed view of the full custom block consisting of the high-speed samplers, the retiming block, and the subsampling block is given in Fig. 10 .
1) Sampler:
First, the incoming data is sampled with a highspeed sampler which is implemented as a sense amplifierbased flip-flop [30] - [35] . The sense amplifier-based flip-flop has a fast sense amplifier input with a short capture window followed by a slower regenerative latch (Fig. 11) . This makes it an ideal choice for a subsampling stage, which needs to capture the high-speed input data very quickly, but has relaxed requirements on the clock-to-output delay. The device sizes of the sense amplifier-based flip-flop shown in Fig. 11 are summarized in Table I .
2) Retiming: The six sampled data signals (four corresponding to actual data samples and two corresponding to edge samples) are sent to the retiming block, which aligns the samples to one clock phase. For this, two types of dynamic flip-flops clocked with the opposite clock edge are used. The sampled input data from clock phases zero to three is retimed by an array of positive edge triggered dynamic flip-flops of type I (Fig. 12 ). This is a standard dynamic flip-flop, shown in Fig. 13(a) . Three of these retimed samples that contain the information of two edges (Edge 0 and Edge 1 ) and one intermediate data symbol (D out0 ) are used for the phase alignment but the first have to be subsampled (see Section III-A-3).
To relax the timing requirements of the flip-flops, the sampled input data from clock phases five and seven is retimed by an array of type II (negative edge triggered) dynamic flipflops (Fig. 12 ). This type is clocked with the opposite clock edge compared with type I, but an additional half-clock cycle delay is incorporated [ Fig. 13(b) ] such that all samples are retimed to the same clock edge. The device sizes of the dynamic flip-flops shown in Fig. 13 are summarized in Table II. 3) Subsampling: Before the phase alignment information can be sent to the digital block, this information has to be subsampled by a factor of 4 (Fig. 10) . The subsampling is performed in two steps (Fig. 14) , where for each step the clock frequency is first divided by two and second, applied as clock signal to an array of three type I dynamic flip-flops. Because the input data of the flip-flops is twice the speed of the corresponding clock input, the data is subsampled by a factor of 2. Overall, the input data is thus subsampled by a factor of 4 and the clock signal is divided by four. This divided clock is used as clock signal for the digital block.
4) Digital Phase Detection Logic:
Next to the full custom blocks, the BB-PD and subsampler comprises the synthesized digital phase detection logic. This part is automatically generated from a Verilog description, which corresponds to the schematic shown in Fig. 15 . It compares the consecutive samples and determines whether the clock leads or lags the data, according to the Inverse Alexander operation [23] .
B. Digital Loop Filter
The implementation of the automatically generated DLF is shown in Fig. 16 . The DLF receives an Early/Late signal from the phase detection logic and this signal is then processed by a proportional and an integral path. The proportional path directly amplifies the Early/Late signals with −K p and K p , respectively. To maintain the stability of the AD-CDR, the delay in this path is minimized and the implementation is made as simple as possible. To achieve this, K p is always an integer and the output is a 7-bit thermometer code. Now, the proportional path can simply be implemented by selecting or deselecting K p of the thermometer-coded output bits. These bits directly drive the fine tuning input of the DCO (see section III-C). This configuration allows the gain K p to be set between 0 and 7.
The integral path of the DLF is implemented as a multirate architecture. That is, a Clk/2-domain is created to reduce the clock speed which facilitates the implementation of the accumulator. Therefore, the Early/Late signal is demuxed by a factor of 2. The internal accumulator has a high resolution of 16 bit. This allows the use of a broad range of integral gains K i , which can be set to integer powers of 2. However, to avoid a bulky DCO design, only the 5 most significant bits of this 16 bit word are converted to a 31-bit thermometer-coded word which drives the DCO. In contrast to a binary-weighted coding, this thermometer coding increases the robustness against parasitic effects and reduces glitches when switching between states. In total, the DCO is controlled (in standard operation) by 45 (= 7+7+31) bits each driving a unit varactor which corresponds to a resolution of 5.5 binaryweighted bits. Furthermore, there are some signals shown in Fig. 16 that are not used in normal operation: first there is a from FD signal, which is used in the calibration process of the DCO (see section III-D) and which can be activated by the control signal Calibration. Second, there is also a a fixed DCO setting signal which is only used for debug purposes and gives the ability to characterize the DCO separately. This signal is activated by the control signal DCO Characterization. 
C. Digital Controlled Oscillator
To generate the eight uniformly phase-shifted clock phases for the aggregated 25 Gb/s PD operation, the DCO is implemented as a 4-stage ring oscillator with differential delay cells (Fig. 17) [29] .
The delay cell is shown in Fig. 17(b) . It can be tuned by tuning the tail bias current or by tuning the load network. For the load, we distinguish a coarse tuning and a fine tuning. The coarse tuning has 6-bit resolution and is only used during calibration of the DCO (see section III-D) and is implemented by switching binary-weighted resistors ON or OFF.
The fine tuning is done by tuning the load varactors. During normal CDR operation only this fine tuning is used. It is implemented as follows: the thermometer-coded words from the DLF (see Fig. 16 ) switch unit varactors ON/OFF. To reduce the area of the ring oscillator and achieve a good resolution, the varactor units are distributed equally over the four delay cells. Per LSB of the fine tuning word, only one varactor is switched. However, the clock phases of the DCO have to be kept equally spaced as much as possible. Therefore, the ON/OFF switching of the varactors is sequenced across the different delay cells: 1) toggle a varactor in the first delay cell; 2) toggle a varactor in the third delay cell, 3) toggle a varactor in the second delay cell, and 4) toggle a varactor in the fourth delay cell, and so on.
The tune mechanism through the tail bias current is in principle not needed, because according to simulation the entire operating range could be sufficiently covered with the load tuning alone. However, this tuning was added to achieve a larger robustness versus process variations, such that the entire intended frequency range has sufficient coverage even under unforeseen process conditions. Here, a 4-bit current control was implemented on the chip.
D. Calibration of the DCO
Before normal AD-CDR operation, where only the fine tuning of the DCO is adapted, the DCO frequency should first be adjusted to within about ±30 MHz of the correct quarterrate frequency of the data rate (e.g., 6.25 GHz for 25 Gb/s input data). For this, a coarse tuning of the DCO is performed in a calibration cycle at startup. This is done through an automatic frequency control loop which is based on an external reference clock and counters [2] . The frequency control loop counts the number of clock cycles of the digital clock and external reference clock. These numbers are compared with SPI configured registers and the coarse settings are then gradually adjusted. This procedure is repeated until the DCO lies within about ±30 MHz of the correct desired frequency.
The circuit is incorporated in the synthesized digital block. The power overhead of this calibration procedure is negligible: the synthesized circuit is only based on simple counters and comparators and consumes almost no power (approximately 0.75 mW).
IV. EXPERIMENTAL RESULTS
The AD-CDR is fabricated in a 40-nm low power CMOS technology. The low-power flavor is not favorable for a highspeed circuit, but was selected based on the available tape-outs. Unfortunately, the received samples (all from the same wafer) were apparently from a slow process corner. This forced us to increase the DCO supply voltage to 1.15 V (instead of the nominal value of 1.1 V). For the BB-PD and synthesized logic, we had to increase the voltage to 1.25 V. All the measurements reported in this section were done with these increased supply voltages.
A photo of the fabricated chip together with an annotated layout view, is shown in Fig. 18 . The chip area of the CDR core is only 0.050 mm 2 . To test the fabricated CDR, it was wire bonded on a highspeed PCB. The input buffers of the CDR and the transmission lines on the PCB are designed for an input impedance of 50 . The measurements were performed by directly connecting the measurement equipment through this PCB to the ESD protected I/O pads of the CDR.
A. Functional Tests
First, basic functional tests were performed on our prototype at three different operating frequencies: 25, 20, and 12.5 Gb/s. For this, a 2 31 − 1 pseudorandom bit data sequence (PRBS31) was applied to the input of our AD-CDR. Note that with this PRBS31 test sequence, the PD output, after the 16 times subsampling that we have in our circuit, will contain idle patterns with a length l equal to 31 (see Section II-D).
At 25 Gb/s, the CDR core without input and output buffers has a power consumption of 46 mW of which 11 mW is dissipated by the samplers, retiming block, and subsampling block, 4 mW is consumed by the digital block and 31 mW is used for the DCO. The power dissipation at 20 and 12.5 Gb/s is, respectively, 38 and 23 mW.
Next a batch of bit error rate (BER) measurements was performed. The full data stream is available as four parallel channels at quarter-rate, but due to equipment limitations, we could only do the BER measurement on one of the four channels at the same time. All the measurement reported underneath are done in this configuration.
In a typical measurement, the AD-CDR was operated over a time span of 15 min and the bit errors over this time frame were collected. These measurements consistently resulted in an error-free operation of the AD-CDR at 20 and 12.5 Gb/s. At 25 Gb/s, a BER of 3.5×10 −13 was measured, well below the error correction capabilities of most applications [36] .
In the remainder of this section, the performance of the DCO, the PD (including the experimental comparison of the conventional and Inverse Alexander PD), and the AD-CDR are discussed more in detail.
B. Digital Controlled Oscillator Operation
The DCO can be driven independently of the other blocks. This allows to characterize the DCO for different current, coarse tuning, and fine tuning settings.
In Fig. 19 , the DCO frequency characteristic is shown. The x-axis represents the 6-bit resistor coarse tuning word concatenated with the 5-bit integral path fine tuning word and results in 2048 possible configurations. The measurement was repeated over multiple current settings: ranging from current setting "2" to "15" (for the lowest current settings the results were not meaningful). Fig. 19 demonstrates that the DCO covers a frequency range from 2.73 to 8.95 GHz, which corresponds to a data rate range from 10.92 to 35.8 GHz.
A detail of the characteristic around 6.25 GHz, which is the quarter-rate oscillation frequency for 25 Gb/s input data, is shown in Fig. 19(b) . In this figure, the influence of the different settings is more visible: each color/symbol corresponds to different current setting. The different line segments of the same color have a different coarse tuning value and all frequency points within a separate line segment have a different fine tuning value.
The DCO was designed such that for every coarse transition, the output frequency range would overlap between the two adjacent settings. If we now focus on, e.g., the rightmost (dark blue) current setting, we note that this is the case for some coarse transition. However, for some coarse transitions, there is an undesired frequency gap. This means that for a fixed current setting, some oscillation frequencies cannot be generated by changing only the coarse and fine tuning settings. This issue arises from underestimated parasites. Fortunately, this problem was anticipated and can be circumvented by using the coarse current tuning. In this way, the desired frequency range is still completely covered.
The measured DCO gain K DCO at 6.25 GHz for the different current settings is shown in Fig. 20 . The figure shows that K DCO is about 1.7 MHz/LSB for high current settings and that K DCO increases to 2.3 MHz per LSB for lower current settings. Clearly, this means that the DCO quantization step is very rough. The measurements reported below are performed for a current setting equal to 12.
The DCO supply sensitivity at 6.25 GHz is shown in Fig. 21 . Here, the supply sensitivity equals 3.3 GHz/V. Due to the high supply sensitivity, the phase noise of the DCO is degraded, e.g., at a frequency offset of 10 MHz from the carrier, the measured phase noise is equal to −95 dBc/Hz [see Fig. 25 (dotted line) ]. In post layout simulation, however, the corresponding phase noise was only −110 dBc/Hz at 10 MHz from the carrier. We attribute this deterioration to supply noise which leads to excessive phase noise due to the poor supply sensitivity.
C. Phase Detector Operation
To determine the performance of the PD, the sensitivity of the samplers is measured. This sensitivity is defined as the time span in which the input data is sampled correctly by the samplers. The measurement is performed by applying an external quarter-rate clock signal together with the input data to the AD-CDR. For this measurement, a 2 7 −1 PRB (PRBS7) at 25 Gb/s with a rise time of 0.25 UI is applied. The internal DCO is bypassed such that the data is sampled by the external clock. By sweeping the time difference between the external clock and the input data, we could determine the BER for each time difference and the resulting bathtub curve is shown in Fig. 22 . The bathtub curve indicates that a time span of 18.8 ps out of a data period of 40 ps gives a BER below 10 -12 .
D. Experimental Comparison of the Conventional and Inverse Alexander PD
To facilitate the experimental comparison between the conventional and Inverse Alexander PD, our prototype circuit was designed such that it can be configured to operate with the conventional as well as the Inverse Alexander PD. This is done by switching the sign of the control loop of the CDR in the DLF. Furthermore, the subsample factor N can be set to 16 Persistence plots of (a) recovered (differential) clock (jitter < 1.5 ps rms ) and (b) recovered data (jitter ≈ 3.71 ps rms ).
(which is the nominal case) or to 32 (which is a test mode). For these cases, comparative BER measurements were performed. A 25-Gb/s PRBS7 was applied to the CDR and jitter was intentionally applied to the input data stream. For the jitter, Gaussian pseudo-white noise with a bandwidth of 80 MHz (= equipment limit) was used. The jitter level was varied and the CDR was operated over a long time until a sufficient number of bit errors were collected to obtain a reasonably accurate estimation of the BER. The results are summarized in Fig. 23 . From the interpretation of the curves, it should be noted that, at a high jitter level, the CDR starts to occasionally lose synchronism (due to cycle slips). This happened in each of the considered configurations, but as the figure shows, much earlier for the conventional PD than for the Inverse PD.
From Fig. 23 , we can conclude that the BER performance of both the conventional as well as the Inverse Alexander PD degrades when the subsample factor increases from N = 16 (nominal value) to N = 32 (test case). For N = 32, the conventional PD was in fact not functional at all. It is also obvious from the figure that due to subsampling and nonidealities, the Inverse Alexander PD greatly outperforms the conventional Alexander PD: if we compare the BER at the same jitter level, the improvement is not measurable but definitely above a factor 10 5 . If we compare the jitter levels where a certain BER occurs, the improvement is about a factor 1.9.
Moreover, the phase noise of the recovered clock is compared between the conventional and Inverse Alexander PD for different subsample factors (Fig. 24) . In all cases, a PRBS31 data sequence at 25 Gb/s was applied to the input of the CDR and the DLF parameters were held constant. As predicted in Section II-A, the Inverse Alexander PD will introduce less noise which leads to smaller phase noise compared with the conventional Alexander PD for the same subsample factor. However, when the subsample factor is doubled, additionally aliasing effects occur which increases the in-band phase noise with approximately 3 dB for both the conventional and Inverse Alexander PD.
E. All-Digital Clock and Data Recovery Operation
For the final AD-CDR operation measurements, the standard operation mode (with Inverse Alexander PD and subsample factor N = 16) was again selected.
The closed loop phase noise of the recovered clock for different gain settings is shown in Fig. 25 next to the phase noise of the free running oscillator. Here, a PRBS31 data sequence at 25 Gb/s is applied to input of the AD-CDR and the phase noise of the quarter-rate recovered clock is captured. The figure shows that increasing the proportional gain K p , increases the bandwidth of the AD-CDR. As the ratio of the proportional gain K p and integral gain K i decreases, peaking starts to occur. Furthermore, the figure also shows that outside the loop bandwidth, the phase noise of the closed loop system approximates the phase noise of the free running clock. In the time domain, the closed-loop phase noise was measured as 1.455 ps rms jitter on the recovered clock as shown in Fig. 26(a) . Additionally, the corresponding measured eye diagram of the recovered data is depicted in Fig. 26(b) . The rms jitter is approximately 3.71 ps.
The capture range of the AD-CDR was also measured and is equal to 248 MHz. This corresponds to the tuning range in normal operation and is sufficiently large to allow correct operation from an initial calibration that aligns the DCO frequency within ±30 MHz of the desired quarter-rate frequency.
Moreover, the JTOL of the AD-CDR is shown in Fig. 27 (a) and (b) for different proportional gains K p and integral gains K i , respectively. On both figures, the SDH STM-256 JTOL mask and the JTOL of [2] and [4] are added for comparison. These JTOL curves are measured by applying a PRBS7 input data sequence at 25 Gb/s with sinusoidal jitter. Each measurement is obtained by increasing the jitter level until the BER becomes >10 −12 . As shown in the figure, the JTOL curves can be widely tuned by adapting the digital loop parameters. For example, the JTOL can easily be set such that it satisfies the STM-256 mask and exceeds the JTOL of [2] and [4] . Please note that for the lower jitter frequencies, the JTOL is better than indicated in the figure, since the highest jitter level that our equipment can generate still leads to a BER that is better than 10 −12 .
Finally, a comparison with the state-of-the-art of digital CDRs is shown in Table III . This section shows that our design occupies the smallest area and has the highest power efficiency. Although the performance of the DCO is modest and the phase noise and the jitter of the recovered clock are higher than prior work, only our work and [4] satisfy the STM-265 JTOL mask as shown in Fig. 27 . Finally, apart from [9] and [11] which have the unattractive requirement that they need a tunable, high-quality, multi-gigahertz frequency reference clock, our design has the highest relative frequency range for digital CDRs.
V. CONCLUSION
We have presented an AD-CDR in 40-nm low-power CMOS technology. It can operate in a very wide range of data speeds (from 12.5 to 25 Gb/s). The CDR takes in the highspeed data and recovers a quarter-rate clock and demuliplexes the recovered data into four parallel data streams. A ring oscillator generates eight equally spaced quarter-rate clock phases, and provides the necessary timing resolution for an Inverse Alexander PD, which captures the recovered data and sends an Early/Late signal to the automatically synthesized DLF.
A key enabling element of the presented design is the use of extensive subsampling together with the Inverse Alexander PD to reduce the operating speed of the synthesized digital logic and still guarantee good operation of the CDR. By avoiding parallel structures, this simplifies the design, reduces the active die area and decreases the power consumption. The resulting AD-CDR core has an area of 0.050 mm 2 and consumes only 46 mW at 25 Gb/s and 23 mW at 12.5 Gb/s. The implemented CDR is highly tunable and satisfies the JTOL specifications for SDH STM-256.
