Abstract: This paper presents a new quantization noise suppression method for a time-to-digital converter (TDC) and proposes an all-digital phase-locked loop (ADPLL) architecture using only standard cell logic gates. Using a new multiple input multiple output (MIMO) quantization noise suppression method provides an order of √ 2N improvement in TDC resolution with N parallel TDC channels. Suppressed noise in the TDC allows the ADPLL to achieve superior jitter performance in both theoretical calculations and simulation results. In order to allow fast portability between process nodes, short design cycle time, ease of modification, and flexibility, ADPLL architecture is designed completely in register transfer level intensive Verilog code and the implementation is synthesized in order to obtain final microelectronic design schematics. In comparison to similar work in the literature, postlayout simulation results show that the designed ADPLL achieves period jitter of 1.78 ps rms with a layout area of 0.09 mm 2 in 65 nm CMOS process and power consumption of 17.5 mW at 800 MHz.
Introduction
A fractional-N phase locked loop (PLL) is one of the fundamental components in wired serial communication systems. It allows generation of any desired clock frequency from a given reference clock source. In some high-data-rate baseband systems such as SAS, SATA, PCIExpress, and DisplayPort, the serial communication link bitrate over the cable is constant. In such systems, data or video throughput is variable and consumes only a portion of the available bandwidth. This requires the regeneration of the video clock on the receiver side from the recovered link clock using a fractional-N PLL [1] . In other applications where the serial link rate is variable, these types of PLLs are still useful when they act as fractional clock multipliers. Fractional clock multiplication helps process a certain fraction of the data extracted from the received data stream.
Bufferless video sinks, such as display panels or video timing controllers, require low jitter from fractional-N PLLs because the video flow is continuous and the tolerance of throughput variation is low due to the large storage requirement that would otherwise be required with an uncompressed video flow [2] . Such low jitter requirements have been traditionally met with analogue or digitally assisted PLLs, as most ADPLLs have relatively poor jitter performance due to the discrete steps in their oscillators. However, charge-pump-based PLLs have become increasingly harder to implement in nanoscale technology due to low supply voltages, poor g ds of MOS transistors, and nonideal current sources and capacitors [3] . Therefore, ADPLL architectures must be implemented in deep submicron processes.
Additionally, there is an increasing need for packing more digital processing functions into such videoprocessing transmitters and receivers. Therefore, the need to move to finer process nodes emerges. Furthermore, various video interface standards, such as HDMI, CSI, DisplayPort, DSI, LVDS, and OLDI, need various configurations for PLLs. These two motivations push for the need to create ADPLLs that are easily configurable and register transfer level (RTL)-intensive. Previous work in the literature contains articles with digital operations. However, [4, 5] contain custom gates and use methods that introduce extra analogue behavior in addition to ring oscillators within their designs, [6, 7] are digital only at the block interface, and finally [8] is not synthesizable. This paper presents a novel synthesizable ADPLL architecture (Figure 1 ) with superior jitter performance compared to similar ADPLLs and that also satisfies the needs of the continuously evolving video transmission industry in terms of migrating to new process nodes and fast IP reuse. This is achieved by the use of two novel subblocks in the design. First, a standard cell digitally controlled oscillator (DCO) with fine frequency steps and low noise was designed. Second, a new phase detection method allowing reduced quantization noise and finer resolution was implemented in the TDC. Fundamentally, the ADPLL truly tracks the phase of the highly precise reference clock by comparing it to the output of the variable phase DCO output clock. The feedback divider divides the DCO output clock with a desired fractional value of NF and generates a feedback clock for comparison to the reference clock. This comparison gives the digital phase error (Err[k]), which allows the loop to adjust the phase and the frequency of the DCO clock to achieve the desired clock frequency multiplication value NF. In other words, NF is a user-specified value that determines the desired DCO output frequency. Err[k] is filtered by the digital loop filter. Output of the loop filter adjusts the frequency of the DCO in a negative feedback manner and the loop achieves phase/frequency tracking. In such an architecture, TDC replaces the conventional phase/frequency detector and the charge pump, DCO replaces the voltage-controlled oscillator (VCO), and the digital loop filter replaces its analogue equivalent. Even if only standard cells are used in an ADPLL, the internal nature of the ring oscillators in DCO and TDC is still analogue. Consequently, in order to maintain loop characteristics, the process, voltage, and temperature (PVT) variation effects on these blocks need to be tracked and compensated. This paper is organized as follows. Section 2 analyzes the proposed phase-detection method called MIMO quantization noise suppression and its implementation in the TDC. Implementation of the standard cell DCO is described in Section 3. An ADPLL example with the proposed subblocks, a digital loop filter, and a sigma-delta modulated feedback divider is illustrated in Section 4. Results and discussion are presented in Section 5.
MIMO parallel channel TDC
The time-to-digital converter (TDC) is one of the main blocks in an ADPLL. It measures the time from the reference clock edge to the feedback clock edge and gives a digital output as shown in Figure 2 . In this section, a novel quantization noise suppression method is presented. First, background information about the prior method is given in Section 2.1 and the proposed MIMO quantization noise suppression method is analyzed in Section 2.2. As shown in Figure 2 , the reference clock, the feedback clock, and the delayed clone of the time input are processed in multiple parallel TDCs and the results are combined in order to get superior TDC resolution with this new method by reducing the sampling jitter component of quantization noise. 
Single input multiple output (SIMO) quantization noise suppression
In order to get a better effective resolution of the TDCs, a quantization noise suppression method was initially presented in [9] . The technique requires digitization of time input by multiple independent observers. Parallel TDC paths with unique conversion resolutions are utilized in order to achieve an effective measurement accuracy better than each individual observer. As each TDC has a unique resolution, independent quantization noise profiles and independent observation results are obtained. Analogously to the multiple receiver antennas in a b. SIMO-phased array antenna grid, parallel TDCs can provide receiver diversity. This diversity is provided by the principle of superposition and the fact that one can benefit from the result of covariance of the correlated and uncorrelated signals. Effective TDC resolution is improved using the weighted gain combining method; coherent absolute time measurements for combining are created by multiplying the output of TDCs back with their individual estimated resolutions, i.e. weights, to create quantized versions of the time input. Finally, these products are averaged to achieve superposition. The rest of the section explains how the RMS quantization noise standard deviation σ is suppressed by an order of √ N via digital post processing compared to the signal level.
The SIMO case is observed when only time input 1 in Figure 2 is processed. The equivalent baseband model is as follows. The individual branch signals are
where S 1 is signal time input 1, T i1 is the TDC channel gain, and n i1 is the uniformly distributed quantization noise with σ 2 i1 . The output of the combiner is
where T ′ i1 are the combining weights that try to estimate T i1 with details explained in Section 2.3. In Eq. (2), signal and noise components are given by 1st and 2nd term, correspondingly. The signal and noise power at output are
where
12 is the branch noise power. Output SNR is
The range of index i is N and a comparison of SNR at N = 1 and N = 4 for unique but close T i1 = {T 11 T 21 T 31 T 41 with an average of T avg shows
which indicates that the employed weighted-gain-combining method provides quantization noise suppression and allows the 1 ×N TDC to act as if it was a TDC with a single channel and resolution of
Proposed MIMO quantization noise suppression method
Using the same number of TDCs, MIMO quantization noise suppression achieves improved resolution compared to the SIMO configuration. A transmitter diversity similar to antenna arrays with multiple transmitters is obtained by creating a delayed clone of the time input and refeeding it to the locked loop's TDCs for reconversion with another resolution setting. In order to have an independent second observation from the same channel, the TDC resolution is changed after the first measurement; hence the same time input's delayed clone is observed with a different quantization noise. Transmitter diversity for the same receiver is obtained as the system acts as if there is a second time input source feeding through a different sampling mechanism. To be able to use the same TDC for the time input and its delayed clone, these pulses need to be nonoverlapping, which can only be achieved in the locked state of the PLL. MIMO case is observed when time inputs 1 and 2 in Figure 2 are processed. The equivalent baseband model is as follows. The individual branch signals are
where S j is time input j, T ij is the TDC channel gain, and n ij is the uniformly distributed quantization noise with σ 2 ij . The output of the combiner is
where T ′ ij are the combining weights that try to estimate T ij with details explained in Section 2.3. In Eq. (8), signal and noise components are given by 1st and 2nd term correspondingly. As signals S 1 and S 2 are ideally the same, the signal and noise power at output are
The range of index (ji) is (M N ) and a comparison of SNR at (M N ) = (1, 1) and (2, 4) for unique but close
Eq. (12) shows that utilizing four parallel 2 × 1 TDCs to create a 2 × 4 TDC, as proposed in Figure 2 , suppresses quantization noise as much as a 1 × 8 SIMO TDC, but with half the number of TDC channels used in SIMO configuration. That is, the 2 × N MIMO configuration acts as if there is a single TDC with a resolution of
, which is an improvement of √ 2 over the SIMO case.
Online TDC resolution estimation
Completion of digital postprocessing requires the online estimation of the TDC resolutions. In the targeted 2 × 4 MIMO TDC application, resolutions within 1 ps of each other need to be distinguished. The method presented in [9] is used for the required online estimation. Previously known output sequence of the sigma-delta modulator creates a known phase error at the input and the output of the TDC, which allows the estimation of TDC resolutions. Starting from the typical resolution values and digitally filtering each estimation sample, a stable resolution estimation is obtained.
Architecture of proposed TDC
The proposed design is composed of a phase detector, a delay line, two gear ring oscillators with counters, and digital postprocessing. During phase and frequency acquisition, the loop is in SIMO mode with a maximum supported input range of 40 ns using a regular phase detector. After the loop is locked, maximum time input at the TDC input reduces to ±2 ns with sigma-delta modulation dithering, which allows the system to start MIMO operation. Delayed clone of the input up-down pulse is multiplexed to the phase detector during the silent phase after the falling edge of the up or down pulse. When the MIMO mode is enabled, the system does the conversion and the reconversion for the same positive up or negative down pulse before it accepts another trigger for phase detection. The phase detector works from rising to rising edge of reference clock (REFCLK) or feedback clock (FBCLK) signals and creates a positive output if REFCLK is leading the FBCLK. Matched delay and inverter cells are used to delay the phase detector output for use in the 2nd conversion. It should be noted that this delay does not need to be equal to a specific value, and that the delays in each channel do not need to match each other. The design just needs to make sure that the clone signal overlaps with the idle window of the phase detector. The silent window for reconversion is at a minimum when feedback divider is at a minimum and the ∑ ∆ modulator disposition is maximally negative. In order to ensure that a minimum-length idle window is available in the phase detector, the reference clock period needs to be constrained. For a constrained reference clock frequency of maximum 100 MHz, this window corresponds to a minimum silent window starting from 4 ns to 8 ns after the rising edge of the up/down signal. While the exact delay for the time input clone is flexible, it has to be in this range in all PVT corners so that the 1st conversion is nonoverlapping with the 2nd conversion. In the fast-cold corner, the delay should be greater than 4 ns, while the slow-hot corner delay should be less than 8 ns.
Two gear ring oscillator with counters
A seven-stage NAND gate ring oscillator is implemented with an enable input in one of the stages, as shown in Figure 2 . The number of stages in a ring is seven in order to keep the counter widths at each node small while having enough dynamic range to cover the reference clock range with the minimum TDC resolution. In order to equip each TDC with a unique resolution, the oscillation frequency is adjusted for each TDC by incorporating dangling inverters at each node of the ring. The sizes of the inverters are configured for TDC replicas in order to provide the desired frequency offset. In order to select a slightly different resolution during the 2nd conversion using the same TDC, standard cell tristate buffers are connected between each NAND gate output and input. When enabled, these tristate buffers decrease the period of oscillation and increase single-channel TDC resolution. Oscillation is enabled only during the up/down pulses and their delayed clones. There are eight-bit-wide asynchronous counters at the output of each ring stage and these counters are summed in order to get the N i1 and N i2 outputs, as shown in Figure 2 . The first conversion output is latched with the falling edge of the up/down pulse and the same hardware is used for the second conversion. In order to have a dynamic range spanning specified reference clock range with the given TDC resolution, 1st and 2nd conversion outputs are provided in eleven-bit two's complement format to the postprocessing block. The tristate buffer strengths and dangling inverter sizes are fine tuned to get typical TDC resolutions of < 17, 18.5 > , <20, 21.5> , <23, 24.5 > , and < 26, 27.5 > ps/LSB.
Digital postprocessing
Both outputs of 2 × 1 TDCs are multiplied with their corresponding five-bit-wide estimated resolution (T i ) and these products are averaged as shown in Figure 2 . While both outputs of each TDC channel are used during the MIMO operation, the second output is omitted for postprocessing in SIMO mode. The result is a twenty-bit-wide output for use in the loop filter. The quantization noise has components due to mismatch, jitter, and sampling error. In order to demonstrate that sampling error suppression is obtained, TDC was simulated in transient simulation and the phase detection results were compared to the actual input signal to generate histograms that converge to the resulting quantization noise profile. When the MIMO mode is enabled. TDC works with an effective resolution of 7 ps/LSB ( Figure 3A) . On the other hand, while the loop is locking, this value is 11 ps/LSB ( Figure 3B ) in the SIMO operation mode. 
DCO
The DCO in Figure 4 incorporates N ring oscillators. Every ring oscillator uses the same number of three, five, or seven programmable delay cells rather than basic inverters for creating an oscillator loop. An offline calibration algorithm is deployed for use before the ADPLL starts using the DCO. During calibration, an externally provided clock source is used to measure the free running oscillation frequency for five delay cells in a ring while using the center frequency control word. If the oscillator frequency is slow due to PVT, the delay cell count in the rings are reduced to three. Similarly, if the oscillation frequency is initially too fast, the rings are programmed to use seven delay cells. Depending on the calibration result, unwanted delay cells are bypassed using multiplexers and the desired number of delay cells is connected to create a ring. Each ring has tristate buffers at each delay element output, and all of the rings are connected in parallel at the output of delay elements. Each ring has a unique and one-bit drive enable signal that enables all of the delay elements. The nodes driven by multiple drivers create the main time constant for each delay stage as the capacitance from every active or inactive ring's driver and next stage input is summed. Frequency tuning is achieved by changing the effective resistance at each high time constant node by enabling more or fewer rings while the capacitance is the same. Tristated rings work as capacitive load; otherwise, when their drivers are active, they increase the driving strength, thereby increasing the output frequency of the loop by decreasing the time constant at the output node of every delay element. By adjusting how many of the rings are active, coarse frequency tuning is obtained.
Additionally, each delay cell has a unique fine frequency control (FCW [3:0] ) that allows the delay of each delay cell to be adjusted in fine steps. There are seven FCW signals connected to each delay element in a ring and this signal is shared in all rings. Except "0000," all FCW combinations can be used to provide slightly tuned delay variations using the inherent propagation delay between the inputs of the gates. LSB bits of the linear F ctrl binary vector are mapped to the nonlinear FCW signals of delay cells for each delay cell in the ring separately in order to provide a monotonic frequency tuning with high dynamic range [10] . Combining coarse and fine frequency control mechanisms provides the tuning control for the DCO.
Implementation of all-digital PLL
To compare the performance of the MIMO quantization noise suppression with the conventional SIMO method and also create a synthesizable standard cell ADPLL, the design is implemented and simulated in 65 nm CMOS technology with the specifications given in Table 1 . Design of the remaining subblocks and top level PLL control are presented in Sections 4.1 and 4.2.
Digital loop filter
The loop filter is implemented digitally, as shown in Figure 5A . With the help of digital scaling, accumulation, and IIR-filtering operations, proportional and integral paths of the loop are created and the structure digitally imitates a type-2 second-order PLL analogue loop filter ( Figure 5B ). The IIR loop filter is a 1st order circuit similar to the one in Figure 5C , with the characteristics given in Eq. (13). 
In order to calculate the loop filter parameters, an analogue PLL design assistant in [11] is used with the specifications resulting in open loop parameters K , f p , and f z for use with an analogue filter. K is the open loop gain, f p is the pole, and f z is the zero frequency. For the specified system, the analogue equivalent transfer function and its calculated parameters are given in Eqs. (14) and (15). The corresponding loop filter is with a gain of K lf and pole and zero frequencies f p and f z .
Frequency response of the loop filter is shown in Figure 5D . This analogue filter transfer function can be approximated in a digital domain using the discrete domain transfer function in (16) [11] .
When Eqs. (17) and (18) 
The digital transfer function approximation is realized with the proposed digital loop filter circuit. Transfer function of the circuit is given in Eq. (23) and solved in the same format as the desired digital filter response in Eq. (24).
The desired transfer function is matched to the transfer function of the actual implementation to give the following parameters for use during the implementation, as given in Eqs. (25) and (26):
Initially, K 3 is set to four in order to increase the loop bandwidth and reduce settling time. It is reduced to one after coarse lock-detection. This puts the loop back in the desired lower bandwidth closed-loop operation and the system continues with the fine tuning. Lock signal is asserted when the counted feedback clocks and reference clocks are within 0.1% of each other for 2 12 reference clocks. Err[k], K 1 , K 2 , and α are scaled up by powers of two in order to approximate real numbers and implement the multiplication, IIR-filtering, and accumulation operations. Finally, at the output of the loop filter, the results are scaled down and the DCO control word is created.
MASH111 digital ∑ ∆ modulator
A 3rd order MASH111 digital sigma-delta modulator topology similar to the one in [12] is implemented by cascading first-order digital ∑ ∆ blocks given in Figure 6A . ADPLL uses the sigma-delta modulator shown in Figure 6B . The 1st order ∑ ∆ cores with 8-bit input and 1-bit output are implemented using delay, compareto-zero, and add operations. The output of the modulator is a 4-bit signed vector and it varies between <-4, 3 > depending on clock multiplication value's rational part F.
a.
b. 
Results and discussion
The ADPLL example was implemented and the RC extracted postlayout ( Figure 7) ; simulation results are presented in Table 2 . Simulations were performed using Spectre simulator's transient analysis and the results were postprocessed to generate a phase noise profile. Amongst the similar ADPLL designs, the implemented design is significant in two aspects.
The TDC uses the proposed MIMO quantization noise-suppression method and reduces the quantization noise by a factor of √ 2 compared to the previously presented SIMO case while using the same number of gates and power. Compared to similar cell-based designs [4, 6, 13] , better jitter performance was obtained with the help of improved TDC resolution as shown in the simulated phase noise profile in Figure 8 . Additionally, compared to mixed-signal designs [7, 8, 14] , similar or marginally better jitter performance was observed, but with a flexible cell-based all-digital design.
The proposed design is in the same ballpark as area and power consumption of recent designs in the literature, though slightly worse. Although synthesis for RTL portions of the HDL code provides logic, cell strength, and buffer optimization, there is room for improvement in area and power consumption when compared to [4, [6] [7] [8] 13, 14] . As DCO is the dominant consumer with 78% of the area, power and area reduction is possible by reducing the number of rings in the DCO. However, such an improvement would limit the frequency-tuning range. In general, our design promises better jitter performance, flexibility, and IP reuse across processes at the expense of slightly worse area and power performance. This brings an advantage when increasing the operating frequency of the data transmitters and also while moving designs to smaller process nodes.
Conclusion
While analogue or digitally assisted PLLs can achieve high performance, standard cell ADPLLs still optimize the jitter, power, and area trade-off. Compared to similar designs, the proposed design successfully achieves superior jitter performance with the proposed MIMO quantization noise-suppression method while staying in the same ballpark for power and area consumption as other ADPLLs. Thanks to the use of standard cells, capacitors in LF, DCO, and TDC are eliminated. The synthesized standard cell digital design flow meets the design's flexibility and portability targets, and the results prove that the proposed design improves ADPLLs' jitter performance.
