In this work, we present a 200 MHz to 1.6 GHz digital delay-locked loop (DLL) for per-pin deskew applications.
Introduction
High performance systems have been increasingly developed for various applications as CMOS process technology scales. Adequate system throughput requires extensive parallelism and a high-speed data interface. Parallel interfaces such as Wide I/O have been proposed to enhance the throughput. The double data rate (DDR) interface, which enables data transmission on both rising and falling clock edges, is widely applied to increase transmission speeds to 3.2 Gbps/pin [1] .
With increasing operating frequency, the duty cycle distortion, the reference clock jitter, and the pin-topin delay mismatch adversely affect the valid sampling window. A delay-locked loop (DLL) is often used to achieve a high-quality phase shift or phase alignment for the sampling clock [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] . In practice, the routing difference and the different path characteristics of the parallel pins cause considerable pin-to-pin delay mismatch [7] . Exacerbating the problem, the read delay and write delay of the bidirectional pins can be very different. Shifting the phase of the sampling clock is ineffective because an overlapped valid sampling window may not exist [1] . In the parallel data link transmission along with a clock signal, appropriate data eye training for all I/O pins is performed to ensure that the best sampling position is robust [15, 16] . In addition, the synchronization of the skewed pins can save data-conflict power [7] . Therefore, reference clock period tracking and area-efficient per-pin deskew are demanded for the high-speed parallel interface [7, 16] . Figure 1 shows a general master-slave architecture of the DLL-based reference clock period tracking circuit and the phase shifters for per-pin deskew applications.
Conventional multiphase DLLs for phase-shifter applications [3, 4, 6] apply n equal delay units in the main delay line to lock the reference period. The phase-shift step of 2π /n can be easily obtained by duplicating the * Correspondence: wildwolf@cs.ccu.edu.tw delay unit. However, this architecture suffers some disadvantages. The intrinsic delay of the main delay line in the DLL becomes very large. Achieving high-frequency operation is difficult. In addition, the adjustable minimum phase shift is limited. A small phase shift cannot be obtained via directly downscaling the control code of the main delay line in the DLL. Thus, a multiplying DLL-based structure [13] has been proposed to overcome the mismatch among the equal delay units, and the clock signal at twice the operating frequency is divided by a frequency divider to provide a 90
• phase shift. However, achieving high-frequency operation is difficult because of the requirement for frequency multiplication. In addition, conventional DLLs [2] [3] [4] 6, 8, [10] [11] [12] [13] [14] only consider the design of the 90 • phase shifter; per-pin deskew is not supported in these DLLs.
In this work, we propose a 200 MHz to 1.6 GHz digital DLL for the phase shifters and per-pin deskew applications. The proposed architecture of the digital DLL and the digitally controlled phase shifter (DCPS) exhibits less intrinsic delay. Both the highest operation frequency and the widest frequency range compared to those reported in related works are achieved. The DLL locks the reference clock and continuously provides its internal one-period delay-locked code as a reference code for the DCPS. Accordingly, the DCPS can cover the delay range with the scalable architecture for the best area efficiency. To mitigate the jitter effect from the reference clock, the DLL applies a phase detector with a detection window and the proposed consecutive phase decision method. As a result, the phase variation of the phase shifter is reduced. With the timing calibration function defined in the DDR memories, a demonstrative test chip is implemented to verify the performance.
The rest of this paper is organized as follows. The architectures of the DLL and the DCPS for the phase shifters and the per-pin deskew application are introduced in Section 2. Section 3 presents the circuit designs of both the DLL and the DCPS. Section 4 shows the simulation results of the test chip and compares its performance with results reported in previous works. Finally, Section 5 concludes this work. Figure 2 shows the architecture of the proposed digital DLL. The main delay unit used in this work is the scalable DCPS. In the DLL, the DCPS-360 and the DCPS-0 are used for one clock period delay and intrinsic delay compensation, respectively. The control procedure for the proposed DLL is shown in Figure 3 . After reset, the initial CODE DCP S−360 is set to 0 to generate the minimum delay. The initial step size is set to S IN IT (in this case, S IN IT is a coarse tuning step, which will be mentioned later). The phase acquisition begins from the minimum delay; it therefore avoids false lock to the harmonics. The CODE DCP S−360 is forced to increase while the phase detector (PD) output progresses through a series of UP for the phase from 0
DLL architecture
• to 180
• and a series of DN for the phase from
180
• to 360
• . The controller then reduces the step size and changes CODE DCP S−360 against the PD direction for the lock-in convergence. The step size is reduced by half when the direction of the PD output changes. After the step size is reduced to 1, the DLL is locked.
Circuit design
The k-bit DCPS (k ≥ 6) is composed of coarse delay units (CDUs) and a fine delay unit (FDU), as shown in Figure 4a . The CDU applies the ladder-shaped nand gate delay line [17] , which is very suitable for delay time extension. The monotonicity and the linearity of the CDU guarantee good fractional phase scalability of the distributed slaves from the master. The step size of the CDU is two nand propagation delays. The FDU interpolates the phase difference of two adjacent CDUs by 32 steps. The binary-to-thermometer decoder and the FDU decoder are used in the DCPS. Figure 4b shows the DCPS-0, which applies only three CDUs and one FDU to generate the intrinsic delay of the DCPSs with the minimum code. The code to the CDU and FDU of the DCPS-0 is fixed, and the decoders are thereby eliminated.
The FDU applies a 32-step interpolator to generate the monotonic delay output with finer resolution. Figure 5a shows the circuit architecture of the complementary driving interpolator [18] . The number of "on" tristate drivers controls the driving strength of the paths. The structure is simple and the number of interpolated output phases is easily expanded. However, the delay step of the interpolator is not uniform when the input phase difference between IN a and IN b is large. The phase difference of the FDU inputs is one CDU delay, which is two nand-gate propagation times. Figure 5b shows the waveform and the delay of the interpolated phase. When the numbers of the "on" tri-state drivers in two paths are similar, even a slight change in the number of the "on" tri-state drivers can substantially affect the driving strength. Therefore, such a change results in a much higher step difference of the middle codes and leads to poor linearity. In addition, the common node of the drivers' output has high capacitance and therefore slows the transition time of the output.
Given the aforementioned effects, the proposed interpolator partitions the large input phase difference into two smaller phase differences, as shown in Figure 5c . When an additional buffer and two multiplexers are applied, the partitioned phase differences can be selected via the F Sel signal. The waveform and the delay of the interpolated phase are shown in Figure 5d . The partitioned interpolator not only improves the uniformity of the output phase but also reduces the transition time with nearly half amount of gates cut. Figure 6a shows the architecture of the PD. In the first part, two different PDs are used for the selection of the detection window range. The sense-amplifier-based PD (SAPD) shown in Figure 6b provides high-resolution detection before the DLL is locked. The sample-based PD (SPPD) shown in Figure 6c preserves a detection window to reduce the reference clock jitter sensitivity after the DLL is locked. The SAPD based on [19] is used for tiny dead-zone detection. In this design, both up/down paths will be pre-charged to high when the CLK IN T is low. Therefore, the detection result is kept for less than one-half of a cycle. Two registers with the pulse generator are introduced to keep the detection result without affecting the signal level of the sense amplifier.
The SPPD uses two registers with some delay cells. Via the combination of the register setup time and hold time, the SPPD outputs zeros for both UP1 and DN1 in an approximately 60-ps region around the rising edge of the CLK IN T in a 90-nm CMOS process. Therefore, the small jitter will be ignored by the phase detector.
The second part of the PD is the configurable consecutive phase decision. The windowed up/down signals (UPW/DNW) are detected and sent to the judgment block. The accumulator will increase if UPW is 1 and decrease if DNW is 1. Two parameters, NUM0 and NUM1, are assigned to the block and selected by the DLL LOCK signal. If the UPW/DNW appears consecutively a given number of times, the PD will output UP/DN. This consecutive phase decision deals with the zero mean random jitter effectively. When the jitter phases appear back-and-forth, conventional DLLs follow the decision and make a reverse delay adjustment. Therefore, the delay code is changed back-and-forth. In this design, such a phase judgment will be eliminated in the accumulator and not propagate to the delay code.
Simulation results
A test chip including the DLL and two phase shifters (DCPS-90s) was built using 90-nm CMOS technology. Figure   8a . The delay of the first 64 codes, which covers the full range of two CDUs, is shown in Figure 8b . The linearity and monotonicity of the 32 interpolated phases are kept not only inside the FDU but also across two CDUs. The simulated process-voltage-temperature (PVT) conditions and the step resolution of the DCPS are listed in Table 1 . The DLL controller operates at a maximum frequency of 200 MHz derived from the 1/8 reference clock frequency. The power consumption of the DLL and the DCPS-90 is 3.4 mW and 0.31 mW, respectively, in the typical PVT condition at a reference clock frequency of 1.6 GHz. Figure 9 shows the 90 • output phase error of the DCPS-90 versus the reference clock frequency. The CODE DCP S−90 is set to one-fourth of the CODE DCP S−360 . The phase error is defined as the mean difference between the generated delayed phase and the target phase. The worst phase error was 2.4
• at a reference clock frequency of 1.6 GHz.
Another important parameter is the phase variation, which is defined as the absolute maximum difference of the delayed phase and the target phase. Figure 10a shows the phase variation at a reference clock frequency of 200 MHz. The consecutive phase numbers, NUM0 and NUM1, in the PD were set to verify the jitter sensitivity. The performance is better because a larger consecutive number is set. Therefore, the back-and-forth phase judgment due to the reference clock jitter is successfully eliminated. Figure 10b shows the phase variation at a reference frequency of 1.6 GHz. The performance is also better because a larger consecutive number is set. Because both the delay scale of the 90
• phase shift and the tolerant cycle-to-cycle jitter at 1.6 GHz are smaller, the performance is saturated by 3.1 • , which is mainly contributed by the phase error shown in Figure 9 . The performances among several DLLs for per-pin deskew applications and multiphase DLLs are compared in Table 2 . The DLL developed in this work can operate at frequencies as high as 1.6 GHz and has the widest operation range with the least power consumption. The DCPS-90 supports an adjustable resolution of 2.99 ps from 0
• to 90
• phase shift, with a small area per channel. 
Conclusion
We presented a 200 MHz to 1.6 GHz digital DLL for phase shifters and per-pin deskew applications. The scalable architecture design of the DCPS supports high-frequency and wide-range operation with low phase error. It can be configured depending on the required delay to minimize the area occupation. The control procedure for the proposed DLL avoids the false-lock problem. The proposed PD reduces the sensitivity to reference clock jitter. As a result, the phase variation of the phase shifter can be greatly reduced. Therefore, the proposed DLL is well suited to the design of area-efficient phase shifters and to per-pin deskew applications. 
