Abstract-This paper presents a novel architecture of on-chip clock generation employing a network of oscillators synchronized by the distributed all-digital PLLs (ADPLLs). The implemented prototype has 16 clocking domains operating synchronously in a frequency range of 1.1-2.4 GHz. The synchronization error between the neighboring clock domains is less than 60 ps. The fully digital architecture of the generation offers flexibility and efficient synchronization control suitable for use in synchronous SoCs.
I. INTRODUCTION
Clock generation and distribution are one of the main challenges in the design of modern large scale SoC [1] . The increase of relative dimensions of digital SoCs, together with power limitations, make the techniques of centralized clocking prohibitive. While long transmission lines are needed for chipwise clock distribution, the associated delays must be perfectly mastered. This is very difficult to obtain with acceptable power consumption costs. It is the main reason for the popularity of globally asynchronous locally synchronous (GALS) SoC architecture, which allows to use many small size clocking domains with asynchronous communication between them.
However, GALS presents a number of fundamental drawbacks related with reliability and verification issues, as well as reduced communication speed. The solution proposed in this paper guarantees a synchronization between the clocks of the neighboring local zones, and in this way, makes possible the synchronous communication between the neighboring zones and even between zones situated at some distance on the chip. This solution, called "distributed clock generator", uses a network of local clock oscillators distributed over the chip area similarly with the GALS architecture . However, the local clock generators are mutually coupled with their immediate neighbors in the phase domain so that all local clocks have the same phase and frequency. Here, long clock distribution lines employed by conventional architectures are replaced by local short network links which connect small local clock trees. The network of local clock generators is sufficiently dense, so that: (i) geometric distance between each couple of neighboring oscillators is small enough and delays associated with the network oscillator links are negligible, (ii) distribution of the clock signal inside of each clocking domain is done by conventional techniques and (iii) synchronous communication between neighboring zones is possible as far as the corresponding local oscillators are synchronous in phase. In the past, a few analog implementations of such clocking systems were presented [2] , [5] . However, since the clocking circuits are tightly integrated in the core digital system on chip, the analog nature of these solutions made them weakly suitable for practical use. This paper presents a first fully digital implementation of an array of 16 oscillators coupled through a network of alldigital phase locked loops (ADPLLs) intended for distributed clock generation. The proof-of-concept chip generating 1.1-2.4 GHz clock is implemented in 65 nm CMOS technology. The theoretical basis for this system was provided by the studies [2] , [3] , [4] , while the only priorly existing silicon implementation of this concept employed a network of analog PLLs [5] .
II. DISTRIBUTED CLOCKING ARCHITECTURE

A. System description
The structure of the clocking network is presented in Fig. 1 The paramount question about the stability of such a complex dynamical system was addressed in theoretical studies through control theory tools [3] , [4] . A formal proof of stability, together with an algorithm of choice of the block parameter were proposed.
B. Desirable mode selection
In difference with single PLLs, in PLL networks there are exist multiple synchronized modes in which all oscillators have the same frequency and a fixed (zero or not) phase error. Only the mode with a zero phase error is required for the clocking application. However, the actual synchronization mode depends on initial conditions on which in most cases the system has no control. Several methods have been proposed in the past for the selection of the desirable mode [2] .
Our method exploits the ability of the digital PLL network to modify its topology (connectivity) on the fly [8] . The idea is to artificially define the initial conditions of the network inside of the attraction basin of the desired synchronization mode. The start-up procedure consists of two steps:
Step 1. The clocking network is powered up and programmed into a unidirectional configuration. This is achieved by disabling or enabling the feedback links between the nodes. For example, each node receives the information about errors from upper and left neighbors ( Fig. 2(a) ). This mode excludes the cycles of propagation of information, hence eliminates the possibility of undesired locking. However, in such an operation mode the suppression of perturbations is weak [8] , and this mode is not suitable for reliable clocking.
Step 2. Once the network is synchronized with small timing errors, it is re-programmed into a bidirectional configuration ( Fig. 2(b) ). In this mode the reverse links are activated, and the network operates in a fully synchronous mode with distributed feedback (coupling) maintaining the synchronization in the desired mode with near-zero phase errors.
III. NODE ARCHITECTURE
A. Phase-frequency detector overview
The PFD is an analog-to-digital converter quantifying the synchronization error into a digital 5-bit signed number. According to its transfer function, showed in Fig. 3(b) , its range is limited by the boundaries ±∆φ r , which are derived from the constraints of precision and hardware complexity. The detail block diagram of the PFD is shown in Fig. 3(a) . The PFD consists of a bang-bang phase detector (BB) measuring the sign of the phase error [7] and a time-to-digital converter (TDC) for the quantification of the absolute time error between two clocks. The arithmetic block combines the signals from these blocks and produces two binary signed signals (straight and inverted) thereafter used by the local and neighboring nodes.
The TDC is based on a tapped delay line followed by the sampling register. Its resolution is 32 ps.
B. Digitally controlled oscillators
The implemented DCOs are the ring CMOS oscillators employing width-modulated technique for the digital frequency tuning [6] , [7] . Their structure is based on a 7-stage ring oscillator (Fig. 4) with parallelly connected tuning inverters (Fig. 4 , CTI0-CTI6 and FTI0-FTI2) to each stage of the oscillator. The main inverters (Fig. 4, MI0-MI6 ) are always active and define the lowest oscillation frequency. The tuning inverters are distributed over all 7 stages of the oscillator and divided in two arrays: 256 coarse tuning (CTI) and 3 fine tuning (FTI) inverters. They provide respectively 6 MHz and 1.5 MHz frequency tuning steps with a total of 256×4=1024 steps. The cells are controlled by three thermometer codes obtained from the binary-to-thermometer decoders. choice of the control algorithm. The oscillator designed in 65 nm technology has a frequency tuning range of 1-2.5 GHz.
C. Error processing
The error processing in node is performed in two steps by an error combining block and a loop filter (Fig. 5) .
The first block receives up to four 5-bit 2-complement coded errors. They are passed through four variable gain blocks and then summed using a four-input adder. The weighting coefficients of the variable gain Kw 1 − Kw 4 are programmable. Each gain can take independently a value from the set {0,1,2,4} and implemented as a binary shift, so introducing a very small delay. Programming these coefficients, we can control the connectivity between the nodes of the network. Then the four-input adder operates with 7-bit operands and produces a 9-bit sum. The output of the adder is buffered with a register. We mention that each node is an auto-sampled system: the filter is sampled with the generated local clock divided by 8 and PFDs compare the clocks at this rate.
The PI filter processes the 9-bit sum of the errors. It has coefficients K p and K i that can be programmed by respectively 5 and 12-bit words. The programmability of the filter is essentially intended for the testing purposes and for the theory validation.
IV. TEST CIRCUIT DESIGN
A prototype of the distributed clock generator with 16 nodes has been designed and manufactured in 65 nm CMOS technology. It has an area of ≈2 mm 2 where the clock network itself occupies 0.8×0.9 mm 2 (Fig. 6 ). Besides the clocking network, the on-chip digital circuitry includes design-for-test block, the PFD and bang-bang detector for their characterization. The microphotograph of the fabricated silicon prototype is presented in Fig. 6 .
V. MEASUREMENT RESULTS
The goals of the experiments were a characterization of the phase synchronization between the DCOs and an investigation of the sensitivity of the network to different perturbations. The initial frequencies of the 16 DCOs of the fabricated network are distributed within a 47 MHz range, which gives good conditions for the fast start-up of the network. This range is explained mainly by the sensitivity of the oscillators to the supply voltage, which is measured to be ≈900 MHz/V. However, even with this mismatch, the frequency adjusting range of the network is 1100-2380 MHz, is guaranteed for ±10 % supply voltage variation. Fig. 7 presents the captured waveforms of divided by 16 local clocks when the network is synchronized. The observed timing errors between neighboring clocks were in the range of 30-60 ps for 1.6 GHz local frequency. This corresponds to 2 steps of the PFD resolution and is in a good agreement with the theory and simulation. The obtained phase error is less than 10% of the clock period. This result can be improved by increasing the PFD resolution. As predicted by the theory, the error is a zero-centered random process, i.e. the static skew is zero. Fig. 8 shows the transient process in one of the nodes. The perturbation has been introduced in a network (@ t =8 µs) in order to study the robustness of the network. After this perturbation, the clocking network is resynchronized after 17 µs. The frequency acquisition speed can be increased by employing special techniques more efficient than a simple PI filtering.
In order to study the synchronization in the undesired modes we have repeated the cycle of global perturbation (all nodes affected) of the network 500 times. In all test cases we have not observed the mode-locking. In fact, this result is in contradiction with modeling and the theory: the possibility of mode-locking is one of particular properties of the PLL networks which is always mentioned in theoretical studies [2] and reproduced in prototype [9] . Therefore, in order to verify the proposed technique of desirable mode selection we have repeated the experiment with reduced (2×2) network configuration, where theoretically mode-locking must occur with high probability. In such a configuration, for 500 cycles of global perturbation we have observed the mode-locking 4 times and the proposed method showed to be efficient.
In order to check the robustness of the network operation in presence of variation of the block parameters, several experiments were done. In particular, the network was tested under 10% variation of the filter coefficients: no degradation in the quality of the oscillator synchronization was observed.
The power consumption of the clocking network has been measured for 1.6 GHz oscillation frequency under 1.2 V supply voltage. The PFDs and PI filters consume 32 mW (≈2 mW per node). The DCO consumption is 9.8 mA/node (≈6.15 mW/GHz). We note that the power optimization of the DCO was not an objective of this prototype and better results can be obtained by a more involved design. Table I shows the summary of measured results and comparison with existing implementation of the distributed clock generator.
VI. CONCLUSION
A distributed clock generator for synchronous SoC based on the network of phase-coupled oscillators has been demonstrated. The synchronization of the oscillators is achieved by the ADPLL network. The problem of undesirable synchronization modes is solved by a dynamic reconfiguration of the network interconnection topology at the start-up stage. The advantage of the proposed system is the compatibility with the digital environment, its flexibility of reconfiguration and the possibility of advanced control over the clock generation. The fabricated prototype has proved the reliability of the proposed clock generation methodology. It has 16 nodes and operates in a frequency range of 1.1-2.4 GHz. The measured timing accuracy between neighboring clocking domains of the circuit is less than 60 ps. This result can be improved by more involved design of the phase-frequency detector.
