This brief presents an active distributed clock generator for manycore systems-on-chip consisting of a 10×10 network of coupled all-digital phase-locked loops, achieving less than 38 ps phase error between neighboring oscillators over a frequency range of 700-840 MHz at V DD = 1.1 V. The network is highly robust against V DD variations. An energy cost of 2.7 µW/MHz per node is 7 times lower than that in analog implementations of similar architectures and is twice lower than that in conventional H-tree architectures. This is the largest on-chip all-digital phase-locked loop network ever implemented. With clock generation nodes linked only locally, this solution is proven to be scalable. The presented clock generation network does not require any external reference, except for the startup frequency selection, generating a synchronized signal in fully autonomous mode and maintaining frequency stability within 0.09% during 1700 seconds. Such a network of frequency and phase synchronized oscillators can be used as a source for local clocking areas.
and radiocommunications [4] , [5] , distributed frequency generation [6] [7] [8] and others.
Clock generation [9] remains a key challenge in the implementation of high-performance, reliable Systems-on-Chip (SoCs). Indeed, the saturation of the clock frequency growth is strongly related to the issue of the distribution of a clock signal over a large chip and its energy cost. As the power consumption rate increases nonlinearly with the frequency of a clock generator, it affects dramatically the synthesis of gigahertz frequencies [10] . Centralized frequency distribution requires chip-wide feedback links for the control of the generated clock. This limits scalability as the size and functionality of SoCs increase.
In active distributed clocking, clock signals are re-generated for each clock domain, whose size is typically 200-300 thousand gates. Inside of each clock domain, the clock signal is distributed by conventional clock tree networks of a moderate size. Global synchronization between clocking domains is achieved by coupling local clock sources with a network of phase-locked loops (PLls). Previous implementations of active distributed clocking, such as, for instance, a 4 × 4 network of analog coupled PLLs [11] , resonant clocking [12] or oscillators coupled by injection through magnetic links [13] , were based on analog techniques. The sensitivity with respect to PVT variations, low compatibility with the digital design flow, lack of scalability and difficulties of reconfiguration are typical issues. For this reason, all-digital coupled oscillators appear to be a promising solution to those issues. The on-chip network reported in [6] , [14] , having a size of 4×4 nodes, validated the feasibility of such an approach.
This brief presents the design, implementation and measurements of a large network of all-digital PLLs (ADPLLs) for the synchronization of digitally controlled oscillators and the generation of distributed clock signals for SoCs. We describe an implementation and measurement results of 10×10 synchronized oscillators in CMOS 65 nm technology of ST Microelectronics. The goal of the study is to prove the scalability of this clocking solution with an increased number of oscillators, to verify the feasibility of a large all-digital globally synchronized ADPLL network and to test its performance. The operation of the network in autonomous mode, without an external reference driving the network, is also investigated. Architecture of one node (a single ADPLL) of the implemented network. It contains up to four phase-frequency detectors (PFDs), a proportional-integral (PI) controller and a digitally controlled oscillator (DCO).
II. SYSTEM ARCHITECTURE
The implemented clock generator is a Cartesian 10×10 network of distributed oscillators ( Fig. 1 ). 180 Phase-Frequency Detectors (PFD) measure the phase error between each couple of neighbour oscillators. The node 'N1.1' is also connected through a PFD to an external reference signal, which allows one to set up the frequency of the whole network. All the connections of the network are programmable so that the topology of the network may be reconfigured dynamically. The external reference may be disconnected, and the network may operate in autonomous mode. Figure 2 presents the structure of one node of the implemented network. The Digitally Controlled Oscillator (DCO) is a 7-stage ring oscillator with CMOS inverters, whose frequency is controlled by a matrix of 7 × 9 three-state inverters, providing 256 frequency steps and occupying a total area of 50×50 µm 2 . The chosen DCO architecture is highly regular and suitable for integration using EDA tools [15] . The choice of the DCO output frequency range of 700-840 MHz is a compromise between the frequency practically useful for applications (1-2 GHz typical clock frequency in IoT electronics) and the cost of implementation and testing of a laboratory prototype.
The distributed synchronization of the oscillators is achieved by an array of digital phase-frequency detectors implemented as a combination of a bang-bang (BB) phase detector and a Time-to-Digital Converter (TDC) [6] , [14] . The control of the DCOs is implemented through digital Proportional-Integral (PI) controllers. There are 100 DCOs, 100 PI controllers (correctors) and 181 PFDs in this design with an area of 50×50 µm 2 , 100×70 µm 2 and 55×30 µm 2 respectively.
Each PFD is composed of a bang-bang detector, measuring the sign of the phase error [15] , and a 3-bit TDC, measuring the magnitude phase error [14] . Overall, the PDF provides a 4-bit signed phase error signal, ranging within ±80 ps at V DD = 1.1 V. Figure 3 shows transistor-level simulations of the input-output characteristics of the implemented PFD for different V DD voltages. In order to improve the accuracy of synchronization, the TDC employs the Vernier architecture, where the time step is defined by the difference between the delays of two cells. Compared to [6] , this allows a time resolution of 16 to 20 ps at V DD = 1.1 V for small phase errors, which is less than the smallest buffer delay (30 ps in 65 nm CMOS). This design is also very robust with regard to V DD variations within a 20% range (i.e., V DD = 1.0 − 1.2 V). Such robustness is explained by a closed-loop architecture inherent in PLL design.
The Proportional-Integral controller (corrector) is a conventional controller receiving a weighted sum of the errors arriving from the neighbor DCOs [6] but with a reduced length of registers and hence a decreased area (100×70 µm 2 each). Each has four inputs with several programmable features. Firstly, the weights of the inputs (W1-W4 in Fig. 2) can be set to 0, 1, 2 or 4. The zero weight corresponds to the case when a node is disconnected or the connection does not exist (for example, the peripheral nodes have only 2 or 3 neighbors). Secondly, the PI controller has programmable gains of the integral and proportional paths. These gains can be optimized and selected to ensure synchronization. The PI controller is clocked by the local DCO signal with the local output frequency divided by 4. The same divided signal is applied to the inputs of the PFDs thus achieving the coupling between clocking domains. The relatively large size of the PI controller (100×70 µm 2 ) is due to these programmable functions implemented for testing purposes, and its area may be further reduced by 30-50%.
In a real application when the proposed network is used for clock generation, the distance between two nodes will lead to some delays associated with PFDs. Since the PI corrector is a standard synchronous digital circuit, it is possible to account for delays by designing a proper timing of the register transfer level circuit. In addition, the loss of stability in the network due to delays may be compensated by a proper choice of the PI controller coefficients, as discussed in Section III.
III. FOUNDATIONS OF PLL NETWORK SYNCHRONIZATION
AND PERFORMANCE The distributed synchronous clocking approach was suggested in [16] . This brief demonstrated that a network of coupled PLLs was able to provide clock signals to physically distant parts of a computing system. However, due to particular features of the phase-frequency detector used, the design suffered from "mode-locks" (multiple coexisting stable modes with synchronicity in frequency but not in phase). The first proof-of-concept of PLL networks was carried out in [11] . The network in that study was made of 16 distributed oscillators operating at 1.3 GHz fabricated in 0.35 µm CMOS technology. The new wave of distributed frequency generation has been based on all-digital PLLs with successful designs demonstrated in [6] , [17] .
There is a deep and rigorous theory underlying the synchronization process in PLL/ADPLL networks. This theory has been developed substantially over recent years so it has become possible to treat such complex systems analytically, despite their mixed analog-digital nature and selfsampling operation. The first advancement in this regard was presented in [18] where an equivalent linear time independent discrete-time system was proposed. In [19] , a design methodology, using a convex optimization approach and involving simple linear matrix inequality constraints, was developed. Study [20] introduced a novel nonlinear event-driven discretetime ADPLL model that is not based on any simplifications typical for ADPLL modelling. The proposed model was then used in [5] , [21] to demonstrate the global stability and synchronization of ADPLL networks. Summarizing the recent research, we outline the following:
The worst-case synchronization error between two neighbors in a network is equal to or less than the sum of the first two resolution steps of the PFD. According to the characteristic shown in Fig. 3 , it corresponds to 38 ps at V DD = 1.1 V. Several studies have emphasized the possibility of undesirable synchronization modes (mode-locks) in analog PLL networks. The implemented 10×10 oscillator network does not display mode-locks based on testing hundreds of runs in different configurations with different initial conditions. This is a clear advantage of a digital ADPLL network over its analog counterpart.
IV. CHIP MEASUREMENT
The photograph of the fabricated chip in CMOS 65 nm technology of ST Microelectronics is shown in Fig. 4 . The chip has a single supply for all the blocks of the network. For testing purposes, some signals are routed-off-chip, as indicated in Fig. 1 . The power consumption of the system as a function of the frequency of the input reference signal at different supply voltages is given in Fig. 5 , highlighting the frequency lockin range of the network for different V DD . The DCO power consumption dominates in the overall node consumption with ≈2.7 µW/MHz per node at V DD = 1.1 V.
The synchronization of the network has been characterized by two methods. The first method is the observation of the digital output of three (out of total 181) PFDs, see Fig. 1 . Since the PFDs are implemented on-chip, this provides a precise and free of parasitics measurement of the phase error between two neighboring nodes. Figure 6(a) presents the mean and the root mean square (RMS) of the digital output of the three PFDs versus the supply voltage. The mean error is under 0.3 PFD resolution step (≈6 ps). The RMS of the phase error is close to unity. As shown by the example of the PFD between nodes N1.6 and N1.7 (Figure 6(b) and (c) ), the output of the PFD is ±1 during 94% of time and with ±2 during the remaining time. A PFD output of ±1 means that the error is below 38 ps at V DD = 1.1 V and 42 ps at V DD = 1.0 V. Figure 6(d) presents the spectrum of the phase error noise directly measured on routed-off-chip oscillator outputs and the spectrum of the phase noise of oscillator N1.6. The precision of synchronization can be improved by increasing the PFD resolution.
The second method is a routed-off-chip measurement of the output of 27 (out of total 100) DCOs (please refer again to Fig. 1 ) after the frequency was divided by 8. A large and variable length of the bonding wires (5 to 7 mm) and PCB routing, combined with a high power consumption of the pad ring, have made this method less reliable for the characterization of the real phase error on-chip. Figure 7 presents the routed-offchip characterization of the phase error statistics between the clocks generated on the main diagonal (between nodes (1, 1) and (i, i) where i = 1, . . . , 10) obtained when the network is fully autonomous, at an output frequency of 720.8 MHz at V DD = 1.1 V. A 'proportional-to-the-distance' trend is clearly observed, with a maximal mean error of 943 ps, which is 68% of the clock period. In most cases, an almost constant standard deviation of the error, above 300 ps, is mainly attributed to the jitter due to the noise of the pad supply.
A linear scaling of the absolute phase error with the distance between the nodes may be explained as follows. In synchronous mode, ADPLL/ADPLL network acts as a linear control system with a transfer function in the z-domain and a constant delay. When synchronized, we deal with a network of linear systems where phase error propagates linearly. The total phase error we observe depends on how many nodes we have in a given path under observation. Fig. 1 ). The phase error is measured on the routed-offchip DCO signals with statistics calculated over 1800 periods.
The stability of network operation in autonomous mode over time is demonstrated in Fig. 8 . The network is first driven by an external reference signal in order to set the frequency at a certain desired value (at t = 0). The external signal is then disconnected by reprogramming the network topology. After that, the network is observed during 1700 seconds. The plot demonstrates an excellent frequency stability (with less than 0.09% max-to-min noise-like fluctuations) and high robustness of synchronization between neighboring nodes (with a less than 3 ps mean value obtained through the routed-off-chip measurement described above).
The proposed network design is compared with existing analog active clocking techniques [11] , [13] , conventional H-tree networks [22] and authors' previous studies [6] in Table. I. The size of the implemented network and the power consumed per node are among best, except for [13] , where a sophisticated inductive coupling is used. Comparing to the authors' previous ) is mainly due to a compromise with regard to the complexity of design and test of the chip in laboratory conditions. A migration of the design to a more recent CMOS technology will naturally lead to an increased clock speed, fitting to the specifications of modern Systemson-Chip. A high scalability and complete compatibility with the conventional digital design flow are achieved at the price of a larger peak error between neighboring nodes (38 ps) than that in state-of-the-art analog solutions (10 ps).
V. CONCLUSION
This brief presented a first implementation of a very large network of coupled all-digital PLLs integrated on a chip using 65 nm CMOS technology. Compared to the performance of existing implementations of smaller networks, the quality of the neighbor-to-neighbor synchronization in the presented network is maintained at the same level, under 2 phasefrequency detector steps. This proofs the scalability of the proposed architecture and its suitability for global clock generation in large systems-on-chip. The analysis of transistor level simulations and chip characterisation reinforce the idea that the design of the ADPLL blocks may be optimized to improve the performances of such a network drastically. The accuracy of synchronization may approach that of analog solutions if the resolution of the time-to-digital is increased. The size and the power consumption of a node may be improved by using an alternative DCO design, for instance, a current-controlled voltage-controlled oscillator. The ability of the proposed network to operate in autonomous mode, while providing a fully digital control of the network configuration, is an advantage of the proposed clocking technique over previously reported analog architectures.
