Abstract-This paper presents an FPGA platform for the design and study of network of coupled All-Digital Phase Locked Loops (ADPLLs), destined for clock generation in large synchronous System on Chip (SoC). An implementation of a programmable and reconfigurable 4 × 4 ADPLL network is described. The paper emphasizes the difference between the FPGA and ASIC-based implementation of such a system, in particular, implementation of digitally controlled oscillators and phase-frequency detector. The FPGA-implemented network allows studying complex phenomena related to coupled ADPLL operation and exploiting stability issues and nonlinear behavior. A dynamic setup mechanism has been proposed for the network, allowing selecting the desirable synchronized state. Experimental results demonstrate the global synchronization of network and performance of the network for different configurations.
I. INTRODUCTION
Clocking is one of the main issues in design of large SoC like multi-core processors. Synchronous communication is still largely used in SoCs, especially for application requiring high reliability. However, practical implementation of a global clock distribution is very difficult in advanced deeply submicron CMOS technologies. Our study explores an alternative technique of clocking, consisting in distributed clock generation through a network of all-digital PLLs [1] . Such a network is composed of several oscillators distributed over the chip, which are coupled in the phase domain. Each oscillator generates a local clock. A network properly designed guarantees that all oscillators generate a signal at the same frequency and with the same phase. The originality of our study is the fact that the network of PLL is fully digital, contrary to previous implementations [1] .
A network of coupled digital PLLs is a very complex nonlinear high-degree dynamic system, having different operating modes, among which only some of them are desirable. The selection of the mode is done through the appropriate choice of the network parameters (the ADPLL filter coefficients, the choice of initial conditions, etc.), and requires a solid underlying theory and a prototyping platform. This work presents an implementation of such a prototype on a single FPGA chip, which speeds up the evaluation of different modes comparing to a Verilog/VHDL modeling approach. The goal of the prototype is to reproduce with high fidelity the functional behavior of an ASIC-based ADPLL network, which is being designed in the LIP6 analog/mixed circuit group. The FPGA emulator must have exactly the same architecture with the same parameters values, however, scaled down proportionally in frequency because of the maximum frequency limit of the FPGA board.
The FPGA-implementation is different from the implementation of their ASIC counterpart. In particular, blocks whose operation is based on controllable pure delays are impossible to be implemented in standard FPGA flow, e.g. Time-to-Digital Converter (TDC) and the Digitally Controlled Oscillator (DCO) [2] . The paper describes the way to prototype these elements on FPGA, so as to model them as close as possible to the ASIC-based blocks.
The validation of the prototype is done through testing the network synchronization with different filter coefficients, and by observing nonlinear phenomena predicted theoretically for a PLL network. The method of selection of the synchronized mode presented in [2] has been implemented and tested.
In section II the architecture of network and functional blocks are described. The procedure allowing homothety in the dimensioning of FPGA prototype with respect to the ASIC prototype is also presented in this section. Section III presents experimental results.
II. NETWORK ARCHITECTURE
The topology of network is presented in Fig. 1 [2] . It is composed of Phase Frequency Detectors (PFD) and 16 Filter/Oscillator (FO) blocks. PFDs are placed on each border between two synchronous clock areas (SCA), measuring the phase error between each couple of neighboring oscillators. The PFD placed in the upper left corner compares the phase of the input reference and the first oscillator in the network. Such a network, if properly designed, is synchronized with the reference clock both in frequency and in phase.
The structure of a typical network node is presented in Fig. 2 [3] . Each node contains 2-4 PFDs: each PFD detects [4] . The duration of the signal M ODE represents the absolute value of the phase error. It is applied to the input of TDC, which converts the duration of input signal to an unsigned binary code (Dout), which is then combined with the SIGN signal by the arithmetic block to form a signed binary code (ERROR). The transfer function is presented in Fig. 4 .
The BBPFD can be seen as a finite state automaton driven by events (rising edges) at its inputs. Inspired by [4] , it is implemented similarly in ASIC and FPGA platforms.
In the ASIC-based TDC, the signal M ODE is delayed by a tapped delay line [5] . At the falling edge of M ODE, the thermometer code produced by the delay line is stored in a register and then converted to an unsigned binary code. The sensitivity of PFD (∆T ASIC ) is defined as one stage 
B. Loop Filter
The PI filter is implemented as the loop filter. Its transfer function is the following:
where K p and K i are gain coefficients of the proportional and integral paths respectively. Their values are programmable. The filter has up to 4 inputs (Fig. 5) , which are weighted with programmable coefficients K wi . The number of inputs can be regulated by programming the coefficients K wi to one or zero. The programming of the filter coefficients is designed so that all the programmable values of the network can be changed in parallel, at the same time. This is achieved by a simple serial-to-parallel register used for the programming value loading. This allows testing scenario of dynamic reconfiguration of the network.
C. DCO
Because of the limits of FPGA, a ring oscillator with programmable delay [2] is impossible to be implemented in standard FPGA flow. Instead, the DCO is implemented as a pre-loaded N c -bit counter whose overflow provides the output DCO signal. The DCO based on counter uses an external clock with a high frequency (f DCO clk ). When the counter saturates, an output event is generated, and at the same time the counter is reloaded with the DCO input code. Hence, the period of the clock (T DCO F P GA ) generated by DCO for a some input code C in is defined as
where T DCO clk is the period of the external clock. The frequency step of DCO in the FPGA prototype (∆f DCO F P GA ) is defined as
where T oF P GA is the nominal period value of FPGA-based DCO. The DCO specification (∆f DCO and the central frequency) defines N c and f DCO clk .
D. Scaling of FPGA prototype parameters
Due to the limit of maximum frequency in FPGA device, all the temporal parameters in the FPGA network have to be scaled down proportionally respecting the following relation
where ∆f DCO ASIC is the frequency step of the ASICbased DCO. f oASIC and f oP F GA are the nominal frequency of ASIC-based DCO and FPGA-based DCO, respectively. M 1 , M 2 , and M 3 are three constants used to simplify the derivation.
From Eq.(3) and Eq.(4), the ratio between the external clock frequency of FPGA-based TDC (f T DC clk ) and that of FPGAbased DCO (f DCO clk ) can be derived (Eq.(5)).
f T DC clk and f DCO clk are the inverse of the sensitivity of PFD (∆T F P GA ) and the inverse of T DCO clk , respectively. The two clocks are generated respectively by two PLLs located in two corners of the device and thus are uncorrelated with each other.
Using the ASIC parameters given in TABLE I, the ratio can be calculated (f T DC clk /f DCO clk ≃ 0.1068). If f DCO clk uses the largest frequency value of device, the other temporal parameters could be calculated by using the ratio equation (Eq.4). The result is shown in TABLE I. In this way, the FPGA emulator is designed as a proportionally scaled down prototype of the ASIC system. It could operate with same filter coefficients as those designed for ASIC.
III. EXPERIMENTAL RESULTS

A. System stability and performance
An FPGA emulator for a network with 16 clock generators is implemented on ALTERA CYCLONE II EP2C70 platform. With the help of this FPGA prototype, designers could choose the coefficients according to the specification of system and observe the results easily. Considering the low working frequency in FPGA as shown in TABLE I, interconnection delay is negligible.
Various tests are done to compare the stability and performance of system with different filter coefficients. In particular, the phase error between the first and second clock generators in the network is observed. Fig. 6 presents the results of three tests with different groups of coefficients. The upper plot demonstrates a system with good performance. The maximum error is only ±2 units when it converges (one unit corresponds to the resolution of PFD, ∆T = 149.88 ns). The middle curve shows a system which converges rapidly, while with worse performance when it is stable. The maximum error is about ±4 units. With some other coefficient values, the frequencies oscillate wildly and the system is no longer stable, like shown in the lower plot. 
B. Prevention of undesired stable states
Due to cyclic (modular) nature of phase, random initial conditions and a large number of degrees of freedom in this complex system, the system could have more than one stable state [1] . In some of the states, all oscillators have equal frequency, but may have fixed non-zero phase difference compared with their neighbors. The stable state depends on the initial condition and thus is not controllable. Therefore, a configuration mechanism is necessary to guarantee the actual settled state of the system is the one at which all the oscillators are synchronized both in frequency and in phase. Fig. 7 shows the clock generated by one node (node 11) and those generated by its 4 neighbors (nodes 7, 10, 12, 15). In this case, when the network is stable, the rising edge of clock11 is about 1.8 µs ahead of those of clock7 and clock10 but about 1.8 µs late compared to edges of clock12 and clock15. 1.8 µs is 12 times the resolution of PFD, which is a big value, while due to a nearly zero average value (one-fourth of T otal Err presented by the lowest plot in Fig. 7) , the frequency of clock11 keeps unchanged, and similar phenomenon happens in other nodes. Thus, the network falls into an undesired stable state. It is known that these undesired modes can be eliminated if the network is unidirectional [1] , which means each node receives the phase error information only from upper and left neighbors. Thus, the information is transmitted in one direction from the upper-left corner of the network to the lower-right corner. However, it has a drawback: Just like a traditional clock tree, any perturbation appearing in early nodes propagates through the entire network. The clock in the area far from the reference clock has a relatively poor quality compared with clocks near the reference. This drawback could be critical and could degrade the system performance if the network has a high order.
A solution has been presented in [2] . The network is configured dynamically and works in two modes. In the first mode, the network is unidirectionally configured in order to avoid undesirable stable states. When the phase errors are corrected and all the nodes are synchronized in phase, the network is reconfigured quickly as the second mode. In this mode, the network is bidirectionally configured, which means all the links are activated, and each clock is coupled with its four neighboring clocks so that perturbations are suppressed. The switching of the connectivity is implemented by programming dynamically the coefficients K wi in the loop filters presented in Fig. 5 . A link can be activated or deactivated simply by assigning 1 or 0 to the corresponding coefficient K wi . Fig. 8 shows the four phase errors after weight coefficients (K wi ) multiplication between node 11 and its neighboring nodes. Between 0ms and 30ms, the network is unidirectionally configured, only the phase differences between the node and its left (node 10) and upper node (node 7) are considered. After 30 ms, the network switches to bidirectional mode, all the four errors are taken into consideration. One can see that compared with previous simulation with the same initial state, after 30ms in this simulation, all phase errors oscillate around zero, which corresponds to the desirable synchronization. 
