Abstract-Synchronization is an issue of significant importance in large-scale, distributed and high-speed systems. Traditional globally synchronous approach is no longer viable due to severe wire delay. Solutions such as "Globally Asynchronous, Locally Synchronous (GALS)" approaches suffer from metastability risk limiting their use in many-core SoC for critical applications, such as aerospace, military or medical equipment. This paper presents a distributed clock generator based on a network of oscillators. A great advantage of this architecture is its high stability and immunity to perturbations. This architecture also makes possible to design large fully synchronous SoC. A 10×10 network supplying clock sources for 100 clock domains has been modeled in VHDL and is under design in silicon. Simulation results show ± 40 ps peak-to-peak phase error between two neighboring clock signals and ± 50 ps between two clocks in distance.
I. INTRODUCTION
Advances in CMOS technology have led to an exponential increase of chip complexity and transistor numbers (Moore's law). The modern SoC can be regarded as micro-networks. Synchronization between their different areas becomes a research subject of utmost importance.
There are many methods for synchronous operation. Traditional clock distribution in complex circuits uses tree or grid structures [1] [2] . These centralized distribution techniques are very expensive, mainly in terms of energy consumption and area. They also suffer from uncontrollable skew and jitters.
To solve these problems, large digital chip are partitioned into local clock areas [3] [4] . Each zone has its own local clock. These zones are small enough so that the clock distribution inside the zones can be achieved by conventional techniques. The communication between blocks in different zones can be asynchronous (GALS: Globally Asynchronous and Locally Synchronous) or synchronous (GSLS: Globally Synchronous and Locally Synchronous). In asynchronous circuits, the reliability is difficult to guarantee at design stage because of a large state space volume of the designed system: the time is continuous and not quantized as in synchronous circuits. For this reason, there is a motivation for pursuing researches on alternative techniques of synchronization in large SOCs.
This study focuses on study and design of distributed clock generators, whose basic idea was presented in [5] [6] . Each local clock area has its clock generator coupled in phase with the clock generators in neighboring areas. The main advantage of this kind of synchronous architecture is that the communication between two adjacent areas obey the same rules as inside a single clocking area. Compared with a centralized clock generator, the distributed clock generator requires shortest paths to carry clock signals, reducing the accumulation of delay and jitters on distribution lines.
Several implementations of clock generator prototypes were achieved with this technique. In particular, recent work using a network of all-digital PLLs for on-chip local clock synchronization proved feasibility of this approach and compatibility with digital environment of SoCs [7] .
In a tissue of coupled and spatially distributed oscillators, questions about global stability and robustness towards local perturbations are of paramount importance [8] [9] . This paper presents study result of an original solution aiming to limit wave propagation in a coupled PLL network, and to prevent an advent of standing wave. The proposed technique has been validated by behavior simulation. This paper is organized as follows. In Section II, we start by studying one local clock generator in a large unlimited network. Then, we introduce an analogy between the surface of the phase error in a network of ADPLL and a water surface. In Section III we study the phenomena of error wave propagation and reflection in an ADPLL network with limited surface, and we also explain the advantage of proposed "Swimming pool"-like architecture in preventing error wave reflection. Simulation results are presented in Section IV to demonstrate the performance of proposed architecture.
II. STRUCTURE AND BEHAVIOR OF A SINGLE ADPLL
A many-core large SoC is partitioned into multiple local clocking areas. Each of these areas has its own clock generator (represented by the squares in Fig. 1(a) ). The locally generated clock is synchronized in phase with its four neighbors by coupling links (represented by the lines in Fig. 1(a) ). The goal of the distributed PLL network is to synchronize all oscillators both in frequency and in phase. In steady state, such a network is the source of fully synchronous local clocks. Fig. 1 (b) presents the structure of one local clock generator (ADPLL) in the network [7] . It is composed of phasefrequency detectors (PFD) placed between two synchronous clock areas (SCA), measuring the phase/frequency difference between this locally-generated clock and its neighboring clock. Each PFD generates a 5-bit signed binary code. Then, the errors with neighbors are added and processed by the loop filter (LF) to generate a control word for the digitally controlled oscillator (DCO). The loop filter is a digital proportionalintegral (PI) one. The output of the DCO is divided by 4 in order to be used in the feedback path of the PLL. 
A. From a discrete network to a continuous surface
Each ADPLL has a closed loop behavior given by a differential equation. A network of nodes represented by such a differential equation can be seen as a discretization of liquid surface, or an elastic membrane surface.
The transfer function of each block in an ADPLL can be expressed in Laplace domain:
where K PFD and K DCO are the gains of PFD and DCO. δ P F D is the timing resolution of PFD and f s is the sampling frequency. K p and K i are the gains of the proportional and integral paths in the PI filter, respectively. τ is the delay in loop filter. The closed loop transfer function of feedback system can be expressed as:
where
The phase comparison part of ADPLL consists of four PFDs and generates the average value of phase errors ( Fig. 1(b) ). The input of feedback loop for the node (x,y) can be regarded as (φ x+1,y + φ x−1,y + φ x,y+1 + φ x,y−1 )/4. Each ADPLL satisfies the following equation:
The sum of phase error is:
This equation coincides with the discretization (Eq. 6) of the Laplacian (Eq. 5) of a scalar field Φ in 2-dimensional space.
Hence, from Eq. (4) and (6), Eq. (3) becomes
By performing a reverse Laplace transform on Eq. (7), we arrive at the following differential equation:
To simplify the equation, we make two approximations. First, during a very small variation of time t, ∆Φ(t − τ ) ≃ ∆Φ(t) − ∂∆Φ(t) ∂t · τ ; Second, a variation of local clock phase Φ introduces a change of phase differences ∆Φ with its neighboring nodes clocks. Since the loop filter processes the mean value of four phase errors, according to Eq. (6), during a small variation of time, 
B. An analogy with damped wave equation
Up here, we have performed a reverse discretization passage from ADPLL mesh φ to a continuous surface of phase errors Φ. Eq. (9) is the same as the damped wave equation describing water surface movement with dissipation [10] :
Here h is the height of the water, c is the wave speed and k is the damping constant. We can make an analogy between the level of water and the phase error of synchronous network. By comparing Eq. (9) and Eq. (10) we get the k and c parameters of the synchronization errors surface.
The transient process of an unlimited ADPLL network in phase domain can be seen as analogous to the wave movement in a vast expanse of water. In equilibrium, the whole water surface is flat. Similarly, when the ADPLL network is synchronized, all the locally generated clocks are in phase. However, if there is a local perturbation, a wave may appear, propagating through the network. This is an undesirable phenomena. The solution aimed to limit it is proposed in the next section.
III. ADPLL NETWORK WITH LIMITED SURFACE
A large synchronous surface satisfies continuity condition: In a local micro region, the difference of level between the nodes (x+1,y) and (x,y) approximates the inverse of level difference between nodes (x-1,y) and (x,y), thus the Laplacian of local phase approaches zero, which means the local region φ x+1,y , φ x,y+1 , φ x−1,y , φ x,y−1 can be regarded as flat.
However, on the boundary of a limited network, the error wave reflects on the border and increase the perturbation inside the network.
To suppress this reflection, we can simulate that the surface around node (x,y) be flat in order to simulate an infinite PLL network. In the case of Fig. 2(a) , we want it to be flat in the x direction by φ x+1,y − φ x,y ≃ φ x,y − φ x−1,y ⇒ It means that the variation in x-direction is not taken into consideration on the vertical border. ∆φ x,y turns out to be ∂ 2 Φx,y ∂y 2 . Similarly, Fig. 2(b) shows the case for a corner node. These anti-reflection considerations lead to isolate a border only distributing its clock to a kernel surface in which the nodes are connected as in the case of unlimited network (Fig. 3 ).
All ADPLLs (border and kernel) are used to generate local clocks. The network border can be regarded as an independent and synchronous ring exciting the inner kernel as a membrane. This ring, with the reference clock signals at its 4 corners, produces a reference for ADPLLs in the kernel and absorbs the error waves. In the "Swimming pool"-like analogy, the ring of the ADPL network acts as the overflow channels of a pool.
IV. SIMULATION RESULTS
A 10×10 network as shown in Fig. 3 is modeled in VHDL. A PFD with a resolution of 20 ps and a DCO with a nominal frequency of 1 GHz and mean frequency step of 2.26 MHz are used in this work. The simulations allow studying the behavior of network with different parameters and validating the theoretical analysis. To observe the transient process of error attenuation, phase errors of local clocks with respect to the reference clock are sampled each cycle and used to create a 3-D animation. Fig. 4 shows the phase error surface just after a main perturbation at the center. We observe that the border is very stable with a relatively low amplitude of errors, while the kernel acts like a membrane fluctuating up and down with an amplitude smaller and smaller until the whole network gets in phase. When the whole network is in phase, the phase difference between two neighboring clocks is within ±40 ps, which is two steps of PFD resolution. We measure the phase error of each clock with respect to the reference clock REF so as to obtain the clock error distribution histogram of nodes in the kernel (Fig. 5(a) ) and that of nodes in the ring border (Fig. 5(b) ). It is obvious that the border clocks have smaller errors than kernel clocks, which agrees with our previous analysis.
A significant advantage of the proposed circuit is its good performance of perturbation attenuation. To prove it, we compare the proposed architecture shown in Fig. 3 with When the network is in phase, we add an artificial perturbation on CLK35 at the node (3, 5) , and observe the transient response on nodes (3, 5) , (2, 5) and the nearest border node (1, 5) .
In the conventional circuit, it is obvious that CLK15 is affected by the perturbation put on CLK35. The reflection produces some wavelet on node (1,5) ( Fig. 6(a) ). In the proposed "Swimming pool"-like topology, we can observe that thanks to the strong ring border, CLK15 is not affected and there is no more wavelet on CLK35 (Fig. 6(b) ). According to Eq. (11), the wave speed and damping constant depend on design parameters (gains of PFD and DCO, filter coefficients, etc.). The reconfigurability of loop filter allows modifying features of the control system according to the specification [7] . Fig. 7 shows phase errors of clock signals in the principal diagonal of the proposed network with two different parameter sets. We can observe that if the system is overdamped, the system takes a shorter time to acquire the reference frequency, but the phase error in steady state is relatively larger (± 100 ps in Fig. 7(a) ) compared to an underdamped system (± 50 ps in Fig. 7(b) ). Designers can choose the appropriate parameters to meet requirements of convergence speed and maximum error limit.
V. CONCLUSION
An ADPLL network with "Swimming pool"-like topology is proposed for synchronization in large many-core SoC. This 
