for Globally-Asynchronous Locally-Synchronous (GALS) 
INTRODUCTION AND RELATED WORKS
The continuous increase in clock frequency together with technology scaling has generated the distribution of a single global clock over a large digital chip tremendously difficult. Globally Asynchronous Locally Synchronous (GALS) design alleviates the problem of clock distribution by having multiple clocks, each one being distributed on a small area of the chip. An integrated circuit with different clock frequency domains appears as a natural enabler for fine-grain power-aware architectures. Actually, power consumption is a limiting factor in VLSI integration, especially for mobile applications. Dynamic Voltage and Frequency Scaling (DVFS) [1] has proven to be highly effective to reduce the power consumption of the chip while meeting the performance requirements [2] . The key idea behind local DVFS is to control at fine grain the supply voltage and the frequency of an island at runtime to minimize the power consumption of the considered island while satisfying the computation/throughput constraints [3] .
The DVFS techniques mainly rely on two 'actuators', namely voltage and frequency actuators.
These actuators need to be dynamically controlled in order to reduce the power consumption while maintaining the required performance. More precisely, the control policy must be carefully designed in order to achieve high power efficiency at low area cost. The voltage actuator fixes the supply voltage of the Voltage and Frequency Island (VFI). It can be a classical buck converter [4] or a digital Vdd-hopping converter [5] , [6] . The frequency actuator is a Clock Generator. Its frequency control is related to the supply voltage control in order to avoid timing faults [7] . This Clock Generator is classically based on a Phase-Locked Loop (PLL) or a Frequency-Locked Loop (FLL).
Another consequence of technology scaling is the in-die and die-to-die process variability (Pvariability). From a practical viewpoint, it is becoming increasingly difficult to manufacture integrated circuits with tight parametric values [6] . In other words, the circuit performance is becoming more and more unpredictable and the optimum functional frequency can differ from one IP to another on the same chip not only due to Process variation but also to Temperature and Voltage changes (PVT) over time. As a consequence, in-die process variation means that the optimum functional and energetic point of the whole circuit can be found if VFI number i has its functioning frequency in the range ] , [ max, min, i i F F [8] . If the clock is generated for the whole circuit, and distributed in each VFI, the maximum acceptable frequency (i.e. the one that will ensure no timing fault for any VFI) will be
, leading to a suboptimal circuit functioning, some VFI being under-clocked. Therefore, in order to obtain the best possible circuit performance, the clock must be locally generated and controlled according to Process, Voltage and Temperature (PVT) variations.
Recently, control techniques were applied to the problem of DVFS (for instance, see [5] , [9] ). These works only address the closed-loop control of the voltage actuator, this latter implementing a Vddhopping technique.
Structure of the closed-lopp system and main objectives
In the context of the industrial French project LoCoMoTiV 1 , a DFLL is selected as second actuator (i.e. frequency actuator) due to the area constraint: in a fine-grain GALS context, the DFLL can indeed be replicated in each VFI of the size of a processor in a manycore architecture. is not dedicated to LoCoMoTiV but to the design of the control law embedded in the DFLL that must be robust to PVT variability. The general structure of the DFLL (see Figure 1 ) is composed of three main blocks, namely, a Digitally-Controlled Oscillator (DCO) that provides at its output a signal with frequency clk_dco, a sensor to measure the frequency at the output of the closed-loop system, and a controller that first compares the targeted reference and the measured frequency and then applies some "intelligent control". The controller design strongly depends on the DCO and sensor models.
Due to PVT-variability, the characteristics of the DCO cannot be considered identical from one VFI of the chip to another, nor from one chip to another. Moreover, it evolves with temperature and power-supply voltage changes (VT-variability). Thus the closed-loop mechanism at least mitigates the performance dispersion. It is remarkable that the whole architecture is digital. The second and main objective of the work presented in this paper is to design a controller for the DFLL taking into account the following requirements:
• closed-loop stability;
• suited performance (no overshoot, no static error, short transient period, see Figure 2 );
• robustness with respect to PVT variations. The control law that will be implemented within the circuit must ensure the "correct" functioning of the DFLL whatever the underlying
process parameters, temperature and supply voltage are (within a given range);
• low area cost and
• exogenous perturbation rejection in the frequency output.
Therefore, the designed controller must not only guaranty the set-point stabilization, but also other criterions. From accurate Spice simulations, it has been seen that the DCO can be modeled with a linear model. Moreover, the sensor introduces a delay that must be taken into account. The system characteristic can change due to PVT effects. A simple integral controller that requires a minimum implementation area is enough to fulfill all the requirements given above. To tune the control gain, a robust and optimal control problem is formulated, for which a functional must be minimized. In order to solve this problem some Linear Matrix Inequalities (LMIs) are defined [11] . Satisfying these LMIs within the optimal problem, all requirements above are fulfilled by the closed-loop system.
Consequently, an optimal and robust control law for the DFLL is reached.
Some simulations under the Matlab/Simulink environment show the powerfulness of the controller proposed. Moreover, the closed-loop system was implemented in RTL, obtaining similar simulation results to the ones obtained in Matlab/Simulink. The resulting layout was implemented in the LoCoMoTiV circuit in CMOS 32nm.
Related Works
PLL or FLL circuits can be considered good candidates for frequency generation within integrated circuits. Both circuits are widely used building blocks. However, new or improved architectures still continue to appear in order to meet today constraints induced by technology scaling. PLLs are usually considered area consuming [12] , which becomes clearly a disadvantage when the PLL has to be replicated in each VFI. Note that the stability of the PLL is also usually much more difficult to obtain than with an FLL. This is due to the "integrator" that naturally appears in the PLL structure.
A fully integrated PLL for frequency synthesis in wireless applications with 45nm CMOS technology is proposed in [13] . The analog PLL is made of a top-biased VCO, a divider in the feedback loop, a Phase/Frequency detector (PFD) and a charge pump. The output frequency ranges from 2 to 2.6 GHz. The loop filter is not explicitly reported. The area cost (0.042 mm²) of this analog PLL is slightly larger than the one (0.028 mm²) of an all-digital PLL developed in the same technology [8] . This latter digital PLL contains a DCO made with tri-state inverters, a digital Proportional-Integral (PI) controller and a divider in the feedback loop. The comparison between the reference frequency and the divided output frequency is achieved with a bang-bang phase/frequency detector (see [15] for a high level architecture scheme of the digital PLL). The output frequency range is from 0.84 to 13.3 GHz.
[16] describes a PLL with leakage current and power supply noise compensation, designed for The FLL in [12] is made of two Frequency-to-Voltage Converters (FVC), an operational amplifier (equivalent to a subtractor and a simple proportional filter), a VCO (ring oscillator of five delay cells) and two frequency dividers. Note that both FVC must be carefully paired to reduce the static error.
However, due to the control scheme chosen, the static error is unavoidable. Therefore, this scheme will not be able to fulfill the requirements given above. With 0.35µm CMOS technology, the total active area of the circuit is 0.22 mm². The response time to switch the output frequency from 171 to 230 MHz is 2 µs. The VCO output frequency ranges from 161 MHz to 256 MHz.
A digital FLL for low power operation in multicore architecture is described in [17] . The targeted application is quite similar to the one of the present work. A tapped ring oscillator is implemented. A digital counter senses the FLL output frequency. A compare-subtract bock computes the discrepancy between the targeted set point and the frequency measurement. The input of the tapped ring oscillator is changed through a shift register when this discrepancy is lower/higher than a given threshold. The range of frequencies is between 1.62 and 10.71 GHz. The estimated size is 0.001225 mm². Note that the correlation between the frequency discrepancy and the shift is not indicated and the control cannot be strictly speaking considered as a classical control scheme. Results are obtained in simulation with IBM soi12s0 technology (45nm). As can be seen, to our knowledge, none of the previously published systems fully satisfy the requirements that have been fixed for this circuit design. Therefore, a Fully Digital variability-aware DFLL is developed. Table I summarizes the characteristics of the frequency generation circuits summarized above. The rest of the paper is organized as follows. Section 2 provides the architecture of the blocks that form the DFLL. The analytical models of the DFLL blocks are presented in Section 3. Section 4 is dedicated to the control structure that is selected here, and an optimal and robust control problem is also formulated. In Section 5, this problem is solved by providing an approach to tune the controller gain. The results obtained together with a comparison with state-of-the-art solutions are provided in Section 6. The paper ends with conclusions and future work.
Notation
For a given S, the notation Co(S) denotes the convex hull of set S. The variation of ξ in two consecutive sampling times is given by:
Finally, 2 L is the space of x k with the norm:
DFLL ARCHITECTURE
In order to model and develop the DFLL control, the architecture that implements the DFLL is analyzed in this section. A classic closed-loop DFLL is composed of three main blocks: a DCO, a sensor and a controller (see Figure 1 ). However, for implementation issues, the whole DFLL is split in five main elements (see Figure 3 ):
• the Digitally-Controlled Oscillator (DCO) is composed of a Digital-to-Analog Converter (DAC) and a Voltage-Controlled Oscillator (VCO);
• the DFLL Control implements the controller and handles the configuration from the host;
• the Clock Counter acts as sensor. It measures the clock generated by the DCO;
• the Clk-ref Counter generates the time reference signals;
• the Clk Divider & Selector builds various divider clocks and selects the appropriate one to obtain the output clock clk_out. 
Digitally-Controlled Oscillator
The DCO is the only part of the design that is implemented in custom cells. The VCO (Figure 4 ) is based on a ring oscillator composed of four Voltage Controlled Delay cells (VCD) [19] . The propagation delay through these delay cells is controlled by two bias voltages, namely, an upper bias and a lower bias. To obtain the DCO, a binary code (F req ) is transformed by two R-2R DACs into an upper and a lower bias voltage applied to the VCO. The two DACs are composed of driving buffers (simple digital standard cells) and a resistance ladder following an R-2R pattern. The DAC output impedance R is set to drive the VCO input. 
Sensor
The feedback sensor is implemented as a synchronous counter. This device counts the number of generated clock pulses during a given time period. This reference time is fixed and synchronized with an external low frequency clock (100 MHz). In the proposed architecture (see Figure 3) , the sensor is The Clock Counter acts as the real sensor, counting the number of pulses generated by the DCO. It is implemented as an asynchronous ripple counter controlled by the count and update signals generated by the Clk-ref Counter. Once the update phase starts, the counter is registered to be used by the DFLL Control engine, and the counter is cleared to start the next count phase. Figure 6 shows the The Clock Counter is fully implemented using the clk_dco domain. Since this counter needs to be very fast, the counter is partially conceived as an asynchronous ripple counter. The 2 first bits are implemented as a ripple counter; this decreases the maximum input frequency clk_dco from 4 GHz (in the nominal case) down to 2 GHz (for bit 0) and down to 1 GHz (for bit 1). Then a standard incrementer is used, at 1 GHz, instead of a full carry ripple adder, avoiding a large skew in the output bits (see Figure 7 ). The two input control signals, count and update (generated from the clk_ref domain), need to be properly synchronized with clk_dco. A schematic view of the Clock Counter is presented in Figure 7 . 
Controller
The controller implemented in this architecture will ensure the proper functioning of the circuit. It is designed not only for the closed-loop system to reach the set point, but also to fulfill the requirements given in Section 1.1. The controller proposed is described in details in Section 4 while the method used to tune its parameter is given in Section 5.
This controller must be developed taking into account its hardware implementation and the area constraint.
Clock Divider and Selector
The frequency of the DCO output signal is in the range 1 to 4 GHz. This high frequency cannot be directly used by digital synchronous circuits for the applications targeted. It is thus required to downscale the frequency generated in the MHz range. As a consequence, the following functions are provided:
• a clock division by a 21 to 216 ratio. This is simply implemented by chained flip-flops. The first flip-flop ensures a clean 50% duty-cycle at the DFLL output. The generated DFLL clock can therefore be from 2 GHz down to 100 KHz;
• a clock selector, which allows dynamically selecting among 2 clock division factors, without any glitches. This mechanism can be used to very rapidly switch between two frequencies.
This can be used for instance for DVFS in coordination with Vdd-Hopping [6] .
ANALYTICAL MODELS
The analytical models for the DFLL blocks (DCO and sensor as shown Figure 1 ) are derived in this section. These models will be used in order to choose the controller structure, taking into account the requirements given in Section 1.1.
Digitally-Controlled Oscillator
From accurate Spice simulations, it can be assumed that the DCO has a linear model that evolves with respect to Process variation but also to Temperature and Voltage changes (PVT) over time.
The DCO model is assumed be
is the analog frequency output, ℵ ∈ k u is coded over 8 bits between 0 and 255, respectively. b is the DC-offset, K DCO is a gain. k w is an energy-bounded signal to take account perturbations, and B w is a constant that defines the perturbation magnitude. In order to consider the PVT variation effects, it is assumed that parameters K DCO , b and B w can change in the intervals
Sensor model
The sensor, which is a counter, measures the frequency of the DCO output signal. This sensor introduces a delay of one-sampling period
K s is a positive constant that represents the sensor gain. Note that the delay is present in the feedback loop, see Figure 1 .
CONTROLLER STRUCTURE AND CONTROL PROBLEM STATEMENT

Structure of the controller
From the requirements provided in Section 1 and the models of the DCO and the sensor, the DFLL control engine can be selected as a simple digital integral filter:
where K is the controller gain to be tuned, u is the input of the DCO (see Figure 1 ) and ε is the difference between the Set_point (i.e. the desired output, coded on a byte) and the measurement k M given in (4)
Then, (5) yields
Note that the choice of (7) for the controller structure will also limit the Silicon area.
The structure used to implement the controller is made of three arithmetic operators and a command register as shown in Figure 8 The whole data-path logic is implemented using only combinational logic. This logic clearly cannot be executed in only one cycle with a 1 GHz clock. Thus a multi-cycle path and its associated control logic is used. Note that they are not shown in Figure 8 for the sake of clarity. Finally, k u is registered, to generate a stable value, to be sent out to the DCO.
The controller gain K must be selected in such a way that the closed-loop system satisfies the whole set of requirements.
Closed-loop system
Define the output error signal with
Then, from (7), it comes that
An analytical closed-loop system is obtained. From (3) and (8), the error equation is
Now, from (9) it follows that
Applying (10) 
where
and
Note that b does not influence the system response.
Control problem statement
Equation (12) can be rewritten in the following explicit closed-loop form, in such a way that a ∞ H control problem can be formulated:
Problem 1: The problem is to find the optimal gain K, such that the controller (7) is robust and the system response is the shortest possible without producing an overshoot. Besides, there exists a The solution to this problem guarantees a suited performance as well as a robust stability and a robust disturbance rejection for system (16)- (17). Section 5 solves Problem 1 with an optimal ∞ H design of the controller.
OPTIMAL ∞ H CONTROL DESIGN
In order to cope with Problem 1, a mathematical manipulation of (16) is performed via a variable change. This allows obtaining feasible LMIs for a robustness problem [20] .
Model transformation
Then, (16) is rewritten in the form [21] :
This system can be compactly written as: 
Control design
Problem 1 will be formulated in terms of Linear Matrix Inequalities (LMIs) [22] .
Assumption 1:
There exists a Lyapunov function , V k with condition (18) and a , γ such that Proof: The goal is to satisfy (24) for both disturbance rejection and asymptotic stability of the equilibrium for system (16) 
The expression of (24) is replaced by (27) in such a way that the LMIs (26) are obtained.
Robust control
Now, the uncertain parameters given in Section 3 are taken into account in order to guarantee the system robustness at the same time than the closed-loop stability as well as disturbance rejection for the DFLL system to be ensured. This means that a robust control under parameter uncertainties satisfies those properties. For this reason, Theorem 1 is extended in the case of polytopic uncertainties. 
TQ K
Then, in the vertices j, the equilibrium is asymptotically stable as well as the disturbances are rejected in the entire polytope.
Proof: This is an extension of Theorem 1 for polytopic uncertainties with some mathematical manipulations. Therefore, this theorem proof is straightforward.
Optimal and robust control
In order to satisfy the whole Problem 1, more assumptions and a lemma are performed. 
If 0 Z > is chosen, overshoots are avoided. In addition, if K is maximized, the response time is the shortest possible one [23] . Note that,
Remark 1: From Theorem 2, it is ensured that Z<1, that is, the closed-loop system is stable. 
The sum of both sides is 
Optimal and robust control result
Now, an optimal and robust control is computed for the DFLL by employing the approach presented above. 
DFLL IMPLEMENTATION
This section deals with implementation issues of the DFLL. Firstly, the design and validation flow is detailed. Then, Matlab/Simulink simulation results are discussed. Finally, the RTL design and experimental results are presented.
Design and validation flow
At the first stage of the design-flow, a full custom design has been performed for the DCO, validated at Spice level. It was characterized at various PVT corners, obtaining the results shown in Remind that the DFLL characteristic curve can change due to PVT variations as shown in Figure 5 .
In order to validate the system robustness with respect to these changes, three different models are considered (see Figure 9 ):
• syst 1: 0315 . 0 and 10 19.8287
• syst 2: 5785 . 4 and 10 14.25
• syst 3: 0785 . 2 and 10 25.5
The optimal and robust control gain has been fixed with the methodology presented in section 5.
Therefore, whatever the characteristic of the DCO is, the closed-loop system will behave as expected.
Moreover, exogenous perturbations at the output of the DCO will be rejected. Figure 10 shows the closed-loop response of "syst 1", "syst 2" and "syst 3" to a change in the Set_point. Note that the offset and the gain of the DCO change, which can happen due to PVT variability. These tests show that the equilibrium point is robust with respect to the uncertainty in the characteristic curve. Note that the response time at 5% is achieved before the 7 th sampling time. Figure 11 shows the frequency output, when the characteristic curve changes ("syst 1", "syst 2" and "syst 3" respectively) and when there is some exogenous perturbations at the output of the system. This example shows the robustness of the system when the optimal robust control tuning is employed. Figure 11 . Evolution of the output frequency with perturbation and for three different systems.
RTL implementation
Following the various steps of the design and verification flow discussed above, the DFLL control developed in the present paper has been implemented in RTL. It must be stated that the DFLL together with its controller is fully compatible with standard cell methodology. Figure 12 
CONCLUSION
In this paper, a small-area Digital Frequency-Locked Loop (DFLL) engine is employed to implement DVFS in GALS architecture. The use of a simple controller has allowed a fully digital implementation in standard cells, attaining a small area. Implemented in 32 nm technology, the design proposed represents 0.0016 mm 2 , i.e. from 4 to 20 times smaller than classical techniques used such as Phase-Locked Loop (PLL) in the same technology. Likewise, this controller is optimal with respect to system performance (short transient response and no overshoot) and perturbation attenuation.
Another suited property offered by the controller is the robustness with respect to PVT variations.
Moreover, the closed-loop system stability is guaranteed whatever the characteristic of the DCO is in a given range. Some simulations under Matlab/Simulink show the closed-loop system robustness.
The DFLL with its controller was implemented in RTL in order to obtain the implementation layout.
The first version of the DFLL (included the controller proposed in this paper) has been implemented in 32 nm technology. The circuit is currently under foundery and performances attained on the real chip will be included in the final paper. 
ACKNOWLEDGEMENT
