Abstract-This work presents an area-efficient voltage and frequency scalable clock generator for low-power digital SoC clocking. Named Direct Digital Sampling and Synthesis (DDSS), the open-loop generator implemented in 28 nm FD-SOI operates from 0.45 V to 1.1 V with measured jitter from 1.7% to 5.1% UI. Its low power consumption of 0.40 pJ/cycle at 57 MHz 0.5 V combined with the ability to perform fast frequency changes makes this circuit an alternative to PLLs for fast Dynamic Voltage and Frequency Scaling (DVFS) strategies in low power SoCs.
I. INTRODUCTION
The last decade has seen a trend in taking advantage of digital logic downscaling to port analog building blocks into digital designs. This has enabled area savings and voltage downscaling for elements such as power monitors, temperature sensors as well as Phase Locked Loops (PLLs) [1] - [5] . The All Digital PLLs now replace the traditional LC oscillators and charge pumps with ring oscillators and digital loop filters.
These circuits offer better area and power performance, at the expense of slightly higher jitter values, which is less critical for clocking than for RF applications. However, due to their closed loop nature, the digital PLLs face the same lock time restrictions as their analog counterparts. Some instant switching strategies added to the PLLs come at the price of added area and power consumption [5] .
Moreover, power management strategies rely heavily on fine grain frequency scaling [6] . This fine granularity applies both in space and time, requiring a clock generator with low area and instant switching capability respectively. To limit its power overhead, the clock generator also needs a low leakage current, in clock gating mode, and a wide voltage scalability with output frequency matching that of the digital logic it clocks.
The open-loop principle of Direct Digital Sampling and Synthesis (DDSS) [7] offers an alternative to PLLs, trading off the phase locking for instant switching. The previously published DDSS, however, suffered from limited voltage and frequency scalability (0.6 V minimum voltage, 574 MHz maximum frequency at 0.9 V), as well as a complicated calibration mechanism, limiting its practical use.
The proposed design, implemented in 28nm FD-SOI, improves on the DDSS principle by using a phase selection approach to the fractional division unit, rather than delay lines. Compared to [7] , this method allows for a simpler calibrationfree design, offering a 14x reduction in area down to 981 μm 2 , as well as extended voltage operating range (down to 0.45 V), 6.5x reduction in power consumption at Vmin and a maximum frequency on par with digital clocking requirements (879 MHz vs 574 MHz at 0.9 V). This makes the phase selection based DDSS a good candidate for low power voltage scalable Glob- Then, in the synthesis stage, a phase selector operates the fractional frequency division of the RO by a programmable factor proportional to the first stage output W , ie T out = T RO .W = T ref /N . By using the same RO reference for the sampling and synthesis stages, this feed-forward design guarantees that the output frequency is N times that of the reference, independently of the exact RO frequency.
The feed-forward topology also provides a change in W and T out when N or T ref are changed, after only one reference cycle, compared to several cycles re-locking for PLLs [1] - [4] .
Contrary to delay line types of fractional division [5] , [7] , the phase selection method does not require any specific calibration, as the sum of the 32 phases delay is by construction equal to one period T RO of the ring oscillator. The only timing constraint is that T RO must be larger than the setup time of the synchronous logic stage. But as this logic is very simple this setup constraint is low (0.61 ns at 0.9 V) and can be safely margined. The calibration-free operation offers a drastic reduction of area compared to [7] which required 2720 latches and logic for PVT dependent configuration.
Last, thanks to a selection of both the rising and falling edge of the generated clock, the width of the output pulse and hence its duty cycle can be controlled. inverter pair topology. The transistor-level design is optimized as the inverter pair is reduced to minimum sized NMOS only to reduce area and power. An enable command is also added at each stage to enable ring gating for power savings in idle mode. This command is added on each of the 16 stages to avoid phase imbalance. Last, the design is laid out in order to allow abutting between stages and with standard cell logic without area overhead. C. Phase selection principle Fig. 3 illustrates the general principle, with the commands cycle and delay sent to the phase selection block. On each cycle the rising and falling edge delay values are incremented by one step (20/32 in the example). When the increment overflows, the cycle command is set to 0 for one cycle and no pulse is processed. Fig. 4 presents the details of the phase selection operation. The general principle, illustrated in sub-figure 4.a) consists in using a flip-flop to propagate a selected phase at its rising edge. The conceptual timing of the selection, presented in subfigure 4.b), is first to set the phase multiplexer selection, then to enable the edge capture by setting the "window" D input of the flip-flop to 1. However, as illustrated on the timing diagram, some margin must be set between the selection: (1) for multiplexer setting before the window is enabled, (2) and (3) are the timings the window needs to be enabled before and after the desired edge is selected and (4) for disabling the window before the multiplexer command is changed.
B. Oscillator design
Because of these constraints, the phase selection cannot be performed in a single cycle: the full window has to cover the phases Φ 0 to Φ 31 plus the margins (1-4), as illustrated in Fig.4b ). For this reason, two Phase Selection Units (PSUs) run in parallel, each operating over 2 cycles. The first half of the first cycle is used to guarantee constraints (1) and (2), one cycle for the actual phase selection and another half cycle for constraints (3) and (4). The sub-figure 4.c) presents the details at gate level of the implementation. The first two flip-flops FF1 and FF2 guarantee the margin (1) and (2), while the FF3 selects the rising edge and FF4 the falling edge. Moreover, the only cells affecting the output jitter are the 32:1 MUX and the FF3, which limits mismatch impact of the two parallel PSUs. This design is very compact and easy to implement, requiring only 44 standard cells per PSU, compared to over 200 per delay element in [7] .
D. Digital flow and simulations
The highly digital and very compact nature of the DDSS take benefit of the digital flow for quick design iterations, to explore different design strategies and cells sizings as well as timing verification and simulations. The trade-off being that the automated P&R does not ideally match timing between paths, causing added deterministic jitter.
The size (13k transistors total) and digital behavior of the circuit makes full SPICE simulations on the extracted netlist possible. This allows for simulation of the estimated output deterministic jitter due to delay mismatch between the gates. Fig. 5 shows the results of the simulation. In green cross the nominal run across three corners shows the effect of P&R mismatch only, while the box plot shows the spread of 25 Monte Carlo runs. This simulation predicts the level of jitter expected and demonstrates that at low voltage its dominant contributor is the random variation rather than P&R mismatch, which validates the digital flow approach. Fig. 6 presents the full test vehicle view, details of the DDSS layout and test harness. 16 chips have been fabricated, packaged and measured. The testchip integrates frequency dividers to allow validation of functionality even at GHz frequency range where standard digital IOs cannot transmit the generated clock off chip directly. The circuit is designed in a 28nm FD-SOI Regular Voltage Threshold (RVT) process to minimize leakage in idle mode for low power applications. The total DDSS area is 981 μm 2 and can be placed inside of digital logic with no guard area overhead.
III.MEASURED PERFORMANCES A. Testchip implementation

S5-3 (2218)
IEEE Asian Solid-State Circuits Conference November 6-8, 2017/Seoul, Korea Fig. 7 shows the measured maximum generated frequency and power consumption at Fmax of the 16 chips across 0.45 V to 1.1 V supply. The circuit achieves an energy efficiency of 0.45 pJ/cycle at 57 MHz 0.5 V and 1.53 pJ/cycle at 879 MHz 0.9 V. Table I (median value measured across 16 dice) shows the DDSS can be Reverse Body Biased (RBB) at the same time as the core when it is power gated for 6x to 11x leakage reduction, down to 10 nW at 0.5 V 1.5 V RBB, enabling Internet of Things type duty-cycled operations.
B. Maximum frequency and power
C. Duty cycle control
As described previously, the phase selection approach makes it possible to control the pulse width. This is implemented in practice by 7 settings, the first 4 control the width from 1/8th to 4/8th, while the last 3 invert the output of the first 3. This sets 7 settings between 1/8th to 7/8th. This feature is illustrated in the measured data of fig. 8 . This feature is useful to offset clock tree unbalance at lower voltages and can be used in low power pulse based latch logic [8] .
D. Jitter measurement
First, it is important to note that the jitter constraints are different for SoC clocking than for RF or data recovery applications. The relative peak to peak period jitter is the main metric and directly translates to a frequency penalty, or extra margining in the digital logic the DDSS clocks. For example a 6% UI jitter corresponds to only a 3% Fmax degradation in the clocked logic. Fig. 9 presents the rms and peak-to-peak jitter measurements at 0.5 V across the 16 dice, as well as a capture of a jitter histogram, illustrating the superposition of the deterministic component, from phase selection paths mismatch, with the random jitter from supply and components noise. As the deterministic jitter is dependent on the phase increment values, the values are measured for the 16 dice and two different output frequencies (20 MHz and 35 MHz) to demonstrate measurements are not made on a best case. The median measured peak-to-peak jitter value is 1.47 ns and 2.01 ns at 35 MHz and 20 MHz, ie 5.1% and 4.0% UI respectively.
Due to IO bandwidth, the jitter at 0.9 V is measured at 100 MHz. Median value of pk-pk and RMS jitter is 167 ps and 20.7 ps, ie 1.7% UI and 0.21% UI respectively.
E. SoC level performance
This jitter performance has to be put in perspective with the full SoC power budget. As an illustration, the test-chip also includes a low power ARM M0+ core [9] operating at the same voltage and clocked by the DDSS. On a Dhrystone testbench the M0+ consumes 0.94 pJ/cycle at 0.5 V. Hence, when compared to an ideal clock, the DDSS with a 5.1% UI jitter requires an increase in frequency margin of 2.6%. From measured data, this corresponds to a 1.6 mV increase in core voltage for margining, which in turn increases the core energy by only +1.1% ie. by 0.01 pJ/cycle. So, for low power cores, the benefit of the energy efficiency improvement in the clock generator far outweighs the penalty in jitter performance compared to some conventional PLLs [4] , [5] . Table II summarizes the performance of the proposed DDSS clock generators compared with previous DDSS and state of the art PLLs. This work combines a wide voltage range with frequencies compatible with digital logic clocking (unlike [2] ), the best reported area and an excellent maximum energy efficiency of 0.40 pJ/cycle.
F. Comparison with the state of the art
IV. CONCLUSION This paper proposes a novel implementation of the Direct Digital Sampling and Synthesis (DDSS) frequency generation circuit. This all-digital approach, compared to conventional PLLs, offers instant frequency scaling. By using a phase selection approach, this circuit offers an extremely compact implementation (981 μm 2 ) and low power across a full 0.45 V-1.1 V operating range, with 0.40 pJ/cycle at 0.5 V. This combined with its satisfying jitter performance (5.1% UI at 35 MHz 0.5 V) demonstrates the possibility for important system-level energy savings from single clock domain ultra low-power systems to large GALS SoCs.
