Abstract: Low-power dual-supply clock networks based on novel level-converting clock gating cells are presented. The proposed clock networks achieve a substantial power saving with mitigated timing constraints on gated clocks. They also allow pulse-based flip-flops used at leaf clock nodes to work with no pulse generators, resulting in more power saving and area reduction. The proposed dual-supply clock networks were designed in a 32 nm CMOS technology. The evaluation results indicated that the proposed clock-gating cells have up to 24.8% smaller power with 74.3% reduced latency and 17.5% reduced area. They also indicate that the power consumption of the proposed clock networks was reduced by up to 30.3%. Keywords: dual supply, clock gating, level converting, clock distribution Classification: Integrated circuits
Introduction
As design rules scale down and architectures get more complex, the number of circuit blocks driven by a global clock in a system-on-a-chip (SoC) substantially increases. This results in a large system-level clock loading, requiring an increased number of clock buffers in a clock network. Request for multi-GHz clock operation also requires clock networks to be built with large-sized clock buffers for sharp clock edges. Combined with inherently high toggling nature of clock nets, these trends result in significantly increased power consumption in high-speed clock networks. For example, clock networks can dissipate up to 20 ∼ 50% of the total power of a design [1] . The dynamic power consumption of a clock network can be represented as P = α · C L · V DD 2 · f , where α, C L , V DD , and f are the switching activity, the capacitive load, the supply voltage, and the clock frequency, respectively. Since it is not easy to reduce C L , applying a reduced supply voltage is a viable choice for reducing clock network power consumption. If the clock frequency could be scaled down, more power saving could be achieved. Proper clock gating could also be used to minimize the power consumption by reducing the switching activity of clock nets [2, 3] .
In this work, we present a novel clock network design targeting low power by adopting a frequency-scaled dual supply scheme with cost-effective clock gating, while minimizing performance penalty. Section 2 discusses previous clock network power reduction techniques. The proposed clock network designs are described in Section 3. Comparison results and conclusions are given in Section 4.
Previous works
For reducing the power consumption of a clock network, single-supply schemes with reduced-swing clock buffers [1, 4] and dual-supply schemes with level converters [1, 5] have been proposed. A single-supply scheme with source follower-based reduced-swing buffers [1] has problems in sizing MOS transistors and applying gated clocks. If the clock stops, the output of the reduced-swing buffer can be distorted due to leakage current. Another single-supply scheme, in which a desired reduced voltage swing is obtained by controlling MOS pairs with delayed input [4] , has a large power penalty due to use of delay chain. The reduced-swing buffer is inefficient at low loading condition, and should be carefully designed because the swing range is sensitive to delay time and threshold voltage. A dual-supply scheme with feedback-based level converters [1] suffers from high short-circuit current and large propagation delay due to slow response of level converters at low V DDL . A dual-supply clocking scheme with multi-V th level converters [5] overcomes the drawbacks of the conventional level converters. But, when it is used with dynamic voltage scaling, an optimum V th chosen for one V DDL might be inefficient at another V DDL . An increased variability of multi-V th in nano-meter CMOS technology also makes design difficult. On top of these drawbacks, the conventional single-and dual-supply clock networks described above do not adopt any clock gating [2, 3] , an efficient power saving method. Unfortunately, incorporating a clock gating capability into these conventional clock networks causes a large design overhead, as seen in Section 3. 
Proposed work
As seen in Section 1, for designing a power-efficient clock network, low-power techniques such as dual-supply scheme, frequency doubling, and clock gating need to be adopted. For incorporating the clock gating capability into the conventional clock network with dual-supply scheme, a clock-gating cell (CGC) can be placed in front of the level converter (LC) at each leaf clock node. Then, the LC increases the timing burden to the enable signal of the CGC. Especially, if the conventional CGC with built-in latching capability for glitch prevention [2] is attached at the input of an LC as shown in Fig. 1 -(a), the setup and hold constraints on E-to-ECK path becomes worse due to level-sensitive operation of the conventional CGC. The enable signal (E) for the CGC should precede the output clock (ECK) by the latency of not only the CGC but also the LC. This becomes worse for lower V DDL because the latency of LC increases substantially as V DDL is lowered. Moreover, the power consumption of LCs and CGCs offsets the power reduction obtained by V DDL clock buffers. Fig. 1-(b) is the conventional CGC with multi-V th level conversion/frequency doubling (LCFD) to save more power by halving the clock frequency in the clock buffers [4] . But, it has larger area overhead than Fig. 1 -(a) due to tuning the delay circuitry for achieving a scaled-up clock with 50% duty. Moreover, the amount of delay is sensitive to both process and V DDL fluctuation. For relieved timing burden and reduced power consumption, a novel levelconverting clock-gating cell (LCCGC) shown in Fig. 1-(c) is proposed, which has implicit pulsed operation with embedded level-converting capability. The pulsed sampling of LCCGC contributes to a reduction of E-to-ECK delay by soft clock-edge operation, relieving the timing constraints related to the enable signal. Since halving the clock frequency in clock buffers enables further power saving, a novel LCCGC with frequency doubling (LCCGC-FD) shown in Fig. 1-(d) is also proposed, which performs frequency doubling as well as level conversion and clock gating simultaneously. In LCCGC-FD, a doubled clock frequency is efficiently obtained by simply replacing the AND gate in LCCGC with an XNOR gate to minimize area and power penalties. As a result, the proposed level-converting clock-gating cells mitigate the timing issue and reduce the area and power penalties of the conventional CGC-LC pair in Fig. 1-(a) . The generic structure of the proposed dual-supply clock network applying the proposed level-converting clock-gating cells is shown in Fig. 1-(e) . It consists of a V DDL clock buffer tree for distributing clock with small swing, and LCCGCs or LCCGC-FDs for performing voltage conversion into V DDH , clock gating, and optional frequency doubling. A group of flip-flops are attached at each leaf clock node. By adopting the proposed level-converting clock-gating cells, the proposed clock distribution network not only efficiently incorporates clock gating with reduced cost for embedding level conversion and frequency doubling but also mitigates the timing burden of the conventional clock-gating cell. Moreover, pulsed nature of the proposed LCCGC and LCCGC-FD gives additional advantages if they are used with pulse-based flip-flops (PFFs) having been proved to be competitive in high-performance design. One is that the overhead of generating a pulsed clock in PFFs will be totally eliminated since the pulsed outputs of LCCGC and LCCGC-FD can directly drive PFFs, further reducing the power consumption and area. Another advantage is that, while the conventional clock network with frequency doubling requires the clock pulse width to be one quarter of the input clock period in order to produce frequency doubled clock with 50% duty [5] , the proposed clock network does not require such precise half-duty clock due to pulsed nature of the output clock. The required minimum duty is enough small to be comparable to the delay of simple delay chain in LCCGCs.
Simulation comparison
To evaluate the performance of the proposed scheme, the conventional and proposed clock networks have been designed in a 32 nm CMOS technology. The proposed clock networks adopt either LCCGC or LCCGC-FD, and are made to drive a bunch of pulse-based flip-flops having no pulse generators. In our test bench dual-supply scheme, the range of V DDL is 0.8 V ∼ 1.1 V for a V DDH of 1.1 V. Fig. 2-(a) shows the performance comparison of CGC circuits in enable delay (E-to-ECK)-power space. The conventional CGC-LC ( Fig. 1-(a) ) dissipates more power than the proposed LCCGCs since it suffers from a large short-circuit current in the cross-coupled pMOS transistors during transitions. It also has a large enable propagation delay (t E-ECK ) due to V DDL operation of CGC and two-stage design. The conventional CGC-LCFD ( Fig. 1-(b) ) has the largest power dissipation above V DDL = 0.9 V, since the CGC operate in V DDH and LCFD with large delay chain is always running. For these reasons, the combination of a CGC and any explicit LC is not acceptable for high-frequency design. Meanwhile, the proposed LCCGCs show lower power consumption and smaller enable propagation delay with minimum area because the clock-gating function is efficiently implemented by NMOS stacking. Fig. 2-(b) shows the comparison of normalized area, enable latency (t E-ECK ) and power consumption for clock-gating cells. For investigating the dependency of the power consumption of CGCs on the switching activity of enable signal (E), the power consumption is compared for enable switching activities of 0%, 50%, and 100%. The proposed LCCGCs achieved 15.7% ∼ 17.5% area reduction, 61.8% ∼ 74.3% t E-ECK reduction by embedding a level converter into the CGC. Although the power consumption of the proposed LCCGCs is somewhat larger than the conventional CGCs at 0% switching activity, it is up to 23.3% and 24.8% smaller for 50% and 100% switching activities, respectively. Fig. 2 -(c) and (d) compare the power consumption of clock networks designed using the conventional and proposed CGCs. The clock network power includes all the power dissipated by the clock buffer tree, CGCs, and F/Fs. Fig. 2 -(c) compares the power consumption of clock networks for various switching activities of the enable signal. The clock networks with proposed LCCGC and LCCGC-FD have up to 28.3% and 30.3% lower power consumption than the conventional CGCs depending on switching activities. Fig. 2 -(d) compares the power consumption of clock networks for V DDL ranging from 0.8 V to 1.1 V in a real application having 26.5% average switching activity. As expected, the power consumption of clock networks at lower V DDL becomes smaller, and those with proposed LCCGCs are lower for the entire V DDL range. Moreover, since the enable propagation delay and its variation remain small for the proposed CGCs as seen in Fig. 2-(a) , the proposed clock networks can be driven by various V DDL 's with almost no clock frequency degradation. Numerically, the proposed clock networks have only 2.7% frequency degradation for the entire range of V DDL , whereas the conventional clock networks cause up to 14.0% frequency degradation.
