Abstract-Large digital chips use a significant amount of energy to broadcast a low-skew, multigigahertz clock to millions of latches located throughout the chip. Every clock cycle, the large aggregate capacitance of the clock network is charged from the supply and then discharged to ground. Instead of wasting this stored energy, it is possible to recycle the energy by controlling its delivery to another part of the chip using an on-chip dc-dc converter. The clock driver and switching converter circuits share many compatible characteristics that allow them to be merged into a single design and fully integrated on-chip. Our buck converter prototype, manufactured in 90-nm CMOS, provides a proof-of-concept that clock network energy can be recycled to other parts of the chip, thus lowering overall energy consumption. It also confirms that monolithic multigigahertz switching converters utilizing zero-voltage switching can be implemented in deep-submicrometer CMOS. With multigigahertz operation, fully integrated inductors and capacitors use a small amount of chip area with low losses. Combining the clock driver with the power converter can share the large MOSFET drivers necessary as well as being energy and space efficient. We present an analysis of the losses which we confirm by experimentally comparing the merged circuit with a conventional clock driver.
I. INTRODUCTION
T HE rapid increase in energy consumption of large digital circuits has been predominantly due to an increase in total gate capacitance, with thinner gate oxides, and an increase in operating frequency. In particular, the high-frequency clock uses a large fraction of the total energy budget. In the 4-GHz IBM POWER7 processor, for example, the clock load consists of over 2 million latches that are distributed across its 567-mm 2 die [1] . Although some energy is required to distribute the clock across the chip, the final (or local) drive stage uses the majority of the energy [2] . For example, the 5-GHz IBM POWER6 processor dissipates 100 W at 1 V in its 341-mm 2 die; of the total power, 8% is used for global clock distribution and 14% is used for local clock drivers [3] . For modeling purposes, this POWER6 clock network is roughly equivalent to an average distributed capacitance of 13 pF/mm 2 . There are several methods used to reduce clock energy, such as clock gating, low-swing signaling, double-edge triggered flipflops, adiabatic switching [4] , and resonant clocking [5] , [6] . Among these techniques, resonant clocking shows significant promise for reducing up to 80% clock energy. It produces a sinusoidal clock waveform that improves EMI, but also provides low slew rates near the logic switching threshold of V DD /2. This produces both skew and jitter. Instead, steep edges are preferred.
Instead of reducing the clock energy directly, this paper adopts a new method, shown in Fig. 1 , by recovering the energy stored in the clock load capacitance and redeploying the energy to another local circuit in a regulated fashion. This reduces the overall current drawn from the primary supply. We call this concept energy recycling [7] . To recover and redeploy this capacitively stored energy, we develop a merged clock driver and buck converter circuit. Using a fully integrated zero-voltage switching (ZVS) buck converter running at 3 GHz, our experimental prototype implemented in 90-nm CMOS technology recovers energy from the final clock driver stage with roughly 50% efficiency. In this paper, we present comprehensive measurements on the prototype and follow this with a mathematical analysis of the losses and the energy recovered.
One of the advantages of energy recycling is the efficient generation of a localized on-chip voltage supply that differs from the level offered by the primary supply. This can be leveraged into additional power savings by using it to provide mixedvoltage islands, adaptive body biasing [8] , as well as dynamic voltage and frequency scaling (DVFS) [9] , all of which require additional supply voltages at modest power levels. Since the onchip dc-dc converter is small, many can be deployed across the chip to produce independent, regional power supplies, without the proliferation of voltage buses.
II. COMPATIBILITY OF POWER CONVERTER AND CLOCK DRIVER CIRCUITS
One contribution of this paper is recognizing the compatibility and overlap between high-speed clocking in large digital 0885 -8993/$31.00 © 2012 IEEE chips and dc-dc switching converters. To achieve the ultimate goal of fully integrated converters, switching frequencies have been steadily increasing to shrink the size of the required output filter passives. The primary concern in such high-frequency converters is to mitigate the increase in associated switching losses. In this paper, our approach is to directly use the clock node, since it is high frequency, high capacitance, and its power overhead is already justified.
The needs of a clock driver network and a converter are similar. Both involve tapered inverter chains to drive a large, final inverter stage. The main difference is that clocks typically maintain a fixed duty cycle, while converters typically vary the duty cycle to vary the output voltage. Also, the clock output is a high-quality square waveform, while converters attach a large LC output filter to produce a dc output waveform with low ripple. This paper shows that the drive stages can be shared between these two circuits, and the LC output filter does not unduly interfere with the quality of the clock signal. However, if the clock duty cycle must be modulated, then logic designers will need to use only the leading clock edge as the primary timing reference. This is compatible with single-edge triggered flipflop and pulsed latch design styles. However, from a clocking perspective, the lack of a reliable falling edge does become a constraint in the digital design.
The next concern is whether the clock distribution network lends itself toward driving a large LC filter. Modern processors distribute multiple clock buffers throughout the chip rather than using a single large buffer [10] . For example, the POWER7 clock distribution network is shown in Fig. 2 , where the vertical axis represents total delay from the clock source [1] . To reach the eight processor cores, a binary tree of distributed buffers is used. Within each core, the clock is subdivided into several (tens to hundreds) of regions, each with independent clock gating. Gating the clock is done to save power, but requires special local clock buffers known as gaters. Each gater covers around 1 mm of clock wire length [11] . We believe that dc-dc converters can be integrated with these local gated regions and cover roughly 1 mm 2 of area. Merging the two circuits requires that the dc-dc switching converters can operate in the multigigahertz region at frequencies that are 10× to 100× higher than typical converters. This shrinks the size of the LC output filter by up to 99% and allows full integration on-chip. However, on-chip components may be of lower quality than off-chip components. For example, some on-chip capacitors are built using the thin gate oxide of transistors, and gate contacts are made using the relatively high resistance silicided polysilicon layer. Similarly, thin metal layers with complex metal slotting and fill rules necessary for manufacturability form the inductor, which may not handle the large current density and magnetic flux required and/or may have too much stray capacitance. Hence, we take an experimental approach and successfully demonstrate a working prototype to address these concerns.
III. MERGED CLOCK DRIVER/BUCK CONVERTER
This section describes a merged circuit that is used to drive a large clock capacitance and to provide an output voltage lower than V DD . Basic performance results for a 3 GHz realization of this circuit were presented in [7] . Fig. 3 gives a simplified version of the merged buck converter and clock driver. Due to the large amount capacitance in C clk , a delay element is introduced to provide ZVS during the high-to-low transition. This nonresonant mode of operation is implemented by creating a short "dead-time" delay before turning ON M n . ZVS saves energy if M n is turned ON only after the source-drain voltage reaches 0 V. Ideally, the stored energy in the clock capacitance is efficiently removed to the inductor and the load rather than shorted to ground via M n each cycle.
A. Simplified Circuit
The buck converter operates by averaging a pulse-widthmodulated (PWM) voltage through a low-pass filter. The output voltage of an ideal converter is the dc value of its input multiplied by duty ratio
where D is the duty ratio of the PWM voltage. This is independent of the frequency. In Fig. 3 , a chain of cascaded inverters (not shown) drives node V clk−in . Capacitance C clk is the overall capacitance at the clock node and includes the stray capacitances of M n and M p , the gate capacitance of all the clock gates connected to this node, and the stray capacitance of the distributed clock wiring at this node. The operation of this circuit is summarized by the idealized timing diagram in Fig. 4 , for the boundary condition of continuous inductor current, where D, T sw , and T delay represent clock duty cycle, switching (clock) period, and ZVS dead-time, respectively. There are three phases of operation. ZVS operation occurs when M n is turned ON, while its source-drain voltage is close to zero, thereby reducing dynamic power loss. Theoretically, in Phase 3, if the falling inductor current were allowed to cross zero, M n could be turned OFF to allow charging C clk with the negative inductor current. Then, at the beginning of the next switching cycle, M p would be turned ON with 0 V across it (ZVS operation for M p ). However, this increases the output voltage ripple, as C F must provide the required charge for the large C clk . This will also cause additional power loss in the inductor resistance. By design, at full load, the minimum inductor current in this circuit will be close to zero; therefore, no ZVS operation is implemented for M p . Since at less than full load the inductor current does not stop at zero, the converter always operates in continuous conduction mode.
B. Full Circuit
Measuring the performance of the proposed circuit presents some difficulties, as the circuit combines two functions. Hence, we construct a benchmark consisting of a reference clock driver circuit (see Fig. 5 ), as well as the merged clock driver/buck converter itself shown in Fig. 6 . The reference clock driver has the same transistor sizes and C clk load as the merged design.
In both circuits, PMOS transistors are three times wider than NMOS transistors, except for the last inverter stage in which the PMOS is four times wider to reduce the voltage drop across M p while V clk is high and the current is building up in the inductor L F (Phase 1). A tapering factor equivalent to a fan-out of four is used for the inverter chain, which minimizes clock latency from the source [12] . NMOS transistor gate capacitance is used to implement the converter filter capacitor, C F , while the gate capacitance of a simple large inverter is used to represent the parasitic and load capacitance at the clock node, C clk . The extracted value of C clk is estimated to be 12 pF, roughly equivalent to a 1-mm 2 region of the IBM POWER6 processor [3] . We designed our prototype to recover energy in a 1-mm 2 region due to manufacturing cost constraints. This ultimately defined the values we chose for L F , C F , and C clk . Using (2) and (3) as guidelines, and considering maximum I LF = 2I out (we have assumed that the converter is operating in the continuous region, and I LF changes between 0 and 2I out ), initial L F and C F values are chosen and then optimized using the simulation tools
Here, T sw , f sw , and ΔV out,pp are switching period, switching frequency, and peak-to-peak output voltage ripple, respectively.
The final values of L F and C F are 320 pH and 350 pF, respectively, to operate at a switching frequency of 3 GHz with a voltage ripple of less than 5% at 100-mA load. The value of the clock capacitor C clk , that is located at the inverting node of the buck converter, does not have a direct impact on the value of the filter components L F and C F but it does slightly modify the duty cycle D as will be discussed later.
The ripple of the output voltage can be calculated using (4), where f c is the corner frequency of the output filter. As the duty cycle D is increased, the output voltage V out increases and its peak-to-peak ripple (ΔV out,pp /V out ) decreases
To obtain ZVS of M n , the exact ON/OFF timing of M n and M p is necessary. As shown in Fig. 6 
Here, T delay is defined by the voltage on C clk reaching zero, neglecting other effects. T delay can be calculated using a simplified circuit model, C clk capacitor in parallel with a current source of I Lmax . At the time t = 0 when
The time that takes for V clk to reach zero is
C. Timing Uncertainty
Timing uncertainty is an important issue in a clock distribution network. In the merged circuit shown in Fig. 6 , C clk is charged and discharged through nonsimilar circuit routes, so the rising and falling edges of V clk are not similar. Two distinctive cases are considered.
1) Constant load current: In steady state, the converter generates constant output voltage with a periodic ripple. As all the waveforms in the circuit are periodic with the same frequency, the rise time and fall time of each clock edge remain constant. 2) Dynamic load current, i.e., digital loads: When the clock is rising, different load currents result in different voltage drops across the M p on-resistance, R on-PMOS . As a result, the rising edge of the clock is slightly modified based on the load current. For the falling edge, the problem is more severe, as the load current solely determines the falling slope of the clock signal before the ZVS delay circuit is triggered. Based on (6), the sensitivity of V clk switching threshold (V DD /2) cross-over time to the load current is
− R on−PMOS ). This introduces jitter into the falling edge.
If the circuit is used with positive-edge-triggered digital circuits, then the jitter on the falling edge is not crucial. Looking again at the jitter on the rising edge, it can be modeled with an RC circuit, which represents C clk and R on-PMOS , in parallel with a current source that represents the inductor current. The current in the inductor changes with the load current. However, in the vicinity of the rising edge of the clock, it is near its lowest value, I L min . While the inductor current can be assumed constant during the charging up of C clk , it has a small magnitude (close to zero by design) and does not have a significant effect. Thus, the circuit can be designed such that the jitter added by the converter on the clock rising edge is negligible. Furthermore, no oscillation happens at the V clk node: as V clk approaches 0 V, M n turns ON and prevents the oscillation by keeping V clk at 0 V until the next M p turn-on phase. Therefore, the proposed design does not have a resonant converter structure.
D. Voltage Regulation
A full-voltage converter requires regulation based on load. Due to the unknowns of operating converters at 3 GHz, and the lack of observability in our test chip, we decided to implement a very simple PI controller on-chip. It modifies the duty cycle of the input clock by providing an analog control voltage to a PWM generator circuit. These are shown in the block diagram of Fig. 7 .
Due to the high frequency of operation, traditional methods of PWM generation using counters are not practical. Hence, we adopted a method based on programmable delay lines [7] and verified that it works at speed using simulation.
Unfortunately, our PI controller did not work properly.
Instead, we tested our chip in an open-loop configuration by forcing PI controller input voltage, V err , to 0 V, which leaves the PWM generator operating in a nominal mode. The nominal mode does not alter the duty cycle of the input waveform. Instead, we adjusted the pulse width sent to the regulator by directly controlling the pulse width of the off-chip clock source.
IV. SIMULATION RESULTS
The integrated clock-driver/switching converter was designed and simulated in a 90-nm CMOS technology, with C clk designed to be 12 pF. All transistors are standard-V t (standard-threshold) type. The circuit-level simulations, layout design, and postlayout simulations are performed using Cadence Design tools including Virtuoso Schematic Editor, Virtuoso Analog Design Environment, and Virtuoso Layout Suite. For layout extraction (extracting parasitic resistors and capacitors due to layout routing), the Calibre extraction tool is used. Simulations are performed at frequencies lower than 3.5 GHz and, therefore, are well within the valid frequency range of foundry provided models for transistors in the 90-nm CMOS technology.
To further refine the passive device models, in the capacitor model, equivalent series resistance (ESR) is extracted with a postlayout simulation using the Calibre extraction tool. For inductor modeling, design, and optimization, we have used ASITIC [13] . Simulated waveforms for typical process corners (TT25) are shown in Fig. 8 . The circuit is simulated with a 50% duty cycle, 70-mA load current, and a 1.0-V supply. The clock waveform has fast rise and fall times as required. The current in the inductor L F in Fig. 8(b) exhibits a triangular shape as expected, with a maximum value of around 190 mA. The minimum value is less than zero at -50 mA. M p source current (convention of current into the node always) provides the energy to charge up C clk when the inductor current is negative. The inductor also serves to charge up C clk (noting that M n is OFF). Once V clk has risen, the current in M p drops back to the inductor current level, continuing the rise from around zero to its next peak value. Because of the high peak current, during a steady on period, there is a maximum voltage drop of ∼0.1 V across M p as the current rises, suggested by the droop of V clk to ∼0.9 V in Fig. 8(a) . In this figure, the reference clock circuit output is shown as V clk-ref . Both clocks have similar rise and fall times. In the second half cycle of the clock, inductor current discharges C clk . In Fig. 8(b) , M n source current is always positive, which means that all the charge in C clk is delivered to the load instead of the ground. Simulation results confirm that the output ripple is around 5%.
The simulated output voltage of the buck converter at different duty cycles is given in Fig. 9 . The output voltage increases as D is increased and the output voltage drops with increasing load current for a fixed duty ratio, as expected. For each duty ratio, the efficiency is fairly constant from 40 to 100 mA, but has different values depending on the output voltage. For example, at 70-mA output current, varying the duty cycle from 30% to 60% increases the efficiency from 38% to 60% (figure of the efficiency is not shown). The detailed effect of the fixed losses in the tapered clock driver chain and the energy recycling of C clk is discussed later.
V. PROTOTYPE IMPLEMENTATION

A. Implementation of the Inductor L F
This study adopts a conventional CMOS process with coreless inductors. However, for manufacturability reasons, layout design rules in deep-submicrometer require slotting (the removal of metal) of wide metal traces and insertion of isolated dummy metal fill patterns in otherwise blank areas inside the inductor core.
The inductor uses a single-turn octagon, placing copper layers 6 and 7 in parallel to reduce series resistance. ASITIC [13] produced the extracted inductor characteristics shown in Fig. 10(a) and the simplified π model of Fig. 10(b) . The inductor layout area is 0.1 mm 2 . Such integrated inductors induce eddy currents in the substrate, as well as in any conducting layers. A patterned-ground shield (PGS) in metal between the inductor coil and the substrate has been shown to facilitate ensuring the design has the expected values of inductance and capacitance [14] , [15] . The PGS is implemented in metal 1 to keep it as far as possible from the inductor and minimize parasitic capacitance.
B. Implementation of the Bulk Capacitor C F
In CMOS technology, MOS gate capacitors have the highest capacitance per unit area and can be used as a normal capacitor. The nonlinear behavior of gate capacitance [16] is not significant in power converter applications such as this study, because the output voltages should be fairly constant. An array of hundreds of NMOS devices in parallel is used to produce the high capacitance needed. The ESR of each individual MOS gate capacitor is reduced by using a transistor W /L ratio of 10 [17] . A low ESR is essential to decrease power dissipation in the capacitor and also to lower the voltage ripple across it.
C. Implementation of the Merged Clock Buffer/Buck Converter
A block diagram of the circuit and its chip micrograph is shown in Fig. 7 . The area of the design is 0.27 mm 2 , and the total die area, including probing pads, is 1 mm 2 . A reference clock driver is also included for comparisons of clock waveform quality and power losses in the clock.
The layout is organized for probe station testing. To maintain the integrity of the input clock signal, a 50-Ω on-chip termination is used to prevent signal reflection. Paths that carry high currents are wide, slotted, and use many metal layers in parallel with multiple vias. We are unable to monitor the internal waveforms of the chip without adding significant load because of the 50 Ω input impedance of the measurement equipment; thus, such internal waveform measurements are not available.
VI. PROTOTYPE MEASUREMENT RESULTS
To verify the circuit concept, the buck converter was manufactured and tested [7] . For precise power measurements, all the parasitic resistances in the test setup of the buck circuit were accounted for through measurement and calibration. As a result, a supply voltage of 1.0 V was applied to the chip via the probe pads. The test bench setup is shown in Fig. 11 . An external signal generator provides a clock signal to the chip under test with an adjustable duty ratio of 33%, 50%, or 66%. The circuit was tested successfully with D = 50% at frequencies of 2, 2.5, and 3 GHz. Using frequencies of 3.5 GHz and higher did not yield sensible results, suggesting that the circuit does not operate properly at such a frequency, because the inverters do not have sufficient time to achieve full rail-to-rail swing. A similar situation happens for D = 66% at 3 GHz.
Measurements and simulations are compared in Fig. 12 . Fig. 12(a) shows that the duty ratio at 2 GHz changes the output voltage as expected. It also shows that the output voltage drops with increasing load current. The droop is consistent for both duty cycles (i.e., 50% and 66%) at 2 GHz. At 3 GHz and a duty ratio of 50%, Fig. 12(b) shows the voltage magnitude and droop with current is consistent with that of Fig. 12(a) . The correct functioning of the circuit is apparent. Evaluating power values of P in 1 , P in 2 , and P ou t1 in simulation, calculation, and measurement, at (3 GHz, 50%).
Measured P in and P out are presented together with simulation results in Fig. 13 (at 3 GHz and D = 50%). P out fails to rise linearly with I out as a result of the output voltage droop. P in rises steadily, from a fixed loss of around 25 mW. The efficiency stays flat at around 50% (as shown in Fig. 14) .
VII. DISCUSSION
Early comparisons between the results achieved from simulations with the measurement results suggested that the actual duty cycle seen by the on-chip buck converter is different to the duty cycle of the input clock signal. This could be due to the reaction of the PWM generation circuit or the tapered driver chain to the magnitude and offset of the clock signal, or stray parasitic elements or an additional delay in the ZVS delay circuit. Being unable to probe the internal operation means that an absolute diagnosis is impossible.
Therefore, the strategy used here is to find a new duty ratio in simulation to match the measured V out and then compare I in1 to see if the simulation matches the experiment. For the tests done at D = 50% and 66%, the simulations match quite well with modified values of D sim = 70% and 78%, respectively. As a result of adjusting the input duty ratio in simulation, the duty ratio seen at the buck converter can be considered equivalent in test and simulation.
Output voltage V out is shown in three different cases of (F sw , D) = (2 GHz, 50%), (2 GHz, 66%), and (3 GHz, 50%) in Fig. 12 . At 3.5 GHz, the loss of output voltage indicated that the chip was not functioning at such a high frequency. As expected, the output voltage does not vary much with frequency. The output voltage decreases as the output current increases because of D eff and the resistive voltage drops in the circuit (see later). The measured and simulated efficiency is plotted in Fig. 15 .
A. Separation of Losses
The input power can be written as P in1 = P out + (P swloss + P D ) + P lossOn , where P swloss is the power loss in M p from switching the clock node capacitance, P D is the driver chain loss, and P lossOn is the power loss in R p , the total parasitic on resistance of either the PMOS transistor or NMOS transistor and the inductor, caused by the inductor current (when the respective transistor is ON, noting M n conducts in reverse). Therefore, R p = R on−mos + R ind and
has an ideal triangular waveform with an average value equal to I load . Defining a recycling factor γ, the recycled power can be estimated as
In the case of short equal rise and fall times in an inverter chain, the power dissipation is mainly due to the dynamic loss, and only a small part (<10%) is due to the short-circuit crowbar currents. Therefore, the driver chain loss is given by
where C clk-chain is the sum of all the internal node capacitances of the clock chain.
Since there is no ZVS on M p , there is a switching loss in M p related to charging C clk that is given by
Thus, efficiency η can be written as
Putting the known values in (9), we would have an efficiency η ≈ 50%, as found in the experiment shown in Fig. 14 .
To assess the power required to drive the gates of M n and M p , and the effectiveness of the ZVS, a comparison can be made with the reference clock, where the clock capacitance is charged and discharged through M p and M n , with no energy recovery and therefore no ZVS. Noting that the drive chain losses will be P D as given above, then
Here, P in2 was experimentally measured to be 42 mW. This agrees well with simulation, which estimates P in2 to be 41 mW. By ratios, this gives the drive chain losses, P D , as 10.5 mW and the maximum clock power which can be recovered, P recycled , as 15.8 mW. In comparison, the drive chain and clock power loss can be extrapolated as 26 mW from Fig. 13 by following the measured input power for the merged clock driver and buck converter back to an output current of zero. The measured P in2 is shown dotted on the graph and is roughly 15.5 mW higher at the intercept on the y-axis. This agreement shows, as expected, that the recovered energy is efficiently removed to the load (γ = 1). Again referring to Fig. 13 , the drive chain losses are a fairly small proportion of the overall consumption, P in1 of 117 mW at I out = 100 mA. This shows that the concept of a tapered driver chain is reasonably efficient at high load currents and of course essential to obtain the switching speed necessary for operation at 3 GHz. Since the driver chain and the M n , M p inverter also performs the clock driver function, the concept of fixed losses should be modified. There is a total energy saving consisting of the recycled energy and the removal of one driver chain. These are fixed losses of 26 mW. Returning to Fig. 13 , at an output of 40 mA, P out1 = 28 mW and P in1 = 60 mW. Taking into account the 26 mW "saved" by merging the circuits the "efficiency" to the overall design of the additional dc-dc converter becomes 90%. Even at 100 mA, the efficiency of the design becomes 67%. Clearly, obtaining a high-efficiency clock driver is important in high-performance digital ICs where around 30% of the total heat generated is due to the lost clock energy.
From Fig. 13 , the variable losses accounted for by R p in (8) need consideration. Starting from P in1 = 117 mW at 100 mA and subtracting away the other known values, we find these losses to be around 36 mW, or 31% of the total input power. The droop in the output voltage in Fig. 12 gives an effective output resistance of 3.4 Ω. As this includes changes in the switching edges and thus D eff , it is not a true resistance. From the simulation waveforms in Fig. 8 , the voltage droop with current is around 100 mV at a peak current of 120 mA. This implies a R on-PMOS of around 0.83 Ω. In the same way, R on-NMOS is around 1 Ω. Taking R p as 1.2 Ω (averaging 0.83 and 1, then adding the series resistance of the inductor) and applying the 4/3 factor from above this term accounts for about 16 mW of the 36 mW.
Noting that the detailed prefabrication simulation is fairly close to the measured results, there must be a loss not accounted for in (8) . Here, the V DD supply track resistance was in fact estimated in the simulation at 0.75 Ω (the ground track resistance was much lower). The current in the V DD track is given by the current in M p (see Fig. 8 ) and the drive chain switching current. Adding 0.75 Ω to R p accounts for a further 10 mW. This supply track resistance also accounts for a significant part of the droop in Fig. 12 .
Overall, we have accounted for a total of 107 mW of the input power. With less than 10% of the input power not accounted for in our measurements and breakdown, we are satisfied that the most significant losses have been covered. It is interesting to note that the losses due to the inductor are small, even if one were to attribute all the error to the inductor loss calculation and the assumption of an idealized triangular waveform. In summary, the losses can be apportioned properly using (8) , and tie up well with the theory and simulation. The way in which the efficiency is flat and unchanged with frequency in Fig. 14 is also a result of the effective energy recycling (and ZVS). An improved design would have to consider all the contributions to the losses.
B. Review of the Implementation
A practical concern is the area overhead imposed by the filter passive components of the power converter, when integrated into a real processor. In the merged buck design, a clock load capacitance of 12 pF is assumed, corresponding to a 0.92-mm 2 region of the IBM POWER6 die. The area needed to implement the filter passives is ∼0.14 mm 2 (0.09 mm 2 for L F and 0.05 mm 2 for C F ) which represents an increase of about 15% in the chip area. It is also possible to stack the passive filter components, i.e., putting the capacitor under the inductor to save area. The area under the inductor has not been used before due to concerns of negative impact on inductance and/or eddy current losses. Recently, these concerns have been studied in [18] with positive results. Table I provides a summary of performance comparison between this work and two other published buck converters. The output voltage ripple is part of the design specification. A common approach to reduce ripple is to add inductor paths. This has been performed at 480 MHz, with on-package inductors [19] . In contrast, a fully on-chip buck converter in 0.13-μm technology that was 58% efficient is described in [20] . Operating at 75-225 MHz, it uses a total area of 3.76 mm 2 to fit the large passive components. Such an approach is difficult to reproduce here, as the clock is merged with the converter. However, by using the clock frequency, the effectiveness of the filter is vastly improved, so the filter components are small and the ripple will be low.
The designs of this paper can be applied to the inverter chains that are located after the clock gaters across the chip. The gater circuitry is not included in these circuits as here we only deal with a local region after the gaters. The gater presents an added constraint to circuits employing these converters: if the clock is gated, power is no longer delivered to the subcircuit that is supplied by the converter.
Consideration of the losses shows that one possible optimization is to operate the initial parts of driver chain at a reduced voltage by supply stacking [21] , [22] , while keeping the very last inverter stage at the full input voltage. An alternative approach is to apply the ZVS technique and energy recovery with a second inductor at the last stage of the tapered driver similar to the approach found in [23] .
Equation (9) accounts for the resistances of the inductor and transistors, and the parasitic losses associated with the capacitances. However, if this circuit were integrated into an actual high-performance processor design, the layout parasitics related to the tracks would need to be considered and these would also introduce interactions between circuit units containing these circuits. Future work that attempts to integrate such a design into a real processor will have to place close attention to these details.
The synchronization between the converter and the clock means that it is necessary to give consideration to the EMI created by the switched current. The filter capacitor adds negligible amounts of EMI. By employing a PGS under the inductor, the substrate itself is shielded from high capacitive currents [14] . The EMI contribution of the inductor merits more investigation. If the variable flux generated by variable current of the inductor passes through a neighboring loop in close proximity of the inductor, it could cause electromagnetic induction (unwanted EMI). Thus, following customary layout techniques, there should be a certain guard space around the inductor. The switched current in the layout parasitics of the tracks also needs further consideration in a full design, and may require local decoupling to avoid crosstalk.
VIII. CONCLUSION
Energy recovery from a clock load has been shown to be possible by merging the functionality of the switching dc-dc converter with the clock driver. The merged and fully integrated clock driver/converter circuit recycles the clock energy by employing the clock capacitance as a ZVS capacitor for the main inverter leg.
Adopting the multigigahertz system clock as part of the power switching means that the full integration of the output filter components is feasible, and the integrated inductor scales effectively, with low losses. With the output filter taking only a small area of the chip, estimated to be 15% of the functional island area, such merged clock and dc-dc converters may prove to be viable and economical.
By arranging the clock driver and dc-dc converter to share the tapered driver chain, significant energy is saved. The tapered driver used here gives sufficient drive to the main MOSFETs, retaining the fast switching edges necessary for a high-quality square wave clock waveform.
By deploying ZVS in one direction only, it is clear that a significant saving can be made without the necessity of a reversing current in the output filter inductor. The self-timed ZVS delay circuit implemented here is simple, effective, and fully digital circuit compatible.
In this study, the buck converter is, out of necessity, the simplest implementation required to verify basic functional operation. No doubt this contributed to the success of the 90-nm CMOS test chip. Measurements indicate the chip operates with an efficiency of around 50%, while providing the clock function at zero energy cost. The energy recovered from the clock capacitance is constant, so the constant fixed losses in the tapered driver are more than compensated for across the output current range. Clearly, the PI controller and its interactions with the PWM generator circuit need further development, as the integrated version did not work as expected.
There are a number of concerns that still need to be addressed before this work can be made practical. First, clock jitter may be increased as a result of the PWM required for voltage regulation. Also, integration of the proposed converter layout with the layout of a high-speed microprocessor is needed to verify practical issues such as resistive parasitics, layout of the extra supply voltage grids, electromagnetic interference, and the lowskew clock distribution. Also, the new clock waveform is only suitable for edge-triggered digital flip-flops or pulsed latches that are timed from one clock edge only.
