Abstract-The clock distribution and generation circuitry forms a critical component of current synchronous digital systems and is known to consume at least a quarter of the power budget of existing microprocessors. We propose and validate a high level model for evaluating the energy dissipation of the clock generation and distribution circuitry, including both the dynamic and leakage power components. The validation results show that the model is reasonably accurate, with the average deviation being within 10% of SPICE simulations. Access to this model can enable further research at high-level design stages in optimizing the system clock power. To illustrate this, a few architectural modifications are considered and their effect on the clock subsystem and the total system power budget is assessed Index Terms-Clock power-consumption-modeling, digital-CMOS, phase-locked-loop, VLSI low-power-design.
I. INTRODUCTION
L OW-ENERGY dissipation is not only of interest for portable devices, where maximizing battery life is a design rule, but also for nonmobile systems, where chip-level issues such as power delivery and packaging, and system-level issues such as integration, cooling and case design are important.
The task of lowering energy dissipation has been attacked at all stages of the design process (i.e., transistor, gate, logic, and architectural levels) with the practical limits already being reached at the lower levels. Thus, the responsibility of achieving power savings is now also a duty of the system architect, and, from this perspective, the design of the clock distribution subsystem remains as one of the main challenges since it consumes up to 40% of the total power budget of current high performance microprocessors [1] .
Most of the previous work in clock distribution networks focused on reducing clock skew. This was pursued at multiple levels [2] that included circuit and layout techniques, automated synthesis tools and the modeling of the timing characteristics of the clock network. Many researchers have focused on proposing techniques and methods that allow the generation of virtually skew-free clock subsystems, but reducing power consumption is now becoming another important design directive as well. Thus, various techniques for clock power reduction have also been proposed, including the addition of constraints to the clock network routing for wire length minimization [3] , power reduction by buffer insertion [4] and the reduction of switching activity (clock gating) [5] . From other perspective, clock energy dissipation models have been also suggested [6] , [7] , so that static optimizations can be evaluated. These models, however, were mostly oriented to the distribution grid and did not consider in detail the generation circuitry and buffer sizing influence.
The authors, understanding the importance of a complete, accurate and flexible power model that captures all components of the network, proposed a model, whose initial formulation was presented in [8] . The work presented here goes further by analyzing in more detail those components that contribute strongly to the total clock power and reformulating some equations, such that the causes of power consumption are understood better, aspects that are presented in Section II. The effects of leakage power and studies with deep submicron technologies not previously considered are also included in this section. In Section III, the applicability of the model is discussed by analyzing the influence of architectural optimizations in clock power. Further, as frameworks for architectural-level decisions on total power budget have become widely used [9] , some experiments using our system-level simulator [10] allow an assessment of the possible savings when various clocking strategies are applied, within the context of a real CPU design. Finally, we state the conclusions and the direction of our future work. Fig. 1 shows some of the usual loads present in a clock distribution network. Others such as the precharge load of dynamic gates, the TLBs, the state registers in the control logic, etc., are not shown for simplicity. Thus, the proposed task is to determine the term , which can be estimated as presented below. Due to clock gating (i.e., not all units are used each clock cycle), a weight constant is required to appropriately model the effective load. Therefore, we can state that (1) The memory structures include units such as data and instruction caches, data, and instruction TLBs, the register file, the branch history table in the branch predictor, the instruction issue window, the load/store queue and other similar constructs. It is assumed that all components are built using static CMOS gates and a single phase clocking strategy for all sequential logic registers is used. The memory structures use the classic 6-transistor cell and a single precharge transistor per bit line. Given that the load of the distribution network wiring is affected heavily by its topology, this paper only considers the popular H-tree clock distribution style [6] , [7] .
II. A CLOCK POWER MODEL

A. Buffers and Drivers
Buffers in the terminal points of the distribution network are usually built as a chain of variable size inverters and optimized for speed, as described in [11] . If is the ratio of the output to the input capacitance in the chain (i.e., ), the required number of inverters is (2) This implies that there are inverters in the chain or , if happens to be odd and an extra minimum-size device is added. The first inverter is built from minimum-size devices whereas the last one, of size , drives a capacitance equivalent to that of an inverter of size . For every transition in the input, only half of the inverters are actually demanding current from the power supply, whereas the other half are discharging to ground. But since in every clock cycle there are both rising and falling edges, each inverter in the chain charges its associated capacitive load once per cycle. Equation (3) gives the estimation of the average sizing factor, which accounts for the situation described, whereas (4) gives the effective capacitance for a single driver, where and are the normalized widths of the and devices in the inverter (3) (4) Using (5), the technology influence was calculated (5) In (5), is an estimate of the average length of the wire attached to a particular node, which increases, as the width of the inverter increases and more parallel transistors must be connected. By considering the gate , drain (diffusion) , and interconnect capacitance , the effect of technology scaling is captured. Equation (6) estimates , where and are the width and effective length of the minimum-size transistor, is the dielectric constant of the oxide, and is the oxide thickness [12] (6) Similarly, is calculated as given in (7), where is the diffusion zone length, is the junction capacitance and is the sidewall junction capacitance [11] (7)
The interconnect capacitance per unit length is calculated using the area and fringe values obtained from MOSIS technology description files, such that Width . Fig. 2 shows how the components of vary as the feature size decreases. Various buffers of different lengths were implemented using Berkeley's MAGIC CAD tool and simulated with SPICE. Table I lists the technologies used for evaluation and corresponding fabrication parameters.
For the validation of the model, the number of stages and the technology used are the factors whose influence we verify. For the simulations, was fixed to 3 and the results are presented in Fig. 3 (the curves of estimated values are so close that they are not shown for the sake of neatness). It was found that the average error across all technologies and design options (i.e., all values of ) was 5.29%. There are two aspects present during technology scaling that deserve to be mentioned. First, the continuous reduction on threshold voltage influences the transistor's current driving capabilities, since its magnitude is a function of the difference . The classic dependence on the square of this difference holds for long channel devices [11] , but as the feature size shrinks below 1.0 m, velocity saturation effects under high electric fields become significant. This forces the current to be a linear function of the voltage [13] , as given in (8) (8) In (8) , the increment in and the decrease in combine with the variations in , such that, if the appropriate scaling is applied, the magnitude slightly decreases with technology scaling. Otherwise, the average gate current may increases, which should be accounted for when estimating power consumption. The second aspect is the behavior of the short circuit current, which is estimated using (9) , where and are the rise/fall time of the signal and its period (average time between transitions), respectively [13] (9)
In a perfect scaling scenario where the supply and threshold voltage difference scales down, and both and decrease as the operating frequency increases, we could state that short-circuit power also reduces, or in the worst case, remains proportionally constant. Now, by matching the rise/fall times of the input and output signals, the power dissipation of short circuit currents is minimized and confined to be within 10% of the switching power dissipation [11] . Equation (10) estimates the rise/fall time for the inverter in position of the buffer chain, where and are the input capacitance and drive current of a minimum-size inverter, respectively (10) Since the above expression is independent of , the input rise/fall times are identical to the output ones at all points in the chain, as long as the PMOS transistors are sized appropriately to compensate for their lower mobility. Thus, by maintaining the sizing factor across the chain, the contribution of short circuit currents to the total dynamic power is bound and could account for the error in Table II .
We expect buffers to have the largest influence on the network's leakage due to their usually large size (width) and absence of stacked devices in the inverter chain. Since it has been shown that the effects of subthreshold currents dominate the diode junction (or -reverse bias) leakage contribution [14] , we only consider the influence of the former. Following a similar reasoning to that presented for the derivation of the buffer's dynamic power, we now model the leakage power as a function of the number of stages in the buffer. Our model calculates the leakage current as given by (11) , where the first factor is the average subthreshold current for and devices (in A/um)
When the input is high, there is leakage trough certain and devices, whereas when the input is low the situation is reversed. As before, is the width of a minimum size transistor and the sum gives an average across all stages. The above current simply needs to be multiplied by the supply voltage to obtain the leakage power. Fig. 4 shows the data obtained from SPICE for three different technologies (the curves of estimated values are so close that they are not shown for the sake of neatness). As expected, leakage increases with decreasing feature size and the data follows the trend predicted using our model for a variable number of stages. The average error was 2% for 0.25 m, 6% for 0.35 m and 9% for 0.8 m. This decreasing error for decreasing feature size can be explained by the fact that the magnitude of the leakage currents increase and are then easier to differentiate from noise and transient fluctuations. These results clearly justify the formulation of a technique for minimizing buffer leakage power, whenever possible.
B. Memory Structures
A lot of effort has been devoted to the study of memory structures and, in particular, to the development of cache, register file and translation lookaside buffer (TLB) power models since they also constitute a big portion of the total microprocessor power. Here, we assume a very general model in order to determine the precharge circuitry capacitive load on the clock network. Let us suppose an address word with bits of tag, bits for the set index and bits for the word index. Thus, there are rows and columns. We assume the precharge transistor load to be proportional to the bit line capacitance but scaled down by a factor . This factor is normally chosen to be around 30, as circuit-level simulation shows that a bitline of 32 cells can be driven effectively with a minimum size PMOS transistor. Note that during precharge, there is no rail-to-rail swing as bitline isolation is assumed (in fact, it is widely used to reduce cache power consumption). For faster memory structures, like register files, this factor can be in the order of 15-10. The bit line capacitance per way is given by (12) , where the first term represents the interconnect capacitance cell height and the second represents the capacitance due to the pass transistor, whereas (13) gives the total precharge capacitance per way (12) (13) In (13), the first part captures the interconnect capacitance cell width and the second estimates the gate capacitance of the precharge transistors (one per bit-line, for a total of 2). Equation (14) gives the clock load for the whole array, where the factor accounts for the additional power overhead if intermediate buffers are included between ways, which are required if a technique like subbanking is to be implemented (14) The last term in (14) is considered only, if the mentioned buffers are present ( is the width of a bank or "way"). According to the rules presented in Section II-A, if the buffer is sized to be times smaller than the load it is driving, then . The second term in (14) is the additional load due to the storage required for the status bits that support the replacement policy, if any. The effective capacitance per memory read can simply be expressed as the factor if the influence of the control logic is discarded given that the size of the memory is usually large in current cache designs. Since a write usually requires a read to verify data status (only the tag might be actually read, but the whole array might be precharged), the model can also be used in this case. The models can be easily tailored to specific cases of caches, register files and TLBs [8] . Fig. 5 shows both the estimated and simulated bitline capacitance for the indicated number of rows. The estimated values were calculated from the equation given earlier for , taking , as given in Table I , as determined for a 0.35 m process and as measured from the actual layout. The simulated values were obtained using SPICE. The average (maximum) deviation is 5% (12%) with respect to the simulated values. The error decreases as the number of rows increases (1.31% for 32 rows and 0.96% in the case of 64 rows). For the other dimension of the memory array, due to the similarity between the expression for ) and the one validated for the bitline capacitance, the results are omitted here as the accuracy is similar to that reported earlier.
Regarding leakage power, the only active device present (besides the buffers, which were already addressed) is the precharge transistor. In general, this device is not large, but there are many of them. They are, however, part of the actual memory and not of the clock network, as only its capacitive load is of interest. Then, we assume the leakage contribution due to the memory structures load to be negligible.
C. Pipeline Registers
Assuming there are flip-flops in register , (15) gives the effective clock load capacitance per register (15) Here, is the flip-flop clock input capacitance and is the average height of the flip-flop (which is used to determine the interconnect capacitance). In general,
, where is the number of clock-driven transistors per flip-flop and it also accounts for any transistor sizing. There is one buffer per register, which is also in charge of providing gating functionality. Equation (16) gives the total clock load presented by all registers in the pipeline, where represents the additional power consumed by the buffers, is the gating factor (i.e., the fraction of flip-flops being gated, , ) and accounts for the width of stage , so that the interconnect capacitance is also included (16) The validation is performed for the case of . The estimated effective capacitance as given in (16) and the simulated effective capacitance from SPICE power readings are shown in . The independent variable is the size of the register , with estimated as 2 times and the height of the flip-flop (in lambdas) taken from the layout. The maximum deviation from the measured values was 2.21% with an average error of 1.21%. The influence of the driver can be estimated in a similar way to that presented earlier, where . In terms of leakage, a similar situation to that found with the memory structures occurs, as the clock inputs of the flip-flops in the registers are used to estimate the load, but the flip-flops themselves are not really part of the clock network, thus no leakage can be calculated.
D. Clock Generator (Phase-Locked Loop)
Clock generators have been widely implemented using phaselocked loops (PLL) [15] , [16] . A power consumption analysis in a component-by-component basis, assuming that the PLL is in locked condition, follows (see Fig. 7 ).
First, we consider the voltage controlled oscillator (VCO). The usual implementation is based on a ring oscillator, where the current drawn by each stage is controlled so that the output frequency is determined accordingly. The stages on the ring oscillator can be either differential or nondifferential, with the former having a better noise immunity. For low power, the number of stages should be small but at least four stages should be used [16] . Fig. 8 shows the schematics of differential and nondifferential cells. For the former, independent of the input values, there is always a branch that conducts current from the supply rail to ground whereas for the latter, at any given time, just one cell is switching providing an excursion equal to the supply voltage. For cells, the average current is given for both cases by (17) , where is the voltage that causes switching in the next cell (normally, , with between 0 and 1), is the input capacitance of the next stage and is the cell delay.
(17) Knowing that and after some manipulation, the effective capacitances for each case are given by (18) (19) Since the initial validation attempts yielded relatively large error margins [8] because the devices functioned outside their dynamic linear range, the validation presented here makes sure that the control voltage remains in the valid linear region, which is shown in Fig. 9 for the single-ended case. This region is narrower in differential designs due to their higher gain. This condition is also important for stable operation of the PLL. Fig. 10 shows average supply current readings given by SPICE, for different values of the control voltage. It can be seen that, for both cases, the measured supply currents follow the expected trends. In particular, the average (maximum) error was below 2% (5%) with respect to the constant behavior for the single-ended VCOs and below 0.10% (0.25%) with respect to the linear behavior of differential VCOs. The other remaining factor that influences the total effective capacitance is the effect of technology, which was addressed earlier.
The widely popular phase/frequency detector (PFD) design used [15] is shown in Fig. 11 After carefully analyzing the behavior of the circuit under all possible input combinations, it was found that all nodes in the design switch at a rate determined by the slowest input signal. This analysis permitted to estimate the effective capacitance using (20) (20) By calculating an activity factor as the ratio of to the total capacitance calculated as the product of the total number of minimum-size transistors in the design , and (i.e., . ), the effective capacitance can be expressed in a more simple way as given in (21), where is the same expression given in section A, with equal to (since connections are only to neighboring cells), which makes the interconnect contribution virtually negligible for the technologies considered (21) The calculations described above yielded for . The average error of the estimated values with respect to SPICE data is presented in Table III .
The frequency divider (FDIV) is basically a ripple counter, built using four cascaded TSPC flip-flops [11] , connected in toggle configuration plus a transmission-gate-based multiplexer and a few inverters. Since each of the successive flip-flops is running at half the frequency of the preceding one, (22) estimates the effective capacitance, where is the multiplication factor that determines the switching factor of the output inverters (which can be taken as 4 on average) (22) We assume whereas (23) gives the flip-flop effective capacitance, where the 1/2 factor captures the fact that this total capacitance is only charged once every two The average error across technologies is presented in Table III . Since the influence of the design style is difficult to generalize, it must be considered in case-by-case basis. We have tried, however, to be as general as possible by modeling designs that are widely used in practical designs. Signal glitching will also affect power estimates, particularly in deep-submicron technologies where signals travel faster. Furthermore, as mentioned earlier, if the appropriate supply voltage is not used, short-circuit power can increase significantly as technology scales.
The last component of the PLL is the low-pass filter (LPF), which determines the capacitive load that the charge pump circuit has to drive. Since the charge pump circuit is directly controlled by the PDF, as long as the PLL is in locked condition, the output pattern of the PFD does not change and virtually no power is consumed in the LPF or the charge pump circuit. Then, assuming that a differential VCO is used, (25) gives the total PLL effective capacitance during lock (25) For now, we will disregard any leakage contributions to the PLL power for the technologies considered, as initial simula-tions revealed leakage power numbers more than 1000 times smaller than the dynamic power measurements, even for the 0.18 m case. This can be explained by the large amount of stacking present.
E. Clock Load Due to the Distribution Grid (Wiring)
Since a distributed buffer clock distribution scheme has been assumed, the width of the main wiring can be taken as almost constant, yet some increase in the main drivers section might be appropriate. Following a floorplan where only the first level of the H-tree is implemented, the total wire capacitance can be determined by simply adding the contributions of individual branches [7] as given in (26), where represents the diagonal dimension of the chip (26) Our approach closely follows the models presented in [6] and [7] , which were validated. The value of used here is different from that employed earlier, as higher metal layers are usually used for routing the top-level clock distribution network. The parameters and are different but can still be obtained from the process description files. Advanced and more detailed studies on the network's topology influence in clock energy can be found in [19] , where interwire effects have also been considered.
III. EFFECTIVENESS OF THE MODEL
Given the absence of reported data, it is difficult to verify the model as a whole. The limitation lies basically on knowing accurately all the clocked loads in the design such that the estimated clock power can be related to the value reported. Since the mentioned details are not usually disclosed due to proprietary concerns, exact verification of the model through this approach becomes too complex. However, a validation with respect to a naïve model is presented later.
The first clear use of the model is for determination of the energy hotspots in the clock subsystem, which is performed for three different architectures: a superscalar RISC processor, a DSP (which can be considered a VLIW machine), and a network processor (NP). We use relevant architectural parameters of each design, given in Table IV , and substitute them into the model. Fig. 12 captures how the different loads contribute across the three designs. We assumed that all designs were implemented in a 0.18 m technology and used a H-tree topology. Structures particular to each design, such as branch history tables, issue windows, L2 caches, and similar structures are not captured, partially due to the reasons given earlier.
There are many interesting insights given by this very simple experiment. The first thing to notice is that in all cases, even though the configurations were different, the main contributors to the clock power budget were the buffers, the caches and the pipeline registers, with the exact percentages varying from one architecture to the other. All other loads are almost negligible (except for the TLBs in the RISC and the register file in the DSP) with respect to the main contributors, but it is important to keep in mind that the wiring component can become a key player, especially for grid implementations in more aggressive technologies. It is interesting to see that in all cases, the global clock distribution (wiring main driver) consumes at least 10% of the total clock power, which is consistent with observations made in [18] .
For the three designs presented, we performed a comparison with respect a crude model, which is basically an equivalent of the simplest model used in transmission line theory. In this naive model, all the clocked loads are grouped together and a single buffer is designed to drive the lumped load. It was found that the basic model underestimates the clock load by 19.5%, on average (13.5%, 19.4%, and 25.7%. for the RISC, DSP, and NP implementations, respectively). This emphasizes the importance of being as accurate as possible when modeling a complicated structure, such as the clock distribution network.
Process technology is another important factor that affects the clock energy behavior. As interconnect capacitance will increase significantly for deep submicron technologies, it is worthwhile to be aware of the implications that this will bring to the complete clock energy picture. Fig. 13 shows the expected clock energy distribution when the single-issue five-stage datapath design of our power simulator [10] is scaled down from 0.35 m to a 0.18 m technology. The first thing to notice is the increase in the relative contributions of the precharge load, the wiring and the main driver. A significant capacitance portion of these components is provided by interconnect, which increases for smaller technologies mostly due to the reduction of distance between wires. The datapath load (pipeline registers) is a mixed load of transistors and interconnect capacitance and at the end, its contribution remains almost constant. The power from buffers reduces drastically since they are basically transistors, which scale down with technology.
IV. EVALUATING OPTIMIZATIONS
A. Design-Time Choices: Timing Versus Power
The results from Section III helped in focusing attention on the loads that cause the largest power consumption, so that more effective power reduction strategies and optimizations can be developed. As usual, power reduction is exchanged for a performance impact and Section IV-B evaluates this trade off as power savings are attempted for two of the most impacting loads: the memory structures and the buffers.
Let us consider first the effect of the factor on memorydependent clock energy and observe how the model provides means to evaluate what the real performance-power tradeoffs incurred by choosing a given sizing factor are. Fig. 14 shows the percentage decrease in delay with respect to an implementation using , as obtained from SPICE signal traces (dark bars). Clearly, a larger means a reduction in the precharge time at the expense of increased clock load as calculated using the model (see line plot). However, we can use the actual variation in , which is presented by the light bars, to estimate the changes in delay with respect to the case (i.e., if , there is a reduction of 50%). The plot shows that the factors ratio provides a good estimate of the performance change since the maximum error of this metric with respect to the actual delay measurements is 2.5%.
As another example of the usefulness of the model, Fig. 15 shows the impact of different values of the sizing factor on the buffer capacitance. In this experiment, the number of inverters in the chain must be changed as changes, so that the signal is fed to the load without incurring a very high performance penalty. The interesting result about this simple experiment is that, although, a value of 3 provides the optimum behavior interms of timing (according to SPICE simulations), the optimum value for power consumption is around 3.66. The increase in delay incurred by using is 14.8% with respect to the case where is 3. This is not a very big price to pay for a reduction of about 37% in buffer power. The figure also shows an estimated variation in short-circuit power, magnified 15 times, which was discussed in Section II. The short-circuit power variation across the possible values of is confined within the percentage variation of the factor (about 6%). Since this translates to less than 1% of the overall dynamic energy, the model estimates are virtually unaffected. A decision on whether the value of the should be modified depends heavily on the influence that this change would have in the performance characteristics of the driver. One possibility is to use in drivers that control sections where larger clock skews are allowed.
B. Run-Time Clock Power Reduction
We now perform a brief study of the effect of run-time optimizations on clock energy. We incorporated the model into our existing cycle-accurate datapath and memory energy simulator [10] , whose estimates were shown to be within a 10% error margin of the circuit simulation values. We chose for evaluation a set of array-dominated benchmarks, which resemble various signal processing applications and are listed in Table V . The base architecture has 8 kB, 2-way set We assess the possible savings obtained when applying clock gating at various levels. By default, we assume that the cache load is disabled during cache misses, which we have called mid-level clock gating. We target additional clock power savings through low-level gating, where certain portions of a register are disabled depending on whether the information they hold is required by the next stage. The overhead incurred in implementing this technique (selective gating of pipeline registers) is minimal [17] and we look at the influence of such fine-grain gating on clock load. We also consider gating of the entire distribution network (high-level gating), which eliminates unnecessary switching of the main driver and the top-level wiring during cache misses. Note that this only applies to an architecture where a cache miss causes a full stall of the pipeline, which is the case here.
For these experiments, we added an additional performance parameter: the latency of the off-chip (main) memory. We considered three cases: an average speed memory (AM), a fast memory (FM, which has a latency of half of the AM), and the same fast memory with the addition of a write buffer (FM WB). Fig. 16 shows the resulting clock energy savings for high-level clock gating when associativity is changed, but experiments were also performed when cache size was increased. From the figure, it is clear that the larger the main memory latency and/or the miss ratio are, the larger the improvement on clock energy savings obtained by high-level gating. Table VI summarizes the results obtained. Note that the maximum savings are obtained when there is no associativity (direct mapped cache) and the cache size is the smallest. These represent the cases where the cache miss ratio is maximized, which is clearly undesirable from the performance point of view but exemplifies the classical power-performance tradeoff.
Table VII presents the average results across all benchmarks when low-level gating is enabled and mid-level gating is already in use. We observe that, since the reduction in clock load at this finer level of granularity is obtained in every cycle where the system is doing useful work (which is the dominant case) its contribution in reducing clock (and system) energy is larger than that obtained through a higher level of gating that only yields savings during the time that the units are not used. In this paper, we have presented a complete energy model that covers both the clock distribution and clock generation circuitry. It includes parametric equations for the most common types of loads present in the clock network, including memory structures, registers, buffers and the clock generation device (i.e., a PLL). The validation results indicate that the model is fairly accurate, with the average deviation across all components lower than 10%. Additionally, by evaluating the impact on clock energy of design-time choices and run-time optimizations, the effectiveness of the model was corroborated. It was shown that when the clock timing requirements are relaxed (i.e., a slightly slower memory or buffer are possible), significant savings on clock power could be obtained. Similarly, the value of clock gating was quantified at various levels of granularity, yielding a tradeoff between the size of the load and how often it is gated. As directions for future research, we believe in enhancing the flexibility of our model by adding features that are commonly encountered in the latest designs. Moreover, further studies on design-time architectural impact are currently being performed, as well as additional experiments with our execution-driven simulator, which will allow us to assess the effect of optimized software on the clock subsystem.
