Abstract-A circuit design methodology minimizing total power drain of a static complementary metal-oxide-semiconductor (CMOS) random logic network for a prescribed performance, operating temperature range, and short channel threshold voltage rolloff is investigated. Physical, continuous, smooth, and compact "Transregional" MOSFET drain current models that consider high-field effects in scaled devices and permit tradeoffs between saturation drive current and subthreshold leakage current are employed to model CMOS circuit performance and power dissipation at low voltages. Transregional models are used in conjunction with physical short channel MOSFET threshold voltage rolloff models and stochastic interconnect distributions to project optimal supply voltages, threshold voltages, and device channel widths 
I. INTRODUCTION
R EDUCTIONS in total power dissipation of complementary metal-oxide-semiconductor (CMOS) circuit designs for ASIC's, microprocessors, and semiconductor memories have emerged as a key design constraint over the last few years [1] . This is motivated not only by high-performance requirements in a portable environment where the size, weight and lifetime of batteries are critical, but also by heat dissipation and packaging issues in larger desktops and parallel machines as well [1] - [6] . Scaling the supply voltage for logic and memory circuits has historically been the most effective way to lower power dissipation as this reduces all components of power and is felt globally across the entire system. The 1997 National Technology Roadmap for Semiconductors (NTRS) [7] projects the supply voltage of future gigascale integrated systems to scale from 2.5 V in 1997 to 0.5 V in 2012 primarily to reduce power dissipation and power density ( Fig. 1) , increases of which are projected to be driven by higher clock rates, higher overall capacitance, and larger chip sizes. A key challenge in the design of logic circuits will be to meet the projected Manuscript received March 7, 1999 ; revised November 22, 1999 . This work was supported by the Defense Advanced Research Project Agency under Contract F3361595C1623 and the Semiconductor Research Corporation under Contract SJ-374-002.
The authors are with the Microelectronics Research Center, Georgia Institute of Technology, Atlanta, GA 30332 USA.
Publisher Item Identifier S 1063-8210(00)04353-5. performances given the competing requirements of high performance and low standby power at low voltages [1] , [3] , [7] in the presence of short channel threshold voltage rolloff [8] .
In Section II, the physical Transregional MOSFET drain current model, verified by HSPICE simulations, is described and employed to calculate the drain current of a MOSFET in all regions of operation. In Section III, a generic CMOS datapath is modeled as a chain of critical path gates. Wiring loads at the outputs of each gate in the critical path are estimated using a stochastic interconnect distribution [9] based on Rent's rule. These distributions have been verified for actual microprocessors. Analytical expressions for propagation delay for scaled series-connected MOSFET circuits are derived using the Transregional drain current model and verified with HSPICE simulations. Analytical expressions for the supply voltage are derived using the above model for propagation delay. Section IV introduces a simplified analysis of total power minimization. This section calculates the optimal supply voltage, minimizing total power per gate for a given performance, and elucidates the minimum power methodology implemented rigorously in Section V. A simple analytical expression for optimal supply voltage gives a rough quantitative estimate of the dependence of the optimal supply voltage on material, device, circuit, and system parameters. Section V rigorously minimizes total power drain for a single datapath using a numerical methodology. The calculations performed to determine minimum total power are consistent with the performance, technology, device count, and chip size forecasts of the NTRS. Significant threshold voltage rolloff due to short channel effects is calculated conjointly using physical models, which predict rolloff dependence on device geometry, doping profile, and supply voltage [8] . Heat removal constraints imposed (by the cost of packaging) at levels of integration and clock rates projected by the NTRS may not permit critical path gates to drive average wire lengths. High local clock frequencies are assumed to apply only within a zone of synchrony-a macrocell of a short-wire cellular array architecture whose cell size is calculated using the stochastic interconnect distribution by imposing heat removal limits on average wire length. Section VI extends the minimum power methodology to parallel datapaths by employing a simple, generic parallel datapath model to calculate the increase in overhead power dissipation with each additional parallel datapath.
II. A TRANSREGIONAL MOSFET MODEL
Compact, analytical, physical Transregional models describe MOSFET behavior in the subthreshold, saturation, and linear regions of operation including high field transport effects. These new models provide smooth current-voltage characteristics across all regional boundaries to enable accurate calculation of propagation delay and total power dissipation per gate. The principal reason for engaging the Transregional model is that its primarily physical rather than empirical origin enables greater insight into the MOSFET parameters that are most critical to the performance of future generations of CMOS logic circuits.
In weak inversion, where the areal inversion layer mobile carrier density is much less than the depletion region charge, the subthreshold drain current (1) is dominated by diffusion
The subthreshold slope factor where is the channel depletion capacitance and the gate oxide capacitance per unit area. The device channel width-to-length ratio is given by and is the low field carrier mobility, given by models reported in [10] for and channel devices. Degradation of mobility due to vertical fields in (2) and (3) is modeled with , where [11] . The reciprocal of the thermal voltage In the linear region described by (2) , the gate voltage is large enough for mobile charge density to be much greater than the depletion charge along the entire channel length and the drain current (2) is primarily determined by its drift component. In the saturation region, the gate voltage is large enough to strongly invert the channel at the source end, but the drain voltage is also large enough to cause a weak inversion region. The drain current in the saturation region (3) is therefore given by the sum of its drift and diffusion components (2) The gate voltage at which the device transits from the subthreshold region (weak inversion only) to the saturation region is given by (6) and (7) . This voltage is determined by imposing the requirement of continuity and differentiability of the product of the field-dependent mobility and the areal mobile charge density at the boundary between these two regions
The areal charge density of the immobile bulk charge is (5) The transition gate voltage from weak to strong inversion is The drain saturation voltage (9) is determined either by channel pinch-off or carrier velocity saturation-whichever occurs earlier at the point corresponding to
The critical electric field [12] is the lateral electric field in the device when carrier velocity reaches its saturated value [11] 
The factor in the denominator of the expressions for drain current in the linear and the saturation regions models the effects of mobility degradation due to high lateral fields. Fig. 2 (a) and (b) compares the drain and gate characteristics predicted by the Transregional model with HSPICE simulations of a 0.25-m CMOS technology. Appendix A lists the HSPICE level-3 parameters used in this simulation.
III. A SINGLE DATA-PATH MODEL
In this section, the performance of a generic CMOS processor is modeled using a simplified cycle time model, assuming a [9] .
The need for a progressively higher clock frequency associated with increasing average chip sizes prompted the NTRS to project "global" as well as "local" clock frequencies where the difference between the two becomes increasingly larger across the roadmap due to degradation of signal delays for long interconnects. In the analysis described in this section, critical path gates clocked at global clock frequencies drive wire lengths averaged across the entire chip, given chip sizes and transistor counts forecast by the NTRS. Critical path gates clocked at higher local frequencies drive wire lengths averaged within a macrocell of a "short-wire" cellular array architecture. The cell size is calculated using the stochastic interconnect distribution by imposing a maximum heat removal coefficient of 50 W/cm on the average wire length of the cell. The two-way NAND gate, with an average fan-out of three, as a basic circuit building block in the critical path, has a performance that parallels that of other circuits actually used in processor critical paths in reflecting technology improvements [13] . Propagation delay models for scaled series-connected MOSFET circuits, derived using the Transregional model, are employed to calculate the dependence of supply voltage on cycle time, logic depth, range of operating temperatures, and wiring capacitance.
A. Total Capacitance Driven by a Critical Path Gate
In logic-intensive CMOS chips, packing densities are interconnect limited [14] where the effective size of a gate is determined by its wireability [15] . The gate pitch is estimated from NTRS projections for ASIC chip size and transistor count assuming an average gate has six transistors. The gate pitch is used in calculating the average wire length in microns, for global critical paths, for each NTRS generation (Table I) . Assuming equal interconnect cross-sectional dimensions and that neighboring wiring levels in a multilevel network provide an approximate ground plane, total capacitance per unit length, including fringing effects, is estimated using analytical models reported in [16] .
The interconnect density function [9] predicts the number of point-to-point interconnects. Real designs, however, use wiring (10), given at the bottom of the page, where (11) Capacitance per unit length:
(F/cm) [16] . Gate Pitch ( m):
Average Fan-Out: Thus
Chip size, gate pitch, average interconnection length, and the wiring capacitance are listed in Table I for each of the 1997 NTRS technology generations.
With simple scaling of dimensions, power density increases with increasing integration: as capacitances scale only linearly with technology, but device count per unit area increases as the reciprocal square of feature size, energy dissipated per unit area in charging or discharging capacitances increases. The constraints imposed by the cost of packaging may thus prevent the levels of integration projected by the NTRS from achieving high local clock rates. Imposing a heat removal coefficient of 50 W/cm the maximum load capacitance driven by an average local critical path gate is calculated using and (
For a given wiring load, the performance of a static CMOS gate increases asymptotically with increasing ratios, with gate delays reaching within 20% [ Fig. 6 (b)] of the intrinsic unloaded gate delay for (15) Substituting (15) into (14), we can solve for the average wiring capacitance or (16) Substituting (16) into (13), yields the power density limited average wire length within a macrocell. This average wire length substituted in (10) yields the size and number of cells. Wiring capacitance, wire length, and cell size are tabulated in Table II above. The total capacitance driven by each gate is calculated as the sum of its three components (17) where and are the gate and drain capacitances, respectively.
The gate capacitance is calculated as (18) with given by (19) The drain capacitance seen at the output of an unloaded static gate has three components: 1) the gate-drain overlap capacitance for the NFET and PFET devices, and , respectively; 2) the drain junction bottom capacitance ; and 3) the junction side-wall capacitance Therefore From [17] we have (21) where is the junction bottom capacitance per unit area and is the junction side-wall capacitance per unit length. Trench isolation is assumed between active regions when calculating the sidewall capacitance. The junction bottom capacitance is calculated assuming a doping concentration that is an order of magnitude lower than that seen in the channel region [18] .
B. Propagation Delay of a Static CMOS Gate
Analytical expressions for propagation delay for an inverter or a generic -input gate are derived as the time required for a 50%-50% transition [19] between the input and the output waveforms. In short channel devices where the drain saturation voltage is limited by carrier velocity saturation, is typically less than
The driving transistor thus remains in the saturation region while driving its load through the first half of the transition at the output of a static CMOS gate. For the same reason, the slope of the output waveform at half the transition is, to a good approximation, independent of the slope of the input transition. Fig. 3 shows the dependence on using (9) for threshold voltages in the neighborhood of optimal values calculated rigorously in Table III (see Section V) .
The slope of the output transition at half the transition in a chain of identical symmetrical gates is thus approximated as (22) Integrating (22), we get the inverter propagation delay in response to a step (23) The gate delay of a stage, with a finite rise time at its input, as in a chain of symmetric inverters with equal rise and fall times, is derived as the time taken for its output waveform to transit by less the time taken for its input waveform to move through half its complete swing (24) where is the time taken for the output waveform to transit to and is the time taken for the input to reach The delay corresponds to the time taken for the input waveform to complete its entire transition and the delay equals the time taken for the output waveform to reach once the input waveform has completed its transition.
From Appendix B, the closed-form expression for (24) is (25) where the device threshold voltage is modified to include reductions due to temperature and short channel effects (SCE) using (26) Other terms in (25) , derived in detail in Appendix B, are defined as (27) (28) (29) The above propagation delay models (23) and (25) are compared with HSPICE simulations in Fig. 4 for the device parameters given in Appendix A. Figs. 2-6 assume no reductions in threshold voltage due to short channel effects. Threshold rolloff using (26) is considered only in Sections V and VI.
For long channel devices, delay of a series-connected MOSFET circuit increases linearly with fan-in. This follows from a simple model where the resistance to the flow of current through a series-connected structure increases linearly with the number of identically sized devices. At short-channel lengths, the improved delay dependence on fan-in at short channel lengths [20] , [21] brought about by velocity saturation is due to a smaller reduction in the drain saturation current with a rise in the source voltage of the topmost series-connected MOSFET. This effect on delay in a series-connected MOSFET circuit is modeled physically using (30)-(32) by calculating the fractional reduction of the normalized saturation drain current for the series-connected structure. The reciprocal of this quantity yields an "effective fan-in" The PFET channel width that yields identical rise and fall times is calculated by equating the saturation drain currents of the two-way NAND gate that charge and discharge the load [see (36), given at the bottom of the next page].
With a rise in temperature, the two competing effects [22] of threshold voltage reduction and carrier mobility degradation determine the worst case delay and consequently the minimum supply voltage necessary to maintain cycle time requirements over an entire range of operating temperatures. Slow cycle times permit low supply to threshold voltage ratios where performance improves with temperature as reductions in the threshold voltage with temperature dominate degradation of carrier mobility. Fast cycle times translate into larger supply-to-threshold voltage ratios, and, consequently, reductions in the threshold voltage due to temperature rise do not affect the performance as much as the reduction in carrier mobility does, causing performance to degrade with temperature.
The temperature dependencies of threshold voltage are assumed to be [23] (37)
The temperature coefficients of threshold voltage are obtained by differentiating the threshold voltage w.r.t. temperature [23] and range from 1.2 to 0.6 mV/ K for technology generations listed in the NTRS.
Stochastic interconnect models and physical Transregional drain current models are used in this section to derive a cycle time model that predicts the dependence of supply voltage on device, circuit, and system parameters. The stochastic interconnect length distribution is used to calculate the average wire length driven by a critical path gate switched at global clock frequencies. Heat removal constraints are used to calculate the average wire length driven at local clock rates with the size and number of macrocells calculated using the stochastic distribution given this requirement on average wire length. This cycle time model is used in Section V conjointly with short channel threshold voltage rolloff models and NTRS projections of cycle time, parameter tolerances, chip size, and transistor count to estimate the total power dissipated by a critical path gate.
IV. A SIMPLIFIED ANALYSIS OF POWER MINIMIZATION
A simplified methodology to minimize the total power dissipated by a critical path of static CMOS logic gates is presented in this section to provide physical insight into the calculation of optimal design parameters for a specified technology generation and performance. In essence, optimal values of supply voltage long-channel threshold voltage NFET channel width and PFET channel width are determined by scaling down and while increasing transistor widths until the rate of change of static power is equal and opposite to that of dynamic power. NTRS projections [7] for feature size and gate oxide are assumed to provide a technology guideline.
Simple power dissipation and cycle time models are employed to elucidate the methodology of power minimization. In this simplified analysis, the total power dissipation of a static CMOS logic gate is assumed to be equal to the sum of its dynamic and static components (38)
The activity factor equals the average switching rate, i.e., the total number of logic transitions that occur in clock cycles divided by for large enough [24] . For random logic networks, is typically 10% [3] . This value is used throughout the analysis presented in this paper. The load capacitance equals the sum of the wiring, gate output capacitance, and the input capacitance of the next stage as calculated in (13)- (21). Short-circuit power dissipation is typically 5-10% of the total power [4] and is neglected in this section, but is calculated numerically in Section V.
The clock frequency equals the clock rate that is a consequence of imposing (39) and (40) and is the subthreshold leakage current given by (1) . The cycle time equation (33) used in the analysis presented in this section assumes a gate propagation delay given by (23) . The logic depth gates [25] , effective fan-in and clock skew factor The simplified minimum power methodology incorporates results from two revealing graphs, Fig. 6 (a) and (b), to establish two valuable approximations that will enable a lucid interpretation of minimized total power. Fig. 6 (a) depicts the normalized propagation delay versus the supply to threshold voltage ratio with fixed and to illustrate the saturation at a (39)
The optimum ratio is chosen at the approximate saturation point of the curve, which occurs somewhat beyond a slope of negative one. Increasing beyond five times yields diminishing improvements in performance. Fig. 6(b) examines the propagation delay versus the load-to-wiring capacitance ratio with fixed and and depicts saturation at (40)
Increasing the transistor widths such that is greater than three leads to marginal performance improvements.
-channel transistor widths are calculated such that rise-fall times are equal (41) where and are the effective carrier mobilities [11] .
(36) 
Since is implicitly defined, an initial value for the iterative calculation of can be solved explicitly by substituting [12] , assuming and such that (46)
The four constraints of: 1) the ratio; 2) the ratio, both corresponding to cycle time saturation; 3) equal rise and fall times; and 4) power minimization, as seen in (38)- (41), provide insights into the optimal supply voltage dependencies on system, circuit, device, and material parameters.
This simplified qualitative analysis is extended to a more complete and rigorous set of calculations in the next section and is engaged to project performance-constrained limits on CMOS energy dissipation. 
V. A COMPLETE MINIMUM POWER METHODOLOGY
The simplified methodology in the previous section provides physical insight into the minimum power methodology and is extended in this section to a complete numerical analysis. Several refinements are made to the simplified analysis. These encompass: 1) short channel effects of threshold reduction as well as drain-induced barrier lowering (DIBL) [8] ; 2) the effects of NTRS-projected parameter tolerances on threshold rolloff; 3) temperature effects on device parameters; 4) coupling the calculation of optimal NFET and PFET channel widths simultaneously with that of the optimal supply and threshold voltages; 5) finite rise-time effects on propagation delay and total power; and 6) iterating between calculations of interdependent circuit and device parameters until convergence is reached.
Given the benefits of performance improvement, increases in level of integration and reductions of switching energy that accompany the scaling of transistor dimensions, the minimum feature size for a technology generation is pushed to the very edge of its physical limits [26] . The physical limit on minimum feature size is defined by the exponential threshold voltage rolloff characteristics of an MOSFET for that generation [27] . Variations in lateral and vertical device dimensions and in the supply voltage are thus bound to impact the two-dimensional (2-D) electrostatic charge coupling between the gate and source/drain regions, reducing threshold voltage and consequently increasing static power dissipation substantially [28] . Of these variations, channel length and gate oxide thickness have the most dominant effect on the reduction of threshold voltage due to the exponential dependence of threshold rolloff on these parameters [8] . The solid lines in Fig. 7 show the MOSFET rolloff characteristics calculated at NTRS-projected and junction depths for the 50-and 250-nm generations, using physical rolloff models reported in [8] . Long channel threshold voltages necessary to yield the NTRS-projected NFET drain saturation currents of 600 A/ m are assumed in Fig. 7 . NTRS-projected variations of 6.7% in produce the dotted lines in Fig. 7 . This deterioration also assumes a variation of 5% in and The rolloff curves in Fig. 7 demonstrate a total reduction in threshold voltage by 60-100 mV from long channel values, potentially increasing standby power by about an order of magnitude and thus requiring short channel threshold voltage rolloff to be considered when calculating the limits on total CMOS power.
The majority of the transistors on a chip are assumed to have minimum feature size channel lengths. The reduction in threshold voltage at this channel length due to rolloff is assumed in calculating gate delays. This calculation assumes no tolerances on device parameters. Because parameter variations of and have a larger impact on static power than on performance, their effect on threshold voltage rolloff is considered only when calculating static power.
Between the competing effects of long channel threshold voltage reduction and mobility degradation on performance due to temperature increases, as described in Section III, mobility degradation dominates (Fig. 8) requiring the worst case delay to be met at the highest temperature for any given operating range. Static power increases exponentially with temperature, as shown in Fig. 8 due to both a decreasing long-channel threshold voltage and a lower thermal voltage, as seen in (1) and (42). Meeting delay and minimizing power at the highest temperature permit the methodology described below to guarantee that the critical path will meet cycle time for any temperature within the operating range.
The numerical calculations in the minimum power methodology sweep though a 2-D grid of long-channel threshold voltages and NFET channel width-to-length ratios for a given generation as specified by its minimum feature size and gate oxide thickness At each on the grid, the substrate doping concentrations and electron and hole saturation velocities and the low field electron and hole mobilities are calculated using models in [10] . Models that permit calculation of the temperature and/or doping concentration dependence [23] of bandgap energy, intrinsic carrier concentration, and conduction and valence band densities of states are used simultaneously while calculating the substrate doping concentration.
After the substrate doping concentration, carrier saturation velocities, and the low-field carrier mobilities have been calculated for a given long channel threshold voltage, the supply voltage necessary to meet the cycle time requirement, calculated using (35), is simultaneously solved for with the threshold voltage rolloff [8] , the PFET channel width (36), effective fan-in (31), and total load capacitance (13)-(21). This simultaneous solution is necessary given the interdependencies of threshold voltage rolloff, PFET channel width, junction capacitance, and effective fan-in on supply voltage. The calculation iterates for a given long channel threshold voltage and NFET channel width-to-length ratio until supply voltage threshold rolloff effective fan-in and load capacitance converge to within a margin specified at the outset of the calculation. This calculation proceeds for a given cycle time and at a temperature of 100 K above room temperature. Fig. 9 describes the complete algorithm.
Total power dissipation is calculated as the sum of its dynamic, static, and short-circuit components for each value of and and is plotted in Figs. 10 and 11, with and as independent variables -
where is given by (1) averaged over its value for NFET's and PFET's (48) When calculating static power, threshold voltage rolloff is determined assuming a deviation in of 6.7% as projected by the NTRS. A 5% increase in and is also assumed in this calculation. The short-circuit power component during a clock cycle is calculated by numerically integrating the PFET and NFET drain currents (2) and (3) as shown in (49) and (50) during a rising transition at the input (49) during a falling transition at the input (50)
From Fig. 10 it can be seen that a straightforward scaling of supply and threshold voltage yields reductions in total power dissipation for a given cycle time until switching and leakage energies become comparable. Increasing the channel width independently for a given threshold voltage and cycle time, as seen in Fig. 11 , opens the window to further scaling of the supply voltage, until device capacitances overwhelm wiring capacitance. Beyond this point, further increases in device channel width permit only asymptotically decreasing reductions in supply voltage as shown in Fig. 11 , with larger than optimal channel widths translating into higher power dissipation due to larger gate sizes. Here we assume that if the critical path gates are small enough to be wireable, devices can be made large enough not to be dominated by wiring loads.
Thus, with threshold voltage scaling limited by static power and channel width decreases limited by wiring capacitance, the absolute minimum in total power dissipated by a CMOS gate-for a given cycle time, logic depth, percentage switching activity, and operating temperature range-corresponds to the optimal supply voltage, optimal threshold voltage, and optimal NFET and PFET channel width calculated in this analysis.
For the minimum in total power calculated above to exist, the rates of change of static and dynamic power must be equal and opposite at some point within the range of supply and threshold voltages that correspond to a given cycle time. Static power does not increase indefinitely as supply and threshold voltages are scaled and peaks at a supply voltage calculated below and shown in Fig. 12 . This value of supply voltage is obtained by differentiating static power given in (42) w.r. and equating this first partial derivative to zero (52) applied to (52) yields the supply voltage corresponding to peak static power as (53) Fig. 12 also shows that optimal supply voltages approaching this value, either due to fewer gates in the critial path or a higher activity factor, do not show a minimum in total power. Instead, total power monotonically decreases with supply voltage until the delay requirement can no longer be met. This turns out to be the case described below in Table IV for a local critical path  with for the 50-nm generation operating at a local clock rate of 10 GHz. The logic depth is dependent on design and generation, however, this dependence is very weak and is asymptotically approaching a limit that cannot go below five-six. This is so because the latch delays at the start and at the end of the critical path count toward the logic depth as well and a minimum number of gates (three-four) are required for basic arithmetic and Boolean computations. The complete methodology described above is applied to several of the 1997 NTRS technology generations and optimal supply voltages, threshold Tables III and IV for local and global clock rates. n (global) = 15 n (local) and 7.
voltages, and NFET, and PFET channel widths calculated by the methodology are listed in Tables III and IV . The optimal will not permit NTRS-projected clock rates to be achieved for a conventional process with thresholds around 0.7-0.9 V. However, for processes with much lower thresholds, corresponding to optimal device threshold voltages ( 0.3 V) gate overdrive does not deteriorate as much at lower , making the NTRS projections on clock rates achievable.
Average wire lengths, which track increases in chip size and transistor count, as calculated in Table I for global critical paths, or in Table II for local critical paths, are assumed driven by each critical path gate. The global critical path assumes 15 [25] two-way NAND gates with a fan-out of three. The local critical path assumes short pipeline stages with seven two-way NAND gates and a fan-out of two. Of the seven gate delays assumed in the local critical path, two correspond to gate delays due to latches. Table III shows that while NTRS projections of global clock frequency increase by a factor of four across the roadmap, minimum feature size and gate oxide thickness are scaled more aggressively, permitting operation at lower supply voltages. From Table I , it can be seen that the average wire length, in gate pitches, increases due to larger chip sizes and higher transistor counts, permitting the average wiring capacitance to scale much less aggressively than minimum feature size. Lower optimal supply voltages accompanied by aggressive scaling of gate oxide thickness and junction depth translate into decreasing threshold rolloff permitting the optimal threshold voltage to scale to smaller values. Tables II and IV show how rapidly average wire capacitance of a macrocell must scale to cope with a maximum heat removal rate of 50 W/cm using (16) . Fig. 13 shows optimal supply voltage for global and local critical path gates, operating at minimum power for NTRS-projected local and global clock frequencies. Static power becomes an increasingly larger fraction of minimum total power with scaling, as seen in Tables III and IV . The impact of increasing static power can be also seen in the increase in power density for local critical path gates in Table IV . Table III . For a given technology generation, on increasing clock frequency, the optimal supply voltage increases to meet a higher performance requirement, deteriorating the threshold voltage rolloff causing static power to increase as well. Fig. 14 shows the sharp exponential rise in total minimum power as global clock rates are increased. Increasing optimal supply voltages and deteriorating threshold voltage rolloff are also plotted for the 50-and 250-nm NTRS generations in Fig. 14. 
VI. PARALLEL DATAPATHS
In this section, the opportunity to reduce power drain by exploiting concurrency driven voltage scaling [2] is investigated at low supply voltages and deep submicron feature sizes. Scaling the supply voltage and compensating for the performance loss by adding datapaths in parallel so that the total number of logic operations per second ( ) or system throughput remains constant permits the total power drain from all of the datapaths to be reduced due to a lower operating voltage [2] . However, increasing the number of parallel datapaths, increases the complexity and size of the overhead circuitry required for routing, multiplexing and control of each of the parallel datapaths resulting in the dissipation of an additional component of overhead power. Also, latency increases by a factor over the single datapath case. Below, a generic model for the dependence of capcitance of overhead circuitry on number of datapaths is described and used to compare power drain of an optimized single datapath with the case for parallel datapaths.
The clock requirements for each parallel datapath are reduced to (54) Power drain of datapaths, each operating at the above clock rate given is given by (55) where is the total switching capacitance along the critical path of a datapath and is the total static power dissipated by each datapath. The increase in the switching capacitance of the overhead circuitry with additional datapaths is calculated relative to the datapath capacitance using the following generic model:
The parameter models the complexity of the control circuitry and/or any other component of the overhead that does not increase with each additional datapath. The exponent of models the rate at which the routing and multiplexing requirements increase with each additional datapath and is specified by the size and complexity of the datapath. For an 8-b adder/comparator datapath [29] , data from layouts showed an approximately quadratic dependence in (56) on Examples in [29] indicate a range on from 0.1 to 0.7. The dynamic power dissipated by the overhead circuitry and the datapaths is given by (57) Static power dissipated by the overhead circuitry is assumed to increase linearly with the overhead capacitance. The overhead circuitry capacitance is assumed to be dominated by device capacitances due to the highly local nature of their placement Decreasing ratios projected by the roadmap (Table V) increases the speed penalty for a given reduction in supply voltage, increasing the required number of parallel processors significantly to compensate for the loss in performance.
VII. CONCLUSION
The limits on CMOS energy dissipation shown to be imposed by static power and by wiring capacitance are investigated using a methodology that conjointly employs physical short-channel MOSFET drain current and threshold voltage rolloff models in tandem with stochastic wiring distributions. This methodology permits a complete evaluation of tradeoffs between saturation drive current and subthreshold leakage current for a prescribed cycle time performance and operating temperature range. Constraints imposed by NTRS-projected package heat removal coefficients permit local clock rates to apply only within a macrocell whose size and total number are calculated using the stocahstic distribution. Limits on the performance of CMOS logic circuits are shown to be imposed by total power dissipation which increases exponentially with clock frequency. Optimum supply voltages, device threshold voltages, and device channel widths corresponding to minimum total power are calculated out to the year 2012 for local and global critical paths. These projections are consistent with technology and cycle time forecasts by the NTRS. Limits on the cycle time performance imposed by power dissipation are projected for the same period. Concurrency-driven voltage scaling is projected to yield decreasing percentage reductions in total power. APPENDIX A See Table VI. APPENDIX B DERIVATION OF (25) This derivation makes two key assumptions (Fig. 16 ) in addition to the requirement of equal rise and fall times. 1) The output waveform in response to a step input could be approximated as a straight line whose slope equals the slope of the waveform at half the transition 2) The waveform at each node in the chain of inverters has a slope equal to this straight line.
The time delay (24) between the input and output waveforms reaching is given as the sum of the following components:
The slope of the output waveform at in Fig. 16 is calculated using or (B2) and the waveform approximated as a straight line with the above slope has a base of width (B3)
The time delay , defined as the time taken for the output waveform to reach once the input waveform has completed its transition, requires calculation of the output voltage at the time when the input completes its transition.
To calculate as moves from zero to (B4) integrating (B5)
The drain current of a MOSFET in the saturation region is between quadratic and linearly dependent on due to the presence of velocity saturation. Integrating the right-hand side (RHS) of (B5) would increase the complexity of the resulting expression for substantially. The integral in (B5) can be approximated as the sum of the areas of the triangle and the two trapezoids, whose bases are equal, as shown in Fig. 17 : the area under the drain current in Fig. 17 is given by the sum of the areas of each of the solid figures that approximate the area under the curve area of triangle:
where is the drain saturation current when the gate input voltage is area of middle trapezoid:
where is the drain saturation current when the gate input voltage is area of right trapezoid:
is the drain saturation current when the gate input voltage is
Total area: 
