I. INTRODUCTION ECAUSE of the increased circuit speed and density, B power consumption in CMOS VLSI chips becomes increasingly important. Therefore, it is important to have a detailed understanding of the power consumption behavior of a chip, such as: What is controlling the power consumption in a digital CMOS VLSI chip? Which is the dominant part of the power consumption? How to estimate total power of a chip or power consumed in different parts of the chip? Finally, how to reduce power consumption during the early design phase?
Many power simulators have been developed at the gate level to analyze the power consumption with statistical methods or with certain stimulation vectors. Based on a netlist, the gate level simulation tools are not suitable for a feasibility study or other early estimations. There are other tools which could be used for the feasibility study or the estimation before circuit design. Powell and Chau, for example, have developed a power consumption model for a class of digital signal processors [l] . Other tools such as those given in 121, 133, are based on cell and gate counts, Yet the power consumption was not analyzed separately for different parts of the chip such as clocking power or power consumed by memory on chip.
Therefore, the motivation of this paper is to get a general estimation method for feasibility studies and to investigate the distribution of power between different parts of a chip, to be used as guidelines for early design.
We will discuss these problems in this paper. We will discuss the power consumption divided into five parts: logic circuit, clock distribution, memory, interconnection, and off chip driving. We will compare power consumption between cell library design, gate array design, and full custom design. In the logic circuit part, we will discuss static and domino logic latched by different latches. We will discuss how the power consumption is influenced by logic depth and clock aistribution. In the memory part, we will discuss power consumption from different parts of the memory block. In the off chip driving part, we will discuss maximum off chip driving power under certain off chip technologies.
ESTIMATION MODELING

A. Model of Logic
We will discuss three kinds of logic circuit styles. The first is the buffered static logic latched by static master slave D flipflops. The second is domino logic latched by dynamic master slave D flip-flops. The third is domino logic latched by simple dynamic latches. We define the average logic gate function as a three input AND (buffered NAND) gate connected to threeidentical AND gates at the output node. Thus, the average fan in and fan out is 3. We simulate random logic circuits with logic depth of fid as in Fig l(a) . Logic gates are shown in Figs. l(b) and (c) and latches are shown in Fig. 2 . All N transistors in logic gates are of minimum gate width and widths of PMOST's in logic gates are twice the width of N transistors. C,, is a minimum size NMOST gate capacitance and we assume that the Ct, is also a minimum size drain diffusion capacitance. Therefore, one gate capacitance and one diffusion capacitance of a PMOST are both 2Ct,. When estimating the power consumption of logic gates, we should analyze gates with an environment in the cell of Fig. l(a) . The cell includes combinational logic followed by a latch. The power consumption from the latch is divided into fld pieces and one piece is added to one gate in the cell.
Because the average node duty factor, fd (node activity ratio) is different between clocked nodes and unclocked nodes, different duty factors are used to sum up equivalent capacitances in logic gates for power consumption estimation. For every logic gate in Fig. l(a) , the total equivalent capacitance for power estimation is divided into three parts. The second and the third part will be discussed later and the first part, in (l), is the logic gate capacitance excluding clock driving nodes. where, fg is average fan in and fan out, fd is the duty factor.
ICl, the input circuit structure factor, is the number of minimum size gate capacitances in one input node, and IC2 is the inverter buffer node factor. It is the number of the minimum size gate or diffusion capacitances on the buffer input node. The third term of (1) is the equivalent capacitance at the output node of a logic gate. Because loads of a gate are input nodes in other gates, their power consumption has been counted in the following gate. Therefore, loads of the output nodes are not included. kg is the factor used to calculate the equivalent capacitance on no-output nodes 1 to 5.
The second part of the equivalent logic gate capacitance is the capacitance on nodes driven by the clock. To decide the global clock buffer size, this is classified as loads of the clock distribution. The third part of the equivalent logic gate capacitance is 1/ f1d of the unclocked capacitances in a latch following the logic gates:
where, k 4 = 6/fid for a dynamic master slave D latch, (3)
Finally the upper limit of the global clock frequency fcm, is given by (4) [3] . fcm, is also the system clock frequency when measuring the maximum system power consumption. Tg is the gate delay, D, is the chip dimension without memories, local and average wire length that will be discussed later, P g ates = o.5fc maxCtotalgvdd2 excluding clocking, is the gate power.
B. Model of On-Chip Memory
The purpose of this paper is to give an overview of the power consumption in an ASIC system. Therefore, we do not need a deep description of a specific memory structure or a certain cache in a special system. Instead, we just use a wellaccepted memory structure and its power consumption as a typical case.
A typical memory [5] is divided into four parts: the memory cell, the row decoder, the column selection, and readwrite circuits. The memory arrangement is in Fig. 3 . To compare the power consumption with different row widths, we define the storage array as a 2n = 2n-k2k matrix with 2" memory cells, 2n-k rows, and 2k columns. -4 typical control circuit of a bit line and a sense amplifier is shown in Fig. 4 [5] . Address-torow-select decoder and Address-to-column selector are shown in Fig. 5 . Just before an operation, is high, both bit and bit are precharged high. During an operation cycle, is on and precharge is off. All memory cells on one row select line are connected to their bit and bit lines because this row line is selected high. But only one sense amplifier is on and sends out its detected low signal through the column select circuit. Now we begin to model the power consumption of the memory. We model the power consumption of one read or write operation. The first step is to model the power consumed by 2k memory cells on a row during one precharge or one evaluation. It is defined as Pmemcell in (6 The power consumption in (6) is the dominant part in memory.
If we define the memory cell as a d,Xd, square, the row and column interconnection length of the memory matrix is Zrow = 2kdm and lcolumn = 2n-kdm.2R-kCtr is all drain diffusion capacitances on this b i a line. The second step is to model the power consumption from the row decoding part in Fig.  5 (a). It is the power consumed in the row decoding matrix. It includes the power from address buffers, row decoders, vertical row decoding lines, and all gates of NMOST's connected to the row decoding line. There are 2(n -k) buffers driving all Ai and lines with 2n-k-1 NMOST decoding gates. Thus, the total power consumption for one row decoding is:
(e 2 c t r + Cintlcohnn) V&dr (7) where, the first 1/2 represents a 1/2 probability for a logic value to change in an address line. We define 0.3 as the total address buffer chain power consumption ratio. It means the total capacitance in the buffer chain is about 0.3 of its total load ratio of the inverter chain is 1/4+1/16+1/64+. . . M 0.3.n-k means there are (n -k)A; and ( n -k)& address lines. The wire capacitance is cintZc0lumn.
The third step is to model the power from row driving. It is the power consumed in all loads on one horizontal row line. Because only one row line is active in one operation, the total row driving power is:
where, 2"2Ctr) is loads from memory cells on one row line. 2(n -k)Ctr is loads of drain capacitances in the row decoding matrix. 8(n -k)Wint + Zrow is the total row select line length.
The fourth step is to model the power consumption from all column select parts. It includes all power used for column selections, and power .from one sense amplifier. Power for one column selection includes: 1) power from the address buffer, 2) power from a half of all NMOST gates in the column selection matrix, and 3) power from interconnections in this part. It is modeled as:
where, 1.3 is the buffer chain power ratio, and the sum term is the power from a half of all gates in the column selection matrix. Power consumed from the sense amplifier and the readout inverter includes static power and dynamic power. Static power is from the differential amplifier, and dynamic part is from the read out inverter. The power of one memory operation in this part is: where, the first part is static power and the second part is dynamic power. Isens is twice maximum drain current of a minimum size NMOST. fmem-clock is the memory clock frequency. The second part power includes power from the inverter and power from all interconnections after the read out inverter.
The memory power in one operation is the total power from the structure in Fig. 3 . It is the sum from (6) to (10).
C. Model of the Local and Intermediate Interconnections
We define two categories of interconnections in CMOS VLSI, local and intermediate interconnections, and global buses. The local interconnection could be defined as interconnections within a logic gate. The intermediate interconnections are used for connections between gates or sub systems. The global bus includes data, control, and address buses. The interconnection width is assumed to be minimum wire width excluding the clock distribution, as the wire RC constant does not change with wire width. where, Np is the number of external signal connections to a logic block, Ng is the number of logic gates in the block, Kp is a constant, and / 3 is the Rent's rule constant. The local and intermediate interconnection length of a logic gate is:
where, x is the average gate pitch ratio, and dg is the average gate dimension. Following [3], x is derived from Rent's rule and the assumption of hierarchical layout placement is in Fig.  6(a) [SI.
For an interconnection limited chip, the gate dimension is:
where, T, is the average transistor dimension weight factor in a gate. It is determined by area and numbers of minimum size N and PMOST's as well as metal connection area on draidsource diffusion in a gate. We is the interconnection factor of a logic gate. When the sum of We and T, is 1, fgxpw/e,n, is the wiring ability determined gate dimension. p, is the wire pitch, e, is the utilization efficiency of the chip interconnections, and n, is the number of wire layers in the chip. This model is suitable for cell library and gate array design. In full custom design, the gate dimension in (15) is limited by transistor area. It is a transistor packing density limited gate.
where, F is feature size and A, is the unit gate area extracted from experimental layouts. The average interconnection length of a gate with a fan out of fg is represented as:
The total average interconnection capacitances are:
where, tint is the unit wire capacitance per unit length with a minimum wire width of wint. tint is 2 pfkm when wint is less than 3 pm 
D. Model of the Global Clock Distribution
Different systems may have different clock distributions. One example of clock arrangement with low clock skew is a H-tree [3] [9] , in Fig. 6(b) . The H-tree and the clock driver are matched at the source end. Therefore, the width of a clock wire is half its incoming width before the branching points. If the far end clock wire has minimum width, the global clock wire capacitance is:
where, the chip dimension is Dt = , / - where, kdriver is the clock driver ratio. It means that the total clock driver capacitance is kdriver times of the clock driven loads. It could be 0.3 for a conventional system. If the system needs very high speed, kdriver could be larger. Lclk is the number of clock driven transistor gates in a logic gate (in Fig.   l(b) , (c)) and in a latch (in Fig. 2 ). Lclk = 3 + G/fid for domino logic with dynamic MS D flip-flops, Lclk = 3 + 3/ fld for domino logic with simple latch, and Lclk = 12/fid for static logic. The second term is from memory precharge and control. In the term of 2k+1+1+2, the first 1 is the PMOST's sizing factor, the second 1 means that there are two PMOST's for bit and bit, and 2 means that there are four clocked transistors in the control circuit. The total power consumption from clock distribution is:
E. Model of the Global Bus
The bus consumes power from three parts, the power from the bus wire capacitance, the power from the bus loads, and the power consumed in bus drivers. The total bus wire capacitance is: Cbuswire = (libus + wbus)(Dc + DII.r)Cint, (22) where, w b u s is the bus width, the number of parallel bits. It may include width of data and address. libus is the equivalent global control bus number. An example is k = 3, one for reauwrite control, one for datdaddress control, and one for other controls. The bus loads include &U's, memory access ports, register blocks, inputs of bus drivers, and others. where Ntotal-b-load is the total load number in a one bit bus. 3Ctr is an inverter input capacitance. Though there might be many groups of bus drivers on one common bus, only one group is active in one piece of time. Thus, the total capacitances of bus drivers are:
where, 0.3 was shown after (7). Finally, the total bus capacitances are the sum of (22), (23) and (24). The total power consumed by bus is given by (25), and the total chip interconnection power Pwire is the sum of (18) 
F. Model of Off Chip Driving
Off chip driving power is consumed in two parts. One is the power used to drive off chip capacitance, bonding wires, and the pad capacitance. The other is the power consumed by the driver itself, an inverter driving chain. The first part is not given by the silicon chip technology. It is determined by the package technology and printed circuit or multichip technology. We define three kinds of off chip technologies [4] . One is traditional package with traditional printed circuit board (PCB), and the total off chip capacitance, Coffchip, is 50 pF. The second is advanced package and advanced PCB, and the total off chip capacitance is 30 pF. The third is multichip module technology with a total off chip capacitance of 10 pF. The width decrease ratio in the inverter chain is 4. Therefore, a total inverter chain capacitance is Coffchipdriver = 0.3Coffchip. The power consumption of off chip driving is: 
G. Model of Three Kinds of Layout Design Strategies
There are three kinds of layout design strategies: full custom design, cell library design, and gate array design. Because of the difference of the design tools and designer's skill, the same circuit function could be designed much differently by different designers. Under this restriction, we can only define some average situations. For example, we define the full custom design as being designed by an experienced designer, so that the local and intermediate interconnections are minimized, the silicon area is best compacted, and logic gates are reasonably organized. Thus, the gate dimension is limited by transistor dimension and the gate pitch ratio is minimized to a. That 
POWER ESTIMATIONS AND DISCUSSIONS
This part is divided into four sections. In the first section, we will verify all estimation models in this paper by comparisons of two real designs. In the second section we will define a conventional ASIC system as an example chip and describe its parameters. In section three, we will give a total power discussion of the example chip to show how the power consumption is distributed in different parts of the chip and to compare the power consumed in full custom design, cell library design, and gate array design. In section four, we will discuss in detail power consumptions of interconnections, clock distribution, on-chip memory, off-chip driving and power versus logic depth based on different system parameters.
A. VeriJcation of Estimation Models
Verification of the estimation models developed above has been done in this section by comparing two sets of published data. The first is the Alpha 21064 microprocessor [14] , [15] , and the second is Intel 80386 microprocessor [3]. Comparisons are given in Table I between parameters from both reference papers and our estimations. In our estimation of Alpha 21064, process technology parameters are feature size of 0.75 pm, gate oxide thickness of 10.5 nm, minimum interconnection width of 0.75 pm, average wire pitch of 2.625 pm, and memory cell area of 10 x 10 pm. The total clock load is 3.2nF driven by a driver chain with ratio kdriver of 0.37. The system parameters are a supply voltage of 3.3 V, threshold voltages of 0.5 V, average logic depth of 7, and average node activity ratio of 0.3. In the estimation of Intel 80386, process technology parameters are feature size of 1.5 pm, gate oxide thickness of 30 nm, minimum wire width of 3 pm, and wire pitch of 6 pm. The system parameters are a supply voltage of 5 V, threshold voltages of 0.7 V, average logic depth of 25, and average node activity ratio of 0.3. Estimations show that the estimated system parameters (for example, the maximum system clock frequency, the total power consumption, and the chip dimension) are close to parameters given in references. This shows that models proposed in this paper is reasonably accurate.
B. Parameters
In the following sections, we will estimate power consumption based on an ASIC system example. This example contains 10000 logic gates and 32 kbits (4kbytes) of memory. The logic circuit styles are buffered static logic, domino logic latched by dynamic MS D flip-flops, or domino logic latched by simple dynamic latches. The logic depth is 10. Two layers of wires are used in layout. We define the bus width as 32 bits, and the Estimation results based on parameters in Section 2 are shown in Fig. 7(a) . As fcm, increases with downscaling the total power will increase with scaling. In Fig. 7(b) we give the system power in mW/MHz. The thick lines give data for a system designed with static logic in cell library design. It is used as references in the following.
We have then tried to study the power distribution between different parts (logic circuit, interconnections, clock driving, and off-chip driving) of the system in Figs. 8 and 9 . In Fig. 8 we compare the differences of the power distribution between static logic and dynamic logic based on cell library design. The total power is in Fig. 8(a) and the power without off chip driving is in Fig. 8(b) . In a similar way, we compare the differences Of the power distribution between three styles (cell library design, gate array design, and full Custom design) based on static logic in Fig. 9 . Again data is given for as well a complete system as for a system without off chip drivers. As the off-chip driving power cannot scale down, the offchip driving power could be up to 70% of the total chip power The total wire power consumption in a static logic system is larger than that in dynamic logic. It is because of the larger gate dimension and more PMOST's of static logic. The total wire power consumption in a gate array designed system is about 30% larger than that in cell library designed system. Using full custom design, the total wire power consumption is just about 1/3 of that in a cell library designed system.
Not included in the figures we also found the following:
-Compared to cell library design, the maximum system clock frequency is about 25% lower in gate array design and about 100% larger in full custom design.
-Chip size increased about 19% with gate array design and decreased about 35% with full custom design compared to cell library design.
D. Detail Discussions on Power Consumptions
If we add the logic power, clock power and total wire power, we will find that the logic depth strongly influences the power consumption, see Fig. 10 . It is well accepted that static logic consumes more power than dynamic logic, [ 101. Yet we find in this paper that this is true only when the logic depth is small. When logic depth is larger than a certain value, dynamic logic consumes more power than static logic. This is because the relative number of clock driven transistors is decreased in static logic when increasing the logic depth. The critical logic depth, in Fig. 10 for domino logic with simple latches, is about 6. If dynamic master slave D flip-flops are used in dynamic logic, the critical logic depth is only 4. The critical logic depth means that when the logic depth is larger than this value, the dynamic logic will consume more power than static logic. When logic depth is large, there is not much difference in power consumption when using different latches in domino logic. The reason is that the power used for precharge and evaluation in Based on the same logic style (for example, domino logic with simple latch) and the same logic depth, the difference of the clock power consumption is very small comparing cell library design, gate array design and full custom design. The reason is that the dominant clock power is consumed in silicon gate loads.
When memory size is more than 4 kbytes, 80-70% of the memory power is consumed in a row of memory cells driving bit/bit lines. The power consumption is not linearly scaled by scaling the feature size. This is mainly because that the unit interconnection capacitances in the memory block do not decrease when wire width is less than 3 pm. When we use different number of rows under the same memory size, the power consumption is just changed less than 6% of the total memory power. This is because the dominant power, used to drive and precharge bit/bit lines, is not changed.
We estimate the maximum power consumed from total off chip driving in Table 111 . Because this power does not depend on silicon technology but depends on off-chip technologies, it is not scaleable. In table 3, 50 pF, 30 pF, and 10 pF are one bit total off chip load capacitances, % is the percentage of the off-chip power in a chip, mpm is the total off chip power in mW/MHz, and W is the maximum off chip power in watt under fc If we use multi-chip module, the off chip power consumption could have a great reduction from 62.5% to 25% of the total chip power, when feature size is 0.25 pm
IV. CONCLUSION AND RECOMMENDATTONS
We have developed a method and a tool for power modeling of CMOS VLSI chips in this paper. The method makes it possible to estimate the power consumption of a chip based on gate count, memory size, logic, and layout styles. The tool is not as accurate as gate level simulators, but it gives a fast estimation far before circuit and layout design. A few verifications with known chips indicate that the models give reasonable results.
We have further used these models to describe the power consumption of a schematic example, with the goal to find -The total power increased with downscaling. This is caused by the increased clock frequency and indicates that high-end designs will have more severe power problems in the future. The only way to handle this problem is to reduce supply voltage [6] , [ll], [MI.
-The power used for off-chip driving is very dominating and becomes more dominating with scaling (Table 111 , Figs. 8(a), 9(d). Up to 70% of the power may be due to off-chip drivings. To reduce power the most important thing is therefore to reduce the power used for off-chip driving. This can be done by using a more advanced off-chip technology [17] , by reducing the off-chip swing [12] , [13] or by reducing the external bandwidth of the chip by proper high level partitioning (e.g., by using single chip solutions [l] , [31).
--If off-chip power is excluded, the power related to wires could be up to 46% of the chip power. This share increases with downscaling. Considerable reduction (70%) of wire power can be achieved by using full custom design style. -Comparing design styles, gate array style uses about 10% more power than cell library style and full custom style uses about 15% less than cell library style (at fixed speed) (Fig. 9 ). Full custom design utilizes power best as it has a low share of interconnection power ( Fig. 9(c) ).
-Comparing logic forms we need to consider the sum of logic power, clock power and wire power (Fig. 10) . Static logic uses more power than domino logic for small logical depth. However, for logical depths larger than six static logic uses less power than domino logic.
-The clock power is about twice the logic power for static logic and about three times the logic power for dynamic logic. Therefore, when using domino logic a larger share of the total logic power must be supplied through the clock generator.
