The research community will have to support the development of the above as well as the following:
• Develop Low power circuit, logic and architectural techniques and proliferate them into the industry and CAD vendors.
[ Similarly, the on package decoupling capacitors can supply a few Amps per nanosecond, requiring only perhaps tenths of Amps per nanosecond to flow through the package pins. The process continues to the printed circuit board power supply where it is now only required to respond in the Amps per hundreds of microseconds range.
The cost per nanofarad of capacitance has to be managed carefully. Small amounts of capacitance are very cheap to build on the die or on the package, but large amounts can get very expensive. An optimum design uses just the right amount of decoupling at each stage to meet noise target goals. Capacitance on the die and in the package costs money, so we need to design with the right amount and not much more. If enough decoupling capacitance is not used, the on die voltage supply levels will vary too much and there will be yield loss. If design is done with excessive amounts, the die may need to grow or the package may again become so complex that there will be yield loss.
Managing dI/dt is a complex task which will become more and more critical as technology advances because it is scaling in the wrong direction. By developing the tools to manage dI/dt now, it can be prevented from becoming a major factor limiting further advances in circuit integration.
Summary & Recommendations
The need for lower power systems is being driven by many market segments as outlined in the Section 1. There are several approached to reducing power, however the highest ROI approach is through designing for low power. Unfortunately designing for low power adds another dimension to the already complex design problem; the design has to be optimized for Power as well as Performance and Area.
Optimizing the three axes necessitates a new class of power conscious CAD tools. The problem is further complicated by the need to optimize the design for power at all design phases; and the power savings at the different design phases are not additive. For example a 10% power savings from layout techniques together with a 15% power savings from logic synthesis techniques will more often than not, yield a total power savings of less that 25%, say 20%. In addition all the evidence points to the fact that the biggest power savings will be derived from transformations and decisions made early in the design process, i.e. from the high level design phase, however high level design tools are not as mature as the logic and layout level tools. At all phases of the design process the tools can be broadly described as either:
• Analysis tools,
• Optimization tools, or
• Libraries and library management tools.
Initially (in the absence of good optimization tools) the analysis tools will have the highest ROI, that is they will allow the designer to identify areas of the design that need to be optimized for power. In addition the analysis and power estimation tools will enjoy a large market because of their applicability to different design domains such as CPU, DSP, etc. To ensure designer acceptance and minimum learning time (on the tool), the analysis and optimization tools should, when possible support power capabilities in a value added way.
The tools' problem will have to be attacked on two fronts. On one front, existing and mature CAD packages (commercial and non commercial) should be quickly modified to provide basic support for analysis and optimization, e.g. logic simulation and logic synthesis.
On the second front, bodies such as Universities, the Semiconductor Research Corporation, DARPA and the European Community Esprit consortium should support longer term, basic and applied, pre-competitive research, to reduce power at all levels of the design, but with a special emphasis on higher level design as described in sections 2 and 3. Furthermore it is imperative that the research activity necessarily include work on design and architectures, as well as CAD tools and methodologies, because good high level CAD tools are domain specific. Domain specific design requires tools targeted at specific architectures such as those used in CPU and DSP designs.
The successful development of new power conscious tools and methodologies requires a clear and measurable goal. In this context the research work should strive to reduce power by 3x in three years and 5x in 5 years; through design and tool development. That is, any power reduction through process scaling or voltage scaling should be above an beyond the 3x and 5x goals. To achieve these goals the commercial CAD companies will have to (in a 1 & 3 year time frames) add to their product portfolio the following tools: One year Goals
• LOGIC level power estimation tools, (circuit level power estimation is currently available).
• RTL power estimation
• low power logic synthesis.
• Reliability Verification and Packaging modeling.
• Standard cell libraries characterized for Power, Performance and Area.
may contribute more decoupling capacitance if it is idle than if it is active. The blocks further away from the switching block will have higher parasitic resistance and inductance in the power supplies connecting the blocks. Hence, the effective decoupling capacitance will be lower for these blocks. When all of this is added up, using circuit activity factors to temper the results, one should get a number for the decoupling capacitance which is lower than if neighboring blocks were ignored.
Cell Libraries for Decoupling Capacitance Design
Taking this a step further, a set of standard cells or layout design guidelines need to be accessible to circuit designers and the place and route tools. Standard cells for decoupling capacitors will insure that the intrinsic resistance of the capacitor is low enough to function properly. A set of guidelines for interleaving the power supply lines, Vcc and Vss, needs to be incorporated into the local and global routing tools. This will insure that resistance, rather than the inductance, will be the dominant component of the impedance of the power supply lines.
The step from dealing with I Average to dealing with dI(t)/dt is big and poses a real challenge for CAD tool developers.
Design for Reliability and Noise Minimization
Preventing reliability and noise effects from impacting the performance of the chip requires contributions from the architecture, circuit design, physical layout, packaging, and the board design areas. Placing undo burden on any one of them will result in grotesque solutions, if any.
Architecture
Architectural techniques can be used to tame the sudden changes in current demand. When the device goes into a standby mode, the circuit blocks can be shut down sequentially over two or three clock periods. Similarly, when the device comes out of standby mode, one or two clock periods could be used to started drawing moderate amounts of current again.
When instructions are pipelined, there is advanced warning of when various circuit blocks will switch. A central dI/dt bookkeeping unit on the die, could estimate in advance the current activities of the die. When the unit predicts a large surge in current, it could partially power up the appropriate circuit blocks in advance to lower dI/dt.
Logic Design and Physical Layout
For reliability design, signal and power lines need appropriate sizing to meet EM constraints. Power distribution can be improved to curb IR voltage drops. This can be done with an interweaved comb-like power distribution network. Using different metal layers for a global and local power distribution also helps. The global network is best laid out in the topmost low-resistivity metal layer which should be thicker than the lower metal layers used for local power distribution.The effective resistance can be further reduced by increasing the number of power connections.
To minimize cross-talk the mutual coupling between signal lines can be minimized by insuring that no two lines are parallel to each other for longer than a maximum length. Simultaneous switching noise due to voltage fluctuation can be reduced by controlling the turn-on characteristics of drivers, making them turn on slowly rather than sharply.
To control power supply noise in addition to decoupling capacitance, the addition of power and ground pins to minimize the effect of Vcc and Vss inductances has already been discussed. The on die decoupling capacitance should be distributed across the die, and be placed very close to the circuits needing fast charge. If possible, the capacitors should have a clean Vcc/Vss power supply connection. This is to ensure that the capacitor is able to maintain its own nominal value of Vcc-Vss. Circuit block placement could also be optimized to share decoupling capacitance, as long as chip routing is not made worse.
Die, Package and P.C. Board Decoupling Capacitance
Decoupling capacitance should be added on the die, in the package, and on the printed circuit board.
The decoupling capacitance is a good local source for fast charge because high frequency current can come from the local capacitor and not have to come from a more inductive path. The decoupling capacitors are functioning as high frequency filters.
Referring to figure 14, we see how the high dI/dt demands on the die can be met despite a relatively high inductive Vcc-Vss connection to the printed circuit board power supply. The on die decoupling capacitors have to be high performance capacitors. This means that their own intrinsic inductance and resistance must be small as to not limit the demand for high frequency charge. They will have to supply charge for current demands in the tens of Amps per nanosecond. With this high frequency current shunted through the capacitor, the bondwires now only have to deal with one or two Amps per nanosecond.
For power networks, often a separate analysis of global and local power is more appropriate. Reliability verification of the local power network can be done as above, assuming constant supply voltages at the block terminals. The global power network RV then becomes an iterative process that begins by using an estimated local power network that is later refined with information from local power RV. Signal reliability verification requires the calculation of trunk and branch currents in addition to node voltages, to check if signal widths are meeting EM limits. In both cases the simulation must be driven by worst-case block activity.
The output calculated in power net and signal RV analysis contains a large volume of data making it a tedious process to browse the current density and voltage drop at each location of the power network. For reliability analysis, a graphical display showing the violations where the current density or voltage drop exceeds the userspecified threshold, or the potential violations where the current density or voltage drop is near the threshold, is important for the designer to quickly redesign the power bus layout to meet reliability requirements.
In order to calculate the fluctuations of the power supply voltage levels on the die and its effect on noise, the following is needed:
• the RLC network of Vcc/Vss from the die to the PC board power supply,
• the switching activity of the device,
• the passive Vcc-Vss capacitance of the device and the associated interconnect parasitics,
• and the voltage noise specifications for the device under analysis. Inductive effects caused by fast switching core logic circuits have not been a problem in the past for digital CMOS circuits. The only inductive effects the designer had to deal with were in the periphery (I/O circuits) and they were easy to deal with because they were very localized, had their own power supply, and the designer knew exactly when they switched.
Today, high dI/dt problems are distributed all over the surface of the die. The designer cannot manually keep track of which circuits in the block switch with every input vector to the block. The on chip inductance is beginning to become significant, further complicating matters. Clearly sophisticated CAD tools are needed to give the designer a chance at managing dI/dt. There are two major parts to this:
dI/dt Extraction Capabilities
While evaluating dI/dt noise, the damping effect of resistance or inductance in the power supply lines is needed to model realistic effects. Existing circuit simulators can provide current waveforms depicting dI/dt, but they do not include resistance or inductance in the power supply lines. While resistance extraction tools have been around for a while, parasitic inductance extraction tools are non-existent. Extracting the inductance requires knowledge of the current flow direction and current flow in neighboring lines, which in turn needs knowledge of the circuit block's operation.
Once parasitic values are extracted, simulations can be run. Unfortunately, when a simulation is performed with Vcc and Vss as variable circuit nodes rather than fixed source voltages, they take up to 50 times longer to run. So now fast simulators are needed that give up some accuracy, but run faster, to handle the cases with parasitic resistance and inductance in the power supply loop.
dI/dt helps determine on-chip die and local decoupling capacitance estimates, and determining this estimate is rapidly gaining importance at higher levels of abstraction in the design stage. Much work is being done today to estimate Iavg at the architecture level, the RTL level, schematic level, and even the silicon testing level. Considering that Iavg is not easy to calculate, it is true that dI/dt is even more difficult., since a profile over time is needed. One of the ways to do this at the RTL/schematic level is to generate a switching activity/toggle profile over time.
Effective Decoupling Capacitance Calculation
In addition to dI/dt, other information is needed: such as the effective capacitance of the block under consideration and the percentage of the gates that are not switching. Basically, we need to know how much capacitance is being charged and discharged and how much is just sitting there idle, able to serve as decoupling capacitance. Just as power estimates become more accurate as we move from the architectural level to the circuit level, and eventually the silicon level, we expect our decoupling capacitance estimates to improve in accuracy also.
Just as the total power of a device is not equal the sum of the worst case power of all of the circuit blocks, so it is with decoupling capacitance. Tools that calculate the effective decoupling capacitance of a circuit block, must also be able to calculate the effective decoupling capacitance of neighboring blocks. A neighboring block
The peaks and valleys of Vcc-Vss can have performance and reliability implications. Timing slowdown may occur when Vcc-Vss is at a minimum. Timing skews may arise from some circuits speeding up at high Vcc-Vss and others, which switch at a different time in the clock period, may see a low Vcc-Vss and slow down. Hot electron operating limits or gate oxide stress limits may be exceeded during the Vcc-Vss peaks, leading to reliability failures.
The magnitude of the oscillations are a function of the power supply inductance, LVcc and LVss; the Vcc-Vss die capacitance, CDie; power supply resistance; and the severity of the current demand from switching circuits, Max(dI/dt). In order to reduce the peaks, the package is supplied with as many Vcc and Vss pins as possible to reduce LVcc-Vss. Decoupling capacitance is added to the die and on the package so that the highest frequency components of dI/dt does not need to be supplied by a highly inductive off package path. Various architectural techniques to limit dI/dt can be attempted, but the circuits can not be slowed down, since performance will be affected.
Low power design introduces its own set of problems. An ideal low power design would result in low values of Iavg and dI/dt. All units on the die would use small currents when active and very little current when inactive. Low power designs for microprocessors typically results in
• reducing the maximum current peaks moderately,
• reducing the time spent at peak levels greatly,
• causing very low values of current when the device is carrying out "easy" tasks or is in standby mode.
That is, the current delivered tracks the MIPS required on a real time basis. This means that the dI/dt is going to get worse. Power conscious designs in general will go through periods of inactivity, such as, standby mode or sleep mode, followed by intense periods of activity, followed again by periods of inactivity. Figure13 shows the differences in current demand between using and not using low power design techniques. Notice how the current differences between peak operation and idle operation are larger in the design using power savings techniques.
One other difficulty posed by low power design is that current activity may be very localized in space. There is naturally occurring decoupling capacitance spread uniformly across the die in the form of N-well to substrate capacitance, non-switching circuit blocks, and other capacitive parasitics. When the current activity is concentrated in one area of the chip, only the naturally occurring decoupling capacitance close to that circuit will function efficiently, it will not be able to "see" the capacitance at the far end of the chip.
One final area of concern is in the use of low voltage to achieve low power. Although low power supply voltages help lower the power consumed, higher transistor counts and higher frequency rates usually keep ICC relatively high. Lower Vcc usually means maintaining a lower absolute value of voltage noise. Considering I*R drop across the die, power supply guardbands and tester guardbands, very little margin is left for the on die power supply oscillations. Since the dI/dt usually remains fairly high, large values of decoupling capacitance are needed.
Reliability Verification and Noise Analysis
In order to meet EM constraints and IR drop limits. current density and voltage drop information needs to be extracted from the physical layout. The following is needed to accomplish this task:
1. Simulate the transistor circuit to estimate the current flowing into and out of each power bus via.
2. Accurately extract and model the physical layout of power buses into an RLC network.
3. Simulate the power network to find the current density and voltage at each location of the power network.
4. Display the current density and voltage drop information in the physical layout.
Techniques for circuit simulation (to extract Iavg and Ipeak) have been introduced in earlier sections. The simulation of an RLC network is a classic problem in circuit analysis that requires circuit formulation and matrix computation [Chua] . It may, however, be the bottleneck in the RV process due to the size of power network, where millions of parasitic elements are common for today's VLSI chips. Moreover, a matrix computation is needed for each timestep in the time-domain transient simulation. Simplified methods can speed up the RLC network simulation by (1) reducing the massive RLC network to a smaller size; or (2) limiting the number of power network analysis. In the former case parasitic elements are lumped into a small number of elements. For example, a wire modeled by a distributed RC is lumped to a simplified p-structure. Method number two performs the power network analysis only once for a pre-specified time interval such that average, rms, or peak current of each transistor over this interval is applied to the power network. In the extreme case, only one matrix computation is needed to estimate the average current density and voltage drop. The speedup from these simplifications, however, is achieved at the expense of degraded accuracy. design or optimization for low power, it is essential to know where the power is being dissipated and how much of it is dissipated. In a design, sources of power dissipation in a design are as follows:
where P t is the activity factor (i.e., probability of a logic transition), V is the voltage swing (=V dd ), I tsc is the transient short circuit current, which is the direct current through the series P and N transistors in CMOS network during logic transition (i.e., transition time = tt) where I bias is the D.C. current through sense amps, ratioed loads, and I leakage is the circuit leakage current. During search, the value of these equations is computed at run-time and compared with the given constraints. Special algorithms using set theory to optimize the multiple constraints based query time have been developed. Such a cell library and the database is used during low power design to meet the given budgets on area, performance and power consumption. As compared to a relational database, the proposed object-oriented cell library manager reduces the search time for an appropriate cell, with m constraints among n cells, from O(n m ) to O(mlogn). A prototype implementation with 5000 library cells, with 92 attributes per cell, took only 8.6 seconds to satisfy power constraints with a matching cell. 
Reliability and Packaging Issues in LowPower Design
The primary objectives in the earlier sections has been on analysis and optimization of full-chip power and average current across different stages of the design hierarchy. Orthogonal to these issues, is the task of reliability verification and on-chip noise management for package design which are impacted by dI(t)/dt in addition to Iavg. Reliability verification addresses electromigration (EM) and the IR voltage drops on signal and power supply lines, while noise management tries to understand the effect of crosstalk and power supply fluctuations on the Vcc/Vss noise levels and the signal noise margins. Many factors have aggravated this RV and noise problem: faster transistors, higher current levels, shorter clock cycles, lower supply voltages and power savings techniques. One might have the initial impression that low power design techniques must inherently improve the stability of the on die power supply levels, Vcc and Vss. As we will see shortly, this is not always the case. In this section, we begin by discussing the reliability and noise problems caused by high performance/high frequency design and low-power design techniques. Then staying with the theme followed through out this paper, we discuss analysis capabilities to better estimate these effects and optimization techniques to do better design for reliability and on-chip noise management.
The Problem
High performance and high frequency design with higher levels of on-chip integration has made EM an important problem. EM is an interconnection failure mechanism that can cause opens in signal and power lines due to large current densities coming from faster circuits, thinner wires and increased device count. Another problem caused by high average current (Iavg) is the higher IR voltage drops along the power lines. This can degrade the noise margin and can cause glitches. With neighboring signal lines lying closer to each other on the die, the mutual coupling capacitance between them increases. This along with rapid switching on these lines can cause cross-talk leading to timing slowdowns and inadvertent logic transition faults.
Power supply noise levels due to increased dI/dt in high performance circuits are also becoming a growing concern. Given an initial stimulus, Vcc and Vss will try to oscillate 180 degrees out of phase at their natural ringing frequency, , where C is the Vcc-Vss capacitance and L is the total power supply loop inductance.
presents a combined wiresizing and driver sizing approach which reduces the interconnect delay with only a small increase in the power dissipation. Experimental results show that for the same delay constraint, this approach reduces the power by about 10% when compared to the conventional method of driver sizing only. Alternatively, this approach produces delay values which are up to 40% lower when compared to the conventional method (at the cost of increasing the power dissipation by 25%).
Clock Tree Generation
Clock is the fastest and most heavily loaded net in a digital system. Power dissipation of the clock net contributes a large fraction of the total power consumption.
[104] describes a two-level clock distribution scheme based on area pad technology for MCMs. The first level of the tree is routed on the MCM substrate connecting the clock source to the clock area pads while the second level tree lies inside each die with the area pads as the source. The objective is to minimize the load on the clock drivers subjects to meeting a tolerable clock skew. A significant power reduction (70% for one benchmark circuit) over the method with one clock pad per die is reported by using this scheme.
Libraries

Scope of the Problem:
As VLSI layout design dimensions continue to shrink, device sizes tend to shrink faster than the routing distances. For future designs, it means that a significant part of the system power will be due to layout parasitics, that are not well modelled at logic and circuit levels. The conflicting needs for high design productivity, low power consumption, high layout density and higher performance goals are difficult to meet without a pre-characterized library. Existing design methodologies map high level requirements to the cells in a library early in the design cycle. This causes some selection decisions to be made before all the constraints may be fully known, e.g., the pin positions on the neighboring cells. This calls for mismatched cell views during layout phase and consequently longer interconnect wires. Another issue is the growing size of the library, as the search time for a cell, with m constraints among n cells in a relational database using separately sorted tables of attribute values, is O(n m ).
Solution:
One possible solution is to the overlap the design steps so that the estimates done at the higher levels are more realistic. The trade-off is in the increased complexity of estimated models and an effort to complete part of the layout design before the logic mapping is fully done. An approach is to use a library of cells with views at every design abstraction level for aiding the decision making process. The cell information is stored in a database which is queried at each design phase. At lowest level of hierarchy, various cells are stored as objects with attributes and at higher levels they are grouped by functionality. The proposed solution uses the concept of delayed binding in selecting the layout view of a library cell identified during technology mapping. Each layout view is designed without any pins on the boundary of the cell and multiple connection flexibility is available after final placement of cell instances during the layout phase A cell in the datapath library may have multiple levels of abstraction (i.e., behavioral, logic, circuit and layout), and multiple views at each level, as shown in Fig. 9 . (i.e. an Adder at behavioral level will be a mere "+"but at circuit level the selection can be made between Carry Look-ahead, Ripple Carry style etc.).
After an individual library cell design is completed, layout parasitics are extracted with simulated routing placed over-the-cell and this information is used to derive characteristic model equations for current and power consumption in terms of actual load being driven, voltage, toggle rate as well the slope of input signal. These electrical equations along with behavioral models and attributes such as area and pin-to-pin delays are stored in an object-oriented database. The database implements search for equivalent functionality cells, and a designer can specify constraints on a cell performance to limit the search space. Since all the performance constraints may not be known at the beginning, i.e., device strengths are not determined until the circuit phase, a group of cells are identified during the logic phase for each component. As more details of a design become known, the size of the mapped group for each design component becomes smaller. If no corresponding cell is available as a new constraint is applied, that constraint is relaxed and a neighboring cell's constraint is tightened.
If more than one cells in the library match the final set of constraints, an objective function is computed over each selected cell (e.g., minimize power/area), and then cells are sorted accordingly to select the most desirable candidates. If no cell meets the given constraints, then a closest cell by relaxing the design constraints is found. This is done by evaluating the objective function over each cell that meets the boolean function criterion and then by ordering all the cells.In order to do any kind of
Physical Level
Physical design fits between the netlist of gates specification and the geometric (mask) representation known as the layout. It provides the automatic layout of circuits minimizing some objective function subject to given constraints. Depending on the target design style (full-custom, standard-cell, gate arrays, FPGAs), the packaging technology (printed circuit boards, multi-chip modules, wafer-scale integration) and the objective function (area, delay, power, reliability), various optimization techniques are used to partition, place, resize and route gates.
Under a zero-delay model, the switching activity of gates remains unchanged during layout optimization, and hence, the only way to reduce power dissipation is to decrease the load on high switching activity gates by proper netlist partitioning and gate placement, gate and wire sizing, transistor reordering, and routing. At the same time, if a real-delay model is used, various layout optimization operations influence the hazard activity in the circuit. This is however a very difficult analysis and optimization problem and requires further research.
Layout Optimization Techniques Circuit Partitioning
Netlist partitioning is key in breaking a complex design into pieces which are subsequently optimized and implemented as separate blocks. In general, the off-block capacitances are much higher than the on-block capacitances (one to two orders of magnitude). It is therefore essential to develop partitioning schemes that keep the high switching activity nets entirely within the same block as much as possible. Techniques based on local neighborhood search (e.g., [63] ) can be easily adapted to do this. In particular, it is adequate to assign net weights based on the switching activity values of the driver gates and then find a minimum cost partitioning solution.
Floorplanning
Floorplanning plays an important role during layout optimization as it determines the interface characteristics (shape, size, I/O locations) and positions of custom or semi-custom blocks in a hierarchical design environment.
[57] describes a floorplanner that considers power management. The idea is to generate a set of powerindexed shape functions and then use implementations for each flexible module that satisfies the timing constraints while minimizing the dynamic power dissipation. In addition, this work considers constraints to mitigate power line noises and thermal reliability problems. Results show 18% reduction in power and more smoothly distributed power dissipation over the floorplan area compared to conventional floorplanners with the same delay constraint. There is however a small area penalty.
Placement and Routing
[101] describes a performance driven placement algorithm for minimizing the power consumption. The problem is formulated as a constrained programming problem and is solved in two phases: global optimization and slot assignment. The objective function used during either phase is the total weighted net length where net weights are calculated as the expected switching activities of gates driving the nets. Constraints on total path delays are also accounted for. On average, this procedure reduces power consumption by about 10% at the expense of 2% increase in circuit delay compared to a placement program minimizing the total interconnection length.
Routing for low power can be performed by net weighting where again the net weights are derived from the switching activity values of the driver gates. The nets with higher weights are more critical and should be given priority during routing.
Gate Sizing
The treatment of gate sizing problem is closely related to finding a macro-model which captures the main features of a complex gate (area, delay, power consumption) through a small number of parameters (typically, width or transconductance). The first major contribution to transistor sizing problem was the work done in TILOS [64] . Their optimization technique is greedy in the sense that they pick a path which fails to meet the timing requirements, and resize some transistor on the path so as to meet the constraint. The procedure is iterated until all timing constraints are satisfied or no further optimization is possible.
[51] proposed a linear programming solution which does global gate sizing subject to a set of timing constraints. He adopted a simple model for gates in which the delay can be piecewise linearized as a function of the gate sizing parameter. The drawbacks of his model are the omission of slope factor (input ramp time) of input waveforms from the delay model and use of simple power dissipation model which ignores short-circuit currents. Berkelaar's work should be extended to account for the slope factor and the short-circuit currents.
Wire Sizing
Wiresizing and/or driver sizing are often needed to reduce the interconnect delay on time-critical nets. Wiresizing however tends to increase the load on the driver and hence increase the power dissipation. [59] true for power estimation under a zero delay model, but not for that under a real delay model.
The extension to a real delay model is considered in [96] . Every point on the power-delay curve of a given node uniquely defines a mapped subnetwork from the circuit inputs up to the node. Again, the idea is to annotate each such point with the probability waveform for the node in the corresponding mapped subnetwork. Using this information, the total power cost (due to steady-state transitions and hazards) of a candidate match can be calculated from the annotated power-delay curves at the inputs of the gate and the power-delay values of the gate itself. The spatial correlations among the input waveforms are captured using the tagging mechanism described previously.
The concept of power-delay curve has been extended to include the area trade-off. Instead of generating a set of (power, delay) values, the trade-off curve consists of a set 
Signal-to-Pin Assignment
It is necessary to construct a standard cell library where several variable-sized versions of a gate are available. These gates are sized with area minimization and input ordering in mind such that they give good delay response without high power dissipation.
In general, library gates have pins that are functionally equivalent which means that inputs can be permuted on those pins without changing function of the gate output. These equivalent pins may have different input pin loads and pin dependent delays. It is well known that the signal to pin assignment in a CMOS logic gate has a sizable impact on the propagation delay through the gate [56] . In particular, under the assumption that the input ramp time is small, the latest arriving signal should be assigned to the transistor near the output terminal of the gate. It is desirable to develop an assignment algorithm which minimizes dynamic power consumption on non-critical paths by signal reordering.
If we ignore the power dissipation due to charging and discharging of internal capacitances, it becomes obvious that high switching activity inputs should be matched with pins that have low input capacitance. However, the internal power dissipation also varies as a function of the switching activities and the pin assignment of the input signals. To find the minimum power pin assignment for a gate g, one must solve a difficult optimization problem [98] . As the number of functionally equivalent pins in a typical semi-custom library is not greater that six, it is feasible to exhaustively enumerate all pin permutations to find the minimum power pin assignment. Alternatively, one can use heuristics, for example, a reasonable heuristic assigns the signal with largest probability of assuming a controlling value (zero for NMOS and one for PMOS) to the transistor near the output terminal of the gate. Alternatively, one could assign the signal with the earliest transition to a controlling value to the transistor near the output terminal. The rationale is that this transistor will as often as (or as early as) possible, thus blocking the internal nodes from non-productive charge and discharge events.
Another heuristic [75] assigns the signal with highest switching activity to the input pin with the least capacitance. This is not very effective as in the semi-custom libraries, the difference in pin capacitances for logically equivalent pins is small.
The pin permutation for low power should take place on non-critical gates as it is in general different from the pin permutation for minimum delay.
Path Balancing
Balancing path delays reduces hazards/glitches in the circuit which in turn reduces the average power dissipation in the circuit. This can be achieved before technology mapping by selective collapsing and logic decomposition or after technology mapping by delay insertion and pin reordering.
The rationale behind selective collapsing is that by collapsing the fanins of a node into that node, the arrival time at the output of the node can be increased. This is however valid only if the node delay is determined by assuming an AND-OR implementation with the delay of AND and OR gates being a function of the number of inputs to the gate. Logic decomposition can be performed so as to minimize the level difference between the inputs of nodes which are driving high capacitive nodes. The key issue in delay insertion is to use the minimum number of delay elements to achieve the maximum reduction in spurious switching activity. This is a difficult task as delay insertion at some node will not only change the spurious activity at the immediate output of the gate (hopefully, will reduce it), but will affect the spurious activity in the transitive fanout of the node (unfortunately, due to change in output waveforms and sensitizability conditions, the activity may be increased).
Path delays may sometimes be balanced by appropriate signal to pin assignment. This is possible as the delay characteristics of CMOS gates vary as a function of the input pin which is causing a transition at the output.
Technology Decomposition
It is difficult to come up with a decomposed network which will lead to a minimum power implementation after technology mapping since gate loading and mapping information are unknown at this stage. Nevertheless, it has been observed that a decomposition scheme which minimizes the sum of the switching activities at the internal nodes of the network, is a good starting point for powerefficient technology mapping.
Given the switching activity value at each input of a complex node, a procedure for AND decomposition of the node is described in [97] which minimizes the total switching activity in the resulting two-input AND tree under a zero-delay model. The decomposition procedure (which is similar to Huffman's algorithm [69] for constructing a binary tree with minimum average weighted path length) is optimal for dynamic CMOS circuits and produces very good results for static CMOS circuits. The performance-oriented version of the above problem requires that the increase in the height of the decomposed network (compared to the undecomposed network) be bounded. The solution here is similar to Larmore/Hirschberg's algorithm [74] for solving the tree decomposition problem with minimum average weighted path length subject to a height constraint. It is shown that the low power technology decomposition reduces the total switching activity in the networks by 5% over the conventional balanced tree decomposition method. This translates to a 3% reduction in power consumption after technology mapping.
Technology Mapping
A successful and efficient solution to the minimum area mapping problem was suggested in [72] and implemented in programs such as DAGON and MIS. The idea is to reduce technology mapping to DAG covering and to approximate DAG covering by a sequence of tree coverings which can be performed optimally using dynamic programming.
The problem of minimizing the average power consumption during the technology dependent phase of logic synthesis is addressed in [97] . This approach consists of two steps. In the first step, power-delay curves (that capture power consumption versus arrival time tradeoffs) at all nodes in the network are computed. In the second step, the mapping solution is generated based on the computed power-delay curves and the required times at the primary outputs. For a NAND-decomposed tree, subject to load calculation errors, this two step approach finds the minimum area mapping satisfying any delay constraint if such a solution exists.
The algorithm is optimal for trees and has polynomial run time on a node-balanced tree. It is easily extended to mapping a network modeled by a directed acyclic graph. Compared to a technology mapper that minimizes the circuit delay, this procedure leads to an average of 18% reduction in power consumption at the expense of 16% increase in area without any degradation in performance. Figure 9 .and Figure 10 . compare the results of this power-delay mapper with the area-delay mapper of [58] for the s832 benchmark circuit. From Figure 9 ., we can see that the power-delay mapper reduces the number of high switching activity nets at the expense of increasing the number of low switching activity nets. From Figure 10 ., we learn that for the remaining high switching activity nets, the power-delay mapper reduces the average load on the nets. By taking these two steps, this mapper minimizes the total weighted switching activity and hence the total power consumption in the circuit.
Under a real delay model, the dynamic programming based tree mapping algorithm does not guarantee to find an optimum solution even for a tree. The dynamic programming approach was adopted based on the assumption that the current best solution is derived from the best solutions stored at the fanin nodes of the matching gate. This is present at their outputs and the probability that this glitching propagates through their transitive fanouts. The power dissipated by the 3-stage pipelined circuits obtained by retiming for low power with a delay constraint is about 8% less than that obtained by retiming for minimum number of flip-flops given a delay constraint.
Pre-Computation Logic
A sequential logic optimization method is described in [48] which is based on selectively precomputing the output logic values of the circuits one clock cycle before they are required, and using the precomputed values to reduced internal switching activity in the succeeding clock cycle. They present an automatic method of synthesizing precomputation logic so as to achieved maximal reductions in power dissipation. Up to 62% reduction in average switching activity and power dissipation are reported with marginal increases in circuit area and delay.
State Assignment
It is well-known that the state assignment of a finite state machine (FSM) has a significant impact on the area, delay and power consumption of its final logic implementation. In the past, many researchers have addressed the encoding problem for minimum area of two-level or multi-level logic implementations (e.g., [ This problem is equivalent to embedding a weighted graph on a hypercube of minimum (or given) dimensionality which is an NP-hard problem. In [87] , the authors use simulated annealing to solve this problem. The shortcomings of the above approach are: (1) It minimizes the switching on the present state bits without any consideration for the loading on the state bits; (2) It does not account for the power consumption in the resulting two-or multi-level logic realization of the next state logic of the FSM. A state assignment technique that not only accounts for the power consumption at the state bit lines, but also the power consumption in the combinational logic that implements the next state logic function is presented in [95] . Experimental results on a large number of benchmark circuits show 10% and 17% power reductions for two-level logic and multi-level implementations, respectively.
Multi-Level Network Optimization
Network don't cares can be used for minimization of nodes in a boolean network [49] . Once the compatible don't cares are computed for nodes in a network, each node can be optimized for area without any concern that changes in the function of this node might affect the function of primary outputs of the network. In [89], a procedure is introduced where absorbability don't cares and image projection techniques are used to compute the compatible local don't cares for each node in the network. The compatible local don't care for node n i is then used to minimize the number of literals in the logic expression for n i . A multi-level network optimization technique for low power is described in [92] . The difference between their procedure and the procedure in [91] is in the cost function used during the two-level logic minimization. Their cost function minimizes a linear combination of the number of product terms and the weighted switching activity. A major shortcoming of this approach is that it does not consider how changes in the global function of an internal node affects the signal probability (and thus, the switching activity) of nodes in its transitive fanout. In general, changing the global function of an internal node may change the signal probability of all nodes in its transitive fanout such that the increase in power consumption due to these fanout nodes may exceed any local power reduction.
In this context, two types of local don't care conditions can be identified: (1) Function Preserving Don't Care (FPDC) which consists of all points in the local space of the node which never occur and is simply the complement of the range of the primary input space into the space of the local fanins of the node and (2) Function Modifying Don't Care (FMDC) which consists of all points in the local space which do not produce observable outputs. FPDC does not affect the global function of the node while FMDC does. A greedy network optimization procedure based on the above concepts is introduced in [70] which reduces the power dissipation in the combinational circuits by about 8%.
Common Subexpression Extraction
Extraction based on algebraic division (using either kernels [52] or two-literal cubes and double-cube divisors [83] ) has proven to be very successful in creating an areaoptimized multi-level Boolean network. The kernel extraction procedure is modified in [87] to generate multilevel circuits with low power consumption. The main idea is to calculate a power saving factor for a candidate kernel based on its effect on the change in loading on the circuit lines and the degree of logic sharing. It is assumed that the load capacitance due to each fanout edge is equal to some constant value. Results show 12% reduction in power compared to a minimum-literal network. probability waveforms at the output of the gate. The correlation between probability waveforms at the inputs is approximated by the correlation between the steady state values of these lines. This approach requires significantly less memory and runs much faster than symbolic simulation, yet achieves very high accuracy, e.g., the average error in aggregate power consumption is about 2%.
Symbolic simulation provides the exact switching activity values under a real delay model. It is however very inefficient and impractical, but for small circuits. Probabilistic simulation and its tagged variant constitute the best choice for switching activity estimation at the gate level.
Estimation in Sequential Circuits
Recently developed methods for power estimation have primarily focused on combinational logic. The estimates produced by purely combinational methods can greatly differ from those produced by the exact state probability method. Accurate average switching activity estimation for sequential circuits is considerably more difficult than for combinational circuits, because the probability of the circuit being in each of its possible states has to be calculated.
A first attempt at estimating switching activity in logiclevel sequential circuits has been presented in [67] . This method can accurately model the correlation between the applied vector pairs, but assumes that the state probabilities are uniform.
The (S i ) corresponding to the steady-state probability of the machine being in state S i at t = ∞. Given static probabilities for the primary inputs to the machine, we can compute prob(S j | S i ), the conditional probability of going from S i to S j . For each state S j , we can write an equation:
where FI(S i ) is the set of fanin states of S i in the STG. Given K states, we obtain K equations out of which any one equation is derived from the remaining K -1 equations. We have a final equation:
This linear set of K equations can be solved to obtain the different prob(S j )'s. The Chapman-Kolmogorov method requires the solution of a linear system of equations of size 2 N , where N is the number of flip-flops in the machine. Thus, this method is limited to circuits with ≤ 15 flip-flops, since it requires the explicit consideration of each state in the circuit. A framework for exact and approximate calculation of switching activities in sequential circuits is also described in [99] and [79] . The basic computation step is the solution of a non-linear system of equations of the form: where ps j 's denote the state bit probabilities of the j th present state bit at the input of the FSM and f l 's are nonlinear algebraic functions representing the next state functions. The fixed point (or zero) of this system of equations can be found using the Picard-Peano (or Newton-Raphson) iteration. Increasing the number of variables or the number of equations in the above system results in increased accuracy [80] . For a wide variety of examples, it is shown that the approximation scheme is within 1-3% of the exact method, but is orders of magnitude faster for large circuits. Previous sequential switching activity estimation methods can have significantly greater inaccuracies.
Logic Optimization Techniques
Both the switching activity and the capacitive loading can be optimized during logic synthesis. It therefore has more potential for reducing the power dissipation than physical design. On the other hand, less information is available during logic synthesis, and hence, factors such as slew rates, short circuit currents, etc. cannot be captured properly. In the following, we present a number of techniques for power reduction during sequential and combinational logic synthesis which essentially target dynamic power dissipation under a zero-delay or a simple library delay model.
Retiming
In [78] , it is noted that the flip-flop output may make at most one transition when the clock is asserted. Based on this observation, the authors then describe a circuit retiming technique targeting low power dissipation. The technique does not produce the optimal retiming solution as the retiming of a single node can dramatically change the switching activity in a circuit and it is very difficult to predict what this change will be.
The technique heuristically selects a set of nodes with the property that if flip-flops are placed at their outputs, switching activity in the network will be reduced. Nodes are selected based on the amount of glitching that is in terms of the circuit inputs. A mechanism for propagating the transition probabilities through the circuit is described in [77] which is more efficient as there is no need to build the global function of each node in terms of the circuit inputs, but is less accurate. The loss is often small while the computational saving is significant.
This work is then extended in [77] to account for spatio-temporal correlations (i.e., spatial correlations between temporally-dependent events). Table 2: and Table 3: give various error measures (compared to exact values obtained from exhaustive binary simulation) for pseudo-random input sequences versus biased input sequences for the f51m benchmark circuit. It can be seen that (1) the error is higher under the biased input sequences; (2) for pseudo-random inputs, the only way to improve the accuracy is by accounting for both spatial and temporal correlations; (3) for biased inputs, accounting for either spatial or temporal correlations improves the accuracy while the most accurate results are still obtained by considering both spatial and temporal correlations.
The OBDD-based approach is the best choice for signal probability calculation if the OBDD representation of the entire circuit can be constructed. Otherwise, a circuit partitioning scheme which breaks the circuit into blocks for which OBDD representations can be built is recommended. In this case, the correlation coefficients must be calculated and propagated from the circuit inputs toward the circuit outputs in order to improve the accuracy.
These methods only account for steady-state behavior of the circuit and thus ignore hazards and glitches. Table 3 : Effect of spatio-temporal correlations on switching activity estimation for biased inputs next section reviews some techniques that examine the dynamic behavior of the circuit and can thus calculate the power dissipation due to hazards and glitches.
Estimation under a Real Delay Model
In [67], the exact power estimation of a given combinational logic circuit is carried out by creating a set of symbolic functions such that summing the signal probabilities of the functions corresponds to the average switching activity at a circuit line x in the original combinational circuit. The inputs to the created symbolic functions are the circuit input lines at time instances 0 -and ∞. Each function is the exclusive or of the characteristic functions describing the logic values of x at two consecutive instances. The major disadvantage of this estimation method is its exponential complexity. However, for the circuits that this method is applicable to, the estimates provided by the method can serve as a basis for comparison among different approximation schemes. The concept of a probability waveform is introduced in [54] . This waveform consists of a sequence of transition edges or events over time from the initial steady state (time 0 -) to the final steady state (time ∞) where each event is annotated with an occurrence probability. The probability waveform of a node is a compact representation of the set of all possible logical waveforms at that node. Given these waveforms, it is straight-forward to calculate the switching activity of x which includes the contribution of hazards and glitches, i.e.:
where EL(x) denotes the list of events on the probability waveform of x. Given such waveforms at the circuit inputs and with some convenient partitioning of the circuit, the authors examine every sub-circuit and derive the corresponding waveforms at the internal circuit nodes. An efficient technique is described in [82] to propagate transition waveforms at the circuit primary inputs up in the circuit and thus estimate the total power consumption (ignoring signal correlations due to reconvergent fanout).
The probabilistic simulation approach is improved in [93] by calculating the statistics of the waveforms and delays more accurately and by considering the signal correlations using the method of [62] . An efficient tagged probabilistic simulation approach is described in [96] that correctly accounts for reconvergent fanout and glitches. The key idea is to break the set of possible logical waveforms at a node n into four groups, each group being characterized by its steady state values (i.e., values at time instance 0 -and ∞). Next, each group is combined into a probability waveform with the appropriate steady-state tag. Given the tagged probability waveforms at the input of a simple gate, it is then possible to compute the tagged cuit. Methods of estimating the activity factor E n (switching) at a circuit node n involve estimation of signal probability prob(n), which is the probability that the signal value at the node is one. Under the assumption that the values applied to each circuit input are temporally independent, we can write:
Computing signal probabilities has attracted much attention. Some of the earliest work in computing the signal probabilities in a combinational network is presented in [84] . The authors associate variable names with each of the circuit inputs representing the signal probabilities of these inputs. Then, for each internal circuit line, they compute algebraic expressions involving these variables. These expressions represent the signal probabilities for these lines. While the algorithm is simple and general, its worse case time complexity is exponential.
In [68], a simple algorithm, known as the tree algorithm, is described that produces the exact signal probabilities for a tree network by a post-order evaluation of the following equations on the network:
and gate:
or gate:
For networks with reconvergent fanout, the tree algorithm however yields approximate values for the signal probabilities.
In [91], a graph-based algorithm is proposed to compute the exact signal probabilities using Shannon's expansion. This algorithm relies on the notion of the supergate of a node n which is defined as the smallest sub-circuit of the (transitive fanin) cone of n whose inputs are independent, that is, the cones of the inputs are mutually disjoint. The procedure, called the super-gate algorithm, essentially requires three steps: 1) Identification of the reconvergent nodes, 2) Determination of the maximal supergates, and 3) Calculation of signal probabilities at the supergate outputs based on the Shannon's expansion with respect to all multiple-fanout inputs of the supergate. In the worst case, a supergate may include all the circuit inputs as multiple-fanout inputs. In such cases, the supergate algorithm becomes equivalent to an exhaustive truevalue simulation of the supergate.
Approximate algorithms for computing the signal probability or bounding this probability in the presence of reconvergent fanouts are presented in [73] and [88] .
An exact procedure based on Ordered Binary-Decision Diagrams (OBDDs) [53] is given in [81] which is linear in the size of the corresponding function graph (the size of the graph, of course, may be exponential in the number of E n sw
) . = circuit inputs). The signal probability at the output of a node is calculated by first building an OBDD corresponding to the global function of the node and then performing a postorder traversal of the OBDD using equation:
A procedure is presented in [62] for propagating signal probabilities from the circuit inputs toward the circuit outputs using only pairwise correlations between circuit lines and ignoring higher order correlation terms as follows. The correlation coefficient of i and j is defined as:
Let g be a gate with inputs i and j and the correlation coefficients of i and j be given as C( i, j ), then not gate:
prob(i) prob(j) C(i, j).
Equations for calculating the correlation coefficient of g and some line m using C(i, m) and C(j, m) are also provided in [62] .
The signal probability of a product term is estimated by breaking down the implicant into a tree of 2-input and gates and then using the above formula to calculate the correlation coefficients of the internal nodes and hence the signal probability at the output. Similarly, the signal probability of a sum term is estimated by breaking down the implicate into a tree of 2-input or gates.
The temporal correlation between values of some signal x in two successive clock cycles are modeled in [77] and [90] by a time-homogeneous Markov chain which has two states 0 and 1 and four edges where each edge ij (i,j = 0, 1) is annotated with the conditional probability prob ij x that x will go to state j at time t+1 if it is in state i at time t. The transition probability prob (
The activity factor of line x can be expressed in terms of these transition probabilities as follows:
The transition probabilities can be computed exactly using the OBDD representation of the logic function of x Based on the delay model used, the power estimation techniques could account for steady-state transitions (which consume power, but are necessary to perform a computational task) and/or hazards and glitches (which dissipate power without doing any useful computation). It is shown in [50] that although the mean value of the ratio of hazardous component to the total power dissipation varies significantly with the considered circuits (from 9% to 38%), the hazard/glitch power dissipation cannot be neglected in static CMOS circuits. Indeed, an average of 15-20% of the total power is dissipated in glitching. The glitch power problem is likely to become even more important in future scaled technology.
In real networks, statistical perturbations of circuit parameters may change the propagation delays and produce changes in the number of transitions because of the appearance or disappearance of hazards. It is therefore useful to determine the change in the signal transition count as a function of this statistical perturbations. Variation of gate delay parameters may change the number of hazards occurring during a transition as well as their duration. For this reason, it is expected that the hazardous component of power dissipation is more sensitive to IC parameter fluctuations than the power strictly required to perform the transition between the initial and final state of each node.
The major difficulty in computing the signal probabilities is the reconvergent nodes. Indeed, if a network consists of simple gates and has no reconvergent fanout stems (or nodes), then the exact signal probabilities can be computed during a single post-order traversal of the network. For networks with reconvergent fanout, the problem is much more difficult.
Circuit-and Switch-Level Simulation
Circuit simulation based techniques ([71],[100]) simulate the circuit with a representative set of input vectors. They are accurate and capable of handling various device models, different circuit design styles, dynamic / precharged logic tristate drives, latches, flip-flops, etc. Although circuit level simulators are accurate, flexible and easy-to-use, they suffer from memory and execution time constraints and are not suitable for large, cell-based designs. In general, it is difficult to generate a compact stimulus vector set to calculate accurate activity factors at the circuit nodes. The size of such a vector set is dependent on the application and the system environment [85] .
A Monte Carlo approach for power estimation which alleviates this problem has been proposed in [55] . The convergence time for this approach is quite good when estimating the total power consumption of the circuit. However, when signal probability (or power consumption) values on individual lines of the circuit are required, the convergence rate is not as good.
Switch-level simulation techniques are in general orders of magnitude faster than circuit-level simulation techniques, but are not as accurate or versatile.
PowerMill [60] is a transistor-level power simulator and analyzer which applies an event-driven timing simulation algorithm (based on simplified table-driven device models, circuit partitioning and single-step nonlinear iteration) to increase the speed by two to three orders of magnitude over SPICE. PowerMill gives detailed power information (instantaneous, average and rms current values) as well as the total power consumption (due to steady-state transitions, hazards and glitches, transient short circuit currents, and leakage currents). It also provides diagnostic information (by using both static check and dynamic diagnosis) to identify the hot spots (which consume large amount of power) and the trouble spots (which exhibit excessive leakage or transient short-circuit currents). Finally, it tracks the current density and voltage drop in the power net and identifies reliability problems caused by EM failures, ground bounce and excessive voltage drops.
Entice-Aspen [66] is a power analysis system which raises the level of abstraction for power estimation from the transistor level to the gate level. Aspen computes the circuit activity information using the Entice power characterization data. For each cell in the library, Entice requires a transistor level netlist which is generally obtained from the cell layout using a parameter extraction tool. In addition, a stimulus file is to be supplied where power and timing delay vectors are specified. The set of power vectors discretizes all possible events in which power can be dissipated by the cell. With the relevant parameters set according to the user's specs, a SPICE circuit simulation is invoked to accurately obtain the power dissipation of each vector. During logic simulation, Aspen monitors the transition count of each cell and computes the total power consumption as the sum of the power dissipation for all cells in the power vector path.
Accuracy and efficiency are the key requirements for any power analysis prediction tool. PowerMill and EnticeAspen are steps in the right direction as they provide intermediate level simulation that bridges the gaps between circuit-level and switch-level simulation paradigms.
Estimation in Combinational Circuits
Estimation under a Zero Delay Model
Most of the power in CMOS circuits is consumed during charging and discharging of the load capacitance. To estimate the power consumption, one has to calculate the (switching) activity factors of the internal nodes of the cir-
Logic Level
Logic synthesis fits between the register transfer level and the netlist of gates specification. It provides the automatic synthesis of netlists minimizing some objective function subject to various constraints. Example inputs to a logic synthesis system include two-level logic representation, multi-level Boolean networks, finite state machines and technology mapped circuits. Depending on the input specification (combinational versus sequential, synchronous versus asynchronous), the target implementation (two-level versus multi-level, unmapped versus mapped, ASICs versus FPGAs), the objective function (area, delay, power, testability) and the delay models used (zero-delay, unit-delay, unit-fanout delay, or library delay models), different techniques are applied to transform and optimize the original RTL description.
Once various system level, architectural and technological choices are made, it is the switching activity of the logic (weighted by the capacitive loading) that determines the power consumption of a circuit. In this section, a number of techniques for power estimation and minimization during logic synthesis will be presented. The emphasis during power estimation will be on pattern-independent simulation techniques while the strategy for synthesizing circuits for low power consumption will be to restructure/ optimize the circuit to obtain low switching activity factors at nodes which drive large capacitive loads.
Power Estimation Techniques
Sources of Power Dissipation
Power dissipation in CMOS circuits is caused by three sources: the (subthreshold) leakage current which arises from the inversion charge that exists at the gate voltages below the threshold voltage, the short-circuit current which is due to the DC path between the supply rails during output transitions, and the charging and discharging of capacitive loads during logic changes.
The subthreshold current for long channel devices increases linearly with the ratio of the channel width over channel length and decreases nearly exponentially with decreasing V GT = V GS -V T where V GS is the gate bias and V T is the threshold voltage. Several hundred millivolts of "off bias" (say, 300-400 mV) typically reduces the subthreshold current to negligible values. With reduced power supply and device threshold voltages, the subthreshold current will however become more pronounced. In addition, at short channel lengths, the subthreshold current also becomes exponentially dependent on drain voltage instead of being independent of V DS (see [65] for a recent analysis). The subthreshold current will remain 10 2 -10 5 times smaller than the "on current" even at submicron device sizes. The short-circuit power consumption for an inverter gate is proportional to the gain of the inverter, the cubic power of supply voltage minus device threshold, the input rise/fall time, and the operating frequency [102] . The maximum short circuit current flows when there is no load; this current decreases with the load. If gate sizes are selected so that the input and output rise/fall times are about equal, the short-circuit power consumption will be less than 15% of the dynamic power consumption. If, however, design for high performance is taken to the extreme where large gates are used to drive relatively small loads, then there will be a stiff penalty in terms of short-circuit power consumption.
It is widely accepted that the short-circuit and subthreshold currents in CMOS circuits can be made small with proper circuit and device design techniques. The dominant source of power dissipation is thus the charging and discharging of the node capacitances (also referred to as the dynamic power dissipation) and is given by:
where V dd is the supply voltage, T cycle is the clock cycle time,
wire is the wiring capacitance of the net driven by g, C g SD is the source-drain diffusion capacitance of g, C j G is the gate capacitance of j, and E g (sw) is the expected number of transitions at the output of g per clock cycle.
Calculation of E(sw) is difficult as it depends on (1) the input patterns and the sequence in which they are applied, (2) the delay model used and (3) the circuit structure.
Switching activity at the output of a gate depends on not only the switching activities at the inputs and the logic function of the gate, but also on the spatial and temporal dependencies among the gate inputs. For example, consider a two-input and gate g with independent inputs i and j whose signal probabilities are 1/2, then E g (sw)=3/8. Now suppose it is known that only patterns 00 and 11 can be applied to the gate inputs and that both patterns are equally likely, then E g (sw)=1/2. Alternatively, assume that it is known that every 0 applied to input i is immediately followed by a 1 while every 1 applied to input j is immediately followed by a 0, then E g (sw)=4/9. The first case is an example of spatial correlations between gate inputs while the second case illustrates temporal correlations on spatially independent gate inputs.
whenever possible, accessing global, centralized resources (such as memories, busses or ALU's) should be avoided. How some of these goals can be accomplished at the different levels of the architectural synthesis process is discussed below.
Architecture Selection and Partitioning
The choice of the architecture model has a dramatic impact on power dissipation as it determines the amount of concurrency which can be sustained, the amount of multiplexing of a hardware resource, the required clock frequency, the capacitance per instruction, etc. There is a widespread belief that the fully parallel, non-multiplexed architecture yields the lowest possible dissipation with the minimum overhead [26], [27] . Unfortunately, such an architecture precludes programming or requires an unwieldy large area. Deciding on the right architecture for a given task or function is therefore a trade-off between power, area and flexibility.
Similarly, the partitioning of a task over a number of resources can have a substantial impact on the dissipation. For instance, in [38] it was shown that partitioning of the codebook memory in a vector encoder for video compression reduced the power consumption in the memories (which was the dominant factor in this design) with a factor of 8. In [39] , it was demonstrated how the memory hierarchy, the caching approach and the cache sizes impacts the power cost of accessing the main memory in a microprocessor. A similar analysis was performed in [23] .
Due to a lack of meaningful tools, the only recourse available to a designer at present is to perform back of the envelope calculations. The emerging architectural level power estimators will help to make this task more accurate and meaningful by identifying the power bottlenecks in an architecture. For instance, the analysis of the signal statistics of these tools can help to determine if and when power-down of a hardware module is desirable.
Instruction Set and Hardware Selection
Optimizing the instruction set is another means of reducing the power consumption in a processor. As an example, providing a special datapath for often executed instructions reduces the capacitance switched for each execution of that instruction (compared to executing that instruction on a general purpose ALU). This approach is popular in application specific processors, such as modem and voice coder processors [40] .
Another dimension of the power minimization equation is the choice of the hardware module to execute a given instruction. A detailed gate level study of adder and multiplication modules, for instance, revealed that energy structures such as carry-lookahead adders and Wallace multipliers, operate at lower energy levels than the more area effective ripple adder or carry save multiplier [41] . This study was later confirmed by Nagendra et al., [42] , where the power-delay products of various adder structures were analyzed at the circuit level. The picture becomes more complex when additional optimizations such are operator pipelining are introduced.
Dealing with all these extra factors requires a profound revision of the hardware selection process in high level or architectural synthesis. The availability of functional level power models is a first step in that direction [43] . A vision on how cell libraries can interface and communicate with synthesis tools is presented in [44] , which proposes an object oriented cell library manager. Based on extensive area, time and power modeling, the library manager helps to select the best fitting cell for a required functionality.
Architecture Synthesis
The architectural synthesis process traditionally consists of three phases: allocation, assignment and scheduling. In a few words, these processes determine how many instances of each resource are needed (allocation), on what resource will a computational operation be performed (assignment) and when will it be executed (scheduling). Traditionally, architectural synthesis attempts to minimize the number of resources to perform a task in a given time or tries to minimize the execution time for a given set of resources. The underlying concept is to optimize resource utilization, under the assumption that an architecture where all resources are kept busy for most of the time is probably also the smallest solution.
This picture is not valid anymore when power minimization is the primary target. The smallest solution is not necessarily the one with the smallest dissipation. Quite the contrary is true. Indeed, it often makes sense to add extra units, if this contributes to minimizing the effective capacitance.
This difference has an important effect on the architectural synthesis techniques. The prime driving function for assignment now becomes the quest towards regularity and locality. Both help to reduce the overhead of interconnect, while simultaneously keeping the signals more correlated. Scheduling can have a similar impact, as it influences the correlation between the signals on the busses, memories and logic operators. This has been illustrated in [23] , where operations were reordered over conditional boundaries. In [30], a "cold scheduling" approach for traditional microprocessor architectures was proposed,
While not specifically targeting power, a number of researchers have already explored some of these issues and some of the proposed approaches could become of particular interest in power sensitive architectural synthesis ([45] , [46] , [47] ). as a function of the correlation factor ρ between consecutive samples. ρ=0 means that the consecutive data samples are totally uncorrelated and, hence, correspond to white noise data. In that case, no difference exists between LSB and MSB bits. When a meaningful correlation is present between samples, an important deviation in transition activity between LSB and MSB's can be observed. The DDT model categorizes a hardware module for both those regions. A power model is automatically generated by applying a predefined set of test patterns (correlated and uncorrelated) to the module using a simulator of choice and extracting the switching capacitance for each of the regions of interest. The resulting power models are accurate to within 15% [35] as illustrated in Figure 7 ., which compares the power consumption of a logarithmic shifter, obtained using both switch level simulations and the architectural model.
Once appropriate models for the modules are available, performing power prediction at the architectural or register transfer level becomes feasible by combining signal statistics, interconnect capacitance and macromodule power models. An example of a tool, doing just that, is the SPA architectural power analysis tool from U.C. Berkeley [34] . Figure 6 illustrates the potential impact of architectural level power estimation and analysis tools. It compares the projected power for the various hardware resources of two architectures, implementing the same Quadrature Mirror Filter (QMF). It is easily observed that the parallel version is far more power effective than the multiplexed one by virtually eliminating the bus and multiplexer effective capacitance [26] . This type of data is instrumental when performing power trade-off and is hard to obtain from lower design hierarchy levels or "intuition".
Power optimization
Similar to the behavioral level, architectural level optimizations can have a dramatic impact on power dissipations. Researchers have reported orders of magnitude in dissipation reduction by picking the proper architectural choice [3] , [26] , [27] .
Although architectural level synthesis has received considerable attention in the last decade [37] , virtually none of the efforts in this domain has been addressing the power dimension. Some initial results have been reported in [7] , [9] , [22] and [23] . In lieu of analyzing existing approaches, it is worth discussing where and how architectural synthesis can affect dissipation. Similar to the behavioral level optimization, architectural level power minimization targets a dual goal: 1) Minimize the required supply voltage to a minimal allowable level; and 2) Minimize the effective capacitance. While the former is obtained at the architecture level by exploiting parallelism and pipelining [7] , one of the main architectural means of addressing the latter is the use of locality of reference: Capacitance this level can have a dramatic impact on the power budget of a design. The availability of efficient, yet accurate tools at this level of abstraction is therefore of utmost importance.
Power analysis
Consider an architecture described at the structural level, consisting of an assembly of (parameterizable) modules and interconnecting buses and wires. Performing an accurate power analysis at this level of abstraction requires the following information:
• power models for the composing modules (library information)
• physical capacitance for the interconnect
• switching statistics for the interconnect signals The latter can be obtained from functional or register transfer level simulation. Given a functional model of the architecture, it might even be possible to derive the signal statistics analytically using, for instance, power spectral analysis techniques, long known in signal processing. This approach turns out to be hopelessly complex, however, and is only useful for linear, time-invariant systems (which are rare). Simulation driven by an actual test-bench is therefore the most appropriate technique to gather signal statistics.
As interconnect in general has a significant impact on the overall power budget, getting a meaningful estimate of its capacitance is important. Meaningful estimates can be obtained from stochastic models, as described in the behavioral section or from the initial floorplan. More accurate data can be back-annotated into the model once a more actual floorplan or layout has been produced.
While the abstract modeling of architectural macromodules with regards to area and power is a well understood craft, providing high level power models has only recently attracted major attention. In a first attempt in that direction, Vermassen [31] derived power models for adders and multipliers, based on complexity theory [32] and a probabilistic analysis of the underlying gate structure.
Another parametric model was described by Svenson and Liu [36] , where the power dissipation of the various components of a typical processor architecture were expressed as a function of a set of primary parameters. For instance, the capacitance of the clock wire, distributed in an H-tree structure, equals 24 D t c int , with D t -the chip dimension and c int the interconnect capacitance per unit length. Models were developed for memory, logic, interconnect and clocks and were used to predict the global power consumption of a number of processors and ASIC chips. This modeling approach has the advantage of giving a global picture, suitable for driving optimizations and design trade-offs. On the other hand, when accuracy is an issue, the technique suffers from an abundance of parameters and is sensitive to mismatches in the modeling assumptions. At that point, empirical models become more attractive.
A first modeling effort in that direction was performed by Powell and Chau [33] , who proposed a parameterized power model for macro-modules based on the following considerations. The power model for an array multiplier might, for instance, be represented as P=C u N 2 V 2 f, where N is the width of the two inputs. Here, the quadratic dependence on N accounts for the N 2 adders of which a typical array multiplier is composed. Given this model, the characterization process consists of simulating or measuring the power dissipation of a multiplier, and fitting the capacitive coefficient, C U , to these results. The problem with this approach is that the module's power consumption depends on the inputs applied. It being impossible, however, to characterize the module for all possible input statistics, purely random inputs -that is to say, independent uniform white-noise (UWN) inputs -are typically applied when deriving C U .
This leads to an important source of error as illustrated by Figure 6 ., which displays the estimation error (relative to switch-level simulations) for a 16x16 multiplier. Clearly, when the dynamic range of the inputs doesn't fully occupy the bit width of the multiplier, the UWN model becomes extremely inaccurate. Errors in the range of 60-100% are not uncommon. A more precise model was proposed by Landman and Rabaey [34] . In the so called DDT (Dual Data Type) model, it is projected that typical data in a digital system can be divided into two regions: the LSB's, which act as uncorrelated white noise and the MSB's, which correspond to sign bits. The latter are generally correlated between consequent data values and are far from random. This is illustrated in Figure 6 ., which plots the transition probabilities for the different bit-positions in a data word as long as the effective capacitance factor does not increase faster. For instance, at low supply voltages, the amount of concurrency needed for compensating the fast increase in propagation delay is so huge, that the overhead becomes dominant and power dissipation increases.
A good overview of the use of optimizing transformations for supply voltage reduction is given in [7] , [8] . 
Reduction in Effective Capacitance
There are many means of reducing the potential effective capacitance of an algorithm. The most obvious way is to reduce the number of operations by choosing either the right algorithm for a given function or by eliminating redundant operators (for instance, using dead code and common sub-expression elimination). For instance, the computational requirements for the popular Discrete Cosine Transformation (DCT), used extensively in video compression, range over a factor of 10 [1] . Automated techniques have been reported, which allow to globally reduce the number of operations, weighted with an area/power cost factor over a range of algebraic manipulations and other transformations. This has resulted in a near specification invariance for linear applications such as DCT and video filters.
Another option is to reduce the "power strength" of an operation by replacing it with less demanding operations (this term is coined from the traditional strength reduction transformation in optimizing compilers [18] ). An example of the latter is the expansion of a constant multiplication into a sequence of add/shift operations [7] . In the degenerated case, where no time multiplexing is performed, these shift operands can even be replaced by simple wiring.
Locality of reference is a prime feature to be pursued when optimizing an algorithm for power consumption [26] . Local operations tend to involve less capacitance and are, consequently, cheaper from a power perspective. Obviously, this locality is only meaningful when exploited properly at the architectural level. For instance, operations in the same sub-function of an algorithm should be allocated to the same processor or the same data path partition. This eliminates expensive data transfers across high capacitance busses. Transformations, improving the locality of reference, are abundant, for instance, in the domain of memory management, although most of the work in that domain has been oriented towards either area or performance optimization ([24] , [25] ) until recently.
Memory accesses often contribute a substantial part of the dissipation in both computational and signal processing applications. Replacing expensive accesses to background (secondary) memory by foreground memory references, or using distributed memory instead of a single centralized memory can cause a substantial power reduction [38] . An extensive study into the impact of memory management on system and architecture-level power consumption has been performed at IMEC [23] . This has shown that the external communication and memory access for e.g. a table-based telecom network subsystem dominates the power budget even when compared to the sum of all the other contributions (clock, data-path and control). The combined results for this realistic test-vehicle taken from an ATM network, shown in Table 1 , illustrate the extent of the improvements which can be obtained.
Finally, selecting the correct data representation or encoding can reduce the switching activity [27] . For instance, the almost universally used two's complement notation has the disadvantage that all bits of the representation are toggled for a transition from 0 to -1, which occurs rather often. This is not the case in the sign magnitude representation, where only the sign bit is toggled. Choosing the correct data encoding can impact the dissipation in data signals with distinct properties, such as signal processing data paths ([28] , [29] ) address counters [23] and state machines. In [30] , it was demonstrated that choosing a Gray-coded instruction addressing scheme results in an average reduction in switching activity, equal to 37% over a range of benchmarks.
The Architectural Level
Once the architecture defined and specified (e.g. using a functional or register-transfer level description), a more refined power profile can be constructed, which opens the way for more detailed optimizations. For instance, the impact of resource multiplexing can be take into account. The architectural level is the design entry point for the large majority of digital designs and design decisions at obviously true that the model parameters will vary over methodologies and technologies, a more important result of these modeling attempts is the establishment of fundamental dependencies and relationships. The understanding of these dependencies can eventually open the door to more advanced models.
Applications of Behavioral Power Modeling
The resulting power model turns out to be particularly useful in the algorithmic optimization process and for design guidance. This is illustrated with the example of a vector quantization encoder for video compression [9] , [38] . A small variation in the formulation of the vector encoding algorithm can result in dramatic power reductions, yet maintaining the same performance. This is demonstrated in Figure 5 ., where the energy/iteration is plotted in function of the supply voltage for two formulations of the algorithm. The second variant has the advantage of a reduced critical path (which enables lower voltage operation) and a smaller number of operations (which reduces the overall switching capacitance). Behavioral power estimators can be extremely useful in studying this type of trade-offs. This is also demonstrated in [43] and [23].
Behavioral Level Power Minimization
Prior to a detailed design of the architecture, the only means to affect power is to modify of transform the specification, such that the resulting description is more amenable to low power solutions. Assuming that the repetition frequency (sample, execution or instruction rate) is a given design constraint, the only means to reduce the projected dissipation of an application is by either reducing the supply voltage or the effective capacitance. 
Supply Voltage Reduction
Lowering the supply voltage is the most effective means of reducing power consumption as it yields a quadratic improvement in the power-delay product of a logic family. Unfortunately, this simple solution comes at a cost. First of all, finding commodity components that operate at different voltage levels (below 3.3V) is hard, while multiple power supplies are to be avoided. Some of these issues can be addressed by the introduction of power-efficient D.C.-D.C. converters [17] . Reducing the supply voltage degrades the performance of the logic operators [2] . While the performance penalty is initially small (between 3 and 5V), a rapid decrease can be observed once V DD approaches 2 times the transistor threshold. To maintain performance, this loss has to be compensated by other means. At the architectural level, this means using either faster components and operators or exploiting concurrency. Replacing a single fast operator, operated at high voltage, by a multitude of slow operators, running at low V DD , does not impact the effective capacitance (ignoring the overhead of the parallel architecture). But it allows for a lowering of the supply voltage, which results in substantial power savings. This approach is generally known as trading area for power.
Design automation is instrumental when exploring this area-power trade-off. To make an application amenable for this class of power optimization, it is essential that the algorithm contains sufficient concurrency and that its critical path is smaller than the maximum execution time (under the voltage range of interest). Optimizing transformations to achieve both those goals have been studied extensively in the compiler and high level synthesis communities ([18] , [20] ), mostly for speed optimization purposes. The power minimization goal adds another twist to the problem, however: increasing the concurrency or reducing the critical path only makes sense 
Determining the switching activity The activity factor is what distinguishes power prediction from, for instance, area or performance analysis and makes it considerably harder. Activity is a function of both the algorithm structure and the data applied. Especially the latter factor is hard to determine deterministically and most of the activity analysis approaches, therefore, require extensive simulation or use empirical data.
As an example of the latter, Masaki ([5] , [6] ) determined the global switching activity factor for a variety of computing systems by analyzing their power consumption using both simulation or measurement. For mini-computers, α-factors between 0.01 and 0.005 are obtained, while micro-computers display a higher activity ranging from 0.05 to 0.01.
While the experimental approach is useful when performing a macro-analysis, its accuracy is insufficient to be of any help in algorithm selection and optimization. Most of the high level power prediction tools use a combination of deterministic algorithm analysis, combined with profiling and simulation to address data dependencies. Important statistics include the number of instructions of a given type, the number of bus, register and memory accesses and the number of I/O operations ( [8] , [9] ), executed within a given period. The advantage is that this analysis can be executed at a high abstraction level and can therefore be executed on a large data set without incurring a dramatic time penalty. Instruction level simulation or behavioral DSP simulators are easily adapted to produce this information (e.g [10] ). An example of such an approach is documented in [23] , where the switching activity related to transitions in data and address signals for off-chip storage is tabulated by applying random excitations to a board level system model.
Determining the capacitance
To translate the activity factors into actual energy measurements, they have to be weighted with the capacitance, switched per access of a given resource. It is important to observe here that, to a first degree, the actual number of instances of a resource does not impact the dissipation. For instance, performing N additions sequentially on a single adder consumes just a must power as performing the additions concurrently on N adders (to a first degree, when ignoring overhead and signal correlations). The architectural composition, however, has C eff αC = an impact on the size of the chip and, consequently, influences the cost of a bus access, an input/output operation or the clock network.
The power cost of accessing a resource is determined by its type. For computational units (adders, multipliers, registers) or for memories, it is possible to get measured or estimated data from the design libraries. At this point in the design process, it is hard to determine the actual switching properties of the data presented to the modules. Consequently, a white noise data model is generally adopted. While presenting a pessimistic view, this simplification is acceptable at this level of abstraction as it does not dramatically distort the correlation with the actual dissipation.
Given these modeling assumptions, the effective capacitance of the computational resources is then estimated by the following expression:
with C res the capacitance switched per access to a given resource and N res the number of accesses to that resource over a given period.
Predicting the dissipation of other architectural components, such as I/O, busses, clocks and controllers, is more troublesome, as their capacitance is strongly influenced by the subsequent design phases (such as logic synthesis, floorplanning and place and route). While just ignoring their contribution results in unacceptable results, the best that can be produced at this point is a reasonable estimate. In the high level synthesis community, there have been considerable efforts in estimating the wiring cost, given the active area of a design and other estimated parameters, such as the number of modules, the number of busses and their average bus-width and fanout and the number of control signals. The most popular approach is to generate the layouts for a large number of benchmark examples, collect the routing data and construct a stochastic model ( [11] , [12] ).
A similar approach can be adapted to predict the average capacitance per bus. This is demonstrated in Figure 3 ., which plots the measured bus-capacitance as a function of the chip area. The data is obtained from 50 benchmark examples generated using the HYPER synthesis [13] and the LAGER-IV layout generation [14] tools. The close correlation, which exists between area and bus capacitance, is easily captured in a simple piecewise linear model. The chip area, in turn, can be estimated from the algorithm and the performance constraints ([15],[16] ).
Similar approaches can be used to produce early guesses of control and input/output capacitance [9] . It can be argued that the resulting models are only useful for a given set of design tools and methodologies. While it is The next section (section 4) of the paper describes the estimation and optimization techniques required to optimize the physical design. This section discusses partitioning, floorplanning, placement, routing and circuit sizing. It also discusses the role of the object-oriented library. The library is key to the successful implementation and support of the CAD tools and activities at all level of the design data hierarchy as described in this paper. The library does not naturally fall into any one point of the design flow because it is an enabling technology and supports all phases of the design process; and is often taken for granted. The discussion (on libraries is) placed in this section to highlight the high value-added nature of this technology; that is, if the library manager/system was implemented to support only one phase of the design process, it will generate the highest ROI at the layout phase of the design process.
Section 5 discusses reliability and packaging issues relating to low power and in particular the acute problems resulting from low power design. For example good low power design requires that units of the processor/chip dynamically turn themselves off when idle and turn themselves on when needed. This results in large current spikes being generated on the power lines/rails.
Each section the paper follows a common theme. The scope of the problem is first established. This is then followed by discussions on the low power analysis techniques and low power optimization techniques.
The paper concludes with a summary and recommendations for low power tool development together with their priorities.
High Level Design
The Behavioral Level
While this is the least explored dimension of the power minimization process, it is potentially the one with the most dramatic impact. Selecting the right algorithm can reduce the power consumption of an application by orders of magnitude. Similarly, a single application can be represented in many different ways, one being more amenable than the other for low power implementation. These observations have been experimentally verified and demonstrated by a number of researchers ( [1] , [2] , [3] ) for various domains of such as audio/speech, image/video, telecommunications and networking. Similarly, finding the right system partitioning can have a profound impact on the overall power budget. For instance, when designing a a wireless communication system, it makes sense to transfer the brunt of the power intensive functions to the immobile side of the system, relieving the power consumption at the battery powered portable end [1] . All these decisions can result in orders of magnitude of power reduction. This comes, however, at the price of decreased performance, increased latency or area, such that trade-off's are necessary in practice.
In summary, the problem with power minimization at the behavioral level is that a designer has to trade-off between multiple contradictory design parameters, the impact of which is mostly unknown in the early phases of the design process. The same holds, by the way, for the architectural level design as well. As a result, the algorithmic and architectural specifications are often frozen early in the design process, based on "back of the envelope" computations. This reduces the power minimization space to the circuit or logic levels, where generally only limited reductions can be achieved. Offering the designer a set of tools, which can help him to explore, evaluate, compare and optimize the power dissipation alternatives early on in the design process is a must to break this lock.
In the following sections, we discuss the existing and on-going efforts in design methodology and tool development at the behavioral and architectural levels. As is common throughout this tutorial, design tools are partitioned in two categories, being analysis/prediction and the synthesis/optimization.
Power Prediction/Estimation
A meaningful minimization process requires a reasonable estimate of the optimization target, in this case power consumption. Trying to predict the power consumption of an application in the early phases of the design process is a non-trivial task. Dissipation is a function of a number of parameters, as demonstrated in (1), the well known expression for dynamic power consumption in CMOS circuits.
(1)
where C represents the physical capacitance of the design, V DD the supply voltage, V swing the logical voltage swing (which most often equals the supply voltage in CMOS design), α the switching activity and f the periodicity of the computation, which is most often the sample, instruction or clock rate. Of those factors, especially the capacitance and activity are hard to predict in the early design phases. The former requires a detailed knowledge of the implementation, while the latter is a strong function of the signal statistics and is, hence, nondeterministic. The two factors are often lumped together in
P dyn
αCV DD V swing f = minicomputers, and printers, and the non power-managed PC's consume some 6.5 times the electricity consumption of the power managed systems.
In rising to the challenge to reduce power the semiconductor industry has adopted a multifaceted approach, attacking the problem on three fronts: • Reducing chip capacitance through process scaling.
This approach to reducing the power is very expensive and has a very low Return On Investment.
• Reducing voltage. This approach is complex, difficult to achieve and requires moving the DRAM and systems industries to a new voltage standard. The Return On Investment is low.
• Employing better architectural and circuit design techniques. This approach promises to be the most successful because the investment to reduce power by design is relatively small in comparison to the other two approaches.
The last approach (design for low power) does however, have one drawback, namely the design problem has essentially changed from an optimization of the design in two dimensions (Performance and Area) to optimizing a three dimensional problem, i.e. Performance, Area and Power, see Figure [ The design for low power problem cannot be achieved without good CAD tools. The remainder of this paper describes the CAD tools and methodologies required to effect efficient design for low power. Owing to the fact that low power design is a relatively new field, the paper is targeted at a wide audience to achieve the following:
• Convey an understanding of the breadth of the problem.
Figure 2. Optimization changes from 2 dimensions to 3 dimnensions
• Explain the state of the art of CAD tools and methodologies and as well as references to find additional more in-depth technical information in specific fields.
• Highlight the areas that need considerably more research.
• Assist the commercial CAD vendors in understanding the needs and time frames for new CAD tools to support low power design.
Although it is well accepted that design is not a purely top down process, it is a convenient method of describing the low power activities at different levels of the designdata hierarchy as well as the flow of data and control. The different sections of the paper discuss the power issues in the context of a CAD system to support design for low power.
Figure [12] depicts the flow of data through the CAD system. Complex systems tend to be specified with a mix of architectural and behavioral constructs, primarily because some parts of the system can be described by an algorithm and others cannot. It is assumed that the design starts off with a behavioral/architectural description of the product specification; DSP systems are more amenable to behavioral descriptions whereas general purpose microprocessor designs are more amenable to microarchitectural specifications. In the case of DSP systems, the behavioral description is transformed through a series of behavior-preserving transformations to generate microarchitectural specifications with timing. The microarchitectural description is then transformed and mapped to Functional Building Blocks (FBBs) or templates representing executing units, controllers, and memory elements etc. The elements of the functional building block are stored in an objected-oriented library. Microarchitectural synthesis further refines the FBBs into hardware building blocks (again stored in the library) to meet performance, power, timing and area constraints. The result of this phase is an RTL description describing datapaths, FSMs and memories (registers together with necessary busses). The low power estimation and optimization tools relating to this phase of the high level design are described in section 2 of the paper.
Next the RTL description is processed by logic level tools, such as data-path compilers and logic synthesis tools to further optimize the design. Again the design is optimized for performance power, timing and area, and the tools make use of the design entities residing in the objectoriented library. The low power estimation and optimization tools relating to this phase of the high level design are described in section 3 of the paper.
