A semicustom ASIC design methodology is used to develop a low power DSP core for mobile (battery powered) applications. Different low power design techniques are used, including dual voltage, low power library elements, accurate power reporting, pseudomicrocode, transition-once logic, clock gating, and others.
I. INTRODUCTION
Low power design is mainly driven by the need to contain power dissipation (hence reduce packaging and cooling costs) for high-performance systems at one end of the applications spectrum and by the desire to reduce power consumption (hence reduce size, weight and increase battery life) for portable applications at the other end. This paper presents several low power techniques used in the design of an ASIC DSP core for portable applications. Both dynamic power (in active mode), and static power (in standby mode), are critical and need to be addressed for batteryoperated devices.
Portable Applications
The portable, battery-operated marketplace is dominated by embedded applications with customer specific software run from a Read-Only Memory (ROM). The challenge for the designers is to balance the desired performance requirements of current compute intensive applications with the need to reduce cost, form factor and increase battery life. Dynamic power is strongly dependent on the switching activity, hence the power can be reduced by tailoring the processor speed to a variable computation load. Because of this possible tradeoff, the power requirements for portable applications are typically specified in mA/MHz, not in mW, with current specifications at < 0.5 mA/MHz. In order to achieve such aggressive power targets attention needs to be paid to all aspects of chip design at all levels of abstraction, from process and library technology, to physical design, circuit and logic, to register transfer (RT) level and behavior.
Thc Power Wheel
The power wheel [1] seen in Figure was used in this project for focusing the low power design activities. The wheel depicts major components needed for low power design and is inspired by Gajski' s well known Y-chart [2] . At the center of the wheel are the low power technology and supporting cell models, with the outer rings representing higher levels of abstraction from transistor to gate to RT and behavior. The components of the wheel are closely integrated through the low power design methodology used for the project.
Low Power Design Methodology
Low-power products have traditionally been custom designed, with the help of accurate power estimators and power reporting tools, in order to deliver first-pass silicon that meets the target power specification. Typically, a semicustom, cell-based, design methodology was considered less attractive for low power products because FIGURE The power wheel represents the components of a low power design methodology (from [1] ).
LOW POWER DESIGN 319
less control is available for the designer to achieve aggressive design specifications. In this paper, we present the process and techniques by which we achieved the low power design of a DSP core for mobile (battery powered) applications using a semicustom methodology. These techniques can be used for other semicustom designed cores and have been incorporated into the low power design methodology shown in Figure 2 [1] .
To The power calculation expected accuracy is intrinsically reduced at higher levels of abstraction but basic correlation must exist to ensure that power reduction at the RT level results in corresponding reductions at the logic and transistor levels. Statistical, probabilistic or random switching models are used to estimate power at different levels of abstraction. The power consumption can be also accurately calculated if switching factors are available from logic simulation, but finding a set of representative applications that will yield power consumption close to the average power of the hardware is generally difficult. The advantage for batterypowered embedded applications is that the microprocessor is typically limited to the code in the ROM, thus the range of applications is bounded, embedded processors spending most of their powered-up cycles in certain segments of code. This bounded code can confidently be used to characterize average dynamic operating power and identify instructions with high usage frequency, this being the method used in this project.
IV.I. Power, Consumption Reporting
The power consumption report is a key component of the power reduction process. The To accomplish these four rules, two selects are defined for every data port, one being called the fast select, the other the slow select. The fast select is timed to always arrive earlier than the fastest data bit for that input port. The slow select is timed to arrive later than the slowest bit of the same data port. The two selects are fed into an exclusive NOR (XNOR). Both selects must have the same value (either both 0 or both 1) for the multiplexer input value to be propagated to the output stage. Between the input stage and the output stage is a soft latch which holds the output state when needed as seen in Figure 9B . If the same input port is to be selected from one cycle to another, then both selects need to be toggled (if both 0, then both become 1 sequentially in the next cycle). By toggling both selects, the input port is deselected during the input data transition and no glitches are propagated. The fast select changes state before the slow select, temporarily disconnecting the output port from the input during the transition of the input port. Once the slow select toggles to the same value as the fast select, the data Figure 10 . This circuit assumes that the select value is known one cycle before being actually needed which is typically true in a pipelined processor.
A variation of the transition-once multiplexer is a transition-once buffer which can be used within random logic. When a pinch-point exists in the design, a transition-once buffer can isolate the downstream logic cone from the upstream cone, reducing glitches. 
VII. CLOCK GATING METHODOLOGY
Since the clock distribution network typically consumes a large percentage of the processor power, clock gating, where the clocks are turned off to portions of the network that do not require it, can save a lot of active power. Our ASIC clock distribution methodology uses a single clock phase which is distributed through a re-driven network to clock splitters, which create two non-overlapping, out-of-phase clocks. These clocks are then used to drive master/slave latches. Figure 11 iteratively picks different groupings of signals to be typed together, analyzes opcode coverage, and picks the type groupings that maximize opcode coverage with the minimum number of groupings Figure 12 shows the clock distribution network for an LSSD-based design.
Once optimum clock gating groups are defined and opcodes have been assigned to types, typed clock gating can begin. Early in the cycle, opcodes are pre-decoded to generate a type field corresponding to the groups of registers. The capacitance. Figure 15 shows a simple example of the ungate algorithm while Figure 16 shows the results of using ungate on two real-life example core designs. The clock gating optimization flow is shown in Figure 17 , more details on the clock gating aspects can be found in [3, 4] .
Another important finding is that hierarchical gating can be quite effective for binary trees with fine-grain distributed drivers and balanced wiring (e.g. 20% savings), but 
