Abstract
Introduction
Power-related issues have been the primary concerns for battery operated devices, such as communication systems and multimedia systems. A portable device needs to rely on a long battery life to enable the convenience leaded by mobility. To reduce the energy dissipation, processor design is especially important in low power/energy computing. Some wellknown power reduction techniques such as resource scaling [2] and clock gating have been adopted for low power architectures, while, in recent years, voltage scaling and multiple voltages [5] have been extensively considered to save the unnecessary energy dissipation. In order to optimize architectures for power budgets and meet performance requirements, designers have to investigate the trade-off between various low power architectures and, by doing so, rely on tools that need to provide fast, but sufficiently accurate estimates for performance, as well as power consumption. Frequent iterations among RTL design, synthesis, gate-level simulation, as well as power analysis is prone to lengthening the design cycle and possibly violating time-to-market deadlines. In addition to architecture exploration, compiler designers need to rely on a rapid cycle, accurate framework for compile-time optimizations. For a VLIW digital signal processor (DSP) which runs statically scheduled instructions, the scheduler needs to perform instruction ordering, resource scheduling and insert specialized power down instructions to reduce power consumption. Therefore, a framework with fast power estimation characteristics has to be adopted in support of such compiler optimizations. Moreover, user evaluation tools are always essential for processors. Evaluating the power consumption for battery operated applications is especially important.
In this paper, we present an architecture-level power estimation simulator, which could be employed in exploring architecture tradeoffs, compiler optimizations as well as user evaluations of a given architecture. The power estimation engine of this simulator is based on parameterized power models and cycle accurate simulation, while power reduction strategies such as voltage scaling and clock gating are considered to provide various design alternatives. Furthermore, this simulator is also configurable to deal with various instruction set architectures (ISAs), while scalable power models can be employed to estimate the power consumption of soft cores, which could be synthesized according to specific applications.
Prior work
In recent years, several simulation-based power analysis engines at instruction-level, RT-level and architecture-level have been proposed. Some of the recent power modeling techniques which play an important role on the accuracy of simulators are described in the sequel. Power analysis at RT-level based on an application driven methodology [7] has been proposed. Instead of using macro modeling, this work relies on program profiling to extract the parameters from applications. For instruction-level, the power model is derived through the simulation of various instruction sets [11, 14] . However, in general, if the full spectrum of transitions between instructions is considered, the complexity of instruction-level power model increases exponentially with the number of instructions. Moreover, power models derived for specific instruction set architectures do not provide flexibility for varying different architecture parameters.
More recently, some ideas related to architecturelevel power estimation have been presented. For example, Wattch [4] and SimplePower [12] provide cycle accurate framework to collect switching activities and employ unit power models for power estimation. Their power models are configured by width and input transitions, but this may not be sufficient to be used for various microarchitectures since the load capacitance may need to be adjusted according to the delay budget used for optimizing individual modules. In power modeling and precharacterization, several techniques have been introduced, such as RT-level Power Modeling [8, 1, 17] , structure-oriented technique [13] and Dual Bit Type model [9] . In contrast with using the random input assumption, the structure-oriented method considers all possible input transitions and further reduces the complexity of state transition graph (STG) by finding the minimum number of compatible pattern set. The work proposed in the Dual Bit Type model organizes an operand into sign region and uniform region due to their different transition behavior.
Framework
This proposed simulator is a cycle accurate design exploration environment based on a VLIW infrastructure and cycle-by-cycle simulation. In order to support multiple instruction issue processing, we treat each instruction way as an independent cluster to provide the flexibility for various architectures, with different number of instruction ways. Moreover, the architecture parameters in this simulator can be adjusted according to architecture specifications. For instance, examples of parameterized units include: the processing elements (PEs) or functional units, memory organization, or the number of pipeline stages. The flexible infrastructure enables designers to explore various design architectures.
The organization of simulator is illustrated in Figure  1 . The core of the VLIW simulation engine is the pipeline operation composed of seven stages (where each stage could be skipped or extended to multi-cycle as well). The pipeline operation is implemented in software, so as to accomplish the work of each stage. In each stage, ways are executed sequentially, although the simulator emulates the actual parallel execution taking place in hardware. Therefore, hazards occurring in shared register files and data memory are correctly detected in this simulator. For instance, there are two instructions issued at the same cycle for loading and storing data to the same memory address respectively, with the load instruction being completed prior to the store instruction. During the pipeline operation, the simulator calculates the transition probabilities of operands and uses pre-characterized power models for obtaining cumulative power numbers. In addition, the switching activity of pipeline registers is also accounted for. In order to simplify the usage of simulator and offer complete cycle accurate information for analysis and debugging, the loader in this simulator is responsible for program loading, while the debugger allows user to trace the state of operation through user interface. To launch the simulator, ISA parameters and benchmarks must be specified to configure memory organization, instruction path and corresponding power models. The operating conditions for frequency, voltage and synthesis frequency can also be selected independently for further exploration. In addition, the simulator can be used to figure out the optimal operation frequency/voltage point according to a given performance constraint. The simulator relies on fast simulation speed, which is more than 1000 times the simulation time of commercial tools. With such design exploration speed-up enabling various ISA evaluations, the extensive architecture analysis covers most useful strategies of power reduction. 
Power modeling
Our power modeling methodology is based on using Hamming distance between two consecutive input operands for determining switching activity and on the pre-characterization of main RT-level modules. Hamming distance has been extensively applied in dynamic power analysis [6] , but it just examines a portion of input vectors since many identical top bits caused by sign extension, and multi-function units increase the complexity of modeling [15] . Therefore, we analyze processing elements further to establish practical and parameterized power models. The parameters for the proposed parameterized power models are the bitwidth of PE, transition probability of input operands, operation voltage, as well as delay budget allocated to each PE. To truly support various architectures, the power models must be configurable for different timing constraints. Since most soft cores can be synthesized according to the characteristics of the target application, this simulator can also be used for soft processor evaluation as described next.
Parameterized power models
We sort RT-level modules into two classes, combinational and array based components. Combinational components include PEs, fetch, dispatcher and decoder. We extract their power models based on input transition characterization. As shown in Figure 2 , we illustrate our analysis methodology through a DesignWare adder [18] . The curves represent different widths for the adder, and each point is derived by simulating 1000 input patterns characterized by a specific transition probability. In fact, each curve could be a look-up table characterizing the power cost for certain bitwidths of the adder. For components that can have variable bitwidths (such as adders, etc.), we employ one curve and the ratio of average power (Figure 3 ) to determine the power tables for various bitwidths. To be able to reuse these power models, we consider the following factors. In an integrated system, the same type of components may be used to implement units which are characterized by different timing constraints. Furthermore, different architectures and user-defined synthesis speed targets for soft cores may also affect the average load capacitance according to the imposed timing constraints. Therefore, we analyze the influence of timing constraints on load capacitance. In Figure 4 , we show the relation between average load capacitance and synthesis-driven timing constraint. The analysis provides not only the effect on the time budget of components but also the limit on the minimum time budget. With this practical information, the simulator could find the maximum operation frequency according to ISA specifications, and our power models could be scaled for various delay budgets.
Most processors provide specific registers for accumulating results. For example, the 40 bit-registers in a 32-bit processor provide 256 accumulation operations without precision penalty. However, a 40-bit adder must be employed to perform the operation.
But the smaller utilization causes that the extra 8 bits to be generally the sign extension (which is all zero or all one). For a two-operand component, we consider the switching activity of most significant bits (MSB's) in three situations: one sign bit transition, two sign bit transitions, and no sign bit transition. In other words, we separate the operands into MSB's and uniform part, and establish the power table for each situation.
The array based components triggered with clock signal, include pipeline registers, memory and register files. We consider the power consumption contributed by clock signal in each array based power model. For instance, a single register with synchronous reset signal is analyzed to estimate the power consumption of pipeline registers. The power model of single register is composed of three coefficients, leakage power, dynamic power and idle power. The dynamic power represents the average power that a register latches different value every rising edge of clock signal, in contrast with dynamic power, the idle power consider only clock swing and no switching on latched bit. The power consumption when there is no switching on input and clock signals determines leakage power. For register files, the models are established in terms of the data switching on cells and read-ports. The power consumption of a register file could be modeled in the following equation:
where idle P is the idle power, wd P is the average power of write-decoder while one write-port is active, i cell N , denotes the number of changed bits by a write-port, cb P is the average power associated with one changed cell bit, rd P is the average power of read-decoder while one read-port is active, i r N , is the Hamming distance of two successive vector on a read-port and rb P is the average power associated with one changed bit on a read-port. 
Multi-function components
As they need to rely on resource sharing, existing processors typically use multi-function processing elements to perform various types of operations, while targeting full hardware utilization. In the case of multifunction processing elements, each operation mode may be characterized by different power values, especially when the control signals have transitions. Therefore, using a single set of switched capacitance models may be not sufficient to accurately model a multi-function component. We analyze input operands as well as control signals. The change in control signals makes a multi-function component switch the function from previous instruction to next one. For the purpose of computing switching activities, we treat both directions of transition between two types of functions as one transition. Figure 5 illustrates the average power for DesignWare Adder-Subtractor. The curves represent successive addition, successive subtraction and the switching between addition and subtraction. The average power numbers of addition and subtraction are quite close, but the curve add/sub which represents the switching between addition and subtraction contributes more power consumption. In this case, we establish two power tables for the AdderSubtractor, one is the mean of curve add and curve sub, and the other one is exactly the curve add/sub. In fact, some functions of a multi-function component could be incorporated into one type according to the power analysis. Doing so significantly decreases the size of power look-up tables. 
Experimental results
In this section, we describe how the simulator has been completely validated. Our experimental architecture is the PAC VLIW DSP v1.0 (Table 1) explored by Industrial Technology Research Institute, and the benchmark suite includes a series of DSP algorithm kernels including BDTI benchmarks [3] . The validation has been performed for all modules and the entire processor in terms of detailed power distribution, as well as the influence of synthesis timing constraints. The experiments are performed on a SparcII 500Mhz machine running SunOS 5.9 operating system. ModelSim SE 5.8b and Synposys PrimePower 2002.05 are employed on gate-level simulation and power analysis respectively. Our simulator is more than 1000 times faster than the combination of PrimePower and ModelSim. Table 2 shows the comparison of average power between our simulator and the gate-level power analysis flow for entire DSP core. The DSP core is synthesized at 100MHz. The error is within 10% in all cases. In addition, we extract the power numbers of each component type. Figure 6 illustrates the percentage of register files which consists of accumulation registers, address registers as well as data registers. The average power consumption is about 28% of entire DSP core. We illustrate also processing elements in Figure 7 . The power consumption depends on the active elements and input vectors. In Figure 8 , we show the power consumption of pipeline registers.
For a soft core, the timing constraints provided for synthesis could be specified according to the target applications. Moreover, the effects of tight timing constraint can be reflected on the critical path. We demonstrate the modeling of MAC element which is on the most critical path of PAC VLIW DSP under different synthesis frequencies in Table 3 . The target frequency for synthesis to be used in the simulator is set to 66Mhz, 83MHz and 100MHz. We validate these benchmarks which have MAC element related instructions. As shown in Table 3 , the results show a reasonable accuracy for different timing constraints. 
Case study of clock gating
We have also performed a detailed analysis of the power distribution across various datapath modules (Figure 6-8) . As seen in Figure8, approximately 35% of the total power consumption is contributed by pipeline registers, while register files consume 30% of the total power budget ( Figure 6 ).
For benchmarks that have fewer parallel instructions because of longer data dependency chains, many resources turn out to be redundant and consume power in multi-ways processors. Clock gating is a well-known technique for pipeline power reduction [16, 10] especially for multi-issued processors. We employ this technique in our simulator to investigate the influence of clock gating on pipeline registers and register files (Figure 9 ). The three levels we consider are cluster base (CB), register file (RF) as well as pipeline register (PR). CB is a coarse-grain clock gating, the entire data path is isolated by clusters and one scalar unit, the clock signal of specific cluster is gated while there is no valid instruction traveling in it. RF focuses on register files. Each cluster has one address register file, one accumulation register file and two data register files. We detect the instructions in read-operand stages and write-back stages in order to determine the idleness of register file. Regarding PR, each instruction in data path is analyzed to determine if there will be any valid instructions to be transferred through pipeline registers. For instance, the clock signal of the pipeline register between execution stage and memory-access stage is gated while the instruction in execution stage is invalid. In addition, the corresponding leakage power is considered when clock gating is used. This simulator provides valuable data through fast simulation. The estimates illustrate the percentages of power saving and the comparison of different strategies. Given the efficiency with which they have been obtained, they could determine alternatives of architecture optimizations without frequent gate-level analysis flow. 
Conclusion and future works
The processor design is getting more complex, with energy consumption being now the primary design issue for portable devices. Iterating between synthesis, gate-level simulation and power analysis is very expensive and may lead to time-to-market violations. We have presented a simulator, which is at the heart of an architecture-level power/performance estimation framework providing fast simulation and reasonable accuracy. For user evaluation, our simulator relies on the parameterized power models to enable hard core and soft core power/performance evaluation.
Future work for this simulator will include power analysis for the memory subsystem. In general, the memories in processors are macro blocks, thus our memory models will be established according to the characteristic of memory macros. Moreover, our power models will be augmented to include information about more aggressive technologies (90 and 65nm).
