Power consumption is currently a critical design constraint in embedded applications, and application software can have a substantial impact on it.
1 Data-intensive computation applications typically employ digital signal processors (DSPs) to improve the application's support for real-time processing.
DSPs are characterized by complex architectures: a deep pipeline, very long instruction word (VLIW) instructions, memory caches, and sometimes, superscalar architectures. This complexity makes it difficult to develop suitable modeling methodologies for analyzing power consumption in these architectures.
At present, it's possible to conduct power consumption estimation at the cycle or instruction level. Cycle-level simulation-based methods can require lengthy computation times. Tools implementing these methodssuch as SimplePower and Wattch 2,3 -also require detailed information about the microarchitecture. Such details, however, are often unavailable for off-the-shelf processors.
Another classical approach is instructionlevel power analysis (ILPA), consisting of current measurements for each instruction and interinstruction (a couple of successive instructions). 4 This method has a good margin of error (typically 2 to 4 percent) for simple processors but it presents several drawbacks for complex architectures.
First, the number of measurements required relates directly to the complexity of the processor architecture (for example, ILPA for a DSP 56K requires 1,176 measurements). 5 Moreover the instruction-level model requires improvement to account for new characteristics in the last generation of processors: pipelines, complex VLIW instruction sets, and memory caches. Recent work has added a functional approach 6 or a generic memory model 7 to ILPA, but very few studies of VLIW processors consider pipeline stalls. 8 Furthermore, most of these methods use a low-level (detailed) representation of the processor, resulting in a complex model.
We propose an original approach that estimates the power and the energy consumption from software parameters characterizing the algorithm activity. Developing the processor's power model relies on a minimum knowledge of the architecture derived from a functional analysis of the processor dissipation combined with a reduced set of physical measurements that identify the model parameters. We chose the Texas Instruments TMS320C6201 to validate our approach because its complex architecture prohibits the use of the ILPA method: Even without considering the addressing modes, ILPA for this processor requires 71 8 measurements. Here, we propose the complete model of this processor, including pipeline stalls, on-and off-chip memory transfers, and cache issues.
Power consumption estimation methodology
The aim is to estimate the power and energy consumption during application execution on the target processor. Therefore, before making this estimation, the complete processor model must be available. The estimation methodology has then two linked parts, model definition and the estimation process.
We perform model definition once for each processor. It is first based on a functional-level power analysis (FLPA) of the target processor. This analysis must permit us to discern which parameter has a significant impact on the global power consumption. Then, in the characterization step, we determine consumption laws describing the average supply current's evolution relative to these parameters, either by simulation or by measurements. This model definition results in the processor's power model, which uses the selected parameter values as inputs and accounts for all processor power and energy consumption sources such as parallel execution, cache misses, the pipeline, computing units, and internal memories.
The estimation process computes the power consumption of an application. The designer settles the architectural parameter values; the estimation tool directly computes the algorithmic parameters from the compiled code through a simple profiling. It is then easy to obtain the final power/energy estimates, thanks to the consumption laws included in the power model.
Model definition
Instead of the classical instruction characterization, our FLPA model relies on activity parameters. Of these parameters, algorithmic parameters depend on the code's execution on the target processor; this group includes the parallelism and cache miss rates. Parameters independent of code execution-called architectural parameters-include clock frequency and memory mode; the designers, together with the application, define these parameters.
As represented in Figure 1 , the FLPA first divides the targeted architecture into blocks and subblocks, defining them relative to their common activation when the processor operates. The algorithmic parameters represent the activity rate of these blocks and their interactions. After this initial division, designers elaborate on various elementary programs to stimulate each block relative to the parameters. They measure supply current I total and determine the consumption laws by curve fitting.
41

SEPTEMBER-OCTOBER 2003
Block 1 Block 2 Block 3
Processor
Step 1 Functional-level power analysis
Step 3 Consumption laws determination 
FLPA on the TMS320C6201
We applied FLPA to the TMS320C6201 and developed a complete power model. This processor has a deep pipeline, VLIW instruction set, and parallelism capabilities (issuing up to 8 instructions in parallel). Figure 2a shows three of the architecture's four blocksthe instructions management unit (IMU), processing unit (PU), and memory management unit (MMU)-each composed of several subblocks and together containing every configuration device in the DSP.
We do not show the fourth unit, the control unit (CU), because this analysis considers only blocks with significant activity, so this analysis discards the CU when considering signal processing applications. However, even for signal processing applications, this analysis must account for power dissipation by the pipeline control and sequencer.
A program can use internal program memory in different modes. In mapped mode, all the instructions are in internal memory. In bypass mode, all the instructions are in external memory. In cache mode, the processor uses internal program memory as a direct-mapped cache. The freeze mode is similar to the cache mode but doesn't permit any writing to memory. The processor also contains an externalmemory interface (EMIF), used to load data and programs from the external memory. Its clock frequency F can reach 200 MHz. 9 The C6x pipeline in Figure 2b consists of 11 elementary stages usually clustered into three parts. The fetch stage loads the instructions from program memory. In the decode stage, the IMU first dispatches instructions to the correct unit (DP step) then the receiving processing unit decodes them (DC step). Finally, five execution steps process the instruction.
The FLPA first groups architectural components with concurrent activity into functional blocks. For example, when the pipeline executes a no-op instruction, it activates only the five first pipeline steps (PG to DP). In this case, the decode stage's DP step is actually associated with the fetch stage in the IMU. In contrast, the DC step is included in the PU.
Then we identify the diverse power consumption sources, representing them by links in the functional diagram. We associate each source with an algorithmic parameter, expressing the activity rate of the block and its interactions. For a complex architecture such as the TMS320C6201, we defined a set of five algorithmic parameters, representing the more important impacts on the final power consumption. Parallelism rate α assesses the flow between the fetch stages and the program memory controller inside the IMU. Processing rate β between the IMU and PU represents the utilization rate of the processing units (the arithmetic logic unit and the multiplier). Program cache miss rate γ expresses the activity rate between IMU and MMU. Parameter τ corresponds to the external data memory access rate. Direct memory access (DMA) utilization rate ε represents the activity level between the data memory controller and the DMA. All these parameters are representative metrics of the code. The architectural parameters set by designers are clock frequency F; memory mode MM, defining the mode of use for the internal program memory; data mapping DM, indicating whether the data is in internal memory and in which bank; and data width W, regulating the size of the data transferred by the DMA.
TMS320C6201 characterization
After performing this functional analysis, we must precisely determine the consumption laws characterizing the processor's power dissipation to obtain the complete power model. To do so, we measure the processor core's average supply current, I total , in relation to the variations of each algorithmic and architectural parameter. We used small programs called scenariosunbounded loops written in assembly language-to separately stimulate each functional block or subblock and measured the absolute or relative part of this element in the total dissipation. Curve fitting this data yields the consumption laws. We take these measurements on the core supply pad (with supply voltage V DD = 2.5 V) and do not include external memory; we obtain these measurements from an evaluation board and average them into 10,000 values. Moreover, on this processor, the branch impact on dissipation is negligible when the loop size is at least 500. Using the program's execution time, T exe , we also compute average energy
Instruction management unit
Because the TMS320C6201 uses VLIW instructions, the consumption of its fetch stages is significant and depends on both clock frequency F and parallelism rate α. To evaluate the variation of IMU consumption with α, we ran a scenario on the DSP (only the IMU is activated when executing no-ops) in the mapped memory mode. Because the current varies linearly with both F and α, you obviously obtain the higher value with the maximum frequency (200 MHz) together with the highest parallelism (α = 1). Actually, measured current I mapped includes the current used by the fetch stage, clock tree, and program memory. The measure current is then
where I mapped is in milliamperes and F is in megahertz.
The term 4.19 F corresponds to the clock frequency dissipation, verifying the information given in the Texas Instruments' documentation (b = 4.21 mA/MHz), which has less than a 1 percent error. 9 Moreover, as Table 1 shows, for a given parallelism rate, the energy per iteration is independent of F. Thus, the energy consumption becomes more optimized as the parallelism rate increases.
Processing unit
To determine the PU consumption variations, we employed the same method. In this scenario, we tuned the number of processing units used per cycle, β, by varying the number of no-ops in a set of 8 instructions executed in parallel. We also took measurements for several values of F. No significant difference has appeared among the various types of operations: For this processor, an addition or a multiplication dissipates nearly the same amount of power; the same conclusion occurs for a read or write to internal memory. Moreover, data correlation has just a 1.5 percent effect on the global energy consumption.
Of course, we limit such remarks to this processor; earlier studies demonstrated that, for other processors like the ARM7, the power behavior might be different. 7 However, other researchers have already chosen to average the effect of data correlation. 10 It seems that the architectural complexity of the C6x hides many activity variations: The fetch stage or the pipeline stalls are much more predominant for such a processor.
Finally, the IMU consumption is simply Pipeline stalls
Pipeline stalls have a very strong impact on the energy consumption. Indeed, a stall stops both the instruction fetch and execution stages. If you neglect control hazards in signal processing applications, the various causes of pipeline stall are
• a delayed program memory access, in the case of, for example, a cache miss (quantified by γ); • a delayed data memory access, which can occur, for example, when the processor must fetch data from external memory (quantified by τ); or • an inadequate data placement in internal memory (related to the DM parameter).
Every time the pipeline stalls, it drastically reduces the parallelism and processing rates. Hence, we define effective values α′ and β′
where PSR is the pipeline stall rate. We then integrate these effective values into the equations described earlier for I mapped and I PU .
Because γ and τ have a direct impact on PSR, it includes τ and DM parameters. So the power model only requires five algorithmic parameters: α, β, γ, ε, and PSR.
Memory modes
To know how the dissipation evolves as memory mode varies, we took measurements for a scenario composed of only no-ops, which particularly stimulates the IMU. This scenario uses, if necessary, the external synchronousburst static RAM: The SBSRAM connects directly to the core so that I/O buffers are idle. We discussed the mapped mode earlier. In bypass mode, all the instructions are in external memory; parallelism is then useless, and the dissipation is linear with frequency, according to the following expression:
The use of the cache memory implies a dissipation overhead for the cache itself and for the external accesses when a cache miss occurs (quantified by γ). Then, for a scenario with no cache miss, we obtain similar results by comparing the freeze and cache modes. The evolution of consumption in cache mode depends on several parameters, such as parallelism rate α, frequency F, and cache miss rate γ. Table 2 shows this variation.
When no cache miss occurs, γ = 0, and the relation is linear:
I cache = 4.36 α′ F + 4.09 F +187.83 α′ + 53.45
Otherwise, the results imply a complex relation among the parameters, indicated by α parameter's appearance in Table 2 . But actually, each cache miss implies pipeline stalls, and the effective parallelism rate differs from the initial α value. The shaded cells in Table  2 indicate when parallelism has no more effect on the power dissipation. The following expression gives these variations: is independent of the frequency. The cache mode's overhead cost relative to the freeze mode never exceeds 2.7 percent of the global power consumption; the impact of writing in the cache is then negligible even when the cache miss rate is high.
DMA
The dissipation due to the DMA I DMA is an overhead of the global consumption; it depends only on DMA utilization rate parameter ε, data width transferred W (whether 8, 16, or 32 bits) and clock frequency F. To sum up, the required architectural parameters are MM, F, and W. The algorithmic parameters are α, β, γ, ε, and PSR. The following expression yields I total , where I IMU represents one of the currents-I mapped , I bypass , I freeze , and I cache -expressed previously:
The expressions thus obtained are more complex than those derived from a linear regression analysis because all the parameters are not completely independent. This could explain why earlier work 11 contains important errors.
Power characterization of the design complexity
If we consider the C6x pipeline depth (with N the number of pipeline stages), a no-op involves five steps (the four fetch steps and DP); an addition, seven steps (four fetch steps, two decode steps, and one execution step); a multiply, eight steps (with two execution steps more); and a load instruction, 11 steps (with five execution steps more). Table 4 gives the corresponding dissipated current in milliamperes. The evolutions are linear with F and α. Only two multiply or load instructions can execute in parallel (then α ≤ 0.25). ∆N) , where ∆I and ∆N are the current and the step number differences between the consecutive columns of Table 4 . I step represents the average power cost of each pipeline step relative to the number of instructions involved. These results confirm that there is no significant dissipation difference between an addition and a multiply instruction because the second execution step (for multiply) consumes only 4 mA per execution against around 60 mA per execution for the addition. Moreover, the average dissipation of the last three execution steps (E3 to E5) is around 95 mA per execution; it proves that the load instruction is highly power consumptive. These results also confirm that the PU's dissipation is linear with β. Increasing the number of registers will probably reduce the number of load instructions and, in turn, the global power consumption. On the other hand, the first five steps-the fetch stage and the DP (defined as the IMU)-have an average dissipation that depends on α, thus explains why it is energy efficient to obtain the maximum parallelism. Moreover, when the entire pipeline is activated, the IMU part represents 57 to 71 percent of the consumption. The second part (DC and first execution step) represents around 10 percent, and the second execution step is less than 0.5 percent. The last stage (including the three last execution steps) represents from 20 to 30 percent of the global dissipation.
Based on these results, we conclude that the superscalar architecture, although it increases power consumption, is an energy-efficient choice. In contrast, it is difficult to make the same conclusion about a superpipeline architecture: Power dissipation in the execution steps is highly variable. The integration of a complex arithmetic operator (like the Viterbi decoder in the C54), although it increases the number of execution steps, might imply no significant power dissipation overhead if it corresponds to the second execution step (E2). Finally, the IMU makes a major contribution to global power consumption. So more complexity in this part-such as including a decoder for multimedia instructions, for example-might imply more power consumption. In this case, power optimization efforts must focus on this functional part.
Model validation
To validate our power model, we calculated and measured estimated consumption for several signal-processing applications.
Estimation process
The estimation process consists of computing, from the source code, the parameter values that become the inputs to the processor model. Actually, we must only determine the algorithmic parameters because the architectural parameters are part of the application. The TMS320C6201 fetches eight instructions at the same time, forming a fetch packet (FP). Within this fetch packet, operations are further divided into execution packets (EP), depending on the available resources and the parallelism capability. 9 We then compute parallelism rate α and processing rate β as follows:
NFP and NEP are the average number of FP and EP. NPU is the average number of processing units and NPU max is the maximum number of processing units; here, NPU max = 8. In the example of Figure 3 , NFP = 1, NEP = 3, and NPU = 7 (the load and store instructions also involve a processing unit). Parameters α and β are then easily computed from the static profiling of the assembly code, which extracts NFP, NEP, and
POWER-AND COMPLEXITY-AWARE ARCHITECTURE Parameter extraction is necessary for each part of the program (loop, subroutine, and so on) for which you desire local values. We compute global parameter values, those for the complete source, by averaging all the local values. Such an approach permits us to spot hot points in the program.
Results
We applied this power estimation method to classical digital signal processing algorithms: a finite impulse response (FIR) filter, a least-mean-square (LMS) filter, a discrete wavelet transform with two image sizes: 64 × 64-pixel (DWT1) or 512 × 512-pixel (DWT2), an enhanced full-rate (EFR) vocoder based on the global system for mobile communication (GSM) standard, and an MPEG application. Table 6 presents the results for these applications, using different memory modes-mapped, bypass, cache, and freeze-and different data placement (whether in internal or external memory). Nominal clock frequency F is 200 MHz.
The average error between the estimates and the measurements was 2.5 percent; the maximum error was 4 percent. These results validate our approach. We have used this model to make an estimation directly at the C-level. 12 
Extension to other processors
From this generic power model for a processor, our high-level approach is easily extended to other targets, and we conduct the FLPA in the same way. The characterization step allows us to select the relevant parameters and to determine new consumption laws. Table 7 (next page) presents all the generic parameters defined in our power model, the selected parameters for various targets, and the maximum error between estimates and measurements. The time required to characterize a new processor ranged from 15 days for the ARM7 to around 30 days for the C67.
The C67 is a floating-point processor with an architecture close to that of the C62. These two DSPs therefore have the same parameters for the processor model but different consumption laws. The C55 is a low-power fixedpoint DSP. Even if it can execute two instructions in parallel, it only fetches one 47 instruction in each clock cycle, so we do not include the α parameter in its model. Furthermore, its architecture never leads to pipeline stalls. Its internal program memory can also use one of four modes like that of the TI C6x, and it also contains a DMA. Because the C55 architecture is less complex than that of the C6x, its power model will have fewer parameters. The ARM7TDMI has a scalar architecture, and its internal program memory can use one of three modes (mapped, cache, or bypass). Under consumption measures, the global consumption never varies more than 8 percent, according to the executed software, corroborating previous work on the StrongARM. 10 For these reasons, its model depends only on architecture parameters. On digital and signal processing applications, the estimation accuracy for all these target processors is similar to the results presented for the C62. I n this article, we have described the complete power model for the TMS320C6201 and shown that our methodology can easily characterize VLIW processors (the C62 and C67), low-power processors (C55), and also general processors (ARM7TDMI) with similar results. As part of this work, we developed an automatic estimation tool, SoftExplorer, which is available at http://lester.univ-ubs.fr:8080/.
SEPTEMBER-OCTOBER 2003
We have completed our approach by the development of a power estimation methodology from the C code directly, so current work concerns the extension of the SoftExplorer tool to this C-level estimation. We also want to add other processor models, to increase the number of target choices and provide reliable power and energy comparisons for a given application. Future works will apply the functional approach presented here to model the power consumption of complete systems, including processors, DSPs, fieldprogrammable gate arrays, the memory, bus, and so on. Nathalie Julien is an associate professor at the University of South Brittany in Lorient, France; she also works at the Lester Laboratory in high-level design methods applied to low-power constraints for dedicated circuits, FPGAs, and DSPs. Her research interests include power estimation for complex processors and high-level synthesis that integrates power optimization and memory issues. Julien has a PhD in electronics from University of Limoges, France. She is a member of the ACM, SIGDA, and SIGARCH.
Johann Laurent is associate professor at the University of South Brittany and works at the Lester Laboratory. His research interests include software consumption estimation and power characterization for complex processors. Laurent has a PhD in electronic from South Brittany University.
Eric Senn is an associate professor at the University of South Brittany and a member of the Lester Laboratory. His research interests include low-power design, architecture synthesis, and asynchronous circuits. Senn has a PhD in electronics from Paris XI University.
Eric Martin is a full professor at the University of South Brittany in Lorient and director of the Lester Laboratory. His research interests focus on advanced electronic design automation dedicated to real-time signal processing applications, including system specification, high-level synthesis, intellectual-property reuse, low-power design, systems on a chip, and platform prototyping. Martin has a PhD in electronics from the University of Paris XI, France. He is a member of the IEEE, and the IEEE Computer Society, and the IEEE Circuits and Systems Society.
Direct questions and comments about this article to Nathalie Julien, Lester Centre de Recherche, rue de Saint Maudé BP 92116, 56321 Lorient Cedex, France; nathalie.julien@ univ-ubs.fr.
For further information on this or any other computing topic, visit our Digital Library at http://computer.org/publications/dlib.
