This paper presents a technique used to d o p o w er analysis of a real processor at the architectural level. The target processor in tegrates a 16-b it DSP a n d a 3 2 -bit RISC on a single c hip. O ur po w er estim ator pro vides pow er consumption data of the architecture based on the instruction data o w stream. We demonstrat e the accuracy of the estimator by c o m p aring the p o w er valu es it prod uces against m easurem entsm adeby a gate level pow er sim u lator for the same benc hmark set. Our estimation approac h has been shown to pro vide v ery e cien t, accurate pow er an alysis at the architectural level.
Introduction
As pow er dissipation has become a critical issue in many VLSI systems, po w er analy s i s a t t he arc hitect ural lev el has become more importan tbecause of its e ciency. It also allows th e design er to experim ent w i t h d esign t rad eo s t o lower po w er consum ption. Most of the research i n t h is area falls in the category of empirical methods which m easu re" the po w er consum ptionofexisting implementations and produce models based on t hose m easurements. Thismacromodeling technique can be subdivided into three subcategories.
The rst approac h introd uced in 7 is a xe d-acti vity macromodeling strategy called the P ow erFactor Appro xim ation PF A method. The energy models are parameterized in terms o f c o m p lexity parameters and a PFA proportionalit y constan t. Thus, the intrinsic internal activit y is capt ured through this PF Aconstan t. This approac h implicitly assumes that the inputs d o not a ect t he switching activit y of the hardware block.
T o remedy this weak ness of the xed-activity a p proac h, activity-sensitive empirical energy models ha ve been d eveloped .They are based on predictable input sign al statistics, such as used in the SPA m ethod 3, 4, 5 . Although the i n dividual models built in this way are relativ ely accu rate the error rate is 10 ,15, overall accuracy may be sacri ced for the reasons of unavailab le correct inp ut statist i c s o r a n inability t o m odel the in teraction s correctly.
Th ethird empirical app roach,transition-sensitive energy models, is based on input tran sition s rather than input statistics. The method presented in 2 assumes an energy model is provided for each functional unit -a table containing t h e p o w er consumed for eac h inp ut transit ion. Th e authors giv e a scheme for collap sing closely related input transition vectors and energy patterns i n t o clusters, th ereby reducing the size of the table for each functional unit. A signi cant redu ction in the number of clusters, and also the size of the tables and the e ort necessary to generate them, can be obtained wh ilekeeping the maximum error within 30 and the r o o t m ean square error within 10 ,15. After the en ergy models are built, it is not necessary to useany kn owledge of the u nit's fu nction ality o r t o h a v e prior k nowledge of any i n put statistics during the analysis. Ho w ever, this ap proac h h as nev er been used to analyze real comm ercial processors. Th us, it rem ains t o b e v alid ated as a viable technique. Our w ork follows the third approach. The accuracy of our pow er estimator will validate th e correctness of this methodology.
Our simulator imitates the behavi o r o f t h e p rocessor in each cloc k cycle as it executes a set of benchmark programs. It also collects pow er consumption data on the functional units ALU, MAC, etc. exercised by t h e instruction and its data. The results o f o u r p o w er estimator are compared with the pow er consumption data pro vided by 6 . This comparison con rms the accu racy of our po w er estimation technique.
The rest of this paper consists of four sections. Section 2 overviews the processor arc hitecture and describes the s i mulation strategy . Section 3 presen ts the pow er estimation approach. Section 4 co v ers the po w er benc hm arks and discusses the validationresults. Fin ally,section 5 dra ws the con clusion and suggests the related fut ure work.
2 Architectural Simulation of the Proc essor 2.1 Processor Overview The p rocessor considered in this research i n tegrates a 32-bit RI SC processor an d a 16-bit DSP on a single c hip. Figure 1 shows its m ain block d iagram. The p rocessor core in clud es the X-memory, the Y-memory, the buses, the CPU engine and D SP engine. Th eCPU engine includes the instruction fetc h decode u nit, the 32-bit ALU, the 16-bit Pointer Arithmetic U nit P A U , the 32-bit Add ition Unit AU f o r PC increment, a n d 16 general purpose 32-bit registers. A s the ALU calculates the X-memory address, the P A Ucan calculate the Y-memory address. A dditional regist e r s i n t h e Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cit ation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CPU support hardware looping and modulo addressing. The DSP engine con tains the MA C, the ALU, t he shifter, and 8 registers six 32 bits wide and tw o 40 bits wide. The rst part of the sim u lation e ort w as to build an assembler which g e n erates machine codes from benchmark programs written in the assembly language of the processor. The second part is the con troller generatin g con trol signals for each functional unit when instruction sare running on the simulator. These signals will be used to control the beha vior of the functional units. For example, when a load instruction p uts data from the m emory to the IDB b us, the MEMtoIDB" control bit will be set. The third part of the sim ulator speci es eac h functional unit in the d atapath.A s the con trol signals are set, the function al units are activ ated. For each u nit, there is a function routine which g a t hers all event a c t ivities. Figure 3 shows an example of these specication functions. In this example, when a con trol signal is set, the IDB will receive data from one of its sources. This inpu t data to the functional unit will be used to d etermine its po w er consum ption . Fin ally,the simulator will provide the precise behavior and functional unit state information . The beha vior information describes whic hinstruction is executed in each pipelin e stage.The functional unit state information shows the con tent o f e a c h register or bu er, the data t ran sferred on each activ e bus, and the function to be perform ed by e a c h arith metic unit ALU, PA U, A U , M A C, etc..
CPU DSP
During the design of the simulator, the implementation of the pipeline w as complicated because of the processor's un usual architecture. And it is necessary to use tw o program counters an d special PC increment u nits, because the instruction stream for the processor is a mixtu re of 16-bit and 32-bit instructions. The sim ulatoralso im plem ented instruction loop bu ers in the CPU engine.
3 P ow er Estimation of the P roc essor Because dynamic po w er consum ption is stron gly depen den t on the input switc hingof the functional units, using only the a verage po w er de n itely causes a loss of accuracy. Our analy sis technique o vercomes this loss of accuracy by t a k ing data and instruction streams i n to accou nt.
Th eprocessor is separated into m an y functional units. Here, the terms e n ergy and p o w er are used interchan geably .
In the bit-d ependent functional un its, the switc hing of one bit a ects other bit slice's operat ions. T ypicalbitdependent f u n ctional units are adders, multipliers, d ecoders, m ultiplexers, et c. Their energy characterization is based on a lookup table consisting of a full energy transition matrix where the r o w address is the previous i n put vector, the column address is the present input vector, and the matrix value is the switch capacitance. All combinations o f p revious and present input vectors are contained in one table. A m ajor problem is that the size of this table grows exponentially in the size of the inputs. A clustering algorithm solves this problem and the sub-problems associated with it by compressing similar energy patterns. The details of this algorithm can be found in 2 . The capacitance data in the energy characterization table is obtained from a switch l e v el simulation of the functional unit. The accuracy of switch level analysis is not as good as circuit level simulation tools, but is much faster.
In the b it-independent functional units, the switching of one bit does not a ect other bit slice's operations; for example, registers, logic operations in the ALU, m emories, etc. The total energy consumption of the functional unit can be calculated by s u m m ing the energy dissipation of the individual bits. To m odel the energy dissipation of a bus, load capacitance of each bit is used as an independent bit characterization. Furthermore, each b it is assumed to have the same load capacitance. The product of the load capacitance and t h e n umber of bits is the Cm in equation 1.
Some energy models are not built from the complete functional unit but from smaller subcells of these units. For example, because a 4:1 multiplexer can be made from three identical 2:1 multiplexers, its energy dissipation can be calculated by using the energy model of a 2:1 multiplexer. This reduces 2 6 2 = 4096 table entries to 2 3 2 = 6 4 e n tries. After building the subcell energy model, energy routines will implement t h ose operations needed to calculate the functional unit power dissipation from subcells.
Because we did not have access to the design of the c o ntrol unit of the p rocessor, average data is used to characterize its power consumption. Although the accuracy is not as good as if we w ere able to build the transition dependent e nergy model, this average data reduces simulation time since there is no table lookup involved.
Validation Results

The Power Benchmarks
A set of simple synthetic programs listed in Table 1 are used as the benchmarks t o v alidate our power analyzer. Theseare the same benchmarks used to test the processor in 6 . Power consumption of the instruction loop bu er ILB, memories, buses, ALU, the multiplier and other functional units are collected as the benchmarks are run. Data values used were those which m aximized the switching in the datapath. In Table 1 , padd pmuls movx+ m o vy+" contains four separate operations e x ecuted simultaneously. They are an addition, a m ultiplication, two loads and t w o address increments. Padd p m uls movx movy" i s t h e s a m e a s t h e previous instruction except it contains no address increments.
Validation Results
The p o w er simulation results generated by our architectural level power analyzer are compared with the d ata gathered by 6 . Overall, the a v erage error rate of power analysis by our simulator is 8:98. Figure 4 compares the estimated by our simulator and reported by 6 p o w er dissipation of the processor core including the CPU engine and the DSP engine. The power consumption data of each program is normalized to that of the pow034 benchmark. As you can see, for most of the benchmarks our simulator produces results very close to those reported data in 6 . In both the case of the pow035d benchmark and the pow035c benchmark, the power consumption is underestimated by our simulator. Possible reasons are thatour p o w er estimator has not considered clock p o w er and t hat the p o w er consumed by the control unit has not been accurately estimated. These two b e n c hmarks use the l e a s t p o w er overall, so that clock and control unit power account f o r a higher percentage of the total power consumption.
In order to v erify the above suspicion, the s i m ulation accuracy of each m ajor functional unit the DSP engine, the CPU engine, the memory and the top-level buses was also analyzed and is shown in Figure 5 . The estimated power distribution rate is compared to that reported by 6 . The distance from a symbol to the 1:1 line i n dicates the accuracy. T h e p o w er simulation of the top-level buses is the least accurate of the four units -the a v erage error rate is 10:2. One o f t he reason for this is that the top level interconnect" in 6 consists not only of the buses we considered but also the control and clock lines. The random control logic is also a large part of the total power consumption, but w e w ere not able to build the energy model for the c o n trol unit since we did not have access to the design. Thus, the a c c u racy of the CPU engine power estimation is not as good as that of the DSP engine and memory, though the a v erage error rate is only 5:7. In this work, we i m plemented an accurate, high level power estimator for a real processor. Our estimation technique needs to do energy characterization for each f u nctional unit ALU, M A C, etc. only once. When the operations of the functional units are speci ed by the instruction stream, t h e power consumed in each functional unit is calculated from its energy model. The simulation results clearly veri ed the correctness of our p o w er analysis methodology. Without loss of accuracy, the running time o f p o w er estimation for each benchmark program was less than 90 CPU seconds. This time s a ving feature of our p o w er estimator is bene cial for reducing the design cycle time. The structural simulator is also applicable for power optimization research a t t h e architecture, system software, and application software levels.
The following are some of the factors which have not been taken into account d u ring the power estimation:
Clock p o w er Transition dependent control logic power I O pad power Varying signal arrival times The e ects of functional unit placement and routing Finding a better interconnect energy model also is a valuable research goal. Our f u ture work to improve o u r p o w er estimator will include all of the above. We are also investigating, running and testing more complicated benchmarks.
