Abstract-In this paper, we present an efficient and accurate methodology for estimating the energy consumption of application programs running on extensible processors. Extensible processors, which are getting increasingly popular in embedded system design, allow a designer to customize a base processor core through instruction set extensions. Existing processor energy macromodeling techniques are not applicable to extensible processors, since they assume that the instruction set architecture as well as the underlying structural description of the micro-architecture remain fixed. Our solution to the above problem is a hybrid energy macromodel suitably parameterized to estimate the energy consumption of an application running on the corresponding application-specific extended processor instance, which incorporates any custom instruction extension. Such a characterization is facilitated by careful selection of macromodel parameters/variables that can capture both the functional and structural aspects of the execution of a program on an extensible processor. Another feature of the proposed energy characterization flow is the use of regression analysis to build the macromodel. Regression analysis allows for in-situ characterization, thus allowing arbitrary test programs to be used during macromodel construction. We validated the proposed methodology by characterizing the energy consumption of a state-of-the-art extensible processor (Tensilica's Xtensa). We used the macromodel to analyze the energy consumption of several benchmark applications with custom instructions. The mean absolute error in the macromodel estimates is only 3.3%, when compared to the energy values obtained by a commercial tool operating on the synthesized register-transfer level (RTL) description of the custom processor. Our approach achieves an average speedup of three orders of magnitude over the commercial RTL energy estimator. Our experiments show that the proposed methodology also achieves good relative accuracy, which is essential in energy optimization studies. Hence, our technique is both efficient and accurate.
I. INTRODUCTION

I
N embedded system design, high silicon efficiency is required to meet tight cost, area, timing, and power constraints. At the same time, programmability or customizability is also desired to augment or enhance system design in response to user specification, market change, or other rapidly evolving requirements. Various implementation options for system design, ranging from software running on a general-purpose programmable processor to custom hardware tuned for a specific application, exist and provide differing degrees of flexibility and efficiency. The recent availability of successful extensible processors promise a favorable tradeoff between high efficiency [as seen in application-specific integrated circuits (ASICs)] and high flexibility (as seen in general-purpose processors), while keeping design turnaround times short. An extensible processor allows the designer to extend the instruction set of a base processor core through application-specific (custom) instructions. Thus, extensible processors can benefit embedded system design with their ability to simultaneously tune both the underlying hardware and the application software to meet diverse design requirements. While commercial vendors of extensible processors, such as [1] - [4] , do offer design tools to take extensible processors from specification to hardware implementation, a large number of issues remain unresolved. One such open problem is energy estimation for extensible processors, which needs to be efficiently addressed since low power dissipation is a prerequisite for most embedded systems. Recent work [5] has focused on providing the infrastructure to automatically select parts of an application that are best implemented using custom instructions based on performance metrics. Taking energy consumption of an application as the design metric in this scenario is challenging, which requires fast and accurate energy estimation for each candidate configuration. Note that if the extensions to the base processor instruction set architecture (ISA) have been decided already, a new energy estimation technique is not required, since any existing processor energy estimation/power analysis framework can be used to characterize the extended processor and estimate the energy consumed by an application. However, such an approach is impractical for use in energy optimization studies done in an application-specific instruction set processor (ASIP) design cycle, since energy characterization has to be performed for each configuration/extension, and the number of configurations/extensions is very large. In [6] , a mathematical decomposition model is proposed for estimation of speed, area, and power of parameterizable soft intellectual property (IP). It decomposes the power function with respect to technology-dependent variables, presence or absence of clock gating (a power-management technique), and architectural configuration variables, while accounting for the impact of architectural variables by making power dissipation proportional to area. In our paper, we focus on deriving the energy consumption of extensible processor with respect to the technology-inde-0278-0070/04$20.00 © 2004 IEEE pendent variables, such as architectural and microarchitectural configurations.
Although existing processor energy estimation techniques are not directly applicable to the problem of efficient energy estimation for extensible processors, they offer valuable insights that can aid in the development of a good solution. Macromodeling, which we adopt in this work, is a commonly used technique in processor energy estimation. It formulates the energy consumed by the processor in terms of parameters that are easily observable (say, during instruction-set simulation). Two categories of processor macromodeling techniques have been successfully employed. Structural macromodeling approaches express the overall energy consumption in terms of the energy consumption of its constituent hardware blocks, and use the activity statistics of the hardware blocks for a given program trace to estimate energy. Instruction-level macromodeling approaches, on the other hand, characterize the energy consumption of processor instructions using carefully constructed test programs and can use fast instruction-set simulation to yield efficient energy estimates. Thus, structural approaches offer the benefits of high accuracy especially if they model the structure of the processor at a fine granularity, while instruction-level approaches facilitate fast energy estimation since they do not have to be cycle-accurate or structure-aware. In the case of extensible processors, which have a fixed-base processor core and customizable components (due to instruction set extensions), we hypothesize that a hybrid approach that combines the efficiency of instruction-level approaches with the accuracy of structural approaches is best suited.
A. Paper Overview and Contributions
Our methodology involves deriving a composite energy macromodel by characterizing the energy consumed by applications with custom instructions using i) instruction-level parameters that capture the interplay of the dynamic execution trace of a program and the base processor micro-architecture (inclusive of processor pipeline stalls and other effects, cache misses, etc.) and ii) structural parameters that account for the energy effect of each instruction (base/custom) on the custom hardware. By using both instruction-level and structural parameters in the macromodel, and using instruction-set simulation of an application with custom instructions to capture both instruction-level statistics as well as custom hardware usage data, both efficiency and accuracy can be simultaneously targeted.
A significant feature of our macromodeling flow is that characterization is performed using regression macromodeling, which has the following advantages.
• Variables in a regression macromodel can be chosen from instruction-level or structural domains, or both. Thus, regression macromodeling is naturally applicable to our hybrid formulation.
• Regression macromodeling significantly simplifies the process of constructing test applications or programs used in characterization. Conventional instruction-level approaches perform bottom-up macromodeling, which requires test programs that contain isolated instructions, selected instruction sequences, etc., wrapped in loops, in order to infer the average energy consumption of a given instruction under various scenarios. However, test program construction becomes cumbersome in most cases (for example, if test programs need to target instructions such as branch). Regression macromodeling, through its in-situ characterization, only requires that the test programs have diversity in their instruction statistics and custom hardware instantiation so as to cover the instruction space and custom hardware library. Thus, arbitrary test programs can be used for regression macromodeling.
• Construction and use of regression models are efficient, and the tools for building a regression model are widely available.
With the energy macromodel of the extensible processor built in the above manner, energy consumption of an application incorporating any custom instructions can simply be determined by instruction-set simulation to capture instruction-level execution statistics, and dynamic resource usage analysis to derive custom hardware activation data needed by the macromodel. Note that energy estimation with this energy macromodel only needs the custom instruction descriptions, it does not require the custom processor to be synthesized. Thus, our methodology is easily usable for evaluating energy/performance versus area tradeoffs among different candidate custom instructions at the early design stage. To the best of our knowledge, this is the first and only work on energy estimation for extensible processors that can be embedded in the design cycle. We applied the proposed methodology to characterize the Xtensa extensible processor core from Tensilica Inc. [1] . We then used the energy macromodel of the Xtensa processor to evaluate the energy consumption of several applications with custom instructions. Our experimental results show that the mean absolute error in the macromodel estimates, when compared to the energy values computed by a commercial tool operating on the actual hardware description of extended processors, is only 3.3%, while the average speedup is three orders of magnitude.
B. Related Work
Various techniques have been developed to estimate and optimize power or energy consumption of hardware throughout its design hierarchy [7] , [8] . Recently, attempts have also been made for energy estimation and optimization of software running on embedded processors. As mentioned before, these approaches can be classified, based on the macromodeling employed, into structural and instruction-level techniques.
Structural techniques for energy estimation of software utilize the architectural description of the processor to collect the dynamic activity information for each architectural block using simulation, calculate the energy consumption for each component and, finally, sum them up to compute the overall energy consumption. Early work [9] characterized the power consumption of each architectural block as a single number. The power profiler in [10] calculates the energy consumption of functional units based on the switching activity between consecutive cycles. Wattch [11] and SimplePower [12] estimate the energy consumption at each cycle at the architecture level. Commercial tools such as WattWatcher from Sente [13] can also be used for energy estimation, once the register-transfer level (RTL) hardware description of the processor and the binary image of the program become available. However, RTL simulation of a processor is extremely slow for even small programs and methods for reducing the simulated trace become necessary [14] .
Instruction-level macromodeling techniques compute the energy consumption of a program based on its instruction profile. They primarily rely on the energy consumption characterization of each instruction of the processor and also estimate the energy consumption of special cases (such as cache misses) that can occur during the execution of a program. Characterization of each instruction can be performed by actual current measurements for a processor chip executing carefully created test programs [15] . The techniques in [16] measure the instantaneous processor power to build a software energy estimation model. The accuracy of instruction-level modeling is improved further by the techniques in [17] - [20] , which are cognizant of variations due to instruction encoding, addressing mode, register fields, operands values, bit toggling on internal and external busses, etc. Since the added accuracy comes at the cost of additional CPU time, efficiency is targeted in [18] and [21] , which perform measurements only on a subset of the instructions (for base energy) and instruction sequences (for interinstruction effects). Measurement-based approaches are accurate because data are acquired from an actual chip implementation. The same reason, however, makes measurement-based techniques infeasible for power tradeoff studies early in the design cycle, especially if the hardware architecture is not fixed.
Recently, statistical analysis has also been used to build energy/power prediction models for very large instruction word (VLIW) and reduced instruction set computer (RISC) processors. The energy coefficients in the macromodel are not obtained through simulation or measurement; instead, they are calculated with a priori knowledge of energy characteristics of a set of instructions and the relevant statistical execution information of these instructions. Instruction-level functional approaches, such as [21] - [23] , decompose an instruction into its constituent pipeline functions (for example, fetching and decoding, execution, load and store, write back, etc.), and calculate the energy coefficients for these functions. Structural models, such as [24] , have variables corresponding to the fraction of total instructions executed by each architectural block. The technique proposed in [23] examines instructions with a finer granularity by considering, for example, instruction fetch address, instruction-bit encoding, register numbers and immediates, data values, etc. However, these approaches do not provide an overall energy estimation for an application.
There has been only one previous work in the literature targeting the power estimation for ASIPs [25] . It requires the hardware description language (HDL) description of the processor and, hence, is expensive to use in the design cycle. It first performs instruction-set simulation to extract data-type information, then connects it with the components used by each instruction and characterizes the power consumption of each component with various data types. After program profiling, power consumption for the application program is obtained. It aims for accuracy at the RTL and efficiency at the instruction level. Our approach has the same purpose. However, we target extensibility in the early design cycle of custom processor design through a hybrid approach of structural and instruction-level macromodeling.
The rest of this paper is organized as follows. Section II briefly examines the Xtensa processor core used in this work. Section III examines with a motivational example the energy macromodeling requirements for an extensible processor. Section IV describes the proposed energy estimation methodology and details the salient steps. Section V presents the results of applying the proposed methodology to build the energy macromodel of the Xtensa processor core and using it to evaluate the energy consumption of applications with different custom instructions. Finally, Section VI concludes.
II. EXTENSIBLE XTENSA PROCESSOR
We use the extensible Xtensa processor from Tensilica [1] as the target processor for macromodeling and energy estimation. Extensible processors try to combine advantages of a general-purpose processor's flexible control logic and an ASIC's efficient computation part. Xtensa's ISA consists of a basic set of instructions, which exists in all Xtensa implementations, plus a set of configurable and extensible options [26] .
The base ISA defines approximately 80 instructions, and the basic hardware implementation of the Xtensa core is built around a traditional five-stage RISC pipeline, with a 32-bit address space. The configurable options include a wide range of architectural settings. For example, the designer can configure the base processor to include generic instructions (e.g., multiply-accumulate) or floating point coprocessors, customize the memory/cache architecture and register file, and set up interruption/exception mechanisms and levels. Extensibility is achieved by specifying application-specific functionality through custom instructions (also called Tensilica instruction extension or TIE).
The behavior of the custom instructions is described using a subset of the Verilog HDL. Custom instructions can be used to perform complex computations, which can take multiple clock cycles to complete. In instruction encoding, there are at most two input and one output operand fields in the instruction. If the custom computation needs additional inputs and outputs, custom state registers are added to allow instructions to have more sources and destinations. They can also be used as dedicated registers, holding values for some temporary variables during program execution. Custom instructions can access both the general-purpose register file of the base processor and additional custom registers/register files for their computations. The TIE compiler processes the custom instruction specification and facilitates seamless integration of the added custom hardware with the base processor configuration. The custom hardware extension consists of designer-specified functional units (corresponds to Verilog-like operators and custom built-in modules), storage elements (register files and custom state registers), pipeline flip-flops and control signals. Control logic, such as the TIE instruction decoder, bypass logic, interlock detection, and immediate-generation logic required by the custom instructions are automatically generated. After the custom instructions are incorporated, a processor generator automatically generates the enhanced processor, and the corresponding GNUbased software development kit for the configuration, which includes ANSI C/C++ compiler, linker, assembler, debugger, code profiler, cycle-accurate instruction-set simulator (ISS), diagnostics, test benches, and standard libraries. Instead of invoking custom instruction at the assembly level, calls to custom instructions can be directly inserted into a high-level language (e.g., C, C++) description of the application program. In this way, both the hardware and software are tuned for specific applications.
III. EXTENSIBLE PROCESSOR ENERGY MACROMODEL REQUIREMENTS
In this section, we illustrate with an example the different factors which must be considered in building an energy macromodel for an extensible processor, and identify the macromodel components.
A. Motivational Example
Example 1: Fig. 1(a) shows a portion of the architecture of an extended processor, wherein the base processor datapath has been augmented with custom hardware needed to implement three custom instructions: , , and . Base processor arithmetic instructions execute on the datapath portion shown with a generic register file, an arithmetic-logic unit (ALU), two operand buses, and one result bus. Custom instructions and perform their respective functionality (multiply and multiply-accumulate) on data values off the operand buses using shared custom hardware-a multiplier , a multiplexer , and an adder ( 1). The multiplexer is used to select whether an immediate 0 (for operation) or a real addend from custom register (for operation) is fed to adder 1. Custom instruction accesses custom registers for its operands, which is independent of base processor operand buses. The computation results, either from base processor ALU or custom hardware modules, are fed into a multiplexer . The result of execution is then selected and loaded onto the result bus, and stored into either generic registers or custom registers.
The description of the three custom instructions is shown in Fig. 2 . In this figure, custom instruction keywords are shown in bold, and predefined names for the base processor core are underlined. The user_register statement specifies the custom state registers and their indices. The iclass statement defines a new instruction class with one or multiple custom instructions. The input and output of this instruction class are also specified. The semantic statement describes the behavior of the instruction class in a single block. For some complex instructions which require multiple cycles, the schedule block gives the schedule for the operation sequence of the custom instruction [26] , [27] . As seen from Fig. 2 , state registers and are both input and output for custom instructions. and instructions are in one iclass, and both take two cycles. The instruction timing in the schedule block shows that the and operations need ars and art at the beginning of the first cycle, use the value at the beginning of the second cycle and produce a new value at the end of the second cycle. The TIE compiler automatically derives the hardware implementation of custom instructions during the custom processor generation phase. It also offers dynamic linked libraries for instruction-set simulation [27] .
A snapshot of the dynamic execution of an application is captured in Fig. 1(b) . Four instructions are shown in the trace, which correspond to base processor instruction add and custom instructions , and , respectively. For each instruction, the top horizontal bar lists the sequence of processor events dictated by its execution. For example, instruction , which is an add instruction, executes by first reading data off the generic register file onto the operand buses , then performing the add operation on the ALU and writing onto the result bus ( , ), and writing back to the register file after a latency of one cycle due to pipeline effects . Stalls, if any, are also indicated in the figure. Since the execution of an instruction can activate other portions of the processor (side effects), the bottom bar for each instruction depicts the side effects in either the base processor or the custom hardware. For example, the execution of the base processor instruction add activates custom hardware ( , , 1) in the second cycle. This occurs because the custom hardware and the ALU of the base processor share the same operand buses. On the other hand, the execution of custom instructions ( and ) can activate the base processor hardware, and the ALU is also activated when these custom instructions are running. Since the execution of the other custom instruction may be independent of the base processor hardware, there is no side effect in the datapath.
B. Macromodel Identification
We analyze Fig. 1(b) further for energy consumption component identification. Along the horizontal time axis, processor events in boxes are time-disjoint, since they occur in different cycles [21] . At the second stage of the pipeline (execution stage), as shown in the figure, the top box and bottom box are space-disjoint, which means that their functionalities involve different structural blocks [21] . For example, when base-processor instruction is running, the top box represents base processor datapath events executed on and , and the bottom box represents the possible side effect on custom hardware 1, , and ; while if custom instruction is running, the top box shows the custom hardware activated, i.e., , 1, , and , and the bottom box shows spurious activation of base processor . Thus, since the activities in each box are either time-disjoint or space-disjoint to each other, their energy consumptions are additive.
To compute the energy consumption of the instruction trace in Fig. 1(b) , we identify several macromodel components, and build an energy macromodel for extensible processors which accounts for the following factors: 1) the energy consumed by base processor instructions on the base processor core, for example, the top horizontal bar for instruction execution in Fig. 1(b) , which occurs throughout the five pipeline stages; 2) the energy consumed by a custom instruction on the custom hardware, for example, the second box in the top bar of , , , which is just computation energy; 3) the energy dependency on inter-instruction correlation, pipeline effects and other nonideal features which are manifested as processor stalls, cache misses, etc.; 4) the interplay between the base processor and custom hardware:
• on one hand, it contains the activation energy of custom hardware owing to base-processor instructions, which is purely a computation side-effect [the bottom bar of instruction execution sequence in Fig. 1(b) ]; • on the other hand, it contains the activation energy of base processor hardware owing to custom instructions, which consists of both the computation side effect in the execution stage [the bottom bar of instructions and in Fig. 1(b) ], and the involvement of the base processor in other pipeline stages ( , , , events in the top bar of instructions , , ). Our proposed energy macromodel aims at gathering the most significant properties of extensible processors to define an accurate model efficiently. Since an extensible processor is augmented by a designer through custom instructions, the previous energy estimation work for microprocessors with fixed ISA is not applicable. The adopted macromodel is based on a decomposition of the activities carried out by the custom processor. Although the custom hardware is integrated with the base processor seamlessly, we can still decompose energy consumption into two space-disjoint parts: 1) base processor and 2) custom hardware extensions. We take a hybrid approach which utilizes both an instruction-level macromodeling for the base processor and a structural approach for the custom hardware extensions.
IV. MACROMODELING AND ESTIMATION METHODOLOGY
In this section, we present the proposed methodology for estimating the energy consumption of an application with (any) custom instruction enhancements running on an extensible processor obtained for the application. Section IV-A presents an overview of our methodology, while Section IV-B details the constituent steps.
A. Overview   Fig. 3 shows the different steps involved in performing energy estimation for an extensible processor. The basic flow involves: 1) characterization flow, which characterizes the energy consumption of the extensible processor through regression macromodeling on a set of test programs (Steps 1-8), and 2) estimation flow, which profiles a real application with custom instruction extensions dynamically to determine the statistics associated with the macromodel variables, and thus, computes the energy consumption (Steps 9-11).
Building an energy macromodel involves first selecting the extensible processor parameters on which the energy consumption of an instruction trace depends and constructing a template that expresses the energy consumption (dependent variable) as a function of those parameters (independent variables) (Step 1). The energy consumption is additive, and the processor events for the parameters in the macromodel are either time-disjoint or space-disjoint to each other and, hence, there is no duplicate energy accounting during macromodeling. We use a linear macromodel template in our analysis, since construction and use of linear macromodels is efficient. The linear macromodel template 1 expresses the energy consumed, , as a linear function of , which are characteristic variables of a program running on the extensible processor. In other words, (1) where are constants called the energy coefficients. Variables are chosen from both instruction-level and structural domains. Instruction-level parameters are employed for characterizing instruction effects on the fixed base processor core, while structural parameters are used to characterize instruction effects on custom hardware due to any base or custom instructions (see Section IV-B).
The energy coefficients in the macromodel are determined using regression analysis (Step 8). Since any test program can be used for building the regression model, we merely ensure that the different instructions in the base processor ISA are covered (Step 2). The input space of custom instructions that can be added to an extensible processor ISA is, however, exponential for a given choice of custom hardware library components. Therefore, the test program suite also incorporates custom instructions so as to cover all the custom hardware library components (Step 2). Regression analysis requires knowledge of both the dependent variables (energy consumption for test programs) and the independent variables (macromodel parameters values). For each test program, instruction-set simulation (Step 6) and dynamic resource usage analysis (Step 7) on the execution trace are used to determine the values of the parameters in the macromodel template. Energy estimation of the test program executing on the RTL description of the processor (Step 5) is used to measure the value of the dependent variable. If the test program incorporates custom instructions, the processor used in the above step corresponds to a custom processor that includes the custom instructions (through Step 4). Note that, while custom processors are generated during characterization, they are not needed for using the macromodel to compute application energy consumption. The macromodeling is just a one-time effort. Steps 3-7 are repeated for all the test programs to gather the data needed by regression macromodeling to find estimates of the energy coefficients, thus completing characterization.
When the energy consumption of an application with custom instructions has to be estimated (Step 9-11), instruction-set simulation (Step 9) is first performed to gather execution statistics (values of instruction-level macromodel variables) as well as to generate the dynamic execution trace for value extraction of structural variables. Since the integration of custom hardware (due to the custom instructions) with the processor architecture is defined in an extensible processor design flow, we analyze the resource usage (Step 10) of each instruction in the execution trace to determine the activation (if any) of custom hardware. This analysis yields the values of the structural macromodel variables. The parameter values are fed to the energy macromodel to yield the energy consumed by the application.
B. Details
In this section, we focus on the implementation of salient steps of our methodology. Section IV-B1 discusses energy macromodel template generation, while Section IV-B-2 examines macromodel fitting using regression analysis.
1) Energy Macromodel Template Generation:
The energy macromodel template used in our analysis is linear, with two components as shown below. (2) where and are linear functions of instruction-level parameters and structural parameters , respectively. They are space-disjoint, as depicts energy consumption on the base processor due to program execution, and on custom hardware due to all the instructions in program.
Instruction-level macromodel variables:
The instructionlevel macromodel variables are chosen to reflect the usage of the base processor core due to either base processor or custom instructions. In the process, we also select parameters to consider the effects of nonideal processor cases caused both by a baseprocessor instruction and custom instruction, such as instruction cache miss, data cache miss, uncached instruction fetching, data or control dependent interlocks, etc., that occur during the execution of a program.
Equation (3) illustrates the use of instruction-level macromodel variables to compute .
wherein is expressed as a linear sum of the following energy components:
• Energy due to the base processor functionality exercised by an instruction belonging to the base processor ISA (energy component item (1) in Section III-B). Experimental studies of energy profiles of processor instructions in the literature suggest that instructions in the base processor ISA can be clustered into arithmetic , load , store , jump , branch taken , and branch untaken classes [15] . We also ran a set of test programs for each base ISA instruction and classified them into these categories according to their average energy consumption. Macromodel variables represent the number of cycles taken by each instruction class in the dynamic execution trace of the program. Such a clustering is convenient (and later seen to be accurate) since macromodel variables do not have to be separately present for individual instructions in the base processor ISA. Thus, efficiency is targeted by instruction clustering.
• Energy due to dynamic effects manifested as instruction-cache misses , data-cache misses , uncached instruction fetches and processor interlocks [15] (item (3) in Section III-B). Macromodel variables denote the number of times each nonideal case occurs during program execution.
• Energy consumption in the base processor imposed by custom instructions . This corresponds to the second energy component of item (4) in Section III-B. It actually contains two components: 1) one is for all the custom instructions, which is the energy consumption in the four pipeline stages other than the execution stage and 2) the other one is for those custom instructions that access the generic register file, which is the spurious energy consumption in the ALU activated in the execution stage. Since the latter computation energy is very small, we only account for the former one. The macromodel variable accounts for the number of cycles taken by all the custom instructions. Structural macromodel variables: Structural macromodel variables reflect the usage of custom hardware extensions due to the execution of either the base processor or custom instructions. The variables are chosen to account for the number of cycles for which each custom hardware component is active during the execution of a program. All the components present in the custom hardware library should, therefore, be covered by these macromodel variables. The custom hardware extensions can be designed to be large enough to match or even exceed the size of the base processor core according to the requirement of specific applications. For example, the present generation of Xtensa III cores range in complexity from a base configuration with approximately 25 000 gates to configurations with several hundred thousand gates [6] . Hence, accurate modeling of custom hardware plays an important role in macromodeling of the whole extended processor. The following considerations are made in the process.
Custom Hardware Module Categorization: For efficiency, the components in the custom hardware library are classified into several categories based on empirical studies of their average energy consumption, even though their functionalities differ from each other. In the context of the custom hardware library used by TIE instructions, we classified the basic primitives into five categories: 1) multiplier ; 2) adder, subtractor, and comparators ( , , , , ); 3) bit-wise logic ( , , , ), reduction logic ( , , , ), and multiplexers (? :); 4) shifter ( , ); and 5) custom registers. Additional categories correspond to specialized modules available for TIE instructions, namely, custom built-in modules: 6) TIE_mult; 7) TIE_mac; 8) TIE_add; 9) TIE_csa; and 10) table, which is a hardware implementation that stores constants to be used in computational expressions.
Energy Complexity Scaling for Hardware Modules: The energy consumption of a hardware component depends significantly on its bit-width (or the number of entries and bit-width of each entry in the case of a table). We use to represent the power complexity of a hardware module and for the bit-width. The dependence on bit-width is linear in the case of hardware components such as ripple-carry adders, multiplexers, etc., while the dependence is quadratic in the case of an array multiplier . The linear categories include (2)- (5), (8) , and (9), and the quadratic categories include (1), (6) , and (7). Energy consumption of the static table is linearly dependent on both the bit-width and number of entries in table.
We expect accuracy to be gained through this energy complexity scaling for custom hardware modules.
Dynamic Resource Usage Analysis: Note that since for each custom hardware module, there are several factors that decide whether it is in an active state or idle state, we need to dynamically analyze the resource usage pattern, which is referred to in Steps 7 and 10 in Fig. 3 . A custom functional block is activated when any custom instruction belonging to the corresponding custom instruction class is executing. For example, the shared hardware modules ( , , 1) in Fig. 1(a) are activated when either the or instruction is executing. In some cases, the custom functional block can also be activated when base processor instructions are running. If the functional block has inputs from generic registers, or it is in the datapath chain of generic register value propagation, the activities on the operand buses still affect the custom hardware. Hence, we also account for the usage of processor extensions due to base processor instructions. The execution traces obtained from instruction-set simulation (Step 6 in Fig. 3 ) provide us the information on sequential instruction execution. Since the TIE compiler just implements the custom instructions automatically by mapping each operation to the corresponding operator one-by-one, we employ a data-flow-graph (DFG) for each custom instruction, with nodes (functional blocks) corresponding to operations, and edges corresponding to data flow, as shown in the custom hardware extensions in Fig. 1(a) . Thus, we identify the activated custom hardware modules for each instruction in the execution trace, and count the activation cycles for each module.
Equation (4) expresses the custom hardware energy consumption, , where macromodel variables denote the number of cycles in which the th functional block belonging to component category is active, and represents the energy complexity of this functional block.
represents the average energy consumption per bit (per entry for table) per cycle for each kind of resource module category.
(4)
2) Macromodel Fitting Through Regression Analysis: Regression analysis is used to determine the energy coefficients in the macromodel template shown in (2) (with and as given in (3) and (4), respectively). Test programs are used to gather both energy consumption values and execution statistics that correspond to the different macromodel variables in the equation. For a set of test programs, the energy consumption data are grouped into an column vector, , while the values corresponding to the macromodel variables are grouped into an matrix, . In such a case, model fitting through regression analysis involves solving the linear-matrix equation, , where is the energy coefficient vector corresponding to . In the above formulation, let represent the estimate of , and represent the estimate of . A solution of the matrix equation using the pseudo-inverse method [28] yields the values for the energy coefficient vector , as shown in (5), such that the mean square error, as shown in (6), is minimized.
(5) (6) The macromodel with the above energy coefficient values best fits the data acquired using the test programs. To estimate the model coefficients , the rows of matrix must be linearly independent, otherwise the problem has infinitely many solutions and the macromodel is not identifiable to energy consumption values of the test programs [21] . Because matrix is singular in this case, i.e., , the pseudo-inverse cannot be calculated. To avoid the linear dependency of two rows in matrix , we should induce enough diversity in the test programs, so that they differ with respect to functionalities of macromodel parameters, and the number of test programs should be no smaller than the number of model parameters.
V. EXPERIMENTAL RESULTS
We have implemented the proposed flow described in Section IV using commercial tools and scripts to perform several key tasks in the methodology. The target extensible processor used in our experiments is the Xtensa processor from Tensilica [1] . The base processor is a T1040.0 version of the processor running at 187 MHz (0.18-technology), and the configuration includes a 32-bit multiplication instruction, four-way 16-KB instruction and data caches, 32-bit wide system bus, and a generic register file with 64 32-bit registers. The processor includes clock-gating features.
Characterization in our set-up proceeds as follows. The GNU-based cross-compiler of the Xtensa software development kit is used to cross-compile test programs to get their memory images (Step 3 in Fig. 3 ). An ISS is used to simulate test programs to gather execution statistics (cycle count, cache misses, interlock, etc.) needed by the macromodel (Steps 6 and 9). Test programs are Tensilica benchmarks written in C, while custom instructions are written in the TIE language [1] . Test programs instantiate TIE instructions intrinsic to their description, which are cross-compiled to generate executables for instruction-set simulation. The Xtensa processor generator is used to generate the RTL description of the custom processors needed during characterization. The RTL description is [31] to perform regression and derive the energy macromodel. The results of characterization are presented in Section V-A, while the results of applying the macromodel to evaluate the energy consumption of different applications are presented in Section V-B.
A. Energy Macromodel for the Xtensa Processor
The energy macromodel template for the Xtensa processor is shown in (2), with the instruction-level and structural macromodel terms given in (3) and (4), respectively. There are 21 macromodel variables in the template whose coefficients are determined through regression analysis. Table I presents these coefficients which indicate the per-cycle estimates of the base processor energy consumption for each base processor instruction category , the per-cycle estimates of the energy consumption of the side-effects of custom instructions on the base processor , and the per-miss/perfetch/per-interlock estimates of the energy consumptions for execution-time effects , as well as the unit energy consumption (per-cycle, per-bit) for the different custom hardware library components . All the test programs used for characterization are benchmarks used by Tensilica to evaluate the performance of their Xtensa-series embedded processors. The test program suite includes programs from a wide range of applications to: 1) cover the various instructions in the instruction set; 2) exercise the various architectural manifestations such as instruction and data cache misses, processor interlocks, etc.; and 3) cover scenarios so that all the custom hardware modules are used. The test programs include applications such as rgb2cmyk, alphablend, color from image processing and graphics, rand from wireless communication, checksum, byteswap from networking, audio from voice compression, add4, bubsort, gcd from digital signal processing, solomon from network security, etc. Note that any test program suite can be used for characterization as long as the above-mentioned requirements are met, and the set of programs is diverse enough not to bias characterization. Fig. 4 reports the fitting errors corresponding to each test program. The errors are very small with the maximum error being 8.9%, and the root mean square error being 3.8%. We observed that the fitting errors of the test programs solomon and color are higher than the errors of the other test programs. Note that the overall macromodel is built using a linear macromodel template based on the statistics corresponding to all the test programs. Therefore, it is possible for regression analysis to be unable to fit some of the statistics corresponding to solomon and color to a linear macromodel template, while attempting to minimize the fitting error corresponding to all the test programs. However, the overall accuracy of the macromodel will be evaluated by using it for any test application as shown in the next section.
B. Applying the Energy Macromodel: Accuracy Results
In this section, we present the results of applying the energy macromodel described in Section V-A. We performed two experiments to validate our methodology and the macromodel.
1) Absolute Accuracy Examination: Our first experiment involves determining the energy consumption of several arbitrary applications (incorporating different custom instructions) in two ways: i) using the derived energy macromodel and ii) using the RTL power estimation tool WattWatcher [13] on the synthesized RTL description of the corresponding extended processor. Table II summarizes these results. For the ten application benchmarks (different from the test programs used in characterization flow) shown in the first column, the maximum estimation error is 8.5%, while the average absolute error is only 3.3%. The description for each application is given in the second column. Fig. 5 shows the breakdown of energy consumption of some applications into two parts: base processor core and custom hardware extension. For each application, the left bar represents the energy estimation by our macromodel for both the base processor core and custom hardware extensions, and the right bar represents the corresponding simulation results obtained from WattWatcher. The results show that: 1) the energy consumption due to custom hardware can be significant, for example, for application Multi_accumulate, the energy consumption ratio of custom hardware extension and the base processor core is 34.0% (the average ratio for these applications is 27.3%) and 2) the performance of the macromodel is good both for the base processor and custom hardware, as seen from the graph, our macromodel tracks the results reported by WattWatcher in both the cases.
The proposed energy estimation methodology is very fast. It takes only a few seconds for application-energy estimation using our approach, and it does not need RTL generation of the custom processor. The average time taken by WattWatcher to determine the energy consumption of a small application is several hours, and it needs the RTL description of the custom processor, which takes several more hours to generate. Thus, we achieve an average speedup of three orders of magnitude.
2) Relative Accuracy Examination: We also performed an additional experiment to study the relative accuracy of the macromodel, when used in energy optimization studies for an application with multiple custom instruction choices (note that each choice may contain a group of custom instructions). For this experiment, we implemented the Reed-Solomon decoding/encoding algorithm with four different custom instruction choices. Fig. 6 shows the energy estimates obtained by using our macromodel for each choice (this required merely instruction-set simulation to generate the statistics required by our macromodel). For each choice, we also obtained the energy estimates using WattWatcher (this required the actual synthesis of the custom processor in each case, followed by RTL simulation and power estimation using WattWatcher). The two profiles track each other, demonstrating good relative accuracy of our macromodel. The high relative accuracy and low effort characteristics (no custom processor generation, no RTL simulation) associated with the usage of our macromodel Fig. 6 . Energy-consumption estimates of an application for various custom instruction choices derived using our macromodel and WattWatcher. makes our approach highly suitable for such architectural exploration studies.
VI. CONCLUSION
In this work, we presented an efficient framework for characterizing the energy consumption of an extensible processor. Our characterization flow facilitates the construction of an efficient macromodel by using parameters drawn from both instruction-level and structural domains, and by leveraging the benefits of regression analysis. Since application of the macromodel only requires instruction-set simulation based analysis of the application, energy estimation using our approach is very fast. We characterized the energy consumption of a state-of-the-art extensible processor and used the macromodel so obtained to derive highly accurate energy estimates for several applications.
Since our goal was to estimate the energy consumption of an application running on an extensible processor efficiently and accurately, we made some conscious choices while developing our energy macromodel characterization and usage flows. High efficiency comes from choosing parameters in the macromodel template, which are available from instruction-level simulation. Some of the instruction-level parameters account for the dynamic characteristics of a processor (such as processor stalls) and memory hierarchy (such as cache misses and uncached instruction fetch overheads). Moreover, we performed a dynamic analysis of custom hardware usage pattern with the instruction execution trace, which captures the computational functionality of custom processor extensions, thereby, improving the overall accuracy. Further accuracy improvements can be made by considering explicitly other inter-instruction effects and data value dependencies. In addition, some processors may require consideration of effects such as deeper pipeline stages, longer instruction word, instruction encoding, addressing mode, activity on data bus, etc., for better accuracy (with possibly slower estimation times). We believe that the proposed energy estimation technique provides a fundamental framework that can be extended to tackle the characterization requirements of other complex processors.
Yunsi Fei (S'01) received the B. Eng. and M. Eng. degrees in electronic engineering from Tsinghua University, Beijing, China, in 1997 and 1999, respectively, and the M.A. degree in electrical engineering in 2001 from Princeton University, Princeton, NJ, where she is currently pursuing the Ph.D. degree in electrical engineering. Her interests include system-level cosynthesis, high-level synthesis, power analysis and synthesis of application-specific instruction set processor, embedded-software and operating-systems design for low power, and power management for portable devices. 
Srivaths Ravi
