I. INTRODUCTION
The increasing popularity of low power applications drives the need for analyzing and optimizing power consumption in all parts of a micro-processing system [1, 4, 5] . Software constitutes nowadays a major part of such systems where power is a constraint, and also has a significant contribution to the overall power consumption [2] . All these systems operate at a constant power supply voltage, and consequently accurate measurement and analysis of power supply current is needed in order to evaluate the software-related power consumption of the processing system [3, 5] . There are many reasons for searching for information about instruction-level power consumption of a low-power processing system: a) estimation of the power cost related to the software, b) verification of overall system's power budget, c) useful information for high-level design decisions such as hardware-software partitioning.
Other teams have also worked on current measurements for power consumption estimation [2, 3, 5, 8] by measuring the voltage drop on a small value resistor, or by measuring the average current values only, by means of an accurate milliamps-meter. Both methods yet, impose certain significant restrictions (averaging suppresses valuable information, while resistorbased measurements are influencing the actual level of the voltage applied to the chip, and thus are creating an offset noise on the current values). A different measurement approach has recently been presented in [7] . A technique based on measuring the charge transfer of switched capacitors (placed in the power supply path), is employed to provide information about energy consumption per clock cycle. A switched pair of capacitors is charged with the power supply voltage during each clock cycle and is discharged during the next cycle, powering thus the processor. The change in the voltage level across the capacitors is proportional to the square of the consumed energy and this value is then used for the calculation of energy in a clock cycle. This is the only known method with measurement per cycle resolution without affecting processor's supply voltage. However, this complicated method can not provide detail information for the shape of the current waveform, which may be significantly useful in many applications and also in case high-quality power models are required (including the architectural characteristics of the processor).
The methodology and the instrumentation setup of a novel approach for the derivation of the instruction-level power model is presented in this work. This method is based on the measurement of the instantaneous current drawn by the processor in each moment (continuously). A high performance instrumentation setup has been established for the accurate measurement of the variations of the power-supply current. It is a fully automated setup which includes a current sensing circuit, a Digital Storage Oscilloscope (DSO) and a PC with data processing software where the waveform of the current is transferred and power consumption is calculated.
As regards the system modeling, most of the instruction-level power models referred to the literature, base their approach on the measurement of average values of the current of the processor. The accuracy of these averaging methods for the estimation of power consumption values is low, but is still acceptable in some cases. They can not provide though, any information about consumption at each clock cycle and consequently can not exploit the various operational characteristics of the processors. Thus, they can only model the energy consumption in a macroscopic way, resulting in certain limitations of these models.
The present method for instruction-level power consumption is based on actual, continuous monitoring and measurement of the instantaneous current of the processor. In this way, we can observe details of the current variations during each clock cycle, watch closely the processing operation and calculate the energy consumption in each clock cycle. The energy amount and the power consumption per cycle are estimated by integrating the drawn current. This is a first, important, step towards a high-quality instruction-level power modeling as this approach permits us to capture the dependency of energy on second order effects like the address spaces, operand values, etc.
II. POWER CONSUMPTION MODELING
Power analysis and modeling techniques can be categorized into measurement-based and simulation-based ones. In simulation-based methods [4, 6] , energy consumed by software is estimated by calculating the energy consumption of various components in the target processor through simulations, which can be performed at different levels of abstraction.
Simulation-based methodologies can be subdivided into a) circuit-level, b) gate-level and c) register transfer-level (RT-level) techniques. A common drawback of these simulation-based techniques is that they do not provide a mechanism able to calculate the energy consumption of software directly from the instruction sequence.
In measurement-based approaches, the energy consumption of software is characterized by examining the data obtained from hardware. Possible adaptations for such approach are: a) to measure current averages [5, 8] , and b) to obtain instantaneous current or direct energy measurements [7, 11] . The advantage of the measurement-based approaches is that the resulting energy model is very close to the actual energy behavior of the processor, because the information is acquired from the hardware itself.
For measurement techniques, the most widely used concept is to associate instructions running on the processor with their corresponding energy cost. These measurement-based instruction-level power analysis techniques require no prior knowledge of the architectural details of the processor under study. For each one of the instructions, a set of power factors are assigned considering two main cases: a) to determine a base power cost, proportional to the processor current consumption and to the corresponding execution time, and b) to take into account the effect of circuit state changes while various processor resources are activated or deactivated in order to perform the functional operations of the specific instruction. Other costs can be assigned to encompass factors that have substantial impact on the overall power figures. These depend on the micro architectural characteristics of the specific processor. In a micro processing system, probable causes for increasing power consumption are some of the following: cache misses, pipeline stalls, branch prediction overheads and wait states (inserted due to lower operating frequency when instructions are accessed from external memory blocks).
Power analysis techniques for embedded processors that employ physical measurements were first suggested in the mid 90's. The era of software optimization for power minimization experienced a first boost at the time Tiwari et al. [2, 3] , proposed a technique based on physical measurements. This physical measurement technique stands on the assumption that, by measuring the current drawn by the processor as it repeatedly executes certain instructions or certain short instruction sequences, it is possible to obtain most of the information that is needed to evaluate the power cost of a program executed on that processor.
If a given instruction is executed repeatedly, then the power consumed by the processor can be thought of as the power cost of that instruction. In a given program, certain interinstruction effects also occur, such as the effect of circuit state, pipeline stalls and cache misses. Repeated execution of certain instruction sequences where these effects occur, may provide a way to isolate the power cost of these effects. Thus, the sum of the power costs of each instruction executed in a program, enhanced by the power cost of the inter-instruction effects, results in a fairly good estimate for the power cost of the program.
When executing these instruction sequences, the current drawn by the processor was measured, by the previously published methods, through a standard off-the-shelf digital ammeter. The instructions under investigation were placed in an infinite loop in order to overcome the short execution time when a program is performing single run, and thus to obtain a stable current reading. The instruction power costs are then classified in the following way:
1. Base instruction costs: The base cost for an instruction is determined by constructing a loop with several instances of the same instruction. The average current being drawn is then measured. This current multiplied by the number of cycles taken by each instance of the instruction gives an estimate directly proportional to the total energy. 6 2. Inter-instruction effects (circuit state overhead): When sequences of instructions in a program are considered, certain inter-instruction effects come into play, which are not reflected in the cost computed solely from base costs. The switching activity in a circuit is a function of the present inputs and the previous state of the circuit. Thus, it can be expected that the actual energy cost of executing an instruction in a program may be different from the instruction's base cost. This is because the previous instruction in the given program and in the program used for base cost determination may be different.
3. Other costs (cache misses, pipeline stalls): Resource constraints in the CPU can lead to additional energy dissipation, mostly due to the increase of execution time for a given program. Costs due to cache misses and stall cases (pipeline stalls or prefetch buffer stalls) can be modeled as another kind of inter-instruction effect. For a cache miss, a certain cycle penalty has to be added to the instruction execution time. A cache miss will lead to extra cycles, which leads to additional energy cost. The average current consumption penalty for cache miss cycles is then multiplied by the average number of miss penalty cycles to get the average energy penalty for one miss. The energy cost of each kind of stall is experimentally determined through experiments that isolate the particular kind of stall. By using this methodology average cost values for each stall type are computed. The extracted costs due to resource constraints and cache misses are then added up with the base and inter-instruction effect costs to provide overall energy consumption.
For any given program P, its overall energy cost can be calculated as follows:
This expression is the sum of three different parts which correspond to different phenomena taking place. The base cost B i of each instruction i, weighted by the number of times N i that is executed, provides the total base cost of the program. Then the circuit state overhead O i,j , for each pair of consecutive instructions ( i, j ), weighted by the number of times N i,j , the pair is executed, is added. Finally a third term is added which represents the energy contribution E h , of the other inter-instruction effects h, (stalls and cache misses) that would occur during the execution of the program.
III. THE PROPOSED MEASURING APPROACH
The proposed method is based on the measurement of the instantaneous current drawn by the processor during the execution of the instructions. Measurement of the instantaneous current values offers a direct imaging of processor's operation as the waveform of the current drawn illustrates continuously the operation of the processor or of the circuit under investigation. The proposed current measurement approach aims at overcoming the shortcomings of the previous methods by monitoring continuously the "instantaneous" current variations by a high-accuracy and high-speed measurement circuitry, and an automated data acquisition setup. The main task is performed by an analog current-mirroring configuration and a high frequency digital storage oscilloscope, which form a measurement setup capable of monitoring, recording and analyzing the instantaneous power supply current variations as shown in Figure 1 . Such a configuration offers continuous, high-quality monitoring of the instantaneous current drawn by the circuit under test (the processor) without causing any voltage fluctuations of its supply voltage. This measurement approach is similar to the BuiltIn Self-Test (BIST) techniques [9, 10] used for testing analog circuits (monitoring the current drawn by the circuit under test, results to evaluation of the different operating conditions). By applying proper timing and signature analysis techniques to these measurements, the power consumption of each instruction sequence used in the software may be estimated. This means that the power consumption due to the specific instruction may be analyzed and measured on the basis of the shape, the duration and other characteristics of the current variations that are recorded. 8 The proposed current measurement approach is based on a current mirroring configuration with Bipolar Junction Transistors (for high frequency operation and negligible power-supply voltage fluctuation). A current mirroring circuit capable of providing a precise copy of the instantaneous current drawn by the processor core is used (Fig.2) . The output current (copy) is then monitored by a precision Digital Storage Oscilloscope (DSO) and transferred to a PC via a GPIB bus connection. A homemade automation software controls DSO monitoring, waveform acquisition, and also performs the appropriate calculations for the estimation of the power consumption. Actually, a discrete-time integration routine is then executed, which provides the integral for a user-defined time interval. Figure 2 , is used in this case. This simple circuit has been proven to offer a quite remarkable performance in terms of copying accuracy and time (frequency) response which are of major importance for the specific application. The first of these characteristics is obviously important for the accurate measurement of the instantaneous current value. The second one is also important for this case, since the current variations in each clock, are short pulse-like shape waveforms. A key point in this measurement problem is to monitor accurately the shape of the current variations since this characteristic affects strongly the energy consumption value. The fastest slope of these pulselike current waveforms reported in the literature for off-the-shelf processors is estimated (not measured) to be in the order of few nanoseconds. Given this figure as a rough estimation of the upper limit for the required monitoring speed of the current sensing configuration, one gets that a rough specification for the upper frequency limit should be around 100MHz. Note that processing circuits operating at higher clock frequencies usually offer to the user the possibility to reduce clock frequency below 100MHz without affecting any other operating parameter.
The experimental circuit of the current mirror and other components for the proper operation of the system are placed in a specially designed printed circuit board. The bipolar transistors used for this configuration have a typical bandwidth of 280MHz, 150mA typical max continuous collector current value, and a typical h FE of 120. The discrete transistors used in the actual PCB were carefully chosen so that they have similar (almost identical) V BE characteristics. Then, multiple experimental tests were performed to ensure the proper operation of this setup for the specific application. These tests include current copying accuracy, operation range (min-max values of the current), frequency response, phase difference measurements (between input and output current waveforms), etc. Note that the input current waveform is considered to be the current drawn by the processor or generally by the circuit or device under test as shown in Fig. 2 , and the output current is considered to be the current copy generated by the current mirror I o , which is measured as voltage across the output resistor R. Input current I i was defined by a precision current sink for the DC and error characteristic, and by a high frequency generator for frequency response measurements. A set of high performance instruments was used for the experimental evaluation of the specifications of this measurement configuration. The set includes waveform generators (i.e. models HP3325B and IFR2025), a precision current source (Keithley's 224), the HP34401 multimeter, and digital storage oscilloscopes (HP54601 and Hameg's HM-1507-3).
The experimental measurement diagrams shown in Figures 3-6 , present the remarkable performance characteristics of this instrumentation system in terms of the different specifications considered for this application. These diagrams present typical cases from the multiple measurement tests which were repeatedly performed in the lab. The main characteristics illustrating the performance of the measurement system include an operation range of 2-100mA, with a relative deviation less than 2.5%, which is maintained less than 1% in an operating range large enough to monitor different variations (Fig.3) . Note that in case a lower current value is to be monitored, the constant current value (offset) I b may be changed in the input mirroring section. This simple modification will then shift back the operating range within the useful low-error region of Figure 3 .
Current copying capability (gain) is practically maintained constant to equal input-output current values up to the desired limit of 100MHz. As shown in Figure 4 , the gain fluctuation does not exceed 0.5dB all over the frequency range of interest. Phase difference is another point of certain interest for these systems and the corresponding diagram is shown in Figure   5 . Note that as is shown in the diagram there is actually a small time-delay in the system, which in the upper frequency range is in the order of 2ns. This time-delay is divided by the period of the signal and is therefore shown as increasing phase difference in the diagram of Figure 5 . Yet, this delay does not affect the shape of the waveforms under study and does not therefore causes any problem to this application.
In addition to these diagrams, a typical oscilloscope recording is also shown here, to illustrate the efficiency of the proposed solution for accurate monitoring and measurement of the current waveforms expected. The example shown in Figure 6 presents the accurate monitoring of a 10MHz square wave, showing the details of the comparison between input and output current variations.
IV. EXPERIMENTAL RESULTS
The proposed current measuring approach certainly provides more insight information on the power consumption and consequently on the internal operation of any digital circuit. As an initial test for the capabilities of the measurement setup, we monitored the current drawn by a simple 4bits up-down binary counter (CMOS technology, type MC14516 from "ON Semiconductors"). The waveform of counter's power supply current as recorded and shown in Figure 7 presents a train of spikes with amplitude directly proportional to the number of bit transitions occurring as the content of the counter is increased. Between any two of these spikes, appears a low-level, constant amplitude spike that corresponds to the other edge (halfperiod) of the clock pulses. As may be seen in this figure starting from the first low level spike, all odd numbered spikes (1,3,5, …) are caused by the clock's half-period. The second spike is a high level one which corresponds to a reset situation on the counter, and then all other even numbered spikes with different levels of amplitude correspond to the sequence of transitions of bits as bytes increase on the counter. Starting from the fourth spike which corresponds to one bit transition (from byte 0000 to 0001 we have a one-bit transition), then to number 6 spike corresponding to two bits change (from 0001 to 0010), then to number 8 with one bit again (0010 to 0011), the sequence goes on with 3,1,2,1,4,1, … etc. It is clear that the operation of this simple digital block (binary counter) can be analyzed by this kind of monitoring.
Moving on to another test-case, we examined the capability of monitoring a well known, general purpose, single chip microcontroller, namely Motorola's HC05C8. In Figure 8 is shown a NOP instruction followed by a 5-cycles BCLEAR instruction which, at the end of this execution sequence, sets the voltage at the corresponding pin of the microcontroller port to drop to zero (upper trace of Fig. 8 ). Note the presence of the low-level, short-duration spikes which are identical and appear again every half-cycle of the clock in all cases. The shape, the duration and other characteristics of the current variations corresponding to the specific instruction may then be analyzed and measured by processing the information of waveform recording. The resolution of the instrumentation setup permits a detailed investigation of these characteristics by offering a closer view (zoom) to each one of the pulse-like spikes.
In figure 9 is shown the recording of a similar case, focused on one of the two phases 
IV. CONCLUDING COMMENTS
The successful use of power-supply current monitoring in both analog and digital integrated circuits has prompted this research team to investigate the feasibility of monitoring the instantaneous supply current as a method for high performance evaluation of power consumption in digital systems (mainly small size microprocessors) for low power applications. The established setup can also be used for measuring the power consumption of the other components at the board level. In this way, the power cost of the memory accesses, the power consumption on DC/DC converters, and various microprocessor peripherals, can be accurately measured, in each clock cycle, leading to analogous cycled power models for these components. Also, exploiting the capability for performing differential measurements, accurate power models for the busses could be developed. With such models available at their toolbox, design engineers may accurately estimate power consumption at the system level, and explore effectively different system implementations at a very early design stage of a low power system.
A number of appropriate experiments have to be carried out for the characterization of the power components of each instruction. Multiple sets of current variation recordings (current transients) under specific test situations (in terms of the software module that is executed at that time) are needed. These recordings should also be synchronized with the processing circuit clock pulses for proper timing identification. The basic parameters that define the instruction-level power consumption model of a processor should be considered [5, 11] . These are actually the different operating situations of the circuitry, which are imposed by the software (instructions). The effect of each one of these situations on the power consumption should be analyzed. Special programs have to be derived for measuring the effect of each one of these parameters in the total power budget of any instruction.
Present measurement technique permits the accurate implementation of the modeling method presented by Tiwari et al. [2, 3] , which has been proved suitable for embedded core processors, and has also been validated for commercial microprocessors. Power reductions of up to 40% obtained by rewriting the code using the information provided by the instruction level power model are mentioned in the literature, while considerable energy reductions have been reported to be verified by physical measurements. 
