Abstract: This paper presents an infrastructure that uses program phase behaviour and brings together the advantages of simulation and physical measurement to capture detailed program power behaviour. Based on the direct correlation between instructions-per-cycle and power dissipation for the same piece of code, we refine the phase classification used in SimPoint to improve the similarity among the intervals in the same phase in terms of time-dependent power behaviour. Experimental results show that this classification method significantly decreases the average relative standard deviation across the intervals within each phase by 68% with respect to the average power dissipation of the intervals.
Introduction
Process technology increases transistor density by about 35% per year. The effect of increasing die size and transistor density causes a growth rate in transistor count of about 55% per year with a commensurate improvement in transistor performance. High transistor density has supported the emergence of 64-bit microprocessors and many innovations in pipelining and caches (Hennessy and Patterson, 2002) . At the same time, power management has emerged as a major challenge for device scaling. Most of the power dissipation of CMOS microprocessors comes from the switching (dynamic) power of transistors, which can be calculated as 2 .
dd P f C V = × ×
Here, f is the switching frequency, C is the load capacitance of the transistor and V dd is the voltage. Decreased feature size of a transistor means a decrease in its load capacitance. However, the increases in the number of transistors and frequency dominate the decreases in load capacity and voltage, resulting in power dissipation growth. Power is likely to become the major limitation in the development of computer architecture (Hennessy and Patterson, 2002) . Simulators are often used to evaluate the power and performance of programs. By simulating the execution of a program on some platform, we can obtain very detailed program profile in both power and performance. Simulation is very important in the early stage of computer architecture design since no concrete architecture is available. However, simulators are very slow and inaccurate compared with physical measurement. In the OS and compiler communities, benefit analysis typically rely on physical measurement in order to get objective power/energy behaviour with time cost proportional to program execution time. However, physical measurement results alone are often not able to explain the observed power/energy behaviour, i.e., do not provide a semantic feedback. It is our goal to bring together the advantages of simulation and physical measurement to build an evaluation infrastructure for OS/compiler power/energy optimisations. By combining the results of physical measurement and simulation, the user can get an objective power/energy measurement with a semantic connection to the evaluated program.
For long-running programs it is hard to get very detailed power behaviour of the whole program through physical measurement because of hardware limitations. Even with energy simulation, it is time-consuming to simulate the whole program. We plan to use the concept of SimPoints (Sherwood et al., 2001) in our infrastructure to overcome hardware limitations and simplify the power/energy measurement/simulation. That is, we find representative slices of a program and perform measurement/simulation only on these slices to get the power/energy behaviour of the whole program.
Although we can estimate the power consumption of a program using the power consumption of the representative slices with very low error rate, when we look into the power consumption of the intervals of the same cluster, we find that sometimes they have largely different values. Based on the assumption that if two intervals from the same cluster have different power dissipation, they have different power behaviour, the selected SimPoint for this cluster is not representative in power behaviour.
Based on the observation of the direct relation between IPC and power dissipation, we combine the phase classification based on Basic Block Vectors (BBV) from SimPoint (Sherwood et al., 2002 ) and a phase classification based on IPC to obtain better representative slices of program execution, called ppoints. Through detailed simulation of the selected slices, our infrastructure can characterise the dynamic power behaviour of the whole program. Experimental results show that the number of ppoints from this new classification is only increased moderately compared to the number of SimPoints from SimPoint, but the average RSD of power consumption in each phase is reduced by more than 68%.
The rest of the paper is organised as follows: Section 2 describes power/energy simulators and the SimPoint idea. Section 3 discusses our current evaluation infrastructure and preliminary results. Section 4 describes the validation of the feasibility of the SimPoint idea in power/energy evaluation through simulations. Section 5 presents the refined phase classification method and its benefits. Section 6 gives directions for future work, and Section 7 concludes the paper.
Related work

Power evaluation techniques
In simulation-based power evaluation methods, the system is abstracted into various components. The energy consumption of a program is estimated as the sum of the energy consumption of all these components. Simulators can be classified according to their levels of system and component abstraction. They target different levels of detail and make different trade-offs between simulation time and accuracy. Most simulators are parameterised, so they can be used to estimate the energy consumption of different system configurations. Simulators are very important in the early stages of architecture design and evaluation. Furthermore, many simulators can give details at a very fine level of semantic granularity.
Transistor-level
These simulators characterise models of transistors and estimate voltage and current behaviour over time. Power dissipation of transistors comes from three sources: switching power, short-circuit power, and leakage power. Such simulation is time-consuming but useful in integrated circuit design. Transistor-level simulators are not suitable in evaluating power consumption of large programs on complex systems.
Cycle-accurate microarchitecture-level
A microarchitecture-level simulator simulates the execution at the level of individual cycles, allowing to keep track of power behaviour changes across cycles. They are often used for simulations of modern superscalar processors. Three examples of cycle-level simulator are Wattch (Brooks et al., 2000) , SimplePower (Vijaykrishnan et al., 2000) and Sim-Panalyzer (Sim-panalyzer). We use Sim-Panalyzer in our research to estimate the power dissipation of the benchmarks.
Instruction-level
Instruction-level simulators provide coarser power behaviour than the above two. The simulation is based on the instruction-level energy profiling of the instruction set of the target processors and the assumption that the energy consumption of an instruction is mostly independent of the addressing mode or operands. Instruction-level simulators are normally faster than cycle-level simulators. One instruction-level simulator is Joule-Track (Sinha et al., 2001 ).
System-level
Hardware component system-level simulators characterise the energy consumption of each system component in different states. The simulator records the transitions between states and the time each component spends in each state during the simulation and calculates the energy consumption of the whole program. Such simulators do not provide detailed power behaviour of a program, but they are useful for component selection and system partitioning during system design. An example is the simulator from Duke University (Cignetti et al., 2000) . It is an extension of POSE, a palm OS Simulator.
There are also some software component system-level evaluators. PowerScope (Flinn and Satyanarayanan, 1999 ) is a time-driven statistical sampler that uses samples from a digital multimeter. An energy-driven statistical sampler from Compaq is similar to PowerScope except that the sampling period is determined by energy quanta.
SoftWatt (Gurumurthi et al., 2002) is the first simulator to target the complete system power profile of a high-end system. It extends SimOS (Rosenblum et al., 1995) with validated analytical energy models for hardware components. This simulator identifies power hotspots in system components, and captures the relative contribution of the power profile to the user and kernel code and identifies power-hungry OS services.
ECOSystem (Zeng et al., 2002 ) is a modified Linux that manages energy as an OS resource. Parameters of its 'currentcy model' can be changed to support different platforms. Isci and Martonosi (2003) propose a coordinated measurement approach that combines real total power measurement with performance-counter-based per-unit power estimation. This is useful for dynamic power/energy management. Our work is different from this approach because our objective is an evaluation infrastructure for OS/compiler optimisations. Our infrastructure can get the power measurement of any small region of a program as well as the power behaviour of a long program. Furthermore, even though (Isci and Martonosi, 2003) provides power break-down for CPU components, there is no semantic connection between the measurement result and the measured program, which is important for observing power/energy optimisation opportunities and will be an important contribution of our infrastructure.
Disadvantages of energy simulators
The above simulators have some common features. Energy models for various components are characterised before the evaluation and energy consumption evaluations are done through looking up values in many tables by the simulator. The higher the precision is, the larger the tables are.
So speed is usually decreased with the increase in precision. Simulators are valuable for power and energy estimation of unavailable architectures. For OS and compiler level power optimisation on available architectures, physical measurement can be used for evaluation.
Performance modelling is subject to many sources of error (Black and Shen, 1998) . Modelling errors are from the incorrect coding of the desired functionality. Desikan et al. (2001) measured the experimental error in microprocessor simulation and showed that the error in common simulators is often larger than the performance gains yielded by new architecture ideas reported in the literature. From the construction of power simulators, we can see that power simulators are also subject to the errors mentioned in Black and Shen (1998) . Some tables are usually simplified to accelerate simulation. There may be mismatches between reality and the simulation of the program execution. The effect of the OS is not considered in many simulators. All of these issues make accuracy a problem of simulators. Ghiasi and Grunwald (2000) compared two architectural power models, the Cai-Lim power model and Wattch, and found that these models disagree on the efficacy of the design choices in each experiment and do not always produce statistically significant results.
The disadvantages of power simulation and physical measurement show that we need a faster, more precise power and energy evaluation infrastructure to correctly reflect the power behaviour of a program and evaluate the benefit of an optimisation. This is the motivation of our research.
The SimPoint idea: offline phase clustering
Execution of a program falls into repeating behaviours called phases. If we can identify these phases then we can find segments of the program execution with similar behaviour and focus our simulation or measurement only on several representative segments to get the behaviour of the whole program execution. Sherwood et al. (2002) proposed off-line Phase Clustering Analysis and used it in finding simulation points of programs in their SimPoint research.
In SimPoint, a program execution is partitioned into intervals with a fixed number of instructions. Intervals with similar behaviour are clustered into a phase. BBV profiling is used to collect a sort of fingerprint of each interval for off-line classification. This fingerprint is the number of instructions executed for each basic block during the interval.
The phase behaviour can be found by examining the ratios in which different regions of code are executed over time (Sherwood et al., 2001) . One cluster corresponds to a phase. One interval for each phase is chosen as a simulation point and the program behaviour can be estimated from the simulation of all the SimPoints. They validated this method on the SPEC2000 benchmarks by estimating their IPC, cache miss rate, branch misprediction rate etc.
We adapt the SimPoint idea to simplify our physical measurement and enable the physical measurement of programs with long execution times in spite of the hardware limitations. We have validated the feasibility of this idea in power/energy evaluation through simulation on some benchmarks.
Correlation between IPC and power dissipation
From the results of profiling both IPC and power dissipation for each interval, we find a direct correlation between these two. Chen et al. (2003) , Valluri and John (2001) and Li and John (2003) also mentioned similar correlation between IPC and power. Power dissipation in each cycle depends on the work done in that cycle. Dynamic power behaviour of an interval depends on execution cycles and power per cycle. Both are proportional to IPC. If two intervals with similar executed basic blocks have different IPC, we can say that they have different power behaviour since both power per cycle and total execution cycles are different. Figure 1 shows this correlation for jpegencode.
There are totally 247 intervals, each has 1 million instructions. The x axis is the cluster that each interval belongs to. The primary y axis on the left is the power dissipation of each interval. The secondary y axis on the right is the profiled IPC for each interval. Intervals are shown in the order of their cluster number and their appearance in program execution, so that we can see the difference of power dissipation and IPC among all the intervals from the same cluster generated by SimPoint. Within each cluster, the correlation between power and IPC is obvious. For instance, in cluster 2, the same basic blocks are executed, high IPC means low power dissipation and vice versa. Similar correlation exists in the other benchmarks in our experiments. If we can classify the intervals from the same BBV-based cluster based on IPC, intervals from a new cluster will have similar power behaviour. In our infrastructure, we use a refined phase classification method, which combines BBV and IPC to get the representative intervals for power behaviour characterisation. By this method, intervals with similar BBVs but different IPCs are classified into different phases. The difference level between IPCs is adjustable to balance the accuracy and time cost.
Evaluation infrastructure
Current state of the infrastructure
As depicted in Figure 2 , our evaluation infrastructure has three components: a Skiff board (a Compaq Personal Server PCB with a Strong-ARM SA110 CPU and 32 MB of SDRAM), a Tektronix TDS3014 DPO oscilloscope with a TDS3TRG advanced trigger module, and a dedicated data-acquisition Linux machine. The Skiff board has separate power planes and current measurement points for CPU and memory. The measured program runs on the Skiff board. The oscilloscope measures the current or voltage of the components (CPU, memory) of the Skiff board or the whole board. The data-collecting machine communicates with the oscilloscope to gather data and does offline analysis. Sampling is done by the oscilloscope and the data-acquisition machine only communicates with the oscilloscope. So there is no interference between the measured program, sampling and data collecting. The accuracy of the result is high. 
Usage and measured parameters of oscilloscope
The TDS3014 oscilloscope has four channels. Users can control its acquisition mode, record length, trigger mode, data encoding and other configurations through our user-level API. TDS3014 has two sample acquisition modes:
• normal: record length = 10,000 points
• fast trigger: record length = 500 points.
Record length is the number of points that comprise a complete waveform record. It determines the amount of data that can be captured with each channel. In our initial work, we use fast trigger mode to collect the power behaviour of the whole measured program. There are two problems:
• The oscilloscope keeps acquiring samples all the time. It is hard to determine the beginning and the end of the measured program.
• The communication cost to collect 500 samples from the oscilloscope to the data-acquisition machine is much higher than the cost to generate these samples when high resolution is used.
In our experiments, the communication cost is usually about 135 ms. This means, only if we adjust the oscilloscope to generate 500 samples in 135 ms or a longer period, the data acquisition machine can retrieve all data from the buffer before the buffer is overwritten. In this case, for a 233 MHz machine, such as the Skiff board, each sample covers at least 62910 cycles. This resolution may not be good enough if we want to have a closer look at the power behaviour of the measured program. However, if we use a higher sampling rate, some samples will be lost due to the overwriting mechanism of the sample buffer.
To solve the first problem we use the trigger module of the oscilloscope. In this mode, the oscilloscope starts sample acquisition only after a trigger event happens. The acquisition is stopped after an entire record is obtained. The trigger starting the measurement is generated by setting a pin on the Skiff board to a high voltage. We call this pin the trigger pin. After the measured program part terminates, the trigger pin is set to lower voltage to indicate the end of the measurement. By running some specially designed micro-benchmarks, we determined that the delays of setting the high voltage and low voltage are 20 ns and 47 ns, respectively. Our current prototype user interface allows the specification of program regions by inserting compiler pragmas to mark the start and end of the region to be measured. Trigger generation code is inserted into the source code automatically. The execution of the measured program region is not affected, allowing non-intrusive power/energy observations.
Although the normal mode can help us identify the beginning and the end of the measured program, it is still hard to get the power behaviour in high precision for programs with long execution time due to the high communication cost. A larger memory for the oscilloscope can help, but the problem persists if we want higher precision. This is why we plan to use the SimPoint idea to find representative slices of a program and get power evaluation for the whole program based on the physical measurement of the selected slices. Figure 3 shows a very simple program with a loop. Unrolling the loop eight times and compiling it with GCC yields a new version, Version A, which has a basic block with 16 loads followed by 16 additions. We reschedule the instructions of the assembly of Version A by hand to get two other versions, Versions B and C. Then we use GCC to compile them into executables. Table 1 shows the instruction schedule for each version. Figure 4 shows the measured CPU current for the Strong-ARM SA110 processor (2.0 V, 233 MHz). The line marked 'trigger' represents the trigger signal for the oscilloscope. At the beginning and end of the program region of interest, the trigger pin is set to high and low voltage, respectively. Version B Two loads, followed by two additions, followed by 14 loads, followed by 14 additions Version C Alternating groups of two loads, followed by two additions
Measurement example
In Figure 4 , we see what we expect for the current behaviour of the program: one cache miss every eight iterations of the original version. Version C, the alternating schedule of memory and CPU instructions, leads to the shortest execution time, the lowest energy consumption and the smoothest power dissipation profile. Choosing this schedule over the alternatives will lead to a fast program with low peak CPU power dissipation and small variations in power dissipation. For comparison, we also simulate the same programs using Sim-Panalyzer, a cycle-accurate architecture-level ARM power simulator. Since it is non-trivial to identify loops within an executable, we simulate the power dissipation of the whole program. Only the loop is different in different simulations, so we can say that the differences among simulation results are from our modification to the loop. Most of the architectural configuration values used in our simulation experiments are from the Strong-ARM v4 data-sheet. Others are from the default configuration provided by Sim-Panalyzer. We use the power configuration file provided by Sim-Panalyzer but change the frequencies to 233 MHz. Table 2 gives the simulation results for the four versions of the program. Both the power dissipation and simulated cycles are normalised by the results of the original version. From Table 2 , we can see that Versions A-C all give better results compared to the original version without respect to power dissipation or execution cycles. However, based on the simulation result, Version A is the best of the four versions. This is different from the observation based on the measurement result. Furthermore, even though we simulate the whole program, the power dissipation of the loop is a big part of that of the whole program based on the comparison of the simulation results of the original version and Version A. But we can not see significant difference between the simulation results of Versions B and C, which also disagrees with the physical measurement result. It is therefore hard to know when to trust the obtained simulation results.
USE of SimPoint in phase classification
Sometimes it is necessary to measure the power of the whole program in fine precision. Even though we can run the program many times to measure the power behaviour of one small slice in each execution and combine the results from the slices to get the final answer, it is time consuming and dealing with the overlap between two slices is non-trivial. In order to simplify the measurement work, we use offline phase classification from SimPoint (Sherwood et al., 2002) to find representative intervals to simulate/measure.
Offline phase clustering analysis
The selection of SimPoints includes the following steps: 
Feasibility validation of the SimPoint idea
In order to validate the suitability of the selected points by the above method, we choose nine benchmarks from (MediaBench) and compile them into ARM executables. We extend sim-outorder of SimpleScalar to perform BBV profiling on ARM benchmarks. The original SimPoint algorithm is also extended to support a fixed number of SimPoints instead of choosing the best one among several numbers so that we can see the impact of the number of SimPoints on the error rate of the power/energy estimation. We then explore the error rate of each benchmark in a 3-dimensional space: interval size, number of SimPoints and error rate. The following steps were performed on each benchmark:
1 compile the benchmark locally on the skiff board with extended options -static and -msoft-float 2 run the modified sim-outorder on the benchmark to get BBVs 3 run the BBV analysis program on the BBVs from step 2 to get the SimPoints and their corresponding weights 4 run Sim-Panalyzer on the SimPoints from step 3 5 calculate whole-program power estimation based on the power values from step 4 and the weights from step 2 6 calculate the error rate. Figure 5 shows the power behaviour of jpegencode and the clusters obtained from the off-line phase clustering.
The curve is the power behaviour of the benchmark. Each power value is the simulated power dissipation of a fixed number (10,000) of continuous instructions, so the x axis can not represent execution time precisely. The numbers on the primary x axis are phase numbers for the intervals of jpegencode when five phases are identified by the off-line phase classification algorithm. Intervals with the same phase number are classified into the same cluster and one of them is chosen as the representative of the phase. The numbers on the secondary x axis are phase numbers when ten phases are identified.
We can see that intervals from the same phase have similar power behaviour. When ten SimPoints are simulated or measured, we get more representative slices of the program than when five phases are identified. This is evident especially for the second half of the program. When there are five phases, the intervals in the second half are clustered into the same phase, in spite of the power peaks in this part. When there are ten phases, intervals with power peaks are clustered into another phase and the peaks can be found through the simulation or measurement of the SimPoints. Phases 10 and 3 have very close power behaviour, which means ten SimPoints are too many in this case.
Figure 5 Power behaviour and phase classification for jpegencode
We also explored the effects of different numbers of SimPoints and different interval sizes. For most benchmarks, when the number of SimPoints is smaller than ten, the error rate decreases with the increase in SimPoints number. When the number of SimPoints is larger than ten, the error rate does not change much. An interval size of 1 million instructions is good enough for the benchmarks. For some benchmarks, we get the same error rate when smaller interval size is used, and the simulation time is largely reduced. Figures are not available here due to space limitation. Figure 6 shows the IPC and power error rates of the simulated benchmarks and the average error rate. For each benchmark, the error rates of IPC and power are similar and the average error rate is below 2%. Figures 5 and 6 show the simulation-validated feasibility of the SimPoint idea in power and energy evaluation for large programs. In order to validate the feasibility of the SimPoint method in power evaluation, we need to identify each SimPoint and trigger the physical measurement during the execution of the program. This work is in progress now. Figure 6 shows that we can estimate the total energy consumption of a program with very low error rate by using representative intervals found by phase classification based on BBVs of intervals. However, sometimes, there is a big variance among the energy consumption and time-dependent power behaviour of the intervals classified into the same phase. In this case, the selected SimPoint for a phase is not representative in terms of detailed power behaviour. Figure 7 shows the power behaviour of two intervals from a cluster of jpegencode. Due to the large number of cycles for an interval, we only show a segment of the interval power behaviour. We can see the difference between the two intervals in terms of power behaviour. Interval 1 has a larger power range and more dramatic fluctuation in power dissipation than interval 2. Both have repeating power behaviour, but interval 1 has a longer period. Furthermore, the two intervals have different execution cycles and total power dissipation, which is not shown in Figure 7 .
Variable power behaviour in the same phase
No matter which interval is selected as the SimPoint for this cluster, it can not represent the two intervals in power behaviour. SimPoint generates no more than ten phases for each benchmark. If there are fewer phases in the benchmark execution, less than ten phases are formed. SimPoint generates seven phases for jpegencode, which means the difference is not due to the small number of phases.
To efficiently characterise program power behaviour, we need to refine the phase classification currently in use to find out the representative intervals for power behaviour characterisation. Figure 8 illustrates the power behaviour characterisation process using our two-step phase classification to get the ppoints. Phase classifications 1 and 2 are independent. Phase classification 1 generates clusters based on the similarity between BBVs as in Section 4. The intervals clustered into the same phase execute similar binary code. Phase classification 2 generates clusters based on similarity between IPCs. Intervals from the same phase have similar IPC values. Intervals from different phases must have different power behaviour whether they execute similar basic blocks or not, since they have either different BBVs or different IPC.
Refined phase classification method
The assumption of this power phase classification is that, if different executions of the same basic blocks have different power dissipation, they have different power behaviour. Our tool combines the clusters from these two phase classifications to do clustering. The correlation between IPC and power dissipation ensures that intervals of the same cluster of classification, but with different power dissipation, are partitioned into different clusters in step 3 of Figure 8 .
In the current implementation of our tool, the number of phases identified in classification 1 is no more than 10. It is exactly the same phase classification as in SimPoint. The best number between 1 and 10 is selected during the classification procedure. The number of phases identified in classification 2 is 10.
In step 3 of Figure 8 , the final clusters and ppoints are generated.
We use the same benchmarks and the same power simulator as in Section 4. Each benchmark is compiled with -static and -msoft-float options by gcc2.95.2 on the Skiff board. Then the binary code, a script file, and a granularity of 10, are input to the process in Figure 8 , the ppoints and other output files are generated. 
Interval variance reduction
To illustrate the improvement of our tool in finding representative points for power behaviour, we simulate the power dissipation of each interval for each benchmark and use the average RSD in power dissipation of each benchmark to evaluate the benefit from our classification method. A higher average RSD means that the classification method is worse. We divide the standard deviation of each cluster by the average interval power dissipation of the cluster to get the RSD of each cluster. Then the average of the RSDs of all of the clusters is the average RSD for the benchmark. We compare three methods of getting representative intervals.
• BBV the original SimPoint method. No more than ten clusters are generated.
• BBV-k the same classification method as in SimPoint, but is given the same number of clusters as generated by the BBV + IPC method. The best number is selected by the classification procedure.
• BBV + IPC our new phase classification.
In the following figures, there are three columns for each benchmark, corresponding to the three methods, left to right. In Figure 9 (a), our old BBV + IPC method aggressively decreases the RSD for most of the benchmarks relative to the RSD of the BBV method. The average relative decrease is 68%. The average number of ppoints to simulate is 20.55, about twice of that generated in the BBV method. For the old IPC+BBV method, finer classification can be obtained through increasing the number of clusters based on IPC. But there is a trade-off between the granularity and simulation time.
For some benchmarks, such as adpcmencode, adpcmdecode, g721encode and g721decode, although there is a big relative decrease in RSD in Figure 9 (a) and (b) shows that the absolute decrease in RSD is very small. For example, BBV + IPC decreases the average RSD of g721encode by almost 90%, but the change in absolute RSD is from 0.13% to 0.01%. The benefit can not offset the extra time and space consumption of the unnecessary ppoints. A restriction on number of IPC-based clusters is needed to control unnecessary finer classifications. For the two intervals in Figure 7 , our BBV + IPC classifies them into two different phases. They are identified to have different power behaviour. Figure 10 shows the power and IPC for each interval of jpegdecode. We can see the power line for each cluster is smoother than in Figure 1 . The decrease in average RSD is 67% for jpegdecode. Figure 11 shows the power and IPC for the intervals of the clusters generated by the BBV-k method. This is different from Figure 1 because of the larger number of initial centers. This K is the same as the number of clusters in Figure 10 . Even when a k value of 19 is given to BBV-k, it generates only 12 clusters. That is, only 12 phases are identified. The difference of power and IPC among intervals from the same cluster is still large, e.g., cluster 1 and 11. The two intervals in Figure 7 are still in the same cluster. Error rates for the estimation of total power dissipation are not shown here. Since our BBV + IPC method only refines the clusters generated by SimPoint, the error rate should not be higher than the values in Figure 6 . 6 Future work
BBV-based classification with larger K
Our focus in ongoing work is the identification of a SimPoint during program execution. The identification should not have much overhead or it will affect the accuracy of the measurement result. Perhaps some sort of fingerprint other than BBV of an interval can be used for phase clustering and make it easy to do run-time identification of SimPoints.
From the comparison of the physical measurement and simulation of the small program in Section 3, we can see that some times simulators can give us misleading results. Since our infrastructure can get precise measurement of a region in a program, we can design some specific micro-benchmarks to validate an energy simulator to help provide semantic connection between the physical measurement and the measured program.
IPC profiling is time-consuming through simulation. Profiling through program execution on a real system will significantly reduce the time cost. Some metric other than IPC that can be profiled with low cost and used in refining the phase classification will also help.
After the above steps, we will have an integrated evaluation infrastructure for OS/compiler power and energy optimisations that will help in future power, energy optimisation research.
Our current work is concentrated on the ARM architecture. We will build a general-purpose power/energy evaluation infrastructure that can be applied to more architectures.
Conclusion
This paper describes an evaluation infrastructure for OS/compiler power and energy optimisations that brings together the advantages of simulation and physical measurement. It can provide objective evaluation of an optimisation and semantic connection between measured power/energy and source code if needed. In order to overcome hardware limitations and measure long programs, we plan to use the SimPoint idea to find representative slices of a program and do physical measurement on these slices. The preliminary results show that our current infrastructure can do non-intrusive physical measurement and get precise power/energy behaviour of the measured region of a program. Also, through simulation, we validated the feasibility of the SimPoint idea in power/energy evaluation. By refining the phase classification based on BBV of intervals, we reduced the power behaviour variance of the intervals in the same phase, which makes this infrastructure feasible in program power behaviour characterisation using detailed power behaviour of selected intervals. Validation of our approach through physical measurement is ongoing. We expect that the final infrastructure can help researchers to find more optimisation opportunities and avoid the effects of misleading evaluation results.
