Abstract: Many recent dynamic voltage-scaling (DVS) algorithms use hardware events (such as cache misses, memory bus transactions, or instruction execution rates) as the basis for deciding how much a program region can be slowed down with acceptable performance loss. Although these approaches result in power savings, the hardware events measured are at best indirectly related to execution time and clock frequency. We propose a new metric for evaluating the performance loss caused by DVS, a metric that is logically related to clock frequency and execution time, namely the percentage drop in cycles. Further, we show that we can predict with high accuracy the execution time of a code region at any clock frequency after measuring the total number of cycles spent in that region for two clock frequencies-the maximum and the second highest clock frequency. Measurements using several real-world applications show that this "two-point" model predicts execution times with an accuracy that is greater than 95% in many cases. This result can be used to develop low-overhead DVS algorithms that are more system-aware than many of the current algorithms, which rely on measuring indirect effects.
Introduction
Dynamic voltage scaling (DVS) remains one of the most popular and effective techniques for reducing the power consumption of microprocessors. This technique, which can be controlled from either of the hardware, operating system or compiler levels, works by reducing a processor's clock frequency and voltage in lockstep (Venkatachalam and Franz, 2005) . Because CPU power dissipation is quadratic with respect to supply voltage and linear with respect to clock frequency, DVS should, under ideal conditions, reduce a processor's instantaneous power dissipation cubically.
However, when the processor is slowed down, program execution will slow down as well. Thus DVS is only effective in code regions that can be slowed down without unacceptably increasing a program's total execution time (or a user's experience of it). Thus the correct question to ask is: "What is the relationship between a program region's execution time and the clock frequency at which it is run"?
Due to the effects of processor stalls, memory or disk intensive programs tend to have less performance loss than compute-intensive programs when run at a lower clock frequency. This makes them ideal candidates for DVS algorithms. Thus several recent DVS approaches (Choi et al., 2004a,b,c,d; Li et al., 2003; Marculescu, 2000; Poellabauer et al., 2005; Singleton et al., 2005; Weissel and Bellosa, 2002; Wu et al., 2005) aim to detect memory boundedness, or how often the processor is stalling due to memory requests. Similarly, some approaches Feng, 2004, 2005) aim to infer memory boundedness from CPUboundedness. Although each of these approaches may have good results, the key challenge they face is how to detect processor stalls. Many of the architectures that are widely used (e.g., Pentium M, Pentium III) lack hardware counter events for measuring cycles spent during processor stalls. Thus these approaches rely on indirect information that is provided by the hardware-event counters available, information about hardware events such as cache misses (Choi et al., 2004b,c,d; Kondo and Nakamura, 2004; Li et al., 2003; Marculescu, 2000; Poellabauer et al., 2005; Singleton et al., 2005) , memory bus transactions (Wu et al., 2005) , memory requests per cycle (Weissel and Bellosa, 2002) , or instruction execution rates Feng, 2004, 2005) .
The problem is that the relationship between the events measured and the execution time of a program at a given clock frequency is at best indirect. Although the measurements may be statistically related to processor stalls, they do not imply processor stalls and it is difficult to draw a correlation between them and program execution time, which depends on everything that is happening in a system. Moreover, not all processors provide the means to measure these events, and devices lacking this ability can not be supported by the approaches mentioned above.
Copyright c 200x Inderscience Enterprises Ltd.
To overcome these difficulties, we present a new methodology to understand the relationship between performance loss and clock frequency. The key observation is that the execution time of a code region under a fixed clock frequency is the ratio of (a) the total number of elapsed clock cycles during the region's execution, and (b) the clock frequency (cycles/s) at which the region was executed. Thus if a region is run at a lower clock frequency and its execution time does not increase as much as one would expectwhich is the case for memory bounded code regions-then executing the region takes fewer clock cycles at the lower frequency than at the higher clock frequency.
Thus we propose to use the percentage decrease in clock cycles as the measure of how compute intensive a code region is. Further, we show that just by knowing the total cycles it takes to run a program at the maximum clock frequency, and the total cycles it takes to run it at a single notch below the maximum clock frequency, we can predict with high accuracy the program's execution time at any lower clock frequency. This result can be used to develop low-overhead DVS algorithms, which do not depend on platform-specific event counters.
In sum, we make the following contributions:
• We introduce a new measure of compute-boundedness which is based on logical foundations, namely the percentage decrease in cycles.
• We show that we can estimate, with high accuracy, the execution time of a code region at an arbitrary clock frequency simply by running the region at the maximum clock frequency and at one notch below the maximum clock frequency, and extrapolating from the difference in execution cycles.
• We show how this result can be used to develop lowoverhead DVS algorithms that do not rely on hardware event counters. We evaluate our technique on a wide variety of real-world applications and show what percentage of their total runtime is spent in processor stalls. This can, among other things, be used to focus the development of power-management techniques toward those benchmarks that are most amenable to DVS.
The rest of this paper is structured as follows. Section 2 provides the theoretical framework of our model by formalising the decrease in clock cycles for an arbitrary code region that is slowed down. Section 3, Section 4 and Section 5 discuss the implementation of our model, the methodology used to validate our model, and provide results attesting to the accuracy of this model for a wide variety of applications, respectively. Section 6 discusses how this model can be applied to dynamic voltage scaling. Section 7 discusses the difference between our model and related work, and Section 8 summarises our conclusions and describes avenues for future work.
Theoretical Foundation
In this section, we provide a formal model for understanding how much a program's total clock cycles decrease when the program is run at a lower clock frequency. Using this result, we show how we can estimate the CPU (or memory or I/O) boundedness of a program region by simply slowing the region down to any lower clock frequency and extrapolating from the difference in total execution cycles.
To put all these ideas in context, we first describe two theoretical extremes, an ideal, CPU-bound program that contains only computations (with no memory accesses), and an ideal, memory-bound program that contains only memory accesses (no computations). As Figure 1 shows, slowing down an ideal, CPU-bound program (top curve) does not change the total number of clock cycles needed for executing the program. Instead, the total cycles will remain the same because computations (e.g., addition or multiplication) generally require a fixed number of clock cycles. As a result, the execution time of the CPU-bound program increases inversely with respect to the decreasing clock frequency. (Cutting the clock frequency in half causes the program to run twice as long.)
In contrast, when an ideal, memory-bound program (bottom curve) is slowed down, the total number of cycles needed for executing the program decreases in direct proportion to the decrease in clock frequency. This is because the CPU is not performing any useful work during this ideal memory-bounded program, but waits. As a result, slowing down the CPU does not affect the amount of time it takes to run the program. This is why the ideal memorybounded program does not suffer any performance loss when slowed down, in contrast to the ideal, CPU-bounded Figure 1 : Execution cycles as a function of the clock frequency for three programs, an ideal memory bounded program, and ideal CPU bounded program, and a typical program in between these two theoretical extremes.
program, which does suffer performance loss. Most programs (middle curve) will exhibit behaviour in between these two theoretical extremes. They may contain regions where the processor is performing computations as well as regions where the processor is idle, waiting for memory, disk, or network accesses to complete. When such a program is run at a lower clock frequency, its total clock cycles will decrease because the regions where the processor is idle will need fewer clock cycles for execution. This percentage decrease in total clock cycles is the primary (and most accurate) indicator of how CPU (memory) bounded a program region is.
To quantify this decrease for an arbitrary program region, we consider two scenarios of program region execution that exhaust all possible cases. In the first scenario, computations do not overlap with memory accesses, while in the second scenario, they do.
Scenario 1: No Overlap Between
Computations and Memory Accesses.
Scenario 1 assumes that either the processor or the main memory is executing at any given time. Figure 2 shows a simplified version of this scenario. The processor first performs some computations, then requests data from main memory, idles while waiting for the memory transaction to complete, and then continues performing more computations. Let C1 mark the first computation phase, I1 mark the idle phase, and C2 mark the second computation phase. Then the total processor clock cycles for the execution of this program region is:
Moreover, the total clock cycles that the processor spends idling (during I1 ) is the product of the clock frequency and the duration of phase I1 . That is,
Because memory is asynchronous with respect to the processor (under our assumptions), the duration of the idle period, T I1 , is constant across all clock frequencies. Thus the total idle cycles (Cycles(I1 )) decrease as the frequency f is lowered. On the other hand, the total computation cycles (Cycles(C1 ) + Cycles(C2 )) do not decrease since C1 and C2 are computational phases. As a consequence, the decrease in total execution cycles needed for executing the program region at a lower clock frequency f new instead of the current clock frequency f old is:
Scenario 2: Possible Overlap Between Computations And Memory Accesses
We now consider a slightly more complex scenario (Figure 3) . As before, the processor first performs some computations and then issues a memory request. But this time, instead of idling during the entire memory transaction, the processor performs some computations and then idles. Finally, when the memory transaction is complete, the processor continues performing more computations. As before, let C1 stand for the first computation phase (which includes computations performed during the memory transaction), I1 stand for the phase in which the processor is idle, and C2 stand for the second computation phase (which is dependent on the memory transaction completing). Now suppose the code region is run at a lower clock frequency than the current clock frequency. Then, just as in scenario 1, the total clock cycles spent in computation (Cycles (C1 ) + Cycles(C2 )) will remain the same since computations require a fixed number of cycles independent of the clock frequency. And just as before, the total clock cycles spent idling will decrease. However, the amount by which the idle cycles decrease will be greater in this scenario than it was in the previous scenario. This is because the time that the processor spends idling (T I1 ) is not fixed, but rather decreases as the clock frequency is lowered, since the processor spends more time executing the (slowed down) computations in C1 while the memory is accessed. To formalise this reduction in idle time, let T old I1 be the processor's idle time at clock frequency f old and T
new I1
be the processor's idle time at clock frequency f new . Then, the the difference in total clock cycles for executing the code region at the lower frequency f new instead of the frequency f old is:
Notice that the only difference between this and the previous scenario is that T new idle is less than T old idle . If these times were equal, we would be back in scenario 1.
Applying this scenario to architectures such as the Pentium 6 is problematic, since they are pipelined and allow multiple loads and stores to overlap. The performance counters on these architectures do not provide sufficient information to quantify exactly how large this overlap between CPU and memory activity is; that is, there is insufficient information to determine exactly how much T new idle differs from T old idle . Therefore, we will spend the rest of this section discussing the implications of scenario 1 (Figure 2 ). We will show that scenario 1 implies a very powerful result for DVS algorithms. This result, and the assumptions underlying scenario 1, will be substantiated by our measurements in Section 5. 
The Implications of Scenario 1
In scenario 1 (Section 2.1), processor and memory are asynchronous, and memory stall time is roughly independent of the processor clock frequency.
1
Let T idle be the total amount of time that the processor stalls, waiting for memory requests to complete. (This is a summation of all the idle periods that are interspersed with computational periods in the entire program region.) Then according to equations 4, the total decrease in cycles when executing the code region at a lower clock frequency f new instead of the current clock frequency f old is:
As a result, we can deduce the idle period T Idle to be:
This means that the time that the processor spends idling can be estimated by running the entire program region at a slower clock frequency and recording the difference in execution cycles. Since the idle period T idle is constant across all clock frequencies, it only needs to be computed once, and it does not matter how much we slowed down the original program region in order to compute this. Now we use T idle to estimate the execution time of the program at any clock frequency. As before, let f old be the old clock frequency at which we ran the program region and f new be the new, slowed down clock frequency. Let C old and C new be the total execution cycles at the old frequency f old and the new frequency f new , respectively. Then, by equations 6 and 5 the execution time of the program region under the new clock frequency f new is:
This equation implies that the execution time under the new clock frequency f new is a function of the cycles C old under the old clock frequency and the idle time T Idle , which we computed knowing merely the total cycles C old under the old clock frequency f old and the total cycles C new under the new clock frequency f new . In other words, the total number of cycles it takes to run a program at two arbitrarily chosen clock frequencies is enough information to estimate the program's execution time for any other clock frequency.
In particular, running a program region at a clock frequency that is "one notch" lower than the clock frequency at which the region originally ran, can provide enough information (i.e., the cycle counts) to estimate the execution time of the region at any clock frequency..
Implementation
This section gives an overview of the experimental platform we used to validate our model.
Experimental Platform
Our experimental platform is a Dell Latitude D600 laptop featuring a 1.6 GHz Pentium M processor. The processor supports Intel Enhanced Speedstep Technology, which allows switching between multiple frequency settings on the fly and automatically adjusts the voltage with respect to the frequency. The supported frequencies and their accompanying voltage levels are listed in Table 1 . We are using the Linux CPUFreq driver to adjust the frequency settings. However, we have written a system call that directly invokes this driver's internal frequency setting method so that we can avoid the overhead of its default mechanism, a communication interface via the proc file system.
Changes to Linux Kernel
We have extended the Linux 2.6 kernel with a highlevel interface for monitoring hardware events including clock cycles with minimal invasiveness and distinguishing events occurring in userspace from events occurring in kernelspace, in addition to retrieving process-specific event counts.
To distinguish kernel mode events from user mode events, we have modified the Linux kernel to save and restore the event counts whenever a task switches to and from kernel mode, and to also update the total number of kernel events, context switches, and interrupts for the current process. Similarly, to retrieve process-specific event counts, we have modified the scheduler so that it saves the event counts for each task being switched out and restores the original event counts for each task being switched in.
Changes to the Java Virtual Machine
We decided to implement our model on top of the Java virtual machine because mobile code such as Java and .NET bytecode is already ubiquitous in a variety of devices including laptops, PDAs, and set-top boxes.
We have implemented our model inside version 1.6.0 of the Sun Hotspot Client Virtual Machine (SUN HotSpot, 2006) . We previously implemented the same mechanism inside the Jikes Research Virtual Machine (Arnold et al., 2000) , but migrated it to Hotspot for two reasons. First, Hotspot is a production-quality, industrial VM, whereas Jikes is a prototype, research VM with many missing features. By using Hotspot, we obtain results that are more realistic and have broader industrial applicability. Second, Hotspot can run programs using any of the Sun Java libraries, allowing us to experiment with a broader range of real-world applications than any other VM. This again enhances the applicability of our approach.
We have extended the Hotspot VM, allowing the interpreter and just-in-time compiler to instrument the entries and exits of methods. Our instrumentation can be used to start and stop hardware performance counters, sample any hardware counter events (including clock cycles), and switch the processor clock frequencies for different method invocations. All of the instrumentation can be turned on and off dynamically for specific methods, independent of any other methods.
We ran all benchmarks in Hotspot's default mixed-mode execution framework, because the interpreter-only option is prohibitively slow. We can safely run our experiments this way because Hotspot's re-compilation heuristic is not sensitive to the processor clock frequency, but only depends on method invocation counts and backward branch counts, meaning that the heuristic makes the same decisions regardless of the clock frequency.
Experiment Methodology

The Two Point Hypothesis
Recall our model of execution time in equation 7 from Section 2.3:
Our hypothesis is that if we know the execution cycles for a program at the maximum clock frequency and at the clock frequency one notch below the maximum clock frequency, this information allows us to predict, with reasonable ac- Solves an N x N linear system using LU factorisation followed by a triangular solve. SOR Performs 100 iterations of successive over-relaxation on a NxN grid. HeapSort Sorts an array of N integers using heap sort. Crypt Performs IDEA encryption and decryption on an array of N bytes. FFT Performs a one-dimensional forward transform of N complex numbers. Sparse Sparse matrix multiplication, using an unstructured sparse matrix. Table 2 : A Description of the benchmarks used in this study.
curacy, the execution time of the program at every other clock frequency. We tested this "two point hypothesis" on 26 real-world Java applications from the SpecJVM '98, Dacapo, and JavaGrande benchmark suites, taking as our two points 1.6 GhZ (the maximum clock frequency) and 1.4 GhZ (one notch below the maximum clock frequency). The benchmarks used are described in Table 2 . We ran each benchmark to completion ten times over every supported clock frequency on our platform, flushing the hardware caches in between each run to ensure that all the runs begin from the same state. For each benchmark we obtained for each clock frequency
• The total execution cycles for that clock frequency.
• The percentage decrease in clock cycles compared to running at the maximum clock frequency.
• The actual execution time for that clock frequency.
• The estimated execution time for that clock frequency, according to our model under the two-point hypothesis. This estimated time was computed by plugging into equations 6 and 7 the total cycles for 1.6 GHz (Cycles 1 .6GHz ) and for 1.4 GHz (Cycles 1.4GHz ).
• The estimation error, given as a percentage. The errors were originally positive or negative depending on whether our model overestimated or underestimated the actual execution time. However, we have displayed here the absolute values of those errors.
Results
We illustrate our results with two different types of plots. First, Figure 4 is a histogram of the estimation errors for all benchmarks over all clock frequencies. On the x-axis is the clock frequency in GHz, and on the y-axis is the estimation error, given as a percentage. The datapoints correspond to the estimation errors for different benchmarks. The errors displayed for the reference points 1.4GhZ and 1.6GhZ will always be zero since these two points were initially chosen to extrapolate the execution times for all other clock frequencies. We are only interested in the datapoints plotted for the other clock frequencies (0.6GhZ up to 1.2GhZ). Of these 104 points, the majority of them (all but 8) lie below the 10% error mark. In fact, most of the points (all but 13) lie at or below the 5% error mark, and there are many of the remaining that are close to 1% error. A small number of points are higher on the error axis: two are between 10% and 15% error, three are between 15% and 20% error, and two are slightly above 20% error, but none of the points exceeds 25% error. This suggests that our model has an estimation accuracy of above 90-95% for most of the datapoints, and 75-80% accuracy for a small number of datapoints. To more closely see what gives rise to these results, we present a second set of plots. 2 In each figure, the x-axis is the clock frequency in GhZ, the y-axis on the left denotes the execution time in seconds, and the y-axis on the right denotes the percentage decrease in clock cycles. Each figure displays three different plots as function of the clock frequency, namely the actual execution time, the estimated execution time, and the percentage drop in cycles.
There are several conclusions that can be drawn from the benchmarks. First, there is a wide variation in how much performance loss different benchmarks exhibit when run at lower clock frequencies, and there is also a correlation between this performance loss and the decrease in total cycles. On the one hand, the benchmarks jython, ps, mpegaudio, crypto, and moldyn are highly CPU-intensive. The percentage drop in clock cycles for these benchmarks is very small (less than 5%) and the overhead of slowing these benchmarks down is very large. For example, for ps, the execution time increases by 163% when the benchmark is run at the lowest clock frequency. Thus these benchmarks should not be considered for DVS. On the other hand, the benchmarks antlr, db, jack, mtrt, euler, fft, flufact, heapsort, montecarlo, and sparsematmult fall into the category of being memory or disk bound. They exhibit significantly less performance loss and a higher decrease in cycles when run at lower clock frequencies. For example, the sparsematmult benchmark has roughly a 3.5% performance loss and a 61% drop in cycles when run at the lowest clock frequency.
Second, in most of the graphs, the percentage drop in cycles increases almost linearly as the clock frequency is lowered from 1.6 GhZ to 0.6 GhZ. This suggests that the total processor stall time is roughly the same regardless of clock frequency, implying that scenario 1 (Figure 2 ), rather than scenario 2 (Figure 3) , holds for most of these benchmarks. This supports our model. For two of the graphs (DaCapo jython and Javagrande moldyn) the drop in clock cycles appears erratic, but the drop is so small (less than 5%) that we can safely ignore it. (The fluctuations are most likely caused by measurement noise.)
Third, for most of the benchmarks, our estimate of execution time is very close to the actual execution time for the different clock frequencies. However, there are a few cases where the error is 15-20%, namely for the benchmarks flufact, montecarlo, sor, fft, and mtrt. For the first four of these benchmarks there is a pattern, which is most easily observable in the graph of sor. The drop in cycles first increases linearly as the clock frequency is lowered, and our estimated execution time remains close to the actual execution time at these points. However, at a certain point, the slope of this curve abruptly changes, becoming less steep, and exactly at that point, our estimated execution time diverges noticeably from the actual execution time. This happens because the number of clock cycles spent in memory stalls decreases as the clock frequency is decreased. However, there will be a point of diminishing returns where the clock cycle length is so long that decreasing the clock frequency will not decrease the total cycles any further because all of the memory stall slack has been used up. Although we may not be able to observe this point-there are limits to how low we can set the clock frequency-the trend is for applications to gradually converge to that point. We see this in the graph by noticing the "drop in cycles" curve becoming less and less steep. However, because our model is based on a simple linear extrapolation, it is not sensitive to this. This explains the larger error numbers (15%-20%) we see for the four benchmarks above (flufact, sor, montecarlo, fft).
The mtrt benchmark displays a different anomaly-the drop in clock cycles is initially very low as the processor clock frequency is switched from 1.6GhZ to 1.4GhZ, but below 1.4GhZ it suddenly becomes higher, causing our model to overestimate the execution time by around 20%. Nevertheless, the curve remains linear for frequencies below 1.4GhZ. We are currently studying this benchmark further to see what could be causing this early change in slope.
Appendix A contains a table of the numbers used to gen- erate these graphs, as well as the numbers for the benchmarks not shown as graphs.
Applications Of Our Model
There are many possible applications for this mechanism. For example, embedded systems can use this model to speed up the offline experiments used to predict the execution times of specialised programs. Another application would be to use this model at runtime in a DVS algorithm. One of the goals of a DVS algorithm is to find the optimal clock frequency f new that keeps the execution time of the program region from increasing more than, say N% of the original execution time. Using equation 7, we can model this constraint as
Solving this equation for F new we get:
Then the goal of a DVS heuristic would be to find the lowest clock frequency f new that meets the above constraint. The different parameters in equation 9 can be estimated at runtime (by an OS or dynamic compiler) using the equations just provided. For example, at method level one would measure a method's execution cycles C old under the current clock frequency f old . Once a stable reading has been obtained, the clock frequency for the method is changed for the next invocations. To avoid performance loss, the frequency is reduced by just one notch below the previous clock frequency. After the method has run enough times to obtain stable readings, we measure the difference in execution cycles with respect to the previous invocation, and estimate T idle using equation 6. This provides all the parameters (C old , f old , T idle ) needed to use equation 9 to determine the appropriate clock frequency for the method. Note that this approach is not restricted to method-level, or compiler-based DVS. It can be applied at other granularities, e.g., at the OS-level, where entire tasks could be slowed down using this approach. The advantages of this approach are:
• There is a direct relationship between clock cycles, clock frequency, and execution time. This allows for a more accurate measure of how compute-intensive a code region is, and a more accurate computation of the correct clock frequency.
• The cycle counts encapsulate all the "hard-tomeasure" system effects (i.e., disk accesses, network accesses, multiple cache misses) that could give rise to processor stalls, thus providing more complete information for making DVS decisions than any single hardware counter event.
• Because the clock frequency is only lowered a single notch below the top clock frequency, and this is only done once for the sake of measurement, there will be little performance loss.
• Clock cycles can be easily measured on most devices, allowing for portability even if hardware event counters fall out of fashion.
Related Work
There is a wealth of literature on DVS algorithms DVS (Azevedo et al., 2002; Choi et al., 2004b; Dudani et al., 2002; Flautner et al., 2001; Govil et al., 1995; Gruian, 2001; Hsu and Feng, 2004; Hsu and Kremer, 2003; Kondo and Nakamura, 2004; Pillai and Shin, 2001; Saputra et al., 2002; Stanley-Marbell et al., 2002; Weiser et al., 1994) . Due to limited space, we will only give a few representative examples of the kinds of approaches that our mechanism can be used to extend, namely those approaches that attempt to extrapolate the memory boundedness of programs from information provided by hardware event counters (Choi et al., 2004b,c,d; Kondo and Nakamura, 2004; Li et al., 2003; Marculescu, 2000; Poellabauer et al., 2005; Singleton et al., 2005; Stanley-Marbell et al., 2002; Wu et al., 2005) .
Marculescu (2000) was one of the earliest to propose using cache misses to drive dynamic voltage scaling. The main idea is that between the time when a cache miss is detected and the time it is resolved, the CPU activity can be divided into an independent and a dependent phase. The independent phase consists of instructions that can be executed while the miss is still being resolved. The dependent phase consists of instructions that have to wait until the miss is resolved. The CPU is slowed down immediately after the miss is detected, so that its workload during the independent phase finishes exactly when the miss is resolved. Kondo and Nakamura (2004) propose an interval-based approach driven by cache misses. The heuristic periodically calculates the number of outstanding cache misses and increments one of three counters depending on whether the number of outstanding misses is zero, one, or greater than one. The cost function expresses the memory boundedness of the code as a weighted sum of the three counters. In particular, the third counter, which is incremented each time there are multiple outstanding misses, receives the heaviest weight. At fixed length intervals, the heuristic compares the memory boundedness so computed, with respect to an upper and lower threshold. If the memory boundedness is greater than the upper threshold, then the heuristic decreases the frequency and voltage by one setting; otherwise, if the memory boundedness is below the lower threshold, then it increases the frequency and voltage settings by one unit. Weissel and Bellosa (2002) propose a heuristic that monitors the rates of different hardware events (i.e., cache misses) and attributes these rates to different processes that are executing. At each context switch of the OS scheduler, this heuristic sets the clock frequency for the process being switched in based on its previously measured event rates. To do this, it refers to a table that assigns clock frequencies based on event rates and performance loss thresholds. Weissel and Bellosa construct this table using exhaustive offline experiments where they measure the lowest clock frequency that will allow a program to satisfy a performance loss threshold under different combinations of event rates. Wu et al. (2005) have developed a memory-aware DVS algorithm that can be used inside a dynamic compiler. Their cost model is based on the analytical model described in (Xie et al., 2003 (Xie et al., , 2004 . The main idea is to estimate the scaling factor based on the fraction of time that the processor is stalled. The model extrapolates this information from three hardware counter events, memorybus transactions, FP/INT instructions retired, and microops retired. The values of the three main terms in their cost model (the optimal scaling factor, the total time that the memory is busy, the total time that CPU and memory are both busy) depend on platform-specific coefficients that are estimated using offline simulations. Poellabauer et al. (2005) attempt to divide a program's execution time into two parts-the time spent in computations and the time spent in memory accesses. To estimate memory boundedness they propose a new metric, which they call memory access rate, which quantifies the average rate at which cache misses are occurring per instruction executed. They use performance counters to measure cache misses and instructions executed. To determine how to slow down the processor on the basis of these measurements, they construct a table that maps these memory access rates to matrices of scaling factors. Their heuristic consults this precomputed table at runtime. Choi et al. (2004b) use the Intel XScale processor's performance counters to determine how much of a program's execution time is spent on-chip versus off-chip. Under this cost model, this reduces to estimating the average cycles for an on-chip instruction. For specific benchmarks, they find that this latter quantity is linear with respect to the average CPU stall cycles per instruction. Reasoning that stall cycles increase with respect to cache misses, they construct a table that associates cache miss ranges with stall cycles. Their DVS heuristic consults this table to choose the correct clock frequency. Feng (2004, 2005) propose an interval-based DVS algorithm that measures CPU-boundedness. According to their model, the execution time for running a program at a given clock frequency is proportional to a constant β, which is a measure of how computation-intensive the program is. Their DVS heuristic estimates β at runtime, using a regression method over past instruction execution rates.
All of the above works attempt to use specific hardware events as indicators for estimating a program's execution time at a given clock frequency. In the case of Kondo and Nakamura (2004) , Poellabauer et al. (2005) , Marculescu (2000) , and Choi et al. (2004b) the events are cache misses. In Weissel and Bellosa (2002) the events are memory requests per cycles and instructions per cycle. In Wu et al. (2005) , the events are memory-bus transactions and microops retired. In Feng (2004, 2005) , the events are instrution execution rates. These approaches involve a level of indirection, because they are based on statistical correlations between the events in question (e.g., cache misses, instruction execution rates) and execution time and clock frequency. On the other hand, execution time, clock frequency and clock cycles are logically related. Thus our metric, the percentage drop in cycles, is a more appropriate metric for assessing how much a code region will slow down when it is run at a lower clock frequency.
Conclusions and Future Work
We have presented a new way of understanding and predicting how compute-intensive a code region is for the purposes of DVS. Namely, the percentage decrease in cycles is the primary indicator of how much the execution time will increase when a code region is run at a lower clock frequency. It is the most fundamental measure of computeboundedness because of the logical relationship between clock cycles, clock frequency, and execution time. We can estimate with high precision the execution time of a code region at any clock frequency just by running the region at the two highest clock frequencies, and extrapolating from the decrease in cycles. This result can be used to develop low-overhead DVS algorithms that are more system-aware than current approaches (Choi et al., 2004b,c,d; Feng, 2004, 2005; Kondo and Nakamura, 2004; Li et al., 2003; Marculescu, 2000; Poellabauer et al., 2005; Singleton et al., 2005; Weissel and Bellosa, 2002; Wu et al., 2005) , approaches that choose clock frequencies based on hardware events that may be at best indirectly related to execution time.
We are integrating our model into a DVS algorithm for the Sun Hotspot VM. So far we have collected detailed method-level execution traces of the benchmarks in this paper. While many of these benchmarks spend a significant amount of time in memory stalls, we find that most of these stalls are occurring at the loop granularity rather than the method granularity. Moreover, in the 3 or 4 benchmarks that contain opportunities for DVS on method level, we find that switching the clock frequency on every method entry and exit amounts to a lot of overhead. We are still investigating the right granularity at which to apply DVS.
Another avenue for future work involves extending this model. Although we can predict execution times with reasonable accuracy using a "two-point" approach, we can increase the accuracy by considering more points. This will allow us to better deal with special cases. 
