Abstract -An adaptive approach f o r dynamic voltage scheduling on processors is presented based on workload prediction byfiltering a trace history. The effects of update frequency and filtering strategy on the energy savings is analyzed. A performance hit metric is defined and techniques to minimize energy under a given performance requirement are outlined. Our results demonstrate that up to two orders of magnitude energy savings is possible with dynamic voltage scheduling depending on workload statistics.
I. INTRODUCTION
Dynamic Voltage Scheduling (DVS) is a very effective technique for reducing CPU energy [ 1] [2]. Most microprocessor systems are characterized by a time varying computational load. Simply reducing the operating frequency during periods of reduced activity results in linear decrease in power consumption but does not affect the total energy consumed per task. Reducing the operating voltage implies greater critical path delays which in turn means that the peak performance is compromised. Significant energy benefits can be achieved by recognizing that peak performance is not always required and therefore the operating voltage and frequency of the processor can be dynamically adapted based on instantaneous processing requirement. Figure 1 shows a 1 minute snapshot of the workload trace for three processors being used for three different types of applications: (i) a dialup server (characterized by numerous users logging in and out independently), (ii) a workstation (characterized by an interactive single user) and (iii) a UNIX file server (characterized by intermittent requests). The varying workload requirements are at once apparent. The goal of DVS is to adapt the power supply and operating frequency to match the workload such that the visible performance loss is negligible. The crux of the problem lies in the fact that future workloads are often non-deterministic. The rate at which DVS is done also has a significant bearing on performance and energy. A low update rate implies greater workload averaging which results in lower energy (as we will show). The update energy and performance cost is also amortized over a larger time frame. On the other hand a low update rate also implies a greater performance hit since the system will not respond to a sudden increase in workload. In this paper we propose a workload prediction strategy based on adaptive filtering of the past workload profile. Several filtering schemes are analyzed. We also define a performance hit metric which is used to judge the efficacy of the schemes. The evaluation of some DVS algorithms on portable benchmarks was done in [3] and [4] . In [5] the scheduling of hard real-time tasks on variable voltage processors is presented. Our approach is more general and illustrates important tradeoffs between energy savings and performance.
The energy savings from DVS can be substantial depending on workload statistics. Although this is very desirable in portable battery constrained systems, energy savings is becoming equally important in desktops and servers too. A server farm with 10000 units each equipped with IOOW processors would require 1MW power!
VARIABLE VOLTAGE PROCESSING

A. Energy Workload Model
Using simple first order CMOS delay models it has been shown in [ 11 that the energy consumption per sample is given by where C is the average switched capacitance per cycle, T, is the sample period, fief is the operating frequency at Vrep r is the normalized processing rate i.e. r = f / fref and VO = ( Vref -V,)*/Vref with V, being the threshold voltage. The normalized workload in a system is equivalent to the processor utilization.
The operating system scheduler allocates a time-slice and resources to various processes based on their priorities and state. Often no process is ready to run and the processor simply idles. The normalized workload, w, over an interval is simply the ratio of the non-idle cycles to the total cycles, i.e. w = (total-cycles -idle-cycles) / total-cycles. The workload is 0-7695-083 1 -6/00 $10.00 0 2000 IEEE always in reference to the fixed maximum supply and maximum processing rate. In an ideal DVS system the processing rate is matched to the workload so that there are no idle cycles and utilization is maximum. Figure 2(a) shows the plot of normalized energy versus workload as described by Equation 1, for an ideal DVS system. The important conclusions from the graph are, (i) Averaging the workload and processing at the mean workload is more energy efficient b e c a u s e c h e convexity of the E(r) graph and Jensen's inequality [6]: E( r ) 2 E( P) .
(ii) A small number of discrete processing rate levels (i.e supply voltage, Vdd, and operating frequency,f) can give energy savings very close to the savings obtained from arbitrary precision DVS.
Workload (r)
(a) 
B. Variable Power Supply
A variable power supply can be generated using a DClDC converter which takes a fixed supply and can generate a variable voltage output based on a pulse-width modulated signal. It essentially consists of a power switch and a second order LC filter and is characterized by an efficiency which drops off as the load decreases as shown in Figure 2 (b) . At a lower current load, most of the power drawn from the supply gets dissipated in the switch and therefore the energy gains from DVS are proportionately reduced. Using a technique similar to the one used in the derivation of Equation 1, the a first order current consumption equation can be expressed as
where Iref is the current drawn at Vr,+ Using the DC/DC converter efficiency graph and the relative load current I(r), we can predict the efficiency, q ( r ) . 
WORKLOAD PREDICTION
A. System Model Figure 3 shows a generic block diagram of the variable voltage processing system. The 'Task Queue' models the various events sources for the processor e.g. YO, disk drives, network links, internal interrupts, etc. Each of the n sources produce events at an average rate of hk, (k = 1,2, .. , n). An operating system scheduler manages all these tasks and decides which process gets to run on the processor. The average rate at which events arrive at the processor is h =: CAk. The processor in turn offers a time varying processing rate p ( r ) . The operating system kernel measures the idle cycles and computes the normalized workload w over some observation frame. The workload monitor sets the processing rate, r, based on the current workload, w, and a history or workloads from previous observation frames. This rate r in turn decides the operating voltage V(r) and operating frequencyflr) which is set for the next observation slot. The problems that we address in this paper are: (i) What kind of future workload prediction strategy should be used? (ii) What is the duration of the observation slot i.e. how frequently should the processing rate be updated? The overall objective of a DVS system is to minimize energy consumption under a given performance requirement constraint. 
Task Queue
i.e. the present N values contain all the information about the past evolution of the process that is needed to determine the future distribution of the process.
Markov processes have been used in the context of Dynamic Power Management (DPM). In [9] a continuous-time, controllable Markov process model for a power managed system is introduced and DPM is formulated as a policy optimization problem. We propose to use Markov processes in the context of workload prediction i.e. we propose to predict the workload for the next observation interval based on workload statistics of the previous N intervals.
C. Prediction Algorithm
Let the observation period be T Let w [ n ] denote the average normalized workload in the interval (n -l ) T I t < nT . At time t = nT, we must decide what processing rate to set for the next slot, i.e. r [ n + l ] , based on the workload profile history. Our workload prediction for the ( r~+ l ) '~ interval is given by
where h,[k] is an N-tap, adaptable FIR filter whose coefficients are updated in every observation interval based on the error between the processing rate (which is set using the workload prediction) and the actual value of the workload.
k=O
Most processor systems will have a discrete set of operating frequencies which implies that the processing rate levels are quantized. The StrongARM SA-1 100 microprocessor, for instance, can run at 10 discrete frequencies in the range of 59MHz to 206MHz [ 101. As we shall show later, discretization of the processing rate does not significantly degrade the energy savings from DVS. Let us assume that there are L discrete processing levels available such that
where we have assumed a uniform quantization interval, A = 1/ L. We have also assumed that the minimum processing rate is 1/ L since r = 0 corresponds to the complete off state. Based on the workload prediction w,[n+l], the processing rate r [ n + l ] is set such that
i.e. the processing rate is set to a level just above the predicted workload.
D. Type of Filter
We have explored four types of filters. In this section we outline the basic motivation behind each of the filters and later present results showing the prediction performance of each of the filters.
Moving Average Workload (MAW) -
The simplest filter is a time-invariant moving average filter, h,[k] = 1/N for all n and k. This filter predicts the workload in the next slot as the average of the workload in the previous N slots. The basic motivation is that if the workload is truly an Mh order Markov process, averaging will result in workload noise being removed by low pass filtering. However, this scheme is might be too simplistic and may not work with time varying workload statistics. Also, averaging results in high-frequency workload changes being removed and as a result instantaneous performance hits are high.
Exponential Weighted Averaging (EWA) -This filter is based on the idea that effect of workload k-slots before the current slot lessens as k increases, i.e. it gives maximum weight to the previous slot, lesser weight to the one before and so on. 
where p is the step size. Use of adaptive filters has its advantages and disadvantages. On one hand, since they are selfdesigning, we do not have to worry about individual traces. The filters can 'learn' from the workload history. The obvious problems involve convergence and stability. Choosing the wrong number of coefficients or an inappropriate step size may have very undesirable consequences. RLS adaptive filters differ from LMS adaptive filters in that they d o not employ gradient descent. Instead they employ a clever result from linear algebra. In practice they tend to converge much faster but they have higher computational complexity. Therefore, ( -_ -
Expected Workload State (EWS)
If CA,< YAt then the performance pena1i.y is negative. The way to interpret this is that it is a slack or idle time. Using this basic definition of performance penalty we (define two different metrics: Q,,,(Ar) and QaVg(At) which are respectively the maximum and average performance hits measured over Ar time slots spread over an observation period T. Figure 7 shows the average and maximum performance hit as a function of the update time T, for a moving average prediction using N = 2 , 6 and IO taps. The time slots used were At = 1s and the workload trace was that of the dialup server. The results have been averaged over 1 hour. While the maximum performance hit increases as T increases, the average performance hit decreases. This is because as T increasles the excess cycles from one time slot spills over to the next one and if the slot has a negative performance penalty (i.e. slack / idle cycles) then the average performance hit over the two slots decreases and so on. On the other hand, as T increases, the chances of an increased disparity between the workload and processing rate in a time slot is more and the maximum performance hit increases.
This leads to a fundamental energy-performance tradeoff in DVS. Because of the convexity of the E(r) relationship and Jensen's inequality, we would always like to work at the overall average workload. Therefore, over a 1 hour period for example, the most energy efficient DVS solution is one where we set the processing rate equal to the overall average workload over the 1 hour period. In other words, increasing T leads to increased energy efficiency. On the other hand, increasing T, also increases the maximum performance hit. In other words the system might be sluggish in moments of high workload. Maximum energy savings for a given performance hit involves choosing the maximum update time T such that the maximum performance hit is within bounds as shown in Figure 7 .
In most DVS processors, there is a latency overhead involved in processing rate update. This is because there is a finite feedback bandwidth associated with the DC/DC converter. Normally a good voltage regulator can switch between voltage output levels in a few tens of microseconds. Changing the processor clock frequency also involves a latency overhead during which the PLL circuits lock. In general, to be on the safe side, voltage and clock frequency changes should not be done in parallel. While switching to a lower processing rate, the frequency should first be decreased and subsequently the voltage should be lowered to the appropriate value. On the contrary, switching to a higher processing rate requires the voltage to be increased first followed by the frequency update. This ensures that the voltage supply to the processor is never lower than the minimum required for the current operating frequency and avoids data corruption due to circuit failure. However, in [ 131 the update is done in parallel because the converter and the clock update latency are comparable (approximately loops) and it still works. 
-
In our experiments, the time resolution for workload measurement was 1 second. Since we want to work at averaged workload this is not a problem unless there are very stringent realtime requirements. The other advantage of using a lower time resolution is that the workload measurement subroutine does not itself add substantial overhead to the workload if the measurement duty-cycle is small. The update latency is of the order of loops and since this is insignificant compared to our minimum update time we have used Equation 9 instead of Equation 10.
B. Optimizing Update Time and Taps
The above conclusion that increasing the update time T results in the most energy savings is not completely true. This would be the case with a perfect prediction strategy. In reality if the update time is large, the cost of an overestimated rate is more substantial and the energy savings decrease. Since we are using discrete processing rates (in all our simulations the number of processing rate levels is set to 10 unless otherwise stated), and we round off the rate to the next higher quanta, using a larger update time results in higher overestimate cost. A similar argument holds for the number of taps N. A very small N implies that the workload prediction is very noisy and the energy cost is high because of widely fluctuating processing rates. A very large N o n the other hand implies that the prediction is heavily low-passed and therefore sluggish to rapid workload changes. This leads to higher performance penalty. Figure 8 shows the relative energy plot (normalized to the no DVS case) for the dialup server trace. The period of observation was 1 hour. The energy savings showed a 13% variation based on what N and T were chosen. The filter was once again the moving average type. The implications of the above discussion is at once apparent. 
V. RESULTS
Table I summarizes our key results. We have used 1 hour workload traces from three different types of machines over different times of the day. Their typical workload profiles were shown in Figure 1 . The Energy Savings Ratio (ESR) is defined as the ratio of the energy consumption with no DVS to the energy consumption with DVS. Maximum savings occur when we set the processing rate equal to the average workload over the entire period. This is shown in the 'Max' column of ESR and we can see that energy savings from a factor of 2 to a few 100s is possible depending on workload statistics. Maximum savings is not possible because of two reasons: (i) The maximum performance hit increases as the averaging duration is increased, and (ii) It is impossible to know the average workload over the stipulated period a priori. The filters have N = 3 taps and an update time T = 5s, based on our previous discussion and experiments performed. The 'Perfect' column shows the ESR for the case where we had a perfect predictor for the next observation slot. ESRMax / ESRperfect reflects the factor by which energy savings is reduced because of update every T seconds.
The 'Actual' column shows the ESR obtained by the various filters. In almost all our experiments the LMS filter gave the best energy savings. The last two columns are the average and maximum performance hits. The average performance hit is around 10% while the maximum performance hit is about 40%. Finally, the effect of processing lwel quantization is shown in Figure 9 . As the number of discrete levels, L, is increased, the ESR gets closer to the perfect prediction case. For L = 10 (as available in the StrongARM SA-1100) the ESR degradation due to quantization noise is less than lo%,.
VI. CONCLUSI~DNS
Dynamic Voltage Scaling is a very effective technique to reduce processor energy consumption without causing significant performance degradation. Up to two orders of magnitude energy savings is possible on low workload processors. Maximum energy savings occur if the processing rate is set to the overall average workload. This however is generally infeasible a priori and even if possible leads to high performance penalties. Frequent processing rate updates ensure that the performance penalty is limited. The faster the update rate, the lower the energy savings and the lesser the performance penalty. Workload prediction is required to set thte processing rate for each update slot. Adaptive LMS filtering c,an be used to predict workloads. Normally a filter with 3-5 taps is good.
