Abstract-In this paper, we present Chameleon, which is an application-level power management approach for reducing energy consumption in mobile processors. By using application-domain knowledge, as opposed to OS-level or hardware-level inferred knowledge, Chameleon can substantially reduce CPU energy consumption. By exporting the energy management to the user space, designers can design more flexible and easily portable algorithms and systems and use multiple energy management policies simultaneously. Our experiments show that, compared to the traditional systemwide CPU voltage scaling approaches, Chameleon can achieve up to 32 percent to 50 percent energy savings while delivering comparable or better performance to applications. Similarly, Chameleon extracts 9 percent to 41 percent more energy when compared to Grace OS, which uses some application knowledge but operates within the kernel.
Ç

INTRODUCTION
R ECENT technological advances have led to a proliferation of mobile devices such as laptops, personal digital assistants (PDAs), and smartphones with rich multimedia capabilities. Although the processing, storage, and communication capabilities of these devices have improved as predicted by Moore's law, these advances have significantly outpaced the improvements in battery capabilities. Consequently, energy continues to be a scarce resource in such devices. The situation is exacerbated by the resourcehungry nature of many applications such as movie players and batch compilation.
Modern mobile devices judiciously use energy by incorporating a number of power management features. For instance, modern processors such as Intel's XScale and Pentium-M and Transmeta's Crusoe incorporate dynamic voltage and frequency scaling (DVFS) capabilities. DVFS enables the CPU speed to be dynamically varied based on the workload and reduces energy consumption during periods of low utilization [1] , [2] , [3] . In general, such techniques must carefully be designed to prevent the processor slowdown from degrading the responsiveness of the application.
Essentially, all approaches to controlling DVFS minimize power usage by taking advantage of the nonlinear relationship between the rate that the CPU performs work and the power that the CPU consumes. Although tasks take longer to complete, the greater reduction in power leads to an overall decrease in the amount of energy used. When the system service rate of the processor is equal to the arrival rate of a new work, power consumption is minimized. However, at that point, the waiting time tends to infinity, and short bursts of the CPU load can create long waiting times for users. Thus, the system must not only balance DVFS states with demand but also must predict future demands to avoid short-term backlogs. In general, the more the knowledge that the system has about the workload demand and the acceptable service level, the less the energy that it will consume.
Thus far, DVFS techniques proposed in the literature fall into three categories-hardware, pure operating system (OS), and cooperative OS-listed in the order of increasing knowledge and increasing performance.
Hardware approaches such as LongRun [4] measure processor utilization at the hardware level and vary the CPU speed based on the measured systemwide utilization. LongRun is a purely reactive memoryless system: It measures CPU utilization for an interval, and if there is idle time, it reduces the CPU speed and voltage to bring the utilization back to 100 percent. Hardware approaches have no knowledge of any future demand, do not track patterns in the demand, and have no knowledge of the application needs. They are, however, the simplest for OS and application writers to use, as no modifications to software are necessary.
Software approaches can be broken into two types: pure OS and cooperative OS. In the first case, the OS measures the current processor demand itself and determines an appropriate processor speed setting [3] , [2] , [5] , [6] , [7] , [8] . The first of these approaches are incremental improvements over the memoryless hardware approaches, allowing software designers to examine different policies and workloads [3] , [2] . However, OS approaches that take advantage of the knowledge gained from the OS scheduler, including queue lengths and the pattern of blocking I/O requests from individual applications, outperform more naive approaches.
Cooperative OS approaches increase the amount of available knowledge even further by consulting with the applications themselves [9] , [10] . For instance, Grace OS permits applications to inform the OS of the beginning and end of periodic bursts of CPU activity, as opposed to the pure OS techniques that must infer them. By tracking the statistics of the bursts, Grace OS can match the DVFS state such that the OS can slow the processing rate, completing each of the bursts before the next burst arrives.
However, such an approach ignores two extra pieces of knowledge that can further improve the performance, deadline of the task, and application-specific knowledge that gives a better estimate of demand. First, Grace OS was tailored to periodic multimedia applications and assumes that the interval task arrival rate matches the deadline for each of the tasks. This is appropriate for video players, which must finish decoding a frame right before decoding the next frame. However, it is inappropriate for other applications whose task arrival rate is variable and does not match the deadline. For instance, in an interactive application, if the user presses a key once every second, that does not imply that it should take 1 s to process the key press. Second, Grace OS only notes the beginning and end of a burst of activity and takes 95 percent of the size of the burst as the worst case prediction for the demand of the next burst. However, applications have much more knowledge about the demand of an event before it even starts. For instance, as we show in this paper, a video-decoding application can use the size and type of the frame to give a much better estimate of the demand. This paper explores a fourth approach that exploits this knowledge: Chameleon, an application-level power management system. Rather than controlling DVFS settings from within the OS kernel, we argue that applications know best what their resource and energy needs are, and consequently, applications can implement better power management policies than the OS and the processor hardware. Using application-specific knowledge, Chameleon can set the DVFS state of the processor closer to the optimal setting, thus saving energy and meeting more application deadlines. In Chameleon, applications are given complete control over their CPU power settings-an application is allowed to specify its CPU power setting independent of other applications, and the OS isolates an application from the settings used by other applications. Our approach resembles the philosophy of the Exokernel, where the OS grants complete control of various resources to the applications and only enforces protection to prevent applications from harming one another [11] . The Exokernel project successfully demonstrates the benefits of application-level networking, application-level memory management, application-level file systems, and CPU scheduling [12] , whereas Chameleon extends this notion to applicationlevel power management.
To show the benefits of incorporating more knowledge in a DVFS algorithm, we use video decoding as an example. As a point of comparison, consider an optimal algorithm with complete knowledge of deadlines and CPU demands. Such an algorithm would set the frequency state of the CPU such that all deadlines for decoding were exactly met no sooner and no later. The results of such an experiment, comparing a hardware algorithm (LongRun [4] ), a cooperative OS approach (Grace OS [10] ), and an application-level algorithm (Chameleon) over several movie resolutions, is shown in Fig. 1 . The extra knowledge that Grace OS uses improves its performance over LongRun, and Chameleon's application-level knowledge improves its performance over Grace OS.
Contributions. Application-level power management in Chameleon opens up a realm of possibilities that are infeasible by using existing approaches.
. Performance. Our approach enables each application to make local power management decisions based on its processor demand and processor availability. We experimentally show that local decisions by individual applications can globally optimize systemwide energy consumption and is better than choosing a single systemwide power setting for all applications. . Flexibility. Such an approach enables each application to implement a power management policy that closely matches its energy and performance requirements. Different applications can choose different policies and yet concurrently coexist with one another. Legacy applications or those applications that do not wish to implement their own strategy can delegate this task to a user-level power manager that chooses appropriate settings based on observed behavior. . Generality. Our approach is general and, unlike some existing approaches, does not make specific assumptions about the nature of applications. Any application can make use of the power management interface to manage its energy needs, and we demonstrate such strategies for several different applications. . Modest implementation costs. We show that user-level power management policies can be implemented at modest cost. As a rough measure of the implementation effort, each application required 40-239 additional lines of code, which is a relatively minor modification to applications that contain tens or hundreds of thousands of lines of code. The rest of this paper is organized as follows: Section 2 presents an overview of the Chameleon architecture. Sections 3 and 4 present the user-level power management strategies for various applications and the design of a userlevel power manager, respectively. Section 5 discusses implementation issues. Section 6 presents our experimental results. Finally, Sections 7 and 8 presents related work and our conclusions.
CHAMELEON ARCHITECTURE
Chameleon consists of three key components, as shown in Fig. 2 . First, Chameleon provides an OS interface that enables applications to query the kernel for resource usage statistics and to convey their desired power settings to the kernel. The details of the interface are presented in Section 5. In general, a user-level power management strategy combines OS-level resource usage statistics with application-domain knowledge to determine a desirable CPU power setting. This can be achieved in two ways. An application can use the Chameleon interface to directly modify its own power settings. Alternatively, an application can delegate the task of power management to a user-level power manager. Such a power manager can use resource usage statistics and any application-supplied information to adjust the application's power settings on its behalf.
Second, Chameleon includes a modified CPU scheduler that supports per-process CPU power settings and application isolation. The scheduler maintains the current power settings for each process and conveys these settings to the underlying processor whenever the process is scheduled for execution (that is, at the context switch time). The application's power settings can be modified at any time via system calls either by the application itself or by a userlevel power manager that acts on its behalf.
One concern is that if one application misuses the speed setting either maliciously or inadvertently, it may degrade the performance of other applications. In Chameleon, an application's power settings take effect only when it is scheduled. This is the only mechanism that is needed to provide isolation, whereas matters of policy such as CPUshares, energy allocations, or management of misbehaving applications should separately be implemented. For instance, if an application slows the CPU and uses more CPU time, this is no different than a process that misbehaves by entering an endless loop, which is something to be managed by a scheduling policy. We experimentally demonstrate Chameleon with Linux time-sharing, which is a best effort scheduler, and with start time fair queuing, which is a QoS-aware proportional-share scheduler. In the time-sharing scheduler, each application will be scheduled for a quantum, regardless of the behavior of other applications, whereas the proportional-share scheduler provides greater guarantees on the frequency and duration of those quantum. In fact, given Chameleon's architecture, it does not require any direct modifications to the CPU scheduling algorithm itself, and as a result, Chameleon is compatible with any scheduling algorithm.
Third, Chameleon implements a speed adapter that maps application-specified power settings to the nearest CPU speed actually supported by the hardware. In particular, an application specifies the desired CPU speed as a fraction f i of the maximum processor speed. The speed adapter maps this fraction to the nearest supported CPU speed. Since different hardware processors support different discrete speeds, such an approach ensures portability across hardware.
APPLICATION-LEVEL POWER MANAGEMENT
Independent of the particular application, a user-level power management policy consists of three key steps:
1. Estimate processor demand. In this step, a combination of application-domain knowledge and past CPU usage statistics is used to estimate processor demand in the near future. 2. Estimate processor availability. This step explicitly accounts for the impact of other concurrent applications. In this step, the amount of CPU time that will be available to the application in the presence of other applications is estimated. 3. Determine processor speed setting. This step chooses an speed setting that attempts to "match" the processor demand to the processor availability. For instance, if the actual demand is only half of the available CPU time, then the application can run the processor at half speed and spread its CPU demand over the available time. In contrast, if the processor demand and the processor availability are roughly equal, the application may choose to run the processor at full speed. In the rest of this section, we show how these ideas can be instantiated for four specific applications that belong to three different application classes: soft real time, interactive best effort, and batch.
Moving Picture Experts Group Video Decoder
An MPEG video decoder is an example of a soft real-time application. Many multimedia applications such as DVD playback, audio players, music synthesizers, video capture, and editors belong to this category. A common characteristic of these applications is that data needs to be processed with timeliness constraints. For instance, in a video decoder, frames need to be decoded and rendered at the playback rate. For a 30-fps video, a frame needs to be decoded once every 33 ms. The inability to meet timeliness constraints impacts application correctness. Playback glitches will be observed in a video decoder, for example.
A soft real-time application can use the following general strategy for user-level power management: Assume that the application executes a sequence of tasks. The decoding of a single frame is an example of a task. Let c denote the amount of CPU time needed to execute this task at full processor speed. Let d denote the deadline of this task and let t denote the task begin time. Furthermore, let e denote the amount of CPU time that will actually be allocated to the application for this task before its deadline. The parameter c captures processor demand, whereas e captures processor availability by accounting for the presence of other concurrent tasks in the system. In a time-sharing scheduler, for instance, the larger the number of runnable tasks, the smaller the value of e. In a QoS-aware scheduler that allows a fixed fraction of the CPU to be reserved for an application, the value of e will be independent of other tasks in the system.
Given the processor demand c, processor availability e, and deadline d, the processor speed can be chosen as follows:
it is impossible to meet the task deadline (see Fig. 3a ). Essentially, the task started "too late," and neither the CPU scheduler nor the power management strategy can rectify the situation. In such a scenario, the appropriate policy is to run the processor at full speed to mitigate the effects of the missed deadline.
Case 2. If e < c, then the processor demand exceeds the processor availability for this task (see Fig. 3b ). Although it is feasible to meet the deadline by allocating sufficient CPU time to the task, the CPU scheduler is unable to do so due to the presence of other concurrent applications. Since the application performance will suffer due to insufficient processor availability, the power management strategy should not further worsen the situation. Thus, the application should run at full processor speed for this task.
The final scenario assumes that neither cases 1 nor 2 are true.
Case 3. If t þ c < d, then the task can finish before its deadline at full processor speed (see Fig. 3c ). In this case, the policy should slow down the CPU such that the demand c is spread over the amount of time that the task will execute on the CPU while still meeting the deadline. The CPU frequency f should be chosen as
where f max is the maximum processor speed (frequency). This strategy is applicable to a variety of soft real-time applications, so long as the notion of a task is appropriately defined. In a video decoder, 1. the decoding of each frame represents a task, 2. c denotes the time to decode the next frame at full speed,
3. e denotes the estimated duration for which the decoder will be scheduled on the CPU until the frame deadline, and 4. d denotes the playback instant of the frame (as determined by the playback rate of the video). Although d is known, parameters c and e need to be estimated for each frame.
Estimating the processor demand. Processor demand is determined by estimating frame decode times. We consider mplayer as an open source video decoder that supports both MPEG-2 and MPEG-4 playback. Note that MPEG-2 is widely used for DVD playback, whereas MPEG-4 is used by commercial streaming systems such as QuickTime and Windows Media. Using mplayer, we encoded a number of MPEG-2 and MPEG-4 video clips at different bit rates and different spatial resolutions. These video clips were decoded by an instrumented mplayer that measured and logged the decode time of each frame at full processor speed. We analyzed the resulting traces by studying the first-order and second-order statistics of the decode times and frame sizes for each frame type (that is, I, P , and B). In our analysis, we seek to 1) show that it is likely that the frame size and frame-decoding time are correlated and 2) model this relationship by using a linear function. Due to space concerns, we demonstrate 1) in a technical report [13] and 2) here.
We have constructed a predictor that uses the type and size of each frame to compute its decode time. A key feature of our predictor is that the prediction model is parameterized at runtime to determine the slope and intercept of the linear function. To do so, the video decoder stores the observed decode times of the previous n frames, scales these values to the full-speed decode time (since the observed decode times may be at slower CPU speeds), and uses these values to periodically recompute the slopes and the intercepts of the linear predictor by using a least squares fit. This not only enables the predictor to account for differences across video clips (for example, different bit rates require different linear predictors), but also accounts for variations within a video (for example, slow moving scenes versus fast moving scenes in a video). The parameterized predictor is then used to estimate the decode time of each frame at full processor speed.
For instance, given window size n, suppose we have the last n I frame sizes and decoding times. Then, we start decoding a new I frame, and we already know the size of this new frame. Let s i and e i denote the frame size and the full-speed decoding time of the ith frame, respectively, s nþ1 denote the frame size of the new I frame, andê nþ1 denote its predicted full-speed decoding time. Thus,ê nþ1 is given by a least squares fit: In the predictor shown in (1), the window size n has great impact on the performance of the predictor; thus, choosing an appropriate n is an important issue in the design of such a linear regression predictor. To do this, we applied the linear regression predictor to our collected traces by varying the window size n from 5 to 50 and then measured the absolute accuracy of the linear regression predictor with different window sizes. In particular, we determined how often the predictor was within 1 ms of the actual decode time. At a frame interval of 33 ms, an accuracy better than 1 ms makes providing on-time frame decoding straightforward.
As shown for the two sample videos in Fig. 4 , the linear regression predictor achieves the best accuracy in most cases when the window size n is less than 10, and the accuracy level has small variation in that area. We found similar results for other frame sizes and videos. Therefore, we choose the window size 8 for our predictor, as the division operations of (1) can then be transformed to the shift operations, thus reducing the computational cost.
After choosing a window size of 8, it is important to know the distribution of the absolute error rate. Fig. 5 presents the accuracy of our predictors for all three different frame types (that is, I, P, and B) with window size 8 on two sample videos. Our experiments show that our predictor can almost always estimate frame-decoding times within 5 ms: Even a frame that is 5 ms late can be masked by a small amount of buffering in a video player.
Estimating the processor availability. Using the Chameleon interface, the application can obtain the start times and the end times of the previous k instances where the application was scheduled on the CPU. This history of quantum durations and the start times of the quanta provide an estimate of how much CPU time was allocated to the application in the recent past. Chameleon uses an exponential moving average of these values to determine the amount of CPU time that is likely to be allocated to the application per unit time, and this yields the processor availability over the next dÀt time units.
Determining the processor speed. Given an estimateĉ of the frame decode time andê of the processor availability, the actual CPU frequency f is chosen in mplayer as follows:
Áf max minðê;dÀtÞ ; f max Þ otherwise:
The Chameleon speed adapter then maps the computed f to the closest supported CPU speed that is not less than the requested speed.
Implementation. We augmented mplayer with the framedecoding time predictor and the speed-setting strategy. Our modifications were primarily restricted to the beginning and end of the frame-decoding method in mplayer. We used gettimeofday to measure the frame-decoding time and the Chameleon interface to estimate the processor availability. Other modifications involved using the Chameleon interface to set the CPU speed by using (2) . In all, the implementation of frame-decoding time predictor involved 221 lines of C code, and the implementation of the speedsetting strategy involved 18 lines of C code. This indicates that the user-level power management strategy can be implemented at relatively modest effort.
Videoconferencing Tool
Videoconferencing is another popular soft real-time application. It is often based on the H.26Â family of compression standards (specifically H.261, H.263, and H.264). Videoconferencing exhibits a slightly different soft real-time property than mplayer: The H.26Â compression is specially designed to support low-latency streaming. As parts of the video are encoded, they are streamed over the network to the client, which decompresses them as they are received. This is in contrast with mplayer and MPEG-4, which retrieves whole frames from disk and decrypts them in bulk. Thus, in a streaming application, Chameleon does not have one deadline per frame to meet, but rather a number of deadlines per frame to be met as compressed data arrives. In the case of an Internet streaming application, the client decodes individual IP packets as the arrive. Each packet can independently be decoded at the receiver without waiting for other packets of that frame. We construct a Chameleonaware videoconferencing application as follows.
Estimating the processor demand. We consider gnomemeeting as an open source videoconferencing tool that supports H.261. Note that H.261 supports two resolutions: QCIF (176 Â 144) and CIF (352 Â 288). Using gnomemeeting, we ran a number of videoconferences at different spatial resolutions, and we measured and logged the decode time of each frame at full processor speed. The low-level compression mechanisms in H.261 share many common ideas with MPEG. Not surprisingly, we observed a similar linear relationship between the packet size and the packetdecoding time. Consequently, we use a similar predictor to that in Section 3.1 to estimate the decoding time of a packet.
As shown for the two sample videoconferences in Fig. 6 , the linear regression predictor achieves the best accuracy in most cases when the window size is more than 8, and the accuracy level has small variation in that area. Therefore, due to the same reason as in Section 3.1, we choose window size 8 for our predictor to reduce the computational cost. Fig. 7 shows the accuracy of our predictors with window size 8 on two sample conferences. Our experiments show that our predictor can almost always estimate packetdecoding times within 1.5 ms.
Estimating the deadline of each packet. The deadline of each frame can be determined from the time stamp of a frame and the frame rate. From this deadline, our client must create a deadline for decoding each packet as it arrives. We adopt a simple policy of evenly dividing the deadline by the number of packets that the client expects to see in a frame. In the case of H.261, the number of packets in a frame is unknown until the entire frame has arrived; thus, we estimate the number of packets from the number of packets in the previous frame.
As the client knows the deadline D of a frame from its time stamp, it can estimate the deadlined of the ith packet in this frame asd
in whichn is the estimated number of packets in this frame and t is the task begin time of decoding the ith packet. Then, the deadlined of the ith packet in this frame is given bŷ if i >n and the ith packet is not the last packet; t þ DÀt nÀiþ1 if i n:
Our analysis of the gnomemeeting videoconferencing traces showed that the frame size (and, thus, the number of packets in a frame) is governed by the amount of human motion in each frame. Due to the continuous nature of human motion, we used the number of packets in the previous frame (denoted as last) and the mean number of packets in the previous k frames (denoted as mean) for predictions, and we also applied a number of time series models such as AR (1), AR(2), AR(3), and MA(1) to our resulting traces for predictions [14] .
As shown in Fig. 8 , we found the number of packets in the current frame to be the best predictor of the number of packets in the next frame. Consequently, we use a simple predictor that sets the estimated number of packets in the next frame to that in the current frame.
Estimating the processor availability. We use the same technique as that in Section 3.1 to estimate the processor availability.
Determining the processor speed. Letn denote the estimated number of packets in the current frame, let t denote the task begin time of decoding the jth packet, and let D denote the deadline of the frame. Then, the CPU speed f for the decoding of the jth packet in the current frame is determined by scaling its full-speed decode timeĉ until the deadlined. That is, f ¼ f max if t þĉ >d; f max ifê <ĉ; minðĉ Áfmax minðê;dÀtÞ ; f max Þ otherwise;
where f max denotes the maximum CPU speed andd is the estimated deadline given by (3).
In an actual implementation, the computed f is mapped by the speed adapter to the closest available speed that is not smaller than the requested speed.
Implementation. We modified gnomemeeting to implement the packet-decoding time predictor, the packet number predictor, and the speed-setting strategy. Like in the case of the video decoder, our modifications were restricted to the beginning and end of packet-decoding method in gnomemeeting, and we used gettimeofday to measure the packet-decoding time and the Chameleon interface to estimate the processor availability. Other modifications involved using the Chameleon interface to set the CPU speed using (4) . In all, the implementation of packet-decoding time predictor involved 221 lines of C code, the implementation of packet number predictor involved only one line of C code, and the implementation of the speed-setting strategy involved 32 lines of C code. This indicates that the user-level power management strategy can be implemented at relatively modest effort. 
Word Processor
A word processor from an Office suite is an example of an interactive best effort application. Many applications such as editors, shell terminals, Web browsers, and games fall into this category. For instance, a word processor is an event-driven application that works as follows: Upon an event such as a mouse click or keystroke, the word processor needs to do some work to process the event. For example, when the user clicks on a menu item, the application must display a drop-down menu of choices. When the user types a sentence, each character that represents a keystroke needs to be displayed on the screen. The window needs to be redrawn when the draw event arrives. The speed at which these events are processed by the word processor greatly impacts the user's experience.
Studies have shown that there exists a human perception threshold, under which events appear to instantaneously happen [15] . Thus, completing these events any faster would not have any perceptible impact on the user. Although the exact value of the perception threshold is dependent on the user and the type of task being accomplished, so a value of 50 ms is commonly used [15] , [5] , [6] , [7] , [8] , [16] . We also use this perception threshold in our work.
An event-driven interactive application should choose CPU speed settings such that each event is processed not later than the human perception threshold. One possible strategy of doing so is to 1) estimate the processor demand of an event, 2) estimate the processor availability in the next 50 ms, and 3) choose a speed such that the demand is spread over the available CPU time while still meeting the 50-ms perception threshold. Since an event-based application may process many different types of events, the estimating processor demand for each event will require the approach to be explicitly aware of different event types and their computational needs. Such a strategy can be quite complex for applications such as browsers or word processors that support a large number of event types. Another strategy is to estimate the think time of the user and proactively change the processor speed ahead of time [16] . We have found this to be unnecessary in our system, as the transition times are relatively small when compared to the perceptual threshold.
Instead, we propose a different technique that can meet the human perception threshold without requiring explicit knowledge of various events types and proactive wakeup. Our technique, referred to as gradual processor acceleration (GPA), implicitly accounts for the processor demand and the processor availability.
Upon the arrival of any event, the word processor is configured to run at the slowest CPU speed, and a timer is set (the timer value is less than the perception threshold). If the processing of the event finishes before the timer expires, then the application simply waits for the next event.
Otherwise, it increases the CPU speed by some amount and sets another timer. If the event processing continues beyond the timer expiration, the CPU speed is increased yet again, and a new timer is set. Thus, the processor is gradually accelerated until either the event processing is complete or the maximum CPU speed is reached. In order to ensure adequate interactive performance, the maximum CPU speed is always used when the event processing time exceeds the perception threshold.
To understand how this policy works in practice, suppose that the event arrives at time t and the application is actually scheduled on the CPU at time t 0 (although the application becomes runnable as soon as the event arrives, other concurrent applications can delay the scheduling of this application). From the perspective of the user, a response is desirable from the application not later than t þ 50 ms. Since the application actually starts executing at time t 0 , it needs to process the event within the remaining 50 À ms, where ¼ t 0 À t (see Fig. 9 ). To do so, we choose n timers, which have values t 1 ; t 2 ; . . . ; t n , and P n i¼1 t i ¼ 50 À . After the expiration of the ith timer, the processor speed is increased to f i , where f i denotes a fraction of the maximum speed. The values of f i are chosen such that the processor speed increases progressively, and f n ¼ f max ¼ 1. Thus, the application runs at full processor speed if the event processing continues beyond 50 À ms. Observe that, rather than explicitly estimating the processor demand of the event, the GPA technique monitors the progress of the event processing and accordingly adjusts the processor speed. Furthermore, implicitly captures the impact of other concurrent applications in the system. Analysis. It is possible to bound the maximum slowdown incurred by an application in the GPA technique by carefully choosing timer values and CPU speeds. To see how, observe that if the processor were running at full speed, the amount of work done in the interval ½t 0 ; t 0 þ P n i¼1 t i will take only P n i¼1 f i t i at full processor speed. If the actual full-speed processing time of the event is smaller than this value, then the event finishes before the ð50 À Þ-ms perception threshold in the GPA technique, and thus, the user does not perceive any performance degradation. For any event that requires more than this amount of fullspeed execution time, the maximum possible performance degradation under our strategy is given by
since the processor will run at full speed once the execution time exceeds the perception threshold. To illustrate, suppose that an event in the GPA technique should not take more than 20 ms longer than it would take at full processor speed. Let ¼ 0 for simplicity. If we chose five timers with values 30, 5, 5, 5, and 5 ms and the processor speeds during these timer intervals are 45 percent, 60 percent, 80 percent, 90 percent, and 100 percent, respectively, then, from (5), the maximum possible userperceived degradation for any event is 20 ms. This is the maximum slowdown for any event that requires more than 50 ms of processing time.
Implementation. We implemented GPA into AbiWord, a sophisticated word processor with a code base of hundreds of thousands of lines of C code. We added code at the beginning of the AbiWord event handler to implement the GPA technique. The X11-server assigns a time stamp to each new user event such as mouse click or keystroke. We extracted this time stamp t and used gettimeofday to determine the execution start time t 0 . The parameter is computed as the difference between t 0 and t. This took only 17 lines of C code. The rest of the modifications involved setting timers and invoking the Chameleon interface to modify the CPU speed when each timer expires, which took 23 lines of C code. In all, the implementation of GPA took only 40 lines of C code, which is a fairly modest change.
Web Browser
A Web browser is another example of an event-driven interactive application that needs to process various events such as a mouse click or a keystroke. When the user types a URL or data into a Web form, the keystrokes need to be displayed on the screen. When the user clicks on a JavaScript menu on a Web page, the menu needs to be expanded. When the mouse is positioned over a hyperlink, visual feedback needs to be provided by changing the shape of the mouse cursor. When the user clicks on a link, the browser needs to construct and send out an HTTP request. When data arrives from the remote server, it needs to parse and display the incoming data. Although the network delay is beyond the control of the browser, all other "local" events should be processed within the human perception threshold for good interactive performance. The GPA technique can directly be used for power management in such a browser.
We added our GPA technique to Dillo, a compact portable open source browser that runs on desktops, laptops, and PDAs. As in the case of the word processor, our modifications were restricted to the event handler in Dillo. First, we extracted the event arrival time and the execution start time in the event handler to compute . We then added code to set timers and increase the processor speed upon timer expiration. In all, the implementation of GPA into Dillo involved 46 lines of C code, again demonstrating the modest nature of our modifications.
Batch Compilations
Compilations using a utility such as make is an example of a batch application. Unlike interactive applications, where the response time is important, the completion time (or throughput) is important for batch applications. Typically, make spawns a sequence of compilation tasks, one for each source code file. One possible user-level power management strategy is to estimate the processor demand for each compilation task and to choose an appropriate speed setting. However, since each compilation task is a separate process that is relatively short lived, gathering CPU usage statistics in order to make reasonable decisions for each process is difficult. Instead, we believe the correct strategy is to allow the user to specify the desired speed setting. System defaults can be used when the user does not specify a setting.
We implemented a utility called pnice that enables the user to specify a particular CPU speed setting for a new process. For instance, the user can invoke the command pnice -n N make to specify that make and all compilations spawned by it should run at a fixed CPU speed setting N. A lower speed setting enables energy savings at the expense of increasing the completion time, whereas a higher one lowers the completion time at the expense of higher energy consumption.
A USER-LEVEL POWER MANAGER
The previous section demonstrated how many commonly used applications can implement their own power management strategy. However, implementing a user-level power management strategy requires modification to the source code, which may not be feasible for legacy applications. Such applications can delegate the task of power management to a user-level power manager. The power manager can use CPU usage statistics and any application-supplied knowledge to modify CPU speed settings on behalf of the applications. A simple user-level power manager may choose a single speed setting for all applications based on the current utilization. The speed setting is varied with observed changes in system utilization. A more complex strategy is to choose a different speed setting for each individual application based on its observed behavior. Doing so requires usage statistics to be maintained for each application. Multiple user-level power managers can coexist in the system, so long as each manages a mutually exclusive subset of the applications. Thus, it is feasible to implement a different power manager for each class of application.
The Chameleon interface enables the entire range of these possibilities. To demonstrate the flexibility of our approach, we take a recently proposed DVFS approach, Grace OS [10] , [17] , and show how the proposed technique can be implemented as a user-level power manager by using Chameleon. Our objective is twofold. First, we show that many recently proposed approaches such as Grace OS that employ an in-kernel implementation can be implemented as user-level power managers. Second, Grace OS advocates a cooperative application-OS approach, where applications periodically supply information to the OS, and the OS chooses the processor speed setting based on this information and usage statistics. We show that such interactions between the application and the CPU scheduler are feasible by using the interface provided by Chameleon.
Implementation. We begin with a brief overview of the Grace OS [10] . Grace OS is designed for periodic multimedia applications that belong to the soft real-time class. Grace OS treats such applications differently from traditional best effort applications. Best effort applications are scheduled using the Linux time-sharing scheduler and do not benefit from DVFS, whereas soft real-time applications are scheduled using a QoS-aware soft real-time scheduler and benefit from DVFS.
To handle soft real-time applications, Grace OS employs two key components: 1) a real-time scheduler and 2) a DVFS algorithm. The CPU scheduler is vanilla earliest deadline first (EDF). The standard EDF theory is used to perform admission control of soft real-time tasks based on their worst case CPU demands. Admitted soft real-time tasks have strict priority over best effort tasks. Deadlines derived from the application-specified periods are used for the EDF scheduling of these tasks. Three system calls-EnterSRT, ExitSRT, and FinishJob-are used to convey the start and finish times of tasks (e.g., frame decode) to the scheduler.
The DVFS algorithm maintains a histogram of CPU usage and derives a probability distribution of processor demand. The processor demand and the applicationspecified periods are used in a dynamic programming algorithm to derive a list of speed scaling points. Each point ðx; yÞ specifies that a job should run at the speed y when it has used x cycles. The DVFS algorithm monitors the cycle usage of the task. If the usage increases beyond x, the next speed setting y is chosen. Observe that this technique has similarities with our GPA technique, where the progress of a task is monitored, and the speed is increased gradually. The key difference is that the duration x and speed y are computed at runtime by using dynamic programming, whereas in GPA, they are statically chosen.
To implement Grace OS as a user-level power manager, we must distinguish between the DVFS component and the CPU scheduler. The DVFS algorithm is fully implemented in the user space and uses the Chameleon interface to query usage statistics and monitor the progress. The CPU scheduler and any interactions between the application and the scheduler must separately be implemented from Chameleon. Since Chameleon does not make any specific assumptions about the underlying scheduler, it is compatible with any CPU scheduling algorithm, including EDF.
Consequently, our implementation of Grace OS includes three components: 1) a user-level daemon to calculate the soft real-time task's demand distribution, cycle budget, and speed schedule by using dynamic programming (300 lines of C code), 2) the use of Chameleon's /dev/syscpu interface driver to query the actual usage of each soft real-time task (109 lines of C code), and 3) three system calls EnterSRT, ExitSRT, and FinishJob that allow an application to convey the beginning and end of each soft real-time task (23 lines of C code). Observe that the first two components relate to the DVFS algorithm, whereas the third component is used by the CPU scheduler in Grace OS. The Grace OS user-level power manager runs at the highest CPU priority in our system. All soft real-time applications run at the next highest CPU priority, and best effort jobs run at lower priorities. EDF scheduling is emulated by manipulating priorities of tasks: The task with the earliest deadline is elevated in priority.
IMPLEMENTATION
Our prototype of Chameleon is implemented as a set of modules and patches in the Linux kernel 2.4.20-9.
New system calls. We added four new system calls to implement the Chameleon OS interface:
1. get-speed. This returns the current CPU speed of the specified process or process group. 2. set-speed. This sets the CPU speed of the specified process or process group. 3. get-speed-schedule. This returns processor budget and speed schedule of the specified process. 4. set-speed-schedule. This sets the processor budget and speed schedule of the specified task. The latter two system calls enable sophisticated speed setting strategies, where an application can specify an a priori schedule for changing the speed as it executes.
Chameleon-enhanced /proc interface. We enhanced the /proc interface by adding a /proc/Chameleon subtree. This directory holds one file for each Chameleon-driven process and allows applications to query their CPU quantum allocations in the recent past.
Chameleon /dev interfaces. To support user-level power managers, we added two new /dev interfaces: /dev/sysdvfs and /dev/syscpu. The systemwide utilization is reported via /dev/sysdvfs, whereas the CPU cycles consumed by individual tasks are reported via /dev/syscpu.
Process control block enhancements. In order to allow Chameleon to implement techniques such as PACE [7] , [8] and Grace OS [10] , [17] as user-level power managers, we borrowed several process control block attributes from the Grace OS implementation: 1) cycle counter, which measures the CPU cycles used by a task, 2) cycle budget, which stores the number of allocated cycles, and 3) speed schedule, which stores a list and schedule of speed scaling points. Because these three attributes are meaningful only for Chameleon processes managed by user-level power managers, we also added three more attributes that are applicable to all processes in the system: 1) Chameleondriven-flag, which indicates whether the process directly modifies its speed settings, 2) current-speed, which specifies the current CPU speed setting of the process, and 3) inheritable-flag, which indicates whether the speed setting is inheritable by its children.
DVS kernel module. The DVS kernel module is actually responsible for interfacing with the hardware in order to modify the processor speed. This is done by writing the frequency and voltage to two machine-special registers (MSRs) [10] , [17] . Chameleon can be applied to any DVFSenabled processor by implementing a DVS kernel module specific to that processor.
Linux scheduler enhancements. We modified the standard scheduler to add per-process speed settings and cycle charging. Similar to our process control block enhancements, cycle charging is only necessary for implementing other techniques as user-level power managers and is directly inspired by the Grace OS implementation [10] , [17] . Whenever the schedule() function is invoked, the modified scheduler will do the following: 1) in the case of no context switch, it may change the speed of the current task according to its speed schedule, 2) in the case of a context switch, the scheduler performs some bookkeeping only for the previous task with a speed schedule (e.g., update its cycle counter, decrement cycle budget, advance speed schedule, etc.), and 3) then, the scheduler sets the CPU speed for the new task based on its current-speed attribute.
Our implementation of Chameleon runs on a Sony Vaio PCG-V1CPK laptop with a Transmeta Crusoe TM5600-667 processor [18] . The Transmeta TM5600 processor supports five discrete frequency and voltage levels. The frequencies and power consumption of the CPU and system are shown in Table 1 . The power consumption of the CPU is taken from a data sheet, and we measured the total power of the system while playing movie 6 in mplayer at each of those DVFS settings. The measurements were taken with the typical method: We remove the battery and test the voltage across a low resistance placed in series with the DC power cable. In all of the experiments in this paper, we use the data sheet measurements to give an isolated view of the CPU power consumption. In its current implementation, Chameleon only manages the power consumption of the CPU, which consumes approximately 20 percent of the overall power consumption in the Sony Vaio. Furthermore, when the laptop is idle, Chameleon cannot reduce the power consumption of the CPU. However, as hardware manufacturers begin to address idle power consumption and open new interfaces to controlling the power-performance tradeoff, Chameleon will have an even greater effect on the overall battery lifetime of the device.
The CPU implements the LongRun [4] technology in hardware to dynamically vary the CPU frequency based on the observed systemwide CPU utilization. LongRun varies the CPU frequency between user-specified maximum and minimum values: These values can be set by users by writing to two MSRs. By default, these values are set to 300 and 677 MHz, enabling the full range of voltage scaling. LongRun can be disabled by setting the minimum and maximum register values to the same frequency (e.g., setting both to 533 MHz does not allow any leeway in changing the CPU frequency, effectively disabling LongRun). This feature can be used to implement voltage scaling in the software: The power-aware application can determine the desired frequency and set the two registers to this value. Table 2 shows the mapping from CPU speed percentages to a corresponding CPU frequency for the Transmeta processor used in our prototype implementation.
EXPERIMENTAL EVALUATION
We evaluated Chameleon on a Sony Vaio PCG-V1CPK laptop equipped with a Transmeta Crusoe processor and a 128-Mbyte RAM. The OS was Red Hat Linux 9.0 with a modified version of Linux kernel 2.4.20-9. To compare Chameleon with other DVFS approaches, we implemented three OS-based DVFS techniques proposed in the literature: 1) PAST [3] , 2) PEAK [2] , and 3) AV G n [1] . All of these are interval-based systemwide DVFS techniques. Our experiments involve running several applications under six different configurations:
1. with DVFS disabled (the CPU always runs at the maximum speed, denoted as FULL), 2. using the hardwired LongRun technology, 3. using PAST, 4. using PEAK, 5. using AV G n , and 6. using Chameleon (where LongRun is disabled for power-aware applications but enabled for legacy applications). We also provide a comparison of Chameleon to Grace OS by using a soft real-time application. Grace OS is applicable only to periodic multimedia applications and, hence, it is not feasible to compare it to other Chameleon applications.
The energy consumption of the processor during an interval T is computed as
where n is the number of available frequency/voltage combinations on the processor, p i denotes the power consumption of the processor when running at the ith frequency/voltage combination, and t i represents the time spent at the ith frequency/voltage combination during the interval T . We modified the Linux kernel to record the energy consumption of the TM5600 processor by using (6) and Table 1 . Given the energy consumption of the processor during an interval T , the average power consumption of the processor during this interval is computed as
In our experiments, we observed that PEAK always consumed the least processor energy among all the DVFS techniques. However, it trades its energy savings with an unacceptably high performance degradation for timesensitive multimedia and interactive applications. For example, video decoding of a 30-minute clip took an extra 16.6 minutes, resulting in poor performance. Therefore, we omit the results of PEAK in the rest of this paper and refer the readers to [13] for these results.
Chameleon-Aware Applications
We first demonstrate the effectiveness of our four Chameleon-aware applications. Our experiments assume a lightly loaded system that runs a single application with the typical background system processes.
Video Decoder
We encoded several DVD movies at different bit rates and resolutions by using Divx MPEG-2/MPEG-4 video codec and MP3 audio codec. The characteristics of six such movies are listed in Table 3 . The bit rates are depicted in the form ða þ bÞKbps, where a is the video and b is the audio bit rate. We recorded the energy consumed by the processor during the playback of these movies at full speed with LongRun, Chameleon, PAST, and AV G n .
Our experiments show that all five configurations handle movie playback very well. The same playback quality is observed under these five configurations: identical execution times, which equal the length of the movies, identical frame rates, no dropped frames, and no user-noticeable The CPU characteristics are taken from data sheets ( Ã ), and the total power was measured for the machine playing movie 6. delays. However, the average CPU power consumption significantly differs across the various configurations (see Fig. 10a ). Fig. 10a shows that 1) neither PAST nor AV G n can outperform LongRun, 2) LongRun can achieve significant energy savings (from 27.36 percent to 57.26 percent) when compared to FULL, and) the Chameleon-aware mplayer can achieve an additional 20.52 percent to 31.99 percent energy savings when compared to LongRun.
Although there are no user-perceived playback problems (in terms of dropped frames or playback freezes) under the five configurations, we observe jitter in the playback quality at the frame level. Such small interframe jitter is inevitable in a time-sharing CPU scheduler, although its effects are not perceptible at the user level. mplayer provides statistical measurements of late frames, that is, the number of frames that are behind their deadline by more than 20 percent of the frame interval. As shown in Fig. 10b , the number of late frames in Chameleon is mostly comparable to PAST and AV G n and is typically better than LongRun (while consuming the least energy). FULL has the least, although not zero, late frames, at the expense of the highest energy consumption. The number of late frames is small (0.2 percent to 2.3 percent) in all configurations.
Videoconference Tool
To ensure repeatable and comparable experiments with the videoconferencing tool, we encoded several video clips with varying degrees of motion, and we replay those videos through remote sending applications. The sender encodes these videos and transmits them to our Chameleon-aware client over a lightly loaded network. This ensures a fair comparison across the various DVFS techniques and enables us to carefully control the amount of motion in each session.
We ran our videoconference experiments under two resolutions, QCIF (176 Â 144) and CIF (352 Â 288), for all five configurations. In our experiments, all five configurations handle the videoconference very well. The same quality is observed under all configurations: identical execution times and no deadline misses (i.e., the decoding of each packet completes before the arrival of the next packet). Our results, as shown in Fig. 11 , show that LongRun achieves significant energy savings (from 20.75 percent to 69.25 percent) when compared to FULL. Chameleon-aware gnomemeeting achieves an additional 11 percent to 34 percent energy savings when compared to LongRun, whereas PAST and AV G n are worse than LongRun.
Web Browser and Word Processor
We ran the Web browser and the word processor and measured their average power consumption, the average response time, and the percentage of late events (where the event processing time exceeds the 50-ms threshold).
To eliminate the impact of variable network delays, our experiments with the Web browser consisted of a client that requests a sequence of Web pages from a Web server on a local area network. The requested Web pages consist of actual Web content that was saved from a variety of popular Web sites. Each experiment consists of a sequence of requests to these Web pages with a uniformly distributed "think time" between successive requests. The experiments differ in the requested Web pages and the chosen think times. Each experiment is repeated under the five configurations, and we measure the mean for each experiment.
The workload for the word processor emulates a user that edits a sequence of documents. Each experiment contains a script that makes a sequence of editing requests to these documents with a uniformly distributed "think time" between successive requests. The experiments differ in the edited documents and the chosen think times. Each experiment is repeated under the five configurations, and we measure the mean for each experiment.
Our results, as depicted in Fig. 12a , show that LongRun consumes a factor of three less power than FULL. Chameleon is able to extract an additional 10.27 percent energy savings when compared to LongRun, whereas PAST is worse than LongRun. We also note that the average power consumption under Chameleon is only 0.03 W and is 0.06 W higher than the power consumption at the slowest CPU speed (300 MHz) for the browser and the word processor, respectively. Furthermore, most events finish in Chameleon without any performance degradation. The percentage of late events is only 0.24 percent and 0.22 percent in the word processor and the browser, respectively, and is comparable to other approaches. Finally, the increase in the processing times of late events is not more than 20 ms (obtained by substituting the chosen timer values and CPU speeds in (5)).
Batch Compilations
We compiled a portion of the ns-2 network simulator by using make and our pnice utility. We chose different values of the CPU speed in pnice and measured the power consumption and completion times of make. As expected, our results, as depicted in Table 4 , show that the power consumption can be traded for the completion time by appropriately choosing a speed setting. A higher speed lowers the completion time at the expense of using more energy.
Impact of Concurrent Workloads
To demonstrate that applications can make locally and globally optimal power management decisions in the presence of concurrent applications, we considered four application mixes:
1. video decoder and Web browser (mix M1), 2. video decoder and word processor (mix M2), 3. video decoder and batch compilations (mix M3), and 4. batch compilations and word processor (mix M4). Note that from the perspective of the video decoder, the background load increases progressively from mixes M1 to M3. Table 5 and Fig. 13 show the average power consumption and the performance of these applications under various power management strategies. Table 5 shows that Chameleon always consumes the least energy among the five configurations. The energy savings range from 19.81 percent to 31.19 percent when compared to LongRun, which itself extracts up to 41.89 percent of reduction when compared to FULL. The performance degradation, as depicted in Fig. 13a , shows that interactive application performance in Chameleon is comparable to the other techniques. For instance, the average event processing time of the word processor under mix M2 increases from 4.4 ms in LongRun to 5.96 ms in Chameleon and is well under the human perception threshold of 50 ms. A similar result is seen for the Web browser under mix M1. The percentage of late events remains well under 1 percent under all mixes (see Fig. 13b ). Fig. 13c plots the percentage of late frames in the video decoder for different mixes. The figure shows that the percentage of late frames in Chameleon is comparable to other approaches. As the background load increases from mix 1 to mix 3, we see that the percentage of late frames increases from around 0.4 percent to more than 22 percent. For mix M3, all techniques, including FULL, incur 22 percent of deadline misses. The decoding of the 10-minute clip takes an extra 20 s under all techniques, resulting in poor performance. This is primarily due to insufficient processor availability at higher loads, as opposed to deficiencies in the power management technique. Due to the background load imposed by the batch compilations in mix M3, the timesharing scheduler is unable to allocate sufficient CPU time to the video decoder. Fig. 14 shows the fraction of time spent by the video decoder at different CPU speed settings. In the absence of any background load, the decoder is able to lower its speed setting to the lowest speed for more than 90 percent of the time. As the load increases, the fraction of the time spent at higher speeds increases. For mix M3, more than 80 percent of the time is spent at the highest speed (recall that insufficient processor availability causes the video decoder to run at full speed, as in case 2 in Section 3.1). Under mix M3, the only possible solution is to use a QoS-aware scheduler that guarantees a fixed fraction of the CPU to the video decoder, regardless of the background load. We ran mix M3 with Chameleon on a proportionalshare scheduler, namely, the Hierarchical Start Time Fair Queue (HSFQ) CPU scheduler [19] . In this experiment, we assigned 1/14 of the CPU time to the batch compilations, 12/14 of the CPU time to the video decoder and the X server, and the remaining 1/14 to the other tasks. As expected, the percentage of late frames in the video decoder fell to a very small value. Furthermore, since processor availability is guaranteed in HSFQ, as shown in Fig. 14, the video decoder was able to spend 73.73 percent of its execution time at the lowest frequency, 300 MHz, as compared to 7.74 percent under the time-sharing CPU scheduler. This causes the mean power consumption to fall to 2.1 W, which is a 44.8 percent reduction when compared to the time-sharing scheduler.
Isolation in Chameleon
We claim that Chameleon isolates an application from the power settings of other applications. To demonstrate the effects of such isolation, we ran mplayer with a misbehaving background application by using the Linux time-sharing scheduler. The background application rapidly switches its CPU speeds from one setting to another every few milliseconds. We ran mplayer with this application when it was well behaved (it used a fixed CPU speed throughout) and then with the misbehaving version of the application. We measured its impact on the progress of the mplayer. As shown in Fig. 15 , the progress made by mplayer is unaffected by the rapid changes in CPU speed by the misbehaving application. Any change in the CPU speed by an application only impacts its own progress and has no impact on the CPU shares received by other applications.
User-Level Power Manager
We modified mplayer to use the Grace OS system calls and used it to decode the movies in Table 3 . The Grace OS userlevel power manager was used to make power management decisions on behalf of mplayer. We measure the energy consumed by mplayer and plot it in Fig. 16 . Our results show that Grace OS can achieve 3.50 percent to 18.44 percent energy savings when compared to LongRun. However, Chameleon outperforms Grace OS by 9 percent to 41 percent. This is because the Chameleon-enhanced mplayer is able to estimate the decode times of individual frames based on domain knowledge, whereas Grace OS relies on external observations of the CPU usage of mplayer. This domain knowledge yields an extra 9 percent to 41 percent energy saving in Chameleon.
To further demonstrate the effectiveness of having application-domain knowledge in making better power management decisions, we also compared Chameleon, LongRun, and Grace OS with the optimal frequency settings of video playback. For each frame, the optimal decoding frequency is the lowest CPU frequency in which the frame can be decoded before the deadline expires. To collect the optimal decoding frequency of each frame, we played the movies under different fixed frequency settings, recorded the decoding time of each frame under these different frequency settings, and chose the lowest frequency in which the frame-decoding time is less than the frame interval as the optimal decoding frequency. Our results show that, compared to Optimal: 1) for movie 2, the energy consumptions of Chameleon, Grace OS, and LongRun are 3.18 percent, 12.06 percent, and 29.83 percent more than Optimal's energy consumption, respectively, and 2) for movie 4, the energy consumptions of Chameleon, Grace OS, and LongRun are 20 percent, 69.85 percent, and 76 percent more than Optimal's energy consumption. Fig. 17 shows the fraction of time spent by the video decoder under these four configurations-Optimal, Chameleon, Grace OS, and LongRun-at different CPU speed settings. For movie 2, in the configurations of Optimal, Chameleon, Grace OS, and LongRun, with their application-domain knowledge being in the order of Optimal, Chameleon, Grace OS, and LongRun, the decoder can run on the lowest speed for 99 percent, 97.69 percent, 88.53 percent, and 42.37 percent of the time, respectively. We observed a similar trend for movie 4, with the exception of Grace OS, which cannot run on the lowest speed at all. This exception is due to the fact that, by default, Grace OS chooses a CPU speed schedule, which guarantees 95 percent of the frames being completed on time. However, observing from Optimal, there are only around 92 percent of the frames that require 300-MHz CPU frequency. As a result, Grace OS chooses a speed schedule that starts from a 400-MHz CPU frequency. In summary, our results demonstrate that, the more the applicationdomain knowledge that the power manager can have, the more the energy savings that it can achieve.
System Overhead
An important consideration is the overhead caused by frequent changes in the CPU speed setting. Using the CPU cycle counter register, we measure the cost as 1,125 cycles (about 3.75 s under 300 MHz and 1.69 s under 667 MHz; see Table 7 ). Due to better DVFS support in the Transmeta processor, this is considerably lower than the 8,000-16,000 cycles reported for an HP laptop used in the Grace OS experiments [10] , [17] ; however, both incur minimal overhead. Finally, the overhead values of the video decoder, GPA, and pnice strategies are 2,738, 1,149, and 127 CPU cycles, respectively, which is in the order of a few microseconds (see Table 6 ).
RELATED WORK
Power management techniques for mobile devices have received considerable research attention. Most of the proposed techniques either use DVFS [20] , [21] , [22] , [23] , [24] , [10] , [17] or application-based/middleware-based adaptations [25] , [26] , [27] , [28] for energy savings. DVFS approaches extract energy savings by varying the processor speed. The techniques do not affect the amount of processing performed by the application, and the processing is merely spread over longer time periods by lowering CPU speeds. In contrast, middleware-based adaptation approaches vary the quality or data fidelity and, thus, the amount of processing performed by the application to extract energy savings. We review related work in both categories. Application-based or middleware-based adaptation techniques trade the computational overhead for application quality. Energy savings are extracted by reducing the video quality [27] , [28] , document quality [25] , or data fidelity [26] , and, thus, the processing overheads. Proxybased adaptations for reducing streaming video quality has been explored in [27] and [28] . Puppeteer adopts document quality (that is, picture resolution, color depth, and animation) for energy savings of office applications [29] , [25] . The impact of adopting the data fidelity on energy savings of several applications has also been demonstrated in the Odyssey system [26] , [30] . In contrast, DVFS techniques do not reduce the amount of processing overhead imposed by an application. Instead, they vary the CPU speed to match the CPU load and extract energy savings [20] , [21] , [22] , [23] , [24] , [10] , [17] . DVFS techniques fall into four categories: hardware-based, OSbased, cooperative application-OS-based, and applicationdirected methods. Hardware-based approaches such as LongRun [4] measure system utilization in hardware and choose a systemwide speed setting based on the current utilization. An online hardware approach for voltage and frequency control in multiple clock-domain microprocessors has been proposed in [31] . OS-based approaches determine a systemwide CPU setting based on the processor demands of the currently active tasks [5] , [6] , [7] , [8] , [32] . In this approach, individual applications do not have any direct control over the CPU power settings. A single systemwide CPU setting is determined, which is typically based on the needs of the most resource hungry application, even when a mix of applications is executing on the processor. Furthermore, the OS needs to infer the processing needs of the applications by using online measurements and can incur estimation errors.
In cooperative application-OS approaches, the application provides some domain-specific information to the kernel. The OS kernel and the CPU scheduler use this information for CPU speed setting and/or scheduling. The Grace OS project [10] , [17] proposes a cooperative application/OS approach to save energy for periodic multimedia applications. It uses probability distributions of CPU usage of periodic applications and knowledge of application periods (which is supplied by the application) for choosing CPU speeds. Aperiodic or non-real-time applications are currently not handled by the approach.
Similarly, the Milly Watt project [9] explores the design of a power-based API that allows a partnership between applications and the system in setting energy use policy. In the context of this project, a Currentcy model, which unifies energy that accounts over diverse hardware components and enables fair allocation of available energy among applications, and a prototype energy-centric OS ECOSystem, which implements explicit energy management techniques from the system point of view, have been proposed [33] . Their goal is to extend the battery lifetime by limiting the average discharge rate and to share this limited resource among competing tasks according to user preferences.
A cooperative power management approach was proposed in [34] to unify low-level architectural optimizations (CPU, memory, and register), OS power-saving mechanisms (DVFS), and adaptive middle techniques (admission control, optimal transcoding, and network traffic regulation). In this technique, interaction parameters between the different levels are identified and optimized to significantly reduce the power consumption.
Rather than a partnership between the OS and the applications, our Chameleon approach exports the entire burden of power management to the user level.
Finally, there has been some work on application-level power management. Researchers have proposed several different application-controlled DVFS techniques for video decoding [20] , [21] , [22] , [23] , [24] . Although some require offline estimation of CPU demands for decoding [22] , others can estimate the CPU demands online [20] , [21] , [23] , [24] .
However, all of these techniques implicitly assume that only a single application is executing on the CPU and grant complete control of the processor settings to the video decoder. Chameleon considers general-purpose systems: applications must consider the impact of other load in the system, whereas the OS provides isolation.
CONCLUSIONS
This paper proposes Chameleon, which is a new approach for power management in mobile processors. We argue that applications know best what their energy needs are and propose an approach that allows them to make decisions on power management. The OS only enforces protection and isolates applications from the power settings of other applications.
Our integration of application-level power management policies into four applications demonstrates that such policies impose a modest cost of tens of lines of code. Our results show that Chameleon can extract up to 32 percent of energy savings when compared to LongRun and up to 50 percent of savings when compared to the recently proposed OS-based DVFS techniques while delivering comparable performance to time-sensitive and interactive applications. Chameleon imposes negligible overheads and is very effective at scheduling concurrent applications with diverse energy needs. More broadly, our results demonstrate the feasibility and benefits of power management at the application level. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
