Abstract. This paper proposes a new hardware-based energy management technique for future embedded multithreaded processors with integrated Earliest Deadline First (EDF) real-time scheduling. Our energy management technique controls frequency reduction and dynamic voltage scaling depending on the deadlines, the Worst Case Execution Times (WCET), and the real execution times. Hard real-time capability can be guaranteed for aperiodic threads and for threads with deadlines shorter than their period. Our evaluations show that energy consumption can be reduced up to about 2 3 of a comparable software-based algorithm.
Introduction
The reduction of energy consumption is an important research field because of the rapidly growing number of battery-powered mobile and embedded devices. Hard real-time is often an essential requirement for such systems. This paper focuses on energy management in embedded processor cores in combination with real-time applications. The aim is to reduce the total energy consumption by optimizing power consumption without delaying the completion of the real-time threads.
In CMOS devices, the power consumption is proportional to the square of the supply voltage and linear to the frequency:
where a is the activity of the circuit, C L is the output load capacity, V DD the supply voltage, and f the frequency. Obviously, power consumption can be reduced dynamically by decrementing supply voltage and clock frequency. Unfortunately, supply voltage depends on clock frequency and, using lower frequency, the processor's performance is reduced too. Hence, in real-time systems, we have to control frequency in a way which does not harm the real-time behavior of the system.
We developed a multithreaded Java microcontroller -called Komodo microcontroller -with hardware-integrated real-time scheduling schemes [1, 2] for application in embedded real-time systems and ubiquitous devices. The Komodo microcontroller is able to perform a thread switch without any overhead. Thus, instructions of active threads are executed in an overlapped fashion inside the core pipeline; the EDF scheduler hardware ensures that the thread with the earliest deadline is the thread with the highest priority. Due to hardware multithreading, instructions of other threads are executed within latency cycles of the thread with the highest priority without interfering with its execution (latency bridging).
We investigate mechanisms to minimize energy consumption using hardwarebased energy management techniques that are made possible by a multithreaded processor core with integrated EDF scheduling. In particular, we show that energy saving techniques like frequency reduction and voltage scaling can be controlled more efficient by the integrated EDF energy management than using conventional operating system methods. Our hardware-integrated energy management algorithm chooses automatically in each processor cycle the frequency and voltage level that is currently required to perform a real-time application without any miss of deadline.
The next two sections show state-of-the-art energy saving mechanisms and related work. Section 4 presents the extensions for hardware-based energy management within the processor-integrated EDF scheduler and in section 5 we evaluate our approach. Section 6 concludes the paper.
State-of-the-Art Energy Saving Mechanisms
Commercial processors use a number of techniques for saving energy like pipeline gating, several suspend or sleep modes, and reduction of frequency and supply voltage. Intel's XScale [3] , Transmeta's Crusoe [4] and the MSP430 [5] from Texas Instruments work with software-controlled techniques of frequency reduction and voltage scaling.
We describe shortly the energy saving features of the XScale and the Crusoe processors, because we use their electrical properties (voltages and frequency rates) for simulating our hardware-based energy management. Both processors are able to run at several frequencies using different supply voltages. A change of frequency requires among other tasks to complete all outstanding memory accesses, to set the external SDRAM to self-refresh mode, and to disable the interrupt controller. Most tasks are done automatically, but, nevertheless, they need time for execution. The whole process of changing frequency requires up to 500µs in the case of the XScale. Using the Crusoe processor, the time required for a supply voltage change depends on the distance of the two voltage levels. The maximum value is about 896µs in the default configuration.
Pipeline Gating [6] is a technique for selectively disconnecting parts of the processor, especially pipeline stages. So the energy consumption can be reduced by uncoupling unnecessary parts of the pipeline without concerning any other component. In contrast, frequency and voltage scaling affect the whole circuit.
Related Work on Real-Time Energy Management
Different directions of research targeting real-time applications are present: energy management controlled by the application, the operating system, or by the hardware itself. Application-based power management requires special power control sequences within the application's program code. Shin et al. [7] present a technique for automatic insertion of power controlling code based on a WCET analysis before runtime. The suggested mechanism is feasible for hard real-time systems.
In contrast to application-based techniques, other approaches focus on frequency and voltage reduction controlled by the operating system, especially by its thread scheduler. Pillai et al. [8] present several energy-aware scheduling schemes similar to the EDF scheduling scheme for low-power embedded real-time operating systems. Jejurikar et al. [9] focus on the problem of task synchronization in combination with energy-aware task scheduling. Pouwelse et al. [10, 11] describe a hybrid approach, which is based on an extended Linux OS with a so-called energy priority scheduling. The parameters for the scheduler are given by the application.
A theoretical approach for an energy saving technique using EDF scheduling is presented by Krishna et al. [12, 13] . Their energy management is based on an offline thread schedule, the online schedule, an offline and an online function, which describe the amount of work to do. Aydin et al. [14] additionally use a speculative speed adjustment for periodic real-time tasks.
All presented techniques are based on a single-threaded processor core and a software-based energy management. Energy management investigations concerning multithreaded processors pertain simultaneous multithreading and are made by [15, 16] . Energy management of a multithreaded single-issue processor with integrated Guaranteed Percentage (GP) hardware real-time scheduling was evaluated by ourselves [17, 18] .
All existing processors and research approaches (except our GP energy management) suffer from the inefficiency of software control: Calculating the optimal frequency and the supply voltage by software requires a software overhead. Additionally, most control techniques assume a continuous frequency control which is not realistic. In real processors, frequency is selected by binary clock multipliers and dividers, i.e. only discrete frequency levels are possible. A more efficient solution is a hardware-based energy management, i.e. the processor core decides to run at the optimal frequency and voltage level by itself and is able to readjust frequency and voltage during thread execution.
Another drawback of existing energy management techniques in combination with real-time scheduling is the often used assumption, that the deadline of each thread has to be equal to its period. Krishna et al. and Aydin et al. additionally require an offline thread "execution" for determining the amount of work function and the offline schedule itself for the energy management.
Hardware-Based Energy Management Mechanism

Thread Model
For our energy management technique we permit arbitrary activation of threads with the constraints that all threads are independent and that a thread will only be restarted after its completion, i.e. at most one instance of each thread is active at a time. In the case of periodic threads, we do not make the assumption that their deadlines are equal to their periods.
For the realization of our proposed energy management technique, several characteristics of the execution of a thread are necessary. Fig. 1 illustrates the required values which are measured in execution cycles. The figure is divided into two scheduling areas: the upper area describes the regular thread scheduling which is similar to Krishna's offline scheduling, with the difference that it is generated online by the knowledge of the WCETs and the deadlines of the already completed and all actually active threads. The lower area mirrors the scheduling depending on the real runtime behavior of the threads, i.e. the runtime scheduling. In addition to these two schedulers a third scheduler, not shown in the figure, called execution scheduler is present. It is responsible for the selection of the thread executed within the multithreaded processor pipeline in the current clock cycle. Because of the latency bridging, the scheduling decision temporarily alternates between different threads. The deadline and the WCET are given by the application and stored as constants within the energy management unit. The surplus are the remaining cycles from thread completion to the regular (planned) completion of the thread assuming that all previous threads have exhausted their WCET too. The runtime(t 0 ) represents the amount of execution cycles the current thread has executed up to time t 0 . In general, due to the multithreaded execution and the surplus of the previous thread, an early thread execution takes place and thus, the runtime(t start ) is greater than zero at the regular start of execution. At thread completion, the runtime(t completion ) is equal to the real execution time (RET). The remaining runtime(t 0 ) is the number of cycles the thread will run from time t 0 (assuming its WCET), i.e., the difference between the WCET and the runtime(t 0 ). The surplus is the sum of the surplus by early thread completion and the surplus by early thread execution (surplus of the previous thread).
Methodology
The idea behind the hardware-based energy management mechanism is that the active threads rarely need the time calculated as WCET for the actual execution as it is reported in [19] . Thus frequency can be reduced such that all threads terminate as late as possible but not later than the time predicted by the schedulability analysis (depending on the WCETs). As a consequence, the supply voltage can be adapted to a level corresponding to the throttled frequency, which may lead to a tremendous energy saving. Because of the direct relationship between the selected clock frequency and the required supply voltage, determining the optimal clock frequency is the real challenge.
Using a software-based solution, frequency and voltage selection is only possible at the time of a thread suspend or activation (intertask DVS) or at dedicated points during thread execution (intratask DVS). In contrast to a software-based version, our hardware-based energy management is able to observe the progression (in execution cycles) of all active threads continuously. Thus, clock frequency and supply voltage can be adapted dynamically during the thread's execution to approximate the optimal execution speed.
At the time of a thread suspend the presented energy saving mechanism registers the number of execution cycles remaining to the regular thread suspend, i.e. the surplus. Due to the surplus of the just suspended thread the execution of the thread directly following can be slowed down. The optimal frequency f reduced can be calculated by the formula
where f max is the maximum frequency of the processor, W CET is the WCET of the new thread, and surplus is the surplus of the just suspended thread. If the processor is working at the calculated optimal frequency f reduced and the new thread requires its complete WCET, its execution completes exactly at the time planned by the schedulability analysis. If the new thread does not need its WCET for execution it offers a surplus to the following thread. Usually only fixed frequency levels are provided by the processor. So the optimal frequency cannot be selected and a frequency higher than the optimal one has to be chosen. As result, the really required energy is higher than the theoretical necessary energy.
Implementation
To realize the EDF energy management the following set of five hardware registers are required for each hardware thread slot:
This register is addressable by the software. It contains the reload value of the WCET. W CET surplus : The W CET surplus register is an internal register within the energy management unit. At every thread activation it will be automatically reloaded with the value stored in the W CET reload register. During runtime it will be decreased according to the algorithm described below. W CET remain : This register is very similar to the W CET surplus register. The difference between these two registers is the way of decrease also described below. DL reload : The DL reload register holds the deadline of the corresponding thread.
It is software addressable. DL count : At the time of a thread activation, this register will be initialized with the value of the DL reload register. It is decremented in every clock cycle and is responsible for the thread scheduling. Both deadline registers are required for the thread scheduling and are already available within the priority manager.
Depending on the thread scheduling, selective registers are updated in every execution cycle by hardware. Both reload registers have to be set by the application with the help of special instructions.
Register Actualization: The W CET surplus and the W CET remain registers have to be updated corresponding to the actual thread execution. That means, the W CET remain register of a thread is decremented iff an instruction of this thread is executed in the actual execution cycle, i.e. it reflects the execution cycles remaining until the maximum thread execution cycles. Whereas the W CET surplus register has to be reduced iff the corresponding thread is currently the regular thread, i.e. assuming the WCET of all previously executed threads. At the time of thread suspend, the W CET surplus register mirrors the surplus which is available for the execution of other threads.
The scheduling decision of the regular scheduler depends only on the deadlines and the WCETs of all active threads. The execution scheduling evaluates additionally the fill level of the instruction windows, possible latencies, and the real completion of the threads. Fig. 2 demonstrates the correlation of the scheduling parameters, the scheduling decisions, and the decrease of the WCET registers. The scheduling parameters deadline, latencies, IW (instruction window) fill level, and the active flags are required for the execution scheduling. The WCET register sets are only required for the energy management. The W CET surplus and the W CET remain registers are updated depending on the regular respectively the execution scheduling. A set of all registers is available for each hardware thread slot.
Frequency and Voltage Control: For frequency and voltage control an additional third scheduler, the runtime scheduler is required, which is not shown in figure 2 . Its task is to determine the thread with the highest priority in execution. In contrast to the execution scheduler, the runtime scheduler ignores the fill levels of the instruction windows and occurring latencies. Thus, the runtime scheduler designates the current active thread with the highest priority disregarding its feasibility.
For the selection of the execution frequency, the energy management unit has to distinguish between three cases:
1. The decision of the runtime scheduler is invalid. In this case, no active thread is available. Frequency and supply voltage can be reduced to the minimum level. 2. The decisions of the runtime scheduler and the regular scheduler are identical. The maximum number of cycles the thread will be executed is known within the register W CET remain and the number of cycles available till the regular completion of the thread is stored in register W CET surplus . The execution frequency can be reduced or has to be increased to
3. As last case, the regular thread is not the same as the thread determined by the runtime scheduler. This means, a previous thread completes before its WCET and its surplus is available for the execution of the thread selected by the runtime scheduler. The execution frequency has to be set to: We assume that the clock generator works with a clock divider without any settling time. In our simulations we used the following divisors: 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 10, and 14. How we determine the optimal execution frequency is now shown at the example of case 3 (see above). The following formula must be fulfilled by the minimal possible frequency:
f reduced is derived from f max by a clock divider. Div num is the numerator and Div denom the denominator of the clock divider:
Combining both formulas leads to the following inequation:
Using the mentioned clock divider, all multiplications can be mapped to shift operations and at maximum one addition. In parallel to the frequency selection, supply voltage is chosen using a lookup table and the calculated frequency divider as index. In contrast to the voltage which is set immediately, frequency is set after a delay iff an increased voltage is required (see 4.5) . In between, the processor continues working at the lower frequency.
Readjusting Frequency
In most cases, the selection of the optimal frequency is not possible. Therefore, the energy management technique has to choose a frequency higher than the optimal one because otherwise the actual executed thread could terminate after the regular termination. While the thread is executed at the higher frequency than the optimal one, the progression is also higher than required.
At the time the thread's progress reaches a level such that frequency can be decreased below the optimal one, the energy management slows down the processor to this frequency. Additionally, supply voltage could be decreased. The dynamic readjustment of frequency and voltage at any time during thread execution can only be afforded by a hardware-based solution which monitors the thread's progression consistently.
Impact on WCET
Using the policy described in section 4.4, an increase of the execution frequency may be necessary. In this case, the supply voltage has to be adapted first (because of the capacity of the circuit) before the execution frequency can be increased. We called this delay the frequency increase delay which is the only impact of the energy management to the timing behavior of the system. The WCET of each thread has to be increased by the frequency increase delay. The necessity of this delay can be demonstrated by the following situation: The processor is running at a low frequency and a low supply voltage. Now, a new thread with the highest priority is activated. Because of the unknown runtime behavior of the new thread, the processor has to run at the highest frequency and voltage. Thus, first voltage has to be increased and just after voltage reached the required level, i.e. after the frequency increased delay, frequency can be increase too.
Another case, in which the frequency increase delay is important is the simultaneous change of the regular thread and the thread with the highest priority. Hence, due to the unknown runtime behavior of the second thread, the processor has to run at highest frequency. To allow running at high frequency immediately, supply voltage has to be set to the highest level before the first thread completes regularly, i.e. when the W CET surplus register of the first thread is less than the frequency increase delay.
Drawback During Switching
Within all software energy-management techniques known to us, voltage and frequency switching is done in one iteration. Hence, this step takes at least as long as the voltage needs to reach the required level (assuming a voltage increase) and no useful work can be done in the meanwhile. Our hardware-based energymanagement controls frequency and voltage in two steps without halting the processor. It still runs at the lower frequency until the voltage reaches the upper level. The time, the processor runs at the lower level is taken into account at the selection of the target frequency and voltage. Therefore, a high number of voltage and frequency changes is rather an advantage than a disadvantage.
Evaluation
Processor Models
As proof of our concept we built the described energy management technique into the VHDL model of the multithreaded, single-issue Komodo processor core with integrated EDF scheduling [2, 20] . Besides the energy management itself, we integrated a clock divider with 11 different output frequencies. To avoid the assumption of f ∼ U , we used the more realistic voltage levels derived from the Crusoe respectively the XScale processor and the appropriate clock dividers shown in table 1.
All benchmarks are performed by simulating the VHDL model. The frequency divider supports the clock dividers shown in table 1. Each benchmark is simulated twice: first using the voltage levels similar to the Crusoe technology (Crusoe-style), second using the voltage levels corresponding to the XScale technology (XScale-style). In addition to these two technologies we used three different energy management techniques per benchmark: Energy consumption is estimated by tracking the core frequency in combination with the selected voltage level and the formula of section 1. Because energy is proportional to the clock frequency we just calculate the relative energy consumption.
Benchmarks
We performed two synthetic benchmarks and a realistic benchmark for evaluating the behavior of the hardware-based EDF energy management.
Synthetic Benchmarks: Each synthetic benchmark consists of four threads with a growing processor utilization. The WCETs and the periods of all threads are chosen in the way that the theoretical processor load of a whole benchmark is 100%. During the execution of both benchmarks, the real processor utilization is growing from nearly 0% at the beginning to finally 100% of computing power.
Within the first benchmark (EQUAL) all four threads were activated simultaneously with identical periods. Figure 3 illustrates the activation and the growing real computing time of the threads. In contrast, the threads within the second benchmark (DIFF) were activated at different times using different periods (see table 2 ).
The relative energy consumptions of the three different processor models using the energy management techniques pipeline gating (PG), Pillai software energy management (Pillai) and hardware-based EDF energy management (EDF) are compared as function of the real processor utilization using the processor models similar to the Crusoe respectively the XScale technology. Figure 4 T1   T2   T3   T4   T1   T2   T3   T4   time   T1   T2   T3   T4   activation   execution   T1 T2 T3 T4  T1 T2 T3 T4  T1 T2 T3 T4 time Fig. 3 . Thread activation and execution during the EQUAL benchmark. energy consumption of the DIFF benchmark. The figures do not show the total energy consumption of the whole benchmark but rather snap-shots of energy consumption at the appropriate utilization level. Three curves are shown in all four figures. The one starting slightly above 40% and reaching 73% represents the energy consumption of a processor core supporting only pipeline gating. Because of the assumed energy consumption of 40% in gated mode the minimum energy consumption is likewise 40%. Due to latency bridging, the maximum energy consumption is less than 100% of the energy consumption, i.e. in the case of 100% processor utilization, there are still unused clock cycles left for pipeline gating. This phenomenon can be observed in all measurements.
The second curve, mostly in the middle describes the energy consumption of the benchmarks using a software-based energy management similar to the Pillai technique. The relative energy consumption using the EQUAL benchmark behaves as expected. In the case of the DIFF benchmark the energy consumption using the Pillai energy management exceeds the energy consumption of pipeline gating. This behavior can be explained by the software overhead of the energy management and the readjustment of frequency and voltage only at each thread activation and suspend. Because of the disadvantageous distribution of the threads in the DIFF benchmark, this phenomenon appears only here.
The lowest curve in each figure shows the relative energy consumption resulting from the hardware-based EDF energy management. The EQUAL benchmark is a very uniform benchmark which leads to the approximately proportional energy consumption in figure 4 . In contrast to EQUAL, DIFF is a very inhomo- geneous benchmark, which leads to a more or less advantageous arrangement of active threads. The low point at about 90% processor utilization using the Crusoe-style model is a result of an advantageous thread arrangement. The flattening of the energy curve at growing processor utilization can be explained by the increasing overlapped thread execution, i.e. with the growing number of usable latency cycles. Realistic Benchmark: For the realistic benchmark, the Komodo microcontroller prototype was built into an autonomous guided vehicle (AGV). Four hard real-time threads control the movements of the vehicle and are used for evaluation. The microcontroller's inputs are the data sent by a line camera, its outputs are pulse width modulated signals (PWM) for two driving engines. The task of the vehicle is to track a steering line on the floor. The four threads perform the following tasks:
Receiving Camera Data:
This thread is responsible for receiving the digital pixel values sent by the line camera. The data is stored in a Java array. The camera thread is activated each time a pixel is received and deactivates itself after writing the received data into the array. After receiving a whole picture, the array is transmitted to the second thread.
Recognizing the Line:
The task of this thread is to recognize the line that guides the vehicle based on the data within the array. This thread is only active during the line detection, otherwise it is deactivated.
Calculating Steering Data:
Together with the data of previous line pictures and the information about the actual positioning of the line, this thread calculates the new driving direction and speed. These two values are forwarded to the next thread. Methodology: Because real current measurements cannot be made using a FPGA prototype and an ASIC is much too expensive, the measurement methodology combines real input data from the AGV prototype with a VHDL simulation of the Komodo microcontroller including the different energy management techniques (pipeline gating, software-based energy management, and hardware-based energy management).
First, the vehicle's control program was executed on the FPGA prototype inside the vehicle. During the first 3.2 million clock cycles a logic analyzer records the signals sent from the line camera. The second step is to use the logged data as input to the simulation running the same vehicle program yielding the frequency and voltage changes and the number of cycles with gated pipeline. Figures 8-9 present the results of our simulations. The x-axes mirror the time in base clock cycles and the y-axes show the energy consumption relative to a processor without any energy management. The peaks above 1 in figure 9 stem from the assumed overhead of 8% of energy consumption of the base processor because of the added energy management hardware. Figure 10 summarizes the simulation results by showing the fractions of energy consumption during the simulated time interval. Each column represents the required energy in the specified technology in comparison to a Komodo microcontroller running at full speed all the time. These values are calculated using the formula in section 1, where C (the capacity of the whole circuit) and f are normalized to 1.
Results:
The leftmost bars show the energy consumption using pipeline gating and the highest voltage of the corresponding technology. The reason for the large energy saving of about 51.5% is the low overall processor utilization with an average of 22.6% over the whole time interval. Because we assumed that the energy needed in gated mode is still 40% of the energy in running mode, the required fraction of energy (48.5%) is higher than the overall utilization.
The bars in the middle of figure 10 mirror the results using a software-based energy management similar to the one presented by Pillai et al. within a single threaded processor core. It reaches energy savings of up to 82%. The remaining bars show the results using the hardware-based frequency/voltage adjustment and pipeline gating. This combination reaches the best results with the least energy consumption due to the fast frequency and voltage switches, the usage of latencies and the fact, that the processor is not idle during voltage/frequency switching. Because of more available voltage levels and a lower voltage at the slow clock rates, the Crusoe-derived version outperforms the XScale version. 
Conclusions
This paper presents a new management technique for reducing energy consumption within multithreaded real-time systems. Frequency adjustment and dynamic voltage scaling are managed exclusively by hardware. The management technique is based on the Earliest Deadline First (EDF) scheduling scheme implemented in the multithreaded Komodo microcontroller which is used for benchmarking.
One advantage of hardware-controlled energy management over softwarebased solutions is the ability of using extremely short periods of underutilization for reducing energy consumption, where software-based solutions are not able to react fast enough. The second advantage is the ability to slow down real-time thread execution at any time during thread execution. Thus, our technique is able to compensate the disadvantage of discrete frequency levels. As third advantage, it should be mentioned that our hardware-based energy management is suitable for both, periodic and sporadic real-time threads. Especially in ubiquitous systems, energy management for aperiodic real-time threads is important.
Our evaluations show that energy consumption could be reduceded to 2 3 of an comparable software-based solution. The consumed energy never exceeds the amount consumed by the software-based algorithm. Additionally, the softwarebased algorithm supports only periodic threads.
