INTRODUCTION
Research on low power systems has received a great amount of attention in recent years since the sustainability of current technologies and practices has become a serious issue.
A few example systems where lowering power usage is critical are:
 Wireless sensors: several sensors extract data from the environment concurrently, transmit these data to a processing unit and receive processed data accompanied by appropriate commands from the processing unit [1] [2] [3] [4] . The sensors and their receiver/transmitter are generally powered by battery and/or solar cells.
 Satellite circuits:
Satellites typically involve massive number of complex circuits that must work in low power. These circuits are supplied by solar cells, the only available power supply in satellites.
 Robots and surveillance devices: these devices are heavily used in army, mine extraction and in difficult or unsafe environments for humans.
 Cell phones and laptops: these devices are powered by batteries which are expected to work for a long time.
In the meantime, stiff increases in energy price and the environmental impact of carbon dioxide emissions associated with energy generation and transportation have forced the issue of reducing energy consumption to be extended to a broader range of system including High Performance Computing Systems (HPCS).
Various issues such as resource management in both software and hardware levels must be addressed to reduce energy consumption in HPCS. An important issue in hardware resource management is how to reduce power usage in processors. In the recent past, many hardware-based approaches have been made to efficiently reduce energy consumption, particularly for processors. Dynamic voltage-frequency scaling (DVFS) is perhaps the most appealing method incorporated into many recent processors. Energy savings with this method is based on the fact that the power consumption in CMOS circuits has direct relation with frequency and the square of voltage supply. In this case, the execution time and power consumption can be controlled by switching between processor's frequencies and voltages. Although this approach was initially designed for single processor task scheduling [5] , it has recently received much attention in multiprocessor systems as well [6, 7] . DVFS technique and task scheduling can be combined in two ways: (1) schedule generation, and (2) slack reclamation. In the schedule generation, tasks graph are (re)scheduled on DVFS-enabled processors in a global cost function including both energy saving and makespan to meet both energy and time constraints at the same time [8, 9] . In slack reclamation, which works as post processing procedure on the output of scheduling algorithms, DVFS technique is used to minimize the energy consumption of tasks in a schedule generated by a separate scheduler. The existing methods based on DVFS technique, however, have two major shortcomings: (1) most of them focus on schedule generation and do not adequately take the slack reclamation approaches into account to save more energy, and (2) the existing slack reclamation methods use only one frequency for each task among all discrete set of processor's frequencies. Using one frequency usually results in uncovered slack time where processor and other devices only waste energy.
In this chapter we focus on slack reclamation and propose a new slack reclamation technique, Multiple Frequency Selection DVFS (MFS-DVFS). The key idea is to execute each task with a linear combination of more than one frequency such that this combination results in using the lowest energy by covering the whole slack time of the task. We have tested our algorithm with both random and real-world application task graphs and compared with the results in previous researches in [7] and [10] . The experimental results show that our approach can achieve energy almost identical to the optimum energy saving.
Energy Efficiency in HPCSs
Many of electronic systems in our life such as satellite systems, cell-phones, game instruments and so on are using rechargeable batteries as their power supplies. Although the battery capacity has been grown significantly in recent years (the battery capacity increases 5% per year), battery life is still the major drawback for most of electronic systems. In addition to power-aware battery-based systems, the issue of energy consumption has recently attracted a great amount of attention in high performance computing systems (HPCS). Energy consumption issue in such systems can be classified into three groups: (1) system-level resource allocation, (2) service-level energy-load distribution, and (3) task scheduling level ( Figure 1 ).
In the system-level, the problem is how to distribute computational resources (e.g. CPU, network, memory and I/O) between large scale data storages and processing centers (such as supercomputers and data centers). Fairly distribute resources among applications (or services) not only requires to obtain individual adaptation among resources but also needs to understand the interaction between individual resources when they work as a system. Therefore, the big challenge here is to find both the relationship among system resources and their trade-off, which may cause an optimal balance between performance, QoS and energy consumption [11] . Among different technologies in system-level for managing resources between workloads, virtualization becomes a key technology in data centers.
Virtualization allows the computational resources to be shared between different workloads. Many of incoming workloads to data centers are medium size workloads which often require a small fraction of the computational resources. The servers typically spend around 70% of their maximum power consumption even in low utilization. With virtualization, such workloads can be run within a virtual machine (VM) causing significant saving in overall energy usage. The associated VMs may require fewer amounts of resources and therefore they can be run on a single hardware unit. It is obvious that less hardware is used in overall, less energy is wasted for both working on and cooling of the servers.
In the service-level, energy reduction by load balancing, scheduling and mapping workloads is concerned. The main challenge is to utilize appropriate algorithms to both multiplex/demultiplex workloads in order to save energy and make a trade-off between performance and service cost reduction because of energy savings. Also, to avoid hotspot in data centers due to high-loaded nodes, services can be moved from nodes with highload and high temperature to nodes with smaller load and lower temperature. Generally, this movement of services should happen when the destination nodes can operate the services in an energy efficient way [11] .
In site-level data/task scheduling, the focus of this chapter, the operating system (OS) and hardware configuration such as dynamic power management, micro-architecture techniques and dynamic voltage scaling are used to decrease power. Here, the typical question could be: 
Exploitation of dynamic voltage-frequency scaling
Dynamic voltage-frequency scaling is a modern technique in computer architecture to reduce the energy consumption of microprocessors or control the amount of the generated heat by the circuit. This technique is commonly utilized in battery-based devices such as laptops and cell phones where decreasing the energy usage of battery is necessary. In addition, DVFS is used in high-computing nodes not only to decrease the power of the nodes but also to save more energy to cool down the nodes' places. An approximation model shows that the dynamic power in CMOS circuits is a linear function of both switching frequency and voltage square as: C.V 2 .f, where C is the effective switching capacity per clock cycle. Therefore, a workload (or task) can save more energy when it is executed in lower voltage and frequency. In general, a computing node executes several tasks with inter-task relationships (e.g., precedence constraints) simultaneously. These inter-task relationships typically incur slack time (idle time) between tasks where can be used by DVFS to reduce energy usage. Specifically, the slack time associated with a task is utilized to execute the task in a lower voltage-frequency; this in turn results in energy reduction.
There are two ways to combine scheduling and Dynamic Voltage-Frequency Scaling:
(1) independent slack reclamation, and (2) integrated scheduling generation. The existing methods in literature based on these combinations have two major limitations: (1) most of them focus on integrating DVFS and scheduling (integrated schedule generation) and do not sufficiently consider the slack reclamation approaches to save more energy, and (2) the existing slack reclamation methods use only one frequency for each task among all discrete set of processor's frequencies. Using one frequency usually results in uncovered slack time where processor and other devices only waste energy.
Independent slack reclamation
Independent slack reclamation, works on the output of other scheduling algorithms as a post processing procedure by applying DVFS technique to minimize energy consumption of generated tasks by a scheduler. In [7] Kimura et al proposed an energy reduction algorithm for power-scalable clusters supporting DVFS. In a simplified version of this algorithm, the appropriate frequency is chosen among a set of processor's frequencies for each task regarding its slack time. Another algorithm was proposed in [10] to reclaim slack time for each task in a DAG by linear combination of the processor highest and lowest frequencies. To the best of our knowledge, among existing energy-aware algorithms in HPCS, these two methods are the most similar approaches to our MFS-DVFS algorithm presented in this chapter. We address the simplified version of these two algorithms as Reference DVFS (RDVFS) and Maximum-Minimum-Frequency DVFS (MMF-DVF) in the rest of this chapter and will use them as benchmarks to evaluate the performance of our proposed algorithm.
Integrated scheduling generation
In integrated schedule generation, tasks graph are (re)scheduled on DVFS-enabled processors using a global cost function including both energy saving and makespan to meet both energy and time constraints at the same time [8, 9] . Therefore, the final scheduling will be a trade off between makespan and energy. Kappiah et al in [12] presented Just-in-time DVFS technique to fill slack time in MPI programs. They utilized a system called Jitter to reduce the frequency on nodes with more slack times and fewer computations. Jitter aimed to make sure that the tasks came just in time without increasing overall execution time. DVS technique was applied in [8] on processors that did not work in peak performance during execution of a parallel application. The best processor frequency of each task was selected by analyzing computation and communication power profiles collected prior to the execution. A method to reduce power consumption was presented in [13] by adaptively activating and deactivating hardware resources and in particular, memory for intensive HPC applications. Cache missing in accessing the main memory also plays an important role in adjusting and triggering processors slack times. Lee and Zomaya in [9] presented a DVFS-based algorithm to minimize both completion time and energy consumption of precedenceconstrained parallel jobs on HPC systems. This method tried to minimize a summation of two cost functions: completion time and energy. Consequently, the final result was a trade-off between the quality of scheduling and energy consumption. The concept of energy scalability in formal terms was introduced by Ding et al. in [14] . In addition to studying energy efficiency/iso-efficiency concept, they extended an analytical model to investigate the tradeoff between performance and energy saving in HPCS. Molnos et al in [15] classified the slack times in real-time applications into static, work and shared lack groups for multiple dependent tasks on multiple DVFS-enabled processors. They proposed a dynamic dependency-aware task scheduling to adjust voltage/frequency of each processor regarding tasks' real time deadlines. A profiled-based power-performance optimization method was presented in [16] to also utilize DVFS in HPCS. Here, the execution of a program was divided into several regions. In trial steps, profile information of each region, including power and execution profiles was extracted and then utilized to find its best combination of processors' voltages and frequencies. In [17] , an upper limit for system energy usage was selected externally. Subsequently, a combination of performance modeling and performance prediction was applied to reduce execution times with respect to their predefined energy usage upper limit. After creating models for both execution time and energy consumption, key parameters of models were estimated by executing a program for a small number of times and then regressing the estimated parameters. Here, for better estimation of parameters, the following steps were iterated until a proper schedule is achieved: (1) using models to predict each possible scheduling of tasks, (2) executing the program a few times with the best predicted schedule and (3) updating estimated key parameters. Rountree et al in [18] proposed an energy-aware schedule generation algorithm for DVFS-enabled processors where a combination of all processor frequencies is involved into an overall linear programming optimization.
Preliminaries
In this section, the system, application and energy models used in our study have been described.
System and application models
In this work, we assume an HPC system comprising of N homogeneous processors with individual memories. The switching time from one frequency to another is typically in microseconds (between  sec  30 and sec 150
refer to [19] ) while the execution time of tasks is in milliseconds. Therefore, compared with tasks' execution time, the switching time can be ignored. We consider a set of M dependent tasks denoted as 
) (k K is the number of tick cycles required for executing this task. This parameter can be calculated as: 
Energy model
A typical DVFS-enabled processor can execute a task in a discrete set of frequencies
For example, AMD Turion MT-34 can operate at six frequencies ranging from 800MHz to 1800MHz [5] . The power consumption of a processor consists of two parts: (1) dynamic part that is mainly related to CMOS circuit switching energy, and (2) static part that addresses the CMOS circuit leakage power [20] .
In CPUs, the power consumption is formulated as [21] :
Here, f C eff , and v represent the effective capacitance, processor's frequency and voltage, respectively. Because the leakage power is always negligible compared with the dynamic power [20] , the overall energy consumption of k th -task ) (
in DAG is calculated as: [21] ; Therefore, the energy of k th -task ) ( 
Energy-aware scheduling using DVFS
In this section, we explain existing DVFS-based approaches to reduce energy consumption of processors by reclaiming the slack time for each task. In the end, we present our algorithm, MFS-DVFS, that uses a linear combination of frequencies to solve the stated problem.
Optimum Continuous Frequency
The optimal approach to remove slack time and as a result, reduce energy consumption of a processor is to perform a task using a continuous frequency by the processor (Figure 2-c) . Before moving further, proving the following theorems are necessary: 
, therefore the theorem 1 is proved.
Theorem 2:
If processor frequency is continues (unrealistic assumption), the optimum energy for k th -task is obtained when the task covers the whole task's slack time (
Proof: the result in theorem 1 shows that when a frequency covers the whole slack time it gives the optimum power consumption. Note that this frequency may not exist unless the frequency set is continuous.
Refer to theorem 2, for k th -task ( 
In actual systems, however, frequencies must be chosen from a discrete set of frequencies. Also, finishing a task by its deadline may require choosing a frequency that is faster than the optimal frequency. Therefore, the optimal discrete frequency of k th -task is the first frequency in the discrete set larger than 
Reference Dynamic Voltage-Frequency Scaling (RDVFS)
RDVFS is a simplified version of the algorithm introduced by Kimura et al in [7] for power-scalable high performance clusters supporting DVFS. It reduces energy consumption of processors by selecting the smallest available processor frequency (f RDVFS ) capable of finishing a task in a given time frame (Figure 2-b) . The details of RDVFS algorithm is shown in Figure 3 .
For each task assigned to a processor, f RDVFS , which is the first frequency larger than optimal frequency (f opt-cont ) calculated from Eqn.4, is likely to be the best discrete frequency candidate to execute the task within the given time frame and covering its 
4.
) for all tasks)
Figure 4. MMF-DVFS algorithm
The algorithm finds the appropriate time portions of the maximum and minimum frequencies to execute each scheduled task. It can be seen from figure 7 that the MMF-DVFS algorithm works the same as RDVFS in the worst case.
In the next section, we present MFS-DVFS algorithm, which uses a linear combination of a variety of processor frequencies instead of two to perform a pre-defined task (Figure 2- e). The new approach is more energy-efficient compared to the other algorithms discussed earlier in this chapter; its energy saving is quite close to the case of using continuous optimum frequency.
Multiple Frequency Selection for Dynamic VoltageFrequency Scaling (MFS-DVFS)
The RDVFS algorithm decreases a task execution energy by choosing the best processor's speed with respect to the task's idle time [7] . As an example, a set of four tasks scheduled on two processors is shown in Figure. 2-a where Figure. 
The optimization problem in Eqn.7 represents the power consumption problem: how to 
To find the best possible values of ) (k i t , this optimization algorithm must be applied to all tasks in the scheduling. There are cases that MFS-DVFS cannot improve the power consumption, for example when a task reaches to 1 f (the lowest frequency) in the RDVFS algorithm or it has no idle time. Therefore, to improve the speed of MFS-DVFS algorithm, eligible tasks should be extracted before optimization
Task eligibility: to simplify the formulation let us just consider 4 discrete values for frequencies (the real processors have normally 4-5 frequencies). In any case, the same procedure can be used for the higher number of frequencies. The problem in Eqn.8
becomes:
Merging constraints 2 and 3 results in:
Therefore, the power consumption function changes to
To guarantee achieving less energy consumption using MFS-DVFS algorithm, the following condition should be satisfied.
shows a 3-dimensional surface and the search region is where it satisfies the three following constraints: (1) 0 
Experimental Results
In this section we present the results of energy consumption obtained from simulating our The simulations were carried out using the simulator we developed as a part of this study. the task) Figure 5 . MFS-DVFS algorithm
MFS-DVFS algorithm: linear combination of frequencies

Simulation Settings
We use the voltage/frequency setting of two real processors in our simulations:
Transmeta Crusoe [7] and Intel Xscale [22] . Table 1 shows the voltage/frequency and the related power consumption of these processors following with the convex models of each processor. These models use least-square curve fitting to fit a convex function ) ( 3    f on the frequency-power of two real processors, as shown in Figure 6 .
We evaluated the performance of MFS-DVFS with two sets of task graphs: randomly generated and real-word parallel applications. The two real world applications used in our experiments were LU decomposition and Gauss-Jordan with DAGs extracted from [19] . We applied a large number of variations in the number of processors and tasks for 
These task graphs have different number of tasks, task distributions, communication costs and task dependencies. The execution cycle of these randomly generated tasks varied from 5-10 million cycles from a uniform distribution, respectively. We used 150 realworld application task graphs based on LU decomposition algorithm in our experiments.
For the real-application graph, the same number of task graphs -ranging from 100 to 500 tasks-with three schedulers and on five sets of processors were investigated.
Results
The simulation results of normalized energy consumption for all DAGs ( Figures. 7 and 8) are shown in table 2. This table clearly denotes the superior performance of MFS-DVFS scheduling compared to the other approaches in all cases. Figure 8 depicts that although the efficiency of all algorithms including MFS-DVFS in saving energy in LU decomposition is significant, these algorithms have less performance on Gauss-Jordan tasks. For a deeper examination of this behaviour, a sample three level Gauss-Jordan application job scheduling on three processors has been shown in Figure 9 . As explained before, since there is no idle time among tasks in Gauss-Jordan graphs applications, none of these algorithms can efficiently reduce energy consumption.
An interesting issue for further investigation is the relationship between energy consumption and the number of processors in our experiments. Increasing the number of processors expedites the processing time and consequently reduces the makespan; however, as a drawback, it also increases the system slack time. An overhead with MFS-DVFS and MMF-DVFS is the transition time of switching from the one frequency to another one. An almost true assumption is that the overhead of transition times is relatively much less than the execution times of tasks; therefore the transition times overhead can be neglected in calculations. In our experiments, the tasks with T at least 20 times more than transition time is considered for the MFS-DVFS algorithm.
Conclusion
Since most traditional static task scheduling algorithms in HPCS do not consider power management, we addressed the energy issue with task scheduling and presented the MFS-DVFS algorithm. Our algorithm adopted the DVFS technique, a recent advance in processor design, to reduce energy consumption.
In this chapter, we studied existing DVFS-based approaches to cover idle time and in particular, using a linear combination of more than one frequency to reduce energy consumption on processors. First, we noticed the energy model in DVS-enabled processors. Then, we formulated our algorithm (MFS-DVFS) as an optimization problem of all frequencies for each task and then solved it to find the suitable time portions.
Simulation results of 1500 randomly generated task graphs and 300 real world application task graphs showed the effectiveness of the MFS-DVFS algorithm compared with other algorithms.
Acknowledgment
The work reported in this chapter is in part supported by National ICT Australia (NICTA). Professor A.Y. Zomaya's work is supported by an Australian Research Council Grant LP0884070.
