In recent years, the issue of energy consumption in parallel and distributed computing systems has attracted a great deal of attention. In response to this, many energyaware scheduling algorithms have been developed primarily using the dynamic voltagefrequency scaling (DVFS) capability which has been incorporated into recent commodity processors. Majority of these algorithms involve two passes: schedule generation and slack reclamation. The former pass involves the redistribution of tasks among DVFSenabled processors based on a given cost function that includes makespan and energy consumption; and, while the latter pass is typically achieved by executing individual tasks with slacks at a lower processor frequency. In this paper, a new slack reclamation algorithm is proposed by approaching the energy reduction problem from a different angle. Firstly, the problem of task slack reclamation by using combinations of processors' frequencies is formulated. Secondly, several proofs are provided to show that (1) if the working frequency set of processor is assumed to be continues, the optimal energy will be always achieved by using only one frequency, (2) for real processors with a discrete set of working frequencies, the optimal energy is always achieved by using at most two frequencies, and (3) these two frequencies are adjacent/neighbouring when processor energy consumption is a convex function of frequency. Thirdly, a novel algorithm to find the best combination of frequencies to result the optimal energy is presented. The presented algorithm has been evaluated based on results obtained from experiments with three different sets of task graphs: 3000 randomly generated task graphs, and 600 task graphs for two popular applications (Gauss-Jordan and LU decomposition). The results show the superiority of the proposed algorithm in comparison with other techniques.
years across a number of areas and technologies. A few examples of these systems are:
 Wireless sensors: several sensors extract data from the environment, transmit these data to a processing unit and receive processed data accompanied by appropriate commands from the processing unit [1] . The sensors and their receiver/transmitter are generally powered by battery and/or solar cells.  Satellite circuits: Satellites typically contain massive number of complex circuits which must work at low power. These circuits are supplied by solar cells, the only available power supply in satellites.  Robots and surveillance devices: these devices are heavily used in army, mine extraction and in unsafe environments for humans.  Cell phones and laptops: these devices are powered by batteries which are expected to work for prolonged periods of time.
In recent years, the high price of energy and a variety of environmental issues have forced the high performance computing sector to reconsider some of its old practices with an aim to create more sustainable HPC systems. The Earth Simulator with a power consumption of 12 MW/h and Petaflop with a power consumption of 100 MW/h are two typical examples of such energy hungry HPC systems [2] [3] . The magnitude of this consumption will be even greater if the energy consumption of the associated cooling systems is also considered. For example, the survey in [4] indicates that if number of transistors in processor -1 billion for the recent Intel Itanium 2-continues with the current increasing rate, produced heat (per cm 2 ) by future processors will exceed the sun's surface temperature. Therefore, new processor architectures require mechanisms that reduce energy consumption so that the amount of emitted heat can also be reduced [5] [6] . Furthermore, not only does the rising temperature of a circuit derail its performance, but it can also lead to significant shortening of the lifetime of its components. For example, a formula based on Arrhenius Law indicates that lifetime expectancy of many HPC systems components is halved for every C o 10 temperature increase [3] . To reduce energy consumption in HPC systems, or clusters on a smaller scale, resource management in both hardware and software must be addressed. One issue in hardware resource managementwhich has direct dependency on the number of transistorsis to efficiently reduce energy consumption by processors. Dynamic voltagefrequency scaling (DVFS), already incorporated into many recent processors, is perhaps the most appealing method for reducing energy consumption. DVFS reduces energy consumption of processors based on the fact that such energy consumption in CMOS circuits has a direct relationship with (1) working frequency and (2) the square of the supplied voltage. Thus, DVFS saves energy by switching between processor's voltages/frequencies to execute tasks during slack times. Although DVFS was originally designed for task scheduling on single processors [4, [7] [8] [9] [10] , however, it has recently been extended and used in parallel and distributed computing systems as well [3, 11] .
To deploy DVFS, it must be properly integrated with a task scheduler by using one of the following two approaches: (1) during the scheduling process or (2) slack reclamation after scheduling. In the first approach, tasks graph are scheduled on DVFSenabled processors by minimizing both energy and makespan at the same time [12] [13] . In the second approach, an independent scheduler is first used to distribute tasks among processors without considering energy consumption. This procedure is then followed by an independent DVFS technique to minimize energy consumption of tasks by filling the generated tasks' slack times. The existing methods based on DVFS techniques, however, have two major limitations: (1) most of them still focus on the scheduler and rarely explore other opportunities for slack reclamation, and (2) they only use one frequency (among a discrete set of frequencies) to perform each taskthe use of one frequency usually results in underutilized slack times leading to energy wastage by processors and other devices. In this paper, we propose a new slack reclamation technique to reduce energy consumption of processors through efficient use of the generated tasks' slack times by an independent scheduler. In our approach, hereafter called Multiple Voltage-Frequency Selection DVFS (MVFSDVFS), the key idea is to execute tasks using a linear combination of available frequencies so that all slack times are fully utilized. The MVFSDVFS algorithm is presented in three steps. Firstly, energy consumption of each task is formulated as an optimization problem with constraints. Secondly, formal proofs are provided to show that the optimal set of at most two voltage-frequencies will always lead to minimum energy consumption. Also, if power consumption is modelled as a convex function of frequency, these two frequencies are adjacent. Thirdly, an algorithm is proposed to find these aforementioned frequencies for each task. Performance of MVFSDVFS is compared against previous approaches with similar goals. The rest of the paper is organized as follows. Section 2 presents related works. Section 3 describes preliminaries including our assumed system and energy models. In Section 4, MVFSDVFS algorithm is presented. Experimental results and conclusions are presented in Sections 5 and 6, respectively.
RELATED WORK
In recent years there has been a significant amount of work on task scheduling for realtime embedded systems using various forms of DVFS enabled techniques. The main idea in most of the existing algorithms is to efficiently use processors' slack times to satisfy time requirements of all tasks; e.g. deadlines, release times and execution times. Based on provided/estimated information for each task, energyaware task scheduling algorithms in embedded systems can be categorized into two groups: static (offline) and dynamic (online). In static scheduling timing information of all tasks is made available during compiletime, scheduling is performed to meet all deadlines while maximizing processor utilization [7] , [8] , [12] , [14] , [15] . This type of scheduling is used in most largescale computational problems, such as, bioinformatics [16] , chemistry [17] and machine vision applications [18] . In dynamic scheduling, on the other hand, although tasks' deadlines might be available during compiletime, their release and execution times must be estimated during the runtime [3] , [10] , [13] , [19] . This class of scheduling is usually used in dynamic largescale approximation and optimization problems such as weather forecasting [20] and search algorithms [21] as well as most poweraware devices like laptops, wireless sensors and cell phones. While there are many algorithms in the literature for energyefficient both static and dynamic scheduling on uniprocessor and multiprocessor systems, most of them cannot be applied to reduce energy consumption in clusters. In fact, these algorithms are suitable for systems with small number of processors [22] [23] as well as those with shared memories [4, 15] . In addition, these algorithms mostly assume that the tasks (periodic or aperiodic) are independent [22] [23] [24] [25] . Kappiah et al. in [26] used a just-in-time DVFS technique to fill slack times in MPI programs. A system called Jitter was utilized to reduce working frequency of nodes with more slack times and/or less assigned computation. Jitter ascertains that tasks would arrive just in time without increasing overall execution time. Ge et al. in [3] applied the DVS technique to processors that do not work at their peak performance during the execution of parallel applications. In this approach, the best processor frequency for each task was selected before its execution based on through analysis of collected computation and communication power profiles. A method to reduce energy consumption was presented in [27] to adaptively activate and deactivate hardware resources (e.g., memory) for intensive HPC applications. Lee and Zomaya in [13] presented a DVFSbased algorithm to simultaneously minimize both completion time and energy consumption of precedenceconstrained (dependant) parallel jobs. Their final result was a tradeoff between quality of scheduling and consumption of energy. Ding et al. in [28] formally modelled efficiency/isoefficiency concepts for energy scalability. They also extended their results to produce an analytical model for studying tradeoffs between performance and energy saving in HPC systems. Molnos et al. in [29] classified the slack times in realtime applications into static, work and shared lack groups for multiple dependent tasks on multiple DVFSenabled processors. Then a dynamic dependency aware task scheduling was proposed to adjust voltage/frequency of the deadlines for tasks assigned to processors. Hotta [30] presented a profilingbased powerperformance optimization method in which execution of a program was divided into several regions. In this approach, profile information for each region (including power and execution profiles) was extracted and then utilized to find the best combination of processor voltages and frequencies. In Springer et al. work in [31] an upper limit for system energy usage was first chosen externally; then, a combination of performance modelling and performance prediction was used to modify execution times according to this upper limit. After creating models for both execution time and energy consumption, key parameters of models were estimated by executing a program for a small number of times followed by regression. Here, for better estimation of parameters, the following steps were iterated until a proper schedule is achieved: (1) using models to predict each possible scheduling of tasks, (2) executing the program a few times with the best predicted schedule and (3) updating estimated key parameters. The use of multiple voltages in Dynamic Voltage Scaling enabled processors was used in Ishihara work in [32] . Their work is a simplified version of our work in this paper which will be described briefly in Section 4.4. Kimura et al in [11] proposed an energy reduction algorithm for powerscalable high performance cluster supported by DVFS technique. This algorithm selects a suitable set of voltages and frequencies to execute tasks as uniformly as possible using the lowest available frequency with slightly increasing the overall execution time. In our former approach [33] , an algorithm was proposed to reclaim slack times of tasks by linear 5 combination of the processor highest and lowest frequencies. To the best of our knowledge, Reference DVFS algorithm (RDVFS) [11] , and MaximumMinimumFrequency DVFS (MMFDVFS) [33] are the most efficient algorithms with similar objectives to our work in this paper; therefore, they will be used to measure efficiency of our new approach.
PRELIMINARIES
This section describes the target system and application models and introduces the relevant energy models.
System and Application Models
In this work, a parallel computing system is comprised of N homogeneous processors with individual memories. In such systems, switching time between frequencies can be safely ignored in processors because time to switch from one frequency to another ( 30 150 sec
) is significantly smaller than execution time of tasks (at least 1 sec m ).
A set of dependent tasks,  
represented by a directed acyclic task graph (DAG) is also assumed to be executed in the modelled HPC system. Here, the k th task ( ) (k A ) have the following four parameters: (1) ) (k T as the whole available time a processor can assign to the tasksummation the task's execution and slack time ( Figure  1a) , (2) ) (k i t as the task execution time when frequency i f is used, (3) ) (k ideal f as the ideal continuous frequency based on [34] that results the optimum energy consumption ( Figure  1c) , (4) ) (k K as the required number of clock ticks (i.e. clock cycles) the task needs for its execution, and (5) ) (k OS t is the time the processor spends for executing the task in original scheduling ( Figure 1a ).
Energy Model
DVFSenabled processors can execute a task by using a discrete set of
. In CMOS based processors, the power consumption of a processor consists of two parts: (1) dynamic part that is mainly related to CMOS circuit switching energy, and (2) static part that addresses the CMOS circuit leakage power. The whole power consumption ( d P ) is estimated as [4] :
Where f ,  and v represent processor's working frequency, the effective capacitance, and processor's working voltage, respectively. Note that, t v is a threshold voltage usually provided by the manufacturer. In this paper, we consider a general relation between voltage, frequency and power as:
The overall energy consumption of k th task ) (
in a DAG is calculated as:
P is the energy a processor consumes when it is in idle.
MULTIPLE VOLTAGEFREQUENCY SELECTION FOR DYNAMIC VOLTAGEFREQUENCY SCALING (MVFSDVFS)
In this section, the general DVFS problem is formally defined and our algorithm, Multiple VoltageFrequency Selection Dynamic VoltageFrequency Scaling (MVFSDVFS), is provided.
Problem Statement
Optimal energy consumption of k th task can be defined as finding the best combination of available voltagefrequencies,
to perform a predefined task with K clock ticks within a predefined time T. For the k th task, this optimal answer is defined as follows:
Because our algorithm reclaims the slack time of each task independent from other tasks in DAG , the above formulation for the k th task is further simplified by replacing
t , T and K , respectively. Here, t and T are time values in miliseconds and K is an integer value.
Computing the Optimal Solution
To find the optimal solution for the problem defined by equation (4), a simplified version of this problem is solved first; then generalized to find the solution for equation (4) . This simplified version uses only three frequencies  
to perform a task in exact time (T) -oppose to within-and is defined as follows: ( , )
2.
Theorem 1: The optimal solution for equation (5) is obtained by at most two voltagefrequencies.
Proof: To prove this theorem, the general energy formulation using three voltagefrequencies is first computed, and then minimized.
From constraints 1 and 2: 1  3  3  1  2  3  3  2  1  2  3  3  2  1  2 
Based on constraint 3:
which results in:
By replacing 1 t and 2 t in energy formulation based on 3 t , the following equation for energy is obtained:
This equation reveals that energy consumption of a task can be represented as a linear function of 3 t . Depending on the sign of two scenarios might arise: (8) is a strictly decreasing function of 3 t . Therefore, it is minimized when 3 t is set to its highest possible value. Thus:   1  3  31   2   3  1  1  2  3  3  1  3  1   *  3  1  1  3  1  1 1  3  3 3  1 1  3 3  3  1  3   2  1  3  1  32  3  2  32  3  2  2  1  3  2   *  3  2  2  3  2  2  2  3  3 3  2  2  3 3  3  2  3  2 0 ()
Equations 9-11 show that regardless of whether energy is a strictly decreasing or increasing function of 3 t , always two voltagefrequencies provide the optimal energy consumption.
are capable of performing a task; then, their associated optimal energy consumption would be:
Proof: direct observation from theorem 1.
are capable of performing a task; then, their associated execution times for optimal energy consumption would be:
(
Lemma 1 (Optimum continuous frequency):
If a processor is able to perform a task with a continuous range of voltagefrequencies, which is an unrealistic assumption, then the optimum energy to perform task A is when task's slack time (T) is fully utilized.
are two voltagefrequencies to obtain optimal energy for a task, then, equation (5) can be rewritten as:
..
, the energy formula would be:
Because E in equation (14) is a strictly decreasing function of . This implies that if frequency can be chosen from a continuous spectrum, the energy is optimized using only one voltagefrequency. Further, this frequency would cover the whole slack time and could be calculated as follows:
Lemma 2: If a processor's set of available voltage-frequencies is discrete; then, two voltage-frequencies that would lead to the optimal energy consumption will be on both sides of ideal f in equation (15) .
Proof: Constraint 3 in equation (5) 
Theorem 2:
Optimal answer for equation (4) uses at most two voltage-frequencies.
Proof (by contradiction):
To prove this theorem, we show that the optimal answer for equation (4) cannot use more than two voltage-frequencies to minimize total energy consumption. If we assume that the optimal answer for equation (4) utilizes more than two voltage-frequencies, then, its utilization profile can be depicted as that in Figure 2 for 3  n . In this case, this total task can be divided into two independent subtasks: (1) a subtask ) , (
and (2) a subtask to cover the rest of calculations, ; and therefore, the optimal answer for equation (4) cannot use more than two voltage-frequencies to minimize energy consumption.
Figure 2. Optimal answer for equation 4 using multiple frequencies
Up until now, we managed to prove that equation (4) can only be minimized by using two voltagefrequencies. However, in all these formulas, constraint 2 of this problem was relaxed to use the maximum available time T to find its optimal solution although the optimizer is allowed to use less time than T. Therefore, in the following theorem we prove that using less time will always lead to more energy consumption. That is, the original assumption of replacing
was correct.
Theorem 3:
In equation (4), using less time will always result in consuming more energy.
Proof:
To prove this theorem, a task is assumed to be executed with two voltagefrequencies ) , ( 
 . Therefore, the original assumption of replacing
is correct. 
Computation of Optimal Energy
Based on constraint 3 from equation (4),
and the optimal energy is calculated as:
Thus, the details of the postprocessing algorithm proposed in this paper are as follows:
SimplifiedMultiple Frequency Selection DVFS (SMFSDVFS)
In most of DVFS algorithms, it is assumed that processor energy consumption is a convex function of frequency (or voltage) as:
The convex function relation between power and voltage was used by Ishihara in [32] where CPU power is just a square function of voltage -not frequency. If the relation between voltage and frequency in equation (1) is assumed to be linear, then the Ishihara work will be similar to the SMFS-DVFS algorithm in this section. Generally, equation (1) is an approximation of the real relation between Voltage-frequency and power in CMOS circuits that may not be followed by a few current or future CPUs. This problem has been solved in MVFS-DVFS algorithm in the previous section of this paper by considering a general form between power and voltage-frequency in CPUs as shown in equation (2) . MVFS-DVFS algorithm claims that independent of the way of modelling between power and voltage-frequency, if equation (2) is satisfied, always two frequencies are involved in the optimal energy consumption. In other words, the technique in [32] is a subset of MVFS-DVFS technique described in this paper. This simplification changes the problem statement in equation (4) to:
MVFSDVFS algorithm
Postprocessing algorithm to optimize energy consumption of scheduled tasks  Schedule tasks given by a DAG using a scheduling algorithm  for k=1:number of tasks in DAG -Select the k th task -Calculate ) (k ideal f -Divide processor frequency set into two groups (U,L):
-Calculate time and energy from equations (18) and (19) 
associated to the lowest energy for this task endfor  return (individual voltagefrequency pair for execution of each task)
3. 0; 1, 2, ,
Here, to simplify the writing of the equations for the k th task,
are calculated as follows:
Therefore the algorithm for SMFSDVFS will be:
EXPERIMENTAL RESULTS AND DISCUSSION
This section presents simulation results of our proposed algorithm (MVFSDVFS) as well as other algorithms (RDVFS, MMFDVFS and optimum continuous frequency) for a more comprehensive comparison. Here, the following three schedulers are used to produce original task schedules: (1) list scheduling, (2) list scheduling with Longest Processing Time first (LPT) and (3) list scheduling with Shortest Processing Time first (SPT) are employed with different numbers of processors. The simulator itself was developed as a part of this study.
An Example

SMFS-DVFS algorithm
Post-processing algorithm to optimally energy consumption of scheduled tasks 
18
The following example shows how each of the algorithms uses a task's slack time to reduce its associated energy consumption. To simplify, it is assumed that the power consumption is a cubic function of frequency as 3 24 10 367
. Figure 1a shows the original scheduling of k th task ) (
, the values of the parameters for this task are as follows:
Based on these parameters:
 By referring to equations (14) and (15) (Figure 1b) .  SMFSDVFS (the simplified version of our proposed method) attempts to find the optimal energy by a linear combination of all processor frequencies. We proved that for each task always two neighbour frequencies produce the optimal energy. These two frequencies are
which are obtained from RDVFS algorithm (Figure 1d) 
Experimental Settings
Processor models
Voltage/frequency settings are defined based on two groups of processors: the first group includes two synthetic processors, while the second group includes two real processors (Transmeta Crusoe [35] and Intel Xscale [36] ). Table 2 shows the voltage/frequency and the related power consumption of these processors.
Task information
The performance of MVFSDVFS was evaluated with two sets of task graphs: randomly generated and realworld applications. For each application, a large number of variations in the number of tasks and the number of processors were applied to simulations. The random task graphs set consists of 3000 graphs with five graph sizes, three different schedulers and five sets of processors. These task graphs have different number of tasks, task distributions, communication costs and task dependencies. The execution cycle of these randomly generated tasks varied from 510 million cycles from a uniform distribution, respectively. The two applications used in these experiments were LU decomposition and GaussJordan with directed acyclic graph (DAG) and execution cycles extracted from [19] . Also, 600 realworld application task graphs based on GaussJordan and LU decomposition algorithms were used in the experiments. For each application graph, the same number of task graphs (ranging from 100 to 500 tasks) with three schedulers and on five sets of processors were investigated. Table 3 shows the simulation results of normalized energy consumption for all DAG sets (Figures 4 and 5 ). This table clearly indicates the superior performance of MVFSDVFS compared with others in all cases. The performance of MVFSDVFS and other related algorithms has a strong dependency on tasks' slack times in the original scheduling. This dependency explains why the algorithms are not successful in reducing the energy consumption of GaussJordan task graphs. To clarify, a three level GaussJordan task scheduling on three homogenous processors is shown in Figure 6 , which clearly shows that the relations among tasks and their computation and communication costs leave no slack time for tasks to be used by the algorithms. Besides the effectiveness of MVFS-DVFS compared with other slack reclamation algorithms, some other issues should be addressed. The first issue is the relationship between energy consumption and the number of processors in our experiments. Increasing the number of processors expedites the processing time and therefore reduces the makespan; as a sideeffect however, it also increases system's slack times. Figure 7 , addressing this effect, shows the percentage of overall energy saving for a system with different number of processors for random and LU decomposition task graphs. This figure indicates that for both random and LU decomposition task graphs increasing the number of processors results in saving more energy by these algorithms. This figure also shows the influence of the type of scheduling on random task graphs which results in increasing the amount of slack time between tasks for 8 and 16 processors compare with 4 and 32 processors. The second issue which is the major limitation on most DVFSbased algorithms working with one frequency (such as the RDVFS algorithm) is that the slack time can not been covered by using only one frequency. Those algorithms work better when processors can be used at any arbitrary frequency. Nevertheless, due to technological issues, the number of valid frequencies is limited; therefore, these algorithms must select the most suitable frequency among a set of frequencies, defined by DVFS, instead. Generally, the relation The third issue is the overhead of running MVFSDVFS. This overhead comes from the transition time of switching from one frequency to another frequency. An almost true assumption is that the overhead of transition times are relatively much less than the execution times of tasks; therefore the transition times overhead can be neglected in calculations. In our experiments, tasks with duration at least 100 times longer than their transition times are considered for the MVFSDVFS algorithm. 
Results and Discussions
22
26
Conclusion
Since most traditional static task scheduling algorithms for HPC systems do not consider power management, we addressed the energy issue with task scheduling in clusters and presented the MVFSDVFS algorithm which is based on the DVFS technique. In this work, we specifically studied the use of a linear combination of more than one voltagefrequency to reduce energy consumption on processors. We proved that the optimal energy in a discrete set of voltage-frequencies for each task is achieved by a combination of two voltage-frequencies. These two voltage-frequencies are adjacent when the power consumption of the processor is a convex function of frequency. Simulation results of 3000 randomly generated task graphs and 600 real application task graphs showed the effectiveness of the MVFSDVFS algorithm compared with other related algorithms. The MVFSDVFS consumes the least amount of energy of all cases.
