Abstract
Introduction
As the performance of modern processor has increased, however, high energy consumption of high performance computer system has become an important and urgent problem [1] . Nowadays, performance and energy efficiency are two key criterions of modern clusters. Designing energy efficient and environmental friendly clusters is highly desirable. Many recent high performance microprocessors are equipped with the Dynamic Voltage and Frequency Scaling (DVFS) technique, which allows processors to be operated at multiple frequencies under different supply voltages at run time, thereby saving energy by spreading run cycles into idle time. Parallel applications can be represented as a Directed Acyclic Graph (DAG), called a task graph, where nodes denote the tasks with precedence constraints and the edges denote the communications between tasks. A parallel program may have some slack time due to their precedence constraints and synchronizations between the tasks.
Our research is devoted to developing the scheduling algorithm which reduces energy consumption of parallel task execution by using the DVFS mechanism at slack time. We identify the slack time of tasks in different stages and scale their supply voltages to the corresponding level thus reducing the jobs energy consumption.
The rest of the paper is organized as follows: in Section 2, related work has been described. Section 3 introduces computational models including the task model and the energy consumption model. In Section 4, we present the two-phase frequency scaling strategy. Simulation results are demonstrated in Section 5. Finally, Section 6 provides the concluding remarks and future research directions.
Related Work
Increasing attention has been directed toward energy efficiency research for high performance clusters [2] [3] . Among many energy saving techniques, scheduling is an efficient approach to reducing energy consumption on clusters. Nowadays, there has been a lot of research on energy efficient scheduling based on DVFS [4] [5] [6] [7] [8] [9] [10] [11] .
Computational Model

Task Model
Subsection text here. Parallel application with precedence constrained tasks can be represented as DAG with weight. In this paper, a parallel application G is modeled as a vector (V, E, C, T), where V= {v 1 , v 2 ,…, v n } represents a set of parallel tasks, and
denotes a set of communication messages among parallel tasks. In all tasks, v 1 denotes the entry node and v n is the exit node. C is the set of edge communication costs and T is the set of computation costs. The value t i ∈T is defined as the required computing time of v i . C ij ∈C is defined as the communication cost incurred at the edge e ij .
A task is non-preemptive and indivisible work unit, which may be an assignment statement, a subroutine or even an entire program. For each edge, if v i and v j are allocated to the same processor, the communication cost e ij is set to zero. PRED (v i ) is the set of immediate predecessors of v i and SUCC (v i ) is the set of immediate successors of v i . A task allocation matrix, named U, is a n×m binary matrix denoting a mapping of n parallel tasks to m computational nodes in a cluster. Element u ij in U is "1" if task v i is assigned to processor v j and "0", otherwise.
Energy Model
A cluster is represented as a set P = {p 1 
Where C is the capacity of the circuit, v is the supply voltage, and f is the frequency. The capacity C is a constant parameter when processor is working. Energy consumption of a processor is written as a product of energy consumption rate Rp and execution time of tasks. Thus, we have
To calculate the energy consumption of interconnects, let el ij be energy consumed by the transmission of message e ij . The energy consumption of the message can be computed as a product of communicating rate R c and the communication cost c ij as
k is the constant parameter. The energy consumption of a network link is a cumulative energy consumption caused by all messages delivered over the link. The communication energy EC can be expressed as below.
Because our research focuses on the DVFS technique, processor energy consumption would be reduced during the tasks execution due to processor frequency adjustment.
Two-Phase Frequency Scaling
The key idea of DVFS is to dynamically scale the supply voltage and frequency of CPU while meeting total computation time. The parallel application has the slack time due to the precedence constraints and synchronization or communication of tasks.
The slack time can be classified into two categories [12] : Worst Slack Time (WST) and Workload-variation Slack Time (WVST). WST belongs to static slack time, which results from low processor utilization due to precedence constraints between tasks. It can be computed before task execution. WVST occurs due to execution time variation caused by data dependent computation. It is dynamically generated. Thus, it can be known only after execution.
A novel energy aware scheduling algorithm (called NEASA) based on DVFS technique is proposed in this section. According to two kinds of slack time, our energy aware scheduling method includes two stages.
(1) In the first stage, the optimum frequency of processor is calculated based on the DAG of parallel application before tasks execution. In the WST, processor frequency would be reduced to optimum frequency. The make span of overall parallel tasks would not be affected in adjusted frequencies.
(2) In the second stage, after every task finished, the execution time of the task is checked and confirms that if the real execution time is less than the original t i . Then the WVST would be calculated in the run time. In WVST, the minimum allowed frequency would be adopted to reduce the energy consumption at the most extent. The objective of the scheduling is to reduce the overall energy consumption through adopting DVFS technique in both WST and WVST according to the critical path of task graph and dynamic execution variation at run time. 
First Phase Frequency Scaling
Parallel task scheduling should consist of grouping and allocating. Grouping is to divide the tasks of a DAG into several groups. Allocating refers to a mapping the groups on the processor. The tasks of the same group will execute on the same processor. In the grouping and allocating of proposed algorithm, the liner clustering method is employed. The key issue is the frequency setting during execution.
The important parameters are listed in Table 1 . The similar notations of some parameters are used by Zong in [13] . Top Level and bottom level of v i can be calculated based on the vector (V, E, C, T) of DAG.
EST of an entry task is defined as 0. The EST of all the other tasks can be calculated in a top-down manner by recursively applying the following term on the right side.
ECT of all tasks can be calculated as the summation of its EST and execution time. The worst slack time of task v i can be calculated using the following Equation.
The critical path is longest path form an entry node to an exit node. It can decide the scheduling make span. If LACT (v i ) = ECT (v i ), then v i is the critical task. All the critical tasks can form the critical path. Since the ECT and LACT of critical tasks are equal, there is no slack time to scale the processor frequency. Thus, only non-critical tasks would be adopted the dynamic frequency scaling. The shortest execution time of task i can be calculated from the following equation.
Task v i will attain the shortest execution time if the processor of its computing node runs with the highest frequency. In contrast, the longest execution time of task v i can be calculated from the following equation. 
If the task v i finished within the longest execution time, the makespan would not be affected. Therefore, the optimal frequency of non-critical tasks on the fly can be derived from the aforementioned parameters using the following equation. 
Second Phase Frequency Scaling
Optimal frequency in WST can be calculated before the parallel tasks start. However, WV ST dynamically exists and is unknown before the tasks running. Only a task finished, the WV ST can be calculated.
Due to some factors, the real execution time of a task is not always t i of DAG. Therefore, RCT is not equal to LACT. When RCT is less than LACT, the task completed ahead of schedule and has slack time. WVST can be calculated according to following equation. The processor frequency of computing node in WVST would be scaled to lowest. When RCT is greater than LACT, the task has no slack time and may extend the successor tasks. To minimize the task subsequent task execution time delay, processor frequency of SUCC (v i ) would be scaled to f max .
Energy Aware Scheduling Algorithm
The proposed energy aware scheduling algorithm NEASA differentiates the static slack time and dynamic slack time. There are two phases: Firstly, the important parameters are calculated based on DAG. The optimal frequency is derived from the important parameters. Secondly, at run time, once a task finished, the dynamic workload variation slack time can be calculated. If WVST is greater than zero, the lowest allowable frequency would be adopted. Otherwise, the maximum frequency would be adopted by the successor tasks. The algorithm description is given as Figure 1 .
The WST and WVST are calculated through traversing all tasks. The Implementation complexity is proportional to the number of tasks. That is O(n).
Experimental Evaluation
The proposed dynamic clustering algorithm is realized in C++, in which interconnect adopted gigabit connection and CPU frequency is 2.4GHz. The impact of RCT and overall energy efficiency are evaluated. The standard task sets [14] are adopted as Table 2 showed. Figure 2 shows the reduced energy rate change trend when ECR is 10%, 30%, 50%, 70% respectively. Both TDVAS and NEASA reduced energy consumption. NEASA has more advantages over TDVAS. As ECR increases, reduced energy consumption of NEASA is more obvious. Main reason is that TDVAS only consider static slack time. NEASA consider both static slack time and dynamic slack time. While ECR increases, WVST also increases. Therefore, NEASA can reduce more energy compared with TDVAS. For makespan, NEASA, TDVAS and ETF have no difference. Figure 3 show the impact of LCR on energy and makes pan. Both TDVAS and NEASA increased energy while LCR increases compared with ETF. For make span, the make span also slowly increases. But NEASA has less increase proportion compared with TDVAS. When LCR is large, NEASA has advantage over TDVAS. Main reason is that when some tasks complete late, NEASA would scale the frequency of successor tasks to maximum. So the successor tasks can execute rapidly.
Overall Energy Efficiency
CCR is the ratio of communication time and computation time. It is an important parameter. Computation intensive application has less CCR. Otherwise, communication application has greater CCR. NEASA mainly focuses on the computation energy optimization. Therefore, CCR is scaled from 0 to 1.
ECR and LCR range from 10% to 70%. They have ECR + LCR = 1 (15) Figure 4 shows the make span and energy comparison for Random task sets with different CCR. NEASA can reduce the energy consumption compared with other two algorithms by 4% on average. For make span, NEASA has similar results compared with other two. Figure 5 shows the comparison results for Sparse Matrix. Both in make span and energy consumption, NEASA have no obvious advantage over other two algorithms. Main reason is that Sparse Matrix belongs to communication intensive application. The proposed algorithm focuses on the computation energy reduction. Therefore, NEASA does not apply to communication intensive applications. Figure 6 shows the make span and energy comparison for SPEC fpppp. SPEC fpppp belongs to computation intensive applications. NEASA can significantly reduce the energy compared with other two algorithms. Make span has also been reduced to a great extent. Energy consumption is reduced by 12% and 9.1% respectively on average compared with ETF and TDVAS. NEASA can effectively reduce the make span by 10.8% compared with TDVAS mainly because NEASA scales the processor frequency dynamically at run time. The proposed algorithm has great advantage for computation intensive applications.
Figure 6. Make Span and Energy Comparison for SPEC fpppp
To evaluate the proposed algorithm for synthetic applications, the make span and energy comparison results are showed in Figure 7 . NEASA can reduce the energy by 8.3% and 5.6% respectively on average compared with ETF and TDVAS. Make span is reduced by 3.2% on average compared with TDVAS. 
Conclusion and Future Work
To reduce the energy consumption for precedence constrained parallel tasks, energy aware algorithm (named NEASA) based on two-phase frequency scaling is proposed, which reclaims both static and dynamic slack time and employs different frequency
