Abstract -High energy cost has become a salient constraint of the next generation of multicore based supercomputers. One approach that has the potential to conserve energy is to reduce the number of resources allocated for a given parallel application. However, this approach raises the concern that utilizing bounded resources may adversely affect performance. In this paper, we demonstrate that utilizing bounded resources to execute parallel tasks with dependency on multicore systems can actually conserve energy without degrading performance. We achieve this goal by proposing BREES, an energy-efficient scheduling algorithm for multicore systems with bounded resources. The proposed BREES algorithm takes advantage of the Dynamic Voltage Scaling (DVS) algorithm and the task duplication strategy. In addition, a dynamic waiting window (DWW) is implemented in BREES to handle the system hardware heterogeneity. We evaluate the effectiveness of BREES by conducting a series of experiments using both realworld and synthetically generated parallel applications on fifteen different multicore processors and four well-known high speed networks.
I. INTRODUCTION
Excessive power consumption has become a significant challenge for supercomputers that run a large number of scientific or commercial applications on a daily basis. For example, the Environment Protection Agency (EPA) reported that in 2006 the total energy consumption of servers and data centers of the United States was 61.4 billion KWh, which was almost equal to the total power cost of 5.8 million U.S. households [1] . As a result, energyefficiency has become as important as performance for the next generation of multicore based high performance computing (HPC) systems.
A number of research projects have proposed different ways of reducing power cost of HPC systems. For example, the Green Destiny project at Los Alamos National Laboratory makes use of low-frequency, low-power processors with modest performance to save energy [2] . The other well-known idea is to utilize the Dynamic Voltage/Frequency Scaling (DVFS) technology in powerscalable clusters, in which the power level will be scaled down when processors are not fully utilized and scaled up when processors are busy. Intel's SpeedStep technology [3] , AMD's PowerNow!, and Cool'n'Quiet technology [4] are typical examples of the DVFS approach. Researchers in Princeton University investigated the possibility of introducing DVFS technology to interconnections [5] . In addition, researchers from Duke University, HP labs and Arizona State University proposed a series of cyber-physical approaches to reduce the power cost caused by the cooling system [1] [6] .
Most existing energy-aware systems and algorithms primarily focus on improving the energy efficiency of system hardware (e.g. low power processors, low power network, low power cooling system). The impact of software optimization (e.g. scheduling) on energy-efficiency needs to be further explored. One approach that has the potential to conserve energy in future multicore systems is to reduce the number of resources allocated for a given parallel application. However, people have concerns about this approach due to potential negative influence on performance. There is no doubt that the performance of some applications will be affected if bounded resources are used. However, this may not be the case for all parallel applications. The Fpppp application shown in Figure 1 is a counter example (Please refer to Section 2 for details).
In this paper, we investigate the possibility of executing the same parallel application using reduced number of resources and analyze the according impact on performance and energy consumption. In fact, our experimental results have shown that utilizing bounded resources will only cause minor performance degradation with a benefit of noticeable energy savings when scheduling parallel applications with dependency and imbalanced workload using our proposed bounded resources energy-efficient scheduling (BREES) algorithm.
II. RELATED WORK
Current energy-efficiency oriented techniques can be classified into two categories. Techniques belonging to the first category conserve energy by dynamically reducing the voltage or frequency of cores when the system computing resources are not fully utilized [8] [21] have reported that utilizing DVS technology is able to reduce energy consumption in HPC systems. In the second category, energy consumption is controlled by distributing the workload over the system. By constantly relocating the heat generating activities, relatively low average energy consumption is achieved. Previous works using the second strategy include migration at granularity of functional unit [22] [23], pipeline method [24] , cache bank [25] , execution clusters [26] and core hopping [27] [28] . Although DVS is capable of reducing energy consumption of processors, the benefits of DVS may diminish when the energy consumed by interconnects dominates the total power consumption. Task duplication has been proved to be an effective approach in reducing the wait time of ready tasks [29] [30] [31] thereby saving energy caused by interconnections [32] [33] [34] .
To reduce both processor side and network side energy with the minimal influence on performance, our proposed BREES algorithm takes the advantage of both DVS and task duplication. Specifically, processors will operate at the highest voltage as long as there is a ready task waiting in the processing queue. Meanwhile, BREES will immediately turn processors to the lowest voltage once no task is waiting or no task is ready for execution. This policy ensures that tasks can be executed as quickly as possible. In addition, tasks on the critical path will be duplicated under the condition that no significant energy overhead is introduced by the replicas (please refer to our previous work [32] for details).
The other primary weakness of existing scheduling algorithms is that they are designed with the assumption that unbounded resources are available in the system, which is unrealistic because the number of required processing units (e.g. cores) grows significantly in proportion to the task size. For example, the scheduling algorithm presented in [31] requires 36 cores to run the Robot Control application [35] with totally 88 tasks. However, the number of cores required grows to 222 for Fpppp (334 tasks) [36] and 456 for our randomly generated DAG (1000 tasks). We also noticed that the workload among these processors (i.e. cores) is highly imbalanced. Fig. 1 shows the workload of each core when running the Fpppp application. While the workloads of some cores (e.g. cores 1, 13, 14, and 15) are extremely high, 90% of the cores sitting idle most of the time. These observations indicate that some parallel applications can be executed using fewer resources with great energy savings and minor performance degradation provided that the scheduler is able to achieve good load balancing. 
III. MODELS AND NOTATIONS

A. System Model and Task Graph Model
A multi-core based HPC system can be modeled as a set of processing units PU = {pu 1 , pu 2 , ... , pu m }, where pu i is the ith processing unit. All processing units (i.e. cores) are fully connected with dedicated and reliable interconnections. They are able to exchange data via the core-level communication (shared memory model) or node-level communication (distributed memory model). The processing units can be performance asymmetric, meaning that the execution time of the same task on different processing units may vary. This is because the processing units may have different clock speed and processing capabilities. In addition, we assume that communication time between two tasks assigned to the same processing unit is negligible.
Parallel applications with precedence constrained tasks can be represented in form of DAGs and modeled as a vector (V, E), where V = {v 1 , v 2 , ..., v n } represents a set of dependent tasks, and E denotes a set of directed edges representing communications and precedence constraints among parallel tasks. For each task in V, t i is the execution time of v i , 1 d i d n. Similarly, c ij is the communication delay between tasks v i and v j . Note that c ij is set to zero if v i and v j are assigned to the same processing unit. Table 1 summarizes the definition of several notations that are used in our proposed BREES algorithm. Note that these notations have been used in several publications related to DAG scheduling [30] 
B. Definition and Notation
The earliest completion time of task v i is expressed as the summation of its earliest start time and execution time. Thus, we have . ) ( ) (
Allocating task v i and its favorite predecessor FP(v i ) on the same computational node can lead to a shorter schedule length. As such, the favourite predecessor FP(v i ) is defined as below
As shown by the first term on the right-hand side of Eq. (5), the latest allowable completion time of the exit task equals to its earliest completion time. The latest allowable completion times of all the other tasks are calculated in a top-down manner by recursively applying the second term on the right-hand side of Eq. (5).
°°®
The latest allowable start time of task v i is derived from its latest allowable completion time and execution time. Hence, the LAST(v i ) can be written as
The communication-to-computation ratio or CCR of a parallel application is defined as the ratio between the average communication cost and the average computation cost of a parallel application. Formally, the CCR of a DAG (V, E) is given by Eq. (7):
IV. THE BREES ALGORITHM
A. Critical Paths Analysis
A DAG may have several critical paths and a critical path contains the most time-consuming sequence of tasks that must be executed sequentially even possible parallelism is allowed. All critical paths will be analyzed in the following three steps. First, the level of each task will be calculated (Eq. 1) and sorted in ascending order to form an original task sequence. Second, the EST, ECT, FP, LAST, and LACT of each task will be calculated based on Eqs. 2 -6. Third, starting from the first task of the original task sequence generated in step 1, the scheduler will scan the DAG up by following the FP of each intermediate task until it reaches the entry task. All intermediate tasks are marked as visited and they form a critical path. In the next iteration, the scheduler will start from the first unvisited task in the original task sequence and find the next critical path in a similar way. This process will continue until all tasks have been marked as visited and by then all critical paths have been found.
B. Making Duplication Decisions
In the scenario that duplication can facilitate in improving performance, the BREES algorithm will duplicate tasks. The pseudo code shown in Fig. 2 
C. Considering Resource Constraints
The next step is to take resource constraints into consideration. Suppose there are m processing units PU = {pu 1 , pu 2 , ... , pu m } available for a parallel application consisting of n critical paths CP = {cp 1 , cp 2 , … , cp n }, the BREES algorithm simply allocates each critical path to a single processing unit to achieve the best performance when m is equal to or greater than n (i.e. unbounded resources are available). However, when m is less than n (i.e. only limited resources are available), BREES first assigns the m critical paths to the available m processing units. Once a critical path has been assigned to a processing unit, the earliest available time of this processing unit will be calculated and referred to as the base time. BREES has previously determined the average execution time (AET) for all of the tasks in the DAG using the equation
. A threshold of that average time will be added to the base time to create a dynamic waiting window (DWW). The threshold is decided by the ratio of the highest PU's speed, S max , and the speed of the first available PU. For instance, if S max is 4 GHz and the speed of the first available PU at time t 0 is 2 GHz, the ratio is set to be 2. In other words, if we allocate the task set to the first available PU, the task set will be completed in twice the time it would take on the fastest processing unit. Therefore, it may worth the waiting for a possible faster PU. If the ratio is equal to 1, which means the first available PU is the fastest PU, the task set will be allocated to the PU immediately without creating the DWW. Otherwise, a DWW will be created as (t 0 , t 0 + AET*ratio). Processing units whose earliest available time falls into this window will be added to the available processing units pool. All processing units in this pool will then be sorted based on their energy-efficiency and BREES will assign tasks on the next waiting critical path to the most energy-efficient processing unit.
D. Dependency Violation and Task Redundancy
The last step is to avoid task dependency violation when multiple critical paths are allocated to the same processing unit and avoid redundant execution of the same task. For example, if the critical path Task 1AE Task 3AE Task 7AE Task 9AE Task 10 is scheduled before the critical path Task 1AE Task 4, then task 7 may have a task dependency violation from task 4. In addition, task 1 will be executed twice, which is a waste of system resource. To solve the task dependency violation and task redundancy problems, BREES will conduct a runtime checking on the task list of a processing unit before assigning new tasks on it. Redundant tasks will be removed and tasks will be reordered if any violations on task dependency are found. For the specific example mentioned previously, task 1 will not be assigned again because it has been executed by the same processing unit. Task 4 will be reordered before task 7 for execution because task 7 is dependent on task 4.
V. EXPERIMENTAL RESULTS
A. Processor and Network Profiles
To evaluate the effectiveness of our proposed BREES algorithm, we use fifteen different CPUs and four commonly used high-speed interconnects to simulate a very complicated heterogeneous multicore systems. Fig. 3 shows the power consumption rate of each processor in idle and busy working mode, which was measured by the Xbits lab In our simulator, we select 9 CPUs (including both AMD and Intel CPUs with Quad-Core, Dual-Core and Single-Core) to form a clustered multicore system with totally 27 processing units (i.e. cores). We record how much time each processing unit is busy (executing a task) or idle (waiting for other tasks to complete). The processor side energy consumption E can be calculated by: E = P idle * T idle + P busy * T busy , where P idle is the idle power rate, T idle is the idle time, P busy is the busy power rate and T busy is the busy time. Table 2 summarizes the configuration profiles for the Gigabit Ethernet, Infiniband, Myrinet and QsNetII network. These four interconnects are widely used in modern multicore HPC systems. Intel Pentium E5200
Intel Celeron 450
Intel Celeron E1600
Intel Celeron E3300 
B. Evaluation Metrics
We compare the energy consumption and performance of the proposed BREES algorithm with four well-known scheduling algorithms, which include the Modified CriticalPath Scheduling (MCP) algorithm, the Task Duplication Scheduling (TDS) algorithm, the Round-Robin (RR) scheduling algorithm and the First Available (FA) scheduling algorithm. The energy consumption includes both processor energy and network energy. The schedule length is the total execution time of a parallel application.
We use two real-world DAG applications -the Robot Control application (88 tasks, 131 edges) and the Fpppp application (334 tasks, 1145 edges) -and one synthetically generated application (500 tasks) to compare the performance and energy-efficiency of the aforementioned algorithms.
C. Performance versus Energy Efficiency
We observe from Fig. 4 that the BREES algorithm consumes the lowest energy in all three applications. For example, we observe a 75.59% energy savings when running the Fpppp application using the BREES algorithm over the MCP algorithm with only 8.81% performance degradation (as shown in Figs. 4(a) (b) ). For the Robot Control application, the energy consumption decrease by 26.86% and performance degradation is only 4.81% (as shown in Figs. 4(c) (d) ). Fig. 4(f) shows that BREES consumes the lowest energy when running a randomly generated application with 500 tasks, a 13.52% energy reduction over MCP. This time, BREES has almost the same performance as MCP, as shown in Fig. 4 (e). 
Total Energy (J)
We also noticed that the energy efficiency of BREES is affected by the parallel applications to which they apply. The Robot Control application, for example, is a communication intensive application while the Fpppp application is computationally intensive. The BREES algorithm demonstrates the ability to balance the performance and energy regardless of the ratio of communication to computation (CCR).
D. Load Balancing Efficiency
As mentioned in Section I, excellent load balancing is critical for minimizing performance degradation when we only allocate limited resources to an application. Our simulation contains a combination of single-core chips, dual-core chips, quad-core chips and six-core chips with totally 27 processing units (i.e. cores), which we believe can represent an extreme heterogeneous multicore system. We execute the Fpppp application and plot the actual workload of each processing unit in Fig. 5 . Since the TDS and MCP algorithms require much more processing units (222), we do not include their results. Note that both Round Robin scheduling (RR) and First Available (FA) scheduling support bounded resources scheduling. Given a limited number of processing units, the RR scheduling algorithm will assign critical paths to subsequent processing units in a round robin way without considering their current workload. In contrast, when resources are constrained, the FA algorithm will assign the next waiting critical path to the first available processing unit.
We can observe from Figs 6. (a), (b), and (c) that the BREES algorithm achieves better load balancing than RR and FA. Fig. 6(d) illustrates the 45.8% and 42.2% reduction in the standard deviation of processor loads when comparing the BREES algorithm with the RR algorithm and the FA algorithm.
E. Impact of Interconnections
In this subsection, we evaluate the effectiveness of BREES over different interconnections. The results plotted in Fig. 6 are based on QsNet, Myrinet, Infiniband, and Gigabit Ethernet. Figs. 6(a) and 6(b) reveal the overall performance and energy efficiency of clusters equipped with the four interconnections. The results show that regardless of the algorithm, Ethernet will consume more energy than other interconnections because it takes longer time to transmit a message. Figs. 6(c) -6(f) illustrate that BREES is able to greatly improve the energy efficiency with only a slight degradation of performance. For example, BREES has a 76.19% reduction in energy over TDS (when using Myrinet) with only 8.81% performance degradation. 
VI. CONCLUSIONS
In this paper, we study the impact of executing parallel applications with dependency using limited resources on performance and energy consumption. Our experiments show that using our proposed bounded resources energyefficient scheduling (BREES) algorithm, we can achieve great energy savings with minor performance degradation when running load-imbalanced parallel applications on multicore systems with bounded resources.
Future studies in this research can be performed in the following directions. First, the impact of memory access and I/O activities on performance and energy are not considered and needs to be further investigated. Second, the effectiveness of the proposed BREES algorithm needs to be evaluated for applications with relatively balanced workload. 
