Abstract. In this paper, we consider power-aware task scheduling (PATS) in HPC clouds. Users request virtual machines (VMs) to execute their tasks. Each task is executed on one single VM, and requires a fixed number of cores (i.e., processors), computing power (million instructions per second -MIPS) of each core, a fixed start time and non-preemption in a duration. Each physical machine has maximum capacity resources on processors (cores); each core has limited computing power. The energy consumption of each placement is measured for cost calculating purposes. The power consumption of a physical machine is in a linear relationship with its CPU utilization. We want to minimize the total energy consumption of the placements of tasks. We propose here a genetic algorithm (GA) to solve the PATS problem. The GA is developed with two versions: (1) BKGPUGA, which is an adaptively implemented using NVIDIA's Compute Unified Device Architecture (CUDA) framework; and (2) SGA, which is a serial GA version on CPU. The experimental results show the BKGPUGA program that executed on a single NVIDIA® TESLA™ M2090 GPU (512 cores) card obtains significant speedups in comparing to the SGA program executing on Intel® Xeon TM E5-2630 (2.3 GHz) on same input problem size. Both versions share the same GA's parameters (e.g. number of generations, crossover and mutation probability, etc.) and a relative small (10 -11 ) on difference of two finesses between BKGPUGA and SGA. Moreover, the proposed BKGPUGA program can handle large-scale task scheduling problems with scalable speedup under limitations of GPU device (e.g. GPU's device memory, number of GPU cores, etc.).
INTRODUCTION
Cloud platforms have become more popular in provision of computing resources under virtual machine (VM) abstraction for high performance computing (HPC) users to run their applications. An HPC cloud is such a cloud platform. Keqin Li [1] presented a task scheduling problems and power-aware scheduling algorithms on multiprocessor computers. We consider here the power-aware task scheduling (PATS) problem in the HPC cloud. The challenge of the PATS problem is the trade-off between minimizing of energy consumption and satisfying Quality of Service (QoS) (e.g. performance or on-time resource availability for reservation requests).
Genetic algorithm (GA) has proposed to solve task scheduling problems [2] . Moreover, GA is one of evolutionary inspired algorithms that are used in green computing [3] . The PATS problem with N tasks (each task requires a VM) and M physical machines can generate M N possible placements. Therefore, whenever the PATS problem increases its problem size, the computation time of these algorithms to find out an optimal solution or a satisfactory solution is unacceptable.
GPU computing has becomes a popular programming model to get high performance on data-parallel applications. NVIDIA introduces CUDA parallel computing framework where a CUDA program can run on GeForce®, Quadro®, and Tesla® products. Latest Tesla® architecture is designed for parallel computing and high performance computing. In the newest Tesla architecture, each GPU card has hundreds of CUDA cores and gets multiple Teraflops that target to high performance computing. For example, a Tesla K10 GPU Accelerator with dual GPUs gets 4.58 teraflops peak single precision [4] . Therefore, study of genetic algorithm on GPU has become an active research topic. Many previous works proposed genetic algorithm on GPU [5] [6] [7] [8] . However, none of these works has studied the PATS. In this paper, we propose BKGPUGA, a GA implemented in CUDA framework and compatible with the NVIDIA Tesla architecture, to solve the PATS problems. The BKGPUGA proposes applying same genetic operation (e.g. crossover, mutation, and selection) and evaluation fitness of chromosomes on whole population in each generation that uses data-parallel model on hundreds of CUDA threads concurrently.
Problem Formulation
We describe notations used in this paper as following:
Set of indexes of tasks that is allocated on the M j at time t mips i,c Allocated MIPS of the c-th processing element (PE) to the T i by M j MIPS j,c Total MIPS of the c-th processing element (PE) on the M j We assume that total power consumption of a single physical machine (P(.)) has a linear relationship with CPU utilization (U cpu ) as mentioned in [9] . We calculate CPU utilization of a host is sum of total CPU utilization on PE j cores:
Total power consumption of a single host (P(.)) at time t is calculated:
Energy consumption of a host (E i ) in period time [t i , t i+1 ] is defined by:
In this paper, we assume that t[t i , t i+1 ]: U cpu (t) is constant (u i ), then:
Therefore, we obtain the total energy consumption (E) of a host during operation time: ⋃
∑
We consider the power-aware task scheduling (PATS) in high performance computing (HPC) Cloud. We formulate the PATS problem as following:
Given a set of n independent tasks to be placed on a set of m physical machines. Each task is executed on a single VM.
The set of n tasks is denoted as:
The set of m physical machines is denoted as: M = {M j (PE j , MIPS j , RAM i , BW j ) |j = 1,...,m} Each i-th task is executed on a single virtual machine (VM i ) requires pe i processing elements (cores), mips i MIPS, ram i MBytes of physical memory, bw i Kbits/s of network bandwidth, and the VM i will be started at time (ts i ) and finished at time (ts i + d i ) with neither preemption nor migration in its duration (d i ). We concern three types of computing resources such as processors, physical memory, and network bandwidth. We assume that every M j can run any VM and the power consumption model (P j (t)) of the M j has a linear relationship with its CPU utilization as described in formula (2) . The objective of scheduling is minimizing total energy consumption in fulfillment of maximum requirements of n tasks (and VMs) and following constraints: Constraint 1: Each task is executed on a VM that is run by a physical machine (host).
Constraint 2:
No task requests any resource larger than total capacity of the host's resource. Constraint 3: Let r j (t) be the set of indexes of tasks that are allocated to a host M j . The sum of total demand resource of these allocated tasks is less than or equal to total capacity of the resource of the M j . For each c-th processing element of a physical machine M j (j=1,..,m):
For other resources of the M j such as physical memory (RAM) and network bandwidth (BW):
HPC applications have various sizes and require multiple cores and submit to system at dynamic arrival rate [10] . An HPC application can request some VMs.
3
Genetic Algorithm for Power-Aware Task scheduling
Data structures
CUDA framework only supports array data-structures. Therefore, arrays are an easy ways to transfer data from/to host memory to/from GPU. Each chromosome is a mapping of tasks to physical machines where each task requires a single VM. Fig. 1 presents a part of a sample chromosome with six tasks (each task is executed on a single VM), the task ID=0 is allocated to machine 5, the task ID=1 is allocated to machine 7, etc. 
Implementing Genetic Algorithm on CUDA
We show the BKGPUGA's execution model that executes genetic operations on both CPU and GPU as shown in the Fig. 3 below. 
Fitness Evaluation
The Fig. 5 shows the flowchart of the fitness evaluation. The placement of each task/VM on a physical machine has to calculate the power consumption increase as the VM is allocated to a physical machine and reduce power consumption when the task/VM is finished its execution.
Selection method
The BKGPUGA does not use random selection method, the BKGPUGA's selection method is rearrangement of chromosomes according to the fitness from high to low, then it pick up the chromosomes have high fitness until reach the limit number of populations. The selection method is illustrated in Fig. 6 . After selection or mutation, chromosomes in Parents and Offspring population will have different fitness, size of new population that included both Parents' and Offspring's is double size. Next, the chromosomes are rearranged according to the fitness value, the selection method simply retains high fitness of chromosome in the region, and the population is named after F1, whose magnitude is equal to the original population. To prepare for the next step of the algorithm GA (selection or mutation), F1 will be given a copy of Clones. The next calculation is done on Clones, Clones turn into F1's offspring. After each operation, the arrangement and selection is repeatedly. In order to simplify and speedup the sorting operation, the program has used the CUDA Thrust Library provided by NVIDIA. The selection method keeps the better individuals. This is not only improves the speed of evolution, but also increases the speedup of overall program because of the parallel steps.
Mutation method
Each thread will execute decisions on each cell mutagenic or not based on a given probability. If the decision is yes, then the cell will be changed randomly to different values. Fig. 7 shows 
Crossover method
Crossover is the process of choosing two random chromosomes to form two new ones. To ensure that after crossover it allowed sufficient number of individuals to form Offspring population, the probability of it is 100%, which mean all will be crossover.
Fig. 8. Selection process between two chromosomes
Crossover process is using one-point crossover and the cross point is randomly chosen. Fig. 8 shows an example of section process between two chromosomes. The result is two new chromosomes. This implementation is simple, ease of illustrating. It creates the children chromosomes randomly but it does not guarantee the quality of these chromosomes. To improve the quality of the result, we can choose the parents chromosomes with some criteria but this makes the algorithm becoming more complex. Thus, the selection with sorting will overcome this drawback. 
Experimental results
Both serial (SGA) and GPU (BKGPUGA) programs were tested on a machine with one Intel Xeon E5-2630 (6 cores, 2.3 GHz), 24GB of memory, and one Tesla M2090 (512 cores, 6GB memory). We generated an instance of the PATS with the number of physical machines and the number of tasks is 500 x 500. On each experiments, mutation probability is 0.005, the number of chromosomes (popsize -size of population) is {512, 1024, 2048}, the number of generations is {100, 1000, 10000}, the number of CUDA threads-per-block is {128, 256, 512}. Table 1 shows experimental results of the computation time of the serial GA (SGA) and the computational time of the BKGPUGA. Fig. 9 shows the speedup chart of the BKGPUGA program on configurations of 128, 256 and 512 CUDA threads-per-block (green, blue and yellow lines respectively). The maximum speedup of BKGPUGA is 28.14 when using 256 CUDA threads-per-block to run the GPU GA with 2048 chromosomes and 10,000 generations. The number of generations is the main factor that affects the execution time, when number of generations increases from 100 to 1000 and 10,000 generations the BKGPUGA's average execution time increases approximately 7.66 and 71.67 and the SGA's average execution time increases approximately 9.16 and 91.39 respectively. The fitness comparison between BKGPUGA and CPU version shows that the difference is relative small (10 -11 ). The fitness values on 1,000 and 10,000
generations are almost equal; that they figure out if it nearly reaches the best solution, the increase of generations makes the fitness is better but not much and a tradeoff is the increased execution time on the BKGPUGA. 
5

Conclusions and Future Work
Compared to previous studies, this paper presents a parallel GA using GPU computation to solve the power-aware task scheduling (PATS) problem in HPC Cloud. Both BKGPUGA and the corresponding SGA programs are implemented carefully for performance comparison.
Experimental results show the BKGPUGA (CUDA program) executed on NVIDIA Tesla M2090 obtains significant speedup than SGA (serial GA) executed on Intel Xeon E5-2630. The execution time of BKGPUGA depends on the number of generations, size of the task scheduling problems (number of tasks/VMs, number of physical machines). To maximize speedup, when the number of generations is less than or equal to 1,000 we prefer to use 128 CUDA threads per block, and when the number of generations is greater than or equal to 10,000 we prefer to use 256 CUDA threads per block. The limitation on the number of tasks and number of physical machines is the size of local memory on each CUDA thread in internal GPU card.
In the future work, we will concern on some real constraints (as in [11] ) on the PATS and we will investigate on improving quality of chromosomes (solutions) by applying EPOBF heuristic in [12] and Memetic methodology in each genetic operation.
