Abstract-We present a simulated annealing based partitioning technique for mapping task graphs, onto heterogeneous processing architectures. Task partitioning onto homogeneous architectures to minimize the makespan of a task graph, is a known NP-hard problem. Heterogeneity greatly complicates the aforementioned partitioning problem, thus making heuristic solutions essential. A number of heuristic approaches have been proposed, some using simulated annealing. We propose a simulated annealing method with a novel NEXT STATE function to enable exploration of different regions of the global search space when the annealing temperature is high and making the search more local as the temperature drops. The novelty of our approach is two fold: (1) we go a step further than the existing scientific literature, considering heterogeneity at levels of task parallelism, data parallelism and communication. (2) We present a novel algorithm that uses simulated annealing to find better partitions in the presence of heterogeneous architectures, data parallel execution units, and significant data communication costs. We conduct a statistical analysis of the performance of the proposed method, which shows that our approach clearly outperforms the existing simulated annealing method.
I. INTRODUCTION AND RELATED WORK
Hardware manufacturers in their eagerness to overcome the limits of silicon have introduced heterogeneous execution architectures. There is evidence [1] of software/compiler developers catching up by inventing techniques for automatic parallelisation of programs onto this heterogeneous hardware. Task graph partitioning and scheduling, onto a heterogeneous architecture, to reduce the overall application latency is a well studied problem [2] , [3] . Different approaches have been proposed, some that give optimal solutions for twoprocessor systems [4] , while others that are heuristics for generic hardware topologies [5] . Task graph partitioning and scheduling being NP-hard [6] , [7] in the general case, requires one to search for good heuristics that can efficiently utilize the underlying execution architecture (unless P=NP). There are two facets that any heuristic task-partitioning technique needs to consider: (1) correct and efficient modelling of the application to extract the parallelism from the underlying execution architecture and (2) a quick algorithm that can provide close to optimal solutions. Usually, the algorithm designers concentrate on the second facet; providing an efficient heuristic algorithm (normally some kind of meta-heuristic optimization technique [2] ), with the first one taking a back-seat. This observation is based on the fact that none of the proposed heuristic techniques, targeting partitioning and scheduling of task-graphs on heterogeneous hardware explicitly specify the different types of potential parallelism available in the underlying heterogeneous hardware. For example, if we consider Graphical processing units (GPUs) in conjunction with the standard Central processing units (CPUs), we can identify at least three forms of parallelism; (1) multi-core parallelism suitable for graphs composed of independent tasks, (2) short vector parallelism available on CPUs and (3) very large vector parallelism available on GPUs. None of the work to date [3] , [8] , [9] , [2] explicitly tries to extract these different types of parallelism together. Sanyal et al. [9] and Braun et al. [2] consider heterogeneous architectures with different cores running at different speeds. However, they do not target vector parallelism.
In this paper we use a simulated-annealing approach to partition and schedule applications modelled as task-graphs onto heterogeneous architectures addressing the above mentioned gaps in the current literature on task-graph partitioning on heterogeneous hardware.
Our key contributions in this paper are:
• Mechanisms to exploit data parallelism within tasks and map the parallelism to processing elements with matching vector capacity.
• A simulated annealing approach to partition the task level parallelism and data level parallelism across heterogeneous multi-core architectures.
• We consider communication costs between heterogeneous units and our technique also allocates the so called datastores indicating placement of variables on the correct processor memory.
• We present a novel adaptation of the METIS graph partitioner tool to the problem of partitioning task graphs onto heterogeneous architectures.
• An experimental evaluation of our approach, comparing it with three established approaches: k-way partition, heterogeneous bin packing, and the existing simulated annealing-based approach.
• Experimental results showing that our approach performs well in practice, and in general produces superior results compared to the existing approaches. The rest of the paper is arranged as follows. Section II, formalizes the problem statement. Section III, first describes the established Simulated Annealing approach and then explains our contributions and improvements to this approach. Next, we perform experiments to quantitatively validate our SA approach in Section IV and discuss the related work and position our work in comparison to the currently available techniques in Sections IV-C2 and IV-D respectively. Finally, we conclude in Section V.
II. NOTATIONS AND FORMALIZATION OF THE PROBLEM

STATEMENT
The overall application partitioning problem onto a heterogeneous execution architecture can be formulated in terms of a graph partitioning problem. An application can be represented as a task graph as discussed in Section II-A. We can represent the underlying parallel execution framework as a resource graph. The nodes in this graph are processing units and the communication links are represented by the edges, which is discussed in Section II-B. Given these two graphs, how does one partition the former and map it onto the latter? To answer this question, we extend the execution model presented by Sanyal et al. [9] , [10] to accommodate the three kinds of parallelisms mentioned in Section I.
A. Formalizing the task graph
A task graph is a weighted directed graph G t (V t , E t ), such that each vertex V t is a program statement in the application, and E t ⊆ (V t × V t ), shows the communication edges between the vertices. Each vertex V t is decorated with n weights (n = 2 in the case of the experiments conducted): w t 0 : i → N, ∀i ∈ V t and w t 1 : i → N, ∀i ∈ V t . Weight w t 0 is a function on some vertex i ∈ V t , that maps the amount of work to be carried out at node i (represented as Units of Work (UoW)) to an integer value. Similarly, w t 1 is a function on some vertex i ∈ V t that maps the vector length that is required by the node i again to an integer value.
Each edge is decorated by a weight w e : e → N, ∀e ∈ E t , where w e represents the number of bytes that need to be transferred from the location of the data-store (i) to the utilization node (j). Data-stores are special nodes, one per data variable, that indicate where the data resides. These special nodes (that have their weights set to zero; thereby making them dummy nodes) are inserted into the task graph wherever reader and writer tasks require access to the corresponding data. The introduction of these nodes however, are done ex post facto. Hence the original dependence amongst tasks in the application graph are preserved by inserting out and in edges to read and write tasks respectively. The edges that existed between tasks in the graph before the introduction of these special nodes are preserved. The allocation of data-stores on a heterogeneous resource graph plays an important role, since we consider CPU-GPU architectures that do not have shared memory. This implies there is a significant memory latency associated with data transfer between these compute units. Hence the placement of the data-store nodes plays a vital role in minimizing data transfer on these expensive communication links.
B. Formalizing the resource graph
The system resources are represented by a weighted undirected graph G r (V r , C r ). Where V r represents a processing element in the underlying resource graph, while the edge 
C. Formalizing the objective function
As stated earlier, we extend the objective function from Sanyal et al.'s work [9] , [10] to accommodate the vector parallelism that is now exposed because of the multi-constraint representation of the task and resource graph we employ. Given some application node i ∈ V t mapped to some resource j ∈ V r . The latency for that node is computed as
where succ(i) represents the successors of the task i; c represents the communication link in the underlying hardware topology which connects PE j to the PE onto which the task k is mapped onto. In this formulation for some given task graph node i, we first calculate the number of vectorized instructions that need to be performed (by diving the required vector length with the vector capacity of the resource node). This gives us the total number of vector instructions that would be performed on the resource node j. Next, we multiply the number of vector instructions to be performed by the UoW required, this in turn gives us the total amount of work to be performed by that task graph node. Finally, we find the computation cost by dividing this total UoW value with the UoW of the resource vertex. For communication on the other hand, we calculate the cost, by dividing the number of required bits to be transferred to the successors of the task by the bandwidth of the resource(s).
Given the task graph and the resource-graph, let ζ be all possible mappings of the application on the resource-graph. We extend the formulation from eq. 1 to find how heavily loaded each PE is. In order to do this, we simply add the latencies of all the tasks scheduled on this PE. This necessitates a sequential execution of all the tasks assigned to this PE. Although this formulation doesn't take the dependencies of tasks that are mapped onto other PEs into account, it still forms a lower bound for the makespan. Accounting for the completion times of all the parents and data transfer time from their PEs to this PE increases the makespan, thereby making our estimate a tight lower bound. In our formalization and our experiments we are only concerned about the mapping and not the schedule. Once an effective mapping has been found, a schedule can easily be derived from it using a simple list scheduler.
Let ζ M be the mapping under consideration and ζ M (i) represent the PE to which the task i has been mapped to. Let s be the processor under scrutiny, then we define the load for such a processor as:
Finally, we define the objective function as the most heavily loaded PE according to eq. 2. More formally, the complete objective function can be defined as:
The goal of our framework is to minimize the total objective function value as described in Equation (3). asdasd asd asd asd asd asd III. SIMULATED ANNEALING Simulated Annealing [11] is an adaptation of the Metropolis-Hastings algorithm for solving the problem of locating a good approximation of the global optimum of a given function, F : R → R, which has a large search space. In the context of the problem at hand, we have a 2-dimensional search space, where one axis represents the task ID and the other represents the resource ID. Each point in this 2-D space represents a {task,resource} pair which implies this task is mapped onto this resource. We define a state to be a collection of |V t | points such that each task is mapped to exactly one resource. The total number of possible states in this discrete space is |V r | |Vt| which is exponential. The large number of states make exhaustive enumeration to find optimal solutions, infeasible. Please note that we will use the term resource and processing element (PE) interchangeably. The same applies for task graphs and application graphs.
SA is a heuristic algorithm that explores the search space by inspecting one valid state at each iteration. Each of these inspected states are evaluated by an objective function which tells us how good or bad this state is. The goodness in an SA algorithm is problem dependent and in our case it is given by the metric defined in Equation 3, Section II-C. The algorithm progresses by inspecting a candidate state at each iteration and it either accepts it as its current state or discards the state and moves on to another state. We define a move as the generation of the next candidate state and this progress is governed by The algorithm always accepts a move to a better solution, i.e. whenever a new state which has a better objective function value than the current state, the SA algorithm accepts it. When this value is worse however, the SA algorithm accepts this move with a certain acceptance probability, that changes with the current temperature. When the temperature is high, the algorithm accepts moves to a worse solution with a higher probability; as the temperature reduces over time, this probability decreases as well.
A. Conventional Simulated Annealing : From the mapping problem standpoint
The algorithm employed by Orsila et al. [8] is given in Algorithm 1. This algorithm takes as input an initial (random) mapping (ζ 0 ), the starting temperature T 0 and the final temperature T f , all of which are set by the user. ζ best holds the best mapping found after the algorithm halts. ζ current and T current are the current mapping and the current temperature, respectively. The NEXT STATE function moves a random task to a random processing element (PE). For further information about this algorithm, we encourage the readers to read Sections II.B and III from Orsila et al.'s paper [8] .
B. Our Improved SA Approach
Simulated annealing is a generic framework that is characterized by the definition of few of its parameters and functions. In this section, we discuss our adaptation of Orsila et al.' s   Application   Vector strip size  10  20  30  40  50  |Vt|  |Et|  |Vt|  |Et|  |Vt|  |Et|  |Vt|  |Et|  |Vt|  |Et|  Binomial option pricing  82  206  102  306  122  406  142  506  162  606  Convolution  79  143  89  173  99  203  109  233  119  263  Gram Schmidt  228  443  838  1653  1848 3663  3258 6473 5068  10083  Gauss-Seidel  227  531  837  2041  1847 4551  3257 8061 5067  12571  Jacobi  48  130  78  240  108  350  138  460  168  570   TABLE I: The task graph setup work [8] to suit the heterogeneity in our multi constraint representation of task and resource graphs. We also discuss the rationale behind the generation of next candidate states based on the temperature parameter. Orsila et al.'s paper [8] primarily dealt with parametrizing the Simulated Annealing algorithm. The authors demonstrated that their heuristic for setting the initial and final temperatures in the annealing schedule was sufficient for the algorithm to produce a good quality result. However, the authors do not mention how the NEXT STATE function is computed, i.e. how the SA method finds the next candidate state. Our NEXT STATE function is one of the most important contributions of this paper.
In conventional SA, temperature is used only in the acceptance probability function to accept worse states as a way of escaping local minimums, which we have retained in our implementation. Additionally, we incorporate temperature in the generation of the next candidate state (the next probable move of the SA algorithm). When the temperature is high, we allow a higher proportion of the elements in this mapping of the form t k → r i for some t k ∈ V t and r i ∈ V r to change. We denote this proportion by,
where T current is the current temperature, T 0 is the initial temperature and T f is the final temperature. |V t | * T scaled gives us an estimate of how many tasks we can migrate in order to generate the next candidate state. As is evident from eq. 4, the scaled temperature factor allows a lot of tasks to migrate to different processing elements when the temperature is high. As the temperature decreases, we restrict the motion of these tasks, meaning we allow only fewer tasks to migrate to different processing elements and enforce a condition on the rest of the tasks to stay at the processing element they are currently on. Note that during every iteration, the tasks that are allowed to move are selected at random while the number of tasks that have to be moved is given by |V t | * T scaled . This is in contrast to the conventional simulated annealing algorithm where only the acceptance probability is affected by temperature. Although the acceptance probability decreases with temperature, the annealing schedule generates candidate states which are distributed randomly throughout the search space. In our method, as the temperature drops, the next candidate state is generated closer to the current best solution.
This enables us to fine-tune the current best solution in order to only move tasks that give us better objective function values. This idea of letting temperature influence the generation of the next state opens up a range of optimizations that can be incorporated into the NEXT STATE function. This is something that we plan to explore in our future work.
We have also changed the starting point, ζ 0 of the annealing process. In the conventional algorithm a random starting point is chosen by mapping each task to a random PE. We maintain the randomness in the starting solution, but we mandate all tasks to be mapped onto a single random PE. This puts tightly coupled tasks, i.e. tasks that communicate heavily, onto the same PE and moving them onto different PEs would only incur more communication costs and would thus lead to a worse objective function value.
We use the value for the initial and final temperature for the annealing schedule same as Orsila et al. [8] . To accommodate the multiple-constraint representation model with theirs, we calculate the fastest and slowest processors by multiplying each processor's capabilities (the PE's UoW capability and vector width) and sorting them in non-descending order.
IV. EXPERIMENTS AND RESULTS
Although several list-scheduling heuristics for the problem exist [12] , [5] , [13] , [14] , we focus our comparison of the proposed solution against the conventional SA approach in [8] . Furthermore, we also compare our technique with two state of the art heuristic algorithms for allocation of task graphs onto heterogeneous architectures: one based on K-way partitioning [15] and other based on heterogeneous bin-packing [16] . Statistical analysis is performed on a large set of randomly generated set of heterogeneous execution architectures (resource graphs) and real-world applications (task graphs).
We show the speedup obtained using our improved heuristics against the conventional heuristics for SA as prescribed by [8] . We also compare the results we obtain against the Kway graph partitioning algorithm [15] and heterogeneous bin packing heuristic [16] .
A. K-way graph partitioning
Graph partitioning plays an important role in the multiprocessor and VLIW scheduling and partitioning algorithms [17] . K-way graph partitioning is an important algorithm, which partitions a given graph into K or less parts, resulting in load balanced allocations. K-way partitioning mixed with min-edge cuts can form a good tool to partition METIS is a graph partitioner, which implements K-way partitioning with min-edge cut as the primary objective. The weights on the graph nodes are represented as constraints. Each graph node can have multiple-node weights representing different criteria. The edges between nodes can be weighted themselves, but as opposed to nodes, edges can only be decorated with a single weight. Moreover, in METIS one can use a concept called tp-weights, which gives preference to different node constraints when performing load-balancing during K-way graph partitioning [18] .
The gist of our K-way task partitioning approach onto a heterogeneous multi-core architecture is as follows:
• Our resource graph is first described as a simple graph in METIS. In this description each of the two capabilities W i 0 and W i 1 are described as two constraints for each node in the graph. The communication bandwidth is described as edge weights in METIS.
• Once we have the resource graph in the METIS format we calculate the tp-weights. There are two tp-weights generated, one for each of the resource node capabilities. For each processor the UoW tp-weight is calculated by the formula: W • The nodes in our task graph are described as graph nodes in the METIS format. The two requirements (R i 0 and R i 1 ) for each task node are described as two constraints in the METIS node format. The edge weights in our task graph are described as edge weights in the METIS graph format.
• Once we have the task graph described in the METIS format along with the tp-weight metric for each PE in the resource graph. We ask for a |V r | partition from Metis, giving the tp-weight metric for each constraint of the task graph. • The resultant partition is then used to calculate the objective in Equation (3).
B. Heterogeneous bin packing heuristic
Heuristic bin packing solutions have given good results in the general case [19] . Comparing with the heterogeneous bin packing heuristic [16] allows us to gauge the effectiveness of our algorithm against a standard technique.
Let I be the items to be accommodated into the bins and let K be the set of bins available. From the standpoint of the mapping problem, I refers to the set of task graph nodes (|V t |) and K refers to the nodes in the resource graph (|V r |). Similar to the Knapsack problem [20] , by which A-BFD (Adaptive Best First Decreasing) is inspired, each element i ∈ I, K has two constraints on them represented by c i (cost), which translates to the PE capability W i 0 and V i (volume), which translates to the capability W i 1 , respectively. A-BFD proceeds to sort I according to non-increasing order of their volume and sorts K according to non-increasing order of the ratio c i /V i . Then, it proceeds to allocate items from I into best bins b ∈ S. (3) for different benchmarks A best bin, i.e., the bin with maximum free space, is defined as the bin volume minus the sum of volumes of the items loaded into it. The post pass in A-BFD chooses every bin that has at least one item allocated to it and tries to find an empty bin, that has a higher or equal volume than the allocated volume on the chosen bin but also has a lower cost. If it finds such an empty bin, then it transfers all the items allocated to the chosen bin to the newly found empty bin which is cheaper. One of the main advantages of A-BFD is that it is very fast with a best case complexity of O(N I ) without the post pass, where N I is the number of items (number of tasks |V t | in the application graph G t ). Including the post pass, the best case complexity becomes O(N I +N K ) where N K is the number of bins (number of PEs |V R | in the resource graph G r ).
C. The experimental setup
We set up the resource graphs and the task graphs for performing the experiments as follows.
1) The resource graph setup: The experimental setup consists of the following: 1) A multi-core system with |V r | nodes. A node could be just a multi-core CPU or a multi-core CPU with a GPU attached to it. |V r | varies in a normal distribution from 2 to 32. 2) The bandwidth is selected from a set B = {B 1 , B 2 , ..., B |B| }. Every communication link weight (W c ) is selected from the set B. The elements of the set B varies in the normal distribution: 1 GB/s to 10 GB/s, representative of the multi-core connection networks in today's machines.
3) A set of N G GPUs where N G is at most |V r |. The GPUs are connected in the network at pre-determined locations, chosen randomly in the normal distribution of 25% to 75% of |V r |.
Every GPU in this experiment has a vector length of V i where V i is sampled randomly from the set V . The elements of set V are chosen from a normal distribution ranging from 2 to 32.
power of 2. Every C i ∈ C and GPU in this experiment has a UoW value of M i where M i is sampled randomly from the set M . The elements of set M are chosen from a normal distribution ranging from 2 7 to 2 20 . For given values of |V r |, N G , V , C, and M and a given application, let the k-th trial be defined as one execution of the following sequence of steps.
• For each GPU G i , sample V and M randomly to determine its vector length V i and UoW count M i . • For each CPU P i , sample C randomly to determine the number of cores C i in the processor P i . • For each core C i in the processor P i sample V i and M i randomly from set V and M .
• Use our framework to extract data and task parallelism that is best utilizable by the heterogeneity created by parameters in items 1, 2, and 3 above. Determine the execution time Objective ζ M using eq. 3. An experiment, E (|V r |, N G , V , C, M ), consists of conducting enough of the above trials so that width of the 95% confidence interval on the average value of Objective ζ M is less than 10% of the average value. This results in a variable number of trials with different experimental setups. Note that 2) The task graph setup: We chose 5 applications: binomial option pricing (a financial derivatives application), 2-dimensional convolution (for image processing), Gram Schmidt linear algebra kernel, 2-dimensional Gauss-Seidel stencil computations, and finally our motivating example itself the 2-dimensional Jacobi stencil computation.
The compiler first collapses all the nested loops and vectorizes the resultant statement to the largest possible vector length, if it is able to do so. Once vectorized to a large vector of some length N (N can be of length 1 million in some cases) we strip this vector, breaking it into multiple nodes requiring N/(vector strip size) each. The resultant nodes are then allocated onto the heterogeneous processor. We take this approach to increase the exploitation of parallelism; especially taking into consideration the GPUs that provide the ability to perform large vector operations at once.
For the 5 aforementioned applications we vary the vector strip from 10 to 50, which results in graphs varying from around 50 to 5000 nodes and with 23 to 12,000 edges. A detailed description of the applications and their features is shown in Table I . The vector requirement of applications in our benchmark suite varies from 1000 to 1 Million elements. The UoW count varies from around 1 to 0.3 billion. The edge weights depicting the amount of data transfer on the other hand varies from 3000 bytes to almost 4.8 Mega byte.
D. Experimental results
The comparison of our simulated annealing approach with other techniques are described in the next sub sections. In all the upcoming comparisons we set the value of q = 0.75(which controls how fast the temperature drops) in our SA and standard SA techniques. This value of q is a good compromise between a slow running SA heuristic vs objective function value, since a bigger value of q, results in larger explored state space. For the aforementioned experimental setup the standard SA and our SA techniques were run for 10 minutes.
1) Comparison with K-way partitioning:
The K-way graph partitioning algorithm uses the METIS [21] graph partitioner to solve the problem of mapping task graphs onto hardware topologies. We discuss the adaptation of the METIS graph partitioner to suit the multi constraint representation model in Section IV.A of our previous work [15] . The major statistics comparing the two approaches is provided in Table IIa. In  Table IIa , The Max objective and the Min objective columns, provide the maximum and minimum application latencies for each of the applications. The last column gives the % of instances our SA technique performed better than K-way graph partitioning, in the experiments conducted. The METIS project has been under development for the past ten years and is extremely well suited for performing min-cuts of large graphs and minimizing cross partition communication. This is clearly reflected in the results of two data intensive benchmarks, namely Convolution (better in 88% of the experiments) and 2D Jacobi Stencil computation (better in 68% of the experiments).
2) Comparison with heterogeneous bin packing heuristic:
Comparison of our SA technique with heterogeneous bin packing heuristic is shown in Table IIb . Unlike, K-way graph partitioning, the bin packing heuristic could not partition the binomial option pricing and Gauss-Seidel examples for any vector strip size, because the algorithm terminates if there is no underlying PE that can support the required vector tile size. Our SA heuristic runs a part of the vector on the underlying PE and then iterates in a loop until all vector computations are finished. For example, if the task graph requires a 1000 vector elements that need to be processed at once, and the largest vector capability in the resource graph is a 100, then, our heuristic will allocate a 100 vector elements onto the underlying hardware and then increase the UoW count by 10.
Heterogeneous bin packing prioritizes volume (vector length) over cost (UoW count; see Section IV-B) thereby performing much worse compared to our approach. It packs a large number of task nodes from the application graph into a single large vector PE, including those task nodes, which have a small vector count (w Table IIc . There is not a single case where our SA approach is worse than the current established technique. The reasons for this are two fold,
• We change the annealing schedule such that the search exploration is global initially, i.e. high temperatures, and it becomes local to the current best solution as temperature drops.
• The greedy phase in our annealing schedule helps us fine tune our best solution In all the figures and statistics presented, we ran the conventional SA and our improved SA approach with a time limit of 10 minutes, i.e. SA will terminate after 10 minutes irrespective of whether it has already found an acceptable solution. The experiments were performed on a quad-core dual processor system with 24 gigabytes of RAM.
V. CONCLUSION
In this paper we have described a novel Simulated Annealing (SA) heuristic for mapping applications with task and data level parallelism onto heterogeneous execution architectures. We partition and schedule such applications onto heterogeneous architectures with differing compute costs, vector lengths, and communication bandwidths. To our knowledge we are the first to utilize SA for extracting different kinds of parallelism (task and data) directly from compilers onto heterogeneous architectures.
Our SA approach guides the movement of the partition using temperature itself, as opposed to others, where such movement occurs randomly. Moreover, a guided movement allows for faster identification of good solutions, resulting in faster runtime for the SA algorithm, making it scalable to larger applications and larger number of processor cores. Experimental results show that our SA approach outperforms the existing SA technique in 84% of the instances. Moreover, for all these instances it has a better objective function value. We also compared our SA technique with the well established K-way graph partitioning and heterogeneous bin packing heuristics. Our SA approach out performs the former in 54% of the instances and the latter in 92% of the instances. As part of the future work, we intend to run some instances of the applications on partitions suggested by our method to confirm the validity of the results as the objective function is an approximate estimation of the real runtime. 
