As embedded chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. These capture the usage of available computation resources and available memory hierarchy, respectively. In order to achieve good performance in a chip multiprocessor based embedded system, an optimizing compiler has to exploit both parallelism and locality. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution.
INTRODUCTION
As chip multiprocessors are finding their ways into commercial market in embedded domain, programming support for these devices is becoming increasingly critical. This support includes language, compiler, and debugging related issues and is likely to receive a lot of attention in the near future.
In a chip multiprocessor based execution environment, two issues are critical to address: parallelism and data locality. The first of these indicates how well an execution exploits available computation resources. Ideally, one wants to use all available processors at each step of computation if doing so improves performance. 1 Data locality, on the other hand, captures how well an execution exercises available memory hierarchy. The concept of locality is particularly important in the context of chip multiprocessors as the gap between latencies of the on-chip and off-chip accesses is huge. Clearly, one wants to satisfy majority of data references from the higher levels of memory, i.e., those components that are close to processor. In order to achieve good performance in a chip multiprocessor based embedded system, an optimizing compiler has to exploit both parallelism and locality in a synergistic fashion.
Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. For example, if locality remains unoptimized, one can expect poor performance at runtime even if all available parallelism is extracted from the application. Similarly, a locality-optimized program that is not parallelized appropriately can result in poor runtime behavior.
The main goal of this paper is to propose and evaluate a compilerdirected code partitioning/scheduling scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph, or LPG for short. It then executes a partitioning/scheduling algorithm on this graph, which assigns the nodes of this graph to the processors in the parallel architecture. We implemented this algorithm and evaluated its effectiveness using a set of four benchmark codes.
In our experiments, we also compared this approach to an alternate parallelization/locality optimization scheme which handles loop nests in the application one by one, and optimizes locality of each loop nest independently. The results collected so far indicate that our approach improves execution latency significantly.
In this paper, we also present an ILP (Integer Linear Programming) based formulation of the combined parallelization and data-locality optimization problem. We implemented this ILP solver based solution and compared the results it generated to those obtained using our heuristic approach. The collected experimental results indicate that our approach gets within 4% of the ILP solution.
The rest of this paper is structured as follows. An abstract view of the chip multiprocessor architecture considered in this work is discussed in Section 2. Section 3 describes our compiler representation, LPG, and Section 4 presents the details of our partitioning/scheduling algorithm. The ILP formulation of the problem is discussed in Section 5. An experimental evaluation of the proposed approach is given in Section 6. Related work is discussed and compared to our work in Section 7, and the paper is concluded in Section 8 with a summary.
ARCHITECTURAL ABSTRACTION
The chip multiprocessor (CMP) architecture considered in this work is a shared memory based one. In this architecture, multiple CPUs share an on-chip cache space. A simplified view of this architecture is illustrated in Figure 1 . We also assume the existence of a large off-chip memory space, shared by all processors in the system. The important point to note here is that optimizing for both parallelism and locality is very important in this CMP architecture. In particular, in order to attain good performance, one has to use all available CPUs to the maximum extent allowed by intrinsic data dependencies in the code, and the reused data elements should be caught in the on-chip cache space as much as possible (instead of going off chip).
LOCALITY-PARALLELISM GRAPH
Our scheduling algorithm, which targets loop based applications, operates with a locality-parallelism graph (LPG) of code blocks. This graph captures the dependencies among code blocks and locality among the blocks. An LPG is an acyclic graph G(V, E dep , E loc ), where V is a set of nodes and E dep and E loc are sets of edges. Each vi in V represents a code block (which will be explained in detail shortly). A directed edge ei,j in E dep from vi in V to vj in V means there is a dependency between vi and vj (i.e., data produced by vi is used by vj). In this case, vi is an immediate predecessor of vj, and vj is an immediate successor of vi. We denote the set of immediate predeces- sors of a node v as P red v , and the set of immediate successors of v as Succ v . A directed edge e i,j in E loc from v i to v j means there is a data reuse, i.e., v i and v j share some data between them. The weight W e i,j of edge e i,j captures the amount of data shared by the two nodes (code blocks). To make our problem formulation simpler, all non-existing edges in an LPG are assumed to have a weight of 0. The set of nodes that share a locality edge with a node v is denoted as Loc v .
When we have a loop with a large number of iterations, we can rewrite the same loop as a set of loops of fewer iterations, which have the same loop body as the original. For example, if the original loop has n iterations, we can break it into k blocks, each block having roughly n/k iterations. In this context, we call each one of these smaller loops a code block. The code blocks can then be executed in parallel, unless data dependencies prevent it. The number and size (i.e., the number of iterations) of the code blocks can be arranged to achieve desired level of granularity.
As an example, in the graph in Figure 2 , we have five separate loop nests in our code (shown on the left.) The solid lines represent the data dependencies, whereas the dotted lines capture the data reuse edges. 2 We partition each of these loops into smaller loops of reasonable size (i.e., into code blocks). On the right part of this figure, we see the code blocks that are generated by the partitioning of loop nests. Note that the number of edges has increased, since it now shows the dependencies and data reuse between small code blocks, instead of larger nests. This graph on the right (which is our LPG in this case) shows dependencies and data localities at a finer granularity, and is the main compiler based data structure on which our partitioning/scheduling scheme operates. When there is no confusion, in the remainder of this paper, when we mention "block", we mean "code block".
OUR APPROACH
We can think of a schedule as a two dimensional matrix, where the rows represent scheduling steps (also referred to as the execution steps in this work) and the columns correspond to available CPUs. At the end of scheduling (which is explained below), we fill the entries of this matrix such that both parallelism and data locality are improved.
Our goal is to schedule, considering the CPUs we have in our given CMP, code blocks that have data reuse between them as close together as possible (in time), while respecting data dependencies. This is because, if two or more blocks that access the same data are scheduled in the same (or close by) execution steps, then the data will be loaded into the on-chip cache once, and used by all of them, instead of each block loading the same data into the cache separately at different times. This will hopefully result in an increase in cache hit ratio and enhance overall performance. 2 Note that, while a data dependency edge between two nodes means that there is also a data reuse edge between them, the opposite may not always be true. This is because if two code blocks only read their shared data, this does not introduce a data dependency. 3 Ideally, we want to load each data element from the off-chip memory to the on-chip cache only once. However, this may not always be realizable due to data dependencies in the code and other potential the set of nodes scheduled in execution step i. sReady the set of nodes that are ready to be scheduled. sM anda the set of nodes that are ready and have to be scheduled as soon as possible. sRemain the remaining set of nodes that are not ready. cap the remaining capacity in the current step.
To accomplish this goal, we designed a heuristic algorithm for resourceconstrained scheduling. In this section, the key parts of our algorithm are given as a pseudo-code along with short explanations of the functions implemented. In our approach (which is fully automated), certain data structures are used throughout our algorithm, and are considered global. These are given in Table 1 .
Our approach starts by computing the ASAP (as soon as possible) and ALAP (as late as possible) values for the nodes of the LPG at hand. An ASAP value for a node v gives the earliest step of execution that v can be scheduled. The ASAP algorithm assigns an ASAP label S v to each node v. Similarly, in ALAP scheduling, each block is scheduled to start at the latest possible step. An ALAP value for a node v represents the latest step of execution that v can be scheduled. The ALAP algorithm assigns an ALAP label L v (step index) to each node v. T represents an upper bound on the number of steps. We omit the pseudo codes for the ASAP () and ALAP () procedures since they are well known in literature [28] . Each step of execution has a capacity of p, that is, the number of processors in the system. In other words, at most p code blocks can be scheduled in an execution step.
The procedure calcSchedule(p, α, β, f lag) in Algorithm 1 calculates the schedule for a system with p processors. It first initializes several variables, populates sRemain, and then assigns ASAP/ALAP labels to each node. insertReadyN odes(1) adds the possible starting nodes (nodes that are not dependent on any other nodes) to sReady.
The rest of the code is the main while loop, which iterates as long as there are ready nodes (code blocks) to be scheduled. At each iteration, it first goes on to next execution step, initializes variables, and calls doM andatory (schedules nodes that would otherwise increase the total number of execution steps.) At each iteration of the inner while loop, the ready node with the highest score is selected and scheduled, as long as there are ready nodes and there is room in the current execution step to accommodate new code block(s). When either condition is f alse, insertReadyN odes(cs + 1) loads sReady with nodes ready for the next execution step and goes back to the next iteration of the main while loop.
The selection of the nodes from sReady is performed according to the score calculated by calcScore Sch. The boolean f lag enables contribution to the score by calcScore sReady, in the case of current step being empty. If the flag is down, the first node of the step is chosen, solely based on data reuses it exhibits with nodes scheduled in previous steps. Otherwise, a constant (β) times the calcScore sReady score (a node's locality with p − 1 other nodes in sReady) is added to the total score. Note that the use of the parameters f lag, α and β enables us to generate multiple heuristics and fine-tune our algorithm.
The procedure insertReadyN odes(s) in Algorithm 2 scans the set of nodes sRemain and determines if any of the nodes are ready for step s (i.e., all its predecessors have already been scheduled, and step s lies between its ASAP and ALAP labels). It then puts all ready nodes in the set of ready nodes, sReady, and deletes them from sRemain.
The procedure doM andatory(s) in Algorithm 3 iterates through the nodes in sReady to find the nodes that have an ALAP label L v such that L v ≤ s (the given step) and adds them to sM anda. It then schedules the nodes in sM anda in the order of non-decreasing ALAP values as long as there are nodes in sM anda and there is scheduling constraints.
Assign ASAP/ALAP values to each node sReady ← ∅ initialize ready nodes set insertReadyN odes (1) Step v accordingly, deletes the node from sReady, and decreases the remaining capacity of the execution step.
Algorithm 4 scheduleN ode(v, s)
The function calcScore sReady(v i , c) in Algorithm 5 is optionally used when it is time to pick the first node for a step, since the amount of data reuse (i.e., the number of data elements shared) with other nodes already scheduled on the same step is not sufficiently high. Instead, it calculates a total weight value for v i , based on its amount of data reuse with other unscheduled ready nodes. The parameter c is used as an upper bound on the number of nodes considered, since there can be at most p nodes scheduled in a step. Basically, the function returns the sum of highest t locality weights that v i shares with other ready nodes. The actual number of nodes is given by
The weight value returned by calcScore sReady(v i , c) is given by the formula:
where vj is one of the t nodes in sReady with the highest data reuse with respect to vi.
The function calcScore Sch(vi, α, s) given in Algorithm 6 calculates the amount of weighted data reuse between vi and already scheduled nodes both in the same step (s) and previous steps. The weight value returned by calcScore Sch(vi, α, s) is given by the formula:
The signum function in the formula prevents contribution from unscheduled nodes, and is defined as follows:
Different α values change the effect of step difference (e.g., α=0 ignores execution steps and uses the locality value directly, whereas a large α value concentrates on the current execution step).
Algorithm 5 calcScore sReady(v i , c)
Check weights with all scheduled items and consider the step distance
As explained earlier, in our CMP architecture, we assume that the highest level of memory for data is the on-chip shared cache, and there are no private on-chip data caches. However, we can also accommodate the case where the processors also have private on-chip data caches by making a small change in our algorithm. Currently, the algorithm considers which step to schedule a block in, but does not care for which particular processor the block is assigned. If there are private data caches, the modified algorithm will also try to assign blocks with data reuse to the same processor in consecutive steps, so that the majority of data requests can be satisfied from the on-chip private data cache.
Example
As an example we consider the case of adi benchmark (one of our benchmarks used in this study) when using 4 processors. In this case, the adi benchmark code was broken into 40 nodes. The schedules obtained by the heuristic solution and the ILP solution are given in Table 2 . Each line shows the nodes executed in that execution step. The weighted data reuse score for these schedules are 16.33 for the schedule returned by our heuristic approach, and 17 for the schedule returned by the ILP solver (which represents the optimal case). These scores are calculated using the formula given in Algorithm 6 with α=1. The heuristic schedule executed in 51.4 seconds, while the ILP schedule executed in 47.2 seconds. The details of the ILP based formulation will be given shortly.
ILP FORMULATION OF THE PROBLEM
In order to see how close our heuristic comes to the optimal, we also implemented an ILP (Integer Linear Programming) based solution to the problem, and performed experiments with it. This section gives the details of our ILP based solution. Table 3 lists the constant terms used in our ILP formulation. We used LPSolve [2], a public-domain ILP tool, to formulate and solve our 0-1 ILP problem. Although ILP generates an optimal result (under the assumptions made), the time complexity of ILP prohibits practical usage in most cases. The com- Table 2 : adi schedule with 4 processors Heuristic ILP n11-n07-n10-n06 n01-n02-n03-n04 n09-n13-n14-n08 n05-n06-n07-n08 n16-n02-n12-n01 n09-n10-n11-n12 n17-n15-n20-n05 n13-n14-n15-n16 n19-n04-n18-n03 n17-n18-n19-n20 n30-n29-n28-n27 n21-n22-n23-n24 n26-n25-n24-n23 n25-n26-n27-n28 n22-n21-n31-n32 n29-n30-n31-n32 n33-n34-n35-n36 n33-n34-n35-n36 n39-n38-n40-n37 n37-n38-n39-n40 putation of the solutions for the ILP problems mentioned below took days on average and more than a week in one case. Therefore, it is very important to explore heuristic solutions for this combined parallelismdata locality problem. A 0-1 ILP problem is a special kind of ILP problem, where each variable can only take the value of 0 or 1. In our case, we have only one type of 0-1 solution variable, X i,l , which indicate whether v i is scheduled on step l. To make our presentation clear, we use the expression
Step v i to represent the execution step that the code block v i is scheduled.
Step v i is expressed in terms of the X i,l variables and ILP constants as follows:
Objective Function
Our objective is to find the execution step that each node is to be scheduled, such that, nodes that exhibit high data reuse with each other are scheduled as close to each other as possible. In our formulation, the 0-1 variables X i,l are used to capture this information. In other words, we want to minimize the scheduling step distance between nodes with data locality. Therefore, we can express our objective function as follows:
Minimize
where ei,j ∈ E loc (e.g. ei,j is a locality edge).
Constraints
We have three types of constraints in our problem.
The execution step
Step v i for code block v i is unique for all i ∈ {1, ..., n}. As a result, for a given i, only one of the X i,l variables will take a value of 1, and the rest will be 0, which can be formulated as follows:
For example, if Sv 1 = 3 and Lv 1 = 5, then X1,3 +X1,4 +X1,5 = 1 is a constraint in our ILP formulation.
2. Sequencing relations must be satisfied. If a code block vj depends on vi, then vj should be scheduled at a later step than vi. This constraint can be expressed as follows: Stepv j > Stepv i , where ei,j ∈ E dep (e.g., ei,j is a dependency edge). 3. Since we have p processors, at most p code blocks can be scheduled at an execution step. In other words, for any given step l, the sum of X i,l values cannot exceed p. Note that the actual number of nodes scheduled at a step can be less than p due to data dependencies between code blocks. Therefore, for each step l, we include the following constraint in our formulation:
The above mentioned constraints and objective function constitute our ILP formulation of the problem. In this formulation, the nodes that do not exhibit data reuse are not part of the objective function, and are therefore scheduled based solely on data dependencies, or arbitrarily if they have none.
EXPERIMENTS
We implemented the algorithm explained in this paper as a software tool which takes an LPG in the TGFF format [14] , as well as α, β and the number of processors in the CMP as input parameters. The tool parses the graph and applies our heuristic algorithm with the given parameters to obtain the data locality-optimized parallel execution schedule.
Setup
We used four benchmarks to test our algorithm, and performed experiments on three hardware platforms (with 2, 4 and 8 processors.) Table 4 lists the benchmark codes we used and their important characteristics for our experiments. Table 5 lists the key properties of the hardware platforms used in our tests. For our tests, we simulated the hardware and OS platforms using Simics [4] . Simics allows building a binary-compatible instance of the target hardware, which operates completely within a virtualized environment running on standard PCs.
For each of the hardware platforms, the following operations were performed. We applied our algorithm to the LPG of each benchmark. The above mentioned ILP formulation of each of these problems produced an alternate schedule. We used the Intel C++ Compiler 10 [1] and OpenMP API [3] to compile the codes resulting from these schedules (for both heuristic scheme and ILP solver based scheme). We also compiled the original code both without parallelization and using the Intel compiler's own parallelization mechanism. For our tests, the default values of α = β = 1 were used. We also made experiments with other values of the α and β parameters, but the results were very close to those shown here for the α = β = 1 case.
Results
On each platform, we obtained four results for each benchmark. 3. Heuristic Parallel: The code scheduled by our heuristic scheduling approach (Section 4).
ILP:
The code scheduled based on the solution returned by the ILP based formulation (Section 5).
Our first observation is that, our algorithm performs better than the original in all but one of the benchmarks. This exception is due to that benchmark having a small body of code, which makes the synchronization overhead brought by parallelization significant. Our algorithm performs better than the compiler's own parallelization mechanism in all cases, and is very close to the result achieved by the ILP solver. The overall speedups achieved by our algorithm with respect to the original code are 1.62, 1.50 and 4.21 for the 2, 4 and 8 processor cases, respectively. By comparison, the overall speedups achieved by the ILP based solution with respect to the original code are 1.63, 1.54 and 4.29 for the 2, 4 and 8 processor cases, respectively.
RELATED WORK
In this section, we first evaluate the previous work on parallelization then we revise the efforts on data locality. Parallelism can be obtained at different levels of abstraction. Instruction-level parallelism is exploited by high-performance microprocessors, whereas data-level parallelism is utilized in nested loops using compilers. Similarly, tasklevel parallelism can be found in many embedded applications. To exploit the data-level parallelism, Kadayif et al [18] proposed to use different number of processor cores for each loop nest to obtain energy savings. This way idle processors are switched to a low-power mode to increase the energy savings. Mei et al [26] propose a modulo scheduling algorithm for coarse-grained reconfigurable architectures by exploiting loop-level parallelism. To parallelize loops, Bondalapati [8] exploits the distributed memory available in the digital signal processing domain. More specifically, he exploits the reconfigurable architecture by implementing a data context switching technique. Goumas et al [15] try to generate parallel code for tiled nested loops through different loop transformations using MPI. Hogstedt et al [16] predict the execution time of tiled loop nests and use this prediction to automatically determine the tiling parameters that minimizes the execution time. Arenaz et al [6] exploit coarse-grain parallelism by a gated single assignment (GSA) based approach with complex computations. Yu and D'Hollander [35] construct an iteration space dependency graph to visualize a 3D iteration space. Beletskyy et al [7] adopt a hyperplanebased representation to apply on transformation matrices with both uniform and non-uniform dependences. Lim et al [25] employ affine partitioning to maximize parallelism with minimum communication overhead. Ozturk et al [31] focus on optimizing parallelism in chip multiprocessors using constraint networks. In [32] , authors propose an abstract interpretation to analyze needed loop parallelization.
Two major techniques to exploit locality are loop transformations and data transformations. Wolf and Lam [33] define reuse vectors and reuse spaces. Moreover, they use these concepts to implement an iteration space optimization technique. Similarly, Li [24] uses reuse vectors to detect the dimensions of loop nest that carry reuse. In [10] , authors analyze data dependences by a variable renaming technique to break anti and output dependences along with a technique to resolve recurrences in a nested loop. Navarro et al [29] represent the locality using a locality graph, and mixed integer nonlinear programming is used on this graph to minimize the communication cost and load imbalance. Carr et al [9] re-order the computation by using a simple locality criterion to enhance data locality. Tiling [13, 20, 21] is another loop based locality enhancing technique. On the data transformation side, in [30] , authors generate the code with a given data transformation matrix. Kandemir et al [19] implement an explicit layout representation, whereas [22] focuses more on memory consumption reduction due to a layout transformation. There are also efforts to combine data and loop transformations. Among these is one of the first papers [12] that offers a scheme which unifies loop and data transformations. On the other hand, Anderson et al [5] propose a transformation technique that accesses contiguous data elements. Chen et al [11] employ a constraint network based solution to the combined data/loop optimization problem.
Hwu et al [27] present a parallel programming model for many-core microprocessors, and provide initial technical approaches towards this goal. Leverich et al [23] compare hardware-managed coherent caches and software-managed streaming memory under the same set of assumptions in terms of technology, area, and computational capabilities. Xue et al [34] propose a memory-conscious loop parallelization strategy which is formulated as a branch-and-bound problem. Hughes et al [17] examine parallelization potential of physics-based simulation and characterize its behavior on a chip multiprocessor.
As compared to these prior studies, we target chip multiprocessors where processors share an on-chip cache and propose a scheduling scheme for improving both data reuse and parallelism.
CONCLUSION
Increasing use of chip multiprocessors in embedded computing domain makes automated software support a primary concern for programmers. In particular, compiler plays an important role since it shapes the code behavior as well as data access pattern. Targeting embedded chip multiprocessors and loop-intensive computations, we propose a novel compiler-based loop scheduling scheme with the goal of exploiting both parallelism and locality. This paper describes our strategy and evaluates it using a set of four application codes. The results of our algorithm are compared to those obtained by the ILP solution and the original codes. The experimental results we collected are promising. Our ongoing work on loop scheduling includes porting our strategy to CMPs with private data caches.
