In this paper a novel approach for compiling parallel applications to a target Coarse-Grained Reconfigurable Architecture (CGRA) is presented. We have given a formal definition of the compilation problem for the CGRA. The application will be written in HARPO/L, a parallel object oriented language suitable for hardware. HARPO/L is first compiled to a Data Flow Graph (DFG) representation. The remaining compilation steps are a combination of three tasks: scheduling, placement and routing. For compiling cyclic portions of the application, we have adapted a modulo scheduling algorithm: modulo scheduling with integrated register spilling. For scheduling, the nodes of the DFG are ordered using the hypernode reduction modulo scheduling (HRMS) method. The placement and routing is done using the neighborhood relations of the PEs.
INTRODUCTION
Reconfigurable computing has been an active field of research for the past two decades. To overcome the disadvantages of Field Programmable Gate Arrays (FPGAs), many coarse-grained or ALU-based reconfigurable architectures (GGRAs) have been proposed as an alternative to FPGA-based systems and fixed logic CPUs. Although CGRAs have the potential to provide both hardware like efficiency and software like flexibility, the absence of proper compilation approaches is an obstacle to their widespread use. There has not been much work on compiling applications directly on to systems containing only CGRAs.
Compiling applications to CGRA, after the source code of the target application has been transformed and optimized to a suitable intermediate representation, is a combination of three tasks: scheduling, placement, and routing. Scheduling assigns time cycles to the operations for execution. Placement places these scheduled operation executions on specific processing elements. Routing finds routes to data from producer PE to consumer PE using the interconnect structure of the target architecture.
Thanks to NSERC for funding.
The main contribution of this paper is to formally define the compilation problem for CGRAs and propose a compilation approach for a family of CGRAs. The target architecture will be specified by the user. The intended application will be written in HARPO/L [1] . The input of the compilation is the intermediate representation of the target application in the form of a DFG and a description of the target architecture; the output will be executable code. HARPO/L is first compiled to a DFG representation [2] . The remaining compilation steps are: scheduling, placement and routing. The rest of the paper is organized as follows. Section 2 discusses the back end of the compilation process. Section 3 formalizes the scheduling, placement, and routing problem. Section 4 describes the modulo scheduling algorithm for cyclic portions of the target application. Section 5 describes the placement and routing steps performed during mapping from the input DFG to the input target architecture. Section 6 concludes the paper with possible future work.
OVERVIEW OF COMPILATION

Source Language Description
The HARPO/L (HARdware Parallel Objects Language) [1] language is designed for implementation on either microprocessors or CGRA's. HARPO/L is a parallel, object-oriented, multi-threaded programming language. The language design allows explicit parallelism and enables the compiler to extract inherent parallelism. It supports codesign by allowing a distribution of objects between CGRAs, FPGAs, and microprocessors.
Framework of Overall Compilation
Our target is to compile parallel applications to a given target architecture with optimal execution time. For that the target application is first written in HARPO/L. Then the source code is transformed and optimized to get the intermediate representation in the form of executable DFG. This executable DFG and a graph-based description of the target architecture form the inputs of the back end of our compilation procedure. The output of the mapping phase will be a schedule, placement, and routing information.
?? shows the framework of our overall compilation.
Overview of our Compilation flow
After target architecture transformation and intermediate representation of the target application, we have two input graphs, a target architecture and an executable DFG. Now our tasks is to map the DFG onto the RRG as efficiently as possible so that the execution time is as little as possible.
We will now give an overview of our compilation process. Our idea is to first analyze the DFG to extract some information that may be useful in the later phases. We are assuming here that the DFG given as input has been optimized using various common optimization techniques.
Since the input DFG can be cyclic, we need some approach for partitioning the cyclic and acyclic parts from the DFG. Then we will apply mapping (from DFG to RRG) for both the parts separately and integrate them for mapping as a whole.
For mapping acyclic parts we will use the most commonly used list scheduling algorithm for resource constrained scheduling problems. In the list scheduling algorithm, instead of using the conventional priority functions, we will use the approach of [3] .
Cyclic parts will be mapped using a register-constrained modulo scheduling method, improved modulo scheduling with integrated register spilling (MIRS) algorithm [4] . For both the cyclic and acyclic parts, the nodes of the DFG will be placed to the processing elements of the target architecture and necessary routing is done accordingly.
PROBLEM FORMULATION
Executable DFGs
We use the executable DFGs [2] as input for scheduling, placement and routing. We can represent our input executable DFG as a Graph GI J = (Q> H> 3 <> { 3> rs> lqUroh> rxwUroh) such that:
• Q is the set of nodes. rs : Q < rshudwlrqv, labels nodes with operations.
• H is set of edges. { 3 h and 3 < h are the source and target nodes for edge h. vufUroh> wdujUroh : H 3< urohv indicate the roles that the edge plays at its source and target.
Target Architecture
We can represent the target architecture specified by the architecture designer by a graph W D = (F> H), such that:
• F = I X S UJ is the set of functional units I X and registers UJ.
• H is the set of edges where { 3 u > 3 < u M F are the source and target nodes. When more than one edge enters a register there is a choice as to which value is stored in the register. Although not formalized here, edges entering and leaving functional units may have different roles. When more than one edge enters a functional unit with the same role, there is a choice as to which value is used.
For scheduling, placing and routing purposes we have modeled our target architecture with routing resource graph (RRG). The RRG is basically obtained by replicating the target architecture graph TA once for each time cycle and giving necessary interconnections with and across time cycles. An UUJ is a directed graph
. F × N is the vertex set of the graph, i.e., resources of the TA replicated across time. The edges of the UUJ come in three flavours, [, \ , and ], to be described shortly. In the following we assume that {, |, and } are one-one functions with mutually disjoint ranges.
• For each edge u in the TA, such that u goes to a functional unit, there is a corresponding edge at each time.
• For each edge u in the TA, such that u goes to a register, there is an edge between copies of the TA indicating that the value of the register at time w + 1 may be a value computed at time
and 333< |(w> u) = ( 3 < u > w + 1)
• For each register i in the TA, there is an edge between copies of the TA indicating that the value of the register at time w + 1 may be the same value as at time w. ] = {}(w> i ) | i M UI> w M N} where { 333 3 }(w> i) = (i> w) and 3 333 < }(w> i ) = (i> w + 1).
For loop scheduling, we use a similar RRG except that the time domain is finite and cyclic and hence the RRG is toroidal.
Mapping Problem Formulation
A subgraph homeomorphism between two graphs J and K is a pair of functions (i1> i2) such that i1 is a one-one function from the nodes of J to the nodes of K and i 2 maps edges of J to paths in K such that i1( { 3 h ) = vwduw (i2(h)) and i1( 3 < h ) = hqg (i2(h)). A subgraph homeomorphism is considered node disjoint if the internal nodes of each path i2(h) appear on no other path in the range of i2.
Assuming all operation latencies are 1, we can formulate the scheduling, placement, and routing problem as one of finding a node disjoint subgraph homeomorphism (i 1 > i 2 ) between the input executable GI J = (Q> H> rs> lqUroh> rxwUroh) and the UUJ, based on the target architecture TA. The RRG can be infinite for straightline code, or toroidal for loops.
• If i1(q) = (n> w), operation q is scheduled for time w on functional unit n. We require that n must be capable of executing q's operation.
• If i 2 (h) = S , then S is a path carrying information across space and time from one function unit to another.
If operations can have a latency of more than one clock cycle, we need to generalize the above formulation. We can formulate our mapping problem as one of finding a pair of functions (i 1 > i 2 ) between the input executable GI J and the UUJ such that:
• i1(q) = {(n> w)> (n> w + 1)> ===> (n> w + q 3 1)} Here n is the processing element which will execute q's operation. w is the start time when q will start executing and q is q's latency. We require that n must be capable of executing q's operation in q time units. For any two distinct nodes x and y, it is required that i 1 (x) K i 1 (y) = .
• For an edge h, let i 1 ( { 3 h ) = {n 0 }×{w 0 > w 0 + 1> ===> w 0 + { 3 h 3 1} and i 1 ( 3 < h ) = {n 1 } × {w 1 > w 1 + 1> ===> w 1 + 3 < h 3 1}; then i 2 (h) should be a path S , such that vwduw(S ) = (n 0 > w 0 + q 0 3 1) and hqg(S ) = (n 1 > w 1 ).For any edge h, the internal nodes of i2(h) should appear on no other paths in the range of i 2 and also not in any i 1 (q).
Our formulated compilation problem has the following properties:
• Each node q must be scheduled to be processed on a unique processing element at a unique time.
• A processing element n processes at most one node's operation at a given time w.
• If a node q 1 M Q is a predecessor of another node q 2 M Q, then q1 must complete its operation's execution before q2's operation starts.
MAPPING ALGORITHM
In this section we will discuss how to schedule, map and route cyclic parts, especially loops of an application. We will adapt Modulo Scheduling with Integrated Register Spilling (MIRS) [4] . MIRS is a software pipelining method that is capable of instruction scheduling with reduced register requirements, register allocation and register spilling in a single phase. But MIRS alone cannot do the required compilation for our problem. The reason is that MIRS does only scheduling and placement, it does not consider routing. We need to do routing during placement. The reason is as follows. During placement a cost function is computed to evaluate the quality of placement. While calculating that cost function, we need to incorporate routing cost. So we have modified the MIRS algorithm to incorporate this feature. Another factor that we have incorporated into the MIRS algorithm is the consideration of loops with conditional branches. For doing this we have adapted the if-conversion and reverse-if-conversion idea from [5] .
The input of the algorithm will be an executable DFG representing the cyclic part and the RRG. There are two outputs of the algorithm. One is the initiation interval (II). The initiation interval is the number of clock cycles between initiation of new iterations of a loop, the lower the initiation interval, the more the depth of the software pipeline. The other is a schedule of the nodes of the DFG, which is the pair of functions i 1 and i 2 . This schedule will enable each node of the DFG to execute at its time cycle in its resource. Here i 1 is a partial function that maps each node of the DFG scheduled so far to a set of a pair of values: a time cycle and a resource and i 2 is a partial function that maps each edge of the DFG to a path in the RRG Figure 1 shows the actual scheduling, placement, and routing step of the IMIRS algorithm. In this phase a node is scheduled starting from a particular clock cycle in one or more resource(s) (functional unit(s) and/or register(s)). II and MII are initiation interval and minimum initiation interval respectively [6] . It first calculates the Hduo|_Vwduw x and Odwh_Vwduw x of the node x to be scheduled, which produces a time frame in which that node can be scheduled legally. Suppose the Vwduw and Hqg defines this time frame. For scheduling node x, Mapping() determines one or more resources in the RRG within this time frame starting from Start that produces optimal cost. During this checks are done so that there are valid routes from/to the predecessors/successors of x to x. Checks are also done so that there is no violation of dependence or no resource conflict. If such free resources are found, x is scheduled to the time cycles indicated from the position of the resource(s) in the RRG. Necessary updates are made to the partial schedule, resources, and registers. However, if no valid cycle is found, then the Force_and_Eject heuristic is applied. The partial schedule is scanned forwards or backwards depending on the values of Hduo|_Vwduw, Odwh_Vwduw, II, and whether predecessors or successors of the node to be scheduled are already placed in the partial schedule. This is done according to the rules from [3] . The Force_And_Eject Heuristic, the Check_and_Insert_Spill Heuristic, and the Restart_Schedule Heuristic are given in detail in [4] . Figure 2 shows the pseudocode of improved MIRS algorithm for adapting to CGRA for cyclic parts. This algorithm uses the node ordering strategy of [3] for assigning priority to the nodes of the DFG. Pdsslqj(), is used for placement of operations and routing them from producer FU to consumer FU in the available time cycles. The basic steps of the algorithm are summarized below. At first the algorithm initializes the II with MII, i 1 and i 2 to empty functions. After the algorithm is completed i 1 will map all the nodes of the DFG with each node having one or more time cycles and a resource depending on the latency of the node's operations. Exgjhw is initialized to the number of nodes of the DFG times the Exgjhw_Udwlr, where Exgjhw_Udwlr is the average number of times that each node of the DFG can be attempted to be scheduled with a fixed value of II.
Schedule_Place_Route
Improved MIRS for Compilation on CGRA
After these initializations all the nodes of the DFG is ordered according to [3] . The ordered nodes are inserted into S ulrulw|_Olvw. Then the algorithm iteratively tries to schedule, place, and route operations from the S ulrulw|_Olvw. In each iteration, the operation with the highest priority is removed from the list and Schedule_Place_Route() tries to find a FU for its execution using a route of free edges of the RRG that minimizes a cost heuristics. If such a FU and time cycle is found without violating any intra-iteration or inter-iteration dependency and resource constraints then those FU and time cycle are reserved for that operation so that they cannot be utilized by any subsequent operations until it is finished with utilizing them.. However, if no such cycle exists, then the algorithm employs the Force_And_Eject technique in which the node to be scheduled is forced to a specific cycle. Force_And_Eject, at the same time, ejects some nodes that were the reasons for dependency violations or resource conflicts.
Then the algorithm determines whether there is any need to spill values to memory to reduce the register pressure. The algorithm also detects the lifetime of a variable or its use which needs spilling. Then Restart_Schedule validates the current partial schedule with the current II. If the current partial schedule is valid then the algorithm continues with the next node of the S ulrulw|_Olvw, otherwise II is increased and the whole procedure is restarted with the new II.
After all the nodes of the S ulrulw|_Olvw have been scheduled, the algorithm allocates registers for them. Then the configuration for executing the target application on the target CGRA is generated using the II and the mapping function i 1 and i 2 .
PLACEMENT AND ROUTING
This section introduces our strategy for mapping from DFG to RRG used in the IMIRS algorithm. We have proposed a new placement method for CGRA. This method uses the neighborhood relations among the functional units (FUs) and registers. We will denote both FUs and registers as processing elements (PEs).
We can view the RRG as the given target architecture (composed of PEs) replicated across time. The interconnections among the PEs in a particular time and across time boundaries define regions with incrementing distances. All the unoccupied PEs in time cycle Vwduw can be viewed as the PEs of first choice. The reason for this highest priority is that those PEs can be reached from the producer/consumer PEs in the fewest possible clock cycle. Our idea is to look for a potential PE for a particular node from these PEs first, provided all the shortest route edges from producers or consumers of a node x to x are also unoccupied. All the possible PEs considering all the predecessors/successors of the node to be placed are tried and a cost function is evaluated for each of them. The PE with the lowest cost is selected for placing that node. If such a PE cannot be found, we will explore the PEs at time cycle (Vwduw + 1) and so on. That is, we choose the shortest paths connecting the producer PEs and the consumer PEs.
The algorithm is outlined in Figure 3 . The first for loop determines the possible candidates for mapping the current node. In each iteration unoccupied PEs at time cycle g, such that for a PE all the edges along the shortest path from PSP(x) or PSS(x) to x in the RRG are unoccupied, are elements in the neighbor set. Then the while loop selects the best PE from the candidates. Each neighbor candidate is considered for mapping and a cost function is evaluated for each of them considering all the elements in its PSP(x) or PSS(x). The PE that contributes the lowest cost is selected for placing the operation in question. Then the selected PE is marked occupied at Vhohfwhg from time cycle g to g + x-1. Necessary routing is done following the available interconnection that causes optimal routing. All the edges in RRG along the path S that corresponds to the edges between the selected PE at g and the PE occupied by each node in PSP(x) or PSS(x) is marked occupied.
Cost Evaluation
We use a greedy approach for evaluating the cost function of a particular placement. Our cost function consists of delay cost and interconnect cost. The delay cost of a node x is contributed by the time cycle in which the nodes M S uhg(x) are scheduled. It is equal to the maximum of such delays. The interconnect cost comes from the interconnections that must be dedicated in order to route the node from the producer PE to the consumer PE. The longer the interconnections are occupied, the larger the Hduo|_Vwduw of the successor nodes will be. The PE with the lowest total of these costs will be selected for executing the current node.
CONCLUSION AND FUTURE WORK
In this paper a novel compilation approach for parallel applications to coarse-grained reconfigurable architectures has been proposed. The intended application is written in HARPO/L. The input of the compilation is the intermediate representation of the target application in the form of DFGs using static token and a description of the target architecture; the output is executable code. HARPO/L is first compiled to a DFG representation. The remaining compilation steps are a combination of three tasks: scheduling, placement and routing.
Intentional Blank Page 001728
