Abstract-Coarse-Grained Reconfigurable Architectures (CGRAs) have gained currency in recent years due to their abundant parallelism and flexibility. To utilize the abundant parallelism found in CGRAs, we propose a fast and efficient Modulo-Constrained Hybrid Particle Swarm Optimization (MCHPSO) scheduling algorithm to exploit loop level parallelism in applications. PSO has been proved to be successful in many applications in continuous optimization problems. In this paper, we show that PSO is capable of software pipelining loops by overlapping placement, scheduling and routing of successive loop iterations and executing them in parallel. Our proposed algorithm has been experimentally validated on various DSP benchmarks under two different architecture configurations. These experiments indicate that the proposed MCHPSO algorithm can find schedules with small initiation intervals within a reasonable amount of time. PSO is thus a promising alternative for obtaining near optimal solutions to this NP-hard scheduling problem.
INTRODUCTION
Reconfigurable Systems have drawn increasing attention from both academic and commercial research applications in the past few years because they combine flexibility with efficiency and upgradability [1] . Among the reconfigurable architectures, many Coarse-Grained Reconfigurable Architectures (GGRAs) have been proposed as an alternative to FPGA-based systems [2] . CGRAs consist of programmable coarse-grained Processing Elements (PEs) which support a predefined set of word-level operations, a programmable interconnection network, a configuration memory, and a controller [1] . Unfortunately the available parallelism has been exploited by few automated design and compilation tools [2] .
The massive amounts of parallelism found in CGRAs can be used to map time critical loops of an application. This can be achieved by Modulo Scheduling [2] , which is a software pipelining technique that overlaps several iterations of a loop by generating a schedule for an iteration of the loop. Modulo scheduling uses the same schedule for subsequent iterations started at a constant interval called the initiation interval (II).
Several heuristic techniques have been tried by researchers in solving the modulo scheduling problem. In this paper, we propose a modulo scheduling algorithm based on Particle Swarm Optimization (PSO). We call this the ModuloConstrained Hybrid Particle Swarm Optimization (MCHPSO) algorithm. PSO provides a near optimal solution with fast convergence and low execution time in solving various combinatory and multidimensional space optimization problems [3] . The MCHPSO algorithm enforces modulo constraints on the parallelism of loop operations as well as data dependence, while mapping onto the CGRA.
The MCHPSO algorithm has been tested on benchmarks taken from [4] , [5] , and [6] . The benchmarks are derived from applications written in the C programming language. The results show that the proposed MCHPSO algorithm finds a valid schedule for the given target applications in reasonable time, with efficient utilization of resources.
The rest of this paper is organized as follows: An overview of compilation and background is given in Section II. The proposed PSO-based modulo scheduling algorithm (MCHPSO) is explained in Section III. The last three sections present the experimental results, related work, and conclusion.
II. BACKGROUND In this paper, we propose an algorithm for modulo scheduling of a loop to be mapped onto CGRAs. The method starts from an imperative language representation of the application, such as a program written in C or some other high-level language.
Each source program is converted to a Data Flow Graph (DFG). The given Target Architecture (TA) is represented by a graph containing all the necessary information such as the number of resources, capacity and interconnections as well as other specific information for each resource. The generic TA graph representation was designed to allow a wide range of architectures. The ADRES [1] architecture was adopted as the TA for our current work. We chose ADRES architecture because it has a flexible architecture template and we can easily map loops onto the ADRES array in a highly parallel way. Furthermore, choosing this architecture allows direct comparison with the method presented in [1] . The TA is replicated for each time cycle to form the Routing Resource Graph (RRG), an internal time-space graph representation.
The mapping algorithm MCHPSO maps each node of the DFG to a node of the RRG and each edge of the DFG to a path in the RRG. The generated scheduled code of the loop exhibits a high degree of Instruction Level Parallelism (ILP).
A. Motivational Example
The compilation flow with a motivational example is described in Figure 1 . Consider the architecture configuration taken in Figure 1 (a) , and a DFG represented in Figure 1 (c) . The architecture components in Figure 1 (a) are Input port (I), Functional Unit (FU), Write Port (WP), Read Port (RP), Register File (RF). Figure 1 (b) shows an RRG created by replicating the TA across two time cycles, as the II is 2. The final embedding of DFG on RRG is shown in Figure 1 (d) .
The schedule produced by the algorithm maps each operation to a processing element and a time and maps each edge in the DFG to a path in the RRG. During the scheduling process, the MCHPSO keeps track of the resources being used in a Modulo Reservation Table (MRT), as shown in TABLE I. The columns in the MRT represent the resources in the architecture and the rows represent remainders modulo the initiation interval. The operation is to be executed in at time 0, so the FU1 is reserved for all cycles divisible by II. Once a resource is reserved it will not be available for the other operations in time cycles that have the same remainder modulo II. The routing path from operation to operation uses the WP 1 (2), RF 1 (2), RP 1 (2) , which are also reserved in the MRT. The capacity of resources are given in brackets, otherwise they have capacity of one.
B. Particle Swarm Optimization
Particle Swarm Optimization (PSO) is an optimization approach that follows an evolutionary metaphor. It is a population-based search procedure in which individuals, called particles, changes their positions, or states, with time. Each particle in the PSO system represents a potential solution to the problem, and at the end of the search, the best particle will hold the best solution found. The standard PSO is discussed in [3] .
In every iteration, the velocity and position of each particle are calculated according to the expressions given below.
(1) (2) where (3) denotes the current iteration and the maximum number of iterations. denotes the particle coordinates at denotes the velocity at . and denote the acceleration constants in the range and and are random values in the range . and denote the local best particle position and global best particle position at the iteration. denotes the inertia weight factor with as the initial weight and final weight.
After calculating , we can get the new particle position to search in the next iteration. PSO algorithm has the advantages of high speed, stable convergence and robustness; it is parallelized well and generates good solutions [3] .
PSO shows significant performance in the initial iterations when compared with Ant Colony Optimization (ACO). PSO has the capability to quickly arrive at an optimal/near-optimal solution [7] . An advantage of PSO over Genetic Algorithm (GA) is that PSO maintains all the solutions in the search space and changes of inertia weight leads to convergence [8] . Since previous research on PSO [3] , [9] shows that scheduling can be done with PSO, we tried PSO with a hybrid combination of mutation operations for our Modulo Scheduling problem to avoid premature convergence in PSO algorithm. 
C. Target Architecture Graph
The target architecture consists of a graph of basic components, including Functional Units (FUs), Register Files (RFs), Column Buses (CBs), and Row Buses (RBs). Similar to the work done in [2] and [10] , our work aims to target a wide range of CGRAs. For the experiments reported in Section V, we targeted an architecture similar to the ADRES [1] architecture template.
The TA graph is formed from a target description file where,
• is the set of vertices. Each vertex represents a FU or RF or CB or RBs described above.
• is the set of edges, indicating the incoming or outgoing edge in the operation. and are the source and target vertex for edge .
Each FU can receive input from various resources of the graph and similarly the output of each FU can be routed to various destination resources [1] . The target architecture used in the experiments of Section V has both 4x4 instances and 8x8 instances of FUs. An example 4x4 instance of target architecture is shown in Figure 2 . Only the top row of FUs, termed as Memory Unit (MU), may be used for load and store operations.
D. Routing Resource Graph
For scheduling, placing, and routing loops onto the target architecture, we employ a time-space graph called a Routing Resource Graph (RRG). The RRG is obtained from the TA graph described above by replicating each vertex in for every time cycle specifyingthe interconnections with edges derived from . The RRG is where • -An infinite set of copies of the TA's vertex set. • edges -Every incoming edge in the TA graph that doesn't end at a register write port is replicated across time.
• edges -Every incoming edg in the TA graph that ends at a register write port is represented in the RRG as an outgoing edge from its source in current time cycle to the write port in the next time cycle. Use of such an edge represents writing to a register [11] .
• edges -For every RF in the TA graph, we have a set of edges that transmit data from each instance of the RF to the instance in the next time. Use of such an edge represents maintaining data in a register [11] .
E. Initiation Interval
To enforce the modulo constraints, we have to generate a schedule for one iteration of the loop, such that this same schedule is repeated at regular intervals with respect to data dependences and resource constraints [1] . This interval is termed the Initiation Interval (II), essentially reflecting the performance of the scheduled loop. To start the MCHPSO scheduling process, the II is assigned the value of a lower bound called as Minimum Initiation Interval (MII) and is computed as in [1] .
F. Data Flow Graph
The target application program description is analyzed and transformed to find the critical loops to be mapped to the CGRA. In our work, we have considered only the inner loop body of the application with no inter-iteration dependence. The loop kernel is rewritten to create a data flow graph representation with nodes as the set of operations in the loop kernel and arcs as the set of interconnection edges, indicating the incoming or outgoing edge of the operation [7] .
III. MAPPING ALGORITHM

A. Modulo Scheduling
Modulo Scheduling is a technique for software pipelining loops [2] . The schedule for each iteration is divided into stages of equal duration, so that different stages of the successive iterations get overlapped. The number of stages in each iteration is called the Stage Count (SC). Modulo scheduling ensures that there are no resource conflicts as multiple stages execute simultaneously.
B. Proposed Algorithm 1) Modulo scheduling with Modulo Constrained Hybrid Particle Swarm Optimization
Our proposed MCHPSO scheduling algorithm simultaneously searches for a good schedule, placement, and routing solution for the entire set of operations given in DFG; it avoids the time consuming sequential search for each operation proposed in the mapping algorithm described in [2] . In [2] , [11] , [10] several trials are needed to find the best schedule for an operation before proceeding to the next operation. In our algorithm, all the particles search for a complete schedule simultaneously. To efficiently map loops onto the CGRA, we have adopted the idea of modulo scheduling used in [2] along with the combination of two heuristic approaches, PSO and randomization. From [3] and [9] we note that PSO could be applied to multidimensional scheduling problems. The application of PSO to modulo scheduling converges faster but can be caught in a local optimum. To escape the local optima, we have used a randomization method in combination with PSO. The overall method of MCHPSO to schedule, place and route a loop is explained in Figure 3 . The inputs to the algorithm are TA graph and a DFG. First the Minimum Initiation Interval (MII) is computed as discussed in the previous section. Second, ASAP (As Soon As Possible) and ALAP (As Late As Possible) times are calculated as in [2] for the given DFG. After generating the DFG and the RRG, the MCHPSO algorithm is executed to schedule, place, and route the loop.
2) Particle Encoding for the problem To frame the solution for the scheduling problem by using the particles, we need to consider various dimensions for each particle, size of DFG, placement of nodes, routing and the schedule time. To establish "best solution mapping", we have 
3) MCHPSO
In MCHPSO, inputs are the RRG and the sorted DFG. The number of operations in the DFG is initialized to the number of nodes, N, for each particle. Each particle in the PSO takes the initial value for the place and schedule of each node in the range of [ASAP, ALAP] that satisfies the dependence constraint. Once all the particles are initialized, their fitness is calculated as illustrated in the next subsection. Every particle updates its Local-best ( ) position if the new fitness is better than the current fitness. Once all the particles have been updated to their best candidate solution, the global best particle is chosen and its position is denoted by the global best particle is chosen and its position is denoted by Every particle updates its velocity according to (4) . The function in (4) creates a swap sequence [3] of the current particle's ( ) placed and scheduled nodes with either from global best position ) or from the local best position
Once the new velocity ( ) is generated, the current particle position ( ) is swapped according to the co-ordinates in the as in (5) . Next the mutation operator is applied to the new particle position ) is shown in (6) . The function selects a random node of the particle and chooses a random placement and schedule value and replaces the particle's current value. Once the mutation is done on the particle, the new particle coordinates are ready for the next generation of MCHPSO. The particles keep searching for the best solution in the current II. The pseudo code is shown in Figure 4 .
4) Fitness calculation
The fitness calculation considers multiple objectives from the routing path produced by Dijkstra's shortest-path algorithm [12] . The three main objectives considered in our work are that no resource is overused, that all edges in the DFG are routable, and that few resources are used to route. The routing cost is computed by accumulating the cost of all RRG nodes used by the new placement and routing of the operation. The fitness calculation was designed to penalize particles which overuse resources. Each node in the RRG has a capacity, base cost [2] , availability, and usage number. The majority of RRG nodes have a capacity of one whereas a few types of nodes such as register files have a capacity larger than one. (4) where is an acceleration constant ranges . 
A. Set up
The proposed scheduling algorithm was written in Java and executed on an Intel Core 2 Duo CPU with 4 GB RAM and a clock speed of 2 GHz. To schedule a loop onto the CGRAs, two main inputs were required for the scheduling algorithm. The first input is the DFG generated from the benchmark loops. The second input for the MCHPSO is the CGRA configuration. The TA graph is created from the TA configuration.
Other than the two main inputs, DFG and TA, MCHPSO requires the following parameters: the number of particles is 10, the relax-factor for the schedule length is the II of the DFG, as one or zero depending on the random generation, the number of trials for each II is one, and the number of iterations to carry out the algorithm is 20.
Among the various CGRAs discussed in [1] , Architecture for Dynamically Reconfigurable Embedded Systems (ADRES) [2] was used for the experiments. The TA consists of 64 FUs, which are divided into four tiles. Each tile consists of 16 FUs in a 4 by 4 grid as shown in Figure 2 . The benchmarks used consist of ten programs, which are derived from [4] , [5] , and [6] .
B. Experiment Results
The overall mapping results of all the selected benchmarks are shown in TABLE II where the first column shows the benchmark name, second column denotes the number of operations in the loop kernel, and the third column shows the Initiation Interval (II) at which the loop kernel is mapped. The fourth column shows the Operations Per Cycle (OPC) which is calculated by (7) . The fifth column shows the schedule density without routing, calculated as in (8) . The schedule density without routing considers the count of FUs used in the placement. The sixth column shows the schedule density of FU with routing calculated by (9) , where the number of stages Procedure MCHPSO (sortDFG, RRG, II, schLength) begin for each operation in sortDFGdo Initialize Particles InitializeMRT(noofFU,II) end for repeat NLOOPS times for each particle in Particles do Find the fitness value fromGetRoutingCost (RRG, particle) if the fitness value is better than the best fitnessthen Set current fitness value as the new particle best fitness end if end for Find the global best particle for each particle do Calculate the new particle velocity according to (4) Update particle search position according to (5) Apply mutation operator for the newPosition (6) end for end while if validSchedule(bestparticle) then return true else return false endif end is calculated by (10) . The schedule density with routing considers the count of FUs used in the placement as well as in routing of edges. The seventh column shows the total CGRA utilization percentage, including all the computation and routing resources in the CGRA used for the scheduling of loop kernel calculated by (11) . (7) (8) (9) The eighth column shows the number of stages overlapped, as calculated in (10) . The last column shows the time taken in seconds to schedule the loop kernel. The mapping results show that the proposed scheduling algorithm MCHPSO utilizes from 31.25% to 79.69% of the total FUs available in the CGRA. The FU utilization depends on the size of the DFG and the number of stages through which a loop is unrolled. The largest loop kernels like IDCT_hor (horizontal pass) and FFT are scheduled within a maximum of 105.89 seconds.
The usage of Functional Units in the CGRA instance has been studied in Figure 5 . From the mapping results, it is understood that the higher the number of loop operations, the larger the routing resources required.
(10) TABLE III shows the comparative results of MCHPSO measured against the modulo scheduling algorithm [1] used in ADRES architecture. The first column shows the benchmarks taken for comparison. The second and seventh columns show the number of operations derived from the benchmarks on both the algorithms.
C. Comparison of MCHPSO with other modulo scheduling algorithms
The third and eighth columns show the II at which both the algorithm were able to do the loop level parallelism. The fourth and ninth columns show the schedule density of FU (with routing). The fifth and tenth columns show the Operations Per Cycle (OPC) as calculated in (7). The sixth and eleventh columns show the scheduling time in seconds for the mapping of the benchmark. The comparison shows that our proposed MCHPSO algorithm was able to route the FFT benchmark within the minimum II with a small measure of execution time.
TABLE IV shows the comparison of MCHPSO with the modulo scheduling algorithm used in [10] . The authors of this paper have used a 2D CGRA with 16 PE with PEIT1 (all PEs are connected with its row PEs and column PEs) and PEIT2 (nearest neighbour) topology. The execution time is smaller in the PEIT1 than in PEIT2 because there is a smaller average routing delay experienced by PEIT2 while PEIT1 overcomes the routing delay by the richer interconnection topology. A memory-conscious mapping algorithm based on the prioritybased list scheduling algorithm is used in [10] . Therefore, we have compared the work done in [10] based on PEIT1 with our proposed algorithm. The first column in TABLE III shows the benchmarks taken for comparison. The second and sixth columns show the number of operations in the benchmark. The third and seventh column shows the Operations Per Cycle (OPC) as calculated in (7) and the fifth and ninth columns show the schedule density of FU (with routing) as calculated in (9) .
This comparative study has established that our proposed algorithm has a lower schedule density (with routing) and minimal II for the first four benchmarks in spite of not using L1 and L0 scratch pad memory, which has been used in [10] . The fifth benchmark 8x8 IDCT-hor depicts a typical case of showing that our algorithm maps at a lower II with the same number of operations and schedule density compared with results in [10] . The numbers of operations are different for the comparing algorithms, because of the various analysis and transformation phase carried out in [1] and [11] . Our proposed algorithm achieves to map with a minimal II for all the benchmarks taken for comparison to the work done in [11] with better utilization of resources. Our proposed algorithm achieves to map with a minimal II for all the benchmarks taken for comparison to the work done in [11] with better utilization of resources.
V. RELATED WORK In the CGRA compilation, Software Pipelining [13] is used for instruction parallelism. The idea of software pipelining is to look for a pattern of operations from various iterations (often termed as the kernel) so that when repeatedly iterating over this pattern, it produces the effect that next iteration is initiated at a regular interval. This interval is termed the Initiation Interval (II) which essentially reflects when the next iteration can start to increase the performance of the scheduled loop. Some of the approaches carried out in modulo scheduling of the inner loop body are discussed below.
The compilation of inner loop body in CGRAs has been done with DRESC (Dynamically Reconfigurable Embedded System Compiler) [2] , a retargetable compiler that is able to parse, analyze, place, route, and schedule the C source code. In this work they propose a modulo scheduling algorithm based on simulated annealing where it takes a long compilation time for larger loops. A memory-conscious mapping methodology for CGRA architectures was presented in [10] with data reuse capabilities and priority-based list scheduling algorithm. The resource aware mapping with local RAMs and flexible interconnection network enables to map the application. The idea of modulo scheduling is applied with a graph embedding technique using an affinity graph heuristic and skewed scheduling space in [14] . The method achieves better convergence and faster compilation times with dedicated register files and sparse network connectivity.
The discrete problem of Instruction scheduling has been solved using Particle Swarm Optimization PSO with the traditional list scheduling algorithm [3] . Our approach closely resembles the work in [2] and [3] by using hybrid PSO with mutation operator to decide the placement and scheduling decisions in CGRAs. The routing path value for the fitness function is calculated from Dijikstra's algorithm to achieve better convergence and faster compilation times. In contrast to all the algorithms discussed, our approach takes the evolutionary process to decide the simultaneous mapping decisions for all the nodes in the DFG. The proposed algorithm optimizes the routing cost as well as holds the modulo constraints and data dependence. 
VI. CONCLUSION AND FUTURE WORK
In this paper, we have proposed the Modulo Constrained Hybrid Particle Swarm Optimization (MCHPSO) algorithm for the loop scheduling problem in CGRAs. The results from our proposed algorithm indicate that the algorithm can find a valid schedule, placement and routing for the given benchmark loops on required initiation interval and maps with a good utilization of resources. Our algorithm can be enhanced to exploit if-conversion, conditional branches and inter-iteration dependence in the loop exploitation. In our future work, we will be trying to apply the proposed algorithm on various reconfigurable architectures and complex applications. The results produced by MCHPSO will be compared with other hybrid evolutionary algorithms in the future. To study the parallelization of the mapping solution search in the proposed algorithm, we have tried on a quad core machine with eight logical processors. The preliminary results are promising and will be discussed in the future paper.
