This paper presents a novel algorithm for temporal partitioning of graphs representing a behavioral description. The algorithm is based on an extension of the traditional static-list scheduling that tailors it to resolve both scheduling and temporal partitioning. The nodes to be mapped into a partition are selected based on a statically computed cost model. The cost for each node integrates communication effects, the critical path length, and the possibility of the critical path to hide the delay of parallel nodes. In order to alleviate the runtime there is no dynamic update of the costs. A comparison of the algorithm to other schedulers and with close-to-optimum results obtained with a simulated annealing approach is shown. The presented algorithm has been implemented and the results show that it is robust, effective, and efficient, and when compared to other methods finds very good results in small amounts of CPU time.
INTRODUCTION
The availability of RPUs (reconfigurable processing units), such as the new FPGAs, with lower reconfigurable times and partial-reconfiguration capability, has made possible the concept of "virtual hardware" [1] : the hardware resources are supposed unlimited and implementations that oversize the RPU area are resolved by temporal partitioning. Then, the partitioned solution is executed by time-sharing the device such that the initial functionality is maintained. This concept promises to be an efficient solution to save silicon area. One of the applications is the switch among functionalities that have mutual exclusiveness on the temporal domain, such as the contextswitching between coding/decoding schemes in communication, video or audio systems. However, temporal partitioning algorithms able to exploit efficiently the new concept are needed. They must consider trade-offs among parallelism, communication costs, latency and reconfiguration times. The nodes of a given graph have to be scheduled in time slots to be executed in each temporal partition. Temporal partitioning must preserve the dependencies between nodes (that are already temporal dependencies) such that a node B dependent on node A cannot be mapped to a partition executed before the partition where node A was mapped.
Although, the FPGAs themselves, such as the Xilinx™ XC6200 family [2] , do not have mechanisms to implement efficiently temporal partitions and the time of reconfiguration of the overall FPGA is still quite high, the importance of the "virtual hardware" concept has already been demonstrated with computationally complex applications [3] . Industrial efforts are under way to further improve the capability of the devices to handle multipleconfigurations by storing several on-chip configurations and permitting the context-switching in few nanoseconds [4] [5] [6] . The trend to have on-chip configurations instead of more logic cells is explained by the fact that the area of SRAMs to store configurations for each cell is much lower than the cell itself.
Efficient mechanisms of communication between temporal partitions have also been actively researched such as the micro-registers in [5] . The majority of the efforts considers FPGA registers that maintain the same state between contexts whenever wanted.
As referred, research efforts are under way on both new RPU architectures [7] and on the automation of the temporal partitioning process. Our efforts address the temporal partitioning of behaviors during the synthesis steps. This paper presents a new temporal partitioning algorithm that effectively takes into account, among other aspects, the inter-communication costs, while maintaining a small computational complexity. Besides, it is sufficiently flexible to permit the consideration of various target architectures. Results are compared to a number of alternative constructive algorithms and, in order to show how far they are from close-to-optimum solutions, comparisons to a simulated annealing (SA) [8] approach are shown.
From now on we refer to temporal partitions and temporal partitioning simply as partitions or partitioning respectively, since this paper neither considers spatial partitions nor spatial partitioning.
The paper is divided in the following sections. Section 2 summarizes the related work on temporal partitioning. Section 3 describes the computational models considered by the approach proposed in this paper. Section 4 explains the algorithm and the heuristics used. Results are shown in section 5,
An Enhanced Static-List Scheduling Algorithm for Temporal Partitioning onto RPUs

3
where the new algorithm is compared to simple heuristics and to results obtained by the SA implementation. Finally, conclusions are enumerated and future work is envisaged.
RELATED WORK
The development of temporal partitioning algorithms was firstly considered in [9] [1]. The similarities of both scheduling on high-level synthesis [10] and temporal partitioning allow the use of common scheduling schemes for partitioning. However, an important factor that must be considered is the inter-communication (communication among partitions) cost, because it can impose an unacceptable overhead on the overall latency.
In [11] a static-list scheduling (SLS) approach is used for partitioning trying to minimize the number of nets among partitions. It works on the netlists (4-input-LUTs) of combinational circuits and uses the path-to-end's length of each node as a priority function and the size of the fan-out of each node as a tiebreaker. The approach is suitable to RPU architectures with inter-buffering (on-chip buffers that maintain the state among partitions). These RPUs have small inter-communication costs and the overall optimization problem resumes to the minimization of the critical path length. [12] presents an enhanced force direct scheduling algorithm which considers communication costs and which is able to process sequential circuits.
In [13] , a variation of the SLS followed by an optimizer is used to perform partitioning of netlists. The algorithm uses three scheduling rules to select among the ready nodes and is tailored to the Time-Multiplexed FPGA [5] . [14] [15] present a network-flow based method for multi-way partitioning of netlists. In [14] the algorithm is also targeted to the Time-Multiplexed FPGA and the results out-perform the SLS approach [13] in terms of communication costs (number of nets between partitions). The algorithm uses the max-flow min-cut computation iteratively to find k-partitions. [15] shows improvements over the enhanced force direct scheduling of [12] with respect to communication costs. Results comparing the latency of the solutions are neither presented nor examined.
The above approaches are all based on the netlist of the final circuit previously mapped to the library of the target FPGA. They can be efficient approaches to rapid prototyping but suffer from the impossibility to exploit partitions at the behavioral level. This has more importance when considering the integration of partitioning into the reconfigware compilation from behavioral descriptions. Moreover, these approaches suffer from the heavy number of nets and nodes that must be manipulated. At the behavior level the operations encapsulate the tricky connections and only the groups of nets transporting operands are visible by the algorithms.
Some authors, such as in [9] and [16] , have considered the partitioning at behavioral levels having in mind the integration of synthesis. In [9] , a heuristic based on the SLS enhanced to consider dynamic area constraints is presented. The approach does not consider the inter-communication costs. In [16] the partitioning problem is modeled in a specified 0-1 non-linear programming (NLP) model [17] . Due to the computational complexity of the approach, heuristic methods must be developed to permit feasible executions on large input examples. In [18] , a partitioning algorithm based on the levelized nodes obtained by the "as soon as possible" (ASAP) scheduling algorithm is used. The algorithm fills the available area of the RPU in the increasing order of the ASAP levels. The selection of nodes in the same level is arbitrary and the algorithm switches to another partition when it encounters the first node that does not fit in the current partition. In [19] , partitioning algorithms based on the extension of the ASAP or "as late as possible" (ALAP) leveling algorithms with the selection of a node, in the same ASAP or ALAP level, by a local priority function based on the nodes' mobilities have been considered. [19] also shows an algorithm that searches recursively in the list of ready nodes so that if a node cannot be mapped to the current partition, other nodes can be considered.
However, all the above approaches do not consider both the intercommunication costs and the latency, and the majority of them are tailored to a specific target architecture. Therefore, new efforts to integrate the intercommunication costs and the latency of the solutions in a temporal partitioning algorithm working at the behavioral level are presented in this paper.
RECONFIGURABLE COMPUTING MODEL
Our model assumes that the CPU has access to the reconfigware and is responsible for the reconfiguration of the RPU(s) that integrate the reconfigware part. We also consider without loss of generality that the CPU has access to the memory attached to the RPU. The partitions are mapped to the reconfigware and the CPU store/load primary input/outputs directly to/from the memory, or the FPGA when this is supported. The data transfered between partitions can be stored in the memory by the reconfigware, in special registers that are maintained during reconfigurations, or collected by the CPU.
At least three schemes of interface mechanisms between partitions can be considered. The use of registers and a task running on the microprocessor to
An Enhanced Static-List Scheduling Algorithm for Temporal
Partitioning onto RPUs 5 load and store operands between partitions is one possibility. This scheme is well suited to boards of RPUs with a processor or a micro-controller that can also control the reconfiguration of partitions, or to boards without any buffer scheme to maintain results among partitions. With the advent of the integration of RPUs in processor cores this can be an efficient scheme because of the lower communication overhead on these systems. The second scheme is the use of a set of registers in the RPU when the device can be partially reconfigured. The registers will be configured once in a region of the RPU and shared between partitions (this can include the advent of new RPUs tailored to time-sharing). The third scheme uses a macro-cell to access memory locations controlled by a hardware unit. This is well suited to types of RPUs where there is no processor (only local memory) or the communication between the host and the RPUs is time-consuming.
From the above considerations it is clear that feasible partitioning schemes must consider different inter-communication costs due to different interface mechanisms.
PROBLEM FORMULATION
In order to be independent from a particular software language (e.g., C/C++ or Java) the input program is represented as a hierarchy of PDGs (program dependence graphs) in which the bottom level is formed by DAGs (directed acyclic graphs) where each node represents an operation. A node in the PDG can be a group of statements or a single statement. A loop is represented by a special node in the PDG which encapsulates the PDG of the loop body. Herein, we only consider nodes with deterministic delays (known at compile time), and recursive constructions are not allowed. Thus, a behavioral description is represented by a graph, G = (V, Ε), which is an ordered, directed and acyclic graph with |V| nodes, {ν 1 ,ν 2 ,…,ν |V| } and |E| edges, where each node ν i represents a single behavior. Each edge e i,j ∈ Ε represents a dependency between nodes ν i and ν j . A dependency can be only a precedence-dependency or a transport-dependency due to the transport of data between the two nodes.
The communication cost associated with an edge e i,j representing a transport-dependency is calculated by the number of bytes to transfer (d i,j ) divided by the maximum bandwidth of each atomic transfer (with the result rounded to the next big integer). In the majority of the communication mechanisms for each connection between different partitions the data must be stored by the partition that defines it and loaded by the partition that uses it. Nontransport dependencies (precedence-dependencies) have a zero d i,j .
We assume that an estimation of the execution time and the reconfigware size (CLBs, cells, FUs, etc.) of each node is available at compile time.
The objective of the partitioning algorithms is to create partitions in the temporal domain such that the cost is minimum and that each partition fits in the reconfigware area physically available. Each partition Π i is a non-empty subset of V. A graph G partitioned in k subsets is correct if: -The set of partitions
-∀ Π i ∈ ℘, Area(Π i ) ≤ MaxArea; Each temporal partition fits in the RPU resources. -∀ e i,j ∈ Ε, ℘(ν i ) → ℘(ν j ) ∨ ℘(ν i ) ≡℘(ν j ); → indicates the order of execution. All dependencies are met (necessary condition to obtain the same functionality). A correct set of partitions guarantees the same behavior of the original graph. However, we are also interested on the minimization of the overall latency. The cost that reflects the latency of a graph in a time-multiplexed RPU can be estimated by the equation (1) . τ cycles is the number of clock cycles for each load/store. In(Π i ) and Out(Π i ) correspond to the number of inter-communications without considering the primary input/outputs. The second term represents the overall critical path delay without intercommunication costs. 
THE ENHANCED-STATIC LIST SCHEDULING ALGORITHM
The partitioning algorithm proposed (ELS), as can be seen in Figure 1 , is an extension of the SLS algorithm. It starts by computing the ASAP and ALAP values of the nodes in the graph. Then a cost computed with equation (2) is assigned to each node of the input graph. Each term of the equation has a multiplication factor to give more weight to the communication cost (α), to the critical path (β), or to a tradeoff between them (η). The first term (3) gives emphasis to the communication costs. A large difference between the input and output edges of a node assigns a greater priority to that node. Also, a large number of nodes from a given node to the sink can produce more communication costs. The middle term (4) tries to give emphasis to the existent parallelism by assigning more weight to the nodes with lower ASAPs (giving the opportunity to place ready nodes in parallel to the nodes of the critical path already scheduled). This factor increases with the decreasing of
An Enhanced Static-List Scheduling Algorithm for Temporal
Partitioning onto RPUs 7 the communications' weight. The third term (5) is the starting fine-grain ALAP of each node and permits to sort the nodes by ascending order of their ALAPs. The scale factor is set to the ratio of the maximum number of levels of the ASAP leveling and the delay of the critical path. Experimentally we have found that the weight η expressed by β/(α+1) generally conduces to good results. However, another independent weight instead of the previous expression can be used to unconstrain the exploitation.
[ ] ( ) ( ) ( ) The next step of the algorithm is to compute the nodes ready to be mapped to the current partition and to sort them by descending order of the costs. Then, the algorithm starts the loop by considering each node of the sorted list of ready nodes. The algorithm checks if a node can be mapped. If the node was mapped the algorithm tries to update the list of ready nodes by considering the sink nodes. Then the algorithm starts the loop considering another node. If a node cannot be mapped to the current partition the algorithm searches for a feasible node on the list before the creation of a new partition.
The ASAP and ALAP scheduling algorithms compute with graph traversal and have a runtime complexity of Ο(|V|+|E|). The ELS algorithm has a worst-case runtime complexity of Ο(|V| 2 +|E|). To have an idea of how much improvement can be obtained an SA approach has been implemented. It starts from a feasible partitioning solution and tries to improve it by moving nodes (considering the probabilistic selection of randomly valid moves) among adjacent partitions. Moves that can violate the maximum area available on the destination partition, but do not violate any temporal precedence, are considered valid. The approach can exploit results considering more partitions than those of the initial solution by adding empty partitions in the beginning of the execution.
RESULTS AND COMMENTS
All the algorithms presented have been implemented with the Java™ language. To permit a statistical comparison a random graph generator has also been implemented.
All the results attributed to SA are close-to-optimum (the best result of several executions with different parameters was collected). Herein Ω is computed with the equation (1) in clock cycles units and ∆ comm and Γ exec correspond to the 1 st and 2 nd terms respectively. All the results neither consider the store/load of primary input/outputs nor the possibility to interleave execution and inter-communications. S 1 refers to the algorithm presented in [18] and S 2 refers to the leveling of the nodes by the ALAP scheme. S 3 and S 4 refer to the algorithm presented in [19] oriented by the ASAP or ALAP levels respectively. S 5 refers to a version of S 2 with the nodes sorted by the ascending ALAP levels and the ascending ASAP levels as a tiebreak. S 6 is a version of S 1 with a list created with the nodes of an ASAP level sorted by the ascending order of their ALAP step time. S 7 refers to an SLS approach with the nodes in the list sorted by the ascending order of their ALAP step time.
An Enhanced Static-List Scheduling Algorithm for Temporal Partitioning onto RPUs
9
In Table 1 and Table 2 E is the relative improvement cost of the SA over the ELS solution. The constructive approaches obtained each solution in less than 1ms.
The 1 st example to be considered is the loop body of the HAL example [20] (all operands with 16-bit width). The example has a total area of 4,384 cells and a critical path delay of 58 cycles. The results presented in Table 1 show small improvements of the SA over the ELS.
The 2 nd example is the AR filter [21] . It has 16 multipliers and 12 adders contributing to a total area of 16,960 cells and a critical path delay of 90 cycles (all operands with 16-bit width). The results are shown in Table 2 . ELS has produced results always better or equal than those obtained by the other heuristics. An accurate analysis has shown that SA has the capability to map nodes with many connections in the same partition reducing the inter-communications. The ELS approach is unable to balance the last two partitions. This problem has more impact when the number of needed partitions is small and is one of the disadvantages of constructive approaches. The results were not improved by making the SA exploit more partitions than those obtained by the constructive approaches. Also, results from experiments with random graphs confirm that the number of partitions used by the heuristics (e.g., ELS) is close-to-optimum and few cases need more partitions (one) to improve the results. More partitions have higher probability to increase the critical path delay and only in few cases can reduce the overall inter-communication cost. The results show improvements of ELS over the other considered heuristics. On the 4 th and 5 th rows ELS results would be similar to the S 7 results if only the third term of equation (2) was used. S 3 was the best algorithm of the ASAP/ALAP leveling approaches with respect to communication. The results show better solutions of SLS over the ASAP/ALAP approaches with respect to latency, as was expected due to the opportunity to execute freedom nodes in parallel with the critical path. S 6 has produced the best results of the constructive approaches when only latency has been considered. The fact is due to the consideration of all the nodes of an ASAP level before the addition of nodes from the next level (an SLS tries, for each node scheduled, to update the list of the ready nodes). When the cost of each communication was not high, S 7 has produced better results than the other constructive algorithms (rows 4 and 5).
The close-to-optimum results had about 16% of average relative improvement over the ELS (from 4 to 24%). Better results of ELS can be achieved by exploiting the values of the α, β and η weights on equation (2).
CONCLUSIONS & FUTURE WORK
In this paper temporal partitioning techniques have been presented and compared. A novel heuristic extension to the static-list scheduling algorithm was presented. Being an algorithm of the family of the scheduling algorithms it can embody resource constraints without much more complexity. The low complexity of the algorithm makes it applicable on large graphs. The results show improvements over related algorithms and show that simplified algorithms can be used to resolve the temporal partitioning problem at the behavior level. The algorithm can be also applied efficiently to rapid prototyping of DFGs with much better results than the ASAP approach presented in [18] without increasing significantly the execution time.
Although just presented as a comparison term a temporal partitioning approach based on the simulated annealing has been also implemented. Based on the execution time the annealing approach does not seem, at least alone, to be a reasonable choice to reconfigware compilers where one of the most important objectives is fast compilation. However, efficient cooling schemes should be studied to improve the efficiency of the annealing.
The close-to-optimum experimental results show that most of times the optimal number of partitions is the minimum as was also stated in [7] .
The support of the loop distribution transformation and the possibility to deal with resource sharing will be considered by future temporal partitioning schemes.
