Reconfigurable computing devices such as FPGAs offer application developers the ability to create solutions with a performance comparable to that of a hardware solution, but with the flexibility of software. Development tools that attempt to support popular software development languages such as C and Java have been developed to reduce the need for the FPGA developer to be trained in hardware design practices, however the tools have not been successful in mirroring all of the languages functionality. In particular most tools do not support programming with recursive functions. Previous research on mapping recursive functions to reconfigurable hardware has built a stack on the device which does not take full advantage of the massive amount of parallel resources on the reconfigurable device. This paper describes a method for mapping recursive functions to reconfigurable hardware without the use of a stack. It does this by unrolling the function on the device as it is executing. The results presented in this paper show that using this method can result in a significant performance increase when compared to a stack based implementation.
INTRODUCTION
The ability to describe functions using recursion provides programmers with the ability to describe some algorithms more elegantly than if it was not available. While there has been some effort made by various companies to develop tools for software engineers, including Xilinx's Forge [1] package and Celoxica's DK [2] range of packages, none have supported recursion. If the aim of these packages is to allow a software engineer with little or no hardware development training to develop FPGA based solutions, tools that support the full functionality of the languages must be developed.
For performance reasons early computer systems did not support dynamic memory allocation. As a consequence soft- * This work was done during Ferizis' PhD program at the school of Computer Science and Engineering at the University of New South Wales ware languages such as FORTRAN, did not support recursive functions. As hardware technology further developed the desire for an increase in functionality resulted in the development of ALGOL 58, which implemented recursion on a stack based processor.
The difficulties with implementing recursion on early microprocessors are similar to the ones on FPGA technology with Celoxica stating space and performance bottlenecks that would result from stack implementations as their reason for not supporting recursion in the Handel-C language [2] . A stack based implementation squanders the massive amount of parallelism that an FPGA could potentially provide to the application. Maruyama and Hoshino, [3] attempt to solve this problem by transforming the recursion into a loop which they pipeline, and then implementing an in-memory stack instead of a logic stack on hardware to hold some state information. While their results show a performance increase due to the pipelining of the recursion, it is unclear how efficiently this system handles recursive functions that call themselves multiple times. This paper presents a method for mapping recursion into reconfigurable hardware that does not require any type of stack on the device. Instead the recursion is unrolled using Runtime Reconfiguration (RTR), in real-time as it is required. The method decomposes the recursive function into smaller functions, which when combined form the original recursive function. These functions are then placed into a pipeline to extract parallelism out of the recursive function. In effect this "unrolls" the recursive function in real-time. The method described in this paper also attempts to reduce the impact of RTR on the performance of the unrolled function. This paper also presents a method to deal with the limitations imposed on the unrolling process by the finite set of resources on FPGA devices.
RELATED WORK
While there has been little work in mapping recursion to reconfigurable hardware significant research has been done in exploiting parallelism from iterative loops on reconfigurable hardware. As it turns out the work serves as a good basis for the mapping techniques described in this paper. Bondalapti and Prasanna, [4] and Weinhardt and Luk, [5] both propose methods that map a loop into a pipelined linear array where each stage of the array corresponds to an iteration of the loop, an example of which is shown in Figure 1 . This approach is similar to the method outlined in this paper, with the exception that the pipelines created by recursive functions may be trees and not arrays if the recursive function contains multiple recursive calls. The approach required to unroll recursion into a pipeline also differs as an iteration i k of a loop never requires data calculated by a later iteration i j|j>k . This backwards data dependency occurs often in recursive calls as to compute a result a recursive function typically requires the result of the recursive call. Unrolling the recursion at runtime could have a negative effect on the performance of the system as it may need to stall while further stages of the system are configured. Bondalapati and Prasanna reduce this latency by pipelining the reconfiguration with the computation. This is done by stages of the pipeline being used for multiple iterations of the loop until the further stages are configured. This technique temporarily reduces the amount of parallelism in the system, but as the bitstream required to configure an iteration of a loop is typically small this may not be an issue. As the bitstreams required to configure an instance of a recursive function are much larger different techniques may be required that initiate the configuration of further stages before they are required.
As more space is required on the FPGA as the function is unrolled, the size of the FPGA will become a limiting factor. Past research into mapping circuits that are larger than the FPGA rely on hardware virtualisation techniques [6, 7] , that "page" computational contexts out of the FPGA and place other contexts into the FPGA in a similar way to a virtual memory system. This technique cannot be applied to the problem of unrolling recursion as it is inefficient to save context and then reconfigure the device and results in a reduction of the parallelism available in the solution. Bondalapati et al. address this problem when mapping iterative loops onto reconfigurable devices [8] by forcing each stage in the pipeline to compute multiple iterations of the loop. This is done by forcing each stage to feed back onto itself multiple times before passing the data onto the next stage. This impacts on the performance and parallelism of the system far less than virtualisation paging techniques. A technique such as this will work for recursive functions also, so long as each stage of the recursive function does not require any state to be maintained.
RECURSION
A recursive function is a function that is defined using itself. A recursive function will continue calling itself until a terminating condition is met and a terminating set of operations is executed. Using this definition for the remainder of this paper a recursive function will be thought of as being made of two separate functions: recursive which contains all the operations that may be executed when the function does not terminate, and non-recursive which contains all the operations that may be executed when the function terminates.
Typically a recursive function calling itself is waiting for data to be returned from the function call. Figure 2 , shows the Fibonacci function and the call graph that results from an instance of the function. It should be noted that the Fibonacci function is used to demonstrate the application of these techniques on a function that contains multiple recursive calls. It will be shown later in this paper that functions such as the Fibonacci function can only be implemented for small values due to space restrictions on the FPGA device. The mutual data dependency between nodes means that they if they were each mapped to a single process, each process would have to halt operation while waiting for data from the other. This lack of parallelism effectively renders pipelining the Fibonacci function in this manner useless.
However, if each node is mapped to two processes, one to handle data from one direction and one from the other the nodes could be run in parallel. In Figure 3 the leftmost node has been partitioned into two nodes, one that passes the arguments into the next node, and another which computes results from the results of the previous node. This in effect partitions the recursive function into two separate functions: pre-recursive, which contains all the operations that are executed before the recursive call, and post-recursive, which contains all the operations that are executed after the recursive call. To save making computations that have already been made the pre-recursive function transmits state information to the post-recursive function. 
Deriving the functions
In essence, to extract the full amount of parallelism from a pipelined recursive function the method described in this paper partitions the recursive function into three functions: non-recursive, pre-recursive and post-recursive. To create these functions the graph is partitioned into a flowgraph as described by Muchnick [9] , with the exception that every statement containing a recursive function is in a block with no other statements. Blocks containing recursive functions will be termed recursive blocks, for the rest of this paper. An example of this is shown in Figure 4 , where a flowgraph is given for the Fibonacci function. Subsets of the blocks in this flowgraph are then used to create a flowgraph for each function. All blocks in the flowgraphs are connected with edges if they were connected in the original graph. Non-recursive function: A graph G is created, which is the graph G with all recursive blocks removed. The flowgraph for the non-recursive function G non−recursive is de- Post-recursive function: The flowgraph G post−recursive for this function is defined as the graph containing all the blocks in G, that can reach the exit block, and that are reachable from any recursive block in G.
These operations on the Fibonacci function would produce the following sets:
• G {entry, 1, 2, 5, exit}
The functions that correspond to these flowgraphs are then placed into a pipelined array, an example of which is shown in Figure 5 . The pre-recursive and the post-recursive functions are connected together so that state information can be transmitted between the two.
Mapping functions with multiple recursive calls
The majority of useful recursive functions, such as various sorting algorithms and matrix multiplication algorithms [10] contain multiple recursive calls. To map these functions it is entirely possible that the pipeline will not result in a linear array, but in fact in a tree like structure. To determine the number of "children" each node in the pipeline has we Figure 6 . The ratio of work done between the root node and its children is the sum of the work done for each child node divided by the work done by the root node. This will be termed the process growth rate. This ratio is shown in Equation (1) . The result of the ceiling function on this ratio is the number of function units that need to be allocated to compute the result for all the children nodes with the same throughput as the previous node. 
Algorithms with a process growth rate of one require only a single function unit to be allocated per level of recursion, which reduces the complexity of mapping the function to hardware to that of a recursive function with a single recursive call. Such algorithms match the following four criteria:
The first criterion requires that the work done by the function monotonically increases as the amount of data increases. The second criterion requires that the amount of data in all the recursive calls is no more than the incoming data to the function. The fourth criterion requires that the rate of change of the work function f also be monotonically increasing. The second and fourth criterions together ensure that at if an arbitrary stage of recursion meets criterion 1, all further levels of recursion will meet criterion 1. It can be shown that all divide-and-conquer algorithms match the first three criteria, and to the authors' knowledge there is no divide-and-conquer algorithm that does not match the fourth criterion. An example of an algorithm which does not meet the fourth criterion is the recursive definition of the Fibonacci function, as the data increases between one level of recursion and the next. A recursive function with a process growth rate ≤ 1 effectively is mapped to the FPGA in a similar way to a loop. While the techniques discussed in the remainder of this paper could be applied to functions with a process growth rate > 1 with the use of additional storage between stages of the recursion, the techniques could be further refined with the use of parallelism to deal with the exponential growth in space that is required as the function unrolls. Therefore further discussion in this paper will concentrate on functions with a process growth rate ≤ 1.
MAPPING THE RECURSION TO THE FPGA
To compute the recursive function logic modules are synthesised from the three functions that are produced from the original recursive function. These modules are then transformed into configuration bitstreams which are loaded onto the FPGA using RTR. To conserve space on the FPGA the bitstreams for the functions are loaded on-demand. Loading the bitstreams as the function is executing introduces problems such as the amount of time that is required to load a bitstream onto the FPGA, the placement of modules and routing in between them and the physical size limitations imposed on the unrolling process by the FPGA device. This section discusses some solutions to these problems.
Reducing the impact of reconfiguration
If the function is unrolled on-demand the time taken to load the bitstream onto the FPGA becomes an issue that may impact on performance. FPGA reconfiguration is a slow process, the minimum theoretical time required to configure a XC2V1000 is 15ms using these figures, however experimental results show that the actual time required is longer than this. If the system were to stall while reconfiguring new stages of the recursion for this time the throughput of the system would be greatly reduced. To reduce, or eliminate, this stalling time we attempt to initiate the reconfiguration as early as possible by predicting the need for reconfiguration.
Recursive function with a predictable behavior
Fig. 8. Recursive function with an unpredictable behavior
Predicting the need for recursion is done by analysising the recursive function. The function in Figure 7 will have a depth of log 2 (n), as n is halved until it reaches 1. This function is simple enough for the depth to be predicted accurately however the behavior of other functions, such as some divide-and-conquer algorithms, is not predictable. The quick sort algorithm, as described in Figure 8 , which partitions the set in two along a random pivot will have at best a depth of O(log(N )), while at worst O(N ). In a situation such as this, the value which is used is that which uses the least space on the FPGA (in this case O(log(N ))). To reduce the impact of RTR in these cases an extra stage is allocated at all times however this technique can still result in the system stalling while waiting for RTR to occur.
Placement and routing
Loading the bitstreams on-demand introduces further complexities. The placement of and routing between newly configured modules becomes an issue which must be dealt with. This is overcome by creating placement constraints such that all modules span entire columns. This reduces the placement and routing problem to a 1-dimensional problem. To reduce the placement and routing problem further the prerecursive and post-recursive modules are synthesised in the one column as one module always occurs in conjunction with the other. This reduces the routing problem as there is now no need to create a route between the two modules.
To comply with Xilinx recommendations for partial reconfiguration design flows [11] , routing modules are created between each module, and bus macros are used between each function module and the routing module.
Overcoming hardware limitations
As the recursive function is unrolled the amount of logic resources on an FPGA device places a limit on the number of function units that can be placed on the device. This limit restricts the size of the pipeline that is created and therefore the amount of unrolling that can be done.
The techniques used by Bondalapati et al. to overcome space limitations while mapping loops are used by us to map recursion to reconfigurable hardware. These techniques apply to recursive functions with a process growth rate ≤ 1, as the mapping is essentially the same as the mapping of a loop. The approach is not used for each stage in the recursive pipeline as the non-recursive must only operate once on incoming data. An example of this is shown in Figure 9 . 
RESULTS
Experiments comparing the methods outlined in this paper and a regular stack implementation were carried out using a Celoxica RC200 board with a Xilinx Virtex-II XC2V1000 device. It is worth noting these results do not measure the impact of runtime reconfiguration as the board does not support the loading of partial bitstreams. While this is not included in the results, the cost of partial reconfiguration would not account for the large gaps in performance between the pipelined implementation and the stack based implementation. All results are measured in clock cycles, with the clock frequency set to the maximum frequency the stack based implementation could use while maintaining stability. In all instances the pipelined solution could be run at a higher clock speed. Two algorithms were implemented, a quick sort algorithm ( Figure 8 ) and a force calculation algorithm [12] . The force calculation algorithm recursively partitions a plane into regions, and approximates the force in each region using the force approximation of the contained subregions. Both of these recursive functions are divide and conquer functions that have O(N log(N )) runtime. The results for both experiments are shown in the graphs in Figures 10 and 11 . As expected both experiments show that the algorithms run in linear time when pipelined with O(log(N )) stages in the pipeline. This results in a significant improvement in runtime as the data sets become larger. The differences in runtime on smaller sets is attributed to the computation overhead that is introduced by managing the stack.
CONCLUSION
The results described in this paper demonstrate that unrolling a recursive function on a runtime reconfigurable device provides substantial performance benefits when compared to a stack based solution. A method to address the application of function unrolling on devices of finite sizes is also presented along with a method to reduce the impact of runtime reconfiguration on the performance of the algorithm.
Further work to make this methodology applicable to more generic recursive functions includes solving the issue of mapping functions that contain a process growth rate > 1. As this results in an exponential increase in hardware use per level of recursion added, more advanced techniques will be required to reduce the hardware use of this methodology.
