Signal processing applications have been shown to map well to time multiplexed coarse grained reconfigurable array (CGRA) devices, and can often be decomposed into a set of communicating kernels. This decomposition can facilitate application development and reuse but has significant consequences for tools targeting these devices in terms of allocation and arrangement of resources. This paper presents a CGRA floorplanner to optimize the division and placement of resources for multi-kernel applications. The task is divided into two phases aligned with the respective goals.
INTRODUCTION
CGRA devices represent an emerging family of architectures exploring alternatives to commodity microprocessor computing. There is a large body of research converging on these styles of architectures from a variety of perspectives from FPGAs to VLIWs. These devices have been shown to be well suited to applications oriented toward streaming computation. Arguably the most important driver of architecture development and adoption is tool support for efficiently mapping applications to these new architectures. Supporting this notion, we present an automated floorplanning tool as part of a tool chain to leverage the advantages of CGRAs for multi-kernel applications.
Floorplanning is a challenging problem for traditional ASIC and FPGA netlists. While the size of the problem for a CGRA is mitigated somewhat by the coarse granularity of the functional units, it is further distinguished from traditional floorplanning in two main ways. First of all, floorplanning normally deals with a fixed quantity of resources. Once a netlist is mapped and packed to a target technology, the quantity and type of resources required in the physical device is essentially fixed. In the CGRA case, the number of resources allocated is more flexible by trading physical resources for time through time multiplexing. Previous work on the Mosaic project [1] considered a single kernel of computation targeting the CGRA device. In this paper we examine multiple communicating kernels sharing the same physical device, made possible by enhancements to the Mosaic compiler [2] .
The multi-kernel floorplanning problem for CGRAs is complicated by the flexibility of mapping each kernel to the device when multiple kernels are involved. The number of resources allocated to each kernel can be adjusted with a corresponding impact on the performance of the kernel. Thus, we must decide how to divide the available resources among them. The second issue is related to the shape of kernel regions. Rectangular regions, while convenient for ASIC or FPGA floorplans, are not well suited to the CGRA floorplanning problem. For example, a five cluster kernel can either be restricted to only 1x5 layouts, which radically increases wire lengths within a kernel, or allowed to take a 2x3 shape, wasting 17% of the allocated resources. The allocation of resources in general should make use of as many resources as available to maximize performance. However, a given allocation of resources is not guaranteed to fit on the device in a rectangular region for each kernel. We provide an alternative solution that allows irregularly shaped regions in the resulting floorplan.
BACKGROUND
The Mosaic project [1] is developing an infrastructure to explore CGRA architectures and CAD tools. CGRAs are clusters of functional units and memories on a configuration plane to enable cycle to cycle static scheduling of operations [3] . To execute multiple independent kernels, an enhanced CGRA allows configurable subsets of the resources to operate as independent CGRA regions within This work was supported by NSF grant #CCF-1116248 and by Department of Energy grant #DE-FG02-08ER64676. the architecture to allow kernels with different performance characteristics to reside on the same device. This means that within a CGRA region, operations and interconnect are scheduled and have a fixed execution sequence. However, different CGRA regions are able to operate independently in the fabric. This allows individual CGRA regions to be tailored exclusively to a particular kernel of computation instead of trying to shoehorn an entire application into a single monolithic kernel spread across the entire device. Between CGRA regions, the application employs massively parallel processor array (MPPA) style flow controlled buffered interconnect, effectively decoupling the control domains of individual CGRA regions.
Mosaic uses pipelined interconnect in a fixed frequency device. This eliminates adjusting the clock speed as a technique to address communication rate mismatches between kernels. The advantage is that, in a practical implementation, only a single clock network and PLL are required.
No additional hardware is required to synchronize at arbitrary clock boundaries between clusters, simplifying the device architecture.
There are two important properties of individual kernels that the floorplanner uses. The first is the size of the kernel, measured by the total number of operations that must be executed. The second is the recurrence initiation interval. Initiation interval (II) is the number of cycles between starting subsequent iterations of a loop. The recurrence II is the length of the shortest loop carried dependence cycle in the dataflow graph. This represents the minimum II achievable for the kernel and therefore the maximum throughput given sufficient resources.
The enhanced architecture has significant benefits when compared to existing approaches. In traditional CGRAs, all computations must operate in lockstep, slowing the entire system to the rate of the slowest element; the enhanced architecture allows an individual kernel to operate at its own rate, often achieving significantly higher throughput. In MPPAs, such as Ambric [10] , users have to write code for each individual processor and must refactor the design manually to employ more resources; the enhanced architecture can automatically spread a given computation across multiple compute units, allowing the user to express a computation in its most natural decomposition while relying on the tools to automatically harness multiple computational resources for individual kernels to provide the best overall throughput.
Supporting the multi-kernel flow in the Mosaic project requires integration of the new floorplanner into the existing toolchain. Figure 1 shows where the floorplanner is inserted into the Mosaic tool chain to manage multiple kernels in the overall application. Design entry begins with Macah [4] which already supports defining multiple kernel blocks [2] . The compiler generates a dataflow graph for each kernel. The compiled dataflow graphs are consumed by SPR [5] . This tool is inspired by VLIW compilation for scheduling as well as FPGA tools for assigning operations to the physical resources as part of placement and routing. SPR targets an individual kernel for CGRA style execution.
In order to support the new multi-kernel model, it is necessary to find an allocation of resources to maximize throughput of the overall computation while respecting the finite amount of total resources. Once this division of resources is decided, global placement works to minimize resources dedicated to the communication links between kernels. After the kernels are assigned to device regions, the existing toolset can be applied to map each kernel onto the subset of resources allocated to it. Therefore, the floorplanner is situated between Macah and SPR to provide the resource partitioning and global placement. Resources in Mosaic CGRA architectures are grouped into clusters of multiple ALUs, memories, stream ports and other resource types on a square grid.
For the purposes of this floorplanning task, these clusters are the granularity at which the resource allocation and placement are performed.
The Mosaic hardware supports multiple kernels by mapping them to different regions of the chip. Each kernel operates like a CGRA array, with a fixed modulo-scheduled operation, deep pipelines, and time-multiplexed logic and routing resources. This provides inexpensive and effective parallelism for streaming computations. Signals between kernels operate with handshaking, moving data independent of the IIs or stalls of the intervening kernels [13] . As such, the inter-kernel wires are more expensive than the intra- kernel wires, and thus the length of communication wires between kernels must be carefully controlled.
RELATED WORK
Floorplanning is an important part of ASIC and FPGA design flows. In the FPGA space, Xilinx PlanAhead [6] allows the designer a high degree of control over where specific modules or components are placed in the architecture, which in many cases can mean the difference between a hopelessly long placement phase and a design that meets the required timing. While floorplan regions are constrained to rectangular regions, they can be composed together to provide an irregularly shaped region. However, this is entirely performed manually by the designer.
There is a large body of floorplanning work to consider. In [7] the authors target heterogeneous FPGA architectures using a slicing technique with compaction. While well suited for FPGAs, this technique does not map well to the coarse granularity of CGRAs. For example, Figure 2 shows a five cluster kernel mentioned previously which suffers from wasted resources or poor wire length in the top and bottom arrangements, respectively.
In [8] , the hierarchical clustering approach leaves unused resources in the array due to the communication pattern of the macro based netlists targeted. In a coarse grained device, this leads to poor utilization where it is much more costly in a CGRA than in an FPGA.
The StreamIt language and compiler [9] is a related project that offers a similar model of computation with actors communicating through FIFO channels. However, the approach is quite different because the underlying hardware is an array of processors which are able to change tasks to a much higher degree than our enhanced CGRA model, which is limited to its static schedule of instructions. With StreamIt, kernel code may be swapped in and out of a particular core, where in our case each computational element is a member of a kernel region and has a small number of operations which must operate in lock step with other members of the same region for the lifetime of the application.
The Ambric [10] flow controlled interconnect channels are similar to the inter-kernel communication resources of the enhanced CGRA. However, they do not support a scheduled execution mode, making it less amenable to operating in a CGRA mode for individual kernels spread across a collection of resources. At its debut, Ambric's programming model required development of individual programs to execute on the processors in the array. Thus, a developer divided an application into components suitable for implementation on a single Ambric processor with individualized programs. Even for applications where one program might be reused on many processors, handling distribution of data to and from each processor would still need to be managed manually. Tool support for leveraging the array without the need to decompose an application by hand is a key feature of the Mosaic project. 
FLOORPLANNING ALGORITHM
Floorplanning must both determine the number of resources assigned to each kernel to achieve the best throughput, and place those resources onto the device to minimize the communication costs in the system. These two questions naturally break the floorplanning problem into two phases:
• Resource Allocation: The goal is to optimally assign a finite quantity of available resources amongst the various kernels to maximize the throughput of the overall application. Per-kernel information provides the number of operations performed and the recurrence II (a limit on the maximum throughput). (See Table 1 .) Each interkernel signal is also annotated by the number of data items per iteration sent and received by the source and destination of the signal respectively. From this data an assignment of resources is generated for each kernel as input to the Kernel Placement step. • Kernel Placement: The goal is to place the resources assigned in Resource Allocation, seeking to minimize the resulting routing costs. Resources dedicated to a given kernel should be contiguous and as compact as possible to limit the length of intra-kernel routing. To minimize the more expensive inter-kernel signal lengths, kernels that communicate with each other should be placed close together. An example result is shown in Figure 3 .
Resource Allocation
Intuitively, each kernel wants resources requisite to reduce the schedule depth and increase its throughput. However, this must be balanced in the context of the overall application. With finite resources available in the device, the topology of connections between kernels, and the performance of neighboring kernels, maximizing performance of an individual kernel will not necessarily produce an optimal system solution. For example, consider the positron emission tomography (PET) event detector application (Figure 4 bottom). It has two kernels, and a simple allocation would give each kernel half of the available resources. However, the application actually consists of a line-rate threshold kernel that must quickly process data, looking for a relatively rare event, and a math kernel that does complex processing on those events. In the example, the send and receive rates are 0.04 and 1 tokens per initiation for the threshold and math kernels respectively measured in data tokens per II. The best allocation (Figure 4 top) actually dedicates almost all of the chip resources to the threshold kernel, since that boosts overall throughput, while starving the math kernel for resources does not affect overall throughput.
While the best allocation of resources for the PET application is relatively obvious, for a more complex network of kernels, it is much more difficult. Each kernel with its own resource requirements, recurrence II, and stream rates, ultimately interacts with all of the kernels in the context of the total resource limit of the device itself.
The resource allocation portion of the algorithm is outlined in the pseudocode of Figure 5 . At a high level, the algorithm begins with a minimal resource allocation to each kernel. The main loop performs an analysis of the application in the context of the current resource allocation, adds resources to kernels that limit throughput, and then iterates until the device is filled or no further performance gain is possible due to limits in the kernels themselves.
The algorithm initially builds a graph describing the communication between the various kernels from the Macah compiler, as well as information about each kernel from SPR, the tool that performs scheduling, placement and routing for an individual kernel in Mosaic. For example, the digital camera pipeline (IPL) application in Figure 3 , with kernel parameters shown in Table 1 , has recurrence II as the lower limit on II if the kernel is not resource limited. The number of operations for a kernel indicates the size of the dataflow graph representing it. All of these operations must ultimately have an issue slot available in the device. For a given number of resources, the resulting II can be calculated, or for a given II, the number of resources can be calculated. In one extreme, provided sufficient instruction memory, all operations could be executed on one functional unit. The other more desirable extreme spreads operations out among a collection of functional units to take advantage of data and pipeline parallelism in the application. Lastly, a production or consumption rate for each output or input port is provided by the developer to indicate how often a value is produced or consumed on a per iteration basis. This information allows the algorithm to assess bottlenecks in the communication between kernels operating at their own IIs in order to increase throughput as much as possible. This information may ultimately be obtained from automated simulation of the application prior to mapping, but is currently manually annotated. At onset, the algorithm allocates each kernel a minimum number of resources. This is limited by the maximum II supported by the hardware, providing a sufficient quantity of resource types such as memories or stream ports. This first solution not only represents the minimum number of resources absolutely necessary to execute the application, but will also be the slowest solution in terms of performance because operations must each have an issue slot in the schedule. Note that we assume that memory operations on different arrays can be packed into the same physical memory. For simplicity of explanation, we will assume the target device supports an unlimited maximum II so for the IPL application, the initial allocation of resources to each kernel is just one resource each.
From the initial resource allocation, the resource limited II is calculated on lines 4-5 of Figure 5 , as the number of operations divided by the number of resources allocated rounded up, and is the minimum schedule depth needed to provide every operation a resource and time slot to execute. On the first iteration, this will be equal to the total number of operations of a given kernel, since only one resource is available to a kernel. With each kernel's II, the next stage calculates the absolute rates at which values would be produced or consumed at each port (lines 6-7) of each kernel assuming input streams always have data available and output streams are never full. Thus, if a kernel has an II of 4, and produces 2 values per iteration on a port, the port would produce at an absolute rate of 0.5 running unconstrained. For IPL, all kernels produce and consume data at the same rate, so their port rates are the reciprocal of their respective resource constrained IIs. Now the algorithm evaluates each stream in the application and assigns the stream rate to the value of the "slower" end (lines 8-9). For example, if one end of the stream is trying to produce results every cycle, and the other end can only consume once every 5 cycles, the stream rate will be 0.2 data elements per cycle. While this local processing puts an upper limit on the rates of each channel and kernel, we must model the more global behavior. The faster end of the stream will slow down to match the stream rate through stalling, and transitively this will slow the other ports of this kernel. Other kernels may then be slowed, until a steady-state is reached. For the IPL example on the first iteration, streams connected to the INT kernel will be set to the INT port rates because it is the slowest kernel on this iteration.
With the kernels, ports and streams annotated with local information, the next phase of the algorithm begins (lines 10-15). For every kernel, the ports are evaluated by comparing the port rate to the stream rate of an adjacent port. If the port rate does not match the stream rate, this means that the port wants to operate faster than its partner on the other end of the stream, but this is not allowed since the slower port dictates the maximum rate. When this condition is detected, the adjustment is made. The change in port rate is then propagated to the kernel II itself (which will no longer be resource limited) and to all other ports of that kernel. The process of evaluating all ports of the system continues until no further changes are made to any port rates after all kernels have been evaluated. Again on the first iteration for IPL, the progression will be the port rates of the INT kernel propagate to the DC and LPF kernels and the LPF rates propagate to the ED kernel until all four kernels are operating at the INT kernel rate. Now the slack of the kernel is calculated as the difference between the originally calculated ideal rate and the rate the kernel was assigned during execution of the algorithm. Zero slack means that the kernel is operating as fast as possible given the resources allocated to the kernel. In the IPL case this will be the INT kernel. The kernel is therefore limiting performance of the overall application. There may be more than one limiting kernel if multiple kernels have zero slack. The limiting kernel or kernels are then provided with the minimum increment of resources required to reduce their II from the present value, and thereby increase throughput (line 16). With the resource allocation changed, the process begins again.
Termination conditions for the algorithm are as follows. If at least one of the limiting kernels is already operating at its recurrence II, the performance cannot be improved because this value is a lower bound on the schedule depth for the kernel, so additional resources will not improve it further. At this point the algorithm returns to the last solution. Alternatively, if the limiting kernel(s) can benefit from more resources but the sum of all allocated resources would exceed the device capacity then the algorithm will also return to the prior solution. Figure 6 and Figure 7 show the incremental solutions for the IPL application in terms of resources and kernel II respectively as generated by the Resource Allocation algorithm. Note that we assume the device supports a maximum II sufficient for every kernel to execute on a single resource.
It is also possible that the system has no legal solution. Our model of computation for floorplanning currently only allows blocking reads. If the production and consumption rates around a loop or where the flow of data diverges and then re-converges are unbalanced, then somewhere in the system a buffer associated with a stream will either become full or empty such that the execution will deadlock. Figure  8 shows an example of each condition with the port rates labeled. We detect this condition by limiting the number of iterations of the inner loop of the resource division phase to the number of ports in the design. Intuitively, if the algorithm is propagating a change due to a particular port more than once, then there is an unbalanced loop and the algorithm terminates. These conditions ultimately mean there is no steady state behavior of the system with bounded buffers between kernels.
Our algorithm provides the best possible application throughput for a given device capacity and the supported production and consumption model. A proof of optimality is presented in [11] . To summarize the proof, this algorithm progresses through the set of Pareto optimal solutions from the smallest and slowest to the fastest and largest terminating when no further improvement is possible or the available resources are exhausted.
The resource allocation phase is very fast even for the most complex multi-kernel benchmarks such as the 18 kernel discrete wavelet transform (Wavelet) application which completes this phase in no more than 2 seconds. Even if the approach scales poorly with the number of kernels, its execution time is dwarfed by the Macah compiler and SPR.
Global Placement
The global placement phase takes the quantity of resources assigned to each kernel in the resource allocation phase and uses simulated annealing to place these resources in the device. The cost function works to keep the resources for each kernel together while also placing communicating kernels close together to reduce resource utilization and maintain routability.
After the division of resources has been established, the algorithm moves on to the coarse placement of kernels on the device in order to minimize routing resources dedicated to communication between kernels. The global placement is a simulated annealing based placement algorithm with a specialized cost function geared toward the foorplanning problem. Each resource assigned to a kernel is a separate moveable object. Moves are made by selecting two locations at random and swapping the resources assigned to these locations. Swaps of resources in two locations from the same kernel are useless and are not allowed. The objective of the placement is to minimize the distance between resources that will communicate by minimizing intra-kernel communication, and minimizing inter-kernel communication.
There are two cost measurements used, which are applied to both internal and external kernel communication. The first is perimeter, evaluated by visiting each resource associated with a kernel and checking its neighbors. An adjacent position that is not another member of the same kernel counts as one unit of perimeter. Placement with a lower perimeter translates to a tightly packed cohesive block, while a large perimeter cost means the elements are spread out more, or even separated. The second measure is the bounding box perimeter, which is simply the perimeter of the smallest rectangle encompassing all members of the kernel. Each resource cluster is one square unit. The overall cost assigned to the kernel is the larger of the actual perimeter and bounding box perimeter. The bounding box helps guide separated elements back together. For example, a two element kernel where the resources are not adjacent will have the same perimeter cost regardless of their separation, so the bounding box dominates in this case to help drive them back together. The same basic approach is applied to inter-kernel communication. The difference in this case is that each pair of communicating kernels is treated as a single super kernel for the purposes of calculating the aforementioned costs. Thus, in this case, the perimeter is only counted if an element is not adjacent to another element from the same kernel or the kernel on the other end of the stream currently being evaluated. All the goals apply here as well; i.e., minimizing the perimeter will minimize the area so it should be tightly packed, and the bounding box will help drive together separated regions. The total cost function for the system is the sum of the individual kernel costs and each super kernel representing each inter-kernel stream in the application. The VPR [12] cooling schedule is used to control temperature adjustments for the annealing. Figure 9 shows a simple two kernel example similar to the PET event detection benchmark with two different placements to illustrate the different components of the overall cost metric. The small kernel itself has a perimeter of 4 in either case, while the large kernel cost changes depending on whether it has a concave shape. Since the two kernels communicate, their resources are pooled together to calculate the perimeter once again, which is the same in either case here. In these two cases, the bounding box option is not used because it is never greater than the perimeter calculation for this example.
Even without any optimizations for calculating an incremental cost function per move, the placement phase of the floorplanning executes for no more than 40 seconds on a modest desktop for the most complex benchmark, again a runtime dwarfed by other tools in the Mosaic flow.
RESULTS
We present results for multi-kernel benchmarks written in Macah to demonstrate the floorplanning flow. For the global placements, the best of ten runs show the effectiveness of the approach. The Resource Allocation progression for the PET application is plotted in Figure 10 and Figure 11 demonstrating the application optimized at different port rates.
While the Resource Allocation process is optimal as shown in [11] , Global Placement is based on a heuristic. Five multi-kernel benchmarks were run through the floorplanner with the results summarized in Table 2 . The Min Cost field is the theoretical minimum placement cost achievable for the given resource allocation. This is calculated as the sum of the minimal rectangular regions for each kernel and pair of communicating kernels, similar to the cost function used in the actual placement. This minimum is generally unachievable in practice, since the placements of different kernels interact. The generated cost is for the best of ten runs of the benchmark through the Global Placement phase with the Cost Ratio indicating the increase over the theoretical minimum. Avg WL (wire length) is the average minimum distance between pairs of communicating kernels as defined in the application while the Max WL is the largest distance.
As can be seen, the placer achieves layouts within 10% of the lower bound in all cases, with a geometric mean of 1.05. Inter-kernel signals are almost always of length 1, meaning communicating kernels are adjacent for all but 1 signal in Wavelet.
Detailed floorplans for several interesting cases are shown in Figure 3 , Figure 4 , and Figure 12 . Clearly, the results are well packed, communications are short, and individual kernels have reasonable shapes.
CONCLUSION
We have presented an algorithm for floorplanning multikernel applications on CGRAs. From a description of the inter-kernel communication pattern and basic parameters of the kernels, the algorithm divides the available resources among the kernels in order to maximize throughput. It then provides a high level placement of the kernel resources in order to facilitate global routing. This in turns enables detailed scheduling, placement and routing of each kernel to efficiently map multi-kernel applications onto the reconfigurable fabric.
