Abstract-This paper is concerned with the application of formal optimization methods to the design of mixed-granularity field-programmable gate arrays (FPGAs). In particular, we investigate the appropriate mix and floorplan of heterogeneous elements: multipliers, RAMs, and lookup table (LUT)-based logic, in order to maximize the performance of a set of digital signal processing (DSP) benchmark applications, given a fixed silicon budget. A mathematical programming framework is introduced, along with a set of heuristics, capable of providing upper-bounds on the achievable reconfigurable-to-fixed-logic performance ratio. Moreover, we use linear-programming bounding procedures from the operations research community to provide lower-bounds on the same quantity. Our results provide, for the first time, quantifications of the optimal performance/area-enhancing capability of multipliers and RAM blocks within a system context. The approach detailed provides a formal mechanism to explore future technology nodes.
I. INTRODUCTION

F
OR SYSTEM architectures, reconfigurable devices provide designers with high-throughput, cost-effective platforms. When designing reconfigurable devices, the architects must be aware of the types of systems that users intend to map onto the platform. This means that device architectures must be designed with the performance of the different systems they are intended for in mind. Given the widespread use of reconfigurable architectures for digital signal processing (DSP) systems, there have been several advances in reconfigurable chip design for this domain, particularly evident in field-programmable gate arrays (FPGAs).
Traditionally, FPGAs have consisted of lookup tables (LUTs) capable of performing any -input logic function. There has been considerable research into exploring the architectures of homogeneous LUT-based FPGA devices [1] , [2] . This has concentrated on exploring the nature of the LUTs, for instance how many inputs they use, and how they are locally interconnected. More recent introductions into the FPGA fabric include components such as DSP blocks and RAM, for instance the Xilinx Virtex 2 [3] contains lookup-based Manuscript received May 27, 2006 ; revised May 14, 2007 . This work was supported by the EPSRC (U.K.) under Grant EP/C549481/1 and Grant EP/E00024X/1, and under the EPSRC DTA scheme.
The authors are with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2BT U.K. (e-mail: alastair. smith@imperial.ac.uk).
Digital Object Identifier 10.1109/TVLSI. 2008.2000259 slices, 1 18 kb embedded SRAM as well as 18-bit multipliers.
There are optional registers on the outputs. Similarly, Altera's Stratix II chips [4] contain three different sizes of memory as well as DSP blocks, the latter being capable of performing a variety of fused multiply-add operations. These embedded components have been used to speed up computation or take advantage of greater logic density. In this paper, the emphasis is on exploring heterogeneous architectures by examining the ratios and physical placements of the different components that are found in heterogeneous devices. When designing a new reconfigurable device, it is common for the architects to have a baseline parameterizable structure from which many different architectures can be generated. A considerable body of work in FPGA research has been based on the versatile place and route tool [5] , which provides such a structure. The parameters of this type of framework can then be varied to represent different devices with different characteristics. For example, the mix of routing resources can be varied [6] , [7] or the number of inputs for each logic function [1] , [2] . The most common approach for testing the performance of the architectures is to simulate a variety of possible architectures, with reference designs placed and routed in each architecture using heuristics such as simulated annealing. The final architecture will be one from this set that best suits the area, speed, and power consumption metrics for all designs.
The work presented in this paper takes a different approach to reconfigurable architecture design. By using linear programming (LP) and integer linear programming (ILP), this work shows how it is possible to simultaneously place benchmarks and generate heterogeneous architectures, as well as perform module selection for given computational structures in a benchmark; e.g., decide whether a ROM should be implemented in LUTs or in an embedded component. This paper proposes an approach that allows all three problems to be performed concurrently, leading to highly optimized architectures, eliminating both the need for exhaustive testing on a set of architectures and the dependence on heuristic parameters. Incorporating module selection into our approach also allows the tradeoff of the speed and area advantages of dedicated hardware with the flexibility of more general logic.
Many of these subproblems have been addressed in previous work. For instance, there is a large body of work on floorplanning from the application-specific integrated circuit (ASIC) community. However, this is the first time that these three problems have been addressed within the same framework.
Moreover, this work addresses floorplanning problems for which the underlying architecture is configurable and must be reused for different applications.
The main contributions of this paper can be summarized as follows.
• To our knowledge, the first formulation of heterogeneous FPGA architecture exploration as a formal optimization problem considering simultaneous floorplanning, module selection, and architecture generation.
• Floorplanning optimization for configurable devices where the underlying architecture is reused for different applications.
• A heuristic for solution of the combined architecture generation, floorplanning, and module selection problem.
• Bounds on the distance to optimality of the architectures generated, quantifying, for the first time, the optimal area/ speed advantages for a class of reconfigurable architectures.
• A comparison of the generated architectures to a family of commercial architectures. The remainder of this paper is organized as follows. In Section II related work is discussed. Section III details the mathematical formulation of the problem as a linear program, based on that presented in [8] , and Section IV, presents a heuristic bounding procedure for the optimization problem, when direct solution of the ILP is not possible due to runtime and computational constraints.
II. RELATED WORK
Many research works have been devoted to automatic floorplanning of digital circuits, we refer the reader to [9] for a comprehensive review of these techniques. The focus of this paper we present here is on ILP floorplanning for reconfigurable hardware. The problem of floorplanning for reconfigurable hardware is somewhat different than that of standard VLSI floorplanning. Reconfigurable hardware can be configured according to the application for which it is to be used. This means that the underlying device floorplan is reused, and that a good architecture floorplan for one application may not be good for an alternative application. In this paper, we build on known ILP floorplanning techniques [10] , and extend them into the realm of heterogeneous reconfigurable devices.
Heterogeneous reconfigurable devices provide an extra dimension to the problem of floorplanning, as they provide the designer with choices for implementing different parts of their design. This is essentially a module selection problem, which has been studied in previous works, for example [11] in the VLSI domain and [12] from a high-level synthesis perspective. There are also known ILP formulations of the module selection problem, for example [13] , which is again from a highlevel synthesis perspective. In the context of FPGA technology, [14] combines technology mapping to homogeneous fine-grain FPGAs with placement and routing, although does not consider heterogeneous fabrics. More recent FPGA related works on technology mapping consider embedded memories, for example [15] , and there is also work on usage of DSP blocks [16] . In contrast to this paper, existing works have not considered the heterogeneous technology mapping problem in a context where the modules have to be reused when the device is configured for a set of different applications.
There is also a considerable body of work in the field of reconfigurable computing relating to exploration of various aspects of the architecture. However, there are relatively few concerning heterogeneous devices. In [17] , architectures containing embedded memories as well as LUTs are considered. A set of benchmarks is selected with the aim of minimizing the area of the architectures produced while maintaining a minimum circuit delay. The benchmarks are mapped to four-input LUTs to calculate the minimum circuit delay. An attempt is then made to find the best size of embedded memory block by applying algorithms that pack logic into the embedded memories. However, more modern architectures containing components such as multipliers are not covered by their work.
In [18] , an abstract model is used to look at a set of predefined architectures containing specific functional units. These are arranged hierarchically so that the routing delay between components in the same level of hierarchy is the same. Applications are then mapped onto the set of architectures, and the architecture that implies the least communication delay is derived, the idea being to assign computations with data dependencies to resources in the same hierarchical level. While an interesting piece of work, only the set of architectures supplied by the user are examined, whereas in the work presented here uses linear programming to evaluate a large set of architectures.
Heterogeneous coarse-grain reconfigurable devices have been researched extensively in the RaPiD project [19] - [21] . The result of this work has been a tool that generates device architectures, and the most recent work compares RaPiD architectures to standard-cell ASICs and FPGAs. A common theme throughout the RaPiD project has been on using heuristics to solve all parts of the design flow, meaning that their work provides empirically derived upper-bounds on the best solutions possible. The methods employed in this paper are based on formal mathematical methods, meaning that lower-bounds can be obtained in addition to upper bounds. Moreover, the devices that are generated in our work mix both fine and coarse-grain, allowing closer comparison to modern commercial FPGAs.
More recent advances have attempted to quantify the gap between heterogeneous FPGA architectures and ASIC implementations [22] . Their work uses synthesis tools to create both ASIC and heterogeneous FPGA implementations of algorithms in order to evaluate the performance gains of ASIC over FPGA. Our work complements [22] by providing an analytical approach to the problem; while [22] tries to measure the FPGA/ASIC gap for the whole tool chain, our work uses formal techniques to isolate heuristic tool bias from architectural measurements.
The work presented in this paper concentrates on exploring the design space of FPGAs containing a mixture of different fine-grain and coarse-grain resources. In particular, the mix of different resource types, and the effect this has on the benchmark performance in each architecture is examined. Synthesis tools are used to map the various computation structures found in the benchmarks to different functional unit types found in commercial FPGAs in order to analyze the area and timing characteristics, however, a routing model is used to estimate the inter-component delays, allowing the device floorplan to vary as a problem parameter. The work presented in this paper concentrates on generation of heterogeneous FPGAs, and synthesis effects are minimized by using mathematical programming. This allows the determination of performance bounds and true optima. The analytical formulation described in Section III is used to optimize the architecture floorplan for a benchmark set under a given area constraint.
III. DESIGN FLOW AND PROBLEM FORMULATION
Section IV details the formulation of the combined problem as linear program. The heuristic methods developed later in the following section of this paper are based on the ILP formulation , in particular, the constraints and variables introduced in Section III-B are important for understanding the heuristic methods. For reference, Table I gives the notation used throughout this paper.
The architecture generator uses linear programming in order to combine the problem of benchmark floorplanning and module selection, as well as underlying architecture floorplanning. This is as illustrated in Fig. 1(a) -(c), in which the proposed framework has been used to generate an architecture to be used specifically for two different benchmarks, and has performed the technology mapping and floorplanning for each configuration.
The focus of this work is on the DSP domain, and as a consequence, the benchmarks used to test the architectures in this study have been developed in Xilinx's System Generator for MATLAB [23] . Fig. 2 shows the design flow and how our tool interacts with existing software. The benchmark circuits are specified as a dataflow graph (DFG). The benchmarks consist of a set of node types that represent various computations. The tool optimizes the architecture for the set of benchmarks supplied such that the architecture can be reconfigured in order to implement any one of the benchmarks.
The computational structures in the benchmarks are allowed to be constructed from the various resource types available; the linear programming approach allows this to be done within a unified framework. Existing synthesis tools supplied by Xilinx are used to determine various constants, such as timing and area estimates used in the ILP formulation. The tools allow assessment of the timing and resource requirements of each computational node under a variety of different implementation strategies. The term implementation strategy refers to whether the node is constructed from LUTs, embedded RAMs, or embedded multipliers.
Inter-node delays are estimated by using a simple linear model in which the cross-chip delay is proportional to the Manhattan distance. The constants of proportionality in this model were obtained by modelling the delay between two circuit elements in a Virtex 2 chip. The optimal architecture is then considered to be one that optimizes a measure of the different benchmark clock periods defined below.
In order to relate the clock period values between benchmark circuits, the concept of "relative clock period" has been introduced [8] . The relative clock period is defined in (1), where represents the relative clock period of benchmark , represents the minimum clock period of the benchmark given a particular architecture , and represents the minimum clock period of the benchmark given the optimal set of components for the given area constraint. This means that the speed of each benchmark is normalized by dividing by its optimal speed under the specified area constraint (i.e., the best components for each benchmark circuit, rather than the best components for the entire set of benchmark circuits, are selected).
can be thought of as a measure of the speed lost as a result of introducing reconfigurability; a relative clock period of 1.2 implies that the circuit is 20% slower than it could be in the same area, if the device were designed solely for that specific benchmark. Note that is a measure of the worst case relative clock period of each benchmark. The overall goal of the optimization is to minimize the maximum value of over all benchmarks
(1) 
A. Linear Programming Formulation
The benchmark set is denoted , and each is a dataflow graph representing one benchmark circuit. A dataflow graph is a pair , where specifies the set of vertices (or nodes) and specifies the set of edges between nodes. denotes an edge produced at node and consumed at node . The nodes represent computations and the edges represent the data dependencies or dataflow between nodes.
The clock period of each benchmark is measured by taking the maximum delay between registers, inputs and outputs. Register outputs and circuit inputs can be considered source nodes while register inputs and circuit outputs can be considered sink nodes. The maximum delay of the circuit is thus the longest path between source and sink nodes. The delay of each node is obtained by individually synthesizing each implementation strategy for each node and automatically running a timing analysis. After having transformed the graph and ascertained the delays of each node, an annotated directed acyclic graph remains.
In the optimization process, the objective is to minimize the maximum relative clock period . The constraints on are given by (2), which includes a linearization of the max operator (2) The delay of the circuit is dependent on whether a node is mapped to slices or embedded hard-IP, so the constraints dealing with circuit delay (based on Bellman's (3) [24] ) have to be formulated to account for this, where represents the time during a clock cycle at which the inputs to node become valid and the edge weight is a function that accounts for the combinatorial delay of node and the routing delay between nodes and . The source node starts at time 0 within a clock cycle, i.e., . Bellman's equations have been formulated in such a way as to account for the different ways of implementing certain nodes. Thus the version of Bellman's equations used in the ILP formulation are as in (4) and (5) (3) (4) (5) Here, represents the portion of delay due to routing and represents the set of implementation strategies for node , for example {embedded multiplier, LUT-based, embedded RAM}. The combinatorial delay of node when implemented in strategy is represented by . The binary decision variable has the value 1 if and only if implementation strategy of node is used. The linear program also has the constraint that each node should only be implemented one way (5) . The routing delay is related to the physical placement on the device, and is given by (6) (6) (7) (8) (9) (10)
In the routing delay model and are the constants of proportionality, and have been ascertained by evaluating intercomponent delays on a Virtex 2 device. The Manhattan model has been shown to work well for uncongested designs. In (6), the coordinates of the bottom left corner of node are , and the node has width and height . Note that the routing delay is taken from the bottom right-hand corner of the output node to the bottom right-hand node of the input node, hence, the term in (6) . This is because the nodes are generated by Xilinx Coregen, and use this same convention for inputs and outputs.
In the LP, variables and are used for the horizontal and vertical distance between nodes. Because the linear program is a minimization problem, the absolute brackets in (6) can be linearized by introducing the appropriate weights (7)-(8); minimization of the objective ensures that (7)-(8) hold at equality. Equations (9) and (10) are used to give the node the appropriate height for the appropriate implementation strategy, for example, a large multiplier implemented in LUTs is typically larger than its equivalent embedded version.
B. Constraining Node Placement
In order to represent the benchmark floorplanning problem, nodes within any one benchmark must be prevented from overlapping with each other. The related constraints can be thought of as a modified version of a 2-D packing problem, where the sizes of the nodes are not known a priori, and the objective function relates to node interconnect and combinatorial delay. This can be visualized as in Fig. 3 . In this case, at least one of the terms in (11) must be true (11) Equation (11) can be represented in a mathematical program by introducing four binary variables , , , and as in (12)- (17), where (16) ensures that at least one of the terms in (11) is true.
is the constraint on the device width and is the constraint on the device height, the overall area constraint . Thus, the aspect ratio of the device may be varied. In these equations, if is equal to zero then (12) is satisfied with , i.e., is to the right of node . Similarly, if is equal to one, then (12) is trivially satisfied, as long as is within the device boundary. By summing these variables (16) Although most computational nodes can be approximated by a single rectangle, some nodes need to be constructed from more than one. As an example, Fig. 4 shows the floorplan of a Xilinx Coregen [25] relationally placed 8-bit multiplier. The floorplan of this multiplier is represented in the ILP by two rectangles, which are placed by setting appropriate integer variables in the ILP formulation.
C. Representing an Architecture in the ILP
The architectures under exploration are those in which the resources are grouped into columns; nodes can only be implemented in a particular strategy when placed in a given region, and is similar to some of the most recent devices from Xilinx [26] and Altera [4] . This is illustrated in Fig. 1 , in which small benchmarks have been fed into the proposed system to produce an architecture, and some corresponding benchmark floorplans. To accurately represent this type of architecture within the ILP, there are constraints to represent the underlying architecture, and constraints that map the computational structures of the individual benchmarks onto the architecture floorplan. In order for the system to automatically generate the architecture floorplan, the architecture template must be specified. The underlying architecture template is column based, hence, it is necessary to introduce constraints to prevent overlap between the columns of the different types of resource. This overlap can be visualized by observing Fig. 5 , and is a 1-D version of Fig. 3 . To prevent the regions of different resource types overlapping, clearly or . These constraints can be represented in a mathematical program by introducing two binary variables and as in (18)- (21). The equations described here have been extended to account for the different region types. In this case, is the set of implementation strategies, is the set of regions of resource type . The left and right boundaries of region of resource type are specified by and , respectively
To map the computational nodes of the individual benchmarks onto the architecture, and set the widths of the regions of particular types of logic, constraints (22)- (24) In other words, the right hand boundary of a given region must be greater than the coordinate of any node plus its width in that particular implementation strategy. Here, is the binary decision variable that takes the value one if and only if node is implemented in region of implementation strategy , is the width of node when implemented in strategy .
The combination of the constraints presented in this section allows the determination of the optimal architecture. The constraints are added in for each benchmark, with the constraints on the regions of the resource types allowing the architecture to be shared between benchmarks. Valuable lower-bounds on the optimum speed for a given area constraint can be achieved once the problem has been cast in linear form, by using linear-program solvers such as ILOG CPLEX [27] .
IV. HEURISTIC DETERMINATION OF RECONFIGURABLE ARCHITECTURES
ILP is a known NP-complete problem [28] . In the context of the ILP outlined in Section III, the large number of binary variables makes a direct solution of the ILP impossible for even relatively small benchmarks sets. In order to counter this issue, a heuristic has been developed based on the ILP framework.
The ILP framework allows the development of a heuristic approach in a structured manner. To make this approach scalable, it is first important to observe the growth of the various integer variables in the system. The number of binary variables modeling node-node placement (12) - (17) is , where is the number of nodes in a given benchmark circuit. Similarly, the number of binary variables representing the floorplan of the underlying architecture (18)- (21) is , where is the number of regions. Finally, the number of binary variables representing the floorplanning of nodes onto the architecture (22) - (24) is at most . A consequence of this is that, while valuable lower-bounds can be achieved, the computation required for exact solution explodes even for benchmarks and architecture templates of modest size.
The heuristic used aims to minimize the effect of the binary variables on runtime by removing the exponential dependence of the ILP on these variables. The approach is an iterative procedure containing two crucial elements. The first element of the heuristic deals specifically with benchmark node-floorplanning, the second element deals with architecture floorplanning and floorplanning of nodes onto the architecture. The entire heuristic procedure is shown in Fig. 6 .
The heuristic technique developed is based around a controlled relaxation of the binary decision variables (12)-(17) to reals in the range . Removing the integrality allows fast solution through, for example, the Simplex method [29] , and is a commonly employed method in integer programming solvers. The resulting optimum decision variable values may then be interpreted to steer an iterative process in which after each iteration some binary variables relating to benchmark floorplanning are fixed to zero or one, allowing a gradual crystallization of the benchmark floorplans. The rounding heuristic is detailed in Section IV-B.
In order to determine the architecture floorplan, and mapping of nodes onto the architecture, each iteration of the procedure is broken into a number of steps. First, an ILP is run to minimize the relative clock period with no constraints on node locations other than the relaxed placement variables. This run of the ILP is fast, as the binary variables introduced in (17) are not present; the only binary variables present are those defining the module selection, i.e., whether a component should be constructed from LUT-based logic or embedded components. A clustering heuristic, as described in Section IV-A is then applied in order to partition the device into column-based regions of each resource type and group nodes from all benchmarks into the appropriate regions. Once the regions are assigned, the clock period is minimized with constraints on the regions (see Fig. 1 ). The sum of the relaxed decision variables is then minimized to reduce the binary decision variables (12) - (17) to their minimum value, as the smallest of these variables implies the least overlap (see Section IV-B for more detail). Finally, the relaxed variable to round is determined for this iteration, thus, gradually avoiding overlap between computational nodes. In each run of the linear program, the binary decision variables (12)-(17) remain relaxed to real values and are only fixed to integer values in the rounding phase.
A. Achieving a Scalable Runtime Using Clustering
One particular aspect of the proposed heuristic, introduced in order to guarantee scalable runtime, is the introduction of the clustering phase. As the number of regions allowed on the device is increased, so is the number of binary variables related to floorplanning of regions, (18) - (21), and binary variables related to the placement of nodes within regions (23)- (24). If these variables are left to the ILP solving software to determine, a single run of the partially relaxed ILP can take over one hour, and given that the entire procedure takes over 500 iterations for the benchmark set used in our experiments, such an approach is not appropriate. Hence, a phase has been introduced in the optimization procedure, used to determine the locations of the regions, as well as the assignments of nodes to regions.
The clustering algorithm is based on the well known -means algorithm [30] . The modified algorithm is shown in Fig. 7 . In this instance, it is used to choose where regions of different types should be placed in relation to one another, and in which of the regions nodes should be implemented, in order to give a suitable architecture for the floorplan constraints in the given iteration of the overall procedure.
The algorithm developed begins similarly to the -means clustering procedure. For each resource type, an arbitrary starting location for the center of each region is defined. Nodes are assigned one-by-one to their closest feasible region, with the order in which nodes are placed determined by the penalty of the node. The penalty is a value that defines how much the floorplan is affected if a node is not assigned to its closest region and is calculated as follows. The difference between the horizontal coordinate of the node and the location of the closest region that it can be feasibly placed within is calculated, as is the difference between the horizontal coordinate of the node and its next closest feasible region. The penalty is then calculated as the difference between these values. This is illustrated in Fig. 8 . Nodes that are close to one region but far from their next closest region are placed first. If a node can only be placed in one region it is given a large penalty, so that it is assigned to a region immediately. The penalties are recalculated after each node has been assigned to a region, as floorplanning a node into a particular region adds placement constraints on other nodes.
After all nodes have been clustered, each region has a new center assigned that corresponds to the mean of the center of all nodes that have been assigned to that cluster. The nodes are then The modification of the well known -means heuristic is important due to the additional constraints introduced in this particular problem. The importance arises due to constraints on the relative placement of the node locations. After each iteration of the overall procedure in Fig. 6 , a variable representing the relative placement between two nodes is fixed, hence, enforcing constraints on the floorplan of those nodes. The introduction of these constraints is illustrated graphically in Fig. 9 . Fig. 9 shows how constraints on node locations can be thought of as edges in a graph, with nodes in the graph representing the computational nodes and edges representing horizontal constraints; an edge produced by node and consumed by node means node must be to the right of node . In a standard -means problem there are no constraints on the node locations: the introduction of these constraints means that the clustering of each node has to be verified. This is done by also including the constraint edges implied by a particular clustering. A feasible clustering is one for which there are no cycles in the graph. Hence, to determine which region a node can be feasibly placed within, a constraint graph must be constructed for each region placement of each node and a cyclicity check must be instantiated.
B. Determination of Architectures Using Variable Relaxation
In order to set the placements of benchmark nodes relative to one another, a heuristic for setting the associated decision variables was developed. In terms of the overall heuristic, this part of the algorithm is denoted by the box entitled "Choose variable to set" in Fig. 6 . This heuristic has been developed to ensure an architecture floorplan falls out of the overall heuristic procedure.
The critical feature of this heuristic approach is that the decision variables related to relative placement of nodes are relaxed. This means that the only binary decision variables in each ILP run are those that determine the implementation strategy of each node. After the post-clustering stage, and after the perimeter of the device has been minimized, the variables " " (12)- (16) do not appear in the objective function and so take arbitrary values among the values that satisfy the constraints. For example, if (13)- (16) are satisfied with , , , and , they will also be satisfied with , ,
, and . Thus, in order to make placement decisions on the basis of these values, it is necessary to perform a round of optimization to reduce them to the minimum feasible values. The linear program is thus rerun with the relative clock period fixed to the minimum, while minimizing the sum of all decision variables is used as the objective.
After minimizing the sum of the decision variables, each of the placement variables is scaled according to how much free space there is given the area constraint. The term free space refers to the difference between the used space and the allowable space in each dimension, as defined by the width and height constraints. The placement variables related to the -direction are divided by the free space in the -direction, and similarly the variables are divided by the free space in the -direction. After scaling the variables, a decision is made as to which of the relaxed variables to examine.
In deciding which variables to round, each scaled set of variables (12)- (16) is then examined. In order to make critical decisions first, the minimum of these four is chosen, and the set with maximum variable of all of these minima is chosen to be examined. Thus, the decision of which set of four variables to be examined becomes as in (25) . and refer to the used space in the and directions, and are determined by examining the solution of the perimeter minimization phase. In scaling the variables this way, the dimensions of the device can be accounted for, and the area efficiency of the heuristic can be improved (25) The particular variable chosen to be rounded to zero out of the four is determined based on a method that attempts to minimize area, while maintaining clock period. The method involves examining the amount of space the device consumes in both and dimensions and setting variables as follows.
The decision of whether to place the nodes side-by-side, or above/below is based on how much space is used in each direction, and is calculated by summing the widths of all nodes that have overlapping vertical coordinates, and summing the heights of all nodes that have overlapping horizontal coordinates (nodes may, at this stage, overlap due to the relaxation of the binary decision variables). This is illustrated in Fig. 10 . In the example nodes, and have identical and coordinates. The used space in the horizontal direction will be the sum of the widths of nodes , , , , and , and the used space in the vertical direction will be the sum of the heights of nodes , , and . Node has no overlapping coordinates and is displayed for illustrative purposes only.
The final decision of which variable to round is made according to which direction has less used space. Thus if more space is used in the direction, then the minimum " " variable ( , ) is chosen. Similarly, if more space is used in the direction, then the minimum " " variable ( , ) is chosen. The variable chosen is rounded down to zero, as this forces the appropriate constraint to be true.
Once the appropriate values are fixed to zero it is also possible to set any related variables to one. This is due to the constraints specified by (21) and (17) . If one of the variables in these equations is zero then the others may be one. If variable in (13) is zero, this indicates that node is to the right of . Thus, in (13) must be identically one (i.e., must not exceed the right hand perimeter of the silicon area). Similarly as it does not matter whether is above or below variables (13) and can both be set to one (nodes and must not exceed the top or bottom perimeters of the device).
In each iteration of the overall heuristic only one set of benchmark floorplanning variables is set. This allows the device architecture to be crystallized out gradually, and allows the architecture floorplan to be dependent on the floorplan of all benchmarks. Due to implied constraints, such as those imposed by the transitivity of the relative node placements, it is not necessary to fix all sets of benchmark floorplanning variables. Some experimental observations in respect of this are given in Section V-B.
The proposed heuristic, while complex, sufficiently utilizes the information resulting from partial linear program relaxations. As a result, high-quality upper-bounds are achieved, as demonstrated by their closeness to the known lower-bounds on speedup achieved through the ILP approach described in Section III. This will be demonstrated in Section V.
V. RESULTS
The ILP and heuristic frameworks were used to explore various aspects of the problem space. In order to evaluate the performance of the heuristic approach, we compare to the optimal floorplanning approach. The heuristic framework is then used to perform an architectural study of reconfigurable hardware.
A. Scalability of Optimal and Heuristic Approaches
In order to draw a comparison between optimal and heuristic approaches within feasible run-time constraints, in this section, we use the ILP and the heuristic to perform floorplanning and technology mapping for a single benchmark. In these experiments, the target architecture is left unconstrained, hence, the comparison is essentially that of single benchmark ILP technology mapping and floorplanning versus a single benchmark version of the first phase of Fig. 6 .
In order to study the scalability of both approaches benchmarks of increasing complexity were chosen. A set of Horner scheme polynomial evaluators was used for this purpose. The number of nodes and edges of these benchmarks varies linearly with polynomial order, with the smallest having 7 nodes and 12 edges, and the largest 61 nodes and 102 edges and the number of integer variables varies quadratically (approximately 400 and 30 000 integer variables, respectively, for smallest and largest problems).
The results of the study of the scalability of this problem are shown in Figs. 11 and 12 . The platform used was an Intel Pentium 4 CPU running Fedora (Linux) at 2.8 GHz with GB of memory. The commercial optimization suite CPLEX [27] was used. The runtime of the optimal approach has been averaged by using different settings in CPLEX: the runtime of these methods was seen to vary widely dependent on the problem size, and no particular setting could be deemed better than others. For the five largest circuits, ILP floorplanning took days to solve, whereas the heuristic method always took less than one hour.
The quality of the heuristic compared to the optimal approach is shown in Fig. 12 . This is a comparison of the clock periods, which can be evaluated from observing the value in solution of the ILP. From Fig. 12 , it is evident that, in purely floorplanning terms, the heuristic performs worse than the optimal approach by a relatively small degree (with the heuristic being no worse than 1% away from the optimal clock period).
B. Architectural Exploration
The focus of this work is DSP applications, thus to perform our exploration of the design space, a suitable set of DSP benchmarks was chosen for evaluating our heuristic approach. The benchmarks include: an LMS adaptive filter, as used in [31] ; a multi-channel IIR filter and a Costas Loop, as supplied with Xilinx System Generator [23] ; a programmable 2-D 5 5 image convolution on a raster-scanned image, based on that supplied with system generator; an ADPCM encoder [32] ; and a Horner scheme polynomial evaluator. All benchmarks are of a similar hardware complexity when implemented in their time-optimal strategy, area figures relative to the size of a slice are given in Table II . The first results presented evaluate the efficiency of the floorplanning and technology mapping heuristic for the different types of benchmark, in a similar fashion to Section V-A. Table II shows the clock periods of each benchmark when there are no constraints on the architecture, i.e., in (1). The results presented for the optimal case show the best solution obtained by the ILP solver, which is a known feasible solution, and therefore an upper bound on the optimal solution. The ILP solver also provides a lower-bound on the solution in cases where the optimum cannot be found within the defined time frame, this bound is also given in Table II . The results show that the heuristic presented deviates from the best known solution by 5.8% at worst, which is in contrast to the figure of 1% in Section V-A due to the different characteristics of the benchmarks.
The second set of results taken used the following methodology. The heuristic was used to generate an architecture for all benchmarks. Each benchmark was individually remapped onto the generated architecture using the full ILP, where the parameters of the commercial device are specified as constraints in the ILP. This approach provides a higher quality solution than the mapping used during the heuristic architecture creation procedure. Results were then taken for a mapping of each circuit onto a Xilinx XC2V2000 device. Similarly, parameters of the commercial device are specified as constraints in the ILP. The device generated by our procedure was given the same overall area as the Xilinx device, with the areas devoted to each component type determined by the heuristic procedure, however, the device is constrained to have the same number of regions of each embedded resource type as the Xilinx device.
On average the heuristic procedure required around 48 hours to complete. This took 700 iterations of the procedure in Fig. 6 , meaning that on average less than 10% of the variables related to floorplanning needed to be set. This is a consequence of implied constraints, such as the transitivity of the relative placement of nodes.
In order to quantify the benefit that embedded multipliers provide, benchmarks were mapped to a device created by removing the multiplier columns, leaving only 18-kb embedded memory blocks and lookup logic. Similarly, a device with no embedded memory (but with multipliers) was examined. Finally, a comparison to homogeneous fine-grain fabrics was performed by removing the columns of both multipliers and embedded memory.
The results of the architecture evaluations are shown in Table III. Table III entries showing "no solution" mean that the ILP solver was unable to find a solution given the time constraint of the software.
The most striking feature of the results in Table III , is that the architecture generated by our optimized automatic generator results in a 58% speed improvement over architectures without embedded multipliers, but only a 1% speed improvement over the Xilinx design. It can be concluded that the Xilinx design is close to optimum for this benchmark set; further improvements can only be made by using different types of embedded components, rather than different orderings or arrangements of the existing ones.
The largest gains seen are in the polynomial evaluator benchmark, which has a large number of multipliers on the critical path. This means that there is significant room for improvement over devices with no multipliers. When comparing the Xilinx device to the one created by the presented heuristics, the polynomial evaluator is also the benchmark with the largest improvements. The lower bound of the clock period for the Xilinx architecture is greater than the best solution of the heuristically determined architecture, meaning that although the true optimum has not been determined, it is certain that the generated architecture marginally outperforms the Xilinx architecture. In this case, the gains are due to the heuristic procedure minimizing routing delay by widening certain regions. It is particularly inefficient to map this benchmark onto the Xilinx device, as much of the chip must be routed across due to each region of multipliers and memory being only one component wide; our technique automatically redesigns the device to have wider columns of these components.
In terms of the actual device construction, the generated hardware was slightly different to Xilinx's. For similar sized devices, the Xilinx Virtex 2 devotes approximately 62%, 13%, and 25% of its functional area to slices, multipliers and embedded memory respectively, whereas the devices generated by our tool consume 44%, 31%, and 29%. When observing the mapping of the nodes in each benchmark, it was noticed that almost all nodes capable of being implemented by a hardwired component were mapped to the embedded version. The area devoted to each component reflects this, and is hence closer to the proportions of the time-optimal component selection exhibited by the benchmarks used.
Interestingly, the introduction of memory components has little effect on circuit clock period, despite embedded memories having significantly lower latency than those built from slices. This observation can be explained by the fact that memory is rarely on the critical path of these benchmark circuits, and our optimization system notices this, and takes it into account when performing technology mapping. However, the density advantages of embedded components can be seen by observing the area figures given in Table II . The biggest area savings are in the image convolution benchmark, which requires nine embedded memories four of which are RAM, and require over 1000 slices each, consuming over 40 times the area of a single 18-kb embedded memory.
VI. CONCLUSION This paper has described a novel heuristic approach to the combined reconfigurable architecture design, floorplanning, and technology mapping problem. As a result, we have been able to present formal upper and lower bounds on the speed attainable by reconfigurable architectures based on slices, 18 18 multipliers, and 18-kb embedded RAM blocks. The proposed methodology has been able to automatically design an architecture capable of supporting several input benchmark circuits, and has quantified the optimal speed and logic density achievable on the basis of these embedded components for the first time. These results indicate that, while a significant system-level speedup is attained by incorporating embedded multipliers for DSP benchmarks, further improvements in speed are likely to arise only as a result of the design of new embedded components. The proposed framework thus allows new embedded components to be tested in order to impact on future technology generations and in light of this, we intend to address the issue of alternative embedded cores in future work. The focus of our results have been on DSP applications, and it is also the intention of future work to extend our tool to target different application domains.
