Field-programmable systems with multiple FPGAs on a PCB or an MCM are being used by system designers when a single FPGA is not sufficient. We address the problem of partitioning a large technology mapped FPGA circuit onto multiple FPGA devices of a specific target technology. The physical characteristics of the multiple FPGA system (MFS) pose additional constraints to the circuit partitioning algorithms: the capacity of each FPGA, the timing constraints, the number of I/Os per FPGA, and the pre-designed interconnection patterns of each FPGA and the package. Existing partitioning techniques which minimize just the cut sizes of partitions fail to satisfy the above challenges. We therefore present a timing driven N-way partitioning algorithm based on simulated annealing for technology-mapped FPGA circuits. The signal path delays are estimated during partitioning using a timing model specific to a multiple FPGA architecture. The model combines all possible delay factors in a system with multiple FPGA chips of a target technology. Furthermore, we have incorporated a new dynamic net-weighting scheme to minimize the number of pin-outs for each chip. Finally, we have developed a graph-based global router for pin assignment which can handle the pre-routed connections of our MFS structure. In order to reduce the time spent in the simulated annealing phase of the partitioner, clusters of circuit components are identified by a new linear-time bottom-up clustering algorithm. The annealing-based N-way partitioner executes four times faster using the clusters as opposed to a flat netlist with improved partitioning results. For several industrial circuits, our approach outperforms the recursive min-cut bi-partitioning algorithm by 35% in terms of nets cut. Our approach also outperforms an industrial FPGA partitioner by 73% on average in terms of unroutable nets. Using the performance optimization capabilities in our approach we have successfully partitioned the MCNC benchmarks satisfying the critical path constraints and achieving a significant reduction in the longest path delay. An average reduction of 17% in the longest path delay was achieved at the cost of 5% in total wire length.
INTRODUCTION
Field Programmable Gate Arrays (FPGAs) are becoming a mainstream technology in board, system and application specific integrated circuit (ASIC) design processes. However, design complexity will con-309 tinue to increase more rapidly than the availability of larger and faster devices. System-level ASIC designers are turning to FPGAs for design verification to take advantage of their low cost and fast prototyping.
Large and complex designs can require multiple iterations in order to achieve a successful design imple-310 mentation. If the automatic design tools cannot provide a feasible solution, the designer is forced to obtain expert level architectural knowledge to support the manual intervention required to complete the design. Current FPGA architectures can handle a maximum of only 6000 to 9000 gates compared to ASIC devices which offer hundreds of thousands. Designers utilize multiple FPGAs when a single FPGA is not sufficient for a design implementation.
Multiple re-programmable FPGAs have been configured on multichip modules ( Figure 1 ) and on boards ( Figure 2) . A large design can be implemented on a PCB [1] using multiple chips of a particular target FPGA architecture such as Xilinx [2] , Actel, or Altera. New field programmable architectures using multiple FPGAs on multichip modules have emerged to offer utilization requirements for prototyping large designs. A multiple FPGA system (MFS) can be modeled as a collection of FPGA chips configured on a single board or a package to realize a design.
In order to effectively use MFSs and benefit from shorter time-to-market, users require an automatic method to partition a large design among multiple FPGAs. This process could be viewed as a divideand-conquer process in order to speed the placement and routing phases or as a conventional top-down design process. Each chip in this N-(multiple) chip combination is considered a partition. 
Constraints and Objectives of the MFS Partitioning System
Any decision made early in the design process will affect the performance of the subsequent design tools. The quality of the partitioning results will influence several aspects of the design implementation:
1) Capacity: The target FPGA architecture used in the MFS has a maximum gate capacity. However, the amount of logic in each chip is limited by the utilization levels that can be handled by the placement and routing tools. Thus, the feasible utilization levels are always some fraction of the maximum gate capacity. The partitioner must ensure that each chip contains a feasibly implementable amount of logic.
2) Congestion in inter-chip communication: The partitioner must be able to minimize the amount of inter-chip communications. The signals external to individual chips must be routed using the limited number of inter-chip connections to produce a feasible partitioning solution. The fixed inter-chip connections of the packaging or the board design restrict the flexibility of routing any signals which are external to a chip. This process can be viewed as pin assignment at the chip level. Any overflow generated during pin assignment will lead to a design which is not implementable. Kernighan-Lin [4] with enhancements by Fiduccia and Mattheyses [5] have been reported [6, 7, 8, 9, 10] . Rectilinear [17] from NeoCAD provides an environment to perform timing-driven partitioning over multiple FPGAs.
InCA has an FPGA partitioner named Concept Silicon [18] Our MFS partitioning system consists of three main phases as shown in Figure 3 . We first reduce the netlist using a new netlist clustering algorithm. The clustered netlist is then partitioned using a simulatedannealing based partitioning algorithm. Finally, chiplevel pin assignment is performed using a graphbased global router.
In section 4, we describe the simulated-annealing based partitioning algorithm. The pin assignment stage is described in section 5. The key issues which affect the performance of the simulated annealing algorithm in phase 2 were strongly considered in the design of our new clustering algorithm which is used in the first phase. We thus present the clustering algorithm in section 6. Section 7 is devoted to the resuits and the conclusion is the subject of section 8. 
The net weighting scheme.
TIMING DRIVEN PARTITION 315 point during the annealing, the new weight of a moved net is re-evaluated and used to compute the weighted wire length. W is given by: W , (Sx(n) + Sy(n)) Wn, (2) n=l where S(n) and Sy(n) are the width and height of the minimum bounding rectangle of the net, respectively, and w is the weight of net n. In order to minimize the CPU time necessary to update W for large nets, we use an incremental net-span updating scheme. The incremental scheme devised for the partitioner takes advantage of the gridded nature of the bin structure used and thus is simpler and faster than previously reported methods [20, 23] . Since detailed placement is going to follow this partitioning stage, the clusters of components are assigned to the center of the bins and the pins of a component are also taken to be at the center of the component. Unlike placement, where we need the exact location of a pin for precise wire length calculations, for each net the partitioner only needs to store the number of pins for a net at each grid line. Thus, updating the pin configuration of a net after a move is easy. The global scale of the grid lines are stored in a lookup table. The x and y span of a net is calculated using the maximum and minimum of the active grid lines of the net (i.e. grids which contain one or more pins for this net).
Timing Penalty
The timing penalty in the cost function is calculated based on the slacks generated in the critical paths of the circuit by partitioning. A critical path may consist of several nets. The timing penalty is minimized dynamically during partitioning. In this section, we will first describe the propagation delay model we have designed for a timing path over multiple FPGAs. Based on this model, we will define the timing penalty.
The total delay on a path p over multiple FPGAs is the sum of the delay generated in the configurable logic blocks (CLB) inside each chip, T/(p), and the total interconnect delay, TR(P).
T.(p) T(p)+ T(p) (3)
TR(p) is the sum of all the constituent net routing delays, TR(n), due to the intra-chip and inter-chip connections of the net.
TR(P) E TR(n) (4) [26] . We will follow their assumption that for lpm -< k <--1.75pm, the gate delay is approximately proportional to k. Given a TcLB of a CLB for a particular k, we accordingly scale TcLB for the same CLB design for a different K in that range.
Routing Delay" The total routing delay of a net n, TR(n), is the sum of the delay due to the intra-chip, Ts(n), and inter-chip connections, TM(n), of a net.
TR(n Ts(n + TM(n) (6) [27] . (For the anti-fuse technology and single segment routing, the number of switches between two logic blocks were taken to be 2 and the RC model was formed accordingly in [27] ). The total switching delay including the parasitics seen by the wire segments used by the net can be modeled as a lumped RC:
Ts(n) RswCsw Rsw(Cg + Cp) (7) Rsw is the equivalent drive resistance or the switching ON resistance and Csw is the total load capacitance seen by the driver. 
The total path delay is:
In (12), we can precompute the expressions which are independent of wire length and number of inter-chip connections as: 
Chipl Chip4 FIGURE 8 Inter-chip connection delay.
Chip2
Chip3
As reported in [23] , the total timing penalty is computed as the sum of the penalties over all critical paths specified. For each critical timing path, the user supplies an upper bound Tub(p) and a lower bound Tub(P) on the required arrival times. The penalty assigned for a path p is the amount the delay deviates from satisfying the bounds.
Tpd(P)--Tub(P) if Tpd(P) > Tub(P)
The total timing penalty is the sum of penalties over all the critical paths specified.
Pt , P(P) Figure 10 . Initially, the shortest path routes are found for all external signals. Based on these routes, the total overflow is calculated. The function update_ edge_weight is used to update the weights of the edges with overflow. The routes which use the edges with overflow are discarded and the corresponding signals are re-routed using an iterative rip-up and reroute scheme. The pseudo-code for generating a route for a net is shown in Figure 11 .
Pin assignment is performed based on the final routes given by the global router for the external signals as shown in Figure 12b for all e E Update_edge_weight(e );
for all e E E for all n e r generate_a_route(n, Max_improve); R,, R,, u r ;/* add to the set of routes for net n */ FIGURE 10 Global routing algorithm.
Set of pins for net n :P (n) Set of trees for net n :T(n) Subroutine Generate_a_route(net n, Max_improve)
Make each node corresponding to pin E P (n) a tree and form T(n); while (IT (n)l > Three prominent clustering metrics were proposed recently. Intuitively, the degree/separation metric used in the random-walk based clustering algorithm, RW-ST [13] , strengthens the objective of seeking a minimum cut between clusters. However, the RW-ST heuristic has overall worst-case time complexity of O(n3) making it inefficient for its application to large circuits. The shortest path clustering algorithm [12] , based on uniform multi-commodity flow problem, was evaluated using the ratio cut metric. In most cases the clusters produced by this method had a large range of size. Clusters having such properties are not at all suitable for N-way partitioning as discussed previously.
A third clustering algorithm based on (k,1) connectivity was reported in [24] . If However, since the appropriate value of k was difficult to predict without experimentation on a particular circuit, we could not use this algorithm to generate clusters for the chip partitioner over a wide range of circuits.
A Two-Phased Natural and Adaptive

Approach for Clustering
Our clustering approach is a bottom-up hierarchical technique based on an agglomerative method of clustering. At each level of hierarchy, we cluster nodes which qualify to merge with each other. This results in a netlist of reduced complexity. Our experiments on circuits having a large size range have showed that the clustering process needs to be adaptive to the current state of the netlist in order to obtain the best N-way partitioning results. Thus we have devised a two-phased natural and adaptive clustering technique.
The first method finds the natural clusters of the circuit. The criterion for merging nodes at each level of hierarchy is based on the net connectivity and density of the weighted netlist graph. The second and more adaptive strategy is based on a heuristic technique which aims to refine the netlist obtained from the first method for its most effective application in the chip partitioner.
Since we are interested in circuit applications, we wish to minimize the number of inter-cluster nets. Thus, we considered the k-edge-connectivity of a graph as opposed to k-vertex-connectivity of a graph. Our initial concept of a cluster was derived from the notion of (k, 1) connectivity [24] . However, we adopted a more general model of a graph representation for a netlist which allowed us to handle multipin nets as easily as 2-pin nets. The accumulative weighted graph is simpler and more accurately represents the circuit structure than a hypergraph. The time complexity of our basic algorithm is linear with respect to the number of nodes in the accumulative weighted graph.
Accumulative Weighted Graph
The nodes of the graph represent the components of the circuit and an edge between two nodes indicates that a hyper-edge contains these two nodes. For example, each multipin net connecting n components is represented by a complete graph contributing n(n 1)/2 edges to the graph. Our ultimate purpose is to use the clusters for distance-based partitioning. In a final placement, the nets with very high fanout naturally span a larger portion of the core and thus have higher wire lengths. Thus, partitioning programs will have more success in minimizing the net lengths of low fan out nets. In order to differentiate between large and small nets, we have formulated an edge weighting scheme based on the fanout of a net. Thus an edge representing a net with n pins is given a weight 1 n 1. An edge from a 2-pin net is given a weight of 1. After assignment of weights, we collapse all the edges between a pair of nodes into one edge. This final edge thus carries an aggregate weight equal to the sum of weights of the original edges between those nodes. In this way, as shown in Figure 13 , we can represent the global nature of the connectivity of the netlist through a simplified graph as a pair of nodes will have at most one edge between them. DEFINITION 2: Natural cluster: A group of components such that every two vertices (s, t) are either connected by a strong edge directly or indirectly through transitive closure.
Edge (1) (2) (3) (4) (5) in Figure 12 is a strong edge.
Multilevel Cluster Growth
Natural Clustering Algorithm
The purpose of the first phase of clustering is to extract dense subgraphs (natural clusters) from the weighted graph of the netlist. The initial seed for this agglomerative growth process consists of the individual components in the netlist. We start out with the netlist of the circuit and construct the weighted graph. Nodes which are connected through strong edges are clustered. The Cluster_Natural algorithm is shown in Figure 14 . When nodes v and v are merged using the natural clustering algorithm, the edges which were between them become internal edges to the cluster these two nodes belong to. These internal edges do not participate in subsequent levels of clustering. A set of small clusters is formed and used in the algorithm Cluster_AdaptiveO ( Figure 15 ). A small cluster is defined as a cluster with size less than the average cluster size.
We tested this two-phased natural and adaptive approach of clustering on several MFS circuits. The nature of the clusters obtained on one such circuit is shown in Figure 16 components per cluster has a wide range, the average size (total component area) of the clusters is relatively uniform.
Our algorithm was designed to be an efficient implementation and thus usable for practical applications. Efficient data-structures like disjoint-union sets are used to form clusters. The worst case time complexity of our algorithm for multipin nets is O(nB), where B is a constant representing the upper bound of the number of immediate neighbors a cell has in a circuit and n is the number of cells. Due to the accumulative weighted graph, the space requirements for our algorithm are O(n2) in the worst case. Practical circuits generate much sparser graphs than complete graphs and thus require approximately O(n) space. Also, this space requirement is for the first level of clustering only. As the network is reduced in successive levels, the space requirement reduces accordingly.
RESULTS
Our MFS partitioning system, Tomus, has been developed in C with an X11 graphics interface to provide interactive features. Several industrial circuits were used to test the partitioning system. The circuit descriptions are shown in Table I . Figure 17 shows the partitioning results of the circuit alma for a 2-FPGA MFS. The net connectivity between bins is shown before and after partitioning. This demonstrates the effectiveness of Tomus in minimizing congestion across chip boundaries. Table  III , out of a total of 20 runs, we picked the run which gives the best final overflow and compared the overflow with the initial shortest routes and the final overflow after overflow minimization. In all but one case, the overflow (unroutable nets) was reduced to zero. For the case with non-zero overflow, the routing was completed by using unassigned system-level I/Os (which are not available for most designs).
Comparison with Mincut
In this section we compare the N-way partitioning results using Tomus with results from recursive bipartitioning using mincut. We have implemented the We tested the timing-driven capabilities of the partitioner on the MCNC benchmark circuits. Multiple FPGA configurations using Xilinx devices on a PCB were used to obtain these results. For each circuit we first used Tomus to find a partitioning without impos- We verified that none of the non-included pin pairs gave rise to a critical delay at the conclusion of the partitioning). We extract the current longest path between these pins. These constitute the set of critical paths used in our timing penalty function, and we impose the delay bound on these paths. Because a particular critical path may not always be the critical path for a pair of pins, we update the set of critical paths 150 times (once per iteration) during the course of the annealing based partitioning.
We compared the results with the timing penalty deactivated versus the results obtained with the timing penalty activated for each circuit. In Table IX we have shown the number of paths which were within specifications and the number of paths which were outside the specifications in both cases. Column 1 shows the number of devices used. Using our timing penalty function, Tomus successfully partitioned these circuits satisfying the timing constraints in all cases. In all the circuits, Tomus achieved a significant reduction in the longest path delay by using the timing penalty function. The average reduction was 17%. These results were obtained at the cost of 17% in nets cut and 5% in wire length on average. c1355  c1908  c2670  c3540  c5315  c7552  c1196  c1423  c1238  c6288   100  100  0  100  0  100  53  47  100  0  100  97  3  100  0  81  50  31  81  0  540  32  508  540  0  450  108  342  450  0  110  28  82  110  0  40  24  16  40  0  67  46  21  67  0  81  53  28 
