Abstract | Placemen t and routing heuristics for a Field Programmable Multi-Chip Module (FPMCM) are presented. The placement is done in three phases; partitioning, chip assignment and iterative improvement. The routing is done in two phases; global routing followed by detailed routing. Detailed routing involves new channel routing problems denoted by Exact Segmented Channel Routing (ESCR) and K-ESCR. A very fast K-ESCR heuristic is described. Experimental results show that the placement heuristic achieves high gate utilization, and that the K-ESCR heuristic performs surprisingly well over wide range of channel sizes.
I. Introduction
Using multiple recongurable Field-Programmable Gate Arrays (FPGAs) for logic emulation and rapid prototyping is becoming increasingly popular [9] . Dierent approaches for combining multiple FPGAs on a substrate have been proposed. First generation emulation machines employed an array o f F PGAs mounted on printed circuit boards and, a xed wiring network among the FPGAs. Inter-chip routing is done using both the xed wires as well as the FPGAs themselves. The use of FPGAs for routing results in low FPGA utilization and poor performance. To improve FPGA utilization and performance, newer generations of emulation machines use a combination of FPGAs and dedicated routing chips interconnected via a xed wiring network [2] . Inter-chip routing is done using the routing chips and the xed wires.
Recently Dobbelaere el al. [1] proposed an alternative approach called a Field Programmable Multi-Chip Module (FPMCM) which i n tegrates the logic of an FPGA with the fast interconnection of a routing chip. This is done by surrounding the logic core of an FPGA with a programmable interconnection frame. The frame supports fast through-chip routing as well as connections into the logic core. By using an MCM substrate with ip-chip bonding instead of a conventional PCB, a higher pin-to-gate ratio can be supported. As a result, an order of magnitude higher gate density and performance can be achieved using the proposed FPMCM than existing emulation machines.
To test the proposed FPMCM architecture we are developing an experimental CAD system for mapping large designs onto an FPMCM. The CAD system consists of two major modules, the MCM level module and the FPGA level module [5] . Since the proposed FPMCM can use an existing FPGA core
The work was partially supported by FBI contract J-FBI-89-101 and its accompanying FPGA CAD tools, we are focusing on the development of the MCM level tools. The MCM level tools accept as input an FPMCM description le which provides the details of the architecture of the FPMCM to be used, and a netlist for the design to be emulated. The system then places the design, i.e. nds an assignment of the design components into the chips, routes the inter-chip nets using both the xed wiring network and the interconnection frames, and generates the information needed to program the frame switches.
In this paper we describe the MCM-level placement and routing system. The FPMCM placement problem is divided into three phases; partitioning followed by c hip assignment and iterative improvement. The netlist is rst partitioned among the available chips using a modied Fiduccia-Mattheyses partitioning heuristic [3] . The partitions are then assigned to the FPGAs on the FPMCM so as to minimize routing cost. In the iterative improvement phase, components from dierent c hips are moved around to improve a cost function of the routing.
Routing is invoked after placement has been completed, and is divided into two phases, global routing followed by detailed routing. The global router replaces each net by a set of horizontal or vertical two-point connections. Global routing of a single net is modeled as a Steiner tree problem [10] . To solve i t w e use a heuristic that combines a known heuristic for the general Steiner Tree Problem [8] , with improvements that exploit the grid structure of the FPMCM.
To complete the routing, each connection must be assigned to one or more of the xed wires. This task is broken into independent assignments for horizontal and vertical routing channels. We denote the problem of assigning connections to xed wires in a channel by Exact Segmented Channel Routing (ESCR). The ESCR problem is similar to that of segmented channel routing [7] , with a numb e r o f k ey dierences that prevent us from using the heuristics reported in [7] . We describe a greedy ESCR algorithm that is simple and fast but gives surprisingly good results.
It is important to note that the placement and routing described in this paper, although developed specically for FPMCM, can be easily adapted to any m ulti-FPGA connected by a xed interconnect network.
The rest of the paper is organized as follows. In Section II a detailed description of the FPMCM routing architecture is given. Section III presents the two phases of the placement. In Section IV we present heuristics for global routing and detailed routing, and formally dene the ESCR problem. Experimental results using an ESCR heuristic are also given. In Section V placement and routing results for benchmark designs mapped into two experimental FPMCMs are presented.
II. Routing Architecture
The FPMCM consists of an array of modied FPGAs mounted on a substrate and interconnected by a xed wiring network. The logic core of the modied FPGA is assumed 31 st ACM/IEEE Design Automation Conference ® Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying it is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1994 ACM 0-89791-653-0/94/0006 3.50 to be an SRAM-based FPGA core, although specialized functions such as memories, processors, etc., may also be used. The core is surrounded by an SRAM-programmable interconnection frame.
The I/O terminals of the FPGA core are connected to the programmable interconnection frame via core pins. The I/O terminals of the programmable interconnection frame are called frame pins. F rame pins on dierent c hips are connected to each other via frame wires in the xed wiring network. Other frame pins may be connected to external pins. The interconnection frame may be programmed to interconnect pairs of frame pins, resulting in fast through-chip interconnections.
As shown in Figure 1 , the interconnection frame comprises four switch boxes placed at the corner regions of the chip. The switch b o xes are SRAM programmable and are assumed to have complete exibility, i.e., any horizontal frame wire may be connected to any v ertical frame wire. Permutation boxes are placed between switch b o xes. Again the permutation boxes are assumed to have complete exibility; any t w o frame wires entering the box from opposite sides may b e i n terconnected. Interconnections between the frame and the core pins are provided so that any signal entering a chip may be routed to the core and any signal leaving the core may be routed to a frame pin. To simplify the routing problem as well as to decrease the average wire length, we restrict the xed wiring network to consist of two-terminal horizontal or vertical xed wires only. We also assume the pattern of the xed wires terminating at any c hip is independent of the location of the chip. At the edges of the array o f c hips this is done by \wrapping around" the xed wires.
There are two t ypes of paths through the interconnection frame, I-paths and L-paths as illustrated in Figure 2 . Three types of L-paths are possible. Between the chip and its neighbor, the near-far type wastes two additional frame pins, while the far-far type wastes four.
III. Placement The goal of the MCM level placement is to assign all components of the design to the chips without violating chip gate or pin capacities and such that all inter-chip nets can be routed using the available routing resources. To a c hieve MCM level routability the placement attempts to minimize the number of connections to be routed. This is done by minimizing the number of inter-chip nets, reducing the number of chips each inter-chip net has to connect, and discouraging inter-chip nets from connecting chips that are not either on the same row o r on the same column.
The placement is done in three phases; partitioning, chip assignment and iterative improvement. In the partitioning phase, we use a modied Fiduccia-Mattheyses partitioning heuristic [3] to minimize the number of inter-chip nets. In the chip assignment phase, we collapse components in the same partition into a node and merge nets that go to the same set of partitions into a weighted edge. Since our data show that only 1% of nets eventually go to more than 2 chips, hyperedges are removed for simplicity. W e then assign the nodes to the chips on the MCM.
We start with a random placement and improve the placement using a simulated annealing heuristic [4] . The total routing cost is the sum of the routing cost for each edge. The routing cost of each edge is the weight of the edge multiplied by the cost of routing a net between the corresponding pair of chips. If the horizontal and vertical distances between the two c hips are x and y respectively, the cost of routing a net between them is the sum of Net2Wire(x) and Net2Wire(y), where the function Net2Wire() computes an estimate of the number of frame wires that are needed to route a connection, which i s a n umber between 1 and 2 because of our restriction on the frame wires that each connection is allowed to use.
After the chip assignment is completed, we improve the placement using a simulated annealing-based iterative improvement heuristic. The set of components considered at any iteration are those connected to at least one inter-chip net, which w e denote by frontier set. A t each iteration, we randomly pick a component in the frontier set, i.e. a frontier component, select a new destination chip for it, and evaluate the routing cost function. The acceptance or rejection of the move is then determined using the acceptance probability. The routing cost is again calculated using the Net2Wire() function, that gives a rough estimate of the number of frame wires needed to route the net.
IV. Routing
Routing in the FPMCM is done in two phases, global rout-ing followed by detailed routing. Global routing is performed after placement is completed; thus it is assumed that each component in the design is assigned to a chip on the MCM. The global router replaces each net by a set of two-terminal connections and assigns each connection to a horizontal or vertical channel. As shown in Figure 3 , there are two c hannels per row or column of chips. In the detailed routing phase the connections are assigned to specic xed wires in the channels. Global routing using frame wires is done one net at a time. We pick a n i n ter-chip net at random, route it, and update the routing resources and costs. Global routing of a single net can be formulated as a Steiner Tree P r oblem (STP) [10] . Each c hip is represented by eight frame nodes, one for each of the eight groups of frame pins located at the outer two sides of the four switch b o xes. Chips connected by the net have an additional core n o de located at the center, representing the core pins to one of which the signal must be connected. Figure 4 show s a 3 3 grid of chips with 3 core nodes at coordinates (1; 1); (2; 3) and (3; 2).
The edges of the graph represent the possible interconnections of frame nodes or core nodes to frame nodes. Associated with each edge is its cost. Since routing resources including frame pins and xed wires are limited, the edge cost is a function of the frame pins used and the availability of frame wires. however, only estimates of these resources are available during global routing. There are three types of edges in the graph:
Internal edges. If a net has a terminal at a particular chip, it must be connected to the core of that chip through a frame-to-core connection. Such a connection is represented by a n i n ternal edge. An internal edge is assigned a unit cost since it uses a single frame pin. In Figure 4 internal edges are shown as dotted lines.
Switching edges. A c hip that is not connected by the net may still be used to connect xed wires in the xed wiring network. The types of interconnections possible include switching edges for L-paths and I-paths. The cost assigned to a switching edge is 2, since in making such a n i n terconnection we m ust use two frame pins. In Figure 4 switching edges are shown as dashed lines. Switching edges represent all types of turns in the frame as a combination of I-path edges and L-path edges. For example, a far-far turn which costs 6 frame pins is implemented with two I-path edges and an L-path edge. Since making a turn costs 2, 4, or 6 frame pins, it is correspondingly discouraged.
Routing edges. These edges represent all possible horizontal and vertical interconnections among chips. Figure 4 shows a subset of the routing edges as solid lines. The cost given to a routing edge is determined by t w o factors: the length of the edge and an estimate of the availability of frame wires for realizing the edge. Because the global router does not assign xed wires to connections, it can only estimate the contributions of these factors to cost. Routing a net with minimal cost is equivalent to solving the Steiner tree problem on the graph described above, with the net terminals at core nodes. Figure 4 shows the graph as well as a global routing, in bold lines.
The heuristic we use to route a net consists of two phases. In the rst phase, we use Takahashi and Matsuyama's [8] heuristic. In the second phase, the result is improved by repeated corner ipping, to T conversion and leaf chip adjustment until no more improvement is possible. Examples of these improvements are given in Figure 5 . 
B. Detailed R outing
To complete the routing, each connection must be assigned to one or more frame wires. This assignment is performed by the detailed router. As a result of the complete exibility o f the switch and permutation boxes and also assuming complete exibility in core pin assignments it is easy to see that detailed routing reduces to independent Exact Segmented Channel Routing performed for all horizontal and vertical channels independently The inputs to ESCR are the channel structure and the channel connections. In Problem 1, the number of frame wires used per connection is unlimited. This may result in unacceptable delays. By restricting the maximum number of frame wires allowed to cover a connection to an integer K , delays may be better controlled. We denote this version of the ESCR problem as the K-wire Exact Segmented Channel Routing (K-ESCR) problem.
Problem 2 The K-wire Exact Segmented Channel Routing
problem is an ESCR problem with the additional constraint that at most K frame wires can be used t o c over each connection.
The ESCR problem has been proved to be strongly NPComplete [6] . In [6] Lan and How also proved that the general K -ESCR problem is strongly NP-complete when K 3, and that a variation that prohibits track-permutation is strongly NP-complete for K 2. The complexity of 2-ESCR is still an open problem.
However, in practice, the number of chips in a channel is bounded by a small constant (on the order of 16) which i n turn imposes an upper bound on K . The ESCR with bounded channel length can be solved in polynomial time [6] , but is too time-consuming in practice.
C. ESCR Heuristic
The heuristic we use to solve the ESCR problem routes each connection using no more than two frame wires, i.e., K = 2 : The algorithm consists of a greedy constructive phase followed by several iterations of a reroute phase; once a connection is routed, the reroute phase may alter this routing, but will never unroute the connection. In each successive reroute iteration, the heuristic searches more and more thoroughly for possible routings for the fewer and fewer remaining unrouted connections.
Since frame wires between the same pair of frame nodes are interchangeable in ESCR, we treat them as a single wire bundle. The capacity of a wire bundle is denoted by C a p (wire bundle). Similarly, a connection group consists of connections between the same pair of frame nodes. With the constraint o f at most two frame wires per connection, a connection of length l can be routed in l dierent w a ys.
Assume the probabilities of routing the connection in each of these l ways are the same. Then, for each wire bundle, we can compute the expected number of connections that will use this wire bundle, denoted by Ex p (wire bundle). W e dene the cost of using a frame wire in a wire bundle to be Ex p (wire bundle) -C a p (wire bundle), unless C a p is zero in which case the cost is dened to be innity. In addition, we dene the cost of routing a connection to be the maximum of the costs of the wire bundles used.
In the greedy constructive phase, we attempt to route as many connections as possible, one by one. To route a connection of length l, w e compute the cost of each of its l possible routings. If a least nite cost routing exists, we select it as the routing for the connection and update C a p and Ex p . Otherwise, we skip the connection and leave it for the reroute iterations.
Example: Consider an ESCR problem with connections Table I as Ex p 0, C a p 0 , and C ost0. As an example of the calculation of these numbers, consider the value of Ex p for F2; which i s 3 2 because C2 will use F2 with probability 1 and C4 will use F2 with probability 1 2 : To route C1 , w e can use either F1 and F7 or F3 and F5. The costs for either alternatives is zero, so the router may pick either one; here we c hoose F1 and F7 and show the updated values in Table I as Ex p 1, C a p 1 , and C ost1. T o route C2, the router can only use F2; the updated values are shown in Table I as Ex p 2, C a p 2 , and C ost2. Finally, to route C3 , w e can use either F3 and F4 or just F6: Once again both costs are zero, so the router may pick either one; here we c hoose the rst again and show the updated values in Table I as Ex p 3, C a p 3 , and C ost3. Since the costs of using F2 and F6 or F4 and F7 for C4 are both innity, C4 will not be routable. Consequently C4 must be left for the reroute phase to complete. ) connection groups and O(L) time is required to compute the contribution of each connection group to Exp: After Ex p is computed, for each connection we need O(L) time to decide how to route this connection and O(L) time to update Ex p and C ap: Since L is a constant, the overall computation time is O(n), where n is the number of connections.
The I th reroute iteration tries to route each unrouted connection by rerouting up to 2I other connections. Thus from F1   F2  F3  F4  F5  F6  F7  F8   C1   1  2  1  2  1  2  1  2   C2   1   C3   1  2  1  2  1  2   C4   1  2  1  2  1  2 one reroute iteration to the next, we increase the solution search depth. The algorithm is given in Figure 7 . Avail(F2) will be called rst to see whether F2 may b e made available by rerouting other connections. The answer is obviously no, because C2 which is using F2 cannot be rerouted. Thus the rst alternative for routing C4 fails.
The router will then try to see whether F7 and F4 are both available. It will rst call Avail(F7) which will determine whether F7 can be released by rerouting C1. Rerouting C1 requires F5, which i s o b viously available, as well as F3, which is currently occupied by C3 : Therefore the answer from AvailByRerouteNet(C1,1,2) depends on whether or not F3 can be freed from implementing C3 : Since C3 can be routed using F6 which is currently available, Avail(F3) will return yes. Thus the router concludes that F7 is available by rst rerouting C3 using F6 and then rerouting C1 using F5 and F3.
The router will then check whether F4 is available. Since F4 was freed when C3 was rerouted, Avail(F4) will return yes. Thus C4 can be routed with F7 and F4; and the router has found a solution for this example.
By 
D. Results
In order to determine the quality of the proposed heuristic for the ESCR problem, we measured its performance on instances of randomly generated channel connections. The heuristic was executed for a channel with 8 chips and 50 pins on each side of each c hip. The length of the frame wires were from 1 to 4. The number of frame wires per chip of each length are 20, 15, 10, and 5 respectively. The connections are randomly generated with the left edge of each connection uniformly distributed over the chips and the length of each connection independently distributed according to a geometric distribution.
The heuristic is executed for dierent connection densities, dened as the ratio between the number of connections and the number of frame wires. For densities of less than 0.55 the heuristic succeeded every time. When the density i s a b o v e 0.6 the success ratio drops sharply. The heuristic achieves high channel utilization, dened as the ratio between the number of frame wires used for the routing and the overall number of frame wires. For the maximum connection density with success probability of 1 the average channel utilization is 0.8. The average running time on a SPARC IPX workstation for successful routing is less than 30ms for densities of 0.6 or less. The average running time for failed attempts is less than 100ms for the densities between 0.55 and 0.6. Similar results are achieved for channels of dierent lengths and widths.
V. Placement and Routing Results
We tested the MCM level placement and routing system, using several designs from the Partitioning93 benchmarks [11] . We placed and routed the designs on two experimental FPMCMs. FPMCM9 has 9 chips, in a 3 3 conguration; while FPMCM25 has 25 chips, in a 5 5 conguration. Table II gives the main parameters of both FPMCMs. All designs were successfully placed and routed on both FPMCMs. Tables IV{ V describe the placement and routing results for the S38584 design, the largest design in the Partitioning93 benchmarks. The parameters of the design are given in Table III VI. Conclusions A placement and routing system for the FPMCM proposed in [1] has been described. We demonstrated that using our system an FPMCM can implement designs of sizes at least one order of magnitude higher than those of single FPGAs with the same core utilizations. The placement and routing system can also be easily adapted to any m ulti-FPGA system with a xed wiring network by routing directly through the FPGAs.
The results of mapping large Partition93 benchmark designs show that only 3% of the nets are inter-chip nets and less than 1% of the nets need more than 2 connections.
We formally dened the ESCR and K -ESCR problem. Heuristics for both global routing of a single net and the K -ESCR problem were presented. We demonstrated experimentally that the K -ESCR heuristic performs surprisingly well.
We plan to perform logic replication [3] after placement t o further reduce the numb e r o f i n ter-chip nets.
