Multi-FPGA systems (MFSs) are used as custom computing machines, logic emulators and rapid prototyping vehicles. A key aspect of these systems is their programmable routing architecture which is the manner in which wires, . In this paper we propose a new routing architecture, called the Hybrid Complete-Graph and Partial-Crossbar (HCGP) which has superior speed and cost compared to a partial crossbar. The new architecture uses both hardwired and programmable connections between the FPGAs. We compare the performance and cost of the HCGP and partial crossbar architectures experimentally, by mapping a set of 15 large benchmark circuits into each architecture. A customized set of partitioning and inter-chip routing tools were developed, with particular attention paid to architecture-appropriate inter-chip routing algorithms. We show that the cost of the partial crossbar (as measured by the number of pins on all FPGAs and FPIDs required to fit a design), is on average 20% more than the new HCGP architecture and as much as 25% more. Furthermore, the critical path delay for designs implemented on the partial crossbar were on average 20% more than the HCGP architecture and up to 43% more. Using our experimental approach, we also explore a key architecture parameter associated with the HCGP architecture: the proportion of hard-wired connections versus programmable connections, to determine its best value.
mable interconnect chips are connected. The routing architecture has a strong effect on the speed, cost and routability of the system. Many architectures have been proposed and built [FCCM] [Butt92] [Van92] [Apti96] [Babb97] [Lewi97] and some research work has been done to empirically evaluate and compare different architectures [Kim96] [Khal97] . These studies have shown that the partial crossbar is one of the best existing MFS architectures. In this paper we present a new routing architecture for MFSs that uses both hardwired and programmable connections to reduce cost and increase speed. We evaluate and compare the HCGP architecture and the partial crossbar architecture using an empirical approach. In particular we compare architectures on the basis of pin cost and speed.
The speed comparisons are based on post inter-chip routing critical path delay of real benchmark circuits, which, to our knowledge, is the first time such detailed timing information has been used in the study of board-level MFS architectures.
We focus on single-board MFS routing architectures that use no more than about 25 FPGAs. This is for two reasons: First, the complete graph topology used in the HCGP architecture does not scale well for a large number of FPGAs. It becomes infeasible to connect each FPGA to every other FPGA because such a scheme would likely result in severe routability problems. In such cases, hierarchical architectures would be more effective. We believe that the HCGP architecture could form the basis of a hierarchical architecture, with the root architecture being an HCGP, and groups of HCGPs connecting in a next-level HCGP and so on. Second, we did not have huge circuits and the CAD tools required for mapping such circuits to hierarchical architectures.
Previous work has been done evaluating mesh [Hauc94] and other architectures [Chan93] . In [Hauc94] , several constructs (1-hop interconnections, Superpins, and Permutations) were proposed to improve the basic 4-way mesh. However, synthetic netlists (not real circuits) were used to evaluate different mesh topologies. In [Chan93] architectural trade-offs in the design of folded Clos network (partial crossbar) were investigated and an optimal algorithm for routing two-terminal nets was presented. Although this work provides some theoretical insight into these architectures, empirical studies that evaluate the implementation of real circuits on different architectures provide a more clear picture of the 'goodness' of each architecture relative to the others [Kim96] [ Khal97] . Our own previous research has shown that partial crossbar is vastly superior to the best mesh architecture [Khal97] . In [Kim96] , several MCNC circuits were mapped to seven different architectures, including the partial crossbar architecture. Each circuit was mapped to a fixed size Each architecture was evaluated on the basis of total number of CLBs needed across all circuits (where fewer CLBs used implies better architecture), the type of FPGA chips used (smallest FPGAs implies better architecture), and maximum number of hops needed across all inter-FPGA nets (as a metric for speed). A hop is defined as a chip-to-chip connection, i.e. a wire segment that connects two different chips on a board. It was shown that one of the proposed architectures, FPGAs connected together as a tri-partite graph, gave the best results (slightly better than partial crossbar). In this work, relatively few large circuits were used that would have really 'stressed' the architectures, as only three reasonably large circuits (>2000 CLBs) were employed. Also, for the speed estimate only the worst case net delay in terms of the number of hops was considered; which is not as representative of the true delay as post-routing critical path delay. An early version of the present work appeared in [Khal98] . The present work includes key enhancements, particularly timing-driven inter-chip routing for HCGP and an exploration of the effects of a key parameter P p (to be defined later) on the speed of the HCGP architecture. This paper is organized as follows: In Section 2 we describe the experimental evaluation procedure and the evaluation metrics used, and give details on the suite of large benchmark circuits used in this experimental work. In Section 3 we cover the architectural issues and assumptions that arise when mapping real circuits to the HCGP and partial crossbar architectures. We also briefly describe architecture-specific inter-chip routing algorithms for these architectures. Experimental results and their analysis is presented in Section 4, and we conclude in Section 5.
Experimental Overview
To evaluate the two routing architectures considered in this paper, we used the experimental procedure illustrated in Figure 2 . Each benchmark circuit was partitioned, placed and routed into each architecture. Section 2.1 describes the general toolset used in this flow. The cost and delay metrics that we use to evaluate architectures are described in Section 2.2. A description of the 15 benchmark circuits used is given in Section 2.3. 
General CAD Flow
As illustrated in Figure 2 , we start with a (technology mapped) netlist of 4-LUTs and flip flops of the circuit. The circuit is partitioned into a minimum number of sub-circuits using a multi-way partitioning tool which accepts as constraints the specific FPGA logic capacity and pin count. For all the experiments presented in this paper we used a Xilinx 4013E-1 FPGA, which consists of 1152 4-LUTs, 1152 flip flops, and 192 usable I/O pins [Xili97] . Multi-way partitioning is accomplished using a recursive bi-partitioning procedure. The partitioning tool used is called 'part' and was originally developed for the Transmogrifier-1 rapid prototyping system [Gall94] . It is based on the Fiduccia and Mattheyses partitioning algorithm [Fidu82] with an extension for timingdriven pre-clustering [Shih92] . The output of the partitioning step is a netlist of connections between the FPGAs that contain the circuit.
Given the chip-level interconnection netlist, the next step is to route each inter-FPGA net using the most suitable routing path. The routing path chosen should be the shortest path (use the minimum number of hops) and it should cause the least possible congestion for subsequent nets to be routed. Depending on the architecture, the routing resources available in an MFS could be wires that are direct connections between FPGAs, or wires that connect FPGAs and FPIDs.
If the routing attempt fails, the partitioning step is repeated after reducing the number of I/O pins per FPGA specified to the partitioner. This usually increases the number of FPGAs needed, and helps routability by decreasing the pin demand from each FPGA, and providing more "routethrough" pins in the FPGAs which facilitate routing.
Note that in an actual MFS, the inter-FPGA routing step is followed by pin assignment, placement and routing within individual FPGAs. We need not perform these tasks because we are only interested in knowing the MFS size needed to fit the circuit. Our previous research has shown that we can afford to assign pins randomly for each FPGA without jeopardizing routability and speed [Khal95] . During recursive bi-partitioning, we restrict the logic utilization of each FPGA to be at most 70% to avoid placement and routability problems within individual FPGAs. Thus we ensure that if an inter-FPGA routing attempt succeeds, it is almost guaranteed that the subsequent pin assignment, placement, and routing steps will be successful for each FPGA in the MFS.
Notice that the above-mentioned claims about I/O pin-constrained placement and routing are not applicable to the older Xilinx FPGAs (XC3000) and the older Xilinx tool set (the APR tool set, [Xili92] ). However, in our research we assume that the Xilinx XC4013 FPGA and the XACT tool set [Xili94] is used, which give excellent results under I/O pin-constrained placement and routing, as shown in [Khal95] . Therefore, our assumption that pin locking on a Xilinx XC4013 FPGA will not have an unduly adverse impact on its routability and speed is valid.
Another important point is that the 70% cap on FPGA logic utilization (that we imposed), could be increased further (whenever possible for specific FPGAs) if we perform placement and routing for individual FPGAs. We did not perform individual FPGA placement and routing because it would have involved a huge amount of experimental effort and time, and probably would not change our architectural conclusions in any significant way. Also, in most cases, very high FPGA logic utilization (say > 85%) after partitioning is rare because of FPGA pin limitations. In fact, the average post-partitioning logic utilization is less than 50%. The 70% cap on logic utilization is based on a conservative estimate. From our previous research study [Khal95] and from anecdotal evidence provided by other FPGA users, we found that in almost all circuits, restricting logic utilization to 70% or less leads to routing completion in the Xilinx XC4000 series of FPGAs.
We have developed a specific router for each of the architectures compared. (We had attempted to create a generic router but found that it had major problems with different aspects of each architecture [Khal99] .)
Evaluation Metrics
To compare the two routing architectures we implement benchmark circuits on each and contrast the pin cost and post-routing critical path delay, as described below.
Pin Cost
The cost of an MFS is likely a direct function of the number of FPGAs and FPIDs: If the routing architecture is inefficient, it will require more FPGAs and FPIDs to implement the same amount of logic as a more efficient MFS. While it is difficult to calculate the price of specific FPIDs and FPGAs, we assume that the total cost is proportional to the total number of pins on all of these devices. Since the exact number of FPGAs and FPIDs varies for each circuit implementation (in our procedure above, we allow the MFS to grow until routing is successful), we calculate, for each architecture, the total number of pins required to implement each circuit. We refer to this as the pin cost metric for the architecture.
Post Routing Critical Path Delay
The speed of an MFS, for a given circuit, is determined by the critical path delay obtained after a circuit has been placed and routed at the inter-chip level. We call this the post-routing critical path delay. We have developed an MFS static timing analysis tool (MTA) for calculating the post routing critical path delay for a given circuit and MFS architecture.
The operation and modeling used in the MTA are described briefly as follows: It first calculates the critical path delay of the un-partitioned design using a widely used method called the block oriented technique [Joup87] . It then reads the inter-FPGA netlist and the routing path for each inter-FPGA net, as provided by the inter-chip router, and the MFS architecture description. From this information the circuit is annotated with the inter-chip delays, from which the postrouting critical path delay can be calculated.
In the delay annotation step, the delay values given in Table 1 (obtained from data sheets [Xili97] and [ICub97] and some design experience) are used. Note that since we do not perform individual FPGA place and route, we approximate the CLB-to-CLB delay as a constant. The value of 2.5 ns for CLB-to-CLB routing delay is roughly half the delay on a long line for XC4013E-1 FPGA. This is a pessimistic estimate. Although using a single delay value is somewhat inaccurate, it still gives us a good estimate of the post-routing critical path delay of an MFS because it is dominated by off-chip delay values.
Benchmark Circuits
A total of fifteen large benchmark circuits were used in our experimental work. An extensive effort was expended to collect this suite of large benchmark circuits. The details of each benchmark circuit are shown in Table 2 which provides the circuit name, size (in 4-LUTs, D flip flops, and I/O count), rough description of the functionality, the source of the circuit and the manner in synthesis tools. We show these details of the benchmark circuits because we feel that the MCNC circuits that have been used so far in MFS architecture studies are insufficient in terms of size and variety to 'stress' different architectures and the mapping tools used. Specifically, we found that they are easier to partition and map compared to the other real circuits that we use in this work.
Routing Architecture Description and Routing Algorithms
In this Section we describe the partial crossbar and HCGP architectures. For each architecture, we briefly describe an architecture-specific inter-chip router.
Architectural Description and Routing for the Partial Crossbar
The partial crossbar architecture [Butt91] [Butt92] [Varg93] is used in logic emulators produced by Quickturn Design Systems [Quic96] . A partial crossbar using four FPGAs and three FPIDs is shown in Figure 3 . The pins in each FPGA are divided into N subsets, where N is the number of FPIDs in the architecture. All the pins belonging to the same subset number in different FPGAs are connected to a single FPID. Note that any circuit I/Os will have to go through FPIDs to reach FPGA pins. For this purpose, a certain number of pins per FPID (50) are reserved for circuit I/Os. Notice that we could have reserved the number of pins per FPID required for I/O signals based on circuit requirements. For this scheme, the number of reserved pins per FPID (for I/O signals) would be variable across different circuits. We did not use this scheme because it will not make any significant difference in architectural comparison results. Also, our present scheme is easier to implement. As for the number of pins per FPID (50) reserved for circuit I/O signals, we used this number to meet the maximum I/O requirement among the circuits in our benchmark suite.
The number of pins per subset (P t ) is a key architectural parameter that determines the number of FPIDs needed and the pin count of each FPID. The extremes of the partial crossbar architecture can be illustrated by considering a system with four FPGAs, and assuming 192 usable I/O pins per FPGA: a P t value of 192 will require a single 768-pin FPID that acts as a full crossbar. A P t value of 1 will require 192 4-pin FPIDs. Both of these cases are impractical.
A good value of P t should require low cost, low pin count FPIDs. For the above example, a P t value of 12 will require 16 48-pin FPIDs. When we consider FPID pins required for circuit I/Os we will need to use 64 or 96-pin FPIDs that are commercially available [ICub97] . When choosing a value of P t , we must ensure that number of usable I/Os per FPGA is evenly divisible by P t or at least the remainder should be a very small number so that we can use such pins for routing high fanout inter-FPGA nets. In this work we set P t = 17 which leaves five pins per FPGA to be used as global lines in the partial crossbar architecture. These global lines are used for routing global nets like reset, clock and other very high fanout nets in the circuit. Our previous research [Khal97] has shown that, for real circuits, the routability and speed of the partial crossbar is not affected by the value of P t used. But this is contingent upon using an intelligent inter-chip router that understands the architecture and routes each inter-FPGA net using only two hops to minimize the routing delay. However, a practical constraint is that we should avoid using P t values that require expensive or even unavailable high pin count FPIDs.
Routing Algorithm for the Partial Crossbar
For any MFS architecture in general and for the partial crossbar in particular, it is important to use a routing algorithm that exploits architecture-specific features in order to obtain good results.
We have developed a routing tool, PCROUTE, for the partial crossbar architecture that gives excellent routability and speed results for all of our benchmark circuits. Irrespective of the value of P t , it achieves 100% routing completion and produces two-hop routing for all the nets in almost all circuits. For only two circuits, for the specific case of P t = 4, it produced multi-hop routing paths for a negligible number of nets (1 out of 991 nets for the first circuit and 3 out of 645 nets for the second). In practical terms, this means it gives almost optimal results for all of our benchmark circuits.
The PCROUTE algorithm works as follows: for each net (irrespective of fanout), it evaluates potential routing paths through all available FPIDs. It uses a suitable cost function to choose an FPID that will guarantee balanced usage of FPIDs and will preserve the most options for two-hop routing of subsequent nets to be routed. Consider a partial crossbar that consists of n FPGAs and m FPIDs. Consider an N-terminal net called M. Let F denote the set of FPGAs belonging to M, i.e. {f 1 , f 2 ,...., f N }.
Let A ik denote the number of available wires between FPGA i and FPID k. The routing cost of the net M through FPID k, C(M, k), is given by:
An FPID that has the lowest routing cost for the net M is chosen for routing that net. We show in [Khal99] that PCROUTE is equivalent in quality to other partial crossbar routers that have been proposed so far [Kim96] [Mak97a] [Lin97] . PCROUTE is better than [Mak97b] in terms of both speed and routability because that algorithm splits each multi-terminal into a set of two-terminal nets and routes them independently, leading to multiple hops and even possible routing failures.
Architectural Description and Routing for HCGP
The HCGP architecture for four FPGAs and three FPIDs is illustrated in Figure 4 
pins in each FPGA are divided into two groups: hardwired connections and programmable connections. The pins in the first group connect to other FPGAs and the pins in the second group connect to FPIDs. The FPGAs are directly connected to each other using a complete graph topology, i.e. each FPGA is connected to every other FPGA. The connections between FPGAs are evenly distributed, i.e. the number of wires between every pair of FPGAs is the same. The FPGAs and FPIDs are connected in exactly the same manner as in a partial crossbar. As in the partial crossbar, any circuit I/Os will have to go through FPIDs to reach FPGA pins. For this purpose, a certain number of pins per FPID (50) are reserved for circuit I/Os. The direct connections between FPGAs can be exploited to obtain reduced cost and better speed. For example, consider a net that connects FPGA 1 to FPGA 3 in Figure 4 . If there were no direct connections as in the partial crossbar, we would have used an FPID to connect the two FPGAs. This will cost extra delay and two extra FPID pins. A natural question to ask is: why not dispense with FPIDs and just use FPGAs connected as a completely connected graph as investigated in [Kim96] ? The answer is that routing multi-terminal nets in an FPGA-only architecture is expensive in terms of routability because in such an architecture a multi-terminal net requires many extra pins in the source FPGA, as illustrated in Figure 5 (a). In Figure 5 (a) two extra FPGA pins are used for routing a fanout 3 multi-terminal net. Since extra pins are scarce on an FPGA this has an adverse effect on the routability of FPGA-only architectures. On the other hand, if we use an FPID for routing the same multi-terminal net, we do not need even a single extra FPGA pin, other than the FPGA pins needed to access the source and sinks of the net as shown in Figure  5(b) .
A key architectural parameter in the HCGP architecture is the percentage of programmable connections, P p . It is defined as the percentage of each FPGA's pins that are connected to FPIDs (the remainder are connected to other FPGAs). If P p is too high it will lead to increased pin cost, if it is too low it will adversely affect routability. If P p is 0% the HCGP architecture degrades to a completely connected graph of FPGAs with no FPIDs used. If P p is 100% the HCGP architecture degrades to a standard partial crossbar. A key issue we address later is the best value of P p for obtaining minimum cost and good routability.
Routing Algorithm for HCGP
The inter-chip routing algorithm for HCGP is similar to the partial crossbar routing algorithm in the sense that the same algorithm is used when routing nets through FPIDs. However, the difference here is that the router should also exploit the direct connections between FPGAs to minimize the number of FPGA and FPID pins used for routing and to minimize the net delay for critical inter-FPGA nets. A critical net is defined as an inter-FPGA net whose slack (when analyzed after partitioning, but before inter-FPGA routing) is less than the delay incurred for connecting two FPGAs via an FPID. We have developed a timing-driven inter-chip routing tool, called HROUTE_TD, that understands the HCGP architecture and gives excellent routability and speed results for all the benchmark circuits.
The main objectives of HROUTE_TD are to try to route all critical nets using direct connections and to route all other (non-critical) nets using no more than two hops for each source-sink path. Our experience has shown that net ordering, based on slack first and then fanout, is crucial for obtaining good routability and speed. Wherever possible, HROUTE_TD uses direct connections to minimize source-sink net delay when routing critical nets. The HROUTE_TD algorithm works as follows: We first try to route all critical two-terminal nets using the direct connections between FPGAs to minimize usage of pins and net delay. Next, we try to route all multi-terminal nets through FPIDs using a routing algorithm similar to that used in PCROUTE, described above in Section 3.1.1. Finally, the remaining (non-critical) two terminal nets are routed using FPGAs or FPIDs. Any nets that remain unrouted are processed by a maze router. A detailed description of HROUTE_TD is given in [Khal99] .
Experimental Results
In this Section we determine the effect of varying the value of P p on the routability and speed of the HCGP architecture and compare the partial crossbar and HCGP architectures.
HCGP Architecture: Analysis of P p
Recall the definition of P p , given in Section 3.2, which is the percentage of pins per FPGA used for programmable connections. P p is important because it affects the cost and routability of the HCGP architecture. Here we explore the effect of P p on the routability and speed of the HCGP architecture. We mapped the fifteen benchmark circuits to the HCGP architecture using five dif- Figure 6 . The Y-axis represents the percentage of inter-FPGA nets routed and the X-axis represents the P p values. The first clear conclusion is P p = 60% gives 100% routability for all the benchmark circuits. Notice that about two thirds of the circuits routed at P p <= 40%, and for the remaining one third, more than 90% of the nets routed. This implies that there is a potential for obtaining 100% routabilty for all circuits at P p = 40% if we use a routability driven partitioner like the one used in [Kim96] . This will lead to further reduced pin cost for HCGP compared to the partial crossbar.
We conjecture that the P p value required for routing completion of a given circuit on HCGP depends upon how well the circuit structure 'matches' the topology of the architecture.
We also investigated the effects of P p on post-routing critical path delay. Table 3 shows the ten circuits that routed for P p < 60%. The first column shows the circuit name. In subsequent columns, the critical path delay of each circuit for different values of P p (20, 30, 40, 50, 60) is shown. A surprising conclusion is that (overall) the lower P p values have no significant effect on the critical path delay. Compared to the delay value at P p = 60%, for lower P p values the delay remained the same or decreased slightly (only 4% less on average and 12% less in the best case). For circuits where delay was reduced, one or two programmable connections on the critical paths were replaced by faster hardwired connections. Note that as P p is reduced, more hardwired connections are available. For circuits where delay remained the same, 'segments' on the critical path are part of very high fanout connections that have to be routed using FPIDs because of the lack of free pins (required for routing multi-terminal nets using hardwired connections). Even though more hardwired connections are available, they cannot be used for routing nets on the critical path. 
Comparison of HCGP and Partial Crossbar
The 15 benchmark circuits described in Table 2 were mapped to the partial crossbar and HCGP architectures using the experimental procedure described in Section 2. The results obtained are shown in Table 4 and Table 5 . In Table 4 , the first column shows the circuit name. The second column shows the number of FPGAs needed for implementing the circuit on each architecture (recall that we increase the MFS size until routing is successful). The third column shows the pin cost normalized to the number of pins used by the HCGP architecture and the fourth column shows the normalized critical path delay obtained for each architecture. Table 5 is similar to Table 4 except that it shows actual (un-normalized) pin cost and delay values.
The number of FPIDs used is not shown because it is constant for each architecture. All the results for partial crossbar use P t = 17. The parameter P t determines the number of FPIDs required and the number of FPGAs in the architecture determine the pin count of each FPID. We have shown that the value of P t used has no effect on the routability and speed of the partial crossbar [Khal97] . Therefore any arbitrary value of P t can be used. However, for practical reasons, the value chosen should require FPIDs that have reasonable pin counts (about 400 pins or less, which are commercially available) for the largest partial crossbar required in our experiments. A reasonable choice in this respect is P t = 17.
The value of P p for the HCGP architecture was set to 60% to obtain good routability across all circuits, as discussed in Section 4.1. Notice that the parameter P t also applies to the programmable connections in the HCGP. For the same reasons as in the partial crossbar (given in the previous paragraph), we chose P t = 14 for the HCGP architecture. Also the number of global lines used in the HCGP architecture depends upon the MFS size (#FPGAs used) and the parameters P p and P t .
In our experiments (P p = 60%, P t = 14) the number of global lines used for the HCGP architecture varied from 5 to 15. Recall from Section 3.1 that the number of global lines for the partial crossbar is 5 corresponding to P t = 17. The different values for number of global lines used in HCGP is due to the fact that the number depends upon both P p and P t instead of just P t as in the partial crossbar architecture. In reviewing Table 4 , consider the circuit mips64. The first partitioning attempt resulted in 14 FPGAs required to implement the circuit on partial crossbar. However, the circuit was not routable on HCGP and the partitioning was repeated after reducing the number of pins per FPGA specified to the partitioner by 5%. This resulted in 15 FPGAs required to implement the circuit. The second partitioning attempt was routable on the HCGP architecture because more 'free pins' were available in each FPGA for routing purposes. The pin cost for the partial crossbar was still more than that for HCGP because it uses many more programmable connections, and hence more FPID pins. A partial crossbar always requires one FPID pin for every FPGA pin; the HCGP architecture requires a lower ratio, (0.6: 1) as shown in the previous section.
Inspecting Table 4 , we can make several observations. First, the partial crossbar needs 20% more pins on average, and as much as 25% more pins compared to the HCGP architecture. Clearly, the HCGP architecture is superior to the partial crossbar architecture in terms of the pin cost metric. This is because the HCGP exploits direct connections between FPGAs to save FPID pins that would have been needed to route certain nets in partial crossbar. However, for routability purposes, the HCGP needs some free pins in each FPGA and may require repeated partitioning attempts for some circuits.
Table 4 also shows that the typical circuit delay is lower with the HCGP architecture: the HCGP gives significantly less delay for twelve circuits compared to the partial crossbar and about the same delay for the rest of the circuits. The reason is that the HCGP utilizes fast and direct con- Table 1 , we can show that the interconnection delay is much smaller (12.6 ns) if we use direct connections between FPGAs compared to the delay value (25.6 ns) when connecting two FPGAs through an FPID. Another interesting observation is that even for the circuits where the HCGP needs more FPGAs compared to the partial crossbar, it still gives comparable or better delay value. This clearly demonstrates that the HCGP architecture is inherently faster due to the nature of its topology. It gives significant speed up, especially when we use timing driven inter-FPGA routing. Table 5 shows the actual pin cost and delay values obtained for the partial crossbar and HCGP architectures. It is interesting that the estimated clock speeds for the partial crossbar architecture range from 20 MHz for the ochip64 circuit to 1.6 MHz the mac64 circuit. This range is representative of the clock rates expected in MFSs [Quic96] .
Conclusions and Future Work
In this paper we have presented the Hybrid Complete-Graph and Partial-Crossbar (HCGP), a new routing architecture for multi-FPGA systems. Using an experimental approach, we evaluated and compared this architecture to the partial crossbar architecture and showed that it is superior in terms of pin cost and speed. To our knowledge, this is the first architectural study of board-level MFSs that considers post-routing critical path delay when evaluating the speed performance of different architectures. We explored a key parameter (P p ) associated with the HCGP architecture and experimentally determined its best value (60%) for obtaining good routability for a variety of circuits.
We believe that the HCGP architecture would give even better results if we use better mapping (CAD) tools for partitioning. A routabilty driven partitioner, similar to the one used in [Kim96] , may result in further reduced pin cost by making circuits routable for even lower values of P p (say 40%).
The HCGP architecture is suitable for single board MFSs using a maximum of about 25 FPGAs. As FPGA logic and pin capacities continue to rise, it makes sense to use single board systems using a few high capacity FPGAs to avoid the problems associated with using high pin count connectors for multi-board systems [Lewi97] . For applications where hundreds of FPGAs are needed, such as logic emulation, we could use 'clusters' of HCGPs interconnected using a hierarchical partial crossbar scheme [Butt92] . The hardwired connections, within each cluster and between different clusters, would still help in reducing the overall pin cost. Determining the P p value suitable for such hierarchical architectures is an open research problem. We will need extremely large benchmark circuits and appropriate CAD tools to explore hierarchical architectures.
