Field-Programmable Gate Arrays (FPGAs) are widely used crossbar architectures were compared experimentally, by for implementing digital circuits because they offer mapping a set of 15 large benchmark circuits into each moderately high levels of integration and rapid turnaround architecture. A customized set of partitioning and inter-chip time. Multi-FPGA systems (MFSs), which are collections of routing tools were developed, with particular attention paid to FPGAs joined together by programmable connections as architecture-appropriate inter-chip routing algorithms. Using illustrated in Figure 1 , are used when the logic capacity of a the experimental approach, a key architecture parameter of single FPGA is insufficient, and when a quickly re-HCGP, called percentage of programmable connections (Pp), programmed system is desired. The typical applications of was also analyzed. Results showed that a Pp value 60% MFSs are for logic emulation [1], rapid prototyping [2], and provided good routability for a variety of circuits. The HCGP reconfigurable custom computing machines [3].
routing architecture, called the Hybrid Complete-Graph and Partial-Crossbar (HCGP), was proposed by Khalid [5] and Index Terms-Partitioning, reconfigurable components, was shown to provide superior speed and cost compared to reconfigurable-computing, reconfigurable-systems, system-level.
partial crossbar. The HCGP architecture uses a mixture of hardwired and programmable connections between the FPGAs whereas the partial crossbar uses only programmable I. INTRODUCTION connections. The speed and cost of the HCGP and partial
Field-Programmable Gate Arrays (FPGAs) are widely used crossbar architectures were compared experimentally, by for implementing digital circuits because they offer mapping a set of 15 large benchmark circuits into each moderately high levels of integration and rapid turnaround architecture. A customized set of partitioning and inter-chip time. Multi-FPGA systems (MFSs), which are collections of routing tools were developed, with particular attention paid to FPGAs joined together by programmable connections as architecture-appropriate inter-chip routing algorithms. Using illustrated in Figure 1 , are used when the logic capacity of a the experimental approach, a key architecture parameter of single FPGA is insufficient, and when a quickly re-HCGP, called percentage of programmable connections (Pp), programmed system is desired. The typical applications of was also analyzed. Results behind the HCGP architecture, we first need to study the speed, and cost. If P' is too high it will lead to increased pin partial crossbar architecture. A partial crossbar using four cost and lower speed, if it is too low it will adversely affect FPGAs and three FPIDs is shown in Figure 2 . The pins in routability. If The HCGP architecture for four FPGAs and three FPIDs is 3. We can increase both the total number of FPGAs and the illustrated in Figure 3 . The I/O pins in each FPGA are divided logic and pin capacity of each FPGA.
into two groups: hardwired connections and programmable The scalability issue 1 was addressed by using a connections. The pins in the first group connect to other hierarchical architecture such as the Hardwired Clusters FPGAs and the pins in the second group connect to FPIDs. Partial Crossbar (HWCP), proposed by Khalid [7] . Scalability The FPGAs are directly connected to each other using a issue 2 has not been explored so far for the HCGP architecture complete graph topology, i.e. each FPGA is connected to and is the subject of this paper. Note that scalability issue 3 is a combination of scalability issues 1 and 2.
Each netlist was derived using FPGA I/O pin utilization As FPGA logic and pin capacities continue to rise, it makes ranging from 50% to 100%. Then each generated netlist was sense to use a limited number (say, 16 or less) of very high sequentially mapped into HCGP architectures with Pp ranging capacity FPGAs for creating MFSs that can be used for logic from 0% to 100%. The numbers of FPGA and FPID I/O pins emulation or rapid prototyping of small to medium sized were assumed to be 1024 and 500 respectively. For every designs. This way we avoid the costs associated with using architecture, we tried to route the mapped netlist. We high pin count connectors and expensive boards for multi-developed an architecture-specific router that restricted the board systems, that would be needed if we use many tens or a number of chip hops for routing a net to one or two. A chip few hundreds of smaller FPGAs. For handling very large hop is defined as a pin-to-pin connection between two chips. designs, processor-based emulators such as Cadence's Hence, the routability of the HCGP architecture was Palladium are proving to be more effective than FPGA-based evaluated for different combinations of (a) Pp value, (b) pin emulators [8] . utilization per FPGA, (c) total number of FPGAs (varied from 6 to 16), and (d) FPGA interconnection pattern. The goal was IV. EXPERIMENTAL OVERVIEW to find a minimum value of Pp that provides routability for all To evaluate the scalability of the HCGP architecture for cases depending on the I/O pin utilization. large FPGAs, we first had to generate synthetic netlists similar to post-partition netlists produced for real multi-million gate V. RESULTS AND CONCLUSIONS designs. For the experiment we chose netlists consisting of 6,
In this section, we present the experimental results obtained 8, 12, and 16 FPGAs. In order to resemble the real netlists, the by mapping synthetic post-partition netlists to different netlist generation process was not completely random but configurations of the HCGP architecture. Recall from followed some statistical patterns derived from real multi-previous sections that our objective is to evaluate the million gate design netlists. First, consider the issue of the net routability of the HCGP architecture using large FPGAs. We fanout distribution in the synthetic netlist. We took real design are also interested in the value of Pp that results in routing partitioning results and collected statistical data on the nets completion in most cases. distribution according to the fanout. On different types of real
The experimental results are shown in Figure 4 which design partitioning results we determined typical distribution consists of four graphs, each characterized by the number of of nets connecting two FPGAs, three FPGA, four FPGAs, etc. FPGAs used in the HCGP architecture. We used synthetic We reproduced the same distribution while randomly post-partition netlists obtained using 6, 8, 12, and 16 FPGAs generating the connections in the synthetic netlists.
and mapped each to an HCGP architecture that used the same Second, post-partition netlists may vary on how evenly the number of FPGAs. The FPGA pin utilization (shown on the connections are distributed between the FPGAs. A netlist may X-axis) used in the synthetic netlist was varied from 50 to consist of FPGAs that have approximately the same number of 100%. Each pin utilization case was mapped to the HCGP connections to each other. In a more typical case there are architecture using different values of Pp (shown on the Yclusters of tightly connected FPGAs, where there are more axis). There were four different types of netlists used: evenly connections between FPGAs inside a cluster than between connected FPGAs, collection of 2-FPGA clusters, collection FPGAs that belong to different clusters. In our experiment we of 3-FPGA clusters, and finally one 2-FPGA cluster with rest generated four types of netlists with different connection of the FPGAs evenly connected.
patterns. In the first pattern all FPGAs were connected to The results show that a Pp value of 60% is sufficient for each other by approximately even numbers of nets. In the achieving routing completion for all types of netlists provided second pattern the netlists consisted of tightly connected two-we restrict the FPGA pin utilization to 82%. This is in FPGAs clusters. In the third pattern the netlists consisted of agreement with previous research results [1] if we consider tightly connected three-FPGAs clusters. Finally, the last that in real design netlists, the average pin utilization per pattern included one cluster of two tightly connected FPGAs FPGA would likely be less than 80%. We have confirmed this with the rest of the FPGAs connected to each other by assumption by pin utilization statistics collected on ten real approximately even numbers of nets. Note that this issue deals designs. Recall that for netlists used in our experiments, with the amount of "locality" in post-partition netlists. FPGA pin utilization of 82% implies every single FPGA has Replicating "locality" of real post-partition design netlists in 82% of its pins used This is even more conservative than what synthetic netlists is a very elusive task and there has been little would be expected in real design netlists. success in this respect in research efforts to date [9] . Fortunately, synthetic netlists produced using our approach are usually much more difficult to map compared to real netlists. Hence they yield a conservative evaluation of architecture and/or mapping CAD tools (rather than overly optimistic evaluation results). We can conclude from the experimental results that the HCGP e1~A 6 lz~I 9 Z @ $ C architecture is scalable using very large FPGAs, such as Pin utilization Xilinx Virtex II [10] , and can be used to handle multi-million gate designs.
(b) 8 Pin utilization
