Abstract-Multi-FPGA systems (MFS) are indispensable for emulating multi-million gates integrated circuits (ICs) for the purpose of functional design verification before IC fabrication. However, with every new generation of FPGAs, the ratio between the logic capacity and the number of inputs and outputs is also increasing. Consequently, the limited FPGA input/output (I/O) pins impose a constraint when the number of inter-FPGA nets greatly exceeds the inter-FPGA physical tracks. This problem is addressed by serializing multiple cut nets using time multiplexing technique. Besides I/O resources, routing architecture also exercises a strong effect on the cost, speed and routability of MFS. In this paper, we compare the achieved system performance in two routing architectures: Completely Connected Graph (CCG) and Torus, when time multiplexing is employed. Six benchmark circuits have been partitioned such that per FPGA logic utilization is upto 60%. However, even with such reasonable logic consumption, the required I/O usage is observed to be 5-11 times more than the available I/O pins, thus employing time multiplexing. Experimental results show that CCG achieves higher performance as compared to Torus for the given range of TDM ratios. However, Torus can provide better cost/ performance ratio for higher TDM ratios.
INTRODUCTION
Multi-FPGA systems (MFS) offer the potential to deliver higher performance solutions for computationally intensive tasks, logic emulation, rapid prototyping and reconfigurable custom computing machines [1] . However, the large delays in the inter-FPGA communication and limited I/O pins per FPGA restrict the overall system performance [2] . Nevertheless, mapping a design to an MFS is mainly divided into two steps. In the first step, the design is partitioned into several parts. A successful partitioning approach ensures that every part fits within the logic capacity of the single FPGA in MFS. The second step routes the inter-FPGA nets according to the available physical tracks, I/O resources of the FPGA and the routing architecture of MFS. Over the past few years, the logic capacity per FPGA is increasing at a much faster pace as compared to the number of I/O pins.
Time-Division-Multiplexing (TDM) is the key to resolving limited I/O resources problem [3] . In this technique, inter-FPGA nets are multiplexed onto a single track on the MFS board. The number of inter-FPGA nets per track is called the TDM ratio.
Over the past years, the TDM ratio is also increasing. However, increasing TDM ratio worsens the MFS system performance. On the other hand, the choice of the MFS routing architecture also exercises significant effect on the system frequency. The routing architecture of an MFS is the manner in which the FPGAs, fixed wires and/or programmable interconnect chips are connected together. In this paper, we use two routing architectures: Completely Connected Graph (CCG) and Torus routing architecture. Fig. 1(a) shows that in CCG, every FPGA is directly connected to all the other FPGAs, whereas in Torus topology each FPGA is connected only to its horizontal and vertical adjacent neighbors. Also, the peripheral FPGAs are wrapped around in horizontal and vertical directions and are connected to the FPGAs on the opposite side of the array.
Extensive research has been carried out in the area of pinmultiplexing. However, none has explored the combined impact of routing architecture and multiplexing on system clock frequency. In this paper, the achieved performance of time multiplexing is compared for a given range of TDM ratios using both CCG and Torus routing architectures.
The rest of the paper is organized as follows. In Section II, related previous work is discussed. In Section III, design flow for time-multiplexed MFS is described in detail. In Section IV, the experimental results are presented and analyzed. Finally, we conclude the paper in Section VI.
II. RELATED WORK A substantial amount of research has been done in the field of MFS and time multiplexing. Nevertheless, to our knowledge, so far no work has been done in evaluation and comparison of the system performance of the different routing architectures using real benchmark circuits.
In [6] , system performance is compared for a design running on HAPS-70 and HAPS-80 when pin multiplexing is employed. However, the effect of routing architectures is not discussed.
In [8] , the delay characteristics of multi-FPGA system and 3D FPGA are compared. However, the impact of time multiplexing on the two architectures is not discussed.
In [4] , the IBM's Bluegene/Q project is mapped on a Virtex-5 only platform and its performance is studied for a wide range of TDM ratio. However, the effect of a different routing architecture is not considered.
In [2] , the complexity, usability and data rate of multi-FPGA board using different multiplexing techniques are compared. But, the number of FPGAs has been restricted to two.
[9] discusses Microsoft Catapult project. It investigated the use of multi-FPGAs to improve performance, reduce power consumption in the datacenter. However, the possible improved performance using multiplexing has not been explored.
Maxwell [10] is a 32-way IBM Bladecentre containing 64 Xilinx Virtex-4s using InfiniBand cables. It targeted HPC rather than datacenter workloads and demonstrated achieved system performance. But it did not discuss the timemultiplexing concept.
Another aspect of time-multiplexed MFS is presented in [11] which proposed a concentric FPGA structure resulting in equal length concept between FPGA pins enabling wavepipelined pin-multiplexing. However, the proposed system has not been evaluated using real benchmark circuits.
III. DESIGN FLOW FOR TIME-MULTIPLEXED MFS
In order to evaluate and compare the two time multiplexed routing architectures, an experimental platform was developed that allowed optimized mapping of real sequential circuits.
A. Experimental Procedure
The experimental procedure for mapping a circuit to the given architecture is presented in Fig. 2 . The FPGA used in this research is the Xilinx Spartan-3E FPGA XC3S100E, which consists of 1920 4-LUTs and flip-flops and 108 I/O pins. Previous research [5] has shown that routing architecture evaluation results are scalable when larger benchmark circuits and larger FPGAs are used.
The process starts with ABC [12] tech-mapper reading blif format gate-level netlist and then performing FPGA mapping into 4-LUTs using exhaustive cut enumeration and finally writing the LUT-based netlist in .bench format. In the next step, a translator was developed to convert the .bench output of the ABC tech-mapping step into .hgr hypergraph format. The .hgr hypergraph is then partitioned using hMetis [13] , which assigns nodes to k different partitions, so that the number of edges between partitions is minimized. hMetis accepts the .hgr file and produces an output .part.k file that stores the results of the k-way partitioning. For this research, every benchmark circuit is partitioned using khMetis. Resulting partitions are considered acceptable only if the partition size is no more than 60% of the available number of 4-LUTs in the given FPGA.
In case of CCG, any arbitrary placement is acceptable because in this architecture every pair of FPGAs is uniformly connected. For Torus, hMetis provides net-cut optimized placement which is followed by flat clustering that performs wirelength optimization and clustering at the same placement level.
An architecture-specific router was developed which is scalable and the number of FPGAs can be increased or decreased according to the MFS size. The router not only tries to find the shortest path for each net but also addresses issues like routing congestion. The routing problem in Torus is slightly more complicated than CCG, because the FPGAs are used for both logic and routing. After a circuit is mapped to Torus architecture, each FPGA will have a number of I/O pins specified for primary inputs and outputs. The rest of the pins are used for inter-FPGA communication and for routethroughs. The inter-FPGA router first routes all those nets that do not require route-throughs i.e. the nets connected through direct wires. It then routes the remaining nets that require route-throughs using an algorithm that listed all possible shortest paths between source and target. A shortest path is chosen which allowed minimum congestion for the subsequent nets to be routed and utilized those route-through FPGAs that had the maximum number of available pins. Similar routing approach was employed for CCG to route all the inter-FPGA nets.
Once the inter-FPGA routing is done, then it is safe to assume that each sub-circuit can be successfully placed and routed within an FPGA. This assumption is based on previous work [14] which shows that the placement and routing of a circuit within an FPGA will generally succeed provided the FPGA logic utilization is restricted to less than 70%. That's why, multi-FPGA partitioning restricted the subset size to at most 60% of the FPGA logic capacity. This ensures that the placement and routing of each sub-circuit within an FPGA will be successful. However, even with efficient partitioning and reasonable logic utilization, the required number of I/O pins for inter-FPGA nets tends to be 5 to 11 times higher than the available I/O pins per FPGA. To address this issue, a multiplexer is developed here in which multiple compatible design signals (nets) are assembled and serialized through the same board trace and then de-multiplexed at the receiving FPGA. The connection between the two FPGAs is designed to be a single-ended connection for the multiplexed signals. In this study we have calculated the emulation time and system frequency when the TDM ratio ranges from 1 to 49.
An MFS static timing analyzer (STA) was developed to calculate the critical path delay at different levels of circuit implementation i.e. pre-partitioning CPD, post-partitioning CPD (CPD_PP) and post-routing CPD (CPD_PR). The different delay values used by the analyzer are obtained from the Xilinx Spartan-3E FPGA data sheet. CLB-to-CLB delay is approximated as a constant, because individual FPGA placement and routing is not performed in this research.
B. Time Multiplexing in MFS
Time-Division-Multiplexing requires multiple compatible nets to be assembled and serialized through the same singleended board trace and then de-multiplexed at the destination FPGA. Using IO flip-flops makes the timing of inter-FPGA connections more predictable and generally faster [2] .
The synchronous method of time multiplexing is often system-synchronous [2] . The multiplexer/de-multiplexer clock (mux_clk) and the system clock (sys_clk) for the FPGAs are mutually synchronous i.e., they are derived from one clock source, PLL (Phase Locked-Loop) and are phase aligned. The mux_clk frequency is a multiple of the sys_clk frequency. According to [1] , the maximum delay on the multiplexing connection, T in MFS inter-FPGA tracks is the sum of the output pad delay Tout, PCB board trace delay Ttrace, the input pad delay Tin and tolerance delay Ttolerance (safe margin for MFS clock distribution etc.) as shown in (1).
= + + + (1) CPD_PR is calculated as the sum of logic delay, output pad delay, PCB trace delay, route-through delay (if any) and the output pad delay. Adding a safe margin of 20% to CPD_PR, T can be re-written as (2). = 1.20 * _ ( ) ( 2 ) T is further composed of two types of delays i.e. internal delays TPR_internal and external delays TPR_external. TPR_internal is primarily the sum of CLB logic delay, intra-FPGA routing delay and route-through delay (if any). Whereas, TPR_external is the sum of primary input delay, primary output delay, PCB delay, FPGA input pad delay and FPGA output pad delay. When multiplexing is employed, TPR_internal remains unaffected but TPR_external is scaled up according to the TDM ratio. Since, TPR_external is always greater than TPR_internal, that's why multiplexing exercises significant effect on the system clock frequency.
The system clock frequency sys_clk is calculated by (3).
C. Evaluation Metrics
The speed of an MFS is determined predominantly by the latency bound i.e. the length of the post-routing critical path obtained after a circuit has been placed and routed at the inter-chip level [3] [7] . CPD_PR is governed by the internal design delay and system routing delay. As compared to the internal delay, board routing delay has a larger impact on the overall system performance. The routing architecture employed mainly dictates the system routing delay. Multiplexing will only scale up the external routing delay and in turn the emulation time of the MFS. System clock frequency sys_clk is the reciprocal CPD_PR, with an added 20% of safe margin.
D. Benchmark Circuits
Six popular sequential benchmark circuits are used in this experimental work. All benchmark circuits are FPGA proven and single clock synchronous designs. Table II provides the circuit name, function and size of the circuit. These digital sequential benchmark circuits are obtained from OpenCores [15] which are available as a gate-level netlist in blif format. 
IV. EXPERIMENTAL RESULTS
In this section, we determine the effects of changing TDM ratio from 1 to 49 on the system frequency in the CCG and Torus routing architectures. Table III gives the critical path delays before partitioning, after partitioning and after routing for both CCG and Torus routing architectures. Table III demonstrates the delay penalties experienced due to partitioning and routing at the board level. The CPD_PP value across all the circuits is on average 47% more than the CPD value, and in some cases is more than a factor of 2 greater. Since CCG provides direct connections among all FPGAs of the board, that's why the routing penalties are lesser as compared to Torus where route-through delays add to the CPD_PR value. Average CPD_PR value across all the circuits is approximately 50% higher when implemented on Torus as compared to CCG. The results in Fig. 3 show that as the TDM ratio is increased, the achieved performance in CCG and Torus is reduced. However, CCG always has notably higher performance than Torus for TDM ratio in the range of 1 to 25. After that the system frequency of the two architectures is comparable. Due to lack of space, we include only the results for two out of six benchmark circuits. System frequency versus TDM ratio results show a similar trend for the other four benchmark circuits.
V. CONCLUSIONS AND FUTURE WORK
Multi-FPGA boards experience large timing delays in inter-FPGA communication as compared to intra-FPGA net delays as well as a limited bandwidth between FPGAs due to I/O resource constraint per FPGA. This problem is resolved by using Time-multiplexing technique. Besides I/O resources, routing architecture also influence the cost, speed and routability of MFS. In this paper, the achieved performance of time multiplexing is compared for a given range of TDM ratios using both CCG and Torus routing architectures. Six benchmark circuits have been partitioned and routed on the two architectures. Then, their performance is recorded for the given range of multiplexing ratio. Experimental results show that CCG can achieve higher performance as compared to Torus for the given range of TDM ratios due to lesser routing delays. However, for TDM ratio greater than 24, the performance of the two architectures becomes comparable which leads to the conclusion that Torus can provide better cost/performance ratio since its Printed Circuit Board (PCB) cost is lesser than that of CCG.
In future research, this work will be extended by exploring the effects of using other multiplexing techniques such as SERDES and MGT for the same routing architectures. Furthermore, a performance comparison between 3D FPGAs and MFS can be explored when using different multiplexing schemes.
