ABSTRACT Field programmable gate arrays (FPGAs) have seen a huge evolution since their inception almost three decades ago. Multi-FPGA boards continuously receive an increasing attention by the research community as efficient solutions for complex system prototyping. This is due to reliable high-speed, lowcost, and real-life exploration environment they offer. Although multi-FPGA platforms offer better frequency compared to other prototyping alternatives, expanding logic resource to I/O ratio in FPGAs is causing an increase in time multiplexing ratio of inter-FPGA signals (logical signals) to inter-FPGA tracks (physical resources), which causes a decline in overall system frequency. This paper introduces a generic testing platform for multi-FPGA modeling. With this platform, users will be able to experience overall prototyping cycle of a digital system. The cycle will start from benchmark generation and will go all the way to inter-FPGA routing. Using generic tools of this platform, we explore the effect of three different inter-FPGA routing approaches on the frequency of final prototyped design. Each routing approach is applied on generic as well as custom multi-FPGA boards. Results obtained through experimentation show that, for generic FPGA board, routing approach better exploiting two-and multi-point tracks of target FPGA board gives better average frequency results as compared to other two routing approaches.
I. INTRODUCTION
The progress in the processing technology over the past few years has greatly improved the computation capability of today's digital applications. However, it has also led to more complex design process, higher performance requirement and faster time to market with total design cost required to be as small as possible. The total cost of a design is usually dictated by the architecture design, the validation and verification costs where the verification normally accounts for 70% of the total design cost [1] . Moreover, 60∼80 % time [2] , [3] of the design cycle of a digital system is taken by pre-silicon verification process. Therefore, having a cheap, reliable and fast verification is considered a critical point in the design process of any digital system.
In pre-silicon verification, three techniques namely simulation, emulation and FPGA-based prototyping are commonly employed [4] . Verification through simulation offers complete system visibility, gives block-level verification, requires no set-up time and costs about a few thousand dollars [5] - [7] . However, the execution speed is only a few KHz (∼ 1 KHz) with no real world testing experience. Emulation has much faster execution speed (∼ 1 MHz) compared to simulation with significant visibility. But commercial emulation solutions [8] - [10] cost enormous amount of money and require many weeks as set-up time. FPGA-based systems permit to run the targeted design at almost cycle-and bitaccurate level. It offers the best possible execution speed of design (∼ 10 MHz). Furthermore, it also enables the real world testing of the design with real external interfaces. FPGA-based prototyping requires about same set-up time as emulators and its set-up cost is around a few thousand dollars. Although FPGA-based testing gives poor system visibility, it is considered the most common verification method among the three aforementioned ones. This is because of the elevated performance, frugality and actual interfacing experience that it provides.
Due to the increasingly complex designs and the huge gap between FPGAs and Application Specific Integrated Circuits (ASICs) [11] , FPGA-based prototyping requires that the complex ASIC designs be divided across more than one FPGA to achieve necessary logic capability; hence introducing multi-FPGA prototyping. Owing to the complication of the design under consideration, the number of FPGAs required to prototype a digital system can vary from a couple of FPGAs on a single board up to many [12] , [13] . Prototyping of intricate designs using multiple FPGAs is considered a complicated process and it can mainly be divided into two phases namely partitioning and routing.
Partitioning is a process that divides a complex design into more than one segment where each segment fits into logic density of the target FPGA. Main optimization objective of this process is to reduce the amount of communication signals between different segments. These signals are also termed as cut-nets in this work. Once partitioning phase is completed, inter-FPGA routing process performs the routing of the communication signals on the inter-FPGA routing tracks. Normally, the number of cut-nets (i.e. communication signals) resulted after partitioning are more in number than the physically available tracks on the FPGA board. So, these cut-nets are routed on the inter-FPGA tracks using Time Division Multiplexing (TDM) method [14] . In this method, depending upon the TDM ratio which is the number of cut-nets moving through a multiplexer/demultiplexer, multiple cut-nets are source multiplexed. These cut-nets are next routed through an inter-FPGA track and upon arrival at the destination FPGA, they are demultiplexed. Higher the TDM ratio, the lower will be the system frequency and vice versa. Although the logic resources and the number of I/Os for FPGAs are rapidly increasing, the gap between them is larger than ever. This widening gap is exacerbating an already poor TDM ratio for multi-FPGA systems [12] , [13] , [15] . Therefore, the research is being carried out to explore and optimize this TDM ratio [16] .
In our work, a generic optimized environment for the exploration of multi-FPGA prototyping systems is proposed. This environment is novel as it is generic in nature and gives complete prototyping experience to the users. For the validation of an exploration environment, benchmarks serve as the basic requirement. In our case, this obligation is even more pertinent as we need large, realistic and complex benchmarks for the validation of our multi-FPGA prototyping environment. Researchers in the previous works [17] , [18] have generated and used benchmarks for different CAD tools. However, these benchmarks are not large enough to pose a serious challenge to current prototyping tools which require benchmarks of size as large as several million gates. Stroobandt et al. [19] have proposed a synthetic benchmark generator that generates benchmarks through random hierarchical connection between different components. Although large benchmarks can be generated through this generator, due to the repetitive connection pattern between different components, the generated circuits are highly redundant and they have no similarity with real world applications. Contrary to the aforementioned work, in this paper, large, complex and realistic benchmarks are generated using Design Space eXploration (DSX) tool [20] . This tool allows the design of multi-processor based hardware architectures. Generated benchmarks are then fast synthesized, partitioned [21] and routed using the proposed exploration environment. Recent research efforts have proposed different types of inter-FPGA routing algorithms. For example, Turki et al. [16] proposed an inter-FPGA routing algorithm based on two-point inter-FPGA tracks. Similarly, the work in [22] proposed a routing algorithm for multi-point FPGA tracks. We, on the other hand, propose a complete exploration environment with generic routing tool that enables exploring and optimizing different types of routing approaches. For exploration, we use three different routing approaches on two FPGA boards. Further details on different exploration steps are given in Section III of the paper. A preliminary version of this work was presented in [23] where only single inter-FPGA routing approach was explored. However, here we extend the aforecited work and explore two more routing approaches which give new results and thorough insight into the routing architecture of a multi-FPGA board. The main contributions of this work are outlined as follows:
• Generation of large, complex and realistic benchmarks for multi-FPGA prototyping.
• Development of a generic routing environment for exploration and optimization of different routing approaches for multi-FPGA prototyping.
• Extensive experimentation and analysis of the results obtained through proposed exploration environment.
The rest of the paper is structured as follows. A brief overview of the benchmark generation platform used in this work is given in Section II. Section III details the steps of the exploration environment. Section IV gives further insight on the inter-FPGA routing tool. Experimental setup is presented in Section V and results obtained through exploration environment are discussed in detail in Section VI. Finally, the paper is concluded in Section VII.
II. BENCHMARK GENERATOR
The DSX tool [20] is used in this work for benchmark generation. These benchmarks are next used for exploration of different routing approaches proposed in this work. Flow of benchmark generation is shown in Figure 1 . This flow accepts three files as input. The first file contains the description of the architecture/platform. In this file, the components to be used by the platform are initialized. Each initialized component will have specific number of inputs, outputs and description about connections with other components of the architecture. The details about the components and their interface with the others are provided by the component model file.
The software application graph is described in the third file. This file allows multiple tasks applications to run on the target platform that has different number of processors. When these files are provided to the the DSX tool, synthesizable files of the system under consideration are generated which can then be used for any synthesizing tool for further implementation on FPGAs. By using the DSX tool, large multiprocessor system on chip (MPSoC) architectures can be generated. These MPSoCs can further contain co-processors as well. Figure 2 illustrates a simplified example of such an architecture where it contains N processors that connect to UART, RAM and coprocessors via multi fifos. The DSX tool employed in this paper is generic and can generate an MPSoC architecture that may contain many processors. However, in reality, it is not possible to have infinite number of microprocessors in a single MPSoC as the communication between co-processors is usually limited by inadequate bus bandwidth. This limitation is overcome by moving from mono-cluster architecture to multi-cluster MPSoC architecture. In multi-cluster MPSoCs, intra-cluster communication is achieved through VCI network while communication between different clusters is ensured through DSPIN [24] Network on Chip (NoC) architecture. The DSPIN NoC has a two-dimensional mesh topology where each node defines a router and its related cluster. Because of the mesh-based nature of interconnect, DSPIN provides a veritably scalable network.
Benchmarks generated through the platform described above are realistic as they model real life scenarios of multiprocessor NoCs. They are also huge in size and complexity. The benchmarks generated in this work are both mono-and multi-clusters. These benchmarks act as an input to the exploration environment which is described in next section. 
III. EXPLORATION ENVIRONMENT
The exploration environment proposed in this work takes synthesizable benchmark as an input which is next passed through different steps and eventually placed and routed on the target FPGA board. Pictorial flow of different exploration steps is shown in Figure 3 .
A. SYNTHESIS Figure 3 shows the first step of the exploration where the design under consideration is synthesized. Through synthesis, the design is logically optimized using commercial or open source tools. During synthesis, the design is mapped to the target library of the FPGA components and usually this information is provided by the FPGA board file which can be seen as an input parameter to the synthesis process. In this work, we use Synopsys Certify [21] tool for fast synthesis of the design under test.
B. PARTITIONING
Since the realistic benchmark is usually very large, it must be divided into multiple segments after the synthesis step so that it can be implemented on multi-FPGA board. Optimized design partitioning is an important step as it directly affects the performance of the circuit. During partitioning, a design is divided into several segments (equal to the number of the FPGAs on board) where each segment should fit in the logic capacity of a single FPGA to which it is being partitioned. Partitioning tool is usually constrained to the size of each individual partition and to the number of physical connections between different partitions. These two constraints make partitioning an NP-hard problem [25] . Partitioning tool has to take into account the FPGA size and the connection information between different FPGAs (provided through board description file) to generate a high performance partitioning VOLUME 6, 2018 solution which is not possible without board information. Board information is generated by performing different steps as shown in Figure 4 .
Once the board description file and the synthesized design are ready, they are given to Certify tool. This tool produces a high performance partitioned solution with minimum number of cut-nets between different segments. The output of the partitioner is a trace assignment file that includes all the necessary information about the partitioned design. The trace assignment file is next passed to the inter-FPGA routing tool. Further discussion on the routing is given in next section.
C. ROUTING
As mentioned in Section I, the gap between the logic resources of FPGAs and the available I/Os is already huge and this gap is further exacerbating with newer generations of FPGAs. Therefore, the design obtained after partitioning has many cut-nets. These cut-nets are for the communication between different FPGA partitions. Due to the limited I/Os of FPGAs, the inter-FPGA routing is performed. Inter-FPGA routing routes the cut-nets between different partitions through time division multiplexing. When the routing function takes place, the inter-FPGA algorithm calculates the shortest possible path from the source FPGA to the destination FPGA which has the smallest amount of TDM ratio and the number of intermediate hops. Here, we propose a routing tool that takes board information and trace assignment files at its input and uses a customized version of Pathfinder [26] routing algorithm to perform inter-FPGA routing. Pathfinder is a well known routing algorithm commonly utilized for intra-FPGA routing [18] . But, in this work, we adopt it for inter-FPGA routing. Pathfinder uses negotiation-based congestion resolution approach to find a conflict free routing solution. In order to find the shortest path from a source to destination FPGA, it employs Dijkstra's shortest path algorithm [27] . To perform inter-FPGA routing using pathfinder, the information on the physically connected tracks of the FPGA board is collected from the board description file. A simplified description of the physical connections between four FPGAs on board is shown in Figure 5 . The information is then transformed into a directed graph G(V , E) of vertices and edges. Vertices V in the graph define the I/Os of blocks and the wires of routing structure while edges E define the possible connections among vertices. Pathfinder routing algorithm is then applied on the routing graph to find the conflict free routing solution. An example of the routing graph representation of the physical connection description is shown in Figure 6 . The next section gives a detailed discussion on the inter-FPGA routing tool that is proposed in this paper. 
D. INTRA-FPGA PLACE AND ROUTE
After inter-FPGA routing, partitioned files are combined with the inter-FPGA routing information to generate the design sub-netlists. The design sub-netlists are then synthesized, placed and routed using vendor specific tools. After placement and routing, these tools generate bitstreams of the design which are later loaded into the respective FPGAs of the multi-FPGA system to complete the prototyping flow.
IV. INTER-FPGA ROUTING
In this section, we present further in-depth detail on the proposed routing tool that is used for multi-FPGA prototyping. As explained earlier, the partitioning of a design gives the trace assignment file which includes all necessary information of the cut-nets of the design. The trace assignment file includes two types of cut-nets: two-terminal net with a unique source and destination and multi-terminal nets with one source and several destinations. The inter-FPGA routing flow used here is illustrated in Figure 7 . The flow starts with board description, routing constraints and trace assignment as an input and ends with the frequency estimation after the optimized routing.
Depending upon the board description file and trace assignment file generated by partitioner, different types of 56304 VOLUME 6, 2018 inter-FPGA routing can be performed using our generic tool. In this work, we explore three different routing approaches to perform inter-FPGA routing. The first routing technique is explained as shown in Figure 8a where we consider graphical representation of a three-FPGA board that has only two-point tracks. For this kind of board, signals of trace assignment file are routed using only two-point tracks. Even if there are multi-terminal cut-nets, they would be first decomposed into two-terminal cut-nets and then routed on two-point tracks of the FPGA board. The second routing technique is explained using Figure 8b . It can be seen from the figure that in this case the board contains both two-and multi-point tracks. Here, the two-terminal signals are routed using two-point tracks while multi-terminal signals use only multi-point tracks. Normally, the number of two-terminal signals in trace assignment file is much more than multi-terminal signals. Two-terminal signals dictate the routing optimization while some of the multi-point tracks on the FPGA board remain unused. Therefore, this fact leads us to a third routing approach where the routing algorithm first routes the two-terminal signals on the two-point tracks of the board. Once the two-point tracks are fully used, the routing algorithm also uses the available multipoint tracks for the routing of two-terminal signals. The three routing approaches described above are used in this work and relevant results are presented in the next section of paper. The routing tool requires three parameters: the routing technique, the board information and the trace assignment file as shown in Figure 7 . It then computes the initial mux ratio as maximum number of cut-nets between a pair of FPGAs (information extracted from the trace assignment file) and the number of physical tracks between them. The FPGA board information is next used to generate the routing graph to be utilized later by Pathfinder algorithm for inter-FPGA routing. After routing graph generation, the cut-nets that have same source and destination are grouped together. The number of cut-nets in one group is equal to the mux ratio calculated at the beginning of the flow. After cut-nets grouping, groups are sorted and then routed on the corresponding tracks using Pathfinder. Once all the nets are routed, mux ratio is adjusted until the routing becomes infeasible. Previously, for inter-FPGA routing algorithm, a sequential optimization process was used to find the best mux ratio for a given design. However, in this work, we have used a binary search algorithm that requires less number of iterations to find an optimal solution. Further details and related results on the optimization approach are given in Section VI.
Although the optimization algorithm attempts to find minimum possible mux ratio, normally this value is greater than one. This is due to the large cut-nets to I/O ratio. It should be noted that the mux ratio has a direct impact on the synchronous system clock frequency of all the FPGAs on board. This frequency is provided by FPGA integrated Phase Locked Loops (PLLs) which is identical to the board's oscillator frequency. Value of the system frequency is found using equation 1 where if _freq is the inter-FPGA frequency and it can be understood with the help of Figure 9 .
The work in [28] indicates that the critical path of multi-FPGA platform is the sum of T out , T board , and T in delays as illustrated in Figure 9 . A tolerance delay is also added which gives the if_freq for the multi-FPGA board 125 MHz. Furthermore, according to [28] , the latency of equation 1 is (mux ratio+3) fast clock cycles. Therefore the relationship between sys_freq and fast clock frequency becomes:
As already discussed, an intermediate FPGA may be required to act as a hop if there are no direct tracks available between a VOLUME 6, 2018 FIGURE 10. Example of system frequency calculation on multi-FPGA board with routing hop. source and the destination FPGA. The introduction of routing hops further exacerbate the sys_freq (ref Figure 10 ). Therefore, when considering routing hops, the relationship between sys_freq and fast clock becomes
V. EXPERIMENTAL SETUP
This section presents the detailed experimental setup. Initially, a comprehensive outline of the benchmarks used in this work is presented and then, the target FPGA platforms adopted to place and route these benchmarks are discussed.
A. BENCHMARKS
In this work, we have generated multiple benchmarks using the platform described in Section II. We have generated both mono cluster and multi-cluster benchmarks. A total of 14 benchmarks are used in this work and their details are given in Table 1 . First four benchmarks in this table are mono-cluster and next ten are multi-cluster benchmarks. The table shows that both bi-terminal and multi-terminal benchmarks are part of suite. Depending upon the logic capacity and in order to have a variety in the routing requirements, we partition these benchmarks using different number of FPGAs. This information can be seen in third column of the table. Information regarding bi-terminal, multi-terminal, and total cut-nets is given in columns 4 -6 of Table 1 respectively. These columns indicate that first four benchmarks contain only bi-terminal nets so we classify them as bi-terminal benchmarks whereas remaining ten benchmarks contain both bi-terminal and multi-terminal nets; hence we classify them as mixed benchmarks. Inter-FPGA routing for these benchmarks is performed using different target FPGA boards. Discussion on target boards is presented next.
B. TARGET FPGA BOARDS
As discussed in Section IV, the proposed routing tool is generic in nature and it can be used to explore different routing approaches. In this work, we perform experimentation using two types of target FPGA boards: one is generic FPGA board and other is customized FPGA board. These boards are further divided into two categories: bi-terminal and multi-terminal. Information related to the generic bi-terminal boards is used to accomplish inter-FPGA routing both for benchmarks having bi-terminal nets and mixed nets. Similarly, board information related to the generic multi-terminal boards is used for inter-FPGA routing of benchmarks having multi-terminal nets. Further discussion on generic and custom FPGA boards is presented next.
1) GENERIC FPGA BOARDS
Here we use generic bi-terminal board that contains four Stratix V [29] FPGAs where each FPGA contains 234,720 Adaptive Logic Modules (ALMs), 512 DSP blocks, 2560 memory blocks, and more than 800 I/O pads. We generate FPGA boards in this work where 60% of I/Os are reserved for inter-FPGA interconnect. Graphical representation of a bi-terminal generic FPGA board depicting their I/O distribution is illustrated in Figure 11a . It can be seen from the figure that I/Os of FPGAs are evenly distributed among all FPGAs on board. Similarly, Figure 11b shows graphical representation of multi-terminal generic FPGA board where biterminal tracks are shown in black and multi-terminal tracks are shown in red color. These two generic FPGA boards are used in this work to perform inter-FPGA routing for different benchmarks discussed already in Table 1 . custom FPGA board is demonstrated using Figure 12a . In this figure, a sample partitioned benchmark with information on bi-terminal and multi-terminal cut-nets is shown. In order to define custom FPGA board for such benchmark, first, we split multi-terminal nets into corresponding two-terminal nets. This is a process which is required only for multiterminal benchmarks. Once the splitting is performed, new information on the cut-nets of the benchmarks is shown in Figure 12b . Once all the nets are converted in the form of two-terminal nets, the number of nets in each part are calculated by summing up cut-nets of each partition. For example, Figure 12b shows that there are 976 cut-nets between partition 1, partition 2, and 419 cut-nets between partition 1 and 3; hence making total cut of partition 1 equal to 1395. Similarly, cut-nets for other two partitions of the design are calculated. Finally, with reference to Figure 12b , the number of nets for parts 1, 2, and 3 are 1395, 1855 and 1298 respectively. After the computation of the number of nets for each part, number of tracks between different FPGAs in a custom FPGA board are computed using equation 4. To find the number of tracks between different FPGAs, first the part that has the largest number of cut-nets is selected. For example, this can be part 2 in Figure 12b . If the number of available FPGA I/Os is 480, tracks between partition 1 and 2 are calculated as (976/1855) * 480 = 253. Similarly, tracks between FPGA 2 and 3 are simply 480 − 253 = 227. The number of tracks defined for partition 2 is shown in Figure 12b . Once the inter-FPGA tracks for one FPGA are defined, corresponding cut-nets are changed with the updated number of available I/Os of FPGAs and then tracks for the next FPGA with the most number of cut-nets are defined. When two parts having same number of cut-nets are present, then the part having less number of available I/Os is chosen. For example, after the definition of the tracks for FPGA 2, remaining cut-nets for both partitions 1 and 3 are same (i.e. 419). However, the available I/Os of FPGA 1 are less (i.e. 227) than FPGA 3 (i.e. 253). So, the number of tracks between FPGA 1 and 3 are defined to be 227 and complete custom board description of Figure 12a is shown in Figure 13b . Here, we have explained the bi-terminal custom board definition for a simple example. But the process is generic in nature and custom FPGA board can be defined for complex examples with any number of partitions/FPGAs using equation 4.
2) CUSTOM FPGA BOARDS
In order to explain the procedure of custom multi-terminal FPGA board, we consider same example as discussed above. Splitted multi-terminal design of the example is shown in Figure 14a . Contrary to Figure 12b , in this figure we keep the multi-terminal cut-nets. Figure 14a shows that there are 307 multi-terminal cut-nets in the design. It can be seen from this figure that two-terminal and multi-terminal cut-nets combined together give the total number of cut-nets equal to 1967. This makes the percentage of multi-terminal nets in the design equal to 15.6%. For custom multi-terminal FPGA board, multi-terminal tracks are computed first using equation 5. The multi-terminal FPGA board representation after the computation of multi-terminal tracks is shown in Figure 14b . After the computation of multi-terminal tracks, the cut-nets of the design and the available FPGA I/Os are updated and same procedure is followed as for the definition of bi-terminal custom FPGA board.
Step by step graphical representation of the remaining process is given in Figure 15a and Figure 15b respectively.
VI. EXPERIMENTAL RESULTS AND ANALYSIS
The benchmarks discussed in Section V-A are synthesized, partitioned and routed using the exploration environment described in Section III. For exploration, we use three different routing approaches. These approaches are already discussed in Section IV and they are applied on generic as well as custom multi-FPGA boards discussed in previous section. Just like the discussion presented in Section V-B, we divide our exploration results into two parts and then we perform a comparison between them in Section VI-C. 
A. GENERIC EXPLORATION RESULTS
For three inter-FPGA routing approaches, we define generic FPGA boards using the methodology described in Section V-B. Once generic board for a particular approach is defined, benchmarks are individually routed using the board information and our inter-FPGA routing algorithm tries to achieve minimum possible mux ratio for each benchmark using the binary search algorithm. Results for generic FPGA boards using three exploration approaches are presented in Table 2 . In this table, three different routing approaches are explored namely as bi-terminal (i.e. two-terminal), multiterminal and mixed respectively. In bi-terminal routing approach, a generic FPGA board containing only two-point tracks is defined. All the benchmarks whether containing only two-terminal nets or a mixture of two-terminal and multi-terminal nets are routed using two-point tracks of the generic board only. Results for bi-terminal routing approach are given in columns 2, 3 of Table 2 where column 2 gives the mux ratio and column 3 gives number of hops crossed for each benchmark. Columns 4, 5 of Table 2 give mux ratio, hop results using multi-terminal routing approach. In this approach, two-terminal nets are routed using two-point tracks whereas multi-terminal nets are routed using multi-point tracks only. Normally, two-terminal nets far outweigh the number of multi-terminal nets because of which they control the overall mux ratio of the inter-FPGA routing solution. For this reason, in some cases the multi-point tracks are not used to their maximum capacity. So, in third routing approach, we address this issue by allowing the two-terminal nets to be routed on multi-point tracks if they do not find a path on twopoint tracks. Results of third routing approach are shown in columns 6, 7 of Table 2 . It can be seen from the results given in Table 2 that bi-terminal routing approach gives good results when the benchmarks under consideration have only twopoint tracks (see Table 1 ). However, the results of bi-terminal approach become poor when the benchmarks under consideration have both two-and multi-point tracks. The results are further aggravated for bi-terminal approach when the number of multi-terminal cut-nets in a benchmark increase. On the other hand, the analysis of mixed approach shows that this approach exploits the available FPGA resources in the best possible manner under both types of benchmarks and it gives the best results among three approaches explored in this work.
B. CUSTOM EXPLORATION RESULTS
Similar to generic exploration, we explore three routing approaches for custom FPGA boards as well. Custom boards are defined for each benchmark separately according to their cut-net information. Discussion on custom board definition has already been presented in detail is Section V-B and custom exploration results for three approaches are shown in Table 3 . Results in the table indicate that for all the benchmarks, custom FPGA boards give either equal or better mux ratio and hop results. It is worth mentioning that similar to generic exploration results, the mixed approach in Table 3 gives the best overall mux ratio and the number of hops.
Frequency comparison results between generic and custom approach are presented next. frequency results for almost all the benchmarks. The frequency comparison of three routing approaches for the generic FPGA board shows that, on average, the mixed routing approach gives 8.3%, 7.5% better frequency results as compared to bi-terminal and multi-terminal routing approach. Similarly, the frequency comparison for custom FPGA board shows that mixed routing approach gives, on average, 9.5%, 7.4% better frequency results as compared to bi-terminal and multi-terminal routing approach. Furthermore, it can also be concluded from these figures that custom boards give better results as compared to generic FPGA boards. For example, bi-terminal, multi-terminal and mixed routing approaches give, on average, 8.9%, 10.2% and 10.1% better frequency results respectively for custom FPGA board as compared to the generic FPGA board. This is due to the fact that in custom boards, routing requirements of each benchmark are processed individually; hence resulting in better frequency results. However, practically it is not viable to define a separate board for each benchmark required to be prototyped. But we adopt this approach here to show that mixed routing approach better exploits the available resources of the FPGA board and gives the best results both for generic as well custom FPGA boards. As discussed in Section III, previously mux ratio optimization was performed in a sequential manner. In this work, we replace it with the binary search algorithm that decreases the number of iterations required to find an optimal solution. At the end, we also present a comparison of the number of iterations between previous approach and the binary search algorithm. Comparison results are shown in Figure 18 . Here, we present results for five benchmarks only, but same trend holds true for the other benchmarks too. It can be seen from the figure that for each benchmark we present the number of iterations required to find the optimal solution for three routing approaches (i.e. Bi, Mult, Mix) and for each routing approach we also present the number of iterations required using the sequential and binary search approach. For example, Bi-N in Figure 18 corresponds to bi-terminal routing approach with normal optimization approach and Bi-B corresponds to bi-terminal routing approach with binary search optimization approach. Same rule can be applied to other two routing approaches under consideration. It can be seen from the figure that the binary search optimization approach gives much better average results in terms of number of iteration. The binary search algorithm, on average, requires 19%, 45% and 54% less routing iterations for bi-terminal, multi-terminal and mixed routing approaches respectively; thus leading to better execution time over the normal optimization approach.
VII. CONCLUSION
In this work, we have presented an efficient inter-FPGA routing exploration environment for multi-FPGA prototyping systems. Using this exploration environment, we have explored three different routing approaches on two FPGA boards. For experimentation, we have used a set of fourteen complex benchmarks with varying routing requirements. Experimental results show that mix routing approach that optimally uses the available FPGA resources gives the best frequency results both for generic as well as custom FPGA boards. Comparative analysis of frequency results for fourteen benchmarks shows that, on average, mix routing approach gives 8.3%, 7.5% better frequency results as compared to bi-terminal and multi-terminal routing approach for generic FPGA board. Similarly, the comparison of frequency results shows that mix routing approach produces 9.5%, 7.4% better results as compared to bi-terminal and multiterminal routing approach for custom FPGA board. Furthermore, the frequency comparison of custom and generic FPGA boards shows that for the three routing approaches under consideration, custom boards give, on average, 9.7% better frequency results as compared to generic FPGA boards.
UMER FAROOQ received the Ph.D. degree in informatics and electronics from Universite Pierre et Marie Curie, Paris, France, in 2011. He is currently serving as an Assistant Professor at the Electrical Engineering Department, Dhofar University, Salalah, Oman. He has authored one book and co-authored many peer reviewed, impact factor research articles. His research interests include reconfigurable architectures, parallel computing, and scheduling techniques for multi-core processing systems. He serves as the reviewer for many international conferences and journals of the domain.
IMRAN BAIG received the Ph.D. degree in electrical and electronic engineering from Universiti Teknologi PETRONAS, Malaysia, in 2012. He has been with the Department of Electrical and Computer Engineering, Dhofar University, Salalah, Oman, since 2015, where he is currently an Associate Professor. He has authored and co-authored about 25 journal papers, one book chapter, and 14 conference papers. His research interests cover many aspects of the physical, medium access, and networking layers of wireless communications with a special emphasis on cellular radio networks, mobile ad hoc and sensor networks, and Internet security. He is an Editor-in-Chief of the ITEE Journal. He has also been serving as a Designated Reviewer for many reputed journals of the IEEE, Elsevier, IET, and Springer. BANDER A. ALZAHRANI received the M.Sc. degree in computer security and the Ph.D. degree in computer science from the University of Essex, U.K., in 2010 and 2015, respectively. He is currently an Assistant Professor at King Abdulaziz University, Saudi Arabia. He has published more than 27 research papers in international journals and conferences. His research interests include network security, network routing and forwarding, information-centric networks, and Bloom-filter data structure.
