ABSTRACT Contemporary datacenters are enhancing their compute capacity, power efficiency, and processing latency by integrating field-programmable gate arrays (FPGA). One would like to virtualize FPGAs to share them between multiple users and to be able to allocate incoming tasks to FPGAs without interrupting their operation. To virtualize FPGAs, their complexities, such as board-specific system-level integration and tricky I/O timing closure problems should be abstracted away from users. To this end FPGA designers have proposed the shell concept which abstracts away the board-specific details from the user and provides an easy-to-use interface to the user application. In this paper, we create several shells using a wide variety of interconnect solutions and rigorously evaluate them in terms of accelerator frequency, usable bandwidth, area-efficiency, latency, wire demand, and FPGA routing congestion. We show that virtualization of four accelerators per chip with traditional bus-based FPGA interconnect costs an average frequency drop of 24%, increases the wire demand of the shell to 2.78X, and creates significant routing congestion. We also show that while FPGA-optimized soft network on chip interconnect solutions can mitigate the reduction in accelerator frequency, they exacerbate the wire demand and routing congestion problems and offer a lower usable bandwidth. Finally, we demonstrate that hard networks on chip are a superior interconnect solution for virtualized FPGAs in all of the aforementioned evaluation criteria making them well-suited to datacenteroptimized FPGAs.
I. INTRODUCTION AND BACKGROUND
To meet the ever-growing demand for performance, datacenters are augmenting their infrastructure with specialized accelerators [1] . The rapidly evolving nature of datacenter services makes non-programmable accelerators both very costly and for many applications too slow to deploy, while strict datacenter power requirements make Graphic Processor Units (GPU) an undesirable option [1] . Field-Programmable Gate Arrays (FPGA), however, are a promising option for specialized accelerators due to their programmability, energy efficiency, and high performance. FPGAs not only boost datacenter capability by transparently performing network acceleration tasks such as on-the-fly compression and encryption [2] but also can be used to accelerate entire user applications. Studies have demonstrated the suitability of FPGAs for several such user applications: search engines [1] , database queries [3] , convolutional neural networks [4] , and text-analytics [5] .
While FPGAs can bring compelling gains to datacenter efficiency, adapting their use model to datacenters also presents challenges. Processors, the traditional compute resources in datacenters, are typically virtualized: several users share the same physical resource and the physical details of the device are abstracted from the users [6] . A virtualized compute resource should have three characteristics:
• The physical details of the device should be abstracted from the user.
• The device should be able to host multiple accelerators simultaneously.
• An incoming task should be allocated to an operational device without interrupting other active accelerators. Ideally, FPGAs should also be virtualized in datacenters but their different compute model requires a different approach to their virtualization. Fig. 1 depicts various parts of a virtualized FPGA system. The shell is responsible for abstracting away the details of external resources for the users and providing the users with an easy-to-use interface to access these resources. Such resources typically include external memories, external processors, and network interfaces [2] but can include other devices as well. The shell delivers the bandwidth of the external resources to partially reconfigurable regions in which the user applications, also known as roles, are placed. A partially reconfigurable region is a predetermined region of the FPGA that can be reprogrammed without disrupting the operation of the rest of the device. The shell designer should deal with all of the tricky I/O timing closure problems for every FPGA board in the datacenter and efficiently distribute the data bandwidth of the external resources to all of the partially reconfigurable regions. Without a shell the applications would become board-dependent and therefore unportable, making datacenter upgrades and use of different FPGA boards infeasible. Moreover, users would have to spend significant development time dealing with system-integration and the I/O timing of a particular board which is also contrary to the idea of virtualization.
Designing the shell is a complex and time-consuming process, but it needs to be done only once per FPGA board. The functionality of the shell is similar to the functionality of an operating system in a processor; it facilitates application access to the available resources and hides the physical and timing details of these resources from the applications. The shell is both predesigned and precompiled guaranteeing its timing closure and ease of use by application developers.
However, this means the shell does not change to suit different applications which requires it to implement all possible features that different roles might need which can lead to area-inefficiency. For services that vary greatly from one role to another, we can use a soft shell, a wrapper around the role that interfaces the role to the shell. The soft shell is also predesigned but is compiled together with the role in a partially reconfigurable region.
This means that a soft shell cannot guarantee timing closure but can adapt to a role's needs. Some of the services it can provide include: width adaptation, clock domain crossing, and insertion of more pipeline stages at the shell-role interface. Width adaptation is the process of matching the data rate and the data width of a resource distributed by the shell to the role. Distributing the ever-growing bandwidth of the external resources is a challenging task for the shell. Fig. 2 demonstrates that the transceiver bandwidth has been growing along with logic density in Xilinx FPGAs. Memory interface bandwidth has also been continuously growing and is expected to face an even sharper increase with integration of High Bandwidth Memory in FPGAs.
External memory resources such as DDR memory and transceivers operate at much higher frequencies than the FPGA fabric. In order to match the data rate of these devices, the FPGA logic must use wide parallel on-chip buses, consuming valuable FPGA routing resources. To utilize a typical 64-bit DDR memory running at 1066 MHz with a data rate of about 17GB/s, FPGAs use a 512-bit data bus at 267 MHz. High bandwidth memory solutions, increase bandwidth 17 fold. A typical high bandwidth memory is 460 GB/s [7] and 1TB/S versions have integrated with FPGAs [8] . Moreover, due to increasing FPGA capacity, one would ideally want to allow more than one accelerator per chip. This makes it even harder to efficiently interconnect different accelerators with external resources without consuming much of the FPGA routing. The challenging task of the shell in distributing the external memory bandwidth motivates us to quantify the costs of FPGA virtualization both with various interconnect solutions possible in current FPGAs and with new FPGA architectures that can more efficiently distribute such tremendous external bandwidth.
As an alternative to the bus interconnect generated by many FPGA CAD flows, a Network on Chip (NoC) can be used as an interconnect solution for FPGAs. Soft NoCs are a network of packet routers built using the logic and routing resources of the FPGA. While full-featured soft NoCs are expensive to implement in FPGAs [9] , several FPGA-optimized NoCs with reduced costs, including CONNECT [10] and Hoplite [11] , have been developed. Alternatively, one can modify the FPGA architecture to embed a full-featured hard [5] . (c) Cloud-scale [2] . (d) Tarafdar et al. [3] . (e) Byma et al. [13] .
NoC in the device. This reduces NoC area by 20-23X [9] and increases its speed by 5-6X [9] .
Given all these diverse interconnect solutions, it's important to investigate their suitability to virtualized FPGAs. While these interconnect solutions have been used in other contexts, to the best of our knowledge there is no evaluation of them for virtualized accelerators, which require high bandwidth connections, partial reconfiguration constraints, clock crossings from interconnect to accelerator, and different traffic patterns. In our prior work [12] , we demonstrated that soft-bus based FPGA virtualization comes with large costs in terms of frequency and routing congestion. To enable more efficient virtualized FPGAs, this paper's contributions include:
• Building and optimizing a variety of shells using traditional interconnect, two different FPGA-optimized soft NoCs, and hard NoCs.
• Rigorously evaluating these interconnect solutions in terms of accelerator operating frequency, shell area, usable bandwidth, latency, wire demand, and routing congestion.
• Investigating the suitability of all these interconnect solutions in the same virtualized context using realisticlly large benchmark circuits, running on different clock domains together with partial reconfiguration constraints and traffic patterns restricted to shell-role communication.
• Demonstrating that future datacenter-optimized FPGAs should consider employing hard NoCs as an architectural enhancement.
II. VIRTUALIZED FPGA: CURRENT IMPLEMENTATIONS
A pioneer of large-scale FPGA integration in datacenters is Microsoft Catapult [1] . Catapult enhances the compute capability of every compute server by augmenting it with a high-end Intel Stratix V [14] FPGA board which the host processor can access through Peripheral Component Interconnect Express (PCIe). Catapult's shell components include two DRAM controllers, a custom PCIe interface, and high-speed serial links which facilitate communication with other FPGAs. Microsoft accelerated part of their Bing search engine using Catapult and demonstrated its ability in accelerating massive compute tasks. Fig. 3a shows the highlevel abstraction of an FPGA in Catapult. In this implementation, the shell consumes 23% of the Stratix V FPGA resources. IBM chose a different approach for augmenting their datacenters with FPGAs; instead of adding an FPGA board to a processor, they directly attach the FPGAs to the network [5] . The shell in this design is only the network service layer and protocol stacks but it still consumes 32% of the Xilinx Virtex-7 FPGA [15] used in their design. This implementation also uses 8GB of DRAM for the TCP/IP stack. However, this memory is not directly accessible for the application.
Microsoft also announced a more advanced iteration of their FPGA-datacenter integration which directly connects the FPGAs to the network [2] . In this implementation, all network traffic, including processor network packets, goes through the FPGA. This allows the FPGA not only to act as an application accelerator, but also to enhance datacenterrelated tasks by transparently processing network packets. The high-throughput processing of network packets going through the FPGA results in significant resource savings; encryption/decryption of packets at the same throughput would require at least five processors [2] . The resources in the shell use 44% of the FPGA resources and are different from the original Catapult as shown in Fig. 3c .
There are also several academic studies that have integrated FPGAs into datacenters. Tarafdar et al. [3] , like the original catapult, also access the FPGA through PCIe. The shell in this design is not as feature-rich as the aforementioned industrial implementations but it still takes 15.5% of their Virtex-7 FPGA resources to support one role [16] . The shell contains a PCIe interface to the host processor, a DRAM memory controller, and a 1GB Ethernet, as shown in Fig. 3d .
The work by Byma et al. [13] supports mult-tenancy; each FPGA can be shared between up to four roles. The shell in this design has four Ethernet cores and a DRAM controller as shown in Fig. 3e and uses 32% of the target Virtex-5 FPGA's resources to enable support for four roles. Although this implementation could potentially support more roles, doing so would degrade the design's timing [17] .
Although the details of the interfaces provided by each shell in the aforementioned implementations differ, all of VOLUME 6, 2018 them employ the concept of separation of the shell and the roles. This separation is a fundamental part of FPGA virtualization, as discussed in the Introduction. However, the resource and timing costs of the shell are very significant in all these implementations; this further motivates our study of efficient FPGA virtualization.
III. INTERCONNECT SOLUTIONS FOR VIRTUALIZED FPGAs
In this section, we first provide details of our experimental setup for the rest of the paper. Afterwards, we detail each of the interconnect solutions and the shell we designed using it. 
A. EXPERIMENTAL SETUP
The shells designed in the previous works discussed in Section II typically contain DDR controllers, PCIe, and Ethernet cores. We carry out our experiments using the same type of resources: our shells have two DDR controllers, two PCIe controllers, and 40Gb of network bandwidth as shown in Table 1 . We will use a wide variety of interconnect solutions to create the shell and interconnect it with the roles. A high level overview of the virtualized FPGA in our experiments is shown in Fig. 4 .
For our experiments, we use three sets of roles as shown in Table 1 . Synthetic Pipeline is a deeply-pipelined synthetic chain of arbitrarily configured LUTs. Synthetic Filler is a moderately pipelined synthetic circuit that uses a wide variety of FPGA resources: arithmetic, memory, and DSP. Compared to the synthetic pipeline, it operates at a lower frequency but it has a more diverse usage of FPGA resources. FIR Filter is a real application generated by the Quartus IP generator and is also very deeply pipelined. The main reason for selecting these circuits is that they can be easily parametrized to achieve a desired level of device utilization and that their high level of pipelining resembles the ideal design style for highthroughput datacenter applications. We use three different variations of the synthetic filler and synthetic pipeline circuits and two variations of the FIR filter for our experiments.
Throughout our experiments we use two compilation flows: virtualized and flat. The virtualized compilation flow uses Intel's partial reconfiguration flow for Arria 10 in Quartus Prime Pro 16.1. In this flow, every partially reconfigurable region, also known as role slot, should be floorplanned manually and be assigned to a reserved and fixed region on the FPGA. However, the shell can still use the routing resources of these role slots. The shell is compiled together with a representative role in a ''base design'', helping guide optimization of the shell role-interface implementation. Every other role sees and connects to the already placed and routed shell. In cases where we have a soft shell, it is added to the role logic and compiled with it. In the flat compilation flow, the shell and roles are compiled together without any of the mentioned placement and routing restrictions. We still put the shell and roles in separate design partitions to avoid crossboundary optimizations that could lead to different results when comparing flows.
B. TRADITIONAL FPGA INTERCONNECT SHELL
Within the Intel Quartus CAD flow, one can use Avalon Interfaces [19] to interconnect various components. Avalon supports controlling off-chip devices using either streaming data or memory-mapped transactions. It's also straightforward to graphically interconnect components that comply with the Avalon interface in Intel's Qsys system integration software. Therefore, we begin our experiments by creating a baseline shell using Qsys Avalon as the interconnect solution. Qsys can also automatically pipeline the interconnect to allow it to operate at high frequencies. For our experiments we set the highest number of pipeline stages to four, the maximum value allowed by Qsys.
After creating the shell and exposing the interfaces shown in Table 1 to the roles, we create three separate designs: flat compilation, virtualizated without a soft shell and virtualized with a soft shell. Since the shell we create in Qsys provides all the required functionality, including clock domain crossing using on-chip memory, the soft shell in this case is simply an additional stage of pipelining to allow the role to run at higher frequencies by further optimizing the shell-role interface. In the virtualized designs, we need to floorplan the FPGA and manually dedicate parts of the chip to the roles. We iteratively floorplanned the FPGA to allow up to four separate roles. Putting the role slots in corners of the chip allowed us to expand their regions more than other parts of the chip, so we created a floorplan as depicted in Fig. 5 for the case with four roles. The role slots also have an expanded routing region where the role is allowed to use the routing resources; without this expansion several of the roles would fail in the routing stage of the CAD flow.
C. FPGA-OPTIMIZED SOFT NoC-BASED SHELLS
To interconnect a virtualized FPGA using NoCs, we must first decide on the minimum number of endpoints required in the shell. An endpoint in this case is a router that provides connectivity to a particular resource, such as DDR memory. As shown in Table 1 the external resource with the highest bandwidth is a DDR memory. Therefore, to prevent limiting the bandwidth of DDR, we avoid sharing the routers that connect to the DDR controller with other resources and dedicate one router to each of the DDR controllers. The PICe controllers and the Ethernet cores have a lower bandwidth than DDR and each can share one router and still not be the limiting factor. Therefore, at least four routers on the shell side are required: two for the two DDR controllers, one for the PCIe controllers and one for the Ethernet cores.
In the roles, one can use an arbitrary number of routers while maintaining connectivity. However, with less than four routers, the bandwidth to a role will be limited since there are four endpoints on the shell side. Another fundamental design choice for a shell using NoCs is the topology of the NoC. A mesh is the most commonly used topology, provides great flexibility for floorplanning the role slots, and results in short links. Therefore, for CONNECT and hard NoCs we use a mesh topology. Hoplite, however, requires a Torus topology since it is unidirectional [11] . As shown in Fig. 6 , we floorplan the shell and the role slots similarly to the floorplan of the shell designed using the traditional bus-based interconnect and add only the routers inside the shell and the role slots. For all NoC-based shells, the soft shell includes a set of wrappers that abstract the NoCs from the users. LYNX [18] can automatically generate such wrappers and we used its wrappers in this work. The wrappers are usually very low cost and simple as they mostly put the bits in the correct position in the generated packets and add other control bits such as destination address to the packet.
1) CONNECT
CONNECT is a soft NoC that employs a wide variety of FPGA optimization strategies. The authors of CONNECT have provided a web interface RTL generator which we used to generate a mesh network with the dimensions shown in Fig. 6 . Since CONNECT routers are relatively large, we have to allocate a significant fraction of the FPGA area to the routers. Ideally we would want a link width of 610 data bits which would allow an entire DDR transaction to be sent in one packet. However, the size of CONNECT routers increases with their link width, so we were forced to limit the width of the data link to 128 data bits; Without this link width reduction, the routers would be too big for our roles to be successfully placed and routed in the remainder of the FPGA. Since we are sending larger packets than the link, in addition to the NoC wrappers, the soft shell now includes a width adaptation circuitry to slice the packets into smaller packets at the sending endpoint and reconstruct the original packet at the receiving endpoint. CONNECT routers don't include clock domain crossing logic, so this feature is also added to the soft shell.
2) HOPLITE
Hoplite's design strategy is providing very simple singleflit routers that use deflection routing instead of buffering VOLUME 6, 2018 and virtual channels. This strategy makes the routers very small, but may negatively affect the performance under heavy injection. Since Hoplite relies on a torus topology to deflect its packets, we need to pipeline the torus link that spans almost the entire chip to avoid frequency degradation. We have carved out a part of every role and allowed the shell to place the vertical torus link registers in those spots as shown in Fig. 7 . We carefully allocated these spots in the floorplan to allow Hoplite to run at a very high frequency; our floorplanned Hoplite can support the entire bandwidth of a DDR controller when there is no deflection by other routers. The link width of the Hoplite routers is 610 data bits wide, large enough to support an entire DDR packet. Similar to CONNECT, clock domain crossing circuitry and NoC wrappers are added via the soft shell. However, unlike CONNECT, the router links are wide enough for an entire DDR packet and no width adaptation circuitry is required in the soft shell.
D. HARD NoC-BASED SHELL
A hard NoC is a pre-fabricated network of packet-switched routers spread out throughout the chip in fixed locations, connected to each other with hard dedicated links [9] . Since hard NoCs are pre-fabricated and do not use the FPGA programmable resources, they are much more area-efficient and operate at much higher frequencies than the NoCs overlayed on the FPGA [9] . Therefore, hard NoCs include many of the otherwise area-consuming logics such as width adaptation logic, clock domain crossing, and some buffering. The shell and the role logics connect via the soft shell to a router's fabric port [20] which handles the width adaptation, clock domain crossing, and flow control. Since the hard routers operate at a much faster frequency than the rest of the FPGA fabric, the width adaptation circuitry offers a wider interface to the role slots than the links between the routers. This means that hard routers only need to have a data link width of 160 to offer a 640 bit wide data interface to the role; more than the 610 required for an entire DDR packet. The soft shell in the case of a hard NoC-based shell is only the NoC wrappers.
We use a similar floorplan to the other NoC-based shells as shown in Fig. 6 . Since existing FPGAs do not contain routers with hard routers, we take several steps to properly model them in Quartus. First, we obtain the area and delay of hard routers by using Hybrid COFFE [21] to synthesize an open-source parametrized virtual channel router [22] plus the router fabric port [18] using STMicroelectronics 28nm Fully-Depleted Silicon On Insulator standard cells [23] . Then we bloat the obtained area by increasing it by 66.7% as in prior work [9] to account for whitespace, wire buffering, and wiring during placement and routing. Since the rest of the fabric is running on an Arria 10 FPGA which is using a more advanced 20nm technology node, we believe that our estimates are conservative enough in terms of area and delay to not favour hard NoCs. Lastly, in the Quartus software, we reserve parts of the chip, as shown in the floorplan, to be dedicated to routers. Since no other logic is allowed to be placed in these regions, it shows the same placement and routing behaviour as if the logic blocks were carved out and replaced with routers.
Hard NoCs have several key characteristics that makes them a suitable interconnect solution for virtualized FPGAs:
• Area Efficiency: We would like the shell and the interconnect to be as small as possible to leave more of the FPGA resources for the roles. Hard NoCs are extremely area-efficient and consume less area than their traditional interconnect counterpart, soft parallel buses [24] .
In our experiments, the entire network of feature-rich routers consumes an area equivalent to 2% of the logic elements in the Arria 10 FPGA.
• High Bandwidth: Our synthesis results show that a hard router with a data link width of 160 data bits can run at 1.45 GHz. This frequency allows the routers to efficiently deliver the high bandwidth of the external resources throughout the chip.
• Dedicated Links: Traditional FPGA interconnect heavily consumes the programmable routing resources of FPGAs when delivering large buses to multiple roles. On the other hand, hard NoCs use dedicated routing resources that operate at much higher frequencies, allowing the links to be much narrower at the same data rate.
• Separation from Programmable Logic: The hard NoC is not part of the FPGA programmable logic and therefore does not need to be reconfigured to route data packets. This separation allows the hard routers to be located inside of the role regions, as shown in Fig. 6 , delivering the data from inside of the role region instead of its boundaries, thereby reducing the stress on placement and routing. A summary of key parameters of the shells we designed in this work using various interconnect solutions is provided in Table 2 . 
IV. EXPERIMENTAL RESULTS
A. FREQUENCY Figure 8 shows the average and the range of role operating frequency for various benchmark circuits using different interconnect solutions and different numbers of roles. The top left subfigure shows the frequency achieved by each of our benchmark circuits when the FPGA supports only one role, for each interconnect solution. On average, flat compilation produces the fastest circuits, and virtualization with one role slows the average circuit by 23% with traditional buses, but only 5% with hard NoCs. The top right subfigure is the case with two roles. The circles show the average role frequency for each benchmark and interconnect solution, while the stars connected by a vertical line show the range of frequencies across the two role slots. A thorough analysis of these results yields the following interesting observations:
• While using the traditional interconnect in flat compilation gives high frequency results, virtualizing FPGAs using the traditional interconnect results in a significant average frequency drop ranging from 23% to 33% depending on the number of roles.
• Using a soft shell to insert more pipeline stages when using tradtional soft buses reduces the average frequency drop to 7% when there is only one role. With more roles, the gain of these additional pipeline stages diminishes; in the four role case the frequency drop of 25% is only slightly improved to 24%.
• Using traditional interconnect in a virtualized FPGA not only lowers the average role operating frequency but also makes the frequency dependent on the physical VOLUME 6, 2018 allocation of the role to a role slot on the FPGA as shown by the increasing range of role frequency. This is an adverse effect that would force the datacenter owners to choose between two undesirable options: advertise the lowest frequency of this range, or manage the role slots as heterogeneous resources.
• Using any of the packet-switched networks on chip results in higher operating frequencies with a narrower range than the traditional interconnect. Hard NoCs result in the lowest frequency drop, as low as 1% and as high as 5% depending on the number of roles. CONNECT and Hoplite result in a slightly larger frequency drop but both achieve higher role frequency than the traditional interconnect with virtualized compilation.
• The frequency range of roles is lower in hard NoCs and Hoplite than in CONNECT, especially for roles that require more resources. Since CONNECT consumes a large portion of the FPGA area, placement and routing become more difficult for larger roles, making the results more sensitive to the allocation of roles to slots. Overall, while there is a significant role frequency drop when virtualizing FPGAs using the traditional interconnect, a packet-switched network of routers can significantly mitigate this problem. Averaging across all the benchmark circuits and number of roles, virtualizing an FPGA interconnected with the traditional bus interconnect results in 28% frequency degradation which can be mitigated by using a soft shell to 20%. The frequency drop of NoC-based virtualized FPGAs is much lower: 10% for Hoplite, 8% for CONNECT, and only 4% using hard NoCs.
B. PERFORMANCE AND DEVICE UTILIZATION
In addition to allowing a high role frequency, an ideal shell would provide high bandwidth and low latency access to I/O resources while having low resource consumption. In stateof-the-art FPGAs such as Arria 10 many of the controllers used in the shell such as transceivers, PCIe controller, and DDR controllers are hardened and using them does not consume fabric resources, leaving more space for role logic. However, other parts of the shell such as the interconnect and arbitration can still consume significant resources. Fig. 9a   FIGURE 10 . A virtualized FPGA with three roles and four different types of endpoints. Notice the shared paths in a NoC for different endpoints.
shows the fraction of the FPGA area consumed by shells supporting different numbers of roles designed using various interconnect solutions. For the traditional bus, the size of the shell grows as the number of role slots increases but the NoC-based shells consume the same amount of resources regardless of the number of roles. To support only one role, a traditional bus-based shell is more area-efficient than other solutions. However, to support more than one role, hard NoCs are the most area-efficient solution. Hoplite becomes a more area-efficient solution than the traditional bus with more than two roles. CONNECT consumes about half of the FPGA resources and we were unable to increase the link size of CONNECT to allow DDR data to be sent over less than 5 flits as it would consume even more of the device and some of the roles could not be placed in the remaining space.
We would like to examine how the interconnect limits I/O bandwidth when it's shared between several endpoints i.e. when multiple I/O requests are made by 1 or more roles at the same time. To better illustrate this concept, Fig. 10 shows a NoC-based virtualized FPGA with three roles and four different I/O endpoint types. The data flows with the dimension ordered routing used by all considered NoCs are also shown. Consider the role on the top left: even within this role, many paths to the corresponding endpoints in the shell go through the same routers and hence the usable bandwidth at an endpoint is actually lower than the theoretical bandwidth of a router. Since DDR is the component with the highest bandwidth, we look at DDR's bandwidth usable at one of the endpoints in a role assuming an ideal DDR response in the shell.
As depicted in Fig. 9b , traditional interconnect can provide the entire bandwidth to one role. As the number of roles increases, assuming perfect arbitration in the traditional interconnect, the bandwidth gets equally divided between the endpoints in different roles. For NoCs, different endpoints in every role compete with each other over injecting packets to the network and therefore, the bandwidth might be limited. In our experiments for hard NoCs with booksim 2.0 [25] , the usable DDR bandwidth was limited to 76% with 1 role. As the number of roles increase, the limiting factor becomes the I/O endpoint and the usable bandwidth becomes equal to that of the traditional interconnect. The usable bandwidth for CONNECT is less than 2% of a DDR channel due to its low frequency and small link size imposed by its relatively large area the usable bandwidth. The bandwidth usable in Hoplite is also 76%, when there is only one role, similar to hard NoCs. Adding more roles, however, causes Hoplite's usable bandwidth to drop as shown in Fig. 9b . This is most likely happening due to the nature of deflection routing in Hoplite; deflecting packets at router ports with contention causes more packets to exist in the network than with hard NoCs, which sometimes prevents the endpoints from injecting more packets into the network.
Although FPGA designs are usually somewhat latencytolerant, latency is also an important performance metric. Virtualizing FPGAs does not fundamentally change latency, an therefore all prior latency studies on hard NoCs [18] , CONNECT [10] , and Hoplite [26] are valid in virtualized FPGAs. However, we present the unloaded latency data in Fig. 9c for the sake of completeness of this study. The latency of the traditional bus-based interconnect is shown with a teal circle. This latency is measured by adding the latency of clock domain crossing and pipeline stages in the interconnect. There is a small range in the latency which is shown by a vertical bar is due to small differences between operating frequency of the buses connected to different I/O resources. The hard NoC's latency, shown with a red square, is calculated by assuming a latency of two cycles per router as in prior work [18] and calculating the range of number of hops between role endpoints and their corresponding shell endpoints and adding the clock domain crossing latency to enter and exit the NoC. Hard NoCs are clearly superior in latency compared to other solutions.
This superiority is due to their very high frequency compared to other solutions despite having more pipeline stages compared to the traditional interconnect. We can also see that the range of latency in hard NoCs is very small despite the difference in number of hops that packets from different endpoints must traverse, again by virtue of their very high operating frequency. Despite the fact that CONNECT and hard NoCs have the same topology and therefore the same number of stages between two corresponding endpoints, CONNECT's latency, shown with a green triangle, is much higher than hard NoCs and has a wider range due to its significantly lower operating frequency. Hoplite's latency without load, shown with a star, is also relatively smaller and ranks third after hard NoCs and the traditional interconnect but it has a higher range and varies depending on the location of the endpoint due to the unidirectional nature of the Hoplite network links. . 11 depicts the wire usage of the different shells using various interconnect solutions. The types of wires are those used in Intel Arria 10 FPGAs. Block and direct wires are the unit size of wires that span only one block. C wires are vertical and R wires are horizontal wires that span a longer length; for example R32 is a horizontal wire with a length of 32 FPGA tiles. We can make the following observations from this figure:
• Virtualizing FPGAs using the traditional interconnect comes at a considerable wirelength increase of to up to 2.78X.
• The increased wire demand is more for longer wires which are expensive in manufacturing and important for timing closure of designs; the virtualized shell supporting four roles with traditional interconnect needs about 5 times more R32 wires than its flat counterpart.
• Using soft NoCs is also costly in terms of shell wire demand; shells made using CONNECT and Hoplite respectively require 6.8X and 4.3X more wire than the shell made using hard NoCs.
• Using hard NoCs as the interconnect solution mitigates the increase of wire demand of virtualization from 178% to 1%. VOLUME 6, 2018
C. ROUTING UTILIZATION AND CONGESTION
While wirelength consumed by the shell is a good metric to compare overall FPGA wire demand, we should also look at the routing congestion to get a full profile of the routing resource usage. Routing congestion in the shell can deteriorate the timing of roles that have high wire demand and can even cause routing failure. Visualizing routing congestion also helps assess the scalability of each interconnect solution with regards to the external I/O bandwidth. which will continue to grow in future FPGAs, necessitating wider links. Fig . 12 shows the routing congestion of all FPGA tiles using different interconnects. This figure shows congestion data for each interconnect using the largest variation of each circuit as the roles and also an empty to role to be able to look at the shell alone. The routing congestion data is color-coded; dark blue shows FPGA tiles with unused routing resources and dark red shows FPGA tiles with a routing resource fully utilized. A key to this color coding scheme is provided inside the figure. We can make the following interesting observations from the data presented in this figure:
• In the flat compilation flow, the congestion slightly increases with the number of roles but the congestion always stays lower than the virtualized flow. This demonstrates that the congestion is not due to the nature of the roles.
• When virtualizing the FPGA using the traditional compilation flow, the FPGA is congested and this congestion worsens as the role count increases. Also, adding more pipeline stages using the soft shell does not affect congestion significantly.
• Using either CONNECT or Hoplite as the interconnect solution causes the highest congestion. This congestion, however, does not increase much as we increase the number of roles. This is due to the fact that implementing a soft NoC requires a significant portion of the FPGA routing but then the NoC can serve any number of roles without much increase in the routing demand.
• When hard NoCs are used as the interconnect solution of the virtualized FPGA, routing congestion is very low and becomes similar to the flat compilation flow. V. CONCLUSION Table. 3 shows the ranking of all of the interconnect solutions for different metrics, with one being the best ranking and five being the worst. The traditional bus-based FPGA interconnect has several limitations for virtualized FPGAs; the operating frequency of the accelerators running the FPGA drops considerably, shell area increases as the number of supported accelerators increases, and the significant shell wire demand and routing congestion can cause timing closure problems and routing failures in the user accelerators. While using FPGA-optimized soft NoCs mitigates the problem of accelerator frequency drop, it exacerbates the wire demand and routing congestion problems, increases the design latency, and lowers the usable external resource bandwidth. Hard NoCs prove to be the best solution for interconnecting virtualized FPGAs across all of these metrics. Our results indicate a strong case for the inclusion of hard NoCs in future datacenter-optimized FPGAs. There are many areas for future work to make FPGA virtualization more efficient. Hard NoC architectures could be extended with direct interfaces to high bandwidth I/O blocks such as PCIe and DDR, potentially reducing latency and wire demand with only a small area cost. Shells should provide not only connectivity but also security functions to isolate accelerators from each other; evaluating the best way to integrate these functions with each interconnect solution is an important research direction. 
