The network-on-Chip (NoC) is a critical subsystem for many largescale systems-on-chip (SoC). We present a complete framework for the design and optimization of NoCs at the system-level. By combining a library of pre-designed con gurable NoC modules specied in SystemC with high-level synthesis, we can generate a variety of alternative 2D-Mesh NoC architectures for a given SoC. We also support the automatic synthesis of network interfaces to translate between IP-speci c messages and NoC its. We demonstrate our approach with the design-space exploration of two complete SoCs running complex applications on a high-end FPGA board.
INTRODUCTION
Networks-on-chip (NoC) play a critical role in the integration of components in large-scale systems-on-chip (SoC) at design time, and have a major impact on their performance at run time. Over the last few years, the research community has produced many different frameworks and tools for NoC design and optimization [7, 14, 16, 17] . Most of these approaches provide some degree of parameterization which allows designers to optimize the NoC architecture for the target SoC and the given ASIC or FPGA technology.
We leveraged this aggregate research experience for the development of ICON (Interconnect Customizer for the On-chip Network). * Young Jin is now with Intel Corporation, Hillsboro, OR.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). NOCS '17, October 19-20, 2017 ICON is a new framework for the design and optimization of NoCs at the system level. Some of its distinguished features include: support for virtual channels for message-class isolation, which is critical for the prevention of protocol deadlock [20] , the ability to generate NoC architectures that combine multiple physical networks with multiple virtual channels [23] , and the ability to explore the NoC design space by varying the NoC parameters in a non-uniform way (e.g. to have di erent numbers of virtual channels per input port in a router [9] ). The generation of NoCs with ICON relies on a rich library of parameterized components that can be combined in a modular way to create complex NoC subsystems and, ultimately, a complete NoC architecture tailored to the target SoC. Table 1 reports a list of the key components that can be used to generate a variety of router micro-architectures. ICON promotes system-level design as it allows the automatic generation of NoC architectures speci ed in SystemC. These generated speci cations can be integrated with full-system simulators, known as virtual platforms, as well as synthesized with highlevel synthesis (HLS) tools to produce corresponding RTL implementations. Make les and scripts for synthesis, simulation, and cosimulation across various levels of abstraction are automatically generated along with the SystemC source code. By bringing the description of the NoC to a higher level, ICON enables the exploration of a broader design space through the combination of system-level parameters with micro-architectural settings for the HLS tool. Also, the compatibility with virtual platforms allows fast full-system simulation, which is crucial to increase the number of design points that can be evaluated.
After summarizing the most related NoC research in Section 2, we present the overall architecture of ICON and its unique features in Section 3. Then, in Section 4 we demonstrate some of the capabilities of ICON by generating 36 di erent NoC con gurations that can be seamlessly integrated in two SoCs, which we designed and implemented on an FPGA board. We present a comparative analysis of the resources utilization and performance evaluation across these NoC con gurations for the two SoC designs while running real workloads. We also report estimates on area occupation and throughput for a corresponding ASIC implementation tested with synthetic tra c patterns.
RELATED WORK
How to design low-latency and high-bandwidth architectures by combining exible and con gurable parameterized components has been the focus of many papers in the NoC literature.
Mullin et al. proposed low-latency virtual-channel routers with a free virtual channel queue and VA/SA speculation that o er a high degree of design exibility in SystemVerilog [14] . Kumar et al. demonstrated a 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator scheme that improves the matching efciency by allowing multiple requests per clock cycle and keeping track of previously con icted requests [11] . Becker presented a state-of-art parameterized virtual channel router RTL with a new adaptive backpressure mechanism that improves the utilization of the router input bu ers [3] . Dall'Osso et al. developed ×pipes as a scalable and high-performance NoC architecture, where parameterizable SystemC component speci cations are instantiated and connected to create various NoCs [5] . Stergiou et al. improved this architecture by presenting ×pipes Lite, a synthesizable parameterizable NoC component library that includes OCP 2.0 compatible network interfaces, and by providing a companion synthesis and optimization ow [22] . Fatollahi-Fard et al. developed OpenSoC Fabric [7] , a tool that simpli es the generations of NoCs from parameterized speci cation by leveraging the properties (abstract data types, inheritance, etc.) of Chisel hardware description language [1] .
A large portion of NoC research focused on FPGAs. In developing ICON we kept in mind the lessons from many of these works. Given the common emphasis on system-level design, our work has perhaps most commonalities with the CONNECT project. However, we trade o some optimization in favor of more exible framework that targets both ASIC and FPGA technologies. Distinctively, ICON is the rst system-level framework that can generate hybrid NoC architectures which combine virtual channels with multiple physical planes. In addition, ICON pushes the design entry point to the system level in a way that it enables the exploration of a broader design space and the evaluation of a very large number of design points in such space.
THE ICON FRAMEWORK
The main advantage of using ICON is to generate multiple di erent NoCs, integrate them into existing SoCs, and create new NoC components with minimal e ort. Most of this exibility is achieved by allowing users to mix-and-match several heterogeneous instances of each sub-component listed in Table 1 to build customized NoC components. Following a user-de ned topology and connection scheme, these components are then automatically connected to generate the desired NoC con guration. In addition, ICON generates the necessary simulation environment and testbench for validation, which can be reused across all NoC con gurations generated with both pre-con gured and custom sub-components. Furthermore, users can extend the set of con guration parameters available to ICON. For example, a user can add de nitions of roundrobin or random-based arbiters to create new types of virtual channel (VC) allocators. At a higher level in the NoC hierarchy, these allocators can be selected to build di erent types of routers.
Beside the NoC generation, ICON automatically creates network interfaces according to the message types and message classes speci ed for the IP components of the SoC. Hence, users can mix-andmatch di erent NoC con gurations without changing IP component speci cation. Alternatively, the same NoC can be used for multiple SoCs, each with a speci c set of message types and message classes. All customized NoC components can be seamlessly integrated. The communication behavior of the same type of components, i.e. a component group, is pre-de ned in ICON. Testbenches and synthesis scripts can be shared for a component group. This simpli es the validation of user-de ned NoC components and their integration into the target system. ICON consists of six main parts: con guration parser, script generator, NoC component generator, testbench generator, the SystemC NoC library and the testbench component library. Fig. 1 illustrates the high-level relationships between these parts and the ow that ICON follows to generate the NoC design and the corresponding scripts for synthesis and simulation. Starting from the user-provided speci cation of the NoC through an XML template, the parser instantiates the necessary objects to build the NoC architecture with the desired con guration. The objects are then sent to the three generators that produce the actual NoC design, together with the scripts for synthesis and simulation, and the SystemC testbenches to validate the design. With the parameter-speci c or customized SystemC code from the NoC component generator, the user can launch rst HLS and then logic synthesis using the tcl scripts from the script generator. The synthesized RTL and netlist can then be co-simulated with the same testbenches by using the generated Make les. The SystemC testbench component library is equipped with the set of synthetic tra c models commonly used to evaluate NoCs. These tra c models can be controlled with simulation con gurations speci ed in the XML speci cation.
SystemC NoC Component Library. The SystemC NoC library contains a rich set of components and sub-components that are speci ed based on object-oriented programming and that can be combined hierarchically to obtain a variety of NoC architectures. Table 1 gives an example of the many components and sub-components for the router and their hierarchical relationships. The router class is one of the main classes and is de ned as a collection of input units, output units, VC and SW allocators, and crossbars in the NoC component library. All these sub-components are de ned as C++ template parameters in the router class to provide the exibility of combining various sub-component implementations to build a router. A component like the router can have a uniform microarchitecture, where every sub-component is con gured with the same parameter values, or a non-uniform architecture. An example of the latter is a router which supports di erent numbers of virtual channels across di erent inputs. The NoC component generator instantiates a prede ned design from the library for a uniform microarchitecture, while it creates a customized SystemC class at runtime for non-uniform microarchitectures.
By sharing the same interface across di erent implementations, NoC components in ICON can be seamlessly combined into a bigger component. Fig. 2 illustrates an example of how these common interfaces are speci ed for the case of virtual channel allocators. All allocators are derived from allocator_base ( Fig. 2(a) ), and the number of input and output (I/O) virtual channels are speci ed in vc_allocator_base (Fig. 2(b) ). When using uniform sub-components to create a large component, ICON leverages SystemC template parameters. For example, the input-rst VC allocator [6] is derived from vc_allocator_base, and contains multiple arbiters in the I/O Fig. 2(c) ). For each I/O stage, the type of arbiter is specied as a template parameter for the input-rst VC allocator implementation in the NoC component library. If multiple non-uniform sub-components need to be instantiated in a component, e.g. different number of output VCs per output unit, the front-end SystemC generator dynamically produces SystemC classes by inheriting common interfaces de ned in the SystemC NoC library. For example, to create the allocator of Fig. 2(d) derived from the one of Fig. 2(c) , the template parameters for I/O arbiters are speci ed as 4-to-1 round-robin arbiters based on the XML speci cation, and some of unused VCs (gray lines) are bound to constants.
stages (
Input and Output Units. Fig. 3 illustrates how the I/O units are implemented in the SystemC NoC library. Both the I/O units consist of ow-control, status control, and pipeline control modules with optional FIFOs to store its. In addition, an input unit contains a routing unit to calculate the designated output port based on the destination information in the header it. The routing unit in Fig. 3(a) not only produces the output port of the it, but also provides possible output VCs with the message class of the input VCs. By providing extra information for the output VCs at the routing stage, input units avoid sending unnecessary requests to the VC allocator. Therefore, a generic VC allocator implementation can be used without any modi cation for the message-class isolation. Instead of managing the granted inputs and outputs and their VC information with a centralized status logic, ICON relies on distributed VC and ow management between I/O units. A distributed design makes it easier to instantiate non-uniform I/O ports. It also helps to control the status of non-uniform I/O ports that characterizes a network interface.
Network Interfaces. In order to support multiple physical networks [23] , message-class isolation [20] , and non-uniform packet speci cation, we designed network interfaces in ICON as routers with non-uniform data types for the input or output ports. Thanks to the parameterized and component-based design, the implementation of the I/O unit for both source and destination network interfaces reuses most of the router sub-component implementations in <network_type name= example2x2 > <source_network_interfaces num_src= 4 > <source_network_interface index= 0 type= sni /> <source_network_interface index= 1 type= sni /> <source_network_interface index= 2 type= sni /> <source_network_interface index= 3 type= sni /> </source_network_interfaces> <destination_network_interfaces num_dest= 4 > <destination_network_interface index= 0 type= dni /> <destination_network_interface index= 1 type= dni /> <destination_network_interface index= 2 type= dni /> <destination_network_interface index= 3 type= dni /> </destination_network_interfaces> <routers num_routers= 4 > <router index= 0 type= r2x2 /> <router index= 1 type= r2x2 /> <router index= 2 type= r2x2 /> <router index= 3 type= r2x2 /> </routers> <channels> <channel type= ch src_ni= 0 src_port= 0 dest_router= 0 dest_port= 4 /> <channel type= ch src_ni= 1 src_port= 0 dest_router= 1 dest_port= 4 /> <channel type= ch src_ni= 2 src_port= 0 dest_router= 2 dest_port= 4 /> <channel type= ch src_ni= 3 src_port= 0 dest_router= 3 dest_port= 4 /> <channel type= ch src_router= 0 src_port= 4 dest_ni= 0 dest_port= 0 /> <channel type= ch src_router= 1 src_port= 4 dest_ni= 1 dest_port= 0 /> <channel type= ch src_router= 2 src_port= 4 dest_ni= 2 dest_port= 0 /> <channel type= ch src_router= 3 src_port= 4 dest_ni= 3 dest_port= 0 /> <channel type= ch src_router= 0 src_port= 1 dest_router= 1 dest_port= 0 /> <channel type= ch src_router= 0 src_port= 3 dest_router= 2 dest_port= 2 /> <channel type= ch src_router= 1 src_port= 0 dest_router= 0 dest_port= 1 /> <channel type= ch src_router= 1 src_port= 3 dest_router= 3 dest_port= 2 /> <channel type= ch src_router= 2 src_port= 1 dest_router= 3 dest_port= 0 /> <channel type= ch src_router= 2 src_port= 2 dest_router= 0 dest_port= 3 /> <channel type= ch src_router= 3 src_port= 0 dest_router= 2 dest_port= 1 /> <channel type= ch src_router= 3 src_port= 2 dest_router= 1 dest_port= 3 /> <channels> <network_type> the NoC component library. Speci cally, a source network interface is implemented as a specialized router where the input unit accepts packets and produces multiple its, while a destination network interface is implemented as a specialized router where the output unit collects multiple its to produce a packet. Fig. 4 illustrates the specialized I/O units to build a network interface. Compared to the router I/O units shown in Fig 3, all components are the same, with the exception of the packet splitter and the it merger. Starting from the user speci cation of the packet format for the source and destination, ICON creates a SystemC module that implements a custom channel. The latter is characterized by a speci c interface implemented with the list of input ports (sc_in) and output ports (sc_out) for the module. This channel is also used as a data type to create status, ow-control, and FIFOs for the I/O units. Packet splitters and it mergers are attached to these components to translate a packet from/to multiple its. Since the it is the base of the control mechanism between I/O units, the packet splitter and it merger must manage the request and grant signals between the input status and the switch allocator. For example, upon receiving a packet from the input queue, the packet splitter creates requests and manages grants for the switch allocator until the entire packet is sent to the output unit as a sequence of multiple its. After sending the last it of a packet, the packet splitter sends a grant signal back to the input status to indicate the complete transmission. Similarly, it mergers keep collecting its from input units to build a packet and send a grant signal to the output status to indicate when a valid packet is ready.
Network Generation. Fig. 5 shows the example of an XML tree that de nes a simple 2x2 2D-Mesh NoC. A user can specify routers with router, and network interfaces with source_network_interface and destination_network_interface XML elements. Links are speci ed Table 2 : NoC con guration parameters. 
EXPERIMENTAL RESULTS
To demonstrate the capabilities of the ICON framework in exploring the NoC design space for a target SoC, we designed two complete SoCs as instances of Embedded Scalable Platforms [4] . As shown in Fig. 6 , each SoC contains a L 3 CPU running Linux and 2 DDR-3 DRAM controllers together with a set of accelerators: 10 accelerators for 5 distinct application kernels from the P benchmark suite [2] in the heterogeneous SoC and 12 copies of the FFT-2D accelerator in the homogeneous SoC.
For each SoC, we used ICON to generate 36 di erent NoC designs by combining the 5 parameters of Table 2 . While every combination of parameter values is supported, we limit ourselves to three possible combinations for the number N of physical networks and the number V of virtual channels. Table 3 reports how these three con gurations support the ve distinct message classes that are needed to enable the various independent transactions in the SoC while avoiding protocol deadlock [20] : two for CPU-memory transfers, two for accelerator-memory transfers and one for accelerator con guration and interrupt requests. Note that ICON allows us to use di erent numbers of VCs per physical network, e.g. 2 for the network 0 and 3 for network 1 with 2N-2/3V. All NoC con gurations share a 4 × 4 2D-mesh network topology with XY dimensionorder routing and credit-based ow control.
Each of the 36 NoC designs given in SystemC was synthesized into a corresponding Verilog design by using Cadence C-to-Silicon. Then, we used two distinct back-end ows, one for ASIC and another for FPGA, to obtain nal implementations for each NoC. Experiments with ASIC Design Flow. We performed logic synthesis targeting a 45nm technology and 500Mhz clock frequency. We simulated the ASIC implementations using the Make les and testbenches generated by ICON for the seven "classic" synthetic tra c patterns: Uniform, Random Permutation, Bit Complement, Bit Reverse, Transpose, Neighbor, and Tornado [6] . Fig. 7 reports the results in terms of saturation throughput for all con gurations with P = 2 and Q = 2. Across all tra c patterns the throughput changes considerably depending on the it width. For the same it width, 5N-1V, which has a bisection bandwidth that is ve times bigger than 1N-5V, provides the highest throughput. The saturation throughput is higher for the simulations with the Random Permutation, Neighbor, and Tornado patterns than in the other cases because on average the destination of the generated tra c is closer to the source. Fig. 8 shows the area-performance trade-o of the NoC con gurations for di erent it-width values.
Experiments with FPGA Designs. We combined the generated NoC Verilog designs with those for the two SoCs of Fig. 6 and performed logic synthesis for a Xilinx Virtex-7 XC7V2000T FPGA with two DDR-3 extension boards for a target frequency of 80MHz.
For each SoC we run a multi-threaded application that uses Linux to invoke all accelerators (via their device drivers) so that they run simultaneously and, therefore, compete for access to the NoC and DDR-3 controllers. Fig. 9 reports the execution time of the application (normalized with respect to the simplest con guration) and the SoC area occupation for many di erent NoC con gurations. Speci cally, it shows the impact of varying the it width (F) in a NoC with 1 physical network (N=1), 5 virtual channels (V=5), a 4-stage pipeline (P=4) and 2 di erent queue sizes (Q={2,4}). When raising F from 8 to 16, the application for the heterogeneous SoC takes a time that is 86.55% (for Q=2) and 87.57% (for Q=4) of the case for F=8 in exchange for modest area increases (3.1% and 4.3%, respectively). The execution time of the corresponding application on the homogeneous SoC becomes 78.24% and 78.98% of the case with F=8 (with 4.11% and 5.55% of area increase, respectively). While the performance improvement obtained by doubling the it width from 8 bits to 16 bits is considerable, this is not the case when doubling it again from 16 to 32 bits. For both the F=16 and F=32 con gurations, the NoCs are not saturated and the zero-load latency has a bigger impact than the contention latency. The main reason is the long communication delay on the o -chip channels between the DDR-3 controllers and DRAM. The average throughput on this channel is about 2.72 bits per clock cycle for both the F=16 and F=32 con gurations while it decreases to 2.48 for the F=8 con guration when the on-chip links become more congested and the NoC becomes the system bottleneck. Table 3 . Overall, the rst con guration is better from an area viewpoint, while the di erences in performance are minimal. Fig. 11 summarizes the area and performance trade-o s across all the con gurations from the previous two gures as well as the rest of the 36 con gurations that we tested for this SoC case study. For the heterogeneous SoC, the Pareto curve includes 4 NoC con gurations: 8F-5N-1V-2P-2Q, 16F-5N-1V-4P-2Q, 16F-5N-1V-2P-2Q, and 16F-2N-2/3V-4P-2Q. For the homogeneous SoC, the Pareto curve consists of 3 con gurations: 8F-5N-1V-2P-2Q, 16F-1N-5V-4P-2Q, and 16F-2N-2/3V-2P-2Q. This set of results shows how ICON can be used to quickly generate and evaluate several network design points. Each design can be seamlessly integrated into a complex heterogeneous SoC without modifying any of the computing IP blocks present in the system. Further, ICON allows us to identify the con guration parameters that have a larger impact on performance for the speci c target SoC. Exploring such a large design space and gathering accurate information from a full-system evaluation would not have been possible without the ICON automation framework. 
