Abstract: Network-on-chip (NoC) is a new design paradigm for system-on-chip intraconnections in the billion-transistor era. Application specific on-chip network design is essential for NoC success in this new era. This paper presents a class of source routing switches that can be used to efficiently form arbitrary network topologies and that can be optimized for various applications. Hardware description language versions of the networks can be generated automatically for simulations and for syntheses. A series of switches and networks has been configured with their performances including latency, delay, area, and power, and analyzed theoretically and experimentally. The results show that this NoC architecture provides a large design space for application specific on-chip network designs.
Introduction
One of the challenges of system-on-chip (SoC) designs in the billion-transistor era is the intraconnections between IPs. Network-on-chip (NoC) designs have been proposed as one of the solutions with many research papers describing their features [1] . The on-chip networks will be application-specific networks for several reasons.
Hardware and software resources that can be utilized by on-chip networks are more limited than those used by traditional computer networks, so resources should be optimized for the applications. Traffic patterns in traditional networks are stochastic and generally follow Poisson distribution, while patterns in on-chip networks are more deterministic [2] , which provides opportunities for optimization. On-chip networks are more flexible since they need be compatible only with the on-chip intraconnections (interconnections between chips are not the goal of NoC), whereas, traditional networks have to be compatible with existing standards. The absence of standards provides a large design space for on-chip networks. Due to these intrinsic characteristics of on-chip networks, the design cost should be partitioned with more effort spent at design time to explore a larger network design space and at run time to more carefully schedule the traffic, while the hardware architecture should be as simple as possible to save area and power. The source routing switches discussed here provide such a simple but powerful scheme with automatic generation enabling exploration of a large configuration space.
Related Work
In recent years, NoC designs have been developing in three phases.
(1) Conceptualization The NoC concept was proposed as a replacement for bus-based SoC communication architectures with flexibility, scalability, and [1, 3, 4] . (2) Regularization Regular network structures were developed including mesh [5, 6] , fat-tree [7] , octagon [8] , star [9] , and bi-directional ring [10] structures. However, regular topologies may facilitate layout using twodimensional silicon surface technology, but they do not easily enable the incorporation of heterogeneous applications.
(3) Customization Researchers now seek new design methodologies for application specific on-chip network designs:
Library-based approaches provide a set of parameterized components that can be used to form networks [11] [12] [13] as bottom-up flows.
Synthesis-based approaches map applications to specific networks subject to several constraints [14, 15] , which are top-down flows. Evolution-based methodologies use iterations through several stages to develop a suitable network for a specific application due to the complexity of traffic pattern modeling [15, 16] in loop flows. This paper discusses a library-based approach with synthesized Verilog code that can be automatically generated and quickly evaluated to accelerate network evolution. These architectures are simple and areaefficient due to more aggressive NoC assumptions. The performance is related to design parameters as references for topology selections used to synthesize new designs.
The source routing mechanism discussed by Farber and Vittal [17] and Sunshine [18] has been used in various NoC networks [12, 19] due to its routing simplicity and topology independency. The custom-designed irregular network topology can significantly reduce area and power demands compared with regular mesh topologies [20] . In addition to the irregular network topologies, the source routing mechanism can also be used to build irregular switches tailored to the application, which results in more area-efficient switches.
Arbitrary Network Topology and Source Routing Implementation
The optimization process seeks to find and eliminate redundancies. A regular topology is redundant to the on-chip network, as is a regular switch. This section defines an arbitrary network topology and its implementation with the source routing mechanism, which provides an opportunity to find and eliminate redundancies.
Arbitrary network topology
An on-chip network is defined as a directed graph G(V,E) with each vertex i v V ∈ representing a module or a switch, the directed edge ( , ) Application specific on-chip network design demands a methodology that can generate an arbitrary topology network subject to the traffic pattern. The "arbitrary topology" means:
(1) |V| > 0, where |V| represents the cardinality of vertex set V. This means that the modules and/or switches on the chip can be expanded arbitrarily according to the application requirements. e . This means that some links may be wider than others to increase bandwidth or narrower to save resources. (4) For a switch vertex v i , |S| > 0, |T| > 0, represents that a switch has at least one input port and one output port; and for a module vertex v i , |S|≥0 and/or |T|≥0, represents that a module can be a pure source or pure sink.
(5) n i, j ≥0, where n i, j represents the number of virtual circuit links from input port s i to output port t j , which means that a large configuration space is available for switch internal structures.
An example of arbitrary network topology is shown in Fig. 1 with irregular connections, wide or narrow edges, large or small switches, and even various input port, output port, and internal links within one switch.
Source routing mechanism
Figure 1 also shows the source routing mechanism of this arbitrary network topology. The M2 module sends "DATA" to M7, and then sends a package "32310DATA" to the nearest switch. The head "3" of the message-the virtual circuit number-guides the first switch to dispatch the message to output port "3" with the switch shifting out the head "3" and shifting in a tailing "0". In the same manner, the package arrives at module M7 and gets transformed from "32310DATA" to "DATA00000". The limitation of this routing scheme is that the network scale is limited by the link width and virtual circuit bits. For example, a network with each link having a width of 32 bits and virtual circuit number lengths of 3 bits, is limited to 11 hops around the diameter.
Wormhole
Since the virtual circuit number bits for the routing selection are shifted out at each switch, the payload efficiency is limited. Therefore, the wormhole scheme is used and a data package is divided into several segments, called flits, so that once the first flit builds a virtual circuit, the following flits do not need to shift bits again. Additional control signals and internal switch states manage the establishing and removal of the virtual circuit.
One of the disadvantages of the wormhole scheme is dead lock. In a cyclic network topology, if the flit sequences depend on each other to advance one step, dead lock may occur [21] . Some solutions to dead lock demand a more complex switch design. The most efficient design area hardware would use an acyclic network or carefully schedule messages in a high protocol stack.
Switch Architecture
An arbitrary network is generated using an HDL code template with several selectable parameters. This section uses a simple but complete 2×2 switch to illustrate the architecture of a typical switch, including signal interface, input port structure, output port structure, and architectural features.
Signal interface
The interface defines a set of signals to connect switches and modules, and also a set protocols to transmit data. Any module that implements this interface communicates with other modules through the network. This protocol provides a virtual circuit link which is easily translated into other protocols. As can be seen in Fig. 2, a sending Figure 3 shows a possible sequential pattern when sending a message "12345". When data is being sent, the signal req holds "true", while tail goes high when the last data is being sent. The signal ack goes low for 2 cycles which means that the link is blocked for 2 cycles, so data "3" is held 3 cycles to prevent data loss. Figure 4 shows the basic architecture of an input port connected to n output ports. The input port extracts the header routing information, shifts data by several bits, and forwards data to the appropriate output port. The virtual circuit number length is one of the properties attached to each input port. If congestion occurs, as indicated by none of the received ack signals (ack 1 … ack n ) going high, the data will be saved into a FIFO buffer, and the congestion information will be handed back to the previous switch by pulling down the output ack in the first congested switch. Before the previous switch receives this congestion information, at most one flit has been sent to the first congested switch, so each input port needs at least one flit buffer. The FIFO buffer length is another input port property, which could be adjusted to reduce the congestion probability. Each input port has a state register named "linked" which goes high when a virtual circuit is built, and goes low when the tail signal is received, indicating that the virtual circuit is destroyed. Figure 5 shows the basic architecture of an output port connected to n input ports. Each input port communicates with the output port along similar interfaces. When any input port requests, an arbitration network is employed to select only one but at least one of the requesting input ports according to the current prior information. When a virtual circuit is built, appropriate ack signals are sent to each input port, output port enters a virtual circuit linked state, the linked input port number is recorded, and prior information is updated to implement justice. Various arbitration priority algorithms can be implemented by different prior updating schemes. The output port in the linked state will enter into the unlinked state if a tail signal is received. If a congestion signal is received, all these actions will be blocked, and the congestion information will be passed to the appropriate input ports. 
Input port structure

Output port structure
Monadic switch
A monadic switch is a switch with just one input port, one output port, and one buffer. As discussed in Sections 3.2 and 3.3, control and data signals start from a latched switch output port, travel a long trace, enter the input port of the next switch, traverse combinational logics, and then get captured by the next clock edge. Long traces and the combinational logic shall be separated if the two switches are far from each other and the trace delay cannot be ignored. Unlike a single group of registers, a monadic switch can handle delays well since it can save one stage of data when congestions occur. Networks can be more easily configured, having monadic switches as a possible input port property.
Arrival guarantee
As shown in Fig. 1 , module M2 sends data to module M7 through path M2→S1→S2→S3→S5→S7→M7. M7 then needs to inform M2 that data has been received. One method is for M7 to send another package to inform M2, but this method depends on the upper protocol stack, needs to build another virtual circuit, and takes more cycles. The arrival guarantee method uses the nature of the switches to simplify the process. Assume that the FIFO buffer depth at each switch is one flit, so the maximum latency from M2 to M7 will be 10. A flit sequence less than 10 can be lengthened to 10 by appending meaningless flits. After having sent the 10th flit of a sequence, M2 can be sure that M7 has been successfully incepting the data. The disadvantage of this arrival guarantee method is that the network burden and the risk of congestion are increased.
Automatic Network Generation
Since the switches have similar architectures regardless of their different parameter configurations, a program was developed to automatically generate custom networks.
Network configuration space
The methodology provides a wide design space for arbitrary networks. The supported parameters can be classified into network, switch, and port (input port or output port) hierarchies.
At the network level, the parameters include the number of input ports and output ports, the number of switches, and the connecting edges. At the switch level, the parameters include the ID, the number of input ports and output ports, and a virtual circuit mapping table similar to Fig. 1 . The input port parameters include the ID, the bit width, the virtual circuit length, the FIFO buffer depth, and whether a monadic switch is used. The output port parameters include the ID, the bit width, and a list of connected input port IDs. The program also automatically computes other parameters based on this information to properly generate the Verilog source code.
Test-bench and verification
Network architecture verification is a very important part of NoC design, especially for the initial arbitrary network topologies. The two levels of NoC verification are signal connection level and performance statistical level. The statistical level verification objective is to show that a network configuration satisfies the application requirements by analyzing the application traffic patterns, which is a future goal of this project.
A signal connection level test-bench was developed to verify that the networks generated by the program work as described in Section 3. The test-bench has source modules, dummy source modules, sink modules, and dummy sink modules. The dummies are connected to the network ports but do nothing. The sources generate random flit sequences and push them into the network. The sinks observe the network output ports and compare the received flits with the initial sequences. Figure 6 shows part of the simulation results for the irregular network topology shown in Fig. 1 . Another link M3-S3-S5-S6-M8 was added to interfere with link M2-S1-S2-S3-S5-S7-M7, so the system required 5 cycles from "TimeF" to "TimeG" but 8 cycles from "TimeC" to "TimeD" for the M2-M7 transmission, which is an increase of 3 cycles due to the 3 cycles when ack was low. Similar results occurred to the M3-M8 transmissions. The signal connection level verification verified both the generation program and the network architecture. 
Performance Analyses and Syntheses Results
The automatic Verilog source code generation provides a convenient approach to observe the performance of various network configurations. Simulations and logic syntheses show that this source routing switch provides an efficient methodology for NoC design.
Latency
If congestion is excluded from the analysis, the latency from one module to another through the network is equal to the path length, as can be seen in Fig. 6 from "TimeF" to "TimeG".
If congestion is included in the analysis, the latency will be random, since each output port arbitration is random. For an output port connected to m input ports with an output port bandwidth capacitance of w and input port bandwidth requirements w i , i=1,…, m, then the probability for input port j connecting to the output port is 
If each competition is independent, the latency at this output port satisfies a geometrical distribution, so the latency cycles expectation is
For a path with s hops, the total latency expectation is the sum of each switch latency,
The analysis must be adjusted to account for wormhole routing, flit sequence length, FIFO buffer depth, and arbitration policy. However, Eqs. (1)-(3) provide a reasonable statistic approximation for many systems.
Delay, area, and power
After the switches and networks are generated, the delays, areas, power consumptions, and configurations must be analyzed as part of the design space exploration.
The cost of a switch is strongly related to the FIFO buffer depth, the number of input ports, the number of output ports, and internal link table. To develop an empirical formula, a series of switches were generated and analyzed, including fully connected switches n × n, 16 × n, n × 16, n = 2,…,16, fully connected 2 × 2 switches with n, n = 1,…,15 buffers at each input port, and non-fully connected 16 × 16 switches with each output port connected to n, n = 2,…,16 input ports, as shown in Fig. 7f .
The logic syntheses were conducted using the Synopsys Design Compiler with TSMC 0.13 technology. Figure 7 shows the syntheses results for the area, delay, and power. The tradeoffs between area and delay in the syntheses were based on timing constraints that kept the ratio of the slack to the clock period almost constant, around 20% as shown in Fig. 7a .
The areas of the series switches are shown in Fig. 7b . The area cost is assumed to be linearly related to the design factors, Figure 7c shows the error in Eq. (5). The mean and standard deviation of the error correlation were 0.03 and 0.04, which is acceptable. Figure 7d has two essentially horizontal curves and three ascending curves, which show that the delay is strongly related to the maximum number of input ports connected to each output port. The two horizontal curves, 16 × n and 2 × 2, had a fixed number of input ports connected to a fixed number of output ports, while for the other three, the number of input ports connected to each output port was variable. Figure 7e shows the power consumption of each series. The power curves are similar to the area curves in Fig. 7b , which means that larger switches consume more power.
For each application specific network, tradeoffs must be made between simple topologies with large switches and complex topologies with small switches. For the latency analysis, larger switches may result in shorter paths between modules but longer wait time at each switch due to more competitions. For the area and power analysis, the simple topology with large switches will occupy less switches, so the effect on the total cost is uncertain. For the delay analysis, the larger switches result in slower clock frequencies. All these tradeoffs require automatic code generation and network evolution.
Conclusions
This paper presents an arbitrary network design system with a template for source routing switches to implement the network with an automatic design methodology. The network performances in terms of latency, delay, area, and power are analyzed theoretically and with simulations. The relationship between performance and the design parameters provides guidance for on-chip network topology selection and design. In the future, the methodology will be applied to real traffic patterns for more precise evaluations. 
