Abstract: Reducing the design complexity of switches is essential for cost reduction and power saving in on-chip networks. In wormhole-switched networks, packets are split into flits which are then admitted into and delivered in the network. When reaching destinations, flits are ejected from the network. Since flit admission, flit delivery and flit ejection interfere with each other directly and indirectly, techniques for admitting and ejecting flits exert a significant impact on network performance and switch cost. Different flit-admission and flit-ejection microarchitectures are investigated. In particular, for flit admission, a novel coupling scheme which binds a flit-admission queue with a physical channel (PC) is presented. This scheme simplifies the switch crossbar from 2p Â p to ( p þ 1) Â p, where p is the number of PCs per switch. For flit ejection, a p-sink model that uses only p flit sinks to eject flits is proposed. In contrast to an ideal ejection model which requires p . v flit sinks (v is the number of virtual channels per PC), the buffering cost of flit sinks becomes independent of v. The proposed flit-admission and flit-ejection schemes are evaluated with both uniform and locality traffic in a 2D 4 Â 4 mesh network. The results show that both schemes do not degrade network performance in terms of average packet latency and throughput if the flit injection rate is slower than 0.57 flit/cycle/node.
store flits instead of packets. This imposes smaller buffering requirement than virtual cut-through switching which requires to store packets [7] . By adopting virtual channels (VCs or lanes) to break the link allocation dependency from VC allocation, wormhole switching achieves higher throughput. Due to these merits, wormhole switching has been widely applied to parallel computers, for example the Intel Paragon and the Cray T3D. Nevertheless, using wormhole switching on chips has to be cost effective and power efficient at the minimal expense of performance. This necessitates reducing the design complexity of switches which directly impacts the gate count and switching capacitance.
In this paper we investigate the micro-architectures of a canonical wormhole VC switch for admitting and ejecting flits. We shall see that different flit-admission and flit-ejection micro-architectures have different complexity. A simpler micro-architecture may not necessarily bring about inferior performance if the network is not saturated. The remainder of the paper is organised as follows. Section 2 outlines the related work. In Section 3, we describe wormhole-switched networks on chips, focusing on the flit admission and ejection problem followed by the operation of the wormhole switch. We discuss flit admission and ejection models in Sections 4 and 5, respectively. Particularly, we detail the coupled admission and the p-sink model. Experimental results are reported in Section 6. Finally we conclude the paper in Section 7.
Related work
A large body of work on wormhole switching exists in the literature [8] [9] [10] [11] . Dally and Seitz first introduced wormhole switching to build fast single-chip switches in 1986 technology [10] . Using VCs for wormhole-switched networks was proposed in [9] . The performance model of a wormhole switch that considers implementation complexity was first noted by Chien [8] . A more efficient wormhole lane switch architecture and its performance model was presented in [11] . In general, the design complexity of a wormhole lane switch is the function of p [the number of physical channels (PCs) per switch] and v (the number of VCs per PC). To our knowledge, no prior work has been reported in the literature specifically discussing flit-admission and flit-ejection models. All of the above work assumes an ideal flit-ejection model while evaluating the network performance. Our motivation is to reduce the switch complexity to achieve cost-effective designs on silicon by exploring the design space of the switch micro-architecture. In line with this motivation, Rijpkema et al. proposed to customise the lane buffers as dedicated hardware FIFOs instead of register-based or RAM-based FIFOs to reduce the buffer area and thus achieve reasonable buffering cost [5] . They decided to adopt the input-queuing wormhole switch against the output-queuing wormhole switch after studying their performance -cost trade-offs. To reduce the control complexity of the switches, deterministic routing is favoured against adaptive routing in buffered networks [1, 3] . In bufferless networks, deflection routing is advocated for NoCs in [12] . Due to its simple control (no buffer and flow management), it achieves very small and fast switch design. Moreover, regular low-dimension network topologies are favoured for NoCs to simplify routing and potentially manage the electrical properties of future wires subject to deep submicron (DSM) effects [2, 4] . For example the Nostrum [4] NoC proposes a 2D mesh network and the NoC in [2] suggests a 2D torus network.
3
Wormhole-switched networks on chips
Problem of flit admission and ejection
Fig . 1 shows a portion of a 2D mesh NoC architecture [2 -4] . A resource is a local computing or storage region, which may contain a processor, a memory, an IP, an ASIC/FPGA or a bus subsystem. The wormhole switch with bidirectional links has eight 'flit channels' (PCs) for sending and receiving flits and eight 'credit channels' for sending and receiving credits. Besides, a duplex 'packet channel' connects a switch to exactly one resource via a resource-network-interface (RNI). The essential part of an RNI implements an interconnect interface, which may be a legacy interface such as advanced extensible interface (AXI) [13] , open core protocol (OCP) [14] , virtual component interface (VCI) [15] , or a customised interface. The RNI deals with bursty transactions from its resource. Specifically, it encapsulates transactions into packets, decapsulates packets into transactions and conducts flow control and reordering if necessary. Furthermore, it sends packets to and receives packets from its connected switch. With wormhole switching, a packet is decomposed into a head flit, body flit(s) and a tail flit. A single-packet flit is also possible. The type of a flit is indicated by a flit type field that has four options: head, body, tail and single-packet flit. In Fig. 2 , we show that a packet of 112 bits is encapsulated into four 32-bit flits, where vcid is the identity number of a virtual channel. This number is determined once a downstream VC is allocated to a packet. Whenever receiving a flit, a switch uses vcid to place the flit into the corresponding VC. The number of bits taken by vcid equals to dlog ve (dxe is the ceiling function which returns the least integer that is not less than x.). A flit differs from a packet in that (1) a flit has a flit type and has a typically constant and smaller size except for a single-packet flit and (2) only a head flit carries the routing information such as source and destination addresses, packet size, priority, sequencing bits, and checksum. This keeps the addressing overhead small and leaves more space for payload.
Since switches deliver flits via PCs but communicate packets with RNIs, switches must be equipped with facility to perform flitisation and flit admission as well as flit ejection and assembly, in addition to flit delivery. Assembly is the reverse of flitisation, that is, de-capsulation of flits back into packets. As sketched in Fig. 3 , we assume that a switch has two separate packet queues, one packet source queue and one packet sink queue. The packet source queue holds packets injected to the network by its associated RNI. The packet sink contains packets to be received by the RNI from the network. These queues enable to decouple a resource clock domain from the network clock domain, and thus allow operating the network asynchronously from resources. The frequencies of resources could be different. Some can be faster than the network; some can be slower. As there is only one duplex packet channel between an RNI and a switch, we use a single FIFO model for the packet queues. We also assume that a switch has flit-admission queues for buffering flits after flitisation and before admission and has flit sinks to buffer flits after ejection and before assembly. As can be seen in Fig. 3 , a source switch performs flitisation and flit admission, and a destination switch conducts flit ejection and assembly. Since switches in the network are both a packet source and a packet destination, they must perform flitisation, flit admission, delivery, ejection and assembly.
Because a switch is where flits are admitted into and ejected from the network, the switch micro-architectures for flit admission and ejection are an integral part of a switch fabric. The transmission time of a flit comprises admission time, delivery time plus ejection time. Thus the network performance is the function of flit admission, flit delivery and flit ejection. Intuitively, to achieve good network utilisation and throughput, flits should be admitted as fast as possible. However, flits to be advanced (after admission) contend not only with each other, but also with flits to be admitted for shared PCs and VCs. Flit admission and delivery interfere with each other. This implies that a fast admission mechanism may speed up the admission but slow down the delivery. If the network is too loaded, the overall transmission time may get worse. For the ejection process, a faster ejection frees flit buffers quicker, thus the faster the better. A slower ejection of flits may slow down the flit delivery and eventually the flit admission through back-pressure. However, an ideal ejection, which ejects flits immediately once they reach destinations, may over-design the switch. Finally the interplay between flit admission and ejection influences the trade-off between performance and cost. A practical ejection model may actually tolerate a slower but simpler admission model with reasonable performance penalty. Fig. 4 illustrates a canonical input-queuing wormhole switch architecture [5, 11] . It has p PCs and v lanes per PC. It conducts credit-based link-level flow control between directly connected switches.
Wormhole switch architecture
For each packet delivery, a switch passes through the following states: routing, lane allocation, flit scheduling, crossbar arbitration, switch traversal and lane release. In the routing state, the routing logic determines the path over which the packet advances. Routing is only performed with the head flit of a packet and only when the head flit becomes the earliest-come flit in the lane. After routing, the packet path and output PC are determined. As only the head flit of a packet carries the routing information, the routing and lane allocation have to be performed with the head flit. Once a lane-to-lane association is established with the head flit, the rest of flits of the packet inherit this association. After the tail flit leaves, the lane-to-lane association is torn down. In summary, a lane is allocated at the packet level, that is, packet-by-packet. This is in contrast with link usage. Since the flit scheduling as well as the crossbar arbitration is performed on a per flit basis, a link is scheduled at the flit level, that is, flit-by-flit. As the head flit advances, lanes are associated like a chain along the routing path of the packet. All flits are pipelined along the chain path and move through the lane buffers from source to destination. Since lane-to-lane associations are established with the head flit and released with the tail flit, the body flits are always located between the head and tail flits, thus the flit order is guaranteed. Moreover, the flits between the head and tail flits are considered as flits belonging to the same packet and will be assembled into the packet at the destination. Therefore flits belonging to different packets cannot be interleaved in a lane or associated lanes. Otherwise, the packet integrity is destroyed and packets cannot be assembled correctly. To guarantee the packet integrity, a lane-to-lane association must be unique at a time, that is, one-to-one. It is forbidden that an upstream lane concurrently associates itself with more than one downstream lane and a downstream lane is allocated to more than one upstream lane simultaneously.
Discussion on the flitisation and assembly
The flitisation and assembly are processed in hardware because packets are derived from and assembled into transactions of the concerned interfaces such as AXI, OCP and VCI. We have chosen to implement the flitisation and assembly in the switch in order to decouple the network from the resources that include RNIs, thus constructing a clean network layer. The network boundary is between the RNIs and switches. Viewed from the network boundary, the network is composed of switches and links, transporting packets. A core communicates AXI/OCP/VCI-like transactions as messages. The interconnect interface supports outstanding and split transactions, and write/read interleaving. Since NoCs aim for complex applications which are communication-intensive with perhaps multiple communication patterns and distributed memories, we can imagine that a core has multiple outstanding transactions and may need to communicate with multiple slaves. For example a core may distribute tasks to multiple networked processing elements, which in turn process the tasks in parallel. In addition, a bursty transaction may be split into a number of packets. These justify the existence of the packet source queue in the wormhole switch. The clean network layer makes an RNI relatively stable and allows us to easily change the switching scheme of the network without extra cost. For example deflection routing and virtual cut-through switching do not decompose packets into flits. If we use a deflection-routing switch or a virtual cut-through switch instead of a wormhole switch to construct the network, we do not need to change the RNI as long as we conform to the packet interface.
Alternatively, one can choose to implement the flitisation and assembly in an RNI. This move could make a switch design light-weighted, but make an RNI heavy-weighted. We can make this choice but (1) we do not gain anything in area and power from the system perspective because this move in itself does not decrement but just moves the functions upward in the modules. In general, such kind of move is always possible while implementing a communication protocol stack. Depending on how to decompose a problem, we can partition the problem and implement its functional modules in an upper layer or a lower layer; (2) We do not improve performance, because the flitisation and assembly do not affect the critical path of the switch; (3) We may improve the symmetry of the switch, since it now admits, delivers and ejects flits. But we lose the modularity of the RNI, as we have elaborated above.
4
Flit admission
Decoupled admission
We start with the organisation of the flit-admission queue.
In Fig. 5a , the flit-admission queue is organised as a single FIFO. In Fig. 5b and 5c, it is arranged as p parallel FIFOs ( p equals to the number of PCs). Figs. 5a and 5b allow at maximum one flit to be admitted into the network at a time, whereas Fig. 5c allows up to p flits to be admitted simultaneously. We adopt the organisation in Fig. 5c for our further discussions since it allows potentially higher performance, whereas the other two may make the network under-loaded. Fig. 4 also illustrates the organisation of p flit-admission queues in the switch architecture. Initially, packets from an RNI are stored in the packet queue. When a flit-admission queue is available, a packet is split into flits which are then put into the admission queue. Similarly to a lane, a flit-admission queue transits states to inject flits into the network via the crossbar. Note that flits to be admitted (in admission) contend with flits already admitted (in delivery) for VCs in the lane-to-lane association state and PCs in the 'crossbar arbitration' state. This interference makes a flit-admission model non-trivial. We shall see that, if the network is to saturate, a faster admission model may actually make delivery slower. Besides, the routing is performed after flitisation. By this scheme, each flit-admission queue is connected to p multiplexers. Flits from a flit-admission queue can be switched to anyone of the p output PCs. To implement this scheme, the crossbar must be fully connected, resulting in a port size of 2p Â p. Since the 
Coupled admission
Although the decoupled admission allows a flit to be switched to anyone of the p output ports, this may not be necessary since a flit is aimed to one and only one port after routing. Based on this observation, we propose a coupling scheme for the flit admission that can sharply decrease the crossbar complexity, as sketched in Fig. 6 . Just like the decoupled admission, it uses p admission queues, but one queue is bound to one and only one multiplexer (instead of p multiplexers) for a particular PC. Due to this coupling, flits from a flit-admission queue are dedicated to the PC. The crossbar size is sharply decreased from 2p Â p to ( p þ 1) Â p as shown in Fig. 6 . The number of control signals per multiplexer is reduced from dlog (2p)e to dlog( p þ 1)e for any p . 1.
Alternatively, an admission queue can directly share an output PC as depicted in Fig. 7a . This solution can also be regarded as having a crossbar complexity of ( p þ 1) Â p, since the combination of a p Â 1 and a 2 Â 1 multiplexer may be viewed as a ( p þ 1) Â 1 multiplexer. The number of control signals per PC is reduced from dlog(2p)e to dlog pe þ 1.
In order to support the coupling scheme, the routing must be performed before flitisation instead. With a routing algorithm, the PC that the packet requests can be determined. Hence, the corresponding admission queue is identified. One drawback due to the coupling is that the head-of-line blocking may be worse if the packet injection rate is high. Specifically, if the head packet in the packet queue is blocked due to the bounded number and size of the admission queues, the packets behind the head packet are all unconditionally blocked for the period when the head packet is blocked. By the decoupled admission, the head-of-line blocking occurs when the p flit-admission queues are fully occupied. With the coupled admission, this blocking occurs when the flit-admission queue, to which the head packet aims, is full.
Flit admission via input channels and in output-queuing switches
In addition to sharing the crossbar or output PCs, a flit-admission approach may choose to share the input PCs of the switch. However, there is a critical section problem. Fig. 8 illustrates the problem with a simplified graph of two connected wormhole switches, A and B. Suppose that lane j in switch B is available at a certain clock cycle, the admission queue sees lane j available and then associates itself to lane j locally. At the same cycle, lane i in switch A also detects lane j available and remotely makes an association with lane j. This is possible since both 
To avoid such a situation and thus achieve a mutual-excluded lane association, we need both architectural support and a control protocol. This complicates the switch design and negatively impacts the network performance. Therefore sharing input PCs should not be favoured as a flit admission solution.
This observation illuminates the flit admission approach in output-queuing wormhole switches. While sharing input PCs and crossbars in output-queuing wormhole switches, we encounter exactly the same critical section problem, which is costly to resolve. This leads to only one reasonable flit-admission option for output-queuing wormhole switches, that is sharing output PCs, as drawn in Fig. 7b , where the multiplexers for admitting flits have a port size of 2 Â 1. If the coupling strategy is not used, the multiplexers must have a port size of ( p þ 1) Â 1.
5
Flit ejection
Ideal sink model
An ideal sink model is typically assumed for a wormhole switch. With an ideal ejection, flits reaching destinations are ejected immediately, emptying the lane buffers they occupy. Fig. 9 depicts an ideal flit-ejection model. A 'flit sink' is a FIFO receiving the ejected flits. Each lane is connected to a sink and the crossbar via a de-multiplexer, implying that a flit is either forwarded to the next hop or sunk locally. When all flits of a packet are received, the packet is then assembled from the flits and sent into the packet sink.
To incorporate ejection, a lane is extended with a reception state besides the six states. If the routing determines that a head flit reaches its destination, the lane enters the reception state immediately. This is enabled by a static lane-to-sink association. Just like a lane-to-lane association, the lane-to-sink association must be unique at a time. Since flits of packets are sequentially delivered in their allocated lanes, flits in different lanes of a switch belong to different packets, and they cannot interleave in a sink queue. As there are p . v lanes in a switch, there must be p . v sink queues. A lane is statically bound to an exclusive sink so as to realise an immediate transition to the reception state. After the lane transits to the reception state, the head flit enters its sink, bypassing the crossbar. The subsequent flits of the packet are ejected into the sink immediately upon arriving at the switch. When the tail flit is ejected, the lane returns back to the initial state. This model is beneficial in both time and space. Although a head flit may be blocked by flits situated in front of it in the same lane, non-head flits can be ejected immediately once the lane is in the reception state. Moreover, it does not interfere with flits buffered in other lanes of the PC from advancing to next hops, because the lanes have additionally a shared path to the crossbar via the de-multiplexers. Upon receiving all flits of a packet, the packet is composed and delivered into the packet sink. If the packet sink is not empty, the switch outputs one packet per cycle in an FIFO manner.
p-sink model
Implementing the ideal sink model requires p . v flit sinks, which can eject p . v flits per cycle. This may over-design the switch since there are only p input ports, implying that at maximum p flits can reach the switch per cycle. Based on this observation, we can use p instead of p . v sink queues to eject flits to avoid over-design. Besides, in order to have a more structured design, we connect the p sink queues to the crossbar, as illustrated in the dashed box of Fig. 10 . We call this model the p-sink model.
To enable ejecting flits by the p-sink model, we now extend the six lane states with two new states: an 'arriving' and a 'reception' state. If a head flit reaches its destination and enters the routing state, the lane that the flit occupies transits from the routing to the arriving state. Then it will try to associate with an available sink, that is, to establish a lane-to-sink association. If the association is successful, the lane enters the reception state. Subsequently other flits of the packet follow this association exactly like flits advancing in the network. When the tail flit enters the sink, the association is torn down. The lane-to-sink association fails if all sinks have already been allocated. If this happens, the head flit is blocked in place holding the lane buffer. To speed up the flit ejection, a lane in a reception state has a higher priority than a lane in a state for forwarding flits when contending for the shared crossbar input channels. This sink model may cause an increase of blocking time for flit ejection. First, the lane-to-sink association may fail since all sink queues might be in use. In contrast, an ideal sink model guarantees an exclusive sink for each lane. Second, because the v lanes of a PC share one crossbar input channel, only one lane per PC can win arbitration to use the crossbar input channel. It is possible that more than one lane of a PC is in the reception state since the p sinks are shared by all the p . v lanes. To implement this model, the crossbar must double its capacity from p Â p to p Â 2p, as illustrated in Fig. 10 . The number of crossbar control ports is doubled proportionally. The p-sink model uses only p flit sinks. This number is independent of v. A variant of this model with a simpler structure but lower performance can be found in [16] .
Synthesis results
We have implemented the wormhole switch model with the flit admission and ejection schemes in VHDL. The switch implementation has eight input/output flit ports, eight input/output credit ports and two input/output packet ports to interface with an RNI. It consists of five logic modules on the data path, five logic modules on the control path and three buffer banks for input lanes, admission and sink queues. The data path comprises 'input logic' for the access of input lanes, generation of credits, maintaining states and routing for each flit input port, 'admission logic' for the access of admission FIFOs, flitisation, maintaining states and routing, 'sink logic' for the access of sink FIFOs, maintaining states and assembly, 'crossbar' and 'credit logic' for updating and managing the status of downstream lanes. The control path comprises one lane allocator, one lane scheduler, one sink allocator, one sink scheduler and one crossbar arbiter. The implemented routing algorithm is the dimension-order XY routing, which routes packets first along the X-axis followed by the Y-axis.
We have synthesised the designs for 180 nm technology using Synopsys Design Compiler with optimisation for timing. We set the width of a flit W flit ¼ 32 and v ¼ 4; the depth of a lane is 2; the depth of a flit admission/sink queue is 4. Packet queues can hold up to eight packets with a size of 112 bits. One packet is encapsulated into four 32-bit flits. We estimate buffer area for lanes and packet/flit queues using registers. The coupled-admission scheme slightly improves the data operating frequency of its decoupled counterpart from 198 to 200 MHz due to the simplified crossbar. The ejection models do not change the switch frequency since they do not alter the critical path. Table 1 compares the switch area in terms of equivalent NAND count for combinations of admission and ejection schemes. The 'other logic' comprises the link-level flow control logic that includes the lane allocator and scheduler and credit logic, which consume 12 604, 3704 and 3465 gates, respectively. The 'other buffer' consists of buffers used for lane FIFOs and packet/flit admission FIFOs, consuming 7998 and 10 462 gates, respectively. The other logic and other buffer do not change in all cases. The second and third rows show the results with the decoupled and coupled admission schemes. Both designs use the p-sink model. The coupled admission reduces logic count for the crossbar by 41.7% from 2549 to 1485 gates, arbiter 8.4% from 6212 to 5691 gates and admission logic by 40% from 8619 to 5177 gates. The switch area is reduced by 6.8% from 73 928 to 68 901 gates. The first and second rows list the results with the ideal sink and the p-sink models. Both designs use the decoupled admission. Since the p-sink model maintains and manages four sinks via the crossbar instead of dedicated 16 sinks with the ideal sink model, it complicates the crossbar, arbiter, sink allocator and scheduler, but it decreases area for the input logic by 39% from 10 628 to 6436 gates and the sink logic by 75% from 21 194 to 5160 gates. In addition, the sink buffer area is reduced by 75% from 11 200 to 2800 gates. The total reduction in the buffer and switch area is 28% from 29 660 to 21 260 gates and 22.8% from 95 830 to 73 928 gates, respectively. Altogether, the coupled admission, p-sink model reduces the switch area of the decoupled admission, ideal sink model by 28% from 95 830 to 68 901 gates.
Note that the modules for admission and ejection, which include the packet admission and ejection queues, flit admission and ejection queues, flitisation and assembly and associated control logic, are main contributors to the area. If those modules are not counted, the flit-switching modules (crossbar, arbiter, the input logic, the other logic and lane buffers) consume 45% (43 282 gates) of the total area (95 830 gates) for the decoupled admission, ideal sink model, 58% (42 968 gates) of the total area (73 928 gates) for the decoupled admission, p-sink model and 60% (41 383 gates) of the total area (68 901 gates) for the coupled admission, p-sink model. We realise that it is impossible to make a fair and accurate comparison between our design and other switches because these switches have different functions and optimisations, different core-to-switch interfaces, different numbers of input/ output ports and buffers and different switching capabilities. In order to show that our design has a reasonable size, we make a very rough comparison with other wormhole switches. In the open literature, we can find the switch area for the AEthereal switch [5] and the SPIN switch [17] . Both are implemented in 130 nm technology. They consume 0.26 and 0.24 mm 2 , respectively. According to ITRS [18], the 130 nm technology for ASIC with auto layout has a transistor density of 89 Mtransistors/cm 2 . Assuming a NAND gate consumes 4 transistors, the number of equivalent NAND gates for theAEthereal switch is 57 850, for the SPIN switch is 53 400. This very rough comparison suggests that the design complexity of our switch in terms of the number of equivalent NAND gates is in the same range as others.
6
Simulation results
Experimental setup
To evaluate the proposed flit admission and ejection models, we have developed a simulator in SystemC comprising the input-queuing wormhole switch model and other supporting objects. The switch is a single-cycle, flit-level model. We construct a 2D 4 Â 4 mesh network. The switches operate synchronously, conducting the dimension-order XY routing. The routing algorithm is deterministic and guarantees deadlock freedom on the mesh. All switches are configured as follows: Except otherwise noted, the number v of VCs per PC is chosen to be 4, which is optimal for cost -performance trade-off [9] ; The depth of a lane is 2, which is minimal in order to pipeline flits; The depth of an admission/ejection queue is set to hold the flits of exactly one packet. All these buffer settings are intended to minimise the buffering cost. For packet source queues, we do not use infinite FIFOs since an ideal FIFO is impractical. Instead we use a simple bounded FIFO model. Since it has no handshake signals with an RNI, packets may be dropped after being injected. The packets that are injected into packet FIFOs (not dropped) are called 'offered' packets. We evaluate the networks with two types of traffic. One is random traffic which is distributed uniformly. A source node sends packets to destination nodes with equal probability, that is, irrespective of source -destination distance. The other is 'locality' traffic [19] which is distributed locally. In contrast, a source node sends packets to its closer destination nodes with higher probability. The locality traffic enables to explore the communication locality, which is a main optimisation objective while mapping tasks onto NoC nodes in order to reduce latency and save power. Packets have a fixed length of four flits, with a head flit leading three data flits. They are injected by RNIs to destinations except for themselves at a constant rate. Contention for lanes and channel bandwidth are resolved randomly. Each simulation was run until the network reached a steady state, that is, increasing simulated network cycles did not change the results appreciably.
We investigate the average packet latency and the network throughput. Latency of a packet is calculated from the instant the packet is offered into the packet source queue to the instant that the packet is ejected from the network. The source queuing time at the packet queue is included. Throughput l is defined as the number of flits received per cycle per node. We denote the number of network nodes as M, the link capacity as C, the number of simulation cycles as T, the number of flits injected/ offered into the network is N in /N of , the number of flits ejected from the network is N out . Suppose that the shortest distance a flit i travels is D i , then the total shortest distance to be travelled by all offered flits is D of ¼ P i¼1 N of D i ; the total shortest distance travelled by all ejected or received flits is
We define some terms as follows:
In the following figures, each curve is plotted by connecting ten points, with each point corresponding to a constant packet injection rate g p in [1/200, 1/40, 1/20, 1/15, 1/10, 1/8, 1/7, 1/6, 1/5, 1/4], where 1/n means that an RNI injects one packet into the network every n network cycles. We assume that the resource frequencies are a factor or a multiple of the network frequency. Since there are packet queues between RNIs and the network, the read/write access of packet queues can work asynchronously. The clock frequencies of resources are not important here but the equivalent packet injection rate in terms of the network frequency is relevant. For example given a packet injection rate 1/4 (in terms of the network frequency), if the network is twice as fast as a resource, this rate is equivalent to that the resource injects one packet every two resource cycles; if the network is half as fast as a resource, this rate is equivalent to that the resource injects one packet every eight resource cycles. These packet injection rates correspond to flit injection rates [0.02, 0.1, 0.2, 0.267, 0.4, 0.5, 0.571, 0.667, 0.8, 1] flit/cycle/node (g f ¼ 4 Â g p ¼ 4/n, since one packet equals to four flits). Suppose the average flit distance is D avg , the relation between a and g f is a ¼ g f . MD avg /C, if no packets are dropped (N of 
. We noticed that network saturation leads to buffer overflow at the two or three highest injection rates under the uniform traffic. If the buffers over-run, a will be smaller than 8g f /9.
Experiments on flit admission
For flit admission, we experiment on the decoupled and coupled admission models. Both use the ideal sink model. Fig. 11 compares their performance. The average latency is unaffected when the offered load is below 0.5, corresponding to flit injection rate 0.57. The head-of-line blocking is worse with the coupled admission only after the network is nearly saturated at the injection rate of 0.66 flit/cycle/node. At this point, the average latency differs by about 9 cycles (It is an approximate number. At and above the saturation point, the latency varies widely due to contention effects. This note is also applicable to other performance figures.). The saturation throughput is 4% worse with the coupled admission (0.75 with the decoupled admission, 0.72 with the coupled one). If the packet FIFO overflow at the two saturation points is not allowed, both achieve the same throughput 0.66 flit/cycle/ node when the flit injection rate is 0.66 flit/cycle/node.
To quantify the head-of-line blocking time in packet source queues, we draw the blocking time against offered load in Fig. 11c . If the offered load is not greater than 0.5, there is no time spent for queuing with both admission schemes. If the offered load is above 0.5, the queuing time shows an exponential increase, and the coupled admission is worse than the decoupled one as we analysed previously. However, we also noticed that, if the source queuing time is not counted, the average packet latency is better with the coupled admission than with the decoupled one, as illustrated in Fig. 11d . This is interesting since this suggests that when the network is to saturate or saturated, a faster admission scheme will lead to higher network contention, and consequently longer network delivery time. Nevertheless, we should note that the blocking time in packet source queues is much more significant than the packet delivery time in the network when the network starts to saturate.
Experiments on flit ejection
For flit ejection, experiments are conducted for the two sink models: the ideal sink and p-sink models. Both models use the decoupled admission. From Figs. 12a and 12b , we can see that the p-sink model does not degrade the average latency until the network is offered with load above 0.5. With the ideal and p-sink models, the network saturation throughput differs by '5%. If the saturation is avoided, both achieve the same throughput 0.66 flit/cycle/node.
With the p-sink model, the number p of sinks is independent of the number v of VCs per PC. When reducing v, the contention for sinks will also be alleviated, thus the p-sink model is unlikely to become a performance bottleneck for a lower v. However, the blocking time during network delivery and back-pressure have more significant impact on performance. We show the performance with v ¼ 3 and v ¼ 2 in Figs. 12c and 12d . As v decreases, network blocking during delivery is worse due to the reduction in VCs thus the average latency is increased (Fig. 12c) and the network processing capability in throughput is decreased (Fig. 12d ) . As a consequence, the networks exert higher back-pressure on admitting flits, resulting in packet buffer overflow at lower injection rates. In the case of the p-sink with v ¼ 3 and the ideal/p-sink model with v ¼ 2, buffers overflow with the three highest injection rates (when the offered load is above 0.5). In other cases, buffers over-run with the two highest injection rates.
Experiments on flit admission and ejection
In this part, we compare the performance of three combinations of flit admission and ejection models: the decoupled admission with the ideal/p-sink ejection and the coupled admission with the p-sink model. As shown in Figs. 13a and 13b, the average packet latency has no difference when the offered load is below 0.5. If the network does not operate at the last two saturation points, they achieve the same throughput 0.66 flit/cycle/node at the flit injection rate of 0.66, where the latency difference between the decoupled admission with the ideal/p-sink ejection and the coupled admission with the p-sink model is '35/10 cycles. The saturation throughput for the three models is 0.75, 0.71 and 0.695, respectively.
Since locality traffic is likely a scenario for real traffic, we use it to investigate the performance of the coupled admission with the p-sink model. As implementing an ideal ejection is too costly, we use only the decoupled admission in conjunction with the p-sink model as the baseline. In this case, both use the p-sink model. Figs. 13c and 13d compare their performance. With the locality traffic, the packet FIFO overflows only at the highest injection rate. The coupled admission performs very close to the decoupled one. When the injection rate is 0.8 flit/cycle/ node, the average packet latency with the coupled admission is '4 cycles worse, and the throughput in both cases is the same. If the injection rate is below 0.66 flit/cycle/ node, both admission schemes have no difference in the average latency and throughput. This means that, as traffic goes from uniform to locality, the performance of the coupled admission with the p-sink model approaches further to the baseline. Hence, the proposed schemes are as applicable to locality traffic as they are to uniform traffic.
Conclusion
We have discussed flit admission and ejection models for wormhole-switched networks on chips. In particular, for flit admission, we present a coupled admission scheme which decreases the crossbar complexity from 2p Â p to ( p þ 1) Â p; for flit ejection, we propose a p-sink model whose buffering cost for flit sinks is only 1/v as much as the ideal ejection. Simulation results show that, under the uniformly distributed stochastic traffic with and without locality, the proposed schemes do not degrade network performance if the network is not saturated. Synthesis results suggest that they do not degrade hardware speed, either. Furthermore, with the coupled admission, area is reduced for the crossbar and switch by 41.7% and 6.8%, respectively. Using the p-sink model, the reduction of the buffer and switch area is 28% and 22.8%, respectively. We believe these flit admission and ejection models are promising alternatives in order to achieve cost-effective switch designs. Although our discussions are equally applicable to macro wormhole-switched networks in parallel computing, the experiments were designed for a NoC that employs a low-dimension regular topology, deterministic routing and smaller buffering cost. Future work will investigate the power saving with the flit admission and ejection techniques. It is estimated in [20] that the switch crossbars and buffers consume '50% and 30% of total node power, respectively, on a 4 Â 4 torus Fig. 13 Performance of the decoupled/coupled admission with the ideal/p-sink model on-chip network. As the coupled admission simplifies the crossbar by 41.7% and the p-sink model cuts buffers by 28%, we expect that the proposals save a significant portion of power besides cost reduction. Another direction is to simplify the control path of the switch, which is the key to further reduce the switch area.
Acknowledgments
Bei Yin and Ming Liu in the System-on-Chip Design Master program from the Royal Institute of Technology in Sweden contributed to the wormhole switch RTL implementation and synthesis results. We thank Prof. Jan Madsen from Technical University of Denmark and anonymous reviewers for providing valuable comments to improve the paper.
References

