Abstract: As the Network-on-Chip (NoC) induces significant hardware overheads, it becomes the performance and scalability bottleneck of Systemon-Chip (SoC) design. To address this challenge, we propose a multi-layer, non-blocking ring NoC architecture. Multi-layer links with different bandwidth achieve high link utilization and avoid protocol-level deadlock. The non-blocking architecture leverages bufferless router to reduce hardware overheads and simplifies router pipeline to reduce zero-load latency. We also propose a scalable global signal control mechanism to eliminate the starvation and avoid the loss of packets. Compared with the conventional ring network composed of dateline routers (DRing) and Intel Nehalem-EX ring network (NRing), our design achieves 69.4% and 12.3% performance improvements, respectively. Compared with DRing, it also reduces hardware overheads.
Introduction
As the transistor count continuously increases with the state-of-art VLSI technology, more and more cores can be integrated on a single chip. It is arduous for the traditional bus-based interconnection to satisfy the demand of SoC. As a consequence, the NoC is proposed to connect the cores together in multicore architecture [1] . With the process scaling, the latency, design complexity, area and power consumption of NoC significantly influence the design of chip multiprocessors (CMPs). For example, in the Intel Terascale chip, the mesh network consumes 30% of chip power and 25% of die power [2] ; the NoC hardware overhead largely limits the development of chip.
So far, people have studied a variety of sophisticated NoC topologies, including the Flattened Butterfly, Clos, Fat Tree and so on. They try to take full utilization of the link resource on chip. Yet, to mitigate the complexity of placing and wire routing, most NoC products leverage simple topologies, such as the ring and 2D mesh. Among them, the ring is the simplest one. It achieves very high energy efficiency at low-to-medium core counts and most current industrial chips consist of tens of cores. Hence, the ring has been largely and widely used on industrial products, including IBM CELL [3] , Intel Nehalem-EX [4] and the recent Xeon Phi [5] . These industrial products leverage different kinds of rings, such as dateline, Token ring, Bubble ring [6, 7] .
Although these kinds of rings are effective in improving performance, they introduce significant hardware complexity, such as router buffers and long pipeline stages. Some rings leverage sophisticated injection procedure so that packets can not be injected each cycle. This paper tries to mitigate this issue by proposing a ring design to support high performance with low hardware overheads.
In this paper, we propose a novel multi-layer and non-blocking ring NoC architecture, Express Ring. It focuses on three issues to optimize the ring. First, we leverage multi-layer links to avoid protocol-level deadlock [8] and configure different bandwidth for data layer and address layer to improve bandwidth utilization. Second, we eliminate buffers in the router. It removes the VC allocation stage from the router pipeline, which can accelerate the packet delivery and reduce the hardware overhead. Finally, as packet loss and starvation may appear in bufferless rings, we leverage global signal control mechanism. Thus our design is a high performance and low overhead ring NoC. Compared with Dateline Ring (DRing) and Nehalem-EX Ring (NRing), our design achieves the performance improvement with 69.4% and 12.3%. Compared with the hardware overheads of DRing, our design reduces 78% of the area and 23% of the power consumption.
The remainder of the paper is organized as follows. Section 2 discusses several existing ring designs. Section 3 describes our design of Express Ring. Section 4 shows our simulated system, including the simulators and the parameters of the simulations. Section 5 evaluates the performance of Express Ring and compares it with DRing and NRing. Finally, Section 6 concludes the paper.
Existing designs of rings
Here, we analyze two existing designs of rings, Dateline Ring and Intel Nehalem-EX Ring.
Dateline Ring
Dateline [9] is a conceptual line across a channel of a ring network, as shown in Fig. 1 . Each packet starts in VC 0 and switches to VC 1 if the packet crosses the dateline. In this way, the cyclic dependencies inherent in the ring are removed. Although it provides a convenient way to avoid deadlock, it induces larger allocators and lower router frequency. The pipeline of Dateline router consists of routing computing (RC), VC allocation (VA), switch allocation (SA) and switch traversal (ST). It costs large hardware overhead for VCs.
Nehalem-EX Ring
Nehalem-EX is a codename of an Intel Xeon processor [10] . Eight cores in the processor are connected by 2 counter rotating ring sets shown in Fig. 2 . Each ring set has 4 protocol rings and supports 32 Bytes in each direction. Rotary rules are used to guarantee that the traffic on the ring has higher priority of arbitration. Ring nodes are tagged as "even"/"odd" polarities and two rings in opposite directions share the same output port at each ring node. It means that the node output buffer is available to receive packets from clockwise ring at even cycle and from counterclockwise ring at odd cycle. Thus packets can be injected when injecting time and the polarity of destination node matches. Although the polarity mechanism saves some output buffers, it slows down the packets delivery as it cannot continuously inject packets into a ring to the same node. 
Express Ring design
In this section, we first provide an overview of the Express Ring architecture. Then we describe multi-layer links and the router microarchitecture. Finally, we describe the proposed global flow control mechanism.
Top architecture
The architecture of ERing is shown in Fig. 3 . It mainly consists of three parts, including multi-layer links, bufferless router and global signal controller.
(1) Multi-layer Links: The multi-layer links are comprised of data layer and address layer. Protocol-level deadlock can be avoided by partitioning request and reply packets. A pair of unidirectional links on each layer support bidirectional communication. According to the width of packet, different links have different data width. Hence, the utilization of the link bandwidth will be extremely high. It will be introduced in subsection 3.2.
(2) Bufferless Router: Bufferless Router composes of access controller and ring node. The access controller mainly deals with sending and receiving packets. The ring node connects links and access controller together. Bufferless design saves large amount of hardware overheads. More details of the ring node will be introduced in subsection 3.3.
(3) Global Signal Controller: Global signal controller deals with available signals and starvation signals. The available signal is used to indicate the source node that the destination node is available to receive packets. The starvation signal is used to guarantee fairness. Both signals will be introduced in subsection 3.4. 
Multi-layer links
Multi-layer links compose of data layer and address layers and each layer is comprised of 2 counter unidirectional links supporting bidirectional communication. In practical, there are 3 kinds of packets transferred on the ring, including read request, read response and write request. In order to avoid protocol-level deadlock, request and reply packets are transferred on different layer links. The address link is used for transferring read requests from master devices to slave devices. Read requests do not contain any data so that we configure the address link with a narrower width. On the other hand, data links transfer two kinds of requests. One is the write request from master devices to slave devices, and the other one is the read response used for replying read data from slave devices. Since write requests and read responses contain data, we configure a wider bandwidth for data links. As long packets are transferred on wider links and short packets are transferred on narrower links, the bandwidth of links are full utilization. Besides, it avoids protocol-level deadlock.
Bufferless router microarchitecture
In this paper, the bufferless router composes of access controller and ring node. The router microarchitecture is shown in Fig. 4 .
As an interface module and a controller module of the ERing router, the access controller plays a critical role in the process of packet injection, ejection and arbitration. The access controller composes of input/output arbiter unit, global signal control unit and routing unit. As several different devices such as the L1D, L1P and DMA are connected to the access controller, packets from different devices need to be arbitrated before being injected into the ring. When these devices send packets to the router, the access controller performs the arbitration and chooses a packet based on a priority scheme. At the same time, the global signal controller detects the available signal of the destination node. If available, the chosen packet will be transferred into the routing unit where the packet will be pre- routing. The routing unit chooses a data link or an address link according to the packet type. The shortest path routing strategy is leveraged, so that the hop count will not be greater than half of the node count. After pre-routing, the packet is injected into the Express Ring Node shown in Fig. 4 . Hence, ERing consumes very small overhead when transferring packets.
The Express Ring Node composes of several Muxes show in Fig. 5 , which cost small area and power overheads. After the packet is injected into the Express Ring Node, a link is chosen by routing information. Then the packet will be transferred to the destination node uninterruptedly on this link. Once the packet arrives at the destination node, it will be transferred into the access controller from the Express Ring Node of the destination node. And then the packet passes through the output arbitration unit to the destination device.
One significant feature of ERing is the non-blocking design. The ERing router is bufferless, and only a small amount of buffers exist in the input/output arbiter unit. These buffers are used to store injecting and ejecting packets, in order to increase throughput on the ring. After packets are injected into the ring, it will be directly transferred to the destination. It means that on-going packets will not be stalled in any Ring Node until it is ejected out. As a result, it takes only 1 cycles to transfer through each hop.
In order to eliminate the network-level deadlock and transferring the packet without blocking, the priority of traffic on the ring need to be higher than the traffic waiting for being injected. If no on-going packet will be transferred to this node in the cycle, packets can be injected. We impose a busy signal which is linked to the next node. When a packet is ready to be transferred to the next node, the busy signal will be sent to the next node in advance. It prevents the ready packet from being injected into the ring.
For DRing, it need VC and switch allocation after routing computation. More than 1-cycle router pipeline is needed to transfer a packet through a router. Different from ERing, DRing can inject packets into the node whenever there is an empty VC. However, it blocks the transferring of on-going packets and needs more cycles in router.
Therefore, there are 2 remarkable advantages of our non-blocking design.
(1) It takes only 1 cycle to transfer through a hop, which plays an important role in improving the ring performance. (2) Bufferless router design has much less power consumption than other designs, as the simple router pipeline and fewer buffer counts.
Global signal controller
The available signal is used to avoid the packet loss. When a packet is transferred into the access controller, it will wait for the available signal to be injected into the ring. The available signal indicates that the destination node is ready to receive the packet. It is an essential condition of injection.
Since the latency of the available signal may cause output buffers state error, we leverage a solution to hide the latency. If a packet has been injected into the ring, the transferring latency is certain. We can predict whether the destination node has enough buffers to receive the injecting packet based on the distance between the destination and source node. Once enough buffers are left, the available signal will be ready. For instance, the longest distance of an 8-node ring is from node 0 to node 4. It costs 4 cycles to transfer the available signal from node 4 to node 0. It also costs 4 cycles to transfer the packet to node 4. During this period, at most 8 packets can be transferred to the node 4. Thus, we set available signal for node 0 to node 4 when the buffer count is larger than 8. In this way, the available signal will be transferred to the source node whenever the output buffer of destination node is enough. For NRing, packets cannot be injected whenever the available signal is on. The injection is also according to the time and the node. Thus, ERing is more efficient then NRing. As the design of ERing is non-blocking, these global available signals avoid the loss of packets.
If a node continuously injects packets into the ring, the ring links will be full of packets. As the traffic on the ring has higher priority than the traffic on the node, other nodes cannot inject packets all the time. It causes the starvation. To address the challenge, we introduce a starvation signal and set a threshold value for each node. The threshold value is a waiting cycle period. If the waiting time of a packet in the node injection buffer is longer than the threshold value, the starvation signal will be broadcasted to other nodes through global signal controller. Once a node receives the starvation signal, it will stop the packet injection. And then the starve node is available to inject packets. After the injection is completed, the starvation signal of the node will be de-asserted. The design of starvation signals guarantees the injection fairness among nodes on the ring.
The global signal controller is scalable. It is mainly comprised of interconnecting wires among ring nodes, and there is no other control logic or arbitration logic in the central controller. Therefore, when the number of nodes increases, we can just add wires to connect them in the global signal controller. The overhead is acceptable when the scale is not too large.
Methodology
We evaluate the performance of ERing using a cycle accurate interconnection network simulator, Booksim [11] . We modify Booksim to implement ERing and compare it with DRing and NRing. We use both synthetic traffic patterns and real applications for the evaluation. We also use router RTL [12] for area and power evaluations.
Six synthetic traffic patterns including uniform random, bit reverse, bit complement, neighbor, tornado and hotspot are used, as they are typical for our performance evaluation. We focus on 8-node ring, as this scale is widely used on current industrial chips. We also use the 16-node ring to evaluate the scalability. The bufferless design of ERing and NRing requires single-flit packets. Thus, we assume all the packets are single-flit packets in all three rings. Multiple physical networks are leveraged to avoid protocol-level deadlock in these 3 rings. Since the ERing use 2 physical networks to deliver the request and reply messages respectively, all other 2 rings use similar structure for fair comparison. The shortest path strategy is used for routing. For DRing, Credit-based flow control is used to avoid buffer overflow and packet loss. 16 slots are divided into two VCs to cover up credit round-trip delays in each router. All the packets in DRing are first injected into VC 0 and then switched to VC 1 after crossing the dateline (router 0). Considering the available mechanism, we use 8 output buffers in each node of ERing, as well as NRing. In order to evaluate the available mechanism as high-fidelity as possible, we leverage a dynamic random scheme to model cache miss. According to the estimate of the real application, we assume that the probability of the slave device consumes a packet in 2 cycles is 60%, and the probability is 40% in 25 cycles when cache miss. In addition, we choose 10 cycles as the starvation threshold value in ERing due to the most ideal simulation result.
For real application, we combine Booksim with Netrace, a dependency-driven trace-based NoC simulation [13] . Although full-system simulation offers highfidelity, it suffers from long runtimes and variability. Dependency-driven tracebased simulation approaches decrease runtimes and offer insight into not only network-level, but also application-level performance characteristics. The traces in Netrace are collected from the M5 simulator modeling a 64-core system executing multithreaded applications from the PARSEC v2.1 suite. Due to 8 nodes in simulation, we distribute 64 cores into 8 nodes with 8 cores each node. The design parameters of the target system are shown in Table I . We modify router RTL to evaluate hardware overheads of routers in three rings. Designs are synthesized using the Synopsys Design Compiler with a 2 GHz clock frequency. We set 512-bit for each link of DRing. As ERing is multi-layer, we configures 512-bit for data link and 128-bit for address link. NRing is the same with ERing. Other parameters are same with the configuration of Booksim.
Evaluation

Synthetic-level performance
We start the evaluation with synthetic traffic patterns. The performance of three different rings are shown in Fig. 6 . The performance of NRing and ERing are similar, and it is much better than DRing. We can find that the zero load latency of NRing and ERing are equal, while DRing is about 2 times higher. The pipeline of DRing consists of RC, VA, SA and ST in 3 cycles, as we leverage speculative SA to combine VA and SA together. While NRing and ERing use single cycle routers, it takes only 1 cycle to transfer through the router. Therefore, the zero load latency of ERing decreases about 66.7% compared with DRing. Across all traffic patterns, the average performance gains of ERing over DRing and NRing are 69.4% and 12.3%. DRing is limited by the pipeline latency and buffer count. Compared with ERing, NRing is limited by the counter rotation mechanism. Then, we analyze the performance in different synthetic traffic patterns.
Comparing DRing and ERing, ERing has less relative performance gain in hotspot. The reason is all packets are transferred to one node. Packets need to be in queue and waiting for the hotspot node to be available. In this case, the injection limitation of bufferless ring reduces the performance. In other three patterns, hops are uniform and the traffic is relatively dispersed. The non-blocking design of Comparing ERing and NRing, they are different in injecting mechanism. The performance loss of NRing is mainly due to the injection process, while packets can be injected any time if destination is available in ERing. The performance of ERing and NRing are nearly same in tornado, while the average performance gains in other traffic patterns are 15%. Packets are always transferred to the farthest node in tornado. Injection benefits will be reduced due to the long distance.
The network state after saturation is another crucial issue [14] . Fig. 7 shows the throughput of three rings from zero load to 1 flits/(node Ã cycle) injection rate. Since packets are transferred without loss, the performance of these three rings are stable after saturation. The saturation throughput of ERing is highest, while DRing is the lowest one. The reason is that when ring scale is not too large, the transferring speed of ERing and NRing is much faster than DRing. More packets can be transferred in certain time. It covers up the effect of VCs for throughput. As a result, the saturation throughput gain of ERing over DRing is more in bit reverse than in uniform random pattern. We assume most parameters of ERing and NRing are the same for comparison except for the injection flow control. The saturation throughput of ERing and NRing have similar trend, while the saturation throughput of ERing is higher for fewer injection restrictions. Fig. 8 gives the performance of 16-node rings. Compared with NRing and DRing, ERing has average performance gains of 85.8% and 11.1%. As nodes increases 2 times, the average hops increases 2 times as well. Thus, the performance decreases in all three rings. The performance of DRing, NRing and ERing reduce about 44%, 26% and 28%, respectively. With the number of nodes increasing, average latency becomes longer. Bufferless rings are not so sensitive to latency so that they suffer less than DRing. Fig. 9 shows the runtime speedup relative to DRing. According to the communication analysis of PARSEC workloads in our distribution of cores, the communication characterize of most workloads are relatively stable between several pairs of nodes. Specially, node 2 is an apparent hotspot in x264 relative to other workloads. The hotspot node 2 reduces the performance of the ring. As a result, different designs perform similarly for x264. The largest system gain of ERing over DRing is 26% in blackscholes. The gain comes from the relatively stable traffic between several pairs of nodes. Across all workloads, The performance of ERing and NRing are similar, and they have average runtime speedups of 18% over DRing. Compared with synthetic traffic, application-level gains are lower. This is due to the lower network traffic in PARSEC workloads, requiring much less throughput than the saturation throughput of DRing.
Overheads: power and area
In this subsection, we compare the router overhead of each ring. Buffers consume much power and area than other modules on NoC. In the router of ERing, only a small amount of buffer units exist in the input arbiter and output arbiter, and we just insert a register unit each link. Hence, the area of our design is extremely small. As shown in Fig. 10 , the area of ERing is about 1/5 of DRing. On the power consumption, due to large difference of buffers, ERing reduces 23% compared with DRing. The leakage power reduction is 81%. NRing is also a bufferless ring. Due to its polarity mechanism, the amount of buffers is less than ERing. The area of NRing is 21% less than ERing, and the power reduction is 5%, as we improve the performance of ERing. For the whole ring, the polarity mechanism of NRing requires more wires and logics. It adds more overheads on the whole.
Conclusion
In this paper, we propose a low cost and low complexity design of the ring NoC, Express Ring (ERing). ERing is based on a bufferless router microarchitecture, and it leverages multi-layer physical links and global flow control mechanism. ERing avoids both network-level deadlock and protocol-level deadlock caused by dependencies. The bufferless router reduces the hardware overheads, including area and power. In addition, the global flow control mechanism is applied to control the packets transferring process. We evaluate both the synthetic traffic and applicationlevel performance of ERing compared with the performance of DRing and NRing. We also evaluate the hardware overheads of them. The result shows that bufferless rings have about 58% performance improvement to conventional rings. Compared with the conventional dateline router, the proposed ERing router microarchitecture significantly reduces 78% of the area and 23% of the power consumption. Compared with NRing, we have 12.3% performance gain.
