Abstract-As technology advances, the number of cores in Chip MultiProcessor systems (CMPs) and MultiProcessor Systemson-Chips (MPSoCs) keeps increasing. Current test chips and products reach tens of cores, and it is expected to reach hundreds of cores in the near future. Such complexity demands for an efficient network-on-chip (NoC). The common choice to build such networks is the 2D mesh topology (as it matches the regular tile-based design) and the Dimension-Order Routing (DOR) algorithm (because its simplicity). The network in such systems must provide sustained throughput and ultra low latencies. One of the key components in the network is the router, and thus, it plays a major role when designing for such performance levels.
I. INTRODUCTION AND MOTIVATION
Current Chip MultiProcessors (CMPs) and high-end MultiProcessor Systems-on-Chip (MPSoCs) are growing in their number of components. As technology advances, more and more transistors can be included in the same die. Therefore rather than building complex processors, the trend is to replicate and include many simpler ones. This is driven by the power consumption concern, as smaller devices are more power-efficient than complex ones.
This ever growing number of devices demands an efficient interconnect structure inside the chip. Initial implementations relied on buses or crossbars, however, both lack scalability in terms of network bandwidth and implementation cost, thus becoming either a bottleneck when the number of devices to interconnect increases or an unfeasible implementation option. As a solution, the network-on-chip concept arose. The idea is quite simple, a point-to-point network inside the chip is implemented to connect all the devices. Current research prototypes by Intel deploy a 2D mesh topology built from routers and links. Such a network is used to interconnect 80 simple (2-FPU units) cores in the Polaris chip [10] , or 24 dual-core tiles each one being x86 compatible and able to run an operating system, in the case of the Single-chip Cloud Computing prototype [25] . Moreover, Tilera recently announced a 100 core chip that also includes a 2D mesh [26] .
In this scenario, and with the expectations to reach hundreds of cores in the near future, an efficient implementation of the NoC becomes a challenge. The NoC is built from basic components as routers (also referred to as switches), links, and network interfaces. Routers and end nodes are connected through links thus forming the topology and final network structure. Although the network interface must be carefully designed in order to not introduce bottlenecks, the complexity is usually moved to the router design. Indeed, many previous works have focused in different router architectures. In CMPs the current trend is to design pipelined router architectures in order to increase clock frequency and use wormhole switching due to its low buffer requirements. Moreover, it is common to use the Stop&Go flow control protocol.
The basic (starting point) pipelined wormhole router design is made of four stages: input buffer (IB), route computation (RC), switch allocator (SA) and router trasversal (RT). The IB stage is used to allocate an incoming flit from an input port into a queue. The RC stage is used to compute the output port the message has to take. This is usually achieved, in a 2D mesh topology, by using a small logic block implementing the DOR routing algorithm. Once the output port is computed, at the next cycle, in the SA stage the flit contends for the requested output port (among all the flits requesting the same output port). Finally, on success, the flit crosses the internal crossbar of the router, thus reaching the output port. This is done in the RT stage. If the router implements virtual channels, then an additional stage, named virtual channel allocation (VA) may be required to arbitrate for the output virtual channel amongst the virtual channels of an input port.
Several techniques have been proposed to reduce the depth of the router pipeline. One method is to speculatively contend for the virtual channel (VA stage), thus performing such task in parallel with the SA stage. Another possibility is to speculatively forward the flit through the internal crossbar at the same time the VA and SA stages are performed, thus saving two cycles in total [8] . Finally, the RC stage can be hidden if the previous router computes the output port at the next switch, and in parallel with the VA, SA, RT stages, thus, ending up in two stages.
The described efforts to reduce the router pipeline show how critical is the router delay in overall network's performance. Actually, applications are specially sensitive to latency [2] . In this way, although achieving high throughput numbers in a NoC is important, it is much more challenging to achieve ultra low latencies. Indeed, as NoC bandwidth is not limited by pin count (as is the usual case for off-chip networks), large bandwidths can be, relatively easy, achieved by increasing link width. However, in order to get low latency, an important design effort must be done for the router itself.
In this paper we aim to reduce router latency. However, instead of reducing the number of cycles of the basic pipelined router design, which is orthogonal to our proposal, we rather identify the most consuming operations along the critical path of the router, and propose a technique to reduce such overhead.
In particular, Table I shows the delay of each stage for the basic 4-stages pipelined router previously described. We target a 2D mesh network connecting one end node per router. Thus, a 5-port router is assumed. As can be observed, the slowest stage is the SA stage. Such delay hinders the router to operate at frequencies higher than 1.33 GHz, thus setting the lower limit of the router delay (0.75 ns). Our proposal is to reduce the complexity of the SA stage in the router. For that purpose, we reduce the complexity of the arbiter that comprises the SA stage. This is achieved by taking into account the routing algorithm implemented in the network (DOR routing). Notice that by using DOR, packets will never take some output ports from a given input port. In particular, flits coming from X input ports may request any output port (or the local port), however, packets from Y input ports will only request Y output ports (or the local port). Based on this reasoning, the arbiter can be simplified for Y output ports. Arbiters for X output ports will be also simplified, as will be explained later. The net result is a reduced clock cycle, thus achieving lower latency.
The remaining of the paper is organized as follows. In Section II we describe the related work. Then, in Section III we describe the basic router architecture we will use as the starting point for our proposals. In Section IV we describe the different router architectures proposed in the paper. Then, in Section V we evaluate the new router architectures.
II. RELATED WORK FOR CMP ROUTERS
We can find in the literature, and in real implementations and prototypes, two different router architectures. The first one is a single stage router. In this architecture, a flit crosses the entire router and reaches the input port of the next router in one cycle, typically. This is the usual case for MPSoC systems. On the other hand, we may find pipelined routers for high-end MPSoC and CMP systems. This is the case for the Intel Polaris chip [10] . In this paper we focus on pipelined router designs for CMP systems.
During the last years many efforts have been done in order to reduce the latency of NoC routers for CMP systems. Initially, the knowledge and established techniques from the offchip network domain (from high-performance interconnects) were applied to the emerging NoC field [19] . First, NoC routers were designed using well-known routing algorithms (e.g. DOR routing) and switching techniques (e.g. wormhole switching). Also, tied with wormhole, virtual channels were advocated in order to time multiplex the physical channel and, therefore, reduce the blocking induced by wormhole switching. All these techniques forced the router to become complex and slow. Different efforts have been made to reduce the complexity and latency. For example, in [18] different techniques are applied and a one-cycle router is achieved, or in [14] the output port is predicted rather than computed in order to minimize latency.
One of the main contributors to latency in a router is the arbiter algorithm that schedules how flits in the input buffers are dispatched to the output ports. Indeed, the design of a fast arbiter algorithm is key to achieve a high-performance low-latency router. Several scheduling algorithms have been proposed, like: PIM [1] , PPA [5] , DRRM [6] , iSLIP [16] , etc. These algorithms are iterative and approximate a maximum size matching by finding a maximal size matching, that means that the arbiter obtains the maximum performance -assuming the performance as the number of flits that could traverse the router -not taking into account the delay introduced. However, as remarked in [23] these algorithms are slow and impractical for a high speed router and, additionally, may cause unfairness. To overcome this issue, other non-iterative algorithms have been developed [20] , [23] . The main property of these schemes is their speed and simplicity at the expense of some loss in performance (lower matching rates) when compared with previous iterative arbiter schemes. However, this lower matching capability is not a burden in real traffic conditions, as shown in [17] .
In order to reduce the latency of the router, the complexity of the arbiter can be further minimized by making use of a logic synthesis principle [15] that performs an areaperformance trade-off. When delay constraints are loose, areaefficient netlists can be achieve, but when more tight delay constraints are needed, high performance can be obtained at the cost of area. Then, by minimizing the complexity of an arbiter we can relax delay constraints thus achieving higher performance. In fact, in [19] , the authors remark the need of reducing the complexity of the crossbar from previous works [7] , arguing that smaller router modules achieve faster routers. Similarly, Gilabert et al. [9] propose a decoupled crossbar for each virtual channel rather than a shared crossbar for all the virtual channels. Decoupling resources in a router relaxes delay constraints thus improving area and power consumption. In our case, we will relax area constraints in order to minimize latency.
Other interesting studies are those that discuss the possibility of replacing virtual channels by physical parallel ports. This has been inherited from the off-chip domain [21] . Carara et al. [3] , [4] reuse this concept for NoCs. In particular, they take advantage of the abundance of wires in current and expected deep sub-micron technologies. Carara et al. based their work in spatial division multiplexing (SDM) introduced by Leroy et al. [13] and Lane Division Multiplexing (LDM) technique introduced by Wolkotte et al. [22] . SDM and LDM basically increase the number of wires between routers to assign different bandwidth resources to each channel at the cost of increasing the critical path. Carara's contribution is to simplify SDM and LDM obtaining a reduction in the critical path and then obtaining better performance by replicating channels rather than using virtual channels.
Our router is highly related with other low latency designs as those presented in [12] . In that paper, a low latency router that supports adaptivity is presented. Its main characteristic is that exploits the properties of a 2D-mesh in order to perform adaptivity. Exploiting the propoerties of a 2D-mesh allows to reduce the latency of the router despite in [12] the router is modified to support adaptivity.
III. BASIC ROUTER ARCHITECTURE
In this section we describe the router design used throughout this paper. Figure 1 shows the main components of the router. The router is a pipelined input buffered wormhole router with five stages: IB, RC, SA, RT, and link traversal (LT). Notice that the fifth stage does not belong to the router itself. We have designed a simple router with five input and output ports.
Thus, four ports are intended to provide connectivity with the neighbouring routers in the 2D mesh and the fifth port connects to the local computing core. Link width is set to 8 bytes. Flit size is also set to 8 bytes. Input buffers can store four flits. A Stop&Go flow control protocol has been deployed in order to control the advance of flits between adjacent routers. Additionally, the routing stage has been implemented to support the XY routing algorithm. Moreover, there is a RC module for each input port. Note, that although the RC module has been designed to support the XY routing, physically each input port can be connected to any output port including itself. Similarly, there is a switch allocator(SA) module for each output port. The SA module determines which input port is connected to its output port associated. Finally, each SA module has been designed using a round-robin arbiter according to [20] . The router has been implemented using the 45nm technology open source Nangate [24] with Synopsys DC. We have used M1-M3 metallization layers to perform the Place&Route with Cadence Encounter.
Using the same architecture of the basic router presented above, we have designed an identical router but with ten input/output ports. Thus, eight ports are intended to provide connectivity with the four neighbouring routers in a 2D mesh (two ports per neighbouring router) and the last two ports connects the router to the local computing core. In both routers, all input ports can be connected to all output ports, thus full connectivity is implemented. Table II shows the area and latency of the basic router architectures presented in this section.
IV. A NEW LATENCY-EFFICIENT ROUTER ARCHITECTURE
In this section we present our router architecture intended to reduce message latency by reducing the critical path of the router pipeline. Therefore, by shortening the longest stage of TABLE II ROUTER AREA AND LATENCY the pipeline, clock frequency will be increased, thus reducing the overall router latency. We will use the router presented in the previous section as a case study.
As explained in Section III, the switch allocator stage is the bottleneck of a router with no virtual channels, due to its larger delay. Thus, by reducing the latency of the switch allocator, the latency of the whole router will be reduced because of a smaller clock cycle.
Reducing the latency of a switch allocator is not easy [11] , [20] , [23] . First, it is needed to reduce the complexity of the algorithm implemented by the arbiter. The simplest arbiter algorithm is the round robin policy. More complex arbitration techniques [11] , [17] may be used, although they are discarded due to their higher complexity, that translates into higher arbitration latencies. Second, reducing the complexity of the simplest round robin arbiter is not trivial. Actually, a fast and simple arbiter implementation is presented in [20] , which could be taken as an example of the fastest possible implementation that may be done because, although better implementations may be carried out, the difference in latency would not probably be significant.
Assuming that no faster implementation than the one presented in [20] can be carried out, the only parameter that could be modified in order to reduce the latency of this arbiter is the number of concurrent requests this arbiter deals with. In order to assess how the arbiter delay varies with the number of requests, we have evaluated the area and latency of the switch allocator with different number of requests. To do so, we have synthesized four different arbiters with 2, 3, 5, and 10 requests. Table III shows the area and latency for the different switch allocators synthesized. As can be seen, the area and latency of the switch allocator increases with the number of requests. Additionally, the area required grows faster than the delay of the arbiter. On the other hand, notice that the delay value for the 5-request arbiter in Table III is lower than the delay of the switch allocator of the router presented in the previous section. The difference in the latency is due to extra control signals that compound the SA stage.
According to the delay numbers shown in Table III , we could conclude that implementing the SA stage by using arbiters with a small number of requests would reduce the latency of that stage. However, that would also reduce the connectivity among ports. For example, if 2-request arbiters are used, then an output port could only receive requests from two different ports. However, in a 2D mesh, each output port may receive requests from other 4 ports (3 ports connecting to other switches and one more port connecting to the local core). Therefore, if the SA stage is implemented by assuming 2-request arbiters, connectivity among ports will not be complete unless some additional changes are introduced. We propose the change of replicating some of the ports, thus guaranteeing that any input port can be connected to one of the replicas of the required output port, and still using 2-request arbiters. Figure 2 shows our proposal for interconnecting ports. Y ports have been replicated three times in order to provide connectivity from both X ports and from the local port. Other schemes are feasible. However, as this router assumes the use of the the DOR routing algorithm, replicating the Y port presents the additional advantage of featuring more bandwidth in the Y dimension, which usually gets congested in our routing.
The connectivity scheme shown in Figure 2 guarantees that each output port arbiter receives requests from only 2 input ports, thus reducing the complexity of the switch allocator from a five-to-one configuration down to a two-toone configuration. However, the number of ports in the router has been increased from 5 to 9. Therefore, the complexity of the crossbar may considerably increase. However, an efficient crossbar implementation similar to the one presented in [9] can be deployed if each output port has its own independent crossbar. In this way, our router proposal will include a 2x1 crossbar at each output port instead of a single 5x5 crossbar as shown in Section III. Note that even in the case these smaller crossbars were used in the basic router architecture, this would only save area, while no delay reduction would be achieved, as the bottleneck stage was the switch allocator. From now on, the router proposal shown in Figure 2 will be referred to as proposal 1.
Our router architecture can be extended if X ports are also replicated by using 3-request arbiters instead of 2-request arbiters. In this way, the delay of the switch allocator stage will be slightly increased, but the bandwidth for connecting to other routers will be noticeably improved. Figure 3 shows the connectivity among ports for this option. Note that now we can include two local ports in the router in order to match the number of input ports with the total number of requests the arbiters can deal with. From now on, the router proposal shown in Figure 3 will be referred to as proposal 2. It is noteworthy to mention that our proposal noticeably increases the number of ports of the router. However, as this router is intended to be used inside a chip, where interconnection bandwidth is not a major concern, the only disadvantage of an increased number of ports is the additional buffering required. Nevertheless, the philosophy of our router is to trade transistors by latency, which is the real concern in current chips.
Finally, note that our proposal is independent of the arbiter scheme selected. That is, other arbiters can be used for deploying our proposal. In this way, if better arbiter implementation arise, then they could be incorporated to our architecture. Furthermore, our proposal is independent and orthogonal to other router architectures, including one-cycle routers since the SA functionality is reduced. Moreover, other techniques as look-ahead routing or predictive routing are also compatible with our proposal.
V. EVALUATION OF THE ARCHITECTURE PROPOSALS
In this section we analyze the benefits of the proposed router architecture over the basic ones presented in Section III. To do so, we present the area and latency metrics as well as a thorough performance comparison of the different characteristics introduced by the new architecture.
A. Area and latency Table IV shows the area and latency of the routers proposed in Section IV. Remember that our proposals have nine and twelve ports rather than the five or ten that featured the basic architectures in Section III. Obviously, as the number of ports increases (and hence the number of input buffers) the area of the proposed routers increases as well. However, note that both proposals have a shorter critical path than the basic router architectures presented in Section III. This means that both proposals have a higher operating frequency. Then, if we compare proposal 1 with a 5-port basic router (as both of them have the same number of links in the X direction) we obtain that proposal 1 has a reduction of latency of 10.67% while there is an increase in area of 59.99%. Comparing our router proposal 2 with a 10-port basic router we obtain a reduction of latency of 20.22% with a small increase in area of 2.88%. The huge difference when comparing the area increase in both cases is due to the fact that when comparing proposal 2 with a 10-port basic router the number of extra ports added is just two while in the first case the number of ports is almost doubled. Additionally, decoupling resources allows the synthesis tools to relax its constraints and then the increase in area produced by increasing the number of input buffers could be reduced by decreasing the area of the SA and RT stages. This behaviour is more evident when comparing proposal 2 with a 10-port basic router. Nevertheless, remember that the purpose of our architecture is reducing latency at the expense of increasing area, if required. Also, note that the critical path of the proposed routers is not decreased according to the expected reduction of the critical path of the arbiter. This is due to the fact that in our proposed routers the critical path is now set by the IB stage instead of the SA stage. However, the critical path of the proposed routers is higher than the critical path of the IB stage of a basic 5-port router -shown in Table I . This is due to the higher number of ports and the control signals that conform the architecture of the IB stage in the proposed routers.
B. Effects of the partial connectivity on performance
In this section we analyze the performance penalty suffered when reducing the connectivity among the ports of a router.
For that purpose, we analyze two routers: the router described in proposal 1 and the same router but allowing that any input port can reach any output port, that is, full router connectivity. Figure 4 (a) and Figure 4(b) show the latency and throughput for a 4 × 4 network using both routers. Uniform traffic is used, and message size set to four flits. In both cases, routers are modelled with the same delay (one cycle per stage). As can be seen, there is a small loss in performance when full connectivity is not allowed. Notice, however, that this loss is small and it will be compensated when comparing both routers with its real frequency, as we will analyze later.
C. Effects on the increased number of ports
In this section we analyze variations in network performance when moving from the 5-port basic router to the 9-port proposed architecture (proposal 1). The purpose of these experiments is to analyze the effect of a larger number of ports. Network size is 16 routers. These results do not take into account the operating frequency of each router. Figure 5 shows the latency and the throughput for different injected uniform traffic rates. All the messages injected into the network are 4-flit long. Figure 5 shows that the performance of the ad-hoc router is better than the basic router. Our proposal obtains a higher throughput and a lower latency over the whole traffic range. This is due to the fact that by increasing the number of ports we reduce the contention in each router and increase the traffic bandwidth in the Y direction. Figure 6 shows the latency and throughput of the same routers when the traffic injected into the network is made of 70% messages of 4 flits and 30% messages of 20 flits. Note that introducing longer messages makes the network performance worse (higher latency and lower throughput). Furthermore, differences between routers are reduced. However, despite the reduction of the difference in performance, the same conclusions as before can be obtained. From now on, all results presented will be taken when the traffic injected is made of 4-flit long messages. Figure 7 shows the latency and the throughput for different injected uniform traffic rates for a 10-port basic router and our router proposal with twelve input/output ports (proposal 2). As before, the performance of the proposed router is better than the basic router for the same reasons explained above.
D. Effects of the increased router frequency
The previous evaluations were focused on the impact of the connectivity pattern provided by the different router architectures. Indeed all the routers were modelled at the same clock frequency. In this section we provide the real difference when routers are composed, each one using its maximum operating frequency. Figure 8 shows the latency in nanoseconds for different injection traffic rates for a 4 × 4 mesh network.
Note that now the differences in performance described before are exacerbated as the proposed routers now work at higher operating frequencies. Note that differences are remarkable both for low injected traffic rates and for high injected traffic rates. For low injection traffic rates the reduction of the net latency is 11.37% and 15.41% as it can be seen in Figure 8 (a) and Figure 8 (b) respectively.
VI. CONCLUSIONS AND FUTURE WORK
In this paper we have proposed a new router design aimed to reduce the latency of the network. From the initial assumption of a 2D mesh topology and the use of the Dimension-Order Routing (DOR) algorithm, we have redesigned a pipelined router. In particular, the arbiter complexity has been reduced. Multiple smaller arbiters are used in parallel and thus exhibiting a lower latency. In order to build a full operational router with such smaller arbiters new ports have been added to the router, and different internal connections in the router Results show a clear reduction in router latency. The new architecture with arbiters with two or three requests is able to reduce the delay of a canonical router design by a range from 10% to 21%. This router uses X ports, and thus, increases the router area up to 59.99%. Network latency is reduced in a range from 11% to 15%.
As future work we plan to analyze the impact of the new router design when using virtual channels. Furthermore, the impact of the new router design with other technics as look-ahead routing should be done. Also, further evaluations with real applications will be analyzed. It is expected that applications will take profit of a faster router.
