Abstract-Network-on-Chip (NoC) is an attractive solution for future systems on chip (SoC). The network performance depends critically on the performance of packets routing. However, as the network becomes more congested, packets will be blocked more frequently. It would result in degrading the network performance. In this article, we propose an innovative dual-switch allocation (DSA) design. By introducing two switch allocations, we can make utmost use of idle output ports. Experimental results show that our design significantly achieves the performance improvement in terms of throughput and latency at the cost of very little power overhead.
I. INTRODUCTION
As the technology of semiconductor continues to develop, hundreds of cores will be deployed on a signal die in future Chip-Multiprocessors (CMPs) designs. In order to solve the problem of bus based system such as increasing power consumption, the limitation of bandwidth and scalability, the promising packet-switch Network-on-Chip (NoC) has become an attractive solution which can provide low latency, high throughput and low power [1] . Besides different routing algorithms, a router architecture can significantly affect the network performance. However, researchers are continuously confronted by one of major challenges: as the whole network becomes more congested, packets will be blocked more frequently. As a result, the network performance is degraded, and in this situation, how to transfer more packets through a router should be considered. On the other hand, power dissipation is also an important factor which developer should consider when they design. Input buffers alone could consume almost 46% percent of the total power of the whole interconnection network [2] . Therefore, even though simply increasing the size of input buffers will lead to more packets being transmitted and buffered, power increases with the number of buffers.
In this paper, we propose a high performance and power modest dual-switch allocation design. In order to make more packets being transmitted and buffered, the design becomes a combination of a primary switch allocation and a secondary switch allocation with no additional buffers, as the power consumption of buffers dominates the whole power of network. At low traffic load, almost all packets use the primary switch allocation to assign their desired output port. Whenever there is a conflict, the packet which fails in the primary switch allocation will be assigned to other corresponding idle output port by the secondary switch allocation. The dual-switch allocation design enables blocked packets transmit through router via idle output port as far as possible, thus achieving high throughput and low latency. On the other hand, power overhead is very little, as there are no additional buffers and links in our design.
This paper is organized as follows. In Section II, a brief overview of related work is presented. Proposed dual-switch allocation design is presented in Section III. Experimental setup and results will be demonstrated in Section IV.
II. RELATED WORK
Various techniques have been proposed to lead to more packets being transmitted and buffered, and then balance the congestion of interconnection network. J. Suseela and V. Muthukumar proposed a loopback virtual channel mechanism to improve the performance of a router [3] . This design can minimize latency by adding additional virtual channel, but with the cost of increased power consumption and complexity. Another approach dividing the router into two subnets is Network Processor Array (NePA) [4] , which uses additional input ports including buffers and links in north and south output ports. In this case, NePA can separate and transmit the packets which desire north or south direction. However, its power consumption becomes larger due to increasing buffer and link.
Other techniques were also proposed in recent years [5, 6, 7] , however some of them were based on additional buffers to improve the network performance, so that power consumption remains as a challenging problem. In order to resolve this problem, we propose a dual-switch allocation design which can make full use of idle output ports to enhance the performance of latency and throughput. At the same time, power overhead will be a few.
III. DESIGN OF DSA ROUTER
A. Architecture
In this Section, we will introduce the architecture of dualswitch allocation design (DSA). Our design can apply to both conventional and virtual channel router. In order to simplify the discussion, we eliminate the virtual channels in the rest of paper. Fig. 1(a) shows a baseline router which has five input ports and five output ports. When there are incoming flits, they will be buffered firstly, and then the desired output port which will be calculated by routing computation. Switch allocation will assign flits to desired output. If the desired direction has been occupied by any other flit, the flit will be standby in buffer until the desired direction is available. At last, crossbar is controlled by the switch allocation for correctly connecting input ports to output ports. Fig. 1(b) demonstrates the router architecture of the proposed DSA. There are two allocations within each router. In order to guarantee the fairness of assigning, each allocation is based on round-robin method [1] . Our design uses a lookahead technique, which calculates the desired route direction for the next router, not for the current router [1, 8] . Firstly, all flits are buffered. And then all flits in each buffer will be assigned to their desired output ports by the primary switch allocation (PSA). If some of them fail allocation in PSA, they will continuously utilize the secondary switch allocation (SSA) to assign its direction according to the lookahead information which is calculated in the current router.
B. Router Implementation Fig. 2 demonstrates pipeline stages in different router designs. Fig. 2(a) shows the router pipeline of baseline design whose stages are buffer writing (BW), routing computation (RC), switch allocation (SA) and switch traversal (ST) [1] . When there are incoming flits, they are stored in input buffers at BW stage, and then routing computation is executed at RC stage according to head flits. After that, SA will assign the desired output. Finally crossbar connects input port to output port according to SA at ST stage. In regard to a lookahead (LA) pipeline shown in Fig. 2(b) , RC is done at the preceding router, and flits can make SA stage when they are buffered in input ports. So no RC is independently needed for trans- mitting the packet to a router neighboring the current router. In this case, we can improve router performance and reduce power consumption, as reducing pipeline stages from four to three. Fig. 2(c) shows the pipeline of the proposed DSA architecture. By utilizing a lookahead technique, all incoming head flits buffered in input ports will make switch allocation by PSA immediately. If some of these flits fail in PSA, then they will make switch allocation by SSA according to routing information which is calculated by RC at the current router.
We adopt existing router components delay model which is described by Peh and Dally [9] to estimate the delay through our pipeline stages. Table I shows the delays of SA and ST pipeline stage achieved by baseline, LA and proposed router, respectively. The stage with the longest critical path delay will set the clock frequency. Since very simple deterministic routing algorithm is implemented in each design, the delay of BW and RC stages are a few and less than ST stage. Therefore, from Table I , we can easy get the delay of ST stage is longest critical path delay. The switch arbitration delay increase in DSA compared with baseline and LA design, as DSA operate two SA allocation (PSA and SSA). However, any clock cycle that accommodates the ST stage will also accommodate the increased SA delay in DSA design. As a result, the penalty of additional SA in DSA will hidden in router pipeline and router delay of DSA will no worse than LA. Furthermore, packets contention delay enable reduce according to our proposed method (described in later). Fig. 3 shows the method of DSA. Firstly, we define OLD routing information (OLD-RI) which is calculated in the preceding router and carried by incoming head flits. Then, NEW routing information (NEW-RI) is calculated by the current router and decide which output port is desired in the next router. In Fig. 3 , the desired east direction is OLD-RI and north shown in bracket is NEW-RI which is calculated by Router 1. If the desired output port in east direction is occupied by other packet and we know this packet should be sent to north direction at the next router (Router 2) according to NEW-RI, thus in order to resolve the blocking of this packet, this packet is transmitted to north direction at Router 1, and then transmitted to east direction at Router 3. If packet is assigned to its desired direction by PSA, it is transmitted with NEW-RI (north) when leaving the current router. While a packet is assigned by SSA, it is transmitted with OLD-RI (east) when leaving the current router. As a result, the performance of network is improved, because of Fig. 4 . Flow chart of the proposed DSA making utmost of idle output port and making more packets being transmitted and buffered. Fig. 4 shows the flow chart of the proposed DSA. At first, the head flits which carries OLD-RI together are buffered in each input buffers. Then PSA assigns each flit to its desired output port according to OLD-RI. During the assignment of PSA, routing computation unit calculates the NEW-RI. If switch allocation complement by PSA, crossbar connects input ports to output ports and then send flits to the next router. In this case, when packet leaves the current router, it will carry NEW-RI information together. At low traffic loads, each head flit may enable to be assigned to their desired output port, as there is no contention between flits. However, as the traffic loads increasing, the preceding packet might block the succeeding packet which desires the same output port. Therefore, some of head flits in input buffer will fail assignment in PSA. In this case, these failed head flits will continuously make switch allocation in SSA according to the value of NEW-RI. If the NEW-RI direction is available for one head flit, SSA assigns this packet to this direction. In other words, although the desired direction of the current router is unavailable, the router knows which direction the packet should be transmitted to in downstream router. Thus we can firstly transmit the packet to the direction which is desired in the next router if it is available, and then send it to the direction which is desired in the current router. Similarly, if switch allocation complement by SSA, crossbar will connect output port to input port and transmit packet to the next router, carrying OLD-RI. Otherwise, if SSA fails, head flits will continue to be standby in input buffer for PSA, and repeat this process. We use a round-robin technique in PSA and SSA to guarantee the fairness. The result is idle output port can be made full use and improve the network performance as reducing the contention between flits. Meanwhile, power overhead is very little, as only SA consumes a little powers (details are discussed in the Section IV).
The timing of an incoming flit is shown in Fig. 5 . First, all incoming flits arrive at input port and are buffered in input buffers. Second, PSA make switch allocation according to OLD-RI. Subsequently, SSA will make switch allocation for flits which fail in PSA according to NEW-RI. At the same time, routing computation unit will calculate NEW-RI information for head flit. After that crossbar connects input port to output port and transmit flit to downstream router according to SA results. RC and DSA stage (consist of PSA and SSA) are executed in parallel. There are some different situations in DSA method. And how flits are assigned for each situation is shown in Fig. 6 . In our method, we utilize minimal routing algorithm [1] , so we dont consider the router sending packet to back. If the routing direction of packet is local, we always assign it by PSA. In other words, if one packet which desires the local output port fails in PSA, SSA will not be used, and this packet will wait for next PSA. In Fig. 6 , the letter before bracket is OLD-RI which is carried with incoming head flits and letter in bracket is NEW-RI which is calculated by the current router. Fig. 6(a) shows the situation with no conflict. All flits can be assigned to their desired output port by PSA. So when flits leave current router, they will carry with the NEW-RI. For instance, the flit in north input port has been assigned to south output port (S). And when it leaves current router, NEW-RI (E) will be carried with it. In this case, SSA is not utilized. Fig. 6(b) shows the situation when two flits desire the same direction. The flits in south and west input port want to be sent to east direction. Firstly, all flits in each buffer will make allocation by PSA. Based on round-robin technique, if east output port is assigned to west input port (so it is unavailable for flit in south input), the flit in south input will fail in PSA and continuously utilize SSA according to NEW-RI (N). In this case, the NEW-RI direction (N) of flit in south input is available, thus a router first assigns this flit to north output port by SSA, and in downstream router, it will desire east direction. Fig. 6(c) shows the situation more than two flits desire the same direction. The OLD-RI of flits in north, south and west input are east direction and east output port is assigned to the flit in north input by PSA. These failed flits will use SSA to complete assignment. In this case, the NEW-RI directions of flits in south and west input (N and S) are available, so current router will transmit flit in south input to north direction, and send flit in west input to south direction. And then in downstream router, they will desire east output port. Fig. 6(d) shows the situation that some of flits fail in PSA and SSA. We can see, flits in north and west, east and south input desire east and west direction, respectively. North input port is assigned to east output port. And east input is assigned to west output port. So flits in south and west input fail in PSA. Within SSA, the NEW-RI of these two flits both are north direction. In this case, based on round-robin, if north output port is assigned to west input port, the flit in south input also fails in SSA and it will wait in input buffer for next PSA.
D. Deadlock and Fault Tolerance
In order to avoid deadlock, we use a deadlock recovery scheme DISHA [11] . An extra flit buffer is equipped at each router to store the head flit of one of deadlock packet. We set T to keep track of the number of clock cycles. The value of T is increased as head flit cannot be sent out. Whenever T is larger than the threshold T th , recovery will be executed. The DSA design enables hardware-level fault tolerance [12, 13] . If the link between two routers is fault, PSA will fail according to OLD-RI. However, router enables transmit packet by SSA according to NEW-RI. And the packet can forward to its destination. Our approach can definitely work on other virtual channel designs [1, 14] . As the same way, if all virtual channels in desired input port of downstream router are not available, the blocked packet enables utilize SSA to forward to the direction which is desired in the next router according to NEW-RI information.
1A-4

IV. PERFORMANCE EVALUATION
In this section, we make different experiments in order to evaluate the DSA design in terms of area, power consumption and the network performance. We compare our design with basic pipeline router, lookahead pipeline router and NePA [4] which utilizes lookahead method. Dimensional-order-routing (DOR) [1] is deployed in baseline, lookahead and NePA, respectively. All of the designs are implemented in 8×8 2D mesh topology.
A. The Network Performance at Synthetic Traffics
We evaluate the performance of network by using an open source simulator Noxim [15] . In our experiments, we first execute a consecutive simulation where the injection rate is varied from 0.01 to 0.90 using random traffic pattern in 8×8 2D mesh. DOR is deployed in different designs. In order to guarantee the fairness of experiments, we set the total buffer sizes same for different designs. Each buffer size of NePA and DSA are set into 3 flits and 4 flits, respectively. Each input buffer size of baseline and LA are set into 4 flits except local input buffer. And the local buffer size of them is 5 flits. Therefore we can get the same total buffer sizes of different designs. In this experiment, we compare the performance of DSA with other designs in terms of latency, throughput and energy consumption. Fig. 7(a) shows the performance of latency between different designs. We can see DSA is significantly better than others. As traffic loads increases, DSA gets average improvement by 38.8%, 29.6% and 26.5% compared with baseline design, LA and NePA, respectively. Fig. 7(b) shows the throughput of each design. After saturation point, the throughput of DSA reaches 0.5 and outperforms baseline, LA and NePA designs. Fig. 7(c) shows the energy consumption for the whole network at random traffic pattern. Because NePA design adds two additional input ports including buffers, so more buffer operations such as writing and reading will be executed in NePA and then dynamic and leakage power of buffer will increase linearly [16] .
From Fig. 7 (c), at low traffic loads, DSA consume almost same energy with baseline and LA designs, as PSA is utilized more frequently. However, as traffic loads increasing, more packets blocking happens, in this situation, both PSA and SSA will be used. So the power consumption of DSA becomes a bit larger than baseline and LA designs. Although there is 246.31 (average) † only PSA is used ‡ both PSA and SSA are used more energy consumption in DSA, the average overhead is only 6.1%. And we also evaluate our design in other traffic patterns such as bit-reversal, transpose and shuffle traffic pattern. Fig. 8(a) shows the resultant throughput at the injection rate 0.5. Obviously, the throughput of DSA is significantly better than others. Especially, in bit-reversal and transpose traffic, the improvement of DSA is distinct. Fig. 8(b) shows the energy consumption under different traffics at 0.5 injection rate. From this figure we can know, there exist a few overhead of energy consumption in DSA design.
B. Area and Power Estimation
We use Orion 2.0 simulator [17] to estimate the area and power of router, with the setting of 65nm technology, 1GHz router at 1Vdd, and the flits size is 128 bits. TABLE II shows the evaluation of area and power for different designs. In regard to area, we can see the design of NePA occupies the largest area compared with others, as it has much more input buffers and a crossbar has more complexity. It almost has two times of area compared with baseline and LA router. However, the area of our design has only a bit larger compared with baseline and LA router, because we only add an additional switch allocator which occupies very small area and a flit buffer in our design. On the other hand, obviously, the power of NePA is also the largest one among different designs, because the power is increased with the number of input buffers. The power of buffer dominates the power consumption of whole router. For DSA design, there are two cases. One is that the flit passes through a router assigned only by PSA, that is, SSA is not used. In this case, the power consumption is minimal and it is almost same with LA. Second is the case where the flit firstly fails in PSA, and then it is forwarded to downstream router by SSA. In this case, both PSA and SSA are used, so it shows the maximum power consumption. From TABLE II, both max power and average power of DSA are larger than that of baseline and LA design, but it is very little.
V. CONCLUSIONS
In modern Network-on-Chip, more and more cores can be deployed. And as the traffic loads increasing, packets may be blocked more frequently. In order to reduce blocking, additional buffers are utilized to improve latency and throughput. However, the power consumption will be increased obviously.
To improve the network performance under the less overhead power, this paper proposes a dual-switch allocation network (DSA). It allows packets assign their desired output port by the primary switch allocation firstly. If some of packets fail in the primary switch allocation, they enables continuously utilize the secondary switch allocation to assign their desired directions. The router can make utmost use of idle output port, as much as possible, to improve the network performance. Experimental results show that our design has significant improvement compared with baseline, LA and NePA designs in terms of latency and throughput, at the cost of a small overhead power.
