This paper presents a single-cycle shared output buffered router for Networks-on-Chip. In output ports, each input port always has an output virtual-channel (VC) which can be exchanged by VC swapper. Its critical path is only 24 logic gates, and it reduces 9.4% area overhead compared with the classical router.
Introduction
Networks-on-Chip (NoCs) has been the most promising interconnection solution for multi-core chips. The architecture of routers plays the key role in the performance of NoCs. In order to improve performances and avoid deadlocks, a physical channel can be shared by several virtual channels [1] . Virtual channels can be implemented in the input port or output port of the router, corresponding to input virtual channel router (IVR) or output virtual channel router (OVR). Almost all of state-of-the-art routers adopt the input virtual channel architecture, because OVR must be operated at p times speedup to improve performance, where p is the number of router ports [2] .
In classical virtual-channel routers, header flit performs routing computation (RC), virtual channel allocation (VA), switch allocation (SA) and switch transfers (ST) in turn. However, body and tail flits just perform SA and ST stages. The header flit needs more clock cycles than other flits, which incurs packet stall. Such unequal pipelines between header flit and body/tail flits make packet transfers slowly. In order to reduce packet stalls, advanced routers have been presented, such as speculative routers [1] , [3] , [4] with lookahead routing. These routers have partially solved the packet stall problem.
In this paper, we present a single-cycle output buffered router, denoted as SOBR. The router has three important characteristics: 1) the reconfiguration of access matrix makes buffers shared by input ports; 2) its buffer supports leap read operation; 3) layered switching is taken. The proposed output-buffered router implements p-1 times speedup by the p-1 VC buffers in each output port. In essence, the proposed router is similar with VOQ routers. Because, it always assigns an output VC to each input port. Exchange of assigned VCs by VC swapper makes VC buffers shared by input ports, which is effective to improve throughputs of the 
Related Work
Yuto Hirata et al proposed an on-chip variable-pipeline (VP) router which can adapt its data path structure according to current requirements by dynamically reconfiguring [6] . This router combines three pipelines, which are 3-cycle, 2-cycle and 1-cycle, respectively. Ramanujam et al proposed a distribute shared-buffer router (DSB) architecture [2] . In essence, it is a special input virtual channel router. Same as prior IVR, DSB also suffers from packet stall. Moreover, additional buffers incur remarkable increase of area overhead which is unacceptable for resource-limited budget of on-chip communication. In [7] , a reconfigurable router was proposed, where the buffer slots are dynamically allocated to increase router efficiency of NoCs. However, it is suffers from poor scalability in term of buffers capacity. In [8] , authors discussed several VOQ buffer structures and compare them in terms of implementation complexity and their ability to deal with variations in traffic patterns and message lengths, and then presented a new buffer structure which provides no-FIFO message handling and efficient storage allocation for variable size packets, denoted as Dynamic Allocation Multi-Queue buffer (DAMQ). In [9] a single-cycle VOQ router is proposed. It is similar with SAMQ router described in [8] , and both suffer from low performance because of poor buffer utilization. In [10] , the buffer scheme based on a dynamically allocated multi-queue self-compacting buffer is proposed to reduce the buffer size by sharing same buffer space in two physical channels. However, DAMQ routers suffer from high latency because of the long pipelines of buffer.
In [11] , a virtual output and input queued (VOIQ) router was presented to eliminate the need for centralized memory management algorithms. It is similar with SAFC described in [8] . Taking advantage of temporal and spatial locality existing in packet destination distributions for IP routers, a destination-based buffer manage strategy DBBM [12] was proposed to reduce the number of queues Copyright c 2012 The Institute of Electronics, Information and Communication Engineers while keeping roughly the same throughput as using VOQ. It is useful when the number of queues is larger. In this paper, we present a shared output buffered router architecture, which is based on VOQ router architecture.
Output Buffered Router

Overall Architecture
Figure 1 (a) illustrates the proposed single-cycle router architecture for 2-D mesh Networks-on-Chip. The only logic of each input port is a de-multiplexer. Output port has four virtual-channels (VCs), each for one input port except for itself, shown in Fig. 1 (b) . The assignation of VCs can be changed by the access matrix of VC swapper. The access matrix is automatic reconfigured by matrix controller according to the available buffers. The interconnection architecture between input ports and output ports is 20 ad-hoc wires. Each VC holds a dynamic FIFO buffer which supports leap read to reduce packet blocking. A next-hop routing computation (NRC) module is used to compute the output port number of next router. Each output port has an output manager to allocate the output link and corresponding VCs of the next router.
Generally speaking, the classical VOQ-based router can reduce head-of-lines (HoLs) by assigning VCs to output ports at input ports. If the assigning VCs are moved to output ports, the new VOQ-based router has the same performance as classical VOQ-based router. Because, the two routers have no difference in essence, except for their implementation manners. It is clear that, the proposed router is a new VOQ-based router if VC swappers are removed. Actually, the proposed router still is a new VOQ-based router, even though VC swappers are not removed. Because, the proposed router always assigns an output VC to each input port. With VC swapper, the assigned VC is changeable according the value of access matrix.
Therefore, the proposed router is similar with VOQ routers in essence. Different from existing VOQ routers [5] , [9] , SOBR has the following features: (1) output buffered architecture is used, rather than input buffered architecture.
(2) The assignation of VCs is changeable by access matrix which is reconfigured according to available buffers. The reconfiguration of access matrix makes buffers shared by all input ports. (3) A FIFO buffer architecture supporting leap read operation is used to reduce head-of-line blocking. (4) Layered switching is used to improve performances. Compared with classical virtual channel routers, it makes both header flit and body flit spend same clock cycle, i.e. only one cycle in each router. The critical path of our router is similar to single-cycle VOQ router [9] .
VC Swapper
VC swapper is made up of 4 multiplexers and an access matrix. It always assigns a VC to other input port. Access matrix is a 4 × 4 bit register. Its initial value is shown in Fig. 1 (b) . Matrix controller is responsible for the control of access matrix which decides the allocation of VCs. Matrix controller selects a two interchangeable VCs, one is full or to be full, the other is idle or to be idle, and configures the access matrix accordingly, so as to exchange them. A VC is interchangeable when the last flit written to the VC is tail, and is un-interchangeable when a packet is being written to the VC.
FIFO Buffer Supporting Leap Read (DFIFO)
In order to improve performance, we designed a FIFO buffer supporting leap read operation. When the head flit of the buffer is blocked, it sent out leap read request to the output manager if possible. After leap read operations, the queue has to be updated, i.e. the following flits are moved forward. Normal read operation and leap read operation can't perform at the same time because of the bandwidth limitation. The priority of leap read is lower than the head read for FIFO buffer. Once begins to leap read a packet, it can't leap read other packet, except for returning to normal read operation.
DIFIO has five modules which are the write logic (WL), the data queue (DQ), the routing results queue (RQ), the queue update (QU), and the read logic (RL). It is worth mentioning that DQ and RQ are one-to-one, and all operations of them are performed at the same time. The WL is used for the write operation of the data and routing results. It consists of two pointers, tail and outport. The former is point to the tail of queues (DQ and RQ), and the later buffers the routing results of current packet. The update of tail pointer is determined by the write operation (WT) and the leap read operation (DRD). If both WT and HRD are performed synchronously or neither is performed, then tail pointer will be unchanged for the next cycle. If WT is performed and HRD is not, then tail pointer will be moved back, i.e. plus 1. If HRD is performed and WT is not, then tail pointer will be moved forward, i.e. minus 1.
The RL is used for the head read (DRD) and the leap read operation (PRD). DRD and PRD can not perform at one time because of the bandwidth limitation. Firstly, it sends out a requirement to read data from the head of queues if the head flit is not blocked. Otherwise, it sends out a requirement to perform leap read operation if there is an un- blocked flit in the data queue. The QU is used for the update of queues.
Output Manager
In each output port, an output manager (OM) is used to allocate the access of output physical link and VCs of the next router. It has two arbiters, head arbiter and leap arbiter. The former arbitrate the normally read requests of FIFO buffers, and the later arbitrate the leap read requests. It implements a switching which is similar with layered switching (LS) [13] .
LS implements wormhole switching on top of virtual cut-through switching (VCT). It performs better than WH, and consumes less storage than VCT. LS improve the performance of virtual channel routers. However, LS is not suit for queued routers because complex buffer of pre-allocation is necessary, otherwise it will results resource wastes because of the binding of link and buffer in fixed group. In order to solve this problem, we propose a dynamic layered switching (DLS) which is based on layered switching. In DLS, the flow control granularity is changeable according to current available buffer, ranging from 1 flit to the whole packet. Figure 2 shows the operation sequence under DLS. Firstly, routing computation is performed, and then requires VC according routing results. Next, requires the output link, and locks the link if granted. Then, starts the forwarding of a dynamic group flits. The check of buffer available is necessary while the forwarding of a flits. If the output buffer is available for current group, the forwarding will go on in the next cycle. Otherwise, the forwarding of current group flits is over, and the link is released. Then, the packet waits for the output buffer, and starts the next loop if the output buffer is available. If the whole packet is forwarded successfully, the packet will release the locked VC. It is worth to mention, all operations, except for routing computation, are performed by output manager.
Experimentation
In order to evaluate the performance of proposed router, we implemented a hardware-level simulation platform. Under this platform, routers (the proposed router and other routers for comparison) are implemented by synthesizable Verilog HDL. In order to debug and verify routers, assertions and coverage are inserted. Routers have 4 VCs in each input/output port, output port of the proposed router, input port of other routers. The network topology is 4 × 4 mesh. The buffer size of routers is 4-flits. We adopt the uniform random traffic pattern. Figure 3 shows the basic performances of routers. Normalized throughput is the throughput normalized by the ideal saturation throughput. Note that, BIVR represents the 4-stage input virtual channel router [4] , BIVR-LR represents BIVR with look-ahead routing [4] , IVR-SC represents speculative input virtual-channel router [1] , and SVOQ represents the single VOQ router presented in [9] . As we can see from Fig. 3 , the average latencies of SOBR and SVOQ are much lower than other routers, and SOBR achieves remarkable performance improvement in terms of throughputs. This is mainly because VCs in each output port are shared by input ports, owing to VC swappers of SOBR. Figure 4 shows the saturation throughputs under different lengths of packets. SOBR achieves up to 86% throughput for 1-flit packet. As the increase of packet length, the throughout is slightly reduced. Figure 4 also proves that SOBR is good at dealing with variable-length packets.
Basic Performance
Note that, IVR-SC does not strictly comply with the structure presented by [1] for complexity and area-overhead consideration. Moreover, we did not implement the router presented in [3] or single-cycle pipeline structure presented by [6] , because the latency of critical path (SA + ST) is too big to suit for high-performance on-chip router. This has been proved in [3] (the critical path is 35 FO4). The router presented by [2] achieves high throughput in the cost of additional buffer. However, its deep pipeline makes the latency of packets increasing clearly. Its latency of each router is 5 cycles, and the latency of each router of our router is only 1 cycle.
Area Overhead
We use Synopsis Design Compiler with 65 nm technology to evaluate critical path and area overheads. The critical path of SOBR is only 24 logic gates and the latency of critical path is 0.62 ns. Compared with BIVR, the area overheads of SOBR are reduced up to 9.4%. The area reduction of SOBR lies on the reduction of VC allocation and switch allocation modules.
Conclusion
In this paper, we have presented a single-cycle shared output buffered router, i.e. SOBR. It always assigns an output VC to each input port. Exchange of assigned VCs by VC swapper makes VC buffers shared by input ports, which is effective to improve throughputs of the router. There is only one stage in the router, and the critical path of the router is only 24 logic gates. The router achieves up to 86% throughput under uniform random traffic pattern. Compared with other routers, SOBR achieves significantly latency reduction. In general, SOBR is low-latency, high-throughput and area-effective.
