Abstract-This paper proposes a novel QoS-aware and congestion-aware Network-on-Chip architecture that not only enables quality-oriented network transmission and maintains a feasible implementation cost but also well balance traffic load inside the network to enhance overall throughput. By differentiating application traffic into different service classes, bandwidth allocation is managed accordingly to fulfill QoS requirements. Incorporating with congestion control scheme which consists of dynamic arbitration and adaptive routing path selection, high priority traffic is directed to less congested areas and is given preference to available resources. Simulation results show that average latency of high priority and overall traffic is improved dramatically for various traffic patterns. Cost evaluation results also show that the proposed router architecture requires negligible cost overhead but provides better performance for both advanced mesh NoC platforms.
I. INTRODUCTION
Multi-core integrated circuit designs have been proposed and proven a prevailing architecture recently. Multiprocessors (CMPs) such as 64-core SoC and 80-core NoC architecture [6] [33] were presented to pave the way to networkbased interconnection network design. NoC interconnection scheme has been demonstrated as a better solution because of superior performance and fault tolerance characteristics [7] [11] . NoC interconnection architecture uses a distributed control mechanism, providing a scalable interconnection network.
A multiprocessor system platform called Network-based Processor Array (NePA) has recently developed [5] , in which processors are interconnected by using an on-chip two-dimensional (2D) mesh network. NePA is a deadlockfree and livelock-free network that implements wormhole packet switching technique and utilizes an adaptive minimal routing algorithm. Because of the limitation of traditional 2D mesh topology, additional alternative routing resources which provide more network tolerance are employed to further improve the performance of NePA architecture. Moreover, diagonal links for the 2D mesh network are proposed to improve throughput and performance because of the emergence of X-architecture routing technique in chip manufacturing [14] [30] . The proposed NoC architecture referred to as Diagonally-linked Mesh (DMesh) employs diagonal express links between routers on a baseline NePA network. Diagonal links not only reduce the distance between source and destination nodes, but also help alleviate network congestion so that network performance is enhanced dramatically [37] .
Adaptive routing algorithms has been employed in multichip interconnection networks as means to improve network performance and to tolerate link or router failures. However, congestion in interconnection networks is a well-known phenomenon. This work utilized a congestion control scheme to provide adaptive routing arbitration control and thus exploit available routing resources efficiently. This allows avoiding control beyond target baseline adaptive routing algorithm which can adaptively balance traffic load and increase NoC overall throughput by intelligently allocating existing resources. Quality-of-Service (QoS) provision supports differentiated service classes among various applications and can further improve utilization efficiency of network bandwidth. Instead of original approaches of adding abundant buffers or virtual channels (VCs), the proposed mechanism featuring congestion avoidance scheme and QoS ability facilitates differentiated service transmission while maintaining a high throughput tolerance.
The rest of the paper is organized as follows. Section II summarizes related work in on-chip interconnection networks. Section III describes an overview of NePA and DMesh NoC platforms. Section IV proposes an innovative router design with congestion-aware and QoS-aware router scheme. Section V shows performance and cost evaluation of the proposed architecture. Concluding remarks are provided in Section VI.
II. RELATED WORK
Congestion control: Congestion management was proposed to prevent networks from saturation and improve the throughput in NoC. Buffer and link status are two of the most popular ways to indicate the existence of network congestion [28] [31] [32] . A congestion-aware routing algorithm is targeted to evenly distribute traffic load over the network. For instance, a self-optimized routing strategy [31] decides a favorable path for incoming packets based on buffer load information. A proximity congestion awareness technique is proposed to avoid congested areas based on the use of stress values which are passed from neighboring switches [26] . Both techniques attempt to divert packets from hot spots in the network. A contention-aware input selection algorithm which gives priority to incoming packets from congested areas was proposed to alleviate congestion in upstream area [38] . An application-aware congestion control algorithm in on-chip bufferless networks is proposed by making proper throttling decisions [27] .
In order to design a lightweight congestion-aware NoC router, an approach based on dynamic port arbitration to resolve congestion and adaptive output path selection to distribute traffic load efficiently was devised [36] . This work utilized this methodology to relax congestion situation.
QoS provision: QoS is commonly achieved by providing each traffic class with a separate virtual channel, either in a time-division multiplexing [13] [23] or dynamic virtual channel allocation [15] [16] manner. Networks providing guaranteed throughput (GT) and best effort (BE) services use VC reservation methodology. The AEthereal is a NoC that provides GT that is connection-oriented and BE that use non-reserved time slots, and it supports statical and dynamic allocation of slots [12] . The MANGO [8] is another NoC that provides connectionless BE routing and connection-oriented guaranteed services (GS). Another implementation example adopts deterministic dimension order source routing strategy and assigns priority to GT traffic. BE packets are allocated in a round robin manner if GT associated virtual channels (VCs) are empty [35] . A customized QoS NoC (QNoC) which classifies service into four classes: signaling; realtime; RD/WR and block transfer was proposed. There are individual buffers to store different classes of traffic and bandwidth are allocated accordingly [9] .
Although multiple VCs are implemented to support more service levels, these designs increase switch complexity and arbitration delay. A area-efficient design using two VCs at switches was presented to provide full QoS support, which demonstrates a more than acceptable performance and meets the low cost need of NoC stringent implementation requirement [20] [22] . The trend of providing sufficient adaptive routing or fault tolerance is to increase the number of ports instead of increasing the number of VCs per port was also indicated [22] [24] .
Hybrid mechanism: Routers with congestion management and QoS provision generally require a high number of buffers at switch ports, so the implementation cost is high and therefore prohibits their adoption of NoCs. An interconnection network architecture combining both technologies has been proposed recently [19] [21] . Regional Explicit Congestion Notification (RECN) which needs no VCs and QoS-aware design with two VCs dramatically reduce resources requirement and design complexity. The combined architecture demonstrates cost efficiency and performance improvement.
This work integrated QoS provision and congestion-aware routing algorithm in an efficient way to facilitate packet transmission and enhance throughput for NoC routers.
III. HIGH PERFORMANCE NOC PLATFORM
Exploring fully adaptive routing ability enables extensive routing flexibility and thus enhance network throughput. The double-y routing algorithm has been proposed as the solution inside a chip to support fully adaptive wormhole routing and maintain feasible design complexity [25] . There are two approaches of employing two additional vertical channels. One is using VC technique, and the other is using additional physical channel to form double-y networks. Although the former approach can reduces port numbers of routers, it might deteriorate transmission performance when workload increases. Adding additional physical ports relaxes the congestion problem and sustains fully adaptive routing benefit. NePA platform uses two extra ports for forming independent eastbound and westbound subnetworks [4] . Beyond that, DMesh platform is constructed by integrating four additional diagonal links to NePA to further take advantage of diagonal architecture and explore more routing space.
A. NePA system platform
NePA platform is a scalable, flexible and reconfigurable high performance NoC platform [2] which is based on a 2D mesh topology as shown in Figure 1 and uses the wormhole packet switching technique. The router connects with its four neighboring routers via six 64-bit bidirectional links, including two horizontal and four vertical links. A key feature of the NePA architecture is the use of two separate vertical links which provides insolated subnetworks and classifies packets into eastbound and westbound traffic. Within each subnetwork, adaptive minimal routing is performed to prevent cycles in the resource dependence graph and guarantees deadlock freedom [10] . Utilizing an adaptive XY routing scheme can increase network performance and provide faulttolerant routing ability [4] . The router adaptively selects an alternative output port for packets when a output port is congested or the output buffer is full. Therefore, the link utilization is well balanced and network performance also improves.
B. DMesh system platform
DMesh network is constructed by integrating diagonal links to NePA, as presented in Figure 2 . DMesh network is composed of two sub-networks: E-subnet and W-subnet, represented with dashed arrows and solid arrows in Figure 2, respectively. E-subnet is responsible for transferring eastbound packets while W-subnet is responsible for westbound traffic. When source PE starts packet transmission, it injects packets into one of the subnetworks depending on the direction of destination PE. Subsequently, packets traverse through one of the sub-networks to their destinations IV. ROUTER ARCHITECTURE Adaptive routing algorithm approaches improve network performance by adjusting routing based on network situations. Multiple buffers inside each port mitigate Head-ofLine (HOL) blocking effects and enhance throughput and latency. Instead of VC approach, this work employs parallel buffers to solve the HOL blocking issue. To effectively utilize routing resource and improve throughput, CongestionAware (CA) routing scheme effectively relax packets from high congested areas and direct them to less congested ones. QoS-aware routing further improves performance for high priority application traffic.
A. Adaptive routing algorithm
A double-y routing and allowed turns are shown in Figure  3 (a) and (b). Once packets are injected into the networks, they can only transmit in one of the disjoint eastbound and westbound subnetworks, following the minimal XY routing approach. If either X or Y direction occurs congestion indicated by link status, packets will go through less congested paths toward their destinations adaptively.
NePA/DMesh routers are mainly divided into internal router dealing with injecting and ejecting packets and two sub-routers: E-router and W-router. NePA follows minimal XY routing approach. However, DMesh adopts a quasiminimal routing approach instead of a minimal one to well balance workloads over the network so that performance and throughput are dramatically improved especially in high workload situations [37] . Diagonal channels will be granted first, then horizontal or vertical channels will be taken if possible diagonal channels are taken or congested. The channels of an internal router have the lowest priority which can prevent further congestion particularly in heavy load situations.
B. Multiple parallel buffers enhanced architecture
Performance of wormhole routed networks suffers from the HOL blocking effect especially when in high workload scenarios. To tackle this problem, incorporating VC can effectively mitigate performance degradation and improve throughput and latency accordingly [18] . Different VC approaches have been adopted by many five ports routers and demonstrated their effects. However, VC flow control needs abundant control signals between routers to keep track of VC buffer status which causes tremendous overhead for multiport routers.
Different from VC flow control, a new routingindependent Parallel Buffer (PB) structure and its management scheme were proposed to enhance network channel utilization but keep design overhead moderate [3] . An enhanced router example of NePA E-router is shown in Figure  4 . Each added channel keeps the merit of adaptive minimal
Algorithm 1 CA and QoS management mechanism
Input:P acketi, Cogestion idx in Output:Cogestion idx out P Bn: nonempty parallel buffers QoSn: predefined service classes P Qi: a priority queue that holds M input ports P Qo: a priority queue that holds N output ports Pi: input port index Po: output port index 1: Packets are inject into either E-subnet or W-subnet 2: Construct P Qi for input ports by Cogestion idx in 3: Construct P Qo for available output links by Cogestion idx in 4: Sort P Qi and P Qo according to congestion status 5: for all QoSn do 6:
for all Po in P Qo do 7:
for all Pi in P Qi do 8:
for all P Bn in each Pi do 9:
if input-output pair is available then 10:
route packet from selected input to output port 11:
remove Pi and Po from P Qi and P Qo 12: else 13:
restore routing strategy instead of mapping to dedicated outputs in a fixed pattern. This scheme works independently to explore more routing resources so that the channel utilization and maximum throughput are achieved accordingly. The proposed architecture maintains the routing flexibility to deliver packets toward paths with less congested possibility. Therefore packets can bypass blocked output ports and keep heading to destination with minimal routing paths.
C. Dynamic congestion-aware router architecture
A proposed lightweight dynamic arbitration mechanism has shown that its effectiveness in congestion detection and management [36] [37] . The mechanism can especially reduce the wiring requirement because it only delivers congestion index calculated by the number of active FIFOs with waiting packets instead of detail buffer or link information. The purpose of dynamic arbitration is to alleviate traffic congestion by allowing packets coming from hot spots to move first and use less congested routers to advance. Resources contention in the congested region is reduced accordingly. A congestion-aware routing procedure is described in Algorithm 1 to detail the approach. First, priority queue P Q i and P Q o associated with available input and output ports are established based on congestion indices from neighboring routers. Each input port is corresponding to a congestion index from its upstream router, and each output port is associated with a congestion index from its downstream router. Available input and output ports have associated keys identifying congestion status in upstream and downstream routers to decide the priority of each port. The arbiter matches input-output pair from P Q i which indicates highly congested traffic and from P Q o which indicates less congestion.
D. QoS-aware router architecture
Application workloads are classified into different service levels and indicated in the header field. Header parsing unit interprets associated QoS header field and gives routing preference to high priority packets. Less congested output and high congested input are selected to advance high priority traffic first. For multiple PBs router mechanism, resource sharing and specific channel reservation for GT traffic are two strategies to differentiate resource allocation procedure. The former is described in Algorithm 1, all buffers are shared by all traffic including GT and BE. For resource reservation solution, one or multiple buffers are dedicated to GT traffic and others are assigned to BE traffic [16] .
According to different resource deployment, QoS-aware router can be further classified into the following mechanisms.
• Single Buffer Mechanism (SBM): There is only one associated buffer with each input port. Arbiter routes GT traffic first based on available routing resources and congestion information. Although GT packets gain preference to be served, they might be blocked by BE ones due to shared buffer. Only the arbiter provides differentiated routing arbitration to enable preliminary QoS.
• Multiple Buffers Mechanism (MBM): All buffers are shared by all service classes. Although it has the potential performance degradation of GT traffic because of being blocked by BE traffic. This situation can be relaxed by multiple PBs design and dynamic adaptive routing. On one hand multiple PBs can store GT traffic to avoid blocking, and on the other the GT traffic can be advanced owing to preferably allocated resource. Buffer utilization is maximized whatever GT rate is set in this case.
• Multiple Buffers Mechanism with Reserved Channel (MBMRC): MBMRC reserves specific buffers to store GT traffic. Static buffer allocation for QoS is popular in assigning routing resource [9] . Different from that, GT buffers are reserved for GT traffic and other buffers are shared by all traffic to improve buffer utilization in our simulation. In two PBs case, P Bnum 0 is dedicated to GT traffic, and P Bnum 1 is shared by both GT and BE traffic. Guaranteed bandwidth is reserved for GT so as to provide better performance. It might cause resource utilization inefficiency when considering low GT traffic case. Congestion and QoS aware mechanism differentiates traffic transmission and maximize network resource utilization efficiency. GT packets benefit from multiple PBs design and preferably adaptive routing. Congestion management helps to mitigate possible performance degradation in high workload cases because GT packets tend to be directed to less congested areas.
E. Starvation prevention
Resource allocation imbalance and high GT injection rate might cause starvation situation for BE traffic. However, experimental results have shown that the starvation can be avoided by limiting the ratio of GT traffic to be around 0.5 when considering multiple PBs without GT channel reservation for NePA. This is realistic as it is expected that GT class should only be assigned to a small portion of traffic. This can be easily achieved by self-disciplined processors.
Other than passive expectation of GT ratio limitation, QoS level boost mechanism is proposed to eliminate starvation. A timer for each buffer is needed to record how long packets have been waited. When blocking time reaches the predefined threshold, BE packets are boosted to GT traffic in order to accelerate transmission. Starvation can be aggressively prevented in this manner.
V. EVALUATION
The methodology used to analyze the performance and feasibility is discussed in this section. Different control factors such as network platform, PB allocation, routing algorithm, GT ratio, traffic pattern and workload will be presented and evaluated in the simulation.
A. Experimental setup
NoC platform employed with QoS and Congestion Aware (QoSCA) routers is developed by a System-C based cycle accurate simulator. A 8x8 mesh network and different router designs such as 5-port, 7-port (NePA), 11-port (DMesh) routers are considered. Wormhole packet switching is adopted, and packets are composed of 64-bit flits. Traffic generator produces different synthetic traffic traces for evaluating the performance, including {Random, Bit complement, Bit reverse, Matrix transpose} traffic patterns [10] . Additionally, local and global hot region traffic conditions are also considered, labeled as {Local, Hot spot}. Local traffic features 80% of the total injected traffic with traverse distance less than four hops. Hot spot traffic features that 10% of the nodes receive 68% of the total injected traffic. These patterns define the spatial distribution of packets. A self-similar traffic generation technique was implemented to apply temporal distribution to transmitted packets [17] [34]. Self-similar traffic can be generated by aggregating a large number of packet sources which exhibit a long-range dependence property [29] . ON/OFF state is imposed on source node to control traffic generation during simulation time. The length of time a node spends in the ON or OFF state is determined by the Pareto distribution [1] [10] . Packets are stored in an infinite queue at the source node after they are generated, and wait until they are injected into the network. This method isolates the packet generation from the network behavior which indicates the packet generation is independent of the network condition. Each simulation executes 10,000 clock cycles for warm-up and then continues for 100,000 cycles during which router performance is conducted.
B. Performance evaluation
Performance evaluation is based on transmission time and traffic load among source and destination pairs. Latency and throughput are major performance evaluation metrics. For performance comparison, various router architectures, arbitration schemes, traffic configurations and hardware assignments were evaluated to demonstrate the effectiveness of the proposed mechanism.
1) Throughput enhanced router design: How port numbers (two additional vertical ports for NePA and diagonal links for DMesh) and PB configuration influence performance in terms of latency and throughput are shown in Figure 5 . The 5-port counterpart adopts dimension order routing and all the cases implement the congestion-aware scheme in the simulation. Performance improves as routing resources increase. Multiple ports provide additional routing flexibility and bandwidth and multiple PBs alleviate congestion situation so as to accelerate data transmission. NePA outperforms 5-port NoC routers (NoC5) and DMesh outperforms NePA because of express diagonal link employment. NoCs with more ports meet the expectation of performance improvement at the same buffer level. It is also noticed that adding more PBs cannot account for definite performance improvement. NePA and DMesh with less PBs work better than NoC5 with four or eight PBs. DMesh PB2 improves accommodated throughput more than NePA PB2 and NePA PB4, and NePA PB2 improves it more than NoC5 PB2 and NoC5 PB4. Even DMesh PB1 accommodates more traffic than NePA and NoC5 with multiple PBs. The result reflects that adding extra routing resource benefits overall system performance owing to routing flexibility. The observation from different traffic patterns comes to the same conclusion, and the improvement is significant especially in symmetric traffic scenarios such as {Bit reverse} and {Matrix transpose}.
2) Congestion management effectiveness for QoS: GT traffic can take advantage of available resources with less congestion and achieve significant performance improvement. Other than that, paths taken by GT traffic have less possibility of being blocked which can guarantee packet advancement. Routers with a fixed and CA routing arbitration are compared to demonstrate the effectiveness of CA mechanism, shown in Figure 6 . CA mechanism can effectively enhance GT and overall transmission performance. Among them, DMesh CA provides the best performance and tolerated throughput. The CA scheme significantly improves the average latency of GT traffic as well as that of overall traffic, especially in high workload situations. By employing a congestion control scheme, transmitted flits can eventually find available paths to dedicated destination in an acceptable time. GT traffic gains the preference to routing resources and achieves better performance than the other. Overall latency also benefit from well balanced traffic.
3) Resource allocation effectiveness for QoS: Multiple PBs design could improve performance and alleviate congestion situation. Traffic associated with different service classes can be separated and stored in different buffers to mitigate order error effect [22] . GT traffic therefore can be routed first and prevented from being blocked by BE traffic. The impact of different PB configurations for random traffic trace was investigated in Table I . Preliminary QoS provision still can be achieved even for NoCs with single buffer cases. For NoC5 and NePA, GT packets achieve relatively better performance than overall packets. GT traffic might be blocked by BE traffic and hinder its advancement. Multiple PBs can effectively separate GT traffic from BE 
5DQGRP
one and further provide privileged bandwidth to GT traffic. NePA/DMesh with multiple PBs demonstrate significant performance improvement over other cases. DMesh benefits from express links, so GT traffic even BE traffic if the resource is available can accelerate transmission and shorten latency dramatically.
It is observed that in the case of low GT ratio and high workload, the reserved channel might be under-utilized and the performance of overall and GE traffic suffers from poor resource allocation. A detail investigation between mechanism with reserved channels and without reserved channels has been performed. Average and GT latency analysis for DMesh has been conducted and shown in Figure 7 . It is noted that GT traffic latency of MBMRC outperforms that of MBM under low workload condition (0.14 flits/node/cycle) for different GT ratios. As workload increases, reserved GT channels are not sufficient to accommodate GT traffic so that GT traffic will take the shared buffers and therefore hinder BE traffic. This situation in MBMRC case is worse than MBM. Due to MBM can efficiently and flexibly allocate buffers to both GT and BE to prevent them from underutilization, GT traffic can achieve the best performance while maintaining tolerable average latency. NoC5 and NePA also demonstrate the same tendency from our observation.
4) Different traffic cases evaluation for QoS:
A detail performance comparison for various traffic patterns was shown in Table II . The results indicate consistent conclusion that the CA mechanism collaborating with QoS provision with two PBs can effectively provide guaranteed bandwidth with GT traffic to ensure its performance, even in local and hot spot cases.
The following conclusion can be made based on simulation results.
• Parallel buffer architecture alleviates congestion and allows reserving channel for GT traffic.
• Adaptive routing approach and congestion-aware mechanism improve overall throughput by well balancing transfer tasks. CA mechanism can be designed to give preference to GT traffic so as to ensure guaranteed bandwidth and performance.
• The extra routing resources from NePA and DMesh effectively enhance QoS provision especially in high workload cases.
• QoS provision was validated by different network platforms, routing algorithms, buffer deployment, GT ratios, workloads and traffic patterns. Table II  OVERALL, GT AND BE TRAFFIC AVERAGE LATENCY COMPARISON BETWEEN  NOC5/NEPA/DMESH ARCHITECTURES UNDER VARIOUS TRAFFIC PATTERNS, GT  RATIOS AND WORKLOADS (MBM IS USED IN THE to perform logic synthesis and analyze hardware cost. The implementation cost of CA mechanism with single buffer was presented in [37] . To evaluate the feasibility for QoS provision with multiple PBs architecture. Two PBs each port and four flits each PB are used to estimate the hardware cost. Besides extra buffers, the routing arbiter has to be modified to provide priority arbitration which is composed of a priority multiplexer circuit. Table III illustrates implementation cost overhead for both NePA and DMesh platforms. It shows that NePA QoSCA increases area by 11% and DMesh QoSCA increases area by only 10.7%, proving that the proposed mechanism can be achieved with a cost efficient modification from original designs.
C. Implementation cost evaluation

VI. CONCLUSION
Current high-performance routers demand congestion management and QoS provision for boosting network performance and supporting differentiated services. Flexible router design and adaptive routing algorithm not only effectively exploit routing resources, but also support more advanced features to accommodate versatile application traffic traces. Fully adaptive routing, parallel buffers and congestion-aware routers alleviate congestion and enhance NoC performance in terms of latency and throughput. QoS-aware routing without multiple PBs provides acceptable performance improvement for GT traffic. Routers with multiple PBs further provide guaranteed transmission performance. Experimental results showed that performance improvement is considerable and implementation cost overhead is moderate for both NePA and DMesh platforms. With alternative links employed between routers, DMesh demonstrated significant performance enhancement for GT and overall traffic.
