Abstract-This paper presents adaptive routing selection strategies suitable for network-on-chip (NoC). The main prototype presented in this paper uses contention information and bandwidth space occupancy to make routing decision at runtime during application execution time. The performance of the NoC router is compared to other NoC routers with queue-length-oriented adaptive routing selection strategies. The evaluation results show that the contention-and bandwidth-aware adaptive routing selection strategies are better than the queue-length-oriented adaptive selection strategies. Messages in the NoC are switched with a wormhole cut-through switching method, where different messages can be interleaved at flit-level in the same communication link without using virtual channels. Hence, the head-of-line blocking problem can be solved effectively and efficiently. The routing control concept and the VLSI microarchitecture of the NoC routers are also presented in this paper.
Ç

INTRODUCTION AND MOTIVATION
N ETWORK-ON-CHIP (NoC)is a feasible communication infrastructure for many-core processor systems because of the scalable bandwidth capacity of the NoC. Currently, there are many research challenges in the field of many-core processor systems starting from abstract application layer until physical network layer. In the network layer, optimum network and router architecture design in terms of cost (logic area, power, etc.) as well as its performance issues (network bandwidth capacity, router latency, etc.) [13] are the challenging topics. Among the topic around router architecture as the main part of a network communication infrastructure, switching methods, routing algorithms, quality-of-service and flow control have been extensively discussed in literature. Specifically, routing algorithm in any case could give impact on the area and the network performance.
In general, the routing algorithm can be made in deterministic (static) or adaptive manner. Network designers are motivated to design adaptive routing algorithms because of two main objectives, i.e., to avoid entering hotspot links such that communication performance can be increased, and to avoid entering faulty network components (faulty switch or link). The works in [18] , [11] , and [16] , for instances propose fault-tolerance adaptive routing algorithms. Network faults can turn a regular network into a nonregular network. The work in [10] presents a fault-tolerance routing algorithm by balancing traffic over network faults and nonregularity due to the network component faults.
The main issue related to the adaptive routing algorithms is deadlock configuration problem due to cyclic dependency. The works in [4] and [5] have presented theory about deadlock-free routing algorithms and formal descriptions about the deadlock configuration. Turn models can be principally used to design a deadlock-free adaptive routing algorithm [6] . A deadlock-free adaptive routing method to cover the problem of oversized IP components placement in irregular mesh-based network is presented in [23] .
Most of routing implementations made at design time use routing tables to route messages (packets). The contents of the routing tables are programmed at design time, and then adaptive routing paths are assigned in every routing table in the network nodes by using some technique. The work in [19] , for example, presents an offline (at designtime) routing method called "Application-Specific Routing Algorithm" (APSRA) used to increase the degree of routing adaptivity for hotspot avoidance. The "Segment-based Routing" (SR) presented in [14] proposes also an offline routing method, in which the network is segmented into some subnets and restrictions are applied to avoid deadlock configurations. Another method is the dynamic routing protocol in [12] used for balancing distribution of traffic in NoCs. However, since the aforementioned routing methods [19] , [14] , [12] used a static or offline routing design approach, they cannot be classified into a pure adaptive routing method.
In most embedded system applications, where most of the NoC platforms are likely to be used, the intercore communication patterns are known. Therefore, the offline (static) congestion avoiding techniques can be used, resulting in a much simpler router. A runtime (dynamic) adaptive routing method is however an interesting approach in the future NoC-based multicore embedded systems, where applications may not be known in advance. Indeed, some embedded IC vendors in multicore era could potentially not only market IP cores but also system architectures [3] , where many applications can be mapped onto the system architectures. Therefore, the implementation of the runtime adaptive routing will simplify an embedded system production because the designers will not need to configure the routing information on the onchip router anymore. In this context however, the runtime techniques will need extra area cost and complexity.
In lookup-table-based routing algorithms, the size of the tables will increase as the network size increases, since all entries must be added in the tables. Some works propose then different techniques to reduce the size of the routing tables. The work in [15] presents a region-based routing algorithm aimed at reducing the size of routing tables for NoCs by grouping destination network into network regions. The work in [8] shows a simple data transfer technique by applying local addresses (labels), which are computed offline for each flow in an application (at designtime routing approach).
With the same background mentioned in [8] , our proposed methodology can be implicitly viewed also as a technique to reduce the number of entries in the routing tables based on runtime variable (dynamic) local message identity (ID) technique that will be explained later. In our experiments, all considered traffic can be still routed under several scenarios, although the number of available ID slots per link is set less than the number of node entries in the NoC. Our methodology can be classified into runtime distributed routing approach, where the routing is made locally in every NoC router at runtime during application execution time.
The remaining sections are organized in the following. Section 2 presents the state of the art of the adaptive routing selection strategies that have been proposed so far for NoCs. Section 3 describes briefly the main contribution of this work. A 2D planar adaptive routing algorithm for mesh NoC platform is presented in Section 4. Section 4 shows also different adaptive routing selection functions and the VLSI microarchitecture of the contention-and bandwidth-aware (CBWA) and queue-length-oriented NoC router. Section 5 shows the performance evaluations of the different adaptive routing selection function under different traffic scenarios and different network sizes. Sections 6 and 7 present the synthesis results and concluding remarks, respectively.
STATE OF THE ART OF ADAPTIVE ROUTING SELECTION STRATEGIES
Selection Based on FIFO Queue Occupancy (FQO)
A commonly used adaptive routing policy is based on buffer occupancy, where the "congestion information" (CI) of a set of possible admissible output ports connected to the downstream (next-hop) routers are traced back to upstream routers. The CI data can be represented as the length of data queues in the FIFO buffer, which can be indicated by multiple-bit signal, or the buffer status (free or busy), which can be indicated by single-bit signal. These CI signal will be used by a packet on a current router to select a best routing direction between alternative downstream outgoing links at any instant time. A "stress value," which indicates how many packets coming into the downstream outgoing links at a unit time [25] , can also be used as an alternative CI data for packet-switched routers to make routing decisions. Many works have used this queue-length-oriented adaptive routing selection such as in [9] , [17] [7] and [25] .
A specific technique to drain messages from hotspot areas called "Contention-Aware Input Selection" (CAIS) is presented in [24] . Rather than adaptively selecting less congested outgoing from downstream directions, the CAIS method focuses on selection of input ports from upstream (backtrace) directions. When two or more input ports request the same output port, an arbiter unit at the output port will select an input port having more waiting packets in its upstream direction. It seems that the adaptive routing path selection is made by the arbitration unit rather than by the routing engine unit.
The work in [2] has presented an interesting method to make adaptive routing selection based on the number of free buffer slots and availability of buffer in the two-hop adjacent neighbors "Neighbor-on-Path Routing Selection Strategy." However, the main critic to apply such methodology is the problem of unpredictable traffic situation as shown in Fig. 1 .
Packet A will be routed from node (1,1) to (3, 3) . By measuring two-hop neighbor CI, the packet A at node (1,1) can overview four alternative paths, i.e., node-to-node paths ð1; 1Þ À ð1; 2Þ À ð1; 3Þ; ð1; 1Þ À ð1; 2Þ À ð2; 2Þ; ð1; 1Þ À ð2; 1Þ À ð3; 1Þ, and ð1; 1Þ À ð2; 1Þ À ð2; 2Þ. However, packet header A has only two alternative output selections, i.e., to North or East output port as depicted in Snapshot 1 of the figure. All four adjacent neighbors send back CI, i.e., the length of data queues in the FIFO buffers and the buffer status ("free" or "busy"). At the same time packet headers B and C come to node (2,1) via Local and East input port, respectively, in which packet A certainly does not know such situation. In this case, packet A will not use ð1; 1Þ À ð1; 2Þ À ð1; 3Þ path. As shown in Snapshot 2, we assume that the routing engine finally decides to route packet header A to East output port, and at the same time at node (2,1), packet headers B and C are routed to West and North, respectively. Now, unexpected situation occurs, where packet A has selected a nonoptimal path because of the unpredictable traffic situation. The same situation could also happen, when packet A would be routed to the North output port.
Selection Based on Bandwidth-Space Occupancy
Another approach to make adaptive routing decision is the strategy based on bandwidth-space occupancy. Fig. 2 shows us a snapshot of a network situation of the difference orientations between the adaptive routing strategies based on FQO or can be called as CongestionAware Adaptive Routing Selection Strategy and bandwidth space occupancy or can be called as a Bandwidth-Aware (BWA) Adaptive Routing Selection Strategy. Fig. 2 presents a snapshot of the network situation where the FQO strategy reads the CI traced back from two-hop possible downstream neighbor routers. In Fig. 2 , packet A coming to West input port at node (1,2) will be routed to node (3, 1) . We can see that packet A has three alternative paths to reach node (3,1), i.e., node-to-node paths ð1; 2Þ À ð2; 2Þ À ð3; 2Þ À ð3; 1Þ, ð1; 2Þ À ð2; 2Þ À ð2; 1Þ À ð3; 1Þ, and ð1; 2Þ À ð1; 1Þ À ð2; 1Þ À ð3; 1Þ. The header of the packet A (flit A1) has alternative output ports at the instant time, i.e., East and South. While the header flit of packet A is coming to node (1, 2) , at the same router node, the payload flit of packet B is coming from North input port and the payload flit of packet C is coming from South input port. They have acquired in advance the South and East output ports and have reserved 50 and 100 percent of the maximum BW space (B max ) of the output ports, respectively.
As presented in Fig. 2 , the two-hop CI signals are sent back to node (1, 2) . If the packet A reads the two-hop CI signals and buffer availability (like the strategy used in [2] ), then packet A will select South output port as the best output path, because the South output port presents the CI signal value of 1 and the East output port presents the CI signal value of 1 and 2 for two consecutive paths, i.e., the formed paths when the packet A would be routed to the East port. But if only 1-hop CI signals are considered (not presented in the figure), then there is no different hotspot situation based on the viewpoint of the packet A, because both East and South neighbor send back the same queuelength (data queue occupancy), i.e., 1 data queue.
The situation presented in Fig. 2 is actually the main functionality of using the N-hop neighbor CI without considering its drawback mentioned in Section 2.1. However, if the packet A just reads the actual bandwidth (BW) space occupancy of the two alternative output port at the current node (1,2), then packet A can view the difference hotspot situation. Hence, packet A will be routed also to South output port when using the BWA strategy because it has more free BW spaces.
CONTRIBUTION
The work in [20] presents BWA adaptive routing method. However, this method computes adaptive routing paths offline (at design time). A runtime BWA adaptive routing function called AdNoC is presented in [1] . The proposed method selects an output port having more free bandwidth spaces. The BWA adaptive routing selection of our NoC called eXtendable Hierarchical NoC (XHiNoC) has the same strategy as AdNoC's strategy. However, AdNoC uses virtual channel (VC) buffers, leading to extra large area overhead, while XHiNoC do not use them. The "bandwidth/contention/congestion look-ahead" method used in XHiNoC can compute immediately a routing decision in one cycle period. Meanwhile, the AdNoC implementation result requires 4 cycle periods to make routing decision, leading to routing computation time overhead.
Moreover, the XHiNoC can also implement many strategies or the combination of the adaptive output selection strategies. The specific work presented in this paper is the evaluation on the performance of the congestion-aware, contention-aware and BWA adaptive routing selection strategy, where routing decisions are made at runtime during application execution time. A concept of adaptive routing with a capability to interleave different messages at flit-level in the same communication link has been introduced in [21] . However, the work makes adaptive routing decisions based on the contention information between alternative output ports.
ALGORITHMS AND MICROARCHITECTURE
Two-Dimensional Planar Adaptive Routing Algorithm
Fig . 3a shows a 2D mesh-planar topology, where the NoC is divided into two subnets, i.e., X þ (increment) subnetwork depicted in solid lines and X À (decrement) subnetwork depicted in dashed lines. If a target node offset of a packet is x offset ¼ x target À x source ! 0, then the packet will be routed through the X þ subnetwork, while if its target node offset is x offset 0, then it will be routed through the X À subnetwork. Once a packet is routed to a subnetwork, it will not move to another subnet. By using such routing rule, the minimal planar adaptive routing algorithm will be free from a cyclic dependency (free from a deadlock configuration).
The main advantage of this NoC topology architecture compared to the turn models approach commonly used in the standard-mesh structure is that a minimal adaptive routing can be made in all nonzero offset directions with maximal two alternative routing directions. As shown in Fig. 3a , for example, when the targets of packets are located in North-East area (node 1 to 12), South-East area (node 21 to 18), North-West area (node 5 to 8), or South-West area (node 25 to 14), then the packets can use one of three possible paths adaptively to reach the target nodes.
Algorithm 1 presents the 2D planar adaptive routing algorithm used for the 2D mesh planar multicast router. The routing algorithm is divided into two subrouting codes for X þ and X À subnetwork in the 2D mesh planar topology. In the X þ subnet, the set of output ports that can be selected are fEAST ; SOUT H1; NORT H1; LOCALg. In the X À subnet, the set of output ports that can be selected are fWEST ; SOUT H2; NORT H2; LOCALg.
if X offs ¼ 0 and Y offs ¼ 0 then 5:
else if X offs ¼ 0 and Y offs > 0 then 7:
Routing ¼ NORT H1 8:
else if X offs ¼ 0 and Y offs < 0 then 9:
Routing ¼ SOUT H1 10:
else if X offs > 0 and Y offs ¼ 0 then 11:
Routing ¼ EAST 
Local ID-Based Data Multiplexing
We use a wormhole cut-through switching technique [22] , where flits of different messages can be interleaved at flitlevel, and share the same communication media based on the locally organized message identity. Flits belonging to the same message will always have the same local ID-tag when acquiring a communication medium (network link). The wormhole messages can be interleaved at flit-level because every flit has a unique local ID-tag, which dynamically changes, to differentiate it from other flits that belong to different packets in the same link. This switching scheme results in a special routing paradigm, in which the on-chip switch "routes flits instead of packets."
The local ID tag of a message is updated by an ID management (IDM) unit implemented at output port, when the message enters a new communication channel. By using this kind of wormhole switching, the head-of-line blocking problem commonly happen in the traditional wormhole switching can be solved without implementing VCs. The ID-based switching method performs equally to the VCbased method, where the VC flow control interleaves packets from different VCs. Compared to the traditional VC-based wormhole switching, the ID-based method demands less area, because it enables us to implement 2-depth single buffer per port. Data buffers increase significant not only logic area but also power dissipation. The discussion about area comparison of the ID-based router and the VC-based routers can also be found in our previous paper [22] . In the paper, we can see a very significant area overhead of a NoC with VCs compared to our XHiNoC with ID-based method (without VCs) by using the same CMOS technology size.
The concept of the wormhole switching is depicted in Fig. 4 , where different messages can be interleaved in the same buffer pool or can virtually cut through at flit level. Each message reserves one ID slot in order to be able to use the link. Based on such situation, contention information in the output port can be achieved by counting the number of the reserved ID slots in the link. Moreover, if the ID slot reservation is followed by bandwidth reservation, then a BWA adaptive routing selection strategy can be also implemented in our NoC.
As presented in Fig. 4b , two example cases are exhibited. The first case (upper figure) shows a flit interleaving where the total bandwidth consumption of all messages are 100 percent of the maximum link bandwidth capacity (B max ), i.e., each of four messages consumes 25 percent B max . The second case presents that 57.5 percent of the B max have been consumed by all packets, i.e., packet A is 20 percent, and packet B, C, and D are 12.5 percent B max , respectively. Thus, there is still 42.5 percent free BW that can be used by other wormhole packets coming to the link. Fig. 4c shows the flow of data when using the traditional packet switching method. Each header of packets is blocked and must wait until all flits of the previously switched packet have been forwarded. Due to the head-of-line blocking problem, in the traditional packet switching, the latency will tend to increase exponentially as the packet size or the number of packets is increased. This situation does not happen in our NoC, where the flits of different packets can be interleaved each other as depicted in Fig. 4b , resulting in better network latency characteristic, where the latency could tend to increase linearly as the number of flits is increased for certain traffic scenarios.
The mechanisms to reserve a local ID slot from the ID slot table and to program a routing output direction in the routing reservation table (RRT) by the wormhole packets are made at runtime during application execution time. Therefore, the XHiNoC uses a special packet format for the wormhole packets by introducing a flit type bit field (beside the ID-tag bit field) in every flit of the wormhole packets to enable such mechanisms. The ID-based architecture will give more significant benefits if packets are very long.
Adaptive Routing Selection Functions
Five router implementations based on information that are considered to make routing decision and based on the viewpoint of our NoC microarchitecture will be presented. The three considered information are described in the following:
. Identity (ID) slot occupancy (the number of free ID slots). This information can be called also as Contention Information of an output port, i.e., the number of messages that have contented (competed) so far to access the output port. Since our router can interleave different wormhole messages at flit-level in the same link without using VCs, then the number of reserved ID slots will represent the number of the wormhole messages that have been mixed in the outgoing link. . BW space occupancy (the number of free BW space).
This information can be called also as BW-Reservation Information of an output port, i.e., the number of BW spaces that have been reserved by messages to access the output port. . Buffer space occupancy (the number of data queue in a FIFO buffer). This information can be called also as CI of an output port, i.e., the queue length in the FIFO buffer at the input port of the next neighbor switch connected directly to the output port.
BW-ID Version
This prototype uses two information signals to make routing decisions. The first prioritized signal is the number of the reserved bandwidth spaces, and the second one is the number of used ID slots (ID slot occupancy). This adaptive routing strategy can be called as a Contention-and BWA Adaptive Routing Selection Strategy.
Messages are routed to an output direction having less reserved bandwidth spaces. If the numbers of the reserved BW spaces between two output ports are equal, then the second prioritized signal is used, i.e., the number of reserved ID slots. When the numbers of the reserved BW spaces between the alternative output ports are equal, the messages are then routed to an output direction having less reserved ID slots.
FQ-ID Version
This prototype uses also two information signals to make routing decisions. The first prioritized signal is the number of the used buffer spaces, and the second one is the ID slot occupancy. This adaptive routing strategy can be called as a Contention-and Congestion-Aware (CCA) Adaptive Routing Selection Strategy.
Messages are routed to an output direction having less utilized buffer spaces. If the number of the used buffer spaces between two output ports are equal, then the second prioritized signal (the number of reserved ID slots) is used. The messages are then routed to an output direction having less reserved ID slots when the numbers of FIFO queue occupancies between the alternative output ports are the same. 
BW Version
This prototype uses single information signals to make routing decisions. This adaptive routing strategy can be called as a BWA Adaptive Routing Selection Strategy. The BW version adaptive routing selection function is simpler than the BW-ID version because the usedID signals from both alternative output ports are removed from the selection mechanism. Messages are routed to an output direction having less reserved BW spaces.
FQ Version
This prototype uses single information signals to make routing decisions. Messages are routed to an output direction having less used FIFO buffer spaces. This adaptive routing strategy can be called as Congestion-Aware Adaptive Routing Selection Strategy.
ID Version
This prototype uses also single information signals to make routing decisions. This adaptive routing strategy can be called as Contention-Aware Adaptive Routing Selection Strategy. In this router prototype, messages are routed to an output direction having less reserved local ID tags.
Router Microarchitecture
The microarchitectures of the NoC router that uses the CBWA adaptive routing and the CCA adaptive routing selection strategies are presented in Fig. 5 . For the sake of simplicity, only the router components in East input port and in West output port are depicted. The router is designed based on a 2D mesh-planar topology, where each router has seven IO ports, i.e., East, North1, North2, West, South1, South2 and Local ports. Crossbar interconnect is customized to optimize the logic area of the router based on the allowed turns in the 2D planar adaptive routing algorithm. The rest router internal IO connections representing the prohibited turns are removed from the architecture.
Set of components at each input and output ports n is the FIFO buffer, the Routing Engine with Data Buffering (REB), the Multiplexor with ID Management unit (MIM) and the Arbiter unit. Based on the crossbar interconnects shown in Fig. 5 , the each port name is assigned to a port number as follows: East (1), North (2), West (3), South (4), North2 (5), South2 (6), and Local (7).
Set of subcomponents in the REB module at output port n are the Routing State Machine (RSM), the Route Buffer, the RRT and Grant Controller (GC). In the REB unit, the combination of the RSM, in which the planar adaptive routing algorithm is implemented (Algorithm 1), and the RRT is implemented to support runtime adaptive routing mechanism. The GC unit is used to control the read operation of the FIFO Buffer, and the Route Buffer is used to store data that will be routed to an output port.
We can see that we need a little bit effort to reconfigure the microarchitecture of the XHiNoC at design time from BW-ID to FQ-ID version. The required modifications are 1. add new output and input ports for the queuelength signal from the FIFO buffer, 2. replace bði; jÞ (BW ) signal paths with qði; jÞ (QL) signal paths, 3. replace the RSM with a new RSM, and 4. remove the BW accumulator unit.
Packet Format
The detail packet format and the control bits used in the XHiNoC architecture for the CBWA and BWA adaptive router is presented in Fig. 6 . In our NoC, rather than splitting a message into packets, it is split into several flits. Hence, single message (short, long or even very long) can be associated as single packet, which consists of single header flits, payload data flits and single tail flit. At the bottom part of Fig. 6 , we can see examples of a short message (four flits) and very long message (N number of flits). Even if the message is a very long stream of data, it has only single header and tail flit, and it is not divided into packets.
Formally, each flit can be then defined as F ðtype; kÞ, where type ¼ fheader; databody; tailg. A single flit has 39-bit width, 32 bits for dataword plus 9 extra bits, i.e., 3-bit field to define the type of flits and 4-bit field to determine the local identity label or ID-tag k of a message. With 4 bits ID field, we can have a number of 2 4 ¼ 16 ID-tags, such that k ¼ f0; 1; 2; . . . ; Mg M¼15 .
An extra 12-bit field in the header and tail flits is used to present the expected communication bandwidth. When a header flit of a message flows through an output port of a multiplexor, the value in this field will be used to reserve BW for the message on the output port. The tail flit is used to remove the BW reservation.
BW Management
An issue on how BW reservation can be managed at runtime is briefly explained in this section. When multiple applications are concurrently running, then there will be situation where the network will be saturated, and further, BW accumulator (BW occupancy) register on a link cannot cover the total considered BWs of packets flowing through the link. For N app number of application, there would be a probability that the total BW occupancy on a link exceeds the maximum value of BW accumulator. BW management should be made to guarantee that this problem could not happen. A simple solution can be made by setting the maximum data rate of communications in an application to be equal to the maximum value of the BW accumulator divided by the total number of applications. However, the final optimum solution for such issue is not discussed further in this paper.
The width of the required (ReqBW ) field determines the resolution of the BW space in each outgoing port. For q-bit field of the ReqBW , the resolution of BW space is 2 q . Hence, when q ¼ 12 as set in the Fig. 6 , then the number of BW variations that can be used by the messages to reserve BW space at the output port is 2 12 ¼ 4;096. When we use MegaByte per second (MB/s) as the unit of the required BW and the maximum capacity of the link were for instance 4,096 MB/s, then if the required BW is 80 MB/s, for example, then the binary signal of the ReqBW will be ½000001010000.
However, for a practical use in a specific application that does not require an accurate BW resolution, the width of the ReqBW field can be reduced into a reasonable width. This reduction can also reduce the logic area and static power of the considered on-chip router. Fig. 7a shows us in detail how a header flit of a packet reserves a routing slot in the RRT unit at the West input port. The header with ID-tag 3 comes into West input port. The REB unit routes and buffers the header flit in its data register. At the same cycle, the RSM unit computes the requested routing direction based on target address written in the header bit fields. The RSM unit selects one between two alternative output ports based on two signals indicating the number of used ID slots (usedID) and used BW spaces (usedBW ) in the two alternative output port. Both signals are concatenated (usedBW&usedID) by the RSM unit. The output port having less concatenated signal will be selected as the best output direction. The output routing made by the RSM unit is then stored in the slot number 3 of the RRT unit (in accordance with the ID-tag of the header flit). The routing output decision is controlled by the type field of a flit via a two input multiplexor. If the flit type is a header, then routing decision is computed and fetched from the RSM unit.
Routing Slot Reservation
When a databody flit belonging to the same message flows through the REB component as presented in Fig. 7b , the routing direction will be indexed by the databody flit using its ID-tag. The routing direction is thus fetched directly from the RRT. The header with ID-tag 3 belongs to the same message with the databody with ID-tag 3, because they have the same ID-tag number. When a tail flit flows through the REB component, the same mechanism takes place like the index operation made by the databody flit, but at the same cycle, the routing direction is removed from the RRT.
Bandwidth Space and ID-Slot Reservations
In the output port, there are two main components, i.e., an arbiter unit and a crossbar multiplexor with MIM. The IDM consists of an ID slot table, BW accumulator and ID accumulator units. Fig. 8a shows how the IDM unit functions to allocate a header flit to a new local ID slot as its new ID tag. The ID tag of the header flit is 0 and it requires to perform a communication rate of 80 MB/s. First, when a header flit type is detected, a free ID slot is looked for. As shown in the Fig. 8a , it looks that new ID slot (IDN) 3 is found free, and then it is used as the new ID tag for the header flit. At the same cycle, the previous ID tag of the header and from which port the header flit comes is written in the slot number 3. The select signal set by the arbiter unit will determine from which port the header flit comes. The BW accumulator unit increments the actual reserved BW spaces (The increment is equal to the required BW of the header). Meanwhile, the ID accumulator unit increments also the reserved ID slot from 2 to 3 (usedID ( usedID þ 1) .
When a databody flit (ID-tag 0) belonging to the same message with the previously header flows through the MIM component as presented in Fig. 8b , the IDM unit will check the current ID-tag of the databody flit and from which port it comes. As shown in the Fig. 8b , the pair of both signals (ID-tag 0, from L port) is detected in the ID slot number 3; thus, the databody flit will also have ID-tag number 3, which is the same as the header's ID that is previously switched (see Fig. 8a ). When a tail flit flows through the MIM component, the same mechanism takes place like the operation made by the databody flit, but at the same cycle, the BW and ID reservations will be reduced from BW accumulator unit and removed from the ID slot table, respectively.
EXPERIMENTAL RESULTS
In this section, the five NoC prototypes with different adaptive routing selection strategies are simulated. The adaptive routing for the five prototypes are minimal. It means that messages will not be routed away from their destination node. Thus, the message will have maximum two alternative routing direction on intermediate nodes.
Two performance metrics are used to evaluate the NoCs, i.e., the measurements on average bandwidth and tail flit acceptance latency on each target node. We measure also the injection and acceptance rate in every cycle on each communication pair (both the source and target nodes). We present also the distribution of the BW reservations for each scenario to overview the hotspot locations in the network during simulation. The performance measurements are interesting in our NoC context because of the specification of the packet format and the use of bandwidth-oriented adaptive routing. Thus, transient and steady-state behaviors of the NoC can be analyzed in detail. Fig. 9 shows the tail flit acceptance (latency) measurement in clock cycle under transpose scenario in 4 Â 4 mesh network, in which a source node located in ði; jÞ will send packet to a target node located in ðj; iÞ where i 6 ¼ j. Hence, there will be 12 communication pairs with 12 tail flit latency measurement (L k ; k 2 f1; 2; . . . ; 12g). In the simulation, we measure the number of clock cycle to receive the tail flits at each target node. The average latency is then formulated as The measurements are made for eight different injection rates, i.e., fpc means that one flit is injected to each source node in every S cycle. We inject 500 flits from each source node and measured the NoC performance for each different data injection rate. Thus, the latency of the 500th flit is measured and presented in the figure. In this case, we set node ð1; 1Þ at the south-west edge as node number 1, and node ð4; 4Þ at the north-east edge as node number 16. Hence, node 1, node 6, node 11, and node 16, whose node address (i ¼ j) will give zero acceptance latency, because these nodes do not send and receive messages.
Transpose Scenario in a 4Â4 Mesh Network
As depicted in Fig. 9 , in the case where the NoC is not saturated (the injection rate is lower than the rate that can make the NoC become saturated), the latency tends to increase as the injection rate is decreased. But it tends to be convergent to a certain value as the injection rate is higher and starts making the NoC to be saturated. When the NoC is saturated, the link-level flit flow control used in our NoC will keep the continuity of data injection at the source node and will dynamically follow the variable data rate condition at the target node. This unique performance characteristic is achieved due to the use of the specific wormhole cut-through switching method [22] . By switching and routing packets flit-by-flit and interleaving them each other at flit-level, then this unique performance characteristic is obtained. From Fig. 9 , we can also see that the performance of the congestion-aware (FQ-version) adaptive routing technique is lower compared to the other adaptive routing strategies.
The average tail flit latency under transpose scenario in 4 Â 4 mesh NoC for different workload sizes is shown in Fig. 10 . The initial injection time on every data producer node for the three scenarios is set randomly. Fig. 10a shows the simulation result when the injection rate is set to Fig. 10b. In Fig. 10c , the injection rates of As shown in the three subfigures, the latency increases linearly as the workload size is increased. The same results are given for the different setting of the data injection rates. This is again a unique performance characteristic of our NoC that uses the novel wormhole cut-through switching method [22] .
The bandwidth space occupancy for every output port of all network nodes are presented in Fig. 11 . The simulation result is obtained by randomly applying different injection rates to the source nodes. The set of the injection rates are Different initial injection times could give different performance evaluation results because correct decisions to make an optimal routing direction are strongly dependent on the dynamic neighbor states of the FIFO buffer occupancy, bandwidth space, and ID slots reservation of the link at certain instant time as explained in Section 2. The experimental result presented in this section is one of many simulations that could be run to test the performance of the five selected adaptive router prototypes. The following section will describe another simulation result in a larger network size with bit-complement data distribution scenario.
SYNTHESIS RESULTS
The synthesis results of the five adaptive NoC routers with different routing selection function are presented in Table 1 .
The NoC routers are synthesized using 130-nm CMOS standard-cell library from Faraday Technology. The target data frequency for the five adaptive NoC router prototypes is 1 GHz. The table presents the total logic cell area and the estimated dynamic power (net switching and cell internal power). We can see in the table that the BW-ID version of the BWA adaptive routers has more logic cells area and power than the other prototypes.
In Table 1 , we can see also that the BW-version of the BWA adaptive router has larger logic cell area than the FQversion. The area overhead is due to the overhead of the bandwidth accumulator unit, which is integrated in each crossbar multiplexor component of the router together with the IDM unit. As presented in the table, the ID-version of the adaptive NoC router has the least logic cell area compared to the other adaptive NoC prototypes.
CONCLUSIONS
The CBWA adaptive NoC routers, which select the best outgoing port at runtime based on the bandwidth occupancy and the number of the free reservable ID slots, are presented so far in this paper.
The awareness of the routing engine units to the number of free bandwidth spaces at alternative outgoing ports is aimed at avoiding congestion situations, in which the bandwidth capacity of communication channels is overloaded. In any case, the BWA adaptive routing selection strategy will help to balance the bandwidth utilization of the total NoC bandwidth capacity provided by the overall communication channels. The CBWA adaptive routing method considers not only the bandwidth space occupancy but also the number of messages contenting to acquires the alternative output ports. Hence, the CBWA adaptive routing method would theoretically make efforts to balance the distribution of traffic on the NoC links.
The implementation of the BWA adaptive routing selection strategy would be potentially used in heterogeneous NoC-based multiprocessor systems especially in a case where several processing element cores may inject data to the NoC with different injection rates. The differences are due to the application requirements, or maximum rate of tile processors and the task complexity executed in each tile processor. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
