Abstract-Real-time (RT) communication support is a critical requirement for many complex embedded applications which are currently targeted to Network-on-chip (NoC) platforms. In this paper, we present novel methods to efficiently calculate worst case bandwidth and latency bounds for RT traffic streams on wormhole-switched NoCs with arbitrary topology. The proposed methods apply to best-effort NoC architectures, with no extra hardware dedicated to RT traffic support. By applying our methods to several realistic NoC designs, we show substantial improvements (more than 30 percent in bandwidth and 50 percent in latency, on average) in bound tightness with respect to existing approaches.
INTRODUCTION
T HE Network-on-Chip [1] , [2] paradigm has emerged in recent years to overcome the power and performance scalability limitations of point-to-point signal wires, shared buses, and segmented buses [1] , [2] , [3] , [4] , [5] , [6] . While the scalability and efficiency advantages of NoCs have been demonstrated in many occasions, their timing predictability and suitability to transport real-time communication are still a source of technical concern.
Many applications have strict requirements on latency and bandwidth of on-chip communication, which are often expressed as real-time constraints on traffic flows. On a NoC fabric, this translates to guaranteed quality of service (QoS) requirements for packet delivery. Different approaches have been used to support guaranteed QoS for NoCs: prioritybased switching schemes [7] , time-triggered communication [8] , time-division multiple access [9] , and many variations thereof. All these approaches require the use of special hardware mechanisms and often come with strict service disciplines that limit NoC flexibility and penalize the average performance to provide worst-case guarantees. In fact, NoC prototypes are often classified as being either best-effort (BE) or guaranteed-service (GS), depending on the availability of hardware support for RT traffic.
Our work takes a new viewpoint. We consider best-effort NoC architectures without special hardware support for QoS traffic. We only assume that the traffic injected by the network's end-nodes is characterized in terms of worst case behavior. We then formulate algorithms to find latency and bandwidth bounds on end-to-end traffic flows transported by a best-effort wormhole NoC fabric with no special hardware support for RT traffic. For applications with traffic streams that have RT latency and/or bandwidth constraints, it is critical to be able to bound the maximum delay and minimum injectable bandwidth for packets of such streams. This helps in choosing topologies that meet the RT constraints with minimum area, power overhead and optimum utilization of resources. Our approach is inspired by the work by Lee et al. [10] for traditional multiprocessor networks, and extends it in several directions. We propose two different methods for characterizing worst case performance. The first method, Real-Time Bound for High-Bandwidth traffic (RTB-HB), is conceived for NoCs supporting workloads where injected flows have high demands of average bandwidth and require a guaranteed worst traffic minimum bandwidth (mBW) and maximum upper bound NoC traversal latency (UB). In this case, we do not assume any a priori regulation on the traffic injection rate; a core can send packets at any time, as long as the network has buffer capacity to accept them.
The second method considers applications with latencycritical flows that require low and guaranteed UB values, but have moderate bandwidth requirements, and thus can send packets at intervals no shorter than a minimum permitted interval-which obviously implies a maximum bandwidth (MBW) limitation. This method, called RealTime Bound for Low-Latency traffic (RTB-LL) requires a very simple traffic regulation at network injection points. RTB-LL is a significant improvement to the WCFC bound proposed in [10] , while RTB-HB is completely new. Table 1 compares typical values for upper bound delay and bandwidth of RTB-HB, RTB-LL, and WCFC methods. In [11] we presented these methods in their basic modes of operation. In this paper, we extend the methods to be more comprehensive, considering more generic NoC models, supporting various modes of operation and more experiments. In particular, we have several new and important contributions from our earlier work in [11] .
The remainder of the paper is organized as follows: Section 2 summarizes related work. Section 3 gives definitions and basic concepts. Section 4 describes RTB-HB and RTB-LL methods. Section 5 focuses on experimental results and quantitative comparisons. Section 6 describes the time complexity of the proposed methods. Finally, Section 7 concludes the paper.
RELATED WORK
The body of knowledge on macroscale RT networks is extensive and an overview of the state of the art is beyond the scope of this work. The interested reader is referred to [12] , [13] , [14] , [15] , [16] , [17] , [18] , [19] , [20] , [21] . Here, we focus on RT-NoCs, which have often been called guaranteed-service or QoS-enabled NoCs.
QoS is an important issue in many application domains such as multimedia, aerospace, healthcare, and military. Many of these applications have one or more traffic flows that have real-time requirements and need hard QoS guarantees. Two major parameters that account for QoS guarantees in NoC are worst case delay and worst case bandwidth. They are sometimes referred as upper bound delay and lower bound bandwidth of flows. Historically, designers have focused on extracting the average delay and average bandwidth, and a large body of work to extract such parameters exists [22] , [23] , [24] , [25] , [26] , [27] , [28] , [29] , [30] , [31] , [32] , [33] . Simulation and mathematical modeling are two different approaches to do so. While simulation is widely used in many situations, it is time consuming and it gives limited insight on sensitivity to traffic parameters and worst case conditions. In contrast, devising accurate mathematical models of a system is complicated, but if such models can be extracted, they are usually computationally efficient and insightful. Therefore, they can be used within design tools, for example, to iterate in the NoC synthesis process to tailor NoC architectures for specific applications. Frequently used mathematical frameworks are queuing theory and statistical timing analysis [25] , [26] , [28] , [31] . A node is modeled as a queuing system, which can be M=G=1; M=G=1=t; G=G=1, etc. The NoC then is modeled by interconnecting a number of queues and the parameters are then extracted using standard solutions from queuing theory. In these models, the applicability and accuracy are the main concerns.
In order to provide QoS, some NoC architectures use special hardware mechanisms. They are known as Guaranteed Service NoCs, as opposed to Best Effort NoCs. To distinguish briefly, GS NoCs commit to a performance level for one or more flows (typically latency or bandwidth). Hard or soft QoS can be provided, depending on whether the NoC actually strictly enforces the desired performance level or merely strives to achieve it. GS NoCs can leverage resource reservation or priority-based scheduling mechanisms. The former technique usually achieves hard QoS, but the resource utilization may be poor because reserved resources are underutilized. The latter may achieve better resource utilization as resources are used on demand, like a best effort fashion with priority, but generally only ensures soft QoS and problems like starvation of low priority flows may occur. GS NoCs require extra hardware complexity with respect to BE NoCs to support redundant resources or priority mechanisms. On the other hand, the performance of flows in a GS NoC can be more easily characterized. In a pure BE NoC, analyzing the temporal behavior of the flows is very complex, due to the large number of contentions that may block a packet of a flow several times along its journey to the destination. Also simulation is very complicated and time consuming, as identifying the worst case scenario, and enforcing the network to operate in such worst case situation, is extremely difficult, if not impossible. Thus, the modeling approach may be an option; however, due to extreme complexity, until recently there have been no applicable approaches to model the performance parameters in worst case situations. Thus, most of the efforts to provide QoS have been in the context of GS or combinations of GS and BE NoCs.
In [9] , Goossens et al. present the AEthereal NoC which combines GS with BE. The MARS [34] , aSoc [35] , and Nostrum [36] architectures use time division multiplexing (TDMA) mechanisms to provide real-time guarantees on packet-switched networks. The aelite NoC [37] provides a GS and scalable TMDA-based architecture, using mesochronous or asynchronous links. In Shi and Burns [7] , a priority-based wormhole switching for scheduling RT flows is presented. In [8] , Paukovits and Kopetz propose the concept of a predictable Time-Triggered NoC (TTNoC) that realizes QoS-based communication services. Diemer and Ernst in [38] introduce Back Suction, a flow control scheme to implement service guarantees using a prioritized approach between BE and GS services. In [39] , Hansson et al. provide the latency and throughput guarantees based on the approach of data flow analysis technique, determining required buffer size at network interfaces, which is applicable to the AEthereal NoC. Many other works have been published with variations over these basic ideas [40] , [41] , [42] , [43] , [44] , [45] , [46] , [47] , [48] , [49] .
However, most NoC architectures are of best effort type [50] and do not have special hardware mechanisms to guarantee QoS. Today, to the best of our knowledge, there are only few works that calculate worst case bandwidth and delay values for a BE NoC. In [51] , the lumped link model was proposed where the links a packet traverses are lumped into a single link. This model does not distinguish direct contention (due to arbitration losses) from indirect contention (due to full buffers ahead along the path), thus the estimated bounds are pessimistic. In [52] , Qian et al. provide a method based on network calculus [53] , [54] to calculate real-time bounds for NoCs; the method uses service curves and arrival curves that characterize the service characteristics of switches and injected traffic. Extracting arrival curves is not a straightforward task for many applications. Thus, for an arbitrary injected traffic load, traffic regulators may be needed to make sure that the amount of injected traffic in a specified period does not exceed a specified level. In [55] a buffer optimization problem is solved under worst case performance constraints based on network calculus. In [56] , Bakhouya et al. also present a model based on network calculus to estimate the maximum end-to-end and buffer size for mesh networks; the delay bounds calculated for the flows are not hard bounds and real values may be larger.
In [11] , we proposed some methods for worst case analysis that do not need traffic regulators. The bounds presented by those methods are tighter than those reported by previous studies. In this paper, we have extended these methods in several directions: we provide a more detailed switch model to differentiate stage delay and buffer depth; this results in tighter bounds in calculations of RTB-LL method. Our analysis also considers networks with virtual channels and variable buffer lengths; the analysis for small buffers is also extended to account for message lengths, which results in tighter bounds with respect to our previous results.
THE NETWORK MODEL
A router model is essential to characterize network latency and bandwidth. We consider the very general reference architecture shown in Fig. 1 where a crossbar handles the connections among input and output channels inside the router. For more generality, we consider optional buffering at input and output ports. We assume round-robin arbitration in the switches, a commonly used arbitration scheme in many NoCs. Each port is equipped with some virtual channels sharing the bandwidth of the physical channel associated with it. Links, which can be pipelined to maximize the operating frequency, connect the output ports to the input ports of adjacent routers. Note that due to backpressure signaling, packet loss and packet dropping do not happen in switches. Table 2 summarizes the parameters used to describe the model. For the sake of simplicity, we use a single parameter Freq for the operating frequency of all cores and FlitWidth as the data width of all NoC links.
The buffer depth ðB d Þ parameter is used in the paper frequently. As seen in Fig. 1, B d is the summation of a number of registers and/or of the number of slots of one or two FIFOðsÞ, 1 from the arbitration point (at the entry of the crossbar) of switch j to the arbitration point of switch j þ 1. The input buffer depth is denoted by b 1 (we assume at least one register), b Blocking always happens because of arbitration conflicts, either directly in front of a switch crossbar, or indirectly due to full buffers ahead. For simplicity, throughout the paper, we consider the buffering between two adjacent switches to be lumped, so we mention "output buffer of switch j" and "input buffer of switch j þ 1" equivalently, referring to the same number of intermediate registers or FIFOs between the arbitration points of switches j and j þ 1, i.e., to B d . The stage delay ðS d Þ parameter instead describes the minimum delay the header flit of a packet will face in the absence of any contention with other packets, between arbitration points of two adjacent switches. Please also note that the switches along a path are indexed j ¼ 1::m, but j ¼ 0 can conveniently be seen as a virtual switch inside the source node, which acts like a physical switch to model source conflicts (i.e., sending more than one flow from a source node). The parameters ts 1 and ts 2 model the setup time at NoC sources and consumption time at NoC destinations to inject and eject packets. Of course, to be able to use finite parameters, we assume that the receiving nodes are able to accept incoming data at the required rates. Table 3 lists the parameters we use to describe traffic flows across the network, while Table 4 summarizes the parameters that we use to model the performance of such flows. Most notably, UB i represents the upper bound delay for a packet of flow F i traversing the network, and is a key factor for the interconnect designer. We try to use a notation as close as possible to that used in [10] for ease of comparison. We first present a method called Real Time Bound for HighBandwidth traffic which calculates UB i in a completely worst case traffic situation. Crucially, this includes the possibility for other system cores to inject unregulated bandwidth, i.e., any amount of traffic at irregular intervals. This is a key property for real-world interconnects analysis, as most available IP cores operate on an unregulated-injection basis. In order to calculate UB i in such a case, we consider all intermediate buffers along the route to be full, and we assume arbitration loss at all switches where other flows are contending for the same output port. As it will be seen, the calculations always provide solutions for worst case situation on different flows that are unique due to the employed deterministic calculation procedure. The calculated values are tight in most network scenarios as the worst case situation can really happen. The bounds may be slightly pessimistic in rare network scenarios where some of the contending flows connect two switches on different routes that can prevent providing enough contending packets to create worst situations.
NETWORK TRAVERSAL DELAY ANALYSIS
Deadlock and livelock do not occur as we assume the routing path along the switches for all the flows are deterministic and predefined (like the networks used in [57] ). As we are modeling the worst case behavior, we consider that the flows send the packets at maximum possible rate that the network permits; thus at this level of design, knowing the flow behavior is not important. Since switches are assumed to feature round-robin arbitration, even though we assume the current flow to be serviced last, the maximum delay is bounded, i.e., starvation cannot occur. Therefore, the packets sent by the source S i are eventually delivered. RTB-HB calculates the Maximum Interval MI i , i.e., the number of cycles after which the output buffer of S i is guaranteed to be free again for further injection. From this value, the worst traffic minimum injectable bandwidth ðmBW i Þ can also be easily derived. This analysis can be applied to most NoC architectures, without any specific QoS hardware or software provisioning. We then move on to the description of another method, called RTB-LL. In this scenario, we assume that traffic injection can be regulated, as in some application scenarios. Therefore, we also calculate a minimum permitted interval ðmL i Þ between two consecutive packets from the same source, which can be translated into a maximum permitted bandwidth ðMBW i Þ. This approach is similar to the previously reported method [10] (Real-time wormhole channel feasibility checking or WCFC, which will be briefly described later) but provides much better results in terms of bound tightness. For a proper operation, the system must then respect MBW i bounds at runtime.
The Proposed Delay Model RTB-HB
The goal here is to calculate the parameters UB i (worst case latency to traverse the network) and MI i (maximum worst case interval). Let us first consider the case B d ¼ L.
¼ L means that a packet fills exactly the buffering resources between the arbitration points of two adjacent switches. Considering the case where the network is completely loaded (an unrealistic scenario just for visualization purposes) and B d ¼ L, the network operates by shuffling packets around in lockstep: all switches simultaneously rearbitrate every L cycles and packets trail each other, filling up the buffers as soon as they become free.
More formally, when P i is generated in S i , we consider all intermediate buffers along its route being full of packets from different flows. In the worst case, for P i to reach its destination, all these packets must leave their buffers. Focusing on hop j, P i may have arbitration conflicts with a number z c ði; jÞ of other flows contending for the output channel c, e.g., flows F a and F r . Since round-robin arbitration is assumed, it is enough to consider all contending flows to send a packet before P i to guarantee a worst case analysis. The order in which contending flows obtain the arbitration is not important for the latency calculation of P i . So, P a should make a one-hop forward progress. While P a frees the buffers at hop j, flit by flit, the flits from P r will smoothly replace the free buffer spaces. Eventually, P i also goes through. Section 4.1.3 presents a simple example to visualize this. The parameter u 0 i represents the time needed for P i to be ejected from S i and be placed in the output buffer of S i (or input buffer of the first switch of F i Þ. u j i then represents the time needed for P i to go from the input buffer of SW j to the input buffer of SW jþ1 , except for the last switch. At the last switch, P i is ejected, so it is instead the time needed to get into the input buffer of destination D i . To calculate UB i , as shown in (1), all these contributions must be aggregated, plus the fixed overhead for the packet creation and ejection:
The time needed for S i to inject the next packet is the time to create such a packet, plus the time needed for this packet to move on to the input buffer of the first switch. Thus,
To be consistent with the notations from [10] , we introduce the uppercase U j i symbol, which models the hop delay from output buffer to output buffer (instead of from input buffer to input buffer).
Let us consider a packet of flow F i initiated at the source S i . For this packet to reach the input buffer of the first switch, all existing packets at that buffer have to leave. Such an existing packet could be a packet from the same flow F i or any of the contending flows at the output channel of the source. Thus, the worst case time taken for any existing packet to leave the buffer is given by MAX x ðU 0 i ; U 0 IðxÞ Þ, where IðxÞ is the index of contending flows at the output channel, with x ¼ 1 :::z 0 ði; 0Þ. Also, all other contending flows of F i may have to send a packet before this flow. Thus, the total delay for a packet from F i to reach the input buffer of the first switch is given by:
IðxÞ ;
Similarly, for the subsequent hops, u j i can be calculated as:
x ¼ 1::z c ði; jÞ; 1 h i h i :
ð4Þ
Please note that, if there is no contention for the flow, the above equation reduces to u
This is again akin to a packet moving in a pipeline fashion in the network.
In order to calculate U j i values, let us consider the packet from flow F i moving from output buffer of the source to the output buffer of the first switch. For the packet to move, any existing packet from the output buffer of the first switch should move to the output buffer of the second switch. Similar to the above calculations, the maximum delay is given to be MAX x ðU 
For the case of the last switch, from the output port, the packet can be ejected in L i cycles (one flit per cycle). Thus,
Based on (3) and (4), now the problem of finding UB i and MI i ( (1) and (2)) is mapped onto a summation of U j i values, which can be solved by (5) . Please note that we assume that the destination has enough buffers to eject the packets at the rate at which the network delivers them. By applying the above formulas recursively, we can obtain the worst case delay (UB) and injection rate (MIÞ for the different flows.
To describe the details of different aspects of analytical method RTB-HB for the calculation of upper bound delay and interval, we apply them step-by-step to an example NoC (shown in Fig. 2 ). The NoC contains four switches and there are four message flows from S 1 to D 1 , S 2;3 to D 3 , S 2;3 to D 2; 4 , and S 4 to D 2; 4 ðS 2;3 and D 2; 4 are source and destination with two flows originated from or finished at them). We consider
As an example, we study the time needed for a packet P 0 of flow F 1 to cross the network. In general, from (1), we can write:
To start, let us model the time u 0 1 needed to move from S 1 to the input buffer of switch SW 1 . We start from the most congested network possible, so there exists another packet P 1 of the same flow ahead, and this packet needs U 0 1 to move from the output buffer of the source (remember that source nodes are tagged with superscript 0) to the output buffer of SW 1 ; so, u
has to be calculated recursively based on the delays of the contending packets and delays of the packets ahead along the same route.
We observe that two factors mainly contribute when calculating the delay: first, the possibility of losing arbitrations at SW 1 ; second, the fact that there may be no available buffer space at the output of SW 1 (due to arbitration losses ahead), which also effectively stalls packets at the input of SW 1 . For what concerns the arbitration loss, it can be seen that flow F 1 contends with flow F 2 at the output of SW 1 . Thus, a packet P 2 of F 2 currently in the input buffer of SW 1 could be arbitrated before P
1 . For what concerns the output buffer full condition, in the worst case, there will be a single (B d ¼ L) packet P 3 in the output buffer of SW 1 . P 3 could belong to either F 1 or F 2 , where, respectively, U 1 moves on to the output buffer of SW 2 , leaving the output buffer at SW 1 empty. However, in the worst case, an arbitration loss occurs to P 1 , so it is packet P 2 which will smoothly replace P 3 (Fig. 3a) . Before P 1 can move on by one hop, we must also consider the time for packet P 2 to go from the output buffer of SW 1 to the output buffer of SW 2 (Fig. 3b) , which is U 1 2 . Thus, we can write:
; which traces back to (3). As mentioned above, this is the delay for P 1 to move one hop on, but equivalently is also the delay for P 0 to replace it in the previous location (Fig. 3c) . Now, similarly, P 0 needs to move another hop on, from the input buffer of SW 1 to the input buffer of SW 2 , with a delay which is defined as u should wait for a packet P 4 of F 2 . A packet P 1 , again either from F 1 or F 2 , should be considered at the output buffer of SW 1 . So again, during the time MAXðU to move on, allowing P 0 to eventually get to the input buffer of SW 2 . Thus,
In a similar manner u 2 1 can be calculated. Once P 0 is in the input buffer of SW 3 , it is only one hop away from its destination and as there is no contending flow at the destination, the ejection time for the messages equals L 1 . For the sake of uniformity of presentation, we can write
Now, the target metric UB 1 can be calculated recursively, as a function of U When considering the source S 2;3 it can be noticed that two flows F 2 and F 3 can originate from it; therefore, source conflicts may happen. As Fig. 4 shows, for example, when analyzing flow F 2 , u 0 2 (the time to transfer a packet of F 2 into the input buffer of the first switch) should include a delay MAXðU 0 2 ; U 0 3 Þ, which accounts for a packet of either F 2 or F 3 to move away from the input of SW 1 toward the input of SW 2 (during which time we must assume, in the worst case, that it is a packet of F 3 which replaces it), and then again the time U In order to tighten the performance bounds, we also present equations to calculate UB i ðAÞ directly, achieving lower figures than those calculated above for network B. For this purpose, we introduce a new parameter 
point of switch j þ 1 to the moment when the tail flit leaves the arbitration point of switch j. In particular, when B d ¼ L (as in network BÞ, this time is zero, since as soon as the header flit enters the arbitration point of switch j þ 1, the tail flit has left the arbitration point of switch j.
In 
In case B d < L, a packet occupies more buffering space than that available between arbitration points of two adjacent switches. Consider now a situation where the header flit of a packet p of flow i is at the arbitration point of switch j þ 1, while an interfering packet q of flow t is also traversing switch j þ 1, but with its header already at the arbitration point of switch j þ 2 þ SðS ¼ L=B d À 1Þ. Also, a number z c ði; j þ 1Þ of packets at different input ports of switch j þ 1 are contending with p for the same output port. Before the tail flit of packet p can leave the arbitration point at switch j þ 1, the following must happen:
1. Packet q should proceed for B d steps (registers), so that its tail flit leaves the arbitration point of switch j þ 2. After this step the header flit of one of the Á ;
x ¼ 1::z c ði; j þ 1Þ:
2. All the z c ði; j þ 1Þ packets, to account for the worst case, shall be arbitrated before p; so, after this step the header of p is at the arbitration point of switch j þ 2.
The required time for this step is
IðxÞ ðAÞ with x ¼ 1:: z c ði; j þ 1Þ. 3. Packet p, whose header is at the arbitration point of switch j þ 2, must proceed for a number of steps until its tail flit leaves the arbitration point of switch j þ 1. This time is calculated as jþ1 i . The summation of the time for these steps results in (7) . As can be seen in (8) , in the last switch for a flow, L cycles are required for the packet to be ejected to the destination. Definition 3. We denote by u j i ðAÞ the worst case delay elapsing from the moment when the header flit of P i enters the arbitration point of SW j ð1 j h i À 1Þ to the time it enters the arbitration point of SW jþ1 (for j ¼ 0 it is the worst case delay elapsing from when the header flit of the packet in the source node is ready to be injected in the network to the time it enters the arbitration point of the first switch, also for j ¼ h i , it is the worst case delay elapsing from the moment the header flit of P i enters the arbitration point of the last switch h i to when a number of B d flits of the packet have ejected the network at D i . 
Using the above definitions the equations to calculate UB i ðAÞ and MI i ðAÞ can be summarized as:
In other words, the time needed for a packet to move from source to destination includes the source and destination overheads (ts 1 and ts 2 ), the time needed for the header flit to move across the network switch by switch, and L À B d (as for the last switch, (10) is used and the packet needs L À B d further steps to completely leave the network).
To calculate the value of the maximum interval between two consecutive packets, it is enough to add the source overhead (ts 1 ), the time needed to send the header flit of the packet to the arbitration point of the first switch ðu (10) is equal to B d as there is no flow contention on the port connected to D 1 in SW 3 . Equation (9) is used to calculate u 2 1 , which is equal to the time needed for the header flit of a packet P 1 of F 1 to move from the arbitration point of SW 2 to the arbitration point of SW 3 . As there is no contention from flows from different input ports of SW 2 for the same output port, the term P x U j IðxÞ ðAÞ in (9) is zero; thus, only the value of MAXðU Þ is calculated (the reason for using the MAX operator is that the type of the packet just ahead of P 1 may be either from F 1 or F 2 ). It is obvious that flow contention happens between F 1 and F 2 for the output port of SW 1 which is connected to SW 2 ; thus, to calculate u 1 1 this contention should be considered and based on (9), the value is MAXðU In such a topology, we can write always u k0 i ¼ u k i ; independent of the flows of the packets in the buffer thus, UB i and MI i are given as:
Extending this approach to the case B d ¼ m Â L, m À 1 dummy switches can now be inserted between each pair of switches and thus:
In cases where B d is not a multiple of L, i.e.,
A straightforward conservative solution can round up B d to ðm þ 1Þ Â L and consider m dummy switches; so, the formula becomes
The above equation shows that MI i , and thus mBW i , is not a function of m. In other words, the minimum guaranteed bandwidth in RTB-HB does not depend on the network's buffering space when buffer depth is larger than message length. Indeed, in this scenario, we assume that a buffer of size m Â L exists just before the arbitration point of the inputs of the switches and for the calculation purposes we assume m À 1 dummy switches are placed to split the buffer to m stages; as the delay for all the stages are equal, we can compare these stages in the network with a pipeline in which increasing or decreasing the number of stages does not affect the data injection rate into the pipeline although the delay will increase when m increases. This is true for the buffers just at the outputs of the source cores and thus the injection rate to the network is not affected when m changes. This is better shown in an example in the next section. Also as in Section 4. 
Virtual Channels
The extension of the proposed method to support virtual channels is straightforward, provided that the index of the virtual channel used for a flow in different switches is a predefined parameter. For this purpose, all virtual channels in intermediate switches can be considered as physical channels in the proposed model. Arbitration conflicts would be among all the incoming virtual channels, and each output virtual channel would account for a separate output port (so, e.g., the value for U h i i would change from L i to m Â L i , assuming m virtual channels per physical channel). In the following equations, the modifications needed to support virtual channels are shown. The parameter u j i ½v i;j denotes the maximum delay to move from the arbitration point of switch j to the arbitration point of switch j þ 1 when using the predefined input virtual channel v i; j used for flow i at the input of switch j. Similarly, U j i ½v i; jþ1 is the maximum delay to move from the output buffer of switch j to the output buffer of switch j þ 1 using input virtual channel v i;jþ1 at the input of switch j þ 1. Also v i;hiþ1 is defined as the virtual channel index used for flow i at the input port of the destination (D i ) IP core. The parameter u 0 i ½1 is used to calculate MI i as shown below; in this case, the selection of virtual channel number 1 is mandatory as only one flow enters the virtual switch inside an IP core 
IðxÞ ½v IðxÞ;jþ2 ;
To illustrate the support of virtual channels, let us consider the example shown in Fig. 2 with two virtual channels per physical channel. In this case, Table 5 shows the mapping between flows and virtual channels at each switch input port. Fig. 7 shows the calculation for this example. By comparing the results with the case where no virtual channels are used, it can be observed that virtual channels can reduce the number of flow contentions in different switches. At the same time, the bandwidth of a physical channel is shared among its virtual channels, thus depending on the application and on the strategy that assigns virtual channels to different flows, the worst case latency and bandwidth values can be better or worse compared to the case without virtual channels, regardless of extra hardware cost and complexity of virtual channels.
The Proposed Delay Model RTB-LL
We present a substantial improvement to the previously published method WCFC [10] . WCFC also calculates the upper bound propagation delays and permitted injection intervals for the flows in a wormhole network. It considers the arbitration contention packets face and the delay incurred by other packets sharing some part of their route due to such blockings. With a notation similar to that used above, WCFC employs [10] the following equation to calculate UB i and mI i :
IðxÞ ; x ¼ 1:: z c ði; jÞ; 0 j h i :
In the WCFC method, the calculations are based on the assumption that each flow injects packets with a minimum permitted interval. For the applications that can support such an assumption, we present a method that provides significant improvement in bound tightness over the WCFC method, which we call RTB-LL. As RTB-LL is less Fig. 2 pessimistic than WCFC in evaluating worst case performance, it enables the design of more hardware-efficient NoCs. To improve upon WCFC, a new concept, called overlapping flows, is introduced. If two or more different flows contend for the same output port at a switch, and they also share the same input port, we call such flows overlapping at the switch. This notion allows us to significantly optimize the bound tightness.
When F i contends with multiple overlapping flows at a switch, it is possible to locally coalesce all such overlapping flows into a single one. This is because the arbitration cannot be lost to multiple of those flows, as they cannot physically produce a contending packet simultaneously given that they enter the switch through the same input port. If there exist, e.g., two overlapping contending flows at hop j with delay parameters U , for calculating the parameters of F i . Moreover, whenever F i overlaps with other flows, those other contending flows should be ignored. By applying these optimizations, we have noticed a significant improvement in bound tightness for RTB-LL, shown in the next section and as summarized in Table 1 .
Figs. 8 and 9 show the calculated UB i and mI i values for both WCFC and RTB-LL methods for the same example in Fig. 2 . Since flows F 1 and F 2 are overlapping at SW 2 , our proposed RTB-LL improves the bound tightness compared to WCFC. Consider, for the sake of exemplification, a NoC variant shown in Fig. 10 , with another contending flow at SW 2 . In RTB-LL, the delay u To show how virtual channels are supported in RTB-LL, the same example for method RTB-HB is considered. Fig. 11 shows the results. Like in RTB-HB, using virtual channels results in better latency and bandwidth bounds, since flow contentions are resolved.
STUDIES ON APPLICATIONS
The proposed methods RTB-HB and RTB-LL can be used to analyze the scheduling of traffic flows in real-world applications. In this section, we present studies on a multimedia application (four other multimedia and RT applications are considered in the appendix, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TC.2011.240). We compare methods RTB-HB and RTB-LL to the baseline method WCFC, using the parameters listed in Table 6 . In these applications, we assume that NoC topologies are predefined based on application communication requirements, but without any feedback from the proposed algorithms to customize the network structure for better upper bound delay and interval time results (considering such a feedback is a possible extension for future work). In particular, for many applications, it is possible to identify a small subset of flows as critical, and then to optimize the NoC based on feedback loops from RTB-HB and RTB-LL to improve the performance of such critical flows. It is possible to do this without dedicated hardware support or any priority scheme.
Case Study: A Multimedia Application
In this section, we compare the results of applying RTB-HB, RTB-LL, and WCFC to D26-media (Fig. 12) , a real-time multimedia application with 67 communication flows, some of which critical. The application is mapped onto different NoC topologies, each with different switch counts and switch radices. Fig. 14 shows the average flow latency for different analysis methods and topologies. In particular we have shown the implementation for switch counts up to Fig. 11 . Calculation of UB i and mI i for the example in Fig. 2 using RTB-LL and WCFC, with two virtual channels per physical channel.
TABLE 6
Network Parameters for the Study seven that are typical values and a topology with 20 switches that is a reasonable point with many longer hop flows. Topologies with one or few switches (e.g., 1-3 switches in this example) need them to be "fat" (highradix), while other cases need "medium" or "thin" (lowradix) switches. It is important to note that the limitations in physical implementation (like power consumption, area, frequency, etc.) may limit the use of fat switches in practice; we still present the results for these cases for the sake of latency comparison, and assume that proper constraints can be implemented at a higher level in the NoC synthesis flow. Two detailed implementations with 5 and 20 switches are shown in Fig. 13 . Fig. 15 presents the results in terms of latency, intervals and bandwidth for the whole set of flows for 1-, 5-and 20-switch networks. Figs. 15a, 15b, and 15c compare UB i . The RTB-LL model always provides the tightest bounds. Compared to WCFC, the largely improved tightness (more than 50 percent on average) is due to the analysis of overlapping flows, a novelty of this paper. Please note that the improved tightness comes without any impact on the accuracy of the bounds, which are still under worst case assumptions. For the topology with only one switch and without overlapping flows in Fig. 15a , the results for RTB-LL and WCFC are identical but increasing switch counts triggers different performance profiles. RTB-HB is intrinsically expected to return higher worst case latencies, due to the assumption that no hardware traffic injection regulation facilities are available. Still, due to the more accurate calculation approach, the bounds are on average 30 percent lower than those given by WCFC, despite the less restrictive assumptions. There are, however, a few flows for which WCFC predicts lower delays than RTB-HB, due to the regulated injection assumption. In a zero-load scenario (with no contention at all), the minimum theoretical latency to traverse the 5-switch NoC for flows spanning a single hop is eight cycles ðS d þ LÞ, while RTB-LL gives a minimum upper-bound of 17 cycles in worst case contention. The delays calculated for the 20-switch topology, in this example, are higher, as a result of longer paths (more hops) per flow, higher probability of contention, and especially for RTB-HB, more in-flight packets. This suggests, as intuitively expected, that NoCs with fewer hops guarantee, on average, lower delay bounds. As described earlier, physical implementation limitations may prevent using fat switches in practice. On the other hand, increasing the number of switches not only increases the system cost but requires a careful consideration in the design process to reduce flow contentions to acquire tighter upper bound delays; so, a trade off may be considered for the number of switches. For RTB-LL, the number of contending in-flight packets is unrelated with the NoC topology, so the delay will not increase as a result of more hops.
Figs. 15d, 15e, and 15f show the maximum and minimum injection intervals (MI i and mI i ). Intuitively, if traversal delays are lower, new packets can be injected sooner, so MI i (mI i ) plots resemble UB i trends: flows with lower latencies can be injected more frequently. Thus, the mI i intervals are always shorter in RTB-LL and the MI i intervals often shorter in RTB-HB when compared to mI i in WCFC (except when using a few fat switches). These intervals can be directly translated into mBW i and MBW i using the equations in Table 3 . Results are shown in Figs. 15g, 15h , and 15i. The maximum injectable bandwidths (MBW i ) are, on average 35 percent higher according to RTB-LL when compared to WCFC, and 25 percent higher according to the minimum bandwidth (mBW i ) in RTB-HB. The maximum theoretical injectable bandwidth is 1;600 MB=s ðFreq Â FlitWidthÞ; according to RTB-LL, even under worst case assumptions, some flows on the 5-switch NoC are guaranteed injection rates of as much as 533 MB=s. In the 20-switch network, the higher contention likelihood affects injectable bandwidth negatively, but the use of more resources has a positive effect on many-hop flows, resulting in comparable injectable bandwidths. In summary, NoCs with few hops exhibit clearly better flow average upper bound traversal delays, but in terms of injectable bandwidths, the mapping of the flows (i.e., the contention patterns) and the amount of used resources play a decisive performance role.
The Effect of Virtual Channels
The results of employing different number of virtual channels in the network in Fig. 12 are reported in Figs. 16 and 17. Here we consider D26-media application with RTB-HB and RTB-LL methods for one, two and four virtual channels per physical channel. The strategy that assigns virtual channels to the flows is to share the load of input ports among the virtual channels and thus minimizing the contentions on switch output ports. The figures show that for both methods increasing the number of virtual channels results in better average RT metrics for different flows.
Study with Variable Buffer Depths
Figs. 18 and 19 show the comparison of worst case delay and maximum interval for D26-media application for the case B d < L. The figure suggests that shallower buffers, when the message length is fixed, will result in smaller worst case delay and maximum interval bounds. Fig. 20 illustrates the effects of increasing the buffer depth B d , in all switches of the NoC, as an integer multiple of message length L. The test is run for the D26-media application using method RTB-HB. A linear relationship can be observed between B d and the upper bound delay UB i . But because of the pipeline effect, as described in Section 4.1.2, there is no such a relation between B d and mBW i .
A contradiction may be perceived since deeper buffers are generally expected to improve performance, while Fig. 20a reveals worse latency with deeper buffers. In fact, the average latency is probably improved with a larger B d , but worst case latency is not, as shown in Fig. 20a . To understand why the worst case latency deteriorates, consider the following explanation. When the basic case of B d ¼ L is considered, RTB-HB calculates the maximum delay for a packet P i from the time the packet is supposed to be injected into the network until when it is ejected at the destination. Since the output buffer of the source core has a depth of B d ¼ L, it is not possible to have more than one packet in this buffer awaiting to be serviced before P i . In the worst situation, the traffic generator can inject a packet every WI i cycles, at most; if it injects more, the traffic generator may be stalled by NoC backpressure until this interval elapses. In the more complex case B d > L, some packets may be queued in the source core buffer ahead of P i , and since RTB-HB imposes no restriction on injection rates, they are expected to be there by a worst case analysis. Thus, the calculated worst case delay for P i includes the extra time needed to service them. Exactly the same reasoning applies to B d > L in other intermediate buffers in the network. As a consequence, deeper buffers increase worst case latencies, but this is not incompatible with the fact that increasing buffer size will decrease the average delay. Indeed if we apply the same traffic pattern to two identical networks, but with different buffer sizes, the total average delay in the network with larger buffers will typically be lower than that of the other network. Our method RTB-LL does not consider the buffering space B d to calculate the worst case delay; instead it uses stage delay S d . Therefore, changing the buffer size will not affect the calculated worst case delay. Fig. 21 shows the average UB i traversal delay and the average mBW i injectable bandwidth for the flows traversing x hops of the 5-switch NoC, considering the D26-media application and using RTB-HB. It is seen 1-hop flows exhibit reasonably low latencies and high bandwidths, suitable for critical traffic loads. Thus, the proposed methodology has a clear applicability to industrial RT applications.
Suitability to Critical Flows

COMPLEXITY OF THE METHODS
To estimate the time complexity of the proposed methods, we calculate the maximum number of required operations. As (1) , (3) and (5) show, the only operations are additions and comparisons (for the MAX operator); we consider one cycle to execute each of such operations. We call h the maximum number of switches traversed by a flow, and k the number of flows. We also pessimistically assume the maximum number of contending flows at a switch output to be k. For calculating one U j i parameter, we need (according to (5)) at most k comparisons and k additions, thus a total of 2k operations. The number of U j i parameters to be calculated is hk; so, the maximum number of In RTB-HB, the outcome is k Â UB i and k Â MI i values. For calculating one UB i value (according to (1)), we need h þ 1 additions; so, for all k Â UB i values, we need ðh þ 1Þk operations, while k operations are needed in the case of MI i . The total number of operations is the summation of all the above, i.e., 2hk 2 þ 2k 2 þ ðh þ 1Þk þ k. Therefore, the complexity of the algorithm is Oðhk 2 Þ. For RTB-LL, using the same approach, we can show that the complexity of the algorithm for calculating UB i and mI i is again Oðhk 2 Þ. Thus, both algorithms have quadratic time complexity. Also the timing complexity of WCFC algorithm is similar to RTB-LL as it exhibits a similar recursive behavior [10] . In practice, the execution time for all our test applications is very small (few seconds on a standard PC) and the modeling of delay and bandwidth parameters does not pose significant runtime issues.
CONCLUSION AND FUTURE WORK
We have proposed two different methods to characterize bandwidth and latency for NoC-based real-time SoCs, aiming at guaranteed QoS provisions. The choice of the most suitable method depends on the performance demands of the system and on whether dedicated hardware facilities can be provided by the NoC. One method is aimed at applications demanding the minimum latencies and requires injection regulation, while the other is suitable for applications where packet injection must be flexible to accommodate for higher average injected bandwidths and no hardware regulation is possible. We have proved that the proposed methods return the worst case metrics in a much tighter way than existing approaches, rendering them quite applicable for real-world SoC applications. The next step is to use the results of this work as an input to NoC synthesize and optimization tools whereby the QoS demands of critical traffic flows are met. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
