Abstract-This paper presents a compositional approach to formally verify quality-of-service properties of network-on-chip designs. A major challenge to scalability is the need to verify worst-case latency bounds for hundreds to thousands of cycles, which are beyond the capacity of state-of-the-art model checkers. The scalability challenge is addressed using a compositional model checking approach. The overall latency bound problem is divided into a number of smaller sub-problems, termed latency lemmas. The sub-problems imply the overall latency bound, but are easier to prove on account of being inductive. A method is presented for computing these lemmas based on the topology of the network and a subset of relevant state, and the latency lemmas are verified using k-induction. The effectiveness of this compositional technique is demonstrated on illustrative examples and an industrial ring interconnection network. In the ring network, a latency bound that cannot be verified in 10 000 s without lemmas is proved inductively in just 75 s when the lemmas are used.
a packet from one node in the network to another. In principle, this property can be expressed in linear temporal logic (LTL), and the problem can be solved using model checking. The LTL property expresses a bounded liveness property, written in English as "every packet from source A gets to its destination B within N cycles." Bounded liveness is equivalent to a safety assertion where one adds some extra logic to track the progress of time. One can use state-of-the-art model checking strategies such as k-induction [10] , interpolation [11] , and IC3/propertydirected reachability (PDR) [12] , [13] to verify this property. However, regardless of the strategy, it is generally necessary to analyze at least N consecutive cycles to either prove or disprove the latency bound, assuming that N is tight. Typical latency bounds for NoCs can be in hundreds or thousands of clock cycles. Unrolling of model transition relation to such depth is beyond the capacity of state-of-the-art model checking engines. PDR [12] , [13] , while avoiding explicit unrolling of the transition relation, still does not scale past tens of clock cycles, as will be shown in Sections VI and VII.
This paper addresses the scalability challenge using a popular approach in formal verification: compositional reasoning. In compositional reasoning, one breaks up the overall monolithic proof obligation into a number of "smaller" proof sub-goals, which are much easier to verify, such that if all of the sub-goals are proved, then so is the original monolithic property. The approach may involve decomposing the system description, the property, or both. The key is to devise a decomposition that is well-suited to the verification task at hand. When the overall property is a latency bound of N cycles through a network, a natural decomposition is to prove smaller bounds on a packet's progress through particular sub-paths of the network. These proof sub-goals are termed latency lemmas, and the core contributions of this paper are methods to discover and apply them.
Specifically, this paper shows that for some common network topologies, one can enumerate finitely many stages that a packet can go through. Each location in the network belongs to at least one stage at every time moment. Stages are arranged into a directed, acyclic stage graph, to capture the order in which they can be visited by a packet. A stage graph is defined through the use of age lemmas that bound the total time between when a packet is injected into the network and when it exits each stage. The age lemmas are in turn created through the use of progress lemmas, which bound the number of cycles that a packet can spend in each stage. Age lemmas and progress lemmas are collectively referred to as latency lemmas, and by proving the latency lemmas one proves bounds corresponding to all paths through the network.
To summarize, this paper makes the following novel contributions.
1) A compositional approach to proving latency bound properties in NoCs by decomposition into latency lemmas. 2) Methods of formulating latency lemmas for a particular type of microarchitectural descriptions using a stage graph. The stage graph formulation is automated for acyclic networks. 3) Experimental results on illustrative examples showing that the proposed technique can reduce the runtime of inductive verification of latency bounds by 20-50x, and causes k-induction to verify latency bounds 4-10x faster than can be achieved with the state-of-the-art IC3/PDR technique using all of the same strengthenings. Furthermore, it is demonstrated that verification runtime can be traded off against tightness of proved latency bounds. 4) Experimental results on an industrial-style ring interconnection network showing that latency lemmas give a speedup of greater than 50x, and in all cases allow latency bounds to be proved inductively. For an eightagent ring, latency is verified using k-induction in 75 s with latency lemmas, and cannot be verified in 10 000 s without them. While the approach given in this paper is shown to be efficient for proving latency bounds, two limitations should be noted. The first limitation is that it assumes a network to be primarily described using a particular microarchitectural modeling language called xMAS (see Section II). The second limitation is that the approach is not automated for cyclic networks. Yet, the ideas presented in this paper are more general than the specific implementation given, and can be applied to most latency verification works that are based on model checking.
The remainder of the paper is organized as follows. Section II introduces basic terminology and sketches the compositional latency verification approach using an example. Section III presents notation. Section IV describes the compositional approach formally and in more detail, including rules for creating the stage graph. Section V presents methodology used to evaluate the approach, and demonstrates efficient encoding of packet ages. Results for illustrative examples are presented in Section VI, and for the ring network in Section VII. Related work is presented in Section VIII, and Section IX concludes. This paper is a significantly extended and revised version of a conference paper [14] .
II. PRELIMINARIES

A. Background
In this paper, NoC designs are described using a high-level modeling formalism called executable microarchitectural specifications (xMAS models) [5] . The motivation for xMAS is to model network microarchitectures in a way that is expressive enough to capture interesting behaviors, yet sufficiently regular to enable formal reasoning. Toward this goal, an xMAS model N is a composition of simple kernel primitives (Fig. 2) connected via communication interfaces known as channels. Each channel c comprises three signals: c.data and c.irdy controlled by the initiator primitive, and c.trdy controlled by the target primitive Fig. 2 . Set of XMAS primitives [5] . Inputs and outputs are written in normal font and the parameters are in bold.
( Fig. 1) . A packet with value c.data is transferred from initiator to target when c.irdy and c.trdy are both asserted in the same cycle. Each channel is specified to exclusively communicate either data packets or dataless tokens; the difference is that in the former case c.data encodes information such as destination address, and in the latter case c.data is ignored. A channel c is said to be blocked (by the target) when c.irdy is asserted and c.trdy is not. A channel c obeys a liveness bound x if the temporal logic formula G c.irdy =⇒ F ≤x c.trdy holds, where G is the temporal operator "Globally" and F is "Eventually." In other words, x is the largest number of consecutive blocked cycles on channel c; a liveness bound of x = 0 means that a channel never blocks. Transfer attempts are persistent, meaning that if a transfer is blocked on a channel c, c.irdy remains asserted until the block is resolved and the eventual transfer occurs [15] .
The stateful xMAS primitives are implemented as finite state machines, and the stateless as combinational logic. The network has a single initial state in which all queues are empty. Each data queue comprises one or more queue slots, and every data queue slot in network N is indexed by a unique identifier i. To save space, only brief descriptions of the xMAS components are provided here. The interested reader is referred to papers by Chatterjee et al. [5] , [7] . This paper adopts the convention that data transformations are always unary, as in the so-called restricted primitives of previous work [5] . Under this assumption, deviating from the original xMAS descriptions [5] , it suffices to have the function primitive be the only one that transforms data. 1) Queue: Parameterized by its depth (number of slots). Data packets or tokens are read from the fixed head slot, and written to a tail position that varies with the number of packets stored in the queue. When a packet is read from the head slot, all other packets in the queue advance by one slot toward the head. 2) Function: Transforms input data on i to output data on o using a deterministic function that is a parameter f of the primitive. The function primitive directly connects the irdy and trdy signals from its input channel and output channel, and therefore is transparent with respect to timing. 3) Data or Token Source: A source sends packets through its output channel o. A data source nondeterministically decides when to send a packet, and the data of said packet is also nondeterministic. A token source is eager and attempts to send a token on every cycle.
4) Data or Token Sink:
A sink consumes packets from a channel i. A data sink nondeterministically decides when to consume a waiting data packet within x cycles, where x is a parameter of the sink. 1 A token sink eagerly consumes tokens from channel i. 5) Fork: A synchronization primitive that consumes a data packet or token from i and produces a packet of the same type on b, as well as a token on a. 6) Join: A synchronization primitive that consumes a token from a and a data packet or token from b, and outputs on o a packet of the same type as b. 7) Switch: A routing primitive parameterized by a switching function f :i → B, a switch consumes a data packet from i and produces it on a if f (i) = true, and on b otherwise. 8) Merge: Arbitration primitive that consumes a packet from either a or b, and produces the same packet on o. A state bit u stores the arbitration priority among the inputs, and its updation ensures local fairness.
B. Sketch of Latency Lemmas
To make latency verification more efficient, this paper adds two kinds of latency lemmas, termed progress lemmas and age lemmas. An informal sketch here shows how each type of lemma is derived and applied, using a credit loop network as an illustrative example. A more detailed description of progress and age lemmas will follow in Section IV.
1) Description of Credit Loop Network:
The credit loop network in Fig. 3 , adapted from work by Chatterjee et al. [6] , implements credited flow of data packets from a master agent to a target agent. The network has three queues: "available tokens" in the master, and "outstanding credits" and "ingress" in the target. The ingress queue stores data packets, and the other two store tokens. The master's source nondeterministically injects data packets on channel a. The data packets consume one available token when propagating through the join and onward to the ingress. Packets stored in the ingress can be consumed only if there are outstanding credits, and if so the data sink on channel d cannot block progress for more than five cycles. Whenever the data sink consumes a packet from the ingress, an outstanding credit is also consumed by the token sink. The token source simultaneously adds tokens to both the available tokens and outstanding credits queue in every cycle when both have free space.
2) Progress Lemmas-Bounds on Time to Make Progress: The first step toward proving an end-to-end latency bound is to compute a conservative upper bound on how long a packet might wait to advance once inside the ingress queue. In doing so one can assume that the ingress holds at least one data packet (i.e., n 1 = 0) and reason about different conditions for the rest of the network.
1) If a token is in the outstanding credits queue, then signal d.irdy is asserted and the data sink may at any time consume the data packet while the token sink consumes the outstanding credit. The liveness bound of the data sink limits it to five cycles of blocking, so c.trdy occurs within five cycles from states satisfying n 2 = 0. 2) If the outstanding credits queue is empty, and available tokens queue is not full, then the token source will add an outstanding credit, causing d.irdy to be asserted in the next cycle. The sink can only block for five cycles once d.irdy is asserted, so in the worst case c.trdy will occur within six cycles from states satisfying n 2 = 0 ∧ n 0 = 2. 3) If the outstanding credits queue is empty, and the available tokens queue is full, then no token can ever be injected into the outstanding credits queue, and the packet in the ingress will never advance. Next the approach assumes the network is deadlock-free, and moreover hypothesizes that every reachable network state in which a packet awaits progress will satisfy either condition (1) or (2) described above. This hypothesis is formalized as the candidate inductive invariant θ c,STATE (given by (1) for the credit loop). If θ c,STATE is valid, and all reasoning is sound, a packet in the ingress should always make progress in, atworst, the 6th future cycle (denoted by δ abs = 6 in Fig. 3) , and property θ c,TIMING (2) explicitly checks this on channel c.
Properties θ c,STATE and θ c,TIMING are collectively called progress lemmas. The progress lemmas are conservative and they over-approximate reachable state. As will be shown later (by (15) in Section VI-B), the condition (n 2 = 0) is unsatisfiable when the ingress contains packets, so condition (2) discussed above is unachievable
3) Age Lemmas-Bounds on Time Since Injection: If the queue slots are visited in a known order, and the progress lemmas provide a way to bound the time spent at each slot, then it becomes possible to compute a bound on the total propagation delay through the network. The credit loop has an obvious ordering among queue slots, as a packet injected from the data source first occupies the tail of the ingress (or bypasses it), then the head of the ingress, then reaches the sink. The progress lemmas assert that channel c will transfer any waiting packet in the 6th future cycle in the worst case. This means that a packet will advance every seven cycles. If a packet cannot spend more than seven cycles in either ingress slot, an age bound of eight cycles is implied for the ingress tail slot, and 15 cycles for the ingress head slot; these specialized bounds are called age lemmas, and formulated using a stage graph as shown at the top of Fig. 3 .
It is clear that the total latency is bounded by 15 if the age lemmas can be proved, yet including the latency lemmas makes the 15 cycle bound compositional and easier to verify using k-induction. Proving the 15 cycle latency bound requires an induction depth of 13 frames without latency lemmas, versus just eight frames with them; the reduced induction depth translates to a 2x speedup in this case. In general, the induction depth required to prove a latency bound property without the lemmas is proportional to the total latency, while induction depth to prove the same bound using lemmas is proportional to the time for a packet to make progress. The speedup from using latency lemmas is therefore more pronounced when proving large latency bounds, as will be shown in Sections VI and VII.
III. FORMALISM
As sketched in the previous section, a set of conjectured latency lemmas makes it possible to efficiently verify a possibly loose bound on the worst case end-to-end latency from any source in the network to any sink. The model being verified is an xMAS model N . Every data-carrying queue slot (i.e., the slots belonging to queues where input and output channels carry data packets instead of tokens) in the network is assigned a unique index i, and variable q i refers to the content of the ith queue slot. A latency bound is translated to a simple safety property by checking the age of a packet in each cycle. The age of the packet in slot i is denoted age(q i ) and is stored using specification variables. 2 
A. Checking Cumulative Latencies as Age Bounds
The mapping from queue slots in N to stages in G can depend on the state of N . This allows the same queue slot to represent different stages of progress depending on certain aspects of network state. The possible mapping from the ith queue slot to the jth stage is defined by a formula p i,j ; the ith slot maps to the jth stage whenever p i,j is true
1) Q i is the set of states of slot i; q i denotes a state of Q i .
2) W is the set of states for selected global variables including the number of items in any queue and the status of reservations in Section VII; w denotes a state of W. A few special cases are worth mentioning. If a slot i never maps to stage j, then p i,j = false regardless of q i and w. If slot i always maps to stage j, then p i,j = true, regardless of q i and w. 2 Specification variables are defined here as variables in N that record expressions over system variables but do not influence them.
Each stage j in the stage graph G has an associated t j that is a claimed upper bound on the age of any packet that maps to the stage. An age lemma for the ith slot and jth stage of progress is written as φ(i, p i,j , t j ) (4). For each slot i, assume the existence of a specification variable used i that is true in every state where slot i stores a packet. The property φ(i, p i,j , t j ) checks that, whenever the network state satisfies p i,j , any packet stored in slot i must have been injected into the network less than t j cycles prior
Let L denote a property that is true in a state of N if the age lemmas for all progress stages hold
Let G t denote the property for a global latency bound of t. The global bound differs from stage bounds in that it is checked on all used queue slots regardless of the state of N
B. Auxiliary Invariants ( )
An advantage of modeling microarchitectures using the xMAS formalism is automated invariant strengthening. The automatically generated invariants are unrelated to QoS, but are essential for verifying any type of property in xMAS networks because they prevent the verifier from exploring unreachable states that may include deadlocked states. The set of auxiliary invariants is denoted and comprises local invariants on queues [5] , [7] , persistency invariants on channels [15] , and design-specific numeric invariants ψ num [6] , [8] , [16] . Note that the designspecific numeric invariants used are automatically derived in earlier works [6] , [8] , and added manually in this paper.
C. Proving Latency Bounds
The overall problem of proving a latency bound t is N G t . With auxiliary invariants the problem becomes N G t ∧ . Described in detail in the next section, this paper further strengthens the problem using progress lemmas and age lemmas L , such that the overall problem becomes N G t ∧ ∧ L ∧ . It will be shown that this property is compositional, and leads to inductive proofs with shorter induction depths and lesser runtimes.
IV. LATENCY LEMMAS
A distinguishing feature of this paper is to strengthen overall latency properties using precise bounds termed age lemmas for different stages of progress arranged in a stage graph G . Computing the amount of time that a packet can spend in each stage further requires computing transfer bounds for different channels in the network. Algorithms are given for automatically deriving age lemmas for a subset of possible xMAS networks, yet the power of age lemmas is more general than just the subset of designs that are handled automatically, as will be demonstrated in the ring interconnect example of Section VII.
The remainder of this section first presents an automated approach for creating a stage graph, and then presents an automated approach for computing the transfer bounds that determine age bounds of each stage in the stage graph.
A. Generating Age Lemmas ( L ) Using Stage Graph G
A stage graph is a tool for constructing age lemmas that lead to compositional proofs of overall latency bounds. In each state of the network, every queue slot that stores a data packet maps to a stage in G . The stage that a packet maps to determines a specialized age bound to check on the packet. Formally, a stage graph is an acyclic digraph G = (S, E) with vertices s 0 , s 1 , . . . , s L−1 ∈ S called stages. A stage s j has an associated lemma φ(i, p i,j , t j ) asserting that any packet in slot i of N that maps to stage s j in G must have an age less than t j . The conjunction of all age lemmas is denoted L (5) .
Stage graph construction is automated for acyclic networks. Acyclic networks are those without cycles in data paths, where "data path" is defined as any sequence c 0 , c 1 , . . . c N of data channels with each c i and c i+1 being input and output channels of the same xMAS primitive. One can trivially check whether a network N is acyclic, for example by using depthfirst search from each data channel. An automated approach for constructing stage graphs for acyclic networks is presented as a two step process: first creating stage graph topology, and second adding the age annotations to the stages.
1) Creating Stage Graph Topology:
The queue slots of an acyclic network will always have a partial ordering with respect to when a packet can occupy them. The topology of the stage graph reflects this ordering. Each queue slot
in stage graph G . The mapping from queue slots to stages is accomplished by setting p i,j to true for combinations of i and j where i = j − 1. A special source stage (s 0 ) is added for all data sources, and a sink stage (s M+1 ) is added for data sinks. These source and sink stages are nonstandard in that no packets can ever map to them.
Edges in the stage graph reflect transitions that packets can make in N . All data sources map to a stage s 0 , and all data sinks map to stage s M+1 . Stages s x and s y in the stage graph G are connected by an edge if the components (source, sink, or queue slot) mapping to s x and s y are adjacent slots within a single queue, or if there exists a queue-free data path in N from the first component to the second.
2) Assigning Age Bounds to Stages: Once the stages and edges of the stage graph are created, age bounds must be assigned to each stage. The first step toward this is to compute for each stage s j , a value d j that is the maximum residence time of the stage. Source stage (s 0 ) is assigned d 0 = 1, and sink stage (s M+1 ) has d M+1 = 0; all other stages correspond to queue slots. For a stage (s j ) corresponding to a queue slot, the maximum residence time (d j ) cannot exceed one greater than the maximum blocking time of the channel that is the output of the queue containing this slot. While the maximum blocking time of each channel is not known a priori, the next subsection gives a way to compute an upper bound on it. For a channel c, the computed upper bound on blocking time is denoted δ abs , and therefore d j = 1 + δ abs . Now that each stage in the network is assigned a residence time (d j ), age bounds for each stage are computed. The maximum age in any stage depends on the maximum residence time of that stage, and the maximum age of a packet when it enters the stage. The critical path for each stage s j is therefore the path from s 0 to s j in G for which the sum of d j is largest. This path sum is denoted t j and it is the age bound of stage s j .
3) Global Age Bound T L From Age Lemmas: If every data packet in every reachable state of N maps to a stage in G , then the largest t j associated with any stage is a claimed global age bound for N . Therefore, letting the largest
T L is often conservative for several reasons.
a) The channel blocking bounds that are computed are conservative, causing the residence times of each stage to be over-approximated. This occurs in the credit loop (Section II-B) where the blocking time of packets in the ingress queue is overestimated. b) The stage graph conservatively over-approximates the connectivity of the network by ignoring logical propagation conditions. c) It may be impossible for any one packet to experience all of the the worst-case progress bounds, even if each is individually achievable.
4) Cyclic Versus Acyclic Networks:
No automated procedure is given to construct a stage graph for cyclic networks. In such a network, a straightforward mapping from each slot to a single stage will induce cycles in G , and this leads to infinite age bounds even if each stage has a known finite residence time. In such a case, the progress stages must be made more precise by refining the p i,j formulas to consider more than just the slot that a packet occupies. Section VII demonstrates on a ring interconnect that a manual refinement can be used to obtain an acyclic stage graph from a cyclic network.
B. Channel Blocking Bounds and Progress Lemmas ( )
The preceding section shows that stage residence times in the stage graph G depend on channel blocking bounds of the output channels of queues, and now a heuristic is given for using rule-based propagation to compute blocking bounds. As the heuristic propagation generates blocking bounds, it also generates "progress lemmas" that formalize the assumptions made in deriving the bounds. These progress lemmas are added to the overall verification problem to check the assumptions.
Given a channel c, a blocking bound of x cycles is the claim of c.irdy =⇒ F ≤x c.trdy. Such a blocking bound of a channel c is computed by first deriving a guarded bound set r c.trdy (7) . Each guarded bound g i , δ i ∈ r c.trdy is a predicate g i on network state, and δ i a bound on the number of cycles until c.trdy is asserted from any network state satisfying g i . In other words, the guarded bound set r c.trdy can be used to make the claim of (8)
The guarded bound set is used to create a single unconditional bound on the number of cycles of blocking on channel c. To accomplish this, the guarded bound set r c.trdy (7) is generalized into a single guarded bound g abs , δ abs , where
The generalized form of (8) 
A blocking bound is proved by way of the guarded bound set r c.trdy using (10) . Property θ c,STATE (11) is equivalent to the portion of (10) labeled "guard coverage"; θ c,STATE checks that the generalized guard holds in all states where the initiator of channel c is attempting to send a packet. Assuming that θ c,STATE is valid, then (10) simplifies to property θ c,TIMING (12) . Property θ c,TIMING checks that the bound (δ abs ) implied by the guarded bound set does in fact hold in the network. This property can be considered as a way to check that the bounds associated with the guards are correct. For each channel c, θ c (13) is checked.
Property θ c is a roundabout way of proving c.irdy∧g abs =⇒ F ≤δ abs c.trdy, which is itself just a strengthened version of c.irdy =⇒ F ≤δ abs c.trdy (i.e. a strengthened version of θ c,TIMING ). Given that the stage graph ultimately makes use of only δ abs , one may question the motivation for proving θ c,STATE and θ c,TIMING . These properties are checked because they formalize and validate the assumptions made in deriving δ abs , and if any modifications are needed in generating the guarded bound sets, counterexamples to these properties will indicate where the modifications are required. If property θ c,STATE fails, then a counterexample reaches a state where c.irdy is true and g abs is false; this can be remediated by adding to the guarded bound set a new guarded bound g i , δ i such that g i is true for the bad state of the counterexample. If property θ c,STATE passes and θ c,TIMING fails, then it means that there exists some g i , δ i in the guarded bound set such that the guard (g i ) does not imply the bound (δ i ); this would indicate that the propagation rules used to create the guarded bound set are unsound.
The preceding paragraphs demonstrate how to derive a conservative bound δ abs for an arbitrary channel c. The residence times are created for each stage in the stage graph G by deriving such a bound for each channel c that is the output of a data queue. The property checked on the network is then (14) . The following sections present the approach for computing a guarded bound set for each such channel. This is done using operations for combining guarded bound sets, applied according to the xMAS network connections. A primitive has operations to determine the guarded bound set for each of its output signals, and these operations are defined using recurrence relations over other guarded bound sets : = ∀c in data queue outputs
1) Recurrence Relations for Future Readiness: Analogous to how irdy and trdy mark current readiness of initiators and targets, guarded bound sets mark future readiness of channel initiators and targets. For any signal irdy marking current readiness, R(irdy) is a symbolic representation of the guarded bound set marking its future readiness. Just as the readiness signals irdy and trdy are defined using recurrence relations over other readiness signals and state variables, the guarded bound sets for future readiness are defined using recurrence relations over other guarded bound sets and state variables.
The recurrence relations for R(irdy) and R(trdy) are defined using the three operations MAX, PLUS, and ITE. MAX is used when readiness depends on the larger of two guarded bound sets. PLUS is used when readiness depends on the sum of two guarded bound sets. ITE is used when readiness depends on one of two guarded bound sets, with the choice determined by the state of some Boolean condition.
Defining recurrence relations for each output signal of each primitive is the first step toward each expanding each symbolic R(irdy) or R(trdy) into a concrete set of guarded bounds r trdy ≡ { g 0 , δ 0 , g 1 , δ 1 , . . . , g N , δ N }. While there is no unique solution for defining useful recurrence relations, the relations for each primitive used in this paper are shown in graphical form in Fig. 4 , and explained in the following paragraphs. a) Data source: Because a data source nondeterministically injects data packets onto channel o, there is no upper bound on when a packet will be injected. The guarded bound is therefore the empty set {}; from any state of the network there are no conditions that ensure o.irdy will be asserted in the future. In other words, there is no assurance that the source will ever again attempt to inject a packet 
c) Data sink:
A data sink is the target of a single channel i. The sink provides to the network a service guarantee to always receive a waiting packet within x cycles. This means that i.trdy is guaranteed to occur no more than x cycles after i.irdy does. In the progress lemmas, given that R(i.irdy) is a (recurrence-defined) guarantee on when i.irdy will occur, R(i.trdy)'s guarantee is x cycles later than that In the first case the queue is nonempty, and the value { , 0 } reflects that the output is currently trying to initiate a transfer. In the second case the queue is empty, and the value PLUS(R(i.irdy), { , 1 }) reflects that the output is ready to initiate no more than one cycle after the input is ready to initiate a transfer. The recurrence relation for R(o.irdy) holds because any attempted transfer from the input is immediately received into the empty queue to cause an attempted output transfer in the next cycle. Target readiness R(i.trdy) is handled similarly, with a case split depending on whether or not the queue is currently full
f) Merge: A merge primitive arbitrates between two input channels a and b, and is the initiator of a single channel o. The merge primitive has a Boolean state variable u to store the current priority among inputs. To create a round robin arbitration policy, the state of u is updated with its negation whenever the high priority input transfers a packet through the merge. The recurrence relations abstract away the arbitration priority u to give a conservative bound for each input that does not depend on the value of u. Because the high priority input channel can only be blocked for R(o.trdy) cycles before transferring, the low priority input achieves high priority in at most PLUS(R(o.trdy), { , 1 }) cycles. Therefore, the progress bound for input channel of undetermined priority is the time to achieve high priority added to the progress time once high priority is achieved. A merge primitive that uses a priority scheme other than round robin would require modified recurrence relations (PLUS (R(o.trdy), { , 1 }) , R(o.trdy) ) .
R(o.irdy) : = MAX (R(a.irdy), R(b.irdy)) R(a.trdy) : = PLUS (PLUS (R(o.trdy), { , 1 }) , R(o.trdy)) R(b.trdy) : = PLUS
g) Switch:
Boolean transfer equations for switch primitive are parameterized by a switching function f . In the recurrence relations for future readiness in the forward direction, no assumptions are made on the packet data, and therefore no upper bound is asserted for either output. This can optionally be refined using ITE to take into account the data value of packets on channel i a.irdy :
R(i.trdy) : = MAX(R(a.trdy), R(b.trdy)). h) Join:
Join consumes an input packet from channel a and one from channel b to produce a single output packet on channel o. It is only ready to produce a packet on o if both inputs are ready to initiate. It is only ready to consume an input packet from a or b when the other input is ready to initiate and the output is ready to receive. Because each Boolean signal in the join is defined using a logical AND of two other signals, the future readiness of each signal depends on the latest future readiness of the two inputs 
R(i.trdy) : = MAX(R(a.trdy), R(b.trdy)) R(a.irdy) : = MAX(R(i.irdy), R(b.trdy)) R(b.irdy) : = MAX(R(i.irdy), R(a.trdy)). j) Function:
Function transforms data, but is transparent to irdy and trdy signals and propagates future readiness unchanged.
2) Computing Guarded Bound Sets:
The guarded bound sets that describe bounds on future readiness (e.g., r c.trdy ) are created by expanding the symbolic representations of the same (e.g., R(c.trdy)) using the dependency graph of the recurrence return node 15: else expand children, compose guarded bound sets 16: visited ← visited ∪ node 17: r n ← {} guarded bound set for node 18: r left ← EXPANDSUBTREE(left, visited) 19: r right ← EXPANDSUBTREE(right, visited) 20: if node = MAX(left, right) then 21 :
end for 24: else if node = PLUS(left, right) then 25: for g i , δ i ∈ r left , g j , δ j ∈ r right do 26: r n ← r n ∪ g i ∧ g j , δ i + δ j
27:
end for 28: else if node = ITE(predicate, left, right) then 29: for g i , δ i ∈ r left do 30: r n ← r n ∪ g i ∧ predicate, δ i
31:
end for 32: for g j , δ j ∈ r right do 33: r n ← r n ∪ g j ∧ ¬predicate, δ j (Fig. 4) for each primitive in the network according to common signals. Each irdy or trdy signal in N is an input of one primitive and an output of another; similarly each R(irdy) or R(trdy) is dependent upon the relations of one primitive, and also depended upon by the relations of another primitive. Note that circular dependencies can exist.
For each data queue in the network, with output channel c, CREATEGUARDEDBOUNDSET (Procedure 1) is called to extract the guarded bound set r c.trdy from the dependency graph. Starting from the dependency graph node R(c.trdy), function EXPANDSUBTREE recursively computes the guarded bound sets for each node's future readiness, and includes an additional check to short-circuit to an empty set in the case of a cyclic dependency. The guarded bound sets for each node are computed in the tail of the recursion, and thus computed over two concrete guarded bound sets. The procedure for combining the guarded bound sets of a node's left and right dependency graph children is according to whether the node is implementing MAX (line 20), PLUS (line 24), or ITE (line 28). For MAX and PLUS, the guarded bound sets are combined as Cartesian products augmented by the appropriate numeric operation. The ITE operation does not take a Cartesian product because the guards of the two children are made disjoint by including the predicate of the ITE in opposing polarities.
Finally the procedure returns to line 2 with a guarded bound set (r c.trdy ) for target readiness of channel c. Because channel c has a queue as its initiator, and only cases where channel c is blocked are relevant, all guards in the set are strengthened with the condition that the initiating queue is nonempty (lines 3-6). A pruning step (line 7) then removes any guarded bounds ( g i , δ i ) that are trivially unsatisfiable, such as a guard asserting that a single queue is both full and empty; an improved pruning step could also remove guards that violate numeric invariants. Finally, CREATEGUARDEDBOUNDSET returns with the final guarded bound set r c.trdy , which is then used as explained at the start of this section (7) to derive residence times of stages in G .
3) Limitations in Deriving Blocking
Bounds: This paper presents recurrence relations that are found to be useful for deriving blocking bounds for example networks. These recurrence relations will not be precise enough to handle all possible networks. Property (14) is formulated such that its failure will serve to indicate when the recurrence relations are not suitable for a given network. Furthermore, a counterexample to can guide the development of rules that lead to a more precise set of recurrence relations. Some situations in which this may arise are highlighted.
One abstraction used in the recurrence relations is to consider only whether a queue is full or empty, and not the number of items in the queue. A similar abstraction is made in proving deadlock freedom by Verbeek and Schmaltz [16] . The significant difference between this paper and Verbeek's approach is that this paper goes beyond deadlock freedom to include numeric progress bounds in the reasoning. Yet, like Verbeek's work, the abstraction causes the approach to be sound but incomplete for proving latency bounds.
A second conservative abstraction is that the recurrence relations presented make no assumptions about data values of packets. This abstraction prevents the recurrence relations from giving any bounds on readiness outputs of the switch primitive. In a network where progress of a data packet depends on switch output, refinement would be needed. This situation can be addressed with manual refinement, by modifying the recurrence relations of the switch to take into account the data value of the input. Ongoing work by Viktorov and Gotmanov [17] aims to overcome this limitation by propagating rules that can be automatically refined to handle cases such as this. Counterexample-guided abstraction-refinement techniques [18] could also be used.
V. METHODOLOGY
The methodology used across all experiments is described here. The xMAS models are created within a C++ framework, 3 with primitives as objects. Progress lemmas and age lemmas are added automatically, and flattened word-level Verilog is generated with all properties added as assertions. The Verilog is bit-blasted into an and-inverter-graph 4 (AIG) using the VeriABC flow [19] . Verification is performed on the AIG using the bit-level model checker ABC [20] 5 on a 2.4 GHz Intel Core i5 processor with 4GB of RAM. The bounded model checking 6 (BMC), property directed reachability 7 (PDR) [13] , and k-induction 8 calls are performed by ABC.
A. Bounded Model Checking to Evaluate Looseness of T L
The global age bound (T L ) that is implied by the age lemmas can be loose. While the tight latency bound for a given network is not generally known, the looseness of T L is bounded by comparison to T FEAS , where T FEAS is the smallest number such that G T FEAS cannot be disproved within allotted resource bounds. T FEAS is found by iteratively increasing T and using BMC to disprove each G T until reaching the first value of T that cannot be disproved; that value is T FEAS . There cannot exist a tighter bound than T FEAS because G T FEAS −1 is disproved by counterexample. However, G T FEAS may not be proved, and could instead be an artifact of the BMC resource limits.
B. Efficient Encoding of Packet Ages
The formulation of an age bound property requires that the age of a packet in a queue slot can be checked as a simple safety property and evaluated in a single state of N . The age of a packet occupying slot i is denoted age(q i ), and begins at 0 when the packet is first injected and then increments in every cycle until the packet is ejected. The variable age(q i ) is considered a part of the data packet and requires each queue slot in the network to be widened by log 2 (t max ) bits to store it, where t max is some number that exceeds the largest packet age bound that is checked. Because standard xMAS queues store data without modifying it, some extra logic is added to the queue primitives for incrementing packet ages. For each queue slot i, the age variable age(q i ) that is stored is incremented by 1 (mod t max ) relative to the value that would be stored by a normal xMAS queue. This ensures that the ages are incremented both for packets remaining in slot i from the previous cycle, and for packets being newly written into slot i from the queue's input channel. An empirical runtime comparison [21] favors for inductive verification the particular encoding of packet ages employed here over that of earlier works [22] . 
VI. ILLUSTRATIVE EXAMPLES
Several examples highlight strengths and weaknesses of using latency lemmas. The primary strength is a dramatic improvement in verification runtime, and the weakness is that the bounds proved using lemmas are in some cases loose. All latency lemma reasoning including stage graph construction is automated for the examples in this section. The subsequent example of a ring interconnect (Section VII) is a case of an xMAS extension where some manual reasoning is needed to create an acyclic stage graph and apply latency lemmas.
A. Single Queue
An example with a single queue (Fig. 5) demonstrates scalability of latency lemmas in a simple network without arbitration, routing, or flow control. The liveness bound of the sink is fixed to three in this example, and the depth of the queue is varied. The progress lemmas give a bound of δ abs = 3 for the queue's output channel, and therefore each slot in the queue maps to a stage s j in G with a weight of d j = 4. The global bound from the lemmas is T L = 1 + 4 * depth. For each queue depth it is found that T FEAS = T L −1 (Fig. 6) , indicating that T L is one cycle larger than the tightest feasible bound. The bound T L at each queue depth is proved inductively, using two different strengthenings of the latency property G ∧ ∧ L ∧ ) proves the same bound strengthened by the age lemmas ( L ) and progress lemmas ( ). Without latency lemmas, the induction depth required for the proof is never less than T FEAS . With latency lemmas, the induction depth is four independent of the queue depth, demonstrating the composability of the approach. In a queue with depth ten, a latency bound of 41 cycles is proved in under 2 s with latency lemmas versus 267 s without.
B. Credit Loop
The credit loop (Fig. 3) introduced in Section II-B is now revisited in more detail. Experiments are performed to explore the scalability of the latency lemma approach, different verification engines, and the tradeoff of verification runtime versus tightness of proved bounds. A credit loop has a numeric invariant ψ num [6] (15) asserting that each outstanding credit corresponds to exactly one available token or data packet in the ingress queue; this numeric invariant is included as part of auxiliary invariant Fig. 6 . Sweeping the queue depth in a network comprising a single queue, and a sink with liveness bound of 3. In the lower plot, for the two properties the y-axis is the induction depth required to prove them; for T FEAS the y-axis is the tightest feasible bound; for T L the y-axis is the bound proved using the lemmas. Fig. 7 . Comparing runtime and induction depth for proving latency bounds with and without latency lemmas while varying the depth of the queues in a credit loop (Fig. 3) .
1) Sweeping the Depth of Queues:
As the depth of the credit loop queues are swept from 2 to 10 ( Fig. 7) , the bounds implied by the lemmas (T L ) at each depth exceed the tightest feasible bound (T FEAS ) on account of the conservativeness of the progress lemmas. As in Section II-B, the bound of the sink is 5. The inclusion of latency lemmas yields inductive latency proofs in eight frames of unrolling and less than 11 s of runtime for all depths. For a queue depth of 10, the lemmas give a speedup of 120x.
2) Comparing Proof Engines: Induction is evaluated against the PDR verification engine when the latency property is formulated with and without latency lemmas. In this experiment, the queue depths are fixed to six and the sink bound is again five. Attempts are made to prove two different bounds; the first is the tight bound (T FEAS ), and the second is the looser bound (T L ) implied by the stage graph G lemmas. The results are shown in Table I . When proving tight bound T FEAS , the PDR engine gives a 3x speedup over induction, and adding latency lemmas does not significantly impact runtime. When proving the looser bound T L , adding latency lemmas causes a dramatic speedup in inductive verification. The speedup is 34x compared to induction without the lemmas, and over 7x compared to PDR with and without lemmas. The speedup is caused by the latency lemmas making the proof compositional and hence provable in only eight frames of unrolling.
3) Precision Versus Scalability:
This section demonstrates that, when using latency lemmas, there exists a tradeoff of inductive verification runtime against looseness of proved bounds. This generalizes the speedup observed in Table I when proving the looser bound T L instead of tight bound T FEAS . The tradeoff is shown by proving individually each bound from T FEAS to T L + 5 (Fig. 8) . The black vertical line indicates the tight bound T = T FEAS = 35 and the gray indicates T = T L = 43. As the verified bound increases from T FEAS to T L the property including the lemmas gets progressively easier to prove, as evidenced by the reduction in both verification runtime and the number of frames needed for the proof. The points where the plotted data cross the black and Fig. 9 . Parameterized ring network, instantiated with three agents and an ingress depth of 2. All channels in ring network carry data packets, and control is implemented using sequential reservation logic instead of tokens.
gray vertical lines correspond to the four rows in Table I that use k-induction as the verification engine.
VII. NONSTALLABLE RING INTERCONNECT
A ring network [23] is a topology for routing traffic amongst a number of agents. Each agent in the ring comprises arbitration logic, a ring queue slot, and an ingress queue (Fig. 9) . Packets reach their destinations by circling around the ring until being admitted into their destination agent's ingress. The ring network is parameterized by the number of agents, depth of ingress queues, and the liveness bound of each agent's sink.
A packet that is injected at agent i and destined for agent j will first occupy the ring slot of agent i. The packet circles the ring thereafter (e.g., occupying slots 6, 7, 8, 6, . . . in Fig. 9 ) and requests admission to the ingress whenever arriving at agent j. If the request is denied, the packet bounces back onto the ring to repeat the request after making one trip around the ring. Unfair arbitration logic prioritizes traffic in the ring over traffic attempting to enter the ring from a source. This unfair arbitration ensures that traffic in the ring is never blocked, but it does permit sources to be blocked indefinitely.
Each packet that a source injects into the ring is nondeterministically assigned a destination address between 0 and n−1 to indicate the agent to which the packet should be routed. This destination is stored using additional bits appended to nominal packet data in the same manner as the packet age. For a packet stored in queue slot i, let dst(q i ) be its destination address. The auxiliary inductive invariant includes ψ dst (16) to block off unreachable states where packets have invalid destinations
A. Implementation of Ring Agent
Each agent in the ring is created in xMAS design style using modified versions of the basic xMAS primitives (Fig. 2) to implement reservations and unfair arbitration. When a packet is transferred from one ring agent to the next, the first primitive encountered is a switch that routes the packet toward the admission logic if this agent is the packet's destination, and to the bypass channel otherwise. Packets that are routed to the admission logic encounter a demultiplexer that is controlled by sequential reservation logic. The state of the sequential reservation logic and the number of free slots in the ingress determine whether the packet is admitted or bounced. A merge primitive then propagates onward a bounced packet or bypass packet, or no packet at all. Finally, a priority merge primitive gives priority to packets already on the ring, and allows the source to inject packets only if there is no competing packet on the ring.
B. Receive Reservation Logic
A naïve ring implementation can have infinite latency even though all sinks obey bounded liveness. A single packet on the ring may never be granted access to the ingress of its destination, despite an unbounded number of other packets being granted access to the same ingress. Receive reservations [24] are a mechanism to enforce fairness; together with bounded liveness of sinks, receive reservations ensure that packets on the ring have finite latency bounds. The receive reservation scheme used by the ring agents is described here.
1) Each agent can issue a single receive reservation. If an agent's reservation is available, then the agent issues it to any packet that is bounced (due to a full ingress). The next ingress slot to become free is reserved for this packet. 2) If the agent has an outstanding receive reservation, packets without the reservation are denied entry to the ingress unless more than one slot is free. 3) When a packet with a receive reservation returns, it is granted entry to the ingress if any slots are free, and the reservation becomes available for other packets. If no slots are free, the reservation is renewed by the packet and remains unavailable. The receive reservation of each agent is implemented using a modified xMAS switch in which an incoming packet is either bounced or admitted to the ingress (see Fig. 9 ) depending on the current state of sequential control logic. The sequential control logic does not use the xMAS design style, and hence this modified switch gives an example of how the latency verification approach of this paper can be applied beyond basic xMAS. As shown in Fig. 10 , the control logic tracks the reservation using a sort of counter. When the receive logic is in state rsv = n, it is an indication that the packet with the reservation will return in n cycles. When the state reaches rsv = 0, the next arriving packet on the ring is the same one for which the reservation was made. State rsv = ⊥ indicates that the reservation is available. Receive reservations are fair with respect to packets in the ring. Whenever one packet returns the reservation (see edge return reservation in Fig. 10 ), the packet trailing it on the ring has a chance to make the reservation in the next cycle. Each packet in the ring gets a turn at making a receive reservation in order.
The dashed edge labeled renew reservation in Fig. 10 is taken when a packet holding the reservation bounces at its destination on account of there not yet being any free slots in the destination's ingress. This situation of renewing a reservation cannot occur if the sink bounds are smaller than the delay around the ring (i.e., the number of agents), a condition that is here assumed to hold. Therefore, the dashed edge from Fig. 10 is ignored, and packets arriving when the reservation Fig. 10 . State machine for receive reservations of a single agent in an threeagent ring. The reservation state is rsv = ⊥ when the reservation is available, and is rsv ∈ {0, 1, 2} when the reservation is outstanding. The state is rsv = 0 when the packet holding the reservation is the next to arrive, and the state is rsv = 2 when the reservation has just been made and the reserving packet is in the ring slot of its destination agent. Fig. 11 . Product automation for state of receive reservation and index of currently occupied ring slot in a three-agent ring (Fig. 9) .
state is rsv = 0 will always find a free ingress slot and be admitted to the ingress.
C. Creating Age Lemmas Using Stage Graph G
The location of packets in the ring network is not a precise enough indicator of progress to create an acyclic stage graph because a packet can occupy the same slot many times as it circles the ring. This precludes use of the automated stage graph construction of Section IV, which attempts to equate a packet's location to its stage of progress. Instead, some manual insight is required to devise a correspondence between packets circling the ring and ordered stages of progress in the stage graph. For packets circling the ring, progress is marked both by changes in its location, and changes in the reservation state of its destination agent. In an n-agent ring, there exists a stage in G for every combination of the n current packet locations, the n packet destinations, and the n + 1 reservation states, so the total number of stages for the ring slots is n × n × (n + 1). For clarity, the explanation here deals only with the n×(n+1) stages for packets destined for agent two; in implementation all destinations are considered.
Using the three-agent ring as an example, composing the reservation state machine of Fig. 10 (without dashed edge) with the packet's behavior of advancing to the next ring slot in every cycle produces the state machine of Fig. 11 . This product machine is the foundation for creating a stage graph for packets in the ring. Each state in Fig. 11 has two labels, the first is the state of the reservation state machine (Fig. 10 ) and the second is the index of a ring slot (Fig. 9) . As a packet destined for agent two moves around the ring, in every cycle it maps to some state of this product automaton. For example, state (0, 7) in Fig. 11 is the state that a packet in slot seven maps to when the receive reservation of agent two has state rsv 2 = 0. The mapping from packets to states in Fig. 11 could serve as an indicator of progress if only the product automaton were acyclic. An acyclic stage graph is obtained from the product automaton by discerning that there are pairs of edge-connected states in Fig. 11 that no single packet can map to. Removing these transitions reveals an ordering among progress stages.
1) (0, 7) → (⊥, 8) can never be made by a packet destined for agent two because it corresponds to a packet bouncing at agent two (from slot seven to slot eight) while agent two has its reservation returned (i.e., it transitions to state rsv 2 = ⊥ that indicates that a reservation is available). The transition is impossible because bouncing and returning a reservation are exclusive; only admitted packets cause the reservation to return to the available state. 2) (⊥, 7) → (⊥, 8) can never be made by a packet destined for agent two because it corresponds to a packet that bounces (from slot seven to slot eight) while an available reservation remains available. The transition is impossible because any packet bouncing while the reservation is available would have the reservation issued to it. Without the two transitions described above, Fig. 11 becomes acyclic and can be used to order the progress stages of a packet in the ring. The product machine of Fig. 11 then becomes the stage graph of Fig. 12 by simply removing the two unrealizable transitions, and adding stages for sources, ingress slots, and the sink. The mapping from queue slots in ring network to stages in Fig. 12 is given by the age lemmas in Table II .
D. Results
The latency lemma approach is evaluated on a three-agent ring and an eight-agent ring, each with ingress depth of two and sink bound of two. For the three-agent ring, the tightest feasible bound (T FEAS ) is 18, and the bound implied by the latency lemmas (T L ) is 19; for the 8-agent ring, T FEAS is 78 and T L is 79. The runtimes and number of frames for proving a bound of T L on each ring, with and without latency lemmas, are shown in Tables III and IV The latency bound for the eight-agent ring (Table IV) is only proved by each engine within 10 000 s when the lemmas are used. The induction engine is able to verify the property with the lemmas 9x faster than PDR verifies the same, and at least 130x faster than either engine does without lemmas.
VIII. RELATED WORK
One way of addressing QoS guarantees at the architectural level is to use resource reservation and contention-free routing [25] . Analysis can be performed manually, but formal verification is still useful for providing guarantees. Network calculus [26] has been demonstrated as a useful tool for NoC performance analysis [27] . However, it has limited applicability and precision for networks with backpressure and complex circular message dependencies. Network calculus formalism relies on very high-level abstraction of arbiters, often modeling them as latency-rate servers. The synchronous protocol automata of Avnit et al. [4] model network components using an xMAS-like formalism, and as in this current work make the distinction between data and token channels.
Recent abstraction-based model checking approaches have been applied to latency verification [9] , [28] , but these works address scalability by explicitly decomposing the overall problem into distinct subproblems. The proofs for the subproblems are then stitched together for an end-to-end proof. By contrast, this current work uses subgoals to strengthen the overall latency bound property to make it efficiently provable with induction, but avoids explicit decomposition.
Several works have explored (unbounded) liveness verification of communication fabrics. The standard approach of verifying liveness using a liveness-to-safety transformation [29] does not scale to large networks in practice [15] . Alternative approaches include reducing deadlock conditions to a set of equations [15] , [16] , and proving liveness using the help of intermediate safety assertions [30] . Higher-order logic has been applied to verifying deadlock freedom, using network models described in the PVS specification language [31] . Prior works compare liveness and safety methods for verifying grant latencies on a particular style of weighted round robin arbiter [32] . The notion of using LTL properties where all eventually properties have time bounds is also referred to as a prompt system [33] . Finite latency bounds imply deadlock freedom; however, because the latency verification approach in this paper assumes deadlock freedom as a starting point, it is intended to complement and not replace existing techniques for verifying deadlock freedom.
The presented verification approach is conceptually similar to ranking functions [34] , i.e., numeric functions of model state that measure progress toward some goal. Typically, ranking functions are useful in proving termination or liveness properties, but they are also applicable for latency bounds. In fact, the stage graph can be viewed as a structural description of a ranking function for the model. Note, however, that stage graphs specify partial orders, rather than the linear orders that are typical for ranking functions. Viktorov and Gotmanov [17] propose a theorem-proving approach to latency verification in xMAS networks that is based on ranking functions. Their inference rules are analogous to the rule-based propagation of local bounds used in this paper.
IX. CONCLUSION
This paper presents a compositional approach to verifying latency bound properties of NoC designs. The key idea is to decompose the overall proof into a finite number of latency lemmas, based on the notion of stages that a packet can be in. The approach is fully automated for acyclic networks constructed from basic xMAS primitives, and some manual input is required for cyclic xMAS networks or xMAS-like networks that use an extended set of primitives. Promising directions for future work include automation of the stage graph construction for cyclic networks, and applying the approach to QoS properties other than latency. The latency lemma approach is applied to several examples including an industrial ring design, and is shown to decrease runtime for proving latency bounds, while also decreasing the induction depth needed to prove them.
