Abstract-To integrate the nowadays rapidly expanding distributed real-time systems, we need multi-hop real-time switched networks. A (if not "the") widely recognized/adopted real-time switch architecture is the TDMA crossbar real-time (TCRT) switch architecture. However, the original TCRT switch architecture assumes per-flow queueing. To support scalability, however, queue sharing (i.e. flow aggregation), must be allowed. With simple flow aggregation, flow burstiness can grow and infect, making schedulability and end-to-end delay bound analysis an open problem. To deal with this, we propose the real-time aggregate scheme. The scheme complies with the existing TCRT switch architecture, and deploys spatial-temporal isolation and overprovisioning to curb aggregate member flows' burstiness. This allows us to derive the closed-form end-to-end delay bound, and give the corresponding resource planning and admission control strategies. Simulations are carried out to show the effectiveness of the design.
I. INTRODUCTION
Real-time networks are the venue where distributed realtime systems integrate. As nowadays distributed real-time systems rapidly scale up, real-time networks have to evolve from traditional Local Area Networks (LANs) toward multihop switched networks [1] . A typical example is avionics. A modern aircraft, such as A380, F-35, or space shuttle, already runs hundreds of processors, and may include hundreds of high-definition real-time video sources [2] [3] . Such large number of nodes and traffic cannot be hosted by a single real-time LAN. This forces the launch of several initiatives to develop multi-hop real-time switched networks [4] [5] . Similar demands also arise from industrial control, telepresence/telerobotics, intelligent transportation, and medical device integration etc. [6] .
A (if not "the") widely recognized/adopted real-time switch architecture for multi-hop real-time networks is the TimeDivision-Multiple-Access (TDMA) crossbar real-time switch architecture (simplified as "TCRT switch" architecture in the following) [7] [8] [1] [9] [10] [11] . This architecture is particularly important for mainstream switch manufacturers because it complies with (and even simplifies) a mainstream Internet switch architecture [1] . This lays a smooth evolution path for these manufacturers to build real-time switches.
However, the existing TCRT switch architecture assumes per-flow queueing. It is well-known that per-flow queueing has poor scalability [12] [13] [14] . In fact, this is the reason why nearly all high performance switches nowadays carry out certain flow aggregation: flows have to share queues * Dept. With aggregates, how to provide real-time delay guarantee becomes a non-trivial problem. Simply aggregating flows in TCRT switches allows the member flows' burstiness to grow and infect (an example is in Section III). As these member flows join other aggregates, the burstiness infection can spread further. This seriously complicates the network model, making schedulability and end-to-end (E2E) delay bound analysis an open problem.
To deal with the problem, we propose a novel scheme called real-time aggregates. This scheme exploits the features of TCRT switch architecture to curb aggregate member flows' burstiness. With real-time aggregates, we can derive closed form E2E delay bound, along with the corresponding resource planning and admission control strategies. Simulation shows the real-time aggregate scheme is efficient in terms of providing short E2E delay bound and high network utilization.
The remainder of the paper is organized as follows. Section II introduces the TCRT switch architecture; Section III proposes a naive aggregation scheme, and shows how it leaves E2E delay bound an open problem; Section IV proposes and analyzes the real-time aggregate scheme; Section V evaluates real-time aggregates; Section VI discusses related work; and Section VII concludes the paper.
II. BACKGROUND
We shall first introduce the existing/basic TCRT switch architecture 1 . Network switches (no matter real-time or not) can be categorized as output queueing or input queueing. In output queueing, input ports (simplified as "inputs" in the following) does not buffer packets. Once a packet enters an input, it is immediately routed to its corresponding output port (simplified as "output" in the following) and buffered there.
Though output queueing is intuitive, its inherent "N speedup" problem [15] limits its adoption. Input queueing, instead, becomes the de facto standard among switch vendors [1] [16] .
In input queueing, a crossbar fabric connects inputs with outputs (see Fig. 1(a) ). Packets are only buffered at inputs; when a packet enters an input, it is immediately routed into the right queue in the input (see Fig. 1(b) ). At scheduled time, an output connects with one input, picks one of its queues, and fetches the queue's header packet. The fetched packet exits the output directly without further buffering (see Fig. 1(c) ).
To facilitate scheduling, each packet is typically manipulated as fixed-size fragments called cells. The time cost to (a) crossbar fabric, which connects inputs with outputs; each input connects to a data bus (the horizontal line segments) that intersects with each output's data bus (the vertical line segments); the intersections (grey dots) can be connected/disconnected in runtime by scheduler(s).
(b) an input port: packet routing and queueing are carried out in it; in input i, the kth queue buffering packets to output j is denoted as Q(i, j, k).
(c) an output port: at different time slot, the output fetches packets from different input queues according to the switch scheduling scheme. send one cell across the crossbar is called one cell-time. To satisfy the crossbar constraint that at any time instance, an input can connect to at the most one output and vice versa, the switch operates periodically, and the period is one celltime. At the beginning of each cell-time, the switch scheduler decides a one-to-one matching (simplified as "matching" in the following) between inputs and outputs and connect/disconnect crossbar intersections (the grey dots in Fig. 1(a) ) accordingly. During the cell-time, each output tries to fetch a cell from its matched input for outputing.
Depending on different queueing and cell matching schemes, many input queueing switch architectures exist. TCRT switch architecture is one of them [1] [9][10] [11] . It runs as follows.
Each input carries out per-flow queueing. Each output maintains a TDMA schedule of M cell-time, a.k.a., the Mslot frame. The gth (g = 0, 1, . . . , M − 1) slot of the frame Fig. 2 . Conflict free schedule for TCRT switch (quoted from [17] ): in this example, the switch has N = 4 inputs/outputs, M = 5; each row of the "schedule matrix" is a conflict free schedule for its corresponding output.
specifies which per-flow queue in which input to grant (i.e., to send a "grant" signal) at the beginning of the gth cell-time. Here g is a global counter incremented by 1 every cell-time (modulo M ). On receiving a grant, the input port per-flow queue sends its header cell to the granting output during the cell-time; or do nothing if the queue is empty.
To ease narration, in the following, we use the term "M -slot frame" and "frame" interchangeably; and the term "slot" and "cell-time" interchangeably.
Remember crossbar requires that in any time instance an input can connect to no more than one output and vice versa. That is, the M -slot frame schedules of all N outputs must be a matching between the N inputs and N outputs in each cell-time. We call this requirement "conflict free" (see Fig. 2 ).
An important feature for TCRT switch architecture is its schedulability test method [1] , quoted here as Theorem 1:
Theorem 1 (Schedulability): For an N inputs N outputs TCRT switch, if in every M -slot frame, each output needs to receive no more than M cells, and each input needs to send no more than M cells, then we can always find a conflict free schedule with a time cost of O(N 4 ).
The corresponding O(N 4 ) scheduling algorithm is in [1] . So far, we are assuming all flows are unicast. The extension to support multicast is simple [8] , as shown in Fig. 3 . When a to-be-multicasted cell enters an input (see Fig. 3 (a) ), it is duplicated into m copies (see Fig. 3 (b) ). Depending on multicast routing, each copy enters its corresponding input port per-flow queue, and the rest is the same as unicast. When the copy enters the next-hop switch, same thing can happen again for further multicast branching.
Such extension complies with the common constraint of crossbar that at any time instance, one input can connect to at the most one output and vice versa; hence will benefit smooth design evolution. Later, we will use this multicast scheme to design our flow aggregate mechanisms.
III. NAIVE APPROACH: PER-AGGREGATE QUEUEING
To address the flow aggregation demand proposed in Section I, in this section, we shall attempt a naive approach: add per-aggregate queueing into the aforementioned TCRT switches. Later in this section, we will show due to insufficient flow isolation, this approach makes the analysis of each flow's [8] , note C 2 and C 3 may be fetched/outputted in different cell-times, but this does not matter). schedulability (i.e., how much resources must be allocated to guarantee the existence of an E2E delay bound?) and E2E delay bounds (i.e., what is the E2E delay bound?) an open problem. A better solution is then proposed in Section IV.
Per-aggregate queueing means flows of a same aggregate share the same queue, as shown by Fig. 4 .
Formally, let set F represent an aggregate, where f ∈ F iff f is a member flow of F ("iff " means "if and only if "). A queue can be uniquely identified as Q(I, O, F ): I is the input where the queue resides; O is the intended output; and F is the aggregate the queue exclusively serves. In addition, let P F def = {X|∃ flow f ∈ X and f joins aggregate F right after leaving aggregate X}
denote the set of Predecessor Aggregates for F ; and let S F def = {X|∃ flow f ∈ X and f joins aggregate X right after leaving aggregate F }
denote the set of Successor Aggregates for F . Topologically, an aggregate F starts from an output O 0 that fetches cells from a set of queues {Q(I, O 0 , X ∩ F )|∀X ∈ P F } (note O 0 and X ∩ F together determines I). We call O 0 the Aggregator of F , or equivalently, O 0 creates F . F then passes several subsequent switches. Without loss of generality, suppose they are switch 1, 2, . . . , k − 1 respectively. Suppose in switch i (i = 1 ∼ k − 1), F is queued in Q(I i , O i , F ) and forwarded to output O i ; and suppose O k−1 , the last hop of output for F , wires to input I k of switch k. Input I k then is F 's Segregator, where F segregates into |S F | queues (| • | is the cardinal of set •). For flow f ∈ F that joins X ∈ S F , the flow enters queue Q(I k , O, F ∩ X), where O is determined by I k and F ∩ X.
These concepts on aggregates are more intuitively explained by an analogy to express trains, depicted in Fig. 4 's caption.
As shown in Fig. 4 , aggregates can share the same physical link(s), but their queues are mutually exclusive. This spatially partitions flows of different aggregates. However, within each aggregate, per-aggregate queueing is unable to isolate the aggregate's member flows. If one member flow is bursty (i.e. the flow's data rate changes drastically; e.g., to have no cell arriving in one M -cell-time frame, and then have four cells arriving in the next frame), it may make other flows bursty. Fig. 5 shows an example on how the burstiness of a flow may emerge and "grow" due to clock drift [18] ; and the burstiness of flows may "infect" each other due to queue sharing.
The example assumes six consecutively connected switches, S 1 ∼ S 6 , along an aggregate F . The events taken place in S 1 ∼ S 6 are shown in Fig. 5 by six synchronized time axes from top to bottom: the top time axis for switch S 1 , the second from top time axis for switch S 2 , so on and so forth.
Without loss of generality, we assume TCRT switches always run an M -slot frame with M = 10 (note in reality, for giga-bps switches, M is usually in the order of 10 3 ∼ 10 6 ). Let τ i (second) denote the duration of a cell-time for switch S i ; and τ 1 < τ 2 < . . . < τ 6 , which implies clock drift.
The aggregate F through S 1 ∼ S 6 consists of two flows: f a and f b . f a 's source end maximal traffic load is 3 cell/frame, while f b 's source end maximal traffic load is 1 cell/frame. They enter the per-aggregate queue Q 1 in switch S 1 's input port, and is fetched by an output port of S 1 , namely output O 1 . Based on f a and f b 's source end maximal traffic load, O 1 shall grant Q 1 for 4 times per M -slot frame. In Fig. 5 , O 1 grants Q 1 for the kth time at g (1) k . Unfortunately, f a is bursty, either ever since source end, or due to burstiness growth/infection in the network. Therefore, f a injects no traffic load throughout our observation period. f b , however, is steady. The kth cell of f b , denoted as c k , arrives at Q 1 at time a 4 using a slot originally for f a . Similar things can happen at switch S 2 ∼ S 5 , so that c 0 ∼ c 3 arrive at switch S 6 within one frame, using 3 additional slots originally for f a , i.e., f b grows bursty. This burst of c 0 ∼ c 3 can infect other flows, as flow f b later joins other aggregates.
Note we cannot stop the growth of f b 's burstiness by allocating more slots per frame to Q i (i = 1 ∼ 5), because bad phasing (like the one where cell arrival time a [20] ). That is, in general, we do not know the exact worst case burstiness of a flow in a per-aggregate queueing TCRT switched network. This stops us from deriving efficient schedulability test and tight E2E delay bound for per-aggregate queueing; though a conservative sufficient schedulability test and a loose E2E bound exist by applying classic DiffServ math framework [20] [21] [22] .
IV. REAL-TIME AGGREGATE
To fix the shortcomings of per-aggregate queueing, we propose real-time aggregate, which carries out spatial-temporal partitioning and over-provisioning to curb the burstiness "growth" and "infection", hence making schedulability and E2E delay bound analysis possible. Fig. 4 . Aggregate Topological Architecture. We can imagine every switch as a railway station, with each input an arrival platform, and each output a departure platform. An aggregate is an express train that runs non-stop from a unique departure platform (i.e., its aggregator) to a unique arrival platform (i.e., its segregator). A flow is a passenger, who boards/alights-from express trains (aggregates) at their aggregators/segregators. A departure platform (output) can serve as the aggregator for some express trains (aggregates), and as pass-by platform for some other express trains. Same is for an arrival platform (input) as segregator and pass-by. originally belong to 4 different frames respectively when arriving at switch S 1 ; but they all arrive at switch S 6 in one frame. The forwarding output of switch S 6 thus may forward c 0 ∼ c 3 within one M -slot frame, propagating burstiness to other parts of the network.
To make an analogy, per-aggregate queueing is like queueing all words of an article without any comma. Such an article is certainly hard to read and error prone. But the solution is simple: just add commas between words, then the article becomes readable. These commas provide temporal isolation.
Similarly, we can add temporal isolation to the original peraggregate queueing. This is illustrated by Fig. 6 . Fig. 6 (a) shows a queue for per-aggregate queueing: all cells are piled up together. Our plan is to insert dummy cells between these cells (see Fig. 6 (b)) to separate cells that should not be forwarded within a same M -slot frame. These dummy cells serve the function of "commas". When an output "reads" (i.e., forwards) cells from queue (b), it should "pause" whenever it reaches a "comma" (i.e., dummy cell) until the next M -slot frame starts.
With dummy cells, we add temporal isolation to the spatial isolation of per-aggregate queueing (different aggregates are spatially isolated from each other because they use different queues). This combined spatial-temporal isolation better curbs the burstiness growth and infection. But we still have one more glitch: since we admit clock drift exists between switches, the incoming of cells may be slightly faster than the reading of cells. This may cause queue overflow.
We address this problem with over-provisioning. We shift the "comma" to include some more words than the original "sentence" length. Then in each M -slot frame, if the next hop output "reads" until it sees a "comma", it reads more than needed (i.e., over-provisioning). In this way, we speed up the "reading" of cells, so that possible clock-drift is compensated.
To implement this, we let the aggregator insert a dummy 
B. Real-Time Aggregate Design Details
To implement the heuristics of Section IV-A, we reuse the per-aggregate queueing topology architecture of Fig. 4 , and revise the granting mechanism in the TCRT switch of Section II to achieve the effects shown in Fig. 7 .
In Fig. 7 , output O plays the role of aggregator for two aggregates. Let M O def = {X|O is the aggregator for aggregate X}.
Then
Every copy of c respectively passes along each aggregate X ∈ M O , marking the temporal border (i.e., beginning) of a new (M + 1) cell-time frame, a.k.a. "virtual-frame" or "v-frame", to be differentiated from the "M cell-time frame", a.k.a. "frame".
When c enters the segregator of X(∀X ∈ M O ) (denoted as input I in the figure), c is duplicated and enqueued as if it is going to be further multicasted to S X , X's successor aggregates. Specifically, for each Y ∈ S X , suppose output O is Y 's aggregator, and Q(I, O , X ∩ Y ) is the queue in segregator I that corresponds to Y (i.e., all cells leaving aggregate X and joining aggregate Y will be queued in Q(I, O , X ∩ Y )), then a copy of dummy cell c will enter
aggregator O sees the beginning of the next v-frame. O will hence block until the next M cell-time frame. In the next M cell-time frame, the first thing O will do is to delete c, and then forward the remaining cells from
Since O deletes c, c would NOT enter Y , although c attempts to. Also, since O blocks at seeing c till the next frame, O will not aggregate cells generated in different vframes in one frame. This temporally curbs burst growth.
The above behavior is formally specified by the pseudo codes in Fig. 8 and 9 . The pseudo codes extend the "grant" protocol in the TCRT switch described in Section II. At the beginning of each cell-time, an output executes OnCellTimeStart() (see Fig. 8 ) to grant input. On getting a grant g from an output, an input executes OnGranted(g) (see Fig. 9 ). Some pseudo code details are explained in the following: Fig. 8 line 4, 12, 14; and Fig. 9 line 5, 7, 8) . "Normal-data-grant" is the conventional grant that fetches the input queue's header cell if there is one. It is used by intermediate nodes (the outputs that the aggregate passes along, but not the aggregator for the aggregate) of an aggregate. In contrast, if an aggregator wants to issue a data grant, it must use "aggregator-data-grant". The difference between "normal-" and "aggregator-" data-grant is that aggregatordata-grant fetches nothing if the input queue header is a dummy cell. This realizes the heuristic of "pause" for temporal isolation. Meanwhile, the dummy cell will block there until it is "deleted" by a "sync-grant" from the aggregator, issued once every M -slot frame. This realizes the heuristic of "pause" until the next frame to "resume reading". Note the only purpose of "sync-grant" is to delete blocking dummy cells; "sync-grant" does not fetch cells.
Suppose output O grants Q for v times in each M -slot frame at slot s i0 , s i1 , . . ., s iv−1 . If O is an aggregator for queue Q, then O sync-grants Q in the first granting slot s i0 ; and aggregator-data-grants Q in all other granting slots s i1 ∼ s iv−1 . If O is not an aggregator for Q, then s i0 ∼ s iv−1 are all for normal-data-grants.
Also, for an output O, every (M + 1) cell-time, it needs to multicast a dummy cell to M O (this is triggered by the combined effects of variable i, newV F rame, and lastV F rameStartsAt in Fig. 8 line 2 , 7, 9, 10, 17, 18, 19) . This means every M -slot frame, at the most one slot will be sacrificed for outputing dummy cell, instead of data cell. This is remedied by allocating one more slot for normalor aggregator-data-grant during the resource planning and admission control stage (see Section IV-C-"Resource Planning Method"). Note if the sacrificed cell happens to be a syncgrant, there will be no negative effects. Because sync-grant's job is only to delete blocking dummy cells in the previous hop; sync-grant does not forward cells.
C. Analysis
Without loss of generality, in the following, we use aggregate F in Fig. 4 as an example for our analysis.
We first define some notations and conventions: Definition 1: We call two dummy cells c 0 and c 1 "consecutive dummy cells" from output O, iff O does not output any other dummy cells between outputting c 0 and c 1 .
if (newV F rame) { 8.
create and output a multicast dummy cell c to M O , and at the segregator of each X ∈ M O , c will be enqueued as if it will further multicast to S X (X's successor aggregates);
= {X|O is the aggregator for aggregate X}. //Note2: The above life cycle of c is better explained by Fig. 7 . //Note although c will enter each queue X segregates to, c will //be deleted at the corresponding queue's header, see Fig. 9 Let c refer to
if (c is a dummy cell) delete c; //else do nothing ("pause") 7.
}else if (g.type == aggregator-data-grant){ if (c is not a dummy cell) send c to O; //else do nothing ("pause") 8.
}else /* g.type == normal-data-grant */ send c to O; 9. } Fig. 9 . OnGranted procedure, called at the beginning of each cell-time by an input iff it receives a grant g: g.queue is the queue granted; g.type is one of sync-grant, aggregator-data-grant, or normal-data-grant.
Special care should be taken at the ends of flows. Without loss of generality, suppose input I 0 in Fig. 4 connects to a source end computer, which enqueues flow f into Q(I 0 , O 0 , F ∩ F ). Then every (M + 1) cell-time, the source end computer shall enqueue a dummy cell into Q(I 0 , O 0 , F ∩ F ). To facilitate narration, we define the following:
Definition 2: Suppose between enqueueing two consecutive dummy cells, the source end computer enqueuesñ
We say that flow f has a worst case source end virtual traffic load ofÑ
Here we use " ∼ " to indicate the corresponding parameter is related to certain "virtual" concepts, such as "virtual frame".
The definition also tells us how to measure/specifyÑ src f . For example, if our system's v-frame size is (M + 1) = 1001 (cell/v-frame), then the source end computer enqueues a dummy cell every 1001 cell-time. If the source end computer NEVER enqueues more than 9 cells of f between enqueueing dummy cells, thenÑ src f = 9 (cell/v-frame). Also, from now on, unless explicitly noted, let us assume the default unit for time and data are "second" and "cell" respectively.
We use τ (i) to denote the cell-time duration of output O i (i = 0, 1, . . .) in Fig. 4 . Therefore, O i 's M -slot frame duration P (i) = Mτ (i) . We admit the existence of clock drift, and denote the minimum and maximum cell-time duration of all switches in the system as τ min and τ max respectively. Correspondingly, the minimum and maximum M -slot frame duration are P min = Mτ min and P max = Mτ max .
We usew(f,
Knowing the traffic load is the first step to plan resource allocation. Therefore, we need the following lemma:
Lemma 1 (Burstiness Bound): Suppose flow f 's worst case source end virtual traffic load (see Definition 2) is N Secondly, Lemma 1 is NOT about overflow or delay; rather, it is only about burstiness: the possible number of f 's cells between any consecutive dummy cells enqueued. The fact that Lemma 1 does not involve τ min and τ max , the shortest and longest cell-time duration of all switches, implies our real-time aggregate can bound burstiness growth and infection regardless of clock-drift.
Thirdly, however, clock-drift may still cause queue overflow, which means infinite E2E delay. Therefore, the delay bound analysis (see Theorem 2 and 3) will still involve τ min and τ max . In fact, we will see Theorem 2 and 3 give a sufficient condition involving τ min and τ max that bounds E2E delay, and hence avoids queue overflow.
Fourthly, Lemma 1 also tells that if f passesh hops of realtime aggregates, then f 's burstiness is bounded by O(h). In many cases (see Section V), we can configure a network so thath = O(log 2 h), where h is the number of physical links that f passes. Therefore, the burstiness of f in such networks is controlled by O(log 2 h). Or if we always configure a single real-time aggregate from source to destination end of f , then the burstiness is controlled by O(1).
Fifthly, Lemma 1 tells us how to calculatẽ
for any f ∈ F ∩ F (where F is the aggregate of concern, and F ∈ P F ). For example, if in Fig. 4 , input I 0 connects to source end computer that enqueues flow f into Q(I 0 , O 0 , F ∩ F ) withÑ src f = 9 (cell/v-frame); theñ With allÑ Fig. 4 ) known, we can plan resources, analyze E2E delay bound, test schedulability, and create TDMA schedules as follows.
Resource Planning Method
O 0 shall allocate
Note the first slot is to sync-grant; the other two additional slots are for over-provisioning, whose meaning will become clear during the analysis of Theorem 2 and 3; andÑ
F ∩F is a notational shortcut. Eq. (5) means O 0 totally allocates
for aggregate F . Subsequently,
for aggregate F . Note the two additional slots are for overprovisioning. Also note since O i is not the aggregator for F , all allocated slots are for normal-data-grant. Lemma 1 tells us how to calculateÑ (0) f (∀f ∈ F ∩ F , where F is the aggregate of concern, and F ∈ P F ). Given allÑ (0) f s, Eq. (5) ∼ (7), tell us how many slots per frame to grant aggregate F along its path.
Next we will see how to calculate delay bounds.
Real-Time Delay Bounds
Still, without loss of generality, we refer to Fig. 4 , and study a flow f that joins aggregate
, and L 
are defined in Eq. (6) and (7) respectively. Proof: Please see Appendix B. Theorem 3 (Inter Aggregate Delay Bound): Suppose we allocate resources per Eq. (5) ∼ (7). IfÑ
where
is given in Eq. (8) . Proof: Please see Appendix C. Note the preconditions in Theorem 2 and 3 define the constraint on all clock drifts between switches: τ max −τ min < τ min /M . As long as this constraint is met, we can have solid delay bounds in spite of the existence of clock drifts.
By applying Theorem 3 along flow's path, we can calculate flow's E2E delay bound (note the 0th aggregate is the source end computer and the input port/queue it connects to).
Switch Schedulability and Scheduling
Given the flows in the network, their worst case source end virtual traffic loads (i.e.,Ñ src f ), and their routing plans among real-time aggregates, Lemma 1 and Eq. (5)∼(7) can decide how many slot/frame each output shall grant each input. We can then reuse Theorem 1 to test schedulability; and reuse the corresponding polynomial time scheduling algorithm described in [1] to derive the schedule.
V. EVALUATION
We evaluate the efficiency of our real-time aggregate design in networks of TCRT switches.
Specifically, the physical link layout of our evaluated networks takes the form of grid. A grid of edge length E consists of (E + 1) × (E + 1) TCRT switches. Switches are deployed in a two-dimensional plane at coordinates (x, y) (where x = 0, 1, 2, . . . , E and y = 0, 1, 2, . . . , E). For simplicity, we use (x, y) to denote the TCRT switch at coordinate (x, y). Switch (x, y) has a directional physical link to connect it to each of its one hop neighbors (here "one hop" means "geographical distance of one"). Fig. 10 (a) illustrates the physical link layout of a grid of 4 × 4 (i.e., E = 4).
Given an aforementioned grid network of E × E, we then overlay (1 + log 2 E ) layers of aggregates upon the network. The Lth (L = 0, 1, . . . , log 2 E ) layer of aggregates also form a grid, which connects those switches whose x, y coordinates are both multiples of 2 L . For example, Fig. 10 (b) is the aggregate layout of the 4 × 4 network of Fig. 10(a) .
We evaluate five grid networks, where E = 1, 2, 4, 8, 16 respectively. In each network, there are (E+1)×(E+1) TCRT switches. Each input/output of these switches is of capacity 10 ∼ 10.0040016Gbps (the range is due to clock drift between different switches, i.e., the clock period of different switches are not exactly the same, hence the time used to transmit one bit are not exactly the same). For convenience, we assume each cell is 500 bit (instead of the de facto standard of 512 bit), M = 2000 slot/frame. These result in τ max = 50ns, τ min = 49.98ns, P max = 0.1msec, and P min = 0.09996msec, which complies with [6] 's suggestions: the frame duration is orders of magnitude less than typical real-time tasks' periods (which are typically >> 1ms [23] 
[6][24][25]).
We compare two flow aggregate methods: real-time aggregate and per-aggregate queueing. For each aggregate method, we run 100 trials. In each trial, we add into the network randomly generated periodical real-time flows: 90% of them are sensing/actuating traffic with worst case source end virtual traffic load ofÑ [25] ). The source and destination ends are randomly picked from the switches in the network according to uniform distribution (here, we are not associating our simulation with any specific networked real-time applications; without the application specific knowledge, uniform distribution is a natural and generic enough choice, just as most software libraries' default random number generator assumes uniform distribution on [0, 1)). Once source/destination ends are picked, the flow is routed via the aggregate layout of the network using Dijkstra's shortest path algorithm (each aggregate is considered to be of length "1" in the Dijkstra path planning).
Under real-time aggregate method, in each switch that the route passes, the corresponding output allocates resource according to Section IV-C-"Resource Planning Method". Then we test the schedulability of such resource allocation according to Section IV-C-"Switch Schedulability and Scheduling"; and calculate the flow's E2E delay bound with Theorem 2 and 3. If every switch on the route can afford the resource allocation and the E2E delay bound is within 50 msec (we choose 50msec because through literature survey, it is a commonly acceptable E2E delay bound for networked real-time applications [23] [20] [21] [22], we allocate each queue a number of TDMA slots so that the service rate is no less than the total source end arrival rates of the queue's member flows. Then we can apply Theorem 2.4.2 of [20] to carry out a sufficient schedulability test and derive the corresponding E2E delay bound.
As mentioned before, given the network and aggregate method, we carry out 100 trials. In each trial, we keep adding real-time flows into the network until no more flow can be admitted. Each admitted flow corresponds to an E2E delay bound. Then we calculate the worst case E2E delay bound of all admitted flows, and the average physical link utilization. The average physical link utilization is calculated as follows: a flow with a worst case source end traffic load ofÑ In summary, in each trial, we can derive a worst case (i.e., maximum) E2E delay bound of all flows admitted, and an average physical link utilization. The statistics of the two metrics are plotted in Fig. 11 and Fig. 12 . In the figures, each dot represents the mean of the corresponding 100 trials, while the error bars represent the corresponding mean value's 95% confidence interval. Note some mean values are very accurate, resulting nearly overlapping upper and lower error bars. According to Fig. 11 , with no less than 95% confidence, we can claim all the flows admitted have E2E delay bound below the real-time deadline requirement of 50msec, a commonly acceptable E2E delay bound for networked real-time applications [23] [6] [24] [25] . What is more, real-time aggregate achieves much better worst case E2E delay bound (all mean values are below 10msec) than per-aggregate queueing (whose mean values fluctuate from 15msec to even near 30msec).
According to Fig. 12 , with no less than 95% confidence, we can claim the following. First, under both aggregate methods, the schedulable average physical link utilization decreases as the network diameter H increases. This is because as H increases, flows on average travels more hops of aggregates. Their burstiness increases each time it joins a new aggregate, which degrades schedulability. Second, real-time aggregate achieves much higher schedulable average physical link utilization than per-aggregate queueing. When H = 4, 8, 16 , and 32, the former is respectively 1.5, 2.5, 3.8, and 7.6 times that of the latter. Third, note our real-time aggregate performance analysis already takes into consideration the dummy cell overhead. The results show that even take into consideration of dummy cell overhead, real-time aggregate still achieves much better schedulable average physical link utilization and worst case E2E delay bound than per-aggregate queueing.
VI. RELATED WORK
In the real-time community. There are three sets of highly relevant works.
The first set is Pinwheel scheduling [26] . Though also based on TDMA, Pinwheel scheduling assumes one CPU per node or independent multiprocessors, and mainly focuses on finding the optimal TDMA scheduling period. In contrast, we have N outputs contending for N inputs in parallel within a same switch, and focus on finding a contention free crossbar schedule (matching).
The second set is hierarchical scheduling [27] . However, hierarchical scheduling are about CPUs. Though recently, Santos et al. [28] proposes using hierarchical scheduling for output queueing real-time switches, how to migrate the hierarchical CPU task model to the popular input queueing TCRT switch architecture without introducing much modifications is still a non-trivial open problem, not to mention supporting aggregates and clock drift.
The third set is non-work-conserving switch scheduling (e.g., Stop-and-Go [29] ). But these schemes also assume output queueing instead of input queueing crossbar switch architecture, which only becomes predominant more recently. In addition, to our best knowledge, the existing non-workconserving switch scheduling (such as Stop-and-Go) schemes are not about flow aggregation, neither do they cover the burstiness growth and infection problems caused by clock drift between switches.
In the networking community, first, we notice that our multihop real-time switched networks bear drastically different design philosophy, traffic features, and network coverage compared to those of Internet. Internet prefers flow aggregation (shared queues) to flow isolation (e.g., per-flow routing) in pursuit of scalability. Unlike Internet, mission critical realtime networks/applications need flow isolation, zero packet loss, and hard E2E delay bound to guarantee dependability.
It is worth noting that in earlier years, the Internet community did propose many per-flow queueing zero packet loss QoS schemes, e.g., WFQ [30] . However, these schemes mostly assume output queueing, and need time-stamp based packet sorting. Due to implementation and runtime complexity, they are not widely implemented by manufacturers.
It is the other branch of efforts on ATM and telephone switches that finally evolves into today's widely adopted input queueing TCRT switch architecture [7] [8][1] [9] [10] [11] , which this paper is about (see Section II).
Another set of architectures supporting per flow queueing and zero packet loss is the real-time LANs, a.k.a. fieldbuses [31] [32] . But their support for real-time mostly assumes shared medium, hence is for LANs instead of multi-hop networks. In fact, to merge fieldbuses to multi-hop switched networks would need the aforementioned real-time switches [11] [9] [1] .
TTEthernet [33] is a fieldbus standard that considers multihop real-time support. However, TTEthernet standard assumes the underlying multi-hop switched network already guarantees bounded E2E delay. The standard itself does not specify the detailed design of the switches. Therefore, our realtime switch/aggregate design can complement TTEthernet by providing a detailed design that meets its core assumption.
There are other real-time fieldbus standards that involve support for real-time flow aggregate. IEC 61784 [34] MPLS [35] is a flow labeling mechanism for aggregation based routing. However, MPLS is a Layer 2.5 mechanism, which is above Layer 2 (the Data Link Layer); while the realtime aggregate design of this paper is a strictly Layer 2 design. In other words, real-time aggregate can serve MPLS.
Sun and Shin proposed Guaranteed Rate (GR) server based flow aggregates with bounded E2E delay in [12] . However, GR servers (e.g., WFQ [30] ) are not widely implemented as they usually assume output queueing and need packet sorting.
In contrast, serving flow aggregates with FIFO is widely implemented due to its simplicity. This method is also known as DiffServ [22] . As Wang et al. [13] point out, DiffServ's schedulability and E2E delay bound are very susceptible to rogue bursty traffic, mainly due to lack of isolation in FIFO. A generic schedulability test and tight E2E delay bound are still open problems. However, there is a well-known sufficient schedulability test and corresponding E2E delay bound analysis framework developed by Boudec et al. [20] [21] . This framework can be applied to per-aggregate queueing TCRT switched networks (in fact, per-aggregate queueing is the way to implement DiffServ on the TCRT switch architecture). In Section V, we used this DiffServ analysis framework to analyze the per-aggregate queueing performance, and compared it with that of real-time aggregate.
The IEEE 802.1 AVB task group has recently released the IEEE 802.1Qav specifications [36] , which also proposes a flow aggregate mechanism for multi-hop switched networks. However, this mechanism is designed for output queueing work-conserving switch architecture with prioritized scheduling. In contrast, this paper's aggregate mechanism is designed for the input queueing non-work-conserving crossbar switch architecture with TDMA scheduling.
Finally, Scharbarg et al. [37] give a probabilistic E2E delay bound for aggregates in AFDX [4] switched networks. In contrast, this paper focuses on providing a deterministic E2E delay bound instead.
VII. CONCLUSION
In this paper, we proposed a novel flow aggregation (queue sharing) mechanism for the popular TCRT switches, which are widely recognized/adopted to build multi-hop real-time networks for integrating nowadays quickly expanding distributed real-time systems. The mechanism, called "real-time aggregates", exploits the TCRT switch's features and deploys spatial-temporal isolations to curb the burstiness growth and infection of aggregate's member flows. This allows us to derive closed form E2E delay bound and the corresponding resourceplanning/admission-control strategies. Simulations show that real-time aggregates can guarantee short E2E delay bound and provide high utilization of the network resources. If at s 0 , the header of Q(I 0 , O 0 , F ∩ F ) is a dummy cell, then the dummy cell is removed by the sync-grant; and till the end of the M -slot frame, O 0 can fetch at the most one vframe from Q(I 0 , O 0 , F ∩ F ). O 0 cannot fetch more than one v-frame in the current frame, because it issues no more syncgrant to remove the next dummy cell in
If at s 0 , the header of
is not a dummy cell, then till the end of the M -slot frame, O 0 cannot issue any more sync-grant to remove the dummy cell leading the next v-frame. Therefore, in the current frame, O 0 can at the most get one v-frame from
Then the data cells of D come from at the most (k + 2) consecutive v-frames outputed from
Proof: O 0 outputs k consecutive v-frames using k(M + 1) consecutive cell-time. Therefore, these k consecutive v-frames are outputed during at the most
consecutive M -slot frames. According to Lemma 2, these (k+ 2) M -slot frames can fetch at the most (k + 2) consecutive v-frames from Q(I 0 , O 0 , F ∩ F ). Now we are ready to prove Lemma 1. Case 1:h = 1. Since O outputs a v-frame every M + 1 cell-time, the slots belong to at the most 2 consecutive M -slot frames of O. According to Lemma 2, the data contents can come from at the most 2 v-frames from previous hop (i.e., f 's source end computer). That is, each v-frame outputed by O can contain at the most 2Ñ src f cells of f . Case 2:h = 2. Using similar analysis as Case 1, for each v-frame outputed from O, the data contents come from at the most 2 v-frames from previous aggregate. Suppose O is the aggregator for previous aggregate. O outputs 2 v-frames using 2M + 2 cell-time: slot 0 ∼ slot 2M +1 . These slots belong to either 3 or 4 consecutive M -slot frames of O .
Suppose these slots belong to 4 consecutive frames: f rame 0 ∼ f rame 3 . Then slot 2M +1 and only slot 2M +1 belongs to f rame 3 , and slot 2M +1 must be f rame 3 's first slot. Because an aggregator cannot do aggregator-data-grant before doing sync-grant during any frame, slot 2M +1 must not be an aggregator-data-grant. Therefore, slot 2M +1 cannot fetch any content of flow f . Therefore, only f rame 0 ∼ f rame 2 can fetch contents of flow f . Therefore, either way, the slots among slot 0 ∼ slot 2M +1 that contain data from flow f come from at most 3 consecutive frames of O . According to Lemma 2, the data contents come from at the most 3 v-frames from the previous hop (i.e., f 's source end computer). That is, there can be at the most 3Ñ src f cells of f .
Case 3:h ≥ 3. For convenience, denote Oh = O. Suppose after leaving the source end computer, flow f passes aggregator O 1 , O 2 , . . . , Oh sequentially. Using similar as Case 2, for each v-frame outputed from Oh, the cells contain contents of f must come from at the most 3 consecutive v-frames outputed from Oh −2 . Then we can recursively apply Lemma 3 till we reach the conclusion that the cells contain contents of f must come from at the most 3 + 2(h − 2) = 2h − 1 consecutive v-frames from f 's source end computer. That is, there can be at the most (2h − 1)Ñ src f cells of f . Combining Case 1 ∼ 3, we prove the lemma.
APPENDIX B PROOF OF THEOREM 2
Our analysis may use two special functions of time t: affine function Λ[σ, ρ](t) and rate-delay function Γ[r, δ](t) defined as follows:
We can use network calculus to prove Theorem 2. According to Eq. (6), O 0 allocatesÑ
F slots in each of its M -slot frame to serve aggregate F . Due to TDMA, theseÑ 
F ](t),
where σ
According to Theorem 1.4.6 (and its corollary) in [20] , we can regard O 1 ∼ O k−1 as one single server in a black box, which serves Q(I 1 , O 1 , F ) with a service curve of 
where r (1∼k−1) F
=Ñ
(1)
F − 1) + 1)τ max . (18) Note at the most 1 slot might be overtaken for sending dummy cells in an M -slot frame. Hence in Eq. (17) and (18) we use (Ñ Since τ max − τ min < τ min M ⇒ MM(τ max − τ min ) < Mτ min ⇒ M (P max − P min ) < P min ⇒Ñ (0) F (P max − P min ) < P min
⇒Ñ
(1) − 1
F ,
we can apply basic network calculus to get: ≤ (k − 1)Mτ max + P max = (k − 1)P max + P max = kP max .
APPENDIX C PROOF OF THEOREM 3
Letč i denote the ith (i = 0, 1, . . .) dummy cell outputed from O 0 since system starts. LetĽ 
where D (1∼k−1) F is defined in Eq. (8). We further have the following lemma: Lemma 4:
where P (k) and τ (k) are the M -slot frame and cell-time duration (in the unit of "second") of O k .
Proof: We can prove by induction. Eq. (20) In this case, atĽ (k) i ,č i+1 already arrives at Q(I k , O k , F ∩ F ), which also means all the v-frame contents betweenč i anď c i+1 are already backlogged beforeč i+1 in Q(I k , O k , F ∩F ).
Note O k goes through a full M -slot frame fromĽ
, issuing enough data-grant to clear the v-frame data cells backlogged in front ofč i+1 .
Therefore when O k issues sync-grant atĽ
, c i+1 is cleared in that cell-time, which meanš 
where Ineq. (21) is because τ max − τ min < τ min M ⇒ P min + τ min > P max
In this case, atĽ
i+1 ,č i+1 arrives at Q(I k , O k , F ∩ F ); and sinceč i has already left, at the most one v-frame of data cells are backlogged beforeč i+1 in Q(I k , O k , F ∩ F ).
Suppose t is the first time O k sync-grants Q(I k , O k , F ∩F ) afterĽ
By t + P (k) , all the v-frame data cell backlog in front ofč i+1 are cleared. Therefore,č i+1 leaves O k by t + P (k) + τ (k) at the latest. That iš 
Due to Theorem 2, That is, dummy cellč i leaves Q(I k , O k , F ∩ F ) by t. Therefore, by t, even if c is still in Q(I k , O k , F ∩ F ), there is no dummy cell backlogged before c; in addition, there is at the most one v-frame of data cells backlogged before c. Therefore
Eq. (24) means c is backlogged in Q(I k , O k , F ∩ F ) for at the most D 
+ 3P
(k) + 2τ (k) ≤ (k + 3)P max + 2τ max seconds.
