Abstract-Output-buffered switches are known to have better performance than other switch architectures. However, outputbuffered switches also suffer from the notorious scalability problem, and direct constructions of large output-buffered switches are difficult. In this paper, we study the problem of constructing scalable switches that have comparable performance to outputbuffered switches. For this, we propose a new concept, called quasi-output-buffered switch. Like an output-buffered switch, a quasi-output-buffered switch is a deterministic switch that delivers packets in the FIFO order and achieves 100% throughput. Using the three-stage Clos network, we show that one can recursively construct a larger quasi-output-buffered switch with a set of smaller quasi-output-buffered switches. By recursively expanding the three-stage Clos network, we obtain a quasi-output-buffered switch with only 2 × 2 switches. Such a switch is called a packetpair switch as it always transmits packets in pairs. By computer simulations, we show that packet-pair switches have better delay performance than most load-balanced switches with comparable construction complexity.
I. INTRODUCTION
It is known that an output-buffered switch achieves 100% throughput and has the best delay performance among all switch architectures. However, this is at the cost of N times speedup for an N × N output-buffered switch. The required speedup somehow limits us to construct a large outputbuffered switch. There are several studies in the literature that achieve exact emulation of an output-buffered switch, e.g., the crosspoint-buffered switch [1] , the parallel-buffered switch [13] , and the combined input-output queue [15] , [22] . However, all these have either non-scalable hardware complexity, or computation and communication overheads.
One of the key problems in high speed switching is whether one can construct scalable switches with comparable performance to output-buffered switches. Recent advances in loadbalanced switches (see e.g., [6] , [9] , [14] , [17] ) have shed some light on that problem. A typical load-balanced switch consists of two stages: the first stage is for load-balancing that converts incoming traffic into the uniform traffic, and the second stage is for switching of the uniform traffic. Moreover, the connection patterns in the switches of both stages are deterministic and periodic. It is shown that various loadbalanced switches have comparable performance to outputbuffered switches. As such, they can achieve 100% throughput with O(1) computation and communication overheads.
One of the main contributions of this paper is to identify the key ingredients in load-balanced switches that enable us to construct large switches with comparable performance to output-buffered switches. For this, we propose a new concept, called quasi-output-buffered switch. Like an output-buffered switch, a quasi-output-buffered switch is a deterministic switch that delivers packets in the First-in First-out (FIFO) order and achieves 100% throughput. Using the three-stage Clos network [11] , we show that one can recursively construct a larger quasi-output-buffered switch with a set of smaller quasi-output-buffered switches. To our best knowledge, such a result on quasi-output-buffered switches seems to be the first result that allows recursive constructions of switches with comparable performance (in the sense of 100% throughput and FIFO delivery) to output-buffered switches. Analogous to the construction of a Benes network [2] , we recursively expand the three-stage Clos network to obtain a quasi-outputbuffered switch with only 2 × 2 switches. Such a switch is called a packet-pair switch as it always transmits packets in pairs. The packet-pair switch has several nice features: 100% throughput, FIFO delivery of packets, deterministic connection patterns for 2 × 2 switches, self-routing of packets, and no need for communication and computation. By computer simulations, we also show that packet-pair switches have better delay performance than most load-balanced switches with comparable construction complexity.
The key theory behind our constructions of quasi-outputbuffered switches is a refined calculus based on a traffic characterization in [8] . Such a traffic characterization allows us to describe a flow of packets by a single "rate." It is shown that the aggregated flow has the "rate" equal to the sum of the "rate" of each individual flow. Round-robin splitting of a flow yields several subflows with smaller "rates." Moreover, a departing flow has the same "rate" as that of the arriving flow provided that the system is "stable." Unlike the theory of effective bandwidth (see e.g., [16] and references therein), the refined calculus does not need the independent assumption on the flows.
The paper is organized as follows: in Section II, we introduce the traffic characterization and its associated calculus. Then we define the concept of a quasi-output-buffered switch. In Section III, we propose the three-stage construction of a quasi-output-buffered switch. The packet-pair switches are introduced in Section IV. Finally, the paper is concluded in Section V.
II. QUASI-OUTPUT-BUFFERED SWITCHES

A. Traffic characterization
A flow is commonly known as a sequence of packets that have the same source and destination pair in a switch (or a network of switches). In most switching papers, traffic characterizations for flows in a switch (or a network of switches) are usually assumed to follow certain traffic models, e.g., Bernoulli arrival processes and Markov processes. These traffic models are too specific for our constructions of quasioutput-buffered switches. Instead, we will use a much more general traffic characterization for a flow of packets in [8] . Throughout this paper, we only consider the discrete-time setting and make the following assumptions:
(A1) Time is slotted and synchronized in every link. (A2) Packets are of the same size and they can be transmitted in a time slot.
Definition 1:
(i) A stochastic process {Q(t), t ≥ 0} is said to have a finite moment generation function if there exists a θ > 0 such that
(ii) For a flow A, we will use A(t) to denote the cumulative number of packets that arrives by time t for that flow. Flow A is said to be λ-moment generating function bounded from above (λ-m.b.f.a.) if for every > 0, the stochastic process {Q(t), t ≥ 0} defined below has a finite moment generation function:
With Q(0) = 0, we note that Q(t) in (2) is in fact the recursive expansion of the Lindley equation [19] 
where
is the number of packets that arrive at time t. In view of (3), Q(t) is simply the number of packets (or more precisely bits with Q(t) being a real number) at time t when we feed flow A to a work conserving link with capacity λ + . It is known from the Loynes construction [20] that {Q(t), t ≥ 0} converges in distribution to a steady state random variable Q(∞) if the sequence {a(t), t ≥ 1} is stationary and ergodic with a mean rate not greater than λ. However, traffic characterization by the mean rate of a stationary and ergodic sequence is not strong enough to have a finite moment generation function of the steady state random variable Q(∞). For this, we need a stronger condition in [3] . Let
be the minimum envelope rate (MER) with respect to θ > 0 (or known as the effective bandwidth function, see e.g., [16] ). When
it was shown in Theorem 3.8 in [3] that
This shows that flow A is a * (θ)-m.b.f.a. for any θ > 0. One can further choose the best traffic characterization by letting ρ = inf θ>0 a * (θ) and thus flow A is ρ-m.b.f.a. We note that for many stochastic processes, the value ρ is simply the mean arrival rate, as illustrated in the following example for the Bernoulli arrival process. 
and inf
Thus, the Bernoulli arrival process with mean ρ is ρ-m.b.f.a.
In view of Example 2, our traffic characterization is only slightly stronger than the traffic characterization by the mean arrival rate. The additional assumption on the bounded moment generation functions leads to the following three important properties: the superposition property in Lemma 3, the splitting property in Lemma 4 and the departure property in Lemma 5. The proofs of Lemma 3, Lemma 4 and Lemma 5 are omitted due to space limitation and they can be found in the full report [4] .
In the following lemma, we derive the superposition property for two flows. 
f.a. We note that the proof of Lemma 3 is based on the CauchySchwartz inequality and Q 1 (t) and Q 2 (t) in Lemma 3(i) (resp. A 1 and A 2 Lemma 3(ii)) need not be independent. As discussed before, if we can view λ 1 as the "mean" rate for flow A 1 and λ 2 as the "mean" rate for flow A 2 , then the aggregated flow has the "mean" rate λ 1 + λ 2 .
The second property is the splitting property. The departure property shows that if flow A has the "mean" rate λ, then flow B, the departure flow of flow A, also has the "mean" rate λ provided that the system is "stable" (in the sense of bounded moment generation function). As we shall see later, the superposition property, the splitting property, and the departure property provide us a simple calculus for our traffic characterization in a network of switches.
We note that it is difficult to obtain the departure property in Lemma 5 if one uses weaker traffic characterizations, such as stationarity and ergodicity. On the other hand, it is possible to obtain such a departure property by using stronger traffic characterizations, such as the (σ, ρ)-deterministic traffic characterization in the network calculus [12] . However, such a deterministic traffic characterization cannot be used for stochastic analysis needed in our later development. Let flow A i,k be the flow from input i to output k, and
B. Output-buffered switches
. . , N, be the cumulative number of packets that arrives by time t for that flow. Also, let B k (t), k = 1, 2, . . . , N, be the cumulative number of packets that depart from output k by time t, and Q k (t) be the number of packets stored at the k th output at time t. Definition 6: (Output-buffered switch) An M × N switch is called an output-buffered switch if it satisfies the following two properties when it is started from an empty system at time 0 (i.e., Q(0) = 0):
(i) packets destined for the same output depart in the First-in First-out (FIFO) order, and
is the number of flow A i,k packets that arrive at time t. Equation (10), known as the Lindley recursion, says that all the packets that arrive at time t from flows A i,k , i = 1, 2, . . . , M, are sent to the output buffer of the k th output port at the same time. If there are packets in that output buffer, then one packet will depart from the output port. We note that in the worst case there might be packets arriving from all the M flows at the same time. In that case, each output buffer is required to have the capability of receiving M packets at the same time. As such, each output buffer needs to speed up (at least) M times and that causes the notorious scalability problem for an output-buffered switch.
By recursively expanding the Lindley equation in (10) with
Since
Note that from the FIFO property and (12) of an outputbuffered switch, the departure of a packet at time t is uniquely determined by all the packets that arrive by time t. As such, if the arrival times of all the packets are delayed by a constant c, then the departure times of all the packets are also delayed by the same constant c. To ensure the stability of an output-buffered switch, we introduce the following no overbooking condition.
Definition 7: (No overbooking condition) Consider an M × N switch. The input traffic is said to satisfy the no overbooking condition if
Intuitively, the no overbooking condition in (13) indicates that the total "mean" rate to a particular output port cannot exceed 1. Under the no overbooking condition, we show that an output-buffered switch is stable in the sense of having a finite moment generation function.
Lemma 8: For an M × N output-buffered switch, if the input traffic satisfies the no overbooking condition in Definition 7, then (i) {Q k (t), t ≥ 0} has a finite moment generation function, k = 1, 2, . . . , N, and (ii) {Q(t), t ≥ 0} has a finite moment generation function, where
is the total number of packets in the switch at time t.
Such a property is called the universal stability property (in the sense of the existence of a finite moment generating function for the total number of packets in a switch).
Proof. (i) Using the superposition property in Lemma 3(ii), the aggregated flow to the k th output is
f.a. The result in (i) then follows directly from (13) and Definition 1(ii).
(ii) This is a direct consequence of the superposition property in Lemma 3(i).
C. Definition of quasi-output-buffered switches
As discussed before, output-buffered switches do not scale due to the needed speedup. As such, it is difficult to construct a large output-buffered switch directly. The natural question is then whether one can construct a larger switch using a set of smaller output-buffered switches. We will show in this paper that this is possible by extracting and preserving some key properties in output-buffered switches. The switches that satisfy these key properties are called quasi-output-buffered switches (defined below), i.e., they behave like output-buffered switches but they are not exactly the same as output-buffered switches.
Definition 9: (Quasi-output-buffered switch) An M × N switch is called a quasi-output-buffered switch if it satisfies the following properties when it is started from an empty system at time 0:
(P1) Deterministic mapping: the departure time of every packet is a deterministic function of the arrival times of all the packets. This implies that if the arrival times of all the packets are delayed by a constant c, then a quasi-output-buffered switch can be operated in a way (by shifting the starting time of the switch) so that the departure times of all the packets are also delayed by the same constant c. (P2) FIFO: packets of the same flow depart in the FIFO order. (P3) Universal stability: let Q(t) be the total number of packets in the switch. If the input traffic of the switch satisfies the no overbooking condition in Definition 7, then {Q(t), t ≥ 0} has a finite moment generation function. Clearly, an output-buffered switch is a quasi-output-buffered switch (from Lemma 8). These include the set of switches that achieve exact emulation of output-buffered switches (e.g., the CIOQ switch in [15] ). Various versions of load-balanced Birkhoff-von Neumann switches, including the Uniform Frame Spreading (UFS) in [17] , the Padded Frame in [14] , and the CR switch in [25] , are shown to have a constant bound when comparing to the total number of packets in the corresponding output-buffered switch. Thus, they are also quasi-outputbuffered switch. However, it is not clear whether an inputbuffered switch with maximum weight matching (MWM) [21] is a quasi-output-buffered switch as the universal stability property in (P3) has not been proved in the literature yet. We all note that switches that use randomized algorithms (e.g., [24] ) are not quasi-output-buffered switches as they fail to satisfy the deterministic mapping property. In this section, we show how one can construct a larger quasi-output-buffered switch by using a set of smaller quasioutput-buffered switches. In Figure 1 , we show a three-stage construction of an N ×N quasi-output-buffered switch, where N = p × q. In the first stage, there are q p × p inputbuffered switches. Each input buffer at an input link of a switch in the first stage has N virtual output queues (VOQ). The second stage consists of p q ×uasi-output-buffered switches. Finally, in the third stage, there are also q p × p input-buffered switches. Each input buffer at an input link of a switch in the third stage has p VOQs. As in a standard Clos network [11] , the switches in the first stage and those in the second stage are connected by the perfect shuffle exchange, i.e., for m = 1, 2, . . . , p, = 1, 2, . . . , q, the m th output from the th switch in the first stage is connected to the th input of the m th switch in the second stage. Similarly, the switches in the second stage and those in the third stage are also connected by the perfect shuffle exchange, i.e., for m = 1, 2, . . . , p, = 1, 2, . . . , q, the th output from the m th switch in the second stage is connected to the m th input of the th switch in the third stage.
III. A THREE-STAGE CONSTRUCTION OF
The main idea of the three-stage construction is to accumulate packets in the first stage to form a frame. Then use the uniform frame spreading (UFS) scheme in [17] to distribute the packets in a frame evenly to the quasi-output-buffered switches in the second stage. Finally, packets in a frame are "re-assembled" in the last stage.
To do this, the connection patterns of the p × p switches in the first stage and the third stage are specified by the symmetric TDM switch in [7] . Recall that a p×p symmetric TDM switch implements the following periodic connection patterns: input i is connected to output j at time t if and only if
In There are N VOQs at an input of a symmetric TDM switch at the first stage. When a packet destined for output j arrives, it is placed in the j th VOQ, j = 1, 2 . . . , N. The switches in the first stage are operated in a frame-based manner as in the UFS scheme [17] . Every frame consists of p consecutive time slots. However, the beginning time slots of frames are different for different inputs. Specifically, frame f of input i of a switch in the first stage begins at the f th time when input i is connected to the first quasi-output-buffered switch in the second stage. As such, we have from (14) that frame f of input i consists of time slots i+(f −1)p, . . . , i−1+fp. If the number of packets in a VOQ is not less than p, that VOQ is called a full-framed VOQ. At the beginning of a frame, if an input of a switch in the first stage has at least one full-framed VOQ, then the switch selects one full-framed VOQ and sends p consecutive packets from that VOQ in that frame. As such, these p packets are distributed to the p q ×uasi-output-buffered switches. Otherwise, it does nothing during that frame. From the UFS scheme in (R1), we know that if there is a packet destined for output j that arrives at the i th input of the first switch in the second stage at time t, then there is also a packet destined for output j that arrives at the i th input of the th switch in the second stage at time t + − 1, = 2, . . . , p. In other words, the arrival process to the th switch in the second stage is simply a time shifted version of that to the first switch in the second stage. Thus, they can be made to be identical if we run the clock in the th switch by the new time t = t − + 1. As there is a unique routing path to an (external) output from an input of a switch in the second stage, we know from the deterministic mapping property in (P1) that the departure process from the first switch in the second stage and that from the th switch in the second stage are also identical with respect to the new clocks. As such, if there is a packet destined for output j that arrives at the first input of the k th switch in the third stage at time t, then there is also a packet destined for output j that arrives at the There are p VOQs at an input of a symmetric TDM switch in the third stage. When a packet destined for output j arrives, it is placed in the k(j) th VOQ, where k(j) = j− (j−1)/p * p. The switches in the third stage are operated in a frame-based manner as those in the first stage. Every frame consists of p consecutive time slots. However, the beginning time slots of frames are different for different outputs. Specifically, frame f of output i of a switch in the third stage begins at the f th time when output i is connected to the first input of that switch. As such, we have from (14) that frame f of output i consists of time slots i + (f − 1)p, . . . , i − 1 + fp. During a frame of output i, every input sends a packet from its i th VOQ to output i (if its i th VOQ is not empty). Theorem 10: The three-stage construction described above is indeed an N × N quasi-output-buffered switch.
We note there are several early works in the literature (see e.g., [5] , [10] ) that also used the three-stage Clos network to construct a larger switch. To our best knowledge, it seems that Theorem 10 on quasi-output-buffered switches is the first result that allows recursive constructions of switches with comparable performance (in the sense of 100% throughput and FIFO delivery) to output-buffered switches.
Clearly, as the switches in the first stage and the third stage are symmetric TDM switches, they are deterministic. As the quasi-output-buffered switches in the second stage satisfy the deterministic mapping property in (P1), the three-stage construction also satisfies the deterministic mapping property. Also, from the UFS scheme in (R1) and the inverse UFS in (R3), packets of the same flow depart in the FIFO order. Thus, (P2) of the three-stage construction is satisfied. It remains to show the universal stability property in (P3). This will be done in the following section.
B. Universal stability
In this section, we show the universal stability property for the three-stage construction. Denote by flow A i,k , i, k = 1, 2, . . . , N, the sequence of packets from input i to output k. For the proof of the universal stability property, we assume that the no-overbooking condition in Definition 7 is satisfied, i.e., for all i, k = 1, 2, . . . , N, A i,k is λ i,k -m.b.f.a., and for all k = 1, 2, . . . , N,
As the switches in the first stage are operated under the UFS scheme, it is well known (see e.g., [17] , [25] ) that the number of packets stored in an input buffer of a switch in the first stage is bounded above by a finite constant. This is stated in the following proposition.
Proposition 11: The total number of packets in an input buffer of a switch in the first stage is bounded above by Np. Now we show the universal stability property for the switches in the second stage. As the switches in the second stage are quasi-output-buffered switches, the key step is then to verify that the no-overbooking condition is satisfied for every quasi-output-buffered switch in the second stage.
Let flow B 
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE INFOCOM 2008 proceedings.
the cumulative number of packets from flow A i,k that arrive at that switch by time t. As the switches in the first stage are operated under the uniform frame spreading scheme, the packets from the same flow are distributed in a round-robin fashion to the switches at the second stage. Thus, 
where we use (15) in the last inequality. As such, the nooverbooking condition for the m th switch in the second stage is satisfied. In view of the definition of a quasi-output-buffered switch and the superposition property in Lemma 3(i), we then have the following proposition.
Proposition 12:
be the total number of packets in the m th switch in the second stage at time t. Then {Q 2 m (t), t ≥ 0} has a finite moment generating function.
(ii) Let
be the total number of packets in the second stage at time t. Then {Q 2 (t), t ≥ 0} also has a finite moment generating function. Now we show the universal stability property for the switches in the third stage. Consider the switch in the third stage that contains the k th output. Let A 
k,m (t) be the total number of packets destined for the k th output that are stored in the m th input buffer of the switch in the third stage that contains the k th output. Also, let C 3 k,m (t) be the cumulative number of time slots that the m th input of that switch is connected to output k by time t. As the connection pattern of that switch is periodic with period p, we have
Moreover, we have from the Lindley equation (with
Since the aggregated flow
, it then follows from the no overbooking condition in (15) that {Q 3 k,m (t), t ≥ 0} has a finite moment generating function. Using the superposition property in Lemma 3(i), we then derive the following result.
Proposition 13: Let
be the total number of packets in the third stage at time t. Then {Q 3 (t), t ≥ 0} also has a finite moment generating function. Let Q(t) be the total number of packets inside the threestage construction at time t. From Proposition 11, Proposition 12, Proposition 13 and the superposition property in Lemma 3(i), we then conclude that {Q(t), t ≥ 0} also has a finite moment generation function.
IV. PACKET-PAIR SWITCHES
A. Architecture
In the case that N is a power of 2, we can recursively construct an N ×N quasi-output-buffered switch by the threestage construction in Section III (as in the construction of a Benes network [2] ). To do this, we first note that for N = 2 we can simply choose p = 2 and q = 1 in the three-stage construction in Section III. Since a 1 × 1 switch can be simply replaced to a single link, the threestage construction for this is equivalent to the (two-stage) load-balanced Birkhoff-von Neumann switch with the uniform frame spreading scheme. For such a switch, the frame size is 2 and packets are transmitted in pairs under the uniform frame spreading scheme. Now we can define packet-pair switches recursively as follows:
Definition 14: (Packet-pair switches) (i) The operations of a packet-pair switch can also be specified in details by recursively expanding the operations in (R1) and (R3). In the following, we describe the detailed operations of an N ×N packet-pair switch with N = 2 n . For the ease of the presentation, we index the inputs/outputs from 0, 1, 2, . . . , 2 n − 1. Also, the N/2 switches at each stage are indexed from 0, 1, 2, . . . , 2 n−1 − 1. (R4) Uniform frame spreading for the first n stages:
For j = 1, 2, . . . , n, the m th 2 × 2 switch in the j th stage consists of 2 n−j+1 VOQs at each input. These 2 n−j+1 VOQs are indexed from 0, 1, 2, . . . , 2 n−j+1 − 1. The connection patterns of the switch are periodic with period 2. It is set to the "bar" state when
is an odd number and to the "cross" state otherwise. Suppose a packet destined for output k arrives at a switch in the j 
A VOQ is called a fullframed VOQ if the number of packets in that VOQ is not less than 2. When an input is connected to the first output at time t, it selects a full-framed VOQ and sends two consecutive packets (packet-pair) from that VOQ at time t and t + 1. th VOQ, where k 2 (j) = b 2n−j+1 . When the switch is in the "bar" state at time t, VOQ 0 is selected and its head-of-line packet is transmitted at time t. Otherwise, VOQ 1 is selected and its head-of-line packet is transmitted at time t.
Note that the 2 × 2 switches in the first n stages of the N × N packet-pair switch is operated under the UFS scheme with frame size 2. From Proposition 11, it follows that the total number of packets in an input buffer of a switch in the j th stage, j = 1, 2, . . . , n, is bounded above by 2 n−j+1 × 2. Moreover, we have from the deterministic mapping property that the arrival process to any input buffer of a 2 × 2 switch in the n + 1 th stage is simply a time shifted version of the arrival process to the first input buffer of the first switch in the n + 1 th stage. In view of this, the first n stages in fact perform load balancing for the incoming traffic at the N × N packet-pair switch. Now we consider the Bernoulli arrival traffic in Example 2. With probability 0 ≤ ρ < 1, there is a packet that arrives at an input of the N × N packet-pair switch. This is independent of everything else. With probability r i,k , an arriving packet at input i is destined for output k. This is also independent of everything else. Note that from the law of total probability, we must have 
In view of (22), we have
for all i = 1, 2, . . . , N. As the N × N packet-pair switch is a quasi-output-buffered switch, we then have the following universal stability result. Theorem 15: For the Bernoulli arrival traffic described above, there exists a θ > 0 such that
where Q(t) is the total number of packets in the N ×N packetpair switch. In summary, the packet-pair switch has the following nice features:
1) It achieves 100% throughput.
2) It delivers packets in the FIFO order.
3) It only contains 2 × 2 switches and the connection patterns of these 2 × 2 switches are deterministic and periodic with period 2. 4) Packets are self-routed through the network of 2 × 2 switches. 5) No communication and computation is needed. We note that the idea of using uniform traffic spreading and self routing in a buffered Benes network was previously used in [23] , [18] . However, there is no guarantee that packets are delivered in the FIFO order in [23] , [18] .
B. Delay analysis
To gain some intuition on the delay performance of the packet-pair switch, let us consider the uniform Bernoulli traffic, i.e., r i,k = 1/N for all i and k in the Bernoulli traffic.
For a 2 × 2 switch in the first stage, there are N VOQs at each input. Recall that the operation of a 2 × 2 switch at an input is to transmit a full-framed VOQ when it is connected to the first output of the 2 × 2 switch. A full-framed VOQ in this case is simply a VOQ that contains at least two packets. As such, we can implement the N VOQs by two parts: the first part for storing packets that have not been"paired," and the second part for storing packets that have been "paired." For this, there are N queues with buffer size 1 in the first part, indexed from 1 to N , and two VOQs (for the two outputs of the 2 × 2 switch) in the second part. Suppose a packet of flow k arrives at the switch. If the k th queue in the first part is empty, the arriving packet is placed in the k th queue. On the other hand, if the k th queue is not empty, the arriving packet and the packet stored in the k th queue are "paired" and they can be moved to the two VOQs in the second part (at the beginning of the next frame).
In view of the two-part implementation of the N VOQs, the delay at a switch in the first stage consists of two parts: (i) the delay for "pairing" and (ii) the queueing delay for transmitting through the 2×2 switch. To compute the "pairing" delay, note that only the odd numbered packets in a flow need to wait for "pairing," and the "pairing" delay for an odd numbered packet is simply the interarrival time of the next packet. Under the uniform Bernoulli traffic, the expected interarrival time of a flow is N/ρ. Thus, the expected "pairing" delay is N/2ρ. For the queueing delay, we approximate the arrival process to the two VOQs in the second part by the Bernoulli arrival traffic with arrival rate ρ. As the connection pattern is periodic with period 2, this model is a special case of the uniform Bernoulli traffic model in [6] (with N = 2). Thus, the expected queueing delay can be approximated by 1/2(1 − ρ). Adding these two parts of delay, the expected delay through a switch in the first stage can be approximated by
If we approximate the arrival process to every input of a 2×2 switch in the packet-pair switch by the uniform Bernoulli traffic with arrival rate ρ, then using the same argument as that in the first stage yields the following approximation for the expected delay through a switch in the j th stage:
as there is no "pairing" delay for the last n stages. Summing up the delay in (27) and (28), we can approximate the expected delay through the N × N packet-pair switch by
In Figure 3 , we compare our approximation in (29) with computer simulation. As shown in Figure 3 , our approximation (APPR) is a conservative estimate of the delay of the packetpair switch (PP). The reason for that is the arrival process to every input of a 2 × 2 switch in the packet-pair switch is not the uniform Bernoulli traffic. In fact, it is much more regular (less random) than the uniform Bernoulli traffic. This is because "pairing" takes time and it is less likely to have two consecutive pairs with the same destination.
To reduce the "pairing" delay of the packet-pair switch in light traffic, we can also use the idea proposed in the padded frame scheme [14] . At the beginning of a frame, if there is no full-framed VOQ in an input-buffer of a switch in the first n stages, we can pad a fake packet to a VOQ with only one packet to form a padded frame (with frame size 2). Then the padded frame is transmitted inside the packet-pair switch. Clearly, it is most beneficial to generate padded frames in the first stage. The gain starts to diminish as the number of stages is increased. For this, we define a parameter n + as the number of stages that allow padded frames to be generated. To ensure stability, the number of padded frames inside the packet-pair switch has to be restrained. For this, we only allow padded frames to be generated when the total number of packets in the first input-buffer of the first switch in the n + 1 th stage does not exceed a threshold T H. Such an enhancement is called a packet-pair-plus (P P + ) switch in this paper.
C. Simulations
In this section, we perform various simulations for packetpair switches. In all our simulations, the switch size N is chosen to be 32. Each simulation run contains 10 6 time slots. In Figure 3 , we consider the uniform Bernoulli traffic model and plot the delay of the packet-pair switch (PP), the packet-pairplus switch (P P + ), the ideal output-buffered switch (OB), the uniform frame spreading scheme (UFS) in [17] , the padded frame scheme (PF) in [14] , and the Contention and Reservation switch (CR) in [25] . Certainly, the output-buffered switch has the best delay performance (at the cost of N times speedup). The packet-pair switch outperforms both the UFS scheme and the padded frame scheme. It also beats the CR switch in heavy traffic. However, its delay is higher than that in the CR switch in light traffic. This is because the CR switch uses the contention mode in light traffic, while the packet-pair switch wastes a lot of time to form a frame of two packets in light traffic. In this simulation, the packet-pair-plus switch is run with n + = 3 and T H = 2, i.e., only the first 3 stages are allowed to generate padded frames when the total number of packets in the first input-buffer of the first switch in the 6 th stage does not exceed 2. The delay of the P P + switch is much better than that of the PP switch in light traffic and is comparable to that of the PP switch in heavy traffic. Similar results are also shown in Figure 4 under the uniform Pareto traffic model in [6] . In this paper, we proposed a new concept, called quasioutput-buffered switch. Like an output-buffered switch, a quasi-output-buffered switch is a deterministic switch that delivers packets in the FIFO order, and achieves 100% throughput. Using the three-stage Clos network, we showed that one can recursively construct a larger quasi-output-buffered switch with a set of smaller quasi-output-buffered switches.
By recursively expanding the three-stage network, we obtained a packet-pair switch with only 2 × 2 switches. By computer simulations, we showed that packet-pair switches have better delay performance than most load-balanced switches with comparable construction complexity.
