Abstract-Buffered crossbar switches are special crossbar switches with a small exclusive buffer at each crosspoint of the crossbar. They demonstrate unique advantages, such as variable length packet handling and distributed scheduling, over traditional unbuffered crossbar switches. The current main approach for buffered crossbar switches to provide performance guarantees is to emulate push-in-first-out output queued switches. However, such an approach has several drawbacks, and in particular it has difficulty in providing tight constant performance guarantees. To address the issue, we propose in this paper the guaranteed-performance asynchronous packet scheduling (GAPS) algorithm for buffered crossbar switches. GAPS intends to provide tight performance guarantees, and requires no speedup. It directly handles variable length packets without segmentation and reassembly, and makes scheduling decisions in a distributed manner. We show by theoretical analysis that GAPS achieves constant performance guarantees. We also prove that GAPS has a bounded crosspoint buffer size of 3L, where L is the maximum packet length. Finally, we present simulation data to verify the analytical results and show the effectiveness of GAPS.
I. INTRODUCTION
Buffered crossbar switches have recently attracted considerable attentions [1] - [16] as promising high speed interconnects. They are special crossbar switches with a small exclusive buffer at each crosspoint of the crossbar, as shown in Figure 1 . Such a switch architecture was once regarded as not scalable [17] . Fortunately, recent development in VLSI technology has made it feasible to integrate on-chip memories to crossbar switching fabrics, and thus build moderate-size buffered crossbar switches [1] - [4] . Buffered crossbar switches demonstrate unique advantages over traditional unbuffered crossbar switches [5] - [7] .
Unbuffered crossbar switches have no buffers on the crossbar, and packets have to be directly transmitted from input ports to output ports. They usually work with fixed length cells in a synchronous time slot mode [18] . To maximize throughput and accelerate scheduling, all the scheduling and transmission units must have the same length. In each time slot, all input-output pairs transmit cells at the same time. When variable length packets arrive, they will be segmented into fixed length cells at input ports. The cells are then transmitted to output ports, where they are reassembled into original packets and sent to the output lines. This process is called segmentation and reassembly (SAR) [14] .
For buffered crossbar switches, crosspoint buffers decouple input ports and output ports, and simplify the scheduling process [8] . They can directly handle variable length packets without SAR and work in an asynchronous mode [7] . To be specific, input ports periodically send packets of arbitrary length to the corresponding crosspoint buffers, from where output ports retrieve the packets one by one. Note that packets in most practical networks are of variable length [19] . Compared with fixed length cell scheduling of unbuffered crossbar switches, variable length packet scheduling of buffered crossbar switches has some unique advantages [7] [8], such as high throughput, low latency, and reduced hardware cost.
In this paper, we study fair scheduling algorithms for buffered crossbar switches to provide performance guarantees. The considered problem is that each flow of the switch is allocated a specific amount of bandwidth, and the fair scheduling algorithm arranges packet transmission to ensure that the flow receives its allocated bandwidth, and thus provides guaranteed delay and jitter. There exist a number of solutions [5] - [12] in the literature, and the main approach is to emulate push-in-first-out (PIFO) output queued (OQ) switches [5] , i.e. duplicating the packet departure time in PIFO OQ switches. As indicated by the name, OQ switches have buffer space only at output ports. Because input ports have no buffers, all arriving packets have to be immediately transferred to the output buffers by the crossbar. Thus, the crossbar of an N × N OQ switch needs to run N times faster than the input port and output port, or in other words has speedup of N [20] . The speedup requirement makes OQ switches difficult to scale. On the other hand, because all packets are already stored at the output ports, it is easy for OQ switches to run various fair queueing algorithms, such as Deficit Round Robin (DRR) [21] , Weighted Fair Queueing (WFQ) [22] , and Worst-case Fair Weighted Fair Queueing (WF 2 Q) [23] , to provide performance guarantees. The objective is for each output port to emulate the packet departure sequence in the ideal Generalized Processor Sharing (GPS) [24] fairness model. Specifically, a PIFO OQ switch is an OQ switch with a push-in-first-out queueing policy. For such a switch, an arriving packet can be put anywhere in the output queue and a departing packet can only be removed from the head of the queue [5] .
However, the above emulation approach has several drawbacks. In particular, it has difficulty in providing constant performance guarantees. Constant performance guarantees mean that for any flow, the difference between its received bandwidth in a specific algorithm and in the ideal GPS model is bounded by constants, i.e. the equations in Theorem 1 of [23] . They are the key properties to assure worstcase fairness [23] . The reason is that WF 2 Q (including its variants) [23] is the only known fair queueing algorithm to achieve constant performance guarantees. Unfortunately, WF 2 Q does not use a PIFO queueing policy [25] , because a packet at the head of a queue may not be eligible for departure because it has not started transmission in GPS [23] .
In this paper, we propose the Guaranteed-performance Asynchronous Packet Scheduling (GAPS) algorithm for buffered crossbar switches to provide constant performance guarantees. The considered buffered crossbar switches do not need speedup. Because the crossbar runs at the same speed as the output ports, no buffer space is necessary at the output ports. When a packet is transmitted to its output port, it will be immediately sent to the output line. GAPS uses time stamps of packets in GPS as the scheduling criteria, and perfectly emulates the ideal GPS model. It directly handles variable length packets without SAR, and allows input ports and output ports to make independent scheduling decisions based on only local information without data exchange. Specifically, an input port needs only the statuses of its input queues, and an output port needs only the statuses of its crosspoint buffers. We show by theoretical analysis that GAPS provides constant performance guarantees. Furthermore, we prove that GAPS has a crosspoint buffer size bound of 3L, where L is the maximum packet length. Finally, we conduct simulations to verify the analytical results and evaluate the effectiveness of GAPS.
The rest of the paper is organized as follows. In Section II, we introduce some preliminaries for the paper. In Section III, we propose the GAPS algorithm. In Section IV, we theoretically analyze the performance of GAPS. In Section V, we present simulation data. In Section VI, we conclude the paper.
II. PRELIMINARIES
In this section, we first provide an overview of the approach to provide performance guarantees by emulating PIFO OQ switches. We then analyze in detail the drawbacks of the emulation approach, and describe the ideal fairness model used in this paper.
A. Emulating PIFO OQ Switches
The emulation of PIFO OQ switches is the current main approach in the literature for crossbar switches to provide performance guarantees. It was proved in [10] that a buffered crossbar switch with speedup of two satisfying non-negative slackness insertion and lowest time to live (LTTL) blocking, and LTTL fabric scheduling can exactly emulate a PIFO OQ switch. In [11] , the MCAF-LTF cell scheduling scheme for one-cell buffered crossbar switches was proposed. MCAF-LTF does not require costly time stamping mechanism, and is able to emulate an PIFO OQ switch with speedup of two. [5] studied practical scheduling algorithms for buffered crossbar switches. It showed that with speedup of two, a buffered crossbar switch can mimic a restricted PIFO OQ switch (a PIFO-OQ switch with the restriction that the cells of an input-output pair depart the switch in the same order as they arrive), and that with speedup of three, a buffered crossbar switch can mimic an arbitrary PIFO OQ switch and hence provide delay guarantees. [12] presented a cell scheduling algorithm for buffered crossbar switches with speedup of two to emulate an arbitrary PIFO OQ switch and achieve flow based performance guarantees. The performance guarantees of packet scheduling for asynchronous buffer crossbar switches were discussed in [7] . The Packet GVOQ and Packet LOOFA scheduling algorithms were designed based on existing cell scheduling algorithms. They require 2L or more buffer space at each crosspoint. Besides buffered crossbar switches, Combined-Input-Output-Queued switches are also proved to be able to emulate PIFO OQ switches with speedup of two [26] [27] .
The above algorithms were designed to make exact emulation of PIFO OQ switches. There are also some other schemes that intend to emulate OQ switches but cannot duplicate the same packet departure sequence. [28] proposed the Distributed Packet Fair Queueing architecture for physically dispersed line cards to emulate an OQ switch with fair queueing, and used simulation results to demonstrate its effectiveness with modest speedup. iFS was proposed in [29] for virtual output queued (VOQ) switches to emulate WFQ [22] at each output port. iFS uses a grant-accept two stage iterative matching method, and uses the virtual time as the grant criterion. Similarly, iDRR in [30] emulates DRR [21] at each output port of VOQ switches. iDRR uses the round robin principle in its iterative matching steps, and thus is able to make fast arbitration.
B. Drawbacks of Emulation Approach
There are two main drawbacks with the above approach to provide performance guarantees by emulating PIFO OQ switches. First, as discussed in Section I, it has difficulty in providing tight performance guarantees. Second, the proportional bandwidth allocation policy of PIFO OQ switches is not practical, because it does not consider the bandwidth constraints at the input ports, while flows may oversubscribe input ports [16] .
The objective of the emulation approach is to emulate a fair queueing algorithm at each output port. Fair queueing algorithms schedule packets from multiple flows of a shared output link to ensure fair bandwidth allocation, and they allocate bandwidth to the flows proportional to their requested bandwidth [24] . Numerically, assume that the available bandwidth of the shared output link is R, and φ i and R i are the requested bandwidth and allocated bandwidth of the i th flow, respectively. With proportional bandwidth allocation, we have ∀i, ∀j, Ri φi = Rj φj and i R i ≤ R. However, simple proportional bandwidth allocation is not suitable for switches [31] [32] . The reason is that, while flows of a shared output link are constrained only by the link bandwidth, flows of a switch are subject to two bandwidth constraints: the available bandwidth at both the input port and output port of the flow. Naive bandwidth allocation at the output port may make the flows violate the bandwidth constraints at their input ports, and vice versa.
In the following, we use an example to illustrate the issue. Consider a 2 × 2 switch. For easy representation, denote the i th input port as In i and the j th output port as Out j . Assume that each input port or output port has available bandwidth of one unit. Use φ ij and R ij to represent the requested bandwidth and allocated bandwidth of In i at Out j , respectively. Assume that each output port uses the proportional bandwidth allocation policy, i.e. the policy used by fair queueing algorithms for shared output links. First we look at only Out 1 . Because φ 11 = 0.9 and φ 21 = 0.6, by the proportional policy we have R 11 = 0.6 and R 21 = 0.4. The same applies to Out 2 . The allocated bandwidth R ij is thus shown in (1). However, this allocation is not feasible, because the total bandwidth allocated at In 1 is R 11 + R 12 = 0.6 + 0.6 = 1.2, exceeding the available bandwidth of 1. For the same reason, if bandwidth allocation is conducted independently by each input port using the proportional policy, the allocation will not be feasible either. 
In addition, to improve utilization, fair queueing algorithms will reallocate the leftover bandwidth of empty flows using the proportional policy. In other words, when a flow temporarily becomes empty, the fair queueing algorithm will reallocate its bandwidth to the remaining backlogged flows in proportion to their requested bandwidth. However, this strategy does not apply to switches either, and we use an additional example to explain. Consider the same 2 × 2 switch, and assume that initially ∀i∀j, R ij = 0.5, as shown in (2) . Now that In 1 temporarily has no traffic to Out 1 , i.e. R 11 = 0.5 changing to R 11 = 0. The fair bandwidth allocation policy would allocate the leftover bandwidth of R 11 to R 21 , because now only In 2 has traffic to Out 1 . However, it is not possible here, because it will oversubscribe In 2 by 0.5. As a matter of fact, the leftover bandwidth of R 11 cannot be reallocated at all in this case.
C. Our Fairness Model
To effectively evaluate the performance guarantees achieved by a scheduling algorithm, it is necessary to have an ideal fairness model as the comparison reference. A fairness model for packet scheduling can be regarded to have two roles. The first role is to calculate allocated bandwidth for flows based on their requested bandwidth. The second role is to schedule packets of different flows to ensure that the actual received bandwidth of each flow is equal to its allocated bandwidth.
As we have seen in Section II-B, the simple proportional bandwidth allocation policy does not apply to switches. Fortunately, there have been some solutions in the literature [31] [32] to fairly allocate bandwidth for flows in a switch based on their requested bandwidth. In this paper, we focus on addressing the first drawback of the emulation approach. In other words, we assume that bandwidth allocation has been calculated by such algorithms, and the scheduling algorithms should provide tight performance guarantees to ensure the allocated bandwidth of each flow. Also, when a flow of the switch temporarily becomes empty, we do not assume that its allocated bandwidth is immediately reallocated. Instead, the bandwidth allocation algorithms will consider the leftover bandwidth in the next calculation. Bandwidth allocation is recalculated when requested bandwidth changes or existing backlogged flows become empty.
We use GPS as the ideal model for packet scheduling. Specifically, given the allocated bandwidth, we compare the received service of a flow in our algorithm and in GPS. GPS views flows as fluids of continuous bits, and creates an independent logical channel for each flow based on its allocated bandwidth. Since the channel bandwidth of a flow is equal to its allocated bandwidth, GPS achieves perfect fairness. Fair queueing algorithms for shared output links also use GPS as the ideal model for packet scheduling, as shown in Figure 2 (a). Similarly, GPS can apply to switch packet scheduling by creating logical channels for different flows based on their allocated bandwidth, as shown in Figure  2 (b). Note that because GPS is a fluid based system, traffic of a flow can smoothly stream from the input port to the output port without buffering in the middle. We thus assume that packets in GPS do not need to be buffered at the crosspoint buffers of the switch.
III. GUARANTEED-PERFORMANCE ASYNCHRONOUS PACKET SCHEDULING
In this section, we describe the considered switch structure, formulate the problem, and present the guaranteedperformance asynchronous packet scheduling (GAPS) algorithm.
A. Switch Structure
The switch structure that we consider is shown in Figure  1 . N input ports and N output ports are connected by a buffered crossbar without speedup. Denote the i th input port as In i and the j th output port as Out j . Use R to represent the available bandwidth of each input port and output port, and the crossbar also has bandwidth R. Each input port has a buffer organized as virtual output queues (VOQ) [33] . In other words, there are N virtual queues at an input port, each storing the packets destined to a different output port. Denote the virtual queue at In i for packets to Out j as Q ij . Each crosspoint has a small exclusive buffer. Denote the crosspoint buffer connecting In i and Out j as B ij . Output ports have no buffers. Define the traffic from In i to Out j to be a flow F ij , and denote the k th arriving packet of F ij as P k ij . After P k ij arrives at the switch, it is first stored in Q ij , and waits to be sent to B ij . It will then be sent from B ij to Out j and immediately delivered to the output line. We say that a packet arrives at or departs from a buffer when its last bit arrives at or departs from the buffer.
B. Problem Formulation
As explained in Section II-C, specific bandwidth allocation algorithms will calculate explicit allocated bandwidth for each flow, and the objective of GAPS is to provide service guarantees for each flow.
Use R ij (t) to represent the allocated bandwidth of F ij , which is a function of time t with discrete values in practice. The calculated bandwidth allocation should be feasible, i.e., no over-subscription at any input port or output port ∀i,
The feasibility requirement is only for bandwidth allocation. It is necessary because it is impossible to allocate more bandwidth than what is actually available. However, temporary overload is allowed for an input port or output port. Use toO ij (0, t) and toO ij (0, t) to represent the numbers of bits transmitted by F ij to Out j during interval [0, t] in GAPS and in GPS, respectively. The objective of GAPS is to ensure that toO ij (0, t)− toO ij (0, t) is bounded by constants. 
C. Algorithm Description
wait until the next earliest virtual start time plus
There are two types of scheduling in GAPS, which we call input scheduling and output scheduling. In input scheduling, an input port selects a packet from one of its N input queues, and sends it to the corresponding crosspoint buffer. In output scheduling, an output port selects a packet from one of its N crosspoint buffers, and sends it to the output line. 
it is sent to the crosspoint buffer, and will be removed before it is sent to the output line. For easy understanding, the pseudo code description of GAPS is given in Table 1 .
As can be seen, the input scheduling and output scheduling of GAPS are similar to WF 2 Q. However, GAPS is different in that the leftover bandwidth of empty flows is not reallocated by the scheduling algorithm but by the bandwidth allocation algorithm.
We define the actual input start time and finish time of P 
Define the actual output start time and finish time of P k ij , denoted as OS k ij and OF k ij , to be the time that the first bit and the last bit of P k ij leave B ij in GAPS, respectively. It is obvious that
IV. PERFORMANCE ANALYSIS
In this section, we theoretically analyze the performance of GAPS. We will show that GAPS provides constant performance guarantees and has a bounded crosspoint buffer size.
A. Performance Guarantees
In this subsection, we show that GAPS achieves constant performance guarantees. According to the description of the GAPS algorithm, we have the following property.
Property 1: For any packet, its actual input start time is larger than or equal to its virtual start time, and its actual output start time is larger than or equal to its virtual start time plus
First we define some notations for input scheduling. We say that Q ij is backlogged at time t, if there exists k such 
Use toB i * (t 1 , t 2 ) and toB i * (t 1 , t 2 ) to represent the total numbers of bits sent from In i to all its crosspoint buffers during interval [t 1 , t 2 ] in GAPS and GPS, respectively, i.e.
The following lemma gives the relationship between toB i * (0, t) and toB i * (0, t).
Lemma 1: At any time, the number of bits sent from a specific input port in GAPS is larger than or equal to that in GPS, i.e. to be a packet that finished transmission in GPS before t , and thus V S k ij < t , which means that P k ij is eligible for input scheduling in GAPS before t . Since In i was idle immediately before t , it indicates that P k ij finished transmission in input scheduling of GAPS before t . The analysis applies to any packet transmitted in GPS before t , which means that all packets transmitted in GPS before t have finished transmission in input scheduling of GAPS by t . In other words,
On the other hand, because In i is busy during [t , t] in GAPS, we know that
Adding (15) and (16), we have toB i * (0, t) ≥ toB i * (0, t).
The following lemma compares the service time of a packet in GAPS and in GPS.
Lemma 2: For any packet, its actual input start time in GAPS is less than or equal to its virtual finish time in GPS, i.e. (18) In GPS, since IS (19) Combining (18) and (19), we have
IS
which is a contradiction to Lemma 1. The next lemma compares toB ij (0, t) and toB ij (0, t). Lemma 3: At any time, the difference between the numbers of bits sent from input port In i to crosspoint buffer B ij in GAPS and GPS is greater than or equal to −L and less than or equal to L, i.e.
−L ≤ toB
≤ t, and thus P k−1 ij has started transmission by time t in input scheduling of GAPS. As a result
On the other hand, since t < V F k ij , P k ij has not finished transmission by t in GPS. Therefore
By (22) and (23), we have
Because IS
Combining (24), (25) , and (26)
has not started transmission by t in input scheduling of GAPS.
Combining the above two equations, we obtain
Correspondingly, we define some notations for output scheduling. We say that B ij is backlogged at time t, if there 
Use toO * j (t 1 , t 2 ) and toO * j (t 1 , t 2 ) to represent the total numbers of bits sent from all the crosspoint buffers to Out j during interval [t 1 , t 2 ] in GAPS and GPS, respectively, i.e.
We have a corresponding version of Lemma 1 for output scheduling as follows.
Lemma 4: The number of bits received by a specific output port in GAPS by time t is larger than or equal to that in GPS by time t − L R , i.e. 
This indicates that P k ij arrived at B ij before t in GAPS, and it is eligible for output scheduling before t . Since Out j was idle immediately before t , it means that P k ij finished transmission in output scheduling of GAPS before t . The analysis applies to any packet transmitted in GPS before t − L R , which means that all packets transmitted in GPS before t − L R have finished transmission in output scheduling of GAPS by t . In other words,
On the other hand, because Out j is busy during [t , t] in GAPS, we know that
Adding (34) and (35), we have toO
Similarly, we have the corresponding version of Lemma 2 for output scheduling of GAPS.
Lemma 5: For any packet, its actual output start time in GAPS is less than or equal to its virtual finish time in GPS plus
The proof is similar to that of Lemma 2 but based on Lemma 4, and is omitted. The follow theorem shows that GAPS achieves constant performance guarantees.
Theorem 1: At any time, the difference between the numbers of bits transmitted by a flow to the output port in GAPS and GPS is greater than or equal to 2L and less than or equal to L, i.e.
Proof: Without loss of generality, assume that
and thus P k−1 ij has started transmission by time t in output scheduling of GAPS. As a result
On the other hand, in GPS we have
By (38) and (39), we have
by Lemma 5, we know
Combing (40), (41), and (42)
Next, we prove toO ij (0, t) − toO ij (0, t) ≤ L. Because a packet P ij has to be transmitted to B ij before sent to Out j , it is obvious that toO ij (0, t) ≤ toB ij (0, t). In addition, by neglecting propagation delay, we have toO ij (0, t) = toB ij (0, t). Thus
The next theorem gives the delay bounds. For easy analysis of the delay difference lower bound, we assume that the allocated bandwidth
Theorem 2: For any packet P k ij , the difference between its departure time in GAPS and GPS is greater than or equal to
and less than or equal to
Proof: First, we prove OF
R . Because a packet P k ij has to be buffered at B ij before sent to Out j , it is obvious that
Based on Property 1, we know V S
R , and thus we obtain
Next, we prove OF
B. Crosspoint Buffer Size Bound
Crosspoint buffers are expensive on-chip memories, and it is desired that each crosspoint has only a limited size buffer. To avoid overflow at crosspoint buffers, we would like to find the maximum number of bits buffered at any crosspoint.
Theorem 3:
In GAPS, the maximum number of bits buffered at any crosspoint buffer is upper bounded by 3L, i.e.
Proof: By Lemma 3,
By Theorem 1,
Because GPS is a fluid based system, we have toB ij (0, t) = toO ij (0, t) by neglecting the propagation delay. Summing the above equations, we have proved the theorem.
V. SIMULATION RESULTS
We have conducted simulations to verify the analytical results obtained in Section IV and evaluate the effectiveness of GAPS.
In the simulations, we consider a 16×16 buffered crossbar switch without speedup. Each input port and output port have bandwidth of 1G bps. Since GAPS can directly handle variable length packets, we set packet length to be uniformly distributed between 40 and 1500 bytes [19] . For bandwidth allocation, we use the same model as that in [9] and [15] . The allocated bandwidth R ij (t) of flow F ij at time t is defined by an unbalanced probability w as follows
When w = 0, an input port In i has the same amount of allocated bandwidth R N at each output port. Otherwise, In i has more allocated bandwidth at Out i , which is called the hotspot destination. Because each flow is allocated a specific amount of bandwidth, it is necessary to have admission control flow to avoid over-subscription. Arrival of a flow F ij is constrained by a leaky bucket (l × R ij (t), σ ij ), where l is the effective load. We set the burst size σ ij of every flow to a fixed value of 10,000 bytes, and the burst may arrive at any time during a simulation run. We use two traffic patterns in the simulations. For traffic pattern one, each flow has fixed allocated bandwidth during a single simulation run. l is fixed to 1 and w is one of the 11 possible values from 0 to 1 with a step of 0.1. For traffic pattern two, a flow has variable allocated bandwidth. l is one of the 10 possible values from 0.1 to 1 with a step of 0.1, and for a specific l value, a random permutation of the 11 different w values is used. Each simulation run lasts for 10 seconds. 
A. Service Guarantees
By Theorem 1, we know that the service difference of a flow in GAPS and GPS at any time has a lower bound of −2L and upper bound of L. We look at the simulation data on service guarantees. Figure 3 (a) shows the minimum and maximum service differences among all the flows during the entire simulation run under traffic pattern one. As can be seen, the minimum service difference is always greater than the lower bound. It drops gradually when the unbalanced probability increases. This indicates that when the traffic distribution is more unbalanced, flows tend to transmit less traffic in GAPS than in GPS. Note that when the unbalanced probability becomes one, the minimum service difference jumps suddenly to −1500 bytes. The reason is that when the unbalanced probability is one, all packets of In i go to Out i , and there is no switching necessary. The only difference between GAPS and GPS is that a packet needs to be buffered at the crosspoint buffer in GAPS but not in GPS. Thus, the service difference in the worst case is equal to the maximum packet length. On the other hand, the maximum service difference is always less than but very close to the upper bound. However, when the unbalanced probability becomes one, the maximum service difference drops to a negative value. As analyzed in the above, when the unbalanced probability is one, the only difference between GAPS and GPS is the extra buffering at the crosspoint buffer. As a result, a flow always transmits less traffic in GAPS than in GPS, and the actual maximum service difference depends on the length of the first packet, which is a random number between 40 and 1500. Figure 3(b) shows the minimum and maximum service differences under traffic pattern two. The minimum service difference is always greater than the lower bound, and keeps relatively constant. This indicates that the minimum service difference is not sensitive to the change of effective load. The maximum service difference increases steadily with the effective load, but is always less than the upper bound. 
B. Delay Guarantees
Theorem 2 gives the lower bound and upper bound for the delay difference of a packet in GAPS and GPS. In this subsection, we present the simulation data on delay guarantees. Figure 4 (a) shows the minimum, maximum, and average delay differences of a representative flow F 11 under traffic pattern one. Note that the delay difference lower bound in Theorem 2 assumes fixed allocated bandwidth R ij and depends on the packet length L k ij . For easy plotting of the figure, we calculate the delay difference lower bound for all packets of flow F ij as follows
As can be seen, the minimum delay difference is almost coincident with the theoretical lower bound, and the maximum delay difference is almost identical with the upper bound. This shows that the theoretical bounds are tight. While the minimum delay difference increase as the unbalanced probability increases, the maximum delay difference is not sensitive to the change of the unbalanced probability. The average delay difference is initially negative. This is reasonable because when the traffic is uniformly distributed, most packets leave earlier than their departure time in GPS. When the unbalanced probability increases, the average delay difference also increases. Note that when the unbalanced probability becomes one, the minimum, maximum, and average delay differences all become 1.2×10 −5 second. The explanation is the same as above, and a packet P Figure 4(b) shows the minimum, maximum, and average delay differences of flow F 11 under traffic pattern two. Because the delay difference lower bound in Theorem 2 assumes fixed allocated bandwidth, the lower bound curve cannot be plotted. We can still see that the maximum delay difference is always less than and very close to the upper bound. Both the minimum and average delay differences are relatively constant, and the average delay difference is always negative, which means that most packets depart earlier in GAPS than in GPS.
C. Crosspoint Buffer Size
Theorem 3 gives the bound of the crosspoint buffer size as 3L. In this subsection, we look at the maximum and average crosspoint buffer occupancies in the simulations. Figure 5 (a) shows the maximum and average crosspoint buffer occupancies under traffic pattern one. As can be see, the maximum occupancy is always smaller than the theoretical bound. It grows as the unbalanced probability increases, but suddenly drops to about 3000 bytes when the unbalanced probability becomes one. The reason is that now there will be at most 2L bits buffered in crosspoint buffers B ii . For the average occupancy, it does not change significantly with different unbalanced probabilities. It drops to about 100 bytes, when the unbalanced probability becomes one. This is because only crosspoint buffers B ii are now used, and the remaining crosspoint buffers are empty. We can find that the average occupancy is more affected by the load than the unbalanced probability. Figure 5 (b) shows the maximum and average crosspoint buffer occupancies under traffic pattern two. We can see that the maximum occupancy increases as the load increases, but does not exceed the theoretical bound. On the other hand, the average occupancy does not change much and is smaller than 150 bytes before the load increases to one. This also confirms the previous observation that the average occupancy is determined by the effective load.
D. Throughput
Next, we present the simulation data on throughput. Figure 6(a) shows the throughput under traffic pattern one. We can see that the throughput for all unbalanced probabilities is greater than 99.99%, which demonstrates that GAPS practically achieves 100% throughput. Figure 6(b) shows the throughput under traffic pattern two. As can be seen, the throughput grows consistently with the effective load, and finally reaches one. 
VI. CONCLUSIONS
Recent development in VLSI technology has made buffered crossbar switches to be feasible, and they demonstrate unique advantages over traditional unbuffered crossbar switches. The current emulation approach for buffered crossbar switches to provide performance guarantees has difficulty in providing tight constant performance guarantees, because of its inability to emulate WF 2 Q. To address the issue, we have presented in this paper the guaranteedperformance asynchronous packet scheduling (GAPS) algorithm for buffered crossbar switches. GAPS requires no speedup, and directly handles variable length packets without segmentation and reassembly (SAR). Different input ports and output ports conduct scheduling independently without any data exchange. We show by theoretical analysis that GAPS achieves constant performance guarantees. In addition, we prove that GAPS has a crosspoint buffer size bound of 3L. Finally, we present simulation data to verify the analytical results and evaluate the effectiveness of GAPS.
