Abstract-Buffered crossbar switches are special crossbar switches with an exclusive buffer at each crosspoint. They demonstrate unique advantages over traditional unbuffered crossbar switches, such as asynchronous scheduling and variable length packet handling. However, since crosspoint buffers are expensive on-chip memories, it is desired that each crosspoint has only a small buffer. In this paper, we propose a scheduling algorithm called Fair Asynchronous Segment Scheduling (FASS) for buffered crossbar switches, which reduces the crosspoint buffer size by dividing packets into shorter segments before transmission. FASS also provides tight constant performance guarantees by emulating the ideal Generalized Processor Sharing (GPS) model. Furthermore, FASS requires no speedup for the crossbar, lowering the hardware cost and improving the switch capacity. By theoretical analysis, we prove that FASS is strongly stable and therefore achieves 100% throughput. We also calculate the size bound for the crosspoint buffers. Moreover, we show that FASS provides bounded delay guarantees. Finally, we present simulation data to verify the analytical results.
I. INTRODUCTION Buffered crossbar switches have recently attracted considerable attention as the next generation of high speed interconnects. They are a special type of crossbar switches with an exclusive buffer at each crosspoint of the crossbar, which has been feasible with advances in modern VLSI technology to integrate miniaturized on-chip memories. Crosspoint buffers relax port contention, and greatly simplify the scheduling process. Buffered crossbar switches have thus demonstrated significant advantages over traditional unbuffered crossbar switches, such as asynchronous scheduling and variable length packet handling [1] [2] [3] [4] [5] .
However, crosspoint buffers are expensive on-chip memories, and the total crosspoint buffer size grows by the square of the switch size, i.e., N 2 crosspoint buffers for an N ×N switch. For buffered crossbar switches to be practical, it is desired that each crosspoint has only a small size buffer, which is one of the main motivations of our work. In addition, it is obvious that the crosspoint buffer size depends on the maximum packet length. For packets in the Internet, although the maximum IP packet length is 1,500 bytes [6] , about 60% of overall packets are less than 64 bytes, including TCP ACK and TCP control packets [7] . This indicates that even if we set large crosspoint buffers based on the maximum packet length, they cannot be efficiently utilized anyway. To address the issue, we propose in this paper a segment scheduling algorithm, which divides a packet into shorter segments before transmission. The maximum segment length can be arbitrarily small (in theory), leading to arbitrarily small crosspoint buffers. Note that our segmentation scheme is different from that for traditional unbuffered crossbar switches [8] , because there are no padding bits for the last segment of a packet and thus no waste of bandwidth.
Another motivation of our work is to provide tight constant performance guarantees for buffered crossbar switches. The emulation of Push-In-First-Out (PIFO) Output Queued (OQ) switches is the main approach in the literature for crossbar switches to provide performance guarantees [9] [10] [11] . However, there are three main drawbacks with this approach. First, it has difficulty in providing tight performance guarantees, because it cannot emulate Worst-case Fair Weighted Fair Queueing (WF 2 Q) [12] , which is the only known fair queueing algorithm achieving constant performance guarantees [13] . Second, the emulation approach requires the switches to have speedup of at least two, which means that the crossbar needs to run twice faster than the input and output port. The speedup requirement increases the implementation cost and reduces the switch capacity. Third, the bandwidth allocation is not practical, because it does not consider bandwidth constraints at input ports, while flows may oversubscribe input ports [15] . In this work, we focus on addressing the first two drawbacks. Specifically, we use existing bandwidth allocation algorithms [16] [17] to calculate fair bandwidth allocation, and design a scheduling scheme to ensure the allocated bandwidth of each flow and achieve tight performance guarantees. In addition, our scheduling scheme requires no speedup for the crossbar.
In this paper, we propose a distributed scheduling scheme, called Fair Asynchronous Segment Scheduling (FASS), for buffered crossbar switches without speedup to achieve constant performance guarantees with reduced crosspoint buffers. First, we present a segmentation-and-reassembly (SAR) scheme to divide packets into short segments before transmission, so as to correspondingly reduce the crosspoint buffer size. We then propose the FASS algorithm to schedule segment transmissions, which uses a time stamp based approach to emulate the ideal Generalized Processor Sharing (GPS) [18] model and provides tight performance guarantees. By theoretical analysis, we prove that FASS is strongly stable and therefore achieves 100% throughput. We also calculate the size bound for the crosspoint buffers. Moreover, we show that FASS provides bounded delay guarantees. Finally, we conduct simulations to verify the analytical results and measure the performance of FASS.
978-1-4244-5638-3/10/$26.00 ©2010 IEEE
II. FAIR ASYNCHRONOUS SEGMENT SCHEDULING (FASS)
In this section, we present the Fair Asynchronous Segment Scheduling (FASS) algorithm for buffered crossbar switches.
A. Packet Segmentation and Reassembly
As mentioned earlier, one of the motivations of our work is to make buffered crossbar switches practical by reducing crosspoint buffers. We present a packet segmentation-andreassembly (SAR) scheme for this purpose. After a packet arrives at the input port, it will be divided into segments before transmission, if it is longer than a threshold, i.e. the maximum segment length. The maximum segment length can be arbitrarily small in theory, and a smaller maximum segment length leads to a smaller crosspoint buffer size. The segments will be used as the scheduling and transmission units. After they arrive at the output port, they will be reassembled back to the original packet before delivered to the output line.
B. Switch Architecture
The considered switch architecture includes N input ports and N output ports, connected by a buffered crossbar without internal speedup. Buffers are located at three possible bottlenecks: input ports, output ports, and crosspoints of the crossbar. Let In i denote the i th input port and Out j denote the j th output port. The available bandwidth of each input port and output port and also the crossbar is R. Each input port has a buffer to store arriving packets based on their destination output ports using Virtual Output Queues (VOQs) [4] . VOQs avoid the head of line (HOL) problem [6] , which limits the maximum throughput of the switch. Denote the virtual queue at In i for packets destined to Out j as Q ij . Each crosspoint is equipped with an exclusive buffer represented by B ij to connect In i and Out j . Each output port has a buffer to store received segments based on their source input ports using Virtual Input Queues (VIQs) [7] . VIQs are used to reassemble segments back into original packets before delivery to the output line.
C. Algorithm Description
FASS has two types of scheduling, called input scheduling and output scheduling. In input scheduling, an input port selects a segment from one of its N input queues and sends it to the corresponding crosspoint buffer. In output scheduling, an output port selects a segment from one of its N crosspoint buffers and retrieves it to the corresponding output queue.
We use the notation "I-O" to differentiate the algorithms for input scheduling and output scheduling, where "I" is the scheduler for input scheduling and "O" for output scheduling. "I" and "O" could be either FASS or GPS. If we do not care about the scheduler for output scheduling, we use a * mark for "O". For example, FASS-GPS means that FASS is used for input scheduling and GPS for output scheduling. It is noted that, GPS is used as the ideal fairness model to compare the received service of a flow in our algorithm and in GPS.
Define the traffic from In i to Out j to be a flow F ij . Use r ij (t) to represent the allocated bandwidth of F ij at time t, which is calculated by specific bandwidth allocation algorithms [16] [17] . The calculated bandwidth should be feasible, i.e. no over subscription at any input or output port,
To avoid input buffer overflow, input ports have admission control for each flow based on its allocated bandwidth. We use an extended leaky bucket for the admission control [6] , which has been discussed in detail in [10] .
Input scheduling and output scheduling of FASS rely on only local information, and are conducted in an asynchronous and distributed manner. To be specific, an input port needs only the statuses of the queues in its input buffer, and does not exchange information with any crosspoint buffer or output port. Similarly, an output port needs only the statuses of its crosspoint buffers.
We first explain input scheduling. For easy presentation, let P ijk represent the k th arrived packet of F ij and S 
where arv(in ijk ) is the arrival time of P ijk at the input port, and V IF Lemma 2: During the time interval [0,t], the difference between the number of bits sent from an input port In i to a crosspoint buffer B ij in FASS-* and GPS-* is l, where l is the maximum segment length, i.e., |toB 
A. Switch Stability and Throughput
We prove that FASS achieves strong stability by showing that the length of input virtual queues are finite.
Let X(t) be the vector of queue lengths at time t, and use Q ij (t) to show the occupancy of virtual queues such that X(t) = (Q 11 (t), Q 12 (t), ..., Q ij (t), ..., Q NN (t)). We follow the definitions in [21] and study the strong stability of our scheme, which implies 100% throughput [11] [20] .
Definition 1: X(t) is called the Euclidean norm of vector X(t), i.e., X(t) = (Q
Definition 2: A system of queues is strongly stable if
The intuitive explanation is that segments belonging to packets of flow F ij arrives and departs at the same rate, and they will not infinitely accumulate at either Q ij or B ij or O ij .
Theorem 1: FASS is strongly stable when flows are leaky bucket compliant, i.e., FASS provides 100% throughput.
Proof: Assume that flow F ij is leaky bucket (r ij (t), σ ij ) complaint [6] , where σ ij is the burst size of F ij . Also assume that Q ij is empty at s and [s, t] is the last continuously backlogged period before t. This indicates that all segments belonging to packets of F ij arriving at Q ij before s have finished transmission by s in GPS-* , and the next packet has not arrived yet. The complete proof is given in [14] and the rest of the proof is omitted here due to space restrictions.
B. Crosspoint Buffer Size Bound
To avoid overflow at crosspoint buffers, we would like to find the maximum number of bits buffered at any crosspoint.
Theorem 2: In FASS-FASS, the maximum number of bits buffered at any crosspoint buffer is upper bounded by four maximum segment length, i.e., toB
Proof: By Lemma 4 toO
GP S ij (0, t), and based on Lemma 5 toO

GP S ij (0, t) + 2l ≥ toB
GP S ij (0, t), and according to Lemma 2 toB
GP S ij (0, t)+l ≥ toB
F ASS ij (0, t).
Summing the above equations, we have proved the theorem.
C. Delay Guarantees (Jitter)
In this subsection, we show that FASS-FASS can provide bounded delay guarantees. A packet can be departed when its last segment has been arrived to reassembly buffers at output ports, i.e., the departure time of a packet is equal to the departure time of its last segment. For easy analysis, we assume that the allocated bandwidth r ij (t) of F ij is a constant r ij during interval [min(AIS 
Proof: We first prove the left side inequality as
Based on the Property 1, we know that V IS 
Next, we prove AOD IV. SIMULATION RESULTS We have performed simulations to evaluate the performance of FASS and verify the analytical results.
In our simulation, we consider a 16×16 buffered crossbar switch without speedup. Each input port and output port has a bandwidth of 1 Gbps. Since FASS is capable of handling variable length packets, we set the packet length in the range of [40, 1500] bytes. We use the same model as in [19] for the bandwidth allocation. This model defines an allocated bandwidth r ij (t) of a flow F ij at time t by applying an unbalanced probability w, i.e., 0 ≤ w ≤ 1, as follows
To constrain the burstiness of a flow F ij , we consider a leaky bucket (η × r ij , σ ij ), where η is the effective load and σ ij is the burst size of F ij . We set σ ij of every flow to a fixed value of 10,000 bytes, and the burst may arrive at any time during a simulation run. We use two traffic patterns in the simulations. For the first pattern, each flow has a fixed allocated bandwidth during a single simulation run. The η is fixed to 1 and w takes one of the 11 possible values of [0,1] with a step of 0.1. For the second traffic pattern, a flow has a variable allocated bandwidth. The η takes one of the 10 possible values of [0.1,1] with a step of 0.1, and for a specific η value, a random permutation of the 11 different w values is used. We set the initial value of the crosspoint buffer size to 40 bytes and then adjust it from 100 to 1,500 bytes with a step of 100.
A. Throughput
To verify Theorem 1, we present the simulation data to show that our scheme achieves 100% throughput. Figure 1(a) illustrates the throughput under traffic pattern one with different segment sizes. We can observe that, the throughput for all unbalanced probabilities is greater than 99.99%, which demonstrates that FASS practically achieves 100% throughput. The throughput slightly decreases when the segment size increases, because with the same simulation time, larger segments sizes have smaller probabilities to finish the transmission of the last segment. Figure 1(b) depicts the throughput under traffic pattern two. As can be seen, the throughput grows consistently with the effective load, independent of the segment size, and finally reaches 100% when the effective load becomes 1. We can make the conclusion that, different segment sizes have no significant impact on the throughput performance.
978-1-4244-5638-3/10/$26.00 ©2010 IEEE 
B. Crosspoint Buffer Size Bound
Theorem 2 gives the upper bound of the crosspoint buffer size as 4l bytes to avoid overflow. Figure 2(a) shows the maximum crosspoint buffer occupancy under traffic pattern one. It is observed that, the maximum occupancy is always smaller than the theoretical bound. The occupancy increases proportional to the segment size and grows as the unbalanced probability increases, but suddenly drops to about l bytes when the unbalanced probability becomes 1. Since at this point, all packets of In i go to Out i and hence, no switching is necessary. Figure 2(b) demonstrates the maximum crosspoint buffer occupancy under traffic pattern two. We can see that, the maximum occupancy increases as the load and the segment size grow, but never exceeds the theoretical bound.
C. Delay Guarantees (Jitter)
Finally, we present the simulation data on jitter, which is the difference between the departure time of a packet in FASS and GPS. Theorem 3 gives the lower bound and upper bound for the jitter as −l Figure 3(a) shows the maximum and minimum jitter of a representative flow F 11 under traffic pattern one. Since Theorem 3 assumes a fixed allocated bandwidth for r ij , jitter depends on the segment size l m ijk . As can be seen, the minimum jitter is almost coincident with but always greater than or equal to the theoretical lower bound. The maximum jitter is always less than the theoretical upper bound and slightly increases as the segment size grows. However, the jitter suddenly drops when the unbalanced probability becomes one, as well as when the segment size reaches 1,500 bytes. Figure 3(b) shows the maximum and minimum jitter of a representative flow F 11 under traffic pattern two. As can be seen, the maximum jitter is always less than and close to the upper bound. It slightly increases as the segment size grows, and jumps when the effective load becomes one. The minimum jitter decreases when the segment size increases, but is always greater than the lower bound. Negative jitter means that most packets in FASS depart earlier than in GPS.
V. CONCLUSIONS We have proposed the Fair Asynchronous Segment Scheduling (FASS) algorithm for buffered crossbar switches. The main features of FASS can be summarized as follows. First, FASS reduces the crosspoint buffer size by dividing packets into segments before transmission. It needs no padding bits and thus does not waste the bandwidth. Second, FASS provides tight constant performance guarantees by tightly emulate the ideal GPS model. Third, FASS requires no speedup for the crossbar, reducing implementation cost and improving switch capacity. By theoretical analysis, we prove that FASS is strongly stable and therefore achieves 100% throughput. We also calculate the size bound for the crosspoint buffers. Moreover, we show that FASS provides bounded delay guarantees. Finally, we have performed simulations, and the collected data demonstrate consistency with the analytical results.
