Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers by Turner, Jonathon
Washington University in St. Louis 
Washington University Open Scholarship 
All Computer Science and Engineering 
Research Computer Science and Engineering 
Report Number: WUCSE-2007-52 
2007 
Strong Performance Guarantees for Asynchronous Buffered 
Crossbar Schedulers 
Jonathon Turner 
Crossbar-based switches are commonly used to implement routers with throughputs up to 
about 1 Tb/s. The advent of crossbar scheduling algorithms that provide strong performance 
guarantees now makes it possible to engineer systems that perform well, even under extreme 
traffic conditions. Until recently, such performance guarantees have only been developed for 
crossbars that switch cells rather than variable length packets. Cell-based crossbars incur a 
worst-case bandwidth penalty of up to a factor of two, since they must fragment variable length 
packets into fixed length cells. In addition, schedulers for cell-based crossbars may fail to 
deliver the expected performance guarantees... Read complete abstract on page 2. 
Follow this and additional works at: https://openscholarship.wustl.edu/cse_research 
 Part of the Computer Engineering Commons, and the Computer Sciences Commons 
Recommended Citation 
Turner, Jonathon, "Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers" 
Report Number: WUCSE-2007-52 (2007). All Computer Science and Engineering Research. 
https://openscholarship.wustl.edu/cse_research/152 
Department of Computer Science & Engineering - Washington University in St. Louis 
Campus Box 1045 - St. Louis, MO - 63130 - ph: (314) 935-6160. 
This technical report is available at Washington University Open Scholarship: https://openscholarship.wustl.edu/
cse_research/152 
Strong Performance Guarantees for Asynchronous Buffered Crossbar Schedulers 
Jonathon Turner 
Complete Abstract: 
Crossbar-based switches are commonly used to implement routers with throughputs up to about 1 Tb/s. 
The advent of crossbar scheduling algorithms that provide strong performance guarantees now makes it 
possible to engineer systems that perform well, even under extreme traffic conditions. Until recently, such 
performance guarantees have only been developed for crossbars that switch cells rather than variable 
length packets. Cell-based crossbars incur a worst-case bandwidth penalty of up to a factor of two, since 
they must fragment variable length packets into fixed length cells. In addition, schedulers for cell-based 
crossbars may fail to deliver the expected performance guarantees when used in routers that forward 
packets. We show how to obtain performance guarantees for asynchronous crossbars that are directly 
comparable to those previously developed for synchronous, cell-based crossbars. In particular we define 
derivatives of the Group by Virtual Output Queue (GVOQ) scheduler of Chuang et al. and the Least 
Occupied Output First Scheduler of Krishna et al. and show that both can provide strong performance 
guarantees in systems with speedup 2. Specifically, we show that these schedulers are work-conserving 
and that they can emulate an output-queued switch using any queueing discipline in the class of 
restricted Push-In, First-Out queueing disciplines. We also show that there are schedulers for segment-
based crossbars, (introduced recently by Katevenis and Passas) that can deliver strong performance 
guarantees with small buffer requirements and no bandwidth fragmentation. 
Department of Computer Science & Engineering
2007-52
Strong Performance Guarantees for Asynchronous Buffered Crossbar
Schedulers
Authors: Jonathan Turner
Corresponding Author: jon.turner@wustl.edu
Web Page: http://www.arl.wustl.edu/~jst
Abstract: Crossbar-based switches are commonly used to implement routers with throughputs up to about 1
Tb/s. The advent of crossbar scheduling algorithms that provide strong performance guarantees now makes it
possible to engineer systems that perform well, even under extreme traffic conditions. Until recently, such
performance guarantees have only been developed for crossbars that switch cells rather than variable length
packets. Cell-based crossbars incur a worst-case bandwidth penalty of up to a factor of two, since they must
fragment variable length packets into fixed length cells. In addition, schedulers for cell-based crossbars may fail
to deliver the expected performance guarantees when used in routers that forward packets. We show how to
obtain performance guarantees for asynchronous crossbars that are directly comparable to those previously
developed for synchronous, cell-based crossbars. In particular we define derivatives of the Group by Virtual
Output Queue (GVOQ) scheduler of Chuang et al. and the Least Occupied Output First Scheduler of Krishna et
al. and show that both can provide strong performance guarantees in systems with speedup 2. Specifically, we
show that these schedulers are work-conserving and that they can emulate an output-queued switch using any
queueing discipline in the class of restricted Push-In, First-Out queueing disciplines. We also show that there
are schedulers for segment-based crossbars, (introduced recently by Katevenis and Passas) that can deliver
Notes:
Supported in part by the National Science Foundation (grant # 0325291). Any opinions, findings or
Type of Report: Other
Department of Computer Science & Engineering - Washington University in St. Louis
Campus Box 1045 - St. Louis, MO - 63130 - ph: (314) 935-6160
Strong Performance Guarantees
for Asynchronous Buffered
Crossbar Schedulers
Jonathan Turner
September 8, 2008
Abstract
Crossbar-based switches are commonly used to implement routers with
throughputs up to about 1 Tb/s. The advent of crossbar scheduling algo-
rithms that provide strong performance guarantees now makes it possible to
engineer systems that perform well, even under extreme traffic conditions. Until
recently, such performance guarantees have only been developed for crossbars
that switch cells rather than variable length packets. Cell-based crossbars incur
a worst-case bandwidth penalty of up to a factor of two, since they must frag-
ment variable length packets into fixed length cells. In addition, schedulers for
cell-based crossbars may fail to deliver the expected performance guarantees
when used in routers that forward packets. We show how to obtain perfor-
mance guarantees for asynchronous crossbars that are directly comparable to
those previously developed for synchronous, cell-based crossbars. In particular
we define derivatives of the Group by Virtual Output Queue (GVOQ) scheduler
of Chuang et al. and the Least Occupied Output First Scheduler of Krishna et
al. and show that both can provide strong performance guarantees in systems
with speedup 2. Specifically, we show that these schedulers are work-conserving
and that they can emulate an output-queued switch using any queueing dis-
cipline in the class of restricted Push-In, First-Out queueing disciplines. We
also show that there are schedulers for segment-based crossbars, (introduced re-
cently by Katevenis and Passas) that can deliver strong performance guarantees
with small buffer requirements and no bandwidth fragmentation.
1 Introduction
Crossbar switches have long been a popular choice for transferring data from in-
puts to outputs in mid-range performance switches and routers [1]. Unlike bus-based
switches, crossbars can provide throughputs approaching 1 Tb/s, while allowing in-
dividual line cards to operate at speeds comparable to the external links.
1
However the control of high performance crossbars is challenging, requiring cross-
bar schedulers that match inputs to outputs in the time it takes for a minimum length
packet to be forwarded. The matching selected by the scheduler has a major influ-
ence on system performance, placing a premium on algorithms that can produce high
quality matchings in a very short period of time.
Traditionally, crossbars schedulers have been evaluated largely on the basis of how
they perform on random traffic arrival patterns that do not cause long term overloads
at inputs or outputs. Most often, such evaluations have been carried out using sim-
ulation [14]. Recently, there has been a growing body of work providing rigorous
performance guarantees for such systems [11, 15] in the context of well-behaved, ran-
dom traffic. A separate thread of research concentrates on schedulers that can provide
strong performance guarantees that apply to arbitrary traffic patterns [3, 8, 18], in-
cluding adversarial traffic that may overload some outputs for extended periods of
time. The work reported here belongs to this second category. Since the internet lacks
comprehensive mechanisms to manage traffic, extreme traffic conditions can occur in
the internet due to link failures, route changes or simply unusual traffic conditions.
For these reasons, we argue that it is important to understand how systems perform
when they are subjected to such extreme conditions. Moreover, we argue that strong
performance guarantees are desirable in backbone routers, if they can be obtained at
an acceptable cost.
There are two fundamental properties that are commonly used to evaluate crossbar
schedulers in this worst-case sense. A scheduler is said to be work-conserving if an
output link is kept busy so long as there are packets addressed to the output, anywhere
in the system. A scheduler is said to be order-preserving if it is work-conserving and
it always forwards packets in the order in which they arrived. A crossbar with an
order-preserving scheduler faithfully emulates an ideal nonblocking switch with FIFO
output queues. In their seminal paper, Chuang, et al. provided the first example of an
order-preserving scheduler [3] for a crossbar with small speedup, where the speedup
of a crossbar switch is the ratio of its ideal throughput to the total capacity of its
external links. So a crossbar with a speedup of S has the potential to forward data
S times faster than the input links can supply it. In fact, Chuang, et al. showed a
stronger property; that certain schedulers can be specialized to emulate an output
queued switch that implements any one of a large class of scheduling algorithms at
the outputs.
Until recently, strong performance guarantees have been available only for cross-
bars that forward fixed length cells. There is a sound practical justification for con-
centrating on such systems, since routers commonly use cell-based crossbars. Variable
length packets are received at input line cards, segmented into fixed length cells for
transmission through the crossbar and reassembled at the output line cards. This
simplifies the implementation of the crossbar and allows for synchronous operation,
which allows the scheduler to make better decisions than would be possible with
asynchronous operation. Unfortunately, cell-based crossbar schedulers that deliver
strong performance guarantees when viewed from the edge of the crossbar, can fail
to deliver those guarantees for the router as a whole. For example, a system using
2
a work-conserving cell-based scheduler can fail to keep an outgoing link busy, even
when there are complete packets for that output present in the system.
We show that strong performance guarantees can be provided for packets, using
asynchronous crossbars that directly handle packets, rather than cells, if the crossbars
are equipped with a moderate amount of internal buffer space. Specifically, we define
packet-oriented derivatives of the Group by Virtual Output Queue algorithm (GVOQ)
of [3] and the Least Occupied Output First Algorithm (LOOFA) of [8, 18] and show
that they can deliver strong performance guarantees for systems with a speedup of 2.
Because our crossbar schedulers operate asynchronously, we have had to develop new
methods for analyzing their performance. These methods now make it possible to
evaluate asynchronous crossbars in a way that is directly comparable to synchronous
crossbars.
The use of buffered crossbars is not new. An early ATM switch from Fujitsu used
buffered crossbars, for example [17]. However, most systems use unbuffered crossbars,
because the addition of buffers to each of the n2 crosspoints in an n× n crossbar has
been viewed as prohibitively expensive. There has recently been renewed interest in
buffered crossbars [4, 6, 9, 10, 12, 16, 19, 20]. A recent paper by Chuang et al. [4]
advocates the use of buffers in cell-based crossbars in order to reduce the complex-
ity of the scheduling algorithms. The authors argue that ongoing improvements in
electronics now make it feasible to add buffering to a crossbar, without requiring an
increase in the number of integrated circuit components. Hence, the cost impact of
adding buffering is no longer a serious obstacle. Our results add further weight to the
case for buffered crossbars, as the use of buffering allows inputs and outputs to oper-
ate independently and asynchronously, allowing variable length packets to be handled
directly. Katevenis et al [9, 10] have also advocated the use of buffered crossbars for
variable length packets and have demonstrated their feasibility by implementing a 32
port buffered crossbar with 2 KB buffers at each crosspoint.
Section 2 discusses the differences between switching cells and switching packets,
and explains how buffered crossbars are particularly advantageous for systems that
directly switch packets. Section 3 defines the terminology and notation used in the
analysis to follow. Section 4 collects several key lemmas that are used repeatedly in
the analysis. Section 5 presents strong performance guarantees for a packet variant of
the Group by Virtual Output Queue crossbar scheduler. Section 6 presents a similar
set of guarantees for a packet variant of the Least Occupied Output First scheduler.
Section 7 explains how our asynchronous crossbar scheduling algorithms can be used
in systems that switch variable length segments rather than cells, reducing the amount
of memory required by crossbar buffers by more than order of magnitude. Finally,
section 8 provides some closing remarks, including a discussion of several ways this
work can be extended.
3
Virtual
Output
Queues
Output Queues
Crosspoint
Buffers
Figure 1: Buffered Crossbar
2 Switching Packets vs. Cells
As noted in the introduction, most crossbar-based routers, segment packets into cells
at input line cards, before forwarding them through the crossbar to output line cards,
where they are reassembled into packets. This enables synchronous operation, allow-
ing the crossbar scheduler to make decisions involving all inputs and outputs at one
time.
Unfortunately, cell-based crossbars have some drawbacks. One is simply the added
complication of segmentation and reassembly. More seriously, the segmentation of
packets into cells can lead to degraded performance if the incoming packets cannot
be efficiently packed into fixed length cells. In the worst-case, arriving packets may
be slightly too large to fit into a single cell, forcing the input line cards to forward
them in two cells. This effectively doubles the bandwidth that the crossbar requires in
order to handle worst-case traffic. While one can reduce the impact of this problem by
allowing parts of more than one packet to occupy the same cell, this adds complexity
and does nothing to improve performance in the worst-case.
In addition, crossbar schedulers that operate on cells, without regard to packet
boundaries, can fail to deliver the expected guarantees from the perspective of the
system as a whole. In a system that uses a cell-based crossbar scheduler, an output
line card can typically begin transmission of a packet on its outgoing link only after all
cells of the packet have been received. Consider a scenario in which n input line cards
receive packets of length L at time t, all addressed to the same output. If the length
of the cell used by the crossbar is C, each packet must be segmented into dL/Ce cells
4
for transmission through the fabric. A crossbar scheduler that operates on cells has
no reason to prefer one input over another. Assuming that it forwards cells from each
input in a fair fashion, at least n (dL/Ce − 1) cells must pass through the crossbar
before the output line card has a complete packet that it can forward on the output
link. While some delay between the arrival of a packet and its transmission on the
output link is unavoidable, delays that are substantially longer than the time it takes
to receive a packet on the link are clearly undesirable. In this situation, the delay is
about n times larger than the time taken for the packet to be received. Interestingly,
one can obtain strong performance guarantees for packets using cell-based schedulers
that are packet-aware. We discuss this in Section 7.
There are a few previous studies of the performance of bufferless crossbars that
switch packets, rather than cells. References [5, 13] focus on performance for well-
behaved random traffic, so are not directly comparable to the results presented here.
On the other hand, [2] studies packet-mode emulation of unbuffered crossbars and
shows that strong performance guarantees can be obtained for such systems. However,
the frame-based scheduling methods used in [2] impose a delay that can be several
orders of magnitude larger than the very modest delays imposed by the schedulers
studied here.
Asynchronous crossbars offer an alternative to cell-based crossbars. They elim-
inate the need for segmentation and reassembly and are not subject to bandwidth
fragmentation, allowing one to halve the worst-case bandwidth required by the cross-
bar. Unfortunately, there is no obvious way to obtain strong performance guarantees
for unbuffered asynchronous crossbars, since the ability of the scheduler to coordinate
the movement of traffic through the system, seems to depend on its ability to make
decisions involving all inputs and outputs at one time. A scheduler that operates on
packets must deal with the asynchronous nature of packet arrivals, and must schedule
packets as they arrive and as the inputs and outputs of the crossbar become available.
In particular, if a given input line card finishes sending a packet to the crossbar at
time t, it must then select a new packet to send to the crossbar. It may have packets
that it can send to several different outputs, but its choice of output is necessarily
limited to those outputs that are not currently receiving packets from other inputs.
This can prevent it from choosing the output that it would prefer, were its choices not
so constrained. One can conceivably ameliorate this situation by allowing an input to
select an output that will become available in the near future, but this adds compli-
cation and sacrifices some of the crossbar bandwidth. Moreover, it is not clear that
such a strategy can lead to a scheduling algorithm with good worst-case performance
and small speedup.
The use of buffered crossbars offers a way out of this dilemma. The addition
of buffers to each crosspoint of an n × n crossbar effectively decouples inputs from
outputs, enabling the asynchronous operation that variable length packets seem to
require. A diagram of a system using a buffered crossbar is shown in Figure 1.
In addition to the now conventional Virtual Output Queues (VOQ) at each input,
a buffered crossbar has a small buffer at each of its crosspoints. As pointed out
in [4], the buffers allow inputs and outputs to operate independently, enabling the
5
use of simpler crossbar scheduling mechanisms, but the buffers have an even greater
import for asynchronous crossbars. With buffers, whenever an input finishes sending
a packet to the crossbar, it can select a packet from one of its VOQs, so long as the
corresponding crosspoint buffer has room for the packet. We show that crosspoint
buffers of modest size are sufficient to allow strong performance guarantees with the
same speedup required by cell-based schedulers.
3 Preliminaries
To start, we introduce common notations that will be used in the analysis to follow.
We say a packet x is an ij-packet if it arrived at input i and is to be forwarded on
output j. We let s(x) denote the time at which the first bit of x is received on an
input link and we let f(x) be the time at which the last bit is received. We let L(x)
denote the number of bits in x and LM denote the maximum packet length (in bits).
The time unit is the time it takes for a single bit to be transferred on an external
link, so f(x)− s(x) = L(x). The time at which a new packet is selected by an input
and sent to the crossbar is referred to as an input scheduling event. We also define to
the time at which an active period ends to be an input event. The time at which an
output selects a packet from one of its crosspoint buffers is referred to as an output
scheduling event. We use event to refer to either type, when the type is clear from
the context.
We let Vij denote the VOQ at input i that contains packets for output j and we
let Vij(t) denote the number of bits in Vij at time t. Similarly, we let Bij denote the
crosspoint buffer for packets from input i to output j, Bij(t) denote the number of
bits in Bij at time t, and B denote the capacity of the crosspoint buffers. For all
quantities that include a time parameter, we sometimes omit the time parameter.
We focus on schedulers for systems in which packets are fully buffered at the input
line cards where they arrive before they are sent to the crossbar. A packet is deemed
to have arrived only when the last bit has arrived. Consequently, an ij-packet that
is in the process of arriving at time t is not included in Vij(t). We say that a VOQ
is active, whenever the last bit of its first packet has been received. For an active
VOQ Vij, we refer to the time period since it last became active as the current active
period. For a particular active period of Vij, we define notations for several quantities.
In particular, if x was the first packet to arrive in the active period, we let sij = s(x),
fij = f(x). and Lij = L(x). The time of the first input event in the active period is
denoted by τij. We say an input event is a backlog event for Vij if when the event
occurs, Bij is too full to accept the first packet in Vij, and we let βij denote the time of
the first backlog event of an active period. We say that Vij is backlogged if it is active,
and its most recent input event was a backlog event. These definitions are illustrated
in Figure 2. Note that τij < fij + LM/S and that if βij 6= τij, then βij ≥ τij + Lij/S.
While we require that packets be fully buffered at inputs, we assume that packets
can be streamed directly though crossbar buffers, and through output buffers to
6
Lij
sij fij τij
<LM /S
βij
=0 or ≥Lij /S
Figure 2: Basic Definitions for Active Periods
outgoing links. The first assumption is the natural design choice. The second was
made to simplify the analysis slightly, but is not essential. Extending our analyses to
the case where outputs fully buffer packets is straightforward.
To define a specific crossbar scheduler, we must specify an input scheduling policy
and an output scheduling policy. The input scheduling policy selects an active VOQ
from which to transfer a packet to the crossbar. We assume that the input scheduler
is defined by an ordering of the active VOQs. At each input scheduling event, the
scheduler selects the first active VOQ in this ordering that is not backlogged, and
transfers the first packet in this VOQ to the crossbar. We also assume that the
output scheduling policy is defined by an ordering imposed on the packets to be
forwarded from each output. At each output scheduling event, the scheduler selects
the crosspoint buffer whose first packet comes first in this packet ordering.
Given a VOQ ordering for an input, we say that one VOQ precedes another if it
comes before the other in this VOQ ordering. We extend the precedes relation to the
packets in the VOQs and the bits in those packets by ordering the packets (bits) in
different VOQs according to the VOQ ordering, and packets (bits) in the same VOQ
according to their position in the VOQ. To simplify the language used in the analysis
to follow, we include the bits in Vij in the set of bits that are said to precede Vij. For
packets (bits) at different inputs going to the same output, we say that one precedes
the other, if it comes first in the ordering that defines the output scheduling policy.
For an active VOQ Vij, we let pij(t) equal the number of bits in VOQs at input
i that precede Vij at time t (note, this includes the bits in Vij), plus the number of
bits in the current incoming packet that have been received so far (if there is such a
packet). We define qj(t) to be the number of bits at output j at time t and qij(t) to
be the number of bits at output j that precede the last bit in Vij.
With these preliminaries, we can now define two key quantities, slack and margin.
Specifically, we define slackij(t) = qj(t) − pij(t) and marginij(t) = qij(t) − pij(t). In
the analysis to follow, we will show that shortly after the start of an active period
for Vij, slackij becomes non-negative and stays non-negative. This is useful, because
when an output j becomes idle, qj is necessarily zero. If slackij is not negative, then
pij must be zero also. Since this implies that Vij is empty, there can be no packet
at input i that should be going out on output j. Consequently, we can show that a
7
scheduler is work-conserving by showing that the slack is non-negative. We can use
margin in a very similar way when showing that a crossbar-based system emulates an
output-queued switch with a specific scheduling policy.
Our worst-case performance guarantees are defined relative to a reference system
consisting of an ideal output-queued switch followed by a fixed delay of length T . An
output-queued switch is one in which packets are transferred directly to output-side
queues as soon as they have been completely received. An output-queued switch is
fully specified by the queueing discipline used at the outputs.
In [3], the class of Push in, First Out (PIFO) queueing disciplines is defined to
include all queueing disciplines that can be implemented by inserting arriving packets
into a list, and selecting packets for transmission from the front of the list. That is, a
PIFO discipline is one in which the relative transmission order of two packets is fixed
when the later arriving packet arrives. Most queuing disciplines of practical interest
belong to this class. In [4], the restricted PIFO queueing disciplines are defined as
those PIFO disciplines in which any two ij-packets are transmitted in the same order
they were received. Note that this does not restrict the relative transmission order
of packets received at different inputs. Our emulation results for buffered crossbars
apply to restricted PIFO queueing disciplines.
We say that a crossbar T-emulates an output-queued switch using a specific queue-
ing discipline if, when presented with an input packet sequence, it forwards each
packet in the sequence, at the same time that it would be forwarded by the refer-
ence system, with an output delay of T . We say that a switch is work-conserving,
if whenever there is a packet in the system for output j, output j is sending data.
A crossbar-based system is T -work-conserving if it T -emulates some work-conserving
output-queued switch. Alternatively, we can say that a system is T -work-conserving
if output j is busy whenever there is a packet in the system for output j that arrived
at least T time units before the current time.
A crossbar that T -emulates an output-queued switch is defined by a specific cross-
bar scheduling algorithm and by the output queueing discipline of the emulated
switch. To achieve the emulation property, the output line cards of the crossbar
must hold each packet until T time units have passed since its arrival. While it is
being held, other packets that reach the output after it, may be inserted in front
of it in the PIFO list. Whenever the output becomes idle, the linecard selects for
transmission the first packet in the list which arrived at least T time units in the
past. This may not be the first packet in the list, since the PIFO ordering need not
be consistent with the arrival order.
In the next few sections, we will prove work-conservation and emulation results
for two crossbar scheduling algorithms. These results all require that the speedup
S, crossbar buffer size B, and time delay T be at least as large as some minimum
threshold. Figure 3 summarizes these thresholds. Note that the values for B and T
are stated relative to the maximum packet length LM .
8
T/LMS B/LM
T-work
conservation
Packet
Group-by-VOQ
T-emulation
2 2/(S–1)
(2 for S=2)
2
2/(S–1)+1/S3+2/(S–1)2
restricted PIFO (2.5 for S=2)(5 for S=2)
T-emulation
restricted PIFO
2/(S–1)
(2 for S=2)
T-work
conservationPacket
Least-Occupied
Output First 2/(S–1)+1/S
(2.5 for S=2)
2S/(S–1)
(4 for S=2)
3S/(S–1)
(6 for S=2)
2
2
Figure 3: Quantitative Results
4 Common Properties
In this section, we prove a number of properties that apply to certain large classes of
crossbar schedulers. Readers may want to skip this section on first reading, referring
back to the individual lemmas as they are used in later sections.
4.1 Prompt Schedulers
All the schedulers we consider have the property that they keep the inputs and outputs
busy whenever possible. In particular, if an input line card has any packet x at the
head of one of its VOQs and the VOQ is not backlogged, then the input must be
transferring bits to some crosspoint buffer at rate S. Similarly, if any crosspoint
buffer for output j is not empty, then output j must be transferring bits from some
crosspoint buffer at rate S. A scheduler that satisfies these properties is called a
prompt scheduler.
The first two lemmas provide lower bounds on qj that apply to all prompt sched-
ulers. These are useful when attempting to establish lower bounds on slackij.
Lemma 1 For a buffered crossbar using any prompt scheduler, qj(t) ≥ (1−1/S)Bij(t)
for all i.
proof. If Bij(t) > 0, then Bij became non-empty at some time no later that t −
Bij(t)/S, since Bij can grow at a rate no faster than S. That is, Bij > 0 throughout
the interval [t−Bij(t)/S, t]. For any prompt scheduler, whenever a crosspoint buffer
for a given output is non-empty, the crossbar transfers bits to the output at rate S.
Since an output sends bits from the output queue to the link at rate 1, an output
queue grows at rate S−1 during any period during which one or more of its crosspoint
buffers is non-empty. It follows that qj(t) ≥ (1− 1/S)Bij(t). 
9
Lemma 2 Consider an active period for Vij. For any prompt scheduler
qj(τij) ≥ (1− 1/S)Bij(τij) + (S − 1)(τij − fij)
if Bij(τij) > 0.
proof. Note that since Vij is inactive just before fij, Bij cannot grow between fij
and τij, hence Bij(fij) ≥ Bij(τij) > 0. Consequently, qj must increase at rate S − 1
throughout the interval [fij, τij], so
qj(τij) ≥ qj(fij) + (S − 1)(τij − fij)
By Lemma 1,
qj(fij) ≥ (1− 1/S)Bij(fij) ≥ (1− 1/S)Bij(τij)
Combining the two inequalities yields the desired result. 
4.2 Invariant Schedulers
We say that a scheduling algorithm is invariant if it does not change the relative
order of any two VOQs during a period when they are both continuously active. This
property is shared by a number of different crossbar schedulers, including one we
consider in detail in the next section.
The next lemma can be used to show that for prompt and invariant schedulers,
slack does not decrease following the first scheduling event of an active period, and it
applies to any prompt and invariant scheduler.
Lemma 3 Let t1 be the time of an input scheduling event in an active period of Vij
and let t > t1 be no later than the next event at input i. For any prompt and invariant
scheduler,
slackij(t) ≥ slackij(t1) + (S − 2)(t− t1)
if B ≥ 2LM .
proof. If Vij is backlogged at time t1, then Bij(t1) > LM which implies that Bij
remains non-empty until at least t1 +LM/S ≥ t. Consequently, qj(t) ≥ qj(t1) + (S −
1)(t − t1). Since the VOQ ordering is invariant in the interval [t1, t], any increase
in pij during this interval can only result from the arrival of bits on the input link.
Consequently, pij(t) ≤ pij(t1) + (t− t1) and slackij(t) ≥ slackij(t1) + (S − 2)(t− t1).
If Vij is not backlogged at t1, then either Vij or another VOQ that precedes Vij
must be selected at t1. In either case, pij(t) ≤ pij(t1) − (S − 1)(t − t1). Since
qj(t) ≥ qj(t1)− (t− t1), it follows that slackij(t) ≥ slackij(t1) + (S − 2)(t− t1). 
We can prove a stronger version of Lemma 3 that can be used to obtain more
precise results.
10
Lemma 4 Let t1 be the time of an input scheduling event in an active period of Vij
and let t > t1 be no later than the next event at input i. For any prompt and invariant
scheduler with S ≥ 2 and B ≥ 2LM , if slackij(t1) ≥ −Vij(t1) then slackij(t) ≥ −Vij(t).
proof. If Vij(t) ≥ Vij(t1) then the result follows from Lemma 3. Assume then that
Vij(t) < Vij(t1). This implies that Vij was selected at t1. Consequently, qj increases at
rate S − 1 during the interval [t1, t] (since the scheduling algorithm is prompt), while
pij decreases at rate ≥ S − 1 (since the scheduling algorithm is invariant). Thus,
slackij(t) ≥ slackij(t1) + 2(S − 1)(t − t1). Since Vij can decrease at a rate no faster
than S, Vij(t) ≥ Vij(t1)− S(t− t1). Consequently,
slackij(t)≥ slackij(t1) + 2(S − 1)(t− t1)
≥ −Vij(t1) + 2(S − 1)(t− t1)
≥ −(Vij(t) + S(t− t1)) + 2(S − 1)(t− t1)
= −Vij(t) + (S − 2)(t− t1)
≥ −Vij(t)
since S ≥ 2. 
4.3 ij-FIFO Schedulers
We say that a system is ij-FIFO if for all inputs i and outputs j, all ij-packets are
forwarded in the same order they were received. Note that systems that implement
restricted PIFO queueing disciplines are ij-FIFO.
In this subsection, we prove several lemmas that are useful in proving emulation
results. The first two lemmas provide lower bounds on qij for prompt and ij-FIFO
schedulers. These are useful for proving lower bounds on marginij.
Lemma 5 For any prompt and ij-FIFO scheduler, qij(t) ≥ (1−1/S)(Bij(t)−LM).
proof. The statement is trivially true if Bij(t) ≤ LM . So assume, Bij(t) > LM ,
and note that this implies that Bij became non-empty at some time no later that
t−Bij(t)/S, since Bij can grow at a rate no faster than S. Consequently, there must
be a scheduling event at output j in the interval [t−Bij(t)/S, (t−Bij(t)/S)+LM/S]
and from the time of that event until t, output j must be receiving bits that precede
Vij, since the scheduler is ij-FIFO. Consequently, qij increases at rate S−1 throughout
the interval [(t−Bij(t)/S) + LM/S, t] and so qij(t) ≥ (1− 1/S)(Bij(t)− LM). 
Lemma 6 Consider an active period for Vij. For any prompt and ij-FIFO scheduler,
qij(τij) ≥ (1− 1/S)(Bij(τij)− LM) + (S − 1)(τij − fij)
if Bij(τij) ≥ LM .
11
proof. Since Bij cannot grow between fij and τij, Bij(fij) ≥ Bij(τij) ≥ LM . Con-
sequently, Bij became non-empty no later than fij − LM/S, which implies that qij
increases at rate S − 1 throughout the interval [fij, τij]. Hence,
qij(τij) ≥ qij(fij) + (S − 1)(τij − fij)
≥ (1− 1/S)(Bij(fij)− LM) + (S − 1)(τij − fij)
≥ (1− 1/S)(Bij(τij)− LM) + (S − 1)(τij − fij)

The next lemma can be used to show that margin does not decrease after the first
event of an active period. It applies to any scheduler that is prompt, invariant and
ij-FIFO.
Lemma 7 Let t1 be the time of an input scheduling event in an active period of Vij
and let t > t1 be no later than the next event at input i. For any prompt, invariant
and ij-FIFO scheduler with speedup S and B ≥ 2LM , marginij(t) ≥ marginij(t1) +
(S − 2)(t− t1).
proof. If Vij is backlogged at t1, then Bij(t1) > LM and Bij became non-empty
before t1 − LM/S and will remain non-empty until at least t1 + LM/S. This implies
that qij increases at rate S − 1 throughout the interval [t1, t] (since the scheduler
is ij-FIFO). Since pij can increase at a rate no faster than 1 during this period,
marginij(t) ≥ marginij(t1)+(S−2)(t−t1). If Vij is not backlogged at t1, pij decreases
at rate ≥ S − 1 in the interval [t1, t] (since the scheduler is invariant) and since qij
can decrease at a rate no faster than 1, marginij(t) ≥ marginij(t1) + (S − 2)(t− t1).

5 Packet Group by VOQ
Group by Virtual Output Queue (GVOQ) is a cell switch scheduling algorithm first
described in [3] and extended to buffered crossbars in [4]. We define the Packet
GVOQ (PGV) scheduler by defining an ordering that it imposes on the VOQs. In
this ordering, the relative order of two VOQs does not change so long as they both
remain active. Hence, PGV is invariant. When an inactive VOQ becomes active, it is
placed first in the VOQ ordering. When a VOQ becomes inactive, it is removed from
the VOQ ordering. Different variants of PGV can be defined by specifiying different
output scheduling strategies.
5.1 T -Work-Conservation
In this section, we show that regardless of the specific output scheduling policy used,
PGV is T -work-conserving. We prove two versions of the work-conservation result.
12
The first is a bit weaker than the second, but is included because the analysis is more
straightforward and hence it provides a useful stepping stone to the more difficult
results to follow.
Theorem 1 Any PGV scheduler is T -work-conserving if S ≥ 2, B ≥ (2 + 1/(S −
1))LM and T ≥ 2LM/(S − 1).
The proof of this theorem involves four steps. The first step is to show that slack
does not decrease after the first scheduling event of an active period. This was shown
in Lemma 3 in the previous section. The second step, is to show that a backlog event
must occur near the start of an active period, and the third step is to show that when
the first backlog event occurs, slack is non-negtive. These two steps are shown in the
proofs of the next two lemmas. The final step, which appears as the proof of the
theorem, is to show that when an output is idle, no input can have a packet that has
been present for more than time T .
Lemma 8 Consider an active period for Vij in a crossbar using a PGV scheduler
with speedup S. If the duration of the active period is at least 2LM/(S − 1), then it
includes at least one backlog event for Vij and βij ≤ fij + 2LM/(S − 1).
proof. Suppose there is no backlog event in the interval [τij, t] for t = fij+2LM/(S−1).
Then, at each event in this interval, the input scheduler selects either Vij or some other
VOQ that precedes Vij. Since the scheduling algorithm is invariant, any contribution
to increasing pij during this interval can only result from the arrival of new bits from
the input link. Consequently, pij decreases at a rate ≥ (S−1) throughout this period.
Since pij(τij) ≤ Lij + (τij − fij),
pij(t)≤ Lij + (τij − fij)− (S − 1)(t− τij)
= Lij + (τij − fij)− (S − 1)((fij + 2LM/(S − 1))− τij)
= Lij + S(τij − fij)− 2LM
< LM + S(LM/S)− 2LM = 0
The first line in the above inequality follows from the fact that pij(fij) = Lij and that
pij can increase at rate at most 1 during the interval [sij, τij] and must decrease at
rate S − 1 after τij. The second and third lines follow directly from the definitions,
and the last line from the fact that τij − fij < LM/S. The above result contradicts
the premise that the duration of the active period is at least 2LM/(S − 1). 
Our next lemma shows that within a short time following the start of an active
period, slackij ≥ 0.
Lemma 9 Consider some active period for Vij that includes the time t ≥ fij +
2LM/(S − 1). For any PGV scheduler, slackij(t) > 0 if S ≥ 2 and B ≥ (2 +
1/(S − 1))LM .
13
proof. We show that slackij(βij) > 0. The result then follows from Lemmas 3 and 8.
If βij = τij, then by Lemma 2,
qj(βij) > (1− 1/S)(B − LM) + (S − 1)(τij − fij)
For any PGV scheduler,
pij(βij) = pij(τij) ≤ LM + (τij − fij)
Combining the inequalities for pij and qj, we obtain
slackij(βij) > (1− 1/S)(B − LM) + (S − 2)(τij − fij)− LM ≥ 0
since S ≥ 2 and B ≥ (2 + 1/(S − 1))LM .
Now, suppose βij > τij. Since at least one packet must be sent from Vij during
the active period, in order for it to become backlogged, βij ≥ τij +Lij/S. During the
interval [τij, βij], pij decreases at rate ≥ S − 1. Consequently,
pij(βij) ≤ Lij + (τij − fij)− (S − 1)(βij − τij) < Lij/S + LM/S ≤ 2LM/S
By Lemma 1, qj(βij) > (1 − 1/S)(B − LM) ≥ 2(1 − 1/S)LM , so slackij(βij) >
2(1− 1/S)LM − 2LM/S ≥ 0. 
We can now proceed to the proof of the theorem.
Proof of Theorem 1. Suppose some output j is idle at time t and no input is
currently sending it a packet, but some input i has a packet x for output j with
f(x) + 2LM/(S − 1) < t. By Lemma 9, slackij(t) > 0. Since, qj(t) = 0, this implies
that pij(t) < 0, which contradicts the fact that Vij is active at t. 
Using a more precise analysis, we can reduce the required crossbar buffer size to
2LM/(S − 1).
Theorem 2 Any PGV scheduler with S ≥ 2 and B ≥ 2LM is T -work-conserving for
T ≥ 2LM/(S − 1).
To prove this, we must first show that slack is bounded from below, shortly after
the start of an active period.
Lemma 10 Consider an active period for Vij that includes the time t ≥ fij+2LM/(S−
1). For any PGV scheduler with speedup S ≥ 2 and B ≥ 2LM , slackij(t) > −Vij(t).
proof. If Bij(τij) > 0 then by Lemma 2, qj(τij) > τij − fij. Since
pij(τij) ≤ Lij + (τij − fij) ≤ Vij(τij) + (τij − fij)
it follows that slackij(τij) > −Vij(τij). By Lemma 4, slackij(t) > −Vij(t).
14
Now, suppose that Bij(τij) = 0. By Lemma 8, βij ≤ fij + 2LM/(S − 1) ≤ t. If
x is the first packet in Vij at βij, then βij > τij + (B − L(x))/S. During the interval
[τij, βij], pij decreases at rate ≥ S − 1. Consequently,
pij(βij)< Lij + (τij − fij)− (S − 1)(βij − τij)
≤ (1 + 1/S)LM − (1− 1/S)(B − L(x))
By Lemma 1, qj(βij) > (1− 1/S)(B − L(x)) so,
slackij(βij)> 2(1− 1/S)(B − L(x))− (1 + 1/S)LM
≥ 4(1− 1/S)LM − (1 + 1/S)LM − 2(1− 1/S)L(x)
= (3− 5/S)LM − (1− 2/S)L(x)− L(x)
> −L(x) ≥ −Vij(βij)
By Lemma 4, slackij(t) > −Vij(t). 
We can now proceed to prove the theorem.
Proof of Theorem 2. Suppose some output j is idle at time t and no input is
currently sending it a packet, but some input i has a packet x for output j with
f(x) + 2LM/(S − 1) < t. By Lemma 10, slackij(t) > −Vij(t). Since, qj(t) = 0, this
implies that pij(t) < Vij(t), which contradicts the definition of pij. 
5.2 T -Emulation Results for PGV
We refer to a PGV scheduler defined by a restricted PIFO queueing discipline as a
PGV-RP sechduler. We show that for any restricted PIFO queueing discipline, the
corresponding PGV-RP scheduler T -emulates an ideal output-queued switch using
the same discipline. Our result for PGV generalizes the corresponding result for
cell-based crossbars given in [3].
Theorem 3 Let X be an output-queued switch using a restricted PIFO scheduler. A
crossbar using the corresponding PGV-RP scheduler T -emulates X if S ≥ 2, B ≥
(3 + 2/(S − 1))LM , and T ≥ (2/(S − 1) + 1/S)LM .
The analysis leading to this result is similar to the analysis used to establish work-
conservation. The first step is to show that margin does not decrease following the
first input event of an active period (Lemma 7). The second step is to establish a
lower bound on margin at the time of the first backlog event. We then use this lower
bound to prove the emulation result.
Lemma 11 Consider an active period for Vij that includes t ≥ fij+2LM/(S−1). For
any PGV-RP scheduler, marginij(t) > LM/S if S ≥ 2 and B ≥ (3 + 2/(S − 1))LM .
15
proof. We show that marginij satisfies the bound at the time of the first backlog
event. If Vij is backlogged at τij, then Bij(τij) > B − LM and by Lemma 6,
qij(τij) > (1− 1/S)(B − 2LM) + (S − 1)(τij − fij)
Since pij(τij) ≤ LM + (τij − fij),
marginij(τij)> (1− 1/S)(B − 2LM) + (S − 1)(τij − fij)− (LM + (τij − fij))
≥ (1− 1/S)B − (3− 2/S)LM
This is ≥ LM/S so long as B ≥ (3+2/(S−1))LM . By Lemma 7, marginij(t) > LM/S.
Now suppose Vij is not backlogged at τij. By Lemma 8, βij ≤ t and since Vij is
backlogged at βij, Bij(βij) > B−LM , so by Lemma 5, qij(βij) > (1−1/S)(B−2LM).
Since, βij ≥ τij + Lij/S, it follows that
pij(βij) ≤ Lij + LM/S − (1− 1/S)Lij ≤ 2LM/S
and
marginij(βij)> (1− 1/S)(B − 2LM)− 2LM/S
= (1− 1/S)B − 2LM
This is ≥ LM/S so long as B ≥ ((2+3/(S−1))LM which is implied by the condition
on B in the statement of the lemma. By Lemma 7 marginij(t) > LM/S. 
We can now proceed to the proof of the theorem.
Proof of Theorem 3. Suppose that up until time t, the PGV-RP crossbar faithfully
emulates the output-queued switch, but that at time t, the output-queued switch
begins to forward an ij-packet x, while the crossbar does not.
Now suppose that in the crossbar, one or more bits of x have reached Bij by time
t− LM/S. Note that the interval [t− LM/S, t) must contain at least one scheduling
event at output j and all such events must select packets that precede x. However,
this implies that during some non-zero time interval [t1, t], output j is continuously
receiving bits that precede x at a faster rate than it can forward them to the output.
This contradicts that fact that by time t, the crossbar has forwarded all the bits that
precede x (since it faithfully emulates the output-queued switch up until time t).
Assume then that at time t − LM/S, no bits of x have reached Bij. Since the
output queued switch has an output delay of T , f(x) ≤ t − T , so t − LM/S ≥
f(x) + 2LM/(S − 1). Since the crossbar has sent everything sent by the output-
queued switch up until t, it follows that qij(t − LM/S) ≤ LM/S. By Lemma 11,
marginij(t− LM/S) > LM/S and hence pij(t− LM/S) < 0, which is not possible. 
The analysis of Lemma 11 requires a crossbar buffer of size at least 5LM when
S = 2. We conjecture that this can be reduced using a more sophisticated analysis.
16
6 Packet LOOFA
The Least Occupied Output First Algorithm (LOOFA) is a cell scheduling algorithm
described in [8]. We define an asynchronous crossbar scheduling algorithm based on
LOOFA, called Packet LOOFA (PLF). Like PGV, PLF is defined by the ordering it
imposes on the VOQs at each input. The ordering of the VOQs is determined by
the number of bits in the output queues. In particular, when a VOQ Vij becomes
active, it is inserted immediately after the last VOQ Vih, for which qh ≤ qj. If there
is no such VOQ, it is placed first in the ordering. At any time, active VOQs may
be re-ordered, based on the output occupancy. We allow one VOQ to move ahead of
another during this re-ordering, only if its output has strictly fewer bits. The work-
conservation result for PLF is comparable to that for PGV, but the required analysis
is technically more difficult because in PLF, the relative orders of VOQs can change.
Because the order of VOQs can change, PLF is also more responsive to changing
traffic conditions than PGV. While this has no effect on work-conservation when
S ≥ 2, it does provide better fairness when used with smaller speedups. As one
example of this, consider the following traffic pattern. From time 0 to time T , a
switch with speedup of S < 4/3, receives packets on inputs A and B for output X
at the link rate of 1. After time T , input A receives packets for output Y (at rate
1) while input B receives packets for output Z. Due to the symmetry of the traffic
pattern, a scheduler has no reason to favor one input over the other, so we assume
that the inputs are treated fairly by the output scheduling policy. Up until time T ,
the two inputs each send packets to X at rate S/2 and X forwards packets at rate 1,
while building a backlog. If a PGV scheduler is used, then after time T , input A gives
preference to output Y , while input B gives preference to output Z. Consequently,
output X receives packets only at rate 2(S − 1). As a result, the output side backlog
at X is fully consumed by time ((2−S)/(3− 2S))T , after which X starts forwarding
packets at rate 2(S − 1), while both outputs Y and Z continue to forward packets
at rate 1. So for S = 1.2, X is limited to an output rate of 0.4. On the other
hand, a PLF scheduler attempts to keep the output queue lengths equal, so after
time ((2− S)/(3− 2S))T , outputs X, Y and Z will all receive packets at rate 2S/3.
So for S = 1.2, all three outputs will forward packets at rate 0.8. This doubles the
rate at which X is able to send, dramatically improving the fairness with respect to
the other outputs.
6.1 More Definitions
To facilitate the analysis of PLF, it’s helpful to separate the analysis of “old bits”
from “new bits”. When considering an active period for Vij, the old bits at input i
are those bits that arrived before sij. All other bits at input i are considered new.
Also, we say that a VOQ V is older than a VOQ W at time t if both are active, and
V last became active before W did. We say that a VOQ V passes a VOQ W during
a given time interval, if W precedes V at the start of the interval and V precedes W
at the end of the interval.
17
For an active VOQ, Vij, we let newij(t) be the number of bits present at input i at
time t that arrived in the interval [sij, t]. We let pij(t) equal newij(t) plus the number
of bits that precede Vij at time t that arrived before sij. Note that pij(t) ≤ pij(t) and
consequently,
slackij(t) ≥ slackij(t) = qj(t)− pij(t)
and
marginij(t) ≥ marginij(t) = qij(t)− pij(t)
6.2 Additional General Lemmas
Here we give several more lemmas that apply to a broad class of scheduling algorithms
and are useful for establishing both T -work-conservation and T -emulation results for
PLF. The reader may want to skip this section on first reading, and refer back to the
lemmas presented here, as they are used.
Lemma 12 Let t1 be the time of an input scheduling event in an active period of Vij
and let t > t1 be no later than the next event at input i. For any prompt scheduler,
slackij(t) ≥ slackij(t1) + (S − 2)(t− t1)
if no older VOQ passes Vij in [t1, t] and B ≥ 2LM .
proof. If Vij is backlogged at time t1, then Bij(t1) > LM which implies that Bij
remains non-empty until at least t1 +LM/S ≥ t. Consequently, qj(t) ≥ qj(t1) + (S −
1)(t − t1). Since no older VOQ passes Vij in the interval [t1, t], any increase in pij
during this interval can only result from the arrival of new bits on the input link.
Consequently,
pij(t) ≤ pij(t1) + (t− t1)
and
slackij(t) ≥ slackij(t1) + (S − 2)(t− t1)
If Vij is not backlogged at t1, then either Vij or another VOQ that precedes Vij must
be selected at t1. In either case, pij(t) ≤ pij(t1)− (S − 1)(t− t1). Since
qj(t) ≥ qj(t1)− (t− t1)
it follows that
slackij(t) ≥ slackij(t1) + (S − 2)(t− t1)

Lemma 13 Let t1 be the time of an input scheduling event in an active period of Vij
and let t > t1 be no later than the next event at input i. For any prompt, ij-FIFO
scheduler,
marginij(t) ≥ marginij(t1) + (S − 2)(t− t1)
if no older VOQ passes Vij in [t1, t] and B ≥ 2LM .
18
proof. If Vij is backlogged at time t1, then Bij(t1) > LM which implies thatBij became
non-empty before t1−LM/S and will remain non-empty until at least t1+LM/S ≥ t.
Consequently, qij(t) ≥ qij(t1) + (S − 1)(t− t1). Since no older VOQ passes Vij in the
interval [t1, t],
pij(t) ≤ pij(t1) + (t− t1)
and
marginij(t) ≥ marginij(t1) + (S − 2)(t− t1)
If Vij is not backlogged at t1, then
pij(t) ≤ pij(t1)− (S − 1)(t− t1)
and since qij(t) ≥ qij(t1)− (t− t1), it follows that
marginij(t) ≥ marginij(t1) + (S − 2)(t− t1)

Our next lemma applies to any scheduling algorithm.
Lemma 14 If there is some VOQ that is older than Vij and that precedes Vij at time
t, then there is some such VOQ Vih for which pij(t) ≤ pih(t).
proof. Let Vih be a VOQ that is older than Vij and that precedes Vij at time t. More
specifically, let Vih be that VOQ that comes latest in the VOQ ordering, among all
VOQs that satisfy the condition. Let X be the set of bits that precede Vij at time t
but not Vih. Note that |X| = pij(t)− pih(t) and that all bits in X must have arrived
since sij (otherwise, there would be some VOQ older than Vij that precedes Vij and
comes later in the VOQ ordering than Vih). Since Vih is older than Vij, these bits also
arrived after sih. Let Y be the set of bits that arrived after sij and are still present at
time t and do not precede Vij. Note that |Y | = pij(t)− pij(t) and that X and Y have
no bits in common. Now, let Z be the set of bits that arrived since sih and do not
precede Vih. Both X and Y are subsets of Z and so |X|+ |Y | ≤ |Z| = pih(t)− pih(t).
Consequently,
(pij(t)− pih(t)) + (pij(t)− pij(t)) ≤ pih(t)− pih(t)
which implies that pij(t) ≤ pih(t). 
6.3 T -Work-Conservation
Theorem 4 A buffered crossbar using any PLF scheduler is T -work-conserving if
S ≥ 2, B ≥ 2LMS/(S − 1), and T ≥ 2LM/(S − 1).
To prove the theorem, we need the following lemma.
19
Lemma 15 If Vij is active at t ≥ τij, then for any PLF scheduler, either
slackij(t) ≥ 0
or
pij(t) ≤ (1 + 1/S)LM − (S − 1)(t− τij)
if S ≥ 2 and B ≥ 2LMS/(S − 1).
Before we proceed with the proof of the lemma, we note that for t ≥ fij+2LM/(S−
1),
(1 + 1/S)LM − (S − 1)(t− τij) ≤ (1 + 1/S)LM + (S − 1)(τij − fij)− 2LM
≤ (1 + 1/S)LM + (1− 1/S)LM − 2LM ≤ 0
Consequently, the lemma implies that slackij(t) ≥ 0 for t ≥ fij + 2LM/(S − 1) and
since slackij(t) ≥ slackij(t), slackij(t) ≥ 0 also. We state this as a corollary.
Corollary 1 If Vij is active at t ≥ fij + 2LM/(S − 1), then for any PLF scheduler,
slackij(t) ≥ 0, if S ≥ 2 and B ≥ 2LMS/(S − 1).
Proof of Lemma 15. Assume that there is some time t when the lemma does not
hold. More specifically, let t be the earliest time when it is not true for some VOQ
and let Vij be the oldest VOQ that violates the lemma at time t.
Suppose first there is no event in [τij, t] at which there is an older VOQ that
precedes Vij. This implies that
pij(τij) ≤ τij − sij ≤ (1 + 1/S)LM
It also implies that there are no two consecutive events in [τij, t] between which an
older VOQ passes Vij. Consequently, by Lemma 12, slackij does not decrease between
any two consecutive events in [τij, t].
If βij > t then Vij is eligible for selection at every event in [τij, t]. This implies
that at every such event, the selected packet precedes Vij. Consequently,
pij(t) ≤ pij(τij)− (S − 1)(t− τij) ≤ (1 + 1/S)LM − (S − 1)(t− τij)
which contradicts our assumption that the lemma does not hold at t. Assume then
that βij ≤ t. By Lemma 1,
qj(βij) ≥ (1− 1/S)(B − LM)
and since
pij(βij) ≤ pij(τij)− (S − 1)(βij − τij)
it follows that
slackij(βij)≥ (1− 1/S)(B − LM)− ((1 + 1/S)LM − (S − 1)(βij − τij))
≥ (1− 1/S)B − 2LM
20
which is ≥ 0 for B ≥ 2(S/(S − 1))LM . Since slackij does not decrease in [τij, t], it
follows that slackij(t) ≥ 0. This again, contradicts our assumption that the lemma
does not hold at t.
From the above, it follows that there must be some event in [τij, t] at which there
is an older VOQ that precedes Vij. Let t1 be the time of the latest such event. By
Lemma 14, there is a VOQ Vih for which pij(t1) ≤ pih(t1) and since Vih precedes Vij at
t1, qj(t1) ≥ qh(t1) and consequently, slackij(t1) ≥ slackih(t1). If t1 = t, then since Vij
is the oldest VOQ that does not satisfy the lemma at t, Vih does satisfy the lemma.
That is,
slackih(t) ≥ 0
or
pih(t) ≤ (1 + 1/S)LM − (S − 1)(t− τih)
If, slackih(t) ≥ 0, then slackij(t) ≥ 0 also. On the other hand, if pih(t) ≤ (1 +
1/S)LM − (S − 1)(t− τih), then
pij(t)≤ (1 + 1/S)LM − (S − 1)(t− τih)
≤ (1 + 1/S)LM − (S − 1)(t− τij)
Once again, this contradicts our assumption that the lemma does not hold at t.
Consequently, we must have t1 < t and since t is the earliest time at which the lemma
is violated, either slackij(t1) ≥ 0 or
pij(t1) ≤ (1 + 1/S)LM − (S − 1)(t1 − τij)
If slackij(t1) ≥ 0 then by Lemma 12, slackij(t) ≥ 0 also. Assume then, that pij(t1) ≤
(1 + 1/S)LM − (S − 1)(t1 − τij). Now, if βij ≥ t, then Vij is eligible for selection at
every event in [t1, t] and so pij(t) ≤ (1 + 1/S)LM − (S − 1)(t − τij). On the other
hand, if t1 ≤ βij < t, it follows that
pij(βij) ≤ (1 + 1/S)LM − (S − 1)(βij − τij)
and since qj(βij) ≥ (1− 1/S)(B − LM),
slackij(βij)≥ (1− 1/S)(B − LM)− ((1 + 1/S)LM − (S − 1)(βij − τij))
≥ (1− 1/S)B − 2LM ≥ 0
Since slackij cannot decrease in [t1, t], it follows that slackij(t) ≥ 0. Once again, this
contradicts the assumption that the lemma does not hold at t.
That leaves one more case: βij < t1. Since slackij(t1) ≥ 0 implies that slackij(t) ≥
0, we must have
pij(t1) ≤ (1 + 1/S)LM − (S − 1)(t1 − τij)
Also, since qj(βij) ≥ (1− 1/S)(B − LM),
qj(t1) ≥ (1− 1/S)(B − LM)− (t1 − βij)
21
and
slackij(t1)≥ (1− 1/S)(B − LM)− (t1 − βij)− ((1 + 1/S)LM − (S − 1)(t1 − τij))
≥ (1− 1/S)B + (S − 2)(t1 − τij)− 2LM
≥ (1− 1/S)B − 2LM ≥ 0
This completes the contradiction to our original assumption that the lemma does not
hold at time t. 
Proof of Theorem 4. Suppose some output j is idle at time t, but some input i
has a packet x for output j with f(x) + T < t. By Corollary 1, slackij(t) ≥ 0. Since
qj(t) = 0, this implies that pij(t) ≤ 0, which contradicts the fact that Vij contains x
at t. 
6.4 T -Emulation
In this section we show that a variant of the PLF algorithm is capable of emulating
an output queued switch using any restricted PIFO queueing discipline. This variant
differs from the standard PLF algorithm in that it orders VOQs based on the values
of qij, rather than qj. That is, when Vij becomes non-empty, it is inserted into the
VOQ ordering after the last VOQ Vih for which qih ≤ qij. If there is no such VOQ, Vij
is placed first in the ordering. Strictly speaking, this variant is different from PLF,
so to avoid confusion we refer to it as Refined PLF or RPLF.
Theorem 5 Let X be an output-queued switch using a restricted PIFO scheduler.
A crossbar using the corresponding RPLF scheduler T -emulates X if S ≥ 2 and
B ≥ 3LMS/(S − 1) and T ≥ (2/(S − 1) + 1/S)LM .
To prove the theorem, we need the following lemma.
Lemma 16 If Vij is active at t ≥ τij, then for any RPLF scheduler, either
marginij(t) ≥ LM/S
or
pij(t) ≤ (1 + 1/S)LM − (S − 1)(t− τij)
if S ≥ 2 and B ≥ 3LMS/(S − 1).
proof. Assume that there is some time t when the lemma does not hold. More
specifically, let t be the earliest time when it is not true for some VOQ and let Vij be
the oldest VOQ that violates the lemma at time t.
Suppose first there is no event in [τij, t] at which there is an older VOQ that
precedes Vij. This implies that
pij(τij) ≤ τij − sij ≤ (1 + 1/S)LM
22
It also implies that there are no two consecutive events in [τij, t] between which an
older VOQ passes Vij. Consequently, by Lemma 13, marginij does not decrease
between any two consecutive events in [τij, t].
If βij > t then Vij is eligible for selection at every event in [τij, t]. This implies
that at every such event, the selected packet precedes Vij. Consequently,
pij(t) ≤ pij(τij)− (S − 1)(t− τij) ≤ (1 + 1/S)LM − (S − 1)(t− τij)
which contradicts our assumption that the lemma does not hold at t. Assume then
that βij ≤ t. By Lemma 5,
qij(βij) ≥ (1− 1/S)(B − 2LM)
and since
pij(βij) ≤ pij(τij)− (S − 1)(βij − τij)
it follows that
marginij(βij)≥ (1− 1/S)(B − 2LM)− ((1 + 1/S)LM − (S − 1)(βij − τij))
≥ (1− 1/S)B − (3− 1/S)LM
which is ≥ LM/S for B ≥ 3LMS/(S − 1). Since marginij does not decrease in [τij, t],
it follows that marginij(t) ≥ LM/S. This again, contradicts our assumption that the
lemma does not hold at t.
From the above, it follows that there must be some event in [τij, t] at which there
is an older VOQ that precedes Vij. Let t1 be the time of the latest such event. By
Lemma 14, there is a VOQ Vih for which pij(t1) ≤ pih(t1) and since Vih precedes Vij
at t1, qij(t1) ≥ qih(t1) and consequently, marginij(t1) ≥ marginih(t1). If t1 = t, then
since Vij is the oldest VOQ that does not satisfy the lemma at t, Vih does satisfy the
lemma. That is,
marginih(t) ≥ LM/S
or
pih(t) ≤ (1 + 1/S)LM − (S − 1)(t− τih)
If marginih(t) ≥ LM/S, then marginij(t) ≥ LM/S also. On the other hand, if
pih(t) ≤ (1 + 1/S)LM − (S − 1)(t− τih), then
pij(t)≤ (1 + 1/S)LM − (S − 1)(t− τih)
≤ (1 + 1/S)LM − (S − 1)(t− τij)
Once again, this contradicts our assumption that the lemma does not hold at t.
Consequently, we must have t1 < t and since t is the earliest time at which the lemma
is violated, either
marginij(t1) ≥ LM/S
or
pij(t1) ≤ (1 + 1/S)LM − (S − 1)(t1 − τij)
23
If marginij(t1) ≥ LM/S then by Lemma 13, marginij(t) ≥ LM/S also. Assume then,
that
pij(t1) ≤ (1 + 1/S)LM − (S − 1)(t1 − τij)
Now, if βij ≥ t, then Vij is eligible for selection at every event in [t1, t] and so
pij(t) ≤ (1 + 1/S)LM − (S − 1)(t− τij)
On the other hand, if t1 ≤ βij < t, it follows that
pij(βij) ≤ (1 + 1/S)LM − (S − 1)(βij − τij)
and since qij(βij) ≥ (1− 1/S)(B − 2LM),
marginij(βij)≥ (1− 1/S)(B − 2LM)− ((1 + 1/S)LM − (S − 1)(βij − τij))
≥ (1− 1/S)B − (3− 1/S)LM ≥ LM/S
Since marginij cannot decrease in [t1, t], it follows that marginij(t) ≥ LM/S. Once
again, this contradicts the assumption that the lemma does not hold at t.
That leaves one more case: βij < t1. Since marginij(t1) ≥ LM/S implies that
marginij(t) ≥ LM/S, we must have
pij(t1) ≤ (1 + 1/S)LM − (S − 1)(t1 − τij)
Also, since qij(βij) ≥ (1− 1/S)(B − 2LM),
qij(t1) ≥ (1− 1/S)(B − 2LM)− (t1 − βij)
and
marginij(t1)≥ (1− 1/S)(B − 2LM)− (t1 − βij)− ((1 + 1/S)LM − (S − 1)(t1 − τij))
≥ (1− 1/S)B + (S − 2)(t1 − τij)− (3− 1/S)LM
≥ (1− 1/S)B − (3− 1/S)LM ≥ LM/S
This completes the contradiction to our original assumption that the lemma does not
hold at time t. 
Corollary 2 If Vij is active at t ≥ fij +2LM/(S− 1), then for any RPLF scheduler,
marginij(t) ≥ LM/S, if S ≥ 2 and B ≥ 3LMS/(S − 1).
proof. For t ≥ fij + 2LM/(S − 1),
(1 + 1/S)LM − (S − 1)(t− τij) ≤ (1 + 1/S)LM + (S − 1)(τij − fij)− 2LM
≤ (1 + 1/S)LM + (1− 1/S)LM − 2LM ≤ 0
Consequently, the lemma implies that marginij(t) ≥ 0 for t ≥ fij +2LM/(S − 1) and
since marginij(t) ≥ marginij(t), marginij(t) ≥ 0 also. 
24
Proof of Theorem 5. Suppose that up until time t, the PLF crossbar faithfully
emulates the output-queued switch with added delay T , but that at time t, the
output-queued switch begins to forward an ij-packet x, while the crossbar does not.
Now suppose that in the crossbar, one or more bits of x have reached Bij by time
t− LM/S. Note that the interval [t− LM/S, t) must contain at least one scheduling
event at output j and all such events must select packets that precede x. However,
this implies that during some non-zero time interval [t1, t], output j is continuously
receiving bits that precede x at a faster rate than it can forward them to the output.
This contradicts the fact that by time t the crossbar forwards all bits that precede x
(since it faithfully emulates the output-queued switch up until time t).
Assume then that at time t − LM/S, no bits of x have reached Bij. Since the
output-queued switch has a delay of T , f(x) ≤ t − T and so t − LM/S ≥ f(x) +
2LM/(S − 1). Since the crossbar has sent everything sent by the output-queued
switch up until time t, it follows that qij(t − LM/S) ≤ LM/S. By Corollary 2,
marginij(t1) ≥ LM/S and hence pij(t1) < 0, which is not possible. 
7 Segment-Based Switching
Chuang, et al. [3] showed that cell-based crossbars can emulate an output-queued
switch using any push-in, first-out (PIFO) queueing discipline. It is straightforward
to define PIFO scheduling policies that keep the cells of a packet together (simply
insert later arriving cells of a given packet right after their immediate predecessors).
This makes it possible to provide strong performance guarantees for packets not just
cells, using variants of standard crossbar schedulers that are packet-aware. (Thanks
to the anonymous referee who made this observation in his insightful review of an
earlier version of this paper.) Note that this method may require that the output
line card forward cells that form the initial part of a packet, before all cells in the
packet are received, but this is feasible in this context, since the crossbar scheduler can
guarantee that the remaining cells are received by the time they are needed. While
packet-aware schedulers can provide packet-level performance guarantees in systems
that use cell-based crossbars, such systems still suffer from bandwidth fragmentation,
since packet lengths are generally not even multiples of the cell length.
One possible objection to the use of crosspoint buffers that are large enough to
hold packets is that they might be too expensive, even for modern integrated circuit
components. A 32 port crossbar equipped with buffers large enough to hold two
1500 byte packets would require a total of more than 3 MB of SRAM. In [10], the
authors propose switching variable length segments rather than cells, as a way of
addressing the fragmentation problem with fixed-size cells. If this is coupled with
a packet-aware crossbar scheduler that provides performance guarantees for variable
length packets, we can reduce the crossbar buffer size to a multiple of the maximum
segment length. For IP routers, a maximum segment length of 80 bytes is sufficient
to eliminate bandwidth loss due to fragmentation effects. Even after adding 20 bytes
25
for header information this reduces the required buffer size by a factor of 15, making
it small enough to be easily accommodated within the constraints of current circuit
technologies.
Also observe that in a segment-based system, an input line card can forward seg-
ments to an output before all segments of the packet have been received. The perfor-
mance guarantee for the crossbar ensures that the remaining segments are transferred
through the crossbar in time to be forwarded on the outgoing link, if the system is
operated with a speedup of 2. Thus, we reduce both the amount of buffering required
and the delay.
8 Concluding Remarks
The results of sections 5 and 6 can be extended to systems that place different con-
straints on where and when packets are buffered. In particular, most routers buffer
packets at both input and output line cards, not just at the inputs. Modifying the
analysis to handle this case is straightforward and requires only that the value of T be
increased by LM/S, to accommodate the added delay for a maximum length packet
to be fully buffered at the outputs.
With an asynchronous crossbar, it is possible to build a system in which packets
pass from inputs to outputs without ever being fully buffered. This is known as cut-
through switching [7] and can provide superior delay performance when load is light.
While our results cannot be directly applied to such systems, it seems likely that
similar results could be developed for this model. Indeed, the segment-based switches
already approach the behavior of a cut-through switch.
There are several ways the work described here can be extended. First, there
are opportunities for tightening the results, particularly with respect to the crossbar
buffer size. There seems to be no intrinsic reason that PLF should require a larger
crossbar buffer size than PGV. An analysis that directly compares the behavior of a
PLF scheduler to a PGV scheduler may be able to reduce the buffer size requirement
for PLF.
It would also be interesting to see if the analysis techniques can be extended to
provide stronger performance guarantees. In particular, it would be useful to show
that an asynchronous buffered crossbar can emulate an output-queued switch using
any PIFO queueing discipline, not just any restricted PIFO discipline. The difficulty
in making the transition from restricted PIFO queueing disciplines to unrestricted
PIFO disciplines is that once a packet is in a crossbar buffer, there is no way for
a later arriving packet from the same input to reach the output line card before it
does, even if the queueing discipline gives it higher priority. Reference [4] describes
several techniques that can be used to allow cell switches using buffered crossbars to
overcome this crosspoint blocking phenomenon. It seems likely that these methods
can be generalized to accommodate asynchronous crossbars.
Still another direction to explore is how scheduling algorithms that deliver strong
26
performance guarantees when operated with a speedup of 2 perform when operated
with a smaller speedup. Since the crossbar cost increases in direct proportion to the
speedup, there are practical reasons to be interested in the performance of systems
with smaller speedup, even if they are not able to deliver strong performance guaran-
tees. A comprehensive simulation study exploring how such systems perform under
a wide range of conditions would have considerable practical value.
References
[1] Anderson, T., S. Owicki., J. Saxe and C. Thacker. “High speed switch scheduling
for local area networks,” ACM Trans. on Computer Systems, 1993.
[2] Attiya, H., D. Hay and I. Keslassy. “Packet-Mode Emulation of Output-Queued
Switches,” Proc. of ACM SPAA, 2006.
[3] Chuang, S.-T. A. Goel, N. McKeown, B. Prabhakar “Matching output queueing
with a combined input output queued switch,” IEEE J. on Selected Areas in
Communications, 12/1999.
[4] Chuang, S. T., S. Iyer, N. McKeown. “Practical Algorithms for Performance
Guarantees in Buffered Crossbars,” Proc. of IEEE INFOCOM, 3/2005.
[5] Ganjali, Y., A. Keshavarzian and D. Shah. “Input queued switches: cell switching
vs. packet switching,” Proceedings of IEEE INFOCOM, 2003.
[6] Iyer, S., R. Zhang, and N. McKeown, “Routers with a Single Stage of Buffering,”
Proceedings of ACM SIGCOMM, 9/2002.
[7] Kermani, P. and L. Kleinrock. “Virtual Cut-Through: A New Computer Com-
munication Switching Technique,” Computer Networks, 267–286, 1979.
[8] Krishna, P., N. Patel, A. Charny and R. Simcoe. “On the speedup required for
work-conserving crossbar switches,” IEEE J. Selected Areas of Communications,
6/1999.
[9] Katevenis, M., G. Passas, D. Simos, I. Papaefstathiou, N. Chrysos. “Variable
Packet Size Buffered Crossbar (CICQ) Switches,” Proceedings IEEE Interna-
tional Conference on Communications, pp. 1090-1096, 6/2004.
[10] Katevenis, M., G. Passas. “Variable-Size Multipacket Segments in Buffered
Crossbar (CICQ) Architectures,” Proceedings IEEE International Conference
on Communications, 5/2005.
[11] Leonardi, E., M. Mellia, F. Neri, and M.A. Marsan, “On the stability of input-
queued switches with speed-up,” IEEE/ACM Transactions on Networking, Vol.
9, No. 1, pp. 104–118, 2/2001.
27
[12] Magill, B., C. Rohrs, R. Stevenson, “Output-Queued Switch Emulation by Fab-
rics With Limited Memory,” IEEE Journal on Selected Areas in Communica-
tions, pp. 606–615, 5/2003.
[13] Marsan, M. A., A. Bianco, P. Giaccone, E. Leonardi and F. Neri. “Packet-Mode
Scheduling in Input-Queued Cell-Based Switches,” ACM/IEEE Transactions on
Networking, 2002.
[14] McKeown, N. “iSLIP: a scheduling algorithm for input-queued switches,” IEEE
Trans. on Networking, 4/1999.
[15] McKeown, N., A. Mekkittikul, V. Anantharam, and J. Walrand. “Achieving
100% Throughput in an Input-Queued Switch,” IEEE Trans. on Communica-
tions, Vol. 47, No. 8, 8/1999.
[16] Mhamdi, L., Mounir Hamdi. “MCBF: A High-Performance Scheduling Algo-
rithm for Buffered Crossbar Switches,” IEEE Communications Letters, 2003.
[17] Nojima, S., E. Tsutsui, H. Fukuda, M.Hashimoto. “Integrated Services Packet
Network Using Bus Matrix Switch,” IEEE Journal on Selected Areas of Com-
munications, 10/87.
[18] Rodeheffer, T. and J. Saxe. “An Efficient Matching Algorithm for a High-
Throughput, Low-Latency Data Switch,” Compaq Systems Research Center,
Research Report 162, 11/5/98.
[19] Rojas-Cessa, E. Oki, Z. Jing, and H. J. Chao, “CIXB-1: Combined Input-One-
cell-Crosspoint Buffered Switch,” IEEE Workshop on High Performance Switch-
ing and Routing, 7/2001.
[20] Stevens, D. and H. Zhang. “Implementing Distributed Packet Fair Queueing in
a Scalable Switch Architecture,” Proc. of IEEE Infocom, 1998.
[21] Turner, J. “Strong Performance Guarantees for Asynchronous Crossbar Sched-
ulers,” Proceedings of Infocom, 2006.
28
