On Buffered Clos Switches by Krishnan, Santosh & Schulzrinne, Henning G.
Technical Report
On Buffered Clos Switches
Santosh Krishnan
Henning G. Schulzrinne






There is a widespread interest in switching architectures that can scale in capacity with increasing
interface transmission rates and higher port counts. Furthermore, packet switches that provide
Quality of Service (QoS), such as bandwidth and delay guarantees, to the served user traffic are
also highly desired. This report addresses the issue of constructing a high-capacity QoS-capable,
multi-module switching node.
Output queued switches provide the best performance in terms of throughput as well as QoS
but do not scale. Input queued switches, on the other hand, require complex arbitration procedures
to achieve the same level of performance. We enumerate the design constraints in the construction
of a packet switch and present several approaches to build a system composed of lower-capacity
memory and space elements, and analyze their performance. Towards this goal, we establish a
new taxonomy for a class of switches, which we call Buffered Clos switches, and present a formal
framework for optimal packet switching performance, in terms of both throughput and QoS.
Within the taxonomy, we augment the existing combined input-output queueing (CIOQ) sys-
tems with Aggregation and Pipelining techniques. Furthermore, we present the design and analysis
of a novel parallel packet switch architecture. For the items in the taxonomy, we present algorithms
that provide optimal throughput and QoS, in accordance with the above performance framework.
While some of the presented ideas are still in the investigative stage, we believe that the current
state of the work, especially the formal treatment of switching, will be beneficial to the ongoing
research in the field.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Applicability and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Switching Framework 4
2.1 Notions of Optimal Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Forwarding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Basic Switching Elements: Building Blocks . . . . . . . . . . . . . . . . . . . . . 7
2.4 Common Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Related Work 10
3.1 Clos Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 IQ Switch Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Size Matchings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Weight and Rate Matchings . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 OQ Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 QoS Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 QoS in Input Queued Switches . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Buffered Clos switches 17
4.1 Switch Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Functional Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 CIOQ: Some Loose Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Strict Relative Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Envelope and Batch Switching . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Single-Path Buffered Clos Switches . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 CIOQ-A: Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 CIOQ-P: Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.3 G-MSM: Memory Space Memory . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Parallel Packet Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Flow-Based PPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Packet-by-Packet PPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 QoS in Multi-Module Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
ii
5 Conclusions 36
A Switching Primer I
A.1 Properties of Circuit Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Packet Switch Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
B Scheduling and Buffer Management IV
B.1 Link Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
B.2 Buffer Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
C Proofs of Theorems VIII
C.1 Single-Path Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII





The past decade has witnessed a manifold increase in the demand for switches that employ packet-
based forwarding. The networking protocols supported by these systems vary from connection-
oriented to connectionless ones, including Asynchronous Transfer Mode (ATM), Multi-Protocol
Label Switching (MPLS) [60], the ubiquitous Internet Protocol (IP), and switched Ethernet. Ir-
respective of the specific protocol supported, these switches consist of a forwarding path that
transfers data units, called a packet in the generic sense, from the input to the output interfaces,
optionally providing preferential treatment, or Quality of Service (QoS) guarantees, to specified
user flows.
Widespread consumer and business adoption of the Internet, together with improvements in the
physical layer technologies, has resulted in increased demand for packet switches with very large
capacities, in terms of both the number of supported interfaces as well as the interface transmission
rates. For example, a typical IP router in production today supports 10 to 20 ports, each at 10 Gb/s,
requiring a forwarding path capacity in the vicinity of 200 Gb/s. Since packet sizes have not
changed materially with the increasing rates, assuming a worst case of 64-byte packets, the router
also requires a packet processing capacity of about 20 million packets per second. This represent at
least a tenfold increase in capacities since just half a decade ago. Moreover, this trend is expected
to continue with the anticipated adoption of 10-G Ethernet in the metropolitan area networks, and
OC-192 (10 Gb/s) and OC-768 (40 Gb/s) links in the long haul. This motivates the design of
switch architectures that can scale with increasing rates and port counts. In general, an increase in
the desired forwarding path capacity imposes design constraints on the construction of the switch,
while the packet processing capacity constrains the complexity of the algorithms that operate on a
per-packet basis.
As an orthogonal trend, in an attempt to benefit from the efficiencies of a packet-based net-
work, traditionally circuit-switched network operators are migrating towards the former in order
to support customer flows that require service guarantees or QoS. Examples of such flows include
Voice-over-IP (VoIP) sessions, packetized Video-on-Demand, and MPLS Virtual Private Network
(VPN) tunnels with guaranteed bandwidth between office locations. The implication of this trend
for switch design is the need to implement scheduling and buffer management techniques in the
forwarding paths, so that the specified service requirements may be met.
This report addresses both the above issues, and presents architectures and algorithms that can
be used to construct the forwarding path of a large-scale packet switch that also guarantees QoS
to the supported flows. Towards this goal, we propose a new taxonomy for a class of scalable
packet switches, together with an analytical framework to characterize different levels of optimal
1
performance for the algorithms employed within such switches. At present, some of the results
are at a conjectural stage, nonetheless, we believe that the current state of the work, especially the
formal treatment of switching, will be beneficial to the ongoing research in switching.
1.1 Motivation
There is a vast amount of prior work in the design and analysis of circuit switches (see, for exam-
ple, [29] for an overview, and Appendix A for a primer on basic switching concepts and commonly
used terms). Circuit switching has benefitted immensely from a well established performance
framework, namely, different levels of blocking behavior, which characterizes the performance of
a given design in isolation from the properties of specific circuit arrival processes. All non-blocking
circuit switches are functionally equivalent in terms of the circuits that they can admit. This has
driven much of the work in the construction of scalable circuit switches using an inter-connection
network of smaller components, examples of which include the Banyan, Benes, Clos and Cantor
networks.
Traditional packet switching theory has borrowed, often erroneously, many of the terms, con-
structs, and performance measures from the former. However, packet switching is fundamentally
different due to the presence of external link contention and buffers in the switch components.
There exists no uniform framework that characterizes optimal performance, or which provides a
notion of functional equivalence with an ideal switch. Consequently, several conflicting measures
have been used to claim optimality. For example, the term blocking, which is really a circuit
switching concept, has been variously used to mean either a blocking structure, or head-of-line
blocking, or non-work conserving. Some other approaches use the notion of 100% throughput
(100% compared to what?) to characterize optimal performance, while quite a few study a switch
design by computing the loss ratios, for the finite buffers within the switch, for specific arrival
processes (most likely an artifact of the arrivals themselves rather than that of the switch design).
In summary, a framework which is able to encompass optimal throughput and QoS properties of a
switch design will be beneficial towards the work in complex scalable packet switches.
The problem of scaling packet switches by using multiple stages of elements has been ad-
dressed in the past, and a taxonomy of ATM switch architectures can be found in [57]. However,
most of the work deals with the construction of the inter-connection network within the switch, the
focus being solely on the structural blocking property of the network. Further studies (e.g., [7])
have characterized the throughput performance of a specific class of Banyan-based topologies.
Though such works have enabled to implement practical multi-stage switches, providing func-
tional equivalence with an ideal switch remains an open topic of research for many of the simple
non-blocking networks augmented with buffers at arbitrary points within them. This report takes a
small step in that direction.
1.2 Our Contributions
We address the problem of constructing a high capacity packet switch, using stages of lower capac-
ity memory elements and non-blocking space elements. We restrict ourselves to designs that are
structurally non-blocking and resemble the three-stage Clos network, with the new performance
framework of functional equivalence. For switches with this structure, which we call Buffered
2
Clos switches, we establish a taxonomy of designs driven not by the switch structure alone but by
the methodologies employed to overcome various design constraints. Each item in the taxonomy
is accompanied by the constraint it resolves, so that switch implementors may identify the most
appropriate one to use. While some items in the classification, such as the combined input-output
queueing (CIOQ) switch, are already addressed to a large extent in the recent literature, the per-
formance framework and the taxonomy itself are novel. In addition to providing a road-map to
construct scalable switches, it also allows to ascertain the properties of existing vendor equipment
by inspecting the structure and the associated algorithms for functional equivalence.
Within the above taxonomy, we also present the design and analysis of CIOQ switches with
aggregation and pipelining, two crucial ingredients to scaling. These fall into the category of
single-path buffered Clos switches. Some existing switches appear similar to these, nevertheless,
the presented algorithms and analytical treatment are novel. Furthermore, we propose and analyze
a new parallel packet switch architecture, which belongs to the multi-path category. Finally, we
propose a new methodology to enable QoS guarantees by combining certain levels of function
equivalence with existing scheduling schemes.
1.2.1 Applicability and Scope
This work is restricted mainly to the treatment of unicast traffic flows. While the ability to handle
multicast traffic is certainly relevant to packet switching, we do not deal with it in any level of
detail in order to keep this work tractable. For the same reason, we also do not include specialized
node algorithms that may be used to optimize the performance of adaptive traffic. Our assumption
is that the optimal throughput property by itself is beneficial to adaptive traffic, and any other
mechanisms, such as active queue management (AQM) to enforce fairness among adaptive flows,
can be added on to this work without much difficulty. In addition, we do not address the end-to-end
network behavior, and network engineering. Instead, we concentrate on node mechanisms whose
suitability for end-to-end behavior may be studied independently. Finally, this work is agnostic
of the protocol processing specifics and control plane handling, e.g., signaling the QoS flows, and
route control.
1.3 Outline of the Report
The remainder of this report is organized as follows. Chapter 2 introduces the switching model
including the components used and meaningful notions of optimal performance. Chapter 3 covers
the related work in switch design and provisions for QoS. Chapter 4 contains our original con-
tributions, including the taxonomy of Buffered Clos switches and the framework of functional
equivalence. It also includes the completed work on aggregated and pipelined CIOQ switches,
the parallel packet switch architecture, and our proposed QoS methodology. Finally Chapter 5




We now present the switching framework, including some meaningful notions of optimal perfor-
mance. We then overview the models for the forwarding path of a packet switch, and their atomic
components. Then, we briefly introduce the output queued and input queued switching models,
the former representing the ideal reference switch for equivalence purposes.
2.1 Notions of Optimal Performance
A circuit request between an input-output link pair, in the context of circuit switching, is consid-
ered admissible if there is capacity available on the specified links. A non-blocking switch ensures
that a path within the switch is realizable for every admissible circuit request. Since circuit admis-
sibility on the external links constrain the maximum possible circuit throughput, we may assert the
following:
Proposition 1 The non-blocking property is necessary and sufficient to ensure optimal throughput
of a circuit switched node.
Similarly, in the context of packet switching, a flow between an input-output link pair with a
given effective bandwidth is considered admissible if the bandwidth is available on the specified
links. The successful admission of this flow depends solely on the existence of a switching path,
with the said bandwidth (and associated buffers) available throughout the path. In other words, if
the flow bandwidth can be accommodated on the input and output links of the switch, this existence
ensures that the same can be accommodated throughput the switching path. By definition, since the
external links impose a physical limit on the total amount of flow bandwidth that can be admitted
into the switch and consequently any notion of an optimal set of flows, by recognizing the analogy
with circuit switched nodes, we may assert the following (without analytical proof):
Proposition 2 A packet switch design which is structurally equivalent to a non-blocking circuit
switch is necessary and sufficient to admit an optimal set of flows with bandwidth requirements.
The sufficiency assumes that the scheduling and buffer management schemes can guarantee
the allocations, itself a non-trivial task. We will refer to such switches as structurally non-blocking
packet switches, and the act of allocating bandwidth and buffer resources throughout the switching
path as flow fitting. Since much of the early work was based on ATM switches, wherein the entire
traffic undergoes CAC and all the flow specifications are known prior to actual packet transmission,
4
the ability to fit (and schedule) flows was the primary objective, leading to vendor claims of optimal
performance based only on the non-blocking property. Even though this may constitute the first
step in packet switch design by enabling flow fitting, ensuring a non-blocking structure cannot be
equated to providing optimal throughput.
Limited by the physical constraints of the input and output links, maximum packet-switched
throughput may be achieved as long as an output link of the switch never idles when a packet
destined to it is backlogged anywhere within the switch. Switches with this property are referred
to as work conserving and a given switch design is considered ideal (in throughput) if it satisfies
this property. If it is difficult or impossible to prove this property, one may justifiably claim op-
timality by proving a functional equivalence with a well known ideal switch. Similar to the case
of circuit switch design, where the non-blocking property ensured functional equivalence with a
well known optimal switch (e.g., a single-stage crossbar), an unambiguous framework is required
to characterize functional equivalence of packet switch designs.
Proposition 3 A packet switch which is functionally equivalent to a well known ideal switch is
optimal in throughput.
We can identify a few approaches to functional equivalence, which address the construction
of a switch rather than the specifics of the arrival traffic procceses. The first relies on construct-
ing a switching fabric  in such a manner that the packets depart from the outputs in an exact
emulation of the departure from a well known ideal switch  of the same dimensions, for any
traffic pattern. The second approach relaxes the constraint of exact emulation and instead borrows
from the concept of stability of queueing systems [55]. A switch with a set of queues is said to be
absolutely stable if the expected queue occupancies are finite. Clearly, absolute stability depends
on the characteristics of the arrival process in addition to the design of the switch. Therefore, we
refer to a fabric  as relatively stable with respect to  if it is stable for a broad class of arrival
processes for which the latter is also stable. We may refer to  as relatively stable in the strict
sense if the subset of queues that are stable in the reference switch remain so in  , under any
arrival pattern. The strict sense property allows to consider traffic patterns that can lead to (partial)
instability in the ideal switch, and to handle it in the same manner as the latter. The final approach
further relaxes the stability constraint by relying on a prior knowledge of flow rates. If the rates
of all the flows (the actual offered rates, not the rate requirements) traversing the switch can be
pre-computed, optimal throughput is achieved just by the ability to fit flows, i.e., the structural
non-blocking property.
To summarize, the structural non-blocking property, along with the appropriate scheduling and
buffer management policies, provides the framework for QoS in a packet switch, and a functional
equivalence with an ideal switch provides the framework for throughput performance. Optimal
packet switching performance may be achieved by the simultaneous satisfaction of both of the
above.
2.2 Forwarding Models
The first generation routers, and many of the current lower speed ones, perform software-based
packet forwarding. A general purpose CPU is connected to multiple line cards through an I/O
bus, as shown in Fig. 2.1(a). The cards perform the physical and link layer protocol processing



























Figure 2.1: Forwarding models: (a) Centralized CPU, (b) Cut-through
the outgoing interface and any special handling, if applicable. Buffers are distributed among the
line cards and the CPU memory, to account for the output, I/O bus and CPU contention. The
disadvantage of this model is that the limited communication bandwidth of the I/O bus and the
software module on the CPU present significant bottlenecks to packet throughput, and thereby
limit the port counts and the interface rates that can be supported. For example, a 	
	 switch,
with 1 Gb/s ports, requires (i) a processing capability of 32 million (64-byte) packets per second,
which translates to an upper bound of 32 (instructions plus memory) cycles of a 1 GHz CPU; (ii)
an I/O bandwidth of 32 Gb/s, compared to the 8 Gb/s available using the latest 64-bit 133 MHz
PCI bus; and (iii) a memory bandwidth of 32 Gb/s, allowing merely 16 ns for a memory transfer
even assuming a total bus width of 64 bytes. Clearly, these are tall orders even for the moderately
sized switch of the example. Consequently, the path traversed by the packet through the central
CPU is referred to as the slow path.
Modern designs use the cut-through model illustrated in Fig. 2.1(b). The incoming packets
undergo header processing in the port processor cards attached to each interface, and are directly
routed through the switch fabric, or the fast path, to the corresponding output interfaces. The
central CPU is connected via a special switch port. Only the control traffic is routed to the CPU,
which maintains the overall state of the switch and programs the cards. In addition to the physical
and link layer processing, the components on the cards are responsible for address lookup, flow
filtering, policing, statistics collection and other protocol specific tasks, and for appending the
results of those operations into a special local header. For the above example, this model requires
a processing capability of 2 million packets per second in each card, a switching fabric aggregate
capacity of 16 Gb/s, and a memory bandwidth that depends on the switch design, but no more
than 32 Gb/s. Communication IC vendors already offer network processors that implement those
components in silicon at interface rates of 10 Gb/s, processing capabilities of 20-30 million packets





































Figure 2.2: Forwarding elements: (a) memory element, (b) space element
2.3 Basic Switching Elements: Building Blocks
A fabric may be constructed using one or more instances of basic forwarding elements, of two
types: the memory element and the space element. An  
 memory element, shown in
Fig. 2.2(a), accepts packets from the  input interfaces and enqueues them into one or more (in
the case of multicast) queues, which are arranged into groups, one for each of the  outputs. Note
that the queues need not be physically separate in memory, and may indeed share packet buffers
from a common pool. The structure of a group may be as simple as a single output queue per in-
terface, or may allow grouping packets on the basis of several criteria such as input interface, flow
identity, and priority, in order to facilitate preferential treatment. A buffer management logic is
responsible for admitting packets into the queues, while a scheduler logic at each output interface
dictates the outgoing packet rate and the relative order of the packets. The latter may vary from
a first-in first-out (FIFO) policy acting on a single output queue, or a sophisticated weighted fair
queueing (WFQ) policy acting on per-flow queues. Given an interface rate of  and a minimum
packet size of  , the element needs a memory bandwidth of 	 at each of the outputs to
accommodate  simultaneous enqueue operations (from the  inputs) and a single dequeue op-
eration. At the same time, the scheduler logic needs to dequeue at a frequency of ﬀ to sustain
the output interface rate. Hence, the dimensions and the rates that can be supported by a single
memory element is limited by the available memory bandwidth, while the interface rates along
with the smallest packet size limits the complexity of the scheduler. Memory elements available
today operate in the order of tens of Gb/s.
A space element, shown in Fig. 2.2(b), is composed of a bufferless cross-connect, and an ar-
biter (possibly distributed) that is responsible for configuring the inputs and outputs into a match-
ing. Packets on the input interfaces are segmented into equal sized cells, if necessary, and in each
7
synchronous time unit of the element, called a timeslot, a single cell is transfered from the inputs
to the outputs, configured by the matching. The segmentation and re-assembly of packets is not
the responsibility of the element, and is typically implemented in one of the components of the
line cards. The matching is re-computed at every timeslot, and therefore the arbitration logic needs
to operate at a frequency of ﬀ , while the complexity of the logic is determined by the size of
the element. While it is the frequency and complexity of arbitration that eventually limits the di-
mensions of a single element, we should note that the chip size and the underlying technology also
affects the interface counts and rates that may be supported. Electronic crossbars available today
are capable of interface rates in the order of 1-10 Gb/s, with port counts in the order of 20-30, and
can be re-configured in less than 100 ns thereby providing a very small timeslot. Optical compo-
nents such as wavelength routers [71] can support a significantly higher interface rate, limited only
by the optical physical layer. However, such components are expensive, the port counts supported
by the currently available ones are lower, and most significantly, the re-configuration times are in
the order of a few ﬁ s, thus providing a very coarse-grain timeslot.
Note that we consider the basic elements as logical entities, i.e., as black boxes. The imple-
mentation of a memory element subsumes the existing work on different internal logic that may be
employed to exhibit the behavior described above. Similarly, the techniques from circuit switch-
ing may be used to construct a large non-blocking inter-connection network that emulates a single
ﬂ
ﬃ logical element.
2.4 Common Existing Solutions
A packet switch that employs the cut-through forwarding model, with a single memory element
for the switch fabric is referred to as an output queued (OQ) switch. Several of the early switches
used this design, an overview of which can be found in [69], in which a few alternative approaches
were used to design the  ! capacity queueing logic. For example, the Prelude switch [21]
uses a shared timeslot exchanger to realize output queueing, whereas the knockout switch [72]
uses  parallel shared buses for the same purpose. While the implementation of the queueing
logic is beyond our consideration, it suffices to note that pure output queued designs are rarely
used in current high speed switches due to the significant memory bandwidth bottleneck. Output
queueing may also be emulated by physically grouping queues on the basis of input-output pairs,
and either providing #"$ disjoint paths between the  inputs and  outputs to form a bus-matrix
switch, or by placing those queues at the crosspoints of an  
% crossbar. While both these
approaches reduce the memory bandwidth requirement to &ﬀ , the chip size imposes a restriction
on its dimensions.
Assuming that the port processors and the queueing logic operate at (interface) wire speed,
an output queued switch does not suffer from any internal contention, and hence has the ability
to be work conserving. Furthermore, since packets are directly presented to the output queues
without any intermediate delay, the flow rates and packet delays can be tightly, controlled by the
output scheduler. Consequently, given the required interface counts and rates, a hypothetical output
queued switch of those dimensions may be considered an ideal reference switch for performance
comparisons.
In contrast to the above, an input queued (IQ) switch concentrates all the packet buffers at the
input interfaces of the system. The switch fabric consists of a single logical '
 space element,
with ﬂ()
*! memory elements connecting each of the input interfaces to the input ports of the
8
space element. The latter may be collocated within the port processing cards of the switch. Notice
that such a design suffers from internal contention at the inputs, i.e., the memory elements do not
function in a work conserving manner and are driven by the sequence of matchings computed by
the arbiter of the space element. Therefore, while the maximum memory bandwidth required for
an input queued switch is &ﬀ irrespective of port count, complex arbitration policies are required
to achieve functional equivalence with a reference ideal switch. This has been one of the primary





We now review specific parts of the literature that relate to this work. First, we show how to
construct a Clos network, in the context of circuit switching, and discuss its applicability to packet
switches. Next, we outline some of the recent results in IQ and CIOQ switches, which form the
starting point of our taxonomy. Finally, we review the work on QoS methodologies for packet
switches.
3.1 Clos Network
The Clos network is the simplest three-stage non-blocking arrangement, composed of space ele-
ments in each stage. While it was first introduced and analyzed in the 1950s, it remains relevant
to this date, and we borrow this overview from Pattavina’s more recent work [57]. As shown in
Fig. 3.1, an +
 switch is composed of three fully connected stages, where each link has the
same capacity. The first stage is composed of  instances of , 
.- non-blocking space
elements. These are inter-connected to the third stage composed of  instances of - 
/,
elements, using - elements of size  
0 in the central stage. The network is non-blocking in
the strict sense if and only if - 123&ﬀ, 4	 (Clos theorem). The speedup required between
the first and second stages can be easily calculated as 5&64..78 . Provided the above condition
is satisfied, given an idle input-output pair, a greedy algorithm which visits each of the central
element is guaranteed to find a path in 9:3,; time.
The network is re-arrangeably non-blocking if and only if - 1<, (Slepian-Duguid theo-
rem), and the re-arrangements necessary to realize a path for an idle pair can be found in 9:5  
time (a consequence of Paull theorem). The minimum number of crosspoints in the network is
9=
?> @
 which results when  ACB &D . Notice that no speedup is required if re-arrangement is
allowed. The results continue to hold when we aggregate the links (in and out) of a space element
to a single faster physical link. Specifically, this implies that , simultaneous matchings on the
(aggregated) input-output pairs can be partitioned into - concurrent matchings on the (pipelined)
central stage. In other words, the fan-in and fan-out of the space elements may interchangeably
be in space or time. Strict-sense non-blocking refers to the case when the existing partitions may
not be touched to admit a new entry, while a re-arrangement refers to the ability to re-compute the
partitions on every circuit admission.
An interesting aspect of the above properties is that similar results can be obtained when the
































Figure 3.1: An ﬂ
8 Clos Network




is admissible for the pair $U
N?V
 if both input U and output
V
have current
allocations that do not exceed WX4
F
 . We can prove that a path can be realized for an admissible









This reduces to the Clos theorem when all quantities are unity, the extra terms accounting for
fragmentation of the link capacities due to the variability in the quantities. Note that the scalar
quantities may represent the peak or the effective bandwidths of packet flows. Hence, we may
construct a three stage network that is structurally equivalent to the Clos network, and enable
fitting of admissible flows, irrespective of whether the nodes are memory or space elements, as
long as each element can internally (through arbitration or scheduling) realize the bandwidths of
the individual flows on their input and output links. This result mirrors Proposition 2 in the context
of Clos topologies.
3.2 IQ Switch Algorithms
Since IQ switches require a memory bandwidth that is independent of the switch size, a significant
amount of research has been conducted to characterize its performance. Consider an  
_
IQ switch, as defined in Sec. 2.4, with all rates normalized to an external interface rate of 1
cell/timeslot, and mean arrival rates of `badc e between input U and output
V
. These rates are said
to be admissible when f
a




`gadc elim for all U , since they correspond to





























Figure 3.2: A combined input-output queueing switch (CIOQ)
algorithms1 that compute a sequence of matchings n
op
, so that the observed behavior is function-
ally equivalent to the OQ switch. Broadly speaking, there are two types of such algorithms: rate
driven schemes, which use a prior knowledge of all the rates
`gadc e
, and queue-occupancy driven
schemes which use the state of the queues to devise a matching. The latter is especially suit-
able when the rates are not known beforehand, as is the case with best-effort traffic, or when an
equivalent rate-driven scheme is too complex to implement.
Early IQ switches used a single FIFO queue in each of the input memory elements. However,




independent Bernoulli traffic on each input link, which is uniformly distributed to all the outputs.
This is because of the phenomenon called head-of-line (HOL) blocking, in which a cell destined
to an idle output is queued behind a cell destined for a currently busy output. This situation is
remedied by queueing the cells in an input element into  separate queues, one for each output,
referred to as virtual output queues [68]. Let the queue of cells destined to output V in input U be
denoted as q
adc e
, with length r
adc e
. Essentially, a matching matrix n
$oK
enables the transfer of a cell






is set to 1.





and corresponds to the total number of cells transfered in the timeslot. Given a weight function,






. An IQ switch is said to employ a
speedup wyx

if w matchings are computed per timeslot, enabling the transfer of up to w cells per
timeslot between an input-output pair. Since the output link operates at a rate of 1 cell/timeslot,
any speedup necessitates buffers at the front of the output link, in addition to the ones in the input
elements. Such a switch is also referred to as a combined input-output queueing (CIOQ) [31]
switch and is illustrated in Fig. 3.2. Note that a CIOQ switch requires a memory bandwidth equal
to  w times the interface rates, and a CIOQ switch with speedup  is the same as an OQ switch.
1Such algorithms have also been referred to as switch scheduling algorithms, but to avoid confusion with the output
link scheduling algorithms, we refer to these as arbitration algorithms.
12
3.2.1 Size Matchings
An algorithm that computes a matching with the maximum size would provide the highest instan-
taneous throughput in each timeslot. Such a matching can be obtained by finding the maximum
network flow in a bipartite graph, where the vertices of the graph correspond to the inputs in one
set and the outputs in the other, and an edge with unit capacity exists between input U and output
V
if r adc e x.z . The most efficient algorithm to solve this problem runs in 9:3 {> @  time [28], and is too
complex to implement at a high frequency (e.g., 20 million arbitrations/second for an interface rate
of 10 Gb/s and cell size of 64B, allowing less than 10 cycles on a 133 MHz clock for the arbitration
logic). This motivated the design of practical so-called maximal-size [50] matching algorithms.
A matching is referred to as maximal in size if it adds one connection at a time, i.e., in a
greedy approach, and converges when there is no idle input-output pair. Clearly, any maximal
matching algorithm will converge in  steps, though it is not guaranteed to find the maximum size
match. A simple example may be constructed as follows: a centralized arbiter matches each input
U , in sequence, to the first unmatched output
V
for which r adc e x|z . This algorithm runs in 9:3  
time, which is the minimum it would take for any algorithm that needs to sequentially inspect all
the entries of an  
% matrix. Dai and Prabhakar [19] proved, using fluid analysis, that any
maximal matching algorithm guarantees that the system of queues is stable for any admissible
arrival process, with a speedup w 1}& . This is a remarkable result as it allows one to choose
the least complex maximal matching, and yet remain stable. Similar results, i.e., the sufficiency
of w A~& , were shown by Leonardi et al. [48] for a specific maximal matching under admissible
traffic which leads to an embedded Markov-chain queue length evolution, by establishing a drift
condition on a well chosen quadratic Lyapunov function [46] of the queue length vector. The same
work [48] also established the stability of a probabilistic (queue-length driven) maximal matching.
Several earlier works [1, 51, 49] proposed matching algorithms for IQ switches without speedup,
and analyzed their performance under restricted arrival processes, as well as sub-maximal sizes. In
Parallel Iterative Matching (PIM) [1], each input, in parallel, sends a request to each of the outputs
for which r ahc e xz . An output chooses one from the contending inputs in a random manner, and
finally, each of the inputs select one from the granting outputs, again in a random manner. These
steps are iterated till the algorithm converges to a maximal matching. Even though  iterations
are required to converge in the worst-case, it was shown that the algorithm needs only 9=$d
iterations on average. Alternatively, in iSLIP [51], in each iteration, the output grant and the input
select operations are performed in a round-robin fashion, with special care to ensure that the point-
ers do not synchronize. It was shown that due to the de-synchronization property, a single iteration
is sufficient to ensure stability for i.i.d. Bernoulli traffic on each input, with uniformly distributed
destinations. For uniform bursty traffic, several iterations,  in the worst case, are required to
maintain stability. Note that such iterative algorithms, due to their parallel nature, can conceivably
be implemented in a distributed fashion, as long as there is enough control bandwidth between the
inputs and outputs to exchange results repeatedly (equal to the number of iterations) within the
same timeslot.
3.2.2 Weight and Rate Matchings
Even though the performance of maximal size matching algorithms, without speedup, have been
demonstrated for uniform traffic, it was recognized [52] that even a maximum size matching is
insufficient for stability under non-uniform admissible traffic. The same work showed that a maxi-
13
mum weight matching, where the weight of the flow from U to
V
is the queue length r adc e , is sufficient
to ensure stability for any admissible i.i.d. stationary arrival pattern. The i.i.d. and stationar-
ity requirements, which were the artifacts of the proof methodology that used embedded Markov
chains, were relaxed in [19]. The most efficient algorithm to compute the maximum weight runs
in 9=3d time. It was later realized [53] that the best maximum size matching can indeed
guarantee stability by computing all possible maximum matchings, in each timeslot, and choos-
ing the one with the maximum weight, where the weight
t
adc e is computed as the sum of all the
occupancies in input U plus the sum of occupancies in output
V
.
Given a prior knowledge of all the rates `gadc e , either through traffic engineering considerations,
or through specifications of traffic profiles, a repeating sequence of - matchings may be used to
essentially provide a virtual trunk with the specific rates, thus ensuring stability. The - match-
ings may be computed offline using the Birkhoff-Von Neumann (BVN) matrix decomposition [5],
which computes -A|9:3   number of templates in a total run time of 9:3Ł > @  and requires no
speedup to operate. Note that while the algorithm has high complexity and memory requirement
to store the templates, the decomposition needs to be repeated only when the rate matrix changes.
Alternatively, the templates can also be computed by performing flow fitting on a structurally
equivalent Clos network [30]. (This may be viewed as a proof for Proposition 2 for the specific
cases of IQ and CIOQ switches.) In this approach, assuming that all the rates are integer multiples
of some `
HKJML
, each rate is converted into `gahc e7ﬀ`
HKJML
circuits, which need to be realized in !ﬀ`
HKJML
time. This can be done with a speedup of 2, i.e., by partitioning the circuits into &`
HKJML
templates
using the Clos theorem, or without speedup using the Slepian-Duguid theorem. The running time
complexity for the former can be found as 9=3ﬀ` HKJML  , and for the latter as 9=  `
HKJML
 . In con-
trast to deterministically computing and storing templates, a probabilistic rate-driven algorithm
was presented in [48], where the rates were used to generate a maximal size matching.
To summarize, a maximal size matching is sufficient to ensure stability under any admissi-
ble traffic using a speedup of 2, while a maximum weight matching ensures the same without
speedup2. A maximal size algorithm, such as iSLIP, may be used without speedup for uniform
traffic, and requires only a single iteration for uniform Bernoulli traffic. Finally, if all the rates are
known beforehand, an offline rate decomposition algorithm may be used to generate a sequence of
templates to provision those rates.
3.2.3 OQ Emulation
While different size, weight and rate-based matchings have been shown to possess varying stability
properties, the ability to exactly emulate an OQ switch for any traffic pattern would undoubtedly
be the most desirable property for an input queued switch. Chaung et al. [17] showed that this
was indeed possible, in theory, for a CIOQ switch with a speedup of 2. (This was also proved
independently in [44, 66].)
The algorithm presented in [17], Critical Cells First (CCF), maintains a priority list, for every
output, of cells enqueued anywhere in the system. The position of a cell in this list, called its
output cushion, is calculated by simulating an OQ switch of the same dimensions with the desired
link scheduling scheme. The class of scheduling schemes which can be emulated is the Packet-In
2It has been widely conjectured that these two results are analogous to the Clos theorem and the Slepian-Duguid
theorem, respectively, for a circuit-switched Clos network, as a matching can be viewed as an online fitting of circuits
over the time-scale of averaging of the rates.
14
First-Out (PIFO), which is a model for any service discipline in which the relative departure order
of cells does not change with future arrivals. An incoming cell is inserted into the input queues
in such a way that the cells are in ascending order of their output cushions. A stable matching is
calculated which ensures that, for each input, either a cell is transfered or a cell from another input
with a lower output cushion is transfered. Such a matching can be computed by solving the stable
marriage problem, for example, using the Gale-Shapely algorithm which runs in 9=y time, where
 is the sum of the lengths of all the input queues.
3.3 QoS Methodologies
As opposed to IQ arbitration algorithms, the main focus of which (except for rate decomposition
and emulation schemes) has been to provide maximum throughput in relation to a work conserving
ideal switch, the primary goals of QoS algorithms are twofold: (i) to provide isolated bandwidth
guarantees to specified flows; and (ii) to provide fairness in the distribution of excess (best-effort
portion) bandwidth. Isolation refers to the ability to provide a virtual bandwidth trunk to a flow,
irrespective of the arrival patterns of other flows. The time-scale over which such isolation is
provided translates into the latency of service, and consequently affects the delays observed by the
packets belonging to that flow. Though several notions of fairness may be contemplated depending
upon the application of the flow (such as TCP connections), we restrict ourselves to the ability
to control the excess allocation depending on specified weights. Both the above objectives are
especially meaningful under inadmissible traffic.
The goals of isolation and fairness are provided by a combination of scheduling and buffer
management algorithms. These have been studied extensively in the context of OQ switches, i.e,
single memory elements, and are overviewed in Appendix B.
3.3.1 QoS in Input Queued Switches
While OQ emulation established the theoretical possibility to provide any desired QoS perfor-
mance in a CIOQ switch by emulating a chosen link scheduler for an equivalent OQ switch, its
implementation remains infeasible. Accordingly, in practice, QoS is provided by considering each
of the stages as independent elements and using a hierarchical scheduling structure, implemented
in a distributed fashion [63, 8] among the elements.
We will consider a simple example shown in Fig. 3.3 to illustrate the methodology. The input
elements contain a virtual output scheduler (VOS) for each output, which are served by the ar-
biter. Each VOS contains a GBS (guaranteed bandwidth scheduler) and an EBS (excess bandwidth
scheduler) portion, which divides the bandwidth provided by the arbiter among the flows. The arbi-
tration proceeds in two steps. In the first step, a (non-work-conserving) rate decomposition scheme
is used to provide isolated bandwidth trunks on every input-output pair. These grants are scheduled
among the flows by the GBS scheduler, the delay bound in the first element being dependent on
the added latency of the arbiter and the VOS. This has motivated some recent interest [39] in the
generation of low-latency smooth templates. In the second step, a maximal matching algorithm is
used to maximize throughput, which is distributed by the EBS scheduler in a chosen fair manner.
Note that, for admissible traffic, the need for fairness in the distribution of (excess) bandwidth is
moot. For traffic that causes (sustained or temporary) instability, excess bandwidth is first divided






























Step 1: Rate Templates
Step 2: Maximal Size
Figure 3.3: QoS in CIOQ using distributed schedulers
distributed by the VOS. Clearly, this does not, in general, correspond to the weighting received by




The methods for constructing a high capacity switching fabric, composed of smaller space and
memory elements, are the subjects of this chapter. We first present the framework of functional
equivalence used to characterize the performance of the fabric. The architectures under our consid-
eration are restricted to a class of fully connected three stage models, which we call Buffered Clos
switches, and are driven by different design constraints. The class includes single-path models
such as the CIOQ switch and more general memory-space-memory designs, and multi-path ones
such as the parallel packet switch (PPS).
The taxonomy itself is novel, and is unpublished as of this writing. Some of the problems
addressed for memory-space-memory switches are still works in progress, for which we present
our initial ideas. Regarding multi-path switches, we include our published results as well as a
discussion on similar work being conducted elsewhere, and identify some as yet unsolved issues.
We conclude by showing how the QoS methodologies, well studied in the context of OQ switches,
can be applied to the multi-module Buffered Clos switches.
4.1 Switch Model
A Buffered Clos switch is a three-stage switch with the following structural limitations. The stages
are uniform, i.e., each stage is composed of a single type of element, memory or space. The
switch is symmetrical in arrangement, i.e., the number of elements in the third stage, and their
inter-connections with the second, are a mirror image of the first. Finally, we restrict ourselves
to a square switch, i.e., one with an equal number of input and output ports. We introduce the









ﬃ is the size of the switch,

is the number of first and third stage elements, and - is
the number of central stage elements.








| , space and memory element, respectively. Due to the property of symmetry,
the first stage elements have dimensions of 











(We assume that   is an integer.)
The parameter w refers to the speedup between the first and second stages. Given an external
link capacity of  , the total input link bandwidth of the switch is  . Similarly, if the capacity of
the links between the stages is  , the total bandwidth between two adjacent stages is equal to   - .
17
Packet Switches
      OQ: Output Queued
      IQ : Input Queued
      Buffered Clos Switches
             Single Path
                    CIOQ: Combined Input Output Queued
                    CIOQ-A: CIOQ with Aggregation
                    CIOQ-P: CIOQ with Pipelining
                    G-MSM: General Memory-Space-Memory
             Multi-Path
                    PPS: Parallel Packet Switch
                    BVN: Birkhoff Von-Neumann
Figure 4.1: A Taxonomy of Buffered Clos switches
By definition, if all the rates are normalized to the external link rate, then the internal capacity can
be calculated as w " . Discounting, for now, the effects of bandwidth fragmentation on the internal
links, the switch is structurally non-blocking as long as w 1 . We define a switch as single-path
if each realizable path between an input and output contains the same memory element(s). If two
paths contain different memory elements, the switch is considered multi-path. Note that we allow
two paths in a single-path switch to consist of different space elements, because a cell experiences
a small fixed delay as it traverses a space element, essentially exhibiting predictable behavior.
The taxonomy of Buffered Clos switches is obtained by varying

and the relationship be-
tween  ,

, and - . While we defer the motivation and analysis for the specific points of interest
in the taxonomy, we illustrate it in Fig. 4.1, and define the structures as follows: (Strictly speak-
ing, IQ and OQ are not members of this class, but we use the same notation for uniformity.)
OQ : 3 N!E  S5N  NObNO 
IQ : 3 N!E ; S5N  N  N 	
CIOQ : 3 N!E ; S5N  N  N w 
CIOQ-A : 3 N!E ; S5NO\N  N w  ,   is an integer

































We address the performance of all but the last item in the above list, the so-called Birkhoff-Von
Neumann (BVN) switch (not to be confused with BVN rate decomposition), which has garnered
some recent interest [6, 35] as a general space-memory-space design. The design constraints that
lead us through the taxonomy can be ennumerated as follows. The memory element is constrained
in size and capacity by the available memory bandwidth, and in complexity by the frequency of
link scheduling operations. The space element is constrained in capacity by the transfer rate be-
tween an input-output pair, in size by the realizable pin count in ASIC, and in complexity by the
frequency of arbitration, which itself depends on the capacity and the size.
As seen from the previous chapter, CIOQ is fairly well-studied and widely deployed. CIOQ-A,
CIOQ-P, and G-MSM switches are already found in practice (e.g., [12, 13]), but their performance
18
is relatively less well studied. PPS and BVN are still mainly under theoretical investigation.
4.1.1 Functional Equivalence
We denote a switch

 operating under a set of algorithms  : to be functionally equivalent with a
switch















where ¤ is the class of traffic for which the equivalence holds, and ¥ denotes the level of equiva-
lence. The following are the levels which are of our immediate interest:
¥^ Assuming all the average rates `gahc e between the input-output pairs are known, and the rates are
admissible (i.e., ¦ V f
a
`badc ei2 ), given an algorithm set  y which ensures the stability of
the queues in

 , we can find an algorithm set  : which ensures the same in

 . In other
words, ¥ denotes the ability to provide virtual trunks of known admissible rates between the
input-output pairs.
¥s Assuming rates `gahc e are admissible (i.e., ¦ V f
a
`badc ei} ), given an algorithm set  y which
ensures the stability of the queues in

 , we can find an algorithm set  : which ensures the
same in

 , without explicit knowledge of the individual rates. We say that

 is relatively





Given an algorithm set  y which ensures the stability of each output
V
with admissible rates
(i.e., those V for which f
a
`gahc e:i~ ), in switch   , we can find a set  : which enables the
same in

 . We define an output
V
to be stable if the system of queues which contain traffic
to output
V
is stable. In other words, ¥

represents the ability of a switch to isolate unstable
outputs in the presence of inadmissible traffic. We say that

 is relatively stable in the strict





Given  y which ensures the stability of a subset of input-output pairs with admissible rates (i.e.,
pairs U
N?V
 for which `gahc e)i
t
adc e , the fraction of the output link bandwidth given to the pair
by  y ) in switch   , we can find  : which guarantees the same in   for the same subset.
We define an input-output pair $U
N?V
 to be stable if the system of queues which contain the
traffic belonging to the pair is stable. ¥
Ł
represents the ability of the switch to isolate unstable
input-output pairs within the same (unstable) output. We say that   is relatively stable in
the strictest sense with respect to

 .
¥s@ Given an algorithm set  y for

 , we can find a set  : which ensures that cells depart from


in an exact emulation of the departure from

 .
In essence, ¥^ denotes the ability of a switch to utilize the knowledge of all the admissible
rates to ensure that the given rates can be supported. For example, the BVN rate matrix decom-
position provides virtual trunks with the given rates for input queued switches, and can be used
for ¥^ equivalence. Note however that such rate decomposition schemes are stronger than ¥ since
they can also guarantee given rate requirements even if the offered arrival rates do not adhere to
those requirements. Equivalence at level ¥s has been equated in the literature to providing 100%




provide further means to maintain the so-called
19
100% throughput, while isolating specific flows that contribute to congestion and instability. In
that regard, they might also be considered as levels of fairness in the distribution of bandwidth in
a (partially) unstable system. Note also that the property of work-conservation, on an output link
basis, is sufficient for ¥

, but not for ¥
Ł
equivalence.
The methodology to prove the stability of a system of queues is left open, which might itself
restrict the traffic class ¤ . We establish two important properties to aid in recognizing equivalence.
Notice that each level of equivalence subsumes the previous level, i.e., if the equivalence holds at
level ¥!a , then it holds for all ¥Te
N§V






























Also, it is easy to see that each level of stability lends itself to the transitive property. That is,
if a set of equivalences of switch

 with respect to


is well known, it suffices to show the
















































Since the optimality of work-conserving OQ switches is well established, we can unambiguously
characterize the performance of any given switch

by finding algorithms   , such that it is func-
tionally equivalent at level ¥!a with the OQ switch, for the largest value of U and the broadest traffic









 OQ,  Work Conserving (WC) ¶
The known results for IQ and CIOQ switches, which were reviewed in the previous chapter, fits
into this framework as follows:
(IQ,  BVN Decomposition  )
¡
c ¢W·
£ (OQ,  WC  )
(IQ,  iSLIP  )
¡
c ¢§¸
£ (OQ,  WC  ), ¤ :Bernoulli, i.i.d., uniform; single iteration




£ (OQ,  WC  ), ¤ :i.i.d., uniform;  iterations
(IQ,  Maximum Weight  )
¡
c ¢§¸
£ (OQ,  WC  )
(CIOQ,  Maximal Size  )
¡
c ¢§¸
£ (OQ,  WC  ), w A;&
(CIOQ,  CCF  )
¡
c ¢5¹
£ (OQ,  WC, PIFO  ), w A;&
4.2 CIOQ: Some Loose Ends
The CIOQ switch is attractive because the memory bandwidth required is W w  times the interface
rates, without a dependence on the size of the switch. However, assuming that a large space
element with a link rate of w can be built, there are two immediate problems. The first is related
to performance. Functional equivalence at level ¥!@ can be achieved as shown in [17], however the
algorithm is exceedingly complex. Practical algorithms have not been shown to be relatively stable
in the strict sense. The second problem is related to the frequency of arbitration. The length of
an arbitration cycle is given by  w  , where  is the output link capacity and  is the size of the
cell. Therefore, it decreases with increasing transfer rates. Additionally, many of the optical space
20
elements which provide the fastest transfers rate suffer from a high re-configuration latency, once
a matching has been computed. We address these issues before proceeding to the next point in the
taxonomy.
4.2.1 Strict Relative Stability
One of the desirable properties of an OQ switch is that instability at an output is isolated to that
output. In other words, if f
a
`badc ey1C for some output
V
, it does not affect the stability of queues







0 into two proper subsets » and  . Let f
a





`gadc e¼1| for all
V=
 . A CIOQ switch would be equivalent at level ¥

with a work conserving
OQ switch, as long as all outputs in » are stable. Of course, directly providing a specific matching
algorithm that guarantees work-conservation (such as the one in [44]) immediately establishes ¥

equivalence. However, our goal here is to provide the same using simpler matchings.
Note that each input element U has virtual output queues corresponding to outputs in both » and
 . Let q adc e and q adc ½ be two such queues, respectively. At first glance, it seems that the combination
of the instability of q adc ½ and a bad maximal size matching algorithm which always chooses that
queue would cause the instability to spread to q adc e . However, we conjecture1 that because of the
physical input link limitation f
e
`badc eµ¾~ , any maximal matching with speedup of 2 is sufficient
to ensure relative stability in the strict sense. (This strengthens the result proved in [19], which
applied only when all the rates were admissible.)
Conjecture 1 (CIOQ,  Maximal Size  )
¡
c ¢3¿
£ (OQ,  WC  ), w A&
Next, consider an output
V
belonging to the set  of unstable ports. Also, assume that the
reference OQ switch has the ability to divide the capacity of output V according to normalized
weights
t
ahc e (e.g., by using virtual input queueing at the outputs, and any weight-based scheduling
scheme) among the pair U N?V  . Then, if `badc e turns out to be less than
t
ahc e , the pair U
N?V
 remains
stable, even within an unstable port. In other words, a work conserving OQ switch with any weight
based policy has the ability to isolate instability on an input-output pair basis. Our next goal is to
enable the same in a CIOQ switch, thereby providing ¥
Ł
equivalence.
Consider a CIOQ switch with virtual input queues qÀ
adc e
in each output element
V
for each of
the inputs. We can show by a simple counter example that any maximal matching algorithm
cannot ensure relative stability with an OQ switch in the strictest sense. Let `?c eÁAÂ`g{c eÁA' , and
`

c evA<!& , with no other traffic in the switch, and weight
t

c evA!&ªÄÃ . Clearly, in the reference
OQ switch, the pair 3Å N§V  would remain stable. However, irrespective of the scheduling scheme





 in an alternating fashion, and leave out Å
N?V
 altogether even with a speedup of 2.
To remedy the above, we propose the Shortest Output Queue First (SOQF) algorithm, which
works as follows. In the output elements, the virtual input queues are scheduled with the same
weights ÆÈÇdÉ Ê as in the reference OQ switch. The maximal matching proceeds by first sorting the
virtual input queue lengths, and in ascending order, matching the first pair ËÌ{Í§ÎÐÏ with unmatched Ì
and Î , which has a cell to send. It can be easily shown that this is an instance of a maximal size
1This result is somewhat counter-intuitive, and a tremendous effort to find a counter-example turned out to be futile.
Of course, we need a rigorous proof to establish this as a theorem. An author of the original result on maximal size
seems to concur with this conjecture [59].
21
matching, and hence is at least equivalent to a work-conserving OQ switch at level ÑsÒ (and ÑsÓ if
Conjecture 1 is proved). We claim2 that by giving preference in the maximal matching to the least
congested pair, a CIOQ switch can isolate instability on an input-output pair basis. The running
time of SOQF is dominated by the time to sort the output queues, i.e., Ô=ËÕ ÒgÖd×ﬀØ ÕÏ .
Conjecture 2 (CIOQ, Ù SOQF Ú ) Û É Ü3ÝÞ (OQ, Ù WC, Weight Scheduling Ú ), ß¶à;á
4.2.2 Envelope and Batch Switching
There have been two recent approaches to tackle the problems associated with a high arbitration
frequency. The first [36] recognizes the fact that the frequency is given by ß	âãDä and can be
reduced by increasing the denominator. Several cells are packed into an envelope and a matching
is computed on an envelope timeslot. It has been proven that this does not affect the stability of
the queues (each queue will have an additional length of at most an envelope length), however,
since envelopes are released only when full, the delay of a cell in a partial envelope is potentially
unbounded. An additional speedup (of upto 2) ßDå can be used to bound the delay. Note that
envelopes also enable moderately complex (and distributed) matching algorithms by amortizing
the complexity over the envelope timeslot.
Another, possibly complimentary, approach [70] deals with a common problem associated
with optical elements, that of high re-configuration times. Even though matchings may proceed at
a frequency of ß	âãDä , where ä is now the envelope size, these may be used in conjunction with
even slower re-configuration, by essentially accumulating a batch of matchings and computing a
sequence of configurations that cover the batch. In [70], it has been shown that áﬀÕ configurations
with an additional speedup of ß,ÒæàÂá is sufficient to cover a batch of any size. Though we do not
propose any new algorithms that address these issues, we need to note that the actual speedup of a
space element, in practice, would equal ßﬀåWß	ÒTß , where ß is the speedup of a CIOQ switch that does
not employ either of the two techniques.
4.3 Single-Path Buffered Clos Switches
The CIOQ switch is the simplest instance of a single-path Buffered Clos switch. We now show
how to add the techniques of aggregation and pipelining to obtain more complex designs, and the
algorithms that maintain equivalence with an OQ switch.
4.3.1 CIOQ-A: Aggregation
A CIOQ switch with aggregation (CIOQ-A) is obtained by grouping Õãﬀç consecutive interfaces
of the former into the same memory element, as shown in Fig. 4.2(b). Consequently, there are
ç memory elements in the first and third stage, and a single space element in the second stage.
Each memory element now requires a memory bandwidth of Ë(èêéß!ÏÕãﬀç times the interface rate.
The space element requires a transfer rate of ß!ÕãDç on each link (instead of ß in CIOQ) and an
2As of now, we possess the intuition behind the algorithm, but no analytical proof. If we fail in the latter, the plan
is to validate the algorithm by resorting to simulations. We may also use the scheme to provide equivalence at level
ëIì
in case the previous conjecture turns out to be incorrect or in the case when the sum of the incoming rates into an
input element exceeds unity due to multicast traffic.
22
arbitration and configuration frequency of ß	Õãﬀç matchings per timeslot. We refer to the cycle time
of transfer of the space element as the internal timeslot, which is çXãß	Õ times the external timeslot.
From a design point of view, this approach has a few advantages. The size of the space element, and
hence the number of inter-connections and the internal contention3 points, decreases by a factor of
ç . A maximal match algorithm on the smaller space element has a time complexity of ßí îvÔ=Ëç Ò Ï
per external timeslot, or Ô=ËÕçïÏ as opposed to Ô:Ë3Õ Ò Ï in CIOQ (though this point is rendered
ineffective in some of the more complex algorithms explained below). Aggregation also allows the
reuse of existing space elements to support multiple subports. The primary disadvantage of this
approach is the higher memory bandwidth compared to a CIOQ switch of the same dimensions and











Figure 4.2: Single Path Buffered Clos Switches
The same rate decomposition techniques of CIOQ, as reviewed in Sec. 3.2.2 can be performed
3The advantages of lower number of contention points will likely not be revealed in maximal matchings with





















1Shadow CIOQ: pi Aggregated: pi∗ Slepian-Duguid
Fitting
Figure 4.3: Example: To shadow a CIOQ switch
on an input-output element basis, for known admissible rates to achieve equivalence at level Ñ^å
with an OQ switch. Furthermore, a maximal size matching done an element pair basis, can be
easily shown to be relative stable with OQ for ßðàá . The straightforward proof of the following
may be found in Appendix C.1. (Note that ÑsÓ equivalence also holds if Conjecture 1 is true.)
Theorem 1 (CIOQ-A, Ù Maximal Size Ú )
Û
É Ü5ñ
Þ (OQ, Ù WC Ú ), ß¼àòá
For stronger equivalence, we propose an algorithm called shadow CIOQ- ó , which emulates
a CIOQ switch running ó , followed by a decomposition of the matchings, to achieve the same
level of equivalence with an OQ switch as that achieved by ó . Consider an Õ ô#Õ CIOQ-A,
with all rates normalized with respect to the interface rates. Let the queues be arranged (e.g., a
combination of virtual output and virtual input queues) as required by ó . Shadow CIOQ- ó first
computes ß	õ matchings every timeslot by running ó , where ß	õ is the speedup required to operate
ó in a CIOQ switch with the same dimensions. Consider one such matching ö , and compose an























Note that the aggregate matching has the following properties. The sum of all entries in a
column or a row is no greater than Õ , and each entry sums up to no greater than Õãﬀç . Viewing the
entries as number of circuits, it follows from Clos theorem (Sec. 3.1) that öQõ can be partitioned into
ábËÕãﬀç)Ï
 è matchings in a greedy fashion in Ô:Ë í
ñ
î
Ï time, or using the Slepian-Duguid theorem
into Õãﬀç matchings in Ô:Ë3Õjç Ò Ï time. Each of the computed partitions is used as a matching
for the ç ô.ç space element, running at at a frequency of ß!ÕãDç . Consequently, the shadow
CIOQ- ó , followed by Clos or Slepian-Duguid (SD) fitting, requires a speedup of Ë5á
 î
í
Ï{ß	õ or ß	õ ,
respectively, to exactly emulate a CIOQ switch running ó at a speedup of ß	õ (see Fig. 4.3 for an
example). We have just proved the following result:
Theorem 2





(CIOQ-A, Ù Shadow CIOQ- ó , SD Fitting Ú )
Û
É Ü
Þ (CIOQ, ó ), ß¶à;ß	õ
24
The above result, along with the containment and transitive properties of functional equivalences,
leads us to the following:
Corollary 1 (CIOQ, ó ) Û É ÜÞ (OQ, ó ), ßæà;ß	õ 
(CIOQ-A, Ù Shadow CIOQ- ó , Clos Fitting Ú )
Û
É Ü




(CIOQ-A, Ù Shadow CIOQ- ó , SD Fitting Ú )
Û
É Ü
Þ (OQ, ó ), ß¼àß	õ
This allows us to establish equivalences for CIOQ-A by applying well known results for CIOQ.
For example, a CIOQ-A switch is relatively stable in the strictest sense with an OQ switch, by
shadowing a CIOQ with the SOQF algorithm with speedup 2, followed by fitting. Clos fitting




and a complexity of Ô:Ë3Õ ÒÖh×Ø Õ8Ï dominated by the sorting
time, while SD fitting requires a total speedup of 2 and a complexity of Ô=ËÕ ÒÖd×Ø Õ é#Õjç Ò Ï . An
open question is whether equivalence at level Ñ (i.e., strictest sense) can be established without
resorting to a shadow CIOQ, possibly in faster time.
4.3.2 CIOQ-P: Pipelining
A CIOQ switch with pipelining (CIOQ-P) is obtained by replacing a single space element of CIOQ,
running at interface rate ß , by  instances of elements running at rate ß!ã , as shown in Fig. 4.2(c).
Each memory element requires a memory bandwidth equal to ËWè é_ß!Ï times the external interface
rates (same as in CIOQ). Each space element has a configuration frequency of ß!ã matchings per
timeslot. The length of an internal timeslot now is much longer, equal to lãß times the external
timeslot. The frequency of the total number of arbitrations, over all the space elements, remains
at ß per timeslot. The main design advantage of this approach is that slower space elements of the
same dimensions may be used to construct a higher interface rate switch. Also, if the matchings
for all the elements can be computed in parallel (a non-trivial task), the frequency of arbitrations
goes down by a factor of  .
We first show two algorithms that rely on shadowing a CIOQ switch, in the same fashion as in
the previous section. Again, let ß	õ be the speedup required by algorithm ó on CIOQ, to achieve a
certain equivalence with an OQ switch. Shadow CIOQ- ó computes matchings öËﬁﬀKÏ at a frequency











Therefore, each space element gets è!ã of the original matchings, which can be satisfied as long
as ß)à<ß	õ . Since öËﬃﬀKÏ is essentially allocated to the  space elements, one element at a time, we
call this method sequential dispatch.
To bring down the arbitration frequency in line with the lower transfer rates of the space el-
ement, we propose a method called striping. In this, we shadow a CIOQ switch running ó with
envelopes of size  . If ß õ!#" ß õ is the speedup required to achieve the desired equivalence while




ã$ envelope matchings per timeslot. For each such matching, we break the envelope and assign
a single cell to each of the  space elements. The assignments can be satisfied as long as ß¼àß!õ ! .






































Figure 4.4: Equal Dispatch using per-path queues
Theorem 3
(CIOQ-P, Ù Shadow CIOQ- ó , Sequential Dispatch Ú ) Û É ÜÞ (CIOQ, ó ), ß¼à;ß õ
(CIOQ-P, Ù Shadow CIOQ- ó , Striping Ú ) Û É Ü

Þ (CIOQ, ó&% Ù K-Envelopes Ú ), ß¼àòß õ!
Again, by using the containment and transitive properties of equivalences, we have the following
result for proving a desired equivalence of CIOQ-P with an OQ switch: (and a similar one for the
striping algorithm)
Corollary 2 (CIOQ, ó ) Û É ÜÞ (OQ, ó ), ßæà;ß	õ 
(CIOQ-A, Ù Shadow CIOQ- ó , Sequential Dispatch Ú )
Û
É Ü
Þ (OQ, ó ), ß¼à;ß	õ
It would be advantageous to have methods which enable equivalences without resorting to
either shadowing CIOQ, or using envelopes of size  . Our goal is to maintain an arbitration fre-
quency of ß!ã using concurrent arbiters attached to each space element. Recognize that envelopes,
in essence, allow to distribute the matched traffic in an exactly equal proportion to each of the 
elements. We propose an algorithm called equal dispatch which achieves the same result using a
slightly complex queueing logic. In each input memory element, per-path virtual output queues are





for output Î , for each of the space elements
ﬂ , as shown in 4.4. The traffic belonging to an input-output pair ËÌÍ?Î Ï is split equally (on a round





, ﬂlàèÍ)()()(7Í	 . Each space element ﬂ computes matchings concurrently
based only on the state of the per-path queues corresponding to ﬂ . Essentially, per-path queueing
allows to distribute the traffic belonging to ËÌÍ?Î Ï to  concurrent planes, each being served by an
individual space element.We can show (see Appendix C.1) that if each element performs maximal
matching, all the queues are stable for admissible traffic, as long as ß¼à;á .
Theorem 4 (CIOQ-P, Ù Equal Dispatch, Maximal Size Ú )
Û
É Ü§ñ
Þ (OQ, Ù WC Ú ), ßæà;á
26
Note that the above would also hold for ÑsÓ equivalence if Conjecture 1 is true. Similarly, if
each output memory element maintains per-path virtual input queues, grouped on the basis of the
space element that feeds into it, we can easily show that Ñ equivalence also holds, as long as






, where ÆÈÇdÉ Ê is the weight given to the input-output pair ËÌÍ?Î Ï by a scheduler in the reference
OQ switch.
Conjecture 3 (CIOQ-P, Ù Equal Dispatch, SOQF Ú )
Û
É Ü Ý
Þ (OQ, Ù WC, Weight Sched Ú ), ß¼àòá
For implementation purposes, to avoid mis-sequencing, the per-path queues are not physically
separate. Indeed, all the cells are enqueued into the same physical virtual output (and input) queue,
and only the per-path occupancies are maintained. These occupancies are viewed as actual queue
lengths for matching purposes only. By sequentially numbering the space elements, or by slightly
skewing the cell transfers, the ordering of the cells can be determined at the output element.
4.3.3 G-MSM: Memory Space Memory
A General Memory-Space-Memory (G-MSM) switch may be constructed by combining aggrega-
tion and pipelining as shown in Fig. 4.2(d). Each memory element requires a memory bandwidth
of ËWè¶é ß!ÏÕãﬀç times the interface rates. Each space element has a transfer rate and configura-
tion frequency of ß!ÕãDç* . The total arbitration frequency, over all the space elements, is ß	Õãﬀç
matchings per timeslot, as in CIOQ-A. The length of an internal timeslot of the space element is
ç+lãß	Õ times that of the external timeslot, and may be higher or lower depending on the chosen
dimensions.
The performance of a G-MSM switch is a natural combination of that of the CIOQ-A and





in input element Ì for each path
(space element) ﬂ , where Î is the destination output element. We can show, similar to Theorems 1
and 4 that the equal dispatch algorithm followed by  concurrent maximal size matchings, at a
frequency of áﬀÕãDç* , is sufficient to ensure relatively stability with a work-conserving OQ switch:
Theorem 5 (G-MSM, Ù Equal Dispatch, Maximal Size Ú )
Û
É Ü5ñ
Þ (OQ, Ù WC Ú ), ß¼àá
While the truth of Conjecture 1 will allow to easily establish Ñ!Ó equivalence, an open question is
whether equivalence at level Ñ can be established using a variation of SOQF, without resorting to
shadowing.
There are three shadowing approaches for G-MSM, which may be used to achieve stronger
equivalences. In the first, virtual input (and output) queues are maintained, as required by a target
algorithm ó , on a per-path basis. The equal dispatch algorithm is used to separate traffic into 
planes. Each path then (in parallel) shadows a CIOQ switch at a frequency of ß õ ã matchings
per timeslot, followed by the aggregation and fitting techniques of CIOQ-A. While equal dispatch
does not lend itself to an exact emulation of a CIOQ switch, the shadowing and fitting algorithms
emulate a CIOQ in each plane (Theorem 2). Combining with Conjecture 3, we obtain:
Conjecture 4




(G-MSM, Ù Equal Dispatch, Shadow CIOQ- ó , SD Fitting Ú )
Û
É Ü






f4: Equal Dispatch, SOQF Maximal Match
f2, f3: Equal Dispatch, Maximal Match
f1: BVN Decompositionf1: BVN Decomposition
f5: Shadow, Fitting
f2, f3: Maximal match
f4: Equal Dispatch, Shadow, Fitting
f5: Shadow, Fitting, Striping
f5: Shadow, Fitting, Sequential Dispatch
f1: BVN Decomposition
f2, f3: Maximal Match
f4: SOQF Maximal Match
f5: Stable Marriage
f5: Shadow, Sequential Dispatch
f5: Shadow, Striping
f2, f3: Equal Dispatch, Maximal Match
f1: BVN Decomposition
Figure 4.5: Equivalences for single-path Buffered Clos switches
The above may be extended by the transitive property to achieve any equivalence Ñ!Ç§Í{Ìvø, with a
reference OQ switch.
The second approach is an extension of the striping algorithm, which does away with the need
for equal dispatch or per-path queues. Choosing an envelope size of  cells and maintaining a
shadowing frequency of ß õ! ã , the matchings are aggregated and partitioned in a manner similar
to the one in CIOQ-A. The same partitions are used to configure all the  space elements in each
internal timeslot. Each of the space elements then carries a single cell from the envelope. Since
striping (Theorem 3) and aggregating (Theorem 2) provide for precise emulation of a CIOQ, we
obatin the following:
Theorem 6









(G-MSM, Ù Shadow, SD Fitting, Striping Ú ) Û É ÜÞ (CIOQ, ó.%/Ù K-Envelopes Ú ), ß¼àß õ!
The third approach is an extension of the sequential dispatch algorithm, and requires a shad-
owing frequency of ß	õ	ËWø ß	õ!­Ï matchings per timeslot, but does not use envelopes. Each of the
matching is aggregated, as in a CIOQ-A, however, the resulting partitions are used to configure
one space element after the other in a sequential fashion. Note that the number of partitions ( Õãﬀç
if SD fitting is used) is not required to exactly fit in the  elements. We obtain the same result as
in the above theorem, with ß	õ ! replaced by ß	õ . The results corresponding to the last two approaches
may be extended, again using the transitive property (similar to Corollary 2), to achieve the desired
equivalence with an OQ switch.
Fig. 4.5 summarizes all the equivalences discussed so far (including the conjectures). The
containment and transitive properties allow to follow multiple hops of the equivalence graph. An
interesting open problem for a G-MSM switch is whether the memory elements can be recursively
(!!) constructed using a ÕãDçô/ CIOQ switch. This seems to be possible for all equal dispatch-
based algorithms, however, with physically separate per-path queues, which puts it in the realm of
multi-path switches.
28
4.4 Parallel Packet Switches
A parallel packet switch (PPS) allows to pool the bandwidth resources on several switching paths
in order to construct a high capacity switch. As shown in Fig. 4.6, an ÕkôÕ switch with interface
rate 0 is composed of  logical memory elements, each of which is of size Õ ôÕ and operates at
an interface rate of 1 . A central element is referred to as a core switch, while the èïô2 first stage
element is referred to as an ingress demultiplexor and the <ôðè third stage element is referred to as
an egress multiplexor. The  core switch elements may be realized by a single physical ﬃÕ ôﬃÕ
memory element, in which case the switch is referred to as an inverse multiplexed switch [10], or
by  separate physical planes [33, 42]. Also, the memory element(s) itself may be replaced by a
CIOQ switch, which is equivalent at a given level Ñ	Ç to an OQ switch (a single memory element).
For the purposes of analyses however, we will restrict ourselves to the multi-plane realization, and
 physical memory elements, with the understanding that an Ñ!Ç equivalence with an OQ switch
for this system will continue to hold if the memory elements are replaced with a CIOQ exhibiting






























Figure 4.6: A parallel packet switch (PPS)
If
ß
is the speedup of the switch, then the internal links operate at
ß!ã
times the interface rates.
Assuming a CIOQ implementation with speedup ß	õ for the core switch, the memory bandwidth
required in the core is
Ë(èüéß	õ®Ï{ß!ã
times the interface rate, which can be made significantly lower
4This statement itself needs a rigorous proof, which we omit here.
29
than the interface rate by choosing an appropriate  . However, if the first and third stages are also
composed of memory elements, they still require a memory bandwidth of ËWèéyßsÏ times the interface
rates. Thus, we identify two significant benefits of this architecture. Firstly, it allows to reuse lower
capacity components, even though it might require the design of memory elements running at rates
comparable to the higher target interface rates. Secondly, it raises the possibility of using space
elements in the first and third stages, thereby allowing to build a switch in which the fastest memory
runs at a fraction of the external interface rates. Our goal is to establish equivalences for the PPS,
while attempting to keep the ingress and egress either memoryless or with a small fixed amount of
memory which can be implemented on-chip.
4.4.1 Flow-Based PPS
Consider a PPS with 3 à5476,6,698 , i.e., all the stages are composed of memory elements. As-
suming the rates of all the flows that traverse the switch are known, Ñå equivalence with an OQ
switch may be achieved by statically assigning a flow to a switching path in one of the  core ele-
ments. The resulting switch is called a flow-based PPS. A path for an admissible flow can be found
as long as the number of paths satisfy the condition for multirate Clos topologies (see Sec. 3.1),
which translates to the following result:







Note that per-path queues are required in the ingress and egress elements to buffer the burstiness
of the flows assigned to the respective paths. The problem associated with the fragmentation of the
internal link bandwidth can be mitigated by splitting a selected number of high bandwidth flows
across multiple paths, and assigning weights to each path for the split flows [10]. The advantage of
static assignment is that there is no mis-sequencing of packets (except for the split flows), as there
is a single fixed path for each of the flows. Several so-called clustered routers use this approach
due to its simplicity. Even flows with no specified allocations are assigned a path, usually through
uniform hash functions. Flow-based PPS has several disadvantages. Bufferless elements in the
first and third stages cannot be contemplated. In addition, a large number of high bandwidth
(and hence, split) flows essentially negates the sequencing advantage. Lastly, it can be shown by
a simple counter-example5 that equivalence with an OQ switch at level ÑsÒ or higher cannot be
achieved.
4.4.2 Packet-by-Packet PPS
A packet-by-packet PPS performs the path assignment on a per-packet basis in contrast to static
flow assignment. Let RbÇdÉ Ê be the average rate of the traffic between the input-output pair ËÌÍ?Î Ï .
Consider an equal dispatch algorithm that distributes traffic belonging to the pair equally among
all the  paths offered by the core elements, either on a round robin basis for fixed sized units
(cells), or using deficit round robin for variable sized packets. In other words, each flow, defined
as an input-output pair in this case, is distributed equally among all the paths. Consequently, the
5Consider the case when the flows assigned to a certain path S , and destined to output T , generate more than U?VXW
times the total average rate destined to that output. Even when Y[Z]\ Z_^ `badc , it causes instability in path S , while some
other path(s) is underutilized.
30










Therefore, if output Î is stable in a reference OQ switch, i.e., Y
Ç







for each of the core elements. A work conserving core element with interface rate 0Áã , i.e., ß¼àè ,
ensures that output Î is also stable in that element. Thus, we have the following result:
Theorem 8 (PPS, Ù Equal Dispatch Ú )
Û
É Üh
Þ (OQ, Ù WC Ú ), 3mài4=6>6,6983Í®ßÁà|è
The above result continues to hold when the granularity of equal dispatch is smaller than an
input-output pair, e.g., an equal dispatch of individual user flows among the paths. Per-path queues
are required for each split flow in the ingress element to resolve contention among many flows that
simultaneously choose the same path, while similar queues are required in the egress element to
correctly order the packets which arrive from different paths. Since traffic is equally split atleast
on an input-output pair basis, Theorem 8 can be easily extended by considering weight-based
schedulers in the core elements with exactly the same weights as in the reference OQ switch.
Corollary 3 (PPS, Ù Equal Dispatch, Weight Sched Ú )
Û
É Ü5Ý
Þ (OQ, Ù WC, Weight Sched Ú ),
3mài476,6,6983Í®ßÁà|è
We will now extend the equal dispatch algorithm to a cell-based fractional dispatch algo-
rithm [42], which enables a PPS to have bufferless ingress elements, while maintaining the same
equivalence. First, we make two observations regarding the above results. A fluid model PPS
which divides the arriving traffic equally, in an instantaneous fashion to the  paths, satisfies The-
orem 8. Moreover, no buffers are required in the ingress due to the instantaneous dispatch, and
none are required in the egress since all the core elements behave as mirror images of each other.
Secondly, notice that given an interface rate 0 and an internal rate 1 , the number of required core
elements is given by àkjﬁ0Áãl1lm . While an equal dispatch algorithm is sufficient for relative sta-
bility in a cell-based PPS model, a fractional dispatch which ensures that out of jﬁ0Áã$1m consecutive
cells belonging to a pair ËÌÍ?Î Ï , not more than one cell is forwarded to any single core element, is
also sufficient for the same (See Appendix C.2).
Theorem 9 (PPS, Ù Fractional Dispatch Ú )
Û
É Üh
Þ (OQ, Ù WC Ú ), 3 à<4=6,6>6985ÍIß "5n Bpo Lq
Bpo
L
Next, we recognize that when a cell arrives at an input interface and is assigned to a path, the
path remains unusable for jﬁ0Áã$1mA
è additional timeslots, as shown in Fig. 4.7. Therefore, an
arriving cell finds a total of at most jﬁ0Áã$1mr
è busy interfaces to the core elements. In addition,
if the ingress performs fractional dispatch, the jﬁ0Áã$1ms
.è previously used paths cannot be used to
dispatch the current cell. However, the cell can be immediately dispatched, without queueing, if











This translates to a speedup of ßà,lãxjﬁlãám . Furthermore, we showed in [42] that fractional
















Figure 4.7: PPS: Timeslot Structure, yﬁzr{l|l}-~
allowing its application to variable packet sizes. Specifically, the algorithm stipulates that out of a













Since the condition of Theorem 9 is not violated by the instantaneous fractional dispatch, we have
the following result for a PPS with bufferless ingress elements:
Theorem 10 (PPS,  Fractional Dispatch  ) E  (OQ,  WC  ), 5~<=b,9X~ 
 ¡
Note that even though the element is bufferless, there do exist some on-chip buffers to perform
header processing, and to execute the load balancing logic. Also, with fractional dispatch, the
egress element still requires buffers for re-sequencing. A work item in progress is to find the mis-
sequencing bound for this algorithm, and hence the size of the egress memory in order to ascertain
if it is small enough to be implemented on-chip (therefore qualifying the egress also as a space
element). Another item [41] is the trade-off between buffers and speedup, while using fractional
dispatch to maintain relative stability.
S. Iyer et al. [33] have shown that an OQ switch with a FIFO discipline can be exactly emulated
by a PPS with bufferless ingress and egress elements, with the same speedup as in the last theo-
rem. The proposed emulation algorithm, called the centralized PPS algorithm (CPA), maintains a
shadow OQ switch, and for every incoming cell at time ¢ for the pair £ﬁ¤¥§¦E¨ , determines the time
©
£ﬁ¢?¤ª§¦E¨ when the cell departs in the shadow switch. Recognizing that at most yﬁzr{l|l}s«  paths
at the ingress ¤ will be busy at the arrival time ¢ , and a similar number at egress ¦ at the departure
time
©
£ﬃ¢¥¤¥§¦E¨ , a path which simultaneously satisfies both the input and output constraints can be
found as long as ¬®­9¯yﬁzr{$|}f«  . Therefore, we have
Theorem 11 (PPS,  CPA  ) E ° (OQ,  WC, FIFO  ), 5~<7±>#X-~ 
 ¡
A followup result [34] showed that a distributed variant of CPA, which maintains the input
and output constraints independently at each input, can emulate an OQ switch within a delay of
yﬁ²³{´µ} using the same speedup as above. However, the egress element now requires a buffer of size
32
Õd , which can presumably be implemented on-chip. Further, it was shown that the same amount
of buffers in the ingress element allows to emulate the OQ switch, within the same delay, using
no speedup at all. It is important however to note that the emulation schemes need to maintain
temporal lists that allow to determine the departure time.
Note that multi-path switches, in general, suffer from out-of-sequence arrivals at the egress
element. This implies that local sequence control algorithms might be required. A summary of our
findings on open-loop sequence control can be found in [43].
4.5 QoS in Multi-Module Switches
We now propose a two-phase algorithm6 called switched fair airport, which enables QoS provi-
sioning in a multi-module switch. The idea is to use proven Ñå and ÑQ equivalences of a switch to
provide isolated bandwidth guarantees and fairness in sharing excess bandwidth, respectively, in
a fashion similar to an OQ switch, without the need for exact emulation ( Ñ¶ ). Our reference OQ
switch employs the fair airport scheduler at each output link. A guaranteed-QoS (see Appendix A





is scheduled using a non-work conserving policy (e.g., a shaped virtual clock). Each flow, includ-




, which is used by a work conserving
scheduler (e.g., weighted round robin) to fill up the slots unused by the guaranteed portion. We
assume a finite buffer size · associated with each output Î , which uses a fair buffer management
scheme (which also guarantees buffer allocations).
The switched fair airport algorithm is a combination of two phases, the first that ensures Ñ^å
equivalence with an OQ switch for a shaped portion of the flows, and the second that ensures Ñ
equivalence. We will illustrate its working on a CIOQ switch, and then describe its extension to
other single-path Buffered Clos switches. Consider an Õ+ôÄÕ CIOQ switch, in which the output
elements employ the same per-flow fair airport scheduler as in the reference OQ switch. An input
element Ì contains a non-work conserving scheduler per output Î , which serves the ﬂ flows for the





implementation purposes, these queues contain only the occupancy counters for each flow, and the
cells continue to reside in the ﬂ per-flow queues associated with the pair Ë$Ì{Í?Î Ï . The total average























, over all input-output pairs, this step essentially provides a
virtual bandwidth trunk of the aggregated rate allocations for each pair ËÌÍ?Î Ï . The shaping within
each pair ensures that the trunk bandwidth is properly divided among the ﬂ flows in the pair7.
Conjecture 5 Rate shaping of flows at every input-output pair, followed by a maximal size match-
ing is sufficient to ensure the individual flow rates through the space element of a CIOQ.
6We call this scheme switched fair airport due to its resemblance with the fair airport algorithm [27] for OQ.
7We have recently proved this conjecture to be true, using the Clos theorem (surprise, surprise!). However, we











Phase 1: Maximal size on G-VOQ




















































Figure 4.8: QoS in CIOQ using Switched Fair Airport
In the second phase, the same maximal size matching adds connections using SOQF. For this
purpose, an output element maintains virtual input queue occupancies »¼½¾ ¿ , and an input element
contains a virtual output EBS scheduler, which when enabled by an added connection in the second
phase, distributes the excess bandwidth according to the same weight ÀrÁÃÂXÄ½¾ ¿ . The intuition behind
this approach is that if an input-output pair remains stable (in the strictest sense) when its total
arrival rate is À ½¾ ¿ times the available excess bandwidth, then a flow within that pair remains stable
when its arrival rate is À ÁÃÂXÄ½¾ ¿ times the available bandwidth on the output link. However, this still
remains a conjecture as we have not thoroughly analyzed the scheme yet.
Conjecture 6 The SOQF algorithm, combined with a virtual-output weight-scheduler, maintains
the same per-flow weighted allocation of the output link excess bandwidth as an OQ switch.
Note that SOQF provides the function of feedback control from the output link scheduler by
giving priority, in the matching, to flows whose service rate at the input element does not overstep
the service rate at the output element. Note also that the effect of finite buffers is not completely
clear. We assume that a buffer of size Å (same as in each output of a memory element) is now
present in each of the input and output elements with the same buffer management scheme to
control the residues of scheduling.
The two-phase switched fair airport queueing structure is illustrated in Fig. 4.8. This approach
can be easily extended to CIOQ-A that shadows a CIOQ by maintaining the same queueing struc-
ture as above. Similarly, it may be extended to CIOQ-P and G-MSM switches that emulate a
CIOQ, and to equal dispatch based systems, by using the same rates and weights on each path for
an input-output pair. (An elaboration of these extensions is still in progress.)
34
Providing QoS guarantees for the PPS architecture is more involved. Bandwidth guarantees
may be provided by using a flow-based PPS. However, the throughput as well as the fairness in
the allocation of excess bandwidth is not comparable to an OQ switch. Presumably, a packet-
by-packet PPS may be used with equal or fractional dispatch on a per flow basis, with each core
element containing the same output link scheduler as the reference switch, with the same weights
for the subflows (since each core element receives no more than ÆÇÈ of the flow traffic). However,




We addressed the problem of building a high capacity QoS-capable packet switch using stages of
lower capacity memory and space elements. Towards that end we proposed a taxonomy of fully
connected three-stage switches called Buffered Clos switches, and provided a formal framework
to study their performance. Within this framework, we presented algorithms that provide func-
tional equivalence with an ideal reference switch for the aggregated and pipelined CIOQ, general
memory-space-memory switches, and the parallel packet switch architecture. To apply the QoS
methodologies to multi-module switches, we introduced the switched fair airport algorithm.
The following are some items that need to be addressed. The SOQF algorithm needs to be
analyzed, and the related conjectures proven. The work on PPS needs further investigation to
bound the mis-sequencing and egress buffers for the fractional dispatch algorithm. Also, the work
on switched fair airport is fairly nascent and needs to be concluded.
We would like to thank Denis Khotimsky and Dimitrios Stiliadis for several productive techni-




A.1 Properties of Circuit Switches
Circuit switched networks rely on dividing the physical communication media into synchronous
units called channels. For example, in time division multiplexed (TDM) networks, the time axis of
a transmission link is divided into timeslots, while in wavelength division multiplexed (WDM) and
frequency division multiplexed (FDM) networks, a channel corresponds to a carrier wavelength or
frequency, respectively. A circuit is established by pre-determining the path to be traversed through
the network, and assigning channels, if they are available, on the links that comprise the path. The
allowed circuit bit rates, the frame sizes to be transmitted in each synchronous unit, and corre-
spondingly the size of the unit, are governed by digital transmission standards, such as SDH and
SONET [16] for TDM networks. A circuit switched node transfers the frames that arrive on a given
channel of an input link to a pre-configured channel of an output link. Examples of such systems
include SONET switches that employ electronic crossbars, and more recently, wavelength routers
based on optical components such as tunable lasers and Micro-Electro-Mechanical (MEMS) mir-
rors.
A contention occurs when more than one frame arrive at a given link at the same instant. Since
channels are allocated on every network link of the end-to-end path of a circuit, prior to frame
transmission, there is no external contention for the output link resources of a circuit switch, and
hence no inherent necessity for buffers1 to absorb contention. Consequently, the performance mea-
sures of interest are related only to the admissibility of circuits, also called call-level behavior, and
the realization of internal paths, through the switching node, for the admitted circuits. The former
is studied using so called loss models [61], which characterize circuit acceptance probabilities,
based on the given statistics of call arrivals and holding times. A celebrated example is the Erlang
loss system which has been used for years to engineer telephony networks. The issue of path real-
ization, on the other hand, deals with the architecture of the switching node itself. By definition,
a node achieves maximum throughput as long as a switching path can be established for every
circuit that is admissible solely on the basis of the available bandwidth of the external links. Once
a circuit is admitted and a path is realized, the frame-level behavior, in terms of the observed frame
bandwidth and delays, is guaranteed without additional mechanisms.
A bipartite mapping from the input to the output links of a switch, during a given time unit, is
1A fixed amount of frame buffers may be used to convert from one channel to another, e.g., for interchanging
timeslots, an operation called grooming.
I
referred to as a matching if each input is connected to no more than one output, and vice-versa.
A switch is called non-blocking if internal paths can be set up to satisfy any given matching. It is
said to be non-blocking in the strict sense if a path can be found between an idle input-output pair
without disturbing the ones that support any existing matching. If a re-arrangement of the existing
paths is necessary and sufficient to support the new pair, the switch is called re-arrangeably non-
blocking. All non-blocking switches are functionally equivalent in terms of the matchings that
they can realize. Note that while a matching determines the input-output pairs to be connected,
path establishment within the switch determines the feasibility of the given matching. In circuit
switched networks, both the sequence of matchings and the associated paths are determined at the
call-level, prior to actual frame transmission.
A trivial example of a strictly non-blocking ÉËÊÍÌ switching fabric is a crossbar that employs
ÉÏÎÌ electronic crosspoints, configured using the current matching, in order to connect É inputs
to Ì outputs. There is a unique path, between an input and an output, which is readily available
when both are idle. Much of the work in circuit switching [29, 57] addresses the construction of
a bufferless fabric using an inter-connection network of smaller components, motivated primarily
by the fact that electronic crosspoints were expensive, and their number in a crossbar increases
quadratically with the size of the switch. Such networks may have internal contention due to
commonality in the internal paths between input-output pairs, and may employ internal speedup
to counter blocking, which is defined as the ratio of the total link bandwidth between two stages,
to the total external bandwidth. Examples of notable inter-connection networks are the Banyan
network and the Batcher sorting network, which require ÐÍÑÉ,ÒÓ´ÔbÉ/Õ and ÐÖÑÉ>ÒÓ´Ô´×ØÉ/Õ crosspoints,
respectively. Both these networks are self-routing with a unique path between an input-output pair,
and hence do not require centralized path set-up, however, the resulting structure is blocking. A
popular example of a non-blocking network, which we shall revisit on several occasions, is the 3-
stage (and recursive) Clos network, which uses ÐÍÑﬃÉÚÙ§Û ÜXÕ crosspoints. A more complex example is
a Cantor network, which uses ÐÍÑﬃÉ>ÒÓÔ × É/Õ crosspoints and remains non-blocking by employing
several planes of a (blocking) sorting network. Indeed, the literature in the design of bufferless
inter-connection networks is fairly rich, yet rigorous due to the well defined cost parameters and
performance measures, i.e., the number of crosspoints and blocking behavior, respectively.
A.2 Packet Switch Design
Packet switched networks do not rely on holding dedicated link resources and on switching pre-
established physical layer circuits. Instead, the physical and link layers have the flexibility to
employ either synchronous means such as SONET, or asynchronous ones such as Ethernet, on a
link-by-link basis. The forwarding decisions at each packet switched node, on the end-to-end path,
are made on a per-packet basis. A packet may traverse through multiple nodes that are connected
to each other through heterogeneous link layers, where a logical link may itself be comprised of
a physical circuit that spans several circuit switched nodes. In connection-oriented networks such
as ATM, the packet2 header contains a virtual circuit (VC) identifier, which corresponds to a flow
and is used to look up the path information, namely, the output interface to use for forwarding,
as well as other service parameters, both of which are set up during a signaling phase prior to
data transmission. Alternatively, in connectionless networks such as IP, the network layer address
2In ATM, this is actually referred to as a cell. However, we will continue to refer to all forwarded units as “packets”,
and reserve the term “cell” for the internal fixed size unit which is switched.
II
in the packet header is used to look up a routing table, which maintains reachability information
using a concurrent control plane. Flows are identified by a header filter, which may be configured
using a signaling phase (e.g., as in RSVP [74]) or through service level agreements (SLA), and
may correspond to ranges of source and destination addresses, protocol types, and/or application
port numbers. In either kind of network, a flow may be fine-grained, such as a single VoIP session
configured via RSVP, or coarse-grained such as a permanent ATM VC providing connectivity
between two IP subnets. The aggregate traffic between the input-output pairs of a switch may
be considered as the flows with the coarsest granularity, as seen by that switch. Since dedicated
resources need not be held for the entire lifetime of a flow, packet switched networks, in theory,
provide better utilization due to statistical multiplexing of the link resources.
The foremost distinguishing feature of a packet switch is the inherent presence of external
contention for the output link, i.e., several packets destined to the same output may arrive simulta-
neously. This phenomenon can be sustained over an arbitrary period of time, resulting in a backlog
of unserved packets inside the switch, which needs to be buffered. An arriving packet may be
dropped in response to congestion in the finite amount of available buffers, yielding a packet loss
ratio for the flow. The admitted backlog is then scheduled in a chosen fashion, which determines
the packet delays and the observed flow service rates.
The traffic presented to a typical packet switch consists of a combination of guaranteed QoS
flows, for which the traffic profiles and service requirements, such as desired average bandwidth
and packet delay bound, are known in advance, and best effort flows without any pre-specified
profile or requirements. Guaranteed QoS flows undergo a call admission control (CAC) procedure
during the signaling phase. This procedure uses the profile (e.g., a leaky bucket specification [18])
of the flow and the service requirements, in order to ascertain if the switch has enough resources, in
the longer term, to meet those specifications. If successful, effective bandwidth and buffer size val-
ues are calculated, which are then used to program the scheduling and buffer management schemes
within the switch so that the agreed upon requirements may be honored. In addition, policers at
the inputs of the switch ensure that the traffic sources adhere to the advertised profiles. Often, the
flows are allowed to violate their profiles, however, the excess component of the traffic is treated
on a best effort basis. While loss models [61], similar to the ones in circuit switching, may be used
to study the call-level behavior of the guaranteed QoS flows, and hence to engineer the network,
it is ultimately the scheduling and buffer management policies, i.e., the packet-level behavior, that
govern the feasibility of the longer term call admission, and thereby the QoS capability of the cho-
sen switch. On the other hand, best effort flows, which comprise the majority of the Internet traffic
today, do not undergo similar CAC procedures or conventional loss models for network engineer-
ing. Instead, empirical studies [58, 4] are used to characterize the statistical nature of the flows in
the chosen network segments, which then yield the expected amount of effective bandwidth and
buffers required to accommodate them in the network nodes, and thereby to engineer the network.
Consequently, the performance measures of interest for the best effort traffic are related mainly to
optimality in packet throughput.
In summary, the primary goals in the design of a packet switched node is to sustain optimal
throughput (given a definition of optimality) and to provide preferential treatment, in the form of




Scheduling and Buffer Management
B.1 Link Scheduling
Consider an OQ switch, composed of a single É ÊÝÉ memory element. To enable bandwidth,
delay, and fairness guarantees, packets are enqueued into separate per-flow queues at each output,
and are dequeued using a chosen scheduling policy. A flow may correspond to a single end-to-end
connection, or to aggregates of several (including best-effort) connections, depending upon the
level at which the scheduler operates. A simple instance of a scheduler is the Weighted Round
Robin (WRR) [38] poller, in which each flow is guaranteed a bandwidth that is proportional to
its weight, and experiences a worst-case latency equal to the size of the WRR frame. Guaranteed
bandwidth as well as fairness are linked to the same weight. A popular variant which accounts for
variable packet sizes is the Deficit Round Robin [62] policy.
A classical rate-based scheme, which provided a framework for perfect isolation, is the fluid-
model Generalized Processor Sharing (GPS) [56] scheme. Each flow Þ is associated with a rate ß ½ ,
and receives an instantaneous service rate proportional to that rate, whenever it is backlogged. The
scheduler operates in a work conserving fashion, and consequently, the flow rate is guaranteed as
long as the sum of the rates is less than the link capacity à . Due to the rate proportional service




Õ , the amount of

















The GPS scheduler has several interesting properties. Firstly, as seen from the above equation, the
normalized service received by each flow is equal at every instant. Therefore, the excess bandwidth
is also distributed in proportion to the flow rates, leading to one of the first formal definitions of
service fairness as the worst-case difference (between any two backlogged flows) in normalized
service received. Secondly, if ÅÑﬁâªÕ is the set of backlogged flows at time â , the instantaneous
service rate observed by a non-empty queue Þ is equal to ß ½ ÎçàÇbè ¿Xé)ê
Á¹ëìÄ
ß
¿ . That is, the flow
receives isolated bandwidth at every instant, or with zero latency. Thirdly, the observed delay
depends only upon the flow’s own arrival process, as shown in B.1. For example, if the flow Þ is
leaky bucket constrained [18] with a bucket size of í ½ , the delay is bounded by í ½ Ç$ß ½ .
The packetized version of GPS, called Weighted Fair Queueing (WFQ) [20], works by essen-
tially simulating a fluid system by keeping track of the normalized work done, also called virtual
time îïÑGâªÕ at every timeslot. Note that, in GPS, îI¼ﬁÑGâªÕ å àÇ è ½éQê
Á=ëìÄ
ß









D(b1): delay of bit b1
B(t1): buffer backlog at time t1
S(t): GPS service process
A(t): Arrival process
Slope = r
Figure B.1: GPS: Delays and Backlog

















This policy simulates GPS with an added latency of Ñû ½ Çlß ½ô û#üýþQÇ$àßÕ . Several of the subse-
quent packet-based scheduling algorithms proposed the use of approximations of îïÑﬁâªÕ that were
easier to compute, but yielded different delay bounds and service fairness. For example, in Virtual
Clock [73], îïÑﬁâªÕ is set to real time â , which leads to delay bounds comparable to WFQ but un-
bounded service fairness. In Self-Clocked Fair Queueing (SCFQ) [26], the virtual time is set to the
timestamp of the currently served packet, yielding the best fairness properties among GPS-related
schedulers, but a delay bound which is ÐÍÑ  Õ , where   is the number of flows. A general model
for packet-based schedulers, called rate proportional [64] servers, was later devised showing that
any scheme in which îïÑﬁâªÕ grows atleast as fast as real time, but always under-estimates it with
respect to a GPS server, has the same added latency as WFQ, with the service fairness dependent
only on the extent of under-estimation. Examples of such schemes include the Virtual Clock and
Frame-Based Fair Queueing (FFQ) [65].
The notion of worst-case fairness was introduced in [2], especially suited for hierarchical [3]
(see Fig. B.2(a)) and distributed schedulers, and was defined as the worst-case inter-service time. It
was shown that the added latency observed by a packet is equal to the sum of the worst-case fairness
indices of the intermediate nodes. In contrast to all the above schemes in which fairness is tied to
the associated rate, the Fair Airport [27] algorithm allowed to decouple the scheduler into a non-
work-conserving guaranteed bandwidth scheduler (GBS) (e.g., shaped Virtual Clock) to ensure
rates and delays, and a work conserving excess bandwidth scheduler (EBS) (e.g., WRR) as shown
in Fig. B.2(b). For the purposes of this work, we will assume the existence of an implementable [9]
link scheduling scheme which provides a method to isolate bandwidth on a specified time-scale,
and to distribute excess bandwidth according to pre-specified weights. The most recent work in
link scheduling is primarily focused on the statistical characterization of GPS-based and Earliest
Deadline First (EDF) schedulers, as well as more complicated service curves, and is beyond the




















Figure B.2: Scheduler Arrangements: (a) Hierarchical (b) Fair Airport
B.2 Buffer Management
The work on scheduling and arbitration algorithms focus on the finite bandwidth resource at a
contention point, assuming essentially an infinite attached memory so that bandwidth isolation,
inter-queue fairness and overall throughput depends solely on those algorithms. Given a finite
buffer resource in a memory element, we may view buffer management as a mechanism to control
the residue of scheduling to enable unhindered performance of the above bandwidth schedulers, by
selectively admitting packets into the finite memory. Therefore, we identify three, often conflicting
goals: (i) to provide isolation to guaranteed QoS flows; (ii) to allow fair access to the available
buffers, in order to support fair scheduling; and (iii) to maintain high throughput by sharing the
available buffers among the served flows. Much like the literature on scheduling, the primary focus
has been on OQ switches, and the application to other switch designs is still immature.
The notion of isolation, in the context of buffers, is well defined. Given a traffic profile, which is
readily available for guaranteed QoS flows, and a service rate, a buffer size  ½ may be calculated for
either zero loss for regulated traffic (e.g., í ½ plus the scheduler latency for a leaky-bucket regulated
source served by a virtual bandwidth trunk with service rate equal to the token rate) or a specified
loss ratio for statistical traffic (e.g., using the tail of the distribution of a queue). Once the size is
computed, it can be totally isolated by using the complete partitioning [32] scheme, in which  ½ is
dedicated to flow Þ with è ½

½
Å , the total buffer size. A packet which causes a queue length
to exceed this size is dropped on arrival. This is a special case of static thresholding [24, 47], in
which the sum of sizes may be allowed to exceed the total buffer size. While the latter does not
provide perfect isolation, it allows to share the buffers (in a limited way) by utilizing a knowledge
of the inter-dependencies of arrival processes.
While the concept of fairness is well established for link schedulers, the same cannot be said for
fair allocation of buffers to store the residue of the excess-bandwidth traffic served by a scheduler,
due to a lack of arrival traffic characterization. Consequently, an equal access to the buffers is often
considered fair, and may be achieved by static thresholding. However, the throughput decreases
with respect to the case when each flow has full access to the available memory. The dynamic
thresholding scheme [15] aims to equalize the queue lengths, while increasing the throughput by
VI
dynamically changing the allocation, so that the total queue length may always approach the total
buffer size. The threshold is calculated as  times the current free space, and weighted access may
be offered by choosing different values of  .
In contrast to the above schemes, all of which are arrival-drop policies, the push-out [25]
scheme allows to maintain full buffer occupancy, and hence the highest throughput for a given
Å , with exactly equal allocations to each backlogged flow. This is achieved by expelling an al-
ready enqueued packet from the queue with the longest length. The choice of the packet within
the longest queue is flexible, e.g., in [67], it is suggested that dropping the head packet is the most
beneficial for TCP flows. In fact, dynamic thresholding may be considered as an arrival-drop ap-
proximation of push-out. Another attractive property of push-out schemes is that even the portion
of memory providing isolated allocations may be completely shared, by slightly modifying the
push-out criterion. When the buffer is full, a packet is expelled from a queue, whose length is the
greatest over a specified threshold. Inspite of such properties, push-out is difficult to implement at
wire speeds, since it requires the maintenance of sorted queue lengths and an extra memory access.
The dynamic partitioning scheme [45], can indeed be considered as an arrival-drop approximation
of push-out with thresholds.
There are several schemes that enforce intra-queue fairness, in which a single FIFO queue is
composed of packets belonging to several best-effort flows. By essentially regulating the entry
of packets belonging to different flows into the same queue, such schemes also may be viewed
as single-FIFO scheduling schemes as they control the fraction of the total queue bandwidth that
each flow receives. Popular examples are Random Early Detection (RED) [23] and its several
subsequent variants, as well as the recent AQM schemes. In our framework, if several flows share
the same queue, we assume that some such algorithm is also present, in addition to the inter-queue










 (OQ,  WC  ),  å
Proof: Let Þ
ã
á be an input and output element (not interface), respectively. Let  be the rate
matrix of the switch. Since, for 
×














Compose an aggregate rate matrix ﬀﬁ



























































If we re-normalize time so that each timeslot now is Ù
+*
ﬁ times the external timeslot, then the
rates become admissible for the smaller timeslot (faster time). Consequently, according to [19],
a maximal size matching algorithm with a speedup of 2 is sufficient to ensure the stability of the
queues in the system. Since the space element operates at a frequency of QÉ³Ç(' , it follows that









 (OQ,  WC  ),  å,
Proof: Let Þ
ã
á be an input and output interface, respectively. Let +- be the rate matrix of the

















Let æÁ ò Ä½¾ ¿ be the average arrival rate into the per-path VOQ »Á ò Ä½¾ ¿ . Since an input-output traffic flow

























We re-normalize time so that each timeslot is now È times the external timeslot. Due to the above
inequality, the rates continue to remain admissible, and a maximal size matching algorithm [19]
with a speedup of 2 on the larger timeslot is sufficient to ensure the stability of each of the VOQs
for each of the path. Since each space element operates at a frequency of lÇ$È , it follows that  å
is sufficient to ensure relative stability with an OQ switch for admissible rates.
C.2 Multi-path Switches
Theorem 9
(PPS,  Fractional Dispatch  ) 
¾
Â65






Proof: By the definition of the fractional dispatch algorithm, the average rate for an input-output
pair ÑﬃÞ
ã












Consequently, the total traffic destined to an output á , which is stable in a reference OQ switch can
























Therefore, the output á in each core element ð is also stable, thereby establishing relative stability
in the strict sense. Note that the proof continues to hold when the traffic to the input-output pair
ÑﬃÞ
ã
áEÕ is less than À ½¾ ¿
H
, in which case the same for each core element becomes less than À ½¾ ¿ ß . In




[1] T. Anderson, S. Owicki, J. Saxe, C. Thacker, “High speed Switch Scheduling for Local Area
Networks,” ACM Trans. Computer Systems, Nov 1993.
[2] J. C. R. Bennett and H. Zhang, “WF2Q: Worst-Case Fair Weighted Fair Queueing,” in Proc.
IEEE Infocom ‘96, pp. 120-128, Mar 1996.
[3] J. C. R. Bennett and H. Zhang, “Hierarchical Packet Fair Queueing Algorithms,” in Proc.
ACM Sigcomm ‘96, pp. 143-156, Aug 1996.
[4] J. Cao, W. S. Cleveland, D. Lin, D. X. Sun, “On the Nonstationarity of Internet Traffic,” in
Proc. ACM Sigmetrics, Cambridge, MA, Jun 2001.
[5] C. S. Chang, J. W. Chen, H. Y. Huang, “On Service Guarantees for Input-Buffered crossbar
switches: A capacity decomposition approach by Birkhoff and Von-Neumann,” in Proc. IEEE
Intl. Workshop on QoS, London 1999.
[6] C. S. Chang, J. W. Chen, H. Y. Huang, “Birkhoff-Von Neumann Input-Buffered Crossbar
Switches,” in Proc. IEEE Infocom ‘00, Tel Aviv, pp. 1614-1623, Mar 2000.
[7] F. M. Chiussi, “Design, Performance and Implementation of a Three-Stage Banyan-based
Architecture with Input and Output buffers for Large Fast Packet Switches,” PhD. Thesis,
Stanford University, Jul 1993.
[8] F. M. Chiussi and A. Francini, “Providing QoS Guarantees in a Packet Switch,” in Proc. IEEE
Globecom ‘99, Nov 1999.
[9] F. M. Chiussi, A. Francini, J. G. Kneuer, “Implementing Fair Queueing in ATM Switches,”
(Parts 1-2) in Proc. IEEE Globecom ‘97, Phoenix, AZ, Nov 1997.
[10] F. M. Chiussi, D. A. Khotimsky, S. Krishnan, “Generalized Inverse Multiplexing of Switched
ATM Connections,” in Proc. Globecom ‘98, Sydney, Nov 1998.
[11] F. M. Chiussi, D. A. Khotimsky, S. Krishnan, “Advanced Frame Recovery in Switched Con-
nections Inverse Multiplexing for ATM,” in Proc. Intl. Conf. on ATM, Colmar, France, Jun
1999.
[12] F. M. Chiussi, J. G. Kneuer, V. P. Kumar, “Low-cost Scalable switching solutions for Broad-
band Networking: the ATLANTA architecture and chipset,” IEEE Communications Maga-
zine, vol. 35, no. 12, Dec 1997.
X
[13] F. M. Chiussi, et al., “A Chipset for Scalable QoS-Preserving Protocol-Independent Packet
Switch Fabrics,” in Proc. Intl. Solid-State Circuits Conf., San Francisco, Feb 2001.
[14] F. M. Chiussi, et al., “A Family of ASIC devices for Next Generation Distributed Packet
Switches with QoS support for IP and ATM,” in Proc. Hot Interconnects 9, Stanford, Aug
2001.
[15] A. K. Choudhury and E. L. Hahne, “Dynamic Queue Length Thresholds in a Shared Memory
ATM Switch,” in Proc. IEEE Infocom ‘96, pp. 679-687, Mar 1996.
[16] M-C. Chow, “Understanding SONET/SDH: Standards and Applications,” Adnan Publisher,
New Jersey, 1995.
[17] S-T. Chuang, A. Goel, N. McKeown, B. Prabhakar, “Matching Output Queueing with a Com-
bined Input Output Queued Switch,” IEEE J. Selected Areas of Communications, vol. 17, no.
6, Jun 1999.
[18] R. L. Cruz, “A Calculus for Network Delay, Part I: Network Elements in Isolation,” IEEE
Trans. Information Theory, vol. 37, no. 1, Jan 1991.
[19] J. G. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” in
Proc. Infocom 2000, Tel Aviv, Mar 2000.
[20] A. Demers, S. Keshav, S. Shenker, “Analysis and Simulation of a Fair Queueing Algorithm,”
in Proc. Sigcomm ‘89, Austin, TX, Sep 1989.
[21] M. Devault, J. Cochennec, M. Servel, “The Prelude ATD Experiment: Assessments and
Future Prospects,” IEEE J. Selected Areas in Comm., vol. 6, no. 9, Dec 1988.
[22] A. Elwalid, D. Mitra, R. Wentworth, “A New Approach for Allocating Buffers and Bandwidth
to Heterogenous, Regulated Traffic in an ATM node,” IEEE J. Selected Areas of Communi-
cations, vol. 13, pp. 1115-1127, Aug 1995.
[23] S. Floyd and V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance,”
IEEE Transactions on Networking, Aug 1993.
[24] G. J. Foschini and B. Gopinath, “Sharing Memory Optimally,” IEEE Transactions on Com-
munications, vol. 3, pp. 352-360, Mar 1983.
[25] L. Georgiadis, I. Cidon, R. Geurin, A. Khamisy, “Optimal Buffer Sharing,” IEEE J. Selected
Areas of Communications, vol. 13, pp. 1229-1240, Sep 1995.
[26] S. J. Golestani, “A Self-Clocked Fair Queueing scheme for Broadband Applications,” in Proc.
IEEE Infocom ‘94, Jun 1994.
[27] P. Goyal and H. M. Vin, “Fair Airport Scheduling Algorithm,” in Proc. NOSSDAV ‘97, pp.
272-283, May 1997.
[28] J. E. Hopcroft and R. M. Karp, “An   Ü *
×
algorithm for maximum matching in bipartite
graphs,” SIAM Journal of Computation, vol. 2, 1973.
XI
[29] J. Y. Hui, “Switching and Traffic Theory for Integrated Broadband Networks,” Kluwer Aca-
demic Press, 1990.
[30] A. Hung, G. Kesidis, N. McKeown, “ATM Input-Buffered switches with Guaranteed Rate
property,” in Proc. IEEE ISCC, Athens, 1998.
[31] I. Iliadis and W. E. Denzel, “Performance of packet switches with input and output queueing.”
in Proc. ICC ‘90, Atlanta, Apr 1990.
[32] M. I. Irland, “Buffer Management in a Packet Switch,” IEEE Trans. Communications, vol.
26, pp. 328-337, Mar 1978.
[33] S. Iyer, A. Awadallah, N. McKeown, “Analysis of a Packet Switch with Memories running
slower than line rate,” in Proc. Infocom 2000, Tel Aviv, Mar 2000.
[34] S. Iyer and N. McKeown, “Making Parallel Packet Switches practical,” in Proc. Infocom ‘01,
Anchorage, Alaska, Apr 2001.
[35] S. Iyer, R. Zhang, N. McKeown, “Routers with a Single Stage of Buffering,” to appear in
ACM Sigcomm ‘02, Pittsburgh, Sep 2002.
[36] K. Kar, T. V. Lakshman, D. Stiliadis, L. Tassiulas, “Reduced Complexity Input-Buffered
Switches,” in Proc. Hot Interconnects 8, Palo Alto, Aug 2000.
[37] M. Karol, M. Hluchyj, S. Morgan, “Input versus Output queueing on a Space Division
Switch,” IEEE Trans. Communications, vol. 35, no. 12, Dec 1987.
[38] M. Katevenis, S. Sidoropoulus, C. Courcoubetis, “Weighted Round Robin Cell Multiplexing
in a General Purpose ATM Switch chip,” IEEE J. Selected Areas in Communications, vol. 9,
pp. 1265-1279, Oct 1991.
[39] I. Keslassy, M. Kodialam, T. V. Lakshman, D. Stiliadis, “On Guaranteed Smooth Scheduling
for Input-Queued Switches,” submitted to IEEE Infocom ‘03, 2003.
[40] D. A. Khotimsky, “A Packet Re-sequencing Protocol for Fault-Tolerant Multi-Path transmis-
sion with Non-Uniform Traffic Splitting,” in Proc. Globecom ‘99, Rio de Janeiro, Dec 1999.
[41] D. A. Khotimsky and S. Krishnan, “Towards the Recognition of Parallel Packet Switches,”
Gigabit Networking Workshop at Infocom ‘01, Anchorage, Alaska, Apr 2001.
[42] D. A. Khotimsky and S. Krishnan, “Stability Analysis of a Parallel Packet Switch with Buffer-
less input Demultiplexors,” in Proc. ICC ‘01, Helsinki, Jun 2001.
[43] D. A. Khotimsky and S. Krishnan, “Evaluation of Open-loop Sequence Control schemes for
Multi-path Switches,” in Proc. ICC ‘02, New York, Apr 2001.
[44] P. Krishna, N. S. Patel, A. Charny, R. Simcoe, “On the speedup required for work-conserving
Crossbar Switches,” in Proc. 6th Intl. Workshop on QoS, Napa, CA, May 1998.
[45] S. Krishnan, A. K. Choudhury, F. M. Chiussi, “Dynamic Partitioning: A Mechanism for
Shared-Memory Management,” in Proc. IEEE Infocom ‘99, pp. 144-152, New York, Mar
1996.
XII
[46] P. R. Kumar and S. P. Meyn, “Stability of queueing networks and scheduling policies,” IEEE
Transactions on Automatic Control, vol. 40, no. 2, Feb 1995.
[47] G. Latouche, “Exponential Servers Sharing a Finite Storage: Comparison of Space Allocation
Policies,” IEEE Transactions on Communications, vol. 28, pp. 992-1003, Jun 1980.
[48] E. Leonardi, M. Mellia, F. Neri, M. A. Marsan, “On the Stability of Input-Queued switches
with Speed-up,” IEEE/ACM Trans. Networking, vol. 19, no. 1, Feb 2001.
[49] Y. Li, S. Panwar, H. J. Chao, “On the Performance of a Dual Round-Robin Switch,” in Proc.
IEEE Infocom ‘01, San Francisco, Apr 2001.
[50] N. McKeown, “Scheduling Algorithms for Input-Queued Cell Switches,” PhD. Thesis, Uni-
versity of California at Berkeley, 1995.
[51] N. McKeown, “The iSLIP Scheduling Algorithm for Input-Queued Switch,” IEEE/ACM
Transactions on Networking, vol. 7, Apr 1999.
[52] N. McKeown, V. Anantharan, J. Walrand, “Achieving 100% Throughput in an Input-Queued
Switch,” in Proc. Infocom ‘96, San Francisco, Mar 1996.
[53] N. McKeown, A. Mekkittikul, “A Practical Scheduling Algorithm to achieve 100% through-
put in Input-Queued Switches,” in Proc. IEEE Infocom ‘98, San Francisco, Apr 1998.
[54] S. Melen and J. Tuner, “Non-blocking Multirate Networks,” SIAM Journal of Computing,
vol. 18, 1989.
[55] S. P. Meyn and R. Tweedie, “Markov Chains and Stochastic Stability,” Springler-Verlag,
London, 1994.
[56] A. K. Parekh and R. G. Gallagher, “A Generalized Processor Sharing approach to Flow Con-
trol in Integrated Service Networks– The Single Node Case,” IEEE/ACM Transactions on
Networking, Jun 1993.
[57] A. Pattavina, “Switching Theory: Architecture and Performance in Broadband ATM Net-
works,” John Wiley, W. Sussex, UK, 1998.
[58] V. Paxson and S. Floyd, “Wide-Area Traffic: The failure of Poisson Modeling,” in Proc. ACM
Sigcomm ‘94, Aug 1994.
[59] B. Prabhakar, Personal Communication, Aug 2002.
[60] E. Rosen, A. Viswanathan, and R. Callon, “Multiprotocol Label Switching Architecture,”
Internet Engg. Task Force RFC 3031, Jan 2001.
[61] K. W. Ross, “Multiservice Loss Models for Broadband Telecommunication Networks,”
Springer-Verlag, London, 1995.
[62] M. Shreedhar and G. Varghese, “Efficient Fair Queueing using Deficit Round Robin,” in Proc.
ACM Sigcomm ‘95, pp. 231-242, Sep 1995.
XIII
[63] D. C. Stephens and H. Zhang, “Implementing Distributed Packet Fair Queueing in a Scalable
Switch Architecture,” in Proc. IEEE Infocom ‘98, Mar 1998.
[64] D. Stiliadis and A. Varma, “A General Methodology for Designing Efficient Traffic Schedul-
ing and Shaping Algorithms,” in Proc. IEEE Infocom ‘96, Apr 1996.
[65] D. Stiliadis and A. Varma, “Design and Analysis of Frame-based Fair Queueing: A new
traffic scheduling algorithm for packet switched networks,” in Proc. ACM Sigmetrics ‘96,
May 1996.
[66] I. Stoica and H. Zhang, “Exact emulation of an Output Queueing switch by a Combined Input
Output queueing switch,” in Proc. 6th Intl. Workshop on QoS, Napa, CA, May 1998.
[67] B. Suter, T. V. Lakshman, D. Stiliadis, A. K. Choudhury, “Design Considerations for sup-
porting TCP with Per-flow Queueing,” in Proc. IEEE Infocom ‘98, Mar 1998.
[68] Y. Tamir and G. Frazier, “High Performance multiqueue buffers for VLSI communication
switches,” in Proc. 15th Annual Symp. on Computer Arch., Jun 19888.
[69] F. A. Tobagi, “Fast Packet Switch Architectures for Broadband Integrated Services Digital
Networks,” in Proc. of IEEE, vol. 78, no. 1, Jan 1990.
[70] B. Towles and W. J. Dally, “Guaranteed Scheduling for Switches with Configuration Over-
head,” in Proc. IEEE Infocom ‘02, New York, Jun 2002.
[71] G. Wilfong, B. Mikkelsen, C. Doerr, M. Zirngibl, “WDM Cross-connect Architectures with
reduced complexity,” J. Lightwave Technology, pp. 1732–1741, Oct 1999.
[72] Y. S. Yeh, M. G. Hluchyj, A. S. Acampora, “The Knockout Switch: A simple modular archi-
tecture for High-performance Packet Switching,” IEEE J. Selected Areas in Comm., vol. 5,
no. 8, Oct 1987.
[73] L. Zhang, “Virtual Clock: A New Traffic Control Algorithm for Packet Switching,” ACM
Transactions on Computer Systems, May 1991.
[74] L. Zhang, S. Deering, D. Estrin, S. Shenker, D. Zappala, “RSVP: A new Resource Reserva-
tion Protocol,” IEEE Network, vol. 7, no. 7, Sep 1993.
XIV
