Asymptotic performance limits of switches with buffered crossbars by Paolo Giaccone & Emilio Leonardi
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 2, FEBRUARY 2008 595
Asymptotic Performance Limits of Switches With
Buffered Crossbars Supporting Multicast Trafﬁc
Paolo Giaccone, Member, IEEE, and Emilio Leonardi, Member, IEEE
Abstract—Input queued (IQ) switches exploiting buffered cross-
bars (CICQ switches) are widely considered very promising archi-
tecturesthatoutperformIQswitcheswithbufferlessswitchingfab-
rics both in terms of architectural scalability and performance. In-
deed the problem of scheduling packets for transfer through the
switching fabric is signiﬁcantly simpliﬁed by the presence of in-
ternalbuffersinthecrossbar,whichmakespossibletheadoptionof
efﬁcient, simple and fully distributed scheduling algorithms. This
paper studies the throughput performance of CICQ switches sup-
porting multicast trafﬁc, showing that, similarly to IQ architec-
tures, also CICQ switches with arbitrarily large number of ports
may suffer of signiﬁcant throughput degradation under “patho-
logical” multicast trafﬁc patterns. Despite the asymptotic nature
of these results, the authors believe that they can contribute to a
deeper understanding of the behavior of CICQ architectures sup-
porting multicast trafﬁc.
Index Terms—Buffered crossbars, multicast, packet switching,
scheduling.
I. INTRODUCTION AND PREVIOUS WORK
I
NPUT QUEUED (IQ) switches have been extensively
studied in the last decade [1], since they can achieve similar
performance of pure output queued (OQ) switches, while guar-
anteeing a much better scalability in terms of ports number and
line data rate. Under unicast trafﬁc, IQ switches achieve 100%
throughput (i.e., the same throughput achieved by OQ) without
speedup [2], [3]; moreover, IQ switches with speedup 2 can
perfectly emulate OQ switches, i.e., they can behave identically
to OQ switches when observed at the output ports [4], [5].
On the contrary, very large size IQ switches can experience
throughput degradation with respect to OQ under multicast
trafﬁc [6]; indeed, for number of ports , the required
speedup to achieve 100% throughput may grow to inﬁnity,
showing a fundamental and intrinsic limitation of IQ switches
supporting multicast trafﬁc.
In a switching architecture built around buffered crossbars
(“CICQ switch”), the bufferless switching fabric of an IQ ar-
chitecture is replaced by a crossbar provided with small buffers
at each crosspoint (see Fig. 1). A survey on CICQ switch
architectures and their scheduling algorithms is available in
[7]; some interesting proposals (also for multicast trafﬁc) have
been described in [8]–[15]. The problem of scheduling packets
Manuscript received October 30, 2006; revised October 3, 2007. This work
has been partially funded by the Italian Ministry for University and Research
under the PRIN project “FAMOUS”. The material in this paper was presented
in part at IEEE INFOCOM 2006, Barcelona, Spain, April 2006.
The authors are with the Dipartimento di Elettronica, Politecnico di Torino,
Torino 10129, Italy (e-mail: paolo.giaccone@polito.it, emilio.leonardi@polito.
it).
Communicated by E. Modiano, Associate Editor for Communication
Networks.
Digital Object Identiﬁer 10.1109/TIT.2007.913564
Fig. 1. The N ￿ N CICQ architecture with VOQ and buffered crosspoints.
through a CICQ architecture is signiﬁcantly simpliﬁed with
respect to an IQ architecture by the presence of internal buffers.
In IQ switches, a centralized scheduler is required to control the
access to the bufferless switching fabric in a globally coordi-
nated fashion; on the contrary, in CICQ switches access to the
switching fabric can be managed by uncoordinated schedulers
residing at every input and output ports.
For these reasons, CICQ architectures are widely recognized
as very promising solutions which potentially outperform IQ
switches both in terms of scalability and performance. This
belief seems conﬁrmed by both theoretical and simulative
performance analysis of CICQ architectures. Recently, it has
been shown that under unicast trafﬁc, CICQ architectures
with speedup 2 and minimal internal buffers operating under
very simple uncoordinated schedulers, achieve the maximum
throughput and in some cases may perfectly emulate an OQ
switch [16], [17]. The minimum speedup required to achieve
100% throughput can be further lowered by increasing the
internal buffers size, as shown in [18]. For any speedup greater
than , indeed, it has been proved that there exists an internal
buffer size such that the CICQ architecture achieves 100%
throughput under uncoordinated scheduling algorithms. Fi-
nally, simulative investigations have shown that performance
under unicast trafﬁc are always very close to 100% throughput
when no speedup is provided, even if no theoretical evidence
has been provided.
In this paper we show that, unfortunately, similarly to what
happens for IQ switches, heavy performance degradations may
be experiencedbyCICQ architectureswith supporting
multicasttrafﬁc,foranyﬁnitespeedup,regardlessofscheduling
complexity and internal buffer size. Indeed, we have identiﬁed
“pathological”trafﬁcpatternswhichcreatefrequentcontentions
for output ports, thus leading to hot-spot congestion of internal
buffers and, consequently, limiting the switch throughput. One
possibility to avoid such congestion is to increase the internal
0018-9448/$25.00 © 2008 IEEE596 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 2, FEBRUARY 2008
buffer size; we show how the minimum buffer size should scale
with to achieve the maximum throughput. We recognize that
our ﬁndings have mainly a pure theoretical relevance due their
asymptotic nature. Nevertheless we believe that they can con-
tribute to a deeper comprehension of the behavior of CICQ ar-
chitectures supporting multicast trafﬁc. In addition, we notice
that arguments developed in this paper can be successfully ap-
plied to pure IQ architectures (architectures built around buffer-
less switches) to reprove in a more concise and elegant way
the results in [6]. Finally, we point out that heavy performance
degradations may be experienced also in OQ switches with lim-
ited queues at output ports and ﬂow control (from outputs to-
ward inputs).
The paper is organized as follows: in Section II we describe
the problem of scheduling multicast trafﬁc in CICQ switches;
in Section III we deﬁne trafﬁc admissibility conditions and
we introduce a class of “worst-case” trafﬁc patterns that, in
Section IV, are shown to lead to poor performance in CICQ
switches. Section V deﬁnes some practical scheduling schemes
for CICQ switches and shows by simulation the throughput
degradation suffered under these trafﬁc patterns. In Section VI,
we show how our negative results apply also to Output Queued
architectures with ﬁnite buffers and ﬂow control. Finally,
Section VII concludes the paper.
Toeaseaﬁrstreadingofthepaper,mostanalyticalderivations
and theorem proofs were moved to the Appendices.
II. SYSTEM DYNAMICS
We consider an CICQ architecture, where
is the set of switch input ports 1
and is the set of switch output ports .
Fig. 1 shows a basic model for it. Each crosspoint of the
crossbar is provided with an internal buffer of size packets;
thus, internal buffers are in one-to-one correspondence with
input–output pairs. An instantaneous ﬂow control mechanism
from each crosspoint to the corresponding input port prevents
the input port from overﬂowing the internal buffers. We assume
time to be slotted, and packets to be of ﬁxed size (denoted as
“cells”). Cells arrive at the input ports according to a discrete
time random process. At every slot, some cells enqueued at
input ports are moved in the internal buffers while cells in
internal buffers are moved toward output ports.
Each multicast cell is associated with a pair , where
is the input port at which it arrives and is the
set of its destination outputs (denoted as the cell “fanout-set”).
Let the cell “fanout” be thenumber of destination ports. Unicast
cellscorrespondtothesubsetofmulticastcellswhosefanout-set
comprises just one output port. We partition the set of cells ar-
riving at the switch in ﬂows according to their attributes .
Thus, a different ﬂow corresponds to every possible attribute
.
For each ﬂow , function returns the input port at
which ﬂow cells arrive. Function returns the fanout-set
associated to the ﬂow cells. Function returns the set of
ﬂows which arrive at input port . Function returns the
1We denote with jAj the size of set A.
set of multicast ﬂows whose fanout-set comprises , i.e.,
.
In order to reach a particular output port belonging to the
fanout-set, a copy of the multicast cell must be enqueued in the
internal buffer leading to output port . As a consequence, each
multicast cell origins several unicast copies, one for destination,
called “fragments,” which are enqueued in the corresponding
internalbuffers.Fragmentsofamulticastcell,afterbeingcopied
in the internal buffers, are independently delivered to the output
ports.
Two possible basic strategies can be devised for the transfer
of multicast cells from inputs to internal buffers.
• No fanout splitting policies: all the fragments originated
by a multicast cell are transferred to internal buffers simul-
taneously. If any of the internal buffers corresponding to
fanout destinations is full, no fragments of the cell can be
transferred to the internal buffers.
• Fanoutsplittingpolicies:different fragmentsoriginatedby
the same cell may be transferred to buffers in different in-
ternal timeslots.
In the latter case, the set of destinations in the fanout-set cor-
respondingto the internal buffersinto whicha fragment has still
tobe transferred, is called “residual fanout-set” of a cell. In both
casesa cell is cancelled atinputswhen everyinternal buffercor-
responding to fanout destinations has been reached by a cell
fragment, i.e., its residual fanout-set is null. The adoption of
no fanout splitting scheduling policies entails a signiﬁcant re-
duction of both the architecture and the algorithms complexity,
since no information on the residual fanout-set of cells partially
transferredtowardinternalbuffersmustbememorizedandman-
aged. However, as we shall show later, no-fanout splitting poli-
ciesareratherpoorlyperformingwithrespecttofanout splitting
policies.
For the sake of readability, we will adopt the following con-
vention: a fragment is “transferred” when it is moved from the
input queue to the internal buffer, and a fragment is “delivered”
when it is moved from the internal buffer to the output port.
In IQ and CICQ switches supporting unicast trafﬁc only, to
avoid head-of-line (HoL) blocking [19], one separate queue for
cellsdirectedtoeachoutputisnecessaryatinputs;thisqueueing
architecture is called Virtual Output Queueing (VOQ) [20].
As shown in [6], one queue is required at every input for each
possible fanout-set (i.e., at most queues per input) to
avoidHoLblockinginswitchessupportingmulticasttrafﬁc;this
queueingschemeisdenotedasMC-VOQor“per-multicast-ﬂow
queueing.” Upon its arrival at the switch, each cell is stored
in the queue that corresponds to the ﬂow fanout-set and it is
removed only when its residual fanout-set is empty. Only cells
at the head of the input queues are eligible for transfer to the
internal buffers.2
We are aware that this queueing structure is hardly imple-
mentable in large switches, due to the possibly large number of
2We notice that, in order to maximize the switch throughput, partially trans-
ferredcellsshouldbere-enqueuedintothequeuecorrespondingtotheirresidual
fanout-set [6]; the adoption of such optimal throughput queueing scheme will
lead to out-of-sequence cells delivery at output ports, which is not tolerable in
many application contexts. For these reasons, in analogy to what done in [6]
for IQ switches, we limit our performance investigation to MC-VOQ schemes
without re-enqueuing.GIACCONE AND LEONARDI: ASYMPTOTIC PERFORMANCE LIMITS OF SWITCHES 597
queues needed at every input. However, since in this paper we
are mostly interested in providing a theoretical investigation of
intrinsic throughput limits of CICQ structures, by considering
this idealized queueing structure we are able to ﬁnd general
bounds on the system performance, which do not depend from
the particular queueing structure implemented at inputs.
We say that a speedup (with integer) is available at the
switch, when the internal data-paths between inputs and output
ports,throughthebufferedcrossbar,runataspeed timesfaster
than the external input/output links.
When , queues are necessary at output ports (simi-
larly to what happens to IQ switches), to store fragments that
cannot be immediately delivered to output links. All the fol-
lowing rates are accelerated by the same factor with respect
to the link speed: read rate from the input queues, read/write
ratesfrom/totheinternalbuffers,writeratetotheoutputqueues.
Furthermore, it is necessary to distinguish between external and
internal timeslot; the former is the transmission time of a cell
on the external input/output links; the latter is the transmission
timeofacellontheinternaldatapath(betweeninputandoutput
ports). Each external timeslot corresponds to consecutive in-
ternaltimeslots.Wedenotewith the thexternaltimeslotfrom
a conventional time origin; in a similar way we denote with
the th internal timeslot from the same conventional time
origin, or equivalently we denote with the -th internal
timeslot with corresponding to external timeslot ,
being .
Considering internal timeslot ,w ed e ﬁne the following
variables:
• : the number of fragments from input to output
(denoted as ) transferred to the internal buffer;
• : the number of cells partially or fully transferred
from input ;
• : the number of cells whose transfer to the internal
buffers is completed;
• : the number of fragments delivered from
the internal buffer to the output port;
• : the number of fragments delivered from internal
buffers to output ; note that ;
• : the number of fragments stored in the
internal buffer at the beginning of the timeslot;
• : the total number of fragments
stored in the internal buffers fed by input at the beginning
of the timeslot;
• : be the total number of fragments
stored in internal buffers leading to output at the begin-
ning of the timeslot.
The dynamics of internal buffer occupancy is given by
Thefollowingconstraintsmustbesatisﬁedbythecellsswitched
across CICQ architectures:
• in each timeslot , at most one cell per input can be trans-
ferred to the internal buffers
(1)
• in each timeslot , at most one fragment is delivered to
each output port
(2)
• due to ﬂow control, no fragment can be transferred to an
internal buffer which is full
if (3)
At last, the behavior of a pure IQ switch is emulated by CICQ
(with ) when the following extra constraint is added:
(4)
Indeed, under (4), fragments are directly transferred (i.e., trans-
ferred within one slot) from the input to output ports, and the
internal buffers remain empty (i.e., ). This
observation will be exploited in the following to extend the re-
sults for CICQ to IQ architectures.
III. TRAFFIC AND PERFORMANCE DESCRIPTION
A. -Complex Trafﬁc
In this section we introduce the “pathological” multicast
trafﬁc patterns under which throughput degradations are expe-
rienced. Deﬁnitions reported in this section generalize those
in [6, Sec. IV]. Consider an switch, with inputs
receiving cells from at least a ﬂow (inputs receiving cells are
called “active inputs” through this paper).
Deﬁnition 1: A set of ﬂows with is said
-complex, with , if:
1) ﬂows in arrive at active input port , i.e., ,
for any active input port ;
2) ﬂows in are directed to output port , i.e.,
, for any output port ;
3) for each subset of comprising ﬂows, a destination ex-
ists to which all the ﬂows in the subset are directed.
Table I reports an example of a -complex ﬂow set for
a switch, where only inputs and are active. In an
switch where input ports are active,
and ,a -complex ﬂow set of size can be
generated with the following algorithm.
Step 1) Let beasetof different
ﬂows.
Step 2) Form all the possible different subsets
of whosesizeis ,andcreateanarbitraryinjective
correspondencefromsubsetsofcellstodestinations.
With reference to Table I,
, and
.
Step 3) To each ﬂow in assign all the destinations that
correspond to sets containing the cell itself. With
reference to Table I,
, and
.
Step 4) Assign the ﬁrst ﬂows in to the ﬁrst input, the
second ﬂows to the second input, and so on, until598 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 2, FEBRUARY 2008
TABLE I
EXAMPLE OF A (2;2)-COMPLEX FLOW SET FOR A 6 ￿ 6 SWITCH
TABLE II
SWITCH SIZE N TO DEFINE A (k;N )-COMPLEX TRAFFIC
the last ﬂows have been assigned to input .
With reference to Table I, , and
and .
For each ﬂow, the resulting fanout is equal to
.TableIIshowstheswitchsize infunctionof and .
Notethatonlysmallvaluesof and aremeaningfulinareal-
istic switch, which can have up to few thousands of ports; larger
values of are signiﬁcant only for theoretical speculations.
Deﬁnition 2: An arrival process is uniformly -com-
plex with normalized rate (with ) if:
1) the associated ﬂow set is -complex;
2) foreveryﬂow ,cellsarriveattheswitchaccordingto
a general discrete time stationary arrival process satisfying
the strong law of large numbers;
3) for each ﬂow the average number of cell arrivals
per timeslot is .
Note that .
Deﬁnition 3: A set of cells of size is a perfect
-complex cell set if each cell in belongs to a different
ﬂow , being a -complex ﬂow set .
Deﬁnition 4: A set of cells of size is a generalized
-complex cell set if each cell in belongs to a ﬂow
, being a -complex ﬂow set (note that in this
case more cells of the same ﬂow can belong to ).
B. Trafﬁc Admissibility, Stability, and Maximum Throughput
We introduce the basic notion of trafﬁc admissibility.
Deﬁnition 5: A stationary arrival process satisfying the
strong law of large numbers is admissible if, either no input or
output ports are overloaded, i.e.
We notice that uniformly -complex arrival processes are
admissible when .
Deﬁnition 6: A switch architecture is stable under an admis-
sible arrival process if, with probability , the observed average
departurecellratefromtheswitchequalstheaveragecellarrival
rate at the switch input ports.
A necessary condition for the switch to be stable is that the rate
at which cells are moved toward output ports equals the average
cell arrival rate at inputs
A different necessary condition for switch stability is that the
rate at which cells are transferred to the internal buffers equals
the average cell arrival rate at inputs
which is equivalent to:
(5)
Deﬁnition 7: A switch achieves 100% throughput if it is
stable under every admissible arrival process.
Observe that an OQ switch (with inﬁnite queues size), by
construction, achieves 100% throughput, also when it is fed by
-complex trafﬁc.
IV. MAIN RESULTS
In this section we report our results regarding the maximum
throughput achievable by CICQ (or IQ) architectures under uni-
formly -complex arrival process.
In order to prove our main results, we need ﬁrst to intro-
duce some partial results on the performance of CICQ (or IQ)
architectures evaluated over a ﬁnite temporal horizon. We as-
sume that at the beginning of timeslot a generalized
-complex cell set is waiting for transfer at inputs; we
further assume that all internal buffers are empty, and no fur-
ther cells arrive at inputs port. We evaluate the “input queues
clearance time,” deﬁned as the minimum number of timeslots
necessary to completely transfer all the cells residing at in-
puts toward the internal buffers, thus clearing the input ports
memories.
These results constitute fundamental building blocks for
the main theorems presented in the next subsection; however,
since their proofs are quite long and articulated, to ease the ﬁrst
reading of the paper, we moved all proofs in the appendices.GIACCONE AND LEONARDI: ASYMPTOTIC PERFORMANCE LIMITS OF SWITCHES 599
A. Input Queues Clearance Time
1) No Fanout Splitting Policies:
Theorem 1: Consider a CICQ switch with loaded
by a generalized -complex cell set . The input queues
clearance time satisﬁes
if fanout splitting is not allowed at input ports.3
Corollary 1: For IQ switches, it results .
Corollary 2: Consider a CICQ switch with (or an IQ
switch), loaded by a generalized -complex cell set .
Under a no fanout splitting transfer policy, if ,(
for IQ switches) at most cells can be transferred from the
active inputs to the internal buffers in consecutive
internal timeslots ( timeslots for IQ switches).
2) Fanout Splitting Policies:
Deﬁnition 8: Let be an ordered set comprising ele-
ments, i.e., under the assumption
. We denote with the number of possible dif-
ferent -tuple such that for
and .
It can been shown that can be computed by the fol-
lowing recursion:
for
Theorem 2: Consider a CICQ switch with , loaded by
ageneralized -complexcellset .Foranyﬁnite ,there
exist two integers and ,
with and , such that
and no complete exists with
frame length , under any fanout-splitting policy.
Corollary 3: Consider an IQ switch loaded by a generalized
-complex cell set . For any ﬁnite , there exist two
integers and , with and
, such that and
no complete exists with frame length ,
under any fanout-splitting policy.
B. Switch Throughput Results
1) No-Fanout-Splitting Scheduling Policy Case:
Theorem 3: Consider a CICQ switch with and ﬁ-
nite , in which cells arrive according to an admissible uni-
formly -complex trafﬁc pattern at rate .F o r
(for example, when and
) the switch implementing any no fanout splitting
transfer policy is unstable.
Proof: This proof easily follows from Corollary 2. Indeed,
sinceatmost cellscanbe transferredtotheinternalbuffers
3Let dxe be the minimum integer ￿ x and bxc be the maximum integer ￿ x.
in internal timeslots, the average number of cells
transferred to the internal buffers per external timeslot satisfy
the following relation:
from which
Now, thanks to (5), if
the system is unstable. Now for the system becomes
unstable for any .
2) Fanout-Splitting Scheduling Policy Case:
Theorem 4: Consider a either a CICQ switch
or an IQ switch with ﬁnite speedup implementing
a fanout splitting scheduling policy. Under an admissible uni-
formly -complex trafﬁc at rate , the switch is
unstable for and , being
and , with ,
and .
Proof: In this proof, we denote with the term “frame” a
limited temporal horizon comprising a ﬁnite set of consecutive
internal/external timeslots. We say that the transfer of a cell is
completed within a given ﬁnite time horizon (timeslot/frame)
if the transfer of the last fragment toward the internal buffer
occurs within the considered ﬁnite temporal horizon. We say
that a cell is fully transferred within a given ﬁnite time horizon
(timeslot/frame) if all the fragments of the considered cell are
transferred toward internal buffers within the considered ﬁnite
time-horizon.
The theorem can be proved by contradiction; suppose that
there exists a per-multicast-ﬂow scheduler that makes the
switch stable with speedup under a uniformly -com-
plex trafﬁc pattern at rate . Since on average cells arrive
at the switch in each timeslot, in a stable switch, with proba-
bility , it must be
(6)
Consider a sample path for which the above property holds,
and on it consider a sequence of time windows
with , and group the timeslots belonging to in
frames of length external timeslots (which correspond to
internal timeslots); now (6) becomes600 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 2, FEBRUARY 2008
Thus, since sequence is in-
teger valued, there exists such that
from which immediately follows:
that is, there exists a frame of consecutive external times-
lots (to which internal timeslots correspond), in which
the transfer of at least cells from inputs to internal buffers
is completed. Since the scheduler is per-multicast-ﬂow, under a
generalized -complex trafﬁc pattern no more than
cells may be simultaneously handled by the scheduler (one
for ﬂow). Hence, when the considered frame starts, among the
cells whose transfer will be completed within the frame,
no more than cells can be already partially transferred
toward internal buffers. As a consequence, at least cells
are necessarily fully transferred within the considered frame
comprising internal time slots. However, since any set
of cells belonging to a uniformly -complex arrival
process forms a generalized- -complex cell set, for any
it is possible to choose a such that ;
for example ﬁxing . Now we can apply Theorem 2
with playing the role of ; hence, if
and , for
any and we obtain a contradiction with
Theorem 2.
An important question that may arise at this point is whether
sufﬁcientconditionscanbeestablishedon and toguarantee
that a CICQ switch achieves 100% throughput. The following
result provides a partial answer to this question, deﬁning sufﬁ-
cient conditions for stability of CICQ switches loaded by uni-
formly -complex arrival processes with rate .
Theorem 5: Consider an CICQ switch loaded by a
uniformly -complex trafﬁc at rate .I f and
, the switch is stable under an appropriate
fanout-splitting scheduling policy.
Proof: We consider a CICQ with speedup .
We subdivide time in frames, each frame comprises consec-
utive internal timeslots (this corresponds to external times-
lots). We consider a scheduling policy according to which we
have the following.
• Among the internal timeslots corresponding to an ex-
ternal timeslot, only the ﬁrst is exploited for the transfer of
cells from inputs toward internal buffers.
• A ﬁxed internal timeslot per frame is reserved at input to
every ﬂow enqueued at input . In an internal timeslot
reserved to ﬂow , no cells belonging to other ﬂows can
be transferred from inputs to internal buffers, even if
MC-VOQ queue associated to ﬂow is empty.
• Schedulers at output ports are assumed work-conserving
(i.e., implies ).
Notethattheaboveschedulingpolicyensuresstabilityforany
provided that the transfer of fragments is never blocked
for effect of ﬂow control; indeed, in every frame, fragments
directed to output are transferred from input ports toward the
internal buffers. As a consequence, we are interested to ﬁnd
the minimal storage capacity requirements on internal buffers
to ensure that fragments transfer is never blocked for effect of
ﬂow control. To this purpose we consider the pessimistic case
in which inputs queues are saturated and never get empty.
Under above assumptions is perfectly periodic
(with period equal to a frame). Since the output scheduler is
work conserving (i.e., implies
), to avoid contradictions, must exhibit a periodic
behavior too, being zero, for exactly internal timeslots
within each frame.
Our claim is that is sufﬁcient to avoid blocks.
Without lack of generality, we suppose and con-
sider the dynamic of for and ,
by construction it holds:
and for
Combiningtheprevioustwoexpressions,weobtain,thatforany
it holds
Suppose , this implies for
any and . Now since the output scheduler
is work-conserving it must be if
, thus:
Therefore, since , at least cells have been
transferred from inputs to internal buffers within the internal
timeslots ; being , by construction it
results
i.e., ; this proves that a buffer
is sufﬁcient to guarantee that the transfer of fragments from
input port to internal buffers is never blocked by ﬂow control.GIACCONE AND LEONARDI: ASYMPTOTIC PERFORMANCE LIMITS OF SWITCHES 601
Fig.2. MinimumbufferrequirementsB (N;S)toguaranteestabilityunder
any (k;N )-complex trafﬁc.
Given an CICQ switch with speedup ,w ed e ﬁne
with the minimum internal buffer storage requirement to
guarantee stability under any admissible uniform -com-
plex trafﬁc arrival process. Note that an CICQ
switch can be loaded with only a ﬁnite set of possible uni-
formly -complex arrival processes; indeed it must hold:
(we remark that in case in which
some of the switch output ports remain idle). Furthermore, note
that CICQ switches are trivially stable for if for any ,
whenever . Thus given an CICQ switch with
speedup , the minimum buffer should be chosen such that
, being the maximum value
of such that , with (it can be shown
that this last constraint implies ). Fig. 2 shows the
minimum buffer required for stability, , as a function of
and of the available speedup (fractional values for are
allowed). It is clear that for practical switch sizes ,
buffers smaller than 4 are sufﬁcient to avoid the throughput
degradations due to critical trafﬁc patterns.
It is interesting to evaluate how the buffer requirement
asymptotically scales with the switch size and the speedup
. Using the Stirling approximation for the factorial and setting
, after some algebra it results
from which, ﬁxing the value of the speedup it results
, while ﬁxing the value of , it results
, being the minimum speedup sufﬁ-
cient to guarantee stability under any -complex trafﬁc
arrival process.
On the other hand, Theorem 4 provides necessary condi-
tions on buffer and speedup requirements to achieve maximal
throughput inlarge CICQswitchesunder -com-
plex trafﬁc patterns. Since , after some al-
gebraitresults: and .
Thus ﬁxing to the buffer size at every crosspoint, we obtain
that the minimal necessary speedup to achieve maximal
throughput scales with as: ; whereas
the minimum necessary buffer at each crosspoint, for a
ﬁxed speed up , scales with as: .
Note that and exhibit the same asymptotic scaling
behavior,foranyﬁxedspeedup ,meaningthatourboundsgive
asymptotically tight predictions on the buffer requirement to
achievestabilityin largeCICQswitcheswithﬁnite speedup.On
the contrary and exhibit different asymptotic behav-
iors for ﬁxed , meaning that our bounds become loose when
the speedup is increased to inﬁnity.
V. SIMULATION RESULTS
We investigated the performance achievable by distributed
schedulers in a CICQ under uniformly -complex arrival
process. We assume per-multicast-ﬂow queueing and we con-
sider fanout splitting policies only.
The scheduling decisions are taken in a local uncoordinated
fashion by schedulers at input ports and schedulers at
output ports. Each input scheduler selects a cell; fragments of
the selected cell are transferred toward the set of non-full in-
ternal buffers. Each output scheduler selects an internal buffer
from which a fragment is delivered to the output port.
The following selection policies are adopted by input
schedulers.
• Longest Queue First (LQF): Each input selects the
longest input queue and transfers all the fragments for
which the internal buffer is not full.
• Min-Splitting (MS): Each input ﬁnds the set of all multi-
cast cells that can be transferred completely to the internal
bufferswithanullresidualfanout-set.Ifsuchsetexists,the
cell withthelargestfanout is selected.Otherwise,theinput
selects the multicast cell with the largest number of frag-
ments that can be transferred to the internal buffers. This
policy tries to reduce the number of partial transmissions
neededtocompletelytransfercellsfrominputportstoward
internalbuffers(i.e.,thenumberoftimeslotinwhichacell
is partially transferred from inputs to internal buffer). So
doing the access bandwidth to the switching fabric is more
efﬁcientlyexploited,resultinginbetterglobalperformance
as shown in [21] for IQ architectures.
The output scheduler can be simply as follows.
• LQF: Each output selects the longest internal buffer.
• Roundrobin (RR): Eachoutput selects theﬁrst nonempty
buffer according to a round-robin scheme.
Ties are broken randomly both by input and output
schedulers.
We emphasize that the scheduling policies described in this
section are inspired by those recently proposed in [22], [23],
[10]. We selected the following two combinations of “input
scheduler”-“output scheduler”: LQF-RR (chosen for its sim-
plicity) and MS-LQF (chosen because it outperforms all other
three combinations of policies).
Fig. 3 compares the minimum speedup required to obtain
100%throughputforeachswitchsize;thecurvesrefertothetwo
scheduling policies and to different internal buffer sizes
. The curves were obtained by feeding the switch with
all possible -complextrafﬁc patterns, such that602 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 2, FEBRUARY 2008
Fig.3. Minimumspeeduptoachieve100%throughputfordifferentscheduling
policies and internal buffer sizes.
TABLE III
COMPARISON OF THE MINIMUM BUFFER SIZES TO ACHIEVE STABILITY
WHEN AN INTERNAL SPEEDUP S =2IS AVAILABLE
,andﬁndingthehighestspeeduprequiredforstability.Allthe
curves are approximately straight, corresponding to a growth
; for LQF-RR the curves are also slightly concave,
coherent with a growth as predicted by
Theorem 5.
Furthermore, given any , MS-LQF outperforms always
LQF-RR since it minimizes the number of times a packet
is partially transferred from inputs to internal buffers. This
performance gain is smaller for and becomes more
signiﬁcant for larger values of .
When Fig. 3 is compared to Fig. 2, it can be observed that
both scheduling policies require less buffer with respect to the
theoretical predictions. Table III reports the minimum buffer
sizes necessary to guarantee stability, according to Theorem 5
and of the two scheduling policies; the speedup available has
been ﬁxed equal to . Both policies require smaller buffers
than the ones predicted by Theorem 5, which provides just suf-
ﬁcient conditions for stability. Furthermore, MS-LQF requires
a smaller buffer than LQF-RR, thanks to its higher performance
efﬁciency.
In conclusion, for practical switch sizes, also simulations
show that small buffers are enough to compensate throughput
degradation due to critical multicast patterns.
VI. FINAL REMARK
The results in Section IV can be also rephrased for OQ
switches with ﬁnite output buffers and ﬂow control.
ObservethatCICQarchitecturesoperatingatspeeduponeare
able to perfectly emulate an OQ architecture when the effects of
Fig. 4. A CICQ emulating an OQ switch with ﬁnite output buffers.
ﬂow control are neglected (i.e., when internal buffers have inﬁ-
nite size). Indeed, in such architectures cells arriving at inputs
are immediately transferred toward internal buffers regardless
of their potential conﬂicts, and then sequentially selected for
transmission onto the output lines by a work-conserving output
scheduler. For each output , in each timeslot, up to different
fragments directed to (one per input line) are transferred from
the inputs to the internal buffers.
Thus, by moving the internal buffers of a CICQ to the output
cards and replacing the buffered crossbar with a bufferless
crossbar (with expansion factor in the output port number), a
CICQ with speedup one is transformed into an OQ architecture
(as in Fig. 4) without impacting on its black-box behavior.
Considering the effects of the ﬂow control mechanism, it is
possible to establish an equivalence between a CICQ architec-
ture, with internal buffers of size and unitary speedup, and
an OQ switch, maintaining at each output a separate buffer of
size for fragments coming from every input. The equivalent
OQ structure is provided with a ﬂow-control mechanism which
prevents output buffers from overﬂowing, by stopping cells at
input ports.
In conclusion the minimum amount of buffer at each
output port that guarantees 100% throughput under any ad-
missible trafﬁc pattern in OQ switch architectures with ﬂow
control, increases with the switch size ,a s
becoming unbounded for asymptotically large switches.
VII. CONCLUSION
In this paper we have established necessary and sufﬁcient
conditions to achieve stability under any -complex
trafﬁc pattern in large CICQ architectures. Our main
ﬁnding is that, to avoid throughput degradations, the amount
of buffering to be deployed at every crosspoint must neces-
sarily scale logarithmically with the switch size , in CICQ
architectures with ﬁnite speedup .} This result extends known
results for pure IQ switches and shows that the introduction of
internal buffers does not solve, but only mitigates, the problem
of possible throughput degradation due to multicast trafﬁc.
Asymptotic theoretical predictions are conﬁrmed by simula-
tionresultswhichshowthatthroughput degradationsmayoccur
inCICQswitchesofmoderatesize,evenifsmallbuffersareable
to compensate for such degradations.GIACCONE AND LEONARDI: ASYMPTOTIC PERFORMANCE LIMITS OF SWITCHES 603
Even if the nature of our results are mainly theoretical, we
believe that they can contribute to a deeper understanding of the
behavior of CICQ architectures supporting multicast trafﬁc.
As ﬁnal remark, we emphasize that our results apply also to
OQ switches with ﬁnite output buffers and ﬂow control.
APPENDIX A
EVALUATION OF THE INPUT QUEUES CLEARANCE TIME
Inthisappendix,westudytheproblemofswitchingmulticast
cells though a CICQ and IQ architecture over a ﬁnite temporal
horizon. Essentially we are interested in evaluating the min-
imum number of internal timeslots (that we call also the “min-
imum frame length”) needed to fully transfer (up to evacuation
of input queues) a perfect or a generalized -complex set
of cells residing at the inputs toward internal buffers.
For unicast trafﬁc a solution to the problem of transferring
cells over a ﬁnite horizon is provided by the well known
Birkhoff–von-Neumann theorem which states that any set of
unicast cells can be completely transferred through a pure IQ
switch (and thus, also though a CICQ) in timeslots, being
the maximum number of cells either stored at any input port or
directed to any output port. Here we prove that an extension of
Birkhoff–von-Neumann theorem to the case of multicast trafﬁc
is not possible. Before proving our results, however, we need
to formalize more precisely the problem.
A. The Scheduling Problem
Recalling that represents the set of cells to be
switched at the beginning of the considered frame, we rede-
ﬁne functions in such a way to operate directly on cells
in , according to the following rules: for each cell
with parameters , function returns input port ,
while function returns fanout-set . Given a set of cells
, we denote with the set of input ports corre-
sponding to cells in . Function returns the set of cells
whose input port is . Function returns the set
of cells whose fanout-set comprises , i.e., .
We denote with a set comprising consecutive (internal)
timeslots which represents our ﬁnite time horizon, or frame;
.
We now introduce three functions deﬁning the switching
process of cells from input to output ports.
Function deﬁnes the correspondence between each
cell and the set of timeslots in which the cell is selected
at the input port for a possible (full or partial) transfer to the
internal buffers. More formally:
Deﬁnition 9: An Input Time Slot Assignment is
deﬁned as a function whose domain is and whose image is
the power set of , i.e.,4: .
For each cell the effective transfer process of different
fragments to the internal buffers is fully speciﬁed by the Input
Scheduling function which, for each cell and each
4Given any set A, 2 denotes the power set of A.
output port , returns the timeslot at which the frag-
ment transfer occurs. More formally, we have the following
deﬁnition.
Deﬁnition 10: An Input Scheduling is deﬁned as
a function whose domain is the set of pairs
and whoseimageis .Con-
ventionally, means that the fragment is not trans-
ferredintheconsideredframe.Aninputscheduling is
said complete if and , it results
.Inthiscaseallthefragmentsoriginatedbycellsin aretrans-
ferred toward internal buffers within the considered frame.
Finally, function of cell directed to destina-
tion returns the timeslot of the frame in which the
corresponding fragment is delivered from the internal buffer to
the output port; conventionally if the frag-
ment is not transferred in the current frame. More formally, we
have the following deﬁnition.
Deﬁnition 11: An Output Time Slot Assignment
is deﬁned as a function whose domain is
the set of pairs with and
and whose image is .
Note that the three functions and cannot
be independently deﬁned. For every cell and every output
the transfer of fragments can occur only in timeslots
speciﬁedby ,i.e.,itmustbe .
For each fragment, the delivery from the internal buffers to
the output ports can occur only after that the fragment has been
moved into the internal buffers, i.e.,5
If instead
(7)
the behavior of an IQ switch is emulated. Indeed, (7) is equiv-
alent to (4) and forces fragments to be directly transferred (i.e.,
transferred within one slot) from input to output ports.
Assuming a ﬁrst-come–ﬁrst-served queueing discipline at
each internal buffer, fragments must be delivered to the output
port from internal buffer in the same order they were moved in,
i.e., if then
We also deﬁne the inverse relations which will be useful in
the following.
• is the function returning the set
of cells which, according to , are selected for
transfer to the internal buffers in timeslot , i.e., the set
of cells such that .
• is the function in returning the set of
cells whose fragments directed to are transferred to the
internal buffers in timeslot , i.e., the set of cells such that
.Notethat denotesthe
set of fragments directed to originated by cells enqueued
at and transferred to the internal buffer in timeslot .
5Throughout this paper we conventionally assume ?>L604 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 2, FEBRUARY 2008
• is the function in returning
the set of cells whose fragments are transferred to
output in timeslot , i.e., the set of cells such that
.
We force the Input Time Slot Assignment to satisfy the fol-
lowing constraint:
(8)
which expresses the fact that in each timeslot , at most one
cell per input has access to the internal buffer.
The tern of functions and completely
speciﬁestheswitchingprocessofmulticastcellsduringthecon-
sidered frame, any other information on cell movements can be
derived from their deﬁnition. In particular, we can express the
following:
• ;
• ;
•
;
• ;
• .
As a consequence, constraints (1)(3) can be rephrased in terms
of and .
Finally, considering a generic subset , function
returns the set of consecutive timeslots within
which all the cells in are transferred to internal buffer
; i.e., it returns the interval of timeslots such that
.
When the behavior of an IQ switch is emulated (i.e., under
constraint (7)) the input scheduler satisﬁes the following
additional constraint:
which states that, in each slot , for each output port at most
one fragment is transferred through the switching fabric.
B. A Preliminary Result
We establish a ﬁrst simple bound on the minimum number of
time slots necessary to completely transfer toward the internal
buffers a subset of cells residing at inputs, extracted by a gener-
alized -complex cell set.
Lemma 1: Consider a complete input scheduling .G i v e n
, a perfect or generalized -complex cell set, for any
such that ,with , (i.e.,
is the set of cells in directed to ), the following constraint
must be satisﬁed:
Proof: We focus our attention on transfer process of frag-
ments directed to ignoring what happens to fragments di-
rected toward other outputs. Conventionally we count timeslots
starting from the timeslot in which the ﬁrst fragment of is
transferred to the internal buffer (as a consequence, timeslot 1
represents the timeslot in which the ﬁrst fragment from is
transferred). Let be the number of cells belonging to
whose fragments destined to are scheduled in theﬁrst times-
lots (i.e., ). Let
be the number of fragments that are
delivered from the internal buffers to output during the ﬁrst
timeslots of the frame. It holds
Note that
; since at most one fragment can be delivered to per timeslot,
it results
now since, by construction, for every ,
thus the results
from which
Thus choosing , and , it follows:
Note that the previous result can be applied also to IQ archi-
tectures. In the latter case, since at most a fragment belonging
per slot can be transferred from the input queues to output
queues, trivially it results
(9)
C. Input Queues Clearance Time for Nonfanout Splitting
Policies
As already said, under nonfanout splitting policies, all the
fragments originated from a cell must be simultaneously trans-
ferred to the internal buffers. We can formalize this constraint,
imposing
and
Then we obtain the following result:
Theorem 1: Consider a CICQ with loaded by a gen-
eralized -complex cell set. Input queues clearance time
must satisfy
Proof: We remind that under no fanout splitting policies
cells are atomically transferred toward internal buffers (no par-
tial transfer of cells is allowed).
Consider any set (with ) of cells extracted from a
generalized -complexcell set.Since, bydeﬁnition,there
exists an output port to which all the cells in are directed,GIACCONE AND LEONARDI: ASYMPTOTIC PERFORMANCE LIMITS OF SWITCHES 605
the minimum number of timeslots within which the cells of
(their fragments directed to ) are transferred, by Lemma 1 is
(10)
thus, at most cells can be transferred to the internal buffers
of the switch, in any set of consecutive timeslots.
As a consequence, the minimum number of timeslots to transfer
all the cells in without violating (10) is
which for tends to .
Corollary 1: Consider a IQ switch loaded by a generalized
-complex cell set. Input queues clearance time must
satisfy
Proof: Considering (9), the proof of this corollary can be
easily obtained repeating the same arguments of Theorem 1
proof. Note that the result can also be obtained directly by set-
ting in Theorem 1.
Corollary 2: Under a no fanout splitting scheduling policy,
considering a CICQ (or an IQ) switch loaded by a generalized
-complex cell set. If ( for IQ switches),
not more cells can be transferred to the internal buffers in
consecutive timeslots ( consecutive timeslots
for IQ switches).
Proof: This statement has been already proved in the pre-
vious proof (Theorem 1).
D. Input Queues Clearance Time for Fanout Splitting Policies
Consider a CICQ switch with generic over a ﬁnite
frame (with ) and loaded with either a perfect or
generalized -complex cell set .
Given a strictly positive integer constant, with ,
let be the set of cells for which , i.e., the
set of cells that have at most transmission opportunities.
Any cell attempts the transfer to internal buffers
at most in timeslots, (with
) and , where
conventionally means that .
Now, assuming , consider any subset
with , since is a -complex cell set, there
exists an output such that all cells in are directed to .
Consider the subsets of comprising those cells whose
fragment directed to is transferred in . By construction,
it results and ( if
is completely scheduled within ). Let ,
being the common destination in .
Lemma 2: A necessary condition for a set
, with to be completely transferred to the internal
buffers within frame is that
(11)
Proof: For any set , Lemma 1 provides a relation be-
tween the size of and
Now since is completely transferred within
. Thus, summing over we get the assert
(12)
The previous result can be extended also to IQ architectures.
In the latter case it results:
(13)
Lemma 3: Consider a CICQ switch with , loaded
with a set of cells subset of either a perfect or a generalized
-complexcellset .Ifthereexistsaset ofinputswith
(for some ) such that each input in
transfers at least cells (for some ); then for
and , with
the input queues clearance time is strictly greater than .
Proof: We prove this statement by contradiction. Suppose
it is possible to complete the transfer to internal buffers of cells
in within a frame of length .
Wepartitionframe in groupsofcontiguoustimes-
lots (called periods) with iff
. To avoid empty
periods, we assume , i.e., .
Under the conventional assumption that
, consider all possible -tuples
, with exception of the null -tuple
, such that and
. It results .
Note that by construction, every cell is scheduled
within a -uple of periods ; i.e.,
, for every .
Thankstothepigeonholeprinciple(Lemma7),eachinput
can be associated with a -tuple of pe-
riods in which the considered input has scheduled at least
cells, i.e., and
.
For by hypothesis it results ;
then, for the pigeonhole principle, there exists a set of in-
puts , with and a common -tuple of
periods , such that
for all . As a consequence, the
total number of cells to be scheduled in by
inputs in is .
For the ease of comprehension, Fig. 5 graphically illustrates
the construction described in the previous two paragraphs in a
simple case.
Now consider a set , comprising cells scheduled by in-
puts within -tuple of periods .
We denote with the output port to which all the cells in606 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 2, FEBRUARY 2008
Fig.5. Anexampleofapplicationofthepigeonholeprinciple,forthecase:n =
4;￿ =0 :75;k=1 2 ; at least ￿k =9cells are scheduled among 4 T-uples of
periods, for each input. Hence, it is possible to ﬁnd at least one T-uple (marked
ingray)forinputportinwhichthenumberofscheduledcellsis￿d ￿k=ne =3 .
At last, considering at least dn =￿e =2 4inputs, it is always possible to ﬁnd
a subset of dn=￿e =6inputs sharing a same marked marked T-uple. In the
example (i ;i ;i ;i ;i ;i ) share (M ...M ).
are directed, and partition set in subsets , comprising
those cells whose fragment directed to is transferred
toward internal buffer in slot ; by construction it results
. Thus
However, the last inequality contradicts (11) for
. Indeed, by applying Lemma 2
Previous statement can be extended to IQs, it results in the
following lemma.
Lemma4: ConsideraIQswitchloadedwithasetofcells ,
subset of either a perfect or a generalized -complex cell
set . If there exists a set of inputs with (for
some ) such that each input in transfers at least
(for some ); then for and ,
with the input queues clearance
time is strictly greater than .
Proof: The assert can be proved by repeating exactly the
same arguments of Lemma 3 proof.
Now we can present our two major results.
Lemma 5: Consider a CICQ switch with generic (or
an IQ switch) loaded with a perfect -complex cell set .
For any ﬁnite , there exists an integer , being
such that and
(and for an IQ switch) no complete
exists with frame length , under any fanout-splitting
policy.
Proof: We remind that, for any is the
number of transmission opportunities for cell . By construc-
tion, for each input at most transmission opportunities can
be distributed among the enqueued cells without conﬂicts as
described by (8). Hence, the average number of transmission
opportunities per cell cannot exceed .
Consider , the set of all cells for which .
Thanks to Lemma 6, at least -th of the cell set is in , i.e.,
.
By setting in
Lemma 8, it is possible to claim that, since ,
there exists a set of inputs with such that,
each input in schedules at least cells belonging
to .
Now we invoke Lemma 3 (or 4 in case of IQ) to conclude our
proof. We notice that in our case plays the role of in the
statement of Lemma 3, plays the role of plays
the role of and plays the role of .
Theorem 2: Consider a CICQ switch with , loaded by
ageneralized -complexcellset .Foranyﬁnite ,there
exist two integers and ,
with and , such that
and no complete exists with
frame length , under any fanout-splitting policy.
Proof: If there is an input such that the
proof is trivial. Let us consider the case in which
for all . By setting
in Lemma 8, it is possible to ﬁnd a set inputs with
such that each input in has at least
cells.
We mark exactly cells for each inputs in . The set of
marked cells by construction is a perfect set. Thus
we can apply Lemma 5 to obtain the assert (we notice that in
our case plays the role of in the statement of
Lemma 5).
Proof of Corollary 3 follows the same arguments of the pre-
vious proof.
APPENDIX B
SOME USEFUL COMBINATORIAL RESULTS
Lemma 6: For a strictly positive integer-valued random vari-
able , with average value .
The proof of the lemma is available in [6].
The following results are extensions of the well-known com-
binatorics “pigeonhole” principle.
Lemma 7: If balls are divided into bins, it is always pos-
sible to ﬁnd a non-empty subset of bins that contains
not less than balls in total.
The proof of this lemma is available in [6].
Lemma 8: If balls are distributed into bins, assuming that
each bins contains at most balls, for any ,i ti s
possibletoﬁndatleast binscontainingatleast
balls each.
Proof: Consider all the bins ordered by decreasing
number of contained balls. Let the number of balls in
bin . By hypothesis, it must be . First ob-
serve that, at least must be larger than , otherwise
in contradiction with the
hypothesis.
Let the number of bins containing at least balls;
since the left bins, by construction, contain no more than
balls each, it results:GIACCONE AND LEONARDI: ASYMPTOTIC PERFORMANCE LIMITS OF SWITCHES 607
Thus,
At last, since at most balls can be contained by each bin, it
results:
By solving with respect to , we get the assert.
REFERENCES
[1] E.Leonardi,M.Mellia,F.Neri,andM.A.Marsan,“On theStabilityof
Input-Queued Switches with Speedup,” IEEE/ACM Trans. Netw., vol.
9, no. 1, pp. 104–118, Feb. 2001.
[2] J. G. Dai and B. Prabhakar, “The throughput of data switches with and
without speedup,” in Proc. IEEE INFOCOM 2000, Tel Aviv, Israel,
Mar. 2000, pp. 556–564.
[3] N. McKeown, A. Mekkittikul, V. Anantharam, and J. Walrand,
“Achieving 100% throughput in an input-queued switch,” IEEE Trans.
Commun., vol. 47, no. 8, pp. 1260–1272, Aug. 1999.
[4] S. T. Chuang, A. Goel, N. McKeown, and B. Prabhakar, “Matching
output queueing with a combined input/output-queued switch,” IEEE
J. Sel. Areas Commun., vol. 17, no. 6, pp. 1030–39, Jun. 1999.
[5] I. Stoica and H. Zhang, “Exact emulation of an output queueing
switch by a combined input output queueing switch,” in Proc. 6th Int.
Workshop Quality of Service (IWQoS’98), Napa, CA, May 1998, pp.
218–224.
[6] M. A. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri,
“Multicast trafﬁc in input-queued switches: Optimal scheduling and
maximum throughput,” IEEE/ACM Trans. Netw., vol. 3, no. 11, pp.
465–477, Jun. 2003.
[7] K. Yoshigoe and K. J. Christensen, “An evolution to crossbar switches
with virtual output queueing and buffered crosspoint,” IEEE Network,
vol. 17, no. 5, pp. 48–56, Sep. 2003.
[8] N.ChrysosandM.Katevenis,“WeightedFairnessinBufferedCrossbar
Scheduling,” in Proc. IEEE HPSR 2003, Torino, Italy, Jun. 2003, pp.
17–22.
[9] T. Javiadi, R. Magill, and T. Hrabik, “A high throughput scheduling
algorithmfor a buffered crossbar switch fabric,”Proc. IEEE ICC2001,
pp. 1581–1591, Jun. 2001.
[10] M. Katevenis, G. Passas, D. Simos, I. Papaefstathiou, and N. Chrysos,
“Variable Packet Size Buffered Crossbar (CICQ) Switches,” in Proc.
IEEE ICC 2004, Paris, France, Jun. 2004, vol. 2, pp. 1090–1096.
[11] L. Mhamdi and M. Hamdi, “MCBF: a high-performance scheduling
algorithm for buffered crossbar switches,” IEEE Commun. Lett., vol.
7, no. 9, pp. 451–453, Sep. 2003.
[12] M. Nabeshima, “Performance evaluation of a combined input and
crosspoint queued switch,” IEICE Trans. Commun., vol. E83-B, no. 3,
Mar. 2000.
[13] R. Rojas-Cessa, E. Oki, Z. Jing, and H. J. Chao, “CIXB-1: Combined
input-one-cell-crosspoint buffered switch,” in Proc. IEEE HPSR 2001,
Dallas, TX, pp. 324–329.
[14] R. Rojas-Cessa, E. Oki, and H. J. Chao, “CIXOB-k: combined input-
crosspoint-output buffered packet switch,” Proc. IEEE GLOBECOM
2001, pp. 2654–60, Nov. 2001.
[15] R. Rojas-Cessa and E. Oki, “Round robin selection with adaptable size
frameinacombinedinput-crosspointbufferedswitch,”IEEECommun.
Lett., vol. 7, no. 11, Nov. 2003.
[16] R. B. Magill, C. E. Rohrs, and R. L. Stevenson, “Output queued
switch emulation by fabrics with limited memory,” IEEE J. Sel. Area
Commun., vol. 21, no. 4, May 2003.
[17] S.-T. Chuang, S. Iyer, and N. McKeown, “Practical algorithms for per-
formance guarantees in buffered crossbars,” in Proc. IEEE INFOCOM
2005, Miami, FL, Mar. 2005, vol. 2, pp. 981–991.
[18] P. Giaccone, E. Leonardi, and D. Shah, “Throughput region of ﬁnite-
buffered networks,” IEEE Trans. Parallel Distrib. Syst., vol. 18, no. 2,
pp. 251–263, Feb. 2007.
[19] M. G. Hluchyj, M. J. Karol, and S. Morgan, “Input versus output
queueing on a space division switch,” IEEE Trans. Commun., vol. 35,
no. 12, pp. 1347–1356, Dec. 1987.
[20] Y. Tamir and G. L. Frazier, “High performance multiqueue buffers for
VLSI communication switches,” ACM SIGARCH, pp. 353–354, May/
Jun. 1988.
[21] R. Ahuja, B. Prabhakar, and N. McKeown, “Multicast Scheduling for
Input-Queued Switches,” IEEE J. Sel. Areas Commun., vol. 15, no. 15,
pp. 885–866, Jun. 1997.
[22] L. Mhamdi and M. Hamdi, “Scheduling Multicast Trafﬁc in Inter-
nally Buffered Crossbar Switches,” Proc. IEEE ICC’04, vol. 2, pp.
1103–1107, Jun. 2004.
[23] X. Zhang and L. N. Bhuyan, “An efﬁcient scheduling algorithm for
combined input-crosspoint-queued (CICQ) switches,” Proc. IEEE
GLOBECOM’04, pp. 1168–1173, 2004.