A Split-Central-Buffered Load-Balancing Clos-Network Switch with
  In-Order Forwarding by Sule, Oladele Theophilus et al.
A Split-Central-Buffered Load-Balancing
Clos-Network Switch with In-Order Forwarding
Oladele Theophilus Sule, Roberto Rojas-Cessa, Senior Member, IEEE, Ziqian Dong, Senior Member, IEEE,
Chuan-Bi Lin, Member, IEEE
Abstract—We propose a configuration scheme for a load-
balancing Clos-network packet switch that has split central
modules and buffers in between the split modules. Our split-
central-buffered Load-Balancing Clos-network (LBC) switch is
cell based. The switch has four stages, namely input, central-
input, central-output, and output stages. The proposed configura-
tion scheme uses a pre-determined and periodic interconnection
pattern in the input and split central modules to load-balance and
route traffic. The LBC switch has low configuration complexity.
The operation of the switch includes a mechanism applied at
input and split-central modules to forward cells in sequence. The
switch achieves 100% throughput under uniform and nonuniform
admissible traffic with independent and identical distributions
(i.i.d.). These high switching performance and low complexity are
achieved while performing in-sequence forwarding and without
resorting to memory speedup or central-stage expansion. Our
discussion includes throughput analysis, where we describe the
operations that the configuration mechanism performs on the
traffic traversing the switch, and proof of in-sequence forwarding.
A simulation study is presented as a practical demonstration of
the switch performance on uniform and nonuniform i.i.d. traffic.
Index Terms—Clos-network switch, load-balancing switch, in-
order forwarding, high performance switching, packet schedul-
ing, packet switching.
I. INTRODUCTION
Clos-network switches are attractive for building large-size
switches [1]. These switches mostly employ three stages,
where each stage uses switch modules as building blocks.
Each module is a small- or medium-size switch. Modules
of the first, second, and third stages are often called input,
central, and output modules, and they are denoted as IM,
CM, and OM, respectively. Overall, Clos-network switches
require fewer crosspoint elements, each of which is the atomic
switching unit of a packet switch, than a single-stage switch
of equivalent size, and thus they may require less build-
ing hardware. This trait of a Clos network often comes at
the cost of an increased configuration complexity. The term
configuration here means the local interconnection between
This paper is an extended version of that published in IEEE trans. on
Networking. (Corresponding author: Oladele Theophilus Sule)
O.T. Sule and R. Rojas-Cessa are with the Department of Electrical and
Computer Engineering, New Jersey Institute of Technology, Newark, NJ
07102. Email: {ots5, rojas}@njit.edu.
Z. Dong is with the Department of Electrical and Computer Engineering,
New York Institute of Technology, New York, NY 10023.
C. Lin is with the Department of Information and Communication Engineer-
ing, Chaoyang University of Technology, Wufeng District, Taichung, 41349,
Taiwan.
This work was partially supported by National Science Foundation (NSF)
under Grant No. (CNS) 1641033.
inputs and outputs of a module. In general, a Clos-network
switch requires the configuration of the modules in every stage
before packets are forwarded through. Moreover, owing to the
multi-stage architecture of such switch, the time for switch
reconfiguration increases as the number of stages holding
dependences increases. In a multi-stage switch, there is a
dependence when the configuration of a module is affected
by the configuration of another. The required configuration
time dictates the internal data transmission time, which in
turn defines the minimum size of the internal data unit. For
example, switches that require long configuration time may
need to use a long internal segment and time to transmit data
while switches with fast configuration times may use a smaller
segment size. Therefore, the configuration time of a switch
must be kept to the shortest possible for a fast and efficient
reconfiguration [2].
In the remainder of this paper, we consider the proposed
packet switch to be cell based; that is, upon arrival at an
input port of a switch, packets of variable size are segmented
into fixed-size cells. Cells are forwarded through the switch
to their destination outputs. Packets are re-assembled at the
outputs of the switch. The selection of the cell length is
left for the implementation of the LBC switch. However, as
in any other switch, the cell length is decided by the time
required to reconfigure IMs and CIMs and memory speed (of
central queues or CBs). Cell length may be selected such
that cell transmission time is equal to or greater than the
largest of the switch configuration or memory response times.
Additionally, the cell length can be increased if the average
Internet packet is longer than the configuration time to reduce
segmentation/reassembly processing [2].
Based on the design of its switching modules, each stage
of a Clos-network switch can be categorized as either space-
based (S) or memory-based, where space switching modules
are bufferless while memory switching modules are buffered.
Space switching refers to the use of a level of parallelism
where multiple cells can be switched at the same time slot by
using multiple connections. Memory switching refers to the
use of memory to store cells when they cannot be forwarded
to the outputs (or next stage). Some of these categorizes
are SSS (or S3) [3], [4], MSM [5]–[8], MMM [9]–[12],
SMM [13], and SSM [14], [15], among the most popular
ones. Out of those, S3 switches require small amounts of
hardware but their configuration has been proven challenging
as input-to-output path setup must be resolved before cells
are transmitted. On the other hand, inclusion of memory in
modules may relax the configuration complexity. However,
ar
X
iv
:1
81
2.
11
65
0v
2 
 [c
s.N
I] 
 27
 A
ug
 20
19
2configuration complexity has remained high despite using
memory in every switch module because of internal blocking
and the multiplicity of input-output paths associated with
diverse queuing delays [9], [16]. Specifically, switches with
buffered central or output stages are prone to forwarding
packets out of sequence, making re-sequencing or in-sequence
transmission mechanisms an added feature. Moreover, the
number and size of queues in a module are restricted to the
available on-chip real estate. This restriction plus the adopted
in-sequence measures may exacerbate internal blocking that,
in turn, may lead to performance degradation [11].
Minimizing the complexity of the central module of a Clos-
network switch has been of research interest in recent years.
Hassen et. al proposed a Clos-network switch that combines
different switching stages [17]. In this work, central modules
are replaced with multi-directional networks-on-chip (MDN)
modules. The switch uses a static dispatching scheme from
the input/output modules, for which every input constantly
delivers packets to the same MDN module, and adopts inter-
central-module routing to enable forwarding of the cells to
the final destination. However, this switch may forward cells
to the output ports out of sequence if cells from the same flow
are routed through different paths on the central modules.
Load balancing traffic prior to routing it towards the desti-
nation output is a technique that not only improves switching
performance but also reduces the configuration complexity of
a packet switch when the load-balancing and routing follow a
deterministic schedule [18]. Such a schedule may be obtained
as an application of matrix decomposition [19], [20]. This
technique enables high performance not only on switches but
also on a large number of network applications [21].
A switch that load-balances traffic may need at least two
stages to operate; one for load balancing and the other
for routing cells to their destination outputs [18]. A switch
with such a deterministic and periodic schedule may require
the use of queues between the load-balancing and routing
stages. However, placing such queues and enabling multiple
interconnection paths between an input and an output make
load-balancing switches susceptible to forwarding cells out
of sequence [18]. This issue has been addressed by intro-
ducing either re-sequencing buffers at the output ports [22]
or mechanisms that prevent out-of-sequence forwarding [23],
[24]. However, these approaches are either complex or degrade
switching performance.
Load balancing has been applied to Clos-network switches
[9], [25]. For example, Zhang et al. [25] proposed an SMM
switch which adopts the two-stage load-balanced Birkhoff-von
Neumann switch in each central module but has no input port
buffers. Here, a central module consists of two k×k bufferless
crossbar switches and k buffers in between the crossbars.
The switch performs load balancing at the input module and
the first stage of the load-balanced Birkhoff-Von Neumann
switch. Each of these queues accommodates up to one cell
to guarantee the transmission of cells in sequence. However,
the distance between modules in a large switch requires larger
queue sizes for which this switch would suffer from out-of-
sequence forwarding.
The switches discussed above suffer from either limited
switching performance, high complexity, or out-of-sequence
forwarding. These drawbacks then raise the question, can a
load-balancing Clos-network switch achieve high switching
performance, low configuration complexity, and in-sequence
cell forwarding without resorting to memory speedup?
In this paper, we aim at answering this question by propos-
ing a split-central-buffered Load-Balancing Clos-network
(LBC) switch. The switch has a split central module and
queues in between. The switch employs predetermined and
periodic interconnection patterns to interconnect the inputs and
outputs of the switch modules. The switch load balances the
incoming traffic and switches the cells towards the destination
outputs, both with minimum configuration complexity. The
result is a switch that attains high throughput under admissible
traffic with independent and identical distribution (i.i.d.) and
uses a configuration scheme with O(1) complexity. The switch
also adopts an in-sequence forwarding mechanism at the input
queues to keep cells in sequence despite the presence of buffers
between the split CMs.
Different from existing switching architectures, as discussed
above, the LBC switch achieves high performance, configura-
tion simplicity, and in-sequence service, all attained without
memory speedup nor central module expansion.
We analyze the performance of the proposed switch by
modeling the effect of each stage on the traffic passing through
the switch. In addition, we study the performance of the switch
through traffic analysis and computer simulation. We show that
the throughput of the switch approaches 100% under several
admissible traffic models, including traffic with nonuniform
distributions, and demonstrate that the switch forwards cells
to the output ports in sequence. The high performance and
the in-sequence forwarding of packets of the switch are both
achieved without resorting to speedup throughout the switch.
In summary, the contributions of this paper are as fol-
lows: 1) the proposal of a configuration scheme for a split-
central-buffered load-balancing switch such that the attained
throughput is 100% under admissible traffic while having
O(1) scheduling complexity, 2) the proposal of an in-sequence
mechanism for forwarding of cells in sequence throughout
the switch, 3) the presentation of throughput analysis of the
LBC switch for each of the stages that shows that the switch
achieves 100% throughput under i.i.d. admissible traffic, and
4) proof of the in-sequence capability of the proposed in-
sequence forwarding mechanism.
The remainder of this paper is organized as follows: Section
II introduces the LBC switch. Section III analyzes the through-
put performance of the proposed switch. Section IV analyzes
the in-sequence forwarding property of the LBC switch. Sec-
tion V presents a simulation study on the performance of the
proposed switch. Section VI presents our conclusions.
II. SWITCH ARCHITECTURE
The LBC switch has N inputs and N outputs, each denoted
as IP (i, s) and OP (j, d), respectively, where 0 ≤ i, j ≤
k − 1, 0 ≤ s, d ≤ n − 1, and N = nk. Figure 1 shows
the architecture of the LBC switch. This switch has k n×m
IMs and k m×n OMs. Each central module is split into two
3modules called central-input and -output modules, denoted as
CIMs and COMs, respectively. The switch has m CIMs and
the same number of COMs. Each CIM and COM is a k × k
switch. In the remainder of this paper, we set n = k = m for
symmetry and cost-effectiveness. The IMs, CIMs, and COMs
are bufferless crossbars while the OMs are buffered ones.
The use of a split central module on this switch enables
preserving staggered symmetry and in-order delivery [26] by
using a pre-determined configuration in the IMs, CIMs and
COMs with a mirror sequence between CIMs and COMs.
The staggered symmetry and in-order delivery refers to the
fact that at time slot t, IP (i, s) connects to COM(r) which
connects to OM(j). Then at the next time slot (t+1), IP (i, s)
connects to COM((r + 1) mod m), which also connects to
OM(j). This property enables the configuration of IMs/CIMs
and COMs to be easily represented with a pre-determined
compound permutation that repeats every k time slots. This
property also ensures that cells experience the same amount
of delay for uniform traffic and the incorporation of a simple
in-sequence mechanism. A switch with queues between IMs
and CMs but without a split central module may require more
complex load balancing and routing configurations to achieve
the same objective.
Each input port has N virtual output queues (VOQs),
denoted as V OQ(i, s, j, d), to store cells destined to output
port d at OM(j). The combination of IMs and CIMs form
a compound stage, called the IM-CIM stage. The COMs and
OMs operate as single stages. There are queues placed be-
tween CIMs and COMs to store cells coming from an IM and
destined to OMs. These central queues may be implemented
as virtual output port queues (VOPQs), as shown in Figure
2(a). Each VOPQ, denoted as V OPQ(r, p, j, d), stores cells
coming for OP (j, d) through LCIM (r, p). As an alternative, to
reduce the number of VOPQs for a large switch, we consider
the use of virtual output module queues (VOMQs) instead, as
shown in Figure 2(b). A VOMQ, denoted as V OMQ(r, p, j),
stores cells for all OPs at OM(j). Each of these queues
stores cells coming from LCIM (r, p) and destined to OM(j).
Compared to VOPQs, VOMQs introduce the possibility of
head-of-line (HoL) blocking. However, as we show in Section
II-F, such HoL effect is not a concern when the switch is
loaded with admissible traffic. The remainder of this paper
considers VOMQs, as this option stresses the load-balancing
feature of LBC.
Every CIM has k LCIM ports. Every LCIM (r, p) of a CIM
is connected to one input IC(r, p) of the corresponding COM.
The LCIM includes a set of k VOMQs, one per OM. Each
OP has m crosspoint buffers, each denoted as CB(r, j, d). A
flow control mechanism operates between VOMQs and VOQs,
and between CBs and VOMQs to avoid buffer overflow and
this is described in Section II-E. The VOMQs are off-chip.
The switch has N LCIMs, and therefore N sets of k VOMQs
each. Table I lists the notations used in the description of the
LBC switch.
The following is a walk-through description of how the
switch operates: After arriving at the IP, a cell is placed at
the VOQ corresponding to its destination OP. The IP arbiter
selects a VOQ to be served in a round-robin manner. When
a VOQ is selected, the HoL cell is forwarded to a VOMQ at
the LCIM identified by the current configuration of the IM
and CIM. The VOMQ is the one associated with the OM that
includes the destination OP of the cell. When the configuration
of the COM permits forwarding to the destination OM, the
cell is forwarded to the OM and stored at the crosspoint buffer
(CB) allocated for cells from the source COM. The OP arbiter
selects CBs based on a round-robin manner. Upon selection
of a CB, the HOL cell is forwarded from the CB to the OP.
TABLE I
NOTATIONS USED IN THE DESCRIPTION OF THE LBC SWITCH
Term Description
N Number of input/output ports.
n Number of input/output ports for each IM
and OM.
m Number of CIMs and COMs.
k Number of IMs and OMs, where k = N
n
.
IP (i, s) Input port s of IM(i), where 0 ≤ i ≤
k − 1, 0 ≤ s ≤ n− 1.
IM(i) Input module i.
OM(j) Output module j, where 0 ≤ j ≤ k − 1.
CIM(r) Central Input Module r, where 0 ≤ r ≤
m− 1.
COM(r) Central Output Module r.
V OQ(i, s, j, d) VOQ at IP (i, s) that stores cells destined
to OP (j, d), where 0 ≤ d ≤ n− 1.
LIM (i, r) Output link of IM(i) connected to
CIM(r).
LCIM (r, p) Output port p of CIM(r), where 0 ≤ p ≤
k − 1.
IC(r, p) Input port p of COM(r).
LCOM (r, j) Output link of COM(r) connected to
OM(j).
V OMQ(r, p, j) VOMQ at output of CIMs that stores cells
destined to OM(j).
V OPQ(r, p, j, d) VOPQ at output of CIMs that stores cells
destined to OP (j, d).
CB(r, j, d) Crosspoint buffer atOM(j) that stores cells
going through COM(r) and destined to
OP (j, d).
OP (j, d) Output port d at OM(j).
A. Module Configuration
The IMs and CIMs in the IM-CIM stage are configured
based on a pre-determined sequence of disjoint permutations,
applying one permutation every time slot. We call a permu-
tation disjoint from the set of permutations if an input-output
pair is interconnected in one and only one of the permutations.
This pre-determined sequence of permutations repeats every
k time slots. Cells at the inputs of IMs are forwarded to the
outputs of the CIMs determined by the configuration of that
time slot. A cell is then stored in the VOMQ corresponding
to its destination OM.
The COMs follow a configuration similar to that of the
CIMs, but in a mirror (i.e., reverse order) sequence. The HoL
cell at the VOMQ destined to OM(j) is forwarded to its
destination when the input of the COM is connected to the
input of the destination OM(j). Else, the HoL cell waits until
the required configuration takes place. The forwarded cell is
queued at the CB of its destination OP once it arrives in the
OM. At the OP, a CB (i.e., HoL cell of that queue) is selected
from all non-empty CBs by an output arbitration scheme.
4.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
N-1
...
0
N-1
...
0
N-1
...
0
N-1
...
IM(0)
IM(k-1)
.
.
.
.
.
.
CIM(0)
CIM(m-1)
IM(i) CIM(r)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
COM(0)
COM(m-1)
.
.
.
.
.
.
OM(0)
OM(k-1)
COM(r) OM(j)
OP(0,0)
OP(0,n-1)
OP(k-1,0)
OP(k-1,n-1)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
...
LCIM(0,0) IC(0,0)
LCIM(0,k-1)
LCIM(m-1,0)
LCIM(m-1,k-1)
IC(0,k-1)
IC(m-1,0)
IC(m-1,k-1)
LCIM(r,p)
VOMQ(r,p,j) or VOPQ(r,p,j,d)
IP(0,0)
IP(0,n-1)
IP(k-1,0)
IP(k-1,n-1)
IC(r,p)
OP(j,d)
IP(i,s)
.
...
..
.
CB(r,j,d)
VOQ(i,s,j,d)
...
.
.
LIM(i,r)
LIM(0,0)
LIM(k-1,k-1)
LCOM(r,j)
LCOM(0,0)
LCOM(k-1,k-1)
Fig. 1. Split-central buffered load-balancing Clos-network (LBC) switch.
0
N-1
...
VOPQ(r,p,0,0)
VOPQ(r,p,k-1,n-1)
(a) VOPQs.
0
k-1
...
VOMQ(r,p,0)
VOMQ(r,p,k-1)
(b) VOMQs
Fig. 2. Split-central buffers with: (a) VOPQs and (b) VOMQs.
The specific configurations of the bufferless modules, IM,
CIM, COM, and OM are as follows.
At time slot t, IM(i) is configured to interconnect input
IP (i, s) to LIM (i, r), with:
r = (s+ t) mod m (1)
Similarly, CIM input LIM (i, r) is interconnected to CIM
output LCIM (r, p) at time slot t with:
p = (i+ t) mod k (2)
The configuration of COMs is similar to that of IMs, but
in a reverse sequence. At time slot t, COM input IC(r, p) is
interconnected to output LCOM (r, j) with:
1j = (p− t) mod k (3)
Round-robin could also be used to select VOMQs and config-
ure COMs. OM buffers allow forwarding a cell from a VOMQ
to the destination output without requiring port matching [14].
Figure 3 shows an example of the configuration of a 9× 9
LBC switch. As k = 3, the example shows the configuration
of three consecutive time slots, after which the configuration
pattern repeats. Because similar connections are set for all the
IMs and CIMs and a different connection pattern is set for all
COMs at each time slot, Table II describes the configuration
on the figure for IM(0), CIM(0), and COM(0) at each time
slot. In this example, we use → to denote an interconnection.
1a mod k = a+(mutiples of k) > 0 when a < 0 (e.g., -2 mod 5 = 3).
B. Arbitration at Output Ports
An output port arbiter selects a HoL cell from the crosspoint
buffers in a round-robin fashion. Because there is one cell
from each flow at these buffers, out-of-sequence forwarding
is not a concern at this stage. We discuss this case in Section
IV. Here, a flow is the set of cells from IP (i, s) destined to
OP (j, d). The round-robin schedule ensures fair service for
different flows.
C. In-sequence Cell Forwarding Mechanism
The proposed in-sequence forwarding mechanism for the
LBC switch is based on holding cells of a flow at the VOQs
so that no younger cell is forwarded from VOMQs to OPs
before any given cell of the same flow. The policy used for
holding cells at an IP is as follows: No cell of flow y at the IP is
forwarded to a VOMQ for δk time slots after cell τ of the same
flow has been forwarded to a VOMQ, whose occupancy is δ
cells at the time of arrival in the VOMQ. For a cell that arrives
at an empty VOMQ, δ = 0. The flow control mechanism keeps
IPs informed about VOMQ occupancy as discussed in Section
II-E.
Figure 4 shows an example of this forwarding mechanism
for flow A. Cells from flow A are denoted as At, where t is
the cell arrival time. In this example, cells arrive at time slots
1, 2, 4, and 5, and they are denoted as A1, A2, A4, and A5,
respectively. VOMQ(k) denotes the kth VOMQ to where cells
are forwarded. Here, the “X” mark indicates that the buffer at
VOMQ(k) is occupied by cells from other flows. Assuming
k = 3 and no other cell arrival or departure during this time
5TABLE II
EXAMPLE OF CONFIGURATION OF MODULES IN A 9 × 9 LBC SWITCH.
Configuration
Time slot IM(0) CIM(0) COM(0)
t = 0
IP (0, 0)→ LIM (0, 0) LIM (0, 0)→ LCIM (0, 0) Ic(0, 0)→ LCOM (0, 0)
IP (0, 1)→ LIM (0, 1) LIM (1, 0)→ LCIM (0, 1) Ic(0, 1)→ LCOM (0, 1)
IP (0, 2)→ LIM (0, 2) LIM (2, 0)→ LCIM (0, 2) Ic(0, 0)→ LCOM (0, 2)
t = 1
IP (0, 0)→ LIM (0, 1) LIM (0, 0)→ LCIM (0, 1) Ic(0, 0)→ LCOM (0, 2)
IP (0, 1)→ LIM (0, 2) LIM (1, 0)→ LCIM (0, 2) Ic(0, 1)→ LCOM (0, 0)
IP (0, 2)→ LIM (0, 0) LIM (2, 0)→ LCIM (0, 0) Ic(0, 2)→ LCOM (0, 1)
t = 2
IP (0, 0)→ LIM (0, 2) LIM (0, 0)→ LCIM (0, 2) Ic(0, 0)→ LCOM (0, 1)
IP (0, 1)→ LIM (0, 0) LIM (1, 0)→ LCIM (0, 0) Ic(0, 1)→ LCOM (0, 2)
IP (0, 2)→ LIM (0, 1) LIM (2, 0)→ LCIM (0, 1) Ic(0, 2)→ LCOM (0, 0)
IP(0,0)
IP(0,1)
IP(0,2)
IP(1,0)
IP(1,1)
IP(1,2)
IP(2,0)
IP(2,1)
IP(2,2)
LIM(0,0)
L
IM (0,1)
L
IM (0,2)
L IM
(1
,0)
LIM(1,1)
L
IM (1,2)
L I
M
(2
,0
)
L IM
(2
,1)
LIM(2,2)
LCIM(0,0)
LCIM(0,1)
LCIM(0,2)
LCIM(1,0)
LCIM(1,1)
LCIM(1,2)
LCIM(2,0)
LCIM(2,1)
LCIM(2,2)
IC(0,0)
IC(0,1)
IC(0,2)
IC(1,0)
IC(1,1)
IC(1,2)
IC(2,0)
IC(2,1)
IC(2,2)
OP(0,0)
OP(0,1)
OP(0,2)
OP(1,0)
OP(1,1)
OP(1,2)
OP(2,0)
OP(2,1)
LCOM(0,0)
L
COM (0,1)
L
CO
M (0,2)
L C
OM
(1
,0)
LCOM(1,1)
L
COM (1,2)
L C
O
M
(2
,0
)
L C
OM
(2
,1)
LCOM(2,2)
IM(0)
IM(1)
IM(2)
CIM(0)
CIM(1)
CIM(2)
COM(0)
COM(1)
COM(2)
OM(0)
OM(1)
OM(2)
OP(2,2)
.. 0
N-1.
.. 0
N-1.
.. 0
N-1
.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
VOMQ(r,p,j)
(a) Time slot 0
IP(0,0)
IP(0,1)
IP(0,2)
IM(0)
IM(1)
IM(2)
IP(1,0)
IP(1,1)
IP(1,2)
IP(2,0)
IP(2,1)
IP(2,2)
LIM(0,0)
L
IM (0,1)LIM (0,2)
L IM
(1
,0)
LIM(1,1)
L
IM (1,2)
L I
M
(2
,0
)
L IM
(2
,1)
LIM(2,2)
CIM(0)
CIM(1)
CIM(2)
COM(0)
COM(1)
COM(2)
IC(0,0)
IC(0,1)
IC(0,2)
IC(1,0)
IC(1,1)
IC(1,2)
LCOM(0,0)
IC(2,0)
IC(2,1)
IC(2,2)
L
COM (0,1)LCO
M (0,2)
OM(0)
OM(1)
OM(2)
L C
OM
(1
,0)
LCOM(1,1)
L
COM (1,2)
L C
O
M
(2
,0
)
L C
OM
(2
,1)
LCOM(2,2)
LCIM(0,0)
LCIM(0,1)
LCIM(0,2)
LCIM(1,0)
LCIM(1,1)
LCIM(1,2)
LCIM(2,0)
LCIM(2,1)
LCIM(2,2)
OP(0,0)
OP(0,1)
OP(0,2)
OP(1,0)
OP(1,1)
OP(1,2)
OP(2,0)
OP(2,1)
OP(2,2)
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
VOMQ(r,p,j)
(b) Time slot 1
IP(0,0)
IP(0,1)
IP(0,2)
IP(1,0)
IP(1,1)
IP(1,2)
IP(2,0)
IP(2,1)
IP(2,2)
IM(0)
IM(1)
IM(2)
CIM(0)
CIM(1)
CIM(2)
VOMQ(r,p,j)
LCIM(0,0)
LCIM(0,1)
LCIM(0,2)
LCIM(1,0)
LCIM(1,1)
LCIM(1,2)
LCIM(0,0)
LCIM(0,1)
LCIM(0,2)
IC(0,0)
IC(0,1)
IC(0,2)
COM(0)
COM(1)
COM(2)
IC(1,0)
IC(1,1)
IC(1,2)
IC(2,0)
IC(2,1)
IC(2,2)
OM(0)
OM(1)
OM(2)
OP(0,0)
OP(0,1)
OP(0,0)
OP(1,0)
OP(1,1)
OP(1,2)
OP(2,0)
OP(2,1)
OP(2,2)
LIM(0,0)
L
IM (0,1)L
IM (0,2)
L IM
(1
,0)
LIM(1,1)
L
IM (1,2)
L I
M
(2
,0
)
L IM
(2
,1)
LIM(2,2)
LCOM(0,0)
L
COM (0,1)
L
CO
M (0,2)
L C
OM
(1
,0)
LCOM(1,1)
L
COM (1,2)
L C
O
M
(2
,0
)
L C
OM
(2
,1)
LCOM(2,2)
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
.. 0
N-1.
(c) Time slot 2
Fig. 3. Configuration example of LBC switch modules.
period, A1 is the first cell of the flow with arrival time t = 1
and is sent to VOMQ(1) at time slot t = 2. Because VOMQ(1)
has no backlogged cells before A1, there is no waiting time for
A2. Therefore, A2 is sent to VOMQ(2) at t = 3. A2 finds three
cells already queued, so no cell from this flow is forwarded
in 3 ∗ 3 = 9 time slots, or from time slots t = 4 to t = 12.
After that, A4 is sent to VOMQ(3) at t = 13. This cell finds
no other cell, so A5 is sent to VOMQ(1) at t = 14.
D. Implementation of In-sequence Mechanism
Each IP has an input port counter (IPC) for each VOMQ
to which it connects. IPCs keep track of the number of cells
at these VOMQs. Each IP also has a hold-down timer for
each VOQ. The timer is used by the in-sequence forwarding
mechanism. The timer is triggered by the IPC count of the
VOMQ where the last cell was forwarded. When a cell is
forwarded from a VOQ to VOMQ, and the IPC is updated
to σ, this update sets the hold-down timer for that VOQ for
(σ − 1)k time slots, where δ = σ − 1.
A4 A2 A1A5
VOMQ(1)
VOMQ(2)
VOMQ(3)
t = 3
t = 2
t = 14
t = 13
XXX
X: cell of a flow different from flow A
Arrival time
Fig. 4. Example of the operation of the proposed in-sequence forwarding
mechanism.
E. Flow Control
There is a flow control mechanism between VOMQs and
IPs and another between CBs and VOMQs that extends to IPs.
There are fixed connections between each VOMQ and its k
corresponding IPs and between each CB and its corresponding
k ICs. Each IP has mk = N occupancy counters, IPCs, one
per VOMQ. Each VOMQ updates the corresponding k IPCs
about its occupancy. A VOMQ uses two thresholds for flow
6control; pause (Tpv) and resume (Trv), where Tpv > Trv , in
number of cells. When the occupancy of VOMQ, |V OMQ|,
is larger than Tpv , the VOMQ signals the corresponding IPs
to pause sending cells to it. When the |V OMQ| < Trv, the
VOMQ signals the corresponding IPs to resume sending cells
to it. Here, Tpv is such that CV OMQ − Tpv ≥ Dv , where
CV OMQ is the size of the VOMQ and Dv is the flow-control
information delay.
Similar to VOMQs, CBs use two thresholds; pause (Tpc)
and resume (Trc), where Tpc > Trc, and Tpc is such that
CCB − Tpc ≥ Dc, for a CB size, CCB , and flow-control
information delay between a CB and corresponding IPs, Dc.
These CB thresholds work in a similar way as for VOMQs.
Different from IPs, VOMQs have a binary flag to pause/resume
forwarding of cells to CBs. When the occupancy of a CB,
|CB|, becomes larger than Tpc, the CB informs the corre-
sponding VOMQs, and in turn VOMQs inform corresponding
IPs to pause forwarding cells to the VOMQ for the congested
OP. With IPs paused for traffic to a CB, traffic already at
VOMQs can still be forwarded to CBs as long as |CB| is
such that Tpc < |CB| < CCB . When |CB| < Trc, the CB
signals the corresponding VOMQs to resume forwarding, and
in turn, VOMQs signal source IPs to resume forwarding cells
for that destination OP.
F. Avoiding HoL Blocking in LBC with VOMQs
Concerns of HoL blocking, owning to the aggregation of
traffic going to different OPs at the same OM at a VOMQ,
may arise. However, one must note that this HoL blocking
may occur if and only if a CB gets congested. Here, we argue
that the efficient load-balancing mechanism and the use of one
CB for each COM at an OP avoids congestion of CBs even in
the presence of heavy (but admissible) traffic. We also show
that CB occupancy does not build up. Let us consider the input
traffic matrix, R1, with input load, λi,s,j,d, which gets load-
balanced to CIMs at rate of 1m . The aggregate traffic arrival
rate at an LCIM from all IMs, RLCIM , is:
RLCIM =
1
m
k∑
i=0
λi,s,j,d (4)
Therefore, the traffic arrival rate to a CB from COMs, RCB ,
is:
RCB =
1
mk
k∑ k∑
i=0
λi,s,j,d (5)
To test the growth of CBs, we consider three stressing traffic
scenarios: a) All IPs in the switch have traffic only for OPs
in an OM; b) all IPs in an IM forward traffic to all OPs in an
OM; and c) a single flow, with a large rate, going from an IP
to a single OP.
Then, for a) the largest arrival rate at IPs while being admis-
sible is:
λi,s,j,d =
1
N
(6)
Substituting (6) into (5) and m = n = k yields:
RCB =
1
k2
k∑ k∑
i=0
1
N
=
1
N
=
1
k2
(7)
Because round-robin is used as selection policy at an OP, the
service rate, SCB , of a CB would be:
1
k
≤ SCB ≤ 1
Yet, while considering the worst case scenario, or:
SCB =
1
k
(8)
Therefore, CB occupancy does not grow because
SCB > RCB .
For b), the arrival rate at IPs for admissibility is:
λi,s,j,d =
1
k
(9)
Substituting (9) into (5) yields:
RCB =
1
m
1
k
k∑ k∑
i=0
1
k
=
1
k
(10)
The service rate would be the same as in (8). Therefore, the
CB would not become congested as RCB = SCB .
For c), the arrival rate at the IP:
λi,s,j,d = 1 (11)
The traffic arrival rate to an LCIM is:
RLCIM =
1
m
λi,s,j,d =
1
m
(12)
The traffic arrival rate to a CB from COMs is:
RCB =
1
m
1
k
k∑
=
1
m
=
1
k
(13)
Therefore, the CB would not become congested since RCB ≤
SCB for admissible traffic.
III. THROUGHPUT ANALYSIS
In this section, we analyze the performance of the proposed
LBC switch. Let us denote the traffic coming to the IM-CIM
stage, the COM stage, the OMs, OPs, and the traffic leaving
LBC as R1, R2, R3, R4, and R5, respectively. Figure 1
shows these analysis points. Here, R1, R2, and R3 are N×N
matrices, R4 comprises N m × 1 column vectors, and R5
comprises N scalars.
The traffic from input ports to the IM-CIM stage, R1, is
defined as:
R1 = [λu,v] (14)
Here, λu,v is the arrival rate of traffic from input u to output
v, where
u = ik + s (15)
v = jm+ d (16)
and 0 ≤ u, v ≤ N − 1.
7In the following analysis, we consider admissible traffic,
which is defined as:
N−1∑
u=0
λu,v ≤ 1,
N−1∑
v=0
λu,v ≤ 1 (17)
under i.i.d. traffic conditions.
The IM-CIM stage of the LBC switch balances the traffic
load coming from the input ports to the VOMQs. Specifically,
the permutations used to configure the IMs and CIMs inter-
connect the traffic from an input to k different CIMs, and then
to the VOMQs connected to these CIMs.
R2 is the traffic directed towards COMs and it is de-
rived from R1 and the permutations of IMs and CIMs. The
configuration of the combined IM-CIM stage at time slot t
that connects IP (i, s) to LCIM (r, p) are represented as an
N ×N permutation matrix, Π(t) = [piu,υ], where r and p are
determined from (1) and (2) and the matrix element:
piu,υ =
{
1 for any u, υ = rk + p
0 elsewhere.
The configuration of the compound IM-CIM stage can be
represented as a compound permutation matrix, P1, which is
the sum of the IM-CIM permutations over k time slots as
follows,
P1 =
k∑
Π(t) (18)
Because the configuration is repeated every k time slots, the
traffic load from the same input going to each VOMQ is 1k
of the traffic load of R1. Therefore, a row of R2 is the sum
of the row elements of R1 at the non zero positions of P1,
normalized by k. This is:
R2 =
1
k
((R1 ∗ 1) ◦P1) (19)
where 1 denotes an N × N unit matrix and ◦ denotes
element/position wise multiplication.
There are k non-zero elements in each row of R2. Here,
R2 is the aggregate traffic in all the VOMQs destined to all
OPs. This matrix can be further decomposed into k N × N
submatrices, R2(j), each of which is the aggregate traffic at
VOMQs designated for OM(j).
R2 =
j=k−1∑
j=0
R2(j) (20)
where j is obtained from (16) ∀ d. The configuration of the
COM stage at time slot t that connects Ic(r, p) to LCOM(r,j)
can be represented as an N ×N permutation matrix, Φ(t) =
[φu,v], and the matrix element;
φu,v =
{
1 for any u, v = jk + r
0 elsewhere.
(21)
Similarly, the switching at the COM stage is represented by
a compound permutation matrix P2, which is the sum of k
permutations of the COM stage over k time slots. Here
P2 =
k∑
Φ(t) (22)
The output traffic of COMs going to different OMs is
described by matrix R3(j), which is defined as
R3(j) = R2(j) ◦P2 (23)
where j is obtained from (16) ∀ d. The traffic destined to
OP (j, d) at OM(j), R3(j, d), is obtained by extracting the
traffic elements from R3(j), or:
R3(j) =
d=k−1∑
d=0
R3(j, d) (24)
where d is obtained from (16) for the different j.
Ds is an m × N matrix, built by concatenating N k × 1
vector of all ones, ~1, as:
Ds = [~1, · · · ,~1] (25)
~A is a 1× k row vector, built by setting the first element to 1
and every other element to 0, or:
~A = [1 · · · 0] (26)
~As is an N×1 column vector, built by concatenating k ~A and
taking the transpose, or:
~As = [ ~As1 , · · · , ~Ask ]T (27)
where ~As1 = ~Ask = ~A, such that
~As = [ ~A, · · · , ~A]T (28)
The traffic queued at the CB of an OP, R4(v), is the multi-
plication of Ds, R3(j, d), and ~As, or:
R4(v) = Ds ∗R3(j, d) ∗ ~As (29)
The traffic leaving an OP, R5(v), is:
R5(v) = (~1)
T ∗R4(v) (30)
Therefore, R5(v) is the sum of the traffic leaving OP (v).
Equations (19), (29), and (30) show that the admissibility
conditions in (17) are satisfied by the traffic at the VOMQ,
CBs, and OP. Since R2, R4(v), and R5(v) meet the admis-
sibility conditions in (17), this implies that the sum of the
traffic load at each V OMQ, CB, and OP does not exceed
their respective capacities. From (29), we can deduce that R4
is equal to the input traffic R1, or:
R4(v) = R1(v) ∀ v (31)
From the admissibility of R2, R4(v), R5(v) and (31), we
can infer that the input traffic is successfully forwarded to the
output ports.
As discussed in Section II-B, the output arbiter selects a
flow in a round-robin fashion and if no cell of a flow is
selected, the OP arbiter moves to the next flow. This implies
the queues are work conserving which ensures fairness and
that cells forwarded to OPs are successfully forwarded out of
OPs. Hence, from (30), we can infer that R5(v) is equal to
R4(v), or:
R5(v) = (~1)
T ∗R4(v) ∀ v (32)
From (31) and (32), we can conclude that LBC successfully
forwards all traffic at IPs out of OPs.
8The following example shows the different traffic matrices
for a 4×4 (k = 2) LBC switch. Let the input traffic matrix be
R1 =

λ0,0 λ0,1 λ0,2 λ0,3
λ1,0 λ1,1 λ1,2 λ1,3
λ2,0 λ2,1 λ2,2 λ2,3
λ3,0 λ3,1 λ3,2 λ3,3

First, R1 is decomposed into R2 at the IM-CIM stage.
From (18), the compound permutation matrix for the IM-CIM
stage for this switch is:
P1 =

1 0 0 1
0 1 1 0
0 1 1 0
1 0 0 1

Using (19), we get:
R2 =
1
2

∑3
i=0 λ0,i 0 0
∑3
i=0 λ0,i
0
∑3
i=0 λ1,i
∑3
i=0 λ1,i 0
0
∑3
i=0 λ2,i
∑3
i=0 λ2,i 0∑3
i=0 λ3,i 0 0
∑3
i=0 λ3,i

From (20), the traffic matrix at VOMQs destined for the
different OMs are:
R2(0) =
1
2

λ0,0 + λ0,1 0 0 λ0,0 + λ0,1
0 λ1,0 + λ1,1 λ1,0 + λ1,1 0
0 λ2,0 + λ2,1 λ2,0 + λ2,1 0
λ3,0 + λ3,1 0 0 λ3,0 + λ3,1

R2(1) =
1
2

λ0,2 + λ0,3 0 0 λ0,2 + λ0,3
0 λ1,2 + λ1,3 λ1,2 + λ1,3 0
0 λ2,2 + λ2,3 λ2,2 + λ2,3 0
λ3,2 + λ3,3 0 0 λ3,2 + λ3,3

The rows of R2(v) represent the traffic from IPs, and the
columns represent V OMQ(r, p, j) at IC(r, p). From (22), the
compound permutation matrix for the COM stage for this
switch is:
P2 =

1 0 1 0
1 0 1 0
0 1 0 1
0 1 0 1

From (23) and (24), the traffic forwarded to an OP is:
R3(0, 0) =
1
2

λ0,0 0 0 λ0,0
0 λ1,0 λ1,0 0
0 λ2,0 λ2,0 0
λ3,0 0 0 λ3,0

R3(0, 1) =
1
2

λ0,1 0 0 λ0,1
0 λ1,1 λ1,1 0
0 λ2,1 λ2,1 0
λ3,1 0 0 λ3,1

R3(1, 0) =
1
2

λ0,2 0 0 λ0,2
0 λ1,2 λ1,2 0
0 λ2,2 λ2,2 0
λ3,2 0 0 λ3,2

R3(1, 1) =
1
2

λ0,3 0 0 λ0,3
0 λ1,3 λ1,3 0
0 λ2,3 λ2,3 0
λ3,3 0 0 λ3,3

The rows of R3(j, d) represent the traffic from
V OMQ(r, p, j) at IC(r, p) and the columns represent
LCOM (r, j). DS and ~As are obtained from (25) and (28),
respectively, as:
Ds =
[
1 1 1 1
1 1 1 1
]
~As = [1 0 1 0]
T
The traffic forwarded from CBs to the corresponding OP is
obtained from (29):
R4(0) =
1
2
[∑3
i=0 λi,0∑3
i=0 λi,0
]
R4(1) =
1
2
[∑3
i=0 λi,1∑3
i=0 λi,1
]
R4(2) =
1
2
[∑3
i=0 λi,2∑3
i=0 λi,2
]
R4(3) =
1
2
[∑3
i=0 λi,3∑3
i=0 λi,3
]
The rows of R4(v) represent the traffic from COM(r).
Using (30), we obtain the sum of the traffic leaving the OP, or:
R5(0) =
3∑
i=0
λi0
R5(1) =
3∑
i=0
λi1
R5(2) =
3∑
i=0
λi2
R5(3) =
3∑
i=0
λi3
We use the traffic analysis of this section to demonstrate that
the LBC switch achieves 100% throughput under admissible
traffic. This demonstration is provided in Appendix B.
IV. ANALYSIS OF IN-SEQUENCE SERVICE
In this section, we demonstrate that the LBC switch for-
wards cells in sequence through the proposed in-sequence
forwarding mechanism.
Table III lists the definition of terms used in the discus-
sion of the properties of the proposed LBC switch. Here,
cy,τ (i, s, j, d) denotes the τ th cell of traffic flow y, which
comprises cells going from IP (i, s) to OP (j, d) with arrival
time tx. In addition, tay,τ denotes the arrival time of cy,τ , and
q1y,τ , q2y,τ , and q3y,τ denote the queuing delays experienced
by cy,τ at V OQ(i, s, j, d), V OMQ(r, p, j), and CB(r, j, d),
respectively. The departure times of cy,τ from these queues are
denoted as d1y,τ , d2y,τ , and d3y,τ , respectively. In this paper,
we consider admissible traffic as defined in (17).
Here, we claim that the LBC switch forward cells in
sequence to the output ports, through the following theorem.
9Theorem 1. For any two cells cy,τ (i, s, j, d) and
cy,τ ′(i, s, j, d), where τ < τ ′, cy,τ (i, s, j, d) departs the
destination output port before cy,τ ′(i, s, j, d).
TABLE III
NOTATIONS FOR IN-SEQUENCE ANALYSIS.
cy,τ The τ th cell of flow y from IP (i, s) to OP (j, d).
tay,τ Arrival time of cy,τ in V OQ(i, s, j, d) at IP (i, s).
q1y,τ Queuing delay of cy,τ at V OQ(i, s, j, d).
d1y,τ Departure time of cy,τ from V OQ(i, s, j, d) at IP (i, s).
q2y,τ Queuing delay of cy,τ at V OMQ(r, p, j).
d2y,τ Departure time of cy,τ from V OMQ(r, p, j) at
LCOM (r, j).
q3y,τ Queuing delay of cy,τ at CB(r, j, d) of OP (j, d).
d3y,τ Departure time of cy,τ from CB(r, j, d).
This theorem is sectioned into the following lemmas.
Lemma 1. For a single flow traversing the LBC switch, any
cell of the flow experiences the same delay. This is, let td be
the delay experienced by a cell. Then, tdy,τ = γ ∀ τ , where
γ is a positive constant.
A constant delay for each cell implies that cells depart the
switch in the same order they arrived under the conditions of
this lemma.
Lemma 2. For any number of flows traversing the LBC switch,
cells from the same flow arrive at the OM in sequence.
Lemma 3. For any number of flows traversing the LBC switch,
the cells of each flow arrive and are cleared at the output port
(OP) in the same order the cells arrived at the input port (IP).
Appendix A presents the proofs of these lemmas.
V. PERFORMANCE ANALYSIS
We evaluated the performance of the LBC switch through
computer simulation under both uniform and nonuniform traf-
fic models. We also compared the performance of the proposed
switch with that of an output-queued (OQ) switch, a high-
performing Memory-Memory-Memory Clos-network (MMM)
switch, and an MMM switch with extended memory (MMeM).
The MMM switch uses forwarding arbitration schemes to
select cells from the buffers in the previous stage modules
and is agnostic to cell sequence, therefore delivering high
switching performance. We considered switches with sizes
N = {64, 256}.
A. Uniform Traffic
We evaluated the LBC, OQ, MMM, and MMeM switches
under uniform traffic with Bernoulli and bursty arrivals. Fig-
ures 5(a) and 5(b) show the average delay under uniform
Bernoulli traffic arrivals for N = 64 and N = 256, respec-
tively. The results in the figures show that the LBC switch
achieves 100% throughput under uniform traffic with Bernoulli
arrivals, indicated by the finite and moderate average queuing
delay. The high throughput performance by the proposed
switch is the result of using an efficient load-balancing process
in the IM-CIM stage. However, this high performance is
expected under this traffic pattern as the distribution of the
incoming traffic is already uniformly distributed.
Figure 5(a) shows that the LBC switch experiences a similar
delay as the MMeM switch at high input load. Figure 5(b)
shows that the LBC switch experiences a slightly higher
average delay than the OQ switch. This additional delay in the
LBC switch is caused by having cells wait in the VOMQs until
a configuration that allows forwarding the cells to their desti-
nation output modules takes place. Because MMeM requires
an excessive amount of memory to implement the extended
set of queues, the measurement of average cell delay cannot
be measured for N=256 by our simulators. This figure also
shows that the LBC switch achieves a lower average delay
than the MMM switch with an input load of 0.95 and larger.
Uniform bursty traffic is modeled as an ON-OFF Markov
modulated process, with the average duration of the ON period
set as the average burst length, l, with l = {10, 30} cells.
Figures 5(c) and 5(d) show the average delay under uniform
traffic with bursty arrivals for average burst length of 10 and
30 cells, respectively, for switches with N=256. The results
show that the LBC switch achieves 100% throughput under
bursty uniform traffic. In contrast, the MMM switch has a
throughput of 0.8 and 0.75 for an average burst length of 10
and 30 cells, respectively. Therefore, the LBC switch achieves
a performance closer to that of the OQ switch. There is a very
small difference in the delay of the LBC. From this graph, we
also observe that the LBC switch achieves 100% throughput
under bursty uniform traffic. The uniform distributed nature of
the traffic and the load-balancing stages help to achieve this
high throughput and low queueing delay. The slightly larger
average queueing delay of the LBC switch for very small input
loads is caused by the predetermined and cyclic configuration
of the bufferless modules as some cells wait for a few time
slots to be forwarded and this is irrespective of the switch
size. Nevertheless, these two figures show that the queueing
delay difference between the LBC and the OQ switch is not
significant for large input loads.
B. Nonuniform traffic
We also compared the performance of the proposed LBC
switch with the MMM, MMeM, and OQ switches under un-
balanced [27], [28] and hot-spot patterns as nonuniform traffic.
The unbalanced traffic can be modeled using an unbalanced
probability ω to indicate the load variances for different flows.
Consider input port IP (i, s) and output port OP(j, d) of the
LBC switch, the traffic load is determined by
ρi,s,j,d =

ρ(ω +
1− ω
N
), if i = j and s = d,
ρ
1− ω
N
, otherwise
(33)
where ρ is the traffic load for input IP (i, s) and ω is the un-
balanced probability. When ω=0, the input traffic is uniformly
distributed and when ω=1, the input traffic is completely
directional; traffic from IP (i, s) is destined for OP (j, d).
10
0.01
0.1
1
10
100
1000
10000
0
.0
5
0
.1
0
.1
5
0
.2
0
.2
5
0
.3
0
.3
5
0
.4
0
.4
5
0
.5
0
.5
5
0
.6
0
.6
5
0
.7
0
.7
5
0
.8
0
.8
5
0
.9
0
.9
5
0
.9
9
A
v
er
ag
e 
d
el
ay
 (
ti
m
e 
sl
o
ts
)
Input load
MMM MM M OQ LBCe
(a) Bernoulli uniform traffic, N=64
0.01
0.1
1
10
100
1000
10000
100000
0
.0
5
0
.1
0
.1
5
0
.2
0
.2
5
0
.3
0
.3
5
0
.4
0
.4
5
0
.5
0
.5
5
0
.6
0
.6
5
0
.7
0
.7
5
0
.8
0
.8
5
0
.9
0
.9
5
0
.9
9
A
v
er
ag
e 
d
el
ay
 (
ti
m
e 
sl
o
ts
)
Input load
MMM OQ LBC
(b) Bernoulli uniform traffic, N=256
0.1
1
10
100
1000
10000
0
.0
5
0
.1
0
.1
5
0
.2
0
.2
5
0
.3
0
.3
5
0
.4
0
.4
5
0
.5
0
.5
5
0
.6
0
.6
5
0
.7
0
.7
5
0
.8
0
.8
5
0
.9
0
.9
5
0
.9
9A
v
er
ag
e 
d
el
ay
 (
ti
m
e 
sl
o
ts
)
Input load
MMM OQ LBC
(c) Bursty uniform, l=10, N=256
1
10
100
1000
10000
0
.0
5
0
.1
0
.1
5
0
.2
0
.2
5
0
.3
0
.3
5
0
.4
0
.4
5
0
.5
0
.5
5
0
.6
0
.6
5
0
.7
0
.7
5
0
.8
0
.8
5
0
.9
0
.9
5
0
.9
9
A
v
er
ag
e 
d
el
ay
 (
ti
m
e 
sl
o
ts
)
Input load
MMM OQ LBC
(d) Bursty uniform, l=30
0.01
0.1
1
10
100
1000
A
v
er
ag
e 
d
el
ay
 (
ti
m
e 
sl
o
ts
)
Input  load
LBC OQ
(e) Unbalanced traffic,w = 0.6
0.01
0.1
1
10
100
1000
0
.1
0
.1
5
0
.2
0
.2
5
0
.3
0
.3
5
0
.4
0
.4
5
0
.5
0
.5
5
0
.6
0
.6
5
0
.7
0
.7
5
0
.8
0
.8
5
0
.9
0
.9
5
0
.9
9 1
A
v
er
ag
e 
d
el
ay
 (
ti
m
e 
sl
o
ts
)
Input load
LBC OQ
(f) Hot-spot traffic
Fig. 5. Average queueing delay of LBC switch under uniform traffic: (a) Bernoulli arrivals and N = 64, (b) Bernoulli arrivals and N = 256, (c) bursty
traffic with average burst length l=10 for N=256, and (d) bursty traffic with average burst length l=30 for N=256, and under nonuniform traffic, such as: (e)
unbalanced traffic with w = 0.6 for N=256 and (f) hot-spot traffic for N=256.
The simulation results show that the throughput of the LBC
switch is 100% under this traffic pattern for all values of
ω, matching those of MMM and MMeM switches, which
are also known to achieve high throughput but neglect in-
sequence forwarding. It has been shown that many switches
do not achieve high throughput when w is around 0.6 [28].
Therefore, we measured the average delay of the LBC switch
under this traffic pattern for ω=0.6, as shown in Figure 5(e),
and compared with the OQ switch as this switch is well-known
to achieve 100% throughput. As the figure shows, the average
delay of the LBC switch is comparable to that of an OQ
switch. The load-balancing stage of the LBC switch distributes
the traffic uniformly throughout the switch.
We compared the performance of the proposed LBC switch
with the MMM, MMeM, and OQ switches under hot-spot
traffic [24]. Hot-spot traffic occurs when all IPs send most or
all traffic to one OP. Consider input port IP (i, s) and output
port OP (j, d) of the LBC switch, the traffic load is determined
by
ρi,s,j,d =
ρ(
1
N
), for jm+ d = h,
0, otherwise
(34)
where h is the hot-spot OP and 1 ≤ h ≤ N .
Our simulation shows that the LBC switch as well as the
MMM and MMeM switches achieve 100% throughput under
admissible hot-spot traffic.
Figure 5(f) shows the measured average delay of the LBC
switch under this traffic pattern and that of an OQ switch.
The figure shows that the average delay of the LBC switch
is comparable to that of an OQ switch. This is as a result of
effective load-balancing at the IMs, CIMs, and COMs of the
multiple flows coming from different inputs.
In addition to the analysis presented in Section II-F, we
also simulated the LBC switch under two new traffic patterns,
which we believe may stress the occupancy of CBs and there-
fore increase the likelihood of occurrence of HoL blocking
conditions. The traffic patterns are: a) k flows from IPs at
different IMs, each arriving at a rate of 1k for admissibility,
are forwarded to all OPs at one OM. The source IPs of the
flows are selected such that they share VOMQs; i = s or
IP (0, 0), IP (1, 1), · · · , IP (k−1, n−1). b) Each IP at an IM
forwards cells at rate 1k to each OP at an OM (e.g., i = j). Each
OP in the destination OM receives traffic from all IPs of one
IM. VOMQs are also shared by different flows. Figures 6(a)
and 6(b) show the average delay under the first and second
traffic patterns presented above, respectively. The results in
the figures show that LBC experiences a finite and moderate
average queuing delay, which implies that LBC achieves 100%
throughput under both traffic patterns. We also measured the
average CB length and this length does not grow more than
one cell, indicating that no CB gets congested. This result
is obtained because the load-balancing mechanism spreads a
flow to different VOMQs.
VI. CONCLUSIONS
We have introduced a configuration scheme for a split-
central-buffered load-balancing Clos-network switch and a
mechanism that forwards cells in sequence for this switch.
To effectively perform load balancing, the switch has virtual
output module queues between these two central stages. With
the split central module, the switch comprises four stages,
named IM, CIM, COM, and OM. The IM, CIM, and COM
stages are bufferless crossbars, while the OMs is a buffered
one. All the bufferless modules follow a pre-deterministic
configuration while the OM follows a round-robin sequence to
forward cells from the CB to the output ports. Therefore, the
switch does not have to perform matching in any stage despite
having bufferless modules, and the configuration complexity of
11
1
10
100
1000
0
.0
5
0
.1
0
.1
5
0
.2
0
.2
5
0
.3
0
.3
5
0
.4
0
.4
5
0
.5
0
.5
5
0
.6
0
.6
5
0
.7
0
.7
5
0
.8
0
.8
5
0
.9
0
.9
5
0
.9
9
A
v
er
ag
e 
d
el
ay
 (
ti
m
e 
sl
o
ts
)
Input load
LBC=64 LBC=256
(a) k flows from k IMs to all OPs in an OM.
1
10
100
1000
0
.0
5
0
.1
0
.1
5
0
.2
0
.2
5
0
.3
0
.3
5
0
.4
0
.4
5
0
.5
0
.5
5
0
.6
0
.6
5
0
.7
0
.7
5
0
.8
0
.8
5
0
.9
0
.9
5
0
.9
9
A
v
er
ag
e 
d
el
ay
 (
ti
m
e 
sl
o
ts
)
Input load
LBC=64 LBC=256
(b) Hot-spot per-module traffic.
Fig. 6. Average queueing delay of LBC switch under: (a) k flows from k IMs to all OPs in an OM and (b) Hot-spot per-Module traffic.
the switch is minimum, making it comparable to that of MMM
switches. We introduce an in-sequence mechanism that oper-
ates at the inputs of the LBC switch to avoid out-of-sequence
forwarding caused by the central buffers. We modeled and
analyzed the operations that each of the stages effects on the
incoming traffic to obtain the loads seen by the output ports.
We showed that for admissible independent and identically
distributed traffic, the switch achieves 100% throughput. Un-
like the existing switching architectures discussed in Section
I, LBC achieves high performance, configuration simplicity,
and in-sequence service without memory speedup and central
module expansion. In addition, we analyzed the operation
of the forwarding mechanism and demonstrated that cells
are forwarded in sequence. We showed, through computer
simulation, that for all tested traffic, the switch achieved 100%
throughput for uniform and nonuniform traffic distributions.
APPENDIX A
ANALYSIS OF IN-SEQUENCE SERVICE
In this section, we demonstrate the lemmas that support
the theorem where we claim that the LBC switch forwards
cells in sequence through the proposed in-sequence forwarding
mechanism.
Lemma 1. For a single flow traversing the LBC switch, any
cell of the flow experiences the same delay. This is, let td be
the delay experienced by a cell. Then, for any cell traversing
the LBC switch, tdy,τ = γ, where γ is a positive constant.
We analyze first the scenario of a single flow, i.e., y,
traversing the switch, whose cells arrive back to back, one
each time slot. For simplicity but without losing generality,
let us also consider empty queues as an initial condition.
Proof:
For any cy,τ , the total delay time is defined as:
tdy,τ = q1y,τ + q2y,τ + q3y,τ (35)
in number of time slots. Here we consider fixed arbitration
time at each queue and this delay is included in the queuing
delay. We are then interested in finding q1y,τ , q2y,τ , and q3y,τ .
For q1y,τ , under a single-flow scenario, let us consider any
two cells of cy,τ with arrival times k time slots apart, cy,τ−2k
and cy,τ−k, they are forwarded to the same VOMQ. Then, cy,τ
is held at the VOQ (owing to the mechanism to keep cells in
sequence at the VOQ) if cy,τ−k finds one or more cells in the
VOMQ, q1y,τ increases. In this case, the empty queue initial
condition makes the waiting factor δ = 0.
On the other hand, an OM is connected to a VOMQ every k
time slots as per the configuration scheme of COM. Therefore,
q2y,τ ≤ k − 1 (36)
This queuing delay is smaller than the arrival gap between
these two cells as:
ay,τ−2k − ay,τ−k = k time slots
Therefore, cy,τ is not backlogged further in VOMQ and
there is no impact on the time the cell is held in a VOQ, such
that:
q1y,τ = 0 ∀ y, τ
For q2y,τ , let us now assume that cy,τ−k arrives at a time
that it has to wait γ time slots, where 1 ≤ γ ≤ k, to be
forwarded to the destination OM, or
q2y,τ−k = γ
Then when cy,τ arrives, k time slots later, it finds exactly the
same configuration in the COM as found by cy,τ−k. Because
cells arrive consecutively,
q2y,τ = γ ∀ τ
For q3y,τ , because there is a single flow traversing the switch
and the configuration scheme followed by COM, one cell
arrives in the CB each time slot and one cell departs OP at the
same time slot. Therefore, no cell is backlogged in this case
and
q3y,τ = 0
From (35):
tdy,τ = γ ∀ τ
for empty queues as initial condition.
It is then easy to see that for any queued cells, q1y,τ would
be increased by δk time slots, and q2y,τ as well as q3y,τ remain
unchanged.
Therefore, all cells of the flow experience the same delay
and are forwarded in sequence.

Lemma 2. For any number of flows traversing the LBC
switch, cells from the same flow arrive at the OM in sequence.
Proof: Here, we consider the following traffic scenario:
There are k flows coming from different IPs, each from a
different IM. In each of the flows, cells arrive back to back
12
and are destined to the same OP. Furthermore, the flows have
one time slot difference in their arrival times such that the
cells with the same sequence number of each different flow
are stored in the same VOMQs. Here, each flow consists of k
cells. Table IV shows an example of the arrival pattern of this
traffic scenario for three flows. The table shows the arrival of
k cells from k flows at different IPs and IMs that arrive at
one time slot apart to enable these flows to be forwarded to
the same VOMQ, otherwise the flows would be forwarded to
different VOMQs.
TABLE IV
EXAMPLE OF BACK-TO-BACK ARRIVALS OF ONE BURST OF k FLOWS.
Cell arrival time
tx tx+1 tx+2 tx+3 tx+4
c1,1 c1,2 c1,3
c2,1 c2,2 c2,3
c3,1 c3,2 c3,3
Table V shows that cells c1,1, c1,2, c1,3, c2,1, and c3,1 were
successfully forwarded to the VOMQ without any blocking.
While the in-sequence mechanism holds back the cells c2,2,
c2,3, c3,2 and c3,3 to prevent out-of-sequence, because cells
c2,1 and c3,1 were forwarded to a non-empty VOMQ.
The configuration pattern used in the IMs and CIMs, and
the in-sequence mechanism determine the order in which cells
arrive to the VOMQs. Table V shows this order in our example.
In such arrival pattern, the departures from VOMQs follow
the deterministic configuration of the COMs. Table VI shows
the corresponding departures of the cells from VOMQs of
these three flows.
TableVI shows that all the cells were forwarded out the
VOMQ in the same pattern they arrived and one cell each k
time slots because the COM connects to the OM once each k
time slots.
Also, let us assume that the first cell of a flow at the LCIM
arrives at least one or more time slots before the configuration
of the COM allows forwarding the cell to its destination OM.
Thus, cells may depart in the following or a few time slot
after its arrival. A cell then may wait up to k − 1 time slots
for the designated interconnection to take place before being
forwarded to the OM.
Given k flows, with their τ th cells being c1,τ to ck,τ , the
arrival time of the first arriving cell c1,τ is:
ta1,τ = tx (37)
The number of cells at the VOQ, N1(cy,τ ), upon the arrival
of c1,τ is:
N1(c1,τ ) = 0 (38)
This condition holds because there is no cell at the VOQ when
c1,τ arrives. Because of (38), the queuing delay at the VOQ
of c1,τ is:
q11,τ = 0 (39)
The departure time of a cell cy,τ from the VOQ is:
d1y,τ = tay,τ + q1y,τ (40)
Using (37) to (40), the departure time of c1,τ from the VOQ
is:
d11,τ = tx + 1 (41)
Upon arriving at the VOMQ, c1,τ finds no cell ahead of it.
Thus, the number of cells at the VOMQ, N2(c1,τ ), upon the
arrival of c1,τ is:
N2(c1,τ ) = 0 (42)
Based on the considered traffic pattern, c1,τ is stored in the
VOMQ for additional k − 1 time slots. Therefore,
q21,τ = k − 1 (43)
The departure time of a cell cy,τ from the VOMQ is:
d2y,τ = d1y,τ + q2y,τ (44)
Using (41), (43), and (44), the departure time of c1,τ from the
VOMQ is:
d21,τ = tx + k (45)
Let us consider now another cell from the same flow, c1,τ+θ,
where 0 < θ < k, with
ta1,τ+θ = tx + θ (46)
Upon the arrival of c1,τ+θ, there is no cell at the VOQ, or:
N1(c1,τ+θ) = 0 (47)
Because of (42) and (47), the queuing delay at the V OQ for
c1,τ+θ is:
q11,τ+θ = 0 (48)
Using (40), (46), and (48), the departure time of c1,τ+θ from
the VOQ is:
d11,τ+θ = tx + θ + 1 (49)
Upon arriving at the VOMQ, c1,τ+θ finds no cell ahead of it,
or:
N2(c1,τ+θ) = 0 (50)
Because of the considered traffic, c1,τ+θ is queued extra k−1
time slots at the VOMQ, hence:
q21,τ+θ = k − 1 (51)
Using (44), and (49) to (51),
d21,τ+θ = tx + k + θ (52)
Using (45), therefore,
d21,τ+θ = d21,τ + θ (53)
In general, for cz,τ , where 1 < z ≤ k, the arrival time is
taz,τ = tx + (z − 1) (54)
and upon the arrival of cz,τ in the VOQ, there is no cell:
N1(cz,τ ) = 0 (55)
With (55),
q1z,τ = 0 (56)
13
TABLE V
TIME SLOTS IN WHICH CELLS ARRIVE TO VOMQS OF A SINGLE k-CELL BURST.
Time slots cells arrive at the VOMQs
tx tx+1 tx+2 tx+3 tx+4 tx+5 tx+6 tx+7 tx+8 tx+9 tx+10 tx+11
c1,1 c1,2 c1,3
c2,1 c2,2 c2,3
c3,1 c3,2 c3,3
TABLE VI
TIME SLOTS WHEN CELLS DEPART VOMQS IN EXAMPLE OF THE IN-SEQUENCE FORWARDING MECHANISM.
Cell departure time slots from VOMQs
tx tx+1 tx+2 tx+3 tx+4 tx+5 tx+6 tx+7 tx+8 tx+9 tx+10 tx+11 tx+12
c1,1 c1,2 c1,3
c2,1 c22 c2,3
c3,1 c3,2 c3,3
Using (40), (54) , and (56),
d1z,τ = tx + z (57)
However, upon arriving in the VOMQ, cz,τ finds δ cells ahead
of it, or:
N2(cz,τ ) = δ (58)
δ = z − 1 (59)
where 0 < δ < k
q2z,τ = qHz,τ + (δ − 1)k + k (60)
qHz,τ is the delay from the HoL cell in the VOMQ on cz,τ .
(δ − 1)k is the delay generated from the other (δ − 1) cells
ahead of cz,τ in the VOMQ. The extra k time slots is the
delay cz,τ experiences as it waits for the configuration pattern
to repeat after the last cell ahead of it is forwarded to the OM.
where
d21,τ = d1z,τ + qHz,τ (61)
Using (44), (60), and (61), the departure time of cz,τ from the
VOMQ is:
d2z,τ = d21,τ + δk (62)
Using (45) and (59), then:
d2z,τ = tx + zk (63)
Let us now consider any other cell from flow z, cz,τ+θ,
where 0 < θ < k. The time of arrival of the cell cz,τ+θ is:
taz,τ+θ = tx + (z − 1) + θ (64)
Upon the arrival of cz,τ+θ, there could be zero or more at the
VOQ, hence:
N1(cz,τ+θ) = γ (65)
where γ is the number of cells at the VOQ upon the arrival
of cz,τ+θ and 0 ≤ γ < k. Using (58) and (65), then:
q1z,τ+θ =

δk for γ = 0
δk +
θ−1∑
σ=1
q1z,τ+σ for γ > 0
(66)
where
θ−1∑
σ=1
q1z,τ+σ
is the delay generated from the γ cells ahead of cz,τ+θ at the
VOQ. Let
γq =
θ−1∑
σ=1
q1z,τ+σ (67)
Using (40), (64), (66), and (67), then:
d1z,τ+θ =
{
tx + (z − 1) + θ + δk for γ = 0
tx + (z − 1) + θ + δk + γq for γ > 0
(68)
The queuing delay of cz,τ+θ at the VOMQ is equal to (60).
Therefore, using (44), (60), and (68), the departure time of
cz,τ+θ from the VOMQ is:
d2z,τ+θ =
{
d21,τ+θ + δk for γ = 0
d21,τ+θ + δk + γq for γ > 0
(69)
Using (53) and (59), then:
d2z,τ+θ =
{
d21,τ + (z − 1)k + θ for γ = 0
d21,τ + (z − 1)k + θ + γq for γ > 0
(70)
Using (45), then:
d2z,τ+θ =
{
tx + zk + θ for γ = 0
tx + zk + θ + γq for γ > 0
(71)
From (53),
d21,τ+θ − d21,τ = θ (72)
Using (63), gives:
d2z,τ+θ − d2z,τ =
{
θ for γ = 0
θ + γq for γ > 0
(73)
The difference between the departure times of any two cells
of a flow from VOMQ is a function of θ, which is the arrival
time difference of the two cells. Therefore, cells of a flow are
forwarded to the OM in the same order they arrived.
14

Lemma 3. For any number of flows traversing the LBC
switch, the cells of each flow arrive and are cleared at the
output port (OP) in the same order the cells arrived at the
input port (IP).
In our discussion of this lemma, let us consider the
following traffic scenario: The switch has cells from only two
flows, each arriving in a different IM (and therefore IP) and
both of them are destined to the same OP. In each flow, cells
arrive back-to-back, one at each time slot, and the first cell
of both flows arrive at a time slot such that the configuration
pattern of IM-CIM stage would not enable forwarding them
to the COM immediately. With this condition, we analyze
how these two flows are kept from affecting each other, and
therefore, the sequence in which cells may depart the OP.
This traffic scenario may present the greatest opportunity of
experiencing out-of-sequence forwarding by any two cells of
a flow as cells from these two flows interact at the CBs of
the destination OP. Let us also consider empty queues as an
initial condition.
Given flows y and z, where the first cells of y and z, cy,τ
and cz,τ , respectively, arrive at their respective VOQs at time
slot tx and the θth cells, cy,τ+θ and cz,τ+θ ∀ θ ≥ 1, arrive
at time slot tx + θ. Therefore, according to this lemma cy,τ
and cz,τ must be forwarded and cleared from the output port
OP (j, d) before cy,τ+θ and cz,τ+θ, respectively.
Proof:
We analyze the departure time of the cells cy,τ and cz,τ
from the CBs. The arrival times for cells cy,τ and cz,τ is:
tay,τ = taz,τ = tx (74)
Upon arriving in the VOQ, cy,τ and cz,τ are placed as HoL
cells. Because there are no backlogged cells, hence:
N1(cy,τ ) = 0 (75)
and
N1(cz,τ ) = 0 (76)
Using (75) and (76), the queuing delays of cy,τ and cz,τ at
the VOQ are:
q1cy,τ = 0 (77)
and
q1cz,τ = 0 (78)
Using (40), (74), and (77) the departure time for cy,τ from the
VOQ is:
d1y,τ = tx + 1 (79)
Using (40), (74), and (78) the departure time for cz,τ from the
VOQ is:
d1z,τ = tx + 1 (80)
Thus, cy,τ and cz,τ are forwarded to the same CIM (so that
these two cells would share the same CB) and stored in their
respective VOMQ. Because the VOMQs are empty at the time
the two cells arrive, hence:
N2(cy,τ ) = 0 (81)
and
N2(cz,τ ) = 0 (82)
Based on the adopted traffic scenario, cy,τ and cz,τ are held
at the VOMQ for β1 and β2 time slots, respectively, before
the configuration pattern enables forwarding them to their
destination OM. Here, 1 ≤ β1 < k and 1 ≤ β2 < k. Hence,
the queuing delay of cy,τ at the VOMQ is:
q2y,τ = β1 (83)
The queuing delay of cz,τ at the VOMQ is:
q2z,τ = β2 (84)
Assuming β1 < β2, hence cy,τ would be forwarded to the
destination OM before cz,τ . From (44), (79), and (83), the
departure time of cy,τ from the VOMQs is:
d2y,τ = tx + 1 + β1 (85)
From (44), (80), and (84), the departure time of cz,τ from the
VOMQs is:
d2z,τ = tx + 1 + β2 (86)
When cy,τ and cz,τ arrive at the OM, they are stored at CBs
before being forwarded to the output port.
Let us now consider cy,τ+1 and cz,τ+1, which arrive at time
slot tx + 1, hence:
tay,τ+1 = taz,τ+1 = tx + 1 (87)
Because there are no cells at the VOQ upon the arrival of
cy,τ+1 and cz,τ+1, then:
N1cy,τ+1 = 0 (88)
and
N1cz,τ+1 = 0 (89)
With (81) and (88), the queuing delay of cy,τ+1 at the VOQ
is:
q1y,τ+1 = 0 (90)
With (82) and (89), the queuing delay of cz,τ+1 at the VOQ
is:
q1z,τ+1 = 0 (91)
Using (40), (87), and (90), the departure time of cy,τ+1 from
the VOQ is:
d1y,τ+1 = tx + 2 (92)
Using (40), (87), and (91), the departure time of cz,τ+1 from
the VOQ is:
d1z,τ+1 = tx + 2 (93)
cy,τ+1 and cz,τ+1 are forwarded to the same CIM and stored
in their respective VOMQs. Based on the traffic scenario
cy,τ+1 and cz,τ+1 are also stored for β1 and β2 time slots,
respectively, at the VOMQs before the configuration pattern
of the COM enables forwarding them to the destination OM.
Hence, the queuing delay of cy,τ+1 and cz,τ+1 at the VOMQ
are equal to (83) and (84), respectively. From (44), (83), and
(92), the departure time of cy,τ+1 from the VOMQ is:
d2y,τ+1 = tx + 2 + β1 (94)
15
From (44), (84), and (93), the departure time of cz,τ+1 from
the VOMQ is:
d2z,τ+1 = tx + 2 + β2 (95)
Next, we analyze the departure time of the cells from the
output port. Because d2y,τ+1 > d2y,τ and d2z,τ+1 > d2z,τ ,
this means that cy,τ and cz,τ arrive at the output module before
cy,τ+1 and cy,τ+1, respectively. With the CB initially empty
based on the initial condition, then:
N3cy,τ = 0 (96)
With d2z,τ > d2y,τ , hence:
N3cz,τ = 0 (97)
With (96) and (97), the queuing delays of cy,τ and cz,τ at the
CB are:
q3y,τ = 0 (98)
and
q3z,τ = 0 (99)
The queuing delay of cy,τ+1 and cz,τ+1 at the CB are equal
to (98) and (99). The departure time of a cell cc,τ from the
CB is:
d3c,τ = d3c,τ + q3c,τ (100)
Therefore, using (85), (98), and (100), the departure time of
cy,τ from the output port is:
d3y,τ = tx + 2 + β1
Using (94), (98), and (100), the departure time of cy,τ+1 from
the output port is:
d3y,τ+1 = tx + 3 + β1
Using (86), (99), and (100), the departure time of cz,τ from
the output port is:
d3z,τ = tx + 2 + β2
Using (95), (99), and (100), the departure time of cz,τ+1 from
the output port is:
d3z,τ+1 = tx + 3 + β2
Therefore, with d3y,τ+1 > d3y,τ and d3z,τ+1 > d3z,τ , cy,τ
and cz,τ would depart the output port before cy,τ+1 and cz,τ+1,
respectively. Note that for N1(cy,τ ) > 0, δ > 0, such that
the cells from the same flow are forwarded with larger time
separation from each other, and there are fewer chances that
they will be at the CBs at the same time slot. Therefore, this
property, as described by this lemma, applies to any two cells
of a flow.

This completes the proof of Theorem 1.

APPENDIX B
100% THROUGHPUT
In this section we prove that LBC achieves 100% throughput
by using the analysis presented on Section III. A and the
concept of queue stability. A switch is defined as stable for
a traffic pattern if the queue length is bounded and a switch
achieves 100% throughput if it is stable for admissible i.i.d.
traffic [29]. With this, we set the following theorem:
Theorem 2. LBC achieves 100% throughput under admissible
i.i.d traffic.
Proof: Here, we consider the queue to be weakly stable if the
drift of the queue occupancy from the initial state is a finite
integer  ∀ t as limt→∞. Using the definition above, we
show that the queue length of VOQs, VOMQs, and CBs are
weakly stable under i.i.d. traffic, and hence, achieves 100%
throughput under that traffic pattern.
Let us represent the queue occupancy of VOQs at time slot
t, N1(t) as:
N1(t) = N1(t− 1) + A1(t)−D1(t) (101)
where A1(t) is the packet arrival matrix at time slot t to VOQs
and D1(t) is the service rate matrix of VOQs at time slot
t. Solving (101) with an initial condition N1(0), recursively
yields:
N1(t) = N1(0) +
t∑
γ=0
A1(γ)−
t∑
γ=0
D1(γ) (102)
Let us consider s1u,v (t) as the service rate received by the
VOQ at IP (u) for OP (v) at time slot t or:
1
N
≤ s1u,v (t) ≤ 1 for δ = 0
1
δNk
≤ s1u,v (t) ≤
1
δk
for σ > 1
(103)
Another way to express D1(t) is:
D1(t) = [s1u,v (t)] (104)
and recalling R1 as the aggregate traffic arrival to VOQs or:
R1 =
t∑
γ=0
A1(γ) (105)
Let us assume the worse case scenario in (103). Substituting
(103) into (104), and (104) and (105) into (102), yields:
N1(t) =

N1(0) + R1 − 1 ∗ t
N
for δ = 0
N1(0) + R1 − 1 ∗ t
δNk
for δ > 1
(106)
From (106), we obtain:
lim
t→∞
R1
t
− 1 ∗ 1
N
≤  <∞ for δ = 0
lim
t→∞
R1
t
− 1 ∗ 1
δNk
≤  <∞ for δ > 1
(107)
From the admissibility condition of R1, it is easy to see that
for any value of t, (107) is finite. Hence, from the admissibility
16
of R1, (106) and (107), we conclude that occupancy of VOQ
is weakly stable.

Now we prove VOMQs stability. As before, the queue
occupancy matrix of VOMQs at time slot t can be represented
as:
N2(t) = N2(t− 1) + A2(t)−D2(t) (108)
where A2(t) is the arrival matrix at time slot t to VOMQs
and D2(t) is the service rate matrix of VOMQs at time slot
t. Solving (108) recursively with consideration of an initial
condition for N2(t), yields:
N2(t) = N2(0) +
t∑
γ=0
A2(γ)−
t∑
γ=0
D2(γ) (109)
Because a VOMQ is serviced at least once every k time slots,
the service rate of the VOMQ at IC(r, p) for OP (v) at time
slot t, d2µ,v (t) is:
d2µ,v (t) =
1
k
∀ µ and v
Then, the service matrix of VOMQs is:
D2(t) = [d2µ,v (t)] (110)
and representing R2 as the aggregate traffic arrival to VOMQs
or:
R2 =
t∑
γ=0
A2(γ) (111)
Substituting (110) and (111) into (109) gives:
N2(t) = N2(0) + R2 − 1
k
P1 (112)
R2 − 1
k
P1 ≤  <∞ (113)
Recalling that R2 is admissible, per the discussion in Section
III.A, and by substituting P1 and R2 into (113), it is easy to
see that  is finite. Hence, from (112) and (113), we conclude
that the occupancy of VOMQ is weakly stable.

Now we prove the stability of CBs. The queue occupancy
matrix of CBs at time slot t can be represented as:
N3(t) = N3(t− 1) + A3(t)−D3(t) (114)
where A3(t) is the packet arrival matrix at time slot t CBs,
and D3(t) is the service rate matrix of CBs at time slot t.
Solving (114) recursively as before yields:
N3(t) = N3(0) +
t∑
γ=0
A3(γ)−
t∑
γ=0
D3(γ) (115)
Because a CB is serviced at least once every k time slots.
Hence, the service rate of the CB at OP (v) at time slot t,
d3v (t) is:
1
k
≤ d3v (t) ≤ 1
and service matrix of CBs is:
D3(t) = [d3v (t)] (116)
Similarly, the aggregate traffic arrival to the CB or:
R4 =
t∑
γ=0
A3(γ) (117)
Let us assume d3v (t) =
1
k ∀ v in (116), which is the worst
case scenario at which a CB gets served once every k time
slots. Substituting (116) and (117) into (115) gives:
N3(t) = N3(0) + R4 − 1
k
∗~1 (118)
where
R4 − 1
k
∗~1 ≤  <∞ (119)
With R4 being admissible, as discussed in Section III.A, and
by substituting R4 into (119), it is easy to see that  is finite.
Hence, from (118) and (119), we conclude that the occupancy
of CB is also weakly stable.

This completes the proof of Theorem 2.

REFERENCES
[1] C. Clos, “A study of non-blocking switching networks,” Bell System
Technical Journal, vol. 32, no. 2, pp. 406–424, 1953.
[2] N. A. Al-Saber, S. Oberoi, T. Pedasanaganti, R. Rojas-Cessa, and
S. G. Ziavras, “Concatenating packets in variable-length input-queued
packet switches with cell-based and packet-based scheduling,” in Sarnoff
Symposium, 2008 IEEE. IEEE, 2008, pp. 1–5.
[3] T. T. Lee and C. H. Lam, “Path switching-a quasi-static routing scheme
for large-scale ATM packet switches,” IEEE Journal on Selected Areas
in Communications, vol. 15, no. 5, pp. 914–924, 1997.
[4] H. J. Chao, Z. Jing, and S. Y. Liew, “Matching algorithms for three-stage
bufferless Clos network switches,” Communications Magazine, IEEE,
vol. 41, no. 10, pp. 46–54, 2003.
[5] F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, “Low-cost scalable switch-
ing solutions for broadband networking: the ATLANTA architecture and
chipset,” IEEE Communications Magazine, vol. 35, no. 12, pp. 44–53,
1997.
[6] J. Kleban and U. Suszynska, “Static dispatching with internal back-
pressure scheme for SMM Clos-network switches,” in Computers and
Communications (ISCC), 2013 IEEE Symposium on. IEEE, 2013, pp.
000 654–000 658.
[7] J. Kleban, M. Sobieraj, and S. Weclewski, “The modified MSM Clos
switching fabric with efficient packet dispatching scheme,” in High
Performance Switching and Routing, 2007. HPSR’07. Workshop on.
IEEE, 2007, pp. 1–6.
[8] R. Rojas-Cessa, E. Oki, and H. J. Chao, “Maximum weight matching
dispatching scheme in buffered Clos-network packet switches,” in Com-
munications, 2004 IEEE International Conference on, vol. 2. IEEE,
2004, pp. 1075–1079.
[9] H. J. Chao, J. Park, S. Artan, S. Jiang, and G. Zhang, “Trueway: a
highly scalable multi-plane multi-stage buffered packet switch,” in High
Performance Switching and Routing, 2005. HPSR. 2005 Workshop on.
IEEE, 2005, pp. 246–253.
[10] N. Chrysos and M. Katevenis, “Scheduling in non-blocking buffered
three-stage switching fabrics.” in INFOCOM, vol. 6, 2006, pp. 1–13.
[11] Z. Dong and R. Rojas-Cessa, “Non-blocking memory-memory-memory
Clos-network packet switch,” in 34th IEEE Sarnoff Symposium. IEEE,
2011, pp. 1–5.
[12] Y. Xia, M. Hamdi, and H. J. Chao, “A practical large-capacity three-
stage buffered Clos-network switch architecture,” IEEE Transactions on
Parallel and Distributed Systems, vol. 27, no. 2, pp. 317–328, 2016.
17
[13] X. Li, Z. Zhou, and M. Hamdi, “Space-memory-memory architecture
for clos-network packet switches,” in Communications, 2005. ICC 2005.
2005 IEEE International Conference on, vol. 2. IEEE, 2005, pp. 1031–
1035.
[14] C.-B. Lin and R. Rojas-Cessa, “Minimizing scheduling complexity
with a Clos-network space-space-memory (SSM) packet switch,” in
High Performance Switching and Routing (HPSR), 2013 IEEE 14th
International Conference on. IEEE, 2013, pp. 15–20.
[15] R. Rojas-Cessa and C.-B. Lin, “Scalable two-stage Clos-network switch
and module-first matching,” in 2006 Workshop on High Performance
Switching and Routing. IEEE, 2006, pp. 6–pp.
[16] Z. Dong and R. Rojas-Cessa, “MCS: buffered Clos-network switch with
in-sequence packet forwarding,” in Sarnoff Symposium (SARNOFF),
2012 35th IEEE. IEEE, 2012, pp. 1–6.
[17] F. Hassen and L. Mhamdi, “High-capacity Clos-network switch for data
center networks,” in IEEE International Conference on Communications
2017. IEEE, 2017.
[18] C.-S. Chang, D.-S. Lee, and Y.-S. Jou, “Load balanced Birkhoff–von
Neumann switches, part i: one-stage buffering,” Computer Communica-
tions, vol. 25, no. 6, pp. 611–622, 2002.
[19] G. Birkhoff, “Tres observaciones sobre el algebra lineal,” Univ. Nac.
Tucuma´n Rev. Ser. A, vol. 5, pp. 147–151, 1946.
[20] H. Y. Lee, F. K. Hwang, and J. D. Carpinelli, “A new decomposition
algorithm for rearrangeable clos interconnection networks,” IEEE Trans-
actions on Communications, vol. 44, no. 11, pp. 1572–1578, 1996.
[21] L. Shi, B. Liu, C. Sun, Z. Yin, L. N. Bhuyan, and H. J. Chao,
“Load-balancing multipath switching system with flow slice,” IEEE
Transactions on Computers, vol. 61, no. 3, pp. 350–365, March 2012.
[22] C.-S. Chang, D.-S. Lee, and C.-M. Lien, “Load balanced Birkhoff-von
Neumann switches, part ii: Multi-stage buffering.”
[23] I. Keslassy and N. McKeown, “Maintaining packet order in two-stage
switches,” in INFOCOM 2002. Twenty-First Annual Joint Conference of
the IEEE Computer and Communications Societies. Proceedings. IEEE,
vol. 2. IEEE, 2002, pp. 1032–1041.
[24] R. Rojas-Cessa, Interconnections for Computer Communications and
Packet Networks. CRC Press, 2016.
[25] M. Zhang, Z. Qiu, and Y. Gao, “Space-memory-memory Clos-network
switches with in-sequence service,” Communications, IET, vol. 8, no. 16,
pp. 2825–2833, 2014.
[26] B. Hu and K. L. Yeung, “On joint sequence design for feedback-
based two-stage switch architecture,” in High Performance Switching
and Routing, 2008. HSPR 2008. International Conference on. IEEE,
2008, pp. 110–115.
[27] R. Rojas-Cessa, E. Oki, Z. Jing, and H. J. Chao, “CIXB-1: combined
input-one-cell-crosspoint buffered switch,” in High Performance Switch-
ing and Routing, 2001 IEEE Workshop on, 2001, pp. 324–329.
[28] R. Rojas-Cessa, E. Oki, and H. J. Chao, “CIXOB-k: Combined input-
crosspoint-output buffered packet switch,” in Global Telecommunica-
tions Conference, 2001. GLOBECOM’01. IEEE, vol. 4. IEEE, 2001,
pp. 2654–2660.
[29] A. Mekkittikul and N. McKeown, “A practical scheduling algorithm to
achieve 100% throughput in input-queued switches,” in INFOCOM’98.
Seventeenth Annual Joint Conference of the IEEE Computer and Com-
munications Societies. Proceedings. IEEE, vol. 2. IEEE, 1998, pp.
792–799.
