Efficient Scheduling for SDMG CIOQ Switches by Yang, Mei & Zheng, S. Q.
Electrical and Computer Engineering Faculty
Publications Electrical & Computer Engineering
2006
Efficient scheduling for SDMG CIOQ switches
Mei Yang
University of Nevada, Las Vegas, mei.yang@unlv.edu
S. Q. Zheng
The University of Texas at Dallas, sizheng@utdallas.edu
Follow this and additional works at: http://digitalscholarship.unlv.edu/ece_fac_articles
Part of the Computer and Systems Architecture Commons, Digital Communications and
Networking Commons, Systems and Communications Commons, Systems Architecture Commons,
and the Theory and Algorithms Commons
This Article is brought to you for free and open access by the Electrical & Computer Engineering at Digital Scholarship@UNLV. It has been accepted
for inclusion in Electrical and Computer Engineering Faculty Publications by an authorized administrator of Digital Scholarship@UNLV. For more
information, please contact digitalscholarship@unlv.edu.
Citation Information
Yang, M., Zheng, S. Q. (2006). Efficient scheduling for SDMG CIOQ switches. IEICE Transactions on Communications, E89-B(9),
2457-2468.
http://digitalscholarship.unlv.edu/ece_fac_articles/659
IEICE TRANS. COMMUN., VOL.E89-B, N0.9 SEPTEMBER 2006 
2457 
!PAPER 
Efficient Scheduling for SDMG CIOQ Switches* 
SUMMARY Combined input and output queuing (CIOQ) switches are 
being considered as high-performance switch architectures due to their 
ability to achieve 100% throughput and perfectly emulate output queu-
ing (OQ) switch performance with a small speedup factor S. To realize a 
speedup factorS, a conventional CIOQ switch requires the switching fabric 
and memories to operate S times faster than the line rate. In this paper, we 
propose to use a CIOQ switch with space-division multiplexing expansion 
and grouped input/output ports (SDMG CIOQ switch for short) to realize 
speedup while only requiring the switching fabric and memories to operate 
at the line rate. The cell scheduling problem for the SDMG CIOQ switch 
is abstracted as a bipartite k-matching problem. Using fluid model tech-
niques, we prove that any maximal size k-matching algorithm on an SDMG 
CIOQ switch with an expansion factor 2 can achieve 100% throughput as-
suming input line arrivals satisfy the strong law of large numbers (SLLN) 
and no input/output line is oversubscribed. We further propose an efficient 
and starvation-free maximal size k-matching scheduling algorithm, kFRR, 
for the SDMG CIOQ switch. Simulation results show that kFRR achieves 
100% throughput for SDMG CIOQ switches with an expansion factor 2 
under two SLLN traffic models, uniform traffic and polarized traffic, con-
firming our analysis. 
key words: CIOQ switch, cell scheduling, maximal size matching, speedup 
1. Introduction 
Output queuing (OQ) switches are employed for many com-
mercial switching systems today due to their ability to max-
imize throughput and provide quality of service (QoS) guar-
antees. However, OQ switches are not scalable for high line 
rates and/or large numbers of ports since the switching fab-
ric and memories for an NxN OQ switch are required to run 
N times faster than the line rate. On the other hand, input 
queuing (IQ) switches are scalable with their switching fab-
ric and memories operating at the line rate, but IQ switches 
have a limited throughput because of head-of-line (HOL) 
blocking and cannot provide QoS guarantees. To reduce 
the speed requirement of the switching fabric and memo-
ries of OQ switches and improve the switch performance 
of IQ switches, combined input and output queuing (CIOQ) 
switches are proposed. CIOQ switches are being consid-
ered as high-performance switch architectures due to their 
Manuscript received May 30, 2005. 
Manuscript revised December 4, 2005. 
tThe author is with the Department of Electrical and Com-
puter Engineering, University of Nevada, Las Vegas, Las Vegas, 
NV 89154 USA. 
ttThe author is with the Department of Computer Science, Uni-
versity of Texas at Dallas, Richardson, TX 75080 USA. 
*Part of the results has been presented at IEEE INFOCOM 
2003 [28]. 
a) E-mail: meiyang@egr.unlv.edu 
b) E-mail: sizheng@utdallas.edu 
DOl: 10.1093/ietcom/e89-b.9.2457 
N 
Mei YANGta) and Si Qing ZHENGttb), Nonmembers 
Inputpon~ 
NxN 
Switching fabric 
Fig.l AnN x N CIOQ switch. 
0ulpUt port 0 I 
TIIJ r-1 
ability to achieve 100% throughput and even emulate OQ 
switch performance with a small speedup factor [2]. Fig-
ure I shows an N x N CIOQ switch. To remove head-of-line 
(HOL) blocking [13], each input port I; maintains N virtual 
output queues (VOQs) with Q;,j buffering packets destined 
for output port Oj. With an internal speedup larger than 1, 
packets need to be buffered at outputs as well. 
In this paper, we assume that all switches we discuss 
are cell based. In such a switch, variable-length packets 
are segmented into fixed-size cells upon arrival, transferred 
through the switching fabric, and reassembled back into 
original packets before they depart the switch. Time is di-
vided into cell slots and one cell slot equals to the trans-
mission time of a cell on the input/output line. In each cell 
slot, a scheduling algorithm selects a matching between in-
put ports and output ports such that no input (resp. output) 
port may be matched to more than one output (resp. input) 
port. Fixed-size cells and slotted time switching make it 
easier for the scheduler to configure the switching fabric for 
high throughput [16]. 
The cell scheduling problem for VOQ based switches 
can be modelled as a bipartite matching problem [16]. Al-
though maximum weight matching algorithms are proved to 
achieve 100% throughput for all admissible identically inde-
pendently distributed (i.i.d.) arrivals [17], they are infeasi-
ble for high speed implementation with their time complex-
ity of O(N3 logN) [25]. The most efficient maximum size 
matching algorithm has a time complexity of O(N2·5) [11], 
[25]. However, maximum size matching algorithms are too 
complex for hardware implementation and can cause unfair-
Copyright © 2006 The Institute of Electronics, Information and Communication Engineers 
2458 
ness [17]. Most practical scheduling algorithms proposed, 
such as parallel iterative matching (PIM) [1], iSLIP [16], 
dual round-robin matching (DRRM) [5], first come first 
serve in round-robin, matching (FIRM) [22], static round-
robin (SRR) [12], iterative ping-pong arbitration (PPA) [4] 
scheme, and the round-robin priority matching (RRPM) 
[14], are iterative algorithms that find a maximal size match-
ing to approximate a maximum size matching. 
A switch with a speedup factor S can remove up to S 
cells from each input port and deliver up to S cells to each 
output port within one cell slot. Hence, an IQ switch has 
a speedup of 1, an OQ switch has a speedup of N, and a 
CIOQ switch has a speedup between 1 and N. It has been 
shown that a CIOQ switch with a speedup of 4 or 2 can 
exactly emulate an OQ switch by employing stable match-
ing [9] based algorithms, such as the most urgent cell first 
algorithm (MUCFA) [21], the critical cell first (CCF) algo-
rithm [6], and the just preferred matching (JPM) algorithm 
[24]. Unfortunately, these scheduling algorithms are highly 
impractical due to their high time complexity (O(N2) itera-
tions). 
In [8], Dai and Prabhakar proved that employing any 
maximal size matching algorithm a CIOQ switch with S = 2 
can achieve 100% throughput for arbitrarily distributed in-
put patterns such that input arrivals satisfy the strong law 
of large numbers (SLLN) and no input/output is oversub-
scribed. Since almost all real traffic processes satisfy these 
properties, this result has high practical significance for at 
least two reasons. First, achieving 100% throughput is 
a necessary condition for a CIOQ switch to realize OQ-
equivalent quality of service (QoS) guarantees with care-
fully designed queuing disciplines at each VOQ and at each 
output queue. Second, maximal size matching algorithms 
are easier to implement than maximum size matching al-
gorithms or stable matching algorithms. In addition, it is 
shown that CIOQ switches with any maximal size match-
ing algorithms perform as good as OQ switches in terms of 
delay under Bernoulli i.i.d. arrival traffic [23]. 
To realize speedup for CIOQ switches, in the conven-
tional scheme, it requires the switching fabric and memories 
to run S times faster than the line rate. Under current tech-
nology, the switching fabric can support up to 3.6 Gbps line 
rate [26]. On the other hand, advances in fiber-optic trans-
mission technologies have greatly pushed the increase of op-
tical transmission rate. Each individual channel now can op-
erate at OC-192 (lOGbps) or even OC-768 (40Gbps). Al-
though silicon technologies have advanced rapidly, the gap 
between the data rate that optical transmission technology 
can deliver and the switching speed that electronic switch-
ing fabric can provide is becoming wider and wider [18]. 
Thus it may not always be feasible to run the switching 
fabric much faster than the line rate. Memories with suf-
ficient access rate are simply not available for high line rate 
due to the limitation of current semiconductor technology. 
Even with fast switching fabric and memories, it may not 
be possible to run the cell scheduling algorithm fast enough 
to realize switch speedup greater than 1. In [27], we pro-
IEICE TRANS. COMMUN., VOL.E89-B, N0.9 SEPTEMBER 2006 
posed pipelined maximal size matching algorithms to re-
lax the running time for the scheduling algorithm for CIOQ 
switches with speedup. However, the running time for the 
switching fabric and memories is not relaxed. 
To relax the stringent timing requirement of the opera-
tion speed of the switching fabric and memories, we intro-
duce a CIOQ switch architecture with space-division mul-
tiplexing expansion and grouped input/output ports, short-
ened as an SDMG CIOQ switch. In an SDMG CIOQ switch, 
the number of connections between each input/output port 
and the switching fabric is increased, but the switching fab-
ric only needs to run as fast as the line rate. We define the 
expansion factor of an SDMG CIOQ switch as the ratio of 
the number of connections between an input/output port and 
the switching fabric and the number of input/output lines as-
sociated with an input/output port. 
We model the cell scheduling problem for SDMG 
CIOQ switches as a bipartite k-matching problem. Using 
fluid model techniques, we prove that any maximal size k-
matching algorithm for an SDMG CIOQ switch with an ex-
pansion factor 2 can achieve 100% throughput assuming that 
input line traffic arrivals satisfy SLLN and no input/output 
line is oversubscribed. We propose the k-connection FIRM-
based round-robin (kFRR) algorithm to find maximal size 
k-matchings on SDMG CIOQ switches. Through simu-
lations, we show that the kFRR algorithm achieves 100% 
throughput for SDMG CIOQ switches with an expansion 
factor 2 under two SLLN traffic models: uniform traffic 
(both Bernoulli arrivals and bursty arrivals) and polarized 
traffic. This confirms our analysis based on fluid model tech-
niques. The advantage of the proposed scheme compared to 
existing schemes is that it achieves the same performance as 
switches with speedup but only requires the switching fabric 
and memories to operate at the same speed as the line rate. 
The remainder of this paper is organized as follows. 
Section 2 presents the SDMG CIOQ switch architecture. 
Section 3 defines and models the cell scheduling problem 
for SDMG CIOQ switches. Section 4 gives an analysis of 
the expansion factor that is sufficient for an SDMG CIOQ 
switch to achieve 100% throughput. Section 5 describes 
the kFRR scheduling algorithm and discusses its proper-
ties. Section 6 presents the simulation results of kFRR. Sec-
tion 7 discusses possible hardware implementation schemes 
for the kFRR algorithm. Section 8 concludes the paper. 
2. SDMG CIOQ Switches 
We assume all the switch architectures we discuss are cell 
based. To realize the speedup required for a CIOQ switch, 
we consider an alternative CIOQ switch architecture with 
more connections between each input/output port and the 
switching fabric. We generalize this CIOQ switch architec-
ture by grouping multiple input/output lines into one port. 
The purpose of introducing grouped input/output ports is 
to achieve better buffer utilization [20], improve scheduling 
performance [19], and balance switch input/output loads. 
We name such a CIOQ switch as a C/OQ switch with space-
YANG and ZHENG: EFFICIENT SCHEDULING FOR SDMG CIOQ SWITCHES 
2459 
Input port r, Output port 0 1 
2 2 
g g 
Nk/gxNk/g 
Switching fabric 
N-g+l 
N-g+2 
N 
Input port INtg 
N-g+l 
N-g+2 
N 
Fig. 2 An SDMG CIOQ switch. 
division multiplexing expansion and grouped input/output 
ports (SDMG CIOQ switch for short). Figure 2 shows an 
N x N SDMG CIOQ switch, where N is the number of in-
put/output lines. 
The characteristics of the SDMG CIOQ switch are 
listed as follows. 
• It has N 1 g (grouped) input ports denoted as /;'s, and 
N 1 g (grouped) output ports denoted as 0 /s, where 
1 ~ i, j ~ N 1 g. Input port /; groups input lines L/s, 
(i- l)g + 1 ~ l ~ ig, and output port Oj groups output 
lines Mm 's, (j- 1 )g + 1 ~ m ~ jg. g is called the group 
factor, 1 ~ g ~ N. In practice, g is selected appropri-
ately to balance the performance and implementation 
complexity. 
• Each input port I; maintains N I g VOQs with Q;,j 
buffering cells destined for output port oj. 1 ~ i, j ~ 
N/g. 
• Each output port 0 j maintains g output queues, each 
associated with an output line. 
• It has an Nk/g x Nk/g switching fabric with k con-
nections to each input/output port. We assume that the 
switching fabric is non-blocking or rearrangeable I)On-
blocking. k is called the port connection factor, and it 
is assumed that k ~g. 
• Cells belonging to one VOQ of an input port may 
be transmitted through the switching fabric simultane-
ously. The sequence of cells can be kept by appropri-
ately setting the switching fabric such that the cell order 
is consistent with the connection order. 
• A cell in an input port can be switched to its destination 
output port through any of the k connections between 
the input port and the switching fabric and any of the k 
connections between the switching fabric and the des-
tination output port. 
We define P = k/ gas the expansion factor of an SDMG 
CIOQ switch. To relax the memory access rate, the inter-
face between each VOQ (or output queue) and the switch-
(a) (b) 
Fig. 3 (a) Queue structure at the input port. (b) Queue structure at the 
output port. 
ing fabric is expanded to multiple copies to allow more than 
one cell to be transmitted from a VOQ (or into an output 
queue). Figure 3(a) shows a possible queuing scheme at an 
input port, in which each VOQ is composed of k sub-queues. 
Cells belonging to one VOQ are buffered in sub-queues fol-
lowing the order of 1, 2, · · ·, k. Since k ~ g, it is feasible for 
each VOQ to receive up tog cells (one to each sub-queue) 
coming from different input lines without speedup. A queue 
controller (QC) is used to select up to g out of k sub-queues 
to receive these incoming cells. Since up to k cells (in differ-
ent sub-queues) from one VOQ may be sent to the switch-
ing fabric through up to k connections in one cell slot, an 
interconnection controller (IC) is used as the interface be-
tween each VOQ and the k connections to ensure the cells 
are sent in the same sequence as they arrive. Figure 3(b) 
shows a possible queuing scheme at an output port, where 
each output queue is composed of k sub-queues. An IC is 
used as the interface between k connections and each output 
queue to ensure the cells enter the sub-queues i~ the same 
sequence as they are transmitted on the k connections. Each 
output queue is connected to its corresponding output line. 
In this scheme, since in one cell slot at most one cell enters 
or leaves a sub-queue, memory speedup is not needed. 
In [19], Obara et al. proposed a similar switch archi-
2460 
tecture to enhance the scheduling performance for an ATM 
switch. Our purpose of using the SDMG CIOQ switch ar-
chitecture is to achieve speedup but only require the switch-
ing fabric and memories to operate at the same speed as the 
line rate. The tradeoff of the SDMG CIOQ switch is the 
increased complexity of the switching fabric and the added 
QCs and ICs in input/output ports. With current semicon-
ductor technology, it is feasible to implement the SDMG 
CIOQ switch with regular size g and k (for the switch sizes 
discussed in Sect. 6). 
3. Cell Scheduling for SDMG CIOQ Switches 
For an SDMG CIOQ switch, the scheduling algorithm needs 
to determine a conflict-free switching fabric setting for 
switching cells from input ports to output ports in each cell 
slot. The cell scheduling problem for the SDMG CIOQ 
switch can be modelled as a bipartite k-matching problem on 
the graph G = (V, E), where V = Vt u V2, Vt = {input ports}, 
V2 = {outputportsl, I Vtl=l V2l= N/g, E ={connection re-
quests from input ports to output ports}. Let M =I E 1. In 
each cell slot, there might be up to k connection requests 
from an input port to an output port. Therefore, G may not 
be a simple graph since there may be more than one edge 
between one pair of nodes. 
A k-matching is an edge set 1( ~ E such that no node 
of G is incident by more thank edges in 1(, where k ;:::: 1. 
A matching is a special case of k-matching with k = 1. A 
match is an edge (i, j) E 1<. A perfect k-matching 1( is one 
that each node of G is incident by k edges in 1<. A maximum 
size k-matching is one with the maximum number of edges. 
A maximal size k-matching is one that is not contained in 
any other k-matchings. 
Fact 1: For a maximal size k-matching of 1( ~ G, all the 
following statements are true. (1) The number of matches 
between any I; and any 0 j in 1( is less than or equal to k. 
(2) The total number of matches between any I; and all 0 /s 
in 1( is less than or equal to k. (3) The number of matches 
between all /;'s and any oj in 1( is less than or equal to k. 
(4) If there are at least k connection requests between some 
I; and some Oj, then at least one of the following holds: 
(a) I; has k matches to some output ports, and (b) Oj has k 
matches to some input ports. 
Figure 4 compares a maximum size 2-matching and a 
Input ports Output ports Input ports Output ports Input ports Output ports 
2 2 
4 4 4. 4 4 4 
(a) Request graph (b) Maximum 2-matching (c) Maxima12-matching 
Fig. 4 A maximum size 2-matching and a maximal size 2-matching of a 
4 x 4 SDMG CIOQ switch. 
IEICE TRANS. COMMUN., VOL.E89-B, N0.9 SEPTEMBER 2006 
maximal size 2-matching for a 4 x 4 SDMG CIOQ switch 
with g = 1 and k = 2. With the maximum size 2-matching 
shown in Fig.4(b), QJ,J, Ql,3· Q2,2. Q2,4· and Q3,2 will be 
served. 
As a special case of the bipartite b-matching prob-
lem [7], the maximum bipartite k-matching problem can 
be transformed to a maximum-flow problem in O(M) time. 
Since the transformed flow network is a unit network [25], 
we can use Dinic's algorithm to solve the corresponding 
maximum-flow problem in 0( VNM) time [25]. However, 
this algorithm is too complex for hardware implementation. 
Another noticeable problem with maximum size k-matching 
algorithms is that they may cause unfairness. For example, 
in Fig. 4, if Ql,I, Qt,3. Q2.2. Q2,4· and Q3,2 continue having 
requests and other VOQs continue having no request in suc-
cessive cell slots, then Q1,2 may get starved since edge (1, 2) 
does not belong to any maximum size 2-matching. 
For practical use, we desire scheduling algorithms to 
be fast, starvation-free, easy to implement, and of high 
throughput [16]. Compared with maximum size k-matching 
algorithms, maximal size k-matching algorithms are sim-
pler and possible to avoid unfairness. However, how good 
the performance of maximal size k-matching algorithms can 
be? In the following, we will give an analysis based on fluid 
model techniques. 
4. Analysis of Maximal Size k-Matching Algorithms 
We follow the definitions of SLLN and fluid model used in 
[8]. We define A;,m(n) as the cumulative number of cells 
that have arrived from input line L1 destined for output line 
Mm up to cell slot n. We assume that the arrival processes 
{A 1' (·),I, m = 1, ... , N} satisfy a strong law of large num-,m 
bers (SLLN), i.e. with probability one, 
A' (n) 
] . l,m 1 ' l 1 N Im --- = Ill m' 'm = ' ... ' ' 
n--+oo n . 
(1) 
where A.;,m is called the cell arrival rate from input line L1 to 
output line Mm. We also assume that no input/output line is 
oversubscribed, i.e. 
N N 
VI:::;/, m:::; N, .2:-t;,m:::; 1, IA;,m:::; 1. (2) 
m=! 1=1 
Equations (1) and (2) are very mild conditions. Almost all 
real traffic processes satisfy these two equations. Let D;,m be 
the departure rate of cells coming from input line L1 to out-
put line Mm. An SDMG CIOQ switch under a k-matching 
algorithm is said to be work conserving if 
lim LL D;,m(n) = "\'A' 
n-->oo n L...;. l,m 
I 
(3) 
for any (traffic) arrival satisfying Eqs. (1) and (2), i.e. the 
long-run fraction of time that output line Mm (1 :::; m :::; N) 
is busy is equal to the cell arrival rate at the output line. 
This is equivalent to saying that the SDMG CIOQ switch 
YANG and ZHENG: EFFICIENT SCHEDULING FOR SDMG CIOQ SWITCHES 
can achieve 100% throughput if there is enough offered load. 
We define A;,j(n) as the cumulative number of cells ar-
rived at Q;,j (i.e. cells arriving at input port /; and destined 
for output port Oj) up to cell slot n. For arrival processes 
A;,m(·)'s satisfying Eq. (1), we derive that arrival processes 
{Ai,j(·), i, j = 1, ... , N I g} also satisfy SLLN since with prob-
ability one, 
"'jg "'ig A' (n) 
. A;,j(n) . "-'m=(j-!)g+l "-'l=(i-!)g+l l,m hm -- = hm ---"----'-';___----=----
n~oo n n-+oo n 
jg ig 
=I IA;,m 
m=(j-!)g+ll=(i-!)g+l 
= l;,j, i,j= 1, ... ,Nig. (4) 
We call l;,j the cell arrival rate at Qi,j· For arrival processes 
Af,m(·)'s satisfying Eq. (2), no input/output port is oversub-
scribed since 
V1 ~ i,j ~ Nlg, 
N/g N ig 
IAi,j =I I A;,m ~ g, 
j=l m=li=(i-l)g+l 
N/g N jg 
IAi,j =I I A;,m ~g. (5) 
i=l 1=1 m=(j-l)g+l 
Let Di,j(n) be the number of cells departing from Qi,j 
up to cell slot n. We say an SDMG CIOQ switch under a 
matching algorithm is VOQ rate stable if with probability 
one, 
. Di,j(n) .. 
hm -- = l;,j. z,J = 1, ... ,Nig 
n--+oo n 
(6) 
for any arrival process satisfying Eq. (4). And an SDMG 
CIOQ switch is said to be port conserving if Eqs. (5) and (6) 
hold, which means that the cell arrival rate at output port 0 j 
satisfies 
I. 2//l{ D;,j(n) 1 . N/ Im ~ g, ~ 1 ~ g. 
n-HXl n 
Let an Nlg x Nlg matrix Z(n) be the request matrix at 
cell slot n, where Z;,j(n) denotes the number of cells in Q;,j 
at the beginning of cell slot n. A maximal size k-matching 
algorithm .7{ determines a matrix n(n) in cell slot n, where 
n;,j(n ), 1 ~ i, j ~ N I g, indicating how many cells can be 
transmitted from input port/; to output port oj during cell 
slot n. Since .7{ is a maximal size k-matching algorithm, we 
have the following equations based on Fact I. 
VI ~ i, j ~ N /g, n(n);,j ~ k, 
N/g 
V1 ~ i ~ Nlg, In(n);,1 ~ k, 
j=l 
N/g 
V1 ~ j ~ N/g, In(n)i,j ~ k, 
i=l 
(7) 
(8) 
(9) 
2461 
N/g N/g 
V1 ~ i, j ~ N I g, 2: n(n);,j + 2: n(n);,j ;?: k, 
j=l i=l 
if Z;,j(n);?: k. · (10) 
We define T;{l(n) as the cumulative amount of time that 
permutation 1r determined by .7{ has been used by cell slot 
n. Assuming that ll is the set of matrices determined by .7{ 
that satisfy Eqs. (7)-(10), the following equations hold for 
the SDMG CIOQ switch: 
Z;,j(n) = Z;,j(O) + A;,j(n)- D;,j(n), 
D;,j(n) = 2:n;,jT;{I(n), 
1fE0 
Ir;{l(n) =n. 
1fE0 
where n ;?: 0 and i, j = 1, · · · , N I g. 
Consider a deterministic, continuous fluid model of the 
SDMG CIOQ switch shown in Fig. 2 operating under the 
maximal size k-matching algorithm .7{ with offered arrivals 
satisfying Eq. ( 4). For each t ;?: 0 and i, j = 1, · · ·, N I g, the 
fluid model is governed by the following set of equations: 
Z;,j(t) = Z;,j(O) + l;,jt - D;,j(t) ;?: 0, 
. ~ ·.9'1 
Di,j (t) = L...J n;,j T,.. (t) > 0, 
1fE0 
r:o is nondecreasing, and 2: T;{l(t) = t, 
,..en 
(1 1) 
(12) 
(13) 
where f denotes the derivative of function f at t, assum-
ing f is differentiable at t. A solution to Eqs. (11)-(13) is 
said to be a fluid model solution. The fluid model of an 
SDMG CIOQ switch operating under a k-matching algo-
rithm is said to be VOQ weakly stable if every fluid model 
solution (D, T, Z) has Z(t) = 0 fort ;?: 0. 
Theorem 1: An SDMG CIOQ switch under a k-matching 
algorithm is VOQ rate stable if its fluid model is VOQ 
weakly stable. 
For detailed proof, please refer to the proof of Theorem 
3 in [8]. We only give an intuitive explanation here. By 
Eq. (11), we get D;,j(t) = l;,jt for Z;,j(t) == 0, t ;?: 0. Hence 
I. D;,1(t) _ 1 lmt->oo - 1- - Ai,j. 
Before we go further, we first state a simple lemma [8]. 
Lem1pa 1: Let f : [0, oo) --+ [0, oo) be an abs_olutely con-
tinuous function with f(O) = 0. Assume that f (t) ~ 0 for 
almost every t (wrt Lebesgue measure) such that j(t) > 0 
and f is differentiable at t. Then f(t) == 0 for almost every 
t;?: 0. 
Please refer to the proof of Lemma 1 in [8]. 
Let (D, T, Z) be a fluid model solution satisfying 
Eqs. (11)-(13) with Z(O) = 0. Let R;(t) = LjZ;,j(t) de-
note the total amount of fluid queued at input port /; at time 
t and S j(t) = Li Z;,j(t) be the total amount of fluid destined 
for output port Oj and queued at some input ports at timet. 
2462 
Define C;,j(t) == R;(t) + S j(t). In addition to the fluid model 
Eqs. (11 )-( 13 ), we have the following lemma. 
Lemma 2: For an SDMG CIOQ switch with expansion 
factor P == k/g operating under a maximal size k-matching 
algorithm, each fluid limit must satisfy the following equa-
tion for 1 ::; i, j::; N fg: 
N/g N/g 
C,·J· (t) ::; "\'X . + "\' J. . - k 
· L..J '·1 L..J •.1 ' 
j=l i=l 
whenever Z;,j(t) > 0. (14) 
Proof is given in Appendix A. 
Theorem 2: For an SDMG CIOQ switch shown in Fig. 2, 
any maximal size k-matching algorithm with k == 2g, i.e. 
P == k/ g == 2, can achieve 100% throughput assuming that 
input line arrivals satisfy SLLN and no input/output line is 
oversubscribed. 
Proof is given in Appendix B. 
5. The kFRR Scheduling Algorithm 
In this section, we focus our study on practical maximal size 
k-matching algorithms, which can be developed based on it-
erative maximal size matching scheduling algorithms, such 
as PIM [1], iSLIP [16], DRRM [5], FIRM [22], SRR [12], 
and iterative PPA scheme [4]. Among these algorithms, 
round-robin based algorithms, such as iSLIP, DRRM, and 
FIRM, are more attractive than others because of their fair-
ness and implementation simplicity. FIRM improves iSLIP 
by reducing the service guarantee time from (N- 1)2 + N2 
cell slots to N 2 cell slots. It is starvation-free and easy to 
implement in hardware at high speed [22]. In the follow-
ing, we generalize the idea of FIRM for the SDMG CIOQ 
switch and present the k-connection FIRM-based round-
robin (kFRR) scheduling algorithm. Similar to FIRM, kFRR 
is also an iterative and round-robin based algorithm. 
For input port I;, let a;, where 1 ::; a; ::; N/g, be its 
accept pointer indicating the accept starting position in the 
circular round-robin priority queue, and C(l;) be the number 
of available connections at I;. For output port Oj. let gj. 
where 1 ::; gj ::; N/g, be its grant pointer indicating the 
grant starting position in the circular round-robin priority 
queue, and C( 0 j) be the number of available connections at 
Oj. Prior to the first iteration of kFRR in any cell slot, we 
set C(l;) = C(Oj) = k for all1::; i,j::; N/g. 
In each cell slot, kFRR iteratively finds a k-matching. 
It terminates after a fixed number of iterations or after a non-
profit iteration (i.e. a maximal size k-matching is found). 
Each iteration of kFRR consists of the following three steps. 
Step 1: Request. VI;, 1 ::; i::; N/g, if I; has any available 
connection and any unresolved request (an unresolved 
request is one to an output port with any available con-
nection), it sends all unresolved requests to their corre-
sponding Oj's. 
IEICE TRANS. COMMUN., VOL.E89-B, N0.9 SEPTEMBER 2006 
Step 2: Grant. VOj. I ::; j ::; N /g, if Oj has any avail-
able connection and receives any request, it grants 
min{C(Oj). the number of requests to Oj} requests, 
starting from g j. These grants are sent to their corre-
sponding /;'s. gj is updated to the first input port that 
receives Oj's grant but does not accept it in the Accept 
phase or the first input port that does not receive Oj's 
grant if all 0 j's grants are accepted in the first iteration, 
starting from gj in a circular manner (modulo Nfg) if 
and only if in the the first iteration. C(Oj) is updated 
to the number of available connections at Oj. 
Step 3: Accept. VI;, 1 ::; i ::; N/g, if I; has any available 
connection and receives any grant, it accepts min{ C(l;), 
the number of grants to I;} grants starting from a;. a; 
is updated to the next position to the last output port 
whose grant is accepted by I; in a circular manner 
(modulo N I g). C(l;) is updated to the number of avail-
able connections at I;. 
Figure 5(a) shows how kFRR with one iteration works 
using an example for a 4x4 SDMG CIOQ switch with k == 2 
under saturated load. Saturated load means at some cell slot, 
Vl ::; i, j ::; 4, Q;,j > 0, and input port traffic arrivals are 
maintained in such a manner that Q;,j > 0 in the following 
cell slots. At the start of cell slot 0, we assume that gj == 1 
and a; = 1 for all 1 ::; i, j ::; 4. In the Request step, each in-
put port I; sends a request to each output port 0 b represented 
by an edge in the figure. In the Grant step, each 0 j grants 
I1 and [z since each 0 j only has two connections available 
and gj = 1 for all 1 ::; j ::; 4. In the Accept step, both 
/1 and [z accept grants from 01 and 0 2 since each of them 
only has two connections available and a 1, a2 = I. Finally 
a 2-matching of size 4 is found. a1 and a2 are updated to 3, 
while a3 and a4 are not updated; g1 and g2 are updated to 3, 
while g3 and g4 are not updated. Figure 5(b) illustrates the 
desynchronization effect of grant pointers of kFRR with the 
previous example. After cell slot 0, due to the desynchro-
nization of grant pointers, a perfect 2-matching is obtained 
at cell slot 1. For the same reason, perfect 2-matchings (with 
different patterns) are found in cell slots 2 and 3. 
kFRR has the following properties. 
Property 1: At each output port 0 j· due to the property 
of round-robin, the lowest priority input port is set as the one 
before the first input port that receives its grant but does not 
accept it in the first iteration or the input port before the first 
input port that does not receive 0/s grant if all 0/s grants 
are accepted in the first iteration. 
Property 2: Under saturated load, all VOQs with a 
common output port have the same throughput because the 
grant pointer at the output port moves to each requesting in-
put port in a fixed order (every ff cell slots). 
Property 3: No connectioh request is starved. This 
property comes from the following theorem. 
Theorem 3: kFRR serves an existing connection request 
within no more than (~)2 cell slots. 
Proof' The worse case service scenario of kFRR is the sit-
uation where a request from input port /; to output port 0 j 
YANG and ZHENG: EFFICIENT SCHEDULING FOR SDMG CIOQ SWITCHES 
2463 
Input purlS Output purlS Input purlS Output purts @if a, 
I I 
Eilif g, @if~ I I Eilifg' 
@if~ Eilifg, ~~ Eilifg, t 2 2 2 2 
@f~ 3 3 Eilif g, ®"' 3. 3 Eilif g, (ilif •. 
4 4 
Ejjifg. (ilif .. 
4e 4 
Ejjifg· 
Request Grunt Accept 
(a) Three steps of cell slot 0 
Output~rts@. ~~~~ut1purts~ g, 
2g
1(j' ~\~2 ~g, 
3g,~~I~J' ~~ 3~3 ~ 
~ Inputports 
~.I 
~2 
Output purlS(@' 
4 I 
I 3 2 
~ 2 4 t 3 2 
cQf1 g, 
4 a!!f\g. ··~ 4~4 g,,;;.r.:r 
,e; Cell slot2 g. j,NW 
~~ i!!:J 3 3 ~ 
4 r.;;tl g. 
,@/ 
Cellslot3 
(b) 2-matching found in cell slot 1-3 
Fig. 5 An example of kFRR with one iteration. 
has to wait all other N I g - k input ports to be served by 
Oi> i.e. for some cell slot n, Zi',j(n) > 0 (i.e. the number of 
cells in Qi'j at cell slot n) for all 1/s and gi = ((i + 1) mod 
N I g), where i' * i. The delay between posting a request and 
serving the request consists of the delay for the request to 
be granted and the delay for the grant to be accepted. The 
delay for the request from I; to 0 i to be granted is ( ~ - 1) ~ 
cell slots since it takes ~ - 1 cell slots for 0 i to grant re-
quests from other N I g- k input ports and it takes at most ~ 
cell slots for each grant to be accepted. After the grant to I; 
is issued, it also takes ~ cell slots to get it accepted. Thus 
tot.all.y it takes(~ - 1)~ + ~ = (~)2 cell slots to serve an 
extstmg connectiOn request. • 
Property 4: kFRR finds a maximal size k-matching in 
at mostNig-k+ 1 iterations, i.e. kFRR converges in at most 
N I g - k + 1 iterations. 
The reason is explained as follows. The size of a max-
imal size k-matching is at most Nk/g. If finding a maximal 
size k-matching takes more than 1 iteration, the first itera-
tion finds at least k2 matches, the last iteration finds at least 1 
match, and other iterations find at least k matches. Thus, the 
total number of iterations needed is at most L Nk/g;JCZ-i j + 2, 
which is given by N I g - k + 1. We further conjecture that 
under uniform traffic arrivals kFRR converges in O(log N) 
iterations on average. 
Figure 6 shows an example of the number of iterations 
needed for kFRR to converge for an 8 x 8 SDMG CIOQ 
switch with k = 2 under saturated load. In cell slot 0, kFRR 
takes 4 iterations to converge. It takes 3 and 2 iterations for 
kFRR to converge in cell slots 1 and 2 respectively. After 
cell slot 3, all grant pointers are totally desynchronized and 
kFRR converges in a single iteration. 
CetlslotO 
Iteration 0 Iteration I Iteration 2 lteradon 3 
Input ports Output ports Input ports Ourput ports Input ports Output pam Input ports Output ports 
•,·1 :><: &,·1 •,•3 :><: g,:l 
o,-1 --- g,•l .,., --- g,=3 
•,·':><: .,., •,·,~· g,•3 
a,•3--- .,., .,.,~. g,=3 
o,•l. • g,•l 
":1 :><: g,•l 
a4=1 e • g,•l a4-l &.•1 
a,•l. e g,=I a,•l. • g,•l 
act • .... 1 ... 1. .... 1 
o,-1. • g,-1 o,-1 • . .,., 
... ,. e g•=l o,•l. • g.=l 
Cdlslot 1 
lterationO lleration 1 Iteration2 
Input porn Outputporu Input"""' Output ports Input ports Output porn 
,-~·- ~~··· '"~·· .,., g,-3 .,.., g,•5 .,.., .,., 
.,., g,•l .,., g,•3 o,•3 g,•3 
a4=!5 i4>=l a4=3 ~~~3 a4=3 &.""3 
a,-7. . .,., 0~~· ~·~··· ... 7. .... 1 ...7 .... 1 ... , ... 1 
o,-1. • g~l .,., • g,-1 o,-1 g,-1 
a,-1. • g,•t a.=l. g.=t a,=t g.=t 
Cellslot2 Cellalot3 Cell slot4 
Iteratl.on 0 Iteration 1 Iteration 0 IterationO 
Input ports Output ports Input ports Output ports Input poru Output pons Input ports Output ports 
o,•l:><: g,•l 
.,., g,=l ~=~:~ ~::~:::: :: 
a,=3 &.=3 a4=5 14•!!1 a4:5 J4=5 
.,., g,•l .,., &,•3 a,-3 g,•3 
... , ... 1 ... 3 ... , .. ., ... 3 
a,·7· • g,•l .,., :><: .,., ¥1 g,-1 
a.=7 • e g.=l a.ci ~=I &s=1 8'1=1 
o,-7~ g,•7 
a4=7 s.=7 
a,=5 g5=5 
.,.., ..... 
o,-3 .,., 
a,=3 &a=:3 
Fig. 6 Example of the number of iterations needed for kFRR to converge 
for an 8 x 8 SDMG CIOQ switch under saturated load. 
6. Performance Evaluation 
In Sect. 4, we proved that any maximal size k-matching al-
gorithms can achieve 100% throughput for SDMG CIOQ 
switches with an expansion factor 2. Nevertheless, in prac-
tice, the number of iterations allowed in one cell slot may 
2464 
not be sufficient for finding a maximal size k-matching. In 
this section, we evaluate the performance of kFRR with the 
number of iterations allowed in each cell slot is limited on 
SDMG CIOQ switches with an expansion factor 2 in terms 
of the average queuing delay. The queuing delay is defined 
as the cell's queuing delay at input and output ports counted 
in the number of cell slots. 
Two traffic models are used in our simulations: uniform 
traffic and polarized traffic. For uniform traffic, we consider 
both Bernoulli arrivals and bursty arrivals. The polarized 
traffic is defined as follows [3]. Given the geometric pro-
gression factor q :2: 1.00, the proportion of traffic arriving at 
input line Lt destined for output line Mm should satisfy 
q(l+m) mod N. (q _ 1) 
dt,m = N I q -
such that, 
VI E [1 ... N], L~=l dt,m = I and 
Vm E [1 ... N], .Lf:,1 dt,m = 1. 
Polarized traffic with q = 1.00 is uniform traffic. 
One can verify that both uniform traffic and polarized traf-
fic satisfy the SLLN condition without oversubscribed in-
put/output lines. Simulations have been performed for the 
kFRR algorithm for SDMG CIOQ switch sizes of 16 x I6, 
32 x 32, 64 x 64, and 128 x 128 with different group factors 
(g), different port connection factors (k), different polariza-
tion factors (q), and different number of iterations. In the 
following, we present simulation results with the example 
of a 32 x 32 SDMG CIOQ switch. Without loss of gener-
ality, in our simulations, all pointers in kFRR are initialized 
randomly. 
6.1 Bernoulli Arrivals 
Figure 7 shows the average cell delay vs. load of kFRR with 
1, 2, and 4 iterations, g = 1, k = 2, and q = 1.00, 1.50, and 
2.00 for a 32 x 32 SDMG CIOQ switch under Bernoulli ar-
rivals. In the figure, "x-y" represents the case of kFRR with 
q = x and the number of iterations being equal toy. kFRR 
achieves 100% throughput for all polarization factors. The 
performance of kFRR improves when the polarization factor 
increases. We observe that the difference in the number of 
iterations does not affect much of the performance of kFRR 
under Bernoulli arrivals. 
Figure 8 compares the average queuing delay vs. load 
of one-iteration kFRR with k = g (solid curve) and k = 2g 
(dotted curve) for g = 1, 2, and 4 for a 32x32 SDMG CIOQ 
switch under uniform Bernoulli arrivals. In the figure, "x-y" 
represents the case of kFRR with g = x and k = y. kFRR 
with k = 2g improves the performance of kFRR with k = g 
dramatically. Fork = 2g, larger group factor yields better 
performance. 
6.2 Bursty Arrivals 
To show the performance of the proposed scheme under real 
IEICE TRANS. COMMUN., VOL.E89-B, N0.9 SEPTEMBER 2006 
!? 
~ '-'__,._"--""--' 
~ 101 
10""' '-----'------'----'---~------'--.L....----'--
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
load 
Fig. 7 Delay performance of kFRR with g = I, k = 2, and different 
number of iterations under Bernoulli arrivals. 
10' 
10~.L1--o:":.2------'0.,-3 ---,o~.4-- o~5 ---'0.-6 --o'--.7--o-'--.a---'-o.9 _ __J 
Load 
Fig. 8 Delay performance of one-iteration kFRR with different g's and 
different k's under uniform Bernoulli arrivals. 
traffic, such as multimedia traffic which tends to be bursty, 
we study the performance of kFRR under bursty traffic us-
ing 2-state markov-chain modulated on-off arrival processes 
[15], [16]. Each input line alternately generates a burst of 
full cells (all with the same destination) followed by an idle 
period of empty cells. The number of cells in each burst or 
idle period is geometrically distributed. Let E(B) and E(I) 
be the average burst length and the average idle length in the 
number of cells respectively. E(I) = E(B)(1- p)/p, where 
p is the load of each input line. We assume the destination 
of each burst is uniformly distributed. As a matter of fact, 
Bernoulli traffic can be considered as a special case of bursty 
traffic with E(B) = 1. 
Figure 9 illustrates the average queuing delay vs. load 
of kFRR with I, 2, and 4 iterations, g = 1, and k = 2 for 
a 32 x 32 SDMG CIOQ switch under bursty arrivals with 
E(B) = 16, 32, 64, 128, and 256 respectively. In the figure, 
YANG and ZHENG: EFFICIENT SCHEDULING FOR SDMG CIOQ SWITCHES 
10'Er=;:::::;:;~-.----,---,r---r---.--.---.--
-A- 16-1 
-l:r- 16-2 
"' 16-4 
-&- 32-1 
-o- 32-2 
0 32-4 
103 -e-- 64-1 
.,__ 64-2 
a 644 
---+-- 128-1 
--+- 128-2 
... 128-4 
---- 256-1 
-x- 258-2 
)( 256--4 
10~.\-1 ----;0.;;-2 -----;;.0.3;--~0;!-..4;---;0;';;.5;----:0"":.6---:0'::-.7---J0.'-8 ---'0.9 _ _j 
Load 
Fig.9 Delay performance of kFRR with g = l, k = 2, and different 
number of iterations under bursty arrivals. 
10~'1.1-r:o.2~--;;o.3;;--o;;..4;--~o;;';.s;-----;o;';;.s---:o'=-.7---::0.'="8 -----='o.-=-s _ __j 
Load 
Fig.lO Delay performance of one-iteration kFRR with different g's and 
different k's under bursty arrivals. 
"x-y'' represents the case of kFRR with E(B) = x and the 
number of iterations being equal to y. kFRR achieves 100% 
throughput with all average burst length settings. The dif-
ference in the number of iterations does not affect much of 
the average queuing delay of kFRR under bursty arrivals. 
Figure 10 compares the average queuing delay vs. load 
of one iteration kFRR with k = g (solid curve) and k = 2g 
(dotted curve) for g = 1, 2, and4 for a 32x32 SDMG CIOQ 
switch under bursty arrivals with E(B) = 64. In the figure, 
"x-y'' represents the case of kFRR with g = x and k = y. As 
shown in Fig. 10, kFRR with k = 2g improves the perfor-
mance of kFRR with k = g dramatically. The performance 
of kFRR improves when the group factor increases. 
Figure 11 compares the performance of one-iteration 
kFRR with P = 2 (solid curve) and one-iteration FIRM with 
S = 2 (dotted curve) for g = 1 of a 32 x 32 SDMG CIOQ 
10
4 '-;=::::::::'=~--
-A- FIRM-1 
"' kFRA-1 I 
-&- FIRM-64 1 
10' I~ ~~r!"~s 
~RR-128 
2465 
10~.L 1 ---'o.2,.----o.L.3 _ ___,o.-4 --o":s-----:co.76 -----='o.7=--_o,....s,.----o.L.9 _ __j 
Load 
Fig. 11 Delay performance one-iteration kFRR with P = 2 and one-
iteration FIRM with S = 2 for g = 1 under Bernoulli and bursty arrivals. 
1400 
:i 
~ 1200 
..;- 1000 
-8 
~ ! 800 
i 600 
400 
200 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Load 
Fig.12 Delay performance of one-iteration kFRR with different switch 
sizes under bursty arrivals. 
switch under Bernoulli arrivals, bursty arrivals with E(B) = 
64 and E(B) = 128. As we can see, under all cases, the 
performance of kFRR with expansion factor 2 achieves the 
same performance as FIRM with speedup factor 2. 
6.3 With Different Switch Sizes 
To evaluate the scalability of the SDMG CIOQ switch ar-
chitecture and the kFRR algorithm, we have conducted the 
simulations for different switch sizes. Figure 12 shows the 
performance of kFRR with g = 1 and k = 2 for SDMG 
CIOQ switch sizes of 16x 16, 32x32, 64x64, and 128x 128. 
As shown in the figure, the delay performance of kFRR in-
creases slightly with larger switch sizes. This confirms that 
the SDMG CIOQ switch architecture and the kFRR algo-
rithm scale well as the switch size increases. 
2466 
7. Hardware Implementation of kFRR Algorithm 
An important property of an efficient scheduling algorithm 
is simple to implement. In this section, we show that kFRR 
is ready to be implemented in hardware. Figure 13 shows 
a possible design of a kFRR scheduler, which consists of 
2N I g port arbitration components, a state update logic, and 
a state memory. Each port arbitration component (PAC) is 
responsible for selecting k out of Nklg requests in a round-
robin manner. 
We discuss three possible designs of a PAC. The first 
design is employing the programmable priority encoder 
(PPE) proposed in [10]. The second design is using the par-
allel round-robin arbiter (PRRA) proposed in [29]. Since 
either the PPE or the PRRA can only make one selection 
each time, we have to run the PPE or PRRA k times to make 
k selections. The time needed for one-iteration kFRR us-
ing these two designs is 2k times the delay of an Nlg-input 
PPE or PRRA. The third design is using the programmable 
k-selector proposed in [30]. The advantage of using pro-
grammable k-selectors is that the timing performance is in-
dependent of k. The time needed for one-iteration kFRR 
using such a design is 2 times the delay of an Nk/g-input 
programmable k-selector. 
For anN x N SDMG CIOQ switch, the scheduler re-
ceives anN I g x log k-bit request vector from each input port 
at the start of each cell slot. Then, taking the example of 
one-iteration kFRR scheduler, it works as follows: 
Step 1: Each grant PAC selects up to k requests and 
sends them to N I g accept PACs. 
Step 2: Each accept PAC selects up to k grants and 
sends them to the decision register, the state memory and 
update logic, where the grant pointers are updated. 
For an iterative kFRR scheduler, the PACs used are al-
most identical to those used for a one-iteration kFRR sched-
uler except the following differences. (1) The request matrix 
should be updated after each iteration. (2) The number of 
Grant arbitration 
IEICE TRANS. COMMUN., VOL.E89-B, N0.9 SEPTEMBER 2006 
available connections at each PAC should be updated after 
each iteration. (3) Once an input/output port has no available 
connection, its PAC should be disabled in subsequent itera-
tions of the same cell slot. These three modifications make 
an iterative kFRR scheduler slightly more complex than a 
one-iteration kFRR scheduler. 
8. Concluding Remarks 
The major contributions of this paper include: (1) We in-
troduced the SDMG CIOQ switch, which features space-
division multiplexing expansion and grouped input/output 
ports to eliminate the speedup requirement of the switching 
fabric and memories of CIOQ switches. (2) We modelled 
the cell scheduling problem for the SDMG CIOQ switch 
as a bipartite k-matching problem. (3) Using fluid model 
techniques, we proved that any maximal size k-matching 
algorithm for the SDMG CIOQ switch with an expansion 
factor 2 can achieve 100% throughput so long as input line 
arrivals satisfy SLLN and no input/output line is oversub-
scribed. (4) We proposed an efficient and starvation-free dis-
tributed scheduling algorithm for the SDMG CIOQ switch, 
kFRR, for finding maximal size k-matchings. (5) Through 
simulations, we showed that kFRR achieves 100% through-
put for the SDMG CIOQ switch with an expansion factor 
2 for two SLLN traffic arrivals: uniform traffic and polar-
ized traffic. (6) We proposed three hardware implementation 
schemes for the kFRR algorithm. In conclusion, the SDMG 
CIOQ switch provides an alternative solution to the CIOQ 
switch with speedup and kFRR is an efficient and practical 
scheduling algorithm for the SDMG CIOQ switch. Future 
work includes study of efficient scheduling algorithms sup-
porting QoS differentiation for different types of traffic on 
the SDMG CIOQ switch. 
Accept arbitration 
Fig.13 Block diagram of a kFRR scheduler for anN X N SDMG CIOQ switch. 
YANG and ZHENG: EFFICIENT SCHEDULING FOR SDMG CIOQ SWITCHES 
References 
[I] T. Anderson, S. Owicki, J. Saxie, and C. Thacker, "High speed 
switch scheduling for local area networks," ACM Trans. Comput. 
Syst., vol.11, no.4, pp.319-352, Nov. 1993. 
[2] A. Awan and R. Venkatesan, "Design and implementation of en-
hanced crossbar CIOQ switch architecture," Proc. Canadian Conf. 
Eletrical and Computer Engineering, vol.2, pp.I045-1048, 2005. 
[3] J. Blanton, H. Badt, G. Darnm, and P. Golla, "Impact of polarized 
traffic on scheduling algorithms for high speed optical switches," 
presented on ITCom 2001, Denver, Aug. 2001. 
[4] H.J. Chao, C.H. Lam, and X.L. Guo, "A fast arbitration scheme for 
terabit packet switches," Proc. GLOBECOM, pp.1236-1243, 1999. 
[5] H.J. Chao, "Saturn: A terabit packet switch using dual round-robin," 
IEEE Commun. Mag., pp.78-84, Dec. 2000. 
[6] S.T. Chuang, A. Gael, N. Mckeown, and B. Prabhakar, "Matching 
output queuing with a combined input output queued switch," IEEE 
J. Sel. Areas Commun., vo1.17, no.6, pp.I030--1039, June 1999. 
[7] W.J. Cook, W.R. Pulleyblank, A. Schrijver, and W.H. Cunningham, 
Combinatorial Optimization, John Wiley & Sons Inc., Nov. 1997. 
[8] J. Dai and B. Prabhakar, "The throughput of data switches with and 
without speedup," Proc. INFOCOM, pp.556-564, 2000. 
[9] D. Gale and L.S. Shapley, "College admission and the stability of 
marriage," American Mathematical Monthly, vol.69, pp.9-15, 1962. 
[10] P. Gupta and N. Mckeown, "Designing and implementing a fast 
crossbar scheduler," IEEE Micro., vol.19, no. I, pp.20--28, Jan.-Feb. 
1999. 
[II] J.E. Hopcroft and R.M. Karp, ''An n2·5 algorithm for maximum 
matching in bipartite graphs," Soc. Ind. Appl. Math. J., vol.2, 
pp.225-231, 1973. 
[ 12] Y. Jiang and M. Hamdi, "A fully desynchronized round-robin match-
ing scheduler for a VOQ packet switch architecture," Proc. IEEE 
HPSR, pp.407-412, 2001. 
[13] M.J. Karol, M.G. Hluchyj, and S.P. Morgan, "Input vs. output queu-
ing on a space-division packet switch," IEEE Trans. Commun., 
vol.35, no.12, pp.II0--115, May 1987. 
[14] C. Li, S.Q. Zheng, and M. Yang, "Scalable schedulers for high-
performance switches," Proc. IEEE HPSR, pp.198-202, 2004. 
[15] S.Q. Li, "A general solution technique for discrete queuing analysis 
of multimedia traffic on ATM," IEEE Trans. Commun., vol.39, no.7, 
pp.1115-1132, July 1991. 
[16] N. McKeown, "The iSLIP scheduling algorithm for input-queued 
switches," IEEE/ACM Trans. Netw., vo1.7, no.2, pp.188-201, April 
1999. 
[17] N. McKeown, A. Mekkittikul, V. Anantharam, and J. Walrand, 
"Achieveing I 00% throughput in an input-queued switch," IEEE 
Trans. Commun., vol.47, no.8, pp.l260--1267, Aug. 1999. 
[18] C. Minkenberg, On packet switch design, Ph.D. dissertation, Eind-
hoven University of Technology, 2001. 
[19] H. Obara, S. Okamoto, andY. Hamazumi, "Input and output queuing 
ATM switch architecture with spatial and temporal slot reservation 
control," Electron. Lett., vol.28, no. I, pp.22-24, Jan. 1992. 
[20] A. Pattavina, "Multichannel bandwidth allocation in broadband 
packet switch," IEEE J. Sel. Areas Commun., vol.6, no.9, pp.1489-
1499, Dec. 1988. 
[21] B. Prabhakar and N. McKeown, "On the speedup required for com-
bined input and output queued switching," Automata, vol.35, no.12, 
1999. 
[22] D.N. Serpanos and P.I. Antoniadis, "FIRM: A class of distributed 
scheduling algorithms for high-speed ATM switches with multiple 
input queues," Proc. INFOCOM, pp.548-555, 2000. 
[23] D. Shah, "Maximal matching scheduling is good enough," Proc. 
GLOBECOM, pp.3009-3013, 2003. 
[24] I. Stoica and H. Zhang, "Exact emulation of an output queuing 
switch by a combined input output queuing switch," Proc. 6th 
IEEE/IFIP IWQoS, pp.218-224, 1998. 
2467 
[25] R.E. Tarjan, Data Structures and Network Algorithms, Bell labs, 
Murray Hill, NJ, 1983. 
[26] Vitesse corp., TeraStream chip set [Online], Available: http://www. 
vitesse.com, 2003. 
[27] M. Yang and S.Q. Zheng, "Pipelined maximal size matching 
scheduling algorithms for CIOQ switches," Proc. ISCC, pp.521-
526, 2003. 
[28] M. Yang and S.Q. Zheng, ''An efficient scheduling algorithm for 
CIOQ switches with space-division multiplexing expansion," Proc. 
IEEE INFOCOM, pp.1643-1650, 2003. 
[29] S.Q. Zheng, M. Yang, J. Blanton, P. Golla, and D. Verchere, "A 
simple and fast parallel round-robin arbiter for high-speed switch 
control and scheduling," Proc. 45th IEEE MWSCAS, pp.671--674, 
2002. 
[30] S.Q. Zheng, M. Yang, and F. Masetti-Piacci, "Constructing sched-
ulers for high-speed, high-capacity switches/routers," Int. J. Com-
put. Appl., vol.25, no.4, pp.264-271, 2003. 
Appendix A: Proof of Lemma 2 
Proof: Proving Eq. (14) is equivalent to showing that, if 
Z;,j(n) ;::: k, then 
N/g 
c .. (n + 1) -c. ·(n) < ~(A- ·(n + 1)- A- ·(n)) l,j l,j - L__; l,j 1,] 
j=l 
N/g 
+ I(A;,j(n + 1)- A;,j(n))- k. (A 1) 
i=l 
Let V;,j denote the set of all VOQs holding cells ar-
riving at input port!; or destined for output port Oj. Then 
C;,j(n + 1) - C;,j(n) is the difference between the number 
of arrivals to V;,j at cell slot n + 1 and the number of de-
partures from V;,j at cell slot n. The number of arrivals to 
V;,j at cell slot n + 1 equals to L7~f(A,',j(n + 1)- A;,j(n)) + 
L~19(A;,j(n + 1)- Ai,j(n)). 
Since Z;,j(n) ;::: k and the switch employs a maximal 
size k-matching algorithm, it follows from Eq. (10) that 
N/g Nfg I 7r(n);,j +I 7r(n);,j;::: k. 
j=l i=l 
That is to say that at least k cells are removed from 
those VOQ's in Vi,j· Thus, we get the bound on the right 
side ofEq. (A· 1). • 
Appendix B: Proof of Theorem 2 
Proof: To prove the theorem, we first show that the SDMG 
CIOQ switch is VOQ rate stable. In light of Theorem 1, this 
is equivalent to showing that the corresponding fluid model 
is VOQ weakly stable, i.e. every fluid solution (D, T, Z) has 
Z(t) = 0 fort ;::: 0. 
Let E be the N I g x N I g matrix with each entry being 
1. We have 
C(t) = EZ(t) + Z(t)E, t ;::: 0 (A·2) 
Define f(t) = (Z(t), C(t)), where (A, B) = 'Li,j A;,jBi,j· 
2468 
Then we have j(t) ~ 0 for t ~ 0 and f(O) = 0. It is also true 
that f(t) = 0 implies that Z(t) = 0. We observe that 
f(t) = 2..: Z;,j(t)C;,j(t) 
i,j 
= f1 Z;,j(t) ( ~ Z;,~:(t) + ~ Zk,j(t)) 
l.:(Z;,j(t)Z;,k(t) + Z;,j(t)Zk,j(t)). 
i,j,k 
Therefore, 
j (t) = 2..: Z;,j (t)Z;,k(t) + 2..: Z;,j(t) Zi,k (t) 
i,j,k i,j,k 
+ 2..: Zi,j (t)Zk,j(t) + I Z;,j(t) Zk,j (!) 
i,j,k i,j,k 
= 2 I Z;,j(t) Zi,k (t) + 2 I Z;,j(t) Zk,j (t) 
i,j,k i,j,k 
= 2 2: Z;,j(t) Ci,j (t) 
i,j 
:<; 0, (A·3) 
since from Lemma 2 and Eq. (5), 
N/IJ N/IJ 
Ci.j (t) :<;I A;,j + 2..: Ai,j- k :<; g + g- 2g = 0. 
j=l i=l 
According to Lemma 1, f(t) = 0 because f(t) ~ 0 and 
f (t) :<; 0. Hence Z(t) = 0 for t ~ 0, i.e. the fluid model 
is VOQ weakly stable. By Theorem 1, the SDMG CIOQ 
· VOQ bl · 1" D;,j(n) 1 Al f IS rate sta e, I.e., Imn--.oo -n- = tti,j· so rom 
Eq. (5), the SDMG CIOQ switch is port conserving. 
We then show that the SDMG CIOQ is work conserv-
ing. In fact, limn--.oo D;,~(n) is equal to the scheduled cell 
arrival rate from input port I; to output port Oj, for any 
1 :<; i, j :<; N I g. Then we have the total scheduled cell arrival 
rate at output port 0 j equals to 
N!IJ D ( ) NIIJ N i1J ~ lim __!:!__!!_ = ~ ..l;, · = ~ ~ A' . L..J n~oo n L..J 1 L..J L..J l,m 
i=1 i=l 1=1 m=(j-1)1}+1 
Therefore, the scheduled cell arrival rate at output line Mm, 
where (j- l)g + 1 :<; m :<; jg for any 1 :<; j :<; N/g, is given 
by 
1: D' (n) N 
lim 1 l,m ="A' <1 
n--.oo n L..J 1,m -
1=1 
based on Eq. (2). Hence the SDMG CIOQ is work conserv-
ing, i.e. it can achieve 100% throughput if input line arrivals 
are sufficient. • 
IEICE TRANS. COMMUN., VOL.E89-B. N0.9 SEPTEMBER 2006 
Mei Yang received her Ph.D. degree in 
Computer Science from the University of Texas 
at Dallas in Aug. 2003. She is currently an 
assistant professor in the Department of Elec-
trical and Computer Engineering, University of 
Nevada, Las Vegas (UNLV). Before she joined 
UNLV, she worked as an assistant professor in 
the Department of Computer Science, Colum-
bus State University, from Aug. 2003 to Aug. 
2004. Her research interests include computer 
networks, wireless sensor networks, computer 
architectures, and embedded systems. 
Si Qing Zheng received the Ph.D. de-
gree from the University of California, Santa 
Barbara, in 1987. After serving on the faculty 
of Louisiana State University for eleven years 
since 1987, he joined the University of Texas 
at Dallas, where he is currently a professor of 
computer science, computer engineering, and 
telecommunications engineering. Dr. Zheng's 
research interests include algorithms, computer 
architectures, networks, parallel and distributed 
processing, telecommunications, and VLSI de-
sign. He has published 200 papers in these areas. He was a consultant of 
several high-tech companies, and he holds numerous patents. He served as 
program committee chairman of numerous international conferences and 
editor of several professional journals. 
