Traffic Management for Next Generation Transport Networks by Yu, Hao
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 19, 2017
Traffic Management for Next Generation Transport Networks
Yu, Hao; Dittmann, Lars
Publication date:
2011
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Yu, H., & Dittmann, L. (2011). Traffic Management for Next Generation Transport Networks. Kgs. Lyngby,
Denmark: Technical University of Denmark (DTU).
Trac Management for Next Generation
Transport Networks
by
Hao Yu
March, 2011
Networks Technology & Service Platforms
in the
Department of Photonics Engineering
of the
TECHNICAL UNIVERSITY of DENMARK
KGS. LYNGBY
DENMARK
ii
To my parents.
Abstract
Video services are believed to be prevalent in the next generation trans-
port networks. The popularity of these bandwidth-intensive services,
such as Internet Protocol Television (IPTV), online gaming, and Video-
on-Demand (VoD), are currently driving the network service providers
to upgrade their network capacities. However, in order to provide more
advanced video services than simply porting the traditional television
services to the network, the service provider needs to do more than just
augment the network capacity. Advanced trac management capability
is one of the relevant abilities required by the next generation transport
network to provide Quality-of-Service (QoS) guaranteed video services.
Augmenting network capacity and upgrading network nodes indicate
long deployment period, replacement of equipment and thus signicant
cost to the network service providers. This challenge may slacken the
steps of some network operators towards providing IPTV services. In
this dissertation, the topology-based hierarchical scheduling scheme is
proposed to tackle the problem addressed. The scheme simplies the de-
ployment process by placing an intelligent switch with centralized trac
management functions at the edge of the network, scheduling trac on
behalf of the other nodes. The topology-based hierarchical scheduling
scheme is able to provide outstanding ow isolation due to its centralized
scheduling ability, which is essential for providing IPTV services.
In order to reduce the required bandwidth, multicast is favored for
providing IPTV services. Currently, transport networks lack sucient
multicast abilities. With the increase of the network capacity, it is
challenging to build a multicast-enabled switch for the transport net-
work, because, from the trac management's perspective, the multi-
cast scheduling algorithm and the switch architecture should be able
i
ii Abstract
to scale in switch size and link speed. The Multi-Level Round-Robin
Multicast Scheduling (MLRRMS) algorithm is proposed for the Input
Queuing (IQ) multicast architecture in this dissertation. The algorithm
is demonstrated a low implementation and computing complexity, and
high performances in terms of delay and throughput. This contribution
makes it possible to provide QoS control in a very high-speed switch,
such as 100 Gbit/s Ethernet switch.
In addition to the multicast scheduling algorithm, the switch fabric,
which is the core of the switching system, should also be able to scale
and deliver excellent QoS performances. One challenge is to solve the
Out-Of-Sequence (OOS) problem of the multicast cells in the three-stage
Clos-network, a type of multistage switch fabrics with a larger scalability
than single-stage switch fabrics. In this dissertation, two cell dispatching
schemes are proposed for the Space-Memory-Memory (SMM) Clos ar-
chitecture, which are the Multicast Flow-based DSRR (MF-DSRR) and
the Multicast Flow-based Round-Robin (MFRR). Both schemes are ca-
pable of reducing the OOS problem, and thus decrease the reassembly
delay and buer size. This improvement is of great signicance for the
multicast switching service, which is foreseen to be extensively used in
the next generation transport network.
To sum up, this dissertation discusses the trac management for
the next generation transport network, and proposes novel scheduling
algorithms to solve some of the challenges currently encountered by both
the academia and the industry. The covered topics in this dissertation
are related to the two projects: High quality IP network for IPTV and
VoIP (HIPT) and The Road to 100 Gigabit Ethernet (100GE), which
are detailed in the dissertation.
Resume
Videotjenester forventes at blive fremherskende i den nste generation
transportnetvk. Populariteten bandbreddekrvende tjenester, sasom
Internet TV, online gaming, og Video-on-Demand (VoD), stiller krav
til internet-udbydere om at opgradere deres netvrk til den krvede
kapacitet. For at tilbyde mere avancerede video-tjenester end blot tra-
ditionel tv, er der behov for at yderligere tiltag end simpel forgelse af
bandbredden. En af de ndvendige tiltag er avanceret trakstyring af
fremtidens netvrk, saledes at der sikre hj kvalitet (QoS) og garanteret
bandbredde til de forventede videotjenester.
Forgelse af netvrkskapacitet og opgradering af knudepunkter har
en lang tidshorisont og krver betydelige omkostninger for for netvrk-
sudbyderne. Disse udfordringer kan begrnse udbydernes lyst til at
levere Internet Protocol Television (IPTV)-tjenester. I denne afhan-
dling foreslas en topologi-baseret hierarkisk skeduleringsmetode med
henblik pa lse dette problem. Metoden forenkler implementeringen ved
at placere en intelligent switch med centraliseret trakstyringfunktion-
alitet pa kanten af nettet, hvor den styrer trakken pa vegne af de vrige
knudepunkter. Denne topologi-baserede hierarkiske skeduleringsmetode
er i stand til at levere fremragende opdeling af trak-ows grundet dens
centrale skeduleingsevne, som er afgrende for at tilbyde IPTV-tjenester.
For at reducere den krvede bandbredde nskes at benytte multicast
til at levere IPTV-tjenester. Idag har de tilgngelige transportnetvrk
dog ikke tilstrkkelig understttelse af multicast, og med forgelsen af
netvrkskapaciteten er det krvende at opbygge en multicast-understttet
switch til transport netvrket, da det fra et trakstyringsperspektiv er
ndvendigt at multicast skeduleringsalgoritmen og switchens arkitektur
kan skaleres i forhold til strrelse og hastigheden af dens forbindelser.
iii
iv Resume
Multi-Level Round-Robin Multicast Scheduling (MLRRMS) algoritmen
er foreslaet som Input Queuing (IQ) multicast arkitektur i denne afhan-
dling. Algoritmen er pavist at have en beskeden implementerings- og
databehandlingskompleksitet smatidig med med hj ydeevne i form af
forsinkelse og bandbredde. Dette bidrag gr det muligt at yde Quality-
of-Service (QoS) styring i en ekstrem hj-hastighedsswitche, som f.eks.
100 Gbit/s Ethernet switche.
Udover multicast skeduleringsalgoritmen, skal switch bagplanet, som
er kernen i switch-systemet, kunne skalere og levere fremragende QoS
resultater. En af udfordringerne er at lse Out-Of-Sequence (OOS) prob-
lemet med multicast celler i tre-trins Clos-netvrk, som er en slags er-
trins switch bagplan hjere skalerbarhed end en et-trins switch-bagplan.
I denne afhandling foreslas to lsninger til afsendelse af celler i Space-
Memory-Memory (SMM) Clos arkitekturer. Disse er Multicast Flow-
based DSRR (MF-DSRR) og Multicast Flow-based Round-Robin (MFRR).
Begge lsninger er i stand til at reducere OOS problemet og dermed
mindske buer-dybde og forsinkelse i forbindelse med genetableringen
af data-pakken. Denne forbedring er af stor betydning for multicast
tjenester, som forventes at blive ittigt brugt i fremtidens transport-
netvrk.
Opsummerende diskuterer denne afhandling trakstyringen for fremti-
dens transportnetvrk og foreslar nye skeduleringsalgoritmer til at lse
nogle af de udfordringer, som idag ndes i bade den akademiske verden
og i industrien. De dkkede emner i denne afhandling er relateret til de
to projekter: Hj kvalitet IP netvrk til IPTV og VoIP (HIPT) og The
Road to 100 Gigabit Ethernet (100GE), som er beskrevet i afhandlin-
gen.
Acknowledgement
With their continued guidance, support and inspiration, I would like to
thank my supervisor Professor Lars Dittmann, Dr. Michael S. Berger,
and Dr. Sarah Ruepp. It has been an arduous yet pleasant journey
to pursue my Ph.D during the stay at DTU. I really appreciate the
encouragement that Professor Lars Dittman gave me back in 2008 before
this journey. Without him, it would be impossible for me to experience
the beauty of the pursuit of my Ph.D.
I am grateful to Dr. Michael S. Berger for all the discussions and
help on the three topics in this thesis, and the freedom of the research
environment that you have given to me and your other students. Your
inspiration has led me to many of my accomplishments, and I deeply
thank you for all the encouragements on numerous situations over the
years.
I would like to give my appreciation to Dr. Sarah Ruepp for all the
kindly help and support on the work of the projects. Your preciseness
has left me a great impression and it is enjoyable and agreeable to work
with you.
A special thank should be expressed to Dr. Ying Yan. The close col-
laboration with you on the HIPT project gave me countless experiences
which greatly helped me in the rst year of my Ph.D study.
Thanks to Dr. Villy Bk Iversen for the inspiring discussions about
the analytical analysis on the multi-level round-robin multicast schedul-
ing algorithm. Your solid trac engineering and probability skills have
helped and inspired me greatly.
Thanks to all the my other colleagues in the Network Technology
and Service Platform group: Dr. Henrik Wessing, Dr. Jose Soler, Dr.
Lars Staalhagen, Dr. Anna Vaseliva Manolova, Rong Fu, Jiang Zhang,
v
vi Acknowledgement
Lukasz Brewka, Ana Rossello, Anders Rasmussen, Anna Zakrzewska,
Jiayuan Wang, Thang Tien Pham, and Brian Srensen. You have made
this group a pleasant place to work in and it is my great pleasure to
spend these years with you.
Thanks again to Dr. Sarah Ruepp, Dr. Henrik Wessing, Dr. Jose
Soler, Jiang Zhang, and Dr. Ying Yan for proofreading and commenting
on this thesis.
Last, and absolutely most, I would like to express my gratitude to
my parents. I dedicate this thesis to you, for your understanding and
support that have given me strength over the years.
Ph.D Publications
The following publications have been made throughout this Ph.D project.
Publications on the topic: IPTV trac
management in Carrier Ethernet transport
networks
[1] H. Yu, Y. Yan, and M. S. Berger, \IPTV trac management in Car-
rier Ethernet transport networks," in OPNETWORK 2008, 2008
[2] H. Yu, Y. Yan, and M. S. Berger, \IPTV trac management using
topology-based hierarchical scheduling in Carrier Ethernet trans-
port networks," in International Conference on Communications
and Networking in China (ChinaCom), pp. 1{5, 2009
[3] H. Yu, Y. Yan, and M. S. Berger, \Topology-based hierarchical
scheduling using decit round robin: Flow protection and isolation
for triple play service," in First International Conference on Future
Information Networks, pp. 269{274, 2009
[4] A. Rasmussen, J. Zhang, H. Yu, R. Fu, S. Ruepp, H. Wessing,
and M. S. Berger, \Towards 100 gigabit Carrier Ethernet trans-
port networks," WSEAS Transactions on Communications, vol. 9,
pp. 153{164, 2010
[5] H. Wessing, M. S. Berger, H. Yu, A. Rasmussen, L. Brewka, and
S. Ruepp, \Evaluation of network failure induced IPTV degradation
vii
viii Ph.D Publications
in metro networks," Recent Advances in Circuits, Systems, Signal
and Telecommunications, pp. 135{139, 2010
[6] H. Wessing, M. S. Berger, H. M. Gestssson, H. Yu, A. Rasmussen,
L. Brewka, and S. Ruepp, \Evaluation of restoration mechanisms
for future services using Carrier Ethernet," WSEAS Transactions
on Communications, vol. 9, pp. 322{331, 2010
Publications on the topic: Multicast scheduling
for input-queued high-speed switches
[1] H. Yu, S. Ruepp, and M. S. Berger, \A novel round-robin based
multicast scheduling algorithm for 100 gigabit ethernet switches," in
29th IEEE International Conference on Computer Communications
(INFOCOM) Workshops, pp. 1{2, 2010
[2] H. Yu, S. Ruepp, and M. S. Berger, \Round-robin based multi-
cast scheduling algorithm for input-queued high-speed Ethernet
switches," in OPNETWORK 2010, 2010
[3] H. Yu, S. Ruepp, and M. S. Berger, \Enhanced fo based round-
robin multicast scheduling algorithm for input-queued switches,"
IET Communications, vol. 5, pp. 1163{1171, 2011
[4] H. Yu, S. Ruepp, and M. S. Berger, \Multi-level round-robin multi-
cast scheduling with look-ahead mechanism," in IEEE International
Conference on Communications, 2011
Publications on the topic: Out-of-sequence
prevention for multicast Clos-network
[1] H. Yu, S. Ruepp, and M. S. Berger, \Out-of-sequence prevention
for multicast input-queuing space-memory-memory Clos-network,"
IEEE Communications Letters, 2011
ix
[2] H. Yu, S. Ruepp, and M. S. Berger, \Out-of-sequence preventative
cell dispatching for multicast input-queued space-memory-memory
Clos-network," in 12th IEEE International Conference on High Per-
formance Switching and Routing, 2011
Publications on the topic: Integrated control
platform design in converged optical and
wireless networks
[1] Y. Yan, H. Yu, and L. Dittmann, \Wireless channel condition aware
scheduling algorithm for hybrid optical/wireless networks," in 3rd.
International Conference on Access Networks, pp. 397{409, 2008
[2] Y. Yan, H. Yu, H. Wang, and L. Dittmann, \Integration of EPON
and WiMAX networks: Uplink scheduler design," in SPIE Sympo-
sium on Asia Pacic Optical Communications, 2008
[3] Y. Yan, H. Yu, H. Wessing, and L. Dittmann, \Integrated resource
management for hybrid optical wireless (how) networks," in Inter-
national Conference on Communications and Networking in China
(ChinaCom), pp. 1{5, 2009
[4] Y. Yan, H. Yu, H. Wessing, and L. Dittmann, \Enhanced signaling
scheme with admission control in the hybrid optical wireless (HOW)
networks," in 28th IEEE International Conference on Computer
Communications (INFOCOM) Workshops, pp. 1{6, 2009
[5] Y. Yan, H. Yu, H. Wessing, and L. Dittmann, \Integrated resource
management framework in hybrid optical wireless networks," IET
Optoelectronics Special Issue on Next Generation Optical Access,
vol. 4, pp. 267{279, 2010
This dissertation only includes work for the topic on (1) IPTV traf-
c management in Carrier Ethernet transport networks, (2) Multicast
scheduling for input-queued high-speed switches, and (3) Out-of-sequence
prevention for multicast Clos-network.

List of Figures
1.1 Dierent levels of trac scheduling . . . . . . . . . . . . . 4
2.1 HIPT network architecture . . . . . . . . . . . . . . . . . 8
3.1 Carrier Ethernet control and transport planes . . . . . . . 14
3.2 Class-based scheduling system . . . . . . . . . . . . . . . . 17
3.3 Flow-based scheduling system . . . . . . . . . . . . . . . . 18
3.4 Balanced tree topology . . . . . . . . . . . . . . . . . . . . 20
3.5 Topology-based hierarchical scheduling system . . . . . . 23
3.6 Simulation scenario set-up . . . . . . . . . . . . . . . . . . 26
3.7 End-to-end delay (class-based, ow-based, and hierarchical) 28
3.8 Jitter (class-based, ow-based, and hierarchical) . . . . . . 29
3.9 Flow isolation ability (class-based, ow-based, and hier-
archical) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.10 Flow isolation ability (ow-based and hierarchical) . . . . 32
3.11 Aected delay (ow-based and hierarchical) . . . . . . . . 33
4.1 Unicast and multicast . . . . . . . . . . . . . . . . . . . . 39
4.2 Illustration of an input-queued switch . . . . . . . . . . . 40
4.3 Illustration of an output-queued switch . . . . . . . . . . . 41
4.4 Illustration of a shared-buer switch . . . . . . . . . . . . 42
4.5 Illustration of an virtual output queued switch . . . . . . 43
4.6 System model of the multi-level round-robin multicast
scheduling algorithm . . . . . . . . . . . . . . . . . . . . . 47
4.7 Illustration of splitting a multicast scheduling problem . . 49
4.8 Multicast head-of-line blocking problem . . . . . . . . . . 50
4.9 MLRRMS: Submission, Decision, and Sync . . . . . . . . 53
xi
xii LIST OF FIGURES
4.10 MLRRMS: Look-ahead, Submission, Decision, and post-
transmission status . . . . . . . . . . . . . . . . . . . . . . 54
4.11 Multicast latency, Bernoulli trac . . . . . . . . . . . . . 64
4.12 Queue size per input, Bernoulli trac . . . . . . . . . . . 65
4.13 Average LA depth, Bernoulli trac . . . . . . . . . . . . . 66
4.14 Multicast latency, bursty trac (cell-based fan-out mode) 68
4.15 Queue size per input, bursty trac (cell-based fan-out
mode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.16 Average LA depth, bursty trac (cell-based fan-out mode) 70
4.17 Multicast latency, bursty trac (burst-based fan-out mode) 71
4.18 Queue size per input, bursty trac (burst-based fan-out
mode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.19 Average LA depth, bursty trac (burst-based fan-out
mode) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.20 Improvement of the sync, Bernoulli trac . . . . . . . . . 74
4.21 Improvement of the sync, bursty trac (cell-based) . . . . 75
4.22 Improvement of the sync, bursty trac (burst-based) . . . 76
4.23 Multicast latency, dierent balance factors . . . . . . . . . 78
4.24 Average number of transmissions per cell, dierent bal-
ance factors . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.25 Throughput, dierent balance factors . . . . . . . . . . . . 81
5.1 Crossbar switch fabric . . . . . . . . . . . . . . . . . . . . 85
5.2 Clos-network switch fabric . . . . . . . . . . . . . . . . . . 86
5.3 Memory-Space-Memory Clos-network . . . . . . . . . . . 88
5.4 Memory-Memory-Memory Clos-network . . . . . . . . . . 89
5.5 Input-Queued Space-Memory-Memory Clos-network . . . 92
5.6 Demonstration of a fan-out vector . . . . . . . . . . . . . 93
5.7 An example of the bit-cluster. The fan-out vector has
N = 12 bits, and each bit-cluster has n = 4 bits. There-
fore the fan-out vector can also be expressed by 3 bit-
clusters. The cell is sent to OM0 and OM1 accordingly. . 94
5.8 Desynchronized Static Round Robin . . . . . . . . . . . . 95
5.9 Multicast Flow-based DSRR . . . . . . . . . . . . . . . . . 97
5.10 Multicast Flow-based Round Robin . . . . . . . . . . . . . 99
5.11 Percentage of inter-packet OOS cells, LA = 0 . . . . . . . 104
5.12 Percentage of in-packet OOS cells, LA = 0 . . . . . . . . . 105
LIST OF FIGURES xiii
5.13 Percentage of the total number of OOS cells, LA = 0 . . . 106
5.14 Average reassembly delay per packet, LA = 0 . . . . . . . 107
5.15 Average reassembly buer size, LA = 0 . . . . . . . . . . 107
5.16 Maximum reassembly buer size, LA = 0 . . . . . . . . . 108
5.17 Percentage of inter-packet OOS cells, LA = 0; 1; 2 . . . . . 109
5.18 Percentage of in-packet OOS cells, LA = 0; 1; 2 . . . . . . 109
5.19 Percentage of the total number of OOS cells, LA = 0; 1; 2 110
5.20 Average reassembly delay per packet, LA = 0; 1; 2 . . . . 111
5.21 Average cell delay, LA = 0 . . . . . . . . . . . . . . . . . . 111
5.22 Average cell delay, LA = 0; 1; 2 . . . . . . . . . . . . . . . 112
xiv LIST OF FIGURES
List of Tables
1.1 A Brief summary of the evolution of Ethernet. . . . . . . 2
5.1 A comparison of dierent Clos-network architectures. . . . 90
5.2 A summarized comparison of dierent Clos-network ar-
chitectures. . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xv
xvi LIST OF TABLES
Contents
Abstract i
Resume iii
Acknowledgement v
Ph.D Publications vii
1 Introduction 1
2 Motivation 7
3 Topology-based Hierarchical Scheduling 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 System Model and Problem Denition . . . . . . . . . . . 19
3.4 Topology-based Hierarchical Scheduling Algorithm . . . . 21
3.5 Simulated Performance . . . . . . . . . . . . . . . . . . . . 25
3.5.1 Evaluation of Statistical Multiplexing Gain . . . . 26
3.5.2 Evaluation of Flow Protection . . . . . . . . . . . . 29
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Multicast Scheduling Algorithms for Input-Queued Switches 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 System Architecture and Problem Denition . . . . . . . 46
4.3.1 System Architecture . . . . . . . . . . . . . . . . . 46
4.3.2 Problem Denition . . . . . . . . . . . . . . . . . . 47
xvii
xviii CONTENTS
4.4 The Multi-Level Round-Robin Multicast Scheduling Al-
gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 MLRRMS Algorithm Analysis . . . . . . . . . . . . . . . . 52
4.5.1 Denitions . . . . . . . . . . . . . . . . . . . . . . 52
4.5.2 Analytical Description of the MLRRMS Algorithm 55
4.5.3 Heuristic Analysis of the Look-Ahead Mechanism . 56
4.5.4 Complexity Analysis . . . . . . . . . . . . . . . . . 60
4.6 Simulated Performance of MLRRMS . . . . . . . . . . . . 61
4.6.1 Trac Model . . . . . . . . . . . . . . . . . . . . . 61
4.6.2 Performance for Balanced Multicast Trac under
Dierent Oered Loads . . . . . . . . . . . . . . . 62
4.6.3 Performance for Unbalanced Multicast Trac un-
der the Same Oered Load . . . . . . . . . . . . . 75
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network 83
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Cell Dispatching Algorithms . . . . . . . . . . . . . . . . . 94
5.4.1 Multicast Flow-based Desynchronized Static Round-
Robin (MF-DSRR) Dispatching . . . . . . . . . . . 95
5.4.2 Multicast Flow-based Round-Robin (MFRR) Dis-
patching . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5 Performance Analysis and Simulation Results . . . . . . . 98
5.5.1 In-Packet OOS Performance of the MF-DSRR . . 100
5.5.2 In-Packet OOS Performance of the MFRR . . . . . 101
5.5.3 Time Complexity of MF-DSRR and MFRR . . . . 101
5.5.4 Advantages and Limitation of the MFRR . . . . . 102
5.5.5 Simulation Results . . . . . . . . . . . . . . . . . . 103
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6 Conclusion 115
Bibliography 119
List of Acronyms 129
Chapter 1
Introduction
"The best way to predict the future is to invent it."
For the last two decades, with the rapid development of telecom-
munication technologies, the network bandwidth capacity has increased
signicantly, both in the access and the transport areas. This has
led to a boom of network applications that require broadband access
and high network capacity, such as High Denition (HD) Video-on-
Demand (VoD), video sharing, videoconferencing, and online gaming,
so on and so forth. At the same time, the recent invention and develop-
ment of such applications, which require high network bandwidth, have
speeded up the growth of the network capacity and have generated more
demands on the transmission speed. It is foreseeable that this trend will
continue in the near future, and network operators will keep upgrading
the capacity of their networks. However, the pursuit of increasing the
network capacity alone cannot provide customers with excellent experi-
ences of the applications, without proper trac management functions
to classify, schedule, and monitor the enormous amount of network traf-
c.
Quality-of-Service (QoS) has been an issue of relevance for many
years, ever since the development of Internet applications are not lim-
ited to best eort data applications, such as plain web browsing. To
guarantee the QoS is of great importance to network operators because
it is the foundation of providing many applications, including HD-VoD,
Internet Protocol Television (IPTV), and Voice-over-IP (VoIP). With-
out the ability to provide QoS guarantees, the IPTV trac, for instance,
1
2 Introduction
can be suddenly delayed or the voice call can be unexpectedly dropped
due to an increase in network trac load, which undoubtedly aects the
user experience and therefore the popularity of the application. Trac
management aims to schedule dierent trac, avoid congestions, and
allocate bandwidth, in order to provide a ne-grained QoS ability to the
network.
Developed in the 70's, Ethernet is a frame-based networking tech-
nology standardized in IEEE 802.3, and is originally for Local Area
Networks (LANs). As shown in Table 1.1, Ethernet has been evolving
from 10 Mbit/s to today's 100Gbit/s. In addition to higher bandwidth,
the evolution contains improved Media Access Control (MAC) schemes
and physical medium changes as well, which are out of the scope of this
dissertation.
Year Standards Bit Rate
Ethernet 1985 IEEE 802:3a 10 Mbit/s
Fast 1995 IEEE 802:3u 100 Mbit/s
Ethernet
Gigabit 1999 IEEE 802:3ab 1000 Mbit/s
Ethernet IEEE 802:3ah
10 Gigabit 2002 IEEE 802:3ae 10 Gbit/s
Ethernet
40 Gigabit 2010 IEEE 802:3ba 40 Gbit/s
Ethernet
100 Gigabit 2010 IEEE 802:3bg 100 Gbit/s
Ethernet
Table 1.1: A Brief summary of the evolution of Ethernet.
Ethernet has successively dominated the LAN networks for decades
and the evolution of Ethernet has been increasing from several megabits
per second to today's 100 gigabits per second, driven by the development
of various applications. Since LAN networks are increasingly connected
to the Metropolitan Area Network (MAN) over Ethernet interfaces, it
drives the operators to provide Ethernet services in their MAN net-
works. Due to the domination of Ethernet, Carrier Ethernet is dened
by the Metro Ethernet Forum (MEF) [18] as an extension to enable
telecommunication operators to provide standardized Ethernet services
3to the customers, such as E-Line, E-LAN and E-tree services [19, 20].
The transport network is evolving from the legacy technology that pro-
vides constant bit rate connection to today's Carrier Ethernet tech-
nologies, which is capable of providing exible bandwidth and services
through packet switching. Two main candidates to the Carrier Ethernet
technologies are the Provider Backbone Bridge with Trac Engineer-
ing (PBB-TE) dened in IEEE 802.1 Qay [21], and Multi-protocol Label
Switching Transport Prole (MPLS-TP) [22], jointly developed by the
International Telecommunication Union-Telecommunication Standard-
ization Sector (ITU-T) and the Internet Engineering Task Force (IETF).
It has widely been discussed that, Carrier Ethernet can become the
main framework for the next generation transport networks [19,23{28].
However, advanced trac management functionalities are integral for
the Carrier Ethernet transport network, in order to provide guaranteed
QoS to various applications. Thus, this dissertation focuses on the devel-
opment of eective trac management mechanisms for switches in the
next generation Carrier Ethernet transport network, to suce various
QoS requirements. This implies that, trac management on the Inter-
net Protocol (IP) layer, IP lookup, IP routing and other Layer 3 (L3)
technologies are out of the scope of this dissertation. The term, switch,
is used throughout this dissertation to indicate either the Layer 2 (L2)
switch or the switch engine used by IP routers.
Scheduling technology plays an essential role in the trac manage-
ment, because it is the scheduling algorithm that steers the QoS per-
formance of a switch and nally the entire network. As packets arrive
at a switch, a packet-based scheduler schedules packets according to
the QoS requirements of dierent packet streams. Before packets being
transmitted to the switch fabric, the core of the switch, they are usu-
ally segmented into xed-size pieces, known as cells. The purpose of
doing so is to increase the throughput and reduce the scheduling com-
plexity [29]. A cell-based scheduler is responsible for selecting cells to
transmit to the switch fabric. As an enormous amount of cells enter into
the switch fabric, cells are arranged by a scheduling algorithm on the
lower in-switch fabric level of trac management. Figure 1.1 illustrate
the relation between dierent levels of scheduling.
Given the increased network capacity, it is challenging for both the
academia and the industry to give solutions to the trac management
4 Introduction
Flow-level
Cell-level
Figure 1.1: Illustration of dierent levels of trac scheduling.
of switches in the next generation high-speed transport networks. The
issues of scalability, complexity, and performance should be taken into
consideration. Hence, the work in this dissertation can be summarized
as:
The objectives of this dissertation is to develop novel
trac management mechanisms for the next gener-
ation transport networks, in order to provide im-
proved trac QoS guarantee. The work considers
the scalability and the complexity as key factors,
and aims to improve the QoS performance. Schedul-
ing algorithms on dierent levels and scales are pro-
posed, and simulations are carried out for evaluation.
Following the illustration of the relation between scheduling on dif-
ferent levels, the outline of this dissertation is organized as follows:
Chapter 2 presents the motivations of the work of this dissertation
within the scopes of two projects High Quality IP network for VoIP and
IPTV (HIPT) and The Road to 100 Gigabit Ethernet (100GE).
Chapter 3 concentrates on the trac scheduling algorithms for IPTV
in the Carrier Ethernet transport networks, aiming to provide an im-
proved end-to-end QoS. This chapter discusses the possibility and the
benets of centralizing intelligent trac management functions to the
edge of the network, and examines dierent packet-based scheduling al-
gorithms. The topology-based hierarchical scheduling algorithm is pro-
posed to reduce the network deployment cost and at the same time to
5provide ow isolation, which is critical to IPTV trac. The work in this
chapter is included in [1{3].
Chapter 4 explores the multicast scheduling algorithms for switches.
Taking the scalability and the implementation complexity into account,
this chapter proposes a novel cell scheduling algorithm for input-queued
multicast switches. The proposed sync mechanism enables the switch
to reduce the unnecessary multiple transmissions of a multicast cell,
without aecting the output port utilization. The proposed look-ahead
mechanism is able to increase the throughput of the input queuing archi-
tecture by reducing the head-of-line blocking. The work in this chapter
is enclosed in [7{10].
Chapter 5 investigates the multicast scheduling algorithms and cell
dispatching schemes inside the switch fabric of the three-stage Clos-
network architecture. In order to prevent the out-of-sequence problem,
two novel cell dispatching schemes are proposed for the input-queued
space-memory-memory Clos-network. Two types of out-of-sequence prob-
lems are dened in this chapter in order to evaluate the dierent cell
dispatching schemes. The work in this chapter is included in [11,12].
Finally, Chapter 6 presents the conclusion of this dissertation and
addresses the future research.

Chapter 2
Motivation
In this chapter, the motivation of the dissertation is presented within
the scope of two projects.
For many years, telecommunication service providers have been seek-
ing new ways out to balance the declining revenue on the traditional
telephone service and broadband access. It is believed that the intro-
duction of Internet Protocol Television (IPTV) service is the next step
that can deliver a substantial increase in revenue to the operators. As a
response to the increased interest in IPTV, the Danish Advanced Tech-
nology Foundation decided to nance a research project entitled High
quality IP network for IPTV and VoIP (HIPT). The objective of the
HIPT project is to enhance the Carrier Ethernet transport network for
IPTV applications, by developing technologies that can fulll the in-
creasing requirements, such as integrating control plane, trac manage-
ment, extended surveillance mechanisms and methods for protection,
redundancy and resiliency, and at the same time, can reduce the cost of
network operation.
Although IP and Layer 3 (L3) have proven useful in addressing the
Internet and other best eort data applications, this approach is not well
suited to high-bandwidth, critical services, such as IPTV, which cannot
tolerate delays in the network in general. The HIPT project intends to
investigate whether intelligent Layer 2 (L2) and Layer 1 (L1) networks
can be used to alleviate the problems seen in the current IPTV networks.
Using Provider Backbone Bridge with Trac Engineering (PBB-TE)
and Multi-protocol Label Switching Transport Prole (MPLS-TP), au-
7
8 Motivation
tonomous network decision making is removed and more trac engi-
neering is performed in the network. This ensures control over exactly
where trac is being transported in the network with the further abil-
ity to monitor individual trac ows, which is not easy to accomplish
in L3 [30]. This approach has the further advantage of being able to
support L3. Rather than replacing L3, the solution intends to sup-
plement it by reducing the need for costly nodes supporting services.
Thus, instead of deploying a large number of nodes with large complex-
ity, a simpler, yet intelligent, L2/L1 network based on Carrier Ethernet
transport technologies can reduce cost and complexity, while enabling
independent scaling of IPTV services at L3 with fewer nodes. The Car-
rier Ethernet based network architecture for IPTV transport is based
on a L2 approach with L3 support in the edge routers and L3 awareness
in the Digital Subscriber Line Access Multiplexer (DSLAM) as shown
in Figure 2.1.
Edge router
L2 L3
Figure 2.1: HIPT network architecture.
For simplicity, only the DSLAM access is shown, but other types of
access technologies can also be applied. IPTV trac ows are termi-
nated in the Set Top Box (STB) in the home network. The L2 network
between the Internet Protocol (IP) DSLAM and the edge router is as-
sumed to be based on either MPLS-TP or PBB-TE. In both cases, the
goal is to transport the IPTV trac with carrier-class quality, and at
the same time to reduce the cost by utilizing the Carrier Ethernet tech-
nologies. This demands that the L2 Carrier Ethernet network is able to
deliver sucient capacity and trac management capabilities. The goal
of the HIPT project is to develop high-capacity Carrier Ethernet net-
work nodes with advanced trac management, Operation, Administra-
9tion and Maintenance (OAM), and support to guarantee the transport
of demanding real-time applications, such as IPTV. Research challenges
include the QoS-enabled Carrier Ethernet control plane, OAM for IPTV
ow monitoring, resiliency and survivability, and trac management
with end-to-end QoS guarantee.
Given the increasing popularity of IPTV services and other high-
bandwidth applications, the Ethernet transmission speed has evolved
from 10 Mbit/s to today's 100 Gbit/s. The Road to 100 Gigabit Eth-
ernet (100GE) is an ongoing project funded by the Danish Advanced
Technology Foundation, aiming at scaling the Ethernet capacity from
current 10 Gbit/s to the next generation 100 Gbit/s. The challenges will
include interface adaptation, data and control plane processing, high-
speed switching, power and printed circuit board design. The goal of
achieving the speed of 100 Gbit/s is ambitious and requires improve-
ments in the low-level circuit technology. However, the improvements in
the low-level circuit technology will be far from enough to guarantee the
switching performance when moving to 100 Gbit/s. Advanced packet
processing and switching technologies with a high degree of scalability
are indispensable.
At the speed of 100 Gbit/s, the processing time of an Ethernet packet
can down to only 5 ns, which is extremely short. The access to exter-
nal memories for trac management is a substantial challenge because
the speed of memory does not follow the Moore's law that the process-
ing capacity will double around every second year. Therefore a highly
scalable and low-complexity scheduling algorithm for switching become
signicant for the project.
The work in this dissertation includes the IPTV trac management
for Carrier Ethernet transport networks in the HIPT project, and the
high-speed multicast scheduling algorithm and cell dispatching algo-
rithms in the 100GE project.

Chapter 3
Topology-based
Hierarchical Scheduling
Carrier Ethernet is becoming a favorable transport technology for the
Next Generation Network (NGN). The features of cost-eciency, oper-
ation exibility and high bandwidth have a great attraction to service
providers [20,24]. However, to achieve these characteristics, Carrier Eth-
ernet needs to obtain the required provisioning abilities, which guaran-
tee the end-to-end performances of voice, video and data trac delivered
over the network.
Switches with class-based scheduling algorithms schedule trac based
on dierent QoS classes. Although simple to implement, the class-based
scheme lacks the ability to isolate dierent trac ows, which belong to
the same QoS class, because packets of the same QoS class are stored
in the same queue. Any malicously behaving trac ow can aect the
other conforming trac of the same QoS class, resulting in the vulner-
ability to trac attack.
Switches with ow-based scheduling algorithms, on the other hand,
are able to protect each trac ow from being aected by others. This
is implemented by further dividing the trac of the same QoS class
into dierent queues, based on information such as port number, source
address, or destination address. However, in order to provide end-to-end
QoS guarantees, the network operator needs to upgrade all the switches
in the network, which is dicult and costly to implement.
In this chapter, a topology-based hierarchical scheduling scheme is
11
12 Topology-based Hierarchical Scheduling
proposed to provide an alternative solution to provide end-to-end QoS
guarantees. The main idea of the topology-based hierarchical schedul-
ing is to map the topology of the connected network into the logical
structure of the scheduling system, and to combine several token sched-
ulers according to the topology. The mapping process can be completed
through the network management plane or by manual conguration,
which is out of the scope of this chapter. Based on the knowledge of
the network topology, the scheduler can manage the trac on behalf
of other less advanced nodes in the network, avoiding potential trac
congestion, and providing ow protection and isolation.
Comparisons among the topology-based hierarchical scheduling, the
ow-based scheduling, and the class-based scheduling algorithms are car-
ried out under a symmetric binary tree topology. Simulation results
show that the topology-based hierarchical scheduling algorithm outper-
forms the others, in terms of ow protection and isolation from the
attack of malicious trac. This is signicant for Internet Protocol Tele-
vision (IPTV) services in the Carrier Ethernet transport networks.
3.1 Introduction
Ethernet, an incontestable technology that has dominated the Local
Area Network (LAN) for decades, is now being developed and extended
to become a possible choice for the Metropolitan Area Network (MAN).
The pressure from competition and the changing needs of communi-
cations and entertainment of residential customers are driving opera-
tors to upgrade their networks to be capable of voice, video and data
delivery (also known as triple-play services). Most voice, video, and
data services used to be provided by separated networks, such as Public
Switched Telephone Network (PSTN), the cable television network, and
the Internet. The tendency of today is to integrate the services on a
single network. In such a network, video broadcasting/multicasting and
Video-on-Demand (VoD) services on networks (also known as services)
will signicantly increase the trac load. To ensure that the quality
of IPTV services is guaranteed without damaging Voice-over-IP (VoIP)
services and high-speed Internet access, dierent QoS requirements of
each type of trac must be ensured by the converged network. Thus, a
ne-grained trac management scheme is demanded.
3.1 Introduction 13
Each type of service has dierent QoS requirements [31, 32]. Devel-
oping a set of trac management functions to meet various requirements
for each service is an important issue. The network can be subjected
to a very heavy trac load for a certain period. Especially at the edge
node, a large amount of incoming trac compete for the output band-
width. Hence, it is relevant to discriminate dierent services and provide
guaranteed QoS performance, so that a bandwidth-hungry user does not
cause performance degradation to other users in the network.
The Metro Ethernet Forum (MEF) [18] has provided a clear deni-
tion of Carrier Ethernet. Based on the description from MEF, Carrier
Ethernet is dened as an omnipresent, standardized, carrier-class service
and network dened by ve attributes that distinguish Carrier Ethernet
from LAN based Ethernet:
- Standardized Services
- Scalability
- Reliability
- Quality of Service
- Service Management
To use Ethernet as a transport technology, which requires customer
separation and manageability, Provider Backbone Bridge with Trac
Engineering (PBB-TE) [21] and Multi-protocol Label Switching Trans-
port Prole (MPLS-TP) [22] have been developed and proposed as car-
rier grade Ethernet transport network solutions. PBB-TE is a recent
development after several years of work by the Institute of Electrical and
Electronics Engineers (IEEE) aiming at improving and enhancing Ether-
net technology for the use in carrier networks. PBB-TE reuses current
implementations of Virtual Local Area Networks (VLANs) and com-
bines it with the network separation and layering principles of PBB [19,
23]. MPLS-TP, the former Transport MPLS (T-MPLS), is now devel-
oped under the cooperation of International Telecommunication Union-
Telecommunication Standardization Sector (ITU-T) and Internet Engi-
neering Task Force (IETF). It promises a solution that provides familiar
and reliable packet-based technology, i.e. , in a way that is aligned with
circuit-based transport networks. Both technologies aim to provide a
14 Topology-based Hierarchical Scheduling
connection-oriented packet switching transport network, where trac is
tunneled and delivered to the destinations [26].
As shown in Figure 3.1, Carrier Ethernet contains two separate and
independent domains, the control plane and the transport plane. The
specication of the control plane implementation is not yet nished in
the process of standardization. The main functions of the control plane
include, however, QoS mapping, label distribution, Call Admission Con-
trol (CAC) [27,28]. In the transport plane, traditional switches should be
updated with advanced functionalities in order to provide carrier grade
services and to guarantee the QoS performance, especially for real-time
trac such as IPTV.
Carrier Ethernet
(Transport Plane)
Carrier Ethernet
(Control Plane)
Figure 3.1: The concept of the control plane and transport plane of a Carrier
Ethernet network.
From the work of [1], the ow-based scheduling scheme using Decit
Round Robin (DRR) [33] algorithm has been evaluated and has shown
to be an appropriate choice for the IPTV service in Carrier Ethernet
transport networks. Although the ow-based scheduling scheme is ca-
pable of treating trac ows separately and providing better protection
than the class-based scheduling scheme, it requires the network operator
to upgrade the entire network with ow-based scheduling nodes, which
demand high volumns of buers. Under economic consideration, net-
3.2 Related Work 15
work operators consider not only the capability of the network, but also
the corresponding cost to deploy and maintain such a network. It has
been discussed in [24, 25] that, Carrier Ethernet can greatly reduce the
consequences of the complexity associated with the large scale of carrier
networks by being a cost-eective replacement for Synchronous Optical
Networking (SONET)/Synchronous Digital Hierarchy (SDH) [34].
To keep the preferable features of Carrier Ethernet and to reduce the
required deployment period, the topology-based hierarchical scheduling
scheme is proposed in this chapter. The term, hierarchical scheduling,
has been mentioned and discussed actively in other researchers work.
In [35{38], hierarchical scheduling is mainly discussed as an improve-
ment to the traditional DRR for a single network node. There still
lacks a hierarchical scheduling scheme that takes the network topology
into consideration. Given the detailed topology of the network, where
nodes are incapable of ow management, the topology-based hierarchi-
cal scheduling node is able to avoid trac congestion and guarantee
QoS requirements. Since the interior nodes of the network may only
provide simple forwarding abilities, intelligence can be condensed in the
hierarchical scheduling nodes at the edge of the network. By learning
the topology of the connected network, the hierarchical scheduler will
be able to schedule packet on behalf of other interior nodes.
The remaining parts of this chapter are structured as follows. In
Section 3.2, dierent related scheduling algorithms are compared and
the advantages of the DRR scheduling algorithm are explained. In Sec-
tion 3.3 the system model is presented and the problem is dened. Sec-
tion 3.4 discusses the benet of hierarchical scheduling and demonstrate
the concept. Section 3.5 presents and analyzes the simulation results.
Finally, Section 3.6 concludes the chapter.
3.2 Related Work
Scheduling algorithms are used in a switch design in order to attain
the QoS requirement and fairly allocate limited resources among trac
ows. A signicant amount of research has contributed to the develop-
ment of scheduling algorithms [33, 35, 39{52], and they can be mainly
divided into two categories: timestamp-based scheduling (also known as
sorted-priority scheduling) and frame-based scheduling.
16 Topology-based Hierarchical Scheduling
Timestamp-based schedulers maintain a global virtual time to emu-
late the ideal Generalized Processor Sharing (GPS) [39]. Arriving pack-
ets are marked with timestamps which are generated through the vir-
tual machine. The timestamps are used by the scheduler to determine
the order of packet departure. This category includes the Weighted
Fair Queuing (WFQ) [41], the Worst-case Fair Weighted Fair Queu-
ing (WF2Q) [42], the Self-Clocked Fair Queuing (SCFQ) [43], and the
Start-time Fair Queuing (SFQ) [44]. These timestamp-based schedulers
can provide good fairness and low latency. However, a main drawback
is that, these methods are not ecient enough due to the complexity
involved in computing the system virtual time, and sorting the packets
based on the timestamps [53]. The WFQ and the WF2Q scheduling
schemes require O(N) time complexity to complete a scheduling deci-
sion, where N denotes the number of active sessions or ows sharing
the outgoing link of the switch. The SCFQ approach reduces the time
complexity but still holds the O(logN) bottleneck. Using this kind of
schedulers can hinder the scalability of the switching system.
On the other hand, frame-based schedulers serve packets in a round
robin manner, i.e. during each round, at least one ow receives a
transmission opportunity. This category includes the Decit Round
Robin (DRR) [33], the Elastic Round Robin (ERR) [45], the Carry-Over
Round Robin (CORR) [47], and the Mini Round Robin (MRR) [48].
These schedulers do not need to calculate the virtual time, and thus
have low time complexities and the design of such frame-based sched-
ulers is simpler compared to timestamp-based schedulers. DRR is one
of the early frame-based scheduling algorithms proposed to overcome
the unfairness and has a time complexity of O(1), which is much lower
than other timestamp-based algorithms. It has been concluded in [33]
that the DRR provides near-perfect isolation at low implementation cost
and can be combined with other fair queuing algorithms to oer better
latency bounds. Low time complexity is a signicant factor to a switch
design, especially for a high speed link.
Together with advanced buer management, the DRR algorithm can
support sucient QoS dierentiation between ows and can guarantee
that any maliciously behaving ow does not aect the QoS performance
of other conforming trac ows. The traditional way of dividing packets
is based on their QoS classes, as shown in Figure 3.2. Dierent QoS
3.2 Related Work 17
P
ac
k
et
 c
la
ss
if
ie
r
Class-based
scheduling
class 1
class 0
class i
class C-1
...
...
...
...
...
...
Figure 3.2: Demonstration of a class-based scheduling system. Variable-length
packets are stored into queues based on their QoS classes.
classes have dierent requirement on delay or bandwidth. As mentioned
earlier, dierent services have dierent QoS requirements, i.e. some
trac is sensitive to delay while others require adequate bandwidth.
Based on this model, a separate queue for each QoS class is created in a
switch to store packets of the same class. By giving dierent priorities,
dierent shares of bandwidth are assigned to the classes.
However, this traditional queuing scheme is incapable of providing
isolation to trac ows, which have the same QoS classes but dierent
sources or destinations. Since all the packets of the same QoS class are
stored in the same queue, a malicious ow can consume a huge amount of
bandwidth, resulting in a signicant performance degradation to other
ows of the same class. Given the requirement of ow isolation, the
18 Topology-based Hierarchical Scheduling
class 0
P
a
c
k
e
t 
c
la
s
s
if
ie
r
Central
Scheduler
0
N-1
Subscheduler
1
Flow-based scheduler
class 1
0
N-1
1
class i
0
N-1
1
class C-1
0
N-1
1
...
...
...
...
c
la
s
s
if
ie
r
...
...
...
...
...
...
...
...
c
la
s
s
if
ie
r
c
la
s
s
if
ie
r
c
la
s
s
if
ie
r
j
j
j
j
Subscheduler
Subscheduler
Subscheduler
Figure 3.3: Demonstration of a ow-based scheduling system. Variable-length pack-
ets are rst sorted based on the QoS classes, then further sorted based on the ow
ID.
ow-based scheduling scheme is proposed in [1], as shown in Figure 3.3.
Within each class queue, a separate queue is assigned to packets of
the same source or destination, depending on whether the ow-based
scheduler is located at the output port of input port. In addition to
3.3 System Model and Problem Denition 19
using hSourcejDestinationjQoS classi to dene a ow, information such
as VLAN ID, can also be included. A central scheduler grants permission
to each subscheduler using DRR, and each subscheduler selects packets
from dierent queues in a DRR manner, once a permission granted.
This architecture ensures isolation between each ow as well as each
class. However, to provide end-to-end QoS guarantee, all the nodes in
the network should be upgraded to this advanced architecture, which is
not cost-ecient.
3.3 System Model and Problem Denition
As discussed in Section 3.2, although the ow-based scheme can provide
the operator with a network with end-to-end QoS guarantee by replacing
all the switches with advanced models, at the same time it places a
considerable burden on the network operator, especially when the size
of the network is fairly large. Such a process of upgrading a large network
inevitably requires a long deployment time and a substantial nancial
investment. Distributing intelligence, in terms of large size of memory,
advanced scheduling algorithm, ow control ability and so forth, to all
the nodes in the network will inevitably need a management platform
that can manage and congure the switches eciently. Besides, the
resources, e.g. the size of the buer, which the operator brings to each
of the nodes, may not be fully utilized.
A possible alternative is to centralize intelligence and introduce an
intelligent switch with the knowledge of the network topology located
at the edge of the network. One good example could be the ingress
and egress router in an MPLS network. Typically, the MPLS label is
attached to an IP packet at the ingress router and removed at the egress
router, while label swapping is performed at the intermediate routers.
This intelligent node should be able to manage the trac on behalf of
other nodes which lack advanced trac management ability, and thus
can avoid potential trac congestion in the network.
From an IPTV trac distribution's point of view, the tree topology
is usually used to construct the network [54,55]. A generic tree topology
in Figure 3.4 is presented.
The intelligent node is connected to the root node, denoted as HS.
The tree network is assumed to have N levels and each node is a parent
20 Topology-based Hierarchical Scheduling
... ...Level N-2
HS
... ... ......Level N-3
... ... ...... ...
...
......... ...... .........
MN-2
MN-3 MN-3 MN-3
......
M1 M1 M1
M0 M0 M0 M0
M1
MN-3
arriving traffic
Level N-1
Figure 3.4: A balanced tree topology network with the topology-based hierarchical
scheduler (HS) node connected to the root node.
3.4 Topology-based Hierarchical Scheduling Algorithm 21
to the nodes in the level below and at the same time a child to the
node in the level above. Nodes in level 0 are leaf nodes and have no
child nodes attached. It is assumed that in each level the number of
child nodes connected to an upper-level node is the same, denoted as
Ml, where l = 0; 1; 2; : : : N   1. Except for the root node, HS, a random
node in the network is denoted as N li , where 0  i <
QN 1
j=0 Mj , and
M0 = 1. The transmission speed of the link between a node N
l
i and one
of its child nodes is assumed to be 1Ml of the link speed between N
l
i and
its parent node. Since the link speed of a node is evenly divided and
allocated to its child nodes, this network topology is referred to as the
balanced tree topology in this chapter.
Since the nodes in the network lack advanced trac management
functionalities, the edge node HS should schedule the arriving trac on
behalf of the lower-level nodes. The topology-based hierarchical schedul-
ing scheme is an attractive candidate for such a situation.
3.4 Topology-based Hierarchical Scheduling
Algorithm
First, the principle of DRR is reviewed before describing the topology-
based hierarchical scheduling algorithm. To serve queues, the scheduler
uses round-robin pattern with a quantum assigned to each queue, which
is the number of bytes allowed to be sent from a queue within one round.
The quantum size in bytes, Q, is usually set to the maximum packet size
to ensure that at least one packet is served during one scheduling round
to maintain a low complexity. If Q is larger than the length of the
packet in bytes, L, in the queue, the packet is sent and Q = Q   L. If
a queue is not able to send a packet in the previous round due to too
large packet size, i.e. Q < L, the remainder from the previous quantum
will be added to the quantum for the next round. Hence, the decits
are kept and unfairly treated queues are compensated in the next round.
By adjusting the quantum size for each queue, the total bandwidth is
allocated in proportion to the quantum size.
To provide per-ow isolation, the ow-based schedulers are leveraged
to compose the topology-based hierarchical scheduling system. Based
on the balanced tree topology of the connected network, a mapping can
22 Topology-based Hierarchical Scheduling
be created by several token schedulers. Figure 3.5 demonstrates the
schematic structure of the topology-based hierarchical scheduler.
A token is generated for each arriving packet at the packet classier.
The token should carry scheduling information for its corresponding
packet, such as packet weight, destination/source ID, and QoS class.
The packet weight is a value in proportion to the actual packet length.
It is used by the token scheduler as a virtual packet length to control
the packet transmission rate. The packet is forwarded into the packet
memory and the token is stored in the token queues based on its ow
ID hSourcejDestinationjQoS classi.
The topology-based hierarchical scheduling algorithm contains 3 steps,
Selection, Grant, and Update, which are described as below:
Selection: Using the DRR scheduling algorithm, the scheduling
system establishes N levels of token schedulers, from level 0 to N   1.
Each scheduler, except the top-level scheduler SN 1, and each token
queue has a Decit Counter (DC) attached to store the value of Q, the
remainder of the quantum size. For a token queue, if it is not empty
during a scheduling period, it is dened to be backlogged [33]. Similarly
for a scheduler, it turns backlogged only when it has selected a queue or a
lower-level scheduler to serve. A level-p scheduler Sp(x), 0 < p < N  1,
operates the DRR algorithm on its backlogged level-(p-1 ) schedulers. A
level-0 scheduler, S0(x), runs the DRR algorithm on its backlogged class
schedulers cSj , and each class scheduler executes the DRR algorithm on
the backlogged token queues. Following this process, the schedulers
make the scheduling decision level by level until level-(N-2 ) schedulers
complete the selection phase. All the decisions made in this step are
pending and wait for further grants from the top-level scheduler.
Grant: A scheduler can grant a permission to the selected token
queue/scheduler when and only when it receives a grant from its upper-
level scheduler(parent scheduler). For the top-level scheduler, SN 1, it
has no parent scheduler and therefore it only grants permissions using
the DRR algorithm. Once a permission is granted by SN 1, it is passed
level by level until the permission reaches a token queue. A token path
is established after this step.
Update: Upon the reception of the permission, a token is sent to
SN 1 along the token path and all the scheduler on the path update
their decit counters with a reduction of the packet length information
3.4 Topology-based Hierarchical Scheduling Algorithm 23
class i
p
a
ck
et
 c
la
ss
if
ie
r
Packet Queue Memory
Token Queue Scheduling System
TQ(0,i)
TQ(1,i)
class j
TQ(0,j)
TQ(1,j)
class i
TQ(0,i,)
TQ(1,i)
class j
TQ(0,j)
TQ(1,j)
Deficit 
counter
class i
TQ(0,i)
TQ(1,i)
class j
TQ(0,j)
TQ(1,j)
class i
TQ(0,i)
TQ(1,i)
class j
TQ(0,j)
TQ(1,j)
class i
TQ(0,i)
TQ(1,i)
class j
TQ(0,j)
TQ(1,j)
class i
TQ(0,i)
TQ(1,i)
class j
TQ(0,j)
TQ(1,j)
class i
TQ(0,i)
TQ(1,i)
class j
TQ(0,j)
TQ(1,j)
class i
TQ(0,i)
TQ(1,i)
class j
TQ(0,j)
TQ(1,j)
S0(0)
S0(M0-1)
S0(M0)
S0(2M0-1)
S0(xM0-M0)
S0(xM0-1)
S0(xM0)
S0((x+1)M0-1)
S1(0)
S1(1)
S1(x)
S1(x+1)
S2
S2
SN-1
cSi
cSj
cSi
cSj
cSi
cSj
cSi
cSj
cSi
cSj
cSi
cSj
cSi
cSj
cSi
cSj
R
M0R
M0M1R
M0M1...MN-2R
M0M1R
R
R
R
R
R
R
R
M0R
M0R
M0R
Scheduler
Information
Update
Memory
......
...
...
...
...
......
...
...
...
Figure 3.5: The topology-based hierarchical scheduling system. The main scheduler
SN 1 grants permission backwards, and a token path is established, shown as the red
arrowed dash line. Token(s) is passed to SN 1, and decit counters on the path are
updated.
24 Topology-based Hierarchical Scheduling
carried by the token, i.e. DC = DC   L. When SN 1 receives the
token, it sends out the correspondent packet from the packet memory
according to the information carried by the token. After this step, a new
scheduling round begins until the quantum sizes of all token schedulers
become zero or no selection is made because the packet size is larger
than the quantum size.
To avoid congestions from occurring in the network, token schedulers
control the packet transmission rate by the token rate and the packet
weight. Packet Weight (PW) is a function of the actual packet length l,
and is calculated by the packet classier for each arriving packet. In the
token scheduling system, a virtual packet transmission time is calculated
as the PW divided by the token rate. The description PW function is
shown in Equation 3.1.
PW (l) = l  fw: (3.1)
where l denotes the packet length stored in the Ethernet Media Access
Control (MAC) header, and fw is the congurable packet weight factor.
Since the token rate corresponds to the actual transmission rate of
the node in the network, it is calculated as a product of the weight
factor fw and the actual link rate Rlink. The description of the token
rate calculation is shown in Equation 3.2. The network operator can
modify the packet weight factor to control the granularity of token rates
in order to adapt the switch to the network.
R = fw Rlink: (3.2)
where R is the basic token rate and Rlink is the link rate of the end
node.
The time between two token selections is the sum of the packet weight
of the token(s) divided by the token rate of the scheduler. By this
mean, the packet transmission rate is controlled so that packets are
transmitted within the capacity of the nodes in the network, and thus
trac congestion can be avoided.
Each S0 can only send one token to SN 1 per time unit. SN 1 is
able to grant up to
QN 2
i=0 Mi permissions to guarantee at most each S0
receives a permission within one time unit. A token scheduler Sp(x) on
level p at most receives
Qp 1
i=0 Mi permissions from its parent scheduler
3.5 Simulated Performance 25
on level p+1. Therefore, if the token rate of S0 is assumed to be R, the
token rate of Sp becomes
Qp 1
i=0 Mi R, 0 < p  N   1.
Schedulers on dierent levels have dierent quantum size, which cor-
responds to their token rate. If assume that the quantum size for a token
queue is Q, then the quantum sizes for the class scheduler and S0 are
bothQ because a level-0 scheduler can only send one token per time unit.
For a level-p scheduler Sp, the quantum size becomes Qp =
Qp 1
i=0 Mi Q,
0 < p < N   1.
It is important to mention that the structure of the topology-based
hierarchical scheduler is not limited to the example shown in Figure 3.4,
but can be recongured according to the actual network topology. If
the topology is an asymmetric tree or a star for instance, the token
schedulers will be reorganized and the logical structure will be congured
accordingly. For the unbalanced or non-binary tree topology, where
bandwidth is not evenly divided, the scheduling system can adjust the
quantum size of the token schedulers on each level accordingly to match
the bandwidth allocation. The quantum size of a token scheduler on
level-p is actually the sum of the quantum sizes of all its child token
schedulers, Qp(x) =
P
iQp 1(i ! x), where i ! x implies that Sp 1(i)
is one child token scheduler of Sp(x).
The hierarchical scheduler is a linear combination of several DRR
schedulers. Each DRR scheduler has a time complexity of O(1) [33]. If
the topology-based hierarchical scheduler has N hierarchies or levels, it
needs to establish a token path in N steps and thus the time complexity
will be O(N). When N = 1, the scheduler will become a single DRR
scheduler of which the time complexity is O(1).
3.5 Simulated Performance
The comparison of the QoS performances is carried out between the
topology-based hierarchical scheduling, the ow-based scheduling, and
the class-based scheduling algorithms, in terms of delay, jitter, and ow
protection in this section. The simulations are carried out in OPNET
Modeler [56].
The network conguration for the simulation is shown in Figure 3.6.
Three networks of the same balanced tree topology are created, each
of which uses one scheduling scheme, i.e. the class-based, the ow-
26 Topology-based Hierarchical Scheduling
00 01 02 03
04 05 06 07 08 09 10 11 12 1413 15
Hierarchical
00
01
02
03
04
05
06
07
08
09
10
11
12
14
13
15
Class-based
Flow-based
IPTV traffic
Figure 3.6: The simulation scenario set-up. Three networks, with dierent schedul-
ing schemes, of the same binary tree topology are connceted to the same trac source.
based, and the topology-based hierarchical scheduling. Each network is
assumed to have 5 levels in total, including the top-level node and the
end nodes. It is assumed that, for each network,Ml = 2 (l = 1; 2; 3; 4).
The number of leaf nodes thus become 24 = 16. All three networks
are connected to the same IPTV trac generator, which provides 16
identical trac ows simultaneously.
3.5.1 Evaluation of Statistical Multiplexing Gain
Each ow is congured to be transmitted to one dierent end node.
The peak and the minimum bandwidth of each ow is assumed to be 10
Mbps and 4 Mbps, respectively. The percentage of the peak bandwidth
is assumed to be 50%, resulting in an average bandwidth of B = 7 Mbps.
The output link rate of the edge node is reduced while the input trac
ows are maintained in order to evaluate the Statistical Multiplexing
Gain (SMG) performance [57, 58]. The input-output rate ratio is used
3.5 Simulated Performance 27
as the x-axis. Since the output rate is reduced gradually, the ratio
increases from the initial value of 1.0.
Due to the burstiness of the input trac and the aggregation of ows,
the capacity of the link can be saved by using statistical multiplexing
to reduce the link rate. If the trac is not highly bursty, the average
end-to-end delay and jitter will increase as the link rate decreases. Fig-
ure 3.7 provides the average end-to-end delay comparison between the
class-based, ow-based and hierarchical scheduling under various input-
output rate ratios. Figure 3.8 shows the jitter comparison under the
same range of input-output rate ratio.
In Figure 3.7, hierarchical scheduling has improved the performance
on the average end-to-end delay. The curve of hierarchical scheduling is
below the other two, class-based and ow-based scheduling schemes. To
achieve the same end-to-end delay, hierarchical scheduling can sustain
a higher reduction of link capacities. As the input-output rate ratio
increases, the average end-to-end delay increases for all three schemes
but the slope of the hierarchical scheduling curve becomes lower than
the class-based scheme.
In Figure 3.8, the three scheduling methods, i.e. class-based, ow-
based and hierarchical scheduling have shown alike performance, in
terms of trac jitter under dierent input-output rate ratios. As the
input-output rate ratio increases, the jitter of all three schemes become
larger. Hierarchical scheduling has little improvement on the jitter per-
formance.
It can be obtained from the results that the improvement of the SMG
factor by the hierarchical scheduling scheme is limited. The traditional
way of browsing websites allows the operator to reduce the required
bandwidth for the aggregated ows. If there are 1000 users, for instance,
and each one is guaranteed 10 Mbps download bandwidth, the operator
can assign 200 Mbps bandwidth for the aggregated trac to satisfy the
requirement, since not all the users need the resource at the same time.
The SMG thus becomes 100010200 = 50 under this circumstance. When
IPTV services are introduced in a network, the SMG factor will begin
to decrease, due to the fact that, the trac is low in burstiness but high
in bandwidth consumption. The trac characteristic is dierent from
those applications that generate bursty trac, such as website browsing.
The advantage of the hierarchical scheduling scheme is that it can
28 Topology-based Hierarchical Scheduling
1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16
2.6
2.8
3.0
3.2
3.4
3.6
 
 
E
nd
-to
-E
nd
 D
el
ay
 (m
s)
Input-Ouput Ratio
 class-based
 flow-based
 hierarchical
Figure 3.7: Comparison between class-based, ow-based, and hierarchical schedul-
ing schemes in terms of average end-to-end trac delay under dierent input-output
rate ratio
3.5 Simulated Performance 29
1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16
0.0
0.4
0.8
1.2
1.6
2.0
2.4
 
Ji
tte
r (
m
s)
Input-Ouput Ratio
 class-based
 flow-based
 hierarchical
Figure 3.8: Comparison between class-based, ow-based, and hierarchical schedul-
ing schemes in terms of trac jitter under dierent input-output rate ratio.
provide nearly the same performance as the distributed intelligence fash-
ion. By learning the network topology through the management plane
or manual conguration, the scheduler at the edge of the network forms
a mapping structure with virtual token schedulers. The cooperation be-
tween each token scheduler is far more ecient than the cooperation
between dierent nodes. The centralized intelligence way of trac man-
agement can be considered as a solution.
3.5.2 Evaluation of Flow Protection
A conforming trac ow is congured to have an average bandwidth of
9 Mbps, which is similar to the bandwidth needed by a high denition
IPTV channel. 16 ows respectively bound for 16 destinations are sent
to the three networks simultaneously. The link speed is reduced by half
for each level as explained previously. For the end user, the link supports
up to 10 Mbps transmission rate.
To evaluate the ow protection and isolation ability of the networks,
a highly bursty trac ow is introduced for a certain period of time.
30 Topology-based Hierarchical Scheduling
The impact to the conforming ow is then observed at the destination.
The highly bursty trac ow has a higher average bandwidth than a
conforming ow.
The simulation lasts for 60 seconds and the highly bursty trac
ow is introduced from 10 to 20 seconds. In the networks shown in
Figure 3.6, the highly bursty ow is bound to user 01. The ow to user
00 is observed because it is the most aected by the highly bursty ow.
In Figure 3.9 the comparison between class-based, ow-based and hi-
erarchical scheduling under the malicious ow attack is presented. The
bandwidth of the highly bursty trac is 9.5% more than the normal
ow. Since the class-based scheduling scheme cannot distinguish dif-
ferent ows of the same trac type, the normal ow is aected the
most in terms of increase in end-to-end delay. Flow-based and hierar-
chical scheduling schemes are both capable of ow isolation, and thus
the end-to-end delay of the normal ow increases slightly. Class-based
scheduling scheme, under the malicious ow attack, performs the worst,
and thus the comparison will be carried out between the ow-based and
the hierarchical scheduling schemes.
In Figure 3.10, the bandwidth of the highly bursty trac ow is in-
creased to be 67% more than a normal ow bandwidth. The aected
end-to-end delay of the conforming ow bound to destination 00 is pre-
sented. A comparison between the ow-based scheduling and the hier-
archical scheduling is presented in this gure. The highest end-to-end
delay of the ow-based scheduling network is increased up to around 4.5
ms, while the delay of the network using hierarchical scheduling scheme
is increased to around 3.0 ms at most. After the highly bursty ow stops,
both end-to-end delays are restored to the normal level. The hierarchical
scheduling obviously has better performance than the ow-based one in
terms of ow protection.
To further investigate the ow protection and isolation ability of
the two scheduling schemes, i.e. ow-based scheduling and hierarchical
scheduling, several simulations under dierent trac load of the highly
bursty ow are carried out. The average end-to-end delay of the aected
period, during which the highly bursty ow is introduced, is measured
for each circumstance. The comparison results are shown in Figure 3.11.
The average bandwidth of the highly bursty ow bound to destination
01 is increased from 10 to 16 Mbps. Both two schemes show similar
3.5 Simulated Performance 31
10 20 30 40 50 60
2
4
6
8
10
12
 
En
d-
to
-e
nd
 d
el
ay
 (m
s)
Time (s)
 Class-based
 Flow-based
 Hierarchical
Figure 3.9: Comparison between class-based, ow-based and hierarchical scheduling
in terms of trac delay when a non-conforming ow appears. Bandwidth of the highly
bursty trac is 9.5% more than a normal ow.
32 Topology-based Hierarchical Scheduling
0 10 20 30 40 50 60
2.5
3.0
3.5
4.0
4.5
 
E
nd
-to
-e
nd
 d
el
ay
 (m
s)
Time (s)
 Flow-based
 Hierarchical
Figure 3.10: Comparison between ow-based and hierarchical scheduling in terms
of trac end-to-end delay when the load of a highly bursty ow increases. Bandwidth
of the highly bursty trac is 67% more than a normal ow.
3.5 Simulated Performance 33
10 11 12 13 14 15 16
2.5
3.0
3.5
4.0
4.5
5.0
A
ve
ra
ge
 E
nd
-to
-E
nd
 D
el
ay
 (m
s)
Bandwidth of Nonconforming Flow (Mbps)
 Flow-based
 Hierarchical
Figure 3.11: Comparison between ow-based and hierarchical scheduling in terms
of average trac delay of the aected period as the load of a highly bursty ow
increases.
average end-to-end delay under the bandwidth of 10 Mbps. This is be-
cause the switches in both networks still have enough capacity. Once
the highly bursty ow increases the bandwidth more than the maxi-
mum limit, congestion will occur and consequently cause an addition to
the average end-to-end delay of the normal ow. The curve of the hi-
erarchical scheduling scheme, compared to the ow-based one, remains
stable, which indicates that the hierarchical scheduling scheme is able
to provide better ow isolation and protection.
The improvement should be credited to centralizing network intelli-
gence in the edge node. Potential congestion or any malicious attack is
handled by the scheduler inside the node. Necessary internal resources
are arranged and utilized by the node to diminish the bad behavior.
On the other hand, the ow-based scheme, which is a distributed way
to protect ows, could be ineective or inecient since the cooperation
between each node in a network is more dicult than the cooperation
between each scheduler in the hierarchical scheduler. From the point
of view of protecting trac ows to guarantee the requirement of QoS,
34 Topology-based Hierarchical Scheduling
the hierarchical scheduling scheme shows better performance than the
distributed ow-based scheme.
It is also worth mentioning that the results have shown a trend of
how a ow is aected by highly bursty trac in a network using various
scheduling schemes. In a real network, the actual values will be very
likely to dier from the ones shown in these gures. What is important
is the relative relation demonstrated by the results.
3.6 Summary
In this chapter, a topology-based hierarchical scheduling scheme for
IPTV trac management in Carrier Ethernet transport networks is pro-
posed. The hierarchical scheduler can be placed at the edge of broad-
band access network, where the topology is relatively static from an
IPTV distibution's point of view.
Based on the assumption that the topology-based hierarchical sched-
uler is able to acquire the network topology, it has demonstrated a
method where the hierarchical scheduler combines several DRR token
schedulers to build a mapping structure of the connected network. The
hierarchical scheduler manages trac on behalf of other nodes in the
network and is able to avoid severe performance degradation from the
attack of maliciously behaving trac ows.
Simulation results have shown that the proposed scheduler can pro-
vide a better ow protection and isolation against potential attack from
malicious trac and as a result provide QoS guarantee, which is a sig-
nicant requirement for IPTV services in Carrier Ethernet transport
networks. The proposed scheme could also bring benet to network
operators in terms of deployment eort and cost-eciency.
It is also important to mention that the hierarchical scheduling scheme
presented in this chapter is not limited to the topology used in the ex-
ample. As a matter of fact, the scheduler can adapt to dierent net-
work topologies. By network management or manual conguration, the
scheduler can know where the potential congestion points are and how
the network topology is. Dierent knowledge about the network leads
to dierent combinations of the DRR token schedulers. The exibil-
ity thus enables the scheduler to adapt to various network topologies,
e.g. star, asymmetric tree and so forth. It is out of the scope of this
3.6 Summary 35
chapter to discuss how the combination of the DRR token schedulers is
implemented.

Chapter 4
Multicast Scheduling
Algorithms for
Input-Queued Switches
The Input Queuing (IQ) architecture has been favored for designing mul-
ticast high-speed switches due to its scalability and low implementation
complexity. Various existing improvements on the First-In-First-Out
(FIFO)-based IQ architecture have been proposed to reduce the Head-
Of-Line (HOL) blocking problem and as a result to increase throughput.
However, a trade-o exists between the complexity and the performance
of the multicast scheduling algorithms. Algorithms with low implemen-
tation complexity usually suer from the HOL blocking [59]. On the
other hand, algorithms that achieve high throughput usually are high in
implementation complexity, making them hard to scale in terms of ei-
ther switch size or port speed [60,61]. Given that multicast switches are
able to reduce the continuously increasing network load, an eective and
ecient, yet low-complexity multicast scheduling algorithm is in need.
In this chapter, the Multi-Level Round-Robin Multicast Schedul-
ing (MLRRMS) algorithm is presented for FIFO-based input-queued
switches. First of all, the advantages of the IQ architecture are discussed
in comparison with other architectures. Dierent algorithms developed
for the IQ architecture are shortly reviewed as the background knowledge
for the MLRRMS. The problems encountered by the system architec-
ture are dened, and the solutions to the problems are provided, which
37
38 Multicast Scheduling Algorithms for Input-Queued Switches
comprise the MLRRMS algorithms. Analytical analysis and simulated
performance results demonstrate that the FIFO-based IQ multicast ar-
chitecture is able to achieve signicant improvement with the MLRRMS
algorithm in terms of multicast delay and throughput with the capability
of searching a limited number of cells stored into the input queues.
4.1 Introduction
It is foreseeable that network capacity will increase substantially as
bandwidth-intensive applications become more and more popular and
the required network bandwidth will grow correspondingly. Trac gen-
erated by the bandwidth-intensive applications, such as videoconferenc-
ing and IPTV, usually have several groups of subscribers and each group
subscribes to the same content, e.g. group A watches a football game
and group B watches a movie.
Even though it is realizable to complete a transmission of a content
to a group by several unicast ows, i.e. to transmit M trac ows to
the network with each bound for a subscriber in the target group that
hasM subscribers, the trac load in the network will incontestably sky-
rocket. Multicast, on the other hand, is able to reduce the trac sent
to the network. As demonstrated in Figure 4.1, instead of loading the
network with redundant unicast trac as in Figure 4.1(a), the switches
in Figure 4.1(b) are able to copy the received packet and send them to
the subscribed nodes and thus the trac load in the network is sub-
stantially reduced. The spared network resources can be used for other
services. As a result, multicast-enabled nodes are favored for those high-
bandwidth services in order to reduce the required bandwidth and the
multicast latency in the transport network, e.g. the Carrier Ethernet
transport network.
Due to the fact that xed-size switching technology is able to achieve
high switching eciency, it is considered widely in literature [7, 29, 60{
68]. Variable-length packets are segmented into xed-size cells before
traversing the switch fabric, and are reassembled back into packets before
being sent out of the switch. As a matter of fact, in several advanced
Internet routers/switches and prototypes, the switch fabric internally
operates on cells, such as the Cisco GSR [69], the Tiny-Tera [70], and
the iPoint [71]. In the rest of this section, packet is used as a generic
4.1 Introduction 39
Content Server
Group A
Group B
(a) Illustrating of Unicast.
Content Server
Group A
Group B
(b) Illustration of Multicast.
Figure 4.1: A simple explanation of using unicast and multicast to provide IPTV
services.
term to indicate data unit regardless of the length, for simplicity.
To ensure a low packet loss rate, switches usually have buers in-
stalled to store packets that cannot be served immediately on their ar-
rivals. Buers can be placed at the input port, at the output port, or
in a location shared by input and output ports. Based on the position
of the buers, buering mechanisms of switches can be mainly catego-
rized into several types: Input Queuing (IQ), Output Queuing (OQ),
and shared-buer. Dierent combinations of these schemes are possible
in the practical switch design.
Pure IQ switches place FIFO queues at the inputs as illustrated
in Figure 4.2. The memory only runs as fast as the input line speed,
which lowers the implementation complexity, but the IQ scheme with
FIFO queues suers from degraded throughput due to the HOL blocking
problem [59], where a packet failing to compete for the output ports will
stay at the head of queue and blocks those behind to be transmitted,
even if their destined output ports are available. The advantage of the
IQ scheme is that it can easily scale up in terms of switch size and link
speed, but HOL blocking limits the throughput to approximately 58:6%
of the maximum [59].
Pure OQ switches set buers at the outputs to store packets, as
shown in Figure 4.3. As a result, packets received by an OQ switch can
40 Multicast Scheduling Algorithms for Input-Queued Switches
...
...
Input ports Switch fabric Output ports
Figure 4.2: Illustration of an input-queued switch. A pure input-queued switch
places FIFO buers at the input ports. The buers need to run only as fast as
the input links, but the head-of-line blocking can be endured and cause throughput
degradation.
always reach their destination ports immediately on their arrivals, given
the condition that the buer runs N times the link speed for a switch
with N input ports in the worst case that packets at all the input ports
are destined to the same output port, which eliminates the HOL blocking
problem. However, the scalability of the OQ architecture is constrained.
Since no input buers are allocated, the switch must deliver N packets
to an output buer to avoid packet loss, and that output buer must be
able to store N packets in the time it takes for one packet to arrive at
an input. This buer speedup requirement limits the scalability of the
OQ scheme.
In the shared-buer architecture, input ports and output ports share
a memory pool as shown in Figure 4.4. Incoming packets are stored in
the shared memory. The packet headers are extracted and used for
scheduling purpose by the switch. When a packet is scheduled for trans-
mission, the output port removes it from the shared memory. However,
for an NN switch, the switch must be able to read and write N packets
in only one packet arrival time, which strongly restricts the scalability
4.1 Introduction 41
...
Input ports Switch fabric Output ports
...
Figure 4.3: Illustration of an output-queued switch. A pure output-queued switch
allocates buers only at the output ports. The buers and the switch fabric need to
run N times as fast as the link speed in order to avoid packet loss. This speedup
requirement limits the scalability of the output-queuing scheme.
of the switch.
Since the advantage of the IQ architecture surpasses the others, in
terms of building a scalable architecture for high-speed switching, the
IQ scheme is favored except for its HOL blocking problem when FIFO
queues are employed. By using a dierent buering strategy at each in-
put port, the HOL blocking can be eliminated entirely. This is known as
Virtual Output Queuing (VOQ), where each input maintains a separate
queue for each output [72, 73], as shown in Figure 4.5. Since a packet
cannot be blocked by a packet ahead of it which is bound for a dierent
output port, the HOL blocking is thus eliminated. No speedup is re-
quired in the VOQ scheme because, for cell switching, at most one cell
can arrive and depart from each input within a cell transmission time
slot. Several scheduling algorithms are proposed based on the VOQ
architecture, such as iSLIP [74] and PIM [72], providing a solution to
eliminate the HOL blocking problem and achieving higher throughput
than the IQ architecture with FIFO queues at the inputs [73]. Strictly
speaking, the VOQ is a subcategory of the IQ architecture, where buers
42 Multicast Scheduling Algorithms for Input-Queued Switches
...
Input ports Switch fabric Output ports
Shared Buffer
...
...
Figure 4.4: Illustration of a shared-buer switch. Input and output ports share a
memory pool, where arriving packets are stored. The memory must be able to read
and write N packets in one packet arrival time for an NN switch. This speedup
requirement restricts the scalability of the shared-buer scheme.
are allocated at the input ports. But for simplicity, we use the term IQ
to refer to the IQ architecture with FIFO queues and VOQ for the IQ
scheme where separate queues are employed at each input for unicast.
Although VOQ, by creating separate queues in each input, can en-
tirely eliminate the HOL blocking and improve the throughput perfor-
mance, it scales poorly due to the requirement for N2 queues in total for
an N N switch. The scalability of the VOQ mechanism becomes even
worse when applied to multicast. To eliminate the HOL blocking for an
N  N multicast switches, each input must maintain a separate queue
to store the multicast packets of each possible combination of N des-
tinations. Such architecture is called MultiCast Virtual Output Queu-
4.1 Introduction 43
...
...
Input ports Switch fabric Output ports
...
N
...
N
...
N
Figure 4.5: Illustration of an virtual output queued switch. The head-of-line block-
ing can be eliminated entirely by employing a separate queue for each output at each
input.
ing (MC-VOQ) [75]. MC-VOQ requires 2N 1 queues for each input and
thus in total N   2N   1. The scalability of the MC-VOQ architecture
is poor and thus becomes impractical in medium/large switches. For
simplicity, we use the term MC-VOQ to refer to the VOQ architecture
for multicast throughout this chapter.
Therefore for multicast switches, the attention is focused on the scal-
ability of the IQ architecture, where the arriving multicast packets are
stored in a FIFO queue at each input. No buers are allocated to the
outputs or shared between the input and output ports in order to avoid
the requirement of speedup. To reduce the HOL blocking problem of
using FIFO queues for multicast trac, a novel multicast scheduling
algorithm, MLRRMS, is proposed. The MLRRMS is implemented in a
distributed manner to provide high scalability instead of using a central-
ized scheduling module, which can hinder the scalability of high-speed
switches.
44 Multicast Scheduling Algorithms for Input-Queued Switches
The rest of this chapter is structured as follows. Section 4.2 briey
introduces the related works in the multicast scheduling algorithms for
completeness as background knowledge. Section 4.3 describes the sys-
tem architecture used throughout this chapter, and denes problems to
be solved. In Section 4.4, the MLRRMS algorithm is proposed and de-
scribed in detail. In Section 4.5, the analysis of the MLRRMS algorithm
is provided. In Section 4.6, simulations and discussions on the results
are presented. Finally, Section 4.7 concludes this chapter.
4.2 Related Work
Since it is impractical to use the MC-VOQ architecture for multicast
where each destination combination requires a queue, several architec-
tures and algorithms have been proposed to schedule multicast trac
leveraging either the FIFO or the VOQ architecture.
The multicast scheduling algorithm for IQ switches, also know as
TATRA [62], focuses on the IQ architecture for multicast, where multi-
cast cells are stored in FIFO queues. After the scheduler decides which
cells to send, it leaves a residue of cells to be scheduled in the next cell
time. Motivated by the game Tetris, TATRA schedules the residue of
cells based on the departure date, which is the number of cell times
before a copy of the cell is served. The TATRA algorithm is strict in
fairness and achieves low latency, however, it is high in implementation
complexity. To remedy this, another algorithm, the Weight-Based Al-
gorithm (WBA), is proposed in [62] as a replacement to TATRA due to
its simplicity. This algorithm works by allocating weights to input cells
according to their age and fan-out (number of destinations in the mul-
ticast group) at the beginning of every cell time, and each output port
choosing the HOL cells with the highest weights. Although the WBA
ensures fairness and has a low implementation complexity, it suers from
the HOL blocking problem.
The FIFO-based Multicast Scheduling (FIFOMS) algorithm [60],
and the Credit based Multicast Fair (CMF) scheduling algorithm [61]
utilize the VOQ architecture for unicast to schedule multicast trac.
Instead of assigning a queue to each combination of N destinations,
only N queues are allocated for each input port. Up to N address to-
kens are generated for each arriving cell, each of which is stored in the
4.2 Related Work 45
a queue corresponding to a destination. The arrived multicast cell is
stored in a memory pool and is linked by its address tokens. Based on
the scheduling decisions from the scheduling algorithms executed on the
address tokens, the multicast cell is sent and is removed from the mem-
ory until all its destinations are reached. The FIFOMS and CMF are
able to achieve low latency and high throughput, but the bottlenecks of
the architecture can hinder its scalability.
The hardware complexity of the address token generator can be
O(N), since up to N tokens are generated for each arriving cell, and
the address token generating rate is required to be N times the cell ar-
rival rate due to that multiple tokens are generated for each arriving cell
within one cell transmission time. Besides, this architecture requires a
complex buer management mechanism to send a multicast cell using
the link address in an address token because the actual cell to be sent
is not always the HOL cell. In addition, the number of token queues
in total is N2, which can be a obstacle for the switch to scale up to
hundreds or even thousands of ports.
In addition, the k-MC-VOQ architecture is proposed in [63]. Each
input port maintains k FIFO queues, with 1 < k < 2N   1. The main
issues for the k-MC-VOQ architecture are related to the scheduling al-
gorithm and the queuing discipline that associates each multicast ow
with a queue. A Greedy Min-Split Scheduling (GMSS) [63] is proposed
to schedule multicast trac for the k-MC-VOQ architecture. Each queue
is associated with a weight, which is the product of the queue length and
the fan-out of the multicast cell at the head of the queue. Queues are
examined by decreasing order of the weights. The scheduling algorithm
iterates with two phases until either all output ports are selected or no
more non-empty queues exist at unselected inputs. With an increase of
k, i.e. the number of queues at each input, throughput improves only
signicantly for small k, i.e. k  N . Load balancing based on the queue
length across multicast queues is required to distribute cells to dier-
ent queues for performance improvement. This has made the system
complex for implementation.
46 Multicast Scheduling Algorithms for Input-Queued Switches
4.3 System Architecture and Problem
Denition
The system architecture is presented in this section, followed by the
problem denition. The notations in the system model are used through-
out the rest of this chapter for consistency.
4.3.1 System Architecture
The multicast cell-based switching system used in this chapter is as-
sumed to have the architecture described in Figure 4.6. The switch is
assumed to have equal number of input and output ports due to the
fact that an input and an output port usually reside in pair on the
same line card. Each input i, 0 6 i 6 N   1, is connected to a FIFO
queue. The information of cells is collected and scheduling decisions are
made by the multicast scheduling module. The status of each output j,
0 6 j 6 N  1, is collected by the multicast scheduling module. We also
assume that the switch fabric has intrinsic multicast/broadcast capabil-
ities, e.g. crossbar switch fabric. Incoming variable-length packets are
segmented into xed-size cells before traversing the switch fabric and are
reassembled at the output ports before being sent out. It is out of the
scope of this chapter to consider the details and technologies of packet
segmentation and reassembly. Thus we only focus on the multicast cell
switching part. Sucient buer capacities are assumed so that no cell
loss occurs due to the buer overow. Since variable-length packets are
segmented into xed-size cells, time is divided into xed periods, de-
noted as cell times. In one cell time, an input can only send at most one
cell to the switch fabric and an output can only receive at most one cell
from the switch fabric. If more than one cells are bound for an output
within a cell time, an output contention is occurred and only one cell
can be scheduled for transmission according to the scheduling algorithm
with other cells left to be scheduled in the next cell time.
Any multicast cell is characterized by its fan-out set, i.e. the set of
the output ports for which the cell is bound. As a simple example shown
in Figure 4.6, input 0 has a cell at the head of the queue destined to
outputs f2, 3, 8g, and fan-out set can thus be expressed as f2, 3, 8g. We
consider the case where fan-out splitting [76] is applied so that copies
4.3 System Architecture and Problem Denition 47
2
3
8
input 0
input 1
output 0
output 1
..
.
2
3
8
1
4
9
1
4
9
6
7
9
1
7
8
..
.
F
ra
g
m
en
ta
ti
o
n
R
ea
ss
em
b
li
n
g
input N-1
1
4
7
2
5
6
..
.
..
.
output N-1
..
.
1
2
3
3
9
8
L
input i
0
1
2
2
3
8
3
4
9
output j
..
...
.
..
.
..
.
..
.
..
.
..
.
Multicast scheduling module
p=0
p=1
p=2
Figure 4.6: The system model of the multi-level round-robin multicast scheduling
algorithm.
of multicast cells can be delivered to output ports over any number of
cell times. Unless all the destinations in the fan-out set are reached,
the cell will not be removed but remain in the queue. A multicast
scheduler makes scheduling decisions prior to each cell time and grants
cell transmissions accordingly.
4.3.2 Problem Denition
An ecient way to schedule multicast cells is to see the cells from the
output's point of view. Even though a multicast cell is bound for sev-
eral output ports, for a specic output port, it only takes into account
whether the multicast cell is destined to itself. For example in Fig-
ure 4.7, the fan-out information of the HOL cells in each input queue
is shown and a diagram can be created to represent all the fan-out in-
formation. For output 1, the fan-out information for output 2, 3, and
4 is useless and therefore a subdiagram can be created which lters out
all the fan-out information for other output ports, as shown in Fig-
48 Multicast Scheduling Algorithms for Input-Queued Switches
ure 4.7(b). Similarly, subdiagrams for output 2, 3, and 4 are shown in
Figure 4.7(c), Figure 4.7(d), and Figure 4.7(e), respectively. Based on
this way of scheduling, the round-robin scheduling algorithm can run
independently on each output to select a cell for transmission. This
guarantees that an output can always succeed in scheduling a cell to be
sent to it, as long as the fan-out information of the cells includes the
output.
The MLRRMS algorithm is proposed based on this principle. How-
ever, two crucial problems should be solved. If the scheduling algorithm
only operates on the HOL cells, the system can suer from the HOL
blocking problem, where the cell is blocked by the one ahead of it and
loses its chance to be sent to the idle output ports. To illustrate this
problem clearly, an example is shown in Figure 4.8. A multicast cell in
the FIFO queue for input 1 is scheduled to be sent to output 1 and 2, for
instance. The other two HOL cells from input 2 and 3 lose the chance
to be sent and are to be scheduled in the next cell time. As a result,
output 3 is idle for this cell time and the two cells in queue 1 and 3 are
blocked from transmission. The throughput is thus reduced by the HOL
blocking problem. For unicast trac, the throughput of a IQ switch is
limited to 58.64% by the HOL blocking problem [59].
The other problem is the number of transmissions of each multicast
cell. As described, a multicast cell is removed from the queue when and
only when all its destinations are reached. This implies that a multicast
cell can be transmitted up to as many times as the fan-out value if the
scheduling algorithm fails to utilize the multicast/broadcast capability of
the switch fabric. Since each output port makes the scheduling decision
independently, it is possible that they select dierent cells even if some
can reach all the destinations within one cell time and be removed from
the queues. This unnecessary multiple transmissions of multicast cells
can result in an increased cell delay since the system takes more cell
times to remove a multicast cell from the queue than a system with an
advanced algorithm to avoid such a situation.
To alleviate the addressed problems, the Look-Ahead (LA) and the
sync mechanism are proposed in the MLRRMS algorithm. The sync
mechanism aims to reduce the unnecessary multiple transmissions of a
multicast cell. The LA mechanism aims to reduce the HOL blocking and
is used by the MLRRMS based on the assumption that the scheduler is
4.3 System Architecture and Problem Denition 49
input output
1
2
3
4
{1, 2}
{1, 2, 3}
{3, 4}
{1, 4}
1
2
3
4
(a) The full diagram.
input output
1
2
3
4
(b) The subdiagram for output 1.
input output
1
2
3
4
(c) The subdiagram for output 2.
input output
1
2
3
4
(d) The subdiagram for output 3.
input output
1
2
3
4
(e) The subdiagram for output 4.
Figure 4.7: Illustration of splitting a multicast scheduling problem. For each output
port, a diagram can be created which lters out all the fan-out information for other
output ports.
50 Multicast Scheduling Algorithms for Input-Queued Switches
1
2
input 1
input 2
1
2
3
1
2
3
1
2
1
2
1
2
input 31
2
1
2
3
1
2
3
output 1
output 2
output 3
Figure 4.8: An example of the multicast head-of-line (HOL) blocking problem.
Output 3 is idle and the two cells in queue 1 and 3 are blocked from transmission by
the cell ahead.
able to examine the cells stored further in the queues and is capable of
sending them to the corresponding output ports.
4.4 The Multi-Level Round-Robin Multicast
Scheduling Algorithm
The MLRRMS algorithm is a distributed multicast scheduling algo-
rithm, though it is assumed to be implemented in one module to re-
duce the signal latency. The terms input and output used in the algo-
rithm description are not necessarily the actual inputs and output of
the switch, but rather a conceptual indication for scheduling purposes.
The MLRRMS can reduce the unnecessary multicast transmissions of
a multicast cell by using the sync mechanism. Unlike the WBA, which
operates only on the HOL cells, the MLRRMS uses the LA mechanism
to iterate the scheduling process on dierent cell position to increase
the throughput. The detailed description of the MLRRMS algorithm is
shown as below and an example is shown in Figure 4.9 and Figure 4.10:
Initial condition : Before each cell time, the position pointer, p, is
reset to point to the HOL cell, i.e. p = 0. All the input and output ports
are in unreserved status and are eligible of transmitting and receiving
4.4 The Multi-Level Round-Robin Multicast Scheduling Algorithm 51
cells.
Step 1) Submission: Each unreserved input submits to the unre-
served outputs which are contained in the fan-out set of the cell pointed
by the position pointer p. If p + 1 is larger than the queue length, the
input stops this step. The output ports that have received the sub-
missions from the inputs will appear in a round-robin schedule of the
dictator assignment.
Step 2) Dictator Assignment: The dictator arbiter of the current
position pointer chooses the output that appears next in a round-robin
schedule, starting from the highest priority element, to be the dictator
over other outputs. The dictator pointer a(p) to the highest priority
element of the round-robin schedule is incremented (modulo N) to one
position beyond the current dictator, after the assignment.
Step 3) Decision: If an unreserved output receives any fan-out in-
formation submissions, it chooses the one that appears next in a round-
robin schedule of the current position pointer, starting from the highest
priority element. The output noties each input whether its submission
is selected in the decision and becomes reserved. The decision pointer
d(p) to the highest priority element of the round-robin schedule, is in-
cremented (modulo N) to one location beyond the selected input, if and
only if, the output receives a cell from its selected input. Upon receiving
a decision, the input temporarily stores the index of the output that has
sent this decision, as well as the value of the current position pointer.
Step 4) Sync: If an input receives a decision from the dictator of
the current position pointer, it invalidates the decisions of other outputs,
which are contained in its submission set, and makes its submissions in
Step 1 as valid decisions. An input without valid decisions loses permis-
sion to transmit cells and remain unreserved. Only an input having at
least one valid decision becomes reserved and it is eligible for transmis-
sion.
Step 5) Look-Ahead: If any unreserved output port exists and if
the position pointer has not reached its maximum value, the position
pointer increases its value by 1, i.e. p = p + 1, go to Step 1. Else
if all output ports are reserved or the position pointer has reached its
maximum value, the scheduling process is completed.
After the completion of the scheduling process, each reserved input
copies the cell in the FIFO queue from the position that has been stored
52 Multicast Scheduling Algorithms for Input-Queued Switches
in Step 3) and sends cells to the outputs that are included in the decision
set. If a cell has reached all the output ports in its fan-out set, it is
removed from the queue. Otherwise, the cell remains in the queues,
removes those reached outputs from its fan-out set and updates it fan-
out information.
4.5 MLRRMS Algorithm Analysis
The analytical analysis of the MLRRMS algorithm is presented in this
section. First, several terms are dened in Section 4.5.1. Then in Sec-
tion 4.5.2, the analytical description of the MLRRMS algorithm is pro-
vided for the purpose of further analysis. Heuristic analysis of the LA
mechanism is given in Section 4.5.3, and nally the complexity analysis
is presented in Section 4.5.4.
4.5.1 Denitions
We dene several terms used in the analysis of the MLRRMS algorithm:
Denition 1 (Maximum Look-Ahead Depth): The maximum look-
ahead depth, L, is dened as the limit of the number of cells that the
scheduler is able to examine further into the queue. L = 0 means that
the switch only operates on the HOL cells, while L = l indicates that
the switch can look up to l cells after the HOL cell.
Denition 2 (Cell Position): The cell position, p, is dened as the
position of a cell in the queue. The cell at the HOL of the queue has
p = 0.
Denition 3 (Fan-out Vector): A fan-out vector is used to indicate
the fan-out set carried by a multicast cell in input i at position p, and is
denoted as f (i;p) , f (i;p)k , k = 0; 1; :::; N 1, p = 0; 1; :::; L, f (i;p)k 2 f0; 1g.
f
(i;p)
k = 0 indicates that output k is not in the fan-out set of the cell and
f
(i;p)
k = 1 indicates the opposite. The cardinality of the fan-out set thus
becomes jf (i;p)j ,PN 1k=0 f (i;p)k .
Denition 4 (Trac Matrix): The Trac Matrix is an NN matrix
constructed by the scheduler, based on the fan-out vectors of the cells in
the position p of each input i, before a cell transmission. It is denoted
as T(p) =

T
(p)
i;j

. Obviously, we have T
(p)
i;j = f
(p)
i;j ; 8i; j; p. We dene
4.5 MLRRMS Algorithm Analysis 53
input output
1
2
3
4
1
21
2
3
4
3
4
1
2
3
1
4
2
3
1
2
3
4
1
2
1
3
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
p = 0
(a) Step 1 with p = 0: Submission. Each unreserved input submits the fan-
out information of its HOL cell to the corresponding outputs. The round-
robin scheduler (p = 0) of each output is at the position left from the last
scheduling process.
input output
1
2
3
4
1
21
2
3
4
3
4
1
2
3
1
4
2
3
1
2
3
4
1
2
1
3
dictator
(b) Step 2 and 3: Dictator Assignment and
Decision. Output 2 is the dictator in this
round. Based on the round-robin pointer,
each output sends a decision to an input and
becomes reserved.
input output
1
2
3
4
1
21
2
3
4
3
4
1
2
3
1
4
2
3
1
2
3
4
1
2
1
3
dictator
(c) Step 4: Sync. Input 2 receives a deci-
sion of the dictator and thus it invalidates
the decisions sent by output 1 and 3, be-
cause they are in its fan-out set. Input 1
and 3 lose their decisions and therefore be-
come unreserved.
Figure 4.9: MLRRMS: Submission, Decision, and Sync.
54 Multicast Scheduling Algorithms for Input-Queued Switches
input output
1
2
3
4
1
21
2
3
4
3
4
1
2
3
1
4
2
3
1
2
3
4
1
2
1
3
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
p = 0
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
p = 1
(a) Step 5 and Step 1 with p = 1: Look-Ahead and submission of the
increased cell position. Since output 4 is unreserved, input 1 and 3 both
submit the fan-out information of the cell at p = 1.
input output
1
2
3
4
1
21
2
3
4
3
4
1
2
3
1
4
2
3
1
2
3
4
1
2
1
3
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
p = 0
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
p = 1
(b) Step 2 with p = 1: Decision with p =
1. Output 4 sends a decision to input 1
according to its round-robin pointer at p =
1, and becomes reserved.
input output
1
2
3
4
1
21
2
3
4
3
1
4
2
3
1
2
3
4
1
2
1
3
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
p = 0
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
p = 1
(c) Post-transmission. The HOL cell in in-
put 2 is sent to all its destinations and is re-
moved from the FIFO queue. Since the cells
received by output 1 and 3 are dierent from
what the outputs' round-robin pointers in-
dicate, no update occurs on the pointers.
Figure 4.10: MLRRMS: Look-ahead, Submission, Decision, and post-transmission
status.
4.5 MLRRMS Algorithm Analysis 55
T
(p)
i;j = 0; 8j; p, if input queue i is empty.
Denition 5 (Decision Matrix): The Decision Matrix is an N  N
matrix denoted as D(p) =

D
(p)
i;j

, D
(p)
i;j 2 f0; 1g. This matrix contains
the scheduling decisions for each output j, with D
(p)
i;j = 1 indicating that
a copy of the cell in input i at position p will be transferred to output
j and D
(p)
i;j = 0 meaning that no copy will be sent to output j. Thus,
0  Pj D(p)i;j  1, 8j. D(p) satises the conditions in Equation 4.1 4.2,
and 4.3 as below:
0 
N 1X
j=0
D
(p)
i;j  N; 8i; p (4.1)
0 
N 1X
i=0
D
(p)
i;j  1; 8j; p (4.2)
0 
N 1X
i=0
N 1X
j=0
D
(p)
i;j  N; 8p (4.3)
Denition 6 (Set of Decision Matrices): The Set of Decision Matri-
ces is dened as  , fD(0);D(1); :::;D(L)g. It contains up to L decision
matrices. Multicast cells are released by the scheduler according to the
decision matrices stored in .
Denition 7 (Assistant Matrix): The Assistant Matrix is an N N
matrix denoted as A(p) =

A
(p)
i;j

, A
(p)
i;j 2 f0; 1g. This matrix is used to
help generate D(p); p > 0.
Denition 8 (Cross Disable Mark X): We dene X as a matrix
transform mark for the sake of convenience, whereX = (Xi;j), Xi;j 2 0; 1
is the matrix in operation. If we haveY = X, rst letY = O, (Yi;j = 0,
8i; j) with the same dimensions as X, and if Xk;l = 1, then Yk;j = 1,
Yi;l = 1, 8i; j.
4.5.2 Analytical Description of the MLRRMS Algorithm
We here describe the proposed multicast scheduling algorithm in detail
based on the previous denitions. Before each cell transmission time,
56 Multicast Scheduling Algorithms for Input-Queued Switches
the scheduler executes the following procedures and accordingly releases
cells after completion.
Initial condition: p = 0,  = ;, and D( 1) = O (D( 1)i;j = 0; 8i; j).
i): The scheduler examines the fan-out vector f (i;p) of the cell in
input i at position p for all inputs to construct T(p).
ii): A(p) = T(p) 
Pjj
p=0D
(p)

, and if A
(p)
i;j < 0, then set A
(p)
i;j to 0,
8i; j.
iii): The round-robin scheduling algorithm is independently exe-
cuted on each non-zero column of A(p). Only one element in a column
can be selected due to the constraint of one output port only being able
to one transmission during a cell time. The scheduling results thus form
D(p).
iv): The sync procedure is carried out onD(p) to reduce the unneces-
sary multiple transmissions of cells caused by the independent scheduling
processes: each column plays the role of dictator in a round-robin man-
ner, and if column y plays the role of dictator during the current cell
time and D
(p)
x;y = 1, and 8j 6= y, A(p)x;j = 1 and D(p)x;j 6= 1, then let D(p)x;j = 1
and D
(p)
x;j = 0, 8i 6= x. The scheduler stores the rened D(p)to , i.e.
D(p) !. The round-robin pointer of the column that is synced to the
dictator remains in the same position for the next cell time.
v): If a zero column is found in
Pjj
p=0D
(p 1), check the queue size of
each unreserved input, which is the corresponding row in
Pjj
p=0D
(p 1).
If the queue size is larger than p + 1, and p + 1  L, increase p with 1
and go to step i. Otherwise, continue to step vi.
vi): The scheduler should examine  and release multicast cells at
particular positions from input queues according to eachD(p). If the fan-
out set of a cell becomes empty after the service, the cell will be removed
from the queue. Otherwise, the cell remains with a new fan-out set.
4.5.3 Heuristic Analysis of the Look-Ahead Mechanism
As described previously in the algorithm, the LA mechanism is only
performed when the output ports are not fully reserved, which can cause
decreased throughput. There are potentially two reasons to cause the
partial output port occupancy: (1) the HOL blocking, and (2) the trac
pattern. Obviously, if the reason of the partial output port occupancy
4.5 MLRRMS Algorithm Analysis 57
is the trac pattern, there is nothing to improve. On the other hand,
the HOL blocking phenomenon may be the cause and therefore the LA
mechanism is introduced to reduce the problem.
It is obvious that the HOL blocking can be eliminated if the switch
is capable of searching innitely in the queues, i.e. the maximum LA
depth is always larger than the queue size. However, taking the imple-
mentation complexity into consideration, innite searching capability is
impractical. As dened previously, a maximum LA depth is introduced
to constrain the implementation complexity and at the same time, to
increase the output utilization. Here in this section, a discussion is car-
ried out on the relation between the maximum LA depth and the output
utilization.
Assuming that there are enough cells stored in the queues and the
fan-out vectors are uniformly distributed among cells, i.e. a multicast
cell is bound for each output port with the same probability:
P

f
(i;p)
k = 1

= ; 8i; k; p (4.4)
A multicast cell always carries a non-zero fan-out vector, i.e. there
is at least one destination in the fan-out set if unicast is considered as a
special case of multicast, the probability of the fan-out can be calculated.
A random variable F = jf (i;p)j is dened for the fan-out of a multicast
cell, and the probability of F = f is calculated as:
P (F = f) =
 
N
f

f (1  )(N f)
1  (1  )N ; f = 1; 2; :::; N (4.5)
E[F ] =
N  
1  (1  )N (4.6)
Given the restriction in Equation 4.5, it is possible to derive the
probability of an random element in T(p) being 1:
P

T
(p)
i;j = 1

=

1  (1  )N 

N
N

=

1  (1  )N
= 
(p)
1
(4.7)
58 Multicast Scheduling Algorithms for Input-Queued Switches
Therefore the probability of 1 random column in T(0) being zero is:
'
(0)
1 = P
0@X
j
T
(0)
i;j = 0
1A = 1  (0)1 N ; 8i; j (4.8)
Given one zero column, the probability of a random element in the
rest N   1 columns of T(0) being 1 given one zero column is known as:

(0)
2 =

1  (1  )N 

N
N   1

(4.9)
The probability of a second random column in T(0) being zero given
one zero column is known is:
'
(0)
2 =

1  (0)2
N
(4.10)
Thus, the probability of 2 random columns in T(0) being zero is:
P

2 random zero colums in T(0)

= '
(0)
1  '(0)2 (4.11)
Suppose that there are x zero columns in T(0), and we can derive:
(0)x =

1  (1  )N 

N
N   x+ 1

(4.12)
'(0)x =

1  (0)x
N
(4.13)
Thus, dene a random variable X(0) is dened for the number of zero
columns in T(0) and the probability of X(0) = x is calculated as:
P

X(0) = x

=8>>><>>>:
 
N
x
 
'
(0)
1 '
(0)
2   '(0)x

1  '(0)x
N x
; 1  x  N   1

1  '(0)1
N
; x = 0
(4.14)
4.5 MLRRMS Algorithm Analysis 59
If zero columns exist inT(0), they are further examined inT(p); p > 0,
and assume that each bit in a non-zero column has the equal chance of
being selected by the round-robin scheduler. Then it is possible to derive:
P (a random bit in a non-zero column is selected) =
x
N
(4.15)
If the scheduling decisions on those columns are fully scattered on
dierent N   x rows, then N   x rows will be disabled in T(p); p > 0.
The probability of this situation is:
P (fully-scattered) =

N
1

x
N


N   1
1

x
N
  

x+ 1
1

x
N
=

x
N
N x

N xY
b=1

N   b+ 1
1
 (4.16)
If the scheduling decisions on those columns are all the same row,
only 1 row will be disabled for further look-ahead process. The proba-
bility of this situation is:
P (zero-scattered) =

N
1



x
N
N x
=

x
N
N x

1Y
b=1

N   b+ 1
1
 (4.17)
In between the above two extreme cases described in Equation 4.16
and Equation 4.17, the probability that the scheduling decisions are scat-
tered among  dierent rows, where 1 <  < N   x, can be calculated:
P (-scattered) =

N
1

x
N
N x +1

N   1
1

x
N


N   2
1

x
N
  

N   + 1
1

x
N
=

x
N
N x

Y
b=1

N   b+ 1
1

; 1 <  < N   x
(4.18)
60 Multicast Scheduling Algorithms for Input-Queued Switches
Thus, combining the three situations, one generic formula is derived
for the probability of  rows being disabled in T(1) as shown in Equa-
tion 4.19:
Pd() =

x
N
N x

Y
b=1

N   b+ 1
1

; 1    N   x (4.19)
Then the probability of 1 random column in T(1) being zero can be
calculated:
'
(1)
1 = Pd()

1  (1)1
N 
(4.20)
where 
(1)
1 = 
(0)
1 since each T
(p); 8p is independent.
Therefore the probability of no zero columns existing after looking 1
cell behind the HOL cell becomes:
P

X(0) = x

P

X(1) = 0

=
N
x

'
(0)
1 '
(0)
2   '(0)x

1  '(0)x
N x  1  '(1)1 x
(4.21)
Given a large N and  = 0:5, the probability shown in Equation 4.21
can be high, which will be later demonstrated by simulation results in
Section 4.6. This indicates that by allowing the switch to look 1 cell
further into the queues after the HOL cell, the largest improvement on
increasing the output utilization can be achieved.
4.5.4 Complexity Analysis
The time complexity of each step of the MLRRMS algorithm is discussed
in this section.
First of all, the time complexity of Submission for each input can
be O(N) because the input needs to examine all N bits of the fan-out
vector carried by the cell. If parallel structure is used to allow the input
to read the N bits at one access, the time complexity can be reduced to
O(1).
4.6 Simulated Performance of MLRRMS 61
Secondly, for each Decision arbiter to generate a decision, the time
complexity can be O(N) because the scheduling arbiter needs to examine
at most N submissions before sending the decision back to an input.
However, if we use priority encoders, the time complexity can be reduced
to O(logN). The Dictator Assignment arbiter can also use priority
encoder to reduce the time complexity of the dictator output from O(N)
to O(logN). The sync mechanism is simple to implement and has a time
complexity of O(1) since the input that receives the decision from the
dictator only needs to invalidate the decisions from the outputs included
in the submission set.
Give a maximum LA depth, L, the scheduling process iterates until
either all outputs are reserved or the maximum LA depth is reached.
Thus, the time complexity becomes O(L logN).
4.6 Simulated Performance of MLRRMS
A comparison between the MLRRMS, theWBA [62] and the FIFOMS [60]
is carried out by simulations in OPNET Modeler [56]. Independent mul-
ticast trac is assumed to each input. To compare the performance of
the algorithms in varying trac conditions, Bernoulli trac and bursty
trac with dierent fan-out schemes are considered, which are further
explained in Section 4.6.1.
4.6.1 Trac Model
For Bernoulli trac process, a cell arrives at an input with a probability
of q, which is also the arrival rate. Thus the oered load can be calcu-
lated as  = q  E[F ], where F is the random variable of the fan-out of
each cell.
The bursty trac process, or Correlated Arrival Process, has two
states, busy and idle. Cells are generated only in the busy state. The
process stays in each state for a random number of cell times following
the geometric distribution with mean values of E[B] and E[I], respec-
tively. The arrival rate is calculated as q = E[B]=(E[B]+E[I]), and the
oered load can be calculated as  = q E[F ]. Since the trac arrives at
the switch in bursts, two modes of fan-out schemes can be applied, cell-
based and burst-based. In cell-based fan-out mode, the fan-out vector
62 Multicast Scheduling Algorithms for Input-Queued Switches
is independently generated for each cell. And in burst-based mode, the
fan-out vector is independently generated for each burst of cells, each
burst of cells having the same fan-out vectors.
To evaluate the multicast scheduling algorithms, the multicast bal-
anceness is introduced. We can dene that the probability of a bit in a
fan-out vector of a cell in an input being 1 follows the Equation 4.22:
P

f
(i;p)
k = 1

=
8<:
max

1; E[F ]   ! + 1 !N 	 ; k = i; 8p
E[F ]   1 !N  ; k 6= i;8p (4.22)
where ! is the balance factor, 0  !  1. When ! = 0, all the bits have
the same probability of being 1 in the fan-out vector, which indicates
a balanced fan-out. As ! increases, the bit that has the same index as
the input has higher probability of being 1. When ! reaches 1, the traf-
c becomes unicast, which can be considered an extremely unbalanced
multicast case.
4.6.2 Performance for Balanced Multicast Trac under
Dierent Oered Loads
When ! = 0, the multicast trac becomes balanced and each bit in the
fan-out vector has the same probability of being 1, resulting in a binomial
distribution. Bernoulli trac is rst applied to the switch with N = 8,
E[F ] = 4 and N = 32, E[F ] = 16. The reason that mean fan-out is
half of the number of output ports is based on the assumption that, the
probability of one multicast cell being sent to an output port is 0.5.
Figure 4.11 compares the average multicast delays under dierent
trac loads. A multicast cell is stored in the queue until all the des-
tinations in its fan-out set are reached. The multicast delay of a cell
is calculated as the cell times that the cell stays in the queue until it
is removed. Since the WBA and the MLRRMS (L = 0) both operate
only on the HOL cells, they become unstable under high oered loads.
With looking ahead maximum 1 cell further, the MLRRMS (L = 1) has
demonstrated a signicant improvement of the multicast delay compared
to the MLRRMS (L = 0) and the WBA. As L increases, Figure 4.11
displays more improvement from the MLRRMS (L = 2) and (L = 10),
but the marginal improvement is decreasing. That is, the improvement
4.6 Simulated Performance of MLRRMS 63
is not in proportion to the hardware implementation complexity added
to the switching system to enable the switch to search more cells. From
both Figure 4.11(a) and Figure 4.11(b), it is possible to point out that
by allowing the switch system to be capable of looking 1 cell stored fur-
ther in the queues, the system is able to obtain the largest improvement,
in terms of the multicast latency. Among all, the FIFOMS has the low-
est delay because it uses the VOQ architecture to handle the multicast
trac with a total number of queues of N2, but as discussed previously,
the VOQ architecture is low in scalability.
The average queue size per input, including the cell in service, is
examined in Figure 4.12. Since the MLRRMS (L = 0) and the WBA
operate only on the HOL cells, they both suer from the HOL blocking
problem and have the highest average queue size compared to other
schemes. Again, a signicant improvement of the MLRRMS (L = 1) in
both Figure 4.12(a) and Figure 4.12(b) can be observed.
In Figure 4.13, the average LA depth is evaluated for both N = 8
and N = 32, i.e. the actual number of cells searched by the switch. For
the MLRRMS (L = 0), the average LA depth is obviously always 0. For
the MLRRMS (L = 1), it allows the switch to search up to 1 cell further
in the queues. When the trac load is heavy, the average LA depth is
almost the same as L, which is set to 1. For the MLRRMS (L = 2)
and (L = 10), the average LA depth under heavy load is less than its
L value revealing that the switch does not utilize its full potential. The
average LA depth of the MLRRMS (L = 10) is approximately 7 when
the switch is heavily loaded with N = 8 and 8 with N = 32, indicating
that the added implementation complexity of the switch is obsolete and
the performance improvement is nonlinear. Both MLRRMS (L = 1) and
(L = 2) begin to converge after  = 0:9 because the queue size begin to
become larger than the L values.
Bursty trac is further applied with N = 8; E[F ] = 4 and N =
32; E[F ] = 16 and the mean burst size E[B] = 16 [62].
Performance comparisons are rst carried out under the bursty trac
with the cell-based fan-out mode. In Figure 4.14, the average multicast
latency of all the scheduling schemes increases. The WBA and the
MLRRMS (L = 0) have the largest delay compared to others. With
looking up to 1 cell, the MLRRMS (L = 1) has reduced the multicast
latency dramatically. The MLRRMS (L = 2) does not provide the same
64 Multicast Scheduling Algorithms for Input-Queued Switches
0.5 0.6 0.7 0.8 0.9 1.0
0
100
200
300
400
500
 
A
ve
ra
ge
 M
ul
tic
at
 L
at
en
cy
 (c
el
l t
im
es
)
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(a) ! = 0, N = 8, F = 4
0.5 0.6 0.7 0.8 0.9 1.0
0
100
200
300
400
500
 
A
ve
ra
ge
 M
ul
tic
at
 L
at
en
cy
 (c
el
l t
im
es
)
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(b) ! = 0, N = 32, F = 16
Figure 4.11: Average multicast latency under Bernoulli trac.
4.6 Simulated Performance of MLRRMS 65
0.5 0.6 0.7 0.8 0.9 1.0
0
20
40
60
80
100
 
Av
er
ag
e 
Q
ue
ue
 S
iz
e 
Pe
r I
np
ut
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(a) ! = 0, N = 8, F = 4
0.5 0.6 0.7 0.8 0.9 1.0
0
20
40
60
80
100
 
Av
er
ag
e 
Q
ue
ue
 S
iz
e 
Pe
r I
np
ut
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(b) ! = 0, N = 32, F = 16
Figure 4.12: Average queue size per input under Bernoulli trac.
66 Multicast Scheduling Algorithms for Input-Queued Switches
0.5 0.6 0.7 0.8 0.9 1.0
0
1
2
3
4
5
6
7
8
9
10
 
Av
er
ag
e 
LA
 D
ep
th
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
(a) ! = 0, N = 8, F = 4
0.5 0.6 0.7 0.8 0.9 1.0
0
1
2
3
4
5
6
7
8
9
10
 
Av
er
ag
e 
LA
 D
ep
th
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
(b) ! = 0, N = 32, F = 16
Figure 4.13: Average look-ahead depth under Bernoulli trac.
4.6 Simulated Performance of MLRRMS 67
level of improvement compared to the complexity it adds to the switch.
The delay performances of the MLRRMS (L = 10) and the FIFOMS are
nearly the same under heavy trac loads. In Figure 4.15, the average
queue size per input is examined. Due to the bursty characteristic of
the trac, the average queue size is larger for light trac load than
when the Bernoulli trac is applied. Since the fan-out scheme is cell-
based, the LA mechanism can still reduce the HOL blocking problem
and increase the output utilization of the switch. This can be observed
in the improvement of the MLRRMS (L = 1) in both Figure 4.14 and
Figure 4.15. In Figure 4.16, similar results are perceived as in Figure 4.13
but the average LA depth of the MLRRMS (L = 2) converges earlier
than in Figure 4.13. This is because the bursty trac leads to larger
queue size under light load than the Bernoulli trac.
Performance comparisons are then carried out under the bursty traf-
c with the burst-based fan-out mode. Each burst of cells has the same
fan-out vector, which means looking ahead a limited number of cells
in the queues is enough for the switch to alleviate the HOL blocking
problem. A deeper searching should be carried out for this type of traf-
c. Among other schemes, the WBA has the largest multicast latency as
shown in Figure 4.17. The FIFOMS performs better than the MLRRMS
with dierent L values. In the MLRRMS group, the improvement of the
MLRRMS (L = 1) is not as signicant as in Figure 4.11 and Figure 4.14,
and the MLRRMS (L = 0), (L = 1) and (L = 2) have similar multicast
delay. This phenomenon corresponds to the reason stated previously, i.e.
the burst-based mode requires a larger L, since L = 1 or 2 is not enough
to include most possibilities of the burst distribution. The MLRRMS
(L = 10) is able to generate a reduce the latency signicantly because
of the large L.
In Figure 4.18, we obtain similar results as in Figure 4.17. All algo-
rithms suer from performance degradation due to the trac pattern.
The average LA depths of the MLRRMS with dierent L's under burst-
based mode are shown in Figure 4.19. The MLRRMS (L = 1) and
(L = 2) begin to converge when the oered load  reaches 0.6 due to the
trac pattern. The MLRRMS (L = 10) also converges at  = 0:8 which
indicates that the trac of burst-based fan-out mode has put a greater
demand on the possible LA depth than the cell-based scheme and the
queue size is always larger than 10.
68 Multicast Scheduling Algorithms for Input-Queued Switches
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
100
200
300
400
500
 
A
ve
ra
ge
 M
ul
tic
as
t L
at
en
cy
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(a) ! = 0, N = 8, F = 4
0.5 0.6 0.7 0.8 0.9 1.0
0
100
200
300
400
500
 
A
ve
ra
ge
 M
ul
tic
as
t L
at
en
cy
 (c
el
l t
im
es
)
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(b) ! = 0, N = 32, F = 16
Figure 4.14: Average multicast latency under bursty trac (cell-based fan-out
mode).
4.6 Simulated Performance of MLRRMS 69
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
20
40
60
80
100
 
A
ve
ra
ge
 Q
ue
ue
 S
iz
e 
P
er
 In
pu
t
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(a) ! = 0, N = 8, F = 4
0.5 0.6 0.7 0.8 0.9 1.0
0
20
40
60
80
100
 
Av
er
ag
e 
Q
ue
ue
 L
en
gt
h 
Pe
r I
np
ut
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(b) ! = 0, N = 32, F = 16
Figure 4.15: Average queue size per input under bursty trac (cell-based fan-out
mode).
70 Multicast Scheduling Algorithms for Input-Queued Switches
0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
1
2
3
4
5
6
7
8
9
10
 
 
A
ve
ra
ge
 L
A
 D
ep
th
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
(a) ! = 0, N = 8, F = 4
0.5 0.6 0.7 0.8 0.9 1.0
0
1
2
3
4
5
6
7
8
9
10
 
Av
er
ag
e 
LA
 D
ep
th
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
(b) ! = 0, N = 32, F = 16
Figure 4.16: Average look-ahead depth under bursty trac (cell-based fan-out
mode).
4.6 Simulated Performance of MLRRMS 71
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
100
200
300
400
500
 
 
A
ve
ra
ge
 M
ul
tic
as
t L
at
en
cy
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(a) ! = 0, N = 8, F = 4
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
100
200
300
400
500
 
A
ve
ra
ge
 M
ul
tic
as
t L
at
en
cy
 (c
el
l t
im
es
)
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(b) ! = 0, N = 32, F = 16
Figure 4.17: Average multicast latency under bursty trac (burst-based fan-out
mode).
72 Multicast Scheduling Algorithms for Input-Queued Switches
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
20
40
60
80
100
120
140
 
A
ve
ra
ge
 Q
ue
ue
 S
iz
e 
P
er
 In
pu
t
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(a) ! = 0, N = 8, F = 4
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
20
40
60
80
100
 
Av
er
ag
e 
Q
ue
ue
 S
iz
e 
Pe
r I
np
ut
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
 FIFOMS
 WBA
(b) ! = 0, N = 32, F = 16
Figure 4.18: Average queue size per input under bursty trac (burst-based fan-out
mode).
4.6 Simulated Performance of MLRRMS 73
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6
7
8
9
10
 
Av
er
ag
e 
LA
 D
ep
th
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
(a) ! = 0, N = 8, F = 4
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6
7
8
9
10
Av
er
ag
e 
LA
 D
ep
th
Offered Load
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
(b) ! = 0, N = 32, F = 16
Figure 4.19: Average look-ahead depth under bursty trac (burst-based fan-out
mode).
74 Multicast Scheduling Algorithms for Input-Queued Switches
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
2
3
4
5
6
7
8
9
 
Av
er
ag
e 
Tr
an
sm
is
si
on
s 
Pe
r C
el
l
Offered Load
 MLRRMS (no sync, L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (no sync, L=2)
 MLRRMS (no sync, L=10)
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
Figure 4.20: Improvement of the sync mechanism under Bernoulli trac, with
! = 0, N = 32, F = 16.
In the MLRRMS algorithm, the sync mechanism aims to reduce
the unnecessary multiple transmissions of cells. Figure 4.20 compares
the average transmissions per cell under Bernoulli trac for dierent L
values, i.e. L = 0; 1; 2; 10. An obvious dierence between the MLRRMS
schemes with sync and without sync mechanism can be observed in the
results. At the oered load of 0.8, for instance, the sync mechanism
can reduce the average transmissions per cell by 1, which results in a
signicant reduction of the cells traversing the switch fabric.
Figure 4.21 compares the average transmissions per cell under bursty
trac with cell-based fan-out mode and Figure 4.22 compares the aver-
age transmissions per cell under bursty trac with burst-based fan-out
mode for dierent L values. The clear dierence between the MLRRMS
schemes with sync and without sync in Figure 4.21 is not obvious in
Figure 4.22. This is due to the trac prole of the burst-based fan-out
mode where cells in the same burst carry the same fan-out vectors. Un-
der the burst-based fan-out scheme, scheduling system needs to increase
the L value in order to examine more cells. With a larger L, cells from
4.6 Simulated Performance of MLRRMS 75
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
2
3
4
5
6
7
8
9
 
Av
er
ag
e 
Tr
an
sm
is
si
on
s 
Pe
r C
el
l
Offered Load
 MLRRMS (no sync, L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (no sync, L=2)
 MLRRMS (no sync, L=10)
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
Figure 4.21: Improvement of the sync mechanism under bursty trac (cell-based
fan-out mode), with ! = 0, N = 32, F = 16.
a dierent burst can be transmitted, which results in a increase of the
average transmissions per cell. This can be observed in between the
MLRRMS (L=10) and the MLRRMS (no sync, L=0) in Figure 4.22.
4.6.3 Performance for Unbalanced Multicast Trac
under the Same Oered Load
When ! > 0, the multicast trac becomes unbalanced. Each bit of the
fan-out vector has dierent probabilities of being 1 as in Formula 4.22.
To examine how the balance factor ! aects the performance, we set the
oered load xed. From Formula 4.22, it can be calculated that when
! becomes larger than the threshold N E[F ]E[F ](N 1) , the actual mean fan-out
will begin to decrease, due to the change of the Binomial distribution.
The actual mean fan-out follows the formula shown in Formula 4.23:
76 Multicast Scheduling Algorithms for Input-Queued Switches
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
1
2
3
4
5
6
7
8
9
10
Av
er
ag
e 
Tr
an
sm
is
si
on
s 
Pe
r C
el
l
Offered Load
 MLRRMS (no sync, L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (no sync, L=2)
 MLRRMS (no sync, L=10)
 MLRRMS (L=0)
 MLRRMS (L=1)
 MLRRMS (L=2)
 MLRRMS (L=10)
Figure 4.22: Improvement of the sycn mechanism under bursty trac (burst-based
fan-out mode), with ! = 0, N = 32, F = 16.
E[F ] =
8>><>>:
E[F ];

0  !  N E[F ]E[F ](N 1)

E[F ](N 1)(1 !)
N + 1;

N E[F ]
E[F ](N 1) < ! < 1
 (4.23)
Thus, to achieve a xed oered load, the arrival rate should be in-
creased by a factor of E[F ]E[F ] , in order to compensate the reduced mean
fan-out. Unbalanced multicast trac with N = 8 and E[F ] = 4 is
applied for performance evaluation.
Figure 4.23 shows the average multicast latency vs. the balance
factor under a xed oered load. The reason that he oered loads in
Figure 4.23(a) and Figure 4.23(b) are dierent is that the maximum
load that the switch can handle is reduced, when the burst-based fan-
out mode is applied. In both Figure 4.23(a) and Figure 4.23(b), the
multicast latency decreases with the increase of ! before it goes beyond
the threshold in Formula 4.23. When ! becomes larger than the thresh-
old, the arrival rate will increase to compensate the loss of E[F ]. The
4.6 Simulated Performance of MLRRMS 77
multicast latency initially increases, and then decreases as ! becomes
larger. The system of WBA suers from throughput degradation under
unbalanced fan-out schemes, which will be shown later, and thus the
focus is set on the comparison between the MLRRMS schemes. The
results show that the MLRRMS (L=1) schemes have lower latency in
comparison to the MLRRMS (L=0) schemes under various fan-out dis-
tributions in both fan-out modes. Before the threshold of !, the delay
curves drop as ! increases. This is due to the fact that trac sent to
input i begins to gather to output i, rather than spreading uniformly
among all outputs. Trac load on other outputs is therefore reduced,
resulting in the decrease of latency. When ! reaches beyond the thresh-
old, the arrival rate begins to increase and thus a rise of the curves can
be observed in Figure 4.23. As ! continues to increase, a decrease of
multicast latency occurs because the eect of the balance factor is more
noticeable than the increasing arrival rate.
Figure 4.24 illustrates the average number of transmissions per cell
vs. the balance factor under a xed oered load. As ! increases, the
number of transmissions decreases. When ! = 1:0, the trac becomes
unicast trac, and thus the number of transmissions per cell naturally
becomes 1. As discussed previously, E[F ] decreases when ! crosses
the threshold. This corresponds to the results in Figure 4.23. Jointly
observing Figure 4.23 and Figure 4.24, we can discover that the LA
mechanism causes an increase in the number of transmissions per cell
in comparison to the basic schemes. This results in the decrease in
the multicast latency. This is due to the fact that by searching into the
queues, cells are more likely to receive service, rather than being blocked,
which naturally increases the number of transmissions per cell. The
HOL blocking problem is therefore alleviated, consequently reducing
the average multicast latency.
Figure 4.25 compares the throughput under dierent balance factors
with a xed oered load. The MLRRMS schemes are able to maintain
a stable throughput for various fan-out distributions. The throughput
of the WBA is most aected by the change of the !. It is important to
maintain the throughput, for both balanced and unbalanced multicast
trac within the load that the switch is able to process. Throughput
should be independent of the fan-out pattern, which is dicult to expect
in reality. Otherwise the algorithm may suer from the risk of sudden
78 Multicast Scheduling Algorithms for Input-Queued Switches
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1
10
100
1000
 
thresholdA
ve
ra
ge
 M
ul
tic
as
t L
at
en
cy
 (c
el
l t
im
es
)
Balance Factor (w)
 MLRRMS (no sync, L=0)
 MLRRMS (L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (L=1)
 WBA
(a) Cell-based fan-out with output load of 0.8
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1
10
100
1000
 
A
ve
ra
ge
 M
ul
tic
as
t L
at
en
cy
 (c
el
l t
im
es
)
Balance Factor (w)
 MLRRMS (no sync, L=0)
 MLRRMS (L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (L=1)
 WBA
threshold
(b) Burst-based fan-out with output load of 0.68
Figure 4.23: Average multicast latency under dierent balance factors.
4.6 Simulated Performance of MLRRMS 79
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
Av
er
ag
e 
Tr
an
sm
is
si
on
s 
Pe
r C
el
l
Balance Factor (w)
 MLRRMS (no sync, L=0)
 MLRRMS (L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (L=1)
 WBA
(a) Cell-based fan-out with output load of 0.8
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
 
Av
er
ag
e 
Tr
an
sm
is
si
on
s 
Pe
r C
el
l
Balance Factor (w)
 MLRRMS (no sync, L=0)
 MLRRMS (L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (L=1)
 WBA
(b) Burst-based fan-out with output load of 0.68
Figure 4.24: Average number of transmissions per cell under dierent balance fac-
tors.
80 Multicast Scheduling Algorithms for Input-Queued Switches
performance degradation. The proposed MLRRMS is able to meet such
requirements as shown in Figure 4.25.
4.7 Summary
In this chapter, the MLRRMS algorithm with the LA mechanism and
the sync mechanism is proposed for N  N switches of the FIFO-IQ
architecture.
The LA and the sync mechanism, consisting of matrix operations
used in MLRRMS, can be implemented in a parallel fashion with a
low time complexity. Under varying trac conditions, the MLRRMS
with L = 1 outperforms the WBA and gains the largest performance
improvement, compared to the added implementation complexity. The
HOL blocking problem is alleviated by the LA mechanism. With a larger
L, the algorithm performs close to the FIFOMS, which uses the VOQ
structure, but the obtained marginal performance improvement does not
justify the introduced implementation complexity.
In addition, under both balanced and unbalanced multicast traf-
c conditions, the MLRRMS with L = 1 is able to maintain a stable
throughput compared to the WBA and achieves better performance than
the MLRRMS with L = 0. Being able to search up to 1 cell stored fur-
ther in the queues for the switch, i.e. being capable of processing 2
cells at the head of the queues, within one cell time, instead of creat-
ing multiple queues for each input, provides a signicant performance
improvement, in terms of multicast delay and average queue size in the
practical switch design.
4.7 Summary 81
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.80
0.85
0.90
0.95
1.00
Th
ro
ug
hp
ut
Balance Factor (w)
threshold
 MLRRMS (no sync, L=0)
 MLRRMS (L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (L=1)
 WBA
(a) Cell-based fan-out with output load of 0.8
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.95
0.96
0.97
0.98
0.99
1.00
Th
ro
ug
hp
ut
Balance Factor (w)
 MLRRMS (no sync, L=0)
 MLRRMS (L=0)
 MLRRMS (no sync, L=1)
 MLRRMS (L=1)
 WBA
threshold
(b) Burst-based fan-out with output load of 0.68
Figure 4.25: Throughput under dierent balance factors.

Chapter 5
Out-of-Sequence
Prevention for Multicast
Input-Queuing
Space-Memory-Memory
Clos-Network
To increase the scalability, multistage interconnection network is intro-
duced to the switch fabric to reduce the number of crosspoints instead
of the single-stage such as crossbar switch fabric. Since introduced by
Charles Clos in 1953, the three-stage Clos architecture has constantly
played an relevant role in the construction of high-scalability and large-
capacity switches.
For the buerless three-stage the buerless central stage Clos-networks,
often referred to as the Space-Space-Space (S3) and Memory-Space-
Memory (MSM) architecture, complex scheduling algorithms are re-
quired to avoid potential contentions due to the buerless central stage.
To reduce the scheduling complexity, buers can be applied to the cen-
tral stages. With a cell dispatching algorithm, the buers at the input
stages can be removed to reduce the implementation complexity. With
buers at the output stages, contentions are absorbed. This results in
the Space-Memory-Memory (SMM) Clos-network architecture. How-
83
84
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
ever, cells traversing the fabric through dierent routes may experience
dierent delays due to the introduced buers, resulting in a disordered
cell sequence when arriving at the outputs. This problem is referred to
as the Out-Of-Sequence (OOS) problem. To reorder the disordered cells
at the outputs, additional reassembly delays can be experienced and
more buers are required at the output ports for the processing. Multi-
cast trac is sensitive to this OOS problem because one single delayed
trac in the multicast group may ruin the application, such as Internet
Protocol Television (IPTV) or teleconferencing.
In this chapter, an Input Queuing (IQ)-SMM architecture is pro-
posed, aiming for high scalability. In order to prevent the OOS, two
novel cell dispatching algorithms are proposed, i.e. the Multicast Flow-
based DSRR (MF-DSRR) and the Multicast Flow-based Round-Robin
(MFRR). Cells are no longer scheduled independently but treated as
ows in order to maintain the order by both cell dispatching schemes. In
comparison with the Desynchronized Static Round Robin (DSRR) [77],
which treats cells independently, analytical and simulation results show
that the MF-DSRR and MFRR outperform the DSRR in terms of re-
assembly delay and buer size.
5.1 Introduction
Multicast handles trac in a resource-ecient manner especially when it
comes to bandwidth-intensive services such as IPTV or teleconferencing.
A multicast router/switch can be implemented using a single crossbar
switching fabric and various publications have discussed the scheduling
algorithms for such a fabric type [7,60{62,75,78]. However the scalability
of crossbar switches is limited by the growth of the number of cross points
N2, where N denotes the number of inputs/output ports, as shown in
Figure 5.1. A Clos-network consists of several Switching Elements (SEs),
as shown in Figure 5.2, and can be denoted as C(n;m; r), where n is the
number of input/output ports on the an Input Module (IM)/Output
Module (OM), m is the number of Central Modules (CMs), and r is
the number of IMs/OMs. Since each SE is a crossbar switch, the total
number of cross points becomes 2nmr +mr2, where nr = N . By using
multiple smaller crossbar switches, the Clos-network can reduce N2  
2nmr mr2 cross points when the number of input/output ports, N , is
5.1 Introduction 85
1
2
3
4
0
N-3
N-2
N-1
N-4
1
2
3
4
0
N-3
N-2
N-1
N-4
Figure 5.1: Crossbar switch fabric. The number of cross point is N2, where N
denotes the number of input/output ports.
large and is thus more scalable than crossbar switches [65,79].
Usually, variable-length packets are segmented into several xed-size
cells before traversing the switch fabric. The IM is responsible for choos-
ing one or several central switching modules to send the incoming mul-
ticast cells according to its Cell Dispatching (CD) algorithm. There are
various CD algorithms, which are reviewed in Section 5.2, for dierent
types of Clos-networks.
The MSM architecture, shown in Figure 5.3, refers to the architec-
ture where memories are allocated at the output ports of IMs and the
OMs. Since this architecture uses Output Queuing (OQ) for both IMs
and OMs, it requires a speedup of the buer, which hinders the scala-
bility of the switch. A buerless Clos-network, referred to as the S3 ar-
chitecture, has no speedup problem but a complex scheduling algorithm
is required to solve the contention because the cells cannot be stored
in any SE. It is obvious that a fully buered architecture, referred to
as the Memory-Memory-Memory (MMM) shown in Figure 5.4, has no
86
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
...
1 20
... ...
...IM
 0
IM
 1
IM
 r
-1
...
C
M
 0
...
C
M
 1
...
C
M
 m
-1
...
O
M
 0
...
O
M
 1
...
O
M
 r
-1
...
...
... ... ...
.........
.........
n
-1 1 20
n
-1 1 20
n
-1
1 20
n
-1 1 20
n
-1 1 20
n
-1
Figure 5.2: Clos-network switch fabric. The Clos-network consists of three stages
of smaller-size switching elements.
5.1 Introduction 87
contention problem and requires no complex scheduling algorithm but
the implementation is costly since buers are required for all the stages.
The SMM Clos-network architecture, proposed in [77], is shown to
be able to achieve 100% throughput under admissible trac. However,
this architecture uses the OQ scheme at the CMs and the OMs, which
limits the scalability due to the memory speedup problem. This archi-
tecture is referred to as the OQ-SMM in this chapter in order to be
distinguished from the proposed IQ-SMM architecture. In addition to
the speedup problem, as discussed, cells may experience dierent delays
and arrives at the outputs out of order due to the buers at the CMs
if the CD algorithm is not carefully designed. Disordered cells need to
be re-sequenced and reassembled to packets at the output ports. This
requires extra memories and results in a longer delay. This problem is
referred to as the Out-Of-Sequence (OOS) problem, which can be catego-
rized into two types: inter-packet OOS and in-packet OOS. Inter-packet
OOS means that cells generated by dierent multicast packets are dis-
ordered, and in-packet OOS implies that cells generated by the same
multicast packet are disordered. The reason of such a subdivision is to
dierentiate the characteristics of OOS and further analyze and compare
dierent CD algorithms so that appropriate reassembly methods can be
applied further if required.
To the best of our knowledge, the OQ-SMM architecture proposed
in [77] has not been evaluated for multicast. Therefore in this chapter,
multicast trac is be used to analyze the performance. In order to re-
duce the requirement of the internal speedup of the OQ-SMM and to
solve the OOS problem, the IQ-SMM Clos-network with the Multicast
Flow-based DSRR (MF-DSRR) and the Multicast Flow-based Round-
Robin (MFRR) dispatching algorithms are proposed in this chapter. IQ
crossbar switches, presented in Chapter 4, are leveraged as the CMs
and OMs in the IQ-SMM architecture. The MF-DSRR/MFRR is in-
dependently run in the IMs, distributing incoming cells to the CMs.
The MF-DSRR utilizes the connection pattern of the DSRR and takes
multicast into consideration. The implementation complexity of the
MF-DSRR is thus simple and it can alleviate the OOS problem. The
MFRR, by using more resources, is able to eliminate the in-packet OOS
problem and reduces the reassembly buer size, and thus results in
smaller reassembly delay compared to the DSRR and the MF-DSRR.
88
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
... ... ...
...IM
0
IM
1
IM
r-
1
...
C
M
0
...
C
M
1
...
C
M
m
-1
...
O
M
0
...
O
M
1
...
O
M
r-
1
...
...
.........
.........
... ... ...
... ... ...
1 20
n
-1 1 20
n
-1 1 20
n
-1
1 20
n
-1 1 20
n
-1 1 20
n
-1
Figure 5.3: The Memory-Space-Memory (MSM) Clos-network architecture. Mem-
ories are placed at the outputs of the IMs and the OMs.
5.1 Introduction 89
... ... ...
...IM
0
IM
1
IM
r-
1
...
C
M
0
...
C
M
1
...
C
M
m
-1
...
O
M
0
...
O
M
1
...
O
M
r-
1
...
...
.........
... ... ...
... ... ...
... ... ...
1 20
n
-1 1 20
n
-1 1 20
n
-1
1 20
n
-1 1 20
n
-1 1 20
n
-1
Figure 5.4: The Memory-Memory-Memory (MMM) Clos-network architecture.
Memories are placed at the outputs of all switching elements.
90
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
The remaining parts of this chapter are structured as follows. In Sec-
tion 5.2, dierent CD algorithms for various Clos-network architecture
are reviewed. Section 5.3 describes the Clos-network model that is used
throughout this chapter for the proposed CD algorithms. Section 5.4
introduces the MF-DSRR and the MFRR with detailed descriptions.
Section 5.5 presents the analysis and shows the simulated performance.
Section 5.6 concludes the chapter.
5.2 Related Work
Various scheduling algorithms have been developed for dierent Clos-
network architectures with respect to buer allocations. Table 5.1 com-
pares the four existing architectures.
Complex Memory OOS Examples
algorithm speedup
S3 yes no no Distro [80]
MSM yes yes no MWMD [81]
CRRD/CMSD [82]
MMM no yes yes
OQ-SMM no yes yes DSRR [77]
Table 5.1: A comparison of dierent Clos-network architectures.
The architecture where memories are only allocated in the input/output
stages is referred to as the MSM architecture. The Maximum Weight
Matching Dispatching (MWMD) [81] and the Concurrent Round-Robin
Dispatching (CRRD)/Concurrent Master-Slave round-robin Dispatch-
ing (CMSD) [82] are proposed for the MSM using round-robin arbitra-
tion. One disadvantage of these schemes is that both the input and
output stages use shared memory schemes, resulting in a requirement of
a speedup to the memory, which hinders the scalability of the switch.
To solve this problem, a buerless S3 Clos-network architecture is pro-
posed in [80]. Based on the Static Round-Robin Dispatching (SRRD),
Distro [80] is proposed for the S3 architecture and is demonstrated to
achieve 100% throughout under uniform trac. Both the MSM and the
S3 architectures require complex algorithms to solve the cell contention
problems, since no buers are allocated at the central stages.
5.3 System Model 91
In [77], the OQ-SMM Clos-network architecture with the DSRR
dispatching is proposed. The study demonstrates that the SMM can
achieve 100% throughput with the DSRR under admissible trac. How-
ever, the OQ-SMM with DSRR in [77] uses OQ for both buered stages,
where speedup is again required. Besides, since memories are placed
in the central stages, cells can experience dierent delays and thus re-
sequencing is required to solve the OOS problem at the output port.
When concerning multicast, the DSRR, initially designed for unicast
and not evaluated for multicast, can result in a serious OOS problem,
causing dierent delays within the same multicast group. This strongly
aects the performance of some multicast applications, e.g. IPTV, since
users of the same group experience dierent delays.
5.3 System Model
Assume that a First-In-First-Out (FIFO) queue is installed at each input
port to temporarily store multicast packets. Packets are assumed to
be segmented into xed-size cells in the Input Port Processors (IPPs)
before entering into the IMs and to be reassembled at the Output Port
Processors (OPPs) after traversing the switch fabric. The IQ-SMMClos-
network switch model consists of three stages of SEs and is denoted as
C(n;m; r). As shown in Figure 5.5, the switch has r IMs/OMs of size
n m, m CMs of size r  r. Each IM/OM has n connections to IPPs,
and m interstage connections to CMs. Only one interstage connection
exists between an IM/OM and a CM. The number of input/output
ports of the switch is N = nr. Cell dispatching algorithms are run by
the IMs. Buers are placed at the inputs of the CMs and the OMs
to store the incoming cells. A CM or an OM runs the Multi-Level
Round-Robin Multicast Scheduling (MLRRMS) algorithm proposed in
Chapter 4. All CMs and OMs have the same maximum look-ahead
depth. Each buered SE can be considered as an IQ switch proposed in
Chapter 4.
Since the IQ-SMM architecture is applied, the IMs are buerless
and forward incoming cells to the CMs following certain cell dispatching
algorithms, which are investigated in the following paragraphs. CMs
and OMs are IQ crossbar switches. Each CM has r FIFO input queues
and each OM has m FIFO input queues, each of which is connected to
92
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
... ... ...
...IM
 0
IM
 1
IM
 r
-1
...
C
M
 0
...
C
M
 1
...
C
M
 m
-1
...
O
M
 0
...
O
M
 1
...
O
M
 r
-1
...
...
... ... ...
.........
C
D
C
D
C
D
M
ax
 L
A
 d
ep
th
M
ax
 L
A
 d
ep
th
M
ax
 L
A
 d
ep
th
M
ax
 L
A
 d
ep
th
M
ax
 L
A
 d
ep
th
M
ax
 L
A
 d
ep
th
IP
P
IP
P
IP
P
IP
P
IP
P
IP
P
IP
P
IP
P
IP
P
.........
... ... ...
O
P
P
O
P
P
O
P
P
O
P
P
O
P
P
O
P
P
O
P
P
O
P
P
O
P
P
.........
10 n
-1 10 n
-1 10 n
-1
10 n
-1 10 n
-1 10 n
-1
Figure 5.5: The Input-Queued Space-Memory-Memory (IQ-SMM) Clos-network
architecture. Buers are placed at the inputs of the central modules and the output
modules.
5.3 System Model 93
one interstage link. Several notations used throughout this chapter are
listed as follows:
IMi: i
th input module, where 0  i  r   1
CMk: k
th central module, where 0  k  m  1
OMj : j
th output module, where 0  j  r   1
Ii;p: p
th input port of IMi, where 0  p  n  1
Oj;q: q
th output port of OMj , where 0  q  n  1
QCi;k : i
th input queue of CMk connected to IMi
QOk;j : k
th input queue of OMj connected to CMk
ILi;k: the interstage link connecting IMi and CMk
CLk;j : the interstage link connecting CMk and OMj
Assume that each packet carries a fan-out vector b = hbji, bj 2
f0; 1g, 0  j  N  1, where bj = 1 indicates that the packet is destined
to the jth output port, otherwise bj = 0. An example of the fan-out
vector is shown in Figure 5.6. Cells generated from the same packet
have the same fan-out vectors. Cells are replicated as far downstream
as possible in the switch model. More specically, no cell replication
occurs in the IMs, in contrast, multicast capability is implemented in
the CMs and OMs. The FIFO queue is assumed to be able to examine
the fan-out vector of each packet and inform the switch fabric of any
fan-out change.
bN-3 bN-2 bN-1
1 0 0
Figure 5.6: Demonstration of a fan-out vector of N bits. The cell carrying this the
fan-out vector shown in this gure is bound for output port 1, 2, and N-3, since the
bits on those positions are 1.
94
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
b8 b9
0 0
b10
0
b11
0
c2
1 1 0
Figure 5.7: An example of the bit-cluster. The fan-out vector has N = 12 bits, and
each bit-cluster has n = 4 bits. Therefore the fan-out vector can also be expressed
by 3 bit-clusters. The cell is sent to OM0 and OM1 accordingly.
A bit-cluster is dened as a set of n consecutive bits in the fan-
out vector, and is denoted as cd = hbji, 0  d  r   1, and nd  j 
nd+n 1. A fan-out vector consists of r non-overlapping bit-clusters and
therefore the intersection of dierent bit-clusters is empty, i.e. cd1\cd2 =
;, if d1 6= d2. Thus, a fan-out vector can also be expressed by bit-clusters,
b = hcdi, 0  d  r   1. Dene that jcdj = min

1;
Pnd+n 1
j=nd bj

. The
CMk transmits an incoming cell by examining r bit-clusters of the fan-
out vector. If jcdj 6= 0, CMk sends a copy of the cell to OMd. The OMd
examines all the bits in cd 1, and if bj = 1, the OMd sends a copy of
the cell to Od;j nd. Figure 5.7 shows an example of the bit-cluster with
n = 4 in a fan-out vector with N = 12.
5.4 Cell Dispatching Algorithms
In general, a cell dispatching algorithm for an IM species which CM
the incoming cells are sent to, following specic requirements such as
balancing load or minimizing blocking probability. For the S3 or the
MSM Clos-network architectures, complex algorithm or strategies have
been proposed in [83, 84] to avoid internal cell contentions. However
in the SMM architecture, buers are introduced in the CMs and OMs
to resolve the internal cell contentions. Cells are temporarily stored in
the buers until the intended output links are available and thus no
internal blocking is experienced. As discussed in [77], the DSRR runs
independently in each IM and connects each input to all outputs in a
5.4 Cell Dispatching Algorithms 95
Time slot 
k
Time slot 
k+1
Time slot 
k+2
Time slot 
k+3
Time slot 
k+4
Time slot 
k+5
Time slot 
k+6
...
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
Figure 5.8: Desynchronized Static Round Robin connection pattern (DSRR). The
connection pattern changes after each cell time.
round-round fashion. The connection pattern changes after each cell
time as shown in Figure 5.8, resulting in a balanced distribution of cells
to the CMs. However, when DSRR is applied to the IQ-SMM architec-
ture shown in Figure 5.5, it causes a serious OOS problem since DSRR
treats each cell independently. Therefore nding the cell dispatching
algorithms that take the sequential order of incoming cells into account
is of great relevance.
5.4.1 Multicast Flow-based Desynchronized Static
Round-Robin (MF-DSRR) Dispatching
The MF-DSRR runs independently in each IM and requires that IMi
is notied when a change of received fan-out vector occurs. Unlike
DSRR [77] that changes the IM connection pattern after every cell time,
the MF-DSRR modies the connections of all the input ports of the IM
in a DSRR manner when and only when a notication is received. By
utilizing the principle of the DSRR, the MF-DSRR maintains a low im-
plementation complexity while reducing in-packet OOS problem. When
96
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
a new fan-out vector is detected, regardless on which input port the
change occurs, each input port of the IMi moves its connection to the
interstage link next to the current one of the IMi in a round-robin man-
ner before a time slot begins. At most one change of the connection
pattern is performed before a cell time.
The example shown in Figure 5.9 demonstrates how the connec-
tion pattern changes for a 4  6 IM. The initial conguration can be
(Ii;0  ! ILi;0, Ii;1  ! ILi;1, Ii;2  ! ILi;2, Ii;3  ! ILi;3) as shown in
Figure 5.9(1), where each input port is serving a ow of cells of the same
fan-out vector. As a fan-out change occurs on Ii;0 from f0 to f
0
0, each
input port maps its connection to the next interstage link, resulting in a
connection as (Ii;0  ! ILi;1, Ii;1  ! ILi;2, Ii;2  ! ILi;3, Ii;3  ! ILi;4),
as shown in Figure 5.9(2). This connection pattern may be kept for sev-
eral cell times until another change is detected in Figure 5.9(3), where
Ii;3 has a fan-out change from f3 to f
0
3. Then the connection pattern
becomes (Ii;0  ! ILi;2, Ii;1  ! ILi;3, Ii;2  ! ILi;4, Ii;3  ! ILi;5). If
another fan-out change occurs on Ii;2, then (Ii;0  ! ILi;3, Ii;1  ! ILi;4,
Ii;2  ! ILi;5, Ii;3  ! ILi;0).
5.4.2 Multicast Flow-based Round-Robin (MFRR)
Dispatching
The MFRR is independently run in each IM and also requires that each
input port monitors the change of received fan-out vectors. Even though
the MF-DSRR has a low implementation complexity, it fails to eliminate
the in-packet OOS. During a transmission of a ow of cells of the same
fan-out vector, it is possible that the connection pattern is changed
due to a fan-out change detected on other input ports. This will cause
distributing the same-packet cells (cells belonging to a same packet) to
dierent CMs. In the MFRR, the change of connection pattern takes
place independently on each input port, and thus eliminates the in-
packet OOS problem.
A list called AvailableList is created for each IM to record the idle
interstage links as its elements. Elements can be only popped from the
top of the list and inserted to the bottom. When an element is popped,
all the elements in the AvailableList are moved one position up to the
top. Since there are n inputs and m interstage links in the IM, the
5.4 Cell Dispatching Algorithms 97
f3
f0
f1
f2
f0
'
f1
f2
f3
1 2
0
1
2
3
4
5
0
1
2
3
4
5
f0
'
f1
f2
f3
'
f0
'
f1
f2
'
f3
'
3 4
0
1
2
3
4
5
0
1
2
3
4
5
Figure 5.9: Multicast Flow-based DSRR cell dispatching (MF-DSRR). The connec-
tion pattern changes only when a fan-out change is detected. Each input port moves
its connection to the interstage link next to the current one.
98
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
number of elements of the AvailableList is (m  n). Therefore in order
to use MFRR, the switch must have m > n.
Upon detecting a change of fan-out vector on Ii;p, IMi pops an
idle interstage link ILi;k0 from the AvailableList, disjoins the connec-
tion (Ii;p  ! ILi;k) and sets up a new connection (Ii;p  ! ILi;k0 ) in-
stead. The released interstage link ILi;k is inserted to the bottom of the
AvailableList. For those input ports where no fan-out vector change is
detected, the connections to the interstage links are unchanged. If mul-
tiple fan-out vector changes on several input ports in the same time slot
are detected, ties are broken by randomly assigning the idle interstage
links popped from the AvailableList to the input ports.
To better illustrate the scheme, a 46 IM shown in Figure 5.10 is con-
sidered. The connection pattern can initially be (Ii;0  ! ILi;0, Ii;1  !
ILi;1, Ii;2  ! ILi;2, Ii;3  ! ILi;3) with fILi;4; ILi;5g in the Avail-
ableList. Assume a new fan-out vector f
0
0 occurs on Ii;0, ILi;4 is popped
from the list and a new conguration is established as (Ii;0  ! ILi;4,
Ii;1  ! ILi;1, Ii;2  ! ILi;2, Ii;3  ! ILi;3). The released interstage
link ILi;0 is inserted to the bottom of the AvailableList, which becomes
fILi;5; ILi;0g. When a new fan-out vector is detected by Ii;3, Ii;5 is
popped from the AvailableList and a connection pattern (Ii;0  ! ILi;4,
Ii;1  ! ILi;1, Ii;2  ! ILi;2, Ii;3  ! ILi;5) is established with the Avail-
ableList being fILi;0, ILi;3g. Further connection pattern modications
as fan-out vector changes are also shown in the gure.
5.5 Performance Analysis and Simulation
Results
The trac to each input port Ii;p is assumed to be an independent
Poisson Arrival Process with arrival rate of   1. Variable-length
packets are segmented into L xed-size cells in the IPPs, where L is a
random variable uniformly distributed with mean of E(L) = L.
Packets are independent and each packet is bound for an output port
Oj;q with a probability of p:
P (bj = 1) = p (5.1)
The fan-out of a fan-out vector is dened as F , jbj = PN 1j=0 bj .
5.5 Performance Analysis and Simulation Results 99
*
*
*
*
*
*
*
*
*
*
4
5
5
0
0
3
3
2
5
1
f0
'
f1
f2
f3
f0
'
f1
f2
f3
'
f0
'
f1
'
f2
'
f3
''
Available
...
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
1 2 3
4 5
Figure 5.10: Multicast Flow-based Round Robin cell dispatching (MFRR). An
AvailableList is maintained by each IM. When a fan-out vector change is detected,
a link is popped from the top of the list and the input port moves its connection
accordingly.
100
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
Since each packet is at least bound for one output port, the fan-out F
for each packet is:
P (F = f) =
 
N
f

pf (1  p)N f
1  (1  p)N (5.2)
E (F ) = F =
Np
1  (1  p)N (5.3)
As described in Section 5.3, the CMs only observe the bit-clusters
and the fan-out FCM ,
P
d jcdj; 8d, seen by a CM becomes:
P (FCM = f) =
 
r
f

[1  (1  p)n]f [(1  p)n]r f
1  (1  p)N (5.4)
E (FCM ) = FCM =
r [1  (1  p)n]
1  (1  p)N (5.5)
All trac is admissible, which implies that no input or output port
is oversubscribed. The total packet trac load on all output ports is
N F , and since trac is equally distributed among all output ports,
the oered packet load seen on each output is  F .
5.5.1 In-Packet OOS Performance of the MF-DSRR
The main principle of the MF-DSRR is to maintain the IM connec-
tion pattern until a change of fan-out vector occurs among the all traf-
c received by the IM. Intuitively, this low-complexity scheme cannot
guarantee the elimination of in-packet OOS due to the varying packet
length and unexpected packet arrivals. For an input port Ii;p under
the MF-DSRR cell dispatching, the probability of j connection pattern
changes (j = 0; 1; 2; : : : ; L) during the transmission of a packet is:
P (j) =
L
j

[(n  1)]j [1  (n  1)]L j (5.6)
After m changes, the connection pattern of an IM resumes, thus the
probability that same-packet cells are sent to dierent CMs is:
5.5 Performance Analysis and Simulation Results 101
P MF DSRR = 1 
X

P ()
= 1 

1  ^
L 241 +  L
m
 
^
1  ^
!m
+
 L
2m
 
^
1  ^
!2m
+   
35
(5.7)
where  = 0;m; 2m; : : : ; and  = (n  1).
5.5.2 In-Packet OOS Performance of the MFRR
Unlike the DSRR and the MF-DSRR, the MFRR maintains the connec-
tion of each input port independently. Changes of fan-out vectors on an
input port have no inuence on others. Thus, during the transmission
of same-packet cells, no connection interruption occurs. The probability
that same-packet cells are sent to dierent CMs is:
P MFRR = 0 (5.8)
For the DSRR dispatching scheme, the connection pattern changes
every cell time. Therefore the probability that same-packet cells are sent
to dierent CMs is:
P DSRR = P (L > 1) = 1  P (L = 1) = 1 
1
Lmax
(5.9)
where Lmax is the maximum number of cells contained in a packet.
The relation of the probability of same-packet cells being sent to
dierent CMs among the DSRR, the MF-DSRR, and the MFRR thus
becomes:
P DSRR > P

MF DSRR > P

MFRR (5.10)
5.5.3 Time Complexity of MF-DSRR and MFRR
In the DSRR, each input port moves its connection to the next interstage
link after each cell time. No complex algorithm is involved to establish
102
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
the new connection pattern. Thus, the time complexity of a new connec-
tion pattern establishment in the DSRR is O(1). Since the MF-DSRR
leverages the DSRR, the time complexity of a new connection pattern
establishment in the MF-DSRR is also O(1).
The MFRR eliminates the in-packet OOS by using distributed and
independent connection management for each input port and achieve
low complexity introducing the AvailableList. Without using the Avail-
ableList, each input port has to check the interstage link one by one until
an idle one is found. In the worst case, an input port will look through
(m  1) interstage links, resulting in a time complexity of O(m). With
the AvailableList, an input port merely pops an element from the top of
the list, and the time complexity of establishing a new connection of an
input port is reduced to O(1).
5.5.4 Advantages and Limitation of the MFRR
1) No Contention on Interstage Links
If each input port locally maintains a round-robin pointer without using
the AvailableList, contentions on interstage links may occur. Multiple
inputs choosing the same interstage link has to be solved. This can delay
the establishment of the new connection and degrade the throughput of
the IM. Since the IM is buerless, cell loss can occur if throughput is
degraded.
Using the AvailableList, the MFRR guarantees that each input port
can always connect to an idle interstage link every time the input port
detects a fan-out vector change. No computation with high complexity
is required for the connection establishment and no cell loss occurs in
the IM.
2) Fairness to CMs
If only local round-robin pointers are used for each input port, unfairness
may occur. For an input port, the next available interstage link, e.g.
ILi;k appears in the round-robin pointer can be the one which is released
by another input port in the last cell time. In this case, the interstage
link ILi;k will be consecutively busy for two multicast ows. In the
worst case, ILi;k can be busy for n multicast ows, causing a sudden
5.5 Performance Analysis and Simulation Results 103
cell increase in QCi;k and starvation in other queues.
Using the AvailableList, the MFRR can provide fairness among the
interstage links. After releasing an interstage link, the IM waits (m n)
times of fan-out vector changes before selecting the link again. This
results in a fair distribution of dierent multicast ows to the CMs and
no starvation occurs.
3) Memory Access Speed Requirement
The memory access speed of the AvailableList is required to be high
enough to handle the n times of memory accesses (including read and
write) within one cell time. If n = 1, the Clos-network needs N IMs,
which is impractical from a scalability's perspective. When n becomes
larger, the number of IMs reduces but the access speed of the Avail-
ableList increases, which may lead to some implementational challenges
if n becomes too large.
5.5.5 Simulation Results
The simulation is carried out in the OPNET Modeler [56]. The Static
and the DSRR schemes are used as references in the performance com-
parison. The Static scheme is simply a stationary conguration of the
internal connections of IMs, which keeps the same during the entire
simulation.
Admissible trac with F = 8 and L = 13 is generated independently
at each input port of the simulated C(4; 7; 4) IQ-SMM Clos-network
switch. The multicast scheduling algorithms with sync to reduce mul-
tiple transmissions of cells used in the CMs and the OMs are described
in [7,10], as well as in Chapter 4. A multicast cell may be served several
times before it is removed from the queue. Unnecessary multiple trans-
missions can cause increased cell delays in a multicast switch. The sync
mechanism aims to reduce the number of transmissions per multicast
cells while maintaining the output port utilization. In order to reduce
the Head-Of-Line (HOL) blocking problem, the IQ-SEs on both central
and output stages can look ahead into the queues for cells that can be
served to idle outputs. Since it is impractical to search an innite depth
into the queues, a maximum look-ahead depth, LA, is dened. If the
104
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
0.50 0.55 0.60 0.65 0.70 0.75 0.80
0
10
20
30
40
50
 
In
te
r-
pa
ck
et
 O
O
S
 C
el
ls
 (p
ct
.)
Offered Load
 Static (no sync)
 Static (with sync)
 DSRR (no sync)
 DSSR (with sync)
 MFRR (no sync)
 MFRR (with sync)
 MF-DSRR (no sync)
 MF-DSRR (with sync)
Figure 5.11: Percentage of inter-packet OOS cells, LA = 0.
LA is reached and the IQ-SE still has idle outputs, it stops the searching
and completes the scheduling process.
Figure 5.11 compares the inter-packet OOS under dierent cell dis-
patching schemes with LA = 0. The counting of the inter-packet OOS
cells is carried out in the OPP module shown in Figure 5.5. Under high
oered load, more specically when the load is beyond 0.77, the DSRR
schemes (with and without sync) outperform the others except for the
Static. This is because the DSRR evenly distribute cells to the input
queues of the CMs and thus cells belonging to a packet are placed ahead
of cells generated by another packet more than the MF-DSRR or the
MFRR. It can also be observed that, with the sync mechanism, both
MFRR and MF-DSRR can reduce the inter-packet OOS. This is due to
that the sync mechanism aims to reduce the number of cell transmis-
sion without decreasing the output utilization of the switching module,
which can result in the reduction of the inter-packet OOS cells.
Figure 5.12 compares the in-packet OOS under dierent cell dis-
patching schemes with LA = 0. The in-packet OOS cells are counted
in the OPP module shown in Figure 5.5. Except Static, the MFRR
5.5 Performance Analysis and Simulation Results 105
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
10
20
30
40
50
In
-p
ac
ke
t O
O
S
 C
el
ls
 (p
ct
.)
Offered Load
 Static (no sync)
 Static (with sync)
 DSRR (no sync)
 DSSR (with sync)
 MFRR (no sync)
 MFRR (with sync)
 MF-DSRR (no sync)
 MF-DSRR (with sync)
Figure 5.12: Percentage of in-packet OOS cells, LA = 0.
schemes outperform the others with zero in-packet OOS cells. Both
the MF-DSRR schemes have less than 10% of total received cells being
OOS cells under high oered load. The DSRR schemes result in serious
in-packet OOS problems. With the sync mechanism, the DSRR causes
more in-packet OOS cells. This is because, in the DSRR, same-packet
cells are treated independently and are distributed to dierent CMs, re-
sulting in a well distributed fan-out vectors in the input queues of the
CMs and the OMs. Thus the same-packet cells have a higher probabil-
ity to be disordered due to the round-robin working mechanism of the
sync. A decrease of the in-packet OOS for the DSRR and the MF-DSRR
schemes can be observed under the loads higher than 0.7. This is due
to the fact that the CM queue length begins to increase non-linearly, re-
sulting in more inter-packet OOS cells. Since the OPP modules consider
inter-packet and in-packet OOS separately, the percentage of in-packet
OOS therefore decreases.
Figure 5.13 shows the total number of OOS cells, i.e. the sum of
inter-packet and in-packet OOS cells, under dierent schemes with LA =
0. The MFRR (with sync) signicantly reduces the OOS problem under
106
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
10
20
30
40
50
60
70
80
To
ta
l O
O
S
 C
el
ls
 (p
ct
.)
Offered Load
 Static (no sync)
 Static (with sync)
 DSRR (no sync)
 DSSR (with sync)
 MFRR (no sync)
 MFRR (with sync)
 MF-DSRR (no sync)
 MF-DSRR (with sync)
Figure 5.13: Percentage of the total number of OOS cells, LA = 0.
high oered loads, while the DSRR schemes cause nearly linear growths
of the OOS cells with the increase of the oered load.
Figure 5.14 depicts the average reassembly delay (LA = 0) for each
packet in cell times. The reassembly buers are assumed to be located
in the OPP module shown in Figure 5.5. Under heavy loads, the DSRR
schemes result in 10 cell times of the reassembly delay, which is about
77% of mean packet transmission time. The MFRR (with sync) reduces
the delay to approximately 3.5 cell times under heavy loads.
Figure 5.15 shows the average reassembly buer size (LA = 0). The
average buer size of the Static schemes become stable under high oered
load because the buers at the OMs become unstable and the through-
put reduces. The DSRR schemes require larger reassembly buers at
the OPPs under oered loads larger than 0.5. The MFRR schemes are
able reduce the average reassembly buer size.
Besides the average reassembly buer size, the maximum buer size
is also worth examining, since it can be used as a benchmark in designing
the reassembly buer. Figure 5.16 compares the maximum reassembly
buer size. The DSRR and the MF-DSRR schemes demonstrate higher
5.5 Performance Analysis and Simulation Results 107
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
2
4
6
8
10
Av
g.
 R
ea
se
em
bl
y 
D
el
ay
 p
er
 P
ac
ke
t (
ce
ll 
tim
es
)
Offered Load
 Static (no sync)
 Static (with sync)
 DSRR (no sync)
 DSSR (with sync)
 MFRR (no sync)
 MFRR (with sync)
 MF-DSRR (no sync)
 MF-DSRR (with sync)
Figure 5.14: Average reassembly delay per packet, LA = 0.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
20
40
60
80
100
120
140
160
A
vg
. R
ea
ss
em
bl
y 
B
uf
fe
r S
iz
e 
(c
el
ls
)
Offered Load
 Static (no sync)
 Static (with sync)
 DSRR (no sync)
 DSSR (with sync)
 MFRR (no sync)
 MFRR (with sync)
 MF-DSRR (no sync)
 MF-DSRR (with sync)
Figure 5.15: Average reassembly buer size, LA = 0.
108
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
100
200
300
400
500
600
M
ax
. R
ea
ss
em
bl
y 
B
uf
fe
r S
iz
e 
(c
el
ls
)
Offered Load
 Static (no sync)
 Static (with sync)
 DSRR (no sync)
 DSSR (with sync)
 MFRR (no sync)
 MFRR (with sync)
 MF-DSRR (no sync)
 MF-DSRR (with sync)
Figure 5.16: Maximum reassembly buer size, LA = 0.
maximum buer sizes, and the MFRR schemes are able to reduce the
maximum buer size.
Figure 5.17, Figure 5.18 and Figure 5.19 compare the inter-packet
OOS, in-packet OOS and total OOS cells, respectively, under varying
LA values with the sync mechanism. With larger LA values, a decrease
on inter-packet OOS for each cell dispatching scheme is observed in Fig-
ure 5.17. The DSRR with LA = 2 outperforms the other two schemes.
In Figure 5.18, the look-ahead mechanism greatly reduces the in-packet
OOS for the DSRR scheme but the DSRR with LA = 2 still suers from
approximately 50% of all the received cells being in-packet OOS under
high load. MFRR always maintains zero in-packet OOS under dierent
LA values. In terms of the total OOS cells, the MFRR with LA = 2
outperforms the others, as shown in Figure 5.19. This is due to the fact
that with the capability of looking ahead into the queues for blocked
cells, the IQ-SEs are able to send some of those delayed cells that cause
the OOS problem.
Figure 5.20 depicts the average assembly delay per packet under dif-
ferent LA values with the sync mechanism. The look-ahead mechanism
5.5 Performance Analysis and Simulation Results 109
0.50 0.55 0.60 0.65 0.70 0.75 0.80
0
10
20
30
40
In
te
r-
pa
ck
et
 O
O
S
 C
el
ls
 (p
ct
.)
Offered Load
 DSRR (sync, LA=0)
 DSRR (sync, LA=1)
 DSRR (sync, LA=2)
 MFRR (sync, LA=0)
 MFRR (sync, LA=1)
 MFRR (sync, LA=2)
 MF-DSRR (sync, LA=0)
 MF-DSRR (sync, LA=1)
 MF-DSRR (sync, LA=2)
Figure 5.17: Percentage of inter-packet OOS cells, LA = 0; 1; 2.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
10
20
30
40
50
 
In
-p
ac
ke
t O
O
S
 C
el
ls
 (p
ct
.)
Offered Load
 DSRR (sync, LA=0)
 DSRR (sync, LA=1)
 DSRR (sync, LA=2)
 MFRR (sync, LA=0)
 MFRR (sync, LA=1)
 MFRR (sync, LA=2)
 MF-DSRR (sync, LA=0)
 MF-DSRR (sync, LA=1)
 MF-DSRR (sync, LA=2)
Figure 5.18: Percentage of in-packet OOS cells, LA = 0; 1; 2.
110
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
10
20
30
40
50
60
70
To
ta
l O
O
S
 C
el
ls
 (p
ct
.)
Offered Load
 DSRR (sync, LA=0)
 DSRR (sync, LA=1)
 DSRR (sync, LA=2)
 MFRR (sync, LA=0)
 MFRR (sync, LA=1)
 MFRR (sync, LA=2)
 MF-DSRR (sync, LA=0)
 MF-DSRR (sync, LA=1)
 MF-DSRR (sync, LA=2)
Figure 5.19: Percentage of the total number of OOS cells, LA = 0; 1; 2.
reduces the reassembly delay, which corresponds to the reduction of total
OOS cells. The MFRR with LA = 2 outperforms the others.
Although the Static schemes have no OOS and low reassembly buer
sizes, they the highest cell delays, shown in Figure 5.21, because fewer
CMs are used than other schemes. Static schemes become unstable
after the oered load of 0.7, resulting in a decrease of the throughput,
which explains the convergence in Figure 5.15. The DSRR schemes
outperform the others in terms of average cell delay because cells are
well distributed among the CMs. The MFRR schemes perform better
than the MF-DSRR schemes under both sync options.
Figure 5.22 further compares the average cell delays of the DSRR,
the MFRR and the MF-DSRR under dierent LA values with the sync
mechanism. As discussed previously, the look-ahead mechanism applied
in both CMs and OMs reduces the cell delay. The DSRR with LA = 2
has the lowest cell delay due to feature of evenly distributing cells to the
CMs..
5.5 Performance Analysis and Simulation Results 111
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
2
4
6
8
10
Av
g.
 R
ea
se
em
bl
y 
D
el
ay
 p
er
 P
ac
ke
t (
ce
ll 
tim
es
)
Offered Load
 DSRR (sync, LA=0)
 DSRR (sync, LA=1)
 DSRR (sync, LA=2)
 MFRR (sync, LA=0)
 MFRR (sync, LA=1)
 MFRR (sync, LA=2)
 MF-DSRR (sync, LA=0)
 MF-DSRR (sync, LA=1)
 MF-DSRR (sync, LA=2)
Figure 5.20: Average reassembly delay per packet, LA = 0; 1; 2.
0.50 0.55 0.60 0.65 0.70 0.75 0.80
0
100
200
300
400
500
600
A
vg
. C
el
l D
el
ay
 (c
el
l t
im
es
)
Offered Load
 Static (no sync)
 Static (with sync)
 DSRR (no sync)
 DSSR (with sync)
 MFRR (no sync)
 MFRR (with sync)
 MF-DSRR (no sync)
 MF-DSRR (with sync)
Figure 5.21: Average cell delay, LA = 0.
112
Out-of-Sequence Prevention for Multicast Input-Queuing
Space-Memory-Memory Clos-Network
0.50 0.55 0.60 0.65 0.70 0.75 0.80
0
100
200
300
400
500
600
A
vg
. C
el
l D
el
ay
 (c
el
l t
im
es
)
Offered Load
 DSRR (sync, LA=0)
 DSRR (sync, LA=1)
 DSRR (sync, LA=2)
 MFRR (sync, LA=0)
 MFRR (sync, LA=1)
 MFRR (sync, LA=2)
 MF-DSRR (sync, LA=0)
 MF-DSRR (sync, LA=1)
 MF-DSRR (sync, LA=2)
Figure 5.22: Average cell delay, LA = 0; 1; 2.
5.6 Summary
In this chapter, two OOS preventative cell dispatching algorithms are
proposed for the multicast IQ-SMM Clos-network architecture, i.e. the
Multicast Flow-based DSRR (MF-DSRR) and the Multicast Flow-based
Round-Robin (MFRR). Table 5.2 presents a summarized comparison of
dierent Clos-network architecture.
Complex Memory OOS Examples
algorithm speedup
S3 yes no no Distro [80]
MSM yes yes no MWMD [81]
CRRD/CMSD [82]
MMM no yes yes
OQ-SMM no yes yes DSRR [77]
IQ-SMM no no yes MF-DSRR, MFRR
Table 5.2: A summarized comparison of dierent Clos-network architectures.
5.6 Summary 113
The MF-DSRR utilizes the connection modication pattern of the
DSRR and obtains a low implementation complexity. The MF-DSRR
can alleviate the OOS problem but still suers from the in-packet OOS
problem. Using more resources, i.e. the AvailableList, the MFRR is able
to eliminate the in-packet OOS problem and thus signicantly reduces
the reassembly buer size and delay. With the use of the AvailableList to
store the information of the idle interstage links, the MFRR can achieve
a low complexity for internal connection setup of IMs.
Simulation results show that the MFRR cell dispatching outperforms
the DSRR and the MF-DSRR in terms of reducing the OOS problem, the
reassembly buer size, and the reassembly delay. The sync mechanism
is able to improve the performance of the MFRR and the MF-DSRR but
worsens the performance of the DSRR in terms of the in-packet OOS.
With the look-ahead mechanisms applied in both CMs and OMs, the
IQ-SMM architecture can further reduce the OOS problem and the cell
delay.

Chapter 6
Conclusion
Since the boom of high-bandwidth applications, such as Internet Pro-
tocol Television (IPTV), telecommunication service providers have gone
through the continuous increase of bandwidth requirement. In both
wireless and wired communication networks, the access speed experi-
enced by customers has increased by a factor of more than 100 in the
past decades. This trend has started the era of 100 Gigabit Ethernet in
the next generation transport network.
In this dissertation, trac management for the next generation trans-
port network is investigated, in three dierent network scales. On the
packet scheduling level, the topology-based hierarchical scheduling algo-
rithm is proposed in Chapter 3. The proposed scheme is based on the
assumption that, information of the network topology can be acquired
by the scheduling system of the edge node. Token schedulers can be ar-
ranged by the scheduling system to map the acquired topology, in order
to schedule the incoming trac on behalf of the switches in the network,
which lack of advanced trac management abilities. This intelligent
switch is usually place at the edge of the IPTV distribution network, so
that the operator can leverage the already-built infrastructure to pro-
vide Quality-of-Service (QoS) guaranteed services. By network simula-
tion, the topology-based hierarchical scheduling scheme demonstrates a
strong ow isolation ability and an eective trac management ability,
comparing with the schemes that requires full network updates.
On the cell scheduling level, where the attention is mainly focused
inside the switch, a novel Multi-Level Round-Robin Multicast Schedul-
115
116 Conclusion
ing (MLRRMS) algorithm is proposed for the input-queuing architecture
in Chapter 4. Given the context of high capacity transport networks,
the scalability and scheduling complexity become an extremely impor-
tant issue. Thus, the Input Queuing (IQ) architecture is selected due
to its high scalability. The proposed MLRRMS aims to surmount the
Head-Of-Line (HOL) blocking problem of the IQ architecture and boost-
ing the throughput of the switch. The sync mechanism is proposed to
reduce the unnecessary multiple transmissions of a multicast cell, and
the Look-Ahead (LA) process is used to reduce the HOL blocking prob-
lem. Analysis and simulation results show that with limited complexity
the switch is proven able to achieve a high scalability and signicant
improvements in terms of multicast delay and throughput, compared to
other existing multicast scheduling algorithms.
As the focus moves into the switch fabric, the three-stage Clos-
network is investigated in Chapter 5. One of the challenges of multicast
in the Clos-network is the prevention of Out-Of-Sequence (OOS) cells.
Many literatures consider cells to be independent, however, it is not the
case for most of the time. One packet usually generates more than one
cells as it arrives at the switch fabric, where cell switching is mostly
used to achieve high throughput. Therefore two OOS preventative cell
dispatching schemes are proposed in Chapter 5 for IQ Space-Memory-
Memory (SMM) Clos-network architecture, i.e. Multicast Flow-based
DSRR (MF-DSRR) and Multicast Flow-based Round-Robin (MFRR).
Analysis and simulation results demonstrate that, both the proposed
schemes can reduce the OOS problem, resulting in a decrease of the
reassembly delay and buer size for the switch fabric.
The accomplishment achieved in this dissertation provides a guid-
ance and a reference to the future research in trac management for the
next generation transport network. Firstly, the IPTV trac manage-
ment with topology-based hierarchical scheduling scheme can be further
investigated. How to integrate the transport function with the con-
trol plane to make the scheduling system adapted to dierent network
topologies and bandwidth allocation can be a research direction.
Secondly, multicast for switches is still open for discussion and re-
search. Hardware implementation of the simulated scheduling algo-
rithms will be an interesting topic, including performance evaluation,
complexity analysis, and experiments. As the link speed increases to
117
100 Gbit/s, the packet processing time will become extremely short and
therefore makes it challenging for hardware implementation of the ad-
vanced trac scheduling system.
Last but not least, multicast inside the switch fabric, especially for
the multi-stage switch fabric, needs further investigation. The cell dis-
patching schemes applied in the Clos-network should consider the route
selection, in addition to the OOS prevention, by the means of back pres-
sure or a control mechanism to ensure low cell delays and high through-
put. Convergence of dierent switching technologies in the multistage
switching network, such as time switching and space switching, is also
an interesting area worth attentions.

Bibliography
[1] H. Yu, Y. Yan, and M. S. Berger, \IPTV trac management in Car-
rier Ethernet transport networks," in OPNETWORK 2008, 2008.
[2] H. Yu, Y. Yan, and M. S. Berger, \IPTV trac management using
topology-based hierarchical scheduling in Carrier Ethernet trans-
port networks," in International Conference on Communications
and Networking in China (ChinaCom), pp. 1{5, 2009.
[3] H. Yu, Y. Yan, and M. S. Berger, \Topology-based hierarchical
scheduling using decit round robin: Flow protection and isolation
for triple play service," in First International Conference on Future
Information Networks, pp. 269{274, 2009.
[4] A. Rasmussen, J. Zhang, H. Yu, R. Fu, S. Ruepp, H. Wessing,
and M. S. Berger, \Towards 100 gigabit Carrier Ethernet trans-
port networks," WSEAS Transactions on Communications, vol. 9,
pp. 153{164, 2010.
[5] H. Wessing, M. S. Berger, H. Yu, A. Rasmussen, L. Brewka, and
S. Ruepp, \Evaluation of network failure induced IPTV degradation
in metro networks," Recent Advances in Circuits, Systems, Signal
and Telecommunications, pp. 135{139, 2010.
[6] H. Wessing, M. S. Berger, H. M. Gestssson, H. Yu, A. Rasmussen,
L. Brewka, and S. Ruepp, \Evaluation of restoration mechanisms
for future services using Carrier Ethernet," WSEAS Transactions
on Communications, vol. 9, pp. 322{331, 2010.
[7] H. Yu, S. Ruepp, and M. S. Berger, \A novel round-robin based
multicast scheduling algorithm for 100 gigabit ethernet switches," in
119
120 BIBLIOGRAPHY
29th IEEE International Conference on Computer Communications
(INFOCOM) Workshops, pp. 1{2, 2010.
[8] H. Yu, S. Ruepp, and M. S. Berger, \Round-robin based mul-
ticast scheduling algorithm for input-queued high-speed Ethernet
switches," in OPNETWORK 2010, 2010.
[9] H. Yu, S. Ruepp, and M. S. Berger, \Enhanced fo based round-
robin multicast scheduling algorithm for input-queued switches,"
IET Communications, vol. 5, pp. 1163{1171, 2011.
[10] H. Yu, S. Ruepp, and M. S. Berger, \Multi-level round-robin multi-
cast scheduling with look-ahead mechanism," in IEEE International
Conference on Communications, 2011.
[11] H. Yu, S. Ruepp, and M. S. Berger, \Out-of-sequence prevention
for multicast input-queuing space-memory-memory Clos-network,"
IEEE Communications Letters, 2011.
[12] H. Yu, S. Ruepp, and M. S. Berger, \Out-of-sequence preventative
cell dispatching for multicast input-queued space-memory-memory
Clos-network," in 12th IEEE International Conference on High
Performance Switching and Routing, 2011.
[13] Y. Yan, H. Yu, and L. Dittmann, \Wireless channel condition aware
scheduling algorithm for hybrid optical/wireless networks," in 3rd.
International Conference on Access Networks, pp. 397{409, 2008.
[14] Y. Yan, H. Yu, H. Wang, and L. Dittmann, \Integration of EPON
and WiMAX networks: Uplink scheduler design," in SPIE Sympo-
sium on Asia Pacic Optical Communications, 2008.
[15] Y. Yan, H. Yu, H. Wessing, and L. Dittmann, \Integrated resource
management for hybrid optical wireless (how) networks," in Inter-
national Conference on Communications and Networking in China
(ChinaCom), pp. 1{5, 2009.
[16] Y. Yan, H. Yu, H. Wessing, and L. Dittmann, \Enhanced signaling
scheme with admission control in the hybrid optical wireless (HOW)
networks," in 28th IEEE International Conference on Computer
Communications (INFOCOM) Workshops, pp. 1{6, 2009.
BIBLIOGRAPHY 121
[17] Y. Yan, H. Yu, H. Wessing, and L. Dittmann, \Integrated resource
management framework in hybrid optical wireless networks," IET
Optoelectronics Special Issue on Next Generation Optical Access,
vol. 4, pp. 267{279, 2010.
[18] Metro Ethernet Forum, http://metroethernetforum.org/, 2011.
[19] L. Fang, R. Zhang, and M. Taylor, \The evolution of Carrier Eth-
ernet services - requirements and deployment case studies," IEEE
Communications Magazine, vol. 46, pp. 69{76, 2008.
[20] J. Mocerino, \Carrier class Ethernet service delivery migrating
SONET to IP & triple play oerings," in 2006 Optical Fiber Com-
munication Conference and National Fiber Optic Engineers Con-
ference, pp. 396{401, 2006.
[21] IEEE Standard, 802.1Qay-2009 - IEEE Standard for Local and
Metropolitan Area Networks - Virtual Bridged Local Area Networks
Amendment 10: Provider Backbone Bridge Trac Engineering,
2009.
[22] Internet Engineering Task Force (IETF), RFC 5921: A Framework
for MPLS in Transport Networks.
[23] D. Fedyk and D. Allan, \Ethernet data plane evolution for provider
networks [next-generation Carrier Ethernet transport technolo-
gies]," IEEE Communications Magazine, vol. 46, pp. 84{89, 2008.
[24] A. Reid, P. Willis, I. Hawkins, and C. Bilton, \Carrier Ethernet,"
IEEE Communications Magazine, vol. 46, pp. 96{103, 2008.
[25] M. Huynh and P. Mohapatra, \Metropolitan Ethernet network: A
move from LAN to MAN," Computer Networks, vol. 51, pp. 4867{
4894, 2007.
[26] S. Salam and A. Sajassi, \Provider Backbone Bridging and MPLS:
Complementary technologies for Next Generation Carrier Ethernet
transport," IEEE Communications Magazine, vol. 46, pp. 77{83,
2008.
122 BIBLIOGRAPHY
[27] S. Vedantham, S. H. Kim, and D. Kataria, \Carrier-grade Ethernet
challenges for IPTV deployment," IEEE Communications Maga-
zine, vol. 44, pp. 24{31, 2006.
[28] R. Fu, Y. Wang, and M. S. Berger, \Carrier ethernet network con-
trol plane based on the Next Generation Network," in First ITU-T
Kaleidoscope Academic Conference, pp. 293{298, 2008.
[29] M. A. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri,
\Packet scheduling in input-queued cell-based switches," in Twen-
tieth Annual Joint Conference on the IEEE Computer and Com-
munications Societies, 2001.
[30] High Capacity Carrier Ethernet Transport Networks, 2010.
[31] K. H. Lee, S. T. Trong, B. G. Lee, and Y. T. Kim, \QoS-guaranteed
IPTV service provisioning in IEEE 802.11e WLAN-based home net-
work," in 2008 IEEE Network Operations and Management Sym-
posium Workshops, pp. 71{76, 2008.
[32] D. Qiu, \On the QoS of IPTV and its eects on home networks," in
5th IEEE Consumer Communications and Networking Conference,
pp. 834{838, 2008.
[33] M. Shreedhar and G. Varghese, \Ecient fair queuing using decit
round-robin," IEEE/ACM Transactions on Networking, vol. 4,
pp. 375{385, 1996.
[34] International Telecommunication Union (ITU), G.803 Architecture
of transport networks based on the synchronous digital hierarchy
(SDH), 1997.
[35] C. Wu, H. Wu, and W. Lin, \Delivering relative dierentiated
services in future high-speed networks using hierarchical dynamic
decit round robin," Multimeida Systems, vol. 13, pp. 205{221,
2007.
[36] D. Back, K. Pyun, S. Lee, J. Cho, and N. Kim, \A hierarchical
decit round-robin scheduling algorithm for a high level of fair ser-
vice," in 2007 International Symposium on Information Technology
Convergence, pp. 115{119, 2007.
BIBLIOGRAPHY 123
[37] S. Jiwasurat, G. Kesidis, and D. Miller, \Hierarchical shaped decit
round-robin scheduling," in IEEE Global Telecommunications Con-
ference, pp. 689{693, 2005.
[38] M. Yang, J. Wang, E. Lu, and S. Q. Zheng, \Hierarchical scheduling
for diserv classes," in IEEE Global Telecommunication Conference,
pp. 707{712, 2004.
[39] A. K. Parekh and R. G. Gallager, \A generalized processor shar-
ing approach to ow control in integrated services networks: the
single-node case," IEEE/ACM Transactions on Networking, vol. 1,
pp. 344{357, 1993.
[40] A. K. Parekh and R. G. Gallager, \A generalized processor shar-
ing approach to ow control in integrated services networks: the
multiple-node case," IEEE/ACM Transactions on Networking,
vol. 2, pp. 137{150, 1994.
[41] M. B. Mamoun, J. Fourneau, and N. Pekergin, \Analyzing weighted
round robin policies with a stochastic comparison approach," Com-
puters and Operations Research, vol. 35, pp. 2420{2431, 2007.
[42] J. C. R. Bennett and H. Zhang, \WF2Q: worst-case fair weighted
fair queueing," in Fifteenth Annual Joint Conference of the IEEE
Computer Socieities (INFOCOM'96), pp. 120{128, 1996.
[43] S. J. Golenstani, \A self-clocked fair queueing scheme for broad-
band applications," in 13th International Conference on Computer
Communications (INFOCOM '94), pp. 636{646, 1994.
[44] P. Goyal, H. M. Vin, and C. Haichen, \Start-time fair queueing: a
scheduling algorithm for integrated services packet switching net-
works," IEEE/ACM Transactions on Networking, vol. 5, pp. 690{
704, 1997.
[45] S. S. Kanhere, H. Sethu, and A. B. Parekh, \Fair and ecient
packet scheduling using elastic round robin," IEEE Transactions
on Parallel and Distributed Systems, vol. 13, pp. 324{336, 2002.
124 BIBLIOGRAPHY
[46] S. S. Kanhere and H. Sethu, \Fair, ecient and low-latency packet
scheduling using nested decit round robin," in Workshop on High
Performance Switching and Routing, pp. 6{10, 2011.
[47] D. Saha, S. Mukherjee, and S. Tripathi, \Carry-over round robin: a
simple cell shceduling mechanism for ATM networks," IEEE/ACM
Transaction on Networking, vol. 6, pp. 779{796, 1998.
[48] T. Al-Khasib, H. Alnuweiri, H. Fattah, and V. C. V. Leung, \Fair
and ecient frame-based scheduling algorithm for multimedia net-
works," in 10th IEEE Symposium on Computers and Communica-
tions, pp. 597{603, 2005.
[49] C. Guo, \SRR: An o(1) time-complexity packet scheduler for ows
in multiservice packet networks," IEEE/ACM Transactions on Net-
working, vol. 12, pp. 1144{1155, 2004.
[50] C. Guo, \G-3: An o(1) time complexity packet scheduler that
provides bounded end-to-end delay," in 2007 IEEE INFOCOM,
pp. 1109{1117, 2007.
[51] C. Guo, \Improved smoothed round robin schedulers for high-speed
packet networks," in 2008 IEEE INFOCOM, pp. 906{914, 2008.
[52] S. Jiwasurat and G. Kesidis, \A class of Shaped Round-
Robin (SDRR) schedulers," Telecommunications Systems, vol. 25,
pp. 173{191, 2004.
[53] A. Varma and D. Stiliadis, \Hardware implementation of fair queu-
ing algorithms for asynchronous transfer mode networks," IEEE
Communications Magazine, vol. 35, pp. 54{68, 1997.
[54] X. Luo, Y. Jin, Q. Zeng, W. Sun, W. Guo, and W. Hu, \On the
stability of multicast ow aggregation in IP over optical network for
IPTV delivery," Chinese Optics Letters, vol. 6, pp. 553{557, 2008.
[55] Y. J. Won, M. Choi, B. Park, J. W. Hong, H. Lee, C. Hwang, and
J. Yoo, \End-user IPTV trac measurement of residential broad-
band access networks," in Network Operations and Management
Symposium Workshops 2008, pp. 95{100, 2008.
BIBLIOGRAPHY 125
[56] OPNET Modeler 16.0, http://www.opnet.com/, 2011.
[57] G. A. F. M. Khalaf and S. S. K. El-Yamany, \Statistical multiplex-
ing gain: direct estimation and its application to admission control
in ATM networks," in 18th National Radio Science Conference,
pp. 483{496, 2001.
[58] J. Huang, C. W. Tan, M. Chiang, and R. Cendrillon, \Statisti-
cal multiplexing over DSL networks," in 26th IEEE International
Conference on Computer Communications, pp. 571{579, 2007.
[59] M. Karol, M. Hluchyj, and S. Morgan, \Input versus output queue-
ing on a space-division packet switch," IEEE Transcations on Com-
munications, vol. 35, pp. 1347{1356, 1987.
[60] D. Pan and Y. Yang, \Fifo-based multicast scheduling algorithm
for virtual output queued packet switches," IEEE Transaction on
Computers, vol. 54, pp. 1283{1297, 2005.
[61] D. Pan and Y. Yang, \Bandwidth guaranteed multicast scheulding
for virtual output queued packet switches," Journal of Parallel and
Distributed Computing, vol. 69, pp. 939{949, 2009.
[62] B. Prabhakar, N. McKweon, and R. Ahuja, \Multicast schedul-
ing for input-queued switches," IEEE Journal on Selected Areas in
Communications, vol. 15, pp. 855{866, 1997.
[63] A. Bianco, P. Giaccone, C. Piglione, and S. Sessa, \Practical algo-
rithms for multicast support in input queued switches," in IEEE In-
ternational Conference on High Performance Switching and Rout-
ing, pp. 187{192, 2006.
[64] A. Mekkittikul and N. McKeown, \A practical scheduling algorithm
to achieve 100% throughput in input-queued switches," in 17th An-
nual Joint Conference of the IEEE Computer and Communications
Societies, pp. 792{799, 1998.
[65] H. J. Chao, \Next generation routers," Proceedings of the IEEE,
vol. 90, pp. 1518{1558, 2002.
126 BIBLIOGRAPHY
[66] S. Gupta and A. Aziz, \Multicast scheduling for switches with mul-
tiple input-queues," in 10th Symposium on High Performance In-
terconnects, pp. 28{33, 2002.
[67] M. Shoaib, \Selectively weighted multicast scheduling designs for
input-queued switches," in 2007 IEEE International Symposium on
Signal Processing and Information Technology, pp. 92{97, 2007.
[68] L. Mhamdia and S. Vassiliadis, \Integrating uni- and multicast
scheudling in buered crossbar switches," in IEEE International
Conference on High Performance Switching and Routing, 2006.
[69] Cisco Product Overview, http://www.cisco.com, Cicso 12000 Gi-
gabit Switch Router, March 2011.
[70] N. McKeown, M. Izzard, B. E. A. Mekkittikul, and M. Horowitz,
\The tiny tera: a packet switch core," IEEE Micro Magazine,
vol. 17, pp. 27{40, 1997.
[71] H. Duan, J. W. Lockwood, S. M. Kang, and J. D. Will, \A high
performance oc-12/oc-48 queue design prototyp for input buerec
atm switches," in Sixteenth Annual Joint Conference on the IEEE
Computer and Communications Societies, vol. 1.
[72] T. Anderson, S. Owick, J. Saxe, and C. Thacher, \High-speed
switch scheduling for local-area networks," IEEE/ACM Transca-
tions on Networking, vol. 11, pp. 319{352, 1993.
[73] Y. Tamir and G. Frazier, \High performance multi-queue buers for
vlsi communication switches," in 15th Annual International Sym-
posium on Computer Architecture, pp. 343{354, 1988.
[74] N. McKeown, \The iSLIP scheduling algorithm for input-queued
switches," IEEE/ACM Transcations on Networking, vol. 7, pp. 188{
201, 1999.
[75] M. A. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri,
\Multicast trac in input-queued switches: optimal scheduling and
maximum throughput," IEEE/ACM Transations on Networking,
vol. 11, pp. 465{477, 2003.
BIBLIOGRAPHY 127
[76] J. Hayes, R. Breault, and M. Mehmet-Ali, \Performance analy-
sis of a multicast switch," IEEE Transcations on Communications,
vol. 39, pp. 581{587, 1991.
[77] X. Li, Z. Zhou, and M. Hamdi, \Space-memory-memory architec-
ture for Clos-network packet switches," in IEEE International Con-
ference on Communications, pp. 1031{1035, 2005.
[78] S. Sun, S. He, Y. Zheng, and W. Gao, \Multicast scheduling in
buered crossbar switches with multiple input queues," in IEEE In-
ternational Conference on High Performance Switching and Rout-
ing, pp. 73{77, 2005.
[79] F. Abel, C. Minkenberg, I. Iliadis, T. Engbersen, M. Gusat,
F. Gramsamer, and R. P. Luijten, \Design issues in next-generation
merchant switch fabrics," IEEE/ACM Transactions on Networking,
vol. 15, pp. 1603{1615, 2007.
[80] K. Pun and M. Hamdi, \Distro: A distributed static round-robin
scheduling algorithm for buerless Clos-network switches," in IEEE
Global Communications Conference, pp. 2298{2302, 2002.
[81] R. Rojas-Cessa, E. Oki, and J. Chao, \Maximum weight match-
ing dispatching scheme in buered Clos-network packet switches,"
in IEEE International Conference on Communications, pp. 1075{
1079, 2004.
[82] E. Oki, Z. Jing, R. Rojas-Cessa, and J. Chao, \Concurrent
round-robin-based dispatching schemes for Clos-network switches,"
IEEE/ACM Transactions on Networking, vol. 10, pp. 830{844,
2002.
[83] Y. Yang and G. M. Masson, \The necessary conditions for Clos-type
nonblocking multicast networks," IEEE Transactions on Comput-
ers, vol. 48, pp. 1214{1227, 1999.
[84] Y. Yang and J. Wang, \On blocking probability of multicast net-
works," IEEE Transactions on Communications, vol. 46, pp. 957{
968, 1998.
128 BIBLIOGRAPHY
List of Acronyms
CAC Call Admission Control
CD Cell Dispatching
CM Central Module
CMF Credit based Multicast Fair
CMSD Concurrent Master-Slave round-robin Dispatching
CORR Carry-Over Round Robin
CRRD Concurrent Round-Robin Dispatching
DC Decit Counter
DRR Decit Round Robin
DSLAM Digital Subscriber Line Access Multiplexer
DSRR Desynchronized Static Round Robin
ERR Elastic Round Robin
FIFO First-In-First-Out
FIFOMS FIFO-based Multicast Scheduling
GMSS Greedy Min-Split Scheduling
GPS Generalized Processor Sharing
HD High Denition
129
130 List of Acronyms
HOL Head-Of-Line
IEEE Institute of Electrical and Electronics Engineers
IETF Internet Engineering Task Force
IM Input Module
IP Internet Protocol
IPP Input Port Processor
IPTV Internet Protocol Television
IQ Input Queuing
ITU-T International Telecommunication
Union-Telecommunication Standardization Sector
L1 Layer 1
L2 Layer 2
L3 Layer 3
LA Look-Ahead
LAN Local Area Network
MAC Media Access Control
MAN Metropolitan Area Network
MC-VOQ MultiCast Virtual Output Queuing
MEF Metro Ethernet Forum
MF-DSRR Multicast Flow-based DSRR
MFRR Multicast Flow-based Round-Robin
MLRRMS Multi-Level Round-Robin Multicast Scheduling
MMM Memory-Memory-Memory
131
MPLS Multi-protocol Label Switching
MPLS-TP Multi-protocol Label Switching Transport Prole
MRR Mini Round Robin
MSM Memory-Space-Memory
MWMD Maximum Weight Matching Dispatching
NGN Next Generation Network
OAM Operation, Administration and Maintenance
OM Output Module
OOS Out-Of-Sequence
OPP Output Port Processor
OQ Output Queuing
PBB-TE Provider Backbone Bridge with Trac Engineering
PSTN Public Switched Telephone Network
PW Packet Weight
QoS Quality-of-Service
SCFQ Self-Clocked Fair Queuing
SDH Synchronous Digital Hierarchy
SE Switching Element
SFQ Start-time Fair Queuing
SMG Statistical Multiplexing Gain
SMM Space-Memory-Memory
SONET Synchronous Optical Networking
SRRD Static Round-Robin Dispatching
132 List of Acronyms
S3 Space-Space-Space
STB Set Top Box
T-MPLS Transport MPLS
VLAN Virtual Local Area Network
VoD Video-on-Demand
VoIP Voice-over-IP
VOQ Virtual Output Queuing
WBA Weight-Based Algorithm
WFQ Weighted Fair Queuing
WF2Q Worst-case Fair Weighted Fair Queuing
