Scheduling in Packet Switches with Relaxed
Constraint
A thesis submitted in fulfilment of the
requirements for the award of the degree
Master of Engineering by Research
from
T HE U NIVERSITY OF W OLLONGONG
by
Lixiang Xiong
B.Eng, Chongqing University, China, 1997
M.Eng Studies with Distinction, University of Wollongong, Australia,
2002

S CHOOL OF E LECTRICAL , C OMPUTER
AND T ELECOMMUNICATIONS E NGINEERING
2004

Abstract
In this thesis, I present a series of new scheduling algorithms for an ATM-like crossbar input-queued switching fabric of an IP router. These new scheduling algorithms
are developed based on three popular existing scheduling algorithms: Parallel Iterative Matching, Round Robin Matching, and Iterative Round Robin Matching with
SLIP. The basic idea of our research is to divide all outputs of the IP router into a
few groups. All outputs in the same output group are multiplexed into a high-speed
link. Cells (traffic) can be directed to a group of outputs instead of an individual output. The performance of our new scheduling algorithms are measured with simulation. The simulation results indicate that our new scheduling algorithms can achieve
an excellent throughput while consuming much less computing time than existing
scheduling algorithms.

ii

Statement of Originality
I hereby declare that this thesis, submitted in fulfillment of the requirement for the
award of Master of Engineering by Research, in the school of Electrical, Computer
and Telecommunication Engineering, University of Wollongong, is my own work
unless otherwise referenced or acknowledged. The document has not been accepted
for the award of any other qualification at any educational institution.

Signed

Lixiang Xiong
18 November 2004

iii

Acknowledgments
I would like to express my thanks to my supervisor Dr Don Platt for his kind advice,
support and assistance during the whole procedure of this study.
I would also like to thank my co-supervisor Professor Fazrad Safaei for his helpful
advice and discussion for this study at the early stage of our research.
Finally, I would like to thank my family in China for their support, which is the most
important assistance I can get during my study.

iv

Contents

1

2

3

Introduction

1

1.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Contribution Resulting from Thesis . . . . . . . . . . . . . . . . .

3

1.3

Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . .

3

1.4

Publication Based on This Thesis . . . . . . . . . . . . . . . . . . .

4

Scheduling Algorithms in IP Router

5

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Operation of IP Router . . . . . . . . . . . . . . . . . . . . . . . .

6

2.3

Parallel Iterative Matching (PIM) . . . . . . . . . . . . . . . . . . .

8

2.4

Round Robin Matching (RRM) . . . . . . . . . . . . . . . . . . . .

10

2.5

Iterative Round Robin Matching with SLIP (iSLIP-RRM) . . . . . .

14

2.6

Other Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . .

16

2.7

Multiple Iterations of Scheduling Algorithms . . . . . . . . . . . .

17

2.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

Scheduling Algorithms with Relaxed Constraint

19

3.1

The Basic Idea of Our Research . . . . . . . . . . . . . . . . . . .

19

3.2

Scheduling algorithms with Relaxed Constraint using Approach 1 .

24

3.2.1

24

PIM with Relaxed Constraint Version 1 (PIMRC 1) . . . . .
v

vi

CONTENTS

3.3

4

RRM with Relaxed Constraint (RRMRC) . . . . . . . . . .

26

3.2.3

iSLIP-RRM with Relaxed Constraint (iSLIP-RRMRC) . . .

29

Scheduling Algorithms with Relaxed Constraint Using Approach 2

31

3.3.1

PIM with Relaxed Constraint Version 2 (PIMRC 2) . . . . .

31

3.3.2

Semi-RRM with Relaxed Constraint (Semi-RRMRC) . . . .

35

3.4

Multiple Iterations of Scheduling Algorithms with Relaxed Constraint 39

3.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

Simulation and Results

41

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.2

Simulation Environment . . . . . . . . . . . . . . . . . . . . . . .

42

4.2.1

General Simulation Parameters . . . . . . . . . . . . . . . .

42

4.2.2

2-MMBP Traffic Model . . . . . . . . . . . . . . . . . . .

42

Simulation Results and Analysis . . . . . . . . . . . . . . . . . . .

45

4.3.1

Simulation Results for the 16 × 16 Switch . . . . . . . . . .

45

4.3.2

Simulation Results for the 32x32 Switch . . . . . . . . . .

55

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.3

4.4
5

3.2.2

Summary and Comments

61

5.1

Summary of Dissertation . . . . . . . . . . . . . . . . . . . . . . .

61

5.2

Future Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

Bibliography

63

A Publication

66

B Simulation Results

78

B.1 Simulation Results for 16 × 16 Switch . . . . . . . . . . . . . . . .

78

vii

CONTENTS

B.1.1

Full Simulation Result . . . . . . . . . . . . . . . . . . . .

78

B.1.2

Throughput Comparison Result . . . . . . . . . . . . . . .

78

B.1.3

Average Queue Length Comparison Result . . . . . . . . .

95

B.1.4

Average Processing Time Comparison Result . . . . . . . .

99

B.1.5

Performance Comparison Result with Different Average Burst
Length in Traffic Model . . . . . . . . . . . . . . . . . . . 103

B.2 Simulation Results for 32x32 Switch . . . . . . . . . . . . . . . . . 109
B.2.1

Full Simulation Result . . . . . . . . . . . . . . . . . . . . 109

B.2.2

Throughput Comparison Result . . . . . . . . . . . . . . . 118

B.2.3

Average Queue Length Comparison Result . . . . . . . . . 130

B.2.4

Average Processing Time Comparison Result . . . . . . . . 130

B.2.5

Performance Comparison Result with Different Average Burst
Length in Traffic Model . . . . . . . . . . . . . . . . . . . 135

C Some Additional Examples Used in This Thesis

141

C.1 An Example of One iSLIP-RRM Iteration . . . . . . . . . . . . . . 141
C.2 An Example of How iSLIP-RRM Avoids the Synchronization of
Round Robin Pointers at Outputs . . . . . . . . . . . . . . . . . . . 144
D Simulation Matlab Programs

146

List of Figures

2.1

The basic hardware structure of IP router [1] . . . . . . . . . . . .

6

2.2

An example of queue structure for a 4 × 4 switch running PIM . . .

8

2.3

An example of PIM iteration [2] . . . . . . . . . . . . . . . . . . .

11

2.4

An example of RRM iteration . . . . . . . . . . . . . . . . . . . .

13

2.5

An example of synchronization of round robin pointers at outputs for
RRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.1

The basic idea of our research . . . . . . . . . . . . . . . . . . . .

20

3.2

An example of two approaches for developing our new scheduling
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.3

An example of PIMRC 1 . . . . . . . . . . . . . . . . . . . . . . .

25

3.4

An example of RRMRC . . . . . . . . . . . . . . . . . . . . . . .

27

3.5

An example of iSLIP-RRMRC . . . . . . . . . . . . . . . . . . . .

29

3.6

An example of the input/output choosing policy during stage 3 of
PIMRC 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.7

An example of PIMRC 2 . . . . . . . . . . . . . . . . . . . . . . .

34

3.8

An example of Semi-RRMRC . . . . . . . . . . . . . . . . . . . .

37

4.1

The switch model used in the simulation . . . . . . . . . . . . . . .

42

4.2

2-MMBP traffic model [3] . . . . . . . . . . . . . . . . . . . . . .

43

4.3

The on-off traffic model . . . . . . . . . . . . . . . . . . . . . . . .

44

viii

LIST OF FIGURES

4.4

ix

Throughput comparison for PIM with different number of iterations
(16 × 16 switch, average burst length=8 cells) . . . . . . . . . . . .

46

Throughput comparison for RRM with different number of iterations
(16 × 16 switch, average burst length=8 cells) . . . . . . . . . . . .

46

Throughput comparison for iSLIP-RRM with different number of
iterations (16 × 16 switch, average burst length=8 cells) . . . . . . .

46

Throughput comparison for PIMRC 1 with different number of iterations (16 × 16 switch, average burst length=8 cells) . . . . . . . .

47

Throughput comparison for RRMRC with different number of iterations (16 × 16 switch, average burst length=8 cells) . . . . . . . . .

47

Throughput comparison for iSLIP-RRMRC with different number of
iterations (16 × 16 switch, average burst length=8 cells) . . . . . . .

47

4.10 Throughput comparison for PIMRC 2 with different number of iterations (16 × 16 switch, average burst length=8 cells) . . . . . . . .

48

4.11 Throughput comparison for Semi-RRMRC with different number of
iterations (16 × 16 switch, average burst length=8 cells) . . . . . . .

48

4.12 Throughput comparison for scheduling algorithms with recommended
minimum number of iterations to achieve acceptable throughput (16×
16 switch, average burst length=8 cells) . . . . . . . . . . . . . . .

49

4.13 Average queue length comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (16 × 16 switch, average burst length=8 cells) . . . . . .

51

4.14 Average processing time comparison (16 × 16 switch, average burst
length=8 cells) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.15 Throughput comparison for 3-iteration PIMRC 1 with different average burst length (16 × 16 switch) . . . . . . . . . . . . . . . . . . .

53

4.16 Average queue length comparison for 3-iteration PIMRC 1 with different average burst length (16 × 16 switch) . . . . . . . . . . . . .

54

4.17 Average processing time comparison for 3-iteration PIMRC 1 with
different average burst length (16 × 16 switch) . . . . . . . . . . . .

54

B.1 Throughput comparison for PIM with different number of iterations
(16 × 16 switch) . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

4.5
4.6
4.7
4.8
4.9

LIST OF FIGURES

x

B.2 Throughput comparison for RRM with different number of iterations
(16 × 16 switch) . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

B.3 Throughput comparison for iSLIP-RRM with different number of
iterations (16 × 16 switch) . . . . . . . . . . . . . . . . . . . . . .

89

B.4 Throughput comparison for PIMRC 1 with different number of iterations (16 × 16 switch) . . . . . . . . . . . . . . . . . . . . . . . .

90

B.5 Throughput comparison for RRMRC with different number of iterations (16 × 16 switch) . . . . . . . . . . . . . . . . . . . . . . . . .

91

B.6 Throughput comparison for iSLIP-RRMRC with different number of
iterations (16 × 16 switch) . . . . . . . . . . . . . . . . . . . . . .

92

B.7 Throughput comparison for PIMRC 2 with different number of iterations (16 × 16 switch) . . . . . . . . . . . . . . . . . . . . . . . .

93

B.8 Throughput comparison for Semi-RRMRC with different number of
iterations (16 × 16 switch) . . . . . . . . . . . . . . . . . . . . . .

94

B.9 Throughput comparison for scheduling algorithms with recommended
minimum number of iterations to achieve acceptable throughput (16×
16 switch, average burst length=8 cells) . . . . . . . . . . . . . . .

96

B.10 Throughput comparison for scheduling algorithms with recommended
minimum number of iterations to achieve acceptable throughput (16×
16 switch, average burst length=16 cells) . . . . . . . . . . . . . . .

97

B.11 Throughput comparison for scheduling algorithms with recommended
minimum number of iterations to achieve acceptable throughput (16×
16 switch, average burst length=32 cells) . . . . . . . . . . . . . . .

97

B.12 Average queue length comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (16 × 16 switch) . . . . . . . . . . . . . . . . . . . . .

98

B.13 Average processing time comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (16 × 16 switch, average burst length=8 cells) . . . . . . 100
B.14 Average processing time comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (16 × 16 switch, average burst length=16 cells) . . . . . 101

LIST OF FIGURES

xi

B.15 Average processing time comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (16 × 16 switch, average burst length=32 cells) . . . . . 102
B.16 The performance comparison for 3-iteration PIMRC 1 with different
average burst length (16 × 16 switch) . . . . . . . . . . . . . . . . 104
B.17 The performance comparison for 2-iteration RRMRC with different
average burst length (16 × 16 switch) . . . . . . . . . . . . . . . . 105
B.18 The performance comparison for 2-iteration iSLIP-RRMRC with different average burst length (16 × 16 switch) . . . . . . . . . . . . . 106
B.19 The performance comparison for 4-iteration PIMRC 2 with different
average burst length (16 × 16 switch) . . . . . . . . . . . . . . . . 107
B.20 The performance comparison for 4-iteration PIMRC 2 with different
average burst length (16 × 16 switch) . . . . . . . . . . . . . . . . 108
B.21 Throughput comparison for PIM with different number of iterations
(32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.22 Throughput comparison for RRM with different number of iterations
(32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.23 Throughput comparison for iSLIP-RRM with different number of
iterations (32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . 121
B.24 Throughput comparison for PIMRC 1 with different number of iterations (32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . . . 122
B.25 Throughput comparison for RRMRC with different number of iterations (32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . . . . 123
B.26 Throughput comparison for iSLIP-RRMRC with different number of
iterations (32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . 124
B.27 Throughput comparison for PIMRC 2 with different number of iterations (32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . . . 125
B.28 Throughput comparison for Semi-RRMRC with different number of
iterations (32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . 126
B.29 Throughput comparison for scheduling algorithms with recommended
minimum iterations to achieve acceptable throughput (32 × 32 switch) 129

LIST OF FIGURES

xii

B.30 Average queue length comparison for scheduling algorithms with
recommended minimum iterations to achieve acceptable throughput
(32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
B.31 Average processing time comparison for scheduling algorithms with
recommended minimum iterations to achieve acceptable throughput
(32 × 32 switch, average, average burst length=8 cells) . . . . . . . 133
B.32 Average processing time comparison for scheduling algorithms with
recommended minimum iterations to achieve acceptable throughput
(32 × 32 switch, average, average burst length=16 cells) . . . . . . 134
B.33 Average processing time comparison for scheduling algorithms with
recommended minimum iterations to achieve acceptable throughput
(32 × 32 switch, average, average burst length=32 cells) . . . . . . 135
B.34 The performance comparison for 3-iteration PIMRC 1 with different
average burst length (32 × 32 switch) . . . . . . . . . . . . . . . . 136
B.35 The performance comparison for 3-iteration RRMRC with different
average burst length (32 × 32 switch) . . . . . . . . . . . . . . . . 137
B.36 The performance comparison for 3-iteration iSLIP-RRMRC with different average burst length (32 × 32 switch) . . . . . . . . . . . . . 138
B.37 The performance comparison for 4-iteration PIMRC 2 with different
average burst length (32 × 32 switch) . . . . . . . . . . . . . . . . 139
B.38 The performance comparison for 4-iteration Semi-RRMRC with different average burst length (32 × 32 switch) . . . . . . . . . . . . . 140
C.1 An example of iSLIP-RRM . . . . . . . . . . . . . . . . . . . . . . 142
C.2 An example of how iSLIP-RRM avoids the synchronization of round
robin pointers at outputs . . . . . . . . . . . . . . . . . . . . . . . 145

List of Tables

3.1

The calculation result of the maximum number of operations of one
scheduling iteration . . . . . . . . . . . . . . . . . . . . . . . . . .

40

The comparison result for the maximum number of operations of one
iteration between the new and original scheduling algorithms . . . .

40

4.1

General simulation setting . . . . . . . . . . . . . . . . . . . . . .

43

4.2

The minimum number of iterations recommended for scheduling algorithms to achieve acceptable throughput (average burst length=8
cells) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Throughput comparison for scheduling algorithms with minimum
number of Iterations to achieve their highest throughput (16 × 16
switch, average burst length=8 cells) . . . . . . . . . . . . . . . . .

50

Average processing time comparison (16 × 16 switch, average burst
length=8 cells) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

The minimum number of iterations recommended for scheduling algorithms to achieve acceptable throughput (32 × 32 switch) . . . . .

56

3.2

4.3

4.4
4.5
4.6

The throughput comparison for the scheduling algorithms with recommended minimum number of iterations to achieve acceptable throughput (32 × 32 switch) . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.7

Average processing time comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (32 × 32 switch) . . . . . . . . . . . . . . . . . . . . .

58

B.1 PIM simulation results (16 × 16 switch) . . . . . . . . . . . . . . .

79

B.2 RRM simulation results (16 × 16 switch) . . . . . . . . . . . . . .

80

xiii

LIST OF TABLES

xiv

B.3 iSLIP-RRM simulation results (16 × 16 switch ) . . . . . . . . . . .

81

B.4 PIMRC 1 simulation results (16 × 16 switch ) . . . . . . . . . . . .

82

B.5 RRMRC simulation results (16 × 16 switch) . . . . . . . . . . . . .

83

B.6 iSLIP-RRMRC simulation results (16 × 16 switch ) . . . . . . . . .

84

B.7 PIMRC 2 simulation results (16 × 16 switch ) . . . . . . . . . . . .

85

B.8 Semi-RRMRC simulation results (16 × 16 switch ) . . . . . . . . .

86

B.9 The minimum number of iterations recommended for scheduling algorithms to achieve acceptable throughput (16 × 16 switch) . . . . .

95

B.10 Throughput comparison for scheduling algorithms with recommended
minimum iterations to achieve acceptable throughput(16×16 switch,
average burst length=8 cells) . . . . . . . . . . . . . . . . . . . . .

95

B.11 Throughput comparison for scheduling algorithms with recommended
minimum iterations to achieve acceptable throughput (16×16 switch,
average burst length=16 cells) . . . . . . . . . . . . . . . . . . . .

96

B.12 Throughput comparison for scheduling algorithms with recommended
minimum iterations to achieve acceptable throughput (16×16 switch,
average burst length=32 cells) . . . . . . . . . . . . . . . . . . . .

96

B.13 Average processing time comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (16 × 16 switch, average burst length=8 cells) . . . . . .

99

B.14 Average processing time comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (16 × 16 switch, average burst length=16 cells) . . . . . 100
B.15 Average processing time comparison for scheduling algorithms with
recommended minimum number of iterations to achieve acceptable
throughput (16 × 16 switch, average burst length=32 cells) . . . . . 102
B.16 PIM simulation results (32 × 32 switch) . . . . . . . . . . . . . . . 110
B.17 RRM simulation results (32 × 32 switch) . . . . . . . . . . . . . . 111
B.18 iSLIP-RRM simulation results (32 × 32 switch ) . . . . . . . . . . . 112
B.19 PIMRC 1 simulation results (32 × 32 switch ) . . . . . . . . . . . . 113
B.20 RRMRC simulation results (32 × 32 switch) . . . . . . . . . . . . . 114

LIST OF TABLES

xv

B.21 iSLIP-RRMRC simulation results (32 × 32 switch ) . . . . . . . . . 115
B.22 PIMRC 2 simulation results (32 × 32 switch ) . . . . . . . . . . . . 116
B.23 Semi-RRMRC simulation results (32 × 32 switch ) . . . . . . . . . 117
B.24 The minimum number of iterations recommended for scheduling algorithms to achieve acceptable throughput (32 × 32 switch ) . . . . 127
B.25 Throughput comparison for scheduling algorithms with recommended
minimum iterations to achieve acceptable throughput (32 × 32 switch )128
B.26 Average processing time comparison for scheduling algorithms with
recommended minimum iterations to achieve acceptable throughput
(32 × 32 switch ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Glossary of Acronyms and Symbols
Acronyms
2-MMBP

2-state Markov Modulated Bernoulli Process

ADM

Add-Drop Multiplexer

ATM

Asynchronous Transfer Mode

DRR

Deficit Round Robin

FIFO

First In First Out

HOL

Head of Line

IRRM-MC

Iterative Round Robin Matching with Multiple Classes

iSLIP-RRM

Iterative Round Robin Matching with SLIP

iSLIP-RRMRC iSLIP-RRM with Relaxed Constraint
PIM

Parallel Iterative Matching

PIMRC1

PIM with Relaxed Constraint Version 1

PIMRC2

PIM with Relaxed Constraint Version 2

PPIM

Probabilistic Parallel Iterative Matching

RRM

Round Robin Matching

RRMRC

RRM with Relaxed Constraint

SDH

Synchronous Digital Hierarchy

Semi-RRMRC

Semi-Round Robin Matching with Relaxed Constraint

SPIM

Simplified Parallel Iterative Matching

VOQ

Virtual Output Queue

WPIM

Weighted Probabilistic Iterative Matching

WRR

Weighted Round Robin

xvi

Glossary of Acronyms and Symbols

xvii

Symbols
N ×N

The switch with N inputs and N outputs. It also represents the number of N × N .

L (n, m)

The queue length of the cells that are queued at the
nth input and are destined for the mth output.

M

The number of outputs for an N × M switch.

N

The number of inputs for an N × N switch or an N ×
M switch.

B

The number of the output groups, which the N outputs
of an N × N switch are divided into.

Qtotal

max P IM

The maximum number of operations of one PIM iteration for an N × N switch.

Qtotal

max RRM

The maximum number of operations of one RRM iteration for an N × N switch.

Qtotal

max iSLIP −RRM

The maximum number of operations of one iSLIPRRM iteration for an N × N switch.

Qtotal

max P IM RC1

The maximum number of operations of one PIMRC 1
iteration for an N × N switch with B output groups.

Qtotal

max RRM RC

The maximum number of operations of one RRMRC
iteration for an N × N switch with B output groups.

Qtotal

max iSLIP −RRM RC

The maximum number of operations of one iSLIPRRMRC iteration for an N × N switch with B output
groups.

K

The number of unmatched outputs in one output
group of a N × N switch applying PIMRC 2 or SemiRRMRC.

L

The number of requests received by one output group
of a switch applying PIMRC 2 or Semi-RRMRC.

Qtotal

max P IM RC2

The maximum number of operations of one PIMRC 2
iteration for an N × N switch with B output groups.

Glossary of Acronyms and Symbols

Qtotal

max Semi−RRM RC

xviii

The maximum number of operations of one SemiRRMRC iteration for an N × N switch with B output
groups.

α

The probability of changing state 1 to state 2 in 2MMBP traffic model.

β

The probability of changing state 2 to state 1 in 2MMBP traffic model

p

The probability of generating traffic in state 2 in a 2MMBP traffic model.

q

The probability of generating traffic in state 1 in a 2MMBP traffic model.

λ

Cell arrival rate.

ρ

Traffic load.

l

Average burst length.

tl

Simulation time (not including the the simulation time
when the traffic model generates traffic).

al

Average processing time.

Chapter 1
Introduction
1.1

Background

Internet has been developed quickly since the 1990s. As the core hardware of the Internet, the IP router has been researched and studied extensively. A popular approach
for developing the IP router is to apply an ATM-like input-queued crossbar switching
fabric [4, 5, 1, 6]: variable-length IP packets arriving at the inputs of the IP router
are fragmented into fixed-length cells (such as ATM cell). These cells pass through
the input-queued crossbar switching fabric controlled by some scheduling algorithm,
and then all these cells are restored back to IP packets after they pass through the
switching fabric and are transmitted out of the IP router. The scheduling algorithm
for a router can have a very critical influence on the router’s overall performance
based on the working procedure of the IP router. Therefore it is also a focus of many
researchers.
The simplest scheduling algorithm is First In First Out (FIFO). However, FIFO will
cause serious Head of Line (HOL) congestion that will decrease the performance
of switching. The maximum throughput of FIFO will be limited to 58% [7]. To
avoid this HOL problem, many new scheduling algorithms have been developed and
researched. All these algorithms can be divided to two categories: one is based
on Parallel Iterative Matching (PIM) [8]; the other one is based on Round Robin
Matching (RRM) [2, 9].

1

Introduction

2

PIM was originally introduced by Anderson, et al [8]. The basic idea of this scheduling algorithm is to apply a random policy in both grant and accept stages of each
iteration. RRM was originally introduced by McKeown, et al [2, 9]. RRM can be
treated as an improved version of PIM. The basic idea of this scheduling algorithm
is to apply a round robin policy instead of a random policy in both grant and accept
stages of each iteration. Both PIM and RRM have been proved to need multiple
iterations per time slot to achieve 100% (or almost 100% throughput) [8, 2, 9].
Many scheduling algorithms have been developed based on PIM and RRM, such
as Weighted Probabilistic Iterative Matching (WPIM) [10], Iterative Round Robin
Matching with Multiple Classes (IRRM-MC) [11], Deficit Round Robin (DRR) [12],
Iterative Round Robin Matching with SLIP (iSLIP-RRM) [2, 9], and Weighted Round
Robin (WRR) [13]. All these scheduling algorithms based on PIM and RRM have
significant performance improvement, such as flexible bandwidth allocation policy
and improved fairness. However, all these existing scheduling algorithms, including PIM, RRM and all scheduling algorithms based on PIM or RRM, have the same
limitation: the input traffic can be directed to only one pre-defined output. In other
words, each individual unit of input traffic (cell or packet) can be destined for one
and only one output (ignoring multicast).
We often meet the following situation in a typical mesh network, or in the Internet:
most nodes (such as IP routers) are connected to about 3-4 other nodes. One example
is the IP router between the edge network and the core network, where the input side
may be fully connected to the terminals in the edge network while the output side is
connected to only a few nodes (like another switch or router). However, as switching
capacity of the IP router has grown significantly, the number of inputs and outputs
for today’s IP router has grown to numbers such as 16, 32, 64 or even 128. If all
these outputs are to be used, clearly many outputs must be fed into links that go to a
single node. They may be multiplexed or be fed into fibres that are laid in parallel to
the same node. If we consider a group of these outputs, all going to the same node,
it is clear that it doesn’t matter which output an individual packet is sent to, because
all outputs in this group will take it to the correct node for its next hop.
Current scheduling algorithms operate so as to ensure that a packet (or a cell) goes to

Introduction

3

one pre-defined output, and only one output at most can be connected to one node.
So there appears to be a possibility of speeding up the operation of the scheduling
algorithms by allowing the packet (or the cell) to go to any output in the group. It is
the intention of this thesis to explore what improvement may be available.

1.2

Contribution Resulting from Thesis

The basic structure of the IP routers and their functions are introduced briefly in this
dissertation. Two popularly researched scheduling algorithms, PIM and RRM, have
been studied and illustrated. An improved version of RRM, iSLIP-RRM, has also
been studied and illustrated since it is a popular research topic too. A series of new
scheduling algorithms, developed by us based on PIM, RRM and iSLIP-RRM, are
also presented in this dissertation. Simulation and results for all the above scheduling
algorithms are also presented.
A clear performance comparison among PIM, RRM, iSLIP-RRM and the new scheduling algorithms developed in our research is illustrated. The result shows that all
the scheduling algorithms can achieve an acceptable throughput with an appropriate
number of iterations, however, our new scheduling algorithms can consume much
less processing time than the original scheduling algorithms (typically only 9-31%
of the original algorithms’ for the 16 × 16 switch or about 10% of that for the 32 × 32
switch.

1.3

Dissertation Overview

This dissertation is organized as follows:
Chapter 2 reviews the literature on current research about scheduling algorithms,
including a detailed introduction for PIM, RRM and iSLIP-RRM, and a brief introduction for other scheduling algorithms developed from them. We use a simple 4 × 4
switch model to illustrate the detailed operation of these scheduling algorithms. The
structure and operation of an IP router are also introduced briefly in this chapter.

Introduction

4

Chapter 3 presents a series of new scheduling algorithms developed by us. In this
part, a total of five new scheduling algorithms are presented in detail. A 4 × 4 switch
model and an 8x8 switch model are used to illustrate their operation.
Chapter 4 presents simulations for all the above scheduling algorithms. The 2MMBP traffic model is applied as the traffic source in the simulation. A 16 × 16
switch model and a 32 × 32 switch model are used as the switching fabric in the
simulation. We focus on the trade-off between throughput (or average queue length)
and scheduling speed in the simulations. We compare the processing time of our new
algorithms to that of the originals.
Chapter 5 provides a summary of this dissertation, conclusions and advice for future
research.
Appendix
Appendix A provides the full text of the publication based on the dissertation, Appendix B provides the detailed simulation results (including Figures and Tables),
Appendix C provides some additional examples used in this dissertation to illustrate
the operation of some scheduling algorithms, and finally Appendix D provides the
original simulation Matlab programs

1.4

Publication Based on This Thesis

Lixiang Xiong, Don Platt, “Scheduling with Relaxed Constraint for Input-Queued
Switch/Router with Crossbar Switching Fabric”, DSCPC 2003/WITSP 2003, Gold
Coast, Australia, Dec 2003.
Lixiang Xiong, Don Platt, Guoqiang Mao, “ Scheduling with Relaxed Constraint for
ATM-like Input-queued Crossbar Switching Fabric in IP Router”, APCC 2004/MDMC
2004, Beijing, China Aug- Sep 2004.

Chapter 2
Scheduling Algorithms in IP Router
2.1

Introduction

The Internet has been an important part of our daily life since the 1990s. The Internet
user population is increasing significantly every year. Therefore, researchers and
vendors are working hard to improve and develop Internet technology so that they can
keep pace with the demand of Internet users. As the core hardware of the Internet,
the IP router performs the function of forwarding IP packets from one network to
other different networks. Extensive research has been carried out to improve the
performance of the IP router.
As a complicated combination of hardware and software, the IP router’s overall performance is affected by many factors. In particular, the scheduling algorithms for
the IP router perform a critical role for its overall performance. Therefore, extensive
research has been focused on scheduling algorithms. In this chapter, we give a brief
description of the IP router in Section 2. In Sections 3∼5, we describe three popular scheduling algorithms in detail: PIM, RRM, and iSLIP-RRM. In Section 6, we
introduce other scheduling algorithms briefly. In Section 7, we discuss the need for
multiple iterations per time slot for PIM, RRM and iSLIP-RRM. Finally Section 8
summaries this chapter.

5

Scheduling Algorithms in IP Router

6

Figure 2.1 The basic hardware structure of IP router [1]

2.2

Operation of IP Router

The basic hardware structure of an IP router is shown in Fig 2.1. As described in [4,
1], an IP router contains three core parts:

1. Line cards. A router is connected to different networks via line cards. They
are the physical hardware of the output and input of a router. The line cards
offer a physical interface so that they can be connected to different network
media, such as coaxial cable, twisted-pair cable, and optical fibers. For the
IP router using an ATM-like switching fabric, the line cards also carry out the
function of segmenting IP packets into fixed-length cells (such as ATM cells)
at the input side, and assembling the cells into IP packets at the output side.
2. A CPU, or a router processor. It offers the process ability to run most software
functions of an IP router, such as routing protocols, scheduling algorithms,
and route lookup algorithms. However, as the result of the quick development
of hardware technology, today’s trend to develop the IP router is to distribute
some functions of the CPU to the line cards, such as the route lookup algorithm [4, 5, 14].

Scheduling Algorithms in IP Router

7

3. A switching fabric. This is the hardware to connect the input line cards and the
output line cards. How traffic passes through the switching fabric is controlled
by the CPU via applying some kind of scheduling algorithm.

The operation of an IP router can be briefly described as the follows:
Once an IP packet arrives in an input line card, the input line card will check whether
the newly arrived packet is valid or not via some error detecting coding. If the packet
is valid, the IP router will perform route lookup algorithm to choose which output
line card this IP packet should be forwarded to. The result of the route lookup algorithm is based on the address information contained in the IP packet header and the
routing table maintained in the IP router. Depending on the design, the route lookup
algorithm can be applied by the input line cards, the CPU or a combination of both.
The routing table in an IP router can be fixed, or it can be updated dynamically by
exchanging routing information with other IP routers via some exchange protocols.
Once the destined output line card is determined via the route lookup algorithm, the
IP packet will be forwarded to the destination output line card, where the CPU will
perform some scheduling algorithm to control the switching fabric to transfer the IP
packet.
The switching fabric used in a router provides a high-speed dedicated path from the
input port to the output port [1, 14] . It is the core hardware for high-speed routers.
There are two popular types of switching fabric: shared memory switching fabric and
crossbar switching fabric [4, 5]. The crossbar switching fabric is more popular since
the performance of the shared memory switching fabric is limited by the memory
access speed. The crossbar switching fabric offers a dedicated physical path between
each input and each output by opening a series of crosspoints between the input and
output.
Although various designs have placed buffers (queues) at either the output side or
the input side, this thesis will consider only the input queued design since the outputqueued switching fabric has the scalability problem for the buffer size [4, 5]. That
is, the output-queued switching fabric has to build a large buffer at each output for
the situation that all input traffic is directed to the same output. Obviously it is not a

Scheduling Algorithms in IP Router

8

Figure 2.2 An example of queue structure for a 4 × 4 switch running PIM

cost-effective scheme if there is such a large buffer at each output. Further, in order
to remove the Head of Line (HOL) problem for the input-queued switching fabric,
virtual output queue (VOQ) [15] is applied. That is, each input queue is divided
into N sub-queues for an N × N switch, and each sub-queue stores the cells/packets
destined for the N outputs separately. Only one class of packet will be considered in
this thesis. That is, all packets have the same priority to be transmitted.
Meanwhile, fixed-length switching technology, or the so-called ATM-like/cell-based
switching technology is widely accepted to achieve high-speed switching for the IP
router [4, 5, 6]. It has been widely applied in the commercial products, such as Cisco
12000 series router [16]. By applying ATM-like swtiching fabric, the variable-length
IP packet is segmented into fixed-length cells in the input line card. Once all cells
of the IP packet arrive in the output line card, the output line card will re-assemble
them into an IP packet, and then the IP packet is transmitted out of the router.

2.3

Parallel Iterative Matching (PIM)

PIM was originally developed for a 16 × 16 switch by DEC System Research Centre [8]. For an N × N input-queued switch, PIM sets up N queues at each input
corresponding to N outputs separately, so there are N × N input queues in total. An
example of such a queue structure is shown in Figure 2.2: it is a 4 × 4 switch, so
that there are 4 queues at each input, and each queue stores the cells destined for one
output. Therefore there are 16 input queues in total.

Scheduling Algorithms in IP Router

9

PIM is an iterative scheduling algorithm, and each iteration contains three stages:
request, grant, and accept. It applies a random policy in both the grant and the accept
stages of each iteration. The detail of one PIM iteration is shown below:

• Stage 1: Request. Each unmatched input sends a request to each unmatched
output for which it has one ore more queued cells.
• Grant. If an unmatched output receives more than one request, it chooses one
and only one request to be granted. The choosing decision is based on a random policy.
• Accept. If an unmatched input receives more than one grant, it chooses one
and only one grant to be accepted. This decision is also based on a random
policy.

After each iteration, the inputs and outputs that have established matching pairs during this iteration will be marked as ”matched”, and they will not be taken into account
the following iterations of the same time slot. Figure 2.3 is an example illustrating
how PIM works. The same example is also used in [2]. This example illustrates three
stages of one PIM iteration for a 4 × 4 switch. Here we apply the symbol “L (n, m)”
to represent the queues in the switch, “n” represents the input where the queue is
located, and ”m” represents the output for which the cells in the queue are destined.
For example, “L (1,2)=3” indicates there are three cells destined for output 2 queued
at input 1. Therefore, the queueing situation in this example follows: at input 1, there
is one queued cell destined for output 1 and four queued cells destined for output 2.
At input 2, no cell is queued. At input 3, there are two queued cells destined for
output 2 and one queued cell destined for output 4. At input 4, there are three queued
cells destined for output 4. The three stages of this PIM iteration are presented in
Figure 2.3 (a), (b), (c) separately. The detailed description follows:

(a) Stage 1: Request. We assume that all inputs and outputs are unmatched initially.
Based on the queueing situation in Figure 2.3 (a), only inputs 1, 3 and 4 send
requests since there is no queued cell at input 2. Input 1 sends requests to

Scheduling Algorithms in IP Router

10

outputs 1 and 2, input 3 sends requests to outputs 2 and 4, and input 4 sends a
request to output 4.
(b) Stage 2. Grant. According to the result of stage 1, only outputs 1, 2 and 4 receive
requests. Output 1 receives only one request from input 1, and then it grants
this request since it has no other choice. Output 2 receives two requests from
inputs 1 and 3 separately. It applies a random policy to grant only one of them.
Here we assume it grants the request from output 3. Output 4 receives two
requests from inputs 3 and 4 separately. It also applies a random policy to
grant only one of them. Here we assume it grant the request from input 3.
(c) Stage 3. Accept. According to the result of stage 2, only inputs 1 and 3 receive
grants. Input 1 receives only one grant for output 1, and then it accepts this
grant since it has no other choice. Input 3 receives two grants, from outputs
2 and 4 separately. It applies a random policy to accept only one of them.
We assume it accepts the grant from output 2. So, two matching pairs are
found after this PIM iteration: input 1- output 1, input 3-output 2, and they are
marked as ”matched” so that they will not join the following iterations of the
same time slot.
According to [8], PIM can achieve about 63% throughput with one iteration under
full traffic load. This throughput is too low to be accepted. It is also found 4-iteration
PIM for a 16 × 16 switch can achieve a throughput greater than 99% under full traffic
load [8].

2.4

Round Robin Matching (RRM)

RRM was introduced by McKeown, et al [2, 9]. There are also three stages within
one RRM iteration: request, grant, and accept. However, RRM applies a round robin
policy instead of a random policy in the grant and accept stages. The detail of one
RRM iteration is shown below:
• Stage 1: Request. Each unmatched input sends a request to each unmatched
output for which it has one or more queued cells.

Scheduling Algorithms in IP Router

11

Figure 2.3 An example of PIM iteration [2]

• Stage 2: Grant. If an unmatched output receives more than one request, it
grants one and only one request. The choosing decision is based on a round
robin policy. There is a round robin arbiter at each output. The output grants
the request from the input that appears next based on the round robin arbiter at
the output. At the end of stage 2, the position of the round robin pointer at the
output is increased to the location beyond the granted input (modulo M, here
M is the number of inputs in the switch).
• Stage 3: Accept. If an unmatched input receives more than one grant, it
chooses one and only one grant to be accepted. The choosing decision is based
on a round robin policy too. There is a round robin arbiter at each input. The
input accepts the grant from the output that appears next based on the round
robin arbiter at the input. At the end of stage 3, the position of the round robin
pointer at the input is increased to the location beyond the accepted output
(modulo N, here N is the number of outputs in the switch).

Scheduling Algorithms in IP Router

12

The status of inputs/outputs after each iteration is updated in the same way as for
PIM. An example of one RRM iteration is shown in Figure 2.4. This example
presents one RRM iteration for a 4 × 4 switch. At input 1, there are two queued
cells destined for output 1 and one queued cell destined for output 3. At input 2, no
cell is queued. At input 3, there are three queued cells destined for output 1 and two
queued cells destined for output 2. At input 4, there is one queued cell destined for
output 1 and one queued cell destined for output 4. The three stages of the RRM iteration are presented in Figure 2.4 (a), (b) and (c) separately. The detailed description
follows:
(a) Stage 1, Request. We assume that all inputs and outputs are unmatched initially.
Only inputs 1, 3 and 4 send requests since there is no queued cell at input 2.
Input 1 sends requests to outputs 1 and 3, input 3 sends requests to outputs 1
and 2, and input 4 sends requests to outputs 1 and 4.
(b) Stage 2. Grant. We assume all round robin pointers at the output side are set
to position “1” initially, which means requests from input 1 own the highest
priority to be granted. Output 1 receives three requests from inputs 1, 3, and 4
separately, and it grants the request from the input 1 based on its round robin
arbiter. Then its round robin pointer is increased to position “2” which is the
location beyond the granted input 1 (modulo 4). Output 2 receives only one
request from input 3, therefore it grants this request since it has no other choice,
and its pointer is increased to position “4” which is the location beyond the
granted input 3 (modulo4). Similarly, outputs 3 and 4 grant the requests from
inputs 1 and 4 separately since they have no other choice. The round robin
pointer of output 3 is increased to position “2” (modulo 4), and the round robin
pointer of output 4 remains at position “1” since the location in its round robin
arbiter beyond the granted input 4 is still position “1” (modulo 4).
(c) Stage 3. Accept. We assume all round robin pointers at the input side are set to
position “1” initially, which means the grants from output 1 own the highest
priority to be accepted. Input 1 receives two grants from outputs 1 and 3 separately, so it accepts the grant from output 1 based on its round robin arbiter,
and its round robin pointer is increased to position “2” (modulo 4). Similarly

Scheduling Algorithms in IP Router

13

Figure 2.4 An example of RRM iteration

inputs 3 and 4 accept the grants from outputs 2 and 4 separately since each of
them receives only one grant. The round robin pointer at input 3 will be increased to position “3”. The round robin pointer at input 4 remains at position
“1” since the location in its round robin arbiter beyond the accepted output 4
is still position “1” (modulo 4). The round robin pointer at input 2 remains at
position “1” since no grant is received by it.

According to McKeown, et al [9], RRM has a flaw of synchronization of round
robin pointers at outputs. We present an example of this flaw in Figure 2.5. This
example illustrates one RRM iteration for a 4x4 switch. The input traffic destined

Scheduling Algorithms in IP Router

14

for each output is backlogged, which means there are always cells destined for all
four outputs queued at each input. Therefore, each input will send requests to all four
outputs at the request stage of the RRM iteration, shown in Figure 2.5(a). We assume
all round robin pointers at both the output side and input side are set to position “1”
initially. Therefore, during the grant stage of the RRM iteration, all outputs will
grant the request from the input 1 only, and the round robin pointers at all outputs
are increased to position “2” (modulo 4), shown in Figure 2.5(b). Finally, during the
accept stage of the RRM iteration, we can see only one matching pair between input
1 and output 1 is found after this RRM iteration. It also can be seen easily that all
outputs will grant the requests from input 2 during the next RRM iteration since all
round robin pointers at all outputs are set to position “2”, therefore only one matching
pair between input 2 and some output can be found during the next RRM iteration.
From this example, we can see that synchronization of round robin pointers at outputs
causes considerable inefficiency in the algorithm: only one matching pair can be
found during one RRM iteration with backlogged traffic if the round robin pointers
at the outputs become synchronised. To overcome this flaw, McKeown introduced
iSLIP-RRM, a new scheduling algorithm [9], and we introduce it in the following
section of this chapter.

2.5

Iterative Round Robin Matching with SLIP (iSLIPRRM)

McKeown adjusted RRM slightly to overcome the problem of synchronization [9].
The detail of one iSLIP-RRM iteration is shown as below:

• Stage 1: Request. Same as for RRM.
• Stage 2:Grant. Same as for RRM, except that the round robin pointers at all
outputs remain at the previous position no matter whether they grant a request
or not during this stage.
• Stage 3: Accept. Same as for RRM except that the positions of the round robin
pointers at the output side will be updated if and only if the grants from the

Scheduling Algorithms in IP Router

Figure 2.5 An example of synchronization of round robin pointers at outputs for RRM

15

Scheduling Algorithms in IP Router

16

outputs are accepted.

The obvious difference between RRM and iSLIP-RRM is that iSLIP-RRM will not
update the round robin pointers at outputs at the end of the grant stage, it will update
them at the end of the accept stage, and only the pointers of those outputs whose
grants are accepted during the accept stage can be updated. The round robin pointers
of those outputs whose grant are not accepted during this iteration will remain at the
previous position. Obviously, the synchronization of round robin pointers at outputs
is destroyed in iSLIP-RRM.
An example of one iSLIP-RRM iteration and an example that how iSLIP-RRM
avoids the problem of synchronization are also illustrated in Appendix C.

2.6

Other Scheduling Algorithms

Most scheduling algorithms for input-queued routers can be divided into two categories: one is based on Parallel Iterative Matching (PIM) [8], such as Simplified PIM
(SPIM) [17], Weighted Probabilistic Iterative Matching (WPIM) [10], and Probabilistic Parallel Iterative Matching (PPIM) [18]. The other is based on Round Robin
Matching (RRM) [9], such as Iterative Round Robin with SLIP (iSLIP-RRM) [2, 9],
and Iterative Round Robin Matching with Multiple Classes (IRRM-MC) [11].
Apart from the iterative scheduling algorithms, there are also non-iterative scheduling algorithms, such as Weighted Round Robin (WRR) [13], Deficit Round Robin
(DRR) [12].
All these existing scheduling algorithms avoid the Head of Line (HOL) problem.
Their developers paid attention mainly to improve bandwidth allocation performance
of the algorithms, where the additional functions are applied to the original PIM and
RRM so that they could perform better. For example, different weight (or probability)
is assigned to different traffic flow in their algorithms so that they can allocate flow
bandwidth more flexibly and more fairly than the original PIM and RRM, which can
help the switching fabric to handle the traffic with different service priority. Mean-

Scheduling Algorithms in IP Router

17

while, all these algorithms are still limited by the rule that one cell/packet can be
directed to only one output, even though several outputs may be multiplexed into a
link going to a single network node.

2.7

Multiple Iterations of Scheduling Algorithms

As presented in [8, 2, 9], PIM, RRM and iSLIP-RRM are iterative scheduling algorithms. From the examples illustrated in previous sections, we can see that none
of these scheduling algorithms can guarantee finding all possible matching pairs
with only one iteration per time slot. Therefore, multiple iterations are necessary
to achieve satisfactory throughput.
Multiple-iteration PIM is very easily implemented: matching pairs (inputs/outputs)
found in one PIM iteration will be marked as “matched” and will not join the following PIM iterations of the same time slot, which means the matched inputs will
not send any request while the matched outputs will not receive any request. When
all PIM iterations of one time slot finish, all inputs and outputs will be reset to “unmatched” again. Based on [8], the appropriate number of PIM iterations for a 16×16
switch is 4. It can achieve almost 100% throughput under full traffic load with 4 PIM
iterations. It is also illustrated that the appropriate number of PIM iterations for an
N × N switch is log2 N in [8].
For multiple-iteration RRM and multiple-iteration iSLIP-RRM, the same policy of
updating the status of inputs/outputs as PIM is applied. Furthermore, the updating of
the round robin arbiter needs to be considered for them. Based on [2, 9], there are
two solutions for round robin arbiter updating: updating them during each iteration of
one time slot, or updating them only during the first iteration of one time slot. In our
simulation, we applied the solution of updating the round robin arbiter during each
iteration. Based on [2, 9], the appropriate number of RRM or iSLIP-RRM iterations
for an N × N switch should generally follow the rule of O(log2 N ).

Scheduling Algorithms in IP Router

2.8

18

Summary

In this chapter, a brief introduction to the IP router is given. Several popular scheduling algorithms for the input-queued switching fabric, that is, PIM, RRM and iSLIPRRM, are also introduced in detail. It is clear that those existing scheduling algorithms have the same limitation that a unit of the arriving traffic can be destined for
only one output, while a node can be connected to only one output of the router via
one link. This limitation may cause the problem that the switching capacity can not
be fully utilized when the outputs are not fully connected. Our solution is explained
in detail in the next chapter.

Chapter 3
Scheduling Algorithms with Relaxed
Constraint
3.1

The Basic Idea of Our Research

As mentioned in Chapter 1 and 2, the existing scheduling algorithms have this constraint: each packet is directed to one and only one output which is connected to a
node via a link, meanwhile other outputs may be connected to the same node, but
they are not made available by the standard algorithms. In this thesis, we develop a
series of new scheduling algorithms based on the following idea: we divide all outputs of the ATM-like input-queued crossbar switching fabric of the IP router into a
few groups. Each group contains several outputs. The outputs in a group are multiplexed into one high-speed output link that is connected to a node somewhere else
in the network. Cells destined for this node can be routed to any output within this
group, whereas all outputs in this group direct their cells to the one link, and any of
these outputs would serve equally well. The idea is shown in Figure 3.1.
In our research, we think the existing digital multiplexing technology should be able
to support our idea. For example, we can multiplex the traffic from all outputs (its
data rate is usually in the order of MBps) within the same group into some GBps
high-speed link via some SDH (Synchronous Digital Hierarchy) multiplexer, like
ADM (Add-Drop Multiplexer). But reassembling cells into an IP packet at the output
side is a key challenge in our research since cells of an IP packet may arrive in
19

Scheduling Algorithms with Relaxed Constraint

20

Figure 3.1 The basic idea of our research

different outputs within a group. We consider applying a buffer at the output of
the multiplexer, where the cells are stored and reassembled to the IP packet. The
assembled IP packets would not be queued in the buffer if the high-speed link has
enough capacity to support all the multiplexed outputs. In this thesis, this buffer is
not considered in the analysis and simulation to avoid complexity.
By applying this idea, each individual unit of input traffic can be directed to any
output of a group of available outputs through the switching fabric, therefore all
outputs can join the switching since we put all outputs into a few output groups,
and the switching capacity will be fully used. In this thesis we will apply a similar
scheduling matching policy to those used in the existing scheduling algorithms to our
new algorithms. Therefore, we need to adjust the existing scheduling algorithms to
make them suitable for this idea.
As described in Chapter 2, an N × N input-queued switching fabric that applies
the existing scheduling algorithms (such as PIM, RRM, and iSLIP-RRM) will have
input queues at the input side. The number of input queues has a critical influence on
the computing complexity of the algorithms. The general working procedure of one
PIM iteration in the simulation programming structure (refer to Appendix D) for an
N × N switch is shown below:
• Step 1. Check the status of the N ×N input queues, so that the program can find

Scheduling Algorithms with Relaxed Constraint

21

how many inputs would send a request to each output. This step needs loops,
and each loop contains two operations: check the status of one input queue
(empty or not), and increase the counter for the number of requests received
by the output if its corresponding input queue is not empty. Obviously the
maximum number of operations for this step is 2 × N × N if each input queue
is not empty.
• Step 2. Perform N loops to carry on the grant stage. Each loop contains two
operations: check the number of requests received by one output, and grant
one request randomly if the number of requests received the output is not zero.
The maximum number of operations for this step is 2 × N if the number of
requests received by each output is not zero.
• Step 3. Perform N loops to carry on the accept sage. Each loop contains two
operations: check the number of grants received by one input, and accept one
grant randomly if the number of grants received by the input is not zero. The
maximum number of operations for this step is 2 × N if the number of grants
received by each output is not zero.

Therefore, we can get the maximum sum of operations for one PIM iteration as
shown in Equation 3.1.
Qtotal

max P IM

= 2 × N × N + 2 × N + 2 × N.

(3.1)

For RRM, the only difference is that each loop in steps 2 and 3 performs the choosing
decision base on a round robin policy instead of a random policy, and it needs one
additional operation to update the round robin pointers at the output side in step 2
and at the input side in step 3 separately. So, the maximum sum of operations for one
RRM iteration should be as shown in Equation 3.2.
Qtotal

max RRM

= 2 × N × N + 3 × N + 3 × N.

(3.2)

For iSLIP-RRM, similarly, the only difference is that each loop in steps 2 and 3
performs the choosing decision based on a round robin policy instead of a random
policy, and the loop in step 3 needs two additional operations to update the round

Scheduling Algorithms with Relaxed Constraint

22

robin pointers at the output side and the input side separately. So, the maximum sum
of operations for one iSLP-RRM iteration should be as shown in Equation 3.3.
Qtotal

max iSLIP −RRM

= 2 × N × N + 2 × N + 4 × N.

(3.3)

The above three equations are just some general equations to calculate the computing
complexity of these three algorithms, and they are only suitable for the situation
that maximum number of operations is needed. Different programming methods
and different traffic loads will influence the actual computing complexity. We also
need to consider the fact that multiple iterations are required, therefore the actual
computing complexity over a full time slot will be more complex since the matched
inputs/outputs will not join switching in the following iterations of the same time slot
(less operations are needed for the following iterations). However, those equations
still can be a good reference for our predicting the computing complexity during our
research.
Based on the above, we can see that the larger switch size would result in the higher
computing complexity. In Equations 3.1- 3.3, the number of operations 2xNxN for
the first step is the majority of the sum of operations (especially when N is a very
large number). Obviously 2 × N × N is determined by the number of input queues
N × N . Assuming the number of output groups for the N × N switching fabric
applying our new algorithms is B, the number of input queues for the switching
fabric is N × B (N>B). Obviously there is a significant decrease of the number of
input queues in the new algorithms. Since we will apply a similar scheduling policy
to the one used in the existing scheduling algorithms to the new algorithms, we can
see that the new algorithms will have a similar working procedure to the existing
algorithms, and so the equations to calculate the computing complexity should also
be similar. Therefore we can see that the significant decrease in the number of input
queues would also result in a considerable decrease in the number of operations. We
expect that the new algorithms would run faster than the existing algorithms due to
the decreasing number of operations. We give the description for it in detail in the
following sections of this chapter.
We develop the new scheduling algorithms in two approaches by applying different
output selection policy in the output group:

Scheduling Algorithms with Relaxed Constraint

23

Figure 3.2 An example of two approaches for developing our new scheduling algorithms

• Approach 1: We allow all unmatched outputs in one group to receive and respond to the request from inputs individually.
• Approach 2: we treat each group of outputs as a virtual output. This virtual
output deals with all requests coming to this group of outputs, and then it applies some policy to pick up one idle output to set up an input-output matching
pair.

These two approaches are shown in Figure 3.2. An NxN switch fabric has the N
outputs divided into a few groups. The first output group contains 4 outputs. In
Figure 3.2 (a), a request from input 1 is received by all 4 outputs in this group, and
each output will respond to this request individually. In Figure 3.2 (b), the request
is received and dealt with by the output group on behalf of all outputs in this group.
The detail for both approaches is discussed in the following sections of this chapter.

Scheduling Algorithms with Relaxed Constraint

24

We develop our new scheduling algorithms from PIM, RRM and iSLIP-RRM. So we
refer to them as PIM with Relaxed Constraint Version 1 (PIMRC 1), RRM with Relaxed Constraint (RRMRC), iSLIP-RRM with Relaxed Constraint (iSLIP-RRMRC),
PIM with Relaxed Constraint Version 2 (PIMRC 2), and Semi-RRM with Relaxed
Constraint (Semi-RRMRC). PIMRC 1, RRMRC and iSLIP-RRMRC are using Approach 1. PIMRC 2 and Semi-RRMRC are using Approach 2 (We do not adjust
RRM and iSLIP-RRM by using Approach 2 due to the difficulty of applying the
round robin arbiter). We introduce them in detail in the following sections of this
chapter.

3.2

Scheduling algorithms with Relaxed Constraint using Approach 1

3.2.1 PIM with Relaxed Constraint Version 1 (PIMRC 1)
Based on PIM, we develop PIMRC 1 by applying Approach 1. Similar to PIM,
PIMRC 1 also contains three stages in one iteration: request, grant and accept. The
algorithm is exactly the same as PIM, except that an input with a cell for a particular
output group sends a request to all unmatched outputs in the group during the request
stage of one PIMRC 1 iteration.
An example of PIMRC1 is shown in Figure 3.3. This example illustrates three stages
of one PIMRC 1 iteration for a 4 × 4 switch by applying PIMRC 1. The 4 outputs
are divided into two groups, and each group contains 2 outputs. The outputs in the
same group are multiplexed into one output link. At input 1, there are one queued
cell destined for output link 1 and four queued cells destined for output link 2. At
input 2, no cell is queued. At input 3, there are two queued cells destined for output
link 1 and one queued cell destined for output link 2. At input 4, there is one queued
cell destined for output link 1. The three stages are presented in Figure 3.3 (a), (b),
(c) separately. The detailed description is shown below:

(a) Stage 1: Request. We assume that all inputs and outputs are unmatched initially.
Only inputs 1, 3 and 4 send requests since there is no queued cell at input 2.

Scheduling Algorithms with Relaxed Constraint

25

Figure 3.3 An example of PIMRC 1

Input 1 sends requests to output links 1 and 2. Input 3 sends requests to output
links 1 and 2, and input 4 sends a request to output link 1. Based on PIMRC
1, the input should send requests to all unmatched outputs in a group directed
to the same output link if there is at least one cell destined for this output link
queued at this input. Therefore, inputs 1 and 3 send requests to all four outputs,
and input 4 sends requests to outputs 1 and 2.
(b) Stage 2. Grant. Based on the result of stage 1, all outputs receive multiple requests from different inputs. Each output grants one and only one request, and
the choosing decision is based on a random policy. Here we assume outputs 1
and 4 grant the requests from input 1, outputs 2 and 3 grant the requests from
input 3.
(c) Stage 3. Accept. Based on the result of stage 2, only inputs 1 and 3 receive grants.
Input 1 receives two grants from outputs 1 and 4 separately, and it accepts one
and only one of them. The choosing decision is also based on a random policy.

26

Scheduling Algorithms with Relaxed Constraint

Here we assume it accepts the grant from input 1. Similarly, input 3 receives
two grants from outputs 2 and 3 separately, and we assume it accepts the grant
from output 2 based on a random policy. So, finally two matching pairs (input
1- output 1, and input 3-output 2) are found in this iteration, and they will be
marked as “matched” so that they will not join the following iterations of the
same time slot.
Assuming we apply PIMRC 1 to an NxN switch with output groups, we can see that
PIMRC 1 should have a similar computing complexity to PIM except that PIMRC 1
would perform only N × B loops instead of N × N loops in step 1. So, its maximum
number of operations of one PIMRC 1 iteration should be shown as Equation 3.4.
Qtotal

max P IM RC1

= 2 × N × B + 2 × N + 2 × N.

(3.4)

3.2.2 RRM with Relaxed Constraint (RRMRC)
We develop RRMRC based on RRM by applying Approach 1. Similar to RRM,
RRMRC also has three stages in one iteration: request, grant, and accept. One
RRMRC iteration is exactly the same as for RRM except that an input with a cell
for a particular output group sends a request to all unmatched outputs in the group
during the request stage of one RRMRC iteration.
An example of RRMRC is shown in Figure 3.4. This example illustrates one RRMRC
iteration for a 4 × 4 switch. The 4 outputs are divided into two groups, and each
group contains 2 outputs. The outputs in the same group are multiplexed into one
high-speed output link. At input 1, there are one queued cell destined for output link
1 and four queued cells destined for output link 2. At input 2, no cell is queued. At
input 3, there are two queued cells destined for output link 1 and one queued cell
destined for output link 2. At input 4, there is one queued cell destined for output
link 1. The three stages of this RRMRC iteration are presented in Figure 3.4 (a), (b),
(c) separately. The detailed description is given below:
(a) Stage 1: Request. We assume that all inputs and outputs are unmatched initially.
Only inputs 1, 3 and 4 send requests since there is no queued cell at input 2.

Scheduling Algorithms with Relaxed Constraint

Figure 3.4 An example of RRMRC

27

Scheduling Algorithms with Relaxed Constraint

28

Input 1 sends requests to output links 1 and 2. Input 3 sends requests to output
links 1 and 2, and input 4 sends request to output link 1. Based on RRMRC,
the input should send requests to all unmatched outputs of the output link if at
least one cell destined for this output link is queued at this input. Therefore,
inputs 1 and 3 send requests to all four outputs, and input 4 sends requests to
outputs 1 and 2.
(b) Stage 2. Grant. Based on the result of stage 1, all outputs receive multiple
requests from different inputs. Each output grants one and only one request,
and the choosing decision is based on a round robin policy. Here we assume the
round robin pointers at all outputs are set to position “1” initially, which means
the requests from input 1 own the highest priority to be granted. Therefore all
outputs grant the requests from input 1. The round robin pointers at all outputs
are also increased to position “2” since it is the location beyond the granted
input 1 (modulo 4).
(c) Stage 3. Accept. Based on the result of stage 2, only input 1 receives grants from
all four outputs. Here we also assume that round robin pointers at all inputs
are set to position “1” initially, which means the grants from output 1 own the
highest priority to be accepted. Therefore input 1 accepts the grant from output
1, and its round robin pointer is increased to position “2” since it is the location
beyond the accepted output 1 (modulo 4). The round robin pointers at inputs
2, 3 and 4 will remain at position “1” since these inputs do not accept any grant
during this iteration. Finally one matching pair (input 1- output 1) is found in
this iteration, and they will be marked as “matched” so that they will not join
the following iterations of the same time slot.
From this example, we can see that round robin pointers at all outputs are increased
to position “2”. Obviously, RRMRC also has the same problem of synchronization
of round robin pointers at outputs as RRM has.
Similarly, we can get the maximum number of operations of one RRMRC iteration
for an NxN switch with B output groups, shown as Equation 3.5.
Qtotal

max RRM RC

= 2 × N × B + 3 × N + 3 × N.

(3.5)

Scheduling Algorithms with Relaxed Constraint

29

3.2.3 iSLIP-RRM with Relaxed Constraint (iSLIP-RRMRC)
We develop this scheduling algorithm based on iSLIP-RRM by applying Approach
1. Similarly, iSLIP-RRMRC has three stages in one iteration: request, grant and
accept. The only difference between iSLIP-RRM and iSLIP-RRMRC is that that an
input with a cell for a particular output group sends requests to all unmatched outputs
in the group during the request stage of one iSLIP-RRMRC iteration.
An example of iSLIP-RRMRC is shown as Figure 3.5. This example illustrates one
iSLIP-RRMRC iteration for a 4 × 4 switch. The starting state is the same as in
Figure 3.4. The three stages of this iSLIP-RRM iteration are presented in Figure 3.5
(a), (b), (c) separately. The detailed description is shown below:

Figure 3.5 An example of iSLIP-RRMRC

Scheduling Algorithms with Relaxed Constraint

30

(a) Stage 1: Request. We assume that all inputs and outputs are unmatched initially.
As before in Figure 3.4, inputs 1 and 3 send requests to all four outputs, and
input 4 sends requests to outputs 1 and 2.
(b) Stage 2. Grant. Based on the result of stage 1, all outputs receive multiple
requests from different inputs. Each output grants one and only one request
based on a round robin policy. Again we assume the round robin pointers at all
outputs are set to position “1” initially, which means the requests from input 1
own the highest priority to be granted. Therefore all outputs grant the requests
from input 1. The round robin pointers at all outputs remain at position “1” at
this stage.
(c) Stage 3. Accept. Based on the results of stage 2, input 1 receives grants from all
four outputs, and no other input receives a grant. Here we also assume that the
round robin pointers at all inputs are set to position “1” initially, which means
the grants from output 1 own the highest priority to be accepted. Therefore
input 1 accepts the grant from output 1, and its round robin pointer is increased
to position “2” since it is the location beyond the accepted output 1 (modulo
4). The round robin pointers at inputs 2, 3 and 4 will remain at position “1”
since these inputs do not accept any grant during this iteration. Meanwhile the
round robin pointer at output 1 is also increased to position “2” since it is the
location beyond its granted input 1 (modulo 4). The round robin pointers at
other outputs will remain at position “1” since their grants are not accepted.
Finally only one matching pair between input 1 and output 1 is found in this
iteration. They will be marked as “matched” so that they will not join the
following iterations of the same time slot.

From Figure 3.5, we can see that that the round robin pointers at output side are not
synchronous after one iSLIP-RRMRC iteration: the round robin pointer at output 1
is set to position “2” while others remain at position “1”. Obviously, the problem of
synchronization of the round robin pointers at output side in RRMRC is solved in
iSLIP-RRMRC.
Similarly, we can get its equation to calculate the maximum number of operations

Scheduling Algorithms with Relaxed Constraint

31

of one iSLIP-RRMRC iteration for an N × N switch with B output groups, shown
below:
Qtotal

3.3

max iSLIP −RRM RC

= 2 × N × B + 2 × N + 4 × N.

(3.6)

Scheduling Algorithms with Relaxed Constraint
Using Approach 2

3.3.1 PIM with Relaxed Constraint Version 2 (PIMRC 2)
We develop this scheduling algorithm based on PIM by applying Approach 2. Similar to the original PIM, PIMRC 2 has three stages in one iteration: request, grant,
and accept. The detail of one PIMRC 2 iteration is shown as below:

• Stage 1: Request. Each unmatched input sends a request to the output group
for which it has at least one queued cell and the group still has unmatched
output.
• Grant. There are two cases:
– Case 1: The number of unmatched outputs in an output group is greater
than or equal to the number of requests received by the output group. All
requests can be granted.
– Case 2: The number of unmatched outputs in an output group is less
than the number of requests received by the output group. Assuming the
number of unmatched outputs is K, and the number of received request is
L, L>K, then K of total the L requests will be chosen to be granted based
on a random policy.
• Stage 3: Accept. If an unmatched input receives more than one grant from
different output groups, it chooses one and only one grant from one output
group to be accepted. The choosing decision is based on a random policy.
Once the grant from one output group is accepted, one unmatched output will
be picked up to transmit the cell. The matching pairs of inputs/outputs found in

Scheduling Algorithms with Relaxed Constraint

32

this iteration are marked as “matched” so that they are not taken into account
for the following iterations of the same time slot.

During stage 2, the output group, instead of individual output, generates the grant.
Once the grant from one output group is accepted by one unmatched input during
stage 3, we need to apply some choosing policy to pick up one unmatched output in
the group to set up the matching pair. Here we just apply a simple way to set up the
matching pair of input/output to transmit the cell: the lowest numbered input/output
owns the highest priority to set up the matching pair. Figure 3.6 illustrates an example of how it works. In this example, the output group contains outputs 1, 2, 3, 4.
All these four outputs except output 1 are unmatched. This group’s grants are also
accepted by inputs 1 and 3. Based on our policy, output 2 owns the highest priority among all outputs since it is the least numbered, and so are input 1. So, the first
matching pair should be set up between input 1 and output 2, shown in Figure 3.6 (a).
Then output 3 has the highest priority to set up a connection for input 3 since it is the
least numbered output among remaining unmatched outputs in this group. Therefore
another matching pair is setup between input 3 and output 3, shown in Figure 3.6
(b). An example of one PIMRC 2 iteration is shown in Figure 3.7. This example
illustrates one PIMRC 2 iteration for a 4 × 4 switch. The four outputs are divided
into two groups, and each group contains two outputs. The outputs in the same group
are multiplexed into one high-speed output link. We use the same queues as before.
The three stages of this PIMRC 2 iteration are presented in Figure 3.7(a), (b), (c)
separately:

(a) Stage 1. Request. We assume that all inputs and outputs are unmatched initially.
Only inputs 1, 3 and 4 send requests since there is no queued cell at input 2.
Input 1 sends requests to output links 1 and 2, input 3 sends requests to output
links 1 and 2, and input 4 sends a request to output link 1.
(b) Stage 2. Grant. Based on the result of stage 1, both output links receive multiple
requests from different inputs. Each output link can grant two requests at most
from two different inputs since there are two unmatched outputs at each output
link. Output link 1 receives three requests from inputs 1, 3 and 4 separately,

Scheduling Algorithms with Relaxed Constraint

33

Figure 3.6 An example of the input/output choosing policy during stage 3 of PIMRC 2

and it will choose two of them randomly to be granted. Here we assume it
grants the requests from inputs 1 and 4. Output link 2 receives two requests
from inputs 1 and 3 separately, and it grants both of them since there are two
unmatched outputs at this output link.
(c) Stage 3. Accept. Based on the result of stage 2, input 1 receives two grants
from output links 1 and 2 separately, and it will choose one and only one to be
accepted based on a random policy. We assume it accepts the grant from output
link 1. Input 3 receives one grant from output link 2, and input 4 receives one
grant from output link 1. Both inputs will accept the only grant they receive
since they have no other choice. Therefore, output link 1 needs to set up two
matching pairs for the connection request from inputs 1 and 4, and output link
2 needs to set up one matching pair for the connection request from input 3.
Based on the choosing policy introduced above, three matching pairs are set
up: input 1- output 1, input 4- output 2, and input 3- output 3. They will be
marked as “matched” so that they are not taken into account for the following
iterations of the same time slot.

Scheduling Algorithms with Relaxed Constraint

34

Figure 3.7 An example of PIMRC 2

Based on the above, we can get the equations to calculate the maximum number of
operations of one iSLIP-RRMRC iteration for an NxN switch with B output groups,
shown as follows:

• Step 1. Check the status of the N × B input queues to find out how many
inputs would send a request to each output. It is the same as for PIMRC 1,
RRMRC and iSLIP-RRMRC. The maximum number of operations for this
step is 2 × N × B if each input queue is not empty.
• Step 2. Perform B loops to perform the grant stage. Each loop contains two
operations: At first the number of requests received by each output link is
checked. If the number of the received requests L is greater than the number
of unmatched outputs K in the output group corresponding to the output link,
K of total L requests will be chosen to be granted based on a random policy,
otherwise all requests will be granted. The maximum number of operations for

Scheduling Algorithms with Relaxed Constraint

35

this step is 2 × B if the number of requests received by each output link is not
zero.
• Step 3. Perform N loops to carry out the accept sage. Each loop contains three
operations: check the number of grants received by each input (the checking
starts from the least numbered input to the highest numbered input), and accept
one grant randomly if the number of grants received by the input is not zero,
and then pick up the least numbered output in the corresponding output group
to set up a matching pair. The maximum number of operations for this step is
3 × N if the number of grants received by each input is not zero.
Therefore, the maximum number of operations of one PIMRC 2 iteration should be
as shown below:
Qtotal

max P IM RC2

= 2 × N × B + 2 × B + 3 × N.

(3.7)

3.3.2 Semi-RRM with Relaxed Constraint (Semi-RRMRC)
We develop this scheduling algorithm based on RRM by applying Approach 2. Due
to the difficulty of applying a round robin arbiter at the output side, we just apply one
at the input side. Therefore a round robin policy is just applied at the input side to
help the input to accept a grant from an output in this scheduling algorithm. While
in RRM and RRMRC, a round robin policy is applied at both the input side and
output side. That is the reason why we refer to this scheduling algorithm as “SemiRRMRC”. Semi-RRMRC also contains three stages in one iteration: request, grant,
and accept. The detail of one Semi-RRMRC iteration is given below:
• Stage 1: Request. Each unmatched input sends a request to the output group for
which it has at least one queued cell and the output group still has unmatched
outputs.
• Grant. There are two cases:
– Case 1: the number of unmatched outputs in the output group is greater
than or equal to the number of requests received by the output group. All
requests can be granted.

Scheduling Algorithms with Relaxed Constraint

36

– Case 2: the number of unmatched outputs in the output group is less
than the number of requests received by the output group. Assuming the
number of unmatched outputs is K, and the number of received requests
is L, L>K, then K of the total L requests will be chosen to be granted
based on a random policy. We can see it is very difficult to use a round
robin arbiter at each output link to perform the above tasks since one
single round robin arbiter cannot determine multiple grants, therefore we
do not apply round robin policy at this stage.
• Stage 3: Accept. If an unmatched input receives more than one grant, it
chooses one and only one grant to be accepted, and the choosing decision is
based on a round robin policy. There is a round robin arbiter at each input. The
input accepts the grant from the output group that is next on its round robin
arbiter. Then the position of its round robin pointer is increased to the location
beyond the accepted output group (modulo B). Semi-RRMRC applies the same
policy as PIMRC 2 to choose the unmatched output(s) of the output group to
set up the matching pair. The matching pairs found in one Semi-RRMRC iteration will be marked as “matched” so that they will not join the following
iterations of the same time slot.

An example of Semi-RRMRC is shown in Figure 3.8. This example illustrates one
Semi-RRMRC iteration for an 8x8 switch. The 8 outputs of the switch are divided
into four output groups. Each output group contains 2 outputs. The outputs in the
same output group are multiplexed into one high-speed output link. Only inputs 1, 4
and 8 have queued cells. At input 1, there are one queued cell destined for output link
1, two queued cells destined for output link 2, four queued cells destined for output
link3, and two queued cells destined for output link 4. At input 4, there are three
queued cells destined for output link 1, four queued cells destined for output link 2,
one queued cell destined for output link 3, and two queued cells destined for output
link 4. At input 8, there are three queued cells destined for output link 1, four queued
cells destined for output link 2, one queued cell destined for output link 3, and two
queued cells destined for output link 4. There is no queued cell at inputs 2,3 5,6, and
7. The three stages of this Semi-RRMRC iteration are presented in Figure 3.8 (a),

Scheduling Algorithms with Relaxed Constraint

37

(b), (c) separately. The detailed description is shown below:

Figure 3.8 An example of Semi-RRMRC

(a) Stage 1. Request. We assume that all inputs and outputs are unmatched initially.
Only inputs 1, 4 and 8 send requests since there is no queued cell at other
inputs. Input 1 sends requests to output links 1, 2, 3 and 4 since there are
queued cells destined for those four output links at input 1. So do inputs 4 and
8.
(b) Stage 2. Grant. Based on the result of stage 1, output links 1, 2, 3 and 4 receive
three requests from inputs 1, 4 and 8 separately. Each output link can grant

Scheduling Algorithms with Relaxed Constraint

38

two requests from two different inputs at most since there are two unmatched
outputs at each output link. The choosing decision is based on a random policy.
We assume the following result: output link 1 grants the requests from inputs
1 and 4, output link 2 grants the requests from inputs 4 and 8, output link 3
grants the requests from inputs 1 and 4, and output link 4 grants the requests
from inputs 1 and 4.
(c) Stage 3. Accept. Based on the result of stage 2, input 1 receives three grants from
output links 1, 3 and 4 separately, input 4 receives four grants from output links
1, 2, 3 and 4 separately and input 8 receives only one grant from output link
2. We assume that round robin pointers at all inputs are set to position “1”
initially, which means the grants from output link 1 own the highest priority
to be accepted. Therefore input 1 accepts the grant from output link 1, and its
round robin pointer is increased to position “ 2” since it is the location beyond
the accepted output link 1 (modulo 4). Similarly, input 4 accepts the grant
form output link 1 too, and its round robin pointer is increased to position ”2”
due to the same reason. Input 8 accepts the grant from output link 2 since it
has no other choice, and its round robin pointer is increased to position “3”
since it is the location beyond the accepted output link 2 (modulo 4). We can
see that output link 1 needs to set up two matching pairs for the requests from
input 1 and 4 separately, and output link 2 needs to set up a matching pair for
input 8. Based on the choosing policy introduced before, three matching pairs
are found after this Semi-RRMRC iteration: input 1- output 1, input 4- output
2, and input 8- output 3. These inputs/outputs will be marked as “matched”
so that they are not taken into account for the following iterations of the same
time slot.

Based on the above, we can see that Semi-RRMRC has a similar computing procedure to PIMRC 2 except each loop in step 3 in a Semi-RRMRC iteration performs
the accept decision based on a round robin policy instead of a random policy, and it
needs one additional operation to update the status of the round robin pointers at the
input side. Therefore, the total maximum number of operations of one Semi-RRMRC

Scheduling Algorithms with Relaxed Constraint

39

iteration is as shown below:
Qtotal

3.4

max Semi−RRM RC

= 2 × N × B + 2 × B + 4 × N.

(3.8)

Multiple Iterations of Scheduling Algorithms with
Relaxed Constraint

The examples in previous sections only illustrate one iteration, and we can see that
these new algorithms cannot find a complete set of matching pairs with only one iteration per time slot. We do not expect that these new scheduling algorithms can
achieve satisfactory throughput with only one iteration per time slot. Therefore,
multiple iterations per time slot are necessary for the new scheduling algorithms to
achieve satisfactory throughput.
For PIMRC 1 and PIMRC 2, we mark all matching pairs (input/output) found in one
iteration as“matched”, and they will not join the following iterations of the same time
slot. After one time slot, all outputs and inputs will be restored to “unmatched” status. For RRMRC, iSLIP-RRMRC and Semi-RRMRC, we apply a similar policy for
updating the status of inputs/outputs to that of PIMRC 1 and PIMRC 2. In addition,
we update their round robin arbiters during each iteration within one time slot.
When these new algorithms run with multiple iterations, matched inputs/outputs will
not join the scheduling of the following iterations. Therefore the later iterations
should have less computing complexity than the previous iterations. To compute the
complexity of these algorithms will be more complex when multiple iterations are
applied.

3.5

Summary

In this chapter, the operations of a series of new scheduling algorithm developed
by us, viz PIMRC 1, RRMRRC, iSLIP-RRMRC, PIMRC 2 and Semi-RRMRC, are
presented in detail, including their operations with multiple iterations. We also find
the equations to calculate the maximum number of operations of one iteration for all

Scheduling Algorithms with Relaxed Constraint

40

Table 3.1
The calculation result of the maximum number of operations of one scheduling iteration

Table 3.2
The comparison result for the maximum number of operations of one iteration between the
new and original scheduling algorithms

algorithms, including the new algorithms and the original algorithms. Based on those
equations, we calculate the general maximum number of operations of one iteration
for the case that these algorithms and the original algorithms (viz PIM, RRM and
iSLIP-RRM) are applied in a 16 × 16 switch and a 32 × 32 switch, where all outputs
are divided into 4 groups (each group contains the equal number of outputs). The
calculation results are given in Table 3.1.
Based on the results in Table 3.1, we present the comparison between the new algorithms and the originals in Table 3.2. We can see that the decreasing of maximum
number of operations of one iteration becomes more significant when the switch size
increases. Although the equations used to calculate these results give only an estimate of the computing complexity, and some other factors may have a considerable
influence on the actual computing complexity as we said before, the results in Tables 3.1 and 3.2 are a good indication that the new algorithms can run much faster
than the originals.

Chapter 4
Simulation and Results
4.1

Introduction

The performance of scheduling algorithms is measured by simulation. In order to
indicate the advantage of the new scheduling algorithms compared to the original
scheduling algorithms, we also carry out simulations for the originals, viz PIM, RRM
and iSLIP-RRM. We decide to apply a 2-state Markov Modulated Bernoulli Process
(2-MMBP) [19, 3, 20] as the traffic model in our simulations since it is a popularly
applied traffic model for IP-based bursty traffic in the research about the scheduling
algorithms for the switching fabric. Although Internet traffic is widely considered
self-similar [21], it is not widely applied in this research field. Therefore the 2MMBP model instead of the self-similar traffic model is applied in the simulation
to avoid the complexity of the simulation. Applying the self-similar traffic model is
considered in the future research to achieve the better simulation result. Matlab is
used as the simulation platform since it is a stable and popularly recognized simulation tool.
This chapter is organized as follows: Section 2 introduces the detail of the simulation
environment, including a detailed introduction for the 2-MMBP traffic model, and
Section 3 gives the simulation results and our analysis in detail. Finally Section 4
makes a conclusion.

41

Simulation and Results

42

Figure 4.1 The switch model used in the simulation

4.2

Simulation Environment

4.2.1 General Simulation Parameters
We model two switching fabrics in our simulations: a 16 × 16 switching fabric and
a 32 × 32 switching fabric, shown in Figure 4.1. When we apply the new scheduling
algorithms developed by us to both switches, we divide all the outputs of the switches
into four groups. Each group contains an equal number of outputs. The outputs in the
same output group are multiplexed into a high-speed output link. When we apply the
original scheduling algorithms to the switching fabric, they run as the normal 16 × 16
or 32 × 32 switching fabric. We introduce the 2-MMBP traffic source in detail in the
following parts of this section. The general simulation setting is shown in Table 4.1.

4.2.2 2-MMBP Traffic Model
The 2-state Markov Modulated Bernoulli traffic source was originally introduced by
Adas [3]. It has been popularly accepted as an appropriate simulation model for
IP-based bursty traffic since it reproduces the bursty character of IP traffic. It can
be illustrated in Figure 4.2. Figure 4.2 shows that the 2-MMBP traffic model has

43

Simulation and Results

Table 4.1 General simulation setting

Figure 4.2 2-MMBP traffic model [3]

44

Simulation and Results

Figure 4.3 The on-off traffic model

two states. In state 1, the traffic model generates traffic (cells) with the probability q
and behaves as a Bernoulli process, and in state 2, the traffic model generates traffic
(cells) with probability p and behaves as a Bernoulli process. The probability of
changing from state1 to state 2 is α, and the probability of changing from state 2 to
state 1 is β . We can use the quadriplex (α,β, p, q) to specify any 2- MMBP traffic
model [20].
And based on the analysis in [20], the mean cell arrival rate for a 2-MMBP traffic
model is shown as Equation 4.1.
λ=

pα + qβ
.
α+β

(4.1)

In our simulation, we apply a special version of 2-MMBP traffic source: the on-off
traffic model [3], shown in Figure 4.3. In the on-off traffic model, the probability of
generating traffic in the off state is zero, and the probability of generating traffic in the
on state is 1. Therefore the quadriplex for the on-off traffic model is (α,β, 0, 1), and
the mean cell arrival rate λ for a 2-MMBP traffic source is shown as Equation 4.2.
λ=

β
.
α+β

(4.2)

We apply an independent on-off traffic source at each input in the simulations. We
assume that buffer space at each input is infinite, and each input can submit one cell
at most per time slot, therefore an on-off traffic source at each input can be modeled
as a 2-MMBP/D/1/∞ traffic model [20]. Its traffic load is shown as below:
ρ=

λ
,
ν

(4.3)

where ν is the service rate. In the simulation, the service rate ν is 1 cell per time slot.
Therefore, the traffic load for the 2-MMBP/D/1/∞ is equal to λ.

45

Simulation and Results

The on-off traffic source generates a cell at each time slot if the traffic source remains
in the on state. All cells generated during this period, which is ended when the
traffic source is switched back to the off state, are regarded as a burst. A burst of
cells is directed to the same destination. In our simulations, one of the 4 output
groups is chosen randomly for the burst destination for the scheduling algorithms
with relaxed constraint, and one of the 16 or 32 outputs are chosen randomly for the
burst destination for the original scheduling algorithms.
The average burst length l for the on-off traffic source is shown as below:
1
.
(4.4)
α
For the full-load on-off traffic model, the off state does not exist since the full-load
l=

on-off traffic source generates traffic all the time. However, the bursty character still
remains for the full-load on-off traffic since the cells’ destination can be changed
randomly for different bursts. The probability α is the probability of generating a
new burst.
Therefore, when we apply the on-off traffic source in our simulations, we just need
to specify the values of traffic load ρ and average burst length l in advance.

4.3

Simulation Results and Analysis

4.3.1 Simulation Results for the 16 × 16 Switch
4.3.1.1

Throughput

When we carry out our simulation, we focus on the performance comparison among
scheduling algorithms with minimum number of iterations required to achieve an
acceptable throughput. Figures 4.4∼ 4.11 are the throughput comparison results for
all scheduling algorithms with different number of iterations under the condition that
average burst length l is 8 cells. The number of iterations varies from one to four or
five.
Based on these results, and assuming we need a worst case throughput of 98%, Table 4.2 shows the minimum number of iteration recommended for each algorithm.

Simulation and Results

46

Figure 4.4 Throughput comparison for PIM with different number of iterations (16 × 16
switch, average burst length=8 cells)

Figure 4.5 Throughput comparison for RRM with different number of iterations (16 × 16
switch, average burst length=8 cells)

Figure 4.6 Throughput comparison for iSLIP-RRM with different number of iterations (16×
16 switch, average burst length=8 cells)

Simulation and Results

47

Figure 4.7 Throughput comparison for PIMRC 1 with different number of iterations (16×16
switch, average burst length=8 cells)

Figure 4.8 Throughput comparison for RRMRC with different number of iterations (16 × 16
switch, average burst length=8 cells)

Figure 4.9 Throughput comparison for iSLIP-RRMRC with different number of iterations
(16 × 16 switch, average burst length=8 cells)

Simulation and Results

48

Figure 4.10 Throughput comparison for PIMRC 2 with different number of iterations (16 ×
16 switch, average burst length=8 cells)

Figure 4.11 Throughput comparison for Semi-RRMRC with different number of iterations
(16 × 16 switch, average burst length=8 cells)

49

Simulation and Results

Table 4.2
The minimum number of iterations recommended for scheduling algorithms to achieve acceptable throughput (average burst length=8 cells)

Scheduling Algorithms
PIMRC 1
RRMRC
iSLIP-RRMRC
PIMRC 2
Semi-RRMRC
PIM
RRM
iSLIP-RRM

Minimun number of iterations per cell time
3
2
2
4
4
4
3
3

Figure 4.12 Throughput comparison for scheduling algorithms with recommended minimum
number of iterations to achieve acceptable throughput (16×16 switch, average burst length=8
cells)

Detailed data can be found in Appendix B. Figure 4.12 reproduces the throughput
data for the cases recommended in Table 4.2. The change of scale allows a clearer
view of the performance.
The numerical data used to plot the graphs of Figure 4-12 is reproduced in Table 4.3,
and the recommended number of iterations for the originals is consistent with previous research in [8, 2, 9]. We can see that all scheduling algorithms achieve 100%
throughput under light traffic load, and throughput drops slightly when the traffic
load increases. However, even under the full traffic load, the worst case is a throughput of 97.98%.

Simulation and Results

50

Table 4.3
Throughput comparison for scheduling algorithms with minimum number of Iterations to
achieve their highest throughput (16 × 16 switch, average burst length=8 cells)

4.3.1.2

Average Queue Length

We now focus on the average queue length comparison for these scheduling algorithms with minimum number of iterations.
The average queue length comparison for the case that average burst length is 8 cells
is shown in Figure 4.13. We can see that the average queue lengths are consistent with
the throughput: the scheduling algorithm with the higher throughput has a shorter
average queue length.
Under 100% traffic load, the algorithms must all produce an infinite queue, since
none of them achieves 100% throughput, on the average. However, as soon as the
load is decreased to 90%, the queue sizes become relatively small. This behavior is
acceptable. The algorithm with the longest average queue and the lowest throughput
under the full traffic load is 2-iteration RRMRC. It is notable that an algorithm can
achieve this level of performance with only two iterations. And 2-iteration iSLIPRRMRC produces even better results.
4.3.1.3

Average Processing Time (Processing Time per Time Slot)

We now focus on the speed comparison for all these scheduling algorithms since they
all achieve a satisfactory throughput with an appropriate number of iterations.The
faster scheduling speed can be considered a significant symbol of the less comput-

51

Simulation and Results

Figure 4.13 Average queue length comparison for scheduling algorithms with recommended
minimum number of iterations to achieve acceptable throughput (16 × 16 switch, average
burst length=8 cells)

ing time. During the simulation, we record the simulation time for the execution of
scheduling algorithms tl in the unit of millisecond (ms), where the part of the simulation time for the traffic model generating traffic is excluded. The average processing
time al is calculated based on Equation 4.5.
al =

tl
.
simulation duration

(4.5)

The simulation duration is the number of time slots used for the simulation. According to Table 4.1, it is 100, 000 for the 16 × 16 switch and 10, 000 for the 32 × 32
switch. The results are shown in Table 4.4 and Figure 4.14. We see that the original
scheduling algorithms, viz 4-iteration PIM, 3-iteration RRM and 3-iteration iSLIPRRM, have a much longer average processing time than all the new scheduling algorithms. When the traffic load increases, the relatively faster computing speed of
the new algorithms appears more obvious. We see that all the new scheduling algorithms can achieve an average processing time that is only 9%-31% of what the
original scheduling algorithms take.

Based on the results in Table 4.4 and Fig-

ure 4.14, we can see that these simulation results are consistent with Tables 3.1 and
3.2. We can also see that 2-iteration RRMRC and 2-iteration iSLIP-RRMRC take
the least average processing time among all algorithms in our simulations.

Simulation and Results

52

Table 4.4
Average processing time comparison (16 × 16 switch, average burst length=8 cells)

Figure 4.14 Average processing time comparison (16 × 16 switch, average burst length=8
cells)

Simulation and Results

53

Figure 4.15 Throughput comparison for 3-iteration PIMRC 1 with different average burst
length (16 × 16 switch)

4.3.1.4

The Influence of Average Burst Length

We now focus on how the average burst length of the traffic model has influence
on the performance of the algorithms. An indication of the effect of increased burst
length is given in Figures 4.15∼ 4.17. For the single case of PIMRC 1 with 3
iterations, these figures show throughput vs. traffic load, average queue length vs.
traffic load, and average processing time vs. traffic load. We can see that there is
a slight decrease in the performance for the throughput and average queue length,
with the longer average burst length, under heavy traffic load. However the average
processing time also decreases slightly with the higher burst length, especially under
the heavy traffic load, although it is not so significant in Figure 4.17. We can see it
in the detailed simulation results in Appendix B. We get similar simulations results
for the influence of average burst length on the performance of RRMRC, iSLIPRRMRC, PIMRC 2 and Semi-RRMRC. Their detailed simulation results can also be
found in Appendix B. Therefore, it is reasonable to conclude that average burst length
of the traffic model generally does not have much influence on the performance of
the new algorithms with an appropriate number of iterations.
4.3.1.5

Summary

All these simulations are repeated for average burst length of 16 and 32 cells. In both
cases, the results are very similar to those presented above for average burst length

Simulation and Results

54

Figure 4.16 Average queue length comparison for 3-iteration PIMRC 1 with different average burst length (16 × 16 switch)

Figure 4.17 Average processing time comparison for 3-iteration PIMRC 1 with different
average burst length (16 × 16 switch)

Simulation and Results

55

of 8 cells. The detail of them is also given in Appendix B.
In summary, under the case that a switch model with four output groups is applied
in our simulations, we can see that all new algorithms with an appropriate number
of iterations can achieve an excellent throughput (>98%) with the corresponding
average queue length comparable to the original algorithms, while they take much
less average processing time (only 9%∼31% of the original ones). Among all the
new algorithms, 2-iteration RRMRC and 2-iteration iSLIP-RRMRC achieve the most
satisfactory performance since they take the least average processing time for the
16 × 16 switch model. Considering our analysis about the computing complexity in
the previous chapter, we can conclude that above simulation results are reasonable.

4.3.2 Simulation Results for the 32x32 Switch
Similarly, we get the simulation results for the 32 × 32 switch, and their detailed
data are given in Appendix B. The analysis procedure is very similar to that for the
16 × 16 switch. Here we focus on the performance comparison for all the algorithms
between the 16×16 switch and 32×32 switch. Table 4.5 shows the minimum number
of iterations recommended for all algorithms to achieve the acceptable throughput.
We can see the recommended minimum number of iteration for some algorithms,
including all the original algorithms, RRMRC, and iSLIP-RRMRC, meets the rule
of since they need only one more iteration for the 32 × 32 switch than for the 16 × 16
switch. And the recommended minimum number of iterations for PIMRC 1, PIMRC
2, and Semi-RRMRC, remains at the same value as that for the 16 × 16 switch.
Table 4.6 presents the numerical data of the acceptable throughput for the algorithms
with recommended minimum number of iterations. We can see that there is a slight
decrease in throughput for very high traffic load when we compare it with the results for the 16 × 16 switch. However, the worst case for the new algorithms is
that 3-iteration iSLIP-RRMRC achieves a throughput of 95.94% under the case that
average burst length is 32 cells and the traffic load is full. Compared to the more
obvious decrease of the throughput for the originals (the worst one is the throughput of 91.63% achieved by 4-iteration iSLIP-RRM under the case that average burst
length is 32 cells and the traffic load is full), we can conclude the slight decrease

56

Simulation and Results

Table 4.5
The minimum number of iterations recommended for scheduling algorithms to achieve acceptable throughput (32 × 32 switch)

Scheduling Algorithms
PIMRC 1
RRMRC
iSLIP-RRMRC
PIMRC 2
Semi-RRMRC
PIM
RRM
iSLIP-RRM

Minimum number of iterations per cell time
3
3
3
4
4
5
4
4

of the throughput for the new algorithms switch is acceptable. Due to the different
simulation duration applied in two switch models, we don’t carry out the comparison
for the average queue length between two switch models since it is affected by the
simulation duration. But we can still see the average queue length for one algorithm
is corresponding to its throughput, which is presented in Appendix B in detail.
The average processing time comparison result for the algorithms is shown in Table 4.7. We can see that all algorithms take a longer average processing time for
the 32 × 32 switch than for the 16 × 16 switch. The increased times are due to the
longer times per iteration and a higher number of iterations. We can see the most
impressive improvement in Table 4.7 is that the improved computing speed of the
new algorithms for the 32 × 32 switch appears much more significant than that for
the 16 × 16 switch. All the new algorithms take only about 10% of the average processing time for original algorithms under the traffic load ranging from 0.1 to 1 for
the 32 × 32 switch, while this value is only 9% 31% for the 16 × 16 switch. These
results generally meet the prediction in the previous chapter. And we also get similar
simulation results for the effect of the average burst length on the performance for
the 32 × 32 switch to that for the 16 × 16 switch. The throughput and the average
queue length under the full traffic load drop slightly when the average burst length
increases, while the average processing time decreases slightly. The detail showing
the influence of the average burst length is also given in Appendix B.
In summary, we can see that the new algorithms generally achieve the lower through-

Simulation and Results

57

Table 4.6
The throughput comparison for the scheduling algorithms with recommended minimum
number of iterations to achieve acceptable throughput (32 × 32 switch)

Simulation and Results

58

Table 4.7
Average processing time comparison for scheduling algorithms with recommended minimum
number of iterations to achieve acceptable throughput (32 × 32 switch)

Simulation and Results

59

put for the 32 × 32 switch than that for the 16 × 16 switch, especially under the heavy
traffic load, although the worst case is only the throughput of 95.94%. And their average queue length is corresponding to their throughput, which is similar to that for
the 16 × 16 switch. They also have a longer average processing time for the 32 × 32
switch than for the 16 × 16 switch, which meets our prediction in the previous chapter. However, the relatively faster computing speed of the new algorithms over the
originals appears more significant for the 32 × 32 switch than that of the 16 × 16
switch, which is also predicted in the previous chapter. We can see that all the new
algorithms with an appropriate number of iterations have a similar average processing time (ranging from 4∼6 ms under the full traffic load). And 3-iteration PIMRC
1 achieves the highest throughput (>99%) among all the new algorithms.

4.4

Conclusion

We carry out the simulations for all algorithms, viz the new algorithms and the original algorithms, for both the 16×16 switch and the 32×32 switch. And the simulation
results are generally consistent with our prediction in the previous chapter and previous research [8, 2, 9]. We can see that all our new scheduling algorithms, viz
PIMRC 1, RRMRC, iSLIP-RRMRC, PIMRC 2 and Semi-RRMRC, achieve a satisfactory performance compared to the original scheduling algorithms, viz PIM, RRM
and iSLIP-RRM. With an appropriate number of iterations, all our new scheduling
algorithms achieve a satisfactory throughput and average queue length, which is similar to or better than that of the original scheduling algorithms with an appropriate
number of iterations. Meanwhile, our new algorithms achieve a faster computing
speed than the original algorithms (The average processing time of all new scheduling algorithms is only 9%∼31% of that of the original scheduling algorithms for the
16 × 16 switch and about 10% for the 32 × 32 switch). The simulation results indicate that the higher switch size results in the more obvious relatively faster speed
of the new algorithms over the originals. A slightly lower throughput also results
from the higher switch size although it still remains at an acceptable level (>95%).
Finally, the simulation results also indicate the negligible influence caused by the
different average burst length. Based on these simulation results, we can conclude

Simulation and Results

60

that our new algorithms have a significant performance improvement compared to
the originals.
We aslo can see that 2-iteration RRMRC and 2-iteration iSLIP-RRMRC perform best
for the 16 × 16 switch, and so does the 3-iteration PIMRC 1 for the 32 × 32 switch.

Chapter 5
Summary and Comments
5.1

Summary of Dissertation

In this thesis, we present a series of new scheduling algorithms developed by us based
on a simple idea: all outputs of the ATM-like crossbar input-queued switching fabric
of an IP router are divided into a few output groups, and outputs in the same output
group are multiplexed into one high-speed output link. Therefore traffic (cells) can
be directed to any member of a group of available outputs instead of one specific
output.
We develop our new algorithms mainly from three popular existing scheduling algorithms: PIM, RRM and iSLIP-RRM. We carry out a set of simulations with both a
16 × 16 switch model and a 32 × 32 switch model to test the performance of our new
scheduling algorithms. The simulation results indicate that all of our new algorithms,
viz PIMRC 1, RRMRC, iSLIP-RRMRC, PIMRC 2, and Semi-RRMRC, can achieve
a satisfactory throughput (>98%, in the worst case for the 16 × 16 switch, or >95%,
in the worst case for the 32 × 32 switch) with an appropriate number of iterations
compared to the original scheduling algorithms with an appropriate number of iterations. At the same time, the average queue length is also satisfactory. Meanwhile
all our new scheduling algorithms achieve a significant improvement in computing
speed compared to the original scheduling algorithms: the average processing time
of all the new algorithms are only 9%∼31% of that of the original scheduling al-

61

Summary and Comments

62

gorithms for the 16 × 16 switch and about 10% for the 32 × 32 switch. This is a
significant performance improvement over the original scheduling algorithms.

5.2

Future Study

In this thesis we do not investigate the fairness property of our new scheduling algorithms, although we have no reason to think that our new scheduling algorithms
have a worse fairness performance than the original algorithms due to the similar
scheduling policy.

Bibliography
[1] C. Metz, “ IP routers: new tool for gigabit networking ,” Internet Computing,
IEEE, vol. 2, no. 6, pp. 14–18, 1998.
[2] N. McKeown, “ The iSLIP scheduling algorithm for input-queued switches ,”
Networking, IEEE/ACM Transactions on, vol. 7, no. 2, pp. 188–201, 1999.
[3] A. Adas, “ Traffic models in broadband networks ,” Communications Magazine,
IEEE, vol. 35, no. 7, pp. 82–89, 1997.
[4] S. Keshav and R. Sharma, “ Issues and trends in router design ,” Communications Magazine, IEEE, vol. 36, no. 5, pp. 144–151, 1998.
[5] N. Yamanaka, “ Next generation Internet backbone router ,” in ATM (ICATM
2001) and High Speed Intelligent Internet Symposium, 2001. Joint 4th IEEE
International Conference on, 2001, pp. 316–319.
[6] J. Chao, “Saturn: a terabit packet switch using dual round robin,” Communications Magazine, IEEE, vol. 38, no. 12, pp. 78–84, 2000, tY - JOUR.
[7] M. Karol, M. Hluchyj, and S. Morgan, “ Input Versus Output Queueing on
a Space-Division Packet Switch ,” Communications, IEEE Transactions on
[legacy, pre - 1988], vol. 35, no. 12, pp. 1347–1356, 1987.
[8] T. E. Anderson, S. S. Owicki, J. B. Saxe, and C. P. Thacker, “ High-speed switch
scheduling for local-area networks ,” ACM Transactions on Computer Systems,
vol. 11, no. 4, pp. 319–352, 1993.
[9] N. McKeown, P. Varaiya, and J. Walrand, “ Scheduling cells in an input-queued
switch ,” Electronics Letters, vol. 29, no. 25, pp. 2174–2175, 1993.
63

64

BIBLIOGRAPHY

[10] D. Stiliadis and A. Varma, “ Providing bandwidth guarantees in an inputbuffered crossbar switch ,” in INFOCOM ’95. Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Bringing Information to People. Proceedings. IEEE, 1995, pp. 960–968 vol.3.
[11] S. Motoyama, “ Cell delay modelling and comparison of iterative scheduling algorithms for ATM input-queued switches ,” Communications, IEE
Proceedings-, vol. 150, no. 1, pp. 11–16, 2003.
[12] M. Shreedhar and G. Varghese, “ Efficient fair queuing using deficit roundrobin ,” Networking, IEEE/ACM Transactions on, vol. 4, no. 3, pp. 375–385,
1996.
[13] M. Katevenis, S. Sidiropoulos, and C. Courcoubetis, “ Weighted round-robin
cell multiplexing in a general-purpose ATM switch chip ,” Selected Areas in
Communications, IEEE Journal on, vol. 9, no. 8, pp. 1265–1279, 1991.
[14] H. Chao, “Next generation routers,” Proceedings of the IEEE, vol. 90, no. 9,
pp. 1518–1558, 2002, tY - JOUR.
[15] Y. Tamir and G. Frazier, “ High-performance multiqueue buffers for VLSI communication switches ,” in Computer Architecture, 1988. Conference Proceedings. 15th Annual International Symposium on, 1988, pp. 343–354.
[16] Cisco Systems,

“Cisco 12000 Series Internet Router Architecture:

Switch Fabric-Cisco 12000 Series Routers ,” 2004. [Online]. Available:
http://www.cisco.com/warp/public/63/arch12000-swfabric.pdf
[17] S. Motoyama, D. Petr, and V. Frost, “ Input-queued switch based on a scheduling algorithm ,” Electronics Letters, vol. 31, no. 14, pp. 1127–1128, 1995.
[18] Y.-K. Park and Y.-K. Lee, “ Parallel iterative matching-based cell scheduling
algorithm for high-performance ATM switches ,” Consumer Electronics, IEEE
Transactions on, vol. 47, no. 1, pp. 134–137, 2001.
[19] C. Ng, L. Bai, and B. Soong, “ Modelling multimedia traffic over ATM using
MMBP ,” Communications, IEE Proceedings-, vol. 144, no. 5, pp. 307–310,
1997.

BIBLIOGRAPHY

65

[20] W.-C. Miao and J.-F. Chang, “ Individual sojourn delay analysis of an ATM
switch receiving heterogeneous Markov-Modulated Bernoulli processes under
FIFO and priority service disciplines ,” IEICE Transactions on Communications, vol. E80-B, no. 5, pp. 712–725, 1997.
[21] W. Leland, M. Taqqu, W. Willinger, and D. Wilson, “On the self-similar nature
of ethernet traffic (extended version),” Networking, IEEE/ACM Transactions
on, vol. 2, no. 1, pp. 1–15, 1994.

Appendix A
Publication
Title of Publication
Scheduling with Relaxed Constraint for Input-Queued Switch/Router
with Crossbar Switching Fabric
DSCPC 2003/WITSTP 2003, Dec 2003, Gold Coast, Australia
Scheduling with Relaxed Constraint for ATM-like Input-Queued
Crossbar Switching Fabric in IP Router
APCC 2004/MDMC 2004, Aug-Sep 2004, Beijing, China

66

Please see print copy for Appendix A

Appendix B
Simulation Results
B.1

Simulation Results for 16 × 16 Switch

B.1.1 Full Simulation Result
B.1.2 Throughput Comparison Result

78

79

Simulation Results

Table B.1 PIM simulation results (16 × 16 switch)

80

Simulation Results

Table B.2 RRM simulation results (16 × 16 switch)

Simulation Results

Table B.3 iSLIP-RRM simulation results (16 × 16 switch )

81

Simulation Results

Table B.4 PIMRC 1 simulation results (16 × 16 switch )

82

83

Simulation Results

Table B.5 RRMRC simulation results (16 × 16 switch)

Simulation Results

Table B.6 iSLIP-RRMRC simulation results (16 × 16 switch )

84

Simulation Results

Table B.7 PIMRC 2 simulation results (16 × 16 switch )

85

Simulation Results

Table B.8 Semi-RRMRC simulation results (16 × 16 switch )

86

Simulation Results

87

Figure B.1 Throughput comparison for PIM with different number of iterations (16 × 16
switch)

Simulation Results

88

Figure B.2 Throughput comparison for RRM with different number of iterations (16 × 16
switch)

Simulation Results

89

Figure B.3 Throughput comparison for iSLIP-RRM with different number of iterations (16×
16 switch)

Simulation Results

90

Figure B.4 Throughput comparison for PIMRC 1 with different number of iterations (16×16
switch)

Simulation Results

91

Figure B.5 Throughput comparison for RRMRC with different number of iterations (16 × 16
switch)

Simulation Results

92

Figure B.6 Throughput comparison for iSLIP-RRMRC with different number of iterations
(16 × 16 switch)

Simulation Results

93

Figure B.7 Throughput comparison for PIMRC 2 with different number of iterations (16×16
switch)

Simulation Results

94

Figure B.8 Throughput comparison for Semi-RRMRC with different number of iterations
(16 × 16 switch)

Simulation Results

95

Table B.9
The minimum number of iterations recommended for scheduling algorithms to achieve acceptable throughput (16 × 16 switch)

Table B.10
Throughput comparison for scheduling algorithms with recommended minimum iterations
to achieve acceptable throughput(16 × 16 switch, average burst length=8 cells)

B.1.3 Average Queue Length Comparison Result

Simulation Results

96

Figure B.9 Throughput comparison for scheduling algorithms with recommended minimum
number of iterations to achieve acceptable throughput (16×16 switch, average burst length=8
cells)

Table B.11
Throughput comparison for scheduling algorithms with recommended minimum iterations
to achieve acceptable throughput (16 × 16 switch, average burst length=16 cells)

Table B.12
Throughput comparison for scheduling algorithms with recommended minimum iterations
to achieve acceptable throughput (16 × 16 switch, average burst length=32 cells)

Simulation Results

97

Figure B.10 Throughput comparison for scheduling algorithms with recommended minimum number of iterations to achieve acceptable throughput (16 × 16 switch, average burst
length=16 cells)

Figure B.11 Throughput comparison for scheduling algorithms with recommended minimum number of iterations to achieve acceptable throughput (16 × 16 switch, average burst
length=32 cells)

Simulation Results

98

Figure B.12 Average queue length comparison for scheduling algorithms with recommended
minimum number of iterations to achieve acceptable throughput (16 × 16 switch)

Simulation Results

99

Table B.13
Average processing time comparison for scheduling algorithms with recommended minimum
number of iterations to achieve acceptable throughput (16×16 switch, average burst length=8
cells)

B.1.4 Average Processing Time Comparison Result

Simulation Results

100

Figure B.13 Average processing time comparison for scheduling algorithms with recommended minimum number of iterations to achieve acceptable throughput (16 × 16 switch,
average burst length=8 cells)

Table B.14
Average processing time comparison for scheduling algorithms with recommended minimum number of iterations to achieve acceptable throughput (16 × 16 switch, average burst
length=16 cells)

Simulation Results

101

Figure B.14 Average processing time comparison for scheduling algorithms with recommended minimum number of iterations to achieve acceptable throughput (16 × 16 switch,
average burst length=16 cells)

Simulation Results

102

Table B.15
Average processing time comparison for scheduling algorithms with recommended minimum number of iterations to achieve acceptable throughput (16 × 16 switch, average burst
length=32 cells)

Figure B.15 Average processing time comparison for scheduling algorithms with recommended minimum number of iterations to achieve acceptable throughput (16 × 16 switch,
average burst length=32 cells)

Simulation Results

103

B.1.5 Performance Comparison Result with Different Average
Burst Length in Traffic Model

Simulation Results

104

Figure B.16 The performance comparison for 3-iteration PIMRC 1 with different average
burst length (16 × 16 switch)

Simulation Results

105

Figure B.17 The performance comparison for 2-iteration RRMRC with different average
burst length (16 × 16 switch)

Simulation Results

106

Figure B.18 The performance comparison for 2-iteration iSLIP-RRMRC with different average burst length (16 × 16 switch)

Simulation Results

107

Figure B.19 The performance comparison for 4-iteration PIMRC 2 with different average
burst length (16 × 16 switch)

Simulation Results

108

Figure B.20 The performance comparison for 4-iteration PIMRC 2 with different average
burst length (16 × 16 switch)

Simulation Results

B.2

Simulation Results for 32x32 Switch

B.2.1 Full Simulation Result

109

110

Simulation Results

Table B.16 PIM simulation results (32 × 32 switch)

111

Simulation Results

Table B.17 RRM simulation results (32 × 32 switch)

Simulation Results

Table B.18 iSLIP-RRM simulation results (32 × 32 switch )

112

Simulation Results

Table B.19 PIMRC 1 simulation results (32 × 32 switch )

113

Simulation Results

Table B.20 RRMRC simulation results (32 × 32 switch)

114

Simulation Results

Table B.21 iSLIP-RRMRC simulation results (32 × 32 switch )

115

Simulation Results

Table B.22 PIMRC 2 simulation results (32 × 32 switch )

116

Simulation Results

Table B.23 Semi-RRMRC simulation results (32 × 32 switch )

117

Simulation Results

B.2.2 Throughput Comparison Result

118

Simulation Results

119

Figure B.21 Throughput comparison for PIM with different number of iterations (32 × 32
switch)

Simulation Results

120

Figure B.22 Throughput comparison for RRM with different number of iterations (32 × 32
switch)

Simulation Results

121

Figure B.23 Throughput comparison for iSLIP-RRM with different number of iterations
(32 × 32 switch)

Simulation Results

122

Figure B.24 Throughput comparison for PIMRC 1 with different number of iterations (32 ×
32 switch)

Simulation Results

123

Figure B.25 Throughput comparison for RRMRC with different number of iterations (32×32
switch)

Simulation Results

124

Figure B.26 Throughput comparison for iSLIP-RRMRC with different number of iterations
(32 × 32 switch)

Simulation Results

125

Figure B.27 Throughput comparison for PIMRC 2 with different number of iterations (32 ×
32 switch)

Simulation Results

126

Figure B.28 Throughput comparison for Semi-RRMRC with different number of iterations
(32 × 32 switch)

Simulation Results

127

Table B.24
The minimum number of iterations recommended for scheduling algorithms to achieve acceptable throughput (32 × 32 switch )

Simulation Results

128

Table B.25
Throughput comparison for scheduling algorithms with recommended minimum iterations
to achieve acceptable throughput (32 × 32 switch )

Simulation Results

129

Figure B.29 Throughput comparison for scheduling algorithms with recommended minimum iterations to achieve acceptable throughput (32 × 32 switch)

Simulation Results

B.2.3 Average Queue Length Comparison Result
B.2.4 Average Processing Time Comparison Result

130

Simulation Results

131

Figure B.30 Average queue length comparison for scheduling algorithms with recommended
minimum iterations to achieve acceptable throughput (32 × 32 switch)

Simulation Results

132

Table B.26
Average processing time comparison for scheduling algorithms with recommended minimum
iterations to achieve acceptable throughput (32 × 32 switch )

Simulation Results

133

Figure B.31 Average processing time comparison for scheduling algorithms with recommended minimum iterations to achieve acceptable throughput (32 × 32 switch, average, average burst length=8 cells)

Simulation Results

134

Figure B.32 Average processing time comparison for scheduling algorithms with recommended minimum iterations to achieve acceptable throughput (32 × 32 switch, average, average burst length=16 cells)

Simulation Results

135

Figure B.33 Average processing time comparison for scheduling algorithms with recommended minimum iterations to achieve acceptable throughput (32 × 32 switch, average, average burst length=32 cells)

B.2.5 Performance Comparison Result with Different Average
Burst Length in Traffic Model

Simulation Results

136

Figure B.34 The performance comparison for 3-iteration PIMRC 1 with different average
burst length (32 × 32 switch)

Simulation Results

137

Figure B.35 The performance comparison for 3-iteration RRMRC with different average
burst length (32 × 32 switch)

Simulation Results

138

Figure B.36 The performance comparison for 3-iteration iSLIP-RRMRC with different average burst length (32 × 32 switch)

Simulation Results

139

Figure B.37 The performance comparison for 4-iteration PIMRC 2 with different average
burst length (32 × 32 switch)

Simulation Results

140

Figure B.38 The performance comparison for 4-iteration Semi-RRMRC with different average burst length (32 × 32 switch)

Appendix C
Some Additional Examples Used in
This Thesis
C.1

An Example of One iSLIP-RRM Iteration

An example of one iSLIP-RRM iteration for a 4 × 4 switch is illustrated in Figure C-1, which is very similar to the RRM example in
Figure 2.4 except they update the round robin pointers at outputs at
different stage. The initial queues follow: at input 1, there are two
queued cells destined for output 1 and one queued cell destined for
output 3. At input 2, no cell is queued. At input 3, there are three
queued cells destined for output 2 and two queued cells destined for
output 4. At input 4, there is one queued cell destined for output 1 and
one queued cell destined for output 4. The three stages of the iSLIPRRM iteration are illustrated in Figure C.1 (a), (b) and (c) separately.
The detailed description is shown below:
(a) Stage 1, Request. We assume that all inputs and outputs are unmatched initially. Only inputs 1, 3 and 4 send requests since there
is no queued cell at input 2. Input 1 sends requests to outputs 1
and 3, input 3 sends requests to output 1 and 2, and input 4 sends
requests to outputs 1 and 4.
(b) Stage 2. Grant. We assume all round robin pointers at output side
are set to position “1” initially, which means requests from input
141

Some Additional Examples Used in This Thesis

Figure C.1 An example of iSLIP-RRM

142

Some Additional Examples Used in This Thesis

143

1 own the highest priority to be granted. Output 1 receives three
requests for inputs 1, 3, and 4 separately, and it grants the request
from the input 1 based on its round robin arbiter. And output 2
receives only one request from input 3, and it grants this request
since it has no other choice. Similarly, outputs 3 and 4 grant the
requests from inputs 1 and 4 separately since they have no other
choice. And their round robin pointers remain unchanged at this
stage.
(c) (c) Stage 3. Accept. We assume all round robin pointers at input side are set to position “1” initially, which means the grant
from output 1 has the highest priority to be accepted. Input 1 receives the grants from outputs 1 and 3, therefore it accepts the
grant from output 1 based on its round robin arbiter, and its round
robin pointer is increased to position “2” which is the location
beyond the accepted output 1 (modulo 4). Similarly inputs 3 and
4 accept the grants for outputs 2 and 4 separately since each of
them receives only one grant. The round robin pointer at input 3
is increased to position “3” (modulo 4) which is the location beyond the accepted output 2. And the round robin pointer at input
4 remains at position “1” which is the location beyond accepted
output 4 (modulo 4). The round robin pointer at input 2 remains
at position “1” since no grant is received by it. Meanwhile, the
round robin pointers at output side are also updated. The round
robin pointer at output 1 is increased to position“2” since it is
the location beyond input 1 which accepts its grant (modulo 4).
Similarly round robin pointer at output 2 is increased to position
“4” since it is the location beyond input 3 which receives its grant
(modulo 4). The round robin pointer at output 4 remains at position “1” since it is the location beyond the input 4 which accepts
its grant (modulo 4). And the round robin pointer at output 3
remains at position “1” since its grant is not accepted.
Finally, after this iteration, three matching pairs are found, including
input 1- output 1, input 3-output 2, and input 4- output 4. They will be
marked as “matched” so that they will not join the following iterations
of the same time slot.

Some Additional Examples Used in This Thesis

C.2

144

An Example of How iSLIP-RRM Avoids the Synchronization of Round Robin Pointers at Outputs

To present the solution of iSLIP-RRM for the flaw of output round
robin pointers synchronization in RRM, we apply iSLIP-RRM in the
same example of Figure 2.5. The result is shown in Figure C.2.
In Figure C.2, at the end of this iSLIP-RRM iteration, we can find out
that the round robin pointer at output 1 is set to position “2” since it is
the location beyond input which accepts its grant (modulo 4). Meanwhile, round robin pointers at output 2, 3 and 4 remains at position “1”
since their grants are not accepted. In Figure 2.5, round robin pointers
at all outputs are increased to position of “2” after the RRM iteration.
Obviously, this synchronization disappears in Figure C.2.

Some Additional Examples Used in This Thesis

145

Figure C.2 An example of how iSLIP-RRM avoids the synchronization of round robin pointers at outputs

Appendix D
Simulation Matlab Programs
There is a CD-ROM disk attached with this thesis. The list of Matlab
programs in this CD-ROM disk is shown as below:

146

147

Simulation Matlab Programs

islip.m
islipfull.m
pim.m
pimfull.m
rrm.m
rrmfull.m
pimrc1.m
pimrc1full.m
rrmrc.m
rrmrcfull.m
isliprc.m
isliprcfull.m
pimrc2.m
pimrc2full.m
semirrm.m
semirrmfull.m

simulation program for iSLIP-RRM ( traffic
load 0.1∼0.9)
simulation program for iSLIP-RRM ( full
traffic load )
simulation program for PIM ( traffic load
0.1∼0.9)
simulation program for PIM ( full traffic load
)
simulation program for RRM ( traffic load
0.1∼0.9)
simulation program for RRM ( full traffic
load )
simulation program for PIMRC 1 ( traffic
load 0.1∼0.9)
simulation program for PIMRC 1 ( full traffic load )
simulation program for RRMRC ( traffic
load 0.1∼0.9)
simulation program for RRMRC ( full traffic
load )
simulation program for iSLIP-RRMRC (
traffic load 0.1∼0.9)
simulation program for iSLIP-RRMRC ( full
traffic load )
simulation program for PIMRC 2 ( traffic
load 0.1 0.9)
simulation program for PIMRC 2 ( full traffic load )
simulation program for Semi-RRMRC ( traffic load 0.1∼0.9)
simulation program for Semi-RRMRC ( full
traffic load )

The operating for all the simulation programs is generally identical :
when the program is executed on Matlab, the user is required to input
the following simulation parameters in advance while the corresponding prompt information appears:

Simulation Matlab Programs

148

• The switch size N for an N × N switch.
• The number of iterations per time slot for the scheduling algorithm.
• The number of the output groups or the output links. It is only
available for the scheduling algorithms with relaxed constraint.
• The number of outputs in each output group, which is also only
available for the scheduling schemes with relaxed constraint.
• The average burst length of the traffic model in the unit of cell.
• Traffic load,ranging from 0 to 0.99, which is unavailable for the
simulation program corresponding to the full traffic load.
• The simulation duration in the unit of time slot.
After input the above parameters, the following information will appear when the program stops running:
• Values of the the simulation parameters we just input.
• The total number of the cells arriving in the switch.
• The total queue length for all inputs of the switch in the unit of
cell.
• The average queue length, which is obtained from dividing the total queue length with the following values: the number of inputs,
the number of outputs, and the simulation duration.
• The total number of the matching requests.
– For the existing scheduling algorithms without relaxed constraint, this value is increased by 1 if there is a queued cell in
one input at the beginning of a time slot.
– For the scheduling algorithms with relaxed constraint, the
value is counted separately corresponding to each output link.
The value is increased by one if one input queue for the corresponding output link is not empty at the beginning of a time
slot.

Simulation Matlab Programs

149

• The total number of the found matching pairs.
• The simulation time used for the scheduling in the unit of second (excluding the time used for the traffic model generating the
traffic).
After obtaining the above values, the throughput can be calculated by
dividing the total number of the found matching pairs with the total
number of the arriving cells. The average processing time can also be
obtained by applying Equation 4.5.

