Scheduling Optical Circuits in Data Center Networks by Huang, Xin

ABSTRACT
Scheduling Optical Circuits in Data Center Networks
by
Xin Huang
Data center driven by optical circuit switching network, or optical data center,
is emerging as an alternative to traditional data center where the electrical packet
switching network is already overwhelmed by bulk data transfer. Optical data center
promises high bandwidth capability, but it is set against circuit reconfiguration de-
lays, which makes circuit scheduling non-trivial. The optical circuit scheduler must
manage traﬃc over both hybrid and pure optical network architectures, sparse and
dense traﬃc patterns, and scale to large network sizes. In this thesis, we show that
the proposed algorithms for circuit scheduling in optical data center fail to meet
these goals. To address their deficiencies, we introduce a scheduling algorithm called
Decomp. We show that regardless of hybrid or pure architectures, sparse or dense
traﬃc, Decomp simultaneously eliminates the long-tailed flow waiting times that ex-
isting algorithms suﬀer from, achieves high network utilization, and maintains a low
computational delay as network size scales up.
Acknowledgments
I am immensely grateful to my advisor, T. S. Eugene Ng, who enlightens me with the
appreciation for great work, ample guidance for good research, and various aspects
for eﬀective presentation. He gives me freedom to explore the problems, as well as
timely feedback on my studies. I am especially fortunate to work with him and I look
forward to our collaboration in my PhD studies.
I am also thankful to my thesis committee members, Alan Cox and Chris Jermaine,
for their useful feedback.
I am grateful to many colleagues in Rice, especially the members in the BOLD
Lab. I have been fortunate to work with Xiaoye Sun, Simbarashe Dzinamarira, Yiting
Xia, Zhaolei Fred Liu, and Ruiqi Liu. I learned a lot from each of them.
I would like to express my deepest gratitude to my parents for their love and
support. They encourage me to do what I love and to love what I do. My mother
instills in me perseverance and my father teaches me the sense of responsibility.
Last but not least, I want to thank my friends for their support throughout my
studies.
Contents
Abstract ii
Acknowledgments iii
List of Illustrations vii
List of Tables x
1 Introduction 1
2 Background 6
2.1 Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Optical Circuit Constraints . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Traﬃc Demand Requirement . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Traﬃc Scheduling Assumptions . . . . . . . . . . . . . . . . . . . . . 8
3 Related Work 9
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Scheduling One Assignment per Cycle . . . . . . . . . . . . . . . . . 10
3.2.1 Edmond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Other Scheduling Algorithms for Packet Switches . . . . . . . 11
3.3 Scheduling Multiple Assignments per Cycle . . . . . . . . . . . . . . . 13
3.3.1 TMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Other Scheduling Algorithms for Crossbar Switches . . . . . . 16
4 Motivating Example 17
4.1 Ineﬃciency behind Edmond . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Ineﬃciency behind TMS . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Eﬃcient Optical Circuit Scheduling with Decomp 21
5.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.1 Key Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.2 Decomp Techniques and the Design Space . . . . . . . . . . . 22
5.2 The Decomp Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.1 Stage I: Partition . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.2 Stage II: Schedule . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.3 Stage III: Merge . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Decomp Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . 34
5.3.1 Process Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3.2 Computation Complexity . . . . . . . . . . . . . . . . . . . . 34
5.3.3 Circuit Configurations . . . . . . . . . . . . . . . . . . . . . . 36
5.3.4 Comparison with Existing Algorithms . . . . . . . . . . . . . . 37
6 Evaluation 38
6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.1 Network Topology and Link Bandwidth . . . . . . . . . . . . 38
6.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.3 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . 40
6.1.4 Traﬃc Workloads . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2.1 Decomp achieves low flow waiting time while preserving
network utilization . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2.2 Decomp requires far less computation time than Edmond and
TMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2.3 Decomp generates far fewer slots than TMS . . . . . . . . . . 47
6.2.4 TMS makes ineﬃcient scheduling decisions for Isolated under
combined eﬀect of computation and reconfiguration overhead . 48
7 Conclusion 52
Bibliography 53
Illustrations
1.1 Design space of circuit scheduling algorithms. The feasible zone of
Edmond is outlined with the solid line and TMS with the dashed line.
These algorithms fail to perform well in all three dimensions. . . . . . 3
2.1 Network model with a electrical packet switch and an optical circuit
switch. Each ToR is both a sender and receiver. . . . . . . . . . . . . 7
3.1 An illustration of existing scheduler’s work flow for a 4-port optical
switch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Scheduling result of Edmond. Grape shaded cells indicate demanding
circuits NOT covered in the scheduling. . . . . . . . . . . . . . . . . . 18
4.2 Scheduling result of TMS. All numbers are rounded up to the second
decimal. Red shaded cells indicate distorted demand and wasted
circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1 An example illustrating 4-way Decomp for 6 racks. . . . . . . . . . . 25
5.2 Greedy matrix decomposition (GMD) algorithm on the example
demand. Demand is not partitioned. Back filling circuits are marked
red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Decomp merges slots from non-conflicting regions. Generate a new
slot by combining two slots employed at the same virtual time in
diﬀerent regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Back filling extra non-conflicting valid circuits in the merge stage.
Back filling circuits are marked red. . . . . . . . . . . . . . . . . . . . 34
6.1 CDF of flow waiting time for the Hotspot pattern under the 10 Gbps
pure optical network architecture. Decomp is suitable for the pure
optical network architecture and eliminates the long-tailed flow
waiting times that Edmond and TMS suﬀer from. . . . . . . . . . . . 43
6.2 Traﬃc finish time under the 10 Gbps pure optical architecture.
Decomp is as good as other algorithms in terms of network utilization
regardless of traﬃc density, while simultaneously eliminating
long-tailed flow waiting times. . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Total flow waiting time for Hotspot under the hybrid network
architecture (Total bandwidth is the sum of 10 Gbps constant optical
bandwidth and variable electrical bandwidth). Decomp is suitable for
the hybrid network architecture and achieves the lowest flow waiting
times regardless of the amount of electrical network bandwidth
available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Computation time for the dense Urand pattern under diﬀerent
algorithms. Note that the y-axes have very diﬀerent scales, and
Decomp is orders of magnitude more computationally eﬃcient.
Parallelization allows Decomp to scale well to large network sizes.
While not shown in this figure, Decomp is also much more eﬃcient
for sparse traﬃc (see Section 6.2.2). . . . . . . . . . . . . . . . . . . . 46
6.5 Number of slots for the Urand pattern under Decomp and TMS.
Decomp generates two orders of magnitude fewer scheduling slots,
leading to much lower circuit switching overhead. Recall that
switching an optical circuit takes up to tens of milliseconds. . . . . . 48
6.6 Finish time of the Isolated traﬃc pattern under the 10 Gbps pure
optical architecture. TMS’s ineﬃciencies are rooted in its high
computation and circuit reconfiguration overhead. . . . . . . . . . . . 50
Tables
5.1 Comparison of scheduling algorithms on N racks. (Decomp is the
only algorithm that is parallelizable) . . . . . . . . . . . . . . . . . . 37
1Chapter 1
Introduction
The rising tide of big data driven cluster computation is imposing heavy traﬃc in
data center network. In face of high traﬃc demand, conventional data centers built
with layers of electrical packet switches is fundamentally challenged. Copper cables
can no longer support high bit rate at a distance more than a few meters. Besides,
large bisection bandwidth requires a highly layered topology with a vast number
of switches which leads to more complicated management. Instead, people in both
academia [1, 2, 3, 4, 5, 6] and industry [7, 8] are turning to optical fabrics and switches
for next generation data center solutions.
However, upgrading to an optical data center is non-trivial. The circuit switching
delay of commercially available optical switches (e.g. WSS and 3D-MEMS) are 2 to
5 orders of magnitude slower than what’s needed to switch at the packet granularity.
Therefore, the switching decisions greatly impact network performance and sophis-
ticated management of optical circuits plays a critical role in optical data centers.
As a basic requirement, the circuit scheduler has to achieve low flow waiting times
and high network utilization. In addition, the leading proposals for optical data cen-
ters [1, 2, 3, 4, 5, 6] and the traﬃc patterns of data intensive applications [9] point to
three requirements that a circuit scheduler has to address:
Handle Hybrid and Pure Architectures Several network architectures have
been proposed for optical data centers. One line of work [1, 2, 3, 6] is in hybrid
architecture, which combines electrical packet switching with optical circuit switching.
2The top-of-rack switches are connected to a conventional electrical packet switched
network and an optical circuit switched network. These networks operate side-by-
side and heavy traﬃc flows are oﬀ-loaded to the optical circuit switched network.
Another line of work [4, 5] is in pure architecture, where the top-of-rack switches are
connected solely to an optical circuit switched network. Here, the fastest available
optical switches must be employed to minimize the negative performance impact of
circuit switching delays. It is also worth noting that the hybrid architecture behaves
similarly to the pure architecture if it possesses only a small amount of electrical
bandwidth. In data centers where the bisection bandwidth might change vastly due
to congestion or link failure, the circuit scheduler must function well regardless of
whether the network behaves like a hybrid or a pure optical network.
Robustness under Diﬀerent Traﬃc Patterns Data intensive applications
generate traﬃc that has a wide range of traﬃc density [9]. A data center can carry
an arbitrary set of applications that vary, resulting in vastly diﬀerent traﬃc density
over time. At one time, jobs can trigger multiple one-to-all and all-to-one flow groups,
resulting in a uniform dense traﬃc pattern. At another, there might be only a few
large flows among several select machine racks, resulting in a skewed sparse traﬃc
pattern. The circuit scheduler must be robust to diﬀerent traﬃc patterns.
Scalability Data intensive applications’ storage and computation demands are
driving data centers to expand. Particularly in this era of cloud computing, the num-
ber of servers in a data center is growing continuously to meet operational require-
ments. In an optical data center, the circuit scheduler can become a critical bottleneck
for network scalability. Therefore, the circuit scheduler must be suﬃciently scalable.
It turns out that meeting these requirements cannot be taken for granted. In
Figure 1.1, we illustrate a three-dimensional design space of circuit scheduling algo-
3Edmond
TMS
Skewed/Sparse
Figure 1.1 : Design space of circuit scheduling algorithms. The feasible zone of Ed-
mond is outlined with the solid line and TMS with the dashed line. These algorithms
fail to perform well in all three dimensions.
rithms. Each point in the design space represents a situation that a circuit scheduler
might encounter. For stable network performance, a circuit scheduler should function
well under various data center sizes, from small to large, diﬀerent architectures, from
pure to hybrid, and unpredictable traﬃc patterns, from uniform to skewed, dense to
sparse.
Unfortunately, in this thesis, we show that the optical data center circuit sched-
ulers proposed in previous works [1, 2, 4, 5] fail to perform well in all three dimensions
(scheduling algorithms are shown in Chapter 3). Specifically, the Edmond’s algorithm,
which we refer to as Edmond hereafter, is used in [1, 2] to maximize the traﬃc volume
over the optical network. In this setting, however, the small flows will starve in the
4pure optical network or in the hybrid architecture with little electrical bandwidth.
Further, Edmond’s computation time is very high for a large network with dense traf-
fic. The Traﬃc Matrix Scheduling (TMS) algorithm proposed in [4, 5] mitigates the
starvation problem, but it is not scalable computationally. Its computation time for
sparse traﬃc even under a small network size is prohibitive; moreover, it also sched-
ules wasted circuits due to a distortion of traﬃc demand in one of its computation
steps. Circuits are valid if they have non-zero traﬃce demand, and therefore valid
circuits will eﬀectively clear traﬃc. However, wasted circuits are idle circuits that
have no demand to serve. The wasted circuits are costly not only because they cause
expensive switch reconfigurations, but they may also prohibit valid circuits, further
degrading network performance. The feasible zones of Edmond and TMS are outlined
by the solid line and the dashed line in Figure 1.1.
To address the multi-dimensional challenges of optical data centers, we propose a
circuit scheduling algorithm called Decomp. Decomp can be seen as a framework. Un-
like other algorithms, Decomp structures the circuit schedule computation into stages
and can be customized with diﬀerent circuit selection policies. Decomp also employs
three key approaches that are not found in previous proposals, namely partitioning,
randomization, and parallelization (details are presented in Chapter 5).
We choose simulation as the methodology for evaluating diﬀerent scheduling al-
gorithms since it allows us to precisely characterize and compare the performance of
the scheduling algorithms, and allows us to stress these algorithms at large network
sizes.
The results show that regardless of hybrid or pure architectures, sparse or dense
traﬃc, Decomp simultaneously eliminates the long-tailed flow waiting times that ex-
isting algorithms suﬀer from, achieves high network utilization, and maintains a low
5computational delay as network size scales up. We observe cases where Decomp clears
up to 55% more traﬃc flows than Edmond, or 13% more than TMS within the same
length of time; Decomp takes 28000× less time to compute a schedule among 600
racks than TMS, or 170× less than Edmond.
The rest of this thesis is organized as follows. In Chapter 2, we provide the
background for the optical circuit scheduling. In Chapter 3, we discuss on a range
of related work on circuit scheduling, which should be helpful to readers who are not
familiar with this area. We uses examples in Chapter 4 to illustrate the ineﬃciency
behind the algorithms of Edmond and TMS, which are two representative optical
circuit scheduling algorithms in the context of data center network. In Chapter 5, we
present the design of the Decomp algorithm and further compare the performance of
Edmond, TMS, and Decomp in Chapter 6. Finally, we summarize our contributions
and conclude in Chapter 7.
6Chapter 2
Background
2.1 Network Model
As shown in Figure 2.1, we model the core network of the data center as a hybrid
network consisting of an electrical packet switch with bisection bandwidth Be, and
an optical circuit switch with bisection bandwidth Bo (Bo >> Be).
The sources and destinations behind the input and output ports of the switches
can vary in practice (e.g. rack-to-rack, pod-to-pod); without loss of generality, we
assume a rack-to-rack granularity in the discussions that follow.
In Figure 2.1, each top-of-rack (ToR) switch is both a sender and receiver. A ToR
switch will aggregate the traﬃc from the servers within its domain and route traﬃc
inbound and outbound traﬃc accordingly. Each ToR is assigned an input port and an
output port to each of the electrical and optical switch. However, traﬃc between two
ToR switches can only go through either the electrical switch or the optical switch. By
default, traﬃc will go through the optical switch when a circuit in the corresponding
direction is set up, while all other traﬃc is left to the electrical switch. Note that
although traﬃc from ToRi to ToRj will travel through the optical switch if circuit
from ToRi to ToRj is set up, however, traﬃc in the reverse direction (from ToRj to
ToRi) is unaﬀected and depends on circuit from ToRj to ToRi to decide which one
of the two switches to travel through.
Note that when the electrical bandwidth Be → 0, the network becomes a pure
7Be Bo
...
Figure 2.1 : Network model with a electrical packet switch and an optical circuit
switch. Each ToR is both a sender and receiver.
optical network, in which all traﬃc can only travel through the optical switch.
2.2 Optical Circuit Constraints
The optical switch has circuit constraints: only one circuit can be set up between
an input port and an output port. Let p denote a circuit assignment of the optical
switch at any time. Then p is an N × N binary matrix, and p(i, j) = 1 indicates
setting up a circuit from input port i to output port j. Thus p is a permutation
matrix corresponding to a one-to-one matching between the input ports and output
ports of the optical switch.
The circuit scheduler is to schedule a list of circuit assignments {p1, ..., pm, ..., pM}
over time, where M is the total number of circuit assignments deployed. Each circuit
assignment is coupled with its duration in time {t1, ..., tm, ..., tM}, so that assignment
pm is active for tm (m = {1, 2, ...,M}).
Applying a circuit assignment will incur a reconfiguration delay of δ.
82.3 Traﬃc Demand Requirement
The scheduling should meet the traﬃc demand requirement. The traﬃc demand can
be represented as a N ×N matrix D, where N is the number of racks. Each element
D(i, j) indicates the byte volume of traﬃc demand from ToRi to ToRj .
2.4 Traﬃc Scheduling Assumptions
In this study, we focus on the impact of circuit scheduling on network performance.
Therefore, we assume the circuit scheduler can only control the circuit assignments
{p1, ..., pM} and the corresponding durations {t1, ..., tM}. Traﬃc is redirected auto-
matically to the optical switch if the corresponding circuits are set up. The traﬃc
shares the bandwidth with max-min fairness.
9Chapter 3
Related Work
3.1 Overview
In the context of data center networks, existing optical circuit schedulers generally
apply centralized management and schedule circuits at the beginning of recurring
scheduling cycle [1, 2, 4, 5]. A scheduling cycle has a length of time T , which is
typically fixed and on the order of hundreds of millisecond. Each scheduling cycle
can be further divided into slots, so that each slot corresponds to a unique circuit
assignment.
At the beginning of each scheduling cycle, the scheduler collects the traﬃc demand,
represented as a matrix D. Based on D, the circuit scheduling algorithm computes
one or more slot configurations in a list C = {c1, c2, ..., cMk}, where Mk is the number
of slots produced in the k-th scheduling cycle (Figure 3.1). Each slot cm has a circuit
assignment cm.p and a coeﬃcient cm.t. The coeﬃcient cm.t indicates the ratio of
length in time between the slot cm and the scheduling cycle, i.e. the time share of
cm in a cycle. In other words, a slot with time share cm.t has a length of cm.t× T in
time (tm = cm.t × T in Figure 3.1), during which circuits in cm.p are set up for the
optical switch. Note that the sum of slot length within a scheduling cycle is equal to
T , i.e.
Mk∑
m=1
cm.t = 1. Each slot corresponds to a circuit assignment with duration of
cm.t× T in Section 2.2 of Chapter 2.
Figure 3.1 illustrates this cycle based work flow adopted by existing schedulers. In
10
Figure 3.1 : An illustration of existing scheduler’s work flow for a 4-port optical
switch.
Figure 3.1, for both the traﬃc demand matrix and the circuit assignment matrices,
each row represents an input port and each column represents an output port. The
ones in the circuit assignment matrices indicate setting up optical circuits between
the corresponding input and output ports.
3.2 Scheduling One Assignment per Cycle
3.2.1 Edmond
In the context of optical circuit scheduling for data center networks, schedulers pro-
posed in [1, 2] use the maximum weighted matching algorithm, or Edmond, as the
core scheduling algorithm. Edmond will schedule only one slot per scheduling cycle,
i.e. |C| = 1, and the slot is configured as
c1.p← a matching s.t. max ⟨p,D⟩
c1.t← 1
(3.1)
where D is the traﬃc demand matrix. In order words, for each scheduling cycle of
length T , Edmond chooses the circuit assignment that maximizes the sum of traﬃc
11
demand on circuits set up, based on the traﬃc demand at the beginning of each cycle.
The computation complexity of Edmond is O(N3) where N is dimension of the
demand matrix D, which is the port counts of the optical switch, and thus the rack
count inside the data center network.
3.2.2 Other Scheduling Algorithms for Packet Switches
The Edmond algorithm can be traced back to the classic circuit scheduling problem
for packet switches, where one circuit assignment is produced for each scheduling
cycle to transmit one round of packets. There is a plethora of works that focus on
switch scheduling for input queuing packet switches by finding the maximum bipartite
matching (MBM) among input and output ports. Edmond[10], PIM [11], SLIP [12],
APSARA [13], LAURA [13] and SERENA [13] are some of the matching algorithms
that serve this purpose. However, circuit scheduling in optical data centers is diﬀerent
from input queuing switch scheduling. There are two important reasons outlined as
follows.
First, in input queuing packet switches, switch scheduling happens at packet gran-
ularity, i.e. one MBM is used per scheduling cycle to transmit one round of packets.
This is possible because the traﬃc demand is instantly available to the scheduling al-
gorithm by counting local buﬀers. However, in optical data centers, the traﬃc sources
are distributed. The core optical switch is buﬀerless and the traﬃc is either buﬀered
at packet switches or at hosts. Before scheduling computation, the traﬃc demand
needs to be measured distributedly and communicated to the circuit manager, which
incurs a significant and often unpredictable delay. This delay is further compounded
by communication overhead, making it impossible to collect traﬃc demand at packet
granularity in optical data centers. Instead, such information is to be collected at
12
coarser granularity with larger update interval on traﬃc demand.
Second, in input queuing switches, only one MBM is used per scheduling cycle to
transmit one round of packets. However, in optical data centers, multiple matchings
are generally needed per scheduling cycle.
Consider a slow 3D-MEMS optical switch, which takes tens of milliseconds for cir-
cuit reconfiguration. Given such circuit reconfiguration delay and the communication
overhead of collecting traﬃc demand, previous works [1, 2] propose to collect traﬃc
demand and schedule one circuit assignment at O(100 ms) intervals. In a typical
data center with 10 Gbps bisection bandwidth, a scheduling interval at O(100 ms)
transmits O(80,000) packets, and the communication overhead is an acceptable cost
to transmit such amount of data. However, as faster optical switching technologies,
such as wavelength-selective switch (WSS), are being proposed, the optical circuit
reconfiguration delay drops from tens of milliseconds to tens of microseconds, and
can drop even further with silicon photonic switches. On one hand, scheduling at
O(100 µs) interval is no longer possible because it causes an unacceptably high com-
munication overhead of collecting distributed traﬃc demand information. On the
other hand, keeping the scheduling interval at the slower O(100 ms) is not desirable
because it does not take advantage of the faster switching enabled by these tech-
nologies. Such advantages of faster switching include more fine-grained sharing of
bandwidth resources among large and small flows, which can significantly increase
network throughput and reduce flow starvation problems.
To bridge the gap between coarse update on traﬃc demand and fast switching of
optical circuits, one can schedule multiple circuit assignments for one cycle.
13
3.3 Scheduling Multiple Assignments per Cycle
3.3.1 TMS
TMS is proposed in [4, 5] as the scheduling algorithm for optical circuits in data center
networks. It schedules multiple circuit assignments per scheduling cycle. The TMS
algorithm relies on the Birkhoﬀ-von Neumann algorithm[14, 15], which we refer to
as BvN hereafter, to decompose a demand matrix into multiple circuit assignments.
BvN assumes its input matrix to have an identical sum for each row and each column.
Therefore, the TMS algorithm applies a pre-processing step that uses the Sinkhorn
algorithm, referred to as Sinkhorn hereafter, and transforms an arbitrary demand
matrix to meet the input requirements of BvN. Hence the TMS algorithm consists of
two steps, as described in Algorithm 1.
Algorithm 1 TMS scheduling algorithm
Input: traﬃc demand matrix D, error tolerance for Sinkhorn τ
Output: a list of slot configurations C
1: Doubly Stochastic Matrix D′ ← Sinkhorn(D, τ)
2: C ← BvN(D′)
In the first step, TMS uses the Sinkhorn algorithm, to transform the traﬃc demand
matrix D to a Doubly Stochastic Matrix (DSM) D′. A DSM is a matrix whose row
sums and column sums are 1, i.e.∑
j
D′(i, j) = 1, ∀i
∑
i
D′(i, j) = 1, ∀j
(3.2)
As shown in Algorithm 2, Sinkhorn obtains an DSM by iteratively normalizing
the rows and columns of a matrix with positive (greater than 0) entries (line 3 - 9).
14
Algorithm 2 Subroutine: The Sinkhorn Normalization
1: procedure Sinkhorn(traﬃc demand D, error tolerance τ)
2: D′ ← Fill zero entries in D with σ > 0 ◃ Fill zero entries
3: repeat
4: Dtmp ← 0
5: ∀ row i in D′ : Dtmp(i, j)← D
′(i,j)
N∑
j=1
D′(i,j)
◃ Normalize over rows
6: ∀ column j in ∈ D′ : Dtmp(i, j)← D
′(i,j)
N∑
i=1
D′(i,j)
◃ Normalize over columns
7: D′ ← Dtmp
8: error ϵ← max
(
|
N∑
i=1
D′(i, j)− 1|, |
N∑
j=1
D′(i, j)− 1|
)
9: until error ϵ < tolerance τ
10: return Doubly Stochastic Matrix (DSM) D′
11: end procedure
Therefore, the zero entries in D will be replaced by a small value σ > 0, before the
iterative normalization (line 2). The output of Sinkhorn is an approximated DSM,
whose row sums and column sums may not be exactly 1, and Sinkhorn repeats the
normalizations until the sums are suﬃciently close to 1, as specified by the error
tolerance (line 9). Sinkhorn’s normalization may distort the input demand matrix D,
and results in distorted demand, which corresponds to entries with zero value in D
but non-zero value in the resulting DSM D′, i.e. D(i, j) = 0, but D′(i, j) > 0.
In the second step, TMS generates multiple circuit assignments by decomposing
the DSM D′ with BvN. As shown in Algorithm 3, BvN is an iterative algorithm
for matrix decomposition. In each iteration, BvN produces an assignment and the
corresponding coeﬃcient, i.e. time share in the circuit schedule (line 4 - 5), and
updates the matrix for the next iteration (line 8). BvN will schedule the circuits
15
Algorithm 3 Subroutine: The BvN Decomposition
1: procedure BvN(DSM D′)
2: a list of slot configurations C ← ∅
3: while D′ ̸= ∅ do
4: p← perfect matching over D′
5: t← min{D′i,j : (i, j) ∈ p}
6: slot configuration c.p← p, c.t← t
7: append c to C
8: D′ ← D′ − tp
9: end while
10: return C
11: end procedure
to serve the exact demand specified with the input DSM D′. In other words, the
slot configurations produced by BvN satisfy
Mk∑
m=1
cm.p × cm.t = D′ and
Mk∑
m=1
cm.t = 1.
Since the BvN input D′ may be distorted by the previous Sinkhorn step, BvN may
be misguided, so that the scheduled circuits may poorly serve the original demand
matrix D. Besides, BvN comes with an expensive cost of circuit configurations. It
may produce up to O(N2) circuit assignments per scheduling cycle, where N is the
port count of the optical switch.
Both Sinkhorn and BvN are iterative algorithms, and thus are hard to be par-
allelized. The run time of Sinkhorn depends on both the input demand matrix D
and the error tolerance τ . The computation complexity of BvN is O(N4.5) for an
optical switch with N ports. Therefore, the computation complexity of TMS is at
least O(N4.5).
16
3.3.2 Other Scheduling Algorithms for Crossbar Switches
A number of algorithms can generate multiple matchings from one demand matrix
(e.g. [16, 17, 18]), and the TMS algorithm that we already mentioned [4, 5] is a repre-
sentative example. These algorithms can be traced back to the classic crossbar switch
scheduling problem, or the so-called Time Slot Assignment (TSA) problem[16]. The
solution algorithms usually rely on the techniques of matrix decomposition, which is
similar to BvN in the sense that they would need to iteratively update and decompose
a matrix into multiple assignments, with a weighted coeﬃcient for each assignment.
However, these algorithms all suﬀer from high computational complexity. Further-
more, they are hard to be parallelized because they rely on iteratively updating the
matrix to produce one assignment after another. In contrast, we propose a circuit
scheduling algorithm that can run in parallel to handle large scale circuit scheduling
problem with reasonable computation time.
17
Chapter 4
Motivating Example
In this chapter, we will illustrate the ineﬃciency behind the two existing circuit
scheduling algorithms with an example. Particularly, we apply the algorithms over
the same traﬃc demand and the scheduling results are shown in Figure 4.1 and
Figure 4.2.
4.1 Ineﬃciency behind Edmond
For each scheduling cycle, Edmond only schedules one set of circuits with one assign-
ment. Traﬃc on unscheduled circuits can not take advantage of the optical bandwidth
and thus are starved in a pure optical network, as shown with the shaded demand in
Figure 4.1. Particularly, Edmond schedules circuits based on the maximum weighted
matching of the traﬃc demand, because Edmond considers the corresponding links
to be hot spots and thus Edmond caters the circuit schedule for the heaviest demand.
As heavy demand also takes longer time to serve, consequently, traﬃc on circuits
with smaller demand may be delayed for long time before they are allocated with
optical circuits and benefit from the optical bandwidth. For example in Figure 4.1,
D(1, 4) and D(3, 2) are competing with D(1, 2) for the same input port or output
port. Edmond would prioritize the circuit for D(1, 2), and the circuits for D(3, 2) and
D(1, 4) would need to wait until D(1, 2) is served with considerable amount of time
to allow circuits for D(3, 2) and D(1, 4), based on the maximum weighted matching
18
Figure 4.1 : Scheduling result of Edmond. Grape shaded cells indicate demanding
circuits NOT covered in the scheduling.
of the traﬃc demand. In a pure optical network, or a hybrid network with small
electrical bandwidth, traﬃc on circuits with small demand (e.g. D(3, 2) and D(1, 4))
ends up with long delay because they can hardly receive any optical bandwidth when
some large demand (e.g. D(1, 2)) dominates the optical circuits scheduling.
The problem of delaying small traﬃc is mitigated in the studies on the Edmond
scheduling [1, 2], which are based on a hybrid network with suﬃciently large electrical
bandwidth, so that the traﬃc on circuits with small demand may take advantage of
the abundant electrical bandwidth and avoid starvation, even if they are not allocated
with optical circuits.
4.2 Ineﬃciency behind TMS
TMS schedules multiple circuit assignments to cover more traﬃc demand in each
scheduling cycle. However, TMS’s both steps, i.e. Sinkhorn and BvN, would result in
significant ineﬃciency due to wasted circuit resource and heavy switching overhead.
For one thing, Sinkhorn’s normalization may heavily distort the original demand
and produce a misleading DSM for BvN. Particularly for demand matrix with zero
entries, Sinkhorn can bring in distorted demand, by modifying the demand of a circuit
19
Figure 4.2 : Scheduling result of TMS. All numbers are rounded up to the second
decimal. Red shaded cells indicate distorted demand and wasted circuits.
in D′ to arbitrary value even even if the circuit has no traﬃc demand in the original
demand matrix D, i.e. D(i, j) = 0, but D′(i, j) > 0. For example in Figure 4.2,
the shaded non-zero demand in the DSM D′ corresponds to zero demand in the
original demand matrix. As the scheduling produced by BvN exactly reflects the
input DSM, BvN would schedule wasted circuits based on the misleading DSM. As
shown in Figure 4.2, multiple wasted circuits are shaded in the assignments produced
by BvN. These wasted circuits do not eﬀectively serve any traﬃc demand. Besides,
setting up these wasted circuits largely harms performance because they not only
cause switching overhead, but they may also take up the ports required by other
circuits with non-zero demand in the original demand matrix. For example in p3 of
Figure 4.2, rather than setting up the two wasted circuits, circuit from input port 1
to output port 2 could be set up to serve traﬃc demand. In summary, Sinkhorn may
distort the original demand matrix heavily so that the resulting circuits scheduled
would poorly serve the original demand.
Besides, BvN schedules circuit switching aggressively, producing up to O(N2)
20
assignments per cycle, which may incur excessive circuit reconfiguration overhead as
the optical switch port count N grows. The computation overhead of TMS (O(N4.5))
also grows rapidly with N , which makes TMS less capable to schedule circuits for an
optical switch with large port count.
Unfortunately, the Sinkhorn distortion problem and the limited scalability are
ignored in the studies of the TMS algorithm [4, 5], which are based on uniform dense
traﬃc demand on a small scale optical switch with 6 ports. For one thing, compared
with a skewed and sparse matrix, Sinkhorn’s iterative normalization would apply less
distortion on a matrix full of uniform non-zero entries. For another, a 6-port switch
in the studies is too small to unveil the ineﬃciency of the overhead due to switching
and computation, both of which increasingly grow heavier with larger switch port
count.
21
Chapter 5
Eﬃcient Optical Circuit Scheduling with Decomp
5.1 Algorithm Overview
Decomp can be seen as a framework. Unlike other algorithms, Decomp structures
the circuit schedule computation into three stages, namely partition, schedule,
and merge, so that 1) the traﬃc demand matrix is first partitioned into multiple
regions, 2) the schedule stage makes circuit selection decisions on diﬀerent regions,
and 3) the merge stage coordinates selected circuits on diﬀerent regions. Particularly,
Decomp employs three key techniques that are not found in previous proposals, i.e.
1) partitioning the demand matrix, 2) randomization before partitioning, and 3)
parallelization for scalability, discussed as follows.
5.1.1 Key Techniques
Partitioning the Demand Matrix Existing algorithms such as Edmond or TMS
that schedule all circuits over the entire traﬃc demand matrix allow large traﬃc de-
mand to easily dominate the use of the optical circuits, which starves small demand in
the optical network. To solve this problem, Decomp first partitions the traﬃc demand
matrix into multiple regions and schedules circuits for diﬀerent regions in isolation
(the randomization technique described next further improves the algorithm’s robust-
ness against small demand starvation). In this way, Decomp distributes the optical
bandwidth among small and large demand because the large demand can now only
22
dominate a single region rather than the entire network. Then, Decomp merges par-
tial scheduling results into a global scheduling decision for all circuits, and enforces
high utilization of the optical network in merging.
Randomization before Partitioning A partition of traﬃc demand for parallel
computation might still result in scheduling decisions biasing some circuits over the
other. For example, circuits in sparse regions are more likely to be set up than
those in dense regions with intensive competition for optical bandwidth. To maintain
robust and stable performance over various traﬃc patterns, Decomp randomizes the
labeling of racks before partitioning demand regions, so that one region may represent
demand for diﬀerent subset of optical circuits in each computation iteration. In this
way, Decomp removes the chance of persistent bias for certain circuits.
Parallelization for Scalability Decomp’s computations can be parallelized inK
ways whereK is a power of 4 (to maintain the square shape of each sub-matrix). To do
so, inter-rack traﬃc demand matrix is first partitioned into regions, each representing
traﬃc demand for a portion of links between source and destination racks. Next, each
parallel process takes in one region and computes partial circuit assignments for links
included in the region. In other words, each process controls a subset of all optical
circuits and schedules them in the scope of the partitioned region. Finally the partial
circuit assignments are merged to obtain the schedule for all circuits.
5.1.2 Decomp Techniques and the Design Space
These key techniques help Decomp to cover the design space of the circuit scheduling
algorithm (Chapter 1).
Scalability Decomp can run in parallel to handle scheduling computation for
an optical switch with large port count. To avoid high computational overhead, a
23
lightweight heuristic can be applied in the schedule stage. Moreover, the schedule
stage can be customized with diﬀerent circuit selection policies that aim for a variety
of performance objectives. Particularly, it can also be explicitly designed to mitigate
excessive switching overhead as the switch port count grows.
Handling Hybrid and Pure Architectures As we discussed in Section 5.1.1,
compared with the existing algorithms such as Edmond or TMS that schedule all
circuits over the entire traﬃc demand matrix, Decomp schedules circuits based on
each partitioned demand regions and distributes the optical bandwidth among small
and large traﬃc demand. The randomization technique also helps to mitigate the
large demand dominance problem.
Robustness under Diﬀerent Traﬃc Patterns. In contrast with TMS that
may apply arbitrary distortion to the traﬃc demand and misguide the resulting cir-
cuit schedule, Decomp may use a heuristic in the schedule stage so that circuits are
scheduled based on the original traﬃc demand. Moreover, the merge stage of De-
comp is also designed to respect the scheduling results on diﬀerent regions, so that
the ultimate circuit scheduling may better matches the demand originally requested.
5.2 The Decomp Algorithm
As described in Algorithm 4, Decomp has three major stages, i.e. partition, sched-
ule and merge. As a pre-processing step, the order of racks is randomized with a
random permutation function f r (line 1). The randomization is repaired by map-
ping the sources and destinations of circuits in the output configurations back to the
corresponding racks using the reverse function of f r (line 17).
24
Algorithm 4 Decomp scheduling algorithm
Input: traﬃc demand matrix D, parallel in 4L-way
Output: a list of slot configurations C
1: f r ← random permutation of {1, ..., N} ◃ Randomization
2: construct randomized demand Dr s.t. Dr(i, j) = D(f r(i), f r(j))
3: divide Dr into 4L regional SDMs {Ds1, ...,D
s
4L} ◃ Partition
4: for all SDM Ds ∈ {Ds1, ...,D
s
4L} do ◃ Parallel Schedule
5: regional configurations Cs ← Schedule(Ds)
6: end for
7: for merge level l = 1 to L do
8: divide Dr into 4(L−l) SDMs
9: for all major-region covered by Ds ∈ {Ds1, ...,D
s
4(L−l)
} do ◃ Parallel merge
10: (C0, C1, C2, C3)← configurations of 4 minor-regions within major-region
11: major-region configurations Cs ← Merge(C0, C1, C2, C3,Ds)
12: end for
13: end for
14: C ← configurations of the one major-region on the L-th merge level
15: f−r ← repairing permutation s.t. f−r(f r(i)) = i
16: for all c ∈ C do
17: construct p−r s.t. p−r(i, j) = c.p(f−r(i), f−r(j)) ◃ Repair from randomization
18: c.p← p−r
19: end for
25
T
merge
Schedule
Figure 5.1 : An example illustrating 4-way Decomp for 6 racks.
5.2.1 Stage I: Partition
The schedule stage of Decomp can be parallelized in 4L ways (L ∈ {1, 2, ...}). In 4L-
way Decomp over N racks, the inter-rack traﬃc demand is a N × N traﬃc demand
matrix (TDM) D and then divided into 4L regions, each in the size of N2L ×
N
2L (line 3).
A regional demand matrix is called sub-demand matrix (SDM). For example, in 4-
way Decomp over 6 racks, the 6 × 6 TDM for racks in random order is partitioned
26
into four 3× 3 SDM, as shown in the Partition step in Figure 5.1.
5.2.2 Stage II: Schedule
During the schedule stage, a list of slot configurations are computed for each region
based on the regional demand SDM. Diﬀerent from Edmond and TMS, each slot
configuration in Decomp has an extra parameter, weight, which indicates the demand
covered by the slot. Time duration for each slot is assigned proportionally to weight
and the slot weights are further used in the merge stage of Decomp.
Parallel Schedule. Each SDM is fed into a subroutine to compute a list of slot
configurations for the circuits covered by the SDM. Computation on each SDM can
be run in parallel in diﬀerent processes since there is no dependency among SDMs
(line 4 in Algorithm 4). For example, in the Schedule stage of Figure 5.1, the SDMs
are fed into four parallel processes.
Greedy Matrix Decomposition. The Schedule subroutine is described in Al-
gorithm 5. We choose a light-weighted greedy algorithm as the schedule function,
which iteratively decomposes SDM into multiple circuit assignments. In each itera-
tion, we scan through the descending list of sorted entries in SDM (line 3), and add
circuits for demand whenever the new circuit does not conflict with circuits already
assigned in the same iteration (line 7 to line 12). The weight for each slot is set to
the maximum demand covered by a slot’s circuit assignment during the iterations
of the greedy decomposition (line 6). We remove the demand entries with circuits
assigned (line 10) and keep producing new circuit assignments until demand entries
are drained (line 4).
We choose this greedy decomposition algorithm for several reasons. Firstly, un-
like Edmond which uses the maximum weighted matching algorithm and generates
27
Algorithm 5 Subroutine: Schedule (Greedy Matrix Decomposition)
1: procedure Schedule(demand matrix D)
2: a list of slot configurations C ← ∅
3: DL ← list of descending-sorted non-zero entries in D
4: while DL not empty do ◃ Greedy matrix decomposition
5: circuit assignment p← ∅
6: weight w ← largest entry in DL
7: for demand entry d ∈ DL do
8: if the circuit from d.src to d.dst does not conflict with p then
9: p← p+ the circuit from d.src to d.dst
10: DL ← DL − d
11: end if
12: end for
13: add non-conflicting circuits with non-zero demand in D to p ◃ Back Filling
14: slot assigned circuits c.p← p, weight c.w ← w
15: append c to C
16: end while
17: cutoﬀ ← max
p∈C.p
|p|
18: remove c ∈ C if |c.p| < cutoﬀ ◃ Dynamic Pruning
19: time share for c ∈ C : c.t← c.w/
∑
c∈C
c.w
20: return C
21: end procedure
28
only one circuit assignment per cycle, the greedy decomposition algorithm produces
multiple circuit assignments based on demand, so that more valid circuits may be
covered in each cycle. Secondly, the greedy decomposition algorithm has much lower
computation overhead. For N2L ×
N
2L SDM, the running time of the greedy algorithm
is O(n3) with n = N2L [19]. To produce multiple circuit assignments per cycle, one
can also replace the greedy scanning (line 7 to line 12) with the maximum weighted
matching algorithm, so that in each iteration, the maximum weighted matching al-
gorithm produces a set of circuits based on the SDM. Then the SDM is updated by
removing the the demand entries covered by the circuits in the maximum weighted
matching, before the updated SDM is fed into the next iteration. However, using
the maximum weighted matching algorithm instead of greedy scanning would incurs
computation in O(n4). TMS also produces multiple circuit assignments per cycle,
however, its computation complexity is even higher in O(n4.5). The Schedule sub-
routine is a heavily used function so we pick a light-weighted greedy algorithm to
avoid excessive delay in computing control logics. Thirdly, the greedy decomposi-
tion algorithm incurs less circuit reconfiguration overhead compared with TMS. For
N
2L ×
N
2L SDM, the number of assignments generated per cycle by the greedy algo-
rithm is in O(n)[19], compared with O(n2) under TMS. It is also worth mentioning
that replacing the greedy scanning (line 7 to line 12) with the maximum weighted
matching algorithm will not help to reduce the number of assignments generated per
cycle from O(n)[19]. Fourthly, this greedy algorithm is eﬃcient by grouping circuits
with demand close to one another into the same assignment, which eﬀectively reduce
circuit idleness incurred when relatively small demand is drained in one assignment.
The upper half of Figure 5.2 shows the results of the Greedy matrix decomposition
algorithm on the example demand not partitioned.
29
GMD
Back Filling
Figure 5.2 : Greedy matrix decomposition (GMD) algorithm on the example demand.
Demand is not partitioned. Back filling circuits are marked red.
Back Filling Circuits. In the greedy matrix decomposition algorithm, demand
entries are removed iteratively (line 10). Hence assignments produced in those later
iterations might fail to cover some non-conflicting circuits that have non-zero demand.
For example, in Figure 5.2, the circuit from input port 1 to output port 2 is not
covered in the last circuit assignment produced. In an online system, letting resource
idle may hurt performance. As a result, we propose to perform circuit back filling on
all circuit assignments generated, by adding circuits that have non-zero demand in
SDM as long as the circuits are not conflicting with circuits already included in the
assignment (line 13). For example in Figure 5.2, we add an extra circuit from input
port 1 to output port 2 to the last circuit assignment.
Dynamic Slot Pruning. We can dynamically prune out undesired slots to
maintain high circuit utilization. Particularly, we choose to keep slots with maximum
number of concurrent circuits and prune out the rest in order to with maximize circuit
30
utilization(line 18). For example in Figure 5.2, the last slot is pruned because it has
only 2 concurrent circuits, which is less than other slots with 3 circuits. Note that
other pruning policy may also apply to satisfy diﬀerent scheduling purposes. Finally,
depending on slot weights, we scale the time share for the remaining slots to fill the
whole scheduling cycle, so that slots covering larger demand are allocated more time
share (line 19). For example in Figure 5.2, after pruning out the last slot, we may
assign the first two slots with duration of 100/123 and 23/123 respectively.
5.2.3 Stage III: Merge
Taking Advantage of Non-conflicting Regions. After the schedule stage, we ob-
tain regional circuit scheduling, a list of slot configurations for each region. However,
we cannot apply these slots directly because the scheduled circuits in diﬀerent slots
might be conflicting with each other. Particularly, the ports covered by two conflict-
ing regions are overlapped, and thus the circuits to be scheduled in these two regions
might be conflicting if they required the same port. For example, in Figure 5.1, region
0 and region 1 are conflicting regions because they cover the same set of input ports.
Nevertheless, non-conflicting regions (e.g. region 0 and region 3 in Figure 5.1) cover
diﬀerent subset of ports and thus slots in two non-conflicting regions can be scheduled
at the same time.
The main task of merge is to compute a globally feasible circuit scheduling from
the regional scheduling, while respecting the scheduling results on diﬀerent regions.
Instead of serializing all regional scheduling, we propose an eﬃcient merge algorithm
which takes advantage of the diagonal non-conflicting regions by first 1) merging
scheduling on non-conflicting regions and then 2) serializing merged scheduling on
conflicting regions. Particularly, we make use of major regions and minor regions
31
so that a major region contains 4 minor regions in 2-by-2 layout. We obtain circuit
scheduling for the major regions by merging slots from the diagonal non-conflicting
minor regions. We merge recursively until the circuit scheduling that covers all circuit
is obtained.
A 4L-way Decomp with 4L regions in the schedule stage has L merge levels (line 7
in Algorithm 4). The major regions and minor regions are recursively defined in
diﬀerent merge level. A minor region on one level is defined as the major region on
the last lower level, with each of the 4L regions in the schedule stage defined as the
minor regions on the 1st merge level. A major region on level l is defined as the
links covered by one of the 4(L−l) SDMs, each in size of N2L−l ×
N
2L−l , divided from the
randomized demand matrix Dr (line 9 in Algorithm 4). For example, in the Merge
step of Figure 5.1, each of the 3 × 3 blocks is a minor region on the 1st merge level
and the 6 × 6 block is a major region on the 1st merge level and a minor region on
the 2nd merge level.
Computation of merging on diﬀerent major regions can run in parallel (line 9 in
Algorithm 4). Further more, merging configurations from minor regions on diﬀerent
diagonal can also run in parallel (details to be discussed in Section5.3.1). Note that
on the L-th merge level, there is only one major region, and the configurations for
this one major region cover all circuits (line 14 in Algorithm 4).
Merging Slots from Non-conflicting Regions. Merging slots from two non-
conflicting regions is non-trivial: we must avoid excessive switching overhead but also
need to make the most of regional scheduling results. On one hand, a cross product of
the slots from both regions may result in excessive circuit switching overhead because
the number of the resulting assignments would be a multiplication of the number of
slots in both regions. On the other, merging slots from each region one by one would
32
constraint the number of resulting circuit assignments. However, it is less desirable
because the number of slots and their weights may diﬀer vastly on diﬀerent regions,
and a slot from the region with more slots may fail to find a matched slot in another
region. Besides, it is more favorable to merge two slots with similar weights so as to
group circuits with similar demand into one assignment.
We propose an eﬃcient way to merge slots from the non-conflicting regions, so
that the merge stage may respect and make the most of regional scheduling decisions,
as well as avoiding excessive scheduling overhead. Particularly, we allocate virtual
time for each slots based on slot weights and combine two slots employed at the same
virtual time on the non-conflicting minor regions into one slot on the major region
(line 7 to line 15), with the weight for the combined slot set to the maximum weight
among the two original slots (line 14). For example in Figure 5.3, the schedule stage
generates two slots C0(0) and C0(1) for region 0 (blue) with virtual time share 0.8
and 0.2 respectively, and another two assignments C3(0) and C3(1) for region 3 (red)
with virtual time shares of 0.5 and 0.5, then merging region 0 and 3 yields three
configurations. The first one has a circuit assignment of C(0).p = C0(0).p ∪ C3(0).p,
the second C(1).p = C0(1).p ∪ C3(0).p, and the third C(2).p = C0(1).p ∪ C3(1).p.
The slots should be sorted descending on weight before merge (line 4). By sorting,
we try to group circuits with similar weight into one slot and assign to the new slot
a new weight that is close to the original ones, such that the new weight may better
represent the demand covered by the new assignment.
Note that each combined slot may utilize one slot from both regions and only
O(n) new slots is produced on the resulting 2n-by-2n major region given O(n) slots
from the original n-by-n minor regions.
Serializing Slots from Conflicting Regions. The merged slots from conflict-
33
Region
0
Region
3
Schedule
Schedule
Figure 5.3 : Decomp merges slots from non-conflicting regions. Generate a new slot
by combining two slots employed at the same virtual time in diﬀerent regions.
ing regions, e.g. the merged slots from region 0 and 3, and the merged slots from
region 1 and 2, might contain conflicting circuits, and therefore these slots are serial-
ized to be the slots for the major region (line 23).
Optimizing with Back Filling and Dynamic Slot Pruning. Similar to
schedule stage, the circuit assignments in the merged slots should be filled with non-
conflicting valid circuits (line 16). Since the two circuit assignments to be merged
has been filled in the schedule stage (or previous merges), extra circuits are less likely
in the two merging regions. However, if the two merging assignments does not take
up all input and output ports, more valid circuits might be added in the rest two
regions not involved in the merge, by using the idle ports left after the two merging
assignments. For example in Figure 5.4, in the first 3 merged circuit assignments
from region 0 (blue) and 3 (red), extra circuits might exist in region 1 (cyan) and 2
(yellow). Besides, we can also further optimize the scheduling decisions by pruning out
undesired slots to enforce high circuit utilization(line 22). For example in Figure 5.4,
34
Schedule
& 
merge
Back Filling
Region
0
Region
1
Region
2
Region
3
Figure 5.4 : Back filling extra non-conflicting valid circuits in the merge stage. Back
filling circuits are marked red.
we can prune out the third slots since it only has 5 concurrent circuits, which is less
than the rest slots which all have 6 circuits.
5.3 Decomp Algorithm Analysis
5.3.1 Process Usage
The schedule stage of 4L-way Decomp takes 4L processes for each minor-region in
the first level. The merge stage in the l-th level uses 2× 4(L−l) processes for merging
minor-regions on two diagonals in each of the 4(L−l) major regions. For example, in
Figure 5.1, one process is used to merge region 0 and 3 and another one is used to
merge region 1 and 2 in parallel.
5.3.2 Computation Complexity
In 4L-way Decomp for N racks, each schedule process takes O(( N2L )
2 log N2L ) to sort
the ( N2L )
2 entries in each SDM. The number of loops for picking up non-conflicting
entries from the sorted list (line 4 of Algorithm 5) is O( N2L ). In each iteration, the
35
Algorithm 6 Subroutine: Merge
1: procedure Merge(four minor-regions configurations C0, C1, C2, C3, major-region de-
mand D)
2: major region configurations C ← ∅
3: for all minor-regions (i1, i2) on one diagonal ((0, 3) or (1, 2)) do
4: Apply descending sort on slots from both regions according to weight
5: index of configurations in Ci1 , Ci2 : j1, j2 ← 0
6: while j1 <
∣∣P ki1∣∣ or j2 < ∣∣P ki2∣∣ do
7: if Ci1(j1) ends before Ci2(j2) then
8: j1 ← j1 + 1
9: else if Ci1(j1) ends after Ci2(j2) then
10: j2 ← j2 + 1
11: else if Ci1(j1), Ci2(j2) end simultaneously then
12: j1 ← j1 + 1, j2 ← j2 + 1
13: end if
14: weight w = max(Ci1(j1).w,Ci2(j2).w)
15: assignment p← Ci1(j1).p ∪ Ci2(j2).p
16: add non-conflicting circuits with non-zero demand in D to p ◃ Back Filling
17: slot assigned circuit c.p← p, weight c.w ← w
18: append c to C
19: end while
20: end for
21: cutoﬀ ← max
p∈C.p
|p|
22: remove c ∈ C if |c.p| < cutoﬀ ◃ Dynamic Pruning
23: time share for c ∈ C : c.t← c.w/
∑
c∈C
c.w
24: return C
25: end procedure
36
sorted list is scanned and its size decreases. Thus, the loop takes O(( N2L )
3), resulting
O(( N2L )
3) for the schedule stage.
In the l-th level of merge, the maximum number of loops (line 6 in Algorithm 6)
is the total number of assignments in two diagonal minor-regions, or O( N2L−l+1 ). Thus
merges take O((N2 ) − (
N
2(L+1)
)) in total. This is a loose estimation and merge has a
much lower complexity in practice. We observe that the the computation time on the
merge stage is a very small compared with that on the schedule stage. The bottle
neck of Decomp is thus its schedule stage in O(( N2L )
3).
5.3.3 Circuit Configurations
The number of circuit configurations for each scheduling cycle is subject to the al-
gorithm used in the schedule stage to decompose each SDM. The greedy matrix
decomposition algorithm used by Decomp is shown [20] to produce at most 2 N2L − 1
circuit assignments for N2L ×
N
2L SDM. In the 1st merge level, each minor region has no
more than 2 N2L −1 circuit assignments. Thus merging four minor regions produces no
more than 4× (2 N2L − 1) circuit assignments, with 2× (2
N
2L − 1) from merging minor
regions on each of the two diagonals. Thus each minor region on 2nd merge level
has no more than 2 N
2(L−2)
circuit assignments. In the l-th level of merge, assume each
minor region has no more than 2 N
2(L−2l+2)
circuit assignments. Thus on the (l + 1)-th
merge level, each minor has no more than 2 N
2(L−2(l+1)+2)
circuit assignments. By induc-
tion, we know that on the final merge level, the total number of circuit assignments
is less than 2(L+1)N with L as a constant. Thus the number of circuit configurations
for Decomp is in O(N) for N racks.
37
Computation Complexity Slots/Cycle
4L-way Decomp O(( N2L )
3) +O((N2 )− (
N
2(L+1)
)) O(N)
Edmond O(N3) 1
TMS O(N4.5) O(N2)
Table 5.1 : Comparison of scheduling algorithms on N racks. (Decomp is the only
algorithm that is parallelizable)
5.3.4 Comparison with Existing Algorithms
Table 5.3.4 shows a comparison of 4L-way Decomp, Edmond and TMS on N racks.
Although Decomp is a polynomial-time algorithm like Edmond and TMS, Decomp is
the only algorithm among the three that is parallelizable. As we will show in Section
6.2.2, Decomp is much faster than Edmond and TMS.
Besides, Decomp applies a modest amount of circuit switchings per scheduling
cycle. On one hand, unlike Edmond which rejects all circuits outside the maximum
weighted matching set, Decomp schedules multiple circuit assignments per cycle, so as
to cover traﬃc demand in each cycle and mitigate starvation problem as in Edmond.
On the other hand, Decomp maintains network utilization by bounding the number
of circuit configurations per cycle to O(N), instead of O(N2) in TMS, to allow for
better scalability.
38
Chapter 6
Evaluation
6.1 Methodology
We study the performance of diﬀerent scheduling algorithms under the same settings,
i.e. topology, bandwidth allocation, circuit switching delay, etc.
6.1.1 Network Topology and Link Bandwidth
We simulate a network consisting of 16 racks, with 20 servers per rack. In the hy-
brid architecture, each top of rack (TOR) switch both connects to a 16 × 16 optical
switch and an electrical Ethernet packet switch. In the pure architecture, each TOR
switch only connects to the optical switch. In the hybrid architecture, the electrical
bandwidth Be varies from 10 Mbps to 1 Gbps in the experiments. The bandwidth
of optical links in both architectures is 10 Gbps. The links between TOR and server
are 10 Gbps. Thus each server can saturate either the optical or electrical links in
the core network with its own traﬃc.
6.1.2 Metrics
We use three metrics, i.e. traﬃc finish time, flow waiting time and algorithm com-
putation time to evaluate the scheduling algorithms. These metrics describes the
network resource utilization and traﬃc delay under a scheduling algorithm, as well
as how responsive a scheduling algorithm can be in case of traﬃc change.
39
• Traﬃc Finish Time TF
Consider a set of flows F in a certain traﬃc pattern. A flow f ∈ F arrives
at s(f) and finishes at e(f). The traﬃc finish time is the time when all flows
finish, i.e.
TF = max
f∈F
e(f). (6.1)
The traﬃc finish time reveals the network utilization when all flows arrive at
time 0 and wait to be serviced. To clear the same set of flows, the shorter the
traﬃc finish time is, the higher the utilization.
• Flow Waiting Time tW (f) and TW
The waiting time of a flow f ∈ F is defined as the length of time from its
arriving time to its finishing time, i.e.
tW (f) = e(f)− s(f). (6.2)
Summing the waiting time of all flows gives the total waiting time TW for flow
set F .
TW =
∑
f∈F
tW (f) (6.3)
tW (f) measures the responsiveness experienced by individual flow, exposing the
impact on short and long flows, while TW measures overall responsiveness. The
waiting time reflects how long a flow has to wait before it goes through the
network. For applications blocked to wait for data from another end host, the
waiting time determines how fast they are able to respond. Long total waiting
time implies overall slow response of applications running on the data center.
• Computation Time
40
The computation time aﬀects how a scheduling algorithm can be adaptive to
network changes, and therefore it should be computational eﬃcient. The com-
putation time of the scheduling algorithm is a function of network size and
traﬃc pattern.
6.1.3 Simulation Settings
Algorithms are called at the beginning of a scheduling cycle. Edmond generates only
one circuit assignment for each cycle consisting of one slot. Decomp and TMS can
generate multiple circuit assignments corresponding to multiple slots within one cycle.
In our experiment, the circuit switching delay is 10 µs, which is typical for optical
wavelength-selective switches (WSS). The scheduling cycle is 0.1 second to allow for a
fair trade-oﬀ between optical circuit utilization and fast adaptivity to traﬃc change.
6.1.4 Traﬃc Workloads
In data centers, traﬃc demand fluctuates over time, resulting in various traﬃc pat-
terns, uniform or skewed, dense or sparse. At one time, traﬃc might be distributed
uniformly across racks but at another, hotspots show up if a few racks have signifi-
cantly larger demand. Besides, the traﬃc is said to be dense when most racks have
traﬃc demands for each other, while the traﬃc is sparse when only some racks have
traﬃc demands for each other. We study the performance of scheduling algorithms
under these four combinations of patterns.
• Urand (uniform-dense): Each server sends a flow of 125 MB to another N
randomly chosen servers, where N is the number of racks. In a data center with
N racks andM servers per rack, this implies that each rack has NM inward and
41
NM outward flows on average. Urand is a standard pattern used in existing
work [3, 4, 5] to simulate intensive communications in data centers.
• Stride (uniform-sparse): The servers are indexed from 0 to 320. Server i sends
a flow of 125 MB to each of 8 other servers in 8 diﬀerent neighbor racks, which
are server (i+ 20 ∗ j + 1) mod 20, with j ∈ {1, ..., 8}. In Stride, each rack has
traﬃc to half of all TORs. Stride is another standard pattern used in existing
work [3].
• Hotspots (skewed-dense): Hotspot is a mixture of heavy flows, each in size of
10 GB between 10 pairs of randomly chosen servers, and small flows of uni-
form random dense traﬃc of 100 KB each. We include it because an extensive
measurement study of data center traﬃc [21] reports that hotspots are common.
• Isolated (skewed-sparse): Three non-conflicting isolated flows, each in size of
100 MB, coexist in a 16× 16 traﬃc matrix. In this traﬃc pattern, all flows can
be routed by the optical switch at the same time. The reason we include it is
to demonstrate certain deficiency of the TMS algorithm on sparse and skewed
traﬃc in Section 6.2.4.
6.2 Numerical Results
6.2.1 Decomp achieves low flow waiting time while preserving network
utilization
In this experiment, we zero out the algorithm computation time and circuit switching
time in order to remove the influence from these two factors on performance. In other
words, we want to show how each algorithm assigns circuits for the same set of flows.
42
In practice, the performance would also be impacted by algorithm computation time
and circuit reconfiguration delay.
Figure 6.1 shows the distribution of Hotspot’s flow waiting time under a 10 Gbps
pure optical architecture. We make the following observations. Firstly, flows tend to
experience long waiting time under Edmond in the pure optical architecture. More
than 50% of the flows are still alive after 5 seconds under Edmond, while with the
same time, both TMS and Decomp have already cleared more than 89% of the flows.
With an objective of maximizing the total traﬃc served by each circuit assignment,
Edmond is more likely to prioritize large flows into optical paths, leaving small ones
waiting until their sizes are comparable with what is remained of the large flows.
Without assistance of electrical network which allows small flows to pass quickly, the
small flows suﬀer under Edmond.
Secondly, flow waiting time is significantly improved with Decomp. Both Decomp
and TMS allow bandwidth sharing among flows with several diﬀerent circuit assign-
ments in each scheduling cycle. Thus even small flows can take advantage of optical
circuits assigned in short slots. Furthermore, Decomp has even better flow waiting
time than TMS. Decomp partitions traﬃc demand and schedules each partition sep-
arately, so that the large flows can only dominate a single partition rather than the
entire network. In TMS, heavy demand can dominate the use of optical network more
easily given that TMS schedules all circuits over the entire traﬃc demand matrix. Ex-
periments show that by 0.13 second, Decomp has cleared more than 97% of the flows,
while the number for TMS and Edmond is merely 84% and 42% respectively. We also
observe that Decomp reduces the average per flow waiting time by more than 0.14
second for Stride pattern and 0.78 second for Urand pattern compared with TMS and
Edmond under the same network settings.
43
0 0.05 0.1 0.15 0.2 0.25 0.3
0
0.2
0.4
0.6
0.8
1
Per Flow Waiting Time (s)
CD
F
 
 
4−Way Decomp
16−Way Decomp
Edmond
TMS
Figure 6.1 : CDF of flow waiting time for the Hotspot pattern under the 10 Gbps
pure optical network architecture. Decomp is suitable for the pure optical network
architecture and eliminates the long-tailed flow waiting times that Edmond and TMS
suﬀer from.
Furthermore, we present the traﬃc finish time for Hotspot, Urand and Stride
under the pure architecture in Figure 6.2. Decomp achieves similar network utilization
as Edmond and TMS in that Decomp clears traﬃc with roughly the same traﬃc finish
time as the other two algorithms. But with comparable network utilization, Decomp
is much better at allocating bandwidth among flows to reduce flow waiting time.
Next, we extend our study to the hybrid network. Figure 6.3 shows the total flow
waiting time for the Hotspot pattern under the hybrid architecture with diﬀerent
ratios of electrical and optical bandwidth. Decomp has short waiting time over all
hybrid network settings. In contrast, flows still suﬀer from long waiting time in
Edmond when there is insuﬃcient electrical bandwidth. Like Edmond, TMS also
44
Hotspot Urand Stride0
10
20
30
Fi
n
is
hi
n
g 
Ti
m
e
 
(s)
 
 
4−Way
Decomp
16−Way
Decomp
 
Edmond
 
TMS
Figure 6.2 : Traﬃc finish time under the 10 Gbps pure optical architecture. Decomp
is as good as other algorithms in terms of network utilization regardless of traﬃc
density, while simultaneously eliminating long-tailed flow waiting times.
suﬀers from not having enough electrical bandwidth, though it is more tolerant. In the
hybrid network with as much as 100 Mbps electrical bandwidth, the total flow waiting
time for Decomp is less than 44% of TMS and only 4.6% of Edmond. When the
electrical bandwidth drops further down to 10 Mbps, the total flow waiting time for
Decomp is only 9.3% of TMS and 1.2% of Edmond. The starvation problem of small
flows under TMS and Edmond is mitigated only after we add a considerable amount
of electrical bandwidth for the small flows to go through easily without significant
delay.
6.2.2 Decomp requires far less computation time than Edmond and TMS
We measure the execution time of the scheduling algorithms on a 3.10 GHz Intel
dual Core i3 processor with 3.7 GB memory. The computation time is not memory
bounded in that the maximum memory consumption is only 1521 MB for 16-way
45
10 10.2 10.4 10.6 10.8 11
0
0.5
1
1.5
2
2.5
3
3.5
Total Bandwidth (Gb/s)
To
ta
l W
a
iti
n
g 
Ti
m
e
 
(x 
10
4 s
)
 
 
4−Way Decomp
16−Way Decomp
Edmond
TMS
Figure 6.3 : Total flow waiting time for Hotspot under the hybrid network architecture
(Total bandwidth is the sum of 10 Gbps constant optical bandwidth and variable
electrical bandwidth). Decomp is suitable for the hybrid network architecture and
achieves the lowest flow waiting times regardless of the amount of electrical network
bandwidth available.
Decomp over 600 racks. All algorithms are implemented in the C language for best
performance. We have ensured that Edmond and TMS are well optimized. The com-
putation time of Decomp is measured as the total time spent on schedule and merge.
The time spent on the schedule phase is the maximum time spent on calculating each
SDM. The time spent on merges in diﬀerent levels is the maximum time spent on
each major-region, and the time for the whole merge phase is the total time spent on
merges in each level.
Figure 6.4 shows the computation time for Urand traﬃc matrices in size of up to
600×600. We observe that Decomp takes much less computation time than Edmond
and TMS. 16-way Decomp is 28000× faster than TMS and 170× faster than Edmond
46
0 200 400 600
0
1000
2000
TMS
Number of Racks
Co
m
pu
ta
tio
n
 
Ti
m
e
 
(s)
 
 
0 200 400 600
0
10
20
Edmond
Number of Racks
Co
m
pu
ta
tio
n
 
Ti
m
e
 
(s)
0 200 400 600
0
0.1
0.2
Decomp
Number of Racks
Co
m
pu
ta
tio
n
 
Ti
m
e
 
(s)
TMS
Edmond
4−Way Decomp
16−Way Decomp
Figure 6.4 : Computation time for the dense Urand pattern under diﬀerent algorithms.
Note that the y-axes have very diﬀerent scales, and Decomp is orders of magnitude
more computationally eﬃcient. Parallelization allows Decomp to scale well to large
network sizes. While not shown in this figure, Decomp is also much more eﬃcient for
sparse traﬃc (see Section 6.2.2).
when the topology scales to 600 racks. Under the sparse Stride pattern (graphs
omitted due to space constraint), the computation time of Decomp and TMS drops
significantly because a sparse matrix can be easily decomposed into a small number of
assignments, reducing the computation complexity of the Birkhoﬀ decomposition [15]
step in TMS and the schedule/merge in Decomp. However, Decomp is still much
faster. Under Stride on average across the diﬀerent network sizes, 4-way and 16-way
Decomp speeds up computation by 9× and 42× respectively compared with TMS,
while 4-way and 16-way Decomp are 92× and 426× faster than Edmond respectively.
A valid algorithm should adapt quickly to diﬀerent traﬃc. For dense traﬃc, it will
47
take thousands of seconds for TMS to respond in a data center with 600 racks, which
is unacceptable. In other words, if the traﬃc in data centers experiences dramatic
change in a short time and the scheduling algorithm is required to respond and adapt
within, say 0.05 second, then the topological size could not go beyond 48 racks with
TMS, or 132 racks with Edmond. In contrast, Decomp can leverage modern multi-core
processors. Under the same responsiveness requirement, 4-way Decomp can support
up to 312 racks and 16-way Decomp can support up to 552 racks.
6.2.3 Decomp generates far fewer slots than TMS
Unlike Edmond which generates only one circuit assignment at a time, Decomp and
TMS schedule multiple circuit assignments or slots after each run. Having a larger
number of slots per scheduling cycle leads to lower eﬃciency because more time is
wasted on reconfiguration circuits. Figure 6.5 shows the number of slots, or circuit
assignments, generated by Decomp and TMS for Urand traﬃc matrices in size of
up to 600 × 600. We find that the number of slots generated by Decomp is O(N)
for N racks, much less than O(N2) of TMS. Under Stride, the number of slot drops
compared with Urand both for Decomp and TMS because sparse matrix can be easily
decomposed into less assignments, but still Decomp generates at least 35% fewer slots
than TMS on average across the diﬀerent network sizes. While pruning out small slots
might seem a plausible approach to improve utilization for TMS, however, doing so
many inadvertently harm small flows.
48
0 100 200 300 400 500 600
     0
 50000
100000
150000
Number of Racks
 
 
TMS
0 100 200 300 400 500 600
0
500
1000
1500
Number of Racks
 
 
4−Way Decomp
16−Way Decomp
N
u
m
be
r 
o
f S
lo
ts
N
u
m
be
r 
o
f S
lo
ts
Figure 6.5 : Number of slots for the Urand pattern under Decomp and TMS. Decomp
generates two orders of magnitude fewer scheduling slots, leading to much lower
circuit switching overhead. Recall that switching an optical circuit takes up to tens
of milliseconds.
6.2.4 TMS makes ineﬃcient scheduling decisions for Isolated under com-
bined eﬀect of computation and reconfiguration overhead
TMS is studied only with dense traﬃc in previous works [6, 4, 5]. Instead we subject
TMS to sparse and skewed matrices in the Isolated pattern. The Sinkhorn algo-
rithm [22], a building block of TMS, takes in matrices with strictly positive elements.
TMS does not work when it is given a matrix with zero entries. A work-around so-
lution is to substitute all zeros with a small quantity σ before feeding the matrix to
Sinkhorn. In this way, the σ entries can be scaled to fit into a doubly stochastic ma-
trix. Otherwise, the zeros persist all the way through the iterative scaling of columns
and rows in Sinkhorn, preventing the algorithm from converging.
However, the substitutions will also bring in distortion of the traﬃc demand and
long computation time for TMS. Let’s consider a simple example such as the Isolated
49
pattern. Ideally Sinkhorn will converge into a permutation matrix for Isolated, with
entries of one covering the original traﬃc demand. But convergence to a strictly dou-
bly stochastic matrix is time consuming and a more common practice is to terminate
Sinkhorn when the error is considered tolerable. Nevertheless, it still takes a long
time to scale rows and columns of substituted σ to below the error tolerance. Be-
sides, by allowing error tolerance, Sinkorn leaves a lot of small entries in the resulting
doubly stochastic matrix, but most of the links corresponding to the small entries
have no traﬃc demand at all. When given this doubly stochastic matrix with error,
the Birkhoﬀ decomposition step in TMS generates a lot of circuit assignments cor-
responding to the non-existing demand. Although the distorted assignments tend to
have small time slot durations, setting up these unnecessary circuits is a large waste
given that unnecessary optical circuit reconfigurations waste network resources. Be-
sides, Birkhoﬀ takes long time to generate these assignments.
The authors in [4] suggest pruning out circuit assignments with small time share
after Birkhoﬀ by scheduling only a few longest slots. This approach will introduce a
predefined threshold to decide which slots are to be taken away. But short slots are
not necessarily a waste, but they might just represent traﬃc with small demand. A
predefined threshold does not distinguish slots generated due to distortion or small
demand.
The long execution time under Isolated implies that TMS cannot respond to traﬃc
change quickly. One can try to enforce a short scheduling cycle with the risk that
the last slot in current cycle ends before computation for next cycle finishes. To
accommodate long computation in short cycle, one can (fix 1) extend the last slot
until computation finishes. But the last slot might be a wasteful one generated due
to distortion. Instead, one can (fix 2) always schedule the last slot as the one with
50
0
0.2
0.4
0.6
0.8
1
Fi
n
is
hi
n
g 
Ti
m
e
 
(s)
4−Way
Decomp
16−Way
Decomp
Edmond TMS
fix 1
TMS
fix 2
TMS
fix 3
Figure 6.6 : Finish time of the Isolated traﬃc pattern under the 10 Gbps pure optical
architecture. TMS’s ineﬃciencies are rooted in its high computation and circuit
reconfiguration overhead.
the longest time share because it is more likely to represent valid demand. In this
case, the longest slot is extended disproportionally. There could be a lot of bandwidth
wasted if the traﬃc is cleared before computation ends, while traﬃc represented in
other slots is left unserved. To avoid this problem, one can also try to (fix 3) repeat
the assignments in the current cycle during computation.
In Figure 6.6, we show the traﬃc finishing time of diﬀerent algorithms under the
Isolated pattern. In this experiment, two Isolated traﬃc patterns (I1 and I2) are
used. Every 8 ms, a new set of traﬃc that alternates between I1 and I2 is injected.
This aims to test how the algorithms adapt to traﬃc changes. Besides, computation
time and circuit switching time are accounted for in this experiment. As shown in
Figure 6.6, the traﬃc finishing time of TMS with fix 2 and 3 is 1.5× longer than
Decomp and Edmond. This is because whenever a new set of traﬃc arrives at time
t, TMS takes tens of milliseconds to compute a new set of circuit assignments and
51
during this time, traﬃc is being served by ineﬃcient assignments computed based
on the traﬃc volumes prior to time t. In addition to this slow adaptation problem,
TMS with fix 1 is even worse because it extends wasteful slots during computation,
resulting in 2.5× longer traﬃc finishing time than Decomp and Edmond.
52
Chapter 7
Conclusion
We make two contributions in this thesis. First, we explore the challenges of circuit
scheduling for next-generation optical data centers in three dimensions, i.e. handle
hybrid and pure architectures, robustness under sparse and dense traﬃc patterns,
and scalability. As a result, the weaknesses of the existing optical data center circuit
scheduling algorithms are exposed. Secondly, we propose Decomp, which provides a
framework that can be customized with diﬀerent circuit selection policies and incorpo-
rates partitioning, randomization, and parallelization approaches that are not found
in existing algorithms, and we show that it significantly outperforms the existing
algorithms along all three dimensions.
53
Bibliography
[1] G. Wang, D. G. Andersen, M. Kaminsky, K. Papagiannaki, T. S. E. Ng,
M. Kozuch, and M. Ryan, “c-Through: Part-time Optics in Data Centers,”
in SIGCOMM ’10, (New Delhi, India), p. 327, Aug. 2010.
[2] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya,
Y. Fainman, G. Papen, and A. Vahdat, “Helios: A Hybrid Electrical/Optical
Switch Architecture for Modular Data Centers,” in SIGCOMM ’10, (New Delhi,
India), p. 339, Aug. 2010.
[3] K. Chen, A. Singla, A. Singh, K. Ramachandran, L. Xu, Y. Zhang, X. Wen, and
Y. Chen, “Osa: An optical switching architecture for data center networks with
unprecedented flexibility,” 2012.
[4] N. Farrington, G. Porter, Y. Fainman, G. Papen, and A. Vahdat, “Hunting Mice
with Microsecond Circuit Switches,” in ACM HotNets, (Redmond, WA, USA),
oct 2012.
[5] G. Porter, R. Strong, N. Farrington, A. Forencich, P. Chen-Sun, T. Rosing,
Y. Fainman, G. Papen, and A. Vahdat, “Integrating microsecond circuit switch-
ing into the data center,” in Proceedings of the ACM SIGCOMM 2013 conference
on SIGCOMM, pp. 447–458, ACM, 2013.
[6] H. Liu, F. Lu, A. Forencich, R. Kapoor, M. Tewari, G. M. Voelker, G. Papen,
A. C. Snoeren, and G. Porter, “Circuit switching under the radar with reactor,”
54
in Proceedings of the 11th ACM/USENIX Symposium on Networked Systems
Design and Implementation (NSDI), Seattle, WA, 2014.
[7] Calient, “Software defined packet-optical datacenter networks,” accessed July,
2014.
[8] Corning, “Fiber optic solutions for data centers and sans,” accessed July, 2014.
[9] M. Chowdhury and I. Stoica, “Coflow: A Networking Abstraction for Cluster
Applications,” in Hotnets 12, (Seattle, WA, USA), pp. 31–36, Oct. 2012.
[10] J. Edmonds, “Paths, trees, and flowers,” Canadian Journal of Mathematics,
vol. 17, pp. 449–467, Jan. 1965.
[11] T. E. Anderson, S. S. Owicki, J. B. Saxe, and C. P. Thacker, “High-speed switch
scheduling for local-area networks,” ACM Transactions on Computer Systems
(TOCS), vol. 11, no. 4, pp. 319–352, 1993.
[12] N. McKeown, P. Varaiya, and J. Walrand, “Scheduling cells in an input-queued
switch,” Electronics Letters, vol. 29, no. 25, pp. 2174–2175, 1993.
[13] P. Giaccone, B. Prabhakar, and D. Shah, “Towards simple, high-performance
schedulers for high-aggregate bandwidth switches,” in INFOCOM 2002. Twenty-
First Annual Joint Conference of the IEEE Computer and Communications So-
cieties. Proceedings. IEEE, vol. 3, pp. 1160–1169, IEEE, 2002.
[14] G. Birkhoﬀ, “Tres observaciones sobre el algebra lineal,” Univ. Nac. Tucuma´n
Rev. Ser. A, vol. 5, pp. 147–151, 1946.
[15] C.-S. Chang, W.-J. Chen, and H.-Y. Huang, “On service guarantees for input-
buﬀered crossbar switches: a capacity decomposition approach by birkhoﬀ and
55
von neumann,” in Quality of Service, 1999. IWQoS’99. 1999 Seventh Interna-
tional Workshop on, pp. 79–86, IEEE, 1999.
[16] T. Inukai, “An eﬃcient ss/tdma time slot assignment algorithm,” Communica-
tions, IEEE Transactions on, vol. 27, no. 10, pp. 1449–1455, 1979.
[17] E. Balas and P. R. Landweer, “Traﬃc assignment in communication satellites,”
Operations Research Letters, vol. 2, no. 4, pp. 141–147, 1983.
[18] J. L. Lewandowski, J. W. Liu, and C. Liu, “Ss/tdma time slot assignment with
restricted switching modes,” in NTC’81; National Telecommunications Confer-
ence, Volume 3, vol. 3, p. 7, 1981.
[19] A. Kesselman and K. Kogan, “Nonpreemptive scheduling of optical switches,”
Communications, IEEE Transactions on, vol. 55, no. 6, pp. 1212–1219, 2007.
[20] I. Keslassy, M. Kodialam, T. Lakshman, and D. Stiliadis, “On guaranteed smooth
scheduling for input-queued switches,” in INFOCOM 2003. Twenty-Second An-
nual Joint Conference of the IEEE Computer and Communications. IEEE Soci-
eties, vol. 2, pp. 1384–1394, IEEE, 2003.
[21] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, “The Nature
of Data Center Traﬃc: Measurements and Analysis,” in IMC ’09, (Chicago,
Illinois, USA), pp. 202–208, Nov. 2009.
[22] R. Sinkhorn et al., “A relationship between arbitrary positive matrices and dou-
bly stochastic matrices,” The annals of mathematical statistics, vol. 35, no. 2,
pp. 876–879, 1964.
