Power-Aware Datacenter Networking and Optimization by Yi, Qing
Portland State University 
PDXScholar 
Dissertations and Theses Dissertations and Theses 
Winter 3-2-2017 
Power-Aware Datacenter Networking and 
Optimization 
Qing Yi 
Portland State University 
Follow this and additional works at: https://pdxscholar.library.pdx.edu/open_access_etds 
 Part of the Computer Sciences Commons, and the Power and Energy Commons 
Let us know how access to this document benefits you. 
Recommended Citation 
Yi, Qing, "Power-Aware Datacenter Networking and Optimization" (2017). Dissertations and Theses. Paper 
3474. 
https://doi.org/10.15760/etd.5358 
This Dissertation is brought to you for free and open access. It has been accepted for inclusion in Dissertations 
and Theses by an authorized administrator of PDXScholar. For more information, please contact 
pdxscholar@pdx.edu. 
Power-Aware Datacenter Networking and Optimization
by
Qing Yi
A dissertation submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
in
Computer Science
Dissertation Committee:
Suresh Singh, Chair
Jingke Li
Fei Xie
Branimir Pejcinovic
Portland State University
2017
i
ABSTRACT
Present-day datacenter networks (DCNs) are designed to achieve full bisection
bandwidth in order to provide high network throughput and server agility. How-
ever, the average utilization of typical DCN infrastructure is below 10% [8] for
significant time intervals. As a result, energy is wasted during these periods. In
this thesis we analyze tra c behavior of datacenter networks using traces as well
as simulated models. Based on the insight developed, we present techniques to
reduce energy waste by making energy use scale linearly with load. The solutions
developed are analyzed via simulations, formal analysis, and prototyping. The im-
pact of our work is significant because the energy savings we obtain for networking
infrastructure of DCNs are near optimal.
A key finding of our tra c analysis is that network switch ports within the DCN
are grossly under-utilized. Therefore, the first solution we study is to modify the
routing within the network to force most tra c to the smallest number of switches.
This increases the hop count for the tra c but enables the powering o↵ of many
switch ports. The exact extent of energy savings is derived and validated using
simulations. An alternative strategy we explore in this context is to replace about
half the switches with fewer switches that have higher port density. This has the
e↵ect of enabling even greater tra c consolidation, thus enabling even more ports
to sleep. Finally, we explore a third approach in which we begin with end-to-end
tra c models and incrementally build a DCN topology that is optimized for that
ii
model. In other words, the network topology is optimized for the potential use
of the datacenter. This approach makes sense because, as other researchers have
observed, the tra c in a datacenter is heavily dependent on the primary use of the
datacenter.
A second line of research we undertake is to merge tra c in the analog domain
prior to feeding it to switches. This is accomplished by use of a passive device we
call a merge network. Using a merge network enables us to attain linear scaling of
energy use with load regardless of datacenter tra c models. The challenge in using
such a device is that layer 2 and layer 3 protocols require a one-to-one mapping of
hardware addresses to IP (Internet Protocol) addresses. We overcome this problem
by building a software shim layer that hides the fact that tra c is being merged. In
order to validate the idea of a merge network, we build a simple mere network for
gigabit optical interfaces and demonstrate correct operation at line speeds of layer
2 and layer 3 protocols. We also conducted measurements to study how tra c gets
mixed in the merge network prior to being fed to the switch. We also show that
the merge network uses only a fraction of a watt of power, which makes this a very
attractive solution for energy e ciency.
In this research we have developed solutions that enable linear scaling of energy
with load in datacenter networks. The di↵erent techniques developed have been
analyzed via modeling and simulations as well as prototyping. We believe that
these solutions can be easily incorporated into future DCNs with little e↵ort.
iii
DEDICATION
To my family.
iv
ACKNOWLEDGMENTS
First and foremost, I want to thank my advisor, Professor Suresh Singh. As one
of his Ph.D. students, I have been very fortunate to conduct research under his
guidance for the past four years. Professor Singh has been not only an excellent
and inspirational advisor in my research field, but also the best mentor in my
professional life. I deeply appreciate all his contributions of time, ideas, and funding
to make my Ph.D. studies productive and stimulating.
For this dissertation, I would like to thank my reading committee, Professor Li
Jingke, Professor Xie Fei and Professor Branimir Pejcinovic, for their time, interest,
and valuable comments and insightful questions. In particular, I am grateful for
the many conversations I had with Professor Xie Fei regarding my research, which
gave me new perspective in many areas I had not previously considered.
I also want to thank the National Science Foundation that made my Ph.D.
work possible. My time at Portland State was made enjoyable in large part due to
the many friends who have become part of my life. I am grateful for the time spent
with my fellow Ph.D. research students and friends, Farnoosh Moshifatemi, Cong
Kai, Yang Zhengkun, Simon Niklaus, Wang Qin, Cuong Nguyen and Lin Bin, and
for many other people and good memories.
Lastly, I would like to thank my family for all their love and support. For
my parents who raised me with a love of math and science and supported me in
all my pursuits. And most of all for my loving and encouraging husband whose
v
faithful support and patience throughout my doctoral program is so appreciated.
In addition, these acknowledgments would not be complete if I did not mention my
sons, Kevin and Ryan. They are great sources of love and support for my scholarly
endeavor. Their thirsty of knowledge always gives me inspiration and strength to
persist during this long journey. Thank you all!
Qing Yi
Portland State University
vi
TABLE OF CONTENTS
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Datacenter Network Topology . . . . . . . . . . . . . . . . . 6
1.2.2 Energy-E cient Networking . . . . . . . . . . . . . . . . . . 9
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2 Literature Review 13
2.1 Energy-E cient Networking . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Green Network Devices . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Power-Aware Network Infrastructure . . . . . . . . . . . . . 18
2.1.3 Power-Aware Software Stack . . . . . . . . . . . . . . . . . . 20
2.2 Datacenter Network Architecture . . . . . . . . . . . . . . . . . . . 21
2.2.1 Datacenter Network Topologies . . . . . . . . . . . . . . . . 21
2.2.2 Datacenter Network Protocols . . . . . . . . . . . . . . . . . 24
2.2.3 Alternative Datacenter Architectures . . . . . . . . . . . . . 31
2.3 Datacenter Network Energy E ciency . . . . . . . . . . . . . . . . 32
2.4 Comparison and Discussions . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 3 Modeling Energy Usage of Datacenter Networks 37
3.1 Modeling Energy Consumption . . . . . . . . . . . . . . . . . . . . 38
vii
3.1.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.2 Extended Model . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.3 Asymmetric Model . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4 Analytical Optimization Model 61
4.1 Minimizing Energy Consumption . . . . . . . . . . . . . . . . . . . 61
4.1.1 Optimization Model . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Greedy Flow Assignment . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 Heuristic Algorithm . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.2 Validation of Greedy Algorithm . . . . . . . . . . . . . . . . 67
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 5 Usage-Based Datacenter Network Topology 70
5.1 Sub-Trees for Di↵erent Tra c Characteristics . . . . . . . . . . . . 71
5.1.1 Tra c Model . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.2 Active Sub-Trees . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.3 Analytical Model of Sub-Tree Size . . . . . . . . . . . . . . . 74
5.2 Right sizing the edge switches . . . . . . . . . . . . . . . . . . . . . 78
5.3 Energy savings of larger-sized edge switches . . . . . . . . . . . . . 80
5.3.1 Static Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Dynamic Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6 Tra c Consolidation using Merge Networks 87
6.1 Our Approach: Merging . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Energy Savings Due to Tra c Merging . . . . . . . . . . . . . . . . 91
6.3 Tra c Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Tra c Merging within a Switch . . . . . . . . . . . . . . . . . . . . 92
6.4.1 Number of Active Interfaces . . . . . . . . . . . . . . . . . . 92
6.4.2 Energy Savings . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Tra c Merging within a Pod . . . . . . . . . . . . . . . . . . . . . . 94
6.5.1 Lower Bound on Energy Consumption . . . . . . . . . . . . 95
6.5.2 Energy Savings Due to Tra c Merging . . . . . . . . . . . . 97
viii
6.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Chapter 7 Simulation Results with Merge Networks 103
7.1 Tra c Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.1 Synthetic Tra c Data . . . . . . . . . . . . . . . . . . . . . 103
7.1.2 Empirical Tra c Data . . . . . . . . . . . . . . . . . . . . . 104
7.2 Applying Merge Network Within A Switch . . . . . . . . . . . . . . 105
7.3 Applying Merge Network Within A Pod . . . . . . . . . . . . . . . 108
7.3.1 Number of active switches . . . . . . . . . . . . . . . . . . . 108
7.3.2 Energy cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Chapter 8 Prototype of Merge Networks 113
8.1 2⇥ 2 Merge Network Architecture Design . . . . . . . . . . . . . . 113
8.2 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.3 Higher-order Merge Networks . . . . . . . . . . . . . . . . . . . . . 128
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Chapter 9 Conclusions 137
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
References 141
ix
LIST OF TABLES
4.1 Number of active switches and active interfaces from optimization
model vs. simulation with greedy algorithm. . . . . . . . . . . . . . 67
5.1 Power Consumption of Datacenter Modular Switch - Cisco Catalyst
4503-E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1 Probabilities of flow going to the same subnet (p
1
), to the same pod
(p
2
) and to di↵erent pods (1  p
1
  p
2
) for all tra c suites studied . 104
8.1 Arduino board STATE values . . . . . . . . . . . . . . . . . . . . . 120
x
LIST OF FIGURES
1.1 Global electricity demand of datacenters 2010 - 2030 [15]. . . . . . . 2
3.1 Fat-tree network model. . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Active switches for the basic model. . . . . . . . . . . . . . . . . . . 41
3.3 Extended model with high external connectivity. . . . . . . . . . . . 46
3.4 Extended model with reduced external connectivity and high exter-
nal load. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Tra c loss corresponding to Figure 3.4. . . . . . . . . . . . . . . . . 48
3.6 Asymmetric model with high external tra c. . . . . . . . . . . . . . 52
3.7 A more realistic asymmetric model with high capacity external links. 53
3.8 Tra c loss for the model in Figure 3.6. . . . . . . . . . . . . . . . . 54
3.9 Using only one externally connected switch. . . . . . . . . . . . . . 55
3.10 Fraction of active switches for the staggered model. . . . . . . . . . 57
3.11 Fraction of active switches for the analytical model (staggered cases). 57
3.12 Fraction of active switches for the stride model. . . . . . . . . . . . 60
3.13 Fraction of active switches for the analytical model (stride cases). . 60
5.1 Minimal fat-trees with uniform and non-uniform tra c of load 10%. 75
5.2 Fraction of switches required for uniform and non-uniform tra c. . 76
5.3 Tra c between edge layer and aggregation layer is less when the size
of edge switches increases. (Figures shown above are for uniform and
nonuniform tra c patterns in EDU Datacenters. CLD, PRV and
EDU1 Datacenters also have the same properties.) . . . . . . . . . . 80
5.4 Fraction of active switches with larger-sized edge switches. . . . . . 81
5.5 EDU DCNs with 12-port and 72-port edge switches, 70% load. . . . 82
5.6 Static power consumption of a k = 12 and a k = 48 fat-tree DCN
with di↵erent sizes of edge switches. . . . . . . . . . . . . . . . . . . 83
5.7 Fraction of total power consumption of network switches with larger-
sized edge switches for di↵erent tra c load and patterns in EDU
Datacenters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
xi
6.1 Fat-tree model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Merge networks applied to a switch. . . . . . . . . . . . . . . . . . . 88
6.3 Merge network applied to pod in a fat-tree. . . . . . . . . . . . . . . 90
6.4 Di↵erence in number of active switches and active interfaces network-
wide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5 Reduction in total cost when using tra c merging. . . . . . . . . . 94
6.6 Active switches for the model in Section 3.1.1. . . . . . . . . . . . . 97
6.7 Active switches for the model with tra c merging. . . . . . . . . . 98
6.8 Simulation results of active switches for near and far tra c. . . . . 100
6.9 Modeling active switches for near and far tra c. . . . . . . . . . . . 101
7.1 Tra c load of a university datacenter. . . . . . . . . . . . . . . . . 105
7.2 Number of active switches and active interfaces network-wide for a
k = 12 fat-tree network. . . . . . . . . . . . . . . . . . . . . . . . . 106
7.3 Reduction in total cost when using tra c merging. . . . . . . . . . 107
7.4 Energy savings when using tra c merging. . . . . . . . . . . . . . . 107
7.5 Compare number of active switches with vs. without tra c merging. 109
7.6 Fraction of active switches before using tra c merging (left) and
after using tra c merging (right). . . . . . . . . . . . . . . . . . . . 110
7.7 Reduction in total cost after using tra c merging. . . . . . . . . . . 111
7.8 Fraction of total cost without tra c merging vs. using tra c merging.112
8.1 A 2⇥ 2 merge network implemented with two 2⇥ 2 optical switches. 115
8.2 Optical switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3 Two states of the optical switch. . . . . . . . . . . . . . . . . . . . . 116
8.4 Controlling the state of merge networks: state-transferring logic im-
plemented at PACLAB12. . . . . . . . . . . . . . . . . . . . . . . . 118
8.5 Arduino board to control the states of the two optical switches. . . 119
8.6 Architecture design of a 2⇥ 2 merge network. . . . . . . . . . . . . 119
8.7 Tra c flows and port usage of Host A and Host B. . . . . . . . . . 122
8.8 Total port usage of Host A and Host B. . . . . . . . . . . . . . . . . 123
8.9 Tra c flows and port usage of Host A and Host B after switching
serial ports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.10 Total port usage of Host A and Host B after switching serial ports. 126
8.11 Average port usage of Host A and Host B. . . . . . . . . . . . . . . 126
8.12 Total Port 1 and Port 2 utilization. . . . . . . . . . . . . . . . . . . 126
xii
8.13 State switching times of the merge network in two experiments. . . 127
8.14 Total state switching times. . . . . . . . . . . . . . . . . . . . . . . 128
8.15 Example of a 16⇥ 16 MEMS matrix optical switch. . . . . . . . . . 129
8.16 MEMS 3D matrix optical switch. . . . . . . . . . . . . . . . . . . . 130
8.17 STATE of a 4⇥ 4 matrix switch. . . . . . . . . . . . . . . . . . . . 131
8.18 A 4⇥ 4 matrix switch: STATE and SIGNAL. . . . . . . . . . . . . 132
8.19 DiCon Tap/Detector module. . . . . . . . . . . . . . . . . . . . . . 135
8.20 Customized functional module of merge networks. . . . . . . . . . . 135
8.21 DiCon customized module. . . . . . . . . . . . . . . . . . . . . . . . 136
1
Chapter 1
INTRODUCTION
Datacenters have experienced a substantial growth in recent years due to the grow-
ing popularity of Internet services and the widespread adoption of cloud computing.
From sever rooms of small-sized organizations, to enterprise datacenters and the
server farms that run cloud computing services, datacenters have become the back-
bone of the economy. Datacenters achieve economies of scale with large numbers
of servers. Many Internet service providers and cloud computing service providers
have built very large-sized datacenters, often housing more than 50,000 or more
servers each, at geographically distributed locations. For example, Google has built
more than 30 datacenters in 15 countries with a total of approximately 900,000
servers [1]. Amazon has 11 cloud regions across the world. Each region has multi-
ple sets of datacenters, with a typical facility containing 50,000 to 80,000 servers.
A conservative estimate puts Amazon at over 1.5 million servers globally.
These gigantic datacenters consume a significant amount of electricity. In 2010,
the total electricity consumption by datacenters was 235.5BkWh (Billion kilo Watt
hours) [58], which accounts for about 1.3% of the total electricity consumption of
the entire world. In 2013, U.S. datacenters consumed an estimated 91 BkWh of
electricity. This is the equivalent annual output of 34 large (500-megawatt) coal-
fired power plants. As shown in Figure 1.1, datacenter electricity consumption is
projected to continue to increase. Energy costs currently comprise approximately
2
Figure 1.1. Global electricity demand of datacenters 2010 - 2030 [15].
70% of the operations costs of the average datacenter facility [72]. As a result, en-
ergy cost and energy availability have become some of the top concerns in planning
future datacenter operations.
However, despite the rising energy consumption and limits on electric power,
the resource utilization of most of the datacenters is poor. An investigation of 5,000
servers in a datacenter during a six-month period shows that the average CPU
utilization is between 10%  50% of the peak load [35]. Low server utilization also
leads to underutilization of network infrastructure interconnecting servers. Studies
show the average link utilization in the lower level of the network is only at 8%
for 95% of the time [20]. Energy concerns and underutilization of IT resources are
driving the industry to find ways of improving energy e ciency of datacenters. New
technologies have driven up power capacity requirements in IT, mainly in server
and storage-related operations. For instance, Google builds their own customized
energy-e cient server for its datacenter use. Software solutions such as server
virtualization have been proposed to improve the energy e ciency by reducing
3
computing overhead.
The datacenter network has not traditionally been a major contributor to the
power problems because of the relatively low consumption of power by the network
compared to the overall power consumed in a datacenter. Power consumption re-
quirements by the network have been relatively low because of the comparatively
low-level of computing functions in switches and routers. Most estimates of net-
working infrastructure consumption range from 8 to 12 percent of the total power
consumed by the entire datacenters. As the network has evolved to include higher
levels of intelligence, however, its corresponding power requirements have continued
to grow. In addition, with the improvement of sever and storage energy e ciency,
networking components are expected to become an increasing user of datacenter
energy.
1.1 MOTIVATION
A great deal of recent research has examined the question of improving server’s
energy e ciency (hardware and software) as well as developing better algorithms
for distributed computing. However, the question of optimizing the power perfor-
mance of the datacenter network has received far less attention. With the servers
becoming more energy e cient, it is projected that the relative energy consumption
of the network components will increase by up to 50% and become the predomi-
nant source of energy waste in datacenters. Thus, it is meaningful to consider how
the datacenter network can be made more energy e cient. We can identify the
following primary reasons for the energy ine ciency of datacenter networks:
• Over-provisioned Network Topology: The underlying assumption in datacen-
ter network design is that full-bisection bandwidth needs to be supported at
4
all times. This assumption has resulted in the use of dense network topolo-
gies such as the fat-tree network. While these topologies do provide high
bandwidth as well as low latency, numerous measurements show that the
links and switches are grossly under-utilized.
• Use of Legacy Switches: Datacenter network operators buy o↵ the shelf
switches from vendors such as Cisco with little thought to adapting the switch
design to the specific needs of the datacenter. Indeed, the primary focus of
switch development has been to make the interfaces faster and the switches
more dense. There has been little e↵ort at customized switches for the specific
eco-system of a datacenter.
• Application Oblivious Topology Design: The philosophy of designing dat-
acenter networks for the general case rather than the expected case causes
network ine ciencies. For instance, datacenter job scheduling algorithms
try to concentrate computation to within a single rack to minimize mem-
ory and disk access latencies. In these datacenters, there is no need to have
high-bandwidth between racks as we do today.
• Unlinked Server and Network Energy savings: Today, servers in the data-
center are very energy-conscious and typically sleep for extended periods.
However, the underlying network continues to remain fully powered on. This
is clearly a waste of energy and linking the energy-conserving approach for
servers to the underlying network will result in significant energy savings as
well as little impact on performance.
In summary, current datacenter network architecture and network devices were
never designed for energy e ciency. Network switches consume virtually the same
energy whether they are forwarding packets or not. Furthermore, the architecture
5
precludes any scaling of switch performance for energy e ciency. One approach
that has been developed called IEEE 802.3az [7] changes the link rate of each
interface on the switch based on load. However, as interfaces typically consume a
tiny fraction of overall switch power, this is an insignificant saving.
The overall goal of this research is to develop datacenter networks
whose energy cost scales linearly with load and which can support high
network throughput.
In this research we will develop novel techniques to address the question of
energy e ciency of datacenter networks as well as delivering high network band-
width for distributed computations. We use a passive switching fabric to form a
merge network that allows tra c from N links to be merged into fewer links into
a switch. This allows us to either replace high port density switches with lower
density switches or power o↵ large parts of a switch. We examine the application
this novel hardware element in the context of existing topologies and develop new
topologies that exploit its properties. Furthermore, we consider replacing the dat-
acenter network with this switching fabric. Thus, if we consider servers located
on one rack, rather than using traditional switches, we can connect all the servers
using a switching fabric composed of passive elements like multiplexers. Logically,
this can be viewed as throwing away the network interfaces of switches and directly
connecting the end-host network interfaces to the internal interconnection network
found in each switch.
This research relies on detailed simulations using tra c models extracted from
measurements reported in the literature as well as analysis to determine perfor-
mance bounds. We examine metrics including energy cost of the new networks
as compared to traditional networks, network reliability, network bisection band-
width, and scalability.
6
1.2 BACKGROUND
In this section, we will lay the groundwork for some fundamental terminology and
concepts. First, we will discuss the design and engineering of datacenter intercon-
nection networks, including network topology, network architecture, and routing
algorithms. Specifically, we will focus on the hierarchical network topologies that
are widely used in production datacenters. Indeed, most of the work in the follow-
ing chapters is based on hierarchical datacenter topologies. Second, we will review
the emerging technologies in power management of network devices, and motivate
our power-aware datacenter network research.
1.2.1 Datacenter Network Topology
Inside a datacenter, large number of servers are interconnected using specific dat-
acenter network (DCN) structures. The design goals of the datacenter network
are to provide high bandwidth, low latency and high throughput communications
between servers through network infrastructure with low cost and high scalabil-
ity. The network enables applications running on the servers to communicate and
interoperate in an orchestrated and e cient way. The performance of intercon-
nection networks plays an important role in improving the overall performance of
datacenters.
Mega datacenters nowadays may contain tens of thousands of servers that are
interconnected by network links and switches. A three-layer hierarchical model
is widely adopted for designing a scalable internetwork for datacenters, in which
switches are connected in a network structure consisting of core, aggregation, and
access layers. Traditional hierarchical model is a tree topology with aggregation to
lager-sized, higher-speed switches moving up the hierarchy. The access layer grants
end servers access to the network. Switches in the access layer connect directly to
7
servers, and are referred as edge switches. The access layer is in turn connected to
the network aggregation layer, which aggregates the data received from the access
layer switches before it is transmitted to the core layer for routing to its final
destination. The switches in the core layer form the network backbone, providing a
fabric for high-speed packet switching between aggregation switches. A single core
switch, even with a density of hundreds of ports, limits the network size to a few
thousand hosts. Also, one core switch only can provide limited bandwidth between
servers in di↵erent racks, making it di cult for applications like MapReduce, which
requires high intra-cluster bandwidth. Moreover, a single-rooted tree has problem
of single point of failure, cause poor reliability. Therefore, many multi-rooted
tree topologies with multiple core switches, i.e. Clos, are deployed for large-sized
datacenters. A Multi-rooted tree topology can provide multiple paths between any
pair of end hosts.
In the access layer and core layer, high-end switches with high port density
and connection rate are used to provide high-speed data transmission. Due to
the high costs of high-end switches, datacenter architects choose to oversubscribe
the networks. The higher layers of the three-layer datacenter network are highly
oversubscribed, which in turn causes poor network utilization, limiting the overall
bisection bandwidth and overall network throughput. To address the problem of
limited cross section bandwidth close to the roots, a fat-tree [11] (also known as
folded Clos), was proposed for large-scale datacenter environments.
A fat-tree [11] is a multi-rooted tree topology that is a special instance of
the Folded-Clos network. It has ’fatter’ links in upper layers, that makes it a
mesh-like network with full bisection-bandwidth. A fat-tree leverages o↵-the-shelf
Ethernet switches and all the switches in di↵erent layers are identical, making it
a cost-e↵ective and easy-to-deploy solution for large-scale datacenters with tens of
8
thousands of servers. Fat-tree topology is deployed in many datacenters. A k -ary
fat-tree has k pods, each containing 2 layers of k
2
switches. Each edge switch in
a pod is connected to k
2
hosts and k
2
aggregation switches. There are
 
k
2
 
2
core
switches, each of them has k ports connected to k pods. The fat-tree topology has
great scalability. A k -ary fat-tree network can support k
3
4
hosts. Furthermore, the
fat-tree topology has identical bandwidth at any bisection and each layer has the
same aggregated bandwidth. Therefore, it can achieve full bisection bandwidth
with 1:1 oversubscription ratio. Compared with other conventional tree-based
topologies, a fat-tree network has less bandwidth bottleneck issues and can provide
high bandwidth by interconnecting smaller commodity switches.
Recently, some relatively flat network architectures have been proposed to re-
place the traditional multi-layer enterprise network architecture for large data-
centers to support virtualization. Virtualization is implemented in large cloud
datacenters to improve e ciencies, and as a result, the network no longer con-
nects only hardware blocks, but also interconnects virtual machines (VMs) and
virtual storage volumes. Resources need to be dynamically reassigned from any
point within the datacenter to another point. While the multi-layer architectures
have high latencies and complex software, moving a VM not only a↵ects the ac-
cess switch configuration, it may require reconfiguration of the aggregation and
core switches as well. Therefore, the multi-layered architecture is not considered
e cient for VMs migration circumstance. However, flat datacenter network archi-
tectures can interconnect basic units such as virtual machines and virtual storage
volumes across large, switched Ethernet fabrics.
9
1.2.2 Energy-E cient Networking
The fat-tree topology provides a scalable and cost-e↵ective solution for large-scale
datacenters. A fat-tree is built from a large number of densely connected switches
and can support any communication patterns with full bandwidth. However, it
achieves the extra bandwidth by provisioning many redundant network links and
switches, providing multiple paths between any two end servers. Richer connection
can achieve high performance at peak network loads. However, the redundancy
also causes waste of network resources, especially in the night hours when the
network loading is extremely low. If the redundant network resources can be
put in low power mode, it is possible to dynamically vary the number of active
network elements to choose the necessary paths, thus to improve energy e ciency
of datacenter networks.
Until recently, network interfaces and links were designed to stay awake all the
time. During idle periods, the interfaces send frames periodically to provide syn-
chronization as well as to serve as a keepalive mechanism. As a result, interfaces
run at full power all the time even though there is no tra c. An increasing interest
in networking energy e ciency led to the formation of the Energy E cient Eth-
ernet (IEEE802.3az) project in November 2006. The project task force considered
many proposals for changes to the Ethernet interface standard to enable energy
e ciency, and, at the same time, ensure backward compatibility and network ro-
bustness. The final IEEE802.3az Energy E cient Ethernet standard (EEE) [7] was
published in November 2010, representing the beginning of a change in networking
architecture design. The core idea of the EEE standard is the low-power idle state
(LPI state), first proposed by Intel [50]. All the EEE-supported PHY types in the
IEEE 802.3az can configure an LPI state for the periods when there is no data
sent from/to the interfaces. The EEE standard defines a signaling protocol for
10
network devices to communicate with each other and indicates the power state of
the link between them. The protocol uses an LPI signal that is a modification of
the normal idle signal that is transmitted between data packets to indicate that
the link can switch to sleep mode to minimize the power consumption of the device
ports that connect from either side of the link. The transmitting port sends idle
signals when it wishes to resume the fully functional link status. The EEE protocol
awakens the link at any time, and there is no minimum or maximum sleep interval,
which allows EEE to function e↵ectively in the presence of unpredictable tra c.
Therefore, by switching between higher power state (data mode) and lower power
state (LPI mode) in response to whether data is flowing through it, a network
device can reduce energy consumption when its utilization pattern consists of long
periods of idleness.
1.3 CONTRIBUTIONS
This research proposes a framework of improving the energy-e ciency of datacen-
ter networks. Most datacenter network topologies are designed to maximize cross-
section bandwidth at minimal link and switch cost, to achieve high scalability,
low latency, high throughput and low cost. Recently, however, energy e ciency
has become a performance metric for datacenter networks. Al-Fares [11] shows
that a fat-tree (aka folded-Clos) topology built from 1Gbps commodity Ethernet
chips uses considerably less power than a hierarchical network composed of high-
end, power ine cient 10 Gbps switches. Some other researchers reports that a
flattened-butterfly topology itself is inherently more power e cient than the other
commonly proposed topology of datacenter networks [8].
This research complements prior work by developing analytical models for en-
ergy consumption and thus enables us to study fat-tree DCNs theoretically. In
11
order to dynamically adapt the datacenter topology to the network tra c load,
we build mathematical models of energy usage for a fat-tree datacenter network.
The approach can be generalized to derive energy consumption models for other
network topologies as well. The model considers di↵erent tra c patterns and load-
ings, and can work as a reference in power-aware network design and utilized to
estimate the energy consumption.
Based on the energy consumption models, we explore maximizing the energy
saving through jointly optimizing task scheduling and flow assignment for given job
loads. Network tra c can be planned and consolidated through virtual machine
migration or changing the size of edge switches.
We next propose a hardware merge network to consolidate the tra c at the
switches automatically. We evaluate the simulation models of datacenters with
merge networks and obtain almost load-proportional energy consumption in large-
scale datacenters. We also build a prototype of a 2 ⇥ 2 merge network using
passive optical devices and test its performance. It shows that a merge network can
successfully consolidate tra c and decrease the active ports needed for datacenter
switches. Especially, for edge switches in the access layer, a significant amount of
energy savings can be obtained.
The major outcomes of this work are summarized as follows:
1. Systematic analysis of energy cost of fat-tree datacenter network topologies
to model how energy usage scales with total load as well as with di↵erent
types of loads;
2. Usage-based topology analysis of fat-tree networks with job placement and
scheduling in order to minimize network energy cost;
3. Development of a tra c driven model for fat-tree networks that proves to be
12
accurate in predicting switch activity;
4. Evaluation of energy-savings when using high-port-density edge switches
with certain tra c patterns and the impact on inter-pod routing;
5. Heuristic flow assignment algorithms to compute routing tables empirically
in simulations of large-scale datacenters;
6. Application of tra c merging in existing topologies and a study of the energy
e ciencies obtained, throughput sustained and costs; and
7. Development, performance measurement and analysis of a new prototype of
merge networks using fiber interfaces and optical switches.
13
Chapter 2
LITERATURE REVIEW
Traditional network design has focused on providing increasingly greater band-
width and better coverage. As a result, networking equipment such as routers and
switches were designed with little consideration given to energy e ciency. Indeed,
networking equipment runs at full power regardless of tra c. However, empiri-
cal measurement shows that typical Ethernet tra c remains low for most of the
time with occasional bursts, giving ample opportunity for energy saving of net-
work equipment. At the network level, the network architecture is designed and
dimensioned with over-provisioning and redundancy to sustain peak hour tra c.
As a result, over-provisioned networks still consume a significant amount of energy
during low tra c periods. A decade ago, researchers began paying attention to
this energy waste and began studying ways to reduce it. In this chapter, we first
discuss energy-aware networking and then describe this problem in the context of
datacenter networks.
2.1 ENERGY-EFFICIENT NETWORKING
The IEEE802.3az [7] standard published in November 2010 represents the begin-
ning of a change in networking architecture design. An EEE compliment network
device should consume power only when data is being sent. When there is a gap in
the data stream, the network device or interface is in an idle state and can be put
into LPI mode. Early EEE-compliant devices use a simple application of static
14
logic design in the physical layer devices (PHYs) to transition to the Low Power
Idle (LPI) mode and save energy when data is not present. Later generations of
networking systems apply new architecture design that uses more aggressive en-
ergy saving techniques to be applied to all of the system silicon, extending the
range of energy savings.
2.1.1 Green Network Devices
Early work first explored the energy e ciency of network devices by considering
two power modes: a sleeping mode and a fully working mode. The constraints
here include the non-zero wake-up time for the interfaces and the spike in power
consumption when the interface powers on. Therefore, it is challenging to find
the optimal trade-o↵ between system reactivity and energy savings, and determine
when to wake up the device. In early research, Gupta and Singh [48] examined
the feasibility of putting di↵erent subcomponents of a network switch or router
into sleep mode, and described possible impact of sleep mode on protocols, such as
VLAN, STP, and channel bonding. In their subsequent research [46], the authors
described di↵erent types of sleeping mode for an interface and presented an algo-
rithm incorporated in the host operating system that can dynamically transition
the power mode of the interfaces. By analyzing the packet interarrival distribu-
tion in the bu↵er, the host determines if the next idle period is long enough to
justify the energy saving of putting the interface into sleep mode. In this work,
they also proposed to modify L2 protocols to enable a more aggressive strategy
to keep the Ethernet interface in the sleeping state until the bu↵er queue exceeds
a predefined threshold. However, some low-level network protocols such as ARP
and STP require periodic control frames to maintain network connectivity, which
is the primary constraint on the length of the sleep state. Christensen et al. [28]
15
[54] proposed network connectivity proxy (NCP) to handle network presence tasks
for an idle network host in a low-power sleep state. To address the problem of
possible packet loss during port sleeping period, G. Ananthanarayanan et al. [14]
proposed a novel architecture for bu↵ering packets at the network ports when the
port is in the low-power state. Especially, they propose using a shadow port to
receive ingress packets if any of the conventional ports are in the low-power state.
Each shadow port has the same hardware with regular ports and associates with
a cluster of normal ports.
Besides using sleep mode to reduce energy consumption during the idle period,
newer network devices can scale power dynamically when they are in active state.
During the low utilization period, network devices can lower the working rate of
processing engines, and reduce link transmitting rate, resulting in significant reduc-
tion of energy consumption. In general, operating a device at a lower frequency
allows substantial energy savings. For some network equipment (e.g. linecards,
transceivers) that supports frequency scaling and dynamic voltage scaling, the re-
duction in power consumption is scaled cubically with operational frequency [70].
There is a great deal of research about dynamically adapting the link rate to
the real load transmitted. C. Gunaratne et al. [40] show the measurement of the
power consumption of Ethernet NICs and switches for a range of data rates and
utilization levels. They first propose the notion of adaptive link rate (ALR) for
Ethernet. In their work, the Ethernet link data rate is scaled as a function of queue
length in both the PC and LAN switch. In a later work, the same authors propose
a refined rate control policy based on dual bu↵er threshold [42]. Successively, they
developed and evaluated a utilization-threshold policy and a time-out-threshold
policy to eliminate rate oscillations in the rate transition process [41].
In more complicated switches and routers, the linecard accounts for most of
16
the energy consumption. M. Mandviwalla et al. [65] study linecard architecture in
backbone routers and dynamically scale the linecard’s power to the predicted work-
load. The simulation results show that it can achieve 60% energy savings through
dynamic power scaling. Hu et al. [52] propose reconfigurable router architecture
that supports multiple router operational states (e.g., energy saving state) with
fast switching ability. The router settings, including routing path, clock frequency,
and supply voltage, are reconfigurable, aiming to support rate adaptive processing
and power-aware routing. [77] measures the power consumption of NetFPGA-
based gigabit routers in standard and low-frequency modes with di↵erent numbers
of activated ports. They compare the internal power usage of a gigabit router at
the granularity of packet and byte level, and analyzed the impact of router fre-
quencies, numbers of activated ports, tra c loads and packet sizes on potential
network power savings. In another work, the same authors propose a practical
implementation of power scaling algorithms to modulate router frequency on a
periodic or threshold basis adaptively [78].
Some other work compares the sleeping mode scheme and rate adaptation
scheme. S. Nedevschi et al. [70] evaluate the two approaches regarding achieved
energy saving, QoS, packet delay and loss rate. Both of the schemes can o↵er
a substantial reduction in power consumption with minimum packet loss and a
relatively small increase in network latency. However, there does exist a boundary
utilization below which sleeping mode o↵ers better energy savings than adaptive
line rate, depending on how much a device’s power consumption scales with fre-
quency and the magnitude of its active-to-idle power draw. Furthermore, the au-
thors compare two sets of data rates. In the first set, rates distribute exponentially
(e.g., 10Mbps, 100Mbps, 1000Mbps), while in the second set they are distributed
17
uniformly (e.g., 330Mbps, 660Mbps, 1000Mbps). Interestingly, the uniformly dis-
tributed rates have a lower additional delay and achieve a greater energy saving
compared to the exponentially distributed set since the first set of rates require
fewer rate transitions, which causes transition delays and leads to reduced energy
saving and higher overall delay. However, the authors also mention that more
supported rates increase the management complexity and cause extra overhead.
Other work [67] [83] also conducts a comparison of sleeping mode and ALR mode,
and concludes that rate adaption is more robust during bursty load periods while
the sleeping mode has a much lower complexity and overhead with comparable
performance.
Meanwhile, new hardware technologies enable re-design and re-engineering the
network devices to improve hardware energy e ciency. For example, novel energy-
e cient silicon (ASICs and FPGAs) contributes to performance gains of packet
processing engines, allowing higher clock frequencies and fast packet forwarding,
and achieves better energy cost per gigabit. Among this research, Yamada et
al. integrate ASICs/FPGAs and router memories, and adapt a scalable central
architecture in the router, which successfully supports 1Tbps with doubled energy
e ciency. Also, some other research focuses on using optical switching architecture
to replace electronic based devices [18]. In general, optical switching is much more
energy e cient than its electronic counterpart. However, it is not possible to bu↵er
the optical signal, so the optical switch lacks management flexibility. Also, optical
switches only support a limited number of ports (less than 100), which limits their
application to the large backbone networks.
18
2.1.2 Power-Aware Network Infrastructure
The emerging low-energy mode of network devices allows network switches, links or
parts of the network be put into sleep mode, achieving non-negligible energy savings
for each device or collaborative devices. Some work considers further coordinating
the network-wide devices to dynamically put a portion of the network into sleep
during the low to median utilization, to address the problem of network over-
provisioning and redundant design. In this section, we discuss some of the network-
wide energy-aware strategies.
Power-Aware Routing and Tra c Engineering
Energy-e cient devices utilize a modified physical layer and link layer protocols
to coordinate the transition between sleep and active mode. At the network layer,
power-aware routing considers consolidating tra c flows over a subset of links and
network devices, allowing more idle interfaces, links, and network components to
be put into the sleep state. Achieving a minimum subset of network devices is an
optimization problem with the constraints of preserving full network connectivity
and satisfying QoS requirements. Theoretically, the power-aware routing problem
is an extension of the general capacitated multi-commodity flow problem, which is
an NP-complete mixed integer programming problem (MIP).
In the position paper where Gupta and Singh [47] first presented the idea of
putting network components into sleeping mode, the possibility of coordinated
sleeping mode of multiple routers was described. The challenge is that the routing
algorithms (OSPF, for example) will consider the sleeping nodes as having failed
and recompute the routes, which incurs extra computing overhead. The paper
discussed a possible solution of pre-computing alternative routes.
Chiaraviglio et al. [27] is the first paper that formulates an optimization model
19
to find the set of routers and links that must be powered on so that the total
power consumption is minimized, subject to flow conservation and maximum link
utilization constraints. This network design problem falls into the class of ca-
pacitated multi-commodity flow problems, which is NP-complete. Therefore, they
propose heuristic greedy approximation algorithms and apply that to the backbone
network of ISP networks. They test di↵erent node and link selection strategies in
responding to the day/night tra c patterns and prove that the total network power
consumption can be reduced by switching o↵ links and ports accordingly. Similar
work includes [37], which evaluates heuristics algorithms on topology and tra c
data from the Abilene backbone network. They prove that the simplest heuristic
algorithm can reduce energy consumption by 79% under realistic tra c loads.
More recent work further explores the practical application of energy-aware
routing. Coudert [33] formulates a model combining redundancy elimination and
energy-aware routing to increase energy e ciency for backbone networks. [62]
quantifies the e↵ects of five recently proposed power-aware routing approaches and
shows that switching o↵ redundant links a↵ects terminal reliability (TR) and route
reliability (RR) significantly. Accordingly, they propose a practical algorithm,
called “reliable Green-Routing” to maximally switch-o↵ network cables subject to
link utilization as well as TR/RR requirements.
Power-Aware Architecture Design
Power-aware routing aims at dynamically adapting network topology to network
usage to address the over-provisioning problem. This approach can be a practical
way to improve the energy e ciency of existing networks. Some other work advo-
cates redesigning the network architecture in order to meet energy e ciency goals
and QoS guarantees. Several authors propose new architecture design for energy
20
savings [17] [25] [73]. For instance, [17] considers synchronizing the operation of
routers and scheduling tra c in advance since tra c comes from predictable ser-
vices (such as video). [25] proposes power awareness in the design, configuration
and management of networks, and in protocol implementations. They conducted
a measurement study about the power consumption of various configurations of
widely used core and edge routers and created a generic power model for routers.
Nevertheless, [73] proposes a planning model that clearly shows the trade-o↵ be-
tween energy consumption and network performance and emphasizes the impor-
tance of accounting for reliability in energy-e cient network design and analyzed
robustness issues in some of the designs. To leverage the high e ciency of optical
switching, some research on hybrid network architecture combines optical trans-
port and electronic packet processing. For instance, Baldi et al. [17] propose to use
a complemental Dense Wavelength Division Multiplexing (DWDM) optical cable
for deterministic tra c.
2.1.3 Power-Aware Software Stack
In the operating system and user-space applications, it is possible to implement
energy strategies at the transport layer and application layer protocols. For ex-
ample, Irish et al. [59] modify the TCP/IP protocol, putting a TCP SLEEP signal
in the TCP header to notify the other party to stop sending data. Application
layer protocols can enforce the power-aware configuration as well. For example,
Blackburn et al. [53] [22] customize Telnet and BitTorrent protocols so that clients
send sleep-signal to servers to advertise the energy state and implement a probing
mechanism to avoid sending keepalive message. Furthermore, Microsoft scientists
developed general tools for application programmers in energy-e cient program-
ming. Kansal et al. [55] present automated tools that profile the energy usage
21
of various network resource components used for the guidance of energy-e cient
application design. Baek et al. [16] provide a framework that enables program-
mers to approximate expensive functions and loops in a systematic manner while
providing statistical QoS guarantees.
2.2 DATACENTER NETWORK ARCHITECTURE
2.2.1 Datacenter Network Topologies
Inside a present-day datacenter, tens of thousands of servers are interconnected
using a network of switches, called datacenter networks (DCNs). Many datacenters
deploy a multi-layer, multi-rooted tree structure DCN, such as Clos [32] and fat-
tree [11] (also known as folded-Clos).
A Clos network topology [29][32] has three stages. Each stage is composed of
many crossbar switches. An (m,n, r) Clos network has m middle-stage switches,
r input switches, and r output switches. n is the number of input(output) ports
in the input(output) switches. Each input switch is an n ⇤m crossbar and every
output switch is an m⇤n crossbar. The input switches are fully connected with the
middle-stage switches, and the middle-stage switches are then fully connected with
all output switches. Every switch in the middle-stage has r input links from input
stage switches and r output links to output stage switches. Thus, the middle-stage
switches are r⇤r crossbars. An (m,n, r) Clos network can have N = rn end nodes.
With m middle-stage switches, there are m di↵erent paths between each pair of the
input and output nodes. Therefore, Clos topology has very good path diversity.
Fat-tree [11] is another example of a multi-rooted tree topology that is widely
deployed in many datacenters. A fat-tree network leverages o↵-the-shelf Ethernet
switches to connect to tens of thousands of nodes. A k -ary fat-tree has k pods, each
containing two layers of k
2
switches. Each edge switch in a pod is connected to k
2
22
hosts and k
2
aggregation switches. There are
 
k
2
 
2
core switches, each of them has k
ports connected to k pods. The fat-tree topology has great scalability. A k -ary fat-
tree network can support k
3
4
hosts. Furthermore, the fat-tree topology has identical
bandwidth at any bisection and each layer has the same aggregated bandwidth.
Therefore, it can achieve full bisection bandwidth with 1:1 oversubscription ratio.
Compared with other conventional tree-based topologies, a fat-tree network has less
bandwidth bottleneck issues and can provide high bandwidth by interconnecting
smaller commodity switches.
Another cost-e cient interconnection network topology is called flattened but-
terfly [57]. The flattened butterfly structure is derived from the conventional but-
terfly topology by combining the routers in each row into a single router. Channels
inside a row are eliminated. All the other channels in the flattened butterfly are
bidirectional. By combining the routers in the same row, the connection path be-
tween pairs of nodes can take any order of dimensions to get through, providing
better path diversity than a conventional butterfly topology. A k -ary n-flat flat-
tened butterfly can support N = nk end nodes with n (k   1) + 1 links. With high
degree of interconnection, flattened butterfly scales more e↵ectively than k -ary n-
cubes. Also, flattened butterfly has smaller network diameter than the folded-Clos
network. With load-balanced tra c, a flattened butterfly is approximately half the
cost of a folded-Clos.
The fat-tree, Clos, and flattened butterfly are all switch-centric architectures
with interconnection intelligence built in switches [45]. Recently, some new server-
centric architectures were proposed to rely on servers to make routing decisions
and use them as intermediate routers. Examples of server-centric topologies include
DCell [43], BCube [44], FiConn [60] and CamCube [9].
DCell [43] is a recursively-defined interconnecting structure. Lower-levelDCells
23
are connected to construct a higher-level DCell. Therefore, DCell scales out dou-
bly exponentially. It can support an enormous number of servers with only a few
levels (k) and switch ports (n). For example, a 3-level DCell can accommodate
around 3.26 million servers [21]. In a DCell network, servers are active parts in
routing and forwarding process. Each DCell can be deemed as a virtual node, and
all virtual nodes in the same level are fully connected to each other. The rich phys-
ical connectivity and distributed routing protocol provide better fault tolerance.
There is no single point of failure in a DCell. Compared with other datacenter
network topologies, DCell has better scalability, fault tolerance, and higher net-
work capacity. However, DCell requires higher wiring cost and it has problems of
load balancing.
BCube [44] is another server-centric architecture suitable for container-based
modular datacenters. The switches only connect with servers and act as a crossbar.
Servers communicate through the switches or other relaying servers. Since all the
switches in the BCube are equally connected, therefore, there is no oversubscription
bottleneck and BCube can provide high inter-server throughput. However, BCube
requires the server to have multiple network ports to scale out, which is a barrier
for BCube to scale to millions of servers.
DCell and BCube both require that a server has more network interfaces to scale
out, which is a challenge for datacenters currently using commodity servers with
at most two network ports each. FiConn [60] is a server-centric network, which
expands similarly to DCell. However, the server in FiConn needs only two Ethernet
ports to scale out to a large number of servers. DPillar [61] is another server-centric
network with dual-port servers. Dpillar ’s structure was inspired by the classic
butterfly network. It has symmetric structure and eliminates network bottleneck.
As a result, DPillar can achieve better scalability and network performance.
24
The server-centric networks we discussed above are all hybrid direct-connect
topologies, which use simple dummy mini-switches to connect servers. Abu-Libdeh
et al. [9] propose a direct-connect topology (CamCube) where each server connects
directly with a set of other servers, without using any switches or routers. It
connects servers in a 3D torus topology, and each server directly connects to six
other servers.
Singla et al. propose a high-capacity network topology called Jellyfish [76]. It
adopts a random topology with high flexibility and network capacity. A Jellyfish
network has small network diameter and supports fine-grain incremental expand-
ing. However, the unstructured design also brings challenges in routing strategy
and networking wiring. Similar works include Scafida [49] and Small-World Dat-
acenters (SWDC) [74].
2.2.2 Datacenter Network Protocols
There is a broad range of applications running in the datacenters today, from web
hosting services to file storage services, to other customized applications. With
the increasing popularity of cloud computing, large online service providers such
as Amazon, Google and Microsoft have launched large-sized cloud datacenters, of-
fering user-facing Internet services such as web services, Instant Messaging, and
webmail. Additionally, many large-scale data-intensive applications like MapRe-
duce, Hadoop, and Dryad run in large production datacenters. Benson et al. [20]
studied the applications and their tra c patterns of ten datacenters.
As the number and variety of applications increases, the network performance
of datacenters will significantly influence the QoS. For example, a MapReduce job
running in the partition/aggregate pattern distributes small tasks to other worker
nodes and collects results afterward, which requires transferring large amounts of
25
data among servers at very high rate, demanding substantial network bandwidth.
Secondly, MapReduce is also latency-sensitive which require minimum response
time. If a worker node misses the deadline, its result will just be ignored, which
will impact the quality of the overall outcome.
To address the performance requirements of applications, datacenter designers
have focused their work on routing, flow control and management protocols to op-
timize the desired metrics of network utilization, latency, and throughput. Current
datacenter network protocols used in datacenters originate from those designed for
general LAN settings, which has predictable communication patterns and limited
paths between end nodes. With the exponential increase in the number of hosts,
especially the appearance of distributive cloud computing, many research e↵orts
are being focused on datacenter protocols to address the network management
challenges and new application requirements.
Routing and Addressing
The development of large-scale datacenters with an increasing number of servers
also imposes a significant challenge on the scalability of the datacenter protocols.
In a cloud computing environment, host virtualization allows multiple virtual ma-
chines (VMs) running on one physical machine. Each of the virtual machines
is assigned a fixed IP address and a MAC address, requiring more scalable ad-
dress resolution to locate millions of end nodes. Furthermore, virtual machines are
migrated to tightly-coupled hosts in order to achieve higher throughput, which im-
poses challenges on IP address configuration. Some researchers have proposed new
addressing schemes for datacenters. For example, Al-Fares [11] described a partic-
ular IP addressing over the fat-tree topology to provide high bisection bandwidth
without using high-cost core switches. The fat-tree network enforces a special
26
IP addressing scheme and a routing protocol customized for its topology. The
routing algorithm uses two level look-ups at each node to distribute tra c. The
prefix look-up in the primary table routes down to servers and the su x look-up
in secondary table routes up towards core switches. The pod switches forward
subsequent packets of the same flow to the same outgoing port. Hence, all the
packets going to the same destination will follow the same path without packet
reordering.
Datacenter protocols originate from Ethernet and IP-based protocols support-
ing arbitrary topologies. As a result, many current datacenter protocols have
obvious limitations such as inflexibility, high configuration overhead, and limited
scalability. Recently, some Ethernet-compatible protocols for datacenters have
been proposed to address these problems.
For example, SEATTLE [56] is an early proposal of scalable Ethernet-compatible
architectures for large-scale datacenters. SEATTLE keeps the simplicity of Eth-
ernet by forwarding packets based on layer 2 flat MAC addresses, and employs a
broadcast-based link state protocol. Flat addressing treats the datacenter as a uni-
fied entity and enables lower administration overhead on handling network config-
uration and host mobility. To overcome the scalability problem of flat addressing,
SEATTLE employs a directory service by building a one-hop Distributed Hash Ta-
ble (DHT). Instead of requiring each switch to maintain a state for every end node,
SEATTLE employs a link-state protocol, which only keeps switch-level topology
and then uses a hash function to map host information to a switch. Switches
first run a discovery protocol to find their positions and automatically configure
the link-state protocol. Additionally, SEATTLE leverages communication locality
by letting switches cache the shortest paths of previous queries. Further, location
27
information is memorized during end host ARP queries for later data packet trans-
mission. Compared with other hybrid IP/Ethernet network protocols, SEATTLE
improves the communication e ciency with minimal management complexity.
Another proposed approach with similar goals is PortLand [69], which is a
scalable and fault-tolerant layer 2 network protocol implemented over the fat-tree
topology. It consists of a set of routing, forwarding, and addresses resolution pro-
tocols, which applies to any multi-rooted tree topology of current datacenters.
Instead of using flat MAC address as seen in SEATTLE, PortLand leverages the
knowledge of the topology by encoding positions into hierarchical addresses and
thus achieves scalability and e ciency. It uses a pseudo MAC (PMAC) addresses
and embeds the baseline network topology information into the PMAC addresses.
Each end node is assigned a unique PMAC address containing hierarchical position
information. Switches update the forwarding table through a lightweight Location
Discover Protocol (LDP). Compared with the traditional layer 2 flat MAC ad-
dresses, the hierarchical structure of the PMAC addresses allows a relatively small
forwarding table in each switch and enables more e cient routing and forwarding.
While PortLand achieves scalability by using topologically-significant PMAC
addresses, the position-related addressing constraints virtual machine migration.
To overcome these limitations, Greenberg et al. presented the Virtual Layer 2
(VL2) [39] network architecture based on layer 3 IP routing in the network in-
frastructure, and implemented flat addressing at the server level. In VL2, all the
switches are using location-specific IP addresses (LAs). At the application level,
applications are assigned application-specific IP addresses (AAs). The VL2 agent
at each server encapsulates the packet with the AA address of the application, and
LA address of the ToR switch directly connected to the destination server. When
the packet arrives at the destination ToR, the switch decapsulates the packet and
28
forwards it to the target server. The AA address represents the name of the server
instead of the location, which allows virtual machine migrating with no IP address
modification overhead. And the network-level IP protocol assures high-e ciency
forwarding and scalability with a small state at each switch. VL2 implementation
does not require changing the software and API of current switches. Therefore, it
is a practical solution for datacenters with commodity switches.
The multi-rooted tree structures of current datacenter networks provide multi-
ple paths between each pair of servers. Many existing forwarding protocols apply
Equal Cost Multipath (ECMP) to select a path statically using flow hashing, re-
sulting in collisions and bandwidth losses. To address these challenges, Al-Fares et
al. proposes Hedera, a dynamic flow scheduling system to adaptively allocate flows
to paths [12]. Hedera is a centralized scheduler with a global view of the flows and
network utilization status. Therefore, it can not only appropriately schedule the
flows to the core switches with less utilization, but also fully leverages the high de-
gree of parallelism provided by the multi-rooted tree topology and achieves nearly
full bisection bandwidth.
To achieve load balancing in multipath datacenter networks, a novel multipath
forwarding approach is proposed as Smart Path Assignment in Network (SPAIN
[68]). While traditional Ethernet protocol uses Spanning Tree Protocol (STP) to
generate a single loop-free-tree, SPAIN explores a set of redundant paths in a
network topology and merges these paths into a set of trees. Every switch installs
the information of VLANs, each of which maps to one tree. SPAIN supports layer
2 flat addressing and routing, and can deliver higher bandwidth and better fault
tolerance than spanning tree.
Similarly, Raiciu et al. propose a Multipath TCP (MPTCP) [71] as an exten-
sion of current TCP protocol. MPTCP explores multiple paths simultaneously
29
and utilizes the congestion feedback to choose the e↵ective paths. Compared with
single-path TCP, MPTCP can find unused capacity more e↵ectively and maxi-
mize the utilization of networks in topologies with full bisection bandwidth. Now,
MPTCP has been deployed in Amazon’s latest EC2 environment, and experimen-
tal tests show that it can improve the throughput by 300%.
The server-centric architectures allow servers to perform routing and forward-
ing. DCell uses a single path routing protocol. Each server is assigned a (k + 1)-
tuple as the address. The DCellRouting routing algorithm follows a divide-and-
conquer approach to find the path from source to destination. BCube adopts a
source routing protocol called BSR (BCube Source Routing). When a new flow
arrives, the source server sends out probe packets through multiple parallel paths
and selects the best route after it receives the probe response. BSR can fully
utilize the high capacity provided by the Bcube topology and realizes automatic
load-balancing. In a FiConn network, every intermediate server takes a greedy
approach for establishing a tra c-aware path hop-by-hop. The source server al-
ways selects the outgoing link with higher bandwidth to forward the tra c, thus
balancing the tra c.
Flow Control and Resource Management
OpenFlow [66] provides an open protocol to program the flow table in a switch.
OpenFlow-enabled switches support fine-grained, flow-level control over Ethernet
switching. However, its centralized control and global visibility create scalability
issue as well. Curtis et al. designed a modified model called DevoFlow [31]. It
pushes most flow controls to switches and only manages over significant flows and
packets. The distributed control mechanism reduces the sizes of flow tables and
control messages.
30
Many studies show the datacenter tra c is characterized by a few large flows
and many small flows. Although there are a few long-lived flows, these flows play
a critical role in deciding the achievable network bisection bandwidth. Current
datacenter commodity switches have limited bu↵er sizes. Short-length flows may
also experience long latencies if the long-length flows occupy the available bu↵ers
in the switches. To keep the high throughput for big flows and low latency for short
flows, a TCP-like Datacenter TCP protocol (DCTCP) [13] is proposed to address
this problem. DCTCP uses Explicit Congestion Notification (ECN) as feedback
of the extent of switch congestion and proportionally resizes the window. A tra-
ditional TCP reduces the window by half when it receives ECN feedback, which
causes bu↵er underflow and throughput losses. Experimental results show that
DCTCP can deliver comparable throughput with 90% less bu↵er space occupancy
compared with conventional TCP.
Another novel tra c management system, Mahout [30], was proposed recently
to detect elephant flows in datacenter tra c. Instead of polling the switches, Ma-
hout monitors the end hosts socket bu↵ers to detect long-lived flows and signals the
central controller. Hence, only the detected long flows will be forwarded through
the central controller. As such, Mahout incurs much lower monitoring overhead
and switch resources compared to Hedera. It is relatively simple to implement
Mahout because it only requires a shim layer of software at the end host OS.
Although many resource management proposals consider the sizes of flows in
congestion control and flow scheduling mechanisms, few of them are aware of the
deadline of the flows and study how meeting the deadline will influence the appli-
cation throughputs. Wilson et al. designed and implemented a Deadline-Driven
Delivery (D3) control protocol [84] to apply explicit rate control to allocate net-
work bandwidth according to the flow deadline. Results from an implemented
31
small testbed show that D3 can e↵ectively double the peak load supported.
2.2.3 Alternative Datacenter Architectures
The topologies we have considered thus far consist of an enormous amount of elec-
trical switches and links, resulting in high complexity of wiring and deployment
of the datacenter networks. Compared with the packet-switching technology, the
optical circuit switching technology can provide higher bandwidth at much lower
power cost, but it comes at a cost of a slower switching speed. An optical switch
requires on the order of one millisecond to establish a new circuit, which is much
longer than the transmission time for a single packet. Therefore, optical circuit
switching works best with high-speed, high-volume communications. Many propos-
als present new network architectures to leverage the high bandwidth transmission
advantage of optical circuit switching.
Wang et al. propose a hybrid packet and circuit switched datacenter network
architecture (HyPaC ) [80] that combines traditional electrical packet-switched net-
work with rack-to-rack circuit-switched optical network. In the implementation of
the prototype system, servers bu↵er tra c and accumulate enough tra c for the
links to leverage the high-speed bandwidth. The experiment on an implemented
prototype shows that the optical inter-rack switches integrate well with the Eth-
ernet/TCP electrical switches. The case studies on di↵erent types of applications
suggest that this hybrid architecture benefits applications with bulk data transfer
requirement and which are also insensitive to latency.
Modular datacenters have been a new direction in building datacenters in the
past few years. Moderate numbers of servers interconnected with non-blocking
networks form a module called a pod. High bandwidth inter-pod connections are
necessary to prevent the bottleneck of communications. An optical switch is an
32
option for the pod-level aggregated tra c demand. Farrington et al. [36] proposed
Helios, a hybrid electrical/optical switch architecture for modular datacenters.
Helios dynamically reconfigures the network topology at run-time according to the
communication patterns and tra c demands monitored.
More recently, Chen et al. [26] introduced an innovative Optical Switching
Architecture (OSA) for datacenter networks. Consisting of all optical switches,
OSA achieves high topology flexibility. It can dynamically change topology or
link capacity more flexibly to adapt to varying tra c patterns. Compared with
the hybrid structures, OSA is characterized by a higher bandwidth, lower energy
consumption, and simpler wiring complexity. However, small flows may su↵er
non-trivial delay due to the reconfiguration delay of OSA.
2.3 DATACENTER NETWORK ENERGY EFFICIENCY
The exponential growth of Internet-scale applications drives the expansion of large-
scale, geographically distributed datacenters with fast increasing energy cost. With
various types of applications from online services, scientific computations to MapRe-
duce running on datacenters, the communication patterns and network bandwidth
requirements have been driving new research on datacenter networks. We explore
the literature of recent studies that addresses various aspects of the challenges in
designing e cient datacenter networks.
A lot of research e↵orts are focused on achieving the energy proportionality in
datacenter networks. In practice, companies provision their datacenters for peak
usage. However, a typical datacenter workload is around 5%   25% of the peak.
Many researchers propose energy proportionality datacenter networks. For exam-
ple, Lin et al. [63] explored the option of adaptively right-sizing the datacenters
by turning o↵ idle servers. Heller et al. designed an ElasticTree topology [51] that
33
changes network topology dynamically to adapt to varying tra c load.
Lin’s research introduced an online algorithm, Lazy Capacity Provisioning
LCP(w), to predict arriving workload in a window size of w. In the case study, some
impacting parameters are discussed to analyze the cost-saving of their approach.
Especially, the impact of Valley Filling, an alternative approach to right-sizing, is
evaluated and compared.
In the ElasticTree approach, Heller developed a variety of optimizers to com-
pute a minimal-power subset of network elements according to di↵erent tra c
patterns. The power control turns o↵ unnecessary switches and links and the
routing assigns routes accordingly. Tradeo↵s between power, fault tolerance and
performance are considered as well in their approaches. Similar work on dy-
namic topology is called CARPO [81], a correlation-aware power optimization
algorithm. Di↵erent from ElasticTree, CARPO first consolidates tra c flows by
putting negatively-correlated flows onto the same path and positively correlated
flows onto di↵erent paths, and then feeds the consolidated flows into a smaller set
of links before shutting o↵ idle links.
More recently, Adnan and Gupta proposed an online path consolidation algo-
rithm to dynamically right-size the networks [10]. From multiple equal cost paths
between each pair of nodes, their algorithm selects the path, which has most over-
lap with the paths between other pairs of nodes and meets the total flow bandwidth
requirement. By combining all the best overlapping paths together, the minimum
energy-proportional topology is formed. When there are large amount of flows in
the network, their method outperforms the ElasticTree approach.
Also, others have explored the ideas of energy-proportional hardware, such as
energy-proportional servers or energy-proportional links. For instance, Abts et al.
proposed building energy proportional datacenter networks by adapting the data
34
rate of individual links to the tra c intensity [8]. Abts then compared the power
consumption rate of di↵erent topologies and proposed to build the network based
on the flattened butterfly topology, which is more energy-e cient than a fat-tree
topology of equivalent size and performance. Compared to the dynamic topology,
this approach is a more fine-grained tuning adaptive technology. Additionally, it
does not require changing the topology and routing.
An innovative idea of tra c merging was proposed recently for running more
energy-e cient datacenters [23]. Given the low utilization of the links, tra c from
multiple links is aggregated into fewer uplinks through a simple hardware design
of the Merge Network. Unused links can be set to low power mode to save energy.
The hardware implementation is simple and has no additional delay and has lower
small power consumption.
2.4 COMPARISON AND DISCUSSIONS
The expansion of datacenter infrastructure motivates the research on how to e -
ciently interconnect a huge number of servers. The high capacity requirement of
applications implies that the topologies have high bisection bandwidth. For exam-
ple, the fat-tree, and Clos are all hierarchical multi-rooted tree-structured topolo-
gies connecting multiple levels of commodity switches. These topologies support
high-rate communications between any pair of end hosts and are widely deployed in
existing datacenters. Another approach considers server-centric topologies, such
as DCell, BCube and FiConn. These topologies leverage the programming ca-
pability of servers and conduct routing and forwarding by the servers instead of
switches. Network scales out through recursive expansion of lower-level server
interconnections. Compared to the tree-based topology, server-centric topology
has better scalability and more convenient routing and management mechanism.
35
Also, there are alternative architectures proposed using optical switches to meet
the high bandwidth demands between di↵erent racks. Optical switches can form
a circuit-switching path and have very high transmission rate.
In our research, we apply merge networks to the switches in di↵erent topologies
and evaluate the total energy saving under various tra c patterns. For a hierar-
chical tree-structured fat-tree topology, more tra c concentrates in the core layer
switches, according to Benson [20]. The link utilization of other layers is much
lower. Considering this variation, we can apply merge networks more aggressively
in the aggregation/edge layer. For example, we can connect a 2N ⇥ 2N merge
network to 2 N -interface switches.
We also propose using a switching fabric for the interconnection of servers.
Conventional networks use switches as intermediate nodes to forward packets. The
forwarding switches increase the network latency and consume large amount of en-
ergy. In our research, we will replace switches with simple analog multiplexers,
which use minimal electricity. We use servers to make routing decisions and es-
tablish a full path from source to destination before data transmission. As we
discussed, server-centric architectures also put routing intelligence on the server.
However, they need servers to work as the relaying nodes during the transmission,
which incurs high latency. Our approach lets the source server set up a particular
path. The packets then traverse the path from source to destination without going
through intermediate servers. Optical switches also establish a circuit switching
path. However, they have a long configuration time, which hinders its application
to inter-server data transmissions. Typically, optical switches are used for inter-
rack networking. Our proposed switching fabric is simple to configure and has low
transmission latency and energy cost.
The increasing concern about datacenter power consumption has attracted
36
many research e↵orts. Barroso [19] has proposed the concept of energy proportion-
ality, which resorts to keep the energy usage proportional to the o↵ered workload.
There are di↵erent approaches to achieve energy proportionality. Some proposed
dynamically turning o↵ idle servers [63]. Others focus on ensuring the energy
use of network infrastructure to be proportional to the network utilization, such as
turning o↵ unused switches [51] or adapting the link rate to the workload [8]. More
aggressive approaches include consolidating tra c flows [81] or merging transmis-
sion paths [10] to find minimal energy proportional topology. Our research on
merge network is also focused on combining the tra c and dynamically turning o↵
the idle switches. However, the previous tra c consolidating approach [81] requires
statistical analysis of the tra c flow correlations. The path-consolidating approach
[10] has to calculate the optimal overlapping path first. The merge network is im-
plemented from simple analog hardware, and it consolidates tra c automatically
with almost no software overhead. The switching fabric proposed is a more energy
e cient way by replacing all switches with low-energy-cost analog multiplexers.
37
Chapter 3
MODELING ENERGY USAGE OF DATACENTER NETWORKS
Datacenter networks tend to consume about 10   20% of energy in normal usage
[51] but account for up to 50% energy [38] during low loads since at those times
servers can be put into low power states. It is therefore important to adapt the
network energy consumption to actual tra c loads as the servers do. To do this,
we need to develop a better understanding of network energy consumption under
varying types of loads with the eventual goal of designing more energy e cient
datacenter networks.
We consider the fat-tree network, which has been a popular choice for many
commercial data centers due to its full bisection bandwidth (which minimizes la-
tency and boosts throughput). Unfortunately, the energy consumption of this or
any other network is very dependent on the type of tra c, the type of load and on
the selected routing algorithm. For instance, even at high loads if most of the traf-
fic is between servers located in the same pod (see Figure 3.1) then core switches
are never used resulting in significant energy savings. On the other hand, at light
loads if most of the tra c is between servers located in di↵erent pods, then savings
are small since more switches in the network will need to be utilized for routing.
Routing also plays an important part in the potential for energy savings. Thus,
routing algorithms that seek to minimize only latency will distribute flows over
unused paths when possible, ensuring that a majority of switches are kept busy
(albeit at very low loads). Alternatively, if paths can be consolidated into a few,
38
then there is the potential to save energy at the idle switches.
In this chapter, we provide a systematic analysis of the energy e ciency of a fat-
tree network using modeling and detailed simulations. The key question we ask is
how does energy usage scale with total load as well as with di↵erent types of loading.
To answer this question, we build a detailed analytical model that gives the lower
bound on the fraction of active switches required for a given load and type of load.
We show that fat-trees have a minimal cost of about 40  50% (i.e., about half the
switches need to remain active at all times) but beyond that, the lower bound scales
almost linearly with total o↵ered load. The lower bound computation is based on
aggregating tra c into few routes. We next conducted a detailed simulation of a
fat-tree network where we used di↵erent types of load (staggered and stride) [11]
and di↵erent amounts of total load. We compute routing tables empirically every
second for the next second and compute the fraction of needed active switches.
The simulation shows that the model we develop is accurate in predicting switch
activity and the simulation demonstrates that by modifying the routing algorithms
we can potentially save significant amounts of energy in real networks.
3.1 MODELING ENERGY CONSUMPTION
The structure of a fat-tree is shown in Figure 3.1. The tree is made up of 2k “pods”
which are connected to k2 core switches. Within each pod we have k aggregation
switches and k edge switches. Each edge switch is in turn connected to k servers.
Therefore, each pod has k2 servers and the DCN has a total of 2k3 servers. Each
core switch has one link to each of the 2k pods. The ith port of a core switch is
connected to an aggregation switch in pod i. The left-most k core switches are
connected to the leftmost aggregation switch of each of the 2k pods. The next set
of k core switches are connected to the second aggregation switch of each of the
39
k servers
Pod 1 Pod 2
k edge
switches
k aggregation
switches
k   core switches
2
2k pods
1 2 k k+1 k
2
Figure 3.1. Fat-tree network model.
pods, and so on.
Our goal here is to derive analytical expressions for minimal energy consump-
tion of fat-trees for di↵erent type of loading. The metric we use for energy con-
sumption is fraction of active switches. To model di↵erent types of loading we
use three parameters. A packet from a server goes to another server connected to
the same edge switch with probability p
1
, it goes to a server in the same pod but
another edge switch with probability p
2
and with probability p
3
= 1   p
1
  p
2
it
goes to a server in a di↵erent pod. Thus p
1
of the tra c is never seen by either
the core or the aggregation switches while p
2
of the tra c is not seen by the core
switches. It is easy to see that by varying p
1
and p
2
we can model very di↵erent
types of tra c. Finally, we model external tra c (i.e., tra c going to/from the
Internet) as the fraction q (also see discussion in first part of section 3.1.2).
Let   denote the average internal load o↵ered by each server expressed as a
fraction of link speed (which we normalize to 1). This load refers to packets that
will stay within the datacenter. Thus, the total o↵ered load per server is  +q. For
simplicity, we assume that   + q is the same for all the servers in the datacenter.
Thus, the total load in the datacenter is 2k3( +q). We have the following equalities
40
for total tra c at the level of edge switches, pod aggregation switches and core
switches:
tra c per edge switch = ( + q)k
tra c for all aggregation switches in a pod =
((1  p
1
) + q)k2
tra c for all core switches =
((1  p
1
  p
2
) + q)k2 ⇥ 2k
Note that tra c flow is symmetric and the numbers above correspond to both,
tra c into and out of a switch or switches.
We perform our analysis below in three stages:
• In the basic model we assume that q = 0 and thus all tra c is internal only.
This analysis gives us a good starting point for generalization to the other
two models.
• The extended model allows q > 0 but assumes that every core switch has a
connection to the Internet. This model is valid for small datacenters.
• In the asymmetric model we only allow a small subset of core switches to be
connected to the Internet and these switches are equipped with much higher
rate links. This model is representative of a large number of datacenters
today.
3.1.1 Basic Model
Assume that q = 0. The first observation we can make is that all the edge switches
need to remain active at all loads to ensure servers have network connectivity.
This gives us 2k2 active switches at this level. Within each pod we have total
tra c equal to (1   p
1
) k2 going into/from the k aggregation switches from/to
41
the edge switches. Given each link has a normalized capacity of 1, and that there
are k interfaces per aggregation switch connected to the edge switches, we require
at least d (1 p1) k
2
k
e active aggregation switches per pod. Observe that in the fat-
tree each edge switch is connected to each aggregation switch. Therefore, we can
force all tra c from the edge switches to go to the fewest number of aggregation
switches. This fact is represented in the expression for the total number of active
aggregation switches above. Since there are 2k pods, the total number of active
aggregation switches becomes 2kd(1  p
1
) ke. Finally, since every core switch i is
connected to aggregation switch i from each of the 2k pods, the number of active
core switches we require is simply, d (1 p1 p2) ⇥2k
3
2k
e where we divide the total tra c
passing through the core switches by the number of links per switch and round up.
Therefore, the total number of active switches can be written as:
ActiveBasic = 2k
2 + 2kd(1  p
1
) ke+
l
(1 p1 p2) ⇥2k3
2k
m (3.1)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Basic Model for k=6
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
Figure 3.2. Active switches for the basic model.
Figure 3.2 plots the fraction of active switches as a function of load   for five
42
di↵erent scenarios when k = 6. The plot with the labels -o corresponds to the
case when all tra c is between the servers connected to an edge switch. In other
words, no tra c needs to flow to the aggregation switches or to the core switches.
As expected, the graph stays flat. However, what is relevant here is that even at
light loads of 0.1 all the edge switches are fully active. At this load value, each
server generates 1/10th of the uplink capacity of tra c (similarly for downlink)
but the energy consumed is the same as when the link is fully loaded. The total
combined tra c from all the k = 6 servers is 0.6 which is less than the capacity
of a single link. It is clear that significant energy savings can be accomplished
here by redesigning the edge switches or the topology at the edge. We return to a
discussion of this point later.
The plot with the labels  / corresponds to the extreme case when all the tra c
is destined for servers in a di↵erent pod. Hence the core and aggregation switches
will be utilized. Contrasting this with the case discussed above, we observe that
energy scales approximately linearly with load, if we discount the edge switches.
This is the desired behavior for energy-proportional networking.
3.1.2 Extended Model
Let us now extend the above discussion to the more realistic case when the dat-
acenter sees external tra c from the Internet. Let us assume that tra c coming
into the datacenter is 2k3qin equally distributed among all the servers and tra c
going out is 2k3qout also equally generated by each server. It is easy to see that
 + qin  1 and  + qout  1 since the normalized capacity of the link connecting
each server to the edge switch is 1. Before proceeding with the derivations below,
note that a switch interface is typically bi-directional. Therefore, even if there is
no tra c in one direction, the entire interface is functioning and running link layer
43
protocols to maintain connectivity. Therefore, instead of considering qin and qout
separately, we need to only consider the maximum of the two. Let,
q = max{qin, qout}
In order to handle external tra c, let us assume that each of the core switches
is equipped with additional interfaces with a total normalized capacity of Q and
connected to a border switch or router. Therefore the network can handle a total
external load of Qk2. Observe that Q  2k.
In order to compute the number of aggregation switches and core switches that
are active, it is convenient to begin at the core layer. The total external tra c is
2k3q therefore the minimum number of core switches required to handle this tra c
is,
n =
2qk3
Q
It is possible that n is a fraction or is greater than the total number of core switches.
Therefore, we obtain,
n
ext
core = min
 
k
2
, dne
 
since each core switch can only handle Q external tra c. The reason we take a
minimum above is to account for the case when the external tra c exceeds the
total capacity of the network to handle it.
Each of the 2k interfaces of the core switches (facing towards the servers) has
a normalized capacity of 1. For the core switches serving external tra c, Q/2k of
each link’s capacity is used for external tra c coming/going from/to the connected
pods leaving a capacity of (1 Q/2k) for internal tra c between pods. The reason
for this is two-fold. First, each of the pods is assumed to be identical to the
other pods and generate an equal amount of external tra c. And second, in the
computation of the number of core switches needed to support the external tra c,
44
we assume that all the external tra c is put into as few core switches as possible
rather than spreading it out among all the core switches. This design is more energy
e cient since we can minimize the number of active switches.
We may require additional core switches to handle inter-pod internal tra c.
To compute this additional number of core switches we first determine how much
internal tra c can be carried by the nextcore switches. The remaining internal tra c
can then be carried by free core switches. To compute the first value, note that of
the nextcore switches, the first n
ext
core   1 will be running their external links at full
capacity of Q each while the last switch may be running at lesser capacity. Thus,
the residual capacity of these active switches can be written as,
(2k  Q)bnc+ (2k   (2qk3   bncQ))
Recall from our discussion of the basic model that the total internal tra c reaching
the core switches from the pods (i.e., tra c sent between pods) is 2(1 p
1
 p
2
) k3.
Therefore, the number of additional core switches we require to handle internal
tra c is,
naddlcore =
&
2(1  p
1
  p
2
) k3  
 
(2k  Q)bnc+ (2k   (2qk3   bncQ))
 
2k
'
(3.2)
We divide by 2k because that is the capacity of each free core switch (2k links
of capacity 1 each). Of course, it is possible that the total tra c will exceed the
capacity of the core switches. Therefore, we obtain the final value for the number
of required core switches as,
n
total
core = min
n
k
2
, n
ext
core + n
addl
core
o
Finally, let us compute the number of aggregation switches required. In each
pod, each server sends (1 p
1
)  amount of internal tra c to the aggregation layer.
An additional q amount of external tra c is also sent. Therefore, the total amount
45
of tra c within each pod that reaches its aggregation layer is ((1   p
1
)  + q)k2.
Each aggregation switch has k interfaces connected to edge switches. Therefore,
the number of aggregation switches required per pod is,
⇠
((1  p
1
) + q)k2
k
⇡
Yielding the total number of required aggregation switches in the network as,
n
total
aggr = 2kmin {k, d((1  p1) + q)ke}
Combining all the derived values, we obtain the total number of active switches in
the extended model as,
ActiveExtended = 2k
2 + ntotalaggr + n
total
core (3.3)
In Figure 3.3 we plot the fraction of active switches as a function of  + q for the
case when q = 0.25, k = 6 and Q = 12. In this case, the external capacity of each
core switch is more than su cient to handle all external tra c. In the figure, point
A denotes the case when there is no internal tra c at all and all tra c is external.
Here, all the edge switches are active and in addition, 9 core switches are required
for external tra c. Each of these 9 core switches have zero available capacity to
handle internal tra c. Therefore, as we now start increasing internal load   we
require additional core switches to become active when (1  p
1
  p
2
) > 0. Observe
that the case when p
1
= p
2
= 0.5 only requires additional aggregation switches
after   exceeds 0.2. Prior to that we can get away with using the aggregation
switches that are already active for external tra c.
It is instructive to contrast the above figure with Figure 3.4 where we now have
Q = 6 and q = 0.5. This represents a case where the external tra c accounts for
50% of all tra c handled by the network and the number of external interfaces is
smaller. Point A again denotes the case when there is no internal tra c. The -o
46
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ + q
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Extended Model for k = 6, q=0.25 and Q=12
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
A
Figure 3.3. Extended model with high external connectivity.
data set corresponds to the case when all internal tra c confined to edge switches
only. As we increase the tra c within the pod but between edge switches (data
shown by -+) we see an increase in aggregation switches used but no change in
core switches. The remaining three plots correspond to the case where we slowly
increase the amount of inter-pod internal tra c. This causes an increase in number
of core switches required until the point where the internal inter-pod tra c plus the
external tra c exceeds the capacity of the network. To understand this further,
let us derive the expressions for tra c loss.
First note that there are no tra c losses in the basic model without external
tra c since the fat-tree has full bisection bandwidth. In the extended model, tra c
losses will not occur within the pod if   + q < 1. However, tra c losses occur in
the core if external capacity Q is unable to handle the external load.
Tra c losses at the core can potentially be divided into two types. The first
type corresponds to losses to the external tra c and happens if,
Qk
2
< 2qk3
47
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ + q
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Extended Model for k = 6, q=0.5 and Q=6
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
A
Figure 3.4. Extended model with reduced external connectivity and high external load.
This yields a loss of external tra c of,
Lossext = max{0, (2qk3  Qk2)}
The second source of losses is from internal tra c if the residual capacity of the core
switches (after handling external tra c) is insu cient for internal tra c. Recall
that the total internal tra c coming to the core layer is 2(1 p
1
 p
2
) k3. Equation
3.2 gives us the number of additional core switches required to handle this tra c.
Therefore the amount of internal tra c loss is,
Lossint = max{0, 2(1  p1   p2) k3 
((2k  Q)bnc+ (2k   (2qk3   bncQ)))  2k(k2   nextcore)}
Theorem 1: In the extended model Lossint = 0.
Proof sketch: (We have not included the formal proof here for space reasons) The
intuition behind this result is relatively simple. Consider tra c going up to the
core first. Since   + q < 1 there will be no losses seen by any of the tra c either
in the pods or in the inputs to the core switches. Tra c heading out of the core
48
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
λ + q
Tr
af
fic
 L
os
s
Loss for Extended Model for k = 6, p1 = 0.25, p2 = 0.25, p3 = 0.5
 
 
q=0.1, Q= 6
q=0.5, Q= 6
q=0.1, Q= 4
q=0.5, Q= 4
Losses only occur for the last case
Figure 3.5. Tra c loss corresponding to Figure 3.4.
to the Internet is limited by Q and hence we may see packet drops if 2kq > Q.
Consider now tra c coming into the network from the Internet as well as inter-pod
and intra-pod tra c. At the core layer, this tra c will be 2(1 p
1
 p
2
) k3+2k3q0.
The first term is the inter-pod tra c and the second term is the amount of external
tra c that was not lost due to the limitation on Q. Clearly, q0  q and hence the
total tra c flowing into the servers is below the link capacity of one and there will
again be no losses. Therefore, we can write the total loss in the extended model
as,
Lossextended = Lossext
Figure 3.5 plots the fraction of tra c loss (total tra c lost divided by the
number of servers) versus   + q for various cases. Note that there are no losses
when q = 0.1 for Q = 4, 6. The only case we see losses is when q = 0.5, Q = 4
since there is not su cient external capacity.
49
3.1.3 Asymmetric Model
The extended model above assumes that every core switch has a link to a border
gateway for connectivity to the Internet. This assumption may be reasonable for
smaller datacenter networks but for larger ones, the more likely scenario is one
where only a few of the core switches have external links. Let us assume that of
the k2 core switches, k   C   1 have external connectivity via links of capacity Q.
Assume further that these C switches are connected to the aggregation switches
using links of capacity l   1. All remaining links in the network have a capacity
of 1. Clearly, Q  2kl and l  k. The latter inequality makes sense since an
aggregation switch is connected to k edge switches with capacity one links and
thus there is little point in connecting it to a core switch by a link of capacity
greater than k. Without loss of generality, assume that the C core switches are
1, 1+ k, 1+ 2k, · · · , 1+ (C   1)k. Thus aggregation switches 1, · · · , C in each pod
are connected with a link of capacity l to these special core switches.
As before, assume that the total external tra c load is 2qk3 and the tra c
is uniformly distributed among all the servers. The total number of externally
connected core switches we need to be active is thus given by,
m =
2qk3
Q
Since m may be greater than C or have a fractional part, we obtain,
m
ext
core = min{C, dme}
If the external tra c exceeds the capacity of the network to handle it, then we can
compute the loss as,
Loss
Asym
ext = {0, 2qk
3   CQ}
We proceed as in the previous section to compute the number of additional core
switches needed to support internal tra c. Recall that the externally connected
50
core switches may not be using all of their link capacity and thus they can be used
for routing internal tra c as well.
Each of the 2k interfaces of the active externally connected core switches has
a capacity of l. If mextcore = dme then bmc of these switches are using their full
external capacity of Q leaving (2kl  Q)bmc free capacity. For example, a switch
may have 2k = 12 one gigabit links and one Q = 10 gigabit link connected exter-
nally. Thus this switch has 2 gigabits of free capacity that can be used for routing
internal tra c.
One additional externally connected core switch will be using less capacity for
external tra c leaving (2kl  (2qk3 bmcQ)) free capacity. The total free capacity
is thus f = (2kl Q)bmc+(2kl (2qk3 bmcQ)). If C < dme, however, then all the
m
ext
core switches are using their full external capacity Q leaving only f = (2kl Q)C
free capacity. The total internal tra c that needs to be forwarded by core switches
is 2(1  p
1
  p
2
) k3. Therefore, the number of additional core switches we need is,
m
addl
core =
8
<
:
0, if 2(1  p
1
  p
2
) k3  f
l
2(1 p1 p2) k3 f
2k
m
, otherwise
We divide the second term above by 2k since that is the degree of the additional
core switches used. Since it is possible that the above number exceeds the available
number of free core switches, we can write the final answer as,
m
total
core = min
n
k
2
,m
ext
core +m
addl
core
o
Let us next compute the number of aggregation switches required per pod.
Within each pod, the total external tra c is qk2 and this is forwarded to/from
the externally connected core switches using mextcore aggregation switches. This
is the case because of the way we are performing the minimization forces tra c
from/to each pod to be identically routed. The total internal tra c that needs to
be handled by the aggregation switches is (1  p
1
) k2.
51
Consider the aggregation switches in a pod that are connected to the active
externally connected core switches. Say the high-capacity link (of capacity l)
carries tra c a. This tra c is evenly distributed over the k capacity 1 links
connecting this aggregation switch to edge switches. In other words, each of the
edge switches can send up to (1  a/k) internal tra c to this aggregation switch.
In all, the k connected edge switches can send (k   a) total internal tra c to this
aggregation switch. Consider next the k links from this switch connected to the
core switches. One of the links is capacity l while the others are capacity 1 each.
Thus, the total available capacity of these links is (l a)+(k 1) = (k a)+(l 1).
Since k   a < (k   a) + (l   1) the total internal capacity that can be handled by
this aggregation switch is k   a.
In a pod, if mextcore = dme then there are bmc aggregation switches where
a = Q/2k (corresponding to the bmc core switches that run their external links at
full capacity) and at most one switch (dme   bmc) where a = (2qk3  Qbmc)/2k.
These dme aggregation switches handle internal tra c equal to,
taggr = bmc(k  
Q
2k
) + (dme   bmc)(k   (2qk3  Qbmc)/2k
leaving (1 p
1
) k2 taggr to be handled by other aggregation switches. If, however,
C < dme then taggr = (k Q/2k)C. In all, in the entire network, the total number
of aggregation switches needed is then given by,
m
total
aggr = 2kmin
⇢
k,m
ext
core +
⇠
(1  p
1
) k2   taggr
k
⇡ 
Putting all these values together we have,
ActiveAssymetric = 2k
2 +mtotalcore +m
addl
core (3.4)
Recall that C  k in the asymmetric model therefore, q  QC/2k3. Let us
assume that C = k = 6. In Figure 3.6 we plot the fraction of active switches versus
52
 +q when q = 0.25, Q = 12, C = 6, k = 6, l = 1. We observe that even when   = 0
we are using over 80% of the switches. The reason for this has to do with the fact
that when mextcore switches are active in the core, an equal number of aggregation
switches are forced to be active in each pod since they have a high capacity link
to these externally connected core switches. In this particular example, therefore
all C = 6 externally connected core switches are active. No additional switches
are needed since there is no internal tra c. In each of the 2k = 12 pods, six
aggregation switches are also active and all the edge switches are active. Therefore
we get 6+72+72 = 150 active switches or 83%. The switches are essentially idle
but due to the external tra c and the topology of the network we are paying this
high cost.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ + q
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Asymmetric Model for k = 6, q=0.25, Q=12, C=6, l=1
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
Figure 3.6. Asymmetric model with high external tra c.
Let us consider a more realistic case where the external links are Q = 40 gbps
and l = 10 gbps for the C = 6 externally connected core switches. Figure 3.7 plots
the fraction of active switches and we note that in this case only 3 of the C = 6
core switches was used resulting in an overall reduction in number of aggregation
53
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ + q
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Asymmetric Model for k = 6, q=0.25, Q=40, C=6, l=10
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
Figure 3.7. A more realistic asymmetric model with high capacity external links.
switches needed. Also, since 2kl = 120 gbps for each of these C switches, there
is 120  40 = 80gbps excess capacity that can be used to forward internal tra c.
Unfortunately, as in the previous example, each pod has at least mextcore active
aggregation switches and thus even when   = 0 we use 62% of switches.
Finally, let us consider tra c loss for the asymmetric model. As in the case of
the extended model, we can state that the only tra c loss will occur at the core
switches when  + q < 1. Thus,
LossAsym = Loss
Asym
ext
Figure 3.8 plots the tra c lost (total tra c lost divided by number of servers) for
the case corresponding to Figure 3.6. Regardless of internal tra c patterns the
tra c loss is the same. This makes sense since only external tra c q = 0.25 will
be lost due to insu cient external bandwidth.
54
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
λ + q
Tr
af
fic
 lo
ss
Loss for asymmetric Model for k = 6, q=0.25, Q=12, C=6, l=1
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
Figure 3.8. Tra c loss for the model in Figure 3.6.
3.1.4 Summary
The derivations above point to a few significant areas where energy is being squan-
dered. The first is the cost of running edge switches all the time even when there
is minimal load. This contributes a large constant to the overall energy cost. Sec-
ond, external tra c to the datacenter can cost a lot of energy as shown in the
asymmetric case. In the examples described, we end up using more aggregation
switches than necessary due to the topology.
In order to reduce energy consumption and make it linear for low loads, we need
to modify the edge topology of these networks without sacrificing the full through-
put to support high loads. To deal with the challenge of supporting external tra c
while not using unnecessary aggregation switches, we need to consider minimizing
C. For example, by making C = 1 but boosting its link speeds dramatically, we
can ensure that only one aggregation switch per pod needs to be active for a wider
range of loads. This fact is illustrated in Figure 3.9 where C = 1 but the external
capacity is 120 gbps with each of its 12 links running at 10 gbps. This one core
55
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ + q
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Asymmetric Model for k = 6, q=0.5, Q=120, C=1, l=10
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
Figure 3.9. Using only one externally connected switch.
switch now carries a significant amount of internal tra c as well and thus even a
 + q = 1 we see less than 100% active switches.
3.2 SIMULATIONS
We built a simulator for a fat-tree network with k = 6 and 1 gbps link capacity. We
also use C = 1 and designate the leftmost core switch as the externally connected
core switch. In the simulator, we read trace files generated externally and forward
packets based on routing tables computed every second of simulated time. The
routing algorithm is a modified version of Dijkstra’s algorithm where we force flows
to use routes that are already in use, thus packing flows together. In the algorithm
we assign weights to edges as well as nodes. Edge weights are constant but node
weights can be 0 or 1. If a node has been used for forwarding a flow, its weight
changes from 1 to 0. Thus, flows are encouraged to reuse the same subset of nodes
(or switches). Of course, if adding a new flow over a link will exceed the link’s
capacity we eliminate that link from further consideration in that round of routing
56
computation.
We used the tra c models developed in [11] to analyze our algorithm. The
two models that are most relevant for datacenters are the stride and the staggered
models. Imagine numbering the 2k3 servers consecutively starting from 1. In
stride(s), packets are send from server i to server i+ s mod 2k3. Thus, stride(1)
tends to mainly send tra c between servers connected to the same edge switch
(high p
1
in our analytical model) while stride(6) generates mainly tra c within a
pod but between edge switches (high p
2
) and stride(36) is mainly inter-pod tra c.
Staggered tra c is very similar to the tra c model we used for our analysis and is
specified using the same probabilities p
1
and p
2
. The seven di↵erent tra c models
we used are as follows:
1. stride(1), stride(6), stride(36), stride(216)
2. staggered(1) p
1
= 1.0, p
2
= 0.0
staggered(2) p
1
= 0.5, p
2
= 0.3
staggered(3) p
1
= 0.2, p
2
= 0.3
We assume that external tra c q is 10% in all cases.
In Figure 3.10 we plot the fraction of active switches versus total load using
simulation for the staggered data and in Figure 3.11 we plot the same metric using
the asymmetric model from the previous section (but using the staggered packet
trace parameters). It is easy to see that our model is a very close match to the
simulations. The implication of this is that the lower bound of energy e ciency can
be achieved in practice by utilizing the simple routing algorithm described above.
Figure 3.12 plots the fraction of active switches versus load for the stride case
using simulation. Figure 3.13 plots the same data using our asymmetric analytical
model. Again note the closeness of the two models. The 5% error between the
57
10% 20% 30% 40% 50% 60% 70% 80%
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Trafic load λ + q
F
ra
ct
io
n
 o
f 
a
ct
iv
e
 s
w
itc
h
e
s
Simulation for k = 6, q=0.1, Q=12, C=1, l=1
 
 
p
1
=1,.0, p
2
=0
p
1
=0., p
2
=0.3
p
1
=0.2, p
2
=0.3
Figure 3.10. Fraction of active switches for the staggered model.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.45
0.5
0.55
0.6
0.65
0.7
0.75
λ + q
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Asymmetric Model for k = 6, q=0.1, Q=12, C=1, l=1
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.3,p3=0.2
p1=0.2,p2=0.3,p3=0.5
Figure 3.11. Fraction of active switches for the analytical model (staggered cases).
58
simulations and analysis is due to the fact that we estimated the values of p
1
and
p
2
from the simulations and then used them in the analysis. The estimated values
for these probabilities are listed in the legend of Figure 3.13. One noteworthy
feature of the stride model is that there is no di↵erence between stride(36) and
stride(216). This makes sense because in both cases packets are inter-pod. The
di↵erence between stride(6) and stride(36) is that most packets in stride(6) remain
within one pod and thus only one core switch is used.
When we examine Figures 3.10 to 3.13, we observe that the type of loading
has a significant impact on energy consumption. While stride(1) and staggered(1)
may be impractical for many distributed applications, we see that stride(6) and
staggered(2) are better choices than stride(36) and staggered(3). This means that
when it comes to allocating tasks to servers, the task manager should be mindful
of the type of tra c that will be generated since we can obtain significant energy
savings by careful scheduling.
3.3 SUMMARY
In this chapter, we provide a systematic analysis of the energy e ciency of a fat-
tree network topology using modeling and detailed simulations. The key question
we ask is how does energy usage scale with total load as well as with di↵erent types
of loading. To answer this question, we build a detailed analytical model that gives
the lower bound on the fraction of active switches required for a given load and
type of load. We show that fat-trees have a minimal cost of about 40-50% (i.e.,
about half the switches need to remain active at all times) but beyond that, the
lower bound scales almost linearly with total o↵ered load. The lower bound com-
putation is based on aggregating tra c into few routes. We conducted a detailed
simulation of a fat-tree network where we used di↵erent types of load (staggered
59
and stride) [11] and di↵erent amounts of total load. We compute routing tables
empirically every second for the next second and compute the fraction of needed
active switches. The simulation shows that the model we develop is accurate in
predicting switch activity and the simulation demonstrates that by modifying the
routing algorithms we can potentially save significant amounts of energy in real
networks. By developing analytical models for energy consumption, datacenter
researchers are able to study fat-tree DCNs theoretically. A practical application
of our work would be to jointly optimize task scheduling and flow assignment so
as to maximize the tra c consolidation for given job loads. In this research, we
first provide a systematic analysis of the energy e ciency of a fat-tree network
topology using modeling and detailed simulations. The key question we ask is how
does energy usage scale with total load as well as with di↵erent types of loading.
To answer this question, we build a detailed analytical model that gives the lower
bound on the fraction of active switches required for a given load and type of load.
We show that fat-trees have a minimal cost of about 40-50% (i.e., about half the
switches need to remain active at all times) but beyond that, the lower bound
scales almost linearly with total o↵ered load. The lower bound computation is
based on aggregating tra c into few routes. We conducted a detailed simulation
of a fat-tree network where we used di↵erent types of load (staggered and stride)
[11] and di↵erent amounts of total load. We compute routing tables empirically ev-
ery second for the next second and compute the fraction of needed active switches.
The simulation shows that the model we develop is accurate in predicting switch
activity and the simulation demonstrates that by modifying the routing algorithms
we can potentially save significant amounts of energy in real networks.
60
10% 20% 30% 40% 50% 60% 70% 80%
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Trafic load λ + q
F
ra
ct
io
n
 o
f 
a
ct
iv
e
 s
w
itc
h
e
s
Simulation for k = 6, q=0.1, Q=12, C=1, l=1
 
 
p
1
=0.75, p
2
=0.125
p
1
=0.0, p
2
=0.75
p
1
=0.0, p
2
=0.0
Figure 3.12. Fraction of active switches for the stride model.
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
λ + q
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Asymmetric Model for k = 6, q=0.1, Q=12, C=1, l=1
 
 
stride(1) p1=0.75, p2=0.125
stride(6) p1=0.0, p2=0.75
stride(36) p1=0.0, p2=0.0
Figure 3.13. Fraction of active switches for the analytical model (stride cases).
61
Chapter 4
ANALYTICAL OPTIMIZATION MODEL
To compute the minimal power required by a datacenter network, we will com-
pute the minimum subset of network elements of network infrastructure. For a
given tra c load, we need to find optimal route assignments of tra c flows that
involves minimum number of switches and links. This optimization problem is in
general a mixed-integer programming problem (MIP), and can be integrated into
a capacitated minimum cost multi-commodity network flow problem (MCMCF).
In this chapter, we examine the power optimization model with the goal of
minimizing energy consumption of datacenter networks. We implement the power
model using commercial optimization software. For a more scalable implementa-
tion, we propose a heuristic algorithm that finds a near-optimal subset of network
switches and links that satisfies a given tra c load and consumes minimal power.
We demonstrate that this simple routing algorithm can approximate the optimiza-
tion model very closely and it can be applied to large-scale datacenter networks to
achieve optimal subset of networks with minimum overhead.
4.1 MINIMIZING ENERGY CONSUMPTION
For a datacenter network, we formulate a power model for all network elements
including switches and links. A network G(V,E) is given, where V is the set of
nodes in the network and E is the set of links. We consider both the end hosts and
the switches as network nodes and thus we have V = V
1
+V
2
, where V
1
is the set of
62
end hosts and V
2
is the set of switches. Link (u, v) 2 E connects node u and node
v (u, v 2 V ). Assuming each switch consumes power P
s
and each link consumes
power P
l
, the total power consumed by the entire network can be expressed as
P
total
=
1
2
X
u2V2
k
u
⇥ P
l
+ n⇥ P
s
+
✏
2
⇥
X
u2V,w2Vu
f
u,w (4.1)
where n is the number of active switches and k
u
is the number of active interfaces
of switch u. V
u
is the set of nodes connecting to node u. ✏ is the dynamic energy
consumption factor representing the power consumption per unit data transmitted
through a link. f
u,v
is amount of tra c flow assigned to link (u, v). We use
binary variables y
u
and x
u,v
to represent the power state of node u and link (u, v),
respectively. For instance, if x
u,v
= 1, link (u, v) is active; if it is 0, link (u, v) is
idle and can be powered o↵. Therefore, k
u
and n can be written as
n =
X
u2V2
y
u
(4.2)
8u 2 V
2
, k
u
=
X
w2Vu
x
u,w
(4.3)
4.1.1 Optimization Model
Based on the power model defined above, we define an optimization problem in
order to find the optimal flow assignment that involves a minimum subset of active
network elements, (n, k
u
), with the minimal total power consumption P
total
for a
given network topology and a tra c load. This optimization problem is an exten-
sion to the capacitated minimum-cost multi-Commodity Flow problem (MCMCF).
A classical MCMCF problem is subject to three constraints - capacity constraint
(4.4), flow conservation constraint (4.5) and demand satisfaction constraint (4.6)
written as follows
8(u, v) 2 E, f
u,v
 cx
u,v
(4.4)
63
8u, u 62 S and u 62 D,
X
w2Vu
f
u,w
 
X
w2Vu
f
w,u
= 0 (4.5)
8
>
>
>
<
>
>
>
:
8s 2 S,
P
w2Vs
g
i
s,w
 
P
w2Vs
g
i
w,s
= ti
s,d
8d 2 D,
P
w2Vd
g
i
w,d
 
P
w2Vd
g
i
d,w
= ti
s,d
(4.6)
where c is the capacity for each link. S is the set of source nodes and D is the set
of destination nodes. V
s
and V
d
is the set of switches that connect to source node
s and sink node d, respectively. f
u,w
is the total flow assigned on link (u, w) and
f
u,w
=
P
i
g
i
u,w
, where gi
u,v
represents the flow of the ith tra c demand ti
s,d
routed
through link (u, v).
Capacity constraint (4.4) takes account of maximum link utilization and ensures
that the total tra c flow assigned to a link does not surpass the link capacity. The
capacity constraint also forces flows to go through active links only. For example,
inactive link (u, v) has x
u,v
= 0, which causes f
u,v
= 0 meaning no tra c flow
is assigned to this link. Flow conservation (4.5) ensures that tra c entering an
intermediate node equals to tra c exiting from it. Demand satisfaction (4.6)
describes that the overall tra c departing a source node or entering a destination
node equals to the tra c demand.
Besides these three constraints, the bidirectional link rule ensures that both
directions of a link are powered on if there is a flow assigned to either direction of
the link. The bidirectional link constraint is expressed as
8(u, v) 2 E, x
u,v
= x
v,u
(4.7)
Additionally, we include constraints that correlate the power states of switches and
links. For each node u and the connected links (u, w) and (w, u), we have
8u 2 V, 8w 2 V
u
, x
u,w
 y
u
and x
w,u
 y
u
(4.8)
64
8u 2 V, y
u

X
w2Vu
(x
u,w
+ x
w,u
) (4.9)
Constraint (4.8) makes sure that a switch is powered o↵ only when all its connected
links are powered o↵, and constraint (4.9) ensures that a switch be powered o↵
when all its connected links are powered o↵. Optionally, we can include a non-
splitting constraint as follows to prevent flow splitting:
8i, 8(u, v) 2 E, gi
u,v
= ti ⇥ ri
u,v
(4.10)
where ri
u,v
is a binary decision variable that indicates whether the tra c demand t
i
is assigned to link (u, v). Constraint (4.10) ensures that gi
u,v
, the flow assignment
to link (u, v), is either equal to the ith tra c demand t
i
or equal to zero.
Furthermore, we define heuristic constraints to reduce the problem size. For
example, since a k-ary fat-tree network has 5k2/4 switches and each switch has at
most k active links, we explicitly apply an upper bound and a lower bound to k
u
and n as 0  k
u
 k and 0  n  5
4
k
2, which can greatly improve convergence
time for the problem.
We implement the power optimization model using CPLEX, which is an opti-
mization solver for integer programming problems. For a given tra c matrix, the
optimization model outputs the numbers of active switches and links, and the flow
assignment to each link corresponding to every tra c flow demand. Our model is
implemented with both flow-splitting and non-flow-splitting options.
4.2 GREEDY FLOW ASSIGNMENT
Through the formal power optimization model, we can find the optimal flow as-
signment for a given network topology and tra c loading. However, noticing that
mixed integer programming is known to be strongly NP-hard, a MCMCF problem
for a large-sized datacenter network cannot be solved within a reasonable time
65
frame. To address this problem, we propose a heuristic greedy algorithm to find a
near-optimal flow assignment.
4.2.1 Heuristic Algorithm
Our greedy flow assignment algorithm is based on the Dijkstra’s algorithm that
solves the shortest path problem. For each tra c flow, the algorithm finds a route
with su cient bandwidth between the source node and the destination node with
the lowest cost. The cost function is defined as the sum of the cost of switches and
links along the route. By carefully defining the cost value of each node and each
link, our greedy algorithm finds the lowest-cost route for each flow incrementally,
and ultimately obtains the optimal routing for all the flows. The greedy algorithm
is described as in Algorithm 1.
Each link and each node has a fixed capacity. We only assign a flow to a
link when there is available bandwidth at that link and also at the source and
destination node. Once a flow is assigned, the corresponding amount of tra c
demand is subtracted from the bandwidth of the link and the nodes on both ends.
Link cost cost(u, v) is defined as a constant value of 2 for all links while node cost
cost(v) is initialized as 1. Each link is counted in the cost of the route and we are
ensured to find the shortest route. Once node v is used in a route once, cost(v) is
updated to 0. This makes sure that a switch that has been used in a previous route
will have higher priority to be reused. As a result, we can achieve the minimum
overall number of active switches. We use higher link cost than node cost in order
to avoid detour routes between switches.
The greedy algorithm is not optimal, but we verified that the results produced
by the algorithm are very close to those from the CPLEX optimization solver in
Section 3 for all tra c types and loads. However, the optimization model can only
66
Algorithm 1 Flow Assignment algorithm
1: function flowAssign(source, sink, demand)
2: for each vertex v in Graph do
3: dist[v] Infinity
4: dist[source] 0
5: insert (source, dist[source]) to Q
6: while Q is not empty do
7: u first pair in Q
8: if u == sink then
9: break
10: for each neighbor v of u do
11: if (capacity(u, v)! = 0) and
12: (capacity(u)! = 0) then
13: alt dist[u] + cost(v) + cost(u, v)
14: else
15: alt Infinity
16: if alt < dist[v] then
17: dist[v] alt
18: previous[v] u
19: update (v, dist[v]) in Q
20: for v = sink; v! =  1; v = previous[v] do
21: insert v to route
22: return route
67
scale to k = 4 fat-tree networks. In the next part of this paper, we use this greedy
algorithm to simulate larger scale fat-tree networks.
4.2.2 Validation of Greedy Algorithm
The greedy algorithm is not optimal but, as we show below, the routes produced by
the algorithm are very close to those produced by solving the optimization formu-
lation using CPLEX optimization solver in section 4.1. We use fat-tree topology
with k = 4 and generate a number of packet traces following certain datacenter
network tra c patterns [20]. The packet traces in each one-second interval are or-
ganized as a tra c matrix and is fed into the CPLEX optimization model and the
simulated greedy algorithm. We obtain the number of active switches and active
interfaces for the eight tra c patterns and seven tra c loads shown in Table 4.1.
Table 4.1. Number of active switches and active interfaces from optimization model vs. simulation
with greedy algorithm.
load
Random Staggered(1)
active switches active interfaces active switches active interfaces
opt greedy opt greedy opt greedy opt greedy
10% 13 13 40 40 8 8 16 16
20% 13 13 40 40 8 8 16 16
30% 13 14 40 44 8 8 16 16
40% 14 14 48 48 8 8 16 16
50% 14 14 48 48 8 8 16 16
60% 18 19 64 72 8 8 16 16
70% 19 19 72 72 8 8 16 16
load
Staggered(2) Staggered(3)
active switches active interfaces active switches active interfaces
opt greedy opt greedy opt greedy opt greedy
68
10% 13 13 40 40 13 13 40 40
20% 13 13 40 40 13 13 40 40
30% 13 13 40 40 13 13 40 40
40% 13 13 40 40 13 13 40 40
50% 13 13 40 40 14 14 48 47.2
60% 13 13 40 40 14 14 48 53.4
70% 13 13 40 40 18 19 64 72
load
Stride(1) Stride(2)
active switches active interfaces active switches active interfaces
opt greedy opt greedy opt greedy opt greedy
10% 13 13 40 40 13 13 40 40
20% 13 13 40 40 13 13 40 40
30% 13 13 40 40 13 13 40 40
40% 13 13 40 40 13 13 40 40
50% 13 13 40 40 17 17 58 56.2
60% 13 13 40 40 18 18 64 64
70% 13 13 40 40 19 18 66 64
load
Stride(4) Stride(8)
active switches active interfaces active switches active interfaces
opt greedy opt greedy opt greedy opt greedy
10% 13 13 40 40 13 13 40 40
20% 13 13 40 40 13 13 40 40
30% 14 14 48 48 14 14 48 48
40% 14 14 48 48 14 14 48 48
50% 17 17 60 60.8 17 17 60 63.6
60% 19 19 72 72 19 20 72 75.2
70% 19 19 72 72 19 20 72 75.6
The results we get from the simulated greedy algorithm are very close to those
69
get from the CPLEX optimization model, especially for the lighter loads. Since
the optimization model can only scale to a fat-tree datacenter network with k = 6,
we use the greedy algorithm to simulate the optimization of a large-scale fat-tree
network in the following chapters of this paper.
4.3 SUMMARY
Inspired by the earlier work of Gupta et al. [48], many researchers propose energy-
proportional datacenter network topologies through topology-aware heuristics to
find optimal subset and power o↵ idle interfaces or devices. For example, Elas-
ticTree et al. [51] leverages the regularity of hierarchical datacenter networks and
uses left-most heuristics to find the smallest topology. CARPO [81] examines the
dynamic topology by consolidating timely-negative-correlated flows into a smaller
set of links and shutting o↵ unused ones. Instead, we propose a universal greedy
flow assignment algorithm to find the optimal network subset. Our algorithm can
find near-optimal flow assignments comparable to solutions achieved from MIP
model solver, for not just hierarchical network topologies, but also random or ir-
regular datacenter network topologies. In addition, our approach is proved to be
able to achieve energy conservation based on real-time tra c load.
70
Chapter 5
USAGE-BASED DATACENTER NETWORK TOPOLOGY
In this chapter, we construct a datacenter network (DCN) topology that supports
the expected loading for di↵erent application domains but incurs a lower energy
cost. Specifically,
1. We first begin with fat-tree network and examine the sub-graphs of these
networks that are used for loadings as high as 70% for di↵erent types of
applications (educational, cloud, and private data centers) when using left-
most routing as in [51]. The results indicate that we can indeed reduce the
number of switches by 50% at the aggregation and core layers of the network
without incurring any loss or increased latency.
2. Next we consider the possibility of moving flows to fewer servers, particularly
for low loads. This approach is interesting since it can inform job schedulers
about how and where to place jobs in order to minimize network energy cost.
Consolidating flows further reduces the needed switches in the network by
up to 10%.
3. Our analysis shows that edge switches (i.e., switches connected to servers)
account for a high energy cost as they are always powered on, even at tiny
loads. Given that a significant energy cost of a switch is static (in the chas-
sis, power supply, processor, interconnect fabric), by using high cardinality
71
switches (and thus fewer switches) we can save significant amount of energy
even if they are always on.
Putting all these studies together we obtain a new DCN in which edge switches
have high port density and where the other switches are connected in a left-skewed
topology which is a subgraph of the fat-tree. This type of topology has a lower
capital cost and lower operational cost as well.
5.1 SUB-TREES FOR DIFFERENT TRAFFIC CHARACTERISTICS
A typical fat-tree DCN consists of three layers of switches: edge layer, aggregation
layer and core layer. For a k-ary fat-tree network, there are k
2
4
switches in the core
layer. The aggregation layer and edge layer are divided into k pods, each of which
has k
2
edge switches and k
2
aggregation switches. In total, the network is composed
of 5k
2
4
switches and each switch has k ports. Every edge switch is connected to k
2
end hosts, thus k
3
4
end hosts in total can be interconnected through the fat-tree
network.
5.1.1 Tra c Model
Benson et al. [20] analyzed network tra c characteristics of ten Datacenters, in-
cluding three university data centers (EDU), two private enterprise Datacenters
(PRV) and five commercial cloud Datacenters (CLD). EDU Datacenters serve stu-
dents and sta↵ on campus. The main applications in EDU data centers include
distributed file systems and Web services. PRV Datacenters mainly serve corporate
users and developers. Besides hosting traditional Web services, these data centers
also run customized applications. CLD Datacenters are purposely-built to sup-
port specific applications and serve external users. For example, two of the CLD
Datacenters primarily run MapReduce style jobs and the other three are mainly
72
for Internet-facing applications, including Messaging, Webmail, Web portal and
searching.
By observation, a significant part of the tra c in the EDU data centers is
distributed file system tra c across the entire network. On average, about 30% of
the tra c in these three EDU datacenters is within the same rack. The applications
in the PRV datacenters have shown a degree of emerging patterns of consolidation
and virtualization and around 45% of the tra c is within the same rack. The
MapReduce job in the CLD Datacenters is scheduled to be packed into the same
rack to reduce core interconnection and nearly 75% of the tra c is confined in the
same rack.
For our topology study, we need tra c traces that not only follow di↵erent
patterns in the EDU, PRV and CLD Datacenters but also have di↵erent loads.
Therefore, we created a tra c generator to generate tra c traces following the
tra c patterns of these Datacenters with varies loadings and fed them to the fat-
tree simulator we implemented. Tra c in a fat-tree can be characterized by two
probabilities: p
1
and p
2
. p
1
denotes the probability that the source and destination
of a packet are connected to the same edge switch. Of the other packets, there
is a probability p
2
that their destination is within the same pod and thus will
need to traverse an aggregation switch. The rest of the packets are destinated to
servers in other pods and thus need to pass through a core switch. By varying the
probabilities p
1
and p
2
, we can simulate di↵erent types of tra c patterns. Based
on the results of Benson et al [20]), we generate synthetic tra c traces using an
On/O↵ process with the On and O↵ periods following the lognormal distributions
( 
off
= [2.6, 3], µ
off
= [12.4, 13.8],  
on
= [2.6, 2.9], µ
on
= [12.2, 14.1]). The packet
interarrival time is also a lognormal process (  = [3.8, 4.1], µ = [20.1, 22.1]). The
source and destination nodes are chosen uniformly from servers in each pod. The
73
specific parameters we used are: EDU(p
1
= 0.3, p
2
= 0.2), EDU1 (p
1
= 0.3,
p
2
= 0.5), PRV (p
1
= 0.45, p
2
= 0.2) and CLD (p
1
= 0.75, p
2
= 0.2).
In a second study, we examine the benefits of consolidating jobs into fewer
servers by distributing di↵erent tra c loads to each pod, which is called the non-
uniform case. For example, say we have 12 pods. We generate 70% pod load for
the first five pods and 10% pod load for the sixth pod and leave all other pods
with zero tra c with the overall load be 10% for the entire network. For each of
the tra c patterns we generate loads from 10% to 70% of network capacity.
We simulate a k = 12 fat-tree DCN that supports 432 end hosts connected
through 180 12-port switches. These switches are grouped in 12 pods and each
pod contains 6 edge switches and 6 aggregation switches. The core layer consists
of 36 core switches, which connect the 12 pods together.
5.1.2 Active Sub-Trees
We feed a 10-second synthetic packet trace to the simulated fat-tree and use left-
most routing. We demonstrate the sub-trees for all tra c patterns with load of
10% in Figure 5.1. It shows that a minimum spanning tree is su cient for CLD
Datacenters because only around 25% of their tra c leaves the rack. We observe
that even the load increases to 70%, there are still a significant number of switches
and links in idle state. If we can pack the communicating jobs into fewer number
of pods like in the non-uniform scenario, then the number of edge switches and
aggregation switches required will be further reduced.
The fraction of active switches is shown in Figure 5.2. It shows that when the
tra c load is less than 70%, 20% of switches are never used. For light tra c at
10%, less than 50% of switches are needed. Since the CLD Datacenters are usually
designed with meticulous job placement in order to decrease cross-pod tra c, the
74
fraction of unused switches for CLD is the greatest. For non-uniform tra c, fewer
edge switches and aggregation switches are required since jobs are consolidated
into fewer pods. However, the non-uniform case requires more core switches than
uniform cases because the number of active core switches is dependent on the pod
with the heaviest tra c going to the core layer. So when we pack jobs into fewer
servers, it is better to balance the load among the active pods. For example, the
10% CLD nonuniform tra c load is split into 70% of pod tra c in pod 1 and 50%
pod load in pod 2. The maximum core number required (= 12) is determined by
pod 1 since more tra c from pod 1 is going to core layer. If we split the 10%
overall load into 60% in pod 1 and 60% in pod 2, the maximum number of core
switches required is reduced to 6.
5.1.3 Analytical Model of Sub-Tree Size
The simulations above clearly show that significant part of a fat-tree can be dis-
pensed without a↵ecting performance. However, the simulation results were con-
ducted for relatively small DCNs. In this section, we provide a theoretical model
that can be generalized to arbitrary sized DCNs. Let us assume that tra c load
generated by k pods is  
1
, 
2
, ..., 
k
(represented as a fraction of full load). We
use parameters p
1
and p
2
to represent the probability of tra c travelling between
servers connected to the same edge switch, and tra c travelling within the same
pod but di↵erent subnets, respectively. Therefore, (1   p
1
  p
2
) of tra c goes
to other pods. We assume that p
1
and p
2
for each of the k pods are written as
p
1
1
, p
2
1
, ..., p
k
1
and p1
2
, p
2
2
, ..., p
k
2
. For pod i,  
i
of tra c load is generated of which
 
i
(1 pi
1
) goes up to the aggregation layer and  
i
(1 pi
1
 pi
2
) arrives the core layer
switches.
The edge switches are constantly active since they are connected to servers.
75
0 50 100 150 200 250 300 350 400 450
Sub−tree for EDU Uniform 10% Load
0 50 100 150 200 250 300 350 400 450
Sub−tree for EDU Nonuniform 10% Load
0 50 100 150 200 250 300 350 400 450
Sub−tree for EDU1 Uniform 10% Load
0 50 100 150 200 250 300 350 400 450
Sub−tree for EDU1 Nonuniform 10% Load
0 50 100 150 200 250 300 350 400 450
Sub−tree for PRV Uniform 10% Load
0 50 100 150 200 250 300 350 400 450
Sub−tree for PRV Nonuniform 10% Load
0 50 100 150 200 250 300 350 400 450
Sub−tree for CLD Uniform 10% Load
0 50 100 150 200 250 300 350 400 450
Sub−tree for CLD Nonuniform 10% Load
Figure 5.1. Minimal fat-trees with uniform and non-uniform tra c of load 10%.
76
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Load %
Ac
tiv
e 
sw
itc
he
s 
%
Fraction of Active Swtiches
 
 
uniform CLD
uniform PRV
uniform EDU
uniform EDU1
nonuniform CLD
nonuniform PRV
nonuniform EDU
nonuniform EDU1
Figure 5.2. Fraction of switches required for uniform and non-uniform tra c.
Thus for each pod i, the number of edge switches that are powered on is e
i
= k/2.
Since the tra c from the edge switches takes the left-most available aggregation
switches first, and the total capacity of each aggregation switch is 1k
2
of the pod
load, the total number of aggregation switches in pod i that handle tra c load
 
i
(1  pi
1
) is:
a
i
= d i(1  p
i
1
)k
2
e (5.1)
For the core layer, we consider two scenarios. The first scenario is when p
2
= 0,
all the tra c arriving in the aggregation switches is going up to the core layer.
The number of core switches is determined by the maximum load of the k pods.
Suppose pod j has the maximum load,  
j
(1  pj
1
), going to core layer. Since each
core switch can handle a fraction of 1
k2
4
pod load, the total number of core switches
for the entire network is computed as:
c = d j(1  p
j
1
)k2
4
e (5.2)
When p
2
6= 0, the number of core switches varies with the tra c load going to
77
core layer. We can compute the range of the number of core switches needed. If
the tra c going to core layer is distributed on the left-most aggregation switches
for all the pods, then minimum number of core switches is needed and is calculated
from the maximum core load of the k pods. Similarly, we suppose the maximum
load going to the core layer is from pod j with the load  
j
(1   pj
1
  pj
2
) and the
number of core switches is computed as:
c
min
= d j(1  p
j
1
  pj
2
)k2
4
e (5.3)
If the load going to core layer is distributed randomly on active aggregation
switches of each pod, the maximum number of active core switches is dependent on
the maximum number of active aggregation switches. For example, suppose pod
j has the most active aggregation switches and the number of active aggregation
switches is a
j
= d j(1 p
j
1)k
2
e. Each of these aggregation switches can send tra c
to k/2 core switches connected to it. Therefore, we can estimate that the upper
bound of the number of core switches as:
c
max
= d j(1  p
j
1
)k
2
e ⇥ k
2
(5.4)
Using above formulations, we can compute the number of active switches in
each layer of a fat-tree network. The significant conclusions we can draw are as
follows:
• Even at 70% loading, in both the uniform and non-uniform tra c cases, no
more than 50% of aggregation switches are used.
• For CLD Datacenters, only a third of the aggregation switches are used be-
cause of the job placement policies.
• At 70% loading, no more than 50% of core switches are used while for CLD
this percentage is even smaller at 33%.
78
Based on above results, we can reduce the number of aggregation switches and core
switches by 50% from current fat-tree networks. One approach may be to keep
each pod symmetric (for the uniform tra c model) and discard the rightmost
50% of aggregation switches from each pod. For the non-uniform tra c case, the
leftmost pods would not be modified but the rightmost pods would have only a few
aggregation switches as computed in eqn. 5.1. Similarly, we can discard rightmost
core switches based on eqn. 5.3 and 5.4.
5.2 RIGHT SIZING THE EDGE SWITCHES
A regular fat-tree uses switches of the same size over the entire network. While this
is a useful feature when purchasing switches in bulk from OEMs, we show that it is
not the best approach from an energy e ciency standpoint. Consider the benefits
of increasing the degree of edge switches, the immediate impact is that it increases
p
1
and decreases p
2
, thus we need fewer aggregation switches. The second benefit
comes about in energy cost of the edge switches. The energy cost of a switch can
be viewed as the cost of the chassis, switching fabric, linecards and ports [14]. As
we show in Section 5.3.1, we can increase the port density of switches by adding
new linecards which has the net e↵ect of scaling the energy cost sub-linearly with
the number of ports. Thus using a single switch with twice as many ports is more
energy e cient as compared to using two switches with half as many ports each.
As we analyze in Section 5.1, the number of switches and links required in
any pod for a given tra c is dependent on the tra c load  , and tra c pattern
parameters of p
1
and p
2
. If we increase the size of edge switches, more servers will
directly connect to it and, by definition, p
1
will be greater, and thus more tra c is
transferred directly through the edge switches. As a result, the number of required
79
aggregation layer switches will decrease to
a
0
i
= d i(1  p
i0
1
)k
2
e (5.5)
If we keep the size of the pod unchanged, then fewer edge switches is required
when we replace small-sized edge switches with larger-sized switches. At the same
time, the amount of inter-pod tra c, 1   p
1
  p
2
, remains the same. Therefore,
the lower bound of required core switches is still c
min
. However, since a smaller
number of aggregation switches are used, the inter-pod tra c is moved to the left
side of the aggregation switches, so the maximum number of required core switches
will decrease to
c
0
max
= d j(1  p
j0
1
)k
2
e ⇥ k
2
(5.6)
In particular, when p
1
keeps increasing and p
2
decreases to zero, all the tra c
reaching aggregation layer is directed to the core switches. Under this circum-
stances, all tra c in core layer are consolidated to the leftmost core switches.
We simulate fat-trees using di↵erent sizes edge switches. The original k = 12
fat-tree has 6 12-port edge switches in each pod. In our experiment, we use 3 24-
port switches, or 2 36-port switches, or 1 72-port switch in the edge layer of each
pod. For all cases, edge switches still use half of the ports connected to servers
and the other half connected to aggregation switches. For example, the 24-port
edge switch connects to 12 servers and connects to 6 aggregation switches with two
links connecting each edge switch and aggregation switch. We note that changing
the radix of edge switches will not change the total tra c between servers and
edge switches or between aggregation and core layers. The only change is in tra c
between the edge and aggregation layers, as shown in Figure 5.3. Indeed, as the
radix of edge switches increases, less tra c goes from the edge switches to the
aggregation switches.
80
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
10
20
30
40
50
60
70
Load
To
ta
l t
ra
ffi
c 
(G
B)
Traffic between edge and aggregation layer for EDU Nonuniform
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
10
20
30
40
50
60
70
Load
To
ta
l t
ra
ffi
c 
(G
B)
Traffic between edge and aggregation layer for EDU Uniform
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
Figure 5.3. Tra c between edge layer and aggregation layer is less when the size of edge switches
increases. (Figures shown above are for uniform and nonuniform tra c patterns in EDU Data-
centers. CLD, PRV and EDU1 Datacenters also have the same properties.)
The fraction of active switches is illustrated in Figure 5.4. We can conclude
that as the size of edge switches increases, the fraction of the total number of
required switches decreases. The reduction in aggregation switches needed comes
about since p
2
decreases (less intra-pod tra c). Although the inter-pod tra c
remain unchanged, the fraction of core switches may reduce slightly because the
tra c going to the core layer can be further consolidated to the left core switches.
Figure 5.5 shows examples of the resulting sub-trees when we use either 12-port
or 72-port edge switches for an EDU Datacenter in the k = 12 fat-tree. We can
conclude that using the highest port-density switches for the edge layer minimizes
the overall number of aggregation and core layer switches required.
5.3 ENERGY SAVINGS OF LARGER-SIZED EDGE SWITCHES
Let us next consider the energy benefits of using high port density switches at the
edge. Datacenter switches are chassis-based modular switches designed for relia-
bility and performance. The number of ports can be expanded by inserting more
81
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Load
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches for EDU Uniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Load
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches for EDU Nonuniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Load
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches for EDU1 Uniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Load
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches for EDU1 Nonuniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Load
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches for PRV Uniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Load
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches for PRV Nonuniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Load
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches for CLD Uniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Load
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches for CLD Nonuniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
Figure 5.4. Fraction of active switches with larger-sized edge switches.
82
0 50 100 150 200 250 300 350 400 450
Sub−tree for EDU Uniform 70% Load
0 50 100 150 200 250 300 350 400 450
Sub−tree with 72−port Edge Switches for EDU Uniform 70% Load
Figure 5.5. EDU DCNs with 12-port and 72-port edge switches, 70% load.
linecards in the switch chassis. In this section, we formulate the power consump-
tion model of the DCN and use actual power consumption data from di↵erent
Cisco switches to make the case for utilizing high port density edge switches.
5.3.1 Static Cost
Consider a fat-tree network constructed with 12-port switches supporting 432 end
hosts. We compare its energy consumption with the cases using edge switches of
24, 36 and 72 ports. Cisco Catalyst 4503-E is a modular datacenter switch with two
linecard slots. From Cisco Power Calculator [2], we find the power consumption
data of Catalyst 4503-E switches (shown in Table 5.1). The fixed power cost of a
switch includes the power cost by the chassis, supervisor engines and linecards. The
Catalyst-4503-E switch works with three supervisor engine models: 6LE,7E/7LE
and 8E. Model 6LE supports 46XX series 1GE 12-port and 24-port linecards and
model 7E/7LE/8E supports 47XX series linecards up to 48 ports. Combining
di↵erent choices of linecards, we can create switches with port number from 12 to
72. We calculate the total power consumption of a k = 12 fat-tree DCN when
using di↵erent sizes of edge switches. The results is shown in Figure 5.6 (left) and
we see around 30% power savings for the entire network when replacing 12-port
83
Table 5.1. Power Consumption of Datacenter Modular Switch - Cisco Catalyst 4503-E.
Component Model Power Cost
chassis 48W
Supervisor Engine
6LE 168W
7E/7LE 223.68W
8E 319.97W
Linecard
4612-SFP-E 24W
4624-SFP-E 40.32W
4712-SFP-E 19.97W
4724-SFP-E 31.97W
4748-SFP-E 73.63W
12 24 36 72
30
35
40
45
50
55
60
65
70
Port number of edge switches
Fi
xe
d 
po
w
er
 c
on
su
m
pt
io
n 
of
 a
 D
C
N
 (k
W
)
Static Power Consumption with Different Sizes of Edge Switches
 
 
Catalyst 4503−E + Supervisor Engine 6LE
Catalyst 4503−E + Supervisor Engine 7E/7LE
Catalyst 4503−E + Supervisor Engine 8E
48 96 144 192 288 384 576
2000
2500
3000
3500
4000
4500
5000
Port number of edge switches
Fi
xe
d 
po
w
er
 c
on
su
m
pt
io
n 
of
 a
 D
C
N
 (k
W
)
Static Power Consumption with Different Sizes of Edge Switches
 
 
Cisco Catalyst 6513−E
Cisco Nexus 7018
Figure 5.6. Static power consumption of a k = 12 and a k = 48 fat-tree DCN with di↵erent sizes
of edge switches.
switches with 72-port switches in the edge.
Large cloud Datacenters have tens of thousands of servers and require high-
port-density switches. For example, we need a 192-port switches to merge 4 48-port
edge switches together. A Catalyst 6513-E switch chassis has 11 linecard slots and
supports 528 1GE ports in total. A Cisco Nexus 7018 switch has 16 linecard slots,
providing 768 1GE ports. We choose switch configurations that consume the least
power per port from the two switches and calculate the DCN power consumption
shown in Figure 5.6 (right).
84
5.3.2 Dynamic Cost
The power consumption of chassis, supervisor engine and linecards is fixed when
the switch is deployed. However, a port consumes more power when it is active.
Besides, port capacity setting, port utilization and switch firmware version also
a↵ect the power consumption of a switch [64]. For simplicity, We only consider
the static power consumption and the power cost by ports in this work and we
represent the power model of a switch as:
P
switch
= P
chassis
+ P
supervisor engine
+ numCard⇥ P
linecard
+ numActPort⇥ P
actPort
+ numIdlePort⇥ P
idlePort
where numCard is the number of linecards supported by the switch. numActPort
is the number of active ports and numIdlePort is the number of inactive port.
We compute the overall DCN power consumption as the sum of power cost of all
switches:
P
total
=
X
P
switch
Using the data in Table 5.1, the switch 4503-E chassis with 12, 24, 36 and 72
ports has a fixed cost of at least 240W, 264W, 280.32W and 377.28W, respectively.
We also learn from [14] that each port consumes 3W when active and 0.1W when
idle. Thus we can formulate the power model for estimating the switch power
85
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Load
Fr
ac
tio
n 
of
 p
ow
er
 c
on
su
m
pt
io
n 
(W
)
Total Power Consumption of Switches with EDU Uniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Load
Fr
ac
tio
n 
of
 p
ow
er
 c
on
su
m
pt
io
n 
(W
)
Total Power Consumption of Switches with EDU Nonuniform Traffic
 
 
12−port egde switch
24−port egde switch
36−port egde switch
72−port egde switch
Figure 5.7. Fraction of total power consumption of network switches with larger-sized edge
switches for di↵erent tra c load and patterns in EDU Datacenters.
consumption as:
P
switch
=
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
240 + (12  x) ⇤ 0.1 + 3x 12-port switch
264 + (24  x) ⇤ 0.1 + 3x 24-port switch
280.32 + (36  x) ⇤ 0.1 + 3x 36-port switch
377.28 + (72  x) ⇤ 0.1 + 3x 72-port switch
where x is the number of active ports. The resulting power consumption of the sub-
trees of EDU Datacenters is compared in Figure 5.7. With the uniform tra c load,
the homogeneous fat-tree can achieve more than 50% power savings through left-
most routing. By replacing the 12-port edge switches with larger-sized switches,
tra c flows can be further consolidated at the edge and core layer, and thus achiev-
ing a skinner sub-tree with more energy savings. Thus we can conclude that us-
ing higher port density edge switches saves energy by requiring fewer aggregation
switches and by reducing the energy needed by the edge switches themselves.
86
5.4 SUMMARY
Many approaches are studied to find the minimum subset of DCN topology for
an o↵ered tra c load without changing network interconnection or the network
devices. Recently, Widiaja et al. [82] compare the energy savings of deploying
di↵erent sizes of switches in a fat-tree network. They find it is more energy e cient
to use smaller-sized switches when the tra c is highly localized. Chabarek et al.
[24] propose to build energy proportional DCN using low-power low-radix switches
for matched tra c patterns.
This chapter explores the approaches to find the minimum network topology
for a given loads and tra c patterns. We derive a tra c-driven model to calcu-
late the number of switches required for each layer in a fat-tree DCN. We use the
left-most heuristic flow assignment algorithm to simulate the tra c consolidation
process and validate the model correctness. Based on the model, we propose to
use high-radix edge switches when the tra c load within the same pod is high,
which significantly reduces the number of switches in aggregation layer. Further-
more, using high-radix edge switches can a↵ect the routing of inter-pod tra c and
consolidate it to the left-side core switches. As a result, fewer core switches are
used and the active core switches are aligned to the left of the core layer, resulting
in a smaller sub-tree. Using this principle, datacenter operators can easily deter-
mine which core switches can be powered o↵. We survey the power consumption
of commodity modular switches and conduct an evaluation of the power savings
when using larger-sized edge switches in di↵erent types of datacenters. We find
that the overall power consumption is reduced by using high-radix edge switches
in fat-tree DCNs.
87
Chapter 6
TRAFFIC CONSOLIDATION USING MERGE NETWORKS
In previous chapters, we investigated the usage-based network optimization. We
also formulated analytical models and proposed heuristic routing algorithm to find
optimal power consumption of datacenter networks. However, in practice, a large
number of switches still need to remain active even for very light loadings, resulting
in sub-optimal energy savings.
Pod 1 Pod 2
k/2 edge
switches
k/2 aggregation
switches
k   /4 core switches
2
k pods
1 2 k/2 k/2+1 k
2
/4
k/2 servers
Figure 6.1. Fat-tree model.
We use a k-ary fat-tree shown in Figure 6.1 as an example to illustrate the
situation and explain the motivation for our work. The k-ary fat-tree intercon-
nects k3/4 servers using three layers of switches. The edge-layer switches and
aggregation-layer switches are organized in k “pods”. Each pod includes k/2 ag-
gregation switches and k/2 edge switches and each edge switch connects to k/2
88
servers. There are k2/4 core switches and each core switch has k links connecting
to k pods.
In datacenters, all the edge switches are always powered on as they are con-
nected to servers and have to remain active to be able to forward tra c upward
and downward at all times. Even if a server has very little tra c going to the
connecting edge switch, the switch will be fully powered on although very lightly
loaded. The contribution of our approach is to enable powering o↵ more switches
and links by consolidating tra c at the edge and aggregation layer.
6.1 OUR APPROACH: MERGING
Consider the case of an edge switch connected to k/2 servers. Assuming each server
k/2 x k/2 Merge
k/2 x k/2 Merge
k port edge switch
k/2 links
k/2 links
k/2 links to k/2 servers
k/2 links to k/2 aggregation switches
Figure 6.2. Merge networks applied to a switch.
o↵ers a load of   (expressed as a frac-
tion of link rate). The total tra c to
this switch from the servers is k /2.
Normally the k/2 switch interfaces re-
main active even when   is small. If
the tra c can be consolidated, we need
at most dk /2e interfaces to be active.
When the load   is small, more switch
interfaces of the switch can be powered
o↵. In other words, if there was a way
to merge the tra c from the k/2 servers, we could potentially power o↵ more
switch interfaces and save power. Previous papers [75][85] provided a hardware
design of a device called merge network. Rather than repeating that discussion
here, we provide a functional model of what such a network does, and then use it
in the remainder of this chapter.
89
We illustrate a k
2
⇥ k
2
merge network in Figure 6.2. The merge network has k/2
connections to the k/2 servers and k/2 connections to the k/2 switch interfaces. A
merge network has the property that it pushes all tra c from servers to the leftmost
interface of the switch. If that interface is busy, then the tra c is forced to the
next interface, and so on. This tra c merging behavior ensures that several switch
interfaces can be put to low power mode without compromising connectivity. Of
course, because of the fact that we are breaking the 1-1 association of a switch
interface to a server interface, several layer 2 protocols will break. Some additional
observations about the merge network are as follows:
1. The merge network is a fully analog device with no transceivers and, as a
result, its power consumption is below one watt. The merge network can be
visualized as a train switching station where trains are re-routed by switching
the tracks (rather than store-and-forward).
2. Consider the uplink from the servers to the merge network. All tra c coming
into the merge network is output on the leftmost m  k/2 links connected
to the m leftmost interfaces of the switch, where m = dk /2e (assuming
a normalized unit capacity for links). This is accomplished internally by
sensing packets on links and automatically redirecting them to the leftmost
output from the merge network that is free.
3. On the downlink to the servers, tra c from the switch to the k/2 servers is
sent out along the leftmost m  k/2 switch interfaces to the merge network.
The packets are then sent out along the k/2 links attached to the servers from
the output of the merge network. The manner in which this is accomplished
is described in [75] (note that the challenge is to correctly route the packets
flowing through the merge network to the appropriate destinations).
90
We apply two k
2
⇥ k
2
merge networks to each edge switch as shown in Figure 6.2.
The connections are similar for each aggregation switch. For the core switches, we
connect a k ⇥ k merge network.
Alternatively, we can connect a merge network to multiple switches. In this
scenario, tra c is merged to interfaces of the leftmost switches, and the right-end
switches with all idle interfaces can be powered o↵, achieving more energy savings.
In this work, we apply merge networks at two locations within a pod of a fat-
tree network – one location is between the servers and the edge switches and the
other location is between the edge switches and the aggregation switches. Figure
6.3 shows a single pod of a fat-tree after applying merge networks. As shown,
we utilize one k2/4 ⇥ k2/4 merge network to connect all the servers in a pod to
the interfaces of edge switches and apply another merge network to connect edge
switches to aggregation switches.
1 k/2
k 2
4
k 2
4
X merge network
1 k/2
k/2 k-port aggregation switches
k/2 k-port edge switches
k 2
4
k 2
4
X merge network
k/2 X k/2 = k  /4 servers
2
k/2 ports
k/2 ports
k/2 ports
to core switches
k/2 ports
Figure 6.3. Merge network applied to pod in a fat-tree.
To apply merge networks to a fat-tree network, we add two k
2
⇥ k
2
merge networks
to each edge switch as shown in Figure 6.2. The connections are similar for each
aggregation switch. For the core switches, we connect a k ⇥ k merge network.
91
6.2 ENERGY SAVINGS DUE TO TRAFFIC MERGING
To illustrate the additional energy savings achieved by merge networks when com-
pared with approaches such as ElasticTree, we quantify this benefit by running
the optimization problem on several di↵erent types of network loadings for a small
fat-tree topology of size k = 4. In this topology, there are 8 edge switches, 8 ag-
gregation switches and 4 core switches. For each edge switch, there are 2 servers
connected for a total of 16 servers in 4 pods. We assume that there is a 2⇥2 merge
network connected to either side of each edge and aggregation switch and there is
a 4⇥ 4 merge network connected to each core switch.
6.3 TRAFFIC PATTERNS
Tra c patterns in data centers can vary greatly, and to ensure our results are
widely applicable, we run the optimization algorithm on the following types of traf-
fic: Random, Stride(n), Staggered(n) [11]. In Random, the source and destination
are randomly selected from among the servers. For Stride(n), the destination of
a flow from server i is server [(i + n) mod 16], where servers are numbered left
to right as 0, 1, · · · , 15. For example, in a k = 4 fat-tree network, Stride(1) has
almost half of the tra c goes between servers connected to the same edge switch
and the other half tra c goes to aggregation and core switches. On the other hand,
Stride(4) sends all tra c between pods, resulting in a larger number of switches to
participate in forwarding tra c. The Staggered tra c model assigns a probability
p
1
for tra c going to a server in the same subnet (i.e., connected to the same edge
switch), a probability p
2
for tra c going to a server in the same pod but di↵erent
subnet, and a probability 1   p
1
  p
2
where the flow is destined to a server in a
di↵erent pod. By varying these probabilities, we can generate a large number of
92
0.1 0.2 0.3 0.4 0.5 0.6 0.70.4
0.5
0.6
0.7
0.8
0.9
1
λ
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of switches active out of 20
 
 
Random
Stride(1)
Stride(2)
Stride(4)
Stride(8)
Staggered(1)
Staggered(2)
Staggered(3)
(a) Number of active switches
0.1 0.2 0.3 0.4 0.5 0.6 0.74
6
8
10
12
14
16
λ
Ad
di
tio
na
l a
ct
iv
e 
in
te
rfa
ce
s 
in
 E
la
st
ic
Tr
ee
Improvement over ElasticTree
 
 
Random
Stride(1)
Stride(2)
Stride(4)
Stride(8)
Staggered(1)
Staggered(2)
Staggered(3)
(b) Total number of active interfaces
Figure 6.4. Di↵erence in number of active switches and active interfaces network-wide.
di↵erent loading patterns.
6.4 TRAFFIC MERGING WITHIN A SWITCH
6.4.1 Number of Active Interfaces
Figure 6.4a plots the percentage of active switches for our approach as well as for
ElasticTree for di↵erent loading patterns and di↵erent loads. As we have expected,
the number of active switches for Stride(1) does not vary with  . This is because
almost all the tra c goes to the server in the same subnet or in the same pod
and therefore, the active switches required are always the eight edge switches,
one aggregation switch per pod and one core switch. Stride(8) shows the highest
number of active switches because all the tra c is inter-pod tra c and hence more
core switches are used.
In order to illustrate the potential benefits of tra c merging, we take a dif-
ference between the total number of active interfaces when using ElasticTree and
using tra c merging with the above optimization. The results, shown in Figure
93
6.4b, clearly illustrate the benefits of merging. In the case of Stride(1), ElasticTree
uses 12 more interfaces than merging. The reason is that one aggregation switch
is active per pod. In ElasticTree, all the four interfaces to this switch are active
(albeit with very low tra c). In our approach, in contrast, we merge the tra c
using a merge network and use only a single interface of the switch.
6.4.2 Energy Savings
The overall energy cost of a switch can be roughly partitioned into the cost of
the chassis and the cost of the interfaces. As described in [82, 34], a reasonable
approximation to the cost of a switch is
Switch Cost = C +m logm+m
where m is the number of active switch ports. The constant C accounts for static
costs of a switch such as fan, etc. The second term corresponds to the cost of
the interconnection fabric within the switch, which is a significant contributor to
energy consumption (typically 30% ⇠ 40%). This cost scales as m logm for a
switch with m active ports. The last term is the cost contribution from the active
interfaces. This term folds into itself the cost of the line cards that the interfaces
are on. For the purpose of comparing the overall cost reduction of tra c merging
relative to ElasticTree, we set C to 50% of the maximum switch cost and express
it as
C = m
max
logm
max
+m
max
where m
max
is the number of switch ports. If the tra c load fraction going to a
switch is  , the merge network will switch the tra c to the leftmost k = d me
ports. Thus, the cost of a switch with merge networks is written as
Tra c Merging Switch Cost = C + k log k + k
94
0.1 0.2 0.3 0.4 0.5 0.6 0.70
5
10
15
20
25
30
35
λ
%
 im
pr
ov
em
en
t o
ve
r E
la
st
ic
Tr
ee
Cost improvement over ElasticTree
 
 
Random
Stride(1)
Stride(2)
Stride(4)
Stride(8)
Staggered(1)
Staggered(2)
Staggered(3)
Figure 6.5. Reduction in total cost when using tra c merging.
Therefore, the fraction of cost savings of tra c merging over ElasticTree is calcu-
lated as
Cost Savings =
m logm  k log k +m  k
C +m logm+m
Figure 6.5 plots the fraction of reduction of network cost using tra c merging
over ElasticTree. It is noteworthy that, for all tra c patterns and across all loads,
the tra c merging reduces the overall energy cost even for a small-sized network
consisting of 20 switches. These savings are more substantial when we consider
realistic DCNs as we do later in this paper.
6.5 TRAFFIC MERGING WITHIN A POD
In this section, we compute the minimum number of active switches required when
we apply merge networks to all switches at each layer within a datacenter pod.
The merge networks force tra c to the left so that more aggregation and core
switches can be powered o↵. Specially, tra c merging at the edge layer enables
95
idle edge switches, which can be powered o↵.
6.5.1 Lower Bound on Energy Consumption
We derived the analytical expressions for minimal energy consumption of fat-trees
for di↵erent type of loadings in Chapter 3. We assume fraction of active switches
as the metric for energy consumption and assume an underlying routing protocols
that pushes tra c to the left. Using parameters p
1
, p
2
and p
3
to model tra c
loading types (A packet from a server goes to another server connected to the
same edge switch with probability p
1
, which goes to a server in the same pod but
another edge switch with probability p
2
, and with probability p
3
= 1   p
1
  p
2
it goes to a server in a di↵erent pod.) and   to denote the average load o↵ered
by each server (  as a fraction of link speed which we normalize to 1), the total
tra c at the level of edge switches, pod aggregation switches and core switches is
represented as follows:
Tra c per edge switch =  k
2
Tra c for all aggregation switches in a pod
= (1 p1) k
2
4
Tra c for all core switches = k ⇥ (1 p1 p2) k
2
4
Note that tra c flow is symmetric and the numbers above correspond to both,
tra c into and out of a switch or switches.
We observe that all the edge switches need to remain active at all levels of
loads to ensure servers have network connectivity. This gives us k2/2 active edge
switches. Within each pod we have total tra c equal to (1   p
1
) k2/4 going in-
to/from the k/2 aggregation switches from/to the edge switches. Given each link
has a normalized capacity of 1, and that there are k/2 interfaces per aggregation
96
switch connected to the edge switches, we require at least d (1 p1) k
2
/4
k/2
e active ag-
gregation switches per pod. Since there are k pods, the total number of active
aggregation switches becomes kd (1 p1) k
2
e. Finally, since every core switch is con-
nected to an aggregation switch from each of the k pods, the number of active
core switches we require is simply, d (1 p1 p2) k
3
/4
k
e where we divide the total tra c
passing through the core switches by the number of links per switch and round up.
Therefore, the total number of active switches can be written as:
Active Switches = k
2
2
+ kd (1 p1) k
2
e+
l
(1 p1 p2) k2
4
m
Figure 6.6 plots the fraction of active switches as a function of load   for five
di↵erent scenarios when k = 12. The plot with the labels -o corresponds to the
case when all tra c is between the servers connected to an edge switch. In other
words, no tra c needs to flow to the aggregation switches or to the core switches.
As expected, the graph stays flat. However, what is relevant here is that even at
light loads of 0.1, all the edge switches are fully active. At this load value, each
server generates 1/10th of the uplink capacity of tra c (similarly for downlink)
but the energy consumed is the same as when the link is fully loaded. The total
combined tra c from all the 6 servers is 0.6, which is less than the capacity of a
single link.
The plot with the labels  / corresponds to the extreme case when all the tra c
is destined for servers in a di↵erent pod. Hence the core and aggregation switches
will be utilized. Contrasting this with the case discussed above, we observe that
energy scales approximately linearly with load, if we discount the edge switches.
This is the desired behavior for energy-proportional networking.
97
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Active Switches for k = 12
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
Figure 6.6. Active switches for the model in Section 3.1.1.
6.5.2 Energy Savings Due to Tra c Merging
To save energy used by the edge switches, we apply a k
2
4
⇥ k2
4
merge network
between the k2/4 servers and the k/2 edge switches. There are two consequences
after applying the merge networks. First, the  p
2
tra c that goes to other subnets
within the same pod has no necessity to go through the aggregation level. Instead,
it is transferred directly to the destination servers. The tra c loading parameters
are changed to p0
1
and p0
2
and p0
1
= p
1
+ p
2
and p0
2
= 0. Accordingly, the number
of active aggregation switches required in each pod is d (1 p
0
1) k
2
e = d (1 p1 p2) k
2
e.
Second, tra c from servers is now sent to a merge network and consolidated to
the leftmost edge switch and the idle edge switches can be put to low power mode
to save energy. Therefore, The active edge switches in one pod can be calculated
as d( k2
4
)/(k
2
)e = d k
2
e, which changes with the tra c load  . The total number of
98
active switches after applying the merge networks can be written as:
Active Switches = kd k
2
e+ kd (1 p
0
1) k
2
e+
l
(1 p01 p02) k2
4
m
= kd k
2
e+ kd (1 p1 p2) k
2
e+
l
(1 p1 p2) k2
4
m
Figure 6.7 shows the fraction of active switches in a k = 12 fat-tree with merge
networks for the same tra c models as Figure 6.6. It is very illustrative of the
benefits of tra c merging. The number of active switches is reduced more when
the tra c load is lower. It is noticeable that the flat line in Figure 6.6 becomes
more linear with the load after the tra c merging.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Active Switches for k = 12 with Traffic Merging
 
 
p1=1.0,p2=0.0,p3=0.0
p1=0.5,p2=0.5,p3=0.0
p1=0.25,p2=0.5,p3=0.25
p1=0.25,p2=0.25,p3=0.5
p1=0.0,p2=0.0,p3=1.0
Figure 6.7. Active switches for the model with tra c merging.
6.6 SIMULATION RESULTS
We build a simulator for a fat-tree network with k = 6 and 1 Gbps link capacity.
Each server generates tra c based on a two-state On/O↵ process in which the
99
length of the On and O↵ periods follows a lognormal distribution. In the On state,
packet inter-arrival times are also from a lognormal process [20]. The parameters
selected for the lognormal processes are based on di↵erent types of tra c patterns
as well as di↵erent loading patterns. For each packet, the destination is selected
uniformly randomly from the set of all nodes based on probabilities p
1
and p
2
. In
the simulator, we read these trace files which are generated externally and forward
packets based on routing tables computed every second of simulated time. In our
simulation, we test several sets of tra c loads with di↵erent p
1
and p
2
and obtain
similar results. Due to space limitations, we only publish the results from the first
set of tra c with p
1
and p
2
as follows:
1. p
1
= 0.75, p
2
= 0.125;
2. p
1
= 0, p
2
= 0.75;
3. p
1
= 0, p
2
= 0;
We assume that external tra c q is 10% in all three cases. The total tra c load
is from 10%  70% of the full bandwidth.
The routing algorithm is a modified version of Dijkstra’s algorithm where we
force flows to use routes that are already in use, thus packing flows together. In the
algorithm we assign weights to edges as well as nodes. Edge weights are constant
of 2, but node weights can be 0 or 1. If a node has been used for forwarding a
flow, its weight changes from 1 to 0. Thus, flows are encouraged to reuse the same
subset of nodes (or switches). We eliminate link with zero available capacity from
further consideration in that round of routing computation.
We use C = 1 and designate the leftmost core switch as the externally connected
core switch. For 10% external tra c, the link capacity l of the externally connected
core switch has to be greater than 4. Therefore, we use l = 4 and let Q = 2kl = 48
to avoid tra c loss.
100
In Figure 6.8, we plot the fraction of active switches versus total load using
simulation without merge networks and with merge networks. Figure 6.9 shows
the same metric from the analytical models described in Chapter 3 and Chapter
6. It is easy to see that, our model is a very close match to the simulations. The
minor di↵erence between the simulations and model is due to the fact that we
estimated the values of p
1
and p
2
from the synthetic tra c and then used them in
the analysis. The estimated values for these probabilities are listed in the legend of
Figure 6.9. The implication of this is that the lower bound of energy e ciency can
be achieved in practice by utilizing the simple routing algorithm described above.
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Traffic load  λ + q
F
ra
ct
io
n
 o
f 
a
ct
iv
e
 s
w
itc
h
e
s
Simulation for k = 6, q=0.1, Q=48, C=1, l=4
 
 
p
1
=0.75, p
2
=0.125
p
1
=0.0, p
2
=0.75
p
1
=0.0, p
2
=0.0
p
1
=0.75, p
2
=0.125 with merge
p
1
=0.0, p
2
=0.75 with merge
p
1
=0.0, p
2
=0.0 with merge
Figure 6.8. Simulation results of active switches for near and far tra c.
When we examine Figure 6.8, we observe that the type of loading has a signifi-
cant impact on energy consumption. When it comes to allocating tasks to servers,
the task manager should be mindful of the type of tra c that will be generated
since we can obtain significant energy savings by careful scheduling.
101
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ + q
F
ra
ct
io
n
 o
f 
a
ct
iv
e
 s
w
itc
h
e
s
Modeling Active Switches for k = 6, q=0.1, Q=48, C=1, l=4
 
 
p
1
=0.75, p
2
=0.125
p
1
=0.0, p
2
=0.75
p
1
=0.0, p
2
=0.0
p
1
=0.75,p
2
=0.125 with merge
p
1
=0.0,p
2
=0.75 with merge
p
1
=0.0,p
2
=0.0 with merge
Figure 6.9. Modeling active switches for near and far tra c.
6.7 SUMMARY
Many approaches proposed on reducing energy consumption by right-sizing the
datacenter networks using di↵erent approaches. Lin et al. [63] propose to turn
o↵ idle servers. CARPO consolidates negatively-correlated tra c flows to keep
a smaller subset of active links. Adnan and Gupta [10] proposed an online path
consolidation algorithm to dynamically right-size the networks. These approaches
perform well when there are lots of flows in the network. ElasticTree [51] propose
to force tra c in a network to the leftmost switches in a fat-tree topology to power
o↵ unused switches. It is the most energy-e cient of all approaches, but is still
sub-optimal because, as we show, for many light loading patterns, a large number
of switches still need to remain active.
In this chapter, we present a concept of a merge network to be applied to
switches to consolidate tra c. Our merging approach enables powering o↵ more
switches and links by merging tra c at edge and aggregation layer and scale net-
work energy cost to the number of busy interfaces of each switch. By applying
merge networks to each switch, we further reduce power consumption of active
102
switches. With very light load, our approach saves 20% ⇠ 40% energy cost com-
pared with ElasticTree, depending on the tra c types. Tra c with small number
of inter-pod and inter-subnet flows can benefit even more from tra c merging. We
show that with tra c merging at switches, datacenter networks can achieve energy
proportionality without changing the network topology and devices.
103
Chapter 7
SIMULATION RESULTS WITH MERGE NETWORKS
We simulate a k = 12 fat-tree network which supports 432 servers and 180 12-
port switches. In this network, there are 12 pods and each of which has six edge
switches and six aggregation switches. We assign 1Gbps capacity to each link and
assume that each of the core switches has extra ports to be connected to external
Internet through border routers. We experiment with synthetic tra c data from
a tra c generator and real packet traces from a university datacenter. Since flow
splitting will incur packet reordering cost, which is not a desirable practice in real
datacenters, we implement our simulation using non-splitting flow assignment.
7.1 TRAFFIC DATA
7.1.1 Synthetic Tra c Data
The experimental tra c traces are generated following the On/O↵ pattern derived
from production datacenters [79][20]. The duration of the On/O↵ period and the
packet interarrival time follow the lognormal distribution. We generate di↵erent
tra c type including Random, Stride(n), and Staggered(n), each of which has
di↵erent patterns of near and far tra c. For instance, a flow in Stride(n) goes from
node i and to node [(i + n) mod N ] (N is the total number of servers). Source
and destination nodes in Random type are uniformly distributed. Staggered(n) is
staggered probability tra c and assigns fixed probabilities for tra c going to the
104
Table 7.1. Probabilities of flow going to the same subnet (p1), to the same pod (p2) and to
di↵erent pods (1  p1   p2) for all tra c suites studied
Tra c Suite p
1
p
2
1  p
1
  p
2
Random 1.2% 7% 91.8%
Stride(1) 83.3% 13.9% 2.8%
Stride(6) 0% 83.3% 16.7%
Stride(36) 0% 0% 100%
Stride(216) 0% 0% 100%
Staggered(1) 100% 0% 0%
Staggered(2) 50% 30% 20%
Staggered(3) 20% 30% 50%
same subnet and to the same pod.
We generate 8 tra c suites with parameters p
1
, p
2
and 1   p
1
  p
2
showed in
Table 7.1. Flows in Stride(1) always go to the next server. In a k = 12 fat-tree,
each edge switch connects to 6 servers and forms a subnet. Flows from the first
5 servers go to the same subnet. While flow from the 6th server travels to the
next subnet or the next pod. Therefore, 5/6 of the tra c goes to the same subnet.
Flows in Stride(6) always travel cross subnets. Stride(36) have all flows traveling
to other pods. The load fraction   o↵ered by each server varied from 0.1 to 0.7.
7.1.2 Empirical Tra c Data
We use packet traces from a university datacenter published by Benson et al. [20].
This university datacenter has about 500 servers providing services for campus
users. 60% of the tra c is for Web services and the rest is for other applications
such as file sharing services. Tra c traces are captured by a sni↵er installed at a
105
0 500 1000 1500 2000 2500 30000
5
10
15
20
25
Time (second)
To
ta
l t
ra
ffi
c 
lo
ad
 (M
B)
Total traffic load of a university data center
Figure 7.1. Tra c load of a university datacenter.
randomly selected switch in the datacenter. Figure 7.1 illustrates the total load
of the packet traces within 50 minutes. The overall load is very small for a high-
bandwidth fat-tree topology.
7.2 APPLYING MERGE NETWORK WITHIN A SWITCH
Our simulation outputs the number of active switches (Figure 7.2a) and the num-
ber of active interfaces of each switch with varies tra c loads and patterns. In
general, the number of active switches increases with the tra c load. However,
both Stride(1) and Staggered(1) have constant number of active switches and ac-
tive interfaces. This is because, for Stride(1), all loads can be satisfied by using a
minimum spanning tree. For Staggered(1), only edge switches are used since all
the tra c flows are local tra c within the same subnet.
Figure 7.2b illustrates the di↵erence of total numbers of active interfaces of a
DCN using merge networks versus ElasticTree. It shows that more interfaces of
the active switches become idle when the tra c is light, which demonstrates that
106
0.1 0.2 0.3 0.4 0.5 0.6 0.70.4
0.5
0.6
0.7
0.8
0.9
1
λ
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of switches active out of 180
 
 
Random
Stride(1)
Stride(6)
Stride(36)
Stride(216)
Staggered(1)
Staggered(2)
Staggered(3)
(a) Number of active switches.
0.1 0.2 0.3 0.4 0.5 0.6 0.70
100
200
300
400
500
λ
Ad
di
tio
na
l a
ct
iv
e 
in
te
rfa
ce
s 
in
 E
la
st
ic
Tr
ee
Improvement over ElasticTree
 
 
Random
Stride(1)
Stride(6)
Stride(36)
Stride(216)
Staggered(1)
Staggered(2)
Staggered(3)
(b) Total number of active interfaces.
Figure 7.2. Number of active switches and active interfaces network-wide for a k = 12 fat-tree
network.
tra c merging can save more energy with lighter tra c (Figure 7.3). Stride(1)
achieves the most energy savings over ElasticTree (around 42%) because, for each
active edge switch, the energy consumed by the five idle interfaces is wasted.
Staggered(1) saves 30% energy consumption since for the entire network, only
half of the interfaces (facing the severs) of the edge switches are used.
ElasticTree provides an energy-e cient solution for DCNs. However, the draw-
back of ElasticTree is that, a DCN still consumes a large amount of power with
light load [51]. In contrast, our approach reduces energy consumption when the
network is lightly loaded, which demonstrates that tra c merging achieves better
energy proportionality than ElasticTree.
We observe that power cost decreases from 30% to 17% when applying merge
networks compared with ElasticTree (Figure 7.4).
107
0.1 0.2 0.3 0.4 0.5 0.6 0.70
5
10
15
20
25
30
35
40
45
λ
%
 im
pr
ov
em
en
t o
ve
r E
la
st
ic
Tr
ee
Cost improvement over ElasticTree
 
 
Random
Stride(1)
Stride(6)
Stride(36)
Stride(216)
Staggered(1)
Staggered(2)
Staggered(3)
Figure 7.3. Reduction in total cost when using tra c merging.
0 500 1000 1500 2000 2500 30000.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Time (second)
Po
w
er
 c
os
t
Fraction of power cost of a fat−tree network
 
 
ElasticTree
Traffic merging
Figure 7.4. Energy savings when using tra c merging.
108
7.3 APPLYING MERGE NETWORK WITHIN A POD
The k = 12 fat-tree network consits of 180 12-port switches, organized in 12 pods
and each of which has six edge switches and six aggregation switches. Every
switch in the edge layer and aggregation layer, there are six uplink ports and six
downlink ports. So within each pod, there are total 36 servers connected with six
edge switches, and the number of links between edge layer and aggregation layer
is 36. In this experiment, within each pod, we apply one 36 ⇥ 36 merge network
between servers and edge switches, and another 36 ⇥ 36 merge network between
edge switches and aggregation switches. The merge network switches tra c flows
to the left switches. Flow splitting is allowed for simplicity.
7.3.1 Number of active switches
Figure 7.5 compares the number of active switches at each level before and after
applying tra c merging. As we can find, for all tra c suites, the number of active
switches at edge level reduces significantly after applying merge networks. For
aggregation-level switches, we observe obvious reduction for Stride(6), Staggered(2)
and Staggered(3). We notice from table 7.1 that these three tra c suites have
higher p
2
, and as we discussed in Section 6.2, a amount of  p
2
of tra c is switched
away from aggregation level after tra c merging, which means a substantial part of
the tra c originally going to aggregation layer has been cut short to be transferred
directly through edge switches with merge networks.
Figure 7.6 illustrates the fraction of total active switches of the DCNs before
and after using merge networks. The fraction of active switches reduced from
40%  60% to 10%  35% for light tra c loadings (0.1  0.2).
109
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
Compare the number of active switches for Stride(1)
λ
N
um
be
r o
f a
ct
iv
e 
sw
itc
he
s
 
 
edge switches
aggregation switches
core switches
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
Compare the number of active switches for Stride(6)
λ
N
um
be
r o
f a
ct
iv
e 
sw
itc
he
s
 
 
edge switches
aggregation switches
core switches
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
Compare the number of active switches for Stride(36)
λ
N
um
be
r o
f a
ct
iv
e 
sw
itc
he
s
 
 
edge switches
aggregation switches
core switches
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
Compare the number of active switches for Stride(216)
λ
N
um
be
r o
f a
ct
iv
e 
sw
itc
he
s
 
 
edge switches
aggregation switches
core switches
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
Compare the number of active switches for Staggered(1)
λ
N
um
be
r o
f a
ct
iv
e 
sw
itc
he
s
 
 
edge switches
aggregation switches
core switches
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
Compare the number of active switches for Staggered(2)
λ
N
um
be
r o
f a
ct
iv
e 
sw
itc
he
s
 
 
edge switches
aggregation switches
core switches
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
Compare the number of active switches for Staggered(3)
λ
N
um
be
r o
f a
ct
iv
e 
sw
itc
he
s
 
 
edge switches
aggregation switches
core switches
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
Compare the number of active switches for Random
λ
N
um
be
r o
f a
ct
iv
e 
sw
itc
he
s
 
 
edge switches
aggregation switches
core switches
Figure 7.5. Compare number of active switches with vs. without tra c merging.
110
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches out of 180
 
 
Random
Stride(1)
Stride(6)
Stride(36)
Staggered(1)
Staggered(2)
Staggered(3)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
λ
Fr
ac
tio
n 
of
 a
ct
iv
e 
sw
itc
he
s
Fraction of Active Switches out of 180 after Traffic Merging
 
 
Random
Stride(1)
Stride(6)
Stride(36)
Staggered(1)
Staggered(2)
Staggered(3)
Figure 7.6. Fraction of active switches before using tra c merging (left) and after using tra c
merging (right).
7.3.2 Energy cost
The above discussions focused on reducing the number of active switches. The
overall energy cost a DCN consists of the cost incurred by switches and links.
However, cost incurred by links is negligible and can be incorporated within the
cost of interfaces of switches. Generally speaking, the energy cost of a switch can
be roughly partitioned into the cost of chassis and the interfaces. As described in
[82][34], a reasonable approximation to the cost of a k-port switch is:
Switch Cost = C + k log k + k
The constant C accounts for static costs of the switch such as fan etc. The second
term corresponds to the cost of the interconnection fabric within the switch, which
is a significant contributor to energy consumption (typically 30%   40%). This
cost scales as k log k for a k-port switch. The last term is the contribution to the
cost from the active interfaces. This term folds into itself the cost of the linecards
that the interfaces are on. For the purposes of comparing the overall cost reduction
of tra c merging, we set C to 50% of the maximum switch cost. That is:
C = k log k + k
111
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
10
20
30
40
50
60
70
80
90
λ
%
 C
os
t i
m
pr
ov
em
en
t
Cost improvement
 
 
Random
Stride(1)
Stride(6)
Stride(36)
Stride(216)
Staggered(1)
Staggered(2)
Staggered(3)
Figure 7.7. Reduction in total cost after using tra c merging.
If the tra c load fraction going to a switch is  , the merge network will switch
the tra c to the leftmost m = d ke interfaces. The cost of a switch with merge
networks is thus:
Switch Cost = C +m logm+m
Figure 7.7 shows the overall cost improvement over approaches without merge
networks. It demonstrates that our tra c merging method can save up to 90%
of energy cost when the tra c load is low. Figure 7.8 shows that the tra c
merging can achieve better energy e ciency that is closer to the ideal energy
proportionality.
7.4 SUMMARY
We examine the approach of merging tra c by simulating a large-size fat-tree
datacenter network and applying merge network at each switch and at a whole
pod, and finding the lower bound of the minimum number of network switches
and links that satisfies a variety of tra c patterns and loads similar to those from
112
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cost Improvement with Random Traffic
λ
Fr
ac
tio
n 
of
 c
os
t
 
 
Without traffic merging
With traffic merging
Ideal energy proportionality
Figure 7.8. Fraction of total cost without tra c merging vs. using tra c merging.
actual datacenters.
Simulation results show that our approach can substantially reduce the number
of active switches and lower the energy consumption of a fat-tree datacenter net-
works when the load is light. We show that our solution achieves up to 70% 90%
total energy savings through tra c merging and achieves almost perfect energy
proportionality.
113
Chapter 8
PROTOTYPE OF MERGE NETWORKS
In previous chapters, we have studied the idea of merging tra c at a switch to
consolidate tra c flows to minimize the number of active switch ports. In this
chapter, we describe the implementation of a prototype of a simple 2 ⇥ 2 merge
network built for optical switches. We use optical networks and devices for two
reasons: 1) fiber optic cables are the fastest-growing transmission medium used in
data centers today since they provide high bandwidth communication and reliable
high-speed data transmission, and 2) passive optical networks (PON) can be used
in the enterprise as point-to-multipoint solutions because passive optical splitters
can distribute data, voice, and video signals throughout a network with greatly
improved cost e ciency than Ethernet switches. Thus our work on merging tra c
is related to the enterprise networks as well.
8.1 2⇥ 2 MERGE NETWORK ARCHITECTURE DESIGN
We utilize three Linux workstations to build a test-bed for a 2⇥ 2 merge network.
Figure 8.1 shows the architecture of the test environment. We configure the top
Linux machine named PACLAB11 as a virtual Linux bridge to simulate an L3 net-
work switch with two interfaces with IP addresses, 192.168.0.11 and 192.168.0.14.
The bottom two Linux machines, PACLAB12 and PACLAB13, work as two servers
connected to the simulated network switch, PACLAB11. The network addresses of
these two hosts are 192.168.0.12 and 192.168.0.13, respectively. We build a merge
114
network to merge tra c to the left port of the network switch. In other words, the
hosts always choose to connect to port 192.168.0.14 if it is available. Otherwise,
the right-side port (192.168.0.11) will be used.
We implement the 2⇥2 merge network using two 2⇥2 optical switches, shown
as the two beige boxes in Figure 8.1. To communicate with the optical switches,
we install two Gigabit multimode SC fiber optic network adapters in PACLAB11,
to act as two ports of the network switch. Each SC fiber optic network adapter
provides a Transmitter (Tx) and a Receiver (Rx). We install one multimode SC
fiber optic network adapter each in host PABLAB12 and PACLAB13. The left-
side optical switch is the uplink switch, which connects the Transmitters of the two
servers to the Receivers of the network switch. The optical switch on the right is
the downlink switch, connecting the Receivers of the two hosts to the Transmitters
of the network switch.
Figure 8.2 is the picture of the two 2 optomechanical optical switches used in
our prototype [3]. A 2 ⇥ 2 optical switch has two states: Inserted State (A) and
Bypass State (B) (Figure 8.3). The state of the optical switch is controlled by
electric signals applied to the latches on the outer side the switch. For instance,
when the uplink switch and downlink switch are both configured as in Bypass
state, host PACLAC12 is connected to the switch port 192.168.0.11 and host PA-
CLAB13 is connected to port 192.168.0.14. If both optical switches are in Insert
state, PACLAB12 and PACLAB13 are connected to 192.168.0.14 and 192.168.0.11,
respectively.
To implement priority on the left port, the controller has to decide which port is
the left-most available port. Since it is not possible to detect whether the network
switch port is busy or not when the host is disconnected from the ports of network
switch, we use a variable named STATE to keep the status of the optical switches
115
Figure 8.1. A 2⇥ 2 merge network implemented with two 2⇥ 2 optical switches.
Figure 8.2. Optical switches
116
Figure 8.3. Two states of the optical switch.
and store it on the Arduino board. In our 2 ⇥ 2 merge network, the STATE
corresponding to di↵erent scenarios is listed as follows:
1. STATE = 0: idle state - no port is in use
2. STATE = 1: both uplink and downlink switches are in Bypass state
3. STATE = 2: both uplink and downlink switches are in Insert state
4. STATE = 8: conflict state - two hosts send contradictory states
We implemente a communication protocol between the hosts and Arduino to
negotiate the merge network state. Before a host starts to transfer data, it reads
the current STATE from Arduino. If the Arduino is already set to one of the active
states (STATE = 1 or STATE = 2), the host will update its own state variable,
myState, as the same value of the STATE of merge network and starts to send data
using the current setting. If STATE = 0, the host will set mySTATE to a value
that can make the merge network connect the host to the left port of the network
switch. The host sends myState, as a STATE update request, to the Arduino. The
host will check STATE again and start to send data if STATE is set to 1 or 2. If
STATE = 8, it means that the other host was trying to set up merge network to
the opposite state at the same time, in other words, competing for the left port.
In this case, the host will back o↵ for some random time and set myState to 0 and
restart.
117
Figure 8.4 shows the flowchart of the logic to determine myState implemented
on host PACLAB12. When current STATE is 0, there is no active data transmis-
sion. Host PACLAB12 needs to set the merge network to STATE = 1 in order to
be connected to the left-side port. If the current STATE is 2, which means host
PACLAB13 is connected to the left port, PACLAB12 can only use the right port
and keep STATE = 2. For host PACLAB13, its preferred STATE is 2 and will set
up myState to 2 when the board STATE is 0. We implement the state-determine
logic in the socket connection function at each host as part of the application-level
protocol.
We use a 5V Arduino Uno board as the controller to negotiate STATE with
hosts and to send control signals to optical switches and to coordinate them to
provide connection channels between the hosts and network switch (Figure 8.5).
Arduino Uno consumes only 232mW power when it is active. The two optical
switches are passive optical devices with no power requirement. As such, the
power consumption of the 2 ⇥ 2 merge network is negligible. In addition, the
optical switch requires minimum management and it is highly reliable. It supports
multimode optic fiber operating at a wavelength from 650nm to 1310 nm.
The architecture design of the merge network is illustrated in Figure 8.6. The
Arduino board works as a micro-controller, receiving state inquiries and state up-
dating requests from hosts, and controls optical switches by outputting control
signals to the latches of optical switches. The Arduino board communicates with
the hosts through serial ports. Given there is only one hardware serial port in
Arduino Uno, we use two pins on Arduino to work as Tx and Rx of a serial port
and simulate a software serial port to communicate with the second host.
The Arduino reads myState values from the two hosts and sets up new STATE,
which is sent to the optical switches to change the switching states. The new value
118
Figure 8.4. Controlling the state of merge networks: state-transferring logic implemented at
PACLAB12.
119
Figure 8.5. Arduino board to control the states of the two optical switches.
Figure 8.6. Architecture design of a 2⇥ 2 merge network.
120
Table 8.1. Arduino board STATE values
PPPPPPPPPPPPPP
myState1
myState2
0 1 2
0 0 1 2
1 1 1 8
2 2 8 2
of the STATE variable is determined according to the values of the state request
variables received from the two hosts, myState1 and myState2. The logic is shown
in Table 8.1. If the new STATE is 1 or 2, the Arduino board will send control
signals to the uplink and downlink optical switches to turn them into Bypass or
Insert state. If the new STATE is 0, Arduino will turn both optical switches o↵. It
is possible for the two hosts to send contradictory myState values (e.g. one sends
1 and the other sends 2). This situation happens when the current STATE is 0
and both hosts have data to transfer at the same time, and each of the two hosts
considers itself the only sender and can use the left port to transfer data. As a
result, one host sets myState to 1 and the other host sets myState to 2. When
Arduino receives two di↵erent myState, the new STATE will be set to 8. When
the hosts detect this situation, they will reset myState to 0, wait some random
time, and try again.
8.2 MEASUREMENT RESULTS
We test the utilization of the two ports in a switch (implemented at workstation
PACLAB11). We use Iperf to send packets from host1 and host2 and customize
the tra c flows with log-normal distribution of flow length and inter-arrival time
121
to generate tra c with di↵erent loadings. We measure the active time periods of
the two ports and compare the throughputs with that of a switch without a merge
network.
We use the Iperf application to send tra c packet flows to the switch, with
overall loadings from about 10% to 75%. The two ports of the switch are named
Port 1 (left) and Port 2 (right) for simplicity. When there is no merge network
used, Port 1 and Port 2 are connected to Host A and Host B respectively, and they
are always in an active mode. With the merge network at the switch, the available
left-most port is chosen first. That means, in our test case with the switch having
two ports, the left port (Port 1) always has higher priority than Port 2 (on the
right).
We use green color to represent the tra c flows going to the left port (Port 1)
and red color for the flows to Port 2. Figure 8.7 illustrates the ten-hour record of
the tra c flows from Host A and Host B. Host A sends most of the tra c flows to
Port 1. Host B also uses Port 1 when loading is low, and uses Port 2 more when
loading increases to 75%.
In our experiment, the two hosts’ software settings are exactly the same. The
reason that Host A can seize more time of the Port 1 is that we use Arduino UNO
as the state control circuit of the merge network. Each of the two hosts sends its
state control signal to the Arduino through a serial port. Since Arduino UNO has
only one built-in hardware serial port, we use a software serial port to simulate
a hardware serial port and let Arduino receive the signal from the second host.
Although we set the two serial ports with the same transfer baud rate, it appears
that the host using the hardware serial port can always get connected to Port 1
faster and more frequently.
In this first experiment, Host A is connected to the hardware serial port and
122
0 200 400 600 800 1000 1200 1400 1600 1800
Packet Flow Sequence
0
5
10
15
F
lo
w
 L
e
n
g
th
 (
se
co
n
d
s)
Host A Sending Traffic Flows to Ports (Loading 20%)
0 100 200 300 400 500 600
Packet Flow Sequence
0
20
40
60
80
100
120
140
160
180
F
lo
w
 L
e
n
g
th
 (
se
co
n
d
s)
Host A Sending Traffic Flows to Ports (Loading 75%)
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Packet Flow Sequence
0
5
10
15
F
lo
w
 L
e
n
g
th
 (
se
co
n
d
s)
Host B Sending Traffic Flows to Ports (Loading 20%)
0 100 200 300 400 500 600 700
Packet Flow Sequence
0
20
40
60
80
100
120
140
160
180
F
lo
w
 L
e
n
g
th
 (
se
co
n
d
s)
Host B Sending Traffic Flows to Ports (Loading 75%)
Figure 8.7. Tra c flows and port usage of Host A and Host B.
123
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
P
e
rc
e
ta
g
e
 o
f 
a
ct
iv
e
 t
im
e
Host A Port Utilization
port 1
port 2
lost
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
0.1
0.2
0.3
0.4
0.5
0.6
P
e
rc
e
ta
g
e
 o
f 
a
ct
iv
e
 t
im
e
Host B Port Utilization
port 1
port 2
lost
Figure 8.8. Total port usage of Host A and Host B.
Host B is linked to the software serial port. The assumption above explains why
Host A sends most of the tra c flows using Port 1. Figure 8.8 shows the total
percentage of tra c flows that is sent to each of the ports from Host A and Host
B, with loadings increase from 10% to 75%. In this figure, the green line represents
the usage of Port 1, and the red line represents the usage of Port 2. It is obvious
that Port 1 is the dominant port used by Host A. For Host B, when loading is
approximately below 45%, the tra c flows from Host B can still squeeze into Port
1. As a result, more tra c is going to Port 1 than going to Port 2. When loading
is above 45%, the total tra c from the two hosts will approach the capacity of
each port. As a result, Host A occupies Port 1 most of the time, and Host B
increasingly sends tra c to Port 2.
To verify our assumption that the di↵erent performance results of Host A and
B are caused by the di↵erence of the Arduino hardware serial port and software
serial port, we switch the hardware and software serial ports between Host A and
B in the second set of experiments. Figure 8.9 shows the reverse results from that,
as shown in Figure 8.7, which verify that the host connected with hardware serial
port of the Arduino (i.e. Host B) uses Port 1 more. The total port utilization
shown in Figure 8.10 also proves that Host B sends significantly more tra c to
124
Port 1 after we switch the serial port connections.
To show the results without the influence of Arduino serial port, we sum the
two experiment results and calculate the average. The results shown in Figure 8.11
illustrate very similar performance of Host A and Host B, and indicates that more
tra c goes to Port 1 for both Host A and Host B. It is also possible to achieve this
result by using a control circuit with two hardware ports. For example, Arduino
Mega provides 3 extra built-in serial ports on board.
We calculate the overall utilization of the two ports by adding the tra c from
Host A and Host B. The result is shown in Figure 8.12. The x-axis represents
loading increase. The y-axis represents the percentage of the active time of the
port. We can see that Port 1 is used much more than Port 2, especially when the
loading is smaller. When the loading is about 75%, Port 1 is used close to the line
capacity and Port 2 is active half of the time. This proves that the merge network
perfectly consolidates the tra c to the left port, which is Port 1.
We extract the state switching data from the Arduino board in order to under-
stand how frequently the merge network changes its switching state and illustrate
the result in Figure 8.13. The x-axis is the workload. The y-axis is the number of
times that the state switching command is received from each host. We find that
the host connected with the software serial port of the Arduino board switches
state more. The host connected with the hardware serial port maintains a much
lower and more stable number of state switching.
We add the state switching number of Host A and Host B and compare the total
numbers of state switching in experiments 1 and 2, we find very similar curves of
the two, Figure 8.14. Both curves start to climb from abut 300 first when loading
increases, and they peak when loading reaches about 30%. After that, the curves
start to descend, totaling lower than 100 when the loading increases to 75%.
125
0 200 400 600 800 1000 1200 1400 1600 1800
Packet Flow Sequence
0
5
10
15
F
lo
w
 L
e
n
g
th
 (
se
co
n
d
s)
Host A Sending Traffic Flows to Ports (Loading 20%)
0 100 200 300 400 500 600
Packet Flow Sequence
0
20
40
60
80
100
120
140
160
180
F
lo
w
 L
e
n
g
th
 (
se
co
n
d
s)
Host A Sending Traffic Flows to Ports (Loading 75%)
0 200 400 600 800 1000 1200 1400 1600 1800
Packet Flow Sequence
0
5
10
15
F
lo
w
 L
e
n
g
th
 (
se
co
n
d
s)
Host B Sending Traffic Flows to Ports (Loading 20%)
0 100 200 300 400 500 600
Packet Flow Sequence
0
20
40
60
80
100
120
140
160
180
F
lo
w
 L
e
n
g
th
 (
se
co
n
d
s)
Host B Sending Traffic Flows to Ports (Loading 75%)
Figure 8.9. Tra c flows and port usage of Host A and Host B after switching serial ports.
126
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
0.1
0.2
0.3
0.4
0.5
0.6
P
e
rc
e
ta
g
e
 o
f 
a
ct
iv
e
 t
im
e
Host A Port Utilization
port 1
port 2
lost
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
P
e
rc
e
ta
g
e
 o
f 
a
ct
iv
e
 t
im
e
Host B Port Utilization
port 1
port 2
lost
Figure 8.10. Total port usage of Host A and Host B after switching serial ports.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
0.1
0.2
0.3
0.4
0.5
P
e
rc
e
ta
g
e
 o
f 
a
ct
iv
e
 t
im
e
Host A Port Utilization
port 1
port 2
lost
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
0.1
0.2
0.3
0.4
0.5
P
e
rc
e
ta
g
e
 o
f 
a
ct
iv
e
 t
im
e
Host B Port Utilization
port 1
port 2
lost
Figure 8.11. Average port usage of Host A and Host B.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P
e
rc
e
n
ta
g
e
 o
f 
a
ct
iv
e
 t
im
e
Switch Port Usage
port 1
port 2
Figure 8.12. Total Port 1 and Port 2 utilization.
127
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
50
100
150
200
250
300
350
400
S
ta
te
 s
w
itc
h
in
g
 t
im
e
s
State Switching Times
Host A
Host B
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
0
50
100
150
200
250
300
350
400
S
ta
te
 s
w
itc
h
in
g
 t
im
e
s
State Switching Times
Host A
Host B
Figure 8.13. State switching times of the merge network in two experiments.
128
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Traffic loadings
100
150
200
250
300
350
400
450
T
o
ta
l s
ta
te
 s
w
itc
h
in
g
 t
im
e
s
Merge Network States Switching Times
Host A uses software serial port and Host B uses hardware serial port
Host A uses hardware serial port and Host B uses software serial port
Figure 8.14. Total state switching times.
For controlling state switching, there are two possible strategies. One is non-
blocking transfer: when a host’s preferred port is busy, the host can connect to
the other port immediately. The other strategy is blocking transfer: when a host’s
preferred port is busy, the host can wait some time to see if the preferred ports can
be available soon. In our experiments, we use non-blocking strategy to guarantee
that all tra c load is sent immediately without delay, thus to ensure the network
performance. In future work, we can experiment with the blocking strategy and
find the threshold of the waiting time that can guarantee reasonable throughput
and latency.
8.3 HIGHER-ORDER MERGE NETWORKS
When we apply a merge network to all ports of a ToR switch, or several ToR
switches within a datacenter pod, we need merge networks with more inputs/out-
puts. If we have a N ⇥ N optical switch like Figure 8.2, we can use a pair of
129
Figure 8.15. Example of a 16⇥ 16 MEMS matrix optical switch.
these optical switches to work as uplink and downlink switches to build a N ⇥N
merge network that connects N hosts to a N-port network switch, similar to the
architecture described in Figure 8.6.
An example of such a switch is a MEMS matrix optical switch [4] that support
all-optical cross connections in a fully non-blocking manner, allowing simultaneous
connection between a number of input and output fibers. Figure 8.15 is a 16⇥ 16
matrix switch. Any of the 16 input fibers can be connected to any of the 16 output
fibers.
The MEMS matrix optical switch is based on the micro-electro-mechanical
system (MEMS) mirror technology, which uses a MEMS chip to rotate a matrix
of movable silicon mirrors to change the coupling of light between input fibers and
output fibers (Figure 8.16 left). The use of MEMS technology o↵ers low cost and
excellent optical performance of high reliability and fast switching time of less than
 40ms. Currently, the MEMS matrix switch is available in sizes up to 32 ⇥ 32
(Figure 8.16 right). Using a modular design, it is also possible to customize the
optical switch to larger-size N ⇥N configurations.
The MEMS matrix optical switch is controlled through a RS232 interface or
I2C. We generalize the algorithm we used for the 2⇥2 optical switch to the N ⇥N
130
Figure 8.16. MEMS 3D matrix optical switch.
optical switch. We use an array to keep the STATE of the current switching state
of optical switches. The array’s size is equal to the dimension of the matrix optical
switch. The value of the ith item of the array stores the name of the host whose
connection is switched to the ith output port. For example, consider four hosts
named A,B,C and D. Assume the STATE array of the 4⇥ 4 matrix switch shown
in Figure 8.17 is CADB. This means host C is connected to port 1 (leftmost), host
A is connected to port 2, and so on. When there is no tra c from a host, the
value of corresponding item of STATE is set to 0. From the example shown in
Figure 8.17, if at this time the packet flow from Host A is completed, the STATE
array will be changed from CADB to C0DB. Following that, if Host C is done, the
STATE changes to 00DB. The STATE array records which ports are available for
the following tra c flows, and the algorithm can use the value of STATE giving
the leftmost available port highest priority. For example, if the next tra c flow is
from Host A again, the STATE will change from 00DB to A0DB.
The STATE array keeps the current outputs of the matrix optical switches.
It contains information for two types of ports: for active ports that have data
flows, it represents current switching logic of those ports. For output ports that
are currently idle, it stores 0 in those corresponding positions. The 0s in the
131
Figure 8.17. STATE of a 4⇥ 4 matrix switch.
STATE array are used to determine whether there is a further-left port available
when new tra c arrives, thus to consolidate tra c to the left side of datacenter
switches. However, since the matrix optical switch is non-blocking, the control
signal sent to the optical switch has to specify the switching path of every port,
even if there is no packet go through it. Since the STATE array may contain many
0s for idle ports, it cannot be used to control the switch state transfer. Therefore,
we use another array, SIGNAL, to store the control signals for the optical switches.
We take a 4 ⇥ 4 optical switch as an example. The STATE array is of size 4
to store the output port states. It is initialized as 0000 since all the four ports are
idle at the beginning. We set up SIGNAL = ABCD to start the optical switch
in the bypass state initially. For the tra c flows, we use ’+A’ to represent that
Host A starts sending data, and ’-C’ to describe that Host C’s data transmission is
completed. Ans we use a string to represent a sequence of starts and ends of data
transmission from specified hosts. For example, “+A+B-B-A” means a sequence of
events - “Host A starts data transmission; then Host B starts data transmission;
Host B data-transmission ends; and then Host A data-transmission ends”. We
illustrate the changing values of STATE and SIGNAL arraies when we have the
132
Figure 8.18. A 4⇥ 4 matrix switch: STATE and SIGNAL.
sequence of data flows shown in Figure 8.18.
For this 4⇥ 4 case, the STATE array starts from 0000, and changes every time
when a new data flow starts and an old data flow completes. When a new data flow
starts, the algorithm finds the first available port (first zero) from STATE array
and updates that item to the name of the host that sends the data flow. When
a host completes its data transmission, the algorithm traverses the STATE array
again, finds the item with the host name and changes it to 0. The algorithm needs
to traverse the STATE array twice for each data flow, so the time complexity is
O(2n). The SIGNAL array starts from ABCD and updates simultaneously with the
STATE array. For example, when a new data flow starts, the algorithm changes the
ith item of STATE array to Host x. Also, it checks the value of ith item of SIGNAL.
If it is not equal to Host x, the algorithm finds Host x at jth position of SIGNAL,
and switches the ith item and jth item. The time complexity of updating SIGNAL
array is O(n). Therefore, the overall complexity of this algorithm is O(3n). The
pseudo-code of the algorithm is described in Algorithm 2. The SIGNAL array is
used to control the matrix switches. It only needs to be updated when there is a
new data flow coming in. However, the STATE array needs to be updated when a
data flow starts and completes to store the real status of each ports of the merge
network.
133
Algorithm 2 Algorithm to update state and control the matrix switch
1: function SwitchControl(host)
2: STATE  {0};
3: SIGNAL {A,B,C,D};
4: loop
5: if Host x has data to transfer then
6: for i 0; i < SwitchDimension; i i+ 1 do
7: if STATE[i] == 0 then
8: STATE[i] x;
9: if SIGNAL[i] <> x then
10: for j  0; j < SwitchDimension; j  j + 1 do
11: if SIGNAL[j] == x then
12: swap(SIGNAL[i], SIGNAL[j]);
13: break;
14: break;
15: Send SIGNAL to matrix switch;
16: if Host x complete data transferring then
17: for i 0; i < SwitchDimension; i i+ 1 do
18: if STATE[i] == x then
19: STATE[i] 0;
20: break;
134
When there are more than one hosts requesting to send data within a short
period of time, it is more e cient to change STATE and SIGNAL for all host
requests and send the final SIGNAL to the switch, considering the switching time
required for each state change of the matrix optical switches.
In the prototype of 2⇥2 merge network, we implemented the control algorithm
in the software running on the Arduino board, which receives inputs from the two
hosts through serial ports. For the general implementation of a N ⇥ N merge
network, we can use optical detectors to collect the data input fiber from each
host. DiCon optical detector [5] provides in-line power monitoring by utilizing
fused couplers on every input, which taps o↵ a portion (1% ⇠ 10%)of the signal
and delivers it to the output (Figure 8.19). We propose to integrate the optical
switch, the micro-controller and the optical detectors as an integrated merge net-
work module. The architecture design of the functional module is shown in Figure
8.20. We have found some successful examples of customized function modules.
Figure 8.21 is an example of a module integrated by Dicon [6] with MEMS VOAs,
tap-detectors, control electronics, and firmware, for optical power balancing and
management. With our control algorithm, the micro-controller can be integrated
with the tap-detectors, and the matrix switch, to make a functional module of an
N ⇥N merge network.
8.4 SUMMARY
This chapter describes the implementation of a 2⇥ 2 merge network using optical
switches. Hosts that have data to transfer send a request to an Arduino controller,
which calculates the control signal and sends it to the optical switch. The 2 ⇥ 2
merge network successfully consolidates the data to the left port of the network
135
Figure 8.19. DiCon Tap/Detector module.
Figure 8.20. Customized functional module of merge networks.
136
Figure 8.21. DiCon customized module.
switch. We extend the algorithm to a general N ⇥ N case and discuss the ar-
chitecture design of integrating optical tap-detector, optical switches, electronic
controller and firmware to form a functional N ⇥ N merge network that can be
used to consolidate tra c to leftmost ports of edge switches within a datacenter
pod.
137
Chapter 9
CONCLUSIONS
9.1 SUMMARY
In this research, we consider the energy e ciency problem of datacenter networks.
Since fat-tree topologies are the predominant choice for datacenter networks, lot of
our work was based on this topology. While fat-trees provide full bisection band-
width, which minimizes latency and boosts throughput, the energy consumption
of this network, and other datacenter networks, is not proportional to the network
load. We find that the usage of the network devices in a fat-tree network is greatly
dependent on the type of tra c, the tra c load, and the selected routing algo-
rithm. For instance, if most of the tra c is between servers located in the same
pod, the core switches are never used, even at high loads. On the other hand,
if most of the tra c is between servers in di↵erent pods, the better part of the
network switches will be in active states even at low loads since more switches in
the network need to be utilized for routing.
We first analyze the problem of energy consumption in fat-tree networks by
deriving expressions for the fraction of active switches and tra c losses for arbi-
trary tra c loads and tra c losses. The developed analytical models for energy
consumption enable us to study fat-tree DCNs theoretically. We show that there
is a base cost of approximately 45% (due to edge switches), but after that point
energy consumption can scale linearly by appropriately consolidating tra c flows.
138
A practical application of the models is to jointly optimize task scheduling and
flow assignment so as to maximize the tra c consolidation for given job loads.
Based on the energy consumption model, we investigate skinnier network topolo-
gies that meet performance requirements of realistic loads, thus saving not only
energy but capital cost as well. Through a comprehensive study of the sub-graphs
of fat-trees for di↵erent tra c characteristics, we conclude that it is possible to
further reduce the number of active switches by up to 10% by consolidating cor-
responding jobs to fewer servers, particularly at low loads. Furthermore, we find
that edge switches account for a large portion of the energy cost even at very low
loads. We propose to replace the edge switches with high cardinality switches and
build energy proportional DCN. We evaluate the DCN power consumption using
the power data of Cisco modular switches. We find that the overall power con-
sumption is significantly reduced by using high-radix edge switches in the edge
layer of fat-tree DCNs.
In order to find the minimum subset of a network, we formulate an optimization
model for computing routes with the goal of minimizing energy consumption and
use Cplex solver to find optimal solutions for a small fat-tree network. Routing
plays an important part in the potential for energy savings. Compared to routing
algorithms that seek to balance load, our routing algorithm consolidates tra c
into a few paths to save energy at the idle switches. We use a universal greedy
flow assignment algorithm, which is proved to be able to find flow assignments
close to that achieved from the optimization solver for a variety of loading scenar-
ios. Although the greedy bin-packing algorithm used in ElasticTree also finds the
shortest route using the left-most heuristics, it leverages the regularity of hierar-
chical DCNs. Our greedy algorithm can find flow assignments close to the MIP
model, for not just hierarchical network topologies, but also random or irregular
139
DCN topologies.
Many approaches that address the DCN energy e ciency problems can hardly
achieve the goal of energy proportionality. There is still considerable amount of
energy waste especially when the network is very lightly loaded during o↵-peak
hours. For example, in ElasticTree, the edge switches are always fully powered
on, even during the idle hours, because they are connected to servers. At the
aggregation layer, switches that are powered on do not fully load their interfaces
facing the edge switches. Our proposed merging approach explores additional
savings made possible by use of a hardware device called a merge network, which
further consolidates the tra c to fewer switches and enables powering o↵ a subset
of interfaces in active switches, thus to manage the power at a finer granularity.
We attach merge networks to each switch of the network, and alternatively, to
all switches of the same layer within a pod, so as to scale the network energy cost
to the number of busy interfaces. We customize the analytical model to include the
merge networks by including the number of active interfaces as a parameter in the
minimization function. The model shows that, in addition to the savings obtain
by forcing tra c to the left, as shown in ElasticTree, we can achieve significantly
additional savings by powering o↵ unused interfaces in active switches, which is
made possible by merge networks. Simulation results prove that the merge net-
works can reduce the energy consumption by around 50% at light loads, and the
DCN energy consumption can scale linearly by appropriately consolidating tra c
flows.
In simulation of larger fat-tree networks, we analyze the energy savings obtained
when using merge networks. With very light load, our approach reduces 20% to
40% energy cost compared with ElasticTree by applying merge networks to each
switch, depending on the tra c types. Localized tra c can benefit even more from
140
tra c merging. When deploying merge network at edge-layer and aggregation-
layer switches within the same pod, the tra c merging achieves up to 70%  90%
total energy savings and the network exhibits power-usage behavior close to that
of an energy-proportional system.
As a proof of concept, we design and build a hardware prototype of a 2 ⇥ 2
merge network using fiber links and passive optical devices. We experiment with
the merge network in a small test bed built on Linux boxes and Arduino. The
merge network consolidates data to one interface, with a slight time delay due to
switching time of the optical switches. The energy cost of the prototype is minimal.
We extend the prototype to larger-size merge networks by generalizing the control
algorithm and hardware design. The system can be built with reasonable expense
compared with the cost of datacenter network devices. The time and space com-
plexity of the control algorithm is linear and the system requires minimal change
at the end hosts.
9.2 CONCLUSIONS
An important conclusion of this research is that the type of tra c has a close cor-
relation with the potential energy savings for datacenter networks. This is clearly
demonstrated in the simulations as well as in the analytical models developed in
this thesis. The key thought is to keep tra c local as much as possible in order to
save energy. Another conclusion is that the symmetric design and homogeneous
network equipment is not generally energy-e cient. Topology enhancement for
providing external connectivity and heterogeneous deployment of network devices
has an impact on the overall energy consumption.
141
REFERENCES
[1] http://www.datacenterknowledge.com/archives/2011/08/01/report-google-
uses-about-900000-servers/.
[2] http://tools.cisco.com/cpc/.
[3] http://www.fs.com/products/32884.html.
[4] https://www.diconfiberoptics.com/products/mems matrix optical switches.php.
[5] https://www.diconfiberoptics.com/products/scd0308/scd0308A.pdf.
[6] http://www.diconfiberoptics.com/products/prd custom.php?sec=modules.
[7] IEEE 802.3az. http://www.ieee802.org/3/az/.
[8] Dennis Abts, Michael R. Marty, Philip M. Wells, Peter Klausler, and Hong
Liu. Energy Proportional Datacenter Networks. In ISCA, 2010.
[9] Hussam Abu-Libdeh, Paolo Costa, Antonu Rowstron, Greg O’Shea, and
Austin Donnelly. Symbiotic Routing in Future Data Centers. In SIGCOMM,
pages 51–62, 2010.
[10] Muhhamad Abdullah Adnan and Rajesh Gupta. Path Consolidation for Dy-
namic Right-sizing of Data Center Networks. In Proceedings IEEE Sixth In-
ternational Conference on Cloud Computing, 2013.
142
[11] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A Scalable,
Commodity Data Center Network Architecture. In SIGCOMM, pages 63–74,
2008.
[12] Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson
Huang, and Amin Vahdat. Hedera: Dynamic Flow Scheduling for Data Center
Networks. In NSDI, 2010.
[13] Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan.
Data Center TCP (DCTCP). In SIGCOMM, pages 63–74, 2010.
[14] Ganesh Ananthanarayanan and Randy H. Katz. Greening the Switch. In
Proc. USENIX HotPower, San Diego, CA, Dec 2008.
[15] Anders S. G. Andrae and Tomas Edler. On Global Electricity Usage of Com-
munication Technology: Trends to 2030. Challenges, April 2015.
[16] Woongki Baek and Trishul M. Chilimbi. Green: A framework for supporting
energy-conscious programming using controlled approximation. SIGPLAN
Not., 45(6):198–209, June 2010.
[17] M. Baldi and Y. Ofek. Time for a ”greener” internet. In Communications
Workshops, 2009. ICC Workshops 2009. IEEE International Conference on,
pages 1–6, June 2009.
[18] J. Baliga, R. Ayre, K. Hinton, and R. S. Tucker. Photonic switching and the
energy bottleneck. In Photonics in Switching, 2007, pages 125–126, Aug 2007.
[19] Luiz Andre Barroso and Urs Holzle. The Case for Energy-Proportional Com-
puting. In Computer. IEEE Computer Society, 2007.
143
[20] Theophilus Benson, Aditya Akella, and David A. Maltz. Network Tra c
Characteristics of Data Centers in the Wild. In IMC, 2010.
[21] Kashif Bilal, Samee U. Khan, Joanna Kolodziej, Limim Zhang, Khizar Hayat,
Sajjad A. Madani, Nasro Min-Allah, Lizhe Wang, and Dan Chen. A Compar-
ative Study of Data Center Network Architectures. In 26th European Confer-
ence on Modeling and Simulation (ECMS), pages 526–532, Koblenz, Germany,
2012.
[22] J. Blackburn and K. Christensen. A simulation study of a new green bit-
torrent. In Communications Workshops, 2009. ICC Workshops 2009. IEEE
International Conference on, pages 1–6, June 2009.
[23] Alessandro Carrega, Suresh Singh, Roberto Bruschi, and Ra↵aele Bolla. Traf-
fic Merging for Energy-E cient Datacenter Networks. In International Sym-
posium on Performance Evaluation of Computer and Telecommunication Sys-
tems, 2012.
[24] J. Chabarek, S. Banerjee, P. Sharma, J. Mudigonda, and P. Barford. Networks
of Tiny Switches (NoTS): In Search of Network Power E ciency and Propor-
tionality. In Proceedings of the 5th Workshop on Energy-E cient Design,
June 2013.
[25] J. Chabarek, J. Sommers, P. Barford, C. Estan, D. Tsiang, and S. Wright.
Power awareness in network design and routing. In INFOCOM 2008. The
27th Conference on Computer Communications. IEEE, April 2008.
[26] Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran, Lei Xu, Yueping
Zhang, Xitao Wen, and Yan Chen. OSA: An Optical Switching Architecture
for Data Center Networks with Unprecedented Flexibility. In NSDI, 2012.
144
[27] L. Chiaraviglio, M. Mellia, and F. Neri. Reducing power consumption in
backbone networks. In Communications, 2009. ICC ’09. IEEE International
Conference on, pages 1 – 6, June 2009.
[28] Kenneth J. Christensen and Franklin ‘Bo’ Gulledge. Enabling power manage-
ment for network-attached computers. Int. J. Netw. Manag., 8(2):120–130,
March 1998.
[29] Charles Clos. A Study of Non-Blocking Switching Networks. The Bell System
Technical Journal, 32(2):406–424, March 1953.
[30] Andrew R. Curtis, Wonho Kim, and Praveen Yalagandula. Mahout: Low-
overhead Datacenter Tra c Management using End-host-based Elephant De-
tection. In INFOCOM, 2011.
[31] Andrew R. Curtis, Je↵rey C. Mogl, Jean Tourrihes, Praveen Yalagandula,
Puneet Sharma, and Sujata Banerjee. DevoFlow: Scaling Flow Management
for High-Performance Networks. In SIGCOMM, pages 254–265, 2011.
[32] William James Dally and Brian Towles. Principles and Practices of Intercon-
nection Networks. ELSEVIER, 2004.
[33] David Coudert and Alvinice Kodjo and Truong Khoa
Phan. Robust energy-aware routing with redundancy elimi-
nation. Computers and Operations Research, 64:71–85, 2015.
http://www.sciencedirect.com/science/article/pii/S0305054815001252.
[34] V. Eramo, A. Germoni, A. Cianfrani, E. Miucci, and M. Listanti. Comparison
in Power Consumption of MVMC and BENES Optical Packet Switches. In
Proceedings IEEE NOC (Network on Chip), pages 125–128, 2011.
145
[35] Xiaobo Fan, Wolf dietrich Weber, and Luiz Andre Borroso. Power Provision-
ing for a Warehouse-sized Computer. In ISCA, 2007.
[36] Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajab-
dolali Bazzaz, Vikram Subramaya, Yeshaiahu Fainman, George Papen, and
Amin Vahdat. Helios: A Hybrid Electrical/Optical Switch Architecture for
Modular Data Centers. In SIGCOMM, pages 339–350, 2010.
[37] Will Fisher, Martin Suchara, and Jennifer Rexford. Greening backbone net-
works: Reducing energy consumption by shutting o↵ cables in bundled links.
In Proceedings of the First ACM SIGCOMM Workshop on Green Networking,
Green Networking ’10, pages 29–34, New York, NY, USA, 2010. ACM.
[38] Albert Greenberg, James Hamilton, David A. Maltz, and Parveen Patel. The
Cost of a Cloud: Research Problems in Data Center Networks. In SIGCOMM
CCR, pages 68–73, 2009.
[39] Albert Greenberg, James R. Hamilton, and Navendu Jain. VL2: A Scalable
and Flexible Data Center Network. In SIGCOMM, pages 51–62, 2009.
[40] C. Gunaratne, K. Christensen, and B. Nordman. Managing energy consump-
tion costs in desktop PCs and LAN switches with proxying, split TCP con-
nections, and scaling of link speed. INTERNATIONAL JOURNAL OF NET-
WORK MANAGEMENT, 15:297310, 2005.
[41] C. Gunaratne, K. Christensen, B. Nordman, and S. W. Yuen. Reducing the
Energy Consumption of Ethernet with Adaptive Link Rate (ALR) . 57:448–
461, April 2008.
[42] C. Gunaratne, K. Christensen, and S. W. Yuen. Ethernet Adaptive Link Rate
146
(ALR): Analysis of a Bu↵er Threshold Policy . In GLOBECOME, November
2006.
[43] Chaunxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and
Songwu Lu. DCell: A Scalable and Fault-tolerant Network Structure for
Data Centers. In SIGCOMM, pages 75–86, 2008.
[44] Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng
Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. BCube: A High Per-
formance, Server-centric Network Architecture for Modular Data Centers. In
SIGCOMM, pages 63–74, 2009.
[45] Deke Guo, Tao Chen, Dan Li, Yunhao Liu, and Guihai Chen. BCN: Ex-
pansible Network Structures for Data Centers Using Hierarchical Compound
Graph. In INFOCOM, 2011.
[46] M. Gupta, S. Grover, and S. Singh. A feasibility study for power management
in LAN switches. In Proceedings of the 12th IEEE International Conference
on Network Protocols, page 361371, Washington, DC, 2004.
[47] M. Gupta and S. Singh. Using low-power modes for energy conservation in
Ethernet LANs. In IEEE INFOCOM, page 24512455, 2007.
[48] Maruti Gupta and Suresh Singh. Greening of the Internet. In Proceedings of
ACM SIGCOMM, 2003.
[49] László Gyarmati and Tuan Anh Trinh. Scafida: A scale-free network inspired
data center architecture. SIGCOMM Comput. Commun. Rev., 40(5):4–12,
October 2010.
147
[50] R. Hays. Active/Idle Toggling with 0BASE-x for Energy E cient Ethernet,
November 2007. presentation to the IEEE802.3az Task Force.
[51] Brandon Heller, Srini Seetharaman, Priya Mahadevan, Yannis Yiakoumis,
Puneet Sharma, Sujata Banerjee, and Nick McKeown. ElasticTree: Saving
Energy in Data Center Networks. In NSDI, 2010.
[52] C. Hu, C. Wu, W. Xiong, B. Wang, J. Wu, and M. Jiang. On the design of
green reconfigurable router toward energy e cient internet. IEEE Communi-
cations Magazine, 49(6):83–87, June 2011.
[53] Jeremy Blackburn and K Christensen. Green Telnet Modifying a client-server
application to save energy. In DR DOBBS JOURNAL, volume 33, 2008.
[54] M. Jimeno, K. Christensen, and B. Nordman. A network connection proxy
to enable hosts to sleep and save energy. In Performance, Computing and
Communications Conference, 2008. IPCCC 2008. IEEE International, pages
101–110, Dec 2008.
[55] Aman Kansal and Feng Zhao. Fine-grained energy profiling for power-aware
application design. SIGMETRICS Perform. Eval. Rev., 36(2):26–31, August
2008.
[56] Changhoon Kim, Matthew Caesar, and Jennifer Rexford. Floodless in SEAT-
TLE: A Scalable Ethernet Architecture for Large Enterprises. In SIGCOMM,
pages 3–14, 2008.
[57] John Kim, William J. Dally, and Dennis Abts. Flattened Butterfly: A Cost-
E cient Topology for High-Radix Networks. In ISCA, pages 126–137, 2007.
148
[58] Jonathan G. Koomey. Growth in Data Center Electricity Use 2005 to 2011.
Technical report, Stanford University, 2011.
[59] L. Irish and K. J. Christensen. A Green TCP/IP to Reduce Electricity Con-
sumed by Computers. In Proceedings IEEE Southeastern Engineering for a
New Era, 1998.
[60] Dan Li, Chuanxiong Guo, Haitao Wu, Kun Tan, Yongguang Zhang, and
Songwu Lu. FiConn: Using Backup Port for Server Interconnection in Data
Centers. In INFOCOM, 2009.
[61] Yong Liao, Dong Yin, and Lixin Gao. DPillar: Scalable Dual-Port Server
Interconnection for Data Center Networks. In Proceeding of 26th International
Conference on Computer Communication Networks, 2010.
[62] Gongqi Lin, Sieteng Soh, and Kwan-Wu Chin. Energy-aware tra c engineer-
ing with reliability constraint. Computer Communications, 57:115–128, 2015.
[63] Minghong Lin, Adam Wierman, Lachlan L. H. Andrew, and Eno Thereska.
Dynamic Right-sizing for Power-proportional Data Centers. In INFOCOM,
2011.
[64] Priya Mahadevan, Puneet Sharma, Sujata Banerjee, and Parthasarathy Ran-
ganathan. A power benchmarking framework for network devices. In Pro-
ceedings of the 8th International IFIP-TC 6 Networking Conference, NET-
WORKING ’09, pages 795–808, Berlin, Heidelberg, 2009. Springer-Verlag.
[65] M. Mandviwalla and Nian-Feng Tzeng. Energy-e cient scheme for
multiprocessor-based router linecards. In Applications and the Internet, 2006.
SAINT 2006. International Symposium on, pages 8–163, Jan 2006.
149
[66] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parylkar, Larry
Peterson, Jennifer Rexford, Scott Shenker, and Jonanthan Turner. OpenFlow:
Enabling Innovation in Campus Networks. In SIGCOMM CCR, pages 69–74,
2008.
[67] D. Meisner, B. T. Gold, and T. F. Wenisch. PowerNap: Eliminating Server
Idle Power. In Proceeding of the 14th ACM International Conference on Ar-
chitectural Support for Programming Languages and Operating Systems (AS-
PLOS 2009), pages 205–216, March 2009.
[68] Jayaram Mudigonda, Praveen Yalagandula, Muhammad Al-Fares, and Jef-
frey C. Mogul. SPAIN: COTS Data-Center Ethernet for Multipathing over
Arbitrary Topologies. In NSDI, 2010.
[69] Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson
Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramnya, and
Amin Vahdat. PortLand: A Scalable Fault-Tolerant Layer 2 Data Center
Network Fabric. In SIGCOMM, pages 39–50, 2009.
[70] Sergiu Nedevschi, Lucian Popa, Gianluca Iannaccone, Sylvia Ratnasamy, and
David Wetherall. Reducing Network Energy Consumption via Sleeping and
Rate-Adaptation. In NSDI, 2008.
[71] Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Da-
mon Wischik, and Mark Handley. Improving Datacenter Performance and
Robustness with Multipath TCP. In SIGCOMM, pages 266–277, 2011.
[72] Forrester Research. Power and Cooling Heat Up the Data Center. September
2011.
150
[73] Brunilde Sanso and Hakim Mellah. On Reliability, Performance and Internet
Power Consumptiong. In Proceedings of 7th International Workshop on the
Design of Reliable Communication Networks, Oct 2009.
[74] Ji-Yong Shin, Bernard Wong, and Emin Gün Sirer. Small-world datacenters.
In Proceedings of the 2Nd ACM Symposium on Cloud Computing, SOCC ’11,
pages 2:1–2:13, New York, NY, USA, 2011. ACM.
[75] Suresh Singh and Candy Yiu. Putting the Cart Before the Horse: Merging
Tra c for Energy Conservation. In IEEE Communications Magazine. June
2011.
[76] Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. Jellyfish:
Networking Data Centers Randomly. In NSDI, 2012.
[77] T. Song, W. Fu, O. Ormond, M. Collier, and X. Wang. Energy evaluation of
gigabit routers towards energy e cient network. In Local Metropolitan Area
Networks (LANMAN), 2014 IEEE 20th International Workshop on, pages
1–5, May 2014.
[78] Tian Song, Xiangjun Shi, and Xiaowei Ma. Fine-grained power scaling algo-
rithms for energy e cient routers. In Proceedings of the Tenth ACM/IEEE
Symposium on Architectures for Networking and Communications Systems,
ANCS ’14, pages 197–206, New York, NY, USA, 2014. ACM.
[79] Theophilus Benson and Ashok Anand and Aditya Akella and Ming Zhang.
Understanding Data Center Tra c Characteristics. In WREN, 2009.
[80] Guohui Wang, David G. Andersen, Michael Kaminsky, Konstantina Papa-
giannaki, T. S. Eugene Ng, Michael Kozuch, and Michael Ryan. c-Through:
Part-time Optics in Data Centers. In SIGCOMM, pages 327–338, 2010.
151
[81] Xiaodong Wang, Yanjun Yan, Xiaorui Wang, Kefa Lu, and Qing Cao.
CARPO: Correlation-aware Power Optimization in Data Center Networks.
In INFOCOM, pages 1125–1133, 2012.
[82] Indra Widiaja, Anwar Walid, Yanbin Luo, Yang Xu, and H. Jonathan Chao.
Switch Sizing for Energy E cient Datacenter Networks. In Proceedings Green-
Metrics 2013 Workshop (in conjunction with ACM Sigmetrics 2013), Pitts-
burgh, PA, June 2013.
[83] A. Wierman, L. L. H. Andrew, and A. Tang. Power-Aware Speed Scaling in
Processor Sharing Systems. In Proceedings of the 28th Annual IEEE Confer-
ence on Computer Communications. (INFOCOM 2009), April 2009.
[84] Christo Wilson, Hitesh Ballani, Thomas Karagiannis, and Ant Rowstron. Bet-
ter Never than Late: Meeting Deadlines in Datacenter Networks. In SIG-
COMM, pages 50–61, 2011.
[85] Candy Yiu and Suresh Singh. Merging Tra c to Save Energy in the Enter-
prise. In E-Energy, 2011.
