A regulated series-connected power delivery architecture for server clusters in highly efficient datacenters by McClurg, Josiah
c© 2014 Josiah C. McClurg
A REGULATED SERIES-CONNECTED POWER DELIVERY
ARCHITECTURE FOR SERVER CLUSTERS IN HIGHLY EFFICIENT
DATACENTERS
BY
JOSIAH C. MCCLURG
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2014
Urbana, Illinois
Adviser:
Assistant Professor Robert C. N. Pilawa-Podgurski
Abstract
The series-connected power delivery architecture is an essentially lossless way
to deliver power directly from a high voltage bus to a number of low-voltage
loads. The benefits of this architecture can be combined with voltage regu-
lation of individual nodes through a technique known as differential power
processing. In contexts involving large numbers of high-power low-voltage
loads like data centers, this combined power delivery and regulation tech-
nique can provide an order of magnitude loss reduction over the existing
parallel power distribution architecture. This work describes the motivation,
background, theory, design, implementation, and evaluation of the world’s
first fully-operational series-connected server cluster.
ii
I would like to playfully dedicate this work to Scary Movie 5 legend Charlie
Sheen, for reminding me that winning isn’t everything – especially if it
requires something like tiger blood or Adonis DNA.
In all seriousness, I would not have made it even this far, if it had not been
for the love and encouragement of my parents Fred and Marty and my
siblings Jedidiah, Moriah, and Micah. To them, I dedicate everything good
that comes out of my time in Illinois.
iii
Acknowledgments
This material is based upon work supported by the National Science Foun-
dation Graduate Research Fellowship Program under Grant Number DGE-
1144245, Texas Instruments, and Google. I would like to thank my advisor
for his invaluable assistance and support, my colleagues within the Pilawa
Group for their continued personal friendship and professional inspiration.
iv
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Energy Cost of the Internet . . . . . . . . . . . . . . . . . 1
Chapter 2 Series-Connected Power Delivery . . . . . . . . . . . . . . 6
2.1 Motivating the Paradigm Shift . . . . . . . . . . . . . . . . . . 6
2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Differential Power Processing . . . . . . . . . . . . . . . . . . 8
2.4 DPP Topology Selection . . . . . . . . . . . . . . . . . . . . . 10
Chapter 3 Background and Previous Work . . . . . . . . . . . . . . . 13
3.1 Early History . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Switch-mode DPP . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 DPP for Digital Loads . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Load Model Assumptions . . . . . . . . . . . . . . . . . . . . . 17
4.2 General Applicability of Series-Stacked Architecture . . . . . . 20
4.3 Applicability of Element-to-Element DPP Topology . . . . . . 24
Chapter 5 Prototype Converter . . . . . . . . . . . . . . . . . . . . . 36
5.1 Key Hardware Features . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Control and Regulation . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Series-Stacked Operation . . . . . . . . . . . . . . . . . . . . . 48
6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Appendix A Prototype Converter Module Design Files . . . . . . . . 60
v
Appendix B Related Software Projects . . . . . . . . . . . . . . . . . 73
B.1 Web Traffic Generator . . . . . . . . . . . . . . . . . . . . . . 73
B.2 Queuing Behavior Simulator . . . . . . . . . . . . . . . . . . . 79
B.3 Distributed Load Balancing Emulator . . . . . . . . . . . . . . 90
Appendix C Setting Up a Cluster . . . . . . . . . . . . . . . . . . . . 96
C.1 Installing Ubuntu Hardy . . . . . . . . . . . . . . . . . . . . . 96
C.2 Setting up NFS share . . . . . . . . . . . . . . . . . . . . . . . 96
C.3 Setting up the TFTP server and PXE bootloader . . . . . . . 97
C.4 Setting up DHCP server and Compiling Kerrighed . . . . . . . 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
vi
List of Tables
2.1 Efficiency improvement for simulated 10 kW server rack
(95% confidence). . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Simulation parameters for general DPP applicability simulation. 24
4.2 Simulation parameters for element-to-element DPP appli-
cability simulation. . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Simulation parameters for pulsed load simulation. . . . . . . . 34
5.1 Per-converter design parameters. . . . . . . . . . . . . . . . . 37
5.2 Selected components. . . . . . . . . . . . . . . . . . . . . . . . 37
6.1 Bus voltage sequence for startup and shutdown. . . . . . . . . 52
6.2 Comparison of architecture-related losses. . . . . . . . . . . . . 58
B.1 Model parameters. . . . . . . . . . . . . . . . . . . . . . . . . 82
B.2 Varying model parameters of this study. . . . . . . . . . . . . 86
B.3 Constant model parameters of this study. . . . . . . . . . . . . 87
B.4 Metrics of interest. . . . . . . . . . . . . . . . . . . . . . . . . 87
B.5 Metrics of interest (95% confidence). . . . . . . . . . . . . . . 88
B.6 Sign table for sensitivity analysis (95% confidence). . . . . . . 89
B.7 Effect estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . 90
vii
List of Figures
1.1 Example energy allocation in typical data center. . . . . . . . 2
1.2 Conventional power distribution architecture. . . . . . . . . . 2
2.1 Series-connected architecture. . . . . . . . . . . . . . . . . . . 6
2.2 Parallel architecture. . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Differential power processing concept. . . . . . . . . . . . . . . 9
2.4 Water fountain analogy of differential power processing. . . . . 9
2.5 Series-connected power delivery topology chosen for this work. 10
2.6 Simulation illustrating DPP conversion loss reduction. . . . . . 11
3.1 Selected DPP technology influences. . . . . . . . . . . . . . . . 14
4.1 Constant power model. . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Constant resistance model. . . . . . . . . . . . . . . . . . . . . 18
4.3 Constant current model. . . . . . . . . . . . . . . . . . . . . . 19
4.4 Measured CPU power consumption CDF compared to Gaus-
sian CDF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Simulation verification of general DPP applicability formula. . 23
4.6 Reference for element-to-element circuit analysis . . . . . . . . 25
4.7 Comparing conversion loss calculation for high-efficiency
converters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 Comparing conversion loss calculation for mid-efficiency
converters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.9 Comparing conversion loss calculation for low efficiency
converters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.10 Expected loss reduction. . . . . . . . . . . . . . . . . . . . . . 31
4.11 Detail of current profile of server under web traffic load. . . . . 33
4.12 Percentage processed out of total power delivered. . . . . . . . 34
4.13 Example load profile (100 pulse/s). . . . . . . . . . . . . . . . 35
5.1 Schematic drawing of experimental implementation of series-
stacked server system. . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Typical voltage regulation. . . . . . . . . . . . . . . . . . . . . 38
5.3 Converter efficiency and phase shedding. . . . . . . . . . . . . 39
5.4 Reference schematic for control discussion. . . . . . . . . . . . 40
viii
5.5 Binary integral control. . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Light load control reference for buck mode. . . . . . . . . . . . 42
5.7 Light load control reference for boost mode. . . . . . . . . . . 43
5.8 PFM with pulse width 2 and incorrect pulse direction. . . . . 44
5.9 PFM with pulse width 2, correct pulse direction, and ex-
cessive pulse frequency. . . . . . . . . . . . . . . . . . . . . . . 44
5.10 PFM with pulse width 2, correct pulse direction. . . . . . . . . 45
5.11 Light load current direction detect circuit. . . . . . . . . . . . 45
5.12 Operation of light load current direction detect. . . . . . . . . 46
5.13 Efficiency improvement of light load optimization. . . . . . . . 47
6.1 “Soft start” circuit in general-purpose PSU. . . . . . . . . . . 48
6.2 Bus initialization sequence. . . . . . . . . . . . . . . . . . . . . 51
6.3 Single system image cluster environment. . . . . . . . . . . . . 53
6.4 Cluster configuration for computational load experiment. . . . 54
6.5 Parallel architecture. . . . . . . . . . . . . . . . . . . . . . . . 54
6.6 Series architecture. . . . . . . . . . . . . . . . . . . . . . . . . 54
6.7 National Instruments data acquisition unit. . . . . . . . . . . . 55
6.8 Optiplex servers. . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.9 Representative time domain power loss comparison. . . . . . . 57
A.1 Top level schematic. . . . . . . . . . . . . . . . . . . . . . . . 61
A.2 High side current sense module. . . . . . . . . . . . . . . . . . 62
A.3 UART isolator module. . . . . . . . . . . . . . . . . . . . . . . 63
A.4 Server switch module. . . . . . . . . . . . . . . . . . . . . . . 64
A.5 Microcontroller module. . . . . . . . . . . . . . . . . . . . . . 65
A.6 Four-phase power stage module. . . . . . . . . . . . . . . . . . 66
A.7 Single-phase submodule of power stage. . . . . . . . . . . . . . 67
A.8 Four-phase inductor block module. . . . . . . . . . . . . . . . 68
A.9 PCB mask for top layer. . . . . . . . . . . . . . . . . . . . . . 69
A.10 PCB mask for inner ground layer. . . . . . . . . . . . . . . . . 70
A.11 PCB mask for inner route layer. . . . . . . . . . . . . . . . . . 71
A.12 PCB mask for bottom layer. . . . . . . . . . . . . . . . . . . . 72
B.1 Constant-variance binning. . . . . . . . . . . . . . . . . . . . . 74
B.2 Google search trends for Chicago datacenter. . . . . . . . . . . 74
B.3 Threadsafe producer/consumer queue structure. . . . . . . . . 75
B.4 Measured traffic rate versus generated traffic rate for one
traffic generator computer. . . . . . . . . . . . . . . . . . . . . 76
B.5 Measured traffic rate versus generated traffic rate for clus-
ter of four traffic generator computers. . . . . . . . . . . . . . 77
B.6 Measured traffic rate versus generated traffic rate for clus-
ter of four traffic generator computers. . . . . . . . . . . . . . 78
B.7 Proposed rate-to-power mapping. . . . . . . . . . . . . . . . . 78
ix
B.8 Beaglebone Black with nginx server mean power consump-
tion (single-threaded configuration). . . . . . . . . . . . . . . . 79
B.9 Beaglebone Black with nginx server mean power consump-
tion (multi-threaded configuration). . . . . . . . . . . . . . . . 80
B.10 Beaglebone measurement setup. . . . . . . . . . . . . . . . . . 80
B.11 Dell Optiplex GX280 with Apache server mean power con-
sumption (multi-threaded configuration). . . . . . . . . . . . . 81
B.12 Power-aware routing policy. . . . . . . . . . . . . . . . . . . . 81
B.13 FIFO routing policy. . . . . . . . . . . . . . . . . . . . . . . . 82
B.14 Memory saving strategy for similar objects. . . . . . . . . . . . 84
B.15 Mean energy under different simulation runs. . . . . . . . . . . 88
B.16 Mean latency under different simulation runs. . . . . . . . . . 89
B.17 Existing system architecture . . . . . . . . . . . . . . . . . . . 91
B.18 Improved system architecture . . . . . . . . . . . . . . . . . . 91
x
List of Abbreviations
DPP Differential power processing, or differential power processor
(depending on the context)
KCL Kirchhoff’s current law
KVL Kirchhoff’s voltage law
In Current draw of server n
il,n Low-side current injection of DPP n
ih,n High-side current rejection of DPP n
Vn Differential voltage across server n
Vbus DC bus voltage for cluster
Is Current delivered to cluster from DC bus
xi
Chapter 1
Introduction
1.1 The Energy Cost of the Internet
In many ways, it could be said that the Internet, more than any other tech-
nology, has enabled the increasingly global and information-based society in
which we live. More and more, data center users and operators are becoming
aware of the environmental and commercial impact of the energy consumed
by the Internet technology infrastructure [1]. In 2013, a survey of 46 North
American data centers found the average data center power utilization to be
over 1 MW [2], more than the average energy required to power 800 U.S.
homes [3]. In addition to the fundamental environmental concerns with the
large energy expenditure of the Internet, data center leaders are becoming
more and more conscious of energy costs – especially as the number of data
centers, and the power density per data center, continue to grow [4].
1.1.1 Three Main Datacenter Energy Allocations
A typical energy allocation of a modern data center is shown in Fig. 1.1.
The PSU and on-board losses are typical of low-cost servers which are not
optimized for saving energy. All other data was taken from [5]. On average,
the largest portion of the power incoming to a data center is spent on the
computational work. Because servers and other network equipment do not
typically process or exchange information in a thermodynamically reversible
manner, essentially all of the energy needed to complete the computational
effort is dissipated as heat, rather than being eventually returned to the
distribution bus. While there have been some efforts to recover some of
this dissipated thermal energy to serve a useful purpose [6], this is not the
common practice. In most data centers, the computationally-generated heat
1
Chiller 23%
Humidifier 3%
CRAC/CRAH 15%
IT Equipment 47%
PDU 3%UPS 6%
Lighting / aux devices 2%
Switchgear / generator 1%
Electrical Power Input
PSU losses 20%
On-board losses 30%
Computational ICs
Figure 1.1: Example energy allocation in typical data center.
Uninterruptible Power Supply
From Power Grid
Power Distribution Units
AC
AC
or
DC
+
-
DC
AC
or
DC
AC
or
DC
AC
Backup Generator
...
HV Distribution Bus
Server Rack Enclosures
... ...
DC
AC
or
DC
...
Compute
Hardware
Power Supply
Server Blades
Figure 1.2: Conventional power distribution architecture.
of data centers is the direct cause for the second largest portion of incoming
data center energy. Due to the increasingly small die size, heat extraction
is a significant engineering challenge at all size domains, requiring at least
forced-air cooling within server enclosures, and in almost all cases requiring
active building-wide cooling [7].
Aside from computational heat and the energy used to extract that heat,
the next largest energy consumer within data centers is the power conver-
sion and distribution infrastructure. As is shown in Fig. 1.2, a minimal data
center power distribution architecture places an uninterruptable power sup-
ply, a high voltage in-building power distribution bus, and a power converter
between each server’s compute hardware and the power grid distribution
network. While the percentage efficiencies of each power stage can be quite
high, the fact that the total energy required by the servers passes through
2
multiple conversion stages means that the power conversion infrastructure
can use large quantities of power.
1.1.2 Ways to Reduce Datacenter Energy
While there are other energy costs associated with maintaining an adequate
workplace environment for the humans that operate the data center – in-
cluding things like lighting, office air conditioning/heating, administrative
IT equipment, etc., the three discussed here (computational load, cooling,
and electrical power conversion) represent those aspects of data center en-
ergy allocation which have a first-principles link with the processing power
or speed of the data center.
Rather than pursuing traditional techniques to reduce the losses in the
previously-mentioned areas, this work investigates a paradigm shift in power
delivery designed to reduce the energy losses associated with electric power
conversion. This approach to reducing data center energy cost is taken for
a variety of reasons, foremost of which is that while standard engineering
solutions to mitigate the demands of all three of the fundamental energy
consumers discussed in the previous section have been available for decades,
all of these techniques have begun reaching the realm of diminishing returns
in recent years [4].
Advances in fields like adiabatic computing (also known as “charge re-
cycling”) are theoretically capable of producing processors with computa-
tional power comparable to that found in modern servers at significantly
reduced power consumption by using thermodynamically reversible compu-
tation techniques [8]. Unfortunately, these techniques are available only at
lower switching frequencies and require more board area per gate, result-
ing in larger, and significantly more complex (and therefore more expensive)
chips. Because high processor speed and low processor cost remain higher
priorities than energy efficiency in today’s network climate, there has been
limited commercial adoption of fundamentally low-power servers within mod-
ern data centers. Similarly, the area of “power aware computing” has proven
that certain software scheduling techniques can be used to limit the power
consumption of a server without significantly reducing its performance [9].
While these techniques are reaching an advanced stage of theoretical develop-
3
ment, at the time this writing, they have yet to garner substantial commercial
deployment, due to the cost and effort associated with re-designing existing
software workflows or network routing behavior. Thus, barring a global re-
duction in Internet consumption, or an industry-wide decision to artificially
throttle network speeds, the present plateau [10] in computations-per-joule
is likely to result in an ever-increasing fundamental energy demand from the
servers themselves.
Along with this demand in server power will necessarily come increased
cooling requirements. Much work has already been done on improved air-
flow designs within server chassis [11] and datacenters [12]. While further
reduction in the power losses associated with cooling may lie in optimized
utilization of low-temperature reservoirs – such as “passive geothermal” cool-
ing [13], this is a costly solution, and its availability is climate-dependent.
Likewise, energy can be saved through “free” cooling by choosing to build
new data centers in cooler climates and directly making use of outside air [14].
However, despite the increasing number of data centers being built in cold
regions [15], the fact remains that new data centers continue to be built in
warm and tropical climates that do not offer the advantage of outside-air
cooling throughout the whole year.
Because the electrical conversion losses in modern data centers are cur-
rently tied directly to the total power consumption, conversion losses may
also be expected to increase in years to come. The two standard methods of
combating this trend have been to reduce the number of conversion stages,
and to improve efficiency of each power converter. While there has been some
discussion [16] regarding the relative merits of using alternating or direct cur-
rent for the distribution bus shown in Fig. 1.2, the majority consensus [17]
seems to be that a high voltage DC distribution bus allows for fewer con-
version stages and therefore represents a fundamentally more efficient way
to deliver power to servers. Thus, several large industry groups have be-
gun switching to DC power distribution. The demands imposed on server
power supplies are a challenging five-way tradeoff between cost, reliability,
size, peak power rating, average power rating, and efficiency. Even under
these stringent requirements, industry power supplies are reaching efficien-
cies in the upper 90% range [12]. Once it becomes technically infeasible or
cost-ineffective to further increase power supply efficiency, conversion losses
will continue to rise in lock-step with increased power requirements, and the
4
increased cooling losses previously discussed. It should be noted that the
series-connected architecture proposed in this work makes use of a highly-
efficient DC distribution bus, and draws on some of the advances made for
high efficiency power supplies. Thus, it is not a radical departure from the
current state of the art, but instead leverages existing techniques in a unique
way to take an order-of-magnitude leap forward in data center efficiency.
1.1.3 Thesis Outline
The work of this thesis distinguishes itself from the traditional methods of
reducing data center energy consumption by decoupling power delivery from
power conversion through a change in power delivery architecture. By sig-
nificantly reducing the amount of power converted, the conversion losses can
be reduced substantially without the need for more efficient power supplies.
Importantly, this thesis presents results demonstrating that the conversion
losses can be made truly independent of the average power delivered to the
servers. Thus, the techniques presented here will offer an ever-increasing ben-
efit over the traditional power distribution architecture, as the total power
delivery requirements of today’s data centers continue to increase.
Chapter 2 introduces the series-connected power distribution architecture,
and the related concept of differential power processing. Chapter 3 provides
an abbreviated technological genealogy of the previous research leading up
to the work presented in this thesis, emphasizing its interdisciplinary nature
and broad scope. Chapter 4 discusses a number of results that provide the
theoretical framework on which this work is based and give fundamental in-
sights into both the range and the limitations of its applicability. Chapter 5
explains design details of how the prototype DPP converter modules were
implemented. Chapter 6 describes in detail the experimental implementa-
tion of the series-connected server cluster, and presents the results of a case
study comparison between the series-connected architecture and a best-in-
class parallel distribution system. The appendices contain a variety of useful
information (schematics, software code, tutorials, etc.) designed to assist
readers in replicating the experimental results of this work and in jumpstart-
ing future efforts.
5
Chapter 2
Series-Connected Power Delivery
2.1 Motivating the Paradigm Shift
As was previously discussed, the fundamental challenge of today’s power dis-
tribution architectures is that the high-power, low-voltage loads cannot draw
their power directly from the distribution bus. Instead, all of the delivered
power must first pass through step-down voltage conversion stages, which
fundamentally have limited practical efficiencies. Because of this intrinsic
coupling between conversion losses and total delivered power, losses may
be expected to increase as outdated servers are upgraded to achieve higher
performance or are utilized to a greater capacity in the near future. Even
today, some of the larger individual server racks consume upwards of 20 kW
of average power [2], in which case even highly efficient converters generate
kilowatts of waste heat.
From the previous discussion, is clear that the ideal scenario would be
to decouple the conversion losses within a data center from the total power
delivered to the servers themselves. Shown in Fig. 2.1 is the proposed series-
connected power delivery architecture that achieves this goal. When the
+
-
Series-Connected Architecture
...
Figure 2.1: Series-connected architecture.
6
DC
DC
+
-
DC
DC
...
Parallel Architecture
Figure 2.2: Parallel architecture.
servers are connected electrically in series, the current required by the first
server is allowed to flow directly from a high voltage DC distribution bus.
Instead of returning immediately to ground, this current is then “recycled”
again and again as it passes through each successive server. Thus, compared
to the standard parallel distribution architecture of Fig. 2.2, the series ar-
chitecture delivers the same total amount of power to its loads, but it does
so in a more direct manner. In this way, the series-connected architecture
allows the conversion losses of a cluster to become truly independent of the
total power delivered to its servers.
2.2 Challenges
2.2.1 Communication
Despite the promising advantages of the series-stacked architecture, there
are some fundamental challenges that make it difficult to apply within the
context of data centers. The problem of communication across voltage do-
mains may seem to present a significant design issue – as it does at the
chip level [18]. However, in the specific case of server loads, isolated com-
munication is easily achieved by outfitting servers with standards-compliant
Ethernet interfaces (which guarantee at least 1.5 kV DC isolation [19] [20]),
or with high-speed optical fiber network cards (which represent inherently
isolated communication channels). In addition to reducing the design costs
associated with architecture implementation, the fact that existing commu-
nication hardware can be re-used without modification in a stacked voltage
domain configuration adds substantial value to this architecture in the con-
7
text of data centers.
2.2.2 Reliability and Voltage Regulation
Considering that traditionally-encountered factors such as temperature and
server workload can still present serious difficulties in data center reliabil-
ity [21], it is not surprising that the series-connected power distribution ar-
chitecture brings along with it a set of reliability and maintenance concerns.
Cascaded failure is an important possibility to be addressed in the case of
stacked server clusters, because all servers share the same line current, and
none are isolated from the distribution bus.
The fundamental cause of the reliability challenge is the fact that the
series-connected distribution architecture does not provide inherent voltage
limiting behavior. As is illustrated in Fig. 2.1, if the average server currents
Iload,i differ, Kirchhoff’s current law indicates that the difference in average
current between each successive server must be supplied by the capacitor
banks at each node. Since this scenario results in a nonzero average current
into the capacitor banks, the voltage at each node will drift apart. Thus, if
the computational loads of the servers are not intelligently managed as in [22],
or if the computational load balancing software experiences a fault, the input
voltages may drift outside their design limits and damage the servers.
2.3 Differential Power Processing
While there are a variety of power electronics solutions to address the previously-
discussed voltage regulation challenge, differential power processing (DPP) is
one technique that is well-suited to series-connected power delivery [23, 24].
In general, the DPP concept illustrated in Fig. 2.3 allows the individual ele-
ments to maintain different average currents, while ensuring a desired voltage
at each node by re-routing the difference in average current between succes-
sive nodes, such that the average current into each node’s capacitor bank is
zero. Since the DPP system is processing only the difference between succes-
sive server currents, the technique is able to maintain the original objective
of decoupling conversion losses from the total energy delivered to the servers,
while maintaining server voltage regulation. To the extent that the average
8
+-
Series-Connected Architecture
...
Figure 2.3: Differential power processing concept.
Leaky pump Leaky pump Leaky pump
High pressure water pipe
Low pressure Medium pressure Low pressure
Leaky pump Leaky pump
High pressure water pipe
Low pressure
Medium pressure
Low pressure
Medium pressure Medium pressure
Medium pressure
Differential configuration allows 
lower-volume pumps and less leakage.
Parallel configuration requires high 
volume  pumps, which leak more water.
Figure 2.4: Water fountain analogy of differential power processing.
differences between successive server power consumption can be kept small
compared to their average current, the DPP voltage conversion strategy can
not only reduce system power losses, but it can do so in a way which is
decoupled from the total power delivered.
The basic operating principle of the DPP technique is illustrated by the
fountain-and-water-pump analogy of Fig. 2.4. In this analogy, the water
represents energy and flow rate represents power. In the left diagram of
Fig. 2.4, the full volume of water for each fountain must pass through a pump
which leaks some percentage of the water flowing through it, resulting in a
relatively large total rate of leakage. This scenario is analogous to how the
inherently-limited efficiencies of the parallel-configured DC/DC converters in
Fig. 2.2 result in relatively large total conversion losses, even if the percentage
loss of each individual converter remains small.
In the right diagram of Fig. 2.4, the bulk of the water is delivered directly
9
+-
Element-to-element DPP
Figure 2.5: Series-connected power delivery topology chosen for this work.
to the fountains, with only a small volume of water being pumped to make
up the difference in height for the center fountain. Because very little water
is being pumped, leakage is decoupled from the total water delivered, and
the reduction in leakage is substantial. This is similar to how the bulk of
the energy can flow directly from bus to the loads in Fig. 2.3, with only a
small portion of energy needing to be injected at each node. Since only a
small portion of energy is being processed, the losses remain relatively small,
even if the percentage loss of each individual converter is relatively large. A
detailed analysis of the differential power processing topology utilized in this
work can be found in Chapter 4.
2.4 DPP Topology Selection
The basic design principle of any differential power processing topology is
to regulate the voltages of series-connected elements by injecting or reject-
ing small currents to each node in the stack, while allowing the major-
ity of the string current to flow directly through successive load elements.
There are a wide variety of differential power processing converter topolo-
gies that can perform this task, each with its own benefits and challenges.
The topology implemented in this work is known as the element-to-element
topology [25], and achieves voltage regulation by neighbor-to-neighbor power
transfer among the DPP modules. The circuit implementation this work uses
to achieve this type of power transfer is shown in Fig. 2.5.
10
050
100
150
200
250
300
350
0 5 10 15 20 25 30 35
P
o
w
e
r 
(W
a
tt
s
)
Server number
Average power processed for 300W servers
Power delivered to server
Power processed
Figure 2.6: Simulation illustrating DPP conversion loss reduction.
The primary downside of the element-to-element topology compared to
other differential power processing topologies is that because power transfer is
only possible between neighboring DPP modules, the system power losses are
sensitive to the location of power mismatch. For example, a power transfer
from the top server to the bottom server in the stack would incur more
losses than a power transfer between two neighboring servers. Therefore,
if server powers are not managed to some degree in software or exhibit a
large variance, the element-to-element can result in higher losses than some
of the other DPP topologies. However, as is illustrated by the 33-server
simulation of Fig. 2.6 and Table 2.1, the total amount of power converter
losses associated with this particular topology compared to the total losses
is quite low, even in the case of server power with relatively high variance.
In the particular application under investigation in this paper (evenly-
loaded servers), the straightforward control and low-voltage design of the
power processing modules associated with neighbor-to-neighbor power trans-
fer far outweigh the small amount of “power routing overhead” previously
described. In addition to allowing for low-voltage switches and non-isolated
converters, the element-to-element strategy illustrated in Fig. 2.5 is attractive
from a control standpoint. If the voltages of each node are to be equalized
(such as is the case in a server cluster context), each differential power supply
module can run either current-mode or voltage-mode control locally, without
11
Table 2.1: Efficiency improvement for simulated 10 kW server rack (95%
confidence).
Conventional Series-Stacked
Server power [W] 300± 30 300± 30
Distribution of server power Gaussian Gaussian
Number of servers in rack 33 33
Total power delivered to servers [W] 9900± 0.2 9900± 0.2
Power converter efficiency 90% 90%
Total power converted [W] 9900± 0.2 1773± 2
Power conversion losses [W] 990± 0.02 177± 0.2
Total system efficiency 90% 98.2%
Factor reduction in losses N/A 8.4 times
the need for any communication between individual DPP modules, thus al-
lowing the system to leverage a fast and straightforward distributed control
algorithm.
12
Chapter 3
Background and Previous Work
The practical and academically interesting research opportunities presented
by the DPP architecture rest on a base of industrial and scientific research
that spans decades. This section is intended to give a flavor of how the
concept of series-stacked loads and regulation through techniques similar
to DPP has been applied in a number of different inter-connecting fields.
Figure 3.1 is derived from a review of current and previous DPP research,
and sketches out a rough influence map of several salient DPP applications.
The elements of the figure are arranged vertically in (relative) chronological
order, and the arrows indicate interdisciplinary influence.
3.1 Early History
Since the invention of the first rechargeable battery in the late 19th century,
scientists and engineers have been concerned with improving the lifetime and
operation of systems of series-connected cells. Very early in the development
of the technology [26], it became evident that unequal charging and discharg-
ing rates of the individual cells had a detrimental effect on the performance of
the battery. For many years, battery makers addressed this problem through
more precise manufacturing techniques and improved electrochemical pro-
cesses. In the 1930s, around the same time that stage lighting saw the first
inklings of switching power converters [27], passive battery cell [28] and light-
fixture equalization circuits began appearing in U.S. patent offices. Since the
passive DPP topologies simply and cheaply achieved the desired effect in
these areas, the field of differential power processing did not see much fur-
ther development until the advent of transistor-based switching converters.
This kind of revolutionary device began appearing in the 1950s [29], and
found its way back into the scientific spotlight, as its small size, weight, and
13
+-
+
-
Industrial 
HV Power 
Converters
Cell Equalization 
Differential Power 
Processing for PV
CMOS Charge 
Recycling
RSCPD for Low-
Power Devices
RSCPD for Server Clusters
Power Aware 
Computing
1930
1950
1960
1970
1980
1990
2000
2010
1890
DC
DC
Power 
Electronics
Photovoltaics
T
im
e 
(y
ea
r)
Figure 3.1: Selected DPP technology influences.
14
high efficiency made switching converters a staple of the electric power sys-
tem for space applications [30]. During the 1960s and 1970s, transistor-based
switching converters came to flourish in the mid-range power market, and the
electric utility industry began to make use of new advances in semiconduc-
tors (in particular, the silicon controlled rectifier, or SCR) in high power
applications. Though these applications typically required many SCRs to be
connected in series to attain a higher voltage rating [31], the reverse avalanche
characteristic of the semiconductor devices provided enough self-regulation to
preclude the need for external balancing of SCR voltage. And so, the notion
of DPP continued to be relegated to the world of passive cell balancing.
3.2 Switch-mode DPP
Finally, in the late 1980s, and early 1990s automatic cell voltage equalizer
circuits were introduced [32,33] that made use switching power converters and
marked the birth of the first non-dissipative differential power processors. Of
interest is the fact that many of high power battery systems that applied these
load balancing techniques were used as storage within a photovoltaic context.
In fact, strings of series-connected photovoltaic cells [34] were among the
first non-battery applications of switch-mode DPP. The compelling efficiency
increases made possible by this technique continue to fuel a growing body of
academic and industrial research related to hardware design and control of
differential power converters within a photovoltaic context.
3.3 DPP for Digital Loads
Meanwhile, in a largely separate technology space, investigations into power
aware computing [35] and CMOS charge recycling [36, 37] began to appear
in computer science and IC design publications. Essentially, the growing
demand for smaller, more powerful mobile devices in the early 1990s kicked
off an energetic search for ways to reduce IC power consumption without
sacrificing performance. The computer scientists approached the problem
through software scheduling techniques which could extract more processing
power per watt than was previously possible, and the IC designers came up
15
with a variety of clever power-saving methods – among them, the technique
of recycling charge between busses.
The idea of charge recycling within the internal IC busses gave rise to
the notion of stacking block-level [38] and chip-level [39] voltage domains
in series to “recycle” leakage current and drastically decrease system power
consumption. Regulation of these voltage domains was initially attained
through the use of linear-mode differential power processors. However, draw-
ing from advances in photovoltaic DPP and power aware computing, it was
later demonstrated that the use of switching converters [40,41] and scheduling
techniques [42] allow for series-connected voltage domains with significantly
improved system efficiency. More recently, a variety of results have been ob-
tained that indicate excellent potential for scaling these gains beyond the chip
level, to apply series-connected voltage domains to high-power digital loads
such as servers [22, 25, 43]. While the architecture has been previously pro-
posed and analyzed to some extent, the results of this thesis represent the first
experimental work demonstrating a fully-operational series-connected server
cluster making use of differential power processing for voltage regulation.
16
Chapter 4
Theory
The intent of this chapter is to provide theoretical motivation for two foun-
dational aspects of the experimental work presented here: 1) The modeling
of servers as current sources, and 2) the use of a server loading profile which
is known to produce low-variance power consumption. A description of the
prototype converter hardware design parameters can be found in Chapter 5
and concepts related to the control of the series-stacked system and modules
can be found in Chapter 6.
4.1 Load Model Assumptions
When designing a combined power conversion and delivery system like the
one proposed in this paper, it is important to have a reasonably accurate
model of the load. This section examines two load assumptions, with dif-
ferent physical motivations, then makes the claim that these can both be
accurately approximated by a simple constant current model because it is
assumed that the voltage regulation hardware maintains server terminal volt-
age close to the nominal voltage. Because of the regulation behavior, a first
order Taylor series expansion of the server voltage-time relationship is an ac-
curate approximation. In this manner, the analysis of the following sections
is applicable to differential power processing of both servers and of CMOS
devices.
4.1.1 Constant Power Model
The constant power model of Fig. 4.1 is inspired by the notion that most
modern servers utilize on-board DC/DC converters to deliver power to low-
voltage devices like the processor. This model assumes that the majority of
17
DC
DC
+
-
PSU
CPU
CV
Figure 4.1: Constant power model.
+
-
Logic
V R C
Figure 4.2: Constant resistance model.
the incoming power to the server is delivered through these high-efficiency
DC/DC converters – causing the input terminals to behave as a constant
power source, with current being inversely proportional to terminal voltage.
Mathematically, the time-dependent behavior of Fig. 4.1 can be described by
Eq. (4.1).
P
V (t)
= C
dV (t)
dt
=⇒
V (t) =
√
2Pt+ CV (0)2√
C
≈ V (0)− Pt
CV (0)
(4.1)
4.1.2 Constant Resistance Model
The constant resistance model of Fig. 4.2 is motivated by the well-known
power/voltage relationship for CMOS logic devices given in Eq. (4.2), where
P is the power consumption of the device, C is the capacitance, V is the
voltage, f is the frequency, and α is the activity factor of the device. If C, f ,
and α are constant, then the device behaves as a resistor with value R = 1
αCf
.
18
+-
PSU+Logic
CV I
Figure 4.3: Constant current model.
The mathematical time-varying behavior of Fig. 4.2 is given in Eq. (4.3).
P = αCV 2f =⇒
=
V 2
R
(4.2)
V (t)
R
= C
dV (t)
dt
=⇒
V (t) = V (0)e−
t
RC
≈ V (0)− V (0)t
RC
(4.3)
4.1.3 Constant Current Model
The constant current model shown in Fig. 4.3 is motivated by comparing
the first-order Taylor series approximations shown in Eqs. (4.1) and (4.3).
Substituting I = P
V (0)
and I = V (0)
R
respectively to each equation results
in Eq. (4.4). Thus, the simple constant-current model of Fig. 4.3 holds re-
gardless of whether the loads being referred to are servers or are logic chips
within a server. Since it was determined that a constant current model is a
reasonably accurate representation of the server behavior when the voltage is
regulated, the “server load” will henceforth be represented as a current sink.
I = C
dV (t)
dt
=⇒
V (t) = V (0)− It
C
(4.4)
19
4.2 General Applicability of Series-Stacked
Architecture
4.2.1 Gaussian Current Model Assumption
Since it is not always possible to estimate or control the current of a server or
a CPU core exactly, a probabilistic analysis is useful for understanding the
computational loading profiles under which the differential power processing
architecture is applicable. It will greatly assist the analysis to make the
assumption that server power is normally distributed. Physically, this is a
reasonable assumption, due to the nature of the manner in which digital
logic consumes power. Referring back to the discussion of Eq. (4.1), and how
it relates to a constant current model, the following discussion provides a
physical justification of modeling the current distribution as being Gaussian.
Assume that voltage and frequency scaling are operated at a slow enough
rate of change that V and f are constant with respect to instantaneous server
power. Now, let Ii(t) be the indicator function for the condition that logical
block i within CPU is active. Assume the scheduler and processor were
designed jointly such that the per-block activity factors E [Ii(t)] have roughly
the same distribution. If the number of logical blocks is large, as in modern
architectures, the total activity factor
∑
E [Ii(t)] ∼ X, where X is normally
distributed. Since total activity factor is proportional to power consumption,
the power of a CPU under constant load is normally distributed. The CPU
dominates the power consumption of the whole motherboard, so its power
can also be approximated as being normally distributed.
This analysis is corroborated by the experimental results of Fig. 4.4, which
compares the measured CDF of a low-end desktop CPU with the CDF gener-
ated from a normal distribution with the same mean and standard deviation.
4.2.2 Circuit Analysis
Referring back to Fig. 2.3, it is assumed that each node is regulated to some
fixed voltage Vload,i such that Eq. (4.5) holds according to Kirchhoff’s voltage
20
00.2
0.4
0.6
0.8
1
11 11.2 11.4 11.6 11.8 12 12.2 12.4 12.6 12.8
Measured CDF
Gaussian CDF
Figure 4.4: Measured CPU power consumption CDF compared to Gaussian
CDF.
law.
N−1∑
i=0
Vload,i = Vbus (4.5)
In this case, the total power processed and the losses experienced by dif-
ferential converters with an efficiency of η are given by Eq. (4.6).
Ptotal,DPP =
N−2∑
i=0
Vload,i+1Iinject,i =⇒
Ploss,DPP = η
N−2∑
i=0
Vload,i+1Iinject,i
(4.6)
If it is further assumed that Vload,i is regulated to
Vbus
N
for each node, then
the differential power processing losses and the parallel case losses can be
compared by comparing only the sum of injected currents with the sum of
21
server currents, according to Eq. (4.7)
Ploss,DPP T Ploss,parallel ⇐⇒
η
N−2∑
i=0
|Vload,i+1Iinject,i| T η
N−1∑
i=0
|Vload,iIload,i| ⇐⇒
N−2∑
i=0
Vbus
N
|Iinject,i| T
N−1∑
i=0
Vbus
N
|Iload,i| ⇐⇒
N−2∑
i=0
|Iinject,i| T
N−1∑
i=0
|Iload,i|
(4.7)
4.2.3 Probabilistic Analysis
As was hinted at during the discussion of fundamental DPP operation, the
applicability of DPP within the context of data centers is confined to sys-
tems of digital loads which can be managed in such a way that the average
instantaneous current difference between one series-connected load and the
next is small compared to the average instantaneous string current. The
simple reason for this is that the primary benefit of connecting digital loads
in series is to reduce the amount of energy lost in conversion. This goal is
not achieved if the differential power processors within a particular series-
connected configuration must convert an equal or greater amount of power
than would be converted in the standard parallel configuration.
Mathematically speaking, if the load voltages are equal, and the load cur-
rent is a random variable Ii, then DPP offers better performance than the
standard parallel configuration where the mean of the sum of absolute dif-
ferences between adjacent loads currents is less than the mean of the sum of
all currents.
According to Kirchhoff’s current law, the injected currents Iinject,i are given
by Eq. (4.8).
Iinject,i = Ii − Ii+1 (4.8)
In this analysis, it is assumed that there are n servers in the stack, and
that the server currents Ii are independent and identically distributed nor-
mal random variables with mean µ and standard deviation σ. Invoking the
result of [44] shown in Eq. (4.9), then the applicability of DPP to a given
22
01000
2000
3000
4000
5000
6000
0 1 2 3 4 5
P
o
w
e
r 
lo
s
s
e
s
 (
W
)
Coefficient of variation
Comparison between DPP and parallel distribution for 33-server 10kW rack
Parallel system
General DPP
Figure 4.5: Simulation verification of general DPP applicability formula.
type of digital load can be related to the coefficient of variation of the cur-
rent. Specifically, the results of Eq. (4.10) specify that the best-case series-
connected power delivery system with voltages regulated using the differential
power processing technique will suffer less loss than the best-case parallel-
connected power delivery system as long as the coefficient of variation of the
server currents remains less than about
√
pi
2
.
n−1∑
i=1
E[|Ii − Ii+1|] = (n− 1) 2σ√
pi
(4.9)
E
[
n−1∑
i=1
|Ii − Ii+1|
]
< E
[
n∑
i=1
Ii
]
=⇒
n−1∑
i=1
E[|Ii − Ii+1|] <
n∑
i=1
E [Ii] =⇒
(n− 1) 2σ√
pi
< nµ =⇒
σ
µ
<
n
√
pi
2(n− 1)
(4.10)
23
Table 4.1: Simulation parameters for general DPP applicability simulation.
Simulation Parameter Value
Mean server power [W] 303.03
Distribution of server power Gaussian
Number of servers in rack 33
Power converter efficiency 90%
Number of experiments 19
Coefficients of variation 0.1828 - 4.567
Number of samples per experiment 100 - 100000
4.2.4 Simulation Results
The simulation results of Fig. 4.5 from the experiment of Table 4.1 verify the
formula of Eq. (4.10) using Monte-Carlo methods. As is shown in the figure,
the general-case DPP losses start to exceed the parallel-distribution losses at
a coefficient of variation of 0.9139, which corresponds to the value obtained
from Eq. (4.10).
4.3 Applicability of Element-to-Element DPP
Topology
4.3.1 Circuit Analysis
General-Case Converters
Referring to Fig. 4.6 and applying KCL gives the results of Eq. (4.11).
I0 + il0 = I1 + ih1
In + iln = In+1 + ih(n+1) + il(n−1) − ih(n−1), for 0 < n < N
IN−1 + il(N−1) = IN + il(N−2) − ih(N−2)
I0 + ih0 = IN + ih(N−1) − il(N−1)
(4.11)
Arranging these equations in the manner of Eq. (4.12) form assists in
24
GND
Figure 4.6: Reference for element-to-element circuit analysis
25
creating a matrix form
ih1 − il0 = I0 − I1
−ih(n−1) + ih(n+1) + il(n−1) − iln = In − In+1, for 0 < n < N
−ih(N−2) + il(N−2) − il(N−1) = IN−1 − IN
ih(N−1) − ih0 − il(N−1) = I0 − IN
(4.12)
Eq. (4.12) is equivalent to matrix equation Eq. (4.13), with I =
[
I0 I1 . . . IN
]T
.

0 1 0 . . . . . .
−1 0 1 0 . . .
0
. . .
. . .
. . . 0
. . . . . . 0 −1 0
 ih+

−1 0 . . . . . . . . .
1 −1 0 . . . . . .
0
. . .
. . .
. . . 0
. . . . . . 0 1 −1
 il =

1−1 0 . . . . . .
0 1 −1 0 . . .
0
. . .
. . .
. . . 0
. . . . 0 1 −1
 I (4.13)
Now, from KVL, we have
∑
Vn = Vbus. Again assuming ideal converters,
the results of Eq. (4.14) hold.
Vn = Dn−1 (Vn−1 + Vn) for 0 < n ≤ N =⇒
−Dn−1Vn−1 + (1−Dn−1)Vn = 0
(4.14)
Applying KVL,
∑
Vn = Vbus. If converters programmed to locally achieve
Rn(Vn + Vn+1) = Vn, then the results of Eq. (4.15) hold.
Vn = Rn−1 (Vn−1 + Vn) for 0 < n ≤ N =⇒
−Rn−1Vn−1 + (1−Rn−1)Vn = 0
(4.15)
In matrix form, Eq. (4.15) becomes Eq. (4.16), with V =
[
V0 V1 . . . VN
]T
.

1 1 . . . 1
−R0 1−R0 0 . . .
0
. . .
. . . 0
. . . 0 −RN−1 1−RN−1
V =

1
0
...
0
Vbus (4.16)
Setting Dn = 0.5 for all n, then Vn =
Vbus
N
for all n, in the steady state.
26
High-Efficiency Converters
Given ideal converters running at duty ratios given by D, ih = diag(D)il,
indicating the results of Eq. (4.17), which is the relationship used for calcu-
lating power loss in [25] and all other theoretical work known to the author
that relates to the element-to-element configuration.
(A diag(D) +B) il = CI =⇒
il = (A diag(D) +B)
−1CI
(4.17)
However, it should be noted that this is an approximation that explicitly
relies on ideal converters, and therefore always under-reports the true power
losses. Using Eq. (4.13) directly results in a more accurate representation
of the true losses, because it takes into account the fact that the individual
module efficiencies have a compounded effect on the total system losses.
The following simulation results were calculated iteratively using Eq. (4.13)
and the relation specified by Eq. (4.18) over 1000 33-server samples (current
distributed normally, with the same coefficient of variation).
ih =
ηil if il > 01
η
il else
(4.18)
Figs. 4.7 to 4.9 show the discrepancies in average conversion loss calculation
between the idealized model of Eq. (4.17) and the more detailed iterative
calculation previously described. In the 99% efficient converter case, the
results differ by over 50%, while in the 90% and 80% efficiency cases, the
calculations differ by 32% and 12% respectively.
Therefore, it is important to note that the standard calculation of element-
to-element power losses, as reported in previous work, is to be used as a
lower-bound estimate, since the true amount of power processed will always
be greater, due to the compounding effect of converter inefficiencies.
4.3.2 Probabilistic Analysis
Referring to Fig. 4.6 and the preceding discussion, it can be shown that the
output il of each differential converter for a string of servers specified by
27
0 200 400 600 800 1000
30
40
50
60
70
80
90
 
 
X: 1000
Y: 35.13
33 x 300 watt string with 99.0% efficiency
Av
er
ag
e 
to
ta
l p
ow
er
 lo
ss
 (W
)
Sample number
X: 1000
Y: 52.85
Simple calculation
Detailed calculation
Figure 4.7: Comparing conversion loss calculation for high-efficiency
converters.
0 200 400 600 800 1000
250
300
350
400
450
500
550
 
 
X: 1000
Y: 463.4
33 x 300 watt string with 90.0% efficiency
Av
er
ag
e 
to
ta
l p
ow
er
 lo
ss
 (W
)
Sample number
X: 1000
Y: 352.1
Simple calculation
Detailed calculation
Figure 4.8: Comparing conversion loss calculation for mid-efficiency
converters.
28
0 200 400 600 800 1000
450
500
550
600
650
700
750
800
850
900
 
 
X: 1000
Y: 694.9
33 x 300 watt string with 80.0% efficiency
Av
er
ag
e 
to
ta
l p
ow
er
 lo
ss
 (W
)
Sample number
X: 1000
Y: 781.7
Simple calculation
Detailed calculation
Figure 4.9: Comparing conversion loss calculation for low efficiency
converters.
Is = {i1, i2, . . . , iN} is given by Eq. (4.19).
il = XIs (4.19)
If there are N items in the string, X is given by Eq. (4.20)
X =
2
N

−(N − 1) 1 . . . . . . 1
−(N − 2) −(N − 2) 2 . . . 2
...
...
−1 . . . . . . . . . . . . −1 (N − 1)
 (4.20)
If we take server currents ii = Xi to be random variables, and if Xi are
independent identically distributed normal random variables with mean µ
and variance σ2, Eq. (4.19) can be written as Eq. (4.21), where Yl,k is a
random variable describing the behavior of DPP current k, il,k, and ZN (µ,σ2)
is an independent normally-distributed random variable with mean µ and
29
variance σ2.
Yl,k =
2
N
(
−(N − k)
k∑
j=1
Xj + k
N−k∑
j=1
Xj
)
=
2
N
(−(N − k)ZN (kµ,kσ2) + kZN ((N−k)µ,(N−k)σ2))
=
2
N
(−ZN (k(N−k)µ,kσ2) + ZN (k(N−k)µ,(N−k)σ2))
=
2
N
ZN (0,2k(N−k)σ2)
= ZN (0,4k(1− kN )σ2)
(4.21)
From Eq. (4.21), it is clear that the expected value of the power processed
by DPP k is given by Eq. (4.22), where V = Vbus
N
is the voltage across one
server. It should be noted that |Yl,k| is a half-normal distribution.
E(PDPPk) =
Vbus
N
E (|Yl,k|)
=
Vbus
N
σYl,k
√
2√
pi
=
Vbus
N
√
4k
(
1− k
N
)
σ2
√
2
√
pi
= 2σ
Vbus
N
√
pi
√
2k
(
1− k
N
)
(4.22)
Thus, the expected losses of the whole DPP string are given by Eq. (4.23),
where Vbus is the bus voltage, N is the number of servers in the string, and
η is the efficiency of each converter.
E(Ploss,string) = (1− η)
N−1∑
k=1
E (|Yl,k|)
= (1− η)2σVbus
√
2
N
√
pi
N−1∑
k=1
√
k
(
1− k
N
) (4.23)
Eq. (4.23) shows that the expected losses of the string will be reduced in di-
rect proportion to the standard deviation of the server currents, independent
of the mean server current. Eq. (4.24) indicates the maximum coefficient of
variation needed to ensure that the best-case element-to-element topology
30
01000
2000
3000
4000
5000
6000
0 0.2 0.4 0.6 0.8 1 1.2 1.4
P
o
w
e
r 
lo
s
s
e
s
 (
W
)
Coefficient of variation
Comparison between element-to-element DPP and parallel distribution for 33-server 10kW rack
Parallel system
element-to-element DPP
Figure 4.10: Expected loss reduction.
outperforms the best-case parallel distribution architecture.
E(Ploss,string) < E(Ploss,parallel) ⇐⇒
(1− η)2σVbus
√
2
N
√
pi
N−1∑
k=1
√
k
(
1− k
N
)
< (1− η)N
(
Vbus
N
µ
)
⇐⇒
2σ
√
2
N
√
pi
N−1∑
k=1
√
k
(
1− k
N
)
< µ ⇐⇒
σ
µ
<
N
√
pi
2
√
2
∑N−1
k=1
√
k
(
1− k
N
)
(4.24)
4.3.3 Simulation Results
A plot of the simulation described by Table 4.2 is shown in Fig. 4.10, indi-
cating that the results of Eq. (4.24) hold.
4.3.4 Example of a Non-Applicable Load
From the previous discussion, it should be clear that not all server current
profiles lend themselves to differential power processing. Moreover, it should
be noted that the element-to-element differential power processing topology
31
Table 4.2: Simulation parameters for element-to-element DPP applicability
simulation.
Simulation Parameter Value
Mean server power [W] 303.03
Distribution of server power Gaussian
Number of servers in rack 33
Power converter efficiency 90%
Number of experiments 19
Coefficients of variation 0.05587 - 1.397
Number of samples per experiment 100 - 100000
is more sensitive to load variance than the general-case DPP, due to the
fact that some power differences have to be processed by multiple DPPs in
the stack, depending on the location of the power differences. The following
discussion provides an example of a load which exhibits particularly poor
performance under the element-to-element configuration.
Web traffic is one case where the current seems to be fundamentally
“spikey” in nature, as is shown by the detail plot of Fig. 4.11, which was
measured on a Dell Optiplex SX280 server under a load of 600 requests per
second, and with 13 mF of external capacitance added to the server input
terminals. It should be noted that due to the fundamental limitations of
Eq. (4.25), these current spikes could be reduced substantially without adding
hundreds of additional millifarads of external capacitance to the server input
terminals.
Cneeded =
∆Ispike∆tspike
∆Vallowable
(4.25)
This simulation results shown in Fig. 4.12, were obtained by running the
experiment described in Table 4.3. The number of pulses per sample and
the average pulse rate determined the time over which each sample was run.
A sample server current profile is shown in Fig. 4.13. The simulation shows
that the pulsed load profile actually is required to process more power, on
average, than the amount of power which is delivered to the servers. This
intuitively makes sense, because the nature of the high-current pulses sig-
nificantly reduces the probability that the difference between any successive
32
0 20 40 60 80 100 120 140
7
8
9
10
11
12
13
Current spike under heavy web traffic (with 30mF input capacitance)
Se
rv
er
 c
ur
re
nt
 (A
)
Time (ms)
Figure 4.11: Detail of current profile of server under web traffic load.
server current remains small throughout the duration of the test. It should be
noted that the differential power requirement calculation used the results of
Eq. (4.17), so the actual losses would be even greater than what is indicated
in the figures.
33
80
100
120
140
160
180
0 20 40 60 80 100
D
if
fe
re
n
ti
a
l 
p
o
w
e
r 
p
ro
c
e
s
s
e
d
 (
p
e
rc
e
n
t)
Average pulse rate (pulses per second)
Percentage power processed out of 10 kilowatts for 25.3-A, 10ms pulse load
Figure 4.12: Percentage processed out of total power delivered.
Table 4.3: Simulation parameters for pulsed load simulation.
Simulation Parameter Value
Mean server power [W] 303.03
Distribution of server power Pulsed
Distribution of inter-pulse times Exponential
Pulse widths 10 millisecond
Number of servers in rack 33
Number of experiments 7
Average pulse rate (pulses per second) 8.3 - 100
Number of samples per experiment 50
Number of pulses per sample 1000
34
100
150
200
250
300
350
400
450
0 0.2 0.4 0.6 0.8 1
In
s
ta
n
ta
n
e
o
u
s
 p
o
w
e
r 
(W
)
Time (s)
Typical power profile of a server under 25.3-A, 10ms pulse load
Figure 4.13: Example load profile (100 pulse/s).
35
Chapter 5
Prototype Converter
5.1 Key Hardware Features
5.1.1 Component Selection and Parameters
The bidirectional buck-boost DPP modules implementing the previously-
discussed element-to-element DPP architecture were designed using the pa-
rameters shown in Table 5.1. The specified control mode is explained in
Section 5.2. A detailed schematic diagram of one converter module is shown
in Fig. 5.1, providing a reference for the component values specified in Ta-
ble 5.2. Fig. 5.1 also provides a photograph of the converter modules and
context regarding how the modules are interconnected within in the series-
connected server cluster.
5.1.2 Voltage Regulation
Fig. 5.2 is a plot of the voltage regulation under typical operating conditions
(full server loading). As is shown in the figure, the typical per-node voltage
C1
Q1
C2
Q2
L1
(to PC)
MCU
COM n+1
Server 3
Server 2
Server 1
Differential 
Power 
Converter 0
Differential 
Power 
Converter 2
Differential 
Power 
Converter 1
MCU
COM
Converter Phase
1cm
Differential Power Converter
Gate 
Driver
Four converter 
phases
Server 0 (to PC)
COM
MCU COM 0
1k
Figure 5.1: Schematic drawing of experimental implementation of
series-stacked server system.
36
Table 5.1: Per-converter design parameters.
Parameter Value
Switching frequency 250 kHz
Number of interleaving phases 4
Phase shift between PWM pairs 90◦
Dead time 31.25ns, fixed
Rated power 400 W
Peak efficiency ˜96%
Operating mode Forced CCM
Control mode Binary Integral
Table 5.2: Selected components.
Component Symbol Value
Per-phase high side capacitance C1 3× 47µF
Per-phase low side capacitance C2 3× 47µF
High-side switch Q1 TI CSD18504Q5A, 40V, 15A
Low-side switch Q2 TI CSD18504Q5A, 40V, 15A
Per-phase inductor L1 Coilcraft XAL1010-103, 10µH
Combined high/low gate driver Gate Driver TI LM5101CMYE/NOPB, 1A
Microcontroller MCU ATXMega128A4U, 128 MHz
Isolated communication module COM ADUM1250ARZ, UART
37
0 0.2 0.4 0.6 0.8 1
11.9
11.95
12
12.05
Voltage regulation (1ms average)
Vo
lta
ge
 (V
)
Time (s)
 
 
Node 0
Node 1
Node 2
Node 3
0 0.2 0.4 0.6 0.8 1
12.4
12.6
12.8
13
13.2
13.4
Node currents (1ms average)
Cu
rre
nt
 (A
)
Time (s)
 
 
Node 0
Node 1
Node 2
Node 3
Figure 5.2: Typical voltage regulation.
ripple is quite small – around 15mV. The reason for the 100mV offset on the
voltage of Node 3, and the smaller offsets among on the other nodes is that
local voltage sensing was used on the differential power processors, allowing a
small voltage droop (due to cable and switch resistances) to develop on nodes
which required significant differential current. Future revisions can correct
this minor issue by using remote sensing leads to control the node voltage.
5.1.3 Phase Shedding
In the DPP topology used here, the converter modules must be rated for
a maximum of twice the maximum server power in order to handle the
worst-case mismatch condition. However, because the benefit of the series-
connected architecture is most pronounced when there is low mismatch among
the server powers, it is desired that the DPP modules incur low losses during
the light load condition as well.
In order to simultaneously achieve both of these goals, the phase-shedding
technique illustrated in Fig. 5.3 is applied. An efficiency plot was made for
the converter, with one, two, three, and four phases enabled, and current
thresholds at the various intersections of the plots were used to increase or
decrease the number of enabled phases to achieve the optimized aggregate
efficiency shown on Fig. 5.3. Due to the symmetric nature of the design, the
38
0 50 100 150 200 250 300
90
91
92
93
94
95
96
97
Output power (W)
Ef
fic
ie
nc
y 
(pe
rce
nt)
DPP module efficiency
 
 
Optimized
1 phase
2 phases
3 phases
Figure 5.3: Converter efficiency and phase shedding.
efficiency curve remains essentially the same regardless of whether the DPP
module is functioning in buck mode or in boost mode.
5.2 Control and Regulation
5.2.1 Voltage-Mode Control
The following discussion refers to Fig. 5.4. The hardware implementation
discussed in this work makes use of a local voltage-mode control algorithm to
achieve the desired system-level regulation. This algorithm achieves global
voltage equalization by setting the local control of each DPP module to
equalize the voltage across each pair of adjacent server nodes. Fig. 5.4 and
Eq. (5.1) explain this behavior in more detail. Let ∆Vn = Vn − Vn+1 be the
voltage that appears at each server’s input terminals, and Rn =
1
2
be the
desired ratio between a server’s voltage, and the voltage across itself and the
node above it.
∆Vn+1 = Rn × (∆Vn + ∆Vn+1) (5.1)
Applying Kirchhoff’s voltage law, assuming that Eq. (5.1) has been achieved
locally for each converter, the results of Eq. (4.16) are obtained.
39
C1
Q1
C2
Q2
L1
Server 3
Server 2
Server 1
Differential 
Power 
Converter 0
Differential 
Power 
Converter 2
Differential 
Power 
Converter 1
Gate 
Driver
Server 0
1k
Figure 5.4: Reference schematic for control discussion.
When a control value of Rn =
1
2
is chosen for each n (i.e., telling the
individual converter modules to locally equalize ∆Vn and ∆Vn+1), the unique
solution for the ∆V vector in Eq. (4.16) is given by Eq. (5.2). Thus, assuming
that the ratio Vbus
N
has been chosen to match the server input terminal voltage,
the local control of Eq. (5.1) will result in all server voltages being globally
regulated to the correct value.
∆Vi =
Vbus
N
∀ i ∈ {0 . . . N} (5.2)
5.2.2 Regulation Implementation
To implement the distributed voltage-mode control algorithm as proposed
above, the DPP modules presented here make use of a bidirectional buck-
boost converter topology. Each module acts either as a synchronous buck
converter or a synchronous boost converter, depending on which direction the
inductor current needs to flow in order to achieve the desired local voltage
equalization. Referring to Fig. 5.4, the buck mode operation can be under-
stood by performing well-known analysis techniques, considering Vn − Vn+2
as the converter input and Vn+1 − Vn+2 as converter output. Conversely, the
boost mode operation can be understood by considering Vn+1−Vn+2 as input
40
Increase
high side duty
Decrease
high side duty
{
Figure 5.5: Binary integral control.
and Vn − Vn+2 as output.
Due to symmetry, an interesting and useful feature about this hardware
configuration is that if it is operated forced continuous conduction mode
(CCM), the same integral control law can be used to achieve regulation in-
dependent of current direction.
Vn − Vn+2 = 1
1−DQ2 (Vn+1 − Vn+2) (5.3)
Vn+1 − Vn+2 = DQ1 (Vn − Vn+2) (5.4)
The proposed DPP hardware discussed in this paper implements the con-
troller described by Fig. 5.5 and Eq. (5.5), where DQ1(t) represents the duty
ratio of the high-side switch, and DQ2(t) = 1 − DQ1. While this voltage
regulation strategy is theoretically slower than a standard PID, PI, or pure
integral controller (see Eq. (5.5)), the limited number of duty ratios available
on the low-cost microcontroller chosen for the converter (Table 5.2) limit
the benefits of a more fine-grained control to such a degree that the simple
control scheme performs comparatively well in this particular application. It
should also be noted that this is a low-cost, low-power solution because it
does not rely on detecting current direction, or on accurate voltage sensing,
but instead only relies on comparing two voltages.
DQ1(t) = K
∫ t
0
sgn (∆Vn+1 −∆Vn) dτ (5.5)
41
0Time
Figure 5.6: Light load control reference for buck mode.
5.2.3 Light Load Optimization
Because it is desired to operate the DPP modules at high efficiency during the
light load condition (when the servers have balanced power consumption), a
pulse frequency modulation scheme was used to minimize losses during the
period when the absolute value of the average low-side current was less than
half of the inductor current peak-to-peak ripple.
The initial implementation of this optimization attempted to achieve op-
eration similar to a non-synchronous buck or a non-synchronous boost con-
verter, depending on the sensed direction of average current, by using analog
voltage sensing across the switches. While the theory for this technique
was sound (this is the method used by some commercial integrated syn-
chronous buck controller ICs), difficulties with board area, sensing accuracy,
crosstalk, and other factors associated with a discrete implementation caused
this methodology to be abandoned.
The second implementation attempt made use of pulse frequency modu-
lation. Due to microcontroller limitations, it was not possible to adjust the
duty ratio during the pulse frequency modulation mode. Thus, the duty
ratio was kept at 50%, and pulse frequency alone was used to control the
converter voltage. The heights of the inductor current pulses were limited
by allowing the body diode to conduct for a brief period to ensure that the
inductor current returned to zero.
The relationship between high-side and low-side voltage when the con-
verter is running in buck mode is derived from Fig. 5.6. Assuming that
(Vhi
Vlo
− 1)∆t is greater than ∆t (the duration that the high-side switch is on),
and that the low-side switch turns off after 2∆t (fixed 50% duty cycle), the
average inductor current < IL > can be found geometrically, and can be used
to obtain the relation between the high side and low-side voltages given in
42
0Time
Figure 5.7: Light load control reference for boost mode.
Eq. (5.6).
< IL > =
1
T
× Vhi − Vlo
L
∆t×
(
∆t+
(
Vhi
Vlo
− 1
)
∆t
)
⇐⇒
Vls =
2Vhs∆t√
4 < IL > T + ∆t2 + ∆t
(5.6)
Again, due to microcontroller limitations, ∆t is not adjustable, and there-
fore only T is available as a control handle. Thus, in buck mode, the control
strategy has the intuitive behavior of increasing the pulse frequency when
the low-side voltage is too low, and decreasing the pulse frequency when the
low-side voltage is too high.
A similar result can be obtained for the boost-mode control using the
geometrical information from Fig. 5.7, and is shown in Eq. (5.7). However,
in this case, the control law is opposite. If the low-side voltage is too low, the
pulse frequency must be decreased, and if it is too high, the pulse frequency
must be increased.
< IL > =
1
T
× Vlo
L
∆t×
(
∆t+
Vlo
Vhi − V lo∆t
)
⇐⇒
Vls =
Vhs < IL > LT
Vhs∆t2+ < IL > LT
(5.7)
Because the control strategy depends on knowledge of which mode the
device is operating in, it requires explicit knowledge of the sign of < IL >.
However, PFM is only needed when the average current is quite small. As
the average current approaches zero, accurate sensing of its sign becomes
increasingly difficult due to noise and comparator hysteresis effects. In order
to overcome this challenge, the inductor current was sensed directly, and
43
Figure 5.8: PFM with pulse width 2 and incorrect pulse direction.
Figure 5.9: PFM with pulse width 2, correct pulse direction, and excessive
pulse frequency.
its time-dependent behavior was analyzed to determine whether or not the
correct current direction had been sensed.
In buck converters and boost converters without current limiting behavior,
the average inductor current is set externally, meaning that the electrical
characteristics of the circuit cause the shape of the inductor current waveform
to change such that it has whatever average value is set by the load. Thus,
even if the wrong switch is turned on first, causing current to temporarily
flow in the wrong direction, the current will eventually “slosh” back in the
right direction in order to maintain the correct average current, as long as
there are enough sequential pulses.
For example, Fig. 5.8 shows a simulation of a buck converter running in
pulse mode, but turning on the low side switch first, rather than the high
side switch, as it should. This results in an initial pulse of current in the
wrong direction, but after one additional pulse, the current is drawn back in
the correct direction in order to maintain the average output current.
Similarly, if the pulse direction is correct, but the pulse frequency is higher
than is required, the inductor current will be driven to “slosh” in the negative
44
Figure 5.10: PFM with pulse width 2, correct pulse direction.
+
-
to switch node
Sense resistor
+
-
+
-
Amplifier
Forward current detect
Comparators
Reverse current detect
to lowside terminal
Figure 5.11: Light load current direction detect circuit.
direction as is shown in the simulation of Fig. 5.9 – in order to maintain the
average output current. Both of these scenarios exhibit behavior similar to
that observed in forced continuous conduction mode, when the output current
is too small. Specifically, both of these scenarios have current “sloshing” back
and forth across the zero current boundary.
However, if the pulse frequency is less than or equal to what it should be,
the inductor current does not exhibit this “sloshing” behavior, but instead
increases the height of the pulses if necessary in order to maintain the output
current, as is shown in Fig. 5.10. Thus, if inductor current is sloshing back
and forth across the zero boundary, either the pulse frequency is too high, or
the pulse direction is incorrect. If inductor current is not sloshing back and
forth, the pulse direction is correct. These observations were used to design
the light load current detect circuit of Fig. 5.11.
The current direction detect circuit uses two comparators with negative
reference voltages above and below the voltage output of the current-sense
amplifier corresponding to zero amps. It should be noted that the current-
sense amplifier does not have to have an extraordinarily-high gain-bandwidth
45
Forward detect
Reverse detect
Inductor current
Correct pulse direction:
        Adjust pulse frequency to set voltage.
Pulse frequency too high:
            Decrease pulse frequency if possible.
Pulse direction incorrect:
            Switch pulse direction.
Figure 5.12: Operation of light load current direction detect.
product or excellent linearity. It is only necessary to produce a change in
the positive or negative direction, when the inductor current goes above or
below zero. As is illustrated in Fig. 5.12, the circuit produces two digital
signals which can trigger microcontroller interrupts.
If one of the signals is constant, the controller can be certain it has selected
the correct current direction, and may proceed to adjust the pulse frequency
to set the low side voltage according to Eq. (5.6) or Eq. (5.7). If both of
the signals are changing, then the controller must first decrease the pulse
frequency in order to check if the sloshing is due to an excessive frequency.
If the pulse frequency has been decreased to its minimum value and the
inductor current is still exhibiting the sloshing behavior, then the controller
can be confident in switching pulse directions.
Fig. 5.13 shows the efficiency improvement at light load, made possible by
the previously-described optimization strategy.
46
0 5 10 15
0
10
20
30
40
50
60
70
80
90
Output power (W)
Ef
fic
ie
nc
y 
(pe
rce
nt)
DPP module efficiency
 
 
Optimized
1 phase
2 phases
3 phases
Figure 5.13: Efficiency improvement of light load optimization.
47
Chapter 6
Experiment
This chapter describes the configuration, operation, and experimental re-
sults of a functional series-connected cluster. It discusses details of the op-
eration of the series-stacked environment and the software load, and con-
cludes with an experimental comparison between the power losses within the
series-connected cluster, and the losses within a best-in-class parallel power
distribution system.
6.1 Series-Stacked Operation
6.1.1 Pre-Charging Behavior
Although the three differential power processing modules used in this ex-
periment do not communicate with each other, they are still conceptually
parts of a single, unified power delivery system. Thus, like any other power
converter, this system must take precautions to avoid instability or current
overages during bus startup, bus shutdown, and load startup through a cen-
tralized control. In order to provide stability during initialization of the in-
put bus, many general-purpose power supplies are equipped with an “enable”
signal which will leave the output disconnected until the internal control cir-
cuitry has started, and the input voltage has reached a high enough level for
the converter to operate. In a similar manner, the series-connected system
Fast current 
limiter
+
-
+
-
Enable
Pre-charge complete
DC
DC
Load
Figure 6.1: “Soft start” circuit in general-purpose PSU.
48
is equipped with software-controllable switches such that the actual servers
remain disconnected from the string until all converters have successfully
turned on and are regulating the node voltages of the load input capacitors.
These switches automatically disconnect the servers from the stack during
bus shutdown.
In order to limit current stresses during load startup – especially in the case
of loads with large input capacitance, many general-purpose power supplies
use a current-limiting circuit like the one shown in Fig. 6.1 to “pre-charge”
any external output capacitors. Similarly, the series-connected power deliv-
ery system makes use of a software-programmable current limit on the bus
power supply to prevent large current transients during bus initialization.
6.1.2 String Initialization
Due to the series-connected nature of the load input capacitors, additional
startup precautions must be taken to avoid capacitor overvoltages during bus
initialization, and excessive negative capacitor voltages during bus shutdown.
These precautions are realized by stepping the bus voltage in increments
designed to ensure that neither overvoltages nor negative voltages can occur
during a single voltage step.
The worst-case scenario during startup or shutdown is when one capacitor
exhibits substantially reduced ESR, and thus sees the whole change in bus
voltage during each bus voltage step. This worst-case scenario is expressed
in Eq. (6.1), where ∆Vci,k is the k’th change in voltage across the load input
capacitance of node i, and ∆Vbus,k is the k’th step in bus voltage.
∆Vci,k = ∆Vbus,k (6.1)
As previously discussed, it is desired that Eq. (6.2) holds, where Vci+∆Vci
is the worst-case voltage across the load input capacitance of node i during
the transient generated by changing Vbus,k by ∆Vbus,k.
Vc,min < Vci,k + ∆Vci,k < Vc,max ∀ k, i (6.2)
Combining Eq. (6.1) with knowledge that the differential converters will
balance the capacitor voltages to
Vbus,k
N
, where N is the number of servers in
49
the string, results in Eq. (6.3).
Vci,k + ∆Vci,k =
Vbus,k−1
N
+ ∆Vbus,k =⇒
Eq. (6.2) ⇐⇒ Vc,min < Vbus,k−1
N
+ ∆Vbus,k < Vc,max
(6.3)
With the further requirement that |∑Mk=1 ∆kVbus| = Vbus,nom, where M is
the number of bus voltage steps needed to reach the nominal bus voltage
Vbus,nom,
Vbus,k = min
(
Vc,max +
(
1− 1
N
)
Vbus,k−1 , Vbus,nom
)
Vbus,0 = 0
(6.4)
Vbus,k = max
(
Vc,min +
(
1− 1
N
)
Vbus,k−1 , 0
)
Vbus,0 = Vbus,nom
(6.5)
The particular case of this application (namely, N = 4, Vbus,nom = 48V,
Vc,max = 12.4V, and Vc,min = −0.4V) results in the sequences given by Ta-
ble 6.1. It should be noted that because the converters are powered by the
voltage across two nodes, the node voltages during this portion of the startup
sequence are balanced by the string of equal-value 1kΩ resistors at the output
of each differential converter (see Fig. 5.4).
In order to facilitate the startup sequencing explained earlier, each con-
verter was equipped with a low resistance mosfet used to switch in the servers
(indicated in the figure by the switch and dashed line at the negative termi-
nal of each server). An additional module containing only a communication
module and a mosfet was added to the highest node in the stack in order
to allow all four servers to remain disconnected during the bus initialization
sequence. Fig. 6.2 shows the bus initialization sequence. Because the servers
remained disconnected during the startup sequence, and the external node
capacitors were rated for higher voltages, it was safe to use fewer voltage
steps for the startup sequence than are specified in Eqs. (6.4) and (6.5).
In order to avoid stability problems during the brief period when there
are fewer than three DPP modules running, the modules are configured to
run open-loop during the startup sequence. Around the 900 millisecond
mark, the server pairs were switched in, starting with node 0, and work-
50
0 200 400 600 800 1000 1200 1400 1600
0
5
10
String initialization (converters running open loop)
N
od
e 
vo
lta
ge
 (V
)
 
 
0 200 400 600 800 1000 1200 1400 1600
0
10
20
N
od
e 
cu
rre
nt
 (A
)
 
 
0 200 400 600 800 1000 1200 1400 1600
−20
0
20
D
iff
er
en
tia
l c
ur
re
nt
 (A
)
Time (ms)
 
 
Node 0
Node 1
Node 2
Node 3
Node 0
Node 1
Node 2
Node 3
DPP 0
DPP 1
DPP 2
Figure 6.2: Bus initialization sequence.
51
Table 6.1: Bus voltage sequence for startup and shutdown.
k Startup (V) Teardown (V)
0 0.0 48.0
1 12.4 35.6
2 21.7 26.3
3 28.7 19.3
4 33.9 14.1
5 37.8 10.2
6 40.8 7.2
7 43.0 5.0
8 44.6 3.4
9 45.9 2.1
10 46.8 1.2
11 47.5 0.5
12 48.0 0.0
ing up to node 3. The large current transients (and resulting voltage dips)
are due to charging the input capacitor bank of the servers. In future re-
visions, these current transients can be reduced by adding soft-start control
circuitry to the individual DPP modules, similar to the bus-level soft-start
circuitry described earlier. The bidirectional nature of the DPP modules,
and the position-dependent power processing requirements for element-to-
element DPP topology, are illustrated by the direction and profile of the
differential currents.
6.1.3 Cluster Configuration
In order to better represent a real-world data center task, the series-connected
cluster was configured to operate as a “real” coordinating cluster, rather
than merely as fully independent servers. The clustering strategy used in
this thesis is called the single system image (SSI) model, and is used in
data centers where the cluster operator has little knowledge of the nature of
tasks being executed on the cluster, or in cases where the cluster operator
wishes to enforce some degree of parallelism in software originally designed
52
Hardware: 1 CPU, 1G memory, 1 
network card, no disk
Standard   
processes Migratable 
process
Distributed kernel
Container
Rack hardware
Network
Rack hardware
Root filesystem
File server hardware
Multi-process application (cluster-enabled)
Migratable 
process
Migratable 
process
Distributed operating system, as seen by cluster-enabled applications
3 CPU, 3G memory 3 network cards, 1 disk
Figure 6.3: Single system image cluster environment.
to operate on a single system. Under this approach, a distributed operating
system automatically migrates processes, memory locations, and i/o streams
between disparate machines. The system has also been configured to run
(and to properly parallelize) the ARAGORN tRNA and tmRNA genome
detection software. However, in the absence of a sufficiently-large sequencing
dataset the experimental results presented in this thesis were obtained using
only the “stress” workload generator.
The particular implementation of SSI used in this work is shown in Fig. 6.3.
In order to provide scalability and to ensure identical configuration among
the servers, the operating and file system of each server were served over the
network from a centralized computer. In order to produce homogenous load-
ing among the cluster nodes and to better emulate a real-world application,
the cluster utilized a distributed version of the Linux kernel [45] configured to
operate the cluster under a single-system image configuration. The operating
system scheduler was configured to migrate threads among the cluster nodes
to balance workload. The computational workload generator used was the
standard Linux “stress” utility [46], configured to perform repeated square
root operations. The computational load setup is shown in Fig. 6.4.
In order to provide a fair comparison between the power losses associated
with the series-connected power distribution system and the parallel power
architecture, the two configurations illustrated in Figs. 6.5 and 6.6 were cre-
ated. A programmable 0-60V, 2kW power supply (Agilent 6674A) was used
to generate the 48V DC bus in both cases. The loads under both configura-
53
Rack hardware
Network
Rack hardware
Root filesystem
Rack hardware
Cluster container Cluster container Cluster container
User terminal hardware
Multithreaded computational workload (stress)
Figure 6.4: Cluster configuration for computational load experiment.
48V DC bus
Best-in-class 12V PSU
160W Server Pair
DC
DC
DC
DC
DC
DC
+
-
DC
DC
Figure 6.5: Parallel architecture.
tions consisted of four pairs of servers (Dell Optiplex SX280 with Pentium 4
and 755 with Core 2 Duo) with a single 12V DC input, and whose combined
power at full load was 160W.
6.2 Experimental Results
In the parallel distribution system of Fig. 6.5, an extensive search was made to
locate the best-in-class (most efficient) commercially-available network power
48V DC bus
Custom Bidirectional PSU
160W Server Pair
+
-
DC
DC
DC
DC
DC
DC
Figure 6.6: Series architecture.
54
Figure 6.7: National Instruments data acquisition unit.
supply with 48V input, 12V output, and a 300W power rating. At the time
of writing, the power supply meeting these criteria was the PQ60120QEx25
model, manufactured by SynQor, Inc. Because the efficiency characteristics
of the commercial power supply are unlikely to exhibit significant variation,
only one power supply was purchased, and the power loss measurements were
performed sequentially for each server pair. The total losses were calculated
by summing the mean power loss associated with each particular server pair
running at maximum computational load for ten minutes.
The series-connected power delivery system of Fig. 6.6 was implemented
according to Fig. 5.1 previously explained in Chapter 5, using the same server
pairs and the same computational load as in the parallel configuration. In
this case, the power measurements of each server were done in parallel, rather
than sequentially. Due to temperature variations between measurements,
the differences in voltage regulation (see Fig. 5.2), and other factors, the
total power of the fully-loaded servers when operating under the parallel
configuration experiment was not identical to the total power of the same
fully-loaded servers when operating under the series configuration. Thus,
it was necessary to compare the power loss (difference between total power
leaving the bus and total power delivered to the servers) in each configuration,
rather than just comparing the total required bus power.
Fig. 6.7 is a photograph of the measurement equipment, and Fig. 6.8 is
a photograph of the servers. Fig. 6.9 shows an example time-domain com-
55
Figure 6.8: Optiplex servers.
parison between the two different architectures. The solid blue line indicates
total losses from the entire four-node cluster, while the solid red line indi-
cates losses from a best-in-class power supply delivering power in the parallel
configuration to a single node. As would be expected, the parallel-configured
system exhibits losses that are roughly proportional to the delivered power.
As explained in the preceding section, the series-configured system shows
losses that are independent of the total power delivered – and instead de-
pend only on the power mismatch between the nodes. During the time pe-
riod between three and nine seconds, the workload of the individual servers
in the cluster was increased, one server at a time. The relatively large power
losses during this region illustrate the effect of power mismatch between the
cluster nodes. The relatively low power losses after the nine second mark
illustrate the effect of balancing the server’s workload (and thus their power
consumption). From the plot, it is clear that the best-in-class power supply is
operating at high efficiency (95%). Still, these losses are quite high compared
to the series-connected configuration.
Table 6.2 presents the results of the comparison experiment. The totals
for input power, output power, and power loss were calculated directly from
the measured data before rounding, and may differ slightly from the values
obtained by summing the pre-rounded values in the table. Compared to the
56
0 2 4 6 8 10 12
10
0
10
1
10
2
10
3
155.7 W
Example power loss comparison (100 ms average)
P
o
w
er
 (
W
)
Time (s)
8.474 W
7.842 W
617.8 W
Whole-cluster losses from series-connected power delivery system
Power delivered to whole series-connected cluster
Single-server losses from best-in-class power supply
Power delivered to single server by best-in-class power supply
Figure 6.9: Representative time domain power loss comparison.
parallel architecture, the series-connected power delivery system delivers ap-
proximately the same amount of power to the cluster, but reduces conversion
losses by a factor of over five.
It should be noted that this experiment represents a reasonably conser-
vative estimate of the attainable power loss reduction in a series-stacked
server cluster. First, because all the control and communication circuitry in
the series-connected system draws its power from the bus, and no effort was
made to minimize microcontroller or communication power consumption, the
static losses of the DPP modules may realistically be expected to decrease.
Second, since the server load in this case was balanced only in the CPU uti-
lization sense (no power-aware scheduling was used), future systems that take
advantage of power-aware scheduling techniques may be expected to further
reduce the differential power required to maintain voltage regulation and
thereby reduce conversion losses even further. Third, the series-connected
system in this case is being compared against losses associated with a best-
in-class commercial power supply, with very high efficiency. In many data
center applications, where the existing power supplies are operating in the
80% - 90% efficiency range, the series-stacked architecture could offer a con-
version loss reduction factor upwards of 20 to 50, rather than the factor of 5
improvement reported here. Last, it should be noted that the power savings
57
Table 6.2: Comparison of architecture-related losses.
Parallel system Series-stacked system
Number of server nodes 4 pairs of 2 4 pairs of 2
Duration of measurement 10 min 10 min
Measurement accuracy ± 0.5 W ± 0.5 W
Node 0 input power 167 W
Node 0 output power 164 W 159 W
Node 1 input power 169 W
Node 1 output power 164 W 149 W
Node 2 input power 164 W
Node 2 output power 160 W 161 W
Node 3 input power 157 W
Node 3 output power 154 W 154 W
Total input power 657 W 625 W
Total output power 641 W 623 W
Total power loss 16 W 3 W
System efficiency 97.5% 99.5% W
presented here can be expected to scale as the power delivered to the cluster
increases, due to the decoupled nature of power delivery and conversion.
58
Chapter 7
Conclusions
This investigation has practically demonstrated the loss-reduction benefits
of the series-connected power delivery architecture by comparing losses from
a fully-operational series-connected cluster with an equivalent cluster pow-
ered by best-in-class converters. Areas for future investigation include fault
protection strategies and hardware, more energy-conscious voltage regulation
during the case of ultra-small power mismatch between nodes, power-aware
scheduling, and investigation of other differential power processing topologies
and loading scenarios.
59
Appendix A
Prototype Converter Module Design Files
60
5 5
4 4
3 3
2 2
1 1
D
D
C
C
B
B
A
A
Th
e
r
e
 
a
r
e
 
tw
o
 
po
r
ts
 
in
 
c
a
s
e
 
th
e
 
c
u
r
r
e
n
t 
c
a
pa
c
it
y 
o
f 
o
n
e
 
is
 
n
o
t 
e
n
o
u
gh
,
 
a
n
d 
fo
r
 
de
bu
gg
in
g 
pu
r
po
s
e
s
Ch
a
n
ge
 
n
o
ti
c
e
:
GN
DS
RV
_
SN
S 
r
e
pu
r
po
s
e
d 
to
 
be
c
o
m
e
 
Vh
i_
w
in
do
w
_
lo
 
a
n
d 
s
ho
r
te
d 
to
 
AD
C7
_
s
n
s
VH
I_
s
n
s
 
r
e
pu
r
po
s
e
d 
to
 
be
c
o
m
e
 
Vh
i_
w
in
do
w
_
hi
Co
n
fi
gu
r
a
ti
o
n
 
n
o
te
s
:
Po
pu
la
te
 
D1
-
D3
.
Sh
o
r
t 
in
du
c
to
r
 
c
u
r
r
e
n
t 
s
e
n
s
e
.
Po
pu
la
te
 
R3
.
Us
e
 
LD
Os
 
in
s
te
a
d 
o
f 
s
w
it
c
hi
n
g 
r
e
gu
la
to
r
s
.
Do
n
'
t 
po
pu
la
te
 
J3
.
R
X
TX
PW
M
[0.
.
7]
PS
U_
in
ps
u
_
o
u
t
IH
I IH
I
IL
O
IIN
D
IL
O
IIN
D
Vh
i_
po
rt
VL
O
/5
.
77
VH
I/1
1.
5
Vl
o
_
po
rt
Vh
i_
po
rt
Vl
o
_
po
rt
Vh
i_
po
rt
Vl
o
_
po
rt
G
AT
E0
G
AT
E1
G
AT
E2
G
AT
E3
G
AT
E4
G
AT
E5
G
AT
E6
G
AT
E7
VS
W
0
VS
W
1
VS
W
2
VS
W
3
G
AT
E[
0.
.
7]
VS
W
[0.
.
3]
SR
V_
EN
G
N
D
SR
V/
5.
77
n
o
_
co
n
n
e
ct
Vh
i
Vl
o
10
V
10
V
3.
3V
3.
3V
3.
3V
3.
3V
Vh
i
Vl
o
10
V
Vl
o
3.
3V
3.
3V
1.
3V
1.
3V
1.
3V
1.
3V
Vh
i
Vl
o
G
N
D
sr
v
G
N
D
sr
v
10
V
3.
3V
1.
3V
G
N
D
sr
v
G
N
D
sr
v
1.
3V
3.
3V
Vh
i
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
1
1.
1
To
p 
Le
ve
l S
ch
e
m
a
tic
A
1
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
1
1.
1
To
p 
Le
ve
l S
ch
e
m
a
tic
A
1
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
1
1.
1
To
p 
Le
ve
l S
ch
e
m
a
tic
A
1
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
BL
K2
D
R
M
O
S_
4x
1P
H
AS
E
VH
I
10
V
GND
PW
M
[0.
.
7]
VS
W
[0.
.
3]
G
AT
E[
0.
.
7]
VL
O
R
EF
_
0A
M
PS
IS
N
S_
O
UT
3.
3V
R
2 0.
00
1
R
8
47
.
5K
R
12
10
.
7K
R
1 0.
00
1
R
5 0
J1 Po
w
e
r 
Co
n
n
e
ct
o
r
1 2 3 4
R
10
16
.
2K
C1 1u
F
D
3
B3
40
B
BL
K3
H
IS
ID
E_
CU
R
R
EN
T_
SE
N
SEV
-
V+
GND
2.
7-
26
Vi
n
R
EF
_
0A
M
PS
I_
SN
S
R
7
16
9K
C2 1u
FD
2
B3
40
B
U1 1.
3V
 
LD
O
IN
1
G
N
D
2
EN
3
N
C
4
O
UT
5
BL
K7
SE
R
VE
R
_
SW
IT
CH
G
N
D
sr
v
G
AT
E
G
N
D
sr
vO
UT
BL
K8
I2
CI
SO
LA
TO
R
5V
_
IS
O
SD
A_
IS
O
SC
L_
IS
O
G
N
D
_
IS
O
BL
K4
PS
U_
D
R
IV
ER
10
Vo
u
t
GND
12
-
24
Vi
n
BL
K5
PS
U_
LO
G
IC
10
Vi
n
PO
W
ER
_
G
O
O
D
3.
3V
o
u
t
GND
R
9
11
3K
BL
K6
M
CU3
.
3V
PW
M
[0.
.
7]
SC
L/
TX
SD
A/
R
X
GND
SR
V_
EN
VL
O
/5
.
77
VH
I/1
1.
5
IH
I/2
0 
+
 
o
s
IL
O
/2
0 
+
 
o
s
IIN
D
/2
0 
+
 
o
s
IO
FF
SE
T
G
N
D
SR
V/
5.
77
R
6 0
J3 He
a
de
r
2 4 6 8 10 12 14 16
1 3 5 7 9 11 13 15
BL
K1
H
IS
ID
E_
CU
R
R
EN
T_
SE
N
SEV
-
V+
GND
2.
7-
26
Vi
n
R
EF
_
0A
M
PS
I_
SN
S
J2 Po
w
e
r 
Co
n
n
e
ct
o
r
1 2 3 4
R
4 0
R
3 0
D
1
B3
40
B
R
11
10
K
F
ig
u
re
A
.1
:
T
op
le
ve
l
sc
h
em
at
ic
.
61
5 5
4 4
3 3
2 2
1 1
D
D
C
C
B
B
A
A
Ch
a
n
ge
 
s
in
c
e
 
v
0.
3:
Ad
de
d 
s
ta
bi
li
ty
 
c
a
pa
c
it
o
r
 
o
n
 
Vi
n
Re
m
o
v
e
d 
o
u
tp
u
t 
c
u
r
r
e
n
t 
li
m
it
 
a
n
d 
pr
o
te
c
ti
o
n
 
z
e
n
e
r
.
It
'
s
 
n
o
w
 
u
p 
to
 
th
e
 
o
u
ts
id
e
 
de
s
ig
n
e
r
 
to
 
c
ho
o
s
e
 
in
pu
t 
v
o
lt
a
ge
 
th
a
t 
do
e
s
n
'
t 
da
m
a
ge
 
AD
C
No
te
s
:
 
On
ly
 
n
e
e
d 
th
e
 
10
 
o
hm
 
r
e
s
is
to
r
s
 
a
n
d 
z
e
n
e
r
s
 
if
 
yo
u
'
r
e
 
w
o
r
r
ie
d 
a
bo
u
t 
th
e
 
c
u
r
r
e
n
t 
go
in
g 
r
e
a
ll
y 
hi
gh
.
 
Us
e
 
0 
o
hm
s
 
in
s
te
a
d 
o
f 
10
,
 
a
n
d 
s
ki
p 
th
e
 
z
e
n
e
r
s
.
V-V+
G
N
D
2.
7-
26
Vi
n
R
EF
_
0A
M
PS
I_
SN
S
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
2
1.
1
H
ig
h 
Si
de
 
Cu
rr
e
n
t S
e
n
se
A
17
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
2
1.
1
H
ig
h 
Si
de
 
Cu
rr
e
n
t S
e
n
se
A
17
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
2
1.
1
H
ig
h 
Si
de
 
Cu
rr
e
n
t S
e
n
se
A
17
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
D
6
25
.
62
V
R
18
10
C3
7
0.
1u
F
50
 
V/
V
U1
5
R
EF
1
GND
2
V+
3
IN
+
4
IN
-
5
O
UT
6
C3
8
0.
1u
F
R
17
10
D
7
25
.
62
V
F
ig
u
re
A
.2
:
H
ig
h
si
d
e
cu
rr
en
t
se
n
se
m
o
d
u
le
.
62
5 5
4 4
3 3
2 2
1 1
D
D
C
C
B
B
A
A
No
te
s
:
 
Yo
u
 
n
e
e
d 
e
x
te
r
n
a
l 
pu
ll
u
p 
r
e
s
is
to
r
s
 
o
n
 
th
e
 
e
a
r
th
-
s
id
e
 
SD
A 
a
n
d 
SC
L,
 
in
 
a
dd
it
io
n
 
to
 
a
n
 
e
x
te
r
n
a
l 
+
5V
 
s
u
pp
ly
.
5V
_
IS
O
SD
A_
IS
O
SC
L_
IS
O
G
N
D
_
IS
O
e
5V
e
5V
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
3
1.
1
UA
R
T 
Is
o
la
to
r
A
20
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
3
1.
1
UA
R
T 
Is
o
la
to
r
A
20
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
3
1.
1
UA
R
T 
Is
o
la
to
r
A
20
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
U1
9
I2
C 
Is
o
la
to
r
VD
D
1
1
SD
A1
2
SC
L1
3
G
N
D
1
4
G
N
D
2
5
SC
L2
6
SD
A2
7
VD
D
2
8
U2
0
H
e
a
de
r 
Sh
ro
u
de
d
Vc
c
2
SD
A
1
SC
L
6
G
N
D
4
Vc
c
3
G
N
D
5
C5
6
0.
1u
F
C5
5
0.
1u
F
F
ig
u
re
A
.3
:
U
A
R
T
is
ol
at
or
m
o
d
u
le
.
63
5 5
4 4
3 3
2 2
1 1
D
D
C
C
B
B
A
A
G
N
D
sr
v
G
AT
E
G
N
D
sr
vO
UT
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
4
1.
1
Se
rv
e
r 
Sw
itc
h
A
23
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
4
1.
1
Se
rv
e
r 
Sw
itc
h
A
23
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
4
1.
1
Se
rv
e
r 
Sw
itc
h
A
23
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Si
R
81
2D
P
U2
5
S
123
G
4
D
5
Si
R
81
2D
P
U2
7
S
123
G
4
D
5
Si
R
81
2D
P
U2
6
S
123
G
4
D
5
F
ig
u
re
A
.4
:
S
er
ve
r
sw
it
ch
m
o
d
u
le
.
64
5 5
4 4
3 3
2 2
1 1
D
D
C
C
B
B
A
A
Ch
a
n
ge
 
fr
o
m
 
0.
3:
PW
M4
-
>
BA
TT
_
EN
PW
M5
-
>
DR
IV
ER
_
PS
U_
EN
PW
M6
-
>
PD
1
PW
M7
-
>
PD
0
Vc
c
:
 
3.
3V
 
(3
.
6V
 
m
a
x
)
Av
c
c
:
 
3.
3V
 
(V
c
c
 
+
 
0.
3V
 
m
a
x
)
AR
EF
:
 
2.
6V
 
(A
v
c
c
 
-
 
0.
6V
 
=
 
2.
7V
 
m
a
x
)
Ch
a
n
ge
 
n
o
ti
c
e
:
GN
DS
RV
_
SN
S 
r
e
pu
r
po
s
e
d 
to
 
be
c
o
m
e
 
Vh
i_
w
in
do
w
_
lo
,
 
a
n
d 
s
ho
r
te
d 
to
 
AD
C7
_
s
n
s
VH
I_
s
n
s
 
r
e
pu
r
po
s
e
d 
to
 
be
c
o
m
e
 
Vh
i_
w
in
do
w
_
hi
RE
SE
T
AR
EF
PW
M
0
PW
M
1
PW
M
4
PW
M
5
PW
M
[0.
.
7]
DA
C0
DA
C1
AD
C9
PD
I
SD
A
SC
L
RX TX
PD
I
RE
SE
T
PW
M
3
PW
M
2
PW
M
7
PW
M
6
PD
4
PD
5
PD
6
PD
7
PC
4
PC
5
PC
6
PC
7
PD
4
PD
5
PD
6
PD
7
SC
L
SD
A
DA
C0
DA
C1
PR
0
PR
1
PR
0
PR
1
TX
RX
PW
M
0
PW
M
1
PW
M
2
PW
M
3
PW
M
4
PW
M
5
PW
M
6
PW
M
7
AD
C9
AD
C0
AD
C1
AD
C2
AD
C3
AD
C4
AD
C5
AD
C6
AD
C7
GN
DS
RV
_
sn
s
VL
O_
sn
s
VH
I_
sn
s
IL
O_
sn
s
IH
I_
sn
s
PC
4
PC
5
PC
6
PC
7
IIN
D_
sn
s
IO
FF
SE
T
DA
C0
PD
4
AD
C6
_
sn
s
DA
C1
3.
3V
PW
M
[0.
.
7]
SC
L/
TX
SD
A/
RX
GN
D
VL
O/
5.
77
VH
I/1
1.
5
IH
I/2
0 
+
 
os
IL
O/
20
 
+
 
os
IIN
D/
20
 
+
 
os
IO
FF
SE
T
SR
V_
EN
GN
DS
RV
/5
.
77
3.
3V
3.
3V
3.
3V
3.
3V
3.
3V
3.
3V
3.
3V
3.
3V
3.
3V
3.
3V
Ti
tle
Si
ze
Do
cu
m
en
t N
um
be
r
Re
v
Da
te
:
Sh
ee
t
of
5
1.
1
M
icr
oc
on
tro
lle
r 
M
od
ul
e
C
19
15
Th
ur
sd
ay
,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
Do
cu
m
en
t N
um
be
r
Re
v
Da
te
:
Sh
ee
t
of
5
1.
1
M
icr
oc
on
tro
lle
r 
M
od
ul
e
C
19
15
Th
ur
sd
ay
,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
Do
cu
m
en
t N
um
be
r
Re
v
Da
te
:
Sh
ee
t
of
5
1.
1
M
icr
oc
on
tro
lle
r 
M
od
ul
e
C
19
15
Th
ur
sd
ay
,
 
Ju
ly 
03
,
 
20
14
U1
8
2.
6V IN
1
O
UT
2
GND
3
R2
8
47
0K
C4
7
0.
1u
F
L5 10
uH
R2
7
0
R2
2
47
0K
R2
1
0
C5
1
0.
1u
F
C4
1
10
nF
C5
4
0.
1u
F
C4
8
0.
1u
F
R3
7
0
C5
0
0.
1u
F
R2
5
0
J6
He
ad
er
 
Sh
ro
ud
ed
2 4 6
1 3 5
C4
9
10
uF
R3
6
0
C4
2
0.
1u
F
J5
He
ad
er
2 4 6 8 10 12 14 16
1 3 5 7 9 11 13 15
J7
He
ad
er
2 4 6 8 10 12 14 16
1 3 5 7 9 11 13 15
C4
6
0.
1u
F
R3
0
0
R3
5
0R3
4
0 R3
2
0
R2
3
0
C5
2
10
uF
AT
XM
EG
A-
A4
U
U1
7 VC
C
9
VC
C
19
VC
C
31
G
N
D
8
G
N
D
18
G
N
D
30
G
N
D
38
AV
CC
39
R
ES
ET
/P
D
I_
CL
O
CK
35
PR
0/
XT
AL
2/
TO
SC
2
36
PR
1/
XT
AL
1/
TO
SC
1
37
PD
I
34
PA
0/
AD
C0
/A
C0
/A
R
EF
40
PA
1/
AD
C1
/A
C1
41
PA
2/
AD
C2
/A
C2
42
PA
3/
AD
C3
/A
C3
43
PA
4/
AD
C4
/A
C4
44
PA
5/
AD
C5
/A
C5
1
PA
6/
AD
C6
/A
C6
/A
C1
O
UT
2
PA
7/
AD
C7
/A
C7
/A
C0
O
UT
3
PB
0/
AD
C8
/A
R
EF
4
PB
1/
AD
C9
5
PB
2/
AD
C1
0/
D
AC
0
6
PB
3/
AD
C1
1/
D
AC
1
7
PC
0/
O
C0
A/
O
C0
AL
S/
SD
A/
SD
AI
N
10
PC
1/
O
C0
B/
O
C0
AH
S/
XC
K0
/S
CL
/S
CL
IN
11
PC
2/
O
C0
C/
O
C0
BL
S/
R
XD
0/
SD
AO
UT
12
PC
3/
O
C0
D
/O
C0
BH
S/
TX
D
0/
SC
LO
UT
13
PC
4/
O
C0
CL
S/
O
C1
A/
SS
14
PC
5/
O
C0
CH
S/
O
C1
B/
XC
K1
/M
O
SI
15
PC
6/
O
C0
D
LS
/R
XD
1/
M
IS
O
/C
LK
R
TC
16
PC
7/
O
C0
D
H
S/
TX
D
1/
SC
K/
CL
KP
ER
/E
VO
UT
17
PD
0/
O
C0
A
20
PD
1/
O
C0
B/
XC
K0
21
PD
2/
O
C0
C/
R
XD
0
22
PD
3/
O
C0
D
/T
XD
0
23
PD
4/
O
C1
A/
SS
24
PD
5/
O
C1
B/
XC
K1
/M
O
SI
25
PD
6/
D
-
/R
XD
1/
M
IS
O
26
PD
7/
D
+
/T
XD
1/
SC
K/
CL
KP
ER
/E
VO
UT
27
PE
0/
O
C0
A/
SD
A
28
PE
1/
O
C0
B/
XC
K0
/S
CL
29
PE
2/
O
C0
C/
R
XD
0
32
PE
3/
O
C0
D
/T
XD
0
33
BO
D
Y
45
R2
6
0
C4
4
0.
1u
F
R2
4
47
0K
R3
3
0
R3
1
0
C5
3
0.
1u
F
J8
He
ad
er
2 4 6 8 10 12 14 16
1 3 5 7 9 11 13 15
J9
He
ad
er
2 4 6 8 10 12 14 16
1 3 5 7 9 11 13 15
C4
3
0.
1u
FC
45 0.
1u
F
R3
8
0
R2
9
0
F
ig
u
re
A
.5
:
M
ic
ro
co
n
tr
ol
le
r
m
o
d
u
le
.
65
5 5
4 4
3 3
2 2
1 1
D
D
C
C
B
B
A
A
Ch
a
n
ge
 
s
in
c
e
 
v
0.
3:
Ad
de
d 
c
u
r
r
e
n
t 
s
e
n
s
e
 
to
 
in
du
c
to
r
 
0.
TO
DO
:
 
Re
pl
a
c
e
 
c
u
r
r
e
n
t 
s
e
n
s
e
 
r
e
s
is
to
r
 
w
it
h 
0.
00
2 
o
hm
,
 
fo
r
 
im
pr
o
v
e
d 
a
c
c
u
r
a
c
y 
o
f 
r
e
a
di
n
g
NO
TE
:
 
Cu
r
r
e
n
t 
s
e
n
s
e
 
o
f 
in
du
c
to
r
 
is
 
u
ps
id
e
 
do
w
n
 
fo
r
 
la
yo
u
t 
r
e
a
s
o
n
s
PW
M
[0.
.
7]
PW
M
[6.
.
7]
G
AT
E[
0.
.
7]
PW
M
[0.
.
1]
PW
M
[2.
.
3]
PW
M
[4.
.
5]
G
AT
E[
6.
.
7]
G
AT
E[
0.
.
1]
G
AT
E[
2.
.
3]
G
AT
E[
4.
.
5]
VS
W
3
VS
W
0
VS
W
1
VS
W
2
VS
W
[0.
.
3]
IN
D
UC
TO
R
_
SN
S-
IN
D
UC
TO
R
_
SN
S+
VH
I
10
V
G
N
D
PW
M
[0.
.
7]
VS
W
[0.
.
3]
G
AT
E[
0.
.
7]
VL
O
R
EF
_
0A
M
PS
IS
N
S_
O
UT
3.
3V
10
V
10
V
10
V
10
V
10
V
3.
3V
3.
3V
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
6
1.
1
Po
w
e
r 
St
a
ge
 
M
o
du
le
A
10
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
6
1.
1
Po
w
e
r 
St
a
ge
 
M
o
du
le
A
10
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
6
1.
1
Po
w
e
r 
St
a
ge
 
M
o
du
le
A
10
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
BL
K9
D
R
M
O
S
VH
I
PW
M
[0.
.
1]
10
V
G
N
D
G
AT
E[
0.
.
1]
VS
W
R
14 0.
00
2
BL
K1
2
D
R
M
O
S
VH
I
PW
M
[0.
.
1]
10
V
G
N
D
G
AT
E[
0.
.
1]
VS
W
BL
K1
0
H
IS
ID
E_
CU
R
R
EN
T_
SE
N
SE
G
N
D
REF_0AMPS
I_SNS
2.
7-
26
Vi
n
V+V
-
BL
K1
3
IN
D
UC
TO
R
S_
BL
O
CK
GND
Vl
o
VS
W
[0.
.
3]
VS
W
O
UT
+
VS
W
O
UT
-
R
13 0
BL
K1
4
D
R
M
O
S
VH
I
PW
M
[0.
.
1]
10
V
G
N
D
G
AT
E[
0.
.
1]
VS
W
BL
K1
1
D
R
M
O
S
VH
I
PW
M
[0.
.
1]
10
V
G
N
D
G
AT
E[
0.
.
1]
VS
W
F
ig
u
re
A
.6
:
F
ou
r-
p
h
as
e
p
ow
er
st
ag
e
m
o
d
u
le
.
66
5 5
4 4
3 3
2 2
1 1
D
D
C
C
B
B
A
A
G
AT
E1
PW
M
1
PW
M
0
PWM[0..1]
G
AT
E0
BO
O
T
G
AT
E[
0.
.
1]
VH
I
PW
M
[0.
.
1]
10
V
G
N
D
G
AT
E[
0.
.
1]
VS
W
10
V
10
V
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
7
1.
1
Si
n
gl
e
 
Ph
a
se
 
Po
w
e
r 
St
a
ge
A
16
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
7
1.
1
Si
n
gl
e
 
Ph
a
se
 
Po
w
e
r 
St
a
ge
A
16
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
7
1.
1
Si
n
gl
e
 
Ph
a
se
 
Po
w
e
r 
St
a
ge
A
16
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
C1
4
47
u
F
C1
3
47
u
F
C1
1
47
u
F
CS
D
18
50
4Q
5A
U7
S
123
G
4
D
5
C1
0
0.
1u
F
CS
D
18
50
4Q
5A
U6
S
123
G
4
D
5
C1
2
1u
F
U8 FE
T 
D
riv
e
r 
H
S+
LS
VD
D
1
VS
S
7
H
B
2
H
O
3
H
S
4
H
I
5
LI
6
LO
8
PA
D
9
F
ig
u
re
A
.7
:
S
in
gl
e-
p
h
as
e
su
b
m
o
d
u
le
of
p
ow
er
st
ag
e.
67
5 5
4 4
3 3
2 2
1 1
D
D
C
C
B
B
A
A
VS
W
[0.
.
3]
VSW1
VSW3
VSW2
VSW0
VSWOUT2
VSWOUT3
VSWOUT1
VSWOUT0-
VSWOUT2
VSWOUT0+
VSWOUT3
VSWOUT1
G
N
D
VS
W
[0.
.
3]
Vl
o
VS
W
O
UT
+
VS
W
O
UT
-
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
8
1.
1
In
du
ct
o
r 
Bl
o
ck
A
11
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
8
1.
1
In
du
ct
o
r 
Bl
o
ck
A
11
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
Ti
tle
Si
ze
D
o
cu
m
e
n
t N
u
m
be
r
R
e
v
D
a
te
:
Sh
e
e
t
o
f
8
1.
1
In
du
ct
o
r 
Bl
o
ck
A
11
15
Th
u
rs
da
y,
 
Ju
ly 
03
,
 
20
14
C2
9
47
u
F
C2
5
47
u
F
J4
H
e
a
de
r
2
4
6
8
10
12
14
16
1
3
5
7
9
11
13
15
C2
4
47
u
F
C2
2
47
u
F
C2
1
47
u
F
C2
0
47
u
F
L3 10
u
H
C2
3
47
u
F
L2 10
u
HL
1 10
u
H
L4 10
u
H
C2
8
47
u
F
C3
1
47
u
F
C2
7
47
u
F
C3
0
47
u
F
C2
6
47
u
F
F
ig
u
re
A
.8
:
F
ou
r-
p
h
as
e
in
d
u
ct
or
b
lo
ck
m
o
d
u
le
.
68
Figure A.9: PCB mask for top layer.
69
Figure A.10: PCB mask for inner ground layer.
70
Figure A.11: PCB mask for inner route layer.
71
Figure A.12: PCB mask for bottom layer.
72
Appendix B
Related Software Projects
B.1 Web Traffic Generator
Introduction
While it was determined in Chapter 4 that web traffic was not particularly
amenable to the element-to-element differential power processing configu-
ration, the following study may be useful to other series-connected server
clusters making use of alternative DPP topologies. The reason it is provided
here is that the data center function with the largest market share is serving
web traffic. In order to reduce the variation between server powers within
a stack, a simple binning method like the one illustrated in Fig. B.1 may
be applied once there exists an estimate of power consumption per queued
job. In this case, it is assumed that there is a large queue of incoming jobs.
After analyzing the incoming traffic over some period of time, a probability
distribution of the incoming job parameter may be generated empirically.
Once this is done, constant-variance bin boundaries may be set, and further
incoming jobs can be placed into their respective bins, which function as sub-
queues. Each bin is then associated with a particular stack of servers, such
that that stack of servers only obtains jobs from one bin. In this manner, the
stacks will see mean power consumption with significantly reduced variation
between them, while the datacenter as a whole can support a much wider
range of mean power consumptions. Of course, in a real-world implementa-
tion, special considerations may have to be made for the edge bins, but the
general principle remains the same.
When constructing the mapping between job and power in the case of
web traffic jobs, the parameter chosen for this study is traffic rate. This
parameter is easily identifiable because it is often heavily correlated with
73
0 2 4 6 8 10
x 104
100
150
200
250
300
350
400
450
Variance reduction by binning
Es
tim
at
ed
 p
ow
er
 c
on
su
m
pt
io
n
Sorting order
Figure B.1: Constant-variance binning.
Figure B.2: Google search trends for Chicago datacenter.
URL and search term. In many cases, the traffic rate for a particular URL
changes slowly, often on a time scale of hours, and sometimes, as illustrated
in Fig. B.2 on a time scale of months. It is assumed for the purpose of
modeling that the traffic arrivals follow a Poisson process, at least on the
time scale at which the scheduler can route traffic flows.
Traffic Generation
In order to develop a mapping between Poisson traffic arrival rate and server
power consumption, it was first necessary to develop a traffic generator ca-
pable of generating requests both at a high rate of speed and with precisely-
distributed inter-request times. Initial development efforts led to the conclu-
74
N response
consumers
M request producers
Threadsafe queue
~Exp()
~Exp()
~Exp()
~Exp(M),
provided N >> M
~Exp(M)
Figure B.3: Threadsafe producer/consumer queue structure.
sion that these two requirements are difficult to achieve using a single traffic
generating computer.
The na¨ıve traffic generator implementation allocates a new thread each
time a new request is generated. However, at high traffic rates, it was found
that the operating system threading limit was reached with enough frequency
to make the traffic generation software unreliable and prone to crashing. To
overcome this limitation, the threadsafe producer/consumer queue structure
illustrated in Fig. B.3 was employed to avoid thread allocation wait times
and to keep the number of total threads below the operating system limit.
The number of consumer threads required for this queue structure was
calculated from a probability analysis of the queue length under the assump-
tion that the producer threads inserted requests into the queue according
to a Poisson process and that the queue can grow without bound. Where
ρ = λrequest
λservice
, Eq. (B.1) is a well-known expression for the probability that the
type of queue described above contains i elements.
pii = (1− ρ)ρi (B.1)
Re-arranging and summing Eq. (B.1) results in Eq. (B.2), which describes
the minimum number of response consumers imin needed to ensure that, with
probability P =
∑i
j=0 pij, every element in the queue may be expected to be
serviced within time 1
λservice
. In other words, imin determines the severity and
number of expected deviations from the generated inter-request times.
imin > ceil
(
log(1−∑ij=0 pij)
log(ρ)
− 1
)
(B.2)
While this method allowed the traffic generation rate to increase beyond
75
0 100 200 300 400 500 600
0
200
400
600
800
1000
1200
1400
1600
Time (seconds)
Tr
af
fic
 ra
te
 (r
eq
ue
sts
 pe
r s
ec
on
d)
Desired rate versus generated rate
 
 
λ = 40
λ = 120
λ = 200
λ = 280
λ = 400
λ = 800
λ = 1200
λ = 1600
Figure B.4: Measured traffic rate versus generated traffic rate for one traffic
generator computer.
the na¨ıve traffic generator implementation, the increase was not significant
enough to merit further development of this architecture. The range of possi-
ble traffic rates was simply too low to be useful in understanding the full range
of server power consumption in response to traffic. For example, Fig. B.4 il-
lustrates how a single-machine traffic generator fails to generate accurate
traffic rates at speeds just over 200 requests per second.
In order to overcome this challenge, a distributed traffic generating soft-
ware with synchronized start capability was developed. This software ex-
ploits the well-known minimum-of-exponentials theorem (Theorem B.1.1) to
set the aggregate traffic rate by adjusting the rates of individual traffic gen-
erating nodes. An example of the fast and accurate traffic generation rates
attainable by this software is shown in Fig. B.5, which represents results from
a four-machine traffic generation cluster.
Theorem B.1.1 Let X1 ∼ exp(λ1) and X2 ∼ exp(λ2) be independent. Then
X = min{X1, X2} ∼ exp(λ1 + λ2).
In the development of the distributed traffic generation software, it was
found that small time shifts between the traffic start times of the individual
generator nodes could have a significant effect on the rate of the aggregate
76
200 400 600 800 1000 1200 1400 1600 1800
0
50
100
150
200
250
300
350
400
450
Time (seconds)
Tr
af
fic
 ra
te
 (r
eq
ue
sts
 pe
r s
ec
on
d)
Desired rate versus generated rate
 
 
λ = 400
λ = 280
λ = 200
λ = 120
λ = 40
Figure B.5: Measured traffic rate versus generated traffic rate for cluster of
four traffic generator computers.
traffic. This effect was especially pronounced when the number of nodes was
large and the desired aggregate rate was high. Fig. B.6 illustrates this effect,
showing as much as 12% deviation in the traffic rate, at a time shift of only
around 150 milliseconds.
On the commodity network setup and computers used in this experiment,
standard TCP sockets could not be accessed in rapid enough succession to
ensure node start delays below 100 milliseconds, due to various software and
network delays. Thus, it was determined to make use of IP multicast to
address all generator nodes at once, and allow for a truly synchronized start
(less than 10 millisecond time shift between nodes). Commented code along
with scripts for automated cluster deployment and detailed documentation
for configuring Windows and Linux operating systems to allow bidirectional
multicast traffic are available for download at http://cl.ly/060K2J1U0v2W.
Model Description
The proposed mapping from Poisson traffic rate to server power is shown
in Fig. B.7. The top portion of the figure indicates the implicit assumption
that the request/server environment behaves as a so-called M/M/1 queue.
77
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
−12
−10
−8
−6
−4
−2
Aggregate rate for 4 1000−sample streams with λ = 1000
Pe
rc
en
t v
ar
ia
tio
n 
fro
m
 e
xp
ec
te
d 
tra
ffi
c 
ra
te
Average time shift between stream start time (seconds)
Figure B.6: Measured traffic rate versus generated traffic rate for cluster of
four traffic generator computers.
The bottom portion of the figure indicates a linear increase from idle at very
slow traffic to some maximum power consumption, which occurs when the
request rate exceeds the processing rate. After this point, the power remains
relatively constant.
This model is corroborated by two different experimental measurements.
The first, shown in Figs. B.8 and B.9 makes use of the Agilent power mea-
surement software to measure the input power of a single Beaglebone Black
computer, running Ubuntu and nginx server. The traffic generator nodes
are six other Beaglebone Blacks, also running Ubuntu. A photograph of the
λ μ
λ
μ
P
Figure B.7: Proposed rate-to-power mapping.
78
00.5
1
1.5
2
2.5
3
0 100 200 300 400 500
M
e
a
n
 l
a
te
n
c
y
 (
s
)
Average latencies for 6 streams
Including data load
Time to response
1.2
1.3
1.4
1.5
1.6
1.7
1.8
0 100 200 300 400 500
A
v
e
ra
g
e
 p
o
w
e
r 
(W
)
Average rate (req/s)
Average powers for 6 streams
Figure B.8: Beaglebone Black with nginx server mean power consumption
(single-threaded configuration).
setup is shown in Fig. B.10.
The second experiment makes use of the DAQmx power measurement soft-
ware to measure the input power of a single Dell Optiplex GX280 running
Ubuntu and Apache server. The traffic generator nodes are four other Opti-
plex desktops also running Ubuntu. The server’s mean power consumption
is shown in Fig. B.11.
B.2 Queuing Behavior Simulator
B.2.1 Simulation Model
While there are many different ways of injecting currents and ensuring that
the average current between successive servers is small, the domain of this
study is to investigate the effect of power-aware job scheduling on systems
which utilize the element-to-element DPP architecture (described in the Im-
plementation Details section). The model implements two distinct routing
policies, with a number of adjustable parameters. Due to time constraints,
this study only implements one server stack, rather than the many clusters
79
00.5
1
1.5
2
2.5
3
0 100 200 300 400 500
M
e
a
n
 l
a
te
n
c
y
 (
s
)
Average latencies for 6 streams
Including data load
Time to response
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
0 100 200 300 400 500
A
v
e
ra
g
e
 p
o
w
e
r 
(W
)
Average rate (req/s)
Average powers for 6 streams
Figure B.9: Beaglebone Black with nginx server mean power consumption
(multi-threaded configuration).
Figure B.10: Beaglebone measurement setup.
80
0 200 400 600 800 1000 1200 1400 1600 1800
0
1
2
3
4
5
6
Time (s)
R
es
po
ns
e 
tim
e 
(m
s)
Average response time for a single 9MB file (95% confidence)
 
 
0 200 400 600 800 1000 1200 1400 1600 1800
70
72
74
76
78
Average power with minimum and maximum 100ms average
Po
w
er
 (W
)
Time (s)
λ = 400
λ = 280
λ = 200
λ = 120
λ = 40
Figure B.11: Dell Optiplex GX280 with Apache server mean power
consumption (multi-threaded configuration).
FIFO
Figure B.12: Power-aware routing policy.
of stacked servers that would appear in a real-world datacenter. However,
since all stacks ought to have approximately equal job input rates, the policy
evaluation results should be the same for a multiple-cluster system.
B.2.2 Implementation Description
The model parameters and their distributions are described in Table B.1.
The actual values used for these parameters are discussed in the experimental
design section. Figures B.12 and B.13 represent the conceptual simulation
model.
The power-aware routing policy estimates the power consumption of each
job, and then sends the job with the highest estimated power consump-
81
FIFO
Figure B.13: FIFO routing policy.
Table B.1: Model parameters.
T Maximum simulation time.
λ Exponential job interarrival rate.
Ninput Length of the unsorted job input buffer.
Nsort Length of the job sorting queue.
Nsrv Number of servers in the stack.
ta1, tb1 Minimum and maximum of uniformly distributed amor-
tized job sorting time.
ta2, tb2 Minimum and maximum of uniformly distributed amor-
tized job routing time.
µs, σs Mean and standard deviation of normally distributed
job completion time.
µp, σp Mean and standard deviation of normally distributed
job power consumption.
σerr Standard deviation of power normally distributed con-
sumption estimator error.
Vbus Bus voltage.
ηdpp DPP efficiency.
82
tion to the empty server which is most “in need of work.” The notion of
“needing work” ought to be obvious from the discussion in the introduction
regarding how energy conversion losses are proportional to the difference be-
tween server currents. The metric for work “need” used in this study is
(Ii−1 + Ii+1)/2, where Ii−1 and Ii+1 are the currents of the servers above and
below server i. The edge cases are handled by assigning I−1 and IN+1 to
equal the string current. In contrast, the FIFO routing policy simply sends
jobs to a randomly-chosen available server in the order that the jobs arrive.
Implementation Details
This simulator is implemented in C++ and makes extensive use of the Boost
library, version 1.55. Where not otherwise specified, it is to be assumed
that all external libraries referenced in this document are part of this Boost
library. The source code for this simulator is released to the public do-
main and is available for download at https://github.com/jcmcclurg/
DataCenterSim. 1 Let P be a random variable with F (x) such that
F (x) = Pr[P ≤ x]
It can be shown that if U ∼ Uniform(0, 1), then the new random variable
X defined as X = F−1(U) converges in distribution to P , according to
Eq. (B.3).
Pr[X ≤ x] = F (x) = Pr[P ≤ x] (B.3)
Thus, generating a uniform random variable between zero and one, and
choosing samples from X as defined above, will generate samples from P. In
this study, the Mersenne-Twister [47] pseudo-random number generator is
used to generate a single stream of U(0, 1) random doubles, which are then
transformed into a variety of distributions by the method described above.
The inverse cumulative distribution functions are defined in the Random
library.
83
Queued job waiting in sorting queue.
Job working in server stack.
Job arrival event sitting in event list.
Job finished event sitting in event list.
time=arrivalTime
sorted by time
sorted by 
estimated power 
consumption time=current time
sorted by 
differential 
current
time=scheduled 
completion time
sorted by time
Single JobEvent allocated in 
memory.
Represents multiple different 
conceptual objects 
throughout its lifecycle.
Figure B.14: Memory saving strategy for similar objects.
Queuing and Sorting Behavior
This project makes use of three sorting queues, as mentioned in the model
description above: the event list, the sorted jobs list, and the working servers
list. To efficiently maintain the sorted nature of these lists, the project makes
use of the priority queue datastructure, implemented as a Fibonacci Heap.
This datastructure is attractive because of its logarithmic amortized complex-
ity [48]. The Heap library provides an implementation of this datastructure.
The first implementation of this project had large memory requirements
and was quite slow, due to copying object data as the job moved from one list
to the next. A substantial speed and memory utilization improvement was
achieved by implementing a mutable job object which could be allocated
once, upon arrival, and then automatically adjust its behavior, depending
on the state of its lifecycle. A figure illustrating this concept is shown in
Figure B.14.
84
Collection of Statistics
This program provides the ability to estimate the mean of the system param-
eters (described in the next section), with associated confidence intervals.
Let R be a constant representing the number of statistics in the series.
Assume that R is large enough that µR =
1
R
∑R
i=1Xi ∼ N(µ, σ/R), by the
Central Limit Theorem. From here, we seek not only to estimate the mean
µR, but also to find the p-confidence interval. The p-confidence interval
indicates the range of values V = [ap, bp] for which
Pr(ap ≤ (µR − µ) ≤ bp) = p
Now, the confidence interval as written above depends on the unknown
parameter µ. However, the similar random variable t = (µR−µ)
√
R
SR
(again,
under the assumption that µR ∼ N(µ, σ/R)) follows the Student’s T dis-
tribution, which is symmetric and depends only on the parameter R − 1.
Letting α = 1− p,
Pr
(
tα/2 ≤ (µR − µ)
√
R
SR
≤ −tα/2
)
= p =⇒
Pr
(
tα/2
SR√
R
≤ (µR − µ) ≤ −tα/2 SR√
R
)
= p
After finding the critical value tα/2 such that Pr
(
t ≤ tα/2
)
= α/2 from a
table, the p-confidence interval is found, with bp = −ap = −tα/2 SR√R .
Now, while it is straightforward to implement an iterative computation
of the sample mean µR, calculating the sample standard deviation from the
definition below is prone to numerical errors, computationally costly, and
requires all of the previous samples to be kept in memory:
SR =
√√√√ 1
R− 1
R∑
i=1
(Xi − µR)2
Fortunately, the method provided in [49] allows the quantity 1
R
∑R
i=1(Xi−
µR)
2 to be computed iteratively without the need for all previous samples
to be kept in memory. A version of this algorithm is implemented by the
85
Table B.2: Varying model parameters of this study.
Parameter symbol Parameter name “-” Value “+” Value
A Routing policy Power aware FIFO
B σs 0.00288 seconds 0.0288 seconds
C σerr 3 watts 30 watts
variance method of the Accumulator library. From here, multiplying by R
R−1
and taking the square root gives the desired value for SR. The value tα/2 is
found from the inverse cumulative distribution function of the Student’s T
distribution, which is implemented in the Quantile library. The confidence
intervals returned by this implementation are the single-sided confidence in-
terval widths w indicating that µR is within ±w of µ, with probability p.
The system makes use of Eq. (4.17) in order to calculate the power losses.
It should be noted that the matrix (A diag(D) +B)−1C is constant for this
study, so it may be pre-computed and stored in memory, to avoid the com-
putationally expensive matrix inversion. The matrix algebra is implemented
in the uBlas library.
B.2.3 Parameters and Metrics
While it would be desirable for us to consider the effect of all the model
parameters in our analysis, we will investigate the effects only the three
parameters given in Table B.2, due to time constraints. All other factors
remain constant and are given in Table B.3.
86
Table B.3: Constant model parameters of this study.
Parameter Value
T 100 seconds
λ 1000 jobs/second
Ninput 100 positions
Nsort 100 positions
Nsrv 32 servers
ta1, tb1 2.88e-05 seconds,5.76e-05 seconds
ta2, tb2 2.88e-05 seconds,5.76e-05 seconds
µp, σp 300 watts, 30 watts
µs 0.0288 seconds
Vbus 380 volts
ηdpp 100%
Table B.4: Metrics of interest.
Metric symbol Metric name Statistic description
R1 Mean latency Difference between a partic-
ular job’s arrival time and
its departure time.
R2 Mean energy Product of total differen-
tial power processed and
the time increment between
power change.
The metrics of interest in this study are shown in Table B.4.
B.2.4 Experimental Design
Output analysis
In this study, it is of interest to obtain a degree of confidence that the aver-
age long-term performance results of a single run of the simulation are not
skewed by initialization bias. That is to say, we would like to believe that
the simulation results are in some sense representative of the fundamental
long-term properties of the system under study, and are not merely artifacts
87
010
20
30
40
50
60
70
80
0 10 20 30 40 50
P
e
rf
o
rm
a
n
c
e
 S
ta
ti
s
ti
c
Time (seconds)
Average total energy over time.
seed 0
seed 1
seed 10
seed 100
seed 1000
Figure B.15: Mean energy under different simulation runs.
Table B.5: Metrics of interest (95% confidence).
Seed Mean latency Mean energy
0 0.0289061 ± 0.000116207 8.17736 ± 0.0536616
1 0.0289005 ± 0.000116618 8.28702 ± 0.0544258
10 0.0288917 ± 0.000116776 8.28020 ± 0.0540625
100 0.0289031 ± 0.000117078 8.36641 ± 0.0556036
1000 0.0289119 ± 0.000118624 8.28692 ± 0.0536499
of the random number seed chosen at the beginning of the simulation.
To achieve this confidence in a rigorous way is beyond the scope of this
study. For now, we will content ourselves that the three metrics under con-
sideration appear to approach a constant value, under a number of different
simulation runs. The results of Figures B.15 and B.16 and Table B.5 were
obtained for the constant model parameters from Table B.3 discussed above,
with the “-” value parameters specified from Table B.2.
Parameter Sensitivity
The techniques described in [50] can be used to assess the sensitivity of the
model to the three parameters under consideration. The goal of this sensitiv-
ity analysis is to strengthen the claim that scheduling policy considerations
88
0.022
0.023
0.024
0.025
0.026
0.027
0.028
0.029
0.03
0 20 40 60 80 100
P
e
rf
o
rm
a
n
c
e
 S
ta
ti
s
ti
c
Time (seconds)
Average latency over time.
seed 0
seed 1
seed 10
seed 100
seed 1000
Figure B.16: Mean latency under different simulation runs.
Table B.6: Sign table for sensitivity analysis (95% confidence).
A B C AB AC BC ABC R1 R2
− − − + + + − 0.0289±1.16×10−4 8.17± 0.0536
+ − − − − + + 0.0289±1.18×10−4 5.77± 0.0344
− + − − + − + 0.0289±1.16×10−4 8.17± 0.0536
+ + − + − − − 0.0289±1.18×10−4 5.77± 0.0344
− − + + − − + 0.0311±2.17×10−4 7.14± 0.0515
+ − + − + − − 0.0312±2.19×10−4 5.44± 0.0364
− + + − − + − 0.0311±2.17×10−4 7.14± 0.0515
+ + + + + + + 0.0312±2.19×10−4 5.44± 0.0364
are the primary factor in determining system performance.
The details and explanation of the calculation are omitted due to time con-
straints. However, the takeaway is that the effect estimates listed in Table B.7
represent the relative sensitivity of the model to a particular parameter, with
the large numbers representing the most important factors.
This effect estimate table indicates that while the latency (R1) is domi-
nated by the effect of the standard deviation of the error estimator, the total
mean energy is actually negatively affected by the sorting queue. One pos-
sible explanation for this is that the additional routing delay caused by the
sorting queue causes more energy loss than it saves.
89
Table B.7: Effect estimates.
Parameter R1 R2
A 3.300e-05 -2.052
B -8.674e-19 0
C 2.303e-03 -0.6796
AB 0 0
AC 2.160e-05 0.3522
BC 8.674e-19 0
ABC 0 0
B.2.5 Conclusion
We have presented a detailed simulation model for evaluating different power-
aware scheduling domains for data centers which make use of the differential
power processing architecture. From the results of the previous section, we
can draw two conclusions. First, the scheduling domain is the primary factor
affecting system performance, as indicated by the sensitivity analysis. Sec-
ond, by comparing the relative performance indicated in Table B.6, it seems
clear that the power-aware scheduling domain, at least under the parameters
studied, does not offer a substantial reduction in energy losses.
B.3 Distributed Load Balancing Emulator
The hardware implementation discussed in this thesis is useful for maintain-
ing system reliability in the case of software failure. This can be comple-
mented by power-aware scheduling frameworks like the one described in [51]
and the proof-of-concept software-only voltage regulation implementation de-
scribed in [22].
Existing weighted load-balancing protocols [52] may be capable of achiev-
ing similar functionality through careful coordination and continuous adjust-
ment of traffic allocation weights. However, this approach is costly in terms
of messages and may thwart some system-level goals for load allocation. The
protocol presented here was designed to operate at a small scale (the level of
an individual server rack), and to avoid deviating from the pre-imposed load
allocation except when necessary to prevent an overvoltage.
90
12V 
12V 
12V
Power-
aware load 
scheduler
Server cluster with 
distributed UPS
380V 
DC bus
Traffic 
generation
Network 
switch
Cluster test and 
control unit
..
.
Server 1
Server 2
Server 32
Figure B.17: Existing system architecture
12V 
12V 
12V
System-
level load 
scheduler
Server cluster with 
distributed UPS
380V 
DC bus
Traffic 
generation
Network 
switch
Cluster test and 
control unit
..
.
Server 1
Server 2
Server 32
Distributed voltage-
aware load 
scheduler
Figure B.18: Improved system architecture
The primary motivation this software framework and the voltage-aware
load balancer presented is to serve as the next step beyond the centralized
load balancing software presented in [22]. A graphical illustration of the
desired architectural changes is shown in Figs. B.17 and B.18.
B.3.1 Background
Assuming that each server has some input capacitance Ci, then its voltage
depends upon the string current Is and the server current Ii according to
Eq. (B.4).
Vi(t) =
1
Ci
∫ t
0
(Is(t)− Ii(t)) dt+ Vi0 (B.4)
Further assuming that the average string current and the average server
current are constant (we will justify this assumption later) so that 1
t
∫ t
0
Is(t)dt =
〈Is〉, and 1t
∫ t
0
Ii(t)dt = 〈Ii〉, then the relationship becomes the simple linear
91
relation of Eq. (B.5).
Vi(t) =
〈Is〉 − 〈Ii〉
Ci
t+ Vi0 (B.5)
Now, the electrical constraint that
∑N
i=0 Vi(t) = Vbus may actually be ig-
nored in the case where either Is or N is large. In the first case, nonzero line
resistance comes into play, and allows mismatch between the sum of the server
voltages and the bus voltage. In the second case,
∑N
i=0 Vi(t)±max(Vi(t)) ≈
Vbus, so the difference between the voltage calculated with the KVL con-
straint and the voltage calculated only using KCL is small. Because our
system exhibits both high currents and a large number of nodes, it is proper
to continue with the derivation, using KCL only.
Because of the linear nature of our system, a simple proportional con-
trol law will suffice to reach nominal voltage Vnom in time ∆t according to
Eq. (B.6).
〈Ii〉setpoint = 〈Is〉measured − (Vnom − Vi,measured) Ci
∆t
(B.6)
Electrically, a server can be modeled as a constant power load, as is dis-
cussed in Chapter 4, with the current draw depending almost solely on load
and frequency. Now, the end goal of this protocol will be to obtain a local
estimate of the global state variable 〈Is〉, and to set 〈Ii〉 to a desired setpoint
〈Ii〉setpoint by rerouting queued server traffic within the cluster. However,
because we have shown that current is directly proportional to CPU utiliza-
tion, it is sufficient (for control purposes) to provide a means to throttle or
increase the average CPU time.
The way that this implementation accomplishes this is by adjusting the
number of concurrently executing threads (“jobs” in the sequel). Systems
which are memory or disk constrained (i.e. most real-world systems) will
show an increased CPU utilization as the thread count increases. Beyond a
certain point, this may actually come at the expense of an overall increase
in latency, and may even result in thrashing if the thread number is allowed
to grow without bound. But, these considerations are beyond the scope of
this implementation and are often avoidable by imposing a simple maximum
thread count constraint.
92
B.3.2 Design goals
In the process of our design, we make the following assumptions:
1. Large Ci: Inter-server communication can happen during input voltage
transients. This is certainly true in the presence of a distributed unin-
terruptible power supply (our current configuration). More research is
needed to determine if this is a good assumption in the general case.
2. Bursty traffic: Requests are sent to the cluster in large bursts, rather
being spread out as individual requests. This assumption is needed
only to ensure that there are enough jobs available for everyone that
needs them.
3. Fast and reliable communication channel between servers. This is a
good assumption for many server cluster configurations (local Ethernet
with dedicated switch).
4. As a corollary to the above, job transfer between nodes is much faster
than the execution of any job.
5. External group membership management detects node failures/group
leaves and group joins, and notifies group members. This is done for
implementation convenience only, as the protocol is easily extensible to
a distributed failure detect.
6. Time-invariant relationship between number of parallel jobs and CPU
current. Future implementations may easily relax this assumption,
because the protocol supports an arbitrary job allocation set point from
the voltage control loop.
Under these assumptions, we were able to meet the following design goals:
1. Minimal impact on throughput and latency: Protocol only re-allocates
jobs which otherwise would have had to wait to complete.
2. Minimal re-allocation of jobs: Protocol only re-allocates jobs if one or
more servers is in danger of overvoltage.
3. No job duplication, loss of jobs, or failure to acquire needed and avail-
able jobs: Protocol implements a distributed mutex (Ricart Agrawala)
to manage access to job allocation routines.
93
4. Completely thread safe operation: Each server supports multiple si-
multaneous job consumers (simulating manycore processor) and multi-
ple simultaneous job producers (simulating multiple parallel load bal-
ancers).
B.3.3 Protocol
Algorithm 1 Job allocation: (called when the queue size decreases, or the
batch size [number of required parallel jobs] changes and queue size < batch
size)
1: procedure populateQueue
2: Acquire local locks (thread safety)
3: Acquire distributed mutex (implemented as Ricart Agrawala)
4: for Each other node do
5: Get number of extra jobs (queue size - batch size)
6: end for
7: Based on this information, decide where to get jobs.
8: Get them according to the previous step.
9: Enqueue the new jobs, marking them with a higher priority (to pre-
vent multi-hopping of jobs).
10: Exit distributed mutex
11: Exit local locks
12: end procedure
B.3.4 Compiling and Running
1. install apache thrift with libevent support
2. git https://github.com/jcmcclurg/jcm-loadbalancer.git
3. make
4. To start the multicast group manager, type: ./hub
5. To start individual servers, type: ./node
6. To test the operation of the protocol, type: ./nodeTest
94
The node test executable demonstrates the correct operation of the proto-
col by sending many jobs to one node only, and directing the voltage control
loop to indicate that all nodes need to execute three jobs in parallel to avoid
an overvoltage condition. The jobs are slowed down significantly, so the op-
eration of the protocol can be observed by the user. What will be seen is
that one node will enqueue many jobs, while the other nodes continually re-
allocate those jobs to themselves, three at a time, as is needed to maintain
correct CPU loading. To exit the nodes, wait until you see “Job consumer
processing...”, and then hit Ctrl+C. To exit the hub or nodeTest, simply hit
Ctrl+C.
B.3.5 Conclusion and Ongoing Work
As the next step toward a fully operational series-connected server cluster,
this project has designed and implemented a fully distributed and flexible
load balancing protocol which lends itself particularly well to voltage regu-
lation. The protocol is designed to operate with minimal intrusion into the
existing traffic patterns – only performing job reallocation when it becomes
necessary to do so to prevent an overvoltage condition. Future research
will see the performance of this implementation first tested on server hard-
ware instrumented with power monitoring equipment. Following application-
specific parameter tuning, the protocol will be installed and tested on a series-
connected server cluster, comparing performance with the existing central-
ized scheduler.
95
Appendix C
Setting Up a Cluster
Setting up the distributed cluster described in this paper was somewhat
tricky, due to the fact that the most recent version of the distributed kernel
only compiles on 64-bit computer, and half of the servers in the cluster only
have 32-bit processors. Kerrighed 2.4.1 is the most recent version that is
known to compile for 32-bit, and the most recent version of Ubuntu that it
compiles on is Hardy (you can get the latest version of Hardy). With this
in mind, the following procedure is needed to set up a new cluster of disk-
less clients. It is inspired by https://wiki.ubuntu.com/ EasyUbuntuCluster-
ing/UbuntuKerrighedClusterGuide. A shell script to automatically perform
the following steps, and all the required initialization files can be downloaded
from https://db.tt/YRwLFO3a.
C.1 Installing Ubuntu Hardy
The install of Hardy must be done using a DVD, as the image does not come
with the correct drivers for booting off of a USB stick. Also note that you will
be unable to use the net install, due to the fact that Hardy is past its Long-
Term-Support lifetime, and has been moved to old-releases.ubuntu.com. Be-
cause of this, it will be necessary to update /etc/apt/sources.list after the
install process has completed, in order to change all non-canonical URLs
pointing to [something].ubuntu.com to old-releases.ubuntu.com.
C.2 Setting up NFS share
NFS stands for Network File System. After installing nfs-kernel-server and
nfs-common using the apt-get command line tool, a shared directory must be
created and mounted. This is done using the /etc/exports file, the format of
96
which is explained in the exports man page. Before continuing further, make
sure it is possible to mount the exported nfs share on another computer, using
the mount command. If another computer is not available, simply test the
mounting process using localhost as the remote host name. The reason this
share is being created is that it will contain the filesystem that the diskless
nodes will all mount.
In order to install this filesystem, it is necessary to use the debootstrap
utility to create a new filesystem structure, with all the proper links and per-
missions, and then to use the chroot and mount utility to configure that
filesystem. After debootstrap is installed, an architecture (i386 hardy),
a local path (whatever you set your NFS share to be), and a url (old-
releases.ubuntu.com/ubuntu) must be specified as command line options.
Once this is done, the chroot command will treat subsequent commands as if
the new filesystem is the actual operating system root. However, in order to
perform network operations within the chroot-ed filesystem, it is necessary to
use the mount command within the chroot-ed filesystem to mount the /proc
directory with type proc none. After any network operations are completed,
/proc can be umount-ed, and the chroot exited.
C.3 Setting up the TFTP server and PXE bootloader
In order to load the operating system over the network, each cluster node
must have a PXE-enabled (Preboot eXecution Environment) network card
(sometimes, PXE must be enabled manually in the BIOS) and have loaded
a pxe-enabled bootloader. Since the nodes are diskless, this bootloader must
also be loaded over the network. To do this, the cluster nodes make use of
TFTP (Trivial File Transfer Protocol) to obtain the pxe-enabled bootloader
from the first server that assigns them an IP address. Thus, on the host
computer, the tftp-hpa package must be installed, and the pxelinux.0 boot-
loader (found in the /usr/lib/syslinux/ directory, after installing the syslinux
package) must be loaded into the /var/lib/tftpboot directory. A number of
system-specific options must be set in the TFTP server directory listed above.
These options are documented on the Kerrighed website.
97
C.4 Setting up DHCP server and Compiling Kerrighed
The dhcp3-server configuration file specifying the network interface on which
it should run is located in /etc/default/dhcp3-server. The configuration file
designating IP address allocation is found in /etc/dhcp3/dhcpd.conf. The
reason this server is being set up is that the diskless nodes need to obtain
the previously-installed bootloader and kernel options along with their IP
addresses. For each connecting client, the filename and the option root-path
should respectively specify the PXE bootloader and the NFS share location.
The compile process is well-documented on the Kerrighed website. How-
ever, it is important to remember to either bundle all the network drivers
that will be used by the client nodes into the kernel, or to include those
drivers as modules, and add a ramfs image to the initrd option in the pxe
configuration file. Keep in mind that there are special considerations to using
a ramfs image documented on the Kerrighed website. One of these consid-
erations is that a hook script must be created in the initramfs-tools/hooks
folder to copy the /etc/kerrighed-nodes file into the ramfs image.
98
References
[1] B. Tushi, D. Sedera, and J. Recker, “Green IT segment analysis: An aca-
demic literature review,” in 20th Americas Converence on Information
Systems, Savannah, Georgia, 2014.
[2] M. Mescall, “2013 data center density study.” [Online].
Available: http://www.symposium.uptimeinstitute.com/schedule/
1862-dc-density-2013
[3] “How much electricity does an American home use? - FAQ -
U.S. energy information administration (EIA).” [Online]. Available:
http://www.eia.gov/tools/faqs/faq.cfm?id=97&t=3
[4] M. Stansberry, “Annual data center industry survey report and full re-
sults,” Uptime Institute, Technical Report, 2013.
[5] N. Rasmussen, “Electrical efficiency measurements for data centers,”
APC by Schneider Electric, Data Center Science Center, White Paper
154.
[6] S. Zimmermann, I. Meijer, M. K. Tiwari, S. Paredes, B. Michel, and
D. Poulikakos, “Aquasar: A hot water cooled data center with direct
energy reuse,” Energy, vol. 43, no. 1, pp. 237–245, 2012, 2nd Inter-
national Meeting on Cleaner Combustion (CM0901-Detailed Chemical
Models for Cleaner Combustion).
[7] C. Patel, R. Sharma, C. Bash, and A. Beitelmal, “Thermal considera-
tions in cooling large scale high compute density data centers,” in Ther-
mal and Thermomechanical Phenomena in Electronic Systems, 2002.
ITHERM 2002. The Eighth Intersociety Conference on, 2002, pp. 767–
776.
[8] “Reversible and adiabatic computing: Energy-efficiency maximized,” in
Field-Coupled Nanocomputing, ser. Lecture Notes in Computer Science,
N. G. Anderson and S. Bhanja, Eds., 2014.
99
[9] C. Cai, L. Wang, S. Khan, and J. Tao, “Energy-aware high performance
computing: A taxonomy study,” in Parallel and Distributed Systems
(ICPADS), 2011 IEEE 17th International Conference on, Dec 2011, pp.
953–958.
[10] “CPU, GPU and MIC hardware characteristics over time - Karl Rupp,”
www.karlrupp.net. [Online]. Available: http://www.karlrupp.net/2013/
06/cpu-gpu-and-mic-hardware-characteristics-over-time/
[11] E. Frachtenberg, A. Heydari, H. Li, A. Michael, J. Na, A. Nisbet, and
P. Sarti, “High-efficiency server design,” in Proceedings of 2011 Interna-
tional Conference for High Performance Computing, Networking, Stor-
age and Analysis, ser. SC ’11. New York, NY, USA: ACM, 2011, p.
27:127:27.
[12] “Efficiency: How we do it - “data centers” - Google.” [Online]. Avail-
able: http://www.google.com/about/datacenters/efficiency/internal/
#servers
[13] “Exergy analysis of data center thermal management systems,” in En-
ergy Efficient Thermal Management of Data Centers, Y. Joshi and
P. Kumar, Eds., 2012.
[14] “Optimization of outside air cooling in data centers.”
[15] S. G. A. N. Rolander, “For data center, google goes for the cold,” Wall
Street Journal, Sep. 2011. [Online]. Available: http://online.wsj.com/
news/articles/SB10001424053111904836104576560551005570810
[16] N. Rasmussen and J. Spitaels, “A quantitative comparison of high effi-
ciency ac vs. dc power distribution for data centers,” White Paper #127
of APC Inc, 2007.
[17] M. Ton, B. Fortenbery, and W. Tschudi, “DC power for improved data
center efficiency,” Lawrence Berkeley National Laboratory, Tech. Rep.,
Mar. 2008.
[18] U. Ogras, R. Marculescu, P. Choudhary, and D. Marculescu, “Voltage-
frequency island partitioning for GALS-based networks-on-chip,” in De-
sign Automation Conference, 2007. DAC ’07. 44th ACM/IEEE, June
2007, pp. 110–115.
[19] R. White, “Electrical isolation requirements in power-over-ethernet
(poe) power sourcing equipment (pse),” in Applied Power Electron-
ics Conference and Exposition, 2006. APEC ’06. Twenty-First Annual
IEEE, Mar. 2006, pp. 1–4.
100
[20] “IEEE standard for information technology–telecommunications and
information exchange between systems–local and metropolitan area
networks–specific requirements part 3: Carrier sense multiple access
with collision detection (CSMA/CD) access method and physical layer
specifications - section one,” IEEE Std 802.3-2008, pp. c1–597, 2008.
[21] D. Economou, S. Rivoire, C. Kozyrakis, and P. Ranganathan, “Full-
system power analysis and modeling for server environments,” in In
Proceedings of Workshop on Modeling, Benchmarking, and Simulation,
2006, pp. 70–77.
[22] J. McClurg, Y. Zhang, J. Wheeler, and R. Pilawa-Podgurski, “Re-
thinking data center power delivery: Regulating series-connected volt-
age domains in software,” in Power and Energy Conference at Illinois
(PECI), 2013 IEEE, Feb. 2013, pp. 147–154.
[23] P. Shenoy, K. Kim, B. Johnson, and P. Krein, “Differential power pro-
cessing for increased energy production and reliability of photovoltaic
systems,” Power Electronics, IEEE Transactions on, vol. 28, no. 6, pp.
2968–2979, June 2013.
[24] P. S. Shenoy and P. T. Krein, “Differential power processing for DC
systems,” Power Electronics, IEEE Transactions on, vol. 28, no. 4, pp.
1795–1806, Apr. 2013.
[25] P. Shenoy, “Improving performance, efficiency, and reliability of DC-DC
conversion systems by differential power processing,” Ph.D. dissertation,
University of Illinois at Urbana-Champaign, Sep. 2012.
[26] W. W. Griscom, “Some storage battery phenomena,” American Institute
of Electrical Engineers, Transactions of the, vol. XI, pp. 302–336, 1894.
[27] T. Wilson, “The evolution of power electronics,” in , Proceedings of the
IEEE International Symposium on Industrial Electronics, 1992, 1992,
pp. 1–9 vol.1.
[28] R. Oppenheim, “Electric cells connected in a battery,” Patent
US2 085 598 A, June, 1937.
[29] H. J. P. Johanne, “Circuit arrangement for converting a low voltage
into a high direct voltage,” U.S. Patent US2 780 767 A, Feb., 1957,
U.S. Classification 363/18, 331/112, 327/100; International Classifica-
tion H02M3/24, H02M3/338; Cooperative Classification H02M3/338,
H02M3/3385; European Classification H02M3/338, H02M3/338C.
[30] D. C. Bomberger, D. Feldman, D. E. Trucksess, S. J. Brolin, and P. W.
Ussery, “The spacecraft power supply system,” Bell System Technical
Journal, vol. 42, no. 4, p. 943972, 1963.
101
[31] C. Dewey, F. Ellert, T. H. Lee, and C. H. Titus, “Development of exper-
imental 20-kY, 36-MW solid-state converters for HVDC systems,” IEEE
Transactions on Power Apparatus and Systems, vol. PAS-87, no. 4, pp.
1058–1066, 1968.
[32] D. Bjork, “Maintenance of batteries; new trends in batteries and au-
tomatic battery charging,” in Telecommunications Energy Conference,
1986. INEC ’86. International, 1986, pp. 355–360.
[33] H. Schmidt and C. Siedle, “The charge equalizer-a new system to extend
battery lifetime in photovoltaic systems, UPS and electric vehicles,” in
Telecommunications Energy Conference, INEC ’93. 15th International,
vol. 2, 1993, pp. 146–151 vol.2.
[34] T. Shimizu, M. Hirakata, T. Kamezawa, and H. Watanabe, “Generation
control circuit for photovoltaic modules,” IEEE Transactions on Power
Electronics, vol. 16, no. 3, pp. 293–300, 2001.
[35] M. Weiser, B. Welch, A. Demers, and S. Shenker, “Scheduling for re-
duced CPU energy,” USENIX SYMP. OPERATING, p. 1323, 1994.
[36] T. Kawahara, Y. Kawajiri, M. Horiguchi, T. Akiba, G. Kitsukawa,
T. Kure, and M. Aoki, “A charge recycle refresh for gb-scale DRAM’s in
file applications,” IEEE Journal of Solid-State Circuits, vol. 29, no. 6,
pp. 715–722, 1994.
[37] H. Yamauchi, H. Akamatsu, and T. Fujita, “An asymptotically zero
power charge-recycling bus architecture for battery-operated ultrahigh
data rate ULSI’s,” IEEE Journal of Solid-State Circuits, vol. 30, no. 4,
pp. 423–431, 1995.
[38] S. Rajapandian, K. Shepard, P. Hazucha, and T. Karnik, “High-tension
power delivery: operating 0.18 mu;m CMOS digital logic at 5.4V,”
in Solid-State Circuits Conference, 2005. Digest of Technical Papers.
ISSCC. 2005 IEEE International, 2005, pp. 298–599 Vol. 1.
[39] J. Gu and C. Kim, “Multi-story power delivery for supply noise reduc-
tion and low voltage operation,” in Proceedings of the 2005 International
Symposium on Low Power Electronics and Design, 2005. ISLPED ’05,
2005, pp. 192–197.
[40] P. Shenoy, I. Fedorov, T. Neyens, and P. Krein, “Power delivery for series
connected voltage domains in digital circuits,” in 2011 International
Conference on Energy Aware Computing (ICEAC), 2011, pp. 1–6.
102
[41] R. A. Abdallah, P. S. Shenoy, N. R. Shanbhag, and P. T. Krein, “System
energy minimization via joint optimization of the DC-dc converter and
the core,” in Proceedings of the 17th IEEE/ACM international sympo-
sium on Low-power electronics and design, ser. ISLPED ’11. Piscat-
away, NJ, USA: IEEE Press, 2011, pp. 97–102.
[42] S. K. Lee, D. Brooks, and G.-Y. Wei, “Evaluation of voltage stack-
ing for near-threshold multicore computing,” in Proceedings of the 2012
ACM/IEEE international symposium on Low power electronics and de-
sign, ser. ISLPED ’12. New York, NY, USA: ACM, 2012, p. 373378.
[43] P. Shenoy, S. Zhang, R. Abdallah, P. Krein, and N. Shanbhag, “Over-
coming the power wall: Connecting voltage domains in series,” in Energy
Aware Computing (ICEAC), 2011 International Conference on, 2011,
pp. 1–6.
[44] U. S. Nair, “The standard error of gini’s mean difference,” Biometrika,
vol. 28, no. 3/4, pp. 428–436, Dec. 1936, ArticleType: research-article /
Full publication date: Dec., 1936 / Copyright 1936 Biometrika Trust.
[45] C. Morin, R. Lottiaux, G. Valle´e, P. Gallard, G. Utard, R. Badrinath,
and L. Rilling, “Kerrighed: A single system image cluster operating sys-
tem for high performance computing,” in Euro-Par 2003 Parallel Pro-
cessing, ser. Lecture Notes in Computer Science, H. Kosch, L. Bszrmnyi,
and H. Hellwagner, Eds. Springer Berlin Heidelberg, 2003, vol. 2790,
pp. 1291–1294.
[46] Amos Waterland, “Stress POSIX workload generator.” [Online].
Available: http://people.seas.harvard.edu/∼apw/stress/
[47] M. Matsumoto and T. Nishimura, “Mersenne twister: A 623-
dimensionally equidistributed uniform pseudo-random number genera-
tor,” ACM Trans. Model. Comput. Simul., vol. 8, no. 1, pp. 3–30, Jan.
1998.
[48] M. L. Fredman and R. E. Tarjan, “Fibonacci heaps and their uses in
improved network optimization algorithms,” J. ACM, vol. 34, no. 3, pp.
596–615, July 1987.
[49] B. P. Welford, “Note on a method for calculating corrected sums of
squares and products,” Technometrics, vol. 4, no. 3, pp. 419–420, 1962.
[50] P. G. Webster, “Design of experiments in the mo¨bius modeling frame-
work,” M. Eng. thesis, University of Illinois at Urbana-Champaign, 2002.
[51] V. Sharma, A. Thomas, T. Abdelzaher, K. Skadron, and Z. Lu, “Power-
aware QoS management in web servers,” in Real-Time Systems Sympo-
sium, 2003. RTSS 2003. 24th IEEE, 2003, p. 6372.
103
[52] M. Randles, E. Odat, D. Lamb, O. Abu-Rahmeh, and A. Taleb-Bendiab,
“A comparative experiment in distributed load balancing,” in Devel-
opments in eSystems Engineering (DESE), 2009 Second International
Conference on, 2009, pp. 258–265.
104
