Fast, accurate power measurement and optimization for microprocessor platforms by Johnson, Matthew Robert
c© 2015 Matthew Robert Johnson
FAST, ACCURATE POWER MEASUREMENT AND OPTIMIZATION FOR
MICROPROCESSOR PLATFORMS
BY
MATTHEW ROBERT JOHNSON
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2015
Urbana, Illinois
Doctoral Committee:
Professor Sanjay J. Patel, Chair
Associate Professor Deming Chen
Adjunct Assistant Professor Matthew I. Frank
Associate Professor Steven S. Lumetta
ABSTRACT
Power and energy consumption have become important for all computers,
but the tools used to measure and optimize power on physical hardware lag
far behind performance focused tools. Existing measurement apparata have
low analog bandwidth, do not explicitly correlate power data with processor
activity, and are not explained in sufficient detail to quantify uncertainty
in their data. We present the design, implementation, and application of
Jouler’s Loupe, a measurement device that overcomes these obstacles and en-
ables a new generation of fast, fundamentally sound energy-efficiency-focused
tools. We demonstrate substantial opportunity for energy-focused software
optimizations on a mobile CPU core.
ii
To my parents, for their love and support.
iii
ACKNOWLEDGMENTS
First, I thank my mother LuAnn, my father Steve, and my wife Katya for
sticking with me through graduate school. Your love, support, and patience
all these years has made all the difference.
The Rigel years at the beginning of my graduate career provided a won-
derful opportunity for technical and personal growth. The members of the
Rigel group, especially John Kelm, Danny Johnson, Bill Tuohy, Aqeel Mah-
esri, Neal Crago, and Voytek Truty, were — and still are — a great source
of camaraderie, technical advice, and devil’s advocacy.
I have been very fortunate in receiving the generous financial support
of the Department of Electrical and Computer Engineering through the
ECE Distinguished Fellowship; Microsoft Corporation and Intel Corporation
through the Universal Parallel Computing Research Center; Intel Corpora-
tion through the Intel ECE Computer Engineering Fellowship; and Daniel
F. Vivoli through the Dan Vivoli Endowed Fellowship.
I’d like to thank Steve Lumetta for serving on my committee and for all
his contributions to my work throughout graduate school; his depth and
breadth of intellect have cut through the fog of many difficult problems, and
his attention to detail has greatly improved to the clarity of my writing and
thinking. Matt Frank was also an invaluable mentor even before he served on
my committee; his experience in many areas of systems work informed many
technical decisions in the Rigel project, and many of the practical aspects
of this work. I thank Deming Chen for lending his unique expertise and
perspective to this work by serving on my committee.
Finally, I thank my advisor, Professor Sanjay Patel, for giving me direc-
tion when I needed direction, freedom when I needed freedom, and unending
support and enthusiasm in every research endeavor I pursued. His encour-
agement to devote time to new ideas, and his knack for knowing when to
focus on promising ones, have been invaluable.
iv
TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . vi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . 3
2.1 Power Measurement Methodology . . . . . . . . . . . . . . . . 3
2.2 Power Measurement Applications . . . . . . . . . . . . . . . . 8
CHAPTER 3 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . 11
3.1 PDN Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER 4 LOUPE DESIGN AND IMPLEMENTATION . . . . . 20
4.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
CHAPTER 5 EXPERIMENTAL METHODOLOGY . . . . . . . . . 51
CHAPTER 6 RESULTS AND DISCUSSION . . . . . . . . . . . . . . 53
CHAPTER 7 CONCLUDING REMARKS . . . . . . . . . . . . . . . 56
APPENDIX A POWER SENSOR DESIGN . . . . . . . . . . . . . . 57
A.1 Power Sensor Schematic . . . . . . . . . . . . . . . . . . . . . 57
A.2 Amplifier Error Analysis . . . . . . . . . . . . . . . . . . . . . 57
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
v
LIST OF ABBREVIATIONS
ADC Analog-to-Digital Converter
CMRR Common Mode Rejection Ratio
DUT Device Under Test
ESL Equivalent Series Inductance
ESR Equivalent Series Resistance
FET Field Effect Transistor
FIR Finite Impulse Response
FPGA Field-Programmable Gate Array
GBP Gain-Bandwidth Product
IIR Infinite Impulse Response
LC Inductor-Capacitor
LSB Least Significant Bit
LUT Lookup Table
PCB Printed Circuit Board
PCIe Peripheral Component Interconnect Express
PDN Power Delivery Network
PLL Phase-Locked Loop
PMIC Power Management Integrated Circuit
ppm Parts Per Million
PSRR Power Supply Rejection Ratio
vi
RC Resistor-Capacitor
RF Radio Frequency
RL Resistor-Inductor
RLC Resistor-Inductor-Capacitor
RSS Root Sum Square
RTI Referred-To-Input
RTO Referred-To-Output
SPICE Simulation Program with Integrated Circuit Emphasis
TCR Temperature Coefficient of Resistance
VRM Voltage Regulator Module
vii
CHAPTER 1
INTRODUCTION
Power and energy consumption have become extremely important for design-
ers, software developers, and end users of nearly all modern computers as a
proxy for mobile device battery life, server electricity cost, or peak perfor-
mance within a fixed power or thermal envelope. Despite this importance, the
tools available to measure and optimize the power consumption of real hard-
ware are primitive relative to their counterparts in the performance domain
such as cycle-accurate timers, profilers, profile-guided software optimization,
and auto-tuners.
Existing power measurement techniques lack the following three desirable
attributes:
• An analog-to-digital signal path with low noise, high analog bandwidth,
and high sampling rate for voltage and current signals;
• A precise temporal correspondence between power measurements and
device-under-test (DUT) activity;
• A rigorous derivation of power measurement uncertainty and error
sources that can be used to construct robust measurement and op-
timization algorithms.
As a result, these techniques can only reliably measure power at a coarse
granularity. Without uncertainty analysis, the power consumption of two
pieces of hardware or software cannot be compared in a rigorous way. In this
thesis, we present the design, implementation, and characterization of the
Jouler’s Loupe, or Loupe, a measurement apparatus that overcomes these
challenges; it serves as an ideal building block for the next generation of
much faster, more ubiquitous power measurement and optimization tools for
a wide range of processor systems.
The contributions of this work include:
1
• An enumeration and analysis of the key design parameters and concerns
for fast, accurate power measurement;
• The design and implementation of the Jouler’s Loupe, a power mea-
surement device that can reliably sample 10×-100× faster than state
of the art methods;
• A case study showing how Loupe enables developers to build practical
tools to measure and optimize the energy efficiency of software.
2
CHAPTER 2
RELATED WORK
2.1 Power Measurement Methodology
Measuring the power of an RF or microwave signal is a common problem in
communications and electromagnetic compatibility, and a wealth of commer-
cial equipment is available for this purpose. RF power sensors must handle
signals up to tens of GHz and do not digitize their input signals directly, but
rather integrate them in the analog domain using a thermistor, thermocou-
ple, or diode, then send the integrated signal to an RF power analyzer to
be digitized [1]. RF power sensors’ frequency responses do not extend down
to DC and power analyzers, despite their high sampling frequencies up to
100MHz, only provide postprocessed average or peak power readings at up
to ≈ 1.5kHz, owing to the low analog bandwidth of the underlying sensing
mechanisms. For instance, since thermistor- and thermocouple-based sen-
sors integrate the RF signal by dissipating it across a resistor and measuring
its temperature, typical 10–90% rise times measure in seconds. Therefore,
RF power measurement techniques are not directly applicable to high-speed
processor power measurements, which require very high analog bandwidth
extending all the way down to DC.
The four major current sensing modalities are the current sense (or shunt)
resistor, the current transformer, the Hall effect sensor, and the Rogowski
coil [2]. The latter three modalities can be implemented non-intrusively by
wrapping a loop or coil around the conductor being measured. This feature
allows for the measurement of very high currents up to thousands of am-
peres, provides galvanic isolation between DUT and measurement system,
and avoids the self-heating and power dissipation inherent in the shunt re-
sistor method. The downside of non-intrusive methods is that they are more
suited to measuring power through a wire or cable and are inconvenient for
3
PCB-level measurements, since commercial designs rarely allow for a loop to
be wrapped around the power trace or plane leading to a processor. Fur-
thermore, the three coil-based modalities typically have bandwidth of a few
MHz or less and variously have issues with linearity, hysteresis, temperature
dependence, and measuring DC current. Despite the relatively high power
consumption of sense resistors and the attendant self-heating and tempera-
ture dependence issues, sense resistors provide the best foundation for a high
speed processor power measurement platform due to their simplicity; their
excellent frequency response and linearity; their ability to measure DC cur-
rent; their suitability to measuring currents both on a wire or cable and on a
PCB; and the ability to easily use a common power measurement platform to
measure different current ranges by using different resistances in a common
form factor.
Commercial current probes for oscilloscopes have many desirable charac-
teristics, but are ultimately unsuitable for our purposes except as a useful
cross-validation tool. The highest-performance probes, such as the Agilent
N2783B, use both a Hall effect sensor and a current transformer to mea-
sure from DC to a -3dB bandwidth of 100MHz. However, these probes have
the disadvantages of non-intrusive methods discussed previously, and higher-
frequency probes are usually limited to measuring smaller conductors. For
example, the N2783B can only measure conductors up to 5mm in diameter.
For PCB-level power rail measurements, the parasitic impedance of a 5mm
wire long enough to fit through the probe would cause substantial ringing
problems, as discussed in Section 4.1.1. In addition to the added impedance
of the wire itself, even non-contact current probes also incur an insertion
impedance. While this impedance is on the order of 1mΩ at low frequencies,
it quickly grows to tens or hundreds of mΩ at 10MHz, far higher than the
33mΩ impedance of a 10mΩ sense resistor with 0.5nH parasitic inductance.
While off-the-shelf current probes are specified with fairly flat gain responses
out to their specified bandwidth, no concrete flatness guarantees beyond the
-3dB bandwidth are provided and no information is provided as to phase lin-
earity/group delay flatness; a sensor with flat gain and nonlinear phase may
produce significant distortions in the time-domain waveform, causing error in
power and energy estimates over short time periods. Current probes require
periodic degaussing and offset voltage nulling due to residual magnetization
of the magnetic core, and are thus more suited for controlled laboratory
4
measurements than continuous, automated in-system use. Current probes
also have limited sensitivity in the µA–mA range; for example, the minimum
current the N2783B can reliably measure is 5mA. One technique to improve
sensitivity is to wind multiple turns of the wire through the probe opening,
multiplying the probe’s output voltage by the number of turns. This tech-
nique has the downside of further increasing inductance along the wire by
coiling it and potentially needing to increase its length, and is limited by the
conductor size limit of the probe. Further limitations of commercial current
probes for our purpose include their large physical size which precludes their
use in space-constrained production systems, their current cost of thousands
of dollars, and the lack of insight into their error sources and behavior due
to their proprietary design. We opted to design our own sense resistor-based
power sensor that addresses all of the above issues.
A large body of literature focuses on the accurate measurement of AC
mains power, with applications ranging from individual electrical appliances
to large industrial equipment, entire buildings, and large-scale electrical grid
design. The signals of interest in this domain have low fundamental frequen-
cies from 50Hz up to perhaps 1kHz. One important feature of mains power
measurement devices is the ability to synchronize the sample clock to the fun-
damental frequency for fast convergence of peak and RMS power estimates
on periodic signals. Processor DC power rails have no such periodicity, and as
such the speed of power estimate convergence is purely dependent on analog
bandwidth, noise, and sample rate. Svensson designed and characterized an
accurate digital watt meter for AC loads which may have significant harmonic
content up to several kHz [3]. Like Svensson, we use a shunt resistance and
an ADC in our power measurement apparatus to digitize current and voltage
waveforms, and we perform an uncertainty analysis to enable the informed
use of the apparatus in statistically rigorous measurement and optimization
applications. However, our apparatus has several orders of magnitude more
bandwidth and is suited for PCB-level or system-level measurement, and
our error analysis is more focused on thorough characterization of the de-
vice than on the uncertainty of particular derived quantities from the mains
power domain like apparent, active, and reactive power.
Chang et al. [4] introduced a switched-capacitor-based methodology for
measuring the dynamic energy of simple ARM microcontrollers at a cycle
granularity. The methodology was later applied to FPGAs by Lee et al. [5].
5
Under this methodology, two or more capacitors are connected across the pro-
cessor’s supply pins, and the measurement system controls digital switches
that connect one of the capacitors to the processor at a time. The processor
then draws current from that capacitor for a short period of time, rather than
from the power supply, while the power supply recharges the other capacitors
which are disconnected from the processor. While the capacitor is powering
the processor, its voltage drops from VDD to V2, and the total energy drawn
during the time period can be calculated as:
E =
∫ t2
t1
I(t)V (t)dt =
1
2
C
(
V 2DD − V 22
)
(2.1)
The switched capacitor method has an important advantage over sense resistor-
based approaches in that it uses the capacitor as a high speed analog inte-
grator, seemingly sidestepping the need for high analog bandwidth to the
ADC, high ADC sampling rate, and the computational expense of integrat-
ing measured current and voltage data in the digital domain. However, three
key downsides to this approach motivate a high speed sense resistor-based
system like Loupe. First, physical capacitors deviate in many ways from
their ideal specification, including leakage, parasitic inductance, and capaci-
tance dependent on applied voltage and temperature. These deviations limit
the utility of the na¨ıve energy computation in Equation 2.1, particularly
when absolute accuracy is important to compare results across processors or
measurement methods. Sense resistors also have nonidealities such as para-
sitic inductance and thermoelectric voltage, but they are characterized much
more thoroughly than those of capacitors, and the designer can compensate
for them more readily. Second, in order to get cycle-level energy measure-
ments, Chang et al. and Lee et al. provide an external clock to the system
under test so that each capacitor can power the processor — and the voltage
drop can be measured — for a single cycle. As a practical concern, nearly
all processors of interest today derive their clock from an on-chip PLL, not
external pins. Most commercial products do not allow for single-stepping
the processor from an external clock at all, so the automatic cycle-level syn-
chronization of power measurements and processor activity assumed in the
literature could not be maintained. Even if single-stepping were possible,
leakage has come to represent a far larger fraction of total power consumption
in modern processors than in previous generations, so accurate total power
6
measurements require running the processor at realistic speeds to generate
realistic thermal conditions and to capture a realistic balance between static
and dynamic power. The third limitation of the switched capacitor approach
is that transistors and other on-die structures are highly non-linear, and the
switching current and total energy is dependent on supply voltage over time.
On one hand, it is desirable to size the capacitor and measurement period to
generate a large voltage drop that can be measured precisely. On the other,
it is desirable to keep the supply voltage very close to nominal VDD to avoid
altering the operating point of the DUT and causing the measured energy
consumption to differ from that under realistic operating conditions. This
fundamental tension is analogous to the burden voltage for a sense resistor,
but the solution in this case — carefully designed amplifiers to allow small
voltage drops to be measured precisely — eliminates much of the simplic-
ity advantage of the switched capacitor approach, and does not address the
other two downsides mentioned earlier. While switched capacitors provide an
interesting way to cross-validate sense resistor-based energy measurements,
the internal clocking of modern processors and the need to run the DUT at
its normal operating point for best results preclude the cycle-level measure-
ment granularity achieved in the literature and limit any accuracy benefit of
the former technique.
Even if the bandwidth, accuracy, and simplicity characteristics of resistance-
based current sensing are desirable, the power dissipated across the sense
resistor may be untenable in some low-power production systems. Previ-
ous studies have evaluated lossless current sensing, where instead of a sense
resistor, the voltage drop associated with current is measured across par-
asitic resistances present in the unmodified power supply. The resistances
used are the DC resistance (DCR) of the switching FETs [6] or inductors
[6, 7] in a single- or multi-phase buck regulator, or even the copper trace
or plane between the regulator and the processor [8]. In the case of the in-
ductor, the DC component of the current is measured by using a single-pole
RC filter to cancel out the reactance of the inductor. While the resulting
measurements are sufficiently accurate to detect and mitigate overcurrent
conditions or share load current between multiple phases[9], lossless sensing
approaches have three downsides for fast, accurate processor power measure-
ment. First, the notion of a single resistance value belies the complexity of
physical power inductors; manufacturer simulation models do include such a
7
resistance, but also have a parallel combination of frequency-dependent R,
RC, and RL circuits [10] that makes the real frequency response consider-
ably more complex and limits the accuracy of RC-filtered current data above
DC. Second, current sense resistors are designed with great attention detail
in areas like thermoelectric voltage and stability over temperature, lifetime,
and environmental conditions that are not important for power inductors,
and thus provide superior stability and characterizability. Finally, since the
inductor is placed physically and electrically “far” from the processor, it
will see less of the high frequency processor current content than a sense
resistor near the PCB-mounted decoupling capacitors, limiting the effective
time granularity of energy and power measurements. MOSFET- or copper-
based lossless sensing suffer from largely the same drawbacks. In the case of
copper-based sensing, placement of the low-side voltage endpoint is an addi-
tional issue, since this placement fully determines the measured current, but
the voltage will vary substantially, both spatially across the power plane or
planes and in time across different use cases as the current draw distribution
among the process balls changes. Nevertheless, characterizing the temporal
resolution and accuracy upper bounds of DCR sensing systems is an interest-
ing area for future study, since lossless sensing may make the measurement
and optimization techniques developed in this dissertation applicable to an
even broader range of systems and use cases.
2.2 Power Measurement Applications
High-side shunt resistors both generate a small voltage in response to the
current passing through them. This microvolt- or millivolt-level signal must
be amplified significantly to occupy a significant fraction of the 1–4 V input
range of most high-speed ADCs. Amplification is also necessary to reduce
the effect of noise on the signal between the current sensor and the ADC.
As shown in Table 2.1, most existing power measurement setups do not am-
plify the current signal at all; those that do amplify use specialized current
sense amplifiers or amplifiers integrated into the Hall effect sensor, both of
which have a typical 3dB bandwidth of 10–100kHz. Applications with abso-
lute accuracy requirements like power measurement require a flat amplitude
response and linear phase response in the frequency range of interest; the
8
Table 2.1: A small sample of existing power measurement techniques. Most
papers do not describe the measurement setup in significant detail. Most
techniques do not explicitly amplify the current signal, leading to low
dynamic range at the ADC; those that do amplify it have relatively low
bandwidth. The commercial AC power measurement devices used by Do et
al. and others are based on a sense resistor [11] or current transformer [12].
Jouler’s Loupe uses a sense resistor, has a 3dB bandwidth of 60–200MHz,
and uses 64Msps 12-bit ADCs.
Citation Method Amplifier BW (3dB) ADC Specs
Correlated w/
Proc. Activity?
Bedard et al.[13] Sense Resistor N/A 6.6kHz, 12b No
Weissel and Bellosa[14] Sense Resistor N/A ≤3Msps No
Rajamani et al.[15] Sense Resistor ≤10kHz 333ksps, 16b No
Zhu et al.[16] Sense Resistor ≤80kHz 200ksps, 16b No
Carroll and Heiser[17] Sense Resistor N/A 250ksps, 16b No
Contreras and Martonosi[18] Sense Resistor N/A 100sps Yes
Govindan et al.[19] Hall effect 100kHz 10ksps, 14b No
Esmaeilzadeh et al.[20] Hall effect 80kHz 50sps, ≈7b No
Do et al.[21] AC Power N/A 1Hz readings No
0.1dB bandwidth is more applicable and is about 1/6th the 3dB bandwidth
for voltage-feedback amplifiers, or about 14–17kHz for the amplifiers in Table
2.1 [22]. This low bandwidth makes it impossible to accurately measure code
executions shorter than 1–5ms (millions of cycles on modern processors), and
increases the required run time for a given level of accuracy and confidence
[23].
Of the methods in Table 2.1, only Contreras and Martonosi [18] can derive
a relationship between a power sample and processor activity, but they do
not exploit this capability for fine-grained measurement. As observed by
Jung et al. [24], misalignment or uncertain alignment in time between when
the code of interest is executed and power sampling instants leads to error
in the estimate of that code’s energy consumption. Minimizing this error
by increasing run length further decreases the effective throughput of the
measurement system. These two limitations of existing measurement schemes
make them ineffective for executions shorter than about a million cycles,
stymying the development of energy-aware compilers, profilers, and auto-
tuners. In Chapter 4, we will show how Loupe improves bandwidth by several
orders of magnitude and extends effective power measurements down to the
microsecond timescale.
Also related to this work are relatively new commercial devices for measur-
ing power consumption of smartphones [25] or embedded CPUs [26]. While
9
the latter device does collect processor activity data along with power, the
two data streams are not explicitly correlated using a common timebase and
the processor activity data is not immediately useful to automated software
tools. Furthermore, these devices suffer from the same bandwidth and dy-
namic range limitations as the methods in the academic literature.
10
CHAPTER 3
MOTIVATION
A primary challenge in high speed current measurement for processors is the
decoupling capacitance and parasitic impedance between transistors drawing
current on the processor die and a current measurement point on the PCB.
Power delivery networks (PDNs) are designed to supply current to on-die
transistors at nearly constant supply voltage, regardless of time-varying load
characteristics. The parasitic inductances of the on-die supply grid, package
substrate, and PCB develop a differential voltage in response to changing
load current according to the equation V = LdI
dt
. Thus, the primary PDN
design goal is to minimize impedance |Z| = |V ||I| over a broad frequency range,
where I is the current drawn by the processor and V is the voltage drop
caused by the parasitic inductance and resistance. A PCB-mounted voltage
regulator by itself can only maintain sufficiently low impedance up to the
kHz to low MHz range, while processor current draw has significant spectral
content up to several GHz. To bridge this gap, a hierarchy of decoupling
capacitors are placed on the PCB, package, and processor die. Each succes-
sive level of capacitors has smaller capacitance, but a lower-inductance path
to the load, and supplies current at low impedance for successively higher
frequency ranges. Thus, the charge for high-frequency current transients is
supplied by on-die decoupling capacitance, mid-frequency transients are han-
dled by package-mounted capacitors, and lower-frequency content is handled
by PCB-mounted capacitors and the regulator. The highest-frequency infor-
mation in the current signal is significantly attenuated at the PCB level, and
the observed signal is essentially a time-averaged version of the true current
draw on the chip.
We quantify the potential benefit of higher power sampling frequencies us-
ing a bottom-up approach. We construct a relatively detailed SPICE model
of a complete processor PDN, from the VRM through the on-die power dis-
tribution grid. We then drive the SPICE model using current consumption
11
data from a cycle-level simulator running representative workloads. We can
then determine the resolution and accuracy bounds of PCB sense resistor
measurement approaches by comparing PCB-level current and voltage wave-
forms to the true on-die load.
3.1 PDN Modeling
Our SPICE model of a PDN is similar at a high level to the lumped model
used by Gupta et al. [27]. However, our model improves accuracy and allows
the model to be more easily adapted to new designs by adding a 4-element
lumped model of the VRM, using individual lumped RLC models for each
decoupling capacitor and its parasitics, and deriving its parasitic component
values from first principles when possible rather than by empirically modify-
ing the values to approximate the shape of a measured PDN impedance curve.
For the purposes of evaluating the benefit of high frequency power sampling,
we model the PDN of an existing system. The Hardkernel ODROID-XU+E
single board computer houses a Samsung Exynos 5410 SoC with 4 high-
performance ARM A15 cores and 4 energy-efficient ARM A7 cores. Here,
we model the voltage rail powering the A15 cores exclusively, which includes
several values of board-level decoupling capacitors and a 10mΩ sense resistor
used for a built-in low-frequency current sensor. The structure of the model
and the particular component values used in our simulations can be found
in Figure 3.1 and Table 3.1. We set VDD at the VRM to compensate for the
DC IR drop due to the sense resistor and PCB and provide 1.0V at the pack-
age. Figure 3.2 shows the impedance of the PDN for current stimuli from
10kHz to 10GHz, with and without the 10mΩ sense resistor. The sense resis-
tor substantially increases impedance, and thus supply voltage noise, at low
frequencies, providing another reason to prefer very small sense resistances.
3.1.1 Power Plane Impedance Modeling
Most high-current IC power rails are connected to the PCB-mounted decou-
pling capacitors and voltage regulator via large power and ground copper
pours, or planes, on adjacent PCB layers. The two planes separated by a
dielectric form a parallel plate capacitor, and this capacitor supplies charge
12
VDD 
+ 
- 
LSlew 
RFlat 
RCVRM1 
LCVRM1 
CCVRM1 
RCVRM2 
LCVRM2 
CCVRM2 
LRsense LPCB Rsense RPCB 
RCPCB1 
LCPCB1 
CCPCB1 
LBall RBall 
LPkg RPkg 
RCPkg1 
LCPkg1 
CCPkg1 
LBump RBump 
RGrid 
RCDie1 
LCDie1 
CCDie1 
ILoad 
Figure 3.1: Conceptual PDN, modeling VRM responsiveness, decoupling
capacitors, plane and package parasitics, and the on-die power grid.
Table 3.1: Component values used in SPICE model of ODROID-XU+E
A15 power rail.
Component Value
VRM Rflat =0.75mΩ, Lslew = 1.6nH
CV RM 2×(22µF / 3mΩ / 1.4751nH)
Rsense 10mΩ
PCB Rpcb=0.1mΩ, Lpcb=21pH
CPCB1 2×(22µF / 3mΩ / 1.4751nH)
CPCB2 1×(4.7µF / 7mΩ / 1.4256nH)
CPCB3 12×(0.1µF / 26mΩ / 1.339nH)
Pkg. Balls 45×(424µΩ / 32pH)
CPkg1 26µF / 541.5µΩ / 5.61pH
Die Bumps 0.3mΩ / 0.5pH Lumped
On-die CDie1=50nF / 1mΩ, Rgrid=1mΩ
13
104 106 108 1010
10−3
10−2
10−1
Frequency (Hz)
P
D
N
Im
p
ed
an
ce
(Ω
)
 
 
Rsense = 0Ω
Rsense = 10mΩ
Figure 3.2: Modeled impedance of the ODROID-XU+E A15 power rail
with and without a 10mΩ current sense resistor.
14
Top 
GND 
Plane 
Bottom 
GND 
Plane 
Inner 
VDD 
Plane 
CP1 
LP1 
LIC 
CP2 
LP2 
ZCT1 ZCT2 
ZCB1 ZCB2 
To Package 
+ - 
LVias 
PCB Planes 
Bottom-Mounted 
Capacitors 
Top-Mounted 
Capacitors 
Figure 3.3: We model the PCB planes and capacitors using the topology
proposed by Shringarpure et al. [28], where the plane is split into two LC
elements and a mutual inductance, and top- and bottom-mounted
decoupling capacitors are connected to separate GND nodes. Since the
PMIC and processor are mounted on the top of the PCB, the top GND
node in this figure corresponds to the GND node in Figure 3.1.
15
to the IC in the mid-frequency range between those of the PCB-mounted ca-
pacitors and the package-mounted or on-die capacitors. We model the power
planes and PCB-mounted capacitors using the split topology presented by
Shringarpure et al. [28], as shown in Figure 3.3. The planes are modeled as an
LC circuit split into two halves, LP1/CP1 and LP2/CP2, connected to the
top- and bottom-mounted decoupling capacitors, respectively. The mutual
inductance between the two LC circuits is represented by a third inductor,
LIC. To estimate the values of the three inductors and two capacitors, we
use the methodology introduced by Jones [29]. First, the planes are modeled
as solid rectangles, and their self-impedance is calculated using the Modal-
Expansion Cavity Model for a rectangular patch given by Carver and Mink
[30]. For simplicity, we assume that the top and bottom ground planes are
equidistant from the VDD plane so that LP1 = LP2 and CP1 = CP2, and
that the processor package and die are approximately centered on the planes.
We then iteratively calculate the LC circuit’s impedance using SPICE and
modify the component values until the LC circuit closely matches the first
resonance and anti-resonance of the more detailed model.
We estimate that the A15 power planes for the ODROID-XU+E measure
1.5 by 1.0 inches and are separated by a 2.0-mil sheet of FR-4 (r = 4.7).
The cavity model shows first resonance and anti-resonance at 834MHz and
3.63GHz, respectively. We find that component values of LP1 = LP2 =43pH,
CP1 = CP2 =400pF, and LIC =9.6pH closely match the more detailed
model up to 3.63GHz, an upper bound which is more than sufficient for eval-
uating voltage and current waveforms at the sense resistor at ≈100MHz and
below.
3.1.2 Decoupling Capacitor Modeling
We model each decoupling capacitor as the series combination of the nominal
capacitance value, equivalent series resistance (ESR), and two inductances:
the equivalent series inductance and a mounting inductance Lmount. The
mounting inductance is a result of the current loop formed by the capacitor,
the power and ground planes, and the PCB tracks and vias that connect
the capacitor to the planes. We calculate these individual component val-
ues using estimated PCB stackup, capacitor placement, and via geometry
16
information from PCB layout data and simplified physically-based models
provided by Altera[31], rather than using an electromagnetic simulator.
3.1.3 Power Consumption Simulation Data
We use version 5.3 of the Sniper simulator [32], which can simulate single-
or multi-threaded workloads running on parameterized x86 core models. We
model a single core similar to Intel’s Silvermont running at 1GHz. This
relatively low-power, low-frequency processor will tend to have lower fre-
quency content and lower dynamic range in its power consumption than
faster or wider cores; thus, our results here provide a conservative estimate
of the benefit of high power sampling frequency. We simulate representative
100M-instruction sections of the SPEC2006 benchmarks [33], derived using a
SimPoint [34]-like methodology, for a total workload corpus of several billion
instructions. We use McPAT 1.0[35] to estimate current consumption on a
cycle-by-cycle basis using the simulator’s block activity data. We drive Iload
in the SPICE simulations using this cycle-granularity current trace, generat-
ing observed current traces at the die, package, board, and VRM levels along
with the true load current.
3.1.4 Results
Each successive layer of decoupling capacitance from the die out to the VRM
handles progressively lower-frequency current demand, so the observed sense
resistor current waveform will certainly not reflect all the high-frequency
content present in the load current, and thus the observed current can only
accurately predict the load current when averaged over relatively long time
periods. Typical averaging periods in the literature range from several mil-
liseconds to several seconds, or even minutes in the case of full-system AC
power measurement. We can quantify the accuracy of much higher sampling
rates and averaging over shorter time intervals by analyzing the corpus of
load and observed current for our modeled PDN and simulated power con-
sumption data. Sample results for the astar benchmark are shown in Figures
3.4 and 3.5. Figure 3.5 compares, for a given sampling frequency, the average
observed current over a sample period to the actual average load current, to
17
0.00000 0.00001 0.00002 0.00003 0.00004 0.00005
Time (s)
0
2
4
6
8
10
C
u
rr
en
t
(A
)
Load
Sense
Figure 3.4: Sample load and sense resistor current trace from the astar
benchmark. The decoupling capacitors on the die, package, and board
handle high-frequency current transients, so the sensed current waveform
does not capture the rapid variations in load current, but accurately
captures the average over longer time periods.
evaluate how well the sense resistor waveform represents the true load. Note
that to achieve the stated error percentiles, the measurement system must
have analog bandwidth matching or exceeding the sampling rate. The data
show that, while high-frequency load transients cannot be fully observed at
the current sense resistor, sampling rates in the MHz range are justified and
can provide insight into the current consumption of much shorter pieces of
code than existing methods. The relatively substantial 99.9th and 99.99th
percentile errors at high sampling rates suggest that power measurement and
optimization algorithms at these rates must take these statistical properties
into account to make correct decisions in the face of uncertainty.
18
106 107 108 109
Sampling Frequency (Hz)
100
101
102
M
ea
su
re
m
en
t
E
rr
o
r
(%
)
50%
95%
99%
99.9%
99.99%
Figure 3.5: Percentiles of the error between observed and actual load
current vs. sampling frequency/interval length for the astar benchmark.
Low sampling rates (up to ≈ 100kHz, not shown) can capture average
current consumption of long intervals with 1% accuracy. Median error
ranges from 0.2% at 1MHz to 2% at 10MHz. 99th percentile error ranges
from 3% at 1MHz to 10% at 10MHz.
19
CHAPTER 4
LOUPE DESIGN AND IMPLEMENTATION
4.1 Design Goals
To achieve maximum bandwidth, accuracy, and flexibility, Loupe is based
on current sense resistors; high-bandwidth amplifiers shift and amplify the
voltage and current signals so that their anticipated value ranges fully occupy
a ±1V ADC input range. To minimize the effect on the DUT’s operating
voltage, the sense resistor should have low enough resistance that the worst-
case voltage drop across it is very small compared to the rail voltage. The
divide between the small current sense signal (0–10mV/0–120mV for 1V/12V
power rails) and the large ADC input range leads to a high required gain,
up to 200× for low-voltage rails feeding processors directly. An additional
factor of 2 gain is required if wideband current and voltage signals are to
be transmitted to the ADC via a terminated transmission line, since the
termination resistors act as a 2:1 voltage divider. Achieving high bandwidth
and high gain is one of the biggest challenges in designing Loupe, due to the
fundamental tradeoff between the two in amplifier design; in fact, amplifiers
are often specified in terms of a fixed gain-bandwidth product. We achieve
multi-MHz current amplifier bandwidth and gains of 36–134 in our two proof
of concept implementations using two cascaded stages of very high-speed
amplifiers; the first is an AD8129 differential amplifier in all cases with a
fixed gain of 10, and the second stage is an OPA843 or OPA847 depending on
the required gain. For voltage amplifiers, we used a single stage of LM7171A
or LMH6609 depending on required gain.
As mentioned in Section 2, a flat gain response and linear phase response
(flat group delay) for the frequencies of interest is necessary to recover a
clean time-domain amplified signal without complex and error-prone inverse
filtering. Our target for Loupe is flat group delay and 0.1dB gain response
20
flatness from DC to at least 10MHz for all channels. We verified using SPICE
simulations that all amplifier channels had a 0.1dB bandwidth of 10–50MHz;
channel group delays ranged from 1.5ns and 4ns, and varied less than 90ps
over the 0–10MHz band. These small, flat group delays can easily be cali-
brated out in the digital domain.
The sense resistor itself must be selected carefully for high-frequency mea-
surements, paying special attention to parasitic inductance and tempera-
ture coefficient (tempco) as discussed in Section 4.1.1. The amplifier power
supplies should be designed with the amplifier power supply rejection ratio
(PSRR) in mind so that voltage ripple on the amplifier power supply has a
minimal impact on the output signal; our limit for Loupe was less than 1
4
LSB
for a 12-bit ADC (< 1
4
∗ 4V
212
= 81.4µV on the amplifier output). Likewise,
variations in DUT rail voltage due to supply noise or DVFS will manifest as a
common-mode input signal at the amplifiers, but the amplifiers should have
sufficiently high CMRR to avoid large erroneous swings in the output current
waveform. For Loupe, we specify that the maximum variation in the DUT
rail voltage should cause less than 1
4
12-bit LSB variation at the output due to
amplifier CMRR. Finally, besides having sufficient gain-bandwidth product,
the amplifiers’ slew rate should be sufficient to drive full scale output sinu-
soids throughout the frequency band of interest with the load capacitance
presented by the Loupe PCB, coaxial cable, and ADC-side buffer.
For Loupe, we wanted to decouple the ADCs from the amplifier and have
them on separate boards connected by coaxial cable, to enable upgrading
the ADC independently and using an off-the-shelf ADC board to avoid the
time-consuming task of designing, assembling, validating, and calibrating a
high-speed ADC PCB. The ADCs should have at least 12 bits of precision,
should sample at upwards of twice the analog bandwidth of the power sen-
sor (≥20MHz), and should be able to stream data back to a wide range of
DUTs or analysis systems over a high-bandwidth standard interface. The
ADCs should also ideally be on the same clock domain as the hardware that
receives processor activity data, so that the power and activity data can be
precisely aligned using a common timebase. Our design goal for the coaxial
cable was to have less than 0.1dB maximum insertion loss from 0–10MHz;
if using RG174 cable, this yields a maximum cable length of 3 feet, which
is sufficient in most cases to run a cable from the possibly space-constrained
DUT enclosure to the external ADC board.
21
4.1.1 Current Sense Resistor
The current sense resistor, while a conceptually simple component, plays
a key role in the ultimate accuracy of current measurements, especially at
high speed, and must be chosen carefully. Nominal resistance varies part-to-
part, but this variation can be calibrated out. Other important properties
of sense resistors cannot easily be calibrated out, including parasitic induc-
tance, temperature coefficient, and thermoelectric voltage. For long-lived
power measurement systems, long term resistance stability over thermal and
mechanical shock, vibration, and humidity are also important, but we do
not consider them in this work due to our relatively controlled measurement
environment and inability to meaningfully test these parameters.
All real components have some amount of parasitic inductance and ca-
pacitance. We consider only surface mount resistors, as the high lead in-
ductance of through-hole inductors renders them unsuitable. Typical thick-
or thin-film surface mount resistors have parasitic capacitance on the order
of tens of femtofarads in parallel with the series RL circuit; we ignore this
capacitance in this work, as it only has a significant impact for RF or mi-
crowave circuits in the GHz range. The parasitic inductance can range from
picohenries to tens of nanohenries for the large resistors used in current sens-
ing, and resists changes in current by producing a differential voltage across
the resistor according to the differential equation V = LdI
dt
. In a current
sensing application, this differential voltage is added to the “real” voltage
caused by IR drop, and amounts to error in the measured current. The
inductance adds reactance proportional to frequency producing an overall
impedance Z = R+ j2pifL, artificially increasing the measured amplitude of
high-frequency current changes and rendering the amplifiers’ flat gain/linear
phase response useless. The resistor’s resistance R, its inductance L, and any
capacitance C to which the resistor is connected form an RLC circuit, also
called a damped resonator. At the circuit’s resonant frequency, calculated as
1
2pi
√
LC
, energy transfers repeatedly between the inductor and capacitor, and
is slowly dissipated (damped) as it travels through the resistor. The oscil-
lating voltage across the resulting RLC circuit after a rapid current change
is known as ringing, and is depicted in Figure 4.1. This ringing has two
negative effects. First, while the error in the sensed current due to ringing
will likely average out over time, this is not necessarily true in the short
22
0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00
Time (s) ×10−6
0
2
4
6
8
10
12
C
u
rr
en
t
(A
)
0
10pH
1nH
Figure 4.1: Measured current using a 10mΩ current sense resistor with
various parasitic inductances. The true current draw switches between 1A
and 10A with 10ns rise and fall times. Inductances up to several pH do not
contribute significant error, but inductances of 1nH and above produce
significant ringing. The maximum allowable parasitic inductance value
varies with resistor value, error budget, and load current frequency content.
term, so power and energy estimates for short time intervals will be greatly
affected. Second, ringing produces large peaks in the current sense voltage
during load transients. These peaks can be handled either by reducing the
gain in the power sensor so that the peak still fits in the input range of the
ADC, or by keeping higher gain and saturating the ADC during the peaks.
The first approach adds quantization error by reducing ADC resolution for
all current data, and the second adds clipping error. For our purposes a
parasitic inductance under 1nH gives sufficiently low error. If additional cor-
rection is necessary due to higher frequency content, more stringent error
requirements, or a highly inductive resistor, a single-pole RC filter can be
used between the sense resistor and the first-stage amplifier to cancel out the
parasitic inductance with capacitance [36].
A resistor’s value changes with temperature. While the true temperature-
resistance curve is usually nonlinear and varies based on material composi-
tion, the effect of temperature is usually approximated using a linear coeffi-
cient called the temperature coefficient of resistance, also known as TCR or
23
tempco [37]. The TCR is a conservative measure of the fractional change in
resistance value (relative to nominal value) for every degree change in tem-
perature and is usually expressed in ppm◦C . Given an estimate of how much
the resistor’s temperature can be expected to vary over a period of time,
one can use the TCR to compute an upper bound on the error in measured
current due to temperature. The resistor’s temperature may vary not just
due to ambient temperature, but also due to self-heating from dissipating
power. Larger sense resistance values lead to larger current sense voltages
and easier current sensing, but also dissipate more power, heat the resistor
more, and cause more temperature-related error. A similar tradeoff exists
between physically large resistors with low thermal resistance and small re-
sistors with low parasitic inductance. The designer should choose a resistor
with a TCR as small as possible given the design constraints, and the effects
of TCR can be mitigated by keeping the resistor’s temperature more con-
stant. The temperature can be kept more constant by designing the system
to use resistors with small value and/or small thermal resistance, or by me-
chanical means such as a heatsink, forced air, or a temperature-controlled
oven like those used for precision oscillators and voltage references. Copper
PCB traces have very poor TCR of 3930ppm◦C ; indeed, high TCR and the small
current handling capability of thin copper layers are the primary reasons for
using special current sense resistors rather than carefully sized PCB traces.
When integrating a sense resistor into a power sensor, it is therefore impor-
tant that the PCB traces from the resistor to the first stage amplifier be
short, geometrically symmetric, and thermally symmetric to minimize their
contribution to sensed current error.
Wherever two dissimilar metals meet, as in the resistive element and the
substrate of a surface mount resistor or the substrate and the copper PCB
pads, a thermocouple is formed. That is, if the two metals are at different
temperatures, they develop a differential voltage proportional to the temper-
ature difference; this thermoelectric voltage is specified for a pair of materials
in V◦C . While this effect is used to great effect for measuring temperature,
in the case of a current sense resistor the thermocouples formed are consid-
ered parasitic, since the thermoelectric voltage manifests as sensed current
error. The thermoelectric voltage is analogous to TCR in terms of system
design considerations; these considerations are broken into two broad cat-
egories, avoidance and mitigation. Thermoelectric voltage minimization is
24
mostly undertaken by component manufacturers by selecting thermoelectri-
cally compatible materials; its impact can be mitigated by minimizing the
temperature difference across all parasitic thermocouples. These mitigation
techniques are mostly at the PCB layout level; the easiest technique is to
align a resistor such that its terminals are isothermal based on a first-order
model of major heat sources in the system, such as high-power components.
Another technique is to implement a single resistance using two series half-
value resistors next to one another; even if the terminals of a given resistor
are not isothermal, producing a thermoelectric voltage, the voltages of the
two adjacent resistors will cancel [38].
4.1.2 Analog-to-Digital Converter
The voltage and current signals must ultimately be digitized in order to store
and make decisions based on power measurements. The first design choice is
whether to digitize voltage and current on two separate ADCs and multiply
in the digital domain, or use an analog multiplier and a single ADC. While
analog multipliers have simplicity, power, and cost benefits over a high-speed
ADC, there is no commercially available multiplier with the combination of
high bandwidth and high accuracy matching high-end ADCs. In our proof
of concept, we digitize voltage and current separately to achieve the high-
est bandwidth and accuracy possible and gain the ability to examine each
separately.1 The most obviously important specifications of the ADC in this
context are sampling rate and resolution, but several other specifications also
impact the uncertainty of the resulting power measurements. Integral nonlin-
earity (INL) and differential nonlinearity (DNL) characterize the nonideality
of the mapping from ADC input voltages to output codes. The system de-
signer can use a wide variety of techniques to compensate for the DC and
AC incarnations of these nonlinearities [39]. In this work, we use a simple
one-dimensional lookup table (LUT) to compensate for ADC nonlinearities
and amplifier gain, offset, and linearity errors. While our technique maps
a single ADC sample to an output current or voltage value considering no
other information, it could be extended to consider several previous samples
and internal ADC error sources.
1For example, to characterize voltage droop in the PDN or capture current stimulus
data to drive simulations using alternative PDNs.
25
Besides maintaining a correct code-to-input-voltage mapping, the other
primary requirement for obtaining an accurate digital waveform is well con-
trolled sample timing; that is, the position of the sample instants should
be well known in order to properly reconstruct a digital waveform with the
same shape as its analog counterpart. In this work, we consider only uniform
sampling regimes, in which the sampling instants are ideally regularly spaced
and sample jitter should be minimized. For externally clocked ADCs, sample
jitter consists of external clock jitter, which can be improved at the system
level, and internal aperture jitter, which is intrinsic to the ADC selected.
Nonuniform or periodic nonuniform sampling, where samples are taken at ir-
regularly spaced but known times, are also promising techniques in a power
measurement context. Nonuniform sampling allows perfect reconstruction
of certain signals that do not strictly meet the classical Nyquist criterion
of being bandlimited to half the average sample rate [40]. In oﬄine power
measurement applications like a power optimizing compiler, the system has
the ability to run the same piece of code repeatedly, thus presenting a peri-
odic signal to the ADC. An importance sampling algorithm like VEGAS [41]
could be used to vary sample positions between runs to get more information
about quickly varying portions of the sampled waveforms; in this way, the
overall power and energy waveforms could be reconstructed accurately even
if the bandwidth of these waveforms greatly exceeded average ADC sample
rate, as long as the system has fine-grained control over sample position as
in the system proposed by Papenfuss et al. [42].
4.1.3 Amplifiers
Amplifiers are needed to apply gain and offset to the raw voltage and cur-
rent signals from the current sense resistor so that they occupy the entire
ADC input voltage range and maximize effective resolution. The µV- to
mV-level sense current signals and multi-Volt ADC input ranges lead to rel-
atively high required gain, up to 100× or more for low-voltage rails with low
burden voltage tolerance. The amplifiers must also have very high band-
width to be able to accurately measure the power and energy consump-
tion of very short pieces of code. Amplifiers are often specified in terms
of their gain-bandwidth product (GBP); our application requires very high
26
GBP, particularly for the current channel, since the differential current sense
voltage is necessarily much smaller than the rail voltage. In this work, we
only consider operational amplifiers (op-amps), as opposed to standalone
FET amplifiers, to take advantage of their complex integrated matching
and compensation mechanisms. Op-amps often have fairly complex closed-
loop gain characteristics which vary according to load capacitance, and are
often characterized using a scalar bandwidth figure, the -3dB bandwidth.
The -3dB bandwidth is the frequency f−3dB for which gain(f) ≥ gain(0
Hz) − 3dB ∀ f ≤ f−3dB. For applications demanding absolute accuracy
across a broad frequency range, a more demanding specification is in or-
der; in this work, we use the -0.1dB flatness specification commonly used
in analog video. The -0.1dB flatness bandwidth is the frequency f−0.1dB for
which |gain(f1) − gain(f2)| ≤ 0.1dB ∀ f1, f2 ≤ f−0.1dB. That is, given two
equal-power sinusoidal inputs of arbitrary frequencies up to the -0.1dB flat-
ness point, the two output waveforms will have power within 0.1dB of one
another [43]. This 0.1dB power difference corresponds to a maximum ampli-
tude difference of just over 1%. For first-order op-amp systems, the -0.1dB
flatness bandwidth will be about 6.55× lower than the -3dB bandwidth due
to the more stringent requirement [22].
We set out to design one power sensor architecture that will work for
any voltage rail in commonly used processors and memory systems, so the
amplifiers connected to the sense resistor must be able to handle a common-
mode voltage of up to 12V with full accuracy.
For our application, high DC and AC accuracy are also important, so we
look for amplifiers with high common mode rejection ratio (CMRR) and
power supply rejection ratio (PSRR), and low bias current, offset current,
offset voltage, and gain error.
4.2 Implementation
We now describe a proof of concept implementation of Loupe, including two
separate power sensor PCBs for different categories of DUT, satisfying the
design constraints and goals described in Section 4.1. Figure 4.2 shows a
diagram of the complete system.
27
 
 
Interposer 
DUT 
Data 
Acquisition 
Machine 
USRP 
(ADC) 
Power Power 
Voltage 
Power 
Data 
Summarized 
Power Data 
Power 
Supply 
Current 
Figure 4.2: In our proof-of-concept experimental setup, the interposer
provides amplified voltage and current signals to the high-speed ADCs on
the USRP. The USRP’s FPGA combines the power data with processor
activity data sent directly from the DUT and streams the combined data to
a second measurement computer over USB 2.0. The combined data stream
could also be sent, directly from the USRP or via the measurement
computer, back to the DUT to aid self-hosted power optimization tools like
compilers and auto-tuners.
4.2.1 Power Sensor PCB
Bus-based accelerators like PCIe GPUs are an easy candidate to instrument
with Loupe, since power can be intercepted with no soldering or other sys-
tem modifications required. Our first power sensor implementation was thus
designed as an interposer between a PCIe card and a motherboard. Moth-
erboards use three rails to provide up to 75W to PCIe cards — 12V, 3.3V,
and a lower-current 3.3VAUX — and our power sensor has three voltage and
channels to measure each separately.
Amplifier Circuit Design
The full amplifier schematics can be found in Figures A.1–A.2.
Noise Analysis In addition to the systematic biases in output voltage
due to component tolerances and repeatable ambient temperature effects,
we must consider the effect of random noise. Noise is one of the two major
sources of non-systematic, uncalibratable noise in our system, along with self-
heating-related thermal effects in the sense resistor and ICs. Thus, though we
include the computed noise parameters in our overall uncertainty analyses,
28
we also examine it separately here to gain insight into the fundamental limits
of accuracy in our system.
In op-amp circuits, the main sources of error are input noise current and
voltage associated with the op-amp itself and Johnson noise, also known as
thermal noise, from discrete offset and gain resistors [44]. The noise cur-
rent and voltage op-amp specifications are aggregates of multiple underlying
noise mechanisms such as Johnson noise from integrated resistors, shot noise,
and flicker noise. The Johnson noise spectral density observed across a re-
sistance R, measured in V√
Hz
, is proportional to absolute temperature T and
is calculated as 4kBTR, where kB is Boltzmann’s constant. Johnson noise is
spectrally white, meaning that it has equal power at all frequencies; there-
fore, to convert the noise density to an RMS noise voltage, a frequency band
of interest with bandwidth B must be defined. With this constraint, the
RMS noise voltage over the frequencies of interest is
√
4kBTRB, and the
distribution of noise voltage over time is assumed to be normal.
In the following analysis, we denote Johnson noise associated with a resistor
Rx as simply Rx; input noise current on the inverting and non-inverting
inputs as Iin− and Iin+, respectively; and input noise voltage as Vnin. Table
4.1 lists the relevant component values, parameters, and noise analysis for
the baseline design of our 0–9A, 0.3–1.7V mobile SoC power sensor. In our
analysis, noise voltages are referred to the output of the relevant amplifier
stage or stages (RTO noise), because we are ultimately interested in the
average- and worst-case noise produced at the ADC, rather than comparing
the noise magnitude against the magnitude of any particular input signal.
Total output noise is ≈ 795µV RMS over a 10MHz bandwidth, corresponding
to 0.815 LSB at the 12-bit ADC. RMS noise of 1
3
–1 LSB actually has some
benefit in our application since it enables dithering, where oversampling can
be used to average out quantization error and improve resolution [45, 46].
Stage 1 contributes 2.26× as much gain and has 4.57× as much standalone
RTO noise as stage 2. As in many multi-stage amplifier systems, we find that
the noise of the first stage dominates. Specifically, stage 1’s input voltage
noise alone, when amplified by stage 2’s noise gain, is 91.3% as large as the
overall RSS noise.
By quantitatively evaluating noise in the design stage, we can quickly
evaluate several alternative designs or environmental conditions. Figure 4.3
shows the noise components of the stage 1 amplifier for the baseline design
29
Table 4.1: Noise analysis parameters and results for a two-stage 0–9A
current amplifier circuit.
Parameter Value Noise Contribution
Bandwidth 10MHz
Temperature 25◦C
Stage1 Part Number AD8129
Stage 1 In+/− 2.79nA 1.4µV / 1.4µV
Stage 1 InFB 1.4
pA√
Hz
8.85µV
Stage 1 Vnin 133µV 133µV
Stage 1 Rfilt 50Ω 2×8.85µV
Stage 1 Rf 2kΩ 18.1µV
Stage 1 Rg 221Ω 54.5µV
Total Stage 1 RTO Noise 146.13µV
Stage 2 Part Number OPA847
Stage 2 In+/− 2.5
pA√
Hz
5.51µV / 1.38µV
Stage 2 Vnin 0.85
nV√
Hz
14.6µV
Stage 2 Rf 174Ω 5.34µV
Stage 2 Rg 39.2Ω 11.26µV
Stage 2 ROffset1/2 137Ω / 2kΩ 24.9µV
Total Stage 2 RTO Noise 32.00µV
Total Stage 1 + Stage 2 RTO Noise 795.42µV
and 4 alternatives. The four alternative design decisions are: reducing the
feedback resistor values by 4× to 499Ω/54.9Ω, using 1kΩ input filter resistors,
eliminating the input filter resistors, and using an AD8130 amplifier with
lower noise and a gain of 2× versus the baseline AD8129’s 10×. When using
the alternative part, the stage 2 gain is increased to compensate for the
lost stage 1 gain. While reducing the feedback resistor values is an attractive
option to reduce Johnson noise and the impact of the FB input current noise,
it only reduces stage 1 RSS noise by 5.8% due to the dominance of input
voltage noise. The benefit may not be worth the 4× increase in drive current
required by the lower-valued resistors, especially considering the resulting
self heating not modeled in this analysis. Similarly, using a larger input filter
resistor or no filter at all impacts stage 1 RSS noise by +4.9% and -6.2%,
respectively.
Figure 4.3 shows the noise components of the stage 2 amplifier and the
cascaded two-stage system for several scenarios, all using the baseline stage
1 design. The first three groups of bars show the typical behavior at 25◦C,
30
0.00E+00
2.00E-05
4.00E-05
6.00E-05
8.00E-05
1.00E-04
1.20E-04
1.40E-04
1.60E-04
Baseline 499/54.9 Rf/Rg 1k Rfilt 0 Rfilt AD8130 (G=2)
St
ag
e
 1
 R
TO
 N
o
is
e
 (
V
) 
In-
In+
InFB
Vnin
Rf
Rg
Rfilt
Total
Figure 4.3: Stage 1 current amplifier noise components for the baseline
design and four potential design variants.
worst-case behavior at 25◦C, and worst-case behavior over the 0 − −70◦C
temperature range. The results vary by less than 0.3%, since the dominant
Vnin,Stage1 does not change. The fourth group shows the small effect (0.04%)
of reducing the offset resistor values by 10×; this noise reduction clearly does
not outweigh the 10× static power increase. The fifth group shows the overall
system implications of switching to an AD8130 in stage 1 and increasing the
gain in stage 2. In this case, the stage 1 RTO noise is halved despite the
AD8130’s larger Vnin due to the decreased noise gain. However, the stage 2
RTO noise is increased by 3.36× due to its correspondingly increased noise
gain. The overall effect is a noise increase of 2.18×, so the change is not
beneficial.
Figure 4.5 shows the noise components for the single stage voltage amplifier
in our 0.3–1.7V power sensor. This amplifier receives a much larger input
signal than the current amplifiers, and its noise gain is thus much lower at
3.85×. The overall RTO voltage noise is 47.81µV, or 0.049 LSBs at the ADC.
The dominant component, as with the current amplifiers, is Vnin.
SPICE Characterization The amplifiers on the power sensor have a lin-
ear response on all voltage and current channels through the entire range
specified by the PCIe standard, as shown in Table 4.2. Group delays for
all six channels are small and extremely flat from DC to 20MHz, yielding
minimal time-domain waveform distortion. All six channels also have flat
AC responses to 10MHz or above, and amplify the signal range of interest to
31
7.954E-04 7.955E-04 7.956E-04 7.951E-04
1.733E-03
0.00E+00
2.50E-04
5.00E-04
7.50E-04
1.00E-03
1.25E-03
1.50E-03
1.75E-03
2.00E-03
Nominal Worst-Case @ 25C Worst-Case 0-70C Smaller Roffset Smaller Roffset,
AD8130
Stage2 In- Stage2 Vnin Stage2 Rf/Rg Stage2 Roffset
Stage2 Total Stage1 In+/- Stage1 InFB Stage1 Vnin
Stage 1 Rf/Rg Stage1 Rfilt Stage1/Stage2 Total
Figure 4.4: Stage 2 amplifier and two-stage system noise components for
the typical case, worst case, worst case over temperature, and two potential
design variants. All values are in RMS Volts.
0.E+00
1.E-05
2.E-05
3.E-05
4.E-05
5.E-05
In- In+ Vnin Rf Rg Roffset1 Roffset2 Total
V
o
lt
ag
e 
A
m
p
. R
TO
 N
o
is
e
 (
V
)
Figure 4.5: Voltage amplifier noise components. Due to the lower noise
gain, the RTO voltage noise is much lower than that of the current
amplifiers, at 47.81µV RMS.
32
Table 4.2: Range of input values resulting in a linear DC response at the
USRP ADC for each of six channels on the PCIe interposer. All channels
have a linear response over the entire range allowed by the PCIe 1.0–3.0
standards. Small, stable group delays from 0–20MHz enable simple time
registration of voltage, current, and processor activity in the digital domain.
Channel
Measured ADC
Linear Range
PCIe Specified Range Group Delay
V(12V) 10.1–14.08V 12.8–13.2V 1.54–1.56ns
V(3.3V) 2.73–3.77V 2.97–3.63V 2.51–2.60ns
V(3.3VAUX) 2.73–3.77V 2.97–3.63V 2.51–2.60ns
I(12V) 0–9A 0–130mV (0–5.5A) 3.75–3.79ns
I(3.3V) 0–3.11A 0–27mV (0–3A) 2.63–2.66ns
I(3.3VAUX) 0–0.4A 0–26.25mV (0–0.375A) 2.63–2.66ns
occupy a [-2V, +2V] range, as shown in Figure 4.6. The termination resistors
on either end of the coaxial cables act as a voltage divider and attenuate the
amplified signal by 2× to fit the USRP ADC’s [−1V,+1V ] input range.
Power Supply Circuit Design
PCB Design
Loupe’s amplifiers are housed in a custom designed 4-layer PCB, shown in
the renderings in Figure 4.7 and the photographs in Figure 4.8. The amplifier
PCB is in the form factor of a PCIe card, and the DUT input and output
connectors are PCIe male and female connectors. The board can fit inside a
standard ATX PC case between any PCIe 1.0–3.0 card and the motherboard.
We populated one copy of the board with amplifiers for all 3 PCIe power rails
in accordance with the PCIe 3.0 specified voltage and current ranges, and
another copy with amplifiers for low-voltage, high-current power rails like
those powering mobile SoCs, as shown in Figure 4.9. Together, these two
PCB variants span devices with two orders of magnitude difference in power
consumption with a single design.
33
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M 1G
G
a
in
 (
d
B
)
-40.00
-30.00
-20.00
-10.00
0.00
10.00
(a) V(12V) AC
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M 1G
G
a
in
 (
d
B
)
-80.00
-60.00
-40.00
-20.00
0.00
20.00
40.00
60.00
80.00
(b) I(12V) AC
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M 1G
G
a
in
 (
d
B
)
-1.00
0.00
1.00
2.00
(c) V(12V) AC (zoomed)
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M 1G
G
a
in
 (
d
B
)
29.00
30.00
31.00
32.00
(d) I(12V) AC (zoomed)
Input voltage (V)
10.00 11.00 12.00 13.00 14.00
V
o
lt
a
g
e
 (
V
)
-2.00
-1.00
0.00
1.00
2.00
3.00
(e) V(12V) DC
Input voltage (V)
0.00 50.00m 100.00m 150.00m 200.00m
V
o
lt
a
g
e
 (
V
)
-3.00
-2.00
-1.00
0.00
1.00
2.00
3.00
4.00
(f) I(12V) DC
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M 1G
G
a
in
 (
d
B
)
-20.00
-10.00
0.00
10.00
20.00
(g) V(3.3V) AC
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M 1G
G
a
in
 (
d
B
)
10.00
20.00
30.00
40.00
50.00
(h) I(3.3V) AC
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M
G
a
in
 (
d
B
)
10.00
11.00
12.00
13.00
(i) V(3.3V) AC (zoomed)
Frequency (Hz)
1.00 10.00 100.00 1.00k 10.00k 100.00k 1.00M 10.00M 100.00M 1.00G
G
a
in
 (
d
B
)
42.00
43.00
44.00
45.00
(j) I(3.3V) AC (zoomed)
Input voltage (V)
2.80 3.05 3.30 3.55 3.80
V
o
lt
a
g
e
 (
V
)
-3.00
-2.00
-1.00
0.00
1.00
2.00
(k) V(3.3V) DC
Input voltage (V)
0.00 10.00m 20.00m 30.00m 40.00m
V
o
lt
a
g
e
 (
V
)
-3.00
-1.50
0.00
1.50
3.00
(l) I(3.3V) DC
Figure 4.6: Signal chain AC and DC responses shown for 12V and 3.3V
voltage and current rails on PCIe power sensor. The 3.3V and 3.3VAUX
rails share identical amplifier circuits and differ only in sense resistor value.
All amplifiers have linear DC responses and amplify the signal values of
interest to occupy [-2V, +2V]. All AC responses have better than 0.1dB
flatness from 0Hz to at least 10MHz and minimal peaking above 10MHz.
34
(a) Front (2D)
(b) Back (2D)
(c) Front (3D)
(d) Back (3D)
Figure 4.7: 2D and 3D renderings of the 4-layer PCIe interposer PCB.
35
(a) Bare Board
Signal Passthrough 
DUT Power Input 
DUT Power Output 
Current 
Outputs 
Voltage 
Outputs 
Amplifier Power 
Supplies 
Current Sense 
Resistors 
(b) Populated
Figure 4.8: Photographs of the bare and populated 4-layer PCIe interposer
PCB. The interposer passes through PCIe data signals undisturbed, and
uses 1 and 2 stages of op-amps to amplify the voltage and current signals,
respectively, corresponding to the DUT-side voltage and voltage drop
across the current sense resistors on each of PCIe’s 3 power rails. The
amplified current and voltage signals are sent to a high-speed ADC via
RG174 coaxial cable.
PCI Express 
12V, 3.3V, 3.3VAUX 
Mobile SoC 
0.3-1.7V, 0-{5,9}A 
Figure 4.9: The Loupe amplifier PCB design can be populated with
different footprint-compatible amplifiers and passive components to measure
a wide range of systems, including a 75W PCIe card and a mobile SoC.
Breakout daughtercards enable the measurement of non-PCIe power rails.
36
Figure 4.10: The Jouler’s Loupe adapter PCB allows non-PCIe voltage rails
to be measured using the PCIe form factor power sensor by adapting
between male or female PCIe connectors and terminal blocks.
4.2.2 Connection to DUT
Our original PCIe form factor power sensor can measure unmodified PCIe
1.0–3.0 cards plugging into unmodified PCIe motherboards via its built-in
male and female PCIe connectors. To enable the same power sensor to mea-
sure current through wires in non-PCIe contexts, we designed an adapter
board as shown in Figure 4.10. The same adapter board can plug into a fe-
male PCIe connector using the bare edge connectors or a male PCIe connector
by soldering a female PCIe connector to the edge connectors on the adapter
board. The other side of the adapter contains terminal blocks where DUT
power wires can be attached without soldering. Using one male-configured
and one female-configured adapter board on each side of the power sensor,
we are able to intercept any external DUT power rail for measurement. Such
a configuration is best suited for low-frequency power measurement or for
measuring power through a wire or connector, since the loop inductance in-
curred by the ≈4.5 inch span of the adapter boards may cause significant
ringing in PCB power rails with rapidly varying current. The small-form-
factor power sensor we developed to address this problem has two parallel
sets of DUT connections: terminal blocks as in the PCIe adapter boards,
and large plated through holes with short, thick gauge wires soldered to the
through holes and the relevant positions on the DUT PCB. The sense resistor
on the small-form-factor is less than 2mm from the plated through holes for
a dramatically reduced current loop area compared to the PCIe sensor plus
adapter boards.
37
(a) Bare Board
(b) Populated (Back) (c) Populated (Front)
Figure 4.11: The small form factor single channel power sensor incorporates
some lessons learned during the design of the PCIe sensor and enables very
low inductance DUT connections for PCB level power measurements.
4.2.3 Small Form Factor Power Sensor
As mentioned in Section 4.2.2, the PCIe power sensor is well suited to mea-
suring currents through wires or connectors, but adds a relatively large par-
asitic inductance when used to measure non-PCIe PCB-level power rails.
Many PCB-level power rails of interest are also part of space-constrained
systems where the sheer size of the PCIe power sensor is problematic. We
designed a smaller single channel power sensor to address these issues, as
shown in Figure 4.11. The smaller sensor uses a circuit topology and com-
ponents similar to the PCIe sensor, including on-board +5V, +15V, and
-5V voltage regulators for the amplifiers. We also opportunistically modified
the design in small ways to further improve measurement uncertainty and
adjustability. First, we added the ability to select the voltages used in the
resistive divider that sets the DC offset for the voltage and stage 2 current
amplifiers by including multiple parallel offset resistor footprints. This gives
the user additional flexibility in the voltage and current ranges that can be
measured, the supported ADC input ranges, and the resistor values that can
be used to achieve a given offset. We also added a potentiometer in the offset
38
voltage dividers to enable some offset adjustability after assembly to com-
pensate for small DC offsets in the amplifiers and ADC. This adjustability
is helpful in avoiding clipping at the ADC when the amplifier output range
is very close to the ADC input range to maximize resolution. Care must be
taken to maintain the good noise performance of the power sensor by using a
small value potentiometer with low Johnson noise; in this work, we use a 50Ω
potentiometer. We also used a new Kelvin sensing pad layout for the cur-
rent sense resistor that can accommodate the Vishay WSK2512 series sense
resistors in addition to the WSL2512 series supported on the PCIe sensor;
the WSK series offers an improved TCR of ±35ppm◦C , down from ±50ppm◦C for
the WSL.
In this work, we populated the smaller power sensors with components to
measure two power rails with relatively low voltage, 0.3–1.7V, and relatively
high current, 0–5A and 0–9A. These measurement ranges are suitable for
measurement of two CPU, GPU, or DRAM power rails on a mobile SoC.
Both channels were designed assuming a 10mΩ sense resistor, for a full-scale
resistor voltage of 50mV and 90mV, respectively. Figure 4.12 shows the
DC transfer function and frequency response of these two channels. Due
to the smaller rail voltage, higher gain is required and these channels have
slightly lower -0.1dB bandwidth than the 3.3–12V PCIe rail amplifiers, but
still exceed the 10MHz specification.
4.2.4 Digital Processing
For the high-speed ADCs and incorporation of processor activity data, Loupe
uses an Ettus Research USRP1, a device originally designed for software-
defined radio. The USRP1 includes 4 64MSPS, 12-bit ADCs, a modest
FPGA, and a USB 2.0 interface. The coaxial cables carrying amplified
voltage and current signals are connected to the ADCs via low-frequency
receiver (LFRX) daughterboards. The daughterboards contain SMA coaxial
connectors, a unity-gain single-ended-to-differential amplifier, and headers to
breakout GPIO pins on the FPGA. We transmit processor activity data to
the FPGA via a custom cable from the DUT’s signaling mechanism to the
daughterboard’s 0.1 inch headers on the daughterboard. When using the
PCIe amplifier board, the host x86 PC sends the USRP processor activity
39
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M 1G
G
a
in
 (
d
B
)
-30.00
-20.00
-10.00
0.00
10.00
20.00
(a) V(9A) AC
Frequency (Hz)
10.00 100.00 1.00k 10.00k 100.00k 1.00M 10.00M 100.00M 1.00G
G
a
in
 (
d
B
)
-80.00
-60.00
-40.00
-20.00
0.00
20.00
40.00
60.00
80.00
(b) I(9A) AC
Frequency (Hz)
1k 10k 100k 1M 10M 100M
G
a
in
 (
d
B
)
10.10
10.20
10.30
10.40
10.50
(c) V(9A) AC (zoomed)
Frequency (Hz)
10.00k 100.00k 1.00M 10.00M 100.00M 1.00G
G
a
in
 (
d
B
)
32.90
33.00
33.10
33.20
(d) I(9A) AC (zoomed)
Input voltage (V)
0.00 500.00m 1.00 1.50 2.00
V
o
lt
a
g
e
 (
V
)
-3.00
-2.00
-1.00
0.00
1.00
2.00
(e) V(9A) DC
Input voltage (V)
0.00 15.00m 30.00m 45.00m 60.00m 75.00m 90.00m
V
o
lt
a
g
e
 (
V
)
-2.00
-1.00
0.00
1.00
2.00
3.00
(f) I(9A) DC
Frequency (Hz)
1 10 100 1k 10k 100k 1M 10M 100M 1G
G
a
in
 (
d
B
)
-20.00
-10.00
0.00
10.00
20.00
(g) V(5A) AC
Frequency (Hz)
10.00 100.00 1.00k 10.00k 100.00k 1.00M 10.00M 100.00M 1.00G
G
a
in
 (
d
B
)
-50.00
-25.00
0.00
25.00
50.00
(h) I(5A) AC
Frequency (Hz)
10k 100k 1M 10M 100M
G
a
in
 (
d
B
)
10.10
10.20
10.30
10.40
10.50
(i) V(5A) AC (zoomed)
Frequency (Hz)
100.00k 1.00M 10.00M 100.00M
G
a
in
 (
d
B
)
38.10
38.20
38.30
38.40
(j) I(5A) AC (zoomed)
Input voltage (V)
0.00 500.00m 1.00 1.50 2.00
V
o
lt
a
g
e
 (
V
)
-2.00
-1.00
0.00
1.00
2.00
(k) V(5A) DC
Input voltage (V)
0.00 10.00m 20.00m 30.00m 40.00m 50.00m
V
o
lt
a
g
e
 (
V
)
-2.00
-1.00
0.00
1.00
2.00
3.00
(l) I(5A) DC
Figure 4.12: Signal chain AC and DC responses of the voltage and current
channels on the small form factor 5A and 9A sensors.
40
LPF Delay 
LPF Delay 
↓ 
∫ × 
↓ I/O 
Logic 
Current 
Voltage 
LUT 
LUT 
12 
12 
16 
16 
FPGA 
Processor Activity 
Summarized 
Power Data          
to Host 
Figure 4.13: The FPGA in Jouler’s Loupe post-processes digitized current
and voltage data and integrates it with processor activity data to output a
stream of temporally registered power samples or energy estimates. The
post-processed data is sent to the DUT itself or a separate host machine via
USB; other USRP devices are available with higher-bandwidth 1Gb
Ethernet, 10Gb Ethernet, or PCIe interfaces.
data via a StarTech PEX1P PCIe-connected parallel port used as bit-bang
GPIO; on an Intel i7-2600k processor at 3.4GHz, the PC can write to the
parallel port 573000 times per second. The mean write latency is 1.365µs,
with a very small standard deviation of 22.274ns. Like the amplifier group
delay, then, this skew is very consistent and its mean is calibrated out using
the digital delay lines on the FPGA. Our knowledge of the distribution of
write latencies is useful in characterizing the uncertainty of aggregate time-
interval energy estimates, as discussed in Section 4.3.2. When using the small
form factor amplifier board, the mobile SoC DUT sends the USRP processor
activity data via memory-mapped GPIO pins, accessed from userspace when
necessary using a Linux kernel module.
The ADCs, which reside on the USRP’s main board, are directly connected
to the FPGA, where all at-speed digital-domain processing takes place. Fig-
ure 4.13 shows an overview of the signal processing chain implemented in
the FPGA. We heavily modified the USRP’s stock FPGA image to remove
unused communications-centric functionality, add needed calibration and fil-
tering logic, and improve dynamic range and rounding logic. The incoming
12-bit current and voltage samples from the ADC are first remapped to 16-bit
41
values via separate current and voltage LUTs. The LUTs are filled with val-
ues computed from a per-power-sensor calibration procedure and can correct
for any nonlinearities in the overall analog transfer function, including gain
resistor component tolerances, amplifier gain error and offset voltage, and
ADC INL and DNL. In our implementation, all FPGA logic is fixed-point;
that is, it assumes a simple linear relationship between the N -bit digital value
and the underlying current, voltage, power, or energy. A floating point or
other nonlinear mapping would improve resolution when the measured volt-
age or current is low, a helpful quality for measuring mobile and embedded
devices with a large dynamic range of possible power consumption; we leave
an investigation of such mappings to future work.
The ADC analog inputs include a simple first-order RC antialiasing filter to
attenuate very high-frequency content, but the ADC sample rate fs greatly
exceeds the analog bandwidth of the power sensor fa. Therefore, the raw
digitized signal may still contain significant frequency content in the interval
[fa,
fs
2
] which does not necessarily exhibit the flat group delay characteris-
tics we have shown for the passband. This high frequency content should
therefore be attenuated to avoid corrupting the time domain properties of
the measured power signal; this attenuation is accomplished by linear-phase
FIR low pass filters, labeled “LPF” in Figure 4.13.
The remapped and filtered current and voltage data is fed to a digital delay
line (DDL) which delays each signal by a programmable amount of time. In-
teger cycle delays are implemented with a programmable shift register. The
fractional cycle remainder of the desired delay, if desired, is achieved with
a fractional delay filter (FDF), which performs bandlimited interpolation to
shift its input signal by a fractional number of samples; in the frequency
domain, the ideal FDF has linear phase (i.e., flat group delay) and flat am-
plitude response [47]. The desired delay only depends on the power sensor
and DUT characteristics, so we are able to use a fixed FDF for a given power
sensor/DUT pair, improving the quality of the filter and reducing its imple-
mentation cost. In future implementations , a more expensive variable FDF
may be used to enable setting the delay at runtime and avoid regenerating a
new FPGA image for each power sensor/DUT pair.
After the DDL, the current and voltage data is sent down two parallel
paths: the first multiplies current and voltage to obtain power samples and
integrates the power samples to form energy estimates over time intervals
42
defined by the external processor activity data, and the second simply down-
samples the current and voltage data to fit the available host bandwidth.
Finally, a block of output processing logic sends either the integrated energy
estimates or a stream of interleaved current, voltage, and processor activity
data to the host, depending on the mode set by the user.
4.3 Analysis
4.3.1 Calibration
Apart from the oscilloscope-based delay calibration, the end-to-end DC current-
to-ADC-code and voltage-to-ADC-code transfer functions were measured
using a low-noise linear power supply and a calibrated 6.5-digit multime-
ter. As predicted by SPICE simulations in the design phase, every chan-
nel’s DC response was extremely linear throughout the ADC’s input range
(r2 > 0.9999998 for 15-point best fit lines). The calibration phase helps cor-
rect for any DC gain and offset deviations caused by component tolerances.
4.3.2 Uncertainty Analysis
Single Measurement Uncertainty
Inspired by Nakutis [48], we used the software tool GUM Workbench [49]
to analyze the uncertainty in our current measurements. In an uncertainty
analysis, one specifies all the uncertain quantities, such as temperatures,
temperature coefficients, and component tolerances, any correlations between
these quantities, and the algebraic equations that combine them to form the
output variable (measured current in this case). The software tool then
analytically computes the contribution of each input quantity’s uncertainty
to the overall output uncertainty, and can run Monte Carlo simulations to
visualize complex output variable distributions.
Our uncertainty analysis involved 39 equations and 99 quantities, includ-
ing component tolerances, temperature variations, amplifier bias current and
offset voltage, and ADC clock jitter. Even with the conservatively tight tol-
erance for most components on the initial Loupe PCB, when the board is
43
uncalibrated and actual component values are not known, the uncertainty
is as high as 3.4% of full scale; of that uncertainty, most of it is caused by
uncertainty in stage 2 amplifier bias current (57%), sense resistor room tem-
perature value (12.6%), and stage 1 amplifier feedback resistor nominal value
(10.1%).
Once a board has been carefully calibrated, the uncertainty in many quan-
tities, including nominal passive values and amplifier offset/bias, can be re-
duced to nearly 0. Calibration drops the uncertainty by more than 20× to
0.16% of full scale; by far the dominant source of remaining uncertainty is the
sense resistor’s tempco and associated resistance change due to self-heating
by up to 25◦C. To further reduce the uncertainty of a calibrated Loupe board,
the results indicate that a heatsink or fan for the sense resistors would be
most effective.
Application-Level Uncertainty
In most cases, the interesting outputs of a power measurement apparatus
at the application level will be estimates of power or energy derived from
multiple current and voltage samples. Additional statistical analysis can be
applied to understand the properties of these higher-level quantities in order
to inform a new class of more rigorous power-focused characterization and
optimization algorithms.
For the simple case of measuring a constant steady-state power, we can
treat individual power samples as independent random samples from a nor-
mal distribution N (µ, σ). The uncertainty or standard deviation of the av-
erage of n samples from such a distribution is σ√
N
. Thus, to achieve any
desired uncertainty σd, we must take the average of (
σ
σd
)2 samples.
A more common case for power measurement in processor systems is the
estimation of the energy consumption of a piece of code using processor activ-
ity data to select which of the power samples are relevant. In this case, there
are not only amplitude errors for each current and voltage measurement, but
also timing errors due to the uncertainty in the temporal alignment between
the “start” and “stop” signals at the power measurement device and the
true starting and stopping points of the code’s execution on the processor.
The mean value of the processor activity signal delay can be calibrated out
by the digital delay lines on the FPGA, but any remaining jitter will cause
44
P(t) 
t1
’ t1
 t2
’ 
t2
 
t 
Measured Interval 
Actual Interval 
erre,t1 > 0 erre,t2 < 0 
P
o
w
e
r 
Figure 4.14: An illustration of energy estimation errors resulting from
timing errors. Errors in estimating the true start and stop times of a
measurement interval [t1, t2] cause energy over time periods at the
beginning and end of the interval to erroneously be included or excluded
when computing energy estimate. We provide a quantitative analysis of the
distribution of energy errors given the distribution of timing errors and
measured power values.
additional uncertainty in the resulting energy estimate. The amount of total
estimation error depends not only on the amount and direction of timing
errors on the “start” and “stop” signals, but also on the measured power
during the erroneous time periods in question. For example, if power con-
sumption dropped to zero before and after executing a piece of code, then
“start” signals arriving too early or “stop” signals arriving too late would not
cause any error in the energy estimate. Quantifying the impact of timing un-
certainty on energy uncertainty can be handled in two ways: the measured
power signal’s behavior during the erroneous time periods can be treated
statistically as an ensemble of average behavior, yielding a single energy un-
certainty number for a given hardware platform, or the error distribution can
be more precisely calculated for a given pair of “start” and “stop” signals by
looking at the surrounding power measurements.
Assume we have a measured power signal P (t) and we would like to es-
timate the energy consumed by the processor over the interval [t1, t2]. Due
to timing uncertainty in the system, the start and stop signals arrive at the
FPGA at times t′1 and t
′
2. The timing errors errt,t1 = t1−t′1 and errt,t2 = t2−t′2
are independent, identically distributed random variables with a PDF f(t).
We assume that the timing errors have zero mean, since in a real system jitter
will be relative to the mean measured delay for that hardware platform. The
45
relationship between timing errors and energy estimation errors is shown in
Figure 4.14. In the figure, the real start time t1 is after the estimated start
time t′1, causing a positive energy error; that is, the shaded region is included
in the energy estimate when it shouldn’t be. The real stop time t2 is also after
the estimated stop time t′2, but this causes a negative energy error because
the shaded region is not included in the energy estimate when it should be.
In the following equations, we use the slightly modified definite integral
notation: ∫ y
x
f(t)dt =

∫ y
x
f(t)dt, x ≤ y
− ∫ x
y
f(t)dt, x > y
The expected value and variance of the energy error due to the timing
error τ in t′1 are given by:
E[erre,t1 ] =
∫ ∞
−∞
[
f(τ) ∗
∫ t′1+τ
t′1
P (t)dt
]
dτ (4.1)
σ2(erre,t1) =
∫ ∞
−∞
f(τ) ∗(∫ t′1+τ
t′1
P (t)dt
)2 dτ (4.2)
While Equations 4.1–4.2 are too computationally expensive for real-time
use in the general continuous-time case, a much simpler solution can be found
if P (t) is piecewise-constant as in a sampled waveform with sample period
ts using zero-order hold reconstruction. If we register the processor activity
data using the ADC sample clock, in the absence of other information we
assume that the signals arrive in the center of the clock period. In this case,
the inner integrals are very simple when evaluated over a sample period, and
we can recast the outer integrals as summations where each term covers one
sample period. P [0] is a special case in that half the sample period is before t′1
yielding a negative inner integral, and half is after t′1 yielding a positive inner
integral; the integrands for all other sample periods are positive everywhere
or negative everywhere in the domain. To calculate expected error more
efficiently, we first split the outer integral into 4 pieces:
46
E[erre,t1 ] =
∫ t′1− ts2
t′1
[
f(τ) ∗
∫ t′1+τ
t′1
P (t)dt
]
dτ +
∫ −∞
t′1− ts2
[
f(τ) ∗
∫ t′1+τ
t′1
P (t)dt
]
dτ
+
∫ t′1+ ts2
t′1
[
f(τ) ∗
∫ t′1+τ
t′1
P (t)dt
]
dτ +
∫ ∞
t′1+
ts
2
[
f(τ) ∗
∫ t′1+τ
t′1
P (t)dt
]
dτ
The first and third terms’ inner integrals have domains only covering a
single power sample, and their integrands can be replaced by P [0], a constant.
The inner integrals thus evaluate to P [0]τ , yielding:
E[erre,t1 ] =P [0]
∫ t′1− ts2
t′1
τf(τ)dτ +
∫ −∞
t′1− ts2
[
f(τ) ∗
∫ t′1+τ
t′1
P (t)dt
]
dτ
+P [0]
∫ t′1+ ts2
t′1
τf(τ)dτ +
∫ ∞
t′1+
ts
2
[
f(τ) ∗
∫ t′1+τ
t′1
P (t)dt
]
dτ
For the second and fourth terms, we split the inner integrals into three
parts: the first half sample period from t′1 towards t
′
1+τ , zero or more aligned
full sample periods, and the last partial sample period to get to t′1 + τ . We
will restate τ as the sum of ts
2
, an integer number of sample periods its, and
a sample period offset o = τ − its − ts2 , all multiplied by sgn(τ). This split
yields:
47
E[erre,t1 ] = P [0]
∫ t′1− ts2
t′1
τf(τ)dτ
+
−∞∫
t′1− ts2
f(τ) ∗
 t
′
1− ts2∫
t′1
P (t)dt+
t′1− ts2 −its∫
t′1− ts2
P (t)dt+
t′1+τ∫
t′1+τ−o
P (t)dt

 dτ
+ P [0]
t′1+
ts
2∫
t′1
τf(τ)dτ
+
∞∫
t′1+
ts
2
f(τ) ∗
 t
′
1+
ts
2∫
t′1
P (t)dt+
t′1+
ts
2
+its∫
t′1+
ts
2
P (t)dt+
t′1+τ∫
t′1+τ−o
P (t)dt

 dτ
Of the three split integral components, the first simply evaluates to P [0]ts
2
,
the second is a summation of similar terms over the full sample periods
between t′1 and t
′
1 + τ , and the third contains τ and is analogous to the inner
integrals of the first and third terms. The outer integral of the second and
fourth terms can be expanded into a sum of integrals over sample periods.
Performing these two operations leads to:
E[erre,t1 ] = P [0]
t′1− ts2∫
t′1
τf(τ)dτ
+
0∑
i=−∞

t′1− ts2 −its∫
t′1−
ts
2
−(i+1)ts
f(τ) ∗
−P [0]ts
2
−
−1∑
j=i
[P [j]ts] + P [−i− 1] ∗ (τ + its + ts
2
)
 dτ

+ P [0]
t′1+
ts
2∫
t′1
τf(τ)dτ
+
∞∑
i=0

t′1+
ts
2
+(i+1)ts∫
t′1+
ts
2
+its
f(τ) ∗
P [0]ts
2
+
i∑
j=1
[P [j]ts] + P [i+ 1] ∗ (τ − its − ts
2
)
 dτ
 (4.3)
Equation 4.3 contains two types of integral terms of the form
∫
f(τ)dτ and∫
τf(τ)dτ , both with domains of a half or full aligned sample period. These
integral terms are all constants that can be precomputed once for a given
platform and jitter PDF. For brevity, we adopt the notation F [i] and G[i] for
these two sets of coefficients, where i is the index of the relevant P sample;
coefficients covering the negative and positive halves of P [0] will use the
48
subscripts 0− and 0+, respectively. We can now rewrite Equation 4.3 as:
E[erre,t1 ] =− P [0]G[0−]−
−1∑
i=−∞
(
ts
[
P [0]
2
+
−1∑
j=i+1
P [j]
]
F [i] + P [i]G[i]
)
+ P [0]G[0+] +
∞∑
i=1
(
ts
[
P [0]
2
+
i−1∑
j=1
P [j]
]
F [i] + P [i]G[i]
)
(4.4)
The analogous equation for variance is:
σ2(erre,t1) =P [0]
2G[0−] +
−1∑
i=−∞
(
ts
[
P [0]2
2
+
−1∑
j=i+1
P [j]2
]
F [i] + P [i]2G[i]
)
+ P [0]2G[0+] +
∞∑
i=1
(
ts
[
P [0]2
2
+
i−1∑
j=1
P [j]2
]
F [i] + P [i]2G[i]
)
(4.5)
As a practical matter, the jitter PDF will have bounded support for real
systems, and the user may choose to truncate the PDF at a certain point to
avoid excess computation to account for extremely rare cases of large timing
error. Equations 4.4–4.5 hold for any finite or infinite sets of coefficients
F [i] and G[i], and the structure of the inner and outer summations suggests
an efficient iterative algorithm to compute the expected value or variance
in linear time with respect to |F [i]|. Listing 4.1 shows C-like pseudocode
for such an algorithm using a PDF supported on the interval [a . . . b] and,
without loss of generality, using floating point values throughout and negative
indexing on the P , F , and G arrays.
float ev = P[0]*(G[ZERO_PLUS] - G[ZERO_MINUS ]);
float sum_p = 0.0f;
for(int i = -1; i >= a; i--) {
sum_p += P[i];
ev -= ts * (P[0]/2 + sum_P) * F[i] + (P[i] * G[i]);
}
sum_p = 0.0f;
for(int i = 1; i <= b; i++) {
sum_p += P[i];
ev += ts * (P[0]/2 + sum_P) * F[i] + (P[i] * G[i]);
}
49
return ev;
Listing 4.1: .]Iterative algorithm for computing expected energy error for
a timing jitter PDF supported on [a,b].
It may seem unintuitive that E[erre,t1 ] is not necessarily zero; that is, even
though mean delay has been calibrated out, for a given measurement the
mean energy error may be nonzero, meaning that the this expected value
should be added to the nominal energy estimate returned to the application.
Suppose ts = 1, f(t) is a uniform distribution over [−1, 2], and P [-1 . . .
1]=[0W, 1W, 2W]. If we want to compute the expected energy error for a run
starting during sample period 0 (t′1 assumed to be 0.5), we first compute the
three relevant coefficients G[−1, 0−, 0+, 1]. For a uniform distribution with
domain of size 3, f(t) = 1
3
everywhere, so G[−1, 0−, 0+, 1] = [1
3
, 1
24
, 1
24
, 1
3
]; F
is not used in this example since we are only looking at one sample on either
side of t′1. The summation of Equation 4.4 evaluates to −1 ∗ 124 − (1 ∗ 12 ∗
1
3
+ 0 ∗ 1
6
) + 1 ∗ 1
24
+ (1 ∗ 1
2
∗ 1
3
+ 2 ∗ 1
6
) = 1
3
J expected error. Many terms
cancel due to symmetries in the uniform jitter PDF, but the overall expected
error is positive, meaning that, on average, our measurements overestimate
the energy during [t1, 2] by
1
3
J . This overestimation is due to the asymmetry
in the measured power waveform with respect to t′1; though the true t1 is
equally likely to be before or after the measured t′1 given our jitter PDF, the
cases where t1 is after t
′
1 have a disproportionate impact on the expected
error since the erroneously attributed power is higher.
The astute reader will note that the foregoing analysis has focused on
t1, the beginning of a measurement interval on the processor, exclusively.
Equations 4.4 – 4.5 can be used to quantify end-of-interval error around t2
by replacing t1′ with t′2 and inverting the signs of the four terms in each
equation, since t′2 < t2 leads to an underestimation of interval energy, as
opposed to overestimation when t′1 < t1.
50
CHAPTER 5
EXPERIMENTAL METHODOLOGY
Our experimental platform is an ODROID-XU development board running
version 3.4.75 of the Linux kernel. The ODROID-XU+E powers the A7
cores off their own rail and has its own 10mΩ sense resistor on-board con-
nected to an INA231 I2C-connected power monitor. We removed the current
sense resistor and connected its pads to the input and output of the Loupe
measurement board via two short wires. We were then able to use Loupe’s
current sense resistor, which has far better thermal dissipation, parasitic in-
ductance, and tempco properties. We communicate processor activity data
to the USRP via a memory-mapped GPIO pin using a very simple proto-
col; we toggle the GPIO at the beginning and end of each measured run.
More sophisticated protocols might use multiple GPIOs to do fine-grained
accounting of execution time between multiple layers of the software stack.
The GPIO is connected to the USRP via a 1.8V-to-3.3V level shifter and a
custom-built cable. Its latency mean and standard deviation are even lower
than the parallel port solution for x86 machines due to tighter coupling be-
tween the processor core and the GPIO.
We demonstrate Loupe’s capabilities by using it to modify FFTW, a
performance-focused FFT auto-tuner, to measure power for the transforms it
runs as well. FFTW can decompose large, multidimensional transforms into
combinations of many different algorithms, data layouts, loop nest orders,
and so on, and is thus fertile ground for testing power-performance tradeoffs
at a software level, since it provides many different ways of performing the
same computation.
To ensure that all FFT variants were tested under the same conditions,
we turned off DVFS and kept all 4 A7 cores at 1.2GHz, 1.2375V nominal.
Precise performance data was gathered using the ARM cycle counter; we
used a kernel module to allow userspace access to the appropriate registers.
We modified FFTW version 3.3.3 to toggle the GPIO before and after each
51
timed run. We found through experimentation that power measurements
converged to a high degree of repeatability and stability when runs took at
least 10000 processor cycles (8.3µs), just over one cycle of the buck regulator
powering the A7s. Clear features can be seen in the raw data at a much finer
granularity, but nonlinearities arising from the regulator itself and the power
delivery network require slightly longer to average out. Further experimen-
tation is needed to develop a robust procedure for obtaining high-confidence
measurements of even shorter runs. To reduce interference effects from other
processes and the operating system, we run each FFT variant 8 times and
use the minimum runtime and minimum measured energy.
52
CHAPTER 6
RESULTS AND DISCUSSION
Analog Bandwidth Figure 6.1 shows the benefit of Loupe’s high analog
bandwidth; it is able to detect short-lived events that lower-bandwidth sys-
tems would miss entirely, and can perform other measurements much faster
than existing apparata.
FFTW We used the FFTW planner to perform 1.32 million runs spanning
thousands of different FFT problems in about 5 minutes. The problems in-
clude real and complex, in-place and out-of-place, and sizes from 1 up to 4.2
million points. The null hypothesis when considering energy-specific opti-
mizations is that optimizing for high performance is the same as optimizing
for low energy; that is, static power and baseline dynamic power are great
enough to override whatever dynamic power differences exist between multi-
ple strategies to perform the same computation, and the most energy efficient
configuration will be the one that finishes first (and thus incurs the least of
this fixed cost). If this null hypothesis were true universally, there would
be no need for energy-focused optimizations or tools; performance-centric
compilers would already be optimal.
Of the 5230 problems where more than one solution was attempted, many
indeed supported the null hypothesis; when plotting energy vs. runtime for
all solutions attempted, a single solution, the most performant one, would lie
on the pareto-optimal frontier, and the frontier would be said to be degen-
erate. However, a substantial fraction, 433/5230 or 8.3%, yielded interesting
energy-performance tradeoffs and a pareto frontier with multiple options.
For example, Figure 6.2 shows a configuration with 6 pareto-optimal solu-
tions. These results include the static and baseline dynamic power of the
other 3 A7 cores which remain idle during the benchmark but are powered
off the same rail as the active core. Subtracting 3
4
of 30.047mA, the mea-
sured median baseline power, from all power results to more fairly reflect the
53
50 100 150 200 250 300 350
Time (µs)
Cu
rre
nt
 
 
Loupe
80kHz BW
10kHz BW
Figure 6.1: Real downsampled 8MSPS current data from the ODROID-XU
mobile SoC board and simulated data from existing measurement
approaches from the literature with 3dB bandwidths of 10kHz and 80kHz.
Loupe captures many samples per cycle of the voltage regulator. Loupe’s
10MHz analog bandwidth enables higher accuracy and a much shorter
minimum run time.
power consumed on the active core yields 476/5230 problems with interest-
ing energy-performance tradeoffs, or 9.08%. Such a high fraction of problems
with interesting energy-performance tradeoffs suggests that there is substan-
tial opportunity for a new generation of tools to optimize software for any
weighted combination of energy and performance on modern hardware.
54
4.5
5
5.5
6
6.5
7
7.5
39000 41000 43000 45000 47000 49000
En
e
rg
y 
(u
J)
Execution Time (Cycles)
(dft 0 0 0 1 1 ((256 2048 2)) ())
Figure 6.2: Scatter plot of energy consumption vs. execution time for a
256-point complex out-of-place FFT. Of the 27 ways we tried to execute
this FFT, the 6 highlighted in blue are on the pareto-optimal frontier. The
most energy efficient option is 12.6% slower but uses 20.2% less energy than
the fastest option. An energy-focused autotuner could choose among these
6 options according to the user’s preference for high performance or energy
efficiency.
55
CHAPTER 7
CONCLUDING REMARKS
In this dissertation, we motivated a faster, more accurate, more rigorously de-
veloped power measurement platform. We presented Jouler’s Loupe, a power
measurement system addressing the shortcomings of previous techniques. Fi-
nally, we showed that Loupe can indeed be used to gather power-performance
data very quickly on real hardware, and that there exists considerable oppor-
tunity for tools, enabled by Loupe, to optimize software for energy efficiency
rather than performance.
56
APPENDIX A
POWER SENSOR DESIGN
A.1 Power Sensor Schematic
A.1.1 PCIe Power Sensor
Figures A.1 – A.5 show the schematics for the three-channel PCIe 3.0 form
factor power sensor. All passives are 0603 or larger to facilitate manual as-
sembly. Fewer than three channels can be populated if necessary; indeed,
we used a PCIe form factor board as the prototype for a two-channel sen-
sor for a mobile SoC. The gain and offset of the amplifiers can be changed
within a wide range, to accommodate systems with widely varying voltage
and current ranges of interest, by varying passive component values.
Mobile CPU Power Sensor
The single channel power sensor uses the same power supply configuration as
the PCIe power sensor, as depicted in Figure A.2. The amplifiers use nearly
the same topology other than the small enhancements listed in Section 4.2.3,
but have different passive component values due to their different voltage and
current measurement ranges. The amplifier schematics are shown in Figures
A.6–A.7.
A.2 Amplifier Error Analysis
In this section, we list the equations used to model the sources of error or
uncertainty in the power measurements obtained by our power sensor. These
equations model the most relevant specifications of the passive components
57
DC
B
A
Title
Number RevisionSize
A
VIN-2
VIN+3
-V
S
4
+V
S
7
VOUT 6
DIS#8 NC 1
NC 5
U8
OPA847
VIN-2
VIN+3
-V
S
4
+V
S
7
VOUT 6
DIS#8 NC 1
NC 5
U9
OPA847
3.3VAUXIN
3.3VAUXOUT
3.3VIN
3.3VOUT
12VIN_PCIe
12VOUT_PCIe
15V
15V
15V
+5V
-5V
+5V
-5V
+5V
-5V
FB5
-IN8
+IN1
-V
s
2
PD#3
+V
S
7
OUT 6
REF4
U4
AD8129ARMZ
FB5
-IN8
+IN1
-V
s
2
PD#3
+V
S
7
OUT 6
REF4
U5
AD8129ARMZ
FB5
-IN8
+IN1
-V
s
2
PD#3
+V
S
7
OUT 6
REF4
U6
AD8129ARMZ
221
R9
Res Semi
2k
R10
GND
GND
GND
221
R11
Res Semi
2k
R12
Res Semi
221
R13
Res Semi
2k
R14
Res Semi
VIN-2
VIN+3
-V
S
4
+V
S
7
VOUT 6
DIS#8 NC 1
NC 5
U7
OPA843
+5V
+5V
+5V
75
R16
Res Semi
270R17
49.9
R18
Res Semi
39.2
R19
Res Semi
523R21
49.9
R22
Res Semi
49.9
R26
GND
SIG2
GND1
GND3
GND4
GND5
J4
SMAEdgeConnector
SIG2
GND1
GND3
GND4
GND5
J5
SMAEdgeConnector
SIG2
GND1
GND3
GND4
GND5
J6
SMAEdgeConnector
GND
GND
GND
1
3
4
2
R6
SenseResistor
1
3
4
2
R7
SenseResistor
1
3
4
2
R8
SenseResistor
U4FB
U6FB
GND
GND
GND
U5FB
U4OUT
U5OUT
U6OUT
U9OUT
U8OUT
U7OUT U7TERM
U8TERM
U9TERM
U7FB
U8FB
U9FB
0.01uF
C14
10uF
C15
GND
0.01uF
C16
10uF
C17
GND
0.01uF
C18
10uF
C19
GND
0.1uF
C20
4.7uF
C21
GND
0.1uF
C22
4.7uF
C23
GND
0.1uF
C24
4.7uF
C25
GND
0.1uF
C29
4.7uF
C28
0.1uF
C26
4.7uF
C27
0.1uF
C31
4.7uF
C30
GND
-5V
0.01uF
C32
10uF
C33
-5V
0.01uF
C34
10uF
C35
-5V
0.01uF
C36
10uF
C37
27pF
C38
GND 3.3pF
C39
GND
15k
R27
1.4k
R28
GND-5V
1.33k
R29
37.4
R30
GND-5V
36pF
C40
GND 1.3pF
C41
39.2
R23
Res Semi
36pF
C42
GND 1.3pF
C43
523R25
GND
GND
GND
1.33k
R46
37.4
R47
GND-5V
TP6
TestPoint
TP7
TestPoint
TP8
TestPoint
TP9
TestPoint
TP10
TestPoint
TP11
TestPoint
Figure A.1: Current sense resistors and current amplifiers.
and integrated circuits that influence current and voltage measurements. We
show equations for a two-stage amplifier, as in the current channel of our
power sensors; the single-stage voltage channel is modeled in much the same
way, omitting the equations pertaining to the sense resistor and stage 2
amplifier. Note that each of the quantities listed in the equations below
may itself be represented as a distribution of values, rather than a single
nominal value. In this way, we can capture error sources like resistor value
tolerances in our analysis by modeling them implicitly.
A.2.1 Sense Resistor
The following equations compute the actual resistance value Rs and differ-
ential voltage VRs of the sense resistor based on the resistor’s nominal value
Rsn, load current I, TCR TCRRs , thermoelectric voltage Vt,Rs , and the tem-
perature differences from the resistor to nominal ambient temperature (usu-
ally 20◦C or 25◦C) (∆TRs,A) and between the two terminals of the resistor
58
Title
Number RevisionSize
A
Date: 4/21/2013 Sheet of
File: C:\Users\ \PCIeInterposerVoltageAmp SchDocDrawn By:
VIN-2
VIN+3 OUT 6
NC5
NC1
NC 8V
-
4
V
+
7
U10
LM7171A
15V
-5V
12VOUT_PCIe
GND
1.07k
R31
Res Semi
768
R32
Res Semi
15V
470
R33
Res Semi
510
R34
Res Semi
49.9
R35
Res Semi
0.01uF
C44
2.2uF
C45
GND
0.01uF
C46
2.2uF
C47
GND
VIN-4
VIN+3
OUT 1
V
-
2
V
+
5
U12
LMH6609
VIN-4
VIN+3
OUT 1
V
-
2
V
+
5
U11
LMH66093.3VOUT
3.3VAUXOUT
+5V
-5V
60.4
R38
Res Semi
60.4
R41
Res Semi
249
R42
Res Semi
249
R39
Res Semi
0.01uF
C48
10uF
C49
GND
0.01uF
C50
10uF
C51
GND
0.01uF
C52
10uF
C53
GND
0.01uF
C54
10uF
C55
GND
+5V
-5V
49.9
R43
Res Semi
49.9
R40
Res Semi
+5V GND
750
R36
Res Semi
825
R37
Res Semi
GND
GND
GND
SIGS
GNDG1
GNDG2
GNDG3
GNDG4
J7
SMARightAngleConnector
SIGS
GNDG1
GNDG2
GNDG3
GNDG4
J9
SMARightAngleConnector
SIGS
GNDG1
GNDG2
GNDG3
GNDG4
J8
SMARightAngleConnector
750
R44
Res Semi
825
R45
Res Semi
+5V GND
Figure A.2: Voltage amplifiers.
59
11
2
2
3
3
4
4
D D
C C
B B
A A
Title
Number RevisionSize
A
Date: 4/21/2013 Sheet    of
File: C:\Users\..\PCIeInterposerPower.SchDoc Drawn By:
SW 1
FB 3
VIN5
GND2
SH
D
N
#
4
U1
LMR62014
SW 1
NFB 3
VIN5
GND2
SH
D
N
#
4
U3
LM2611
1000pF
C3
29.4k
R1
Res Semi
22uH
L1
Inductor
22uH
L2
Inductor
100pF
C2
Cap Semi
10k
R2
Res Semi
22uF
C4
Cap Semi
D1
Diode
GND
-5V
1M
R3
Res Semi
10uF
C5
Cap Semi
180pF
C6
Cap Semi
10uF
C7
Cap Semi
D2
Diode
149K
R4
Res Semi
13.3k
R5
Res Semi
22uH
L3
Inductor
GND
12VIN
12VIN 15V
VIN2
EN3
AGND4
PGND1
SW 7
VOS 6
PG 8
FB 5
EP9
U2
TPS62163
GND
3.3uH
L4
Inductor
22uF
C9
22uF
C1022uF
C8
Cap Semi
12VIN +5V
33uF
C1
PIN 1
SHIELD 2
SW 3
J3
PJ-102A
GND
12VIN
0.1uF
C11
D3
1.6k
R48
15V +5V -5V
TP1
TestPoint
TP2
TestPoint
TP3
TestPoint
TP4
TestPoint
TP5
TestPoint
12VIN
GND
PIC101
PIC102
OC1
PIC201 PIC202
COC2
PIC301
PIC302
OC3
PIC401
PIC402
OC4
PIC501
PIC502
COC5
PIC601PIC602
COC6
PIC701
PIC702
COC7
PIC801
PIC802
COC8
PIC901
PIC902
COC9
PIC1001
PIC1002
COC10
PIC1101
PIC1102
OC11 PID101
PID102
COD1
PID201 PID202
COD2
PID301PID302
COD3
PIJ301
PIJ302
PIJ303
COJ3
PIL101 PIL102
COL1
PIL201 PIL202
COL2
PIL301 PIL302
COL3
PIL401 PIL402
COL4
PIR101
PIR102
COR1
PIR201
PIR202
COR2
PIR301
PIR302
COR3
PIR401 PIR402
COR4
PIR501
PIR502
COR5
PIR4801
PIR4802
COR48
PITP101
COTP1
PITP201
COTP2
PITP301
COTP3
PITP401
COTP4
PITP501
COTP5
PIU101
PIU102 PIU103
PIU104
PIU105
COU1
PIU201
PIU202
PIU203
PIU204
PIU205
PIU206
PIU207
PIU208
PIU209
COU2
PIU301
PIU302 PIU303
PIU304
PIU305
COU3
NL05V
NL05V
NL12VIN NL15V
Figure A.3: Voltage regulators.
1
1
2
2
3
3
4
4
D D
C C
B B
A A
Title
Number RevisionSize
A
Date: 4/21/2013 Sheet    of
File: C:\Users\..\PCIeInterposerTX.SchDoc Drawn By:
REFCLK+ A13
REFCLK- A14
TX0+ B14
TX0- B15
TX1+ B19
TX1- B20
TX2+ B23
TX2- B24
TX3+ B27
TX3- B28
TX4+ B33
TX4- B34
TX5+ B37
TX5- B38
TX6+ B41
TX6- B42
TX7+ B45
TX7- B46
TX8+ B50
TX8- B51
TX9+ B54
TX9- B55
TX10+ B58
TX10- B59
TX11+ B62
TX11- B63
TX12+ B66
TX12- B67
TX13+ B70
TX13- B71
TX14+ B74
TX14- B75
TX15+ B78
TX15- B79
J1D
PCIe x16 Card Edge Multipart
REFCLK+A13
REFCLK-A14
TX0+B14
TX0-B15
TX1+B19
TX1-B20
TX2+B23
TX2-B24
TX3+B27
TX3-B28
TX4+B33
TX4-B34
TX5+B37
TX5-B38
TX6+B41
TX6-B42
TX7+B45
TX7-B46
TX8+B50
TX8-B51
TX9+B54
TX9-B55
TX10+B58
TX10-B59
TX11+B62
TX11-B63
TX12+B66
TX12-B67
TX13+B70
TX13-B71
TX14+B74
TX14-B75
TX15+B78
TX15-B79
J2D
PCIe x16 Card Edge Multipart
RX0+ A16
RX0- A17
RX1+ A21
RX1- A22
RX2+ A25
RX2- A26
RX3+ A29
RX3- A30
RX4+ A35
RX4- A36
RX5+ A39
RX5- A40
RX6+ A43
RX6- A44
RX7+ A47
RX7- A48
RX8+ A52
RX8- A53
RX9+ A56
RX9- A57
RX10+ A60
RX10- A61
RX11+ A64
RX11- A65
RX12+ A68
RX12- A69
RX13+ A72
RX13- A73
RX14+ A76
RX14- A77
RX15+ A80
RX15- A81
J1E
PCIe x16 Card Edge Multipart
RX0+A16
RX0-A17
RX1+A21
RX1-A22
RX2+A25
RX2-A26
RX3+A29
RX3-A30
RX4+A35
RX4-A36
RX5+A39
RX5-A40
RX6+A43
RX6-A44
RX7+A47
RX7-A48
RX8+A52
RX8-A53
RX9+A56
RX9-A57
RX10+A60
RX10-A61
RX11+A64
RX11-A65
RX12+A68
RX12-A69
RX13+A72
RX13-A73
RX14+A76
RX14-A77
RX15+A80
RX15-A81
J2E
PCIe x16 Card Edge Multipart
REFCLK_P
REFCLK_N
TX0_P
TX0_N
TX1_P
TX1_N
TX2_P
TX2_N
TX3_P
TX3_N
TX4_P
TX4_N
TX5_P
TX5_N
TX6_P
TX6_N
TX7_P
TX7_N
TX8_P
TX8_N
TX9_P
TX9_N
TX10_P
TX10_N
TX11_P
TX11_N
TX12_P
TX12_N
TX13_P
TX13_N
TX14_P
TX14_N
TX15_P
TX15_N
RX0_P
RX0_N
RX1_P
RX1_N
RX2_P
RX2_N
RX3_P
RX3_N
RX4_P
RX4_N
RX5_P
RX5_N
RX6_P
RX6_N
RX7_P
RX7_N
RX8_P
RX8_N
RX9_P
RX9_N
RX10_P
RX10_N
RX11_P
RX11_N
RX12_P
RX12_N
RX13_P
RX13_N
RX14_P
RX14_N
RX15_P
RX15_N
iPCIeDiffPairs
PIJ10A13
PIJ10A14
PIJ10B14
PIJ10B15
PIJ10B19
PIJ10B20
PIJ10B23
PIJ10B24
PIJ10B27
PIJ10B28
PIJ10B33
PIJ10B34
PIJ10B37
PIJ10B38
PIJ10B41
PIJ10B42
PIJ10B45
PIJ10B46
PIJ10B50
PIJ10B51
PIJ10B54
PIJ10B55
PIJ10B58
PIJ10B59
PIJ10B62
PIJ10B63
PIJ10B66
PIJ10B67
PIJ10B70
PIJ10B71
PIJ10B74
PIJ10B75
PIJ10B78
PIJ10B79
COJ1D
PIJ10A16
PIJ10A17
PIJ10A21
PIJ10A22
PIJ10A25
PIJ10A26
PIJ10A29
PIJ10A30
PIJ10A35
PIJ10A36
PIJ10A39
PIJ10A40
PIJ10A43
PIJ10A44
PIJ10A47
PIJ10A48
PIJ10A52
PIJ10A53
PIJ10A56
PIJ10A57
PIJ10A60
PIJ10A61
PIJ10A64
PIJ10A65
PIJ10A68
PIJ10A69
PIJ10A72
PIJ10A73
PIJ10A76
PIJ10A77
PIJ10A80
PIJ10A81
COJ1E
PIJ20A13
PIJ20A14
PIJ20B14
PIJ20B15
PIJ20B19
PIJ20B20
PIJ20B23
PIJ20B24
PIJ20B27
PIJ20B28
PIJ20B33
PIJ20B34
PIJ20B37
PIJ20B38
PIJ20B41
PIJ20B42
PIJ20B45
PIJ20B46
PIJ20B50
PIJ20B51
PIJ20B54
PIJ20B55
PIJ20B58
PIJ20B59
PIJ20B62
PIJ20B63
PIJ20B66
PIJ20B67
PIJ20B70
PIJ20B71
PIJ20B74
PIJ20B75
PIJ20B78
PIJ20B79
COJ2D
PIJ20A16
PIJ20A17
PIJ20A21
PIJ20A22
PIJ20A25
PIJ20A26
PIJ20A29
PIJ20A30
PIJ20A35
PIJ20A36
PIJ20A39
PIJ20A40
PIJ20A43
PIJ20A44
PIJ20A47
PIJ20A48
PIJ20A52
PIJ20A53
PIJ20A56
PIJ20A57
PIJ20A60
PIJ20A61
PIJ20A64
PIJ20A65
PIJ20A68
PIJ20A69
PIJ20A72
PIJ20A73
PIJ20A76
PIJ20A77
PIJ20A80
PIJ20A81
COJ2E
NLREFCLK0N
NLREFCLK0P
NLRX00N
NLRX00P
NLRX10N
NLRX10P
NLRX20N
NLRX20P
NLRX30N
NLRX30
NLRX40N
NLRX40
NLRX50N
NLRX50P
NLRX60N
NLRX60
NLRX70N
NLRX70P
NLRX80N
NLRX80P
NLRX90N
NLRX90
NLRX100N
NLRX100P
NLRX 10N
NLRX110P
NLRX120N
NLRX120P
NLRX130N
NLRX130P
NLRX140N
NLRX140P
NLRX150N
NLRX150P
NLTX00N
NLTX00P
NLTX10N
NLTX10P
NLTX20N
NLTX20P
NLTX30N
NLTX30P
NLTX40N
NLTX40P
NLTX50N
NLTX50P
NLTX60N
NLTX60P
NLTX70N
NLTX70P
NLTX80N
NLTX80P
NLTX90N
NLTX90P
NLTX100N
NLTX100
NLTX 10N
NLTX110
NLTX120N
NLTX120
NLTX130N
NLTX130P
NLTX140N
NLTX140P
NLTX150N
NLTX150
Figure A.4: PCIe differential pairs.
60
11
2
2
3
3
4
4
D D
C C
B B
A A
Title
Number RevisionSize
A
Date: 4/21/2013 Sheet    of
File: C:\Users\..\PCIeInterposer.SchDoc Drawn By:
+12V A2
+12V A3
3.3V A9
3.3V A10
+12V B1
+12V B2
+12V B3
3.3V B8
3.3VAux B10
J1A
PCIe x16 Card Edge Multipart
GNDA4
GNDA12
GNDA15
GNDA18
GNDA20
GNDA23
GNDA24
GNDA27
GNDA28
GNDA31
GNDA34
GNDA37
GNDA38
GNDA41
GNDA42
GNDA45
GNDA46
GNDA49
GNDA51
GNDA54
GNDA55
GNDA58
GNDA59
GNDA62
GNDA63
GNDA66
GNDA67
GNDA70
GNDA71
GNDA74
GNDA75
GNDA78
GNDA79
GNDA82
GND B4
GND B7
GND B13
GND B16
GND B18
GND B21
GND B22
GND B25
GND B26
GND B29
GND B32
GND B35
GND B36
GND B39
GND B40
GND B43
GND B44
GND B47
GND B49
GND B52
GND B53
GND B56
GND B57
GND B60
GND B61
GND B64
GND B65
GND B68
GND B69
GND B72
GND B73
GND B76
GND B77
GND B80
J1B
PCIe x16 Card Edge Multipart
PRSNT#1 A1
TCK A5
TDI A6
TDO A7
TMS A8
PWRGD A11
SMCLK B5
SMDAT B6
TRST# B9
WAKE# B11
PRSNT#2 B17
PRSNT#2 B31
PRSNT#2 B48
PRSNT#2 B81
J1C
PCIe x16 Card Edge Multipart
Reserved A19
Reserved A32
Reserved A33
Reserved A50
Reserved B12
Reserved B30
Reserved B82
J1F
PCIe x16 Card Edge Multipart
+12VA2
+12VA3
3.3VA9
3.3VA10
+12VB1
+12VB2
+12VB3
3.3VB8
3.3VAuxB10
J2A
PCIe x16 Card Edge Multipart
GNDA4
GNDA12
GNDA15
GNDA18
GNDA20
GNDA23
GNDA24
GNDA27
GNDA28
GNDA31
GNDA34
GNDA37
GNDA38
GNDA41
GNDA42
GNDA45
GNDA46
GNDA49
GNDA51
GNDA54
GNDA55
GNDA58
GNDA59
GNDA62
GNDA63
GNDA66
GNDA67
GNDA70
GNDA71
GNDA74
GNDA75
GNDA78
GNDA79
GNDA82
GND B4
GND B7
GND B13
GND B16
GND B18
GND B21
GND B22
GND B25
GND B26
GND B29
GND B32
GND B35
GND B36
GND B39
GND B40
GND B43
GND B44
GND B47
GND B49
GND B52
GND B53
GND B56
GND B57
GND B60
GND B61
GND B64
GND B65
GND B68
GND B69
GND B72
GND B73
GND B76
GND B77
GND B80
J2B
PCIe x16 Card Edge Multipart
PRSNT#1A1
TCKA5
TDIA6
TDOA7
TMSA8
PWRGDA11
SMCLKB5
SMDATB6
TRST#B9
WAKE#B11
PRSNT#2B17
PRSNT#2B31
PRSNT#2B48
PRSNT#2B81
J2C
PCIe x16 Card Edge Multipart
ReservedA19
ReservedA32
ReservedA33
ReservedA50
ReservedB12
ReservedB30
ReservedB82
J2F
PCIe x16 Card Edge Multipart
GND
Reserved1
Reserved2
Reserved3
Reserved4
Reserved5
Reserved6
Reserved7
PRSNT#1
PRSNT#2_x1
PRSNT#2_x4
PRSNT#2_x8
PRSNT#2_x16
WAKE#
PWRGD
TRST#
TCK
TDI
TDO
TMS
SMCLK
SMDAT
3.3VAUXIN
3.3VIN
12VIN_PCIe
i
Reserved
3.3VAUXOUT
3.3VOUT
12VOUT_PCIe
PIJ10A2
PIJ10A3
PIJ10A9
PIJ10A10
PIJ10B1
PIJ10B2
PIJ10B3
PIJ10B8
PIJ10B10
COJ1A
PIJ10A4
PIJ10A12
PIJ10A15
PIJ10A18
PIJ10A20
PIJ10A23
PIJ10A24
PIJ10A27
PIJ10A28
PIJ10A31
PIJ10A34
PIJ10A37
PIJ10A38
PIJ10A41
PIJ10A42
PIJ10A45
PIJ10A46
PIJ10A49
PIJ10A51
PIJ10A54
PIJ10A55
PIJ10A58
PIJ10A59
PIJ10A62
PIJ10A63
PIJ10A66
PIJ10A67
PIJ10A70
PIJ10A71
PIJ10A74
PIJ10A75
PIJ10A78
PIJ10A79
PIJ10A82
PIJ10B4
PIJ10B7
PIJ10B13
PIJ10B16
PIJ10B18
PIJ10B21
PIJ10B22
PIJ10B25
PIJ10B26
PIJ10B29
PIJ10B32
PIJ10B35
PIJ10B36
PIJ10B39
PIJ10B40
PIJ10B43
PIJ10B44
PIJ10B47
PIJ10B49
PIJ10B52
PIJ10B53
PIJ10B56
PIJ10B57
PIJ10B60
PIJ10B61
PIJ10B64
PIJ10B65
PIJ10B68
PIJ10B69
PIJ10B72
PIJ10B73
PIJ10B76
PIJ10B77
PIJ10B80
COJ1B
PIJ10A1
PIJ10A5
PIJ10A6
PIJ10A7
PIJ10A8
PIJ10A11
PIJ10B5
PIJ10B6
PIJ10B9
PIJ10B11
PIJ10B17
PIJ10B31
PIJ10B48
PIJ10B81
COJ1C
PIJ10A19
PIJ10A32
PIJ10A33
PIJ10A50
PIJ10B12
PIJ10B30
PIJ10B82
COJ1F
PIJ20A2
PIJ20A3
PIJ20A9
PIJ20A10
PIJ20B1
PIJ20B2
PIJ20B3
PIJ20B8
PIJ20B10
COJ2A
PIJ20A4
PIJ20A12
PIJ20A15
PIJ20A18
PIJ20A20
PIJ20A23
PIJ20A24
PIJ20A27
PIJ20A28
PIJ20A31
PIJ20A34
PIJ20A37
PIJ20A38
PIJ20A41
PIJ20A42
PIJ20A45
PIJ20A46
PIJ20A49
PIJ20A51
PIJ20A54
PIJ20A55
PIJ20A58
PIJ20A59
PIJ20A62
PIJ20A63
PIJ20A66
PIJ20A67
PIJ20A70
PIJ20A71
PIJ20A74
PIJ20A75
PIJ20A78
PIJ20A79
PIJ20A82
PIJ20B4
PIJ20B7
PIJ20B13
PIJ20B16
PIJ20B18
PIJ20B21
PIJ20B22
PIJ20B25
PIJ20B26
PIJ20B29
PIJ20B32
PIJ20B35
PIJ20B36
PIJ20B39
PIJ20B40
PIJ20B43
PIJ20B44
PIJ20B47
PIJ20B49
PIJ20B52
PIJ20B53
PIJ20B56
PIJ20B57
PIJ20B60
PIJ20B61
PIJ20B64
PIJ20B65
PIJ20B68
PIJ20B69
PIJ20B72
PIJ20B73
PIJ20B76
PIJ20B77
PIJ20B80
COJ2B
PIJ20A1
PIJ20A5
PIJ20A6
PIJ20A7
PIJ20A8
PIJ20A11
PIJ20B5
PIJ20B6
PIJ20B9
PIJ20B11
PIJ20B17
PIJ20B31
PIJ20B48
PIJ20B81
COJ2C
PIJ20A19
PIJ20A32
PIJ20A33
PIJ20A50
PIJ20B12
PIJ20B30
PIJ20B82
COJ2F
NLPRSNT#
NLPRSNT#20x1
NLPRSNT#20x4
NLPRSNT#20x8
NLPRSNT#20x16
NLPWRGD
NLReserved1
NLReserved2
NLReserved3
NLReserved4
NLReserved5
NLReserved6
NLReserved7
NLSMCLK
NLSMDAT
NLTCK
NLTD
NLTDO
NLTMS
NLTRST#
NLWAKE
Figure A.5: Miscellaneous PCIe signals.
61
(a) Voltage Amplifier
(b) Current Amplifiers
Figure A.6: 0–9A channel amplifier schematics.
62
(a) 0–5A Voltage Amplifier
(b) 0–5A Current Amplifiers
Figure A.7: 0–5A channel amplifier schematics.
63
(∆TRs,S):
Rs = Rsn × (1 + ∆TRs,A × TCRRs) (A.1)
VRs = I ×Rs + ∆TRs,S × Vt,Rs (A.2)
A.2.2 Stage 1 Amplifier
The stage 1 amplifier is a non-inverting amplifier whose gain is set by a
feedback resistor Rs1f and gain resistor Rs1g. We model value tolerances,
temperature difference from ambient, and TCR of the resistors. We also
model the effects of the amplifier’s CMRR, PSRR, offset voltage, and offset
voltage temperature drift on the ultimate input voltage being amplified. Fi-
nally, we model the amplifier’s DC gain error, gain error temperature drift,
and gain nonlinearity.
Rs1f =Rs1fn × (1 + ∆TRs1f ,A × TCRRs1f ) (A.3)
Rs1g =Rs1gn × (1 + ∆TRs1g ,A × TCRRs1g) (A.4)
Vcm,Stage1 =(VRail + (VRail − VRs))/2 (A.5)
Vin,Stage1 =− VRs + Vos,Stage1 + TCVos,Stage1 ×∆TStage1
+
Vcm,Stage1
CMRRStage1
+
RippleVdd,Stage1
PSRRStage1
(A.6)
GainStage1 =(1 +
Rs1f
Rs1g
)× (1 +DCGainErrorStage1
+ ∆TStage1 × TCStage1,GainError +GainNonlinearityStage1)
(A.7)
Vout,Stage1 =Vin,Stage1 ×GainStage1 (A.8)
A.2.3 Stage 2 Amplifier and Transmission Line Termination
Resistors
The stage 2 amplifier is an inverting amplifier whose gain is set by a feedback
resistor Rs2f and gain resistor Rs2g. The amplifier’s offset is set by a voltage
divider of two resistors Rs2o1 and Rs2o2. We model value tolerances, temper-
ature difference from ambient, and TCR of all four resistors. We also model
the effects of the amplifier’s bias currents (Ib− and Ib+ for the inverting and
64
noninverting inputs, respectively), CMRR, PSRR, offset voltage, and offset
voltage temperature drift on the ultimate input voltage being amplified.
Rs2f =Rs2fn × (1 + ∆TRs2f ,A × TCRRs2f ) (A.9)
Rs2g =Rs2gn × (1 + ∆TRs2g ,A × TCRRs2g) (A.10)
Rs2o1 =Rs2o1n × (1 + ∆TRs2o1,A × TCRRs2o1) (A.11)
Rs2o2 =Rs2o2n × (1 + ∆TRs2o2,A × TCRRs2o2) (A.12)
Ib+,Stage2 =Ib,Stage2 +
Ioffset,Stage2
2
(A.13)
Ib−,Stage2 =Ib,Stage2 − Ioffset,Stage2
2
(A.14)
VIb+,Stage2 =
Ib+,Stage2
1
Rs2o1
+ 1
Rs2o2
(A.15)
VIb−,Stage2 =
Ib−,Stage2
1
Rs2f
+ 1
Rs2g
(A.16)
VV oltageDivider =Vdd,V oltageDivider × Rs2o1
Rs2o1 +Rs2o2
(A.17)
Vcm,Stage2 =
VV oltageDivider + Vout,Stage1
2
(A.18)
Vin,Stage2 =Vout,Stage1 − VV oltageDivider + V os, Stage2
+ TCVos,Stage2 ×∆TStage2 +
Vcm,Stage2
CMRRStage2
+
RippleVdd,Stage2
PSRRStage2
+ (VIb+,Stage2 − VIb−,Stage2) (A.19)
GainStage2 =− Rs2f
Rs2g
(A.20)
Vout,Stage2 =VV oltageDivider + (Vin,Stage2 ×GainStage2) (A.21)
A.2.4 Transmission Line Termination Resistors and ADC
Buffer
The stage 2 amplifier is connected to the ADC buffer via a transmission line,
with series and parallel termination resistors Rt1 and Rt2 on the amplifier and
ADC buffer side, respectively. Since the transmission line is properly termi-
nated and the frequencies and currents of interest do not result in significant
attenuation over short coaxial cables, we do not model the cable itself. The
ADC buffer is a single-ended to differential amplifier whose gain is set by
65
two pairs of resistors, Rabf− and Rabg− on the negative input and Rabf+ and
Rabg+ on the positive input. We model value tolerances, temperature dif-
ference from ambient, and TCR of all six resistors. We model the effects of
varying resistor values on the buffer’s gain and the differential output voltage
presented to the ADC.
Rt1 =Rtn × (1 + ∆TRt1,A × TCRRt1) (A.22)
Rt2 =Rtn × (1 + ∆TRt2,A × TCRRt2) (A.23)
Vin,ab = Vout,Stage2 × Rt2
Rt1 +Rt2
(A.24)
Rabg+ = Rabg+n × (1 + ∆TRabg+,A × TCRabg+) (A.25)
Rabg− = Rabg−n × (1 + ∆TRabg−,A × TCRabg−) (A.26)
Rabf+ = Rabf+n × (1 + ∆TRabf+,A × TCRabf+) (A.27)
Rabf− = Rabf−n × (1 + ∆TRabf−,A × TCRabf−) (A.28)
βab1 =
Rabg+
Rabg+ +Rabf+
(A.29)
βab2 =
Rabg−
Rabg− +Rabf−
(A.30)
Gainab = 2× 1− βab1
βab1 + βab2
(A.31)
Vout,ab = Vin,ab ×Gainab (A.32)
A.2.5 ADC
Sample Jitter
In a real uniform sampling ADC, the sampling instants will deviate slightly
from their ideal positions at t = N × 1
fs
. This deviation is known as sample
jitter, and has two main components: the jitter in the triggering edges of
the external ADC sample clock, and the jitter in the internal delay between
the triggering clock edge and the sample capture. These two components are
known as clock jitter and aperture delay jitter, respectively. The modeling
66
equations used in our uncertainty analysis are as follows:
dVin,ADC
dt
=
√
2× pi × fin × Amplitudein (A.33)
tSampleJitter = tClockJitter + tApertureDelayJitter (A.34)
Vin,ADCWithJitter = Vin,ADC +
dVin,ADC
dt
× tSampleJitter (A.35)
(A.36)
Note that we only model the aperture delay jitter and not the mean aperture
delay itself, as this delay can be calibrated out in the full system along with
the average delays of the amplifiers, coaxial cable, and processor activity
signaling mechanism.
The ADC in the USRP1 used for this work is clocked using a temperature-
compensated crystal oscillator (TCXO) with ≤1ps RMS phase jitter [50].
The ADC itself has 1.2ps RMS aperture delay [51], for a maximum root sum
square (RSS) sample jitter of 1.56ps. The TCXO may be replaced by hand
with a more expensive pin-compatible model to reduce clock jitter [52]. Its
timing characteristics are expressed in terms of its phase noise spectrum.
Translating these to a scalar jitter figure for uncertainty analysis [53] yields
an RMS jitter of 0.347ps, reducing the total RSS sample jitter of 1.25ps. If
sample jitter is small enough that the input waveform maps to the same ADC
code at the ideal and actual sampling instants, the jitter has no practical
effect. For an N -bit ADC and a full-scale input sinusoid with frequency fin,
the jitter must be less than 1
pi×fin×2N+1 to cause a sampled voltage error of
less than 1
2
LSB in the worst case [54]. For a 10MHz sinusoid matching the
input bandwidth used in our prototype power sensor and a 12-bit ADC, the
critical jitter value is 3.886ps. Assuming a normal jitter distribution, the
actual jitter will exceed the critical value 1.29% of the time using the stock
TCXO, and 0.19% of the time using the upgraded TCXO. Note that a full-
scale 10MHz sinusoid is the worst case for slew rate and thus jitter-induced
error, and jitter is low enough using the stock TCXO that jitter will seldom
cause a different ADC code to be read.
The slew rate can either be modeled using empirical data for a particular
system, or by modeling the analog waveform as an ensemble of one or more
sinusoids with given frequencies and amplitudes. For a given sinusoid with
amplitude A and frequency f , the sinusoid is described by f(t) = A×sin(2×
67
pi× f × t) and its slew rate by df(t)
dt
= 2×pi× f ×A× cos(2×pi× f × t). The
distribution of slew rates over a period is U-shaped, with a mean value of 0,
a mean absolute value of 4f , and peaks at df(t)
dt
= ±(2 × pi × f × A). The
standard uncertainty of this distribution is
√
2×pi×f×A; GUM Workbench
and other uncertainty analysis software can handle U-shaped distributions
directly.
We model the ADC’s input noise, quantization error, and INL. We also
model the effect of sample jitter on the digitized signal by multiplying the
jitter by the slew rate of the signal in the neighborhood of the sample instant.
Note that we only model aperture delay jitter, not the mean aperture delay
value, because constant aperture delay can be calibrated out in the full system
along with the average delays of the amplifiers, coaxial cable, and processor
activity signaling mechanism.
Vin,ADC = Vout,ab + InputNoiseADC (A.37)
VLSB =
VInputRange,ADC
2Nbits,ADC
(A.38)
VQuantizationError,ADC = NLSBQuantizationError × VLSB (A.39)
VINL,ADC = NLSBINL × VLSB (A.40)
Vquantized,ADC = Vin,ADCWithJitter + VINL,ADC + VQuantizationError,ADC
(A.41)
68
REFERENCES
[1] “Fundamentals of RF and Microwave Power Measurements,” Agilent
Technologies, Tech. Rep. 64-1B, 2000.
[2] W. Koon, “Current sensing for energy metering,” in Proceedings of IIC-
China, 2002, pp. 321–324.
[3] S. Svensson, “Power measurement techniques for non-sinusoidal condi-
tions,” Ph.D. dissertation, Chalmers University of Technology, 1999.
[4] N. Chang, K. Kim, and H. G. Lee, “Cycle-accurate energy consumption
measurement and analysis: Case study of arm7tdmi,” in Low Power
Electronics and Design, 2000. ISLPED ’00. Proceedings of the 2000 In-
ternational Symposium on, 2000, pp. 185–190.
[5] H. G. Lee, K. Lee, Y. Choi, and N. Chang, “Cycle-accurate energy
measurement and characterization of fpgas,” Analog Integr. Circuits
Signal Process., vol. 42, no. 3, pp. 239–251, Mar. 2005. [Online].
Available: http://dx.doi.org/10.1007/s10470-005-6758-5
[6] X. Zhou, X. Peng, and F. Lee, “A high power density, high efficiency and
fast transient voltage regulator module with a novel current sensing and
current sharing technique,” in Applied Power Electronics Conference
and Exposition, 1999. APEC ’99. Fourteenth Annual, vol. 1, Mar 1999,
pp. 289–294 vol.1.
[7] D. Goder, “System to protect switch mode DC/DC converters against
overload current,” Patent US 6 127 814 A, 10 03, 2000. [Online].
Available: http://www.google.com/patents/US6127814
[8] S. Ziegler, H.-C. Iu, R. Woodward, and L. Borle, “Theoretical and prac-
tical analysis of a current sensing principle that exploits the resistance
of the copper trace,” in Power Electronics Specialists Conference, 2008.
PESC 2008. IEEE, June 2008, pp. 4790–4796.
[9] A. Ikriannikov and O. Djekic, “Investigation of DCR Current Sensing in
Multiphase Voltage Regulators,” in IBM Power and Cooling Technology
Symposium 2007, 2007.
69
[10] “SPICE Model – 0603PS,” Coilcraft, Inc., Tech. Rep. 267-1, August
2007.
[11] P4400 Kill A Watt Operation Manual, P3 International.
[12] Watts Up? Power Analyzer, Watt Meter, and Electricity Monitor, Elec-
tronic Educational Devices, February 2008.
[13] D. Bedard, M. Lim, R. Fowler, and A. Porterfield, “Powermon: Fine-
grained and integrated power monitoring for commodity computer sys-
tems,” in IEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the,
2010, pp. 479–484.
[14] A. Weissel and F. Bellosa, “Process cruise control: event-driven clock
scaling for dynamic power management,” in Proceedings of the 2002
international conference on Compilers, architecture, and synthesis for
embedded systems, ser. CASES ’02. New York, NY, USA: ACM,
2002. [Online]. Available: http://doi.acm.org/10.1145/581630.581668
pp. 238–246.
[15] K. Rajamani, H. Hanson, J. Rubio, S. Ghiasi, and F. Rawson,
“Application-aware power management,” in Workload Characterization,
2006 IEEE International Symposium on, 2006, pp. 39–48.
[16] Y. Zhu and V. J. Reddi, “High-performance and energy-efficient
mobile web browsing on big/little systems,” in Proceedings of the
2013 IEEE 19th International Symposium on High Performance
Computer Architecture (HPCA), ser. HPCA ’13. Washington,
DC, USA: IEEE Computer Society, 2013. [Online]. Available:
http://dx.doi.org/10.1109/HPCA.2013.6522303 pp. 13–24.
[17] A. Carroll and G. Heiser, “An analysis of power consumption in a smart-
phone,” in Proceedings of the 2010 USENIX conference on USENIX an-
nual technical conference, ser. USENIXATC’10. Berkeley, CA, USA:
USENIX Association, 2010, pp. 21–21.
[18] G. Contreras and M. Martonosi, “Power prediction for intel xscale
processors using performance monitoring unit events,” in Proceedings
of the 2005 international symposium on Low power electronics and
design, ser. ISLPED ’05. New York, NY, USA: ACM, 2005. [Online].
Available: http://doi.acm.org/10.1145/1077603.1077657 pp. 221–226.
[19] M. S. S. Govindan, S. W. Keckler, and D. Burger, “End-to-end val-
idation of architectural power models,” in Proceedings of the 14th
ACM/IEEE international symposium on Low power electronics and de-
sign, ser. ISLPED ’09. New York, NY, USA: ACM, 2009, pp. 383–388.
70
[20] H. Esmaeilzadeh, T. Cao, Y. Xi, S. M. Blackburn, and K. S. McKin-
ley, “Looking back on the language and hardware revolutions: measured
power, performance, and scaling,” in Proceedings of the sixteenth inter-
national conference on Architectural support for programming languages
and operating systems, ser. ASPLOS XVI. New York, NY, USA: ACM,
2011, pp. 319–332.
[21] T. Do, S. Rawshdeh, and W. Shi, “pTop: A Process-level Power Profiling
Tool,” in HotPower ’09: Proceedings of the Workshop on Power Aware
Computing and Systems. New York, NY, USA: ACM, Oct 2009.
[22] F. Barany, “System Bandwidth vs. Resolution for Analog Video,” Ana-
log Devices, Tech. Rep. AN-945, 2007.
[23] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.
Aamodt, and V. J. Reddi, “Gpuwattch: Enabling energy optimizations
in gpgpus,” in Proceedings of the 40th Annual International Symposium
on Computer Architecture, ser. ISCA ’13. New York, NY, USA: ACM,
2013. [Online]. Available: http://doi.acm.org/10.1145/2485922.2485964
pp. 487–498.
[24] W. Jung, C. Kang, C. Yoon, D. Kim, and H. Cha, “Devscope:
A nonintrusive and online power analysis tool for smartphone
hardware components,” in Proceedings of the Eighth IEEE/ACM/IFIP
International Conference on Hardware/Software Codesign and System
Synthesis, ser. CODES+ISSS ’12. New York, NY, USA: ACM, 2012.
[Online]. Available: http://doi.acm.org/10.1145/2380445.2380502 pp.
353–362.
[25] Monsoon Solutions, Inc., “Mobile device power monitor manual, version
1.14,” May 2014.
[26] “Using ARM Streamline,” ARM, Tech. Rep. DUI 0482H, May 2012.
[27] M. S. Gupta, J. L. Oatley, R. Joseph, G.-Y. Wei, and D. M.
Brooks, “Understanding voltage variations in chip multiprocessors
using a distributed power-delivery network,” in Proceedings of the
Conference on Design, Automation and Test in Europe, ser. DATE
’07. San Jose, CA, USA: EDA Consortium, 2007. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1266366.1266498 pp. 624–629.
[28] K. Shringarpure, S. Pan, J. Kim, B. Achkir, B. Archambeault, J. Fan,
and J. Drewniak, “Innovative pdn design guidelines for practical high
layer-count pcbs,” in Proceedings of DesignCon 2013, ser. DesignCon
’13, 2013.
[29] M. Jones, “Simulating power planes with LTSpice IV,” April 2013.
[Online]. Available: http://www.linear.com/solutions/1810
71
[30] K. R. Carver and J. Mink, “Microstrip antenna technology,” Antennas
and Propagation, IEEE Transactions on, vol. 29, no. 1, pp. 2–24, Jan
1981.
[31] “Printed circuit board (PCB) power delivery network (PDN) design
methodology,” Altera Corporation, Tech. Rep. AN 574, May 2009.
[32] T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring
the level of abstraction for scalable and accurate parallel multi-core
simulation,” in Proceedings of 2011 International Conference for
High Performance Computing, Networking, Storage and Analysis, ser.
SC ’11. New York, NY, USA: ACM, 2011. [Online]. Available:
http://doi.acm.org/10.1145/2063384.2063454 pp. 52:1–52:12.
[33] H. Patil and T. E. Carlson, “Pinballs: Portable and shareable user-level
checkpoints for reproducible analysis and simulation,” in 2014 Workshop
on Reproducible Research Methodologies (REPRODUCE 2014), Febru-
ary 2014.
[34] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and
B. Calder, “Using simpoint for accurate and efficient simulation,”
in Proceedings of the 2003 ACM SIGMETRICS International
Conference on Measurement and Modeling of Computer Systems, ser.
SIGMETRICS ’03. New York, NY, USA: ACM, 2003. [Online].
Available: http://doi.acm.org/10.1145/781027.781076 pp. 318–319.
[35] S. Li, J. Ahn, J. B. Brockman, and N. P. Jouppi, “McPAT 1.0: An
integrated power, area, and timing modeling framework for multicore
architectures,” HP Labs, Tech. Rep., 2009.
[36] Analog Modules Inc., “The quest for accurate current sensing,”
http://www.omnipulsetechnology.com/products/pdfs/The%20Quest%
20for%20Accurate%20Current%20Sensing.pdf, Tech. Rep., January
2010.
[37] F. Zandman and J. Szwarc, “Non-Linearity of Resistance/Temperature
Characteristic: Its Influence on Performance of Precision Resistors,”
Vishay Precision Group, Tech. Rep. 108, February 2013.
[38] K. Blake, “Op Amp Precision Design: PCB Layout Techniques,” Mi-
crochip Technology Inc., Tech. Rep. AN1258, 2009.
[39] E. Balestrieri, P. Daponte, and S. Rapuano, “A state of the art on
adc error compensation methods,” Instrumentation and Measurement,
IEEE Transactions on, vol. 54, no. 4, pp. 1388–1394, Aug 2005.
72
[40] M. Mishali and Y. Eldar, “Blind multiband signal reconstruction: Com-
pressed sensing for analog signals,” Signal Processing, IEEE Transac-
tions on, vol. 57, no. 3, pp. 993–1009, March 2009.
[41] G. P. Lepage, “A New Algorithm for Adaptive Multidimensional Inte-
gration,” Journal of Computational Physics, vol. 27, p. 192, May 1978.
[42] F. Papenfuss, Y. Artyukh, E. Boole, and D. Timmermann, “Nonuni-
form sampling driver design for optimal adc utilization,” in Circuits and
Systems, 2003. ISCAS ’03. Proceedings of the 2003 International Sym-
posium on, vol. 4, May 2003, pp. IV–516–IV–519 vol.4.
[43] Analog Devices Inc., Op Amp Bandwidth and Bandwidth Flatness, ser.
Tutorial MT-045, October 2008.
[44] “Noise analysis in operational amplifier circuits,” Texas Instruments,
Tech. Rep. SLVA043B, 2007.
[45] L. Schuchman, “Dither signals and their effect on quantization noise,”
Communication Technology, IEEE Transactions on, vol. 12, no. 4, pp.
162–165, December 1964.
[46] L. Melkonian, “Improving A/D Converter Performance using Dither,”
National Semiconductor Corporation, Tech. Rep. 804, February 1992.
[47] T. Laakso, V. Valimaki, M. Karjalainen, and U. Laine, “Splitting the
unit delay [fir/all pass filters design],” Signal Processing Magazine,
IEEE, vol. 13, no. 1, pp. 30–60, Jan 1996.
[48] Z. Nakutis, “A current consumption measurement approach for fpga-
based embedded systems,” Instrumentation and Measurement, IEEE
Transactions on, vol. 62, no. 5, pp. 1130–1137, May 2013.
[49] W. C. Losinger, “A review of the GUM workbench,” The American
Statistician, vol. 58, no. 2, pp. 165–167, 2004.
[50] Part Number Data Sheet, Ecliptek Corporation, 6 2012, revision V.
[51] Mixed-Signal Front-End (MxFETM) for Broadband Communications,
Analog Devices, 2002, revision 0.
[52] CMOS 7x5x2.5mm SMD, ‘V’ Group, Euroquartz Limited.
[53] “Clock (CLK) Jitter and Phase Noise Conversion,” Maxim Integrated,
Tech. Rep. APP 3359, December 2004.
[54] “Aperture Jitter Calculator for ADCs,” Maxim Integrated, Tech. Rep.
APP 4466, September 2009.
73
