Washington University in St. Louis

Washington University Open Scholarship
McKelvey School of Engineering Theses &
Dissertations

McKelvey School of Engineering

Spring 5-15-2021

Efficient and Scalable Computing for Resource-Constrained
Cyber-Physical Systems: A Layered Approach
An Zou
Washington University in St. Louis

Follow this and additional works at: https://openscholarship.wustl.edu/eng_etds
Part of the Electrical and Electronics Commons

Recommended Citation
Zou, An, "Efficient and Scalable Computing for Resource-Constrained Cyber-Physical Systems: A Layered
Approach" (2021). McKelvey School of Engineering Theses & Dissertations. 640.
https://openscholarship.wustl.edu/eng_etds/640

This Dissertation is brought to you for free and open access by the McKelvey School of Engineering at Washington
University Open Scholarship. It has been accepted for inclusion in McKelvey School of Engineering Theses &
Dissertations by an authorized administrator of Washington University Open Scholarship. For more information,
please contact digital@wumail.wustl.edu.

WASHINGTON UNIVERSITY IN ST. LOUIS
School of Engineering & Applied Science
Department of Electrical and System Engineering

Dissertation Examination Committee:
Xuan Zhang, Chair
Shantanu Chakrabartty
Christopher D. Gill
Jing Li
Chuan Wang

Efficient and Scalable Computing for Resource-Constrained Cyber-Physical Systems: A
Layered Approach
by
An Zou

A dissertation presented to
The Graduate School
of Washington University in
partial fulfillment of the
requirements for the degree
of Doctor of Philosophy

May 2021
St. Louis, Missouri

© 2021, An Zou

Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1 Introduction . . . . . . . . . . . . . . . . .
1.1 Computing in Cyber-Physical Systems
1.2 Power-efficient Computing . . . . . . .
1.3 Performance-efficient Computing . . .
1.4 Scalability and Layered Solutions . . .
1.5 Dissertation Contributions . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

2 Circuit Layer: Early-Stage Modeling and Evaluation of IVR-assisted Processor Power Delivery System . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Conventional Power Delivery System and Efficiency . . . . . . . . . .
2.2.2 Integrated Voltage Regulator . . . . . . . . . . . . . . . . . . . . . .
2.2.3 IVR-enabled Power Delivery System and Efficiency . . . . . . . . . .
2.2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Modeling Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 System-Level Modeling Framework . . . . . . . . . . . . . . . . . . .
2.3.2 Power/ Area/ Ripple Static Module . . . . . . . . . . . . . . . . . . .
2.3.3 Dynamic Response Module . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Case Study I: Many-core GPU PDS . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 IVR Design Space Exploration . . . . . . . . . . . . . . . . . . . . . .
2.5.3 Power Delivery System Dynamic Behaviors . . . . . . . . . . . . . . .
2.5.4 Putting It Together: Power Efficiency Analysis . . . . . . . . . . . . .
2.6 Case Study II: PDS with Fast Per-Core DVFS . . . . . . . . . . . . . . . . .
2.6.1 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii

1
1
3
4
6
8

10
11
14
14
16
17
17
19
19
21
24
31
34
35
36
36
40
40
41

2.7

2.6.2 IVR Support for Fast DVFS . . . . . . . . . . . . . . . . . . . . . . .
2.6.3 Power Delivery System and Architecture Co-Design . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Circuit and Architecture Layers: Voltage-Stacked Power Delivery
tems: Reliability, Efficiency, and Power Management . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Power Delivery System . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Voltage Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Supply Voltage Noise . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Power Delivery Efficiency . . . . . . . . . . . . . . . . . . . . . .
3.2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Power Grid Routing and PDN Modeling of VS . . . . . . . . . . .
3.3.2 Communication Across Layers . . . . . . . . . . . . . . . . . . . .
3.4 Supply Voltage Noise Analysis . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Supply Voltage Noise Characterization . . . . . . . . . . . . . . .
3.4.2 Dominating Supply Voltage Noise . . . . . . . . . . . . . . . . . .
3.4.3 Worst-Case Supply Voltage Noise . . . . . . . . . . . . . . . . . .
3.5 Noise Mitigation by Hybrid Regulation . . . . . . . . . . . . . . . . . . .
3.5.1 Hybrid Regulation Framework . . . . . . . . . . . . . . . . . . . .
3.5.2 Centralized and Distributed Integrated Voltage Regulator . . . . .
3.5.3 Off-Chip Charge-Recycling VR . . . . . . . . . . . . . . . . . . .
3.5.4 Charge-Recycling VR Power Loss . . . . . . . . . . . . . . . . . .
3.5.5 Hybrid Regulated VS Power Delivery Efficiency . . . . . . . . . .
3.6 Architectural Support for VS . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Control Theoretic Formulation . . . . . . . . . . . . . . . . . . . .
3.6.2 Control Stability and Performance . . . . . . . . . . . . . . . . .
3.6.3 Voltage Smoothing Actuation . . . . . . . . . . . . . . . . . . . .
3.6.4 Implementation Considerations . . . . . . . . . . . . . . . . . . .
3.7 Advanced Power Management . . . . . . . . . . . . . . . . . . . . . . . .
3.7.1 Dynamic Voltage and Frequency Scaling . . . . . . . . . . . . . .
3.7.2 Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.3 Power Management Hypervisor in Voltage Stacking . . . . . . . .
3.8 Evaluation of Hybrid Regulation . . . . . . . . . . . . . . . . . . . . . . .
3.8.1 Supply Voltage Noise Evaluation . . . . . . . . . . . . . . . . . .
3.8.2 Efficiency in Real Applications . . . . . . . . . . . . . . . . . . . .
3.8.3 Compatibility with Advanced Power Management . . . . . . . . .
3.8.4 Comparison with Other Power Delivery Systems . . . . . . . . . .
3.9 Evaluation of Architecture Support . . . . . . . . . . . . . . . . . . . . .
iii

43
44
46

Sys. . 48
. . 49
. . 51
. . 51
. . 52
. . 53
. . 54
. . 56
. . 57
. . 58
. . 59
. . 59
. . 59
. . 64
. . 65
. . 68
. . 68
. . 70
. . 72
. . 74
. . 76
. . 77
. . 78
. . 81
. . 82
. . 85
. . 90
. . 91
. . 92
. . 92
. . 94
. . 95
. . 96
. . 97
. . 100
. . 101

3.9.1 System-level Efficiency . . . . . .
3.9.2 Supply Reliability . . . . . . . . .
3.9.3 Performance Tradeoffs . . . . . .
3.9.4 Collaborative Power Management
3.10 Conclusion . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

102
103
105
108
109

4 Architecture and Operating System Layers: Real-Time GPU Scheduling
of Hard Deadline Parallel Tasks with Fine-Grain Utilization . . . . . . . 110
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2.1 Background on GPU Systems . . . . . . . . . . . . . . . . . . . . . . 115
4.2.2 Background on Multi-Segment Self-Suspension . . . . . . . . . . . . . 116
4.2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.3 CPU and Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.1 CPU Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.2 Memory Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.4 GPU Parallel Kernel Execution Model . . . . . . . . . . . . . . . . . . . . . 122
4.4.1 Kernel-granularity and SM-granularity Scheduling . . . . . . . . . . . 122
4.4.2 Kernel Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.4.3 Interleaved Execution and Virtual SM . . . . . . . . . . . . . . . . . 126
4.4.4 Workload Pinning and Self-Interleaving . . . . . . . . . . . . . . . . . 130
4.5 Practical RT-GPU Tasks Scheduling . . . . . . . . . . . . . . . . . . . . . . 131
4.5.1 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.5.2 Federated Scheduling for GPU Segments . . . . . . . . . . . . . . . . 133
4.5.3 Fixed-Priority Scheduling for memory copy Segments with Self-Suspension
and Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.5.4 Fixed-Priority Scheduling for CPU Segments . . . . . . . . . . . . . . 139
4.5.5 RT-GPU Scheduling Algorithm and Analysis . . . . . . . . . . . . . . 141
4.6 Full-System Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.6.2 Schedulability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.6.3 GPU Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5 Circuit, Architecture, and Operating System Layers: Fast
Energy Management for Multi-/Many-core Processors . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Background and Related Work . . . . . . . . . . . . . . . . .
5.2.1 Dynamic Voltage Frequency Scaling (DVFS) . . . . .
5.2.2 Adpative Power Management . . . . . . . . . . . . .
5.2.3 Integrated Voltage Regulators . . . . . . . . . . . . .
5.2.4 Related Work . . . . . . . . . . . . . . . . . . . . . .
iv

Learning-based
. . . . . . . . . 156
. . . . . . . . . 157
. . . . . . . . . 161
. . . . . . . . . 161
. . . . . . . . . 162
. . . . . . . . . 163
. . . . . . . . . 164

5.3

5.4
5.5
5.6

5.7

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Power Delivery System for Fast DVFS . . . . . . . . . . . .
5.3.2 Hierarchical Power Management Framework . . . . . . . . .
5.3.3 Global Controller . . . . . . . . . . . . . . . . . . . . . . . .
5.3.4 Learning Controller . . . . . . . . . . . . . . . . . . . . . . .
5.3.5 Swift Controller . . . . . . . . . . . . . . . . . . . . . . . . .
Quantitative Study of Internal Metrics with Synthetic Benchmarks
Online Learning and System Implementation . . . . . . . . . . . . .
Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.2 Hierarchical Fast Learning Approach . . . . . . . . . . . . .
5.6.3 Hierarchical Layered Approach with Ablation Study . . . . .
5.6.4 Workload Transition and Scalability . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

165
167
170
172
173
176
177
183
186
186
187
190
191
194

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

v

List of Figures
1.1
1.2

The cyber-physical systems and their computing systems. . . . . . . . . . . .
Computing impacts on self-driving cars. . . . . . . . . . . . . . . . . . . . .

2.1

Overview of the power delivery subsystem (PDS) in modern microprocessors
with distributed integrated voltage regulators (IVRs). . . . . . . . . . . . . .
Block diagram of the IVR and IVR-enabled PDS system-level modeling framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Three types of converter topologies. . . . . . . . . . . . . . . . . . . . . . . .
Hierarchical power delivery system with integrated voltage regulator (IVR)
dynamic models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Interleaved (multi-phase) buck converter. . . . . . . . . . . . . . . . . . . . .
Periodical linear time-varying (PLTV) systems. . . . . . . . . . . . . . . . .
Efficiency validation for SC converters. . . . . . . . . . . . . . . . . . . . . .
Efficiency validation for buck converters. . . . . . . . . . . . . . . . . . . . .
Interleaved (multi-phase) buck converter dynamic responses in time domains.
Interleaved (multi-phase) buck converter frequency responses in frequency domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Voltage noise across benchmarks and VR config. . . . . . . . . . . . . . . . .
Voltage noise waveforms (CFD) with VR config. . . . . . . . . . . . . . . . .
Supply voltage noise effective impedance. . . . . . . . . . . . . . . . . . . . .
IVR efficiency trade-off with area. . . . . . . . . . . . . . . . . . . . . . . . .
Power delivery system optimization. . . . . . . . . . . . . . . . . . . . . . . .
Hierarchical power delivery system for SoC systems. . . . . . . . . . . . . . .
CPU and GPU power activity frequency analysis in executing benchmark
blackscholes and backp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inductor and capacitor sizes for different voltage scaling speeds. . . . . . . .
CPU Energy benefit from different speed fast DVFS. . . . . . . . . . . . . .
GPU Energy benefit from different speed fast DVFS. . . . . . . . . . . . . .

2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
3.1
3.2
3.3
3.4

Conventional single-layer and voltage-stacked multi-layer power delivery system. (PCB board voltage: 4V; each core requires 1V voltage and 1A current)
Illustration of on-chip power routing for conventional and voltage-stacked
power delivery configurations. . . . . . . . . . . . . . . . . . . . . . . . . . .
Power delivery network (PDN) of a 2x4 voltage-stacked many-core processor.
Supply voltage noise decomposition. . . . . . . . . . . . . . . . . . . . . . . .
vi

2
5
12
19
21
24
25
27
33
33
34
35
37
39
39
40
40
42
43
45
45
46
51
53
57
60

3.5

3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18

3.19
3.20
3.21
3.22
3.23
3.24
3.25
3.26
3.27
3.28
3.29
3.30

Illustrative example for noise decomposition using 2 ˆ 3 voltage stacking network: (a) simplified 2 ˆ 3 network; (b) equivalent network for I G ; (c) voltage
response with I G ; (d) equivalent impedance for I G ; (e) equivalent network
for I ST ; (f) voltage response with I ST ; (g) equivalent impedance for I ST ; (h)
equivalent network for I R . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Effective impedance of current components. . . . . . . . . . . . . . . . . . .
An example instruction trace contributing to worst-case supply noise. . . . .
Histogram comparison between analytically derived worst case and other heuristic core activation patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hybrid voltage regulation based on distributed on-chip CR-IVRs and off-chip
CR-VRM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Voltage distribution among the 16 SMs. . . . . . . . . . . . . . . . . . . . . .
Supply voltage noise distribution. . . . . . . . . . . . . . . . . . . . . . . . .
Effective impedance after employing CR-VRM. . . . . . . . . . . . . . . . .
Simplified circuit of (a) the 4 ˆ 4 VS GPU, (b) a single VS stack. . . . . . .
Timescales of different power actuation mechanisms. . . . . . . . . . . . . . .
SM microarchitecture and operation of dynamic issue width scaling. . . . . .
Implementation of the proposed cross-layer VS GPU solution with architectural support for voltage smoothing and VS-aware PM hypervisor. . . . . . .
Evaluation of the supply voltage noise in hybrid regulated voltage stacking
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Power delivery efficiency comparison between voltage-stacked system with
SC/LDO hybrid regulation and conventional single-layer system across ten
benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DVFS power saving comparison between conventional single-layer system and
voltage-stacked system with hybrid regulation across benchmarks. . . . . . .
Normalized system energy consumption under DVFS. . . . . . . . . . . . . .
Power delivery efficiency under PDE guided power gating and original power
gating on voltage stacking. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Power delivery efficiency and power breakdown across benchmarks and power
delivery subsystems configurations. . . . . . . . . . . . . . . . . . . . . . . .
Transient voltage waveforms under worst imbalance scenarios. . . . . . . . .
Worst supply noise in response to worst imbalance as a function of CR-IVR
area and control latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Noise distribution across benchmarks and the worst-case imbalance. . . . . .
Performance penalty varies with controller voltage threshold. . . . . . . . . .
Energy saving and performance penalty tradeoff space. . . . . . . . . . . . .
Performance penalty and energy saving across benchmarks. . . . . . . . . . .
Applying DFS on conventional and proposed voltage-stacked GPU. . . . . .
Applying PG on conventional and proposed voltage-stacked GPU. . . . . . .

vii

60
65
65
66
68
69
72
73
78
82
84
86
95

96
98
99
100
102
103
104
105
105
105
106
107
107

3.31 Distribution of imbalanced currents by their normalized magnitudes when
no power management (No PM), DFS with different performance goals, and
power gating are applied in a VS GPU. . . . . . . . . . . . . . . . . . . . . . 107
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15

RTGPU framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Typical GPU task execution pattern. . . . . . . . . . . . . . . . . . . . . . .
Comparison of three different GPU application scheduling approaches. . . . .
Kernel execution time trends. . . . . . . . . . . . . . . . . . . . . . . . . . .
Virtual SM model for interleaved execution . . . . . . . . . . . . . . . . . . .
Characterization of the latency extension ratios of interleaved execution. . .
GPU tasks real-time scheduling model. . . . . . . . . . . . . . . . . . . . . .
Schedulability under different computation (CPU) and suspension (memory+
GPU) lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Schedulability under different numbers of subtasks. . . . . . . . . . . . . . .
Schedulability under different numbers of tasks. . . . . . . . . . . . . . . . .
Schedulability under different numbers of SMs. . . . . . . . . . . . . . . . . .
CPU to GPU memory copy time distribution. . . . . . . . . . . . . . . . . .
GPU kernel execution time distribution. . . . . . . . . . . . . . . . . . . . .
Schedulability under different numbers of SMs with schedulability analysis
and Real GPU experiments (with worst case execution time model). . . . . .
Schedulability under different numbers of SMs with schedulability analsysis
and Real GPU experiments (with average execution time model). . . . . . .
RTGPU Throughput improvements. . . . . . . . . . . . . . . . . . . . . . . .
Microsecond-Level hierarchical fast power management (DVFS) for multi-core
and many-core processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The integrated voltage regulator based power delivery system. . . . . . . . .
Workload power and throughput traces in many-core processors. . . . . . . .
Normalized energy consumption of throughput (IPC) guided DVFS at different microsecond timescales. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reinforcement learning and swift controllers. . . . . . . . . . . . . . . . . . .
Quantitative study of the DVFS on energy saving . . . . . . . . . . . . . . .
Quantitative study of the DVFS on performance loss. . . . . . . . . . . . . .
Learning progress under different reward functions. . . . . . . . . . . . . . .
Normalized energy consumption of F-LEMMA. . . . . . . . . . . . . . . . .
Normalized performance of F-LEMMA. . . . . . . . . . . . . . . . . . . . . .
Energy delay product of F-LEMMA. . . . . . . . . . . . . . . . . . . . . . .
Normalized energy consumption of F-LEMMA. . . . . . . . . . . . . . . . .
Normalized performance of F-LEMMA. . . . . . . . . . . . . . . . . . . . . .
Learning under Workload Transitions. . . . . . . . . . . . . . . . . . . . . . .
Normalized energy consumption of F-LEMMA DVFS with the swift controller
at different microsecond timescales. . . . . . . . . . . . . . . . . . . . . . . .
viii

113
116
123
124
128
128
132
145
145
148
149
151
151
153
153
153
158
162
166
170
172
180
180
185
189
190
190
191
191
192
192

5.16 Normalized performance of F-LEMMA DVFS with the swift controller at
different microsecond timescales. . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.17 Normalized energy of F-LEMMA on multi-core and many-core processors. . 193
5.18 Normalized performance of F-LEMMA on multi-core and many-core processors.193

ix

List of Tables
2.1
2.2
2.3
2.4
2.5
2.6

Summary of Ivory input parameters. . . . . . . .
Summary of design space exploration. . . . . . . .
CPU GPU many-core system. . . . . . . . . . . .
CPU core DVFS frequency and voltage pairs. . .
GPU core DVFS frequency and voltage pairs. . .
Summary of design space explorations of 16-phase

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

GPU voltage-stacked system configuration . . . . .
Freq. Distribution of decomposed core current . . .
Switched Cap. Regulator Parameters . . . . . . . .
Voltage Detector Options . . . . . . . . . . . . . . .
LDO Regulator Parameters . . . . . . . . . . . . .
SM Core DVFS Frequency and Voltage Pairs . . . .
Power delivery system comparison . . . . . . . . . .
Comparison of Different Power Delivery Subsystems

4.1

Parameters for the taskset generation . . . . . . . . . . . . . . . . . . . . . . 146

5.1
5.2
5.3
5.4
5.5

Summary of design space explorations of 16-phase buck IVRs. . . . . . . . .
RL terminology. RL’s goal is to an find optimal policy πpa|sq˚ . . . . . . . .
Action space of the actor neural network. . . . . . . . . . . . . . . . . . . . .
Pearson correlation coefficients be-tween input features and output weights .
Architecture parameters and hyperparameters for the hierarchical controller.

x

. . .
. . .
. . .
. . .
. . .
buck

. . . .
. . . .
. . . .
. . . .
. . . .
IVRs.

. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
(PDS)

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

36
36
41
42
43
43

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

. 57
. 67
. 72
. 87
. 92
. 97
. 100
. 102

169
173
176
184
187

Acknowledgments

During my Ph.D. study at Washington University in St. Louis, I have received enormous
help and support in study, research, and living from many people, and this thesis would not
have been possible without all of them.
First and foremost, I would like to express my deepest gratitude to my advisor, Prof. Xuan
Zhang, for her guidance, advice, and support throughout my Ph.D. study. She not only
guides me to conduct high-quality research but also helps me develop critical thinking habits
and relevant soft skills such as writing, presentation, and communication skills. She always
believes in me, helps me to get through difficulties, and encourages me to pursue higher
goals. The invaluable knowledge, research methodologies, and attitudes from her, as well as
all her full support and kind suggestions for my career, lead to my countless respects and
thanks to her.
I would like to thank my advisory committee members and collaborators, Prof. Christopher
D. Gill, Prof. Jing Li, Prof. Shantanu Chakrabartty, Prof. Chuan Wang, Prof. Vijay Janapa
Reddi, Prof. Jingwen Leng, Prof. Sanjoy Baruah, Prof. Kunal Agrawal, Prof. Zhishan Guo,
Prof. Jinghao Sun, and Prof. Benjamin C. Lee. Thanks for their supportive and excellent
collaboration work that improves and expedites my research.
I was fortunate to have worked and interacted closely with many postdocs and students,
Dr. Xin He, Dr. Wei Yan, Weidong Cao, Huifeng Zhu, Adith Jagadish Boloor, Karthik

xi

Garimella, Chenfeng Zhao, Liu Ke, and Tianrui Ma, and all others who have graduated and
may not be listed here. Thanks for their support and companies.
I am also thankful for the support from SRC task 2810.003 and NSF CCF 1646579. Meanwhile, I would like to express my gratitudes to James Ballard at the Communication Center,
Francesca Allhoff, other staff, and faculty members at the Department of Electrical and
System Engineering, and staff at graduate school at Washington University in St. Louis. I
have greatly enjoyed my graduate study here and received all kinds of help from the friendly
staff and faculty members.
I would like to thank all my friends, whom I can not list here, for all the time we spent
together. It is you guys who make my study a really pleasant and enjoyable experience.
Thank you for making my life so colorful and so many sweet memories we have together.
I would like to thank my mother, Chunjuan Wang, and my father, Runmin Zou, for their
unconditional and endless love and support. Special thanks to my wife, Yehan Ma, my
daughter, Luyi Zou, and my parents-in-law, for their selfless love, company, and support
all the time, which make me a better person in my life. None of this could have happened
without them.

An Zou
Washington University in St. Louis
May 2021

xii

Dedicated to my family.

xiii

ABSTRACT OF THE DISSERTATION

Efficient and Scalable Computing for Resource-Constrained Cyber-Physical Systems: A
Layered Approach
by
An Zou
Doctor of Philosophy in Electrical Engineering
Washington University in St. Louis, May 2021
Research Advisor: Professor Xuan Zhang

With the evolution of computing and communication technology, cyber-physical systems
such as self-driving cars, unmanned aerial vehicles, and mobile cognitive robots are achieving increasing levels of multifunctionality and miniaturization, enabling them to execute
versatile tasks in a resource-constrained environment. Therefore, the computing systems
that power these resource-constrained cyber-physical systems (RCCPSs) have to achieve
high efficiency and scalability. First of all, given a fixed amount of onboard energy, these
computing systems should not only be power-efficient but also exhibit sufficiently high performance to gracefully handle complex algorithms for learning-based perception and AI-driven
decision-making. Meanwhile, scalability requires that the current computing system and its
components can be extended both horizontally, with more resources, and vertically, with
emerging advanced technology. To achieve efficient and scalable computing systems in RCCPSs, my research broadly investigates a set of techniques and solutions via a bottom-up
layered approach. This layered approach leverages the characteristics of each system layer
xiv

(e.g., the circuit, architecture, and operating system layers) and their interactions to discover
and explore the optimal system tradeoffs among performance, efficiency, and scalability. At
the circuit layer, we investigate the benefits of novel power delivery and management schemes
enabled by integrated voltage regulators (IVRs). Then, between the circuit and microarchitecture/architecture layers, we present a voltage-stacked power delivery system that offers
best-in-class power delivery efficiency for many-core systems. After this, using Graphics
Processing Units (GPUs) as a case study, we develop a real-time resource scheduling framework at the architecture and operating system layers for heterogeneous computing platforms
with guaranteed task deadlines. Finally, fast dynamic voltage and frequency scaling (DVFS)
based power management across the circuit, architecture, and operating system layers is
studied through a learning-based hierarchical power management strategy for multi-/manycore systems.

xv

Chapter 1
Introduction

1.1

Computing in Cyber-Physical Systems

A cyber-physical system integrates computing and physical processes. Computing platforms
and networks monitor and control the physical processes, usually with feedback loops where
computing affects physical processes and vice versa. As an intellectual challenge, the cyberphysical system involves the intersection and cooperation, not just the union, of the computing and the physical processes [1]. Examples of cyber-physical systems include smart grids,
autonomous automobile systems, medical monitoring, industrial control systems, robotics
systems, and autonomously piloted vehicles [2]. If we liken cyber-physical systems to human
beings, computing platforms act like human brains, fundamentally determining the performance of the cyber-physical systems. The computing platforms in modern cyber-physical
systems range from small microcontroller units, usually integrated as one of the components
in today’s system on a chip (SoC) architecture, up to large server computers.
New technologies such as artificial intelligence (AI) and machine learning (ML) are increasingly implemented, and people’s lives are being improved by complex technology such as
1

Figure 1.1: The cyber-physical systems and their computing systems.
the internet of things (IoT), the smart city and home, healthcare intelligence, self-driving
cars, mobile cognitive robots, and wearable devices. As shown in Fig. 1.1, these developments synergistically benefit from increasing levels of multifunctionality and miniaturization
as they execute versatile tasks in a resource-constrained environment, Computing platforms
that intelligently control these resource-constrained cyber-physical systems have to achieve
high power and performance efficiency. Many high-tech companies are seeking to provide
high-performance, low-power computing platforms for cyber-physical systems. For example,
Apple announced the M1, a powerful processor [3] that provides up to 2x longer battery
life, allowing mobile devices to use smaller and lighter batteries. Tesla built a computing
platform called Autopilot Hardware [4] that supports full self-driving from the ground up,
taking into account every small architectural and micro-architectural improvement while
pushing hard to squeeze maximum silicon performance-per-watt. Besides such efforts from
industry, many researchers [5] have proposed (ultra-)high performance or (ultra-)low energy

2

consumption requirements for the embedded computing in the modern cyber-physical system
applications, especially in resource-constrained environments [6–9].

1.2

Power-efficient Computing

Power efficiency in computing not only directly impacts the operation time of cyber-physical
systems but also imposes a ceiling on their computing capacity, which further limits performance.
Most cyber-physical systems, from micro/nano robots to wireless sensors in industrial plants,
are powered by a battery with a limited density and capacity. Some cyber-physical systems,
such as wireless sensor nodes [10–12], can require years of battery life. For example, the
Emerson Smart Wireless Solution [13] for web-based monitoring has an estimated battery
life of up to six years with a solar power recharging option. For cyber-physical systems
chasing high performance, the energy consumed by computing is also increasing significantly.
For example, AI and ML are being widely adopted in modern cyber-physical systems with
startling results. The computing power required for key AI benchmarks has doubled roughly
every 3.4 months, increasing by 300,000 times between 2012 and 2018 [14]. However, the
battery energy density has improved by only 5-8% per year [15, 16].
Power efficiency also limits computing performance. Since the last decade, with the failure of
Dennard scaling [17], processors have been gradually entering an era of dark silicon [18]. In
dark silicon, the thermal design power (TDP) constraint means that not all the computing
resources in a multi-core or many-core processor can be energized at the same time. As
discussed in the Landscape of the New Dark Silicon Design [19], transistor density continues
3

to double every two years, and native transistor speeds improve by 1.4. But transistor
energy efficiency also improves by only 1.4, which, under constant power budgets, causes a
2ˆ shortfall in the energy budget for powering a chip at its native frequency. Because power
and energy deficits impede computing performance by degrees, improving power and energy
efficiency pays obvious dividends.

1.3

Performance-efficient Computing

In the past decades, we have witnessed the impact of rapidly increased computing on cyberphysical systems in almost every field, especially with the emerging powerful algorithms like
AI and ML algorithms on the cyber-physical systems.
Boroujerdian et al. [20, 21] comprehensively evaluated the role of computing in autonomous
and mobile cyber-physical systems. Based on optimization case studies of co-designs of computing platforms and cyber-physical systems, they found that cyber-physical systems can
achieve up to 2ˆ faster mission times and use 1.8ˆ less mission energy. Ma et al. [22] used
the improved computing performance in edge computing platforms to improve the performance of industrial cyber-physical systems. Besides its direct impacts on the whole cyberphysical system, computing performance also impacts the modules inside. For example,
computer vision and natural language processing are two of the most widely used modules
in cyber-physical systems that interact with the surrounding environment and human beings. Agrawal [23] quantified the relationship between recognition completion performance
(recognition accuracy) and the computing size of convolutional neural network (CNN) based
object recognition. On all tested objects, up until resource saturation, the completion performance significantly increased with computing size. Similarly, Sharir [24] and Strubell [25]
4

15

Performance score

Performance score

60

40

20

0
Embedded

10

5

0
Desktop

Embedded

Server

Desktop

Server

Computing ability

Computing ability

(a) Simple environments

(b) Complex environments

Figure 1.2: Computing impacts on self-driving cars.
quantified the approximate computing performance and energy costs of training a variety of
recent successful neural network models used for natural language processing.
To demonstrate how vitally computing affects an entire autonomous cyber-physical system,
we tested the impacts of computing ability on the performance of self-driving cars. We used
CARLA [26], an open source driving simulator, to model the environments and the selfdriving cars, and used self-driving algorithms from the open-source CARLA Autonomous
Driving Challenge [27]. We evaluated self-driving performance under different computing
capabilities in different street environments. The scores of the self-driving cars came from
multiple metrics, including the driving score, route completion percentage, and an infraction
penalty. The self-driving algorithms were executed with increasing amounts of computing
ability, ranging from embedded systems [28] to server systems [29], evenly divided into nine
levels. Fig. 1.2 shows performance scores for the self-driving algorithms’ execution with
these levels of computing ability. Clearly, greater computing ability allows the self-driving
system to perceive the environment better and to respond faster. The self-driving car with a
higher computing ability usually achieves a higher performance score, which means it drives
longer and is more stable and reliable.
5

1.4

Scalability and Layered Solutions

Scalability can be categorized into horizontal scalability and vertical scalability. Horizontal
scalability is the property of a system to handle a growing amount of work by adding resources
to the system [30]. Vertical scalability means the system designs are scalable and compatible
with emerging advanced technology. At this point in its history, computing is evolving from
multi-core through many-core to heterogeneous architectures. Techniques and solutions to
improve the energy efficiency and performance of computing systems should not only achieve
good savings and performance improvements, but also should be generally universal, able to
adapt to diverse computing platforms and future developments.
A complete computing platform is made up of multiple layers, from the circuit layer, through
the micro-architecture and architecture layers, and up to the kernel, operating system, and
application layers. On the one hand, each layer has unique functions, characteristics, and
response times. For example, more power- and energy-efficient techniques and solutions
are proposed on the lower layers, because these layers are close to the physical transistors
where power and energy are consumed. Performance-efficient techniques and solutions are
typically applied on high layers, because they are closer to user applications, where performance attracts more attention. On the other hand, these layers are all closely connected and
interact with each other to finish computing tasks. Lower layers provide implementations
for higher-layer functions, and higher layers abstract and manage lower-layer resources. For
example, the circuit layer directly performs digital logic operations on the physical devices
at nanosecond timescales. The architecture layer abstractly describes the functionality, organization, and implementation of computing platforms with a timescale of microseconds to
milliseconds. The operating system layer manages the resources abstracted by architectures
and allocates and schedules tasks to these resources, usually at the millisecond scale. As a
6

complex system, a high-performance and low-power computing platform requires close cooperation among different layers to make better use of its potential while keeping power and
energy consumption low.
A complete computing platform is made up of multiple layers from the circuit layer, via the
micro-architecture, architecture layers, up to the kernel, operating system, and application
layers. On one hand, each layer has its own unique functions, characteristics, and response
time. For example, more power and energy efficient techniques and solutions are proposed
on lower layers as low layers are close to the physical transistors where the power and energy consumption happen. More performance-efficient techniques and solutions are applied
on high layers as they are closer to user applications where the performance attracts more
attention. On the other hand, these layers are closely connected and interact with each
other to finish the computing tasks. Lower layers provide implementations for higher-layer
functions and the higher layers abstract and manage the lower-layer resources. For example, the circuit layer directly performs the digital logic operation on the physical devices at
nanosecond timescales. The architecture layer abstractly describes the functionality, organization, and implementation of computing platforms with a timescale of microseconds to
milliseconds. The operating system layer manages the resources abstracted by architectures
and allocates and schedules works to these resources usually at milliseconds. As a complex
system, the high-performance and low-power computing platform requires close cooperation among different layers to make better use of the potentials with low power and energy
consumption.

7

1.5

Dissertation Contributions

As discussed above, computing, especially power and performance-efficient computing, plays
an important role in resource-limited cyber-physical systems. Power delivery efficiency, computing power efficiency, and computing resource utilization are the three key parts in efficient
computing. Therefore, in this dissertation, we propose a layered co-design approach by presenting detailed and scalable techniques and solutions from the circuit layer through the
architecture layer to the operating system layer to improve the computing energy and performance efficiency for the requirements of nowadays and future cyber-physical systems.
With this bottom-up layered approach, we choose four representative cases, to improve the
computing energy and performance efficiency from the dominant modules like power delivery efficiency, the real-time resource scheduling, and the power management perspectives,
targeting the multi-core and many-core CPU and GPU systems which are the most widely
used computing systems in cyber-physical systems.
This dissertation makes four main contributions:
1. First, at the circuit layer, we propose an early-stage modeling and evaluation of the integrated voltage regulator (IVR)-assisted processor power delivery system. The work demonstrates that the IVR-assisted power delivery solution can improve the efficiency of power
delivery to the processor and support microsecond scale power management.
2. Next, at the circuit and architecture layers, we present two voltage-stacked power delivery systems that are reliable, efficient, and compatible with typical power management
techniques. Both systems achieve state-of-the-art power delivery efficiency for a many-core
processor. An improved power delivery efficiency directly contributes to a longer operating
time of both the computing system and cyber-physical systems.
8

3. Then, at the architecture and operating system layers, to execute multiple parallel realtime applications for the cyber-physical systems which have hard deadlines, we use a GPU,
which is the main computing platform for self-driving cars, as a representative of heterogeneous computing systems to study the real-time scheduling of hard deadline parallel tasks,
employing fine-grain resource utilization to boost the computing performance for the autonomous cyber-physical systems.
4. Finally, across the circuit, architecture, and operating system layers, we design a hierarchical fast integrated voltage and frequency scaling technique for energy-efficient multi/many-core processors, leveraging the microsecond scale power management supported by
integrated voltage regulators. Effective power management not only improves the power
efficiency but also eases the restrictions on computing performance, allowing the computing platforms have a longer operating time and a higher computing performance for the
cyber-physical systems.
On the one hand, each contribution either opens new opportunities at other layers or leverages co-designs across several layers. On the other hand, the four contributions work together
to form a complete, layered approach, spanning from the circuit layer, through the architecture layer, to the operating system and application layers. This layer-spanning approach
successfully improves both power and performance efficiencies of the computing systems,
which further strengthens whole cyber-physical systems.

9

Chapter 2
Circuit Layer: Early-Stage Modeling
and Evaluation of IVR-assisted
Processor Power Delivery System

First of all, the computing systems in resource-constrained cyber-physical systems require
an efficient power delivery. Despite being employed in numerous efforts to improve power
delivery efficiency for the computing systems, the integrated voltage regulator (IVR) has
yet to be evaluated in a rigorous or quantitative manner in a full power delivery system
(PDS) setting. To fulfill this need, we present a system-level modeling and design space
exploration tool Ivory 2.0 for IVR-enabled PDSs. With a novel modeling methodology,
it can accurately estimate power delivery efficiency, static performance characteristics, and
dynamic transient responses under different load variations and external voltage/frequency
scaling conditions. We validate the model over a wide range of IVR topologies with silicon
measurement and SPICE simulation. Finally, we present two case studies in combination
with architecture-level performance and power simulators. The first case study focuses on
optimal PDS design for multi-core systems which achieves 9.5% power efficiency improvement
10

over conventional off-chip voltage regulator module (VRM)-based PDS. The second case
study explores the design trade-offs for IVR-enabled PDSs in CPU and GPU systems with
fast per-core dynamic voltage and frequency scaling (DVFS). We find 2 µs to be the optimal
DVFS time scale, which not only reaps the energy benefits (12.5% improvement in CPU and
50.0% improvement in GPU), but also avoids costly IVR overheads. The improved power
and power delivery efficiency allow the computing system to lower the power consumption
which is a must in resource-constrained cyber-physical system applications.

2.1

Introduction

With the decline of Dennard scaling, thermal design power and energy efficiency restrict
single thread performance [18], and designers are looking for more efficient ways to deliver
power to microprocessors. Integrated voltage regulators (IVRs) can enhance supply integrity
and enable flexible voltage scaling by moving power conversion closer to the point-of-load.
Distributed IVRs (shown in Fig. 2.1) can further provide per-core, fine-grain, and fast dynamic voltage and frequency scaling (DVFS) [31] and effective supply noise suppression [32]
at a level unattainable with traditional off-chip regulators. These benefits lead to both improved performance and efficiency. Also, IVR solutions save precious board/package area
compared to bulky off-chip regulators with large discrete passive components, making them
especially attractive for mobile SoCs [33]. As IVRs become viable solutions for power delivery in modern microprocessors, it is important to explore various design alternatives and
thoroughly evaluate their impacts on performance and efficiency at the system level.

11

Figure 2.1: Overview of the power delivery subsystem (PDS) in modern microprocessors
with distributed integrated voltage regulators (IVRs).
Despite the recent proliferation of IVR research, prior studies often focus on circuit-level
implementation to improve conversion efficiency [34]. Real implementation benefits in IVRenabled power delivery subsystems remain elusive due to the lack of modeling tools and
evaluation frameworks to explore the design space and investigate the performance and
efficiency implications of IVRs in a full system setting. Given the absence of high-level userfriendly IVR models, previous studies resort to either over-simplified assumptions of IVR
efficiency [35–37] that overlook important design considerations such as dynamic response,
or a fixed IVR design covering only a fraction of the entire design space [31].
To address these shortcomings, we propose an analytical modeling framework for earlystage design space exploration that is compatible with architecture-level performance and
power simulators. Our system-level model captures the complex yet subtle design trade-offs
among different IVR typologies to evaluate the performance benefits and implementation
costs in a full power delivery subsystem settings. It abstracts away the details of low-level
IVR circuit implementation to enable architects, system engineers, and other experts at the

12

upper levels of the system stack to effectively explore new design spaces enabled by IVR’s finegrain voltage regulation capability, similar to what Cacti [38] did for memory systems and
ORION [39] did for network-on-chip designs. Our modeling framework incorporates several
advanced features that were previously lacking and makes the following key contributions:

• A fast, accurate, and validated (using both SPICE simulations and measured silicon
data) parameterized IVR static model is introduced to estimate the static characteristics such as conversion efficiency, static voltage ripple/droop, and die/board area of
multiple IVR topologies in different technology nodes or processes.
• A novel method to derive an IVR’s dynamic model as a two-port network allows direct
drop-in of IVR modules into the power delivery system. This model facilitates the
complete capture of an IVR-enabled PDS’s dynamic voltage/current waveform, noise
characteristics, and power efficiency, given power traces from real-world workloads or
voltage scaling.
• As a comprehensive design exploration tool, Ivory covers a wide spectrum of IVR
topologies and a variety of IVR metrics for hierarchical composition of multi-stage
on-chip and off-chip power delivery networks and provides compatible interfaces with
architecture simulators.

Two case studies with the system-level design exploration tool Ivory are presented:

• Case study I investigates the optimal power delivery system in a many-core GPU architecture, and reveals that a distributed IVR configuration can outperform a conventional
off-chip VRM’s output efficiency by 9.5%.

13

• Case study II explores the IVR-enabled hierarchical power delivery with a microsecond
level DVFS for a heterogeneous CPU-GPU system. This DVFS can achieve 12.5% and
50.0% net energy improvement for CPU and GPU respectively.

2.2

Background and Related Work

The benefits of integrated fine-grain voltage regulation [31] have driven recent advancements
in device fabrication [34, 40], circuit implementation [33, 41], and system integration of integrated voltage regulators (IVRs) [35, 37]. In this section, we review the current state of
IVR designs and implementations, especially in the context of the entire PDS of modern
processors.

2.2.1

Conventional Power Delivery System and Efficiency

The underlying physical mechanism to convert and transfer electron charges from the higher
supply voltage on the motherboard to the much lower supply voltage on the microprocessor
chip invariably causes energy loss. The energy loss in power delivery can be broken down
into three parts:
First, energy is lost in voltage conversion to step down the supply voltage [42]. We define the
conversion efficiency of a voltage regulator (ηV R ) as the ratio between the power it delivers
at the voltage regulator output over the power it consumes at the input. ηV R is usually a
function of the step-down conversion ratio α. A high performance off-chip switching VRM
can deliver over 90% conversion efficiency, but the efficiency is degraded at a lower output
voltage with a higher step-down ratio [43].
14

The second part of the energy loss occurs in the power delivery networks mostly because
of heat dissipation when current runs through the parasitic resistance that exists along the
path of the power delivery network. This loss is related to the IR-drop component of the
supply voltage noise [44, 45]:

ηP DN “

Rcore pVcore q
,
pRP DN ` Rcore pVcore qq

(2.1)

where RP DN represents the total parasitic resistance contributed by the power delivery network, and Rcore represents the equivalent resistive impedance of the computational load as
a function of Vcore . The definition of Rcore suggests that Rcore “ Vcore {Icore . For a fixed Vcore
value, Rcore is a measure of the power rating.
The third and often overlooked part is the energy overhead incurred by raising the supply
by a non-negligible voltage margin, ∆V “ Vcore ´ Vmin , to accommodate the supply voltage
noise and sustain fault-free operation [46, 47]. We can express this component as η∆V :

η∆V “

Pcore pVmin q
Vmin Icore pVmin q
“
,
Pcore pVcore q
Vcore Icore pVcore q

(2.2)

where Pcore and Icore represent the power consumption and the current load of the processor
core as a function of the core supply voltage (Vcore and Vmin ).
Based on above analysis, the full power delivery efficiency can be expressed as

ηP DS “

Pcore pVmin q
“ ηV R ¨ ηP DN ¨ η∆V ,
Psrc

where Psrc is the total power drawn from the source.

15

(2.3)

2.2.2

Integrated Voltage Regulator

A voltage regulator converts an input voltage to an output voltage at a different level that
serves as the supply to load circuits. Linear and switching regulators are the two main types,
and they differ most notably in their efficiency ranges. The linear regulator’s efficiency is
determined by the input/output voltage ratio, whereas the switching regulator yields higher
efficiency even with a higher conversion ratio.
Due to their lower switching frequencies (ă 10M Hz), switching regulators usually require
large discrete passive components such as capacitors and inductors to mitigate static ripples.
Recent technology advances make it possible for switching regulators to operate at much
higher frequencies and to be integrated on the same die as processors [34,40]. Buck converters [48] and switched-capacitor converters [33, 34, 49] are two types of topologies commonly
adopted for such IVRs, in addition to low dropout linear regulators (LDO). While a buck
converter requires both an inductor and a capacitor, it can sustain a relatively constant conversion efficiency over a wide output range. In contrast, the inductor-free switched-capacitor
topology benefits from higher capacitor density with technology scaling but incurs a linear
drop in efficiency when its output voltage deviates from its peak efficiency points. The
efficiencies of both the switched-capacitor and the buck converter are sensitive to device
parameters that depend on technology and process options.
Prior work on the system-level impact of IVR provides fragmented evaluations on a few
fixed configurations of technology/ processes, topologies, input/output voltage ratios, and
load current levels [31,32]. Therefore, the findings cannot easily be extended to different use
cases. While analytical models of the buck [50] and switched-capacitor converters [49, 51]

16

exist, they primarily focus on modeling individual IVRs as stand-alone blocks, and thus are
unable to handle integration with the entire PDS.

2.2.3

IVR-enabled Power Delivery System and Efficiency

As shown in Fig. 2.1, in an IVR-enabled PDS, voltage conversion is moved from off-chip to
on-chip. Because the on-chip die space is limited, IVRs adopt high frequency switches to
compensate for the reduced size of passive components like capacitors and inductors. As the
high frequency switches may cause more power loss, IVRs usually suffer from lower conversion
efficiencies than the conventional off-chip voltage regulator modules (VRMs). After moving
voltage conversion on chip, the current that goes through power delivery network and the
current that loses power to parasitic resistance are reduced by

1
α

and

1
α2

respectively.

As voltage regulation is now located closer to the load, an IVR-enabled PDS enjoys multiple
intrinsic benefits. In a conventional PDS with an off-chip VRM, the voltage margin, ∆V “
Vcore ´ Vmin , causes a non-negligible power loss. In an IVR-enabled PDS, we can potentially
reduce the voltage margin to mitigate the energy overhead. Besides, IVRs open up the
opportunity to faster power management at the microsecond level. In this chapter, we
present two case studies to reveal the benefits of IVR-enabled PDS.

2.2.4

Related Work

Proof-of-concept circuits [52–56] and silicon prototypes [57–62] have been presented previously to explore the designs and benefits of integrated voltage regulators (IVRs) and IVRenabled PDSs. Burton et al. [63] presented a fully integrated voltage regulator design (FIVR)
17

on commercial 4th generation Intel® Core™ SoCs with improved power delivery efficiency.
Fluhr et al. [64] presented the design of POWER8™ Processor powered by integrated voltage
regulation. Zimmer et al. [65] designed integrated switched capacitor voltage regulator that
can support a sub-microsecond scale fast DVFS power management.
On the system side, Zhuo et al. [66] and Zhou et al. [32] proposed cross-layer infrastructures
for the co-exploration of power delivery and system architecture, especially focusing on the
power delivery network supply noises from parasitic components. Kim et al. [31] evaluated
the system-level benefits from fast DVFS supported by a fixed IVR-enabled PDS. Zeng et
al. [67] studied the system dynamic stability of integrating a large number of LDO on-chip
voltage regulators, and found the design offers a strong local load regulation and facilitates
system-level power management. Wang et al. [68] developed PowerSoc which is a modeling, analysis, and optimization platform for buck converter based PDS. Based on analytical
models, PowerSoc provides an accurate and fast evaluation of static characteristics, such as
power efficiency, transient response, and cost. Zhan et al. [69] proposed a heterogeneous
voltage regulation (HVR) architecture, exploring the rich heterogeneity and tunability of
HVR. They developed systematic workload-aware power management policies to adapt heterogeneous VRs with respect to workload change at multiple temporal scales. These policies
significantly improved the system’s power efficiency while providing a guarantee for power
integrity. However, none of these previous works are able to have a comprehensive study
and fair comparison across different IVR typologies and IVR-enabled PDSs in either static
or dynamic characteristics from a system-level. To fill this need, we present Ivory 2.0 a
system-level early stage modeling framework, which can accurately estimate both the static
and dynamic behaviors of IVRs and IVR-enabled PDSs.

18

Model Interface

PDN Parameters
LPCB, RPCB, Lpkg, Rpkg

User Input

Configuration

Technology Parameters
MOS device, Cap,
Inductor, Wire

Optimization

Architecture Parameters
Range of Vin, Vout,
Average and Max I load

Results

Exploration Environment

Optimization Target
Max efficiency
Min Power/area/cost

Arch.

Design
Parameter
Module

Static
Tech.

Power

Power /Area
/Ripple
Module

Design
Optimizer
Module

Interface

Optimized Integrated Voltage Regulator

Dynamic Response Module

Power Trace
Core/Cache
…...

Workload

Optimized IVR-enable Power Delivery System

Architecture
Simulation

SPICE 3 Circuit Simulator

Area

Ripple

Dynamic
Dynamic
Characters
Transient
Voltage
Current
Efficiency
...

Figure 2.2: Block diagram of the IVR and IVR-enabled PDS system-level modeling framework.

2.3

Modeling Methodology

Ivory’s system-level model enables rapid design exploration of IVR-enabled PDSs for computing systems with diverse configurations. Towards this end, it is crucial to capture the
main parameters that critically determine the overall PDS characteristics such as the power
consumption (loss) of each component in the PDS under static load conditions, and the
dynamic transient voltage, current and power variations and the system’s responses under
different scenarios. Here, we present a detailed description of the modeling framework and
methodology to obtain accurate estimates of these characteristics.

2.3.1

System-Level Modeling Framework

An overview of IVR system-level modeling framework is shown in Fig. 2.2. Users input
high-level parameters, such as the input/output voltage range and maximum load current.
19

Technology parameters that characterize CMOS switches, capacitors, and inductors in the
IVR are built-in and extensible when necessary, with a comprehensively-compiled database
containing MOSFET and capacitor data from 130 nm down to 10 nm, based on ITRS and
PTM models [70] as well as surface-mounted-inductor and integrated-inductor data recently
published [40,48]. By default, the static module optimizes for maximum conversion efficiency
(to reduce power delivery overhead); it also allows users to specify a different optimization
target, such as area. The dynamic module considers the dynamic responses in IVR-enabled
PDSs. The internal structure of system-level modeling consists of the following key modules:
• Design parameter module reads in user input and technology information, such as
input/output voltage, load power, power switch width, capacitor/inductor density and
so on.
• Power/ area/ ripple static module calculates power consumption, static voltage
ripple, and die/board area for various building blocks accross different IVR topologies,
based on design parameters.
• Design optimizer module calculates the optimal IVR designs based on the specified
technology, architecture configurations and basic circuit design guidelines. The systemlevel modeling can further support run-time optimization to achieve the desired power
delivering performance considering the PDS dynamic responses.
• Dynamic response module rapidly models the dynamic responses of IVRs and IVRenabled full PDSs under load current transients and/or external commands with the
help with SPICE 3 circuit simulator.

Advanced users familiar with IVR design trade-offs can leverage built-in interfaces to specify design parameters directly. Our model not only considers both the static performance
20

Interleaved xN

Interleaved xN

Interleaved xN
Linear Regulator

Buck Converter

SC Converter
Drivers

Drivers

Power
Switch

Drivers

Power
Switch

Power
Switch

L

Controller

Controller

Controller

CFLY

CFLY

ILoad

Cd,ext

Cd,ext
Clock
Generator

ILoad
Cd,ext

Clock
Generator

Feedback

Feedback

(a) Switched-Cap. Converter

ILoad

Clock
Generator

(b) Buck Converter

Feedback

(c) Linear Regulator

Figure 2.3: Three types of converter topologies.
characteristics of the IVR-enabled PDSs, but also applies distinctive modeling strategies to
accurately capture the system dynamic behaviors, which we will elaborate in the remaining
sections.

2.3.2

Power/ Area/ Ripple Static Module

By power/ area/ ripple static modeling, we refer to the calculation of the IVR conversion
efficiency, area, and voltage ripples based on static assumption of average load conditions
and statistics. In contrast, the dynamic module described in Section 2.3.3 deals with an
IVRs and response to load current transients from dynamic power traces. The static model
applies to switched-capacitor converters, buck converters, and linear regulators, which are
the most commonly used IVR topologies in processor’s PDSs.
Switched-capacitor converters: Fig. 2.3(a) illustrates a basic switched-capacitor circuit.
The system-level modeling adopts the analytical model introduced by Seeman [51] and Le
[49]. The model derives the charge multiplier vectors (ac,i and ar,i ) based on the switch
topology, and uses these vectors to calculate both the slow (RSSL ) and fast switching (RF SL )
21

limit output impedances. RSSL and RF SL can be expressed as:

RSSL

ř
p i |ac,i |q2
“
Ctot fsw

RF SL

ř
p i |ar,i |q2
“
.
Gtot Dcyc

(2.4)

Ctot is the total amount of fly capacitance, Gtot is the total amount of switch resistance,
fsw is the switching frequency, and Dcyc is the duty cycle of the switching phase signals
in a switched-capacitor IVR. The power losses due to the series of output impedances is
a
2
2
Iload
RSSL
` RF2 SL . The losses due to the switch parasitic capacitance, bottom plate parasitic capacitance, and the gate leakage current from the fly capacitors are calculated to
model the total power loss from the switching cells. Our model considers the commonly
used Series-Parallel and Symmetric Ladder switched-capacitor topologies because both require capacitors with the same voltage rating and thus are suitable for on-chip implementation [51]. Researchers can plug in their own switched-capacitor topology by providing the
charge multiplier vectors explicitly.
Buck converters: A typical buck converter is shown in Fig. 2.3(b). We adopt an existing
validated analytical model that calculates the power loss of buck converters can be found
in previous work on off-chip voltage regulators [50]. This model is based on the high-side
and low-side switch resistance/capacitance, inductor size, parasitic resistance, capacitance,
switching frequency, and PWM signal duty cycle. We extend this model to on-chip regulators
by deriving the required parameters from the technology characteristics of switches and
inductors, using parameters stored in its internal device database. Compared to an off-chip
voltage regulator with a low switching frequency, the change of inductor characteristics with
frequency is more pronounced in buck IVRs, and this effect is considered in the proposed
system-level model by a polynomial-fitted frequency-dependent coefficient of the inductance.

22

Linear regulators: Analog Gm amplifiers have been traditionally used in linear regulators.
Recent design trends [71] have increasingly adopted digital comparators and controllers to
achieve faster transient responses. Therefore, our Ivory model evaluates linear regulators
with a digital feedback path, as illustrated in Fig. 2.3(c). Since a current efficiency close to
99% can usually be achieved by state-of-the-art linear regulator design for moderate load
currents, the conversion efficiency of a linear regulator in this load range will closely follow
a linear relationship satisfying Vout {Vin .
Common building blocks: As illustrated in Fig. 2.3, different IVR topologies share many
of the same circuit building blocks, such as power switches, drivers, comparators, adigital
controller, and a clock generator – not to mention the basic capacitor and inductor devices.
By commensurately modeling these shared building blocks across all topologies, the systemlevel modeling guarantees fair comparisons between different topologies, given the same
technology and design constraints, which is of paramount importance for the efficiencydriven design exploration discussed in Section 2.5.2. For advanced digital technology, the
power consumed and the area occupied by the digital feedback system are minimal compared
to the moderate load current (10s of mA) and the on-chip capacitor and inductor needed
for IVRs. Despite its insignificant power and area proportion, such peripheral circuitry is
still important for transient response analysis and the scalability studies of IVR designs, and
therefore is taken into account in the dynamic module of this system-level modeling.
In the design optimization, we adopt the traditional hyperparameter optimization called
grid search, or parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. The grid search
algorithm is guided by performance metrics such as conversion efficiency or die area.

23

Local Power Grid 1
Buck IVR 1-1

VRM

PCB/
PKG
Paras
itics

R
Load

Local Power Grid 2
Buck IVR 2-1

Req L/D

Req

L

R/D

CD

 V in

Cout

R

Load

Buck two-way averaging
switch-free model

R

Cout

Ceq

 V in

Local Power Grid 3

SC IVR 2-2

Req 1 Req 2

Load

Switched-capacitor two-way
averaging switch-free model

Figure 2.4: Hierarchical power delivery system with integrated voltage regulator (IVR) dynamic models.

2.3.3

Dynamic Response Module

Besides static characteristics, the dynamic responses of IVRs also determine critical properties of the PDS, such as system reliability, efficiency and power management flexibility. The
dynamic module models the dynamic responses of the three main types of IVRs in PDSs.
Fig. 2.4 shows a hierarchical IVR-enabled PDS where the supply voltage is stepped down
by multiple off-chip VRMs and on-chip IVRs before reaching the workloads. To effectively
model these coexisting “serial and parallel” voltage regulators and the dynamic responses of
the full PDSs, we propose a two-way average switch-free model which models each voltage
regulator as a two port network without periodic switches. This model not only can capture all the critical dynamic responses but also can filter out the static voltage ripples from
periodic switches, whose magnitudes are negligible in modern multi-phase IVR designs.

24

Figure 2.5: Interleaved (multi-phase) buck converter.
The two-way average switch-free model uses a power delivery network side and a load side to
model each IVR as shown in the dynamic models of Fig. 2.4. As it models IVR as a switchfree two-port network, the model can be directly plugged into the power delivery network. In
this model, the IVR switch dynamics are considered as average values of currents and voltages
within a switching period by employing a weighted combination of the state equations of
switching phases in pulse-width modulated (PWM) converters. By avoiding the periodic
switches in the dynamic model, this model improves the simulation speed by 1000x than
the direct SPICE simulation, and also supports the AC analysis of the hierarchical PDS
including multiple IVRs. Compared with a real voltage regulator, this average approach
only neglects the static voltage ripple effects by using switching state-space averaging (SSA)
method [72]. Generalized transfer function (GTF) can be further deployed to evaluate the
influence from this periodic switches ripples. Here, we use a classic buck converter to derive
and demonstrate how the this model captures the IVR dynamic responses in the PDS. The
derivation and demonstration are not limited to buck converters, and can also be applied to
other switching voltage regulators. We will start with the two-way average model and then
present the GTF analysis for the static voltage ripples.

25

An integrated buck converter is shown in Fig. 2.5. Its single-phase state space model can
be described as:
X9 “ AXptq ` Bi uptq, i “ 1, 2,
(2.5)
y “ CXptq ` Duptq,
where

»

fi

fi

»

» fi

» fi

1 1
— 0 ffi
—0ffi
—VC ptqffi
— ´ RC C ffi
ffi , B1 “ — ffi , B2 “ — ffi ,
—
ffi
—
Xptq “ –
,
A
“
fl
– fl
– fl
fl
–
1
´ L1 0
0
IL ptq
L

„



„ 

yptq “ Vo ptq, C “ 1 0 , D “ 0 , uptq “ Vin ptq.

Modeling the switch period with the average model, the input matrix B is written as:
» fi
— 0 ffi
ffi
B “ αB1 ` p1 ´ αqB2 “ —
– fl ,
α
L

where α is the duty ratio of the periodic switch, which is also the voltage conversion ratio
of the integrated buck converter.
1

Thus, the above system can be modeled as an average model free system with input Vin and
1

B.

» fi
— 0 ffi
1
ffi
Vin “ αVin , B “ —
– fl ,
1
L

Similarly, from power delivery network to the IVR and its loads, the IVR and its loads can
be modeled as Eq. (2.6).
X9 “ AXptq ` Buptq,

26

(2.6)

(a) PLTV system.

(b) Output of PLTV system.

Figure 2.6: Periodical linear time-varying (PLTV) systems.
where

»

fi

fi

»

V ptq

» fi

1

1

— Cα ffi
ffi
— 0 ffi
—´ R
ffi , A “ — α Cα Cα ffi , B “ — ffi , uptq “ Vin ptq.
Xptq “ —
–
fl
– fl
fl
–
1
IL ptq
´ L1 0
L
α

α

This two-way average model supports the analysis and simulation of hierarchical power
delivery networks by bridging the lower level and higher level power delivery network through
IVRs. Similarly, the dynamic model of switched capacitor IVRs can be derived from the
two-way average model [51].
This two-way average switch-free model discussed above ignores the static voltage ripples
from periodic switches. The GTF analysis is derived to include the disturbances of periodic
switches filtered out in the two-way average switch-free model. Continued from Eq. (2.5),
the time domain solution of the integrated buck converter is
żt
Apt´t0 q

xptq “ e

eApt´τ q Bi upτ q dτ, t ě t0 .

x0 `
t0

27

(2.7)

Phase 1 (switch on): When t P rkT, kT ` DT q,
ż kT `DT
eApkT `DT ´τ q B1 upτ q dτ.
xpkT`DT q“eADT xpkT q`

(2.8)

kT

Phase 2 (switch off): When t P rkT ` DT, pk ` 1qT q,
xptq “ eApt´kT ´DT q xpkT ` DT q.

(2.9)

At the end of period t “ pk ` 1qT ,
AT

xppk`1qT q “ e

xpkT q`e

ż kT `DT
e´Aτ B1 upτ q dτ

Apk`1qT

(2.10)

kT

Because the buck converter is a non-linear system, small signal analysis is used in analyzing
its dynamic response. The input can be expressed as the combination of a DC value and
rejωt . The output at steady-state contains a
a AC component of frequency ω. uptq “ u0 ` u
DC component and an AC component of the same frequency. The GTF for above system is
given by
ż DT
HGT F pjΩq “ Cpe

jΩT

I ´e

AT

AT

e´Aτ B1 ejΩτ dτ,

qe

(2.11)

0

where I is the identity matrix. The impulse response and transfer function can be extended
to time-varying systems. The output is
ż8
yptq “

hpt, τ qupτ q, dτ, hpt, τ q “ Rrδpt ´ τ qs,

(2.12)

´8

where hpt, τ q is the generalized impulse response, R is an operator describing the system
behavior, t is the observation time, and τ is the excitation time.

28

The bi-frequency transfer function is
ż8 ż8
hpt, τ qe´jpωt´Ωτ q,dtdτ ,

Hpω, Ωq “

(2.13)

´8 ´8

where Ω and ω are the input and output frequencies.
The time-varying transfer function can be written as
ż8
hpt, τ qe´jΩpt´τ q,dτ .

Hpt, Ωq “

(2.14)

´8

Here, Hpt, Ωq is a periodic function of t, w.r.t. ωs “

2π
,
T

and the system is a periodic linear

time-varying (PLTV) system [73] as shown in Fig. 2.6. The LTI relationship can be recovered
for n “ 0, which is exactly modelled by the two-way average switch-free model.
n“8
ÿ

Hpt, Ωq “

Hn pΩqejnωs t .

(2.15)

n“´8

The frequency-dependent Fourier coefficients Hn pΩq are called aliasing transfer functions:

Hn pΩq “

1
T

żT
Hpt, Ωqejnωs t .

(2.16)

0

The switches in the buck converter make the system non-linear by introducing new harmonics at multiples of ωs , which can be evaluated by GTF analysis.
Based on the model of single-phase buck converter, the dynamic model of the modern interleaved buck converter (also called a multi-phase buck converter) can be derived as follows.

29

For a N-phase buck converter, its state space model is
X9 “ AXptq ` Bi uptq, i “ 1, 2, ..., 2N,
(2.17)
y “ CXptq ` Duptq,
where

»

fi

fi

»
1 1

1

— VC ptq ffi
— ´ RC C ¨ ¨ ¨ C ffi
ffi
—
ffi
—
ffi
—
ffi
—
— I ptq ffi
—´ 1 0 ¨ ¨ ¨ 0ffi
ffi
— L1 ffi
—
ffi ,
ffi , A “ — N L
Xptq “ —
ffi
—
ffi
—
ffi
— ... ffi
—
¨
¨
¨
ffi
—
ffi
—
ffi
—
ffi
—
fl
–
fl
–
1
´ NL 0 ¨ ¨ ¨ 0
ILN ptq
»

fi

fi

»

1

»

fi
1

1

— N L ffi
— N L ffi
— N L ffi
— ffi
— ffi
— ffi
— ffi
— ffi
— ffi
— 0 ffi
— 1 ffi
— 1 ffi
— ffi
— N L ffi
— N L ffi
—
ffi
ffi
ffi
—
B1 “ — ffi , B2 “ — ffi , ¨ ¨ ¨ , BN “ —
— ffi ,
— ¨ ¨ ¨ ffi
— ¨ ¨ ¨ ffi
— ¨ ¨ ¨ ffi
— ffi
— ffi
— ffi
— ffi
— ffi
— ffi
– fl
– fl
– fl
1
0
0
NL
»

BN `1

fi

»

fi

»

fi

— 0 ffi
— 0 ffi
— 0 ffi
— ffi
— ffi
— ffi
— ffi
— ffi
— ffi
— 1 ffi
— 0 ffi
— 0 ffi
— N L ffi
— ffi
— ffi
ffi , BN `2 “ — ffi , ¨ ¨ ¨ , B2N “ — ffi ,
“—
— ffi
— ffi
— ffi
— ¨ ¨ ¨ ffi
— ¨ ¨ ¨ ffi
—¨ ¨ ¨ffi
— ffi
— ffi
— ffi
— ffi
— ffi
— ffi
– fl
– fl
– fl
1
1
0
NL
NL

„



„ 

C “ 1¨¨¨0 ,D “ 0 ,

30

yptq “ Vo ptq, uptq “ Vin ptq.

According to the state-space description of the multi-phase interleaved buck converter in
Eq. (2.17), it has the same two-way average model with the conventional single phase
buck converter. For an N phase interleaved buck converter, the GTF model can be derived
with 2N phases in Eq. (2.7) - (2.10). In modern multi-phase interleaved voltage regulator
designs, the static voltage ripple effects from periodic switches are sufficiently mitigated.
When the number of interleaved phases N Ñ 8, which is the ideal voltage regulator, there
will not be any ripple effects from periodic switches and two-way average switch model will
reflect all the dynamic behaviors. Now, we have presented a general IVR dynamic model.
Besides, any customized feedback laws for IVRs could also be easily reflected in this model
by adjusting the duty ratio α, which is a part of IVR dynamic module. For example in the
voltage regulator with PID control, duty ratio is controlled by the PID controller so that the
expression of duty ratio in a dynamic model will be expressed by α ` kP ID pVref ´ Vout q.

2.4

Model Validation

We validate Ivory’s analytical model against both SPICE simulation results and measurement data from recent publications, spanning different technology nodes, input/output voltage ranges, and power levels. All these results demonstrate that the system-level model can
faithfully model and explore the design space of voltage regulator configurations in realistic
PDS settings.

31

For the static model, validation data for the switched-capacitor IVR model is presented
in Fig. 2.7. On the left, Ivory is compared against silicon measurements taken from a
reconfigurable switched-capacitor implemented in 32nm SOI process [49]. It is clear that
Ivory adequately models the measured data for the 3:2 and the 2:1 configurations until an
efficiency drop occurs past the peak efficiency. Normal switched-capacitors do not function
past the efficiency cliff region. Given that these points are non-functional and are mostly
likely caused by aggravated leakage current when the power switch exceeds its intended
operating range, we conclude that Ivory is sufficiently accurate over the realistic, functional
range of operation. Data points on the right plot were generated by SPICE simulations of
two sets of 2:1 and 3:1 switched-capacitor converter designs in 40nm CMOS process [33].
Regular CMOS capacitors are used for the low-power density design, whereas embedded
trench capacitors [34] are used for the high-power density design. The data validates Ivory’s
ability to model the conversion efficiency across all four designs. The buck converter IVR
topologies are validated in Fig. 2.13. The measured data on the left is obtained from a
2.5D buck converter using an integrated inductor-on-silicon interposer, a 45nm SOI process
and an embedded trench capacitor. The buck converter operates at different load current
levels [48]. On the right data is from our buck design simulated in a 40nm CMOS process.
Ivory again proves capable of modeling voltage regulator efficiency, validating its internal
buck converter modeling framework. Additionally, the analytical buck model used in Ivory
has previously been validated against off-chip VRMs [50].
For the dynamic model, we validate the IVR two-way switch-free average model with SPICE
simulations of recent IVR designs in both the time domains and frequency domains. Fig. 2.9
shows the comparison of the step responses from the proposed two-way switch-free average
model and the measurements of integrated buck voltage regulators [74] SPICE simulation
with L “ 0.1uH, C “ 0.5uF , fsw “ 20M Hz, D “ 0.2, load R “ 0.5Ω and different phases.
32

70

Efficiency (%)

Efficiency (%)

80

60
3:2 Ivory
2:1 Ivory
3:2 Cadence
2:1 Cadence

50

40
0.5 0.6 0.7 0.8 0.9 1.0 1.1

Output Voltage (V)

100
90
80
70
low 3:1 Ivory
60
high 3:1 Ivory
50
low 2:1 Ivory
high 2:1 Ivory
40
low 3:1 Cadence
30
high 3:1 Cadence
low 2:1 Cadence
20
high 2:1 Cadence
100.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
Output Voltage (V)

(a) Silicon measurements.

(b) SPICE simulations.

80
70
60
50
40
30
20
10
00.0

0.2

0.4

0.6

0.8

Output Voltage (V)

Efficiency (%)

Efficiency (%)

Figure 2.7: Efficiency validation for SC converters.

1A Ivory
3A Ivory
4A Ivory
1A Measured
3A Measured
4A Measured

1.0

(a) Silicon measurements.

85
80
75
70
65
60
55
50
45
40 0.4

0.6

0.8

1.0

1.2

1A Ivory
2A Ivory
1A Cadence
2A Cadence

Output Voltage (V)

(b) SPICE simulations.

Figure 2.8: Efficiency validation for buck converters.

33

1.4

1.6

Voltage (V)

1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

1.2
1.1
1
0.6 0.8 1 1.2 1.4

0

1

1-phase buck IVR(SPICE)
10-phase buck IVR(SPICE)
buck IVR switch free average model

2

3

4

5

Time (µs)

Figure 2.9: Interleaved (multi-phase) buck converter dynamic responses in time domains.
Although the 1-phase buck converter has voltage ripples from the periodic switches, the
voltage ripples are effectively mitigated in the interleaved multi-phase (10-phase) designs.
This kind of static voltage ripples become trivial in real designs as interleaved multi-phase
designs are widely used in modern IVRs. The two-way switch-free average model naturally
filters out the ripples from periodic switches but accurately captures all critical dynamic
responses. Fig. 2.10 shows the output and input frequency responses of recent integrated
buck voltage regulator designs (case 1 [75], case 2 [74], and case 3 [76]). The two-way
switch-free average models (plotted with curves) match the SPICE simulations (plotted by
points) of the full integrated buck converters below half of the switching frequency. The
generalized transfer function (GTF) can further analyze the interference of these voltage
ripples as described in Section 2.3.3 when needed.

2.5

Case Study I: Many-core GPU PDS

To demonstrate how Ivory enables early stage design exploration at upper levels of the
system stack, we present a case study on finding the optimum PDS configuration in the
context of a GPU style many-core processor. Our goal is not to champion any one particular
34

Figure 2.10: Interleaved (multi-phase) buck converter frequency responses in frequency domains.
configuration, rather it is to demonstrate how Ivory can be used for the early stage design
exploration of the PDS.

2.5.1

System Configuration

In this case study, we focus on the comparison between the IVR-based and a conventional offchip VRM-based PDS. We assume an embedded GPU system with four cores (i.e. Streaming
Multiprocessors, SMs) that form a 2ˆ2 grid, but note that Ivory allows an arbitrary number
of cores and layout. The Fermi architecture based SM has an average power of 5 W and
a peak power of 14 W. This system uses the same off-chip and on-chip PDN equivalent
circuit as in GPUVolt [77], with a 3.3 V supply at the board and a 0.85 V SM nominal
voltage + 0.15 V voltage guardband. The four SMs are modelled with 12ˆ12 on-chip power
grids, where each SM is modelled with 3ˆ3 grid points. The maximum area budget for the
IVR is 200 mm2 , scaled to be similar to the IVR area in a 4-core Intel CPU with 45 nm
technology [63]. The other input parameters of Ivory are summarized in Table 2.1.

35

Table 2.1: Summary of Ivory input parameters.
Configuration
Max. Area(mm2 )
Total Average Power(W)
Total Peak Power(W)
Input Voltage(V)/Output Voltage(V)
Max Number of Distributed IVRS
Rsw (Ω¨µm)/L(nH/mm2 )/C(nF/mm2 )
Off/On-Chip PDN parameter

Value
200
20
56
3.3/1
4
40/1/10
Rof f,on /Lof f,on

Table 2.2: Summary of design space exploration.
Topology
Distri. No.
Eff.(%)
Ripple(mV)
fsw (MHz)

2.5.2

3:1 SC
1/2/4
81.3/81.2/81/
1.6/1.6/1.5
141/139/137

Buck
1/2/4
80.4/80.2/80
1.6/1.5/1.2
59/57/56

LR
1/2/4
33.2/30.1/30
5.1/4.7/4.1
300/300/300

IVR Design Space Exploration

In this study, we set the maximum efficiency as the optimization target, and use Ivory to
find the optimal IVR design (Fig. 2.14). We find the buck has higher efficiency than the
SC converter with a more stringent area budget, although a high capacitor density process
can be used to alleviate such hurdles. With the design constraints shown in Table 2.1, Ivory
performs the design space exploration and gives the optimal IVR solution shown in Table
2.2.

2.5.3

Power Delivery System Dynamic Behaviors

We find that a 32-phase interleaved 3:1 switched-capacitor converter has the highest efficiency
for this GPU system, and use it to optimize the dynamic response and PDS optimization.
We use the dynamic module to explore the centralized and distributed IVR designs, and we
36

Voltage Noise(V)

1.1

Lower voltage noise with distributed IVRs
1.05
1
0.95

Benchmark Name (VR Configuration)

0.9

)
)
) s) s)
) s) s)
) s) s)
)
)
)
) s) s)
)
) s) s)
)
) s) s)
) s) s)
RM IVR R R RM IVR R R RM IVR R R RM IVR R R RM IVR R R RM IVR R R RM IVR R R
f V en is IV is IV ff V Cen is IV is IV ff V en is IV is IV ff V en is IV is IV ff VCen is IV is IV ff V en is IV is IV ff VCen is IV is IV
f
O C D D O
D D O C D D O C D D O
D D
D D O C D D O
P ( (1 (2 (4 2 ( (1 (2 (4 D ( (1 (2 (4 P ( (1 (2 (4 N ( (1 (2 (4 D ( (1 (2 (4 T ( (1 (2 (4
CK KP KP KP BFS FS2 FS2 FS2 CF CFD CFD CFD OTS TSP TSP TSP KMKMN MN MN LU LUD LUD LUD MGSGST GST GST
K K
B B B
BA BAC BAC BAC
H HO HO HO
M M M

Figure 2.11: Voltage noise across benchmarks and VR config.
compare the results from previous default setting with the conventional off-chip VRM design
which adopts a 6-phase buck converter [78]. The dynamic response analysis compares the
IVR designs through a workload-dependent analysis. We feed Ivory with GPU SM power
traces from performance and power simulation infrastructures (GPGPUSim 3.2.0 [79] and
GPUWattch [80]) in running large programs from the Rodinia suite [81] and NVIDIA CUDA
SDK.
Ivory allows us to compare the run-time voltage noise of all centralized and distributed IVR
configurations. In the centralized IVR configuration, the IVR is located in the middle of
the 12x12 on-chip power grids, while in the distributed IVR configurations, the four IVRs
are evenly distributed in the 12x12 on-chip power grids. The voltage statistics of the GPU
system running different workloads are shown in box plots in Fig. 2.11. As indicated by
the tight boxes with short whiskers, the design with four distributed IVRs is the optimal
solution in supply voltage noise mitigation. Fig. 2.12 shows the supply voltage trace of the
workload “CFD” with different VR designs. The voltage noise range in the off-chip VRM,
the centralized IVR, the two distributed IVRs, and the four distributed IVRs scenarios are
125 mV, 59 mV, 55 mV, and 25 mV, respectively.

37

Besides the exhaustive time domain simulation of supply voltage noise with a run-time
workload power trace, Ivory also supports the AC analysis of the full PDS including VRMs
and IVRs and customized feedback controls. Fig. 2.13(a) from Ivory presents the effective
impedance to load variations of off-chip VRM-based, centralized IVR-based, and distributed
IVRs-based PDSs. For a closer examination of each point, we index the 12ˆ12 on-chip
power grid with 12ˆ12 X-Y coordinates where (x6,y6) is the middle grid point of the power
delivery network. The effective impedance plot directly demonstrates that the distributed
IVR configuration has a lower effective impedance and less supply voltage noise than the
centralized IVR configuration, especially for load points located far away from the IVRs, like
the grid point (x2,y2).
The constant values in the low frequency range are from the voltage regulator’s internal resistance and the power delivery network’s parasitic resistance, which were ignored in previous
works. In previous work such as [44], VRMs are directly modeled as a fixed voltage source
for simplicity. To account for the voltage regulator’s internal resistances and power delivery
network parasitic resistance which is also called IR drop, an extra voltage margin is added
to the fixed voltage source as load line compensation. However, in real voltage regulator
designs and PDSs, feedback control plays an important role in mitigating the static voltage drop and the low frequency voltage noise within the regulation frequency. Fig. 2.13(b)
shows the effective impedance after introducing feedback control (for example proportional
feedback control k=3). In the high frequency range, the resonant impedance in an off-chip
VRM-based PDS exceeds the feedback regulation frequency, but it can be mitigated by IVR,
especially by a distributed IVR-based PDS. Furthermore, the distributed IVR-based PDS
has a lower supply noise impedance over the full range, especially at the resonant frequency.
In the medium and low frequency ranges, and within the regulation frequency ranges, the
feedback control can further help mitigate static IR drop and low frequency voltage noise.
38

Power(W)

10

GPU workload power trace

5
0

Voltage(V)

0

Voltage(V)

3

4

5

6

7

8

9

10

6

7

8

9

10

6

7

8

9

10

7

8

9

10

7

8

9

10

1
0.9
1

2

3

4

5

1.1
Centralized IVR
1
0.9
0

Voltage(V)

2

Off-Chip VRM

0

1

2

3

4

5

1.1
2 Distributed IVRs
1
0.9
0

Voltage(V)

1

1.1

1

2

3

4

5

6

1.1
4 Distributed IVRs
1
0.9
0

1

2

3

4

5

6

Time ( 7 s)

Figure 2.12: Voltage noise waveforms (CFD) with VR config.

0.05

off-chip VRM
Centralized IVR(x2y2)
Distributed IVR(x2y2)
Centralized IVR(x6y6)
Distributed IVR(x6y6)

Impedance

0.04
0.03
0.02
0.01
0
100

101

102

103

104

105

106

107

108

109

Frequency (Hz)

(a) Effective impedance in VRM-based, centralized and distributed IVRs-based PDS.
0.05

Centralized IVR(x2y2)
Distributed IVR(x2y2)
Centralized IVR(x6y6)
Distributed IVR(x6y6)
Centralized IVR(k=3, x2y2)
Distributed IVR(k=3, x2y2)
Centralized IVR(k=3, x6y6)
Distributed IVR(k=3, x6y6)

Impedance

0.04
0.03
0.02
0.01
0
100

101

102

103

104

105

106

107

108

109

Frequency (Hz)

(b) Effective impedance in centralized and distributed IVRs
with feedback.

Figure 2.13: Supply voltage noise effective impedance.
39

Figure 2.14: IVR efficiency trade-off Figure 2.15: Power delivery system
with area.
optimization.

2.5.4

Putting It Together: Power Efficiency Analysis

Ivory lets designers rapidly evaluate the final PDS efficiency through combined static and
dynamic analysis. The static converter design analysis finds the optimal converter with high
converter efficiency and low IR-drop loss. Ivory further optimizes the voltage margin by
identifying the IVR design with the minimal voltage noise that accounts for most of the
voltage margin [82]. Fig. 2.15 shows the power delivery efficiency breakdowns of different
PDS designs. The power efficiency is the percentage of power consumed by cores that perform
the actual computation over total power input to the PDS. The optimal PDS solution by
Ivory achieves a 9.5% power efficiency improvement over the previous off-chip VRM-based
PDS, without any performance loss.

2.6

Case Study II: PDS with Fast Per-Core DVFS

Another significant benefit of an IVR-enabled PDS is fast power management, such as microsecond level fast DVFS. Computer architects keep pursuing faster power management,
40

Table 2.3: CPU GPU many-core system.
Configuration
PCB Supply Volt.
CPU PDN Para.
CPU Core Arch.
CPU Core Volt.
CPU Core Power
CPU Core L1$.
CPU Core L2$.
CPU Core L3$.
Execution Order

Value
5V
DisPDN
Nehalem
0.6-1 V
0-5 W
32 KB
512 KB
8 MB
OoOE

Configuration
Process Tech.
GPU PDN Para.
GPU Core Arch.
GPU Core Volt.
GPU Core Power
Threads per SM
Registers per SM
Threads per warp
Mem Bandwidth

Value
40 nm
GPUvolt
Fermi
0.8-1 V
0-14 W
1536
128 KB
32
179.2GB/s

because faster voltage scaling means higher power and energy efficiency. Voltage regulator
circuit designers usually focus on voltage conversion efficiency under area constraints. In
this case study, we will demonstrate Ivory as the downstream platform, after architecture
level performance analysis and power simulation, to analyze power delivery for many-core
computing systems and bridge the gap between computer architects and voltage regulator
circuit designers.

2.6.1

System Configuration

We consider applying the fast DVFS supported by IVR-enabled PDS on both CPU cores
with an Intel Nehalem (x86) architecture and GPU streaming multi-processors (SMs) with
NVIDIA Fermi architecture. The detailed specifications of this CPU and GPU system are
shown in Table 2.3. The system is powered with a hierarchical IVR-based PDS [31, 68, 69],
shown in Fig. 2.16. The hierarchical IVR-enabled PDS [68, 69] is proposed to adopt an
off-chip VRM to step down the voltage from board level to a intermediate level (for example
1.8V) with higher efficiency. Then the per-core IVR further regulates the intermediate
voltage to the desired core voltage with more flexibility. The off-chip VRM is modeled based
41

Figure 2.16: Hierarchical power delivery system for SoC systems.
Table 2.4: CPU core DVFS frequency and voltage pairs.
Core Freq. (MHz)
Core Voltage (V)

2000
1.2

1800
0.9

1500
0.8

1000
0.7

800
0.6

on a commercial product [78]. The CPU on-chip power grid is scaled from the distributed
power delivery network [44], and the GPU on-chip power grid is from GPUVolt [77], which
are validated with CPU and GPU systems respectively. We will use Ivory to find the IVR
designs that can support the desired microsecond per-core DVFS.
On the architecture side, we use Sniper [83] (with Mcpat) and GPGPUsim [79] (with
GPUWattch) to simulate the architecture level performance and power activities of CPU and
GPU systems. Sniper (with Mcpat) simulates the CPU part, generating run-time statistics
with a granularity of 100 ns, and GPGPUsim (with GPUWattch) simulates the GPU part
at 700 MHz. We use representative benchmarks that cover a wide range of scientific and
computational domains from CPU benchmarks parsec and splash2, and also the GPU benchmarks from the Rodinia suite [81] and NVIDIA CUDA SDK. The CPU and GPU voltage
and frequency scaling pairs are shown in Table 2.4 and Table 2.5, respectively.

42

Table 2.5: GPU core DVFS frequency and voltage pairs.
Core Freq. (MHz)
Core Voltage (V)

700
1.00

650
0.95

550
0.87

300
0.46

1.2

Power/Frequency (W/Hz)

0.25

Power/Frequency (W/Hz)

600
0.91

0.2
0.15
0.1
0.05
0
105

106

107

1
0.8
0.6
0.4
0.2
0
105

108

106

Frequency (Hz)

107

108

Frequency (Hz)

(a) CPU frequency analysis

(b) GPU frequency analysis

Figure 2.17: CPU and GPU power activity frequency analysis in executing benchmark
blackscholes and backp.

2.6.2

IVR Support for Fast DVFS

To guide the IVR designs, we perform a frequency analysis on the power activities of the
CPU and GPU cores in running benchmarks. For example, the CPU and GPU core power
frequency analyses in executing the system benchmarks blackscholes and backp are shown
in Fig. 2.17, where both the CPU and GPU power and performance can reach MHz. We
correspondingly explore the IVR designs that can support fast DVFS, up to microseconds.
Table 2.6: Summary of design space explorations of 16-phase buck IVRs.
DVFS Speed
Efficiency (%)
Switch Freq. fsw (MHz)
L per-phase (nH)
C per-phase (µF)
Area (mm2 )

500ns
77.0
189
0.25
0.125
42.9

43

2µs
83.3
62
0.5
0.56
184.9

5µs
83.4
53
0.75
0.56
187.0

50µs
84.8
44
0.5
1.125
365.9

Based on the hierarchical IVR-based PDS in Fig. 2.16, we use the Ivory dynamic module
to explore the design spaces of IVRs with hierarchical PDSs that can support different fast
DVFS. Here, we set the voltage scaling rise time to within 1% of DVFS intervals [58,68,69,84]
and the voltage overshoot to less than 5%. In the design space of an integrated buck voltage
regulator, the passive inductors and capacitors directly and significantly affect the voltage
scaling speed. Fig. 2.18 shows the design space explorations of the inductor and capacitor
sizes for desired voltage scaling speeds, where the blue points indicate that the inductor and
capacitor design parameters can support the desired the CPU fast DVFS. These parameters
further form new constraints and are passed to the static module to find proper IVR design
configurations. Similar approaches are also applied to the GPU SM cores. The key design
parameters for the IVRs that support different speeds of DVFS are summarized in Table 5.1.
When supporting fast DVFS, IVR designs keep reducing the size of on-die inductors and
capacitors to achieve a faster voltage transition, and one prominent side effect is pushing the
switching frequency from tens to hundreds of MHz by increasing the switching frequency.
The higher frequency switching comes at the cost of degrading the conversion efficiency of
the IVRs as the switching loss becomes more significant.

2.6.3

Power Delivery System and Architecture Co-Design

Finally, we evaluate the system’s energy efficiency with fast per-core voltage scaling supported by this hierarchical PDS. For a fair comparison of the raw benefit from the fast
per-core DVFS given by an IVR, we use a native DVFS mechanism where the instructionsper-cycle (IPC) value is monitored to adjust the frequency and voltage at run-time. The
energy benefits for different speeds of fast DVFS supported by IVRs on CPU and GPU are
shown in Fig. 2.19 and Fig. 2.20 respectively. The fast DVFS supported by IVR can reach
44

success
fail

L Inductor Size (log10(y) H)

L Inductor Size (log10(y) H)

-6

-7

-8

-9
-9

-8.5

-8

-7.5

-7

-6.5

-6

success
fail

-7

-8

-9
-9

-6

-8.5

-6

success
fail

-7

-8

-8.5

-8

-7.5

-7

-7.5

-7

-6.5

-6

(b) 2µs voltage scaling
L Inductor Size (log10(y) H)

L Inductor Size (log10(y) H)

(a) 0.5µs voltage scaling

-9
-9

-8

C Capacitor Size (log10(x) F)

C Capacitor Size (log10(x) F)

-6.5

-6

success
fail

-7

-8

-9
-9

-6

-8.5

C Capacitor Size (log10(x) F)

-8

-7.5

-7

-6.5

-6

C Capacitor Size (log10(x) F)

(c) 5µs voltage scaling

(d) 50µs voltage scaling

Normalized Energy
Consumption

Figure 2.18: Inductor and capacitor sizes for different voltage scaling speeds.

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

Bla

s

ole

ch

s
ck

al

k

ac

ytr

d
Bo

No DVFS
50000ns DVFS
2000ns DVFS
500ns DVFS

Ca

e
nn

Flu

n
ida

ate

im

q
Fre

ne
mi

Ba

rne

s
o
Ch

les

ky

Ffm

d
Ra

ios

ity

Ra

dix

Figure 2.19: CPU Energy benefit from different speed fast DVFS.

45

Normalized Energy
Consumption

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

No DVFS
50000ns DVFS
2000ns DVFS
500ns DVFS

p

ck

Ba

ot

tsp
Ho

D
aM

v
La

r

de

fin

th
Pa

ad
Sr

les

ho

sc

ck
Bla

C

n
tio

olu

v
on

tc
Dx

ort

es

rg
Me

s

ork
etw

gn
rtin

So

Figure 2.20: GPU Energy benefit from different speed fast DVFS.
finer granularity and save more energy for CPUs and GPUs. On the CPU side, the 50 µs,
2 µs, and 500ns DVFS have energy saving of 7.65%, 12.5%, and 15.7% on average, and
20.7%, 33.5%, and 43.2% on specific workloads like Canneal. Also, at the GPU side, the
50 µs, 2 µs, and 500 ns DVFS offer energy savings of 18.2%, 50.0%, and 55.4% on average
and 55.8%, 66.6% and 66.9% on specific workloads like Srad and LavaMD. Together with the
results from Ivory, although the fastest DVFS (0.5µs DVFS) achieves the greatest energy
saving, the implementation overheads of IVRs offset the fast DVFS benefits. The 2µs DVFS
is the proper candidate for this hetergenous system especially for the GPU SMs, because
it not only reaps the energy benefits from fast DVFS and the power delivery efficiency improvement seen in case study I, but also avoids the costly IVR overheads. Besides the power
and energy analysis of the CPU/GPU systems, Ivory can guide the designs of both IVRs,
such as re-configurable IVR designs for various frequencies of fast DVFS, and IP core designs
with customized fast power management mechanisms.

2.7

Conclusions

The computing systems especially computing systems in the resource-constrained cyberphysical systems, require an efficient power delivery solution to reduce power consumption.
46

The IVR is one of the attractive solutions. However, subtle trade-offs and topology choices in
IVRs make efficiency decisions unintuitive, forcing researchers to use inaccurate or incomplete
models. As IVRs continue to grow in popularity and become more beneficial, the system-level
model exposes design space trade-offs and supports dynamic response optimization without
manual effort and without the circuit expertise otherwise required, making the system-level
model and tool useful to system architects. Using the design space exploration tool Ivory,
we show cases where optimizing across technologies, topologies, and dynamic responses can
yield area and efficiency savings that would otherwise be missed without such a high-level
model. The saved power and energy can be either used to extend the operating time or
allocated to other critical components in cyber-physical systems to boost performance.

47

Chapter 3
Circuit and Architecture Layers:
Voltage-Stacked Power Delivery
Systems: Reliability, Efficiency, and
Power Management

In today’s computing systems, energy loss of more than 25% may result from inherent inefficiencies of conventional power delivery system (PDS) design. This wasted energy is critical
when the computing system is used in a resource-constrained scenario such as the resourceconstrained cyber-physical systems. The wasted energy directly degrades the operating time,
one of the most important metrics, of the computing system and cyber-physical systems. By
stacking multiple voltage domains in series to lower the step-down conversion ratio of the
off-chip voltage regulator module (VRM) and reduce energy loss along the path of the power
delivery network (PDN), voltage stacking (VS) offers a novel alternative power delivery technique to fundamentally improve power delivery efficiency (PDE). However, voltage stacking

48

suffers from aggravated supply voltage noise from current imbalance, which hinders its adoption. In this chapter, we investigate practical voltage stacking implementation in many-core
processors to improve power delivery efficiency (PDE) and achieve reliable performance,
while maintaining compatibility with advanced power management techniques. We first
present the system configuration of a voltage-stacked many-core processor. We then systematically characterize supply voltage noise in voltage stacking, identify global and residual
differential currents as its dominant contributors, and calculate the possible worst supply
voltage noise. We next propose two practical solutions to limit the supply voltage noise
within safe range with high power delivery efficiency at the same time. The first solution is
a hybrid voltage regulation solution, based on a charge-recycling off-chip voltage regulator
and distributed integrated voltage regulators, to mitigate supply voltage noise effectively.
The second solution is a control theory driven cross-layer solution which leverage the architecture level power management with feedback control to prevent serious current imbalance
and supply voltage noise. We also study the compatibility of voltage stacking with higher
level power management techniques. Finally, the performance of a voltage-stacked GPU system is comprehensively evaluated. Simulation results show that our approach can achieve
93.5% power delivery efficiency, reducing the power loss by 13.6% compared to conventional
single-layer power delivery system. The 13.6% may be ignorable for power-plugin systems
but it is a must for the resource-constrained cyber-physical systems.

3.1

Introduction

A closer examination of the power delivery path in modern computing systems reveals a
provocative finding: delivering the power from the PCB board to the processor chip can
49

waste more than 25% of the power [85–87]. Thus, improving the efficiency of the last power
delivery stage of today’s processors is critically important especially when they are used in
an energy-limited environment such as in resource-constrained cyber-physical systems.
Despite the importance of improving processor power delivery efficiency (PDE), the energy
loss in a conventional single-layer power delivery system (PDS), such as that shown in Fig.
3.1(a) is difficult to eliminate. Three main inefficiency sources are directly associated with the
PDS. The first is the voltage conversion loss incurred in converting a higher supply voltage
at the board level to a lower supply voltage required by the microprocessor [42]. The second
is the power delivery network (PDN) loss in parasitic resistance in transferring the electron
charges from the off-chip power source to the distributed on-chip computing units [44, 45].
The third is the supply voltage margin to accommodate supply voltage noise and process
variation. Generally speaking, the three inefficiencies become worse with lower supply voltages, increasing power density, and higher power ratings. Although various techniques have
been proposed in prior work to reduce PDN loss by moving the voltage regulation closer
to the point-of-load [88, 89], they are not capable of addressing the inefficiencies from the
voltage regulator simultaneously, and thus are fundamentally unable to close the efficiency
gap.
Voltage stacking shown in Fig. 3.1(b), also known as charge recycling [90] or multi-story
power delivery [91], is a novel technique that allows efficient power delivery through a single high voltage source to multiple serially-stacked voltage domains. Due to the inherent
voltage division among the voltage domain in series, it obviates the need for step-down voltage conversion and reduces the currents flowing through the PDN. Ideally, if the current
loads from all the voltage domains are perfectly balanced, then the input voltage is evenly
divided with no supply voltage noise fluctuation. Voltage stacking’s theoretical peak power
50

(a) Conventional single-layer

(b) Voltage-stacked multi-layer

Figure 3.1: Conventional single-layer and voltage-stacked multi-layer power delivery system.
(PCB board voltage: 4V; each core requires 1V voltage and 1A current)
delivery efficiency under balanced power activity is close to 100%, making it an attractive
solution. However, in real applications, voltage stacking is seriously limited by its exacerbated supply voltage noise caused by the current imbalance between the serially-stacked
voltage domains [92]. This limitation prevents wide adoption in practical systems that require consistent and reliable operation. In this chapter, we systematically investigate the
feasibility and potential benefits of applying voltage stacking to a graphic processing unit
(GPU) processor to improve its power delivery efficiency.

3.2
3.2.1

Background and Related Work
Power Delivery System

The power delivery system (PDS) in modern processors consists of a step-down voltage
regulation module (VRM) on the motherboard; sockets, off-chip decoupling capacitors and
electrical connections at the board, package, and chip levels in the form of PCB traces; and
51

socket bumps and C4 bumps, where undesirable parasitic resistance and inductance reside.
The decoupling capacitors (C) and the parasitic resistance (R) and inductance (L) along
the connection path form the electrical model of the PDN in a computing system with a
conventional PDS. To study the power delivery efficiency and system reliability, it is usually
sufficient to assume the output of board level VRM is an ideal voltage source, and we adopt
this convention in this work.
In a conventional setting, voltage conversion using a step-down VRM is necessary because
the voltage level at the board is higher than the digital supply of a processor. Yet due
to inherent inefficiency of step-down VRMs, energy is lost during the voltage conversion.
Resistive parasitics along the PDN path also contribute to energy loss and incur voltage
drop across the resistance, which is known as IR-drop. These two major efficiency losses can
approach 20% or more in advanced technology nodes and under peak power operations.
Moreover, because of the non-ideal effect of the parasitic RLC network, electrons cannot be
delivered instantaneously from the VRM output to immediately satisfy the fast changing
current loads of various on-chip components. This lag results in on-chip voltage fluctuations
and causes supply voltage noise reliability issues during operation.

3.2.2

Voltage Stacking

In voltage stacking (VS), the step-down VRM can be eliminated by serially stacking the voltage domains. It can be intuitively understood as allowing electron charges to recycle through
the stacking layers in series. In addition to eliminating step-down conversion loss, voltage
stacking lowers the PDN loss due to resistive parasitics, because in a N-layer voltage-stacked

52

(a) Power routing hierarchy

(b) Conventional power grid (c) Power grid routing of voltage stacking
routing

Figure 3.2: Illustration of on-chip power routing for conventional and voltage-stacked power
delivery configurations.
system, the PDN path current is reduced by N ˆ, which corresponds to N 2 ˆ reduction in
power loss. These efficiency improvements have been demonstrated in prior work [93, 94].
A theoretical peak PDE close to 100% can be achieved using voltage stacking [94] when all
the stacking layers have balanced activities, and hence the same transient current demands.
In practice, though, applying voltage stacking in real computing systems, where activity
mismatches abound both spatially and temporally, proves to be challenging. As has been
shown in previous studies, such activity mismatches can cause severe voltage fluctuations in
a voltage-stacked system [87,95–99]. The aggravated noise problem remains one of the most
obstinate obstacles preventing voltage stacking adoption in the mainstream.

3.2.3

Supply Voltage Noise

Due to its impact on system reliability, supply voltage noise has been diligently studied and
characterized for conventional single-layer PDS in single-core [100,101] , multi-core [102,103],
and many-core GPU processors [77, 97, 104–106]. While circuit techniques such as load line
compensation are effective at taming IR-drop induced noise [107], dynamic Ldi{dt noise, and
resonance noise in particular, are more dominant and harder to tackle [108, 109], and often
53

demand a cross-layer solution. However, a voltage-stacked many-core processor experiences
more serious and complex supply voltage noise behavior due to the interactions between
the cores that can lead to constructive or destructive noise composition. This aggravated
supply voltage noise prevents the wide adoption of voltage stacking in mainstream computing
systems, despite its higher power delivery efficiency. Up until now, it has been only intuitively
understood that the supply voltage noise in voltage stacking is from an imbalanced workload,
and a systematic supply voltage noise study of many-core processors with multi-layer voltagestacked PDS is still lacking.

3.2.4

Power Delivery Efficiency

The underlying physical mechanism to convert and transfer electron charges from the higher
supply voltage on the motherboard to the much lower supply voltage on the microprocessor
chip invariably causes energy loss. The energy loss can be broken down into three parts:
First, energy is lost in voltage conversion to step down the supply voltage [42]. We define the
conversion efficiency of a voltage regulator (ηV R ) as the ratio between the power it delivers
at the voltage regulator output over the power it consumes at the input. ηV R is usually
a function of the step-down conversion ratio. A high performance off-chip switching VRM
can deliver over 90% conversion efficiency, but the efficiency is degraded at a lower output
voltage with a higher step-down ratio [43].
The second part of the energy loss occurs in the power delivery network mostly because
of heat dissipation when current runs through the parasitic resistance that exists along the
path of the PDN, which is related to the IR-drop component of supply voltage noise [44,45]:

54

ηP DN “

Rcore pVcore q
pRP DN ` Rcore pVcore qq

(3.1)

where RP DN represents the total parasitic resistance contributed by the PDN, and Rcore
represents the equivalent resistive impedance of the computational load as a function of
Vcore . The definition of Rcore suggests that Rcore “ Vcore {Icore . For fixed Vcore value, Rcore is
a measure of the power rating.
The final and often overlooked part is the energy overhead incurred by raising the supply by
a non-negligible voltage margin, ∆V “ Vcore ´ Vmin , to accommodate and sustain fault-free
operation [46, 47]. We can express this component as η∆V :

η∆V “

Vmin Icore pVmin q
Pcore pVmin q
“
Pcore pVcore q
Vcore Icore pVcore q

(3.2)

Pcore and Icore represent the power consumption and the current load of the processor core
as a function of the core supply voltage (Vcore and Vmin );
Based on above analysis, the full power delivery system efficiency can be expressed as

ηP DS “

Pcore pVmin q
“ ηV R ¨ ηP DN ¨ η∆V
Psrc

(3.3)

where Psrc is the total power drawn from the source.
Voltage stacking can reduce the power loss in step-down voltage regulator by eliminating
supply voltage conversion and power loss in power delivery network parasitic resistance by
reducing path current. Besides, previous works [97, 110] further prove that voltage stacking
can also diminish the voltage margin to further improve power delivery efficiency. In this
work, for a fair comparison, we assume voltage stacking has a same voltage margin with
55

conventional single-layer power delivery system and mainly focus on its improvements in
conversion efficiency (ηV R ) and PDN efficiency (ηP DN ).

3.2.5

Related Work

Proof-of-concept circuits [91, 111] and silicon prototypes [90, 93, 94, 112, 113] have been
presented previously to explore voltage stacking using low-power microcontrollers, along
with design methodology for floorplanning and placement [114, 115]. These pioneering
works demonstrate the feasibility of voltage stacking, but they are often limited to simple
assembly of uncorrelated cores with low power density. Inter-layer current imbalance has
been discussed qualitatively [92] as a contributor to the supply voltage noise in voltagestacked systems, but without rigorous quantitative derivation of worst-case conditions. To
overcome supply voltage noise, most voltage stacking prototypes [90, 93, 94, 112, 113] resort
to employing charge-recycling integrated voltage regulators (CR-IVR) to actively balance
the current mismatches. However, the overhead and trade-offs from CR-IVR require further
discussion and should be reduced.
Built upon these early prototypes, a number of novel approaches have been proposed to take
advantage of voltage stacking under different scenarios, such as 3D-IC with varying TSV, onchip decoupling capacitance, and package parameters [116,117]; optimal system partitioning
to unfold CPU cores [96, 110]; and GPU systems with supercapacitors [118] operating under
near-threshold voltages [97,110]. CoreUnfolding [96] is a novel method that voltage stacking
can be used within each core. However, it is highly invasive as it requires separating function
units inside the core to balance the groups of units. Voltage-Stacked GPUs [119–121] models
the power grid of the voltage-stacked GPU as a linear dynamic system to derive the power
56

Table 3.1: GPU voltage-stacked system configuration
Configuration
PCB supply voltage
No. of SM cores
Voltage-stacked layers
SM core ave power
Threads per SM core
Registers per SM core

Value
4.1V
16
4
5W
1536
128KB

Configuration
SM core voltage
Clock frequency
SM cores per layer
SM core max power
Threads per warp
Shared memory

Value
1V
700M
4
14W
32
48KB

Figure 3.3: Power delivery network (PDN) of a 2x4 voltage-stacked many-core processor.
control strategy for supply noise guarantee, but the architecture-level power control scheme
sacrifices the system performance for reliability.

3.3

System Configuration

In this work, we use the GPU system with a NVIDIA Fermi architecture as a representative
many-core processor. Table I lists the configuration details of Fermi architecture. The Fermi
architecture GPU has 16 streaming multiprocessor (SM) cores [122]. We use a 4 ˆ 4 voltage
stacking structure: 16 SM cores are stacked in 4 layers and each layer has 4 SM cores, as 4V
is generally available on the board and SM cores require 1V. Our analysis and solutions are
not limited to this 4ˆ4 voltage stacking configuration and can be applied to other many-core
processors with arbitrary voltage stacking configurations.
57

3.3.1

Power Grid Routing and PDN Modeling of VS

Voltage stacking can be implemented in both 2D and 3D-IC chips, but for a fair comparison
with conventional power delivery methods, we focus on voltage stacking implementation in a
2D planar technology. To properly isolate the transistors in each voltage layer from the global
substrate, voltage stacking often relies on advanced process technology such as triple wells
or Silicon on Insulator (SOI) to establish local body biasing voltages [92, 98, 99, 123, 124].
A hierarchical structure is used in 2D power routing, as shown in Fig. 3.2(a). The top
metal layers are for global power grid which connects cores or modules. The next layers
are local power grids connecting the function blocks such as ALU and Reg. Finally local
power grids in the bottom metal layers connect to the logic gates. As illustrated by the
power/ground routing scheme in Fig. 3.2(b) and 3.2(c), topologically stacking the voltage
domains on a 2D chip can be achieved with minimal modifications by re-routing the top
metal layers from parallel connections to series connections, leaving the local power/ground
grids in the lower metals and the physical floorplans of the underlying blocks largely intact.
Assuming this minimally-invasive routing method, we derive the voltage stacking PDN model
shown in Fig. 3.3 based on the typical RLC circuits and parameters introduced previously
to study GPU many-core processors [77, 106]. Note that there is parasitic resistance (RS )
between the vertically-connected cores (modeled by current sources), as depicted in Fig. 3.3.
Our study focuses on the SM core power grid to clearly demonstrate the benefit of voltage
stacking and evaluate the proposed hybrid regulation methodology, since its peak and average
powers account for 80% and 93% of the total GPU chip power consumption [80]. Similar
scheme can also be adapted to other on-chip components like SRAM in a voltage-stacked
configuration [111].

58

3.3.2

Communication Across Layers

A voltage-stacked system suffers inherently complex communications across different voltage
layers: instructions and data are communicated among memory, cache, and core registers,
which are in different voltage layers. In the GPU system, SM cores do not directly communicate with each other, and the cross-layer communication mainly happens between SM
cores and Memory Partition Units through an interconnection network. There are two interconnection networks with butterfly topology and 22 nodes: one for traffic from SM cores
to Memory Partitions, and one for traffic from Memory Partitions back to SM cores. Cross
layer communication requires extra level shifters added to the interconnection network. Several level shifter designs are suitable for a stacked architecture, such as capacitive-couplingbased (conventional) [125], two-stage cross-coupled (TSCC) [126–130], Wilson current mirror (WCM) [127,129,131], stacked Wilson current mirror (Stacked) [132], switched-capacitor
(Tong) [98] and modified switched-cap (Mod-Tong). Tested by Ebrahimi [133] with input
signal at 1GHz, Tong has the best energy-delay trade-off.

3.4
3.4.1

Supply Voltage Noise Analysis
Supply Voltage Noise Characterization

Unlike previous empirical approaches [77,106], we develop an analytical modeling framework
to study and characterize supply voltage noise responses in voltage stacking PDN, especially
in the presence of both correlated and uncorrelated core activities. The cornerstone of

59

Supply Voltage
Noise(V)

0.2

V

VG

V ST

VR

0.1
0

-0.1
-0.2
0

0.5

1

1.5

2

2.5

3

Time(us)

Figure 3.4: Supply voltage noise decomposition.

Figure 3.5: Illustrative example for noise decomposition using 2ˆ3 voltage stacking network:
(a) simplified 2 ˆ 3 network; (b) equivalent network for I G ; (c) voltage response with I G ; (d)
equivalent impedance for I G ; (e) equivalent network for I ST ; (f) voltage response with I ST ;
(g) equivalent impedance for I ST ; (h) equivalent network for I R .

60

our analytical approach lies upon the decomposition and superposition principles in the
fundamental circuit theory.

Noise Decomposition & Superposition

Since the basic electrical model of voltage stacking PDN consists of only linear components,
including the RLC and ideal voltage and current sources, the superposition principle in linear
systems generally holds, allowing us to decompose the core current to different components
to reveal their distinctive characteristics. Without loss of generality, let us assume a voltagestacked system that consists of NL vertically-stacked layers with NV cores on each layer. For
example, Fig. 3.3 shows a NL “ 2 and NV “ 4 voltage-stacked system. The cores that align
vertically are defined as a voltage stack. To facilitate later analysis, we adopt the s-domain
expressions for current sources and give the following definitions:
core
R
Ii,j
psq “ I G psq ` IiST psq ` Ii,j
psq

(3.4)

řNV řNL
G

I psq “

core
j“1 Ii,j psq

i“1

NV NL

(3.5)

řNL
IiST psq

R
Ii,j
psq

“

“

core
j“1 Ii,j psq

NL

core psq ´
pNL ´ 1qIi,j

´ I G psq

(3.6)

řNL

NL

core
k“1,k‰j Ii,k psq

(3.7)

core psq is the current contributed by the core in the ith stack and the j th layer.
where Ii,j
R psq, in Eq. (3.4) - (3.7).
It is decomposed into three components: I G psq, IiST psq, and Ii,j

I G psq represents the global current component shared by all the cores, IiST psq represents the

61

R psq is the residual
common current components shared by the cores in the ith stack, and Ii,j

current components after removing the global and per-stack common terms. Now, the supply
voltage noise at the core (in the ith stack and the j th layer) can be expressed by the current
G , Z ST , and Z R
components working on their respective effective impedances, Zef
f
ef f,i
ef f,i,j , and

causing superimposed supply voltage noises, ∆V Gcorei,j , ∆V STcorei,j , and ∆V Rcorei,j , as described in
Eq. (3.8) and Fig. 3.4.
N
V N
L
ÿ
ÿ
G
ST
R
G G
ST ST
R R
∆Vcorei,j “∆Vcorei,j`∆Vcorei,j`∆Vcorei,j “I Zeff`Ii Zeffi `
Ii,j
Zef fi,j
i“1j“i

(3.8)

To illustrate how the decomposition and superposition in Eq. (3.4) - Eq. (3.8) help us
analyze and characterize supply voltage noise effects in voltage stacking, we use a simplified
RLC network of a 2 ˆ 2 voltage stacking PDN, as shown in Fig. 3.5.

Global Uniform Current

Since I G psq is a uniform component across all the cores, the effective network can then be
transformed by removing the path between equal-potential nodes and merging the parallel
components as in Fig. 3.5(b) according to our 2 ˆ 2 example. We can derive the supply
voltage noise caused by I G with an analytical expression for a general NL ˆ NV network:1 .
G
G G
G
∆Vi,j
“ Ii,j
Zef f “ Ii,j
p

ZC4 ZS NV
` ` Zo f f q{{ZC
NL NL NL

(3.9)

Due to the uniform nature of the global current, all cores share the same common mode,
G , and thus the same ∆V G
Zef
corei,j . Eq. 3.9 also applies to the case when NL “ 1, which is
f
1

symbol // is the circuit symbol for parallel connection

62

G
a conventional single-layer PDN. From Eq. 3.9 and the typical impedance profile of Zef
f
G
shown in Fig. 3.5(c), we can see that in a NL ˆ NV voltage stacking PDN, ∆Vcorei,j
peaks

at the dominant resonant frequency of Zof f , similar to the conventional single-layer, but its
magnitude is reduced by NL ˆ when stacked.

Local Uniform Through-stack Current

Following our definition of IiST psq, we can see that since

řNV

ST
i“1 Ii psq

“ 0, there is no current

going through Zof f according to Kirchhoff’s Current Law (KCL) and the entire branch
can be eliminated. The linear circuit network is again transformed to a simpler form as
ST
in Fig. 3.5(d). For example, in our 2 ˆ 2 example, we can derive ∆Vcorei,j
, for i “ 1, 2 and

j “ 1, 2 respectively, as a function of the unit current stimulus IiST and complex impedances

in the form of ZL and ZC :
ST
ST
ST
∆Vcorei,j
“ IiST Zef
f piq “ Ii

1
rZC {{ZL s
NL

(3.10)

ST
where ∆Vcorei,j
represents the supply voltage noise induced by IiST , the common current

components shared by all the cores in the ith stack. All cores in the ith stack share the
ST
same common-mode ∆Vcorei,j
disturbance. The resulting expression suggests that on the

first order, the combined effect of all the IiST exerts differential voltage fluctuations between
the vertical stacks, and it is further voltage divided across the cores in the same stack, as
illustrated in Fig. 3.5(d). The dividing ratio depends on the ratio of ZL {ZC , and in its
high-frequency limit asymptotically approaches ZL {NL . The analytical results of the local
uniform through-stack current again suggest that by moving from single-layer to multi-layer,
the supply voltage noise experienced at each core level and contributed by this current
component is reduced by NL ˆ on average.
63

Residual Per-Core Differential Current

R can be rearranged as the summation of differential currents
On closer inspection of Eq. 3.7, Ii,j
core ´ I core , where k ‰ j . The summation suggests that the remaining voltage
in the form of Ii,j
i,k

noise effect, unaccounted for by the global and the local terms, ∆V G and ∆ViST , are induced
by the aggregated differential currents. This differential current represents the mismatched
part of current between cores which will not only cause voltage noise at itself but also cause
noise at other cores. For example, at corepi, jq, the noise from residual current is from its
own residual current and other cores’ residual current:
R
R R
∆Vcorei,j
“ Ii,j
Zef f i,j `

N
L
V N
ÿ
ÿ

R
R
In,m
Zef
fn,m

(3.11)

n‰i m‰j

R ZR
where Ii,j
ef f i,j is the supply voltage noise caused by its own residual current, and

řNV řNL
n‰i

R
R
m‰j In,m Zef fn,m

is the supply voltage noise caused by residual current from other cores. Most importantly,
this type of residual per-core differential current is unique to voltage stacking, since these
terms simply vanish when NL “ 1.

3.4.2

Dominating Supply Voltage Noise

G ,
Based on the above system configuration, we characterize the effective impedances, Zef
f
ST , and Z R
Zef
fi
ef f i,j , of each current component defined in Eq. (3.8). The effective impedance

for corep1, 1q is shown in Fig. 3.6. Due to location symmetry, the effective impedances
of other cores are similar to corep1, 1q. We divide the frequency range into low frequency
(ă 10M Hz), medium frequency (10M Hz´50M Hz), and high frequency (ą 50M Hz). From
R
G
the effective impedance curve, we can see that both Zef
f i,j at low frequency and Zef f at high

frequency (especially at resonance), have relatively large magnitudes. The corresponding
64

Impedance(Ohm)

0.25

ZG
eff

0.2

Low
frequency

Z ST
eff

ZR
eff(i,1)

Medium
frequency

0.15

ZR
eff(i,2)(i,3)(i,4)
High frequency

0.1
0.05
0

1

5

10

50

100

500

Frequency (MHz)

Figure 3.6: Effective impedance of current components.

core
(a) Ii,j“n
psq

core
(b) Ii,j‰n
psq

Figure 3.7: An example instruction trace contributing to worst-case supply noise.
low frequency residual current components and high frequency (resonance) global current
components that excite these effective impedances can thus cause large supply voltage noise,
and we identify them as the dominant causes of voltage noise in voltage stacking.

3.4.3

Worst-Case Supply Voltage Noise

Identifying the root cause of noise is not sufficient for rigorous reliability analysis. We must
also consider what core activity conditions can result in the worst-case supply voltage noise.
Understanding the condition and the magnitude of worst-case would help us determine the

65

(a) Worst case

(b) Residual

(c) Global

(d) Random

Figure 3.8: Histogram comparison between analytically derived worst case and other heuristic core activation patterns.
necessary and sufficient noise mitigation strategy to guarantee reliable operation in real-world
voltage-stacked systems.
G
S
R
After characterizing Zef
f , Z Tef f and Zef f , and establishing the relationship between ∆Vcore

as a function of these impedances, searching for the load current conditions that would
result in worst-case supply voltage noise can now be performed in the frequency domain. We
formulate it as an optimization problem of finding the optimal frequency distribution of each
core
core
core current Ii,j
to maximize their combined effects ∆Vm,n
on corepm, nq. This optimization

can be solved as a linear programming problem, and the process is described in Algorithm
1. The optimization variables are each core current distribution at different frequency range
core psq. The optimization objective function is the supply voltage noise ∆V core at corepm, nq
Ii,j
m,n

and the constraints are from voltage noise decomposition Eq. (3.4) - (3.7) and peak GPU
SM core power, as shown in Table 3.1. This linear optimization formulation with a general
constraint of max power/current allows us to search the vast space of arbitrary synthetic core
current stimuli from all possible activity combinations, including the effects cause by clock
gating and power gating, and therefore can quantitatively represent the worst-case supply
voltage noise for rigorous reliability analysis.

66

Algorithm 1 Maximize supply voltage noise
Optimization Variables:
core
psq
Each core current frequency distribution Ii,j
Objective Function:
G
ST
R
∆Vcorei,j “ ∆Vcorei,j
` ∆Vcorei,j
` ∆Vcorei,j
in Eq. (3.8)
Subject to:
core
1: @i, j; 0 ď Ii,j
psq
core
core
psqq ď peak current (14A)
ptq “ F ´1 pIi,j
2: @i, j; Ii,j
core
3: @i, j; Ii,j psq, 0 ď s ď clock frequency (700MHz)
4: Eq. (4) - (7): current decomposition rules
Table 3.2: Freq. Distribution of decomposed core current
Core Current
core
psq
Ii,j“n
core
psq
Ii,j‰n

Frequency
low frequency
pă 10M Hzq
high frequency
pą 50M Hzq

Major Component
residual current
global current

The numerical solution of the linear programming problem based on the GPU configurations
in Table 3.1 gives us a glimpse of the core current distribution and combination that act
together and cause the largest supply voltage fluctuation at corepm, nq, as shown in Table
core psq, are distributed at low frequency with major components of
3.2. The currents, Ii,j“n
core psq, are distributed at the resonant frequency of
residual currents, while the currents, Ii,j‰n
G
Zef
f with major components the global currents.

This worst-case scenario is plausible in

core
psq are alternating between idle and
real GPU applications shown in Fig. 3.7 when the Ii,j‰n
core
Sine/Cosine special function instructions (SF Inst) at the resonant frequency, while Ii,j“n
psq

are at peak power executing Sine/Cosine special function instructions (SF Inst). We compare
the worst-case noise derived by our optimization algorithm with three other scenarios based
only on heuristics: (1) all cores have low frequency residual currents, (2) all cores have high
frequency global currents, and (3) all cores have randomly distributed currents. From the
supply voltage noise histograms in Fig. 3.8, we can see that the worst-case rigorously derived

67

Figure 3.9: Hybrid voltage regulation based on distributed on-chip CR-IVRs and off-chip
CR-VRM.
by our method is more severe than the heuristic ones, and therefore is more representative
as a stressmark for supply voltage noise reliability analysis.

3.5
3.5.1

Noise Mitigation by Hybrid Regulation
Hybrid Regulation Framework

To combat elevated and hard-to-predict supply voltage noise and guarantee reliable operation in spite of worst-case conditions in voltage-stacked many-core processors, we explore a
hybrid voltage regulation mechanism using both on-chip charge-recycling integrated voltage
regulators (CR-IVRs) and an off-chip charge-recycling voltage regulator module (CR-VRM).
Fig. 3.9 shows the framework of the proposed hybrid regulated voltage stacking using either
switched-capacitor or low dropout voltage regulators. This hybrid approach takes advantage
of the unique merits of on-chip and off-chip voltage regulators and simultaneously avoids
their individual defects.
Unlike step-down voltage regulators converting supply voltage, charge-recycling voltage regulators move extra charge between different layers to balance current and maintain a stable
68

Supply Voltage(V)

1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4

Voltage stacking + IVR

SM1

SM2

SM3

SM4

SM5

SM6

SM7

SM8

Default voltage stacking

SM9 SM10 SM11 SM12 SM13 SM14 SM15 SM16

Supply Voltage(V)

(a) Voltage distribution among the 16 SMs during execution of the backp benchmark
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4

Voltage stacking + IVR

SM1

SM2

SM3

SM4

SM5

SM6

SM7

SM8

SM9

Default voltage stacking

SM10 SM11 SM12 SM13 SM14 SM15 SM16

(b) Voltage distribution among the 16 SMs during execution of the blackscholes benchmark

Figure 3.10: Voltage distribution among the 16 SMs.
voltage of each layer. Because the direction and amplitude of extra charge keeps changing
with core workload conditions, charge-recycling voltage regulators should support bidirectional fast switching current. Voltage regulators, such as low drop-out voltage regulators
and switched capacitor voltage regulators, can be used as charge-recycling voltage regulators, while inductor based voltage regulators, such as buck converters, do not support
bidirectional fast switching of current movement and incur extra Ldi/dt noise. Multi output
switched capacitor (SC) voltage regulators are the most widely used charge-recycling voltage
regulators, because they have higher power efficiency, but they require each layer to have the
same voltage. Previous work has demonstrated a multi-output switched-capacitor integrated
voltage regulator [112] that balances the layer currents in voltage-stacked systems. Although
low drop-out voltage regulators have lower power efficiency, they do not force each layer to
have exactly the same voltage, and hence are more suitable to support dynamic voltage and
frequency scaling in voltage stacking.
69

3.5.2

Centralized and Distributed Integrated Voltage Regulator

Located closer to the point-of-load, on-chip integrated voltage regulators enjoy fast regulation
response, but have limited on-die area and capacity, making them suitable for reducing highfrequency noise of smaller magnitude. According to the analysis in Section 3.4, one of the
dominant causes of worst-case supply voltage noise is high frequency global currents. This
noise can be mitigated by on-chip CR-IVRs.
By moving charges across the stacking layers, the CR-IVR effectively behaves as an additional
G
parallel impedance connected to the original effective impedance Zef
f . It thus reduces the

supply voltage noise caused by global current:
CR´IV R
G
G
` Z CR´IV R´path qs
∆Vcorei,j
“ I G rZef
f {{pZ

(3.12)

Here, Z CR´IV R is the impedance of the on-chip charge-recycling voltage regulator, and
Z CR´IV R´path is the parasitic impedance of the on-chip power grid between the core and
G
voltage regulator. By deploying CR-IVR with the desired impedance, ∆Vcorei,j
from global

current I G can be effectively mitigated. The effective impedance of a multi-output switchedcapacitor CR-IVR can be expressed:
b
2
Z CR´IV R “ ZSSL
` ZF2 SL

1
ZSSL“
Ctotal fSW

˜
n
ÿ

¸2
| ac,i |

1

˜
¸2
n
Gtotal ÿ
ZF SL“
| ar,i |
Dcycle 1

(3.13)

(3.14)

where, Ctotal is the fly capacitance, Gtotal is the total switch conductance, fSW is the switching
frequency, and Dcycle is the duty cycle, Further, ac,i and ar,i are charge multiplier vectors
[116, 134]. Z CR´IV R´path is the other important factor that determines the supply voltage
70

noise mitigation. It is related to the distance between the core and the voltage regulator. As
the regulator is located far from the load, the noise mitigation effect will be reduced because
the parasitic impedance between the core and the voltage regulator contributes to a larger
Z CR´IV R´path . One effective way to enhance the noise mitigation is by distributing a large

centralized voltage regulator to smaller distributed ones, because the distributed voltage
regulators can be located closer to each core.
We next will demonstrate the effectiveness of hardware regulation by on-chip charge-recycling
integrated voltage regulators (CR-IVRs) and compare the regulation effects of centralized
and distributed CR-IVRs. We first simulate the transient voltage waveforms of all the SMs
with one centralized CR-IVRs physically located in the middle of each layer and plot their
voltage distribution using box plots. The statistics presented in Fig. 3.10 were collected
from the benchmark backp and benchmark blackscholes, but similar results are observed
for all the benchmarks from both NVIDIA CUDA SDK and Rodinia 2.0 benchmark suites.
Comparing the standard deviations and peak-to-peak values of all the SM core voltages in
the proposed voltage-stacked GPU, with centralized CR-IVR and without CR-IVR, reveals
that the regulation effect is uneven among the SMs. This phenomenon is highlighted in the
histograms in Fig. 3.11(a). We have the histogram of the voltage distribution across SM1
and SM2, collected with 500, 000 samples over a typical 10µs period from the benchmark
backp. SM2 exhibits the smallest supply voltage noise spread, yet noise worsens at SM1,
because SM2 is closer to the centralized CR-IVR and has a smaller Z CR´IV R´path than SM1.
Now, we leverage the scalability of CR-IVR in a distributed design. The distributed CR-IVR
divides the original centralized design into four equal sub-IVRs and connects each sub-IVR
directly to the SMs in each layer, with each sub-IVR consisting of 1{4 of the total switched
capacitance. The extra implementation overhead of the distributed design is mainly due to
71

(a) Centralized CR-IVR

(b) Distributed CR-IVRs

Figure 3.11: Supply voltage noise distribution.
Table 3.3: Switched Cap. Regulator Parameters
Design Parameters
Topology of VR
Number of VR
Switch frequency
Total capacitor / VR
Capacitor density
Switch on resistance
Area per VR

CR-IVR
Multi-output SC
4
50M Hz
1.24uF
50nF {mm2
130Ω ¨ um
24.8mm2 (Die)

CR-VRM
Multi-output SC
1
1M Hz
624uF
0.2uF {mm2
37600Ω ¨ um
3.12cm2 (Board)

the duplication of control logic, which accounts for negligible area and power consumption
compared to the rest of the CR-IVR circuitry. The resulting SM voltage distribution using
the distributed regulation is presented in Fig. 3.11(b). The location dependence is now
completely removed and the same regulation effect is achieved across the board. A optimal
design parameters [135] of distributed CR-IVR are shown in Table 3.5.

3.5.3

Off-Chip Charge-Recycling VR

Compared with CR-IVR, off-chip CR-VRMs have slower response time, but they offer better
efficiency [68, 136] and do not consume expensive die area. It is important to note that
although on-chip CR-IVRs can be designed to provide similar regulating capacity as an
72

0.25

[ZG
// ZCR-IVR]
eff

[ZR
// (Z CR-VRM+ZC4 +Zpkg +ZPCB)]
eff(i,j)

ZG
eff

ZR
eff

Impedance

0.2
0.15

Medium frequency

0.1

High frequency

Low frequency

0.05
0
1

5

10

50

100

500

Frequency (MHz)

Figure 3.12: Effective impedance after employing CR-VRM.
off-chip counterpart, they incur large area overhead, sometimes exceeding the total area of
the logic cores, making them impractical in real systems. Therefore, off-chip CR-VRM is
a better and more economical choice for regulating supply voltage noise at low frequency.
Similarly, the addition of the CR-VRM results in an effective parallel impedance connected
R
with the original Zef
f pi,jq through the C4 pad, package, and PCB. In this case, the supply

voltage noise caused by residual current becomes
N
L
VN
ÿ
ÿ
R
R
CR´V RM
R
Ii,j
rZef
` Z CR´V RM ´path qs
∆Vcorei,j
“
f pi,jq {{pZ
i

(3.15)

j

where Z CR´V RM is the impedance of the off-chip charge-recycling voltage regulator module;
Z CR´V RM ´path includes the parasitic impedances of not only the on-chip power grid but

also the C4 pads, package, and PCB board between the CR-VRM and the cores. A design
optimization similar to that for CR-IVR is applied to arrive at an optimal set of design
parameters, as summarized in Table 3.5. The new effective impedance of the residual current after employing on-chip CR-IVR and off-chip CR-VRM is shown in Fig. 3.12. With
G
R
reduced effective impedance, the supply voltage noise, ∆Vcorei,j
and ∆Vcorei,j
, are significantly

mitigated.

73

3.5.4

Charge-Recycling VR Power Loss

In voltage stacking, most of the current goes through the stacked layers, the occasional
residual current is absorbed by decoupling capacitors, and only the accumulated residual
current components goes through the CR-IVR or CR-VRM. For the accumulated residual
current, we call it imbalanced current. When imbalanced current is recycled by chargerecyling VR, power losses in these VRs are unavoidable.

Switched-capacitor charge-recycling VR

A switched capacitor voltage regulator suffers mainly from the following four types of power
losses:
Intrinsic switched-capacitor loss: A switched capacitor voltage regulator delivers current to
a synchronous digital system, whose frequency is determined by the clock frequency, set by
the minimum voltage over a clock period. Power loss in the voltage ripple over the minimum
voltage is the intrinsic switched-capacitor loss [49], which is

Pintrinsic “ Iimbalance

2
Iimbalance
∆V
“
2
Mcap Cf ly fsw

(3.16)

Switching conductance Loss: Also, the finite conductance of the transistor switch has a series
power loss:
2
PRsw “ N Iimbalance
Rsw D

(3.17)

Plate Parasitic Capacitance Loss: In steady-state operation, both the top and the bottom
plates experience approximately equal voltage swings, and parasitic capacitance causes extra

74

power loss:
Pplate´cap “ Mbott V 2 Cplate fsw

(3.18)

Switching Parasitic Capacitance Loss: The loss in voltage swings at the switch transistor
parasitic capacitance, which can be expressed as

Psw´cap “ N Csw V 2 fsw

(3.19)

Among these losses, intrinsic switched-capacitor loss and switch conduct loss are the main
components [135].

Low Drop-out Charge-Recycling VR

Low drop-out voltage regulators suffer from following three power losses [137]:
Switch Conduct Loss: The main cause of loss in low drop-out voltage regulators is the power
dissipated as heat on the transistor switch resistance. It is highly dependent on the difference
between the input and output voltage:

2
PRsw “ Iimbalance
Rsw

(3.20)

Switching Parasitic Capacitance Loss: The gate parasitic capacitance switching loss is similar
to the loss happens in switched capacitor charge recycle voltage regulator.
Control Logic Loss: In LDO, a feedback control logic circuitry is used to control the voltage at
the reference value. The loss is due to the current flowing through the operational amplifier,
75

the resistive voltage divider, and the voltage reference generator in the control logic circuitry;
the sum of these currents is called quiescent current. The quiescent current can be reduced
by optimizing the components, making it negligible when compared to the load current
consumption.

3.5.5

Hybrid Regulated VS Power Delivery Efficiency

The power delivery efficiency of hybrid regulated many-core voltage stacking can be described
as
Pcore
Pcore ` PP DN ` PCR´IV R ` PCR´V RM
IcoreVcore
“
2
Icore
IcoreVcore`p
q RP DN`PCR´IV R`PCR´V RM
N

ηP DS “

(3.21)

where, Pcore is the power consumed by cores, and PP DN is the power loss in the parasitic
resistance along the power delivery network. PCR´IV R and PCR´V RM are the power loss in
CR-IVR and CR-VRM, derived in Eq. (4.4) - (3.20).
In the ideal case, if there is no residual current, the power losses in CR-IVR and CR-VRM
are negligible and the power delivery efficiency can approach nearly 100%. In the normal
case, most of the current goes through the stacked layers, and only the accumulated residual
current components goes through the CR-IVR or CR-VRM where it introduces a small
amount of power loss. In the worst-case, when cores in one layer are powered off and cores in
the other layers are powered on, all the current consumed by cores is imbalanced current and
goes through the CR-IVR and CR-VRM, causing significant power loss. However, this worst
case seldom happens, and the system on average achieves high power delivery efficiency.
76

Consequently, the hybrid regulation scheme not only mitigates the supply voltage noise but
also maintains the high power delivery efficiency.

3.6

Architectural Support for VS

Based on above analysis, some important insights are revealed—the highest impedance peak
in a VS system happens in the low frequency range and contributes the largest supply
fluctuations in the worst case scenarios. This finding opens the possibility for architecturelevel techniques to suppress low frequency supply noise. In this section, we explore such
opportunities by proposing control-theory-driven architectural support for VS.
The motivation to leverage control theory in the architecture-level technique is to provide
strong guarantees of worst case behavior and control stability, which (unlike its conventional
PDS counterpart) are necessary in a voltage stacked system. In a conventional system
with single-layer PDS, the worst case supply noise is often induced by repetitive execution
sequences or sudden trigger events near its peak resonance impedance. These execution
activities over a short period of time (tens or hundreds of clock cycles) can be predicted and
rearranged either at compile time or runtime. Such predictability does not readily apply
to voltage stacking, however, because its impedance profile may exhibit a high plateau over
a wide low frequency range due to the imbalanced residual current components that could
span from hundreds to tens of thousands of clock cycles. In light of the intractability of the
root causes of current imbalance/misalignment in the GPU, we resort to a control theory
based approach to stabilizing the layer voltages in voltage stacking. We first present the
control theoretic formulation of our proposed architectural support for VS, modeling the layer
voltages as a four dimensional linear dynamic system. We then discuss the available voltage
77

(a)

(b)

Figure 3.13: Simplified circuit of (a) the 4 ˆ 4 VS GPU, (b) a single VS stack.
smoothing techniques and identify dynamic issue width scaling (DIWS), fake instruction
injection (FII), and dynamic current compensation (DCC) as suitable actuation mechanisms.
Finally, the detailed implementation is considered, to account for potential performance
impacts and power/area overheads of the proposed techniques.

3.6.1

Control Theoretic Formulation

In order to apply control theory to mitigate severe voltage droops caused by current imbalance, and to stabilize layer voltages, we model the on-chip power grid of the voltage-stacked
GPU as a linear dynamic system and then formally derive the control strategy in response
to the measured state of the system. Fig. 3.13(a) illustrates the simplified on-chip power
grid of the example GPU system with 4 ˆ 4 VS configuration. Here, we simplify the PDS
by neglecting the parasitics impedance and assume an ideal 4V supply voltage (VDD ). We
further simplify the model by only looking at the voltages and corresponding current terms
in a single stack (or column) of the 4ˆ4 array and ignoring the small parasitic on-chip inductance, as shown in Fig. 3.13(b). Assuming that the system reaches equilibrium when all the
78

layer voltages are evenly divided and using that equilibrium point as the initial condition,
we can write down the differential equation for each layer voltage at time t as:
1
1
Vi ptq “ Vi´1 ptq ` V DD `
4
C

żt
pIi`1 ´ Ii ` ∆Ii qdτ

(3.22)

0

in which Vi ptq represents the absolute voltage level at layer i. Assuming VDD is an ideal
voltage source, V4 ptq “ VDD and is a constant value. The systems of equations depicted by
(4.3) can be expressed in matrix form as:
» fi »
V9
0
— 1ffi —
— ffi —
—V92ffi —0
— ffi —
— ffi “ —
— 9 ffi —
—V3ffi —0
– fl –
V94
0

0 0
0 0
0 0
0 0

fi
» fi » fi
fi »
fi »
∆I1
´1 1
0 0 I1
0
V
— ffi — C ffi
ffi — 1 ffi — C C ffi
— ffi —∆I ffi
ffi —
ffi —´1 1 ffi
ffi
—I2ffi — 2 ffi
—
ffi —
0ffi
— ffi — C ffi
ffi — V2 ffi — C 0 C 0 ffi
`
ffi
— ffi ` — ffi
ffi —
ffi —
— ffi —∆I3 ffi
ffi —
ffi —´1 1 ffi
0ffi — V3 ffi — C 0 0 C ffi
—I3ffi — C ffi
fl
– fl – fl
fl –
fl –
0 0 0 0 I4
0
1 VDD

(3.23)

where Ii represents the current of the SM in the ith layer. Replacing Ii as the SM power
Pi
, we have the dynamic
(Pi ) divided by the layer voltage across the SM, i.e., Ii “
Vi ´ Vi´1
system describing the relation between voltage and power as:
» fi »
V9
0
— 1ffi —
— ffi —
—V92ffi —0
— ffi —
— ffi“—
— 9 ffi —
—V3ffi —0
– fl –
V94
0

0 0
0 0
0 0
0 0

fi
»
fi
» fi
fi
»
fi »
P1
´1 1
ffi ∆I1
0 0 —
0 V1
ffi
—
ffi — C C ffi—V1 ´ VGN D ffi — C ffi
P2
ffi—
ffi —∆I ffi
ffi
—
ffi —
ffi — 2 ffi
— V2 ffi —´10 1 0 ffi—
0ffi
ffi
—
ffi — C C ffi— V2 ´ V1 ffi — C ffi
ffi—
ffi`— ffi
ffi
—
ffi`—
P3
—
ffi —∆I3 ffi
ffi
—
ffi —
1 ffi
0
0
0ffi
ffi
—
ffi — ffi
— V3 ffi —´1
C
C
fl— V3 ´ V2 ffi – C fl
fl
–
fl –
fl
–
P4
0 000
0
1 VDD
V4 ´ V3

(3.24)

Assuming small voltage disturbance, we can linearize the above system around its equilibrium
1

1

point where rV1 V2 V3 V4 s “ r1 2 3 4s , resulting in the final linear dynamic system equation

79

(3.25) which has the classic form (5.4).
» fi »
V9
0
— 1ffi —
— ffi —
—V92ffi —0
— ffi —
— ffi “ —
— 9 ffi —
—V3ffi —0
– fl –
V94
0

fi
» fi » fi
fi »
fi »
∆I1
´1 1
V1
P
0 0 0
0
0
— 1ffi — C ffi
ffi — C C ffi
ffi —
— ffi —∆I ffi
ffi —´1 1 ffi
ffi —
ffi
—P2ffi — 2 ffi
ffi —
—
0 0 0ffi
— ffi — C ffi
ffi — V2 ffi — C 0 C 0 ffi
`
ffi
— ffi ` — ffi
ffi —
ffi —
— ffi —∆I3 ffi
ffi —´1 1 ffi
ffi —
0 0 0ffi — V3 ffi — C 0 0 C ffi
—P3ffi — C ffi
fl
– fl – fl
fl –
fl –
0 0 0 0 P4
0
0 0 0 VDD

X9 “ AX ` BU ` ∆F

(3.25)

(3.26)

1

where X “ rV1 V2 V3 V4 s is the state of the above linear dynamic system; A is the state matrix;
1

B is the control input matrix; and U “ rP1 P2 P3 P4 s gives the SM power levels, which are the
control inputs of the system. ∆F captures the current disturbance that incurs supply noise.
We consider a classic proportional state feedback controller U “ KX as an illustrative
example, as it is considered an effective stabilization technique with both computational
advantages and satisfactory regulation results. In a proportional state feedback controller
the SM power is a function of SM voltage:

Pi “ kVi

(3.27)

where k is the proportional feedback coefficient. Hence, the system with feedback control
can be represented as:
X9 “ AX ` BKX ` ∆F “ pA ` BKqX ` ∆F

80

(3.28)

3.6.2

Control Stability and Performance

The control delay plays an important role in determining the system stability and control
performance in real applications. We express the total delay as T , which includes the sensor/actuator delay, communication and computation latencies along the feedback loop. We
discretize the system with a sampling period of T :

Xpn ` 1q “ ZpA ` BKqXpnq ` ∆F

(3.29)

where ZpA ` BKq is the discretization of matrix A ` BK with sampling rate T . The system
model depicted by (3.29) suggests that V1 V2 V3 is controllable and V4 is equal to VDD . We
use MATLAB(R2018a) SIMULINK to examine the system dynamic response and select the
proper coefficient k. It can be shown that the largest voltage deviations are caused by the
worst case disturbance ∆F . When the disturbance frequency falls within half of the discrete
system sampling frequency

1
,
2T

the voltage deviations are guaranteed to be suppressed within

a fixed range (i.e.0.2V ), and a formal proof can be obtained by analyzing the Bode plot of
the discrete system described by (3.29). In this way, we can rigorously prove that our control
scheme is not only stable but also guaranteed to constrain the supply noise within the bound
of the predetermined voltage margin. In addition to the theoretical proof, we are able to
experimentally verify the system’s stability and control performance under both worst case
disturbances and representative benchmark workloads in Section VI-B.
In essence, formulating the on-chip VS power grid as a discrete-time linear dynamic system allows us to employ rigorous voltage smoothing mechanisms in the VS setting. The
sampling rate T of the discretized system accounts for various latencies introduced by real
implementations of the front-end detector, the controller, and the back-end actuator in real
implementations. To effectively mitigate the dominant low frequency plateau exhibited by
81

Low Frequency

Medium Frequency

High Frequency

Thread Migration
Power Gating
Dynamic Frequency Scaling (DVFS)
Dynamic Instruction Issue Width Scaling (DIWS)
Fake Instruction Injection (FII)
Dynamic Current Compensation (DCC)
Frequency (MHz) 0.05 0.1
10us
Time
10000
Cycle

0.5

1

5

1us
1000

10

100ns
100

50

100

500 700

10ns

10

1

Figure 3.14: Timescales of different power actuation mechanisms.
the effective impedance of the VS GPU, we need the total latency to be such that the low
frequency peaks can safely fall within

1
.
2T

Detailed choice of actuation mechanisms and

implementation considerations of the voltage smoothing scheme are discussed next.

3.6.3

Voltage Smoothing Actuation

As described by equation (3.25), the power consumption by the SM in each VS layer can
be used as a control input, which suggests that any mechanism that actively modulates
SM power can be considered a type of voltage smoothing actuator. Fig. 3.14 surveys several
typical power management techniques in a GPU together with their respective response time
scales. To achieve effective control, the actuator response time generally has to be at least
an order of magnitude faster than the time scale of the relevant disturbance. In the case
of a voltage-stacked GPU, we have shown earlier through impedance analysis that the part
of the noise to be suppressed using architecture-level techniques is associated with the low
frequency impedance caused by the residual current components. Therefore the maximum
response time required of the voltage smoothing actuator is on the order of hundreds of clock
cycles or around tens of M Hz. Techniques such as Thread Migration [138–140] and Power
Gating [141, 142] require content migration or state saving and operate at slower time scales
82

(longer than 1000 clock cycles). The speed of Dynamic Frequency Scaling is determined by
the re-locking time of the digital phase-locked loop (DPLL) and is typically on the order of
ms [143, 144]. Our survey rules out these slower techniques and identifies three promising
candidates for voltage smoothing actuation: dynamic issue width scaling, fake instruction
injection, and dynamic current compensation.
Dynamic Issue Width Scaling (DIWS): In the Fermi architecture, each SM has a 2
warp/cycle issue width. Each warp includes 16 instructions. Either one or two warps can
be dispatched in each cycle to any two of the four execution blocks within a Fermi SM—
two blocks of 16 cores each, one block of four special function units (SFU), and one block
of 16 load/store units (LSU), as shown in Fig. 3.15. To reduce the SM power, its warp
issue width can be reduced, which can later be restored up to 2 warp/cycle, when voltage
smoothing is no longer needed. One appealing advantage of DIWS is its low performance
penalty when dynamically scaled. Before a warp is issued, the warp scheduler first checks
with the scoreboard. Only when the warp is marked ready in the scoreboard, can it then
be issued. Therefore, although each SM has a 2 warp/cycle issue width, the number of
warps issued at each cycle varies at runtime. In our experiments with benchmarks from
Rodinia and NVIDIA Cuda SDK, the average issue rate is 0.8 ´ 1.8 warps per cycle due to
data dependences, memory stalls, compute stalls and idle cycles. When DIWS is applied,
even though the peak issue rate is reduced, which thus throttles the performance in certain
cycles, it may result in more “ready” warps being accumulated in the warp pool. These
accumulated “ready” warps can be issued opportunistically later, to fully occupy the issue
width, offering a speedup that partly compensates for the performance loss from previous
issues. For example in Fig.3.15, the warps in cycles 1 to i are issued without DIWS; during
cycles i to k ´ 1 DIWS sets the issue width to 1; and from cycles k to n the issue width is
back up to 2.
83

SIMD
Back End

Scalar
Front End

Fetch and Decode

Cycle 1 (2 warp)
Cycle 2 (1 warp)
...
Cycle i (1 warp)
...
Cycle k (2 warp)
Cycle k+1 (2 warp)

Warp Pool (Instruction-Buffer)
Scoreboard

Warp Scheduler

Warp Scheduler

Instruction Dispatch

Instruction Dispatch

CUDA Cores (x16)

CUDA Cores (x16)

SFU (x4)

LD/ST Units (x16)

LD

MOV

SIN
MOV

FADD
FFMA
FFMA

FMUL

MOV

SIN

LD

Cycle 1->i: no DIWS, issue rate = 1.5;
Cycle i->(k-1): with DIWS set issue width to 1, issue rate = 1;
Cycle k->n: with DIWS sets issue width to 2, issue rate =2;

Figure 3.15: SM microarchitecture and operation of dynamic issue width scaling.
Fake Instruction Injection (FII): Inserting fake instructions to fill up the issue width
slack also can be used to introduce extra power consumption. Like DIWS, FII operates
at the warp issuing speed, and thus has a fast response time. FII can leverage existing
GPU architectures and does not require extra circuitry or die area to implement, but its
availability is limited by the difference between the number of valid instructions and the
maximum issue width at each cycle: when there are already two valid instructions in the
warp pool, no extra instruction can be injected.
Dynamic Current Compensation (DCC): Finally, dummy digitally controlled current
sources can be added on-chip to provide extra current/power and thus help balance the layer
currents. We refer to this method as dynamic current compensation (DCC). While a similar
method has been implemented using ring oscillator circuits [118], we employ binary-weighted
current ladder circuits that are widely used as digital-to-analog converters (DACs). These
84

DACs can be digitally controlled at runtime to compensate layer current imbalance at the
time scale of a single clock cycle. Compared to DIWS and FII, deploying DCC requires
extra die area and consumes more leakage power, and thus should be used sparingly to avoid
energy and area penalties.
Weighted Control Inputs: Given that each of these three temporally suitable actuation
mechanisms has its own merits and drawbacks, we consider a weighted linear combination
of DIWS, FII, and DCC to exert the control inputs in equation (3.25). Therefore, the actual
control inputs can be expressed as follows:

PSM “ w1 Pdyn,ins

IssueW idth
` w2 Pdyn,ins NF II ` w3 Pd0 NDCC
maxpIssueW idthq

(3.30)

where w1 , w2 , and w3 are the respective weights for the power components of DIWS, FII,
and DCC; Pdyn,ins represents the dynamic power of the SM while executing the instruction
ins; Pd0 represents the unit power of the least significant bit (LSB) of the DCC current
DAC; NF II P 0, 1, 2 is the number of fake instructions injected; and 0 ď NDCC ď 2nDCC is
the digital code that controls the nDCC -bit current DAC to implement DCC. Formulating
the control input as a weighted sum allows us to explore the design space of our proposed
voltage smoothing method by sweeping different combinations for the same power effect, and
to find optimal control strategies under different optimization objectives.

3.6.4

Implementation Considerations

A number of circuit-level and microarchitecture-level changes have to be made in a GPU
system to accommodate the proposed control theory driven voltage smoothing technique.
Fig. 3.16 illustrates the overall architecture to implement our scheme, which consists of the
85

Operating System
Higher Level
Power Optimization

f

f

Hypervisor

Max allowed difference

Architecture Level f
Stage 1

Stage 2

Circuit Level

Performance
loss
Stage 3

Stage 4

SM(1,1)

Stage 3
Stage 4
Stage 2
Stage 1
CUDA
I-Buffer
CUDA
CUDA
core
Stage 3
Stage 4
Stage 2 Warp Schedule
CUDA
core
CUDA
FetchStage 1 I-Buffer
core
CUDA
(Inst Issue)
core
CUDA
core
Stage 3
Stage 4
Stage 2 Warp Schedule
CUDA
1
Decode
core
CUDA
I-Buffer
FetchStage Score
core
CUDA
(Inst
Issue)
core
Inst
Issue
CUDA
Warp
Schedule
core
CUDA
Decode
core
CUDA
Fetch Board
I-Buffer
Adjuster
core
(Inst
Issue)
CUDA
Score
core
CUDA
Issue
width
Warp
Schedule SM(1,1)
core
CUDA
Decode
core
Fetch Board
core
controller
(Inst
Issue)
Score
core
Issue
width
Decode
Board
SM(2,1)
controller
Score
Issue width
Board
SM(3,1)
controller

SM(2,1)

SM(3,1)

SM(4,1)
SM(4,1)

LPF
Voltage
Sensor

LPF
Voltage
Sensor

LPF
Voltage
Sensor

LPF
Voltage
Sensor

VDD
DCC

3/4 VDD

1/2 VDD

1/4 VDD

GND

Voltage smoothing
controller

Figure 3.16: Implementation of the proposed cross-layer VS GPU solution with architectural
support for voltage smoothing and VS-aware PM hypervisor.
front-end detector and back-end actuator circuits, the voltage smoothing controller, and the
VS-aware power management hypervisor.

Detector and actuator

To monitor spatial and temporal voltage fluctuations, front-end voltage detectors are placed
close to each SM. A RC low pass filter is applied before the voltage detector to filter out highfrequency noise. The cutoff frequency of the filter is ωc “ 50M Hz and it can be implemented
with a 10KΩ resistor and a 2pF capacitor, which together occupy 1120µm2 area. Onchip voltage detector circuits can be implemented in a number of ways using on-die droop
detector (ODDD) [145–147], critical path monitor (CPM) [148], or analog digital converter
(ADC) [149] approaches, as listed in Table 3.4. All these voltage sensing/inference methods
86

Table 3.4: Voltage Detector Options
Sensor
ODDD
CPM
ADC

Latency (cycle)
1-2
10-100
1-10

Power (mW)
0-10
30-60
10-100

Resolution (mV)
10-20
10-100
1/2N V

Output
detect indicator
timing variation
N-bit digit signal

are compatible with the front-end detector requirements of our proposed scheme. The backend actuators consist of the instruction issue adjuster embedded in the warp scheduler at
each SM to support DIWS and FII, and the binary-weighted current DAC located near the
load of each distributed CR-IVR to support DCC. The instruction issue adjuster arbitrates
the instruction issue width and issues fake instructions to exert power actuation. Since each
SM can issue up to two instructions per cycle, we can adjust the total number of instructions
issued every N cycles to achieve finer-grained control resolution, from 1 to 1/N instructions
per cycle on average. For instance, if the issue width is set to 1.7 instructions per cycle, it
is adjusted by setting the down-counter that arbitrates the instruction issue to 17, with a
reset every 10 cycles.

Voltage smoothing controller

The voltage smoothing controller executes the boundary triggered control algorithm using
measured voltages from the detectors, and sends the updated issue width to the instruction
issue adjuster. Algorithm shows an implementation of the proportional control algorithm.
To reduce the negative effect of voltage smoothing on system performance, the controller
is triggered by real-time supply noise measurements from the voltage detectors and only
intervenes when a voltage droop below a certain threshold is detected. To evaluate the
performance and overhead of the voltage smoothing controller accurately, we implement the

87

Algorithm Algorithm 1: Streaming Multiprocessor Power Controller
Input: Measured voltage from voltage sensor Vpi,jq
Output: Issue Width: IssueSM pi,jq , Fake Rate: Nf ake´SM pi,jq
Procedure: The Controller
1: Read in measured voltage: Vp1,1q ...VpNlayer ,Ncolumn q ;
2: for pi ď Nlayer , j ď Ncolumn q do:
3:
Calculate SMpi,jq voltage: VSM pi,jq =Vpi,jq -Vpi´1,jq ;
4:
if pVSM pi,jq ă Vthreshold q then:
Power control enable:
SMpi,jq “ active; nSM “ nSM ` 1;
IssueSM pi,jq “ Issuemax ´ k1 ˆ w1 ˆ p1 ´ VSM pi,jq q;
Nf ake´SM pi`1,jq “ k2 ˆ w2 ˆ p1 ´ VSM pi,jq q;
Pcurrent´SM pi`1,jq “ k3 ˆ w3 ˆ p1 ´ VSM pi,jq q;
where k1 , k2 , k3 are proportional control factors
end if
end for //finish a round of calculation
6: return IssueSM pi,jq , Nf ake´SM pi,jq , Pcurrent´SM pi,jq
controller and the SM instruction issue adjusters using VHDL2 . We synthesize the VHDL
code in the Synopsys Design Complier with TSMC 40nm technology, which is comparable to
the process used in the NVIDIA Fermi GPU. The voltage smoothing controller and the 16 SM
instruction issue adjusters in total consume 1.634mW power and occupy 3084µm2 area when
operating at the same GPU frequency of 700M Hz. Finally, we account for control latency
from several components: the detector response time, the controller computation time, the
actuation delay, and the round-trip communication delay between the detector/actuator
and the controller. We obtain the detector response time from previous work, calculate
the controller computation time and the actuation delay based on our synthesized circuit
model, and estimate the communication delay using an Elmore delay model based on tapered
inverter buffer chains, assuming the controller is situated in the middle voltage stacking layer
near the center of the SM.
2

https://github.com/xz-group/gpuvs.git

88

VS-aware power management hypervisor
Algorithm Algorithm 2:: VS-aware Power Management Hypervisor
Input: Command from OS: fSM pi,jq , gateSM pi,jq
1
1
Output: Command to SMs: fSM pi,jq , gateSM pi,jq
Procedure: Command Mapping
1: Read in operation system command:
fSM p1,1q ...fSM pNlayer ,Ncolumn q , gateSM p1,1q ...gateSM pNlayer ,Ncolumn q
2: for pi ď Nlayer , j ď Ncolumn q do:
3:
Calculate ∆fSM pi,jq , ∆pleakage´SM pi,jq :
∆fSM pi,jq =fSM pi,jq -fSM pi`Nlayer ,jq ;
∆pleakage´SM pi,jq =pleakage´SM pi,jq -pleakage´SM pi`Nlayer ,jq ;
4:
Update fthreshold SM pi,jq ,pthreshold SM pi,jq ;
5:
if p|∆fSM pi,jq | ą fthreshold SM pi,jq q then:
Increase the frequency of SM pi ` Nlayer , jq:
1
fSM pi,jq “ minpfSM p‰i,jq q ` fthreshold SM pi,jq ;
end if
6:
if p|∆pleakage´SM pi,jq | ą pthreshold SM pi,jq q then:
1
gateSM pi,jq “ 0
end if
end for
1
1
7: return fSM pi,jq , gateSM pi,jq
Due to voltage stacking’s unique topology and constraints on layer current imbalance, previous VS studies have not thoroughly explored its compatibility with higher-level power
optimization techniques such as dynamic frequency scaling (DFS) [150–153] and power gating (PG) [154–156]. We consider the implications of collaborative power management in
a voltage stacking setting and propose a voltage-stacking-aware hypervisor layer to interface with other power techniques. This hypervisor interface is added between the operating
system layer and the GPU architecture layer as illustrated in Fig. 3.16. Since the voltage
smoothing actuation mechanisms (DIWS, FII, and DCC) used in our cross-layer solution
are orthogonal to the optimization mechanisms (frequency scaling and power gating) used
in other techniques, we can accommodate these higher level mechanisms, which often operate over longer time scales, in the same control framework. The most significant impact of
89

higher-level power management via frequency scaling and power gating on voltage stacking is
that they may inadvertently introduce current imbalance, due to the different scaling/gating
actions at the SM as determined by the power or performance optimization strategies. In
terms of reliability, since these power-management-induced imbalances do not exceed the
worst case imbalance analyzed previously, system reliability is still guaranteed by our control theory driven approach. However, a large imbalance could lead to undesirable energy
loss associated with the on-chip CR-IVRs and performance penalties associated with throttling actions in the voltage smoothing mechanism. Here, we propose a heuristic optimization
algorithm to constrain layer current imbalance and alleviate performance penalties as shown
in Algorithm . The VS-aware hypervisor actively maintains balanced power across each
voltage stack by preventing the frequency scaling and power gating requested by the power
optimization techniques from exceeding a maximum power imbalance budget. The budget
is dynamically adjusted according to the SM performance loss, which gauges how many
instructions have been throttled due to voltage smoothing.

3.7

Advanced Power Management

We consider a well-designed power delivery system should be compatible with advanced
high-level power management techniques. Among them, the most common techniques are
dynamic voltage and frequency scaling (DVFS) and power gating. In this section, we will
discuss applying DVFS and power gating together with the proposed hybrid regulation in
voltage-stacked GPU systems.

90

3.7.1

Dynamic Voltage and Frequency Scaling

Dynamic voltage and frequency scaling (DVFS) adjusts the supply voltage and frequency
of a voltage domain to boost performance or save power. In voltage stacking each layer
can be divided into different voltage domains and the division should be consistent across
different layers to maintain the stacked power delivery. In this chapter, we assume the SMs
in each layer share one voltage domain in the proposed GPU many-core voltage-stacked
system. When DVFS is applied in voltage stacking, each layer (voltage domain) may have a
different voltage. The low drop-out voltage regulator will be used to recycle the imbalance
current because LDO can support that each layer has its own voltage. We will use LDO
hybrid regulations in following voltage stacking DVFS analysis and evaluations. When one
voltage domain needs to change to a different supply voltage, the low drop-out voltage
regulator can change the reference voltage and conversion ratio to adjust the voltage of each
layer [93, 95, 157].
Compared to the original voltage stacking, DVFS maintains the stacked power delivery but
may brings more frequent current imbalance. As the current imbalance introduced by DVFS
does not go beyond the worst cases studied in Section 3.4 where one layer is totally powered
off, the proposed hybrid regulation can effectively guarantee the system’s stability under
DVFS operation. Although the amplitude of the imbalanced current does not exceed the
worst case, the extra current imbalance introduced by DVFS will cause more power loss
in CR-IVR and CR-VRM. Compared with the original voltage stacking, part of efficiency
benefit will be sacrificed under DVFS.

91

Table 3.5: LDO Regulator Parameters
Design Parameters
Number of VR
Switch frequency
Total capacitor per VR
Capacitor density
Switch on resistance
Area per VR

3.7.2

CR-IVR
4
50M Hz
1.1uF
50nF {mm2
130Ω ¨ um
22.0mm2 (Die)

CR-VRM
1
2M Hz
600uF
0.2uF {mm2
37600Ω ¨ um
3.1cm2 (Board)

Power Gating

Power gating turns off the circuitry inside a core or the core itself for a while when not in
use. Power gating introduces current imbalance and also causes supply voltage noise. The
most severe imbalance happens when one layer is totally powered off while other layers are
working. This scenario is already captured by supply voltage noise worst case analysis in
Section 3.4 and supply voltage can be also guaranteed by the proposed hybrid regulation
as described in Section 3.5. Similar to DVFS, the extra imbalanced current introduced
by power gating will cause more power loss from CR-IVR and CR-VRM thus degrading
efficiency gains.

3.7.3

Power Management Hypervisor in Voltage Stacking

The DVFS, power gating and other power management techniques optimize the power and
performance tradeoffs based on the commands from software operating system. At the
software commands level, power management techniques should taken voltage stacking into
consideration and many techniques such as fast thread migration [140] can balance the
workload before current imbalance happens. To make the correct decision at the software
level, the power management techniques first need to know the potential power benefit and
92

Algorithm 4 Power Saving from DVFS and Power Gating
Input Variables:
core´gate
core
/ Pi,j
DVFS / power gating command: fi,j
Output Variables:
core
Power estimation of each Core: Pi,j
Steps:
core´gate
core
core
1: Replace Ii,j
with fi,j
/ Pi,j
in Eq. (3.4) - (3.7).
R´gate
R
/Pi,j
.
2: Calculate residual frequency / gated power: fi,j
3: Residual current can be known as:
R´gate
Pi,j
R
R
Ii,j “ αCV fi,j /
V
4: Calculate VR loss PCR´IV R /PCR´V RM in Eq. (4.4)-(3.20).
5: Return power estimation:
core´gate
core
Pi,j
= αCV 2 fi,j / Pi,j
- PCR´IV R - PCR´V RM
performance loss and then find the proper tradeoff point. To estimate the potential power
benefit that each core can earned, we introduce a power management technique estimator
for the software level power management as described in Algorithm 4. The estimator can
evaluate the potential net power consumption of each core considering the extra power loss
in power delivery system.
At the hardware power delivery system, we provide a power delivery efficiency guaranteed
power management hypervisor of DVFS or power gating instructions for voltage stacking.
According to Section 3.4 and 3.5, the power loss in voltage stacking comes from the accumulated residual current component going through charge-recycling voltage regulators. The
hypervisor guarantees the power delivery efficiency by limiting the maximum allowed residual
current, described in Algorithm 5. In the hypervisor, the residual current of each core under
DVFS and power gating is calculated with Eq. (3.4) - (3.7). The residual current threshold
R
∆Ithreshold is given to limited the residual current Ii,j
and guarantee power loss in power

delivery system. Then each core whose residual current exceeds the threshold ∆Ithreshold or

93

R
R
´ ∆Ithreshold to make sure that the residual current Ii,j
Pthreshold will be compensated by Ii,j

and power loss in power delivery system are limited within desired range.
Algorithm 5 Power Management Hypervisor in VS
Input Variables:
core
core
Commands in conventional system: fi,j
/ Pi,j
Output Variables:
1
1
Commands for Voltage Stacking: fi,jcore / Pi,jcore
Steps:
core
core Pi,j
core
, V in Eq. (3.4) - (3.7).
with αCV fi,j
1: Replace Ii,j
R
2: Calculate residual current: Ii,j
3: Dynamic Voltage Frequency Scaling:
for i, j = 1, 2, 3, 4 do
R
if |Ii,j
| ą |∆Ithreshold | then
1 core
R
core
´ ∆Ithreshold q
´ pIi,j
Ii,j “ Ii,j
1

I

1 core

i,j
fi,jcore “ αCV
else then
1
core
fi,jcore “ fi,j
1
Return DVFS commands: fi,jcore
4: Power Gating:
for i, j = 1, 2, 3, 4 do
R
| ą |∆Ithreshold | then
if |Ii,j
1 core
R
core
´ ∆Ithreshold q
´ pIi,j
Ii,j “ Ii,j
1 core
1 core
Pi,j “ V Ii,j
else then
1
core
Pi,jcore “ Pi,j
1
Return power gating commands: Pi,jcore

3.8

Evaluation of Hybrid Regulation

In this section, we evaluate the hybrid regulated GPU many-core voltage-stacked system
in terms of supply voltage noise, power delivery efficiency, advanced power management
compatibility, and finally compare it with other power delivery systems. We develop an
hybrid simulation infrastructure that combines SPICE3 [158] and GPGPU-Sim 3.1.1 (with
94

Supply Voltage(V)

2.0
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

Hybrid regulation(LDO)

Hybrid regulation(SC)

ckp
Ba

Bfs

r

e
ind

thf
Pa

t
po

ts
Ho

rk
MD
les
va
ho
two
La
ne
ksc
t
r
c
So
Bla

Default voltage stacking

tc

Dx

rt
se
so
po
ns
rge
a
e
r
M
T

Wo

rst

(a) Supply voltage noise comparison between SC / LDO hybrid regu- (b) Worst supply noise distrilated and default voltage stacking
bution

Figure 3.17: Evaluation of the supply voltage noise in hybrid regulated voltage stacking
system.
GPUWattch) [79, 159]. SPICE3 simulates the circuit transient response of the full voltagestacked power delivery system and the charge-recycling voltage regulators as illustrated in
Fig. 3.9, and GPGPU-Sim 3.1.1 simulates the GPU architecture level system specified in
Table 3.1. We use ten representative benchmarks that cover a wide range of scientific and
computational domains from two benchmark suites, five from Rodinia 2.0 [81] and five from
NVIDIA CUDA SDK [160].

3.8.1

Supply Voltage Noise Evaluation

We first evaluate the supply voltage noise across real GPU benchmarks and the worst case
derived by Algorithm 1. As shown in Fig. 3.17(a), in default voltage stacking without any
voltage regulation, the supply voltage suffers huge noise, especially under the worst case. As
demonstrated by the noise histograms in Fig. 3.17(a) and 3.17(b), after deploying hybrid
regulation in the voltage-stacked GPU system, the supply voltage noise across both the
benchmarks and the worse case is limited to a range of 0.2V, comparable to conventional
single-layer power delivery system3 . One of the key strengths of our hybrid approach is
3

0.2V is the voltage margin used in commercial GPU systems for tolerable supply noise [77].

95

Power Delivery
Efficiency Breakdown

100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%

Voltage stacking power delivery system (SC)
Voltage stacking power delivery system (LDO)
Conventional single layer power delivery system

Core
PDN
Stepdown/CR-VRM
CR-IVR

ckp
Ba

Bfs

r
de

in
thf
Pa

s
ot
MD
ork
ole
tsp
va
ch
etw
Ho
La
s
n
k
t
r
c
So
Bla

tc

Dx

rt
se
so
po
ns
rge
a
e
r
M
T

Figure 3.18: Power delivery efficiency comparison between voltage-stacked system with
SC/LDO hybrid regulation and conventional single-layer system across ten benchmarks.
its use of the more expensive on-chip regulator for high frequency noise mitigation and the
more economical off-chip regulator for low frequency noise mitigation. This choice avoids
over design of the on-chip CR-IVR, saves significant on-die area, and provides worst-case
guaranteed reliability.

3.8.2

Efficiency in Real Applications

We evaluate the system level power delivery efficiency (PDE) by running a wide range of
real GPU benchmarks on our integrated hybrid simulation infrastructure. We compare our
hybrid regulated voltage-stacked system in Fig. 3.9 with the conventional single-layer power
delivery system with a board-level voltage regulator module (VRM), which is the default
GPU power delivery system [77, 161].
The normalized breakdown of the full system power delivery efficiency across benchmarks is
shown in Fig. 3.18. On average, voltage-stacked power delivery system configurations (with
hybrid regulation) can deliver power at close to 93.5% efficiency with switched capacitor
charge-recycling voltage regulators and 92.3% efficiency with LDO charge-recycling voltage regulators, as compared to 79% for the single-layer VRM (conventional baseline). The
96

Table 3.6: SM Core DVFS Frequency and Voltage Pairs
Core freq. (MHz)
Core voltage (V)

700
1

650
0.95

600
0.91

550
0.87

300
0.46

charge-recycling voltage regulator in voltage stacking outperforms the step-down voltage
regulator in the single-layer PDS because the former only needs to shuffle the accumulated
imbalanced part, usually within 20% of the layer power, whereas the latter delivers the total
power. For example, in benchmark Transpose, only 11.8% and 2.9% of current are imbalanced current that goes through CR-IVR and CR-VRM respectively, and causes 3.7% and
1.1% of power loss in switched capacitor CR-IVR and CR-VRM respectively.

3.8.3

Compatibility with Advanced Power Management

First we leverage the common and classic DVFS algorithm proposed in [152] to explore percore DVFS on a voltage-stacked GPU system, which monitors and predicts the application
status (compute bound/memory bound) to adjust each core and memory frequency. The SM
core frequency and voltage pairs are shown in Table 3.6. In conventional single-layer power
delivery system, each cores has its own frequency and voltage. In voltage stacking, the cores
in each layer share a voltage domain and the highest voltage and frequency from the cores in
one layer is used as the voltage and frequency for this layer. We evaluate DVFS on the voltage
stacking and compare with DVFS on conventional single-layer power delivery system in Fig.
3.19. Although DVFS on voltage stacking causes more power loss than normal execution
on voltage stacking, but it still has a higher power delivery efficiency than on conventional
power delivery system at most benchmarks except Transpose. This is because GPU benefits
the single instruction multiple thread (SIMT) architecture causing a synchronized activity
and synchronized DVFS commands for the cores during most of time. Besides, shown in
97

Power Delivery
Efficiency under DVFS

100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%

Conventional single layer power delivery system
Voltage stacking power delivery system
PDE guided voltage stacking power delivery system

kp

c
Ba

Bfs

er

ind
thf
Pa

D
ot
es
ork
tsp avaM chol
etw
s
Ho
n
L
t
k
r
c
So
Bla

Core
PDN
Stepdown/CR-VRM
CR-IVR

Dx

tc

rt

so

rge
Me

e
os

sp

n
Tra

Figure 3.19: DVFS power saving comparison between conventional single-layer system and
voltage-stacked system with hybrid regulation across benchmarks.
the right bars in Fig. 3.19, the power delivery efficiency guided hypervisor in Algorithm
5 can further prevent the aggravated power loss in CR-IVR and CR-VRM by limiting the
occasional current imbalance from DVFS. Power delivery efficiency guided hypervisor can
help voltage stacking achieve a near 90% power delivery efficiency under DVFS operations.
The normalized energy consumption across benchmarks of conventional single-layer system,
DVFS on conventional single-layer system and DVFS on power delivery efficiency guided
voltage stacking is shown in Fig. 3.20. On conventional single-layer system, DVFS can reduce
the energy consumption of cores across most benchmarks. On power delivery efficiency
guided voltage stacking, the energy consumption of cores are also partly reduced compared
to conventional single-layer system without DVFS, but cannot reach the same amount as
DVFS on single-layer system. This is because power delivery efficiency guided hypervisior
modifies the aggressive DVFS commands which cause current imbalance and low power
delivery efficiency. Although the energy consumption of cores is higher than DVFS on
conventional single-layer system, when the energy loss in power delivery system is taken into
consideration, DVFS on voltage stacking achieves the best overall energy consumption.
For power gating, we manually power off the cores in one layer and leave the cores in the
other layers under normal execution which will cause the most current imbalance, power
98

Normalized Energy
Consumption

Conventional single layer power delivery system
DVFS on conventional single layer power delivery system
DVFS on PDE guided voltage stacking power delivery system

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

p

ck

Ba

Bfs
P

er

ind

f
ath

ot

tsp

Ho

Core
PDS

s
e
rt
ork Dxtc
MD hole
os
so
tw
ge ansp
c
e
r
s
n
e
t
r
ck
M
Tr
So
Bla

va

La

Figure 3.20: Normalized system energy consumption under DVFS.
loss and the worst power delivery efficiency. The residual current threshold ∆Ithreshold in
power delivery efficiency guided hypervisor is set to 25%, 50%, and 75% of the core current
respectively to protect power delivery efficiency. Fig. 3.21 describes the full system power
delivery efficiency across benchmarks. Compared with voltage stacking without power gating
in Fig. 3.18, continuous imbalanced current from power gating causing more power loss in
CR-IVR and CR-VRM. When ∆Ithreshold is set to 25% and 50% the core current, the full
system power delivery efficiency can still maintain 80%. When ∆Ithreshold is set to 75% the
core current, the full system power delivery efficiency is lower than 70%. It means that the
power benefits from gating the cores in one layer, which is about 1/4 of system power, are
all wasted in the power delivery system. Since that when power gating is applied in voltage
stacking, at the software level power gating should prefer powering off the cores in the same
stack with the help of thread migration. When it is inevitable to power off the cores in the
same layer, hardware based power delivery efficiency guided power management hypervisor
will be deployed to prevent gating over 50% of one layer from happening frequently to protect
the voltage stacking power delivery efficiency.

99

25% PDE guided power gating on voltage stacking
50% PDE guided power gating on voltage stacking
75% PDE guided power gating on voltage stacking
Power gating on voltage stacking

Power Delivery
Efficiency Breakdown

100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%

kp
ac

B

ot
tsp

r

Bfs

e
ind

thf

Pa

Ho

Core

PDN

s
ork
MD
ole
va
ch
etw
s
n
La
t
k
r
c
So
Bla

CV-IVR

tc
Dx

CR-VRM

e
rt
os
so
rge ransp
e
M
T

Figure 3.21: Power delivery efficiency under PDE guided power gating and original power
gating on voltage stacking.
Table 3.7: Power delivery system comparison
Power Delivery System
Single-layer VRM [118]
Single-layer IVR [162]
VS IVR [94]
VS IVR (worst) [94]
VS Hybrid (this work)

3.8.4

Efficiency
79.9%
85.8%
92%
92%
93.5%

Die Area
N/A
172.3 mm2
88.3 mm2
912 mm2
99.2 mm2

Reliable
‘
‘

Compatibility
‘
‘

ˆ
‘
‘

ˆ
ˆ
‘

Comparison with Other Power Delivery Systems

In Table 3.8, we compare the proposed hybrid regulated voltage-stacked power delivery
system with other existing and emerging power delivery system configurations. Although
charge-recycling voltage regulators are employed, the voltage-stacked system does not suffer a
large efficiency penalty, because most currents go through the vertically-stacked grid without
incurring energy loss at the regulators. Validated by benchmarks, the proposed voltagestacked system with hybrid regulation can achieve 93.5% power delivery efficiency on average
and can guarantee that the supply voltage noise remains within the reliable region. Besides
efficient power delivery and supply voltage noise mitigation, hybrid regulated voltage-stacked
systems are also compatible with other advanced high level power management techniques,
such as DVFS and power gating. Although when advanced power managements are applied
in voltage-stacked system, huge imbalance current may lead to power delivery efficiency loss,
100

power delivery efficiency guided hypervisor are able to limit the magnitude and frequency of
imbalance and guarantee the improved power delivery efficiency. Furthermore, many other
techniques, like high efficiency charge-recycling circuit, can be explored to further improve
the voltage-stacked power delivery efficiency.

3.9

Evaluation of Architecture Support

In this section, we quantitatively evaluate the efficiency, overhead, and reliability of our
cross-layer voltage-stacked GPU system leveraging control theory. We first examine systemlevel power delivery efficiency and compare with alternative PDS configurations. The results
indicate that our cross-layer voltage-stacked PDS is the only practical solution that can
deliver power at 92.3% efficiency–12.3% improvement over the conventional PDS–without
incurring prohibitive area overhead. Next, we evaluate the supply noise behavior of our
solution against both synthetic worst-case scenarios and real-world benchmarks to verify
that it can sustain the specified voltage margin with strong guarantees. We then perform
a sensitivity study and design space exploration to reveal the potential performance and
energy efficiency tradeoffs in the voltage-stacked GPU system. Finally, we demonstrate
collaborative power management operations by combining the cross-layer VS framework
with other higher-level power optimization techniques, which can can yield better overall
system-level efficiency results than any of the individual methods alone.

101

Figure 3.22: Power delivery efficiency and power breakdown across benchmarks and power
delivery subsystems configurations.
Table 3.8: Comparison of Different Power Delivery Subsystems (PDS)
PDS Configuration
Single layer VRM [161]
Single layer IVR [63]
VS circuit only [94, 112]
VS cross-layer

3.9.1

PDE
80%
85%
93.0%
92.3%

Die Area Overhead
N/A
172.3mm2 (0.33ˆGPU die)
912mm2 (1.72ˆGPU die)
105.8mm2 (0.2ˆGPU die)

System-level Efficiency

System-level power delivery efficiency (PDE) is evaluated by running a wide range of GPU
benchmarks on our integrated hybrid simulation infrastructure. We compare our crosslayer VS solution with three alternatives: the conventional single-layer PDS with a boardlevel voltage regulator module (VRM), the single-layer IVR PDS with an on-chip switchedcapacitor integrated voltage regulator but without voltage stacking, and the circuit-only
solution to implement VS with the aid of on-chip charge-recycling IVR (CR-IVR).
The normalized breakdown of the total system power across benchmarks is shown in Fig.3.22.
On average, both voltage stacking PDS configurations (circuit-only and cross-layer) can deliver power at close to 92.3% efficiency, as compared to 80% for single-layer VRM (conventional baseline) and 85% for single-layer IVR. The reason that IVR in VS outperforms IVR
in single-layer PDS is because the former only needs to shuffle the imbalanced load, which
is usually less than 20%, of the layer power, whereas the latter delivers the total power.
102

1.2

Worst imbalance happen
Voltage(V)

1
0.8
0.6

Circuit only (2x GPU area)
Circuit only (1x GPU area)
Circuit only (0.2x GPU area)
Cross layer (0.2x GPU area)

0.4
0.2

2

2.5

3

3.5

4

4.5

Time(s)

5
×10 -6

Figure 3.23: Transient voltage waveforms under worst imbalance scenarios.
Table 3.8 summarizes the comparison results. Besides efficiency, it also highlights different
PDS configurations’ die area overhead. Although both VS solutions exhibit high PDE, the
circuit-only approach consumes excessive die area (1.72ˆ the GPU die area) in order for
the CR-IVR to have enough capacity to deal with the worst-case current imbalance. In
contrast, our cross-layer approach that leverages architecture-level support to deal with the
slow-changing part of the current imbalance appears to be the only practical solution known
that can consistently achieve above 90% efficiency.

3.9.2

Supply Reliability

We first construct a synthetic worst case scenario to verify reliable operation of the proposed
VS GPU. At the 3µs mark (Fig. 3.23), we manually turn off SMs in one layer to simulate
extreme current imbalance. In the circuit-only VS systems, the voltage droop worsens as the
CR-IVR area decreases and it takes about 2ˆ the GPU area to stabilize the voltage above
0.8 V. Instead, our cross-layer solution incurs only 0.2ˆ area overhead to achieve a similarly
stable transient SM voltage, which is a nearly 90% area reduction.

103

1

0.8

Worst Voltage(V)

Worst Voltage(V)

1

0.6
0.4
latency=60cycle
latency=80cycle
latency=120cycle
latency=140cycle

0.2
0
0

0.5

1

1.5

2

Area Budget(xGPU area)

0.8
0.6
0.4
0.2

2x GPU area
0.8x GPU area
0.4x GPU area
0.2x GPU area

50

100

150

Latency(cycle)

(a) CR-IVR area

(b) Control latency

Figure 3.24: Worst supply noise in response to worst imbalance as a function of CR-IVR
area and control latency.
We also perform a sensitivity study on the impact of CR-IVR area and control latency on
the supply reliability of our cross-layer VS GPU. Fig.3.24 plots the worst voltage droop in
response to the synthetic current imbalance event as a function of CR-IVR area (a) and
control latency (b). In the left plot, when the control latency is greater than 80 cycles, the
worst-case voltage droop becomes highly sensitive to the area budge. Similarly in the right
plot, when the area budge is smaller than 0.8ˆ, the worst-case voltage droop becomes highly
sensitive to the control latency. Since the architecture-level voltage smoothing scheme can
only deal with slow-changing supply fluctuations, a minimal-sized CR-IVR is always required
to handle fast current imbalances. From the sensitivity analysis, we choose a 0.2ˆ sized CRIVR and a 60 cycle latency controller as the optimal parameters to implement the cross-layer
VS solution, and use that default setting from now on. We also simulate the distribution
of supply noise across real world benchmarks. Each box in Fig. 3.25 summarizes the noise
distribution of all 16 SMs for a benchmark. We compare noise distribution between the
cross-layer solution and the circuit-only solution, both with 0.2ˆ sized CR-IVR. 9 out of
12 benchmarks experience modest reduction in voltage noise magnitude from the control
theoretic voltage smoothing. The 3 outliers (pathfinder, simpleatomic, fastwalsh) are due to
104

Voltage(V)

1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6

Cross layer solution

P

CK

BA

S
BF

Circuit only

r
s
ic
et
ot
all
ce
se
od
lsh
de sard
ole
rtw hotsp
rpr rtingn plefa twa
tom st ca
fin
ch
a
h
l
t
s
s
r
ea
a
l
k
o
a
m
o
c
f
i
p
c
s
pa
s
s
w
bla
sim

a
he

Figure 3.25: Noise distribution across benchmarks and the worst-case imbalance.
the choice of control parameters and boundary transitions, but their lowest voltage excursions
are still bounded by 0.8V , satisfying the specified 0.2V voltage margin. The rightmost
box plot represents the worst-case noise distribution, which indicates that although the
architecture voltage smoothing is only occasionally trigged for regular benchmarks, it is
essential to provide the worst-case guarantee to ensure supply reliability.

Performance Tradeoffs
5%

24%

backprop
bfs
heartwall
hotspot
pathfinder
sard
blackscholes
scalarprod
scalarprod
sortingnet
simpleface
fastwalsh

Performance Penalty

20%
16%
12%
8%
4%

0.8

4%
3%

0.8DIWS+0.2DII
0.8DIWS+0.2DCC

2%
1%

DCC
FII

0%
14% 12% 10% 8%

0
0.7

DIWS
FII
DCC

DIWS

Performance Penalty

3.9.3

0.9

Threshold Voltage(V)

1

6%

4%

2%

0%

Net Energy Saving

Figure 3.26: Performance penalty varies Figure 3.27: Energy saving and perforwith controller voltage threshold.
mance penalty tradeoff space.

Due to the throttling nature of voltage smoothing mechanisms such as DIWS, our cross-layer
approach inevitably incurs performance penalties. When evaluating energy efficiency of the
105

20%

Net energy saving
Performance loss

16%
12%
8%
4%
0

P

CK
BA

r
ll
S
ot
de
BF artwa otsp
fin
h
h
t
e
h
pa

rd

sa

t
e
h
ic
les prod gne
ac wals
r
tom
n
lef
i
t
a
a
t
l
p
r
s
e
a
fa
pl
so
sc
sim
sim

ho

sc
ck
bla

Figure 3.28: Performance penalty and energy saving across benchmarks.
GPU system, such performance penalties lead to longer total execution times and higher
energy consumption caused by leakage power. We account for such performance penalties
and their resulting increased leakage energy in our total energy savings calculation. The
normalized performance penalty and net energy savings of our proposed cross-layer VS GPU
is presented in Fig.3.28, normalized against the performance and total energy of conventional
PDS with single-layer VRM. The performance penalty is distributed within 2% ´ 4% across
benchmarks. After taking the extended execution time and increased leakage energy into
account, voltage-stacked GPUs with the cross-layer solution still enjoy 10%´15% net energy
savings (improved energy efficiency) due to higher power delivery efficiency. We perform
another sensitivity study by varying the voltage threshold (Vthreshold ) used in the voltage
smoothing controller, as it determines how often DIWS throttling is triggered. The results
across benchmarks are shown in Fig 3.26. A lower threshold leads to a smaller performance
overhead, but jeopardizes supply reliability. In this work, we set the default Vthreshold at
0.9V , and at this level, less than 20% of the cycles are affected by voltage smoothing during
benchmark execution when the layer voltage is below 0.9V .
In the previous evaluation, we use only DIWS as the voltage smoothing mechanism, and the
performance penalty is a result of its throttling effect. If an even smaller performance penalty
is desired, our cross-layer approach has the flexibility to incorporate other mechanisms using
106

Figure 3.29: Applying DFS on conventional and proposed voltage-stacked GPU.

Figure 3.30: Applying PG on conventional and proposed voltage-stacked GPU.
the weighted control inputs as specified in (3.30). We explore the space of different weight
combinations and the resulting performance penalty and net energy savings in Fig 3.27.
On the Pareto frontier of the design space, we can see that when high net energy saving is
desired, DIWS is generally the better voltage smoothing mechanism to choose, while FII and
DCC can deliver a lower performance penalty. Due to its extra area overhead and leakage
current, DCC is usually an inferior mechanism when FII can be applied to achieve similar
performance.

Figure 3.31: Distribution of imbalanced currents by their normalized magnitudes when no
power management (No PM), DFS with different performance goals, and power gating are
applied in a VS GPU.

107

3.9.4

Collaborative Power Management

Finally, we demonstrate the collaborative operation of voltage stacking with dynamic frequency scaling (DFS) and power gating (PG) for high-level power optimization. Previous
DFS studies [150, 163] find the optimal SM operating frequencies to minimize the computational power under different performance goals. We apply a similar DFS strategy and
examine the total GPU energy consumption with and without VS. The energy in Fig.3.29 is
normalized by the total GPU energy operating at its peak performance when the power delivery inefficiency is taken into account. Since our VS-aware power management hypervisor
may modify the optimal frequency settings to ensure a bounded layer current imbalance, this
negative effect of VS on DFS can be observed in the slight increase of computational energy
(1-2%) in the second bar representing our cross-layer VS GPU solution. However, when the
power delivery loss is considered, the slight energy penalty experienced by the VS GPU is
more than compensated by its superior PDE, resulting in overall energy savings of 7-13%
compared to applying DFS in the GPU with a conventional PDS. We observe similar results
when combining PG techniques (i.e., Warped Gates [155]) with VS. As shown in Fig. 3.30,
although the minimum current imbalance requirement in the VS GPU disrupts the optimal
PG setting, it is more than compensated by improved PDE.
These favorable DFS and PG results can be better understood by carefully examining the
distribution of imbalanced currents between two vertically stacked SMs across cycles. We
normalize the current imbalance by the peak SM current and plot its distribution during the
lifetime of the benchmark execution in Fig.3.31. When no power management is applied,
the benchmark with the most imbalance is BACKUP (left bar) and the benchmark with
the highest uniformity is heartwall (right bar). The middle bar presents the distribution
averaged over all benchmarks and it shows that 50% of the time, the current imbalance
108

is less than 10% of its peak magnitude, and 93% of the time, it is less than 40% of the
peak. Similar exercises can be performed when DFS and PG are applied by evaluating the
imbalance distribution for the worst, best, and average benchmarks. Fig.3.31 suggests that
SM-level activities are overwhelmingly uniform and synchronized, resulting in well-balanced
currents across the stack, and high-level power optimizations such as DFS and PG do not
fundamentally disturb such balanced activities.

3.10

Conclusion

Voltage stacking fundamentally improves many-core processors power delivery efficiency but
suffers aggravated supply voltage noise. According to the analysis using circuit decomposition and superposition, the contributors to supply voltage noise are high frequency global
current and low frequency residual current. Then the current configuration leading to the
worst supply voltage is derived as an optimization problem. Based on the characteristics of
supply voltage noise, a hybrid regulation, with distributed on-chip and a off-chip charge recycle voltage regulators, is proposed to effectively mitigate supply voltage noise. The supply
voltage noise is guaranteed within a safe range even under the worst case. Also, the proposed
hybrid regulated voltage-stacked system can not only be compatible with other power management techniques like DVFS and power gating but also maintains a high power delivery
efficiency. Compared with conventional power delivery system, the proposed hybrid regulated voltage-stacked system achieves a 13.6% improvement of power delivery efficiency. The
improved power delivery efficiency can help the computing systems and the cyber-physical
systems have a longer operating time in an energy-limited scenario.

109

Chapter 4
Architecture and Operating System
Layers: Real-Time GPU Scheduling
of Hard Deadline Parallel Tasks with
Fine-Grain Utilization

Many emerging cyber-physical systems, such as autonomous vehicles and robots, rely heavily on artificial intelligence and machine learning algorithms to perform important system
operations. Since these highly parallel applications are computationally intensive, they need
to be accelerated by graphics processing units (GPUs) to meet stringent timing constraints.
However, despite the wide adoption of GPUs for machine learning and artificial intelligence,
efficiently scheduling multiple GPU applications while providing rigorous real-time guarantees remains a challenge. In this chapter, we propose RTGPU, which can schedule the
execution of multiple GPU applications in real-time to meet hard deadlines. Each GPU
application can have multiple CPU execution and memory copy segments, as well as GPU
kernels. We start with a model to explicitly account for the CPU and memory copy segments
110

of these applications. We then consider the GPU architecture in the development of a precise timing model for the GPU kernels and leverage a technique known as persistent threads
to implement fine-grained kernel scheduling with improved performance through interleaved
execution. Next, we propose a general method for scheduling parallel GPU applications
in real time. Finally, to schedule multiple parallel GPU applications, we propose a practical real-time scheduling algorithm based on federated scheduling and grid search (for GPU
kernel segments) with uniprocessor fixed priority scheduling (for multiple CPU and memory copy segments). Our approach provides superior schedulability compared with previous
work, and gives real-time guarantees to meet hard deadlines for multiple GPU applications
according to comprehensive validation and evaluation on a real NVIDIA GTX1080Ti GPU
system.

4.1

Introduction

Nowadays, artificial intelligence (AI) and machine learning (ML) applications accelerated
by graphics processing units (GPUs) are widely adopted in emerging autonomous systems,
such as self-driving vehicles and collaborative robotics [164, 165]. For example, Volvo deployed NVIDIA DRIVE PX 2 technology for semi-autonomous driving in 100 XC90 luxury
SUVs [166]. These autonomous systems must simultaneously execute different algorithms
in the GPU in order to perform tasks such as object detection, 3D annotation, movement
prediction, and route planning [167, 168]. The systems must also process images and signals from various sensors and decide the next action in real time. Therefore, it is essential
to diligently manage the concurrent execution of applications in the GPUs with respect to

111

various timing constraints, since their behaviors can have direct and critical impacts on the
stability and safety of the whole system.
For general purpose computing in non-real-time systems with GPUs, the GPU scheduling
problem has been extensively studied to minimize the makespan of a single application or to
maximize the total throughput of the system [169–172]. Examples include accelerating the
training of an AI algorithm or optimizing the average utilization of GPUs running an ML
inference application. However, many of these techniques do not translate well to scheduling
GPU applications with real-time deadlines. The conventional programming interface that
comes with off-the-shelf desktop and embedded GPUs allows scheduling only at the granularity of GPU kernels. In particular, by default, the first-launched GPU kernel will occupy
all the GPU resources until completion, at which time the next scheduled GPU kernel can
begin executing4 , so a GPU scheduler can decide only which of the available GPU kernels
should be launched first even with Multi-Process Service (MPS) [173]. This kernel-granular
scheduling is not sufficient to meet real-time deadlines. For example, consider two real-time
tasks run on the same GPU, one of which has a large GPU kernel with a long deadline, while
the other has a small GPU kernel with a short deadline. If the large GPU kernel arrives
slightly before the small GPU kernel, the large task will take over the entire GPU, leaving
the small task stuck waiting and likely missing its deadline. To overcome this deficiency
and improve the real-time performance of GPU applications, it has been proposed to add
some form of preemption via low-level driver support and to modify CUDA APIs so that the
system’s timing behavior is more predictable [174–181]. However, none of these approaches
4

GPU CUDA activity from independent host processes will normally create independent CUDA contexts,
one for each process. Thus, the CUDA activity launched from separate host processes will take place in
separate CUDA contexts, on the same device. CUDA activity in separate contexts will be serialized. The
GPU will execute the activity from one process, and when that activity is idle, it can and will contextswitch to another context to complete the CUDA activity launched from the other process. The detailed
inter-context scheduling behavior is not specified. (Running multiple contexts on a single GPU also cannot
normally violate basic GPU limits, such as memory availability for device allocations.)

112

Figure 4.1: RTGPU framework.
provide fine-grained real-time GPU scheduling and the corresponding schedulability analysis
needed to execute multiple real-time tasks in GPUs.
In this chapte, we propose RTGPU, a general real-time GPU scheduling framework shown
schematically in Fig. 1. This framework includes the GPU partitioning and modeling on
the system side and a scheduling algorithm with schedulability analysis on the theory side.
First, working from an in-depth understanding of GPU kernel execution and profiling synthetic workloads, we leverage a technique called persistent threads to support SM-granularity
scheduling for concurrent GPU applications [182–184]. With the persistent threads technique, the interleaved execution is proposed to achieve a 10% to 37% improvement in system
utilization. Then we develop a real-time GPU system model that introduces the concept of
virtual streaming multiprocessors (virtual SMs). With this model, we are able to explicitly
assign the desired number of virtual SMs to each GPU kernel of each GPU application,
allowing finer-grained GPU scheduling without any low-level modifications to GPU systems.
Compared with previous kernel-granularity scheduling approaches, this model supports more
flexible parallel execution in the GPUs.
113

As each GPU application has multiple CPU execution, memory copy segments, and GPU
kernels, on the scheduling algorithm side we introduce fixed priority with federated scheduling
strategy for the system. For the GPU segment, based on the proposed real-time GPU system
model, we extend a parallel real-time scheduling paradigm for CPUs, namely federated
scheduling [185], to schedule real-time GPU applications with implicit deadlines. The key
idea behind federated scheduling is to calculate and statically assign the specific computing
resources that each parallel real-time task needs to meet its deadline. Note that preemption
between tasks is not needed if the correct number of fixed-granularity computing resources
can be accurately derived in analysis and enforced during runtime. For the CPU segment and
memory copies between CPU and GPU, a novel uniprocessor fixed priority scheduling method
is then proposed based on calculating the response time upper bounds and lower bounds of
each segment alternately. This scheduling algorithm is not limited to GPU applications and
can be further applied to other applications running on heterogeneous architecture computing
systems.
Compared with previous work, the proposed GPU federated scheduling and CPU and memory copy fixed priority scheduling techniques collaborate well with each and achieve the
best schedulability known to date. To assess the effectiveness of those techniques on real
platforms, we evaluate and validate our proposed RTGPU framework on real NVIDIA GPU
systems.

114

4.2
4.2.1

Background and Related Work
Background on GPU Systems

GPUs are designed to accelerate compute-intensive workloads with high levels of data parallelism. As shown in Fig. 4.2., a typical GPU program contains three parts — a code segment
that runs on the host CPU (the CPU segment), the host/device memory copy segment, and
the device code segment which is also known as the GPU kernel. GPU kernels are single
instruction multiple threads (SIMT) programs. The programmer writes code for one thread,
many threads are grouped into one thread block, and many thread blocks form a GPU kernel.
The threads in one block execute the same instruction on different data simultaneously. A
GPU consists of multiple streaming multiprocessors (SMs). The SM is the main computing
unit, and each thread block is assigned to an SM to execute. Inside each SM are many
smaller execution units that handle the physical execution of the threads in a thread block
assigned to the SM, such as CUDA cores for normal arithmetic operations, special function
units (SFUs) for transcendental arithmetic operations, and load and store units (LD/ST)
for transferring data from/to cache or memory.
When GPU-accelerated tasks are executed concurrently, kernels from different tasks are
issued to a GPU simultaneously. Standard CUDA streaming supports multiple kernels concurrently within the same CUDA context. However, it cannot effectively manage concurrent
GPU kernels and tasks in an explicit manner. When kernels are launched, the thread blocks
are dispatched to all the SMs on a first-come, first-served basis. The first-launched kernel
occupies all the GPU resources, and the next kernel begins its execution only when SMs are
freed after completion of the first kernel. Therefore, the execution of the concurrent tasks
remains sequential despite the CUDA streaming mode.
115

Figure 4.2: Typical GPU task execution pattern.

4.2.2

Background on Multi-Segment Self-Suspension

In the multi-segment self-suspension model, a task τi has mi execution segments and mi ´ 1
suspension segments between the execution segments. So task τi with deadline Di and period
Ti is expressed as a 3-tuple:
`
˘
τi “ pL0i , Si0 , L1i , ..., Simi ´2 , Limi ´1 q, Di , Ti
where Lji and Sij are the lengths of the j-th execution and suspension segments, respectively.
pj is the upper bound
rSqij , Spij s gives the upper and lower bounds of the suspension length Sij . L
i
on the length of the execution segment Lji .
The analysis in [186] bounds the worst-case response time of a task under the multi-segment
self-suspension model, which is summarized below and utilized in this work for analyzing
the response time of CPU-GPU tasks.

116

Lemma 4.2.1 The following workload function Wih ptq bounds on the maximum amount of
execution that task τi can perform during an interval with a duration t and a starting segment
Lhi :
Wih ptq “

l
ÿ

pj mod mi `
L
i

j“h
l
´
ÿ
` j mod mi
˘¯
pl`1q mod mi
p
p
min Li
,t ´
Li
` Si pjq
j“h

where l is the maximum integer satisfying the following condition:
l
ÿ
`

˘
pj mod mi ` Si pjq ď t
L
i

j“h

and Si pjq is the minimum interval-arrival time between execution segments Lji and Lj`1
,
i
which is defined by:
$
’
’
Sqij mod mi
if j mod mi ‰ pmi ´ 1q
’
’
’
’
&
else if j “ mi ´ 1
Si pjq “ Ti ´ Di
’
m
´1
m
´2
i
i
’
ÿ j
ÿ j
’
’
p ´
’
T
L
Sqi
otherwise
i´
’
i
%
j“0

j“0

Then the response time of execution segment Lji in task τk can be bounded by calculating
the interference caused by the workload of the set of higher-priority tasks hppkq.

p j is the smallest value that satisfies the folLemma 4.2.2 The worst-case response time R
k
lowing recurrence:
pj “ L
pj `
R
k
k

ÿ

max

τi Phppkq

hPr0,mi ´1s

117

pjq
Wih pR
k

Hence, the response time of task τk can be bounded by either taking the summation of the
response times of every execution segments and the total worst-case suspension time, or
calculating the total interference caused by the workload of the set of higher-priority tasks
hppkq plus the total worst-case execution and suspension time.
p k of task τk is upper bounded by the
Lemma 4.2.3 Hence, the worst-case response time R
x k and R2
x k , where:
minimum of R1

xk “
R1

mÿ
k ´2

Spkj `

mÿ
k ´1

j“0

pj
R
k

(4.1)

j“0

and R2k is the smallest value that satisfies the recurrence:

xk “
R2

mÿ
k ´2
j“0

4.2.3

Spkj `

mÿ
k ´1
j“0

ÿ

pj `
L
k

τi Phppkq

max
hPr0,mi ´1s

x kq
Wih pR2

(4.2)

Related Work

Previous work on the general topic of GPU resource management has looked at the problem at the operating system-level [169, 170, 187] and has used persistent threads to implement SM-granularity workload assignment for non-real-time systems [182–184]. Meanwhile,
Lin [165] proposed integrated vectorization and scheduling methods to exploit multiple forms
of parallelism for optimizing throughput for synchronous dataflows on memory-constrained
CPU-GPU platforms. Wang [188] implemented a user-mode lightweight CPU–GPU resource
management framework to optimize the CPU utilization while maintaining good Quality of
Service (QoS) of GPU-intensive workloads in the cloud, such as cloud games. For a more
complex system, Kayiran [171] considered GPU concurrency in a heterogeneous setting. For
a large scale server system, Yang [172] studied parallel execution on multicore GPU clusters.
118

Besides, Park [178], Basaran [179], Tanasic [180], and Zhou [181] proposed architecture extensions and Effisha [189] introduced software techniques without any hardware modification
to support kernel preemption. Chen [190] extended the original Flink on CPU clusters to
GFlink on heterogeneous CPU-GPU clusters for big data applications. The thermal-aware
and energy efficient GPU systems were also studied in [191] and [192].
For real-time systems with GPUs, previous work mainly involves GPU kernel-granularity
scheduling. For example, Kato [176] introduced a priority-based scheduler; Elliott proposed shared resources and containers for integrating GPU and CPU scheduling [177] and
GPUSync [193] for managing multi-GPU multicore soft real-time systems with flexibility,
predictability, and parallelism; Golyanik [194] described a scheduling approach based on
time-division multiplexing in GPU; S 3 DNN [174] optimized the execution of DNN workloads
on GPU in a real-time multi-tasking environment through scheduling the GPU kernels. However, these approaches focus on predictable GPU control, they do not allow multiple tasks
to use the GPU at the same time. Thus, the GPU may be underutilized and there may be
a long waiting time for a task to access the GPU. Besides, the kernel-granularity scheduling, researchers also explore other approaches to improve the schedulability. Gerum [195]
and Berezovskyi [196] targeted accurate timing estimation for GPU workloads. Zhou [181]
proposed a technique based on reordering and batching kernels to speed up deep neural
networks. Lee [175] studied how to schedule two real-time GPU tasks. Bakhoda [197],
Wang [198], Xu [199], and Lee [200] studied GPU scheduling on a GPU simulator.
On the scheduling theory side, the CPU-GPU system looks like the self-suspension model,
but it has CPU, memory copy, and GPU segments leading to more unique and complicated features like the interactions and blockings from non-preemptive components in the
suspension segments. Saha [201] used the persistent threads technique and busy-waiting
119

suspension mode, which underrates the system’s performance and causes extra pessimism
in the scheduling ability. Sun [202] proposed a formal scheduling-theoretic representation of
the scheduling problem upon the host-centric acceleration architectures but it cannot handle
the classic sporadic/periodic tasks.

4.3
4.3.1

CPU and Memory Model
CPU Modelling

As represented in Fig. 4.2., a typical GPU application has multiple segments of CPU code,
memory copies between the CPU and GPU, and GPU code (which are also called GPU
kernels). Because a GPU has powerful parallel computational capacity, it is assigned to
execute computationally-intensive workloads, such as matrix operations. The CPU executes
serial instructions, e.g., for communication with IO devices (sensors and actuators) and
launches memory copies and GPU kernels.
When a CPU executes serial instructions, it naturally behaves as a single-threaded application without parallelism. When the CPU code launches memory copies or GPU kernels,
these instructions will be added into multiple FIFO buffers called a ”CUDA stream”. The
memory copies and GPU kernels, which are in different CUDA streams, can execute in parallel if there are remaining available resources. The execution order of memory copies and
GPU kernels in a single CUDA stream can be controlled by the order in which they are added
to it by the CPU code. After the CPU has launched memory copies and GPU kernels into a
CUDA stream, it will immediately execute the next instruction, unless extra synchronization
is used in the CPU code to wait for the memory copies or GPU kernels to finish. Thus, the
120

CPU segments in GPU applications can be modelled as serial instructions executed by one
thread.

4.3.2

Memory Modeling

Memory copying between the CPU and GPU execution units includes two stages. In the
first stage, data is copied between the CPU memory and the GPU memory through a single
peripheral component interconnect express (PCIe) for a desktop/server GPU, or through a
network on chip (NoC) for an embedded GPU. Because of the hardware protocols for PCIe
and NoC, only one global memory copy can be performed at a time. Also, the memory
copy through PCIe/NoC is non-preemptive once it starts. The memory copy time between
CPU memory and GPU memory is a linear function of the copied memory size. The GPU
and other accelerators mainly provide two types of memory movement between the CPU
and GPU (accelerators) [203,204]: direct memory copy (also called traditional memory) and
unified memory (introduced in CUDA 6.0 and strengthened in CUDA 8.0). Direct memory
copy uses traditional memory to store and access memory, where data must be explicitly
copied from CPU to GPU portions of DRAM. Unified memory is developed from zero-copy
memory where the CPU and the GPU can access the same memory area by using the same
memory addresses between the CPU and GPU. In unified memory, the GPU can access
any page of the entire system memory and then migrate the data on-demand to its own
memory at the granularity of pages. Compared with unified memory, direct memory copy is
faster (higher bandwidth) [205] and is a more universal application, not just limited to GPU
systems but also widely used in any heterogeneous computing systems. In the following
discussion, we focus mainly on direct memory copy, but our approach can also be directly
applied to unified memory by setting the explicit memory copy length to zero.
121

The second stage is the memory access from the GPU’s execution units to the GPU cache or
memory. The GPU adopts a hierarchical memory architecture. Each GPU SM has a local
L1 cache, and all SMs share a global L2 cache and DRAM banks. Although the current
NVIDIA Multi-Process Service (MPS) does not provide any official mechanism for shared
memory hierarchy partitioning, computer architecture researchers have proposed softwarebased generic algorithms [206] for partitioning the publicly unknown architectural details of
the GPU L2 cache and DRAM through reverse engineering. These memory accesses actually
happen simultaneously with the kernel’s execution. Thus, the second memory operation is
modeled as part of the critical-path overhead of the kernel execution model, which is discussed
in the following Section 4.4.

4.4

GPU Parallel Kernel Execution Model

This section introduces the modeling of GPU kernels, which are the key components in GPU
accelerated applications. A hard deadline requires an accurate task execution model, built
upon a deep understanding of the GPU architecture and its parallel execution mechanism.

4.4.1

Kernel-granularity and SM-granularity Scheduling

An off-the-shelf GPU supports only kernel-granularity scheduling, as shown in Fig. 4.3(a).
When kernels are launched in the GPU, each kernel fully occupies all the compute resources
(SMs) on the GPU, so even with Multi-Process Service (MPS) by default a GPU is only able
to execute one kernel at a time. The execution order of the kernels of the different tasks
can be changed in kernel-granularity scheduling, as shown in Fig. 4.3(b). Ever since the
122

(a) Default sequential execution

(b) Kernel-granularity scheduling

(c) SM-granularity scheduling

Figure 4.3: Comparison of three different GPU application scheduling approaches.
development of the Pascal GP100 architecture, preemption has been supported by swapping
the whole kernel context to GPU DRAM. However, preemption is mainly used for longrunning or ill-behaved applications. It is not suitable for run-time systems [207, 208], since
it introduces intolerable overhead when a whole GPU kernel is swapped in and out.
The persistent threads approach is a new software workload assignment solution proposed to
implement finer and more flexible SM-granularity GPU scheduling. The persistent threads
technique alters the notion of the lifetime of virtual software threads, bringing them closer
to the execution lifetime of the physical hardware thread [183]. Specifically, each persistent
threads block links multiple thread blocks of one kernel and is assigned to one SM to execute
123

(a) with increasing numbers of assigned SMs

(b) comprehensive kernel with increasing size

Figure 4.4: Kernel execution time trends.
for the entire hardware execution lifetime of the kernel. For example, in Fig. 4.3(c), the
first thread block in kernel 1 (K1) links the other thread blocks in K1 to form a big linked
thread block. When this first thread block is executed by one SM, the other thread blocks
in K1, which are linked by the first block, will also be executed in the first SM. Thus, K1
takes one SM to execute. Similarly, in kernel 3 (K3), the first two thread blocks link the
other thread blocks and form two big linked thread locks. Thus, the kernel 3 (K3) takes two
SMs to execute. The detailed persistent threads technique of linking thread blocks to form
linked thread blocks is shown in Algorithm 6.
When the numbers of linked thread blocks are changed, the resulting number of persistent threads blocks controls how many SMs (i.e., GPU resources) are used by a kernel. In
addition, when there are remaining available SMs, CUDA introduces CUDA Streams that
support concurrent execution of multiple kernels. Therefore, by exploiting persistent threads
and CUDA Streams, we can explicitly control the number of SMs used by each kernel and
execute multiple kernels of different tasks concurrently to achieve SM-granularity scheduling.
124

Persistent threads enabled SM-granularity scheduling fundamentally improves the schedulability of parallel GPU applications by exploiting finer-grained parallelism.

4.4.2

Kernel Execution Model

To understand the relationship between the execution time of a kernel and the number
of SMs assigned via persistent threads, we conducted the following experiments. We use
five synthetic kernel benchmarks that utilize different GPU resources: a computation kernel, consisting mainly of arithmetic operations; a branch kernel containing large number of
conditional branch operations; a memory kernel full of memory and register visits; a specialfunction kernel with special mathematical functions, such as sine and cosine operations;
and a comprehensive kernel including all these arithmetic, branch, memory, and special
mathematical operations. Each kernel performs 1000 floating-point operations on a 215 -long
vector.
We first run each kernel separately with a fixed workload for 1000 times and record its corresponding execution time with increasing numbers of assigned SMs, as shown in Fig. 4.4(a).
From the boxplot, we can see that the kernel execution time t follows the classic formula

t“

C ´L
`L
m

(4.3)

where m is the number of assigned SMs, C is the amount of work of the kernel, and L is the
GPU overhead including on-chip memory visit.
This formula makes it clear that GPU kernels are fully parallel workloads, which can utilize
all the m allocated SMs. The only sequential execution time during a kernel’s execution
125

is when the GPU is copying data and launching the kernel. We can also observe that
the execution time of a GPU kernel has low variation because it benefits from a singleinstruction multiple-threads (SIMT) architecture, in which single-instruction, multiple-data
(SIMD) processing is combined with multithreading for better parallelism.
Next, we examined the kernel execution time with increasing kernel sizes and different numbers of assigned SMs. Fig. 4.4(b) shows that the sophisticated kernel and the other types of
kernels have similar trends. The results are again consistent with Eq. (4.3). When the size of
the kernel is significantly larger than the GPU overhead, the execution time is dominated by
the work of the kernel and has a nearly linear speedup. Also, no matter whether the kernel
is large or small, and no matter what types of operations are executed inside the kernel, the
variance of the kernel execution times is consistently small.

4.4.3

Interleaved Execution and Virtual SM

In SM-granularity scheduling with multiple GPU tasks, we can further improve GPU utilization by exploiting the interleaved execution of GPU kernels. On a GPU with M SMs, naive
SM-granularity scheduling can first concurrently execute the K1 and K2 kernels, each with
M {2 persistent threads blocks, and then execute the K3 kernel with M persistent threads
blocks, as shown in Fig. 4.5(a). Each persistent threads block requires one SM to execute
one persistent thread at a time.
On the other hand, an SM actually allows the parallel execution of two or more persistent
threads blocks to overlap if they use different components of the SM in the same cycle [209].
This interleaved execution is similar to the hyper-threading in conventional multithreaded
CPU systems that aims to improve computation performance. For example, in an NVIDIA
126

GTX 1080 TI, one SM can hold 2048 software threads, whereas one thread block can have
at most 1024 software threads. Thus, two or more thread blocks can be interleaved and
executed on one SM. One important consequence of interleaved execution is that the execution time of a kernel increases. Therefore, to improve GPU utilization and efficiency, we
can simultaneously launch all three kernels, as illustrated in Fig. 4.5(b), where kernel 1 and
kernel 2 will simultaneously execute with kernel 3. The execution latency of each kernel
is increased by a factor called the interleaved factor, which ranges from 1.0 to 1.8 in the
following experiments.
We propose a virtual SM model to capture this interleaved execution of multiple GPU kernels, as shown in Fig. 4.5(c). In particular, we double the number of physical SMs to get
the number of virtual SMs. Each virtual SM can execute the same type of instruction from
one persistent threads block in one virtual cycle. Compared with a physical SM, a virtual
SM has a reduced computational ability and hence a prolonged virtual cycle, the length of
which is related to the type of instructions in the interleaved kernel. To understand the
interleaved ratio between the virtual cycle and the actual cycle α “

virtual cycle
,
actual cycle

we empiri-

cally measured the execution time of a synthetic benchmark when it was interleaved with
another benchmark. Fig. 4.6 illustrates the minimum, median, and maximum interleaved
execution time, colored from light to dark, normalized over the worst-case execution time of
the kernel without interleaving, where the left bar is without interleaving and right bar is
with interleaving. We can see that the interleaved execution ratio is at most 1.45ˆ, 1.7ˆ,
1.7ˆ, and 1.8ˆ for special, branch, memory and computation kernels, respectively. The
proposed virtual SM model improves the throughput by 11% „ 38% compared to the naive
non-interleaved physical SM model.

127

0.5

(a) On computation kernel

2

1.5

1

0.5

Max
Med
Min
Max
Med
Min

(b) On memory kernel

2

1.5

1

0.5

Max
Med
Min
Max
Med
Min

(c) On branch kernel

Normalized Execution
time

1

Max
Med
Min
Max
Med
Min

Normalized Execution
time

2
1.5

Normalized Execution
time

Normalized Execution
time

Figure 4.5: Virtual SM model for interleaved execution

2

1.5
1

0.5

Max
Med
Min
Max
Med
Min

(d) On special kernel

Figure 4.6: Characterization of the latency extension ratios of interleaved execution.

128

Algorithm 6 Pseudo Code of Pinned Self-Interleaving Persistent Thread Pseudo Code
// Get the ID of current SM with assemble language

static device
inline
uint32 t mysmid() {
uint32 t smid;
asm volatile (”mov.u32 %0, %%smid;” : ”=r”(smid));
return smid; }
// Kernel pinned to desired SMs with self-interleaved persistent thread

global void kernel (int desired SMs, ...){
int SM num;
SM num = mysmid(); // Get the ID of current SM
//Excute on desired SMs, otherwise return

if (SM num == desired SMs) {
//Get the global thread index: tid

int tid = threadIdx.x+(SM num desired SM start)*blockDim.x;
//off set links to the next thread block by persistent thread

int off set = blockDim.x*(desired SM end-desired SM start+1);
//Divide N threads inside a kernel to 2 halves [0 N/2) and [N/2 N). [0 N/2) and [N/2 N) from
same kernel interleaved execute with each other. From the kernel perspective, the kernel
interleaved execute with itself.

if (blockIdx.x ă virtual SM/2) {
for(int i = tid; i ă N/2; i += off set) {
Execute on thread i;}}
else {
for(int i = tid + N/2; i ă N; i += off set) {
Execute on thread i;}}
}
return; }
// Kernel launch
void main () {
dim3 gridsize (number of virtual SM);
dim3 blocksize (Max number of threads per block);
task1 Î gridsize, blocksize, ..., stream Ï (int desired SMs, ...);
kernel(intdesiredSMs, ...);
return; }

129

4.4.4

Workload Pinning and Self-Interleaving

Using the persistent threads and interleaved execution techniques, multiple tasks can be
executed in parallel, and the interleaved execution further improves GPU performance. In
real GPU systems, such as NVIDIA GPUs, a hardware scheduler is implemented that allocates the thread blocks to SMs in a greedy-then-oldest manner [197]. Thus, at run time,
the thread blocks from a kernel are interleaved and executed with thread blocks from other
possible kernels, and the interleaved execution ratio is different when different kernels are
interleaved and executed, as shown in Fig. 4.6. To guarantee a hard deadline, each kernel has
to adopt the largest interleaved execution ratio when this kernel is interleaved and executed
with other possible kernels. However, using the highest interleaved execution ratio cannot
avoid underestimation of the GPU computation ability. Therefore, we introduce workload
pinning which pins the persistent threads blocks to specific SMs, and self-interleaving where
the kernel interleaves with itself on its pinned SMs.
Workload pinning is implemented by launching 2M persistent threads blocks in each kernel,
which is also the number of virtual SMs, so that all virtual SMs will finally have one persistent
threads block to execute. If the SM is the targeted pinning SM, the thread block will begin
to execute. For persistent threads blocks are assigned to undesired SMs (untargeted pinning
SMs), they will simply be returned, which takes only about 10 µs. When a persistent
threads block is assigned to the correct SM, it will not only execute its own workload, but
will also execute the workloads from blocks assigned to the undesired SMs. Thus, the kernel
is actually executed on the desired SMs, and the undesired SMs execute an empty block
within an negligible time.

130

The self-interleaving technique evenly divides the original kernel into two small kernels, which
are assigned to the same specific SMs using workload pinning. The two small kernels are then
interleaved and executed on the pinned SMs. From the perspective of the original kernel,
it is self-interleaved on the pinned SMs. A persistent threads with pinned self-interleaving
design and implementation is described in Alg. 6.

4.5

Practical RT-GPU Tasks Scheduling

In this section, we first introduce the task model for real-time GPU tasks, then propose
the RT-GPU scheduling algorithm, and develop the corresponding response time analysis.
RT-GPU algorithm uses federated scheduling to execute GPU kernels on virtual SMs and
uses fixed-priority scheduling to schedule CPU and memory copy segments.
One of the key challenges of deriving the end-to-end response times for CPU-GPU tasks
is to simultaneously bound the interference on CPU, GPU, and bus without being too
pessimistic. Extending federated scheduling allows us to achieve efficient and predictable
execution of GPU kernels and to analyze the response times of GPU kernels independently.
When analyzing the response times of the CPU segments, we view the CPU segments as
execution and the response times of GPU and memory copy segments as suspension; similarly,
when analyzing the response times of the memory copy segments, we switch the view and
consider the memory copy segments as execution and the response times of GPU and CPU
segments as suspension. By analyzing in this double-view, we can exploit the response time
analysis in [186] for multi-segment self-suspension tasks, which allows us to achieve better
schedulability for CPU-GPU tasks. Our proposed end-to-end response time analysis is not

131

Figure 4.7: GPU tasks real-time scheduling model.
limited to CPU-memory-GPU system. It can also be applied to other heterogeneous systems,
like CPU-memory-FPGA and CPU-memory-TUP systems.

4.5.1

Task Model

Leveraging the platform implementation and the resulted CPU, memory and GPU models
discussed in previous sections, we now formally define the parallel real-time tasks executing
on a CPU-GPU platform. We consider a task set τ comprised of n sporadic tasks, where
τ “ tτ1 , τ2 , ¨ ¨ ¨ , τn u. Each task τi , where 1 ď i ď n, has a relative deadline Di and a period
(minimum inter-arrival time) Ti . In this work, we restrict our attention to constraineddeadline tasks, where Di ď Ti , and tasks with fixed task-level priorities, where each task
is associated with a unique priority. More precisely, when making scheduling decisions on
any resource, such as CPU and bus, the system always selects the segment with the highest
priority among all available segments for that resource to execute. Of course, a segment of
a task only becomes available if all the previous segments of that task have been completed.

132

On a CPU-GPU platform, task τi consists of mi CPU segments, 2mi ´ 2 memory copy
segments, and mi ´ 1 GPU segments. As discussed in Section 4.4.2, a GPU segment Gji
models the execution of a GPU kernel on interleaved SMs using total work GW ji , criticalpath overhead GLji , and interleaved execution ratio αij , i.e., Gji “ pGW ji , GLji , αij q. Thus,
task τi can be characterized by the following 3-tuple:
τi “

´`

CL0i , ML0i , G0i , ML1i , CL1i , ML2i , G1i , ML3i ,

j
2j`1
i ´2
¨ ¨ ¨ , CLji , ML2j
, ¨ ¨ ¨ , CLm
,
i , Gi , MLi
i
¯
˘
mi ´1
2mi ´3
mi ´2
2mi ´4
, Di , Ti
, CLi
, MLi
, Gi
MLi

(4.4)

where CLji and MLji are the execution times of the pj ` 1q-th CPU and memory copy segments, respectively. In addition, we use q and p to denote the lower and upper bound on a
x j and CL
| j are the upper and lower bounds on execution
random variable. For example, CL
i
i
times of the pj ` 1q-th CPU segment of τi , respectively.
To derive the end-to-end response time Ri of task τi , we will analyze the response times GRij ,
M Rij , and CRij of each individual GPU, memory copy, and CPU segments, respectively, and
calculate their lower and upper bounds in the following subsections.

4.5.2

Federated Scheduling for GPU Segments

For executing the GPU segments of the n tasks on the shared GPU with 2GN virtual SMs
(i.e., GN physical SMs), we propose to generalize federated scheduling [185], a scheduling
paradigm for parallel real-time tasks on CPU, to scheduling parallel GPU segments. The key
insight of federated scheduling is to calculate and assign the minimum number of dedicated
resources needed for each parallel task to meet its deadline.
133

Specifically, we allocate 2GN i dedicated virtual SMs to each task τi , such that its GPU segment Gji can start executing immediately after the completion of the corresponding memory
copy ML2j
i . In this way, the mapping and execution of GPU kernels to SMs are explicitly
controlled by the platform via the persistent thread and workload pinning interfaces, so the
effects caused by the black-box internal scheduler of a GPU are minimized. Additionally,
tasks do not need to compete for SMs, so there is no blocking time on the non-preemptive
SMs. Furthermore, via the self-interleaving technique, we enforce that GPU kernels do not
share any physical SMs. Therefore, the interference between different GPU segments is
minimized, and the execution times of GPU segments are more predictable.
In summary, each task τi is assigned with 2GN i dedicated virtual SMs where each of its
GPU segments self-interleaves and has an interleaved execution ratio αij . In Section 4.5.5,
we will present the algorithm that determines the SM allocation to tasks. Here, for a given
allocation, we can easily extend the formula in Section 4.4.2 to obtain the following lemma
for calculating the response time GR ji of a GPU segment Gji .
j

j

~ , GW
z s, a criticalLemma 4.5.1 If the GPU segment Gji has a total work in range rGW
i
i
j

x s and an interleaved execution ratio in range r1, αj s, then when
path overhead in range r0, GL
i
i
j

j

} , GR
y s where
running on 2GN i dedicated virtual SMs, its response time is in rGR
i
i
zj αj ´ GL
xj
~j
j
j
GW
GW
i i
i
i
y
x j.
}
, and GR i “
` GL
GR i “
i
2GN i
2GN i

} j is the shortest execution time of this GPU segment on 2GNi virtual
The lower bounds GR
i
SMs. In the best case, there is no critical-path overhead and no execution time inflation due
~j is executed in full parallelism
to interleaved execution. The minimum total virtual work GW
i
} j . In the worst case, the maximum total
on 2GNi virtual SMs, which gives the formula of GR
i
134

zj αj , and the maximum critical-path overhead GL
x j captures the maximum
virtual work is GW
i i
i
x j is a constant overhead and is not affected by
overhead of launching the kernel. Since GL
i
self-interleaving and multiple virtual SMs, we do not need to apply the interleaved execution
j

x . After deducting the critical-path overhead, the remaining GPU computation
ratio αij to GL
i
yj .
is embarrassingly parallel on 2GNi virtual SMs, which results the formula of GR
i
Note that the above Lemma 4.5.1 calculates both the lower and upper bounds on the response
time of GPU segment Gji , because both bounds are needed when analyzing the total response
time of task τi . Both the lower and upper bounds can be obtained by profiling the execution
time of GPU segments many times.
To ensure that tasks do not share SMs, the total number of virtual SMs assigned to all tasks
ř
must be no more than the number of available virtual SMs, i.e., i GN i ď GN ; otherwise,
the task set is unschedulable. During runtime execution of schedulable task sets, our platform
will generate 2GN i persistent threads blocks for each GPU segment of task τi to execute on
its assigned 2GN i virtual SMs.

4.5.3

Fixed-Priority Scheduling for memory copy Segments with
Self-Suspension and Blocking

Our proposed algorithm, which will be explained in detail in Section 4.5.5, schedules the CPU
and memory segments according to fixed-priority scheduling. In this subsection, we will focus
on analyzing the fixed-priority scheduling of the memory copy segments on the bus. Looking
from the perspective of executing memory-copies over the bus, memory copy segments are
“execution segments”; the time intervals where task τi spends on waiting for CPU and GPU
to complete the corresponding computation are “suspension segments”, since the bus can be
135

used by other tasks during these intervals of τi even if τi has higher priority. The analysis uses
the lower bounds on the lengths of suspension segments, i.e., the lower bounds on response
} j has been
times of CPU and GPU segments. For a GPU segment, the lower bound GR
i
obtained in Section 4.5.2, since our proposed algorithm uses federated scheduling on the
GPU. Since the CPU segments are executed on a uniprocessor, the response time of a CPU
}j “ CL
| j.
segment is lower bounded by the minimum execution time of this segment, i.e., CR
i
i
However, compared with the standard self-suspension model in Section 4.2.2, the execution
of memory copy over bus has the following differences. (1) Because memory copy is nonpreemptive, a memory copy segment of a high-priority task can be blocked by at most one
memory copy segment of any lower-priority task if this lower-priority segment has already
occupied the bus. (2) The length of suspension between two consecutive memory-copies
depends on the response time of the corresponding CPU or GPU segment. (3) The response
times of CPU segments are related to the response times of memory copy segments, which
will be analyzed in Section 4.5.4. (4) Moreover, the lower bounds on the end-to-end response
times of a task are related to the response times of all types of segments, which requires a
holistic fixed-point calculation to be presented in Section 4.5.5.
We now define the following memory copy workload function MW hi ptq, which is similar to
the workload function defined for standard self-suspension tasks in Lemma 4.2.1.

136

Lemma 4.5.2 MW hi ptq bounds the maximum amount of memory copy that task τi can perform during an interval with a duration t and a starting memory copy segment MLhi , where:
MW hi ptq “

l
ÿ

´ pl`1q mod 2mi ´2
yj mod 2mi ´2 ` min ML
y
ML
,
i
i

j“h

t´

l
ÿ
`

˘¯
yj mod 2mi ´2 ` MS i pjq
ML
i

j“h

where l is the maximum integer satisfying the following condition:
l
ÿ
`

˘
yj mod 2mi ´2 ` MS i pjq ď t
ML
i

j“h

and MS i pjq is defined as follow:
`
}
• If j mod p2mi ´ 2q ‰ p2mi ´ 3q and j mod 2 “ 0, then MS i pjq “ GR
i

˘
j mod p2mi ´2q {2

;

`
|
• Else if j mod p2mi ´2q ‰ p2mi ´3q and j mod 2 “ 1, then MS i pjq “ CL
i

˘
pj mod p2mi ´2qq`1 {2

| mi ´1 ` CL0 ;
• Else if j “ 2mi ´ 3, then MS i pjq “ Ti ´ Di ` CL
i
i
• Else MS i pjq “ Ti ´

ř2mi ´3 yj řmi ´2 | j řmi ´2 } j
MLi ´ j“1 CLi ´ j“0 GR i ;
j“0

From the perspective of executing memory-copies over the bus, the 2mi ´ 2 memory copy
segments are the execution segments by the definition of self-suspension task in Section 4.2.2.
y to
So the definition of MW hi ptq and l directly follows those in Lemma 4.2.1 by applying ML
p and changing from mi to 2mi ´ 2.
L
The key difference is in the definition of MS i pjq, which is the minimum “interval-arrival
time” between execution segments MLji and MLj`1
. By the RT-GPU task model, when
i
j mod p2mi ´ 2q ‰ p2mi ´ 3q, there is either a GPU or CPU segment after MLji , depending
137

;

on whether the index is even or odd. So the lower bound on the response time of the
corresponding GPU or CPU segment is the minimum interval-arrival time on the bus. For
the latter case, the response time of a CPU segment is lower bounded by its minimum
execution time. When j “ 2mi ´ 3, MLji is the last memory copy segment of the first job
of τi occurring in the time interval t. In the worst case, all the segments of this job are
delayed toward its deadline, so the minimum interval-arrival time between MLji and MLj`1
i
| mi ´1 , and
is the sum of Ti ´ Di , the minimum execution time of the last CPU segment CL
i
the minimum execution time of the first CPU segment CL0i of the next job. The last case
calculates the minimum interval-arrival time between the last memory copy segment of a
job that is not the first job and the first memory copy segment of the next job. Since these
two jobs have an inter-arrival time Ti between their first CPU segments, intuitively, MS i pjq
| mi ´1 of the
is Ti minus all the segments of the previous job plus the last CPU segment CL
i
previous job plus the first CPU segment CL0i of the next job, which is the above formula.
Hence, the response time of memory copy segment MLjk can be bounded by calculating the
interference caused by the workload of tasks hppkq with higher-priorities than task τk and
the blocking term from a low-priority task in lppkq.

y j is the smallest value that satisfies the
Lemma 4.5.3 The worst-case response time MR
k
following recurrence:
y j “ ML
yj `
MR
k
k

ÿ

max

τi Phppkq

hPr0,2mi ´3s

` max

(4.5)
max

τi Plppkq hPr0,2mi ´3s

138

yj q
MW hi pMR
k

yh
ML
i

yj
Because the execution of memory copy segments is non-preemptive, the calculation of MR
k
extends Lemma 4.2.2 by incorporating the blocking due to a low-priority memory copy
segment that is already under execution on the bus. Under non-preemptive fixed-priority
scheduling, a segment can only be blocked by at most one lower-priority segment, so this
blocking term is upper bounded by the longest lower-priority segment.

4.5.4

Fixed-Priority Scheduling for CPU Segments

Now, we will switch the view and focus on analyzing the fixed-priority scheduling of the
CPU segments. Looking from the perspective of the uniprocessor, CPU segments become
the “execution segments”; the time intervals where task τi spends on waiting for memory
copy and GPU to complete now become the “suspension segments”, since the processor can
be used by other tasks during these intervals.
y j and lower bounds MR
} j on response times
For now, let’s assume that the upper bounds MR
i
i
of memory copy segments are already given in Section 4.5.3. As for GPU segments, the
y j and lower bounds GR
} j have been obtained in Section 4.5.2. Similarly,
upper bounds GR
i
i
we define the following CPU workload function CW hi ptq.

Lemma 4.5.4 CW hi ptq bounds the maximum amount of CPU computation that task τi can
perform during an interval with a duration t and a starting CPU segment CLhi , where:
CW hi ptq “

l
ÿ

´ pl`1q mod mi
x j mod mi ` min CL
x
CL
,
i
i

j“h

t´

l
ÿ
`
j“h

139

˘¯
x j mod mi ` CS i pjq
CL
i

where l is the maximum integer satisfying the following condition:
l
ÿ
`

˘
x j mod mi ` CS i pjq ď t
CL
i

j“h

and CS i pjq is defined as follow:
}2pj mod mi q ` GR
} j mod mi ` ML
}2pj mod mi q`1 ;
• If j mod mi ‰ pmi ´ 1q, then CS i pjq “ ML
i
i
i
• Else if j “ mi ´ 1, then CS i pjq “ Ti ´ Di ;
• Else CS i pjq “ Ti ´

řmi ´1 x j ř2mi ´3 }j řmi ´2 } j
MLi ´ j“0 GR i ;
j“0 CLi ´
j“0

From the perspective of the uniprocessor, the mi CPU segments are the execution segments
by the definition of self-suspension task in Section 4.2.2. So the definition of CW hi ptq and
x to L.
p For the minimum “intervall directly follows those in Lemma 4.2.1 by applying CL
arrival time” CS i pjq, there are two memory copy and one GPU segments between segments
CLji and CLj`1
by the RT-GPU task model, when j mod mi ‰ pmi ´ 1q. So CS i pjq is the
i
sum of the minimum response times of these segments, where the response time of a memory
copy segment is lower bounded by its minimum length. The case of j “ mi ´ 1 is the same.
The last case considers for a job that is not the first job in interval t. The calculation is
similar to the one in Lemma 4.2.1, except that both the 2mi ´ 2 memory copy and mi ´ 1
GPU segments constitute the suspension time.
Hence, the response time of CPU segment CLjk can be bounded by calculating the interference
caused by the CPU workload of tasks hppkq with higher-priorities than task τk .

140

yj is the smallest value that satisfies the
Lemma 4.5.5 The worst-case response time CR
k
following recurrence:
yj “ CL
xj `
CR
k
k

ÿ

max

τi Phppkq

hPr0,mi ´1s

yj q
CW hi pCR
k

(4.6)

The formula is directly extended from Lemma 4.2.2.

4.5.5

RT-GPU Scheduling Algorithm and Analysis

For a particular virtual SM allocation 2GN i for all tasks τi , we can calculate the response
times of all GPU, memory copy, and CPU segments using formulas in Section 4.5.2 to 4.5.4.
i ´1
Note that a task starts with the CPU segment CL0i and ends with the CPU segment CLm
.
i

Therefore, we can upper bound the end-to-end response times for all tasks using the following
theorem, by looking at the perspective from CPU.

p k of task τk is upper bounded by
Theorem 4.5.6 The worst-case end-to-end response time R
x k and R2
x k , i.e., R
p k “ minpR1
x k , R2
x k q, where:
the minimum of R1

xk “
R1

mÿ
k ´2

yj `
GR
k

j“0

2m
k ´3
ÿ

yj `
MR
k

mÿ
k ´1

j“0

yj
CR
k

(4.7)

j“0

and R2k is the smallest value that satisfies the recurrence:

xk “
R2

mÿ
k ´2

yj `
GR
k

j“0

2m
k ´3
ÿ
j“0

ÿ
`
τi Phppkq

max
hPr0,mi ´1s

141

yj `
MR
k

mÿ
k ´1
j“0

x kq
CW hi pR2

xj
CL
k
(4.8)

x k and R2
x k are extended from Lemma 4.2.3 by noticing that the time
The calculations for R1
spent on waiting for GPU and memory copy segments to complete are suspension segments
from the perspective of CPU execution.
With the upper bound on the response time of a task, the following corollary follows immediately.

Corollary 4.5.6.1 A CPU-GPU task τk is schedulable under federated scheduling on virtual
SMs and fixed-priority scheduling on CPU and bus, if its worst-case end-to-end response time
p k is no more than its deadline Dk .
R

Computational complexity. Note that the calculations for the worst-case response times
of individual CPU and memory copy segments, as well as one upper bound on the end-to-end
response time, involves fixed-point calculation. Thus, the above schedulability analysis has
pseudopolynomial time complexity.
Note that the above schedulability analysis assumes a given virtual SM allocation under federated scheduling. Hence, we also need to decide the best virtual SM allocation for task sets,
in order to get better schedulability. The following RT-GPU Scheduling Algorithm adopts
a brute forth approach on deciding the virtual SM allocation. Specifically, it enumerates all
possible allocations for a given task set on a CPU-GPU platform and uses the schedulability
analysis to check whether the task set is schedulable or not. Alternatively, one could easily
apply a greedy approach by assigning the minimum numbers of virtual SMs to tasks and
increasing the numbers for tasks that miss their deadline according to the schedulability
analysis, if one needs to reduce the running time of the algorithm while a slight loss in
schedulability is affordable.
142

The full procedure of scheduling GPU tasks can be described as follows: (1) Grid search
a federated scheduling for the GPU codes and calculate the GPU segment response time
} j GR
y j s, details in Section 4.5.4. (2) The CPU segments and memory copy segments
rGR
i
i
are scheduled by fixed priority scheduling. (3) If all the tasks can meet the deadline, then
they are schedulable and otherwise go back to step (1) to grid search for a next federated
scheduling. This schedulability test for hard deadline parallel GPU tasks can be summarized
in Algorithm 7.
Algorithm 7 Fixed Priority Self-Suspension with Grid Searched Federated Scheduling
Input Variables:
Parameters for tasks and sub-tasks
Output Variables:
Scheduability, SM allocation: GNi
Steps:
void main(){
//1: GPU kernel federated scheduling grid search:

for GN1 = 1, ..., GN do
for GNi = 1, ..., GN do
for GNn = 1, ..., GN do
//2: Calculate response times of GPU segments:

ř
if ( ni“1 GNi ď GN ) then
~j
i
} j “ GW
,1 ď i ď n
GR
i
yj
GR
i

“

2GN i
z j αj ´GL
yj
GW
i

i

2GN i

i

x j, 1 ď i ď n
` GL
i

y j for memory copy segments using Eq.(4.5)
3:Calculate worst-case response time MR
k
j
y
4:Calculate worst-case response time CR k for CPU segments using Eq.(4.6)
p k for all tasks using Theorem 4.5.6
5:Calculate worst-case end-to-end response time R

p k ď Dk for all τk ) then
if (R
Scheduability “ 1; break out of all for loops
end for
end for
end for
return;

143

4.6
4.6.1

Full-System Evaluation
Experiment Setup

In this section, we describe extensive experiments using synthesized tasksets to evaluate the
performance of the proposed RTGPU real-time scheduling approach, via both schedulability
tests and a real system. We choose self-suspension [186] and STGM [201]: Spatio-Temporal
GPU Management for Real-Time Tasks as baselines to compare with, as they represent the
state-of-the-art in fine-grained (SM-granularity) GPU real-time scheduling algorithms and
schedulability tests. The names for the three approaches used in our experiments are given
below.
1. Proposed RTGPU: the proposed real-time GPU scheduling of hard deadline parallel
tasks with fine-grain utilization of persistent threads, interleaved execution, virtual SM, and
fixed-priority federated scheduling.
2. Self-Suspension: real-time GPU scheduling of hard deadline parallel tasks with the
persistent threads with self-suspension scheduling, as in [186].
3. STGM: real-time GPU scheduling of hard deadline parallel tasks with the persistent
threads and busy-waiting scheduling, as in [201].
To compare the schedulability results of the three approaches, we measured the acceptance
ratio in each of four simulation setups with respect to a given goal for taskset utilization. We
generated 100 tasksets for each utilization level, with the following task configurations. The
acceptance ratio of a level was the number of schedulable tasksets, divided by the number of
tasksets for this level, i.e., 100. According to the GPU workload profiling and characterization
144

0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

2.0

Schedulable tasksets (%)

RTGPU(2mems)
Self-suspension(2mems)
STGM(2mems)
RTGPU(1mem)
Self-suspension(1mem)
STGM(1mem)

Schedulable tasksets (%)

Schedulable tasksets (%)

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

0

0.2

0.4

0.6

Utilization rate

0.8

1.0

1.2

1.4

1.6

1.8

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0
0

2.0

0.2

0.4

0.6

(a) computation:suspension=2:1

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Utilization rate

Utilization rate

(b) computation:suspension=1:2

(c) computation:suspension=1:8

Figure 4.8: Schedulability under different computation (CPU) and suspension (memory+
GPU) lengths.

80%
70%
60%
50%
40%
30%
20%
10%
0
0

0.2

0.4

0.6

0.8

1.0

1.2

Utilization rate

(a) 3 subtasks

1.4

1.6

1.8

2.0

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

Schedulable tasksets (%)

RTGPU(2mems)
Self-suspension(2mems)
STGM(2mems)
RTGPU(1mem)
Self-suspension(1mem)
STGM(1mem)

90%

Schedulable tasksets (%)

Schedulable tasksets (%)

100%

0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Utilization rate

(b) 5 subtasks

1.6

1.8

2.0

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0
0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Utilization rate

(c) 7 subtasks

Figure 4.9: Schedulability under different numbers of subtasks.
[210], the memory length upper bound was set to 1/4 of the GPU length upper bound. We
first generated a set of utilization rates, Ui , with a uniform distribution for the tasks in the
taskset, and then normalized the tasks to the taskset utilization values for the given goal.
Next. we generated the CPU, memory, and GPU segment lengths, uniformly distributed
within their ranges in Table 4.1. The deadline Di of task i was set according to the generated
řmi ´1 x j ř2mi ´3 yj řmi ´2 x j
segment lengths and its utilization rate: Di “ p j“0
CLi ` j“0 MLi ` j“0 GLi q{Ui .
In the configuration setting, the CPU, memory, and GPU lengths were normalized with one
CPU, one memory interface, and one GPU SM. When the total utilization rate, U , is 1, the
one CPU, one memory interface, and one GPU SM are fully utilized. As there are multiple
SMs available (and used), the total utilization rate will be larger than 1. The period Ti
is equal to the deadline Di . The task priorities are determined with deadline-monotonic
priority assignment.
145

Table 4.1: Parameters for the taskset generation
Parameters
Number of tasks N in taskset
Task type
Number of subtasks M in each task
Number of tasksets in each experiment
CPU segment length (ms)
Memory segment length (ms)
GPU segment length5 (ms)
Task period and deadline
GPU kernel launch overhead pq
Number of physical GPU SMs NSM {2
Priority assignment

Value
5
periodic tasks
5
100
[1 to 20]
[1 to 5]
[1 to 20]
pTi {Di q
12%
10
D monotonic

Meanwhile, in each experiment we evaluate two models. The first model has two memory
copies: one memory copy from CPU to GPU and one memory copy back from GPU to
CPU between a CPU segment and a GPU segment, which is exactly the execution model we
introduced in section 4.4. The second model has one memory copy between a CPU segment
and a GPU segment, which combines the memory copy from CPU to GPU and the memory
copy from GPU to CPU. These two models can capture not only the CPU-GPU systems
but also general heterogeneous computing architectures.

4.6.2

Schedulability Analysis

Our first evaluation focused on the schedulability of tasksets as the overall utilization increased, with respect to different parameters pertinent to schedulability. The following

146

sub-subsections present the results of four simulations that each varied the different parameters we examined: the ratios of CPU, memory, and GPU segment lengths; the number of
subtasks; the number of tasks; and the number of total SMs.

CPU, Memory, and GPU Lengths

We first investigated the impact of CPU, memory, and GPU segment lengths on the acceptance ratio. To study this quantitatively, We tested the acceptance ratio under different
length range ratios. The CPU length is shown as Table 4.1 and we changed the memory, and
GPU lengths according to the length ratio. Fig. 4.8 shows taskset acceptance ratio when
the CPU, memory, and GPU length range ratios were set to 2:1, 1:2, and 1:8, which give an
exponential scale.
Not surprisingly, the STGM approach is effective only when the memory and GPU segment
(suspension segment) lengths are short enough: the STGM approach was developed based
on ”busy waiting”. When tasks are being processed in memory copy and GPU segments, the
CPU core is not released and remains busy waiting for the memory copy and GPU segments
to finish. Although this is the most straightforward approach, its pessimistic aspect lies
in the CPU waiting for the memory copy and GPU segments to finish. Thus, it will be
ineffective and hugely pessimistic when the memory copy and GPU segments are large.
Self-suspension scheduling in [186] increases the schedulability performance compared with
the straight forward STGM approach. Self-suspension models the memory and GPU segments as being suspended, and the CPU is released during this suspension. The theoretical
drawback of this approach is that the suspension does not distinguish between the memory
segments and GPU segments. Instead, they are modelled as non-preemptive and will block
147

80%
70%
60%
50%
40%
30%
20%
10%
0
0

0.2

0.4

0.6

0.8

1.0

1.2

Utilization rate

(a) 3 tasks

1.4

1.6

1.8

2.0

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

Schedulable tasksets (%)

RTGPU(2mems)
Self-suspension(2mems)
STGM(2mems)
RTGPU(1mem)
Self-suspension(1mem)
STGM(1mem)

90%

Schedulable tasksets (%)

Schedulable tasksets (%)

100%

0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Utilization rate

(b) 5 tasks

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0
0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Utilization rate

(c) 7 tasks

Figure 4.10: Schedulability under different numbers of tasks.
higher priority tasks. However, in real systems, each task is allocated its own exclusive GPU
SMs, and the GPU segments in one task will not interfere the GPU segments in other tasks.
The RTGPU schedulability analysis proposed in this work is effective even when the memory
and GPU segment (suspension segment) lengths are long. In this approach, we distinguish
the CPU, memory, and GPU segments based on their individual properties. For example,
if the CPU cores are preemptive, then no blocking will happen. Blocking happens only in
non-preemptive memory segments. Meanwhile, because federated scheduling is applied for
the GPU segments and each task is allocated its own exclusive GPU SMs, the GPU segments
can be executed immediately when they are ready, without waiting for higher priority GPU
segments to finish or being blocked by lower GPU segments.
Also, by comparing the models with one memory copy and two memory copies, we notice
that the memory copy is the bottleneck in the CPU-GPU systems because of limited resource
(bandwidth) and non preemption. Reducing the numbers of memory copies or combining
memory copies can increase the system schedulability, especially when the memory copy
length is large shown in Fig. 4.8 (b) and (c).

148

80%
70%
60%
50%
40%
30%
20%
10%

100%

100%

90%

90%

Schedulable tasksets (%)

RTGPU(2mems)
Self-suspension(2mems)
STGM(2mems)
RTGPU(1mem)
Self-suspension(1mem)
STGM(1mem)

90%

Schedulable tasksets (%)

Schedulable tasksets (%)

100%

80%
70%
60%
50%
40%
30%
20%
10%
0

0
0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Utilization rate

(a) 5 SMs

80%
70%
60%
50%
40%
30%
20%
10%
0

0

0.2

0.4

0.6

0.8

1.0

1.2

Utilization rate

(b) 8 SMs

1.4

1.6

1.8

2.0

0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Utilization rate

(c) 10 SMs (in progress)

Figure 4.11: Schedulability under different numbers of SMs.
Number of Subtasks

We then evaluated the impact of the number of subtasks in each task on the acceptance
ratio. From the possible values in Table 4.1, the number of subtasks, M , in each task was
set to 3, 5, or 7. The corresponding acceptance ratios are shown in Fig.4.9. The results show
that with more subtasks in a task, schedulability decreases under all approaches but the
proposed RTGPU approach still outperforms all other approaches. Compared with STGM,
the proposed RTGPU approach and the self-suspension approach are the most robust as the
number of subtasks increases.

Number of Tasks

In a third simulation, we evaluated the impact of the number of tasks in each taskset on the
acceptance ratio. Again, from the possible values in Table 4.1, the number of tasks, N , in
each task was set to 3, 5, or 7. The corresponding acceptance ratios are shown in Fig.4.10.
As with subtasks, schedulability decreases under all the approaches as the number of tasks
increases, but the proposed RTGPU approach outperformed the other two.

149

Number of SMs

Finally, we examined the impact of the number of total SMs on the acceptance ratio. Based
on the possible values in Table 4.1, the number of tasks, N, in each task was again set to 3, 5,
or 7. The corresponding acceptance ratios are shown in Fig.4.10. All three approaches have
better schedulability as the number of available SMs increases. From this set of experiments
we can see that adding two more SMs will cause the utilization rate to increase for all three
approaches. Meanwhile, among the three approaches, the proposed RTGPU approach again
achieved the best schedulability across different numbers of SMs. As shown in Fig.4.10
(a), when the computation resources (GPU SMs) are limited, the bottleneck from memory
copy is more obvious and serious. The two memories model has a poor scheduability in all
approaches and the one memory model has a significant improved performance.

4.6.3

GPU Experiment

We also empirically evaluated the proposed RTGPU scheduling framework on a real system
with an NVIDIA 1080TI GPU, which has 28 SMs modeled as 56 virtual SMs. (There are 28
physical streaming multiprocessors (SMs) in an NVIDIA GTX 1080Ti: 27 SMs can be used
for executing parallel tasks, and 1 SM is reserved for handling default system applications.)
The CPU was an Intel(R) Core(TM) i7-3930K CPU operating at 3.20GHz with 12 cores
and 12,288 KB of on-chip cache. We implemented the synthetic benchmarks described in
Section 4.4 in a common real-time scheduling context, since multiple GPU kernel concurrency
is supported only within the same CUDA context. To run multiple kernels from different
tasks simultaneously, we created a single parent process and launched each kernel using a
separate CPU thread of that parent process. For parallel kernel execution, CUDA streams
150

(a) Memory
100KB data

copy

of (b) Memory copy of 1MB (c) Memory
data
10MB data

copy

of (d) Memory
100MB data

copy

of

Figure 4.12: CPU to GPU memory copy time distribution.

(a) Kernel thread length: (b) Kernel thread length: (c) Kernel thread length: (d) Kernel thread length:
10
100
1000
10000

Figure 4.13: GPU kernel execution time distribution.
were used to allow asynchronous copy and kernel execution. By default, the NVIDIA GPU
adopts ”adaptive power setting”, in which the firmware adaptively throttles the clock speeds
of SM cores and memory when they experience a low utilization rate. To avoid interference
from this adaptive power setting and guarantee the hard deadlines, we manually fixed the
SM core and memory frequencies respectively using the nvidia-smi command. We also set
the GPUs to persistence mode to keep the NVIDIA driver loaded even when no applications
are accessing the cards. This is particularly useful when you have a series of short jobs
running.
As in the previous schedulability analysis experiments, each task in a taskset was randomly
assigned one of the values in Table 4.1. The deadline was set to the same value as the period.
Theoretically, the memory copy and GPU kernels are modeled by their worst execution times.
The execution time distributions of different sizes of memory copies through PCIe from CPU
151

to GPU and from GPU to CPU are shown in Fig. 4.12, where each size of memory copies is
executed 10,000 times. Meanwhile, the execution time distributions of different GPU kernel
thread lengths (number of floating-point addition operations in one thread) are shown in
Fig. 4.13, where each thread of each GPU kernel is executed 10,000 times. Using the real
GPU system, we examined schedulability using different numbers of SMs and compared the
results from the schedulability analysis and from the real GPU experiments (with the worst
case execution time model). Fig. 4.14 presents the acceptance ratio results of the RTGPU
schedulability analysis and experiments on the real GPU system. Both of them have better
schedulability as the number of available SMs increases. The gaps between the schedulability
analysis and real GPU system arise from the pessimistic aspect of the schedulability analysis
and the model mismatches between worst execution time and acutual execution time. In the
limited computation resource scenarios (5 SMs and 8 SMs), the bottlenecks from memory
copy exist in both schedulability test and experiments with real GPU systems. Reducing
the numbers of memory copies or combining memory copies are proper methods to deal
with the bottlenecks. After this, the memory copy and GPU kernels are modeled by their
average execution times. The results from the RTGPU schedulability analysis and real
GPU system are presented in Fig.4.15. Because the segments are modeled by their average
execution times, which is much tighter than the worst execution time, the gaps between the
schedulability analysis and experiments on the real GPU system are further reduced.
Finally, we quantified the GPU throughput gained by the virtual SM model on the synthetic
and real benchmark tasksets:

η1 “

Nÿ
“5
i“1

N umbers of SM taskpiq
2
ˆp
´ 1q
GP U T otal N umbers of SM s
αpiq

152

(4.9)

100%
90%

Schedulable tasksets (%)

80%

Sche analysis(2mem 5SMs)
Sche analysis(1mem 5SMs)
Real GPU(2mem 5SMs)
Real GPU(1mem 5SMs)
Sche analysis(2mem 8SMs)
Sche analysis(1mem 8SMs)
Real GPU(2mem 8SMs)
Real GPU(1mem 8SMs)
Sche analysis(2mem 10SMs)
Sche analysis(1mem 10SMs)
Real GPU(2mem 10SMs)
Real GPU(1mem 10SMs)

70%
60%
50%
40%
30%
20%
10%
0
0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

Ultilization rate

Figure 4.14: Schedulability under different numbers of SMs with schedulability analysis and
Real GPU experiments (with worst case execution time model).

100%

Schedulable tasksets (%)

90%
80%

Sche analysis(2mem 5SMs)
Sche analysis(1mem 5SMs)
Real GPU(2mem 5SMs)
Real GPU(1mem 5SMs)
Sche analysis(2mem 8SMs)
Sche analysis(1mem 8SMs)
Real GPU(2mem 8SMs)
Real GPU(1mem 8SMs)
Sche analysis(2mem 10SMs)
Sche analysis(1mem 10SMs)
Real GPU(2mem 10SMs)
Real GPU(1mem 10SMs)

70%
60%
50%
40%
30%
20%
10%
0
0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

Ultilization rate

Figure 4.15: Schedulability under different numbers of SMs with schedulability analsysis and
Real GPU experiments (with average execution time model).

(a) Improvement over whole GPU (b) Improvement over used resources
system

Figure 4.16: RTGPU Throughput improvements.
153

η2 “

Nÿ
“5
i“1

N umbers of SM taskpiq
2
ˆp
´ 1q
T otal N umbers of SM s used in taskset
αpiq

(4.10)

where N umbers of SM taskpiq is the number of SMs used by task(i) and αpiq is the interleaved
ratio of task(i). Fig. 4.16(a) shows the throughput improvement over the whole GPU system
according to E.q. (4.9). At low utilization, the actual used SMs are few so that it has
small throughput over the whole GPU system. With the increase of utilization rate, more
SMs are in use and bring more throughput over the whole system. To better quantify the
throughput improvement, we compare it with the actual used SMs as described in E.q. (4.10),
in Fig. 4.16(b). We can see that there are over 20% and 11% throughput improvement in
synthetic benchmarks and real benchmarks. This throughput improvement can be achieved
with any GPU systems which different number of SMs. The reason why the synthetic
benchmark has more throughput improvement than the real benchmark is that the special
function kernel in the synthetic benchmark has a low interleaved ratio α. The special kernel
is ”better” interleaved and has a low interleaved ratio, as it uses the special function units
(SFUs) while other kernels rarely use these units.

4.7

Conclusion

To execute multiple parallel real-time applications especially for the cyber-physical systems
which have hard deadlines on GPU systems, we propose RTGPU —a real-time scheduling
method including both system work and and a real-time scheduling algorithm with schedulability analysis. RTGPU leverages a precise timing model of the GPU applications with
the persistent threads technique and achieves improved fine-grained utilization through interleaved execution. The RTGPU real-time scheduling algorithm is able to provide real-time
154

guarantees of meeting deadlines for GPU tasks with better schedulability compared with
previous work. We empirically evaluate our approach using synthetic benchmarks on both
schedulability analysis and real Nvidia GTX1080Ti GPU systems and demonstrate significant performance gains compared to existing methods. The improved performance and resource utilization rates accelerate the artificial intelligence and machine learning algorithms
executed in GPU for many emerging cyber-physical systems, such as autonomous vehicles
and robots to perform important system operations.

155

Chapter 5
Circuit, Architecture, and Operating
System Layers: Fast Learning-based
Energy Management for
Multi-/Many-core Processors

Over the last two decades, as microprocessors have evolved to achieve higher computational
performance, their power density has also increased at an accelerated rate. Improving energy
efficiency and reducing power consumption is therefore critically important to the computing systems especially for the computing systems used in cyber-physical systems, which have
strong requirements on high performance and low power consumption. One effective technique for improving energy efficiency is dynamic voltage and frequency scaling (DVFS). With
the emergence of integrated voltage regulators, the speed of DVFS can reach up to microsecond (µs) timescales. However, a practical and effective strategy to guide fast DVFS remains
a challenge. In this chapter, we propose F-LEMMA: a fast, learning-based, hierarchical
DVFS framework consisting of a global power allocator in the kernel space, a reinforcement
156

learning-based power management scheme at the architecture level, and a swift controller at
the digital circuit level. This hierarchical approach leverages computation at the system and
architecture levels with the short response time of the swift controller to achieve effective and
rapid µs-level power management supported by the integrated voltage regulator. Our experimental results demonstrate that F-LEMMA can achieve significant energy-savings (35.2%)
across a broad range of workloads. Conservatively compared with existing state-of-the-art
DVFS-based power management schemes that can only operate at millisecond timescales,
F-LEMMA can provide notable (up to 11%) Energy-Delay Product (EDP) improvements
across benchmarks. Compared with state-of-the-art non-learning-based power management,
our method has a universally positive effect on all of the evaluated benchmarks, proving its
adaptability.

5.1

Introduction

Multi-/many-core processors have become mainstream computing workhorses for cyberphysical systems, especially the multi-core and manycore embedded systems [28] are widely
used in the resource-constrained environments. With the demise of Dennard scaling [211,212]
and the increasing level of integration of digital logic on a single die, high power density
has become a key design constraint and performance-limiting bottleneck for future generations of computing systems used in resource-constrained cyber-physical systems, which needs
not only high performance but also low power consumption. Dynamic power management
(DPM) techniques, such as dynamic voltage and frequency scaling (DVFS) and power gating, are widely used in the state-of-the-art processor systems to save power and improve
energy efficiency. For example, Intel’s Enhanced Intel Speed- Step Technology (EIST) [213],
157

User Applications/
Operating System

Milliseconds

Response time

Workload
Allocation

Core

Cache Core
Cache Core
Cache Core

Milliseconds
to
Microsecond

MC
MP

Power/Energy
Budget

Global Controller
Core

Core

Core

Core

Core

MC
MP

Computer Architecture
Agent

Environment

Configuration

Learning Controller
D0

Di

IN

Dn
OUT

IN

OUT

IN

OUT

Digital Circuit

Microsecond
Run-time
Voltage
Power...

Cache

Cache

Cache

Core

Memory

Run-time
Performance
Counter

Core

Core

X1

X2

θ1

Weight

Y

Voltage
Frequency

θ2

Swift Controller

Figure 5.1: Microsecond-Level hierarchical fast power management (DVFS) for multi-core
and many-core processors.
AMD’s PowerNow! [214], ARM’s Intelligent Energy Controller (IEC) [215] and NVIDIA’s
Power Management Mode [216] provide the utility for the voltage and frequency (clock
speed) of the processor to be dynamically changed to different power states by software.
This capability allows the processor to meet the instantaneous performance demands from
diverse computational workloads while minimizing power consumption and heat generation.
In a typical setting, voltage and frequency are decreased as the processor enters an idle stage
and increased as it enters an active stage.
Seeking a more effective power management strategy is critically important for the computing systems used in cyber-physical systems, many adaptive solutions have been explored
recently by leveraging control theory and machine learning approaches. In these adaptive
power management schemes, the control/learning agent can monitor the workload status at
run-time and adjust the voltage and frequency settings according to its online estimation

158

model [217–220]. In conventional power delivery systems for multi-core and many-core processors, a cluster of cores (or even all the cores) may reside in one voltage domain and share
one voltage rail from an off-chip voltage regulator. Due to the long physical distance and
associated parasitic loading effect, the voltage transition time of an off-chip voltage regulator
generally exceeds a millisecond, which fundamentally limits how quickly the power management settings can be adjusted in response to transient workload events that can happen in
several microseconds. Although integrated voltage regulation provides much finer spatial
(per-core) and temporal (tens to hundreds of nanoseconds) granularity in supply voltage
allocation and delivery, [31, 65], there still lacks a practical and effective method to realize adaptive power management at the microsecond timescale and take advantage of such
fast integrated voltage and frequency scaling ability. Meanwhile, as the computational complexity of the control and machine learning algorithms and their execution costs in software
increase, the latency and response time of many adaptive power management schemes cannot
be readily scaled to meet the demands of microsecond-level DVFS.
In this chapter, we present F-LEMMA, a fast learning-based voltage and frequency scaling
approach for energy-efficient multi-core and many-core processors. To reap the previously
unattainable benefits of microsecond timescale power management, we propose a hierarchical
learning-based approach, illustrated in Fig. 5.1. This hierarchical power management approach has three layers: a global controller works as the kernel space interface to a userspace
energy and power management methodology; an intermediate learning-based controller takes
in the architectural information and utilizes a reinforcement learning agent to update the
configuration of a lower-level swift controller; finally, the swift controller uses a fast linear
classifier to generate voltage and frequency pairs for each core at the microsecond timescale.
Here, we validate the proposed F-LEMMA approach, under different configurations and using

159

several benchmark applications, and compare it with previous related work. Our experimental results show that F-LEMMA achieves a 35.2% energy savings on average across a wide
range of benchmarks. Compared with state-of-the-art power management at the millisecond timescale, the microsecond-level fast power management in F-LEMMA saves significant
amounts of energy with only minimal performance loss.
This chapter makes the following contributions to the state-of-the-art in power and energy
management:

• An illustration of the potential benefits of microsecond timescale per-core DVFS and
a comparison study of integrated voltage regulators and power delivery systems supporting this fast DVFS.
• A hierarchical power management strategy, including a global controller as the interface to the operating system, a learning controller at the architecture layer, and a
swift controller at the circuit layer. This architecture provides adaptive, microsecond
timescale, per-core, fast DVFS.
• A quantitative study methodology proposed and applied to F-LEMMA with OpenMP
synthetic benchmarks. F-LEMMA power management achieves over 90% of the ideal
DVFS and the learning-based program phase prediction is critical to the power management.
• An evaluation of the run-time adaptive hierarchical power management approach, and
an implementation of its learning controller with High-Level Synthesis (HLS).

160

• A comprehensive experimental study of the proposed F-LEMMA approach, which
demonstrates extra energy savings from fast power management. The evaluation includes comparisons to previous related work, ablation studies of different layers, and
assessments of performance with different system configurations and scales.

5.2
5.2.1

Background and Related Work
Dynamic Voltage Frequency Scaling (DVFS)

Dynamic voltage and frequency scaling (DVFS) is a technique to manage processor power
consumption. Run-time dynamic power has a squared and linear relationship with frequency
and voltage (Pdynamic „ CV 2 f ), respectively, whereas static power has a relationship with
voltage (Pstatic „ V Ntr Istatic ) where where Ntr is the number of transistors and Is is the
normalized static current for each transistor.
Effective DVFS for multi-core processors requires multiple voltage domains. The circuitry
within one voltage domain shares a common voltage rail, hence opportunities to reduce the
domain’s voltage are limited by the unit that needs the highest supply voltage. Voltage
levels are scaled in fixed, discrete steps and are typically selected using tables that map
frequency to voltage. Voltage and frequency scaling is based on the application’s performance
requirements. For example, when one core is waiting for synchronization, its voltage and
frequency are reduced to save power and energy.

161

Voltage
μs

Time

IVR

Voltage
ms

Core

Package
Time

Off-Chip
VR

Board Caps
VCC

PCB Board

Socket
Cavity Caps

Figure 5.2: The integrated voltage regulator based power delivery system.

5.2.2

Adpative Power Management

In recent years, as the workloads in multi-core and many-core systems have become more
diverse and variable, adaptive power management has replaced previous fixed models. To
achieve effective power management, workloads are predicted at run-time using adaptive
models. There are two general strategies. On one hand, control theoretic mechanisms, such
as Kalman filters [150] and model predictive control [221], use dynamically updated models
to scale voltage and frequency under power or performance constraints. On the other hand,
learning mechanisms predict application phases and control decisions without knowing an
accurate workload model in advance [222,223]. With reinforcement learning, an agent learns
to act optimally in an environment by evaluating and selecting actions that optimize for
desired rewards. Reinforcement learning can be adapted for power management by training
a per-core DVFS agent that selects the appropriate voltage and frequency levels by observing
system conditions [217]. Because both the adaptive control and learning algorithms are
relatively complex with considerable execution time, such adaptive power management can
operate only at low frequencies. This problem can be mitigated by introducing a hierarchical
design in which adaptive power management techniques are constrained at the software level
and supply information to fast controllers.
162

5.2.3

Integrated Voltage Regulators

In a conventional power delivery system for multi-core or even many-core processors, cores
share a common voltage rail and a centralized voltage regulator is located off-chip to step
down the supply voltage from the PCB board level (5-12V) to the core level (0.8-2V). Because
the off-chip voltage regulator uses large inductors and capacitors, together with the boardlevel decoupling capacitors and prominent parasitic inductance, there is an unavoidable long
transition time (rise time and fall time) before the voltage reaches a desired level. It limits
the dynamic voltage and frequency scaling in processors with off-chip VRM based power
delivery systems to the millisecond timescales.
Emerging power delivery systems use integrated voltage regulators, moving the step-down
voltage regulator on-chip, as shown in Fig.

5.2. Integrated regulator design strives to

reduce the size of inductors and capacitors to a small on-die area. One prominent side effect
of this design strategy is pushing the switching frequency from tens to hundreds of MHz.
Such a higher switching frequency incurs significant switching losses and degrades conversion
efficiency.
The integrated voltage regulator naturally has a much shorter transition time than conventional off-chip voltage regulators. This advantage comes from smaller inductors and
capacitors, faster switching, and reduced parasitic inductance thanks to its closer location to
the core. Measured results from prototype silicon chips [224–227] suggest that power delivery
with integrated regulators can easily switch between voltage levels at tens to hundreds of
nanosecond timescales. Furthermore, integrated on-chip regulators support multiple, flexible

163

voltage domains, which would incur expensive design overhead when using off-chip regulators. In summary, integrated voltage regulators permit fast, per-core power management
which was previously unattainable.

5.2.4

Related Work

While more transistors are integrated on die and the limits of Dennard scaling are being realized, power and energy have become major constraints on processors‘ development. Many
power management techniques have been proposed to improve power efficiency according
to different objectives. Winter et al. [228] presented a thread scheduling and global power
management co-design for a heterogeneous many-core processor. Sartori et al. [229] studied
peak power management in a distributed hierarchical configuration, given a power budget.
Haghbayan et al. [211], Rahmani et al. [212] and Shafique et al. [218] used a PID controller,
a multi-objective controller, and an adaptive controller based dynamic power management
method to improve system power efficiency. Jung et al. [219], Shen et al. [222, 230], Chen et
al. [217, 231], Rapp et al. [232] and Yu et al. [233] used a learning-based predictor and controller to find optimal power and performance. Rahmani et al. [234,234], Ebi et al. [235], Lai
et al. [236] and Kanduri et al. [237] explored reliability/variability, thermal, latency or accuracy aware solutions. Hierarchical power management [238] has been widely adopted from
mobile devices [239] to cloud computers [240]. Muthukaruppan et al. [241] and Ren et al. [220]
used hierarchical frameworks for adaptive power managements. More techniques [242–244]
target multi-core than many-core processors. Limited by the supply voltage transition time
in processor power delivery systems and the complexity of effective power managements
algorithm, the managements operate at millisecond timescales.

164

With the development of integrated voltage regulators, per-core microsecond level fast DVFS
has become practical. Kim et al. [224], Toprak-Deniz et al. [225], Meinerzhagen et al. [245],
Kim et al. [58], and Keller et al. [227] designed integrated voltage regulators that can support sub-microsecond level dynamic voltage scaling. Kim et al. [31] and Eyerman et al. [246]
studied the potential system level energy benefits from microsecond level dynamic voltage
scaling supported by on-chip integrated voltage regulators. Höppner et al. [247] and Tseng
et al. [226] studied fast DVFS on MPSoCs and SRAMs respectively. Kasture et al. [248]
proposed a fine-grain DVFS scheme for latency-critical workloads. Bai et al. [249] proposed
a voltage regulator efficiency aware power management strategy, which relied upon reinforcement learning. Although the fast per-core DVFS supported by integrated voltage regulators
offers a potential means to improve power and energy efficiency, effective power management
strategies are still missing. Our learning-based hierarchical power management approach not
only can leverage run-time information to make optimal decisions adaptively, but also can
reach microsecond timescales to adjust voltage and frequency.

5.3

Methodology

In this section, we first reveal the potential benefits of microsecond timescale per-core DVFS
and compare the integrated voltage regulator and power delivery system designs to study the
possible speeds of the fast DVfS. Then we introduce the proposed fast hierarchical learningbased power management strategy with the global controller as the interface to users, the
learning controller at the architecture level, and the swift controller at digital circuits.
Fig. 5.3 illustrates the potential benefits of microsecond-level power management. Here, the
power consumption of a core is shown in black lines and its throughput (instruction per cycle
165

Power(W)

20

10
0

0

50

100

150

200

250

300

350

400

450

0
500

(a) FFT (core 1)
IPC

5

10
0

0

50

100

150

200

250

300

350

400

450

0
500

(b) Radix (core 1)
IPC

5

10
0

0

50

100

150

200

250

300

350

400

450

0
500

(c) Radix (core 2)
5

IPC

20
10
0

0

50

100

150

200

250

300

350

400

450

0
500

(d) Radix (core 4)
20

5

IPC

Power(W)

5

IPC

Power(W)
Power(W)

20

Power(W)

20

10
0

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0
5000

(e) Water (core 1)

Figure 5.3: Workload power and throughput traces in many-core processors.
IPC) in blue lines during microsecond intervals for representative workloads on a simulated
16-core Intel Nehalem CPU processor. An interval with fewer instructions per cycle (IPC)
transitions could be a candidate DVFS interval, in which the core can reduce the frequency
and voltage to save power with only rare instances of performance degradation. In addition
to intervals that have transitions at millisecond timescales, we find there exist many more
transitions at microsecond timescales, exhibiting distinctive traits. First, such transitions
often appear irregularly within the workloads. For example, the transitions indicated in
Fig. 5.3 (a) are occasional power and activity peaks and valleys in the power-light “FFT”
benchmark, providing opportunities to apply DVFS to lower the voltage and frequency
during the low-activity period without incurring performance loss. Secondly, transitions
arise from interactions among threads. In running the power-hungry ”Radix” benchmark,
Core 1 and core 4 (in Fig. 5.3 (b) and (d)) are in synchronization stalls, waiting for core
2 (in Fig. 5.3 (c)). Thirdly, transitions occur from periodic power and activity within
a workload, and between the completion of one workload and the start of another. Fig.
166

5.3 (e) shows the periodic power and activity in the workload ”Water”. In addition to
computation in the user space, the majority of request service times in the kernel space
require less than 250 microseconds, even with the millisecond tail latencies [250]. Based on
the observation from benchmark executions on the architecture simulators and the program
execution patterns (under the scenarios of fast DVFS) discussed in related works [31, 246],
typical DVFS opportunities generally fall into two classes– One originates from the periodic
or occasional execution period with low computation and memory intensity, and the other
can be attributed to stalls from synchronization, thread scheduling, periodic activities and
so on. Since conventional power delivery systems with off-chip voltage regulators can only
support millisecond voltage scaling, many energy-saving opportunities are lost. In contrast,
integrated voltage regulators can adjust voltages within microseconds and offers flexible percore implementation, thus opening the door for fast and adaptive power management at the
system level.

5.3.1

Power Delivery System for Fast DVFS

As the first step in building the foundation for our hierarchical power management approach
with online learning, we explore the state-of-the-art power delivery systems designed to
enable fast, per-core DVFS.
In conventional power delivery systems that use off-chip voltage regulators, a buck converter
is deployed for its high efficiency across a wide input and output range. However, it requires
more than 10 microseconds to scale voltage, due to the passive components like inductors
and capacitors in off-chip voltage regulators, parasitic inductance along the power delivery
networks, and bloated decoupling capacitance at the PCB board and package levels. Recent
167

technology advances make it possible for switching regulators to operate at much higher
frequencies. At the higher switching frequency, the passive components can be much smaller
and integrated on the same die as processors. Given these advantages, integrated voltage
regulators have been adopted in both academic prototypes and industrial and commercial
processors. Although IVRs have a slightly lower voltage conversion efficiency than off-chip
voltage regulators, they enjoy lower supply voltage noise, which compensates for the voltage
conversion loss. Most importantly, the IVR naturally has a much shorter transition time
because of the smaller passive components, reduced parasitics, and the avoidance of PCB
and package-decoupling capacitance.
As the starting point for exploring hierarchical fast in- integrated voltage and frequency
scaling for energy-efficient multi-/many-core processors, we begin with the power delivery
systems that determine the possible DVFS speeds. To maximize the versatility of the proposed hierarchical learning-based power management, we choose mainstream two-stage heterogeneous power delivery systems with both off-chip and on-chip integrated buck voltage
regulators. A buck-based two-stage heterogeneous power delivery system can fully represent
mainstream power delivery systems with integrated voltage regulators because it offers high
power delivery efficiency and flexible, fast voltage scaling [68, 251].
Alternatives for the on-chip regulator suffer from several limitations. A switched capacitor
has a fixed conversion ratio and cannot support fine-grained voltage scaling. A low drop
out (LDO) voltage regulator offers fast voltage scaling, but its power conversion efficiency
is determined by the ratio of output to input voltages. As voltage and frequency scale
down, the conversion losses in an LDO more than offset any power and energy savings in
the processor. Customized reconfigurations of IVR-based power delivery systems are studied
in [252], but they lack the needed versatility.
168

Table 5.1: Summary of design space explorations of 16-phase buck IVRs.
DVFS Speed
Efficiency (%)
Switch Freq. (MHz)
L per-phase (nH)
C per-phase (µF)
Area (mm2 )

1µs
79.1
146
0.188
0.281
92

2µs
80.6
119
0.188
0.422
137

4µs
82.8
60
0.75
0.422
142

8µs
82.8
60
0.75
0.422
142

16µs
82.8
60
0.75
0.422
142

Having decided to use heterogeneous power delivery systems with both off- and on-chip
integrated buck voltage regulators, we proceed to determine the proper DVFS speeds. As
we discussed before, the passive components like inductors and capacitors in integrated
voltage regulators and power delivery networks limit the voltage transition time. For a
heterogeneous power delivery system, we use the open-source integrated voltage regulator
modeling tool Ivory [162] and the power delivery networks for manycore systems [44] to
explore the design spaces of IVRs in heterogeneous power delivery systems that can support
different fast DVFS. The loads are the processor cores described in Section 5.6.1. Here, we
set the voltage scaling rise time to within 0.5% of the DVFS interval durations [58,68,69,84]
and the voltage overshoot to less than 5%. The key design parameters for IVRs that support
different DVFS speeds are summarized in Table 5.1. When the DVFS speeds are faster than
8 mus, the DVFS speed is one of the constraints of IVR design. When supporting faster
DVFS, IVR designs keep reducing the size of on-die inductors and capacitors to achieve a
faster voltage transition, and one prominent side effect is pushing the switching frequency
from tens to hundreds of MHz. The higher frequency switching comes at the cost of degrading
the conversion efficiency of the IVRs as the switching loss becomes more significant. When
the DVFS speeds are slower than 4 µs, the DVFS speed is not the constraints of IVR design,
which means the optimal IVR that targets high efficiency can naturally support the DVFS
speeds no faster than 4 µs.

169

Normalized Energy
Saving

30%

1us

2us

4us

8us

16us

32us

64us

128us

25%
20%
15%
10%
5%
0

fft

ix
s
q
nt
an
ky
rad lu.co oce oles er.ns hole
ch wat cksc
bla

bt

cg

ft

lu

sp

Figure 5.4: Normalized energy consumption of throughput (IPC) guided DVFS at different
microsecond timescales.
Unlike conventional millisecond timescale power management, at microsecond timescales, the
DVFS controller has limited computational ability, and only simple arithmetical operations
can be applied to control the DVFS. Fig. 5.4 shows the normalized energy consumption
of throughput-guided DVFS (measured in IPC, instructions per cycle [253]) at different
microsecond timescales of the system described at Section 5.6. In this DVFS strategy, if the
run-time IPC at the DVFS interval is larger than 0.8 ˆ times the average run-time IPC, the
voltage and frequency will increase by a level. If the run-time IPC at the DVFS interval is
smaller than 0.6 ˆ the average run-time IPC, the voltage and frequency will decrease by a
level. From the experimental results, we can see that the microsecond level fast DVFS based
on throughput IPC can save more energy when the DVFS runs faster. However, limited by
the computational ability at the microsecond timescale, the fast DVFS cannot be effective
on all the benchmarks, e.g., radix, lu.cont, cholesky, blackscholes, and cg.

5.3.2

Hierarchical Power Management Framework

Conventional DVFS control algorithms can be implemented in the processor microarchitecture, in the scheduler, or through compiler algorithms [254,255]. Most prior research in DVFS
170

control has been implemented in the operating system with coarse temporal granularity, a
sensible approach when off-chip regulators have slow response times and voltages change
on the order of several milliseconds. Integrated voltage regulators enable more responsive
DVFS, saving power and energy at microsecond granularity, but effective mechanisms are
required to guide such fine-grained DVFS. Directly increasing the execution frequency of
previous conventional DVFS control algorithms is not applicable, not only because it is hard
to finish the computation within microseconds but also because the kernel module overhead
(such as thread switching) has already taken more than microseconds.
To guide the microsecond timescale fast DVFS effectively within the computational constraints, we propose a hierarchical DVFS management that implements three control layers.
First, a global controller in the kernel space specifies the power budget and energy performance weights. This global controller also works as the interface with computer users. Users
can use their own power budget and energy performance weights, based on the applications
they are running. Next, a per-core learning controller in the architectural layer is implemented with reinforcement learning to pass the refined run-time architectural information
to a swift controller. Finally, the fast swift controller then makes decisions based on the
refined architectural information and run-time power and performance. This hierarchical
layered approach not only acts fast, at the swift controller frequency, but also adapts as the
application progresses at the learning controller frequency.

171

State (500 μs)

01. IPC
11. L2 Stores

Per-Core RL Agent

02. Branch Predictor Misses
12. L2 Store Misses

Environment

ACTOR-CRITIC NETWORK
4 Layer Network (19,32,32,|A|+1)

03. Voltage Level

Per-Core
Swift
Controller

13. L2 Loads
04. Power Consumption
14. L2 Load Misses

V/F Pair (4 μs)

Core

05. D-TLB Misses
15. L3 Stores
06. D-TLB Accesses
16. L3 Store Misses

01. IPC

07. L1-D Stores
17. L3 Loads

02. Power Consumption

08. L1-D Store Misses

Swift Input Features (4 μs)

BackProp (25 x 500 μs)

18. L3 Load Misses
09. L1-D Loads
19. Memory Loads
10. L1-D Load Misses

Reward (500 μs)

Figure 5.5: Reinforcement learning and swift controllers.

5.3.3

Global Controller

The top controller in this hierarchical framework is the global controller. The global controller runs at the kernel level and provides a programmable interface for the users of multi/many-core system to adjust the features of power management. The global controller can
accept energy and performance weights and power budgets from user inputs that guide the
learning controller as it navigates varied modes that favor battery life, favor performance, or
balance the two. The global controller updates the reward function of the learning controller
with to the user inputs, as shown in Eq. 5.1.

R “ ´WE ˆ

IPC
|power ´ budget|
energy
` WI ˆ
´ WB ˆ
peak energy
peak IPC
peak energy

(5.1)

where WE is the weight for energy, WI is the weight for performance, and WB is the weight for
power budget. The energy is calculated at the frequency of learning controllers. By adjusting
the weights in the reward functions, the corresponding features of the power management
will be selected.
172

5.3.4

Learning Controller

The learning controller, at the architectural layer, leverages reinforcement learning (RL)
to help the DVFS adapt to applications. RL is a subset of machine learning built upon a
Markov Decision Process, which describes interactions between an agent and its environment
over time. The environment is represented by states. At each time step, the agent selects
an action that changes the environment and thus the state. After selecting this action, the
agent transitions to a new state and receives a reward associated with this state.
Table 5.2 lists the components of a RL model. Note that actions or elements of the state
space can be either continuous or discrete.
Table 5.2: RL terminology. RL’s goal is to an find optimal policy πpa|sq˚
Terminology
Action Space
State Space
Reward Function
Return
Policy
State Value Function
State-Action Value Function

Symbol
aPA
sPS
R P R1
ř
G “ kt“0 R
πpa|sq˚
V psq
Qps, aq

The value functions describe the expected return for being in some state or for taking an
action in a state when following policy π:

V psq “ Eπ rG|St “ ss

(5.2)

Qps, aq “ Eπ rG|St “ s, At “ as

(5.3)

173

The agent’s goal is to learn policy πpa|sq˚ , which maps each state to an action that maximizes
the expected return G over k time-steps and future time-steps are discounted by factor γ.
Policy gradient methods, such as Actor-Critic, directly optimize the policy by approximating
the policy π and value function (Q or V ) using approximators such as neural networks [256].
The actor consumes the state and produces a probability distribution over the action space.
The critic learns a real-valued number that approximates the value function, V psq. The
actor-critic network reaps the benefits of both value based reinforcement learning methods
which are more sample efficient and steady and the policy-based methods which are better
for continuous and stochastic environments. In this project, we use the actor-critic network
in the learning engine as a modest spur and start point to induce researchers to come forward
with this valuable contributions with more powerful learning engines.
If the action space is discrete with size |A|, the final layer in the approximator network is
a flattened vector of size |A| that is passed through a softmax layer to produce a discrete
probability distribution. If the action space is continuous (e.g., a P r0, 1s), the final layer
approximates a probability distribution by predicting its parameters. For example, a layer
that approximates a normal distribution must predict the mean µ and variance σ of a,
increasing the number of outputs required for the approximation function [257].
In each time step, the actor takes a normalized state as input, then forward propagates the
neural network approximator and selects an action by sampling from the output distribution.
A trajectory is built by saving actions, states, and rewards over several time steps. The actor
and critic networks share weights and are trained jointly by back propagating them with the
appropriate loss functions and utilizing stored trajectories.

174

Algorithm 8 Learning Controller (with Swift) p„ 500µsq
Input: Ncores , f ps; θ1 q, ¨ ¨ ¨ , f ps; θN ´1 q, ŝmean , ŝstd
iÐ0
while (i ă Ncores ) do
s Ð get core state(i)
s Ð ps ´ ŝmean q {ŝstd
Forward propagation µpolicy , σpolicy , V psq Ð f ps; θi q
Construct πpa|sq Ð N pµpolicy , σpolicy q
Sample weights w
~ i „ πpa|sq
Update swift controller (i, w
~ i)
R Ð observe reward(i)
Store µpolicy , σpolicy , R, V psq
iÐi`1
end while

For power management, the environment is the processor core’s activity, and the state space
is defined by 19 normalized performance counters, including instruction throughput, branch
prediction misses, cache misses, and reads, as well as the current power and voltage levels.
To collect samples of the state space, each benchmark is run with a random DVFS policy
that performs DVFS every 500 µs. The performance counter values are stored at each of the
VF transitions. This process is repeated until roughly 1, 000 samples for each benchmark
are collected. When a particular benchmark is studied, the corresponding stored samples
are then used to normalize inputs to the learning controller. See Fig. 5.5 for details.The
reward function is a linear combination of instruction throughput, energy, and the power
budget determined by the global controller [249].
The learning controller can manage the DVFS settings either independently or in coordination with the swift controller at a lower level. During independent management, it directly
maps the core’s state to a voltage-frequency pair. During coordinated management, it sends

175

Table 5.3: Action space of the actor neural network.
Experiment Type
Action Space
Without Swift Controller. a P tV F1 , ¨ ¨ ¨ , V F4 u
With Swift Controller
~a P r0, 1s2
an intermediate weight vector to the swift controller as described in Algorithm 8. Table 5.3
summarizes the action spaces of these two operations.

5.3.5

Swift Controller

The swift controller for each core is implemented at the digital circuit layer, managing
its power and energy consumption by adjusting its voltage and frequency on microsecond
timescales, which is supported by the integrated voltage regulator. First, the swift controller
monitors current drawn by its core during each fine-grained monitoring interval (e.g., 100
ns in our study) to calculate power consumption. Second, it accesses hardware performance
counters. These measurements together guide voltage and frequency settings at microsecond
timescales.
The swift controller uses a linear classifier as described in Eq. 5.4, where X is the input
feature vector, W is the weight vector for the input feature, and b is the bias. When
f pX, W, bq is greater than threshold Ri , the swift controller sets voltage and frequency to Vi
and Fi .
f pX, W, bq “ W X ` b

(5.4)

Operating at microsecond timescales, the linear classifier must be computationally simple
yet effective. The classifier takes only two run-time parameters, power consumption and
instruction throughput IPC, to define input X “ rP ptq, IP Cptqs. Beyond the instruction
176

throughput, we consider and test other performance counters, such as cache hits and misses.
At conventional millisecond timescales, these counters improve the model’s accuracy when
estimating system dynamics. However, at the microsecond timescales we consider, these
counters exhibit rapid and large fluctuations that can cause the system to oscillate and fail
to converge.
We propose a hierarchical management strategy in which the global and learning controllers
dynamically update the weight vector W . Updated weights help the swift controller capture
diverse workload phases and variations adaptively. Depending on the workload phase, power
and IPC have different roles in estimating system behavior. For example, suppose the fixedpoint unit dissipates less power and the floating-point unit dissipates more power. As a
workload performs a varying mix of fixed and floating-point operations, simply using power
or instruction throughput alone cannot accurately classify the system behavior, even with
offline trained weights.

5.4

Quantitative Study of Internal Metrics with Synthetic Benchmarks

In this section, to quantitatively study the internal metrics and behaviors of hierarchical
learning-based fast power management, we propose a rigorous methodology with synthetic
benchmarks. We generate synthetic benchmarks with manually defined “ideal” DVFS opportunities. Theoretically, in these DVFS opportunities, the voltage and frequency should
be immediately reduced to the lowest levels, with no performance loss. By comparing the
behaviors of the DVFS controller with the ideal strategy where the voltage and frequency
177

should be set to the lowest level without performance loss, we can quantitatively describe
the distance between the proposed F-LEMMA DVFS controller and the ideal DVFS controllers at microseconds and find out what contributes to and dominates the “less than ideal”
mismatch.
To cover the two categories of fast DVFS opportunities (computation/memory intensity
variations and long stalls in thread’s activities), we generate benchmarks of computation,
memory, and combination, based on OpenMP for multi-core and many-core systems. We
manually create ideal opportunities for the microsecond timescale DVFS by inserting microsecond timescale sleep intervals between the operations. In benchmark computation and
memory, we use the sleep to create the DVFS opportunities by adjusting the computation
and memory intensity. In combination benchmark shown in Algorithm 9, we not only use
the inner loop sleep to create the DVFS opportunities by adjusting the computation and
memory intensity, but also use the outer loop sleep to emulate long stalls, such as thread synchronization and scheduling. Meanwhile, the switching between computation and memory
parts represents the program phase changes, which always happen with the long stalls.
In these synthetic benchmarks, the voltage and frequency should be reduced during the sleep
intervals without adding performance loss. With these synthetic benchmarks, any power and
performance patterns with a microsecond timescale resolution can be generated easily and
their DVFS theoretical boundaries can be obtained. After applying F-LEMMA on these
three synthetic benchmarks, we quantitatively evaluate the energy saving and performance
loss of F-LEMMA against the theoretically ideal DVFS strategy for each benchmark. We apply F-LEMMA with the swift controller at different speeds (1µs and 4µs) on the benchmarks
with different DVFS interval lengths. Fig. 5.6 shows the normalized energy consumption
of F-LEMMA applied on the three synthetic benchmarks. Fig. 5.7 shows the normalized
178

Algorithm 9 Combination Benchmark Example
void main (int argc, char* argv[]){
int threads;
double x[length], y[length];
//OpenMP parallel execution

#pragma omp parallel
for(int i=0; i ă threads 1; i++) {
// The computation part:

for(int j=0; j ă computation length; j++) {
x[j] = (i+j)*0.5/(threads+0.1);
}
usleep(low computation interval);
// The memory part:

for(int j=0; j ă memory length; j++) {
x[j] = y[j];
}
usleep(low memory interval);
// The next computation/memory part ......

}
usleep(iteration interval);
for(int i=0; i ă threads 2; i++) {
// ......

}
return;

179

120%

100%
90%
80%
70%

90%
80%
70%
60%

50%

50%
4

8

16

32

64

128

4us DVFS

100%

60%
2

120%

1us DVFS
4us DVFS

110%

Normalized Energy

1us DVFS
4us DVFS

110%

Normalized Energy

Normalized Energy

120%

110%

1us DVFS

100%
90%
80%
70%
60%
50%

256

2

DVFS intervals in microseconds

4

8

16

32

64

128

256

2

4

DVFS intervals in microseconds

(a) Computation benchmark

8

16

32

64

128

256

DVFS intervals in microseconds

(b) Memory benchmark

(c) Combination benchmark

Figure 5.6: Quantitative study of the DVFS on energy saving

4us DVFS

1us DVFS

2

4

8

16

32

64

128

256

DVFS intervals in microseconds

(a) Computation benchmark

100%
98%
96%
94%
92%
90%
88%
86%
84%
82%
80
78
76

Normalized Performance
(Inst. per Second)

100%
98%
96%
94%
92%
90%
88%
86%
84%
82%
80
78
76

Normalized Performance
(Inst. per Second)

Normalized Performance
(Inst. per Second)

100%
98%
96%
94%
92%
90%
88%
86%
84%
82%
80
78
76

4us DVFS
1us DVFS

2

4

8

16

32

64

128

2

256

4

8

16

32

64

128

256

DVFS intervals in microseconds

DVFS intervals in microseconds

(b) Memory benchmark

1us DVFS
4us DVFS

(c) Combination benchmark

Figure 5.7: Quantitative study of the DVFS on performance loss.
performance of F-LEMMA applied on the three synthetic benchmarks, and the ideal boundaries are 100%. The X axis is the DVFS intervals, and the boxplots at each interval indicate
the performance of F-LEMMA with swift controllers of different speeds.
The accurate detection and fast action are the most important two internal metrics of the
DVFS, which are from the cooperating between the learning controller and the swift controller. The fast action is determined by how long the swift controller can detect the DVFS
intervals with the linear classifier and adjust voltage and frequency. The accurate detection
is mainly determined by how accurate the learning controller can give out the weights to
capture the status of program and processor for swift controllers.
We first take a look at the fast action given an accurate prediction. We choose synthetic
benchmark computation and memory because in these two benchmarks, there is only one
180

type of operations, no program phase changes, and an accurate prediction can be obtained
after a long enough training process. We use the energy saving and performance loss under
different swift controller speeds inside of each synthetic benchmarks to evaluate the impacts
of the fast action on the DVFS. In the computation and memory benchmarks shown in
sub-figure (a) and (b) in Fig. 5.6 and Fig. 5.7, when the swift controller operates at 1
µs, the voltage and frequency will be adjusted to save energy since the DVFS interval is
over 2 µs. The swift controller operating at 4 µs can only have the energy saving when the
DVFS interval reaches 32 µs. This is because if the swift controller operating at 4 µs is used
to adjust voltage and frequency to save energy for intervals smaller than 32 µs, significant
performance loss will be introduced. For both the swift controllers operating at 1µs and 4µs
the wider distribution will be more obvious when the DVFS interval is short and immediate
voltage and frequency changes are not always possible. When the arrival or finish of DVFS
intervals are detected just before the swift controller’s action, then the swift controller can act
quickly and more energy will be saved. However, when the arrival or finish of DVFS intervals
are detected just after the swift controller’s action, the voltage and frequency can only be
adjusted until the swift controller’s next action. On one hand, lately reducing the voltage
and frequency will cause less energy saving. On the other hand, if the swift controller is not
able to immediately increase the voltage and frequency because of action time, performance
loss will be introduced.
Next, we consider the accurate prediction. We use the synthetic benchmark combination
to test the prediction shown in sub-figure (c) in Fig. 5.6 and Fig. 5.7. We choose the
combination benchmark because it contains two representative program phases (computation intensive and memory intensive) and these two phases keep switching as the benchmark
executes. Previous, in the computation and memory benchmarks shown in sub-figure (a)
and (b) in Fig. 5.6 and Fig. 5.7, there is only one type of operations and no program phase
181

changes. Therefore, the learning controller only needs to give out an accurate prediction
after training. However, in the combination benchmarks, there does not exist an 100% accurate prediction especially when the program phases change within the period of learning
controller. This is because the learning controller cannot update the prediction when the
phase changes are faster than the prediction of the learning controller. The best prediction
from the learning controller is the compromise prediction considering all the phases. For example in this synthetic benchmarks, the learning controller needs to give out the compromise
prediction considering both the computation intensive and memory intensive. Previously in
computation and memory benchmarks, the accurate prediction lets the 1us swift controller
and the 4us swift controller to adjust the voltage and frequency when the DVFS interval
is longer than 2µs and 32µs, which successfully saves energy. However, in this compromise
prediction of the combination benchmark, the 4us swift controller changes voltage and frequency even when DVFS interval is only 2 µs which is even faster than switch controller’s
speed. This means the changed voltage and frequency cannot catch up the DVFS and the
voltage and frequency should not be changed at all. Therefore, not only the performance
loss is introduced but also more energy is consumed with this compromise prediction.
To summarize, both the fast action and accurate prediction are critical to the hierarchical
learning-based fast power management. Under the accurate prediction, the action speed
determines the speeds of DVFS intervals the controller can catch up. However, because of
the physical limitation (learning rate given the computation size) of the learning controller,
it has to give an compromise prediction. This compromise prediction impacts the DVFS
effects or even causes extra energy consumption, which rarely happens but exists.

182

5.5

Online Learning and System Implementation

In this section, we first demonstrate the necessity of online learning control and then validate
the functionality of learning controllers and test them with different reward functions. Then
we use high-level synthesis (HLS) to implement the online learning controller and estimate
its latency and power cost.
To demonstrate the necessity of online learning control, we examine the impacts of the
input features (19 performance counters) on the output weights for the IPC and power.
We measure the average Pearson correlation coefficients between input features and output
weights, where the reward function has IPC, energy, and power budget terms with the same
weight during 100 epochs. Table 5.4 shows the average Pearson correlation coefficients of the
most power-light and power-hungry benchmarks, fft and radix, in the splash-2 benchmark
set. From the results of the fft benchmark, the weight of the performance indicator IPC has
a higher correlation coefficient with the input features than the weight of power. Conversely,
for the benchmark radix, the weight of power has a higher correlation coefficient than the
weight of the performance indicator IPC. In the power-light application fft, the performance
indicator is more critical in guiding the DVFS. Not surprisingly, the misses, like TLB misses
and cache misses, have a negative correlation with the weight for the performance indicator
IPC. Meanwhile, the correlation coefficients for the same input feature vary greatly across
different benchmarks. Using an offline trained learning controller, it is hard to balance the
variations across benchmarks and adapt to different benchmarks with a convincing energy
saving.
The online-learning based power management solution can adapt to these variations. Fig.
5.8 shows the learning (convergence) progress for three representative benchmarks, where
183

Table 5.4: Pearson correlation coefficients be-tween input features and output weights
Benchmark
Weight for
IPC
Branch Pred. Misses
Voltage Levels
Power Consumption
D-TLB Misses
D-TLB Accesses
Memory Loads
L1 D $ Stores
L1 D $ Store Misses
L1 D $ Loads
L1 D $ Load Misses
L2 Stores
L2 Store Misses
L2 $ Loads
L2 $ Load Misses
L3 Stores
L3 Store Misses
L3 $ Loads
L3 $ Load Misses

IPC
0.054
0.085
-0.0251
-0.0146
-0.0219
0.0026
-0.0310
0.0032
-0.0223
0.0023
-0.0042
-0.0223
-0.0223
-0.0032
-0.0059
-0.0261
-0.0243
-0.0073
-0.0097

fft
power
0.0161
0.0035
0.0098
0.0123
0.0190
0.0143
-0.0290
0.0166
0.0099
0.0126
-0.0260
0.0099
0.0099
-0.0256
-0.0218
0.0098
0.0159
-0.0217
-0.0234

radix
IPC
power
0.0574 0.3794
0.0055 0.2174
0.0017 -0.2725
-0.0232 -0.1653
-0.0249 0.3244
-0.0544 0.4350
0.0185 0.2091
-0.0541 0.4381
-0.0195 0.2272
-0.0550 0.4326
-0.0510 0.4955
-0.0195 0.2272
-0.0184 0.2335
-0.0491 0.4999
-0.0497 0.5032
-0.0189 0.2350
0.0051 0.2175
-0.0497 0.5045
0.0101 0.2259

the reward function has IPC, energy, and power budget terms with the same weight. For
comparison, the progress of the reward function with only the energy term is also shown,
(where we set the weights for IPC and power budget to 0). In these experiments, each learning controller for each benchmark is started from randomly generated weights of the learning
controller, meaning the learning controller starts from a random position. Fig. 5.8 shows
that energy consumption in both scenarios is reduced as the benchmark progresses, which
means the online learning controller can keep adjusting to adapt to the current benchmarks
and save more energy. Also as expected, the reward function with only the energy term
makes the system energy consumption fall further and even faster over epochs. By adjusting

184

Normalized Energy

1
0.95
0.9
0.85
ocean(IPC+energy+budget)
water.nsq(IPC+energy+budget)
cg(IPC+energy+budget)
ocean(energy)
water.nsq(energy)
cg(energy)

0.8
0.75
0.7

0

10

20

30

40

50

60

70

80

90

100

Epoch

Figure 5.8: Learning progress under different reward functions.
the weights in the reward learning controller, the user can customize F-LEMMA to favor
either performance or energy.
In our design, the learning controller is executed on an application-specific integrated circuit
(ASIC) located close to the core. Compared with execution at the software level of generalpurpose CPUs, execution in an ASIC avoids redundancy from middleware, which increases
both performance and energy efficiency. We experimentally compared the learning controllers
using both software and ASIC designs. In software, the learning controller took over 40
microseconds (about 40µs/500µs = 8% performance overhead) to execute on a 2.3 GHz DualCore Intel Core i5 processor. To estimate the performance of the ASIC design, we synthesized
the learning controller on a Zynq- 7000 FPGA using Xilinx Vivado HLS 2019.1, with a
pipeline applied to optimize the design. The learning controller took 1464 cycles at 100MHz
to execute, with an average power consumption of 1.38 W. Scaled from the 28nm technology
used in the Zynq-7000 FPGA to the state-of-the-art 10nm technology [258, 259] used in
mainstream processors, this online learning controller can execute within 10 microseconds
and introduces less than 2% overhead with lower power consumption. The ASIC design
thus speeds up the learning controller by 4 times over execution in software. The swift
185

controllers operate at microsecond timescales, and each swift controller operation has 2 fixed
point multiplications, 1 addition, and up to 3 comparisons. The overheads from the swift
controllers are negligible compared with those from the learning controller.

5.6

Evaluation Results

In the section, we test F-LEMMA (the proposed hierarchical learning-based fast DVFS) via
architecture-level performance and power simulators. We compare F-LEMMA with stateof-the-art power and management solutions and evaluate F-LEMMA with an ablation study
and under different system configurations across real benchmarks.

5.6.1

System Setup

We evaluate the proposed hierarchical learning-based power management scheme with experiments on an Intel Nehalem x86 processor, which is detailed in Table 5.5. We use Sniper
v7.3 [83] (with Mcpat [260]) to simulate the system performance and power (dynamic power
and leakage power) for this multi-/many-core processor, generating run-time statistics with
a granularity of 100 ns. We integrate both the Numpy and PyTorch packages with Sniper to
implement the hierarchical learning design. Sniper performs timing simulations for multithreaded, shared-memory applications with tens to hundreds of cores, and has been validated
for Intel Core2 and Nehalem systems. From the parsec, splash2, and NPB benchmark suites,
we select representative power-light, power-moderate, and power-hungry benchmarks that
cover a wide range of scientific and computational domains. The global controller operates in

186

Table 5.5: Architecture parameters and hyperparameters for the hierarchical controller.
Configurations
Number of cores
Core architecture
V/F Levels (V/GHz)
Nominal V/F
DVFS transition overhead
L1-I/D cache
L2 cache
L3 cache
Global/learning/swift ctrl.
NN Architecture
Learning rate
Discount reward factor
Trajectory size for backprop
Optimizer

Value
2-128
Intel Nehalem (x86)
1.20/2.0,1.08/1.8,0.96/1.6,0.84/1.4
1.20/2.0
40 cycles
32KB, 4-way, LRU
512KB, 8-way, LRU
8MB, 16-way, LRU
10 ms, 500µs, 4µs
4-layer (19,32,32,|A| + 1)
1 ˆ 10´3
γ “ 0.95
25
Adam (β1,2 “ 0.9, 0.999)

kernel space and is triggered by userspace power management. The learning controllers operate every 500 microseconds, a rate limited by the computational complexity of the learning
algorithm. To accurately estimate the DVFS transition overhead, each voltage and frequency
switch is set to 40 cycles by the Sniper simulator. For a conservative consideration, the swift
controllers work at 4 microseconds scales, as determined by the voltage transition times
of the integrated voltage regulators. Later we will examine the swift controllers working
at different DVFS speeds supported by integrated voltage regulators, as studied in Section
5.3.1.

5.6.2

Hierarchical Fast Learning Approach

We compare F-LEMMA to the two state-of-the-art DVFS techniques on multi-/many-core
processors: Profit, Priority and Power/Performance Optimization for Many-Core Systems,
187

and Grape, Minimizing Energy for GPU Applications with Performance Requirements. Our
methods are normalized to the default race-to-idle execution mode. F-LEMMA and previous
techniques are implemented in the same Sniper and Mcpat simulation platforms as described
above. For fairness, F-LEMMA (the learning controller), Profit, and Grape all operate at
a fixed timescale of 500 microseconds. Profit and Grape are implemented with the best
knowledge found in their papers. The names of the approaches used in our experiments are
given below.

• Default Race-to-Idle. Runs each benchmark as fast as possible. All other methodologies are normalized to this.
• F-LEMMA: The proposed learning-based fast power and energy management in a
hierarchical layered approach.
• Profit: State-of-the-art reinforcement learning-based power, and energy management
for multi-core and many-core systems [217].
• Grape: State-of-the-art feedback control based power and energy management for
multi-core and many-core systems with performance constraints [150].

We evaluate energy consumption, not power dissipation, for a standard comparison against
workloads and configurations. The energy consumption metric evaluates net benefits and
accounts for potential losses due to extended execution times when lowering frequency. We
normalize energy and performance results to the energy consumed in the Default Race-to-Idle
case.
Fig. 5.9 shows the normalized energy consumption, and Fig. 5.10 shows the normalized
performance loss (instructions per second). F-LEMMA achieves a 35.2% energy saving with
188

Normalized Energy

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

F-LEMMA
Profit
GRAPE

fft

ix

rad

lu.

nt

co

n

ea

oc

s
ky .nsq
ole
ter ksch
a
w
c
bla

les

o
ch

bt

cg

ft

lu

sp

Figure 5.9: Normalized energy consumption of F-LEMMA.
an 11.8% performance penalty, on average, compared to the Default Race-to-Idle. The best
case is the fft benchmark, which saves 30.4% energy with only a 1.0% performance loss. The
worst case is the radix benchmark, which saves 30.4% energy with a 25.3% performance loss.
Compared to Profit and Grape, F-LEMMA achieves 6.6% and 11.5% extra energy savings
with 3.5% and 2.6% performance penalties, respectively. For the For the fft, lu.cont, cholesky,
water.nsq, blackscholes and ft benchmarks, DVFS saves significant amounts of energy with a
minimal performance penalty across all three power management approaches. In the Grape
results, although using a feedback control improves the effectiveness compared with the
throughput-based DVFS in Section 5.3 on benchmarks radix, bt, and cg, it is still hard to
achieve consistent effects across all the benchmarks.
Fig. 5.11 shows the Energy-Delay Product normalized to Default Race-to-Idle. Across most
benchmarks, F-LEMMA has the highest energy efficiency and smallest energy-delay product,
after accounting for potential performance losses. On average, F-LEMMA, Profit and Grape
have normalized energy- delay products of 0.73, 0.78, and 0.84, respectively.

189

Normalized Performance
(Inst. per Second)

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

F-LEMMA
Profit
GRAPE

fft

ix

rad

nt

co

lu.

n

ea

oc

ole

ch

sky r.nsq holes
c
te
wa lacks
b

bt

cg

ft

lu

sp

Figure 5.10: Normalized performance of F-LEMMA.
F-LEMMA
Profit
GRAPE

Normalized Energy
Delay Product

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

fft

s
ix
nt
an esky .nsq
ole
rad lu.co oce
ol
ter ksch
a
ch
w
c
bla

bt

cg

ft

lu

sp

Figure 5.11: Energy delay product of F-LEMMA.

5.6.3

Hierarchical Layered Approach with Ablation Study

As the ablation study, Fig. 5.12–5.13 compare the energy savings and performance penalties
from F-LEMMA and alterna- tives that use only a subset of the layered global, learning, and
swift controllers. In the configurations with learning controller, the learning controllers works
like online learning. In the only swift controller configuration, the weights for the controller
inputs are from off-line trained learning controller. F-LEMMA outperforms a framework with
only global and learning controllers (i.e., the second bar), achieving significant energy savings
with only a tiny performance loss. For example, on the lu.cont, ocean, and ft benchmarks,
F-LEMMA achieves 9%, 8%, and 6% energy saving respectively, while reducing performance
by less than 1%. F-LEMMA also outperforms a framework with only the swift controller
(i.e., the third bar).

190

Normalized Energy

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

F-LEMMA (Energy)
Global+Learning (Energy)
Swift Only (Energy)
Global(1,0,0)+Learning+Swift (Energy)
Global+Learning+Swift(power) (Energy)

fft

s
ix
nt
an esky .nsq
ole
rad lu.co oce
ol
ter ksch
a
ch
w
c
bla

bt

cg

ft

lu

sp

Figure 5.12: Normalized energy consumption of F-LEMMA.
Normalized Performance
(Inst. per Second)

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

F-LEMMA (Performance)
Global+Learning (Performance)
Swift Only (Performance)
Global(1,0,0)+Learning+Swift (Performance)
Global+Learning+Swift(power) (Performance)

fft

ix

rad

nt

co

lu.

ea

oc

n
ch

y
s
q
sk
.ns hole
c
ter
wa lacks
b

ole

bt

cg

ft

lu

sp

Figure 5.13: Normalized performance of F-LEMMA.
We also compare full hierarchical management with different configurations at each layer.
Suppose the learning controller only pursues energy savings because the global controller
specifies weights (1,0,0) for its reward function (i.e., the fourth bar). The system achieves
more energy saving but with slightly greater performance penalties. Finally, suppose the
swift controller uses only power as the input feature and neglects instruction throughput
(i.e., the fifth bar). Compared to F-LEMMA, this configuration induces larger performance
penalties for the same energy savings. With only power measurements, the swift controller
predicts the effects of DVFS less accurately. These effects were discussed in Section 5.3.3.

5.6.4

Workload Transition and Scalability

Finally, we examine the unique features needed or per- formed by F-LEMMA. The learning
controller must be effective, or converge quickly, when the processor transitions from one
191

Normalized Energy

Figure 5.14: Learning under Workload Transitions.

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

1us

fft

ix
nt
rad lu.co

2us

n

ea

oc

4us

8us

les
sky r.nsq
ho
te
ksc
wa
c
bla

ole
ch

16us

bt

32us

64us

ft

cg

128us

lu

sp

Normalized Performance
(Inst. per Second)

Figure 5.15: Normalized energy consumption of F-LEMMA DVFS with the swift controller
at different microsecond timescales.

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

1us

fft

nt

ix

rad

l

o
u.c

2us

4us

8us

n
s
q
ea
sky
ns
ole
oc
ole ater.
ch
s
ch
k
w
c
bla

16us

bt

32us

cg

64us

ft

128us

lu

sp

Figure 5.16: Normalized performance of F-LEMMA DVFS with the swift controller at different microsecond timescales.

192

Normalized Energy

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

2 core
4 core
8 core
16 core
32 core
64 core
128 core

fft

ix

rad

nt
co
lu.

n
ea

oc

y
s
q
sk
ole
.ns
ole
ch
ter
ch
ks
wa
c
a
bl

bt

cg

ft

lu

sp

Normalized Performance
(Inst. per Second)

Figure 5.17: Normalized energy of F-LEMMA on multi-core and many-core processors.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0

2 core
4 core
8 core
16 core
32 core
64 core
128 core

fft

ix

rad

nt

co
lu.

an

e
oc

y
sk

ole
ch

ter

wa

q

.ns

les

ho

sc

k
lac

bt

cg

ft

lu

sp

b

Figure 5.18: Normalized performance of F-LEMMA on multi-core and many-core processors.
workload to another. Fig. 5.14 compares energy consumption when control starts from
randomly generated weights (i.e., no prior epoch) and starts from weights learned for the fft
benchmark in the first two epochs before the workload transition. The energy is normalized to
the first execution of each benchmark. Compared with starting from scratch, the benchmarks
switching from fft inherit a learning controller configuration that was trained under fft, and
present a significant energy saving at the beginning. Benchmarks other than radix save more
energy as the execution proceeds. The same phenomena are also observed in transitions for
other benchmarks. Then we change the swift controller frequency to control the DVFS at
different speeds as shown in Table 5.1. Fig. 5.15 and Fig. 5.16 show the normalized energy
consumption and performance across benchmarks. Generally, the proposed F-LEMMA is
effective with different swift controller speeds. Meanwhile, a faster DVFS could achieve
more energy savings with a small performance loss. Last but not least Fig. 5.17 and Fig.
5.18 show the energy and performance when scaling from 2 cores to 128 cores; some bars
193

are blank because the benchmark does not support that configuration. Overall, as the core
numbers scale from 2 to 128, F-LEMMA achieves from 35.2% to 41.1% energy saving at a
cost from 12.1% to 5.4% performance loss overhead on average for 2, 4, 8, 16, 32, 64, and
128-core systems. As the number of cores increases, the performance penalty decreases as
more DVFS opportunities are created by more thread synchronizations.
To summarize, F-LEMMA achieves an effective power management at microsecond timescale
with significant energy saving and only moderate performance loss. The global and learning
controllers help the swift controller make better decisions, saving more energy across benchmarks. F-LEMMA is also effective during workload transitions and supports user space
inputs to balance energy and performance according to specified weights. Furthermore,
F-LEMMA can be applied on from multi-core systems scaling up to many-core systems.

5.7

Conclusion

The low power consumption and high computing performance are the essential requirements
of the computing systems used in a resource-constrained cyber-physical system. In this
chapter, we proposed F-LEMMA, a hierarchical fast integrated voltage and frequency scaling approach for multi-core and many-core processors. With integrated voltage regulators,
DVFS power management can reach microsecond timescales. A learning-based hierarchical
approach, including a global controller in userspace, a learning controller at the architecture level, and swift controllers at the digital circuit level, is presented to guide microsecond
level power management. Experimental results show that on average F-LEMMA can save
35.2% of energy with a 11.8% performance decrease. Compared with two classic millisecond timescale DVFS techniques using control theory and reinforcement learning, F-LEMMA
194

achieves 5% and 11% Energy-Delay Product (EDP) improvements, respectively. F-LEMMA
is readily applied to the computing systems allowing both the multi-core and many-core
computing platforms in cyber-physical systems to have a lower power consumption and a
higher computing performance.

195

Chapter 6
Conclusion

This dissertation has explored techniques and solutions for improving the power and performance efficiency of computing in contemporary and future cyber-physical systems, especially
in resource-constrained environments. In cyber-physical systems, both power efficiency and
performance efficiency directly impact the operation time and performance of the whole
system. We proposed a layered approach to improve computing power and performance
efficiency, leveraging the characteristics of each layer, from the circuit layer, through the
architecture layer, up to the operating system and application layers. With a bottom - up
layered approach, we presented four representative case studies on mainstream computing
systems used in cyber-physical systems, exploring each study from the three perspectives
of power delivery efficiency, resource and task real-time scheduling, and power management. At the circuit layer, we provided early-stage modeling and evaluation of IVR-assisted
processor power delivery systems, with two case studies that demonstrate the significant
power efficiency improvements realized from supply voltage noise mitigation and microsecond timescale power management. Next, at the circuit and architecture layers, we provided a
practical voltage-stacked power delivery system with guaranteed reliability and power management functions. Compared with a conventional power delivery system, the proposed
196

hybrid regulated voltage-stacked system achieves over 90% power delivery efficiency, which
directly extend the operating time of the computing systems and cyber-physical systems.
Then, at the architecture and operating system layers, we used a CPU-GPU system as a
representative of heterogeneous computing systems for autonomous cyber-physical systems
like self-driving cars, and we presented real-time scheduling of hard deadline parallel tasks
with fine-grain utilization, which offers higher schedulability than current state-of-the-art
solutions. Finally, across the circuit, architecture, and operating system layers, we designed
and presented a hierarchical, learning-based fast power management strategy operating at
microsecond timescales. Each of these innovations opens the door to new opportunities
at other layers or leverages the characteristics of several neighboring layers. Although the
cyber-physical systems put a tough challenge on the computing performance and power consumption on their computing platforms, all the techniques form a complete layered approach
from the circuit layer through the architecture layer to the operating system and application
layers to improve both power and performance efficiency for the computing systems that are
used in cyber-physical systems especially in a resource-constrained environment.
There is no limit in improving the performance and reducing the power consumption of the
computing system. Many amazing solutions and techniques have been proposed in the past,
are being proposed now, or will be proposed in the future. We hope our layered approach
and proposed techniques can serve as a modest spur to induce later researchers to come
forward with valuable contributions to improve computing and cyber-physical systems and
help people build a better life together.

197

References
[1] Edward Ashford Lee and Sanjit A Seshia. Introduction to embedded systems: A cyberphysical systems approach. Mit Press, 2017.
[2] Siddhartha Kumar Khaitan and James D McCalley. Design techniques and applications
of cyberphysical systems: A survey. IEEE Systems Journal, 9(2):350–365, 2014.
[3] Apple unleashes m1.
m1/.

https://www.apple.com/newsroom/2020/11/apple-unleashes-

[4] https://www.tesla.com/autopilotai.
[5] Lech Jóźwiak. Embedded computing technology for highly-demanding cyber-physical
systems. IFAC-PapersOnLine, 48(4):19–30, 2015.
[6] Stefano Zanero. Cyber-physical systems. Computer, 50(4):14–16, 2017.
[7] Jin Ho Kim. A review of cyber-physical system research relevant to the emerging
it trends: industry 4.0, iot, big data, and cloud computing. Journal of industrial
integration and management, 2(03):1750011, 2017.
[8] Ragunathan Rajkumar, Insup Lee, Lui Sha, and John Stankovic. Cyber-physical systems: the next computing revolution. In Design automation conference, pages 731–736.
IEEE, 2010.
[9] Yansheng Zhang, I-L Yen, Farokh B Bastani, Ann T Tai, and S Chau. Optimal adaptive system health monitoring and diagnosis for resource constrained cyber-physical
systems. In 2009 20th International Symposium on Software Reliability Engineering,
pages 51–60. IEEE, 2009.
[10] Bo Li, Yehan Ma, Tyler Westenbroek, Chengjie Wu, Humberto Gonzalez, and
Chenyang Lu. Wireless routing and control: a cyber-physical case study. In 2016
ACM/IEEE 7th International Conference on Cyber-Physical Systems (ICCPS), pages
1–10. IEEE, 2016.
[11] Yehan Ma, Dolvara Gunatilaka, Bo Li, Humberto Gonzalez, and Chenyang Lu. Holistic
cyber-physical management for dependable wireless control systems. ACM Transactions on Cyber-Physical Systems, 3(1):1–25, 2018.
198

[12] Yehan Ma, Chenyang Lu, and Yebin Wang. Efficient holistic control: Self-awareness
across controllers and wireless networks. ACM Transactions on Cyber-Physical Systems, 4(4):1–27, 2020.

[13] Power management plantweb university. https://www.emerson.com/documents/automation/training
power-management-en-41156.pdf.
[14] Ai can do great things—if it doesn’t burn the planet. https://www.wired.com/story/aigreat-things-burn-planet/.
[15] Scaling battery technology. https://semiengineering.com/scaling-battery-technology/.
[16] Elon musk:
Battery energy density to increase 50 percent by 2024.
https://www.pcmag.com/news/elon-musk-battery-energy-density-to-increase-50percent-by-2024.
[17] Robert H Dennard, Fritz H Gaensslen, Hwa-Nien Yu, V Leo Rideovt, Ernest Bassous,
and Andre R Leblanc. Design of ion-implanted mosfet’s with very small physical
dimensions. IEEE Solid-State Circuits Society Newsletter, 12(1):38–50, 2007.
[18] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and
Doug Burger. Dark silicon and the end of multicore scaling. In 2011 38th Annual
international symposium on computer architecture (ISCA), pages 365–376. IEEE, 2011.
[19] Michael B Taylor. A landscape of the new dark silicon design regime. IEEE Micro,
33(5):8–19, 2013.
[20] Behzad Boroujerdian, Hasan Genc, Srivatsan Krishnan, Bardienus Pieter Duisterhof,
Brian Plancher, Kayvan Mansoorshahi, Marcelino Almeida, Wenzhi Cui, Aleksandra
Faust, and Vijay Janapa Reddi. The role of compute in autonomous aerial vehicles.
arXiv preprint arXiv:1906.10513, 2019.
[21] Behzad Boroujerdian, Hasan Genc, Srivatsan Krishnan, Wenzhi Cui, Aleksandra Faust,
and Vijay Reddi. Mavbench: Micro aerial vehicle benchmarking. In 2018 51st Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 894–907.
IEEE, 2018.
[22] Yehan Ma, Chenyang Lu, Bruno Sinopoli, and Shen Zeng. Exploring edge computing
for multitier industrial control. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 39(11):3506–3518, 2020.
[23] Pulkit Agrawal, Ross Girshick, and Jitendra Malik. Analyzing the performance of
multilayer neural networks for object recognition. In European conference on computer
vision, pages 329–344. Springer, 2014.
199

[24] Or Sharir, Barak Peleg, and Yoav Shoham. The cost of training nlp models: A concise
overview. arXiv preprint arXiv:2004.08900, 2020.
[25] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
[26] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen
Koltun. Carla: An open urban driving simulator. In Conference on robot learning,
pages 1–16. PMLR, 2017.
[27] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Krähenbühl. Learning by cheating. In Conference on Robot Learning, pages 66–75. PMLR, 2020.
[28] Jetson agx xavier developer kit. https://developer.nvidia.com/embedded/jetson-agxxavier-developer-kit.
[29] Nvidia graphics card specification chart. https://www.studio1productions.com/Articles/NVidiaGPU-Chart.htm.
[30] André B Bondi. Characteristics of scalability and their impact on performance. In
Proceedings of the 2nd international workshop on Software and performance, pages
195–203, 2000.
[31] Wonyoung Kim, Meeta S Gupta, Gu-Yeon Wei, and David Brooks. System level
analysis of fast, per-core dvfs using on-chip switching regulators. In 2008 IEEE 14th
International Symposium on High Performance Computer Architecture, 2008.
[32] Pingqiang Zhou, Dong Jiao, Chris H Kim, and Sachin S Sapatnekar. Exploration of
on-chip switched-capacitor dc-dc converter for multicore processors using a distributed
power delivery network. In Custom Integrated Circuits Conference (CICC), 2011 IEEE,
pages 1–4. IEEE, 2011.
[33] Tao Tong, Xuan Zhang, Wonyoung Kim, David Brooks, and Gu-Yeon Wei. A fully
integrated battery-connected switched-capacitor 4: 1 voltage regulator with 70% peak
efficiency using bottom-plate charge recycling. In Custom Integrated Circuits Conference (CICC), 2013 IEEE, pages 1–4. IEEE, 2013.
[34] Leland Chang, Robert K Montoye, Brian L Ji, Alan J Weger, Kevin G Stawiasz, and
Robert H Dennard. A fully-integrated switched-capacitor 2 1 voltage converter with
regulation capability and 90% efficiency at 2.3 a/mm 2. In VLSI Circuits (VLSIC),
2010 IEEE Symposium on, pages 55–56. IEEE, 2010.
[35] Hamid Reza Ghasemi, Abhishek A Sinkar, Michael J Schulte, and Nam Sung
Kim. Cost-effective power delivery to support per-core voltage domains for powerconstrained processors.
In Design Automation Conference (DAC), 2012 49th
ACM/EDAC/IEEE, pages 56–61. IEEE, 2012.
200

[36] Ulya R Karpuzcu, Abhishek Sinkar, Nam Sung Kim, and Josep Torrellas. Energysmart:
Toward energy-efficient manycores for near-threshold computing. In High Performance
Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on,
pages 542–553. IEEE, 2013.
[37] Guihai Yan, Yingmin Li, Yinhe Han, Xiaowei Li, Minyi Guo, and Xiaoyao Liang.
Agileregulator: A hybrid voltage regulator scheme redeeming dark silicon for power
efficiency in a multicore architecture. In High Performance Computer Architecture
(HPCA), 2012 IEEE 18th International Symposium on, pages 1–12. IEEE, 2012.
[38] Steven JE Wilton and Norman P Jouppi. Cacti: An enhanced cache access and cycle
time model. IEEE Journal of Solid-State Circuits, 31(5):677–688, 1996.
[39] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and Sharad Malik. Orion: a
power-performance simulator for interconnection networks. In Microarchitecture,
2002.(MICRO-35). Proceedings. 35th Annual IEEE/ACM International Symposium
on, pages 294–305. IEEE, 2002.
[40] Donald S Gardner, Gerhard Schrom, Fabrice Paillet, Brice Jamieson, Tanay Karnik,
and Shekhar Borkar. Review of on-chip inductor structures with magnetic films. IEEE
Transactions on Magnetics, 45(10):4760–4766, 2009.
[41] Wonyoung Kim, David Brooks, and Gu-Yeon Wei. A fully-integrated 3-level dc-dc
converter for nanosecond-scale dvfs. IEEE Journal of Solid-State Circuits, 47(1):206–
219, 2012.
[42] Rinkle Jain, Bibiche M Geuskens, Stephen T Kim, Muhammad M Khellah, Jaydeep
Kulkarni, James W Tschanz, and Vivek De. A 0.45–1 v fully-integrated distributed
switched capacitor dc-dc converter with high density mim capacitor in 22 nm tri-gate
cmos. IEEE Journal of Solid-State Circuits, 49(4):917–927, 2014.
[43] Texus Instruments. LMZ10501 1-A SIMPLE SWITCHER® Nano Module With 5.5-V
Maximum Input Voltage. http://www.ti.com/product/LMZ10501.
[44] Meeta S Gupta, Jarod L Oatley, Russ Joseph, Gu-Yeon Wei, and David M Brooks.
Understanding voltage variations in chip multiprocessors using a distributed powerdelivery network. In 2007 Design, Automation & Test in Europe Conference & Exhibition, pages 1–6. IEEE, 2007.
[45] Intel Corp. Voltage Regulator Module, Enterprise Voltage Regulator-Down 10.0. http:
//www.intel.com/content/www/us/en/power-management/voltage-regulatormodule-enterprise-voltage-regulator-down-10-0-guidelines.html.
[46] Reddi, V.J. and Kanev, S. and Campanoni, S. and Smith, M.D. and Wei, G.Y.
and Brooks, D. Voltage Smoothing: Characterizing and Mitigating Voltage Noise
201

in Production Processors Using Software-Guided Thread Scheduling. In Proc. Annual
IEEE/ACM Int. Symp. on Microarchitecture, 2010.
[47] Youngtaek Kim and Lizy Kurian John. Automated di/dt stressmark generation for microprocessor power delivery networks. In Low Power Electronics and Design (ISLPED)
2011 International Symposium on. IEEE, 2011.
[48] Noah Sturcken, Eugene J O’Sullivan, Naigang Wang, Philipp Herget, Bucknell C Webb,
Lubomyr T Romankiw, Michele Petracca, Ryan Davies, Robert E Fontana, Gary M
Decad, et al. A 2.5 d integrated voltage regulator using coupled-magnetic-core inductors on silicon interposer. IEEE Journal of solid-state circuits, 48(1):244–254, 2013.
[49] Hanh-Phuc Le, Seth R Sanders, and Elad Alon. Design techniques for fully integrated
switched-capacitor dc-dc converters. IEEE Journal of Solid-State Circuits, 46(9):2120–
2131, 2011.
[50] Yongseok Choi, Naehyuck Chang, and Taewhan Kim. Dc–dc converter-aware power
management for low-power embedded systems. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 26(8):1367–1381, 2007.
[51] Michael D Seeman. Analytical and practical analysis of switched-capacitor dc-dc converters. Technical report, CALIFORNIA UNIV BERKELEY DEPT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE, 2006.
[52] Hanh-Phuc Le, John Crossley, Seth R Sanders, and Elad Alon. A sub-ns response
fully integrated battery-connected switched-capacitor voltage regulator delivering 0.19
w/mm 2 at 73% efficiency. In 2013 IEEE International Solid-State Circuits Conference
Digest of Technical Papers, pages 372–373. IEEE, 2013.
[53] Noah Sturcken, Michele Petracca, Steven Warren, Paolo Mantovani, Luca P Carloni,
Angel V Peterchev, and Kenneth L Shepard. A switched-inductor integrated voltage
regulator with nonlinear feedback and network-on-chip load in 45 nm soi. IEEE Journal
of Solid-State Circuits, 47(8):1935–1945, 2012.
[54] Robert J Milliken, Jose Silva-Martı́nez, and Edgar Sánchez-Sinencio. Full on-chip cmos
low-dropout voltage regulator. IEEE Transactions on Circuits and Systems I: Regular
Papers, 54(9):1879–1890, 2007.
[55] Toke Meyer Andersen, Florian Krismer, Johann Walter Kolar, Thomas Toifl, Christian
Menolfi, Lukas Kull, Thomas Morf, Marcel Kossel, Matthias Brändli, Peter Buchmann,
et al. 4.7 a sub-ns response on-chip switched-capacitor dc-dc voltage regulator delivering 3.7 w/mm 2 at 90% efficiency using deep-trench capacitors in 32nm soi cmos. In
2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers
(ISSCC), pages 90–91. IEEE, 2014.
202

[56] Edward NY Ho and Philip KT Mok. A capacitor-less cmos active feedback low-dropout
regulator with slew-rate enhancement for portable on-chip application. IEEE Transactions on Circuits and Systems II: Express Briefs, 57(2):80–84, 2010.
[57] James Myers, Anand Savanth, Rohan Gaddh, David Howard, Pranay Prabhat, and
David Flynn. A subthreshold arm cortex-m0+ subsystem in 65 nm cmos for wsn
applications with 14 power domains, 10t sram, and integrated voltage regulator. IEEE
Journal of Solid-State Circuits, 51(1):31–44, 2015.
[58] Stephen T Kim, Yi-Chun Shih, Kaushik Mazumdar, Rinkle Jain, Joseph F Ryan,
Carlos Tokunaga, Charles Augustine, Jaydeep P Kulkarni, Krishnan Ravichandran,
James W Tschanz, et al. Enabling wide autonomous dvfs in a 22 nm graphics execution
core using a digitally controlled fully integrated voltage regulator. IEEE Journal of
Solid-State Circuits, 51(1):18–30, 2015.
[59] Harish K Krishnamurthy, Vaibhav Vaidya, Pavan Kumar, Rinkle Jain, Sheldon Weng,
Stephen T Kim, George E Matthew, Nachiket Desai, Xiaosen Liu, Krishnan Ravichandran, et al. A digitally controlled fully integrated voltage regulator with on-die solenoid
inductor with planar magnetic core in 14-nm tri-gate cmos. IEEE Journal of SolidState Circuits, 53(1):8–19, 2017.
[60] George Patounakis, Yee William Li, and Kenneth L Shepard. A fully integrated onchip dc-dc conversion and power management system. IEEE Journal of Solid-State
Circuits, 39(3):443–451, 2004.
[61] Noah Sturcken, Eugene J O’Sullivan, Naigang Wang, Philipp Herget, Bucknell C Webb,
Lubomyr T Romankiw, Michele Petracca, Ryan Davies, Robert E Fontana, Gary M
Decad, et al. A 2.5 d integrated voltage regulator using coupled-magnetic-core inductors on silicon interposer. IEEE Journal of solid-state circuits, 48(1):244–254, 2012.
[62] Tom Van Breussegem and Michiel Steyaert. A 82% efficiency 0.5% ripple 16-phase
fully integrated capacitive voltage doubler. In 2009 Symposium on VLSI Circuits,
pages 198–199. IEEE, 2009.
[63] Edward A Burton, Gerhard Schrom, Fabrice Paillet, Jonathan Douglas, William J
Lambert, Kaladhar Radhakrishnan, and Michael J Hill. Fivr—fully integrated voltage
regulators on 4th generation intel® core™ socs. In Applied Power Electronics Conference and Exposition (APEC), 2014 Twenty-Ninth Annual IEEE, pages 432–439. IEEE,
2014.
[64] Eric J Fluhr, Steve Baumgartner, David Boerstler, John F Bulzacchelli, Timothy
Diemoz, Daniel Dreps, George English, Joshua Friedrich, Anne Gattiker, Tilman
Gloekler, et al. The 12-core power8™ processor with 7.6 tb/s io bandwidth, integrated voltage regulation, and resonant clocking. IEEE Journal of Solid-State Circuits,
50(1):10–23, 2015.
203

[65] Brian Zimmer, Yunsup Lee, Alberto Puggelli, Jaehwa Kwak, Ruzica Jevtić, Ben Keller,
Steven Bailey, Milovan Blagojević, Pi-Feng Chiu, Hanh-Phuc Le, et al. A risc-v vector
processor with simultaneous-switching switched-capacitor dc–dc converters in 28 nm
fdsoi. IEEE Journal of Solid-State Circuits, 2016.
[66] Cheng Zhuo, Kassan Unda, Yiyu Shi, and Wei-Kai Shih. From layout to system:
Early stage power delivery and architecture co-exploration. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 38(7):1291–1304, 2018.
[67] Zhiyu Zeng, Xiaoji Ye, Zhuo Feng, and Peng Li. Tradeoff analysis and optimization
of power delivery networks with on-chip voltage regulation. In Proceedings of the 47th
Design Automation Conference, pages 831–836. ACM, 2010.
[68] Xuan Wang, Jiang Xu, Zhe Wang, Kevin J Chen, Xiaowen Wu, Zhehui Wang, Peng
Yang, and Luan HK Duong. An analytical study of power delivery systems for manycore processors using on-chip and off-chip voltage regulators. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 34(9):1401–1414, 2015.
[69] Xin Zhan, Jianhao Chen, Edgar Sánchez-Sinencio, and Peng Li. Power management
for multicore processors via heterogeneous voltage regulation and machine learning enabled adaptation. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
2019.
[70] Saurabh Sinha, Greg Yeric, Vikas Chandra, Brian Cline, and Yu Cao. Exploring sub20nm finfet design with predictive technology models. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 283–288. IEEE, 2012.
[71] Mohammad Al-Shyoukh, Hoi Lee, and Raul Perez. A transient-enhanced low-quiescent
current low-dropout regulator with buffer impedance attenuation. IEEE journal of
solid-state circuits, 42(8):1732–1742, 2007.
[72] Richard D Middlebrook and Slobodan Cuk. A general unified approach to modelling
switching-converter power stages. In 1976 IEEE Power Electronics Specialists Conference, pages 18–34. IEEE, 1976.
[73] Riccardo Trinchero. Emi analysis and modeling of switching circuits. In PhD thesis.
Politecnico di Torino, 2015.
[74] Ashis Maity, Amit Patra, Norihisa Yamamura, and Jonathan Knight. Design of a 20
mhz dc-dc buck converter with 84 percent efficiency for portable applications. In 2011
24th Internatioal Conference on VLSI Design, pages 316–321. IEEE, 2011.
[75] Siamak Abedinpour, Bertan Bakkaloglu, and Sayfe Kiaei. A multistage interleaved
synchronous buck converter with integrated output filter in 0.18 mu m sige process.
IEEE Transactions on Power Electronics, 22(6):2164–2175, 2007.
204

[76] Gerhard Schrom, P Hazucha, Fabrice Paillet, DJ Rennie, ST Moon, DS Gardner,
T Kamik, P Sun, TT Nguyen, MJ Hill, et al. A 100mhz eight-phase buck converter
delivering 12a in 25mm2 using air-core inductors. In APEC 07-Twenty-Second Annual
IEEE Applied Power Electronics Conference and Exposition, pages 727–730. IEEE,
2007.
[77] Jingwen Leng, Yazhou Zu, Minsoo Rhu, Meeta Gupta, and Vijay Janapa Reddi. Gpuvolt: Modeling and characterizing voltage noise in gpu architectures. In Proceedings of
the 2014 international symposium on Low power electronics and design, pages 141–146.
ACM, 2014.
[78] CHL8266. Digital multi-phase gpu buck controller. Technical report, infineon, 2011.
[79] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt.
Analyzing cuda workloads using a detailed gpu simulator. In 2009 IEEE International
Symposium on Performance Analysis of Systems and Software, pages 163–174. IEEE,
2009.
[80] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim,
Tor M Aamodt, and Vijay Janapa Reddi. Gpuwattch: enabling energy optimizations in
gpgpus. In ACM SIGARCH Computer Architecture News, volume 41, pages 487–498.
ACM, 2013.
[81] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha
Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing.
In International Symposium on Workload Characterization, 2009.
[82] Jingwen Leng, Alper Buyuktosunoglu, Ramon Bertran, Pradip Bose, and Vijay Janapa
Reddi. Safe limits on voltage reduction efficiency in gpus: A direct measurement
approach. In Proceedings of the 48th International Symposium on Microarchitecture,
pages 294–307. ACM, 2015.
[83] Trevor E Carlson, Wim Heirman, and Lieven Eeckhout. Sniper: Exploring the level of
abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of
2011 International Conference for High Performance Computing, Networking, Storage
and Analysis, pages 1–12, 2011.
[84] Ben Keller, Martin Cochet, Brian Zimmer, Jaehwa Kwak, Alberto Puggelli, Yunsup
Lee, Milovan Blagojević, Stevo Bailey, Pi-Feng Chiu, Palmer Dabbelt, et al. A risc-v
processor soc with integrated power management at submicrosecond timescales in 28
nm fd-soi. IEEE Journal of Solid-State Circuits, 52(7):1863–1875, 2017.
[85] Xuan Zhang, Tao Tong, S. Kanev, Sae Kyu Lee, Gu-Yeon Wei, and D. Brooks. Characterizing and evaluating voltage noise in multi-core near-threshold processors. In Low
205

Power Electronics and Design (ISLPED), 2013 IEEE International Symposium on,
pages 82–87, 2013.
[86] Wonyoung Kim et al. System level analysis of fast, per-core dvfs using on-chip switching
regulators. In HPCA, 2008.
[87] Sae Kyu Lee, David Brooks, and Gu-Yeon Wei. Evaluation of voltage stacking for nearthreshold multicore computing. In Proceedings of the 2012 ACM/IEEE international
symposium on Low power electronics and design, pages 373–378. ACM, 2012.
[88] Kazuhiro Ueda, Fukashi Morishita, Shunsuke Okura, Leona Okamura, Tsutomu Yoshihara, and Kazutami Arimoto. Low-power on-chip charge-recycling dc-dc conversion
circuit and system. IEEE Journal of Solid-State Circuits, 48(11):2608–2617, 2013.
[89] Pablo Castro Lisboa, Pablo Pérez-Nicoli, Francisco Veirano, and Fernando Silveira.
General top/bottom-plate charge recycling technique for integrated switched capacitor dc-dc converters. IEEE Transactions on Circuits and Systems I: Regular Papers,
63(4):470–481, 2016.
[90] Saravanan Rajapandian, Zheng Xu, and Kenneth L Shepard. Implicit dc-dc downconversion through charge-recycling. IEEE journal of solid-state circuits, 40(4):846–852,
2005.
[91] Pulkit Jain, Tae-Hyoung Kim, John Keane, and Chris H Kim. A multi-story power
delivery technique for 3d integrated circuits. In Low Power Electronics and Design
(ISLPED), 2008 ACM/IEEE International Symposium on, pages 57–62. IEEE, 2008.
[92] Sae Kyu Lee, David Brooks, and Gu-Yeon Wei. Evaluation of voltage stacking for nearthreshold multicore computing. In Proceedings of the 2012 ACM/IEEE international
symposium on Low power electronics and design, pages 373–378. ACM, 2012.
[93] Kristof Blutman, Ajay Kapoor, Arjun Majumdar, Jacinto Garcia Martinez, Juan
Echeverri, Leo Sevat, Arnoud P van der Wel, Hamed Fatemi, Kofi AA Makinwa,
and José Pineda de Gyvez. A low-power microcontroller in a 40-nm cmos using charge
recycling. IEEE Journal of Solid-State Circuits, 52(4), 2017.
[94] Sae Kyu Lee, Tao Tong, Xuan Zhang, David Brooks, and Gu-Yeon Wei. A 16-core
voltage-stacked system with adaptive clocking and an integrated switched-capacitor
dc–dc converter. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
25(4):1271–1284, 2017.
[95] Kristof Blutman, Ajay Kapoor, Jacinto Garcia Martinez, Hamed Fatemi, and
José Pineda de Gyvez. Lower power by voltage stacking: A fine-grained system design
approach. In Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE,
pages 1–5. IEEE, 2016.
206

[96] Ehsan K Ardestani, Rafael Trapani Possignolo, Jose Luis Briz, and Jose Renau. Managing mismatches in voltage stacking with coreunfolding. ACM Transactions on Architecture and Code Optimization (TACO), 12(4), 2016.
[97] Rafael Trapani Possignolo, Elnaz Ebrahimi, Ehsan Khish Ardestani, Alamelu Sankaranarayanan, Jose Luis Briz, and Jose Renau. Gpu ntc process variation compensation
with voltage stacking. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, (99):1–14, 2018.
[98] Sae Kyu Lee, Tao Tong, Xuan Zhang, David Brooks, and Gu-Yeon Wei. A 16-core
voltage-stacked system with an integrated switched-capacitor dc-dc converter. In VLSI
Circuits (VLSI Circuits), 2015 Symposium on, pages C318–C319. IEEE, 2015.
[99] Tao Tong, Sae Kyu Lee, Xuan Zhang, David Brooks, and Gu-Yeon Wei. A fully
integrated reconfigurable switched-capacitor dc-dc converter with four stacked output channels for voltage stacking applications. IEEE Journal of Solid-State Circuits,
51(9):2142–2152, 2016.
[100] Ed Grochowski, David Ayers, and Vivek Tiwari. Microarchitectural simulation and
control of di/dt-induced power supply voltage variation. In High-Performance Computer Architecture, 2002. Proceedings. Eighth International Symposium on, pages 7–16.
IEEE, 2002.
[101] Meeta S Gupta, Vijay Janapa Reddi, Glenn Holloway, Gu-Yeon Wei, and David M
Brooks. An event-guided approach to reducing voltage noise in processors. In Design,
Automation & Test in Europe Conference & Exhibition, 2009. DATE’09., pages 160–
165. IEEE, 2009.
[102] Vijay Janapa Reddi, Svilen Kanev, Wonyoung Kim, Simone Campanoni, Michael D
Smith, Gu-Yeon Wei, and David Brooks. Voltage smoothing: Characterizing and mitigating voltage noise in production processors via software-guided thread scheduling. In
Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium
on, pages 77–88. IEEE, 2010.
[103] Vijay Janapa Reddi, Meeta S Gupta, Glenn Holloway, Gu-Yeon Wei, Michael D Smith,
and David Brooks. Voltage emergency prediction: Using signatures to reduce operating
margins. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th
International Symposium on. IEEE, 2009.
[104] Jingwen Leng, Yazhou Zu, and Vijay Janapa Reddi. Gpu voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in
gpu architectures. In High Performance Computer Architecture (HPCA), 2015 IEEE
21st International Symposium on, pages 161–173. IEEE, 2015.
207

[105] Renji Thomas, Kristin Barber, Naser Sedaghati, Li Zhou, and Radu Teodorescu. Core
tunneling: Variation-aware voltage noise mitigation in gpus. In High Performance
Computer Architecture (HPCA), 2016 IEEE International Symposium on, pages 151–
162. IEEE, 2016.
[106] Renji Thomas, Naser Sedaghati, and Radu Teodorescu. Emergpu: Understanding
and mitigating resonance-induced voltage noise in gpu architectures. In Performance
Analysis of Systems and Software (ISPASS), 2016 IEEE International Symposium on,
pages 79–89. IEEE, 2016.
[107] Jae-Pyo Lee, Ho-Sik Jeon, Dong-Sung Moon, and Byung Seong Bae. Threshold voltage
and ir drop compensation of an amoled pixel circuit without a vDD line. IEEE Electron
Device Letters, 35(1):72–74, 2014.
[108] Xuan Zhang, Tao Tong, David Brooks, and Gu-Yeon Wei. Supply-noise resilient adaptive clocking for battery-powered aerial microrobotic system-on-chip in 40nm cmos.
In Proceedings of the IEEE 2013 Custom Integrated Circuits Conference, pages 1–4.
IEEE, 2013.
[109] Xuan Zhang, Tao Tong, David Brooks, and Gu Yeon Wei. Evaluating adaptive clocking
for supply-noise resilience in battery-powered aerial microrobotic system-on-chip. IEEE
Transactions on Circuits and Systems I: Regular Papers, 61(8):2309–2317, 2014.
[110] Rafael T. Possignolo. Gpu ntc process variation compensation with voltage stacking. In
Parallel Architectures and Compilation Techniques (PACT), International Conference
on., 2015.
[111] Elnaz Ebrahimi, Rafael Trapani Possignolo, and Jose Renau. Sram voltage stacking. In
Circuits and Systems (ISCAS), 2016 IEEE International Symposium on, pages 1634–
1637. IEEE, 2016.
[112] Tao Tong, Sae Kyu Lee, Xuan Zhang, David Brooks, and Gu-Yeon Wei. A fully
integrated reconfigurable switched-capacitor dc-dc converter with four stacked output channels for voltage stacking applications. IEEE Journal of Solid-State Circuits,
51(9):2142–2152, 2016.
[113] Kristof Blutman, Ajay Kapoor, Arjun Majumdar, Jacinto Garcia Martinez, Juan
Echeverri, Leo Sevat, Arnoud Van Der Wel, Hamed Fatemi, José Pineda de Gyvez, and
Kofi Makinwa. A microcontroller with 96% power-conversion efficiency using stacked
voltage domains. In VLSI Circuits (VLSI-Circuits), 2016 IEEE Symposium on, pages
1–2. IEEE, 2016.
[114] Kristof Blutman, Hamed Fatemi, Andrew B Kahng, Ajay Kapoor, Jiajia Li, and
José Pineda de Gyvez. Floorplan and placement methodology for improved energy
208

reduction in stacked power-domain design. In Design Automation Conference (ASPDAC), 2017 22nd Asia and South Pacific, pages 444–449. IEEE, 2017.
[115] Kristof Blutman, Hamed Fatemi, Ajay Kapoor, Andrew B Kahng, Jiajia Li, and
José Pineda de Gyvez. Logic design partitioning for stacked power domains. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 2017.
[116] Runjie Zhang, Kaushik Mazumdar, Brett H Meyer, Ke Wang, Kevin Skadron,
and Mircea Stan.
A cross-layer design exploration of charge-recycled powerdelivery in many-layer 3d-ic. In Design Automation Conference (DAC), 2015 52nd
ACM/EDAC/IEEE, pages 1–6. IEEE, 2015.
[117] Kaushik Mazumdar and Mircea Stan. Breaking the power delivery wall using voltage
stacking. In Proceedings of the great lakes symposium on VLSI, pages 51–54. ACM,
2012.
[118] Qixiang Zhang, Liangzhen Lai, Mark Gottscho, and Puneet Gupta. Multi-story power
distribution networks for gpus. In Design, Automation & Test in Europe Conference
& Exhibition (DATE), 2016, pages 451–456. IEEE, 2016.
[119] An Zou, Jingwen Leng, Xin He, Yazhou Zu, Vijay Janapa Reddi, and Xuan Zhang.
Efficient and reliable power delivery in voltage-stacked manycore system with hybrid
charge-recycling regulators. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2018.
[120] An Zou, Jingwen Leng, Xin He, Yazhou Zu, Christopher D Gill, Vijay Janapa Reddi,
and Xuan Zhang. Voltage-stacked gpus: A control theory driven cross-layer solution
for practical voltage stacking in gpus. In 2018 51st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), pages 390–402. IEEE, 2018.
[121] Weidong Cao, Xin He, Ayan Chakrabarti, and Xuan Zhang. Neuadc: Neural networkinspired synthesizable analog-to-digital conversion. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 2019.
[122] NVIDIA. Whitepaper nvidia’s next generation cudatm compute architecture: Fermi.
[123] Joseph Ervin, Asha Balijepalli, Punarvasu Joshi, Vadim Kushner, Jinman Yang, and
Trevor J Thornton. Cmos-compatible soi mesfets with high breakdown voltage. IEEE
Transactions on Electron Devices, 53(12):3129–3135, 2006.
[124] Andrew Suchanek, Zhong Chen, and Jia Di. Asynchronous circuit stacking for simplified power management. IEEE, 2018.
[125] Jie Gu and Chris H Kim. Multi-story power delivery for supply noise reduction and
low voltage operation. In Proceedings of the 2005 international symposium on Low
power electronics and design, pages 192–197. ACM, 2005.
209

[126] Jun Zhou, Chao Wang, Xin Liu, and Minkyu Je. Fast and energy-efficient low-voltage
level shifters. Microelectronics Journal, 46(1), 2015.
[127] Shien-Chun Luo, Ching-Ji Huang, and Yuan-Hua Chu. A wide-range level shifter using
a modified wilson current mirror hybrid buffer. IEEE Transactions on Circuits and
Systems I: Regular Papers, 61(6), 2014.
[128] Kyoung-Hoi Koo, Jin-Ho Seo, Myeong-Lyong Ko, and Jae-Whui Kim. A new level-up
shifter for high speed and wide range interface in ultra deep sub-micron. In Circuits
and Systems, 2005. ISCAS 2005. IEEE International Symposium on, pages 1063–1065.
IEEE, 2005.
[129] Tejas S Joshi and Priya M Ravale Nerkar. A wide range level shifter using a self
biased cascode current mirror with ptl based buffer. In IJCA Proceedings on National
Conference on Emerging Trends in Advanced Communication Technologies, pages 8–
12, 2015.
[130] Amir Hasanbegovic and Snorre Aunet. Low-power subthreshold to above threshold
level shifters in 90nm and 65nm process. Microprocessors and Microsystems, 35(1):1–
9, 2011.
[131] Bhawna Aggarwal, Maneesha Gupta, and Anil Kumar Gupta. A comparative study of
various current mirror configurations: Topologies and characteristics. Microelectronics
Journal, 53:134–155, 2016.
[132] Manoj Kumar, Sandeep K Arya, and Sujata Pandey. Level shifter design for low power
applications. arXiv preprint arXiv:1011.0507, 2010.
[133] Elnaz Ebrahimi, Rafael Trapani Possignolo, and Jose Renau. Level shifter design for
voltage stacking.
[134] Michael Douglas Seeman. Analytical and practical analysis of switched-capacitor dc-dc
converters. Master’s thesis, EECS Department, University of California, Berkeley, Sep
2006.
[135] Michael Douglas Seeman. A design methodology for switched-capacitor DC-DC converters. University of California, Berkeley, 2009.
[136] Haoran Li, Jiang Xu, Zhe Wang, Rafael KV Maeda, Peng Yang, and Zhongyuan
Tian. Workload-aware adaptive power delivery system management for many-core
processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 2017.
[137] Pablo Mendoza Ponce, Dietmar Schröder, and Wolfgang H Krautschneider. Trade-off
study on switched capacitor regulators for implantable medical devices.
210

[138] Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M Sleiman, Ronald
Dreslinski, Thomas F Wenisch, and Scott Mahlke. Composite cores: Pushing heterogeneity into a core. In Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM
International Symposium on, pages 317–328. IEEE, 2012.
[139] Krishna K Rangan, Gu-Yeon Wei, and David Brooks. Thread motion: fine-grained
power management for multi-core systems. In ACM SIGARCH Computer Architecture
News, volume 37, pages 302–313. ACM, 2009.
[140] Miguel Rodrigues, Nuno Roma, and Pedro Tomás. Fast and scalable thread migration
for multi-core architectures. In 2015 IEEE 13th International Conference on Embedded
and Ubiquitous Computing. IEEE, 2015.
[141] Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson,
and Pradip Bose. Microarchitectural techniques for power gating of execution units. In
Proceedings of the 2004 international symposium on Low power electronics and design.
ACM, 2004.
[142] Manish Arora, Srilatha Manne, Indrani Paul, Nuwan Jayasena, and Dean M Tullsen.
Understanding idle behavior and power gating mechanisms in the context of modern
benchmarks on cpu-gpu integrated systems. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 366–377. IEEE,
2015.
[143] Jaehyun Park, Donghwa Shin, Naehyuck Chang, and Massoud Pedram. Accurate
modeling and calculation of delay and energy overheads of dynamic voltage scaling
in modern high-performance microprocessors. In Proceedings of the 16th ACM/IEEE
international symposium on Low power electronics and design, pages 419–424. ACM,
2010.
[144] Amir Bashir, Jing Li, Kiran Ivatury, Naveed Khan, Nirav Gala, Noam Familia, and
Zulfiqar Mohammed. Fast lock scheme for phase-locked loops. In Custom Integrated
Circuits Conference. CICC’09. IEEE, 2009.
[145] Ali Muhtaroglu, Greg Taylor, and Tawfik Rahal-Arabi. On-die droop detector for
analog sensing of power supply noise. IEEE Journal of solid-state circuits, 39(4):651–
660, 2004.
[146] Russ Joseph, David Brooks, and Margaret Martonosi. Control techniques to eliminate
voltage emergencies in high performance processors. In High-Performance Computer
Architecture, 2003. HPCA-9 2003. Proceedings. The Ninth International Symposium
on, pages 79–90. IEEE, 2003.
[147] Jaeha Kim and Mark A Horowitz. An efficient digital sliding controller for adaptive
power-supply regulation. IEEE Journal of solid-state circuits, 37(5):639–647, 2002.
211

[148] Bruce Fleischer, Christos Vezyrtzis, Karthik Balakrishnan, and Keith A Jenkins. A
statistical critical path monitor in 14nm cmos. In Computer Design (ICCD), 2016
IEEE 34th International Conference on, 2016.
[149] maxim integrated. Max19506 data sheet. https://www.maximintegrated.com/en/
products/analog/data-converters/analog-to-digital-converters/MAX19506.
html/.
[150] Muhammad Husni Santriaji and Henry Hoffmann. Grape: Minimizing energy for
gpu applications with performance requirements. In 2016 49th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), 2016.
[151] Pietro Mercati, Raid Ayoub, Michael Kishinevsky, Eric Samson, Marc Beuchat,
Francesco Paterna, and Tajana Šimunić Rosing. Multi-variable dynamic power management for the gpu subsystem. In Design Automation Conference (DAC), 2017 54th
ACM/EDAC/IEEE, pages 1–6. IEEE, 2017.
[152] Rong Ge, Ryan Vogt, Jahangir Majumder, Arif Alam, Martin Burtscher, and Ziliang
Zong. Effects of dynamic voltage and frequency scaling on a k20 gpu. In Parallel
Processing (ICPP), 2013 42nd International Conference on, pages 826–833. IEEE,
2013.
[153] Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither
more nor less: optimizing thread-level parallelism for gpgpus. In Proceedings of the
22nd international conference on Parallel architectures and compilation techniques.
IEEE Press, 2013.
[154] Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. Power gating strategies on gpus. ACM Transactions on Architecture and Code Optimization
(TACO), 8(3):13, 2011.
[155] Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. Warped gates:
gating aware scheduling and power gating for gpgpus. In Proceedings of the 46th
Annual IEEE/ACM International Symposium on Microarchitecture, pages 111–122.
ACM, 2013.
[156] Yue Wang, Soumyaroop Roy, and Nagarajan Ranganathan. Run-time power-gating in
caches of gpus for leakage energy savings. In Design, Automation & Test in Europe
Conference & Exhibition (DATE), 2012, pages 300–303. IEEE, 2012.
[157] Enver Candan. A series-stacked power delivery architecture with isolated converters
for energy efficient data centers. 2014.
[158] Ngspice, howpublished = http://ngspice.sourceforge.net/, note = Accessed:
2018-12-31.
212

[159] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim,
Tor M. Aamodt, and Vijay Janapa Reddi. Gpuwattch: Enabling energy optimizations
in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer
Architecture, ISCA ’13, pages 487–498, New York, NY, USA, 2013. ACM.
[160] NVIDIA, howpublished = https://developer.nvidia.com/, note = Accessed: 201812-31.
[161] Graphics
cards
voltage
regulator
modules
(vrm)
explained.
https://www.geeks3d.com/20100504/tutorial-graphics-cards-voltage-regulatormodules-vrm-explained.
[162] An Zou, Jingwen Leng, Yazhou Zu, Tao Tong, Vijay Janapa Reddi, David Brooks,
Gu-Yeon Wei, and Xuan Zhang. Ivory: Early-stage design space exploration tool for
integrated voltage regulators. In Proceedings of the 54th Annual Design Automation
Conference 2017, pages 1–6, 2017.
[163] Abhinandan Majumdar, Leonardo Piga, Indrani Paul, Joseph L Greathouse, Wei
Huang, and David H Albonesi. Dynamic gpgpu power management using adaptive
model predictive control. In 2017 IEEE International Symposium on High-Performance
Computer Architecture (HPCA), pages 613–624. IEEE, 2017.
[164] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat
Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang,
et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
[165] Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md E Haque, Lingjia
Tang, and Jason Mars. The architectural implications of autonomous driving: Constraints and acceleration. In Proceedings of the Twenty-Third International Conference
on Architectural Support for Programming Languages and Operating Systems, pages
751–766. ACM, 2018.
[166] Nvidia
accelerates
race
to
autonomous
driving
https://blogs.nvidia.com/blog/2016/01/04/drive-px-ces-recap/ note
2019-11-23.

at
ces.
= Accessed:

[167] Omid Hosseini Jafari, Dennis Mitzel, and Bastian Leibe. Real-time rgb-d based people detection and tracking for mobile robots and head-worn cameras. In 2014 IEEE
international conference on robotics and automation (ICRA), pages 5636–5643. IEEE,
2014.
[168] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv, 2018.
[169] Christopher J Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett
Witchel. Ptask: operating system abstractions to manage gpus as compute devices. In
213

Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles,
pages 233–248, 2011.
[170] Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. Gdev: Firstclass tGPUu resource management in the operating system. In Presented as part of the
2012 tUSENIXu Annual Technical Conference (tUSENIXutATCu 12), pages 401–412,
2012.
[171] Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata
Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R
Das. Managing gpu concurrency in heterogeneous architectures. In Microarchitecture
(MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 114–126.
IEEE, 2014.
[172] Chao-Tung Yang, Chih-Lin Huang, and Cheng-Fang Lin. Hybrid cuda, openmp, and
mpi parallel programming on multicore gpu clusters. Computer Physics Communications, 182(1):266–269, 2011.
[173] Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H Anderson, and
F Donelson Smith. Avoiding pitfalls when using nvidia gpus for real-time tasks in
autonomous systems. In 30th Euromicro Conference on Real-Time Systems (ECRTS
2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
[174] Husheng Zhou, Soroush Bateni, and Cong Liu. Sˆ 3dnn: Supervised streaming and
scheduling for gpu-accelerated real-time dnn workloads. In 2018 IEEE Real-Time and
Embedded Technology and Applications Symposium (RTAS), pages 190–201. IEEE,
2018.
[175] Hyeonsu Lee, Jaehun Roh, and Euiseong Seo. A gpu kernel transactionization scheme
for preemptive priority scheduling. In 2018 IEEE Real-Time and Embedded Technology
and Applications Symposium (RTAS), pages 202–213. IEEE, 2018.
[176] Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. Timegraph:
Gpu scheduling for real-time multi-tasking environments. In Proc. USENIX ATC,
pages 17–30, 2011.
[177] Glenn A Elliott and James H Anderson. Globally scheduled real-time multiprocessor
systems with gpus. Real-Time Systems, 48(1):34–74, 2012.
[178] Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. Chimera: Collaborative
preemption for multitasking on a shared gpu. ACM SIGARCH Computer Architecture
News, 43(1):593–606, 2015.

214

[179] Can Basaran and Kyoung-Don Kang. Supporting preemptive task executions and
memory copies in gpgpus. In 24th Euromicro Conference on Real-Time Systems
(ECRTS 2012), pages 287–296. IEEE, 2012.
[180] Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo
Valero. Enabling preemptive multiprogramming on gpus. In Computer Architecture
(ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 193–204. IEEE,
2014.
[181] Husheng Zhou, Guangmo Tong, and Cong Liu. Gpes: A preemptive execution system for gpgpu computing. In Real-Time and Embedded Technology and Applications
Symposium (RTAS), 2015 IEEE, pages 87–97. IEEE, 2015.
[182] Chao Yu, Yuebin Bai, Hailong Yang, Kun Cheng, Yuhao Gu, Zhongzhi Luan, and
Depei Qian. Smguard: A flexible and fine-grained resource management framework
for gpus. IEEE Transactions on Parallel and Distributed Systems, 2018.
[183] Kshitij Gupta, Jeff A Stuart, and John D Owens. A study of persistent threads style
gpu programming for gpgpu workloads. In Innovative Parallel Computing-Foundations
& Applications of GPU, Manycore, and Heterogeneous Systems (INPAR 2012), pages
1–14. IEEE, 2012.
[184] Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. Enabling and
exploiting flexible task assignment on gpu through sm-centric program transformations.
In Proceedings of the 29th ACM on International Conference on Supercomputing, pages
119–130. ACM, 2015.
[185] J. Li, Jian-Jia Chen, K. Agrawal, C.Lu, C.D. Gill, and Abusayeed Saifullah. Analysis
of federated and global scheduling for parallel real-time tasks. In Real-Time Systems
(ECRTS), 26th Euromicro Conference on, pages 85–96, 2014.
[186] Wen-Hung Huang and Jian-Jia Chen. Schedulability and priority assignment for multisegment self-suspending real-time tasks under fixed-priority scheduling. In Technical
report. Technical University of Dortmund, 2015.
[187] Olivier Valery, Pangfeng Liu, and Jan-Jan Wu. A collaborative cpu–gpu approach for
principal component analysis on mobile heterogeneous platforms. Journal of Parallel
and Distributed Computing, 120:44–61, 2018.
[188] Bin Wang, Ruhui Ma, Zhengwei Qi, Jianguo Yao, and Haibing Guan. A user mode
cpu–gpu scheduling framework for hybrid workloads. Future Generation Computer
Systems, 63:25–36, 2016.
[189] Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. Effisha: A software
framework for enabling effficient preemptive scheduling of gpu. In Proceedings of the
215

22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,
pages 3–16, 2017.
[190] Cen Chen, Kenli Li, Aijia Ouyang, Zeng Zeng, and Keqin Li. Gflink: An in-memory
computing architecture on heterogeneous cpu-gpu clusters for big data. IEEE Transactions on Parallel and Distributed Systems, 29(6):1275–1288, 2018.
[191] Muhammad Husni Santriaji and Henry Hoffmann. Merlot: Architectural support for
energy-efficient real-time processing in gpus. In 2018 IEEE Real-Time and Embedded
Technology and Applications Symposium (RTAS), pages 214–226. IEEE, 2018.
[192] Seyedmehdi Hosseinimotlagh and Hyoseung Kim. Thermal-aware servers for real-time
tasks on multi-core gpu-integrated embedded systems. In 2019 IEEE Real-Time and
Embedded Technology and Applications Symposium (RTAS), pages 254–266. IEEE,
2019.
[193] Glenn A Elliott, Bryan C Ward, and James H Anderson. Gpusync: A framework for
real-time gpu management. In 2013 IEEE 34th Real-Time Systems Symposium, pages
33–44. IEEE, 2013.
[194] Vladislav Golyanik, Mitra Nasri, and Didier Stricker. Towards scheduling hard realtime image processing tasks on a single gpu. In 2017 IEEE International Conference
on Image Processing (ICIP), pages 4382–4386. IEEE, 2017.
[195] Christoph Gerum, Oliver Bringmann, and Wolfgang Rosenstiel. Source level performance simulation of gpu cores. In Proceedings of the 2015 Design, Automation & Test
in Europe Conference & Exhibition, pages 217–222. EDA Consortium, 2015.
[196] Kostiantyn Berezovskyi, Konstantinos Bletsas, and Björn Andersson. Makespan computation for gpu threads running on a single streaming multiprocessor. In Real-Time
Systems (ECRTS), 2012 24th Euromicro Conference on, pages 277–286. IEEE, 2012.
[197] Bakhoda, A. and Yuan, G.L. and Fung, W.W.L. and Wong, H. and Aamodt, T.M.
Analyzing CUDA workloads using a detailed GPU simulator. In Proc. Annual IEEE
Int. Symp. on Int.Performance Analysis of Systems and Software, 2009, 2009.
[198] Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi
Guo. Simultaneous multikernel gpu: Multi-tasking throughput processors via finegrained sharing. In High Performance Computer Architecture (HPCA), 2016 IEEE
International Symposium on, pages 358–369. IEEE, 2016.
[199] Yunlong Xu, Rui Wang, Tao Li, Mingcong Song, Lan Gao, Zhongzhi Luan, and Depei
Qian. Scheduling tasks with mixed timing constraints in gpu-powered real-time systems. In Proceedings of the 2016 International Conference on Supercomputing, page 30.
ACM, 2016.
216

[200] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and
Soojung Ryu. Improving gpgpu resource utilization through alternative thread block
scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th
International Symposium on, pages 260–271. IEEE, 2014.
[201] Sujan Kumar Saha, Yecheng Xiang, and Hyoseung Kim. Stgm: Spatio-temporal gpu
management for real-time tasks. In 2019 IEEE 25th International Conference on
Embedded and Real-Time Computing Systems and Applications (RTCSA), pages 1–
6. IEEE, 2019.
[202] Jinghao Sun, Jing Li, Zhishan Guo, An Zou, Xuan Zhang, Kunal Agrawal, and Sanjoy Baruah. Real-time scheduling upon a host-centric acceleration architecture with
data offloading. In 2020 IEEE Real-Time and Embedded Technology and Applications
Symposium (RTAS), pages 56–69. IEEE, 2020.
[203] Tanya Amert, Nathan Otterness, Ming Yang, James H Anderson, and F Donelson
Smith. Gpu scheduling on the nvidia tx2: Hidden details revealed. In 2017 IEEE
Real-Time Systems Symposium (RTSS), pages 104–115. IEEE, 2017.
[204] Nathan Otterness, Ming Yang, Sarah Rust, Eunbyung Park, James H Anderson,
F Donelson Smith, Alex Berg, and Shige Wang. An evaluation of the nvidia tx1 for supporting real-time computer-vision workloads. In 2017 IEEE Real-Time and Embedded
Technology and Applications Symposium (RTAS), pages 353–364. IEEE, 2017.
[205] Steven Chien, Ivy Peng, and Stefano Markidis. Performance evaluation of advanced
features in cuda unified memory. In 2019 IEEE/ACM Workshop on Memory Centric
High Performance Computing (MCHPC), pages 50–57. IEEE, 2019.
[206] Saksham Jain, Iljoo Baek, Shige Wang, and Ragunathan Rajkumar. Fractional gpus:
Software-based compute and memory bandwidth reservation for gpus. In 2019 IEEE
Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 29–
41. IEEE, 2019.
[207] NVIDIA. Nvidia tesla p100: The most advanced datacenter accelerator ever built
featuring pascal gp100, the world’s fastest gpu. Whitepaper, 2016.
[208] How to utilize compute preemption in the new pascal architecture (tesla p100 and
gtx1080)? https://devtalk.nvidia.com/default/topic/973140/how-to-utilize-computepreemption-in-the-new-pascal-architecture-tesla-p100-and-gtx1080-/.
[209] Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. Automatically exploiting implicit pipeline parallelism from multiple dependent kernels for gpus. In Parallel
Architecture and Compilation Techniques (PACT), 2016 International Conference on,
pages 339–350. IEEE, 2016.
217

[210] Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, and Tao Li. Gaas workload
characterization under numa architecture for virtualized gpu. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages
65–76. IEEE, 2017.
[211] Mohammad-Hashem Haghbayan, Amir-Mohammad Rahmani, Awet Yemane
Weldezion, Pasi Liljeberg, Juha Plosila, Axel Jantsch, and Hannu Tenhunen. Dark
silicon aware power management for manycore systems under dynamic workloads. In
32nd International Conference on Computer Design (ICCD). IEEE, 2014.
[212] Amir-Mohammad Rahmani, Mohammad-Hashem Haghbayan, Anil Kanduri, Awet Yemane Weldezion, Pasi Liljeberg, Juha Plosila, Axel Jantsch, and Hannu Tenhunen.
Dynamic power management for many-core platforms in the dark silicon era: A multiobjective control approach. In 2015 IEEE/ACM International Symposium on Low
Power Electronics and Design (ISLPED). IEEE, 2015.
[213] Wikipedia. Speedstep. [EB/OL]. https://en.wikipedia.org/wiki/SpeedStep/.
[214] AMD Staff. Amd powernow! technology brief. Advanced Micro Devices, Inc.
[215] Teodor Neagoe, Ernest Karjala, and Logica Banica. Why arm processors are the
best choice for embedded low-power applications? In 2010 IEEE 16th International
Symposium for Design and Technology in Electronic Packaging (SIITME).
[216] The FPS Review. Nvidia power. [EB/OL]. https://www.thefpsreview.com/2019/
12/04/nvidia-geforce-driver-power-mode-settings-compared/.
[217] Zhuo Chen, Dimitrios Stamoulis, and Diana Marculescu. Profit: priority and
power/performance optimization for many-core systems. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 2017.
[218] Muhammad Shafique, Benjamin Vogel, and Jörg Henkel. Self-adaptive hybrid dynamic
power management for many-core systems. In 2013 Design, Automation & Test in
Europe Conference & Exhibition (DATE), pages 51–56. IEEE, 2013.
[219] Hwisung Jung and Massoud Pedram. Supervised learning based power management
for multicore processors. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 29(9):1395–1408, 2010.
[220] Zhiyuan Ren, Bruce H Krogh, and Radu Marculescu. Hierarchical adaptive dynamic
power management. IEEE Transactions on Computers, 54(4):409–420, 2005.
[221] Abhinandan Majumdar, Leonardo Piga, Indrani Paul, Joseph L Greathouse, Wei
Huang, and David H Albonesi. Dynamic gpgpu power management using adaptive
model predictive control. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 613–624. IEEE, 2017.
218

[222] Hao Shen, Jun Lu, and Qinru Qiu. Learning based dvfs for simultaneous temperature, performance and energy management. In Thirteenth International Symposium
on Quality Electronic Design (ISQED), pages 747–754. IEEE, 2012.
[223] Yanzhi Wang, Qing Xie, Ahmed Ammari, and Massoud Pedram. Deriving a
near-optimal power management policy using model-free reinforcement learning and
bayesian classification. In Proceedings of the 48th Design Automation Conference,
pages 41–46, 2011.
[224] Wonyoung Kim, David M Brooks, and Gu-Yeon Wei. A fully-integrated 3-level dc/dc
converter for nanosecond-scale dvs with fast shunt regulation. In 2011 IEEE International Solid-State Circuits Conference, pages 268–270. IEEE, 2011.
[225] Zeynep Toprak-Deniz, Michael Sperling, John Bulzacchelli, Gregory Still, Ryan Kruse,
Seongwon Kim, David Boerstler, Tilman Gloekler, Raphael Robertazzi, Kevin Stawiasz, et al. 5.2 distributed system of digitally controlled microregulators enabling percore dvfs for the power8 tm microprocessor. In 2014 IEEE International Solid-State
Circuits Conference Digest of Technical Papers (ISSCC).
[226] Chun-Yen Tseng, Li-Wen Wang, and Po-Chiun Huang. An integrated linear regulator
with fast output voltage transition for dual-supply srams in dvfs systems. IEEE journal
of solid-state circuits, 45(11):2239–2249, 2010.
[227] Ben Keller, Martin Cochet, Brian Zimmer, Yunsup Lee, Milovan Blagojevic, Jaehwa
Kwak, Alberto Puggelli, Stevo Bailey, Pi-Feng Chiu, Palmer Dabbelt, et al. Submicrosecond adaptive voltage scaling in a 28nm fd-soi processor soc. In ESSCIRC
Conference 2016: 42nd European Solid-State Circuits Conference, 2016.
[228] Jonathan A Winter, David H Albonesi, and Christine A Shoemaker. Scalable thread
scheduling and global power management for heterogeneous many-core architectures.
In 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 29–39. IEEE, 2010.
[229] John Sartori and Rakesh Kumar. Distributed peak power management for many-core
architectures. In 2009 Design, Automation & Test in Europe Conference & Exhibition,
pages 1556–1559. IEEE, 2009.
[230] Hao Shen, Ying Tan, Jun Lu, Qing Wu, and Qinru Qiu. Achieving autonomous power
management using reinforcement learning. ACM Transactions on Design Automation
of Electronic Systems (TODAES), 18(2):1–32, 2013.
[231] Zhuo Chen and Diana Marculescu. Distributed reinforcement learning for power limited
many-core system performance optimization. In 2015 Design, Automation & Test in
Europe Conference & Exhibition (DATE). IEEE, 2015.
219

[232] Martin Rapp, Anuj Pathania, Tulika Mitra, and Jörg Henkel. Prediction-based task
migration on s-nuca many-cores. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1579–1582. IEEE, 2019.
[233] Zheqi Yu, Pedro Machado, Adnan Zahid, Amir M Abdulghani, Kia Dashtipour, Hadi
Heidari, Muhammad A Imran, and Qammer H Abbasi. Energy and performance tradeoff optimization in heterogeneous computing via reinforcement learning. Electronics,
9(11):1812, 2020.
[234] Amir M Rahmani, Mohammad-Hashem Haghbayan, Antonio Miele, Pasi Liljeberg,
Axel Jantsch, and Hannu Tenhunen. Reliability-aware runtime power management for
many-core systems in the dark silicon era. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, 25(2):427–440, 2016.
[235] Thomas Ebi, Mohammad Abdullah Al Faruque, and Jörg Henkel. Tape: Thermalaware agent-based power econom multi/many-core architectures. In 2009 IEEE/ACM
International Conference on Computer-Aided Design-Digest of Technical Papers, pages
302–309. IEEE, 2009.
[236] Zhiquan Lai, King Tin Lam, Cho-Li Wang, and Jinshu Su. Latency-aware dvfs for
efficient power state transitions on many-core architectures. The Journal of Supercomputing, 71(7):2720–2747, 2015.
[237] Anil Kanduri, Mohammad-Hashem Haghbayan, Amir M Rahmani, Pasi Liljeberg, Axel
Jantsch, Hannu Tenhunen, and Nikil Dutt. Accuracy-aware power management for
many-core systems running error-resilient applications. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 25(10), 2017.
[238] Peng Rong and Massoud Pedram. Hierarchical power management with application to
scheduling. In ISLPED’05. Proceedings of the 2005 International Symposium on Low
Power Electronics and Design, 2005., pages 269–274. IEEE, 2005.
[239] Jacob Sorber, Nilanjan Banerjee, Mark D Corner, and Sami Rollins. Turducken: hierarchical power management for mobile devices. In Proceedings of the 3rd international
conference on Mobile systems, applications, and services, 2005.
[240] Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra. Phase-based
application-driven hierarchical power management on the single-chip cloud computer.
In 2011 International Conference on Parallel Architectures and Compilation Techniques, pages 131–142. IEEE, 2011.
[241] Thannirmalai Somu Muthukaruppan, Mihai Pricopi, Vanchinathan Venkataramani,
Tulika Mitra, and Sanjay Vishin. Hierarchical power management for asymmetric
multi-core in dark silicon era. In 2013 50th ACM/EDAC/IEEE Design Automation
Conference (DAC), pages 1–9. IEEE, 2013.
220

[242] Canturk Isci, Alper Buyuktosunoglu, Chen-Yong Cher, Pradip Bose, and Margaret
Martonosi. An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budget. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), 2006.
[243] Mohammad Ghasemazar, Ehsan Pakbaznia, and Massoud Pedram. Minimizing the
power consumption of a chip multiprocessor under an average throughput constraint.
In 2010 11th International Symposium on Quality Electronic Design (ISQED), pages
362–371. IEEE, 2010.
[244] Eran Shifer and Shlomo Weiss. Low-latency adaptive mode transitions and hierarchical
power management in asymmetric clustered cores. ACM Transactions on Architecture
and Code Optimization (TACO), 10(3):1–25, 2013.
[245] Pascal Meinerzhagen, Carlos Tokunaga, Andres Malavasi, Vaibhav Vaidya, Ashwin
Mendon, Deepak Mathaikutty, Jaydeep Kulkarni, Charles Augustine, Minki Cho,
Stephen Kim, et al. An energy-efficient graphics processor featuring fine-grain dvfs
with integrated voltage regulators, execution-unit turbo, and retentive sleep in 14nm
tri-gate cmos. In 2018 IEEE International Solid-State Circuits Conference-(ISSCC),
pages 38–40. IEEE, 2018.
[246] Stijn Eyerman and Lieven Eeckhout. Fine-grained dvfs using on-chip regulators. ACM
Transactions on Architecture and Code Optimization (TACO), 8(1):1–24, 2011.
[247] Sebastian Höppner, Chenming Shao, Holger Eisenreich, Georg Ellguth, Mario Ander,
and René Schüffny. A power management architecture for fast per-core dvfs in heterogeneous mpsocs. In 2012 IEEE International Symposium on Circuits and Systems.
IEEE, 2012.
[248] Harshad Kasture, Davide B Bartolini, Nathan Beckmann, and Daniel Sanchez. Rubik:
Fast analytical power management for latency-critical systems. In 2015 48th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 598–610.
IEEE, 2015.
[249] Yuxin Bai, Victor W Lee, and Engin Ipek. Voltage regulator efficiency aware power
management. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 825–838,
2017.
[250] Chih-Hsun Chou, Laxmi N Bhuyan, and Daniel Wong. µdpm: Dynamic power management for the microsecond era. In 2019 IEEE International Symposium on High
Performance Computer Architecture (HPCA), pages 120–132. IEEE, 2019.

221

[251] Alan Roth, Charlie Zhou, Mei Wong, Eric Soenen, Tze-Chiang Huang, Paul Ranucci,
Ying-Chih Hsu, Hung-Chih Lin, Chester Kuo, Min-Jer Wang, et al. Heterogeneous
power delivery for 7nm high-performance chiplet-based processors using integrated
passive device and in-package voltage regulator. In 2020 IEEE Symposium on VLSI
Technology, pages 1–2. IEEE, 2020.
[252] Waclaw Godycki, Christopher Torng, Ivan Bukreyev, Alyssa Apsel, and Christopher
Batten. Enabling realistic fine-grain voltage scaling with reconfigurable power distribution networks. In 2014 47th Annual IEEE/ACM International Symposium on
Microarchitecture, pages 381–393. IEEE, 2014.
[253] Soraya Ghiasi, Jason Casmira, and Dirk Grunwald. Using ipc variation in workloads
with externally specified rates to reduce power consumption. In In Workshop on
Complexity Effective Design. Citeseer, 2000.
[254] Gregor Von Laszewski, Lizhe Wang, Andrew J Younge, and Xi He. Power-aware
scheduling of virtual machines in dvfs-enabled clusters. In 2009 IEEE International
Conference on Cluster Computing and Workshops, pages 1–10. IEEE, 2009.
[255] Qiang Wu, Margaret Martonosi, Douglas W Clark, Vijay Janapa Reddi, Dan Connors, Youfeng Wu, Jin Lee, and David Brooks. Dynamic-compiler-driven control for
microprocessor energy and performance. IEEE Micro, 26(1):119–129, 2006.
[256] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.
A Bradford Book, Cambridge, MA, USA, 2018.
[257] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P.
Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods
for deep reinforcement learning. 2016.
[258] Ilya K Ganusov, Mahesh A Iyer, Ning Cheng, and Alon Meisler. Agilex™ generation
of intel® fpgas. In 2020 IEEE Hot Chips 32 Symposium (HCS), pages 1–26. IEEE
Computer Society, 2020.
[259] Jeffrey Chromczak, Mark Wheeler, Charles Chiasson, Dana How, Martin Langhammer,
Tim Vanderhoek, Grace Zgheib, and Ilya Ganusov. Architectural enhancements in
intel® agilex™ fpgas. In The 2020 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 140–149, 2020.
[260] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and
Norman P Jouppi. Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual
IEEE/ACM International Symposium on Microarchitecture, 2009.
222

