Energy Reduction Through Voltage Scaling and Lightweight Checking by Kadric, Edin
University of Pennsylvania
ScholarlyCommons
Publicly Accessible Penn Dissertations
1-1-2016
Energy Reduction Through Voltage Scaling and
Lightweight Checking
Edin Kadric
University of Pennsylvania, ekadric@seas.upenn.edu
Follow this and additional works at: http://repository.upenn.edu/edissertations
Part of the Computer Engineering Commons, and the Electrical and Electronics Commons
This paper is posted at ScholarlyCommons. http://repository.upenn.edu/edissertations/1794
For more information, please contact libraryrepository@pobox.upenn.edu.
Recommended Citation
Kadric, Edin, "Energy Reduction Through Voltage Scaling and Lightweight Checking" (2016). Publicly Accessible Penn Dissertations.
1794.
http://repository.upenn.edu/edissertations/1794
Energy Reduction Through Voltage Scaling and Lightweight Checking
Abstract
As the semiconductor roadmap reaches smaller feature sizes and the end of Dennard Scaling, design goals
change, and managing the power envelope often dominates delay minimization. Voltage scaling remains a
powerful tool to reduce energy. We find that it results in about 60% geomean energy reduction on top of other
common low-energy optimizations with 22nm CMOS technology. However, when voltage is reduced, it
becomes easier for noise and particle strikes to upset a node, potentially causing Silent Data Corruption
(SDC). The 60% energy reduction, therefore, comes with a significant drop in reliability. Duplication with
checking and triple-modular redundancy are traditional approaches used to combat transient errors, but
spending 2–3x the energy for redundant computation can diminish or reverse the benefits of voltage scaling.
As an alternative, we explore the opportunity to use checking operations that are cheaper than the base
computation they are guarding. We devise a classification system for applications and their lightweight
checking characteristics. In particular, we identify and evaluate the effectiveness of lightweight checks in a
broad set of common tasks in scientific computing and signal processing. We find that the lightweight checks
cost only a fraction of the base computation (0-25%) and allow us to recover the reliability losses from voltage
scaling. Overall, we show about 50% net energy reduction without compromising reliability compared to
operation at the nominal voltage. We use FPGAs (Field-Programmable Gate Arrays) in our work, although
the same ideas can be applied to different systems. On top of voltage scaling, we explore other common low-
energy techniques for FPGAs: transmission gates, gate boosting, power gating, low-leakage (high-Vth)
processes, and dual-V dd architectures.
We do not scale voltage for memories, so lower voltages help us reduce logic and interconnect energy, but not
memory energy. At lower voltages, memories become dominant, and we get diminishing returns from
continuing to scale voltage. To ensure that memories do not become a bottleneck, we also design an energy-
robust FPGA memory architecture, which attempts to minimize communication energy due to mismatches
between application and architecture. We do this alongside application parallelism tuning. We show our
techniques on a wide range of applications, including a large real-time system used for Wide-Area Motion
Imaging (WAMI).
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Graduate Group
Electrical & Systems Engineering
First Advisor
Andre DeHon
Second Advisor
Jan Van der Spiegel
This dissertation is available at ScholarlyCommons: http://repository.upenn.edu/edissertations/1794
Keywords
Communication, Energy, FPGA, Lightweight Check, Memory, Power
Subject Categories
Computer Engineering | Electrical and Electronics
This dissertation is available at ScholarlyCommons: http://repository.upenn.edu/edissertations/1794
ENERGY REDUCTION THROUGH VOLTAGE
SCALING AND LIGHTWEIGHT CHECKING
Edin Kadric
A DISSERTATION
in Electrical and Systems Engineering
Presented to the Faculties of the University of Pennsylvania
in Partial Fulfillment of the Requirements for the
Degree of Doctor of Philosophy
2016
Supervisor of Dissertation
Signature
André DeHon, Professor of Electrical and Systems Engineering
Graduate Group Chair
Signature
Alejandro Ribeiro, Rosenbluth Associate Professor of Electrical and Systems Engineering
Dissertation Committee:
André DeHon, Professor of Electrical and Systems Engineering
Randy Huang, Intel
Jonathan Smith, Olga and Alberico Pompa Professor of Engineering and Applied Science
Jan Van der Spiegel, Professor of Electrical and Systems Engineering
ENERGY REDUCTION THROUGH VOLTAGE SCALING
AND LIGHTWEIGHT CHECKING
COPYRIGHT
2016
Edin Kadric
Acknowledgements
I would like to thank André DeHon for so many things. I have been very fortunate
to get to know him and work with him over the last few years. His expertise, insight
and dedication have guided this work in the best of ways. He also contributed to
much of the written content. He is the best advisor and mentor I could have asked
for.
I would like to thank the members of my defense committee: Randy Huang,
Jonathan M. Smith, and Jan Van der Spiegel. Their advice and feedback have greatly
helped strengthen this dissertation.
I would like to thank the members of the IC lab, who have contributed to shaping
this work through discussion and feedback. In particular, Udit Dhawan, Benjamin
Gojman, and Rafi Rubin, who have taught me many important tricks, shared ideas,
and provided much technical help and tool support.
I would also like to thank my parents and my brother for the invaluable support
and encouragement they have always provided me.
Finally, I would like to thank my wife and best friend Emina. Her constant love,
patience, and belief in me, has been of tremendous help over the last few years.
iii
ABSTRACT
ENERGY REDUCTION THROUGH VOLTAGE SCALING
AND LIGHTWEIGHT CHECKING
Edin Kadric
André DeHon
As the semiconductor roadmap reaches smaller feature sizes and the end of Den-
nard Scaling, design goals change, and managing the power envelope often dominates
delay minimization. Voltage scaling remains a powerful tool to reduce energy. We
find that it results in about 60% geomean energy reduction on top of other common
low-energy optimizations with 22 nm CMOS technology. However, when voltage is
reduced, it becomes easier for noise and particle strikes to upset a node, potentially
causing Silent Data Corruption (SDC). The 60% energy reduction, therefore, comes
with a significant drop in reliability. Duplication with checking and triple-modular
redundancy are traditional approaches used to combat transient errors, but spending
2–3× the energy for redundant computation can diminish or reverse the benefits of
voltage scaling. As an alternative, we explore the opportunity to use checking oper-
ations that are cheaper than the base computation they are guarding. We devise a
classification system for applications and their lightweight checking characteristics. In
particular, we identify and evaluate the effectiveness of lightweight checks in a broad
set of common tasks in scientific computing and signal processing. We find that the
lightweight checks cost only a fraction of the base computation (0-25%) and allow
iv
us to recover the reliability losses from voltage scaling. Overall, we show about 50%
net energy reduction without compromising reliability compared to operation at the
nominal voltage. We use FPGAs (Field-Programmable Gate Arrays) in our work, al-
though the same ideas can be applied to different systems. On top of voltage scaling,
we explore other common low-energy techniques for FPGAs: transmission gates, gate
boosting, power gating, low-leakage (high-Vth) processes, and dual-Vdd architectures.
We do not scale voltage for memories, so lower voltages help us reduce logic and
interconnect energy, but not memory energy. At lower voltages, memories become
dominant, and we get diminishing returns from continuing to scale voltage. To en-
sure that memories do not become a bottleneck, we also design an energy-robust
FPGA memory architecture, which attempts to minimize communication energy due
to mismatches between application and architecture. We do this alongside applica-
tion parallelism tuning. We show our techniques on a wide range of applications,
including a large real-time system used for Wide-Area Motion Imaging (WAMI).
v
Contents
Acknowledgements iii
Abstract iv
List of Tables xiii
List of Figures xvi
1 Introduction 1
1.1 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Common low-energy optimizations (Chapter 3) . . . . . . . . . . . . 4
1.5 Parallelism and communication energy (Chapter 4) . . . . . . . . . . 7
1.6 Maintaining reliability using LWCs (Chapters 5 and 6) . . . . . . . . 9
1.7 WAMI case study (Chapter 7) . . . . . . . . . . . . . . . . . . . . . . 10
1.8 Contributions and scope . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.9 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Background 14
2.1 FPGA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 FPGA architecture . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 FPGA tools and our modifications to them . . . . . . . . . . . 16
vi
2.1.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.4 FPGA energy modeling . . . . . . . . . . . . . . . . . . . . . 20
2.1.5 FPGA memory energy . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Previous work on energy optimizations . . . . . . . . . . . . . . . . . 21
2.2.1 Technology scaling . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Transmission gates, gate boosting . . . . . . . . . . . . . . . . 22
2.2.3 Power gating . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 Multiple supply voltages . . . . . . . . . . . . . . . . . . . . . 23
2.2.4.1 Lower voltage off the critical path . . . . . . . . . . . 23
2.2.4.2 Differential reliability . . . . . . . . . . . . . . . . . . 24
2.3 Previous work on lightweight checking . . . . . . . . . . . . . . . . . 25
2.4 Previous work on reliability . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Upset phenomena and voltage scaling . . . . . . . . . . . . . . 26
2.4.2 Prior work on soft-error upsets and timing failures . . . . . . . 27
2.4.3 Configuration upsets . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.4 Process variation . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.5 Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Common Low-Energy Optimization Techniques 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Architecture and benchmarks . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Simple voltage scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Transmission gates or gate boosting . . . . . . . . . . . . . . . . . . . 33
3.4.1 Pass-transistor and level restorer failure . . . . . . . . . . . . . 33
3.4.2 Transmission gate, gate boosting results . . . . . . . . . . . . 35
3.4.3 Technology trends . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Power gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Power switches for power gating . . . . . . . . . . . . . . . . . 40
3.5.2 Power gating results . . . . . . . . . . . . . . . . . . . . . . . 42
vii
3.6 Dual-Vdd architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 Motivation and approach . . . . . . . . . . . . . . . . . . . . . 44
3.6.2 Power switches for dual-Vdd . . . . . . . . . . . . . . . . . . . 45
3.6.3 Level converters for dual-Vdd . . . . . . . . . . . . . . . . . . 46
3.6.4 Vdd assignment algorithm . . . . . . . . . . . . . . . . . . . . 47
3.6.5 Dual-Vdd results . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7.1 Relative benefits . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7.2 Results spread . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Communication Energy: Adjusting Parallelism and Memory Orga-
nization 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Parallelism and data movement energy . . . . . . . . . . . . . . . . . 55
4.2.1 Memory energy . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Between computations . . . . . . . . . . . . . . . . . . . . . . 57
4.2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Architecture mismatch energy . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Background on FPGA memories . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 FPGA memory architecture . . . . . . . . . . . . . . . . . . . 68
4.4.2 Memory energy modeling . . . . . . . . . . . . . . . . . . . . . 69
4.5 Methodology for memory exploration . . . . . . . . . . . . . . . . . . 70
4.5.1 Power-optimized memory mapping . . . . . . . . . . . . . . . 71
4.5.2 Energy and area of memory blocks . . . . . . . . . . . . . . . 72
4.5.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.4 Limit study and mismatch lower bound . . . . . . . . . . . . . 77
4.6 Parallelism tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6.1 Example: MMul . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6.2 Parallelism tuning with limit-study architecture . . . . . . . . 79
viii
4.6.3 Parallelism tuning with concrete FPGA architecture . . . . . . 79
4.7 Memory exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7.1 Memory block size sweep . . . . . . . . . . . . . . . . . . . . . 81
4.7.2 Impact of dm . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.7.3 Impact of memory width . . . . . . . . . . . . . . . . . . . . . 85
4.7.4 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.8 Chapter conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Lightweight Checking Classification 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Computational model . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Differential reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 LWC definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Classification system for LWCs . . . . . . . . . . . . . . . . . . . . . 93
5.6 Class #1: Checksums . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6.1 LWC for sorting . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6.2 LWC for matrix multiplication . . . . . . . . . . . . . . . . . . 98
5.6.3 LWC for FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6.4 LWC Gaussian elimination . . . . . . . . . . . . . . . . . . . . 102
5.6.5 LWC for window filtering . . . . . . . . . . . . . . . . . . . . . 103
5.6.6 LWC for integer multiplication, division, modulo . . . . . . . . 104
5.6.7 LWC for data integrity . . . . . . . . . . . . . . . . . . . . . . 105
5.7 Class #2: Probabilistic . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.7.1 Probabilistic LWC for matrix multiplication . . . . . . . . . . 106
5.7.2 Probabilistic LWC for matrix inversion . . . . . . . . . . . . . 106
5.8 Class #3: Convergent algorithms . . . . . . . . . . . . . . . . . . . . 107
5.8.1 LWC for conjugate gradient . . . . . . . . . . . . . . . . . . . 108
5.9 Class #5: Error-tolerant applications . . . . . . . . . . . . . . . . . . 109
5.10 Class #6: No checks possible . . . . . . . . . . . . . . . . . . . . . . 109
ix
5.11 Using context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.12 Impact of communication complexity . . . . . . . . . . . . . . . . . . 113
5.13 Combining different LWCs . . . . . . . . . . . . . . . . . . . . . . . . 113
5.14 Chapter conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6 Lightweight Checking Models and Results 115
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Reliability models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2.1 Energy savings analysis . . . . . . . . . . . . . . . . . . . . . . 116
6.2.2 Silent data corruption (SDC) rate . . . . . . . . . . . . . . . . 118
6.2.3 Fault injection runtime . . . . . . . . . . . . . . . . . . . . . . 123
6.2.4 Bit flip rate versus voltage . . . . . . . . . . . . . . . . . . . . 123
6.3 Lightweight checking results . . . . . . . . . . . . . . . . . . . . . . . 126
6.3.1 Detailed LWC example: sorting . . . . . . . . . . . . . . . . . 127
6.3.1.1 Impact on energy when considering reliability . . . . 127
6.3.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3.1.3 Parameter sweep . . . . . . . . . . . . . . . . . . . . 130
6.3.1.4 Effect of problem size on LWC results . . . . . . . . 131
6.3.2 LWC results for the other applications . . . . . . . . . . . . . 132
6.3.2.1 Basic LWC results for all benchmarks . . . . . . . . 132
6.3.2.2 Improving delay . . . . . . . . . . . . . . . . . . . . 134
6.3.2.3 Sensitivity analysis . . . . . . . . . . . . . . . . . . . 136
6.3.2.4 Fault injection results . . . . . . . . . . . . . . . . . 138
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7 Full System Case Study: WAMI 140
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.2 Top level: Wide-Area Motion Imaging, WAMI . . . . . . . . . . . . . 141
7.3 Stage 1: DeBayer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
x
7.3.1 Bayer filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3.2 LWC for DeBayer . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.4 Stage 2: Lucas-Kanade . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.4.1 LK algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.4.2 LK implementation in hardware . . . . . . . . . . . . . . . . . 148
7.4.2.1 Fixed-point arithmetic . . . . . . . . . . . . . . . . . 148
7.4.2.2 Parallel LK . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.2.3 Caching scheme for the warp . . . . . . . . . . . . . 150
7.4.2.4 Structure of LK in hardware . . . . . . . . . . . . . . 152
7.4.3 LWC for LK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.5 Stage 3: Gaussian Mixture Modeling . . . . . . . . . . . . . . . . . . 157
7.5.1 GMM overview . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5.2 GMM implementation . . . . . . . . . . . . . . . . . . . . . . 158
7.5.3 LWC for GMM . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.6 WAMI results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.6.1 Basic WAMI results . . . . . . . . . . . . . . . . . . . . . . . 159
7.6.2 Parallelism benefits . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6.3 Other metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.7 Chapter conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8 Future Work 164
8.1 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.2 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3.1 Error modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3.2 Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3.3 Fault injection simulation . . . . . . . . . . . . . . . . . . . . 167
8.3.4 Limits on larger applications . . . . . . . . . . . . . . . . . . . 167
8.3.5 Improving reliability . . . . . . . . . . . . . . . . . . . . . . . 168
xi
8.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4.1 FPGA structure . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4.2 Multi-Vdd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.4.3 Joint exploration of memory architecture and voltage scaling . 169
9 Conclusions 170
Appendices 172
A Detailed Results for Chapter 3 173
Bibliography 179
xii
List of Tables
2.1 Taxonomy of small-feature-size reliability problems . . . . . . . . . . . 26
4.1 Memory requirements for the benchmarks . . . . . . . . . . . . . . . . 74
4.2 Area comparison of select memory organizations . . . . . . . . . . . . . 85
5.1 LWC classes and example applications (problems with P complexity) . 96
6.1 Possible outcomes of the protected design (CMP+LWC) . . . . . . . . 121
6.2 Fault injection results for Sort (N = 1024, P = 1) . . . . . . . . . . . . 127
6.3 Ratio results for CMP-only single-Vdd normalized to “CMP at Vnominal” 134
6.4 Ratio results for CMP+LWC single-Vdd normalized to “CMP at Vnominal”
(Tab. 6.3 baseline) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5 Ratio results for CMP+LWC dual-Vdd normalized to “CMP at Vnominal”
(Tab. 6.3 baseline) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 LWC fault injection results (no bit flip in LWC) . . . . . . . . . . . . . 139
7.1 Window coefficients for the DeBayer filter . . . . . . . . . . . . . . . . 143
xiii
List of Figures
1.1 Effect of voltage scaling on energy, delay, and reliability; example from
Sec. 6.3.1.1 with HP (High-Performance) process . . . . . . . . . . . . 3
1.2 Resulting energy when applying the different low-energy techniques, nor-
malized to a 22 nm HP baseline at nominal Vdd with pass-gates and no
power gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Basic FPGA model used for logic and routing . . . . . . . . . . . . . . 15
2.2 Tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Effect of voltage scaling on energy and delay for the baseline architecture
(ignoring pass-gate failures) . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Pass-gate versus transmission gate 2-mux . . . . . . . . . . . . . . . . 34
3.3 Energy achieved under delay constraints . . . . . . . . . . . . . . . . . 37
3.4 Minimum-energy trends versus technology . . . . . . . . . . . . . . . . 38
3.5 Comparison of PTM technologies . . . . . . . . . . . . . . . . . . . . . 41
3.6 Power gating allows us to avoid the cost of unused resources (apex2.blif
shown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Power switches for power gating and dual-Vdd . . . . . . . . . . . . . . 42
3.8 Power gating reduces leakage energy . . . . . . . . . . . . . . . . . . . 43
3.9 Level converter circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.10 Placing level converters at logic cluster inputs or outputs . . . . . . . . 47
3.11 Breakdown of energy components in dual-Vdd voltage sweep (apex2.blif
shown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
xiv
3.12 Relative benefits of the different low-energy techniques for the min-
energy goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.13 Energy overhead spread across benchmarks (single-Vdd, with transmis-
sion gates and power gating) . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Energy versus PE Count (Npe) for the window filter benchmark (WinF) 56
4.2 Column-oriented embedded memories . . . . . . . . . . . . . . . . . . . 61
4.3 Energy overhead due to architectural mismatch for matrix-multiply . . 66
4.4 Internal banking of memory block . . . . . . . . . . . . . . . . . . . . . 67
4.5 Internal banking for 1024×32 memory . . . . . . . . . . . . . . . . . . 70
4.6 Effect of memory block activation and output width selection on energy
consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 GMM structure and parallelization . . . . . . . . . . . . . . . . . . . . 75
4.8 Window filter configurations . . . . . . . . . . . . . . . . . . . . . . . . 76
4.9 FFT butterfly, radix R = 2 and 4 . . . . . . . . . . . . . . . . . . . . . 76
4.10 Basic FFT network for N = 16, R = 2, P = 2 . . . . . . . . . . . . . . 76
4.11 Parallelism impact on memory and interconnect requirements for a (4×
4)2 matrix-multiply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.12 Optimum parallelism level versus problem size . . . . . . . . . . . . . . 80
4.13 Sweep of physical memory block size at fixed [dm=7, width=32] . . . . 82
4.14 Detailed breakdown of energy vs memory block size [dm=7, width=32] 83
4.15 Energy overhead versus memory block size and dm . . . . . . . . . . . 86
4.16 Energy overhead versus memory block size and data width . . . . . . . 87
4.17 Sensitivity of Fig. 4.16b (worst-case overheads) to CACTI estimates . . 87
5.1 CMP and LWC computational structure . . . . . . . . . . . . . . . . . 90
5.2 Sort kernel with its LWC . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Conjugate Gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Conceptual view of LWC existence and complexity . . . . . . . . . . . 110
xv
6.1 Effect of voltage on energy and system-level FIT rate for sort (N = 1024,
P = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Effect of voltage on system-level FIT rate for sort (N = 1024, P = 1) . 132
6.3 Effect of problem size on energy, delay and area for Sort . . . . . . . . 133
6.4 Geomean energy and delay versus voltage . . . . . . . . . . . . . . . . 136
6.5 The energy results are robust to a sweep of the exponential parameter α 137
6.6 Effect of the α parameter on the reliability results . . . . . . . . . . . . 138
7.1 Three stages of the WAMI system . . . . . . . . . . . . . . . . . . . . 141
7.2 Bayer color pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3 Forward additive Lucas-Kanade algorithm . . . . . . . . . . . . . . . . 147
7.4 Parallel LK problem and solution with duplication . . . . . . . . . . . 151
7.5 Flow diagram for LK in hardware . . . . . . . . . . . . . . . . . . . . . 155
7.6 Energy benefits of voltage scaling for WAMI stages . . . . . . . . . . . 160
7.7 Memory contribution to energy when changing parallelism for WAMI . 161
7.8 Comparison of gains due to voltage scaling for WAMI depending on
parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.1 Effect of low-energy techniques on energy and delay (90 nm and 65 nm) 174
A.2 Effect of low-energy techniques on energy and delay (45 nm) . . . . . . 175
A.3 Effect of low-energy techniques on energy and delay (22 nm) . . . . . . 176
A.4 Effect of low-energy techniques on energy and delay (14 nm) . . . . . . 177
A.5 Effect of low-energy techniques on energy and delay (7 nm) . . . . . . . 178
xvi
Chapter 1
Introduction
1.1 Thesis
Voltage reduction allows net energy savings of 60% compared to operation at the
nominal voltage, but also causes a significant drop in reliability. By adding lightweight
checks that cost only a fraction of the base computation, we can maintain reliability
as we reduce voltage and still achieve about 50% energy savings.
1.2 Motivation
Energy consumption is a key design limiter in many of today’s systems. Mobile de-
vices are limited by the energy that can be stored in batteries. Operating power
density limits (e.g. 100W/cm2 for force-air cooled machines or 1–10W/cm2 for am-
bient cooling) have put even wired systems in an energy-dominated regime. We can
now place more transistors on an integrated circuit die than we can afford to switch
[52, 91], a phenomenon known as dark silicon [37]. If we could reduce energy per
operation, we could perform more operations within the limited power envelope we
have available.
The most straightforward way to reduce energy consumption is to reduce the
1
supply voltage, Vdd. This reduces both dynamic and, initially, leakage energy:
Edyn = 1/2 · α · C · V 2dd (1.1)
Elkg = Tcrit(Vdd) · Ilkg(Vdd) · Vdd (1.2)
Etotal = Edyn + Elkg (1.3)
α is the activity factor, Ilkg is the leakage current at Vdd, and Tcrit is the critical path
at Vdd. However, we cannot make Vdd arbitrarily low:
• Lower Vdd means larger Tcrit, with an exponential degradation in the sub-threshold
region of operation (Vdd < Vth). This impacts circuit delay and could be unac-
ceptable if real-time constraints need to be met, see Fig. 1.1a.
• Larger Tcrit also means that Elkg is not strictly decreasing as Vdd is reduced. In-
stead, Elkg will eventually increase exponentially due to the Tcrit term in Eq. 1.2,
resulting in a net increase in total energy (Edyn + Elkg), see Fig. 1.1b.
• Modern systems use high Vdd margins in order to protect against many causes
of failure, including aging, process variation, supply noise, and ionizing particle
strikes. Therefore, reducing Vdd makes a system more susceptible to failure, see
Fig. 1.1c. The FIT rate is a measure of reliability (Failures In Time, defined as
the number of errors in 1 billion hours of operation).
We certainly do not want to reduce Vdd past the point where it would increase
both energy and delay. However, as we first reduce Vdd, there is a trade-off between
energy gains and delay increases, as well as reliability degradation. As a result, energy
and reliability requirements are at odds, and together, may limit our exploitation of
scaled technology.
2
0.3 0.45 0.6 0.75
2
0
6
0
1
0
0
Vth
delay
Vdd (V)
D
e
la
y
 R
a
ti
o
Delay
(a) Delay ratio
0.3 0.45 0.6 0.75
0
0
.5
1
1
.5 energy
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
Total
Dynamic
Leakage
(b) Energy ratio
0.3 0.45 0.6 0.75
1
1
e
+
0
5
1
e
+
1
0 reliability
Vdd (V)
F
IT
 r
a
ti
o
FIT 
 Rate
(c) Error rate ratio
Figure 1.1: Effect of voltage scaling on energy, delay, and reliability; example from
Sec. 6.3.1.1 with HP (High-Performance) process
1.3 Outline
In the rest of this chapter we introduce our work in more detail. In Chapter 2 we cover
background work. Then, the Dissertation is organized in five core chapters:
1. We need to ensure that we are applying our novel techniques on top of designs
that are already optimized for energy efficiency. We thus start by reviewing
some common low-energy techniques in Chapter 3: Vdd scaling, transmission
gates, gate boosting, power gating, low-leakage (high-Vth) processes, and dual-
Vdd. We review these in a unified framework and across different technology
generations, allowing us to assess the trends and changes in relative importance
of each technique.
2. Chapter 4 focuses on reducing communication energy. We show how tuning
the parallelism of an application can change its communication requirements
and reduce its energy. We also design a robust FPGA memory architecture to
minimize the mismatch between an application’s needs and what the architecture
provides.
3. Chapter 5 provides examples of applications and their lightweight checks (LWCs),
together with a classification system to help understand the space.
4. Chapter 6 develops reliability models and shows the benefits of LWCs empiri-
cally. Specifically, we show that we can maintain reliability as we reduce Vdd by
3
leveraging LWCs.
5. Chapter 7 demonstrates the ideas from our work on a large-scale, real-time system
for a WAMI application (Wide-Area Motion Imaging).
We show limitations and make suggestions for future work in Chapter 8. Finally, we
present our conclusions in Chapter 9.
1.4 Common low-energy optimizations (Chapter 3)
Our main goal is to explore the energy benefits of reducing voltage, but as we do so,
we must be mindful of other possible low-energy techniques and we must explore a
larger space of optimizations. Otherwise, we might end up claiming certain benefits
for a particular case, but not the actual energy-optimum case. For instance, scaling
voltage on low-Vth (high-performance) processes typically yields larger energy savings
that scaling voltage on high-Vth (low-power) processes because there is more room to
reduce voltage before hitting the sub-threshold region, but the high-Vth process still
achieves lower energy.
Furthermore, the relative benefits of those low-energy techniques changes with
technology. Many new challenges arise as we approach atomic-scale feature sizes
and the end of the silicon roadmap. New materials (copper, metal gates, high-κ
dielectrics, strained silicon) and device structures (FinFETs) have been introduced
to somewhat mitigate scaling effects. We therefore revisit Vdd scaling on top of the
following design options to understand how technology scaling has changed their
importance and benefits:
• Process selection: For low-power applications, designers typically use a high-Vth
(low-power, or LP) process, which significantly reduces leakage energy, at the cost
of lowering the clock frequency. They choose an LP process instead of the more
common HP process (high-performance, or low-Vth process); the HP process runs
faster but also consumes more leakage energy. We expect an LP process to be
4
more appropriate for low-energy applications, but we still explore both LP and
HP as a design option because we want to quantify their impact and understand
how their importance changes with technology. Furthermore, as we reduce Vdd
past Vth for both HP and LP, we might observe a turnaround where leakage
for the LP process increases much faster than that of the HP process because
the latter has lower delay. Note that we did not observe this phenomenon, as
we will see in the experimental results of Chapter 3: LP always reaches lower
energy than HP. Still, quantifying the results for the HP process is useful for
the designer who is not looking for the absolute energy-minimum point, but is
looking to reduce energy without impacting the delay too much (keeping an HP
process).
• Transmission gates: Multiplexers on FPGAs are typically built out of pass-gate
logic in order to save about 50% area compared to building them out of trans-
mission gates. However, pass-gate multiplexers are also less robust because they
do not output a full Vdd at their output, where the voltage swing is reduced to
a peak of about Vdd-Vth. This is even more problematic as we scale technology,
Vdd, or both, and we may be forced to switch to transmission gate designs, which
output a full Vdd swing. Even at full Vdd, it is not clear which of pass gates or
transmission gates have lower energy, especially across technologies, where their
relative advantages may change.
• Gate boosting: As an alternative to transmission gates, we also explore the possi-
bility of raising the voltage of the NMOS gate terminal above Vdd (gate boosting);
the output of the logic can then drop down to Vdd, instead of falling below Vdd.
• Power gating: In order to allow for a more efficient place-and-route process and
broad applicability, resources on an FPGA are significantly over-provisioned,
especially in the interconnect. Most resources end up unused and consume extra
leakage power. As we reduce Vdd, delay increases, and leakage gets even worse. To
counter this, we can augment the FPGA with power gating capabilities and turn
5
H
P
H
P
 T
ra
n
s
G
H
P
 G
a
te
B
o
o
s
t
H
P
 P
o
w
e
rG
H
P
 T
ra
n
s
G
 P
o
w
e
rG
H
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG L
P
L
P
 T
ra
n
s
G
L
P
 G
a
te
B
o
o
s
t
L
P
 P
o
w
e
rG
L
P
 T
ra
n
s
G
 P
o
w
e
rG
L
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG
0.0
0.2
0.4
0.6
0.8
1.0
22nm
E
n
e
rg
y
 R
a
ti
o
Figure 1.2: Resulting energy when applying the different low-energy techniques, nor-
malized to a 22 nm HP baseline at nominal Vdd with pass-gates and no power gating
off unused resources so they do not leak. This allows us to reduce leakage and
extend the minimum-energy point to lower Vdd levels. The goal is to gain enough
extra savings to overcome the overhead of adding the power gating capability.
• Dual-Vdd: Most of the resources end up off the critical path, suggesting that
we could use two Vdd levels: a high Vdd for resources close to the critical path
and a low Vdd for resources far from it. This way we reduce energy for most
of the resources without impacting the delay. However, implementing dual-Vdd
introduces new energy overhead due to power supply programmability and level
converters. We find 24% more energy when using dual Vdd when both supplies
are at nominal Vdd. We then need to reduce the low supply enough to recover
that overhead before we start seeing savings. Dual-Vdd architectures will also be
useful in Chapter 6, where we will be interested in reliability, which is directly
affected by the Vdd level: high-Vdd resources will be more reliable than low-Vdd
ones.
The resulting energy gains from applying these techniques (except dual-Vdd) are
highlighted in Fig. 1.2 for 22 nm technology. We find that simply scaling Vdd to the
energy-optimum point reduces energy by 9% compared to operation at the nominal
6
Vdd of the technology. Using transmission gates instead of pass-gates provides another
5% reduction, and power gating provides another 60% reduction, for 65% total sav-
ings. This is in contrast to older technologies (90 nm and 65 nm) that do not benefit
from transmission gates. These large energy savings come with a delay penalty, but
if we prevent the delay from increasing beyond the baseline, we can trade the delay
gains from using transmission gates and still get 45% total savings (not shown in
Fig. 1.2, covered in Chapter 3). We can further reduce leakage energy significantly by
selecting a high-Vth process. At 22 nm (again in Fig. 1.2) this leads to 51% additional
savings, for 83% total savings. We also find that dual-Vdd architectures do not reduce
energy more than a single-Vdd design with power gating. We find similar results when
using gate boosting instead of transmission gates.
1.5 Parallelism and communication energy (Chap-
ter 4)
Chapter 4 minimizes communication energy. It is centered around two different but
ultimately related ideas: optimizing application parallelism and finding a robust mem-
ory architecture. Chapter 4 accomplishes a similar goal to Chapter 3 in that it identi-
fies energy-optimized designs to make sure that we do not claim LWC benefits based
on a poorly chosen baseline (in Chapter 6). Furthermore, we do not scale voltage for
memories in Chapter 6, so it is useful to explore memory architecture design on top
of voltage scaling for logic and interconnect, allowing us to avoid diminishing returns
as we keep scaling voltage due to the memories becoming more dominant.
Can we reduce energy by changing an application’s parallelism? To first order,
the answer would be no. If we double parallelism, we double the energy per second,
but we also complete the task twice as fast, resulting in no overall difference in
energy. However, this first order argument ignores the effects of communication. In
general, we can think of our computations as having two components: the logic, which
7
performs the actual operations we are interested in, and the communication, which
takes the form of memory references or data movement on interconnect. Total energy
can be divided as follows:
Etotal = Elogic + Ecommunication (1.4)
Ecommunication = Einterconnect + Ememory (1.5)
We could think of communication as a necessary overhead, and yet, on FPGAs,
energy consumption is often dominated by data communication energy, especially
over the interconnect ([113], Sec. 4.2). As we change parallelism, Elogic remains the
same, as suggested by the first order argument above, but Ememory and Einterconnect
could change, so it is possible to reduce energy by changing parallelism. Too little
parallelism forces us to pay higher Ememory costs, whereas too much parallelism may
increase Einterconnect too much. In Chapter 4, we explore how to use data placement
and parallelism to reduce communication energy. We show that parallelism can reduce
energy, that the optimum level of parallelism increases with the problem size, and that
it usually lies between the two extreme fully-sequential and fully-spatial design points.
How to design an FPGA’s memory architecture to minimize energy? We do not
have prior knowledge of the application for which an FPGA will be used, but we need
to make a decision on the chip’s memory organization. This will generally cause a
mismatch between what an application needs and what is provided for it on the chip,
leading to energy overheads that would have been avoided had we had the “right”
memory organization. Chapter 4 thus also explores how FPGA memory architecture
(memory block size(s), memory banking, and spacing between memory banks) can
impact communication energy, and determines how to organize the memory archi-
tecture to guarantee that the energy overhead compared to the optimally-matched
architecture for the design is never more than 60%. We propose a novel method for
banking memories on FPGAs, allowing us to avoid some of the application/architec-
8
ture mismatch. We call this method “internal banking” (Sec. 4.3). We specifically
show that an architecture with 32-bit wide, 16Kb internally-banked memories placed
every 8 columns of 10 4-LUT Logic Blocks is within 61% of the optimally-matched
architecture across our large benchmark set. Without internal banking, the worst-
case overhead is 98%, achieved with an architecture with 32-bit wide, 8Kb memories
placed every 9 columns, roughly comparable to the memory organization on the Cy-
clone V (where memories are placed about every 10 columns). Monolithic 32-bit wide,
16Kb memories placed every 10 columns (comparable to 18Kb and 20Kb memories
used in Virtex 4 and Stratix V FPGAs) have a 180% worst-case energy overhead.
Combined with the idea of changing parallelism to reduce energy, we show practical
cases where designs mapped for optimal parallelism use 4.7× less energy than designs
using a single processing element (PE).
1.6 Maintaining reliability using LWCs (Chapters
5 and 6)
If we reduce Vdd as suggested in Chapter 3, we reduce energy, but we also decrease
reliability. Can we tolerate higher rates of upsets in our designs to maintain high
reliability even as we scale Vdd? For example, we could use triple modular redundancy
(TMR), which would perform each computation three times, and decide on the actual
result using majority voting. This is a common way to drastically increase a system’s
reliability [24, 97]. However, we must guarantee that the cure is not worse than the
problem—the design changes we make must not consume more energy than we save
by voltage reduction. TMR performs computations three times, thereby increasing
the energy by 3×. Since dynamic energy scales as CV 2, we must achieve voltage
reductions of over
√
3× in order to achieve any dynamic energy benefits, and even
when there is a benefit, it is limited by the TMR overhead. DMR (Dual Modular
Redundancy) is a cheaper alternative to TMR: we only compute the result twice [55].
9
If the results match, we proceed, if not, we have detected an error, but we cannot
correct it since we do not know which of the two results is the correct one. Instead, we
must use a system that stores incoming data and rolls back when an error is detected.
Can we do better than DMR? does an operation really need to be duplicated
in order to be checked? As an alternative, we explore the use of application-specific
checks that are less expensive than the computation itself—lightweight checks (LWC).
We exploit the fact that, for many computations, it is asymptotically and absolutely
cheaper to check that a proposed answer for a computational task is correct than it is
to perform the computation. Throughout this work, we often refer to an application
as the “compute”, or CMP, and its associated lightweight check as the LWC. Chap-
ter 5 identifies different applications and classifies them according to their lightweight
checking characteristics. The goal is to have a framework that helps us understand
any application, classify it, and decide whether it is amenable to lightweight check-
ing. In Chapter 6 we run fault injections experiments and evaluate the error rate,
recomputation rate, and energy impact of using lightweight checks. We find that
for many applications of interest, we can identify lightweight checks that cost only a
fraction of the base computation (0-25%) and allow for voltage to be scaled down to
the energy-minimum point while maintaining reliability. This translates into about
50% overall energy savings.
1.7 WAMI case study (Chapter 7)
We present empirical results throughout Chapters 3, 4, 5 and 6 based on many bench-
marks, and we identify a robust energy-efficient FPGA architecture. In Chapter 7
we perform a case study for a large, real-time system on a 22 nm chip composed of
multiple computational kernels. This allows us to exploit the context in which the
kernels are used and explore additional issues that are not obvious when examining
the benchmarks in isolation. In particular:
10
• Can we optimize all components of the system, or at least the dominant ones?
or are we only able to significantly optimize small portions of it, leading to low
overall savings?
• Identifying the application of interest provides information on the accuracy needed
at the output and noise-tolerance level of the application. This is information
that we may be able to leverage to simplify our LWCs. Do we need LWCs for
all parts of the system? or can some of them get away without LWCs if they are
followed by stages that tolerate errors?
• If a kernel of interest does not store its incoming data, using LWCs also means
that we need to add buffers for checkpoint and rollback, adding to the cost. This
may or may not be necessary depending on the interface with other stages of the
system.
• Reducing voltage reduces performance; can we compensate for this by increasing
parallelism?
We find that the sequential WAMI for a 512 × 512 image is amenable to very
low-cost checks that add less than 1% energy to the base computation. However, it
is also dominated by memory energy (41% of the total), and since we do not scale
voltage on memories, we only achieve 34% energy reduction. However, if we increase
parallelism as suggested in Chapter 4, we reduce the memory contribution down to
14% with 32 PEs. When we scale voltage on top of the optimum number of PEs, we
achieve 44% energy savings.
1.8 Contributions and scope
We have designed a large unified framework to explore many low-energy design ideas
and their impact on all of energy, delay, area, and reliability. Our contributions
include work on applications, architecture, models and tools.
We think of energy as being composed of three elements: logic, interconnect,
11
and memory (Sec. 1.5, Eq. 1.4). Chapter 3 addresses logic and interconnect. It
performs an empirical study of the impacts of different common low-energy circuit
and architecture techniques. In particular, we evaluate the energy impact of scaling
voltage on logic and interconnect.
In Chapter 4, we focus on communication energy, that is the interconnect and
memory components. We provide both analytic and experimental characterizations
of the mismatches between the logical memory organization needed by an applica-
tion and the physical memory organization provided by an FPGA. We find a robust
memory architecture that minimizes that mismatch across a large set of benchmarks,
keeping it under 1.6× energy overhead. We also characterize how parallelism impacts
energy consumption, including a demonstration of how parallelism tuning can reduce
energy.
In Chapter 6, we bring together the previous optimizations on all of logic, in-
terconnect, and memory, and we see combined benefits. We address the reliability
impact of scaling voltage by adding LWCs. We present a fault estimation framework
to assess an application’s reliability, and an empirical evaluation of the LWCs’ costs
and benefits. Our results depend on our ability to identify computational kernels that
have LWCs. In Chapter 5, we propose a classification system for applications and
their LWCs.
The contributions throughout our work are further solidified in the Chapter 7 case
study, which proposes an efficient implementation of Wide-Area Motion Imaging, and
evaluates the impact of adding LWCs to a large, real-time system.
Our tool flow (shown in Fig. 2.2) provides the following capabilities:
• Fast gate-level simulator that computes activity factors and performs fault in-
jection experiments
• Power-optimization to select the most energy-efficient memory implementation
(Sec. 4.5.1)
• LWC voltage optimization to select the energy-minimizing operating voltage
12
• Parallelism optimization to select the energy-minimizing level of parallelism
• Augmented VPR [82] and Versapower [44] to support voltage scaling, dual-Vdd,
transmission gates, gate boosting, and power gating
• Reliability analysis, with and without LWCs
1.9 Publications
We published our first work on LWCs to save energy in [59]. Chapter 6 is an im-
provement over the work in [59], and Chapter 5 complements it.
We performed the first study to address the impact of FPGA on-chip memory
architecture on energy consumption in [57]. Our first work on parallelism tuning for
energy was published in [60]. These two ideas are related, and they came together
within the theme of communication energy minimization in [58]. Chapter 4 mostly
covers [58], with a few additions.
13
Chapter 2
Background
2.1 FPGA models
2.1.1 FPGA architecture
Unlike full-custom designs or ASICs (Application-Specific Integrated Circuits), FP-
GAs provide post-fabrication programmability, so the same chip can be used to imple-
ment different applications. This programmability allows for easier debugging, lower
production costs, and faster time to market.
We build on the standard Island-Style FPGA model [12] shown in Fig. 2.1. Pro-
grammable logic is provided by k-input LookUp Tables (k-LUTs). The basic logic
tile is a cluster of k-LUTs with a local depopulated crossbar providing connectivity
within the cluster. Each LUT’s output is either registered or not before being fed to
the output of the cluster, and to the cluster’s input crossbar. There are n LUTs per
cluster (Fig. 2.1b). These clusters are arranged in a regular mesh and connected by
segmented routing channels (Fig. 2.1a). They are connected to the routing network
through their input/output (I/O) pins placed around the cluster. These connections
are programmed through the connection boxes: each pin has the ability to connect
to some of the routing segments, and when the FPGA is programmed, one of those
connections is chosen. I/O pads are placed on the periphery of the FPGA. Routing
segments are part of routing channels; a channel with W segments is said to have
14
clusterIO
Fig. 2.1b
(a) Standard Island-Style FPGA model
Switch box
(Wilton)
[Cluster of n k-LUTs]
LUT
LUT
LUT
reg
reg
reg
D
e
p
o
p
u
la
te
d
c
ro
s
s
b
a
r
k
n
Cin
Cout
=n
Connection boxes are also depopulated
L
e
n
g
th
 1
 r
o
u
ti
n
g
 s
e
g
m
e
n
t
Routing channel 
(width=4 shown)
(I/O pins go all around the cluster)
(b) Zoom on an FPGA tile
Figure 2.1: Basic FPGA model used for logic and routing
width W . We typically use the same channel width throughout the FPGA. Two rout-
ing segments can communicate through switch boxes, placed at each routing channel
intersection. The switch boxes are also depopulated, using the Wilton pattern [118]
with fs = 3. Actually, we can avoid using a switch box every time a segment goes
to a different channel. We do so by increasing the segment’s length [11]. A length
of 1 means that the segment spans only one logic tile, and must go through a switch
box to get to the next channel. A length of 2 means that the segment spans two tiles
and skips the switch box in between them. Segments can further be uni-directional
or bi-directional [74, 70]. Segment length and directionality, LUT size, cluster size,
channel width, the depopulation pattern of inter-cluster crossbars, switch boxes and
connections boxes, these are all parameters that affect the energy/area/delay and
ease of place-and-route trade-offs of the FPGA [12, 71].
FPGAs can store data in registers, but applications often benefit from having on-
chip memory that provides denser storage. FPGA on-chip memory traditionally takes
15
the form of columns within the mesh that are dedicated to having memory blocks.
For example, we can reserve a whole column to hold 8Kb memories instead of logic
clusters. These columns are placed every so often on the FPGA, and communicate
with the rest of the chip in the same way that logic clusters do: through their I/O pins
communicating to the routing interconnect. Memory blocks are typically wider and
taller than logic clusters, which is one of the reasons they are placed together in the
same column (Fig. 4.2). Other blocks can be placed on the FPGA in the same fashion,
most commonly multiplier blocks. Even though multiplication can be implemented
using a combination of LUTs, it is more efficient to have dedicated columns with
multiplier blocks [65]. Chapter 4 explores optimum memory block organization. We
do not explore optimum multiplier design; we keep the default models in VTR 7.0
[81] and place a column of 36× 36 multipliers every 20 columns. Each multiplier can
be decomposed into two 18×18 multipliers, or four 9×9 multipliers. Each multiplier
and its interconnect is as high as four tiles (one tile is one logic cluster and its
interconnect).
2.1.2 FPGA tools and our modifications to them
To describe a custom FPGA architecture and map designs to it, we use VPR (Versa-
tile Placement-and-Routing) [82]. VPR is the fourth and last step of the VTR tool
(Verilog-To-Routing). As shown in Fig. 2.2, the first step is logic synthesis, which
converts from Verilog to BLIF (Berkeley Logic Interchange Format) using Odin [54].
The second step consists in optimizing and mapping the BLIF to the FPGA architec-
ture of interest using ABC [87]. We code our applications in Bluespec SystemVerilog
[15]; they then get compiled to Verilog, the input to Odin.
The third step consists in evaluating activity factors for the application. The VTR
flow uses ACE [67] and performs vectorless estimation to estimate activity factors and
static probabilities of a design. These have a major impact on the estimated energy.
Common ways to estimate activity include assigning a uniform activity to all nets
16
(e.g., 15%), or performing vectorless estimation with tools such as ACE, as done by
VTR. For better accuracy, our flow obtains activity factors by performing gate-level
simulations of the designs. We run a logic simulation on the BLIF output of ABC
(pre-vpr.blif file). Depending on the benchmark, we either simulate it with random
input data or with a custom data set.
Note that VPR’s place-and-route task is not exactly the same as that of place-
and-route tools for commercial FPGAs. Commercial tools have a fixed architecture
and try to find a route that works and that minimizes some metric, typically the
delay. We could make VPR work the same way by supplying it the dimensions of the
FPGA (e.g. 4×4 in Fig. 2.1), and the routing channel width (4 in Fig. 2.1). However,
this would give a poor indication of the tool’s place-and-route performance in cases
where the architecture is much larger than the application (it would be too easy to
route) or vice versa. Instead, VPR performs a binary search to find the minimum
size array that allows for a successful placement of all the blocks. It then performs
another binary search to find the minimum channel width that produces a successful
route. This gives the smallest achievable area for the design. Once VPR finds the
minimum channel width, VTR calls VPR a second time with a channel width set to
1.3× the minimum, allowing for a less constrained route and better results. We use
VTR with this default behavior.
We modify VPR’s area, delay and power estimation to support power-supply pro-
grammability with two voltage levels and power gating. To implement multiplexers,
VPR’s power tool, Versapower [44], uses NMOS pass-gate logic and level restorers.
As an alternative, we add support for transmission gates, for which we evaluate low-
level circuit characteristics using SPICE (Simulation Program with Integrated Circuit
Emphasis), within the same framework used by Versapower. The CMOS technology
models that we feed into Versapower’s SPICE scripts are Predictive Technology Mod-
els (PTM) [23]. We use 90 nm down to 7 nm technologies. We modify Versapower to
support the FinFET technology. We use CACTI 6.5 [88] to model memories (more
17
in Sec. 4.4.2).
We use ITRS [2] parameters for constants such as the unit capacitance of a wire
(e.g. Cwire = 180 pF/m at 22 nm). Then:
Cmetal = Cwire × tile-length (2.1)
We evaluate interconnect energy based on this Cmetal, instead of the constant one that
is provided in the VTR architecture files. This way, the actual size of the low-level
components of the given architecture and technology, as well as the computed channel
width, are taken into account when evaluating energy. It is important to model this
accurately since routing energy dominates total FPGA energy.
Our tool flow is shown in Fig. 2.2, including modifications for fault injection and
reliability calculations described in Sec. 6.2.
2.1.3 Benchmarks
Toronto20 is the classic FPGA benchmark set. It contains BLIF benchmarks that are
already in the right format for a k = 4 FPGA architecture; they are read directly by
VPR. However, Toronto20 does not contain any memory or multipliers, so VTR also
provides a set of Verilog benchmarks containing those and that first need to go through
Odin and ABC before VPR. From the VTR benchmark set, we do not include spree.v
because of an Odin segmentation fault; LU32PEEng.v and LU64PEEng.v because they
take too long through VPR and are similar to LU8PEEng.v.
In our work we use both of those benchmark sets, plus some custom benchmarks
described in Sec. 4.5.3, Chapter 5, Sec. 7. These custom benchmarks are useful
because they use memories, their parallelism can be tuned, and because we know
their logic, so we can design LWCs for them. When simulating them, we also use
realistic custom data sets. When simulating the Toronto20 and VTR benchmark
sets, we have no choice but to use random data, since they do not have associated
18
CMP LWC
Verilog
power-opt
Odin
ABC
VPR
Versapower
Odin pre-processing
Modified VTR 7.0
VTR 7 benchmarks
OR
Gate-level
simulation
Dataset
Bluespec SV
ACE
Activity factors
FPGA Architecture
CACTI ITRS SPICE
Fault injection
simulation
Reliability and LWC
equations
Data on Energy, delay,
area, reliability
Toronto20
benchmarks
BLIF file
OR
Generate Verilog for given application parallelism P and size N
Parametric description 
of application, for both 
CMP and LWC
Adapt Verilog syntax to match Odin assumptions
Alternatively, we can start from the Verilog
benchmarks from VTR 7. Will not be able
to tune parallelism and application size
Power-opt memories
Alternatively, we can start 
from the BLIF benchmarks 
from Toronto20. Will not have 
memories and multipliers
Generate Architecture 
based on technology and 
memory organization Activity factor
simulation
Figure 2.2: Tool flow
19
data sets. Note that the VTR benchmarks do not come with clock-enable for the
memories, so we set them to be always on.
2.1.4 FPGA energy modeling
Poon [92] developed extensive energy modeling for FPGAs and identified how to size
LUTs, clusters, and segments to minimize energy. Poon found that clusters of n =8–
10 LUTs with k = 4 inputs and segments of length 1 was best for energy. We use these
conclusions (with n=10) for our basic FPGA structure. Since Poon’s work, FPGA
energy modeling has also been expanded to modern direct-drive architectures and
integrated into the more flexible framework of VTR (Versapower, [44]). We assume
direct-drive interconnect. We use 22 inputs per cluster as in the default VTR example
architecture files (instead of the maximum of k × n = 4× 10 = 40). We also use the
default depopulation patterns from VTR’s example architecture files.
2.1.5 FPGA memory energy
Poon’s study did not include memories. Recent work on memory architecture has
focused on area optimization rather than energy. Luu examined the area-efficiency of
memory packing and concluded that it was valuable to support two different memory
block sizes in FPGAs [80]. Lewis showed how to size memories for area optimization
in the Stratix V and concluded that a single 20Kb memory was superior to the
combination of 9Kb and 144Kb memories in previous Stratix architectures [75], but
did not address energy consumption, leaving open the question of whether energy-
optimized memory architectures would be different from area-optimized ones. Chin
explored the energy impact of embedded memory sizes when they are used to map
logic, but not when they are used as read-write memories for application data [29].
We performed the first study to address the impact of FPGA on-chip memory
architecture on energy consumption in [57]. The results are presented in detail in
Chapter 4 of this dissertation. Chapter 4 also covers work on reducing energy through
20
parallelism, published in [60]. We combined [60] and [57] and expanded upon them
in [58].
2.2 Previous work on energy optimizations
2.2.1 Technology scaling
During the long era of Dennard Scaling [35], if we scaled Vdd and Vth with feature
size, the power density of our circuits remained constant. This worked well to keep
leakage current small as long as Vth was sufficiently large.
Isd,leak ≈ IS
(
W
L
)
e
Vgs−Vth
nkT/q (2.2)
In the off state in CMOS, Vgs ≈ 0, so the off state leakage current is driven by −VthnkT/q .
The sub-threshold slope, S = nkT/q ln(10), tells us how large a voltage change is
necessary to cause a 10× reduction in sub-threshold leakage current. With finite
sub-threshold slope, typically around 110mV, we cannot drop Vth past a few hundred
millivolts to keep the on-off current ratio sufficiently high. This puts an end to
Vth scaling, and, consequently, puts an end to Dennard Scaling [17], limiting our
opportunity to further reduce voltage. As a result, we can now place more transistors
on a chip than we can afford to use at any point in time [52, 91]. This gives rise to the
era of dark silicon [115, 37] where we can often maximize performance by minimizing
energy rather than by minimizing circuit delay. It also means that leakage plays an
increased role in the energy consumption of our devices. Among other things, this
motivated the move to multi-gate devices, such as FinFETs [51] that reduce the sub-
threshold slope (to around S=90mV), thereby decreasing the leakage current at a
given threshold voltage.
21
2.2.2 Transmission gates, gate boosting
Due to the overhead of programmability, FPGAs are less energy-efficient than ASICs
[65]. Programmability is provided using LUTs for logic, and multiplexers (MUXes)
for routing, both of which have been built out of NMOS pass-transistor logic instead
of transmission gates [13] (shown in Fig. 3.2). The advantage of pass-transistor logic
is reduced area, since fewer transistors are used to implement the same function [120].
The disadvantage of pass-transistor logic is the loss in signal strength, which comes
out as Vout = Vgate−Vth. Vgate is typically set to Vdd, and the resulting Vdd−Vth leads
to increased static power dissipation in subsequent stages. With newer technologies,
Vdd − Vth becomes smaller (Fig. 3.5a), and the problem only grows in importance.
To counter this effect without moving to transmission gates, we can use higher
voltage configuration bits, or gate boosting: set Vconfig on > Vdd and ensure the output
is at a full Vdd (this is a form of dual-Vdd architecture). However, over-volting the gate
drive makes the device age faster, while pass-gate FPGAs are already more sensitive
to aging, especially with newer technologies that use high-κ dielectrics [62, 5]. On the
other hand, if we run the gate at lower voltage, as we intend to do, we can over-drive
the pass-gates without exceeding nominal voltage. Furthermore, some previous work
showed reduced aging for newer FinFet technologies [90].
An alternative to gate boosting is to add level restorers to recover a full Vdd,
typically an inverter with a pull-up PMOS in feedback. However, newer technologies
reduce Vdd − Vth, making these level restorers slower and less reliable: If the voltage
coming out of the pass-transistor logic is too low, it may not pass the inverter trip
point and would not get restored; instead it would be interpreted as a logic 0: a
circuit failure. Chiasson explored whether FPGAs should switch from pass-gates to
transmission gates [28] and found a 25% delay advantage and a 3.8% power penalty for
doing so at 22 nm. We get similar results in Sec. 3.4.2. The most important advantage
we get out of transmission gates or gate boosting is the ability to scale voltage more
aggressively. Without them, circuits fail early, and we cannot achieve the benefits of
22
running at the minimum-energy point. These techniques are thus especially useful
in combination with power gating, which pushes the minimum energy point to lower
voltages.
2.2.3 Power gating
In order to reduce leakage energy, FPGA configuration bits are usually implemented
with high-Vth transistors [42, 77]. The drawback of high-Vth is higher dynamic energy
and delay, but neither of those is a concern for configuration bits since they do not
switch during operation. Most of the leakage energy therefore lies in the logic and
interconnect of the FPGA, both of which are over-provisioned to allow for more
applications to be mapped, and make room for easier placement and routing. This
suggests that we could power-gate the resources that end up being unused after place-
and-route, thereby avoiding their leakage energy consumption. Power gating reduces
leakage energy, but increases both area and delay due to the addition of power switches
that cut off the voltage supplies. These can be implemented at different granularities
(switch, LUT, cluster). Higher granularities yield lower savings, but help amortize
the cost of power switches [21, 42, 96]. In our work, we focus on power gating at the
level of routing switches and logic clusters. That is, each routing segment is either
power-gated or not, and each logic cluster is either power-gated or not.
2.2.4 Multiple supply voltages
2.2.4.1 Lower voltage off the critical path
ASICs were first to use two Vdd levels to reduce power consumption [114]. Li first
explored dual-Vdd for FPGAs [78] using an architecture with pre-defined high- and
low-Vdd clusters. However, the focus was on dynamic energy only (not leakage), and
it did not include routing, the primary source of energy consumption on FPGAs
[113]. [78] found that having a fixed architecture with pre-defined high- and low-Vdd
23
regions limited the tools’ ability to map designs to it, and thus limited the energy
savings achieved; the conclusion was that a Vdd programmable architecture would
save more energy; that is, an architecture where the Vdd level of a resource can be
configured post-fabrication. Gayasen proposed a programmable dual-Vdd architecture
[41] and Vdd assignment algorithms. Li also extended his work to programmable
architectures, including routing, and confirmed that they perform better than pre-
defined architectures [76]. Our work uses a programmable architecture and algorithms
similar to the ones suggested in [41] and [76], but also supports more modern FPGA
features such as multipliers, memories, and direct-drive interconnect. We also explore
more modern technologies (down to 7 nm in addition to 65 nm (used by [41]) and
90 nm (close to 100 nm, used by [76])).
Our work reproduces the results from previous work at older technologies, and con-
firms that dual-Vdd still provides savings with more modern architectures. However,
a large portion of the savings in dual-Vdd comes from the power gating capabilities
added by the Vdd programmability. When we compare fine-grained, programmable
dual-Vdd architectures against single-Vdd architectures with power gating added, we
find that dual-Vdd is inferior (Sec. 3.6). This illustrates why it is important for us to
explore a larger space of energy optimizations.
2.2.4.2 Differential reliability
Given the choice between two voltage levels, we can select between two reliability
levels for a given component. Therefore, another use for multiple supply voltages is
the implementation of differential reliability [59], which says that we may not need
the same level of reliability from all components of a design. For instance, we could
use more reliable components to oversee the computation of less reliable ones. That
is, we can run the LWC at high voltage, high energy, high reliability, and since the
LWC is lightweight, this only consumes a fraction of the total energy. Indeed, the
total energy is dominated by the CMP, which we run at low voltage, low energy, low
24
reliability. Error-correction on memories and checksums on packet data transmission
are familiar and commonly used forms of differential reliability where a reliable com-
putation guards less reliable operations. In the memory case, the peripheral circuitry
for error detection and correction is typically of a larger, more reliable feature size
than the memory core. Also, the way that FPGAs keep the voltage on configuration
bits high is a form of differential reliability. As we will see in Chapter 6, differential
reliability does not constitute an improvement over a basic design where CMP and
LWC run at the same, reduced voltage.
2.3 Previous work on lightweight checking
The idea that a computation’s result can be checked inexpensively has been used in
different contexts. We review the previous work in Chapter 5 instead of here in order
to better contrast it with our work. The space of applications that are amenable to
lightweight checking is large, and many examples have been given in previous work.
In Chapter 5, we characterize the space and design a classification system.
The core idea in Chapter 6 is to reduce voltage to save energy, and compensate
for the loss in reliability with lightweight checks. The idea of using a form of low-cost
result checking to save energy has been used by Shin and Shanbhag [106] in digital
signal processing (DSP) for applications that can tolerate some level of Signal to
Noise Ratio (SNR). The Razor latch [9, 19] also saves energy by reducing voltage, and
addresses the resulting increase in error rate due to timing violations inexpensively
by detecting late changes in signal values, but it only catches violations in a narrow
timing window. Razor can be a complementary technique to our work to tune the
frequency of operation, but it does not address the broader classes of single-event
upsets (SEUs).
We published our first work on LWCs to save energy in [59]. Chapter 6 significantly
improves on the work in [59] by using the more advanced tool flow shown in Fig. 2.2
25
Table 2.1: Taxonomy of small-feature-size reliability problems
Detect This
Challenge Immediate? | How? Response Sec. Work
Logic & Latch Y Concurrent Check Rollback, Retry 2.4.1 focus
Configuration N Checksum Reload 2.4.3
Upset Y Concurrent Check Reload, Rollback 2.4.3 detect
Aging N Offline Test Remap to Avoid 2.4.5
Y Concurrent Check Remap, Rollback 2.4.5 detect
Manufacture N/A Offline Test Map to Avoid 2.4.4
with Versapower-based energy and area results compared to the more crude energy
model from [59]. Furthermore, Chapter 6 covers the delay impact of LWCs, as well as
their impact on leakage energy, whereas [59] only reported dynamic energy, ignoring
the total minimum-energy point dictated by leakage.
2.4 Previous work on reliability
To clarify the goals of our work on reliability and put it into context, this section and
Tab. 2.1 review common, small-feature-size reliability problems. We also review prior
work on fault-tolerance in FPGAs.
2.4.1 Upset phenomena and voltage scaling
This Dissertation primarily addresses single-event logic and latch upsets (SEUs),
which become increasingly important with scaled technology [26]. These may be
caused by ionizing particles that disrupt the voltage on nodes causing wrong values
in latches. SEUs may result directly from upsetting the stored state in a latch or
from upsetting logic that is then sampled into a latch at a clock edge. They may also
be caused by thermal fluctuations or shot noise [63].
Memories are known to be more prone to SEUs than logic, and they have been
protected with error-correcting codes (ECC) for many years. Logic has been less of
26
a problem because of several forms of masking [107]: logical, electrical, and latching-
window masking (more in Sec. 6.2.4). However, over the last few years SEUs in logic
have become more important, and were estimated to be as frequent as memory upsets
in [107].
2.4.2 Prior work on soft-error upsets and timing failures
Space and avionics applications have long had to deal with higher upset rates than
ground-based systems, spawning a host of prior work on SEU tolerance in FPGAs.
Our LWCs are an optimization over DMR [55] since our checkers are small compared
to the base computation. TMR has been the typical mitigation mechanism on FPGAs
[24, 97]. However, this comes at a high energy overhead (>200%). When the appli-
cation can accept errors in the output, previous work shows that this can be reduced
by applying TMR selectively [94]. In contrast, our solution catches errors as they
occur, before they corrupt the output, and are significantly more lightweight than
TMR. Unlike TMR, our detection and correction scheme does impact the through-
put of results and may not be suitable when there is no timing slack available for
recomputation. Still, as we will see from the low rate of recomputation (Sec. 6.3), the
impact on aggregate throughput due to recomputation is small.
2.4.3 Configuration upsets
As we reduce voltage, different parts of the chip get affected to different extents, and
we need to make sure that we identify all weak points and have a way to protect
them. FPGAs are particularly sensitive to transient events that upset configuration
bits. This can result in a persistent change in logic behavior until the configuration
is repaired. This is not the primary concern of this work for two reasons:
1. FPGA vendors already provide checksums and scrubbing logic to detect when
configurations need to be reloaded [25].
27
2. There is no need to aggressively scale down configuration voltages since they
do not switch dynamically during operation and do not contribute to dynamic
energy consumption. This is gate boosting, similar to what we used for pass-
gate (Sec. 2.2.2). Also, since they do not switch, they can be implemented out
of high-Vth technology to reduce their leakage.
The output can be corrupted for a large number of cycles before errors are detected.
A common recovery strategy is checkpoint and rollback [6].
Our LWCs can catch configuration upsets as they occur. In contrast, checksum
and scrubbing schemes take millions of cycles to detect upsets, resulting in a large
number of erroneous outputs. Our LWC scheme validates every output and detects
errors immediately. If the error is an SEU (Sec. 2.4.1), the error will not persist and
a retry will most likely not see it again. If the error is a configuration upset or a
lifetime aging failure (Sec. 2.4.5), the retry will fail as well, indicating the need to
reload the configuration or repair the logic [85, 66, 100].
2.4.4 Process variation
High Vth variation in small feature size transistors could also prevent aggressive scaling
of component operating voltages. That is, if our voltage scaling were limited by the
worst-case Vth on a multi-billion transistor 22 nm, or smaller, device, we would have
limited room to reduce the voltage. This is not the primary concern of this work since
prior work [85] shows that it should be possible to avoid high variation transistors in
FPGAs and operate down to 150 mV—much lower than the ITRS suggested operating
point of 800mV at 22 nm. Significantly, these variation-avoidance techniques allow
us to operate at the well-defined minimum energy point (Section 1.2, Fig. 1.1b).
Nonetheless, [85] does not address the impact of low-voltage operation on transient
faults, which is the primary concern of our work.
This post-fabrication, component-specific mapping allows FPGAs to operate at
lower voltages than ASICs and is a key reason why FPGAs may have greater need
28
than ASICs to tolerate low voltage operation of small-feature size devices, which are
stuck operating at higher voltages where transient upsets are less of an issue. By
combining the upset tolerance enabled by this work and the variation tolerance in
[85], we could close some of the traditional energy gap between FPGAs and ASICs
[65].
Note that variation impacts Vth, which in turn affects the SEU rate of a given node,
resulting in non-uniform upset rates throughout the design. A higher Vth relative to
the same nominal voltage means a lower charge needed to upset the node, or an
increased SEU rate.
2.4.5 Aging
Small feature size devices are also susceptible to aging faults [110] that can result in
permanent rather than transient circuit errors. Many of the major aging effects have
voltage dependence, such that low voltage operation reduces the rate of aging. As
noted above (Sec. 2.4.3), the LWCs we describe can also immediately detect those
aging faults. If retry-and-configuration-reload does not resolve the error, this is an
indication that an aging error has occurred. This can serve as a trigger for repair
mechanisms such as [66, 100]. Note that aging decreases the charge needed to upset
a node, so it also increases the SEU rate, similar to Vth variation.
29
Chapter 3
Common Low-Energy
Optimization Techniques
3.1 Introduction
As mentioned in Sec. 1.4, our main goal is to explore the energy benefits of reducing
voltage, but we need to make sure that we assess these benefits within a larger design
space that includes other low-energy techniques. Otherwise our comparisons may not
be fair, for example, an HP process will benefit from voltage scaling more than an
LP process, but still would not achieve lower energy. In this chapter, we evaluate the
impact of five different low-energy techniques that have been proposed in the past,
and apply Vdd scaling on top of them:
1. Low leakage (high-Vth) process technology
2. Gate boosting
3. Transmission gates
4. Power-gating unused resources
5. Dual-Vdd architectures
The relative benefits of the above techniques change as we scale process technology,
particularly in the decananometer range (90-7 nm) with the end of Dennard scaling
and switch to FinFET technology.
30
As we apply these different techniques, we observe regions of the design space with
a trade-off between energy gains and delay increase. We identify two distinct goals
that a designer may pursue:
1. Getting the lowest energy possible at a given performance target (delay goal).
2. Getting the overall energy-minimum point, even if it increases delay (min-energy
goal).
Clarifying the designer’s goal allows us to properly assess which combination of low-
energy techniques is appropriate.
Previous work (Sec. 2.2) typically explored only one or two of the above tech-
niques, and focused only on one of the two goals stated above. In our work, we
construct a unified framework to explore them together. We first examine the basic
energy and delay impact of scaling Vdd in Sec. 3.3. Then, as we scale technology,
we find the need to use gate boosting, or switch to full transmission gates instead of
pass-gate logic (Sec. 3.4). Sec. 3.5 shows the added benefits of power-gating unused
resources, particularly useful as we scale Vdd. Sec. 3.6 revisits the impact of dual-Vdd
architectures. In addition to technology, FPGAs have undergone many architectural
changes, including the addition of multipliers, memories and uni-directional intercon-
nect, since the last published studies of dual-Vdd FPGA architectures. We find that
dual-Vdd architectures are still able to reduce energy with only a small delay penalty
(delay goal). However, they do not perform better than a single-Vdd architecture with
power gating. Sec. 3.7 discusses our results.
3.2 Architecture and benchmarks
We use the basic Island-Style FPGA architecture described in Sec. 2.1 (k = 4, n = 10,
length 1 segments, a column of multipliers every 20 columns). In this chapter we do
not explore memory architecture, but simply choose the robust memory architecture
that we will find in Chapter 4, where we place a column of 256× 32 memories every
31
9 columns. This is close to that of the Cyclone V, a 28 nm commercial low-power
FPGA. Chapter 4 also provides more details on how we model memories.
In this chapter we use the Toronto20 benchmark set, used by previous dual-Vdd
studies, as well as the VTR verilog benchmarks. We report the geomean over all
these benchmarks.
3.3 Simple voltage scaling
Our baseline architecture for this chapter is a standard, single-Vdd FPGA using pass-
transistor switches. We use its energy and delay results at nominal Vdd to normalize
other results throughout the chapter. We expect newer technologies to be faster and
consume less energy, so in the rest of the work, we normalize each process to its own
baseline. Starting at 45 nm, PTM models provide a choice between high-performance
(HP) and low-power (LP) or low-standby power (LSTP) processes. HP uses a lower
Vth than LP/LSTP, resulting in faster switching, but higher leakage power.
Without doing any circuit work, we can save energy simply by scaling the single
supply voltage of the baseline architecture (see Fig. 3.1a). As expected [47], the total
energy curves (Etotal = Edyn + Elkg, Eq. 1.3) reach an energy-minimizing point; at
Vdd = 0.7 for HP, Vdd = 0.75 for LP. Past that point, leakage (Elkg) dominates, due
to the exponential increase in delay in subthreshold (Vdd < Vth); the delay increase is
shown in Fig. 3.1b. Vth =667 mV for the 22 nm LP process and Vth =465 mV for the
22 nm HP process (Fig. 3.5).
Fig. 3.1a is normalized to the energy value at nominal Vdd for LP. Fig. 3.1b is
normalized to the delay value at nominal Vdd for HP. We can see that for the delay
goal, we would choose the HP process since it is faster. This is what previous work has
typically done. However, for the min-energy goal, we can trade off delay for energy
savings, and the LP process is better, since the energy-minimum point with HP has
higher energy than even the baseline point with LP. For LP, simply scaling Vdd yields
32
0.4 0.6 0.8
0
1
2
Vdd (V)
E
n
e
rg
y
 R
a
ti
o total LP
dyn LP
leak LP
total HP
dyn HP
leak HP
baseline
minimum
(a) Energy overhead
0.4 0.6 0.8
1
1
0
1
0
0
0
Vdd (V)
D
e
la
y
 R
a
ti
o
HP
LP
(b) Delay overhead
Figure 3.1: Effect of voltage scaling on energy and delay for the baseline architecture
(ignoring pass-gate failures)
22% energy savings. Choosing LP instead of HP reduces the minimum energy by
68%.
In the rest of the chapter, we will try different techniques to reduce energy. The
results are compiled in Figures A.1 through A.5, with curves similar to the ones
in Fig. 3.1, divided into HP vs LP and energy vs delay, across different technology
points. We use 10 different PTM technology files, divided into two sets. The HP
set: 90nmbulk, 65nmbulk, 45nmHP, 22nmHP, 14nmHP, 7nmHP. The LP/LSTP
set: 90nmbulk (repeated), 65nmbulk (repeated), 45nmLP, 22nmLP, 14nmLSTP,
7nmLSTP. If we do nothing but scale Vdd, we can reduce energy by 7.5-29% for the
LP/LSTP set, 8.8-29% for the HP set.
3.4 Transmission gates or gate boosting
3.4.1 Pass-transistor and level restorer failure
Fig. 3.1 is useful to illustrate our approach, but in reality, simply scaling the voltage of
the baseline as in Sec. 3.3 eventually causes a failure in the level restorers used to bring
the output of pass-gates back to Vdd. This failure usually happens before we reach
33
Vdd
0
Vdd-Vth
pulled up 
to Vdd
Vdd
Vdd
Vdd
0
Mfb
Mn
Mp
(a) Pass-gates and level re-
storer
Vdd+Vboost
0
Vdd
Vdd
(b) Pass-gates and gate
boosting
Vdd
Vdd
0
Vdd
0
Full Vdd 
swing
(c) Transmission Gates
Figure 3.2: Pass-gate versus transmission gate 2-mux
the energy-minimum point. NMOS pass-transistors are used in FPGAs to implement
multiplexers used for LUTs and routing switches. They only output up to Vdd − Vth
and are followed by level restorers consisting of an inverter with a PMOS transistor
in feedback (Fig. 3.2a). To first order, if Vdd − Vth < Vdd/2, the inverter of the level
restorer will not switch and the output of the multiplexer will be interpreted as a logic
0: a circuit failure. To counter this, we can increase the size of the inverter’s NMOS
and the feedback PMOS (Mfb and Mn), which will allow for lower voltage levels to be
pulled up, at the expense of higher delay and noise susceptibility (including SEUs).
This pushes the breaking point further, but eventually still breaks. We set Mfb and
Mn to 2 minimum-width transistors, the same as in Versapower.
Transmission gates are an alternative to pass-transistor logic; they output a full
Vdd swing (Fig. 3.2c). This avoids early failure and allows us to actually scale the
voltage down to the minimum-energy point. Compared to the pass-gate multiplexers
and level restorers, transmission-gate multiplexers are faster, but they consume more
area.
The other alternative is “gate boosting” (Fig. 3.2b): we can operate the pass
transistor gates at voltages higher than nominal, such that that output does not
drop below the full nominal Vdd swing. Since the value of the gate comes out of a
configuration SRAM bit and does not switch during operation, it does not increase
dynamic energy.
34
3.4.2 Transmission gate, gate boosting results
Figures A.1a through A.5a show results similar to Fig. 3.1a, but they do not continue
past the point of failure for pass-transistor designs (“Base”). Transmission gates
(or gate boosting) do not have much of an effect on older technologies (90 nm and
65 nm): we can reach the minimum-energy point even without them. Starting at
45 nm LP (Fig. A.2a), the “Base” curve stops before we reach minimum-energy, and
transmission gates (or gate boosting) are needed to reduce Vdd further and reach
minimum-energy. For example, at 45 nm LP, even though “Trans Gate” has 23%
higher energy at nominal Vdd, the ability to reduce Vdd further allows it to eventually
reach a lower energy point than “Base”. By the time we recover the “Trans Gate”
overhead, the energy savings compared to the energy-minimum of “Base” are only
9%. However, transmission gates are especially useful in combination with power
gating (more in Sec. 3.5), which further extends the energy-minimum point to lower
Vdd levels. Without transmission gates, those lower Vdd levels could not be reached.
Alternatively, we can also use gate boosting to reach the lower energy levels. At
45 nm LP the overhead of gate boosting is lower than the overhead of transmission
gates, and we can reduce the energy at the minimum point by 25% compared to 9%
previously.
For the delay goal, we look at Fig. A.1a and Fig. A.1b together (or Fig. A.2a and
Fig. A.2b, Fig. A.3a and Fig. A.3b, Fig. A.4a and Fig. A.4b, Fig. A.5a and Fig. A.5b).
We can set delay goals that are more or less aggressive to change the headroom we
get to reduce energy. We set six such goals: “Base CP ×1.0”, which does not allow
the critical path to get reduced compared to the baseline. “Base CP ×1.25”, which
allows the critical path to increase by 1.25×. We also use “Dual Vdd TG CP ×1.0”,
“Dual Vdd TG CP ×1.25”, “Dual Vdd GB CP ×1.0”, “Dual Vdd GB CP ×1.25”,
which will be explained in Sec. 3.6. We now look for the minimum-energy point in
Fig. A.1a that does not make the delay worse than the delay target we set (seen in
Fig. A.1b). To explain the delay goal energy optimizations, the 90 nm bulk energy
35
and delay graphs are repeated in Fig. 3.3a with extra annotations. Looking at the
“Trans Gate” curve, we start at nominal Vdd (1), and we reduce Vdd until we hit the
target delay (2). This gives us the lowest voltage that still meets the delay target
(Vdd = 0.95V ). Then, we look at the energy savings at that voltage in the energy
figure (3). Therefore, because we had a delay target we had to stop at (3), instead of
reducing Vdd further down to (4), which would have been the energy-minimum. We
repeat this process for each curve and show the energy-minimum points in Fig. 3.4a
and Fig. 3.4b Fig. 3.4 digests the key trends in Figures A.1 through A.5 by plotting
the minimum-energy points against technology. In addition to the six delay goals
mentioned above, Fig. 3.4 shows the absolute minimum-energy that can be achieved
if we do not care about delay (“No Delay Target”).
Our experiments show 22% reduction in geomean delay at nominal Vdd for using
transmission gates at 22 nm (Fig. A.3b). This comes with no change in geomean
energy (Fig. A.3a). This is close to what previous work found at 22 nm when switching
to transmission gates: a 25% reduction in delay for a 4% increase in power [28]. Over
all HP technologies, we find 16-25% reduction in delay at nominal Vdd, for 0.84−1.1×
geomean energy. Then, we can trade some of that delay gain for energy, as long as
we do not increase delay beyond the targets we set. Fig. 3.4 shows that we end
up with 0.69 − 0.97× geomean energy (Base CP ×1.0), or 0.69 − 0.87× geomean
energy (Base CP ×1.25). Compared to “Base” with optimum Vdd scaling (i.e. taking
the ratio “Trans Gate”/“Base” instead of just looking at “Trans Gate”, which is
normalized to “Base” at nominal Vdd), this is still 0.69 − 0.97× geomean energy for
Base CP ×1.0 (because in this case Vdd cannot be reduced without increasing delay
for “Base”, leading to the flat line at 1.0), but only 0.86− 1.03× geomean energy for
Base CP ×1.25. Therefore, transmission gates alone do not help much with the delay
goal, but as mentioned above, they will contribute to larger savings in combination
with other techniques in the next sections.
Once again, the results above are very similar for gate boosting instead of trans-
36
1
2
3
4
(target delay)
Start at (1):
nominal Vdd
(2) gives Vdd
at target delay
(3) is larger
than (4), the
min-energy
(3) gives
energy
at target delay
(a) Lowest energy given “Base CP ×1.0” de-
lay target for “Trans Gate”
1
(target delay)
Start at (1):
nominal Vdd
(2) gives Vdd
at target delay
2
3
4
5
(3) gives
energy
at target delay
(3) is larger
than (4), the
min-energy,
but (3) is lower
than (5)!
(b) (Sec. 3.6.5) Lowest energy given
“Dual CP ×1.0” delay target for “Trans
Gate + Power Gate”
●
●
Base
Trans Gate
Gate Boost
Power Gate
Trans Gate + Power Gate
Gate Boost + Power Gate
Dual−Vdd TG CP x1.0
Dual−Vdd TG CP x1.25
Dual−Vdd GB CP x1.0
Dual−Vdd GB CP x1.25
Figure 3.3: Energy achieved under delay constraints
mission gates, except transmission gates do slightly better for newer HP technologies
(7 nm and 14 nm), while gate boosting does slightly better for older ones (22 nm,
45 nm, 65 nm, 90 nm). For LP/LSTP technologies gate boosting is always slightly
better. The differences are not significant enough to draw reliable conclusions.
As expected, we found the geomean tile area (logic cluster plus interconnect) for
transmission gates to be larger than the tile area of pass gates with gate boosting, with
a slow and steady increase in the difference with scaled technology, from 25% larger
at 90 nm, to 31% larger at 7 nm. This contributes to larger power dissipated in the
interconnect (17% larger at 90 nm, 26% larger at 7 nm). However, transmission gates
37
(a) LP/LSTP processes
7 14 22 45 65 90
0
0
.5
1
●
●
●
●
●
● ●
●
●
● ●
No Delay Target
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1 ● ● ● ● ●
Base CP x1.0
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1
● ●
●
●
●
●
●
●
●
●
Base CP x1.25
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1 ● ● ●
●
Dual−Vdd TG CP x1.0
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1
● ●
●
●
●
● ●
● ●
Dual−Vdd TG CP x1.25
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1 ● ●
●
● ●
●
● ● ●
● ●
Dual−Vdd GB CP x1.0
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1
●
●
●
●
●
● ●
●
●
● ●
Dual−Vdd GB CP x1.25
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
●
●
Base
Trans Gate
Gate Boost
Power Gate
Trans Gate + Power Gate
Gate Boost + Power Gate
Dual−Vdd TG CP x1.0
Dual−Vdd TG CP x1.25
Dual−Vdd GB CP x1.0
Dual−Vdd GB CP x1.25
(b) HP processes
7 14 22 45 65 90
0
0
.5
1
● ●
●
● ●
●
●
●
●
●
● ●
No Delay Target
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1 ● ● ● ● ● ●
Base CP x1.0
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1
● ●
●
●
●
●
● ●
●
●
●
●
Base CP x1.25
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1 ● ●
●
Dual−Vdd TG CP x1.0
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1
● ●
●
●
●
●
●
●
●
●
● ●
Dual−Vdd TG CP x1.25
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1 ●
●
●
●
● ●
● ●
●
●
● ●
Dual−Vdd GB CP x1.0
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
7 14 22 45 65 90
0
0
.5
1
● ●
●
●
●
●
●
●
●
●
● ●
Dual−Vdd GB CP x1.25
Technology (nm)
E
n
e
rg
y
 R
a
ti
o
●
●
Base
Trans Gate
Gate Boost
Power Gate
Trans Gate + Power Gate
Gate Boost + Power Gate
Dual−Vdd TG CP x1.0
Dual−Vdd TG CP x1.25
Dual−Vdd GB CP x1.0
Dual−Vdd GB CP x1.25
Figure 3.4: Minimum-energy trends versus technology
still result in a lower critical path than pass gates with gate boosting, about 7.6%
lower at 90 nm and 14% lower at 7 nm, slowly increasing in between. The reduction in
38
critical path is due to faster logic; it offsets the higher power and results in a similar
energy consumption between the two cases, as reported above. Transmission gates
are faster because of the increased voltage swing and lower resistance due to the two
switch transistors in parallel (Fig. 3.2c).
3.4.3 Technology trends
From Fig. 3.4 we saw that the early failure problem gets worse with newer tech-
nologies, emphasizing the importance of switching to transmission gate designs, or
using gate boosting. This is happening because as we scale technology, Vth does not
follow the reduction in Vdd (Fig. 3.5a). Vth sometimes even increases with newer tech-
nologies (end of Dennard scaling, Sec. 2.2.1). Higher Vdd − Vth means that we have
less headroom to reduce voltage before the circuit breaks. In fact, 7nmLSTP does
not even work with pass gates and without gate boosting at its nominal Vdd of 0.7V
(Fig. A.5a).
Even with transmission gates (or gate boosting), the reduction in Vdd − Vth also
means that we have less headroom to reduce Vdd before delay starts increasing ex-
ponentially, i.e. the energy-minimum point happens earlier with newer technologies,
and the energy savings are reduced. This can be seen from the increasing level of the
minimum point of the “Trans Gate” (or “Gate Boost”) curve in Figures A.1a through
A.5a as we scale technology. Fig. 3.4a “No Delay Target” shows this more directly
by plotting the minimum energy point versus technology feature size.
Our technology files cover three “ranges” of CMOS PTM technology: bulk CMOS
(bulk), CMOS with high-κ dielectrics, metal-gate, stress effects (mgks), and FinFETs
(finfet). Overall, in the decananometer range, Vdd is reduced with newer technologies,
and Vth is increased, until we reach FinFET technology, at which point Vth starts
decreasing again (Fig. 3.5a). Fig. 3.5b shows three sets of curves, for the three
technology ranges, that differ by the voltage at which we reach sub-threshold. We
can see that LP processes offer very little headroom before sub-threshold is reached
39
and lead to the lowest energy savings when scaling Vdd. HP processes have more
headroom and offer more savings, but still require more energy at the minimum
energy point than the LP processes as we saw in Fig. 3.1a. FinFETs shift the LSTP
curves to have the same delay as the non-FinFET HP processes, and the FinFET HP
processes go down even lower, suggesting that FinFETs will help reduce energy.
3.5 Power gating
As shown in Fig. 3.1a, the minimum-energy point is caused by an exponentially-
increasing leakage energy that dominates at lower voltages. If we could reduce the
leakage energy, we could push back the point where it starts dominating and hence
improve the minimum-energy point.
Power gating is an effective method to reduce leakage energy on FPGAs, since
most of the energy goes into the programmable interconnect, most of which ends
up unused, since it is significantly over-provisioned to allow for better routability
[32, 83]. We show this in Fig. 3.6 for the apex2.blif benchmark. Power gating
is able to eliminate the cost of unused resources, which is significant, especially in
leakage, and allows us to push the minimum-energy point further.
3.5.1 Power switches for power gating
Fig. 3.7a shows the scheme we use for power gating. We use a power switch between
the voltage supply Vdd and the circuit, consisting of a PMOS transistor, with gate
voltage set by an SRAM bit, part of the FPGA configuration. The SRAM bit does not
switch during operation and is off the critical path; it can therefore be implemented
with a high-Vth process to drastically reduce its leakage, while incurring no delay
or dynamic energy penalty [42, 77]. The PMOS transistor does not switch either
during operation, so it has no dynamic energy penalty. However, it adds resistance
to the path between the voltage supply and the circuit, thereby increasing delay.
40
(a) Vdd and Vth
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Vdd HP/bulk
Vth HP/bulk
Vfail HP/bulk
Vdd LP/LSTP
Vth LP/LSTP
Vfail LP/LSTP
7           14         22          45         6          90        (technology, nm)
Voltage (V)
  finfet               mgks                bulk
(b) Single inverter delay
0.15 0.35 0.55 0.75 0.95 1.15
1.00E-012
1.00E-011
1.00E-010
1.00E-009
1.00E-008
1.00E-007
1.00E-006
1.00E-005
90nm bulk
65nm bulk
45nm bulk
45nm HP
22nm HP
45nm LP
22nm LP
20nm HP
14nm HP
7nm HP
20nm LSTP
14nm LSTP
7nm LSTP
In
v
e
rt
e
r 
D
e
la
y
 (
s
)
Vdd (V)
Figure 3.5: Comparison of PTM technologies
Furthermore, both the SRAM bit and the PMOS increase area. In fact, there is a
trade-off between the W
L
-sizing of the PMOS (Msw) and the delay penalty of power
41
(a) Before power gating
0
.3
0
.4
0
.5
0
.6
0
.7
0
.8
0.0
0.5
1.0
1.5
2.0
22nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
Eleak logic
Eleak logic unused
Edyn logic
Edyn logic unused
Eleak route
Eleak route unused
Edyn route
Edyn route unused
Eclock
(b) After power gating
0
.3
0
.4
0
.5
0
.6
0
.7
0
.8
0.0
0.5
1.0
1.5
2.0
22nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
Eleak logic
Eleak logic unused
Edyn logic
Edyn logic unused
Eleak route
Eleak route unused
Edyn route
Edyn route unused
Eclock
Figure 3.6: Power gating allows us to avoid the cost of unused resources (apex2.blif
shown)
Vdd
0
Mn
Mp
sram Msw
 (Power switch)
(a) Single-Vdd power gating
Vhigh
0
Mn
Mp
sram Msw
 (Power switch)
sramMsw
Vlow
Pass-gate for
bi-directional
(b) Dual-Vdd voltage selection
Figure 3.7: Power switches for power gating and dual-Vdd
gating: Making Msw larger increases area but reduces the delay overhead. This area
increase also means that the FPGA tiles are larger, causing longer wires, longer delays,
and more energy consumption. We find Msw = 4×Mp to be a good trade-off point,
with results detailed next.
3.5.2 Power gating results
Fig. 3.8 shows the effect of power gating for 22 nm HP. Unlike Fig. 3.1a, this one
is normalized to the 22 nm HP energy at nominal Vdd with transmission gates. We
can see that power gating reduces the leakage component significantly, leading to a
reduction in total energy and a shift in minimum-energy: from Vdd = 0.7 and a ratio
of 0.872 down to Vdd = 0.5 and a ratio of 0.296, a ratio difference of 0.58, a 66%
42
0.4 0.6 0.8
0
0
.5
1
1
.5
∆ = 0.58
22nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
Trans Gate Total
Trans Gate Dyn
Trans Gate Leak
Trans Gate + Power Gate Total
Trans Gate + Power Gate Dyn
Trans Gate + Power Gate Leak
baseline
Trans Gate MIN
Trans Gate + Power Gate MIN
Figure 3.8: Power gating reduces leakage energy
energy reduction.
We can now look at the power gating results in Figures A.1 through A.5 and
Fig. 3.4. We show results for power gating alone, for power gating in combination with
transmission gates, and for power gating in combination with gate boosting. Power
gating increases geomean delay at nominal Vdd, thereby reducing the headroom we
get to reduce Vdd before reaching our delay target, though the energy at a given Vdd
is also reduced. As expected, power gating together with transmission gates or gate
boosting reduces energy significantly. Compared to “Base” with optimum Vdd scaling
(i.e. taking the ratio “Trans Gate + Power Gate”/“Base” instead of just looking at
“Trans Gate + Power Gate”, which is normalized to “Base” at nominal Vdd), we get
the following geomean energy reduction: 41-73% for no delay target (LP/LSTP), 32-
71% for Base CP ×1.0 (HP), 35-73% for Base CP ×1.25 (HP). Once again, we can see
the importance of having transmission gates, without which we would have stopped
earlier for some cases and we would have only been able to reduce energy by (“Power
Gate”/“Base”): 19-73% for no delay target (LP/LSTP), 28-58% for Base CP ×1.25
(HP). The Base CP ×1.0 (HP) delay target cannot be met with power gating only
(without transmission gates), because the delay at nominal Vdd is already larger than
the target. Furthermore, these results do not include 7 nm LSTP, which breaks even
at nominal Vdd without transmission gates. We can see that the benefit of power
43
gating is reduced as we scale technology for LP processes, whereas it is increased for
HP processes. This is because power gating is able to reduce the increased dominance
of leakage as we scale technology in HP processes.
Once again, we get similar results if we use gate boosting instead of transmission
gates.
3.6 Dual-Vdd architectures
As we reduce Vdd below its nominal value, we reduce energy, but we also increase delay.
In previous sections, the headroom we got to increase delay was either from the delay
gains due to transmission gates (we could then decrease Vdd until we got the original
delay back) or from setting a delay target larger than the baseline (Base CP ×1.25).
We now further increase this headroom by using two Vdd levels and only scaling Vdd
for resources not affecting the critical path.
3.6.1 Motivation and approach
We use a high Vdd at nominal voltage (Vhigh = Vnominal) and a low Vdd that we can
reduce (Vlow < Vnominal). The idea is to use Vlow for resources that are not on the
critical path: their energy decreases, and their delay increases, but not enough to
affect the critical path. On the other hand, resources on the critical path, or close
to it, are kept at Vhigh. Dual-Vdd architectures come with an overhead (see next
paragraph), but if we can move enough resources to the Vlow domain, we may be able
to save enough energy to compensate for that overhead, and overall save energy.
Dual-Vdd designs have been successfully used in ASICs, but FPGAs have the
disadvantage that the application is not know at fabrication time, so we do not
know in advance where the high and low Vdd domains should be. We have two
choices to implement dual-Vdd on FPGAs: We can decide which physical resources
use Vhigh and which use Vlow ahead of time: a fixed dual-Vdd architecture. Or we
44
can add circuitry to select a different Vdd post-fabrication: a programmable dual-Vdd
architecture. As mentioned in Sec. 2.2.4, previous work has found that mapping
a netlist to a fixed dual-Vdd architecture incurred too high of a mapping overhead,
and that programmable architectures were better due to their flexibility. Of course,
programmable dual-Vdd also incurs an overhead (Sec. 3.6.2). The granularity at which
the Vdd of resources is programmable also matters. We stick to the same approach
as previous work [41, 76], by allowing each cluster and each routing segment to be
programmed to a different Vdd. We leave granularity exploration and revisiting pre-
defined versus programmable Vdd for future work.
3.6.2 Power switches for dual-Vdd
In order to select the Vdd level, we use a switch similar to the one for power gating
in Sec. 3.5, shown in Fig. 3.7b (also similar to the ones from previous work [41, 76]).
For dual-Vdd, we need to select between two voltage supplies, so we need two power
switch PMOS transistors. The area overhead is therefore larger than it was for power
gating, thus leading to larger tiles, larger wires, delay, and energy. The additional
PMOS transistor does not increase the delay of the circuit that it is driving (it is
still the same as with only one power switch). Once again we use Msw = 4. We
could have used only one SRAM bit to set the voltage either high or low in Fig. 3.7b.
However, at the cost of only one extra SRAM bit, we get the ability to power-gate
the resource if it is unused, thus benefiting from some of the same gains as in Sec. 3.5.
Note that in our work, the impact of the added power switches Msw is higher than
in previous work because we use modern direct-drive segments [74, 70], which have
more drive strength for a given width (are more efficient) than the bi-directional
drives used in older FPGAs [41, 76]. For bi-directional drives, the inverter is followed
by a pass-transistor NMOS that selects the driver and increases delay, as shown in
Fig. 3.7b.
45
Vdd
n2
p2
0
n3
p3
outin
p4n1
Figure 3.9: Level converter circuit
3.6.3 Level converters for dual-Vdd
The two Vdd domains are not independent, some signals need to cross between them
to communicate information. A signal can easily switch from the Vhigh to the Vlow
domain, since the rail-to-rail voltage swing on the Vhigh domain is at or above the
Vlow domain power rails. However, when going from the Vlow to the Vhigh domain, a
low-voltage signal driving a high-voltage inverter does not drive to the high logic rail,
causing both the NMOS and PMOS transistors of the inverter to turn on partially,
resulting in DC short-circuit currents and excessive leakage. To avoid this we use
level converters to bring a Vlow signal up to Vhigh when crossing domains. We use the
same level converter as previous work [76]; it was proposed by [95] and is shown in
Fig 3.9. It consists of two inverters with a feedback path between the output of the
second inverter and the power supply selection of the first one. This allows the first
one to either operate with full Vdd when its input is 0, or to operate from a virtual
low-voltage supply created by the threshold drop across n1.
As suggested in [41], we place level converters at the inputs of clusters instead of
outputs. Our clusters have more inputs than outputs (22 versus 10), so this leads to a
higher area overhead. However, it provides more flexibility and leads to better results,
since the whole route out of a cluster can have the same voltage, including all fanouts,
regardless of the destination cluster’s domain. On the other hand, if we placed the
level converters at cluster outputs, the routing would tend to stay in the Vhigh domain
46
Logic
Cluster
0
Logic
Cluster
3Logic
Cluster
1
IO pin without level converter
IO pin with level converter
Low-Vdd domain
High-Vdd domain
a) Level converters at logic cluster outputs: the route is forced
to stay at high Vdd because logic cluster 3 is at high-Vdd
b) Level converters at logic cluster inputs: the route can move
to the low-Vdd domain
Logic
Cluster
2
Logic
Cluster
0
Logic
Cluster
3Logic
Cluster
1
Logic
Cluster
2
Figure 3.10: Placing level converters at logic cluster inputs or outputs
more often because it would be limited by only one of its multiple destination clusters
being in the Vhigh domain. We illustrate this in Fig. 3.10. This is important since most
of the energy in FPGAs is in the routing [113]. We further place level converters at
multiplier and memory block inputs. We scale the voltage in multipliers the same way
that we do for clusters. We keep memories in the Vhigh domain because of reliability
concerns when scaling memory voltages: memories are designed with transistors close
to the minimum size to increase their density, which makes them more susceptible
to failure due to variation as we scale voltage, for example due to intra-die random
dopant fluctuation [3, 22].
3.6.4 Vdd assignment algorithm
Our dual-Vdd mapping flow is similar to [76]. We first perform place-and-route on a
single-Vdd = Vhigh netlist, which gives us our design’s critical path, the lowest critical
path we will be able to achieve (Dual-Vdd CP). Then, we use the greedy algorithm
from [76] to assign cluster blocks to either the Vlow or Vhigh domain. The algorithm
tries to move the blocks one by one to the Vlow domain. If the move results in a
47
critical path increase higher than a given threshold, it moves the block back to Vhigh,
otherwise, it keeps it and moves to the next block. Moving a block to a different
domain also means that all routes going out of that block change domains. This is
because level conversion only happens at cluster inputs (to limit its overhead).
3.6.5 Dual-Vdd results
In Figures A.1 through A.5, we can now look at the “Dual-Vdd CP” curves. We show
the effect of dual-Vdd architectures on total energy and delay (the x-axis shows Vlow
scaling). Dual-Vdd CP ×1.0 is the case where we do not allow the critical path to
increase in the Vdd assignment algorithm. Dual-Vdd CP ×1.25 is the case where we
allow it to increase by 25%, enabling more energy savings. They are implemented
either with transmission gates or gate boosting because of the benefits found in the
previous sections.
The “Dual-Vdd CP” curves have a different shape than the other ones: they get
back to the original energy/delay characteristics at very low Vlow, instead of over-
shooting to infinity. At nominal Vdd, we actually have Vlow = Vhigh = Vdd−nominal,
so the Vdd assignment algorithm is able to switch all resources to Vlow without in-
creasing CP. This is shown in Fig. 3.11. Then, as we reduce Vlow, the resources on
the critical path cannot switch to Vlow, they stay at high Vdd, but most of the circuit
consumes a bit less energy, so the energy goes down. As we keep reducing Vlow, more
and more resources stay in the Vhigh domain (their delay is farther and farther away
from that of the critical path), until a point where Vlow is so low that no more blocks
can be switched to it and energy starts going back up: more of the circuit is in the
Vhigh domain than in the Vlow domain. We eventually get back to the energy level of
Vdd−nominal when no block switches to Vlow.
As we can see from Fig. A.1b at 90 nm and nominal Vdd, even though transmission
gates reduce delay by 16%, the dual-Vdd scheme brings it back up to 3.6% overhead
due to the programmability overhead. Part of the delay increase is due to the Msw
48
0
.3
0
.4
0
.5
0
.6
0
.7
0
.8
0
.9 1
1
.1
1
.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
90nm bulk
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
comp dyn Vhigh
comp dyn Vlow
comp leak Vhigh
comp leak Vlow
route dyn Vhigh
route dyn Vlow
route leak Vhigh
route leak Vlow
clock
Figure 3.11: Breakdown of energy components in dual-Vdd voltage sweep (apex2.blif
shown)
power switch (Fig. 3.7a), while the rest is due to the 58% larger tile area (for “Dual-
Vdd TG CP ×1.0” compared to “Trans Gate”, not shown in Fig. A.1b). Because it
increases the delay above the baseline’s delay, the “Dual-Vdd CP ×1.0” point does not
show up in Fig. 3.4b “Base CP ×1.0” at 90 nm: it does not even meet the delay target
at nominal Vdd. However, we can look at the last four cases of Fig. 3.4: “Dual-Vdd TG
CP ×1.0” is the case where the different techniques are allowed to reduce delay up
to the delay set by the “Dual-Vdd TG CP ×1.0” curve, and similarly for the “Dual-
Vdd TG CP ×1.25” case. “Dual-Vdd GB CP ×1.0” and “Dual-Vdd GB CP ×1.25”
correspond to gate boosting instead of transmission gates. At 90 nm, we get 43%
energy savings for only 3.6% delay overhead; this is looking at the “Dual-Vdd TG CP
×1.0” curve and the “Dual-Vdd TG CP ×1.0” delay target in Fig. 3.4b, normalized to
“Base” at nominal Vdd. We could also normalize this to “Trans Gate” at nominal Vdd
to isolate the effect of dual-Vdd without the effect of transmission gates: At 90 nm, we
get 49% energy savings for 18% delay overhead. This is close to the 48% reduction
in power for 18% increase in delay reported by previous work at 100 nm [76].
Back to normalizing to “Base” at nominal Vdd, we note that the energy savings
of “Dual-Vdd TG CP ×1.0” increase as we scale technology for the HP case (49% at
49
90 nm, 66% at 7 nm). For LP/LSTP technology, the dual-Vdd savings are low (3.9% at
22 nm, 0.2% at 22 nm), though they start increasing with the switch to FinFETs: 14%
at 14 nm, 16% at 7 nm. Therefore, for LP/LSTP, the reduced headroom we have in
reducing Vdd (Sec. 3.4.3) does not allow the energy to drop low enough to significantly
offset the Vdd programmability overhead. “Dual-Vdd TG CP ×1.25” reduces energy
further by trading off delay. The best energy we can achieve is shown in “No Delay
Target”: this is our min-energy goal.
Even though the dual-Vdd architecture (with transmission gates) shows savings
compared to the baseline case, we observe that it does not help reduce energy more
than a single-Vdd architecture with power gating and transmission gates, whether it is
for HP or LP/LSTP processes, older or newer technologies, the min-energy or any of
the delay goals. Indeed, even though the “Dual-Vdd TG CP ×1.0” curves in Figures
A.1 through A.5 reduce energy without changing delay as we scale voltage, the “Trans
Gate + Power Gate” curves start at a lower delay, so they also get headroom to reduce
Vdd (until they hit the delay target). This, combined with the fact that they start at
a lower energy because of the added dual-Vdd overhead, makes them perform about
as well as the dual-Vdd scheme, if not better. This is illustrated in Fig. 3.3b.
Once again, we get similar results and conclusions when using gate boosting in-
stead of transmission gates.
3.7 Discussion
3.7.1 Relative benefits
As noted in Sec. 3.3, since HP processes are significantly faster than LP/LSTP ones,
we would use them for the delay goal. Fig. 3.1a showed that the energy-optimum HP
point had higher energy than even the baseline LP point; in general, we expect that
LP/LSTP processes will be better for the min-energy goal. Still, since the HP process
sees a higher energy gain than the LP/LSTP, we might wonder if an HP process
50
ends up consuming less energy at lower voltages, especially with power-gating, which
significantly reduces leakage energy, a component that is more dominant in the HP
case. Fig. 3.12 shows the benefits we get out of picking LP/LSTP instead of HP. It
also shows the benefits of transmission gates and power gating, and their different
combinations. The results are normalized to the baseline at nominal Vdd, so the
first bar (“bulk” or “HP”) shows the benefits from simply scaling Vdd to its energy
optimum. We notice that most of the energy savings at 90 nm, 65 nm, and 45 nm
come from power gating, whereas most of the savings at 22 nm, 14 nm, and 7 nm
come from using an LP/LSTP process. Note that 14 nm and 7 nm are shown on a
log scale due to the large difference between the HP and LSTP processes: leakage is
very high for the HP case, and LSTP reduces it significantly. Over all technologies,
and using all the techniques mentioned, we are able to reduce energy by 74-98%.
3.7.2 Results spread
So far we have only shown the geomean across all benchmarks. Fig. 7.4 shows a
boxplot for the energy-optimum curves at 22 nm, and separates the Toronto20 bench-
mark set (used by previous work) and the newer VTR 7.0 set that includes memories.
The rectangular box shows the lower and upper quartiles, half the data lies in it.
The heavy line in the box shows the median. The horizontal lines above and below
the box denote the limits of the data’s nominal range as inferred from the upper and
lower quartiles, and the outliers are denoted with open circles. The actual minimum
energy voltage differs among the designs, with some designs continuing their energy
reduction past the geomean minimum energy point. The newer VTR 7.0 set has a
larger spread in characteristics driven by the larger diversity in resource usage. The
designs with the highest memory requirements show up as high energy outliers since
we did not reduce Vdd for the memories.
51
b
u
lk
b
u
lk
 T
ra
n
s
G
b
u
lk
 G
a
te
B
o
o
s
t
b
u
lk
 P
o
w
e
rG
b
u
lk
 T
ra
n
s
G
 P
o
w
e
rG
b
u
lk
 G
a
te
B
o
o
s
t 
P
o
w
e
rG
0.0
0.2
0.4
0.6
0.8
1.0
90nm
E
n
e
rg
y
 R
a
ti
o
b
u
lk
b
u
lk
 T
ra
n
s
G
b
u
lk
 G
a
te
B
o
o
s
t
b
u
lk
 P
o
w
e
rG
b
u
lk
 T
ra
n
s
G
 P
o
w
e
rG
b
u
lk
 G
a
te
B
o
o
s
t 
P
o
w
e
rG
0.0
0.2
0.4
0.6
0.8
1.0
65nm
E
n
e
rg
y
 R
a
ti
o
H
P
H
P
 T
ra
n
s
G
H
P
 G
a
te
B
o
o
s
t
H
P
 P
o
w
e
rG
H
P
 T
ra
n
s
G
 P
o
w
e
rG
H
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG L
P
L
P
 T
ra
n
s
G
L
P
 G
a
te
B
o
o
s
t
L
P
 P
o
w
e
rG
L
P
 T
ra
n
s
G
 P
o
w
e
rG
L
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG
0.0
0.2
0.4
0.6
0.8
1.0
45nm
E
n
e
rg
y
 R
a
ti
o
H
P
H
P
 T
ra
n
s
G
H
P
 G
a
te
B
o
o
s
t
H
P
 P
o
w
e
rG
H
P
 T
ra
n
s
G
 P
o
w
e
rG
H
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG L
P
L
P
 T
ra
n
s
G
L
P
 G
a
te
B
o
o
s
t
L
P
 P
o
w
e
rG
L
P
 T
ra
n
s
G
 P
o
w
e
rG
L
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG
0.0
0.2
0.4
0.6
0.8
1.0
22nm
E
n
e
rg
y
 R
a
ti
o
(note: y-axis is in log scale for next two)
H
P
H
P
 T
ra
n
s
G
H
P
 G
a
te
B
o
o
s
t
H
P
 P
o
w
e
rG
H
P
 T
ra
n
s
G
 P
o
w
e
rG
H
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG L
P
L
P
 T
ra
n
s
G
L
P
 G
a
te
B
o
o
s
t
L
P
 P
o
w
e
rG
L
P
 T
ra
n
s
G
 P
o
w
e
rG
L
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG
0.01
0.02
0.05
0.10
0.20
0.50
1.00
14nm
E
n
e
rg
y
 R
a
ti
o
H
P
H
P
 T
ra
n
s
G
H
P
 G
a
te
B
o
o
s
t
H
P
 P
o
w
e
rG
H
P
 T
ra
n
s
G
 P
o
w
e
rG
H
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG L
P
L
P
 T
ra
n
s
G
L
P
 G
a
te
B
o
o
s
t
L
P
 P
o
w
e
rG
L
P
 T
ra
n
s
G
 P
o
w
e
rG
L
P
 G
a
te
B
o
o
s
t 
P
o
w
e
rG
0.01
0.02
0.05
0.10
0.20
0.50
1.00
7nm
E
n
e
rg
y
 R
a
ti
o
Figure 3.12: Relative benefits of the different low-energy techniques for the min-energy
goal
52
●
0.4 0.6 0.8
0
.2
5
0
.7
5
1
.2
5
22nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
●
●
●
●
●
●
●
● ●
●
●
●
0.4 0.6 0.8
0
.2
5
0
.7
5
1
.2
5
22nm LP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
(a) Toronto20 benchmark set
●
●
● ●●
●
●
●
●
●
●
0.4 0.6 0.8
0
.2
5
0
.7
5
1
.2
5
22nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
●
●
●
●
●
● ●● ●
● ●
●
●
●●●
●
0.4 0.6 0.8
0
.2
5
0
.7
5
1
.2
5
22nm LP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
(b) VTR benchmark set
Figure 3.13: Energy overhead spread across benchmarks (single-Vdd, with transmis-
sion gates and power gating)
53
Chapter 4
Communication Energy: Adjusting
Parallelism and Memory
Organization
4.1 Introduction
Within FPGAs, for many applications, data movement energy—the energy for moving
data from one physical location on the FPGA to another—can dominate computa-
tional energy. Data movement includes both energy for accessing memory and energy
for moving bits over interconnect segments between processing elements. In this
chapter, we explore two inter-related issues that significantly impact data movement
energy in FPGA applications: parallelism and memory organization.
How does parallelism impact the energy required for computation? In sequen-
tial designs, the energy reading and writing data in large memories is typically the
dominant component of energy. Since the energy to read from a memory grows with
memory size, parallel designs that use many smaller memories local to each process-
ing element rather than one large memory can reduce memory energy. As long as the
parallel design does not incur too much energy communicating among its process-
ing elements, the decomposition can provide a net energy reduction. We formulate
the trade-offs involved and show analytically how they lead to an optimum level of
parallelism that minimizes communication energy (Sec. 4.2). We then show empiri-
54
cally how parallelism tuning can be used to minimize energy both for an ideal, limit
study architecture and for designs mapped to a specific, energy-optimized FPGA
architecture (Sec. 4.6).
This result underscores the importance of the architectural question: How do we
organize memories in FPGAs to minimize the energy required for a computation? We
have several choices: What are the sizes of memory blocks? Where (how frequently)
are memory blocks placed in the FPGA? How are the memories activated? How
are they decomposed into sub-block banks? What read and write widths should the
memory blocks use? Then, there are choices available to the RTL mapping flow:
When mapping a logical memory to multiple blocks, should they each get a subset of
the data width and be activated simultaneously? or should they each get a subset of
the address range and be activated exclusively? We develop simple analytic relations
to reason about these choices (Sec. 4.3). After setting up some background (Sec. 4.4)
and methodology (Sec. 4.5), we perform an empirical, benchmark-based exploration
to identify the most energy-efficient organization for memories in FPGAs and quantify
the trade-offs between area- and energy-optimized mappings (Sec. 4.7). We show how
to choose a one-size-fits-all FPGA memory architecture that contains the worst-case
architectural mismatch overhead from memory block size and placement below 60%.
4.2 Parallelism and data movement energy
To perform any computation, we must communicate data between the point in time
and space where each intermediate data item is computed and where it is consumed.
This communication can occur either through interconnect wires, in case the operators
are spatially located at different places, or through memories, in case the operators
are sequentialized on a common physical operator. At the extremes, the design could
be fully spatial (e.g., a spatial FFT) or fully sequential (e.g., a single processor that
computes the same FFT, storing intermediate data in a single, large memory). Either
55
●
1 2 4 8 16 320
.0
2
5
0
.0
7
5
●
winf N=
128
256
512
1024
# of PEs (Npe)
n
J
/(
p
ix
e
l/
c
y
c
le
)
(a) Size and PE sweep
1 2 4 8 16 32
# of PEs (Npe)
n
J
/(
p
ix
e
l/
c
y
c
le
)
0
.0
0
0
.0
4
0
.0
8 logic
route
mem
(b) Detailed breakdown for N=512
Figure 4.1: Energy versus PE Count (Npe) for the window filter benchmark (WinF)
way, we spend energy for communication—either toggling long wires or reading and
writing from large memories. We show that, in the general case, neither extreme is the
most energy-efficient. That is, we can optimize the energy required for communication
by tuning the parallelism in the task, and there is typically a level of parallelism
that minimizes total communication energy (See Fig. 4.1). The phenomenon here is
closely related to the ones explored in [33], and we similarly show it is often better
to distribute the data and computation than to centralize it in a single memory.
GraphStep [34] provides one concrete model for how applications might be defined
to allow this form of parallelism tuning. In the remainder of this chapter, we explain
and model the opposing communication energy effects and show how they give rise
to this optimum energy point.
4.2.1 Memory energy
The energy required to access a memory depends on its capacity and the number of
output bits. This is driven almost directly by the length of the wires the bits must
traverse to move between the input/output port for the memory and the memory cell
location within the array. Roughly, the energy (and delay) minimizing organization
for an M -bit memory is a
√
M ×
√
M array, meaning that all the main wires in the
56
memory are of length
√
M , so that their capacitance scales as
√
M and, consequently,
their switching energy scales as
√
M . Everything else being equal, a memory of four
times the capacity will cost twice the energy.
Since memory energy is driven by wire lengths, we cannot “cheat” the
√
M energy
growth by decomposing the memory into smaller memories and wiring to them. We
would still end up with address wires of length
√
M and input/output wires of length
√
M . If we were dominated by memory cell capacitance, breaking the large memory
into smaller memory banks and activating only one bank at a time could reduce the
memory cell energy, but we are still left with wiring energy that also has a
√
M
dependence. We will see later (Sec. 4.3) that banking can help us reduce mismatch
energy.
4.2.2 Between computations
For many computations, we can reduce the size of the memory by performing the
computation in parallel. Rather than having a single processing element (PE) or
computational datapath with a large memory to hold all the data, we can have
multiple PEs, each with its own, smaller memory. Ideally, for a problem with data
size N , the size of the memories scales with the number of PEs, Npe, as S = N/Npe.
Smaller memories reduce the energy for each memory operation as noted above. In
the extreme, we may be able to eliminate the memories altogether and simply connect
datapath elements. For example, we could build a completely spatial FFT network
with no internal memories. However, increasing the number of PEs also increases the
physical size of the computation, potentially increasing the length of the wires in the
system and hence increasing energy. In tasks like the FFT, data must now be moved
from the PE where it is produced to the PE where it is consumed, and this data
movement costs energy.
57
4.2.3 Analysis
To understand how these effects interact, we develop simple energy models for the
computation. Since communication energy depends on interconnect lengths, which
in turn are driven by the design size, we first need a rough understanding of how the
design grows with the number of PEs, Npe, and the problem size, N . We pay area
for memory to hold the state of the computation, Amem, area for the logic for each
PE, Alogic, and area for the interconnect.
Amem(M) ∝ M (4.1)
Ape(S) = Amem (S) + Alogic (4.2)
S =
N
Npe
(4.3)
p′ = max(0.5, p) (4.4)
A = Npe · Ape (S) + max
(
C1
(
Np
′
S
)2
, C2 (Npe)
2pnet
)
(4.5)
The maximum term in Eq. 4.5 uses a Rent’s Rule [68] wiring model, in the spirit of
[112, 14], to estimate the growth rate contribution from wiring based on the Rent
exponent. The interconnect area in this max term can be driven by either the wiring
requirement for the physical substrate (pnet) or the communication requirement of the
application (p) and the extent to which the design is serialized, S. Np captures the
volume of traffic that must cross the bisection width of the chip and the denominator
S accounts for the fact that it can be serialized over S cycles, reducing the volume
of physical wires required. Np/S or (Npe)
pnet defines one side length of the chip; the
other will be proportional to this, such that the area is proportional to the square
of this length term. This term deals with the fact that wiring may drive total area,
forcing the design to be spread out just to get the required bandwidth for highly
parallel computations when p > 0.5. As we will see, this can have an effect limiting
the amount of parallelism that should be exploited to minimize energy. p′ (Eq. 4.4)
58
deals with the fact that the wiring requirement remains O(N) even when p < 0.5.
The energy at the PE, Epe, is driven by the size of the memories and represents
the total energy over the PE processing all S computations assigned to the PE. The
communication energy, Ecomm, deals with both the volume of traffic and the length of
the wires that the traffic must traverse. Rent’s Rule captures the number of wires at
each level of a hierarchical decomposition ((2i)
p
), and the areas computed above allow
us to associate a wire length with each level of the hierarchy (2⌈i⌉
√
A/N). The N
2i
term
captures the number of sub-units at a given level of the hierarchical decomposition.
Emem(M) ∝
√
M (4.6)
Epe(S) = S · (Emem (S) + Elogic) (4.7)
Ecomm ≤
log2(N)
∑
i=0
(
N
2i
× c
(
2i
)p × 2⌈i/2⌉
√
A/N
)
∝ Np′
√
A (4.8)
E = Npe · Epe + Ecomm (4.9)
Eq. 4.9 captures the total energy to perform the task over the entire problem of size
N . Substituting Eq. 4.5, 4.7, 4.8 into Eq. 4.9, we get a total energy that scales
asymptotically as:
E = O
(
N1.5
√
Npe
+Np
′
√
N +max
(
(Npe)
2 N2p′−2, (Npe)
2pnet
)
)
(4.10)
The first term represents the memory operations, and it decreases withNpe as we make
the size of the memories in each PE smaller. The second term is the interconnect term,
and it increases with Npe. The first component of the area (N in the square root) is
the space to hold all the memory, which is independent of Npe and sets a potential
lower-bound on the asymptotic energy of O(Np
′+0.5). This gives us a goal of selecting
an Npe that does not force any of the other components to exceed O(N
p′+0.5). To
59
achieve this:
N2−2p
′ ≤ Npe ≤ min
(
N1.5−p
′
, N
1
2pnet
)
(4.11)
For pnet and p ≤ 0.5, Eq. 4.11 says that Npe = O(N)—we should accommodate larger
designs by increasing the number of PEs proportionally with the problem size. This
arises because the wire lengths are not growing with N for p < 0.5 [36]. The example
shown in Fig. 4.1 illustrates this with the energy minimizing number of PEs growing
linearly with the problem size. For p > 0.5, the wire lengths do grow and the number
of PEs should grow more slowly as dictated by the terms on the right. Nonetheless,
the memory term on the left dictates that the number of PEs must be growing for
any p < 1.0. The terms on the right allow growth of O(
√
N) even when pnet or p
become 1.0; the allowed growth increases as p and pnet reduce from 1.0. When pnet is
large, we can get cases where the bounds in Eq. 4.11 cross; this reflects cases where
O(Np
′+0.5) is not achievable, and the actual minimum must be formulated differently.
When we select Npe within Eq. 4.11, the total area remains O(N) and is, asymp-
totically at least, not wire dominated. For the cases where Npe = O(N), the seri-
alization factor becomes a constant S = N
Npe
that is independent of N . If we select
S such that Amem (S) = Alogic, the design is only twice as large as a fully serialized
design, meaning the chip crossing wires are, at most
√
2 longer than they might be
in a fully serialized design. Making S smaller than this increases the total area, and
hence, the length of chip-crossing communications, while making S larger increases
the length of the nearest neighbor connections. Hence, this distribution is, at least,
within a small constant factor of the energy-optimal S ratio. Unfortunately, the logic
for a PE, Alogic, is application-specific, while an FPGA must typically pick a single,
one-size-fits-all organization for memory and logic that must satisfy all applications.
As we will see in the next section, we can design the FPGA memory organization so
that the overhead energy due to this mismatch is bounded by a small constant.
Additional parallelism can enable a secondary benefit. Assuming we are trying to
meet a fixed throughput requirement, the more parallel solution will meet the require-
60
Memory
Block
LB
dm=4
LsegLmseg
Figure 4.2: Column-oriented embedded memories
ment with lower clock rates. A lower clock rate, in turn, will allow the computation
to run at a lower voltage. Since dynamic energy is proportional to CV 2, this can save
computational energy as well as data movement energy [27]. In order to keep this
chapter focused on memory energy, we will not also turn the operating voltage knob.
However, we will do so in Chapter 6 and Chapter 7.
4.3 Architecture mismatch energy
FPGA embedded memories generally improve area- and energy-efficiency [65]. The
previous section showed how this works for one class of applications. When the
embedded memory perfectly matches the size and organization needed by the appli-
cation, an FPGA embedded memory can be as energy-efficient as the same memory
in a custom ASIC. Nonetheless, the FPGA has a fixed-size memory that is often
mismatched with the task, and this mismatch can be a source of energy overhead.
As noted (Eq. 4.6), memory energy scales as the square-root of the capacity. When
the FPGA memory block (March) is larger than the application memory (Mapp), there
is an energy overhead that arises directly from reading from a memory bank that is
too large (E(March)/E(Mapp)).
61
There is also a mismatch overhead when the memory block is smaller than the
application memory. To understand this, we must also consider the routing segments
needed to link up the smaller memory blocks into a larger memory block. To build
a larger block, we take a number of memory blocks (⌈Mapp/March⌉) and wire them
together, with some additional logic, to behave as the desired application memory
block. In modern FPGAs it is common to arrange the memory blocks into periodic
columns within the FPGA logic fabric (See Fig. 4.2). Assuming square memory and
logic blocks, the set of smaller memory blocks used to realize the large memory block
might roughly be organized into a square of side length
⌈
√
Mapp/March
⌉
, demanding
that each address bit and data line connected to the memory cross roughly dm ×
⌈
√
Mapp/March
⌉
horizontal interconnect segments to address the memory, where dm
is the distance between memory columns in the FPGA architecture. Since there is
an asymmetry that we cross logic blocks in the horizontal direction and not in the
vertical direction, we can reduce the overhead by a constant factor by composing the
large memory from an h× v rectangle with v ≥ h, where h and v are the respective
horizontal and vertical dimensions (h ·v ·March = Mapp). If Eseg is the energy to cross
a length-1 segment over a logic island in the FPGA, and Emseg(March) is the energy
to cross a length-1 segment over a memory block of capacity March, the horizontal
and vertical routing energy to reach across the memory is:
Eh = (dmEseg + Emseg(March))× h (4.12)
Ev = Emseg(March)× v (4.13)
We define φ = dmEseg
Emseg(March)
, allowing us to restate:
Eh = (φ+ 1)Emseg(March)× h (4.14)
Since the routing energy of wires comprises most of the energy in a memory read, and
62
since each bit must travel the height of the memory block (bit lines) and the width
(output select), per bit, the energy of a memory read is roughly the energy of the
wires crossing it. For a native memory block:
Ebit(M) ≈ 2Emseg(M) (4.15)
Therefore:
Emseg(March)
⌈
√
Mapp
March
⌉
≈ Emseg(Mapp) ≈ 0.5Ebit(Mapp) (4.16)
For this composed case the per bit energy becomes:
Ebit(Mapp > March) = Eh + Ev = (v + h (φ+ 1))Emseg(March) (4.17)
We set the derivative of Eq. 4.17 with respect to v equal to 0 and solve for the v
that minimizes Ebit(Mapp > March), recalling that h = (1/v) ·Mapp/March, and get
v =
√
(φ+ 1) Mapp
March
, which results in:
Ebit(Mapp > March) = 2
√
(φ+ 1)
Mapp
March
Emseg(March) (4.18)
Eqs. 4.16 and 4.18 give the following memory mismatch ratio, driven by the ratio of
the energy for routing between memory banks to the energy for routing over memory
banks:
Ebit(Mapp > March)
Ebit(Mapp)
≈
√
φ+ 1 (4.19)
To illustrate the mismatch effects when memories are both too large and too
small, Fig. 4.3a shows the result of an experiment where we quantify how the energy
compares between various matched and mismatched designs. Each of the curves
represents a single-PE matrix-multiply design that uses a single memory size; the
size of the memory varies with the size of the matrices being multiplied. Each curve
63
shows the energy mismatch ratio (Y-axis) between the energy required on a particular
memory block size (X-axis) and the energy required at the energy-minimizing block
size (typically the matched size); hence all curves go to 1.0 at one memory block size
and increase away from that point. In contrast to the previous paragraph where we
used deliberately simplified approximations to provide intuition, Fig. 4.3a is based
on energy from placed-and-routed designs using tools and models detailed in the
following sections; Fig. 4.3 also makes no a priori assumption about large memory
mapping, allowing VTR [81] to place memories to minimize wiring. The figure shows
how the energy mismatch ratio grows when the memory block size is larger or smaller
than the matched memory block size. In practice, designs typically demand a mix
of memory sizes, making it even harder to pick a single size that is good for all the
memory needs of an application. Nonetheless, this single-memory size experiment is
useful in understanding how each of the mismatched memories will contribute to the
total memory energy overhead in a heterogeneous memory application.
There is also a potential energy overhead due to a mismatch in memory place-
ment. Assuming we accept a column-oriented memory model, this can be stated as
a mismatch between the appropriate spacing of memories for the application (dmapp)
and the spacing provided by the architecture (dmarch). If the memories are too fre-
quent, non-memory routes may become longer due to the need to route over unused
memories. This gives rise to a worst-case mismatch ratio:
Eroute overhead unused memories ≈
dmarch · Eseg + Emseg(March)
dmarch · Eseg
=
φ+ 1
φ
= 1 +
1
φ
(4.20)
If the memories are not placed frequently enough, the logic may need to be spread
out, by at most a factor of φ + 1. This effectively forces routes to be longer by a
factor of at most
√
φ+ 1 as they run over unused logic clusters.
Eroute overhead unused logic ≤
√
φ+ 1 (4.21)
64
If we make
√
φ+ 1 = 1 + 1
φ
and solve for φ, we get φ =
√
5+1
2
≈ 1.6, which is the
Golden Ratio [49]. The mismatch ratio due to route mismatch (Eq. 4.20) is never
greater than 1+ 1
φ
= φ ≈1.6. Similarly, the mismatch ratio due to memories being too
small (Eq. 4.19) or memories being placed too infrequently (Eq. 4.21) is never greater
than
√
φ+ 1 = φ ≈1.6. We can observe this phenomenon in Fig. 4.3a by looking at
the 32Kb memory size that never has an overhead greater than 1.2×. In Fig. 4.3, we
also identify the dmarch that minimizes max-overhead (shown between square brackets
for each memory size in Fig. 4.3). This approximately corresponds to the intuitive
explanation above, where the energy for routing across memories is balanced with
the energy for routing across logic. The 32Kb case has Emseg(March)/Eseg = 2.53,
suggesting a dmarch around 4. For this 32Kb case, we found dmarch = 2 experimentally.
Since segment energy is driven by wire length, dmarchEseg = φ ·Emseg(March) roughly
means dmarchLseg = φ ·Lmseg(March); when we populate memories this way, 11+φ ≈40%
of the FPGA area is in memory blocks. This design point is robust in that it guarantees
the worst-case mismatch overhead is small for any design. This design point gives us
an energy-balanced FPGA that makes no a priori assumptions about the mix of logic
and memory in the design. In contrast, today’s typical commercial FPGAs could be
considered logic-rich, making sure the energy (and area) impact of added memories
is small on designs that do not use memories heavily.
While the dmarchEseg = φ ·Emseg(March) balance can limit the overhead when the
memories are too small, we can still have large overhead when the memory blocks are
too large (E(March)/E(Mapp)). One way to combat this problem is to use internal
banking, or Continuous Hierarchy Memories (CHM): We can bank the memory blocks
internally so that we do not pay for the full cost of a large memory block when we only
need a small one. For example, if we cut the memory block into four, quarter-sized
memory banks, and only use the memory bank closest to the routing fabric when the
application only uses one fourth (or less) of the memory capacity, we only pay the
memory energy of the smaller memory bank (See Fig. 4.4). This banking scheme is
65
(a) Normal Memories
0.5
[1]
1
[1]
2
[1]
4
[1]
8
[1]
16
[2]
32
[2]
64
[3]
128
[3]
256
[7]
512
[7]
1
2
3
●
●
App Mem Size (Kb)
0.5
1
2
4
8
16
32
64
128
256
512
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
● ●
Physical Memory Size (Kb) [dm]
N
o
rm
a
liz
e
d
 T
o
ta
l 
E
n
e
rg
y
(b) with Internal Banking (1/4th, 1/16th)
0.5
[1]
1
[1]
2
[1]
4
[1]
8
[1]
16
[2]
32
[2]
64
[3]
128
[3]
256
[7]
512
[7]
1
2
3
●
●
App Mem Size (Kb)
0.5
1
2
4
8
16
32
64
128
256
512
●
●
●
● ● ● ●
● ●
● ●
●
●
●
●
●
●
●
●
●
Physical Memory Size (Kb) [dm]
N
o
rm
a
liz
e
d
 T
o
ta
l 
E
n
e
rg
y
Figure 4.3: Energy overhead due to architectural mismatch for matrix-multiply
biased such that the cost of accessing the bank closer to the I/O is lower than the cost
of accessing the other ones. This is useful on FPGAs since it reduces the mismatch,
whereas on an ASIC, where the sizes can be matched, there is often little reason to
prefer one bank over another. In the extreme, we might recursively decompose the
memory by powers-of-two so that we are never required to use a memory more than
twice the size of the memory demanded by the application, keeping the mismatch
energy ratio down to
√
2. There are some overheads for this banking which may
suggest stopping short of this extreme. Fig. 4.3b performs the same experiment as
Fig. 4.3a, except with memory blocks that can be decomposed into one-quarter and
one-sixteenth capacity sub-banks. With this optimization, the curves flatten out for
larger memory sizes. The physical size with smallest max-overhead is now shifted to
128Kb, still at 1.2×.
Another way to reduce the impact of memory block size mismatch is to include
memory blocks of multiple sizes in the architecture. This way, the design can use the
smallest memory block that will support the application memory. For example, if we
had both 1Kb and 64Kb memories, we could map the 2Kb and smaller application
memories to the 1Kb memory block and the 4Kb and larger application memories to
the 64Kb block and reduce the worst-case overhead to 1.1× (Fig. 4.3a). The impact
66
Data
Addr[n-1]
Addr[n-2]
Addr[n-3:0]
Lmseg(M/4)
L
m
s
e
g(M
/4
)
Lmseg(M)
L
m
s
e
g(M
)
Figure 4.4: Internal banking of memory block
of multiple memory sizes is explored experimentally in [57].
Another point of mismatch between architecture and application is the width of
the data written or read from the memory block. Memory energy also scales with
the data-width. In particular, energizing twice as many bit lines costs roughly twice
the energy. While FPGA memory blocks can be configured to supply less data than
the maximum width, this is typically implemented by multiplexing the wider data
down to smaller data after reading the full width—the same number of bit lines
are energized as the maximum width case, so these smaller data reads are just as
expensive as the maximum width read, and hence more expensive than they could
have been with a matched-width memory. For example, Altera’s tools report the
same energy for Stratix III M9K memory reads regardless of the width configured
[4]. Asymptotically, we expect memory read energy to scale as W
√
M . This means
an additional mismatch factor that could be as large as Warch
Wapp
. However, the small
memory blocks that are appropriate for FPGA embedded memories are dominated
by peripheral effects (e.g., address decode, sense amplifiers) such that the width
67
mismatch effect when Warch > Wapp is less severe in practice. To minimize energy,
the memory layout should be square, making cases where Warch >
√
March relatively
more expensive.
Another potential point of mismatch is the simultaneous ports provided by the
memories. We assume dual-ported memories (2 read/write ports) throughout this
work, consistent with commercial FPGAs.
4.4 Background on FPGA memories
4.4.1 FPGA memory architecture
We use the basic Island-Style FPGA architecture described in Sec. 2.1 (k = 4, n =
10, length 1 segments, a column of multipliers every 20 columns). To incorporate
memories into this mesh, we follow the model used by VTR [81], Xilinx, and Altera,
where select columns are designated as memory columns rather than logic columns—
all the tiles in the designated columns are populated with memory tiles rather than
logic tiles (Fig. 4.2). To the logical mesh, the memory tiles and logic tiles look the
same—they have the same network connectivity. Beyond the routing, the inputs
selected for the memory tile become address and data inputs to the memory, and the
data output from the memory are the outputs from the memory tile supplied into the
routing network. Organizing the memory tiles into a homogeneous column rather than
placing them more freely in the mesh (e.g., checkerboard pattern), allows memories
the freedom to have a different size than the logic tiles. For example, if the memory
block requires more area than the logic cluster, we can make the memory column
wider without creating irregularity within rows or columns or demanding unused
space to pitch match the smaller tile type. Altera uses this column memory model in
their Cyclone and Stratix architectures, and the M9K blocks in the Stratix III [73]
are roughly 3× the area of the logic clusters [119], while being logically organized in
the mesh as a single tile. Large memories can span multiple rows, such as the M144K
68
blocks in the Stratix III, which are 8-rows tall while remaining one logical row wide,
accommodated by making the column wider as detailed above.
Within this architectural framework, we can vary the proportion of memory tiles
to logic tiles by selecting the fraction of columns that are assigned to memory tiles
rather than logic tiles. We control this by setting the number of logic columns between
memory columns, dm. VTR identifies this as a repeat parameter (repeat=dm + 1).
It is possible to have multiple memory types, each with their own dm, and other hard
logic, such as multipliers, can also be incorporated using this heterogeneous column
model.
4.4.2 Memory energy modeling
We use CACTI 6.5 [88] to model the physical parameters (area, energy, delay) of
memories as a function of capacity, organization, and technology. In addition to
modeling capacity and datapath width, CACTI explores internal implementation
parameters to perform trade-offs among area, delay, throughput, and power. We use
it to supply the memory block characteristics for VTR architecture files. We set
it to optimize for the energy-delay-squared product. We use an LSTP process for
memories.
For internal banking (Fig. 4.4), CACTI gives us the area and energy (Emem) of
the memory banks, and we compute wire signaling energy (Ewires) to communicate
data and addresses between the referenced memory bank and the memory block I/O.
For example, consider the data in Fig. 4.5 for a 1024×32b (32Kb) internally-banked
memory. A monolithic 32Kb memory block is 113µm×67µm, which is high enough
to contain the 31µm×2 = 62µm required for the height of the two 256× 32 memories
of size 66µm×31µm (plus room for extra logic), as shown in Fig. 4.5. The total
width in Fig. 4.5 is 38× 2 + 66 = 142µm, or 142/113 = 1.26× that of the monolithic
32Kb memory block. We therefore adjust Emseg(32K-banked)= 1.26 × Emseg(32K).
CACTI directly provides Emem. The I/O is placed close to the first bank (lower
69
Addresses 0–63 64–127 128–255 256–511 512–1023
Shape 64×3264×3264×32 (×2) 256×32256×32 (×2)
Size (µm2) 38×1538×1538×15 (×2) 66×31 66×31 (×2)
Emem (pJ) 0.24 0.24 2×0.24 0.51 2×0.51
Ewires (pJ) 0.00 0.23 0.82 0.46 1.5
64x32
31
(in μm)
64x32
64x32
64x32
256x32 256x32
256x32
66
15
38
I/O
Figure 4.5: Internal banking for 1024×32 memory
left in Fig. 4.5), so that reading the first 64 × 32 bits comes at no additional wiring
cost. Accessing the next small bank has the same memory cost, but also a wiring
cost of Ewires = (15µm)(1 + 30 + 64)Cwire (Vdd)
2. Cwire = 180pF/m, Vdd = 0.95V.
(1 + 30 + 64) corresponds to one signal for the enable, 30 = 2 × 15 for the address
bits (15b covers the 32K × 1 operating mode), and 64 for the 32b input and 32b
output. 15µm is the distance to reach the bank. Similarly, we can compute the
other Ewires costs. For example, to reach the upper right bank, we need to pay a
distance of (66 + 31)µm. Then, the energy of an internally-banked memory is given
by Ebanked = Emem +αEwires, where α is the average activity factor over all signaling
wires.
4.5 Methodology for memory exploration
This chapter introduces many architectural and design parameters that exponentially
increase the search space. In order to keep the chapter focused on communication
energy, we choose to fix some of the parameters from Chapter 3. In particular, we
focus on a 22 nm LP process (LSTP for memories), with pass-gate logic, no power
gating, and a single-Vdd. Since we do not scale Vdd and we use an LP process, there
is not much benefit to transmission gates or power gating, and this architecture is
close to commercial low-power FPGAs such as the Cyclone V (except for the memory
architecture, which we explore in this chapter).
70
8K 8K 8K 8K 8K 8K 8K 8K
addr[10:0]addr[7:0]
en en en en
en0
4b4b4b4b
32b
addr
[10:8]
8K 8K 8K 8K
en en en en
4b4b4b4b
8K 8K 8K 8K
4 8
1
6
3
2
0
4
8
12
W
E
n
e
rg
y
 (
p
J
)
logic
route
mem
a: 2K×32 (W=32) b: 2K×32 (W=4) c: Sweep W
Figure 4.6: Effect of memory block activation and output width selection on energy
consumption
4.5.1 Power-optimized memory mapping
When mapping logical memories onto physical memories, FPGA tools can often
choose to optimize for either delay or energy using power-aware memory balanc-
ing [111]. For example, when implementing a 2K×32b logical memory using eight
256×32b physical memories, we could choose to read W = 4b from each memory
(delay-optimized, Fig. 4.6b). Since each memory internally reads at the full, native
width, the cost of the memory operation is multiplied by the number of memory
blocks used. Alternatively, we could read W = 32b from only one of the memories
(Fig. 4.6a), in which case only one memory is activated at a time (reducing mem-
ory energy), but extra logic and routing overhead is added to select the appropriate
memory and data. The power-optimized case often lies between these extremes. For
example, in the experiment in Fig. 4.6c, the optimum is to activate 2 memories at
once and read W = 16b from each. In the figure, the output width, W, corresponds
to the number of bits used from each memory per read (in this case, there are 32/W
memories activated per read). The figure also shows a diagram for a 1K×32 memory
using four 8K memory blocks.
Unfortunately, the VTR flow does not perform this kind of trade-off: it always
71
optimizes for delay. Odin decomposes the memories into individual output bits [98],
and the packer packs together these 1-bit slices as much as possible within the memory
blocks to achieve the intended width [80]. In fact, VTR memories do not have a clock-
enable so they must be activated all the time. Instead, we use VTR architectures
with special memory block instantiations that contain a clock-enable, (VTR does
not know that they are memories; it treats them like black boxes). We modify
VTR’s architecture-generation script (arch gen.py) to support these blocks and add
a p-opt stage before Odin to perform power-optimized memory mapping based on the
memories available in the architecture. This includes performing memory sweeps as
illustrated in Fig. 4.6c to select the appropriate mapping for each application memory.
For the robust architecture identified in Sec. 4.7, we find that mapping without p-
opt adds 10% geomean energy overhead, comparable to the 6% benefit reported in
[111]. Not using p-opt adds 41% worst-case energy overhead, suggesting that this
optimization is more important for the designs with high memory overhead. Our
p-opt code and associated VTR architecture generation script can be found online
[56].
The example architectures in VTR have two modes for memories: single-ported
and dual-ported. When the data width is 32b, single-ported memories can read 32b
at once, but dual-ported ones can only read 16b at once per port (since the total
number of bits read must be 32b). However, this limitation is unnecessary when we
know that the two ports do not both read (or write) at the same time. Therefore,
we have added a third mode for our memories: simplex dual-ported. Then, we can
identify when a benchmark’s dual-ported memory is actually in simplex mode and
avoid doubling the number of memories used unnecessarily.
4.5.2 Energy and area of memory blocks
As is it, VTR assigns one type of block to each column on the FPGA (logic cluster,
multiplier, or memory), and can give them different heights, but assumes the same
72
horizontal segment length crossing each column. However, some memories can occupy
a much larger area than a logic tile, and laying them out vertically to fit in one
logic tile width would be inefficient. For energy efficiency, the memories should be
closer to a square shape, and to that end, we modify VTR to allow the horizontal
segment length crossing memories to be longer (which costs more routing energy,
hence Emseg(M) 6= Eseg in Sec. 4.3). We fix the height of the memory (h, an integer,
the number of basic logic tile heights) ahead of time, but keep the horizontal memory
segment length (Lmseg) floating:
h =
⌈
√
Asw(W0) + Amem
√
Alogic(W0)
⌉
(4.22)
Here W0 is a typical channel width for the architecture and benchmark set. We use
W0 = 80. Then, when VPR finds the exact channel width, Wact, and hence the
tile-length and area (Alogic), we can adjust Lmseg accordingly:
Lmseg =
Asw(Wact) + Amem
h
√
Alogic(Wact)
(4.23)
Amem is the area for the memory obtained from CACTI, and Asw is the switch area
required to connect the memory to the FPGA interconnect. We obtain Asw from
VPR’s low-level models, similar to the way it computes Alogic = Aluts + Asw.
4.5.3 Benchmarks
To explore the impact of memory architecture, we use the VTR 7 Verilog benchmarks
[81] and a set of tunable benchmarks that allow us to change the parallelism level,
P , in Sec. 4.6. Tab. 4.1 summarizes the benchmarks that have memories. We expect
future FPGA applications to use more memory than the VTR 7 benchmarks. Some of
them, such as stereovision, only model the compute part of the application and assume
off-chip memory. We expect this memory to move on chip in future FPGAs. The
73
Table 4.1: Memory requirements for the benchmarks
Benchmark Mem Bits # Memories Largest Mem
VTR
boundtop 32K 1 1K×32
ch intrinsics 256 1 32×8
LU8PEEng 45.5K 9 256×32
mcml 5088K 10 64K×36
mkDelayWorker32B 520K 9 1K×256
mkPktMerge 7.2K 3 16×153
mkSMAdapter4B 4.35K 3 64×60
or1200 2K 2 32×32
raygentop 5.25K 1 256×21
Tunable
MMul (N2+N)*32 2P (N2/P)×32
GMM 160 N2 P (N2/P)×160
Sort ≈(128N+4NlogN) log(N/P)-1+2P (N/P)×[32+logN]
FFT (-twiddle) ≈224N 4P (N/2P)×56
WinF (-line buffer) 16 N2 P (N2/P)×16
tunable benchmarks allow us to explore parallelism tuning and provide better coverage
of the large memory applications we think will be more typical of future FPGA
applications. For this reason, we do not expect a simple average of the benchmarks,
such as the geometric mean, to be the most meaningful metric for the design of future
FPGAs—it is weighted too heavily by memory-free and memory-poor applications.
We implemented the tunable benchmarks in Bluespec SystemVerilog [15]; they
are the following:
GMM: Gaussian Mixture Modeling [43] for an N × N pixel image, with 8b per
pixel and M = 5 models. P pixels are computed every cycle (Npe = P ). More details
on GMM are provided in Chapter 7. This operation is embarrassingly parallel, since
each PE is independent of the other ones (See Fig. 4.7). This benchmark has very
high locality (prent = 0).
WinF: 5×5 Gaussian window filter for an N ×N pixel image, with 16b per pixel
and power-of-2 coefficients, see Fig. 4.8. P = 1 uses line buffers so that 1 main
74
a) GMM operation
Registered
image
Read 
params
Update
params
Params 
mem
Read pixel
PE
Decide if 
pixel is 
foreground
b) Parallel GMM – 4 PEs
PE PE
PE PE
º Exploit
locality
º Memory 
sizes /4
º Routes 
shorter
Figure 4.7: GMM structure and parallelization
memory read and 4 line buffer reads and writes allow a throughput of 1 pixel per
cycle. P = 2 and P = 4 extend the filter’s window, share line buffers, and compute
2 and 4 pixels per cycle, respectively. For P > 4, every time P is doubled, the image
is divided into two sub-images, similar to the GMM benchmark (Npe = P ). This
benchmark has medium locality, since each pixel needs to share information with its
neighbors, but not with pixels that are farther away (prent = 0.5).
MMul: N × N matrix-multiply (A × B = C), with 32b integer values and
datapaths (See Fig. 4.11, Sec. 4.6.1, Npe = P ). This benchmark has low locality
(prent = 2/3).
FFT: N -point 28b fixed-point complex streaming Radix-2 Fast Fourier Trans-
form, with P × log(N/P )-stage FFTs followed by log(P ) recombining stages (Npe =
P log(N)), see Fig. 4.10 and Fig. 4.9. The streaming portion has high locality
(prent = 0), while the spatial combining section has high communication requirements
(prent = 1).
Sort: N -point 32b streaming mergesort [64], where each datapoint also has a
log(N)-bit index. One value is processed per cycle, and the parallelism comes from
implementing the last log(P ) stages spatially (Npe = log(N/P ) + (P − 1)). We build
a binary reduce tree to select the final output, so that this streaming implementation
has prent = 0. An example with N = 128 and P = 1 is shown in Fig. 5.2.
75
d) Add 3 PEs (total of 4)c) Add 4 Line buffers
b) Single memorya) 5x5 Gaussian Filter
Read from Register
(5 cycles per pixel)
Read from memory
Read from line buffer
(1 pixel per cycle) (4 pixels per cycle)
-1   -1   -1   -1    -1
-1   -1   -1   -1    -1
-1   2     2    2    -1
-1   2     2    2    -1
-1   2     8    2    -1
5x5 window
of neighbor
pixels gives
1 output pixel
(Coefficients shown)
Figure 4.8: Window filter configurations
Radix 2 butterfly Radix 4 butterfly
a
b
= a + b w
= a - b w
w: input twiddle factor
inputs outputs a
b
c
d
w, x, y: twiddle factors
= a+bw+cx+dy
= a-jbw-cx+jdy
= a-bw+cx-dy
= a+jbw-cx-jdy
Figure 4.9: FFT butterfly, radix R = 2 and 4
0
4
8
12
2
6
10
14
1
5
9
13
3
7
11
15
0
2
4
6
8
10
12
14
1
3
5
7
9
11
13
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
8
4
12
2
10
6
14
1
9
5
13
3
11
7
15
0
8
4
12
2
10
6
14
1
9
5
13
3
11
7
15
A0
B0
B0
A0
B0
A0
A0
B0
A1
B1
B1
A1
B1
A1
A1
B1
Stage 0 Stage 1 Stage 2 Stage 3
outputsinputs
memories
PE1
PE2
Figure 4.10: Basic FFT network for N = 16, R = 2, P = 2
76
4.5.4 Limit study and mismatch lower bound
Sec. 4.6 and 4.7 show the energy consumption for different applications and memory
architectures. As we change the architecture, the energy of a benchmark changes,
and for each of them, we can identify a minimum energy point: an energy-minimizing
architecture. In order to identify bounds on the mismatch ratio, we also set up limit-
study experiments. Our limit-study assumes that each benchmark gets exactly the
physical memory depth it needs, as if the FPGA were an ASIC. Therefore, there is
no overhead for using memories that are too small (no need for internal banking as in
Sec. 4.3) or too large (no need to combine multiple memory blocks as in Sec. 4.5.1).
We further assume that the limit-study memories have the same height as that of
a logic tile, making them widely available and keeping the interconnect energy low
for vertical memory crossings. Finally, we place memory blocks every 2 columns
(dm = 1), so that place-and-route tools can always find a memory right where they
need one. To avoid overcharging for unnecessary memory columns, we modify routing
energy calculations, and ignore horizontal memory-column crossings for the limit
study (Emseg = 0). We also want to minimize the effect of width mismatch, for which
we could use a similar trick as for the depth: use a large width (e.g. 1024), but only
pay for what the memory actually uses (e.g. 32). However, this artificially increases
the channel width since each input and output data bit needs to access the routing
network (this is also the case with the depth trick, but to a lesser extent, since each
doubling in depth only adds one bit to the address). Therefore, in order to minimize
the effect of width mismatch, we run multiple limit studies, with memory widths of
8, 16, 32, 64, 128, and choose the one that achieve the lowest energy. CACTI energy
does not decrease for width below 8.
77
4.6 Parallelism tuning
In Sec. 4.2, we saw that the amount of parallelism used to perform a task affects its
energy consumption and that the optimum level of parallelism grows with problem
size (Eq. 4.11 and its implications). In this section, we show experimental results from
optimizing the parallelism for the tunable benchmarks at different data set sizes.
4.6.1 Example: MMul
Let us first review a specific example of a task where parallelism can be tuned to
illustrate how memory is decomposed and how communication requirements change.
Fig. 4.11 shows the shape of an N × N by N × N matrix-multiply A × B = C for
different parallelism levels P (N = 4 is shown). The computation is decomposed by
columns, with each PE performing the computation for N/P columns of the matrix.
The B data is streamed in first and stored in P memories of size N2/P , then A is
streamed in row major order. Each A datapoint (A[i, k]) is stored in a register, data
for each column (j) is read from each B memory, a multiply-accumulate is computed
(C[i, j] = C[i, j] + A[i, k] · B[k, j]), and the result is stored in a C memory of size
N/P .1 Once all the A datapoints of a row have been processed, the results of the
multiply-accumulates can be streamed out, and the C memories can be used for the
next row. When P = N , C does not need memories. Either way, increasing P keeps
the total number of multiply-accumulates and memory operations constant. However,
since the memories are organized in smaller banks, each memory access now costs less,
and energy is reduced, as long as the interconnect-per-PE does not increase too much.
1This is different from the matrix-multiply in Sec. 4.3, where C was stored in an output memory
of size N2 (P = 1), keeping only one size of memory for the application.
78
Register for
matrix A data
Multiply-
accumulate
Memory for 
matrix B data
Memory for 
matrix C data
P=1 P=2 P=4
Figure 4.11: Parallelism impact on memory and interconnect requirements for a (4×
4)2 matrix-multiply
4.6.2 Parallelism tuning with limit-study architecture
In order to explore the effect of parallelism on energy without the bias introduced
by a fixed memory architecture, we first sweep the size and parallelism of our five
tunable benchmarks using the limit-study architecture described in Sec. 4.5.4. This
is how we obtain Fig. 4.1 for WinF. We can see that as the number of PEs increases,
the total energy decreases, reaches a minimum, then starts increasing. The decrease
is due to decreasing memory energy (Fig. 4.1b). The increase is due to increasing
interconnect energy. We also notice that the optimum number of PEs (Popt) increases
with increasing problem size. We run the same experiment for the other benchmarks,
normalize the energy at Popt to the energy at P = 1, and we get Fig. 4.12a. We see
that the other benchmarks follow the same trend. The benefit of tuning parallelism to
the optimum number of PEs grows with the problem size. For the largest sizes shown,
tuning parallelism provides energy savings of 2.1× for WinF, 1.7× for MMul, 1.1×
for FFT, 2.4× for Sort, and 1.4× for GMM. It should be clear from both Fig. 4.12a
and Eq. 4.10 that the parallelism benefit will continue to grow with problem size.
4.6.3 Parallelism tuning with concrete FPGA architecture
We now explore the benefits of parallelism with a concrete FPGA architecture using
internally-banked 16Kb memory blocks with width = 32 and dm = 7 (the robust
architecture identified in Sec. 4.7). The optimum level of parallelism may change
from the limit study to adapt to the given physical architecture, but the major trend
79
a) Limit Study b) Robust Architecture
●
●
●
●
●
2
4
4
8
16
4 8
16
32
1
2
4
8
16
1
2 4
4
4
8
1 4 16 64 256
0
.1
0
.4
0
.6
0
.8
1
.0
●
Benchmark
MMul
WinF
Sort
FFT
GMM
Problem Size Multiplier
E
(P
=
P
o
p
t)
/E
(P
=
1
)
●
●
●
●
●
4
4 8
8
16
4
8
16
32
1
4
4
8
16
1
2 2
4
8
8
1 4 16 64 256
0
.1
0
.4
0
.6
0
.8
1
.0
●
Benchmark
MMul
WinF
Sort
FFT
GMM
Problem Size Multiplier
E
(P
=
P
o
p
t)
/E
(P
=
1
)
[The optimum PE count is shown next to each point]
MMul FFT WinF GMM Sort
Base problem sizes: N = 16× 16 1024 128× 128 128× 128 512
Figure 4.12: Optimum parallelism level versus problem size
still holds. Fig. 4.12b shows the same experiments as before, on this concrete FPGA
memory organization. This time, the energy savings at Popt compared to P = 1 are
larger than in the limit study case: 4.7× for WinF, 2.6× for MMul, 1.1× for FFT,
3.3× for Sort, and 3.0× for GMM. In addition to finding the right level of parallelism,
these concrete designs benefit from selecting a logic-memory balance that reduces the
mismatch overhead.
The largest design in Fig. 4.12b is the 512 × 512, 8 PEs GMM, using 250,000
LUTs, 90,000 registers, 40Mb of memory, a 206 × 206 array and a channel width of
128. This fits comfortably on modern FPGAs. In fact, [60] showed this phenomenon
for a subset of the benchmarks (and fixed problem sizes) on the Stratix IV. In cases
where the ideal parallelism level is too large, we may be limited by the FPGA size.
This is, for example, the case for larger GMM sizes, not shown in Fig. 4.12b, which
require more PEs, and hence larger arrays, than our tools can currently support.
80
4.7 Memory exploration
Sec. 4.3 suggested that we could build a robust FPGA that bounds the worst-case
energy mismatch. In this section, we explore the different memory architecture pa-
rameters (depth, width, dm, internal banking) and identify optimum regions of oper-
ation.
4.7.1 Memory block size sweep
We start with the simplest memory organization that uses a single memory block size
and no internal banking (Fig. 4.13) at a fixed dm = 7 and width = 32. For comparison,
energy is normalized to the lower-bound obtained using the limit study. We include all
the benchmarks from Sec. 4.5.3, including multiple sizes for the tunable benchmarks,
where we set the number of PEs to the optimum values found in Fig. 4.12a. Most
of the curves have an energy-minimizing memory size between the two extreme ends
(1Kb and 256Kb), including the geomean curve. Benchmarks with little memory
have an energy-minimizing point at the smallest memory size (1Kb). Benchmarks
with no memory have a close-to-flat curve, paying only to route over memories, but
not for reads from large memories. The 4Kb memory architecture minimizes the
geometric mean energy overhead of all the benchmarks at 37%. As noted (Sec. 4.5.3),
the geometric mean is weighted heavily by the many benchmarks with little or no
memory, so may not be the ideal optimization target for future FPGA applications.
gmm N256 P4 and mkPktMerge define the maximum energy overhead curve and suggest
that an 8Kb memory minimizes worst-case energy overhead at 110% of the lower
bound. This can be seen by looking at both Fig. 4.13a and Fig. 4.13b at the same
time (putting them on top of each other). We highlight the curves that set the
maximum in Fig. 4.13c.
Fig. 4.14 shows the detailed breakdown of energy components for three bench-
marks, where the memory capacity is varied, both for the normal case (top row) and
81
a) VTR benchmarks b) Tunable Benchmarks
1 2 4 8 16 32 64 128 256
0
1
0
0
2
0
0
3
0
0
4
0
0
● ● ●
● ●
●
●
●
●
● ● ●
● ●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
Physical Memory Size (Kb), dm=7
%
 E
n
e
rg
y
 O
ve
rh
e
a
d
 C
o
m
p
a
re
d
 t
o
 L
im
it
 S
tu
d
y
●
VTR with memory
boundtop
ch_intrinsics
LU8PEEng
mcml
mkDelayWorker32B
mkPktMerge
●
mkSMAdapter4B
or1200
raygentop
VTR without memory
bgm
blob_merge
diffeq1
●
●
diffeq2
sha
stereovision0
stereovision1
stereovision2
stereovision3
geomean
1 2 4 8 16 32 64 128 256
0
1
0
0
2
0
0
3
0
0
4
0
0
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
Physical Memory Size (Kb), dm=7
%
 E
n
e
rg
y
 O
ve
rh
e
a
d
 C
o
m
p
a
re
d
 t
o
 L
im
it
 S
tu
d
y
● fft_N1K_P1
fft_N4K_P2
fft_N16K_P4
sort_N512_P1
sort_N2K_P2
sort_N8K_P4
sort_N32K_P8
●
winf_N128_P8
winf_N256_P16
winf_N512_P32
gmm_N128_P4
gmm_N256_P4
●
●
mmul_N16_P2
mmul_N32_P4
mmul_N64_P8
mmul_N128_P8
mmul_N256_P16
geomean
c) Highlighting the maximum line
1 2 4 8 16 32 64 128 256
0
1
0
0
2
0
0
3
0
0
4
0
0
Physical Memory Size (Kb), dm=7
%
 E
n
e
rg
y
 O
ve
rh
e
a
d
 C
o
m
p
a
re
d
 t
o
 L
im
it
 S
tu
d
y
mkPktMerge
sort_N32K_P8
gmm_N256_P4
Figure 4.13: Sweep of physical memory block size at fixed [dm=7, width=32]
82
Sort_N2K_P2 mkSMAdapter4B WinF_N256_P16
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
Mem Sizes (Kb)
E
n
e
rg
y
 (
p
J
)
0
200
400
600
800
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
Mem Sizes (Kb)
0
20
40
60
80
100
120
140
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
Mem Sizes (Kb)
0
200
400
600
800
n
o
rm
a
l 
m
e
m
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
Mem Sizes (Kb)
E
n
e
rg
y
 (
p
J
)
0
200
400
600
800
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
Mem Sizes (Kb)
0
20
40
60
80
100
120
140
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
Mem Sizes (Kb)
0
200
400
600
800
in
te
rn
a
l 
b
a
n
k
in
g
−logic route mem limit
Figure 4.14: Detailed breakdown of energy vs memory block size [dm=7, width=32]
for the internally-banked memories (bottom row). We also show a blue line highlight-
ing the lower bound obtained from the limit study. Most benchmarks have the shape
of Sort N2K P2, with an energy-minimizing memory size between 4Kb and 32Kb.
Small benchmarks with small memories have the shape of mkSMAdapter4B, with large
increases in memory energy with increasing memory block size. The bottom row
shows how internal banking reduces this effect. It may even allow the minimum en-
ergy point to shift. For example, in WinF N256 P16 the minimum shifts from 16Kb
to 64Kb, reducing total energy at the energy-minimizing block size from 410 pJ to
370 pJ, or by 11%.
4.7.2 Impact of dm
In Sec. 4.3 we showed analytically why the spacing between memory columns, dm,
should be chosen to balance logic and memory in order to minimize worst-case en-
ergy consumption. For simplicity, we limited Fig. 4.13 to only use dm = 7. Since
83
the optimal values of dm may vary among benchmarks, Fig. 4.15 shows geomean (a)
and worst-case (b) energy overheads when varying both memory block size and dm,
still at a fixed width = 32. Without adding internal banking, sweeping dm allows
us to identify a lower energy point with 98% worst-case overhead (dm = 8), versus
the 110% we found previously when only looking at dm = 7. We see broad ranges
of values that achieve near the lowest geometric mean point, with narrower regions
that minimize the worst-case overhead. The heatmap shows that overhead has a
stronger dependence on memory block size than memory spacing. If we approximate
the Cyclone V as 8Kb and dm = 9, the Stratix V as 16Kb and dm = 9,
2 we can
see that these two commercial architectures are around the energy-minimizing valley
for both geometric mean and worst-case. However, we can improve energy further
by using internal banking, which broadens the energy-minimizing valleys, shifts them
towards larger memory sizes, and overall reduces the overhead. Compared to the
16Kb, non-internally-banked commercial architectures (≈ Stratix V), we can reduce
the worst-case by 46% ((180-98)/180) by tuning for robust energy, and using an ar-
chitecture closer to that of the Cyclone V. Then, we can reduce the energy by another
38% ((98-61)/98) by using internal banking and re-tuning the memory organization.
This also reduces the geomean by 36% ((42-27)/42). We achieve these benefits with
the 16Kb, internally-banked, dm = 7 architecture. Since our logic block is smaller,
our energy minimizing cases tend to place the memories more frequently than the
commercial architectures, closer to the robust balance point identified analytically in
Sec. 4.3. In [57] we also explored architectures with two memory sizes, and found
that without internal banking, they achieved a similar reduction in worst-case over-
head to one memory with internal banking. Combining internal banking and two
memory sizes can further reduce energy overheads, at the expense of higher area.
Tab. 4.2 shows that our robust, internally-banked memory architecture has modest
2Modeled points have square logic clusters and memories, whereas real Stratix and Cyclone
devices are rectangular. Modern Cyclone and Stratix logic blocks have 20 6-LUTs—larger logic
blocks than we use here.
84
Table 4.2: Area comparison of select memory organizations
Which internal Maximum Relative
Architecture bank? Size Width dm Overhead Area
robust, energy-min. yes 16Kb 32 7 61% 1.09×
≈ “Cyclone V” no 8Kb 32 9 98% 1.00×
≈ “Stratix V” no 16Kb 32 9 180% 1.07×
area impact compared to the alternatives.
4.7.3 Impact of memory width
Fig. 4.16 shows geomean (a) and worst-case (b) energy overheads when varying mem-
ory capacity and data width, with and without internal banking. Each point on the
heatmap shows the energy overhead at the dm that minimizes it. The lower right
corner is missing because we do not explore the cases where width > depth. Each
heatmap shows an energy-minimizing valley running along the bottom-left to top-
right axis. Once again, we observe that using internal banking broadens the valley,
highlighting the most robust architectural point at 16Kb and width = 32, with inter-
nal banking. Not shown on the heatmap, this point has dm = 7. This architecture
keeps the worst-case overhead below 61% and the geomean overhead below 27% across
mismatches in memory block size, memory column spacing, and memory width.
4.7.4 Sensitivity
The best memory sizes and the magnitude of benefits achievable are sensitive to the
relative cost of memory energy compared to interconnect energy. Since PowerPlay
[4] estimates that the Altera memories are more expensive (about 3× the energy—
perhaps because the Altera memories are optimized for delay and robustness rather
than energy) than the energy-delay-squared-optimized memories CACTI predicts are
possible, it is useful to understand how this effect might change the selection of
85
a) Geomean (% overhead)
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
1
2
3
4
5
6
7
8
9
10
74 66 68 98 91 150 140 260 360
59 51 50 65 63 99 100 180 250
58 47 41 52 54 80 84 150 210
60 45 38 46 48 72 77 140 190
62 46 37 43 44 66 71 130 180
63 47 37 40 41 62 66 120 160
66 47 37 39 41 60 65 120 170
64 47 36 39 41 57 66 110 160
73 49 39 39 42 57 66 110 160
68 51 39 39 42 56 64 110 160
Mem Size (Kb) w=32
d
m
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
1
2
3
4
5
6
7
8
9
10
77 68 74 100 84 130 120 210 270
61 51 52 63 52 78 73 120 160
60 47 42 48 42 58 54 89 120
61 45 39 42 35 49 46 75 100
63 45 38 38 31 41 39 64 86
64 46 37 34 28 37 35 55 77
68 46 37 33 27 35 34 54 75
65 47 36 33 28 32 34 51 74
74 49 39 33 29 33 34 51 72
70 50 39 33 29 31 33 50 71
Mem Size (Kb) int−bank, w=32
d
m
b) Worst-case (% overhead)
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
1
2
3
4
5
6
7
8
9
10
250 190 140 170 200 390 460 11001700
200 160 130 140 180 360 440 10001800
230 150 120 120 160 330 420 10001700
270 190 120 110 160 320 450 10001700
300 200 120 100 150 310 430 11001800
260 170 130 100 150 320 440 10001700
290 180 140 110 160 320 460 10001800
310 150 110 98 170 340 480 11001800
340 220 140 100 180 340 500 11001800
360 180 130 100 190 360 500 11001800
Mem Size (Kb) w=32
d
m
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
1
2
3
4
5
6
7
8
9
10
260 200 150 160 140 230 220 400 500
210 170 150 130 110 160 150 250 410
230 160 140 110 87 130 120 220 380
270 200 140 100 78 130 130 200 380
310 210 140 100 76 110 120 240 430
270 200 150 100 77 84 100 200 380
290 210 150 110 61 86 120 200 410
310 160 120 96 74 85 140 240 420
340 230 160 100 75 93 160 240 470
360 200 150 100 81 100 150 270 430
Mem Size (Kb) int−bank, w=32
d
m
Figure 4.15: Energy overhead versus memory block size and dm
architecture. Therefore, we perform a sensitivity analysis where we multiply the
energy numbers reported by CACTI by factors of 2× and 3× (Fig. 4.17). Without
internal banking, the relative overhead cost of using an oversized memory is increased,
shifting the energy-minimizing bank size down to 4Kb (instead of 8Kb previously).
With internal-banking, the overhead remains around 60% for all cases.
4.8 Chapter conclusions
Communication energy dominates computations, whether the communication is mov-
ing data in and out of memories or moving data over wires to different processing
points on the chip. Tuning the level of parallelism exploited for an application can
shift this communication energy between memories and interconnect, often changing
86
a) Geomean (% overhead) b) Worst-case (% overhead)
8 16 32 64 128
1
2
4
8
16
32
64
128
256
51 50 58
45 42 45
46 37 36 47
50 40 39 43
62 46 41 42 56
91 67 56 44 55
130 92 64 59 71
180 130 110 90 110
250 180 160 150 130
Memory width
M
e
m
 S
iz
e
 (
K
b
)
8 16 32 64 128
1
2
4
8
16
32
64
128
256
46 47 60
35 37 45
31 30 36 50
32 29 33 45
34 27 27 35 59
46 35 31 35 57
65 42 33 35 54
87 63 50 41 59
130 90 71 63 69
Memory width
M
e
m
 S
iz
e
 (
K
b
) 
in
t−
b
a
n
k
8 16 32 64 128
1
2
4
8
16
32
64
128
256
170 190 200
140 120 150
170 110 110 150
260 130 98 110
480 260 150 150 200
920 520 310 190 230
1600 910 420 350 350
2600 1600 1000 660 680
4000 2500 1700 1300 860
Memory width
M
e
m
 S
iz
e
 (
K
b
)
8 16 32 64 128
1
2
4
8
16
32
64
128
256
170 180 210
150 130 160
110 110 120 200
87 87 96 140
110 65 61 88 260
210 120 84 110 210
390 180 100 120 210
570 340 200 170 240
1000 600 380 280 290
Memory width
M
e
m
 S
iz
e
 (
K
b
) 
in
t−
b
a
n
k
Figure 4.16: Energy overhead versus memory block size and data width
a) Normal Memories (% overhead) b) Internal Banking (% overhead)
8 16 32 64 128
1
2
4
8
16
32
64
128
256
110 130 180
120 98 110
260 95 89 130
380 190 120 120
700 380 230 200 250
1400 770 470 260 290
2300 1400 640 510 510
3900 2300 1500 1000 1000
5900 3600 2600 2000 1300
Memory width 
 (cacti x2)
M
e
m
 S
iz
e
 (
K
b
)
8 16 32 64 128
1
2
4
8
16
32
64
128
256
94 100 170
150 92 100
310 120 76 120
460 240 140 130
850 460 270 240 290
1600 930 570 310 330
2800 1600 770 610 610
4700 2800 1900 1200 1200
7100 4400 3100 2400 1600
Memory width 
 (cacti x3)
M
e
m
 S
iz
e
 (
K
b
)
8 16 32 64 128
1
2
4
8
16
32
64
128
256
110 120 180
100 100 140
82 85 110 220
71 61 76 130
120 66 56 89 300
240 130 83 130 250
450 200 110 140 270
650 370 220 190 310
1200 680 430 330 370
Memory width 
 (cacti x2)
M
e
m
 S
iz
e
 (
K
b
)
8 16 32 64 128
1
2
4
8
16
32
64
128
256
78 84 170
74 81 120
61 68 100 220
67 49 63 130
130 66 56 97 330
250 130 82 140 290
480 210 120 150 330
700 400 240 190 360
1300 730 460 370 420
Memory width 
 (cacti x3)
M
e
m
 S
iz
e
 (
K
b
)
Figure 4.17: Sensitivity of Fig. 4.16b (worst-case overheads) to CACTI estimates
the energy required for the computation. As a result, for each application and dataset
size, there is an optimal level of parallelism that minimizes energy. This minimum
energy point balances the energy spent communicating between processing elements
with the energy spent reading from local data memories. The optimal level of paral-
lelism grows with problem size, as does the energy benefit compared to a non-parallel
design. We show 4.7× energy reduction compared to the non-parallel design.
These communication results underscore the need for an energy-optimized on-chip
memory system, and the need to support flexible memory and processing systems that
can be tuned to the application and dataset size. We have shown how to size and
place embedded memory blocks to guarantee that energy is within a factor of 1.6
of the optimal organization for the application. On the benchmark set, we have
87
seen that a 32bit-wide, 16Kb, internally-banked memory block keeps the worst-case
mismatch energy overhead below 61% compared to an optimistic limit-study lower
bound. Without internal banking, the Cyclone V memory organization of 32-bit
wide, 8Kb memories with dm = 9 achieves close to the smallest worst-case overhead
of 98%. While the memory organization is similar, the Cyclone V logic block is closer
to 10 6-LUTs, so represents a more logic rich design. Commercial architectures with
32bit-wide, 16Kb, non-internally-banked memory blocks have a worst-case overhead
of 180%. Tuning for robust energy cuts the worst-case mismatch overhead by 46%,
and internal banking provides 38% savings on top of that.
88
Chapter 5
Lightweight Checking Classification
5.1 Introduction
The technique we are proposing relies on finding operations that have lightweight
checks. This raises the important question: Why do we expect operations to have
lightweight checks? which operations have lightweight checks? which ones do not?
For those that do, how lightweight are the checks? are they practical? are they
lightweight enough to yield low overheads and large savings? This section addresses
these questions.
We first clarify our computational model in Sec. 5.2, and we suggest the possibility
of using differential reliability in combination with LWCs (Sec. 5.3). We define what
we mean by a “lightweight” check in Sec. 5.4. We then devise a classification system
for LWCs and contrast it with previous work in Sec. 5.5. Then, sections 5.6 through
5.10 each correspond to one of these classes and show in more detail some of the
applications they cover.
5.2 Computational model
To tolerate the fact that the computation will have errors at some rate, we checkpoint
its inputs and keep them until the LWC validates that the computation was correct
89
LWC
Vlwc
Compute
Vcmp
Commit
previous
Input
Output
check
Commit
current
Figure 5.1: CMP and LWC computational structure
and the outputs are themselves checkpointed (Fig. 5.1). When the LWC sees an
error, the erroneous output is discarded and recomputed from the checkpointed input.
This scheme fits very naturally into a streaming compute model (e.g., [20]). If the
voltage is not decreased too much and the error rate remains reasonable, the rate
of recomputation will be low, the rollback costs negligible and the overall energy
reduced. We potentially set a different voltage for the CMP and the LWC (differential
reliability, Sec. 5.3).
Since we maintain a low rate of recomputation, the cost of recomputing is low,
but there can still be added cost to store the checkpointed data. This will depend on
whether the application of interest already stored its data or not. We will explore these
concerns with a large system consisting of three computational stages in Chapter 7.
Note however that the LWCs we will find will all require at least O(data size) work
(Sec. 5.10, Tab. 5.1), where data size is the number of inputs to the computation,
so storing an extra data size data points does not change the asymptotics—does not
make the LWC more complex.
5.3 Differential reliability
Differential reliability observes that we do not need the same reliability out of all the
components of our design. In particular, we can often use more reliable components to
oversee the computation of less reliable components. One option we have for selecting
90
the reliability of components is the voltage at which we operate them.
LWCs are one way to exploit differential reliability. If a check can validate the
correctness of a computation, then we can tolerate low reliability in the computation
as long as the check reliably identifies when the computation is in error. Specifically,
this means we can run the computation at low voltage, VCMP = Vlow, and the check
at high voltage, VLWC = Vhigh. To the extent that the check is cheaper than the
computation, we have an opportunity to save energy.
The dual-Vdd architectures developed in Chapter 3 allow us to operate applications
in a differential reliability mode. In Sec. 6.3 we will compare applications augmented
with LWCs with and without differential reliability (the LWC stays at nominal Vdd,
or gets reduced as well).
Furthermore, note that we generally have some amount of differential reliability
even when CMP and LWC have the same voltage. Indeed, the LWC typically performs
fewer operations than the CMP for a given data set, giving it fewer opportunities to
get upset, i.e. making it more reliable.
5.4 LWC definition
Given a computational task F , and an input x to the task, we can compute the
output F (x). Given an implementation of that task f , and the same input x, we can
compute the output f(x). A checker c(x, f(x)) takes the input x and output f(x)
as arguments, and returns a boolean value: 1 when it thinks that f(x) = F (x), 0
otherwise. 1 indicates that the computation was performed successfully, 0 indicates
an error, either permanent or transient. This is a broad definition that imposes no
useful contraints on the checker, so any program can have a checker. In practice, we
are interested in checkers that are designed properly such that:
• they catch errors with high probability,
• they cost less than the base computation.
91
We want to design c(x, f(x)) such that, if it is fault-free, it returns a correct output
with high probability (close to 1), i.e. such that it has “high coverage”. For example,
consider an n-bit adder:
f(a, b) = (a+ b)%2n
The function
c(a, b, f(a, b)) = (a[0]⊕ b[0] == f(a, b)[0])?1 : 0
is a boolean of the form c(x, f(x)), but it has low coverage: it only verifies the least-
significant bit, and for uniformly random inputs, it will only return a correct decision
with probability 1/2n−1 (assuming the checker is fault-free).
We also want to design c(x, f(x)) such that it has low cost: Computing c(x, f(x))
must be cheaper than computing f(x). This is the case in the example above. In
contrast, if
c(a, b, f(a, b)) = ((a+ b)%2n == f(a, b))?1 : 0
then assuming c is fault-free, it will never return an erroneous decision, but it will also
not be low-cost: it is a duplication of the original computation. Lower cost typically
means smaller area, lower energy and lower runtime.
Definition: A “lightweight” check (LWC) is a checker c(x, f(x)) with high cov-
erage and cost strictly less than the cost of the function f(x) that it is guarding:
cost(c) < cost(f).
Ideally we want to find checks that verify cost(c) < cost(f) both asymptotically
and absolutely. In practice, if a check has the same asymptotic complexity as the
computation, but it is cheaper because of lower constants, we still consider it to be
an LWC.
Remembering our goal of reducing energy, note that the definition above does not
mean that any LWC will necessarily yield savings: even if cost(c) < cost(f), cost(c)
could still be a large fraction of cost(f). However, if O(cost(c)) < O(cost(f)), then
as the application gets larger the LWC becomes cheaper and more likely to yield
92
savings, i.e. we would not need to reduce the voltage as much to recover the cost
of adding the LWC on top of the computation. On the other hand, note that even
if f itself is excluded from being its own LWC by the definition above, it could still
be used as a (non-low-cost) checker for itself and potentially still yield savings; for
example if we can reduce voltage enough to reduce the energy of f by more than 2×,
and if duplicating f is enough to recover reliability. That is, the CMP module is f ,
the LWC module is f as well, and both run at the low voltage.
We thus have a clear definition for the cost requirements of an LWC, but what
level of coverage is high enough? We decide to answer this question empirically. In
Sec. 6.3 we will evaluate the coverage of our checkers; higher probabilities of a correct
decision will translate into lower SDC rates. We consider an LWC to have good
enough coverage as long as it allows us to achieve our goal: reduce energy without
impacting reliability. LWCs with poor coverage will not help us achieve our goal.
Specifically, we note that the best-performing LWCs are those that can at least catch
any single bit flip (More in Sec. 6.2.2, Eq. 6.20).
5.5 Classification system for LWCs
One can think of many examples of tasks that are harder to accomplish than they
are to verify. For instance, sorting a shuffled deck of cards takes longer than checking
whether a given deck of cards is sorted. Finding the two prime factors p and q of a
large number N = pq is much harder than verifying that N = pq once we are given
p and q.
Finding these examples is not hard, and their existence as a fact of life makes
intuitive sense, yet formally explaining why we expect some tasks to have lightweight
checks is not easy.
Blum and Kannan [16] formalized the concept of result-checking. They did this
within the context of applications that can have faults, not due to single event upsets
93
(SEUs), but because the implementation may be erroneous. They presented their
work as an alternative to testing and verification to ensure that a program is bug-
free. Result-checking does not prove the program to be correct for all inputs, but
only that the answer that was just produced was correct. As such, it is easier to do
than verification.
Because of this difference in context, our approach differs from [16] in important
ways:
• We are concerned with the detection of errors due to physical upsets, not design
errors. This allows us to assume that in the absence of upsets, the computation
will be correct.
• The work in [16] is inspired by interactive proof systems, similar to how they
are used in cryptography. Most of the result-checkers proposed in [16] follow an
interactive proof model, where a faulty computation is equivalent to a dishonest
prover. However, these checkers typically rely on calling the program f being
checked multiple times, which defeats our definition of an LWC: Our LWCs work
based on a single call to the program.
Rubinfield [101] provides a mathematical framework for some of the ideas in [16], but
the focus is on self-testing and self-correcting programs, which also rely on calling the
program f multiple times.
Low-cost error-detection mechanisms can also be constructed for more complex
systems such a processor cores, for example DIVA [8] and Argus [86], which are
implemented at the architectural level.
In an attempt to understand the space of applications that have LWCs, we can
think about the different complexity classes, and note that some of them have LWCs
by definition. In particular, any problem that is in NP but not in P has an LWC by
definition (a polynomial-time check for a non-polynomial time computation). This
of course assumes P 6= NP . Similarly, problems in NEXPTIME but not EXPTIME
have LWCs. We focus on identifying LWCs for problems in P , arguably the class with
94
the most importance and most common problems. We identify the following classes
for problems in P :
1. Checksum
2. Probabilistic
3. Iterative convergent
4. Error-tolerant
5. No check possible
This classification is analogous to Berkeley’s 7+ dwarfs [7], a classification system for
algorithmic methods that capture a particular pattern of computation and commu-
nication. We created this list of classes by reviewing a large list of applications and
assigning them to a particular class. Tab. 5.1 summarizes the applications we have
considered and the LWC classes to which they belong. The following five sections
describe each LWC class and provide details on some of the applications they cover,
including those that we have implemented for the experimental results in Sec. 6.3.
Note that most of the applications covered here work well with integers, including
integers mod 2n (e.g. 32-bit and 64-bit integers). They also work for floating-point
arithmetic if we make the (very common) assumption that the LSBs do not need to
be exact since we expect rounding errors. i.e. the LSBs do not need to be checked,
so for example a checksum comparison would check that two numbers do not differ
by more than a small number ǫ instead of checking for equality.
5.6 Class #1: Checksums
Many tasks in signal processing, image processing and scientific computing are based
on linear weighted sums. As such, it is often possible to identify sums that remain
invariant between the input data and the output data, or, at least, change in easily
predictable ways. These sums serve as an LWC on the operation. Algorithm-Based
Fault Tolerance (ABFT) [53] proposed lightweight schemes for detecting and correct-
95
Table 5.1: LWC classes and example applications (problems with P complexity)
Note: the complexities shown for matrix operations assume square dense matrices;
similar LWCs exist for non-square and sparse matrices
LWC class Application CMP LWC
Checksum Matrix multiply O(N3) O(N2)
Matrix-vector multiply O(N2) O(N2)
LU decomposition O(N3) O(N2)
Gaussian elimination O(N3) O(N2)
QR factorization O(N3) O(N2)
Matrix inverse O(N3) O(N2)
Sort O(N log(N)) O(N)
Fast Fourier Transform O(N log(N)) O(N)
Window filter O(k2N2) O(N2)
Integer multiplication O(N2) O(N2)
Integer division O(N2) O(N2)
Modulo O(N2) O(N2)
Data integrity (ECC) O(N) O(N)
Probabilistic Matrix multiply O(N3) O(N2)
QR factorization O(N3) O(N2)
Matrix inverse O(N3) O(N2)
Convergent Conjugate gradient O(N2
√
k) O(N2)
Lucas-Kanade O(kN) O(k)
Extended GCD O(N) O(N)
Error-tolerant Gaussian Mixture Modeling O(N) -
Digital Signal Processing (ASET) - -
No LWC addition, subtraction O(N) -
basic boolean logic O(N) -
Matrix transposition (see Sec. 5.12) O(N2) -
Cryptographic hashing (e.g. MD5) O(N) -
96
ing errors for many common linear algebra computations, including matrix multipli-
cation (Sec. 5.6.2), LU decomposition, QR factorization, and matrix inversion. These
techniques augment the input data with a checksum, resulting in output data that
is produced with an augmented checksum, and that provides enough redundancy
to check the data’s validity. Other examples of applications that have checksum-
based LWCs include sorting (Sec. 5.6.1), spectral methods such as the Fast Fourier
Transform (FFT) (Sec. 5.6.3), window filtering (Sec. 5.6.5), integer multiplication/-
division/modulo (Sec. 5.6.6), and data integrity checks (Sec. 5.6.7).
5.6.1 LWC for sorting
Our implementation of sort and its LWC are shown in Fig. 5.2 for N = 128. This is
an N -element streaming merge-sort [64] where each element is composed of a single-
precision, floating-point number for the sort key and a log(N) = 7-bit index that can
serve as a pointer to the payload. The merge-sort is implemented using log(N) = 7
merge elements with appropriately sized buffers between them, as shown in Fig. 5.2.
A new input is presented to the first unit on every cycle, and an output is produced
on every cycle, with a latency of n. The ith merge element performs an in-order merge
on two ordered 2i-element sequences in alternating 2i+1-cycle phases. On every cycle,
each merge element consumes the smallest of its inputs and sends it to the top output
on phase-1 and the bottom output on phase-2. Once one of the input buffers has been
read 2i times, all subsequent consumed values are from the other input, until that
one too has been used 2i times.
An LWC for the sorting kernel performs a simple pairwise comparison of the
outputs to confirm their order (Fig. 5.2). The check must also confirm that no element
was lost or modified, which it does by computing a checksum of all the inputs and
confirming that it matches that of the outputs. We choose to simply compute an
integer sum of all elements, which only fails to detect an error in rare cases where
two simultaneous flips are complementary, do not affect the order, and preserve the
97
Compute Area
Input
stream
+ i/p sum +o/p sum
<?
output
&correct
Check
out
Output
stream
M
2
2
M
8
8
M
16
16
M
32
32
4
4
M
64
64
M
128
128
M
2
2
M
n
Merge unit
FIFO of size n
i/p stream 
complete
o/p stream 
complete
1
=?
Register
LWC Area
Figure 5.2: Sort kernel with its LWC
sum. The probability of such events is quantified experimentally in Sec. 6.3.
Performing the sort requires O(N log(N)) work, whereas performing N compar-
isons and sums requires O(N) work. The check is asymptotically and absolutely
easier to perform than the computation, making it a lightweight check. In general,
our merge sort can have size N and parallelism P , as explained in Sec. 4.5.3.
5.6.2 LWC for matrix multiplication
Multiplying a dense m1×n matrix A by a dense n×m2 matrix B results in an m1×m2
matrix C, and requires n ·m1 ·m2 additions and multiplications:








A1,1 A1,2 . . . A1,n
A2,1 A2,2 . . . A2,n
. . . . . . . . . . . . . . . . . . . . . . . .
Am1,1 Am1,2 . . . Am1,n
















B1,1 B1,2 . . . B1,m2
B2,1 B2,2 . . . B2,m2
. . . . . . . . . . . . . . . . . . . . . .
Bn,1 Bn,2 . . . Bn,m2








=








C1,1 C1,2 . . . C1,m2
C2,1 C2,2 . . . C2,m2
. . . . . . . . . . . . . . . . . . . . . . . . .
Cm1,1 Cm1,2 . . . Cm1,m2








As suggested in ABFT [53], an LWC for the matrix multiplication operation A×B =
C would be to compute a checksum row for A: Am1+1,j =
∑m1
i=1 Ai,j, as well as a
98
checksum column for B: Bi,m2+1 =
∑m2
j=1 Bi,j, which are then included as part of the
computation, resulting in a product matrix C with an extra row and column, where
the common element, the bottom-right corner, is the sum of all the elements of the
m1 ×m2 matrix C, thus the LWC:
Cm1+1,m2+1 =
m1
∑
i=1
m2
∑
j=1
Ci,j.











A1,1 A1,2 . . . A1,n
A2,1 A2,2 . . . A2,n
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Am1,1 Am1,2 . . . Am1,n
∑m1
i=1 Ai,1
∑m1
i=1 Ai,2 . . .
∑m1
i=1 Ai,n



















B1,1 B1,2 . . . B1,m2
∑m2
j=1 B1,j
B2,1 B2,2 . . . B2,m2
∑m2
j=1 B2,j
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bn,1 Bn,2 . . . Bn,m2
∑m2
j=1 Bn,j








=











C1,1 C1,2 . . . C1,m2
C2,1 C2,2 . . . C2,m2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cm1,1 Cm1,2 . . . Cm1,m2
. . . Cm1+1,m2+1











The LWC requires n ·m1+n ·m2 additions to compute the checksums for the inputs,
and n additions and multiplications for the corner element Cm1+1,m2+1, which is less
work than computing A × B = C. For example, in the case of square matrices we
have m1 = m2 = n. The cost of the compute is O(n
3) whereas the cost of the LWC
is O(n2).
For matrix-vector multiplication, m2 = 1, and we do not need a checksum for the
vector; assuming m1 = n, we need n
2 additions to compute the checksum of the input
matrix. We still need n additions and multiplications for the corner element. This is
O(n2), the same as actually computing the matrix-vector multiplication, so the LWC
is not asymptotically cheaper, but it is still cheaper in practice, since the compute
99
requires n2 multiplications and n2 additions.
As mentioned previously, the same checksum technique can be applied to LU de-
composition, QR factorization, and matrix inversion. For vector-vector multiplication
(m1 = m2 = 1), this would not be lightweight any more since the check would need
to repeat the exact same operations as the compute. As suggested in [53], ABFT also
preserves the matrix checksum property for matrix addition and matrix transposi-
tion1, but the resulting checks are not lightweight, so we categorize these operations
in the “no-check” category (Sec. 5.10).
Note that ABFT sometimes provides error-correction capabilities. This is the case
for matrix multiplication if we also compute the full checksum row and column of the
result matrix C [53]. However, not all ABFT-based LWCs allow for error-correction,
and it does not directly fit in our general framework, where we simply re-compute.
We implement matrix multiplication and its LWC and evaluate its error coverage
empirically in Sec. 6.3. Our implementation is detailed in Sec 4.6.1 (Fig. 4.11).
Note that even though the LWC above focused on dense matrix multiplication, it
often works for sparse matrix multiplication as well. A square matrix multiplication
of two n×n matrices where each has at most m non-zero elements takes O(mn) work
(with m ≤ n2). Computing the checksum row and column for the two input matrices
takes O(m) work, and the extra multiply-adds due to them take O(m) work as well.
Now, at the output end, the complexity of computing the checksum of all the outputs
will depend on the sparsity of the output matrix. If the number of non-zero elements
at the output is mo, the complexity of the LWC will be O(m + mo). In the worst
case, we can have mo = n
2, leading to a check with complexity O(n2) (since m ≤ n2).
Whether this makes the check lightweight depends on m: if m > n, then we do have
an LWC, but if m < n, then the check is asymptotically more expensive that the
compute, so it is not an LWC. However, the output matrix will typically be about as
sparse as the input matrices (mo ≈ m), making the check lightweight (O(m) for the
1At least to first order; see Sec. 5.12.
100
check, O(mn) for the compute).
5.6.3 LWC for FFT
We are concerned with computing the Discrete Fourier Transform (DFT) of a vector
~x of dimension N , which results in a vector ~X, also of dimension N . The DFT can
be viewed as a change of basis and formulated as a matrix-vector multiplication:
~X = AN~x (5.1)
AN =











W 0 W 0 . . . W 0
W 0 W 1 . . . WN−1
W 0 W 2 . . . W 2(N−1)
. . . . . . . . . . . . . . . . . . . . . . . . . .
W 0 WN−1 . . . W (N−1)
2











W rs = e−j(2πrs/N) (5.2)
This matrix-vector multiplication requires O(N2) work. The Fast Fourier Transform
(FFT) is a well-known way to compute the DFT with O(N log(N)) work.
In terms of finding an LWC, we could treat the FFT (or DFT) as an example of the
matrix-vector multiplication in the previous section and use a similar check. However,
due to the symmetry and cancellations within the FFT, the most straightforward
version does not protect against errors in the butterfly multiplications.
For example, a 2-input FFT takes a and b as input, and returns A = a + b and
B = a− b. In matrix form, this can be written as:


A
B

 =


1 1
1 −1




a
b


101
With the checksum row, this becomes:





A
B
c0





=





1 1
1 −1
2 0







a
b


We check if c0 = A + B = 2a; this is indeed the case when there is no fault, but
notice that this does not check the effect of b at all. In general, if we augment AN in
Eq. 5.1 with a checksum row, we will be checking c0 =
∑N−1
i=0 Xi = Nx0; this looks
at all the outputs, but only one of the inputs. In other words, we only check the DC
(direct-current) component of the FFT input.
A variant from Wang and Jha [116] addresses this problem by encoding the inputs
and outputs to assign each of them a non-trivial weight in the checksum. The result
is that we simply need to perform a dot product on the input and output of the FFT
and compare them. This means that the LWC requires O(N) operations, while the
FFT requires O(N log(N)).
For the previous example with N = 2, the checksum at the input would have been
csi = (1 + e−j2π/3) · a+ (1− e−j2π/3) · b, and the checksum at the output would have
been cso = (1) · A + (e−j2π/3) · B, and we indeed have csi = cso when there is no
fault, with both inputs having an effect.
We implement the FFT and its LWC and evaluate its error coverage empirically
in Sec. 6.3. Our implementation is a streaming FFT as explained in Sec. 4.5.3.
5.6.4 LWC Gaussian elimination
Gaussian elimination is one of the methods to solve for the N -element vector ~x in the
system of linear equations A~x = ~b. This is an O(N3) operation that computes the
unique reduced row echelon form of the augmented matrix [A|b]. It can be checked
by simply computing the matrix-vector multiplication A~x at the end and checking
whether it is indeed equal to ~b. The check is an O(N2) operation, making it an LWC.
102
5.6.5 LWC for window filtering
Window filters are common in image processing. Each output pixel is computed as
the weighted sum of a number of neighboring pixels.
out[x, y] =
dxhigh
∑
i=dxlow
dyhigh
∑
j=dylow
w[i, j]× in[x+ i, y + j] (5.3)
For example, a k × k=5×5 filter would have dxlow = dylow = −2 and dxhigh =
dyhigh = 2. Gaussian Filtering for edge detection is a common example of a window
filter (Sec. 4.5.3, Fig. 4.8). Since each pixel in the original image contributes to
the output image weighted by coefficients in the window mask, w, the total sum of
the final image is the same as the sum of the original image times the weight of the
window mask. Note that only computing the (xmax, ymax) pixels for the output would
not weigh all the edge pixels fully by the weight mask and the sums would not be
exact. However, we can make them exact by treating the edge pixels of the original
image as 0 and computing the edge pixels for the output image as well.
(
xmax
∑
x=0
ymax
∑
y=0
in[x, y]
)


dxhigh
∑
i=dxlow
dyhigh
∑
j=dylow
w[i, j]


=


xmax+dxhigh
∑
x=dxlow
ymax+dyhigh
∑
y=dylow
out[x, y]

 (5.4)
Except for edging effects, the LWC requires two additions for each pixel, one for the
input sum and one for the output sum, while the window filter operation requires:
Wentry = (dxhigh − dxlow + 1) (dyhigh − dylow + 1) = k2 (5.5)
multiplications and additions. For a k×k=5×5 window filter, this is 25 multiplications
and additions. Therefore, window filtering takes O(k2N2) work. For a constant k,
this is O(N2) work, the same as the check. In that case, what makes it lightweight is
103
that the constants for checking are much lower.
We implement a window filtering application and its LWC and evaluate its error
coverage empirically in Sec. 6.3. Instead of reporting results for the window filter from
Sec. 4.5.3, we report results for a slightly more involved window filter, the DeBayer,
described in more details in Sec. 7.3. We do not include both in the results of Sec. 6.3
because they have similar results, so we avoid skewing the geomean.
5.6.6 LWC for integer multiplication, division, modulo
“Casting out nines” is a well-known sanity check for basic arithmetic operations
such as addition, subtraction, multiplication and division. This check is over 1000
years old, and is described in [31] among others. For example in the case where we
want to multiply a × b = c, it works by checking whether checksum(checksum(A) ×
checksum(B)) = checksum(C). The checksum consists in repeatedly adding all
the base-10 digits together until there is only one base-10 digit. In other words:
A = checksum(a) = a%9. In general, we can construct a checker with this ap-
proach using the checksum A = checksum(a) = a%M and base-2 computations. This
check works for both addition and subtraction, but is not lightweight. For multi-
plication and division, the check can be lightweight, and its coverage depends on
the choice of M . For example, setting M to 2n (a power of 2) is a an easy way to
compute the modulo, so it is lightweight, but it is a poor choice for coverage since it
would only involve looking at the n least significant bits. If we set M = 2n − 1
(n > 1), then we can easily compute the modulo of a number a by repeatedly
adding n digits of a together, similar to the trick used in “Casting out nines” in
base 10. For instance, with n = 2, 1010×11 = 11110 would be checked by computing
checksum(1010) = checksum(10 + 10) = checksum(01 + 00) = 1, checksum(11) = 11,
checksum(11110) = checksum(01 + 11 + 10) = checksum(01 + 01 + 01) = 11, and
we indeed have checksum(1010)× checksum(11) = checksum(11110). This check has
better coverage since it uses information from all input and output bits.
104
A survey of existing checkers for basic ALU operations (Arithmetic Logic Unit)
can be found in [104]. Some of them have been used in practice to protect processors
[86].
5.6.7 LWC for data integrity
The simplest way to detect an error in data at low cost is to augment it with a parity
bit. This detects any single error and any odd number of errors. This is a special
case of the more general idea of Hamming codes, which can detect up to d− 1 errors
when they have a minimum Hamming distance of d (e.g. Hamming (7,4) [46]). This
is an example of error-correcting codes (ECC), which are commonly used to protect
memory. ECC is also used when sending digital data over a noisy network, allowing
us to significantly improve the SNR (Signal to Noise Ratio) at the receiver end by
augmenting the data at the sender end (the assumption is that the noisy channel is
the most expensive part of the system, and we want to send as little data as possible).
These techniques are all lightweight compared to storing or sending the data twice,
and they require O(N) work.
In a different context, we can also check for data integrity by computing Cyclic
Redundancy Checks (CRC) to detect accidental errors, or even cryptographic hash
functions (e.g. MD5, [84]) to also detect malicious modifications.
5.7 Class #2: Probabilistic
Our second class of LWCs consists of probabilistic checkers that take a random pa-
rameter r and return a correct answer with a given probability. As we repeat the
checking process more times, with different random parameters r, we get more con-
fidence (a higher probability) that the result was correct. Many applications that
can be checked with a checksum, or ABFT, can also be checked with a probabilistic
algorithm, so there is significant overlap between this class and the checksum one in
105
terms of the applications they cover.
5.7.1 Probabilistic LWC for matrix multiplication
We can check an AB = C matrix multiplication probabilistically using Freivalds’
algorithm [40]: We generate a random vector r and compare A(Br) to Cr. If AB = C
was correct, then A(Br) = Cr will be correct as well. If AB 6= C, then A(Br) will
most likely come out as different from Cr, indicating an error. However, in some
cases we could have AB 6= C and still get A(Br) = Cr, a false negative. To counter
this we can repeat the process multiple times with different random vectors r, thereby
decreasing the overall probability of a false negative. Evaluating A(Br) and Cr is
an O(n2) operation for dense matrices, which is lightweight compared to the O(n3)
matrix multiplication of A and B. We can make this LWC even more lightweight by
choosing r as a vector of values in (−1, 0, 1) and avoiding multiplications. For sparse
matrices, the reasoning is similar to the ABFT check in Sec. 5.6.2. It takes O(m)
work to compute A(Br) if A and B have at most m numbers, and it takes O(mo)
work to compute Cr is C has at most mo numbers, so the complexity of the check is
O(m+mo). We typically havemo ≈ m, making the O(m) check lightweight compared
to the O(mn) computation. In the worst case, mo = n
2, so the O(n2) check is only
lightweight if m > n. Prata and Silva [93] performed a comparison study between
ABFT and simple checkers for matrix multiplication, QR decomposition and matrix
inversion; they found that they had similar coverage.
Probabilistic result checkers rely on generating random inputs [103], which could
be done ahead of time to minimize runtime costs [117].
5.7.2 Probabilistic LWC for matrix inversion
We have different options to construct an LWC for matrix inversion, an O(N3) op-
eration. First, we can multiply the input matrix A and the output matrix A−1 and
check whether the result is equal to the identity matrix AA−1 = I. This involves a
106
matrix multiplication, also an O(N3) operation, but it has a smaller constant than
matrix inversion, so it is still lightweight. Another option consists in only computing
the diagonal of the output matrix I and checking that all its elements are 0. This still
involves looking at all the input and output data, and is an O(N2) operation. Yet
another option consists in using ABFT, in the same way as for matrix multiplication,
also resulting in an O(N2) check.
Finally, we can also use a probabilistic check as in Freivalds’ algorithm for matrix
multiplication (Sec. 5.7.1). Since we implement ABFT for matrix multiplication, we
choose to implement the probabilistic check for matrix inversion. As explained in [93],
ABFT has the drawback that its coverage depends on the underlying implementation
of the matrix inversion, whether a probabilistic check does not; this sometimes results
in better coverage for the probabilistic check. The check works by testing the equality
A(A−1r) = r given A, A−1, and a random vector r. This is an O(N2) operation that
we can repeat k times with different r vectors to improve coverage. We report results
for a 6 × 6 matrix (N = 6) and k = 3, similar to the size of the matrix inversion
required in Lucas-Kanade (Sec. 7.4).
5.8 Class #3: Convergent algorithms
Iterative convergent algorithms are an important class of applications in both image
processing (e.g., Lucas-Kanade [10]) and scientific computing [102]. They often have
the useful property that an LWC is built into the algorithm—the convergence accep-
tance test. Assuming we protect the acceptance check, it guarantees that no result is
produced until correct. In many cases, the algorithm will self-correct if errors occur
in an iterative improvement computation. This means that, for correctness, we only
need to focus on the convergence test regardless of the iterative improvement compu-
tation. However, even if an error eventually does not affect the result, it could cause
a significant slowdown by causing more iterations than necessary. To counter this, we
107
can use the fact that such algorithms often have a monotonically-decreasing metric,
and we can check whether we got closer to the solution after an iteration based on
whether the metric was reduced.
5.8.1 LWC for conjugate gradient
In Sec. 5.6.4, we used Gaussian elimination to solve for the system of linear equations
A~x = ~b. As an alternative, and to avoid the O(N3) operation, consider the conjugate
gradient iterative solution to A~x = ~b. Fig. 5.3 shows an efficient version for it. We
can use a check after each iteration that is both high-level and lightweight: we check
whether the residue of the current iteration is smaller than the one obtained from
the previous iteration (~rTk+1~rk+1 < ~r
T
k ~rk). If it is, then the monotonic convergence
property of the conjugate gradient was preserved: we got closer to the answer, and
we can continue with the next iteration. Otherwise, we know that an error occurred
during the last iteration, so we roll back to the previous state which had an estimate
closer to the actual solution. This lightweight check alone is useful to avoid getting
stuck or slowed down. However, this LWC does not examine the output vector x
of the CMP module directly, and so does not protect against upsets happening on
x at the end of the last iteration right before leaving the CMP module. We can
detect these cases by actually computing A~xk as part of the check and confirming
that it is equal to ~b after convergence is achieved. This is similar to the LWC we used
for Gaussian elimination (Sec. 5.6.4). Convergence takes O(
√
k) cycles for an n × n
matrix A with condition number k [105], so the conjugate gradient algorithm requires
O(n2
√
k) work, whereas the check is only an O(n2) operation.
In Sec. 6.3, Instead of reporting results for the simple conjugate gradient as de-
scribed above, we report results for another, more complex conjugate-gradient-based
algorithm, Lucas-Kanade, described in Sec. 7.4.
108
Init: ~r0 = ~b− A~x0; ~p0 = ~r0;
for k=0 ; k <MaxIterations ; k++ do
αk =
~rk
T ~rk
~pk
TA ~pk
~xk+1 = ~xk + αk ~pk
~rk+1 = ~rk − αkA~pk
if ~rk+1 ≃ 0 then break
βk =
~rk+1
T ~rk+1
~rk
T ~rk
~pk+1 = ~rk+1 + βk ~pk
end for
Figure 5.3: Conjugate Gradient algorithm
5.9 Class #5: Error-tolerant applications
In many applications, particularly in signal processing, we can often increase the error
rate while still maintaining proper operation. This is for applications that already
tolerate a level of Signal to Noise Ratio (SNR), such as compression, object matching
and feature detection (handwriting and speech recognition). Therefore, we often have
a margin to decrease voltage and SNR, yet maintain an acceptable signal, image or
sound quality, even with more errors at the output, especially when they are isolated.
ASET (Algorithmic Soft-Error Tolerance) uses this idea to design low-cost checkers
for DSP applications [106]. Another example is Gaussian Mixture Models (GMM)
[43], which we implement as part of the WAMI, with details in Sec. 7.5.
5.10 Class #6: No checks possible
Every deterministic application has a checker since it suffices to duplicate the com-
putation to check it. Therefore, a checker’s overhead has an upper bound of 100%.
Sometimes we can design a checker that is cheaper asymptotically, but it could turn
out to be more expensive than the computation in practice because of constant fac-
tors. In those cases, we can choose to use duplication instead of the LWC, so the upper
bound for checker overhead is still 100%. This leads to the conceptual LWC-space
109
Application Complexity
0%
100%
LWC cost
Figure 5.4: Conceptual view of LWC existence and complexity
curve shown in Fig. 5.4. We argue below that all applications have LWCs, except at
the two ends of that curve: the “very simple” and “very complex” applications, to
be defined below.
We have already seen some applications that do not possess LWCs, such as addi-
tion and matrix transposition2. What do these applications have in common? Infor-
mally, we notice that they are “very simple”. This is also the case of integer addition,
subtraction, and basic boolean operations such as logical and, or, xor, not, nand, nor.
In general, the more complex the task, the more structured redundancy there will be
in the data usage, the more likely it is that the operation will have an LWC.
What does it mean for a task to be “complex”? Note that each of the applications
mentioned above that does not have an LWC uses each bit of data only once. When
matrix transposition reads a bit, it knows exactly where it needs to move it, it does
not access it again. The same is true for addition, subtraction, and basic boolean
operations such as xor, where we look at each bit only once. This leaves no room for
redundancy to be created. For an LWC to have good coverage, it also needs to look at
every bit, making it as costly as the original computation. In contrast, applications
that have LWCs look at the input bits multiple times and combine them in a way
that creates redundancy that the LWC can then check without accessing each of the
original bits of information as many times.
2At least to first order; see Sec. 5.12.
110
Remember the addition example from Sec. 5.4; we could not find a check that
is lightweight and has good coverage at the same time. For the check to have good
coverage, we had to look at each bit of data once; but this is also what the compute
was doing, so we could not come out ahead. For computing the sort, each of the
N data points to be sorted was compared log(N) times, whereas the LWC only had
to look at each input and output data point once. For an h × h window filter, the
compute had to access each pixel ≈ h2 times, whereas the LWC had to access each
pixel only once. For the N2 matrix multiplication, the compute had to access each
data point N times, whereas the LWC only had to access each data point once.
In general, if we access each data point more times, we create more redundancy
that can be exploited by a potential LWC. This rule of thumb is eventually broken as
an application becomes even more “complex”, and we lose the lightweight checking
ability. This corresponds to the turnaround point in Fig. 5.4. It happens when too
much redundancy is stored in a finite amount of space, causing the information to
combine and get lost.
Reaching the far end of the high-complexity designs in Fig. 5.4 requires careful
design. Popular block ciphers such as AES [1] or cryptographic hash functions such
as MD5 [84] are examples of it. These encryption schemes are specifically designed
to require a full recomputation to check the answer. They often consist of a series
of simple operations, and they are repeated many times over the same input data,
so they are not “low complexity”. For example, MD5 consists of four rounds of 16
operations that each involve boolean operations, a non-linear mapping, additions, and
permutations. Each bit of the input ends up being read four times, and mixed in with
the other data in a structure that achieves an “avalanche effect” [38]. This means that
we expect a small change in the input to have a drastic effect at the output. Typically,
one bit flip at the input should result in half the bits being flipped at the output.
The design intent is that small changes are quickly propagated through the different
iterations, leading to each output bit being dependent on each input bit. In other
111
words, the data is “scrambled” as well as possible, and the output does not provide
any information about the input, unless we recompute the whole MD5. Therefore,
when we are given the output of the computation, we do not get any “tricks” to check
whether it is correct based on the input.
5.11 Using context
It is often possible to exploit the context within which a computation is executed
to create a check that previously did not exist. For instance, by thinking about the
precision needed at the output, we can often be satisfied with errors in the lower
bits, and we can create approximation-based LWCs, such as in [106]. In fact, the
error-tolerant class from Sec. 5.9 is an example of using context to simplify an LWC,
or even remove the need for it. As that class shows, we can often get away with
approximate computations. We could even say that any computation could have an
LWC, at least if it is used within a context that allows us to relax constraints at
the output. The WAMI system in Chapter 7 will also exploit context to simplify its
LWCs.
In Sec. 5.10, we claimed that MD5 did not have an LWC. However, even for an
extreme example such as that one, we can sometimes find an LWC depending on the
application. For instance, consider a system where MD5 is used to hash a password
entered by a user and check it against a database to decide whether to grant access.
If the MD5 matches what is in the database, then there is no need to check it,
chances are extremely high that the entered password, and the MD5 computation,
were correct. On the other hand, if the MD5 does not match what is in the database,
we can recompute it. If it fails again, we do not grant access. This is not itself a
lightweight operation, but a duplication. However, if most passwords entered are
correct (as we might expect), then the overhead of recomputing becomes very small.
AES was another example given in Sec. 5.10 as having no LWC. In the general
112
case, this is true; but in the common case we will usually expect the deciphered
plaintext to have a certain structure to it. For example, we may know that we are
expecting an English sentence, and if what comes out looks like random bits instead
of letters of the alphabet, then we can recompute.
5.12 Impact of communication complexity
Throughout this chapter, we have reported the basic asymptotic complexities of the
operations. However, once we take into account communication complexity, things
may change, usually in favor of the LWC, which is typically smaller and incurs a
smaller communication overhead. For instance, the FFT is an O(N logN) operation,
but once we include its communication complexity, it becomes O(N
√
N logN). This
is easiest to see in the fully sequential case, with one PE and a large memory, for
which each access costs O(
√
N). O(N
√
N logN) is an even greater difference with the
O(N) LWC, which is a simple accumulated summation and does not incur additional
communication complexity (no communication between the data points).
Another example is the matrix transposition, which we have categorized as not
having an LWC in Tab. 5.1, because both the computation and the ABFT-based check
have O(N2) complexity, and because each data point is only accessed once, leaving
no room for a constant difference either. However, if we include communication
complexity, we notice that each data point needs to travel a distance of N on average,
making the CMP O(N3). The check on the other hand remains O(N2), turning it
into a lightweight check.
5.13 Combining different LWCs
Convergent algorithms are higher-level constructs that potentially instantiate one or
more applications already discussed, leaving degrees of freedom to the designer or
113
optimization tools when deciding where to apply checks. For example, the conjugate
gradient algorithm (Sec. 5.8.1) computes a matrix-vector multiplication on every it-
eration, for which we can choose to use an LWC or not as discussed in Sec. 5.6.2. On
the other hand, we can also use a final matrix-vector multiplication only as a check
(compute A~x and see if it equals ~b). This decreases the cost of the check itself but po-
tentially increases the rollback cost, since an error caught only at the very end means
that we need to roll back the whole computation. With the end-to-end convergence
test in place, the use of smaller checks on the component computations is purely a
performance optimization. If they reduce energy by saving more energy on wasted
computations than they add by performing extra checks, they can be included.
5.14 Chapter conclusions
This chapter identified six classes of applications in the P complexity class that have
LWCs. These LWCs are required to both be cheap and have good error coverage,
two constraints that are often in opposition and constitute a trade-off knob for the
designer. We showed many examples of important applications throughout the six
categories. The LWCs we found were either asymptotically cheaper than the com-
putation, or at least cheaper by a constant factor. We also saw some operations
that are not amenable to lightweight checking; these were either too simple (e.g.
simple boolean logic) or too complex (e.g. cryptographic hashes). The next chap-
ter implements some of them and evaluates the energy benefits of augmenting the
computations with their LWCs.
114
Chapter 6
Lightweight Checking Models and
Results
6.1 Introduction
In Chapter 3 we were able to reduce energy by reducing the operating voltage Vdd, but
we did not consider the reliability impact of doing so. In reality, a lower Vdd means
that our circuits are more prone to various kinds of upsets (Sec. 2.4). This chapter
provides a framework to quantify this effect, and proposes the use of lightweight checks
as a way to recover this reliability loss while maintaining energy savings overall.
We develop reliability models and a fault injection methodology in Sec. 6.2. We
then present LWC experimental results in Sec. 6.3. We find that the lightweight
checks cost only a fraction of the base computation (0-23%), and that they allow us
to achieve 44% energy savings while maintaining reliability, compared to 50%1 if we
simply reduced Vdd without using LWCs and ignored reliability.
1 We previously reported that voltage scaling allows 60% energy savings if we ignore reliability.
This was based on the results from Chapter 3 where we included a larger benchmark set, including
many benchmarks from VTR 7.0 and Toronto20 for which we do not know how to construct LWCs
because we do not know what the applications do exactly. The 50% we report here is based on a
subset of the benchmarks, specifically those for which we have also constructed LWCs.
115
6.2 Reliability models
Our goal is to compare a “base” design with a CMP module and without an LWC
module, to a “protected” design that has both the CMP module and an LWC module.
We know how to evaluate energy, delay and area from the tool flow shown in previous
sections (Fig. 2.2). The SDC rate is calculated as shown in this section. Our main goal
is to reduce energy without impacting the SDC rate. Therefore, we reduce voltage,
and we compensate for the rise in SDC by adding the LWC.
Note that we distinguish between a bit flip that is due to the environment and
does not necessarily propagate, and an error, which corrupts data. Errors can further
be detected by the LWC, or silent, causing the SDC rate to rise.
6.2.1 Energy savings analysis
We potentially allow two different voltage levels to operate the LWC and CMP mod-
ules: Vhigh and Vlow respectively (differential reliability, Sec. 5.3). Therefore, energy
consumption has the form:
ECMP = Edyn−CMP + Elkg−CMP = βCMP · V 2low + TCMP · Ilkg−CMP · Vlow, (6.1)
ELWC = Edyn−LWC + Elkg−LWC = βLWC · V 2high + TLWC · Ilkg−LWC · Vhigh. (6.2)
Where β = 1/2 · αact · C can be approximated as a constant for a given application.
αact is the weighted average activity factor, and C is the weighted average capacitance
for all the nodes. ELWC typically has a lower critical path than ECMP . Once we put
them together on the same chip, they both share the same critical path Tcrit. At
nominal Vdd we expect Tcrit ≈ max (TCMP , TLWC) . The energy consumption of the
base design is simply that of the CMP module at nominal Vdd (VCMP = Vnom):
Ebase = βCMP · V 2nom + Tcrit−nom · Ilkg−CMP−nom · Vnom. (6.3)
116
For the protected design, when an error is detected by the LWC mechanism, we
need to recompute the operation. If prcmp is the probability of recomputation due
to the LWC having detected an error, then the expected number of times we will
compute the operation is:
tries = (1 + prcmp + p
2
rcmp + p
3
rcmp + ...) =
1
1− prcmp
. (6.4)
The energy consumption of the protected design is the sum of that of the CMP and
LWC modules, multiplied by the expected number of times the computation will need
to be performed.
Eprotected = (ECMP + ELWC) ·
1
1− prcmp
. (6.5)
This leads to the following energy savings:
Esav = 1−
Eprotected
Ebase
, (6.6)
Esav = 1−
βCMP · V 2low + βLWC · V 2high + Tcrit(Ilkg,CMPVlow + Ilkg,LWCVhigh)
βCMP · V 2nom + Tcrit−nom · Ilkg−CMP−nom · Vnom
· 1
1− prcmp
.
(6.7)
Of course, if we do not scale Vdd (Vdd = Vhigh = Vlow), the protected design will cost
more energy than the base case: Esav reduces to Esav = 1− (1 + ELWCECMP ) ·
1
1−prcmp ≤ 0.
This can still be useful if the error rate at nominal Vdd is already unacceptable.
Otherwise, we need to scale Vlow to see savings. Note that prcmp is typically kept very
low (prcmp ≈ 0), so 11−prcmp can be approximated as 1 in Eq. 6.7. Furthermore, note
that in the extreme, even if we could reduce Vlow down to 0 and operate reliably, we
would still not get 100% savings because of the cost of the LWC, which stays at Vhigh:
Esav(Vlow = 0) ≈ 1−
βLWC · V 2high + Tcrit · Ilkg−LWC · Vhigh
βCMP · V 2nom + Tcrit−nom · Ilkg−CMP−nom · Vnom
. (6.8)
If the check is lightweight, then βLWC < βCMP , and assuming Vhigh stays at nominal
117
Vdd (Vhigh = Vnom), then Ilkg−LWC < Ilkg−CMP and Tcrit ≈ Tcrit−nom. We get a lower
bound on the energy savings achievable:
Esav−best(Vhigh = Vnom) ≈ 1−
βLWC · Vnom + Tcrit · Ilkg−LWC
βCMP · Vnom + Tcrit · Ilkg−CMP
> 0. (6.9)
Keeping the lightweight check at nominal Vdd (Vhigh = Vnom) is conceptually useful
to maintain at least part of the chip at “nominal reliability”, but it forces a lower
bound on how much we can reduce energy (Eq. 6.9). In practice, as we will show in
the next sections, we can still maintain high reliability even if we reduce the voltage
of the LWC (Vhigh < Vnom), and this will enable larger savings. In the extreme,
at Vlow = Vhigh = 0, we would get Esav = 0. Keeping Vlow = Vhigh and scaling
them together is a particularly good choice since it avoids extra overhead for voltage
programmability and conversion. In fact, even when they run at the same voltage,
the LWC is typically more robust to upsets than the CMP because it is typically
smaller, or takes fewer cycles to complete. That is, the product Nt · d0 is smaller
for the LWC, resulting in fewer opportunities to get upset for a given computation.
This means that we can often reduce VLWC a bit below Vnominal and still maintain
the “nominal reliability” that the CMP had at nominal Vdd.
6.2.2 Silent data corruption (SDC) rate
Errors are rare events, so simulating them is not straightforward. If we expect an
error every ≈ 1020 cycles, we cannot simulate for ≈ 1020 cycles until the error is
observed: the simulation runtime would be prohibitive. Instead, we inject errors on
purpose to evaluate system behavior when an error happens, and we evaluate the
SDC rate according to the probability of the fault injection we forced. Our approach
is outlined below.
Each application has the following characteristics that can be obtained experi-
mentally:
118
• d0, the depth, or the number of cycles needed between the first input showing
up and the last output leaving. The depth can be different for CMP and LWC,
so we use d0,LWC and d0,CMP . For example, a streaming mergesort that sorts N
elements and processes P inputs per cycle will have a depth d0 = N/P , because
it takes 2N/P cycles between the first P inputs entering the sorter and the last
P outputs leaving the sorter, but during a given cycle half the nodes are being
used by another set of data (so we divide by 2).
• Nt, the total number of physical locations where a bit flip can occur. Again, it
can be different for CMP and LWC, so we use Nt,LWC and Nt,CMP . Since the
check is lightweight, we will typically have Nt,LWC < Nt,CMP . Nt includes many
types of different physical locations (wire segments, registers, LUTs, I/O). To
simplify our analysis, we assume an equally likely error rate for each resource
type. This will not be the case in practice, but we will perform a sensitivity
analysis to ensure we are properly capturing the worst case (Sec. 6.3).
• pbf (V ), the bit flip rate at voltage V , or the probability that the logical level of
a resource is misinterpreted (0 instead of 1 or 1 instead of 0) in one clock cycle.
This alone does not necessarily lead to an error since the fault could be masked
by subsequent logic. For example if one input to an AND gate is misinterpreted
as 1 instead of 0, and the other input is correctly interpreted as 0, the output
would correctly be interpreted as 0, even though the first input experienced a bit
flip.
• pprop(i), the probability that the output is corrupted given i simultaneous bit
flips. pprop(i) captures the logical masking effects explained above and is obtained
through simulation.
pbf (V ) has an exponential relationship on voltage and is given by the following equa-
tion (derived in Sec. 6.2.4):
pbf (Vlow) = T (Vlow) · SER0(Vnominal) · eα(Vnominal−Vlow) (6.10)
119
The probability of i simultaneous bit flips while the application is executing its task,
psbf (i), depends on the size of the design (Nt) and how many cycles it takes to
complete its task (d0). Throughout the whole process, there are d0Nt nodes that can
get upset, leading to:
psbf (i, V ) =
(
d0Nt
i
)
(pbf (V ))
i(1− pbf (V ))d0Nt−i. (6.11)
Note that psbf (i, V ) gives simultaneous bit flips while the application is executing its
task, not just in one clock cycle, as opposed to pbf . So we have pbf (V ) ≤ psbf (1, V ),
since a psbf (1, V ) event can happen on any of the Nt nodes during any of the d0 cycles.
We can now simulate these cases separately and combine the results to compute the
SDC rate (probability of SDC per cycle) for the “base” design:
psdc,base =
d0Nt
∑
i=1
psbf,CMP (i, Vnom)pprop,CMP (i). (6.12)
i.e. for each possible number of simultaneous bit flips 1 ≤ i ≤ d0Nt we multiply the
probability of i simultaneous bit flips by the probability that i simultaneous bit flips
cause an error.
For the “protected” design we follow a similar approach, but since CMP and
LWC potentially use a different voltage, we cannot do it as directly. The probability
of simultaneously seeing i bit flips in CMP and j bit flips in LWC is:
psbf2(i, j, VCMP , VLWC) = psbf2(i, j) = psbf,CMP (i, VCMP ) · psbf,LWC(j, VLWC). (6.13)
Furthermore, for the protected design, we need to capture how well the LWC catches
an error at the CMP output: when an error occurs in CMP and LWC catches it, we
do not consider it to be an SDC event. Instead of the two possible outcomes for the
“base” design captured by pprop(i), we now have four possible outcomes as shown in
Tab. 6.1. When we have i simultaneous bit flips in CMP while the LWC itself sees j
120
Table 6.1: Possible outcomes of the protected design (CMP+LWC)
Notation CMP LWC Outcome
p00 no error no catch expected behavior → proceed
p01 no error catch false positive → recompute
p10 error no catch SDC → proceed
p11 error catch successful catch → recompute
simultaneous bit flips, we have:
• p00(i, j) is the probability that there is no error in CMP, and that LWC does not
catch an error, this is the common case.
• p01(i, j) is the probability that there is no error in CMP, and that LWC thinks
there was an error, this is a false positive.
• p10(i, j) is the probability that there is an error in CMP, and that LWC does not
catch an error, this is an SDC event.
• p11(i, j) is the probability that there is an error in CMP, and that LWC success-
fully catches the error.
Of course, we have:
p00(i, j) + p01(i, j) + p10(i, j) + p11(i, j) = 1. (6.14)
We can now write an expression similar to Eq. 6.15 for the four outcomes of Tab. 6.1.
The SDC rate of the protected design is given by:
psdc,protected =
d0,CMPNt,CMP
∑
i=0
d0,LWCNt,LWC
∑
j=0
psbf2(i, j)p10(i, j). (6.15)
The probability of a false positive event:
pfalse+ =
d0,CMPNt,CMP
∑
i=0
d0,LWCNt,LWC
∑
j=0
psbf2(i, j)p01(i, j). (6.16)
121
The probability of an error correctly caught by the LWC:
pgood catch =
d0,CMPNt,CMP
∑
i=0
d0,LWCNt,LWC
∑
j=0
psbf2(i, j)p11(i, j). (6.17)
The probability of no error and normal continuation:
pexpected =
d0,CMPNt,CMP
∑
i=0
d0,LWCNt,LWC
∑
j=0
psbf2(i, j)p00(i, j). (6.18)
Finally, the probability of recomputation:
prcmp = pfalse+ + pgood catch =
d0,CMPNt,CMP
∑
i=0
d0,LWCNt,LWC
∑
j=0
psbf2(i, j)(p01(i, j) + p11(i, j)).
(6.19)
As we will see in Sec. 6.3, it will be an important design characteristic of our
LWCs to have
p10(1, 0) = 0, (6.20)
i.e. The LWC catches any single error in CMP. This will allow us to operate at
exponentially higher bit flip rates and thus drop the voltage and energy further (more
in Sec. 6.3.2.3). This is because when Eq. 6.20 is satisfied, there needs to be at least
two simultaneous errors in order to cause an SDC, a much less likely event than having
a single error. Indeed, since we will keep pbf relatively low, the probability of n + 1
simultaneous bit flips is exponentially smaller than the probability of n simultaneous
bit flips, so we actually do not really need to simulate up to d0Nt simultaneous bit
flips. In the experiments that will follow, we simulate up to 3 simultaneous bit flips,
sufficient to observe all the interesting results. We do not need to drop the voltage
enough to reach the region where pbf is too high and requires us to simulate more
than 3 simultaneous bit flips.
122
6.2.3 Fault injection runtime
To simulate one bit flip throughout one round of computation, we need to perform
Ntd0 different experiments to exhaustively test every possible location for the bit
flip. For n simultaneous errors, we would need to perform (Ntd0)
n experiments. This
quickly becomes prohibitive, so we only simulate the n = 1 case exhaustively. For
n > 1, we inject at random locations and run as many experiments as possible. This
is not an issue because small inaccuracies in the fault injection results do not affect
our results, unless we claim 100% coverage (p10(1, 0) = 0) while in reality it is not
100% (e.g. a coverage of 99.9999%, and we claim 100% because we missed the one
bad case due to random tests). However, we never find 100% coverage for n > 1, and
the cases with n = 1 that have 100% coverage are tested exhaustively.
Running one of the experiments described above requires us to evaluate each of
the Nt nodes on each clock cycle, and there are O(d0) clock cycles, so simulation time
for one experiment is O(Ntd0). Therefore, it gets slower even faster as the application
size increases; not only are there more experiments to perform, but each experiment
takes longer. Overall, for exhaustive tests, the runtime is O((Ntd0)
n+1).
6.2.4 Bit flip rate versus voltage
Transient upsets in logic do not always get captured in a register or memory because
of one of the following masking mechanisms [107]:
• Logical masking: The upset may get masked by subsequent logic, for example a
2-AND gate propagates a 0 when its two inputs are 0. If one of them gets upset
and becomes 1, the output of the 2-AND is still 0. The extent to which faults
get masked by logic is application-specific; we capture it experimentally using
our gate-level fault injection simulations.
• Electrical masking: The effect of the upset may get attenuated as it propagates
through a logic chain. We do not model this masking directly, and its effect
should get captured in the parameters α and FIT0(Vnominal) (see below).
123
• Latching-window masking: If the upset reaches a latch outside of its “window
of vulnerability”, i.e. not at the clock transition where the value gets captured,
then the upset gets masked. We do not model this either, instead we make the
worst-case assumption that an upset happening at any time during the clock
cycle will get latched (see below).
As suggested in the model developed by Hazucha and Svensson [48] and used by
previous work [107, 106, 39], the SER (soft-error rate) is an exponential function of
Qcrit, the minimum charge that needs to be deposited by a particle strike to cause an
upset:
SER =
1
2
· F · A · e−Qcrit/Qs (6.21)
F is the neutron flux (units of upsets/(m2s)), A is the area of the circuit affected
by particle strikes (units of m2), Qs is the charge collection efficiency (units of C).
We multiplied by 1/2 to convert from upset to bit flip, assuming half the upsets are
in the “right direction”. The resulting SER is in bit flip/s. For a given design, we
can estimate the probability of a bit flip in one unit resource in one clock cycle T as
follows:
pbf (V ) = T (V ) · SER0(V ), as long as pbf (V ) << 1 (6.22)
SER0 refers to the SER in a unit resource. We multiplied by T to get the expected
number of upsets in one clock cycle. As long as that number is significantly less than
1, the probability of an upset in one clock cycle can be approximated as the expected
number of upsets in one clock cycle. Hence the condition pbf << 1. The above
assumes a worst-case window of vulnerability: it assumes that a bit flip at any point
in the clock cycle will get latched (no latching-window masking). This leads to:
pbf (Vlow) =
1
2
· T (Vlow) · F · A · e−Qcrit,low/Qs (6.23)
pbf (Vhigh) =
1
2
· T (Vhigh) · F · A · e−Qcrit,high/Qs (6.24)
124
Taking the ratio of the two equations above, and using Qcrit ∝ C · V , we get:
pbf (Vlow) = pbf (Vhigh) ·
T (Vlow)
T (Vhigh)
· eα(Vhigh−Vlow) (6.25)
pbf (Vlow) = T (Vlow) · SER0(Vhigh) · eα(Vhigh−Vlow) (6.26)
With Vhigh = Vnominal:
pbf (Vlow) = T (Vlow) · SER0(Vnominal) · eα(Vnominal−Vlow) (6.27)
This equation is similar to the one suggested and confirmed experimentally in [39].
Remember that we require pbf (Vlow) << 1, otherwise the equation above is not valid
any more.
Note that we can easily convert SER0(Vnominal) into a FIT rate (Failures In Time,
defined as the number of failures in 1 billion hours):
FIT0(Vnominal) = 3600 · 109 · SER0(Vnominal) (6.28)
Once we obtain the SDC rate of the system (probability of uncaught error per
cycle) as in Sec. 6.2.2, we can convert it to a system-wide SER rate, and then a
system-wide FIT rate:
FITsystem(Vlow) = 3600 · 109 · psdc(Vlow)/T (Vlow) (6.29)
This section derived the equation we use to relate bit flip rate and voltage. It is
a rather high-level equation; an accurate low-level model is very hard to obtain, in
part because of the scarcity of experimental data on bit flip rates versus voltage in
the literature. For our purposes, this equation properly captures the fact that bit flip
rates increase exponentially fast as we reduce voltage. How fast? this will depend on
the parameter α. It is also hard to obtain accurate values for SER0(Vnominal), so we
125
resort to sweeping those two parameters around regions where they could realistically
be. We thus report a range of results, which should cover for inaccuracies in modeling.
We can find estimates of FIT rate for logic, latches and SRAM in [107] across dif-
ferent technologies. For modern technologies, the FIT rate is between 10−5 and 10−3
(Figure 5 in [107]). Looking at different sources, [108] predicts and [89] experimen-
tally confirms a FIT rate of about 4× 10−5 for an inverter running at 1GHz in 90 nm
technology. Furthermore, Fig. 5 in [108] suggests an order of magnitude increase in
FIT rate when scaling feature size by a factor of 2, so about 4× 10−3 at 22 nm. This
is consistent with the 10−5 to 10−3 range of FIT rate estimated previously. Still, to
cover for potential inaccuracies, we will use a larger range for our experiments, with
FIT0(Vnominal) up to 1.
We estimate α = 10 from [50], specifically Figures 2 and 4 in [50] show the depen-
dence on voltage, and we use the steepest voltage dependence case (FF5, clk=low,
data=1) for a conservative estimate. Still, to cover for inaccuracies, we will use a
range for α between 1 and 50.
6.3 Lightweight checking results
This section explores the trade-offs between energy and reliability that result from
scaling Vdd. It repeats some of the experiments from Chapter 3 but also takes reli-
ability into account. We first look at detailed results for the sort in Sec. 6.3.1. We
then show summarized results for more applications in Sec. 6.3.2.
Throughout this section, we assume the robust memory architecture we found in
Chapter 4 without internal banking (a column of 256 × 32 memories placed every 9
columns). We assume gate boosting as in Chapter 3, and either a single-Vdd archi-
tecture with power gating, or a dual-Vdd architecture. We also assume a 22 nm LP
process.
126
Table 6.2: Fault injection results for Sort (N = 1024, P = 1)
Bit flips Outcome
CMPLWC p00 p01 p10 p11 Comment
0 0 1 0 0 0 No bit flip, common case
1 0 0.5685 0 0 0.4315p10(1, 0) = 0 satisfies Eq. 6.20
0 1 0.44180.5527 0 0.0056p11(0, 1) 6= 0 (if CMP stalls due to LWC fault)
2 0 0.3238 0 0.00050.6757
1 1 0.25050.31210.00070.4367
0 2 0.19700.7912 0 0.0118
3 0 0.1815 0 0.00080.8177
2 1 0.14300.17850.00110.6774
1 2 0.11140.45120.00050.4369
0 3 0.08780.8947 0 0.0175
6.3.1 Detailed LWC example: sorting
6.3.1.1 Impact on energy when considering reliability
To illustrate how we obtain our results, let us first focus on the streaming merge-sort
application with N = 1024 and P = 1. The results of the fault injection experiments
are shown in Tab. 6.2. As expected, when no fault is injected the outcome is p00 = 1
(no CMP error, and the LWC thinks there is no error). We notice that this appli-
cation satisfies Eq. 6.20 (p10(1, 0) = 0), which will lead to better results (Sec. 6.2.2,
Sec. 6.3.2.3). One could expect to see p11(0, 1) = 0 since no fault is injected in CMP,
so it should not produce an error. However, there are cases where a fault in the LWC
causes the CMP to stall because the control gets corrupted. These cases are included
in p11.
This application has d0 = 1024, Nt = 10888 for the CMP, Nt = 1652 for the
LWC. At this point, if we know FIT0(Vnominal) and α, we can compute the SDC rate
for the unprotected design using Eq. 6.12, and the SDC rate for the protected design
using Eq. 6.15. The FIT rate of the system is obtained using Eq. 6.29. Assuming
FIT0(Vnominal) = 10
−4, we find the FIT rate versus VCMP relationship shown in
Fig. 6.1b (we sweep the value of VCMP , the voltage we are changing). There are three
127
curves: the unprotected design “baseline (CMP only)”, the protected design with
VLWC = Vnominal “CMP+LWC dualVdd”, and the protected design with VLWC =
VCMP “CMP+LWC singleVdd”. As expected, the FIT rates increase exponentially as
VCMP is reduced, and the protected designs have lower FIT rates than the unprotected
one. Among the protected ones, keeping the LWC at nominal Vdd helps lower the FIT
rate, but not much, the curves follow each other closely.
We can also evaluate the recomputation rate for the protected designs using
Eq. 6.19. This tells us the impact of recomputation on the total energy spent us-
ing Eq. 6.5. Fig. 6.1a shows the effect of scaling voltage on energy consumption for
the three designs in Fig. 6.1b. At nominal Vdd, the curves start as non-dotted lines.
At the point where voltage is low enough that reliability becomes worse than that of
the base design, the curves become dotted. We call this voltage level Vopt. Of course,
the unprotected design starts off dotted since reducing Vdd by any amount already
makes it less reliable that its reliability at nominal Vdd. Looking at Fig. 6.1b and
Fig. 6.1a together, we can see that the voltages where the Fig. 6.1a curves become
dotted (Vopt) correspond to the voltages where the reliability curves intersect the
baseline reliability in Fig. 6.1b. Whether this intersection happens after or before we
have reached the minimum energy point tells us whether we can claim the full savings
of the energy-minimum point or if we have to stop beforehand. In this case, we can
reach the minimum energy point. This is the case even though we have assumed a
very pessimistic α = 40 (instead of the expected value of α = 10). This will also be
the case at least for all α < 40.
6.3.1.2 Analysis
At nominal Vdd, the LWC only costs 13% more energy for the single-Vdd case, and
only 22% more energy for the dual-Vdd case. Here the voltages are the same, and
the difference in overhead comes from the extra architectural overhead for supporting
dual-Vdd. As we reduce Vdd, both cases are able to recover their LWC overheads
128
(a) Energy, FIT0(Vnominal) = 10
−4, α = 40
0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
0
.5
1
1
.5 energy
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
Reliability better than nominal
Reliability worse than nominal
(b) System-level FIT rate, FIT0(Vnominal) = 10
−4, α =
40
0.3 0.4 0.5 0.6 0.7 0.8 0.91
e
−
2
0
1
e
−
1
0
1
1
e
+
0
5
1
e
+
1
5 FIT
Vdd (V)
F
IT
baseline (CMP only)
CMP+LWC single−Vdd
CMP+LWC dual−Vdd
Figure 6.1: Effect of voltage on energy and system-level FIT rate for sort (N = 1024,
P = 1)
129
while maintaining a higher reliability than the baseline. However, the dual-Vdd only
achieves 13% energy savings at its energy minimum point, while the single-Vdd case
achieves 42% energy savings, much closer to the maximum savings of 47% for the
baseline case. These results suggest that we can indeed achieve high savings without
compromising reliability thanks to voltage scaling and LWCs. In fact, we even get
an improvement in reliability of 23000× at the minimum-energy point (FIT rate of
0.021 instead of 480).
The results also suggest that “differential reliability” (dual-Vdd case) does not
justify its high cost: maintaining the LWC at high voltage makes it more expensive
as we reduce VCMP , and the reliability advantage we gain from it is not significant
enough. Why do we not see significant reliability gains with differential reliability?
Consider the case where there is only one upset in the CMP+LWC combination.
This either means one fault in the CMP ((i, j) = (1, 0) in the notation of Sec. 6.2.2),
or one fault in the LWC ((i, j) = (0, 1)). The first case can cause an SDC, but
since the LWC does not observe a fault, it does not matter whether it is protected
more by keeping its voltage higher. The second case does not cause SDCs, only
false positives, since the CMP does not observe a fault. Keeping VLWC thus reduces
the rate of false positives, but since we ensure that upsets remain rare events, the
resulting increase in prcmp does not change the fact that prcmp ≈ 0, and does not affect
energy consumption. Furthermore, since upsets remain rare events, the likelihood of
observing two simultaneous faults is orders of magnitude lower than the cases we just
described, so the 2-fault case does not change the answer.
6.3.1.3 Parameter sweep
We observe that as long as Vopt is less than the energy minimum point, the energy
savings we achieve are the same no matter what the parameters α and FIT0(Vnominal)
are. When FIT0(Vnominal) or α get too high, Vopt may happen before we reach the
energy minimum point, and the savings achieved are reduced. However, this does
130
not happen unless we use extreme values for the parameters. Fig. 6.2 illustrates this.
The expected value of α is about 10, and the expected value of FIT0(Vnominal) is
10−4. This case can be seen in Fig. 6.2a, together with cases with larger α, up to
50. Fig. 6.2b shows the pessimistic case where FIT0(Vnominal) = 1. Out of all these
cases, the only one that does not achieve the minimum energy savings is α = 50
and FIT0(Vnominal) = 1, an extreme case that we do not expect to see in practice.
This will be true for the other applications as well (Vopt is always below the energy
minimum point), so we can report only one number for energy savings. The impact
of α and FIT0(Vnominal) will therefore mainly be on the improvement in reliability
achieved at the energy minimum point (we do not expect it to get worse), it will not
be on energy, delay, or area.
6.3.1.4 Effect of problem size on LWC results
Fig. 6.3a shows the area overhead of adding an LWC on top of sort for different
application sizes (16 to 16K). The LWC adds 65% extra area when N = 16, but as
the application size increases, the area of the LWC grows slower than the area of
the CMP. At N = 16K, the LWC only adds 5% extra area. Fig. 6.3a also shows
the energy overhead of the LWC at nominal Vdd, which follows a similar trend. At
N = 16 the energy overhead is 34%, whereas at N = 16K it drops down to 6%.
Fig. 6.3b shows the energy savings achieved when scaling voltage in two cases: “CMP
only”, which also comes with lower reliability, and “CMP + LWC”, which achieves
the savings without reducing reliability compared to “CMP only” at Vdd. We notice
that “CMP + LWC” is very close to “CMP only”: close to the maximum achievable
savings. The curves get closer as application size increases, but overall the maximum
achievable savings decrease slowly as application size increases. At N = 16 using
LWCs leads to 42% energy savings (out of 54% possible at most). At N = 16K using
LWCs leads to 34% energy savings (out of 37% possible at most).
131
(a) FIT0(Vnominal) = 10
−4
0.3 0.4 0.5 0.6 0.7 0.8 0.91
e
−
2
0
1
e
−
1
0
1
1
e
+
0
5
1
e
+
1
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
FIT
Vdd (V)
F
IT
(b) FIT0(Vnominal) = 1
0.3 0.4 0.5 0.6 0.7 0.8 0.91
e
−
2
0
1
e
−
1
0
1
1
e
+
0
5
1
e
+
1
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
FIT
Vdd (V)
F
IT
●
●
baseline, α=1
baseline, α=10
baseline, α=20
baseline, α=30
baseline, α=40
baseline, α=50
protected, α=1
protected, α=10
protected, α=20
protected, α=30
protected, α=40
protected, α=50
Figure 6.2: Effect of voltage on system-level FIT rate for sort (N = 1024, P = 1)
6.3.2 LWC results for the other applications
6.3.2.1 Basic LWC results for all benchmarks
Tables 6.3, 6.4 and 6.5 summarize the energy, reliability, delay and area results for
the rest of the applications we considered (described in Sec. 5.1). We show ratios of
those metrics compared to the baseline architecture—the single-Vdd with no LWC and
VCMP = Vnom, the nominal voltage for the technology. Tab. 6.3 shows the baseline
case, for which all the ratios are 1×. It also shows the case where we simply scale
voltage to the minimum energy point, without worrying about reliability. We assume
α = 10 and FIT0(Vnominal) = 10
−4 as suggested in Sec. 6.2.4. We find that we can
reduce the voltage by 41-59%, for a 50% geomean reduction, but with a 42× reduction
132
(a) Energy and area overhead
0
%
2
0
%
4
0
%
6
0
%
8
0
%
1
0
0
%
16 64 256 1K 4K 16K
●
●
●
●
●
●
●
● ●
●
●
● energy (nominal Vdd)
area
Problem size (N)
%
 o
ve
rh
e
a
d
 o
f 
L
W
C
(b) Energy savings achievable
0
%
2
0
%
4
0
%
6
0
%
8
0
%
1
0
0
%
16 64 256 1K 4K 16K
● ●
● ●
● ●
● ●
●
●
●
● CMP only
CMP + LWC
Problem size (N)
E
n
e
rg
y
 s
a
v
in
g
s
 a
c
h
ie
ve
d
Figure 6.3: Effect of problem size on energy, delay and area for Sort
in both reliability and delay.
Tab. 6.4 shows the result of adding LWCs. The VCMP = VLWC = Vnom case shows
the energy overhead of the LWCs at nominal Vdd: 0-23%, 9% geomean. This overhead
needs to be recovered before we start claiming benefits from voltage scaling. We also
see that the geomean area overhead of adding LWCs is 0-53%, 22% geomean. Simply
adding the LWCs improves reliability by 2.0 · 1020×. We can now trade off some
of that added reliability for energy savings in the VCMP = VLWC = Vopt case, also
shown in Tab. 6.4. We find that we can still achieve 31-56% energy savings, 44%
geomean, while losing some reliability, but still maintaining it 8.8 · 1014× higher than
the original. However, this comes with a 40× decrease in delay.
Finally, Tab. 6.5 shows the results for the dual-Vdd case. The overhead of adding
LWCs is higher, both in terms of area (25-85%, 51%), and energy (5-47%, 21% ge-
omean). This means that there is a larger gap that needs to be recovered before
we start seeing benefits. Furthermore, the dual-Vdd case reaches its energy-minimum
point faster due to the LWC that stays at high voltage and sees significant increases
133
Table 6.3: Ratio results for CMP-only single-Vdd normalized to “CMP at Vnominal”
VCMP = Vmin VCMP = Vnom (baseline) Both
Application Energy Reliability Delay Energy Reliability Delay Area
Sort N=1K 0.53× 1.8 · 10−2× 69× 1× 1× 1× 1×
MMul N=32 0.41× 1.8 · 10−2× 65× 1× 1× 1× 1×
MInv N=6 0.50× 3.0 · 10−2× 28× 1× 1× 1× 1×
FFT N=1K 0.45× 1.8 · 10−2× 69× 1× 1× 1× 1×
DeBayer N=128 (WinF) 0.45× 1.8 · 10−2× 69× 1× 1× 1× 1×
LK N=128 (CGrad) 0.59× 5.0 · 10−2× 13× 1× 1× 1× 1×
GMM N=128 0.59× 3.0 · 10−2× 28× 1× 1× 1× 1×
Geomean 0.50× 2.4 · 10−2× 42× 1× 1× 1× 1×
Table 6.4: Ratio results for CMP+LWC single-Vdd normalized to “CMP at Vnominal”
(Tab. 6.3 baseline)
VCMP = VLWC = Vopt VCMP = VLWC = Vnom Both
Application Energy Reliability Delay Energy Reliability Delay Area
Sort N=1K 0.58× 6.1 · 1014× 66× 1.13× 1.3 · 1020× 0.96× 1.06×
MMul N=32 0.44× 2.0 · 1013× 63× 1.05× 3.9 · 1018× 0.97× 1.40×
MInv N=6 0.69× 8.2 · 1011× 26× 1.19× 2.5 · 1016× 0.94× 1.23×
FFT N=1K 0.55× 7.0 · 1010× 68× 1.23× 1.4 · 1016× 0.99× 1.37×
DeBayer N=128 (WinF) 0.49× 6.2 · 1025× 69× 1.04× 4.8 · 1034× 0.99× 1.53×
LK N=128 (CGrad) 0.59× 1.8 · 1015× 12× 1.04× 9.1 · 1018× 0.99× 1.06×
GMM N=128 0.59× 5.3 · 1012× 28× 1× 1.7 · 1017× 1× 1×
Geomean 0.56× 8.8 · 1014× 40× 1.09× 2.0 · 1020× 0.98× 1.22×
in leakage energy. This leads to low savings, with 16% geomean reduction (compared
to 44% with single-Vdd). Because the turn-around point happens earlier, the achieved
reliability (1.4 · 1017×) and delay (8.4×) are better than in the single-Vdd case.
6.3.2.2 Improving delay
Going back to the single-Vdd case (VCMP = VLWC = Vopt in Tab. 6.4), we saw that
even though we achieved significant energy savings without compromising reliability
(44% geomean), we also degraded the delay significantly (40×). We can limit this
effect by not reducing Vdd all the way to the minimum-energy voltage, at which point
we are already hitting the exponential delay increase curve that causes leakage to
134
Table 6.5: Ratio results for CMP+LWC dual-Vdd normalized to “CMP at Vnominal”
(Tab. 6.3 baseline)
VCMP = Vopt, VLWC = Vnom VCMP = VLWC = Vnom Both
Application EnergyReliability Delay EnergyReliability Delay Area
Sort N=1K 0.87× 3.0 · 1016× 14× 1.22× 1.1 · 1020×1.14× 1.35×
MMul N=32 0.55× 6.2 · 1016× 23× 1.12× 3.6 · 1018×1.04× 1.75×
MInv N=6 1.01× 2.0 · 1014× 3.8× 1.47× 1.8 · 1016×1.30× 1.49×
FFT N=1K 0.95× 1.9 · 1013× 6.3× 1.36× 1.3 · 1016×1.06× 1.71×
DeBayer N=128 (WinF) 0.84× 1.0 · 1031× 6.6× 1.12× 5.2 · 1034×1.09× 1.85×
LK N=128 (CGrad) 0.99× 1.0 · 1015× 3.8× 1.21× 4.6 · 1016×1.14× 1.25×
GMM N=128 0.76× 1.8 · 1013× 15× 1.05× 9.1 · 1016×1.18× 1.28×
Geomean 0.84× 1.4 · 1017× 8.4× 1.21× 8.0 · 1019×1.13× 1.51×
dominate, which causes the turnaround point (Fig. 1.1). If we back up on voltage
reduction just a little bit, we should be able to benefit from an exponential reduction
in delay. Since this is also the point where the energy curve flattens out, we should
also be able to back up on voltage reduction just a little bit without losing too much
of the energy savings. This is indeed the case, as we can see from Fig. 6.4a, which
shows the geomean delay and energy ratio versus voltage. If we stop reducing voltage
at 0.7V, we still get 35% geomean energy improvement, for only 6× worse geomean
delay. Note that we can actually do a bit better if we do not stop all applications
at 0.7V, since they do not all get the same Vopt. Fig. 6.4a only shows the geomean
energy and delay given the same voltage for all applications; this is why its maximum
geomean savings (42%) are slightly less than reported previously (44%), when looking
at each benchmark’s unique Vopt.
We have so far focused on an LP process in order to find the overall minimum-
energy point. Alternatively, a designer may want to use an HP process to focus on
performance, then get some energy savings on top of that by using the combination
of voltage scaling and LWC. Since the nominal Vdd of the HP process operates farther
from its Vth (i.e. farther from the exponential increase in delay and minimum-energy
point, Fig. 3.5), there is more opportunity to save energy. Over the same set of
benchmarks, we find 59% geomean energy savings for the HP process (compared to
135
(a) LP
0
0
.2
0
.6
1
1
.2
0.4 0.5 0.6 0.7 0.8 0.9
●
●
●
●
●
●
●
●
●
●
●
2
0
4
0
6
0
8
0
1
0
0
D
e
la
y
 r
a
ti
o
Vdd (V)
E
n
e
rg
y
 r
a
ti
o
● energy
delay
(b) HP
0
0
.2
0
.6
1
1
.2
0.4 0.5 0.6 0.7 0.8
●
●
● ●
●
●
●
●
●
2
4
6
8
1
0
D
e
la
y
 r
a
ti
o
Vdd (V)
E
n
e
rg
y
 r
a
ti
o
● energy
delay
Figure 6.4: Geomean energy and delay versus voltage
44% for LP), and that only comes with a 5× decrease in delay. This is compared to a
baseline that is also an HP process. Comparing the absolute value of the HP at Vopt
to the LP at Vopt, LP still has 69% lower energy, and HP is 70× faster. Note that we
can further reduce the delay of the HP case by using the same “backing up” trick as
described above. Fig. 6.4b shows that we can stop reducing voltage at 0.6V to limit
geomean delay at 2×, and still achieve 48% geomean energy savings.
Finally, if we can increase the parallelism of the application, then we have another
opportunity to recover some of the lost delay from scaling voltage. Even if we started
with the maximum amount of parallelism possible, we may exploit the fact that we
reduced energy to recover some of the transistors that we could not previously use
due to power density limits. i.e. we retrieve some of the dark silicon (Sec. 2.2.1, [37])
because we have lowered energy, and we can now use it to increase parallelism.
6.3.2.3 Sensitivity analysis
Fig. 6.5 shows the achievable energy savings versus the exponential parameter α
that relates voltage to upset rate (Eq. 6.10). This is the most important parameter,
or assumption, that we have made. As we can see, the achievable energy savings
136
0
2
0
4
0
6
0
8
0
1
0
0
0 10 30 50 70
● ● ● ● ● ●
●
●
●
Expected α=10
parameter α
%
 e
n
e
rg
y
 s
a
v
in
g
s
● sort
matmul
fft
matinv
debayer
lk
gmm
Figure 6.5: The energy results are robust to a sweep of the exponential parameter α
are very robust to the value of α: a wide range between 0-40 allows us to satisfy
Vopt = Vmin, and claim the maximum possible benefits (the constant part of the
curves in Fig. 6.5). As suggested in Sec. 6.2.4, we expect α ≈ 10, well within the
range. Even past that range, most applications still see part of the benefits even with
α up to 100. This significant robustness is due to the significant, order-of-magnitudes
increase in reliability due to LWCs, itself due to the fact that all the LWCs we used
satisfy Eq. 6.20 (p10(1, 0) = 0, more in Sec. 6.3.2.4).
Another effect of increasing α is a change in the reliability we reported for scaled
voltage in Tables 6.3, 6.4 and 6.5. We show the geomean reliability versus α in
Fig. 6.6 for four cases: CMP only, at nominal voltage or at Vmin, and CMP+LWC,
at nominal voltage or at Vmin. As expected, adding the LWC improves reliability
by orders of magnitude. The Vnom curves are constant because they do not depend
on voltage. The “CMP+LWC Vmin” curve increases with α, but as long as it is
below “CMP Vnom”, we can achieve Vopt = Vmin. This happens as long as α ≤ 50,
covering the expected value of α = 10 with a large margin. We previously found
that we degraded reliability by 42× if we simply scaled voltage to Vmin and did not
137
1
e
−
2
5
1
e
−
1
0
1
1
e
+
1
0
0 10 30 50 70
● ● ● ● ● ● ● ● ●
Expected α=10
parameter α
G
e
o
m
e
a
n
 F
IT
 r
a
te
● CMP Vnom
CMP Vopt
CMP+LWC Vnom
CMP+LWC Vopt
Figure 6.6: Effect of the α parameter on the reliability results
add LWCs (Tab. 6.3). This may not have seemed significant, but it gets worse with
larger α (“CMP Vopt” curve in Fig. 6.6): without LWCs, we get a reliability decrease
of 1.1× 103 for α = 20, 3.6× 104 for α = 30.
6.3.2.4 Fault injection results
Tab. 6.6 shows some of the detailed results from the fault injection experiments for the
different benchmarks. The same observations as in Tab. 6.2 still hold. In particular,
we emphasize once again the importance of the p10(1, 0) = 0 condition (Eq. 6.20),
allowing us to cover all 1-fault cases, and not observe SDCs unless there are at least
2 simultaneous faults, a much less likely event. This is a major contributor to the
orders of magnitude increase in reliability we observed. This is also a major reason
why our results are insensitive to the parameter α as shown in Sec. 6.3.2.3. If we had
p10(1, 0) > 0, but still small, say 10
−3, we could still achieve similar energy results, but
they would rely more heavily on the α = 10 assumption (and FIT0(Vnominal) = 10
−4
to a lesser extent). It would also not be as robust to other assumptions we have made
in Sec. 6.2.2, such as the uniform upset rate across different circuit components.
138
Table 6.6: LWC fault injection results (no bit flip in LWC)
1 bit flip in CMP 2 bit flips in CMP
Application p00 p10 p11 p00 p10 p11 Nt (CMP)Nt (LWC)≈ d0
Sort N=1K 0.569 0 0.432 0.324 0.0005 0.676 10,888 1,652 N
MMul N=32 0.576 0 0.424 0.334 0 0.667 4,477 2,379 N3
MInv N=6 0.948 0 0.052 0.875 0.0012 0.126 19,453 5,628 N3
FFT N=1K 0.563 0 0.437 0.318 0.1838 0.499 46,912 12,031 N/2
DeBayer N=128 0.392 0 0.608 0.168 0.0001 0.832 7,578 2,364 N2
LK N=128 0.947 0 0.053 0.938 0.0020 0.060 202,947 2,137 N2k
GMM N=128 0.950 0 0.050 0.875 0.0010 0.124 42,119 0 N2
6.4 Discussion
We have developed reliability models and a fault injection work flow that allowed us
to evaluate the effects of adding LWCs and reducing voltage. We found that we could
achieve around 50% energy savings without compromising reliability. We saw that
the bottleneck that limits Vopt and the energy savings is not the loss in reliability, at
least as long as p10(1, 0) = 0 (Eq. 6.20). Instead, it is the minimum-energy point,
driven by the exponential increase in delay and leakage energy.
Note that we chose relatively small sizes for the benchmarks. As we saw in Fig. 6.3,
we expect that as the benchmark size increases, the overhead of the LWC decreases,
particularly for the LWCs that are asymptotically cheaper than the compute, leading
to savings that are closer to the maximum savings achievable by the CMP-only case.
We get smaller savings for applications that are more memory-intensive, since we
do not scale voltage for memories. We can reduce the contribution of memories to
total energy by changing parallelism as suggested in Chapter 4. We do this for the
WAMI benchmarks in Chapter 7. We can also reduce the contribution of memories
by jointly optimizing with voltage scaling and the memory architecture exploration
from Chapter 4. We leave this as future work.
139
Chapter 7
Full System Case Study: WAMI
7.1 Introduction
As shown in Chapter 6, voltage scaling in combination with LWCs allows energy
savings of 50% on many representative applications. These applications are compu-
tational kernels that are often used as building blocks to construct larger systems.
The fact that we got savings for the individual kernels suggests that we should be
able to get similar savings for the larger systems.
However, building a larger system out of sub-components that have LWCs presents
unique challenges, but also new opportunities. For instance, when connecting two
kernels, instead of just streaming data through, using LWCs may force us to add
extra storage at the interface so that we can roll back when an error is detected. This
is an overhead that would otherwise not be present. On the other hand, we may be
able to optimize the LWC of a kernel depending on how it is used within the system.
For instance, we may not even need to use an LWC for a particular stage if adequate
protection is provided by the subsequent stage.
In this chapter, we aim to address these questions with the aid of a large system
case study: Wide-Area Motion Imaging (WAMI). We already used the system and
its three components as applications in the results of Chapter 6. In this chapter,
we explain the three components and see in more detail how they come together to
140
WAMI
DeBayer LK GMM
Figure 7.1: Three stages of the WAMI system
form the WAMI system. We also explore the components’ parallelism in the spirit of
Chapter 4, and analyze the impact of our techniques on delay and power density.
7.2 Top level: Wide-Area Motion Imaging, WAMI
The goal of the WAMI is to identify and track moving objects over a wide background
area. For example, a camera mounted on a plane or a drone would continuously
capture images of the ground and be used as the input of our system [69], before
undergoing three consecutive stages of processing (Fig. 7.1):
• DeBayer, for color filtering (Section 7.3)
• Lucas-Kanade (LK), for image alignment (Section 7.4)
• Gaussian Mixture Modeling (GMM), for detecting moving objects (Section 7.5)
7.3 Stage 1: DeBayer
7.3.1 Bayer filtering
The input data to the system is an nx×ny array coming from a camera with a color
array filter arranged in a Bayer pattern. This means that at each pixel location, we
sample only one basic color intensity (red, green, or blue), according to the layout
shown in Fig. 7.2.
The DeBayer stage interpolates the values of the missing colors at each pixel
141
B G
RG
B G
RG
B G
RG
B G
RG
B G
RG
B G
RG
B G
RG
B G
RG
B G
RG
0 1 2 3 4 5
0
1
2
3
4
5
ny
nx
Figure 7.2: Bayer color pattern
location and returns an nx × ny image with RGB information at each location. In
essence this is a 5×5 window filter, with different coefficients depending on the current
pixel’s location. The operation can be written as:
ȳ(r, c) = clamp[
4
∑
i=0
4
∑
j=0
w̄(i, j)x̄(r − 2 + i, c− 2 + j), 0,MAX] (7.1)
x̄ is the incoming DeBayer data, ȳ is the interpolated color value associated with the
window coefficients w̄. The values need to be checked for underflow and overflow and
must saturate between 0 and a maximum value MAX. The window coefficients are
shown in Table 7.1.
The DeBayer filter is not costly to parallelize, it can be done in the same fashion
as for the basic window filter described in Sec. 4.5.3 and shown in Fig. 4.8.
7.3.2 LWC for DeBayer
The LWC for DeBayer is similar to the one for the basic window filter described in
Sec. 5.6.5. It is based on computing a checksum at the input and at the output,
142
Table 7.1: Window coefficients for the DeBayer filter
G at R location
G at B location
0 0 -1 0 0
0 0 2 0 0
-1 2 4 2 -1
0 0 2 0 0
0 0 -1 0 0
R at B location
B at R location
0 0 -1.5 0 0
0 2 0 2 0
-1.5 0 6 0 -1.5
0 2 0 2 0
0 0 -1.5 0 0
R at G, even row
B at G, odd row
0 0 -1 0 0
0 -1 4 -1 0
0.5 0 5 0 0.5
0 -1 4 -1 0
0 0 -1 0 0
R at G, odd row
B at G, even row
0 0 0.5 0 0
0 -1 0 -1 0
-1 4 5 4 -1
0 -1 0 -1 0
0 0 0.5 0 0
and making sure that they match. The only minor difference is that for DeBayer,
we need to maintain three separate checksums, for each of the red, green and blue
components. Indeed, after applying the weights shown in Tab. 7.1, each blue pixel
ends up being weighted by the same amount as each red pixel (32), and by twice the
amount of each green pixel (16). Both CMP and LWC require O(N) work, but the
LWC has smaller constants.
The idea is simple, but in practice, we need to carefully think about our problem
to decide exactly how we want to implement the check.
First, we note that the incoming data consists of 8-bit pixel values. If we simply
multiply some of them by 0.5 as suggested in Tab. 7.1, we lose some bits of precision,
and end up with an inexact comparison of checksums. Instead, we need to add a
low-bit for extra precision. Furthermore, the clamp operation suggested in Eq. 7.1
will also lose information in some cases, so we need to hold off from it until the input
of the subsequent stage (i.e. DeBayer must output the full precision values, without
saturating them). This way, no information is lost at the output of DeBayer, and the
LWC can use it properly.
The next issue is the edges of the image. The pixel at (2,2) is the first one that has
full information from its neighbors to compute its value. The pixels above it and to
its left do not. In a standard implementation of WAMI, this is not an issue, since the
subsequent stage does not need to use the outer edges (Sec. 7.4), so we do not even
143
need to compute an output for them: we simply return the rectangle between (2,2)
and (nx− 3,ny − 3) instead of the full rectangle between (0,0) and (nx− 1,ny − 1).
When we use an LWC, we cannot ignore the edges any more since that would be lost
information, and the checksums would not match any more. We have two ways to
make the checksums match again:
1. Since the outer edges are not used the same number of times as the other pixels,
we cannot simply add them to the sum. However, we can apply a scaling factor
for them, since we still know how many times they are used. For example, pixels
(0,0), (0,1), (1,0) are never used, so they would not be added. Pixel (2,0) is
multiplied by -1.5 to get the red component of pixel (2,2), and multiplied by -1
to get the green component of pixel (2,2). We can find the corresponding weights
for each of the edge pixels and make the LWC checksums match, however, this
results in too many special cases, and significantly increases the constant in the
O(N) cost of computing the LWC.
2. The alternative is to compute the full DeBayer output between (0,0) and (nx−
1,ny− 1), assuming pixels of value 0 surrounding the image. That is, pixel (0,0)
(for example) is now used as many times as every other pixel of the image. The
special cases mentioned above can now be ignored since they are multiplied by
0. We choose this technique because it is cheaper. Note that it still involves
modifying the CMP module to compute the outer edges that it otherwise would
not have had to compute.
Finally, we note that in a standard implementation of WAMI, there is no need
to store the raw DeBayer data coming in. However, once we add an LWC to it, we
need to add extra storage in case the LWC finds an error and we need to roll back
and recompute. Note however, that there is an alternative to re-computing, which is
to drop the whole frame and wait for the next one. If we do not lower the voltage
enough to expect two consecutive frames to be faulty, we can still perform the 3
WAMI stages on frame i based on frame i − 2 instead of frame i − 1. We do not
144
explore this alternative, because as we will see in Sec. 7.6, the DeBayer stage is the
cheapest of all WAMI stages, and a small increase in its cost will not be significant
overall.
7.4 Stage 2: Lucas-Kanade
7.4.1 LK algorithm
In the WAMI system, we want to compare each two consecutive frames and identify
which pixels belong to the background, and which are from a moving object. However,
the camera is not assumed to be fixed, so it may move between two frames, and two
consecutive frames out of the DeBayer stage are not “aligned”: A given point in the
physical background that was captured in the pixel at coordinate (x0, y0) will now be
captured by the pixel at (x1, y1), with (x0, y0) 6= (x1, y1) in general. Therefore, before
we perform background detection (Stage 3, GMM, Section 7.5), we need to align (or
warp) the incoming image to the coordinate frame of the previous image. We perform
image alignment in stage 2 of the WAMI using a variation of the Lucas-Kanade (LK)
algorithm [79], a Gauss-Newton gradient descent non-linear optimization algorithm.
Within the framework of Baker and Matthews [10], the LK variation we use is called
“inverse compositional”; we choose it because it requires fewer computations per LK
iteration.
Let x̄ = (x, y) be the set of coordinates in the space. Let T (x̄) be the template
image (the current reference image), and let I(x̄) be the new, incoming image, to
be aligned to the reference frame of T (x̄). Let W (x̄; p̄) be the warp, or the function
that defines the mapping for a given coordinate x̄0 onto a new coordinate x̄1. p̄ is
a vector of parameters to the warp function. The warped version of I(x̄) is denoted
as I(W (x̄; p̄)). The goal of the LK algorithm is to minimize the error between the
145
template image and the warped image (with respect to p̄):
∑
x̄
[I(W (x̄; p̄))− T (x̄)]2 (7.2)
Since the pixel values I(x̄) are non-linear in x̄, minimizing Eq.7.2 is a non-linear
optimization task. The LK algorithm linearizes the equation using two steps: First,
we assume that a current estimate of p̄ is known, and we solve for an increment ∆p̄:
∑
x̄
[T (W (x̄; ∆p̄))− I(W (x̄; p̄))]2 (7.3)
Then, we notice that we do not need to solve Eq.7.3 exactly; an approximate solution
is enough, so we compute the first-order Taylor expansion of Eq.7.3, and our goal is
now to minimize:
∑
x̄
[T (W (x̄; 0̄)) +∇T δW
δp̄
∆p̄− I(W (x̄; p̄))]2 (7.4)
Notice that Eq.7.4 is linear in ∆p̄ (p̄ is now a constant). ∇T is the gradient of the
template image T (x̄), δW
δp̄
is the Jacobian of the warp, and is evaluated at (x̄; 0̄). Once
we have solved for ∆p̄, we update the warp parameters:
W (x̄; p̄)← W (x̄; p̄) ◦W (x̄; ∆p̄)−1 (7.5)
Eq.7.4 is then solved again. The process is repeated until the change in p̄ is small
enough:
||∆p̄|| < ǫ (7.6)
Minimizing Eq.7.4 is a least squares problem, and has the following closed-form so-
lution:
∆p = H−1D = H−1
∑
x̄
[∇T δW
δp̄
]T [I(W (x̄; p̄))− T (x̄)] (7.7)
146
Pre-compute:
1. Compute the gradient of the template image ∇T
2. Compute the Jacobian δW
δp̄
with p̄ = 0̄
3. Compute the steepest descent images ∇T δW
δp̄
4. Compute the Hessian matrix H: Eq.7.8
5. Compute the inverse Hessian H−1
Iterate:
1. Warp I(x̄): Compute I(W (x̄; p̄))
2. Compute the error between the two images: I(W (x̄; p̄))− T (x̄)
3. Compute the parameter update ∆p̄: Eq.7.7
4. Update the warp p̄: Eq.7.5
Figure 7.3: Forward additive Lucas-Kanade algorithm
Where H is the Hessian matrix:
H =
∑
x̄
[∇T δW
δp̄
]T [∇T δW
δp̄
] (7.8)
Note that evaluating the gradient ∇T , Jacobian δW
δp̄
, Hessian H, and Hessian inverse
H−1, is independent of p̄: we can pre-compute each of those at the beginning of an
iteration, then only perform the warp I(W (x̄; p̄)) and parameter update (Eq.7.5) at
each iteration. The “inverse compositional” LK algorithm is shown in Fig. 7.3.
Note that using the more efficient “inverse compositional” algorithm requires us to
use a set of warps W (x̄; p̄) that form a group. This is the case for most applications,
including ours. In fact, we choose the following warp:
W (x̄, p̄) =


(1 + p1)x+ p3y + p5
p2x+ (1 + p4)y + p6

 (7.9)
We thus choose W (x̄; p̄) to be an affine warp, which means that it preserves points,
straight lines, planes, as well as ratios of distances between points lying on a straight
line. Our warp supports transformations of the image such as translation, scaling and
147
rotation.
Finally, note that the template image T in the above equations does not need to
be the whole image, but it could be only a subset of it (e.g. excluding boundaries).
This is because there can be significant information redundancy as T gets larger. For
example, if we only look at 80% of the pixels and they all get warped the same way,
then the other 20% are likely warped the same way as well. There is thus a trade-off
between the size of the template image T (the computational cost) and the accuracy
of the warp estimate. i.e. if T is larger, then we spend more energy aligning two
images, but the alignment we find is more accurate. We choose T to be the whole
image minus 10% of the image depth/width for each boundary.
7.4.2 LK implementation in hardware
7.4.2.1 Fixed-point arithmetic
Floating-point formats allow designers to easily develop algorithms without worrying
too much about their data’s dynamic range. However, implementing floating-point
operators in hardware is expensive due to exponent alignment, leading-one detection,
and rounding operations, especially when also handling sub-normal numbers. There-
fore, designers often explore ways to map their systems to only use fixed-point data
types. This often comes out cheaper than using floating-point, especially when the
dynamic range of the data is predictable and relatively small.
The LK algorithm turns out to have a large dynamic range throughout its pipeline.
For example, the Hessian is an accumulation over many pixels and ends up having
large numbers, whereas the inverse Hessian ends up having small numbers. However,
at each stage, the dynamic range is smaller, and predictable. The different stages
can therefore be adapted to use fixed-point types of different sizes. The exact sizes
depend on the size of the input images; for example: the larger the image, the more
accumulation there is, the larger the Hessian needs to be.
148
7.4.2.2 Parallel LK
As described in Chapter 4, parallelizing an application may help us reduce total
energy consumption, since it reduces the cost of each memory access (because each
PE’s memory is smaller); but for energy to be reduced, we must also ensure that the
communication between PEs is minimized. However, parallelizing LK is not a trivial
task because of the warping step: the target and source pixels may belong to different
PEs. Other than that, most operations are very local, and we can separate the image
into equal subsets, each processed by a different PE. Aside from the warping problem,
the only part where PEs need to communicate is when computing the sum for the
H matrix (Eq.7.8, step 4 of the pre-computation in Fig. 7.3), and the sum for the D
vector (Eq. 7.7, step 3 of the loop in Fig. 7.3). These are summations over all pixel
locations x̄, and are only used once at the end of the pre-computation (H) or after
each iteration (D). We can therefore accumulate these sums separately for each PE
and obtain a matrix Hi and a vector Di for each PE i. We then compute the sums
over the whole image H =
∑
i
Hi and D =
∑
i
Di.
The problem for the warping step is illustrated in Fig. 7.4a with P = 4 PEs. Each
of the four blue rectangles corresponds to the image captured by one PE. Together,
they form the full image that we are observing (dark gray). Over time, the camera
that we are using may move through space, as shown for frame j, where the image
covered by the four PEs is now offset compared to the original image (dark gray). As
shown, on frame j, part of the pixels from the fixed object (red rectangle) will get
stored in PE #2, whereas they were previously stored in PE #3. If PE #2 and PE
#3 cannot communicate, PE #2 will not have the information it needs to properly
compare the pixels it currently sees to their previous values.
This will prevent us from accumulating the Hi and Di matrices for every single
pixel of the PE. This on its own is not a problem, since we do not even need to do that
(we can choose a sub-image within the PE as the template T , see the end of Sec. 7.4).
The problem is when we have converged and we want to compute the warped output
149
image. Then, we need to warp every single pixel from each PE.
One option we have to ensure that a pixel can get all the information it needs is
to allow for inter-PE communication, for example using a packet-switched or time-
multiplexed network [61]. However, this may add significant communication overhead.
Instead, we suggest using two simple modifications to LK:
• keep track of an “overall shift”,
• duplicate boundaries between PEs.
The “overall shift” keeps track of how much the frame has been shifted between
the first frame and the current frame. We do this approximately, for example by
keeping track of how much the middle pixel’s location has shifted (δx, δy) (that is
from our perspective, the point is actually fixed on the ground). This way, on frame
j, we do not necessarily store the pixel we see at location (x, y) into the same PE we
would on frame 1; rather, we store it in the same location where we would have stored
(x+δx, y+δy) on frame 1. That way, each PE still gets approximately the same pixels
that it had on frame 1, and we avoid the problem from Fig. 7.4a. To make this scheme
work better, we duplicate some of the boundary pixels (store them redundantly across
two neighboring PEs). This offsets the approximation in our calculation of the “overall
shift”, and allows us to avoid having the PEs communicate with one another. We
find that duplicating only 5% of the image depth/width works well to make this
scheme work, as long as rotations of the image are not too pronounced in the data
set. Fig. 7.4b shows how the fixed object that was previously not part of PE #3
once we get to frame j is now included in PE #3, at least due to the duplicated
boundaries.1
7.4.2.3 Caching scheme for the warp
When computing the warp I(W (x̄; p̄)) (step 1 of the loop, Fig 7.3), we traverse all the
coordinates of interest from left to right, top to bottom, from (0,0) to (nx−1,ny−1).
1If all we had was uniform translation throughout the image, keeping track of the “overall shift”
would have been enough, but this cannot be seen in Fig. 7.4b.
150
Frame 1
Frame j
Legend:
Fixed object
Fixed image
Image seen
in frame 1
Image seen
by a PE
PE #0 PE #1
PE #2 PE #3
The fixed object is 
now partially 
stored in PE #2 
instead of PE #3
(a) Parallel LK problem
Frame 1
Frame j
Duplicated 
boundaries allow 
for the fixed 
object to still be 
covered by PE #3
(b) Parallel LK based on duplicated PE
boundaries
Figure 7.4: Parallel LK problem and solution with duplication
For each coordinate pair x̄T = (xT , yT ), we compute the warped coordinates (xI , yI) =
W (x̄T ; p̄), this is a coordinate pair that indicates the pixel in the new image that
corresponds to the location (xT , yT ) in the previous image. Of course, (xI , yI) will
generally not fall right on a pixel that we know in I, but rather, it will fall somewhere
between four corner pixels. We then need to retrieve these four pixels and compute
a linear interpolation as an estimate of the value of I at (xI , yI). We therefore need
four I values for each coordinate (xT , yT ), but we want to minimize reads from the
large memory I. Ideally, we want to only read each I pixel once throughout this
process, similar to what we would have done in a standard, regular window filter. For
instance, our window filter from Chapter 4 (Fig. 4.8) stores the latest data in registers,
and knows the next column of the window from line buffers that are populated while
running over the previous row, except for the bottom-right element, which it reads
from memory. Overall, this allows us to access the large memory only once per pixel.
In the case of warping in LK, if the image were just observing translations (constant
shifts in the x or y direction), then adding registers and a line buffer would also work
151
to guarantee that we only read one pixel per cycle. However, because of rotations
and scaling, also supported in our warp (Eq. 7.9), we do not, in general, read from
I in a regular fashion. Still, the reads are usually “close to regular”. For instance,
while scanning row 7 by looking at the following (xT , yT ) coordinates: (0,7) (1,7)
(2,7) (3,7) (4,7), we may observe this sequence of target (xI , yI) coordinates due to
rotation: (0,13.2) (1,13.1) (2,13.0) (3,12.9) (4,12.8). Starting at (xT , yT ) =(3,7), we
start needing pixels in row 12, instead of row 13. We expect a similar switch to happen
around (xT , yT ) =(3,8), while scanning the next row. We exploit this by changing
the line buffer into a “cache”, augmented with the row index of the data stored at
each location. This allows the line buffer to store data that does not all belong to the
same row, and exploit the expected regularity described above.
For this scheme to be efficient, the images should not rotate or scale too much
between two frames. This is usually a safe assumption, especially with higher FPS
rates (frames per second). The scheme works well for our data sets, and only produces
a 0.5% cache miss rate.
7.4.2.4 Structure of LK in hardware
Fig. 7.5 shows a flow diagram for the LK algorithm, as implemented in hardware.
The steps from Fig. 7.3 are labeled on the diagram. LK mostly consists of sequences
of operations that can be streamed together, which maps well to hardware. Both
for the pre-computation step, and for the loop’s iterations, the same operations are
repeated for each pixel, and each pixel contributes to a final accumulated matrix:
the Hessian for the pre-computation, D for the loop’s iterations. Each PE thus
computes one component Hi and one component Di independent of the other PEs.
The only communication between PEs happens when calculating H =
∑
i
Hi (once)
and D =
∑
i
Di (once per loop iteration) (Sec. 7.4.2.2).
For each pixel coordinate (x, y), within one PE, we fetch the template’s intensity
value from the T memory. We compute the two gradient values ∇T = (Gx, Gy) using
152
a “Gradient” module, which is a basic 3× 3 Sobel window filter with a line buffer, so
that we only need to access T once per pixel, even though each pixel needs its eight
immediate neighbors. The Jacobian is easy to compute based on the coordinates
(x, y):
δW
δp̄
=


x 0 y 0 1 0
0 x 0 y 0 1

 (7.10)
We then multiply (Gx, Gy) by the Jacobian to obtain the 1×6 steepest descent vector.
The steepest descent vectors are needed both in the pre-computation and in the
loop, so at this point we have two options: The first one is to perform the above
operations once during pre-computation and store the steepest descent vectors in an
nx∗ny×6 S matrix, living in memory. We then fetch a value from S for each pixel at
each loop iteration. The second option is to repeat the above steps and re-compute
the steepest descent vectors each time. This is better if these computations cost less
than accessing the large S memory. We only explore the first option.
During the pre-computation, we perform an inner product on each steepest de-
scent vector, and we obtain a 6 × 6 Hessian matrix H(x,y) for the particular pixel
at (x, y). This Hessian is accumulated over all pixels of the given PE to obtain
Hi =
∑
pixels in PE i
H(x,y). We then add Hi to the Hessian matrices obtained from other
PEs and get H =
∑
i
Hi: This is the first time that there is communication between
PEs.
The main Hessian H is then inverted using LU decomposition, backward substi-
tution, and forward substitution. The resulting H−1 matrix is the inverse Hessian,
and will be used once at the end of each loop iteration. Matrix inversion can have
a large latency, but it is masked by the larger latency of computing the first whole
iteration (which can be done in parallel).
For the loop iterations, within each PE, we first compute the warp at the given
coordinate (x, y) (Eq. 7.9). Then, we call the warp module, which communicates
with the new image I and computes the warped pixel I(W (x̄, p̄)). We then subtract
153
I(W (x̄, p̄)) from the value of the pixel at the template T (x̄) and multiply by the
steepest descent vector at the (x̄) location: ∇T δW
δp̄
(x̄). This gives a 1×6 D(x,y) vector
for the current pixel at (x, y). This gets accumulated into a totalDi =
∑
pixels in PE i
D(x,y)
for the current PE, before being summed with the other PEs’ Di vectors: D =
∑
i
Di.
This is the first time that there is communication between the PEs in the loop.
The final D vector is then multiplied by H−1 to give the update to the warp
parameters ∆p̄. We decide if we have converged based on ∆p̄, and we broadcast the
decision to each PE.
7.4.3 LWC for LK
As mentioned in Sec. 7.4, there is significant redundancy in the information retrieved
from the different pixels (remember the trade-off in choosing the size of T ). This
suggests that LK should be tolerant to upsets affecting a single pixel’s output: it
is usually not enough to affect the final H and D matrices significantly. Still, some
upsets could have a more dramatic effect, such as changing the most significant bit
of a value in H or D. This in turn leads to a wrong value for ∆p̄. Similarly to the
LWC for conjugate gradient from Sec. 5.8.1, we use the convergence acceptance test
as a built-in LWC: If ∆p̄ increases between two iterations, then we know that there
was an error and we roll back. Indeed, we expect ∆p̄ to monotonically decrease until
convergence.
Similarly to conjugate gradient (Sec. 5.8.1), this LWC alone does not guard against
an error in the final warp following the evaluation of ∆p̄ and convergence. We could
cover this case in a manner similar to conjugate gradient by directly evaluating Eq. 7.7
at the end as part of the check. The LWC then costs about the same as a full iteration
of the compute, so as long as the expected number of iterations is greater than 1,
the LWC is lightweight. However, depending on the data set, the typical number of
iterations may be very small (about 1-2 for a fixed camera, 5-10 for a slowly moving
camera), making the LWC more costly in comparison.
154
T(x,y)
nx
ny
Fetch pixel at (x,y)
Gradient
memory
op
reg/val
Gx Gy xGx xGy yGx yGy Gx Gy
0 x 0 y 0 1
x 0 y 0 1 0
=δW/δp
T=
∆
=  T δW/δp
∆
mul
Inner
product
H(x,y)
6
6
Hpe
6
6
add
Σ(x,y)
H
6
6
Hpe from 
other PEs
H-1
6
6
inv
add
Legend
Warp I(x,y)
nx
ny
[1]
[2]
[3]
[4]
[5]
For each pixel (x,y)
Wx Wy
p0 p1 p2 p3 p4 p5
Warp parameters
W(x,p)
sub
mul
D(x,y)
6
1
6
1
add
Σ(x,y)
6
1
Dpe from 
other PEs
add
Dpe
D(x,y)
[6]
[7]
mul
∆p0 ∆p1 ∆p2 ∆p3 ∆p4 ∆p5
(||∆p||<ε) ?
Yes:
Next frame
Update T
Load new I
No:
Next iteration
Update pixel at (x,y)
If converged,
[8]
W(x,p)
o W(x,∆p)-1
[9]
pre-comp
loop
Figure 7.5: Flow diagram for LK in hardware
155
We can actually avoid the extra cost of re-evaluating Eq. 7.7 as part of the LWC,
by noting that we are using LK as part of a larger system, which provides two helpful
features:
• As we will see in Sec. 7.5, we will be able to tolerate isolated errors at the output
of the third stage (GMM), which also means that we will be able to tolerate
isolated errors at the output of LK.
• As long as the error rate is not too high, LK has a self-correcting behavior: a
pixel that comes out with the wrong warped intensity in one cycle will not have
a significant effect on the H and D matrices of the next iteration, and since ∆p̄ is
expected to be correct because of the LWC, the wrong pixel will typically get the
correct value from warping the new image on the next iteration. In other words,
this self-correcting behavior allows us to tolerate errors in the output image as
long as ∆p̄ is correct, since in the extreme, we could even skip a whole a frame
and still be able to perform alignment (end of Sec. 7.3).
Given the approximations in the computation and the checking operations, we will
say that LK is correct not if its output is bit-exact with the output of the fault-free
LK, but if the output of the whole WAMI system (i.e. of the subsequent, GMM
stage) is correct (more in Sec. 7.5.3).
Note that we could have also protected the pre-computation operation by checking
the result of the matrix inversion as in Sec. 5.7.2, but this is not necessary since
significant errors will get captured by the other check, and non-significant errors will
get self-corrected.
Finally, note that unlike DeBayer, LK already stores the incoming data in memory,
so when an error occurs we can easily roll back, and there is no need for extra storage
cost.
156
7.5 Stage 3: Gaussian Mixture Modeling
7.5.1 GMM overview
After aligning the newest image to the original frame of reference, we can compare
the changes in pixel values, estimate the locations that are part of the background,
and track moving objects.
In its simplest form, a background detection algorithm would record the current
value of each pixel, and on the next frame, the algorithm would consider each pixel
that has changed as part of the foreground. Of course, this simple scheme does not
work. The pixel intensities are inherently noisy, so we need to account for small
variations around the previously recorded pixel value. This is why we store the
mean pixel value over time at each pixel location, as well as its standard deviation
(a Gaussian model), allowing us to check whether a new pixel falls within a given
number of standard deviations of the mean value. If it does, then the new pixel is said
to match the given Gaussian distribution. In fact, we need to store more than one
of these Gaussian models for each pixel, and associate a weight with each of them,
indicating how confident we are in the given model. This is what the GMM algorithm
does [109]. This way, we can learn more than one possible value for the background.
This is useful because of variation in nature (e.g. light reflecting on water, rotating
leaves of a tree). When a new pixel value does not match any distribution, we update
the least likely model (the one with the lowest weight) with the new pixel value. Over
time, the more we see that pixel value, the larger the weight of the model will become,
the lower its standard deviation will become, the more likely it will match, and be
considered part of the background. On the other hand, if the pixel belonged to a
moving object, we would forget about that unlikely distribution after a few frames.
157
7.5.2 GMM implementation
We implemented the hardware-efficient GMM algorithm described in [43]. It improves
upon the basic GMM algorithms by requiring fewer non-linear operations (which are
looked up in Read-Only Memories (ROMs)). It modifies the basic operations to
avoid computing a square root, and thus manipulates variance instead of standard
deviation. The result is an implementation that is cheap in terms of logic, but that is
dominated by the memory accesses, where each pixel needs to store 32M -bit of data,
and access it on each new frame (8-bit for the weight, 8-bit for the mean, 16-bit for
the variance, M is the number of Gaussian models stored, we use M = 5).
GMM is thus a memory-intensive application where the number of Gaussian mod-
els stored per pixel sets a trade-off between detection accuracy and lower energy con-
sumption. GMM is often implemented on FPGAs (e.g. [43]), but because of the
large memory requirements, usually uses external memory to store the models. This
certainly hurts energy, and our suggestion is to move to on-chip memory for GMM.
This is not feasible for larger size images on current commercial FPGAs, because of
the limited amount of on-chip memory. However, as Chapter 4 argues, we need to
build more memory-balanced FPGAs, with more on-chip memory than what is cur-
rently fabricated. Furthermore, as we move to more advanced technology sizes, our
transistors become smaller, and we have more room to store larger memories on-chip
(see Sec 7.6).
GMM is an embarrassingly parallel problem (Fig. 4.7), since each pixel operation
is completely independent of its neighbors. There is a dependency in time, since the
result from time ti is needed to compute the result at time ti+1, but GMM is still
easy to pipeline, since the time it takes to process the rest of the pixels of the image
is typically much longer than the latency of a single pixel’s processing through the
GMM pipeline.
158
7.5.3 LWC for GMM
The data going into GMM and coming out of it is inherently noisy. We do not
get 100% accuracy in detecting whether a given pixel is part of the foreground or
background. Therefore, even in a standard WAMI without LWCs, we use some noise-
tolerance techniques. In particular, GMM is typically followed by a morphological
operation, a simple window filter that “smoothens” the output: if a single pixel is
identified as foreground but is surrounded with background pixels, then it will be
switched back to being part of the background, and vice versa. This way we end up
with patches of foreground objects, and a single error does not corrupt the output.
We can think of the morphological operation as an error-correcting LWC. At an even
higher level, we end up tracking moving objects through time, and there is consistency
between the locations of a moving object through different frames. This provides yet
another way of checking for mistakes at a high-level.
We thus do not implement an LWC for GMM directly, since errors are tolerated
in the same way they would be in a non-LWC implementation of WAMI. Instead,
we implement a morphological operation to remove isolated pixels and consider the
output to have an SDC if the resulting image post-morphological operation does not
match the image when GMM is fault-free (but we do not count this as an “extra
cost” of the LWC version).
7.6 WAMI results
7.6.1 Basic WAMI results
We show results for the same architecture as in the previous chapter: 22 nm LP with
a single-Vdd, gate boosting, power gating, and the robust memory architecture from
Chapter 4 (a column of 256 × 32 memories placed every 9 columns). We fix the
channel width to 200, a common over-provisioned channel width, so that place and
159
0
0
.5
1
1
.5
0.3 0.4 0.5 0.6 0.7 0.8 0.9
●
●
●
● ●
●
●
●
●
●
●
●
● debayer
lk
gmm
wami
Vdd
E
n
e
rg
y
 R
a
ti
o
Figure 7.6: Energy benefits of voltage scaling for WAMI stages
route completes faster (Sec. 2.1.2). The baseline is a WAMI system for a 512 × 512
pixel image, at nominal Vdd, without LWCs, and with one PE (minimum parallelism,
as it would normally be implemented). The result of adding LWCs and sweeping
voltage is shown in Fig. 7.6. The LWCs we use are very low cost; at nominal Vdd,
they cost 3% for DeBayer, 4% for LK, 0% for GMM, 1% for WAMI. This leaves
significant room to reduce energy as we scale voltage. However, 41% of the energy for
the N = 512 and P = 1 WAMI is in the memories, for which we do not scale voltage.
This results in 48% energy savings for DeBayer, 40% for LK, 29% for GMM, 34%
for WAMI. DeBayer contributes to 5% of the total WAMI energy, LK contributes to
35%, and GMM contributes to 60%.
7.6.2 Parallelism benefits
Using the ideas from Chapter 4, we can work on reducing memory energy and further
increasing the benefits we get out of voltage scaling. In particular, we could perform
a new memory study to find a new energy-minimizing architecture, given the voltage
scaling ability. However, the limited set of benchmarks with LWCs would likely skew
160
(a) Vnominal
1 2 4 8
1
6
3
2
6
4
1
2
8
0.0
0.2
0.4
0.6
0.8
1.0
PE count
E
n
e
rg
y
 r
a
ti
o
mem
other
(b) Vopt
1 2 4 8
1
6
3
2
6
4
1
2
8
0.0
0.2
0.4
0.6
0.8
1.0
PE count
E
n
e
rg
y
 r
a
ti
o
mem
other
Figure 7.7: Memory contribution to energy when changing parallelism for WAMI
the results, so we leave this as future work. The other technique we saw to reduce the
contribution of memories to total energy consumption is parallelism. Even at nominal
Vdd, we can reduce energy by 8% by optimizing parallelism (Fig. 7.7a). If we further
optimize voltage (operate at Vopt), parallelism is even more helpful in reducing energy
because of the decreased contribution of memories as the parallelism increases. As
Fig. 7.7b shows, in that case we get 22% total energy savings between P = 1 and
P = 32. At P = 1, 62% of the energy is due to memories, and this drops down to
14% at P = 32, . Fig. 7.8 highlights the increased gains due to voltage scaling at the
optimum number of PEs: At P = 1 we could only claim 34% energy savings. At the
optimum PE count, we get 44% energy savings.
7.6.3 Other metrics
Our baseline WAMI has minimum parallelism P = 1 at nominal voltage Vdd = 0.95V ,
and without LWCs. It occupies an area of 1.6cm2. Assuming a 2.5cm × 2.5cm
maximum die size for the chip, we actually cannot fit more than a P = 4, 512× 512
WAMI with LWC, which occupies an area of about 6cm2. This is lower than the
optimum PE count we found above (32). This limits our savings from voltage scaling
161
0.66x
0.56x
0
0
.2
0
.6
1
1
.2
1 2 4 8 16 32 64
●
●
● ● ● ●
●
●
PE count
E
n
e
rg
y
 r
a
ti
o
● Vnom
Vopt
Figure 7.8: Comparison of gains due to voltage scaling for WAMI depending on
parallelism
to 42%, versus 44% previously.
Our WAMI design has room to improve pipelining and clock speed, and our tools
do not perform automatic pipelining. As it is, the P = 1 WAMI runs at 81 MHz at
nominal-Vdd. The P = 4 WAMI also runs at 81 MHz at nominal-Vdd, but produces
about 4× the throughput. When scaling to Vopt = 0.65V , it gets reduced to 7.9 MHz.
We can use the “voltage backing up” trick from Sec. 6.3.2.2, and operate at Vdd = 0.7V
instead. This gets us back to 16 MHz. The resulting throughput is only 81/(16 ×
4)− 1 = 27% slower than the baseline, even with scaled voltage. It still comfortably
achieves real-time processing, with about 80 frames per second. The resulting energy
of the P = 4 case is 45% lower than that of the baseline, without any reduction in
reliability.
We expect that as feature sizes shrink in future technologies, we should be able to
fit more PEs on a die. We also expect to be able to bring in more memory on chip,
and cover larger image sizes without accessing costly off-chip memory.
162
7.7 Chapter conclusions
The switch from computational kernels to larger application did add some overhead
in the form of extra buffers at the output of DeBayer, but it was far surpassed by the
benefits we got from using the context in which the application is running. We were
able to eliminate the need to check GMM, and design a very low cost LK check. With
less than 1% LWC overhead, we were able to reduce the voltage for the WAMI and
save energy without reducing reliability. To maximize savings, we further increased
parallelism and brought the memory cost down, allowing more room for the voltage
scaling to yield benefits: 44% reduction in total energy between the PE-optimum
nominal-Vdd case and the PE-optimum Vopt case. Parallelism also helped compensate
for the loss in throughput due to voltage scaling. Finally, area considerations forced
us to limit the parallelism level, but we were still able to achieve 42% reduction.
163
Chapter 8
Future Work
8.1 Technology
Unlike our study on low-energy techniques in Chapter 3, our LWC exploration did
not estimate the savings across different technology processes. We focused on 22 nm
and found around 50% energy savings, a number that is likely to change as we scale
technology. We do not expect these savings to change because newer technologies
would be more susceptible to faults, but because of the shift in the minimum-energy
operating point. Indeed, the improvement in reliability due to LWCs is so large that
the bottleneck for energy savings ends up being the minimum-energy point, not the
reliability degradation (Sec. 6.4). In Chapter 3, we saw that Vdd gets closer to Vth
with newer technologies (Fig. 3.5a), reducing the room we have to scale voltage and
save energy. This is especially the case for the LP/LSTP processes, for which we
saved 60% geomean energy at 22 nm, but only 27% at 7 nm (Fig. A.3a and Fig. A.5a,
“Gate Boost + Power Gate”). We expect the savings with LWCs to get reduced in
similar proportions. For the HP processes, we found 59% savings from voltage scaling
at 22 nm, 50% and 7 nm, again we expect the savings with LWCs to get reduced in
similar proportions.
164
8.2 Automation
Throughout our work, we have manually implemented our benchmarks and their as-
sociated LWCs in Bluespec SystemVerilog. In the WAMI, we have also manually
decided the interface between the different stages and added buffers for the rollback
mechanism. Instead, we can imagine having tools to automate the process of iden-
tifying which parts of a design can be augmented with LWCs, at what granularity
they should be inserted, and how that changes the implementation of the computa-
tions. This could work at different abstraction levels. For instance, a designer could
instantiate high-level computational kernels from a library, for which the automation
tool would have a list of LWCs to choose from. More involved, but no less inter-
esting, would be the ability to extract LWCs from a new, unknown, or “compiled”
computation.
Automation would also be of great use for communication energy minimization.
At least to automatically exploit locality in computations and find the parallelism
balance point that optimizes energy, maybe to even automatically extract parallelism
in applications. In terms of parallelism, we did not explore the possibility of having
non-uniform parallelism across a system. For instance, in the WAMI it might have
been most optimal to have 4 DeBayer PEs, 2 LK PEs, and 8 GMM PEs, instead of
setting them all to the same number of PEs. An automated tool should be able to
explore this.
8.3 Reliability
8.3.1 Error modeling
As mentioned in Sec. 6.2.4, the equation we use to relate bit flip rate and voltage
(Eq. 6.10) is a rather high-level one. We performed parameter sweeps and sensitivity
analyses to make sure we cover the worst-case, but it would be valuable to obtain
165
empirical data on the shape of the equation, which may change depending on different
regions of operation (similar to MOSFETs, which have saturation, linear, and sub-
threshold regions), including the value of α, as well as data on the absolute value of
the bit flip rates observed physically (e.g. FIT0(Vnominal)). This data will be different
across technologies, but also the circuit of interest: whether it is a LUT, a register, a
buffer, a wire, or a memory cell.
Because we expect memories to fail faster than logic, we did not scale voltage on
memories. Instead we explored other ways to reduce communication energy (including
memory energy). However, reducing voltage on memories could yield more savings,
and it is something we could try in future work. In particular, we can leverage the
already-existing LWCs for memories (ECC) and find the optimum number of ECC
bits that allows us to reduce voltage on memories and save the most energy possible.
8.3.2 Variation
Voltage scaling may also be limited by variation, which we did not directly address
in our analysis. Bol shows that variation may cause the minimum energy to actually
increase as we scale down feature size [18]. This demands aggressive variation toler-
ance techniques as well, such as DVFS [9, 30, 72]. Another one is component-specific
mapping, which Mehta et al. used to avoid the slow transistors due to Vth variation
when scaling voltage [85] (see Sec. 2.4.4). Component-specific mapping allowed them
to scale voltage more aggressively, and achieve larger energy savings than if they had
been limited by variation. For component-specific mapping to be efficient, we need
to know the actual delays of the different FPGA components. We can measure these
accurately using the techniques described in [45]. For component-specific mapping to
be practical, we cannot perform a full place-and-route flow for each chip that is fabri-
cated, since this would take too long. Place-and-route is usually done once per design
and shared across all chips, given that they have the same estimated (not measured)
timing information. To counter this, we can use the Choose-Your-own-Adventure
166
router [99], which loads a bitstream with a given routing, together with alternatives.
The chips are then free to pick the alternatives that work for them, and can do so
much faster than if we were to perform a unique place-and-route run.
8.3.3 Fault injection simulation
Our gate-level fault injection simulation methodology described in Sec. 6.2.2 allows
us to evaluate reliability orders of magnitude faster than if we injected errors at
the rate at which they actually happen. However, performing the fault injection
experiments still takes a long time because of the large number of experiments to
perform, especially as the application size increases (Sec. 6.2.3).
To speed up simulation time, we could use hybrid simulations, where we simulate
most of the design at the high-level RTL (Register Transfer Level), or even at the
software level. For instance, when simulating the WAMI, if we only wanted to inject
faults in the LK stage, we could keep the DeBayer and the GMM in software, and
interface them with the gate-level LK. When simulating a large number of repeated
PEs, we could decide to only inject faults in one of them, and use a software version for
the other ones, then combine the results to get the whole system’s reliability profile.
This exploits the inherent symmetry obtained from parallelizing many applications.
For instance, in the case of FFT with random data, we could get the reliability profile
of one butterfly stage at each depth level, instead of redundantly simulating similar
all the butterflies of a given stage, which we expect to behave the same way.
8.3.4 Limits on larger applications
Larger applications have more nodes that can get upset (a larger Ntd0 product). Even
with a constant bit flip rate, this means that there is a higher probability of a bit
flip somewhere in the design during the computation (Eq. 6.11), to the point where
it could drive the recomputation rate up and the energy savings down. In this case,
it may be beneficial to decompose the application into smaller subtasks with lower
167
error probabilities for each: we check the computation more often and rollback less
often. Identifying the subtask sizes at which decomposition is necessary is left as
future work, more suitable for an automated design exploration tool (Sec. 8.2).
As an example of such a decomposition, consider a large image on which we apply
a window filter as in Sec. 5.6.5. If we have the problem where the likelihood of an
error occuring in the full computation is too large, we can decompose the image into
a number of sub-images, and run the LWC for each one separately. We then have
finer granularity information on which sub-images need to be recomputed, and there
is no need to recompute the whole image.
8.3.5 Improving reliability
We saw that our LWCs allow us to save energy while at least maintaining the same
reliability as the baseline, and we saw that they often improve reliability. We leave as
future work the exploration of the freedom we have in trading some of the achievable
energy benefits for a further increase in reliability over unprotected designs. This is
particularly useful when the FIT rate at nominal Vdd at a given technology is already
unacceptably low for a given application.
8.4 Architecture
8.4.1 FPGA structure
We have used the basic energy-efficient FPGA structure from Poon [92]: 10 4-LUTs
per cluster and interconnect segments of length 1. However, these were performed
with older technologies and bi-directional interconnect. It would be interesting to
explore that space as well with more modern architectures. We also have not explored
interconnect depopulation patterns other than the default one in VPR.
We have not explored multiplier (or DSP block) size and placement. We have
168
explored memory size and placement, but we have not explored the possibility of
non-uniform spacing between the memories. We have proposed internal banking as
a technique to reduce application/architecture mismatch and reduce energy, but we
have not explored it beyond three levels, and we have not extended the idea to banking
the memory width as well.
Even though CACTI can be configured to tune memories for energy, its basic
architecture is delay-oriented and does not include many energy-oriented optimiza-
tions that could be relevant to energy-minimizing FPGA embedded memories. There
is considerable room for future work to optimize these memories, and impact both
the energy of the ideally-matched, limit study memories, and the fixed-size FPGA
memory blocks.
8.4.2 Multi-Vdd
For dual-Vdd, we have not revisited the relative benefits of fixed versus programmable
Vdd and the granularity for Vdd selection. It may be possible to achieve greater benefits
with other organizations.
8.4.3 Joint exploration of memory architecture and voltage
scaling
In our work, we first found a robust memory architecture at nominal Vdd (Chapter 4),
then used it to explore the benefits of voltage scaling in combination with LWCs
(Chapter 6). We did not, however, perform a joint study, where we explore the
best possible memory architecture, while knowing that we will also scale the voltage
down to Vopt. This could find a different minimum point, and further improve the
combined benefits of our techniques. This would be especially useful given that we
do not scale voltage for memories, potentially making them bottlenecks that prevent
larger system-wide savings.
169
Chapter 9
Conclusions
Our work shows that we can reduce voltage and save energy without compromising
reliability if we augment our computations with lightweight checks (LWCs). For a
22 nm low-power CMOS process, we can reduce energy by 60% simply by scaling
voltage to the minimum-energy point, driven by the exponential increase in leakage
at lower voltages. To compensate for the resulting loss in reliability, we augmented our
computations with low-cost LWCs, which completely recovered at least the original,
nominal reliability, and reduced the total energy savings compared to nominal Vdd to
about 50%, which is not far from the 60% obtained without LWCs. We saw that our
results are robust to many assumptions, and that they are dictated by the minimum-
energy point, rather than the reliability impact of reducing Vdd, at least if the LWC
can catch any error resulting from a single upset.
To demonstrate, we relied both on analytical models to understand the different
phenomena at play, and on experimental explorations to collect empirical data and
ground our ideas in practice. We developed a large tool flow, and extensive models
for architecture, energy, delay, area and reliability. We performed studies based on
common benchmark sets (VTR 7.0, Toronto20). We also added our own applications
for which we can tune the size and parallelism level, and we augmented them with
LWCs. We showed a case study example of a large, real-time system for wide-area
motion imaging. We saw that we can often reduce the overhead of LWCs by exploiting
170
the context in which the computations are carried.
The extent to which we are able to save energy depends on how cheap the LWCs
are compared to the base computations. This raises the question of when we can
expect a computation to have an LWC. We developed a classification system to help
answer this question, covering some of the most common computational kernels and
their complexities, including areas such as signal processing and scientific computing.
To ensure that we reported gains on designs that were already optimized for en-
ergy, we also explored many common low-energy circuit techniques to identify the
proper baseline: a single-Vdd FPGA architecture with gate boosting and power gat-
ing. We did not scale voltage for memories, so lower voltages helped us reduce logic
and interconnect energy, but not memory energy. At lower voltages, memories be-
come dominant, and we get diminishing returns from continuing to scale voltage.
To ensure that memories do not become a bottleneck, we also designed a memory
architecture that is robust to mismatches between application and architecture. We
optimized communication energy, including memory accesses and data movement on
interconnect, specifically by tuning application parallelism to reduce the memory en-
ergy cost per PE. This further allowed larger savings from voltage scaling due to the
fact that we did not scale voltage on memories.
In conclusion, we end up spending less energy by allowing errors to occur and
correcting them, than by demanding that no error is ever made in the first place.
To err is human.
171
Appendices
172
Appendix A
Detailed Results for Chapter 3
The following figures are used in Chapter 3. We show a digested version with the
minimums in Fig. 3.4.
173
(a) Energy
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
90nm bulk
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
●
●
● ● ● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
65nm bulk
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
(b) Delay
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ● ● ● ●
●
●
●
●
●
●
●
90nm bulk
Vdd (V)
D
e
la
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ●
●
●
●
●
●
65nm bulk
Vdd (V)
D
e
la
y
 R
a
ti
o
●
●
Base
Trans Gate
Gate Boost
Power Gate
Trans Gate + Power Gate
Gate Boost + Power Gate
Dual−Vdd TG CP x1.0
Dual−Vdd TG CP x1.25
Dual−Vdd GB CP x1.0
Dual−Vdd GB CP x1.25
Figure A.1: Effect of low-energy techniques on energy and delay (90 nm and 65 nm)
174
(a) Energy
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
45nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
●
●
●
●
●
● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
45nm LP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
(b) Delay
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
45nm HP
Vdd (V)
D
e
la
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
● ● ● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
45nm LP
Vdd (V)
D
e
la
y
 R
a
ti
o
●
●
Base
Trans Gate
Gate Boost
Power Gate
Trans Gate + Power Gate
Gate Boost + Power Gate
Dual−Vdd TG CP x1.0
Dual−Vdd TG CP x1.25
Dual−Vdd GB CP x1.0
Dual−Vdd GB CP x1.25
Figure A.2: Effect of low-energy techniques on energy and delay (45 nm)
175
(a) Energy
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
22nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
●
● ● ● ● ● ● ●
●
●
●
● ●
●
●
22nm LP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
(b) Delay
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
●
●
●
●
● ● ● ●
●
●
22nm HP
Vdd (V)
D
e
la
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
22nm LP
Vdd (V)
D
e
la
y
 R
a
ti
o
●
●
Base
Trans Gate
Gate Boost
Power Gate
Trans Gate + Power Gate
Gate Boost + Power Gate
Dual−Vdd TG CP x1.0
Dual−Vdd TG CP x1.25
Dual−Vdd GB CP x1.0
Dual−Vdd GB CP x1.25
Figure A.3: Effect of low-energy techniques on energy and delay (22 nm)
176
(a) Energy
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
14nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
● ●
●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
14nm LSTP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
(b) Delay
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
14nm HP
Vdd (V)
D
e
la
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
●
●
●
●
●
● ●
●
●
14nm LSTP
Vdd (V)
D
e
la
y
 R
a
ti
o
●
●
Base
Trans Gate
Gate Boost
Power Gate
Trans Gate + Power Gate
Gate Boost + Power Gate
Dual−Vdd TG CP x1.0
Dual−Vdd TG CP x1.25
Dual−Vdd GB CP x1.0
Dual−Vdd GB CP x1.25
Figure A.4: Effect of low-energy techniques on energy and delay (14 nm)
177
(a) Energy
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
7nm HP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
0
.2
5
0
.5
0
.7
5
1
1
.2
5
● ● ●
●
●
● ●
●
●
7nm LSTP
Vdd (V)
E
n
e
rg
y
 R
a
ti
o
(b) Delay
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
7nm HP
Vdd (V)
D
e
la
y
 R
a
ti
o
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
0
.7
5
1
1
.2
5
1
.5
●
●
●
●
● ●
● ●
●
7nm LSTP
Vdd (V)
D
e
la
y
 R
a
ti
o
●
●
Base
Trans Gate
Gate Boost
Power Gate
Trans Gate + Power Gate
Gate Boost + Power Gate
Dual−Vdd TG CP x1.0
Dual−Vdd TG CP x1.25
Dual−Vdd GB CP x1.0
Dual−Vdd GB CP x1.25
Figure A.5: Effect of low-energy techniques on energy and delay (7 nm)
178
Bibliography
[1] Specification for the Advanced Encryption Standard (AES). Federal In-
formation Processing Standards Publication 197, http://csrc.nist.gov/
publications/fips/fips197/fips-197.pdf, 2001. 5.10
[2] International Technology Roadmap for Semiconductors. <www.itrs.net/
Links/2012ITRS/Home2012.htm> , 2012. 2.1.2
[3] A. Agarwal, B. Paul, S. Mukhopadhyay, and K. Roy. Process variation in
embedded memories: failure analysis and variation aware architecture. Solid-
State Circuits, IEEE Journal of, 40(9):1804–1814, Sept 2005. 3.6.3
[4] Altera Corporation. PowerPlay Early Power Estimator, 2013. 4.3, 4.7.4
[5] A. Amouri, S. Kiamehr, and M. Tahoori. Investigation of aging effects in dif-
ferent implementations and structures of programmable routing resources of
FPGAs. In Field-Programmable Technology (FPT), 2012 International Confer-
ence on, pages 215–219, Dec 2012. 2.2.2
[6] G.-H. Asadi and M. B. Tahoori. Soft Error Mitigation for SRAM-based FPGAs.
In Proceedings of the VLSI Test Symposium, pages 207–212, 2005. 2.4.3
[7] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer,
D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick.
The Landscape of Parallel Computing Research: A View from Berkeley. Techni-
179
cal Report UCB/EECS-2006-183, EECS Department, University of California,
Berkeley, Dec 2006. 5.5
[8] T. Austin. DIVA: a reliable substrate for deep submicron microarchitecture
design. In Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual In-
ternational Symposium on, pages 196–207, 1999. 5.5
[9] T. Austin, D. Blaauw, T. Mudge, and K. Flautner. Making Typical Silicon
Matter with Razor. IEEE Computer, 37(3):57–65, March 2004. 2.3, 8.3.2
[10] S. Baker and I. Matthews. Lucas-Kanade 20 Years On: A Unifying Framework.
International Journal of Computer Vision, 56(3):221–255, Febuary 2004. 5.8,
7.4.1
[11] V. Betz and J. Rose. FPGA Routing Architecture: Segmentation and Buffer-
ing to Optimize Speed and Density. In Proceedings of the 1999 ACM/SIGDA
Seventh International Symposium on Field Programmable Gate Arrays, FPGA
’99, pages 59–68, New York, NY, USA, 1999. ACM. 2.1.1
[12] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-Submicron
FPGAs. Kluwer Academic Publishers, Norwell, MA, 02061 USA, 1999. 2.1.1
[13] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-Submicron
FPGAs. Kluwer Academic Publishers, Norwell, Massachusetts, 02061 USA,
1999. 2.2.2
[14] S. Bhatt and F. T. Leighton. A Framework for Solving VLSI Graph Layout
Problems. Journal of Computer System Sciences, 28:300–343, 1984. 4.2.3
[15] Bluespec, Inc. Bluespec SystemVerilog 2012.01.A, 2012. 2.1.2, 4.5.3
[16] M. Blum and S. Kannan. Designing Programs that Check Their Work. Journal
of the ACM, 42(1):269–291, January 1995. 5.5
180
[17] M. Bohr. A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper.
Solid-State Circuits Society Newsletter, IEEE, 12(1):11–13, Winter 2007. 2.2.1
[18] D. Bol, R. Ambroise, D. Flandre, and J.-D. Legat. Interests and Limitations of
Technology Scaling for Subthreshold Logic. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 17(10):1508–1519, 2009. 8.3.2
[19] A. Brant, A. Abdelhadi, D. Sim, S. L. Tang, M. Yue, and G. Lemieux. Safe
Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor. In
Proceedings of the IEEE Symposium on Field-Programmable Custom Computing
Machines, pages 37–44, 2013. 2.3
[20] M. Butts, A. Jones, and P. Wasson. A Structural Object Programming Model,
Architecture, Chip and Tools for Reconfigurable Computing. In Proceedings
of the IEEE Symposium on Field-Programmable Custom Computing Machines,
pages 55–64, 2007. 5.2
[21] B. Calhoun, F. Honore, and A. Chandrakasan. Design methodology for fine-
grained leakage control in MTCMOS. In Low Power Electronics and Design,
2003. ISLPED ’03. Proceedings of the 2003 International Symposium on, pages
104–109, Aug 2003. 2.2.3
[22] B. Calhoun, S. Khanna, R. Mann, and J. Wang. Sub-threshold circuit design
with shrinking CMOS devices. In Circuits and Systems, 2009. ISCAS 2009.
IEEE International Symposium on, pages 2541–2544, May 2009. 3.6.3
[23] Y. Cao. Predictive Technology Model. <http://ptm.asu.edu> , 2014. 2.1.2
[24] C. Carmichael. Triple Module Redundancy Design Techniques for Virtex FP-
GAs. Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124, 2006. XAPP 197
<http://www.xilinx.com/bvdocs/appnotes/xapp197.pdf> . 1.6, 2.4.2
181
[25] C. Carmichael and C.-W. Tseng. Correcting Single-Event Upsets in Virtex-4
Configuration Memory. Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124,
2009. XAPP 1008. 1
[26] V. Chandra and V. R. Aitken. Impact of Technology and Voltage Scaling on
the Soft Error Susceptibility in Nanoscale CMOS. In Proceedings of the IEEE
International Symposium on Defect and Fault Tolerance in VLSI Systems, DFT
’08, pages 114–122, 2008. 2.4.1
[27] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen. Low-Power CMOS Digital
Design. IEEE Journal of Solid-State Circuits, 27(4):473–484, 1992. 4.2.3
[28] C. Chiasson and V. Betz. Should FPGAs abandon the pass-gate? In Field Pro-
grammable Logic and Applications (FPL), 2013 23rd International Conference
on, pages 1–8, Sept 2013. 2.2.2, 3.4.2
[29] S. Chin, C. Lee, and S. J. Wilton. Power Implications of Implementing Logic Us-
ing FPGA Embedded Memory Arrays. In Proceedings of the International Con-
ference on Field-Programmable Logic and Applications, pages 1–8, Aug 2006.
2.1.5
[30] C. T. Chow, L. S. M. Tsui, P. H. W. Leong, W. Luk, and S. J. E. Wilton.
Dynamic voltage scaling for commercial FPGAs. In Proceedings of the Inter-
national Conference on Field-Programmable Technology, pages 173–180, 2005.
8.3.2
[31] J. Conway and R. Guy. The Book of Numbers. Copernicus Series. Springer,
1996. 5.6.6
[32] A. DeHon. Balancing Interconnect and Computation in a Reconfigurable Com-
puting Array (or, Why You Don’T Really Want 100In Proceedings of the 1999
ACM/SIGDA Seventh International Symposium on Field Programmable Gate
Arrays, FPGA ’99, pages 69–78, New York, NY, USA, 1999. ACM. 3.5
182
[33] A. DeHon. Fundamental Underpinnings of Reconfigurable Computing Archi-
tectures. Proceedings of the IEEE, 103(3):355–378, March 2015. 4.2
[34] M. deLorimier, N. Kapre, N. Mehta, and A. DeHon. Spatial Hardware Imple-
mentation for Sparse Graph Algorithms in GraphStep. ACM Transactions on
Autonomous and Adaptive Systems (TAAS), 6(3), September 2011. 4.2
[35] R. H. Dennard, F. H. Gaensslen, H.-N. Yu, V. L. Rideout, E. Bassous, and
A. R. LeBlanc. Design of Ion-Implanted MOSFET’s with Very Small Physical
Dimensions. IEEE Journal of Solid-State Circuits, 9(5):256–268, October 1974.
2.2.1
[36] W. E. Donath. Placement and Average Interconnection Lengths of Computer
Logic. IEEE Transactions on Circuits and Systems, 26(4):272–277, April 1979.
4.2.3
[37] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger.
Dark silicon and the end of multicore scaling. In Proceedings of the International
Symposium on Computer Architecture, pages 365–376, 2011. 1.2, 2.2.1, 6.3.2.2
[38] H. Feistel. Cryptography and Computer Privacy. 228(5):15–23, May 1973. 5.10
[39] F. Firouzi, M. E. Salehi, F. Wang, and S. M. Fakhraie. An accurate model for
soft error rate estimation considering dynamic voltage and frequency scaling
effects. Microelectronics Reliability, 51(2):460–467, 2011. 6.2.4, 6.2.4
[40] R. Freivalds. Fast probabilistic algorithms. In J. Bev, editor, Mathematical
Foundations of Computer Science 1979, volume 74 of Lecture Notes in Com-
puter Science, pages 57–69. Springer Berlin Heidelberg, 1979. 5.7.1
[41] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. Irwin, and T. Tuan.
A Dual-Vdd Low Power FPGA Architecture. In Proceedings of the Interna-
183
tional Conference on Field-Programmable Logic and Applications, pages 145–
157. Springer, 2004. 2.2.4.1, 3.6.1, 3.6.2, 3.6.3
[42] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan.
Reducing Leakage Energy in FPGAs Using Region-Constrained Placement. In
in Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, pages 51–58, 2004.
2.2.3, 3.5.1
[43] M. Genovese and E. Napoli. ASIC and FPGA Implementation of the Gaus-
sian Mixture Model Algorithm for Real-Time Segmentation of High Definition
Video. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
22(3):537–547, March 2014. 4.5.3, 5.9, 7.5.2
[44] J. Goeders and S. J. Wilton. VersaPower: Power estimation for diverse
FPGA architectures. In Proceedings of the International Conference on Field-
Programmable Technology, pages 229–234, 2012. 1.8, 2.1.2, 2.1.4
[45] B. Gojman, S. Nalmela, N. Mehta, N. Howarth, and A. Dehon. GROK-LAB:
Generating Real On-chip Knowledge for Intra-cluster Delays Using Timing Ex-
traction. ACM Trans. Reconfigurable Technol. Syst., 7(4):32:1–32:23, Dec. 2014.
8.3.2
[46] R. Hamming. Error Detecting and Error Correcting Codes. Bell System Tech-
incal Journal, 29:147–160, 1950. 5.6.7
[47] S. Hanson, B. Zhai, K. Bernstein, D. T. Blaauw, A. Bryant, L. Chang, K. K.
Das, W. Haensch, E. J. Nowak, and D. Sylvester. Ultralow-voltage, minimum-
energy CMOS. IBM Journal of Research and Development, 50(4–5):469–490,
2006. 3.3
[48] P. Hazucha and C. Svensson. Impact of CMOS technology scaling on the at-
mospheric neutron soft error rate. Nuclear Science, IEEE Transactions on,
47(6):2586–2594, Dec 2000. 6.2.4
184
[49] T. L. Heath and Euclid. The Thirteen Books of Euclid’s Elements, Books 1 and
2. Dover Publications, Incorporated, 1956. 4.3
[50] T. Heijmen, P. Roche, G. Gasiot, K. Forbes, and D. Giot. A Comprehensive
Study on the Soft-Error Rate of Flip-Flops From 90-nm Production Libraries.
IEEE Transactions on Device and Materials Reliability, 7(1):84–96, 2007. 6.2.4
[51] D. Hisamoto, W.-C. Lee, J. Kedzierski, H. Takeuchi, K. Asano, C. Kuo,
E. Anderson, T.-J. King, J. Bokor, and C. Hu. FinFET-a self-aligned double-
gate MOSFET scalable to 20 nm. IEEE Transactions on Electron Devices,
47(12):2320–2325, Dec 2000. 2.2.1
[52] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein.
Scaling, power, and the future of CMOS. In Technical Digest of the IEEE
International Electron Device Meeting, pages 7–15, December 2005. 1.2, 2.2.1
[53] K.-H. Huang and J. Abraham. Algorithm-Based Fault Tolerance for Matrix
Operations. IEEE Transactions on Computers, C-33(6):518–528, 1984. 5.6,
5.6.2
[54] P. Jamieson, K. B. Kent, F. Gharibian, and L. Shannon. Odin II - An Open-
Source Verilog HDL Synthesis Tool for CAD Research. In Proceedings of the
IEEE Symposium on Field-Programmable Custom Computing Machines, pages
149–156, 2010. 2.1.2
[55] J. Johnson, W. Howes, M. Wirthlin, D. McMurtrey, M. Caffrey, P. Graham,
and K. Morgan. Using duplication with compare for on-line error detection in
FPGA-based designs. In Proceedings of the IEEE Aerospace Conference, pages
1–11, 2008. 1.6, 2.4.2
[56] E. Kadric. Power Optimization (P-opt) code and architecture files. http:
//ic.ese.upenn.edu/distributions/meme_fpga2015/, 2015. 4.5.1
185
[57] E. Kadric, D. Lakata, and A. DeHon. Impact of Memory Architecture on
FPGA Energy Consumption. In Proceedings of the International Symposium
on Field-Programmable Gate Arrays, pages 146–155, 2015. 1.9, 2.1.5, 4.3, 4.7.2
[58] E. Kadric, D. Lakata, and A. Dehon. Impact of Parallelism and Memory Archi-
tecture on FPGA Communication Energy. ACM Trans. Reconfigurable Technol.
Syst., 2016. 1.9, 2.1.5
[59] E. Kadric, K. Mahajan, and A. DeHon. Energy Reduction through Differential
Reliability and Lightweight Checking. In Proceedings of the IEEE Symposium
on Field-Programmable Custom Computing Machines, 2014. 1.9, 2.2.4.2, 2.3
[60] E. Kadric, K. Mahajan, and A. DeHon. Kung Fu Data Energy—Minimizing
Communication Energy in FPGA Computations. In Proceedings of the IEEE
Symposium on Field-Programmable Custom Computing Machines, 2014. 1.9,
2.1.5, 4.6.3
[61] N. Kapre, N. Mehta, M. deLorimier, R. Rubin, H. Barnor, M. Wilson,
M. Wrighton, and A. DeHon. Packet Switched vs. Time Multiplexed FPGA
Overlay Networks. In Field-Programmable Custom Computing Machines, 2006.
FCCM ’06. 14th Annual IEEE Symposium on, pages 205–216, April 2006.
7.4.2.2
[62] S. Kiamehr, A. Amouri, and M. Tahoori. Investigation of NBTI and PBTI
induced aging in different LUT implementations. In Field-Programmable Tech-
nology (FPT), 2011 International Conference on, pages 1–8, Dec 2011. 2.2.2
[63] J. Kim and L. Kish. Error Rate In Current-Controlled Logic Processors With
Shot Noise. Fluctuation and Noise Letters, 4(1):83–86, 2004. 2.4.1
[64] D. Koch and J. Torresen. FPGASort: A High Performance Sorting Architecture
Exploiting Run-time Reconfiguration on FPGAs for Large Problem Sorting.
186
In Proceedings of the International Symposium on Field-Programmable Gate
Arrays, pages 45–54, 2011. 4.5.3, 5.6.1
[65] I. Kuon and J. Rose. Measuring the Gap Between FPGAs and ASICs. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
26(2):203–215, February 2007. 2.1.1, 2.2.2, 2.4.4, 4.3
[66] V. Lakamraju and R. Tessier. Tolerating Operational Faults in Cluster-based
FPGAs. In Proceedings of the International Symposium on Field-Programmable
Gate Arrays, pages 187–194, 2000. 2.4.3, 2.4.5
[67] J. Lamoureux and S. J. E. Wilton. Activity Estimation for Field-Programmable
Gate Arrays. In Proceedings of the International Conference on Field-
Programmable Logic and Applications, pages 1–8, 2006. 2.1.2
[68] B. S. Landman and R. L. Russo. On Pin Versus Block Relationship for Parti-
tions of Logic Circuits. IEEE Transactions on Computers, 20:1469–1479, 1971.
4.2.3
[69] B. Leininger, J. Edwards, J. Antoniades, D. Chester, D. Haas, E. Liu,
M. Stevens, C. Gershfield, M. Braun, J. D. Targove, S. Wein, P. Brewer,
D. G. Madden, and K. H. Shafique. Autonomous real-time ground ubiqui-
tous surveillance-imaging system (ARGUS-IS). volume 6981, pages 69810H–
69810H–11, 2008. 7.2
[70] G. Lemieux, E. Lee, M. Tom, and A. Yu. Directional and Single-Driver Wires in
FPGA Interconnect. In Proceedings of the International Conference on Field-
Programmable Technology, pages 41–48, 2004. 2.1.1, 3.6.2
[71] G. Lemieux and D. Lewis. Design of Interconnection Networks for Pro-
grammable Logic. Kluwer Academic Publishers, Norwell, MA, USA, 2004. 2.1.1
187
[72] J. M. Levine, E. Stott, and P. Y. Cheung. Dynamic Voltage &#38; Fre-
quency Scaling with Online Slack Measurement. In Proceedings of the 2014
ACM/SIGDA International Symposium on Field-programmable Gate Arrays,
FPGA ’14, pages 65–74, New York, NY, USA, 2014. ACM. 8.3.2
[73] D. Lewis, E. Ahmed, D. Cashman, T. Vanderhoek, C. Lane, A. Lee, and P. Pan.
Architectural enhancements in Stratix-III and Stratix-IV. In Proceedings of the
International Symposium on Field-Programmable Gate Arrays, pages 33–42,
2009. 4.4.1
[74] D. Lewis, V. Betz, D. Jefferson, A. Lee, C. Lane, P. Leventis, S. Marquardt,
C. McClintock, B. Pedersen, G. Powell, S. Reddy, C. Wysocki, R. Cliff, and
J. Rose. The Stratix Routing and Logic Architecture. In Proceedings of the
International Symposium on Field-Programmable Gate Arrays, pages 12–20,
2003. 2.1.1, 3.6.2
[75] D. Lewis, D. Cashman, M. Chan, J. Chromczak, G. Lai, A. Lee, T. Vanderhoek,
and H. Yu. Architectural Enhancements in Stratix V. In Proceedings of the
International Symposium on Field-Programmable Gate Arrays, pages 147–156,
2013. 2.1.5
[76] F. Li, Y. Lin, and L. He. Field Programmability of Supply Voltages for FPGA
Power Reduction. Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on, 26(4):752–764, April 2007. 2.2.4.1, 3.6.1, 3.6.2, 3.6.3,
3.6.4, 3.6.5
[77] F. Li, Y. Lin, L. He, and J. Cong. Low-power FPGA using pre-defined dual-
Vdd/dual-Vt fabrics. In Proceedings of the International Symposium on Field-
Programmable Gate Arrays, pages 42–50, 2004. 2.2.3, 3.5.1
[78] F. Li, Y. Lin, L. He, and J. Cong. Low-power FPGA Using Pre-defined dual-
Vdd/dual-Vt Fabrics. In Proceedings of the 2004 ACM/SIGDA 12th Interna-
188
tional Symposium on Field Programmable Gate Arrays, FPGA ’04, pages 42–50,
New York, NY, USA, 2004. ACM. 2.2.4.1
[79] B. D. Lucas and T. Kanade. An Iterative Image Registration Technique with
an Application to Stereo Vision. pages 674–679, 1981. 7.4.1
[80] J. Luu, J. H. Anderson, and J. S. Rose. Architecture Description and Packing
for Logic Blocks with Hierarchy, Modes and Complex Interconnect. In Pro-
ceedings of the International Symposium on Field-Programmable Gate Arrays,
pages 227–236, 2011. 2.1.5, 4.5.1
[81] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk,
M. Nasr, S. Wang, T. Liu, N. Ahmed, K. B. Kent, J. Anderson, J. Rose, and
V. Betz. VTR 7.0: Next Generation Architecture and CAD System for FPGAs.
ACM Transactions on Reconfigurable Technology and Systems, 7(2):6:1–6:30,
July 2014. 2.1.1, 4.3, 4.4.1, 4.5.3
[82] J. Luun, J. H. Anderson, and J. S. Rose. Architecture description and packing
for logic blocks with hierarchy, modes and complex interconnect. In Proceed-
ings of the International Symposium on Field-Programmable Gate Arrays, pages
227–236, 2011. 1.8, 2.1.2
[83] N. Mehta. An ultra-low energy, variation tolerant FPGA architec-
ture using component-specific mapping. http://resolver.caltech.edu/
CaltechTHESIS:10072012-230900231, 2013. 3.5
[84] N. Mehta. The MD5 Message-Digest Algorithm. https://www.ietf.org/rfc/
rfc1321.txt, April 1992. 5.6.7, 5.10
[85] N. Mehta, R. Rubin, and A. DeHon. Limit Study of Energy & Delay Benefits
of Component-Specific Routing. In Proceedings of the International Symposium
on Field-Programmable Gate Arrays, pages 97–106, 2012. 2.4.3, 2.4.4, 8.3.2
189
[86] A. Meixner, M. E. Bauer, and D. J. Sorin. Argus: Low-Cost, Comprehensive
Error Detection in Simple Cores. IEEE Micro, 28:52–59, January/February
2008. 5.5, 5.6.6
[87] A. Mishchenko, S. Chatterjee, and R. K. Brayton. Improvements to Technol-
ogy Mapping for LUT-Based FPGAs. IEEE Transactions on Computed-Aided
Design for Integrated Circuits and Systems, 26(2):240–253, February 2007. 2.1.2
[88] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. CACTI 6.0: A Tool
to Model Large Caches. HPL 2009-85, HP Labs, Palo Alto, CA, April 2009.
Latest code release for CACTI 6 is 6.5. 2.1.2, 4.4.2
[89] B. Narasimham, M. Gadlage, B. Bhuva, R. Schrimpf, L. Massengill, W. Holman,
A. Witulski, R. Reed, R. Weller, and X. Zhu. Characterization of Neutron-
and Alpha-Particle-Induced Transients Leading to Soft Errors in 90-nm CMOS
Technology. IEEE Transactions on Device and Materials Reliability, 9(2):325–
333, 2009. 6.2.4
[90] S. Natarajan, M. Agostinelli, S. Akbar, M. Bost, A. Bowonder, V. Chikarmane,
S. Chouksey, A. Dasgupta, K. Fischer, Q. Fu, T. Ghani, M. Giles, S. Govin-
daraju, R. Grover, W. Han, D. Hanken, E. Haralson, M. Haran, M. Heckscher,
R. Heussner, P. Jain, R. James, R. Jhaveri, I. Jin, H. Kam, E. Karl, C. Kenyon,
M. Liu, Y. Luo, R. Mehandru, S. Morarka, L. Neiberg, P. Packan, A. Paliwal,
C. Parker, P. Patel, R. Patel, C. Pelto, L. Pipes, P. Plekhanov, M. Prince,
S. Rajamani, J. Sandford, B. Sell, S. Sivakumar, P. Smith, B. Song, K. Tone,
T. Troeger, J. Wiedemer, M. Yang, and K. Zhang. A 14nm logic technology
featuring 2nd-generation FinFET, air-gapped interconnects, self-aligned dou-
ble patterning and a 0.0588um2 SRAM cell size. In Electron Devices Meeting
(IEDM), 2014 IEEE International, pages 3.7.1–3.7.3, Dec 2014. 2.2.2
190
[91] B. Nikolic. Design in the Power-Limited Scaling Regime. IEEE Transactions
on Electron Devices, 55(1):71–83, January 2008. 1.2, 2.2.1
[92] K. K. W. Poon, S. J. E. Wilton, and A. Yan. A detailed power model for
field-programmable gate arrays. ACM Transactions on Design Automation of
Electronic Systems, 10:279–302, 2005. 2.1.4, 8.4.1
[93] P. Prata and J. Silva. Algorithm based fault tolerance versus result-checking
for matrix computations. In Fault-Tolerant Computing, 1999. Digest of Papers.
Twenty-Ninth Annual International Symposium on, pages 4–11, June 1999.
5.7.1, 5.7.2
[94] B. Pratt, M. Caffrey, P. Graham, K. Morgan, and M. Wirthlin. Improving
FPGA Design Robustness with Partial TMR. In Proceedings of the IEEE In-
ternational Reliability Physics Symposium, pages 226–232, 2006. 2.4.2
[95] R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and
S. Kulkarni. Pushing ASIC performance in a power envelope. In Design Au-
tomation Conference, 2003. Proceedings, pages 788–793, June 2003. 3.6.3
[96] A. Rahman, S. Das, T. Tuan, and S. Trimberger. Determination of Power
Gating Granularity for FPGA Fabric. In Proceedings of the IEEE Custom
Integrated Circuits Conference, pages 9–12, Sept 2006. 2.2.3
[97] N. Rollins, M. Wirthlin, P. Graham, and M. Caffrey. Evaluating TMR Tech-
niques in the Presence of Single Event Upsets. In Proceedings of the Inter-
national Conference on Military and Aerospace Programmable Logic Devices,
2003. 1.6, 2.4.2
[98] J. Rose, J. Luu, C. W. Yu, O. Densmore, J. Goeders, A. Somerville, K. B. Kent,
P. Jamieson, and J. Anderson. The VTR project: architecture and CAD for
FPGAs from verilog to routing. In Proceedings of the International Symposium
191
on Field-Programmable Gate Arrays, pages 77–86, New York, NY, USA, 2012.
ACM. 4.5.1
[99] R. Rubin and A. Dehon. Choose-your-own-adventure Routing: Lightweight
Load-time Defect Avoidance. ACM Trans. Reconfigurable Technol. Syst.,
4(4):33:1–33:24, Dec. 2011. 8.3.2
[100] R. Rubin and A. DeHon. Choose-Your-Own-Adventure Routing: Lightweight
Load-Time Defect Avoidance. Transactions on Reconfigurable Technology and
Systems, 4(4), December 2011. 2.4.3, 2.4.5
[101] R. A. Rubinfeld. A Mathematical Theory of Self-checking, Self-testing and Self-
correcting Programs. PhD thesis, Berkeley, CA, USA, 1991. UMI Order No.
GAX91-26752. 5.5
[102] Y. Saad. Iterative Methods for Sparse, Linear Systems. SIAM, 2nd edition,
2003. 5.8
[103] D. Schellekens, B. Preneel, and I. Verbauwhede. FPGA Vendor Agnostic True
Random Number Generator. In Field Programmable Logic and Applications,
2006. FPL ’06. International Conference on, pages 1–6, Aug 2006. 5.7.1
[104] F. Sellers, M. Xiao, and L. Bearnson. Error detecting logic for digital computers.
McGraw-Hill, 1968. 5.6.6
[105] J. R. Shewchuk. An Introduction to the Conjugate Gradient Method Without
the Agonizing Pain. Technical report, Pittsburgh, PA, USA, 1994. 5.8.1
[106] B. Shim and N. R. Shanbhag. Energy-Efficient Soft Error-Tolerant Digital
Signal Processing. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 14(4):336–348, April 2006. 2.3, 5.9, 5.11, 6.2.4
[107] P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling
the effect of technology trends on the soft error rate of combinational logic. In
192
Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International
Conference on, pages 389–398, 2002. 2.4.1, 6.2.4, 6.2.4
[108] P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the
effect of technology trends on the soft error rate of combinational logic. In Pro-
ceedings of the International Conference on Dependable Systems and Networks,
pages 389–398, 2002. 6.2.4
[109] C. Stauffer and W. E. L. Grimson. Adaptive background mixture models for
real-time tracking. In Computer Vision and Pattern Recognition, 1999. IEEE
Computer Society Conference on., volume 2, pages 246–252 Vol. 2, Los Alami-
tos, CA, USA, Aug. 1999. IEEE. 7.5.1
[110] E. A. Stott, J. S. J. Wong, P. Pete Sedcole, and P. Y. K. Cheung. Degradation
in FPGAs: measurement and modelling. In Proceedings of the International
Symposium on Field-Programmable Gate Arrays, page 229, 2010. 2.4.5
[111] R. Tessier, V. Betz, D. Neto, A. Egier, and T. Gopalsamy. Power-Efficient RAM
Mapping Algorithms for FPGA Embedded Memory Blocks. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 26(2):278–290,
Feb 2007. 4.5.1, 4.5.1
[112] C. Thompson. Area-Time Complexity for VLSI. In Proceedings of the ACM
Symposium on Theory of Computing, pages 81–88, May 1979. 4.2.3
[113] T. Tuan, A. Rahman, S. Das, S. Trimberger, and S. Kao. A 90-nm Low-Power
FPGA for Battery-Powered Applications. IEEE Transactions on Computed-
Aided Design for Integrated Circuits and Systems, 26(2):296–300, 2007. 1.5,
2.2.4.1, 3.6.3
[114] K. Usami and M. Horowitz. Clustered Voltage Scaling Technique for Low-power
Design. In Proceedings of the 1995 International Symposium on Low Power
Design, ISLPED ’95, pages 3–8, New York, NY, USA, 1995. ACM. 2.2.4.1
193
[115] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-
Martinez, S. Swanson, and M. B. Taylor. Conservation cores: reducing the
energy of mature computations. In Proceedings of the International Conference
on Architectural Support for Programming Languages and Operating Systems,
pages 205–218, 2010. 2.2.1
[116] S.-J. Wang and N. K. Jha. Algorithm-Based Fault Tolerance for FFT Networks.
IEEE Transactions on Computers, 43(7):849–854, July 1994. 5.6.3
[117] H. Wasserman and M. Blum. Software Reliability via Run-time Result-checking.
J. ACM, 44(6):826–849, Nov. 1997. 5.7.1
[118] S. J. E. Wilton. Architectures and Algorithms for Field-Programmable Gate
Arrays with Embedded Memory. Technical report, 1997. 2.1.1
[119] H. Wong, V. Betz, and J. Rose. Comparing FPGA vs. Custom CMOS and
the Impact on Processor Microarchitecture. In Proceedings of the International
Symposium on Field-Programmable Gate Arrays, pages 5–14, 2011. 4.4.1
[120] R. Zimmermann and W. Fichtner. Low-power logic styles: CMOS versus pass-
transistor logic. IEEE Journal of Solid-State Circuits, 32(7):1079–1090, 1997.
2.2.2
194
