Novel many-core architectures for energy-efficiency by Karpuzcu, Rahmet
c© 2012 Rahmet Ulya Karpuzcu
NOVEL MANY-CORE ARCHITECTURES FOR ENERGY-EFFICIENCY
BY
RAHMET ULYA KARPUZCU
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2012
Urbana, Illinois
Doctoral Committee:
Professor Josep Torrellas, Chair
Professor Wen-Mei W. Hwu
Professor Sanjay Jeram Patel
Professor Naresh R. Shanbhag
Assistant Professor Nam Sung Kim, University of Wisconsin, Madison
Chris Wilkerson, Intel Corporation
ABSTRACT
Ideal CMOS device scaling relies on scaling voltages down with lithographic dimensions at every
technology generation. This gives rise to faster circuits due to higher frequency and smaller silicon
area for the same functionality. The dynamic power density – equivalently, dynamic power, if the
chip area is fixed – stays constant. Static power density, on the other hand, increases. In early
generations, however, since the share of static power was practically negligible, dynamic power
density staying constant translated to total power density staying constant.
This picture has changed recently. To keep the growth of the static power under control, the
decrease in the threshold voltage has practically stopped. This, in turn, has prevented the supply
voltage from scaling. The end effect is an increasing power density over generations, giving rise
to the power wall: Processor chips can include more cores and accelerators than can be active at
any given time – and the situation is getting worse. This effect, utilization wall or dark silicon, as
induced by the power wall, presents a fundamental challenge that is transforming the many-core
architecture landscape.
This dissertation attempts to address the key implication of the power wall problem, dark sil-
icon, in two novel and promising ways: By (1) trading off the processor service life for power
and performance – the BubbleWrap many-core, and (2) exploring near-threshold voltage operation
from an architectural perspective – the Polyomino many-core.
The BubbleWrap many-core assumes as many cores on chip as CMOS transistor density scaling
trends suggest, and exploits the resulting implicit redundancy – as not all of the cores can be pow-
ered on simultaneously – to extract maximum performance by trading off power and service life
on a per-core basis. To achieve this, BubbleWrap continuously tunes the supply voltage within the
course of each core’s service life, leveraging any aging-induced guard-band instantaneously left,
rendering one of the following regimes of operation: Minimize power at the same performance
level and processor service life; attain the highest performance for the same service life while
ii
respecting the given power budget; or attain even higher performance for a shorter service life
while respecting the given power budget. Effectively, BubbleWrap runs each core at a closer-to-
optimal operating point by always aggressively using up all the aging-induced guard-band that
the designers have included – preventing any waste of it.
Another way to dim dark silicon is reducing the supply voltage to a value only slightly higher
than the threshold voltage. This regime is called near-threshold voltage (NTV) computing (NTC),
as opposed to conventional super-threshold voltage (STV) computing (STC). A major drawback
of NTC is the higher susceptibility to parametric variations, namely the deviation of device pa-
rameters from their nominal values. To address parametric variations in present and future NTV
designs, this dissertation builds on an existing model of variations at STV and develops the first ar-
chitectural model of process variations at NTV. Further, using the model, this dissertation demon-
strates that facilitating multiple on-chip voltage domains to handle parametric variations will
not be cost effective in near-future NTV designs. With this insight, this dissertation introduces
Polyomino, a simple many-core architecture which can effectively cope with variations at NTV.
Polyomino eschews multiple voltage domains and relies on fine-grain frequency domains to op-
timize execution under variations. Thanks to Polyomino’s simplicity, a variation-aware scheduler
can effectively assign clusters of cores to jobs.
iii
To my parents, Mehmet Karpuzcu and Fatma Karpuzcu
iv
ACKNOWLEDGMENTS
I know one thing, that I know nothing.
I grew up in an extended family full of academicians mainly from I˙stanbul Technical Univer-
sity (I˙TU¨), including my father, now a retired professor of environmental engineering. This is
how I dared to attend graduate school. I could not even come close to see the finish line in a
consistent state without the tireless support from my parents, Prof. Dr. Mehmet Karpuzcu and
Fatma Karpuzcu. My brothers, Mahmut Ekrem Karpuzcu and Mu¨bin Karpuzcu, too, never got
exhausted in parenting their annoying elder sister.
I have been processed extensively along a dense academic pipeline so far, and I fear that I will
vanish if this comes to stall one day. My first stop was Austrian School, I˙stanbul, where I spent
eight long years to incarnate exactness. For discerning a fine appreciation of mathematics, I will
remain indebted to my teachers OSR Friederike D’Isidoro, OStR Mag. Paul Steiner and OStR Mag.
Gerhard Ender.
The next stop was, upon the recommendation of Prof. Dr. Alinur Bu¨yu¨kaksoy and my father,
I˙TU¨. There I was lucky to attend lectures of Prof. Dr. Fuat Anday, who catalyzed my interest in
circuits and systems. I would like to extend my gratitude to my undergraduate advisers Prof. Dr.
Ali Zeki, who acquainted me with VLSI design; Assoc. Prof. Dr. S¸ima Etaner Uyar, who intro-
duced to me nature-inspired computing; and Prof. Dr. Ali Toker, who has since been graciously
guiding me.
Then, I proceeded to UIUC. In my first years, Prof. Sarita Adve kindly provided me with aca-
demic advice. The previous generation iacomers, Karin Strauss, Radu Teodorescu, Pablo Mon-
tesinos, and Wonsun Ahn were always there to give me a hand whenever I stumbled. Our efforts
in solving the research problem I took over from Abhishek Tiwari resulted in the first part of this
thesis. As a junior iacomer, I was fortunate to get attached to Brian Greskamp, who deserves
at least equal credit for this thesis. Along the way, my peer, Abdullah Muzahid, and the next
generation iacomers have borne with me while combating graduation induced anxieties.
My committee members, Prof. Wen-Mei Hwu, Prof. Sanjay Patel and Prof. Naresh Shanbhag
did not spare their invaluable feedback regarding this thesis. Nor did my mentors from Intel, Dr.
v
Alaa Alameldeen, Dr. Chris Wilkerson and Dr. Shih-Lien Lu. I owe a large credit to Prof. Nam
Sung Kim from Wisconsin for his significant contributions and continuous support.
I am grateful to my trusted Doktorvater,1 Prof. Josep Torrellas, for academic direction, and specif-
ically, for his limitless patience with me, since most of the time we operated at a phase difference
of 180◦. Without his encouragement, I would not be able to return to the arena upon each setback
we survived. We never came to resonance due to my glacial progress. However, I hope I will
manage to pass on the principles he taught me to my own students.
1German: Thesis adviser. Word for word translation: Doctorate father.
vi
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER 1 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Many-Core Power Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Pushing Back the Many-Core Power Wall: Trading Off Processor Service Life for
Power and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Dimming Dark Silicon: Near-Threshold Voltage Operation . . . . . . . . . . . . . . . 4
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 2 TRADING OFF THE AGING RATE FOR POWER AND PERFORMANCE:
THE BUBBLEWRAP MANY-CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Modeling Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Impact of Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Slowing Down Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 DVSAM: Dynamic Voltage Scaling for Aging Management . . . . . . . . . . . . . . . 14
2.3.1 Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Detailed Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 The BubbleWrap Many-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 BubbleWrap Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Hardware Support for BubbleWrap . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Power and Thermal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.2 Process Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.3 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.4 Many-Core Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.1 Enhancing Throughput: DVSAM-Pow . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.2 Enhancing Frequency: DVSAM-Perf . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.3 Core Popping: DVSAM-Short & VSAM-Short . . . . . . . . . . . . . . . . . . 32
2.6.4 BubbleWrap Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
CHAPTER 3 VARIUS-NTV: A MICROARCHITECTURAL MODEL OF PROCESS VARI-
ATIONS FOR NEAR-THRESHOLD VOLTAGE COMPUTING . . . . . . . . . . . . . . . . 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Process Variation Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 SRAM Operation in the Presence of Variation . . . . . . . . . . . . . . . . . . . 44
3.2.3 The Impact of Process Variations at NTV . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 Modeling Process Variations at STV . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 VARIUS-NTV: A Microarchitectural Model of Process Variations Tailored for NTC . 48
3.3.1 Gate Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Impact of Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.3 SRAM Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.4 Memory Failure Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Many-Core Architecture Modeled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.1 Validation of Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.2 Comparison to Silicon Measurements . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.1 Computing the Operating Point . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.2 Impact of Process Variations at NTV and STV . . . . . . . . . . . . . . . . . . 60
3.7.3 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
CHAPTER 4 ESCHEWING MULTIPLE VOLTAGE DOMAINS AT NEAR-THRESHOLD
VOLTAGES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Eschewing Multiple Vdd Domains at NTV . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Limitations of Multiple Vdd Domains at NTV . . . . . . . . . . . . . . . . . . 68
4.2.2 An Alternative Approach for Future NTC: Polyomino . . . . . . . . . . . . . . 69
4.3 The Challenge of Core Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.1 Rationale: Simplicity and Effectiveness . . . . . . . . . . . . . . . . . . . . . . 70
4.3.2 Core-Assignment Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 The Challenge of Applying Fine-Grain DVFS . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6.1 Variation Observed in NTC Polyomino Chips . . . . . . . . . . . . . . . . . . 76
4.6.2 Effect of Not Having Multiple Vdd Domains . . . . . . . . . . . . . . . . . . . 80
4.6.3 Core Assignment in Polyomino . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6.4 Implications for Fine-Grain DVFS . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
viii
CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Looking Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
APPENDIX A QUANTITATIVE CHARACTERIZATION OF THE MANY-CORE POWER
WALL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
APPENDIX B VARIUS TIMING MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
ix
LIST OF TABLES
1.1 Classical scaling theory basics [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1 DVSAM modes. P denotes total power consumption, with PNOM corresponding
to the power consumption of a core clocked at fNOM under nominal operating
conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Voltage applied to the Expendable cores (VddE) and to the Throughput cores
(VddT) in each of the BubbleWrap environments. . . . . . . . . . . . . . . . . . . . . . 21
2.3 Technology parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Microarchitectural parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Benefits of DVSAM-Pow and DVSAM-Perf. . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Technology and architecture parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Configurations for the NTC many-core. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Why multiple Vdd domains become less attractive in a future NTC environment. . . 68
4.2 Technology and architecture parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Environments analyzed to assess Vdd domains at NTV. . . . . . . . . . . . . . . . . . 80
x
LIST OF FIGURES
1.1 The many-core power wall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The BubbleWrap many-core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Processor service life as a function of supply voltage [7]. . . . . . . . . . . . . . . . . 4
1.4 Impact of Vdd on energy efficiency and delay [9]. . . . . . . . . . . . . . . . . . . . . . 5
1.5 Parameter scaling under three scenarios [8]. . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Effects of aging on Vth degradation (a) and on critical path delay degradation
(b)-(e). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Changes in Vdd (top row) and critical path delay (bottom row) as a function of
time for the different DVSAM modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 BubbleWrap chip (a) and operation (b). . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 BubbleWrap chips corresponding to different environments. In each environ-
ment, Throughput cores are in gray and Expendable cores in white. Recall that
popping means applying DVSAM-Short or VSAM-Short to the Expendable cores. . . 20
2.5 BubbleWrap power (a) and clock (b) distribution. . . . . . . . . . . . . . . . . . . . . 23
2.6 Overview of the BubbleWrap controller. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Temporal evolution of Vdd (a), normalized critical path delay (b), power con-
sumption (c), and temperature for a nominal core under DVSAM-Pow (d). . . . . . 29
2.8 Temporal evolution of Vdd (a), normalized critical path delay (b), power con-
sumption (c), and temperature for a nominal core under DVSAM-Perf (d). . . . . . . 31
2.9 Impact of core popping with DVSAM-Short. . . . . . . . . . . . . . . . . . . . . . . . 32
2.10 Impact of core popping with VSAM-Short. . . . . . . . . . . . . . . . . . . . . . . . . 34
2.11 Frequency of the sequential section for each environment. . . . . . . . . . . . . . . . 35
2.12 Speedup of BubbleWrap environments. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.13 Power consumption of BubbleWrap environments. . . . . . . . . . . . . . . . . . . . . 38
3.1 Conventional 6-transistor (6T) SRAM cell architecture: VR and VL are the volt-
ages at the nodes R and L, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Increased sensitivity of circuit timing to variation at NTV. . . . . . . . . . . . . . . . . 46
3.3 Impact of leakage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 8-transistor (8T) SRAM cell architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Many-core architecture used to evaluate VARIUS-NTV. . . . . . . . . . . . . . . . . . 55
3.6 Data generated by VARIUS-NTV that replicates the data presented in [47]: (a)
histogram of the ratios of highest core frequency to lowest core frequency over
100 dies; (b) the frequency map for one of the sample dies. . . . . . . . . . . . . . . . 59
3.7 Impact of variations at NTV and STV. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8 Values of VddMIN for all the clusters of a representative chip at NTV. . . . . . . . . . 62
xi
3.9 Performance of our 288-core chip at NTV with different cluster sizes and config-
urations: under 100% use (a), and under ≈ 50% use (b). . . . . . . . . . . . . . . . . 62
3.10 Performance of our 288-core chip at STV with different cluster sizes and config-
urations: under 100% use (a), and under ≈ 50% use (b). . . . . . . . . . . . . . . . . 64
4.1 Example Polyomino architecture (a), its operation (b), the core-assignment algo-
rithm (c), and distance of clusters to cluster i (d). . . . . . . . . . . . . . . . . . . . . . 69
4.2 Variation of VddMIN within a representative Polyomino chip (a), across 100
chips analyzed (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Kernel density of f within a representative Polyomino chip (a), across 100 chips
analyzed (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Kernel density of PSTA within a representative Polyomino chip (a), across 100
chips analyzed (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Variation of f within a representative chip (a), across 100 chips analyzed (b). . . . . . 79
4.6 Variation of PSTA within a representative chip (a), across 100 chips analyzed (b). . . 79
4.7 Normalized MIPS/W in the different environments for workloads that use all
36 clusters (a) or only 18 (b). We consider different Vdd regulator inefficiencies. . . . 81
4.8 MIPS/W attained by different core-assignment algorithms if 0%, 25%, 50% of
the clusters were already busy initially. For the latter two, the top (bottom) error
bar depicts the MIPS/W if the least (most) energy-efficient clusters were initially
busy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.9 Performance of fine-grain DVFS under different environments. . . . . . . . . . . . . 86
4.10 MIPS/W across different environments for a 288-core chip with 4, 8 (default),
and 16 cores per cluster; under full utilization (a) and 50% utilization (b). . . . . . . 87
4.11 MIPS/W across different environments for 72-, 144-, and 288-core chips with 8
cores per cluster; under full utilization (a) and 50% utilization (b). . . . . . . . . . . 88
xii
CHAPTER 1
MOTIVATION
1.1 The Many-Core Power Wall
Ideal CMOS device scaling [1] relies on scaling voltages down with lithographic dimensions at
every technology generation by a constant scaling factor κ (Table 1.1), which gives rise to faster
circuits due to higher frequency and smaller silicon area for the same functionality. Further, the
dynamic power per unit area – equivalently, dynamic power of a fixed-area chip – stays constant.
This is because the energy per switching event, C × Vdd2, decreases enough to compensate for
having more devices in the same chip area and switching them faster. Since the impact of static
power remained negligible in early generations of CMOS, the table excludes leakage. However,
across technology generations, (share of) static power has been increasing.
Table 1.1: Classical scaling theory basics [1].
Transistor/circuit parameter Scaling factor
Vdd 1/κ
Vth 1/κ
C 1/κ
I 1/κ
Area of same functionality 1/κ2
Delay C×Vdd/I 1/κ
(Dynamic) Power dissipation Vdd× I 1/κ2
(Dynamic) Power density Vdd× I/A 1
In recent generations, to keep the growth in static power under control, the decrease in the
threshold voltage, Vth, has practically stopped, which in turn has prevented the supply voltage,
Vdd, from scaling [2]. As a consequence, the compensation effect has vanished, rendering an
increasing power density over technology generations. Transistor density scaling trends, on the
other hand, have been closely following ideal scaling projections. The divergence between power
density and transistor density scaling has been getting more and more pronounced. For a fixed-
1
area chip, if the power budget is fixed, 1 a growing gap emerges between what can be placed on
chip and what can be powered on simultaneously.
Figure 1.1 demonstrates the key implication of the power wall, dark silicon, based on projections
from the ITRS 2011 edition [3]. The x-axis corresponds to the year of introduction of a technology
generation, where the y-axis depicts the number of cores that can be placed on chip (blue curve)
vs. the number of cores that can be powered on simultaneously (red curve), all normalized to
2011 data.2 Chip area is fixed along with per core functionality (as quantified by transistor count),
but not the core area. Figure 1.1 starts at 2011 with 16 cores per chip, assuming a homogeneous
many-core similar to UltraSPARC T3 [4].
N
um
be
r'o
f'C
or
es
'p
er
'C
hi
p'
(N
or
m
al
iz
ed
'to
'2
01
1)
0
9
18
27
36
45
2011 2014 2017 2020 2023 2026
Total
Powered+on
Figure 1.1: The many-core power wall.
The growing gap between the two curves demonstrates the many-core power wall. This trend
remains universal, specifically in a homogeneous many-core setting. Similar characteristics would
emerge if different baselines were adopted with cores of potentially different complexity. In [5],
for example, the same analysis is conducted assuming beefy cores for a i7-like [6] system.
Due to the power wall, in future designs, the majority of cores will be dormant in order to meet
the power budget. How to exploit this surplus of cores for more energy-efficient execution?
1A viable scaling scenario due to system cooling constraints and the associated cost. If the available power budget
increased at the pace the power density is increasing, there would not be dark silicon. Unfortunately, predicted heat-
sink improvements are far from facilitating such expansion in the available power budget. The problem in fact stems
from the cooling wall rather than the power wall.
2The details of this analysis are covered in Appendix A.
2
1.2 Pushing Back the Many-Core Power Wall: Trading Off Processor
Service Life for Power and Performance
The future provides more cores than can be possibly deployed. How to exploit this surplus of cores
to maximize energy efficiency?
Process variation in future many-cores will render some cores more energy-efficient than others.
The most efficient cores are precious because parallel workloads can use them to run as many
threads as possible under the power budget. The less-efficient cores can be used as Bubble Wrap
to protect the precious cores when high single-thread performance is required. During sequential
phases, BubbleWrap cores will intentionally sacrifice themselves by operating at much higher than
nominal voltage and frequency to provide improved single-thread performance. The elevated
voltages and temperatures will quickly wear out or pop BubbleWrap cores. This is not a problem,
as BubbleWrap cores can be replaced from a large pool of dormant spares. Such a many-core is
expected to unlock more energy-efficient execution than a conventional many-core by delivering
higher (sequential and throughput) performance within the same power budget.
Figure 1.2 illustrates a sample BubbleWrap chip in mid-life, when some of the BubbleWrap –
Expendable – cores are already popped (black). The BubbleWrap cores are microarchitecturally
identical to the precious cores but are destined to live fast and die young. The precious – Through-
put – cores operate only at nominal voltage and are guaranteed not to burn out over the service
life of the chip.
Se
qu
en
ti
al
 
Ac
ce
le
ra
to
r
Vd
d T
Throughput cores
Expendable cores
Vd
d E
The BubbleWrap Many-Core
Figure 1.2: The BubbleWrap many-core.
The system software, responding to user demands, has full discretion on how to use the cores.
All could be spent in one day if, facing a critical deadline, the user requests performance at all
3
costs. If, instead, battery life is valued, the cores might never be popped. The system continually
tracks the aging state of each core, and when one is about to pop, migrates its work elsewhere.
When all Expendable cores are popped, the system can still provide a satisfactory and guaranteed
performance level over the remaining service life using only the Throughput cores.
If the Expendable cores are used to execute a single critical thread and are rationed out progres-
sively over the service life of the processor, the expected service life per Expendable core would
decrease, as more and more Expendable cores become available. If |Expendable| Expendable cores
reside on chip, each needs to last for 1/|Expendable|th of the nominal processor service life. The
shorter the expected service life, the higher the increase in supply voltage, hence frequency will
be. Figure 1.3 demonstrates service life as a function of supply voltage at 32 nm. If the nomi-
nal processor service life is seven years, 34 Expendable cores with a 2.5 month service life per se
would facilitate a 15% increase in operating voltage over that admitted by a seven year service
life, where 84 Expendable cores with a one-month service life per se would accommodate a 23%
increase.
1E#05
1E#04
1E#03
1E#02
1E#01
1E+00
1E+01
1.1 1.2 1.3 1.4 1.5 1.6
Li
fe
,T
im
e,
(Y
ea
rs
)
Vdd,(V)
7,years,per,core
Nominal,Operation
2.5,months,per,core
34,Expendable,cores
BubbleWrap,Operation
1,month,per,core
84,Expendable,,cores
Figure 1.3: Processor service life as a function of supply voltage [7].
As process technology scales, the fraction of Expendable (dormant) cores increases and the ser-
vice life demanded per Expendable core declines. Consequently, the benefits of BubbleWrap will
increase as technology advances.
1.3 Dimming Dark Silicon: Near-Threshold Voltage Operation
Near-threshold voltage computing (NTC) refers to an environment where the supply voltage Vdd
assumes a value only slightly higher than the threshold voltage Vth [8, 9]. For the current tech-
4
nology generation, this corresponds to Vdd ≈ 0.5 V, where Vdd ≈ 1 V applies for conventional,
super-threshold voltage computing (STC).
NTC dims [10] dark silicon by reducing the energy per operation by about 2-4× over STC –
at the expense of a frequency degradation of about 5-10× [11]. As a result, power reduces by
approximately 10-40×, allowing more cores to operate simultaneously within the same many-
core power envelope.
The NTC range of operation is close to a sweet spot in energy efficiency: Figure 1.4 charac-
terizes the energy efficiency as MIPS/watt (left y-axis) and the transistor delay (right y-axis) as
a function of Vdd. In a narrow range of voltages around Vth, the energy efficiency peaks. Out
of this range, higher voltages quickly result in substantially lower energy efficiency, since power
consumption increases more than the delay reduces. Lower voltages, on the other hand, quickly
result in excessively slower transistors than acceptable while power consumption decreases sig-
nificantly. Note that the operating voltage cannot be lowered indefinitely due to reliability and
performance constraints.
Supply Voltage
Log(Transistor Delay)
Vth VddSTC
En
er
gy
 E
ff
ic
ie
nc
y
(M
IP
S/
Wa
tt
)
~10x
~100x
~2x
~10x
NTC STC
VddNTC
Figure 1.4: Impact of Vdd on energy efficiency and delay [9].
Figure 1.5 compares scaling trends of three key parameters under NTC, STC, and as imposed by
classical CMOS theory [1]: Supply voltage, transistor delay and power density. The x-axis denotes
gate length to demarcate each technology generation. Classical scaling relies on scaling voltages
down with lithographic dimensions at every technology generation by a constant scaling factor κ
(Table 1.1). Both Vdd and transistor delay reduce by 1/κ each generation, giving rise to a constant
power density. Conventional STC scaling deviates from classical scaling in that the decrease in the
threshold voltage Vth has stopped (in order to keep static power under control), which in turn has
5
prevented the supply voltage Vdd from scaling [2]. A direct consequence is that power density
no longer stays constant. The curves experience substantial vertical shifts when NTC scaling is
considered. Supply voltage (Figure 1.5(a)) and power density (Figure 1.5(b)) reduce significantly,
while transistor delay increases.
Classical Scaling STC Scaling NTC Scaling
(a) (b)
10
1
0.
1
1000 100 10
Gate Length (nm)Su
pp
ly
 V
ol
ta
ge
(V
) 
1000 100 10
Gate Length (nm)
1e1
1e2
1e3
1e4
1e0
Power 
Density
Transistor 
Delay
ar
bi
tr
ar
y 
un
it
s 
Figure 1.5: Parameter scaling under three scenarios [8].
1.4 Overview
This dissertation explores how to dim dark silicon along two novel and promising axes: By (1)
trading off the processor aging (or wear-out) rate for power and performance – the BubbleWrap
many-core, and (2) exploring architectural implications of near-threshold voltage operation – the
Polyomino many-core.
The BubbleWrap many-core manages processor aging (or wear-out) rate and trades it off for
more performance within the same power budget or power savings at the same performance
level. BubbleWrap introduces the novel concept of core popping and deploys the Dynamic Voltage
Scaling for Aging Management (DVSAM) toolset to this end. The idea is to continuously tune the
supply voltage on a per core basis within the course of its service life, exploiting any aging guard-
band instantaneously. This renders the following regimes of operation: Consume the least power
for the same performance and processor service life; attain the highest performance for the same
service life while respecting the given power budget; or attain even higher performance for a
shorter service life while respecting the given power budget. BubbleWrap takes its name from the
third regime of operation. Effectively, BubbleWrap extracts a closer-to-optimal operating point
6
by aggressively using up, at all times, all the aging-induced guard-band that the designers have
included – preventing any waste of it.
First, the set of most energy-efficient cores in a variation-affected chip – the largest set that
can be simultaneously powered on – is identified. These cores are reserved as Throughput cores
dedicated to parallel section execution. The rest of the cores are designated as Expendable and are
dedicated to accelerating sequential sections. Expendable cores are sacrificed one at a time, by
running each of them at an elevated supply voltage for a short, few-month-long service life, until
the core completely wears out and is discarded – figuratively, as if popping bubbles in a bubble
wrap that protects Throughput cores. As a result, BubbleWrap provides substantial performance
gains over a plain chip under the same power budget.
The second axis explores architectural implications of near-threshold voltage (NTV) operation
in an attempt to dim dark silicon. One drawback of NTV is a degradation in core frequency,
which may be tolerable through more parallelism (to the extent the application domain permits),
by using more on-chip cores. A more important problem at NTV is the increased sensitivity to
parametric variations – the deviation of device parameters from their nominal values.
Already at conventional super-threshold voltages (STV), parametric variations result in sub-
stantial power and performance loss. At NTV, parametric variations of the same magnitude cause
even larger fluctuations in speed and power across the transistors in a chip. Effectively coping
with parametric variations at NTV at the architecture level is difficult for at least two reasons:
First, all of the existing architectural models of process variations apply only to STV rather than
NTV. Second, conventional techniques to address parametric variations at STV typically rely on
supply voltage tuning, where the effectiveness increases with voltage domain granularity. How-
ever, considering practical overheads and physical limitations, the energy efficiency decreases the
finer the on-chip voltage domain granularity becomes. Such efficiency degradation is not afford-
able in any energy conscious NTV setting.
To handle parametric variations in present and future NTV designs, this dissertation first builds
on an existing model of variations at STV and develops the first architectural model of process
variations at NTV. Secondly, using the model, this study shows that supporting multiple on-chip
voltage domains to address parametric variations will not be cost-effective in near-future NTV
designs, due to (i) the on-chip voltage regulators power losses, (ii) the increased supply voltage
guard-band needed to tolerate deeper voltage droops (as induced by lower capacitance per do-
7
main), and (iii) the practical fact that a domain still includes many cores. With this insight, this
dissertation introduces Polyomino, a simple many-core architecture which can effectively cope
with variations at NTV. Polyomino eschews multiple voltage domains and relies on fine-grain
frequency domains to optimize execution under variations.
8
CHAPTER 2
TRADING OFF THE AGING RATE FOR POWER AND
PERFORMANCE: THE BUBBLEWRAP MANY-CORE
2.1 Introduction
As more transistors are integrated on a fixed-sized chip at every generation, the chip power in-
creases rapidly. If the chip power budget is fixed due to system cooling constraints and the as-
sociated costs, a growing gap emerges between the number of cores that can be placed on a chip
and the number of cores that can be powered on simultaneously. Soon, many cores may have to
remain powered off.
At the same time, aggressive scaling exacerbates parametric variation [12]. Variation can man-
ifest as static, spatial fluctuations across the chip or as dynamic, temporal changes. A significant
contributor to the latter is device wearout or aging. Aging induces a progressive slowdown in
logic.
Processor aging has been the subject of many proposals (e.g., [13–20]). The aging rate increases
with Vdd and temperature T. Consequently, approaches to slow down aging by operating at
lower Vdd or T have been introduced [20]. Such techniques to change the aging rate typically
affect performance and power. Hence, aging rate can be traded off for performance and power in
an attempt to push back the many-core power wall.
Based on this observation, this chapter proposes a novel scheme for managing processor aging
that unlocks higher performance or power savings: Dynamic Voltage Scaling for Aging Management
(DVSAM). The idea is to continuously tune Vdd (but not the frequency), exploiting any instanta-
neous aging guard-band. The goal can be one of the following: consume the least power for the
same performance and processor service life; attain the highest performance for the same service
life while respecting the power budget; or attain even higher performance for a shorter service life
while respecting the power budget.
9
Further, this chapter covers BubbleWrap, a novel many-core architecture relying on the DVSAM
toolset in dimming dark silicon. BubbleWrap identifies the most energy-efficient set of cores in
a variation affected die — the largest set that can be simultaneously powered on. It designates
them as Throughput cores dedicated to parallel-section execution. The rest of the cores are desig-
nated as Expendable and are dedicated to sequential acceleration. BubbleWrap attains maximum
sequential performance by sacrificing Expendable cores one at a time, running them at elevated
Vdd for a significantly shorter service life each, until they completely wear out and are discarded
– figuratively, as if popping bubbles in bubble wrap that protects Throughput cores.
In simulated 32-core chips, BubbleWrap provides substantial gains over a plain chip. For exam-
ple, on average, one design runs fully sequential applications at a 16% higher frequency, and fully
parallel ones with a 30% higher throughput.
This chapter (1) introduces Dynamic Voltage Scaling for Aging Management (DVSAM), a novel
toolset for managing processor aging to attain higher performance or power savings, and (2)
presents the BubbleWrap many-core which makes extensive use of the DVSAM toolset in an at-
tempt to push back the many-core power wall.
In the following, Section 2.2 provides a background; Section 2.3 introduces DVSAM; Section 2.4
presents the BubbleWrap many-core; Sections 2.5 and 2.6 evaluate BubbleWrap; and Section 2.7
discusses related work.
2.2 Background
2.2.1 Modeling Aging
Our analysis focuses on aging induced by bias temperature instability (BTI), which causes tran-
sistors to become slower in the course of their normal use. BTI-induced degradation leads to
increases in the threshold voltage (Vth) of the form ∆VthBTI ∝ ta, where t is time and a is a
time-slope constant. Constant a is strongly related to process characteristics, and generally takes
a value between 0 and 0.5 for recent process generations [21–23].
To model ∆VthBTI , we adopt the framework of Wang et al. [23]. Vth only increases when the
voltage between the gate and the source of the transistor is set to a given logic value, and decreases
more slowly when the voltage is set to the opposite logic value. These conditions are called Stress
and Recovery conditions, respectively. Equation 2.1 shows ∆VthBTI as a function of the time the
10
transistor is under stress (tSTRESS) and recovery (tRECOVERY). In the equation, ABTI , a, Eo and η are
model fitting parameters. Importantly, the equation shows that ∆VthBTI depends exponentially
on the supply voltage (Vdd) and the temperature (T). Therefore, high values of Vdd or T will
substantially increase the aging rate.
∆VthBTI = ∆VthSTRESS × (1−
√
η × tRECOVERY
tRECOVERY + tSTRESS
)
where
∆VthSTRESS =
ABTI ×
[
q3
Cox2
× (Vdd−VthNOM)× exp
(
− Ea
2kT
+
Vdd−VthNOM
tox× 0.5Eo
)]2a
× (tSTRESS)a
(2.1)
With this effect, Vth at a given Vdd and T is given by Equation 2.2, where VthNOM, VddNOM, and
TNOM are the nominal values of these parameters, and kDIBL and kT are constants.
Vth =VthNOM + kDIBL × (Vdd−VddNOM) + kT × (T − TNOM) + ∆VthBTI (2.2)
The increase in Vth translates into an increase in transistor switching delay (τ) as per the alpha-
power law [24] (Equation 2.3). The result is a slowdown in the processor’s critical paths and,
hence, a decrease in its operating frequency. In the formula, α > 1 and µ ∝ T−1.5.
τ ∝
Vdd
µ(Vdd−Vth)α (2.3)
The aging model from Equation 2.1 was derived for devices with silicon-based dielectrics and poly
gates (Poly + SiON devices), and verified against an industrial 65 nm node [23]. To reduce gate
leakage, starting from 45 nm, manufacturers have introduced a new generation of devices with
higher gate dielectric constants (high-k) and metal gates (HK + MG devices) [21, 22]. However,
we argue that the model is still applicable.
To see why, we refer to the reliability characterization of Intel’s 45 nm node [25], one of the
first HK + MG processes in production. The authors report that (1) PMOS Negative BTI (NBTI)
remains a major reliability concern, as it was in the predecessor Poly + SiON 65 nm node, and
that (2) at higher electric fields (as induced by higher Vdd), NMOS Positive BTI (PBTI) becomes
significant. Further, they demonstrate that the BTI characteristics closely follow the BTI behavior
of Poly + SiON devices. Specifically, the same physical phenomena cause this behavior [26].
11
We modify the process parameters and time-slope characteristics of the base model to reflect the
new technology node in light of the observations from [26]. We assume cores like the Intel Core
i7 [6], which are based on Intel’s 45 nm HK + MG technology. Another finding from [25] is that
the hot-carrier injection (HCI) effect is not that significant for any realistic use condition. Hence,
we focus on BTI-based aging only.
We acknowledge, however, that advances in devices in future nodes may require reconsidering
the formulae. In the contemporary era of CMOS scaling, as we face more scaling limits, the pace
at which new materials and features are introduced will increase [27, 28], as demonstrated by the
introduction of HK + MG devices.
2.2.2 Impact of Aging
Equation 2.1 shows that ∆VthBTI follows a power law with time. Since a typical value of the time
slope a is between 0 and 0.5, ∆VthBTI increases rapidly first and then more slowly. Let us assume a
fixed ratio of stress to recovery time, and that stress and recovery periods are finely interleaved. If
VthNOM is the value of Vth at the beginning of the service life, the curve Vth = VthNOM +∆VthBTI
as a function of time is shown in Figure 2.1(a). At the end of the service life, which the figure
assumes is seven years, Vth has reached VthD.
The switching delay of a transistor is given by Equation 2.3. As Vth increases with time due
to aging, so does the switching delay τ for the same supply voltage and temperature conditions.
Both PMOS and NMOS transistors suffer from BTI-induced aging — called NBTI and PBTI, re-
spectively [25].
To determine the delay of a logic path in a processor, we follow the approach of Tiwari and
Torrellas [20]. Specifically, when the path is activated, we identify the set of transistors that switch.
The path delay is given by the switching delays of such transistors plus the wire delays. As
individual transistors age and their ∆VthBTI increases following a power law, the total path delay
can be shown to also increase following a similar curve. Specifically, the path delay increases fast
at the beginning of the service life and then progressively more slowly.
The actual path delay increase depends on many issues, including the path composition in
terms of PMOS or NMOS transistors, the degree of wiring, the ratio of stress to recovery periods,
the T, and the Vdd. However, the increase follows the general shape described. If we assume that
the delay of the critical path of the processor increases 10% in a seven-year service life, we attain
12
10 20 30 40 50 60 70 80
time (months)
Vt
h
VthD
VthNOM
(a)
10 20 30 40 50 60 70 80
1
1.1
time (months)
N
or
m
al
iz
ed
 C
rit
ica
l P
at
h 
De
la
y
(b)
(c) (d)
Low Vdd
High Vdd
(e)
Figure 2.1: Effects of aging on Vth degradation (a) and on critical path delay degradation (b)-(e).
a normalized curve like the one in Figure 2.1(b). In the figure, the normalized critical path delay
is 1 at the beginning, and 1.1 at the end of the service life.
Let us call the delay of the processor’s critical path at the beginning of service τZG (where ZG
stands for zero guard-band). The same path will have a delay of τNOM = τZG × (1 + G) at the
end of the service life, where G is the timing guard-band that the path will consume during its life
(10% in our example). Consequently, the processor cannot be clocked at a frequency fZG = 1/τZG
because it would soon suffer timing errors. It is clocked at the lower frequency fNOM during all
its service life:
fNOM =
1
τNOM
=
1
τZG × (1+ G) =
fZG
1+ G
(2.4)
2.2.3 Slowing Down Aging
In recent work called Facelift [20], Tiwari and Torrellas attempt to slow down aging by perturbing
the curve in Figure 2.1(b). A major knob they use is Vdd. Vdd affects transistor aging as per
Equation 2.1 and transistor delay as per Equation 2.3. Specifically, if we increase Vdd (without
changing any other parameter), transistors become faster (from Equation 2.3, since α >1) and also
13
age faster (from Equation 2.1). If, instead, we decrease Vdd, transistors age more slowly but they
also become slower.
They then use timely Vdd changes to slow down aging. Graphically, this means forcing the
curve in Figure 2.1(b) to reach a lower y-coordinate value at the end of the service life. To see the
impact of timely Vdd changes, we repeat Figure 2.1(b) in Figures 2.1(c) and 2.1(d). Figure 2.1(c)
shows the effect of increasing Vdd: it pushes the curve down (faster critical paths) but increases
the slope of the curve (faster aging). Figure 2.1(d) shows the effect of decreasing Vdd: it pushes
the curve up (slower critical paths) but reduces the slope of the curve (slower aging).
They observe that changing Vdd impacts (i) the aging rate and (ii) the critical path delay differ-
ently within the course of the processor service life. Specifically, it impacts the aging rate strongly
(positively or negatively) at the beginning of the service life, and little toward the end. In con-
trast, it impacts the delay more uniformly across time. Consequently, they propose to apply a low
Vdd toward the beginning of the service life. At that time, it reduces the aging rate the most and,
therefore, slows down aging the most. Moreover, there is still substantial guard-band available to
tolerate the lengthening of the critical path. They propose to apply a high Vdd toward the end of
the service life. At that time, it still speeds up the critical path, while it increases the aging rate the
least. The result of using this strategy is shown in Figure 2.1(e). At the end of the service life, the
wearout of the processor (and therefore the critical path delay) is lower. Consequently, the authors
can run the processor at a constant frequency throughout the service life that is higher than fNOM.
Tiwari and Torrellas [20] also use this strategy to configure cores for a shorter service life. They
consolidate all the aging into the shorter life and run the processor at an even higher, constant
frequency throughout the short life.
2.3 DVSAM: Dynamic Voltage Scaling for Aging Management
2.3.1 Main Idea
While Tiwari and Torrellas [20] change the Vdd of a processor only once or twice in its service
life, we observe that we can manage aging better if we continuously tune Vdd in small steps
over the whole service life — keeping the frequency constant as these authors do. Moreover,
we observe that these changes can be done not only to improve performance, but also to reduce
power consumption.
14
Based on these two observations, we propose Dynamic Voltage Scaling for Aging Management
(DVSAM). The idea is to manage the aging rate by continuously tuning Vdd (but not the fre-
quency), exploiting any currently-left aging guard-band. DVSAM is a novel approach to trade off
processor performance, power consumption, and service life for one another.
We propose the four DVSAM modes of Table 2.1. DVSAM-Pow attempts to consume the mini-
mum power for the same performance and service life. DVSAM-Perf tries to attain the maximum
performance for the same service life and within power constraints. DVSAM-Short tries to attain
even higher performance for a shorter service life and within power constraints. Finally, VSAM-
Short is the same as DVSAM-Short but without changing Vdd with time.
Table 2.1: DVSAM modes. P denotes total power consumption, with PNOM corresponding to the
power consumption of a core clocked at fNOM under nominal operating conditions.
Mode Vdd Values
DVSAM-Pow:
Consume minimum power for the same performance and service life At t = 0: Vdd << VddNOM
( f = fNOM, P < PNOM). At t = SNOM: Vdd < VddNOM
DVSAM-Perf:
Attain maximum performance for the same service life At t = 0: Vdd < VddNOM
and a given power budget ( f > fNOM, P > PNOM). At t = SNOM: Vdd > VddNOM
DVSAM-Short:
Attain even higher performance for a shorter service life At t = 0: Vdd > VddNOM
and a given power budget ( f >> fNOM, P >> PNOM). At t = SSH: Vdd >> VddNOM
VSAM-Short:
Special case: Same as DVSAM-Short but no Vdd changes with time ∀ t ∈ [0, SSH ]: Vdd >> VddNOM
( f >> fNOM, P >> PNOM).
DVSAM operates by aggressively trying to consume, at any given time, all the aging guard-
band that would be otherwise available. Consequently, the design assumes the existence of aging
sensor circuits that reliably measure the guard-band available at all times [14, 15, 17, 29–32] (Sec-
tion 2.4.3.2). Note that circuits have additional guard-bands to protect themselves against other
effects such as thermal and voltage fluctuations.
2.3.2 Detailed Operation
To understand the DVSAM operation, note that the nominal supply voltage VddNOM used in a
processor is “over-designed” for the early phases of the processor’s service life. It is designed so
that, by the end of the service life, the processor’s critical path is just fast enough to avert any
timing error. However, earlier on in the service life, before this path aged, this path used to take
15
less time than the cycle time — and its speed was enabled by the VddNOM. Clearly, at that time,
we could have used a lower Vdd.
This effect is seen analytically by assuming a critical path of identical gates and summing up
the switching delays of all the transistors in the path using Equation 2.3. The critical path delay at
the end of the service life (τNOM), when Vth = VthD, is supported by VddNOM:
τNOM ∝
VddNOM
µ(VddNOM −VthD)α (2.5)
The same VddNOM is used at the beginning of the service life when, because Vth = VthNOM, the
same path only takes τZG:
τZG ∝
VddNOM
µ(VddNOM −VthNOM)α (2.6)
However, since the processor is clocked at the same frequency throughout the whole service life,
keeping these paths so fast is unnecessary. Consequently, DVSAM-Pow reduces Vdd in the early
stages of the service life, slowing these paths but ensuring that they do not take longer than τNOM.
The result is power savings. Note that the Vdd reduction becomes gradually smaller as the paths
age. It may be that, by the end of the service life, we can still use a lower Vdd than VddNOM. The
reason is that, thanks to having applied lower-than-usual Vdd to the processor over a long period,
its paths have aged less than usual. Table 2.1 shows these Vdd values. In the table, the nominal
service life is denoted as SNOM.
Alternatively, since the paths have timing slack in the early stages of the service life, DVSAM-
Perf increases the frequency of the processor, hence, delivering higher performance. However,
for simplicity in our design, we want to keep the elevated frequency of the processor constant
over the whole service life. To do so, DVSAM-Perf also changes Vdd. Toward the early stages of
the service life, to slow down aging as in Tiwari and Torrellas [20], Vdd will be set slightly below
VddNOM. Toward the end of the service life, to keep up with the higher frequency of the processor,
Vdd will be set above VddNOM. Table 2.1 shows these Vdd values.
Figures 2.2(a) and 2.2(b) show the operation of the DVSAM-Pow and DVSAM-Perf modes, re-
spectively. The top row shows the changes in Vdd as a function of time, while the bottom one
shows the changes in critical path delay (τ) as a function of time. Time goes from t = 0 to the end
of the service life SNOM (e.g., 7 years). Each chart also shows, with a dotted line, the evolution of
the parameter value if DVSAM was not applied. Specifically, Vdd stays constant at VddNOM (top
16
SNOM
SNOM
SSH
SSH
SNOM
SNOM
Vd
d
SNOM
SNOM
SNOM
SNOM
SSH
SSH
Service LifeService Life Service LifeService Life
Service LifeService Life Service LifeService Life
DVSAM-PerfDVSAM-Pow VSAM-ShortDVSAM-Short
τZG
τNΟΜ
Vd
d
Vd
d
Vd
d
τZG
τNΟΜ
τZG
τNΟΜ
τZG
τNΟΜ
VddNΟΜVddNΟΜ VddNΟΜVddNΟΜ
τOP
τOP
τOP
τOP
ττ τ τ
(a) (b) (c) (d)
t=0
t=0
t=0
t=0
t=0
t=0
t=0
t=0
Figure 2.2: Changes in Vdd (top row) and critical path delay (bottom row) as a function of time
for the different DVSAM modes.
row), and τ changes from τZG at t = 0 to τNOM at SNOM — first quickly and then slowly.
Consider the Vdd charts first. As indicated above, under DVSAM-Pow, Vdd starts significantly
lower than VddNOM, gradually increases, and reaches a value below VddNOM at SNOM. Under
DVSAM-Perf, Vdd starts slightly lower than VddNOM, increases faster, and ends up significantly
higher than VddNOM.
In the critical path delay charts (bottom row), both modes show a constant critical path delay
(labeled τOP). This is because they dynamically tune the Vdd so that the critical path always takes
exactly the same time — balancing the natural lengthening of the critical path due to aging with
progressively higher Vdd. In both cases, the processor is clocked at constant frequency fOP =
1/τOP over the whole service life. Under DVSAM-Pow, τOP is equal to τNOM; under DVSAM-
Perf, τOP is kept smaller than τNOM to enable a higher frequency. DVSAM-Perf can keep pushing
τOP lower at progressively higher power costs. However, we will reach a point where constraints
in Vdd, T, or service life duration will prevent any further reduction in τOP.
To attain higher performance beyond such points, we have the last two DVSAM modes. They
deliver higher performance than DVSAM-Perf (at higher power) by giving up service life (Ta-
ble 2.1). This means that the processor will become unusable and be discarded at a time SSH (for
short service life), much earlier than SNOM.
Figure 2.2(c) shows the operation of DVSAM-Short. Already at the start of the service life, Vdd
is set to a value higher than VddNOM (top chart of the figure), dramatically reducing the delay
of the critical path to τOP (bottom chart), and enabling the processor to cycle at a high frequency.
17
To make up for the lengthening of the critical path with time due to aging, Vdd has to continue
to increase with time (top chart). The result is that the critical path takes the same time (bottom
chart) and, therefore, the high frequency is maintained. However, since aging is exponential on
Vdd and T (Equation 2.1), the high Vdd and resulting high T rapidly age the critical path. Soon,
the aging is such that, to keep up with the high frequency required, Vdd (or T) would have to go
above allowed values. At that point, shown as SSH in Figure 2.2(c), the processor is discarded.
Finally, VSAM-Short is a simpler design than DVSAM-Short. Figure 2.2(d) shows its operation.
Vdd is set to a value higher than VddNOM (top chart of the figure). However, instead of dynami-
cally compensating the increase in critical path delay due to aging with higher Vdd, we keep Vdd
constant (top chart). As a result, the critical path delay increases (bottom chart). After a relatively
short duration SSH (different than for DVSAM-Short), the processor has aged too much and is dis-
carded. Note that the critical path delay is not constant; however, we want to keep the frequency
constant. Consequently, the processor can only be clocked at the frequency allowed by the path
delay at SSH, namely τOP in the figure.
2.4 The BubbleWrap Many-Core
2.4.1 Overview
The BubbleWrap is a novel many-core that uses DVSAM to address the problem described in Sec-
tion 2.1 of not being able to power on all the on-chip cores simultaneously. While BubbleWrap uses
all DVSAM modes, its most novel characteristic is the use of DVSAM-Short and VSAM-Short.
BubbleWrap has N architecturally homogeneous cores, of which only NT can be powered on
simultaneously. The architecture distinguishes two groups of cores: Throughput (T) and Expendable
(E) cores. In a die affected by process variation, we select NT cores as the Throughput cores.
We choose the ones that consume the least power at the target frequency. They are used to run
the parallel sections of applications where, typically, we want throughput. The rest of the cores
(NE = N − NT) are designated as Expendable cores. They are dedicated to run the sequential
sections of applications, where we want per-thread performance.
The Expendable cores form a sequential accelerator. In our most novel designs, we attain ac-
celeration by sacrificing one Expendable core at a time. Specifically, each Expendable core runs
under DVSAM-Short (or VSAM-Short) mode, at elevated Vdd and frequency. In this manner, an
18
Expendable core delivers high performance, but it ages fast until it can no longer sustain such con-
ditions. At that point, we say that the core pops. It is discarded and replaced by another core from
the Expendable group. This process of popping cores by applying DVSAM-Short (or VSAM-Short)
mode gives its name to the BubbleWrap many-core.
Figure 2.3(a) shows a logical view of the BubbleWrap many-core. The figure depicts a mid-life
chip, when some of the Expendable cores have already popped (in black). Although the Through-
put and Expendable cores form two logical groups, process variation determines their actual loca-
tion on the die and, therefore, the cores of each group are typically not physically contiguous. All
Throughput cores, however, receive the same supply voltage VddT. When active, an Expendable
core receives VddE.
Se
qu
en
ti
al
 
Ac
ce
le
ra
to
r
Vd
d T
Throughput cores
Expendable cores
Vd
d E
The BubbleWrap Many-Core
(a)
fNOM
1
0 Expected Service Life: SSH
SNOM
Fr
eq
ue
nc
y G
ain
(b)
Figure 2.3: BubbleWrap chip (a) and operation (b).
To estimate how quickly BubbleWrap can afford to pop Expendable cores, we add up the
sequential-execution time of all the applications that are expected to run on the chip over its ser-
vice life. This number, as a fraction of the nominal service life SNOM of the chip, is called the
Sequential Load (LSEQ). For example, if we expect to run 21 applications, each of which runs a
sequential section for 6 months, and the nominal service life of the chip is 7 years, then LSEQ =
21× 0.5/7 = 1.5. Knowing LSEQ and the number of Expendable cores NE, we can conservatively
estimate the short service life of each individual Expendable core SSH as follows
SSH =
SNOM × LSEQ
NE
(2.7)
In our example, if we have 32 Expendable cores, each has to last SSH = 7 × 1.5/32, which is
approximately 4 months. In reality, each core will take less than 4 months to execute its load
because, thanks to its higher Vdd, it runs faster.
19
Figure 2.3(b) qualitatively shows how an Expendable core’s shorter service life (SSH) permits
operation at increasingly higher frequencies. The figure plots the frequency gain over the nominal
frequency as a function of SSH. The curve is generated with representative parameter values. It
can be shown that the frequency gain increases exponentially with decreasing SSH. Consequently,
for (D)VSAM-Short to be profitable, it is required that SSH  SNOM. Fortunately, we expect that
SSH will continue to shrink with time, since technology scaling is providing more Expendable
cores with each generation.
2.4.2 BubbleWrap Environments
The application of the different DVSAM modes of Section 2.3 to the Throughput or Expend-
able cores gives rise to six different BubbleWrap environments. The environments are pictorially
shown in Figure 2.4 and described in Table 2.2.
DVSAM-Perf
VSAM-Short DVSAM-Short
BaseE+PowTBase PerfE+PowT
SShortE+BaseT SShortE+PowT DShortE+PowT
(b)(a) (c)
(d) (e) (f)
Popping
DVSAM-PowDVSAM-Pow
DVSAM-Pow
VSAM-Short
DVSAM-Pow
Figure 2.4: BubbleWrap chips corresponding to different environments. In each environment,
Throughput cores are in gray and Expendable cores in white. Recall that popping means applying
DVSAM-Short or VSAM-Short to the Expendable cores.
Chip (a) in Figure 2.4 shows the Base environment, which serves as the baseline. In Base, we
disable the Expendable cores and operate the Throughput cores at VddNOM and fNOM.
Chip (b) shows the BaseE+PowT environment, where we keep the Expendable cores disabled
and apply DVSAM-Pow to the Throughput cores. Throughput cores operate at fNOM and at a
20
Table 2.2: Voltage applied to the Expendable cores (VddE) and to the Throughput cores (VddT) in
each of the BubbleWrap environments.
Environment VddE VddT
Base N/A (Cores disabled) VddNOM
BaseE+PowT N/A (Cores disabled) < VddNOM (DVSAM-Pow)
Per fE+PowT Variable (DVSAM-Perf) < VddNOM (DVSAM-Pow)
SShortE+BaseT > VddNOM (VSAM-Short) VddNOM
SShortE+PowT > VddNOM (VSAM-Short) < VddNOM (DVSAM-Pow)
DShortE+PowT > VddNOM (DVSAM-Short) < VddNOM (DVSAM-Pow)
supply voltage less than VddNOM. Since each Throughput core now consumes less power, we can
expand the set of Throughput cores and power on all of them at the same time. This is shown in
Figure 2.4(b) with the arrows. Overall, this environment increases throughput while clocking all
the processors at fNOM.
Chip (c) shows the Per fE+PowT environment, where we apply DVSAM-Perf to the Expendable
cores (one at a time, during sequential sections) and, as before, DVSAM-Pow to the Throughput
cores (to all of them, under expansion, during parallel sections). Expendable cores now deliver
higher sequential performance, while Throughput cores deliver higher throughput.
The environments in chips (d), (e), and (f) are the same as the ones in (a), (b), and (c) except that,
in addition, we apply core popping. Recall that this means applying VSAM-Short or DVSAM-
Short to the Expendable cores. As a result, these environments further increase sequential perfor-
mance.
The type of popping we perform (VSAM-Short or DVSAM-Short) in each case depends on
whether the corresponding chip in Figure 2.4 already supported DVSAM on Expendable cores
to start with. Specifically, if it did not, we apply VSAM-Short; if it did, we apply DVSAM-Short.
Consequently, in chip (d), we take Base and apply VSAM-Short to the Expendable cores. The re-
sult is SShortE+BaseT. In chip (e), we take BaseE+PowT and apply VSAM-Short to the Expendable
cores. The result is called SShortE+PowT. Finally, in chip (f), we take Per fE+PowT and apply
DVSAM-Short to the Expendable cores, creating the DShortE+PowT environment.
2.4.3 Hardware Support for BubbleWrap
We describe three hardware components of BubbleWrap, namely the power and clock distribution
system, the aging sensors, and the optimizing controller.
21
2.4.3.1 Power and Clock Distribution System
Our proposed BubbleWrap design has two voltage and frequency domains in the chip, namely
one for the set of Throughput cores and one for the set of Expendable ones. This simple im-
plementation provides enough functionality for our environments of Figure 2.4. Indeed, in such
environments, all the cores in the Throughput set operate under the same conditions, and these
conditions are potentially different than those for the Expendable cores. Note that other designs
are also possible. For example, in a scenario with high within-die process variation, it may make
sense to have one voltage and frequency domain per core. Alternatively, in a scenario where the
chip runs a single application at a time, which alternates between parallel and sequential phases,
BubbleWrap would only need a single voltage and frequency domain. This simpler design would
suffice because we would not have Throughput cores and Expendable cores busy at the same
time. However, we do not consider such designs here.
Let us consider the power distribution first. Since the physical location of the Throughput
and Expendable cores on chip is unknown until manufacturing test time, we need a design that
can supply either VddE or VddT to any core on the die. The most flexible solution is to include
two independent supply networks [33], and connect all cores to both grids through power-gating
transistors. In this manner, a controller can select, for each core, whether to connect to the VddE
grid or the VddT grid by turning on the appropriate transistor.
Figure 2.5(a) shows such a design. The figure shows a chip with a global controller and one core.
The core has a multiplexer that can select one of the two power grids. The controller determines
the values of VddE and VddT, along with which grid each core should be connected to. This design
has the advantage of allowing cores to move from one grid to the other dynamically, possibly
based on their aging conditions.
Figure 2.5(b) shows the clock distribution network. The figure shows the grid with a global
controller and four cores — one Expendable and three Throughput ones. Each core contains a
PLL for signal integrity. The controller manages each PLL so that the correct frequency is supplied.
This design allows a core to move from one of the core sets to the other dynamically. Moreover, if
needed, it can also support per-core frequency domains [6, 34].
22
CTRL
CLK PLL
PLL PLL
PLL PLL
fE fT
fT fT
VddVddE
CTRL
VddT
SEL
MUX
(b)(a)
Core
Chip Chip
Figure 2.5: BubbleWrap power (a) and clock (b) distribution.
2.4.3.2 Aging Sensors
The effective application of the different DVSAM modes requires that we have a way to reliably
estimate the aging that each core experiences. To this end, aging sensors can be deployed, that
dynamically measure the increase in critical path delays due to aging. The literature proposes a
variety of techniques for this purpose, including canary paths distributed throughout the cores
or periodic BIST (e.g., [14, 15, 17, 29–32]). It is claimed that such techniques can measure critical
path delays with sub-µs measurement times and sub-ps precision [30]. Moreover, Intel Core i7 [6]
already includes some of this circuitry to determine the level of Turbo Boost that can be applied.
2.4.3.3 Optimizing Dynamic Controller
The BubbleWrap controller interacts with the rest of the chip as shown in Figure 2.6(a). The con-
troller performs several tasks. First, it dynamically adjusts the VddT(t) and VddE(t) supplied to
the two sets of cores. Second, it keeps a table of which cores are Throughput, which are Expend-
able, and which ones are currently running. Based on this information, it sends the core selection
signals described in Section 2.4.3.1. Note that, if it uses any of the BubbleWrap environments with
core popping, the controller also determines when an Expendable core can be considered popped
and keeps track of which cores are already popped. To perform all these tasks, the controller needs
age information from all the cores.
We envision a simple hardware-based implementation of the controller. Every time that the
operating system (OS) changes the threads running on the chip, it passes information to the con-
troller on which DVSAM mode is most appropriate for each thread, or at least which threads
23
(b)(a)
VddT (t)
VddE (t)
age(t)
CTRL
Core Selection
Chip
DAC
DOWN UP
CNTR
VddT(t)
Fast
Slow
Fast
Slow
Fast
Slow
Throughput Cores
Core Selection
Core
Popped
T/E SlowFast
...
Critical Path Replica
+ Phase Detector
...
VddE(t)
CTRL
Core
Figure 2.6: Overview of the BubbleWrap controller.
require acceleration. Based on this information and core age condition, the controller outputs its
initial VddT, VddE, and the core selection signals.
As the threads execute, the controller dynamically tunes the VddT and VddE values. At any
time, its goal is to supply the minimum VddT (or VddE) value that enables the cores to keep up
with the target operating frequency ( fOP) for the current DVSAM mode. Note that such frequency
is not set by the controller; it is set statically by the manufacturer.
To tune the voltages, the controller relies on the age information provided by the cores in Fig-
ure 2.6(a). To see how this works, Figure 2.6(b) shows the internals of the controller. The figure
also shows three Throughput cores, one of which is expanded.
In each core, aging sensors detect aging-induced increases in critical path delays. The figure
sketches a design similar to the one by Teodorescu et al. [35]. It uses multiple critical path replicas
distributed across the core along with a phase detector. If each replica satisfies the frequency
specification by a certain margin, the core asserts signal Fast, to indicate that this core may operate
at a lower Vdd and still be clocked at the target frequency. If at least one of the replicas does
not satisfy the frequency specification, signal Slow is asserted to demand a higher Vdd for this
core to support the target frequency. If no signal is set, all of the replicas satisfy the frequency
specification with the lowest margin. The combination of the Fast and Slow signals is the age
information passed from each core to the controller.
24
The controller collects all Fast and Slow signals from all Throughput cores and combines them
using a similar circuit inside the controller (Figure 2.6(b)). The circuit asserts signal DOWN when
a lower Vdd may be feasible to cycle at the target frequency; it asserts signal UP if at least one of
the cores requires a higher Vdd. If no signal is asserted, all of the cores satisfy the frequency at
the lowest possible power budget. Finally, DOWN and UP form the control inputs to a counter.
The counter is then connected to a digital-to-analog converter (DAC), which converts the counter
output to VddT.
Although not shown in the figure, an analogous circuit exists for Expendable cores. In addition,
the figure shows the table that records the mapping of cores to Throughput or Expendable type,
and which cores are already popped.
The characteristics of the DAC, together with the magnitude of the voltage noise determine the
grain size at which the DVSAM modes can tune Vdd.
2.5 Evaluation Setup
In this section, we present the evaluation methodology that we use to characterize a BubbleWrap
chip in a near-future process technology.
2.5.1 Power and Thermal Model
We use a simple model to estimate power and temperature values in a BubbleWrap chip. We
estimate the dynamic power consumed by a core using Wattch [36]; for the static power, we use
the following equations:
PSTA = Vdd× ILEAK
ILEAK ∝ µ× T2 × e−qVth/kTn , where µ ∝ T−1.5
(2.8)
Adding up the power consumed by all of the on-chip cores in PTOT, we use the following equa-
tions to estimate the (junction) temperature TJ of a core:
TJ = TS + θJS × (PSTA + PDYN) (per core)
TS = TA + θSA × PTOT
(2.9)
25
In the equations, θSA is the spreader to ambient thermal resistance, which is modeled as a
lumped equivalent of spreader to heat-sink and heat-sink to ambient thermal resistances. More-
over, TA is the ambient temperature, TS is the spreader temperature, PSTA and PDYN are the static
and dynamic power consumptions of the core, and θJS is the junction to spreader thermal resis-
tance. We assume an ambient temperature TA of 45 ◦C and a θSA of 0.222 K/W [6], while θJS is
calibrated for the worst case operating conditions.
For the near-future technology node we are assuming, the core area is already so small that
intra-core spatial thermal variation becomes negligible. Therefore, we model temperature at core
granularity, which offers reasonable fidelity for many-core designs [37]. Moreover, since we as-
sume a checkerboard core-cache layout, which reduces core-to-core thermal coupling significantly,
we neglect inter-core lateral conduction. The cache power density is significantly lower than the
core power density. Hence, caches placed between cores act as virtual lateral heat-sinks.
We impose a maximum chip power in the cores of 80 W, and a maximum power of 5 W per
Throughput core. A core is designed to support a power limit of 10 W; going above that could
damage its power-distribution system. Consequently, 10 W is the effective maximum power for
an Expandable core.
2.5.2 Process Technology
We assume a 22 nm technology node based on the predictive technology model’s bulk HK-MG
process [38], which incorporates recent corrections [39]. The technology parameters are given in
Table 2.3.
Table 2.3: Technology parameters.
Cox 2.5 × 10−20 F/nm2 n 1.5
kT -1 mV/K E0 0.08 V/nm
T0 70 oC Ea 0.56 eV
TMAX 100 oC a 0.2
Vdd 0.8 – 1.3 V SNOM 7 years
VddNOM 1 V α 1.3
VthNOM 250 mV G 10% at TMAX
kDIBL -150 mV/V η 0.35
We use a very simple model to estimate the effects of Vth process variation. Power and perfor-
mance asymmetry due to process variation at the core granularity stems mainly from systematic
variation, which shows strong spatial dependence [40]. Hence, given the small area taken by a
26
core, we neglect any intra-core Vth variation. Moreover, we assume a normally-distributed core-
to-core variation in Vth with σ/µ = 5%. In our evaluation, we perform some experiments on cores
with [− 3σ, + 3σ] deviation from VthNOM. In our experiments, such range causes a variation of
approximately 12% in power consumption between the most power consuming core and the least
consuming one. Moreover, it leads to a variation of approximately 14% in frequency between the
fastest core and the slowest one.
In our experiments, we assume that processors have a nominal service life SNOM of seven years.
The constant of proportionality of Equation 2.1, ABTI , is calibrated so that a core with a + 3σ Vth
value (i.e., the slow corner) slows down by G = 10% at the end of SNOM if operated continuously
at VddNOM and TMAX. The constant of proportionality for the alpha-power law (Equation 2.3)
is calibrated to guarantee operation at fZG for a core with a + 3σ Vth value at the beginning of
the service life, at VddNOM and T0. Finally, the constant of proportionality for leakage current
(Equation 2.8) is set so that leakage accounts for 25% of the total power consumption for a core
with a − 3σ Vth value (i.e., the leaky corner) at VddNOM and T0.
2.5.3 Workload
Each non-idle core repeatedly cycles through the SPECint2000 applications in a round-robin fash-
ion (switching every ≈45 minutes of simulated time), therefore experiencing a diverse range of
work over its service life. We create parallel sections, where all Throughput cores are busy running
this load (or an expanded number of Throughput cores). We also create sequential sections, where
only one Expendable core is running. To assess how BubbleWrap performs with different propor-
tions of parallel and sequential sections, we parameterize the workload with the fraction WSEQ
of time spent in the sequential section — for the default number of Throughput cores, without
expansion. The evaluation characterizes BubbleWrap for WSEQ = [0, 1].
2.5.4 Many-Core Microarchitecture
We model a 32-core many-core microarchitecture as described in Table 2.4. Based on ITRS projec-
tions [41], we use 16 Throughput and 16 Expendable cores as the default (not counting expansion
for the set of Throughput cores). We simulate the core microarchitecture with the SESC simula-
tor [42] instrumented with Wattch models [36].
27
Table 2.4: Microarchitectural parameters.
Technology node: 22 nm fNOM: 4.5 GHz
Cores per chip: 16 T + 16 E PMAX : 80 W (all cores), 5 W (T core), 10 W (E core)
Width: 6-fetch 4-issue 4-retire OoO L1 D Cache: 16 kB WT, 0.44 ns round trip, 4 way, 64 B line
ROB: 152 entries L1 I Cache: 16 kB, 0.44 ns round trip, 2 way, 64 B line
Issue window: 40 fp, 80 int L2 Cache: 2 MB WB per core, 2 ns round trip,
LSQ Size: 54 LD , 46 ST 8-way, 64 B line, has stride prefetcher
Branch pred: 80 kb tournament Memory: 80 ns round trip
2.6 Evaluation
In this section, we assess the impact of BubbleWrap in terms of power and performance. The
evaluation optimistically assumes that the aging sensors always give correct information on the
critical path length and that the BubbleWrap controller adds no overhead. In the following, we
analyze DVSAM-Pow, DVSAM-Perf, DVSAM-Short, and VSAM-Short, and finally the different
BubbleWrap environments.
2.6.1 Enhancing Throughput: DVSAM-Pow
For the evaluation, we consider three cores to cover the whole Vth spectrum: one with VthNOM,
one with Vth at − 3σ, and one with Vth at + 3σ. They are all clocked at fNOM and work at
workload temperature conditions. To each of these cores, we apply DVSAM-Pow. The second
column of Table 2.5 shows, for each of the cores, the energy reductions attained with DVSAM-
Pow over the entire service life of the core. From the table, we see that DVSAM-Pow reduces the
energy consumption by 13–31%. The savings are more pronounced for the cores with low Vth,
which suffer from higher static energy consumption. For the core with VthNOM, the savings are a
substantial 23%.
Table 2.5: Benefits of DVSAM-Pow and DVSAM-Perf.
Deviation Energy savings due Frequency increase due
in Vth to DVSAM-Pow (%) to DVSAM-Perf (%)
−3σ 31 18
0 23 14
+3σ 13 10
We now take the core with VthNOM before and after the optimization and plot its temporal evo-
lution over the whole service life. We call the two resulting cores Base and DVSAM-Pow, respec-
28
tively, and show the Vdd evolution (Figure 2.7(a)), normalized critical path delay τ/τZG evolution
(Figure 2.7(b)), power evolution (Figure 2.7(c)), and temperature evolution (Figure 2.7(d)). The
plots show banded structures for the curves. They are due to temporal variations in the workload
as we execute the SPECint2000 applications in a round-robin manner.
10 20 30 40 50 60 70 800.8
0.85
0.9
0.95
1
1.05
1.1
time (months)
Vd
d 
(V
)
BASE
DVSAM−Pow
(a) Temporal evolution of Vdd
10 20 30 40 50 60 70 800.9
0.95
1
1.05
1.1
time (months)
τ 
/ τ
ZG
DVSAM−Pow
BASE
(b) Temporal evolution of critical path delay
10 20 30 40 50 60 70 801
2
3
4
5
6
7
time (months)
P 
(W
)
DVSAM−Pow
BASE
(c) Power consumption over the service life
10 20 30 40 50 60 70 8050
60
70
80
90
100
time (months)
T 
(C
) BASE
DVSAM−Pow
(d) Temperature over the service life
Figure 2.7: Temporal evolution of Vdd (a), normalized critical path delay (b), power consumption
(c), and temperature for a nominal core under DVSAM-Pow (d).
Figure 2.7(a) corresponds to the top row of Figure 2.2(a). We see that DVSAM-Pow keeps Vdd
about 0.15V lower than Base. Moreover, the difference does not decrease much over the whole
lifetime. To understand why, consider Figure 2.7(b), which corresponds to the second row of
Figure 2.2(a). While DVSAM-Pow’s critical path consumes all the guard-band from the beginning
(by design), Base’s critical path delay increases only a little, never consuming much of the guard-
band. The reason is because the guard-band is dimensioned for the worst-case conditions, namely
a core with Vth at+ 3σ operating at TMAX. In our case, Base has VthNOM and operates at workload
temperatures. As a result, it does not age as much and does not use most of the guard-band.
29
Overall, the across-the-board gap in Figure 2.7(b) induces the resulting gap in Figure 2.7(a). Note
that DVSAM-Pow only consumes the guard-band set aside for aging (and variation). There is
additional guard-banding present for other reasons, such as voltage noise. That one remains.
Figure 2.7(c) shows that the lower operating voltage of DVSAM-Pow saves significant power
compared to the core operating continuously at VddNOM. Finally, Figure 2.7(d) shows that the
temperature also decreases due to the lower operating voltage.
2.6.2 Enhancing Frequency: DVSAM-Perf
We now take the three cores covering the Vth spectrum as explained in the beginning of Sec-
tion 2.6.1 and apply DVSAM-Perf. The third column of Table 2.5 shows, for each of the cores, the
frequency increases attained with DVSAM-Perf over fNOM. From the table, we see that DVSAM-
Perf increases the frequency by 10–18%. The increase is larger for the core with low Vth, since this
core can cycle faster for any given Vdd. For the core with VthNOM, the increase is a significant
14%.
Using the core with VthNOM before and after the optimization, we plot its temporal evolution
over the whole service life. We call the resulting two cores Base and DVSAM-Perf, respectively, and
show the evolution of Vdd (Figure 2.8(a)), normalized critical path delay τ/τZG (Figure 2.8(b)),
power (Figure 2.8(c)), and temperature (Figure 2.8(d)).
Figure 2.8(a) corresponds to the top row of Figure 2.2(b). We see that, with DVSAM-Perf, Vdd
starts around VddNOM at the beginning of the service life and increases beyond VddNOM there-
after. At the end of the service life, it reaches around 1.25 V. Consider now Figure 2.8(b), which
corresponds to the second row of Figure 2.2(b). Thanks to DVSAM-Perf’s high Vdd operation,
DVSAM-Perf keeps the critical path delay around 0.96 × τZG. As a result, DVSAM-Perf operates
at a constant frequency that is 14% higher than fNOM over the whole service life. Recall that fNOM
is the frequency of the Base core. It has a period of τZG × (1 + G) = τZG × 1.1. This substantial
frequency increase, even over the frequency of no guard-band (1/τZG), is possible because the
guard-band is dimensioned for a core with Vth at + 3σ operating at TMAX.
Figure 2.8(c) shows that the higher voltages of DVSAM-Perf cause a continuous increase in core
power. By the end of the service life, the power consumed by a core is 7 W. This is a high value,
but still less than the 10 W reserved to Expendable cores. As shown in Figure 2.8(d), the junction
temperature goes above 90 ◦C at the end of the service life.
30
10 20 30 40 50 60 70 80
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
time (months)
Vd
d 
(V
) DVSAM−Perf
BASE
(a) Temporal evolution of Vdd
10 20 30 40 50 60 70 800.93
0.94
0.95
0.96
0.97
0.98
0.99
1
time (months)
τ 
/ τ
ZG
DVSAM−Perf
BASE
(b) Temporal evolution of critical path delay
10 20 30 40 50 60 70 802
3
4
5
6
7
time (months)
P 
(W
)
DVSAM−Perf
BASE
(c) Power consumption over the service life
10 20 30 40 50 60 70 8050
60
70
80
90
100
time (months)
T 
(C
)
DVSAM−Perf
BASE
(d) Temperature over the service life
Figure 2.8: Temporal evolution of Vdd (a), normalized critical path delay (b), power consumption
(c), and temperature for a nominal core under DVSAM-Perf (d).
31
2.6.3 Core Popping: DVSAM-Short & VSAM-Short
We now consider the effect of popping Expendable cores, first with DVSAM-Short and then with
VSAM-Short. As before, we use a core with VthNOM. We are interested in the evolution of the
core at elevated voltage and frequency conditions for a period SSH that ranges from 0 to SNOM.
We consider DVSAM-Short first. Figure 2.9 considers all possible values of SSH and shows the
maximum Vdd that gets applied to the core (a), the frequency of the core relative to fNOM (b), the
maximum power consumed by the core (c), and the maximum temperature attained by the core
(d). Each chart highlights two data points, namely one for a one-month service life (labeled 1mo)
and one for a nominal service life (labeled SNOM). The latter corresponds to the DVSAM-Perf
mode.
1 2 3 4 5 6 71.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
1.5
SSH (years)
M
ax
 V
dd
 (V
)
(S
NOM
,Vdd
DVSAM−Perf
)
(1mo,Vdd
SH
)
(a) Maximum Vdd as a function of service life
1 2 3 4 5 6 71.14
1.15
1.16
1.17
1.18
1.19
SSH (years)
Fr
eq
ue
nc
y 
re
la
tiv
e 
to
 f N
O
M
(1mo,f
SH
)
(S
NOM
,f
DVSAM−Perf
)
(b) Frequency gain as a function of service life
1 2 3 4 5 6 75
5.5
6
6.5
7
7.5
8
SSH (years)
M
ax
 P
 (W
)
(1mo,P
SH
)
(S
NOM
,P
DVSAM−Perf
)
(c) Maximum power as a function of service life
1 2 3 4 5 6 770
75
80
85
90
95
100
SSH (years)
M
ax
 T
 (C
)
(S
NOM
,T
DVSAM−Perf
)
(1mo,T
SH
)
(d) Maximum temperature as a function of service
life
Figure 2.9: Impact of core popping with DVSAM-Short.
32
Recall from the top row of Figure 2.2(c) that, under DVSAM-Short, Vdd starts off elevated and
continues to increase until the core pops (i.e., we need to violate VddMAX, PMAX, or TMAX to main-
tain the frequency). Figure 2.9(a) shows that, in all cases of SSH, Vdd practically reaches VddMAX,
which is 1.3V. Figures 2.9(c) and 2.9(d) show that the maximum core power and temperature, re-
spectively, are fairly similar for different SSH, and have not reached their allowed limit — core
power is 6.5–7 W and temperature is 90–95 ◦C.
Finally, Figure 2.9(b) shows that the frequency at which we can clock the core increases as SSH
becomes smaller. For a service life of one month, the core can be clocked at a frequency of 1.19
relative to fNOM. However, as SSH increases, the frequency quickly goes down. Very soon, we
attain frequencies not much higher than the one reached by DVSAM-Perf, namely 1.14 relative to
fNOM. This is because the Vdd conditions required to deliver high frequency quickly age the core,
limiting its service life.
We now consider VSAM-Short. This mode represents a simpler approach than DVSAM-Short
because we do not need to repeatedly adjust Vdd based on measurements of the critical path
delays. We simply set a constant, elevated VddSH for the duration of the shorter service life SSH.
This was shown in the top row of Figure 2.2(d). However, this mode is less effective than DVSAM-
Short because we need to set VddSH conservatively — with the same conservative assumptions
as we set VddNOM when we want the processor to last for SNOM. Specifically, VddSH should
guarantee that, if operated continuously at TMAX, the core with + 3σ Vth deviation (the slow
corner) consumes the entire guard-band only at the end of its presumed short service life (SSH).
With these assumptions, we generate Figure 2.10(a), which shows for each short service life SSH,
the elevated VddSH that we need to apply. The figure highlights two data points, namely one for
a one-month service life (labeled 1mo) and one for the nominal service life (labeled SNOM). The
latter corresponds to a core operating at VddNOM. We can see that, for SSH = 1 month, we get a
VddSH of only 1.1 V.
Figures 2.10(b), (c), and (d) repeat Figures 2.9(b), (c), and (d) for VSAM-Short. As usual, we char-
acterize a core with VthNOM. From Figure 2.10(b), we see that the frequency at which we can clock
the core increases as SSH becomes smaller. However, VSAM-Short does not reach the frequency
values that DVSAM-Short attains in Figure 2.9(b). At a service life of one month, VSAM-Short
clocks the core at a frequency of 1.1 relative to fNOM, in contrast to DVSAM-Short’s 1.19.
33
1 2 3 4 5 6 70.8
0.9
1
1.1
1.2
1.3
SSH (years)
Vd
d S
H 
(V
)
(1mo,Vdd
SH
)
(S
NOM
,Vdd
NOM
)
(a) Maximum Vdd as a function of service life
1 2 3 4 5 6 70.9
0.95
1
1.05
1.1
1.15
1.2
Fr
eq
ue
nc
y 
re
la
tiv
e 
to
 f N
O
M
SSH (years)
(1mo,f
SH
)
(S
NOM
,f
NOM
)
(b) Frequency gain as a function of service life
1 2 3 4 5 6 73.5
4
4.5
5
5.5
6
SSH (years)
M
ax
 P
 (W
)
(1mo,P
SH
)
(S
NOM
,P
NOM
)
(c) Maximum power as a function of service life
1 2 3 4 5 6 765
70
75
80
85
SSH (years)
M
ax
 T
(C
)
(1mo,T
SH
)
(S
NOM
,T
NOM
)
(d) Maximum temperature as a function of service
life
Figure 2.10: Impact of core popping with VSAM-Short.
34
Figures 2.10(c) and 2.10(d) show that the maximum power and temperature reached by an Ex-
pendable core with a short service life of one month are around 5.2 W and around 82 ◦C, respec-
tively. This is in contrast to the higher values attained with DVSAM-Short, namely 7 W and 93 ◦C
(Figures 2.9(c) and 2.9(d)).
2.6.4 BubbleWrap Environments
We now estimate the performance and power impact of each of the BubbleWrap environments
described in Figure 2.4 and Table 2.2. BubbleWrap increases the frequency of sequential sections
by popping Expendable cores, and the throughput of parallel sections by expanding the set of
Throughput cores. Next, we examine the frequency of sequential sections, the throughput of
parallel sections, and finally the performance and power of the application as a whole.
2.6.4.1 Sequential Section Frequency
The frequency increase achievable by popping cores is a function of the service life per Expendable
core (SSH). From Equation 2.7, SSH depends on the sequential load (LSEQ). The smaller LSEQ is,
the shorter is SSH, and the larger is the frequency boost provided by each Expendable core.
To study a range of SSH, we take our workload from Section 2.5.3 and vary the fraction of its
original execution time in the sequential section (WSEQ), from 1 to 0. Note that, for our workload,
LSEQ = WSEQ. Then, for different values of WSEQ, Figure 2.11 shows the frequency attained by the
sequential section relative to fNOM in all the BubbleWrap environments of Figure 2.4.
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1 0.8 0.6 0.4 0.2 0
 F
re
qu
en
cy
 G
ai
n
WSEQ
Base SShortE+BaseT BaseE+PowT SShortE+PowT PerfE+PowT DShortE+PowT
WSE
Fr
eq
ue
nc
y 
Re
la
tiv
e 
to
 f N
O
M
Figure 2.11: Frequency of the sequential section for each environment.
35
There are three groups of environments. First, there are the environments that do not use Ex-
pendable cores (Base and BaseE + PowT); these cannot get any increase in frequency. The second
group is the environment that uses Expendable cores but does not pop them (Per fE + PowT). Ex-
pendable cores are expected to last for SNOM and, for performance, are run in DVSAM-Perf mode.
As a result, Per fE + PowT increases the frequency of the sequential section by a fixed amount
irrespective of WSEQ. We see that this increase is 14%, and was already shown in Table 2.5.
Finally, there are the environments that pop Expendable cores (SShortE + BaseT, SShortE +
PowT and DShortE + PowT). These environments increase the sequential section frequency. The
lower WSEQ is, the higher their impact is, except for WSEQ = 0, where they have no impact.
Among the three environments, the one that uses DVSAM-Short (rather than VSAM-Short), is the
one with the highest impact : DShortE + PowT. For a fully sequential execution (WSEQ = 1), we
see that this environment provides a 16% frequency increase. Finally, SShortE + PowT increases
the frequency less than SShortE + BaseT because, in SShortE + PowT, some Expendable cores
are turned into Throughput ones during the parallel section when the set of Throughput cores
expands.
2.6.4.2 Parallel Section Throughput
The throughput increase achievable by BubbleWrap environments during parallel sections de-
pends on whether or not they can expand the set of Throughput cores. There are two groups of
environments. First, there are the environments that do not expand the set of Throughput cores
(Base and SShortE + BaseT); these cannot get any increase in throughput. The other group is the
rest of environments (BaseE + PowT, SShortE + PowT, Per fE + PowT and DShortE + PowT);
they all use the DVSAM-Pow mode on Throughput cores and, therefore, expand them equally
during parallel sections. Assuming that the parallel section has enough parallelism, we can in-
crease the number of Throughput cores to make up for the power saved by DVSAM-Pow. As
shown in Figure 2.7(c), DVSAM-Pow reduces the power of a core from 4.1 W to 3.1 W on average.
This is a reduction of 24%. Consequently, for constant power, the number of Throughput cores
to execute the parallel section can be increased from the original 16 cores to 16× 4.1/3.1 ≈ 21
cores. Overall, per core savings lead to a throughput increase of st ≈ 30% for these BubbleWrap
environments.
36
2.6.4.3 Estimated Overall Speedup and Power Cost
From the previous two sections, we can now roughly estimate the overall speedup and the power
cost of each BubbleWrap environment. We consider a workload that has a fraction of sequential
time WSEQ. For each BubbleWrap environment, we assume that the time of the parallel section
scales down perfectly with the corresponding throughput increase st of Section 2.6.4.2; similarly,
we assume that the time of the sequential section scales down perfectly with the relative frequency
increase fr of Figure 2.11. With these optimistic assumptions, we obtain the following execution
time speedup over Base
Speedup =
Timeunoptimized
Timeoptimized
=
1
WSEQ/ fr + (1−WSEQ)/st (2.10)
Figure 2.12 shows these speedups for all the BubbleWrap environments and different WSEQ val-
ues. The speedups are normalized to Base. There are three groups of environments. First,
SShortE + BaseT only speeds up sequential sections; its impact decreases with WSEQ. Second,
BaseE + PowT only speeds up parallel sections and, therefore, its effect goes down as WSEQ
increases. Finally, the remaining three environments speed up both sequential and parallel sec-
tions. They are the environments that show the highest speedups across most of the WSEQ range.
Their relative speedups follow their relative increase in sequential section frequency in Figure 2.11.
Overall, DShortE + PowT gives the highest speedups, which range from 1.16 to 1.30.
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1 0.8 0.6 0.4 0.2 0
O
ve
ra
ll 
Sp
ee
d-
up
WSEQ
Base SShortE+BaseT BaseE+PowT SShortE+PowT PerfE+PowT DShortE+PowT
O
ve
ra
ll 
Sp
ee
du
p
Figure 2.12: Speedup of BubbleWrap environments.
Note that the best performing environment (DShortE + PowT, which supports core popping) is
not any more complex than the next best-performing one (Per fE + PowT, which does not support
37
core popping). Both environments use two supply networks, tune VddE and VddT, and have
aging sensors in both Expendable and Throughput cores. However, the third best-performing
environment (SShortE + PowT) is significantly simpler, since it does not require tuning VddE or
keeping aging sensors for Expendable cores.
Finally, Figure 2.13 shows the power consumption of the different BubbleWrap environments
normalized to the power of Base. The power cost of BubbleWrap operation stems only from se-
quential section acceleration — since the expansion in the set of Throughput cores has no power
cost. Consequently, the normalized power increases with WSEQ. Moreover, the figure shows
that the most power-consuming environments are those that use the DVSAM-Short (or DVSAM-
Perf) modes — namely, DShortE + PowT and Per fE + PowT. This is because these modes
raise VddE to high values. On the other hand, the environments that use the VSAM-Short mode
(SShortE + BaseT and SShortE + PowT) consume much less power. This is because VSAM-Short
does not tune VddE and, therefore, has to set VddE to conservatively low values.
0
0.5
1
1.5
2
1 0.8 0.6 0.4 0.2 0
Base SShortE+BaseT BaseE+PowT SShortE+PowT PerfE+PowT DShortE+PowT
N
or
m
al
iz
ed
 P
ow
er
 C
on
su
m
p
tio
n
WSEQ
Figure 2.13: Power consumption of BubbleWrap environments.
2.6.5 Discussion
All the data shown in our evaluation except for Table 2.5 corresponds to processors with VthNOM.
In reality, the cores in a BubbleWrap chip will exhibit process variation, which will affect how
they respond to the BubbleWrap modes as shown in Table 2.5. This makes some of our results
slightly optimistic while others, slightly pessimistic. Specifically, the best cores will be chosen
for the Throughput set. This is likely to result in higher parallel section speedup than reported
38
here. However, it will also leave the worst cores for the Expendable set, which will result in lower
sequential section speedup (or higher power consumption) than reported here. A further study
should be done to address this issue.
We constrained the operation of all cores, including those that pop, to be within the TMAX and
VddMAX envelopes. Respecting TMAX and VddMAX ensures a reliable service life of SNOM. In
reality, for a core that will have a short life, it may be possible to go above TMAX or VddMAX. This
could allow core popping to deliver higher performance than reported here — perhaps at the cost
of risking new types of hard or soft failures. This topic will be studied in future work.
We relied on DVSAM for core popping and did not explore the design space. For instance, we
excluded changes in frequency within the course of the service life. DVSAM can be extended
to also include changes in frequency with time. This improvement should deliver better design
points. Moreover, with technology scaling, specifically, with the introduction of novel device ar-
chitectures, new wear-out mechanisms can be identified, which may give rise to radically different
means for core popping.
Finally, the proof-of-concept analysis we provided for BubbleWrap relied conservatively on
a very small scale many-core. As process technology scales, the fraction of Expendable cores
increases and the service life demanded per Expendable core declines. Consequently, the benefits
of BubbleWrap will increase as technology advances.
2.7 Related Work
The impending many-core power wall is well-known in industry and reflected in recent ITRS
projections However, our suggestion to expend (or pop) the excess cores that cannot be powered
on is novel, as far as we know. Another proposal to address the many-core power wall is to use
extremely low supply voltages, possibly close to the threshold voltage [43, 44]. This environment
would allow all cores to operate simultaneously, albeit at severely reduced frequencies. Yet an-
other alternative is a design comprising power-efficient, heterogeneous, application-specific accel-
erators [45]. BubbleWrap is unique in extending the scaling of homogeneous many-cores without
requiring core modifications.
Recently, processor aging has been the subject of interest in the computer architecture commu-
nity (e.g., [13–20]). Some of this work attempts to reduce aging by setting the logic to recovery
mode [13, 16]. However, the aging work most closely related to ours is Facelift [20]. Facelift pro-
39
posed applying a few discrete voltage levels to minimize aging and showed how shorter service
lives could be exploited to increase core frequency. BubbleWrap’s DVSAM framework improves
on these techniques with continuous voltage tuning and with the power-saving DVSAM-Pow
mode. Additionally, BubbleWrap proposes a set of novel architectures (or environments) that
use DVSAM. Finally, several authors have designed circuits that detect when a critical path has
slowed down [14, 15, 17, 29–32]. Some of the BubbleWrap environments use these aging sensors.
2.8 Summary
To push back the many-core power wall, we made two main contributions. First, we introduced
Dynamic Voltage Scaling for Aging Management (DVSAM) — a new scheme for managing processor
aging by tuning Vdd (but not the frequency), exploiting any instantaneous aging guard-band. The
goal can be one of the following: consume the least power for the same performance and service
life; attain the highest performance for the same service life and within power constraints; or
attain even higher performance for a shorter service life and within power constraints.
Second, we presented BubbleWrap, a novel many-core architecture that makes extensive use of
DVSAM to push back the many-core power wall. BubbleWrap selects the most power-efficient set
of cores in the die — the largest set that can be simultaneously powered on — and designates them
as Throughput cores. They are dedicated to parallel-section execution. The rest of the cores are
designated as Expendable. They are dedicated to accelerating sequential sections. BubbleWrap
applies DVSAM in several environments. In some of them, BubbleWrap attains maximum se-
quential acceleration by sacrificing Expendable cores one at a time, running them at elevated Vdd
for a significantly shorter service life each, until they completely wear out. In simulated 32-core
chips, BubbleWrap provides substantial gains over a plain chip with the same power envelope.
On average, our most aggressive design runs fully-sequential applications at a 16% higher fre-
quency, and fully-parallel ones, with a 30% higher throughput.
40
CHAPTER 3
VARIUS-NTV: A MICROARCHITECTURAL MODEL OF
PROCESS VARIATIONS FOR NEAR-THRESHOLD
VOLTAGE COMPUTING
3.1 Introduction
Power consumption is typically the primary concern in today’s computer platforms, ranging from
datacenters to handhelds. This is because CMOS technology has long ago stopped scaling close
to perfectly and, as a result, power density has been increasing significantly each technology gen-
eration. To be able to continue delivering scalable computing performance, unconventional ways
to compute more energy-efficiently should be explored.
One way to attain high energy efficiency is to reduce the supply voltage Vdd to a value only
slightly higher than the threshold voltage Vth. This environment is called near-threshold volt-
age computing (NTC) [8, 9, 46] — as opposed to conventional super-threshold voltage computing
(STC). Vdd represents a powerful knob because both dynamic and static power reduce with Vdd
super-linearly. Current indications suggest that NTC can decrease the energy per operation by
several times over STC [8, 9]. A drawback is that NTC imposes a frequency degradation, which
may be tolerable through more parallelism – to the extent the application domain permits. For
parallel loads, since more cores can be running concurrently within the chip’s power envelope,
the end result is a higher throughput.
A roadblock for NTC is the increased sensitivity to process (parametric) variations — i.e., the
deviation of device parameters from their nominal values. Already in current-technology STC
multi-cores, process variations result in noticeable differences in power and performance across
the chip [47]. Due to the low operating Vdd [46], process variations of the same magnitude cause
a substantially larger shift in transistor power consumption and speed at NTV. Process variations
are undesirable because they result in chips that consume more static power, cycle at lower fre-
quencies, and can even be faulty.
41
Process variations should be holistically addressed across all levels of the system stack, includ-
ing the computer architecture. To confront variations at the architecture level, models of paramet-
ric variations are required at a level of abstraction that is useful to microarchitects. Such models
exist for STC (e.g., [48–52]). Unfortunately, none of these is applicable to NTC — NTC requires
new memory structures along with new power and performance models.
This chapter presents the first microarchitectural model of process variations for NTC, VARIUS-
NTV, which extends the VARIUS model for STC [52]. The key aspects include: (i) Adopting a gate
delay model and an SRAM cell type that are tailored to NTC, (ii) modeling SRAM failure modes
emerging at NTV, and (iii) accounting for the impact of leakage current in SRAM timing and sta-
bility models. VARIUS-NTV captures how variation affects power consumed and the frequency
attained by cores and memories in an NTC many-core, and models the timing and stability faults
in SRAM cells at NTV. The model is validated against an experimental 80-core prototype chip [47].
For a simulated 288-core clustered many-core at 11nm, the following analysis shows that the
expected process variations induce higher differences in frequency f and power at NTV. For ex-
ample, the maximum difference in cluster f within a chip is ≈ 3.7× at NTV and only ≈ 2.3× at
STV. Different cluster organizations in the chip and different configurations of on-chip Vdd- and
f-domains are considered. This study demonstrates that variation management is critical at NTV.
In the following, Section 3.2 provides a background; Section 3.3 presents the VARIUS-NTV
variation model; Section 3.4 describes the many-core architecture evaluated; Section 3.6 outlines
the initial validation of VARIUS-NTV; Sections 3.5 and 3.7 cover the variation analysis deploying
VARIUS-NTV; and Section 3.8 discusses related work.
3.2 Background
3.2.1 Process Variation Basics
With each technology generation, manufacturing restrictions exacerbate vulnerability to process
variations, which manifest across the chip as static, spatial fluctuations in transistor parameters
around the nominal values [12, 53]. Process variation is usually analyzed at global (die-to-die,
D2D) and local scopes (within-die, WID). D2D variation can be approximately captured by a
global offset per die, on a per-parameter basis. WID variation, on the other hand, is responsible
for the heterogeneity in transistor parameters across the building blocks of any given die, and
42
forms the focus of this chapter.1 WID variation is caused by systematic effects due to lithographic
irregularities and random effects due to, for example, varying dopant concentrations [54]. Two
important vulnerable parameters are the threshold voltage Vth and the effective channel length
Le f f , which directly determine a transistor’s switching speed and leakage power.
Since systematic and random effects are triggered by different physical phenomena, their im-
pact can be captured by two independent random variables per vulnerable parameter. The end
impact is additive as given by Equation 3.1, where Vth0 and Le f f 0 represent the nominal values
of Vth and Le f f ; VthSYS and Le f f SYS, the parametric shifts due to systematic effects; and VthRAND
and Le f f RAND, the parametric shifts due to random effects, respectively.
Vth = Vth0 + ∆VthWID = Vth0 + ∆VthSYS + ∆VthRAND
Le f f = Le f f 0 + ∆Le f f WID = Le f f 0 + ∆Le f f SYS + ∆Le f f RAND
(3.1)
The higher the Vth and Le f f variation is, the higher becomes the variation in transistor speed
and, hence, in path delay and frequency across the chip. Such parametric variation results in
slower processors, since variation-afflicted path delay generally follows an asymmetric distribu-
tion, and the slowest path ends up determining the frequency of the whole processor.2 In ad-
dition, the difference between the frequencies of different cores of a many-core increases. Also,
as Vth varies, transistor leakage varies across the chip. However, low-Vth transistors consume
more power than high-Vth ones save, due to the exponential dependence of leakage power on
−Vth [55]. As a result, under variation, chips consume substantially more leakage power. In a
many-core, different cores leak different quantities.
When compared to logic, on-chip memory blocks are more prone to variation, as triggered by
(i) aggressive transistor sizing to satisfy high density requirements, and (ii) greater sensitivity to
transistor mismatch. With technology scaling, the minimum operating voltage of a conventional
SRAM array tends to increase to accommodate higher safety margins as induced by exacerbated
variability.
1Note that D2D and WID variation may be correlated.
2Under process variation, some paths get faster, while others become slower. Theoretically, it could be the case that
none of the slower paths exhibit a delay higher than the nominal critical path – the ideal critical path were there no
variation; then, the nominal operating frequency would be preserved. Practically, this scenario is excluded since most
of the paths are optimized to take a value very close to the critical path. Assume that for a given variation-afflicted
path delay distribution, a standard deviation of σ0 results in some of the paths becoming slower when compared to the
nominal critical path. More variation translates to a larger standard deviation. What is the impact of a larger standard
deviation σ > σ0? Slower paths would be slower, and faster paths would be faster. Hence, the operating frequency
would be lower.
43
3.2.2 SRAM Operation in the Presence of Variation
Figure 3.1 depicts a conventional 6-transistor (6T) SRAM cell. The cell consists of two inverters,
formed by transistors PR-NR and PL-NL, connected in a positive feedback loop, and two access
transistors, AXR and AXL. VR stores the cell’s value and VL, its complement. To read from or write
to the cell, word-line WL is driven high to connect the cell to the bit-lines BL and BR. To read, the
bit-lines are pre-charged to logic high. To write, BR is pre-conditioned to the value to be written
and BL, to its complement. This cell is liable to five types of failures [56]. In the following discus-
sion, VR = 0 is assumed without loss of generality.3
WL
BR
AXL AXR
VL VR
WL
BR
AXL AXR
NL NR
PL PR
(b)(a)BL BL
PL/NL
PR/NR
Figure 3.1: Conventional 6-transistor (6T) SRAM cell architecture: VR and VL are the voltages at
the nodes R and L, respectively.
A read upset failure occurs if, during a read operation, the cell flips its contents. Reading starts
by setting WL = BR = BL = 1. While reading VR = 0 (VL = 1), BL keeps its value and BR
discharges over AXR and NR, giving rise to a voltage difference between the bit-lines, which is
captured by the sense amplifiers to extract the cell content. Sense amplifiers accelerate the read by
detecting small voltage differences between the bit-lines.
While BR discharges, due to charge sharing between AXR and NR, VR increases. NR should be
stronger 4 than AXR such that VR does not reach the switching threshold of the PL/NL inverter
during the read, otherwise a read upset flips the cell content [56]. Read upsets can be triggered
by variation leading to an increase in VthNR while VthAXR decreases (i.e., AXR becomes stronger
than NR), and/or causing shifts in VthPL and/or VthNL such that the switching threshold of the
PL/NL inverter decreases.
3Since the cell is symmetric, this discussion applies directly to the case where VR = 1.
4Current drive capability represents a measure of strength. If NR is stronger than AXR, VR would assume a value
closer to logic low than logic high.
44
A read access failure occurs if, during a read, the time needed to produce a voltage difference
between the two bit-lines exceeds the period that WL stays high. This failure occurs when the
discharge transistors (AXR and NR) are too slow: Access failures are triggered by increasing Vth
of AXR and/or NR in the presence of variation, such that the transistors become slower.
A write stability failure occurs if, during a write, the cell cannot change its logic state even if
the write duration is extended to infinity. To write a 1 to a cell storing 0, i.e., VR = 0 (VL = 1),
BL is driven low while BR is pre-conditioned high. Once WL gets triggered, BR discharges over
AXR− NR as it was the case for the read. However, AXR cannot drive VR high enough to reach
the switching threshold of the inverter to ensure read stability. Hence, VL can only be driven low
through AXL. Since BL is 0, current flows through PL and AXL. Once VL decreases enough to
reach the switching threshold of the inverter PR/NR, the positive feedback loop completes the
write by pulling up VR high. This is opposed by PL, pulling VL to Vdd. Depending on the strength
of these transistors, and that of PR and NR, VL may never reach a value low enough to trigger the
switching of the PR-NR inverter, hence, the flipping of the cell contents. To ensure write stability,
PL should be weaker than AXL.
In a cell without the write stability failure, a write timing failure occurs if the write is unable to
change the logic state of the cell by the end of the designated write duration. Write failures are
triggered by variation-induced shifts in the switching threshold of the PR/NR inverter and/or
PL becoming stronger than AXL.
A hold failure occurs if the logic value held by the cell is distorted by excessive leakage of the
transistors forming the core of the cell while the cell is not being accessed. Hold failures are
triggered by variation-induced shifts in Vth of the transistors forming the core of the storage cell.
While write stability enhances the weaker the access transistors are, read stability demands
stronger access transistors. Due to the circuit symmetry, the analogous semantics apply to read
from and write to a cell storing 1, i.e., VR = 1 (VL = 0).
3.2.3 The Impact of Process Variations at NTV
At near-threshold voltages, the sensitivity of circuit timing and power consumption to variations
in Vth and Le f f drastically increases than at conventional super-threshold voltages. The impact
of a given ∆VthWID and ∆Le f f WID on an environment with low voltage, such as NTC, is much
higher than on one with high voltage.
45
Severe frequency variation at NTV stems from increased sensitivity of transistor delay to Vth [57],
as depicted in Figure 3.2. The transistor delay is obtained from the model of Markovic et al. [46]
as Vth is varied.
0
0.
2
0.
1
0.
3
0.
4
0.
5
0.25 0.30 0.35
Vth(V)T
ra
ns
is
to
r 
De
la
y(
ns
) 
Vdd = 0.6V
Vdd = 0.4V
Vdd = 0.5V
11nm
(a) Transistor delay for different Vth.
300
0.4 0.6 0.8 1
200
100
Vdd(V)
De
la
y 
Pe
na
lt
y 
(%
) 
22nm
11nm
0.2 1.2
250
150
50
(b) Delay penalty due to variations.
Figure 3.2: Increased sensitivity of circuit timing to variation at NTV.
The Vth range is obtained as follows: First, it is assumed that Vth follows a Gaussian distribu-
tion [54]. The nominal Vth suggested by ITRS [3] at 11 nm constitutes the mean, µ = Vth0 = 0.3
V. (σ/µ)Vth = 10% applies for the total (systematic and random) variation in Vth; hence, the
variation-induced Vth0× (1± 3σ) covers the Vth range of [0.21, 0.39] V. We see that, for Vdd = 0.6
V, the difference in delay between transistors of Vth = 0.25 V and 0.35 V is only 30 ps, while for
Vdd = 0.4 V, it jumps to over 200 ps, increasing by ≈7×. As a result of this larger delay variation,
the frequency that a processor can support at NTV varies more.
Figure 3.2(b) from Chang et al. [8] shows the guard-band of the critical path due to variations
for different Vdd. As Vdd decreases toward Vth, the guard-band required quickly increases, di-
minishing the potential benefit of NTC. Overall, therefore, it is crucial to develop techniques to
cope with process variations in an NTC environment.
Variation in power consumption at NTV increases over STV since (1) variation in dynamic
power increases due to the increase in variation in frequency; (2) the share of static power in-
creases at NTV which is more sensitive to variation.
46
3.2.4 Modeling Process Variations at STV
There are several microarchitectural models that analyze the impact of process variations on the
frequency and power of cores and memories at a level that is useful to microarchitects (e.g., [48–
52]). However, these proposals only apply to STC, and not to NTC. This dissertation builds one of
these models, VARIUS [52], and substantially extends it so that it applies to NTC. To understand
the contributions, this section provides an overview of VARIUS.
VARIUS starts with characterization of the variation in Vth and Le f f , which exhibits a system-
atic and a random component. VARIUS captures systematic WID variation by dividing the die
into a grid, where each grid point is assigned a Vth and Le f f value as sampled from a multivari-
ate Gaussian distribution. Systematic variation shows spatial correlation, in that parameters of
transistors in immediate vicinity show less diversity. VARIUS captures spatial correlation by a
position- and direction- independent (spherical) correlation function. The correlation between two
points on-die only depends on their Euclidean distance; the correlation decreases first linearly,
then sub-linearly, as the distance increases. This function converges to zero – to demarcate negli-
gible correlation – beyond the correlation distance, φ. The grid granularity should be tailored as a
function of φ. Due to spatial correlation, specific regions of the die exhibit a systematic shift of the
same quantity and in the same direction on a per parameter basis; the grid-granularity should be
set to sample at least one point from each such region.
Random variation, on the other hand, does not exhibit spatial correlation. Hence, each transis-
tor should be considered individually. Unless each grid point corresponds to a transistor, random
variation cannot be captured at grid granularity. VARIUS models the random component analyt-
ically per variation-afflicted parameter.
VARIUS deploys Equation 3.2 to extract the variation in gate delay and the static power model
from [58] to capture the variation in static power, as induced by variation in Vth and Le f f .
τ ∝
Vdd× Le f f
µ(Vdd−Vth)α (3.2)
To find the path delay distribution of a pipeline stage, VARIUS proceeds differently depending
on whether the stage has only logic, only SRAM memory access, or a combination of both. For
logic, it assumes that wire delays do not suffer from variations and, knowing the number of gates
in a logic path, it uses the gate delay variations, along with distribution of path delays per stage
47
were there no variation (as extracted from, e.g., Razor data [59]), to compute the path delay vari-
ation. For a stage with memory access, VARIUS relies on the 6T SRAM cell of Figure 3.1. From
the variation in transistor delays, it computes the variation in cell read access time. It assumes
that the read access time determines memory timing. Then, using the cell access time, it computes
the memory line access time. Note that the pipeline stage also contains some logic, namely the
decoder, the logic at the intersection of word- and bit-line, and the logic at the sense amplifier. The
delay through all this logic is modeled by relying on the VARIUS logic stage model. The distri-
bution of total path delay in the stage is composed by linear combination of the memory access
delay and the delay of the peripheral logic components.
For pipeline stages that combine both logic and memory access, VARIUS estimates the delay
distribution by appropriately weighting the delay of logic and memory components.
The pipeline stage with the longest delay determines the safe frequency of the processor. The
static power (PSTA) of the processor, on the other hand, is found by integrating the PSTA of all of
its transistors.
3.3 VARIUS-NTV: A Microarchitectural Model of Process Variations
Tailored for NTC
Since NTC does not impose radical changes in the manufacturing process, the VARIUS methodol-
ogy to extract Vth and Le f f variation maps still applies – although the values of parameters such
as σ may change. These parameters are directly affected by technology scaling trends.
Technology scaling came along with frequency improvement – though at a progressively slower
rate each generation – while the share of static power has been rapidly increasing. NTC is most
effective for throughput-critical environments as opposed to latency-critical. Also, the significant
reduction in dynamic power due to NTC is not accompanied by a corresponding reduction in the
static power, thus NTC induces a much higher share of static power when compared to STC. Based
on these observations, it can be argued that older technology generations – as characterized by a
lower frequency and lower leakage – conform better to NTC; however, in [60] it is demonstrated
that for (throughput) performance-critical application spaces, new technologies are still favorable.
On the other hand, the VARIUS performance model and timing error infrastructure cannot be
directly deployed at NTV, for the following reasons. (1) The performance (frequency) model is
based on the alpha-power law [24], which does not accurately capture operation near the thresh-
48
old voltage. (2) The VARIUS timing model neglects the impact of (sub-threshold) leakage. At
NTV, the magnitude of off-current (sub-threshold leakage current) decreases due to lower Vdd;
however, the magnitude of the on-current decreases more. Hence, the relative impact of leakage
current (i.e., the ratio of off-current to on-current) increases. (3) The memory timing model relies
on a conventional 6T SRAM cell, which cannot reliably operate at near-threshold voltages. (4) The
only memory failure mode considered is read access; however, at NTV, write stability and write
timing failures become critical along with hold failures. The following details how VARIUS-NTV
differs from VARIUS regarding these four aspects.
3.3.1 Gate Delay
To model the gate delay (τg), VARIUS uses the alpha-power law (Equation 3.2), where α is a
process parameter capturing carrier velocity saturation and µ identifies the carrier mobility as a
function of the temperature, T. This equation does not model the region near the threshold voltage
accurately. Alpha-power law variants [61–64] attempt to extend the model to the sub-threshold
region. Usually, these come along with an increased number of fitting parameters that have no
direct physical interpretation. Further, covering the sub-threshold region does not necessarily
imply that the near-threshold region is properly modeled.
Consequently, VARIUS-NTV deploys the EKV-based [65] model proposed by Markovic et al. [46].
The formula for the on-current is given in Equation 3.3, where vt represents thermal voltage and
n characterizes a process dependent parameter determined by sub-threshold characteristics.
I = µ/Le f f × n× vt2 × ln2(e
Vgs−Vth
2×n×vt + 1) where µ ∝ T−1.5 (3.3)
The resulting gate delay from CV/I is shown in Equation 3.4. Since the EKV model covers all
regions of operation, Equation 3.4 remains equally valid at ST and at NT voltages. In all cases, Vth
is a function of Vdd and temperature as per Equation 3.5, where Vth0, Vdd0 and T0 are the nominal
values of these parameters, and kT and kDIBL represent constants of proportionality capturing the
impact of T and DIBL (drain induced barrier lowering) on Vth, respectively.
τg ∝
Vdd× Le f f
µ× n× vt2 × ln2(e Vdd−Vth2×n×vt + 1)
where µ ∝ T−1.5 (3.4)
Vth = Vth0 + kDIBL(Vdd−Vdd0) + kT(T − T0) (3.5)
49
VARIUS-NTV captures variation in gate delay by Equation 3.4 as a function of the underlying
Vth and Le f f variation maps, process parameters, and operating conditions. Appendix B covers
the details of the logic timing model as adopted from VARIUS.
3.3.2 Impact of Leakage
At NTV, the magnitude of off-current (leakage current), IOFF, decreases when compared to STV.
On-current, ION , on the other hand, decreases more due to lower Vdd. Hence, the relative impact
of IOFF increases. Figure 3.3 points to the drastic decrease in ION/IOFF in an NTC environment
of Vdd = 0.5 V, when compared to a STC environment of Vdd = 0.8 V. Consequently, unlike
VARIUS, VARIUS-NTV takes into account the impact of the leakage current on SRAM timing and
stability, as covered in later sections. As part of leakage current, only sub-threshold leakage is
considered; gate leakage is excluded because high-k metal gate devices like the ones currently in
use are assumed.
0.20 0.30 0.40
3.
0
3.
5
4.
0
4.
5
5.
0
Vth(V)
lo
g(I
O
N
I O
FF
) Vdd = 0.8VVdd = 0.5V 
Figure 3.3: Impact of leakage.
3.3.3 SRAM Cell
The 6T cell of Figure 3.1 requires careful sizing of the transistors, with conflicting requirements to
prevent both read and write failures. In an environment where the impact of process variations
on transistors is more considerable, such as in NTV, this is difficult. To address this problem,
one approach is to power SRAMs at a higher Vdd than the processor logic. Unfortunately, this
approach is costly, since (i) cache memory and logic blocks are often highly interleaved in the
layout, (ii) extra voltage regulators are required in the platform, and (iii) many design, validation,
and testing issues can be incurred by the voltage-domain crossing. Moreover, this approach is
50
not scalable: With technology scaling, the relative difference between the safe SRAM and logic
voltages increases, diminishing the power savings incurred by NTC.
Consequently, VARIUS-NTV follows other work [8, 66] in using the 8-transistor (8T) cell of Fig-
ure 3.4 [67, 68]. This cell is easier to design reliably because it decouples the transistors used for
reading from those used for writing.
AXL AXR
NL NR
PL PR
VL VR
NRD
Read BL(inverted)Read WL
AXRD
Write WL
Wr
it
e 
BL
Write BR
Figure 3.4: 8-transistor (8T) SRAM cell architecture.
Specifically, the two additional transistors NRD and AXRD are only responsible for reads, while
the remaining transistors support the writes as in the 6T cell. To read, the read bit-line is pre-
charged and the read word-line is driven high. If the cell stores a 0, the bit-line keeps its value;
otherwise, it gets discharged to 0 through NRD and AXRD. The writes proceed as in the 6T cell.
Compared to 6T cells, read and write timing margins can be independently optimized, enhancing
the reliability margin significantly with marginal increase in cell area [67]. This is because the 6T
cell has to accommodate larger transistors to enhance reliability.
Of the five failure modes of Section 3.3.3, the 8T design eliminates the read upset failure because
the cell internal nodes are decoupled from the read bit-line. The other failure modes are possible,
but they can be mitigated at a reasonable cost by separately optimizing some transistors for writes
and others for reads.
3.3.4 Memory Failure Modes
While VARIUS only considers read timing failures, VARIUS-NTV models all of the SRAM failure
modes (except read upsets, which cannot occur in the 8T cell because a read cannot flip the cell
contents by construction). The following describes how VARIUS-NTV models these NTV-specific
memory failure modes.
51
3.3.4.1 Hold Failure
For a cell storing 0 (VR = 0, VL = 1), at low Vdd, the voltage VL reduces by construction. When
the cell is not accessed, the access transistors NL and PR are off. Due to leakage through NL and
AXL, if VL reduces enough to reach the VSWITCH of the PR-NR inverter, the cell content would be
distorted. A hold failure occurs when the leakage current through the NL and AXL transistors in
Figure 3.4 reduces VL below the VSWITCH of the PR-NR inverter while the cell is not being accessed.
At that point, the cell’s state gets lost. To model these failures at a given Vdd, VARIUS-NTV uses
Kirchoff’s current law (KCL) to compute VL and VSWITCH at Vdd. VL is extracted from IPL(VL)−
INL(VL)− IAXL(VL) = 0, where
IPL(VL) ∝ µ/Le f f × n× vt2 × ln2(e Vdd−Vth2×n×vt + 1) where Vds = Vdd−VL
INL(VL) ∝ µ/Le f f × T2 × e− Vthn×vt where Vds = VL
IAXL(VL) ∝ µ/Le f f × T2 × e− Vthn×vt where Vds = VL
(3.6)
VSWITCH is extracted from IPR(VSWITCH)− INR(VSWITCH) + IAXR(VSWITCH) = 0, where
IPR(VSWITCH) ∝ µ/Le f f × n× vt2 × ln2(e
Vdd−VSWITCH−Vth
2×n×vt + 1) where Vds = Vdd−VSWITCH
INR(VSWITCH) ∝ µ/Le f f × n× vt2 × ln2(e
VSWITCH−Vth
2×n×vt + 1) where Vds = VSWITCH
IAXR(VSWITCH) ∝ µ/Le f f × T2 × e− Vthn×vt where Vds = Vdd−VSWITCH
(3.7)
The threshold voltage is derived from:
Vth = Vth0 + kDIBL × (Vds−Vdd0) + kT × (T − T0) (3.8)
The hold failure probability per cell constitutes of PCell,Hold = P[VL(Vdd)−VSWITCH(Vdd) < 0].
If no redundant cells are provided, the hold failure probability of a line would be PLine,Hold = 1−
(1− PCell,Hold)line size, where line size denotes the number of cells per line, and 1− (1− PCell,Hold)line size
gives the probability that at least one cell fails. A line is faulty if at least one of its cells is deemed
faulty. The failure probability of cells is assumed independent in this case. This approximation
holds valid if systematic deviation within a line can be neglected, since random variation per
transistor (hence per cell) is independent.
PMem,Hold =
number o f lines
∑
i=1
(
number o f lines
i
)
× PLine,Hold i × (1− PLine,Hold)number o f lines−i (3.9)
52
PMem,Hold = 1− (1− PLine,Hold)number o f lines (3.10)
To avoid hold failures, the minimum allowable supply voltage, VddMIN,Cell , can be obtained
by solving VL(VddMIN,Cell) = VSWITCH(VddMIN,Cell) under variation. The minimum allowable
supply voltage per line follows from VddMIN,Line = max(VddMIN,Cell).
3.3.4.2 Write Stability Failure
Write failures are of symmetric nature. Without loss of generality, a cell that stores a 0 (VR = 0 and
VL = 1) is considered. To model a write stability failure, VARIUS-NTV computes the voltage (VLW)
that node L reaches when the write BL is set to 0 (where BR = 1) and the write duration is extended
to infinity. If this value is above the switching threshold of the PR-NR inverter (VSWITCH), then a
write stability failure occurs. VLW distribution is computed using KCL at node L, from IPL(VLW)−
INL(VLW)− IAXL(VLW) = 0, where
IPL(VLW) ∝ µ/Le f f × n× vt2 × ln2(e Vdd−Vth2×n×vt + 1) where Vds = Vdd−VLW
INL(VLW) ∝ µ/Le f f × T2 × e− Vthn×vt where Vds = VLW
IAXL(VLW) ∝ µ/Le f f × n× vt2 × ln2(e Vdd−Vth2×n×vt + 1) where Vds = VLW
(3.11)
On the other hand, VSWITCH distribution is extracted by using KCL for the PR-NR inverter when
VIN = VOUT [56], from IPR(VSWITCH)− INR(VSWITCH) + IAXR(VSWITCH) = 0, where
IPR(VSWITCH) ∝ µ/Le f f × n× vt2 × ln2(e
Vdd−VSWITCH−Vth
2×n×vt + 1) where Vds = Vdd−VSWITCH
INR(VSWITCH) ∝ µ/Le f f × n× vt2 × ln2(e
VSWITCH−Vth
2×n×vt + 1) where Vds = VSWITCH
IAXR(VSWITCH) ∝ µ/Le f f × n× vt2 × ln2(e
Vdd−VSWITCH−Vth
2×n×vt + 1) where Vds = Vdd−VSWITCH
(3.12)
In all cases, transistor parameters are subjected to the variation model. Finally, the per-cell
probability of write stability failure becomes PCell,WrStab = P[VLW − VSWITCH > 0]. A memory
block suffers from write stability failure if there is at least one (non-redundant) cell suffering from
write stability failure.
53
3.3.4.3 Read Timing Failure
To model a read access failure, VARIUS-NTV computes DVarReadCell , the random variable captur-
ing the time taken to generate a detectable voltage drop on the read bit-line
DVarReadCell ∝
1
IAXRD +∑ ISTA
(3.13)
where IAXRD is the bit-line discharge current through the AXRD transistor in Figure 3.4, and ∑ ISTA
is the leakage over all of the cells attached to the bit-line. To calculate the distribution of 1/IAXRD ,
first, the source voltage of AXRD, VRD, should be extracted by solving KCL at this node, from
IAXRD(VRD) = INRD(VRD). When reading from a cell storing 1 (VR = 1 and VL = 0), transistor
currents follow from Equation 3.3:
IAXRD = µ/Le f f × n× vt2 × ln2(e
Vdd−VRD−Vth
2×n×vt + 1) where Vds = Vdd−VRD
INRD = µ/Le f f × n× vt2 × ln2(e
Vdd−Vth
2×n×vt + 1) where Vds = VRD
(3.14)
When reading from a cell storing 0 (VR = 0 and VL = 1), NRD is cut off:
IAXRD ∝ µ/Le f f × n× vt2 × ln2(e
Vdd−VRD−Vth
2×n×vt + 1) where Vds = Vdd−VRD
INRD ∝ µ/Le f f × T2 × e−
Vth
n×vt where Vds = VRD
(3.15)
Then, the probability distribution of DVarReadCell can be attained by applying those of Vth and
Le f f given by the variation model to 1IAXRD(VRD)+∑ ISTA
. DVarReadCell can be conservatively set to
max(DVarReadCell(VR = 0), DVarReadCell(VR = 1)). Alternatively, if signal probabilities are known,
DVarReadCell would constitute a weighted sum of the underlying probabilities. Following the VAR-
IUS methodology, the maximum of DVarReadCell over all the cells in a line is the time to read
an entire memory line DVarReadLine. Finally, the probability of read access failure (PReadAccess) is
P[DVarReadLine > tREAD], where tREAD is the designated read duration.
3.3.4.4 Write Timing Failure
Given a cell without write stability failure, VARIUS-NTV models a write timing failure by com-
puting DVarWriteCell . This is the time that node L takes to reach the switching threshold (VSWITCH)
of the PR-NR inverter.
54
DVarWriteCell ∝
1
IL
=
∫ VSWITCH
Vdd
dvL/iL(vL)
iL(vL) = iPL(vL)− iNL(vL)− iAXL(vL)
(3.16)
where IL is the discharge current at node L during the write, obtained following [56]. With iL(vL)
representing a function of Gaussian random variables Vth and Le f f under process variation,
iPL(vL) ∝ µ/Le f f × n× vt2 × ln2(e Vdd−Vth2×n×vt + 1) where Vds = Vdd− vL
iNL(vL) ∝ µ/Le f f × T2 × e− Vthn×vt where Vds = vL
iAXL(vL) ∝ µ/Le f f × n× vt2 × ln2(e Vdd−Vth2×n×vt + 1) where Vds = vL
(3.17)
further applies. After obtaining the probability distribution for DVarWriteCell , the distribution of the
maximum of DVarWriteCell over all the cells in a line DVarWriteLine is computed, following VARIUS.
Finally, the probability of write timing failure (PWriteTiming) is P[DVarWriteLine > tWRITE], where
tWRITE is the designated write duration.
3.4 Many-Core Architecture Modeled
To evaluate VARIUS-NTV, we model an 11 nm many-core architecture that operates at NTV. The
many-core is organized in clusters (36 in our default configuration) for ease of design (Figure 3.5).
Core + Private Memory Cluster
Cluster 
Memory
0.83mm
0.
67
mm 1
.9
9m
m
Figure 3.5: Many-core architecture used to evaluate VARIUS-NTV.
Each cluster has a cluster memory and several cores (8 in our default configuration), each with
a per-core memory. Each core is a single-issue engine where memory accesses can be overlapped
with each other and with computation. Each cluster memory is a bank of a shared L2 cache, while
the per-core memories are L1 caches. Data in the L1 caches is kept coherent with a directory-based
55
MESI coherence protocol where each pointer corresponds to one cluster. The cores are connected
with a bus inside each cluster and with a 2D torus across clusters. Table 3.1 shows the default
architecture and technology parameters. In the table, all of the parameters that are not labeled
with STV refer to the NTC environment.
We evaluate an STC version of the many-core and three NTC versions of it. The three NTC
versions differ based on the use of voltage and frequency domains, as listed in Table 3.2.
Table 3.1: Technology and architecture parameters.
System parameters
Technology node: 11 nm PMAX = 10 0W
Number of cores: 288 TMAX = 80 ◦C
Number of clusters: 36 (8 cores/cluster) Chip area ≈ 20 mm × 20 mm
Variation parameters
Correlation range: φ = 0.1 Sample size: 100 chips
Total (σ/µ)Vth = 20% Total (σ/µ)Le f f = 10%
Equal contribution of systematic & random Equal contribution of systematic & random
Technology parameters
VddNOM at STV = 0.77 V VddNOM at NTV = 0.55 V
VthNOM at STV = 0.30 V VthNOM at NTV= 0.33 V
fNOM at STV = 3.3 GHz fNOM at NTV = 1.0 GHz
finterconnect at STV = 2.5 GHz finterconnect at NTV = 0.8 GHz
kT = - 1.5 mV/K; kDIBL = - 150 mV/V; n = 1.5
Architectural parameters
Per-core memory: Cluster memory:
64 kB WT, 4-way, 2 ns access, 64 B line 2 MB WB, 16-way, 10 ns access, 64 B line
On-chip network: Coherence:
bus inside cluster and 2D-torus across clusters directory-based MESI
Crossing a f domain boundary: Average memory round-trip access time:
2 ns ≈ 80 ns (before contention)
The technology parameters used in Table 3.1 are derived from ITRS [3] and projected trends
from industry. Every single experiment is repeated for 100 chips with different variation profiles,
and we present the average. More samples beyond 100 do not change the results noticeably.
Table 3.2: Configurations for the NTC many-core.
Name NTC Many-core configuration
MVMF Multiple Vdd and multiple f domains (one per cluster).
SVMF Single chip-wide Vdd domain and one f domain per cluster.
SVSF Single chip-wide Vdd and f domains.
56
3.5 Experimental Setup
We evaluate VARIUS-NTV by using it to estimate the performance and power consumption of the
many-core architecture of Section 3.4. We interface Pin [69] over a user-level pthreads library to
the SESC [42] cycle-level architectural simulator. SESC estimates both execution time and energy
consumed. The energy analysis relies on McPAT [70] scaled to 11 nm. HotSpot [71] takes the
detailed layout of the chip and models the temperature, which in turn affects the leakage energy
in a feedback loop. VARIUS-NTV is implemented in R [72].
In our experiments, we run multi-programmed workloads that contain some or all of the fol-
lowing 8 PARSEC applications: blackscholes, ferret, fluidanimate, raytrace, swaptions, canneal,
dedup, and streamcluster. Each application can run with 4, 8, or 16 threads. For each application,
we measure the complete parallel section (called region of interest or ROI) running the simsmall
input data set.
3.6 Model Validation
Our initial validation of VARIUS-NTV involves a validation of the parameters used and a com-
parison to the results reported in an experimental chip.
3.6.1 Validation of Model Parameters
VARIUS-NTV builds on the VARIUS variation and timing error model which, as explained in [52],
was calibrated with experimental data from Friedberg et al. [73] and Razor [59], and validated
with error rates in logic and memory [52].
To validate the new VARIUS-NTV formulae, we start with Vth, which is a complex function of
Vdd, Le f f , and other technology parameters. We obtained a version of the 12 nm predictive tech-
nology model (PTM) from Yu Cao at Arizona State University [74]. We compared the Vth values
generated by VARIUS-NTV to those generated by the BSIM analytical model [55] and HSPICE.
The Vth values from VARIUS-NTV closely track those from both HSPICE and BSIM with less
than 1% error over the designated Vdd range. The main source of discrepancy is the accuracy of
modeling the DIBL effect.
57
We then used Vth values from VARIUS-NTV to extract gate delay and static power. We com-
pared gate delay and static power values to HSPICE measurements of a FO4 inverter chain. The
delay and static power trends of VARIUS-NTV follow HSPICE within a 10% of error for our Vdd
range.
3.6.2 Comparison to Silicon Measurements
To further validate VARIUS-NTV, we compare its outputs to the variation measurements from In-
tel’s 80-core TeraFLOPS processor [47]. To this end, we experimented with a 12 mm× 20 mm chip
that mimicks the TeraFLOPS processor, where each core (which they call a tile) has 2 floating point
units, a 3 kB instruction memory, and a 2 kB data memory. According to the chip micrograph, the
chip organizes the 80 cores into 10 rows and 8 columns. To match their technology parameters,
we adapted VARIUS-NTV to a 65 nm CMOS technology with a VddNOM of 1.2 V.
Dighe et al. [47, Figure 8] depict the measured variation in core frequency ( fMAX) for the 80 cores
of a single die at 50 ◦C and Vdd = 0.8 V. At 0.8 V, the authors report a ratio of highest core frequency
to lowest core frequency equal to 1.62.
We repeat the conditions in which these measurements were taken to the extent that we can.
We generate VARIUS-NTV frequency maps for 100 sample dies, assuming (σ/µ)Vth = 5% for
the 65 nm technology, with an equal contribution of random and systematic variation. The his-
togram of the resulting ratios of highest core frequency to lowest core frequency as generated by
VARIUS-NTV is shown in Figure 3.6(a). As shown in the histogram, VARIUS-NTV produces an
average value of≈ 1.48 for the ratio of frequencies, with a 95% confidence interval of (1.452, 1.483).
Further, Figure 3.6(b) shows the frequency distribution of the cores in one of the dies, as gener-
ated by VARIUS-NTV at 0.8 V. For this particular die, the ratio of highest core frequency to lowest
core frequency is ≈ 1.4. This figure is very similar to Figure 8 in [47].
Recall that the 80-core processor does not represent an NTC design. However, our validation
experiments are run at the relatively low 0.8 V (where the nominal Vdd at 65 nm is 1.2 V). No
further measured data is provided in [47] below 0.8 V. To our knowledge, there is no detailed
variation characterization of any NTC chip available.
58
f ratio of fastest to slowest core
N
um
be
r o
f d
ie
s
1.3 1.4 1.5 1.6 1.7
0
5
10
15
20
25
(a)
l
l
l
l
l
lll
ll
l
ll
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
0 20 40 60 80
2.
6
2.
8
3.
0
3.
2
3.
4
Core Index
f(G
Hz
)
(b)
Figure 3.6: Data generated by VARIUS-NTV that replicates the data presented in [47]: (a)
histogram of the ratios of highest core frequency to lowest core frequency over 100 dies; (b) the
frequency map for one of the sample dies.
3.7 Evaluation
In our evaluation, we first describe how we set the operating voltages and frequencies of the
many-core, then assess the impact of process variations in NTC and STC environments, and finally
explore some design parameters.
3.7.1 Computing the Operating Point
To determine the operating Vdd and f at NTV, our model starts with SRAM blocks. Our goal is
to estimate VddMIN , the minimum sustainable Vdd. It is set by hold and write stability failure
analyses.
Our model first finds the minimum Vdd needed to avoid hold failures, namely Vddhold. The
Vddhold distribution is attained by solving VL(Vdd,hold) = VSWITCH(Vdd,hold), where the former is the
voltage at node L (Figure 3.4(b)), while the latter is the switching threshold of the PR-NR inverter.
The chosen Vddhold value is obtained at the 3σ of the distribution — after approximating to a
normal distribution. Our model then proceeds with write stability failure analysis, to guarantee
that the chosen Vddhold also avoids write stability failures. At this step, a higher Vdd may emerge
if the write stability failure rate at Vddhold remains higher than the target tolerable error rate. The
resulting Vdd is VddMIN .
59
Once VddMIN is picked, VARIUS-NTV considers timing issues in order to set the f. The selected
f is determined by the slowest component of the chip, based on our model’s analysis of path delay
distributions at VddMIN . For logic blocks, the analysis follows that of VARIUS [52]. For SRAMs, it
can be shown that, for the parameters considered, write timing requires longer delays than read
timing for the same Vdd. This is consistent with the work of Abella et al. [66]. Hence, write timing
analysis determines the path delays in each SRAM block. To determine the maximum path delay,
VARIUS-NTV approximates the path delay distribution to a normal distribution and picks the 3σ
cutoff point. This maximum delay determines the f at VddMIN .
3.7.2 Impact of Process Variations at NTV and STV
To examine the impact of WID process variations on the f and power consumption at NTV and
STV, we consider three types of on-chip blocks separately: logic (the core pipelines), small mem-
ories (the per-core local memories) and large memories (the per-cluster memories). We do this
because they have different critical path distributions. In all cases, the f for a block is determined
by finding the distribution of the path delays in the block at VddNOM and then picking, as the
period for the block, the delay at the 3σ of the distribution. The power of the block is the sum of
the static and dynamic components.
We consider intra-cluster variations first. In each cluster, we compute the ratio of the frequencies
of the fastest and slowest pipelines in the cluster. We then take the average of the ratios across all
clusters (Intra Pipe). We repeat the same process for local memories in the cluster to calculate Intra
Mem. Finally, for the power consumption, we take the power ratio of highest to lowest consuming
pipelines, and highest to lowest consuming local memories, to compute Intra Pipe and Intra Mem,
respectively.
For inter-cluster variations, we measure the ratio of the frequencies of the fastest and slowest
cluster memories on chip (Inter Mem). We then consider the frequency that each cluster can sup-
port (the lowest frequency of its pipelines, local memories, and cluster memory), and compute the
ratio of the frequencies of the fastest and slowest clusters (Inter Pipe + Mem). Finally, we repeat the
computations for power (Inter Mem and Inter Pipe + Mem). We report the mean of the experiments
for 100 chips.
Figure 3.7 compares these ratios for NTC and STC. Figure 3.7(a) shows the f ratios. We observe
that the frequency ratio of the fastest to the slowest blocks is substantially higher at NTV than
60
at STV — for the same process variation profile. For example, Inter Pipe + Mem at NTV is 3.7,
while it is only 2.3 at STV (Figure 3.7(a)). This is because a low Vdd amplifies the effect of process
variations on delay.
Figure 3.7(b) shows the power ratios. The variation in total power also increases at NTV. How-
ever, the relative difference in power ratios between NTC and STC is generally smaller than the
relative difference in frequency ratios. The reason is that power includes both dynamic and static
power, and the ratios for static power are the same for NTC and STC. Consequently, the relative
difference in power ratios is smaller. Still, the absolute difference is significant. The chip becomes
more heterogeneous at NTV.
Intra Pipe Intra Mem Inter Mem Inter Pipe+Mem
M
ax
/M
in 
Ra
tio
 fo
r F
re
qu
en
cy
0
1
2
3
4
5 STC
NTC
M
ax
/M
in 
Ra
tio
 fo
r F
re
qu
en
cy
Intra Pi Intra M Inter Inter Pipe + Mem
(a) Frequency
Intra Pipe Intra Mem Inter Mem Inter Pipe+Mem
M
ax
/M
in 
Ra
tio
 fo
r P
ow
er
0
5
10
15
20
25
STC
NTC
M
ax
/M
in 
Ra
tio
 fo
r P
ow
er
Intra Pip Intra M Inter M ter Pipe + Mem
(b) Power
Figure 3.7: Impact of variations at NTV and STV.
These experiments have used a fixed, safe VddNOM for the whole chip. In reality, process vari-
ations in the SRAM cells result in each cluster supporting a different VddMIN , the minimum sus-
tainable Vdd to avoid failures. Such VddMIN values are lower than VddNOM for many clusters.
Figure 3.8 shows the distribution of the VddMIN values for all the clusters in a sample chip at
NTV. The data is shown as a histogram. We can see that the VddMIN values of clusters in a chip
vary along a significant 0.46-0.58 V range.
3.7.3 Design Space Exploration
A promising way to combat the increased impact of process variations is to rely on fine-grain,
per-cluster Vdd and f tuning. To quantify the effect, we compare the many-core configurations
of Table 3.2 across different cluster granularities ranging from 4 cores per cluster to 16 cores per
61
VddMIN (V)
0.46 0.50 0.54 0.58
0
2
4
6
8
10
12
Nu
m
be
r o
f T
ile
s
VddMIN (V)
N
um
be
r'o
f'C
lu
st
er
s
Figure 3.8: Values of VddMIN for all the clusters of a representative chip at NTV.
cluster. MVMF is an environment with a Vdd and an f domain per cluster; SVMF has a single Vdd
domain in the chip but one f domain per cluster; finally, SVSF characterizes a variation-oblivious
environment, with a single Vdd and f domain per chip.
Figure 3.9 compares the performance (in normalized MIPS) of our 288-core NTC chip for the
different environments. We consider two workload scenarios: one where we use all the clusters
in the chip (Figure 3.9(a)) and one where we only use about half of the clusters (Figure 3.9(b)).
Specifically, we use 128 out of the 288 cores and leave the others idle. Figure 3.10 repeats the
analysis for STC.
4 8 16
# Cores per Tile
 N
or
m
ali
ze
d 
M
IP
S
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
MVMF
SVMF
SVSF
#"Cores"per"cluster
(a) 100% use
4 8 16
# Cores per Tile
 N
or
m
ali
ze
d 
M
IP
S
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
MVMF
SVMF
SVSF
#"Cores"per"cluster
(b) ≈ 50% use
Figure 3.9: Performance of our 288-core chip at NTV with different cluster sizes and
configurations: under 100% use (a), and under ≈ 50% use (b).
62
In each figure, we keep the total number of cores in the chip constant, and perform a sensitivity
analysis of different cluster granularities: 4, 8, or 16 cores per cluster. In each case, the workload
consists of 4-threaded, 8-threaded, or 16-threaded parallel applications, respectively, from PAR-
SEC. Each application uses one cluster, and we report the average performance of the workload in
MIPS. In each plot, to make the comparison fair, the power consumed by all of the environments
is kept constant. In MVMF, the per-domain Vdd and f are set as per Section 3.7.1. Specifically,
each cluster runs at the cluster-specific VddMIN , and at the maximum f that it can support at this
voltage. In SVMF, all the clusters in the chip run at the maximum of the VddMINs across all clus-
ters. The per-cluster frequencies are increased accordingly. Finally, in SVSF, the chip uses the
same voltage as SFMV but it runs at the chip-wide minimum of per-cluster frequencies. Recall
that the VddMIN of a cluster represents the maximum VddMIN across its components, where the
f of a cluster corresponds to the minimum f across its components at the designated cluster Vdd.
The applications are assigned to clusters according to highest average IPC application to highest f
cluster. After the MIPS of each environment is computed, it is normalized to that of MVMF for an
8-core cluster in each plot.
Starting with the fully-utilized chip (Figure 3.9(a)), we observe that SVMF only attains 59%,
71%, and 81% MIPS of MVMF, for 4-core, 8-core, and 16-core clusters, respectively. This is because
it does not exploit the multiple Vdd domains of MVMF. The difference between the two bars gets
larger as the cluster granularity becomes finer, as MVMF tracks core-to-core variations closer.
SVSF in this case only reaches 32%, 46%, and 61% MIPS of MVMF, for 4-core, 8-core, and 16-core
clusters, respectively. As the cluster granularity increases, the differences between the different
configurations diminish.
Figure 3.9(b) repeats the experiment when only approximately half of the clusters are busy.
For MVMF, we pick the 32, 16, and 8 most MIPS/W(att)-efficient clusters for 4-, 8-, and 16-cores-
per-cluster granularity, respectively, and then assign the applications of higher IPC to the faster
clusters in turn. The resulting power consumption is the power budget that we allow to the other
environments. The other environments pick their 32, 16, or 8 most MIPS/W-efficient clusters that
satisfy the budget. We see similar trends as in Figure 3.9(a) except that the drop in MIPS is not
as large. The reason is that each environment now picks a subset of energy-efficient clusters —
leaving energy-inefficient ones idle.
63
Finally, in Figure 3.10, the experiments are repeated for STC. For STC, MVMF, and SVMF be-
come equivalent, since the nominal STC Vdd is high enough to produce a safe operating point
across all of the clusters. There is no need to set the Vdd of some clusters higher or lower de-
pending on their VddMIN . Apart from this, while generally the same trends apply as under NTC
operation, the MIPS loss as incurred by SVSF operation is much less.
4 8 16
# Cores per Tile
 N
or
m
ali
ze
d 
M
IP
S
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
MVMF
SVMF
SVSF
#"Cores"per"cluster
(a) 100% use
4 8 16
# Cores per Tile
 N
or
m
ali
ze
d 
M
IP
S
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
MVMF
SVMF
SVSF
#"Cores"per"cluster
(b) ≈ 50% use
Figure 3.10: Performance of our 288-core chip at STV with different cluster sizes and
configurations: under 100% use (a), and under ≈ 50% use (b).
3.8 Related Work
Several microarchitectural models analyze the impact of process variations on the frequency and
power of processors and memories at a level that is useful to microarchitects. They include the
work of Humenay et al. [48], Liang and Brooks [49], Marculescu and Talpes [50], Romanescu et
al. [51], and Sarangi et al. [52] (on which this work builds) among others. As indicated before,
these works only apply to STC and not to NTC.
A few papers include a good description of the challenges and issues at NTV [8, 9, 46].
Many other works are related to evaluating the impact of process variation, mostly in STC
environments. We list some of the most relevant here. Humenay et al. demonstrate that WID
process variations lead to considerable performance and power consumption asymmetry among
the cores in a CMP [48]. To minimize such asymmetry, they propose per-core ABB and ASV.
Donald and Martonosi analyze core-to-core power variations in a CMP due to WID variation [75].
64
They propose to turn off cores when they consume excessive leakage power in order to maximize
the chip-wide energy efficiency. Herbert and Marculescu examine the impact of core size on the
throughput of a fixed area chip in the presence of WID variations [76]. They find that smaller
cores (thus more cores per chip) running at independent f lead to higher throughput than larger
ones. Li and Martinez propose to optimize the number of active cores and their Vdds and fs
jointly while running a workload on a CMP [77] where they apply DVFS chip-wide rather than
independently per core. In [78], Rangan et al. propose a throughput-driven scheduling scheme
to guarantee that a variation-afflicted chip performs very close to a perfect chip operating at the
average frequency of the former. Rotem et al. [79] analyze the impact of single and multiple
voltage and frequency domains in a CMP environment, considering power delivery limitations.
They propose a clustered topology to maximize performance. The authors ignore the impact of
variation. Finally, Teodorescu and Torrellas [40] examine the impact of process scheduling in
the context of a many-core with variation. They provide heuristics to schedule the workload for
performance or for power efficiency. It would be interesting to reproduce these works in the
context of NTC.
3.9 Summary
To help confront process variations at the architecture level at NTV, we presented the first mi-
croarchitectural model of process variations for NTC. The model, called VARIUS-NTV, extends
an existing variation model for STC. It models how variation affects the frequency attained and
power consumed by cores and memories in an NTC many-core, and the timing and stability faults
in SRAM cells at NTV. The key aspects include: (i) adopting a gate-delay model and an SRAM cell
type that are tailored to NTC, (ii) modeling SRAM failure modes emerging at NTV, and (iii) ac-
counting for the impact of leakage in SRAM failure models.
We evaluated a simulated 11 nm many-core at both NTV and STV. Our results showed that
the expected process variations induce higher differences in f and power at NTV than at STV. For
example, the maximum difference in cluster f within a chip is ≈3.7× at NTV and only ≈2.3× at
STV. We evaluated different core-clustering organizations in the chip and different configurations
of on-chip Vdd- and f-domains. Our experiments showed that variation management is more
crucial at NTC. Finally, we validated our model against an experimental 80-core prototype chip.
65
CHAPTER 4
ESCHEWING MULTIPLE VOLTAGE DOMAINS AT
NEAR-THRESHOLD VOLTAGES
4.1 Introduction
Many-core scaling now faces a power wall. Successive technology generations have been increas-
ing chip power density, resulting in a situation where more cores can be placed on a chip than can
be concurrently operating. We urgently need new ways for more energy-efficient execution.
One way to attain energy-efficient execution is to reduce the supply voltage (Vdd) to a value
only slightly higher than a transistor’s threshold voltage (Vth). This environment is called near-
threshold voltage (NTV) computing (NTC) [8, 9, 46] – as opposed to the conventional super-
threshold voltage (STV) computing (STC). Vdd is a powerful knob because it has a strong impact
on both dynamic and static energy. According to initial data [8, 9], NTC can decrease the energy
per operation by several times over STC. One drawback is a degradation in frequency, which
may be tolerable through more parallelism in the application. Still, since many more cores can
be executing concurrently within the chip’s power envelope, the result is a higher throughput for
parallel codes.
A major roadblock for NTC is process variations, namely the deviation of device parameters
from their nominal specifications. Already at STV, a chip has regions with different speed and
power consumption. At NTV, the same magnitude of process variations causes larger changes in
transistor speed and power due to the low Vdd [46].
It is important to find techniques to cope with process variations in future NTC chips. Possibly,
general approaches currently used to handle variations could be applied. They include, among
other techniques, adaptive body biasing (ABB) and, especially, Vdd tuning (in the form of adaptive
supply voltage (ASV) or dynamic voltage scaling (DVS)). For the latter, the effectiveness increases
with support for multiple finer grain Vdd and frequency (f) domains. With fine-grain Vdd and f
domains, we can separately adapt to the parameter values in the different regions of the chip.
66
Unfortunately, simply applying these techniques will not work at NTV. There is consensus that
ABB will likely be ineffective in new technologies. Most importantly, supporting many on-chip
Vdd domains with conventional on-chip Vdd regulators is not power-efficient — hence hardly
compatible with a highly energy-efficient environment such as NTC. The reason is that on-chip
regulators — the only type we currently know how to build to provide the desired functionality
— have a typical power efficiency of 70-90% [80,81]. Moreover, given the many cores that an NTC
chip can include, many such regulators would be needed, multiplying the cost.
In this chapter, we examine how significant the problem of power-inefficient Vdd domains is
at NTV. We also explore a potential alternative that tackles variation with a single chip-wide Vdd
domain and multiple f domains. We call this new alternative Polyomino. Having f domains intro-
duces an execution overhead (when a boundary is crossed) but not a significant power overhead.
However, for such an approach to be competitive, we need to address two issues. The first one
is to provide core-to-job assignment algorithms that deliver high performance per watt without
relying on multiple Vdd domains. The second is to support chip-wide DVFS that adapts Vdd in
intervals short enough to be competitive with an environment with on-chip Vdd regulators.
First, we show that, at NTV, a Polyomino chip delivers higher performance per watt than one
with multiple on-chip Vdd domains supported by on-chip Vdd regulators. The reasons are: (i)
the regulators’ power inefficiencies, (ii) the increased Vdd guard-band induced by fine-grain Vdd
domains (to handle deeper Vdd droops due to lower capacitance per Vdd domain), and (iii) the
practical fact that any Vdd domain still has to include several cores. Second, we introduce core
assignment algorithms for the Polyomino architecture that deliver high performance per watt
while being simple. Finally, we show that the lower speed of Vdd changes during DVFS without
on-chip Vdd regulators is perfectly tolerable.
In the following, Section 4.2 discusses multiple Vdd domains at NTV and presents Polyomino;
Sections 4.3 and 4.4 discuss core assignment and fine-grain DVFS for Polyomino; Sections 4.5
and 4.6 evaluate the ideas; Section 4.7 presents a discussion; and Section 4.8 covers related work.
4.2 Eschewing Multiple Vdd Domains at NTV
Having multiple on-chip Vdd domains with independent voltage scaling can increase the energy
efficiency of a many-core — e.g., by executing non-latency-critical applications in low-Vdd do-
mains. Since NTC is an energy-conscious environment, such support appears to suit it. In ad-
67
dition, NTC has two additional reasons to benefit from multiple Vdd domains. First, since WID
process variations are more significant at NTV, chip neighborhoods at NTV are expected to benefit
more from the decoupling of Vdd values across the chip. The second reason is that, at NTV, Vdd
is a particularly strong lever to affect the power and performance conditions of core operation.
Specifically, small changes in Vdd have a relatively large impact on the performance and power
consumption of a core.
However, we uncover several limitations that make the use of multiple Vdd domains at NTV
less attractive. The presence of such limitations suggests a different type of many-core architec-
ture at NTV, and two challenges that need to be overcome for cost-effective operation. Next, we
describe these limitations, the architecture, and the challenges.
4.2.1 Limitations of Multiple Vdd Domains at NTV
Several effects limit the cost-effectiveness of multiple on-chip Vdd domains in a power-conscious
environment such as a future NTC chip (Table 4.1). The first one is the power loss in conventional
on-chip Vdd regulators. Having Vdd regulators on chip is likely the only realistic way to support
many domains — utilizing many off-chip regulators (e.g., 36) is too expensive. Unfortunately,
such regulators have typical power efficiencies of only 70-90%, be they switching or low-dropout
(LDO) regulators. For example, Ghasemi et al. [80] discuss LDO regulators with a range of efficien-
cies, while Kim et al. [81] discuss on-chip switching regulators that have a peak power efficiency
of 77%. This is hardly compatible with a very power-conscious environment such as NTC.
Table 4.1: Why multiple Vdd domains become less attractive in a future NTC environment.
Limitation Reason
Power loss in on-chip Vdd regulators Regulators have power efficiencies of 70-90%
Increased Vdd guard-band to tolerate Each domain has a lower capacitance than
larger dynamic Vdd droops a large chip with a single Vdd domain
Inability to fully fine-tune Vdd To reduce cost and complexity,
for individual cores in one domain a domain still includes several cores
The second effect is the likely need to increase the Vdd guard-band in the finer grain Vdd do-
mains to tolerate more accentuated dynamic Vdd droops. Such deeper droops may be induced by
the lower capacitance of a domain, compared to a large chip with a single Vdd domain. James et
al. [82] discuss this problem for the IBM POWER6 processor. Increased Vdd guard-bands imply
lower power efficiencies.
68
Finally, there is the practical aspect that the low-power operation of NTC chips will result in
chips with many cores. Attempting to control Vdd on a per-core basis is expensive in hardware.
Therefore, it is likely that each Vdd domain will include several cores. The differences between
the cores in the same Vdd domain, as induced by the increased sensitivity to process variations,
will lead to a Vdd setting that is suboptimal for individual cores — hence missing out on part of
the potential gains.
4.2.2 An Alternative Approach for Future NTC: Polyomino
Since supporting multiple Vdd domains is costly at NTV, we propose an alternative many-core
architecture. The architecture, called Polyomino, eschews multiple on-chip Vdd domains for hard-
ware simplicity and energy efficiency. It keeps a single Vdd across the whole chip. DVFS can be
used, but it is applied globally across the chip.
To handle process variations more inexpensively than with fine-grain Vdd and frequency (f)
domains, Polyomino uses only f domains. Polyomino is organized in clusters of cores, where each
cluster is an f domain. A cluster (domain) is characterized by the maximum frequency ( fMAX) that
it can support at the lowest possible chip-wide safe voltage (VddNOM). VddNOM and the set of
fMAX are set at manufacturing-testing time. With many f domains, the chip has many degrees of
freedom in an environment highly affected by process variation.
(a) (b)
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36
fA= min(f8,f9,f14,f15)
fB= min(f5,f6,f11,f12,f17,f18,f23,f24)
fC= min(f20,f26,f27,f28,f29,f30,f32,f33)
Polyomino 
Assignment
Ensemble of clusters
+
f of the ensemble
(c)
Cluster
i
11 1
1 1
1 1 1
2 2 2 2 2
2
2
2
22222
2
2
23
3
3
3
3
3 3 3 3 33
(d)
Per cluster 
variation profile 
(Psta,fmax)
Chip load,
P/T headroom
Number of cores 
requested
Rough IPC 
Estimation
~0.83mm~
0.
67
mm ~
1.
99
mm
Cluster
Cluster Memory
Core + Local Memory
Figure 4.1: Example Polyomino architecture (a), its operation (b), the core-assignment algorithm
(c), and distance of clusters to cluster i (d).
Figure 4.1(a) shows a Polyomino many-core. This design has 6 × 6 clusters, where each cluster
has a cluster memory and 8 cores with local memories.
To determine VddNOM and the set of fMAX, we proceed as follows. Each cluster’s minimum
sustainable Vdd, VddMIN , is set by the cluster’s SRAM hold and write stability failure analyses to
69
ensure reliable operation. Then, the chip-wide VddNOM becomes the maximum over all clusters’
VddMIN . After VddNOM is set, timing tests in the SRAM and logic for each cluster i determine the
maximum frequency fMAX i that the cluster can support at VddNOM. Such frequency will be the
default frequency of the cluster. It can be increased if Vdd increases over VddNOM.
In Polyomino, where we give up multiple Vdd domains and, instead, rely on fine-grain f do-
mains, two challenges appear: the need to (1) carefully control core-to-job assignment, and (2)
effectively support fine-grain (i.e., short-interval) DVFS. We consider them next.
4.3 The Challenge of Core Assignment
4.3.1 Rationale: Simplicity and Effectiveness
Attaining energy-efficient performance in a many-core with hundreds of cores as enabled by NTC
requires a good core-assignment algorithm. In such a many-core, the number of degrees of free-
dom in the assignment is vast. Hence, the challenge is to attain effective assignment while man-
aging such complexity. Polyomino eases the complexity by eliminating Vdd domains: the as-
signment algorithm only needs to select the f of the chosen cores, rather than both Vdd and f.
However, the inability to set multiple Vdds in the chip disables a potentially important knob for
energy efficiency. Hence, Polyomino’s assignment algorithm needs careful design.
Recall that the chip is necessarily organized in clusters of cores to exploit WID variations and
to enhance scalability. A cluster is the smallest f domain. Clocking all the cores in a cluster at the
same f is reasonable, since the whole cluster is likely to have a similar value of the systematic com-
ponent of process variations. In this environment, Polyomino simplifies core assignment further
by assigning all the cores in a cluster as a group to a job. Any resulting unused cores in the cluster
are power-gated. Leaving them unused is typically not a problem because, in our environment,
there is likely a surplus of cores. However, if cores are scarce, a cluster can take in multiple jobs.
A single parallel job may take multiple clusters. Such set of clusters is called an Ensemble. For
assignment simplicity, an ensemble runs at a single f, which is equal to the lowest of the fMAX
of the constituting clusters. Moreover, running at a single f ensures that all the threads of the job
make similar progress and typically results in faster overall execution [83] — especially in highly
synchronized codes. When multiple jobs are running concurrently, each is assigned to a different
ensemble, which forms a separate f domain. An example is shown in Figure 4.1(b), where fMAX
70
for cluster i is depicted as fi. Three ensembles are allocated to three jobs (A, B, and C), giving
rise to three independent f domains characterized by frequencies fA, fB, and fC. Each ensemble
operates at the f of its slowest component cluster.
There are various design alternatives for the network that interconnects the clusters. One alter-
native is to have a single, separate f domain for the network [47]. This is a simple design, but re-
quires that every single communication between clusters in an ensemble or between a cluster and
memory crosses domains. An alternative is to break the network into multiple f domains — pos-
sibly including each router in the f domain of its neighbor cluster. This design would allow clus-
ters within an ensemble to communicate without crossing f domains. However, longer-distance
messages would still suffer. Moreover, the network is more complicated to design. Given that
crossing an f domain only adds about 2 ns [84], we choose the simple, single f domain for the net-
work. Hence, Polyomino’s design does not eliminate f-domain crossings when clusters within the
same ensemble communicate. However, a baseline chip that uses multiple Vdd domains would
also have the same issue: communicating clusters would have to cross f-domain boundaries.
One degree of freedom in Polyomino’s assignment algorithm is whether the clusters that form
the ensemble assigned to a job have to be physically contiguous. Not worrying about the physical
layout simplifies the algorithm, but using non-contiguous clusters lengthens inter-thread com-
munication. In our design, we consider two algorithms that give either high or low priority to
choosing contiguous clusters for an ensemble. In practice, the closer clusters are to each other,
the less is the difference in their systematic-variation-induced f and PSTA values. As a result, both
algorithms try to pick contiguous clusters implicitly or explicitly. We consider the algorithms in
the next section.
4.3.2 Core-Assignment Algorithm
We call our core-assignment algorithm P Assign. When a new job arrives, P Assign assigns an
ensemble of clusters to it at a single f and, typically, does not revisit the assignment during the
job’s lifetime. In this discussion, P Assign tries to maximize MIPS/W(att); other related metrics
can be used instead.
P Assign uses information from both hardware and application. The hardware information in-
cludes each cluster’s static power (PSTA) and maximum frequency supported ( fMAX) at VddNOM at
a reference temperature (T0). This information is generated at manufacturing-testing time. Other
71
information includes the instantaneous load of the chip (which clusters are busy) to determine the
headroom to the maximum power and T constraints. The application information includes the
number of cores requested (equal to the number of threads) and an estimate of the average IPC of
these threads. This IPC only needs to be roughly approximate. It can be obtained from previous
runs of the application or from the current run, and does not include long-duration spinning. The
output of P Assign is the chosen ensemble of clusters to run the job plus the f at which these clus-
ters should run — equal to the minimum of the fMAX of the chosen clusters. The general approach
is depicted in Figure 4.1(c).
To see how P Assign works, assume that a job requests n cores. P Assign must return an ensem-
ble E of size |E| = dn/ClSizee clusters, where ClSize depicts the cluster size. Naively, P Assign
could simply check all the possible groups of |E| free clusters, and pick the group that delivers the
maximum MIPS/W at VddNOM. P Assign indeed relies on such exhaustive search, however, con-
ducts it in an intelligent way, such that the search space gets pruned and the runtime complexity
reduces significantly: Specifically, P Assign would repeatedly pick one free cluster i (which can
cycle at most at fMAX i), and combine it with the best selection of |E| − 1 clusters among those that
can cycle faster than i, to arrive at the ensemble E which maximizes MIPS/W:
maxE(
MIPS
Watt
) ≡ minE( WattMIPS ) ≡ minE(
∑E PSTA +∑E Pdyn
IPC× |E| × ClSize× fMAX i )
≡ minE(∑E PSTA + C×V
2
ddNOM × |E| × fMAX i
IPC× |E| × ClSize× fMAX i )
(4.1)
At the time cluster i is considered, all variables of this formula are known except ∑E PSTA, the
total PSTA of the ensemble E to be formed. We know the f of E, fMAX i, as set by the slowest cluster,
namely cluster i; the operating Vdd is fixed chip-wide to VddNOM; the number of cores requested
determines ensemble size |E|; an estimate of IPC( fMAX i) is already available; and finally, C, aver-
age cluster capacitance, is proportional to the area, and does not depend on the selection. ∑E PSTA,
on the other hand, changes with the selection of the clusters to form E. Thus, for each cluster i
considered, the ensemble that maximizes MIPS/W, maxE(MIPS/W), reduces to the ensemble of
the clusters that deliver min(∑E PSTA). The pseudo-code for P Assign is given in Algorithm 1: E∗
is the optimal ensemble of |E| clusters, while E is one selected candidate ensemble of |E| clusters.
P Assign can be implemented inexpensively if the clusters are ordered offline from lowest to
highest PSTA, and from highest to lowest fMAX. As P Assign picks one cluster i at a time, it only
72
Algorithm 1 Pseudo-code for P Assign.
1: E∗ ← ∅
2: for each free cluster i in the chip do
3: /* cluster i cycles at fMAX i */
4: find the other |E| − 1 free clusters faster than i that have the minimum ∑E−1 PSTA
5: /* these clusters can all cycle at fMAX i or higher */
6: E← { i ∪ these |E|-1 clusters }
7: if (MIPS/W(E) > MIPS/W(E∗)) then
8: E∗ ← E
9: end if
10: end for
needs to select, among those with higher fMAX, the |E| − 1 ones that have the lowest PSTA. It then
computes the MIPS/W of the ensemble. This process is repeated once for each available cluster i,
and the ensemble with the highest MIPS/W is picked.
We design two P Assign algorithms that differ in the priority given to choosing contiguous
clusters for an ensemble. P Assign NC (for non-contiguous) is the algorithm just described, which
gives low priority to selecting contiguous clusters; P Assign C (for contiguous) gives high priority
to picking contiguous clusters. It does so by picking the ensemble of the clusters that deliver
min(∑E Di × PSTA) as opposed to min(∑E PSTA). In this formula, Di is the wavefront distance
between a given component cluster of E and cluster i (Figure 4.1(d)). Consequently, clusters that
are far apart are penalized and avoided. In this case, a practical implementation would order
clusters not based on PSTA, but on Di × PSTA.
Both P Assign NC and P Assign C have a low overhead (Section 4.6.3 quantifies the number of
instructions) and scale with O(N2), where N is the number of clusters per chip. Hence, they are
fully practical.
For reference, we will compare them to two greedy algorithms: The non-contiguous version,
Greedy NC, picks free clusters in increasing order of PSTA/ fMAX. The contiguous version, Greedy C,
on the other hand, first assigns the free cluster of minimum PSTA/ fMAX as the center, and ex-
pands the ensemble along wavefronts of progressively increasing distance from the center. At
each wavefront, Greedy C picks clusters in increasing order of PSTA/ fMAX. These algorithms scale
with O(N).
In contrast, a baseline many-core with per-cluster Vdd and f domains needs more complicated
assignment algorithms. Specifically, assume that a job needs an ensemble of |E| clusters. We need
to find the set of clusters for the ensemble and the ensemble’s Vdd and f that maximize MIPS/W,
assuming a single Vdd and f domain per ensemble to reduce complexity. To do so, we repeatedly
73
pick one cluster i (which needs at least VddMIN i and can only cycle at fMAX i at VddMIN i). Then,
we try to combine it with all of the possible groups of |E| − 1 clusters among those that have
a VddMIN lower than VddMIN i. Recall that no cluster can operate safely below its designated
VddMIN . For each of these combinations, we set the ensemble’s Vdd to VddMIN i (which is the
highest one), raise the f of each of the |E| − 1 clusters accordingly, and set the ensemble’s f to
the minimum of the resulting f of all of the |E| clusters. We then compute the MIPS/W. As we
proceed, we pick the best selection for this i, and then the best selection over all possible i. We call
this algorithm MultipleVf NC. It has a higher overhead than P Assign and scales with O(N3).
4.4 The Challenge of Applying Fine-Grain DVFS
A second challenge of not using multiple Vdd domains is that, without on-chip voltage regulators,
the speed at which a DVFS algorithm can change Vdd is lower. Specifically, according to Kim et
al. [85], on-chip regulators can change Vdd at a rate of about 30 mV/ns. On the other hand, off-
chip regulators take over two orders of magnitude longer to change the same Vdd magnitude. For
example, a very conservative estimate is given by Intel’s guidelines, which assume that off-chip
regulators take 1.25 µs to change Vdd by 6.25 mV [86]. The reason for the lower speed is the higher
latency of the communication between the CPU and the off-chip Vdd regulator. A Vdd regulator
must sense the Vdd applied to cores to adjust its output Vdd accordingly; however, sensing the
Vdd from off-chip is much slower than from on-chip. This limits the speed of Vdd regulators and
thus requires a bulkier inductor and capacitor to support high efficiency. Overall, this inability to
change Vdd fast could result in less energy-efficient DVFS operation under Polyomino than under
multiple Vdd domains.
In practice, however, this issue may not have a significant effect on the execution’s energy effi-
ciency. One reason is that the Vdd changes needed at NTV are likely to be small most of the time.
This is because even modest Vdd changes quickly bring the execution to regimes that are either
too energy-inefficient or too slow. Moreover, the value of Vdd at NTV likely needs to be capped
for reliability reasons. In addition, the algorithm for selecting the DVFS levels in an environment
with multiple Vdd domains will be more complicated and have non-negligible overhead. Finally,
the speed of DVFS changes is limited by the PLL re-locking time for f change, which is at least 10
µs [87]. Very fast Vdd regulators do not eliminate this critical path. All of these effects may hide
the higher latencies of off-chip regulators. Section 4.6.4 quantifies the tradeoffs.
74
4.5 Evaluation Setup
We evaluate Polyomino by modeling an 11 nm NTC chip with 288 cores. The chip is organized in
clusters. A cluster has 8 cores (each with a per-core private memory) and a cluster memory. The
technology parameters are derived from ITRS and from projected trends from industry. Table 4.2
shows the technology and architecture parameters. The nominal values of Vdd and f are 0.55 V
and 1.0 GHz (which would correspond to about 0.77 V and 3.3 GHz for STC). To model variation,
we use the VARIUS-NTV [88] model, using (σ/µ)Vth = 15%, (σ/µ)Le f f = 7.5%, and φ = 0.1.
Every single experiment is repeated for 100 chips with the same variation parameters but different
variation profiles, and we present the average. More samples beyond 100 do not change the results
noticeably.
Table 4.2: Technology and architecture parameters.
System parameters
Technology node: 11 nm PMAX = 100 W
Number cores: 288 TMAX = 80 ◦C
Number clusters: 36 (8 cores/cluster) Chip area ≈ 20 mm × 20 mm
Variation parameters
Correlation range: φ = 0.1 Sample size: 100 chips
Total (σ/µ)Vth = 15% Total (σ/µ)Le f f = 7.5%
Equal contribution of systematic & random Equal contribution of systematic & random
Technology parameters
VddNOM = 0.55 V On-chip Vdd regulator: 15% P loss
VthNOM = 0.33 V Vdd guard-band for V noise:
fNOM = 1.0 GHz 5% base
f network = 0.8 GHz + 5% if multiple Vdd domains
Architectural parameters
Core-private mem: 64 kB WT, Cluster mem: 2 MB WB,
4-way, 2 ns access, 64 B line 16-way, 10 ns access, 64 B line
Network: bus inside cluster Coherence: directory-based MESI
and 2D-torus across clusters Average mem round-trip access time
Cross a f domain boundary: 2 ns (before contention): ≈80 ns
Number of memory controllers: 8
Each core is a single-issue engine where memory accesses can be overlapped with each other
and with computation. The per-core memories are private L1 caches, while the cluster memories
are shared L2 caches. We use a full-mapped directory-based MESI coherence protocol where each
pointer corresponds to one cluster. The on-chip network is a bus inside a cluster and a 2D-torus
across clusters. The chip has eight memory controllers.
To evaluate performance and power consumption, we interface Pin over a user-level pthreads li-
75
brary to the SESC [42] cycle-level architectural simulator. The power analysis relies on McPAT [70]
scaled to 11 nm. HotSpot is used to model the temperature [71]. The algorithms for core assign-
ment in the many-core are implemented in R [72].
For our experiments, we run multi-programmed workloads that contain the following PARSEC
applications: blackscholes, ferret, fluidanimate, raytrace, swaptions, canneal, dedup, and stream-
cluster. Each application runs in parallel with a thread count that can range from 8 to 64 threads.
We measure the complete parallel sections of the applications (i.e., region of interest) running the
standard simsmall input data set. We report the average performance (e.g., in MIPS) or average
energy efficiency (e.g., in MIPS/W).
4.6 Evaluation
In this section, we first show the variation observed in NTC Polyomino chips, and then examine
the effect of not having multiple Vdd domains, the impact of the core-allocation algorithms, and
the implications on fine-grain DVFS. Finally, we perform a sensitivity analysis of the architecture.
4.6.1 Variation Observed in NTC Polyomino Chips
To understand the results in the rest of the evaluation, this section assesses variation in VddMIN ,
frequency (f), and static power (PSTA) in Polyomino chips. Each experiment covers three different
values of process variation, (σ/µ)Vth = 12, 15, and 17%, at two different scopes: (i) across the clus-
ters within a representative Polyomino chip, namely, the chip with the median chip-wide maximum
VddMIN among the 100 chips analyzed; and (ii) across 100 Polyomino chips.
Figure 4.2(a) depicts variation in cluster-wide maximum VddMIN across the 36 clusters of a
representative Polyomino chip. Higher values of (σ/µ)Vth render not only a higher VddMIN per
cluster, but also a higher variation in VddMIN across clusters (Figure 4.2(a)). Figure 4.2(b) char-
acterizes variation in the chip-wide maximum VddMIN across the 100 Polyomino chips analyzed.
Both maximum chip-wide VddMIN and variation in maximum VddMIN across chips increase with
increasing (σ/µ)Vth (Figure 4.2(b)).
Maximum VddMIN at different scopes (i.e., cluster-wide or chip-wide) evolves as a function of
(σ/µ)Vth, and determines the safe operating Vdd(s) per chip. Variation in f and PSTA indeed tightly
depend on the operating Vdd(s). To demonstrate the impact of (σ/µ)Vth on f and PSTA variation
76
0 5 10 15 20 25 30 35
0.
40
0.
45
0.
50
0.
55
0.
60
0.
65
Cluster ID
V d
dM
IN
 
(V
)
l
l
l
l
ll
l
ll
l
l
l
l
ll
l
l
l
llll
l
l
l
ll
ll
ll
l
l
ll
l
l
( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(a) VddMIN variation within chip
0 20 40 60 80 100
0.
40
0.
50
0.
60
0.
70
Chip ID
V d
dM
IN
 
(V
)
llll
l
llll
l
l
l
ll
lll
l
l
l
l
l
l
ll
l
l
l
l
l
lll
l
l
l
lll
l
l
l
l
l
l
llll
l
ll
ll
l
ll
l
ll
l
l
l
ll
l
l
ll
ll
l
l
l
l
l
ll
l
lll
l
l
lllll
l
ll
l
l
ll
l
l
l
l
l
( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(b) VddMIN variation across chips
Figure 4.2: Variation of VddMIN within a representative Polyomino chip (a), across 100 chips
analyzed (b).
under a given fixed operating Vdd, Figures 4.3 and 4.4 deploy the chip-wide maximum VddMIN
for (σ/µ)Vth = 0.17 as the safe, fixed operating Vdd per chip across all configurations. We show
kernel density estimates for each parameter.1 In each plot, the x-axis characterizes the normalized
f or PSTA by the non-variation afflicted counterpart.
Figure 4.3(a) shows variation in cluster f, as determined by the slowest unit within each cluster,
across the 36 clusters within a representative Polyomino chip. Across-cluster variation in cluster
PSTA is given in Figure 4.4(a). On the other hand, Figure 4.3(b) depicts variation in chip-wide
median cluster f; Figure 4.4(b), in chip PSTA, across 100 Polyomino chips. We observe that, as
(σ/µ)Vth increases, the spread of the distributions, hence the variation in f or PSTA, increases.
Let us next explore how f and PSTA vary at different scopes considering the actual operating Vdd
for different (σ/µ)Vth, as opposed to a fixed Vdd across all configurations. Figure 4.5(a) shows
variation in cluster f; Figure 4.6(a), in cluster PSTA, across the 36 clusters within a representative
Polyomino chip. On the other hand, Figure 4.5(b) depicts variation in chip-wide median cluster f;
Figure 4.6(b), in chip PSTA, across 100 Polyomino chips.
For each chip, all f and PSTA are evaluated at the chip-wide maximum VddMIN ; for the rep-
resentative chip, VddMIN = 416.1 mV, 560.4 mV, and 677.5 mV, for (σ/µ)Vth = 12, 15, and 17%,
respectively. Chip-wide maximum VddMIN represents the safe operating Vdd across all clusters.
1Kernel density estimates can be regarded as smooth correspondents of histograms. While the area under each
curve should be equal to 1, the density function can assume values >1.
77
0.5 1.0 1.5
0.
0
0.
5
1.
0
1.
5
2.
0
Normalized f
D
en
si
ty
 a
cr
os
s 
36
 c
lu
st
er
s ( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(a) f variation within chip
0.5 1.0 1.5 2.0
0
2
4
6
8
Normalized f
D
en
si
ty
 a
cr
os
s 
10
0 
ch
ip
s ( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(b) f variation across chips
Figure 4.3: Kernel density of f within a representative Polyomino chip (a), across 100 chips
analyzed (b).
0 1 2 3 4
0.
0
0.
2
0.
4
0.
6
0.
8
Normalized Psta
D
en
si
ty
 a
cr
os
s 
36
 c
lu
st
er
s
( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(a) PSTA variation within chip
0.5 1.0 1.5 2.0
0
1
2
3
4
5
Normalized Psta
D
en
si
ty
 a
cr
os
s 
10
0 
ch
ip
s ( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(b) PSTA variation across chips
Figure 4.4: Kernel density of PSTA within a representative Polyomino chip (a), across 100 chips
analyzed (b).
78
0 5 10 15 20 25 30 35
0
1
2
3
4
5
Cluster ID
f (G
Hz
)
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l l
l
l
l l
l
l
( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(a) f variation within chip
0 20 40 60 80 100
0
1
2
3
4
5
Chip ID
f (G
Hz
)
l
ll
l
ll
l
ll
l
l
l
ll
lll
l
l
l
l
l
l
l
l
lll
ll
l
l
l
l
llll
l
l
l
ll
llllll
l
ll
ll
l
ll
ll
ll
l
l
llll
l
l
ll
llll
l
l
ll
lll
l
l
l
lll
ll
ll
l
l
l
l
l
l
l
l
l
( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(b) f variation across chips
Figure 4.5: Variation of f within a representative chip (a), across 100 chips analyzed (b).
Recall that no Polyomino chip can operate at the high Vdd and fs imposed by (σ/µ)Vth = 17%
without exceeding the power budget. Due to the higher values of chip-wide maximum VddMIN ,
progressively higher values of (σ/µ)Vth not only induce higher cluster f and cluster PSTA, but also
increase variation in cluster f and cluster PSTA across clusters (Figure 4.5(a), Figure 4.6(a)). Sim-
ilarly, with increasing (σ/µ)Vth, both, chip-wide median cluster f and chip PSTA, and variation
across chips in chip-wide median cluster f and chip PSTA, increase (Figure 4.5(b), Figure 4.6(b)).
0 5 10 15 20 25 30 35
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
3.
0
Cluster ID
P s
ta
 
(W
)
l
l l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l l
l
l
l
l
l l
l l
l
l
( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(a) PSTA variation within chip
0 20 40 60 80 100
0
10
20
30
40
50
60
Chip ID
P s
ta
 
(W
)
ll
l
l
l
l
l
l
l
ll
l
l
l
ll
l
l
l
l
l
l
l
lll
l
l
llll
l
l
l
ll
lll
l
l
l
l
l
lll
l
l
ll
ll
l
l
l
ll
ll
l
l
ll
ll
l
l
ll
lll
l
ll
ll
lll
l
l
l
lll
lll
ll
l
l
ll
l
l
l
l
( σ  µ)Vth = 0.17( σ  µ)Vth = 0.15( σ  µ)Vth = 0.12
(b) PSTA variation within chip
Figure 4.6: Variation of PSTA within a representative chip (a), across 100 chips analyzed (b).
The rest of the evaluation uses the default (σ/µ)Vth = 15% and the representative chip.
79
4.6.2 Effect of Not Having Multiple Vdd Domains
To see the effect of not supporting multiple Vdd domains in Polyomino, we compare the NTC
environments of Table 4.3. Perf is a perfect environment with a Vdd and an f domain per cluster,
without Vdd overheads. Eff is Perf plus the power inefficiencies of the on-chip Vdd regulators. By
default, we use a 15% power loss in the regulators, which is in the ball park of the designs ana-
lyzed by Kim et al. [81] having a peak power efficiency of 77%. Recall that power loss of on-chip
regulation changes as a function of the output Vdd. However, throughout the experiments, we
deploy the peak efficiency independent of the output Vdd to favor environments with multiple
Vdd domains. EffCoa is Eff with coarser Vdd domains (four clusters per domain) due to practical
implementation issues. EffCoaDyn is EffCoa plus a larger Vdd guard-band. This is to tolerate po-
tentially deeper dynamic Vdd droops — resulting from the likely lower capacitance of individual
domains, compared to a large chip with a single Vdd domain. Extrapolating from James et al. [82],
we add an additional 5% guard-band over the base 5% Vdd-noise guard-band used for the chip
with a single Vdd domain [85]. Finally, Polyomino has a single Vdd domain. Note that all of the
environments have one off-chip Vdd regulator.
Table 4.3: Environments analyzed to assess Vdd domains at NTV.
Environment Multiple Vdd domains? Description
Perf Yes Perfect environment with per cluster Vdd
and f domains. No Vdd overheads.
Eff Yes Perf + power loss due to Vdd regulation.
EffCoa Yes Eff with coarse Vdd domains.
Each one includes four clusters.
EffCoaDyn Yes EffCoa + a larger Vdd guard-band to handle
deeper dynamic Vdd droops.
Polyomino No Single Vdd domain. Each cluster is an f domain.
Figure 4.7 compares the MIPS/W of our 288-core NTC chip for the different environments.
We consider two scenarios: one where we use all 36 clusters (Figure 4.7(a)) and one where we
use only 18 (Figure 4.7(b)). The workload consists of combinations of our PARSEC applications,
where each PARSEC code runs with eight threads on one cluster. We report the geometric mean
of MIPS/W across clusters. In each plot, we perform a sensitivity analysis of different power
inefficiencies for the Vdd regulators (5%, 10%, 15%, 20%, and 25%, where 15% is the default). To
make the comparison fair, the total power consumed by the chip and Vdd regulators in all of the
environments and efficiency points is kept constant.
80
●●
●
●
●●
●
●
0.
4
0.
6
0.
8
1.
0
No
rm
ali
ze
d 
M
IP
S/
w
Perf Eff EffCoa EffCoaDyn Polyomino
●
●
5%
10%
15%
20%
25%N
or
m
al
iz
ed
+M
IP
S/
W
(a) Full utilization: All (36) clusters
●
●
●
●
●
●
●
●
0.
4
0.
6
0.
8
1.
0
No
rm
ali
ze
d 
M
IP
S/
w
Perf Eff EffCoa EffCoaDyn Polyomino
●
●
5%
10%
15%
20%
25%N
or
m
al
iz
ed
+M
IP
S/
W
(b) 50% utilization: 18 clusters
Figure 4.7: Normalized MIPS/W in the different environments for workloads that use all 36
clusters (a) or only 18 (b). We consider different Vdd regulator inefficiencies.
81
For each environment, we compute the sustainable per-domain Vdd and f as per Section 4.2.2:
SRAM hold and write stability determine Vdd, and then timing tests in the SRAM and logic de-
termine f. In this section, we are after the maximum MIPS/W each environment can achieve.
To exclude sub-optimal operation due to algorithmic imperfections, we deploy oracular core-
assignment algorithms tailored for each environment. The combinatorial optimization algorithms
are based on the Hungarian method [89]. We ignore all algorithmic overheads. Implicitly, this ap-
proach favors the non-Polyomino environments, which require more complex core-assignment
algorithms due to the larger number of degrees of freedom (as induced by multiple voltage val-
ues per chip). Finally, after we compute the MIPS/W of each environment, we plot it normalized
to that of Perf for the given utilization profile (100% or 50%).
Consider first the fully utilized chip (Figure 4.7(a)). Polyomino only attains about 81% of the
MIPS/W of Perf. This is because it does not exploit the multiple Vdd domains of Perf and, there-
fore, operates less energy-efficiently. However, in the environments with multiple Vdd domains,
as we go from Perf to Eff, EffCoa, and EffCoaDyn, progressively adding more realistic assump-
tions, the MIPS/W keep decreasing. By the time we reach a realistic environment (EffCoaDyn), its
MIPS/W at 15% regulator power loss is 64% of Perf’s — significantly lower than Polyomino’s.
As we vary the regulator power losses between 5% and 25%, we see that Polyomino always
has a higher MIPS/W than the realistic EffCoaDyn. For the multiple Vdd domain chip to beat
Polyomino, it must not require any increase in Vdd guard-band (EffCoa) and use regulators with
only 5% power losses. Alternatively, it must support per-cluster Vdd domains and no increase in
Vdd guard-band (Eff) and use regulators with at most 10% power losses.
Figure 4.7(b) repeats the experiment when we only need to use half of the clusters. We see
similar trends for all the non-Perf environments except that the drop in MIPS/W is not as large.
The reason is that each environment now picks a subset of energy-efficient clusters — leaving
energy-inefficient ones idle. Polyomino attains 95% of the MIPS/W of Perf, while EffCoaDyn only
attains 79% at 15% regulator power loss.
4.6.3 Core Assignment in Polyomino
The previous section showed that Polyomino delivers higher MIPS/W than a realistic imple-
mentation of multiple Vdd domains. In this section, we explore how to best use a Polyomino
environment by evaluating the different Polyomino core-assignment algorithms of Section 4.3.2:
82
P Assign NC, P Assign C, Greedy NC, and Greedy C. As a reference, we also show MultipleVf NC
applied to EffCoaDyn for three different Vdd regulator inefficiencies (5%, 10%, and 15%). In the
previous section, the workload contained mixes of 8-threaded programs only, to expose the im-
pact of per-cluster Vdd and f domains. Now, we consider a flexible workload composed of a mix
of applications from PARSEC, such that one runs with 64 threads, one with 32, one with 16, and
one with 8. The mix uses a total of 15 clusters, and we measure MIPS/W. We run many experi-
ments, forming similar mixes with all of the possible permutations of the PARSEC programs, and
report the average MIPS/W. The power budget is set to the worst-case power consumed by all 36
clusters.
Our Polyomino algorithms are light-weight. For these experiments, P Assign NC, P Assign C,
Greedy NC, and Greedy C execute, on average, 7835, 7912, 233, and 326 assembly instructions per
run, respectively. The corresponding number for MultipleVf NC is 153,100.
Figure 4.8 shows the MIPS/W attained by the algorithms for three different many-core utiliza-
tion scenarios: In the first one, all the clusters are available at the time the workload is launched
(0% busy clusters). Hence, the whole power budget is available. The second and third scenarios
assume that 25% and 50% of the clusters, respectively, are already busy running another, existing
load when we launched our workload. The existing load constitutes of 8-threaded runs of our
highest-IPC application, namely dedup. For the 25% and 50% scenarios, the figure augments the
bars with ranges. The range top corresponds to when the busy clusters were the least MIPS/W-
efficient ones; the range bottom, when they were the most MIPS/W-efficient ones. This choice
matters because it sets the power budget available, and our load runs on the remaining clusters.
In the figure, all the bars are normalized to P Assign NC with 0% use.
Figure 4.8 shows that P Assign NC delivers the highest MIPS/W. P Assign C is about 5% worse.
The reason is that the advantages of picking close-by clusters are offset by the fact that some of
these clusters are not the most energy-efficient ones available. On the other hand, energy efficien-
cies rendered by the greedy algorithms remain, on average, within 5% of the energy efficiencies
of the corresponding P Assign variants. Hence, while P Assign’s more accurate algorithms are
worthwhile, lighter-weight greedy algorithms can also be safely adapted.
The algorithm for the multiple Vdd domains (MultipleVf NC) delivers a lower MIPS/W than
P Assign for 5%, 10%, and, especially, 15% power inefficiency of the Vdd regulator. For the 15%
inefficiency, MultipleVf NC delivers around 78% of the MIPS/W of P Assign NC. Even though
83
0% 25% 50%
% Already busy clusters
No
rm
ali
ze
d 
M
IP
S/
w
0.
5
0.
6
0.
7
0.
8
0.
9
1.
0 P_Assign_NC
P_Assign_C
Greedy_NC
Greedy_C
MultipleVf_NC_5
MultipleVf_NC_10
MultipleVf_NC_15
N
or
m
al
iz
ed
+M
IP
S/
W
Figure 4.8: MIPS/W attained by different core-assignment algorithms if 0%, 25%, 50% of the
clusters were already busy initially. For the latter two, the top (bottom) error bar depicts the
MIPS/W if the least (most) energy-efficient clusters were initially busy.
MultipleVf NC’s overhead is still tolerable for our chip size, it suffers because the many-core’s
energy efficiency is low. Overall, P Assign NC (or P Assign C) is the most effective despite not
supporting multiple Vdd domains.
As we increase the utilization of the chip, the impact of initial load increases, and the rate of
MIPS/W delivered goes down. However, the relative trends across environments remain. One
observation is that the MIPS/W delivered depends significantly on what clusters were busy with
other work.
4.6.4 Implications for Fine-Grain DVFS
We next compare the application of fine-grain DVFS in the different environments of Table 4.3.
We want to show the difference in the maximum MIPS that each environment can attain; hence,
we rely on perfect algorithms, ignoring all algorithmic overheads. We consider different intervals
between DVFS adaptations (from 0.1 ms to 10 ms) and two classes of workloads (one that uses
16 clusters and one that uses 24). A workload consists of a group of 8-threaded applications from
PARSEC, where each application runs on a cluster. First, we identify the most MIPS/W-efficient
(16 or 24) clusters, because these are expected to respond with the highest ∆ f increase to a given
∆Vdd increase (and to the corresponding increase in power). Hence, these clusters are expected
to deliver the maximum MIPS by application of DVFS. Next, the applications are statically as-
84
signed to the subset most MIPS/W-efficient clusters deploying the Hungarian algorithm [89]. The
assignment is then fixed.
Since using a specific DVFS algorithm to dynamically tune Vdd and f can render sub-optimal
operating points, we use a best-case Oracle algorithm which relies on non-linear optimization
in finding fs and Vdds [90]. Moreover, we disregard any overhead involved in computing the
next interval’s Vdd and f — which favors the more complex multi-Vdd environments with more
degrees of freedom such as EffCoaDyn. Specifically, at the beginning of each interval, we assume
we know the average IPCs of the applications in the interval. Then, we find the f and Vdd for each
cluster (or a single Vdd in Polyomino) that delivers the highest overall MIPS — while remaining
within the chip’s power envelope and within the Vdd cap (aggressively set to 20% higher than the
VddNOM of Polyomino.)
Figure 4.9 shows the resulting performance of fine-grain DVFS for the different environments
and adaptation intervals. Figure 4.9(a) and (b) correspond to loads that use 16 or 24 clusters,
respectively. The performance for all the environments is normalized to Perf for 0.1 ms and 16
clusters. As the DVFS adaptation interval decreases, higher MIPS would be expected due to more
accurate tracking of application characteristics, were there no adaptation overhead. The impact
of adaptation overhead on execution time, however, increases with decreasing DVFS interval. At
progressively shorter intervals, as adaptation overhead becomes dominant, MIPS would start to
deteriorate.
The figures show that, while Polyomino’s performance is lower than the perfect Perf environ-
ment, it is higher that the realistic EffCoaDyn environment. This is the case for all the adaptation
intervals and the two chip utilization scenarios. For example, for 0.5–5 ms intervals, Polyomino’s
performance is 20-28% higher than EffCoaDyn for 16 clusters. This means that the higher latency of
off-chip Vdd regulation in Polyomino is effectively hidden. There are two reasons for this. First,
the magnitude of the Vdd changes needed is modest — in our case, they are capped at 20% of
VddNOM. Second, the overhead of PLL re-locking is in the critical path.
Only when the adaptation interval is very small (0.1 ms), is Polyomino’s overhead more visible
and, therefore, its performance becomes comparable to EffCoaDyn’s. Finally, as we go from 16 to
24 clusters, the performance decreases across the board. The reason is that Vdd cannot go as high
as for the 16-cluster case because the power headroom is lower due to the increased number of
clusters. Still, Polyomino’s performance is higher than EffCoaDyn’s.
85
0.1 0.5 1 5 10
Time Interval (ms)
N
or
m
a
liz
e
d 
M
IP
S
0.
5
0.
6
0.
7
0.
8
0.
9
1.
0
1.
1
Perf
Eff
EffCoa
EffCoaDyn
Polyomino
(a) 16 clusters
0.1 0.5 1 5 10
Time Interval (ms)
N
or
m
a
liz
e
d 
M
IP
S
0.
5
0.
6
0.
7
0.
8
0.
9
1.
0
1.
1
Perf
Eff
EffCoa
EffCoaDyn
Polyomino
(b) 24 clusters
Figure 4.9: Performance of fine-grain DVFS under different environments.
4.6.5 Sensitivity Analysis
Finally, we assess the impact of cluster granularity (core count per cluster) for fixed core count per
chip, and of core count per chip for fixed cluster granularity on energy efficiency. We deploy the
default regulator power loss of 15% across experiments. The number of coarse-grain Vdd domains
per chip (for EffCoa and EffCoaDyn) is fixed to nine in order not to increase complexity and Vdd
guard-bands any further.
First, we compare the energy efficiency of the environments in Table 4.3 for the 288-core Poly-
omino chip with either 4, 8 (default), or 16 cores per cluster. The results, shown in Figure 4.10,
are organized as in Figure 4.7. The workload is the same as in Figure 4.7, except that the PARSEC
applications run either with 4 threads or with 16 threads for 4 and 16 cores per cluster. All clusters
are busy in Figure 4.10(a), and only half, in Figure 4.10(b). Each plot is normalized by the MIPS/W
of Perf for the corresponding cluster granularity.
In each case Polyomino’s energy efficiency remains closer to Perf when compared to EffCoaDyn.
As the cluster granularity decreases, the difference in MIPS/W of Perf and the rest of the environ-
ments reduces, since the difference between Vdd domains of different granularity becomes less
pronounced under large core count per cluster. Specifically, under full utilization, MIPS/W of Ef-
fCoaDyn remains within 61%, 64%, and 68% of Perf, where Polyomino attains 77%, 81%, and 86%
86
●●
●
●
●●
●
●
0.
6
0.
7
0.
8
0.
9
1.
0
No
rm
ali
ze
d 
M
IP
S/
w
Pe
rf Ef
f
Ef
fC
oa
Ef
fC
oa
Dy
n
Po
lyo
m
ino
●
●
16 cores per cluster
8 cores per cluster
4 cores per cluster
N
or
m
al
iz
ed
+M
IP
S/
W
(a) Full utilization
●
●
●
●
●
●
●
●
0.
6
0.
7
0.
8
0.
9
1.
0
No
rm
ali
ze
d 
M
IP
S/
w
Pe
rf Ef
f
Ef
fC
oa
Ef
fC
oa
Dy
n
Po
lyo
m
ino
●
●
16 cores per cluster
8 cores per cluster
4 cores per clusterN
or
m
al
iz
ed
+M
IP
S/
W
(b) 50% utilization
Figure 4.10: MIPS/W across different environments for a 288-core chip with 4, 8 (default), and 16
cores per cluster; under full utilization (a) and 50% utilization (b).
for 4, 8, and 16 cores per cluster. Under 50% utilization, the MIPS/W drop over Perf across dif-
ferent environments reduces, since each environment can now pick an energy-efficient subset of
clusters, leaving energy-inefficient ones idle. For both of the utilization profiles, a similar MIPS/W
drop over Perf applies for Eff independent of cluster granularity because Eff differs from Perf only
by the regulator power loss, not cluster organization.
Second, we compare the energy efficiency of the environments in Table 4.3 for a 288-, 144-, and
72-core Polyomino chip with 8 cores per cluster. The results, shown in Figure 4.11, are organized as
in Figure 4.7. The workload is the same as in Figure 4.7. All clusters are busy in Figure 4.11(a), and
only half, in Figure 4.11(b). Each plot is normalized by the MIPS/W of Perf for the corresponding
core count.
In each case Polyomino’s energy efficiency remains closer to Perf when compared to EffCoaDyn.
MIPS/W of EffCoaDyn over Perf decreases with increasing core count. This is because variation
across chip increases with increasing core count. In addition, core count per coarse-grain domain
increases as core count per chip increases, since we fix the number of coarse-grain domains to
nine. On the other hand, for 72 cores per chip, each Vdd domain constitutes of one cluster only
under EffCoaDyn. Even then, EffCoaDyn cannot beat Polyomino. Under full utilization, MIPS/W
of EffCoaDyn remains within 70.7%, 67.4%, and 64.0% of Perf, where Polyomino attains 82.5%,
80.6%, and 81.3% for 72, 144, and 288 cores, respectively. Under 50% utilization, the MIPS/W
87
●● ●
●
●
●
●
●
0.
6
0.
7
0.
8
0.
9
1.
0
No
rm
ali
ze
d 
M
IP
S/
w
Pe
rf Ef
f
Ef
fC
oa
Ef
fC
oa
Dy
n
Po
lyo
m
ino
●
●
288 cores
144 cores
72 cores
N
or
m
al
iz
ed
+M
IP
S/
W
(a) Full utilization
●
● ●
●
●
●
●
●
0.
6
0.
7
0.
8
0.
9
1.
0
No
rm
ali
ze
d 
M
IP
S/
w
Pe
rf Ef
f
Ef
fC
oa
Ef
fC
oa
Dy
n
Po
lyo
m
ino
●
●
288 cores
144 cores
72 coresN
or
m
al
iz
ed
+M
IP
S/
W
(b) 50% utilization
Figure 4.11: MIPS/W across different environments for 72-, 144-, and 288-core chips with 8 cores
per cluster; under full utilization (a) and 50% utilization (b).
drop over Perf across different environments reduces, since each environment can now pick an
energy-efficient subset of clusters.
4.7 Discussion
Our data supports that a large NTC many-core with a single Vdd domain can be more energy-
efficient than one with many Vdd domains using on-chip Vdd regulators. This observation may
change if new technologies come along that affect the fundamental tradeoffs involved.
One currently popular issue is that the upcoming use of FinFETs may reduce the amount of
leakage power and process variations. However, both share of leakage power and the impact of
process variations will be still considerable in future technologies. For example, FinFETs can re-
duce the fraction of leakage power in total power only to 18% when we assume that the share
of leakage power in STC processors with planar FETs is 30% in a 20 nm technology [91]. Fur-
thermore, the Vth sensitivity of FinFETs to Le f f systematic variations is still comparable to that of
planar FETs, while the Vth sensitivity to Wstrip, i.e., the width of each fin, is far more significant
than planar FETs, potentially leading to much higher leakage and delay variations [92]. Conse-
quently, even with FinFETs, the impact of process variations on leakage and delay will continue
to be important, and energy-efficient solutions to deal with them will be crucial.
88
4.8 Related Work
The most relevant proposals apply typically to non-NTC environments: Rotem et al. analyzed the
impact of multiple Vdd/f domains, along with policies to maximize performance in CMPs, under
power delivery limitations [79], and proposed a clustered topology to maximize performance, but
ignored the impact of variation. Yan et al. suggested deploying fast but power-inefficient on-
chip Vdd regulation only for applications requesting fast DVFS, and relying on slow but power-
efficient off-chip regulation otherwise [93]. Although a much smaller chip than Polyomino was
considered, the area overhead had to be justified by exploitation of area which would otherwise
remain as dark silicon. In addition, the metal complexity is significantly higher than in Polyomino.
All of this body of work assumes an STC environment. On the other hand, Zhai et al. proposed a
clustered CMP for NTC [44]. Since SRAM has a higher VddMIN than logic, they let caches operate
at a higher f than logic. This approach is costly, since (i) SRAM and logic are highly interleaved in
the layout, and (ii) extra voltage regulators are required. Nor is this solution scalable into smaller
technologies where the difference between SRAM and logic VddMIN increases, diminishing the
power savings at NTV. Jain et al. demonstrated an experimental IA chip capable of operating at
NTV [11], but the chip has a single core only. Miller et al. deployed dual Vdd rails to handle
variations at NTV [94]. Depending on variation profile and chip load, cores switch between two
rails in a 32-core chip. The scalability into a larger number of cores may be questionable due to
increased complexity.
4.9 Summary
We examined how to effectively cope with process variations in energy-efficient, future NTC
chips. Our analysis suggests a novel, counter-intuitive approach to design such chips, without
multiple Vdd domains, and with only f domains.
First, we showed that a chip with only f domains like Polyomino can deliver higher performance
per watt than a chip with multiple Vdd and f domains supported by conventional on-chip Vdd
regulators. The reasons are: (i) the regulators’ power inefficiency, (ii) the increased Vdd guard-
band required by the finer grain Vdd domains to handle deeper dynamic Vdd droops (due to
lower capacitance per Vdd domain), and (iii) the practical fact that a Vdd domain still includes
many cores.
89
Second, we introduced core-assignment algorithms for a Polyomino-type architecture that de-
liver high performance per watt while being simple. Finally, we showed that the higher latency of
Vdd changes during DVFS without on-chip Vdd regulators is effectively hidden. Key reasons are
that the magnitude of the Vdd changes needed is modest and that PLL re-locking time is in the
critical path.
90
CHAPTER 5
CONCLUSION
5.1 Summary
To push back the many-core power wall, this dissertation made the following contributions:
First, we presented BubbleWrap, a novel many-core architecture that makes extensive use of the
DVSAM (Dynamic Voltage Scaling for Aging Management) toolset to dim dark silicon. Process
variation in future many-cores will render some cores more energy-efficient than others. Parallel
workloads can use the most efficient cores to run as many threads as possible under the power
budget. The less-efficient cores can be used as Bubble Wrap to protect the most efficient cores
when high single-thread performance is required. During sequential phases, BubbleWrap cores
will intentionally sacrifice themselves by operating at much higher than nominal voltage and fre-
quency to provide improved single-thread performance. The elevated voltages and temperatures
will quickly wear out or pop BubbleWrap cores. This is not a problem, as BubbleWrap cores can
be replaced from a large pool of dormant spares. Such a many-core is expected to unlock more
energy-efficient execution than a conventional many-core by delivering higher (sequential and
throughput) performance within the same power budget. DVSAM represents an effective toolset
for managing processor aging by tuning Vdd (but not the frequency), exploiting any instantaneous
aging guard-band. The goal can be one of the following: consume the least power for the same
performance and service life; attain the highest performance for the same service life and within
power constraints; or attain even higher performance for a shorter service life and within power
constraints.
Another way to dim dark silicon is reducing the supply voltage to a value only slightly higher
than the threshold voltage. This regime is called near-threshold voltage (NTV) computing (NTC),
as opposed to conventional super-threshold voltage (STV) computing (STC). A major drawback
of NTC is the higher susceptibility to parametric variations, namely the deviation of device pa-
91
rameters from their nominal values. To help confront process variations at the architecture level
at NTV, we then presented the first microarchitectural model of process variations for NTC. The
model, called VARIUS-NTV, extends an existing variation model for STC. It models how variation
affects the frequency attained and power consumed by cores and memories in an NTC many-core,
and the timing and stability faults in SRAM cells at NTV. The key aspects include: (i) adopting
a gate-delay model and an SRAM cell type that are tailored to NTC, (ii) modeling SRAM failure
modes emerging at NTV, and (iii) accounting for the impact of leakage in SRAM timing and sta-
bility analysis. We validated our model against an experimental 80-core prototype chip. Next, we
examined how to effectively cope with process variations in energy-efficient, future NTC chips.
Our analysis suggested a novel, counter-intuitive approach to design such chips, without multiple
Vdd domains, and with only f domains. Specifically, we showed that a chip with only f domains
like Polyomino can deliver higher performance per watt than a chip with multiple Vdd and f
domains supported by conventional on-chip Vdd regulators. The reasons are: (i) the regulators’
power inefficiency, (ii) the increased Vdd guard-band required by the finer grain Vdd domains to
handle deeper dynamic Vdd droops (due to lower capacitance per Vdd domain), and (iii) the prac-
tical fact that a Vdd domain still includes many cores. We introduced core-assignment algorithms
for a Polyomino-type architecture that deliver high performance per watt while being simple. Fi-
nally, we demonstrated that the higher latency of Vdd changes during DVFS without on-chip Vdd
regulators is effectively hidden because the magnitude of the Vdd changes needed is modest, and
PLL re-locking time is in the critical path.
5.2 Looking Forward
Many interesting research questions on architectural implications of NTC remain open:
(I) Since NTV operation involves a reduction in frequency, it incurs a drastic loss in sequential
performance. In practice, even highly parallel applications tend to spend significant time
in serial regions. Thus, architectural techniques for sequential acceleration tailored for NTC
should be explored.
(II) At NTV, static power dominates the overall power consumption. Moreover, most of the on-
chip static power is consumed by memories. Consequently, organizing the on-chip mem-
ories with static power awareness becomes crucial. How SRAM-based cache hierarchies
92
should be organized at NTV, and how the NTC eco-system can benefit from non-volatile
memory technologies, represent open questions.
(III) NTC results in lower thermal density than STC. This property can be useful in 3D die stack-
ing. Whether and how 3D stacking needs to be re-designed to take advantage of the lower
thermal density at NTV should be analyzed.
(IV) NTC shows an increased sensitivity to PVT variations. Depending on how aggressive the
design is, this fact may lead to higher (timing) error rates. Whether/how this likely in-
crease in error rate at NTV can be mitigated by exploiting application-level fault tolerance,
represents an interesting problem. A broad and important class of applications that covers
recognition, mining, and synthesis (RMS) can tolerate inaccurate computation by construc-
tion. A key step for this analysis lies in understanding how timing faults at NTV manifest at
higher levels in the system stack.
93
APPENDIX A
QUANTITATIVE CHARACTERIZATION OF THE
MANY-CORE POWER WALL
The analysis rendering Figure 1.1 is conducted as follows: Gate capacitance per unit width, Cg,
metal-1 pitch, m, transistor density (transistors per unit area), δ, frequency, f , and supply voltage,
Vdd are collected from ITRS tables. Chip area is fixed; hence (i) δ is representative of the number
of transistors on-chip; (ii) number of cores, N, is proportional to δ. Recall that fixing the chip area
does not imply that the core area is constant: Per core functionality (as quantified by transistor
count) is fixed, not the per core area. In this case, dynamic power consumption across all cores, P,
becomes proportional to Cg×m× δ× f ×Vdd2.
Figure 1.1 starts at 2011 with N2011 = 16 cores per chip, assuming a homogeneous many-core
similar to UltraSPARC T3 [4]. The blue curve shows the evolution of δ, equivalently N, over
technology generations. The red curve, on the other hand, is obtained by a simple proportion:
N2011 cores consume P2011 at year 2011, where all the cores can be active; hence N2011,active = N2011.
At year Y, NY cores are expected to consume PY. Excluding the – practically negligible – heat sink
improvements for a fixed area chip without loss of generality, the available power budget, P2011,
can accommodate only NY,active cores: NY,active = NY × P2011/PY. P2011/PY < 1 applies due to the
increasing power density over technology generations. A perfect heat-sink, on the other hand,
permitting the available per chip power budget to scale with PY, would render NY = NY,active for
all Y considered.
The growing gap between the two curves demonstrates the many-core power wall. The hops cor-
respond to years of introduction of a new device architecture. The basic analysis excludes the
impact of leakage. Since static power keeps increasing over technology generations, accounting
for static power can only make the many-core power wall look more dramatic: The actual number
of active cores would be less than predicted by Figure 1.1. Moreover, the transistor density, δ, cor-
responds to logic density. Logic represents the delimiter for integration; SRAM transistor density
increases at a faster pace than logic.
94
This trend remains universal, specifically in a homogeneous many-core setting. Similar char-
acteristics would emerge if different baselines were adopted, with cores of potentially different
complexity. In [5], for example, the same analysis is conducted assuming beefy cores, for an i7-
like [6] system.
95
APPENDIX B
VARIUS TIMING MODEL
VARIUS-NTV characterizes variation-induced shift in the path delay distribution of a pipeline
stage following VARIUS’ framework. Variation-afflicted path delay distribution, DVar, can serve
extraction of either (1) the maximum path delay, from max(DVar), to designate a proper clock pe-
riod; or (2) the timing error rate per cycle at a designated clock period, tCLK, from 1− cd fDVar(tCLK).
A path causes a timing error if and only if it is exercised and its delay exceeds the designated clock
period. It is assumed that there exists at least one path that has a delay equal to the clock period
tCLK if there was no variation.
A purely-logic pipeline stage constitutes a multitude of paths of various delays. Even if there
was no variation, there exists a specific distribution of path delays, DLogic. The question is how
this distribution changes under process variation to give rise to DVarLogic. By fixing the share of
wire delay in the path delay by the multiplier kW ,
DLogic = DGates + DWire and DWire = kW × DLogic =⇒ DGates = (1− kW)× DLogic (B.1)
applies, where DGates corresponds to the delay of a sequence of gates along a path excluding wires.
Variation in interconnect is neglected.
The path delay distribution of a logic stage under variation is given in Equation B.2. If the path
delay distribution, DLogic, is normalized to the clock period, it would be clustered at 1 due to state-
of-the-art design optimization. In this case, Equation B.2 can be approximated by Equation B.3.
DVarLogic = DVarGates + DWire = DSysGates + DRandGates + DWire
DSysGates = kSys × DGates = kSys × (1− kW)DLogic
DRandGates = DRand × DGates = DRand × (1− kW)DLogic
=⇒ DVarLogic = (1− kW)(kSys + DRand)× DLogic + kW DLogic
(B.2)
96
DVarLogic ≈ (1− kW)(kSys × DLogic + DRand) + kW DLogic (B.3)
The systematic variation in DGates, DSysGates, consists of inter- and intra-stage components. The
inter-stage systematic component represents the stage systematic mean, the average shift in delay
over all the paths in the stage, lumped into the multiplier kSys of DGates. The intra-stage systematic
deviation, on the other hand, remains much smaller due to the high degree of spatial correlation.
Already at 45 nm, the length of a typical pipeline stage remains less than 0.1× the chip-length
for a 4-core design [52]. Hence, for typical values of φ around 0.5× the chip-length, intra-stage
systematic deviation can be neglected. VARIUS-NTV does not model intra-stage systematic de-
viation. Under this assumption, Vth and Le f f per transistor – hence gate – along each path in a
pipeline stage change in the same direction, moreover, by the same quantity. From this respect, a
pipeline stage is assumed atomic – each pipeline stage represents a grid point.
The random variation in DGates, DRandGates, on the other hand, stems from non-correlated, inde-
pendent shifts in transistor – hence gate – delay distribution. The random shift in transistor delay
distribution is not necessarily by the same quantity and in the same direction over all transistors.
Hence, the random shift in path delay distribution cannot be lumped into a constant coefficient
such as kSys of the systematic variation. If each grid point corresponded to a transistor, each point
per sampled die would be characterized by a specific value of the random shift superimposed on
the systematic shift. However, operation at transistor granularity is not feasible. Hence, while
systematic variation is captured by a constant coefficient of DGates per grid point per die, the co-
efficient of DGates corresponding to the random component, DRand, is modeled analytically as a
random variable of a specific distribution. DRand represents a coefficient of DGates, which lumps
the shift in path delay due to random variation.
According to Equation B.2, to arrive at the variation-afflicted path delay distribution, DVarLogic,
kW , DLogic, kSys and DRand should be known. kW , the (average) share of wire delay in the path
delay (were there no variation) represents a design-specific constant. It is assumed that DLogic is
normalized by tCLK and follows a Gaussian distribution. This is justified by data collected for
representative circuitry. Due to state-of-the-art design optimization, the distribution is impulse-
like, with majority of path delays accumulated at 1.
97
Extracting kSys: Following VARIUS, VARIUS-NTV works with normalized gate delay, tnorm, as
captured by Equation B.4.
tnorm ∝ tg/t0
tg ∝
Vdd× Le f f
µ× n× vt2 × ln2(e Vdd−Vth2×n×vt + 1)
t0 ∝
Vdd0 × Le f f 0
µ× n× vt2 × ln2(e Vdd0−Vth02×n×vt + 1)
(B.4)
kSys is set by tnorm, where Vth and Le f f represent systematic variation-afflicted parameters per
grid point (i.e., the center of each pipeline stage): kSys = tnorm(Vth = VthSys, Le f f = Le f f ,Sys). Vth0
and Le f f 0 depict nominal values, were there no variation. Equation B.4 is an apparently gate delay
formulation; however, DVarLogic corresponds to path delay. Recall that kSys represents systematic
variation, and being highly correlated, Vth and Le f f of all gates along a path in a pipeline stage
would change in the same direction, moreover, by the same quantity; a pipeline stage is assumed
atomic from this respect. Hence, the stage (path) delay can be simply captured by l × tg, with l
being the length of the critical path. Similarly, the nominal path delay would correspond to l× t0.
In normalizing path delays, l disappears, but the quantity of interest still corresponds to paths as
opposed to gates.
Extracting DRand: To determine DRand, the distribution of this random variable1 should be known
along with its parameters (mean µ and standard deviation σ). Mathematically, the parameters of
the distribution can be estimated independent of the underlying distribution by relying on the
principles of functions of iid – independent, identically distributed random variables.
In this case, each gate’s Le f f and Vth is affected independently. Hence, the path delay should be
formed by composition of l independent gate delays; path delay = l × gate delay does not hold
any more. Instead, path delay should be determined from the sum of l iid random variables.
Assume that G represents the absolute delay of a FO4 gate – the random variable corresponding
to tg. Let P be the absolute path delay as composed by the sum of l independent Gs. In this case,
σ(P) =
√
l × σ(G), µ(P) = l × µ(G). However, VARIUS-NTV is after the normalized path delay,
PN . The parameters of PN = P/(l × t0) would be σ(PN) = σ(P)/(l × t0), µ(PN) = µ(P)/(l × t0).
1A random variable X is a function that assigns a real number X(ψ) to each outcome ψ in the sample space of a
random experiment.
98
Hence,
σ(PN) =
√
l × σ(G)/(l × t0) =⇒ σ(PN) = σ(G/t0)/
√
l (B.5)
µ(PN) = l × µ(G)/(l × t0) =⇒ µ(PN) = µ(G/t0) (B.6)
G/t0 corresponds to the distribution of normalized gate delay, as captured by tnorm. G/t0 rep-
resents a function of the random variables Le f f and Vth. Orthogonal to the determination of
its distribution, the mean and standard deviation can be estimated from formulae for functions of
random variables. If, for example, Le f f and Vth are assumed to follow a Gaussian distribution [54],
Equation 52 in [52] can be deployed to extract σ(G/t0) and µ(G/t0). For Gaussian Vth and Le f f ,
µ(G/t0) can be calculated from systematically shifted Vth0 and Le f f 0 plugged in as means, along
with σRands of Vth and Le f f as standard deviations, into Equation 5 from [52]. DRand represents
an additive term to be superimposed on systematic variation-afflicted DLogic, i.e., kSys × DLogic,
according to Equation B.3. The shift in delay due to systematic variation should not be doubly
accounted for; this is why once µ(G/t0) is extracted, to arrive at the mean of DRand, µ(DRand), the
stage systematic mean should be subtracted out. 34 σ(DRand), on the other hand, corresponds to
σ(G/t0)/
√
l:
µ(DRand) = µ(PN)− kSys = µ(G/t0)− kSys
σ(DRand) = σ(PN) = σ(G/t0)/
√
l
(B.7)
How does the picture change in extraction of the stage systematic mean, kSys? kSys = µ(PN) =
µ(P)/(l × t0). For systematic variation, µ(P) = l × µ(G). Hence, µ(PN) = l × µ(G)/(l × t0) =
µ(G/t0). Since intra-stage systematic deviation in Vth and Le f f is neglected, this µ(G/t0) can
be directly extracted from tnorm, by plugging in systematically shifted Vth0 and Le f f 0 as obtained
from physical variation maps; there is no need to deploy Equation 5 from [52].
Determining DVar Distribution: Since (1) the underlying zero-variation path delay, DLogic, is as-
sumed to follow a Gaussian distribution [54] and (2) DRand, to capture random variation, is ap-
proximated by a Gaussian distribution, DVarLogic follows a Gaussian distribution.
2A formulae to extract σ and µ of a function of Gaussian random variables. The random variables here correspond
to Vth and Le f f , hence not necessarily independent and identically distributed. See [95,p.130] for a more general
discussion.
3Note that, since systematic and random components are independent, they give rise to independent random vari-
ables. The cumulative impact is additive. Orthogonal to the underlying distributions, the cumulative mean/variance
is the sum of the means/variances.
4Ideally, zeros should be plugged in as means into Equation 5 from [52] to avoid corrective subtraction, however,
this approach is observed to lead to numerical instability.
99
A similar analysis applies for pipeline stages of pure memory access: Both read and write tim-
ing represent non-linear functions of the underlying Vth and Le f f distributions. To extract the
parameters of random variables capturing read and write timing, systematic variation induced
Vth/Le f f s are deployed as means, where random variation induced σs replace standard devia-
tions in equivalents of Equation 5 from [52].
100
REFERENCES
[1] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, and A. LeBlanc, “Design of Ion-Implanted
MOSFETs with Very Small Physical Dimensions,” Journal of Solid-State Circuits, vol. 9, no. 5,
pp. 256–268, October 1974.
[2] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein, “Scaling, Power, and
the Future of CMOS,” in International Electron Devices Meeting, December 2005, pp. 7–15.
[3] International Technology Roadmap for Semiconductors, 2011 Edition. [Online]. Available:
http://www.itrs.net.
[4] J. Shin et al., “A 40nm 16-core 128-thread CMT SPARC SoC Processor,” in International Solid-
State Circuits Conference, February 2010, pp. 98–99.
[5] U. R. Karpuzcu, B. Greskamp, and J. Torrellas, “The BubbleWrap Many-Core: Popping Cores
for Sequential Acceleration,” in International Symposium on Microarchitecture, December 2009,
pp. 447–458.
[6] Intel Core i7 Processor Extreme Edition and Intel Core i7 Processor Datasheet,
2008. [Online]. Available: http://www.intel.com/content/www/us/en/processors/core/
core-i7-900-ee-and-desktop-processor-series-datasheet-vol-1.html.
[7] C. Diaz et al., “32nm Gate-first High-K/Metal-Gate Technology for High Performance Low
Power Applications,” in International Electron Devices Meeting, December 2008, pp. 1–4.
[8] L. Chang, D. J. Frank, R. K. Montoye, S. J. Koester, B. L. Ji, P. W. Coteus, R. H. Dennard, and
W. Haensch, “Practical Strategies for Power-Efficient Computing Technologies,” Proceedings
of the IEEE, vol. 98, no. 2, pp. 215–236, February 2010.
[9] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, “Near-Threshold
Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits,” Pro-
ceedings of the IEEE, vol. 98, no. 2, pp. 253–266, February 2010.
[10] M. B. Taylor, “Is Dark Silicon Useful?: Harnessing the Four Horsemen of the Coming Dark
Silicon Apocalypse,” in Design Automation Conference, June 2012, pp. 1131–1136.
[11] S. Jain et al., “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS,” in
International Solid-state Circuits Conference, February 2012, pp. 66–68.
[12] K. Bernstein et al., “High-Performance CMOS Variability in the 65-nm Regime and Beyond,”
IBM Journal of Research and Development, vol. 50, no. 4/5, pp. 433–449, July/September 2006.
101
[13] J. Abella, X. Vera, and A. Gonza´lez, “Penelope: The NBTI-Aware Processor,” in International
Symposium on Microarchitecture, December 2007, pp. 85–96.
[14] M. Agarwal, B. Paul, M. Zhang, and S. Mitra, “Circuit Failure Prediction and Its Application
to Transistor Aging,” in VLSI Test Symposium, May 2007, pp. 277–286.
[15] J. Blome, S. Feng, S. Gupta, and S. Mahlke, “Self-Calibrating Online Wearout Detection,” in
International Symposium on Microarchitecture, December 2007, pp. 109–122.
[16] J. Shin, V. Zyuban, P. Bose, and T. M. Pinkston, “A Proactive Wearout Recovery Approach for
Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime,” in International
Symposium on Computer Architecture, June 2008, pp. 353–362.
[17] J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai, “Detecting Emerging Wearout
Faults,” in Workshop on Silicon Errors in Logic - System Effects, March 2007.
[18] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “The Case for Lifetime Reliability-Aware
Microprocessors,” in International Symposium on Computer Architecture, June 2004, pp. 276–
287.
[19] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “Exploiting Structural Duplication for
Lifetime Reliability Enhancement,” in International Symposium on Computer Architecture, June
2005, pp. 520–531.
[20] A. Tiwari and J. Torrellas, “Facelift: Hiding and Slowing Down Aging in Multicores,” in
International Symposium on Microarchitecture, November 2008, pp. 129–140.
[21] F. Arnaud et al., “32nm General Purpose Bulk CMOS Technology for High Performance Ap-
plications at Low Voltage,” in Electron Devices Meeting, December 2008, pp. 1–4.
[22] C. Auth, “45nm High-K + Metal Gate Strain-Enhanced CMOS Transistors,” in Custom Inte-
grated Circuits Conference, September 2008, pp. 379–386.
[23] W. Wang, V. Reddy, A. T. Krishnan, R. Vattikonda, S. Krishnan, and Y. Cao, “Compact Model-
ing and Simulation of Circuit Reliability for 65-nm CMOS Technology,” Transactions on Device
and Materials Reliability, vol. 7, no. 4, pp. 509–517, December 2007.
[24] T. Sakurai and A. Newton, “Alpha-Power Law MOSFET Model and Its Applications to
CMOS Inverter Delay and Other Formulas,” Journal of Solid-State Circuits, vol. 25, no. 2, pp.
584–594, April 1990.
[25] D. Bergstrom, M. Hattendorf, J. Hicks, J. Jopling, J. Maiz, S. Pae, C. Prasad, and J. Wiedemer,
“45nm Transistor Reliability,” Intel Technology Journal, vol. 12, no. 2, pp. 131–144, June 2008.
[26] S. Pae et al., “BTI Reliability of 45 nm High-K + Metal-Gate Process Technology,” in Interna-
tional Reliability Physics Symposium, May 2008, pp. 352–357.
[27] J. W. McPherson, “Reliability Challenges for 45nm and Beyond,” in Design Automation Con-
ference, July 2006, pp. 176–181.
[28] E. J. Nowak, “Maintaining the Benefits of CMOS Scaling When Scaling Bogs Down,” IBM
Journal of Research and Development, vol. 46, no. 2/3, pp. 169–180, March/May 2002.
102
[29] E. Karl, P. Singh, D. Blaauw, and D. Sylvester, “Compact In-Situ Sensors for Monitoring
Negative-Bias-Temperature-Instability Effect and Oxide Degradation,” in International Solid-
State Circuits Conference, February 2008, pp. 410–623.
[30] J. Keane, D. Persaud, and C. H. Kim, “An All-In-One Silicon Odometer for Separately Moni-
toring HCI, BTI, and TDDB,” in Symposium on VLSI Circuits, June 2009, pp. 817,829.
[31] T.-H. Kim, R. Persaud, and C. Kim, “Silicon Odometer: An On-Chip Reliability Monitor for
Measuring Frequency Degradation of Digital Circuits,” Journal of Solid-State Circuits, vol. 43,
no. 4, pp. 874–880, April 2008.
[32] K. Stawiasz, K. Jenkins, and P.-F. Lu, “On-Chip Circuit for Monitoring Frequency Degrada-
tion Due to NBTI,” in International Reliability Physics Symposium, May 2008, pp. 532–535.
[33] M. Popovich, A. V. Mezhiba, and E. G. Friedman, “On-Chip Power Distribution Grids with
Multiple Supply Voltages,” in Power Distribution Networks with On-Chip Decoupling Capacitors.
New York: Springer, 2008, pp. 323–359.
[34] J. Dorsey et al., “An Integrated Quad-Core Opteron Processor,” in International Solid-State
Circuits Conference, February 2007, pp. 102–103.
[35] R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas, “Mitigating Parameter Variation with
Dynamic Fine-Grain Body Biasing,” in International Symposium on Microarchitecture, Decem-
ber 2007, pp. 27–42.
[36] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-level Power
Analysis and Optimizations,” SIGARCH Computer Architecture News, vol. 28, no. 2, pp. 83–94,
2000.
[37] W. Huang, M. R. Stant, K. Sankaranarayanan, R. J. Ribando, and K. Skadron, “Many-Core
Design from a Thermal Perspective,” in Design Automation Conference, July 2008, pp. 746–749.
[38] W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45nm Design
Exploration,” in International Symposium on Quality Electronic Design, 2006, pp. 585–590.
[39] A. Khakifirooz and D. Antoniadis, “MOSFET Performance Scaling Part II: Future Directions,”
in Transactions on Electron Devices, June 2008, pp. 1401–1408.
[40] R. Teodorescu and J. Torrellas, “Variation-Aware Application Scheduling and Power Man-
agement for Chip Multiprocessors,” in International Symposium on Computer Architecture, June
2008, pp. 363–374.
[41] International Technology Roadmap for Semiconductors, 2008 Update. [Online]. Available:
http://www.itrs.net.
[42] J. Renau et al., 2008, SESC Simulator. [Online]. Available: http://sesc.sourceforge.net.
[43] R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and Threshold Voltage Scaling for Low
Power CMOS,” Journal of Solid-State Circuits, vol. 32, no. 8, pp. 1210–1216, August 1997.
[44] B. Zhai, R. G. Dreslinski, D. Blaauw, T. Mudge, and D. Sylvester, “Energy Efficient Near-
Threshold Chip Multi-Processing,” in International Symposium on Low Power Electronics and
Design, 2007, pp. 32–37.
103
[45] K. Fan, M. Kudlur, G. Dasika, and S. Mahlke, “Bridging the Computation Gap between Pro-
grammable Processors and Hardwired Accelerators,” in International Symposium on High Per-
formance Computer Architecture, February 2009, pp. 313–322.
[46] D. Markovic, C. C. Wang, L. P. Alarcon, T.-T. Liu, and J. M. Rabaey, “Ultralow-Power Design
in Near-Threshold Region,” Proceedings of the IEEE, vol. 98, no. 2, pp. 237–252, February 2010.
[47] S. Dighe et al., “Within-Die Variation-Aware Dynamic-Voltage-Frequency-Scaling With Op-
timal Core Allocation and Thread Hopping for the 80-Core TeraFLOPS Processor,” Journal of
Solid-State Circuits, pp. 184 –193, January 2011.
[48] E. Humenay, D. Tarjan, and K. Skadron, “Impact of Process Variations on Multicore Perfor-
mance Symmetry,” in Conference on Design, Automation and Test in Europe, April 2007, pp.
1653–1658.
[49] X. Liang and D. Brooks, “Mitigating the Impact of Process Variations on Processor Register
Files and Execution Units,” in International Symposium on Microarchitecture, December 2006,
pp. 504–514.
[50] D. Marculescu and E. Talpes, “Variability and Energy Awareness: A Microarchitecture-Level
Perspective,” in Design Automation Conference, June 2005, pp. 11–16.
[51] B. F. Romanescu, S. Ozev, and D. J. Sorin, “Quantifying the Impact of Process Variability on
Microprocessor Behavior,” in Workshop on Architectural Reliability, December 2006.
[52] S. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas, “VARIUS: A
Model of Process Variation and Resulting Timing Errors for Microarchitects,” Transactions on
Semiconductor Manufacturing, vol. 21, no. 1, pp. 3–13, February 2008.
[53] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, “Parameter Variations
and Impact on Circuits and Microarchitecture,” in Design Automation Conference, June 2003,
pp. 338–342.
[54] A. Srivastava, D. Sylvester, and D. Blaauw, “Statistical Models and Techniques,” in Statistical
Analysis and Optimization for VLSI: Timing and Power. New York: Springer, 2005, pp. 13–77.
[55] BSIM. [Online]. Available: http://www-device.eecs.berkeley.edu/∼{}bsim/BSIM4/
BSIM460.
[56] S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Modeling of Failure Probability and Statis-
tical Design of SRAM Array for Yield Enhancement in Nanoscaled CMOS,” Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 12, pp. 1859–1880, De-
cember 2005.
[57] M. Eisele, J. Berthold, D. Schmitt-Landsiedel, and R. Mahnkopf, “The Impact of Intra-die
Device Parameter Variations on Path Delays and on the Design for Yield of Low Voltage
Digital Circuits,” Transactions on VLSI Systems, vol. 5, no. 4, pp. 360 –368, December 1997.
[58] Y. Cheng and C. Hu, MOSFET Modeling and Bsim3 User’s Guide. New York: Kluwer Aca-
demic Publishers, 1999.
104
[59] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Zeisler, D. Blaauw, T. Austin, K. Flaut-
ner, and T. Mudge, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Specula-
tion,” in International Symposium on Microarchitecture, December 2003, pp. 7–18.
[60] M. Seok, D. Sylvester, and D. Blaauw, “Optimal Technology Selection for Minimizing En-
ergy and Variability in Low Voltage Applications,” in International Symposium on Low Power
Electronics and Design, 2008, pp. 9–14.
[61] K. A. Bowman, B. L. Austin, J. C. Eble, X. Tang, and J. D. Meindl, “A Physical Alpha-power
Law MOSFET Model,” in International Symposium on Low Power Electronics and Design, 1999,
pp. 218–222.
[62] Y. Cao and L. T. Clark, “Mapping Statistical Process Variations Toward Circuit Performance
Variability: An Analytical Modeling Approach,” in Design Automation Conference, 2005, pp.
658–663.
[63] H. Im, “Physical Insight Into Fractional Power Dependence of Saturation Current on Gate
Voltage in Advanced Short Channel MOSFETS (Alpha-power Law Model),” in International
Symposium on Low Power Electronics and Design, 2002, pp. 13–18.
[64] M. Orshansky, J. Chen, and C. Hu, “Direct Sampling Methodology for Statistical Analysis
of Scaled CMOS Technologies,” Transactions on Semiconductor Manufacturing, pp. 403–408,
November 1999.
[65] C. C. Enz et al., “An Analytical MOS Transistor Model Valid in All Regions of Operation
and Dedicated to Low-voltage and Low-current Applications,” Analog Integrated Circuits and
Signal Processing, pp. 83–114, July 1995.
[66] J. Abella, P. Chaparro, X. Vera, J. Carretero, and A. Gonzalez, “High-Performance Low-Vcc
In-Order Core,” in International Symposium on High Performance Computer Architecture, January
2010, pp. 1–11.
[67] L. Chang et al., “An 8T-SRAM for Variability Tolerance and Low-Voltage Operation in High-
Performance Caches,” Journal of Solid-State Circuits, vol. 43, no. 4, pp. 956–963, April 2008.
[68] Y. Morita et al., “An Area-Conscious Low-Voltage-Oriented 8T-SRAM Design under DVS
Environment,” in Symposium on VLSI Circuits, June 2007, pp. 256 –257.
[69] C.-K. Luk et al., “Pin: Building Customized Program Analysis Tools With Dynamic Instru-
mentation,” in Conference on Programming Language Design and Implementation, June 2005, pp.
190–200.
[70] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: An
Integrated Power, Area, and Timing Modeling Framework for Multi-Core and Many-Core
Architectures,” in International Symposium on Microarchitecture, December 2009, pp. 469–480.
[71] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M. Stan, “HotSpot:
A Compact Thermal Modeling Methodology for Early-Stage VLSI Design,” Transactions on
VLSI Systems, vol. 14, no. 5, pp. 501–513, May 2006.
[72] The R Project for Statistical Computing. [Online]. Available: http://www.r-project.org/.
105
[73] P. Friedberg, Y. Cao, J. Cain, R. Wang, J. Rabaey, and C. Spanos, “Modeling Within-die Spa-
tial Correlation Effects for Process-design Co-optimization,” in International Symposium on
Quality of Electronic Design, March 2005, pp. 516–521.
[74] Predictive Technology Model (PTM). [Online]. Available: http://ptm.asu.edu/.
[75] J. Donald and M. Martonosi, “Power Efficiency for Variation-tolerant Multicore Processors,”
in International Symposium on Low Power Electronics and Design, October 2006, pp. 304–309.
[76] S. Herbert and D. Marculescu, “Characterizing Chip-multiprocessor Variability-tolerance,”
in Design Automation Conference, June 2008, pp. 313–318.
[77] J. Li and J. Martinez, “Dynamic Power-performance Adaptation of Parallel Computation on
Chip Multiprocessors,” in International Symposium on High-Performance Computer Architecture,
February 2006, pp. 77 – 87.
[78] K. Rangan, M. Powell, G.-Y. Wei, and D. Brooks, “Achieving Uniform Performance and Max-
imizing Throughput in the Presence of Heterogeneity,” in International Symposium on High
Performance Computer Architecture, February 2011, pp. 3–14.
[79] E. Rotem, R. Ginosar, A. Mendelson, and U. Weiser, “Multiple Clock and Voltage Domains
for Chip Multiprocessors,” in International Symposium on Microarchitecture, December 2009,
pp. 459–468.
[80] H. A. Ghasemi, A. Sinkar, M. Schulte, and N. S. Kim, “Cost-Effective Power Delivery for Sup-
porting Per-Core Voltage Domains for Power-Constrained Processors,” in Design Automation
Conference, June 2012.
[81] W. Kim, D. Brooks, and G.-Y. Wei, “A Fully-integrated 3-Level DC/DC Converter for
Nanosecond-scale DVS with Fast Shunt Regulation,” in International Solid-State Circuits Con-
ference, February 2011, pp. 268–270.
[82] N. James, P. Restle, J. Friedrich, B. Huott, and B. McCredie, “Comparison of Split-Versus
Connected-Core Supplies in the POWER6 Microprocessor,” in International Solid-State Cir-
cuits Conference, February 2007, pp. 298–604.
[83] J. Lee and N. S. Kim, “Optimizing Total Power of Many-core Processors Considering Voltage
Scaling Limit and Process Variations,” in International Symposium on Low Power Electronics and
Design, March 2009, pp. 201–206.
[84] R. R. Dobkin and R. Ginosar, “Two-phase Synchronization with Sub-cycle Latency,” Integra-
tion, the VLSI Journal, vol. 42, no. 3, pp. 367–375, June 2009.
[85] W. Kim, M. Gupta, G.-Y. Wei, and D. Brooks, “System Level Analysis of Fast, Per-core DVFS
Using On-chip Switching Regulators,” in International Symposium on High Performance Com-
puter Architecture, February 2008, pp. 123–134.
[86] Voltage Regulator Module (VRM) and Enterprise Voltage Regulator-Down (EVRD) 11.1,
2009. [Online]. Available: http://www.intel.com/Assets/en US/PDF/designguide/321736.
pdf.
106
[87] A. Bashir, J. Li, K. Ivatury, N. Khan, N. Gala, N. Familia, and Z. Mohammed, “Fast Lock
Scheme for Phase-Locked Loops,” in Custom Integrated Circuits Conference, September 2009,
pp. 319–322.
[88] U. R. Karpuzcu, K. B. Kolluru, N. S. Kim, and J. Torrellas, “VARIUS-NTV: A Microarchitec-
tural Model to Capture the Increased Sensitivity of Manycores to Process Variations at Near-
Threshold Voltages,” in International Conference on Dependable Systems and Networks, June 2012,
pp. 1–12.
[89] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity.
Upper Saddle River: Prentice-Hall, 1982.
[90] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A Limited Memory Algorithm for Bound Con-
strained Optimization,” SIAM Journal of Scientific Computing, vol. 16, no. 5, pp. 1190–1208,
September 1995.
[91] A. Keshavarzi et al., “Architecting Advanced Technologies for 14nm and Beyond with 3D
FinFET Transistors for the Future SoC Applications,” in International Electron Devices Meeting,
December 2011, pp. 4.1.1–4.1.4.
[92] H. Khan, D. Mamaluy, and D. Vasileska, “Simulation of the impact of process variation on the
optimized 10-nm FinFET,” Transactions Electron Devices, vol. 55, no. 8, pp. 2134–2141, August
2008.
[93] G. Yan, Y. Li, Y. Han, X. Li, M. Guo, and X. Liang, “AgileRegulator: A Hybrid Voltage Reg-
ulator Scheme Redeeming Dark Silicon for Power Efficiency in a Multicore Architecture,” in
International Symposium on High Performance Computer Architecture, February 2012, pp. 1–12.
[94] T. Miller, X. Pan, R. Thomas, N. Sedaghati, and R. Teodorescu, “Booster: Reactive Core Ac-
celeration for Mitigating the Effects of Process Variation and Application Imbalance in Low-
voltage Chips,” in International Symposium on High Performance Computer Architecture, Febru-
ary 2012, pp. 1–12.
107
