Improving Per-Thread Performance on CMPs through Timing Speculation by Greskamp, Brian
© 2009 Brian L. Greskamp
IMPROVING PER-THREAD PERFORMANCE
ON CMPS THROUGH TIMING SPECULATION
BY
BRIAN L. GRESKAMP
B.S. Clemson University, 2003
M.S. University of Illinois at Urbana-Champaign, 2005
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2009
Urbana, Illinois
Doctoral Committee:
Professor Josep Torrellas, Chair
Shekhar Borkar, Intel Corporation
Assistant Professor Deming Chen
Professor Sanjay Patel
Assistant Professor Craig Zilles
Abstract
The future of performance scaling lies in massively parallel workloads, but less-parallel applica-
tions will remain important. Unfortunately, future process technologies and core microarchitec-
tures no longer promise major per-thread performance improvements, so microarchitects must find
new ways to address a growing per-thread performance deficit. Moreover, they must do so without
sacrificing parallel throughput. To meet these apparently conflicting demands, this dissertation
proposes a Timing Speculation (TS) system for multicores that boosts core clock frequencies past
their normal limits when an application demands per-thread performance and operates efficiently
at nominal frequency when it demands throughput. This work’s contributions are organized into
three interlocking proposals.
This work begins by introducing Paceline, the first TS microarchitecture designed specifically
for multicores. Paceline enables two cores to work together to execute a single thread at high
speed under TS or independently to execute two threads at the rated frequency. In single-thread
mode, one core in the pair — the “Leader” — executes at higher-than-normal frequency, while a
“Checker” runs at the rated, safe frequency. The Leader runs the program faster but may experience
timing errors. To detect and correct these errors, the Checker periodically compares a hash of its
architectural state with that of the Leader. The Leader helps the Checker keep up by passing it
branch results and prefetches.
Next, this dissertation enhances Paceline with BlueShift, a circuit design method for TS ar-
chitectures that improves a circuit’s common-case delay rather than focusing on worst-case delay
like traditional design flows. BlueShift profiles a gate-level design as it runs real benchmark ap-
plications to identify the frequently-exercised circuit paths and then applies speed optimizations
ii
to those paths only. These optimizations can be implemented in a way that can be enabled and
disabled at run-time so that they do not exact a power cost when they are not needed (ie. when the
processor is executing a throughput workload).
Finally, this work presents LeadOut, a multicore design that combines Paceline with an ad-
ditional per-thread performance enhancement: the ability to increase core supply voltage above
nominal. LeadOut evaluates the performance gains that are possible with Paceline alone, voltage
boosting alone, and both together. It shows major gains from applying the two techniques together
when feasible and also shows that, in many cases, future multicores have power and tempera-
ture headroom to exploit still more per-thread enhancements as long as they can be enabled and
disabled dynamically according to application demand.
iii
To my parents, Mike and Debbie Greskamp.
iv
Acknowledgements
My parents, Mike and Debbie, have been most instrumental in this dissertation. Who else would
have bought a fourth-grader thousands of dollars worth of electronic components and test equip-
ment to encourage a burgeoning interest in electronics design? Their continued support and en-
couragement ultimately gave me the audacity to attempt the Ph.D, and their counsel throughout
kept me in mental health to see it through.
For sparking my interest in electronics, I’d like to thank Wayne Weise, my first great mentor.
Thanks, Wayne, for taking the time to put together lessons and kits for a student who was only
beginning to understand. For introducing me to the joys (and trials) of research and pointing me
toward graduate school, I thank Prof. Ron Sass, who welcomed me into his research group as a
still-naı¨ve undergraduate.
Next, I owe a great debt to my peer–mentors: Smruti R. Sarangi, Abhishek Tiwari, James Tuck,
Radu Teodorescu, Luis Ceze, Karin Strauss, Wonsun Ahn, and Pablo Montesinos. Thanks for your
critiques and insights, but above all else, thanks for your example. Pablo deserves special thanks
for sharing his experiences, hints, and encouragement during an extremely difficult job search.
Thanks also to the rest of the i-acoma group — past and present — for the discussion (whether
research-related or not) and the occasional levity.
All students owe a large credit to their advisors, but I owe even more than most. Above all
else, Prof. Torrellas taught me to stay focused and to develop a mental toughness that helps me
remain motivated even after repeated setbacks. He also taught me to work analytically — to
systematically exhaust a problem and to distill insights instead of getting lost in the complexities
of implementation. In none of these lessons have I succeeded completely, but under his guidance,
v
I have progressed.
As for this document itself, the work is not solely my own. For their significant contributions,
I thank my co-authors: Lu Wan, Jeffrey J. Cook, Ulya Karpuzcu, Prof. Deming Chen, and Prof.
Craig Zilles. They provided invaluable assistance at all levels: developing the seminal BlueShift
ideas (Jeffrey, Craig, Lu, and Deming); performing implementation (Lu), and tirelessly question-
ing and verifying my methods and assumptions (Ulya).
Of course, graduate school isn’t all about academics, and I’d like to thank my room-mates
and friends, Lee Baugh and Kiran Lakkaraju, for putting up with my idiosyncrasies and providing
a relaxing and fulfilling “home” life away from the lab. I’d also like to thank my friend and
colleague, Ulya Karpuzcu, for her continued support — both academic and personal.
vi
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Per-Thread Performance Crisis . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Barriers to Per-Thread Performance . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Technology Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Core Microarchitecture Barriers . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 A Case for Configurable Multicores . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Timing Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Contributions and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2 Paceline: Timing Speculation for Multicores . . . . . . . . . . . . . . . . . 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Characterizing Overclockability . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Grading Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Safety Margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Error Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Exploiting Overclockability . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Paceline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Overview of the Microarchitecture . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Types of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Detailed Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4 Additional Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.5 Implementation Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Analytical Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 3 BlueShift: Designing Pipelines for Timing Speculation . . . . . . . . . . . . 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Taxonomy of Design for TS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Classification of TS Microarchitectures . . . . . . . . . . . . . . . . . . . 32
3.2.2 General Approaches to Enhance TS . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
3.3 A Common-Case Optimization Flow . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 The BlueShift Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Example BlueShift Techniques . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 4 LeadOut: Combining Timing Speculation with V–f Boosting . . . . . . . . 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 V–f Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 V–f Boosting and TS are Synergistic . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Power Increase in VBoost and Paceline . . . . . . . . . . . . . . . . . . . 47
4.2.2 Composing the Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 A Highly-Configurable Multicore . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Dynamic Controller Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Thread Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Global Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 5 Paceline Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.2 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Power Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.4 Sensitivity to Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 6 BlueShift Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.1 Modeling Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.2 Technology Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.3 BlueShifted Module Implementation . . . . . . . . . . . . . . . . . . . . . 69
6.1.4 Module-Level PE and Power . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.5 Microarchitecture-Level PE and Power . . . . . . . . . . . . . . . . . . . 72
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.1 Error Curve Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.2 Paceline+OSB Performance and Power . . . . . . . . . . . . . . . . . . . 76
6.2.3 Razor+PCT Performance and Power . . . . . . . . . . . . . . . . . . . . . 78
6.2.4 Computational Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Chapter 7 LeadOut Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.2 Power–Performance Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
viii
7.2.4 Configurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Chapter 8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1 Timing Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1.1 TS Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1.2 TS Design Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.2 Configurable Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2.1 Leader-Checker Execution . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2.2 Voltage-Frequency Boosting . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.2.3 Other Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Chapter 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.2 Looking Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Author’s Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
ix
List of Tables
1.1 Microprocessor evolution over the past 15 years. . . . . . . . . . . . . . . . . . . 3
2.1 Types of errors considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Estimated versus measured Paceline speedups. . . . . . . . . . . . . . . . . . . . 28
3.1 Classification of existing proposals of TS microarchitectures. . . . . . . . . . . . . 33
3.2 How TS microarchitectural choices impact what TS-enhancing approaches are
most appropriate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Multicore constraint regimes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Microarchitecture parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1 BlueShift TS microarchitecture parameters. . . . . . . . . . . . . . . . . . . . . . 67
6.2 Technology model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 BlueShift parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 OpenSPARC modules used to evaluate BlueShift. . . . . . . . . . . . . . . . . . . 70
6.5 Static power consumption (Psta) and switching energy per cycle (Edyn) for each
module implementation in 130nm. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1 Variation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Power/thermal environments for the sensitivity study. . . . . . . . . . . . . . . . . 90
x
List of Figures
1.1 Top SPECint92 score by year. Dotted trend-line shows pre-2003 scaling. Solid
line estimates future scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A homogeneous multicore without reconfigurability (a) gives constant per-thread
performance regardless of system loading. A configurable homogeneous multicore
(b) provides enhancements to improve per-thread performance when few threads
are running. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Error rate (a) and performance (b) versus frequency under TS. . . . . . . . . . . . 8
2.1 Qualitative comparison of the potential of the three architectural approaches to
exploit overclocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Paceline multicore with 16 cores. Added components are highlighted in red. . . . 17
2.3 The Paceline microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Speedup traces for parser (first row) and ammp (second row) with a leader over-
clocking factor of 1.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 General approaches to enhance TS by reshaping the PE(f) curve. Each approach
shows the curve before reshaping (in dashes) and after (solid), and the operating
point of a processor before (a) and after (b). . . . . . . . . . . . . . . . . . . . . . 35
3.2 Circuit annotated with net transition times, showing two overshooting paths for
this cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 The BlueShift optimization flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 On-demand Selective Biasing (OSB): application to a chip (a) and pseudo code of
the algorithm (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Transforming a circuit to reduce the delay of A → Z at the expense of that of the
other paths. The numbers represent the gate size. . . . . . . . . . . . . . . . . . . 43
4.1 Qualitative depiction of increases in processor f when the VBoost and Paceline
techniques are applied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Thread controller examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Global controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Application speedups with overclocking factors (oc) from 1.1–1.4 without a BQ. . 59
5.2 Application speedups with overclocking factors (oc) from 1.1–1.4 with a BQ. . . . 60
xi
5.3 Dynamic power breakdown in two base cores sharing an L2 (leftmost bar U in
each group) and in Paceline mode for overclocking factors of 1.1 – 1.3 (right three
bars in each group). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Error rate and geometric mean of the Paceline speedup in SPECint applications as
the overclocking factor changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1 Whole-pipeline PE(f) curves for the four implementations. The frequencies are
given relative to the Rated Frequency (fr) of Paceline Base. . . . . . . . . . . . . . 74
6.2 Performance (a) and power consumption (b) of Paceline before (Paceline Base)
and after (Paceline+OSB) BlueShift. BS and NonBS refer to BlueShiftable and
non-BlueShiftable modules, respectively. . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Performance (a) and power consumption (b) of Razor before (Razor Base) and
after (Paceline+OSB) BlueShift. BS and NonBS refer to BlueShiftable and non-
BlueShiftable modules, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.1 Performance improvements relative to Unoptimized at various multicore loading
levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Relative incidence of limiting constraints across all applications and die samples.
The numbers above the bars show the average power consumption of the enhanced
thread on each system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Per-thread power consumption vs performance for each technique under different
load conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Sensitivity of performance improvements to the thermal environment (a), process
guardband (b), and power grid design (c). . . . . . . . . . . . . . . . . . . . . . . 89
7.5 Percentages of performance improvements experienced by an S thread when the
application demands different numbers of R and S threads. For each technique,
there is an area of infeasible configurations. Iso-performance contours are shown
in the feasible regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xii
Chapter 1
Introduction
1.1 The Per-Thread Performance Crisis
Semiconductor process technology scaling has driven microprocessor performance for the past
thirty years. Until very recently, each new technology generation brought exponentially more
and exponentially faster transistors, which microarchitects put to innovative use in building ever-
larger and faster processor cores. These designs furnished predictable, exponential performance
gains on existing single-threaded compute intensive applications. For example, Figure 1.1 shows
the scaling of SPECint performance since 1993, where each data point is the highest score re-
ported in that quarter. As the dashed trendline shows, SPECint (and essentially all compute-
intensive applications) enjoyed aggressive performance improvements until about 2003, when
technology scaling and microarchitecture innovation began to falter. Since then, per-thread per-
formance scaling (solid trendline) has fallen far — and probably permanently — behind the his-
torical rate. The grand challenge for the computing industry today is to find and implement an al-
ternative method of scaling application performance that matches the aggressive historical trend.
As of 2009, the processor industry’s near-unanimous answer is a many-core roadmap that
provides exponentially more cores at each technology generation with relatively little increase
in per-thread performance. To keep up with this hardware scaling model, software implementa-
tions must also scale to ever increasing levels of parallelism. While many important applications
(eg. enterprise workloads and visual computing) scale readily to large numbers of cores, others
do not. Even for problems that are at least in theory scalable, highly-parallel implementations
are frequently more costly (ie. effort-intensive) than less-parallel ones, meaning that program
1
 100
 1000
 10000
 100000
 1
99
3
 1
99
4
 1
99
5
 1
99
6
 1
99
7
 1
99
8
 1
99
9
 2
00
0
 2
00
1
 2
00
2
 2
00
3
 2
00
4
 2
00
5
 2
00
6
 2
00
7
 2
00
8
 2
00
9
 2
01
0
 2
01
1
 2
01
2
Year
SP
EC
int
92
Figure 1.1: Top SPECint92 score by year. Dotted trend-line shows pre-2003 scaling. Solid line
estimates future scaling.
scalability will often be limited by practical considerations such as development cost and time-
to-market. This dissertation attempts to hedge the risk inherent in the many-core roadmap by
proposing simple, low-overhead microarchitecture enhancements that improve per-thread perfor-
mance on multicores without compromising parallel throughput.
1.2 Barriers to Per-Thread Performance
To address per-thread performance, we must first understand its technology and microarchitec-
ture sources. As a starting point, consider the two microarchitectures shown in Table 1.1, re-
leased roughly fifteen years apart and bracketing a period in which SPECint performance im-
proved by 168x. The table shows a roughly 35x improvement in clock rate between the two de-
signs, half of which has come from deeper pipelining and the remainder from the technology
(faster gates). Assuming that compiler quality remained constant over the interval1, the rest of the
168x improvement (about 4.9x) has come from gains in IPC even as memory latency increased.
1A gross approximation, but for integer applications, the impact of the compiler is expected to be far less than
that of the process and microarchitecture.
2
Intel Pentium Intel Core i7
Year Introduced 1994 2009
Process 600 nm 45 nm
Die Size 147 mm2 263 mm2
# Cores 1 4
# Transistors 3.2 M 731 M
Clock Frequency 100 MHz 3.46 GHz (w/Turbo)
FO4 per stage 46 23 (est.)
SPECint92 130 21.8 K
Total Power 11 W 130 W
Table 1.1: Microprocessor evolution over the past 15 years.
Together, the 2x deeper pipelining2 and 4.9x higher IPC mean that microarchitecture innovation
has contributed a roughly 10x performance improvement over the last fifteen years. The remain-
ing 17x is due to technology advancement. The following subsections examine the technology
and microarchitecture limitations on per-thread performance and explain why, despite this im-
pressive history, further major improvements are unlikely.
1.2.1 Technology Barriers
Classical technology scaling [18] exponentially increases the number and speed of devices on a
fixed-size die while maintaining constant power density. For the seven process generations from
1994 (600nm) to 2009 (45nm), classical scaling predicts a 178x increase in transistor density and
a 13x increase in device speed. While actual scaling closely matched those projections (128x
and 17x, respectively, according to the data in Table 1.1), this was achieved at the cost of a 7x in-
crease in power density — not the constant density classical scaling predicted. The main cause
for the discrepancy is that supply voltage did not decrease as rapidly as the classical models re-
quire. Even worse, supply voltage scaling has reached an impasse below 1V and is expected to
decrease by only 1.6x over the next thirteen years [1]. This is even less than the 3x reduction that
occurred over the preceding thirteen year period (which saw an explosion in power density).
2The Core i7’s per-stage FO4 delay is estimated to be the same as that of the preceding Core 2 microarchitec-
ture [16], which is half that of the Pentium.
3
Meanwhile, the difficulty of precisely manufacturing nanoscale transistors has given rise to
worsening process variation: the deviation of transistor parameters from their nominal specifi-
cation [68, 11]. This variation affects transistor switching delay and leakage power [41, 45, 57],
with the microarchitectural consequence that some cores on a multicore are slower or leakier than
designed. In general, process variation erodes per-thread performance by forcing engineers to de-
sign defensively, sacrificing performance and efficiency. For example, it forces device engineers
to employ higher threshold voltages than would otherwise be optimal in order to control static
power consumption and forces circuit designers to include larger guardbands to accommodate
timing uncertainty. Process variation thus greatly complicates the tradeoffs between power, per-
formance, and reliability.
Because of the power density problem, process variation, and a host of other interrelated
causes, the International Technology Roadmap for Semiconductors now projects that circuit de-
lay will scale down by only 7% per year [1] instead of the historical 17% per year. This means
that processor microarchitects can expect at most a 2.6x increase in clock frequency due to tech-
nology over the next thirteen years. The actual increase may be far less if designers decide to use
the new technology nodes to recover power density that has been lost in the course of scaling in-
stead of pushing clock frequency.
1.2.2 Core Microarchitecture Barriers
Although microarchitecture innovation has historically contributed major IPC and frequency
gains (nearly 5x and 2x cumulative gain, respectively, since 1994 according to Table 1.1), tight-
ening power constraints severely limit future microarchitectures. Existing approaches for ex-
tracting higher IPC involve some combination of wider instruction issue and deeper instruction
windows, both of which inherently degrade power efficiency. Wider issue increases power con-
sumption by requiring more heavily-ported structures and forcing operands to travel larger phys-
ical distances on die for bypassing. Deeper instruction windows share similar problems at the
circuit level and also raise the spectre of aggressive speculation, which is necessary to fill the
4
window with work but inherently wastes more power on incorrect, extraneous instructions. Deep
pipelines (the traditional approach to high-frequency design) are also unattractive due to their
higher degree of speculation and the power consumption of the additional pipeline latches them-
selves.
Heterogeneous designs [40] that contain multiple core types (ie. a few large, fast cores and
many simple, efficient cores) mitigate the power concern somewhat because most of the cores
will be idle during the sequential sections in which the large cores are active, freeing more room
for microarchitecture innovation on the large cores. However, they incur extra design cost to im-
plement several core types and force a static partitioning of the silicon area between the types.
While heterogeneous designs may indeed prevail, this dissertation assumes a simpler homo-
geneous multicore environment where all of the cores are microarchitecturally identical and of
moderate size.
1.3 A Case for Configurable Multicores
The challenge taken up in this dissertation is to boost per-thread performance on future homoge-
neous multicores where clock frequency and IPC are stagnant. Moreover, this must be done with
minimal area and power cost to avoid compromising performance on the highly-parallel appli-
cations that drive the new scaling roadmap. These criteria are in apparent contradiction because
they seem to demand something (per-thread performance) for nothing (no compromise in peak
throughput), but this is not so. The key to resolving the dilemma is to note that workloads do not
simultaneously demand both per-thread performance and throughput from a given set of threads.
Instead, at any given instant, high per-thread performance is needed only for a subset of threads.
For example, parallel applications tend to alternate between highly-parallel phases that demand
throughput and less-parallel phases that demand per-thread performance from a few threads. Ad-
ditionally, multiprogrammed workloads may contain a mix of latency-sensitive (eg. interactive)
and throughput-sensitive programs.
5
What is really required then is the ability to configure the microarchitecture on-the-fly to
match the workload’s current demands (ie. high throughput, per-thread performance, or some
combination of the two). This dissertation refers to this ability to dynamically trade-off per-
thread performance and throughput (which others have called reconfigurability [35], dynamism [27],
or composability [36]) as configurability. In the most general terms, configurability allows the
amount of resources (eg. core pipelines and power budget) dedicated to each thread to vary ac-
cording to the workload demand. A configurable multicore speeds up performance-critical threads
by allocating them more than their fair share of resources, for example by executing them on
multiple cores employing Thread Level Speculation [39, 67], allowing adjacent execution pipelines
to fuse together to achieve wider issue [35], or by raising their core clock rate and supply volt-
age [34]. During parallel phases, resources are allocated uniformly to all threads and each core
executes in its most power-efficient mode.
To see the benefits of configurability more clearly, consider an application that has fewer
active threads than the chip has cores. On a traditional (non-configurable) homogeneous multi-
core, no matter how many of the on-chip cores are idle, the per-thread performance of the active
cores is unchanged as shown in Figure 1.2(a). The result is that, for applications with limited
parallelism, adding more cores has no impact on performance. A configurable system, on the
other hand, is able to reallocate resources from the idle cores and put them to work on the active
threads. Figure 1.2(b) shows performance curves for two hypothetical configurable microarchi-
tectures. Note that the configurable systems provide higher performance when few threads are
active while matching3 the throughput of a non-configurable system when the chip is running its
maximum number of threads.
An important observation from Chapter 4 is that microarchitectural enhancements for con-
figurability can be composable. In other words, two different configurable enhancements such as
those shown in Figure 1.2(b) can combine to give greater performance than either alone. Since
the system still retains the ability to use only a subset (or none) of the per-thread enhancements at
3Assuming the power and area costs of the configurability enhancements are negligible.
6
# Active Threads
Pe
r-T
hr
ea
d 
Pe
rf
# Active Threads
Pe
r-T
hr
ea
d 
Pe
rf
(a) Traditional multicore (b) Reconfigurable multicore
Figure 1.2: A homogeneous multicore without reconfigurability (a) gives constant per-thread
performance regardless of system loading. A configurable homogeneous multicore (b) provides
enhancements to improve per-thread performance when few threads are running.
any give time, performance is never worse than a non-configurable system or a configurable sys-
tem with only a subset of the enhancements. Configurability provides a “knob” that the runtime
environment or operating system can turn to trade off performance and power among the threads
in the system. The more configurability the system provides, the more useful the knob becomes.
1.4 Timing Speculation
This dissertation proposes novel configurable enhancements that recover some of the clock fre-
quency lost to process variation and guardbands in recent technology generations by employing
Timing Speculation (TS). TS safely operates a core above its nominal, rated frequency by detect-
ing and correcting any timing faults. The main contribution of this dissertation is to demonstrate
a configurable, low-overhead TS microarchitecture that is composable with other configurable
enhancements for near-future multicores.
To understand how TS can increase clock frequency, note that the processor’s rated frequency
fr is conservative because it includes guardbands that guarantee safe operation under worst-case
conditions of aging, temperature, supply voltage, and process variation even though the pro-
cessor rarely if ever operates under such extreme conditions. Under nominal conditions, clock
frequency can increase beyond fr without generating timing faults simply by consuming the
guardband. As frequency increases further, it eventually reaches a Limit Frequency f0 at which
7
the guardband has been fully consumed and beyond which faults begin to occur. As frequency
continues to increase, the number of timing faults per cycle PE grows exponentially [57]. Fig-
ure 1.3(a) shows how the rate of timing faults varies with frequency. Without TS, a processor can
typically only operate at point (a) or at best, point (b). TS enables operation at point (c).
f0 f0fr fr
1 2 3 1 2 3
P E
(f)
pe
rf(
f)
(a) (b)
a b
c
a
b
c
f f
Figure 1.3: Error rate (a) and performance (b) versus frequency under TS.
TS results in an overall performance improvement when the gains from increased clock fre-
quency outweigh the execution time overhead of fault recovery. To see how, consider the perfor-
mance perf(f) of a processor clocked at frequency f, in instructions per second:
perf(f) =
f
CPInorc(f) + CPIrc(f)
=
=
f
CPInorc(f)× (1 + PE(f)× rp) =
=
f × IPCnorc(f)
1 + PE(f)× rp (1.1)
where, for the average instruction, CPInorc(f) are the cycles taken without considering any time
lost to fault recovery, and CPIrc(f) are the number of cycles per instruction lost to recovery
from timing errors. In addition, PE is the probability of error (or error rate), measured in errors
per non-recovery cycle. Finally, rp is the microarchitecture’s recovery penalty per error, mea-
sured in cycles.
Figure 1.3(b) illustrates the performance tradeoff. The plots show three regions. In Region 1,
f < f0, so PE is zero and perf increases consistently, impeded only by the application’s increas-
ing memory CPI. In Region 2, errors begin to manifest, but perf continues to increase because the
8
recovery penalty is small enough compared to the frequency gains. Finally, in Region 3, recovery
overhead becomes the limiting factor, and perf falls off abruptly as f increases. The goal of TS is
to operate as close as possible to point (c).
1.5 Contributions and Overview
This dissertation makes three interlocking contributions to the design of configurable multicores
at the microarchitecture and circuit levels. First, Chapter 2 introduces Paceline, the first TS mi-
croarchitecture designed specifically for multicores. Paceline enables two cores to work together
to execute a single thread at high speed under TS or independently to execute two threads at
the rated frequency. In single-thread mode, one core in the pair — the “Leader” — executes at
higher-than-normal frequency, while a “Checker” runs at the rates, safe frequency. The Leader
runs the program faster but may experience timing errors. To detect and correct these errors,
the Checker periodically compares a hash of its architectural state with that of the Leader. The
Leader helps the Checker keep up by passing it branch results and prefetches.
Next, Chapter 3 proposes BlueShift, a circuit design method for TS architectures that im-
proves a circuit’s common-case delay rather than focusing on worst-case delay like traditional
design flows. BlueShift profiles a gate-level design as it runs real benchmark applications to iden-
tify the frequently-exercised circuit paths and then applies speed optimizations to those paths
only. These optimizations can be implemented in a way that can be enabled and disabled at run-
time so that they do not exact a power cost when they are not needed (ie. when the processor is
executing a throughput workload).
Chapter 4 introduces LeadOut, a multicore design that combines Paceline with an additional
per-thread performance enhancement: the ability to increase core supply voltage above nominal.
LeadOut evaluates the performance gains that are possible with Paceline alone, voltage boosting
alone, and both together. It shows major gains from applying the two techniques together when
feasible and also shows that, in many cases, future multicores have power and temperature head-
9
room to exploit still more per-thread enhancements as long as they can be enabled and disabled
dynamically according to application demand.
Chapters 5 through 7 evaluate the three contributions in turn. Finally, Chapter 8 compares
Paceline, BlueShift, and LeadOut with a the large body of related work, and Chapter 9 concludes.
10
Chapter 2
Paceline: Timing Speculation for
Multicores
2.1 Introduction
This chapter presents Paceline, a configurable TS microarchitecture for multicores. Paceline im-
proves per-thread performance by running a thread redundantly on two cores of a multicore chip,
called Leader and Checker. The leader is “overclocked” to a higher-than-rated frequency while
still running at the rated, nominal supply voltage. It thus exploits the safety margin for frequency
and even experiences occasional timing errors. Meanwhile, the checker core runs at the rated,
safe frequency. The leader prefetches data into the L2 cache that it shares with the checker and
also passes branch outcomes to the checker. This improves the checker’s IPC and allows it to
keep up with the accelerated leader. The result is that the thread executes faster than on a single
baseline core.
We envision a Paceline multicore containing multiple of these leader–checker core pairs.
Each core pair shares an L2 and includes simple hardware to periodically compare state and to
pass branch outcomes. This hardware requires only small design modifications with respect to
a contemporary multicore — mostly confined to the interface between the L1 and L2 caches —
and can be easily disabled, returning the core pair to the standard multicore operating mode. Our
simulation results show that Paceline substantially improves the performance of a thread without
significantly increasing the power density or the hardware design complexity of the chip. If the
leader can overclock by 30%, Paceline improves SPECint and SPECfp performance by a geomet-
ric mean of 21% and 8%, respectively.
11
2.2 Characterizing Overclockability
Like all TS schemes, Paceline exploits three sources of “overclockability” in the leader core that
allow it to operate above the rated frequency fr: grading artifacts arising from the way proces-
sors are binned and marked for sale; process and environmental safety margins; and error toler-
ance at frequencies beyond the safe one. This section characterizes these sources.
2.2.1 Grading Artifacts
After fabrication, each die is tested for functionality and speed. The latter process is called speed
binning, where the objective is to assign one of several pre-determined speed grades to each part.
For example, a manufacturer might offer 4, 4.5, and 5 GHz speed grades, and will label each part
with the highest speed grade at which it can safely and reliably operate under worst case condi-
tions.
The binning process introduces overclockability in two ways. The first arises from the fact
that bin frequencies are discrete. For example, under the scheme described above, a part that
passes all tests at 4.8 GHz will be placed in the 4.5 GHz bin. As of early 2007, the Intel E6000
series and the AMD Athlon 64 X2 series space their bin frequencies by 7-14% [6, 32]. If we as-
sume a spacing of 10% as an example, a processor can safely run on average 5% faster than its
binned frequency specification.
Secondly, binning contributes to overclockability because within-die process variation causes
some cores on a given multicore to be faster than others. In current practice, the slowest core dic-
tates the bin for the entire die. While it is possible to bin each core on the die individually, this
may not be cost-effective. If multicores continue to be binned according to the slowest core, each
die will contain many underrated cores.
To get a feel for this effect, using the process variation model of Teodorescu et al. [57] with
σ/µ = 9% for the threshold voltage, we find that the σ/µ of on-die core frequency is 4%. Monte
Carlo simulations show that this corresponds to an average 16% difference in frequency between
12
the fastest and the slowest cores on a 16-core die. As another example, Humenay et al. [29] esti-
mate a 17% difference in frequency between the fastest and slowest core on a 9-core 45nm die.
2.2.2 Safety Margins
Process and environmental margins for device aging, operating temperature, and supply voltage
can also be exploited for overclocking. For example, device aging due to Negative Bias Temper-
ature Instability (NBTI) [51] and Hot Carrier Injection (HCI) [71] causes critical path delays to
increase over the lifetime of the processor — and this is especially severe in near-future technolo-
gies. A typical high-performance processor’s operational lifetime is seven to ten years [4]. Ac-
cording to [51], the delay increase due to NBTI alone over that period is 8% in 70nm technology,
and HCI adds additional slowdown on top of that. Since processors are guaranteed to operate at
the rated frequency for the entire design lifetime, significant overclockability exists in fresh pro-
cessors where aging has not yet run full course.
Processors are typically rated for maximum device junction temperatures in the 85–100◦C
range (e.g., [31]) even though operating temperatures are often lower. At lower temperatures,
transistors are faster, so overclocking is possible. For example, consider a chip where the hottest
unit is 20◦C below the maximum temperature. In this case, the analysis in [28] and data from [38]
show that the safe frequency increases by approximately 5% for 180nm technology.
Finally, while a processor’s off-chip voltage regulator typically has a tight tolerance, on-chip
logic is subject to sizable local voltage drops. These drops are due to rapid current changes that
cause L dI/dt supply noise and sustained periods of high activity that cause IR drops. Since these
drops are difficult to predict at design time, designers assume conservative margins to guarantee
safe operation. For example, the IBM POWER4 processor is designed to provide voltages within
±10% of nominal at all points in the worst case [78]. The logic must therefore be designed to
operate correctly under a 10% supply voltage droop even though this is not the common case.
This fact can offer some limited room to overclock the processor.
13
2.2.3 Error Tolerance
Timing Speculation can push the core’s clock frequency past the Limit Frequency f0 into a regime
where it will experience occasional timing errors. The paths within a processor have a variety of
delays, and not all paths are exercised with a given input. Consequently, as the frequency climbs
past f0, error rate increases continuously as more and more paths fail. As long as only a few
paths fail, the error rate will be tolerable, and TS can extract performance gains. The slope of
the PE vs f curve in the vicinity of f0 varies widely from design to design. Some designs exhibit
a “critical path wall” where many paths have long delay, and these are antagonistic to TS. Others
exhibit a gradual error onset that results in a tolerable error rate even at frequencies far past f0.
For example, the Razor project fabricated 180nm in-order Alpha microprocessors and mea-
sured the error rates under different frequencies and voltages [17]. It showed the error rate versus
voltage for two specimens at 120 MHz and 140 MHz. If the supply voltage is reduced so that the
chip begins to experience timing errors at 120 MHz (i.e., all safety margins have been removed),
increasing the frequency to 140 MHz yields less than one error per ten thousand cycles for one
specimen and one error per million cycles for the other. This corresponds to a 17% frequency
improvement in exchange for an error rate of roughly 10−6 – 10−4 per instruction. On the other
hand, experiments with the OpenSPARC [69] pipeline in Chapter 6 of this dissertation show a
much more rapid error onset, allowing less overclocking.
2.2.4 Exploiting Overclockability
Exploiting the above factors for overclocking requires various levels of microarchitecture sup-
port. Removing the Grading Artifacts requires the fewest microarchitecture changes: The manu-
facturer must speed test each core individually and populate a one-time-programmable, architecturally-
visible table with the core frequencies. The OS can then use this table to set operating conditions.
We call this solution Fine Grain Binning (FGB).
At the next level of microarchitecture complexity, Timing Error Avoidance (TEA) schemes
14
(e.g., [4, 12, 38, 76]) can estimate the maximum safe frequency dynamically as the processor
runs. These techniques either embed additional “canary” critical paths [76], which have delays
slightly longer than the actual critical paths, or they directly monitor the delay of the existing crit-
ical paths [4, 12, 38]. Either way, they are able to determine when the clock period is too close
to the actual critical path delay. Although in theory TEA can push the clock frequency arbitrar-
ily close to f0, practice requires that some safety margins be maintained to avoid errors. Con-
sequently, TEA schemes can remove the Grading Artifacts and some but not all of the Safety
Margins.
With an additional increase in microarchitecture complexity, we can exploit all three fac-
tors for overclocking, including TS, where the processor experiences occasional errors. The re-
quired support is an error detection and correction mechanism. Some examples are Razor [20],
DIVA [7], TIMERRTOL [75], and X-Pipe [77]. In practice, all of these schemes require fairly
invasive microarchitectural changes, either by modifying all processor pipeline latches and the
corresponding control logic or by adding a specialized checker backend for the processor core. In
the next section, we describe the proposed Paceline microarchitecture, which exploits the three
factors for overclocking while requiring minimal modifications to the processor cores and caches.
Figure 2.1 compares the factors for overclocking exploited by each microarchitecture ap-
proach. The figure shows only qualitative data because the factors described above are hard to
quantify and not fully orthogonal.
FGB
TS
TEA
Grading Artifacts Safety Margins ErrorTolerance
Figure 2.1: Qualitative comparison of the potential of the three architectural approaches to ex-
ploit overclocking.
15
2.3 Paceline Architecture
Paceline is a leader–checker architecture that improves the performance of a thread by running
it redundantly on two cores of a multicore. The leader core is clocked at a frequency higher than
nominal, exploiting the three factors for overclocking described in Section 2.2. Meanwhile, the
checker core runs at the rated, safe frequency. The leader prefetches data into the L2 cache that
it shares with the checker, and also passes branch outcomes to the checker. This improves the
checker’s IPC and allows it to keep up with the accelerated leader. The hardware periodically
compares the architectural state of the leader and the checker, and is able to detect and recover
from errors due to overclocking or other effects. The result is that the thread executes faster than
on a single baseline core.
As it executes, the overclocked leader dissipates higher power than a baseline core. Mean-
while, the checker, by leveraging the prefetched data and branch outcomes from the leader, spends
less power than a baseline core. To be able to sustain higher than baseline speed without over-
heating, the two cores periodically alternate the leader position. The result is that the chip’s power
density and maximum temperature are not expected to increase substantially over a baseline sys-
tem. Intuitively, the operation is analogous to a paceline of two bicycle riders where riders take
turns to lead. The leader expends more effort while sheltering the other rider.
A Paceline multicore contains multiple of these leader–checker cores as shown in Figure 2.2.
Each core pair shares an L2 cache and includes simple hardware to periodically compare state
and to pass branch outcomes. This hardware requires only very small core modifications and can
be easily disabled, returning the core pair to the standard multicore operating mode.
Overall, a Paceline pair speeds up a thread (of a serial or a parallel program) without sig-
nificantly increasing multicore power density or hardware design complexity. In the following,
we first give an overview of the microarchitecture and characterize the types of errors it can en-
counter. Then, we present two different Paceline variations, each specialized for handling differ-
ent types of errors.
16
Shared L2
Core 1Core 0
$
10
$
54
$
98
$
1312
$
32
$
76
$
1110
$
1514
I
n
t
e
r
c
o
n
n
e
c
t
L1 I$L1 I$ L1 D$L1 D$
21
m
m
14mm6mm
5.
2m
m
VQ
BQ
Figure 2.2: Paceline multicore with 16 cores. Added components are highlighted in red.
2.3.1 Overview of the Microarchitecture
In Paceline, the leader and the checker cores operate at different frequencies, with the checker
lagging behind and receiving branch outcomes and memory prefetches into the shared L2 from
the leader. Figure 2.3 shows the microarchitecture. The region in the dashed boundary is over-
clocked, while everything else runs at the rated, safe frequency.
The shaded components are the new hardware modules added in Paceline. Specifically, the
outcomes of the leader’s branches are passed to the checker through the Branch Queue (BQ).
Moreover, the hardware in both leader and checker takes register checkpoints every n instruc-
tions and saves them locally in ECC-protected safe storage. In addition, the hardware hashes the
checkpoints into signatures and sends them to the ECC-protected Validation Queue (VQ). As ex-
ecution continues, the VQ checks for agreement between the hashed register checkpoints of the
leader and the checker. The VQ sits in the cache hierarchy between the L1 and L2 caches. The
L1 caches operate in write-through mode as in the Pentium 4 [62], so the VQ sees all the memory
writes in order. Such capability allows it to provide extra functionality that depends on the types
of errors handled. We will see the details in Section 2.3.3.
Although the leader and checker cores redundantly execute the same thread, they do not ex-
ecute in lock-step. Since they are out-of-order processors, they execute instructions in different
orders and even execute different instructions — due to branch misprediction. However, in the
absence of errors, their retirement streams are identical.
17
Co
he
re
nt
 L
2
int
er
co
nn
ec
t
VQ
Leader
P0
L1Reg
ckpt
Checker
P1
L1
Reg
ckpt
BQ
Hash
Hash
Figure 2.3: The Paceline microarchitecture.
Given a dynamic write instruction in a program, the leader and the checker will issue the store
to the L1 at different times, when each retires the write instruction. The VQ will only allow one
of the two stores to propagate to the L2, possibly after performing some validation. For reads,
however, there is no such filtering. A load issued by the leader that misses in the L1 is immedi-
ately sent to the L2. If and when the checker issues the corresponding load, it also sends a read
request to L2. The advantage of this approach is that it does not require any read buffering at all
and, therefore, it is easy to support in hardware. However, it may result in the two loads returning
different values — if the location being read is modified in between the reads by, for example,
a write from another thread, a DMA action, or a fault. Smolens et al. [64] call this problem the
Input Incoherence Problem. This approach is also used in Slipstream [53] and Reunion [64].
2.3.2 Types of Errors
To design Paceline, we consider the three potential sources of error shown in Table 2.1: timing
errors due to overclocking, errors due to the input incoherence problem, and soft errors. Ta-
ble 2.1 characterizes them based on (1) whether they re-appear as the code section with the error
18
is rolled back and re-executed, and (2) what part of the Paceline architecture they can affect.
Type of Repeats in Can It Affect...
Error Re-Execution? Leader? Checker?
Timing Yes:
Error Register state
Due to Likely Data read into L1 No
Overclock Data written to L1
Yes:
Input Register state
Incoherence Possibly Data read into L1 No
Data written to L1
Yes: Yes:
Soft Error No Register state Same as
Data written to L1 Leader
Table 2.1: Types of errors considered.
These errors behave differently. Timing errors are likely to repeat after rollback because the
same critical paths are likely to be exercised during re-execution. Input incoherence errors may
re-occur when other threads repeatedly update the location that the leader and checker are trying
to read [64]. Indeed, one can construct a pathological scenario where, in each re-execution, a
third processor updates the location between the leader and checker reads. On the other hand,
soft errors do not re-occur.
From the last two columns, we see that both timing and incoherence errors can affect the
same parts of Paceline. Specifically, they can affect the state in the leader’s registers, and the state
in the leader’s L1 that has been read from memory or been written by the processor. However,
they cannot affect the checker. On the other hand, soft errors can affect both the leader and the
checker — both the register state and the data written by the processor into the L1.
A final key difference not shown in the table is that soft errors are much less frequent that
the other two types. For example, based on data in [57], this dissertation will estimate in Chap-
ter 5 that the expected timing error rate at optimal performance is approximately one per 105
instructions. A similar incidence of incoherence errors is shown in the environment considered
by Reunion [64]. On the other hand, the soft error rate may be in the range of one per 1015–1020
instructions.
19
2.3.3 Detailed Microarchitecture
Based on the previous discussion, we propose two different levels of microarchitecture support
for Paceline. The first one is a simple design that targets only the frequent types of errors —
namely the timing and input incoherence errors. The second level is a high-reliability design that
targets all three types of errors. We call these designs Simple and High-Reliability, respectively.
They can both be supported by a single microarchitecture, with some of the features being dis-
abled for the Simple design.
There is a key difference between the two designs. Specifically, since timing and incoherence
errors can only affect the leader, Simple can recover without rollback — it recovers simply by
copying the checker’s state to the leader’s. On the other hand, since soft errors can affect both
cores, High-Reliability must recover by rolling back. Unfortunately, the timing and incoherence
errors will likely or possibly repeat during re-execution. To avoid this, the re-execution has to
be performed under different conditions. As we will see, this complicates the High-Reliability
design over the Simple one. Overall, these two designs correspond to different cost-functionality
tradeoffs.
In this section, we describe the two designs. In both cases, the Paceline microarchitecture
must provide two mechanisms: one to periodically compare the state of the leader and the checker,
and another to repair the state of the leader-checker pair when the comparison mismatches.
The Simple Design
In this design, the checker is assumed to be always correct. Ensuring that the leader is also cor-
rect is a performance issue, not a correctness one. If the leader diverges from correct execution,
it will not help speed up the checker; the leader will prefetch useless data into the L2 and pass
useless branch outcomes to the checker, whose branch outcomes will not match the leader’s pre-
dictions.
Since correctness is not at stake, the checkpoint signature can be short — e.g., four bits. The
20
VQ has a Signature Queue that contains signatures that cores deposit when they take checkpoints
and hash them. The VQ hardware compares signatures from the two cores corresponding to the
same point in the program. Moreover, writes from the checker’s L1 are immediately routed to the
L2, while those coming from the leader’s L1 are discarded. If the leader and checker checkpoints
are different but hash into the same signature due to aliasing, correctness does not suffer. More-
over, since the divergence is likely to be detected in subsequent signature comparisons, the leader
will not continue on a wrong path for long.
When a mismatch is detected, we know that a timing or incoherence error has happened and
that its effect is confined to the leader’s processor and L1 cache. At this point, the VQ state is in-
validated and we need to repair the register and L1 cache states of the leader. To repair the regis-
ter state, the checker’s checkpoint is copied to the leader’s registers, effectively rolling the leader
forward past the error. To repair the L1 state, the simplest option is to invalidate the leader’s L1
contents. As execution resumes, the leader will re-fetch the correct data from the L2 on demand.
This approach does not involve a large performance overhead because all the lines invalidated
from the L1 are in the L2.
A more sophisticated option, which we do not adopt, is to have the VQ record the read miss
and write addresses emanating from the leader L1 during each checkpoint interval and to selec-
tively invalidate only those lines when a signature mismatches. This approach is not guaranteed
to repair the L1 state, so an incoherence may resurface in future signature comparisons. If a cer-
tain number of signature comparisons in a row are found to mismatch, the recovery procedure
can revert to flushing the leader’s L1, which eliminates any lingering incoherence.
As an optimization, the VQ may also include aWrite Queue that buffers all leader writes until
they are performed by the checker and sent to the L2. This prevents the leader from suffering in-
coherence when a recently-written line is evicted from its L1 and read again before the checker
commits the corresponding write; instead, the correct value for the read is found in the Write
Queue. As an alternative, a standard victim cache could serve the same purpose.
21
The High-Reliability Design
In this design, the leader or the checker may suffer an error. We need to make sure that any di-
vergence between the leader and the checker states is detected. Missing a divergence leads to a
correctness problem, since erroneous data can then propagate to the L2.
Paceline compares the register and L1 state of leader and checker. The register state is checked
by comparing the checkpoint signatures of the cores, which are stored in the VQ’s Signature
Queue. In the High-Reliability design, the signature has more bits (e.g., sixteen) than in the Sim-
ple design in order to minimize aliasing. In addition, the L1 state is checked by buffering the
leader writes in an in-order circularWrite Queue in the VQ. Each Write Queue entry contains the
address and data of a leader write, together with a Validated bit. When a checker write arrives,
the hardware compares its address and data to the next non-validated entry in the Write Queue
and, if they match, the Validated bit for the entry is set. Note that writes issued by each core ar-
rive at the VQ in program order. For a checkpoint interval to check successfully, the leader and
checker register signatures at the end of the interval must match, and the VQ must have success-
fully validated all writes in the interval. When this is the case, the Validated writes are removed
from the VQ and released to the L2 cache.
If, instead, the signatures mismatch or the writes are not all validated, Paceline rolls the check-
point interval back and re-executes it. Rolling back the interval involves invalidating the VQ’s
state, restoring the register state of the last successfully-compared checkpoint in both cores, and
repairing the L1 state in both cores. To repair the L1 state, there are two alternatives. The sim-
plest is to invalidate both L1 caches as in Simple. In this case, as the leader and checker re-execute,
they will naturally re-populate their L1 caches with lines from the L2. A more advanced alterna-
tive, which we do not adopt, is to re-fetch only those lines accessed during the failing checkpoint
interval (i.e., those which could have contributed to the failure). This can be done by operating
the two L1 caches in a special mode during the re-execution that explicitly forces all accesses
to miss and re-fetch from the L2. At the next successful checkpoint comparison, re-execution is
22
complete, and the L1s may return to normal mode.
Regardless of which L1 cleanup method is chosen, the VQ operation during re-execution is
no different than its normal-mode operation; it buffers leader writes and sends load requests to
the L2 as usual. However, some additional precautions are needed to handle timing errors and
persistent input incoherence errors. Specifically, recall from Table 2.1 that a timing error will
reappear in a normal re-execution. To avoid this, in a re-execution, the leader is clocked at a
lower, safe frequency until the next checkpoint. We envision the high-frequency clock and the
safe-frequency clock to be always available to the leader. When the leader enters re-execution
mode, the high-frequency clock is disconnected from the leader and the safe-frequency one is fed
to it.
A persistent input incoherence error is one that repeatedly occurs every time that the inter-
val is re-executed [64]. Since this event is expected to be rare, the High-Reliability design uses a
simple approach to handle it. Specifically, when the hardware detects that an interval has been re-
executed more than a few times in a row, an interrupt is sent to the other cores to stall them until
the local core pair successfully proceeds past the next checkpoint.
2.3.4 Additional Issues
Special Accesses
There are a few types of memory accesses, including atomic Read-Modify-Write (RMW) opera-
tions such as atomic increment, and non-idempotent reads such as I/O reads, that require special
support in a paired architecture such as Paceline. These accesses are called serializing operations
in [64]. In this section, we briefly describe the basic support required.
At the simplest level, serializing operations can be implemented by checkpointing the leader
before issuing the operation and then stalling the leader until the checker reaches the same point.
Once this happens, the checker also checkpoints. Then, the checkpoint interval is validated by
comparing the checkpoint signatures and, in the High Reliability design, also comparing all the
23
writes in the interval. After that, both cores issue the serializing operation, which is merged in
the VQ and issued as a single access to memory. Let us call r the data that this operation returns,
such as the data read in the RMW operation or in the non-idempotent read. The value r is pro-
vided to both cores and also temporarily buffered in the VQ. After that, both cores take a new
checkpoint and the checkpoint interval is validated. In case of a match, the cores continue.
If, instead, a mismatch occurs, an error has happened since the previous checkpoint — for
example a soft error has affected one of the register files. In this case, the hardware uses the same
recovery procedures as before except that the serializing operation cannot be redone. Instead, the
buffered r value is provided to the processors.
As a simple performance optimization in the case of atomic RMW operations, the leader
can issue an exclusive prefetch for the RMW variable into the L2 before stalling to wait for the
checker. With this support, by the time the leader and checker together issue the RMW operation,
the data is likely to be in the desired state in the L2. Note that this optimization cannot be applied
for non-idempotent reads.
It is possible to design higher performance implementations where leader and checker do not
need to wait for one another. For example, in an RMW operation, the leader could issue the read
and buffer the write in the VQ, rather than waiting for the checker. The checker will later perform
its RMW operation and, if both operations agree, a single operation is made visible to the mem-
ory. Otherwise, the recovery procedure is followed. Since this dissertation focuses on compute-
intensive sequential sections, we do not evaluate these issues.
Interrupts and Exceptions
Delaying interrupt delivery is correct and allowable as long as the delay is bounded. When an
interrupt arrives, Paceline must find some point in the future when the leader and checker have
identical architectural state and then deliver the interrupt simultaneously to both. Checkpoint
boundaries provide ideal points at which to deliver interrupts. Consequently, when an interrupt
signal arrives at the processor pair, it goes first to the VQ. The VQ marks a pending interrupt
24
bit and waits until the next checkpoint is released. It then initiates a rollback to the just-released
checkpoint and asserts the interrupt line on both cores before restarting them. Both cores then
wake up in the same architectural state and see the interrupt simultaneously. The VQ remembers
the interrupt that was just delivered so that it can re-deliver the same interrupt in the event of a
future rollback recovery.
We assume that synchronous, program-initiated exceptions have precise state semantics in
the processor, as is the case in most current designs. Intuitively, an instruction with an exception
of this type behaves like a conditional jump with the additional effect of setting one or more sta-
tus registers. This does not demand any special handling from Paceline; exceptions are handled
just as any other instruction. Barring an error, the exception code will execute identically in both
cores just as all other instructions do. In the case of an error causing the execution of the excep-
tion code in one core only, the next checkpoint comparison will fail and initiate recovery.
2.3.5 Implementation Feasibility
Minimizing changes to the processor cores and the cache coherence system has been a key goal
for Paceline, and we believe that the resulting microarchitecture can integrate easily in a com-
mercial CMP. Note that the VQ, the principal Paceline structure, falls outside the core boundary
and occupies the relatively non-critical path between the L1 and L2 caches. Moreover, cache co-
herence, which happens at the L2 level, is completely oblivious to Paceline. Although the core
itself does require some modifications (shown shaded in Figure 2.3), they affect only the fetch
and retirement stages. The renaming, scheduling, and execution logic of the core is completely
unaffected. The checkpointing hardware required for Paceline will likely be present in a future
baseline system by default because several proposals for future core features (e.g., thread-level
speculation and transactional memory) rely on the ability to perform frequent register check-
points.
Perhaps the most demanding feature of Paceline in terms of implementation and verification
effort is the requirement that, at a given time, different cores be able to run at different frequen-
25
cies. Adding clock domains to a design complicates verification, so until recently, multicore
designs ran all cores at the same frequency. However, pressure to improve power efficiency has
forced chip makers to implement per-core frequency scaling, with the Barcelona microarchitec-
ture from AMD being the first to support it [19]. In Paceline, clock domain crossings occur in the
BQ, VQ, and (for loads) between the leader L1 and the L2.
2.4 Analytical Performance Model
In order to enjoy speedup under Paceline, an application must have two properties: (i) overclock-
ing the leader must increase its performance; and (ii) the improved behavior of the branch predic-
tor and cache subsystems in the checker must increase its IPC. A program that is totally memory-
bound or limited by the L2 cache latency will not satisfy (i), while a program that fits completely
within the L2 cache and already achieves high branch prediction accuracy without Paceline will
not meet requirement (ii).
Here, we present an intuitive method of estimating how much speedup Paceline can gener-
ate for a given overclocking factor and application. The method uses a standard (not Paceline -
enabled) microarchitecture, so it can easily be added to early architectural design-space explo-
rations. To get the required data, we instrument the simulator to output an IPC trace that records
the average IPC for each chunk of ten thousand dynamic instructions. We then run the applica-
tion on the simulator under two configurations. The first represents an overclocked leader with a
perfect (infinite performance) checker, while the second represents a checker in the presence of a
perfect (infinite performance) leader.
To obtain these leader and checker configurations, we start with the baseline core configu-
ration. The leader configuration differs from the baseline only in being overclocked, so that the
latencies of all memory components beyond the L1 cache increase by the overclocking factor.
The checker configuration differs from the baseline only in having a perfect branch predictor and
a perfect L2 cache (thanks to the BQ and prefetching effects). Running the simulator with each
26
configuration produces IPC traces. From these, we obtain performance traces by multiplying
the entries in the leader IPC trace by the leader frequency and those of the checker trace by the
checker frequency. Finally, we normalize these numbers to the performance of a baseline core.
The two resulting speedup traces give the leader’s speedup (Li) and the checker’s speedup (Ci)
relative to a baseline core during the ith dynamic instruction chunk. For example, the leftmost
two columns of Figure 2.4(a) show the Li and Ci traces for the SPECint application parser.
The overclocking factor used for the leader is 1.3.
Li Ci min(Li, Ci) Sj
sp
ee
du
p
1
1.125
1.25
1.5
1.625
1.75
1.375
1.875
2
1
1.125
1.25
1.5
1.625
1.75
1.375
1.875
2
{ammp
sp
ee
du
p{parser
Dynamic Instruction # Dynamic Instruction # Dynamic Instruction # Dynamic Instruction #
(a)
(b)
Figure 2.4: Speedup traces for parser (first row) and ammp (second row) with a leader over-
clocking factor of 1.3.
We can use Li and Ci to estimate the speedup of Paceline by observing that, at any given
time, the slowest of the leader and checker cores determines the overall speed. Since the chunk
size that we have chosen (10K instructions) is significantly larger than the typical lag between
leader and checker (< 1000 instructions), both leader and checker will typically be executing the
same chunk. Consequently, Pi = min(Li, Ci), shown in the next column of Figure 2.4, is a good
estimate of the instantaneous speedup under Paceline. The speedup Sj over the interval from pro-
gram start to chunk j is then given by the harmonic mean of all Pi in the interval as shown in
Equation 2.1. Referring to the last column of Figure 2.4, we see how Sj evolves as the applica-
tion executes.
Sj =
j∑j
i=1
1
Pi
(2.1)
27
For an example of an application on which Paceline performs well, see parser in Figure 2.4(a).
Here, the Li speedup trace is consistently near the maximum possible value of 1.3 because the
leader experiences few L1 cache misses, so almost all of the execution takes place inside of the
overclocked domain. The Ci speedup in the checker from perfect branch and cache subsystems is
high most of the time, allowing the checker to keep up with the leader most of the time. Conse-
quently, the speedup converges to 1.25 with an overclocking factor of 1.3.
In contrast, ammp in Figure 2.4(b) does not show much potential for speedup. Early in the
execution, the application has good cache and branch behavior in a baseline processor and, there-
fore, the Ci speedups are often modest. The result is that the checker limits the Paceline speedup.
Later, the application enters a memory-bound phase, in which Li speedups are low, and the leader
limits the Paceline speedup.
We have predicted the Paceline speedups for our applications using this model. Table 2.2
compares the predicted speedups for the SPECint applications to speedups measured on the cycle-
accurate Paceline-enabled simulator of Chapter 5. Given the strong agreement between the two
sets of numbers, we believe that our trace-based model provides an attractive alternative to cus-
tom simulator implementation when evaluating Paceline.
Appl. Estimated Actual Appl. Estimated Actual
bzip2 22% 21% mcf 6% 6%
crafty 26% 25% parser 25% 24%
gap 22% 20% twolf 29% 29%
gcc 28% 28% vortex 16% 13%
gzip 15% 18% vpr 29% 29%
Table 2.2: Estimated versus measured Paceline speedups.
2.5 Summary
This chapter demonstrated a substantial potential for overclocking through TS in near-future mul-
ticores. It introduced the Paceline Simple microarchitecture, which provides a safe means of ex-
ploiting this overclockability to improve the performance of a thread (of a serial or a parallel ap-
28
plication). It also presented a High-Reliability variant that additionally provides tolerance to tran-
sient faults such as soft errors. In either variant, Paceline has minimal impact on multicore power
density and hardware design complexity. Moreover, it is a configurable scheme that can be en-
abled or disabled on-demand at runtime. When disabled, it has no impact on power consumption
and a very small die area footprint. With an overclocking factor of 1.3, Chapter 5 shows SPECint
and SPECfp speedups of 21% and 8%, respectively, from outfitting a typical near-future core de-
sign with Paceline.
29
Chapter 3
BlueShift: Designing Pipelines for Timing
Speculation
3.1 Introduction
A key limitation of current TS proposals (including Paceline) is that they assume traditional de-
sign methodologies, which are tuned for worst-case conditions and deliver suboptimal perfor-
mance under TS. Specifically, existing methodologies strive to eliminate slack from all timing
paths in order to minimize power consumption at the target frequency. Unfortunately, this can
create a critical path wall that impedes overclocking. If the clock frequency increases slightly
beyond the target frequency, the many paths that make up the wall fail all at once. The error re-
covery penalty then quickly overwhelms any performance gains from higher frequency.
In this chapter, we present a novel approach where the processor itself is designed from the
ground up for TS. The idea is to identify the most frequently-exercised critical paths in the design
and speed them up enough so that the error rate grows much more slowly as frequency increases.
The majority of the static critical paths, which are rarely exercised, are left unoptimized or even
deoptimized — relying on the TS microarchitecture to detect and correct the infrequent errors in
them. In other words, we optimize the design for the common case, possibly at the expense of the
uncommon ones. We call our approach and design optimization algorithm BlueShift.
This chapter also introduces two techniques that, when applied under BlueShift, improve pro-
cessor performance. These techniques, called On-demand Selective Biasing (OSB) and Path Con-
straint Tuning (PCT), utilize BlueShift’s approach and design optimization algorithm. Both tech-
niques target the paths that would cause the most frequent timing violations under TS and add
slack by either forward body biasing some of their gates (in OSB) or by applying strong timing
30
constraints on them (in PCT).
Later, Chapter 6 evaluates BlueShift by applying it with OSB and PCT on modules of the
OpenSPARC T1 processor. It finds that applying BlueShift with OSB to a Paceline core con-
tributes an 8% speedup over a traditionally-designed core with little inherent overclockability.
PCT achieves similar speedups, but targets a different microarchitecture. Although both tech-
niques come with a significant power overhead, they are remarkable for providing a means of
scaling logic delay that is orthogonal to voltage scaling. Moreover, OSB is a fully configurable
technique that can be disabled at runtime. When disabled, it imposes no power overhead.
3.2 Taxonomy of Design for TS
This chapter aims to develop general design techniques that are compatible with a variety of TS
schemes — not just Paceline. To that end, this section begins by outlining the key differences and
similarities of previous TS proposals. Next, it constructs a taxonomy to encompass existing and
future TS microarchitectures. Finally, it uses this microarchitecture taxonomy to develop a set of
general approaches to enhance TS, each of which applies to different points in the taxonomy.
Stage-Level TS Microarchitectures
Razor [20], TIMERRTOL [75], CTV [59], and X-Pipe [77] detect faults at pipeline-stage bound-
aries by comparing the values latched from speculatively-clocked logic to known good values
generated by a checker. This checker logic can be an entire copy of the circuit that is safely clocked [59,
75]. A more efficient option, proposed in Razor [20], is to use a single copy of the logic to do
both speculation and checking. This approach works by wave-pipelining the logic [15] and latch-
ing the output values of the pipeline stage twice: once in the normal pipeline latch, and a fraction
of a cycle later in a shadow latch. The shadow latch is guaranteed to receive the correct value. At
the end of each cycle, the shadow and normal latch values are compared. If they agree, no action
is taken. Otherwise, the values in the shadow latches are used to repair the pipeline state.
31
Another stage-level scheme, Circuit Level Speculation (CLS) [42], accelerates critical blocks
(rename, adder, and issue) by including a custom-designed speculative “approximation” version
of each. For each approximation block, CLS also includes two fully correct checker instances
clocked at half speed. Comparison occurs on the cycle after the approximation block generates
its result, and recovery may involve re-issuing errant instructions.
Leader-Checker TS Microarchitectures
In multicores, two cores can be paired in a leader-checker organization, with both running the
same (or very similar) code, as in Paceline, Slipstream [70], and Reunion [64]. The leader runs
speculatively and can relax functional correctness. The checker executes correctly and may be
sped up by hints from the leader as it checks the leader’s work. Because the two cores are loosely
coupled, configurability comes naturally; they can be disconnected and used independently in
workloads that demand throughput.
One type of leader-checker microarchitecture sacrifices configurability in pursuit of higher
frequency by making the leader core functionally incorrect by design. Optimistic Tandem [47]
achieves this by pruning infrequently-used functionality from the leader. DIVA [7] can also be
used in this manner by using a functionally incorrect main pipeline. This approach requires the
checker to be dedicated and always on.
3.2.1 Classification of TS Microarchitectures
We classify existing proposals of TS microarchitectures according to: (1) whether the fault detec-
tion and correction hardware is always on (Configurability), (2) whether functional correctness is
sacrificed to maximize speedup regardless of the operating frequency (Functional Correctness),
and (3) whether checking is done at pipeline-stage boundaries or upon retirement of one or more
instructions (Checking Granularity). In the following, we discuss these axes. Table 3.1 classifies
existing proposals of TS microarchitectures according to these axes.
32
Microarchitecture Functional Checking
Configurability Correctness Granularity
Razor [20] Always-on Correct Stage
Paceline On-demand Correct Retirement
X-Pipe [77] Always-on Correct Stage
CTV [59] Always-on Correct Stage
TIMERRTOL [75] Always-on Correct Stage
CLS [42] Always-on Relaxed Stage
Slipstream [70] Always-on Relaxed Retirement
Optim. Tandem [47] Always-on Relaxed Retirement
DIVA [7] Always-on Relaxed Retirement
Table 3.1: Classification of existing proposals of TS microarchitectures.
Configurability
The checker hardware that performs fault detection and correction can be kept Always-on or just
On-demand. If per-thread performance is crucial all the time, the processor will always operate
at a speculative frequency. Consequently, an Always-on checker suffices. This is the approach
of most existing proposals. However, future multicores must manage a mix of throughput- and
latency-oriented tasks. To save power when executing throughput-oriented tasks, it is desirable
to disable the checker logic and operate at fr. We refer to schemes where the checker can be en-
gaged and disengaged with minimal power and area overhead as On-demand checkers.
Functional Correctness
Relaxing functional correctness can lead to higher clock frequencies. This can be accomplished
by not implementing rarely-used logic, such as in Optimistic Tandem [47] and CLS [42], by not
running the full program, such as in Slipstream [70], or even by tolerating processors with design
bugs, such as in DIVA [7]. These Relaxed schemes suffer from errors regardless of the clock fre-
quency. This is in contrast to Correct schemes, which guarantee error-free operation at and below
the Limit Frequency f0.
Relaxing functional correctness imposes a single (speculative) mode of operation, demanding
an Always-on checker. Correctness at the Limit Frequency f0 and below is a necessary condi-
tion for checker schemes based on wave pipelining [15] like Razor [20], or On-demand checker
33
schemes like Paceline.
Checking Granularity
Checking can be performed at pipeline-stage boundaries (Stage) or upon retirement of one or
more instructions (Retirement). In Stage schemes, speculative results are verified at each pipeline
register before propagating to the next stage. Because faults are detected within one cycle of their
occurrence, the recovery entails, at worst, a pipeline flush. The small recovery penalty enables
these schemes to deliver performance even at high fault rates. However, eager fault detection pre-
vents them from exploiting masking across pipeline stages.
The alternative is to defer checking until retirement. In this case, because detection is de-
layed, and because recovery may involve heavier-weight operations, the recovery penalty is higher.
On the other hand, Retirement schemes do not need to recover on faults that are microarchitec-
turally masked, and the loosely-coupled checker may be easier to build.
3.2.2 General Approaches to Enhance TS
Given a TS microarchitecture, Equation 1.1 shows that we can improve its performance by reduc-
ing PE(f). To accomplish this, we propose four general approaches. They are graphically shown
in Figure 3.1. Each of the approaches is shown as a way of reshaping the original PE(f) curve of
Figure 1.3(a) (now in dashes) into a more favorable one (solid). For each approach, we show that
a processor that initially worked at point a now works at b, which has a lower PE for the same f.
Delay Trading (Figure 3.1(a)) slows-down infrequently-exercised paths and uses the resources
saved in this way to speed up frequently-exercised paths for a given design budget. This leads to
a lower Limit Frequency f ′0 when compared to the one in the base design f0 in exchange for a
higher frequency under TS.
Pruning or Circuit-level Speculation (Figure 3.1(b)) removes the infrequently-exercised paths
from the circuit in order to speed-up the common case. For example, the carry chain of the adder
is only partially implemented to reduce the response time for most input values [42]. Pruning
34
Freq
P E
(c) Delay Scaling(b) Pruning(a) Delay Trading (d) Targeted Acceleration
f0 f'0f0f'0 f0 f0 , f'0
FreqFreq Freq
a
a
a
a
b
b b b
Figure 3.1: General approaches to enhance TS by reshaping the PE(f) curve. Each approach
shows the curve before reshaping (in dashes) and after (solid), and the operating point of a pro-
cessor before (a) and after (b).
results in a higher frequency for a given PE , but sacrifices the ability to operate error-free at any
frequency.
Delay Scaling (Figure 3.1(c)) and Targeted Acceleration (Figure 3.1(d)) speed-up paths and,
therefore, shift the curve toward higher frequencies. The approaches differ in which paths are
sped-up. Delay Scaling speeds-up largely all paths, while Targeted Acceleration targets the common-
case paths. As a result, while Delay Scaling always increases the Limit Frequency, Targeted Ac-
celeration does not, as f ′0 may be determined by the infrequently-exercised critical paths. How-
ever, Targeted Acceleration is more energy-efficient. Both approaches can be accomplished with
techniques such as supply voltage scaling or body biasing [73].
3.2.3 Putting It All Together
The choice of a TS microarchitecture directly impacts which TS-enhancing approaches are most
appropriate. Table 3.2 summarizes how TS microarchitectures and TS-enhancing approaches
relate.
Microarchitecture Characteristic Implication on TS-Enhancing Approach
Configurability Delay Trading is undesirable with On-demand microarchitectures
Functional Correctness Pruning is incompatible with Correct microarchitectures
Checking Granularity All approaches are applied more aggressively to Stage microarchitectures
Table 3.2: How TS microarchitectural choices impact what TS-enhancing approaches are most
appropriate.
35
Configurability directly impacts the applicability of Delay Trading. Recall that Delay Trading
results in a lower f0 than the base case. This would force On-demand checking architectures to
operate at a lower frequency in the non-TS mode than in the base design, leading to sub-optimal
operation. Consequently, Delay Trading is undesirable with On-demand checkers.
The Functional Correctness of the microarchitecture impacts the applicability of Pruning.
Pruning results in a non-zero PE regardless of the frequency. Consequently, Pruning is incompat-
ible with Correct TS microarchitectures, such as those based on wave pipelining (e.g., Razor) or
on-demand checking (e.g., Paceline).
Checker Granularity dictates how aggressively any of the TS-enhancing approaches can be
applied. An approach is considered more aggressive if it allows more errors at a given frequency.
Since Stage microarchitectures have a smaller recovery penalty than Retirement ones, all the TS-
enhancing approaches can be applied more aggressively to Stage microarchitectures.
3.3 A Common-Case Optimization Flow
Our goal is to design processors that are especially suited for TS. Based on the insights from the
previous section, we propose: (1) a novel processor design methodology that we call BlueShift
and (2) two techniques that, when applied under BlueShift, improve processor frequency. These
two techniques are instantiations of the approaches introduced in Section 3.2.2. Next, we present
BlueShift and then the two techniques.
3.3.1 The BlueShift Framework
Conventional design methods use timing analysis to identify the static critical paths in the design.
Since these paths would determine the cycle time, they are then optimized to reduce their latency.
The result of this process is that designs end up having a critical path wall, where many paths
have a latency equal to or only slightly below the clock period.
We propose a different design method for TS processors, where it is fine if some paths take
36
longer than the period. When these paths are exercised and induce an error, a recovery mecha-
nism is invoked. We call the paths that take longer than the period Overshooting paths. They are
not critical because they do not determine the period. However, they hurt performance in propor-
tion to how often they are exercised and cause errors.
Consequently, a key principle when designing processors for TS is that, rather than working
with static distributions of path delays, we need to work with dynamic distributions of path de-
lays. Moreover, we need to focus on optimizing the paths that overshoot most frequently dynam-
ically— by trying to reduce their latency. Finally, we can leave unoptimized many overshooting
paths that are exercised only infrequently — since we have a fault correction mechanism.
BlueShift is a design methodology for TS processors that uses these principles. In the follow-
ing, we describe how BlueShift identifies dynamic overshooting paths and its iterative approach
to optimization.
Identifying Dynamic Overshooting Paths
BlueShift begins with a gate-level implementation of the circuit from a traditional design flow.
A representative set of benchmarks is then executed on a simulator of the circuit. At each cycle
of the simulation, BlueShift looks for latch inputs that change after the cycle has elapsed. Such
endpoints are referred to as overshooting. As an example, Figure 3.2 shows a circuit with a target
period of 500ps. The numbers on the nets represent their switching times on a given cycle. Note
that a net may switch more than once per cycle. Since endpoints X and Y both transition after
500ps, they are designated as overshooting for this cycle. Endpoint Z has completed all of its
transitions before 500ps, so it is non-overshooting for this cycle.
Once the overshooting endpoints for a cycle are known, BlueShift determines the path of
gates that produced their transitions. These are the overshooting paths for the cycle, and are the
objects on which any optimization will operate. To identify these paths, BlueShift annotates all
nets with their transition times. It then backtraces from each overshooting endpoint. As it back-
traces from a net with transition time tn, it locates the driving gate and its input whose transition
37
f
b
ca
d
e
↑360
↓447
↑520
↓511
↑458
↑288
↓375
↑448
↑216
↓303
↑376
↑107
↑85
↑172
↓318
↑529
Z
X
Y
Figure 3.2: Circuit annotated with net transition times, showing two overshooting paths for this
cycle.
at time ti caused the change at tn. For example, in Figure 3.2, the algorithm backtraces from X
and finds the path b→ c→ e. Therefore, path b→ c→ e is overshooting for the cycle shown.
For each path p in the circuit, the analysis creates a set of cycles D(p) in which that path
overshoots. If Ncycles is the number of simulated cycles, we define the Frequency of Overshoot-
ing of path p as d(p) = |D(p)|/Ncycles. Then, the rate of errors per cycle in the circuit (PE) is
upper-bounded bymin(1,
∑
p d(p)). To reduce PE , BlueShift focuses on the paths with the high-
est frequency of overshooting first. Once enough of these paths have been accelerated and PE
drops below a pre-set target, optimization is complete; the remaining overshooting paths are ig-
nored.
Iterative Optimization Flow
BlueShift makes iterative optimizations to the design, addressing the paths with the highest fre-
quency of overshooting first. As the design is transformed, new dynamic overshooting paths are
generated and addressed in subsequent iterations. This iterative process stops when PE falls be-
low target. Figure 3.3 illustrates the full process. It takes as inputs an initial gate-level design and
the designer’s target speculative frequency and PE .
At the head of the loop (Step 1), a physical-aware optimization flow takes a list of design
changes from the previous iteration and applies them as it performs aggressive logical and physi-
cal optimizations. The output of Step 1 is a fully placed and routed physical design suitable for
fabrication. Step 2 begins the embarrassingly-parallel profiling phase by selecting n training
38
Benchmark 0 Benchmark 1 Benchmark n-1
Path profile
Design changes
Physical design
PE < targetPE > target
Final design
Select training benchmarks2
Compute training set error rate4
Gate level simulation3
Speed up paths with highest 
frequency of overshooting5
Initial Netlist
1
Restructuring
Placement
Clock tree synth
Routing
Leakage minimization
Physical-aware
Optimization
Figure 3.3: The BlueShift optimization flow.
benchmarks. In Step 3, one gate-level timing simulation is initiated for each benchmark. Each
simulation runs as many instructions as is economical and then computes the frequencies of over-
shooting for all paths exercised during the execution. Before Step 4, a global barrier waits for
all of the individual simulations to finish. Then, the overall frequency of overshooting for each
path is computed by averaging the measure for that path over the individual simulation instances.
BlueShift also computes the average PE across all simulation instances.
BlueShift then performs the exit test. If PE is less than the designer’s target, then optimiza-
tion is complete; the physical design after Step 1 of the current iteration is ready for production.
As a final validation, BlueShift executes another set of timing simulations using a different set of
benchmarks (the Evaluation set) to produce the final PE versus f curve. This is the curve that we
39
use to evaluate the design.
If, on the other hand, PE exceeds the target, we collect the set of paths with the highest fre-
quency of overshooting, and use an optimization technique to generate a list of design changes
to speed-up these paths (Step 5). Different optimization techniques can be used to generate these
changes. We present two next.
3.3.2 Example BlueShift Techniques
To speed-up processor paths, we propose two techniques that we call On-demand Selective Bi-
asing (OSB) and Path Constraint Tuning (PCT). They are specific implementations of two of
the general approaches to enhance TS discussed in Section 3.2.2, namely Targeted Accelera-
tion and Delay Trading, respectively. We do not consider techniques for the other approaches
in Figure 3.1 because a technique for Pruning was already proposed in [47] and Delay Scaling is
a degenerate, less energy-efficient variant of Targeted Acceleration that lacks path targeting.
On-Demand Selective Biasing (OSB)
On-demand Selective Biasing (OSB) applies forward body biasing (FBB) [73] to one or more
of the gates of each of the paths with the highest frequency of overshooting. Each gate that re-
ceives FBB speeds up, reducing the path’s frequency of overshooting. With OSB, we push the
PE versus f curve as in Figure 3.1(d), making the processor faster under TS. However, by apply-
ing FBB, we also increase the leakage power consumed.
Figure 3.4(a) shows how OSB is applied, while Figure 3.4(b) shows pseudo code for the al-
gorithm of Step 5 in Figure 3.3 for OSB. The algorithm takes as input a constant k, which is the
fraction of all the dynamic overshooting in the design that will remain un-addressed after the al-
gorithm of Figure 3.4(b) completes.
The algorithm proceeds as follows. At any time, the algorithm maintains a set of paths that
are eligible for speedup (Pelig). Initially, at entry to Step 5 in Figure 3.3, Line 1 of the pseudo
code in Figure 3.4(b) sets all the dynamic overshooting paths (Poversh) to be eligible for speedup.
40
Standard
Gate
FBB
Gate
Original Design Resulting Design
Bias
(a)
Pelig ← Poversh
(b)
1
repeat
gsel ← argmax
g
∑
p∈(Pelig∩paths(g))
d(p)
Pelig ← Pelig − paths(gsel)
GFBB ← GFBB + gsel
2
3
4
5
6 d(p)
∑
p∈Poversh
while
∑
p∈Pelig
d(p) > k
Figure 3.4: On-demand Selective Biasing (OSB): application to a chip (a) and pseudo code of the
algorithm (b).
Next, in Line 2 of Figure 3.4(b), a loop begins in which one gate will be selected in each iteration
to receive FBB. In each iteration, we start by considering all paths p in Pelig weighted by their
frequency of overshooting d(p). We also define the weight of a gate g as the sum of the weights
of all the paths in which it participates (paths(g)). Then, Line 3 of Figure 3.4(b) greedily se-
lects the gate (gsel) with the highest weight. Line 4 removes from Pelig all the paths in which the
selected gate participates. Next, Line 5 adds the selected gate to the set of gates that will receive
FBB (GFBB). Finally, in Line 6, the loop terminates when the fraction of all the original dynamic
overshooting that remains un-addressed is no higher than k.
After this algorithm is executed in Step 5 of Figure 3.3, the design changes are passed to Step
1, where the physical design flow regenerates the netlist using FBB gates where instructed. In the
next iteration of Figure 3.3, all timing simulations assume that those gates have FBB. We may
later get to Step 5 again, in which case we will take the current dynamic overshooting paths and
re-apply the algorithm. Note that the selection of FBB gates across iterations is monotonic; once
a gate has been identified for acceleration, it is never reverted to standard implementation in sub-
sequent iterations.
After the algorithm of Figure 3.3 completes, the chip is designed with body-bias signal lines
that connect to the gates in GFBB. The overhead of OSB is the extra static power dissipated by
the gates with FBB and the extra area needed to route the body-bias lines and to implement the
body-bias generator [73].
41
In TS architectures with On-demand checkers like Paceline (Table 3.1), it is best to be able
to disable OSB when the checker is not present. Indeed, the architecture without checker cannot
benefit from OSB anyway, and disabling OSB also saves all the extra energy. Fortunately, this
technique is easily and quickly disabled by removing the bias voltage. Hence the “on-demand”
part of this technique’s name.
Path Constraint Tuning (PCT)
Path Constraint Tuning (PCT) applies stronger timing constraints on the paths with the highest
frequency of overshooting, at the expense of the timing constraints on the other paths. The result
is that, compared to the period T0 of a processor without TS at the Limit Frequency f0, the paths
that initially had the highest frequency of overshooting now take less than T0, while the remain-
ing ones take longer than T0. PCT improves the performance of the common-case paths at the
expense of the uncommon ones. With PCT, we change the PE versus f curve as in Figure 3.1(a),
making the processor faster under TS — although slower if it were to run without TS. This tech-
nique does not intrinsically have a power cost for the processor.
Existing design tools can transfer slack between connected paths in several ways, exhibited in
Figure 3.5. The figure shows an excerpt from a larger circuit in which we want to speed up path
A → Z by transferring slack from other paths. Figure 3.5(a) shows the original circuit, and fol-
lowing to the right are successive transformations to speed up A → Z at the expense of other
paths. First, Figure 3.5(b) refactors the six-input AND tree to reduce the number of logic levels
between A and Z. This transformation lengthens the paths that now have to pass through two 3-
input ANDs. Figure 3.5(c) further accelerates A → Z by increasing the drive strength of the
critical AND. However, we have to downsize the connected buffer to avoid increasing the capac-
itive load on A and, therefore, we slow down A → X . Figure 3.5(d) refines the gate layout to
shorten the long wire on path A → Z at the expense of lengthening the wire on A → X . Finally,
Figure 3.5(e) allocates a reduced-Vt gate (or an FBB gate) along the A→ Z path. This speeds up
the path but has a power cost, which may need to be recovered by slowing down another path.
42
11
1
A
Z
4
1
A
Z1 1
1
A
Z
2
1
2
1
A
Z
2
1
2
(a) Original (b) Restructure (c) Resize (d) Place
1
A
Z
2
1
2
(e) Assign Low-Vt
4X X X X X
Figure 3.5: Transforming a circuit to reduce the delay of A → Z at the expense of that of the
other paths. The numbers represent the gate size.
The implementation of PCT is simplified by the fact that existing design tools already imple-
ment the transformations shown in Figure 3.5. However, they do all of their optimizations based
on static path information. Fortunately, they provide a way of specifying “timing overrides” that
increase or decrease the allowable delay of a specific path. PCT uses these timing overrides to
specify timing constraints equal to the speculative clock period for paths with high frequency of
overshooting, and longer constraints for the rest of the paths.
The task of Step 5 in Figure 3.3 for PCT is simply to generate a list of timing constraints for
a subset of the paths. These constraints will be processed in Step 1. To understand the PCT al-
gorithm, assume that the designer has a target period with TS equal to Tts. In the first iteration
of the BlueShift framework of Figure 3.3, Step 1 assigns a relaxed timing constraint to all paths.
This constraint sets the path delays to r × Tts (where r is a relaxation factor), making them even
longer than a period that would be reasonable without TS. When we get to Step 5, the algorithm
first sorts all paths in order of descending frequency of overshooting at Tts. Then, it greedily se-
lects paths from this list leaving those whose combined frequency of overshooting is less than
the target PE . To these selected paths, it assigns a timing constraint equal to Tts. Later, when the
next iteration of Step 1 processes these constraints, it will ensure that these paths all fit within
Tts, possibly at the expense of slowing down the other paths.
At each successive iteration of BlueShift, Step 5 assigns the Tts timing constraint to those
paths that account for a combined frequency of overshooting greater than the target PE at Tts.
Note that once a path is constrained, that constraint persists for all future BlueShift iterations.
Eventually, after several iterations, a sufficient number of paths are constrained to meet the target
43
PE .
3.4 Summary
This chapter introduced BlueShift, a new design approach and optimization algorithm in which
the processor is designed from the ground up for TS. The idea is to identify and optimize the
most frequently-exercised critical paths in the design, at the expense of the majority of the static
critical paths. It then proposed two specific optimization techniques that, when applied under
BlueShift, improve processor performance: On-demand Selective Biasing (OSB) and Path Con-
straint Tuning (PCT). These techniques target the most frequently-exercised critical paths, and
either add forward body bias to some of their gates or apply strong timing constraints on them.
When applied to modules from the OpenSPARC T1 processor, the evaluation in Chapter 6
shows that OSB and PCT provide average speedups of 8% and 6% on top of traditionally-designed
Paceline and Razor cores, respectively. BlueShift thus provides a new way to speed up logic
modules that is orthogonal to voltage scaling.
44
Chapter 4
LeadOut: Combining Timing Speculation
with V–f Boosting
4.1 Introduction
Having described a timing speculation microarchitecture and design methodology, we now ex-
pand our search for configurable techniques to enhance per-thread performance. This chapter
focuses on combining TS with a voltage and frequency-boosting technique recently introduced
in Intel’s Nehalem processor as Turbo Boost [34, 5]. The idea of is to increase the voltage and
frequency of the active cores on a multicore beyond nominal until the chip reaches its power
envelope. The hardware overhead is modest because it uses the same mechanisms as dynamic
voltage-frequency scaling (DVFS), which is already widely deployed for power savings. This
makes voltage–frequency boosting an especially attractive option for adding configurability to a
multicore.
This chapter observes that individual application of either voltage–frequency boosting or
TS is suboptimal. Alone, neither technique will be able to consistently bring the multicore to
its power envelope or limit temperature. Instead, the technique will be limited by other factors,
such as the maximum supply voltage (beyond which reliability suffers) or maximum error rate
(beyond which performance drops due to frequent recovery), leaving some of the available power
and temperature headroom untapped. This means that the two techniques can combine synergisti-
cally to unlock much higher levels of single-thread performance than either can alone. Moreover,
we demonstrate a dynamic controller that co-optimizes the two techniques. Under a variety of
loading conditions, the two techniques together bring the multicore to the extremes of its power
and temperature envelope in an aggressive pursuit of performance without violating any con-
45
straints.
Chapter 7 evaluates the controlled combination of these techniques on a 16-core configurable
desktop multicore where pairs of cores can be coupled into Paceline mode or decoupled, and
where each core has its own voltage and frequency domain. The results show that when half of
the cores are busy and the goal is to optimize a new performance-critical thread, the combined
approach enables the thread to attain 34% higher performance than before while consuming
220% more power. Moreover, the configurability of the proposed multicore enables similar gains
under various load conditions.
4.1.1 V–f Boosting
The Intel Core i7 (“Nehalem”) processor introduces the Intel Turbo Boost technology, where
both supply voltage (Vdd) and frequency (f) are increased when high performance is desired [34]
on the active (non-idle) cores and the chip is operating below its power and thermal limits. To
support this feature, the processor includes sensors and an on-chip controller that continuously
monitors the power, temperature, and current in the chip.
When the operating system requests performance state P0 (highest performance) for one or
more cores and places other cores into an inactive state, the on-chip controller transparently at-
tempts to scale up Vdd and f on the active cores. As long as the on-chip sensors report safe condi-
tions, the controller increases the frequency in steps of 133.33 MHz with a corresponding voltage
increase. The maximum number of steps is determined by the number of inactive cores. If, at any
time, the power, temperature, or current conditions reach unsafe values, the hardware automati-
cally decreases Vdd and f one 133 MHz step.
Note that in Intel’s Turbo Boost, all active cores increase and decrease Vdd and f at the same
time. In this paper, we use a more advanced design where each core increases its Vdd and f in-
dependently of the other cores. Moreover, the Vdd and f increases are not a fixed function of the
number of inactive cores; instead, they continue until the application hits a power or temperature
constraint or another limitation.
46
4.2 V–f Boosting and TS are Synergistic
The key insight of this chapter is that TS and voltage-frequency boosting are synergistic and mul-
tiply their speedups on per-thread performance. This provides great potential for boosting per-
thread performance in multicores.
4.2.1 Power Increase in VBoost and Paceline
We call the technique described in Section 4.1.1 VBoost and compare it with Paceline from Chap-
ter 2. Of the two, VBoost is simpler to implement, but Paceline still has lower complexity than
other per-thread performance enhancements (eg. thread-level speculation, core fusion, or very
wide superscalars) because it requires few changes to the core. The modifications needed to in-
tegrate the BQ, VQ, and signature/checkpointing logic are concentrated in the fetch and retire
stages, where they are unlikely to affect the critical path.
Our goal is to apply these two techniques to threads that demand per-thread performance un-
til a core reaches its maximum allowed power (Pmax) or temperature (Tmax). These techniques
have different power behaviors; per-core power (and, therefore, temperature) increases faster with
VBoost than with Paceline. This arises from the fact that VBoost increases both Vdd and f, while
Paceline only increases f. We can see this in the formulas for static power due to leakage (Pleak
in Equation 4.6), dynamic power (Pdyn in Equation 4.5), and logic gate delay [55] (Tg in Equa-
tion 4.4).
Vt = Vt0 +Ktemp (T − T0) +KDIBL (Vdd − Vdd0) (4.1)
µ ∝ T−1.5 (4.2)
Ileak ∝ µ
Leff
T 2 e−q Vt/(k T n) (4.3)
Tg ∝ Vdd Leff
µ (Vdd − Vt)α (4.4)
Pdyn ∝ C V 2dd f (4.5)
Pleak = Vdd Ileak (4.6)
As VBoost increases Vdd, both Pdyn and Pleak increase rapidly. Indeed, Pleak depends directly
47
on Vdd (Equation 4.6) and, through Ileak (Equation 4.3), exponentially on Vdd (since Vt is a func-
tion of Vdd in Equation 4.1). Meanwhile, Pdyn has a quadratic dependence on Vdd (Equation 4.5)
and a linear dependence on f (which itself depends on Vdd).
In general, when speeding up a single thread with either of these techniques, VBoost will
reach power or temperature constraints at lower frequencies than Paceline— assuming, of course,
that no other constraint limits the frequency increase earlier. One way to improve the effective-
ness of VBoost is to add activity migration [26], which moves the thread among nearby cores
to reduce the formation of hotspots. We call the combination of VBoost and migration with the
neighboring core VBo+Mig.
4.2.2 Composing the Techniques
The two techniques are synergistic because, in principle, the application of each one is limited by
a different constraint. Specifically, suppose that we have as much power and temperature head-
room as we want and use one of the techniques to gradually increase the frequency of a core. If
we apply Paceline, the limiting factor will be the value of the error rate PEmax beyond which
the performance begins to drop because of the frequency of error recovery. Instead, if we apply
VBoost, the limiting factor will be the value of the process maximum supply voltage (Vddmax),
beyond which the devices become unreliable.
Pictorially, this is shown in Figure 4.1. Figure 4.1(a) shows a typical PE versus f curve for
a core. The core works at f=A. If we apply Paceline, we eliminate guardbands and even allow
the processor to tolerate some errors. We work at f=B in Figure 4.1(b), and accomplish an f in-
crease of ∆fL. Instead, if we apply VBoost alone, the entire PE versus f curve shifts to the right
— while changing its slope in some way. Figure 4.1(c) shows that the processor can now work at
f=C while still respecting the guardband. We have obtained an f increase of ∆fB. Finally, if we
combine both VBoost and Paceline, we still have the same curve as Figure 4.1(c) but, as shown
in Figure 4.1(d), we remove the guardband and tolerate some errors to operate at f=D for an f in-
crease of ∆fL&B.
48
Bf
P E
A
f0 f
P E
ΔfL f
P E
C
ΔfB f
P E
D
ΔfL&B f
P E
E
co
ns
tra
int
(a) (b) (c) (d) (e)
Figure 4.1: Qualitative depiction of increases in processor f when the VBoost and Paceline tech-
niques are applied.
In practice, the core may reach its power or temperature limits before the two techniques are
applied to their full extent. Figure 4.1(e) shows such a case where the f reaches only E. Note that
the core running the critical thread reaches the temperature or power constraint sooner or later
depending, in part, on the load in the rest of the multicore. Specifically, if the load is high, the
temperature of the chip and, therefore, of the core is higher to start with. Moreover, a higher tem-
perature induces higher leakage power, reducing the room to Pmax. Overall, we define three oper-
ational regimes, as shown in Table 4.1.
In the Individual regime, the application of Paceline alone is sufficient to bring a core its tem-
perature or power limit. The combination of the two techniques — which we refer to as VBo+Pl
— cannot deliver gains because the voltage of the Paceline cores cannot be increased without vi-
olating the temperature or power constraints. This regime has a higher chance to occur when the
rest of the chip is heavily loaded, as the other active cores generate a background of high tem-
perature that reduces the room to Tmax and Pmax in this core. Our evaluation finds this case to be
rare.
In the Synergistic regime, Paceline alone does not bring a core to its power or temperature
limits. Instead, it reaches the maximum timing error rate first. However, the application of both
techniques together does bring the core to its temperature or power limits. In this regime, the
combination of techniques is likely to deliver gains. Chapter 7 finds that this regime dominates
when the chip is under high to moderate load.
Finally, in the Unfulfilled regime, even when both techniques are applied together, the pro-
49
B
ou
nd
in
g
C
on
st
ra
in
ts
V
B
oo
st
Pa
ce
lin
e
V
B
o+
P
l
G
ai
n
fr
om
M
ul
tic
or
e
W
he
n
se
en
in
ou
r
R
eg
im
e
T
/P
V
d
d
T
/P
P
E
T
/P
V
d
d
/P
E
B
ou
nd
ed
by
T
or
P?
co
m
bi
ni
ng
?
lo
ad
ev
al
ua
tio
n
(C
ha
pt
er
7)
?
In
di
vi
du
al
X
X
X
Pa
ce
lin
e
bo
un
de
d,
N
o
V
er
y
H
ig
h
R
ar
el
y
X
X
X
co
m
bi
na
tio
n
bo
un
de
d
Sy
ne
rg
is
tic
X
X
X
Pa
ce
lin
e
no
tb
ou
nd
ed
,
L
ik
el
y
H
ig
h
to
2,
4,
8
pr
oc
.a
va
ila
bl
e
X
X
X
co
m
bi
na
tio
n
bo
un
de
d
M
od
er
at
e
fo
ru
se
U
nf
ul
fil
le
d
X
X
X
C
om
bi
na
tio
n
no
tb
ou
nd
ed
Y
es
L
ow
16
pr
oc
.a
va
ila
bl
e
Ta
bl
e
4.
1:
M
ul
tic
or
e
co
ns
tr
ai
nt
re
gi
m
es
.
50
cessor reaches the maximum supply voltage and limiting timing error rate before arriving at the
temperature or power constraint. In this case, the combination of techniques delivers gains, but
there is power available for per-thread performance improvement that goes unused and could be
spent on additional techniques if they were available. Chapter 7 finds that this regime is common
in lightly-loaded multicores.
4.2.3 A Highly-Configurable Multicore
To support VBoost and VBo+Mig, we give each core from Figure 2.2 its own voltage domain. To
support Paceline, each core already has its own frequency domain. When per-thread performance
matters, the core’s voltage and frequency are ramped up. Under VBo+Mig, the thread bounces
between the core and its neighbor to equalize temperatures. In this case, each of the two cores
oscillates between a high-voltage, high-frequency mode and a shut-down mode. While we could
migrate the thread to a remote core, the benefits of a shared cache more than compensate for the
slightly higher temperature resulting from lateral heat conduction between the adjacent cores.
The configurable multicore operates as follows. At any time, there are R threads where through-
put matters and S threads where per-thread performance is paramount. The multicore can assign
each of the S threads either to a single core using VBoost, or to a core pair using VBo+Mig, Pace-
line, or VBo+Pl— the decision is made by an optimizing controller. The R threads run normally
on remaining cores, and any unused cores are shut down. As the load (R, S) changes, the con-
troller reconfigures the mode, f , and Vdd of each core to maximize overall performance.
4.3 Dynamic Controller Design
To manage the reconfigurability of the multicore, we design a dynamic controller that runs in
software. In this section, we formulate the problem, present the per-thread controllers, and finally
present the global controller.
51
4.3.1 Problem Formulation
Our goal is to maximize the speed of the S threads, which are performance-critical. We do this by
consuming all the power and thermal headroom available in the cores that run them. To ensure
reliable operation, we must respect two constraints: the per-core power consumption must be less
than Pmax, and the hottest point in the core must have a temperature less than Tmax. Additionally,
to simplify the control, when using the Paceline technique, the timing error rate PE is constrained
to be less than PEmax = 10−5. Note that we use per-core constraints rather than chip-wide ones
because our hardware has independent per-core power grids and supplies, with no sharing of the
current load between cores.
In this environment, we propose a thread controller in software per S thread (which may con-
trol one or two cores, depending on the technique used), and a simple global controller also in
software that oversees the thread controllers. The inputs and outputs of each thread controller
are:
Inputs: The controller reads a power and a temperature sensor in the core (or each of the two
cores) that it controls. This is like the Intel Foxton [46]. Additionally, if it controls two cores run-
ning in leader-checker mode, it reads two other sensors. The first one is a counter with the rate
of checkpoint rollbacks observed — which gives the error rate PE . The second one is the aver-
age occupancy of the leader-checker coupling queues — which gives an indication of whether the
checker is falling behind the leader.
Outputs: The controller sets the voltage and frequency of the core (or each of the two cores)
that it controls. We assume that V and f can be modified arbitrarily at every control interval (1
ms) with a minimum granularity of 10 mV and 100 MHz, respectively. The controller also sets
the microarchitectural configuration in the cores depending on what technique is to be applied
(VBoost, VBo+Mig, Paceline, VBo+Pl, or none).
52
4.3.2 Thread Controllers
A thread controller is implemented with two software modules: the f-subcontroller that sets the
core frequency and the V-subcontroller that sets the core voltage. Next, we describe the way they
work for each of the techniques.
Thread Controller for VBoost: In this technique, the thread controller increases the core’s f
as much as possible while applying an elevated V. Figure 4.2(a) shows the organization. The f-
subcontroller drives the optimization by executing a simple search for the maximum f as it mon-
itors the core’s P and T for constraint violations. In the absence of violations, it attempts to in-
crease f at the rate of 100 MHz every 2 ms. Otherwise, it initiates exponential backoff starting
with a reduction of 100 MHz for the next control step.
(a) VBoost (b) VBo+Pl
Core
Thread Controller
fckr V
CkrLdr
T,P
f V
Q
f
T,P
f
Vldr
V
PE
f V
T,P
f V
Thread Controller
Figure 4.2: Thread controller examples.
Global Controller
Thread 
Controller
CoreCore
f, V
Thread 
Controller
CoreCore
f, V
VBo+LC
S Threads
VBo+LC
Figure 4.3: Global controller.
Whenever the f-subcontroller generates a f change, it sends the new f to the V-subcontroller,
which determines whether a feasible V exists to support it. If so, the V-subcontroller acknowl-
edges the new f and instructs the core PLLs and supplies to move to the new operating point.
Otherwise, it notifies the f-subcontroller that its f choice was infeasible and the f-subcontroller
responds by proposing successively lower f until one is accepted. The V-subcontroller imple-
mentation is simple — just a lookup table indexed by f. At manufacturing test time, the table is
populated with V that guarantee error-free (fully guardbanded) operation of the core for each pos-
sible f setting. Frequencies that cannot be supported within the allowable Vdd range are marked as
53
infeasible.
Thread Controller for VBo+Pl: In this technique, the thread controller drives the leader f as
high as possible while keeping the error rate below PEmax. Simultaneously, the checker fre-
quency is adjusted to ensure that the checker does not fall behind and limit performance. Both
the leader and checker can receive elevated voltages. However, at all times, the checker is oper-
ated at a fully-guardbanded voltage that ensures error-free operation just like a VBoosted core.
As shown in Figure 4.2(b), this thread controller uses an f-subcontroller and a V-subcontroller
for each of the two cores. Compared to VBoost, the leader’s V-subcontroller (Vldr) and the checker’s
f-subcontroller (fckr) are slightly different. The reason is that the Vldr-subcontroller provides a V
that accommodates a certain error rate, and the fckr-subcontroller has the goal of enabling the
checker to catch up with the leader.
The controller works as follows. The leader uses the same f-subcontroller as in VBoost to
continuously search for the highest f. However, the leader’s Vldr-subcontroller provides a V that
accommodates the f target with an acceptable error rate PE . Like in VBoost, the Vldr-subcontroller
is implemented as a lookup table indexed by f, but unlike in VBoost, it learns its entries dynami-
cally in the field — since they depend on process variation, application, and environmental pa-
rameters. At power-on, all entries default to a minimum Vdd. Whenever a frequency fviol is re-
quested and PEmax is exceeded, all table entries for f ≥ fviol increment their V by 10 mV. In this
way, the table entries monotonically converge toward the minimal Vs that ensure PE < PEmax.
However, because of the monotonicity, transients bursts of errors can drive V to values that are
unnecessarily high in steady state. To combat this tendency, all entries are decremented by 10
mV once per second.
The fckr-subcontroller in the checker, rather than maximizing the checker’s f, attempts to
keep the leader–checker coupling queues no more than half and no less than one quarter full. It
does so by treating the leader–checker pair as a GALS system and applying the attack-decay con-
trol algorithm [60] as follows. Whenever the queues are more than half full, the checker f is in-
creased (attack) by 200 MHz per ms. When they are less than one quarter full, the f is decreased
54
(negative attack) by 100 MHz per ms. Otherwise, the f is decreased by 100 MHz every 7 ms (de-
cay).
Thread Controllers for VBo+Mig and Paceline: These controllers are special cases of the two
organizations already presented. In VBo+Mig, each of the two cores has the f-subcontroller and
V-subcontroller of VBoost. The only difference is that, in the inactive core, the V-subcontroller
turns off the power supply to conserve power. The thread controller for Paceline is a degenerate
case of VBo+Pl’s where the two V-subcontrollers veto any f that is infeasible at the nominal V.
4.3.3 Global Controller
The global controller receives information from a higher level of the system (e.g., the operating
system or the runtime) about which threads are S-threads. It then attempts to speed-up all of these
threads uniformly. To keep the control simple, the global controller simply selects one technique,
which all the thread controllers apply to the S threads — VBoost, VBo+Mig, Paceline, VBo+Pl,
or none. All thread controllers are asked to apply the same technique. Figure 4.3 shows the over-
all organization. In the figure’s example, the global controller has decided to apply VBo+Pl to the
S threads.
The algorithm used by the global controller is based on a heuristic that works well in our ex-
periments of Chapter 7. Specifically, if the chip has enough idle cores to dedicate two cores to
each S thread, then the global controller selects VBo+Pl for application. Otherwise, it selects
VBoost. This heuristic is shown in Chapter 7 to be suboptimal only in rare cases when the chip
is in the Individual regime of Section 4.2.2. Finally, the global controller reruns its algorithm ev-
ery time that the number of S or R threads changes.
4.4 Summary
This chapter augmented a TS-enabled multicore with the ability to boost voltage and frequency
of critical threads. In a large multicore with varying numbers of busy cores and critical threads,
55
this chapter observed that individual application of either TS or V –f boosting alone is subopti-
mal. Due to supply voltage or error rate limitations, they are often unable to bring the multicore
all the way to its power or temperature envelope. This chapter argued that the two techniques
are complementary and can be synergistically combined to unlock much higher levels of single-
thread performance. Moreover, it presented an example dynamic controller that co-optimizes the
two techniques.
Among other experiments, Chapter 7 evaluates these techniques and their combination on a
simulated 16-core, desktop-style configurable multicore where half of the cores are busy. When
the goal is to run one additional thread at high performance, application of either Paceline or
voltage boosting increases the thread’s performance by 13% or 20%, respectively, while increas-
ing its power by 100%. However, if we combine both techniques, we unlock much higher single-
thread performance. Specifically, the performance-critical thread attains 34% higher performance
than before, while consuming 220% more power. The high configurability of a multicore with
both techniques enables similar gains for various load conditions.
56
Chapter 5
Paceline Evaluation
Paceline is the central feature of the proposed configurable CMP. The evaluation therefore begins
by considering Paceline by itself — without any of its subsequent BlueShift or voltage–frequency
boosting enhancements.
5.1 Experimental Setup
We use a modified version of the cycle-accurate Wattch-enabled [13] SESC [2] simulator to eval-
uate the performance and power impact of Paceline. The evaluation assumes a cache-coherent
multicore comprising 16 high-performance out-of-order cores, configured as shown in Table 5.1.
The core design is slightly more aggressive than current designs like the Intel Core i7 [33], and
pairs of cores share an L2 cache. The bottom of the table shows the parameters for the Paceline
features, not present in the baseline architecture.
5.2 Results
5.2.1 Performance
To get a feel for Paceline’s performance potential, we first consider performance in the absence
of errors. In other words, we assume that the leader frequency can be increased by some over-
clocking factor oc above fr without causing timing errors and see how much performance Pace-
line can extract. As the overclocking factor increases, performance relative to a single baseline
core increases monotonically toward an asymptote. Figure 5.1 shows the speedups for the in-
57
Core Parameters
General 32nm, Out-of-Order, 3.8 GHz
Pipe width 6-fetch, 4-issue, 4-retire
Pipe depth Min. 13 cycle mispredict penalty
ROB size 152
Scheduler size 40 fp, 80 int
LSQ size 54 LD, 46 ST
Branch pred 80Kb local/global tournament,
unbounded RAS
L1 I cache 16KB, 2 cyc, 2 port, 2 way
L1 D cache 16KB WT, 2 cyc, 2 port, 4 way
L2 cache 2MB WB, 2 ns round trip, 1 port, 8 way,
shared by two cores, has stride prefetcher
Cache line size 64 bytes
Memory 80 ns round trip, 10 GB/s max per core pair
Paceline Parameters
Design evaluated Simple (Section 2.3.3)
VQ write queue 64 entries, 2 ports per core, 8 cycles
Ckpt interval 100 instructions
Ckpt restore cost 100 cycles
VQ ckpt queue 5 entries
Migration policy Swap leader and checker every 250µs
Table 5.1: Microarchitecture parameters.
dividual SPECint and SPECfp applications without a BQ as the leader overclocking factor (oc)
varies from 1.1 to 1.4 in increments of 0.1. The figure also includes bars for the geometric mean
of the integer and floating-point applications. The figure shows that the SPECint applications do
not experience much speedup from prefetching alone. In fact, only gap and mcf achieve any
speedup at all without a BQ; at an overclocking factor of 1.3, they reach speedups of 2% and 4%,
respectively. The remaining SPECint applications experience no speedups or even small slow-
downs. The reason is that almost all of the SPECint2000 working sets fit within the L2 cache.
When there are few L2 misses, there can be no prefetching benefit. Consequently, the leader only
wastes L2 bandwidth.
On the other hand, with a BQ, Figure 5.2 shows large but application-dependent gains from
Paceline, with speedups for the oc-1.3 case ranging from 1.03 (equake) to 1.29 (vpr). More-
over, while some applications (e.g., vortex) top out at modest overclocking factors, others
could benefit from factors in excess of 1.4. With the BQ, applications with low L2 miss rates are
58
bz
ip
2
cr
a
fty ga
p
gc
c
gz
ip
m
cf
pa
rs
er
pe
rlb
m
k
tw
ol
f
vo
rte
x
vp
r
in
t−
hm
ea
n
a
m
m
p
a
rt
e
qu
ak
e
m
e
sa
sw
im
w
u
pw
ise
fp
−h
m
ea
n
Sp
ee
du
p 
(%
)
0
10
20
30
40
oc1.1 oc1.2 oc1.3 oc1.4
Figure 5.1: Application speedups with overclocking factors (oc) from 1.1–1.4 without a BQ.
at an advantage, since the leader is able to do useful work when overclocked rather than stalling
for main memory accesses. The BQ is a major benefit for most SPECint applications, where the
dynamic branch count is high and the branch prediction accuracy relatively low. Here, the near-
perfect branch information from the BQ can vastly increase the checker IPC (sometimes by a fac-
tor of two or more) — as was shown in the parser example of Figure 2.4. The dramatic impact
of the BQ is in accord with previous work [22], which reported similar speedups from perfect
branch prediction.
In contrast, the SPECfp applications present behavior that is difficult for Paceline to exploit,
whether a BQ is present or not. Most of the time, one of the following two conditions prevails:
Either the program is memory-bound so that overclocking the leader has little effect, or the pro-
gram is hitting in the L2 cache and generating good branch predictions so that branch and cache
hints do not help the checker keep up with the leader. This was the behavior seen from ammp in
Figure 2.4.
59
bz
ip
2
cr
a
fty ga
p
gc
c
gz
ip
m
cf
pa
rs
er
pe
rlb
m
k
tw
ol
f
vo
rte
x
vp
r
in
t−
hm
ea
n
a
m
m
p
a
rt
e
qu
ak
e
m
e
sa
sw
im
w
u
pw
ise
fp
−h
m
ea
n
Sp
ee
du
p 
(%
)
0
10
20
30
40
oc1.1 oc1.2 oc1.3 oc1.4
Figure 5.2: Application speedups with overclocking factors (oc) from 1.1–1.4 with a BQ.
5.2.2 Dynamic Power
We compare the dynamic power consumed by a Paceline pair to that consumed by two baseline
cores sharing an L2 and independently executing the same program. Figure 5.3 shows the abso-
lute dynamic power consumption for a pair of cores under five different configurations. The left-
most bar (U) represents the base unpaired case of two cores independently executing the same
application. Moving to the right, the following four bars represent a Paceline system running
with an overclocking factor of 1.1, 1.2, 1.3, and 1.4, respectively. The three segments within each
bar show the power consumed in the checker core (Checker), leader core (Leader), and in the
added Paceline VQ, BQ, checkpointing, and hashing structures (Extra). The core powers include
the L1 instruction and data cache access power. Off-chip power is not considered.
A key observation is that the power of a Paceline pair is usually approximately equal to that
of two baseline cores. Due to its improved branch prediction, the checker fetches and executes
far fewer instructions in the Paceline system than it would in the base system. This is consis-
tent with results from other work [49], which indicate that for the SPECint applications, per-
60
Power (W)
02468101214
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
U 
02468101214
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
1.1
bzip2
crafty
gap
gcc
gzip
mcf
parser
perlbmk
twolf
vortex
vpr
int−mean
ammp
art
equake
mesa
swim
wupwise
fp−mean
02468101214
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
1.2
02468101214
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
02468101214
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
1.4
Ch
ec
ke
r
Le
ad
er
Ex
tra
Fi
gu
re
5.
3:
D
yn
am
ic
po
w
er
br
ea
kd
ow
n
in
tw
o
ba
se
co
re
s
sh
ar
in
g
an
L
2
(l
ef
tm
os
tb
ar
U
in
ea
ch
gr
ou
p)
an
d
in
Pa
ce
lin
e
m
od
e
fo
r
ov
er
cl
oc
ki
ng
fa
ct
or
s
of
1.
1
–
1.
3
(r
ig
ht
th
re
e
ba
rs
in
ea
ch
gr
ou
p)
.
61
fect branch prediction reduces the number of fetched instructions by 53% on average. Using this
information, we can obtain an intuitive feel for the relationship between baseline and Paceline
power.
Consider gzip, which attains a speedup S1.3 = 1.18 with an overclocking factor of 1.3. Our
simulations show that the checker fetches 51% fewer instructions than a baseline core. Wattch
shows that in total, it consumes 39% less energy than the baseline core. The leader core, mean-
while, is executing approximately the same number of total instructions as the baseline core.
However, the number of wrong-path instructions increases slightly because the longer relative
memory access time leads to slower branch resolution. Additionally, the lower IPC of the leader
reduces its energy efficiency, as parts of the chip (especially the clock network) remain active
even on idle cycles. The net result is that the leader consumes 1% more energy than the baseline
core.
Given the preceding, we can easily compute the total dynamic core power for gzip under
Paceline. When the application runs on the baseline core, let EB be the dynamic energy con-
sumed and TB the time taken. According to the above, the energy of the checker is EC = 0.61 EB,
and the energy of the leader is EL = 1.01 EB. However, due to the Paceline speedup, the leader
and checker energies are dissipated over a shorter interval than the baseline execution. The to-
tal core power under Paceline is then (EL + EC) S1.3/TB ≈ 1.91 PB, where PB is the power of
a baseline core. In this case, the total core dynamic power is less than that of two baseline cores
executing the same application.
5.2.3 Power Density
Although total dynamic power may not increase under Paceline, the leading core power clearly
does. This thermal imbalance is a concern because it could worsen hot spots that erode over-
clockability and reduce device lifetime. Paceline avoids this problem through activity migra-
tion [26, 52], periodically switching the core on which the hot thread runs. In Paceline, core
swapping allows both the leader and checker core temperatures to roughly equal the tempera-
62
ture of a baseline core. This is because each core is dissipating approximately the same dynamic
power as a baseline core on average, and the package’s thermal RC network has an averaging ef-
fect on temperature. Since static power is dependent only on temperature, chip static power also
does not change after applying Paceline.
Swapping the leader and checker is trivial in Paceline, and it has negligible effect on perfor-
mance even if swaps are performed roughly once per million instructions. To see why, consider
the simplest swapping mechanism: A swap starts with both cores rolling back to the most recent
checkpoint as though an error had just been detected. Then, a mode bit change in the VQ and in
the BQ effectively switches the identities of the two cores. Finally, switching the leader and core
clock frequencies completes the swap. In this simple scheme, both cores begin with an empty L1
after the swap, but they repopulate it quickly from the shared L2 [48, 52].
5.2.4 Sensitivity to Errors
Until now, we have assumed a zero error rate, which models the effect of overclocking up to the
safe zero-guardband frequency f0. To see what happens as frequency passes f0, we adopt the
timing error model from [57] using its default parameters, which predict a gradual error onset1.
To model the rate of timing errors, we assume that the onset of timing errors occurs at an aggres-
sive overclocking factor of 1.3, where we set the error rate to one per 1012 cycles. This represents
a far-future process where guardbands are large, providing an ideal environment for Paceline.
The data in [57] predicts that a 9% frequency increase pushes the error rate from one per 1012
to one per 104 instructions. It also shows that the increase is approximately exponential. Conse-
quently, using this information, we set the error rate at an overclocking factor of 1.3 × 1.09 ≈ 1.42
to be one per 104 cycles, and use a straight line in a logarithmic axis to join the 10−12 and 10−4
error rate points. The resulting error rate line is shown in Figure 5.4.
To estimate the impact of these errors on Paceline speedups, we test three different recov-
1The error onset predicted by the default VARIUS parameters is significantly more gradual than that observed in
our experiments with OpenSPARC in the following section, but it serves to illustrate the concepts clearly.
63
 1
 1.05
 1.1
 1.15
 1.2
 1.25
 1.3
 1.3  1.32 1.34 1.36 1.38  1.4  1.42 1.44
 1e-12
 1e-10
 1e-08
 1e-06
 1e-04
 0.01
 1
Sp
ee
du
p
Er
ro
rs
 / 
cy
cle
Frequency
100 cycle recovery
Erro
r ra
te
1K cycle recovery
10K cycle recovery
Overclocking Factor
Figure 5.4: Error rate and geometric mean of the Paceline speedup in SPECint applications as the
overclocking factor changes.
ery penalties. These penalties are modeled as a number of cycles that we force the checker to
stall after every timing error. The recovery penalties modeled are 100, 1K, or 10K cycles. In ac-
tuality, simulations show that the Paceline recovery time is dominated by the time required to
re-populate the L1 caches following the flush at the start of recovery. For the simulated microar-
chitecture, this overhead averages ≈ 1K cycles amortized over the post-recovery execution.
Figure 5.4 shows the resulting Paceline speedups over the baseline processor for each of the
three penalties. The speedups correspond to the geometric mean of the SPECint applications.
From the figure, we see that at low overclocking factors, the speedups are roughly proportional to
the factors. However, the exponential increase in error rates causes the linear region to give way
to a “cliff” where speedup plummets if the overclocking factor is pushed even 1–2% past the op-
timal point. We need, therefore, to be careful not to increase overclocking past the optimal point.
Later, we will show that the controller designed for LeadOut (Section 4.3) solves this problem
adequately.
The figure also demonstrates that the peak Paceline speedup is only weakly dependent on the
recovery penalty; the topmost point of the curves is similar across different recovery penalties.
64
Consequently, we conclude that — at least for the error rates targeted here — optimizing recov-
ery time need not be a design priority.
65
Chapter 6
BlueShift Evaluation
The preceding chapter demonstrated that Paceline is capable of generating large speedups if the
leader overclocking factor is large. The purpose of BlueShift is to ensure leader overclockability
even when the process guardbands are small and when a traditional design exhibits a critical path
wall that is antagonistic to TS. This chapter demonstrates significant performance enhancement
from applying BlueShift to just such a design.
The BlueShift PCT and OSB techniques are both applicable to a variety of TS microarchitec-
tures. However, to focus our evaluation, we mate each technique with a single TS microarchitec-
ture that, according to Section 3.2.3, emphasizes its strengths. Specifically, an Always-on checker
is ideal for PCT because it lacks a non-speculative mode of operation where PCT’s longer worst-
case paths would force a reduction in frequency. Conversely, an On-demand microarchitecture
is suited to OSB because it does have a non-speculative mode where worst-case delay must re-
main short. The PCT design, where TS is on all the time, adopts an extremely aggressive per-
formance target, while the OSB one targets a more power-efficient, configurable environment.
Overall, we evaluate a high-performance Always-on Stage microarchitecture (Razor [20]) for
PCT and a power-efficient On-demand Retirement one (Paceline) for OSB. We call the resulting
BlueShift-optimized designs Razor+PCT and Paceline+OSB respectively. These are compared
with the Razor Base and Paceline Base designs, which are microarchitecturally identical but use
a traditional design methodology.
66
6.1 Experimental Setup
The core microarchitecture parameters are the same as in the preceding chapter (Table 5.1) for
both Razor+PCT and Paceline+OSB. Table 6.1 shows the timing speculation support assumed
for the two example architectures. The target error rates for Paceline and Razor have been set so
that the processor will spend at most 1% of its time in error recovery.
Paceline+OSB Parameters
Design evaluated Simple (Section 2.3.3)
VQ write queue 64 entries, 2 ports per core, 8 cycles
Ckpt interval 100 instructions
Ckpt restore cost 100 cycles
VQ ckpt queue 5 entries
Total target PE 10−5 err/cyc
Migration policy Swap leader and checker every 250µs
Razor+PCT Parameters
Error recovery cost 5 cycles
Total target PE 10−3 err/cyc
Table 6.1: BlueShift TS microarchitecture parameters.
6.1.1 Modeling Overview
To accurately model the performance and power consumption of a gate-level BlueShifted proces-
sor, we partition the modeling task into two loosely-coupled levels. The lower level comprises
the BlueShift circuit implementation, while the higher level consists of microarchitecture-level
power and performance estimation. At the circuit-modeling level, we sample modules from the
OpenSPARC T1 processor [69]. We apply BlueShift to these modules and use them to compute
PE and power estimates before and after BlueShift. At the microarchitecture level, we want to
model a more sophisticated core than the OpenSPARC. To this end, we simulate the out-of-order
core of Table 5.1 in the SESC timing simulator.
The main modeling challenge lies in incorporating the circuit-level PE and power estimates
into the microarchitectural simulation. Our approach is to assume that the modules from the
OpenSPARC are representative of those in any other high-performance processor. In other words,
67
we assume that BlueShift would induce roughly the same PE and power characteristics on the
out-of order microarchitecture that we simulate as it does on the in-order processor that we can
measure directly. With this assumption, we annotate the microarchitectural power models with
the power parameters derived from the low-level circuit analysis to accurately capture the power
impact of BlueShift. We also combine the module-level PEs to estimate the frequency at which
the whole pipeline would meet the target error rate and perform microarchitecture simulation at
that frequency.
The following subsections first describe the technology model (needed to accurately capture
leakage power). Next, they show how to generate the BlueShifted circuits. Finally, they detail
how PE and power estimates are extracted from these circuits and used to annotate the microar-
chitectural simulation.
6.1.2 Technology Model
We model a 32nm technology that is an incremental improvement on Intel’s current high-performance
45nm process [9]. ITRS [1] projects that neither threshold nor supply voltage are scaling in the
near term, so we use most values directly from [9] in our model, as shown in Table 6.2. The con-
stant of proportionality for leakage in Equation 4.3 is set so that the die expends 25W of leakage
power when under uniformly nominal conditions (Vdd = 1V , T = 85◦C, and no process varia-
tion). Finally, because BlueShift is intended to inject overclockability into designs that otherwise
are not amenable to TS, the evaluation assumes a core timing guardband of only 10% using Fine
Grain Binning (Section 2.2.4). The guardband is applied during manufacturing test by measur-
ing the critical path delay 1/f0 of each core measured at T0. This delay is then multiplied by the
guardband to obtain the rated clock period 1/fr.
68
Tech node 32nm KDIBL −150mV/V
Vdd 1V (nom) Ktemp −1.5mV/K
Vt0 250mV (nom.) n 2
T0 85◦C α 1.3
Tmax 100◦C ——
Table 6.2: Technology model.
6.1.3 BlueShifted Module Implementation
Using Synopsis Design Compiler 2007.03 and Cadence Encounter 6.2, we perform full physical
(placed and routed) implementations of sample modules from the OpenSPARC T1 core in a real
130nm commercial standard cell process. We modified the cell library by adding low-Vt gates
that have a 10x higher leakage and a 20% lower delay than normal gates [61, 74]. To make the
130nm results more representative of the near-future 32nm technology that we target (Table 6.2),
we scale the cell power and delay to match1.
For each module, we perform four different implementations — one for each design (Pace-
line Base, Paceline+OSB, Razor Base, and Razor+PCT). Because Paceline Base requires low
power consumption in its non-TS mode, we do not allow it to use low-Vt devices, but otherwise,
it uses the most aggressive per-module timing targets that a traditional design flow can meet.
Paceline+OSB starts from the Paceline Base design, converting some of its gates to receive FBB.
When FBB is enabled on a gate, that gate performs identically to a low-Vt gate. Razor Base tar-
gets the most aggressive possible per-module timing target (at a power cost) by allowing the tra-
ditional design flow to freely assign low-Vt gates. Razor+PCT also allows free assignment of
low-Vt, but uses BlueShift PCT to set the timing constraints. Additionally, in the Razor experi-
ments, we add hold-time delay constraints to the paths2 to accommodate shadow latches. More-
over, shadow latches are inserted wherever worst-case delays exceed the speculative clock period.
Table 6.3 lists the BlueShift parameters. Each profiling phase (Step 2 of Figure 3.3) comprises a
parallel run of 200 (or 400 for OSB) benchmark samples, each one running for 25K cycles.
1Admittedly, dealing with two technologies invites ambiguity. When following sections report power or delay
numbers, they specify which technology (130nm or 32nm) that measurement assumes.
2For some modules, the commercial design tools that we use are unable to meet the minimum path delay con-
straints, but we make a best effort to honor them.
69
# Benchmarks run per iteration 200 (PCT)
400 (OSB)
# Cycles per benchmark 25K
r: PCT relaxation factor 1.5
k: Fraction of all the dynamic 0.01
overshooting that remains un-addressed
after each OSB iteration of Figure 3.3
Table 6.3: BlueShift parameters.
For all implementations, we use the unmodified RTL sources from OpenSPARC, but we sim-
plify the physical design by modeling the register file and the 64-bit adder as black boxes. In a
real implementation, these components would be designed in full-custom logic. We use timing
information supplied with the OpenSPARC to build a detailed 900MHz black box timing model
(in 130nm) for the register file; then, we use CACTI [72] to obtain an area estimate and build a
realistic physical footprint. The 64-bit adder is modeled on [82], and has a worst-case delay of
500ps in 130nm.
The sample modules are taken from throughout the pipeline and are shown in Table 6.4. Taken
together, these modules provide a representative profile of the various pipeline stages. For each
module, the Stage column of Table 6.4 shows where in the pipeline (Fetch/Decode, EXEcute, or
MEMory) the module resides. The next two columns show the size in number of standard cells
and the shortest worst-case delay Tr achievable under a traditional CAD flow without using any
low-Vt cells (the Paceline Base design) in the 130nm process.
Module Stage Num. Tr Target PE (Errors/Cycle) Description
Name Cells (ns) PCT OSB
sparc exu EXE 21,896 1.50 10−4 10−6 Integer FUs, control, bypass
lsu stb ctl MEM 765 1.11 10−5 10−7 Store buffer control
lsu qctl1 MEM 2,336 1.50 10−5 10−7 Load/Store queue control
lsu dctl MEM 3,682 1.00 10−5 10−7 L1 D-cache control
sparc ifu dec F/D 727 0.75 10−5 10−7 Instruction decoder
sparc ifu fdp F/D 7,434 0.94 10−5 10−7 Fetch datapath and PC maintenance
sparc ifu fcl F/D 2,299 0.96 10−5 10−7 L1 I-cache and PC control
Table 6.4: OpenSPARC modules used to evaluate BlueShift.
The next two columns in Table 6.4 show the per-module error rate targets under PCT and
OSB. This is the PE that BlueShift will try to ensure for each module. We obtain these numbers
70
by apportioning a “fair share” of the total processor PE to each module — roughly according to
its size. With these PE targets, when the full pipeline is assembled (including modules not in the
sample set), the total processor PE will be roughly 10−3 errors/cycle for PCT and 10−5 for OSB.
These were the target total PE numbers in Table 5.1. They are appropriate for the average recov-
ery overhead of the corresponding architectures: 5 cycles for Razor (Table 5.1) and about 1,000
cycles for Paceline (which include 100 cycles spent in checkpoint restoration as per Table 5.1).
Indeed, with these values of PE and recovery overhead, the total performance lost in recovery is
1% or less.
The largest and most complex module is sparc exu. It contains the integer register file, the in-
teger arithmetic and logic datapaths along with the address generation, bypass, and control logic.
It also performs other control duties including exception detection, save/restore control for the
SPARC register windows, and error detection and correction using ECC. This module alone is
larger than many lightweight embedded processor cores.
Although we find that BlueShift is widely applicable to logic modules, it is not effective on
array structures where all paths are exercised with approximately equal frequency. As a result,
we classify caches, register files, branch predictor, TLBs, and other memory blocks in the pro-
cessor as Non-BlueShiftable. We assume that these modules attain performance scaling without
timing errors through some other method (e.g. increased supply voltage) and account for the at-
tendant power overhead.
6.1.4 Module-Level PE and Power
For each benchmark, we use Simics [44] to fast-forward execution over 1B cycles, then check-
point the state and transfer the checkpoint to the gate-level simulator. To perform the transfer,
we use the CMU Transplant tool [66]. This enables us to execute many small, randomly-selected
benchmark samples in gate-level detail. Further, only the modules from Table 6.4 need to be sim-
ulated at the gate level. Functional, RTL-only simulation suffices for the remaining modules of
the processor.
71
Paceline Base Paceline+OSB Razor Base Razor+PCT
Module Psta Edyn Psta Edyn Psta Edyn Psta Edyn
(mW) (pJ) (mW) (pJ) (mW) (pJ) (mW) (pJ)
sparc exu 68.5 207.8 75.1 207.8 175.1 217.2 130.3 257.8
lsu stb ctl 2.1 5.6 2.1 5.6 4.3 6.0 3.9 9.5
lsu qctl1 4.7 18.8 4.7 18.8 12.4 18.8 15.4 35.9
lsu dctl 8.8 33.3 9.2 33.3 20.7 35.1 21.3 54.5
sparc ifu dec 2.1 1.4 3.3 1.4 5.9 3.9 5.1 5.3
sparc ifu fdp 21.7 117.6 23.8 117.6 36.1 119.6 30.0 146.6
sparc ifu fcl 5.8 15.7 6.4 15.7 16.5 15.7 10.6 24.3
Total 113.7 400.3 124.7 400.3 271.0 416.1 216.6 533.7
Table 6.5: Static power consumption (Psta) and switching energy per cycle (Edyn) for each mod-
ule implementation in 130nm.
The experiments use SPECint2006 applications as the Training set in the BlueShift flow
(Steps 1–5 of Figure 3.3). After BlueShift terminates, we measure the error rate for each mod-
ule using SPECint2000 applications as the Evaluation set. From the latter measurements, we
construct a PE versus f curve for each SPECint2000 application on each module. All PE mea-
surements are recorded in terms of the fraction of cycles on which at least one flip-flop or register
receives the wrong value. This is an accurate strategy for the Razor-based evaluation, but because
it ignores architectural and microarchitectural masking across stages, it is highly pessimistic for
Paceline.
Circuit-level power estimation for the sample modules is done using Cadence Encounter. We
perform detailed capacitance extraction and then use the tool’s default leakage and switching
analysis. Table 6.5 shows the estimated switching energy per clock (Edyn) and static power con-
sumption (Psta) for each module under each implementation in the 130nm ASIC technology.
6.1.5 Microarchitecture-Level PE and Power
We compute the performance and power consumption of the Paceline- and Razor-based microar-
chitectures using the SESC [2] simulator augmented with Wattch [13], HotLeakage [79], and
HotSpot [63] power and temperature models. For evaluation, we use the SPECint2000 applica-
tions, which were also used to evaluate the per-module PE in the preceding subsection. The sim-
72
ulator needs only a few key parameters derived from the low-level circuit analysis to accurately
capture the PE and power impact of BlueShift.
To estimate the PE for the entire pipeline, we first sum up the PE from all of the sampled
modules of Table 6.4. Then, we take the resulting PE and scale it so that it also includes the es-
timated contribution of all the other BlueShiftable components in the pipeline. We assume that
the PE of each of these modules is roughly proportional to the size of the module. Note that
by adding up the contributions of all the modules, we are assuming that the pipeline is a series-
failure system with independent failures, and that there is no error masking across modules. The
result is a whole-pipeline PE versus frequency curve for each application. We use this curve to
initiate error recoveries at the appropriate rate in the microarchitectural simulator.
For power estimation, we start with the dynamic power estimations from Wattch for the simu-
lated pipeline. We then scale up these Raw power numbers to take into account the higher power
consumption induced by the BlueShift optimization. The scale factor is different for the BlueShiftable
and the Non-BlueShiftable components of the pipeline. Specifically, we first measure the dy-
namic power consumed in all of the sampled OpenSPARC modules as given by Cadence En-
counter (Table 6.5). The ratio of the power after BlueShift to the power before BlueShift is the
factor that we use to scale up the Raw power numbers in the BlueShiftable components. For the
Non-BlueShiftable components, we first compute the increase in supply voltage that is necessary
for them to keep up with the frequency of the rest of the pipeline, and then scale their Raw power
numbers accordingly.
For the static power, we use a similar approach based on HotLeakage and Cadence Encounter.
However, we modify the HotLeakage model to account for the differing numbers of low-Vt gates
in each environment of our experiments. As a thermal environment, microarchitectural power
simulations assume the 16-core floorplan of Figure 2.2, where each core is based on the default
EV6 floorplan from HotSpot [63]. We optimistically assume that only one thread is running on
the processor, yielding thermal headroom that allows per-thread performance to reach the highest
possible levels. Heatsink thermal resistance Rth is 0.33K/W , and maximum temperature con-
73
straints are enforced.
6.2 Results
6.2.1 Error Curve Transformations
We begin by examining BlueShift’s effect on the whole-pipeline PE(f) curve. Figures 6.1(a)-(d)
show the curves for each of the four implementations. These curves do not include the effect of
the non-BlueShiftable modules. In each plot, the x axis shows the frequency relative to the Rated
Frequency (fr) of Paceline Base. Each plot has one curve for each SPECint2000 application. In
addition, there is a horizontal dashed line that marks the whole-pipeline target PE , namely 10−5
errors per cycle for the Paceline-based environments and 10−3 for the Razor-based ones.
1.0 1.1 1.2 1.3 1.4 1.5
1
e
!
0
7
1
e
!
0
5
1
e
!
0
3
1
e
!
0
1
1.0 1.1 1.2 1.3 1.4 1.5
1
e
!
0
7
1
e
!
0
5
1
e
!
0
3
1
e
!
0
1
1.0 1.1 1.2 1.3 1.4 1.5
1
e
!
0
7
1
e
!
0
5
1
e
!
0
3
1
e
!
0
1
1.0 1.1 1.2 1.3 1.4 1.5
1
e
!
0
7
1
e
!
0
5
1
e
!
0
3
1
e
!
0
1
P
E
 (
e
rr
o
rs
 /
 c
y
c
le
)
Normalized f
(a) Paceline Base (b) Paceline+OSB (c) Razor Base (d) Razor+PCT
vpr
twolf
Normalized fNormalized fNormalized f
Figure 6.1: Whole-pipeline PE(f) curves for the four implementations. The frequencies are
given relative to the Rated Frequency (fr) of Paceline Base.
Figure 6.1(a) shows the curve for Paceline Base. Here, we assume a 10% guardband, so the
maximum error-free frequency f0 is 1.1 in the plot. As we increase f, PE takes non-zero val-
ues past f0 (although it is invisible in the figure) and quickly reaches high values. This is due
to the critical path wall of conventional designs. Consequently, using TS on a non-BlueShift
OpenSPARC design can only manage frequencies barely above f0 before PE becomes prohibitive.
Figure 6.1(b) shows the curve for Paceline+OSB. This plot follows the Targeted Accelera-
tion shape of Figure 3.1. Specifically, PE starts taking non-zero values past f0 like in Paceline
Base. A static timing analysis shows that the worst-case delays are unchanged from Paceline
74
Base. However, the rise in PE is delayed until higher frequencies. Indeed, PE remains negligi-
ble until a relative frequency of 1.27, compared to about 1.11 in Paceline Base. This shows that
the application of BlueShift with OSB enables an increase in processor frequency of 14%.
Figure 6.1(b) also shows a dashed vertical line. This was OSB’s frequency target, namely a
20% increase over Paceline Base (Section 6.1.3) — or 1.2× 1.1 = 1.32. However, we see that
OSB did not meet its target. This is because the Training application set (which was used in the
optimization algorithm of Figure 3.3) failed to capture some key behavior of the Evaluation set
(which was used to generate the PE curves). Consequently, Paceline+OSB, like other TS mi-
croarchitectures, will rely on its control mechanism to operate at the frequency that maximizes
performance (1.27 in this case) rather than at its target. While higher frequencies may be possible
with more comprehensive training, the obtained 14% frequency increase is substantial.
Figure 6.1(c) shows the curve for Razor Base. Since this is not a BlueShifted design, it ex-
hibits a rapid PE increase as in Paceline Base. The difference here is that, because it targets a
high-performance (and power) design point, it attains a higher frequency than Paceline Base.
Finally, Figure 6.1(d) shows the curve for Razor+PCT. The plot follows the Delay Trading
shape of Figure 3.1. Specifically, PE starts taking non-zero values at lower frequencies than
even Paceline Base, but the curve rises more gradually than in any of the other designs. The
dashed vertical line shows the target frequency, which was 30% higher than Paceline Base (Sec-
tion 6.1.3) — or 1.3× 1.1 = 1.43. We can see that most applications reach this frequency at the
whole-pipeline target PE . Compared to the frequency of 1.28 attained by Razor Base, this means
that BlueShift with PCT enables an increase in processor frequency of 12%. The two exceptions
are the twolf and vpr applications, which fail to meet the target PE due to discrepancies between
the Training and Evaluation application sets. For these applications, the Razor+PCT architecture
will adapt to run at a lower frequency, so as to maximize performance.
75
6.2.2 Paceline+OSB Performance and Power
We compare three Paceline-based architectures. First, Unpaired uses the Paceline Base module
implementation and one core runs at the Rated Frequency while the other is idle. Secondly, Pace-
line Base uses the Paceline Base module implementation and the cores run paired under Pace-
line. Finally, Paceline+OSB uses the Paceline+OSB module implementation and the cores run
paired under Paceline. For each application, Paceline Base runs at the frequency that maximizes
performance. For the same application, Paceline+OSB runs at the frequency that maximizes per-
formance considering only the PE curves of the BlueShiftable components; then, we apply tradi-
tional voltage scaling to the non-BlueShiftable components so that they can catch up — always
subject to temperature constraints.
Figure 6.2(a) shows the speedup of the Paceline Base and Paceline+OSB architectures over
Unpaired for the different applications. We see that Paceline+OSB delivers a performance that is,
on average, 8% higher than that of Paceline Base. Therefore, the impact of BlueShift with OSB
is significant. The figure also shows that, on average, Paceline+OSB improves the performance
by 21% over Unpaired. Finally, given that all applications cycle at approximately the same fre-
quency for the same architecture (Figures 6.1(a) and 6.1(b)), the difference in performance across
applications is largely a function of how well individual applications work under Paceline. For
example, applications with highly-predictable branches such as vortex cause the checker to be a
bottleneck and, therefore, the speedups in Figure 6.2(a) are small.
Figure 6.2(b) shows the power consumed by the processor and L1 caches in Paceline Base,
Paceline+OSB, and two instances of Unpaired. The power is broken down into power consumed
by the checker core (which is never BlueShifted), non-BlueShiftable modules in the leader, BlueShiftable
modules in the leader, and extra Paceline structures (checkpointing, VQ, and BQ). On average,
the power consumed by Paceline+OSB is 14% higher than that of Paceline Base. Consequently,
BlueShift with OSB does add to the power consumption, but delivers a significant performance
gain.
76
bzip2
crafty
gap
gcc
gzip
mcf
parser
twolf
vortex
vpr
hmean
051015202530
Pa
ce
lin
e 
Ba
se
Pa
ce
lin
e+
OS
B
051015
bzip2
crafty
gap
gcc
gzip
mcf
parser
twolf
vortex
vpr
mean
051015 051015
Ch
ec
ke
r
Le
ad
er
 N
on
BS
Le
ad
er
 B
S
Ex
tra
Speedup (% over Unpaired)
Power (W)
(a
)
(b
)
2x
Un
pa
ire
d
Pa
ce
lin
e
Ba
se
Pa
ce
lin
e
+O
SB
Fi
gu
re
6.
2:
Pe
rf
or
m
an
ce
(a
)a
nd
po
w
er
co
ns
um
pt
io
n
(b
)o
fP
ac
el
in
e
be
fo
re
(P
ac
el
in
e
B
as
e)
an
d
af
te
r(
Pa
ce
lin
e+
O
SB
)B
lu
eS
hi
ft
.B
S
an
d
N
on
B
S
re
fe
rt
o
B
lu
eS
hi
ft
ab
le
an
d
no
n-
B
lu
eS
hi
ft
ab
le
m
od
ul
es
,r
es
pe
ct
iv
el
y.
77
6.2.3 Razor+PCT Performance and Power
We now compare two Razor-based architectures. Razor Base uses the Razor Base module im-
plementation, while Razor+PCT uses the Razor+PCT one obtained by applying BlueShift with
PCT. As before, Razor+PCT runs at the frequency given by the PE curves of the BlueShiftable
components; then, we apply traditional voltage scaling to the non-BlueShiftable components so
that they can catch up.
Figure 6.3(a) shows the speedup of the Razor Base and Razor+PCT architectures over the
Unpaired one used as a baseline in Figure 6.2(a). Since these Razor-based designs target high
performance, they deliver higher speedups. We see that, on average, Razor+PCT’s performance
is 6% higher than that of Razor Base. This is the impact of BlueShift with PCT in this design —
which is not negligible considering that Razor Base was already designed for high performance.
We also see that vpr and, to a lesser extent, twolf do not perform as well as the other applications
under Razor+PCT. This is the result of the unfavorable PE curve for these applications in Fig-
ure 6.1(d).
Figure 6.3(b) shows the power consumed by the two designs. Recall that all other cores on
die are assumed idle, providing a favorable thermal environment. The power is broken down
into the contributions of the non-BlueShiftable and the BlueShiftable modules. On average, Ra-
zor+PCT consumes 28% more power than Razor Base. This is because it runs at a higher fre-
quency, uses a higher supply voltage for the non-BlueShiftable modules, and needs more shadow
latches and hold-time buffers. In general, the Razor designs consume high power because of their
high frequencies and widespread use of low-Vt devices, which consume considerable leakage
power at high temperatures. Consequently, as chip loading increases, all of the Razor designs
quickly become temperature-limited and the performance gains from BlueShift evaporate.
Given Razor+PCT’s delivered speedup and power cost, we see that BlueShift with PCT is not
compelling from an E × D2 perspective. Instead, we see it as a technique to further speed-up a
high-performance design (at a power cost) when conventional techniques such as voltage scaling
78
bzip2
crafty
gap
gcc
gzip
mcf
parser
twolf
vortex
vpr
mean
02468101214 02468101214
No
nB
S
BS
bzip2
crafty
gap
gcc
gzip
mcf
parser
twolf
vortex
vpr
hmean
01020304050
Ra
zo
r B
as
e
Ra
zo
r+
PC
T
Speedup (% over Unpaired)
Power (W)
(a
)
(b
)
Ra
zo
r
Ba
se
Ra
zo
r
+P
CT
Fi
gu
re
6.
3:
Pe
rf
or
m
an
ce
(a
)a
nd
po
w
er
co
ns
um
pt
io
n
(b
)o
fR
az
or
be
fo
re
(R
az
or
B
as
e)
an
d
af
te
r(
Pa
ce
lin
e+
O
SB
)B
lu
eS
hi
ft
.B
S
an
d
N
on
B
S
re
fe
rt
o
B
lu
eS
hi
ft
ab
le
an
d
no
n-
B
lu
eS
hi
ft
ab
le
m
od
ul
es
,r
es
pe
ct
iv
el
y.
79
or body biasing do not provide further performance. Specifically, for logic (i.e., BlueShiftable)
modules, BlueShift with PCT provides an orthogonal means of improving performance when
further voltage scaling or body biasing becomes infeasible. In this case however, for the pipeline
as a whole, non-BlueShiftable stages remain a bottleneck that must be addressed using some
other technique.
6.2.4 Computational Overhead
Although most modules of Table 6.4 were fully optimized with BlueShift in one day on our 100-
core cluster, the optimization of sparc exu took about one week. Such long turnaround times dur-
ing the frantic timing closure process would be unacceptable in industry. Fortunately, the current
implementation is only a prototype, and drastic improvements in runtime are possible. Specifi-
cally, referring to Figure 3.3, a roughly equal amount of wall time is spent in physical implemen-
tation (Step 1) and profiling (Step 3). Luckily, the profiling phase is embarrassingly parallel, so
simply adding more processors can speed it up. However, the CAD tools in Step 1 are mostly se-
quential. To reduce the overall runtime, the number of BlueShift iterations in Figure 3.3 must be
reduced. This can be done, for example, by adding more constraints at each iteration. Our experi-
ments added few constraints per iteration to avoid overloading the commercial CAD tools, which
have a tendency to crash if given too many constraints.
80
Chapter 7
LeadOut Evaluation
This chapter explores the benefits of extreme configurability from combining Paceline1 with
VBoost. It first examines the impact of each of the techniques when the goal is to accelerate a sin-
gle thread on the processor in the presence of other throughput-oriented threads. In the process,
it identifies the most performance-limiting constraints (eg. maximum temperature Tmax, per-core
maximum power Pmax, and maximum supply voltage Vddmax) under different loading conditions.
Next, it characterizes the power-performance tradeoffs available with the different techniques. It
also provides a sensitivity study (Section 7.2.3). Finally, it charts the performance of the config-
urable multicore on a workload requiring multiple high-speed threads and multiple throughput
threads.
7.1 Experimental Setup
Except where noted, all technology and microarchitecture parameters are identical to those in the
preceding BlueShift evaluation. The main difference is that this evaluation adds the ability to dy-
namically vary voltage and frequency on a per-core basis. In other words, each core has its own
voltage and frequency domain (although the L2 caches still run at nominal, fixed voltage and fre-
quency). The maximum Vdd allowed by the process is an especially critical parameter that limits
performance scaling of VBoost. In Intel’s current 45nm process, the maximum Vdd ranges from
1.26V – 1.38V depending on the processor’s current draw [33] (lower values at higher currents).
These values are likely to decrease slightly in future technologies in order to meet reliability re-
1It excludes BlueShift for simplicity and clarity of evaluation
81
quirements. Therefore, for simplicity, we assume a maximum Vdd of 1.3V. To maintain signal
integrity and SRAM stability, the minimum Vdd is set at 800mV. Likewise, to ensure power in-
tegrity, we limit the maximum per-core power consumption Pmax to 12W, which is roughly twice
a core’s average power consumption.
The Paceline experiments in this chapter use the pre-BlueShift PE(f) curves measured in
the preceding chapter except that they assume a 15% process guardband (instead of 10%) by de-
fault. As before, the guardband is applied during manufacturing test assuming Fine Grain Bin-
ning (Section 2.2.4), where the critical path delay of each core is measured at Tmax. This delay is
then multiplied by the guardband to obtain the rated clock period. Because the exact guardbands
for current processes are not publicly disclosed and may vary significantly as technology scales,
section 7.2.3 of this chapter undertakes a guardband sensitivity study.
Finally, the preceding Paceline and BlueShift evaluations ignored process variation for sim-
plicity, but these variations have a profound impact on delay and power consumption. Conse-
quently, this section adds the variation model from VARIUS [57] to the device parameters of Ta-
ble 6.22. VARIUS models within-die variation comprising systematic (spatially correlated) and
random (spatially uncorrelated) deviations in Vt0 and Leff , both of which impact the effective-
ness of the proposed techniques — e.g., by increasing leakage power for some cores and caus-
ing them to quickly reach their power or temperature limits as voltage increases. For the magni-
tude of variation, we assume the default parameters from VARIUS [57] as shown in Table 7.1,
which are representative of near-future technologies; the magnitude of the random Vt0 variation
assumed agrees with recent measurements from a 65nm technology [3]. VARIUS is also used
to adjust the measured per-core error rates from Section 6.2.1 for the current supply voltage and
temperature conditions.
Given this model, we perform three types of experiments. During the initial characterization
of performance and power for each technique, we incorporate the effects of variation by running
2The experiments in this section use n = 1.5 for historical reasons, but this is not expected to have a tangible ef-
fect on the outcomes.
82
Vt0 random σ/µ 6.4%
Vt0 systematic σ/µ 6.4%
Leff random σ/µ 3.2%
Leff systematic σ/µ 3.2%
Corr. range (φ) 1cm
Table 7.1: Variation parameters.
50 Monte Carlo simulations, each using a different sample die variation map, while running all
the SPECint2000 codes. Additionally, we construct a pseudo-oracle against which to compare
the results of our proposed dynamic controller. The pseudo-oracle uses Nelder-Mead optimiza-
tion [50] to generate static (i.e., not time-varying) values for the controller outputs from Sec-
tion 4.3.1. Finally, to keep the computation reasonable, when we conduct sensitivity studies in
Section 7.2.3, we use only one typical die selected from among the 50 samples created before.
7.2 Results
7.2.1 Performance
The performance of both Paceline and VBoost is determined by the application’s responsiveness
to frequency increase (i.e., the degree to which the application is CPU-bound) and the amount
of frequency increase that is possible within the Pmax, Tmax, Vddmax, and PEmax constraints. We
begin with the goal of speeding up a single target thread. This speedup will be highly dependent
on the chip’s loading, which we measure in terms of the number of idle cores that are available
to run the target thread: an un-loaded chip would have all 16 cores available, while a maximally-
loaded chip would have only one core available and the other cores would be active running other
load. Note that Paceline, VBo+Mig, and VBo+Pl require a core pair to run, so are only applicable
when there are at least two cores available.
Figure 7.1 shows the performance improvements attained by the SPECint applications af-
ter applying each of the techniques. The improvements are over the application running on a
plain core — an environment we call Unoptimized. For each application, the individual bars cor-
83
respond to the different techniques. In a given bar, we stack the improvements attained under
different amounts of load in the multicore — measured in number of available cores. The top
segment represents the lightest possible load, while the bottom segment represents the heaviest.
The rightmost group of bars (p-oracle) shows the performance of the pseudo-oracular optimizer
for comparison. Note that because its outputs are static, the pseudo-oracle does not provide a
strict upper bound on performance. Nevertheless, the real controller compares favorably with the
pseudo-oracle.
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
50
60
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
# Available Cores
16 8 4 2 1
B
BM
P BP B BM
P BP B BM
P BP B BM
P BP B BM
P BP B BM
P BP B BM
P BP B BM
P BP B BM
P BP B BM
P BP B BM
P BP B BM
P BP
bzip2 crafty gap gcc gzip mcf parser perlbmk twolf vortex vpr hmean
Techniques
VBoost
VBo+Mig
Paceline
VBo+Pl
B:
BM:
P:
BP:
hmean
(p-oracle)
P
e
rf
o
rm
a
n
c
e
 I
m
p
ro
v
e
m
e
n
t 
(%
)
0
10
20
30
40
50
60
P
e
rf
o
rm
a
n
c
e
 I
m
p
ro
v
e
m
e
n
t 
(%
)
P
e
rf
o
rm
a
n
c
e
 I
m
p
ro
v
e
m
e
n
t 
(%
)
P
e
rf
o
rm
a
n
c
e
 I
m
p
ro
v
e
m
e
n
t 
(%
)
P
e
rf
o
rm
a
n
c
e
 I
m
p
ro
v
e
m
e
n
t 
(%
)
# Available Cores
16 8 4 2 1
B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P B
B
M P
B
P
bzip2 crafty gap gcc gzip mcf parser perlbmk twolf ortex vpr
Figure 7.1: Performance improvements relative to Unoptimized at various multicore loading
levels.
In situations with just one core available, the only applicable technique is VBoost (because
the others require a free core pair), and it improves performance by 12% on average. If two cores
are available, Paceline comes online to provide a performance gain of 12%, which is compara-
ble to VBoost’s 14%. Note that even at such high loading, the system is at least in the Synergistic
regime. This is seen by combining Paceline and VBoost, which deliver a 26% average perfor-
mance improvement. In fact, the Individual regime occurs only once, namely for vortex, when
just two cores are free. In this case, VBo+Mig outperforms VBo+Pl because of tight thermal con-
straints.
84
For all loads, VBo+Mig provides a consistent and compelling improvement of 20%. In some
lightly-loaded systems, it is outperformed by VBoost. This is due to an artifact of our thread con-
troller implementation: for simplicity, we set the f of both cores in the VBo+Mig pair to be that of
the slower core in the pair. This makes it slightly slower than VBoost under best-case conditions.
Finally, VBo+Pl performs much better than either technique alone — especially as the number
of available cores increases. When all 16 cores are available, VBo+Pl’s average performance im-
provement reaches 38%.
To explain these performance trends, Figure 7.2 shows the limiting constraints for each tech-
nique and loading level. The size of each bar segment represents the fraction of samples (one for
each of the 50 dies and 11 benchmarks, for a total of 550 samples) from the pseudo-oracular ex-
periments in which the particular constraint was the limiting one. The constraints can be PEmax,
Pmax, Tmax, or Vddmax. The number above each bar is the average power consumption of an S
(performance-enhanced) thread under that configuration.
The three regimes of Table 4.1 appear in the figure as follows. When Paceline is limited by
Pmax or Tmax, the system is in the Individual regime. There is insufficient headroom to apply
two techniques simultaneously. This regime is seen rarely in Figure 7.2 when only two cores are
available. When Paceline is limited by PEmax and VBo+Pl is mostly limited by Pmax or Tmax,
the system is in the Synergistic regime. This is the dominant regime for 2-8 available cores. Fi-
nally, when VBo+Pl is often limited by PEmax or Vddmax, it is Unfulfilled. This is commonly seen
for 16 available cores.
More specifically, for the fully-loaded system (one core available), Tmax is the key constraint.
The high total power consumption of all cores on chip raises the heatsink temperature, reducing
thermal headroom. Consequently, VBoost can only increase voltage slightly before hitting the
temperature limit. The result is that VBoost is not able to provide large performance improve-
ments on a fully-loaded system (Figure 7.1).
When the system has two available cores, VBoost remains temperature limited, able to dis-
sipate only 9W on average. VBo+Mig addresses the thermal problem effectively, allowing the
85
# Available Cores
Lim
itin
g 
Co
ns
tra
int
s (
%
)
0
20
40
60
80
100
Lim
itin
g 
Co
ns
tra
int
s (
%
)
Lim
itin
g 
Co
ns
tra
int
s (
%
)
Lim
itin
g 
Co
ns
tra
int
s (
%
)
Lim
itin
g 
Co
ns
tra
int
s (
%
)
none erate P T Vdd
U B
BM
P BP U B BM
P BP U B BM
P BP U B BM
P BP U B BM
P BP
1 2 4 8 16
6 8 6 9 12 12 15 6 9 12 12 16 6 11 11 11 19 5 11 10 10 20
Techniques
Unoptimized
VBoost
VBo+Mig
Paceline
VBo+Pl
U:
B:
BM:
P:
BP:
PEmax 
Limiting Constraint
Average
S thread
power (W)
Pmax  Tmax  Vddmax
Figure 7.2: Relative incidence of limiting constraints across all applications and die samples.
The numbers above the bars show the average power consumption of the enhanced thread on
each system.
thread to dissipate 12W on average and reach the supply voltage limit in many cases. Referring
to Figure 7.1, this provides substantial marginal performance gains over VBoost. Meanwhile,
Paceline opens a new avenue for performance enhancement: Paceline is limited by PEmax and
has ample room to grow in power and temperature. In other words, Paceline provides an orthog-
onal means of increasing performance. When VBoost and Paceline are used together in VBo+Pl,
they work in the Synergistic regime. They attain substantial improvements (Figure 7.1) and are
bounded mainly by temperature.
The cases of four to eight available cores show that Paceline is usually limited by PEmax and
VBo+Mig is often limited by the maximum supply voltage, but by applying VBo+Pl, we usually
arrive at the power or temperature limits. This is still the Synergistic regime.
In lightly loaded systems (16 available cores), VBoost and VBo+Mig are practically supply
voltage limited, while Paceline is PEmax-limited. After combining them in VBo+Pl, the system
86
is still often limited by supply voltage. Therefore, the system works in Unfulfilled regime. It has
left-over power and temperature headroom to accommodate other techniques that increase per-
core power.
Overall, the key benefit of combining Paceline with VBoost is that they provide orthogonal
means of improving performance because they have different limiting constraints. This leads to a
Synergistic regime (2-8 available cores) and an Unfulfilled one (16 available cores).
7.2.2 Power–Performance Tradeoffs
To fully assess the different techniques, Figure 7.3 shows the performance and power consump-
tion of an S thread under the different techniques and a range of loading conditions. The perfor-
mance is normalized to the Unoptimized environment. In the figure, each curve corresponds to
one technique. The diamond point on each curve represents an unloaded system with 16 available
cores, while the circle (or the triangle in the case of the Paceline, VBo+Mig, and VBo+Pl), repre-
sents a heavily loaded system. Other markers on each curve identify the remaining loading points
from Figures 7.1 and 7.2.
This plot clearly shows the composability of the Paceline and VBoost (or VBo+Mig) tech-
niques, and the great capability of the VBo+Pl technique. Indeed, the figure shows that, roughly
speaking, Paceline, VBoost, or VBo+Mig deliver at best a 20% improvement in single-thread per-
formance at the cost of doubling the power of the thread. However, VBo+Pl delivers up to 38%
improvement in single-thread performance at the cost of trebling the thread power. This is a re-
markably high performance improvement. It is unreachable with either base technique alone,
and shows that the two base techniques are orthogonal. For the arguably more likely case when
only 8 cores are available, VBo+Pl improves single-thread performance by 34% while consum-
ing 220% more power than Unoptimized.
The figure also shows that VBoost (and VBo+Mig) are more power-efficient than Paceline. At
approximately the same power, VBoost and VBo+Mig deliver higher performance than Paceline.
Finally, we note that power consumption does not increase monotonically with performance.
87
1.0 1.1 1.2 1.3 1.4
4
6
8
10
12
14
16
18
Performance
S 
Th
re
ad
 P
ow
er
 (W
)
●
●
●
# Available Cores
1
2
4
8
16
VBo
ost
VBo+MigPaceline
VB
o+P
l
Unoptimized
Figure 7.3: Per-thread power consumption vs performance for each technique under different
load conditions.
For most techniques, power decreases slightly for the highest performance point — namely, the
one with the lowest load. This is due to the decrease in leakage power at the lower die temper-
atures that prevail under light load. When most of the cores on the die are idle, the total chip
power dissipation is low, so the heatsink is cold. As a result, all points on the die surface decrease
in temperature, and leakage goes down.
7.2.3 Sensitivity Analysis
We now characterize the sensitivity of the performance improvements of VBoost and Paceline.
In many cases, Tmax is a limiting factor, so we examine different thermal design points. Addi-
tionally, the Paceline systems achieve their performance improvements primarily by consuming
88
guardband, so we evaluate several different guardband sizes. Finally, we consider different per-
core power limits that correspond to more robust and more frail per-core power grids. Figure 7.4
shows the three sensitivity studies varying the parameters that have been assumed for the preced-
ing experiments (labeled as “default”). All numbers are normalized to the performance of Un-
optimized with the same parameters under the same conditions. Note that the default-parameter
bars are not exactly like those in Figure 7.1 because here, we only simulate a single (typical) die,
rather than 50 samples. In the following, we consider each plot in turn.
Techniques B: VBoost       BM: VBo+Mig       P: Paceline       BP: VBo+Pl
(a) Thermal envelope (b) Guardband (c) Per-core Max Power
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
B
BM
P BP
laptop
B
BM
P BP
desktop
(default)
B
BM
P BP
server
# Available Cores
24 16 4 2 1
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
B
BM
P BP
10%
B
BM
P BP
15%
(default)
B
BM
P BP
20%
# Available Cores
16 8 4 2 1
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
Pe
rfo
rm
an
ce
 Im
pr
ov
em
en
t (
%
)
0
10
20
30
40
B
BM
P BP
8W
B
BM
P BP
12W
(default)
B
BM
P BP
16W
# Available Cores
16 8 4 2 1
Figure 7.4: Sensitivity of performance improvements to the thermal environment (a), process
guardband (b), and power grid design (c).
Power/Thermal Envelope
We consider three different power/thermal environments, namely laptop, desktop, and server,
as shown in Table 7.2. Configurations vary the total number of cores on the chip to fit into their
respective thermal envelopes — 4 cores in laptop, 16 cores in desktop, and 24 cores in server.
Additionally, server assumes a powerful high-airflow cooling system typical of datacenter instal-
lations, while laptop is constrained by a relatively weak heatsink due to its physical confines3.
Figure 7.4(a) shows that the performance trends are largely identical in all three thermal en-
vironments. On an unloaded system, the performance gains for each technique differ by less than
3Laptop, desktop, and server heatsinks are modeled on Aavid Thermalloy types 3680, 1002/D, and 037704,
respectively.
89
Laptop Desktop Server
# Cores 4 16 24
TDP 24W 96W 144W
Heatsink Rth 1.3 K/W 0.33 K/W 0.30 K/W
Ambient T 313K 313K 305K
Table 7.2: Power/thermal environments for the sensitivity study.
2% across all environments. Moreover, for all environments, performance falls off at roughly the
same rate as system load increases.
Guardband
Figure 7.4(b) shows that, as the process guardband increases from 10% to 20%, the effective-
ness of Paceline also grows. This is because Paceline gets much of its speedup from removing
the guardband (refer to Figure 4.1(b)). As the guardband increases from 10% to 15%, the returns
from Paceline grow by roughly 0.04. However, if the guardband increases above 15%, the perfor-
mance benefits from Paceline rise at a slower rate — tempered by the application’s responsive-
ness to f increases and the limited bandwidth of the checker core.
Power Grid Capacity
Figure 7.4(c) shows how increasing the per-core power constraint Pmax impacts performance.
From a design perspective, supporting a higher Pmax requires a more robust per-core supply grid
with lower resistance, higher decoupling capacitance, and lower inductance, which can be diffi-
cult to achieve. The default design assumed a grid capable of supplying a worst-case power equal
to twice the average power consumption of the SPECint applications, which we believe repre-
sents a typical design. Decreasing that budget by one third to arrive at Pmax = 8W degrades
all techniques except Paceline, but they still provide compelling speedups. Increasing the budget
by one third to 18W benefits only the most power-hungry cases and even then, only slightly. We
conclude that none of the techniques requires more than a standard power distribution system.
90
7.2.4 Configurability
The previous evaluations focused on the goal of optimizing a single thread, but when executing a
multiprogrammed workload or an application with some, limited parallelism, it will often be nec-
essary to speed up a larger number S of optimized, performance-critical threads (Section 4.2.3).
Here, the key question is how much performance can be delivered to the S optimized threads
while the chip concurrently runs a specified number R of unoptimized threads. Note that Pace-
line, VBo+Mig, and VBo+Pl all use two cores to optimize a single S thread. Therefore, they can
execute at most S = 8 optimized threads when the system is otherwise unloaded. Only VBoost is
able to optimize S > 8 threads.
Figure 7.5 shows the percentages of performance improvement experienced by an S thread
for each technique when the application demands different numbers of R and S threads. Plots
(a)-(d) correspond to one technique each, while plot (e) corresponds to the algorithm used by the
global controller (Section 4.3.3), namely choose VBo+Pl if there are twice as many idle cores as
S threads, or VBoost otherwise. Each plot shows iso-performance contours for the S threads in
increments of 2%. Performance is normalized to that of an Unoptimized chip running the same
number of threads. For techniques that use two cores per S thread (plots (b), (c), and (d)), the
region where R + 2S > 16 is infeasible; for VBoost (plot (a)), the region where R + S > 16
is infeasible. Finally, plot (e) is an “overlay” of plot (d) when VBo+Pl is feasible and plot (a)
otherwise.
For their feasible configurations, both VBo+Mig and Paceline offer consistent per-thread per-
formance improvement, varying by less than 2% over the entire feasible range of S and R. When
available, VBo+Pl always offers superior performance improvements to either technique alone.
However, that performance is more sensitive to the particular value of S and R demanded. Like-
wise, the performance of VBoost falls off as the number of S threads approaches the limit.
Plot (e) demonstrates that a configurable multicore benefits greatly from having both VBoost
and VBo+Pl available. A chip with only VBoost realizes the full performance gains when the sys-
91
# Optimized Threads (S)
# 
Un
op
tim
ize
d 
Th
re
ad
s (
R)
5 10 15
0
5
10
15
# Optimized Threads (S)
# 
Un
op
tim
ize
d 
Th
re
ad
s (
R)
5 10 15
0
5
10
15Not Feasible
Not Feasible
Not Feasible
Not Feasible
Not Feasible
13
(a) VBoost (b) VBo+Mig (c) Paceline
(d) VBo+Pl (e) Global controller algorithm
VBoost
VBo+Pl
# Optimized Threads (S)
# 
Un
op
tim
ize
d 
Th
re
ad
s (
R)
5 10 15
0
5
10
15
# Optimized Threads (S)
# 
Un
op
tim
ize
d 
Th
re
ad
s (
R)
5 10 15
0
5
10
15
# Optimized Threads (S)
# 
Un
op
tim
ize
d 
Th
re
ad
s (
R)
5 10 15
0
5
10
15
Figure 7.5: Percentages of performance improvements experienced by an S thread when the
application demands different numbers of R and S threads. For each technique, there is an area of
infeasible configurations. Iso-performance contours are shown in the feasible regions.
tem must run many S and R threads at the same time (darker region), but it is not able to provide
optimal improvements when the number of S threads is low or the load is light (light region).
Likewise, a system with only Paceline would provide less performance increase in the light re-
gion and would be helpless in the dark region.
92
Chapter 8
Related Work
8.1 Timing Speculation
Timing speculation is a relatively recent development, originating with TIMERRTOL [75] less
than a decade ago. Since then, many approaches and refinements have been proposed with the
goal of either saving power or enabling higher clock frequencies, but Paceline and BlueShift
each fill important gaps in this prior work: Paceline is the first proposal to be Configurable in the
sense of Section 1.3, and BlueShift provides much-needed CAD support, which has only been
addressed in a couple of previous proposals and never in a way that applies to such a wide range
of microarchitectures. The following subsections review the previous proposals from a microar-
chitecture and CAD perspective, contrasting them with Paceline and BlueShift.
8.1.1 TS Microarchitectures
Stage-Level Checking
The original TS proposal, TIMERRTOL [75] was less of a microarchitecture than a general-
purpose pipeline latching scheme. It proposed three ways of using dual-phase clocking to de-
tect and correct timing faults at the register level. The most efficient of the schemes works by
doubling the pipeline depth, inserting an additional pipeline register in the middle of each combi-
national logic block. Adjacent registers are then clocked 180 degrees out of phase and at higher
frequency than the original circuit. Errors are detected by comparing each register’s input to its
output just before the clock edge of the preceding register. In the event of a miscomparison, indi-
93
cating a timing fault, a global recovery signal is asserted and the contents of the preceding regis-
ter are used to restore the pipeline state.
TIMERRTOL’s main advantage is that it does not require any change in the design of the
combinational logic except aside from inserting the extra pipeline registers. For example, the
designer does not need to ensure longer than normal hold times, unlike in Razor [20]. However,
TIMERRTOL does add one register delay to the critical path of each pipeline stage, which re-
duces clock frequency substantially. Nevertheless, Uht et al. successfully demonstrated the tech-
nique on a 32-bit ripple-carry adder, achieving a 70% throughput increase over a baseline, non-
speculative design.
Constructive Timing Violation (CTV) [59] and Circuit Level Speculation (CLS) [42] are
independently-developed proposals similar to TIMERRTOL that evaluate the performance and
power impacts of TS on an out-of-order processor pipeline. At the circuit level, their technique
is identical to one of TIMERRTOL’s proposed schemes, which requires three identical copies of
each speculative logic block. One copy runs at the full pipeline frequency and produces specu-
lative results. Two other “checker” copies run at half the pipeline frequency. CLS showed that
using this technique on the adder unit — along with other optimizations — could decrease pro-
cessor cycle time. Moreover, it simplified the error recovery procedure by using the pipeline’s
built-in speculation support, simply re-issuing errant instructions. CTV focused on power reduc-
tion by reducing supply voltage. It argued that although replicating functional units may seem
wasteful, this type of TS can actually improve power efficiency if the error rate is low enough to
allow a significant reduction in supply voltage.
Razor [17, 20] is another scheme that performs checking at each pipeline register, but unlike
its predecessors, it relies on wave-pipelining [15] to detect and correct errors. It augments each
pipeline register with a “shadow” register. Both the shadow and main register accept the same
data input, but the shadow is clocked 180 degrees out of phase with the main one. A minimum
hold time of one half cycle is imposed on all logic paths in the design to prevent values from
“racing through” to the shadow latch. Thus, the value latched by the shadow should be identical
94
to the value in the main latch but has had a half cycle longer to propagate through the logic and is
therefore nonspeculative. A comparator monitors the contents of the shadow and main registers,
and if they disagree, the shadow contents are used to reinitialize the pipeline.
Perhaps Razor’s most important contribution is its rigorous evaluation, which includes a real
silicon implementation of a full Alpha processor [17]. The goal of the evaluation was to demon-
strate power savings from reduced supply voltage, and this was successful. Razor can also in-
crease frequency at nominal voltage, and this dissertation has focused on that use.
Note that none of these microarchitectures provides configurability. It is true that any TS sup-
port can be “turned off” simply by powering down the checker logic and registers, but substantial
overhead remains relative to a non-TS design: First, the area overhead of the additional regis-
ters and checking logic reduces core count on a CMP. In the case of CLS [42] and CTV [59], the
replicated functional units increase power consumption by lengthening bypass paths. The most
advanced TIMERRTOL scheme adds to the critical path delay, which increases delay when run-
ning at a nonspeculative frequency. Finally, Razor imposes a power overhead by requiring extra
buffers to guarantee the half-cycle hold time.
At-Retirement Checking
DIVA [7] was developed to simplify processor design and provide tolerance of permanent and
transient errors. Although TS was not one of its design goals, timing faults are similar to the
soft errors the microarchitecture was designed to correct. DIVA works by augmenting the main
pipeline with a checker unit that re-executes every stage (fetch, decode, execute, and writeback)
of each instruction before retirement. The checker keeps up with the main pipeline by performing
each stage of its checking in parallel after the instruction completes in the main pipeline. DIVA
thus isolates the checking logic into at-retirement module that is off the main pipeline’s critical
path. Additionally, DIVA’s ability to correct permanent faults allows the main pipeline to relax
functional correctness.
Optimistic Tandem [47] is a proposal that builds on DIVA by embracing functional errors in
95
the main pipeline for the purpose of maximizing performance under TS. Specifically, it designs
the main pipeline so that the common cases are fast and uncommon cases are incorrect. It then
integrates a checker core that — like DIVA — is much smaller and less powerful than the main
core. The way in which the checker is integrated with the main core, however, is closer to Pace-
line. As in Paceline, the leader passes branch results to the checker and prefetches data into a
shared L2 cache. Meanwhile, the two cores periodically compare execution signatures to detect
faults. Unlike Paceline, however, the leader and checker are microarchitecturally different. Con-
sequently, Optimistic Tandem does not perform activity migration. The method of branch passing
also differs slightly from Paceline, as Optimistic Tandem [47] allows the leader core to train the
checker’s branch predictor directly. This provides an interesting and possibly lower-overhead al-
ternative to Paceline’s BQ.
Unfortunately, these at-retirement checkers also lack configurability. TS can never be dis-
abled in Optimistic Tandem because the main pipeline is not functionally correct. DIVA provides
more opportunity for configurability, as the checker could be disconnected and used indepen-
dently as a low-performance processor when TS is disabled. However, this introduces hetero-
geneity into the design, which may not be desirable.
8.1.2 TS Design Methodologies
Several proposals have analyzed or improved the performance of specific functional blocks —
usually adders — under TS [25, 8]. Others have also pointed out an unfulfilled need for general-
purpose design flows that optimize performance under TS [8]. Of existing work, Optimistic Tan-
dem [47] (which inspired BlueShift) is the most closely related. Like BlueShift, it uses a profile-
based approach, running training benchmarks on an RTL simulation of the system and record-
ing which source RTL statements are only rarely used. However, unlike BlueShift, it achieves
its speedup by sacrificing functional correctness, pruning the infrequently-used statements from
the design of the main pipeline. The fact that it operates on the RTL (rather than the gate-level
design that BlueShift requires) is an advantage of Optimistic Tandem, offering much faster pro-
96
filing speed. However, because it is based on functional pruning, the Optimistic Tandem design
methodology can not be used to build configurable systems.
BTWMap [37] is a general design tool that optimizes common-case performance without
sacrificing functional correctness by implementing a novel gate-mapping algorithm (part of the
logic synthesis flow) that uses a switching activity profile to minimize the common-case delay.
BTWMap requires modification of the CAD algorithms, but it is applicable to any synthesis-
based design flow. The main disadvantage is that it does not account for physical (layout, routing,
and clock skew) effects, which can significantly affect circuit timing. Because BTWMap operates
at an early stage of the design flow, many of its gains may be lost in the physical design backend
if those tools re-factor the design using more traditional objectives. Integrating BTWMap with
the rest of the design flow is therefore not trivial. In contrast, BlueShift sacrifices some ability
to do aggressive optimization by not intervening in the synthesis stage but controls the physical
design (place and route) to ensure that the final design is suitable for TS.
8.2 Configurable Microarchitectures
8.2.1 Leader-Checker Execution
PIPE [24] was the first configurable architecture to apply two cores to the execution of a single
thread. In PIPE, two cores — called the Access Processor (AP) and Execution Processor (EP)
— execute different instruction streams generated by a specialized compiler. One stream per-
forms all of the memory operations (and their associated slices) and executes on the AP. The
other stream executes the non-memory instructions on the EP. The effect is that the AP can ex-
ecute ahead, buffering its load values in a special load queue for later consumption by the EP,
thereby reducing stalls and increasing execution bandwidth.
Slipstream [53, 70] was another seminal leader–checker proposal, and Paceline is heavily
based on its microarchitecture1. Like Paceline, Slipstream uses two cores in coupled leader–
1Even the names are similar, both referring to the same aerodynamic effect.
97
checker mode to boost per-thread performance. The leader generates prefetches and branch re-
sults for the checker just as in Paceline. The key difference is the way in which the leader achieves
its speedup. Slipstream is actually more clever than Paceline in this respect, speculatively eliding
dynamically dead code, highly predictable branches, and synchronization operations from the
leader’s instruction stream. In fact, Slipstream could perform TS on top of this optimization to
realize even higher performance than Paceline. Slipstream also argued for configurability, show-
ing that even for the highly-parallel SPLASH-2 benchmarks, a 32-processor system running 16
threads in coupled leader–checker mode usually outperforms the same system running twice as
many threads in uncoupled mode [30].
Master/Slave Speculative Parallelization (MSSP) [81] took instruction elision a step further
by creating an aggressively reduced “distilled” program binary to run on a Master (leader) pro-
cessor while employing multiple checker processors. A special MSSP compiler partitions the
program into a sequence of tasks and generates both full and distilled implementations of each.
As the distilled tasks complete on the Master, they distribute predicted live-in values to Slave
processors, which then execute the full implementation of the successor tasks starting from those
live-ins. Slaves keep up with the Master by checking in parallel, with each Slave core executing
one task at a time. A Slave commits its results to architectural state in-order after verifying the
live-ins from the preceding task.
More conservatively, other proposals such as Dual-Core Execution [80] and Flea-Flicker [10]
pipelining have focused exclusively on prefetching without using compiler support to alter the
leader’s program binary. In these proposals, the leader core fetches and executes instructions
without stalling for long-latency cache misses. Instead, it executes what it can and passes the
complete stream of dynamic instructions to the other core, which executes the full program and
integrates the missing loads. Future Execution [21] is similar except that instead of skipping
cache-missing loads, it attempts to predict the values that instructions will produce in future loop
iterations, thereby allowing it to run several iterations ahead of the checker.
Aside from performance enhancements, the other major application of leader–checker archi-
98
tectures has been in reliability and especially soft-error tolerance. For example, CRTR [23] re-
dundantly executes one thread on two cores with the leader providing branch outcomes, load val-
ues and register results to the checker. The checker re-executes all instructions using the values
in the queues for comparison. Madan and Balasubramonian [43] refine this approach by running
the checker core at a lower frequency and voltage to provide energy savings. Reunion [64] is yet
another high reliability leader–checker microarchitecture. Unlike the others, but like Paceline, it
uses checkpoint signatures or fingerprints [65] to compare the leader and checker executions.
8.2.2 Voltage-Frequency Boosting
Many previous works have used dynamic voltage and frequency scaling (DVFS [14]) to save
power by reducing voltage and frequency below nominal. However, the idea of increasing volt-
age and frequency above nominal (when the power and thermal budget allows) is relatively re-
cent. Intel’s currently shipping Turbo Boost [34, 5] technology and its predecessor, Foxton [46],
are the best examples. They both include on-die controllers that continuously monitor the power
consumption and temperature of the cores, increasing voltage and frequency up to an aggressive
maximum value when constraints allow. LeadOut (Chapter 4) uses a similar control algorithm.
The combination of voltage-frequency boosting and TS in LeadOut is also not novel. EVAL [58]
proposed trading off voltage, error rate, and power to maximize core performance under varia-
tion. As part of this tradeoff, it includes the ability to increase voltage above nominal. Unlike the
Turbo and Foxton controllers, EVAL uses a fuzzy control system to co-optimize supply voltages
(and body-bias settings) for every functional unit in the processor, with each unit potentially re-
ceiving a different voltage. Unlike LeadOut, EVAL’s checker architecture is based on DIVA [7]
and is therefore not configurable.
99
8.2.3 Other Proposals
Thread-Level Speculation (TLS) [67, 39, 54] provides another configurable way for multiple
cores to work together on one thread. In TLS, a sequential program is partitioned into tasks that
speculatively execute concurrently on multiple cores. As it executes, each task watches for data
dependences from preceding tasks in program order. Whenever the system detects that a data de-
pendence was not honored in the parallel execution, it squashes any tasks that might have been
affected. In the absence of such dependence violations, tasks first buffer their updates and then
commit them atomically when all earlier tasks have completed. TLS offers the advantage that,
like MSSP, it can apply an arbitrary number of cores to the execution of a single thread. The
main disadvantage is the modification of the caches needed to detect memory dependences be-
tween tasks and to hold speculative data until a task commits.
Core Fusion [35] takes a much finer-grain approach to configurability. It lays out multiple
dual-issue out-of-order pipelines side-by-side on the die and allows adjacent pipelines to fuse to-
gether into a single wider-issue processor. This approach allows an arbitrary number of cores to
work together, with the benefit of adding more cores limited only by the application’s ILP. How-
ever, it requires the core pipeline to be redesigned from the ground up to incorporate the extra
multiplexers and control logic that orchestrate fetch, bypassing, and commit in the fused mode.
The hardware overhead is estimated at 8% of the core area.
TFlex [36] is another finer-grained approach to configurability. Like its predecessor, TRIPS [56],
it is an explicit dataflow architecture that distributes operators across a fabric of small execution
tiles. These tiles exchange operands (dataflow tokens) under the explicit control of the compiler.
Since the tiles are only loosely coupled in the microarchitecture, they can easily be spatially par-
titioned to allow different threads to run on different segments of the fabric. When per-thread
performance is required, the dataflow operators spread out across all of the tiles to maximize ILP,
and when TLP is demanded, more operators are placed on a single tile to make room for more
threads.
100
Chapter 9
Conclusion
9.1 Summary of Contributions
The future of performance scaling lies in massively parallel workloads, but less-parallel applica-
tions will remain important. Unfortunately, future process technologies and core microarchitec-
tures no longer promise major per-thread performance improvements, so microarchitects must
find other ways to address a growing per-thread performance deficit. Moreover, they must do
so without sacrificing parallel throughput. To meet these apparently conflicting demands, this
dissertation joined with previous works in advocating a configurable multicore. It contributed
a novel configurable enhancement based on timing speculation, which can compose with other
configurable enhancements to provide large per-thread performance gains. Our example LeadOut
multicore achieved mean speedups of 34% on SPECint by combining timing speculation with
voltage-frequency boosting.
This dissertation made three interlocking contributions to the design of configurable multi-
cores at the microarchitecture and circuit levels. First, Chapter 2 introduced Paceline, the first
TS microarchitecture designed specifically for configurable multicores. Next, it enhanced Pace-
line with BlueShift, a circuit design method for TS architectures that aims to improve the circuit’s
common-case delay rather than focusing on worst-case delay as traditional design flows do. Fi-
nally, it introduced LeadOut, a multicore design that combines Paceline with the ability to in-
crease core supply voltage above nominal. It showed major gains from applying the two tech-
niques together when feasible and argued that, in many cases, future multicores can benefit from
composing still more configurable enhancements.
101
9.2 Looking Forward
This dissertation’s contributions were aimed at homogeneous multicores, assuming a scaling
model in which the number of cores grows exponentially as technology advances. However,
technology experts are now expressing growing concern that power scaling may not support this
model. Just as Chapter 1 explained that power density limits future frequency increases, it is pos-
sible that even if clock frequency does not scale at all, the power problem could become so acute
that it will limit the number of cores that can be simultaneously powered on. This leads to a case
where most of the cores on a die must be idle at any given time in order to meet the power and
thermal budget. The likely architectural consequence is extreme heterogeneity, where the die in-
cludes many customized cores and application-specific accelerators, only a few of which — the
best for the phase at hand — are active at any given time.
Such a design is less appealing than a homogeneous one from a design effort and time-to-
market perspective, but it may be the only way forward. It is important to note that all of the
techniques proposed in this dissertation still apply in a heterogeneous environment; configura-
bility is still valuable within a given class of cores — especially the highest-performance cores
where Paceline, BlueShift, and LeadOut can be used to boost per-thread performance to even
higher levels than would otherwise be achievable even using arbitrarily complex cores.
102
References
[1] International Technology Roadmap for Semiconductors (2008 Update).
[2] The Super ESCalar simulator. http://sourceforge.net/projects/sesc/.
[3] K. Agarwal and S. Nassif. Characterizing process variation in nanometer CMOS. In Design
Atomation Conference, pages 396–399, June 2007.
[4] M. Agarwal, B. C. Paul, M. Zhang, and S. Mitra. Circuit failure prediction and its applica-
tion to transistor aging. In VLSI Test Symposium, pages 277–286, May 2007.
[5] P. Altevogt, H. Boettiger, W. M. Felter, C. R. Lefurgy, L. Stiege, and M. S. Ware. Method
for autonomous dynamic voltage and frequency scaling of microprocessors, April 2008. US
Patent Application #20080098254.
[6] AMD Corporation. AMD NPT family 0Fh desktop processor power and thermal data sheet,
June 2007.
[7] T. Austin. DIVA: A reliable substrate for deep submicron microarchitecture design. In
International Symposium on Microarchitecture, pages 196–207, November 1999.
[8] T. Austin, V. Bertacco, D. Blaauw, and T. Mudge. Opportunities and challenges for better
than worst case design. In Asia-South Pacific Design Automation Conference, pages 2–7,
January 2005.
[9] C. Auth, M. Buehler, A. Cappellani, C. Choi, G. Ding, W. Han, S. Joshi, B. McIntyre,
M. Prince, P. Ranade, J. Sandford, and C. Thomas. 45nm high-k+metal gate strain-
enhanced transistors. Intel Technology Journal, 12(2):77–85, June 2008.
[10] R. Barnes, J. Sias, E. Nystron, S. J. Patel, J. Navarro, and W. W. Hwu. Beating in-order
stalls with “flea-flicker” two-pass pipelining. IEEE Transactions on Computers, 55(1):18–
33, 2006.
[11] K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L. Ji, S. R. Nassif, E. J. Nowak,
D. J. Pearson, and N. J. Rohrer. High-performance CMOS variability in the 65-nm regime
and beyond. In IBM Journal of Research and Development, volume 50, pages 433–449,
July/September 2006.
[12] J. A. Blome, S. Feng, S. Gupta, and S. Mahlke. Online timing analysis for wearout detec-
tion. In Workshop on Architectural Reliability, November 2006.
103
[13] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power
analysis and optimizations. In International Symposium on Computer Architecture, pages
83–94, June 2000.
[14] T. Burd and R. Broderson. Processor design for portable systems. Journal of VLSI Signal
Processing Systems, 13(2):203–221, August 1996.
[15] W. Burleson, M. Ciesielski, F. Klass, and W. Liu. Wave-pipelining: A tutorial and research
survey. IEEE Transactions on VLSI Systems, 6(3):464–474, September 1998.
[16] D. Chinnerty and K. Keutzer. Closing the Power Gap Between ASIC and Custom, chapter 3,
page 62. Springer, 2007.
[17] S. Das, S. Pant, D. Roberts, S. Lee, D. Blaauw, T. Austin, T. Mudge, and K. Flautner. A
self-tuning DVS processor using delay-error detection and correction. In IEEE Symposium
on VLSI Circuits, pages 258–261, June 2005.
[18] R. Dennard, F. Gaensslen, H. Yu, V. L. Rideout, E. Bassous, and A. LeBlanc. Design of
ion-implanted MOSFETs with very small physical dimensions. IEEE Journal of Solid State
Circuits, (5):256–268, October 1974.
[19] J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Braganza, S. Meyers,
E. Fang, and R. Kumar. An integrated quad-core Opteron processor. In International Solid
State Circuits Conference, pages 102–103, February 2007.
[20] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Zeisler, D. Blaauw, T. Austin,
K. Flautner, and T. Mudge. Razor: A low-power pipeline based on circuit-level timing spec-
ulation. In International Symposium on Microarchitecture, pages 7–18, December 2003.
[21] I. Ganusov and M. Burtscher. Future execution: A prefetching mechanism that uses mul-
tiple cores to speed up single threads. ACM Transactions on Architecture and Code Opti-
mization, 3(4):424–449, December 2006.
[22] A. Ghandi, H. Akkary, and S. T. Srinivasan. Reducing branch misprediction penalty via
selective branch recovery. In International Symposium on High Performance Computer
Architecture, pages 254–264, February 2004.
[23] M. Gomaa, C. Scarborough, T. N. Vijaykumar, and I. Pomeranz. Transient fault recovery
for chip multiprocessors. In International Symposium on Computer Architecture, pages
98–109, June 2003.
[24] J. Goodman, J. Hsieh, K. Liou, A. Pleszkun, P. Schechter, and H. Young. PIPE: A VLSI
decoupled architecture. Computuer Architecture News, 13(3):20–27, June 1985.
[25] R. Hedge and N. Shanbhag. Soft digital signal processing. IEEE Transactions on VLSI
Systems, 9(6):813–823, December 2001.
104
[26] S. Heo, K. Barr, and K. Asanovic´. Reducing power density through activity migration. In
International Symposium on Low Power Electronics and Design, pages 217–222, August
2003.
[27] M. D. Hill and M. R. Marty. Amdahl’s law in the multicore era. IEEE Computer, 41(7):33–
38, July 2008.
[28] H. Hua, C. Mineo, K. Schoenfliess, A. Sule, S. Melamed, R. Jenkal, and W. R. Davis. Ex-
ploring compromises among timing, power, and temperature in three-dimensional inte-
grated circuits. In Design Automation Conference, pages 997–1002, July 2006.
[29] E. Humenay, D. Tarjan, and K. Skadron. Impact of process variations on multicore per-
formance symmetry. In Design, Automation, and Test in Europe, pages 1653–1658, April
2007.
[30] K. Ibrahim, T. Byrd, and E. Rotenberg. Slipstream execution mode for slipstream-based
multiprocessors. In International Symposium on High Performance Computer Architecture,
pages 179–190, February 2003.
[31] Intel Corporation. Intel Core 2 Duo Mobile processor for Intel Centrino Duo Mobile tech-
nology, January 2007.
[32] Intel Corporation. Intel Core 2 Extreme Processor X68000 and Intel Core 2 Duo Desktop
Processor E6000 and E4000 sequences, April 2007.
[33] Intel Corporation. Intel Core i7 processor Extreme Edition and Intel Core i7 processor
datasheet, November 2008.
[34] Intel Corporation. Intel Turbo Boost technology in Intel Core microarchitecture (Nehalem)
based processors, November 2008.
[35] E. I˙pek, M. Kırman, N. Kırman, and J. F. Martı´nez. Core fusion: Accommodating software
diversity in chip multiprocessors. In International Symposium on Computer Architecture,
pages 186–197, May 2007.
[36] C. Kim, S. Sethumadhavan, M.S. Govindan, N. Ranganathan D. Gulati, D. Burger, and
S. Keckler. Composable lightweight processors. In International Symposium on Microar-
chitecture, pages 381–393, December 2007.
[37] J. Kong and K. Minkovich. Mapping for better than worst-case delays in LUT-based FPGA
designs. In International Symposium on Field Programmable Gate Arrays, pages 56–64,
February 2008.
[38] S. Krishnamurthy, S. Paul, and S. Bhunia. Adaptation to temperature-induced delay varia-
tions in logic circuits using low-overhead online delay calibration. In International Sympo-
sium on Quality Electronic Design, pages 755–760, March 2007.
[39] V. Krishnan and J. Torrellas. A Chip-Multiprocessor Architecture with Speculative Multi-
threading. IEEE Transactions on Computers, 48(9):866–880, September 1999.
105
[40] R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi, and K. Farkas. Single-ISA heterogeneous
multi-core architectures for multithreaded workload performance. In International Sympo-
sium on Computer Architecture, pages 64–75, June 2004.
[41] X. Liang and D. Brooks. Mitigating the impact of process variations on CPU register file
and execution units. In International Symposium on Microarchitecture, pages 504–514,
December 2006.
[42] T. Liu and S. Lu. Performance improvement with circuit-level speculation. In International
Symposium on Microarchitecture, pages 348–355, December 2000.
[43] N. Madan and R. Balasubramonian. Power-efficient approaches to reliability. Technical
Report UUCS-05-010, University of Utah School of Computing, December 2005.
[44] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg,
F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE
Computer, 35(2):50–58, February 2002.
[45] D. Marculescu and E. Talpes. Variability and energy awareness: A microarchitecture-level
perspective. In Design Automation Conference, pages 11–16, June 2005.
[46] R. McGowen, C. A. Poirier, C. Bostak, J. Ignowski, M. Millican, W. H. Parks, and S. Naf-
fziger. Power and temperature control on a 90-nm Itanium family processor. IEEE Journal
of Solid State Circuits, 41(1):229–237, January 2006.
[47] F. Mesa-Martinez and J. Renau. Effective optimistic-checker tandem core design through
architectural pruning. In International Symposium on Microarchitecture, pages 236–248,
December 2007.
[48] P. Michaud, A. Seznec, and D. Fetis. A study of thread migration in temperature-
constrained multicores. ACM Transactions on Architecture and Code Optimization, 4(2),
June 2007.
[49] O. Mutlu, H. Kim, D. Armstrong, and Y. N. Patt. An analysis of the performance impact of
wrong-path memory references on out-of-order and runahead execution processors. IEEE
Transactions on Computers, 54(12):1556–1571, December 2005.
[50] J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer
Journal, 7(4):308–313, January 1965.
[51] B. C. Paul, K. Kang, H. Kufluoglu, M. A. Alam, and K. Roy. Temporal performance degra-
dation under NBTI: Estimation and design for improved reliability of nanoscale circuits. In
Design, Automation, and Test in Europe, pages 1–6, March 2006.
[52] M. Powell, M. Gomaa, and T. N. Vijaykumar. Heat-and-run: Leveraging SMT and CMP
to manage power density through the operating system. In International Conference on
Architectural Support for Programming Languages and Operating Systems, pages 260–270,
October 2004.
106
[53] Z. Purser, K. Sundaramoorthy, and E. Rotenberg. Slipstream memory hierarchies. Tech-
nical Report CESR-TR-02-3, North Carolina State University Department of Electrical and
Computer Engineering, February 2002.
[54] J. Renau, K. Strauss, L. Ceze, W. Liu, S. R. Sarangi, J. Tuck, and J. Torrellas. Thread-level
speculation on a CMP can be energy efficient. In International Conference on Supercom-
puting, pages 219–228, June 2005.
[55] T. Sakurai and R. Newton. Alpha-power law MOSFET model and its applications to CMOS
inverter delay and other formulas. IEEE Journal of Solid State Circuits, 25(2):584–594,
April 1990.
[56] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. Keckler, and
C. Moore. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture.
Computer Architecture News, 31(2):422–433, May 2003.
[57] S. R. Sarangi, B. Greskamp, R. Teodorescu, A. Tiwari, and J. Torrellas. VARIUS: A model
of process variation and resulting timing errors for microarchitects. IEEE Transactions on
Semiconductor Manufacturing, 21(1):3–13, February 2008.
[58] S. R. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas. EVAL: Utilizing processors with
variation-induced timing errors. In International Symposium on Microarchitecture, pages
423–434, November 2008.
[59] T. Sato and I. Arita. Constructive timing violation for improving energy efficiency. In Com-
pilers and Operating Systems for Low Power, pages 137–153, 2003.
[60] G. Semeraro, D. Albonesi, S. Dropsho, G. Magklis, S. Dwarkadas, and M. L. Scott. Dy-
namic frequency and voltage control for a multiple clock domain microarchitecture. In
International Symposium on Microarchitecture, pages 356–367, November 2002.
[61] G. Sery, S. Borkar, and V. De. Life is CMOS: Why chase the life after. In Design Automa-
tion Conference, pages 78–83, June 2002.
[62] T. Shanley and B. Colwell. The Unabridged Pentium 4. MindShare Inc., 2005.
[63] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan.
Temperature-aware microarchitecture. In International Symposium on Computer Archi-
tecture, pages 2–13, June 2003.
[64] J. Smolens, B. Gold, B. Falsafi, and J. Hoe. Reunion: Complexity-effective multicore re-
dundancy. In International Symposium on Microarchitecture, pages 223–234, December
2006.
[65] J. Smolens, B. Gold, J. Kim, B. Falsafi, J. Hoe, and A. Nowatzyk. Fingerprinting: Bounding
soft-error detection latency and bandwidth. In International Conference on Architectural
Support for Programming Languages and Operating Systems, pages 224–234, October
2004.
107
[66] Jared Smolens and Eric Chung. Architectural transplant, 2007.
http://transplant.sunsource.net/.
[67] G. Sohi, S. Breach, and T. N. Vijayakumar. Multiscalar processors. In International Sympo-
sium on Computer Architecture, pages 414–425, June 1995.
[68] A. Srivastava, D. Sylvester, and D. Blaauw. Statistical Analysis and Optimization for VLSI:
Timing and Power. Springer, 2005.
[69] Sun Microsystems. OpenSPARC T1 RTL release 1.5. http://www.opensparc.net/opensparc-
t1/index.html.
[70] K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream processors: Improving both
performance and fault tolerance. In International Conference on Architectural Support for
Programming Languages and Operating Systems, pages 257–268, November 2000.
[71] E. Takeda, C. Y. Yang, and A. Miura-Hamada. Hot-Carrier Effects in MOS Devices. Aca-
demic Press, 1995.
[72] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. Jouppi. CACTI 5.1. Technical Report
HPL-2008-20, Hewlett Packard Labs, April 2008.
[73] J. Tschanz, J. Kao, S. Narendra, R. Nair, D. Antoniadis, A. Chandrakasan, and V. De.
Adaptive body bias for reducing impacts of die-to-die and within-die parameter varia-
tions on microprocessor frequency and leakage. IEEE Journal of Solid State Circuits,
37(11):1396–1402, November 2002.
[74] S. Tyagi, M. Alavi, R. Bigwood, T. Bramblett, J. Brandenburg, W. Chen, B. Crew, M. Hus-
sein, P. Jacob, C. Kenyon, C. Lo, B. McIntyre, Z. Ma, P. Moon, P. Nguyen, L. Rumaner,
R. Schweinfurth, S. Sivakumar, M. Stettler, S. Thompson, B. Tufts, J. Xu, S. Yang, and
M. Bohr. A 130 nm generation logic technology featuring 70nm transistors, dual Vt transis-
tors and 6 layers of Cu interconnects. In IEDM Technical Digest, pages 567–570, December
2000.
[75] A. Uht. Achieving typical delays in synchronous systems via timing error toleration. Tech-
nical Report 032000-0100, University of Rhode Island Department of Electrical and Com-
puter Engineering, March 2000.
[76] A. Uht. Going beyond worst-case specs with TEAtime. IEEE Computer, 37(3):51–56,
March 2004.
[77] X. Vera, O. Unsal, and A. Gonzalez. X-pipe: An adapative resilient microarchitecture for
parameter variations. InWorkshop on Architectural Support for Gigascale Integration, June
2006.
[78] J. Warnock, J. Keaty, J. Petrovick, J. Clabes, C. Kircher, B. Krauter, P. Restle, B. Zoric, and
C. Anderson. The circuit and physical design of the POWER4 microprocessor. IBM Journal
of Research and Development, 46(1):27–51, January 2002.
108
[79] Y. Zhang, D. Prikh, K. Sankaranarayanan, K. Skadron, and M. Stan. HotLeakage: A
temperature-aware model of subthreshold and gate leakage for architects. Technical Re-
port CS-2003-05, University of Virginia, March 2003.
[80] H. Zhou. Dual-core execution: Building a highly scalable single-thread instruction window.
In International Conference on Parallel Architecture and Compilation Techniques, pages
231–242, September 2005.
[81] C. Zilles and G. Sohi. Master/slave speculative parallelization. In International Symposium
on Microarchitecture, pages 85–96, November 2002.
[82] R. Zlatanovici and B. Nikolic. Power–performance optimal 64-bit carry-lookahead adders.
In European Solid State Circuits Conference, pages 321–324, September 2003.
109
Author’s Biography
Brian was born in 1980 and entered directly into the engineering profession. He received his first
tools (a screwdriver and hammer) at the age of two and used them on anything within reach. In
third grade, he met Wayne Weise, a Bell Labs engineer who introduced him to electronics design
and nurtured the hobby by supplying hand-written lessons and parts kits. Through junior high
and highschool, Brian tinkered with electronics and created many one-off gadgets.
He obtained a B.S. in Computer Engineering from Clemson University, where he also re-
ceived an introduction to research under Prof. Ron Sass. Inspired by this experience, he joined
the University of Illinois and completed an M.S. in Electrical Engineering and a Ph.D in Com-
puter Science. His research has covered FPGA-based computing, modeling and mitigating pro-
cess variation, simplifying processor design and verification, and improving per-thread perfor-
mance through timing speculation.
After receiving his Ph.D, he joined D. E. Shaw Research to assist in high-speed molecular
dynamics simulation for biochemistry. Although research papers have been his main product for
the last six years, he still has an oscilloscope on his desk and knows how to use it.
110
