Hardware atomicity for compiler-directed control speculation by Neelakantam, Naveen
C© 2011 by Naveen Neelakantam. All rights reserved.




Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2011
Urbana, Illinois
Doctoral Committee:
Professor Craig B. Zilles, Chair & Director of Research
Professor Vikram S. Adve
Professor Wen-mei W. Hwu
Doctor Ravi Rajwar, Intel Corporation
Doctor Suresh Srinivas, Intel Corporation
Professor Marc Snir
Abstract
This dissertation introduces the atomic region as a novel compiler abstraction which eases
the development of speculative compiler optimizations. As this dissertation will show, specu-
lation enables a compiler writer to exploit dynamically occurring opportunities which would
otherwise be difficult or even impossible to expose. Despite their potential, speculative opti-
mizations typically involve complex implementation and significant compiler re-engineering.
In comparison, the atomic region abstraction is both simple to incorporate into a compiler
infrastructure and also exposes speculative opportunity to existing and unmodified optimiza-
tions.
The utility of the atomic region abstraction largely derives from its use of hardware
atomicity—the execution of a region of code either completely, and as if all operations in
the region occurred at one instant, or not at all. Hardware atomicity is an architectural
primitive which provides software with a simple and intuitive model of execution, namely
the ability to either explicitly commit or rollback a region of code.
The atomic region abstraction leverages hardware atomicity to enable a compiler writer
to easily reason about and implement speculative optimization. In the atomic region ab-
straction, the compiler encapsulates commonly executed regions of the program using the
hardware atomicity primitive. This permits the compiler to generate a speculative version of
the code where uncommonly executed code paths are completely removed such that they do
not need to be considered in (and hence do not constrain) a region’s optimization. Pruned
paths are converted into assert operations that trigger an abort in the uncommon case that
one of these paths is needed. On an abort, hardware reverts state back to the beginning of
ii
the region and transfers control to a non-speculative version of the code.
Two implementations of the atomic region abstraction are also presented, the first of
which demonstrates potential for a 10-15% average performance improvement in the context
of a Java Virtual Machine, albeit running on simulated hardware. Incorporation of the
atomic region abstraction is shown to be a simple and, thereby, a cost-effective means for
exposing speculative optimization opportunities to the compiler.
The second implementation leverages a real system to demonstrate that at least some of
these gains are achievable in practice. By incorporating the atomic region abstraction into
the dynamic translator of the Transmeta Efficeon processor, practical concerns related to the
identification of speculative opportunities and adaptation to misspeculations are explored.
In the context of this real system, a straightforward implementation using simple control
mechanisms is shown to be sufficient to achieve a 3% average performance improvement.
iii
Acknowledgments
Writing a dissertation is a personal journey, but it is not undertaken alone. Along the way,
I have been honored to gain the advice, feedback, encouragement, and support of a number
of people. Not all influenced the technical content of this dissertation, but all were essential
to its completion. To everyone involved, I would like to extend my heartfelt gratitude.
Throughout, I have benefited from the guidance and intellectual inspiration provided by
my adviser, Prof. Craig Zilles. His hard work and dedication to his students is clear for all
to see. It is something more to experience it firsthand. This dissertation, as well as the work
that fueled it, had as an embarkation point a stimulating and inviting research environment,
the creation of which can be attributed solely to Craig.
In this environment, I had the opportunity to observe the research process in an ex-
emplary form. Specifically, I watched Pierre Salverda develop, under Craig’s advisement,
his research ideas into several sound publications and a superb dissertation. Imitation is
the sincerest form of flattery, and in that regard I have striven to emulate the balance of
exploration and completeness that I observed in Pierre’s work.
I have also benefited from several important experiences that each shaped the path I
would follow. Early on, I had the pleasure of working closely with Prof. Sanjay Patel and
his research group. I gained much from those experiences, and my interests in computer
architecture truly began there.
Later, the committee for my first qualifying exam provided me with what, in hindsight,
has turned out to be an inflection point in my intellectual development. I have come to
believe that failure merely provides an opportunity for growth. It motivates change, it
iv
motivates increased effort, and it motivates self-determination. I am proud of my failure, it
led me to all of my successes.
Along the way, I was fortunate for the many discussions and shared academic experiences
that the excellent students attending the University of Illinois have to offer. From reading
groups to ad-hoc exchanges at a whiteboard, these experiences shaped my intellectual de-
velopment in ways too numerous to mention. As peers, we also provided one another with
much needed support, encouragement, and compassion. I will always reflect fondly on my
time at UIUC. There are surely numerous students that I owe my thanks, but I would like to
explicitly acknowledge Mayank Agarwal, Lee Baugh, Luis Ceze, Jeffrey Cook, Mike Fertig,
John Kelm, Alex Li, Pablo Ortego, Pradeep Ramachandran, Nicholas Riley, James Roberts,
Pierre Salverda, Francesco Spadini, Sam Stone, and Karin Strauss.
A special thanks also goes to Reverend Dr. Tipp Moseley, and the $924 bet that I lost
to him. In his own way, Tipp tried to encourage me to complete my dissertation more
quickly and move onto the next stages of my life. In the end, I am happy to have given this
dissertation the time I felt it needed, despite the growing sum of our bet.
I would like to thank my Ph.D. committee members, Prof. Craig Zilles, Prof. Vikram
Adve, Prof. Wen-mei Hwu, Dr. Ravi Rajwar, Dr. Suresh Srinivas, and, in particular,
Prof. Marc Snir, for their patience, attention to detail and helpful feedback. I would also
like to thank my managers at Intel, David Ditzel, Stephen Lee, and Suresh Srinivas, who
have always supported my academic pursuit. They simultaneously encouraged my ideas
and exposed me to the real world concerns that are often overlooked in our field. This
dissertation and my research benefited from the exposure.
Finally, I would like to thank my family and my friends for the unwavering support that
they have provided. Above all, I would like to thank Kyle Hartzell. A Ph.D. is not only a
very personal journey but is also a very selfish one. Kyle has excused my absences without
question, and she has always encouraged my efforts. I am a lucky man.
v
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2 The Atomic Region Abstraction . . . . . . . . . . . . . . . . . . 13
2.1 Hardware Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Software Speculation Using Atomic Regions . . . . . . . . . . . . . . . . . . 16
Chapter 3 Limitations of Classical Optimization . . . . . . . . . . . . . . 21
3.1 Aggressive Compiler Optimization Opportunities . . . . . . . . . . . . . . . 22
3.1.1 Partial Redundancy Elimination . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Partial Dead Code Elimination . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Procedure Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.4 Control Flow Restructuring . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.5 Partially Redundant Array Bounds Check Elimination . . . . . . . . 32
3.2 Opportunities in Multithreaded Systems . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Synchronization Elimination . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Synchronization Optimization . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Constraints on Binary Optimization . . . . . . . . . . . . . . . . . . . 37
Chapter 4 Speculative Compiler Optimization . . . . . . . . . . . . . . . . 39
4.1 Region Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Misspeculation Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Speculation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Data Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 5 Study of Control Bias in Integer Programs . . . . . . . . . . . 49
5.1 Cost-benefit Tradeoff for Control Speculation . . . . . . . . . . . . . . . . . . 49
5.2 Previous Techniques for Detecting Branch Bias . . . . . . . . . . . . . . . . 51
5.3 Characterization of Changing Branches . . . . . . . . . . . . . . . . . . . . . 54
vi
5.4 Requirements for Robust Control Speculation . . . . . . . . . . . . . . . . . 56
5.4.1 A Simple Effective Model . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.2 Reactive Model Performance . . . . . . . . . . . . . . . . . . . . . . . 58
5.4.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 6 Atomic Regions for Managed Languages . . . . . . . . . . . . . 64
6.1 Opportunities in Managed Languages . . . . . . . . . . . . . . . . . . . . . . 64
6.1.1 DaCapo Xalan example . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.2 DaCapo Jython example . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Providing Hardware Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.1 Checkpoints and Hardware Atomicity . . . . . . . . . . . . . . . . . . 72
6.2.2 Microarchitectural Implications . . . . . . . . . . . . . . . . . . . . . 73
6.3 Forming and optimizing regions . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Experimental method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.5.1 Understanding the Variation . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.2 Architectural Analysis of Atomic Regions . . . . . . . . . . . . . . . . 90
6.5.3 Microarchitectural sensitivity . . . . . . . . . . . . . . . . . . . . . . 91
6.5.4 Limitations of the existing compiler . . . . . . . . . . . . . . . . . . . 93
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Chapter 7 Atomic Regions for Dynamic Translation . . . . . . . . . . . . 95
7.1 SPECint 2000 Vortex example . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2 SPECint 2000 GCC example . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.1 Efficeon Processor Architecture . . . . . . . . . . . . . . . . . . . . . 104
7.3.2 CMS Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4 Atomic Regions in CMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.1 Hardware Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.4.2 Incorporating Atomic Regions into CMS . . . . . . . . . . . . . . . . 110
7.4.3 Monitoring Speculations . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4.4 Eliminating Redundant Asserts . . . . . . . . . . . . . . . . . . . . . 116
7.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5.1 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Chapter 8 Atomic Region Memory Model . . . . . . . . . . . . . . . . . . 124
8.1 Formal Specification of Multithreaded Programs . . . . . . . . . . . . . . . . 125
8.2 Formal Specification of Atomic Regions . . . . . . . . . . . . . . . . . . . . . 129
8.3 Constraints on Atomic Region Formation . . . . . . . . . . . . . . . . . . . . 131
8.4 Reorderings Permitted with Atomic Regions . . . . . . . . . . . . . . . . . . 137
8.4.1 Control Speculation and Safe Reordering . . . . . . . . . . . . . . . . 145
Chapter 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
vii
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
viii
List of Tables
5.1 Simulation data sets and run length. . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Model Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Model Transition Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Model Sensitivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 Baseline processor parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 DaCapo benchmarks used in evaluation. . . . . . . . . . . . . . . . . . . . . 85
6.3 Atomic region statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1 Efficeon atomicity primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 Evaluation system configuration . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Atomic region statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.4 Static code statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
ix
List of Figures
1 Normalized SPECint rate scores for Intel x86 processors. . . . . . . . . . . . 3
2 Normalized SPECint scores for Intel x86 processors. . . . . . . . . . . . . . . 4
2.1 Software usage of hardware atomicity. . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Atomic region formation by the compiler. . . . . . . . . . . . . . . . . . . . . 18
3.1 Example of an unsafe partial redundancy elimination opportunity. . . . . . . 23
3.2 Example of partial dead code elimination opportunities. . . . . . . . . . . . . 24
3.3 Example of subroutine inlining opportunities. . . . . . . . . . . . . . . . . . 26
3.4 Example of partially-redundant conditional control flow. . . . . . . . . . . . 29
3.5 Partial dead code elimination that requires control flow restructuring. . . . . 30
3.6 Partial redundancy elimination that requires control flow restructuring. . . . 31
3.7 Fully and partially redundant bounds check elimination opportunities. . . . . 33
3.8 Example of optimizations disallowed by memory model constraints. . . . . . 38
4.1 Example of trace scheduling bookkeeping complexity. . . . . . . . . . . . . . 40
5.1 Speculative removal of biased branches versus misspeculation rate. . . . . . . 51
5.2 Five static branches with initially invariant behavior. . . . . . . . . . . . . . 55
5.3 A finite-state machine model for branch behavior characterization. . . . . . . 56
5.4 Reactive control performs comparably with self-training. . . . . . . . . . . . 59
5.5 Misprediction rate when a biased branch transitions from being biased. . . . 62
6.1 An example Java method with hot and cold paths. . . . . . . . . . . . . . . 66
6.2 Compiler-based redundancy removal. . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Atomic region optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 Complexity of Compiler Optimizations. . . . . . . . . . . . . . . . . . . . . . 70
6.5 Region formation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6 Performance analysis infrastructure. . . . . . . . . . . . . . . . . . . . . . . . 83
6.7 Execution time speedups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.8 Micro-operation (uop) reduction. . . . . . . . . . . . . . . . . . . . . . . . . 88
6.9 Sensitivity to hardware atomicity implementation. . . . . . . . . . . . . . . . 92
7.1 Potential for atomic region optimizations. . . . . . . . . . . . . . . . . . . . . 96
7.2 CMS baseline optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3 Optimizations enabled by atomic regions. . . . . . . . . . . . . . . . . . . . . 98
x
7.4 Unbiased control flow in an atomic region. . . . . . . . . . . . . . . . . . . . 100
7.5 Assert merge optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.6 The Efficeon architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.7 Atomic region example using the Efficeon atomicity primitives. . . . . . . . . 110
7.8 SPEC CPU2000 integer results. . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.1 Example reorderings in an atomic region that contains an unlock. . . . . . . 144
8.2 Example reorderings in an atomic region that contains a lock. . . . . . . . . 145
8.3 Control dependences restrict optimization across atomic region boundaries. . 146
xi
List of Abbreviations
CFG Control Flow Graph
CMS Code Morphing Software
DFG Dataflow Graph
DMA Direct Memory Access
DRLVM Dynamic Runtime Layer Virtual Machine
ILP Instruction Level Parallelism
ISA Instruction Set Architecture
IPC Instructions Per Cycle
IR Intermediate Representation
JIT Just-In-Time
JVM Java Virtual Machine
LIT Long Instruction Trace
LOC Lines Of Code
OOO Out-Of-Order
PC Program Counter
PDE Partial Dead-code Elimination
PRE Partial Redundancy Elimination
SLE Speculative Lock Elision
SMT Simultaneous Multithreading
SSA Static Single Assignment
xii
TLP Thread Level Parallelism
TLB Translation Lookaside Buffer
VLIW Very Long Instruction Word
xiii
Preface
High-performance processors enable software innovation and increased program functional-
ity. Historically, software has relied on hardware to provide continual performance improve-
ments through technology, frequency, and architectural innovations. However, shrinking
power envelopes and diminishing returns from the pursuit of instruction-level parallelism
(ILP) have raised serious questions about where such performance will come from in the fu-
ture. The computer industry envisions a future of concurrency and increased core counts per
die. Unfortunately, the lack of a viable concurrent programming model for general-purpose
programs means that, at least for the foreseeable future, achieving high performance by
improving single-thread execution remains critical. Furthermore, efficiency in achieving per-
formance has become a first-class constraint, and hardware research has re-focused towards
efficient processor designs and will continue to innovate in that direction.
As a result, I believe that software and in particular the compiler will play an increas-
ingly critical role in efficiently enabling performance, especially as part of the trend toward
richer, safer, and more dynamic programming environments. Likewise, hardware will play a
complementary role by providing features to enable simpler and more powerful compiler de-
signs. To that end, this dissertation demonstrates that one such feature, hardware atomicity
both improves the effectiveness of classical compiler optimizations and simplifies the imple-
mentation of new compiler optimizations. In particular, hardware atomicity permits the
compiler to trivially and effectively employ speculative optimizations, an otherwise complex
undertaking.
1
Recent Trends in Processor Performance
The semiconductor manufacturing trend predicted by Gordon Moore in 1965 [78], the oft
quoted Moore’s Law—that the number of transistors per integrated circuit will double every
two years1—has been sustained by the semiconductor industry for the past four decades.
This doubling of transistor density has enabled designers of high-performance processors
to translate Moore’s Law into a similar performance trend in which processor performance
roughly doubled every eighteen months2—equivalent to a growth rate of 60% per year.
Until recent years, these gains were achieved by improving the performance of an indi-
vidual processor core through a combination of frequency scaling and an increased number
of executed instructions per cycle (IPC). However, semiconductor scaling trends and the
quadratic circuit complexity of further exploiting ILP have now made it impractical to sus-
tain as rapid a pace of performance improvements per processor core. The computer industry
has instead chosen to increase the number of processor cores contained on a single die thereby
relying on thread-level parallelism (TLP) to sustain historical performance trends.
As a demonstration, Figure 1 shows SPECint rate scores collected in recent years by
Intel for their x86 processors. All data was taken from the SPEC website. Scores through
2006 are from the SPECint 2000 benchmark suite [96], and scores after 2006 are from the
SPECint 2006 benchmark suite [97]. All scores have been normalized using a conversion
factor computed by comparing SPECint 2000 and 2006 scores of identical systems. Also
shown on the graph is a 60% per year improvement trendline and transition points where
core counts were increased or Simultaneous Multithreading (SMT) [100] was added in the
tested systems.
The SPECint rate score measures the throughput capabilities of a system by measuring
the time it takes to run multiple concurrent instances of each benchmark. Therefore, the
1His original prediction was that the number of transistors per integrated circuit would double every
year but was later revised [56].
2First mention of this trend is often attributed to Moore himself. However, Moore attributes it to Dave
House [56].
2





















Normalized SPECint rate performance























Figure 1: Normalized SPECint rate scores for Intel x86 processors. The vertical
lines depict a change of core count or multithreading support in the measured systems.
“rate” score benefits both from increases in single-thread performance as well as increases in
TLP. As the figure shows, Intel processors have been able to sustain historical performance
trends by increasing core count and by adding multithreading features.
Unfortunately, this shift in focus has a tradeoff. Whereas increasing the performance
of a single processor core enables unmodified programs to execute more quickly, increasing
the number of processor cores per die only improves the performance of parallel programs.
In general this requires converting a single-thread program into a multithreaded, parallel
implementation, a far from trivial undertaking.
Figure 2 shows SPECint scores collected for Intel x86 processors over the past decade
to demonstrate recent single-thread performance trends. As in Figure 1, data was taken
from the SPEC website, all scores through 2006 are from the SPECint 2000 benchmark
suite, and all scores have been normalized using a conversion factor computed by comparing
SPECint 2000 and 2006 scores of identical systems. In addition to illustrative trendlines,
the figure also includes SPECint scores after performance benefits attributable the compiler
3



















Normalized SPECint performance (w/o compiler benefits)
30% per year trend
25% per year trend
10% per year trend
Figure 2: Normalized SPECint scores for Intel x86 processors. Also shown are
normalized scores once compiler-based performance benefits have been factored away and
approximated trend lines.
have been factored away (using a conversion factor computed by comparing the performance
of identical hardware systems evaluated using different Intel compiler versions).
The figure shows that recent improvements in single-thread performance have indeed
fallen short of the 60% per year trend of the past. On the other hand, performance has not
stagnated and single-thread performance continues to improve at a rate of 30% per year.
Such a performance trend—equivalent to a doubling in performance every three years—is
worth sustaining especially considering its generality: single-thread performance benefits
both sequential and parallel applications.
After adjusting SPECint scores by removing benefits provided by the compiler it becomes
apparent that this recent trend in single-thread performance improvements results from
more than just changes in the hardware. More specifically, hardware changes alone appear
to have been providing a 25% per year improvement until 2007 and only a 10% per year
improvement afterward. If it were not for changes in the compiler, two-thirds of the single-
thread performance improvements achieved in the course of the past decade would have been
4
lost. The performance lost in the past two years would have been even more severe.
Therefore, the compiler has already become a key factor to providing sustained per-
formance improvements. Furthermore, it seems that in the past two years hardware has
become even more constrained in its ability to improve single-thread performance and Intel,
at least, has relied on compiler advancements to fill the gap. Some of these performance
improvements result from specialized hardware extensions, such as new vector instructions,
but even these require improvements in compiler technology, such as auto-vectorization, to
expose their performance potential. I believe this trend will continue and that the compiler
will play an increasingly central role in the quest for ever greater single-thread performance.
Factors Influencing Processor Performance
The recent emphasis by the computing industry on multicore processors and concurrent
programming has largely been borne of necessity not choice. There are two major reasons
for this paradigm shift: unfavorable technology trends and diminishing returns from ILP.
Technology trends: Improvements in semiconductor manufacturing technology have his-
torically enabled dramatic and consistent frequency increases every process generation [22].
Unfortunately, recent years have witnessed a fundamental shift in these trends for two major
reasons. First, wire delay is becoming an increasingly dominant and constant factor to deter-
mining cycle time. Even if transistors were to continue to scale and exhibit reduced delays,
the delay of the wires used to interconnect them is not scaling proportionally. Second, sub-
threshold leakage has further constrained designs. Processors are designed to operate below
fixed power dissipation ceilings, and, in the past, transistor scaling enabled a reduction of
gate threshold voltages which thereby provided the power headroom necessary for increasing
the processor frequency. Recently, however, gate threshold voltages have been kept relatively
constant to keep sub-threshold leakage in control. The causes are quite different, but the
result is the same: transistor technology scaling can no longer enable dramatic increases in
5
processor frequency.
Diminishing returns from ILP: The pursuit of instruction-level parallelism drove many
of the architectural innovations of the past decades. This pursuit manifested itself in three
complementary ways each sharing the same goal: execute an increasing number of instruc-
tions concurrently and thereby reduce the effective latency of each individual instruction.
However, all three of these techniques are plagued by diminishing returns and have limited
ability to further improve performance. First, lengthening a processor’s pipeline depth is
increasingly difficult to accomplish without also lengthening performance critical loops such
as the branch-mispredict penalty and load-to-use penalty [21]. Second, wide-issue processors
are infeasible due to the quadratic interconnect complexity of widening bypass and schedul-
ing logic and adding cache and register ports. Third, building larger scheduling windows
for out-of-order processors is difficult because doing so increases the size of already timing-
critical structures such as the re-order buffer, scheduler, load-store queue and register file.
Therefore, dramatic increases in ILP are unlikely to be achievable on future processors.
As a result of these trends, the computing industry has refocused away from single-
thread performance to multithreaded computing. The industry has chosen to invest growing
transistor budgets by incorporating multiple cores onto a single die and improving single
core throughput with hardware multithreading [68,100].
However, I believe this hardware-centric view is overly pessimistic. There are two software
factors which bear mention: a growing interest in managed languages and dynamic binary
optimizations.
Managed Languages: Languages such as Java and C# have become pervasive over the
past decade. These languages are typically compiled to an intermediate binary representa-
tion, or bytecode, and are executed inside of a virtual machine [35,70]. These languages are
managed because they require sophisticated runtime support from the virtual machine to
support features such as portability, garbage collection, dynamic class loading and reflection.
6
These features, in addition to others also found in non-managed languages such as bounds
checking and polymorphism, enable programmers to write expressive and maintainable appli-
cations quickly and more easily. Furthermore, the underlying virtual machines have recently
been adapted to support dynamic languages such as Python, Ruby and Javascript which
add yet more programmability features such as dynamic typing [28,60].
However, these features impose a performance cost onto the managed language imple-
mentation. To compensate, high performance implementations of managed languages use
just-in-time (JIT) compilation techniques to enable a runtime compiler to optimize for com-
mon case behavior while still retaining programmability features. Though already sophis-
ticated, JIT compilers continue to innovate increasingly aggressive techniques to improve
performance. Efficient hardware features to ease managed language implementation or en-
able more aggressive JIT compilation could enable significant single-thread performance
improvements, potentially avoiding the hardware complexity issues mentioned above.
Dynamic Binary Optimization: Several recent systems have integrated a dynamic com-
piler, also known as a translator, to enable runtime program optimization. Similar to the
JIT compilers mentioned above, dynamic translators such as Transmeta’s Code Morphing
Software (CMS) [32], IA-32 EL [8] and the Godson x86 translator [52] also incorporate
sophisticated runtime optimizers to enable high-performance execution of x86 binaries on
dissimilar hardware. These systems are capable of identifying frequently executed pieces
of a running program, analyzing control flow and dataflow structure, optimizing for ob-
served common case behavior, and deploying optimized codes into the running system. Here
too, the opportunity exists for single-thread performance improvement by providing efficient
hardware features to enable more effective runtime compiler optimizations.
Given these software factors, I believe that hardware architects should embrace the needs
of the language and nurture the optimization potential of a runtime compiler. Even with
limited hardware features and without the ability to adapt to runtime behavior, static com-
pilers have been instrumental in sustaining single-thread performance growth, particularly
7
in the past few years. It stands to reason that with more general hardware support and with
runtime information the compiler would be capable of continued performance improvements.
To that end, the focus of my dissertation is on one such feature: hardware atomicity. The
dissertation introduces hardware atomicity as a powerful feature which enables a compiler
to easily and effectively optimize for common case program behavior. It is viable in any
runtime optimization system, and I will describe how it was incorporated into both a JIT
compiler and dynamic binary optimizer.
To be clear, I do believe that industry and academia have made sound and pragmatic
decisions. If we are to sustain historical trends of 60% performance growth per year, then
multicore and multithreaded processors demand attention. That being said, I do not believe
that the quest for single-threaded performance has reached an impasse, and this dissertation




In his landmark paper “Compilers and Computer Architecture,” William Wulf identifies
three principles (regularity, orthogonality, and composability) that instruction sets should
adhere to in order to simplify compiler implementations, thereby improving the code quality
that is practically achievable [106]. Each time these principles are not observed, an additional
set of special cases must be considered during compilation in order to generate the best
possible code for a given program. While architectures that ignore these principles do not,
in theory, preclude the building of compilers that generate the highest performance code,
in practice the quality of code suffers as many compiler implementations will be unable to
justify the additional software complexity required.
Akin to the simplifying principles set forth by Wulf, I see hardware atomicity—the exe-
cution of a region of code either completely, and as if all operations in the region occurred
at one instant, or not at all—as a fundamental architectural feature that enables a range of
sophisticated uses by software. Hardware atomicity provides a simple and intuitive model
of execution which enables software writers to reason about implementations that would
otherwise be too complex to consider.
This dissertation focuses on one possible use of hardware atomicity: the atomic region.
An atomic region is a compiler abstraction in which hardware atomicity primitives are used
to encapsulate a program region being optimized. In doing so, the region appears to execute
completely and instantaneously or not at all. This permits the compiler to generate a specu-
lative version of the code where uncommonly executed code paths are completely removed,
so that they need not be considered in (and hence do not constrain) a region’s optimization.
9
If one of these pruned paths needs to be executed, the region will be aborted—reverting
back to the state at the beginning of the region—and control will be transferred to a non-
speculative version of the code.
Speculative optimizations are important for achieving high performance in many inte-
ger and enterprise applications because the control flow intensive nature of these programs
prevents non-speculative compiler approaches from generating efficient code. In fact, the
presence of frequent control flow can be a significant inhibitor of compiler optimizations,
even when a significant fraction of the control flow is strongly-biased and compilation is
performed with an accurate profile.
This dissertation will demonstrate both the ability of atomic regions to expose speculative
opportunity and the simplicity in which the atomic region abstraction can be incorporated
into software optimization frameworks. The atomic region abstraction not only exposes
speculative opportunities to unmodified classical compiler optimizations but it can also form
the basis of new optimizations—some which would otherwise be prohibitively difficult to
implement and others that require atomicity guarantees. In Chapter 2, I will describe
how the atomic region abstraction is used by software and the necessary requirements of a
hardware atomicity implementation.
In Chapter 3, I will present an informal survey of optimization opportunities that con-
found classical compiler techniques. Many specific strategies have been proposed to exploit
these opportunities but they are often computationally complex and non-trivial to imple-
ment. The atomic region abstraction offers the ability to exploit the practically occurring
instances of these opportunities.
At a cursory level, the atomic region abstraction shares many similarities with previously
proposed speculative optimization opportunities. However, by characterizing the atomic
region abstraction and previously proposed speculative optimization according to three
dimensions—region shape, misspeculation recovery, and speculation strategy—Chapter 4
will show that atomic regions provide a novel, powerful, and simple tool to the compiler
10
writer.
Nevertheless, the utility of the atomic region abstraction depends upon the occurrence
of strongly-biased control flow in programs. The manner and frequency of such occurrences
is the subject of Chapter 5. As the chapter will show, biased control flow is common but
can be short lived. Exploiting biased control flow therefore requires mechanisms to react
to changing program behaviors. Reacting to changing program behaviors also enables a
dynamic compilation system to identify biased control flow as effectively as, and sometimes
better than, a self-trained static compilation system.
The simplicity of the atomic region abstraction makes it well suited for a variety of com-
piler frameworks and processor architectures. In demonstration of this point, I incorporated
the atomic region abstraction into two different dynamic compilation systems.
Chapter 6 describes how the atomic region abstraction was incorporated into the JIT
compiler of a Java virtual machine (JVM) for an out-of-order processor. To evaluate the
resultant hardware and software co-designed system a new simulation methodology needed
development: a description of this methodology is provided as a secondary contribution of
this dissertation.
Chapter 7 describes how the atomic region abstraction was also incorporated into the
dynamic binary translator of a very long instruction word (VLIW) processor. Evaluated on a
real system, the chapter also describes an implementation of the reactive control mechanisms
called for in Chapter 5. The control mechanism described is sufficient to achieve speedups
on a real system.
The execution guarantees of the atomic region abstraction also enable its use in multi-
processor systems. However, any such use must still satisfy the constraints specified by the
memory consistency model of a system. Chapter 8 formally specifies the execution semantics
of the atomic region abstraction. It also shows that the atomic region is compatible with
modern memory consistency models and enables optimization.
11
A retrospective of this work is provided in Chapter 9, including a discussion of open
questions left for future research.
12
Chapter 2
The Atomic Region Abstraction†
The central focus of this dissertation is the atomic region and its effectiveness as an ab-
straction for speculative compiler optimization. The atomic region abstraction simplifies
many optimizations, and this simplification derives from the execution guarantees and clean
interface provided by hardware.
In this chapter, I describe these hardware requirements, which are collectively referred
to as hardware atomicity, and mention potential implementations that satisfy them. I then
described how a software optimizer leverages hardware atomicity to implement the atomic
region abstraction and provide an abstract example of the optimization opportunities ex-
posed.
2.1 Hardware Atomicity
An implementation of hardware atomicity must satisfy three important criteria. First, hard-
ware must provide the illusion of atomic execution—that a region of code either executes
completely and instantaneously or does not execute at all. Second, hardware must provide
this illusion without introducing performance overheads in the common case. Third, a set
of simple hardware atomicity primitives should be provided that enables intuitive software
control.
Providing atomic execution: The illusion of atomic execution provides important guar-
antees to software. Foremost, it guarantees that a region of code, as specified by software,
†The content of this chapter derives from work published in the January 2008 issue of IEEE Micro [84].
13
will either completely execute or will appear as if it never executed. This enables trivial
recovery in case of a misspeculation because software simply relies on hardware to discard
all updates from the current region and then restarts execution elsewhere (e.g., in a non-
speculative version of the same region).
In addition, hardware must guarantee that the same region of code will appear to execute
instantaneously or, in case of a misspeculation, will have no observable effects. This enables
a software optimizer to easily satisfy the memory ordering rules of the system. To maintain
this illusion, hardware must not permit other processors or devices to observe any of the
effects of an atomically executed region until it has committed (see Chapter 8). Likewise,
none of the effects of a region should be observable in the case of a misspeculation.
The implementations discussed in this dissertation provide these guarantees through the
following steps: 1) creating a register checkpoint upon entering the region, 2) tracking all
memory addresses accessed by instructions in the region, 3) buffering all updates performed
by these instructions, 4) using an ownership-based cache coherence protocol to detect con-
flicting accesses from other agents, 5) discarding updates on a conflict, and 6) committing
the updates in the cache atomically.
The mechanisms necessary for each of these steps have much in common with prior
hardware proposals ranging from efficient management of resources in out-of-order proces-
sors [4, 29, 55, 75] to hardware transactional memory [48, 67]. A number of implementations
are possible, of which this dissertation explores two. The first is based on hardware similar
to that proposed for speculative lock elision [88] and is described in Section 6.2. The sec-
ond leverages hardware already provided by the Transmeta Efficeon processor [89,90] and is
described in Section 7.3.1.
Good common case performance: Though various implementations of atomic execution
are possible, not all are suitable for the atomic region abstraction. Because the atomic region
abstraction is intended to improve single-thread performance, hardware must introduce as
little performance overhead as possible. Commonly occurring overheads could overwhelm
14
the performance opportunities exposed by the abstraction.
Most importantly, the performance overheads of taking and committing atomic region
checkpoints should have minimal overhead. The speculative opportunities exposed by the
atomic region abstraction may only provide a small performance improvement on each exe-
cution of a region; if the overhead of using the atomic execution hardware is significant, the
net result could be a performance penalty.
Fortunately, some aspects of the atomic execution hardware are tolerant of performance
overhead. For example, the atomic region abstraction is intended to exploit highly-biased
opportunities and misspeculation are assumed to be rare. Therefore, the performance of a
hardware rollback (and the corresponding redirection to non-speculative code) can tolerate
some performance overheads.
Likewise, a hardware implementation of atomic execution need not support all possible
uses. Best effort hardware is sufficient, as long as it covers common uses of the atomic
region abstraction. For example, hardware does not need to support unbounded region
sizes, I/O operations or exceptions. In these situations, hardware is expected to implicitly
abort the region. This hardware flexibility is afforded by software control mechanisms that
rein in excessive misspeculation and also enable software to reoptimize regions that exceed
hardware support.
Simple primitives: Atomic execution hardware should ideally be exposed using a set
of simple and concise primitives. This simplicity has two purposes: it eases the design
of the software that must use it, and it provides an interface to hardware that allows for
implementation flexibility.
These primitives should be both intuitive and general. For example, hardware could
expose three simple instructions: one for beginning an atomic region, another for completing
it, and a final instruction to abort the region in case of a misspeculation:
• aregion begin <alternate PC>. This instruction signals the start of a speculatively-
15
optimized region and creates a recovery checkpoint. Subsequent register and memory
updates are speculative. The alternate PC (program counter) specifies the code address
at which execution will resume after an abort.
• aregion end. This instruction ends the region and atomically commits all speculative
register and memory updates.
• aregion abort. This instruction permits software to explicitly rollback to the pre-
viously taken recovery checkpoint by discarding all speculative register and memory
updates. Execution resumes at the alternate PC specified by the most recently exe-
cuted aregion begin.
Though intended for the atomic region abstraction, these instructions merely manage
the underlying mechanisms of atomic execution. They presume knowledge of neither how
software might utilize nor how hardware might implement atomic execution. Their generality
provides design flexibility to both hardware and software.
Figure 2.1 illustrates the use of these new instructions. Causes for atomic region aborts
are communicated to the software via two additional registers. The first register encodes
the reasons for an abort (e.g., explicit abort, interrupt, data conflict, exception, etc.). The
second register records the program counter of the instruction responsible for an abort (if
any). This information enables software to diagnose the cause of aborts and adaptively
recompile when necessary.
2.2 Software Speculation Using Atomic Regions
This section describes how the compiler exploits hardware atomic regions to improve the
generated code quality using an illustrative, but representative example. Figure 2.2(a) shows
the control flow graphs (CFGs) of two example methods that depict a common Java idiom:








// consult abort_info and








speculation:   success   failure
alt_code:
Figure 2.1: Software usage of hardware atomicity. If a speculation succeeds, no abort
conditions will be invoked and the execution will reach an aregion end that commits the
atomic region. Speculation fails when an abort condition evaluates to true, causing a branch
to be taken to an unconditional abort instruction. The abort instruction restores regis-
ter state, invalidates speculatively written cache lines, and transfers control to the address
specified by the aregion begin instruction.
foo to expose optimization opportunities. The compiler has annotated the CFG edges for
both methods with frequencies derived from an edge profile, as most runtime optimizers
do prior to optimization. The monitor enter and monitor exit intrinsics in the method
bar provide the mutual exclusion property that the Java synchronized keyword requires.
Basic block Y contains an operation that could incur an exception and therefore has an
outgoing exception edge that the profile indicates has never been taken. The exception edge
is connected to another monitor exit intrinsic, which will free the synchronization lock
before invoking exception dispatch.
The example depicts several common optimization obstacles. In both methods, ex-
tremely rare execution paths (C→B, X→call, and Y→exception) limit the optimization
of more frequently executed paths because these infrequent paths tend to use or redefine
variables that obscure optimization opportunity. The static size of infrequently executed






























































Figure 2.2: Atomic region formation by the compiler. (a) Initial control flow graphs
(CFGs) for the methods annotated with a control flow profile, (b) a replicated version of the
hot paths after partial inlining and trimming of cold paths, (c) and the final CFG for the
method foo.
monitor enter and monitor exit intrinsics will expand into the relatively complex CFGs
required of high-performance lock implementations [61]. Furthermore, monitor enter and
monitor exit are synchronization actions in the Java memory model, which restrict the
compiler’s ability to perform optimizations across them [74].
The atomic region abstraction provides the compiler writer with a simple, yet effective
means of overcoming these common performance obstacles. It enables the compiler writer to
reason about exploiting available optimization opportunity without being concerned about
infrequently executed paths or multiprocessor memory models. In essence, the abstraction
provides a compiler with the ability to “undo” a region of speculatively optimized code.
For example, in Figure 2.2, the compiler can replace each infrequently executed path with
an operation to assert1 that the path is not followed. Similarly, the compiler can replace the
1An assert can be implemented with an unconditional abort, as depicted in Figure 2.1
18
balanced pair of synchronization actions with an operation to assert that no other thread
holds the lock. The all-or-nothing property of atomic regions aids both transformations; if
any assertion does not hold, the processor discards the updates performed by the specula-
tive optimizations, making it appear as if the speculative execution had never happened.
Furthermore, the instantaneous commit provided by an atomic region allows the compiler
to remove synchronization actions (i.e., a software controlled implementation of speculative
lock elision [88]) because hardware prevents illegal interleavings of memory operations from
other threads.
To exploit these opportunities, the compiler first selects an optimization region as de-
picted by the CFG in Figure 2.2(b). The region is a subset of the CFG in Figure 2.2(a),
where bar has been partially inlined (i.e., infrequent paths have not been inlined) into foo
and infrequently executed paths have been removed. The compiler converts the optimization
region into an atomic region by duplicating the CFG and placing an aregion begin at each
region entry and an aregion end at each region exit. The atomic region may contain arbi-
trary control flow, such as the depicted branch-over, but any path through the region must
encounter a single balanced pair of aregion begin and aregion end instructions. Specifi-
cally, any path through the region must encounter exactly one aregion begin and exactly
one aregion end.
The resulting atomic region is reconnected into the flow graph for foo by pointing all of
A’s in-edges to the aregion begin and adding an exception edge from the aregion begin
to A. For each infrequent path that has been removed, the compiler ensures correctness
by replacing the branch to the removed path with an assert operation to verify that the
expected path was taken. By traversing the CFG for this region, the compiler can identify
the balanced pair of monitor enter and monitor exit intrinsics and convert them into an
assertion that the monitor is free. Figure 2.2(c) shows the atomic region formed and the
complete CFG for a speculatively optimized foo.
The simplified CFG contained within the atomic region enables the compiler to transform
19
and schedule the common program paths without having to generate compensation code.
Importantly, the compiler uses the existing exception handling mechanisms in its interme-
diate representation to represent an atomic region. To the compiler, an atomic region abort
appears as if an exception occurred in the aregion begin block and transferred control to
block A. By inserting the exception edge between these two blocks, the compiler preserves
the values needed by the abort path and performs register allocation appropriately.
The assert operations are a simple addition to the intermediate representation and are
represented as arithmetic operations that have source values but produce no output. In par-
ticular, an assert is neither a control nor an exception producing operation in the compiler
intermediate representation—if an assert triggers an abort, hardware will rollback such that
an exception appears to occur at the aregion begin. As a result, existing compiler passes
can eliminate redundant assertions and schedule them. By using existing compiler mecha-
nisms to represent atomic regions, existing optimization passes do not require modification
to exploit the exposed speculative optimization opportunity.
20
Chapter 3
Limitations of Classical Optimization
Throughout this dissertation, I use the term classical optimization to refer to compiler
techniques which consider all possible program paths in their analysis. These techniques are
fundamentally non-speculative in that they generate code which is safe for all program paths
and does not need recovery code to maintain correctness. They rely instead on proofs and
heuristics (possibly augmented by profiling information) to identify and exploit optimization
opportunities. These techniques are commonly accepted into standard practice and form
an essential part of most compilers. Classical optimizations readily exploit many of the
optimization opportunities available in a program, but this chapter will show that other
opportunities remain.
I first present a set of examples which demonstrate optimization opportunities that are
often difficult to exploit using classical compiler techniques. For each example, I will refer
to previously proposed techniques to exploit the available opportunities. In a few cases, a
prior proposal has proven to be effective and is used in standard practice. However, many
of the examples remain difficult to exploit.
I then describe optimization obstacles that are introduced by multiprocessor memory
consistency models. Memory model requirements of both modern programming languages
and hardware systems place stringent requirements upon the observability of memory values
specified by a programmer and by an application binary. Operations that might be safe to
remove or reorder in a single-thread program might not be safe to remove or reorder in a
multithreaded program.
To improve readability, abstract and simplified examples are used throughout this chap-
21
ter. In each of the examples, a control flow graph is annotated with branch frequencies
indicating a possible hot path. The intent is simply to suggest that speculatively optimizing
for the common case has the potential to exploit each opportunity.
In practice, many of the opportunities discussed are intertwined with one another, serving
only to increase the difficulty of exploiting them using classical techniques. In other words,
realistic examples could strengthen the arguments being made but with significant loss of
clarity. Real examples will be presented in Chapter 6 and Chapter 7.
3.1 Aggressive Compiler Optimization Opportunities
In the discussion that follows, five types of opportunities are explored. In each case, previous
proposals to exploiting these opportunities are presented, including explanations as to why
many have not been adopted into the standard optimization vernacular.
3.1.1 Partial Redundancy Elimination
Partial redundancy elimination (PRE) is a mature topic with several well understood classi-
cal solutions [79]. Optimizations such as lazy code motion [65] and SSAPRE [63] are able to
efficiently identify and exploit all partial redundancies that can be removed safely. In fact,
these techniques produce both computationally optimal and lifetime optimal placements
(i.e., no further redundancies can be safely removed and they do not introduce any more
register pressure than necessary).
However, they only are able to perform safe reorderings, i.e., introduce computation onto
paths on which they would have otherwise already been computed [62]. The necessity of
this restriction is demonstrated in Figure 3.1. Hoisting the load in block D eliminates the
partial redundancy in the loop but may lead to incorrect program behavior. Specifically, it
may introduce an invalid address exception onto the path A→B→C→D.
The opportunity demonstrated in Figure 3.1 could be exploited with hardware support
22
T, 99% F, 1%
a) b)
if (a < 10) 
B
x = yC





T, 99% F, 1%
if (a < 10) 
B
x = y








Figure 3.1: Example of an unsafe partial redundancy elimination opportunity.
Hoisting the partially redundant load may introduce unexpected program behaviors such as
an invalid address exception.
for speculation. A mechanism that enables suppression of speculatively raised faults would
enable a compiler to hoist the potentially excepting load. Chapter 4 provides a thorough
discussion of speculation mechanisms.
Note that speculative partial redundancy elimination is not guaranteed to be computa-
tionally optimal—additional computations can be introduced into a program’s execution.
However, if guided by accurate profile information, speculative partial redundancy elimi-
nation can enable a program to execute fewer operations than non-speculative PRE [71].
Figure 3.1 and the remaining figures in this chapter incorporate precise profile information
in order to motivate the possibilities for speculative optimization.
3.1.2 Partial Dead Code Elimination
Figure 3.2 depicts the logical dual of partial redundancy elimination: partial dead-code
elimination (PDE). Shown in Figure 3.2(a) is a control flow graph in which several values
are only consumed in a subset of the possible paths, indicating that performance could be
23
a = a + 1
c = a + b
A
B
a = b + d
c = a + 2
D
d = a + b
C






a = a + 1A
B
a = b + d
c = a + 2
D
c = a + b
d = a + b
C
c = a + b








a = b + d
c = a + 2
D
a = a + 1
c = a + b
d = a + b
C
a = a + 1
c = a + b







Figure 3.2: Example of partial dead code elimination opportunities. Classical re-
moval of all partially dead operations requires more complex algorithms and code duplication.
improved by eliminating these partially dead operations.
In order to render these operations fully dead (i.e., compute them only on paths on
which they are used) a software optimizer must sink (i.e., move in the direction of control
flow) them onto the cold paths. However, this is insufficient to eliminate all of the partially
dead operations as shown in Figure 3.2(b). A problem arises because of a dependence chain
through the partially dead operations, which can only be broken by inserting additional
copies into the flow graph as shown in Figure 3.2(c).
This example intuitively shows how classical software optimization techniques can be
used to eliminate partially dead code, but belies their complexity and negative side ef-
fects. Specifically, a classical implementation of partial dead code elimination requires a
polynomial-time algorithm and can introduce significant code bloat.
For example, the partial dead code elimination algorithm introduced by Knoop et al. is
optimal in the sense that it eliminates all of the partially dead code that could be removed
without modifying the control flow or semantics of the program [66]. However, despite its
effectiveness the algorithm is essentially a bidirectional iterative dataflow pass (i.e., on each
iteration a backwards pass is first run followed by a forward pass). This has a worst-case
time complexity of O(n4) and even using optimistic assumptions Knoop estimates the time
complexity to be O(n2).
24
Furthermore, eliminating all partially dead operations requires inserting additional copies,
as demonstrated in Figure 3.2(c). The static code growth due to these additional copies is
bounded by O(b) where b is the number of basic blocks in the optimization region, which
using optimistic assumptions may be bounded by a constant factor [66]. However, these
additional copies not only bloat the static code size, but they effectively increase the compi-
lation cost of all subsequent optimizations (i.e., the variable component of many algorithms
increases). Therefore, even with optimistic assumptions the effect on overall compilation
time is expected to be significant. This additional cost in compilation time is a serious
deterrent to runtime optimization systems.
Within the context of a dynamic binary optimization, the compilation cost becomes
prohibitive because the effective number of basic blocks in an optimization region is increased
by the need to maintain precise exceptions. Therefore, potentially excepting instruction
effectively terminates a basic block by inserting an exception edge to the CFG. If interrupts
are enabled within an optimization region, each instruction effectively resides in its own
basic block.
It may seem that both the time complexity and space expansion problems of Knoop’s
partial dead code elimination algorithm can be alleviated by reframing the optimization as a
form of global code motion [26]. Framing the problem in this manner trivially eliminates all
code bloat effects—code is relocated rather than copied—and enables linear time algorithms
(assuming that dominator tree and loop depth information has already been computed for
other optimization passes, as both typically are). However, such a formulation is insufficient
for removing all partially dead code and is merely capable of the optimization shown in
Figure 3.2(b).
Speculative optimization, and more specifically the atomic region abstraction, enables a
compiler to exploit dynamically dead opportunities with fewer copies (see Section 4.3). Par-
tially dead code which occurs in practice is simply a type of dynamically dead opportunity.
Whether or not such an opportunity is practical to exploit depends on the execution profile.
25
r1 = ld [x]








st [z] = r2F





r1 = ld [x]








st [z] = r1F






Figure 3.3: Example of subroutine inlining opportunities. Subroutine inlining removes
calling convention overheads and also increases the scope of global optimizations. However,
it typically increases static code size and overuse can degrade performance.
In other words, if either of the paths A→C or B→E in Figure 3.2 are rarely executed, then
all of the opportunities shown could be exploited by a combination of cutting cold paths and
global code motion. If both paths are rarely executed then the global code motion pass is
also unnecessary.
3.1.3 Procedure Inlining
The pervasive use of procedures in modern programs obscures many optimization oppor-
tunities. Shown in Figure 3.3(a) is an interprocedural control flow graph annotated with
dynamic branch biases. When invoked from the callsite shown, the procedure foo contains a
redundant operation that is obscured by the call boundary. Furthermore, the procedure call
incurs overheads such as saving and restoring registers as specified by the calling convention.
A classical approach to exploiting these types of optimization opportunities is procedure
inlining.
Procedure inlining essentially expands a called procedure by copying it into the flowgraph
26
of its caller as shown in Figure 3.3(b). In doing so, it obviates call related overheads and
exposes opportunities that would otherwise be obscured by the call boundary to global
optimization passes. However, this benefit comes with three costs: increased pressure on the
register allocator, increased compilation time and an increase in static code size.
First, the expansion caused by procedure inlining has the effect of increasing the number
of statements—and likewise the number of variables—contained in the optimization flow-
graph. This can stress the register allocator and result in suboptimal register allocations.
Sias, in his analysis of compiler optimization techniques for explicitly parallel instruction
computing (EPIC), found that procedure inlining can stress the register stack engine (a hard-
ware mechanism for managing register spills and fills) and thereby hinder performance [93].
Second, increasing the number of statements in the optimization flowgraph also increases
compilation time cost. The cost of many compiler optimizations grows super-linearly with
the number of statements in the flowgraph. To prevent compilation cost from growing out
of control, compilers use heuristics to tightly limit code expansion caused by inlining. In
designing their aggressive inliner, Ayers et al. found that limiting code expansion to 20%
enabled them to keep the total compilation time to within 200% of the total compilation
time with inlining disabled [6].
Although compilation time may be less of an issue for static compilers, it is of primary
concern in runtime optimization systems. In most runtime optimizers, extra time spent com-
piling must be recovered by producing and sufficiently executing higher performance code.
In their study of JVM inlining heuristics, Cavazos and O’Boyle show that an example inlin-
ing heuristic nearly always produces higher performance code but increases in compilation
cost occasionally result in a reduction of Java system performance [24].
Third, the code expansion incurred by procedure inlining also increases the overall static
size of the program being optimized. This increase in static program size, known as code
bloat, can be significant. Hank et al. show that aggressive inlining increases the static code
size by a ration of of 4.0x on average and by as much as 17.4x [46].
27
Increased static code size can cause performance degradation by increasing the number
of instruction cache misses. Similar to the heuristics used to control compilation time,
compilers may restrict code expansion to control detrimental side-effects in the instruction
cache. For example, the inliner in the IMPACT compiler is tuned to restrict code expansion
to a ratio of 2.0x and in practice expands the static code size of a program by roughly 1.5x.
Despite these restrictions, the increased static code size can still increase instruction cache
misses sufficiently to incur a net program slowdown [93].
Furthermore, increases in static code size may pose further problems to dynamic bi-
nary optimization systems. These systems place optimized code translations into software-
managed memory known as a translation cache. The translation cache is typically main-
tained in a fixed-size region of memory that must periodically be reclaimed to make room
for new translations [7, 32]. If code bloat is not restrained, the increased pressure upon the
translation cache could cause it to thrash. If a program’s entire working set of translations
does not fit in the translation cache, then performance of the system will suffer dramatically.
Speculative optimization techniques provide viable alternatives to classical procedure
inlining. Several of the techniques described in Chapter 4 enable a compiler to inline only
the commonly executed portions of a method into a callsite. In doing so, they expose similar
opportunities to classical inlining but with far less code expansion. Some of these techniques
are already common to JVM implementations, and the atomic region abstraction enables
yet further improvements.
3.1.4 Control Flow Restructuring
The opportunities depicted thus far are all able to be exploited without changing the logical
control flow structure of the program being optimized. Control flow operations may be
replicated, but they still occur in the same order as in the original program. However,
additional opportunities exist which cannot be exploited without enabling control flow graph
changes. Optimizations based on control flow restructuring—reordering and elimination
28
T, 99% F, 1%
a)
T, 99% F, 1%
T, 99% F, 1%



















if (x < 10) 
A
B'
if (x < 10)
D'
E'





if (x < 10)
D''
E''





if (x < 11)
G
H I
Figure 3.4: Example of partially-redundant conditional control flow. (a) The control
flow operations are redundant along several of the possible paths, (b) but eliminating all
redundancies would require significant code duplication.
of partially redundant control flow operations—have been proposed which expose further
optimization opportunities [14, 16,17,38,80,91].
Unfortunately, as this section will show, prior proposals for identifying and exposing these
opportunities are both complex and time consuming. Furthermore, control flow restructuring
requires code duplication to maintain program correctness, which has already been shown
to introduce additional compilation costs and can degrade performance.
A simple opportunity for control flow restructuring is presented in Figure 3.4(a). It
depicts two opportunities to remove branches that depend on conditions that are partially
redundant with earlier branches. The first opportunity is a branch which, along one path,
reuses the exact condition computed for an earlier branch thereby making the reused condi-
tion partially redundant. The second opportunity is a branch condition that is not partially
redundant but will always be true if a prior branch condition is also true. In other words,
29
if (y < 10)
C
if (x < 10) 
A
F, 1%T, 99%
T, 99% F, 1%
if (x < 10) 
A
if (y < 10)
C
a = b + d
D
a = b + a
E
a) b)
a = b · cB
T, 99%
a = b + d
D
a = b + a
E
T
if (y < 10)
C'
F, 1%
a = b · cB
F FT
Figure 3.5: Partial dead code elimination that requires control flow restructuring.
(a) A multiply operation is redundant along one of the possible paths through the region,
(b) and classical optimization of the redundancy requires restructuring the CFG.
the earlier branch condition subsumes the later branch condition [80].
Prior work has introduced both intraprocedural and interprocedural algorithms which
can identify and exploit these conditional branch opportunities [16, 80]. Though effective,
they require complex polynomial time algorithms to identify optimization opportunities for
control flow restructuring and require code duplication to exploit them. Figure 3.4(b) shows
that exploiting even the example opportunities requires significant code duplication. Because
of these issues, control flow restructuring is of questionable practicality in a static compiler
and is simply untenable in a dynamic compilation framework. One of the authors of the
control flow restructuring works more recently viewed “restructuring as too expensive for a
dynamic compiler” [15].
Control flow restructuring can also expose additional opportunities for PDE and PRE.
Figure 3.5(a) first shows an opportunity for partial dead-code elimination which shares sim-
ilarities with the opportunities shown in Section 3.1.2. The difference is that the partially
dead code in block C cannot be eliminated without changing the control dependence struc-





























Figure 3.6: Partial redundancy elimination that requires control flow restructur-
ing. (a) A partially redundant divide operation is shown. (b) Safely eliminating it requires
code duplication and restructuring the CFG.
dependence structure of the optimization region as shown in Figure 3.5(b). By making the
partially dead operation control dependent upon a later branch it has been rendered totally
live (i.e., the operation is only executed on the paths on which it is consumed).
Figure 3.6(a) shows an example of a PRE opportunity which is obscured by a control
flow obstacle. As already mentioned in Section 3.1.1, safe PRE techniques would normally
be unable to exploit this opportunity. By duplicating code and restructuring the control
flow, as shown in Figure 3.6(b), the opportunity can be exposed to safe optimization.
Techniques have been proposed that are capable of identifying and eliminating PDE op-
portunities like the one shown in Figure 3.5 [14, 38, 91]. However, similar to the techniques
used for eliminating partially redundant or subsumed branches, these techniques require
complex analysis algorithms and significant code duplication.1. A similarly complex tech-
nique that also relies on code duplication has been proposed for exploiting PRE opportunities
such as the one in Figure 3.6 [17].
1The revival transformation does not require code duplication, but is unable to expose some PDE
opportunities obscured by control flow. The authors themselves note that exposing other interesting PDE
opportunities requires code duplication, which they call decision node copying [38]
31
All of these restructuring techniques also introduce additional complexity to compiler
development. Verifying the correctness of optimizations which restructure the control flow
semantics of a program is a non-trivial undertaking. In particular, maintaining precise
exception semantics while still allowing control flow to be reordered and eliminated can be
difficult or even impossible without the hardware support described in Section 4.2.
Furthermore, many of these techniques can produce irreducible flowgraphs [17]. There-
fore, they interfere with many existing compiler analysis and optimization routines which
rely on reducibility. Reconstructing a reducible flowgraph requires additional compilation
time and incurs further code duplication [58].
Speculative optimization offers an easier and more practical approach to exploiting these
opportunities, one without the complex analysis and code duplication required by control
flow restructuring. For example, the atomic region abstraction converts highly biased control
flow into assert operations, thereby converting control operations into dataflow (see Chap-
ter 2.2). Because assert operations are simple dataflow operations, they can be redundancy
eliminated. Similarly, they remove the control obstacles to the PDE and PRE opportunities
described in this section.
3.1.5 Partially Redundant Array Bounds Check Elimination
More recently, a specialized variation of control flow restructuring has been proposed to
optimize a common idiom in managed languages, array bounds checking [15]. Array bounds
checks can often be eliminated because the array indices are known to reside within the limits
of an array, as shown in Figure 3.7(a). The ABCD optimization proposed by Bodik et. al pro-
vides a cost-effective approach to identifying and eliminating these fully redundant bounds
checks.
Eliminating partially redundant bounds checks, shown in Figure 3.7(b), is more complex
and requires control flow restructuring. Bodik et. al show that ABCD can be extended to
identify partially redundant bounds checks and further propose a transformation to eliminate
32





i = i + 1
D










i = i + 1
D



























Figure 3.7: Fully and partially redundant bounds check elimination opportunities.
(a) Bounds checks can be eliminated when the array index can be proven to reside within
the array bounds. (b) In cases when such a proof is not possible, (c) eliminating partially
redundant checks requires control flow restructuring.
them based on inserting stronger compensating checks. However, simply inserting a stronger
check is insufficient as it would change the exception behavior of a program.
Instead, the authors propose an extension to ABCD that uses speculation to optimize the
partially redundant bounds check. First, a compensating check is speculatively placed early
in the flowgraph which enables eliminating the partially redundant bounds checks. Second,
an unoptimized version of the region is generated which contains all of the bounds checks
in their original program positions. At runtime, if the compensating check does not detect
an out of bounds access then the optimized version of the region is executed, otherwise the
unoptimized version is executed so that the out of bounds exception is thrown precisely. This
transformation is shown in Figure 3.7(c). Note that in the case of managed languages, further
speculation can be used to defer generating the unoptimized version until the compensating
check detects a potential out of bounds exception (as described in Section 4.2).
33
3.2 Opportunities in Multithreaded Systems
Modern multiprocessor systems share physical memory and provide well-defined consistency
models which dictate the order in which memory operations appear to execute. Stated
differently, memory consistency models restrict the set of values that a particular memory
load may return [1]. Memory consistency models have been well-defined from a hardware
system perspective for a number of years and have recently been incorporated into modern
programming language definitions [19, 59,74].
Memory models can range from strict models, such as sequential consistency, to relaxed
models, such as release consistency. From a performance point of view, the flexibility of
relaxed models is preferable because they permit several important optimizations. From a
programming and debugging perspective, strict models, particularly sequential consistency,
are preferable because they are more intuitive and easier to understand. In order to satisfy
these competing goals, programming languages tend towards data-race-free memory models.
Data-race-free memory models are indistinguishable from sequential consistency in programs
which are synchronized sufficiently to eliminate all data races [3]. These models enable
implementations which can use optimizations such as hardware write buffers and compiler-
driven register allocation of memory variables. Implementations need only be restricted by
the ordering rules imposed via synchronization actions.
Nonetheless, three significant optimization opportunities remain. The first two pertain
to the synchronization actions themselves: frequent synchronization introduces excessive
performance overhead, and synchronization is often used to protect against data races that
rarely manifest. The last is specific to optimization at the instruction set architecture (ISA)
level: when optimizing a program binary, the memory model of the ISA must be upheld.
34
3.2.1 Synchronization Elimination
One way to achieve sufficient synchronization is to excessively synchronize a program. For
example, the class libraries which are defined as part of the Java runtime environment
frequently define methods as synchronized. Synchronized methods acquire a lock on their
parent object when they are entered and only release the lock when the method is exited.
As a result, invocations of a synchronized method are mutually exclusive with one another
and with invocations of other methods in the same object.
Excessively-synchronizing methods reduces the possibility of race conditions and makes
it easier to write correct multithreaded Java programs. However, it can also introduce
unnecessary synchronization overhead as discovered by Heydon and Najork while designing
a high-performance Java-based web crawler [49]. For example, they noticed that using the
Java class libraries to write a single line to a log file involves 67 lock acquisitions.
Two strategies to remove excessive synchronization in object-oriented languages have
been proposed. The first strategy uses interprocedural escape and pointer analysis techniques
to determine if an object can ever be accessed by more than a single thread. If an object
can only be accessed by a single thread, then it can be allocated as a thread-local and all
synchronization related to the object can be eliminated [12, 20]. This strategy is effective
when static whole-program analysis can be performed, but is challenging in the context of
managed runtimes, because dynamic class loading obscures interprocedural information, and
in dynamic binary optimization systems.
The second strategy combines related lock synchronization by coarsening the granularity
of the locks. Lock coarsening was first proposed as a static compilation technique [33] and
has more recently been implemented in a dynamic compilation environment [98]. Lock
coarsening can alleviate excessive locking overhead, but may reduce program scalability. A
coarsened lock is held for longer and may cover previously unprotected regions of a program.
Both effects may result in increased lock contention and reduce program scalability.
35
Coarse-grained locks are sometimes employed explicitly by programmers as a another
conservative strategy to achieving sufficient synchronization. Rather than locking only lo-
calized regions of a program which can incur a data race, coarse-grained locks are used to
guarantee mutual exclusion of large regions of the program. It is often simpler to write
a correct parallel program when using coarse-grained locks, but it may also be harder be
provide good scalability. Coarse-grained locks obscure parallelism which might otherwise be
available.
3.2.2 Synchronization Optimization
An alternative approach to reducing synchronization overheads is to make locks more efficient
without eliminating the underlying locks themselves. There are two sources of inefficiency
that these techniques strive to reduce: dynamic synchronization complexity and serialization
bottlenecks.
In Java, synchronization operations are specified using monitors which are semantically
richer than traditional locks. In addition to mutual exclusion, Java monitors also provide
communication mechanisms for inter-thread waiting and notification. However, these com-
munication mechanisms require large and complex monitor implementations which are rarely
used. As a result, Bacon et al. introduced thin locks, which are only used to provide mutual
exclusion for rarely contended locks, to optimize the performance of Java monitors. Further-
more, as mentioned in Section 3.2.1, many locks in Java are excessive and are never accessed
by more than a single thread. In these cases, lock reservations further eliminate the need
for expensive atomic acquire and release operations [61].
A hardware speculation proposal to reduce the serialization cost of synchronization has
also been proposed. Rajwar and Goodman note that lock-based synchronization uses mu-
tual exclusion to provide an illusion of atomicity and isolation. In many cases, mutual
exclusion is overly conservative because concurrently executing threads affect disjoint data
sets. Therefore, hardware support for speculative lock elision (SLE) enables multiple threads
36
to simultaneously execute code protected by the same lock by eliding all synchronization
operations [88]. Hardware mechanisms are proposed for monitoring memory accesses and
speculatively buffering all memory updates. Hardware also detects common acquire and
release idioms and replaces them with a simple load to check if the lock is available. If
the lock is available and the threads do not conflict, then the SLE instantaneously commits
all buffered memory updates at the lock release point. Otherwise, if the lock is taken or
hardware detects a memory conflict, all buffered updates will be discarded, and the locked
region will be restarted (possibly acquiring the lock if conflicts continue).
The hardware provided in the atomic region abstraction enables a software-controlled
implementation of SLE. This optimization enables SLE to complement the thin lock and lock
reservation implementations used in modern JVM implementations by removing redundant
locks and enabling optimizations across synchronization actions (see Section 6.3).
3.2.3 Constraints on Binary Optimization
The data-race-free memory models adopted by Java and C++ enable compiler and hardware
optimizations in programs that are sufficiently synchronized. In such programs, constraints
on program behavior are largely restricted to the synchronization operations themselves.
Nonetheless, once a binary has been generated the memory model of the underlying ISA
obscures all flexibility provided at the language level.
For example, the x86 memory model provides strict ordering guarantees with respect
to loads. Load operations may not be reordered with one another and stores may not be
reordered with other stores or with preceding loads. Essentially, the only reordering allowed
by the x86 memory model is the reordering of loads with preceding stores [57].
To alleviate these limitations, hardware techniques are used to enable speculative re-
ordering of load operations [44]. For example, modern out-of-order x86 implementations
may speculatively execute a load operation before a load that precedes it in program order.





st [sp+4] = r1
r1 = 0
st [a] = r1
r1 = ld [sp+4]
B
a)
st [sp+4] = r2





st [sp+4] = r1
rtemp = 0
st [a] = rtemp
r1 = ld [sp+4]
B
st [sp+4] = r2




Figure 3.8: Example of optimizations disallowed by memory model constraints.
Seemingly redundant memory operations may be illegal to eliminate, depending on the
memory model of the system. The depicted store removal is illegal under the x86 memory
model.
flicting coherence requests have been received for the loaded memory address. If a conflicting
access does occur, then the load must be re-executed along with all of its consumers.
Despite these techniques, other optimization opportunities remain as shown in Figure 3.8.
In the example, the register r1 is spilled to the stack but then soon restored. In addition, the
stack location used to spill r1 is overwritten by a later store. In a single-threaded context,
it would be legal to perform the optimizations shown in Figure 3.8(b). However, when
optimizing a multithreaded binary for x86 it is illegal to eliminate the seemingly dead store.
Speculative compiler optimization alone is unable to relax these constraints. The atomic
execution guarantees provided by hardware enables a compiler using the atomic region ab-




This chapter provides an overview of speculative compiler optimization techniques and char-
acterizes them according to the usage models they support. Prior approaches to speculative
optimization, as well as the atomic region abstraction, can be characterized according to
three aspects: region shape, misspeculation recovery, and speculation strategy.
Region shape describes the types of control flow that a speculative optimization tech-
nique can encompass, ranging from a trace of basic blocks to arbitrary control flow graphs.
Misspeculation recovery describes the mechanisms provided in hardware and software to
enable the recovery of correct state in the event of a misspeculation. Speculation strategy
describes the types of control and data speculation that are enabled by a technique.
4.1 Region Shape
This section characterizes speculative optimization approaches according to the control flow
that they support. In general, the goal of each approach is to support regions which are
more general than a single basic block, but are more focused than traditional global and
interprocedural techniques. In doing so, profile information can be used to generate regions
with sufficient scope to expose opportunity and yet enable optimizations tailored towards
common program paths.
Trace scheduling was the first to introduce the notion of a compilation unit that crosses
basic block boundaries and can easily incorporate profile information [41]. Assembling se-
quences of commonly executed basic blocks into a larger trace, provides a simple abstraction
39
a = a + 1
b = c · d
A
b = c · d
a = b + c
B
b = c · d
d = a + b
C
99% 1%
b = c · d
A
b = c · d
a = b + c
B a = a + 1
b = c · d
d = a + b
C
99% 1%
b = c · dYa = a + 1X
b = c · d
A
b = c · d
a = b + c
B
d = a + bC
99% 1%
a) b) c)
Figure 4.1: Example of trace scheduling bookkeeping complexity. (a) Removing the
partially redundant operations shown (b) requires code rescheduling (c) and a combination
of control flow fixup and compensation code.
for a compiler to implement cross-basic block scheduling optimizations. When scheduling
a trace, the compiler temporarily ignores control flow entrances and exits and therefore is
unconstrained by them.
However, maintaining correctness during trace scheduling involves bookkeeping com-
plexities that prove difficult to manage. For example, Figure 4.1 demonstrates very simple
opportunities that a compiler would attempt to exploit. Though trace scheduling exposes
such opportunities, exploiting them is difficult.
The superblock solved this problem by introducing a single-entry multiple-exit compila-
tion unit, free of convergent control flow [54]. By performing tail-duplication, reconvergent
control flow can be removed from traces. This trivially eliminates much of the bookkeeping
complexities of traces. The hyperblock furthered the superblock by including small control
flow “hammocks” and predicating them [73].
A more general shape abstraction for compiler optimization was introduced by Chambers
and Ungar. Using deferred compilation, the Self dynamic optimizing compiler could arbi-
trarily remove program paths from an optimization region [25]. In their system uncommon
paths are replaced with stubs that invoke the dynamic compiler. If invoked, a stub directs
the compiler to generate code for an uncommon path that both reuses the existing stack
frame and reconverges into the original optimization region.
Though effective at reducing compilation costs and enabling optimizations such as reg-
40
ister allocation, the deferred compilation abstraction limits other optimizations because
stubbed control flow paths potentially reconverge. Whaley introduced the partial method
compilation abstraction which eliminates this restriction [104]. The partial method com-
pilation technique also replaces uncommon program paths with stubs, but these stubs are
assumed to never reconverge. As a result, conventional compilation techniques can exploit
some speculative redundancy opportunities. A similar abstraction is provided by the un-
common traps mechanism used in the HotSpot server compiler [85].
Hank et al. introduced an equally general compilation unit for static compilers. Their
region-based compilation enables a compiler to select arbitrarily shaped pieces of a program
as the unit of compilation [46], and a similar partial inlining approach has been proposed by
Muth and Debray [81]. Their partial inlining technique provides much of the optimization
opportunities of region-based compilation, while mitigating compilation time and code bloat
issues.
Atomic regions support a similarly general region shape, but offers more powerful seman-
tic guarantees. Similar to the partial method compilation techniques developed by Whaley,
an atomic region enables a compiler to speculatively remove cold paths from an optimization
region. Likewise, even if taken, these cold paths can be assumed to never reconverge.
Furthermore, atomic regions enable a compiler to make more aggressive assumptions
because of hardware atomicity. Cold paths in the atomic region can be converted into asserts
which roll back the entire region if triggered. This enables a compiler to simply remove all
dependences related to the asserted path resulting in simplified control flow and dataflow.
These simplifications expose speculative opportunities as conventional optimizations.
4.2 Misspeculation Recovery
This section overviews the techniques and mechanisms that have been proposed to recover
from a misspeculation in speculatively optimized code. I will initially focus on software-only
41
techniques but will progressively introduce techniques that rely on hardware mechanisms.
A software only recovery mechanism was pioneered by Holzle et al. in their work on
a dynamic optimizing compiler for the Self language. To support source-level debugging,
they developed a technique for dynamically deoptimizing methods [50]. Their technique
replaces the stack frame generated by an optimized program region with the stack frame
expected by a programmer and transparently relocates execution from optimized code into
a corresponding unoptimized code location. Holzle further generalized this technique into
what is known today as on-stack replacement [51].
Some Java virtual machine implementations use on-stack replacement to provide recov-
ery support for unexpected program behaviors [40, 85]. Doing so requires annotating any
optimized code location that could require recovery as a safe point. Each safe point is treated
as an opaque call which consumes all live variables and sufficiently constrains optimization
to enable on-stack replacement. These constraints become increasingly common when safe
points for exceptions and interrupts are considered.
These software-only approaches to recovery, though effective for other uses, overly limit
speculative compiler optimization A major concern of any speculatively optimizing compiler
is guaranteeing that speculatively executed operations will not adversely affect correctness
by computing incorrect values or trigger spurious exceptions. Satisfying this concern using
software-only techniques requires sophisticated analysis and conservative optimization. As
a result, a range of hardware proposals have been proposed to aid the compiler.
The simplest and most commonly used hardware feature might be obvious but bears
mentioning: extra registers. Extra registers provide an optimizer with hidden locations
in which to hold speculatively computed values without the need to spill non-speculative
recovery values to memory. Extra registers are essential and an assumed component of each
of the following hardware mechanisms.
Several hardware mechanisms have been proposed to support speculatively hoisting op-
erations past branches, in other words executing operations “early”. These mechanisms
42
not only support improved scheduling for in-order architectures, but also support aggressive
partial redundancy elimination (see Section 3.1.1 for an example).
Instruction boosting [95] supports speculatively hoisting operations above branches by
enabling the compiler to communicate its decisions to the hardware. The result of a specula-
tively hoisted operation is labeled with the future branch outcomes upon which it is control
dependent. The hardware uses this information—in conjunction with shadowed register files
and store buffers—to only commit the result or any raised exception once the operation
becomes non-speculative.
Sentinel scheduling [72] and write-back suppression [23] provide hardware support to
simplify exception handling for speculatively hoisted instructions. In sentinel scheduling,
instructions which may speculatively cause an exception are annotated and are paired with
another instruction that non-speculatively checks for exceptions. With write-back suppres-
sion, all speculatively executed instructions are annotated so that if one raises an exception
all speculative updates can be suppressed.
All three of these schemes specifically focus on handling speculative results and specu-
lative exceptions but ignore other operations. For these schemes, misspeculation recovery
remains a complex and non-trivial problem. When a misspeculation occurs, recovery code
must be generated which reconstructs non-speculative state. However, a multitude of specu-
lative optimizations are possible and implementing a compiler capable of generating correct
recovery code for every possibility is a daunting task1.
A more general misspeculation recovery mechanism can be provided by fast checkpoint
hardware. A hardware checkpoint takes a logical snapshot of register and memory state
that can later be restored or replaced by a subsequent checkpoint. Several proposals use
checkpoint hardware for misspeculation recovery. A checkpoint is taken prior to entering
1The general speculation model does not require compensation code. Speculatively hoisted operations
are labeled as such so that any exception is suppressed and instead the output of the operation is “poisoned”
so that non-speculative consumers can raise the exception. However, the general speculation model does not
support precise exceptions [93]. Therefore, general speculation violates correctness on the systems considered
in this dissertation and is not further considered.
43
an optimization region, and the checkpoint is restored in the case of a misspeculation or
exception in the optimization region. If an optimization region executes successfully, the
checkpoint is discarded, typically by taking a new checkpoint.
Melvin and Patt were the first to introduce a checkpoint-based recovery mechanism for
use in speculative compiler optimization. They proposed a new block-structured ISA to
provide single-entry and single-exit atomic blocks [76,77]. Atomic blocks can be aggressively
optimized for a predicted control flow path, and unexpected exits can be made to discard all
speculative state by a hardware-supported fault assert operation. When triggered, a fault
assert discards all register and memory updates performed in an atomic block and redirects
control flow to an alternate implementation of the asserted code.
The rePLay framework [86] refined the block-structured ISA and introduced the frame.
The frame is a single-entry and single-exit atomic trace of basic blocks in which all side
exits have been converted into assert instructions. A hardware-only system, rePLay relies
on a modified branch predictor to identify predictable instruction traces and place them
into frames which are processed by a hardware code optimizer. These optimized frames are
then stored in a frame cache, and execution is redirected to them by a frame predictor. If
an assert detects a misspeculation, the frame is rolled back, and execution resumes using
normal instructions.
Transmeta’s CMS was the first software-based dynamic optimization system to rely on
hardware atomicity primitives for recovery [32]. Hardware atomicity enables simple recovery
because if a misspeculation occurs hardware rolls back all state to a well-defined point.
Execution is then simply redirected to a non-speculative implementation of the same code
region (see Chapter 2. The atomic region abstraction relies on an identical checkpoint




This section focuses on the types of speculation optimizations that are easily implemented by
different abstractions. In theory, each speculative optimization abstraction supports a com-
plete range of control speculative optimizations. In practice, the choice of abstraction makes
some types easier than others. Primary focus is placed on the types of control speculation
that each abstraction supports. Although not a focus of this dissertation, the complementary
issue of data speculation is discussed in Section 4.3.1 for completeness.
Control speculative optimizations are those that optimistically violate control depen-
dences as expressed by a program or binary. There are two general forms of control spec-
ulation: hoisting and sinking. Hoisting occurs when an operation is moved above a branch
and occurs implicitly as part of some optimizations (e.g., partial redundancy elimination).
Sinking occurs when an operation is moved below a branch and also occurs as part of some
optimizations (e.g., partial dead code elimination).
The speculation strategy used in most modern compilers is based around traces, su-
perblocks, or hyperblocks. The trace naturally encapsulates frequently executed control
flow in a single compilation unit. Because a trace only includes a single control flow path,
it trivially exposes speculative opportunity to a compiler. However, exploiting these oppor-
tunities requires custom compiler optimizations that maintain correctness, as discussed in
Section 4.1. Even so, some hoisting opportunities cannot be exploited because of potentially
excepting instructions.
By removing side entrances, the superblock and hyperblock convert some partially redun-
dant operations into fully redundant operations enabling a compiler to trivially exploit these
hoisting opportunities. Furthermore, systems designed to support the superblock and hyper-
block typically include mechanisms to enable speculative hoisting of potentially excepting
instructions, as discussed in Section 4.2
Using predication, the hyperblock also enables a compiler to if-convert short unbiased
45
paths. Predication enables otherwise control-dependent instructions to be speculatively
promoted past their predicate producer by using speculative code motion techniques derived
from superblocks.
The region-based [46] and partial-method [104] compilation frameworks provide alterna-
tive support for reducing cold-path optimization obstacles. In these frameworks, cold paths
in an optimization region are cut and replaced by exit stubs. If a cold-path is reconvergent,
then its removal may expose speculative hoisting opportunities to classical optimizations,
without needing tail-duplication.
In each of these frameworks, an exit implicitly uses all live variables. As a result, sinking
opportunities such as elimination of partially dead code remain obscured. Identifying and
eliminating these opportunities is complex and expensive (see Section 3.1.2).
The atomic region speculation model is the first to enable a software optimizer to effec-
tively remove exits. In the atomic region abstraction, a compiler can convert cold exits into
assert operations, and, if taken, these asserted exits will trigger a hardware rollback and
restart execution in an alternate implementation of the same region. Therefore, an asserted
exit is not a consumer of any live variables (other than those used to compute the assert
condition itself). Furthermore, an asserted exit does not impose any control dependences,
which trivially enables sinking opportunities such partial dead code elimination.
In addition, an atomic region relies on hardware providing an illusion of instantaneous
execution. This enables additional optimizations that would not otherwise be possible.
For example, memory model constraints can be relaxed, which enables additional memory
reordering and redundancy elimination. This enables new optimizations such as software-
controlled speculative lock elision (see Chapter 6.3).
On the other hand, the atomic region abstraction incurs a significant performance penalty
in the case of a misspeculation. Only highly biased control flow can be practically converted
into assert operations. Trace-based and superblock-based speculation strategies have much
lower misspeculation penalties and can therefore can exploit moderately biased control flow.
46
That being said, these strategies can complement one another as shown in Chapter 7.
4.3.1 Data Speculation
Although this dissertation primarily focuses on control speculation, it is important to note its
relationship to data speculation. Data speculative optimizations are those that optimistically
execute an operation before all of its data dependences are resolved. Of particular interest
are optimizations that reorder memory operations that may alias. Note that optimizations
such as redundancy or dead code elimination may implicitly reorder loads and stores.
The control speculation abstractions above all assume protection from incorrectly re-
ordering memory aliases. Before applying aggressive optimizations, memory alias analysis
must be performed to prove the correctness of optimizations that reorder memory operations.
However, it is not always possible to prove that two memory accesses do not alias.
To address this limitation, Gallagher et al. introduced the memory conflict buffer which
dynamically disambiguates potential memory aliases [43]. A hardware structure, the memory
conflict buffer enables a software optimizer to protect the address of reordered memory
operations against unexpected memory alias violations. Practical application of aggressive
control speculation typically depends on hardware support for data speculation via structures
like the memory conflict buffer. For example, recent processors which support superblock
and hyperblock optimizations provide alias protection support in hardware [53,64].
It should be noted that superblock-based designs are based on in-order, statically-scheduled
processors. They rely on the compiler to identify instruction-level parallelism and to reorder
and schedule instructions to exploit it. As a result, aggressive software scheduling is critical
to achieving good performance and therefore more instruction reorderings may be necessary.
These reorderings often require the aid of hardware support for data speculation.
Dynamic binary optimization also benefits from hardware support for data speculation
because precise alias analysis is infeasible. Tight compilation budgets prevent the use of
sophisticated analysis techniques and the lack of high-level source code information limits the
47
precision that can be attained. Therefore, even classical optimizations such as redundancy
elimination typically require hardware support for data speculation.
Similarly, the alias imprecision inherent to type-unsafe languages such as C obscure
optimization opportunities. Hardware support for data speculation can also aid in the
optimization of programs written in these languages. For example, Lin et al. introduce
an extended static single assignment (SSA) framework that enables control speculation and
data speculation [69]. They demonstrate the usefulness of their framework by implementing
SSA-based partial-redundancy elimination [63] that removes likely redundancies that are
obscured by possible memory aliases. To maintain correctness, their framework presumes
hardware support for data speculation. Dai et al. recently introduced a more general
compilation framework which enables classical compiler optimizations to employ hardware
supported data speculation [30].
However, it is unclear whether hardware support for data speculation is needed in the
context of statically-typed, type-safe languages that are compiled for out-of-order processors.
The type information and safety provided by these languages enables simple type-based
alias analysis with sufficient precision to support classical optimization [34]. The dynamic
scheduling offered by the out-of-order processor obviates the need for aggressive software
scheduling.
Likewise, aggressive control speculative optimizations on these systems do not require
data speculation support in order to be beneficial (as demonstrated in Section 6). Neverthe-
less, data speculation could expose additional opportunities and further study is warranted.
48
Chapter 5
Study of Control Bias in Integer
Programs†
This chapter explores the feasibility for exploiting control bias as a proxy for aggressive,
but safe, compiler analysis and optimizations. As suggested by the examples in Chapter 3,
control-speculative optimizations—those which speculatively optimize common paths at the
expense of uncommon ones—have the potential to trivially expose otherwise complex op-
timization opportunities. Nonetheless, the extent to which this is possible depends on the
manner and frequency of biased control flow in a program.
To gain insight into these questions, the following sections analyze biased control flow
behavior in the SPECint 2000 benchmarks. An abstract model of a software optimization
system that speculatively exploits biased control flow behavior is also presented. Studies
of this abstract model identify the mechanisms necessary to making such an optimization
system practical.
5.1 Cost-benefit Tradeoff for Control Speculation
The potential for aggressive control speculation can be seen by looking at branch bias across
complete program runs. In a software optimization system capable of exploiting this po-
tential, a decision to speculate or not is made when the code is generated. In making this
decision, the ratio of correct to incorrect speculations must be considered. For a branch,
this ratio is the branch’s bias; for example, speculatively removing the branch in Figure 3.8
will benefit execution whenever the common path is followed but incur misspeculation costs
†The content of this chapter derives from work published at the 2005 International Symposium on Code
Generation and Optimization [108].
49
whenever the infrequent path is executed. Speculation will improve performance whenever
the aggregate benefit exceeds the aggregate penalty:
(correct preds× benefit) > (incorrect preds× penalty)
Thus speculation should be applied to all branches whose bias (or more precisely the ratio







The opportunity for control speculation can be estimated by studying branch profiles
from complete program runs. For each of the SPECint 2000 benchmarks, Figure 5.1 plots
the cumulative distribution of dynamic branches that can be speculatively removed (e.g.,
assumed to follow a single path) as the overall misspeculation rate increases. Following a
curve away from the origin yields an increasing percentage of dynamic branches that could
be removed (y axis) with a resultant increase in the overall misspeculation rate (x axis). On
each curve, a circle is placed at the point corresponding to speculatively removing branches
with an average bias of 99% or greater; for example, the circle for gcc indicates that over 70
percent of dynamic branches could be eliminated with a net misspeculation rate of less than
0.1%. In general, this 99% threshold sits at or near the knee of the curve in each benchmark,
allowing correct speculation on between 25 and 90 percent (average 46 percent) of dynamic
branches with an average of about one misspeculation every 20,000 instructions. Clearly with
such misspeculation rates, even very aggressive speculation (i.e., where the misspeculation
penalty is two orders of magnitude larger than the benefit of correct speculation) can be
profitable.
While these results demonstrate significant opportunity, they are potentially optimistic.
In selecting the set of branches for speculation, the behavior of the whole program’s run
(representing future knowledge) has been used. The next section explores the effectiveness
50






























































































































Figure 5.1: Speculative removal of biased branches versus misspeculation rate.
The line represents the pareto optimal correct speculation rate that could be achieved for a
given misspeculation rate with perfect knowledge of future branch outcomes (self-training).
•: a 99% threshold which is usually at the knee of the curve. As discussed in Section 5.2,
4: results from using a training input (using a 99% threshold), +: results from using initial
behavior to predict bias (using a 99% threshold and initial periods of 1k, 10k, 100k, 300k,
and 1 million executions). Points that fall off the graph are labeled with their (x,y) location.
of using conventional profiling mechanisms to predict biased branches.
5.2 Previous Techniques for Detecting Branch Bias
Despite significant potential, mechanisms are needed for deciding which branches can be
speculatively removed. Further analysis demonstrates that this is a non-trivial issue. Specif-
ically, two conventional techniques—using profile data from a training run and using profile
data from the beginning of a run—have significant limitations.
Profiling from a previous run: Many aspects of program behavior are consistent from one
51
Bmark Profile Input Evaluation Input Len
bzip2 input.compressed input.source 10 19B
crafty ponder=on ver 0 ponder=off ver 5 sd=12 45B
eon rushmeier input kajiya input 9B
gap (test input) (train input) 10B
gcc -O0 cp-decl.i ? -O3 integrate.i 13B
gzip input.compressed 4 input.source 10 14B
mcf (test input) (train input) 9B
parser (test input) (train input) 13B
perl scrabbl.pl diffmail.pl 35B
twolf (train input) fast 3 (ref input) fast 1 36B
vortex (train input) (reduced ref input) 32B
vpr -bend cost 2.0 -bend cost 1.0 21B
Table 5.1: Simulation data sets and run length. As our intention was to demonstrate
the fragility of oﬄine profiling, we attempted to find reasonable inputs whose behavior
differed from the evaluation set. In some cases, we diverged from the standard SPEC training
sets for profiling, which in most cases are unrealistically similar to the ref inputs. All
benchmarks were compiled for the Alpha architecture using peak compiler optimization. ?
Since the optimization level of gcc is hard coded, we had to modify its execution to give the
appearance of -O0.
data set to the next; so using the behavior of one input to predict the behavior of another is
often effective [42,103]. Nevertheless, some program behaviors are entirely input dependent;
many programs are parametrized (for example the optimization level of a compiler) and
the input parameters become predicates for frequently executed branches. This presents
a problem for aggressive control speculation: for one input a branch may be 100% biased
in one direction and for another input the same branch may be 100% biased in the other
direction. Furthermore, if the profile input and the evaluation input do not exercise the
same regions of code, there will be branches with missing profile information. In general,
because of these two effects, selecting biased branches from a previous input may have both
lower benefit and more misspeculations than self-training.
If the training input differs materially from the evaluation input, the difference in program
behavior can be substantial. In Figure 5.1, the benefit and misspeculation rates achieved
from selecting biased branches (using a 99% threshold) from a differing input are plotted as
52
triangles; the set of inputs used is described in Table 5.1. For these inputs, the benefit is
reduced by a factor of 3 on average and the misspeculation rate increases by a factor of 10.
Using a higher threshold does not significantly reduce the misspeculation rate for some of
the worst offenders (crafty, perl and vpr) and achieves only approximately 3/4ths of the
benefit. The misspeculation rate can be reduced by averaging together a number of profiles;
while this does reduce misspeculation rate it also reduces opportunity as input-dependent
branches will no longer be considered (data not shown). Overall, this form of speculation
control does not do a good job of approximating self-training, an observation also made by
Wall [103].
Profiling from initial behavior: Another approach is to use initial branch bias behavior
to predict the overall behavior of a branch. A recent study shows initial behavior is a
generally more effective predictor of branch bias than having a profile from a training data
set. In some programs, however, a significant number of executions need to be recorded in
order to reliably predict branch behavior [105].
Since most highly-biased branches exhibit identical behavior during their entire lifetime,
the bias of an initial segment of execution is an effective predictor of which branches will be
highly biased. In fact, 80 percent of the benefit of self-training can be captured by choosing
to speculate only on branches whose bias exceeds 99% for their first 1,000 executions. The
remaining 20 percent of benefit is derived from branches that are not initially biased, but
whose overall behavior is biased.
The difficulty with this approach is the same one observed by Wu et al.: some branches
change their behavior—sometimes drastically so—during their execution [105]. For the
benchmarks studied in this chapter, 7 percent of the static branches selected as highly
biased (99% biased or greater) from their initial 1,000 executions had an average bias for
the whole run that fell below 99%; more than a third of these branches had average biases
less than 90%. The inclusion of these false positives results in a misspeculation rate of 2.6%;
without them, the misspeculation rate is only 0.13%.
53
It is tempting to think that by observing a longer initial sequence before making a
decision, misspeculations can be eliminated; however this is not particularly effective. The
crosses in Figure 5.1 show the benefit/misspeculation trade-offs for 5 different training period
lengths: 1k, 10k, 100k, 300k, and 1 million executions. While increasing the initial sequence
length does reduce misspeculation rate (points farthest from the y-axis correspond to the
shorter training period), in some cases (bzip2, perl) it takes more than 300k executions
to reach a rate comparable to self-training. In one case, mcf, even 1 million executions are
insufficient, resulting in a 3% misspeculation rate. Furthermore, the cost of a longer training
period is a reduction in the achievable benefit.
The problem with both of these conventional mechanisms is that they lack robustness.
While each works well in certain circumstances, misspeculation rates as high as one per
100 instructions executed remains. Clearly such misspeculation rates are unacceptable for
aggressive control speculation, where misspeculation detection and recovery could take hun-
dreds of cycles.
This lack of robustness derives from the fact that once a decision to speculate is made it
is never reconsidered. Section 5.4 demonstrates that, by adding a small amount of reactivity,
the system can be made quite robust. The next section more closely investigates branches
that change behavior over their lifetimes.
5.3 Characterization of Changing Branches
When classifying branches from their initial bias, there are two challenging possibilities: 1)
branches that start biased but become unbiased or biased in the opposite direction, and
2) branches that are initially unbiased and later become biased. The first category is the
most serious because it represents potential misspeculations; the second category merely
represents lost opportunity and, as shown in Section 5.4, the loss is modest.
Figure 5.2 provides some insight into the difficulty of this problem; five static branches
54










0 200 400 600 800















Figure 5.2: Five static branches with initially invariant behavior. Branch bias av-
eraged over blocks of 1000 dynamic instances. In all of these cases, the branch can be
considered highly biased for at least the first 20,000 branch instances.
from the benchmark gap are shown that are characterized as biased for at least the initial
20,000 executions (most of which are initially 100% biased) then change their behavior, in
some cases completely reversing their bias. By solely looking at a sequence of initial branch
outcomes, these branches are indistinguishable from branches that remain biased throughout
a program’s execution.
Manual inspection of the source code provides little insight. In some cases, the branch’s
behavior is correlated to a path or calling context and the branch’s initial behavior is deter-
mined by the control flow of early executions. In other cases, no obvious correlation exists,
leaving only the unsatisfactory explanation that the branch’s behavior is data dependent.
In one case, the branch condition is purely a function of a loop induction variable so that
it is false for the first 32,768 executions and true for the rest. No simple features appear to
exist that would enable these branches to be distinguished from truly biased branches.
55
5.4 Requirements for Robust Control Speculation
Despite the lack of a static heuristic for classifying branches according their branch bias
behavior, a robust control-speculation system can be built. This section, describes a sim-
ple model for controlling speculation that addresses the aforementioned realities of branch
bias behavior. Despite its simplicity, this model is effective enough that its performance
is comparable to, and occasionally exceeds, static self training (i.e., using the same input
for profiling as evaluation). The model is studied in the abstract in order to explore its
fundamental requirements independent of implementation.
5.4.1 A Simple Effective Model
The fundamental requirement of any robust model is an ability to tolerate branches with
biases that vary over time. It is necessary for branches to be reclassified when their behavior
changes. Figure 5.3(a) depicts an abstract model for the systems described in Section 5.2,
which classifies branches according to earlier profile information. This model lacks robustness








Figure 5.3: A finite-state machine model for branch behavior characterization.
Figure 5.3(b) shows a model with two additional transitions, both back to the monitor
state. From the biased state, the transition should be taken when the branch is resulting
in an undesirable rate of misspeculations. From the unbiased state, it is merely necessary
to periodically revisit the monitor state. As a following sensitivity analysis will show, the
existence of these transitions is fundamental; most every other attribute of this model is of
secondary importance.
56
Monitor period 10,000 executions
Selection threshold 99.5 percent
Misspeculation threshold 10,000 (+50 on misp., -1 otherwise)
Wait period 1,000,000 executions
Oscillation threshold will not optimize a sixth time
Optimization latency 1,000,000 instructions
Table 5.2: Model Parameters.
Nevertheless, to evaluate the model, various model parameters must be assigned. Ta-
ble 5.2 shows the parameters used, which roughly approximate latencies and thresholds
practical for implementation.
To a large degree, the model parameters are chosen to reduce the effort required by the
optimization system. In particular, every transition into or out of the biased state requires
code to be re-optimized. The main drawback of using a model like Figure 5.3(b) is the
potential for oscillating in and out of the biased state. Therefore, four strategies are used to
dampen this oscillation:
1. First, a moderately long monitoring period (10,000 executions) provides a simple filter
for reducing the number of false positives.
2. Second, hysteresis is introduced by using a stricter threshold for entry into the biased
state than for eviction. For example, to target branches with average bias of greater
than 99%, the bias must be greater than 99.5% to begin speculation, and branches
are only evicted when their bias falls below 98% for a non-trivial time period. This is
implemented in the model using a saturating counter that counts up 50 on a misspec-
ulation and down by one on a correct speculation; the branch is evicted if the counter
reaches 10,000 (requiring at least 200 misspeculations). This hysteresis is necessary to
tolerate short bursts of misspeculations by otherwise biased branches.
3. Third, the model spends a relatively long waiting period (1 million executions) in the
unbiased state. In addition to reducing the frequency at which a branch’s classification
57
needs to be reconsidered, increasing this period reduces the likelihood that a branch
which is only temporarily biased will be selected for speculation.
4. Fourth, there is an absolute limit to the number of times each branch can oscillate.
This is a necessity for the small number (∼50 of over 7000) of branches that otherwise
oscillate hundreds or thousands of times, even for relatively short program runs. Af-
ter a threshold number of oscillations, a branch is permanently classified as unbiased.
This limit has little impact on the number of branches that can be speculatively op-
timized but provides a two-thirds (on average) reduction in the number of requested
re-optimizations.
For transitions into or out of the biased state, which are accompanied by re-optimization
requests, the latency required to make modifications to the code is modeled. Given the abun-
dance of thread-parallel resources in current and future processors, the model assumes that
re-optimization is performed in parallel with execution and hence has latency, but no over-
head. A latency of 1 million instructions is used (the functional simulations described below
have no notion of time). Thus, after a branch has been selected for speculation, the model
waits 1 million instructions before counting correct and incorrect speculations. Likewise,
when a branch is evicted from the biased state, correct and incorrect speculations continue
to be counted for the following 1 million instructions until the repaired code fragment takes
effect. The value of this latency represents an estimate of the compilation cost of a dynamic
optimizer for median-sized optimization regions (∼100 instructions).
5.4.2 Reactive Model Performance
This section, demonstrates two characteristics of the model: 1) its ability to select a set of
branches on which to speculate is comparable to what is achievable by self training, and 2)
the model is rather forgiving in regard to its parameters—except that all of the transitions
must be present.
58




















































































































Figure 5.4: Reactive control performs comparably with self-training. The line
still represents the correct/incorrect speculation trade-off achievable through self-training.
The other marks are results from the reactive control model. square: baseline, x: no
eviction (without biased→monitor transition), +: no revisit (without unbiased→monitor
transition), circle: eviction by bias sampling, ellipse: shorter revisit period, diamond:
lower (1,000) eviction threshold, triangle: sampling. As all of the points except the x and
+ are collocated, the behavior of the model is primarily only sensitive to the presence of all
of the transitions.
As some of the changes of program behavior are only observed in long runs of the pro-
grams, the following experiments were performed in the context of a fast functional simulator
that simulates each benchmark to completion. These runs explore the behavior of a specula-
tion control mechanism in an abstract context, independent of implementation. The behavior
of each branch is tracked independently, except when modeling optimization latency.
Figure 5.4 plots the results of these simulations in the same format as Figure 5.1. For
reference, the self-training line is shown. The performance of the model with the parameters
shown in Table 5.2 is shown by a square dot. In all benchmark runs, the performance is
59
Static branches total % misspec
Bmark executed biased evicted evicts spec. dist.
bzip2 282 109 6 15 44.1% 26,400
crafty 1124 396 138 276 25.1% 109,366
eon 403 95 3 3 38.3% 105,552
gap 3011 1045 167 201 52.5% 36,728
gcc 7943 2068 11 12 66.3% 20,802
gzip 314 66 7 12 35.4% 43,043
mcf 366 210 22 47 33.6% 12,896
parser 1552 284 53 124 26.3% 50,643
perl 1968 1075 58 64 63.4% 55,382
twolf 1542 440 19 22 32.1% 165,711
vortex 3484 1671 67 104 88.5% 92,163
vpr 758 340 16 38 31.6% 65,588
ave 34% 2% 76 44.8% 65,000
Table 5.3: Model Transition Data. Only a small fraction of branches need to be evicted
from the biased state and mispredictions can be very far apart.
competitive with self training. In gzip and mcf, the model outperforms static self training,
because it can adapt to the low frequency time-varying behavior of branches; for example,
the average bias of the middle branch in Figure 5.2 is about 60% so it should not be selected
for speculation by a static mechanism, but a reactive model can discern that its behavior
consists of two highly-biased regions, each of which can be exploited.
Table 5.3 presents results regarding how often branches transition into and out of the
bias state. Of the static (conditional) branches executed, 34 percent enter the biased state
some time during the benchmark run. Of these, about 7 percent—2 percent (37%×7%) of
executed static branches—are later evicted from the biased state. Some of these branches are
evicted more than once; the average evicted branch transitions back to the monitor state 1.6
times (total evictions/static branches evicted). Almost half of all dynamic branches can be




These above results are surprisingly insensitive to exactly how the model is implemented.
Exploration of a number of configurations showed that most changes merely shift the model’s
performance up or down along the self-training curve. Some of these sensitivity results are
included in Figure 5.4; in many cases the points in the figure overlap, emphasizing the
insensitivity:
1. Lower Misspeculation Threshold: Lowering this threshold from 10,000 to 1,000
makes the system less tolerant of branches with varying biases, leading to a more
conservative system.
2. Misspeculation Sampling: Rather than tracking each branch’s misspeculation rate
continuously, this experiment periodically re-samples the branch’s bias to make the
eviction decision. Computing the bias of 1,000 samples every 10,000 executions (a
10% duty cycle) ends up evicting more branches resulting in a slight reduction of both
correct and incorrect speculations.
3. Sampling in “monitor” State: Using a 1-in-8 sampling rate adds a little additional
uncertainty causing a few unbiased branches to be declared biased. Larger sampling
rates can be tolerated as well by lengthening the monitor period to keep the number
of samples constant.
4. More Frequent Revisit: Lowering the revisit wait period by an order of magnitude
to 100,000 executions introduces two competing factors: 1) a reduction of time spent
by biased branches in the unbiased state (increased opportunity), and 2) branches that
are only momentarily biased are more likely to be selected and later evicted (increased
re-optimization cost).
5. Optimization Latency: All of the results discussed include a latency for optimizing
















































Figure 5.5: Misprediction rate when a biased branch transitions from being biased.
Two behaviors are common when a branch leaves the biased state. First, the branch bias
softens (bias direction stays the same, but the percentage reduces). Second, the branch
becomes perfectly biased in the other direction.
of correct speculations increases only an additional 0.1% and the number of misspec-
ulations is reduced by a factor of 1.1. This latency tolerance arises from two factors:
1) the branch in question may not be executed again for many instructions, and 2)
although the branch may not be considered highly biased, it still may be biased in the
same direction; as a result only a fraction of future executions will cause misspecula-
tions. Figure 5.5 shows the misprediction rate (fraction of branches not in the original
bias direction) in the vicinity (up to 64 branches) of a transition out of a highly biased
state. Over 50 percent of the static branches have a 30% or lower misspeculation rate
during the transition period. It is really only the 20 percent of branches that become
perfectly biased in the other direction that require quick action.
However, if the transitions back to the monitor state are removed, behavior changes sig-
nificantly. If the revisit transition (unbiased→monitor) transition is eliminated, the model
achieves only a little more than 80% of the correct speculations of the baseline. Removing
the eviction transition (biased→monitor) increases misspeculation rate by almost two or-




no revisit 35.8% 0.007%
lower eviction threshold 42.9% 0.015%
eviction by sampling 43.6% 0.021%
baseline 44.8% 0.023%
sampling in monitor 44.8% 0.025%
more frequent revisit (100k) 46.1% 0.033%
no eviction 53.9% 1.979%
Table 5.4: Model Sensitivity. Only the no revisit and no eviction configurations truly
differ from the baseline.
The fact that the model is so insensitive to its parameters relaxes the demands placed on
a real implementation. It means that the model can be implemented in a simplistic manner
without significant impact on performance.
63
Chapter 6
Atomic Regions for Managed
Languages†
This chapter focuses on common optimization opportunities in managed languages and shows
that the atomic region abstraction both trivially exposes them and is easy to implement.
Using two examples from the DaCapo Java benchmark suite, I will first depict some of
these speculative optimization opportunities and the complexity of exploiting them without
atomic regions.
I will then describe a proposed implementation of hardware atomicity for modern out-of-
order processors and how it can be used to provide the necessary primitives for the atomic
region abstraction. Likewise, I describe how the atomic region abstraction can easily be
incorporated into a modern Java Virtual Machine (JVM) to enable aggressive speculative
optimization. Finally, an evaluation of the proposed system, running Java programs from
the DaCapo benchmark suite, demonstrates potential for a 10-15% average performance
improvement.
6.1 Opportunities in Managed Languages
Even high quality managed language code has significant inefficiencies resulting from two
sources: good software engineering practice and the safety mechanisms provided by modern
languages. Good software engineering practices and an emphasis on programmer produc-
tivity demand that source code be readable, debuggable, maintainable, and reusable, which
often translates to frequent control flow and many invocations of small virtual methods.
†The content of this chapter derives from work published at the 34th International Symposium on
Computer Architecture [83].
64
Modern language safety features include performing NULL checks on dereferenced pointers,
array bounds checks to catch array overruns, and checked dynamic casts to ensure type
safety. While these checks rarely fail, their frequency significantly impacts the average basic
block size, as observed by the compiler.
In principle, compilers can be quite effective at mitigating these inefficiencies. Much of the
inefficiency results from having to recompute values and perform checks that are redundant or
are subsumed by other checks. Classical optimizations such as value numbering and partial
redundancy elimination address these inefficiencies when they are within an optimization
region. However, because of the frequency of branches, only a fraction of the redundancy is
within a single basic block, necessitating a larger optimization scope.
In general, the optimization opportunity exposed to a compiler increases as the scope of
optimization is grown. In other words, optimizing across multiple basic blocks (i.e., global
optimization) tends to expose more opportunities than simply optimizing within a single
basic block (i.e., local optimization). Likewise, inter-procedural optimizations expose much
more opportunity than intra-procedural optimizations.
Unfortunately, the software complexity necessary to exploit opportunities tends to grow
along with the scope of optimization and the difficulty of identifying more complex oppor-
tunities. Global optimizations depend on and must correctly interpret information from
a slew of supporting passes such as def-use, interval and alias analysis. Extending these
optimizations towards more complex opportunities is often impractical, as Chapter 3 shows.
Speculative optimization compounds the problem, as it fundamentally involves identi-
fying operations necessary only on cold paths and then moving them away from hot paths
without violating correct program execution. This delicate balancing of performance and
correctness concerns can quickly become untenable as optimization scope is widened and
the number of paths considered increases. In contrast, hardware atomicity primitives enable
atomic-region based optimization that trivially scales to large optimization scopes contain-
ing numerous paths. The reason is simple, the atomic region trivially exposes speculative
65





























a) b) c) d)
b1
a2 a2
Figure 6.1: An example Java method with hot and cold paths. (a) The hot path
simply checks that the index is within the current cached array segment, writes an array
element, and increments an index. (b) The control flow graph when two method calls are
inlined. (c) Superblock formation removes incoming edges from the hot path through code
replication. (d) Optimizations on the hot path can require the insertion of compensation
code blocks on exits from the hot path.
optimization opportunities to conventional compiler passes.
To help motivate this point, the remainder of this section will describe a representative
optimization opportunity from the DaCapo [11] benchmark Xalan and a more sophisti-
cated optimization opportunity taken from the DaCapo benchmark Jython. Each example
illustrates the issues which confound traditional speculative optimization approaches to ex-
ploiting these opportunities.
6.1.1 DaCapo Xalan example
Figure 6.1(a) shows a simplified control flow graph for the most frequently invoked method in
Xalan, addElement. The method inserts an integer into a SuballocatedIntVector object,
which provides an efficient implementation of an extensible vector of integers. To avoid
having to reallocate and copy the whole vector whenever the vector extends beyond its
current allocation, the object maintains an array of integer sub-arrays so that the vector can
be extended simply by allocating a new integer sub-array.
Similar to much of the managed code I analyzed, the function addElement has a fast
hot path and a slower cold path. The fast path is invoked whenever an element is inserted
66
into the same sub-array as was previously accessed (the software caches the most recent
sub-array); since insertions are generally to sequential elements and the sub-arrays are large,
this fast path ends up handling 99.8% of the calls. The slow path handles the rare cases
when an access is performed to a segment other than the cached one, including when new
segments are allocated. At the hottest call site, the function is called twice sequentially on
the same object, as shown below:
m_data.addElement(m_textPendingStart);
m_data.addElement(length);
Inlining this method at both call sites (as shown in Figure 6.1(b)) can expose some
redundancy to the compiler. Figure 6.2(a) shows the code for the hot path a1→b1→a2→b2.
By performing superblock formation [54], which involves code replication, the compiler can
remove the incoming edge c1→a2 (shown in Figure 6.1(c)), so that it can guarantee that
execution of block b2 only occurs if block b1 was executed (i.e., b1 dominates b2). This
restructuring enables the compiler to trivially remove those operations from b2 that are
redundant with those in b1 (shown in Figure 6.2(b)). One of the optimizations applied
(constant propagation of the first ++i) effectively removes an instruction from the path
a1→b1→a2→c2; to correct for this, however, the compiler must insert a compensation
block, C, into the control flow graph as shown in Figure 6.1(d). The block C holds the
removed code as shown in Figure 6.2(c).
While these optimizations can be relatively effective, their implementation introduces a
certain amount of compiler complexity. The compiler must guarantee that any exit from the
hot path, however unlikely, will generate correct results. It must provide two key assurances
to fulfill this guarantee. First, the compiler must ensure that sufficient program state is
kept live in the hot path such that at each exit the precise program state required by the
cold path can be reconstructed. Second, it must maintain mappings from the optimized hot
path’s state to that of the cold path so that compensation code can be generated, for every
67
































C:  i += 1
replicated, unoptimized code optimized code compensation code
a) b) c)
Figure 6.2: Compiler-based redundancy removal. (a) Unoptimized code after inlining.
(b) Through superblock formation, the second copy of blocks a and b can be optimized
knowing that the first copies will already have been executed, enabling constant folding of
the first increment of i and removal of the redundant NULL check and load of the vector’s
length field. (c) The constant folding of the increment to i effectively involves downward
code motion of i += 1 past the branch in block a2, requiring compensation code to be
inserted in block C.
exit from the hot path, to undo any hot-path specific optimizations (e.g., constant folding
the increment of i in Figure 6.2).
In contrast, with a hardware atomicity primitive, the same hot path code can be gen-
erated without encountering such complexities. Figure 6.3 shows the atomic region opti-
mization process for the same method. The compiler merely replicates the hot code for
execution in an atomic region; the entry to the atomic region is delimited by an instruction
(aregion begin) that communicates the beginning of speculative execution to the hard-
ware, and exits from the atomic region are delimited by an instruction (aregion end) that
instructs the hardware to commit the region’s results atomically. The compiler converts
branches to cold paths into conditional abort instructions (aregion abort); if an abort con-
dition evaluates such that control should transfer to a cold path, the hardware rolls back to
the state prior to the aregion begin, and transfers control to the original (non-speculative)
version of the code, as if speculative execution of the hot path had never occurred.
This example is rather simple due to the relatively small scope of optimization. The next
example shows that as the scope of the optimization grows, so does the complexity disparity
between traditional speculative optimization approaches and atomic region optimization.
68


































replicated, unoptimized code optimized code
a) b)
Figure 6.3: Atomic region optimization. (a) Unoptimized code after inlining. (b)
Through atomic region formation the hot region is wrapped with hardware atomicity primi-
tives. After converting cold exits into assert operations, unmodified classical constant folding
and redundancy eliminated optimizations are able to constant fold the first increment of i
and remove the dynamically redundant NULL check and load of length.
6.1.2 DaCapo Jython example
Consider the inter-procedural control flow graph of the fully-optimized code generated by
a leading commercial JVM1 for the most frequently executed loop from Jython, shown
schematically in Figure 6.4(a). Despite showing only commonly taken paths, significant
complexity remains in the depicted loop. Few paths through this loop are ever executed yet
the hottest of those paths executes 109 conditional branches and over 600 instructions. In
addition, the hottest path makes twenty calls to eight different methods.
Note that the loop shown already benefits from aggressive optimization by the JVM. The
JVM has correctly identified the hot paths and has aggressively inlined away method calls—
unoptimized the hottest path would include eight more calls to two additional methods—but
the remaining method calls have not been inlined for two reasons. First, some of the method
calls are to virtual methods that are truly polymorphic despite being monomorphic along
the hot path. Second, a method which is called four times during each loop iteration is not
inlined because doing so would cause the static code footprint of the loop to grow to tens of
1It was politely suggested that I refrain from naming the JVM analyzed to avoid potential legal impli-









Figure 6.4: Complexity of Compiler Optimizations. Abstract inter-procedural control
flow graph from Jython (only executed paths shown). (a) As optimized by a commercial
JVM. (b) The hot path if optimized in isolation. (c) The control flow/call graph resulting
from partial-inlining and superblock formation to optimize the hot paths. (d) The control
flow/call graph using the proposed hardware support for atomic regions. Atomic regions
enable the compiler to isolate the hot path from the cold paths for the purpose of optimiza-
tion; if one of the compiler’s speculations should fail, state is rolled back to the beginning of
the atomic region and control is transferred to a non-speculative version of the code.
thousands of instructions—the static footprint of the JVM optimized loop is already in the
thousands of instructions. Similarly, the JVM has identified and eliminated many redundant
operations along the hot path. Nevertheless, many dynamically redundant type and NULL
checks (demanded by Java language semantics) remain.
Manual analysis of the hot path shows that aggressive speculative optimizations can
eliminate more than two-thirds of the dynamic instructions (Figure 6.4(b)), which translates
into both improved performance and reduced power consumption. Traditional approaches
to implementing these speculative optimizations, however, come at a significant cost in
complexity, because correct execution along all of the potential paths has to be preserved
by the compiler. For example, achieving these benefits requires a correct implementation
of partial inlining—in which only hot paths are inlined to alleviate static code bloat issues.
When discussing the correctness of their partial inliner implementation (which did not use
70
atomic regions), Muth and Debray remarked [81], “The flow of control in the program
resulting from partial inlining is sufficiently complex that it is no longer obvious that the
resulting program is semantically equivalent to the original.”
Figure 6.4(c) shows the simplest possible control flow graph—having aggressively elim-
inated 67 branches with redundant conditions—that achieves the desired optimization of
the hot path. Because of the difficulty of verifying the correctness of these radical program
transformations, many commercial systems do not perform speculative optimizations to this
extent.
The atomic region abstraction enables a compiler to achieve the desired optimization
with trivial complexity, as shown in Figure 6.4(d). The reason is simple, hardware atomicity
enables a software compiler to enter an intuitive optimization contract with the hardware.
The compiler identifies an optimization region and the hardware in turn guarantees to either
execute the region in its entirety or not at all. This, in turn, enables the compiler to optimize
the hot program in isolation after simply replacing the cold paths with conditional abort
instructions that verify a cold path would not have been taken. In the infrequent event that
a cold path is traversed, a conditional abort will be triggered and hardware will roll back
execution to the state prior to entering the atomic region. Hardware also transfers control
to a compiler-specified alternate version of the code that includes the eliminated cold paths.
Replacing cold paths with conditional asserts also eliminates all the dataflow dependences
from the cold path. This trivially enables classical compiler optimizations to exploit spec-
ulative optimization opportunity without being rewritten. In addition, the atomic region
abstraction simplifies the implementation of new optimizations. For example I was able to
produce a working implementation of partial inlining in six hours, a feat I would dare not
attempt without hardware atomicity primitives.
71
6.2 Providing Hardware Atomicity
This section proposes an implementation for hardware atomicity which is synergistic with
other recent microarchitecture proposals. It satisfies the requirements described in Chap-
ter 2.1 while also being compatible with the out-of-order microarchitecture used as a base-
line. The implementation of hardware atomicity is similar to prior hardware checkpointing
proposals for the implementation of resource-efficient high-performance processors.
6.2.1 Checkpoints and Hardware Atomicity
Modern processors employ speculative execution and typically record information at a fine
granularity for when speculation fails and execution state needs to be restored. However,
speculation mostly succeeds, and the recorded information is not frequently needed. Check-
point processors use this observation to optimize recovery information management [4, 29,
55, 75]. They record recovery state at coarse intervals (100s of instructions) instead of at
every instruction. When a misspeculation does occur, the processor restores the checkpoint
and restarts execution, adaptively tracking information at a finer granularity after a mis-
speculation. This checkpoint abstraction obviates much of the fine-grain bookkeeping, since
execution can always restore to a safe point.
An extension of the checkpoint abstraction provides hardware atomicity, namely ensuring
that memory updates also appear to occur atomically. Hardware provides atomicity for a
sequence of instructions by ensuring that either all the instructions appear to be committed
at the same time or that none are. Specifically, the memory operations performed by an
atomic region appear to occur instantaneously, with all other memory operations in the
system appearing to occur either before or after.
Checkpoint processors do not otherwise provide the appearance of instanteneity required
by hardware atomicity. Previous proposals generally provide an execution that satisfies
the underlying memory model, the requirements of which may be weaker than atomicity.
72
One possible implementation that provides hardware atomicity involves the following steps:
1) creating a register checkpoint at the recovery point, 2) tracking all memory addresses
accessed by the instructions, 3) buffering all updates performed by the instructions, 4) using
an ownership-based cache coherence protocol to detect conflicting accesses from other agents,
5) discarding updates on a conflict, and 6) committing the updates in the cache atomically.
These requirements are similar to those for speculative lock elision (SLE) [88].
A practical implementation of hardware atomicity should expose a simple set of primi-
tives to software. The proposed implementation uses exactly the instruction set primitives
proposed in Chapter 2.1.
6.2.2 Microarchitectural Implications
A critical aspect of hardware atomicity is its synergy with recent microarchitectural propos-
als. This is important since any proposal for exposing hardware mechanisms to software must
also be amenable to high-performance implementation. A simple abstraction also provides
significant flexibility to hardware designers. The atomic region abstraction allows hardware
to execute the code region in whatever way seems fit, as long as when an abort condition
occurs, the execution restores to the beginning of the region with appropriate information
in the appropriate registers.
The proposed implementation of hardware atomicity leverages similarities with a check-
point processor. The common and fast path execution for hardware atomicity must be fast
and have low overhead, and these requirements can be satisfied by a checkpoint processor
with atomicity support.
Various implementation strategies based on checkpoint architectures exist for providing
the memory requirements of hardware atomicity. In this implementation, the data cache
retains the data footprint of the atomic region and a register rename table checkpoint is
used for recovering register state. Each cache line is extended with two bits for tracking
which addresses have been read and written in the atomic region. These addresses are
73
exposed to the coherency mechanism to observe invalidations. Flash clear operations are
used to commit and/or abort speculative state.
While support for hardware atomicity may appear similar to hardware support for trans-
actional memory [67], significant differences in requirements and usage exist, resulting in
different hardware implementation requirements. Transactional memory is proposed pri-
marily for scalability and can potentially tolerate some loss of single-thread performance to
achieve this scalability. In contrast, the atomic region abstraction uses hardware atomicity
to improve single-thread performance, and any hardware execution overheads will reduce
the benefit of these optimizations. Section 5 discusses the implications of simplified imple-
mentations on the usage model. Use of a checkpoint execution substrate for implementing
hardware atomicity enables a nearly no-overhead common case execution and permits mul-
tiple atomic regions to be in-flight simultaneously.
Apart from the performance goal of fast common case execution, the atomic region
abstraction simplifies the functionality required from the hardware implementation. Since
the hardware is used opportunistically to improve the performance of a single thread, a best
effort implementation is sufficient.
6.3 Forming and optimizing regions
This section focuses on how the compiler uses the atomic region abstraction to generate bet-
ter code. Specifically, it demonstrates how support for atomic regions can be introduced into
a compiler without significant changes, an algorithm for selecting appropriate atomic regions
while achieving good program coverage, why assertions constrain optimization significantly
less than branches, and optimizations enabled by atomic regions.
Atomic regions and abort as try/catch: Modern languages like Java generally provide
support for structured exception handling, which in Java takes the form of try and catch
blocks. These primitives enable the programmer to specify one block of code that should be
74
executed assuming that no exceptions occur and another one to be executed on an excep-
tion. To support these language features, a compiler must be able to represent them in its
intermediate representation (IR).
One of the most important observations made in this work is that the IR support for
try and catch is similar to what is required to represent both atomic regions and the abort
path to non-speculative recovery code. This observation reduces the problem of supporting
software speculation within the compiler to that of simply transforming the program’s con-
trol flow graph so that atomic regions look like try blocks and non-speculative recovery code
looks like a catch block. As a result, unmodified optimizations can exploit the speculative
optimization opportunities exposed by the atomic region. The entire implementation of the
atomic region abstraction (including all transformations and optimizations) required approx-
imately 3,000 lines of code (LOC) (∼3% of the optimizing compiler), roughly two-thirds of
which is the region selection algorithm. While the complexity of atomic region formation
corresponds closely that of reported by Hwu et al. for superblock formation (2,000 LOC),
their superblock optimizations incurred an additional 12,000 LOC [54].
Region formation: In selecting regions for optimization, three properties are maintained:
1) overly large regions must be avoided, 2) atomic regions must not be nested, and 3) atomic
regions will be single-entry, multiple-exit subgraphs, containing arbitrary intraprocedural
control flow. The first property permits a best-effort implementation of atomicity (i.e.,
atomic regions that overflow the cache or receive an interrupt will abort) and bounds the
lost effort when a region aborts. Nesting is avoided, in part, to demonstrate that its support
is not a hardware requirement. In addition, nesting only occurs as a result of encapsulating
a non-inlined call within an atomic region, and I have yet to observe a case where this will
significantly improve optimization. The last property simplifies region formation by building
upon other well understood single-entry techniques, without the control flow limitations
imposed by building regions from traces [37,54,86].
The process of region formation is fundamentally a profile-driven one. The goal is to select
75
regions for optimization that exclude infrequently executed (or “cold”) code paths. As is
typically done in JVMs, the first-pass compiler inserts instrumentation to profile program
behaviors (e.g., branches, virtual calls). For this implementation cold paths are those with
a branch bias of less than 1%; these paths will be removed from atomic regions. All other
paths are non-cold.
The region formation algorithm has five steps:
Step 1. Aggressively inline methods
Step 2. Select region boundaries (See Algorithm 6.1)
Step 3. Replicate flowgraphs for selected regions
Step 4. Convert cold edges into asserts
Step 5. Remove all inlined methods from non-speculative paths
The first step enlarges the optimization scope by aggressively inlining methods. This
can be done without fear of the “code bloat” typically associated with inlining for two
reasons. First an inlined method along a speculative path will only be retained if it is con-
tained entirely within an atomic region (inlined methods that do not satisfy this criteria are
pruned away). Second, the remaining inlined methods will have their infrequently executed
paths speculatively removed, enabling the retained paths to be further reduced in size by
optimization.
The next step, selection of region boundaries is the crux of region formation and is
detailed in Algorithm 6.1; here I overview its operation. The goal of boundary selection is
to identify a set of blocks that will become the entry and exit points for atomic regions. To
be more precise, it focuses on identifying blocks that should become atomic region entries.
The placement of atomic region exits is largely born of necessity (i.e., atomic regions may
be terminated to maintain the three previously mentioned invariants).
Placement of atomic region boundaries starts by considering loops to decide whether
individual loop iterations should be executed in atomic regions or whether the whole loop
76
Algorithm 6.1 Selection of atomic region boundaries
procedure SelectBoundaries(method)
selectedBoundaries← ∅ . Set of blocks
// Place region boundaries at the headers of large loops (i.e. those with long iterations
// or high trip counts) and loops containing calls reachable along non-cold paths
L← LoopsInPostOrder(method) . Process loops from innermost to outermost





// LoopWeight defined in Algorithm 6.2
loopPathLength← LoopWeight(loop) / GetExecCount(loopPreHeader)
if (loopPathLength ≥ LoopPathThreshold) or hasWarmCall then
selectedBoundaries← selectedBoundaries ∪ {loopHeader}
// Prune inlined methods that contain selected loops or calls reachable along non-cold paths.
// This limits unnecessary code bloat and is part of my partial inlining implementation





hasSelectedLoop← (selectedBoundaries ∩ inlinedBlocks) 6= ∅
if hasWarmCall or hasSelectedLoop then
UnInlineMethod(inlinedMethod)
// Place region boundaries along acyclic paths
visited← ∅
traceBoundaries← GetEntryBlock(method) ∪ GetExitBlock(method)
traceBoundaries← traceBoundaries ∪ GetCallBlocks(method)
maxBlockExecCount← GetMaxBlockExecCount(method)
B ← BlocksSortedByExecCount(method) . Process by block execution frequency
foreach block in B do
if block /∈ visited and GetExecCount(block) ≥ (maxBlockExecCount/100) then
traceBoundaries← traceBoundaries ∪ selectedBoundaries
// TraceDominantPath defined in Algorithm 6.2
dominantPath← TraceDominantPath(block, traceBoundaries)
// Selects boundaries that minimize Equation 6.1
acyclicBoundaries← SelectAcyclicBoundaries(dominantPath)
selectedBoundaries← selectedBoundaries ∪ acyclicBoundaries
visited← visited ∪ dominantPath
return selectedBoundaries
77
Algorithm 6.2 Used during selection of atomic region boundaries
// Generate an ordered list containing the most frequently executed
// path through the specified block. Stop tracing at selected
// boundaries and trace boundaries
procedure TraceDominantPath(seedBlock, traceBoundaries)
dominantPath← [seedBlock]
traceBlock ← seedBlock; done← false
while ¬done do
traceBlock ← GetDominantOutEdge(traceBlock)
dominantPath← dominantPath + [traceBlock]
if traceBlock ∈ traceBoundaries then
done← true
traceBlock ← seedBlock; done← false
while ¬done do
traceBlock ← GetDominantInEdge(traceBlock)
dominantPath← [traceBlock] + dominantPath










foreach edge in GetOutEdges(currBlock) do
if GetProb(edge) ≥ coldThreshold then
successorBlock ← GetTarget(edge)
if successorBlock ∈ searchScope then




foreach block in GetBlocks(loop) do
blockExecCount← GetExecCount(block)
numBlockOps← GetNumOperations(block)
weight← weight + (blockExecCount ∗ numBlockOps)
return weight
78
should be encapsulated within a single atomic region. There are two factors which influence
the decision: the dynamic path length through the loop and whether the loop contains a call
statement which will not be inlined or speculatively removed. The algorithm chooses per-
iteration atomic regions when loop iterations are large or if the average number of iterations
executed is high enough that the region might overflow the cache. Because atomic regions
are terminated at non-inlined calls, to avoid nesting, and a new atomic region often begins
immediately after a call return, an atomic region boundary is also inserted in the header of
any loop containing a call, to prevent the creation of irreducible flowgraphs [47]. Note that
calls which are inlined or which will be speculatively removed are ignored.
Next, the region selection un-inlines any aggressively inlined methods that will not be
completely encapsulated in an atomic region; this step prevents code bloat resulting from
the method needing to be fully duplicated on an atomic region’s non-speculative path. If an
aggressively inlined method includes an atomic region boundary (from the previous step) or
a call statement which will not be speculatively removed, it is un-inlined.
The last part of the boundary selection algorithm places boundaries along acyclic paths.
The algorithm iteratively selects the hottest block that has not already been visited and
traces the dominant path through the block, terminating the trace at already selected region
boundaries or at the CFG entry and exit or any call continuation. All loop pre-headers
and loop exits contained on the dominant path, as well as the start and end of the path,
become candidates for boundary selection. The algorithm selects the subset of the candidate
boundaries that minimizes Π in Equation 6.1 where R is the desired region size and rn is
the size of the nth candidate region. Equation 6.1 increases in value for regions further
from the desired size but is also biased towards selecting oversized regions by the rn term in
the denominator. This equation was originally proposed for the task selection algorithm of




















































R · rn (6.1)
Once region boundaries have been selected, Step 3 creates atomic regions by perform-
ing a depth first search, ignoring cold paths, starting from each selected region boundary
and stopping at other selected region boundaries, the CFG exit, and any call statements.
The visited blocks are copied, an aregion begin is placed at the entry of the duplicated
region, and an aregion end is placed at each region exit. All edges into the original region
entry block are redirected to the aregion begin, and an exception edge is added from the
aregion begin to the original region entry block. Figure 6.5(b) shows the result of this step
(partial loop unrolling has also been applied to the outer loop).
The remaining steps of atomic region formation convert cold branches into asserts (Step 4)
and replace aggressively inlined methods on non-speculative paths with call statements
(Step 5).
80
As initially stated, region formation should avoid generating large atomic regions. I
found that setting LoopPathThreshold and R to a value of 200 high-level intermediate
representation operations2 satisfies this property without sacrificing much opportunity.
Why asserts constrain optimization less than branches do: One of the final steps of
region formation converts branches from the hot path to the cold paths into assertions in the
compiler’s intermediate representation. These assertions constrain optimizations less than
branches, because the assertion operations are implemented in the high-level IR as simple
operations that have only source operands and no side effects. Like an ALU operation
that produces no value and unlike branches, an assert can be completely ignored when
optimizing other data-independent instructions. Furthermore, asserts can be optimized by
existing passes: they can be freely scheduled across branches, limited only by their data
dependences and the boundaries of the atomic region, and redundant asserts are eliminated
by existing redundancy elimination passes such as global value numbering. Only dead code
elimination needs a slight modification to consider asserts essential so that they will not be
removed despite having no dataflow consumers.
Atomic regions enable optimizations: The guarantees provided by atomic regions en-
abled us to implement several additional optimizations: partial inlining, partial loop un-
rolling and speculative lock elision3 [88]. The implementations of partial inlining and partial
loop unrolling were enabled by the design simplicity offered by atomic execution, and spec-
ulative lock elision was enabled by the atomicity and isolation guarantees provided by hard-
ware. The relatively small amount of code required to implement these optimizations (∼200
LOC each for partial inlining and partial loop unrolling, and ∼400 LOC for speculative lock
elision) demonstrates the simplicity offered by atomic regions.
Partial inlining exposes additional opportunity by enlarging the optimization scope, but
2There is a loose correspondence between IR operations and the number of hardware instructions
actually generated.
3SLE is used to reduce monitor overhead, but this optimization would also reduce monitor-
induced serialization in multithreaded workloads.
81
limits static code expansion by obviating the need to inline infrequently executed paths in
the method. Partial loop unrolling has similar benefits. However, implementing either opti-
mization without atomic regions overly burdens the compiler writer with the responsibility
of guaranteeing that correct program state can be recovered and forward progress made if
an infrequent path is executed. With atomic region support, the implementation of both
partial inlining and loop unrolling is greatly simplified. The hot paths of inlined methods
and loops are simply wrapped in atomic regions and the infrequent paths are converted into
assertions. If an infrequent path is executed, an assert will fire and hardware will redirect
execution to the corresponding non-speculative code, which has not been inlined or unrolled.
Speculative lock elision (SLE) exploits opportunity exposed by my atomic region forma-
tion. Atomic regions often contain balanced pairs of Java monitor enter and exit operations,
and these monitors are typically uncontended. The JVM used already provides fast-path im-
plementations for common lock behaviors using reservation locks [61], but even the fastest
path must still check the status of the lock and update it with a store (both at monitor
entry and monitor exit) to track lock nesting depth. This monitor overhead can be elimi-
nated with atomic regions; when a balanced pair of monitor operations is contained within
an atomic region, my implementation of SLE must only load the value of the lock upon
monitor entry and verify—a compare and branch—that it is not held by another thread.
In the common case, no action is needed at the monitor exit. This improvement to single-
thread performance is in addition to any concurrency benefits from optimistically executing
a synchronized method/block.
6.4 Experimental method
Evaluating the performance impact of run-time compiler enhancements using new hardware
features presents a number of challenges. First, in the absence of real hardware, a full-system


















Figure 6.6: Performance analysis infrastructure.
many system features. Second, because the compilation is performed during the program
run, the benchmark runs have to be sufficiently long for the staged optimizer to produce the
fully optimized code. Third, in order to compare the performance of two different compilers
it is necessary to select equivalent regions of the program’s execution to make an “apples-to-
apples” comparison. Figure 6.6 depicts the evaluation infrastructure, which was developed
to overcome these obstacles.
The benchmark being evaluated is executed using a modified version of the Apache
Harmony Dynamic Runtime Layer Virtual Machine (DRLVM) for Java [5] on the SoftSDV
full-system simulator [101], which has been extended to support the ISA extensions for
atomic regions4 discussed in Section 6.2. I use the DRLVM server execution manager
configuration to maximize code quality; the whole process is completely automatic and
profile driven. A functional simulation is run for a sufficiently long duration to allow for all
initial compilation to be performed and for the staged optimizer to generate fully optimized
code for commonly executed methods.
Once a representative portion of the execution is reached, the state of the functional
simulation is recorded for use in a timing simulation. The format of the state recorded,
known as a long instruction trace (LIT) [94], contains a snapshot of the initial processor
4For debugging the compiler, I also developed a means to test on real machines by registering
a signal handler for invalid opcode exceptions (triggered by the unrecognized aregion begin in-
structions) that inspects the faulting instruction and branches immediately to the (non-speculative)
recovery path.
83
Processor frequency 4.0 GHz
Rename/issue/retire width 4/4/4
Branch mispred. penalty 20 cycles
Instruction window size 128
Scheduling window size 64
Load/store buffer sizes 60/40
Functional units Pentium R© 4 equivalent,
Branch predictor combine: 64K gshare/16K bimod
Hardware data prefetcher Stream-based (16 streams)
Trace Cache 64 K-uops, 8-way
I-TLB 128 entries
D-TLB 64 entries, 4-way
L1 Data Cache 32 KB, 4-way, 4 cycle hit, 64B line
L2 Unified Cache 4 MB, 8-way, 20 cycle hit, 64B line
L1/L2 Line size 64-bytes
Memory latency 100 ns
Table 6.1: Baseline processor parameters
architectural state and memory as well as a trace of all system interrupts necessary to
simulate system events such as DMA traffic. The LIT is consumed by a detailed execution-
driven simulator coupled to a micro-operation (uop) level x86 architecture model5. This
simulator accurately models a detailed memory subsystem, wrong path execution, interrupts,
system interactions, and DMA events. The baseline 4-wide OOO (out-of-order) processor
parameters are shown in Table 6.1. A checkpoint execution substrate, similar to that of
checkpoint processors, provides atomic execution.
Because different compilation approaches are being compared, equivalent regions of the
benchmark must be selected for evaluation. To accomplish this, the compiler has been
modified to insert special markers that are understood by the full-system simulator. These
markers bound equal work at the Java bytecode level, thus allowing for a fair comparison.
Selecting good marker locations first requires collecting a complete trace of method invoca-
tions from the benchmark’s execution. This trace can then be divided into groups of 10,000
methods and run through the SimPoint 3.0 phase classification tool [45] to identify phases.
5This work was performed while on an internship at Intel and used a proprietary simulator.
84
Benchmark Description #
antlr Generates parser/lexical analyzer 4
bloat Bytecode analysis and optimization tool 4
fop Parses/formats XSL*-FO to generate PDF* 2
hsqldb Executes JDBCbench*-like benchmark 1
jython Interprets pybench Python benchmark 1
pmd Analyzes a set of Java classes 4
xalan Converts XML* documents into HTML* 1
Table 6.2: DaCapo benchmarks used in evaluation. # = number of samples used in
evaluation.
For up to four phases per benchmark, a marker method is also identified that can be used to
bound a simulation sample and that is infrequently invoked (so that it minimally perturbs
the execution). Three dynamic invocations of this marker method are used to identify the
sample: i) the beginning of the warm-up period, ii) the end of warm-up/the beginning of
timing simulation, and iii) the end of timing simulation. This method shares similarities
with concurrent work [87].
When the JVM is invoked it is also passed a marker method identifier—the class name,
method name, and call signature—which compares it to each method compiled and then
inserts the marker into the matching method’s prologue. While the exact number varies,
warm-up and simulation intervals are selected to contain on the order of millions to tens
of millions of instructions. For benchmarks with multiple phases, results are weighted for
each sample by its phase’s contribution to the overall execution. Note that non-deterministic
benchmarks are not supported because the number of marker method invocations can change
for different compiler configurations.
The DaCapo benchmark suite [11] is used for the evaluation (version dacapo-2006-10).
The suite is intended for evaluation of JVMs by the programming languages, memory man-
agement, and computer architecture communities and consists of a set of open source, real
world applications with non-trivial memory working sets. Table 6.2 lists the benchmarks
used and their descriptions. The remaining benchmarks are not included due to experimen-
85
tal method limitations: chart and eclipse were too long running, luindex’s samples could
not be validated in time, and lusearch is non-deterministic. To avoid non-determinism in
xalan, the single-threaded version from the beta-2006-08 release of the benchmarks is used.
To work around a bug in DRLVM, the version of jython is from the same beta release.
6.5 Results
The following analysis focuses on two metrics: performance and dynamic micro-operation
(uop) count reduction. The performance metric simply compares execution time of the
sampled regions. Reduction in dynamic uop counts is also measured because, in general, it
will translate into improved energy efficiency. Fewer uops flowing down the pipeline results in
less switching activity, which in turn results in a reduction in the amount of energy consumed
to perform a given unit of program work.
First, two compiler configurations are compared: no-atomic, a baseline set of optimiza-
tions that corresponds closely to Harmony’s default server configuration, and atomic, which
exploits hardware-supported atomic regions. The optimization passes enabled are the same
for both, except atomic performs atomic region formation, partial inlining, partial loop
unrolling, and speculative lock elision.
As shown in Figure 6.7, these optimizations enable a significant (12% average) speedup
across the evaluated benchmarks. Furthermore, these speedups are accompanied by a nearly
comparable reduction in the number of uops retired, as shown in Figure 6.8. By providing
a simple recovery abstraction (atomic regions) the hardware has facilitated the compiler’s
generation of higher performance and more efficient code.
Hand inspection of the generated atomic regions uncovered clear evidence of speculative
optimization, even though the compiler does not explicitly implement any. For example,
in one atomic region elimination of cold paths enabled the compiler to simplify an indirect




























no-atomic + aggr. inline
atomic + aggr. inline
antlr bloat fop hsqldb jythonj pmd xalan average
Figure 6.7: Execution time speedups. All runs use the same hardware configuration, and
performance differences result from increased optimization effectiveness. The second set of
bars for jython demonstrate a benefit of more optimistic region formation (Section 6.5.1).
asserts), eliminate branches via constant propagation previously inhibited by cold control
flow, eliminate partially redundant loads, and eliminate partially redundant checks.
Some of the benefits in other regions, however, were merely the result of increasing
the scope of optimization through partial inlining and partial loop unrolling beyond what
the baseline inliner and loop unroller were exposing. In order to demonstrate that this
scope enlargement is not responsible for all of the benefits, two additional configurations
are compared: no-atomic and atomic further configured with unrealistically large inlining
thresholds (a factor of five larger than the baseline), which should achieve the optimization
potential resulting simply from scope enlargement. The results for these configurations are
also shown in Figures 6.7 and 6.8. Note that both partial inlining and partial loop unrolling
are disabled in the atomic+aggressive inlining configuration.
From this data it can be seen that the atomic region-based optimizations are achiev-
ing more than just scope enlargement, as the performance from no-atomic+aggressive
inlining is less than half of the atomic+aggressive inlining case. Actually, increas-
ing the optimization scope appears to disproportionately benefit the atomic+aggressive
inlining case, as its speedup (25.3%) is more than the sum of those of atomic and


























no-atomic + aggr. inline
atomic + aggr. inline
antlr bloat fop hsqldb jython pmd xalan average
Figure 6.8: Micro-operation (uop) reduction. All runs use the same hardware config-
uration, and differences result from increased optimization effectiveness. The second set of
bars for jython demonstrate a benefit of more optimistic region formation (Section 6.5.1).
6.5.1 Understanding the Variation
Clearly, atomic regions do not uniformly benefit all of the benchmarks; the speedup achieved
by the atomic+aggressive inlining configuration ranges from 56% (hsqldb) to 2% (pmd).
This section explores the sources of this variation.
Across the benchmarks, a strong correlation exists between uop reduction and speedup,
which is not surprising as both generally occur when code is optimized more effectively.
It should be noted that the observed uop reduction steps not only from the removal of
instructions (as is done in SlipStream [99]); in many cases uops have been replaced with other,
simpler uops (e.g., SLE replaces compare-and-swap primitives and Java monitor updates
with a load and a branch) and the critical path through the code has been shortened. As a
result many of the benchmarks exhibit superlinear speedups relative to their uop reduction.
The biggest factor affecting the degree of optimization seems to be coverage. Table 6.3
shows that four of the benchmarks with high speedups—bloat, hsqldb, jython, and xalan—
execute most (upwards of 69%) of their uops in atomic regions. As I am reporting coverage
after optimization and most of the reduction in dynamic uop count occurs in the atomic
regions, an even larger fraction of the program is actually encapsulated in atomic regions
88
Bench. Atomic Regions Region Abort Rate
coverage unique size % per 1k uop
antlr 9% 96 47 0.02 0.0004
bloat 69% 93 128 4.3 0.12
fop 20% 73 32 0.01 0.0007
hsqldb 76% 75 88 2.74 0.24
jython 87% 14 227 0.69 0.27
pmd 32% 32 42 2.2 0.18
xalan 78% 37 78 0.28 0.03
Table 6.3: Atomic region statistics. coverage: fraction of executed uops in atomic re-
gions, unique: average number of unique atomic regions in execution sample(s), size:
average size of atomic regions (in dynamic instructions), abort %: percentage of re-
gions aborting, aborts/1k uop: number of aborts per 1,000 uops. Data shown for the
atomic+aggressive inlining configuration.
than these coverage numbers suggest; this effect also explains how antlr can achieve a 17%
uop reduction with only 9% coverage. These four benchmarks also have the largest atomic
regions. With average region sizes ranging from 75 to 225 instructions after optimization
there is significant scope for optimization.
The outlier from this trend is antlr, which manages to achieve significant speedups
despite low coverage. This occurs because a large fraction of the instructions are eliminated
from the atomic regions it does form. On average, two-thirds of the instructions in antlr’s
atomic regions get optimized away. Like the other benchmarks that get significant speedups
antlr gets most of its benefits from two main sources: partial redundancy elimination and
elimination of monitor overhead in the Java class library.
The pmd benchmark actually slows down in the atomic configuration because it has rel-
atively low coverage and yet incurs a 2.2% abort rate for its atomic regions. This relatively
high abort rate is the result of a behavioral change in four atomic regions that occurs between
when profiling occurs and when the optimized code is deployed and an execution sample is
taken. Frequent misspeculations result because a path that initially appears cold is removed
from its atomic region but is later frequently executed. As described in Chapter 5, react-
ing to these behavior changes through adaptive recompilation can eliminate their negative
89
performance impacts. In Chapter 7, I describe an implementation of a reactive mechanism
which is sufficient to address these behavior changes.
Two other benchmarks—hsqldb and bloat—also have non-trivial abort rates, but achieve
significant speedups despite them. In hsqldb, the aborts occur very early in the atomic re-
gion so they have little negative impact beyond a pipeline flush. In bloat, they do have a
large impact; almost all of bloat’s aborts occur in one of its four execution samples—the one
representing the least dominant phase—and that sample incurs a 33% slowdown. Discount-
ing that phase, bloat’s speedup would be 40% (up from 32%) for the atomic+aggressive
inlining configuration.
Despite performing well in the atomic+aggressive inlining configuration, the jython
benchmark incurs a slowdown in the atomic configuration. The source of this discrepancy is
an important method getitem (called four times in a hot loop) that is not being inlined by
the partial inliner in the atomic configuration. This results in a large number of small atomic
regions being formed that incur more overhead than they provide optimization opportunity.
The getitem method is not being inlined because it contains what appears to be a polymor-
phic call site and the region formation algorithm will not partially inline methods containing
polymorphic calls. If getitem were inlined, however, this call site is perfectly monomorphic.
By forcing the implementation to recognize this fact, getitem is inlined and the atomic con-
figuration’s 9% slowdown becomes a 10% speedup, as shown by the gray bars in Figure 6.7.
These performance benefits may also be achieved through implementing an adaptive recom-
pilation strategy that performs aggressive speculation (e.g., assumes polymorphic call sites
are monomorphic) and recompiles methods containing frequently misspeculating asserts.
6.5.2 Architectural Analysis of Atomic Regions
This section studies the atomic regions generated by the compiler from the hardware’s per-
spective. In terms of implementing atomicity in hardware, it is important to understand the
size of atomic regions in terms of dynamic instruction count and data footprint.
90
If the compiler-generated atomic regions were consistently small, they could be buffered
completely within the pipeline; this is not the case. A 128-entry reorder buffer would be
unable to support nearly 25% of the atomic regions executed, resulting in frequent aborts and
significant performance degradation. A small fraction of atomic regions even contain over
1,000 uops. By using register checkpoints for recovery (similar to branch checkpoints except
that they live past speculative retirement), hardware enables the compiler to construct such
regions.
The evaluated hardware implementation uses the data cache to buffer the reads and
writes in the atomic regions, similar to prior work [31,75]. Modern L1 data caches are easily
sufficient for holding the read and write set of an atomic region. The majority of dynamically
executed atomic regions access less than 10 cache blocks and 50 cache blocks is sufficient for
99% of the atomic regions (for reference a 32KB cache with 64B blocks holds 512 blocks).
Only 110 out of the 1.7 million dynamically executed atomic regions touched more than 100
cache blocks and only one overflowed the cache. Clearly, the region selection algorithm is
effective at tolerating the constraints of a bounded atomic primitive.
Once again, the read and write sets of the atomic regions fit easily within the cache, but
the number of loads and stores, which tend to be proportional to the number uops in the
atomic region, are generally too large to fit in the load and store buffers. Even if the region
formation algorithm described in Section 6.3 were tuned to select smaller regions, a compiler
would have a difficult time guaranteeing that a region would satisfy more limited hardware.
Smaller region sizes would also limit optimization opportunity.
6.5.3 Microarchitectural sensitivity
Because atomic regions are intended to improve single-thread performance, they must be
implemented with minimal overhead in order to preserve the benefits achieved by compiler
optimizations. All the experiments thus far have assumed a checkpoint execution substrate

























chkpt + 20-cycle overhead
chkpt, single-inflight
antlr bloat fop hsqldb jython pmd xalan average
Figure 6.9: Sensitivity to hardware atomicity implementation. All runs use the same
code (atomic+aggressive inlining) on different hardware configurations: chkpt: the
base high-performance non-stalling checkpoint execution substrate, + 20-cycle: stalls the
pipeline for 20 cycles at every aregion begin, and single-inflight: stalls an aregion begin
at decode if another uncommitted atomic region is already in the pipeline.
However, it is worth investing the performance of alternate or simplified implementations
in the absence of a high-performance checkpoint substrate to provide atomicity. Alternate
schemes can incur two additional sources of overhead: overhead in the form of serializing
operations that may occur as part of the aregion begin to record recovery state and serial-
izing overheads that may occur due to simplified implementations of the aregion begin and
aregion end instructions in the absence of a checkpoint substrate. It is possible to explore
the performance sensitivity to these effects simply by modeling two ways such overheads
may be exposed. First, the performance of the atomic+aggressive inlining compiler
configuration was measured with a simulator configured to stall the pipeline for 20 cycles
when processing an aregion begin. Second, to study implementations that only permit a
single atomic region to be in flight at a time; an aregion begin is stalled at decode until
all preceding atomic regions commit.
As shown in Figure 6.9, each of these configurations eliminate or drastically reduce the
benefit of atomic regions in most benchmarks. The sole exception is antlr, which shows
limited sensitivity because of its sparing use of atomic regions.
92
In addition to the baseline processor configuration, performance was also measured on
two more-modest microarchitectures, as might be incorporated into future multiprocessors:
a 2-wide OOO version of the baseline machine (pipeline widths reduced to 2/2/2) and a
2-wide half OOO configuration that halves the superscalar width and all other processor
structures (including caches and TLBs). The relative speedups achieved by the atomic
region-based optimizations closely tracked the 4-wide OOO results shown in Figure 6.7,
generally within a percent or two.
6.5.4 Limitations of the existing compiler
In the process of implementing my atomic region-based optimizations, I found that some-
times the benefits of optimization were mitigated by limitations in the compiler’s other opti-
mizations and code generation. One particularly spectacular example of this effect occurred
when I tried to remove the garbage collection safe point from loops completely encapsulated
in atomic regions, replacing it with a single load of thread’s local yield flag in the loop’s
pre-header. As it turns out, the JVM’s register allocator implicitly relied on the call to the
yield() function to prevent the registers within the loop from being assigned to variables
only used outside the loop. If this call was removed, performance degrades because many of
the frequently accessed variables within loops are spilled to the stack.
As such, these performance results should not be considered a definitive characterization
of the potential of atomic regions. I believe that significant further optimization potential
exists. Nevertheless, it is important to recognize that atomic regions primarily facilitate
the optimization phase of the compiler, and must be complemented by high quality code
generation and run-time services to achieve high performance.
93
6.6 Conclusion
This chapter demonstrates that the introduction of architectural support for atomic regions
greatly facilitates the implementation of speculative optimizations in a JIT compiler. The
atomic region abstraction permits the compiler to isolate the hot paths for the purpose of
optimization by replacing infrequently-executed paths with asserts. As a result, speculative
optimizations can be performed without compensation code, enabling a great reduction
in compiler complexity to achieve a given code quality. The prototype implementation
achieves average speedups of 12% across a suite of DaCapo benchmarks, with commensurate
reductions in the number of micro-operation flowing through the pipeline.
However, these experiments used short simulations using a modified JVM and therefore
suffered from several shortcomings. For example, the extra runtime compilation costs were
not measured and these might overwhelm the benefits achieved. Likewise, an atomic region
optimization can become unprofitable as a program changes phases and complete program
runs are necessary to fully observe and properly react to such effects. Finally, the prototype
focuses on managed languages and ignores pre-compiled native binaries.
The next chapter addresses these shortcomings and describes the integration of atomic
regions into a real dynamic binary translator system.
94
Chapter 7
Atomic Regions for Dynamic
Translation†
Even though managed languages have gained popularity, statically compiled workloads will
likely continue to be commonplace. Further, many of these programs are shipped as binaries
and are unlikely to be recompiled for future systems. Therefore, this chapter explores the
atomic region abstraction in dynamic binary translation systems. Such systems enable trans-
parent compiler-based optimization of a program while it runs and can transparently make
use of hardware atomicity primitives to support atomic region optimization. In addition, the
dynamic nature of these systems provides access to accurate profile information necessary
for speculative optimizations similar to those already described for managed languages.
To illustrate this point, I will show two examples from the SPECint 2000 benchmark
suite [96], one from 255.vortex and one from 176.gcc. In each of these examples I will
first describe how they are optimized by a commercial dynamic translation system for x86,
specifically Code Morphing Software (CMS) [32] for the Transmeta Efficeon processor, and
then describe how the atomic region paradigm enables their further optimization.
I will then provide background on CMS for the Transmeta Efficeon processor and describe
how the atomic region abstraction was incorporated into a real system. With moderate effort,
the atomic region abstraction is able to provide up to a 9% performance improvement (3% on
average) on full runs of the SPECint 2000 benchmarks. In order to achieve these performance
gains, it is necessary to identify and respond to frequent misspeculation, but I will show that
even a simple control mechanism is sufficient to rein in all detrimental side-effects.
†The content of this chapter derives from work published at the 15th International Conference on




































Figure 7.1: Potential for atomic region optimizations. Optimization region for method
OaGetObject from vortex. (a) Control flow graph for optimization region with the hot
path highlighted, (b) dataflow graph and schedule for region as optimized by CMS using
superblocks, (c) atomic region representation of the same control flow graph with cold paths
converted into asserts, (d) dataflow graph and schedule for region optimized with atomic
regions. Atomic regions enable CMS to trivially exploit additional optimization opportunity
beyond that provided by superblocks alone.
7.1 SPECint 2000 Vortex example
Shown in Figure 7.1(a) is the control flow graph (CFG) for a portion of the method
OaGetObject. The portion shown is the optimization region selected by the Transmeta
CMS translator. This is one of the hottest regions in vortex and accounts for approximately
10% of the overall execution time. A single hot path exists through this optimization region,
including 56 x86 instruction and nine cold exits.
CMS uses superblock scheduling in conjunction with a suite of classical optimizations
to generate the Efficeon code and schedule shown in Figure 7.1(b). This code has been
aggressively scheduled and optimized including the speculative hoisting of loads, reorder-
ing of memory operations and removal of redundant operations. As described in previous
96
work [32], this requires no compensation code because CMS makes use of the atomic execu-
tion hardware provided by Efficeon.
Ignoring cache misses, this optimized code emulates the original 56 x86 instructions by
executing 72 Efficeon operations in 26 cycles or an average of 2.15 x86 instructions per cycle
(IPC) and 2.77 IPC in Efficeon operations. The code generated by CMS is of high quality:
the static schedule produced has a similar height to the dynamic schedule achieved by a




























































Figure 7.2: CMS baseline optimization. CMS uses temporaries to eliminate three re-
dundant loads and eliminate a fourth load by forwarding a previously stored value. Original
x86 code is shown on the left. The code on the right illustrates the effect of optimized CMS
code.
Figure 7.2 shows a few of the optimizations that CMS applies to the region from Fig-
ure 7.1(a). In particular, CMS is able to identify three redundant load operations and
eliminate them by buffering previously loaded values in temporary registers. Likewise, CMS
forwards a value stored to the stack to a later consumer, which obviates a fourth load. In
each of these cases, CMS uses liveness analysis to prevent unnecessary register copying (for
example, the temporaries introduced in block G are never copied to register edi because of
a subsequent kill).























































Figure 7.3: Optimizations enabled by atomic regions. Asserts trivially expose specu-
lative opportunities to classical optimizations. For example, a partially dead store gains the
appearance of a fully dead store and a partially redundant exit computation appears fully
redundant.
that significant optimization opportunity remains. Put simply, superblock scheduling and
optimization has enabled CMS to eliminate some redundancies, hoist critical operations
past exits and generate a nearly optimal schedule, but it has not enabled CMS to remove
operations that are only needed along cold exit paths.
Because the code generated by CMS already uses hardware primitives to execute the
region atomically, these additional optimization opportunities can be trivially exposed by
converting the cold exits into asserts, as advocated by the atomic region abstraction (shown
in Figure 7.1(c)).
An assert operation simply performs a conditional check to verify that a cold exit has
not been followed. If the cold exit is followed, the assert triggers an abort which causes the
entire atomic region to be rolled back and redirects execution to code which includes the cold
exit. An assert therefore enables the compiler to speculatively isolate frequently occurring
paths in the CFG from rarely taken exits.
Figure 7.3 demonstrates a few of the additional opportunities the atomic region abstrac-
98
tion exposes in the same region from Figure 7.1(a). By converting cold exits into simple
dataflow operations, an assert provides speculative opportunities with a non-speculative ap-
pearance. For example, after converting the cold exits in blocks F, G, and H the partially
dead store in block F appears fully redundant to classical optimizations in CMS and is
thereby eliminated. In addition, the partially redundant branch exit computation in block
D becomes fully redundant after assert conversion (enabling the removal of the redundancies
in block I).
Figure 7.1(d) shows the optimized code and schedule that results after converting cold
exits into asserts. By removing the cold exits, CMS is able to remove 15 additional Efficeon
operations from the region. Given fewer operations and fewer control flow constraints, the
superblock scheduler generates a 27% shorter schedule. This more aggressively optimized
code now executes in 19 cycles at an average x86 IPC of 2.95.
In contrast, implementing the same optimization using a software-only approach would
require complex and difficult to implement techniques. First, operations only consumed
along cold exits would need to be identified and eliminated. Next, compensation code for
these eliminated operations would need to be pushed onto the cold exit paths to maintain
correctness. Finally, the implementation of these analysis and code transformations must be
fast enough to be practical for CMS.
This is not only a complex undertaking but it can result in significant code duplication
(see Chapter 3). For example, one of the speculative optimization opportunities available in
Figure 7.1(a) is a store in basic block J that is made dynamically dead by a store in block
O. Eliminating the store in basic block J requires placing compensation copies along each
of the exit paths of blocks J, K and M. Without atomic regions, correctly identifying all
the available opportunities and properly placing the necessary compensation code—all in a
fast and efficient implementation—is a non-trivial proposition.
This vortex example is a rather simple one. The next example introduces further com-










































Figure 7.4: Unbiased control flow in an atomic region. Optimization region for method
combine movables from gcc. (a) Control flow graph for optimization region with the hot
path highlighted and control flow paths annotated with their frequency. (b) atomic region
representation of the same control flow graph with highly biased paths converted into asserts.
need for a fast but intelligent misspeculation recovery and control mechanism.
7.2 SPECint 2000 GCC example
The following example, taken from gcc, demonstrates an additional optimization opportunity
enabled by atomic execution. It also shows that some cold paths are occasionally taken and
therefore fast misspeculation recovery is needed to profitably speculate on these paths as well
as an intelligent control mechanism that is capable of tolerating infrequent misspeculations.
Shown in Figure 7.4(a) is the control flow graph (CFG) for a portion of the method
combine movables as selected for optimization by the Transmeta CMS translator. This is a
commonly executed region in gcc and accounts for approximately 1% of the overall execution
time. The region includes 32 x86 instructions and eight exit branches. Of the nine possible
paths through the region three are common (i.e., executed greater than 1% of the time), one
is relatively uncommon (i.e., executed less than 1% but still more than 0.01% of the time),
100
three are rarely executed (i.e., executed 0.01% of the time or less) and two more are never
executed.
Using superblock formation and scheduling, CMS generates aggressively scheduled code,
including the speculative hoisting of loads and reordering of memory operations. In the
absence of cache misses, the generated code will execute 57 Efficeon operation to emulate
the original 32 x86 instructions in 19 cycles, for an average x86 IPC of 1.68 and Efficeon IPC
of 3. Similar to the vortex example, CMS has been able to generate high quality and well
scheduled code but optimization opportunity still remains. A number of operations are only
needed along rarely or never taken paths and two of the cold conditional exit computations
can be merged.
Eliminating operations only consumed along cold paths is trivial using atomic region
asserts. By converting the cold exits into asserts (shown in Figure 7.4(b)), the results of






















Figure 7.5: Assert merge optimization. Asserts convert biased control flow into dataflow
operations which can be further optimized. For example, a pair of similar conditional checks
can be combined into a single assert operation which subsumes the originals.
Merging exit condition computations depends on the all-or-nothing property of atomic
execution. As shown in Figure 7.5, the two of the condition checks in this region are computed
using the same logical operation (i.e., bitwise AND) and register operand but have differing
immediates. Once these exits have been converted into asserts it becomes apparent that the
atomic region will only commit if both assert checks pass. Therefore, a simple extension to
classical constant folding enables CMS to identify and exploit this opportunity by merging
101
similar assert operations. Also shown in Figure 7.5, the first assert is strengthened so that it
subsumes the second assert and enables it to be eliminated. Recall that an assert is a purely
dataflow operation in the IR and therefore classical redundancy elimination techniques are
sufficient to identify and eliminate the second assert.
After converting cold exits into asserts and after merging similar assert operations, CMS
is able to remove 9 additional Efficeon operations from the region (not shown). In addition,
converting six cold exits into asserts leaves fewer control flow constraints and the superblock
scheduler is thereby able to generate a 32% shorter schedule. This more aggressively opti-
mized code now executes in 13 cycles at an average x86 IPC of 2.46.
However, converting the uncommon but taken exits from this region into asserts poses a
challenge to the atomic region paradigm. If taken these exits will first cause their equivalent
assert to trigger an abort, causing hardware to rollback and restart execution at nonspec-
ulative recovery code. Misspeculating on these asserts incurs a penalty of the time spent
partially executing the atomic region, rolling back and restarting at the recovery code. In
contrast, the performance benefit of successfully speculating on each assert operation is likely
to be small. For example, converting six cold branches in combine movables into asserts
only saves six cycles.
As already described in Chapter 5, in order for an assert to be profitable the inequality
in Equation 7.1 must be satisfied. Specifically, it is only profitable to speculatively remove







It is therefore important to minimize the cost of misspeculation so that a larger set of cold
paths are profitable to convert into asserts. For example, assuming a misspeculation cost of
20,000 cycles and a speculation benefit of 1 cycle it would not be profitable to generate two
of the asserts shown in Figure 7.4(c). On the other hand, reducing the misspeculation cost
102
to 200 cycles would make all of the asserts shown profitable.
Likewise, a control mechanism is necessary to identify and disable unprofitable asserts.
Even with an accurate profile used to select branches to convert into asserts, poor selections
will be made simply due to changes in program behavior. A branch that was extremely biased
when a profile was collected may exhibit unbiased behavior later on. Profile inaccuracies
only worsen the situation.
In addition, the control mechanism must be intelligent. A naive threshold-based control
mechanism would be unable to differentiate between profitable and unprofitable asserts. An
assert which misspeculates a large number of times is tolerable (and even desirable) so long
as it is correctly speculated often enough to satisfy the inequality in Equation 7.1
Both of the previous examples imply that a compiler must choose the same optimization
scope for atomic regions as it would for superblocks. For this work, I indeed engineer the
compiler in this way. However, atomic regions are more general than superblocks because
they can contain arbitrary control flow and therefore can encapsulate larger optimization
scopes. While there could be a benefit to taking advantage of this difference, such an
exploration is beyond the scope of this dissertation.
7.3 Background
The Transmeta Efficeon, first released in 2003, utilizes a low complexity design to provide
high-performance and low-power x86 compatibility. The Efficeon hardware is a very long
instruction word (VLIW) processor that executes an instruction set dissimilar x86. The
instruction set is designed to enable the Code Morphing Software (CMS) software system to
faithfully execute x86 code through interpretation and by dynamically translating x86 code
into high-performance native code.
In this section, I first describe the architecture of the Efficeon processor. A brief overview






M0 M1 I0 I1 F0 F1 B
Figure 7.6: The Efficeon architecture. Efficeon executes molecules composed of a variable
number of 32-bit packets. Each packet includes a stop bit as well as a type field that specifies
which functional unit it uses.
Efficeon then follows.
7.3.1 Efficeon Processor Architecture
The Efficeon is an in-order VLIW processor, which is designed to provide high-frequency
execution of software-scheduled code. To further simplify the design and reduce power, it
does not provide hardware interlocks or register scoreboarding and therefore relies completely
upon a compiler to correctly schedule dependent and independent operations. To simplify
the compiler’s task, Efficeon provides hardware support for taking fast register and memory
checkpoints and for reordering memory operations.
Depicted in Figure 7.6, an Efficeon VLIW instruction, or molecule, is variable length and
composed of 32-bit packets. Each packet includes a stop bit, which denotes the end of a
molecule. A molecule may contain up to eight packets.
A packet typically encodes a functional operation, or atom, but may also encode auxiliary
information such as longer immediates or memory alias protection. An Efficeon atom has a
three-address format and is analogous to an instruction from a load-store architecture. An
104
atom is statically assigned to one of seven functional units: two memory, two integer, two
floating point and one branch.
The Efficeon processor provides hardware support for fast register and memory check-
points. The Efficeon has two copies of each register: a shadowed and a working copy.
Likewise, the Efficeon includes a speculative bit for each line in its data cache [90]. Between
checkpoints, all updates are speculatively written to either working registers or to the data
cache. If a cache line is speculatively written, its speculative bit is set and it is transitioned
to the dirty state (after first evicting any non-speculative dirty data on the line into a victim
cache). The hardware can commit speculative work in a single cycle by copying the working
registers onto their shadowed counterparts and flash clearing all speculative bits in the data
cache. Alternatively, the hardware can rollback all speculative work by restoring the working
registers from their shadowed counterparts and flash invalidating all speculative lines in the
data cache. Section 7.4.1 describes the primitives used by software to control this commit
and rollback hardware.
The Efficeon also provides memory alias detection hardware, which a compiler can use
to guarantee the correctness of reordered memory operations. Often memory operations do
not alias, but the compiler can not statically prove their independence. In these situations,
the compiler can generate code that includes alias packets. If used to initiate protection
of a coupled load or store atom, an alias packet captures the memory address used by the
atom into an alias register. If used to detect an alias with a coupled load or store atom,
the alias packet compares the memory address against the contents of one or more alias
registers, and, if a match is made, an alias fault is triggered. In this way, software can
check if speculatively-reordered loads alias with the stores they were hoisted above. The
alias hardware also enables the compiler to eliminate redundant loads and stores in more
situations [64].
The Efficeon shares many of the above architectural traits with the Transmeta Crusoe
that preceded it, but several key differences exist [64]. Both processors have statically
105
scheduled VLIW architectures but the Efficeon can issue seven atoms per cycle versus four
atoms in the Crusoe. To provide better code density, an Efficeon molecule is variable in
length whereas a Crusoe molecule may only be two or four packets long. Most relevant to this
paper, the Efficeon has more relaxed speculation support because it buffers all speculative
memory updates in a 64-KB first level data cache whereas the Crusoe buffers all speculative
memory updates in a gated store buffer.
7.3.2 CMS Overview
The Transmeta Code Morphing Software [32] is a software system designed to provide high-
performance execution of x86 binaries on Efficeon hardware. To accomplish this goal, it
includes a robust and high-performance dynamic binary translator, supported by a software
x86 interpreter. The translator is a large and well-tuned software system which includes
components to identify commonly-executed regions of x86 code, convert the corresponding
x86 instructions into a three-address intermediate representation (IR), and then optimize,
schedule, and deploy each translated region.
The first several times a group of x86 instructions is encountered by CMS they will not
be translated. Rather, they will be emulated by the CMS interpreter. In doing so, CMS
is able to collect a dynamic execution profile of the x86 instructions as well as provide a
low-latency “cold start” response. If a group of x86 instructions is executed enough times,
CMS will generate a translation for them.
A CMS translation is a multiple-entry, multiple-exit region of x86 instructions that can
contain arbitrary control flow such as indirect branches, divergences, and loops. As with
other dynamic translation systems, exits from a CMS translation are directly linked, or
chained, to other translations [7, 27]. In CMS, chaining is lazily performed the first time an
exit is executed.
The CMS translator uses a staged optimization strategy in managing its translations.
Translations are first lightly-optimized, but later promoted to an aggressively-optimized
106
translation if executed frequently enough. This staged optimization strategy enables CMS
to focus its compilation efforts on the small subset of an x86 program’s instructions where
the majority of execution time is spent.
In this chapter, I focus my efforts on improving the quality of these aggressive transla-
tions, and the remainder of this section focuses on the design of the aggressive optimizer.
The aggressive optimizer is designed to enable compiler optimizations to exploit as much
available opportunity as possible, while keeping the total compilation time to a minimum.
It primarily accomplishes these goals through a careful organization of compilation steps as
described below:
1. Region preparation: Decode the selected x86 region into a three-address code in-
termediate representation (IR).
2. Flow analysis: Generate a topological ordering of basic blocks in the region, compute
dominators and post-dominators, and rename operands into a static single-assignment
(SSA) form.
3. Control flow manipulation: Unroll loops, if-convert short branch-overs and con-
trol flow divergences. Also create single-entry multiple-exit sub-regions (hyperblocks)
that are wrapped with checkpoint commit points to provide atomicity. Incrementally
update the already computed flow analysis as necessary.
4. Forward dataflow pass: In a single forward pass, apply a suite of optimizations such
as constant folding and propagation, common subexpression elimination, and several
peephole optimizations. Also perform a simple alias analysis to guide redundant load
elimination and later memory optimization and scheduling passes.
5. Backward dataflow pass: In a single backward pass, perform a liveness analysis to
guide dead-code elimination and dead-store elimination.
107
6. Schedule and lower: Perform loop-invariant code motion, hoist critical operations,
allocate registers, perform code lowering, and schedule each hyperblock.
7. Emit: Assemble all instructions and update branch targets.
There are two key differences between this organization and that typically employed by a
static compiler. First, the CMS translator is broken into distinct phases which constrain the
types of changes that can be made to the IR at any given point. For example, modifications
to the control flow graph are only performed in Step 3, meaning that later passes can rely
on an immutable control flow structure. Likewise, flow analysis is performed early in Step 2
and is properly updated by the control flow manipulation passes so that later phases can
rely on accurate information about loop structure, dominators, and post-dominators.
Second, dataflow optimizations that are typically implemented as separate passes in a
static compiler are instead folded into a single forward dataflow pass and a single backward
dataflow pass. For example, the forward dataflow pass processes the region in topological
order and applies a suite of global analysis and optimizations as it visits each statement in a
basic block. In doing so, the benefits of several (in the case of CMS, seven) forward dataflow
passes can be achieved in roughly the same amount of time as a single forward pass.
These design differences are key to the efficiency of the translator and, thereby, the
performance of CMS as a whole. In adding additional optimizations to CMS, it is important
to respect these efficiency considerations. In the context of a dynamic optimizer, a powerful
but computationally complex optimization is untenable. As the next section will show,
incorporating the atomic region abstraction is not only easy to do, but can be done without
adding significant overheads.
7.4 Atomic Regions in CMS
This section describes the modifications made to CMS in order to incorporate the atomic
region abstraction. I first discuss the hardware atomicity primitives that Efficeon provides
108
commit Copy working registers into shadowed registers.
Mark speculative lines in the data cache as dirty.
rollback Copy shadowed registers into working registers.
Invalidate speculative lines in the data cache.
Table 7.1: Efficeon atomicity primitives. Software uses these operations to control the
Efficeon commit and rollback hardware
and how they can be used by a software compiler to implement the atomic region abstraction.
I then describe how atomic regions were integrated into CMS. Lastly, I introduce a simple
mechanism to rein in frequent misspeculations and an optimization for removing redundant
asserts.
7.4.1 Hardware Atomicity
The Efficeon processor exposes its support for fast hardware checkpoints through the two
operations shown in Table 7.1. Software can use these operations to provide the illusion of
atomic execution—the execution of a region of code completely or not at all.
The commit operation is used to denote both the beginning and the end of an atomic
region. It is used at the beginning of an atomic region to take a register checkpoint and
to treat all future register and memory updates as speculative. It is used at the end of
an atomic region to commit all speculative updates and discard the last checkpoint. The
rollback operation is used to unconditionally abort an atomic region by restoring the last
checkpoint. A rollback does not affect the program counter, so an instruction following the
rollback can be used to redirect control flow as necessary.
Figure 7.7 illustrates how the CMS translator could use these operations to speculatively
optimize a region of code. The optimizer first wraps an optimization region with commit
points, and then speculatively removes cold paths from the region. To guarantee correctness,
the optimizer inserts a check, i.e., assert, to verify that the cold path is not taken. If the
assert determines that the cold path is needed, a rollback is executed that instructs the
hardware to discard all speculative state. Control is then redirected to a non-speculative
109
speculation:   success   failure
commit
...
p1 ← tst.ne eax, 0




















Figure 7.7: Atomic region example using the Efficeon atomicity primitives. If
speculation succeeds, the assert path will not be taken and execution will reach the commit
at the end of the region. If speculation fails, the abort path executes a rollback before
invoking recovery code to restart execution at a non-speculative version of the same code
(e.g., via the CMS interpreter).
version of the same region. In this example, execution resumes in the CMS interpreter.
It should be noted that the Efficeon hardware is designed to provide atomicity in a
uniprocessor environment. In a multiprocessor environment, additional support is necessary
to provide an illusion of atomicity to other threads. Essentially, loads must also be handled
speculatively and coherence traffic must be monitored to detect atomicity violations. The
necessary support has previously been proposed [13,83,89].
7.4.2 Incorporating Atomic Regions into CMS
The CMS translator already uses the hardware atomicity primitives to obviate the need
for recovery code in superblocks. The CMS optimizer wraps each superblock with commit
points to simplify the recovery of precise state in the case of a misspeculation or excep-
110
tion. For example, if a speculatively hoisted load incurs a memory fault, CMS relies on the
hardware to discard all speculative state and afterward redirect execution to a more con-
servative implementation of the same code (by dispatching to the interpreter in the current
implementation).
However, the CMS translator does not use hardware atomicity to expose speculative
optimization opportunities resulting from biased control flow. As shown in Sections 7.1
and 7.2, the translator can be made to better optimize code by simply generating an atomic
region with rarely executed paths removed. Extending the CMS translator to use the atomic
region abstraction required three additions: representing an assert operation in the IR, a
mechanism for converting biased branches into asserts, and a mechanism for recovering from
misspeculations.
Assert operations: The assert operation is represented in the compiler IR as a pseudo
operation. The assert is used to speculatively convert highly-biased conditional branches
into straight-line code. The assert consumes the same condition as a biased branch and—
like the branch it replaces—has no dataflow consumers. Unlike a branch, no operations are
control dependent on an assert, which means that it is not an optimization obstacle for later
passes. The assert is treated as a potentially-excepting operation in the IR to prevent it
from being hoisted out of its atomic region, and an assert is annotated with a numerical
identifier to distinguish it from other asserts in the same region.
Converting biased branches into asserts: An accurate execution profile is necessary to
identify which conditional branches are good candidates to convert into asserts. Misspecula-
tions can be very costly, so only highly-biased branches should be converted. However, CMS
does not collect a profile that is sufficient to properly distinguish good candidate branches.
The execution profile collected by the CMS interpreter simply does not include enough
samples to be useful for the purpose of identifying assert candidates.
Rather than forcing the interpreter to collect more samples or adding instrumentation
111
code to lightly-optimized translations, both of which could incur costly performance over-
heads, I instead turned my attention to the translation chaining mechanism.
As described in Section 7.3.2, translation exits are lazily chained to other translations.
Therefore when a lightly-optimized translation is promoted and retranslated, rarely taken
translation exits are unlikely to have been chained. Similarly, the conditional branches
corresponding to these unchained exits are likely to be biased. I have implemented a heuristic
based on this observation that strikes a reasonable balance between being able to identify
good assert candidates and minimizing profiling overheads.
The CMS translator was modified so that it consults chaining information when promot-
ing a lightly-optimized translation. All unchained exits are considered assert candidates,
and this information is provided to a new flow manipulation pass added to Step 3 of the
optimizer.
Misspeculation recovery: Throughout most of the optimizer, the assert is represented
as a single dataflow pseudo-operation. In the final code emit step this changes, and the assert
is emitted as a conditional branch which targets rollback code. The exact rollback routine
that the assert targets depends on the numerical identifier of the assert. Each identifier is
associated with a separate rollback routine to simplify misspeculation monitoring (described
in Section 7.4.3).
There are a maximum of 31 chainable exits in a translation, and the numerical identifier
assigned to an assert is the same as the exit number of the cold branch it replaces. I therefore
added 31 rollback routines to CMS which are shared by all asserts. As shown in Figure 7.7,
a conditional branch is emitted for each assert that targets the rollback routine with the
corresponding numerical identifier.
When an assert fires, control is directed to its rollback routine, which first executes a
rollback instruction to discard all speculative register and memory state. It then loads
the identifier of the triggered assert into a register and jumps to the misspeculation recovery
routine. This misspeculation recovery routine is responsible for recovering the x86 instruction
112
pointer and dispatching to the interpreter (to execute the same code non-speculatively).
The misspeculation recovery routine is also responsible for monitoring each assert, which I
describe next.
7.4.3 Monitoring Speculations
Even though the heuristic for identifying biased branches is reasonably accurate, it is still
fallible. It occasionally leads CMS to convert branches into asserts that fire frequently.
Often the cause is a change in program behavior: a path that was rarely executed early
in the program becomes a common path later in the program. If these problematic asserts
are left untended, they will adversely affect performance because of the relatively high cost
associated with misspeculation. Therefore, a mechanism is necessary to identify and disable
problematic asserts [108].
I developed a simple solution by augmenting the misspeculation recovery routine. The
routine updates a misspeculation counter corresponding to the assert that fired1. If this
counter exceeds a threshold, then the assert is designated misbehaving, and the translation
will be reoptimized with the corresponding assert disabled.
Furthermore, it is desirable to tolerate asserts that fire infrequently relative to the total
number of times they execute. For these asserts, the performance improvements provided
by each successful execution of the assert outweighs the infrequent misspeculation costs. To
distinguish between asserts which are problematic and asserts that are tolerable, it is ideal
to know the local assert rate, or the number of times an assert fires relative to the number
of times it executes.
Discovering the precise execution frequency of an assert is difficult, as it would require
intrusive profiling. In the interest of minimizing overheads, I use an alternative approach
1The counter does not increase the size of the metadata associated with a translation because
an assert replaces what would have otherwise been a translation exit. Each translation already
includes eight bytes of metadata for each translation exit and I simply reappropriate the same
storage for each assert.
113
based on hardware sampling.
By default, CMS takes a program counter sample every 200,000 cycles so that it can
identify and promote frequently executing translations. If a sample is taken while a transla-
tion is executing, a counter in the translation metadata is incremented. I can therefore use
this counter as an approximation for translation execution frequency.
My assert monitoring mechanism is shown in Algorithm 7.1. Essentially, whenever an
assert misspeculation counter is updated I also capture the value of the translation sample
counter. When the next misspeculation occurs, the code checks whether a sample has
been received since the last time the assert fired—by comparing the captured sample value
to the current translation sample counter value. A changed sample counter value implies
that the translation is commonly executed, and by proxy so is the assert being monitored.
To reflect that misspeculations from commonly executed asserts should be tolerated, the
misspeculation counter is reset if the sample values do not match. Otherwise, when the
assert count exceeds a threshold it is disabled through retranslation.
However, workloads with a large number of commonly executed translations will have an
increased latency to detect misbehaving asserts. To prevent this increased detection latency
from adversely affecting performance, I also incorporated a mechanism to monitor the global
assert rate, or the total number of asserts firing per cycle.
Algorithm 7.1 also shows this global monitoring mechanism. Every 100 million cycles the
global monitor is invoked to check if the global assert count exceeds a global assert threshold.
If the threshold is exceeded, the parameters of the local assert monitoring mechanism are
tightened: either by increasing the assert sample shift parameter (to require sample counter
values to differ in more significant bits before considering a translation commonly executed)
or by reducing the local assert threshold.
114
Algorithm 7.1 Assert misspeculation monitoring
// Monitors the behavior of a misspeculating assert.




GlobalAssertCount ← GlobalAssertCount +1
if AssertSampleMatches(currSample, lastSample) then
assertCount← GetAssertCount(assertID) +1









// Compares two sample values. Returns true if they are equivalent




return mismatchBits ≡ 0
// Global assert monitoring. If the global assert rate is too high,
// tighten the local sample shift or threshold. Otherwise, loosen them.
procedure GlobalMonitor
if GlobalAssertCount > GlobalAssertThresh then
if AssertSampleShift < MaxSampleShift then
AssertSampleShift ← AssertSampleShift +1
else if AssertThresh > 0 then
AssertThresh ← AssertThresh −1
else
if AssertThresh < LocalAssertThresh then
AssertThresh ← AssertThresh +1
else if AssertSampleShift > 0 then
AssertSampleShift ← AssertSampleShift −1
GlobalAssertCount ← 0
115
7.4.4 Eliminating Redundant Asserts
After biased branches have been converted into asserts, opportunities exist to remove some of
these asserts from the CFG. These opportunities arise either because an assert is redundant—
another assert in the same atomic region implements the same (or subsuming) check—or
because an assert can be proven to never fire.
The existing common subexpression elimination and constant evaluation optimizations
were easily modified to recognize assert operations. I also added an optimization that can
eliminate an assert if it is rendered unnecessary by a stronger assert. One assert subsumes
than another if it would fire in at least every situation that the other would fire. For example,
an assert that fires whenever r1 < 5 subsumes an assert which fires whenever r1 < 4.
I also extended common subexpression elimination to allow an assert to be removed if
it is post-dominated by an equivalent or subsuming assert. Typically an operation must be
dominated by an equivalent operation to be removed, but atomicity makes post-dominance
a sufficient proxy for dominance.
7.5 Evaluation
In this section, I present the result of incorporating the atomic region abstraction into
CMS, evaluated on an Efficeon hardware platform. I first describe the configuration of the
evaluation system and provide compilation details for the benchmarks used. I then present
and interpret the experimental results. Overall, I find that incorporating atomic regions into
CMS provides a 3% average performance improvement, and that a simple assert monitoring
mechanism is sufficient enough to prevent slowdowns in any individual benchmark.
116
Processor Transmeta Efficeon 2 (TM8800)
Processor frequency 1.2 GHz
Dynamic Translator CMS 7.0 (pre-release)
Registers 64 integer, 64 FP
Translation Cache 32 MB (of physical memory)
L1 Instruction Cache 128 KB, 4-way, 64B line
L1 Data Cache 64 KB, 8-way, 32B line
Victim Cache 1 KB, fully-associative, 32B line
L2 Unified Cache 1024 KB, 4-way, 128B line
Physical Memory 1 GB DDR-400
Operating System Linux 2.6.19
Compiler (for SPEC) Intel C++ Compiler 11.0
Compilation options -O3 -ipo -no-prec-div -prof use
Local Assert Threshold 8 per translation sample
Global Assert Threshold 16 per 100 million cycles
Table 7.2: Evaluation system configuration
7.5.1 System Configuration
All of the experiments are run using a Transmeta development system. Shown in Table 7.2,
the system configuration is intended to closely represent retail Efficeon hardware2. Of par-
ticular note is the pre-release version of CMS used for this evaluation; this CMS version
includes significant enhancements over the last retail version of CMS and is in many ways
superior. In terms of raw performance, the pre-release version is marginally faster than the
retail version on the benchmarks studied.
For this evaluation, all of the SPEC CPU2000 integer benchmarks were run to completion
using the reference inputs. SPEC CPU2006 was not used because the evaluation system—
representative of a computer circa 2004—does not satisfy the system requirements. The
benchmarks are compiled with the Intel C++ Compiler using the highest performing SPEC
“base” compiler options, including profile guided optimizations. All efforts have been made
to use the best possible baseline.















































































Figure 7.8: SPEC CPU2000 integer results. Results for three atomic region configura-
tions. (d)isabled misspeculation monitoring, (m)onitoring enabled but assert optimizations
disabled, (a)tomic region optimizations and monitoring fully enabled. All results have been
normalized to the baseline CMS configuration, which has no atomic region support.
7.5.2 Experimental Results
The following experiments focus on understanding both the dynamic and static impacts of
atomic regions. Broadly, atomic regions are able to improve performance over a baseline
CMS by an average of 3% and by up to 9%. This performance improvement is achievable
because frequently misspeculating asserts are identified and disabled by the simple assert
monitoring mechanism described earlier. Likewise, the compilation overheads introduced by
atomic regions are minimal. Finally, atomic regions do not suffer from static code bloat
problems but rather they reduce static code size.
Figure 7.8 shows the performance of three configurations of the atomic region imple-
mentation in CMS, normalized to the runtime of the baseline configuration. The first con-
figuration shows the performance of a system without the assert monitoring mechanism
enabled. The second configuration enables the assert monitoring mechanism but does not
use assert operations when speculating on biased exits (i.e., biased-exit branches are simply
re-targeted at rollback and recovery code). The third configuration is a complete implemen-
118
Dynamic Biased Misspeculation Misspec. / 1M cyc. Misspec. / 1M cyc.
Benchmark Branches (%) Cost (cycles) (no monitoring) (w/ monitoring)
gzip 4.3 1790 1337 0.06
vpr 9.8 6188 87 0.08
gcc 5.8 1816 575 0.1
mcf 8.1 1167 968 0.07
crafty 14.9 2985 1529 0.06
parser 14.3 2135 604 0.1
eon 13.0 961 1127 0.04
perlbmk 30.7 1878 942 0.08
gap 26.9 1738 1879 0.08
vortex 38.7 1863 1236 0.1
bzip2 19.0 1179 1497 0.09
twolf 13.6 2598 252 0.07
average 16.6 2195 1003 0.08
Table 7.3: Atomic region statistics. Lists the percentage of dynamic branches that are
99.999% biased or greater, the estimated cost of a misspeculating assert, and the overall
misspeculation rates both with and without assert monitoring.
tation of atomic regions, including assert monitoring and speculative conversion of biased
exits into assert operations.
The complete atomic regions implementation provides an overall performance improve-
ment in nearly every benchmark and none of the benchmarks exhibited a slowdown. Atomic
regions provides a 3% average improvement over the CMS baseline that uses superblocks.
Seven of the benchmarks exhibit greater than a 2% performance improvement and three of
these exhibit a greater than 5% performance improvement. The benchmark with the largest
performance improvement is vortex at 9.3%.
The performance improvements exhibited roughly correlate with the percentage of dy-
namic branches which are considered highly biased. Shown in the first column of Table 7.3
are the percentage of branches executed in each benchmark which are 99.999% biased or
greater (i.e., branches for which fewer than 1 in 100,000 dynamic instances oppose the bias).
The eight benchmarks with greater than 10% of their executed branches being biased exhibit
a performance improvement of 2% or more (with the exception of eon which improves by
119
1.7%). Similarly, the three benchmarks which exhibit a 5% or greater performance improve-
ment have greater than 25% of their branches being biased.
Targeting a 99.999% observed bias rather than a 100% observed bias is important to
broadening the set of branches which will be considered profitable speculation candidates.
Not doing so can have a significant impact on the performance achievable by eliminating
profitable opportunities. For example, when only 100% biased branches are considered prof-
itable the percentage of viable dynamic branches in gap drops to 11.2% and the performance
improvement achieved reduces to 2.3% (from 5.9%).
The current implementation of atomic regions, is unable to profitably speculate on
branches less than 99.999% biased, due to a high cost for misspeculation. Shown in the sec-
ond column of Table 7.3 is the estimated cost of an assert misspeculation in each benchmark,
which was measured using an instrumented version of CMS. Because the current implemen-
tation uses the CMS interpreter for recovery, each assert misspeculation costs thousands of
cycles. Therefore, it is only worthwhile to speculate on branches that go against their bias
significantly less than one out of every thousand executions. The selection of assert thresh-
olds, shown in Table 7.2, satisfies this goal although the thresholds have not been highly
tuned.
The high cost of an assert misspeculation can also cause severe performance degradation
in a naive implementation of atomic regions. The first configuration in Figure 7.8 shows
the performance lost if frequently misspeculating asserts are left untended. If problematic
asserts are not disabled, such a configuration will incur a misspeculation once every thousand
cycles on average. Combined with the high cost for misspeculations this results in a greater
than factor of two slowdown for most benchmarks.
The simple assert monitoring mechanism introduced in Section 7.4.3 is sufficient to iden-
tify problematic asserts so that they can be disabled after retranslation. As shown in Ta-
ble 7.3, this simple mechanism is able to reduce the misspeculation rate to fewer than one
misspeculation every ten million cycles. In doing so, approximately one in five asserts are
120
Static Asserts Static Code Static Asserts
Benchmark Disabled (%) Reduction (%) Eliminated (%)
gzip 33.0 -0.1 0.9
vpr 1.1 0.6 1.2
gcc 4.5 0.9 1.2
mcf 24.0 0.4 1.2
crafty 28.7 0.4 0.7
parser 30.1 0.2 2.2
eon 21.4 0.6 1.1
perlbmk 17.9 1.9 2.1
gap 16.1 1.4 1.0
vortex 3.6 3.5 10.1
bzip2 34.2 0.4 0.9
twolf 9.4 0.6 1.0
average 18.7 0.9 2.0
Table 7.4: Static code statistics. Lists the percentage of static asserts disabled because
they are misbehaving, the static reduction in translation code size enabled by atomic regions,
and the percentage of static asserts that have been redundancy eliminated.
disabled through retranslation (shown in Table 7.4).
However, the additional retranslation costs are minor. The second configuration in Fig-
ure 7.8 measures the performance costs associated with atomic regions by enabling assert
monitoring and retranslation but disabling all the optimization benefits of assert operations.
The performance costs never exceed 1.5% and generally amount to less than 1% of overhead.
Overall, this simple assert monitoring mechanism is sufficient and perhaps conservative.
In general, atomic regions also improve static code characteristics. Whereas superblocks
can incur significant code bloat due to transformations such as tail duplication, atomic
regions do not require duplication to expose additional opportunities. As Table 7.4 shows,
static translation size generally decreases by a small amount. The reduction in static code
is mostly the result of extra classical optimization opportunities exposed by atomic regions.
The redundant assert elimination optimizations described in Section 7.4.4 are also beneficial
as they are able to eliminate 2% of asserts on average (up to 10% in vortex ).
These results demonstrate that atomic regions are able to offer significant performance
121
improvements on a real machine and that these performance improvements can be achieved
without detrimental side-effects. In addition, it serves as a motivation for future work. As
already mentioned, the high misspeculation cost prevents the current implementation from
considering branches which are less than 99.999% biased. However, if the misspeculation
cost could be reduced significantly, it should be possible to target a lower bias threshold and
thereby broaden the set of branches that are profitable to convert into asserts.
To reduce the misspeculation cost, it is possible to implement a misspeculation recovery
mechanism that redirects execution to a non-speculative translation rather than the CMS
interpreter. Doing so will incur some code duplication, but so long as the duplication is
incurred judiciously it could make for a worthwhile trade-off.
7.6 Conclusions
In this chapter, I have demonstrated that the previously proposed atomic region abstraction
truly is simple and intuitive to integrate into a mature compilation system. I have also
shown that atomic regions expose real performance opportunities even in a well-engineered
commercial system.
My experience in this work has also resulted in several opinions on the relative merit
and utility of atomic regions, especially in comparison to superblocks. Any view of atomic
regions and superblocks as purely competitive abstractions is overly simplistic. Instead, in
my experience atomic regions and superblocks are complementary and synergistic with one
another.
Specifically, the key advantage of the atomic region abstraction lies in exposing oppor-
tunities to remove operations that are partially redundant or partially dead along hot paths
(i.e., operations that are fully redundant or fully dead once cold paths are removed). The
strength of the approach is the relative simplicity in which it can expose these opportuni-
ties. Although superblocks could be used to expose such optimization opportunities, doing so
122
requires significantly more effort. Despite being a mature and highly-engineering implemen-
tation, CMS does not exploit superblocks for partial redundancy or dead code elimination.
These observations have led me to believe that the real benefits of the superblock ab-
straction are scheduling optimizations that reduce the critical path height and increase
instruction-level parallelism (ILP). On a wide in-order superscalar such as Efficeon, gen-
erating good static schedules is key to performance and therefore the superblock plays a
critical role on these types of machines.
Looking forward, physical constraints portend a re-emergence of simple, in-order pro-
cessor designs. To provide good single-thread performance, these designs necessitate so-
phisticated compiler infrastructures, and I believe that hardware support for speculative
optimizations is an energy and complexity effective approach to achieving that performance.
Specifically, I believe support for the atomic region, along with the superblock, is a strong
candidate for incorporation into these future designs.
123
Chapter 8
Atomic Region Memory Model
One of the salient features of the atomic region abstraction is its simple and intuitive ex-
ecution model, namely atomic execution. An atomic region will appear either to execute
completely and instantaneously or will appear as if it never executed. As claimed in Chap-
ter 2.1, this illusion simplifies speculative compiler optimization even in a multiprocessor
system.
However, the atomic region is not a panacea. Modern programming languages specify
memory consistency models that define observability rules for shared memory in a multipro-
cessor system. In simple terms, these memory consistency models restrict the values that
can be observed by memory read operations [1]. From the point of view of a compiler, the
memory consistency model constrains the statement reorderings which are available when
optimizing a program. To maintain correctness, any use of the atomic region abstraction
must also satisfy these constraints.
As this chapter will prove, atomic region formation trivially satisfies memory model
constraints. A program region can be converted into an atomic region without changing the
semantics of the program. Furthermore, within an atomic region, memory model constraints
can be relaxed, which may expose additional optimization opportunity.
At the same time, the atomic region abstraction should not obscure other optimization
opportunities. The boundaries of an atomic region, in particular, should not unnecessar-
ily obstruct optimization. Therefore, this chapter will also prove that the atomic region
abstraction continues to permit optimizations across atomic region boundaries.
The next section introduces the formal specification used throughout the chapter. The
124
remainder of the chapter contains proofs demonstrating the compatibility of atomic region
formation with memory model constraints and the optimizations permitted by atomic re-
gions.
8.1 Formal Specification of Multithreaded Programs
The terms, definitions and formal notation used in the remainder of this chapter borrow
from Shasha and Snir [92], Manson et. al [74] and Effinger-Dean et. al [36].
The following specification considers programs as expressed by a programming language.
A program is composed of a set of statements which express the behavior of the program
and adhere to the semantics of the programming language. For the purposes of this chapter,
both high-level languages such as Java and low-level machine languages such as the x86
ISA are considered programming languages. A program P is formally specified by the tuple
P = 〈S,CFG,DFG〉, where:
• S is a set of statements.
• CFG is the control flow graph and is a directed graph over the set of vertices defined
by S. Each edge in CFG expresses a control flow ordering between statements in S.
• DFG is the dataflow graph and is a directed graph over the set of vertices defined by
S. Each edge in DFG expresses a dataflow ordering between statements in S.
Definition Suppose sx, sy ∈ S such that a path exists from sx to sy in CFG, then sx is a
predecessor of sy. Likewise, sy is a successor of sx.
Definition Suppose sx, sy ∈ S such that an edge exists from sx to sy in CFG, then sx is
an immediate predecessor of sy. In particular, a path of length one exists from sx to sy in
CFG. Likewise, sy is an immediate successor of sx.
125
Definition Suppose sx, sy ∈ S such that a path exists from sx to sy in DFG, then sy is
data dependent on sx.
The definition for control dependence is determined by language semantics. Typically
the classical definition of control dependence is used, which is determined by the post-
dominance relation [39]. As mentioned by Manson et al. [74], the Java definition is loop
control dependence, which additionally specifies that operations following any potentially
infinite loop are control dependent on the loop [10]. For the purposes of this chapter, it is
sufficient to note that control dependence can be derived from the control flow graph and
language semantics of a program.
When a statement is executed, it becomes an action which may include reading or writing
a memory location. For simplicity, statements are assumed to correspond with a single
action. Two actions are said to conflict if at least one is a write and they both reference the
same memory location. A multithreaded program may have different possible executions,
each with a different interleaving of conflicting actions. An execution E of a program P is
formally specified by the tuple E = 〈A, po−→, so−→, sw−→, hb−→〉, where:
• A is a set of actions. Each action is a tuple a = 〈t, k, u〉, where:
◦ t is the thread identifier.
◦ k specifies the kind of action, which in particular includes reads and writes to
shared memory. A read of shared memory location x is denoted by read(x) and a
write to a shared memory location x is denoted by write(x). Also pertinent to this
chapter are the synchronization actions lock(x) and unlock(x) that, respectively,
are used to enter and exit a critical section coordinated through shared memory
location x.
◦ u is a unique identifier of the action.
126
• po−→ is the program order relation over the set of unique identifiers and specifies a total
order over actions performed by a single thread. Program order is consistent with the
control flow and dataflow semantics of program P . For each 〈ti, ki, ui〉, 〈tj, kj, uj〉 ∈ A
such that ui 6= uj:
◦ If ti = tj, then either ui po−→uj or uj po−→ui.
◦ If ui po−→uj or uj po−→ui, then ti = tj.
◦ If ti 6= tj, then ui
po−/→uj and uj
po−/→ui.
• so−→ is the synchronization order relation over the set of unique identifiers and specifies
a total order over synchronization actions performed by any thread. Furthermore, for
〈ti, ki, ui〉, 〈tj, kj, uj〉 ∈ A such that ui 6= uj:
◦ If ki and kj are synchronization actions, then either ui so−→uj or uj so−→ui.
◦ If either ki or kj is not a synchronization action, then ui so−/→uj and uj so−/→ui.
• sw−→ is the synchronizes-with relation over the set of unique identifiers that specifies a
partial order over synchronization actions performed by any thread and is consistent
with synchronization-order. The details of this relation are dependent on the memory
consistency model. For example, the Java memory model [59,74] defines synchronizes-
with relations for lock(x) and unlock(x) actions. For each 〈ti, unlock(x), ui〉,
〈tj, lock(x), uj〉 ∈ A:
◦ If ui so−→uj, then ui sw−→uj.
• hb−→ is the happens-before relation over the set of unique identifiers and is the transitive
closure of the program order and synchronizes-with relations.
An execution is said to contain a data race if and only if it contains a conflict which is
not ordered by the happens-before relation, i.e., there exists 〈ti, ki, ui〉, 〈tj, kj, uj〉 ∈ A such
127
that ui
hb−/→uj, uj hb−/→ui and at least one of ki and kj is a write. A program P is said to be
correctly synchronized if and only if every execution E of P is free of data races.
The remainder of this chapter focuses on correctly-synchronized programs. Formal rea-
soning about programs containing data races is both complex and beyond the scope of this
dissertation. Furthermore, modern languages such as Java and C++ emphasize correct syn-
chronization [19, 74]. In fact, the behavior of programs containing data races is undefined
by C++ and Posix threads [18]. Only the Java memory model defines the behavior of pro-
grams containing data races, and even its current definition inadvertently prohibits some
optimizations that the model was intended to permit [102].
By restricting the discussion to correctly-synchronized programs, a valid execution E of
program P must be equivalent to some sequential execution of P [2]. Therefore, the happens-
before relation of E defines the order of all memory conflicts and, thereby, the value that
each memory read returns as well as the final value of each memory location.
The behavior of an execution can be characterized by the value that each memory read re-
turns and the final value of each memory location. The execution of a correctly-synchronized
program can therefore be characterized by the happens-before relation between memory con-
flicts of an execution. Intuitively, two executions have the same behavior if each conflicting
read observes the same write and the last conflicting write to each memory location is the
same. This yields the following definition of equivalent executions.
Definition Let A be the set of actions for a valid execution E of correctly-synchronized
program P , and let A′ be the set of actions for a valid execution E ′ of P . Executions E and
E ′ are equivalent if and only if A = A′ and for all 〈ti, ki, ui〉, 〈tj, kj, uj〉 ∈ A such that ki and
kj conflict, there exists 〈ti, ki, ui〉, 〈tj, kj, uj〉 ∈ A′ such that:
• If ui hb−→uj in E, then ui hb−→uj in E ′.
• If uj hb−→ui in E, then uj hb−→ui in E ′.
128
Similarly, two programs have the same behaviors if each possible execution in one program
has an equivalent execution in the other program. This concept is useful when considering
the correctness of transforming one program representation into another.
Definition Program Pi is said to be equivalent to Pj if and only if:
• For any execution Ei of Pi an equivalent execution Ej exists of Pj
• For any execution Ej of Pj an equivalent execution Ei exists of Pi.
In some cases, transforming one program into another may require introducing new
statements. To simplify the comparison of these programs the following notion of statement
equivalence is provided.
Definition Statement sx is said to be equivalent to statement sy if and only if sx and sy
produce the same kind of action when executed.
8.2 Formal Specification of Atomic Regions
Next, the formal notation is extended to include the atomic region abstraction by defin-
ing the statements aregion begin, aregion end, and aregion abort. These statements (and
actions by the same name) correspond to the hardware atomicity primitives introduced in
Chapter 2.1. Together, they provide atomic execution semantics and define an atomic re-
gion. The statements aregion begin, aregion end and aregion abort are collectively called
atomic statements and the related actions are called atomic actions. Note that hardware
may implicitly abort an atomic region, and therefore an aregion abort action may not always
correspond to a statement in the program.
Informally, any path through an atomic region must include an aregion begin action
followed by either an aregion end or aregion abort action, and atomic regions are never
nested. Paths through an atomic region that include an aregion end are said to commit.
129
Paths through an atomic region that include an aregion abort are said to abort and do not
include any action other than the preceding aregion begin (hardware discards all intervening
speculative actions). In addition, all actions outside of the atomic region appear to occur
either before or after the atomic region.
The following constraints are placed on the program order relation of any valid execution
of a program P , which contains atomic regions. Let E be a valid execution of P , with a set
of actions A:
• For each 〈tb, aregion begin, ub〉 ∈ A, there exists some 〈tb, kc, uc〉 ∈ A such that ub po−→uc
and either kc = aregion end or kc = aregion abort. Furthermore, for any 〈tb, ki, ui〉 ∈
A:
◦ If kc = aregion end and ub po−→ui po−→uc, then ki is not an atomic action
◦ If kc = aregion abort, then ub
po−/→ui or ui
po−/→uc.
• For each 〈tc, kc, uc〉 ∈ A such that kc = aregion end or kc = aregion abort, there exists
some 〈tc, aregion begin, ub〉 ∈ A such that ub po−→uc. Furthermore, for any 〈tc, ki, ui〉 ∈
A, if ub
po−→ui po−→uc, then ki is not an atomic action.
For convenience, the notation [ub, uc] is used to denote an atomic region as specified by
actions ub and uc, which satisfy the constraints above. For any action 〈tb, ki, ui〉 ∈ A where
ub
po−→ui po−→uc, ui is further said to be contained in the atomic region [ub, uc]. Note that an
atomic region which aborts does not contain any actions.
The following constraints are placed on the synchronizes-with relation of any valid exe-
cution of a program P , which contains atomic regions. For any valid execution E of P , with
a set of actions A:
• Let 〈ti, ki, ui〉, 〈tj, kj, uj〉 ∈ A such that ki conflicts with kj and ti 6= tj. Also let ui
be contained in the atomic region specified by [ub, ue], where 〈tb, aregion begin, ub〉,
〈tb, aregion end, ue〉 ∈ A. Then, atomic execution orients the conflict by imposing
either uj
sw−→ub or ue sw−→uj on the synchronizes-with relation.
130
• Let 〈ti, ki, ui〉, 〈tj, kj, uj〉 ∈ A such that ki and kj are synchronization actions. Also let
ui be contained in the atomic region specified by [ub, ue], where 〈tb, aregion begin, ub〉,
〈tb, aregion end, ue〉 ∈ A. Then the following synchronizes-with relations are imposed
by atomic execution semantics:
◦ If ui sw−→uj, then ue sw−→uj.
◦ If uj sw−→ui, then uj sw−→ub.
8.3 Constraints on Atomic Region Formation
Assume a program P , which does not contain atomic regions, is transformed into program P ′
by inserting aregion begin and aregion end statements to form atomic regions. For any be-
havior that is speculatively removed from an atomic region in P ′ an aregion abort statement
is also inserted. This transformation is called atomic region formation (see Chapter 2.2).
Informally, atomic region formation is only valid if the transformed program is equivalent
to the original program. First, the transformed program must contain atomic regions which
satisfy the constraints of Section 8.2. Second, any execution of the transformed program
must have an equivalent execution in the original program. Third, any execution of the
original program must have an equivalent execution in the transformed program.
To form an atomic region, a compiler must first select a region of a program to trans-
form. A region R is a subset of a program and is similarly defined by the tuple R =
〈S,CFG,DFG〉. Before specifying constraints on region selection, definitions for an entry
and exit of a region are provided.
Definition Let sx be an immediate predecessor of sy in program P . If sx /∈ R and sy ∈ R,
then sy is an entry of region R.
Definition Let sy be an immediate successor of sx in P . If sx ∈ R and sy /∈ R, then sx is
an exit of region R.
131
To satisfy the program order constraints of Section 8.2, a compiler must first select
a region that does not already contain any atomic statements (atomic regions are never
nested). The region should also specify a connected set of statements because a disjoint
set of statements can trivially be split into separate regions. Furthermore, the region must
have well defined entry and exit points such that any path through the region will encounter
exactly one entry and one exit statement.
In order to be considered for atomic region formation, a region R selected from program P
must satisfy the following constraints. Let CFGP be the control flow graph of the statements
in P and CFGR be the control flow graph of the statements in R, then:
• For any sx ∈ R, sx is not an atomic statement.
• CFGR is a weakly connected subgraph of CFGP , i.e., if each edge of CFGR were
undirected, then CFGR would be a connected graph.
• Let sx, sy ∈ R such that sx and sy are both entries of R. If sx is a predecessor of sy in
P , then for each path τ from sx to sy in CFGP , there exists some sn ∈ R such that
sn is an exit of R and sn is a vertex in τ .
• Let sx, sy ∈ R such that sx and sy are both exits of R. If sx is a predecessor of sy in
P , then for each path τ from sx to sy in CFGP , there exists some sn ∈ R such that
sn is an entry of R and sn is a vertex in τ .
Once a compiler has selected an appropriate region, it can then transform it into an
atomic region. To do so, an aregion begin must be placed at each entry of the region, an
aregion end must be placed at each exit, and aregion abort statements must be inserted
on any control flow path that has been speculatively removed. Also, if the atomic region
aborts then execution must be resumed on an alternate control flow path that is equivalent
to executing the original region prior to atomic region formation.
132
Atomic region formation is formally specified as follows. Given a region R in program
P , which satisfies the constraints on region selection, a compiler may transform P into P ′
by generating an atomic region Ra that satisfies the following constraints:
• For each sx ∈ P , sx ∈ Ra if and only if sx ∈ R.
• For each sy ∈ R such that sy is an entry of R, there exists a statement sb =
aregion begin where:
◦ sb ∈ Ra and sb is the only immediate predecessor of sy.
◦ For each sx ∈ P such that sx is an immediate predecessor of sy in P , sx is an
immediate predecessor of sb in P
′. Thus, sb replaces sy as an entry of Ra.
◦ An alternate control flow path is provided as the parameter to sb (see Section 2.1).
This alternate control flow path contains statements that are equivalent to a
control flow path starting at entry sy of R and continuing through some exit of
R. It will be executed by hardware if the atomic region starting at sb aborts.
• For each sx ∈ R such that sx is an exit of R, there exists a statement se = aregion end
where:
◦ se ∈ Ra and se is the only immediate successor of sx.
◦ For each sy ∈ P such that sy is an immediate successor of sx in P , sy is an
immediate successor of se in P
′. Thus, se replaces sx as an exit of Ra.
• If any behaviors are speculatively removed from a path through Ra, a statement sa ∈
Ra is used to terminate the path such that sa = aregion abort. Furthermore, sa has
no successors in the control flow graph of P ′.
It is now possible to prove that atomic region formation satisfies the constraints on
program order from Section 8.2. In particular, each control flow path through the atomic
133
region must include an aregion begin followed by either an aregion end or aregion abort,
and atomic regions are never nested.
Lemma 8.1. Suppose atomic region formation is used to transform region R, which satis-
fies the constraints on region selection, into an atomic region Ra. Then, any control flow
path through Ra satisfies the atomic region constraints on the program order relation (see
Section 8.2).
Proof. By the constraints on region selection, any control flow path through R will pass
through a single entry and a single exit. By the constraints on atomic region formation,
every entry of Ra is an aregion begin and every exit of Ra is an aregion end. Likewise, any
control flow path through Ra will pass through a single entry and at most a single exit (a
path may terminate at an aregion end).
Therefore, every control flow path through Ra starts with an aregion begin and ends
with either an aregion end or aregion abort. Likewise, every control flow path includes
exactly one aregion begin and exactly one of either aregion end or aregion abort.
Thus, any control flow path through Ra satisfies the program order constraints on atomic
region execution.
Provided that atomic region formation satisfies the program order constraints of Sec-
tion 2.2, the next two proofs show that atomic regions do not introduce new executions to a
transformed program and that forming an atomic region containing a single statement does
not eliminate any executions from a transformed program.
Lemma 8.2. Suppose atomic region formation is used to transform a correctly-synchronized
program P into program P ′. Then, for any valid execution E ′ of P ′, an equivalent execution
E exists of P .
Proof. Suppose E ′ is a valid execution of P ′, then the happens-before relation of each conflict
in E ′ can be shown to exist in some execution E of program P .
134
It is necessary to demonstrate that transforming P into P ′ does not introduce new
happens-before relations to any conflicts. The proof first considers atomic regions that
abort and then considers atomic regions that commit. In both cases, let A be the set of
actions for execution E and let A′ be the set of actions for execution E ′. For the set of
actions A′ of E ′, A′ ⊇ A and the actions in A′ − A are the atomic actions produced by the
atomic regions in P ′.
Let 〈tb, aregion begin, ub〉, 〈tb, aregion abort, ua〉 ∈ A′ such that [ub, ua] specifies an atomic
region that aborts. By the constraints on atomic region execution, [ub, ua] does not contain
any actions and, therefore, does introduce any new happens-before relations. By the con-
straints on atomic region formation, the actions following ua in program order must be
equivalent to an execution of the original region in P .
Suppose there exists a pair of conflicting actions 〈ti, ki, ui〉, 〈tj, kj, uj〉 ∈ A′ such that ti 6=
tj, ki conflicts with kj and ui
hb−→uj. Let 〈tb, aregion begin, ub〉, 〈tb, aregion end, ue〉 ∈ A′ such
that [ub, ue] specifies an atomic region that commits. Because P is correctly synchronized,
there must also exist 〈ti, ki+1, ui+1〉, 〈tj, kj−1, uj−1〉 ∈ A′ such that ui po−→ui+1 sw−→uj−1 po−→uj.
There are two cases to consider:
1. A synchronizes-with relation introduced by ue such that:
• If ui is contained in [ub, ue], then ui po−→ue sw−→uj
• If ui+1 is contained in [ub, ue], then ui+1 po−→ue sw−→uj−1
2. A synchronizes-with relation introduced by ub such that:
• If uj is contained in [ub, ue], then ui sw−→ub po−→uj
• If uj−1 is contained in [ub, ue], then ui+1 sw−→ub po−→uj−1
Each of these relations enforce ui
hb−→uj but, clearly, this relation was already enforced by
ui+1
sw−→uj−1. Simply removing all atomic actions from E ′ yields a valid execution E of P .
135
The same argument holds for any pair of conflicts and therefore for any execution of P ′ an
equivalent execution exists of P .
Lemma 8.3. If atomic region formation is used to transform a region R containing a single
statement si of correctly-synchronized program P into program P
′, then program P ′ is
equivalent to program P .
Proof. By Lemma 8.2, for any valid execution of P ′ an equivalent execution exists of P .
Therefore, it is only necessary to prove that for any valid execution E of P an equivalent
execution E ′ exists of P ′.
Suppose E is a valid execution of P , then the happens-before relation of each conflict in
E can be shown to exist in some execution E ′ of program P ′. Let A be the set of actions
for execution E and A′ be the set of actions for execution E ′. For the set of actions A′ of
E ′, A′ ⊇ A and the actions in A′−A are the atomic actions produced by the atomic regions
in P ′.
Let 〈ti, si, ui〉 ∈ A be an action produced by the single statement si contained in R.
Suppose there exists some 〈ti−1, ki−1, ui−1〉, 〈ti+1, ki+1, ui+1〉 ∈ A such that ui−1 hb−→ui hb−→ui+1.
Suppose si is executed in an atomic region which commits. Then, there exists
〈ti, aregion begin, ub〉, 〈ti, aregion end, ue〉 ∈ A′ such that [ub, ue] specifies an atomic region
and ub
po−→ui po−→ue. This implies ui is contained in [ub, ue]. Let 〈ti−1, ki−1, ui−1〉,
〈ti+1, ki+1, ui+1〉 ∈ A′ such that ki conflicts with ki−1 and ki+1, and ui−1 hb−→ui hb−→ui+1. Atomic
region constraints may impose ui−1
sw−→ub or ue sw−→ui+1, both of which are trivially consistent
with ui−1
hb−→ui hb−→ui+1 in E ′.
Suppose si is executed in an atomic region which aborts. Then, there exists
〈ti, aregion begin, ub〉, 〈ti, aregion abort, ua〉 ∈ A′ such that [ub, ua] specifies an atomic re-
gion and ub
po−→ua po−→ui (i.e., an equivalent action for si occurs in the alternate code executed
after the atomic region abort). Because ui is not contained in [ub, ua], ui−1
hb−→ui hb−→ui+1 is
trivially consistent with E ′.
136
Therefore, for any execution E of P an equivalent execution exists for E ′ of P ′. Thus,
program P ′ is equivalent to program P .
As a result, atomic region formation can be used to form atomic regions containing a
single statement without changing the behavior of a program. The next section discusses
reorderings allowed for these atomic regions, and the first two proofs show that an atomic
region can be incrementally grown from a single statement, albeit with constraints, without
changing the behavior of a program.
8.4 Reorderings Permitted with Atomic Regions
This section will prove that statements can be reordered both across atomic region bound-
aries and within atomic regions. Statement reordering is important for two reasons. First,
the previous section merely proved that it is safe to form single-statement atomic regions.
By enabling a compiler to reorder statements into an atomic region, larger atomic regions
are supported. Second, compiler optimizations (such as partial redundancy elimination and
dead code elimination) can generally be thought of as reordering redundant statements to
the same location and then deleting a redundant statement. The remaining proofs assume
the following definition for statement reordering.
Definition Two adjacent statements sx and sy in program P are said to be reordered in
program P ′ if their relative control flow order is reversed in P ′. These statements are further
said to be safely reordered if and only if:
• sx and sy are reordered
• sx and sy belong to the same thread in program P
• Reordering sx and sy does not violate any control or data dependence in program P
137
• Reordering sx and sy does not violate the atomic region formation constraints of Sec-
tion 8.3.
Stated informally, two statements can be safely reordered if doing so would not violate
the single-thread semantics of a program. However, a safe reordering might violate the
memory ordering constraints of a program and could still be incorrect.
Therefore, the following proofs also consider statements which may have inter-thread
dependences (control or data). The next two definitions introduce the terms in-synchronize
and out-synchronize to simplify discussion of these statements. Informally, a statement in-
synchronizes with an atomic region if the statement must execute before some statement
in the atomic region because of an inter-thread dependence. Likewise, a statement out-
synchronizes with an atomic region if the statement must execute after some statement in
the atomic region because of an inter-thread dependence.
Definition Suppose Ra is an atomic region in program P , statement si is a predecessor of
Ra, si /∈ Ra, and statement sj ∈ Ra. Further suppose E is a valid execution of P with a set
of actions A, where 〈ti, si, ui〉, 〈tj, sj, uj〉, 〈tj, aregion begin, ub〉 ∈ A such that ui identifies
the action produced by si, uj identifies the action produced by statement sj, and ub identifies
the entry of the atomic region containing uj. Consider a (potentially invalid) execution E
′
which is derived by removing ui
po−→ub from the program order relation of E (if it exists).
Then, statement si is said to in-synchronize with Ra if and only if ui
hb−→uj in some E ′ of P .
Definition Suppose Ra is an atomic region in program P , statement sj is a successor of
Ra, sj /∈ Ra, and statement si ∈ Ra. Further suppose E is a valid execution of P with a
set of actions A, where 〈ti, si, ui〉, 〈ti, aregion end, ue〉, 〈tj, sj, uj〉 ∈ A such that ui identifies
the action produced by si, ue identifies the exit of the atomic region containing ui, and uj
identifies the action produced by statement sj. Consider a (potentially invalid) execution
E ′ which is derived by removing ue
po−→uj from the program order relation of E (if it exists).
138
Then, statement sj is said to out-synchronize with Ra if and only if ui
hb−→uj in some E ′ of
P .
Provided with these definitions, it is possible to prove that statements can be reordered
into an atomic region (either for optimization purposes or to grow the region) by considering
statements that in-synchronize or out-synchronize with an atomic region. In particular,
statements that are predecessors of an atomic region and do not in-synchronize with the
atomic region can be reordered into the atomic region. The corollary is also true, statements
that are successors of an atomic region and do not out-synchronize with the atomic region
can be reordered into the atomic region
Lemma 8.4. Suppose P is a correctly synchronized program which contains an atomic
region Ra. Reordering statement si, which is an immediate predecessor of Ra, with an
aregion begin of Ra results in a transformed program P
′. Program P ′ is equivalent to
program P ′ if si does not in-synchronize with Ra.
Proof. By Lemma 8.2, adding statements to an atomic region does not introduce any new
executions to P ′ that were not possible in P . Therefore, it is only necessary to prove that
P ′ does not prohibit executions which were possible in P .
Suppose there exists a valid execution E of P which is prohibited in P ′. Let 〈ti, si, ui〉,
〈ti, aregion begin, ub〉 ∈ A such that ui identifies the action produced by executing si and
ub identifies the action produced by an entry of region Ra. In P
′, statement si is reordered
after the aregion begin and therefore ui
po−/→ub in any execution E ′ of program P ′ with a set
of actions A′ = A.
Therefore, if execution E is not equivalent to any valid execution of E ′ of program P ′
there must exist some 〈ti+1, ki+1, ui+1〉 ∈ A such that ui hb−→ui+1 in E but ui hb−/→ui+1 in any E ′.
This implies that statement si in-synchronizes with Ra, which is a contradiction.
Lemma 8.5. Suppose P is a correctly synchronized program which contains an atomic
region Ra. Reordering statement si, which is an immediate successor of Ra, with an
139
aregion end of Ra results in a transformed program P
′. Program P ′ is equivalent to program
P ′ if si does not out-synchronize with Ra.
Proof. As with Lemma 8.4, it is only necessary to prove that P ′ does not prohibit executions
which were possible in P .
Suppose there exists a valid execution E of P which is prohibited in P ′. Let 〈ti, si, ui〉,
〈ti, aregion end, ub〉 ∈ A such that ui identifies the action produced by executing si and
ue identifies the action produced by an exit of region Ra. In P
′, statement si is reordered
before the aregion end and therefore ue
po−/→ui in any execution E ′ of program P ′, with a set
of actions A′ = A.
Therefore, if execution E is not equivalent to any valid execution of E ′ of program P ′
there must exist some 〈ti−1, ki−1, ui−1〉 ∈ A such that ui−1 hb−→ui in E but ui−1 hb−/→ui in any E ′.
This implies that statement si out-synchronizes with Ra, which is a contradiction.
Although Lemma 8.4 and Lemma 8.5 enable some statements to be reordered into an
atomic region, a compiler is still constrained. Limiting reorderings to statements that do
not in-synchronize with or out-synchronize with an atomic region is overly prohibitive for
languages with strict memory models, such as x86.
Hardware can enable a relaxation of these constraints, specifically by permitting any
safe reordering of a statement into an atomic region. However, this requires hardware to
abort an atomic region if a speculative memory conflict is detected during atomic region
execution. For example, the hardware assumed by the implementations of Chapter 6 and
Chapter 7 monitors the cache coherence protocol and eagerly detects violations of atomic
region execution. If a violation is detected, hardware implicitly aborts the atomic region
and resumes execution in a non-atomic version of the region. However, proving this claim
requires formal reasoning about speculative actions and is left as future work.
Nonetheless, for languages with weak memory models (or assuming that the previous
claim holds), the next proof demonstrates that statements can be freely reordered within
140
an atomic region. Specifically, safely reordering two statements which are contained in an
atomic region is guaranteed to be correct.
Lemma 8.6. If program P , which contains atomic regions, is transformed into program
P ′ by safely reordering two adjacent statements inside an atomic region of P , then P ′ is
equivalent to P .
Proof. Let A be the set of actions for some valid execution E of program P . Suppose that
〈ti, si, ui〉, 〈ti, sj, uj〉 ∈ A correspond to adjacent statements si and sj in P such that si and
sj can be safely reordered. Further suppose ui and uj are contained in an atomic region
specified by [ub, ue], where 〈ti, aregion begin, ub〉, 〈ti, aregion end, ue〉 ∈ A, and ui po−→uj.
For any 〈tn, kn, un〉 ∈ A such that kn conflicts with sj, the atomic region constraints
require that:
• Either un sw−→ub which implies un hb−→uj
• Or ue sw−→un which implies uj hb−→un
An execution E ′ of program P ′ exists, which differs from E only that si and sj are
reordered and uj
po−→ui in E ′. Atomic region constraints require that:
• Either un sw−→ub which implies un hb−→uj
• Or ue sw−→un which implies uj hb−→un
Thus for any conflict contained in an atomic region in E ′, a conflict with an equivalent
happens-before relation can be found in some execution E of P . The converse argument
holds and P is therefore equivalent to P ′.
As a result, a compiler can ignore memory model constraints when reordering statements
within an atomic region. This may provide additional reordering freedom and thereby en-
able additional optimization (see Chapter 3.2). This is a key benefit of the atomic region
abstraction.
141
However, in regions of a program which do not contain synchronization statements a
compiler is already free of memory model constraints. If atomic regions are inserted into such
a program region, new constraints should not be introduced. Lemma 8.4 and Lemma 8.5
already proved this for optimizations which logically reorder statements into an atomic
region. The following proofs will show that similar freedom exists for optimizations which
logically reorder statements out of an atomic region.
Note that the definition of safe reordering must be refined to include control specula-
tion inside of an atomic region. The following proofs momentarily assume that no control
speculation occurs within an atomic region, but this restriction will be lifted by the refined
definition of safe reordering provided in Section 8.4.1.
Lemma 8.7. Let P be a program with atomic regions. Also let P contain an atomic
region Ra such that no statement in-synchronizes with Ra. If program P is transformed
into program P ′ by safely reordering the aregion begin statement of Ra with the immediate
successor of the aregion begin, then program P ′ is equivalent to program P .
Proof. Let A be the set of actions of an execution E of program P . Assume that sb =
aregion begin and se = aregion end are an entry and exit of atomic region Ra, respectively,
and that statement si ∈ Ra is an immediate successor of sb. Suppose si can be safely
reordered with sb and no statement in-synchronizes with Ra.
Further suppose that 〈ti, si, ui〉, 〈ti, sb, ub〉, 〈ti, se, ue〉 ∈ A are actions which correspond
to statements si, sb and se, respectively, and ui is contained in the atomic region [ub, ue].
Let 〈tj, kj, uj〉 ∈ A such that tj 6=ti, and uj hb−→ui. Because no statement in-synchronizes
with Ra, there must be some 〈ti, ki−1, ui−1〉 ∈ A such that uj hb−→ui−1 po−→ub po−→ui.
Suppose that the statements sb and si have been reordered in program P
′. Then in any
execution E ′ of P ′, uj
hb−→ui−1 po−→ui po−→ub and uj hb−→ui. Thus, execution E ′ is equivalent to E.
The converse is also true by the same argument and program P ′ is equivalent to program
P .
142
Lemma 8.8. Let P be a program with atomic regions. Also let P contain an atomic
region Ra such that no statement out-synchronizes with Ra. If program P is transformed
into program P ′ by safely reordering the aregion end statement of Ra with the immediate
predecessor of the aregion end, then program P ′ is equivalent to program P .
Proof. Let A be the set of actions of an execution E of program P . Assume that sb =
aregion begin and se = aregion end are an entry and exit of atomic region Ra, respectively,
and that statement si ∈ Ra is an immediate predecessor of se. Suppose si can be safely
reordered with sb and no statement out-synchronizes with Ra.
Further suppose that 〈ti, si, ui〉, 〈ti, sb, ub〉, 〈ti, se, ue〉 ∈ A are actions which correspond
to statements si, sb and se, respectively, and ui is contained in the atomic region [ub, ue].
Let 〈tj, kj, uj〉 ∈ A such that tj 6=ti, and ui hb−→uj. Because no statement out-synchronizes
with Ra, there must be some 〈ti, ki+1, ui+1〉 ∈ A such that ui po−→ue po−→ui+1 hb−→uj.
Suppose that the statements si and se have been reordered in program P
′. Then in any
execution E ′ of P ′, ue
po−→ui po−→ui+1 hb−→uj and ui hb−→uj. Thus, execution E ′ is equivalent to E.
The converse is also true by the same argument and program P ′ is equivalent to program
P .
The following two examples depict optimization opportunities which are enabled by the
reorderings permitted by atomic region semantics. Figure 8.1 depicts an atomic region which
contains an unlock statement. In the Java memory model, an unlock action synchronizes-
with all subsequently executed lock actions, and, therefore, no statement in-synchronizes
with the depicted atomic region but there could be a statement that out-synchronizes with
the atomic region.
Lemma 8.6 enables reordering the statements within the atomic region. Because no
statement in-synchronizes with the atomic region, Lemma 8.4 permits reordering a state-
ment immediately preceding an atomic region with the aregion begin, which enables the
elimination of the dead store in block A. Likewise, Lemma 8.7 permits a statement inside
143
r1 = ld [a]
st[b] = 1
unlock(x)
r1 = ld [a]
st[b] = 2








r1 = ld [a]
st[b] = 1
unlock(x)
r1 = ld [a]
st[b] = 2







Figure 8.1: Example reorderings in an atomic region that contains an unlock. (a)
The atomic region contains a Java unlock statement. Therefore, there could be a state-
ment in the program that out-synchronizes with the region. (b) This prevents reordering
and, thereby, optimization of the dead store in block B but otherwise exposes optimization
opportunity by permitting other reorderings.
the atomic region to be reordered with the aregion begin. This further enables the elimina-
tion of the redundant load in block B. By Lemma 8.5, a statement that immediately follows
and does not out-synchronize with an atomic region may be reordered into the atomic region.
No synchronization statement exists on the control flow path from the aregion end to the
redundant load in block C. Therefore, the redundant load does not out-synchronize with the
atomic region and can be eliminated.
Figure 8.2 depicts a similar atomic region, except that the atomic region contains a
lock statement instead of an unlock statement. In the Java memory model, any previ-
ously executed unlock action synchronizes-with a lock action. Therefore, no statement
out-synchronizes with the depicted atomic region but there could be a statement that in-
synchronizes with the atomic region.
Because no statement out-synchronizes with the atomic region, Lemma 8.5 enable the
elimination of the redundant load in block C. Furthermore, Lemma 8.8 enables a statement
inside the atomic region to be reordered with the aregion end. This further enables the
144
r1 = ld [a]
st[b] = 1
lock(x)
r1 = ld [a]
st[b] = 2








r1 = ld [a]
st[b] = 1
lock(x)
r1 = ld [a]
st[b] = 2







Figure 8.2: Example reorderings in an atomic region that contains a lock. (a) The
atomic region contains a Java lock statement. Therefore, there could be a statement in the
program that in-synchronizes with the region. (b) This prevents reordering and, thereby,
optimization of the the redundant load in block B, but otherwise exposes optimization
opportunity by permitting other reorderings. For example, the dead store from Figure 8.1
can be optimized.
elimination of the dead store in block B. By Lemma 8.4, a statement that immediately
precedes and does not in-synchronize with an atomic region may be reordered into the
atomic region. No synchronization statement exists on the control flow path from the dead
store in block A to the aregion begin. Therefore, the dead store does not in-synchronize
with the atomic region and can be eliminated.
8.4.1 Control Speculation and Safe Reordering
The definition of safe reordering of program statements has thus far ignored control-speculation
within an atomic region, but the primary motivation for the atomic region abstraction is the
simplicity in which it enables a compiler to employ control speculation.
Figure 8.3 depicts an example atomic region in which a highly-biased branch is converted
into an assert statement. As described in Chapter 2.2, an assert statement is not considered
a control operation by a compiler and it therefore enables a speculative relaxation of control
145




if (r1 != 0)
st[b] = 42











Figure 8.3: Control dependences restrict optimization across atomic region bound-
aries. (a) The store in basic block B is control dependent on the null-check in basic block
A. (b) When an atomic region is created from these basic blocks the null-check is converted
into an assert operation. The assert operation relaxes control dependences, but it is still
invalid to hoist the store out of the atomic region. (c) An example of an invalid hoisting
optimization is shown.
dependences. This control speculation is made safe by the atomic region guarantees provided
by hardware.
Therefore, statements which were previously control-dependent on a branch which has
been converted into an assert must not escape an atomic region. For example, Figure 8.3(c)
depicts an invalid reordering which should not be permitted. Because a branch in the atomic
region has been converted into an assert, a store appears control independent and has been
reordered out of the atomic region. This transformation is clearly incorrect because the store
has become non-speculative even though it should depend on the successful commit of the
atomic region. Even if the assert fires and the region is aborted the store will still execute
and thereby expose a speculative value. This example motivates the following addition to
the definition of control dependence.
Definition If statement sx is control dependent on statement sy and both sx and sy
are contained in an atomic region, then statement sx is also control dependent on every
aregion begin of the atomic region. This control dependence persists even if sy is converted
into an assert statement.
This refined definition prevents a control-speculative statement from being reordered
146
out of an atomic region, because the definition of safe reordering requires that control de-
pendences be preserved. Reordering a control-speculative statement within an atomic re-
gions is still permitted, however. The following proves the correctness of reordering control-
speculative statements within an atomic region.
Lemma 8.9. Let program P contain a statement sx which is control dependent on sy, where
sx and sy are adjacent and contained in the same atomic region. If P is transformed into
program P ′ by converting sy into s′y, where s
′
y = assert, and by safely reordering sx and s
′
y,
then program P ′ is equivalent to program P .
Proof. Let E ′ be a valid execution of program P ′. Suppose that the atomic region containing
sx and s
′
y does not abort in E
′, then s′y did not misspeculate. By Lemma 8.6, there must
exist some execution E of program P such that E ′ is equivalent to E.
Suppose that the atomic region Ra containing sx and s
′
y does abort in E
′, then the
executed atomic region will contain no actions in E ′. By the constraints on atomic region
formation, the actions following the aborted region in program order must be equivalent to
an execution of the original region from which Ra was formed. In this case, an equivalent
execution of the same region exists in E (by Lemma 8.3, Lemma 8.4, and Lemma 8.5).
Thus, for any execution E ′ of P ′ there exists an equivalent execution E of P . The




Despite the growing emphasis on parallel processing and multiprocessors, I believe that im-
proving single-thread performance remains paramount. The emphasis on parallelism is a
practical response to limits in process scaling and the difficulty of building larger-window,
wider-issue or more deeply pipelined processors. Certainly, recovering historical improve-
ments in single-thread performance is unlikely. Nonetheless, opportunities remain.
This dissertation is evidence of this simple fact. It demonstrates that hardware and
software can play a complementary role in exposing these additional opportunities. Hard-
ware primitives designed for use by software can provide for power and complexity efficient
improvements in performance. Specifically, the hardware atomicity primitive provides the
means by which the atomic region abstraction enables software to more easily exploit spec-
ulative optimization opportunities.
I have detailed many of the performance opportunities that compiler writers pursue as
well as the obstacles which often prevent them from being exploited. Speculative compiler
optimization attempts to exploit many of these opportunities in the common case, and this
concept is not new. Likewise, others have proposed hardware to enable software to more
effectively employ speculative optimization.
The atomic region abstraction is but a furthering of this line of thinking, albeit a demon-
strably effective one. It has been incorporated into two different dynamic optimization
frameworks and shown to improve performance both in simulation and on a real hardware
system. Practical solutions have been presented for all the major technical issues ranging
from IR representation to misspeculation management.
148
That said, important questions are left unanswered. I have demonstrated the utility of
the atomic region abstraction in improving single-thread performance, but it is unclear if
performance potential remains. A limit study or design space exploration of different design
choices, such as region formation, would help determine if additional potential opportunity
exists.
In Chapter 6, the atomic region abstraction is prototyped in a JIT compiler and is shown
to improve performance. The baseline hardware is an out-of-order x86 processor extended
with the hardware atomicity primitive. The combination of type-based alias analysis [34]
and out-of-order scheduling hardware means that hardware for memory alias detection is not
strictly needed. However, aliases may still exist and alias detection hardware could further
enable the removal of partially redundant memory operations.
In Chapter 7, I argue that the atomic region and superblock abstractions are synergistic.
This is certainly true for a VLIW processor because of its need for effective static scheduling.
However, other static scheduling techniques might make for a better match. For example,
wavefront scheduling [9] is amenable to arbitrary region shapes and could allow for a more
effective implementation of the atomic region abstraction.
Even more fundamentally, the atomic region abstraction has only been explored in the
context of dynamic optimization frameworks. Whether or not it is viable in a static compiler
is not known. To be sure, some runtime features would be necessary for misspeculation
monitoring and disabling unprofitable asserts. However, other requirements might also exist
and other issues may need to be solved.
These questions are left for derivative work. It is my hope that others might further the
understanding of the atomic region abstraction and its performance potential. Barring that,
I hope that the success of the atomic region abstraction encourages others to continue the
pursuit of single-thread performance improvements.
149
References
[1] S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial.
Computer, 29(12):66–76, 1996.
[2] S. V. Adve and M. D. Hill. Weak ordering–a new definition. In Proceedings of the 17th
Annual International Symposium on Computer Architecture, pages 2–14, June 1990.
[3] S. V. Adve and M. D. Hill. A unified formalization of four shared-memory models.
IEEE Transactions on Parallel and Distributed Systems, 4(6):613–624, June 1993.
[4] H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint Processing and Recovery: To-
wards scalable large instruction window processors. In Proceedings of the 36th Annual
International Symposium on Microarchitecture, pages 423–434, December 2003.
[5] Apache. Harmony Dynamic Runtime Layer Virtual Machine (DRLVM).
http://harmony.apache.org/subcomponents/drlvm.
[6] A. Ayers, R. Schooler, and R. Gottlieb. Aggressive inlining. In Proceedings of the ACM
SIGPLAN 1997 Conference on Programming Language Design and Implementation,
pages 134–145, June 1997.
[7] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimiza-
tion system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming
Language Design and Implementation, pages 1–12, June 2000.
[8] L. Baraz, T. Devor, O. Etzion, S. Goldenberg, A. Skaletsky, Y. Wang, and Y. Zemach.
IA-32 Execution Layer: A two-phase dynamic translator designed to support IA-32
applications on Itanium-based systems. In Proceedings of the 36th Annual International
Symposium on Microarchitecture, pages 191–204, December 2003.
[9] J. Bharadwaj, K. Menezes, and C. McKinsey. Wavefront Scheduling: Path based
data representation and scheduling of subgraphs. In Proceedings of the 32nd Annual
International Symposium on Microarchitecture, pages 262–271, November 1999.
[10] G. Bilardi and K. Pingali. A framework for generalized control dependence. In Pro-
ceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design
and Implementation, pages 291–300, May 1996.
150
[11] S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. McKinley, R. Bentzur,
A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump,
H. Lee, J. E. B. Moss, B. Moss, A. Phansalkar, D. Stefanovic´, T. VanDrunen, D. von
Dincklage, and B. Wiedermann. The DaCapo Benchmarks: Java benchmarking devel-
opment and analysis. In Proceedings of the 21st Annual Conference on Object-Oriented
Programming, Systems, Languages, and Applications, pages 169–190, October 2006.
[12] B. Blanchet. Escape Analysis for Object-Oriented Languages: Application to Java. In
Proceedings of the 14th Annual Conference on Object-Oriented Programming, Systems,
Languages, and Applications, pages 20–34, November 1999.
[13] C. Blundell, M. M. K. Martin, and T. F. Wenisch. InvisiFence: Performance-
transparent memory ordering in conventional multiprocessors. In Proceedings of the
36th Annual International Symposium on Computer Architecture, pages 233–244, June
2009.
[14] R. Bod´ık and R. Gupta. Partial dead code elimination using slicing transformations.
In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language
Design and Implementation, pages 159–170, June 1997.
[15] R. Bod´ık, R. Gupta, and V. Sarkar. ABCD: Eliminating array bounds checks on
demand. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming
Language Design and Implementation, pages 321–333, June 2000.
[16] R. Bod´ık, R. Gupta, and M. L. Soffa. Interprocedural conditional branch elimination.
In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language
Design and Implementation, pages 146–158, June 1997.
[17] R. Bod´ık, R. Gupta, and M. L. Soffa. Complete removal of redundant expressions.
In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language
Design and Implementation, pages 1–14, June 1998.
[18] H.-J. Boehm. Reordering constraints for pthread-style locks. In Proceedings of the
12th Symposium on Principles and Practice of Parallel Programming, pages 173–182,
March 2007.
[19] H.-J. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model.
In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language
Design and Implementation, pages 68–78, June 2008.
[20] J. Bogda and U. Ho¨lzle. Removing unnecessary synchronization in Java. In Proceedings
of the 14th Annual Conference on Object-Oriented Programming, Systems, Languages,
and Applications, pages 35–46, November 1999.
[21] E. Borch, S. Manne, J. Emer, and E. Tune. Loose loops sink chips. In Proceedings of
the 8th International Symposium on High-Performance Computer Architecture, pages
299–310, February 2002.
151
[22] S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4):23–29, 1999.
[23] R. A. Bringmann, S. A. Mahlke, R. E. Hank, J. C. Gyllenhaal, and W.-M. W. Hwu.
Speculative execution exception recovery using write-back suppression. In Proceedings
of the 26th Annual International Symposium on Microarchitecture, pages 214–223,
November 1993.
[24] J. Cavazos and M. F. P. O’Boyle. Automatic tuning of inlining heuristics. In Proceed-
ings of the 2005 ACM/IEEE Conference on Supercomputing, page 14, 2005.
[25] C. Chambers and D. Ungar. Making pure object-oriented languages practical. In
Proceedings of the 6th Annual Conference on Object-Oriented Programming, Systems,
Languages, and Applications, pages 1–15, October 1991.
[26] C. Click. Global code motion/global value numbering. In Proceedings of the ACM
SIGPLAN 1995 Conference on Programming Language Design and Implementation,
pages 246–257, June 1995.
[27] B. Cmelik and D. Keppel. Shade: A fast instruction-set simulator for execution profil-
ing. In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and
Modeling of Computer Systems, pages 128–137, May 1994.
[28] CodePlex. Dynamic language runtime. http://dlr.codeplex.com.
[29] A. Cristal, D. Ortega, J. Llosa, and M. Valero. Out-of-order commit processors. In
Proceedings of the 9th International Symposium on High-Performance Computer Ar-
chitecture, pages 48–59, February 2004.
[30] X. Dai, A. Zhai, W.-C. Hsu, and P.-C. Yew. A general compiler framework for specu-
lative optimizations using data speculative code motion. In Proceedings of the 3rd In-
ternational Symposium on Code Generation and Optimization, pages 280–290, March
2005.
[31] P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, and D. Nussbaum. Hybrid
transactional memory. In Proceedings of the 12th International Conference on Archi-
tectural Support for Programming Languages and Operating Systems, pages 336–346,
October 2006.
[32] J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and
J. Mattson. The Transmeta Code Morphing Software: Using speculation, recovery,
and adaptive retranslation to address real-life challenges. In Proceedings of the 1st
International Symposium on Code Generation and Optimization, pages 15–24, March
2003.
[33] P. C. Diniz and M. C. Rinard. Lock Coarsening: Eliminating lock overhead in au-
tomatically parallelized object-based programs. Journal of Parallel and Distributed
Computing, 49(2):218–244, 1998.
152
[34] A. Diwan, K. S. McKinley, and J. E. B. Moss. Type-based alias analysis. In Proceed-
ings of the ACM SIGPLAN 1998 Conference on Programming Language Design and
Implementation, pages 106–117, June 1998.
[35] ECMA International. ECMA-335 - Common Language Infrastructure (CLI), December
2010.
[36] L. Effinger-Dean, H.-J. Boehm, D. Chakrabarti, and P. Joisha. Extended sequential
reasoning for data-race-free programs. In Proceedings of the 2011 ACM SIGPLAN
Workshop on Memory Systems Performance and Correctness, pages 22–29, June 2011.
[37] B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. J. Patel, and S. S.
Lumetta. Performance characterization of a hardware mechanism for dynamic opti-
mization. In Proceedings of the 34th Annual International Symposium on Microarchi-
tecture, pages 16–27, December 2001.
[38] L. Feigen, D. Klappholz, R. Casazza, and X. Xue. The revival transformation. In
Proceedings of the 21st Symposium on Principles of Programming Languages, pages
421–434, January 1994.
[39] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and
its use in optimization. ACM Transactions on Programming Languages and Systems,
9(3):319–349, 1987.
[40] S. J. Fink and F. Qian. Design, implementation and evaluation of adaptive recompi-
lation with on-stack replacement. In Proceedings of the 1st International Symposium
on Code Generation and Optimization, pages 241–252, March 2003.
[41] J. A. Fisher. Trace Scheduling: A technique for global microcode compaction. IEEE
Transactions on Computers, 30(7):478–490, 1981.
[42] J. A. Fisher and S. M. Freudenberger. Predicting conditional branch directions from
previous runs of a program. In Proceedings of the 5th International Conference on
Architectural Support for Programming Languages and Operating Systems, pages 85–
95, October 1992.
[43] D. M. Gallagher, W. Y. Chen, S. A. Mahlke, J. C. Gyllenhaal, and W.-M. W. Hwu.
Dynamic memory disambiguation using the memory conflict buffer. In Proceedings of
the 6th International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 183–193, October 1994.
[44] K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the per-
formance of memory consistency models. In In Proceedings of the 1991 International
Conference on Parallel Processing, pages 355–364, August 1991.
[45] G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.0: Faster and more
flexible program analysis. Journal of Instruction Level Parallelism, 7:1–28, 2005.
153
[46] R. E. Hank, W.-M. W. Hwu, and B. R. Rau. Region-Based Compilation: An intro-
duction and motivation. In Proceedings of the 28th Annual International Symposium
on Microarchitecture, pages 158–168, November 1995.
[47] M. S. Hecht and J. D. Ullman. Flow graph reducibility. In Proceedings of the 4th
Annual Symposium on Theory of Computing, pages 238–250, May 1972.
[48] M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural support for lock-
free data structures. In Proceedings of the 20th Annual International Symposium on
Computer Architecture, pages 289–300, May 1993.
[49] A. Heydon and M. Najork. Performance limitations of the Java core libraries. In
Proceedings of the ACM 1999 Conference on Java Grande, pages 35–41, June 1999.
[50] U. Ho¨lzle, C. Chambers, and D. Ungar. Debugging optimized code with dynamic de-
optimization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming
Language Design and Implementation, pages 32–43, June 1992.
[51] U. Ho¨lzle and D. Ungar. A Third-Generation SELF Implementation: Reconciling
responsiveness with performance. In Proceedings of the 9th Annual Conference on
Object-Oriented Programming, Systems, Languages, and Applications, pages 229–243,
October 1994.
[52] W. Hu, Q. Liu, J. Wang, S. Cai, M. Su, and X. Li. Efficient binary translation system
with low hardware cost. In 27th International Conference on Computer Design, pages
305 –312, October 2009.
[53] J. Huck, D. Morris, J. Ross, A. Knies, H. Mulder, and R. Zahir. Introducing the IA-64
architecture. IEEE Micro, 20(5):12–23, 2000.
[54] W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bring-
mann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M.
Lavery. The Superblock: An effective technique for VLIW and superscalar compilation.
Journal of Supercomputing, 7(1-2):229–248, 1993.
[55] W.-M. W. Hwu and Y. N. Patt. Checkpoint repair for out-of-order execution machines.
In Proceedings of the 14th Annual International Symposium on Computer Architecture,
pages 18–26, June 1987.
[56] Intel Corporation. Excerpts from A Conversation with Gordon Moore: Moore’s Law.
Video Transcript, 2005.
[57] Intel Corporation. Intel 64 architecture memory ordering white paper, August 2007.
[58] J. Janssen and H. Corporaal. Making graphs reducible with controlled node splitting.
ACM Transactions on Programming Languages and Systems, 19(6):1031–1052, 1997.
[59] JSR 133 Expert Group. JSR-133: Java memory model and thread specification, August
2004.
154
[60] JSR 292 Expert Group. JSR-292: Supporting dynamically typed languages on the
Java platform, February 2006.
[61] K. Kawachiya, A. Koseki, and T. Onodera. Lock Reservation: Java locks can mostly do
without atomic operations. In Proceedings of the 17th Annual Conference on Object-
Oriented Programming, Systems, Languages, and Applications, pages 130–141, Novem-
ber 2002.
[62] K. Kennedy. Safety of code motion. International Journal of Computer Mathematics,
3:117–130, 1972.
[63] R. Kennedy, S. Chan, S.-M. Liu, R. Lo, P. Tu, and F. Chow. Partial redundancy
elimination in SSA form. ACM Transactions on Programming Languages and Systems,
21(3):627–676, 1999.
[64] A. Klaiber. The Technology Behind Crusoe Processors. Transmeta Corporation White
Paper, January 2000.
[65] J. Knoop, O. Ru¨thing, and B. Steffen. Lazy code motion. In Proceedings of the ACM
SIGPLAN 1992 Conference on Programming Language Design and Implementation,
pages 224–234, June 1992.
[66] J. Knoop, O. Ru¨thing, and B. Steffen. Partial dead code elimination. In Proceed-
ings of the ACM SIGPLAN 1994 Conference on Programming Language Design and
Implementation, pages 147–158, June 1994.
[67] J. R. Larus and R. Rajwar. Transactional Memory. Morgan and Claypool, December
2006.
[68] J. Laudon and L. Spracklen. The coming wave of multithreaded chip multiprocessors.
International Journal of Parallel Programming, 35(3):299–330, 2007.
[69] J. Lin, T. Chen, W.-C. Hsu, P.-C. Yew, R. D.-C. Ju, T.-F. Ngai, and S. Chan. A com-
piler framework for speculative analysis and optimizations. In Proceedings of the ACM
SIGPLAN 2003 Conference on Programming Language Design and Implementation,
pages 289–299, June 2003.
[70] T. Lindholm and F. Yellin. The Java Virtual Machine Specification. Prentice Hall,
April 1999.
[71] R. Lo, F. Chow, R. Kennedy, S.-M. Liu, and P. Tu. Register promotion by sparse par-
tial redundancy elimination of loads and stores. In Proceedings of the ACM SIGPLAN
1998 Conference on Programming Language Design and Implementation, pages 26–37,
June 1998.
[72] S. A. Mahlke, W. Y. Chen, R. A. Bringmann, R. E. Hank, W.-M. W. Hwu, B. R. Rau,
and M. S. Schlansker. Sentinel Scheduling: A model for compiler-controlled speculative
execution. ACM Transactions on Computer Systems, 11(4):376–408, 1993.
155
[73] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective
compiler support for predicated execution using the hyperblock. In Proceedings of the
25th Annual International Symposium on Microarchitecture, pages 45–54, November
1992.
[74] J. Manson, W. Pugh, and S. V. Adve. The Java memory model. In Proceedings of the
32nd Symposium on Principles of Programming Languages, pages 378–391, January
2005.
[75] J. F. Mart´ınez, J. Renau, M. C. Huang, M. Prvulovic, and J. Torrellas. Cherry: Check-
pointed early resource recycling in out-of-order microprocessors. In Proceedings of the
35th Annual International Symposium on Microarchitecture, pages 3–14, November
2002.
[76] S. Melvin and Y. Patt. Exploiting fine-grained parallelism through a combination of
hardware and software techniques. In Proceedings of the 18th Annual International
Symposium on Computer Architecture, pages 287–296, May 1991.
[77] S. Melvin and Y. Patt. Enhancing instruction scheduling with a block-structured ISA.
International Journal of Parallel Programming, 23(3):221–243, 1995.
[78] G. E. Moore. Cramming more components onto integrated circuits. Electronics,
38(8):82–85, 1965.
[79] S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann,
August 1997.
[80] F. Mueller and D. B. Whalley. Avoiding conditional branches by code replication.
In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language
Design and Implementation, pages 56–66, June 1995.
[81] R. Muth and S. Debray. Partial Inlining. Technical report, Department of Computer
Science, University of Arizona, 1997.
[82] N. Neelakantam, D. R. Ditzel, and C. Zilles. A real system evaluation of hardware
atomicity for software speculation. In Proceedings of the 15th International Conference
on Architectural Support for Programming Languages and Operating Systems, pages
29–38, March 2010.
[83] N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles. Hardware atom-
icity for reliable software speculation. In Proceedings of the 34th Annual International
Symposium on Computer Architecture, pages 174–185, June 2007.
[84] N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles. Hardware Atom-
icity: An effective abstraction for reliable software speculation. IEEE Micro, 28(1):21–
31, 2008.
156
[85] M. Paleczny, C. Vick, and C. Click. The Java Hotspot server compiler. In USENIX
2001 Java Virtual Machine Research and Technology Symposium, pages 1–12, April
2001.
[86] S. J. Patel and S. S. Lumetta. rePLay: A hardware framework for dynamic optimiza-
tion. IEEE Transactions on Computers, 50(6):590–608, 2001.
[87] E. Perelman, J. Lau, H. Patil, A. Jaleel, G. Hamerly, and B. Calder. Cross binary
simulation points. In IEEE 2007 International Symposium on Performance Analysis
of Systems and Software, pages 179–189, April 2007.
[88] R. Rajwar and J. R. Goodman. Speculative Lock Elision: Enabling highly concurrent
multithreaded execution. In Proceedings of the 34th Annual International Symposium
on Microarchitecture, pages 294–305, December 2001.
[89] G. Rozas. Memory management methods and systems that support cache consistency.
United States Patent 7,376,798, May 2008.
[90] G. Rozas, A. Klaiber, D. Dunn, P. Serris, and L. Shah. Supporting speculative modi-
fication in a data cache. United States Patent 7,225,299, May 2007.
[91] M. Schlansker and V. Kathail. Critical path reduction for scalar programs. In Proceed-
ings of the 28th Annual International Symposium on Microarchitecture, pages 57–69,
November 1995.
[92] D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share
memory. ACM Transactions on Programming Languages and Systems, 10(2):282–312,
1988.
[93] J. W. Sias. A Systematic Approach to Delivering Instruction-Level Parallelism in EPIC
Systems. PhD thesis, Department of Electrical and Computer Engineering, University
of Illinois at Urbana-Champaign, 2005.
[94] R. Singhal, K. S. Venkatraman, E. R. Cohn, J. G. Holm, D. A. Koufaty, M.-J. Lin,
M. J. Madhav, M. Mattwandel, N. Nidhi, J. D. Pearce, and M. Seshadri. Performance
analysis and validation of the Intel Pentium 4 processor on 90nm technology. Intel
Technology Journal, 8(1):33–42, 2004.
[95] M. D. Smith, M. Horowitz, and M. S. Lam. Efficient superscalar performance through
boosting. In Proceedings of the 5th International Conference on Architectural Support
for Programming Languages and Operating Systems, pages 248–259, October 1992.
[96] Standard Performance Evaluation Corporation (SPEC). CINT2000 (Integer Compo-
nent of SPEC CPU2000). http://www.spec.org/cpu2000/CINT2000.
[97] Standard Performance Evaluation Corporation (SPEC). CINT2006 (Integer Compo-
nent of SPEC CPU2006). http://www.spec.org/cpu2006/CINT2006.
157
[98] M. Stoodley and V. Sundaresan. Automatically reducing repetitive synchronization
with a just-in-time compiler for Java. In Proceedings of the 3rd International Sympo-
sium on Code Generation and Optimization, pages 27–36, March 2005.
[99] K. Sundaramoorthy, Z. Purser, and E. Rotenburg. Slipstream Processors: Improving
both performance and fault tolerance. In Proceedings of the 9th International Con-
ference on Architectural Support for Programming Languages and Operating Systems,
pages 257–268, November 2000.
[100] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing
on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on
Computer Architecture, pages 392–403, June 1995.
[101] R. Uhlig, R. Fishtein, O. Gershon, I. Hirsh, and H. Wang. SoftSDV: A presilicon soft-
ware development environment for the IA-64 architecture. Intel Technology Journal,
3(4):1–14, 1999.
[102] J. Sˇevcˇ´ık and D. Aspinall. On validity of program transformations in the Java memory
model. In Proceedings of 22nd European Conference on Object-Oriented Programming,
pages 27–51, July 2008.
[103] D. W. Wall. Predicting program behavior using real or estimated profiles. In Proceed-
ings of the ACM SIGPLAN 1991 Conference on Programming Language Design and
Implementation, pages 59–70, June 1991.
[104] J. Whaley. Partial method compilation using dynamic profile information. In Pro-
ceedings of the 16th Annual Conference on Object-Oriented Programming, Systems,
Languages, and Applications, pages 166–179, October 2001.
[105] Y. Wu, M. Breternitz, J. Quek, O. Etzion, and J. Fang. The accuracy of initial predic-
tion in two-phase dynamic binary translators. In Proceedings of the 2nd International
Symposium on Code Generation and Optimization, pages 227–238, March 2004.
[106] W. A. Wulf. Compilers and computer architecture. Computer, 14(7):41–47, 1981.
[107] C. Zilles. Master/Slave Speculative Parallelization and Approximate Code. PhD thesis,
Computer Sciences Department, University of Wisconsin at Madison, 2002.
[108] C. Zilles and N. Neelakantam. Reactive techniques for controlling software specu-
lation. In Proceedings of the 3rd International Symposium on Code Generation and
Optimization, pages 305–316, March 2005.
158
