Reference idempotency analysis: A framework for optimizing speculative execution by Kim, Seon Wook et al.
Appears in the Proceedings of the SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2001
Reference Idempotency Analysis: A Framework for
Optimizing Speculative Execution
Seon Wook Kim1, Chong-Liang Ooi1, Rudolf Eigenmann1,
Babak Falsafi2, T. N. Vijaykumar1
1School of Electrical and Computer Engineering, Purdue Universityy
2Department of Electrical and Computer Engineering, Carnegie Mellon University
mux@ecn.purdue.edu, http://www.ece.purdue.edu/˜mux
ABSTRACT
Recent proposals for multithreaded architectures allow
threads with unknown dependences to execute speculatively
in parallel. These architectures use hardware speculative
storage to buer uncertain data, track data dependences
and roll back incorrect executions. Because all memory
references access the speculative storage, current proposals
implement this storage using small memory structures for
fast access. The limited capacity of the speculative storage
causes considerable performance loss due to speculative stor-
age overow whenever a thread's speculative state exceeds
the storage capacity. Larger threads exacerbate the over-
ow problem but are preferable to smaller threads, as larger
threads uncover more parallelism.
In this paper, we discover a new program property called
memory reference idempotency. Idempotent references need
not be tracked in the speculative storage, and instead can di-
rectly access non-speculative storage (i.e., the conventional
memory hierarchy). Thus, we reduce the demand for specu-
lative storage space. We dene a formal framework for ref-
erence idempotency and present a novel compiler-assisted
speculative execution model. We prove the necessary and
suÆcient conditions for reference idempotency using our
model. We present a compiler algorithm to label idempotent
memory references for the hardware. Experimental results
show that for our benchmarks, over 60% of the references in
non-parallelizable program sections are idempotent.
1. INTRODUCTION
Multithreaded and multiprocessor architectures are
emerging as attractive candidates for future high-
performance single-chip computers. As in shared-memory
multiprocessors, some of these single-chip architectures (e.g.,
y
Seon Wook Kim is now with Intel Corp., Champaign, IL.

This work was supported in part by NSF grant #9974976-
EIA.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
PPOPP’01, June 18-20, 2001, Snowbird, Utah, USA.
Copyright 2001 ACM 1-58113-346-4/01/0006 ...$5.00.
the IBM Power 4) support the conventional parallel execu-
tion models in which a programmer or compiler partitions
the program into distinct parallel threads. Unfortunately,
many programs include code sections that have dependences
unknown at compile time and are therefore not entirely par-
allelizable [2, 5]. While runtime data dependence tests can
parallelize certain unanalyzable code sections [8, 4], these
tests are not applicable to general program patterns and/or
incur high software overhead.
Alternatively, recent proposals for multithreaded archi-
tectures (e.g., the Sun Microsystems' MAJC [11], and the
Multiplex [7], Multiscalar [9], Stampede [10], and Hydra [6]
chip multiprocessors) employ speculative execution to allow
threads with unknown dependences to execute speculatively
in parallel. To guarantee correct execution and to verify
speculation, these architectures provide hardware structures
| which we refer to as the speculative storage | to tem-
porarily maintain a thread's memory state. Speculative
storage records the data a thread produces/consumes and
the memory reference information necessary to track depen-
dences across the threads. Once speculation is veried, a
thread's produced data is transferred from speculative stor-
age to the conventional memory hierarchy | which we refer
to as the non-speculative storage.
A key bottleneck of speculative multithreaded architec-
tures is the limited capacity of the speculative storage. The
proposed systems require the hardware to track and enforce
dependence on all memory references in speculative storage
without regard to an application's memory access patterns
and actual data dependences. To eliminate adverse eects
on memory system performance, current proposals also use
small (i.e., kilobytes of) hardware structures to allow for fast
memory reference and dependence tracking [9, 3]. Unfortu-
nately, when a thread lls up its speculative storage, execu-
tion halts until speculation is resolved, signicantly reducing
parallelism in execution and performance. To avoid specula-
tive storage overow, current proposals often execute small
thread sizes and granularities (e.g., inner loop iterations),
potentially limiting the opportunity and degree of extracted
parallelism [7].
In this paper, we discover a new program property, called
memory reference idempotency, which allows a large num-
ber of memory references to avoid dependence tracking and
placement in speculative storage. Idempotent references can
be placed in the non-speculative storage directly, reducing
the likelihood of speculative storage overow and increasing
the opportunity for extracting higher degrees of parallelism.
Idempotent references span a large spectrum of memory ref-
erences including not only references to read-only and pri-
vate variables, but also certain references to shared variables
with cross-thread dependences.
Reference idempotency can be explained by the insight
that, in speculative execution, incorrect values are created
due to dependence violations, and propagated through sub-
sequent computation. Essentially, idempotent references are
those that never violate dependences, although they may
propagate incorrect values. Because only a speculative ref-
erence can originate an incorrect value, hardware speculative
mechanisms are guaranteed to correct this value eventually
and re-propagate it through subsequent computation. Idem-
potent references never violate dependences, and therefore
need not be tracked in speculative storage.
We focus our experiments in this paper on code sections
that are not fully analyzable by the compiler. In a re-
cent paper [7], we presented an architecture that unies
both conventional and speculative multithreaded execution
on a single chip. Our architecture allows code sections that
are parallelizable to execute as conventional parallel pro-
grams (using non-speculative storage) without any specula-
tion overhead. The present paper targets code sections that
require hardware speculation support to execute in parallel.
The key contributions of this paper are:
 We dene a formal framework for reference idempo-
tency to alleviate speculative execution overhead.
 We present a novel compiler-assisted speculative ex-
ecution model, in which the compiler communicates
idempotent references to the architecture.
 We prove the necessary and suÆcient conditions for
reference idempotency using our model.
 We present a compiler algorithm to label idempotent
memory references, so that the hardware can place the
references directly in the non-speculative storage.
 We show results that, for our benchmarks, over 60%
of the references in non-parallelizable code sections are
idempotent.
This paper is organized as follows. Next, we present an
introductory example of hardware-only speculative execu-
tion and idempotent references. Section 2 formally denes
and veries the hardware-only model. Section 3 presents the
formal denition and proof of correctness of the compiler-
assisted speculative execution model. Section 4 introduces
reference idempotency, proves the necessary and suÆcient
conditions for reference idempotency, and describes a com-
piler algorithm for idempotency analysis. Section 5 shows
experimental results on the frequency of idempotent refer-
ences. Section 6 presents conclusions.
An Introductory Example
Current proposals for speculative multithreaded proces-
sors assume a hardware-only speculative execution (HOSE)
model. In HOSE, the software assumes sequential execution
semantics and sees the usual program state (i.e., the val-
ues of all program variables) in the memory system. The
hardware, which we call the speculation engine, selects pro-
gram segments identied by the compiler and executes them
speculatively in parallel. Segments can range in size from a
single instruction to entire subroutines.
Consider the program in Figure 1. The program is split
into two segments that are executed speculatively in parallel
by a two-processor system. Segment 2 follows segment 1 in
sequential program order and therefore all cross-segment de-
pendences must be satised in that order while the segments
are executing. The program has several read references to
variable B, a data dependence across the two segments in-
volving variable A, and a write and read reference to variable
C in segment 2.
A typical speculative execution scenario in HOSE is illus-
trated in Figure 1 (a). The system executes the two seg-
ments in parallel while keeping all data values produced or
referenced in speculative storage. The data values remain
in speculative storage until the speculation is veried and
all dependences are satised, hence the execution is known
to be correct. Upon verifying speculation, the data values
in the speculative storage are transferred or \committed" to
non-speculative storage. To track and enforce dependences
in program order, in addition to the data values, the specu-
lative storage also keeps information about every reference
type and the order in which references are made.
In the example shown, because the two program segments
execute concurrently, upon the write reference to A in seg-
ment 1, the processor may see that a later program-order
read reference to A by segment 2 has already happened. This
is a dependence violation, to which the system reacts by
aborting and re-starting segment 2. Since all accessed data
values have been buered in the speculative storage, the re-
start simply clears all the buered references corresponding
to segment 2.
Figure 1 (b) illustrates several examples of idempotent
references | i.e., references that do not require buering
in speculative storage and that can directly access non-
speculative storage. First, the compiler can identify all refer-
ences to variable B to be idempotent because B is a read-only
variable and as such, does not have any data dependences.
Second, the rst write reference to A in segment 1 is idempo-
tent because there are no previous program-order references
to A in the segment. To enforce dependences, the write ref-
erence, however, does look through speculative storage to
check for data dependence violations by segment 2's refer-
ences to A (i.e., the read reference), hence the sink must
remain in speculative storage. The actual value of the write
reference resides in non-speculative storage, without occu-
pying any space in speculative storage. Third, variable C
in this example is private to segment 2 | i.e., there are
no dependences across segments on this variable | and all
references to it are idempotent. Although segment 2 may
re-execute due to incorrect speculation, the write reference
C always occurs rst whenever the segment is re-executed.
Hence, even if an incorrect value were written initially, the
value of C will be corrected in the nal execution of the
segment.
2. HARDWARE-ONLY SPECULATION
2.1 The HOSE Model
In the following, we formally dene the structure of the
software and the execution model of hardware-only specu-
lative execution. We show that the execution produces the
same answer as a sequential program.
Definition 1 (Program Structure). A program is
structured into one or several regions, which are sub-
structured into several segments. A region has a single entry
and exit. A segment has a single entry, but may have mul-


























    storage
Non-speculative





Segment 1 Segment 2
... = C
... = C
Figure 1: Basic idea of labeling idempotent refer-
ences. (a) In hardware-only speculative execution,
all data is placed in the speculative storage. (b)
Idempotent references can go directly to the non-
speculative storage (conventional memory hierar-
chy).
older segment would execute before a younger segment in a
sequential execution of the program. All older segments are
referred to as ancestors.
In this denition, segments represent speculative units.
These can be individual instructions in ne-grain or entire
subroutines in large-grain speculative execution models. For
HOSE the entire program is a single region. Multiple regions
will be important for the compiler-assisted speculative exe-
cution model, introduced in Section 3.
Definition 2 (HOSE Mechanism). Hardware-Only
Speculative Execution is an execution mechanism for
programs given in Denition 1 with the following properties:
1. Overall Execution: Regions execute sequentially with
respect to other regions. Segments can be executed
speculatively in parallel with other segments within the
same region, that is, they may be started in an order
that is dierent from the sequential order and they may
execute concurrently. Internally, segments execute se-
quentially and perform memory references in program
order.
2. Segment Execution and Roll-Backs: Speculative
parallel execution of segments may violate data and
control dependences, resulting in incorrect values
generated and incorrect control paths taken. The spec-
ulation engine detects these violations (see Property 5)
and rolls back incorrect segments. Upon a roll-back,
all data generated by the segment are discarded (see
Property 4). This process may repeat several times.
3. Final Execution: A correct, nal execution follows all
incorrect executions of a segment. The nal execu-
tion satises all cross-segment ow and control depen-
dences. If the segment was incorrectly started due to
misspeculation, the nal execution may execute a dif-
ferent segment or it may be empty.
4. Data Access: Each segment has its own speculative
storage. It is empty at the beginning of each segment's
execution and after each roll-back. During the execu-
tion of a segment, all data references go to the specu-
lative storage. They do not aect the non-speculative
storage until the segment is committed (see Property
6). If a read reference accesses a location not yet
present in the speculative storage, then the value is
fetched from the youngest ancestor that contains a
value for this location, or from non-speculative storage
if no ancestor contains that location. A write reference
aects only the segment's own speculative storage.
5. Dependence Tracking: In addition to the actual data
values, the speculative storage contains access infor-
mation (time and type of reference), which allows the
speculation engine to track dependences. If a write ref-
erence detects that a read reference to the same storage
location by a younger segment has prematurely hap-
pened, then a data-dependence (ow dependence) vi-
olation has occurred. If, at the completion of a seg-
ment, the speculation engine detects that the succes-
sor segment is dierent from the speculatively cho-
sen one, then a control dependence violation has oc-
curred. The speculation engine reacts to both viola-
tions by rolling back all younger segments currently in
execution. Cross-segment anti and output dependences
are satised because the segments have separate spec-
ulative storage (Property 4), which are committed in
sequential order (Property 6).
6. Segment Commit: When the oldest segment in execu-
tion has completed all instructions, it is ready to com-
mit (i.e., conceptually move) its speculative storage to
the non-speculative storage. A segment cannot commit
until all older segments have committed. Note, that
only the values generated by the segment's nal execu-
tion are committed.
2.2 Correctness of HOSE
Definition 3 (Correct Program Execution). A
region R is executed correctly if, given that all older regions
are executed correctly, at their last reference in R all live
program variables in the non-speculative storage have the
same value as in a sequential execution of the program.
Similarly, a segment R
x
in region R is executed correctly
if, given that all older segments in R and all regions older
than R are executed correctly, at their last reference in R
x
all live program variables in the non-speculative storage have
the same value as in a sequential execution of the program.
Essentially, our denition of correctness says that any exe-
cution must have the same eect on the memory as a sequen-
tial program. It does not specify any order for the program
execution. For example, two accesses to dierent memory
locations could be reordered. Also, two write accesses to the
same location may be reordered if we can prevent the eect
of the originally rst access (e.g., by renaming the reference
to a dierent location).
Lemma 1 (Correctness of HOSE). A region R and
all segments in R are executed correctly under the hardware-





be the segments in region R from
oldest to youngest. To satisfy the correctness criterion of
Denition 3, we need to show that, for any segment R
x
; 1 
x  n, the values of the program variables generated and
committed to non-speculative storage locations at the end
of R
x
are correct. That is, they are the same as the values
of these variables in a sequential program execution. HOSE
discards all values generated by segments that are being
rolled back. The only values to be committed are those
generated in nal executions. We show correctness of these
values in two steps. We show that (1) the nal executions
of all segments produce correct values in the speculative
storage and (2) these values are committed correctly.
(1) Internally, segments execute sequentially (HOSE
Property 1). All data references use the segment's own spec-
ulative storage, and this storage cannot be modied by any
other segment (HOSE Property 4). Hence, the segment ex-
ecutes and produces the same nal values as a sequential
program if we can show the following: upon a read reference,
a data value that is not yet present in the segment's specu-
lative storage is fetched in a way that yields the same value
as in a sequential program. This follows from two facts. (a)
By HOSE Property 5, all cross-segment time orderings are
satised. (b) By HOSE Property 4, values for locations not
yet present in the speculative storage are consumed either
from the youngest ancestor that contains a value for this
variable (which is the producer of this value in a sequential
execution,) or from non-speculative storage (where they are
correct, given the preceding region's correct execution).
(2) By HOSE Property 6, all segments commit in sequen-
tial order. Therefore, all segments' values will be seen in
the non-speculative storage correctly after all ancestors have
placed their values.
Correctness of R follows directly from the correctness of the
segments in R. All segments write the same values as they
would in a sequential execution. Since the segments are com-
mitted in sequential order (HOSE Property 6), these values
appear in the non-speculative storage in the same order as
in a sequential execution of the region.
3. COMPILER-ASSISTED SPECULATION
3.1 The CASE Model
The Compiler-Assisted Speculative Execution (CASE)
model is an extension of the HOSE model introduced in Sec-
tion 2. The software structure is the same as in Denition 1.
As in HOSE, segments are the primary units of speculative
execution. Regions are important for enclosing code sec-
tions in which certain data attributes hold (e.g, read-only,
or dependence-free). The execution mechanism is dened as
follows:
Definition 4 (CASE Mechanism). Compiler-
Assisted Speculative Execution is a program execution
mechanism with the basic properties of HOSE as given
in Denition 2. Certain data references are labeled as
idempotent, and the rest of the references are speculative
with the same properties as in HOSE. Idempotent references
have the following properties:
Idempotent read references completely bypass the spec-
ulative storage and instead directly reference the non-
speculative storage. Unlike speculative reads, idempo-
tent reads do not leave any information in the specu-
lative storage.
Idempotent write references enforce data dependences
by rst checking in the speculative storage (for prema-
turely executed speculative loads), much like speculative
write references. However, then their value is directly
placed in the non-speculative storage and no informa-
tion about the references is kept in the speculative stor-
age.
From the denition of idempotent references, we see that
the references access non-speculative storage, and do not
occupy any space in speculative storage. Thus, idempotent
references help reduce the likelihood of speculative storage
overow, as motivated in Section 1. Note, for brevity we use
the term idempotency for both a program property (the ref-
erenced variable is correct despite repeated accesses caused
by roll-back and re-execution) and a hardware property (the
memory reference accesses non-speculative storage.)
3.2 Correctness of CASE
In CASE, programs contain both speculative and idem-
potent references. The hardware guarantees correctness for
speculative references, like HOSE. But idempotent refer-
ences are not tracked by speculative storage, and therefore
correctness of idempotent references is no longer guaranteed
by the hardware. Instead, the compiler must correctly label
idempotent references to guarantee correct execution. To
that end, the following labeling conditions must be satised
by references to be identied as idempotent.
LC1: A write reference
1
a^ to a variable x in region R is
correctly labeled as idempotent only if it is guaranteed
that x will eventually be correct | i.e., an incorrect
x must be overwritten with the correct value, before
it is consumed by the nal execution of any segment.
(Speculative read references may obtain incorrect val-
ues in a misspeculated execution and propagate the in-
correct values to idempotent write references. Because
such incorrect idempotent writes are not discarded but
written to non-speculative storage, LC1 ensures that
the write reference is eventually corrected irrespective
of the control ow path taken.)
LC2: A reference a^ is correctly labeled as idempotent only if,
in the nal execution, all time orderings as dictated by
data dependences involving a^ are satised. (An idem-
potent reference does not keep any information about
the reference in speculative storage. Because the hard-
ware can no longer enforce data dependences for the
reference, LC2 ensures that the reference is ordered
correctly with respect to its dependences.)
LC3: A write reference to x is correctly labeled as idem-
potent only if any subsequent read reference to x con-
sumes this value from non-speculative storage. A read
reference is correctly labeled as idempotent only if it
obtains from non-speculative storage the value gener-
ated there by any prior write reference. (If one of the
source and sink of a ow dependence is a speculative
reference and the other is an idempotent reference, the
source and sink access dierent storages. LC3 ensures
that the sink reference correctly obtains the value pro-
duced by the source reference.)
Recall that in speculative execution, incorrect values are
created due to control and data dependence violations, and
1




propagated through subsequent computation. LC1 ensures
that even if such values are written to the non-speculative
storage, they do not persist and are eventually overwrit-
ten with the correct values. LC2 and LC3 together imply
that idempotent references never violate data dependences.
Thus, LC1, LC2, and LC3 together guarantee that idempo-
tent references do not generate incorrect values on their own.
However, LC1, LC2, and LC3 do not disallow an idempo-
tent reference from propagating an incorrect value. Because
only a speculative reference can originate such an incorrect
value, hardware speculative mechanisms are guaranteed to
correct this value eventually and re-propagate it through
subsequent computation. Therefore, idempotent references
need not be tracked in speculative storage, even though they
may write temporarily incorrect values.
Lemma 2 (Correctness of CASE). CASE is cor-
rect under Denition 3 if and only if all idempotent refer-
ences satisfy the three labeling conditions LC1 through LC3.
Proof. The values in non-speculative storage generated
by a segment are those committed from speculative storage
and those written by idempotent references.
We proceed in two steps, (1) we show that the values
generated by idempotent references are correct and (2) we
show that the values generated in speculative storage and
then committed are correct.
(1) Because idempotent references directly write into non-
speculative storage, we must consider all segment execu-
tions. This contrasts with HOSE, which considers only -
nal executions. By LC1, a segment produces correct values
for all variables that incur idempotent references. That is,
even though a variable x may be written in a misspeculated
segment, LC1 guarantees that, in all nal executions of seg-
ments referencing x, this variable is correct.
(2) The only dierence to the values produced in specula-
tive storage in HOSE is that instructions may consume input
values through read references involved in idempotent refer-
ences. These values are correct as follows. By LC2, all time
orderings as dictated by data dependences are satised. By
LC3 values are correctly communicated even if the producer
or the consumer is an idempotent reference. Therefore, the
values committed from speculative storage are correct for
the same reason as they are correct in HOSE.
The proof of the converse is simple, and is only sketched.
The descriptions of the three labeling criteria make obvious
that if any of them is not satised then an incorrect value
is produced, consumed, or a data dependence may be vi-
olated. Hence, correct program execution would no longer
be guaranteed. Note, that in some degenerate case even a
misspeculated value might not lead to an incorrect execu-
tion (e.g., if the program multiplies this value with 0). Our
proof does not account for such cases, as do the correctness
proofs of most program transformations.
The proof of correctness of a region is identical to the one
for HOSE.
4. REFERENCE IDEMPOTENCY
In this section we present the methods and algorithms for
identifying variable references in a program that have the
idempotency property. Idempotent references do not need
to be buered in speculative storage. To prove correctness
we will show that such references satisfy the labeling criteria
LC1 through LC3.
Theorems 1 and 2 give the necessary and suÆcient condi-
tions for a data reference to be labeled as idempotent. The
following lemmas will be necessary to prove the two theo-
rems. In addition, the term re-occurring rst write will be
used. It is dened as follows.
Definition 5 (Re-occurring First Write (RFW)).
A write reference to the variable x in segment R
i
is a RFW
if, following any roll-back of R
i
, a live x is guaranteed to be
written before the end of the enclosing region R without a
preceding read reference.
Note, that by Denition 2 the segment R
i
may get rolled
back to the end of any ancestor segment in R. Hence, a
write reference to x in R
i
is a RFW if x is rst written on
all possible control ow paths p, where p is a path from the
end of any ancestor of R
i
to the end of R. If x is not live
then its value is irrelevant for correctness by Denition 3.
The RFW attribute will allow us to identify a write refer-
ence as idempotent, even though it may write an incorrect
value as a result of data or control misspeculation. The
RFW attribute ensures that a write reference to the same
variable x is guaranteed to re-occur with a correct value be-
fore the end of R. Hence, x's value will be corrected. It
further guarantees that no read reference can consume the
incorrect value before the correct value is written. Note, that
determining the RFW attribute is non-trivial in the presence
of pointers and subscripted subscripts. The compiler must
guarantee that the references to x in the misspeculated and
in all possible nal executions go to the same storage loca-
tion. We will present a compiler algorithm in Section 4.2.
In the following presentation, we consider one region at a
time. This is suÆcient, as regions execute sequentially with
respect to each other. Data dependences (may-dependences)
are assumed to have been analyzed for the region on a ref-
erence by reference basis. Note, that this means that there
are only data dependences between references to the same
variable. Only intra-region dependences are considered. We
will show examples at the end of the subsection.
Lemma 3 (Cross-Segment Dependence Sink).
The sink of a cross-segment dependence must be labeled
speculative.
Proof. Assume the dependence sink can be labeled
idempotent. Suppose the dependence source executes af-
ter the sink. If the sink is a read reference, no information
about its access time is kept in speculative storage. Hence,
the hardware will not enforce the dependence per HOSE
Property 5. If the sink is a write reference, it directly writes
to the non-speculative storage, violating the dependence. In
both cases the labeling criterion LC2 is not satised, which
contradicts the assumption.
Lemma 4 (Independent Read). A read reference a^
that is not the sink of any dependence can be labeled idem-
potent.
Proof. LC1 does not apply to read references. Consider-
ing LC2, suppose the reference a^ is involved in a dependence
with sink
^
b. Intra-segment dependences are always satised
because of the sequential execution of segments. A cross-
segment dependence is also satised because
^
b is labeled
speculative per Lemma 3. This means that the value of
^
b is
committed at the end of the nal execution of the enclosing
segment, which happens after a^ (HOSE Properties 4 and 6).
Hence LC2 is satised. LC3 is not applicable because there
is no write reference preceding a^.
Lemma 5 (Independent RFW). A re-occurring rst
write (RFW) that is not the sink of a cross-segment depen-
dence can be labeled idempotent.
Proof. LC1 is satised because the write reference is a
re-occurring rst write. By Denition 5, even after a mis-
speculated value is written, a new value is guaranteed to be
written prior to all reads in any nal execution, hence the
value is corrected.
For LC2, intra-segment dependences are always satis-
ed. For cross-segment dependences we consider two cases.
Case 1: the reference a^ is the source of a ow dependence
with sink
^
b. This dependence is enforced per Denition 4 as
long as the sink is speculative. This is the case by Lemma 3.
Case 2: there is an output dependence from a^ to
^
b. This
dependence is also satised. Since
^
b is speculative, it will
be written to the non-speculative storage upon the commit
of the segment containing
^
b, which is after the reference a^
(HOSE Property 6). Hence LC2 is satised.
LC3 needs to be considered for the case of a ow depen-
dence from a^ to
^
b. By HOSE Property 4, the speculative
read reference
^
b consumes the value from the non-speculative
storage location if no ancestor segment contains a specula-
tive value for this location. This is indeed the case because
a^ is not the sink of any other dependence, which means it is
the rst reference to this variable in the region. Hence LC3
is satised as well.
Lemma 6 (Covered Read). A read reference
^
b that
is dependent on an idempotent RFW reference a^ within the
same segment can be labeled idempotent.
Proof. LC2 and LC3 need to be considered. For LC2,
all intra-segment dependences are satised because of the
sequential execution of segments. Write references are only
labeled idempotent with Lemma 5. Such references do not
depend on older segments, hence
^
b cannot be the sink of a
cross-segment dependence. On the other hand,
^
b can be the
source of a cross-segment dependence. Such a dependence is
satised; the proof is the same as in Lemma 4. Hence, LC2
is satised. LC3 is also satised, because an idempotent
^
b
correctly reads the value generated by an idempotent a^ in
non-speculative storage.
For completeness, the following simple lemma deals with
fully-independent regions.
Lemma 7 (Fully-Independent). All references of a
region whose segments do not carry any data dependences
or control dependences can be labeled idempotent.
Proof. A region without any data and control depen-
dences across segments is completely non-speculative. That
is, all segments are executed only in their correct, nal form
without any violations of data and control dependences. The
execution will not perform roll-backs. Hence all shared ref-
erences happen exactly once in their nal and correct form.
Labeling them as idempotent satises all three labeling cri-
teria trivially.
Lemmas 3 through 6 provide the basis for proving neces-
sary and suÆcient conditions for idempotent read and write
references in segments that include dependences.
Theorem 1 (Idempotent Write). A write reference
is idempotent if and only if it is a re-occurring rst write
and it is not the sink of a cross-segment dependence.
Theorem 2 (Idempotent Read). A read reference is
idempotent if and only if it is not the sink of any data de-
pendence or it is dependent on an idempotent write reference
within the same segment.
Proof (Idempotent Write). By Lemma 5, a RFW
that is not the sink of a cross-segment dependence can be
labeled idempotent.
We prove the converse by contradiction. We show that
a write reference that is the sink of a cross-segment depen-
dence or is not a RFW cannot be labeled idempotent. By
Lemma 3, a cross-segment dependence sink cannot be la-
beled idempotent. If a reference a^ to variable x is not a
RFW then, after the enclosing segment rolls back, execution
can take a path that does not write x. Hence, the incorrect
value written by a^ will persist.
Proof (Idempotent Read). By Lemma 4, a read refer-
ence that is not the sink of a data dependence can be labeled
idempotent. By Lemma 6, a read can also be labeled idem-
potent if it is dependent on an idempotent write reference
within the same segment.
We prove the converse by contradiction. We show that a
read reference cannot be labeled idempotent if it is depen-
dent on a source that is not an idempotent write reference
within the same segment. There are two cases. (1) The
source is in a dierent segment and (2) the source within
the same segment is labeled speculative. By Lemma 3, a
dependence sink cannot be labeled idempotent in case 1.
Case 2 directly violates LC3 because an idempotent read
from x will not consume the value written by a preceding
speculative write reference to x.
Examples
Figure 2 shows several examples. By Denition 5, RFW(R
0
)
= fC, N, Jg, RFW(R
1
) = fE, Jg, RFW(R
2
) = fAg, RFW(R
3
)
= fAg, and RFW(R
4
) = fFg. The reference to B in R
2
is
not a RFW, because the reference to B is not guaranteed
to happen after a possible roll-back of R
2
. Similarly, the
reference to B in R
3
is not a RFW. The write reference to
H in R
4





are not RFW because they are not guaranteed
to access the same address. All above RFW references are





to J and F are not idempotent by Lemma 5 because they are
the sink of output and anti dependences from R
0
.




, and a write ref-
erence to F in R
4
are speculative by Lemma 3 because they
are sinks of cross-segment dependences. All references to
variable G in R, F in R
0
and the read of H in R
4
are idempo-
tent by Lemma 4 because they are independent reads. The
read references to N and C in R
0
, A in R
3
and F in R
4
are
idempotent by Lemma 6 because they are covered reads.
4.1 Discussion: Idempotency Categories
We can describe idempotent references in the form of the
following categories. The rst category deals with the sim-
ple case of program regions that can be detected as fully
parallel by a compiler.
Fully-independent: If there are no cross-segment data




































   B =
ENDIF
Figure 2: Example code with control and data de-
pendence graphs. The region R contains ve seg-
ments, R
0
,    , R
4
.
are idempotent. No individual access labeling would be nec-
essary for this category. No data needs to be placed in spec-
ulative storage. Essentially this means that the region can
be run as in a conventional multiprocessor.
The next three categories are applicable to regions that have
data dependences.
Read-only: All references to read-only variables in a region
are idempotent. These references are not sinks of any depen-
dence. Note that, although very intuitive, the idempotency
property for read-only variables in partially-dependent code
sections is not trivial because of the interaction of idempo-
tent and speculative references.
Private: All references to segment-private data are idempo-
tent. This category is relevant for compilers that can recog-
nize private variables and express this information such that
the architecture or runtime system can provide a private
address space for each segment. Alternatively, the compiler
can apply data renaming, with the result that the references
will fall into the next category. Important in our analysis
are the facts that private variables do not have any cross-
segment dependences and are thus not live at the end of the
segment.
Shared-dependent: The fact that there are data-
dependent references that do not need to be placed in spec-
ulative storage is most remarkable. Essentially, only sinks of
cross-segment data dependences need to be labeled specula-
tive. Within a segment, all references following a write that
is guaranteed to happen, and happen again after a misspec-
ulation, can be labeled idempotent. It is important to note
that these write references may produce temporarily incor-
rect values in the non-speculative storage. The idempotency
property guarantees that correctness is still ensured.
4.2 Compiler Algorithms
4.2.1 Prerequisite Analysis
The prerequisites for our algorithm are as follows. The
compiler identies regions and segments. The algorithm for
dening regions and segments is not part of the presented
paper. In our evaluation, regions are loops and segments are
loop iterations. Furthermore, we assume that a state-of-the-
art compiler (e.g., [1, 12]) has analyzed read-only and private
variables, and also the data dependences of every reference
in each region. Data dependences are may-dependences.
The following algorithm determines RFW references.
4.2.2 Analyzing Re-occurring First Write References
Recall that by Denition 5 a write reference to a variable
x in segment R
i
is a RFW if x is rst written on all possible
control ow paths p, where p is a path from the end of any
ancestor of R
i
to the end of R. The basic goal of the follow-
ing graph algorithm is to mark all successors of a segment
as non-RFW to a given variable x, if any successor has an
exposed read reference to x.
Algorithm 1. Identifying re-occurring rst write refer-
ences in a region R:
Let G be a graph with nodes V representing segments R
i
and edges E representing control paths between segments.
An extra node v
exit
is placed at the exit of R. A node has
the following two attributes for each variable: color (Black,
White) and reference type (Write, Read, Null). For a
given node v and variable x, either all write references to
x in v are RFW (White) or none is RFW (Black). The
algorithm nds this property.
1. Initially, for each node v and for each variable x, set
the color to White and set the reference type as follows:
 If x is dened on all paths through segment v with-
out exposed read
2
, then set the reference type to
Write.
 Else, if there is an exposed read of x, then set
Read.
 Else, (no reference to x in v) set Null.
Set v
exit
for x as Read if x is live-out of R, and Null
otherwise.
2. Search G breadth-rst. At each v, if it is White:
 If v reaches any node marked as Read through
zero or more Null nodes, then recursively color
all White successors of v Black.
3. All write references to x in White nodes are re-
occurring rst writes.
Note, that the algorithm relies on the compiler's ability to
identify references that go to the same address. Two refer-
ences a^ and
^
b cannot be assumed to access the same variable
if there is any execution scenario in which the address may
be dierent. Examples of such scenarios are subscripted
array subscripts or variables whose address itself may be
speculative. Both the used programming language and the
architecture may give guarantees that certain addresses are
2
We refer to standard compiler techniques for analyzing







































Figure 3: Example of a re-occurring rst write anal-
ysis. (a) Segment control ow graph. (b) Graph
marked for variable x, (c) variable y, and (d) vari-
able z.
always correct. In our initial implementation we use Fortran
programs, whose variable addresses are statically known. In
addition, we rely on our architecture's ability to guarantee
that loop variables are non-speculative (this is implemented
through proper synchronization). Therefore, our compiler
can assume that all array references with aÆne subscript
expressions have correct addresses and are thus candidate
RFWs.
Figure 3 shows an example of the analysis performed by
Algorithm 1. It shows colored graphs for the three variables
x, y, and z in the program region (a). In graph (b), the
write references to x in segments 6 and 7 are found not to
be RFW because there is an exposed read in segment 4.
Similarly, in (d), the write reference to z in segment 6 is not
RFW because segment 2 has an exposed read. In (c), all
write references to y are RFW.
4.2.3 Labeling Idempotent References
Given a region R, the following algorithm labels all idem-
potent references.
Algorithm 2. Identifying idempotent references in re-
gion R. The following information is assumed to have been
analyzed in R: Read-only and private variables, reference-
by-reference data dependences of shared variables, and the
segment control ow graph.
Initially, all references are labeled speculative.
1. Analyze RFW references with Algorithm 1.
2. If R is fully independent with respect to cross-segment
data and control dependences, then
 Label all references in R as idempotent.
3. Otherwise (dependent region),
 Label all read-only references as idempotent.
 Label all private references as idempotent.
 For each RFW reference, if the reference is not
the sink of a cross-segment dependence, label the
reference idempotent.
 For each read reference, label the reference idem-
potent if
do k = nz-1, 2, -1
do j = ny-1, 2, -1
do i = nx-1, 2, -1
do m = 1, 5
......
do l = 1, 5





do m = 1, 5





Figure 4: Idempotent and speculative references in
APPLU BUTS DO1.
{ the reference is not the sink of any depen-
dence, OR
{ the reference is the sink of an intra-segment
dependence AND the source is labeled idem-
potent.
Example
Figure 4 shows a serial loop, BUTS DO1 in APPLU, which in-
cludes many nested small loops. The outermost loop is de-
ned as our region and is parallelized speculatively by se-
lecting each iteration (k) as a segment. The loop contains
only one shared variable, v. Both references to v in state-
ment S2 are dependent on the three references in S1. All
of these three references are dependence sources only and
hence can be labeled as idempotent by Theorem 2. Since
the references in S2 are dependence sinks they must remain
speculative.
5. EVALUATION
We evaluate opportunities for labeling idempotent refer-
ences in code sections that the compiler is unable to paral-
lelize automatically. Recall, that speculative storage over-
ow is a critical limitation of the currently proposed specu-
lative architectures, and these overheads increase further as
advanced compilers identify large speculative threads. We
have quantied these overheads in prior work [7]. Here we
present performance results of applying our labeling algo-
rithm on a selected group of segments.
For our experiments we have developed a preliminary ver-
sion of our algorithm on top of the Multiplex compiler. Mul-
tiplex is a proposal for a chip multiprocessor supporting both
conventional and speculative execution of threads (i.e., seg-
ments) [7]. The Multiplex compiler integrates Polaris [2] and
the Multiscalar compiler [13] into a single infrastructure for
generating conventional and speculative threaded code.
We execute the code on a cycle-accurate simulator of Mul-
tiplex. In the rest of this paper, we assume Multiplex chips
with four processors. Multiplex provides per-processor spec-
ulative storage, which is backed up by a full memory hierar-
chy serving as non-speculative storage. The compiler com-
municates reference idempotency labels for memory instruc-
tions to the hardware, to allow bypassing the speculative
storage and placing the data directly in the non-speculative
storage. As in conventional multiprocessors, the runtime
Figure 5: Fraction of idempotent references in code
sections that cannot be detected as parallel. It
shows idempotent references in the categories read-
only, private, and shared-dependent.
system allocates a private stack for every segment. The
compiler transforms and places the private variables into
these per-segment private stacks.
5.1 Labeling Idempotent References
We rst evaluate the opportunity for labeling idempotent
references in all of our benchmarks. Next, we present per-
formance results on removing idempotent references from
speculative storage for selected groups of non-parallelizable
loops. Each group of loops exhibits large opportunity for
labeling a specic category of idempotent references. The
loops are representative of the rest of the non-parallelizable
code sections.
A key question in this work is what fraction of the to-
tal references our algorithm can identify as idempotent in
non-parallelizable code sections. To answer this question
we have extracted from all the benchmarks the code sec-
tions that could not be automatically parallelized by our
compiler. Recall, that the parallelizable code sections are
\uninteresting" from the point of view of this paper, be-
cause all data references can be marked idempotent (shown
in Lemma 7). Figure 5 shows the fraction of total references
in non-parallelizable code sections that our analysis detects
as idempotent. In 7 out of the 13 benchmarks more than
60% of these references are idempotent. The largest fraction
is read-only idempotent variables. In four programs there is
a substantial fraction of private idempotent variables. Most
important is that the category of shared-dependent idempo-
tent variables is a signicant fraction is 5 benchmarks. Note,
that even a single reference that causes speculative storage
overow will lead to large delays | essentially serializing the
execution. Therefore, even small increases in the number of
references that do not access speculative storage, can lead
to signicant performance gains. The benchmarks with few
or no idempotent variables fall into two opposite categories.
SWIM, TRFD and ARC2D are fully-parallel programs with no
unanalyzable variables, while FPPPP is known to be highly
unstructured and diÆcult to analyze.
Figure 6 shows a selection of loops, MAIN DO80 in TOMCATV,
PARMVR DO120 and PARMVR DO140 in WAVE5, that have idem-
potent references in the read-only category. The gure
shows the distribution of the read-only references with re-
spect to the total memory references under CASE. The g-
ure also shows loop speedups relative to a uniprocessor. La-
beling the idempotent references in these loops reduces the
(a) (b)
Figure 6: Examples of loops for idempotency cate-
gory read-only references: (a) ratio of read-only ref-
erences to total memory references, and (b) loop
speedups before (HOSE) and after (CASE) refer-
ence labeling.
(a) (b)
Figure 7: Examples of loops for idempotency cat-
egory private references: (a) ratio of private ref-
erences to total memory references, and (b) loop
speedups before (HOSE) and after (CASE) refer-
ence labeling.
pressure on the speculative storage, allowing for signicant
reductions in execution time.
Figure 7 shows the fraction of references and speedups un-
der CASE in two loops, DRCFT DO2 in TURB3D and SETBV DO2
in APPLU that have idempotent references in the private cat-
egory. In SETBV DO2, a signicant fraction (about half) of
the total memory references are private. Our implementa-
tion sets up per-segments stacks, of which private variables
can make use. The stack setup adds a substantial number
of instructions. Nevertheless, there are small speedup gains
under CASE as compared to HOSE.
Figure 8 shows loops that include idempotent references
in the shared-dependent category. The gure shows idempo-
tent references as a fraction of the total number of references,
and the corresponding loop speedups after labeling under
HOSE and CASE. The ability to remove shared-dependent
references from speculative storage is one of the most ad-
vanced qualities of the presented compiler techniques. The
fact that there are program sections with more than 50%
idempotent shared-dependent references is an important re-
sult. Note, that these loops are not independent and thus
cannot be parallelized by our current compiler technology.
Figure 9 includes all references in fully-independent re-
gions in three major loops of the program MGRID. This cat-
egory applies to do loops with fully-independent iterations.
CASE improves the performance signicantly over HOSE,
which incurs signicant speculative storage overow. Fig-
ure 9 (b) shows that read-only references represent the ma-
jor category of idempotent references in RESID DO600 and
PSINV DO600, and write shared references represent the ma-
jor category in ZRAN3 DO400.
(a) (b)
Figure 8: Examples of loops for idempotency cate-
gory shared-dependent references: (a) ratio of idem-
potent references to the total memory references,
and (b) loop speedups before (HOSE) and after
(CASE) reference labeling.
(a) (b)
Figure 9: Examples of loops for idempotency cat-
egory fully-independent regions: (a) ratio of idem-
potent references, and (b) loop speedups before
(HOSE) and after (CASE) reference labeling.
6. CONCLUSIONS
In this paper we have discovered a new program property
called memory reference idempotency. Idempotent refer-
ences can access non-speculative storage directly and, hence,
alleviate speculative storage overow, a critical limitation
of speculative execution. A key feature of idempotent ref-
erences is that, during speculative parallel execution, they
do not cause any data-dependence violations on their own.
References with this property are guaranteed to be eventu-
ally corrected, though the references may write temporarily
incorrect values during speculative execution. Because spec-
ulative execution mechanisms eventually correct and propa-
gate these incorrect values, idempotent references need not
be tracked in speculative storage. By ltering out idempo-
tent references, we reduce the demand for speculative stor-
age space. This reduction is especially important for large
threads, allowing them to uncover more parallelism without
incurring substantial overow.
We dened a formal framework for idempotency and
presented a novel compiler-assisted speculative execution
model. We proved the necessary and suÆcient conditions for
reference idempotency under our model. We also presented
a compiler algorithm to label idempotent memory references
for the hardware. Experimental results show that, for our
benchmarks, over 60% of the references in non-parallelizable
code sections are idempotent.
Reference idempotency enables compilers to deal with
code sections that are unanalyzable by classical compiler
techniques. The current generation of compilers is most ca-
pable of optimizing program sections for which the absence
of data and control dependences can be proven. While such
analysis applies to many regular programs, a large number
of programs are irregular in nature. Reference idempotency
applies to these very programs. With architectural support
| in the form of the proposed compiler-assisted specula-
tive execution model | it enables new optimizations where
conventional compiler techniques face hard limits.
7. REFERENCES
[1] U. Banerjee. Dependence Analysis for Supercomputing.
Kluwer. Boston, MA, 1988.
[2] W. Blume, R. Doallo, R. Eigenmann, J. Grout,
J. Hoeinger, T. Lawrence, J. Lee, D. Padua, Y. Paek,
B. Pottenger, L. Rauchwerger, and P. Tu. Parallel
programming with Polaris. IEEE Computer, pages
78{82, Dec. 1996.
[3] S. Gopal, T. Vijaykumar, J. E. Smith, and G. S. Sohi.
Speculative versioning cache. In The Fourth IEEE
Symposium on High-Performance Computer
Architecture (HPCA-4), pages 195{205, Jan. 1998.
[4] M. Gupta. Techniques for speculative run-time
parallelization of loops. In International Conference
on Supercomputing (ICS'98), 1998.
[5] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R.
Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam.
Maximizing multiprocessor performance with the
SUIF compiler. IEEE Computer, pages 84{89, Dec.
1996.
[6] L. Hammond, M. Willey, and K. Olukotun. Data
speculation support for a chip multiprocessors. The
Eighth ACM Conference on Architectural Support for
Programming Languages and Operating Systems
(ASPLOS'98), October 1998.
[7] C.-L. Ooi, S. W. Kim, R. Eigenmann, B. Falsa, and
T. N. Vijaykumar. Multiplex: Unifying conventional
and speculative thread-level parallelism on a chip
multiprocessor. In Proceedings of the 2001
International Conference on Supercomputing, 2001.
[8] L. Rauchwerger and D. Padua. The LRPD test:
Speculative run-time parallelization of loops with
privatization and reduction parallelization. Proceedings
of the SIGPLAN'95 Conference on Programming
Language Design and Implementation, June 1995.
[9] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar.
Multiscalar processors. The 22th International
Symposium on Computer Architecture (ISCA-22),
pages 414{425, June 1995.
[10] J. G. Stean, C. B. Colohan, A. Zhai, and T. C.
Mowry. A scalable approach to thread-level
speculation. The 27th Annual International
Symposium on Computer Architecture (ISCA-27),
June 2000.
[11] Sun Microsystems. MAJC architecture tutorial. White
Paper, September 1999.
[12] P. Tu and D. Padua. Automatic array privatization.
In U. Banerjee, D. Gelernter, A. Nicolau, and
D. Padua, editors, Proceedings of Sixth Workshop on
Languages and Compilers for Parallel Computing,
Portland, OR. Lecture Notes in Computer Science.,
volume 768, pages 500{521, August 1993.
[13] T. N. Vijaykumar and G. S. Sohi. Task selection for a
multiscalar processor. The 31st International
Symposium on Microarchitecture (MICRO-31),
December 1998.
