Louvre: Lightweight Ordering Using Versioning for Release Consistency by Kumar, Pranith et al.
Louvre: Lightweight Ordering Using Versioning for
Release Consistency
Pranith Kumar, Prasun Gera, Hyojong Kim, Hyesoon Kim
Georgia Institute of Technology
School of Computer Science, College of Computing
Atlanta, GA, USA
{pranith, prasun.gera, hyojong.kim, hyesoon.kim}@gatech.edu
Abstract—Fence instructions are fundamental primitives that
ensure consistency in a weakly consistent shared memory multi-
core processor. The execution cost of these instructions is sig-
nificant and adds a non-trivial overhead to parallel programs.
In a naı´ve architecture implementation, we track the ordering
constraints imposed by a fence by its entry in the reorder
buffer and its execution overhead entails stalling the processor’s
pipeline until the store buffer is drained and also conservatively
invalidating speculative loads. These actions create a cascading
effect of increased overhead on the execution of the following
instructions in the program. We find these actions to be overly
restrictive and that they can be further relaxed thereby allowing
aggressive optimizations.
The current work proposes a lightweight mechanism in which
we assign ordering tags, called versions, to load and store in-
structions when they reside in the load/store queues and the write
buffer. The version assigned to a memory access allows us to fully
exploit the relaxation allowed by the weak consistency model and
restricts its execution in such a way that the ordering constraints
by the model are satisfied. We utilize the information captured
through the assigned versions to reduce stalls caused by waiting
for the store buffer to drain and to avoid unnecessary squashing
of speculative loads, thereby minimizing the re-execution penalty.
This method is particularly effective for the release consistency
model that employs uni-directional fence instructions. We show
that this mechanism reduces the ordering instruction latency by
39.6% and improves program performance by 11% on average
over the baseline implementation.
I. INTRODUCTION
As core counts have continued to increase over the last
decade, the shared memory programming model has become
an attractive choice for high performance, scientific, and
general purpose parallel applications. The use of this model
is facilitated by high-level libraries and frameworks such as
Cilk [1] and OpenMP [2]. The shared memory programming
model necessitates a consistent view of the memory for pro-
grammers to reason about the accuracy of parallel programs.
This consistent view is dictated by the memory consistency
model of the processor.
Higher level languages standardize on a consistency model
to maintain portability across architectures. Languages such as
C and C++ have standardized on data-race-free memory mod-
els [3] that guarantee that accesses to synchronization objects
(or atomics) are sequentially consistent. Sequential consistency
for such accesses is ensured by the compiler, which generates
appropriate ordering instructions. These instructions indicate
an ordering point in the instruction stream, and when it is
encountered by the processor, certain classes of optimizations
on the following memory accesses are disabled to ensure con-
sistency [4]–[6]. Optimizations such as speculative execution,
prefetch, and buffering of store instructions are implemented
by modern processors to reduce the cost of memory accesses
[7]–[9]. The fence instructions restrict optimizations which
exploit instruction-level parallelism, disable speculative exe-
cution of post-fence loads and stores, and introduce stalls to
drain the store buffer.
Researchers have proposed techniques that allow a proces-
sor to employ hardware optimizations when ordering instruc-
tions are scheduled, thereby reducing their execution over-
head [10], [11]. One such technique utilizes check-pointing
along with aggressive speculative execution of post-fence
loads and stores. The check-pointed state is tracked and
updated so that it enables roll back in case of ordering vi-
olations [12]–[14]. Another promising approach is to increase
the granularity of enforcing consistency, thereby amortizing
the overhead of ordering instructions. This technique has been
applied both in hardware [15] and at an algorithmic level [16],
[17]. Further, to increase the scope for speculative execution
based on the observation that ordering violations rarely occur
at runtime [12], studies have proposed techniques that identify
scenarios in which violations are likely to occur and handle
them in special ways. These techniques use annotations that
tag ordering instructions [18], hardware extensions [13], or
run-time information [19]. However, all these techniques were
proposed in the context of stronger memory consistency mod-
els such as that of x86 and little work has explored reducing
the overhead of ordering instructions, specifically in weak
memory model architectures.
Release consistency (RCsc) [20] is one of the most widely
implemented weak memory model in recent processors. Vari-
ants of this model have been adopted in architectures such as
Itanium, ARM64, and PowerPC and it is also being considered
for adoption in RISC-V processors [21]. In this memory
model memory accesses are classified as ordinary, acquire,
or release operations. This classification is done using special
instructions in the ISA. A load-acquire operation and a store-
release operation place ordering constraints either on the
memory accesses following it or preceding it respectively. This
is in contrast to a full fence which places ordering constraints
ar
X
iv
:1
71
0.
10
74
6v
3 
 [c
s.A
R]
  5
 D
ec
 20
17
on both the memory accesses. These acquire and release
operations allow more leeway for possible re-ordering of the
memory accesses, thereby allowing a processor implementa-
tion to reduce the overhead of ensuring memory consistency.
We propose LOUVRE, a low-overhead hardware extension
that ensures memory ordering using versioning and helps in
reducing the overhead of fence instructions in a processor
implementing the release consistency model. Our proposed
mechanism adds ordering tag fields to the load/store queues
and the write buffer. In the proposed technique, we assign
versions to memory instructions when they are issued. We
update the version being assigned upon issuing an ordering
instruction. Instead of tracking the ordering constraints by the
entry of the fence instruction in the reorder buffer (ROB),
we track them in a separate FIFO queue. This allows us to
retire the ordering instruction from the ROB while preserving
the constraint that needs to be enforced. Using the version as-
signed to memory accesses, we identify and rectify any consis-
tency violations at run-time without the need for maintaining
any global state or inter-core communication. Compared to
versioning, traditional implementation of ordering instructions
stall the pipeline to drain the store buffer to ensure order
among stores in the buffer and memory accesses following the
ordering instruction. In the proposed versioning mechanism,
we issue and execute instructions that follow the ordering
instruction without draining the store buffer. Since we version
the stores in the store buffer, we enforce the correct ordering
among them utilizing these versions. Furthermore, the ver-
sioning mechanism is efficient for one-way fence instructions,
since it allows the processor to speculatively issue and execute
instructions following a store-release instruction before the
store-release instruction itself has finished execution.
Although we did not find any hardware optimizations in
the literature specifically targeting one-way fence instructions,
the work which is closest in spirit to our current work is
zFence [22]. As in the current work, the authors of the zFence
paper focus on reducing the stalls caused by waiting for the
store buffer to drain. They achieve this by introducing a new
coherence state which indicates exclusive permission to a
cache line without the cache line being available. When all
the stores waiting in the store buffer acquire this exclusive
permission, you do not need to drain the store buffer since no
other processor can modify those cache lines, thereby reducing
the fence overhead. In contrast, the current proposal requires
only minor hardware changes to avoid the store buffer drain
stall.
The main contributions of our work are as follows:
• We describe a detailed micro-architectural description for
implementing RCSC memory model semantics.
• We propose a hardware mechanism based on versioning
loads and stores that reduces the cost of ordering instruc-
tions in weak memory model architectures.
• We propose a low overhead mechanism that optimizes
the execution of uni-directional ordering instructions by
taking advantage of re-orderings allowed in a weak
memory model architecture.
0%
20%
40%
60%
80%
100%
two-way one-way
Fig. 1. Distribution of fences in ARM64
II. BACKGROUND AND MOTIVATION
A. Fence instructions
Ordering instructions can be classified as either uni-
directional (also known as one-way fences) or bi-directional
(also known as full fences). Instructions mfence and dmb are
bi-directional in that they do not allow any re-ordering across
them. The restrictions placed by such fence instructions on
memory accesses conflict with the optimization techniques
used to hide the latency of long memory accesses in modern
processors. Efforts to reduce this overhead have led to the
design of the release consistency model [20]. This model pre-
scribes acquire and release semantics, which are achieved by
uni-directional fence instructions. These instructions allow re-
ordering of memory accesses across them in only one direction
and have either acquire or release semantics. Slight variations
of this conventional consistency model are implemented in
architectures such as Itanium and ARM64. On ARM64, a
load-acquire instruction (also called as a synchronizing load,
ldar) enforces an order between its load and all the following
memory accesses. A store-release instruction (also called as
a synchronizing store, stlr) enforces an order between its
store and all the preceding memory accesses. A store-release
also ensures that the store is multi-copy atomic; a store
is made visible to all other processors simultaneously [23].
Additionally, a fence instruction is sequentially consistent with
respect to all other fence instructions.
Since one-way fences remove constraints on possible mem-
ory access re-ordering, generating them is more efficient than
generating full fences whenever possible. Figure 1 shows
the frequency distribution of one-way and full fences in
benchmarks from the Parsec [24] benchmark suite compiled
for ARM64 processors. We observe that one-way fences, on
average, are twice as frequent as full fences in most of the
benchmarks. Exploiting reduced constraints enforced by a one-
way fence necessitates an efficient micro-architectural imple-
mentation. However, optimizations specific to one-way fences
are limited. As a result, a uni-directional fence, when treated
as a full fence limits the potential performance improvement.
To the best of our knowledge, our proposal is the first to
present a detailed optimal micro-architectural implementation
of one-way fences.
B. Definitions
In the rest of the paper, we use the following terminology.
An instruction is fetched (FE), decoded, and then issued (IS).
Once it is issued (possibly out-of-order), it can be executed
(EX) when all its dependencies are met. Once an instruction
reaches the head of the reorder buffer (ROB) and finishes
execution, it can commit or retire (RE). Upon retiring, a store
instruction moves from the ROB to the store buffer (SB), at
which point it is said to be a post-commit store. A store will
complete (CO) once it updates the value in the cache and
drains from the store buffer. A load is said to be satisfied
(SA) after it loads the value into a register. While a load
can retire only after it is satisfied, a store can retire before
it completes. An instruction remains in a speculative state as
long as it resides in the ROB. The life cycle of a memory
access instruction is shown in Figure 2. Please note that a
satisfied load in a speculative state can be squashed (SQ) and
re-executed along with all of its dependent instructions.
FE IS EX
SA
RE CO
LOAD
STORE
SQUASH
STORE
Fig. 2. Life-cycle of a memory access instruction.
Program order (po) is the order in which instructions
are fetched and decoded by the same core. We use po-
before/previous to refer to instructions that are earlier in
program order and po-after/later/following to refer to instruc-
tions that are later in program order relative to the current
instruction.
C. RCSC Semantic Rules
The following semantic rules describe the constraints im-
posed by RCSC [20]. Here X and Y are two instructions,
<p and <m are the program and global memory order
respectively. L(a) and S(a) are load and store to address a
respectively. FF (X) is true if X is a full fence, LDAR(X)
is true if X is a load-acquire fence, STRL(X) is true if X is
a store-release fence, and LOAD(X) or STORE(X) is true
if X is a load or store instruction respectively.
RC1. X <p Y, FF (X) =⇒ X <m Y
RC2. X <p Y, FF (Y ) =⇒ X <m Y
RC3. X <p Y, LDAR(X) =⇒ X <m Y
RC4. X <p Y, STRL(Y ) =⇒ X <m Y
RC5. X <p Y, X, Y ∈ (FF, LDAR, STRL) =⇒
X <m Y
RC6. V alue of L(a) = V alue of Max<m {S(a) |
S(a) <p L(a) or S(a) <m L(a)}
In RCSC , unlike in sequential consistency (SC), program
order does not imply global order for ordinary memory ac-
cesses. A synchronizing load (load-acquire), a synchronizing
store (store-release), and a fence instruction enforce order
among the memory accesses.
RC1 and RC2 specify that memory accesses that are po-
before a full fence are ordered before the po-later memory
accesses.
RC3 specifies that the load of a load-acquire fence should
be ordered before any po-later memory accesses.
RC4 specifies that the store of a store-release fence should
be ordered after all the po-previous memory accesses.
RC5 specifies that all the fence instructions should be
ordered in sequential order.
Finally, RC6 specifies that the value of a load is satisfied
from the latest po-before store to the same address pending in
the store buffer. If there is no such store, the load is satisfied
from the latest global store to the same address.
We later show in Section IV-A that LOUVRE satisfies the
RCSC semantics listed here.
D. Conventional Microarchitecture
In this section, we describe three actions that a conventional
processor performs to avoid ordering violations and to imple-
ment the semantics of RCSC model listed in the previous
section: retirement of a fence, draining the store buffer, and
squashing a speculative load on cache line invalidation. The
processor enforces RC1 for a full fence by draining the store
buffer before retiring the full fence instruction. This ensures
that post-fence stores do not complete before pre-fence stores
from the unordered store buffer. The processor enforces RC2
by squashing any speculative post-fence load upon receiving
an invalidation request to the cache line until the full fence
is retired. The one-way fences act as two individual halves
of the full fence. The processor performs the same actions
separately to enforce rules RC3, RC4, and RC5. To enforce
RC3 for a synchronizing load, the processor needs to squash
any speculative po-later loads upon invalidation as long as
the synchronizing load is in-flight. To enforce RC4 for a
synchronizing store, the processor needs to drain the store
buffer before retiring the synchronizing store instruction. This
prevents the synchronizing store from completing ahead of the
stores in the store buffer. These actions also ensure that the
rule RC5 is followed. Finally, for rule RC6, the value of a
load is satisfied from the store buffer if a store to the same
address is pending in it; otherwise, a request is sent to the
memory hierarchy. We summarize the constraints that need to
be satisfied for each type of ordering instruction to be retired
in Table I .
TABLE I
CONSTRAINTS ENFORCED BY DIFFERENT ORDERING INSTRUCTIONS IN A
CONVENTIONAL RELEASE CONSISTENCY ARCHITECTURE
Retirement of fence Store buffer On cache invalidation
load-acquire − Squash load
store-release Drain −
full Drain Squash load
waiting for store 
buffer drain
I1: STLR W0, [A0] EX
I4: ADD  W4, W3, #1 EX SQ REEX
EXI5: STR  W4, [A3] SQ EX
I3: LDR  W3, [A2] EX SA SQ EX SA RE
Unnecessary Squash2
I2: LDR  W2, [A1] EX SA REcache miss
cache 
hit
cache 
miss
A2 Invalidation1Core #0 Core #1: STR W0,[A2]
RE CO
Fig. 3. Inefficient load speculation in a weak consistency architecture.
III. MOTIVATION
In this section we discuss the ordering constraints imposed
by different fence instructions and identify two major sources
of overhead caused by these constraints.
A. Constraints of ordering instructions
A full fence and a load-acquire fence enforce the constraint
that the processor squash all speculative loads satisfied from a
cache line when that cache line is invalidated. This constraint
prevents speculative loads from using stale values, which
would otherwise not be used if the loads were executed non-
speculatively. A full fence and a store-release fence enforce
the constraint that the processor drain the store buffer of all
po-previous stores before retiring the fence. In the case of
a full fence, this constraint prevents a po-later store from
completing ahead of a po-previous store. In the case of a
store-release fence, this constraint prevents the synchronizing
store from completing ahead of any po-previous stores. The
inefficiencies caused by these constraints are discussed in the
following sections.
B. Inefficient load speculation
A load-store queue (LSQ) keeps track of all in-flight load
and store instructions in program order. To detect ordering
violations, the LSQ snoops the addresses of the incoming
cache invalidations. If the address of a cache invalidation
matches that of an in-flight satisfied load, then the load might
have read a stale value causing an ordering violation. In such
cases, the processor squashes the load and all later instructions,
and re-executes them.
The first inefficiency we identify is the unnecessary invali-
dation of speculative loads to prevent ordering violations in
a processor implementing release consistency. While loads
require squashing in stricter memory model processors, it
is not always necessary in weaker model processors. In
particular, the processor needs to squash a speculative load
only in the presence of an in-flight fence enforcing order on
the load. Since ordering instructions retire after they enforce
the required constraints, a speculative load no longer violates
ordering when all preceding fence instructions are retired. In
such cases we do not need to squash it on an invalidation.
These scenarios arise in the presence of all three kinds of
fences. We detail one such scenario with a synchronizing store
next. An example with a synchronizing load is illustrated in
Section V-F.
In Figure 3, instruction I1 is a store-release fence. Instruc-
tions I2 and I3 are independent loads that access addresses
A1 and A2 respectively. Instructions I4 and I5 are dependent
on the value loaded by instruction I3. The illustration shows a
situation in which I2 has a cache miss but I3 has a cache hit.
Because of the cache hit, the speculative load I3 is satisfied
earlier than the speculative load I2. The synchronizing store
I1 is waiting for the store buffer to drain to prevent possible
reordering with any pending stores residing in the store buffer.
At this point, an invalidation request for cache line containing
A2 arrives (say from core 1) at 1 . Note that this is after
the speculative load I3 is satisfied. However, since I3 is
still speculative, the processor squashes it and its dependent
instructions, I4 and I5, and re-executes them. However, since
there is no fence enforcing order on I3, the value loaded by
it does not violate any ordering constraints. So squashing and
re-execution of instructions I3, I4, and I5 is not necessary.
We identify such situations in LOUVRE and avoid the
squashing and re-execution penalty. This is explained later in
Section V-E.
C. Inefficient store retirement
Typically, a store instruction that misses in the cache waits
in the ROB until the cache line is ready for an update. When
this store reaches the head of the ROB, instead of holding
up the following instructions waiting for the cache line, the
processor retires it and buffers it in the store buffer [7], [25]. A
store retires from the ROB and moves to the store buffer from
where it drains after completion [26], [27]. In architectures
implementing the TSO memory model, the store buffer is
ordered; the stores drain out from the buffer in FIFO order.
In weaker memory model architectures the store buffer can be
unordered, that is, the stores can drain out in any order. In our
experiments, we assume a weak memory model architecture
and an unordered store buffer. Additionally, the store buffer is
augmented with a CAM structure that tracks the destination
FE REIS cache missEX CO
FE IS EX cache hit RE CO
Fence Delay
(a)
(b)
S1: STR  W0, [A0]
S2: STLR W1, [A1]
S3: STR  W3, [A2] FE IS EX RE CO
FE REIS EX CO
FE IS EX RE CO
S1: STR  W0, [A0] (0)
S2: STLR W1, [A1] (1)
S3: STR  W3, [A2] (0) FE IS EX RE CO
cache hit
cache hit
cache miss
cache hit
Early Completion
(v)
waiting in
store buffer
Early Retirement
Fig. 4. Early retirement and completion of stores from an unordered store buffer (a) baseline and (b) optimal with versions (ver)
address and relative age of the store in the buffer. This is
necessary for store-to-load forwarding.
The second inefficiency we focus on is the stall of a full
fence or synchronizing store at the head of the ROB, while
waiting for the stores in the unordered store buffer to drain.
This stall prevents ordering violations in which post-fence
stores are ordered before pre-fence stores. Without this stall, a
processor, after retiring the fence will retire post-fence stores
and place them in the store buffer along with pre-fence stores.
Since the store buffer is unordered, these post-fence stores can
drain before pre-fence stores resulting in an ordering violation.
The cost to drain the store buffer is the number of cycles it
takes to service all the cache misses of the pending stores in the
buffer. Stalling execution to drain the store buffer undermines
the whole reason for including a store buffer: to remove stores
from the critical path in the processor pipeline. Further, this
stall adds to the latency of all instructions that follow this
fence instruction in the ROB. If the ROB is full, this delay
may even propagate to the instruction fetch stage.
A synchronizing store allows memory accesses following
the instruction to be ordered before it. Figure 4 illustrates an
example scenario in which S1, S2, and S3 are independent
stores with S2 being a synchronizing store. Figure 4(a) shows
how, in a conventional implementation, S2 blocks store S3
from retiring from the ROB while waiting to drain the store
buffer containing store S1. Here store S1 has a cache miss, so
the cost of draining the store buffer is the latency of fetching
this cache line. Once store S1 completes, S2 and S3 can retire.
Since no ordering constraint is imposed by the synchronizing
store S2 on S3, it can complete earlier. Using versioning,
we retire stores S1, S2, and S3 once they reach the head of
ROB and place them in the store buffer. The assigned version
ensures that S2 completes only after S1 and that we place no
constraint on S3. As shown in Figure 4(b), versioning allows
store S3 to complete earlier than was possible in Figure 4(a).
Also, note that retiring S2 and S3 without any removes the
bottleneck on all following instructions.
IV. LOUVRE
In this section we describe the semantics of LOUVRE and
explain how we employ the versioning rules to order loads
and stores.
A. Semantic Rules
This section details the versioning semantic rules (V SR)
that describe the global order enforced by versions. Here X
and Y are two instructions; vx and vy are the versions assigned
to those instructions. LDAR(X) is true if X is a load-acquire
instruction. L(a) and S(a) are load and store to address a
respectively.
VSR1. X <p Y, vx < vy =⇒ X <m Y
VSR2. X <p Y, LDAR(X) and vx = vy =⇒ X <m Y
VSR3. V alue of L(a) = V alue of Max<m{S(a)|
S(a) <p L(a) or S(a) <m L(a)}
The crux of LOUVRE is the rule V SR1. Intuitively, it states
that, if two memory accesses that are in program order have
increasing versions, then those accesses should have the same
global order.
V SR2 states that in versioning, if X is po-before Y and
X is a load-acquire instruction, then it is ordered before Y ,
even though the versions assigned to X and Y are the same.
We explain this rule in detail in Section IV-B2.
VSR3 states that the value of a load is satisfied from the
latest po-before store to the same address pending in the store
buffer. Otherwise the load is satisfied by the latest store to that
address from the memory system.
It is straightforward to see that versioning semantic rules
are equivalent to the semantic rules of RCSC listed in Sec-
tion II-C. V SR1 satisfies RCSC rules RC1, RC2, RC4, and
RC5. V SR2 satisfies RC3 whereas V SR3 is the same as
RC6.
B. Versioning Rules
We explain the rules used to assign versions to memory
accesses in this section. These rules describe how the versions
assigned to memory accesses differ based on the ordering that
needs to be enforced. Given two regular memory accesses a
and b that are issued from the same processor, va and vb are
the versions assigned to them. Program order (po) is the order
in which instructions are issued by the processor. a<pb implies
that a is before b in program order.
1) Full Fence: A full fence should prevent any reordering
across the fence (RC1 and RC2 listed in Section II-C). When
we issue a full fence instruction, we assign versions such that
all po-later memory accesses get a higher version than the
version assigned to all the po-previous memory accesses. This
ensures that we order all post-fence memory accesses after the
pre-fence memory accesses.
Versioning Rule 1: If a full fence separates two regular
memory accesses a and b, then the versions assigned for the
memory accesses increases as follows:
a <p fence <p b =⇒ va < vb
2) Synchronizing Load: A synchronizing load prevents any
po-later memory accesses from being ordered before it (Rule
RC3 in Section II-C). It differs from a full fence in that the
processor enforces the ordering constraint only between the
synchronizing load and all po-later memory accesses and not
on po-previous memory accesses. However, if we increase the
version assigned to po-later memory accesses like we do in
the case of a full fence, we will impose unnecessary ordering
constraints on these accesses. The ordering constraint of this
fence should be valid only as long as the synchronizing load is
speculative. Once the synchronizing load retires from the ROB,
the corresponding ordering constraint is satisfied. No further
restrictions should exist between the po-later memory accesses
and the po-previous memory accesses, i.e., stores residing in
the store buffer. We illustrate an example of such a situation
in Figure 5. In this example, if we assign a higher version to
store I3 than the version assigned to store I1, we prohibit a
valid reordering of I1 and I3 as listed in Figure 5 (c). So as to
not place such unnecessary constraints, we do not increase the
version assigned to I3. Instead, we rely on the retirement order
of instructions and speculative load squashing in versioning to
ensure the required memory order.
I1: STR  W0, [A1]
I2: LDAR W1, [A2]
I3: STR  W2, [A3]
I2: LDAR W1, [A2]
I1: STR  W0, [A1]
I3: STR  W2, [A3]
I2: LDAR W1, [A2]
I3: STR  W2, [A3]
I1: STR  W0, [A1]
(a) (b) (c)
Fig. 5. The valid reorders for instructions in program order in (a) are (b) and
(c).
Consider X <p Y where X is a load-acquire instruction.
If Y is a store instruction, it will naturally be ordered after
X . This is because a store will be globally visible, and hence
ordered, only after it retires that is in turn after the retirement
of the instruction X . Now if Y is a load instruction that
is speculatively satisfied before X retires, we squash Y and
re-execute it upon receiving a cache invalidation. This re-
execution will ensure that Y is ordered after X .
Versioning Rule 2: If a load-acquire fence separates two
regular memory accesses a and b, then the versions assigned
for all the three memory accesses are the same.
a <p ldar <p b =⇒ va = vldar = vb
3) Synchronizing Store: A synchronizing store needs to
be ordered after all the po-previous memory accesses (Rule
RC4 in Section II-C). It places no constraint on the po-later
memory accesses. Our reasoning for not increasing the version
on issuing a store-release fence is similar to the one presented
for the synchronizing load. If we increase the version assigned
to po-later memory accesses than the synchronizing store, we
place an unnecessary constraint on them that should not remain
once the synchronizing store is complete. We illustrate an
example of such a situation in Figure 6. In this example, if
we assign a higher version to store I3 that follows the store-
release fence than the version assigned to store I1, we prohibit
the valid reordering of stores I1 and I3 listed in Figure 6 (c).
I1: STR  W0, [A1]
I2: STLR W1, [A2]
I3: STR  W2, [A3]
I1: STR  W0, [A1]
I3: STR  W2, [A3]
I2: STLR W1, [A2]
I3: STR  W2, [A3]
I1: STR  W0, [A1]
I2: STLR W1, [A2]
(a) (b) (c)
Fig. 6. The valid reorders for the instructions in program order in (a) are (b)
and (c).
To ensure that we place the required minimal ordering
constraint, we assign a higher version only to the synchro-
nizing store. All po-later memory accesses continue receiving
the same version as the po-previous memory accesses. This
assignment ensures that we place the constraint only on the
synchronizing store. Since all the post-fence stores receive the
same version as pre-fence stores, no constraint in placed on
them and they can complete in any order from the store buffer.
Versioning Rule 3: If a store-release fence separates two
regular memory accesses a and b, then the versions assigned
for the three memory accesses are as follows:
a <p stlr <p b =⇒ va = vb < vstlr
V. ARCHITECTURE
This section details a low overhead implementation that
utilizes LOUVRE to enforce the prescribed ordering. In the
current work, we assume an architecture that implements the
conventional RCSC memory model [20] and employs a load-
store queue and a store buffer. We augment these structures to
hold version information for each entry.
A. Hardware Structures
We use the following structures to support versioning.
Ordering version: A tag added to each load and store that
tracks the assigned version.
Version registers (vr and lfvr): We assign a version to ordinary
memory accesses during issue using the vr register. The lfvr
register is the last fence version register and keeps track of the
version of the last issued fence. This is used to assign versions
to ordering instructions such that they are in sequential order
(RC5 in Section II-C.
Min version registers (vmin,sb and vmin,lsq): These registers
keeps track of the lowest version of the entries in the store
buffer and LSQ respectively.
version
comparison
Fetch/Decode/Rename
ROB
ORQ ISSUE QUEUE
FUNCTIONAL 
UNITS
LOAD 
QUEUE
STORE 
QUEUE
L1D $
vr lfvr
Write buffer
Structures carrying 
ordering tags
Fig. 7. Overview of a simplified architecture implementing LOUVRE.
ORQ (orq): We need to track in-flight synchronizing loads
and full fences to squash speculative post-fence loads upon
invalidation. We use an ordering queue to track these fences.
A simplified architecture with the above structures is de-
tailed in Fig. 7.
B. Version Registers Update
We increment the last fence version register, lfvr, after
issuing any fence instruction. This is done to ensure that the
ordering instructions are properly ordered.
We update the version register vr to lfvr when we issue a
full fence instruction. This updated version register ensures
that the version to later memory access is higher and hence
orders them with previous memory accesses.
Table II summarizes the assignment of versions and updates
to the version register vr after issuing an instruction. Since we
continuously increment the version registers, depending on its
size, it will occasionally overflow. To handle this overflow, we
stop issuing any further instructions and wait for the pipeline to
drain. After draining the pipeline, we reset the version registers
and start reissuing instructions. Assuming that we issue 10
fence instructions per kilo instructions, with a 10-bit version
register we expect to see an overflow approximately once every
hundred thousand instructions.
TABLE II
VERSION ASSIGNMENT AND REGISTER UPDATE ON ISSUING AN
INSTRUCTION.
Issued Inst. Version assignment Register update
load/store version = vr vr is not updated
load-acquire version = vr vr is not updated
store-release version = vr + 1 vr is not updated
fence — vr = lfvr
We illustrate an example of version assignment to a se-
quence of memory operations in Table III.
C. Retirement
In this section we detail the conditions required for the
retirement of different kinds of instructions in LOUVRE.
TABLE III
WORKED OUT EXAMPLE OF VERSIONED MEMORY OPERATIONS.
OPERATION mi CAN EITHER BE A LOAD OR STORE. V(M) IS THE VERSION
ASSIGNED TO EACH MEMORY ACCESS.
Operation v(m) lfvr vr
Initial value - 0 0
m1 0 0 0
ldar(m2) 0 1 0
m3 0 1 0
stlr(m4) 1 2 0
m5 0 2 0
fence - 3 3
m6 3 3 3
Stores: A store reaching the head of ROB will retire im-
mediately. Retired stores move to the store buffer awaiting
completion.
Loads: In conventional implementations, a satisfied load that
reaches the head of the ROB can retire. However, in LOUVRE,
we need to consider the stores residing in the store buffer
because of the relaxed constraints on fence retirement. In the
proposed mechanism, a load cannot retire as long as a store
exists in the store buffer with a lower version. This is to ensure
that any store pending in the store buffer with a lower version
completes before this load and hence is ordered before it.
Consider the case where a storeA; fence; loadB sequence
of instructions execute. We assign version 1 to store A and
version 2 to load B. load B should not retire until store A
completes from the store buffer even if it is at the head of
ROB.
To check for a store versioned lower than the load at the
head of the ROB without probing the store buffer, we use the
vmin,sb. As explained in Section V-A, the version in vmin,sb is
the minimum version of the stores in the store buffer. Hence,
if the version of the load at the head of the ROB is not greater
than the vmin,sb, then we satisfy all the ordering constraints
and can retire the load at the head of the ROB.
Ordering Instructions A synchronizing store, similar to a
store, retires once it reaches the head of the ROB and then
moves to the store buffer. There is no other constraint on it
since the version assigned to its store ensures the required
ordering. A satisfied synchronizing load, similar to a load,
retires once it reaches the head of the ROB and its version
is not greater than the version in the vmin,sb. A full fence
instruction retires once it reaches the head of the ROB. There
is no delay in retiring a full fence instruction in the proposed
technique, since versioning ensures ordering on the memory
accesses for which these fences are responsible.
D. Store Completion
Stores complete from the store buffer once they update
the cache line. Since the store buffer is unordered, it allows
out-of-order completion; the stores can complete in an order
according to the versions assigned. For completion, of all
the stores that have their cache line available, one of the
lowest versioned stores completes first. This ensures that the
stores complete in the order enforced by the versions. We
also complete the oldest store residing in the store buffer
irrespective of its version, since no ordering constraint can
exist on the oldest store.
E. Speculative Execution and Invalidation
As discussed in Section III-B, we do not need to squash
and re-execute all speculative loads when we invalidate the
cache line that a load reads from. We need to squash loads
only in cases where there is an in-flight full fence or a load-
acquire instruction enforcing order on it. We ensure this order
by comparing the versions of the loads that need invalidation
and the versions vmin,sb and vmin,lsq. We squash and re-
execute speculative loads only if the version of the load is
greater than these versions. Only post-fence loads with an
active fence (detected using orq) have a version that is higher
than the version in vmin,sb and vmin,lsq. Squashing such loads
ensures that it is not ordered ahead of the pre-fence stores,
which may reside in the store buffer. In all the other cases, we
do not squash loads upon a cache invalidation. This increases
the effectiveness of speculation in the presence of ordering
instructions.
We illustrate an optimal execution of the scenario shown in
Figure 3 when using versioning in Figure 8. In this example,
instruction I1 is a store-release fence. Once it retires, the
ordering store moves to the store buffer with a version of
1. The version assigned to loads I2 and I3 is 0. When the
invalidation message for cache line containing A2 arrives at
1 , we can determine that no ordering constraint exists by
comparing the version of the load to the vmin,sb. Since the
version of the load is lower than vmin,sb, it allows us to skip
squashing I3 thereby saving the re-execution penalty.
No wait for store 
buffer drain
I1: STLR W0, [A0] EX
I4: ADD  W4, W3, #1 EX RE
I5: STR  W4, [A3] EX
I3: LDR  W3, [A2] EX SA RENo
Squash
(= vmin, sb)
I2: LDR  W2, [A1] EX SA REcache miss
cache 
hit
A2 Invalidation1Core #0 Core #1
RE
RE
Fig. 8. Efficient load speculation in LOUVRE
F. Versioning Example
We present a simple two core execution scenario for a
sample instruction sequence that has a synchronizing load in
the base architecture in Figure 9 (a) and execution for the same
instruction sequence in LOUVRE in Figure 9 (b).
In this example, instruction I1 on core 1 generates an
invalidation request when it writes to address A1. In the base
architecture, this invalidation request squashes the satisfied
load I2 on core 0 since it is speculative as shown in Figure 9
(a). However, since the synchronized load I1 on core 0 has
retired by this point, squashing I2 is unnecessary. In LOUVRE,
we can detect this scenario by comparing the version of the
load that the processor has to squash with the minimum ver-
sion registers. Since the version of this load is not greater than
TABLE IV
SIMULATED ARCHITECTURE PARAMETERS
Processor 8-core, 2 GHz with 192 entry ROB
Fetch 6 instructions, 2 loads or stores per cycle
Issue Out-of-order schedule
Store Buffer 16 and 32 entry unordered buffer
L1 Cache 64 KB I-cache and D-cache per core
2 cycle access latency
L2 Cache 512 KB shared L2 cache (4 core) (LLC)
10 cycle access latency
L3 Cache 8 MB shared L3 cache (LLC)
25 cycle access latency
Coherence MESI protocol
Memory 4 GB, 100 cycle access latency
the minimum version registers, we avoid squashing the load as
shown in Figure 9 (b). Note that both the executions produce
a valid order even though the values read are different. In the
base architecture, core 0 reads (W0 : 0,W1 : 1) whereas in
the LOUVRE architecture, core 0 reads (W0 : 0,W1 : 0). The
only invalid combination is (W0 : 1,W1 : 0).
G. Satisfying RCSC
In this section, we discuss how the versioning rules de-
scribed in Section IV-B satisfy the ordering rules of RCSC
listed in Section II-C. We satisfy RC1 and RC2 by versioning
rule 1 which ensures that the assigned version is incremented
when a full fence instruction is issued. We satisfy RC3 by the
restriction on retirement of a synchronizing load as described
in Section IV-B2. We satisfy RC4 by the versioning rule 3
which assigns a higher version to the synchronizing store than
the po-before memory accesses. Finally, we satisfy RC5 by
the squashing and retirement semantics of LOUVRE described
in Section V.
VI. ANALYSIS AND EXPERIMENTS
We implement and analyze LOUVRE using 10-bit version
tags. Each entry in the LSQ and the store buffer is augmented
with these tags. For a 16 entry store buffer and a 64-entry
LSQ the storage cost of these tags is (16 + 64) * 10 = 800
bits or 100 bytes.
To implement the tracking of lowest versions in the store
buffer, we use a hierarchical comparator network [28]. This
network for a store buffer of 16 entry size uses 15 comparators.
We also use two additional comparators to compare versions
in these version registers.
In our experiments we assume a release consistency archi-
tecture loosely modeled on an in-production ARM64 system
in Apple’s Cyclone processor [29]. ARM64 follows the RCsc
[20], [30] consistency model where the ordering operations
are sequentially consistent. We consider a 192 entry reorder
buffer which is interfaced to the data cache through the
load-store queue. We assume an out-of-order execution core
I2: LDR  W1, [X1]
I1: LDAR W0, [X0]
Core #0
I2: FULL FENCE
I1: STR  #1, [X1]
Core #1
I3: STR  #1, [X0]
Initial: [X0] = 0, [X1] = 0
Invalid: W0: 1, W1: 0
EX RESA SQ EX SA
EX RE CO
W0: 0
W1: 1
EX SA RE
EX RE
EX RE CO
X1 invalidation 1 X0 invalidation2
waiting for store buffer drain
I2: LDR  W1, [X1]
I1: LDAR W0, [X0]
Core #0
I2: FULL FENCE
I1: STR  #1, [X1]
Core #1
I3: STR  #1, [X0]
cache hit
cache hit
cache miss
(a)
No squash
EX RESA
EX RE CO
W0: 0
W1: 0
EX SA RE
EX RE
EX RE CO
X1 invalidation 1 X0 invalidation2
I2: LDR  W1, [X1]
I1: LDAR W0, [X0]
Core #0
I2: FULL FENCE
I1: STR  #1, [X1]
Core #1
I3: STR  #1, [X0]
cache hit
cache hit
cache miss
(b)
saved
saved
saved
Fig. 9. A synchronizing load execution in (a) the base architecture (b) LOUVRE
which can speculatively execute loads and stores. Other micro-
architectural details of the simulation are listed in Table IV.
We assume an invalidation based coherence protocol and use
MESI protocol in our experiments. So as to not complicate
our simulation, we assume that all the caches participate in
the same coherence domain, i.e., any updates in one cache
will be propagated to all processors. Please note that this
assumption is stricter than that in the ARM memory model
which allows non-multi-copy atomic behavior. We assume that
stores are multi-copy atomic in our model. We evaluate our
implementation on architectures using an unordered 16 entry
store buffer.
We model our simulator after the one implemented in the
O3 cpu model of the Gem5 simulator [31] as our baseline
architecture. In this model, speculatively executed loads are
squashed immediately upon invalidation and all the ordering
instructions disable speculative execution of post-fence loads
and stores.
We evaluate our implementations in a cycle-accurate simu-
lator implemented using the Qsim [32] framework which in-
cludes SST [33] and MacSim [34]. We ran parallel benchmarks
from two suites, PARSEC [24] and SPLASH-2 [35].
VII. RESULTS
In this section we present our evaluation of the proposed
versionining mechanism. We compare our results against the
base line version with an equal store buffer capacity. The var-
ious characteristics which we focus on to show the efficiency
of the proposed technique are as follows:
1) Performance evaluation (IPC).
2) Reduction in stalls caused by an ordering instruction.
3) Reduction in store latency.
4) The average residency of a fence in ROB.
A. Performance Evaluation
The main performance benefit of the proposed mechanism
is being able to retire a fence instructions without waiting for
the store buffer to drain. Also speculatively executing loads
and stores past the ordering instruction without incurring any
delay allows additional benefits. We keep track of the ordering
constraint using versions and the version registers when using
versioning which enables us to retire the fences ahead of time
when compared to a naive implementation. This retirement
reduces the stalls in the pipeline. Also note that since this
technique is primarily implemented within the core, based on
contention for shared lines, the performance should increase
with increasing number of cores.
Figure 10 shows the performance of our versioning scheme
when compared to a recently proposed fence implementation
zFence [22]. In the zFence mechanism, the stores with LLC
misses request exclusive permission, which when granted
allows execution past the fence. This mechanism, similar to
ours, tries to reduce the delay of fence caused by draining
the store buffer. However, there is still a significant delay
in acknowledging this permission request by the directory
controller since you need to get the permission for all the stores
residing in the store buffer before proceeding with further
execution. Whereas in LOUVRE, no such permission request
needs to be issued. On average we improve performance
by 11% with streamcluster gaining the most at 18% IPC
over the baseline. When compared to zFence we improve
the performance by 4% on average. Although we did not
perform any power consumption studies, we believe that the
performance gains will translate to energy savings.
B. Scheduling Stalls
As we explained in Section III-A, an ordering instruction
introduces bubbles in the pipeline causing a stall. One of the
major component of this stall is the wait for the store buffer to
05
10
15
20
25
blackscholes canneal dedup fluidanimate stream swaptions barnes fmm ocean-cp radiosity HM
%
 I
P
C
zfence louvre
Fig. 10. Performance of LOUVRE when compared to zFence for a 16-entry store buffer.
drain. Using a versioned store buffer eliminates this stall. The
reduction in the stalls when compared to a conventional store
buffer is shown in Figure 11. We see an average reduction of
19% in these stalls.
0
5
10
15
20
25
%
 r
ed
u
ct
io
n
 i
n
 s
ch
ed
st
al
ls 16 entry 32 entry
Fig. 11. Reduction in scheduling stalls in LOUVRE.
C. Store latency
Scheduling stores without any fence induced delay allows
us to issue a memory request earlier for it. Also, since the
stores with the same version can complete out of order from
the store buffer, the store latency is reduced from an 63 cycles
to 46 cycles on average, which is a 26% reduction as shown
in Figure 12. It should be noted that savings in store latency
do not directly transfer to increase in performance. This is
because of other architectural optimizations implemented that
hide memory access latency like prefetching and speculative
execution even in the baseline architecture.
0
10
20
30
40
%
 r
ed
u
ct
io
n
 i
n
 s
to
re
 l
at
en
cy 16 entry 32 entry
Fig. 12. Reduced store latency caused by early retirement
D. Early Execution
Whenever a full fence or an acquire fence is active, we
do not speculatively execute post-fence memory accesses as
0 2 4 6 8 10 12
blackscholes
canneal
dedup
fluidanimate
streamcluster
swaptions
barnes
fmm
ocean
radiosity
avg
Fig. 13. Average reduction in stalled cycles for post-fence memory accesses
when using a 16-entry versioned store buffer.
this can violate the memory consistency guarantees. Using
a versioned store buffer allows us to relax this constraint,
thereby, enabling us to speculatively execute post-fence mem-
ory accesses. The reduction in stalled cycles in camparison
to a naive store buffer implementation is shown in Fig. 13.
On average we find that a post-fence store in a versioned
store buffer implementation starts executing 8 cycles early
(contributing to 9% reduction in latency) when compared to
an implementation with naive store buffer.
E. Fence Residency in ROB
An instruction is allocated an entry in the ROB once it is
scheduled. This entry remains until the instruction is retired
from the ROB. A fence instruction retires once it enforces all
its consistency guarantees. A naive store buffer needs to be
drained completely before a fence can be retired and removed
from the ROB. A versioned store buffer, on the other hand
allows us to retire the fence without any delays (as explained
in Section V-C) from the ROB as the version assigned to
the memory access will be responsible for ensuring ordering.
This reduces the lifetime of a fence in the ROB. In other
words, it reduces the number of cycles a fence resides in the
ROB. This fence residency goes down from an average of 142
cycles in the base implementation to 82 cycles in the versioned
implementation, a reduction of 39.6% as shown in Fig. 14.
020
40
60
%
 r
ed
u
ct
io
n
 i
n
 f
en
ce
 s
ta
ll
s 16 entry 32 entry
Fig. 14. Reduced residency of fences in LOUVRE.
F. Congested Store Buffer
One side-effect of delaying scheduling of stores past an
ordering instruction is not being able to issue a memory
request for them. Pre-fetching can mitigate this issue, though
you get it as a side-effect with LOUVRE. As a result of
not issuing a memory request early, there might be multiple
post-fence stores with pending memory requests. When these
reach the head of the ROB, they are retired and placed in the
store buffer. This causes congestion and leads to further stalls.
Figure 15 shows the reduction in such stalls when we employ
versioning. In all but two benchmarks, we have a decrease in
the number of stalls caused by a congested write buffer. In the
two benchmarks ocean and radiosity using a 32-entry store
buffer increases the number of stalls. This is because these
benchmarks are store intensive, and we issue lots of memory
requests causing greater cache pressure.
0
20
40
60
80
100
120
140
160
%
 s
to
re
 b
u
ff
er
 f
u
ll versioning(entries=16) versioning(entries=32)
Fig. 15. Stalls caused by the store buffer being full
VIII. DISCUSSION
A. Multi-copy Atomicity
Architectures like ARM and POWER are not multi-copy
atomic [23], [36]. This property allows a store to become
visible to only to some cores and not others. One of the reasons
for this behavior is the sharing of cache by some cores. An
update from one core can become visible to the cores sharing
cache with it before the update propagates to the memory.
Intuitively, the cache coherence protocol should ensure that
a reader sees the latest update in the entire cache hierarchy.
However, if we restrict the coherence domain to the caches of
a core cluster, it will give rise to this behavior. Another reason
for this behavior is when multiple hardware threads share
the same core using simultaneous multi-threading (SMT) as
seen in POWER processors. In such processors, the hardware
threads share the store buffer from where they can read the
value of an earlier write from another hardware thread before
others. However, in our modeling, to keep our simulation
simple, we assume that all the caches participate in the same
coherence domain and that there are no shared store buffers.
We plan to explore the non-multi-copy atomic behavior in our
future work.
B. Branch Prediction
Since we introduce two version registers vr and lfvr that
are updated on instruction issue, we need to consider the
scenario when branch mis-prediction occurs. We need to make
sure that the version registers have correct values when the
processor starts fetching instructions from the correct path
after recovering from a mis-prediction. We ensure this by
check-pointing registers along with the process state on spec-
ulation. On mis-predicting, we flush the pipeline and restore
the version registers from the check-pointed state.
C. Load-Store Queue Design
We discuss two possible implementation choices for the
LSQ: the snoopy and the insulated queue [37]. In a snoopy
queue implementation, when a cache line is invalidated, the
LSQ is probed to squash all speculative loads that were
satisfied from the invalidated cache line to prevent ordering
violations. In an insulated queue implementation, this violation
is prevented at the time of retiring a load. During retirement,
the LSQ conservatively squashes all later loads in program
order satisfied ahead of the current load that can potentially
violate consistency. Whereas processors with a stricter consis-
tency model usually employ a snoopy queue design, those with
a weaker consistency model use an insulated queue design.
After all, in a strict consistency model, we need to enforce
ordering on a large number of in-flight loads and stores.
Since ordering violations in a weak consistency processor
are infrequent, an insulated load queue is usually utilized
at the cost of conservative performance. However, since a
fence instruction enforces consistency, an insulated queue
implementation can be used to avoid unnecessary squashes.
In such a case, at the time of load retirement, we check for
an active fence instruction, and if such a fence is found, we
identify all the satisfied loads that are later in program order
and squash them.
D. Write Combining
A write combining buffer merges multiple pending writes
to the same address to reduce the number of memory trans-
actions. However, in LOUVRE, we ensure that the buffer only
merges stores with the same version. If it combines stores with
different versions, then an ordering violation can occur.
IX. RELATED WORK
Considerable previous work has been done with a focus on
reducing the cost of fences in both software and hardware.
The main impetus for research in this area is that a consid-
erable fraction of fences are determined to be unnecessary
at run-time, or that they are ensuring ordering on memory
accesses for which ordering is unnecessary, for example,
thread-local memory accesses. Various hardware and software
techniques were proposed to remove the fence overhead.
Software techniques focus mainly on compiler techniques
which automatically infer fences [38], identify and remove
redundant fences [39] or which reduce the strength of a
fence i.e., convert a stronger fence to a weaker fence. Hard-
ware techniques focused on utilizing the dynamic information
available at the time of execution of the fence to reduce or
elide the overhead of ensuring ordering. Ordering needs to be
ensured only for shared memory locations since sequentially
consistent (SC) semantics are guaranteed for local accesses
i.e., a processor always has a sequentially consistent view of
its own memory accesses. Taking the help of the compiler
to determine such memory accesses and using an un-ordered
store buffer for the identified accesses reduces the overhead
of fence instruction [40].
Lin et al. proposed address-aware fences [41] in which
the shared memory addresses on which ordering is to be
ensured are tracked in a watch list. Ordering is only ensured
for these memory locations. All other memory accesses are
unordered. The authors extend the idea to order memory
accesses only within the scope of the fence [41]. Singh et al.
in [40] proposed an approach to design sequentially consistent
hardware in which they classify memory accesses as either
local or shared and try to maintain order only for the shared
accesses. They do this by using an unordered store buffer
for local memory accesses in addition to a regular FIFO
store buffer for shared memory accesses. The local memory
accesses are processed through the unordered store buffer
avoiding the ordering overhead on them.
X. CONCLUSION
A processor implementing a weak memory model requires
higher number of fence instructions to ensure memory con-
sistency for parallel programs than a processor implementing
stronger memory models. However, the weaker semantics pro-
vide new opportunities for extracting performance by allowing
us more leeway to reorder memory operations. Uni-directional
fences allow us to only restrict certain possible reordering
ensuring release consistency. In this paper, we propose a light-
weight mechanism which assigns versions to in-flight memory
accesses. We detailed how fences can be retired without
causing any significant delays utilizing versions. With minor
modifications to the micro-architecture where we add version
holding structures, we significantly reduce the fence overhead.
We also studied the various architectural optimizations that can
be implemented in terms of memory re-orderings and spec-
ulative execution of memory accesses when uni-directional
fence instructions are employed and presented a design for
such an optimal implementation. We found that implementing
LOUVRE reduces pipeline stalls caused by fences and store
latency thereby increasing performance by 11% on average
over the baseline and 4% over zFence.
REFERENCES
[1] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H.
Randall, and Y. Zhou, “Cilk: an efficient multithreaded runtime system,”
in PPOPP ’95: Proc. of the fifth ACM SIGPLAN Symp. on Principles
and practice of parallel programming, 1995.
[2] A. J. Dorta, C. Rodrı´guez, F. de Sande, and A. Gonzalez-Escribano, “The
OpenMP Source Code Repository,” Parallel, Distributed, and Network-
Based Processing, Euromicro Conf. on, 2005.
[3] S. V. Adve and M. D. Hill, “Weak ordering — a new definition,” in
Proceedings of the 17th Annual International Symposium on Computer
Architecture, ser. ISCA ’90. New York, NY, USA: ACM, 1990, pp.
2–14. [Online]. Available: http://doi.acm.org/10.1145/325164.325100
[4] J. Sˇevcˇı´k, “Safe optimisations for shared-memory concurrent programs,”
in Proceedings of the 32Nd ACM SIGPLAN Conference on Programming
Language Design and Implementation, ser. PLDI ’11. New York,
NY, USA: ACM, 2011, pp. 306–316. [Online]. Available: http:
//doi.acm.org/10.1145/1993498.1993534
[5] E. Eide and J. Regehr, “Volatiles are miscompiled, and what
to do about it,” in Proceedings of the 8th ACM International
Conference on Embedded Software, ser. EMSOFT ’08. New
York, NY, USA: ACM, 2008, pp. 255–264. [Online]. Available:
http://doi.acm.org/10.1145/1450058.1450093
[6] O. Trachsel, C. von Praun, and T. R. Gross, “On the effectiveness of
speculative and selective memory fences,” in Proceedings of the 20th
International Conference on Parallel and Distributed Processing, ser.
IPDPS’06. Washington, DC, USA: IEEE Computer Society, 2006, pp.
33–33. [Online]. Available: http://dl.acm.org/citation.cfm?id=1898953.
1898968
[7] Y. Chou, B. Fahs, and S. Abraham, “Microarchitecture optimizations
for exploiting memory-level parallelism,” SIGARCH Comput. Archit.
News, vol. 32, no. 2, pp. 76–, Mar. 2004. [Online]. Available:
http://doi.acm.org/10.1145/1028176.1006708
[8] A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, “A
large, fast instruction window for tolerating cache misses,” in ISCA-29,
2002, pp. 59–70.
[9] C.-K. Luk, “Tolerating memory latency through software-Controlled pre-
execution in simultaneous multithreading processors,” in ISCA-23. New
York, NY, USA: ACM, 2001, pp. 40–51.
[10] J. F. Martı´nez and J. Torrellas, “Speculative synchronization: Applying
thread-level speculation to explicitly parallel applications,” SIGOPS
Oper. Syst. Rev., vol. 36, no. 5, pp. 18–29, Oct. 2002. [Online].
Available: http://doi.acm.org/10.1145/635508.605400
[11] P. Ranganathan, V. S. Pai, and S. V. Adve, “Using speculative
retirement and larger instruction windows to narrow the performance
gap between memory consistency models,” in Proceedings of the Ninth
Annual ACM Symposium on Parallel Algorithms and Architectures, ser.
SPAA ’97. New York, NY, USA: ACM, 1997, pp. 199–210. [Online].
Available: http://doi.acm.org/10.1145/258492.258512
[12] C. Gniady, B. Falsafi, and T. N. Vijaykumar, “Is sc + ilp = rc?”
SIGARCH Comput. Archit. News, vol. 27, no. 2, pp. 162–171, May
1999. [Online]. Available: http://doi.acm.org/10.1145/307338.300993
[13] Y. Duan, A. Muzahid, and J. Torrellas, “Weefence: Toward making
fences free in tso,” SIGARCH Comput. Archit. News, vol. 41, no. 3, pp.
213–224, Jun. 2013. [Online]. Available: http://doi.acm.org/10.1145/
2508148.2485941
[14] Y. Duan, X. Feng, L. Wang, C. Zhang, and P.-C. Yew, “Detecting and
eliminating potential violations of sequential consistency for concurrent
c/c++ programs,” in Proceedings of the 7th Annual IEEE/ACM
International Symposium on Code Generation and Optimization, ser.
CGO ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp.
25–34. [Online]. Available: http://dx.doi.org/10.1109/CGO.2009.29
[15] L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas, “Bulksc: Bulk
enforcement of sequential consistency,” SIGARCH Comput. Archit.
News, vol. 35, no. 2, pp. 278–289, Jun. 2007. [Online]. Available:
http://doi.acm.org/10.1145/1273440.1250697
[16] J. H. Lee, J. Sim, and H. Kim, “Bssync: Processing near memory
for machine learning workloads with bounded staleness consistency
models,” in Proceedings of the 2015 International Conference on
Parallel Architecture and Compilation (PACT), ser. PACT ’15.
Washington, DC, USA: IEEE Computer Society, 2015, pp. 241–252.
[Online]. Available: http://dx.doi.org/10.1109/PACT.2015.42
[17] K. Vora, S. C. Koduru, and R. Gupta, “Aspire: Exploiting asynchronous
parallelism in iterative algorithms using a relaxed consistency based
dsm,” SIGPLAN Not., vol. 49, no. 10, pp. 861–878, Oct. 2014. [Online].
Available: http://doi.acm.org/10.1145/2714064.2660227
[18] C. Lin, V. Nagarajan, and R. Gupta, “Address-aware fences,”
in Proceedings of the 27th International ACM Conference on
International Conference on Supercomputing, ser. ICS ’13. New
York, NY, USA: ACM, 2013, pp. 313–324. [Online]. Available:
http://doi.acm.org/10.1145/2464996.2465015
[19] Y. Duan, N. Honarmand, and J. Torrellas, “Asymmetric memory fences:
Optimizing both performance and implementability,” in Proceedings of
the Twentieth International Conference on Architectural Support for
Programming Languages and Operating Systems. ACM, 2015, pp.
531–543.
[20] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and
J. Hennessy, “Memory consistency and event ordering in scalable
shared-memory multiprocessors,” in Proceedings of the 17th Annual
International Symposium on Computer Architecture, ser. ISCA ’90.
New York, NY, USA: ACM, 1990, pp. 15–26. [Online]. Available:
http://doi.acm.org/10.1145/325164.325102
[21] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovi, “The
risc-v instruction set manual, volume i: User-level isa, version
2.1,” EECS Department, University of California, Berkeley, Tech.
Rep. UCB/EECS-2016-118, May 2016. [Online]. Available: http:
//www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-118.html
[22] S. Aga, A. Singh, and S. Narayanasamy, “zfence: Data-less
coherence for efficient fences,” in Proceedings of the 29th ACM
on International Conference on Supercomputing, ser. ICS ’15. New
York, NY, USA: ACM, 2015, pp. 295–305. [Online]. Available:
http://doi.acm.org/10.1145/2751205.2751211
[23] S. Flur, K. E. Gray, C. Pulte, S. Sarkar, A. Sezgin, L. Maranget,
W. Deacon, and P. Sewell, “Modelling the armv8 architecture,
operationally: Concurrency and isa,” in Proceedings of the 43rd Annual
ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Languages, ser. POPL ’16. New York, NY, USA: ACM, 2016, pp. 608–
621. [Online]. Available: http://doi.acm.org/10.1145/2837614.2837615
[24] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC Bench-
mark Suite: Characterization and Architectural Implications,” Princeton
University, Tech. Rep. TR-811-08, 2008.
[25] Y. Chou, L. Spracklen, and S. G. Abraham, “Store memory-
level parallelism optimizations for commercial applications,” in
Proceedings of the 38th Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO 38. Washington, DC, USA:
IEEE Computer Society, 2005, pp. 183–196. [Online]. Available:
http://dx.doi.org/10.1109/MICRO.2005.31
[26] T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, “Mechanisms
for store-wait-free multiprocessors,” in Proceedings of the 34th Annual
International Symposium on Computer Architecture, ser. ISCA ’07.
New York, NY, USA: ACM, 2007, pp. 266–277. [Online]. Available:
http://doi.acm.org/10.1145/1250662.1250696
[27] R. Bhargava and L. K. John, “Issues in the design of store buffers in
dynamically scheduled processors,” in Performance Analysis of Systems
and Software, 2000. ISPASS. 2000 IEEE International Symposium on,
2000, pp. 76–87.
[28] D. Koch and J. Torresen, “Fpgasort: A high performance sorting
architecture exploiting run-time reconfiguration on fpgas for large
problem sorting,” in Proceedings of the 19th ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, ser. FPGA ’11.
New York, NY, USA: ACM, 2011, pp. 45–54. [Online]. Available:
http://doi.acm.org/10.1145/1950413.1950427
[29] A. L. Shimpi, “Apple’s cyclone microarchitecture de-
tailed,” http://www.anandtech.com/show/7910/apples-cyclone-
microarchitecture-detailed, 2014.
[30] R. Grisenthwaite, “Armv8 technology preview,” 2011.
[31] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,”
SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, Aug. 2011.
[Online]. Available: http://doi.acm.org/10.1145/2024716.2024718
[32] C. D. Kersey, A. Rodrigues, and S. Yalamanchili, “A universal
parallel front-end for execution driven microarchitecture simulation,”
in Proceedings of the 2012 Workshop on Rapid Simulation and
Performance Evaluation: Methods and Tools, ser. RAPIDO ’12.
New York, NY, USA: ACM, 2012, pp. 25–32. [Online]. Available:
http://doi.acm.org/10.1145/2162131.2162135
[33] Sandia National Laboratories, “SST,” http://sst.sandia.gov, 2015.
[34] H. Kim, J. Lee, N. B. Lakshminarayana, J. Sim, J. Lim, and T. Pho,
“Macsim: A cpu-gpu heterogeneous simulation framework user guide,”
2012.
[35] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The splash-
2 programs: Characterization and methodological considerations,” in
Proceedings of the 22Nd Annual International Symposium on Computer
Architecture, ser. ISCA ’95. New York, NY, USA: ACM, 1995, pp.
24–36. [Online]. Available: http://doi.acm.org/10.1145/223982.223990
[36] “Arm architecture reference manual for armv8-a architecture profile,”
ARM. [Online]. Available: https://developer.arm.com/docs/ddi0487/a/
arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
[37] H. W. Cain and M. H. Lipasti, “Memory ordering: a value-based
approach,” in Computer Architecture, 2004. Proceedings. 31st Annual
International Symposium on, June 2004, pp. 90–101.
[38] M. Kuperstein, M. Vechev, and E. Yahav, “Automatic inference
of memory fences,” in Proceedings of the 2010 Conference on
Formal Methods in Computer-Aided Design, ser. FMCAD ’10.
Austin, TX: FMCAD Inc, 2010, pp. 111–120. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1998496.1998518
[39] V. Vafeiadis and F. Z. Nardelli, “Verifying fence elimination
optimisations,” in Proceedings of the 18th International Conference
on Static Analysis, ser. SAS’11. Berlin, Heidelberg: Springer-Verlag,
2011, pp. 146–162. [Online]. Available: http://dl.acm.org/citation.cfm?
id=2041552.2041566
[40] A. Singh, S. Narayanasamy, D. Marino, T. Millstein, and M. Musuvathi,
“End-to-end sequential consistency,” in Proceedings of the 39th Annual
International Symposium on Computer Architecture, ser. ISCA ’12.
Washington, DC, USA: IEEE Computer Society, 2012, pp. 524–535.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2337159.2337220
[41] C. Lin, V. Nagarajan, and R. Gupta, “Fence scoping,” in Proceedings
of the International Conference for High Performance Computing,
Networking, Storage and Analysis, ser. SC ’14. Piscataway, NJ,
USA: IEEE Press, 2014, pp. 105–116. [Online]. Available: http:
//dx.doi.org/10.1109/SC.2014.14
