The Impact of Memory Models on Software Reliability in Multiprocessors by Jaffe, Alexander et al.
ar
X
iv
:1
10
3.
61
14
v2
  [
cs
.D
C]
  6
 A
pr
 20
11
The Impact of Memory Models on Software Reliability
in Multiprocessors ∗
Alexander Jaffe
University of Washington
ajaffe@cs.washington.edu
Thomas Moscibroda
Microsoft Research
moscitho@microsoft.com
Laura Effinger-Dean
University of Washington
effinger@cs.washington.edu
Luis Ceze
University of Washington
luisceze@cs.washington.edu
Karin Strauss
Microsoft Research
kstrauss@microsoft.com
ABSTRACT
The memory consistency model is a fundamental system
property characterizing a multiprocessor. The relative mer-
its of strict versus relaxed memory models have been widely
debated in terms of their impact on performance, hardware
complexity and programmability. This paper adds a new
dimension to this discussion: the impact of memory mod-
els on software reliability. By allowing some instructions
to reorder, weak memory models may expand the window
between critical memory operations. This can increase the
chance of an undesirable thread-interleaving, thus allowing
an otherwise-unlikely concurrency bug to manifest. To ex-
plore this phenomenon, we define and study a probabilistic
model of shared-memory parallel programs that takes into
account such reordering. We use this model to formally
derive bounds on the vulnerability to concurrency bugs of
different memory models. Our results show that for 2 con-
current threads, weaker memory models do indeed have a
higher likelihood of allowing bugs. On the other hand, we
show that as the number of parallel, buggy threads increases,
the gap between the different memory models becomes pro-
portionally insignificant, and thus the importance of using
a strict memory model diminishes.
Categories and Subject Descriptors
F.1.2 [Computation by Abstract Devices]: Modes of
Computation—parallelism and concurrency ; G.3 [Proba-
bility and Statistics]: Stochastic processes; B.3.4 [Mem-
ory Structures]: Reliability, Testing, and Fault-Tolerance
General Terms
Theory, Reliability
∗This paper is a full version of an extended abstract that
appears in PODC 2011.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
PODC’11, June 6–8, 2011, San Jose, California, USA.
Copyright 2011 ACM 978-1-4503-0719-2/11/06 ...$10.00.
Keywords
Memory consistency models, probabilistic analysis, sequen-
tial consistency, total store order, weak ordering, software
reliability
1. INTRODUCTION
A critically important property of a shared-memory multi-
processor is its memory consistency model. There has been
an enormous amount of work on this subject, both in in-
dustry and academia. The memory consistency model de-
scribes which values may be returned by a load operation
in a parallel or multi-threaded program. The strongest and
most intuitive model is Sequential Consistency (SC) [15]. SC
imposes two requirements on the execution of parallel pro-
grams: first, all processors must see the same global order
of memory operations, and second, the operations for a par-
ticular processor must appear to execute in program order.
This model is attractive for its high level of programmability,
but the strict constraints on memory operation reordering
rule out important optimizations such as access buffering,
pipelining, or dynamic scheduling, which improve perfor-
mance by hiding the latency of memory accesses. In order
to enable these aggressive optimizations, a wide variety of
relaxed memory models have been proposed. Relaxed mem-
ory models allow the reordering of certain types of memory
operations at the cost of increased programming complex-
ity, since programmers need to explicitly encode reordering
restrictions to ensure correctness.
Historically, the vast literature on memory consistency
models has discussed a three-way trade-off between perfor-
mance, hardware complexity, and programmability. In this
paper, we bring a new axis to this discussion: software re-
liability. Software is inherently unreliable, and is arguably
becoming less reliable with pervasive concurrency. Concur-
rency bugs such as data races and deadlocks are extremely
common in practice, and can cause unexpected failures in
even production-level code.
In this paper, we investigate to what extent relaxed mem-
ory consistency models further contribute to the unreliabil-
ity of parallel software by increasing the likelihood that con-
currency bugs will manifest during an execution. For this
purpose, we study a new probabilistic model for the instruc-
tion reordering introduced by relaxed memory models, and
analyze a canonical buggy program (specifically, an atom-
icity violation [9, 4, 17]) with respect to this model. We
compare three important memory consistency models: Se-
quential Consistency, Weak Ordering, and Total Store Or-
der. We derive two interesting results for our model:
• We show that for 2 (or any small constant number of)
parallel threads, the bug is indeed more likely to mani-
fest under weaker memory models. This is intuitive and
follows from the following high-level argument: A typical
concurrency bug, such as a data race, can manifest only
during a short window of time. The reordering of opera-
tions caused by relaxed memory models may increase the
size of this critical window, thus making the bug more
likely to manifest. In the paper, we give precise bounds
on this vulnerability of the three memory models.
• On the other hand, we show that as the number of par-
allel, buggy threads increases, the gap between the dif-
ferent memory models shrinks in proportion to the risk
for even the strongest memory model. This implies that
as the number of parallel threads in the system increases,
the importance of using a strict memory model dimin-
ishes (with regard to the software reliability metric we
study in this paper).
Notice that the latter result could have far-reaching impli-
cations on the choice of memory consistency models in future
multi-core and massively parallel systems. Intuitively, one
might expect that with more and more concurrent threads,
stronger memory consistency models should be used in or-
der to counter the generally increased likelihood of bugs.
However, our results indicate that the opposite is the case:
As the number of threads increase, the relative importance
of having stronger memory models reduces to a minimum.
The underlying reason is that the larger number of threads
causes the likelihood that bugs occur to increase much more
quickly than what even the strictest memory model is able
to contain. That is, the asymptotic growth fundamentally
works against using strict memory models as we increase the
number of threads.
The technical content of our paper proceeds as follows.
In Section 3, we introduce two distinct random processes,
each of which is a natural object of inquiry in isolation. By
combining them—treating the output of the first process as
the input to the second—we model the end-to-end behavior
of program execution. This allows us to answer our central
question: how does the probability that a canonical data
race manifests vary across memory models and quantity of
threads?
The first process models the generation of a random pro-
gram, and the subsequent randomized reordering of instruc-
tions. Specifically, in Section 4, we derive the probability
that a certain essential window of vulnerability between two
instructions widens. The second process enacts a random se-
ries of shifts on a set of heterogenous segments of the integer
line. We use the positions of these line segments to model
the interleaving of the vulnerable windows of the threads. In
Section 5, we estimate the probability that each of these seg-
ments is shifted to mutually disjoint positions. Finally, the
two processes are combined together in Section 6 to derive
overall bounds on the probability of bug manifestation, first
for two threads, then for a large number of threads. Due to
lack of space, several proofs are omitted and deferred to the
appendix.
2. BACKGROUND & RELATED WORK
2.1 Memory Consistency Models
Memory models are a key aspect of the hardware/software
interface in shared-memory multicore/multiprocessor sys-
tems. They determine what values read memory operations
are allowed to return by dictating how memory operations
are allowed to be reordered, as well as when writes become
visible to other processors. They have major implications
on the performance, design complexity and programmabil-
ity of multiprocessor systems and the programs that run on
them. Common misunderstandings about memory models
often lead to bugs that are very difficult to find and fix,
and can also lead to major performance issues. There ex-
ists a vast and rich line of literature on memory models (a
good tutorial overview is presented in [1]). Most of the past
work has focused on new memory models [11, 2, 13], hard-
ware implementations [10, 12, 7], memory models for popu-
lar languages such as Java [18] and C++ [6], and compiler
optimizations [16] and their relative merits [1, 5].
Relaxed memory models: The strongest memory model
is Lamport’s Sequential Consistency (SC) [15]. In order to
enable important performance optimizations, a number of
relaxed memory models have been proposed in the literature,
with varying degrees of guarantees. One of the strongest
examples is known as Total Store Order (TSO) [19]. In
TSO, loads may execute before stores that precede them
in program order, as long as no data dependency is vio-
lated. All other pairs of instructions must maintain strict
program order. This model encapsulates the natural case in
which stores are observed by remote processors in program
order. Some stores may take extra time to be observed after
their execution, but the local program is allowed to proceed.
A similar, but slightly weaker consistency model is Partial
Store Order (PSO) [19], which also allows the reordering
of stores with respect to each other as long as they access
distinct memory locations. A significantly weaker consis-
tency model is Weak Ordering (WO) [8, 2]. The opposite
extreme from Sequential Consistency, WO allows any mem-
ory operations to reorder with one another, as long as no
data dependencies are violated. This model allows for an
equal amount of optimization as a uniprocessor, but is also
the most vulnerable to programmer error, since it requires
explicit fences to prevent unwanted reorderings. Modern
processors typically support relaxed models. For example,
the x86 memory model [3, 14] supports a model similar to
TSO and the IBM POWER architecture supports a form of
WO.
The above memory consistency models follow a pattern:
they can be defined by a subset of the four ordered mem-
ory operation pairs, specifying which pairs are allowed to
reorder: For example, in the WO model, any two mem-
ory operations are allowed to be reordered; in SC, no two
memory operations are allowed to be reordered; and in the
TSO model, no two memory operations are allowed to be
reordered, except that loads can reorder before stores (see
Table 1).
Note that since in this paper we analyze a concurrency
bug involving multiple threads, we ignore store atomicity [5],
which is tangential to our present analysis. Moreover, we
do not currently handle fence operations explicitly,1 which
1However, our shift process in Section 5 can be used to simu-
late a behavior similar to that arising from the use of fences.
ST/ST ST/LD LD/ST LD/LD Name
Sequential Consistency
X Total Store Order
X X Partial Store Order
X X X X Weak Ordering
Table 1: Important memory models. A “X” in col-
umn ST/LD means that the ordering restriction from
stores to later loads can be relaxed, i.e., loads can
complete before stores that precede them in pro-
gram order. With regard to our model in Sec-
tion 3.1.2, this means that a LD can settle past (swap
with) a preceding ST. Other columns are analogous.
are used to restrict reorderings and are typically used for
synchronization. For that reason, we do not consider models
such as Release Consistency (RC) [11], which differs mainly
in the types of fences supported. As we discuss in Section 7,
it will be interesting to extend our process to distinguish
such memory models.
2.2 Race Conditions
A common type of bug in shared-memory multithreaded
programming is a race condition, which occurs when cor-
rectness depends on an assumption about the order in which
instructions from two or more threads interleave. In partic-
ular, an atomicity violation [9] occurs when the programmer
assumes that multiple instructions will execute as an atomic
unit, but fails to insert the proper synchronization. A re-
cent study showed that atomicity violations are extremely
common in “real world” programs [17]. Race conditions are
often difficult to identify due to nondeterminism: the pro-
gram may behave correctly most runs, but fails only for
specific thread interleavings.
A canonical example of an atomicity violation is as follows:
Thread 1 Thread 2
1: int loc = x; 1: int loc = x;
2: loc = loc + 1; 2: loc = loc + 1;
3: x = loc; 3: x = loc;
Here x is a shared variable (with x = 0 initially) and loc
is local to each thread. Two threads simultaneously try to
increment x by loading its value into a local variable, incre-
menting that local variable, then storing the updated value
back to x. The programmer’s intent is that x = 2 after both
threads finish executing. However, the program has a race
condition that can result in the spurious outcome x = 1. For
instance, suppose that the two threads interleave as follows:
(1) Thread 1 executes Lines 1 and 2; (2) Thread 2 executes
Lines 1 and 2; (3) Thread 1 executes Line 3; (4) Thread 2
executes Line 3. This interleaving produces the final result
x = 1. We say that the bug manifests because the result did
not match programmer intent.
The standard solution for race conditions like the exam-
ple above is to protect the variable x with a lock. However,
locking protocols can be extremely complicated in large pro-
grams, and in practice, a concurrency bug may easily slip
past even the most experienced programmers. Note that
such bugs can manifest in any memory model, even Sequen-
tial Consistency.
3. MODEL
Our goal is to study how the use of different memory
models impacts the likelihood of an error occurring given
a canonical atomicity violation. In this section, we describe
a model that allows us to formally analyze these likelihoods.
It is a probabilistic model of parallel program executions
under memory models that may permit reordering. At a
high level, we consider two or more threads which execute a
simple program containing an atomicity bug. The program
consists of basic memory operations (stores and loads). De-
pending on the memory model under consideration, the op-
erations in each thread are then independently reordered via
a random process we call the settling process. Finally, we use
a thread interleaving model—the shift model— to model the
execution of the program by interleaving the instructions
of different threads. The probability of the bug manifest-
ing is determined by analyzing how the operations from the
threads interleave. We show in this paper that, when exe-
cuting two threads, this probability crucially depends on the
underlying memory model. Yet, perhaps counter-intuitively,
we show that as the number of threads grows larger, the rela-
tive difference between the memory models becomes smaller
and smaller.
3.1 Program Model
We first describe a process for modeling a typical, ran-
domly reordered program. The process proceeds in two
phases: program generation and program reordering.
3.1.1 Program Generation
Wemodel an initial program based on the canonical atom-
icity violation bug described in §2.2. The program is a se-
quence S of memory operations x1, x2, . . ., xm, xm+1, xm+2,
where each xi has type τ (xi) ∈ {LD,ST}. xm+1 and xm+2
are Lines 1 and 3 of the canonical bug, respectively. Since we
are only concerned with memory operations, we omit Line 2
(which accesses only the local variable loc), and we will use
the terms instruction and memory operation synonymously
in this paper. We assume for simplicity that that only xm+1
and xm+2 access the same location.
2 We will call xm+1 the
critical load and xm+2 the critical store. An initial program
order S0 starts with a random sequence of m independently
distributed LD and ST operations; τ (xi) = ST with prob-
ability p and LD with probability 1 − p. Furthermore, for
convenience in the analysis, it will be useful to approximate
a very long program by letting m→∞.
3.1.2 Instruction Reordering: The Settling Process
Different memory models allow for different forms of in-
struction reorderings. We model this relaxation of program
order using a probabilistic settling process. This random
process models instruction reordering by taking a (random)
initial program order as input, and producing a reordering of
that initial program. The settling process takes into account
which kinds of reorderings are allowed by the memory con-
sistency model under consideration, and generates a random
program order that is allowed to occur given the kinds of re-
orderings. In this section, we give an informal description
of the settling process; a formal definition is given in Ap-
pendix A.2. Figure 1 presents a visualization of the settling
process.
Given an initial program order S0, the settling process
proceeds in m+2 rounds. In the rth round, (1) the program
order Sr−1 from the end of the (r − 1)st round is taken as
2If two instructions access the same location, they cannot
reorder, so this assumption simplifies our analysis.
ST
LD
ST
ST
ST
ST
ST
LD
ST
ST
ST
LD
ST
ST
LD
ST
ST
ST
LD
ST
ST
LD
ST
ST
ST
LD
ST
ST
LD
ST
ST
ST
LD
ST
LD LDLDLDLD
ST
ST
LD
ST
ST
ST
LD
LD
ST
ST
LD
ST
ST
ST
LD
LD
ST
ST
LD
ST
ST
ST
LD
LD
0.5
0.5
0.5
0.5
0.5
0.5
0
STOP
LD
Figure 1: An instantiation of the settling process under TSO. LDs repeatedly settle upward with probability
1/2. If they fail to settle, or encounter another LD, they stop permanently, and the next-lowest LD begins. The
black boxes represent the critical instructions. The grey outlines indicate the currently settling instruction.
The bottom four instructions in the final order form the critical window.
the input, and (2) the rth instruction is settled in this pro-
gram order, which (3) creates the new program order Sr.
The final output of the settling process is the program or-
der Sm+2 after settling the critical store xm+2. Settling the
rth instruction in round r of the process works as follows.
Instruction xr is recursively reordered (that is, swapped in
the current program order) with its preceding instruction
(initially, this is the instruction at position r − 1), until a
reordering “fails,” in which case xr remains at its current
position in the program order. A reordering always fails if
the memory consistency model does not allow two opera-
tions of this type to be reordered. Otherwise, the reordering
succeeds with some fixed probability s, and fails with prob-
ability 1 − s.3 When a reordering fails, we move onto the
next round.
For ease of exposition, we will set both probabilities p
(from program model) and s to be 1/2 in subsequent sec-
tions. However, note that as long as s and p are constant,
the key theorems and conclusions derived in this paper re-
main fundamentally the same (though some of the numerical
values change somewhat).
Examples: In SC, no instructions are allowed to be re-
ordered; hence Sm+2 = S0. In WO, all types of reorderings
are allowed, so, starting from instruction 2 in the initial pro-
gram order, each instruction is settled using a series of swaps
with its preceding instructions, until with probability 1 − s
a swap fails. Then the next instruction is settled, and so
forth. TSO relaxes only the ST → LD ordering, which in
our model implies that a LD may reorder with a preceding
ST with probability s, but all other types of reorderings fail.
We will represent the result of a settling process by a per-
mutation on the indices. For thread k, π(k)(i) : [1, 2, . . . , m+
2] → [1, 2, . . . ,m+ 2] maps the instruction starting at posi-
tion i to its final settled position.
The settling process has two key features: (1) memory
model constraints are enforced (two operations can reorder
only if allowed by the memory model), and (2) reorderings
that are allowed occur with a fixed likelihood. One effect of
the latter property is that in the final program order, most
3A more general form of the settling model allows different
nonzero probabilities for different kinds of reorderings, de-
pending on the types of memory operations involved. For
example sLD,LD can be different from sLD,ST, even if both are
nonzero.
instructions will not to move too far from their position in
the initial program order. The critical property of a memory
consistency model that we seek to capture is the degree to
which individual instructions can reorder beyond other in-
structions, and thus move further away from their original
position.
3.2 Thread Interleaving Model
We describe a second high-level random process, which is
used to determine the interleaving of n threads when they
are executed simultaneously on a multiprocessor. In fact,
the process is quite general, and may be of independent in-
terest as a probabilistic model. We first describe it in the
abstract, then discuss how it will be used to determine the
effect of the program model’s output on the probability of
bug manifestation.
Definition 1. Consider a sequence of n positive line seg-
ments originating at 0, having integer lengths γ¯ = γ1, . . . , γn.
A shift process translates the segments by i.i.d. geometric
random variables s1, . . . , sn. Then the random event of in-
terest, called A(γ¯), is the event that the segments are shifted
such that all are mutually disjoint. That is,
A(γ¯) := [si, si + γi] ∩ [sj , sj + γj ] = ∅ ∀ i 6= j.
In Section 5, we will analyze the probability of A(γ¯) for ar-
bitrary segment lengths γ¯. However, to connect this model
to the task at hand, we will go on to think of these seg-
ment lengths as the critical windows of reordered programs
generated by the program model.
Recall that we study a canonical data race, for which cor-
rect execution requires that each thread’s pair of critical LD
and critical ST be executed atomically. We thus refer to
the sequence of instructions between the critical LD and ST
(inclusively) as the critical window of a thread. We let Bkγ
be the event that the final ordering of thread Tk inserts γ
instructions between the critical LD and ST, (sometimes re-
ferred to as the critical window growth of a memory model).
Manifestation of the data race corresponds exactly to the
event that when the reordered threads are executed in paral-
lel, some pair of critical windows are not executed disjointly.
We let A refer to the event that critical windows are disjoint.
One can then think of Pr[Bkγ ] and Pr[A] as the two funda-
mental values we seek to characterize in this paper - each
a measure of the vulnerability of a memory model to this
canonical data race.
The shift model is used to simulate the parallel execution
of the critical windows of each thread, under the following
assumptions. All threads are assumed to initially be iden-
tical copies of a single program, generated randomly as in
Section 3.1.1. Each thread is then independently reordered
according the process of Section 3.1.2. We then simulate
the parallel execution of the reordered threads by placing
the final instruction of each critical window the origin of the
number line (here representing time in reverse, with 0 being
the final time step of execution), and using the shift model
of Definition 1 to model the varying rates of execution of
each thread. After shifting, the execution of each instruc-
tion is assumed to take one unit of time; instructions begin
and end synchronously across all threads, in lock-step. We
assume that instructions instantaneously read the current
state of the system at the beginning of the time step, and
instantaneously commit their changes at the end of the time
step. In this way we ensure a clear semantics for the state
of the system at any given time: when a LD executes, it
observes all the effects of any ST that completed in a time
step preceding it.
We can now observe the circumstances in which a data
race manifests. There must be two threads such that, sub-
sequent to reordering, the final regions of time steps between
the critical LD and ST (inclusive) overlap with one another.
In this case the data race must manifest, because one of the
LDs must observe a value after (or simultaneous to) the other
LD being observed, but before the other ST has committed.
A formal definition and a graphical visualization of the
shift process is in Appendix A.3 (see Figure 2).
4. THE CRITICAL WINDOW
In this section, we study what is perhaps the core com-
ponent of our random process, and the only one that di-
rectly distinguishes the memory models: the reordering of
instructions within an individual thread. In particular, we
are interested in the final distribution of the size of the crit-
ical window between the critical LD and ST. For the ex-
treme memory models of Sequential Consistency and Weak
Ordering, we are easily able to exactly characterize this dis-
tribution. The bulk of the technical challenge of this section
(and consequently of later sections) is in establishing results
for the more subtle model, Total Store Order. By carefully
conditioning on several auxiliary random variables, lower
bounding complex algebraic terms by their low-indexed val-
ues, and utilizing a bound on the partition number of certain
integers, we derive rather sharp approximations for the dis-
tribution of the critical window size. These bounds will in
subsequent sections be plugged into derived formulae for the
probability of bug manifestation, as a function of the thread
interleaving process. Though the results in this section are
tailored specifically to the thread generation and reordering
processes specified in the previous section, it is worthwhile
to observe how the asymptotics of the overall bug manifes-
tation probability will not depend delicately on the details
of this process.
We will be estimating the critical window growth, Pr[Bkγ ],
for a select set of memory models. Recall that Bkγ is the
event that the thread Tk inserts γ instructions between the
critical LD and ST in reordering. Because we will be consid-
ering a single fixed thread in this subsection, we will refer
to the event Bkγ by Bγ , and the permutation π
(k) by π. The
first two memory models can be considered a warmup, for
the substantially more challenging case of Total Store Order.
All of these results are captured in the following theorem.
Theorem 4.1. The critical window growth behaves ac-
cording to the following functions:
• Sequential Consistency:
Pr[Bγ ] =
{
1 if γ = 0,
0 if γ > 0.
• Weak Ordering:
Pr[Bγ ] =
{
2/3 if γ = 0,
(2−γ)/3 if γ > 0.
• Total Store Order:
Pr[Bγ ] =
{
2/3 if γ = 0,
(6/7) · 4−γ +R(γ) · 2−γ if γ > 0,
for non-negative approximation term R(γ) ≤ 2
21
.
Observe that the critical window grows at vastly different
rates across the models. Up to lower-order terms, the prob-
ability of a window size γ is 2−γ in Weak Ordering, (2−γ)2
in Total Store Order, and 0 in Sequential Consistency. It
remains to be seen in later sections the extent to which this
window size effects bug manifestation.
Proof (Theorem 4.1—Sequential Consistency).
Under sequential consistency, no instruction is ever allowed
to reorder. Hence Pr[B0] = 1, and Pr[Bγ ] = 0 ∀γ 6= 0.
We next consider the case of intermediate difficulty: Weak
Ordering.
Proof (Theorem 4.1—Weak Ordering).
Under weak ordering, all four ordered pairs of instruction
types are allowed to pass one another. Recall that we as-
sume a strong normal form, in which all possible swaps occur
with probability 1/2. Hence in weak ordering, each subse-
quent instruction continually moves up with probability 1/2,
until it ever fails to swap. This applies to the critical load
and critical store as well, with the exception that the critical
store will never pass the critical load, (because they access
the same address). To calculate the probability, we condi-
tion on the resting position of the critical LD, which entails
a given resting position for the critical ST, for any γ > 0.
Pr[Bγ ] = Pr[π(m+ 2) − π(m+ 1) = γ + 1]
=
∞∑
i=γ
Pr[π(m+ 1) = m+ 1− i]
· Pr[π(m+ 2) = m+ 2− i+ γ|
π(m+ 1) = m+ 1− i]
=
∞∑
i=γ
2−(i+1)2−(i+1−γ) =
2−γ
3
.
We must handle the case of γ = 0 separately, because here
the critical ST stops moving “automatically,” when it runs
up against the critical LD.
Pr[B0] =
∞∑
i=0
Pr[π(m+ 1) = m+ 1− i]
· Pr[π(m+ 2) = m+ 2− i|π(m+ 1) = m+ 1− i]
=
∞∑
i=0
2−(i+1)2−(i) = 2/3.
Finally we turn to the far more challenging setting of Total
Store Order.
Proof (Theorem 4.1—Total Store Order).
One of the strongest and most commonly used relaxed mem-
ory models, Total Store Order (TSO) only permits loads to
swap with stores. Hence in calculating the distribution of
window size, we need only consider the number of stores
located directly before the critical load. Those stores will
never move themselves, and the critical load can never swap
past the first load above it. Moreover, the critical store never
swaps with anything, so its final position is fixed.
However, deriving bounds on Pr[Bγ ] is difficult. LD oper-
ations may reorder past ST operations, thus pushing longer
sequences of ST operations together. In this section we de-
rive bounds on the critical window growth for TSO, which
is a core technical contribution of this paper. The proof is
quite involved. Much difficulty arises in gaining control over
the relative positions of LDs and STs. We outline the steps
taken to estimate the critical window growth below. The
majority of these steps are non-trivial, and often involve a
delicate case analyses.
Proof Outline.
1. Express the critical window probability in terms of a se-
ries of new random variables, Lµ: the event that the
second-to-last reordering leaves exactly µ contiguous STs
above the critical LD.
2. To calculate the probability of Lµ, condition on the value
of another series of random variables, Ψµ: the number
of LDs initially between the critical LD and the µ+ 1th
lowest ST.
3. Express the Ψµ-conditioned probability of Lµ in terms
of the limit of the fraction of STs near the bottom of a
reordered thread, and another probability, Pr[Fµ|Ψµ =
q]: the chance of q LDs all reordering out of a region of
at least µ STs.
4. To estimate Pr[Fµ|Ψµ = q], condition on a new random
variable, ∆: the sum, over STs, of the number of LDs
below each ST. Express the probability of ∆ in terms
of the weighted sum of several integer partition numbers,
and estimate these by a simple lower bound.
5. After combining the above elements to bound the proba-
bility of Lµ, lower bound an ugly term of this expression
by its value at µ = 1, checking via the derivative that
this term is increasing in µ.
6. Use the lower bound on the probability of Lµ to finally
lower bound the probability of a given window size. To
achieve an upper bound, calculate the total probability
not attributed to some Lµ in the lower bound, and at-
tribute it to the worst-possible case.
We now move on to execute this plan in detail.
Step 1—Number of contiguous STs above the criti-
cal LD: Recall that S0 (Sm+2) denotes the initial (final)
instruction order, and that Sm refers to the instruction or-
der just before the critical load is settled. For convenience,
we define the following basic random events. Let SLD,i(j) be
the event that after the jth instruction of Si is a LD. Fur-
thermore, we define SLD,i(j, k) =
∧k
ℓ=j SLD,i(ℓ) as the event
that the entire contiguous range from j to k in Si consists
of LDs. SST,i(j) and SST,i(j, k) are defined accordingly.
For µ ∈ N, we define Lµ as the event that in Sm, there are
exactly µ ST operations immediately preceding the critical
LD. In other words,
Lµ = SLD,m(m− µ) ∧ SST,m(m− µ+ 1,m).
The critical LD may only move γ positions if there are at
least γ contiguous ST operations above it. Hence for any γ,
we have
Pr[Bγ ] =
∞∑
µ=γ
Pr[Bγ |Lµ] · Pr[Lµ].
Deriving Pr[Bγ |Lµ] is straightforward. If µ = γ, we have
Pr[Bγ |Lγ ] = 2−γ , as the critical LD must pass all γ STs. Af-
ter that, it stops because the next instruction is a LD. For
µ > γ, we have Pr[Bγ |Lµ] = 2−(γ+1), because the instruc-
tion above the γth ST is also a ST. Hence there is only a
1/2 probability of the reordering completing when it reaches
that point.
It remains to derive bounds for Pr[Lµ] for all µ. This is
the primary technical lemma of the proof.
Lemma 4.2. For any µ > 0, Pr[Lµ] ≥ 47 · 2−µ. Moreover,
Pr[L0] = 1/3 exactly.
Proof. We will approach this lemma by asking (1) how
many LDs are interspersed among the first µ STs above the
critical LD, and (2) what is the probability that all of those
LDs settle such that we are left with µ contiguous STs above
the critical LD. Because STs cannot settle past LDs in this
model, nothing happens during rounds in which a ST can
move; the technical difficulty arises in the motion of the LDs.
Step 2—Number of interspersed LDs: In the initial
program order S0, let Φµ refer to the position of the µth-
lowest non-critical ST. Formally,
Φµ = min{i : |{j ≥ i : SST,0(j)}| = µ+ 1}.
Furthermore, let Ψµ refer to the number of LD operations
above the critical LD but below the µth-lowest non-critical
ST. That is,
Ψµ = m+ 1− µ− Φµ.
Note that as the program length goes to infinity, the prob-
ability that such a Φµ and Ψµ exist goes to 1. Now we can
express Pr[Lµ] as
Pr[Lµ] =
∞∑
q=0
Pr[Lµ|Ψµ = q] · Pr[Ψµ = q]. (1)
We have Pr[Ψµ = q] = 2
−µ2−q
(
µ+q−1
q
)
because there are(
µ+q−1
q
)
ways to build a string of µ STs and q LDs such that
the top instruction is a ST.
Step 3—Probability of interspersed LDs settling out:
The difficult part of bound (1) is Pr[Lµ|Ψµ = q]. This is the
probability that
(A) All q LDs between the ST at Φµ and the critical LD
settle up until they pass the ST at Φµ,
(B) but do not settle so far that the settled instruction
above the ST at Φµ is another ST.
(B) is due to the fact that Lµ specifies that there be exactly
µ STs above the critical LD. The probability of (B) relies
on the instruction directly above Φµ in SΦµ−1. If it is a LD,
then (B) holds automatically, since all the LDs must stop
settling. However, if it is a ST, then (B) only holds if not
all of the q LDs that have passed the ST at Φµ also pass the
next-highest ST. Hence this is the first property on which
we condition.
Pr[Lµ|Ψµ = q] = Pr[Lµ ∧ SLD,Φµ−1(Φµ − 1)|Ψµ = q]
+ Pr[Lµ ∧ SST,Φµ−1(Φµ − 1)|Ψµ = q].
By Bayes’ Law,
Pr[Lµ ∧ SLD,Φµ−1(Φµ − 1)|Ψµ = q]
= Pr[SLD,Φµ−1(Φµ − 1)|Ψµ = q]
· Pr[Lµ|SLD,Φµ−1(Φµ − 1) ∧Ψµ = q].
We first consider the latter term. Because the final instruc-
tion that settles above Φµ will be a LD under these condi-
tions, this depends only on the bottom µ instructions settled
above the critical LD being STs. For shorthand, let
Fµ = SST,m(m− µ+ 1, m).
Then
Pr[Lµ|SLD,Φµ−1(Φµ − 1) ∧Ψµ = q] = Pr[Fµ|Ψµ = q].
In contrast, for Lµ to hold given SST,Φµ−1(Φµ − 1), it does
not suffice for the q LDs to move past Φµ. They must also
not all settle past the next highest instruction. They do so
with probability 2−q . Hence
Pr[Lµ|SST,Φµ−1(Φµ − 1) ∧Ψµ = q] =
Pr[Fµ|Ψµ = q] · (1− 2−q).
Putting these expressions together, we find that
Pr[Lµ|Ψµ = q]
= Pr[Fµ|Ψµ = q] · Pr[SLD,Φµ−1(Φµ − 1)]
+ Pr[Fµ|Ψµ = q] · Pr[SST,Φµ−1(Φµ − 1)] · (1− 2−q)
= Pr[Fµ|Ψµ = q] · (1− 2−q · (1− Pr[SST,Φµ−1(Φµ − 1)])).
We first derive an exact value for Pr[SST,i(i)]. Though it
is difficult to determine the probability that a given instruc-
tion is a ST in general, this particular value can be derived
exactly through a recurrence relation.
Claim 4.3.
lim
i→∞
Pr[SST,i(i)] = 2/3.
Proof. After reordering stage i, instruction i can be a
ST in one of two ways. Either it can initially be a ST, (in
which case it never reorders) or it can initially be a LD, the
instruction above it can be settled as a ST, and the two can
swap. Hence
Pr[SST,i(i)] =
1
2
+
1
2
· Pr[SST,i−1(i− 1)] · 1
2
.
This is a recurrence relation of the form Xi = b+aXi, which
has the solution Xi =
b
1−a
+ ai−1(X1 − b1−a ). Plugging in
X1 = 1/2, a = 1/4, b = 1/2, we find
Pr[SST,i(i)] =
1/2
1− 1/4 + (1/4)
i−1
(
1/2− 1/2
1− 1/4
)
.
= 2/3 + (1/4)i−1(1/2− 2/3)
The resulting probability is a function of i, but we are in-
terested in the steady-state as the size of the program goes
to infinity. Hence the second term falls out.
lim
i→∞
Pr[SST,i(i)] = 2/3.
Now that we know the typical fraction of instructions near
the bottom of the program that are STs after reordering, we
can derive a bound on Pr[Fµ|Ψµ = q].
Step 4—Estimating Pr[Fµ|Ψµ = q]:
Claim 4.4.
Pr[Fµ|Ψµ = q] ≥ 2
−(q−1) − 2−µq(
µ+q−1
q
) .
Proof. Everything in this proof is implicitly conditioned
on the event Ψµ = q. Let the random variable
∆ =
∑
Φµ<i≤m:τLD,0(i)
|{Φµ ≤ j < i : τST,0(j)}|
represent the total number of positions that LDs from Φµ to
mmust move up, in order to leave a sequence of µ STs imme-
diately above the critical LD. It must be that ∆ ≥ q, because
at least instruction Φµ is a ST, and ∆ ≤ µq, because no LD
can be required to pass more than µ STs. With this defini-
tion, we may write Pr[Fµ|Ψµ = q] =∑µqδ=q Pr[∆ = δ] · 2−δ.
The exact value of Pr[∆ = δ] can be stated formally, but
not in a closed form. Namely, let φ(x, y, z) be the number
of distinct multi-sets of y positive integers summing to x,
such that each integer is at most z. This is a variant on
the much-studied partition number of x. Then φ(δ, q, µ) is
exactly the number of arrangements of q LDs and µ STs (be-
ginning with a ST) such that δ is the sum of the number of
STs above each of the LDs. (For each LD, we simply select
how many STs to place it below—the relative order of the
LDs is immaterial.) There are
(
m+q−1
q
)
total arrangements
of LDs and STs beginning with a ST. Hence
Pr[∆ = δ] =
φ(δ, q, µ)(
µ+q−1
q
) ,
and
Pr[Fµ|Ψµ = q] =
µq∑
δ=q
φ(δ, q, µ)(
µ+q−1
q
) · 2−δ.
Simple forms for φ(x, y, z) are not known. Asymptotic
results exist, but are not helpful here because the terms with
small parameters have the largest contributions. However,
to achieve a good bound it suffices to show that φ(δ, q, µ) ≥ 1
when q ≤ δ ≤ µq. To show that a partition exists that
achieves any number in this range, consider the following
construction. Set δ mod q of the integers to ⌈δ/q⌉, and set
the rest of the integers to ⌊δ/q⌋. We can set the integers this
large, because δ/q ≤ (µq)/q = µ. Then the chosen integers
sum to (δ mod q)⌈δ/q⌉+(q− (δ mod q))⌊δ/q⌋ which can be
shown to be exactly δ. Hence we may write
Pr[Fµ|Ψµ = q] ≥ 1(µ+q−1
q
) µq∑
δ=q
2−δ =
2−(q−1) − 2−µq(
µ+q−1
q
) .
Having derived a bound for Pr[Fµ|Ψµ = q], we are now in
a position to conclude the proof of Lemma 4.2. First note
that Pr[L0] = 1/3, by Claim 4.3. For values of µ greater
than 0, Claim 4.4 will be the central tool in the proof, which
is left to Appendix B.1.
The remainder of the proof of Theorem 4.1, steps 5 and 6,
is deferred to Appendix B.1.
5. SHIFT PROCESS
Here we discuss the next component of our analysis: a
“shift process”meant to capture the interleaving of reordered
threads. We refer the reader back to the definition in Section
3.2. This process is where the critical windows derived from
the reordering process come into effect.
In the analysis that follows, we assume that each critical
window’s shift is distributed geometrically, representing the
intuition that threads are exponentially less likely to execute
at progressively increasing offsets from one another. Let
γ¯ = (γ1, γ2, . . . , γn) ∈ Nn be a sequence of integral “segment
lengths.” In subsequent sections, γk will be used to represent
the length of the critical window of thread Tk. We define a
shift process on γ¯ as follows. Consider n segments of the line,
of lengths γ1, γ2, . . . , γn, and let the starting point of each
segment be shifted up from 0 by an i.i.d. positive random
variable si. We are interested in the probability that the
resulting set of shifted segments is non-overlapping. In other
words, we would like to bound Pr[A(γ¯)], where A(γ¯) is the
event that ∀i 6= j ∈ {1, 2, . . . , n}, we have [si, si + γi] ∩
[sj , sj + γj ] = ∅.
The following theorem states this probability precisely,
and as such is not particularly enlightening on its own. How-
ever, when the segment lengths are random variables drawn
from a well-understood distribution (as they are in the case
of reordered random threads), we will be able to state the
probability concisely.
Theorem 5.1.
Pr[A(γ¯)] =
2−((
n+1
2 )−1)∏n−1
i=1 (1− 2−(n+1−i))
∑
σ∈Symn
n−1∏
i=1
2−(n−i)γσ(i) ,
where Symn is the symmetric group of degree n: the set of
all permutations on n elements.
The following corollary simplifies this expression:
Corollary 5.2. For some c(n) ∈ [2, 4],
Pr[A(γ¯)] = c(n) · 2−(n+12 ) ·
∑
σ∈Symn
n−1∏
i=1
2−(n−i)γσ(i) .
In particular, c(2) = 8
3
exactly.
The proof of the corollary is in Appendix B.2. We now turn
to the proof of the main theorem. The challenge is to char-
acterize the probability that the next segment is shifted to a
position disjoint from all previous segments. At first glance,
it is difficult to handle the huge and diverse set of legal place-
ments for a set of segments. Our key insight is to condition
on the relative order of the magnitude of the shifts. We
then consider the probability that each segment is disjoint
from the previous threads in this order. In so doing, we are
able to exploit the memorylessness of the geometric distri-
bution. Let t be an arbitrary segment, and t′ be the segment
immediately preceding it in this order. To understand the
distribution of disjoint placements for t, we need only know
the distribution of the origin of t′. Then by assuming that
the segments are disjoint, we can infer that the origin of t is
distributed according the origin of t′, plus the length of t′,
plus an independent geometric random variable.
Proof (Theorem 5.1). Let si be a geometric random
variable with expectation 2 (i.e., si = k with probability
2−(k+1) ∀k ∈ N). In order to analyze the probability of
A(γ¯), we will take the following steps. We will first con-
dition on the ordering of the segments. Then for a given
ordering, we will use the memorylessness of the shift vari-
ables to calculate the probability of each successive segment
being disjoint from each previous.
For a permutation σ on {1, 2, . . . , n}, let Yσ be the event
that for all i, the ith largest shift occurs on segment σ(i).
That is, sσ(1) ≥ sσ(2) ≥ · · · ≥ sσ(n). Then Pr[A(γ¯)] =∑
σ∈Symn
Pr[A(γ¯) ∧ Yσ].
We now analyze Pr[A(γ¯)∧Yσ]. We will refer to this event
by A(γ¯, σ). For all segments to be disjoint, it must be the
case that each segment begins after the end of every seg-
ment that began before it. σ captures exactly the order in
which segments begin. So disjointness means that for all i,
j s.t. σ(j) > σ(i), segment j begins after the end of seg-
ment i. Hence for each i, we may condition on the shift
of the segment with the ith largest shift, and consider the
probability that each segment with a smaller shift follows its
completion.
Pr[A(γ¯, σ)] =
∞∑
ℓ1=0
Pr[A(γ¯, σ) ∧ sσ(1) = ℓ1]
=
∞∑
ℓ1=0
Pr[A(γ¯, σ) ∧ sσ(1) = ℓ1 ∧
n∧
i=2
sσ(i) ≥ ℓ1 + γσ(1)]
=
∞∑
ℓ1=0
Pr[A(γ¯, σ)|sσ(1) = ℓ1 ∧
n∧
i=2
sσ(i) ≥ ℓ1 + γσ(1)]
· Pr[sσ(1) = ℓ1] ·
n∏
i=2
Pr[sσ(i) ≥ ℓ1+γσ(1)].
The third equality is due to the independence of the shift
variables. Let γ¯i refer to the restriction of γ¯ to the seg-
ment indices with the n − i + 1 smallest shifts (i.e., γ¯i =
γ¯|[n]\
⋃
n
j=i σ(j)
). Similarly, let σi refer to the restriction of σ
to the n − i + 1 smallest shifts (i.e., σi = σ|[n]\[i−1]). We
define these structures so that we can express the disjoint-
ness event in terms of a new disjointness event on a smaller
set of unconditioned segments. In particular, let A(γ¯i, σi)
be the disjointness event for an independent random shift
process on segments σ(i), σ(i+ 1), . . . , σ(n), with permuta-
tion σi pointing to the new indices of these segments. We
will see that we are permitted to condition on such a prior
event, because of the memoryless of the shift variables.
Conditioned on the first segment being disjoint from all
the following segments, we need only consider the event
A(γ¯2, σ). Then due to the memorylessness of the shifts,
we have
Pr[A(γ¯i, σi)|sσi(1) = ℓ1 ∧
n∧
j=2
sσi(j) ≥ ℓ1 + γσi(1)]
= Pr[A(γ¯i+1, σi+1)|
n∧
j=2
sσi(j) ≥ ℓ1 + γσi(1)]
= Pr[A(γ¯i+1, σi+1)|
n∧
j=2
sσi(j) ≥ 0] = Pr[A(γ¯i+1, σi+1)].
We now observe a simple recurrence relation that defines
Pr[A(γ¯i, σi)].
Pr[A(γ¯i, σi)] =
∞∑
ℓ1=0
Pr[A(γ¯i+1, σi+1)] · Pr[sσi(1) = ℓ1]
·
n∏
j=i+1
Pr[sσi(j) ≥ ℓ1 + γσi(1)]
=
∞∑
ℓ1=0
Pr[A(γ¯i+1, σi+1)] · 1
2
2−ℓ1 ·
n∏
j=i+1
1
2
· 2−(ℓ1+γσi(1))
=
∞∑
ℓ1=0
Pr[A(γ¯i+1, σi+1)] · 2−(ℓ1+1+(n−i)(ℓ1+γσi(1)+1))
= 2
−1+(n−i)(γ
σi(1)
+1) · Pr[A(γ¯i+1, σi+1)]
∞∑
ℓ1=0
(2−(n−i+1))ℓ1
=
2
−1+(n−i)(γ
σi(1)
+1)
1− 2−(n−i+1) · Pr[A(γ¯
i+1, σi+1)].
Moreover, it is clear that Pr[A(γ¯n, σn)] = 1. Then noting
that σi(1) = σ(i), the solution is trivial:
Pr[A(γ¯1, σ1)] =
n−1∏
i=1
2−(n+1−i)−(n−i)γσ(i)
1− 2−(n+1−i)
=
2−((
n+1
2 )−1)∏n−1
i=1 (1− 2−(n+1−i))
·
n−1∏
i=1
2−(n−i)γσ(i) .
Finally, plugging these terms into the overall probability of
disjointness yields the expression in the theorem. We will
use this expression in the next section to calculate the prob-
ability of bug manifestation.
6. JOINING THE MODELS
We have now described the two fundamental random pro-
cesses of our work. Though the two are interesting in isola-
tion, it is by combining them that we will achieve our overall
goal: to characterize the probability of the canonical data
race manifesting, under various memory models.
Our first observation is to note that Corollary 5.2 can be
further simplified, provided the segment lengths are drawn
from a distribution with a very weak condition.
Theorem 6.1. Let Γ¯ = Γ1, . . . ,Γn be a distribution over
segment lengths, drawn from Nn. Assume that the marginal
distribution of each segment length is identical (i.e., Γi ∼
Γj ∀ i 6= j); they needn’t be independent. Then all permuta-
tions of segment shifts are equivalent, and
Pr[A(Γ¯)] = c(n) · 2−(n+12 ) · n! · EΓ¯[
n−1∏
i=1
2−iΓi ].
The proof is given in Appendix B.3. Because the identical-
ity condition holds for the critical window size, the theorem
gives an indication of how it is that we can analyze the
overall bug manifestation concretely. Recall that the pro-
cess of Section 4 generates a uniformly random program of
STs and LDs, then randomly “settles” each instruction in
turn, according to the rules of the memory model. The pro-
cess of Section 5 applies a random “shift” to a series of line
segments, the key event for which is the mutual disjointness
of all the segments. We now combine these two processes by
letting the line segment lengths of the shift process be dis-
tributed as the critical window size of the settling process.
An important subtlety is that we generate a single initial
random program, then independently reorder n copies of this
program. Though this makes the analysis more complex, it
adds a degree of realism: with n identical threads, it is more
natural that the same data race would be present in the
same position of every pair of threads. The following two
theorems summarize our key results.
Theorem 6.2. For n = 2 threads, the probability that the
canonical data race does not manifest is the following, in
each of the three main models.
Sequential Consistency: Pr[A] ≈ 0.1666
Total Store Order:4 0.1369 > Pr[A] > 0.1315
Weak Ordering: Pr[A] ≈ 0.1296
Theorem 6.3. As n grows, the probability of successful
execution is identical in all models, up to lower order terms
in the exponent. In particular, Pr[A] = e−n
2(1+o(1)).
The first tightly bounds the probability of successful exe-
cution for the case of n = 2 threads; the second gives an
asymptotic bound on this probability for large n. We leave
the proofs of these theorems to Appendix B.3. Both proofs
are rather technical and build upon the theorems of the pre-
vious two sections. The only surprising observation nec-
essary is that, when lower bounding a certain expectation
over the critical window for n threads, it suffices to use only
a single term of this expectation. Doing so achieves the
asymptotic behavior we seek.
Key Observations: Interpreting Theorems 6.2 and 6.3
yields remarkable insights. Though the case of n = 2 sub-
stantively distinguishes the memory models, we find that as
n grows, the probability in all memory models approaches
the same value, up to lower order terms in the exponent.
This dichotomy is a fundamental take-away for informing
computer architecture decisions. Though the use of weaker
memory models does increase the risk of program error, as
the number of threads grows this risk grows negligibly com-
pared to growth of risk of error in even sequential consis-
tency. This is of particular importance given the trends
towards ever larger multicores that enable more and more
concurrent threads.
7. DISCUSSION
Limitations and possible extensions: Our analysis as-
sumes that the program consists solely of loads and stores,
when real programs include synchronization, arithmetic, etc.
These instructions can affect the timing of the program, in-
troduce data dependencies that limit reordering, or disallow
4A very similar analysis achieves a similar result for Partial
Store Order (PSO). We omit the result for brevity.
certain types of reorderings. An important item for future
work is to include acquire/release fences, which are necessary
to simulate memory models such as Release Consistency [11].
These fences act as one-way barriers, allowing instructions
to reorder into, but not out of, a critical section. This be-
havior can be easily modeled using settling (§3.1.2). Fences
make concurrency bugs less likely to manifest, as programs
with fences have fewer legal reorderings. However, we con-
jecture that adding fences will not significantly change the
main conclusions derived in this paper.
Optimized implementations of SC: Our model of Se-
quential Consistency assumes a relatively simple implemen-
tation wherein each processor executes only one memory
instruction at a time. Many SC implementations use ag-
gressive optimizations such as speculative execution to com-
pete with the performance of weaker memory models [10, 12,
7]. We do not consider this simplifying assumption to be a
weakness of our model; rather, we believe our results about
weak memory models can be extended to address optimizing
implementations of strong memory models. In other words,
concurrency bugs are more likely to manifest in an imple-
mentation of SC that uses aggressive reordering than in a
simple (and slow) implementation.
Generality of Results: In this paper, we propose and
study one specific probabilistic process to model program
execution and thread interleaving. Clearly, there are other
plausible models that can be studied. Our intuition is that
the results in this paper have a certain robustness with re-
gard to changes to the parameters in our models as well as
to changes in the model. However, future work is required
to formally validate this conjecture.
8. CONCLUSION
With the ubiquity of multicore systems and the trend to-
wards integrating every more cores on a single chip, multi-
processor programmability has become one of the key chal-
lenges in computer science. Even with improvements in pro-
grammability, we are likely to see an increase in software
defects, given the inherent difficulty of concurrent program-
ming. Memory consistency models are at the center of the
programmability discussion, since they determine the mem-
ory access semantics of parallel programs. The debate over
memory models has historically revolved around the trade-
offs between programmability, performance and complexity.
In this paper we bring a new axis to this discussion: soft-
ware reliability. We study an analytical model and show that
concurrency bugs are indeed more likely to manifest them-
selves in relaxed memory models, but surprisingly, that as
the number of parallel threads increases, the difference be-
tween harsh and weak memory models diminishes. The lat-
ter observation can have important consequences on system
designers when developing new memory models.
9. REFERENCES
[1] S. V. Adve and K. Gharachorloo. Shared memory
consistency models: A tutorial. IEEE Computer,
29(12):66–76, December 1996.
[2] S. V. Adve and M. D. Hill. Weak ordering—a new
definition. In Proc. of the 17th International
Symposium on Computer Architecture (ISCA), 1990.
[3] AMD Corp. AMD64 Architecture Programmer’s
Manual - Volume 2: System Programming, July 2007.
[4] C. Artho, K. Havelund, and A. Biere. High-level data
races. Journal on Software Testing, Verification &
Reliability, 13(4):220–227, 2003.
[5] Arvind and J.-W. Maessen. Memory model =
instruction reordering + store atomicity. In Proc. of
the 33th International Symposium on Computer
Architecture (ISCA), 2006.
[6] H.-J. Boehm and S. V. Adve. Foundations of the C++
concurrency memory model. In Proc. of the 29th
Conference on Programming Language Design and
Implementation (PLDI), 2008.
[7] L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas.
BulkSC: Bulk enforcement of sequential consistency.
In Proc. of the 34th International Symposium on
Computer Architecture (ISCA), 2007.
[8] M. Dubois, C. Scheurich, and F. A. Briggs. Memory
access buffering in multiprocessors. In Proc. of the
13th International Symposium on Computer
Architecture (ISCA), 1986.
[9] C. Flanagan and S. Qadeer. A type and effect system
for atomicity. In Proc. of the 24th Conference on
Programming Language Design and Implementation
(PLDI), 2003.
[10] K. Gharachorloo, A. Gupta, and J. Hennessy. Two
techniques to enhance the performance of memory
consistency models. In International Conference on
Parallel Processing, 1991.
[11] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons,
A. Gupta, and J. Hennessy. Memory consistency and
event ordering in scalable shared-memory
multiprocessors. In Proc. of the 17th International
Symposium on Computer Architecture (ISCA), 1990.
[12] C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC +
ILP = RC? In Proc. of the 26th International
Symposium on Computer Architecture (ISCA), 1999.
[13] J. R. Goodman. Cache consistency and sequential
consistency. Technical Report 1006, University of
Wisconsin-Madison, 1989.
[14] Intel Corp. Intel 64 and IA-32 Architectures Software
Developer’s Manual—Volume 3A: System
Programming Guide, Part 1, December 2009.
[15] L. Lamport. How to make a multiprocessor computer
that correctly executes multiprocess programs. IEEE
Transactions on Computers, 28(9):690–691, September
1979.
[16] J. Lee and D. Padua. Hiding relaxed memory
consistency with compilers. In Proc. of the 9th
Conference on Parallel Architectures and Compilation
Techniques (PACT), 2000.
[17] S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from
mistakes—a comprehensive study on real world
concurrency bug characteristics. In Proc. of the 13th
Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), 2008.
[18] J. Manson, W. Pugh, and S. V. Adve. The Java
memory model. In Proc. of the 32th Symposium on
Principles of Programming Languages (POPL), 2005.
[19] SPARC International, Inc. The SPARC Architecture
Manual—Version 8, 1992.
APPENDIX
A. MODEL DEFINITION
A.1 Initial Program Order
The initial program order S0 consists of n+2 instructions:
x1, x2, . . . , xn, LD X,ST X
where for 1 ≤ i ≤ n, xi has type τ (xi) = ST with probability
p, and type τ (xi) = LD with probability (1 − p). Each xi
accesses a location Xi such that Xi = Xj only if i = j, and
Xi 6= X. For the purposes of defining the model we assume
that n is finite, but in the analysis it is useful to approximate
a very long program by letting n→∞.
A.2 Definition of Settling Process
We model instruction reordering as a random process con-
sisting of n+2 rounds. This process produces a permutation
of S0 which we call Sn+2; round i produces the intermediate
permutation Si. During round i, instruction xi is inserted
into the permuted ordering of instructions x1 through xi−1.
We decide where to insert instruction xi by repeatedly swap-
ping xi with the instruction directly before it. Each swap
succeeds with probability ρτ1,τ2 , where τ1 is the type of the
instruction directly before xi’s current location and τ2 is the
instruction type of xi, and fails with probability 1− ρτ1,τ2 .
ρτ1,τ2 is always either 0 or s, depending on the memory
model. The single exception is for the critical LD and ST,
xplen+1 and xm+2. If xm+2 ever tries to swap with xm+1,
it automatically fails, because they access the same memory
address. The round completes when a swap fails occur or xi
reaches position 1. This recursive random process is called
settling.
Let πi(j) be a function from positions in S0 to positions
in Si. (Note that π0(j) = j for all j.) We formally define
the insertion point of instruction xi using the probability
distribution βi.
Definition 2. Given the intermediate permutation Si−1,
we define a probability distribution βi,k as follows:
• If k = 1, 1.
• Else let j = π−1i−1(k − 1) and let q = ρτ(xj),τ(xi).
– k with probability 1− q
– Draw from βi,k−1 with probability q.
We also define βi to be βi,i.
βi describes the distribution of possible positions for in-
struction i after round i of settling. βi,k describes the distri-
bution of the possible positions for instruction i given that
i moves up at least as high as position k.
The result of round i is the permutation πi, in which
the instructions following xi’s new location are each pushed
down by one, and the instructions preceding xi’s new loca-
tion do not move.
Definition 3. Recall that πi is a function mapping posi-
tions in S0 to Si. Given permutation Si−1, we draw k from
βi and construct the permutation Si as follows:
πi(i) = k
πi(j) = πi−1(j) for πi−1(j) < k
πi(j) = πi−1(j) + 1 for πi−1(j) ≥ k
We use definitions 2 and 3 to get a probability distribution
over permutations of S0. We refer to the final permutation
πm+2 as π.
A.3 Definition of Interleaving Model
Formally, the thread interleaving model is defined as fol-
lows. Let threads T1, . . . , Tn be n identical threads, dis-
tributed as described in A.1.
We allow each initially-identical thread to reorder inde-
pendently, using the settling process of A.2. We refer to
the final permutation πm+2 of thread Tk as π
(k). Define the
“critical window”Wk for a reordered thread Tk to be the set
of indices (inclusively) between the settled positions of the
critical instructions. E.g.,
Wk = {π(k)(m+ 1), π(k)(m+ 1) + 1, . . . ,
π(k)(m+ 2)− 1, π(k)(m+ 2)}.
Finally, we independently allow each thread to “shift up”
with respect to one another.
For each k, we allow thread Tk to shift exactly i positions
up with probability 2−(i+1). Observe that
∑∞
i=0 2
−(i+1) = 1,
so as n → ∞, this gives a probability distribution over the
positions of each instruction in each thread. Let ηk be the
shift of Tk.
We then say that the bug manifests if there exist k 6=
ℓ such that the critical windows of reordered Tk and Tℓ
overlap whatsoever. In other words, define the bug non-
manifestation event A by
A = ¬∃k 6= ℓ : (π(ℓ)(m+ 2)− ηℓ ≥ π(k)(m+ 2)− ηk)
∧ (π(ℓ)(m+ 2)− ηℓ ≤ π(k)(m+ 1)− ηk).
(Note that for any overlapping pair of ranges, the bottom
of one of the two windows will necessarily overlap with the
other window.)
Expressed alternately, let W ′k be the shifted window
{π(k)(m + 1) − ηk, π(k)(m + 1) + 1 − ηk, . . . , π(k)(m + 2) −
1− ηk, π(k)(m+ 2)− ηk}. Then
A = ¬∃k 6= ℓ : W ′k ∩W ′ℓ 6= ∅.
B. PROOFS
B.1 Proofs for Theorem 4.1: Critical Window
Growth
In this section, we finish up the proofs for Lemma 4.2, and
the Total Store Order case of Theorem 4.1.
Proof (Remainder of Lemma 4.2).
Pr[Lµ] = Pr[Lµ|Ψµ = q] · Pr[Ψµ = q]
=
∞∑
q=0
2−µ2−q
(
µ+ q − 1
q
)
Pr[Fµ|Ψµ = q]
· (1− (1− g(µ, q)) · Pr[SST,Φµ−1(Φµ − 1)])
=
∞∑
q=0
2−µ2−q
(
µ+ q − 1
q
)
Pr[Fµ|Ψµ = q]
·
(
1− 2−q · 2
3
)
01
2
3
4
5
6
7
8
9
10
11
0
1
2
3
4
5
6
7
8
9
10
11
γ1 γ2
γ3
γ1
γ2
γ3
Figure 2: An instantiation of the shift process. Three segments of lengths γ¯ = (γ1, γ2, γ3) = (3, 2, 5) are
independently shifted. This particular shift occurs with probability 2−8−1 ·2−0−1 ·2−2−1 = 2−13. The disjointness
event A(γ¯) does indeed hold here.
By Claim 4.4:
≥ 2−µ
∞∑
q=0
2−q
(
µ+ q − 1
q
)
· 2
−(q−1) − 2−µq(
µ+q−1
q
) (1− 2−q 2
3
)
= 2−µ
∞∑
q=0
(
2−(2q−1) − 2
3
2−(3q−1) − 2−µq−q + 2
3
2−µq−2q
)
= 2−µ
(
2
1− 1
4
− 4/3
1− 1
8
− 1
1− 1
2µ+1
+
2/3
1− 1
2µ+2
)
= 2−µ
(
8
7
− 1
1− 1
2µ+1
+
2/3
1− 1
2µ+2
)
The above expression is difficult to work with, so we give
a simpler lower bound that holds for all µ ≥ 1. Let h(µ) be
the parenthesized expression above (such that the bound is
Pr[Lµ] ≥ 2−µ · h(µ)).
We differentiate h(µ) and show that it is increasing, and
compute a small value explicitly, so that we may lower bound
all higher values by this value.
We will show that for all µ ≥ 1,
h(µ) ≥ 4/7
We first note that
h(1) = 8/7 − 1
1− 1/4 −
2
3
· 1
1− 1/8
= 8/7 − 4/3 + 16/21
= 4/7.
We now show that h(µ) is increasing in µ. The function
h(µ) is defined as h(µ) = 8
7
− (1 − 2−(µ+1))−1 + 2
3
· (1 −
2−(µ+2))−2. To see that h(µ) is increasing, we differentiate
w.r.t. µ.
d
dµ
h(µ) =
2−(µ+1)(ln 2)
(1− 2−(µ+1))2 −
2
3
· 2
−(µ+2)(ln 2)
(1− 2−(µ+2))2
This expression is positive when
1
(1− 2−(µ+1))2 >
4
3(1− 2−(µ+2))2 ,
e.g. 1−2
−(µ+2)
1−2−(µ+1)
>
√
4
3
. This holds for all µ ≥ 0. Hence h(µ)
is descreasing for non-negative µ.
Proof (Remainder of Theorem 4.1).
It will be useful to calculate the total slack in our probabil-
ity bounds. Let Prℓ[Lµ] be the lower bound for Pr[Lµ] com-
puted in this section, and Prr[Lµ] = Pr[Lµ]−Prℓ[Lµ] be the
remainder. The “missing” probability R =
∑∞
µ=0 Prr[Lµ] is
computed below.
Claim B.1.
R = 2/21.
Proof. Note that Prr[Lµ] is nonnegative for all µ, be-
cause Prr[Lµ] is a lower bound. Lµ must hold for exactly
one µ, hence
∑∞
µ=0 Pr[Lµ] = 1.
R = 1−
∞∑
µ=0
Pr
ℓ
[Lµ]
= 1− Pr
ℓ
[L0]−
∞∑
µ=1
h(1) · 2−µ
= 1− 1/3− 4/7 · 1
= 2/3 − 4/7
= 2/21.
Now, using the derivation of Pr[L0] from Lemma 4.2,
Pr[B0] = Pr[B0|L0] · Pr[L0] + Pr[B0|¬L0] · Pr[¬L0]
= 1 · (1/3) + (1/2) · (2/3)
= 2/3.
Moving on to the case of γ ≥ 1, we rewrite Pr[Bγ ] as
Pr[Bγ ] =
∞∑
µ=γ
Pr[Bγ |Lµ] · Pr
ℓ
[Lµ] +
∞∑
µ=γ
Pr[Bγ |Lµ] · Pr
r
[Lµ].
We compute the first sum exactly, and provide upper and
lower bounds for the second sum, in order to upper and
lower bound Pr[Bγ ].
First we compute the value of
∑∞
µ=γ Pr[Bγ |Lµ] · Prℓ[Lµ]
exactly:
∞∑
µ=γ
Pr[Bγ |Lµ] · Pr
ℓ
[Lµ]
= Pr[Bγ |Lγ ] · Pr
ℓ
[Lγ ] +
∞∑
µ=γ+1
Pr[Bγ |Lµ] · Pr
ℓ
[Lµ]
= 2−γh(1)2−γ +
∞∑
µ=γ+1
2−γ(1/2)h(1)2−µ
= h(1)2−γ(2−γ +
∞∑
µ=γ+1
(1/2)2−µ)
= h(1)2−γ(2−γ + (1/2)
2−(γ+1)
1/2
)
= h(1)2−γ · 3 · 2−(γ+1)
= 3h(1)2−(2γ+1)
=
6
7
· 4−γ .
Next we upper bound
∑∞
µ=γ Pr[Bγ |Lµ] · Prr[Lµ]:
∞∑
µ=γ
Pr[Bγ |Lµ] · Pr
r
[Lµ]
= 2−γ · Pr
r
[Lγ ] +
∞∑
µ=γ+1
2−γ(1/2) · Pr
r
[Lµ]
= 2−γ · (Pr
r
[Lγ ] + (1/2)
∞∑
µ=γ+1
Pr
r
[Lµ]).
To upper bound the above expression, observe that
∞∑
µ=γ
Pr
r
[Lµ] ≤ R.
It is clear that allocating all of this probability mass to Lγ
maximizes the above expression:
∞∑
µ=γ
Pr[Bγ |Lµ] · Pr
r
[Lµ] ≤ 2−γ · (R + (1/2)
∞∑
µ=γ+1
0) = R2−γ
We cannot ensure that
∑∞
µ=γ Prr[Lγ ] is positive for γ > 0,
because all of R could be allocated to Prr[L0]. Hence the
best lower bound we can expect here is
∑∞
µ=γ Pr[Bγ |Lµ] ·
Prr[Lµ] ≥ 0.
B.2 Shift Model Proofs
Proof (Corollary 5.2). For general n, it suffices to
show that
∏n−1
i=1 (1− 2−(n+1−i)) ≥ 12 .
n−1∏
i=1
(1− 2−(n+1−i)) =
n∏
i=2
(1− 2−i)
=
n∏
i=2
1
1 + 2
−i
1−2−i
≥
n∏
i=2
1
exp( 2
−i
1−2−i
)
= exp
(
−
n∑
i=2
2−i
1− 2−i
)
≥ exp
(
−
n∑
i=2
2−i
1− 2−2
)
= exp
(
−4
3
· 1
4
n−2∑
i=0
2−i
)
≥ exp
(
−2
3
)
>
1
2
.
To check the value of c(2), we simply plug in n = 2. That
is, 2
(1−2−(2+1−1))
= 8
3
.
B.3 Proofs of Final Theorems
Proof (Theorem 6.1). Recall from Corollary 5.2 that
Pr[A(Γ)] = c(n) · 2−(n+12 ) ·
∑
σ∈Sn
n−1∏
i=1
2−(n−i)γσ(i) .
Our goal is to average over the summation of permutations.
Since we are treating Γ as a random variable,
Pr[A(Γ)] = EΓ[c(n) · 2−(
n+1
2 ) ·
∑
σ∈Sn
n−1∏
i=1
2−(n−i)γσ(i) ]
= c(n) · 2−(n+12 ) ·
∑
σ∈Sn
EΓ[
n−1∏
i=1
2−(n−i)γσ(i) ]
Then
EΓ[
n−1∏
i=1
2−(n−i)Γσ(i) ] =
∑
Γ
n−1∏
i=1
2−(n−i)Γσ(i) · Pr[BΓ].
Let σ(Γ) : Nn → Nn be the operation mapping Γ to Γ′ with
entries permuted by σ. Define the inverse σ−1(Γ) accord-
ingly. Then note
∑
Γ
n−1∏
i=1
2−(n−i)Γσ(i) · Pr[BΓ]
=
∑
Γ′
n−1∏
i=1
2−(n−i)Γ
′
i · Pr[Bσ−1(Γ′)].
But because threads are distributed identically, the distri-
bution of Γ is symmetric over any ordering σ: Pr[BΓ] =∏n
k=1 Pr[B
(k)
Γk
] =
∏n
k=1 Pr[B
(k)
σ(Γ)k
] = Pr[Bσ(Γ)].
Hence we may write
∑
Γ′
n−1∏
i=1
2−(n−i)Γ
′
i · Pr[Bσ−1(Γ′)]
=
∑
Γ′
n−1∏
i=1
2−(n−i)Γ
′
i · Pr[BΓ′ ]
= EΓ′ [
n−1∏
i=1
2−(n−i)Γ
′
i ].
There are n! permutations, hence this proves the claim.
Proof (Theorem 6.2). First observe that for any Γ con-
sisting of two segments, by Corollary 5.2,
Pr[A(Γ)] = c(2) · 2−(2+12 ) ·
∑
σ
2−1∏
i=1
2−(2−i)γσ(i)
=
8
3
· 2−3 ·
∑
σ
2−γσ(i)
=
1
3
· (2−γ1 + 2−γ2).
We then let Γ be distributed as the critical windows of two
reordered copies of an identical random program.
Pr[A] = EΓ[Pr[A(Γ)]] = EΓ[
1
3
·(2−Γ1+2−Γ2)] = 2
3
·EΓ[2−Γ1 ].
Note that Bγ is the event that γ instructions end up between
the critical LD and ST exclusively, yet the critical window
includes the critical LD and ST. Hence
EΓ[2
−Γ1 ] =
∞∑
k=2
2−k · Pr[Bk−2].
Sequential Consistency: We first analyze the prob-
ability of bug manifestation in sequential consistency, the
strictest of all memory models. In sequential consistency,
no thread ever reorders. Hence no instructions ever appear
between the critical LD and ST in a given thread. Thus
Pr[A] =
2
3
· EΓ[2−Γ1 ] = 2
3
· 1
4
=
1
6
.
Weak Ordering: Under the weakest memory model, in-
structions have a chance to bubble up regardless of whether
its preceding instruction is a LD or ST. For this reason, the
final size of the critical window is independent of the original
program, as it was in sequential consistency.
Recall from Theorem 4.1 that under Weak Ordering,
Pr[Bt] =
2−t
3
if t > 0, and Pr[Bt = 0] =
2
3
. Thus
E[2−Γ1 ] =
∞∑
t=2
Pr[Bt−2] · 2−t
= (2/3) · 2−2 +
∞∑
t=3
2−(t−2)
3
· 2−t
= 2/12 +
4
3
·
∞∑
t=3
4−t
= 1/6 +
4
3
· 1
64
· 4
3
=
7
36
Then
Pr[A] =
2
3
· 7
36
=
7
54
.
Note that the probability of not manifesting has indeed de-
creased from sequential consistency. 1/6
7/54
= 9/7. Correct
behavior is somewhat more likely than under sequential con-
sistency.
Total Store Order: We take advantage of the symme-
try for n = 2. We need not characterize the joint distribu-
tion of the lengths of two critical windows, because only the
starting position of the lower window matters.
Recall that Theorem 4.1 shows that:
Pr[B0] =
2
3
,
and
Pr[Bγ ] =
6
7
· 4−γ +R(γ) · 2−γ ,
for some positive R(γ) < 2
21
.
Hence
E[2−Γ1 ] =
∞∑
t=2
Pr[Bt−2] · 2−t
= (2/3) · 2−2
+
∞∑
t=3
(
6
7
· 4−(t−2) +R(t− 2) · 2−(t−2)
)
· 2−t
= (1/6) +
(
6
7
· 16 · 8−3 · 8
7
∞∑
t=0
8−t
)
+
(
4
∞∑
t=3
R(t− 2)4−t
)
= (1/6) +
3
98
+ 4
∞∑
t=3
R(t− 2)4−t.
Plugging in R(t) > 0 gives
Pr[A] =
2
3
· E[2−Γ1 ] > 2
3
(
1
6
+
3
98
) = 58/441 > 0.1315.
Similarly, plugging in R(t) < 2/21 gives
Pr[A] =
2
3
· E[2−Γ1 ]
< 58/441 +
2
3
· (4 · 2
21
· 4−3 · 4
3
)
= 58/441 + 1/189
< 0.1369.
We now see that with two threads, the probability of reli-
able execution is substantially closer to that of weak ordering
(0.1296) than that of sequential consistency (0.1666).
Proof (Theorem 6.3). We first analyze the probability
for Sequential Consistency. As the strongest memory model,
the probability of successful execution serves as an upper
bound for every other model. This is because the likelihood
that the shift process results in an overlap is monotonically
increasing in the distribution of critical window size.
Sequential Consistency: Again, recall from Corollary
5.2 that
Pr[A] = EΓ[Pr[A(Γ)]]
= c(n) · 2−(n+12 ) ·
∑
σ
EΓ
[
n−1∏
i=1
2−(n−i)γσ(i)
]
.
Under sequential consistency, γσ(i) = 2 always. Hence
Pr[A] = c(n) · 2−(n+12 ) ·
∑
σ
n−1∏
i=1
2−(n−i)2
= c(n) · 2−(n+12 ) · n! · 2−2(n2)
= 2−n
2(3/2+o(1)),
where the last line follows from Stirling’s formula:
n! =
√
2πn(
n
e
)n(1 + o(1))
= exp
(
ln(2π)
2
+
lnn
2
+ (lnn− 1)n
)
(1 + o(1))
= e(n lnn)(1+o(1))
= en
2·o(1).
Other Models: Surprisingly, to achieve the same bound
for any model, all we need is a lower bound on the proba-
bility of generating a small critical window.
Claim B.2. In every memory model,
Pr[B0] ≥ 1
2
.
Proof. This can be observed by the fact that no matter
the model, the critical LD must move up to have any chance
of the critical window growing. But even if the critical LD
is allowed to pass the instruction above it, this only occurs
with probability 1/2.
The claim has the following consequences.
Pr[
n−1∧
i=1
γi = 2] ≥ 2−(n−1),
hence
Pr[
n−1∏
i=1
2(n−i)γσ(i) = 2−2(
n
2)] ≥ 2−(n−1),
thus
E[
n−1∏
i=1
2(n−i)γσ(i) ] ≥ 2−2(n2)−(n−1).
Plugging this expectation into the probability of correct ex-
ecution again gives
Pr[A] > c(n) · 2−(n+12 ) · n! · 2−2(n2)−(n−1) = 2−n2(3/2).
Recall that Sequential Consistency offers the largest proba-
bility of correct execution of any model. Hence upper bound-
ing the above value by the probability for Sequential Con-
sistency, we have
Pr[A] < 2−n
2(3/2+o(1)).
This completes the proof.
