Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! by Joe Devietti et al.
Explicitly Parallel Programming with Shared-Memory is Insane:
At Least Make it Deterministic!
Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin
Department of Computer Science and Engineering
University of Washington
Abstract
The root of all (or most) evil in programming shared memory
multiprocessors is that execution is not deterministic. Bugs are
hard to ﬁnd and non-repeatable. This makes debugging a night-
mare and gives little assurance in the testing — there is no way to
know how the program will behave in a different environment.
Testing and debugging is already difﬁcult with a single thread
on uniprocessors. Pervasive parallel programs and chip multi-
processors will make this difﬁculty worse, and more widespread
throughout our industry.
In this paper we make the case for fully deterministic shared
memory multiprocessing. The idea is to make memory interleav-
ing fully deterministic, in contrast to past approaches of simply
replaying an execution based on a memory interleaving log. This
causes the execution of a parallel program to be only function
of its inputs, making a parallel program effectively behave like a
sequential program. We show that determinism can be provided
with reasonable performance cost and we also discuss the bene-
ﬁts. Finally, we propose and evaluate a range of implementation
approaches.
1. Introduction
History has shown that developing multithreaded software is in-
herently more difﬁcult than writing single-threaded code. Among
the many new software engineering challenges multithreading
creates is the fact that even correctly written code can execute
non-deterministically: given the same input, threads can inter-
leave their memory and I/O operations in unique ways each time
an application executes. This fact seems both obvious and upon
further reﬂection, ridiculous. What kind of programming model
can be taken seriously that executes applications differently each
time they are run?
Non-determinism in multithreaded execution arises from
small perturbations in the execution environment from run to
run. These changes include other processes executing simulta-
neously, differences in how the operating system allocates re-
sources, minor differences in the cache, TLB, bus and resource
priority control mechanism states, and differences in the initial
micro-architecturalstatesoftheprocessorcoresthemselves. Non-
determinism also enters into program execution from the operat-
ing system itself. Even the most basic system calls, such as READ
can, legitimately return a different result from program run to pro-
gram run.
Non-determinism complicates the software development pro-
cess. Defective software might execute correctly hundreds of
times before a subtle synchronization bug appears. The current
solution to this is based on hope: simply execute the application
several times and hope the signiﬁcant thread interleavings that
will occur, have occurred. Test-suites, the bed rock of reliable
software development, have diminished value with multithread-
ing. If a program executes differently each time it is run, what
does any particular test suite actually test; i.e. what is the cover-
age? If the program output varies from run to run, is that program
wrong, or is it just a different, legitimate outcome? Finally, how
can developers have conﬁdence that any particular thread inter-
leavings that occur in a software product deployed in the wild,
have been tested in-house?
Inthispaperwearguethatsharedmemorymultiprocessorscan
be deterministic, and with little performance penalty. To ground
the discussion we begin by providing a precise deﬁnition for what
it means to execute deterministically. A key insight of this deﬁ-
nition is that multithreaded execution can be deterministic if the
communication between threads is deterministic, and this leaves
ample room in the microarchitecture design space in which to
achieve deterministic behavior efﬁciently.
Next, we study what gives rise to non-deterministic execution,
and explore just how non-deterministic shared memory multipro-
cessors actually are. We ﬁnd, asanyone would have expected, that
currentgenerationmulticoredevicesarehighlynon-deterministic.
Program outcomes could diverge from run to run exponentially in
the number of instructions they contain. The fact that they tend
not to is because of synchronization between threads, and small
perturbations in execution at the instruction-by-instruction level
tend to cancel each other out over time, creating far less apparent
non-determinism than is theoretically possible.
The remainder of this paper presents our study into how to
build an efﬁcient deterministic multiprocessor. We propose three
basic mechanisms for doing so: (1) using locks to transform
multithreaded execution into single-threaded execution, a naive,
but useful technique; (2) using cache-line states to control when
communication occurs between threads; (3) using transactions to
speculatively communicate deterministically between threads.
This study is a limit study of these three techniques, and some
useful implementation variations of them. From this study, we
ﬁnd that determinism can, theoretically at least, be had for prac-
tically no added performance cost. Stated differently, assuming
efﬁcient hardware can be built, then being deterministic in execu-
tion will not artiﬁcially serialize parallel execution.2. Background
2.1. A Deﬁnition of Deterministic Parallel Execution
At a high level, our goal is to make multi-threaded execution as
deterministic assingle-threaded execution. Hence, given the same
input to a multi-threaded program, it should produce the same
output. How can this be achieved? Let us consider all memory
and system call operations from all threads merged together into
a global ordering. First, its not important what particular ordering
is chosen, any ordering that is a valid program execution will do.
For a multiprocessor to be deterministic it is only critical that the
same ordering is achieved each time the same program is run with
the same input.
Second, what about this global ordering makes a program pro-
duce the same output given the same input? For instance, if two
adjacent memory operations from different threads operate on dif-
ferent memory addresses, can they be swapped in the order and
not effect the program outcome? The answer is yes because that
swap has no observable effect on thread execution. What turns
out to be key for achieving deterministic execution in this global
ordering is that each consumer (load instruction) of data read data
from the same producer (write instruction). Moreover, I/O oper-
ations should be considered as having read and write sets, and in
order to operate correctly with the outside non-deterministic envi-
ronment, should be executed in the same order from program run
to program run.
Insummary, amultiprocessorisdeterministiciftwoconstrains
are held: (1) Each dynamic instance of a consumer (load) in-
struction, regardless of thread, reads data written by the same
dynamic instance of a producer (store) instruction, regardless of
which thread it is executed in; (2) All system I/O calls occur in a
global ordering and are considered both producers and consumers
of data to all addresses. This latter constraint is clearly overly
broad, but narrowing it is the subject of future work. Such a deﬁ-
nition of execution provides for deterministic behavior, and, as we
shall see in the next few sections, ample opportunity for efﬁcient
implementation.
2.2. Non-determinism in existing systems
An interesting question to ask is what are the sources of non-
determinism in the execution environment that give rise to
the non-determinism in program output; and how much non-
determinism exists, quantitatively, in the execution environment?
Here in this section, we attempt address these two questions.
2.2.1. Sources of non-determinism
Multiprocessor systems are non-deterministic in their execution
environment for two broad reasons: (1) The software environ-
ment changes, from program run to program run; (2) Non-ISA
micro-architectural state changes from program run to program
run. Both of these effects manifest themselves as perturbations
in the timings between events in different threads of execution.
As the ordering of small events change, the effect on the appli-
cation is to ultimately change which dynamic instance of a store
instruction produces data for a dynamic instance of a load instruc-
tion. Once this occurs, the program execution diverges from pre-
vious execution runs at the ISA level, and the program output may
vary. Belowweenumeratejustafewofthesoftwareandhardware
sources of non-determinism.
Several aspects of the software environment create non-
determinisminprogramoutput. Amongthemare: otherprocesses
executing concurrently and competing for resources, the state of
memory page tables, disk and I/O buffers, and the state of any
global speed-heuristic data structures (e.g hash tables) in the op-
erating system. In addition, several operating system calls have
interfaces with legitimate non-deterministic behavior. For exam-
ple, the read system call can legitimately take a variable amount
of time to complete and return a variable amount of data!
At the hardware level, a variety of non-ISA visible compo-
nents vary from program run to program run. Among them are:
the state of any physically mapped caches, the state of any pre-
dictor tables (branch, data, aliasing, etc), the state of any bus pri-
ority controllers, any micro-architectural state that may eventu-
ally manifest as a timing difference (physical register usage, etc).
Certain hardware components, such as bus arbitrators can indeed
change their outcome from program run to program purely from
environmental factors. If the choice of priority is given on which
signal is detected ﬁrst, the outcome can vary with differing tem-
perature and load characteristics.
Collectively, current generation software and hardware sys-
tems are not built to be deterministic. In the next section, we
measure just how non-deterministic they are.
2.2.2. Quantifying non-determinism
clear cache? same processor? thread 0 wins
no yes 99.89%
no no 83.18%
yes yes 64.05%
yes no 33.92%
Table 1. Data race outcomes under various code/scheduling
conﬁgurations
In this section, we quantify the amount of non-determinism
that exists in ordinary multiprocessor systems. We begin with a
simple experiment that illustrates how small changes to the ini-
tial state of the system can lead to entirely different program out-
comes.
A simple illustration of non-deterministic behavior: Figure 2
depicts a simple program with a data-race. Table 1 illustrates var-
ious changes to the execution environment we induced and a mea-
surement of program outcome. The point of this is that program
behavior is, as we have all known, non-deterministic on a multi-
core system.
Measuring non-determinism in real applications: The previ-
ous example illustrates how simple changes to the initial condi-
tionsofamultiprocessorsystemcanleadtodifferentprogramout-
comes for a simple toy example. But how much non-determinism
exists in real application execution?
To answer this question we ﬁrst have to deﬁne a measurement
technique. Returning to the deﬁnition of deterministic execu-
tion above, non-determinism is program execution occurs when
a particular dynamic instance of a program load reads data cre-
ated from a different store dynamic instance. We instrumented
Splash2 to track communication among threads in order to quan-
tify producer-consumer differences.
Figure3showstwoillustrativeexamplesoftheresults(barnes,
ocean-contig). The x axis is time. The y axis is the percent
of loads, in a 100,000 memory-instruction window, that source
2P0 P1
st A
st A
ld A
ld A
ld A
st B
ld B
ﬂow of data
st W
st P
ld Q
ld Z
ld T
ld Z
ld T
st A
st B
st A
st W
st P
ld Q
ld A
ld B
ld A
ld A
Interleavings with the same 
shared memory communication 
ld Z
ld T
st A
st B
st A
st W
st P
ld Q
ld A
ld B
ld A
ld A
...
Figure 1. Making the ﬂow of data between shared memory operations deterministic.
int race =  1; / / g l o b a l v a r i a b l e
void thread0 () f
if ( doClearCache )
clearCache ( ) ;
barrier wait ( ) ;
race = 0;
g
void thread1 () f
if ( doClearCache )
clearCache ( ) ;
barrier wait ( ) ;
race = 1;
g
return race ;
Figure 2. Simple program with a data race between 2 threads.
their data from a different dynamic store instance. To compute
the y axis we take two traces of loads and stores and between sets
of producer-consumer store/loads we compute the edit-distance
within each set. It is the average of this edit-difference (normal-
ized against the total number of memory operations) that is plot-
ted. Edit-distance is used because it correctly accounts for extra
load instructions inserted for synchronization.
From this graphs two things are immediately apparent. First,
programs have signiﬁcant non-deterministic behavior. Second,
the graphs have two additional properties. Both graphs depict
phases of execution where non-determinism drops to nearly zero.
These are created by barrier operations, which synchronize the
threads and then subsequent execution is more deterministic. The
second observation is that ocean-contig never shows 100% non-
determinism, and in fact a signiﬁcant fraction of load operations
are deterministic. These load operations are private data. Both
of these facts about program execution are signiﬁcant and can be
exploited in a system that actually is deterministic.
3. Enforcing Deterministic Multiprocessing
In this section we describe how to build a deterministic multipro-
cessor. We begin with a naive approach (which is still extremely
useful for debugging), and then reﬁne this simple technique into
ever more efﬁcient implementations.
3.1. Basic Idea
In the previous section we saw that the key to making multipro-
cessors deterministic was to ensure that the communication (via
shared memory or otherwise) between threads was deterministic.
Conceptually, the easiest way to do that is to allow only one pro-
cessor to access memory at a time in a deterministic order. This
process can be thought of as a memory access token being passed
around between processors in a deterministic order. We call this
deterministic serialization of a parallel execution, shown in Fig-
ure 4(b). Deterministic serialization guarantees that inter-thread
communication is deterministic because it preserves all pairs of
communicating memory instructions (see Section 2.1).
The simplest way of implementing such serialization is by
having each processor synchronize before every memory opera-
tion to acquire the token and then, when the memory operation is
completed, pass it to the next processor in the deterministic order.
From now on we will call this token the deterministic token. A
processor blocks whenever it needs access memory but does not
have the deterministic token. This token can be implemented as a
queue lock or ﬂag variable.
Waiting for the token at every memory operation is certainly
expensive and will cause signiﬁcant performance degradation
when compared to the original parallel execution (Figure 4(a)).
The performance degradation stems from (i) overhead introduced
by waiting and passing the deterministic token and (ii) the seri-
alization itself, which removes the beneﬁts of parallel execution.
3Barnes
0 1000000
2000000
3000000
4000000
5000000
Time (Insns.)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
N
o
n
-
D
e
t
e
r
m
i
n
i
s
m
Ocean-Contig
0 1000000
2000000
3000000
4000000
5000000
Time (Insns.)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
N
o
n
-
D
e
t
e
r
m
i
n
i
s
m
Figure 3. Proﬁle of non-determinism over time in two applications from SPLASH2, Ocean-Contig and Barnes
P0 P1
(a) 
Parallel
P0 P1
(b) 
Deterministic serialized
execution at a ﬁne-grain
memory operation
happens-before 
synchronization
P0 P1
(c) 
Deterministic serialized
execution at a coarse-grain
Figure 4. Deterministic serialization of memory operations.
The overhead of synchronization can be mitigated by synchroniz-
ing at a coarser granularity, allowing each processor to execute
a ﬁnite, deterministic number of memory operations, henceforth
called quantum, before passing the token to the next processor.
Reducing the impact of serialization requires enabling parallel
execution while preserving the same execution behavior as de-
terministic serialization. We propose three main techniques to
recover parallelism. The ﬁrst technique exploits the fact that syn-
chronization is only necessary before accesses to shared pieces
of memory, allowing concurrent execution of private memory ac-
cesses. The second uses speculation to optimize parallel execu-
tion of quanta from different processors. The trade-off between
thesetwotechniquesisonebetweenperformance, complexityand
energy waste. The ﬁrst technique requires fewer additions to a
typical multiprocessor system and does not suffer from wasted
work due to misspeculation but yields lower performance than
using speculation, which, on the other hand is more complex to
implement and uses more energy. Finally, the third main tech-
nique provides a more convenient break-down of program execu-
tion into quanta in order to reduce the lengthening of the critical
path of execution of a parallel program. Below, we now describe
each technique in detail.
3.2. Leveraging Private Data Information
The performance of deterministic parallel execution can be im-
proved by leveraging the observation that only memory accesses
to shared data need to be deterministically serialized. This im-
plies that a thread only needs to wait for the token when it ac-
cesses shared data. If the thread is able to identify when an access
is private, it can issue the access even if it does not hold the to-
ken. Figure 5 illustrates this concept. As a result, the system can
execute memory operations that access private data concurrently
with memory operations from other processors.
P0 P1 shared memory operation
happens-before 
synchronization
private memory operation
Figure 5. Recovering parallelism with deterministic serializa-
tion of shared memory operations only.
The key to making this efﬁcient is to provide a low-overhead
mechanism of determining what memory operations in a thread
access private data. There are three main possibilities: (i) dynam-
ically determining what pieces of data is shared my maintaining
a sharing table (a table which tracks global cache-line states); (ii)
statically determining private data access with a compiler and im-
plementing ISA support for private loads. In this paper, we do not
explore option (ii). Below we explore how to implement the ﬁrst
option.
4A sharing table is a data structure in memory that contains
sharing information for each memory position (usefully, but not
necessarily, aggregated into cache line size blocks). A thread can
access its own private data without holding the deterministic to-
ken. A thread can also read shared data without holding the token.
However, in order to write to shared data or read data regarded as
private by another thread, a thread needs to hold the token and
wait until all other threads are blocked waiting for the token. This
guarantees that the sharing information is kept consistent. When a
thread t writes to a piece of data, the data is set to private with t as
the owner. Similarly, when a thread t reads data for the ﬁrst time,
it is set as private owned by t. Figure 6 illustrates this process
with a ﬂowchart.
thread t about to 
access address A
A owned by t? Y
is A shared ?
N
Y N
is read ? Y
proceed
access
N
set A private and 
owned by thread t
t waits for token 
and 
for all other threads 
to be blocked 
is write ?
set A shared
Y
N
1
2
4
3
Figure 6. Flowchart for deterministic serialization of shared
memory operations only.
The sharing information itself can be implemented in several
ways. In some respects, the sharing table is the state of cache lines
in a multiprocessor system. Determinism is achieved by carefully
orchestrating in time, when a cache-line changes state from ex-
clusive to shared or the reverse. One particular implementation
detail of this table is, just like with the shared state of MESI, it is
not important which processors have the cache-line, only that the
cache-lineisinashared, read-onlystate. Similarly, asharingtable
need only track whether a memory address is in an exclusive state
(and which thread owns it), or that no-thread owns the address
and it is shared. A sharing table has no need for the M (modiﬁed)
or I (invalid) states. Figure 6 depicts a ﬁnite-state machine for
processing a memory request with a sharing table.
The sharing table itself can be instrumented in either software
or hardware. A software implementation simply inlines the deci-
sion tree from Figure 6 into the application. A hardware imple-
mentation should be piggy-backed onto the coherence protocol.
The changes required to the base MESI protocol are straightfor-
ward: cache-lines cannot be converted from exclusive to shared
or shared to exclusive, unless the processor holds the determinis-
tic token. All other state transitions remain the same. A hybrid
HW/SW approach is also possible. Instead of maintaining a sepa-
rate sharing table for cache-lines, if a processor can simply trap to
software if it misses in the cache, then a ﬂexible software solution
can be implemented, with the common-case cache-hit requiring
no additional overhead.
3.3. Leveraging Support for Transactional Memory
P0 P1
st A
ld B
st A
ld T
ld A
st C
st A
ld B
1
2
3
ld A
st C
(squash)
4
memory operation
commit token passing
N deterministic commit order
atomic quantum
Figure 7. Recovering parallelism by executing quanta as
atomic transactions.
It is easy to see that executing quanta atomically and in iso-
lation in a deterministic total order would be equivalent to deter-
ministic serialization of memory operations. This implies that as
long as quanta appear to execute atomically and in isolation, the
execution will be equivalent to deterministic serialization. Trans-
actional memory systems can be leveraged for this purpose.
We can leverage support for transactional memory by treating
each quantum as a transaction. In addition to support for trans-
actional memory, we also need a mechanism to form quanta de-
terministically, as well as a mechanism to enforce a pre-deﬁned
commit order. As Figure 7 illustrates, a quantum runs concur-
rently with other quanta in the system as long as there are no
overlapping memory accesses that would violate the original de-
terministic serialization of memory operations. In case a conﬂict
happens, the quantum later in the total deterministic order gets
squashed and re-executed. Note that the total deterministic order
of quantum commits is a key component in guaranteeing deter-
ministic serialization of memory operations.
As mentioned earlier, a quantum is a unit of work with a de-
terministic number of instructions. In order to achieve that, the
process of breaking the execution of a thread into quanta has to
be deterministic. The choice of where to break the execution can
be done in software (compiler, or binary instrumentation) or by
the architecture. A key fact here, however, is these transactions
are not programmer-directed transactions. In our implementation,
we keep a private counter for each thread. At the start of any
hyperblock of execution, this counter is incremented, and when
it reaches a predeﬁned value, the transaction is committed and
a new one started. Several alternatives are possible (tail end of
loops, etc).
A hardware implementation would clearly be more efﬁcient
for forming transactions and can be implemented by a simple
instruction completion count counter. Note, that for determin-
istic execution bounded TM systems can be used. So long as the
bounds of the TM are reached deterministically for each trans-
5action, then it is appropriate to simply end the transaction at the
hardware bound, and begin a new one. This means even simple
schemes that hold onto cache-lines are able to implement deter-
ministic execution.
Having a total predeﬁned commit order allows uncommitted
(or speculative) data to ﬂow between quanta. This can potentially
save a large number of squashes in application that have more in-
tensive inter-thread communication. The idea is allow a quantum
to fetch speculative data from an uncommitted quantum that hap-
pened earlier in the deterministic total order. This is illustrated in
Figure 8, where quantum (2) fetched an uncommitted version of
A from (1). Note that without support for forwarding, (2) would
have been squashed. In order to guarantee correctness, though,
if a quantum that provided data to other quanta is later squashed,
all later quanta also need to be squashed, since they might have
consumed misspeculated data.
P0 P1
st A
ld B
st A
ld B
st C
ld A
st A
ld B
1 2
3 4
memory operation
commit token passing
N deterministic commit order
atomic quantum
uncommitted value ﬂow
Figure 8. Avoiding unnecessary squashes with un-committed
data forwarding.
3.4. Exploiting the Critical Path
The critical path of a parallel application is composed of multiple
sections of different threads. It can be thought of as the path of
a “criticality” token that gets passed between threads as the ap-
plication execution progresses. Intuitively, the criticality token is
passed around between threads as they synchronize and commu-
nicate with each other.
We can exploit knowledge of how programs typically are writ-
ten to adapt the size of quanta to make more efﬁcient progress on
the critical path of execution. We devised two heuristics to create
quanta that better match the critical path of a program. The ﬁrst
heuristic, called sync-follow simply ends a quantum when an un-
lock operation is performed. The rationale is that when a thread
releases a lock, other threads might be spinning waiting for that
lock, so the deterministic token should be sent forward as early
as possible to allow waiting thread to make progress. Figure 9
illustrates such scenario.
The second heuristic relies on information about data sharing
in order to identify when a thread has potentially completed work
on shared data, and consequently ends a quantum at that time.
It does so by determining when a thread hasn’t issued memory
operations to shared locations in some time; e.g. in the last 30
memory operations. The rationale is that when a thread is work-
ing on shared data, it is expected that others thread will access
that data soon. By ending a quantum early and passing the de-
terministic token, the consumer thread potentially consumes the
data earlier than if the quantum in the producer thread ran longer.
This not only has an effect on performance, but also is likely to re-
duce the amount of work wasted by squashed in the transactional
memory-based implementation discussed earlier in this section.
Other quanta breaking techniques are possible (system call
boundaries, exposing the breaking to the programmer, etc). The
key to any of them, however, is they must break in deterministic
ways. The two described above do so, and any new ones explored
must as well.
4. Experimental Setup
We evaluate the performance impact of deterministic execution
with a simulator written as a tool for the PIN [3] binary instru-
mentation infrastructure from Intel. Our simulator monitors the
execution of an application, builds quanta based on the instruc-
tions that are executed and then builds a schedule of their exe-
cution. The model includes the effects of serialization, memory
conﬂicts and limited buffering in the transactional memory sup-
port.
WeevaluateallstrategiesdescribedinSection3: Lockrefersto
the basic approach of complete serialization (Section 3.1); Shar-
ingTable refers to the approach that recovers parallelism by over-
lapping execution of private memory accesses (Section 3.2); TM-
Bounded refers to the transactional-memory based approach with
support for a single outstanding transaction; and TM-Forward
refers to the transactional-memory based approach with unlim-
ited buffering and support for speculative value forwarding. We
also evaluate the effect of conﬂict detection granularity by sup-
porting conﬂict detection at the granularity of word and 32-byte
cache-line sizes.
The simulator also models different quantum builder strate-
gies. The Base quantum builder just builds quanta based on in-
struction count (every 1,000 or 10,000 instructions). The Shar-
ingMonitor builds quanta based on the sharing monitoring heuris-
tic described in Section 3.4. For the preliminary results included
in this submission, the data for the SyncFollow quantum building
was not ready.
For this preliminary study we chose applications from the
SPLASH2 [8] benchmarks suite, although this tool supports any
application running on a Linux-x86 machine. Table 2 describes
the benchmarks in more detail.
Application Description
fft Fast Fourier transformation
lu LU matrix factorization
ocean Ocean movements simulation
radix Sorting algorithm
volrend Volume rendering
water-ns Water molecule system simulation
water-sp Water molecule system simulation
Table 2. Benchmarks used in our evaluation
5. Evaluation
Figure 10 shows the scalability of our techniques compared to
the original parallel baseline. We ran the SPLASH-2 [8] parallel
benchmark suite with 2, 4 and 8 threads, using word-granularity
conﬂictdetectionand10,000-instructionquanta. Thesimplelock-
based deterministic scheme shows the poorest scalability, degrad-
ing nearly linearly with the number of threads for most bench-
marks (as one would expect). The performance degradation is
6memory operation
commit token passing
N deterministic commit order
atomic quantum
P1
unlock L
P0
lock L
1
3
2 lock L
4
...
spin
...
grab
 lock L
(a) 
Regular quantum break
P1 P0
lock L
1
3
2 lock L
4
(wasted
work)
unlock L
(b) 
Quantum break following
critical path approximation 
... ...
Figure 9. Example of a situation when better quantum breaking policies leads to better performance.
sublinear because our deterministic scheme affects only the mul-
tithreaded portion of an application’s execution. Some of the
benchmarks with a substantial amount of single-threaded work
(e.g. lu and fft) exhibit this behavior. The scalability of the
sharing table scheme depends on the amount of data sharing in
the program, which is generally low for well-engineered applica-
tions like SPLASH2, but is quite high in the case of radix and
ocean.
Using transactional memory helps reduce overheads substan-
tially, by allowing “memory renaming” which avoids unnecessary
conﬂicts on WAR and WAW dependences. The TM-Bounded
scheme allows one transaction commit to be buffered, showing
that high performance can be acheived with modest buffering re-
quirements. The TM-Forward scheme represents a very aggres-
sive design that speculatively forwards writes to reduce the la-
tency of RAW conﬂicts. The overhead of the TM-based schemes
is low (< 50% even with 8 threads), showing that for well-
designed parallel programs, it is possible to obtain determinism
and performance.
Decreasing the size of quanta from 10,000 to 1,000 instruc-
tions reduces overheads afor all deterministic schemes (Figure
11). We use 8 threads and word-granularity conﬂict detection.
Smaller quanta would in practice increase the amortized over-
heads of starting and committing a quanta, but also lead to lower
abort costs due to the decreased amount of lost work. Our model
ignores these costs, but does take into account the reduced proba-
bility of conﬂicts due to fewer loads and stores being inside each
quantum. The sharing table scheme, with its inability to specula-
tivelyproceedpastpotentialconﬂicts, gainsthemostfromsmaller
quanta. The performance of TM-Bounded converges with that of
TM-Forward as the probability of having a conﬂicts drops; the
latter scheme’s ability to avoid conﬂicts is signiﬁcant only with
larger quanta.
Coarsening the granularity of conﬂict detection allows for a
simpler hardware implementation, at the expense of a greater
number of false conﬂicts. False conﬂicts in any scheme occur
for two reasons. The ﬁrst source of false conﬂicts is cache line
aliasing of memory addresses. The second source is that in the
non-TM based sharing monitor scheduling technique, WAR and
WAW hazards are viewed as conﬂicts, when in fact they could
safelybeignoredwithTM’s“memoryrenaming”. Figure12com-
pares conﬂict detection at word (4-byte) and cache line (32-byte)
granularity, with 8 threads and 10,000-instruction quanta. This
affects the overhead of the TM-based schemes, but not to a large
extent: none of the TM schemes have an overhead of more than
2x, andwithspeculativeforwardingtheoverheadislessthan1.4x.
Using different quantum builders (Figure 13) affects overhead
in a variable manner. We use 8 threads and 10,000-instruction
quanta. Monitoring the ﬂow of shared data and using heuristics
to anticipate inter-thread communication helps some applications
(e.g. volrend, ocean) while hurting others (fmm). As these
results are preliminary, we do not evaluate the effect of producing
quanta according to observed synchronization events.
1k
          water-ns
 10k     1k
          radix
   10k         1k
          lu-nc
     10k             1k
          lu-c
       10k                1k
          fft
         10k                    
benchmark
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
r
u
n
t
i
m
e
 
n
o
r
m
a
l
i
z
e
d
 
t
o
 
n
o
n
-
d
e
t
e
r
m
i
n
i
s
t
i
c
 
p
a
r
a
l
l
e
l
 
e
x
e
c
u
t
i
o
n
TM-Forward
TM-Bounded
SharingTable
Lock
Figure 11. 1,000- versus 10,000-instruction quanta.
72 4
water-ns
  8         2    4
volrend
     8                   2       4
radix
        8                            2          4
ocean-nc
           8                                     2             4
ocean-c
              8                                              2                4
lu-nc
                 8                                                       2                   4
lu-c
                    8                                                                2                      4
fft
                       8                                                
benchmark
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
r
u
n
t
i
m
e
 
n
o
r
m
a
l
i
z
e
d
 
t
o
 
n
o
n
-
d
e
t
e
r
m
i
n
i
s
t
i
c
 
p
a
r
a
l
l
e
l
 
e
x
e
c
u
t
i
o
n
TM-Forward
TM-Bounded
SharingTable
Lock
Figure 10. Runtime overheads with 2, 4 and 8 threads.
W
      water-ns
 L     W
      volrend
   L        W
      radix
     L             W
      ocean-nc
       L                W
      ocean-c
         L                     W
      lu-nc
           L                         W
      lu-c
             L                             W
      fft
               L                                
benchmark
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
r
u
n
t
i
m
e
 
n
o
r
m
a
l
i
z
e
d
 
t
o
 
n
o
n
-
d
e
t
e
r
m
i
n
i
s
t
i
c
 
p
a
r
a
l
l
e
l
 
e
x
e
c
u
t
i
o
n
TM-Forward
TM-Bounded
SharingTable
Lock
Figure 12. Word- versus line-granularity conﬂict detection.
B
      water-ns
 SM     B
      volrend
   SM         B
      radix
     SM             B
      ocean-nc
       SM                B
      ocean-c
         SM                    B
      lu-nc
           SM                         B
      lu-c
             SM                             B
      fft
               SM                                
benchmark
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
r
u
n
t
i
m
e
 
n
o
r
m
a
l
i
z
e
d
 
t
o
 
n
o
n
-
d
e
t
e
r
m
i
n
i
s
t
i
c
 
p
a
r
a
l
l
e
l
 
e
x
e
c
u
t
i
o
n
TM-Forward
TM-Bounded
SharingTable
Lock
Figure 13. Base (B) versus SharingMonitor (SM) quantum
builders.
6. Related Work
This work addresses the problem of eliminating non-determinism
from parallel executions. In some sense an attempt at this goal
was the advent of synchronization primitives — programmers
eliminate troublesome non-determinism by preventing problem-
atic interleavings with synchronization. Very little continued
work has been done to truly eliminate non-determinism from the
execution of parallel programs.
Architects and systems researchers have recognized the difﬁ-
culty non-determinism presents to software developers. For this
reason, a variety of replay techniques [9, 6, 4, 1, 2, 5] have been
proposed. Replay allows programmers to execute applications in
aspecial“debugmode”, thatenablesreliablereplayingofapartic-
ular sequence of execution. Among recent advances in hardware
support for deterministic replay are ReRun [1] and DeLorean [4],
which aim at reducing log size and hardware complexity. Re-
Run is a hardware memory race recording mechanism, which
uses episodes to enable replay an execution. ReRun constructs
episodes by recording portions of an execution during which no
conﬂicts are detected. This novel strategy consumes little hard-
ware state, and produces a small race log from which executions
equivalent to the original can be replayed. DeLorean is another
hardware approach to supporting deterministic replay, in which
instructions are executed as blocks (or chunks), and the commit
order of blocks of instructions are recorded, as opposed to each
instruction. DeLoreanisabletorecordthisdataefﬁcientlyon-the-
ﬂy and produces very small logs. DeLorean also uses pre-deﬁned
ordering to some extent to further reduce the memory ordering
log.
Software approaches to deterministic replay have also shown a
great degree of success. DejaVu [2] provides deterministic replay
for applications running in a Java Virtual Machine, efﬁciently and
independently of the thread scheduler being used in the under-
lying VM. RecPlay [7] combines memory access recording and
replaying with online data-race detection. The combination of
these techniques makes the record phase of this process very efﬁ-
cient, andpermitsdataracedetectiontobepostponedtothereplay
phase, also contributing to efﬁciency.
Overall, replaying has its place, but it also has its limitations.
First among them, replay does not make multithreaded execution
deterministic, it merely enables the deterministic replay of a par-
ticular non-deterministic execution. While far more useful than
nothing at all, replay does not help give precise meaning to what a
regression test means. Nor do they provide the ability to have con-
8ﬁdence that the interleavings that occur during deployment can be
tested during development.
7. Conclusion
This paper makes a bold claim: the explicitly parallel shared
memoryprogrammingmodelisfundamentallybrokenduetonon-
determinism. Program output should not vary each time it is exe-
cuted, essentially at random. The reason it does is – up until now
– because it was viewed as the only way to make efﬁcient par-
allel processors. Making the execution model deterministic was
thought to cost too much in terms of performance.
In this paper we have shown that conventional wisdom to be
false: a deterministic multiprocessor can be built to be fast. The
key is to understand what it really takes to be deterministic, which
is deterministic communication between threads. Building on re-
cent research into shared memory cache coherence systems and
transactional memory support, this paper has outlined a variety of
implementation strategies to build a fully deterministic multipro-
cessor.
Our results have shown a variety of viable techniques to
achieve determinism. These techniques have various micro-
architectural trade-offs, such as quanta, speculation versus lock-
ing, etc. These trade-offs will need to be more thoroughly ex-
plored and matched to whatever speciﬁc micro-architectural so-
lution is implemented. But, the high-level result of this paper
is that multiprocessor systems can be deterministic with no per-
formance impact due intrinsically to deterministic behavior. Any
performance impact will be because of micro-architectural imple-
mentation details. It is the subject of our future work to make
these as minimal as possible. Given that they are largely based
around transactional memory support and cache coherence sys-
tem schemes, and these schemes have minimal impact, we expect
this will be the case with determinism.
In the future, we expect all shared memory systems will be
deterministic. This paper points the way. With little performance
impact, and obvious beneﬁts to software development and pro-
grammer sanity, there is no reason not to.
References
[1] D. Hower and M. Hill. Rerun: Exploiting Episodes for Lightweight
Memory Race Recording. In ISCA, 2008.
[2] J. Choi and H. Srinivasan. Deterministic Replay of Java Multi-
threaded Applications. In SIGMETRICS SPDT, 1998.
[3] C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,
S. Wallace, V. Janapa Reddi, and K. Hazelwood. PIN: Building Cus-
tomized Program Analysis Tools with Dynamic Instrumentation. In
PLDI, 2005.
[4] P. Montesinos, L. Ceze, and J. Torrellas. DeLorean: Recording and
Deterministically Replaying Shared-Memory Multiprocessor Execu-
tion Efﬁciently. In ISCA, 2008.
[5] S. Narayanasamy, C. Pereira, and B. Calder. Recording Shared Mem-
ory Dependencies Using Strata. In ASPLOS, 2006.
[6] S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously
Recording Program Execution for Deterministic Replay Debugging.
In ISCA, Los Alamitos, CA, USA, 2005.
[7] M. Ronsee and K. De Bosschere. RecPlay: A Fully Integrated Prac-
tical Record/Replay System. ACM TOCS, 1999.
[8] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2
Programs: Characterization and Methodological Considerations. In
ISCA, 1995.
[9] M. Xu, R. Bodik, and M. Hill. A ”Flight Data Recorder” for Enabling
Full-System Multiprocessor Deterministic Replay. In ISCA, 2003.
9