OPR by Qian, Xuehai et al.
Lawrence Berkeley National Laboratory
Recent Work
Title
OPR
Permalink
https://escholarship.org/uc/item/6fd8s143
Journal
ACM SIGPLAN Notices, 51(8)
ISSN
0362-1340
Authors
Qian, Xuehai
Sen, Koushik
Hargrove, Paul
et al.
Publication Date
2016-02-27
DOI
10.1145/3016078.2851179
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
OPR: Partial Deterministic Record and Replay for
One-Sided Communication
Xuehai Qian Koushik Sen
University of California, Berkeley
{xuehaiq,ksen}@cs.berkeley.edu
Paul Hargrove Costin Iancu
Lawrence Berkeley National Laboratory
{phhargrove,cciancu}@lbl.gov
ABSTRACT
Deterministic replay of parallel execution and communication op-
erations is required both by HPC debuggers and resilience mecha-
nisms. Despite its potential performance advantages, the inherent
nondeterminism present in one-sided communication makes replay-
ing difficult. The essential problem is that the readers of updated
shared data do not have any information on which remote threads
produced the updates. This paper presents OPR (One-sided com-
munication Partial Record and Replay), the first known software
tool for record and deterministic replay for one-sided communica-
tion. We have designed OPR from first principles with scalability
as its main goal. OPR allows the user to specify a set of tasks of
interest and then “records” their execution. The tasks in this set
can be replayed, while any other task from the original execution
can be abstracted away. OPR provides determinism by using a
combination of data- and order-replay. To ensure scalability with
the value and the order logs, we carefully optimize the recording
stage: values are logged on the first read or only when changed;
orderering is imprecisely maintained using a tailored vector clock
algorithm. Our evaluation on deterministic and non-deterministic
UPC programs shows that OPR introduced an overhead ranging
from 1.3× to 27×, when running on 1,024 cores and tracking up
to 16 tasks.
1. INTRODUCTION
High-performance computing (HPC) has been the enabler for
many science and engineering breakthroughs. Large-scale parallel
HPC simulations run detailed numerical simulations that model the
real world. They have been used from understanding the process
of protein folding to estimating climate changes. Therefore, the
software reliability of these applications becomes a major concern.
Due to the overwhelmingly large number of concurrent events,
debugging parallel programs is significantly more difficult than de-
bugging serial programs. The problem becomes more challenging
at large scale due to the need to maintain and reason about the state
and interaction of different threads or processes.
Deterministic Record and Replay (R&R) is an effective approach
to debugging parallel programs. A R&R system logs sufficient
information about all sources of nondeterminism and replays the
same execution based on the log. Non-deterministic inter-thread
communication is a major source of nondeterminism. For two-
sided communication in message passing (e.g. MPI), the sender
and receiver of a communication are well-defined and are matched
at runtime according to source code specification. Therefore, each
communication could be naturally intercepted and logged at run-
time. For one-sided communication, identifying communication
is more challenging. In this paradigm, a thread could write to
any shared memory location without notifying others. Later, when
another thread reads the new value produced by an earlier writer,
the reader thread is not aware of which thread produced the value.
Compared with two-sided communication, one-sided communica-
tion removes the implicit synchronization between sender and re-
ceiver and can potentially offer better performance. This model
is used in several Partitioned Global Address Space (PGAS) lan-
guages, including UPC (Unified Parallel C) [5], Co-Array For-
tran [14], Chapel [3] and X10 [7]. Moreover, the new MPI-3 [8]
also introduced efficient support for one-sided communication for
better performance. Debugging programs based on one-sided com-
munication is more challenging due to their implicit nature. Similar
challenges are faced by resilience techniques [10, 13] using unco-
ordinated or quasi-synchronous checkpointing and recovery. Cur-
rently, no deterministic R&R tool is available for one-sided com-
munication.
In this paper, we present the first general tool, OPR (One-sided
communication Partial Record and Replay) to support determinis-
tic R&R for one-sided communication. Partial replay allows users
focus on events within a specified small set of threads. It could ease
debugging experience and relieve users from monitoring all con-
current events from potentially thousands of threads. OPR is built
based on Berkeley UPC [1], — a typical PGAS language based
on one-sided communication. In OPR, the user specifies the replay
set (R_Set), containing threads that need to be replayed. OPR then
records all relevant information related to threads in R_Set and re-
plays only those threads without executing the others. Therefore,
OPR makes it possible to debug a large-scale execution on a smaller
(potentially local) machine.
To build a tool like OPR, we face two challenges:
• How to detect and log communication in the record phase?
• How to support partial replay?
The essence of one-sided communication is that only the initia-
tor is aware of the operation. In a simple case where a read (load)
consumes the value produced by a remote write (Put), the reader
thread could find the communication happened only when it ob-
serves the updated value. Unlike in MPI explicit message pass-
ing, one-sided communication decouples data transfer from inter-
task synchronization. This property makes it difficult to extract
communication order from source code and it is even challenging
to detect threads involved in communications at runtime. There-
fore, we use the principle of data-replay [12] to correctly detect
and replay communication. Specifically, by instrumenting shared
memory accesses, OPR logs the input values to all shared read ac-
cesses, those values could potentially be produced by remote write
accesses. To mitigate the large log size that is common to data-
replay based schemes, each thread maintains a shadow memory
through instrumentation, it serves as a "software cache" that keeps
the view of shared memory that is observed by each thread so far.
Shadow memory could be used as a filter so that a value is only
logged when it is read by a thread for the first time or the previous
value has been changed. With value logging, each thread in R_Set
could be replayed in isolation. This data-replay based design nat-
urally supports partial replay because the execution of threads in
R_Set can be fully replayed using their own value logs.
Although the data-replay based approach enables replay in isola-
tion, it does not provide sufficient insights on how communications
happened between threads. To eliminate this drawback, OPR at-
tempts to infer communication during a replay phase using a com-
bination of order-replay [12] and value matching. In the record
phase, OPR runs a simplified and scalable vector clock algorithm
among threads within R_Set to get an approximation of event or-
ders of accesses to shared addresses. In the replay phase, OPR
enforces the same event order and infers the communication by
matching values of local writes and remote reads (in the value log
of remote threads). OPR does not rely on the event order for cor-
rect replay, because the value log is still needed to simulate the ef-
fects of threads that are not in R_Set. More importantly, our simple
vector clock algorithm is executed at instrumentation of memory
accesses, which are not guaranteed to execute atomically with the
actual memory accesses. This could compromise the correctness
of event orders detected. For instance, a read gets an updated value
produced by a remote write but the vector clock algorithm could
conclude that the write happens after the read. If we follow the
detected order in replay, the same read will incorrectly get the pre-
vious value before the remote write. In OPR, since the replay of
each thread is still driven by its value log, such impreciseness will
only result in accidentally incorrect communication inference, but
will never affect replay correctness.
Essentially, OPR is a hybrid R&R scheme that supports deter-
ministic R&R for one-sided communication. The data-replay prin-
ciple ensures replay correctness and is complemented with order-
replay to infer inter-thread communication based on value match-
ing. To the best of our knowledge, OPR is the first software tool to
support deterministic partial replay for one-sided communication.
The evaluation is conducted on Edison, a Cray XC30 super-
computer at NERSC. We evaluate OPR using eight NAS Parallel
Benchmarks (BT, CG, EP, FT, IS, LU, MG, SP), two applications
using work stealing from the UPC Task Library (fib, nqueens),
three applications in the UPC test suite (guppie, laplace, mcop),
Unbalanced Tree Search (UTS) and Parallel De Bruijn Graph Con-
struction and Traversal for De Novo Genome Assembly (Meracu-
lous). We see that OPR incurs overhead from 1.39x∼ 29.4x among
all applications and different R_Set sizes (2,4,8,16 threads), when
running the original program on 1,024 cores. Such overhead is
moderate and acceptable for a software-only R&R scheme.
The main contributions of this paper are:
• We introduce a novel partial deterministic R&R scheme for
one-sided communication. It allows users to deterministi-
cally replay a subgroup of threads in a full execution without
executing the rest of threads. There was no software tool that
supports deterministic R&R for one-sided communication.
• We implement our mechanisms on UPC in a tool called OPR
and demonstrate its practicality by evaluating the tool using
15 applications.
The rest of the paper is organized as follows. Section 2 presents
background for UPC and deterministic R&R. Section 3 explains
each step in OPR by a concrete example. Section 4 shows the value
logging algorithm based on shadow memory. Section 5 describes
the simplified vector clock algorithm used in the record phase. Sec-
tion 6 describes the communication inference mechanisms and the
whole partial replay algorithm. Section 7 discusses the implemen-
tation details, it is followed by the evaluation in Section 8. The
paper concludes in Section 9.
2. BACKGROUND
2.1 Unified Parallel C
Unified Parallel C (UPC) [5] is an extension to ISO C 99 that pro-
vides a Partitioned Global Address Space (PGAS) abstraction using
Single Program Multiple Data (SPMD) parallelism. The memory
is partitioned in a task (unit of execution in UPC) local heap and
a global heap. All tasks can access memory residing in the global
heap, while access to the local heap is allowed only for the owner.
The global heap is logically partitioned between tasks and each task
is said to have local affinity with its sub-partition. Global mem-
ory can be accessed either using pointer dereferences (load and
store) or using bulk communication primitives (memget(), mem-
put()). The language provides synchronization primitives, namely
locks, barriers and split phase barriers. Most of the existing UPC
implementations also provide non-blocking communication primi-
tives, e.g. upc_memget_nb(). The language also provides a mem-
ory consistency model which imposes constraints on message or-
dering.
One-sided communication [2] is the primary mode of commu-
nication in UPC and has also been integrated into MPI-3 [8]. Al-
though these two one-sided models differ semantically and opera-
tionally in the mechanisms used to enforce synchronization, they
both aim to improve performance by decoupling synchronization
from data movement. The one-sided communication model is gen-
erally believed to better suited for unstructured computations and
irregular communication patterns due to the better performance and
programmability. The distinctive feature of one-sided communica-
tion is that only the initiator is aware of a communication and the
consumer of data is not aware of initiator. Despite its potential per-
formance advantages, it is inherently nondeterministic and creates
a great challenge in program debugging.
2.2 Deterministic Record and Replay
Deterministic Record and Replay (R&R) consists of monitoring
the execution of a multithreaded application on a parallel machine,
and then exactly reproducing the execution later. R&R requires
recording in a log all the nondeterministic events that occurred dur-
ing the initial execution. They include the inputs to the execution
(e.g., return values from system calls) and the order of the inter-
thread communications (e.g., the interleaving of the inter-thread
data dependences). During the replay phase, the logged inputs are
fed back to the execution at the correct times, and the memory ac-
cesses are forced to interleave according to the log.
Deterministic replay is a powerful technique for debugging HPC
applications. During the record phase, the tool records application
inputs, such as messages. During the replay phase, the tool re-
plays the faulty processes to any state of a recorded execution and
investigate how these processes reached that state. Replay tools
for HPC applications typically fall into two categories [12]. Data-
replay tools record all incoming messages to each process during
program execution, and provide the recorded messages to processes
during replay and debugging. With this approach, developers can
replay just faulty processes rather than having to replay the entire
parallel application. In contrast, order-replay tools only record the
outcome of nondeterministic events in inter-process communica-
tion during program execution. Since order-replay only records the
ordering of nondeterministic events, it records far less data than
data-replay.
Previous research has been focusing on MPI R&R debugging [9].
Subgroup reproducible replay (SRR) [21] tries to find a good bal-
ance between data-replay and order-replay by considering a hybrid
approach. SRR divides all processes into disjoint replay groups.
During the record phase, SRR records the contents of messages
across group boundaries using data-replay but records just mes-
sage orderings for communications within a group. In record phase,
each group could replay independently.
Despite the similar design goals as OPR, SRR is based on MPI
and two-sided communication. In this context, the source and des-
tination of communications are clearly specified. Therefore, the
communications between different subgroups and communications
within a subgroup could be clearly distinguished. Unfortunately,
this is not the case for one-sided communication. OPR solves the
more challenging problem by a hybrid approach. However, data-
replay and order-replay are combined in a different manner. OPR
purely relies on data-replay to ensure replay correctness, therefore,
each thread records all data inputs, no matter whether they are pro-
duced inside R_Set or not. SRR only logs data inputs produced
by processes outside the subgroup. It is not possible in OPR due
to the nature of one-sided communication ( the producers of new
values are unknown to the consumer). OPR tries to approximately
detect event order in record phase, enforces the same orders dur-
ing replay phase and infers communications by comparing values.
SRR can precisely record the matching of explicit senders and re-
ceivers and use such information to ensure the same behavior inside
a subgroup. The correctness of replay inside a subgroup is purely
ensured by the recorded message ordering.
MPReplay [20] proposes architectural supports for deterministic
R&R for MPI programs. The hardware supports focus on nonde-
terministic synchronization events such as wildcard receives (e.g.
MPI_ANY_SOURCE, MPI_ANY_TAG, etc.). They are MPI spe-
cific mechanisms and not directly applicable in our context. How-
ever, architectural supports for one-sided communication are criti-
cal to reduce the overhead. We leave it as future work.
3. OVERVIEW OF OPR
In this section, we first show an example taken verbatim from
the UTS benchmark that employs nondeterminism by design, using
one-sided communication and data races in synchronization. Then
we explain the workflow of OPR based on this example.
3.1 Example: Communication in UTS
The Unbalanced Tree Search (UTS) benchmark presents a syn-
thetic tree-structured search space that is highly imbalanced. Paral-
lel implementation of the search requires continuous dynamic load
balancing to keep all processors engaged in the search. We con-
sider a dynamic load balance implementation using asynchronous
work-stealing. In the shared memory algorithm in UPC, the depth-
first search (DFS) stack is partitioned into two regions: local and
shared. Steal operations are necessary to accomplish load balanc-
ing, nodes are transferred through one-sided communication. To
amortize the manipulation overheads, nodes can only be moved in
chunks of size k between the local and shared regions or between
the shared regions of two different threads’ stacks. More detailed
description of the algorithms can be found in [16].
Listing 1 shows two important functions related to work stealing.
checkSteal is called by a thread which will potentially share
certain amount of its own work to another thread. The thread first
checks whether it has enough work to share (line 28). If so, it
updates local stack information (line 32∼ 38). Finally, it publicizes
the work using one-sided communication and writes directly to the
work stack of the remote thread which requested the work (line 40
∼ 43). The first write (line 41) indicates the stolen work amount.
The second write (line 43) indicates the stolen work address. These
two variables are later read by the remote thread to complete the
work stealing. The upc_fence between the two writes ensures
that the remote thread read the updates in correct order.
ss_steal is called by a thread that has already posted the steal-
1 int ss_steal(StealStack *s, int victim, int k) {
2 long stealIndex;
3 long stealAmt;
4
5 stealIndex = WAITING_FOR_WORK;
6 while (stealIndex == WAITING_FOR_WORK) {
7 stealIndex = s->stolen_work_addr;
8 }
9
10 if (stealIndex>=0) {
11 upc_fence;
12 stealAmt=s->stolen_work_amt;
13 SMEMCPY(&((s->stack)[s->top]),
14 &(stealStack[victim]->stack_g)[stealIndex],
15 stealAmt * sizeof(Node));
16 s->nSteal += stealAmt;
17 }
18 ....
19 }
20
21 void checkSteal(StealStack *ss){
22 long d, position;
23 int stealAmt;
24 int requestor;
25
26 if (doSteal) {
27 int d = ss_localDepth(ss);
28 if (d > 2 * chunkSize) {
29 //enough work to share
30 requestor = ss->req_thread;
31 if (requestor >= 0){
32 stealAmt = (d/2/chunkSize)*chunkSize;
33 //make chunk(s) available
34 position = ss->local;
35 ss->local += stealAmt;
36 ss->nRelease++;
37 //advertise correct amount of work left locally
38 ss->workAvail = d - stealAmt;
39 }
40 ss->req_thread = REQ_AVAILABLE;
41 stealStack[requestor]->stolen_work_amt = stealAmt;
42 upc_fence;
43 stealStack[requestor]->stolen_work_addr = position;
44 return;
45 }
46 }
47 ....
48 }
Listing 1: Communication in UTS Algorithm
ing request and is waiting for stolen work that will be granted from
a remote thread. The stealIndex is initially WAITING_FOR_WORK,
indicating that it is waiting, then the thread busy waits on a while-
loop, until the local variable stealIndex is updated by a re-
mote thread using one-sided communication. After this, the local
thread will observe the update by a local read (line 7) and then
leaves the loop. If some work is successfully stolen, the local
thread will then read the second write performed by remote thread,
stolen_work_amt, to find out the amount of stolen work. Fi-
nally, it completes the work stealing by copying data from the stack
of remote thread to its local stack.
This example indicates a typical use case for one-sided com-
munication. The essences are: (1) a thread could update shared
addresses of remote threads directly without any involvement of
them; and (2) only the initiator is aware of a communication, so
there is no explicit match between sender and receiver. Specifi-
cally, a thread that receives the stolen data could only implicitly
find the thread which provided stolen work by the owner of address
(s->stolen_work_addr), but there is no explicit send and re-
ceive operation posted for this communication.
This example also illustrate nondeterministic behavior. In differ-
ent executions, a thread may receive the stolen work from different
Instrumentation
Original UPC 
Program
Record 
Binary
Execution on 
modified UPC 
runtime
...
replay group
Value Log
Distributed 
Event 
Order Log
1. Record with full execution 2. Offline log processing
Replay
Order Log
Write 
Check 
Log
wait
wake
??
T3: SN(34) 
  wait: [12,0,0,0]
  wake:[1,0,0,0]
.....
T1: SN(14) 
  check T0 SN(30)
  check T1 SN(28)
.....
??
??
3. Partial Replay
Figure 1: Overview of OPR.
remote threads at different execution points. Obviously, it is chal-
lenging to debug the large scale executions with nondeterminism
since the developers will be overwhelmed by different thread inter-
actions over different executions.
3.2 OPR: Deterministic Partial R&R
OPR involves the following steps (see Figure 1). Overall, OPR
can deterministically reproduce the same execution of threads in
R_Set as in a full execution.
Record with full execution. The user first specifies the replay
set, R_Set, a subset of threads that need to be replayed. A modified
compiler is used to build a binary with recording instrumentation.
Record binary is then executed at full scale on a modified UPC
runtime system that intercepts Get and Put operations to shared
memory space. This is called the record phase. This step gener-
ates generate two kinds of logs: a value log and a distributed event
order log. The value log for each thread in R_Set contains the in-
puts for reads at different points. In replay phase, the values could
be fed into the same threads at correct points. The event order log
indicates an approximation of orders of conflicting operations ac-
cessing the shared addresses. After log processing, this information
is used to guide execution and infer communication order in the re-
play phase.
In Figure 1, the shaded region indicates the replay group. In
each thread, the white dots indicate read accesses that do not have
value log entries; the black dots indicate read accesses that generate
value log entries; the grey dots indicate write accesses. The arrows
indicate detected event orders. We can see that some orders exist
between write and read accesses, but the reads may not consume the
values produced by writes, such relationship needs to be checked in
replay phase. Also, some read accesses could get values produced
by threads outside R_Set, such as the second black dot in the last
thread in R_Set.
Offline log processing. Based on the distributed event order
log, the offline pass generates a replay order log for each thread in
R_Set. The event orders are translated into wait and wake vector
clocks for the relevant operations so that threads in R_Set could
collaboratively enforce detected event orders. In addition, a write
check log is generated for each thread so that it could try to match
its own written values with remote read values in certain ranges at
correct points in replay phase. We use this value based approach
to infer communications between threads in R_Set because there
is no explicit matching between senders and receivers in one-sided
communication.
Partial replay. OPR only executes the threads in R_Set in par-
tial replay phase. Each thread reproduces the same execution by
injecting the values in its value log at correct points. The opera-
tions from different threads are scheduled to execute in an order
according to the replay order log. In addition, after a thread per-
forms certain writes, it needs to check whether all the local writes
so far could contribute to some read value log entries of remote
threads. On a value match, a communication is assumed to happen
between the two threads. This process is driven by the write check
log. For each read log entry of a thread in R_Set, OPR could infer
one of two possibilities: (a) the value is produced by a thread inside
R_Set, if so, the specific thread is given; (b) the value is not pro-
duced by any thread inside R_Set. In Figure 1, the question marks
indicate the value matching operation.
Now let us consider how does OPR work for the UTS example
in (Listing 1). Assume R_Set is {T0, T2} and in a period of ex-
ecution, T0 steals from T2 and T3. In the record phase, in both
steals, OPR will log the values of s->stolen_work_addr and
s->stolen_work_amt at the correct time. In the replay phase,
these values will be fed into T0 at the same execution points. This
ensures that T0 is replayed correctly in isolation. In addition, based
on the logs generated by the offline processing step. The write op-
erations in T2 are executed before the read operations in T0 that
caused the exit of the while-loop. In addition, after writes in T2
are performed, T2 will check whether its writes performed so far
could match a read value log in T0. In our case, since T0 in-
deed steals work from T2, there will be matches for both values of
s->stolen_work_addr and s->stolen_work_amt. Based
on the matched values, OPR infers that the communication hap-
pened from T2 to T0.
In OPR, we use the principle of data-replay (based on value log)
to ensure the correct replay of each thread in R_Set. We use order-
replay and value matching to infer the communications between
threads in R_Set. This design principle is critical since purely rely-
ing on order-replay requires replaying all threads (not satisfying re-
quirement of partial replay). More importantly, the instrumentation
based approach makes it extremely challenging to produce precise
event orders. The current approach could tolerate such imprecision
as it will only lead to false positives or negatives in communication
inference but not affect replay correctness.
4. VALUE LOGGING
OPR uses data-replay to ensure that each thread in R_Set could
be replayed in isolation. Specifically, we could log the values of
each read of threads in R_Set and then feed them to the same read
operations during replay. To specify the execution point of reads,
each thread in R_Set maintains a sequence number (SN) that is
locally increased on each memory access. SN of Ti is denoted as
Vi[i], we use this notation because it is also used in the vector clock
algorithm.
The above simple algorithm can lead to large log size since a
value needs to be logged for each read. To mitigate this prob-
lem, OPR maintains a shadow memory in each thread in R_Set.
Algorithm 1: Value Logging by thread Ti in R_Set.
Data: V (a, len): values of (a, len) in Ti
Vsm(a, len): values of (a, len) in shadow memory of Ti
Vi[i] is the sequence number (SN) of Ti.
Output: V alLogi: read value log of Ti.
Value log entry format: (Vi[i], len, val).
1 switch type of an access ei do
2 case ei is a read of range (a,len)
3 if V (a, len) 6= Vsm(a, len) then
4 new V alLogi entry: (Vi[i], a, len, V (a, len))
5 Vsm(a, len)← V (a, len)
end
6 case ei is a write of range (a,len)
7 Vsm(a, len)← V (a, len)
8 Vi[i]← Vi[i] + 1
endsw
It keeps the values of addresses that a thread has observed so far.
In essence, the shadow memory indicates the current local view of
shared memory by a thread. Based on the shadow memory, OPR
could only log values either when it is the first time read or when
the value is changed.
Algorithm 1 shows the detail of the value logging mechanism
in OPR. Each thread maintains its local shadow memory, Vsm. It
is initially empty. On each read, V (a, len) is the value obtained
from the current shared memory. If this value is the same as the
current value in Vsm, no log is generated. If not, a new value log
entry is generated and Vsm is updated, so that next time Ti will not
log the same value again. On each write, V (a, len) is the written
value and it also updates the shadow memory. This could avoid
logging the values generated by the local thread and also avoid log-
ging addresses of dynamically allocated objects (see Section 7 for
more details). The SN (Vi[i]) is updated on both read and write
accesses, this value is a part of vector clock that is used in tracking
event orders.
Each value log entry includes three fields. Vi[i] indicates that
this value should be consumed by Ti in replay phase when its SN
is increased to the same number. We do not include the addresses
in the log since they are available during replay. Note that some
read addresses could be different in record and replay phase, as a
thread may access dynamically allocated memory objects. It will
not affect the replay correctness and will be discussed in Section 7.
5. TRACKING EVENT ORDERS
5.1 Vector Clock Algorithm
We use a vector clock to obtain event orders of conflicting ac-
cesses in record phase. This information is used to schedule the
conflicting accesses within R_Set in replay phase and infer com-
munications. Vector clock [17] is a powerful tool to track causal
relationship of events in concurrent systems. The conventional vec-
tor clock algorithms assume explicit sender and receiver and they
are matched when a communication happens. We present a vector
clock algorithm based on [18] and propose mechanisms to generate
event orders of conflicting accesses in one-sided communication.
The algorithm is shown in Algorithm 2 as a function OnMemAcc.
Let Vi be an n-dimensional vector of natural numbers for thread
Ti, 1 ≤ i ≤ n. Let V ax and V wx be two additional n-dimensional
vectors for each shared address, we call V ax and V wx access vector
clock and write vector clock, respectively. All the vector clocks
Algorithm 2: Vector Clock for Shared Memory
Procedure OnMemAcc (ei in Ti,AccRange)
Data: Vi: vector clock of thread Ti
V wx : write vector clock of address x
V ax : access vector clock of address x
All vector clocks have r entries, r is the size of R_Set.
Output: Oi: Event orders need to obey in replay
1 Vi[i]← Vi[i] + 1
2 switch type of ei do
3 case ei is a read
4 foreach x ∈ AccRange do
5 Oi ← Oi ∪ GO(Vi,V wx ,i)
6 Vi ← max{Vi, V wx }
7 V ax ← max{V ax , Vi}
end
8 case ei is a write
9 foreach x ∈ AccRange do
10 Oi ← Oi ∪ GO(Vi,V ax ,i)
11 V wx ← V ax ← Vi ← max{V ax , Vi}
end
endsw
Procedure GO
Input : Vmy ,Vm,my_pid
Output: On: New event orders
12 foreach 1 ≤ i ≤ r, i 6=my_pid do
13 if Vm[i] > Vmy[i] then
14 On ← On ∪ (Ti : Vm[i]→ Tmy : Vmy[my])
end
end
15 return On
Time T1
w(x)
T2
r(y)
T3
w(y)
[1,0,0]
[0,0,1]
Vxa Vxw Vya Vyw
[0,1,1]
w(x)
[1,2,1]
r(x)
[2,2,1]
[0,0,0] [0,0,0] [0,0,0] [0,0,0]
[0,0,1] [0,0,1]--- ---
[1,0,0] [1,0,0] --- ---
[0,1,1] ------ ---
[1,2,1] [1,2,1] --- ---
[2,2,1] --- --- ---
Figure 2: Running Example of Algorithm 2.
are initialized to 0 at the beginning of computation. For two n-
dimensional vectors we say that V ≤ V ′ if and only if V [j] ≤
V ′[j] for all 1 ≤ j ≤ n; max{V, V ′} is defined as the vector
with max{V, V ′}[j] = max{V [j], V ′[j]} for each 1 ≤ j ≤ n.
Vi[i] also represents the SN of the event in Ti which caused Vi[i]
increased to the current value. In OPR, we only run the vector clock
algorithm within R_Set, therefore n = r, r is the size of R_Set.
It is proved that OnMemAcc ensures ei → ej (→ indicates
causal relationship), if and only if V (ei) < V (ej) [19]. Using this
property, by keeping and comparing the vector clock of all memory
accesses in an external observer, we can obtain the complete causal
relationship of events. However, this algorithm needs to be adapted
to generate orders of conflicting accesses in our scenario.
Our goal is to generate the order of conflicting accesses observed
during record and replay these conflicting memory accesses ac-
cording to the recorded order. When a thread performs a memory
access to a shared address, it can only obtain the current vector
clocks associated with this location but cannot observe the vector
clocks of remote memory accesses. After each access ei in Ti, two
T1 T2 T3 T4
r(x)
r(x)
w(x)
w(z)
GL0
GL1
Figure 3: Event Order Detection.
vector clocks are available to Ti, one is the updated Vi after the
access (denoted as Vi(ei)) according to Algorithm 2, the other is
Vax (if ei is a write) or Vwx (if ei is a read) from shared memory,
assuming ei accesses x. Based on this information, Ti can only
infer whether there is a causal relationship between ei and the most
recent access to x (and the accesses that causally ordered before it).
However, by the vector clock of the most recent access, Vax or Vwx ,
Ti cannot tell the specific remote access and cannot generate orders
between two specific accesses. Unlike in [18], there is no "external
observer" that keeps the vector clock of previous memory accesses
in all threads.
Figure 2 shows a running example of Algorithm 2. We con-
sider three threads and two shared memory addresses (x and y). Vi
(i=1,2,3) after each memory access is indicated below the memory
accesses. On the right, we show the trace of Va{x,y} and V
w
{x,y}
updates. Consider the second access in T1 (i.e. r(x)), V1(r(x)) is
[2,2,1], Vwx is [1,2,1]. T1 can infer that the current operation r(x)
is ordered after the most recent write to address x. However, from
[1,2,1], it does not know which remote access previously wrote to
x. The issue is similar to the case in one-sided communication in
that, a read does not know the most recent writer of a memory lo-
cation. Obviously, it is impractical to let threads keep the vector
clocks of previous memory accesses and pass around such infor-
mation. Therefore, the event order has to be inferred by limited
information.
We propose a simplified mechanism to generate causal relation-
ship of events conservatively. Consider Vi(ei0), it captures the set
of all accesses from all threads that causally happened before ei0.
We could consider it as a global layer, denoted as GL[ei0]. It cap-
tures the boundary of most recent previous accesses in all threads
that are causally executed before ei0. When Ti performs the next
memory access ei1, similarly, Vi(ei1) represents a different global
layer GL[ei1]. To reproduce the event orders in an execution, it
is sufficient to execute ei1 after the accesses in each remote thread
on GL[ei1]. These accesses are denoted as Vi(ei1)[j], j 6= i. It
is possible that Vi(ei1)[j] = Vi(ei0)[j] for some j, it means that
Tj did not perform any access after ei0 that is causally happened
before ei1. In this case, no new causal relationship needs to be
generated. Therefore, condition for generating causal relationship
is, Vi(ei1)[j] → ei1 if j 6= i and Vi(ei1)[j] 6= Vi(ei0)[j]. The
advantage of this approach is that we can generate causal relation-
ship between individual accesses, so that these event orders could
be reproduced in replay phase.
Figure 3 shows the concept. From the vector clocks, T2 can
identify the difference between GL0 and GL1. According to our
rule, the second r(x) in T2 is causally ordered after w(x) in T0. In
T3, there is no memory access performed between the two global
layers, so there is no order generated. T4 performs a memory access
w(z), but it is not conflicting with r(x) in T2, so there is no causal
relationship between the two and also no order generated. Now let
us consider this mechanism in the example in Figure 2. Before r(x)
in T1 is performed, the current vector clock in the thread is [1,0,0],
after the operation, the vector clock becomes [2,2,1]. According
to the rule, r(x) needs to be ordered after w(x) in T2 and w(y) in
T3. Note that w(y) in T3 does not conflict with r(x) in T1, but it
is causally ordered before r(x) in T1. Specifically, it is because
the vector clock obtained in T1 at r(x) (most recently updated by
w(x) in T2) include w(y) in T3 due to T2’s r(y), — they are indeed
conflicting accesses.
The example discloses the relation between causal relationship
and the order between conflict accesses. Algorithm 2 can produce
causal relationship between events in different threads precisely.
However, not all pairs of accesses that are causally ordered are con-
flicting accesses. It is because program order also contributes to
causal relationship and it is exactly why in Figure 2 r(x) in T1 is
causally ordered after w(y) T3: w(y) in T3 conflicts with r(y) in
T2, r(y) and w(x) in T2 are ordered by program order, w(x) in T2
conflicts with r(x) in T1, so transitively, r(x) in T1 is also causally
ordered after w(y) in T3. Therefore, our order generation rule will
produce a superset of orders between conflicting accesses.
Concretely, the order generation rule is implemented by GO in
Algorithm 2. It takes two vector clocks (Vmy and Vm) and thread
Id of the calling thread as inputs. Vmy is the vector clock for Ti
before executing the current memory access. Vm is the vector clock
obtained from shared memory, it is either Vax (for writes) or Vwx (for
reads). This function is called before the vector clock updates in
local threads and shared memory (line 6-7 and 11). GO checks the
exact condition that we showed (line 14). An event order in OPR
is in the format of (Ti : SNi → Tj : SNj). In replay phase, this
enforces that an access in Tj with SNj executed after an access in
Ti with SNi.
5.2 Scalability Enhancements
Algorithm 2 is able to capture all causal relationship between
memory accesses to shared memory. However, the overhead is high
for the following reasons.
Storage Overhead. Two vectors (V ax and V wx ) are associated with
each shared memory location. This makes the algorithm impracti-
cal to implement.
Atomic vector clock updates. It implicitly requires that the up-
dates to vector clocks happen atomically with the actual memory
accesses. To satisfy this requirement with software instrumenta-
tion, each memory access will be associated with a lock opera-
tion when modifying vector clock. It is obviously challenging to
achieve this without hardware supports at large scale distributed
memory.
Update order requirement. The updates of vector clocks associ-
ated with memory addresses (V wx and V ax ) (line 7 and 11) should
be consistent with program order. It seems to be obvious, but in re-
ality the updates to vector clocks are ordinary memory accesses to
shared memory, UPC runtime may reorder them. Strictly enforcing
the order requires using fences, which also leads to extra overhead.
To make Algorithm 2 practical, we apply several scalability en-
hancements which compromise preciseness. It is not an issue for
OPR, because replay correctness is ensured by data-replay, the un-
necessary event orders can be tolerated.
To reduce storage overhead, we make a set of shared address
share the same vectors (V ax and V wx ). All vector clock updates due
to the set of addresses are performed on the same vector clocks. We
naturally partition the shared address space according to the affinity
(owner) of shared address in UPC. Specifically, shared addresses
with the same owner use a common vector clock. Essentially, it
makes the accesses to addresses with same owner "conflicting",
causing unnecessary event orders.
Algorithm 3: Value check log generation
Procedure ValCheckGen (ValLogi, i ∈ 1, ..., r)
Output: V CLi: A map from local SN to remote SN.
i ∈ 1, ..., r
1 foreach i ∈ 1, ..., r do
2 foreach val ∈ V alLogi do
3 foreach j ∈ 1, ..., r do
4 if j 6= i then
5 V CLj [Vval[j]]← Vval[i]
end
end
end
end
Regarding the atomic vector clock update requirement, there is
no efficient way to ensure that the actual memory accesses happens
atomically with the instrumentation functions without introducing
huge overhead. We choose to give up this requirement. The conse-
quence is that the event orders generated could be incorrect (e.g. a
read happens after a write, but according to the order generated, the
write happens after the read). It will only cause some incorrectly
inferred communications but does not affect replay correctness. For
similar reason, we do not use fences to ensure vector clock updates
order. To eliminate some false ordering, for a read, an order is only
generated when there a new value is logged on value change.
6. PARTIAL REPLAY
In this section, we first describe two offline log processing steps
to generate the order log and the value check log. Then the partial
replay algorithm is presented.
6.1 Order Log Generation
The order log is used to reproduce the orders generated in the
record phase. For each memory access ei in Ti with SNi, we intro-
duce two maps: wake_up map (wake) and wait_for map (wait).
Each of them maps an SN to a vector that has size equal to R_Set.
wake[SNi][j] (the j-th element in the vector mapped from SNi) re-
quires that after a memory access with SNi in Ti is executed, Ti
should send its sequence number SNi to Tj , which is supposed to
wait for SNi. wait[SNj][i] indicates a sequence number SNi from
Ti, that before a memory access with SNj in Tj can be executed,
it needs to wait for SNi, which is supposed to be sent by Ti. With
this notion, each order (Ti : SNi → Tj : SNj) generated in
the record phase naturally incurs the following updates to the two
maps. wake[SNi][j]=1, wait[SNj][i]=SNi. After processing all
distributed event order logs, a map is generated for each thread in
R_Set, it is then written to an order log used during replay.
6.2 Value Check Log Generation
In OPR, communication is inferred by matching values writ-
ten by a potential producer with the new values logged in remote
threads’ value log. Consider the scenario in Figure 4. First image it
is in record phase. There are three read accesses from T2 that incur
new values logged (e21,e22,e23). The number indicates the return
value of each read. When each one is performed, its vector clock
represents a global layer that indicates the set of remote accesses
that ordered before it. Such global layers are denoted by dashed
lines. The arrows indicate the remote accesses that produced the
new values logged. The goal of value matching is to infer the solid
arrows in replay phase.
T1 T2 T3 T4
e21:r(x,1)z=3
y=2
V(e21)
e22:r(y,2)
V(e22)
x=1x=2
e23:r(z,4)
z=4
V(e23)
z=5
y=4
e33:r(w,4)
produced 
outside
R_Set
Figure 4: Inferring Communication in Replay.
During replay, by following the orders in order log, we can or-
der the three read accesses after the accesses before the global lay-
ers specified by their vector clocks. The value matching could be
done naturally at producer side as follows. Consider e21, both T1
and T3 could compare their last write value to x with the value in
T2’s value log. The communication is inferred when the two val-
ues match. In the example, T3 will conclude that its write value is
consumed by T2. Therefore, the purpose of the value check log is
to give the potential producer threads information about, at which
point, the thread should match its written values with which remote
new read values in remote threads’ value log.
Algorithm 3 shows the value check log generation algorithm.
The input is the value logs of all threads in R_Set. The output is
a value check log (VCLi) for each thread. VCLi is a map from
local SN to remote SN. For Ti, if we have VCLj[SNi]=SNj , it in-
dicates that after Ti finished the access with SNi, it needs to match
all its locally written values up to SNi (inclusive) with the logged
values in Tj from the next value after the previous match (by Ti)
to the value with SNj . This algorithm processes all entries in the
value log of all threads in R_Set, and continuously updates VCL
of remote threads. To simplify notation, we assume that for each
value in value log, its full vector is available. But as Algorithm 4
showed, each value only has the local SN associated with it. In
the implementation, we maintain some extra information in record
phase that could recover the full vector needed for value check log
generation. Due to space limit, we do not describe the details.
Let us consider Algorithm 3 in the scenario in Figure 4. We con-
sider the value check log (VCL) for T2. We see that V(e21)[3] and
V(e22)[3] are the same, according to the algorithm, we will even-
tually have VCL3[V(e22)]=V(e22)[2]. It ensures that after T3 fin-
ishes x = 1 operation, it will try to match its previous write values
with the value of both e21 and e22. Since V(e23)[3] is larger than
V(e22)[3], a new map is generated, which ensures all writes in T3
up to the boundary specified by V(e23) are matched with the new
value logs in T2 from the one after e22 to e23. Each thread keeps
the most recent locally written value to shared addresses and the
value matching is always against most recent values. For example
T1 performs two writes to z, but only the second one is matched
with e23. It is important to ensure that value matching needs to
consider all previous writes performed by a thread, not only the ac-
cesses on a global layer or between two global layers. For example,
T4 performed a write y = 2 before V(e21), but it is only matched
with e22 after V(e22). When a value cannot be matched by writes
in R_Set, it is deemed to be produced by threads outside R_Set. It
is the case for e33.
In summary, the value matching procedure could provide the
producer of a new value in value log if it is produced by some
thread in R_Set. Otherwise, OPR will conclude that the values are
performed outside R_Set.
Algorithm 4: Partial Replay
Procedure OnMemAcc (ei in Ti,AccRange, V alLogi)
Data: Vi: vector clock of thread Ti
ShMem: actual shared memory in execution
Wsm: shadow memory for local written values
Rsm: shadow memory for values read from log
SNnext_val: SN of the next new value from V alLogi
Rval: return value of a read
Wval: written value of a write
V C: a vector indicating the most recent SN of remote new
value checked
notify: data structure in shared memory to enforce order.
1 Vi[i]← Vi[i] + 1
2 block ← false
3 repeat
4 foreach j ∈ 1, ..., r do
5 block ← block|(wait[Vi[i]][j] ≤ notify[i][j])
end
until block == false
6 switch type of ei do
7 case ei is a read
8 if Vi[i] == SNnext_val then
9 Fill value from V alLogi[Vi[i]]
10 ShMem[AccRange]← V alLogi[Vi[i]]
11 Rsm[AccRange]← V alLogi[Vi[i]]
else
12 if
ShMem[AccRange] == Rsm[AccRange]
then
13 Rval ← ShMem[AccRange]
else
14 Rval ← Rsm[AccRange]
end
end
15 case ei is a write
16 Wsm[AccRange]← (Wval, Vi[i])
foreach j ∈ 1, ..., r do
17 if V CLj [Vi[i]] 6= 0 then
18 CheckComm
(Wsm[AccRange], V C[j], V CLj [Vi[i]])
19 V C[j]← Vi[i]
end
end
endsw
foreach j ∈ 1, ..., r do
20 if wake[Vi[i]][j] 6= 0 then
21 notify[j][i]← Vi[i]
end
end
6.3 Replay Algorithm
Using the value log, order log and the value check log, OPR can
replay the threads in R_Set without executing any other threads.
The partial replay algorithm is shown in Algorithm 4. In the replay
phase, OPR executes the memory accesses according to the order
log. The correctness is always ensured by the value log.
The order of memory accesses in different threads is enforced
by a logically shared data structure notify. It has r × r entries,
each entry is an SN that will be set by remote threads by one-sided
update. The i-th row of notify is used by Ti to check whether its
next access needs to wait due to event order. Physically, the i-th
row is associated with the local shared memory of Ti.
If Ti needs to wait at Vi[i], then for some j, wait[Vi[i]][j] is
non-zero and it indicates the SN of remote access from Tj it needs
to wait. Before an access can be executed, Ti needs to make sure
that all wait[Vi[i]][j] entries are less than or equal to notify[i][j]
(less is becausewait[Vi[i]][j] is zero if Ti’s current access does not
need to wait for Tj) (line 4 ∼ 5). If the condition is not true, then
block is true and the thread blocks at this point. Similarly, after an
access from Ti is executed, if wake[Vi[i][j]] is set, Ti will update
i-th entry in Tj’s row in notify using one-sided communication
(line 20 ∼ 21).
For a read access, if there is a value log entry for it, then the
value from value log is used (line 8 ∼ 9). The value is written to
shared memory (line 10). Such value may or may not be the same
as the current values in shared memory. If the value is produced
by a thread not in R_Set, then shared memory does not contain it
because that thread does not execute in replay. In this case, value
log is used to construct the partial states in shared memory.
Each thread still maintains a shadow memory for values read
from value log (line 11). The purpose is to tolerate the incorrect
event orders generated in record phase. When there is no value log
entry for a read access, the thread accesses corresponding values
in both shared memory and read shadow memory (Rsm) (line 12).
If they disagree, then the value in read shadow memory is used
(line 13 ∼ 14). The reason is that in record phase, there could be
a conflicting remote write happened after the read, and changes the
value in shared memory. However, this order could be incorrectly
detected as the remote write happens before the read. Following
this order in replay phase, when the read executes, the value in
shared memory is already updated by the remote write to a new
value. However, to replay correctly, the read should still get the
old value. Our mechanism ensures that the read always gets correct
value from read shadow memory.
Finally, for write accesses, each thread updates a write shadow
memory (Wsm) (line 16). It keeps the most recent local write val-
ues produced by the local thread and is used in communication in-
ference. After a write access, value check is performed when its
next VCL indicates that there is a need to check the current local
writes so far with a set of remote read value log entries (line 17 ∼
19). Due to space limit, we do not show the detail of CheckComm
function. However, its operations are straightforward: the relevant
values in Wsm are checked against some value entries in remote
threads’ value log.
7. IMPLEMENTATION
The instrumentation of memory accesses is implemented in both
UPC runtime and UPC compiler. For each memory access (load,
store), we add "before" and "after" instrumentation. Both will in-
crease the SN of the thread. For Put/Get operations, we modify the
UPC runtime to intercept them. We also modify the compiler to
instrument the local accesses that are casted from shared pointers.
Shadow memory is implemented as a hash map. Shared ad-
dresses are used to generate the hash keys. Each entry maps a
key to a block of consecutive bytes. The key is the start address
of the byte block. The size of the block is configurable, we choose
64-byte block. On an access to the shadow memory, the key is
generated based on the start address of the byte block that the ac-
cess belongs to. Depending on the size of accessed address range,
multiple blocks may be accessed for value comparison. The same
data structure and implementation are used in both read and write
shadow memory in record and replay phase.
OPR detects the value changes at instrumentation points ("be-
fore" and "after" each shared memory access). However, the instru-
mentation functions are not executed atomically when the memory
accesses. In most cases it is not an issue, but in the case where
data races are used in synchronization, it may affect execution path.
Consider Listing 1, the thread waiting for stolen data busy waits
in a while-loop (see ss_steal in Listing 1). The change of
stealIndex will be detected at either before or after instrumen-
tation after a remote thread writes the address. Here the problem
is, the value change that is detected at the "after" instrumentation
point could in fact happen before the memory access but after the
"before" instrumentation point. In replay phase, if we inject the
new value accordingly at the "after" instrumentation point, the ef-
fect will be only reflected at the next iteration. But in record phase,
since the value change actually happens before memory access, the
code will leave the while-loop in the current iteration. This extra
iteration will cause the execution path diverge in the following ex-
ecution, where SNs cannot be matched correctly when the value
log entries. To handle this case, we also encode the source code
line information in the value log and detect the diverged execution
when it happens. In those cases, the diverged execution will not
consume any log entries, until the execution converges again. Due
to space limit, we cannot describe all details. In practice, we found
our solution worked well.
Some applications also have the dynamically allocated objects
in shared memory. Their addresses could be different in record and
replay phase. This does not cause a problem if we inject data val-
ues from value log at the right points and do not expect an exact
address match. However, we should avoid logging any shared ad-
dress as values, otherwise bad pointers will be generated in replay
phase. We solve this problem by putting a thread’s local write val-
ues in shadow memory in record phase. Therefore, when later the
thread reads some addresses written by itself, no value log is gener-
ated because the values from shared memory and shadow memory
will be equal. This can avoid logging pointers as values. In the
translated code, the dynamically allocated objects’ addresses are
normally first written to some temporary variables and then read
into variables in program. Essentially we write the dynamically
allocated addresses into shadow memory, so it will not be logged
later. This technique also has the effect of reducing value log size,
as it can avoid logging values produced by the local thread.
Finally, we also instrument the shared memory allocation func-
tion and always set the content of newly allocated object to zero.
Otherwise, the object may contain some values that are the same
as previous objects at same addresses. Those old values may be
already in shadow memory. This could lead to the the side effects
when we need to log the values of the new object: we may miss
some values that would have been logged due to the equivalence of
old values in shadow memory.
8. EVALUATION
8.1 Evaluated Applications
In the evaluation, we use fifteen UPC benchmarks. Eight NAS
Parallel Benchmarks [4] (BT, CG, EP, FT, IS, LU, MG, SP) and
three applications in the UPC test suite (guppie, laplace, mcop) are
deterministic. The rest are non-deterministic by design: two appli-
cations in the UPC Task Library [15, 6] (fib, nqueens), Unbalance
Tree Search (UTS) [16] and Parallel De Bruijn Graph Construction
and Traversal for De Novo Genome Assembly (Meraculous) [11].
Table 1 indicates the parameters and data sets used in each experi-
ment.
De novo whole genome assembly reconstructs genomic sequence
from short, overlapping, and potentially erroneous fragments called
Set Apps Description
BT class=D, NP=1024
CG class=D, NP=256
EP class=D, NP=1024
FT class=D,NP=512,-shared-heap=512
NAS IS class=C, NP=256
LU class=D, NP=1024
MG class=D, NP=1024
SP class=D, NP=1024
guppie NP=1024
Tests laplace NP=1024
mcop NP=1024, problem size: 4000
fib NP=1024, fib(60)
Task nqueens NP=1024, 8× 8
uts-upc NP=1024, $T3XXL
meraculous NP=480, human genomes
Table 1: Applications Parameters. NP denotes the number of cores
used for the original exection.
reads. We use optimized parallelized program of the most time-
consuming phases of Meraculous, a state-of-the-art production as-
sembler [11]. It is a novel algorithm that leverages one-sided com-
munication capabilities of UPC to facilitate the requisite fine-grained
parallelism and avoidance of data hazards. Nondeterminism is a
main feature of data-driven synchronization in de Bruijn graph traver-
sal. To traverse the graph, all threads independently start building
subcontigs and no synchronization is required unless two threads
pick k-mer seeds that eventually belong in the same contig. In
this case, the threads have to collaborate and resolve this conflict
in order to avoid redundant work. A lightweight synchronization
scheme is the heart of the parallel de Bruijn graph traversal. Es-
sentially, the synchronization protocol maintains a distributed state
machine. The readers could refer to [11] for more details.
In UTS, nondeterminism exists in dynamic work stealing, when
a thread needs to steal certain amount of work from other threads,
the thread that provides the stolen work depends on the current sta-
tus of each thread and the order that steal requests arrive. fib and
nqueens run on top of a work stealing task library.
8.2 Experiment Setup
Partial record and replay experiments are conducted on Edison,
a Cray XC30 supercomputer at NERSC. Edison has a peak per-
formance of 2.57 petaflops/sec, with 5576 compute nodes, each
equipped with 64 GB RAM and two 12-core 2.4GHz Intel Ivy
Bridge processors for a total of 133,824 compute cores, and inter-
connected with the Cray Aries network using a Dragonfly topology.
We are mainly interested in record overhead and how it is af-
fected by different replay group sizes. For each experiment, we
choose four different R_Set sizes: 2,4,8 and 16. Since each node in
Edison contains 24 cores, we specifically make sure that threads in
R_Set execute on different nodes (e.g. when R_Set is 2, the threads
are T24 and T48). In total, we conduct 60 executions (4 for each ap-
plication). The concurrency during the initial program run and the
recording phase is given by the parameter NP in Table 1. Ideally,
for replay phase, we would have modified the UPC runtime so that
we can execute just threads in R_Set using smaller number of cores.
We have not added this support at this point as it involves nontrivial
modifications to UPC runtime system. For our current evaluation,
we still start the same number of threads in replay as full execution
but modify the source code to only execute the threads in R_Set
after the execution starts. Threads not of interest are waiting in bar-
riers. Also note that we use only one node of Edison (24 cores) for
the replay phase, down from the original 1,024 in most cases.
App Native Exec. R_Set=2 R_Set=4 R_Set=8 R_Set=16 Shadow Memory Log Size
BT 363s 8.38x 8.48x 8.35x 8.41x 9.73 MB 1.6 GB
CG 508s 5.79x 5.84x 5.93x 6.16x 7.51 MB 16.9 GB
EP 4s 5.79x 3.98x 3.97x 4.03x 0.13 MB 0.12 MB
FT 35s 27.5x 28.1x 28.5x 29.4x 703.12 MB 15 GB
IS 26s 1.39x 1.44x 1.51x 1.57x 13.08 MB 13 MB
LU 56s 13.03x 13.89x 14.32x 15.04x 1.75 MB 770 MB
MG 176s 11.20x 11.38x 11.64x 12.18x 58.20 MB 759 MB
SP 1229s 1.82x 1.83x 1.83x 1.82x 9.65 MB 2.8 GB
guppie 160s 4.49x 4.67x 4.74x 4.89x 64 MB 519 MB
laplace 154s 8.55x 12.84x 14.76x 13.14x 0.52 MB 0.15 MB
mcop 247s 0.24x 0.52x 0.31x 0.29x 86.05 MB 121 MB
fib 13s 0.98x 0.99x 0.98x 1.14x 0.26 MB 1.31 MB
nqueens 123s 12.2x 12.8x 12.9x 13.4x 0.28 MB 85 MB
uts-upc 5s 25.4x 25.3x 26.0x 26.4x 40 MB 204 MB
Meraculous 216s 5.18x 5.44x 5.17x 5.79x 5.3 GB 2.1 GB
Table 2: OPR Overhead
8.3 Experimental Results
Table 2 shows our results. For each application, we show the
native execution time without any instrumentation, the overhead
for different R_Set sizes, size of shadow memory allocated and the
largest log size among all logs generated by threads in R_Set.
8.3.1 Record Overhead
We first consider the overhead of the smallest replay group size
(R_Set=2). We see that OPR introduce overhead from 1.39x ∼
27.5x. For FT, the high overhead (27.5x) is due to the large ratio
between log size and shadow memory size. More details are ex-
plained later. For uts-upc, the high overhead (25.4x) is due to the
large number of shared memory accesses. They appear in when
polling (busy-waiting) on remote variables when waiting for the
stolen work from remote threads (e.g. line 7 in Listing 1). The
overhead for the other applications are mostly under 10x. Note that
the replay phase runs faster with instrumentation for two applica-
tions (mcop and fib). It is because of the nondeterministic behavior
in the algorithms. For example, mcop’s data distribution depends
on random numbers generated. Therefore, we observed different
execution characteristic in record and replay executions. Note that
we do not expect the native execution to have the same behavior as
the recorded executions.
8.3.2 Overhead vs. R_Set Size
With different replay group sizes (R_Set=2,4,8,16), we see that
the record overhead only increases slightly or almost the same. The
reason is two-fold. First, the main overhead is introduced by instru-
mentation of read and write accesses. They are local overhead and
do not increase when the number of threads in replay group in-
creases. Second, the overhead due to vector clock does increase
when replay group size increases. However, because replay group
size is normally not large (we expect that bugs are normally lo-
calized among a small number of threads) and the scalability en-
hancements in our simplified vector clock algorithm, the overhead
increase is almost negligible.
8.3.3 Shadow Memory
For each application, we also show the size of shadow memory
allocated. This includes both read and write shadow memory. We
see that different applications show drastically different character-
istics. For all applications, we found that the shadow memory size
increases when the executions start and then become stable after
certain points. The largest shadow memory size appears in Merac-
ulous. Essentially, shadow memory of each thread captures the data
read and written by it. In this experiment, the input data is around
150 GB and we use 480 threads. Because OPR also uses a sepa-
rate shadow memory to keep written values, the total size grows to
5GB.
8.3.4 Log Size
The final column shows the largest log size generated by a thread
in R_Set for each application. We also see that the log sizes vary a
lot. The naive implementation performs a log file write on each ac-
cess, this obviously incurs huge overhead. In our implementation,
we used a 1 GB log buffer in memory and only writes logged read
values into log file when the buffer is full. After this optimization,
the record overhead became reasonable.
Besides the instrumentation overhead, we found that the log size
and shadow memory size are also related to record overhead. In
general, the larger the ratio between log size and shadow mem-
ory size, the larger record overhead tends to be. It is particularly
true if the shadow memory size is large. The intuition is that,
shadow memory is a "filter" to decide whether values need to be
logged. Therefore, it needs to be accessed on all memory accesses.
When the ratio between the two sizes are large, it indicates that
for most accesses, value comparisons are needed. Such byte level
comparison contributes to the record overhead. This is the case
for FT, where the ratio is around 22. For Meraculous, although
the size of shadow memory is much larger than FT, the log size
is in fact smaller than shadow memory size. This suggests that
the data in shadow memory are mostly allocated and written once.
In another word, when deciding whether some values need to be
logged, we mostly find that chunk of data not appear in shadow
memory. Therefore, there are no byte level comparisons in those
cases. This observation also suggests future optimizations that po-
tentially avoids comparing values in some scenarios.
9. CONCLUSION
One-sided communication is widely used in Partitioned Global
Address Space (PGAS) programming models. Despite the poten-
tial performance advantages, its inherent nondeterminism makes
debugging even more difficult. In this paper, we present a general
tool, OPR (One-sided communication Partial Record and Replay)
to support deterministic R&R for one-sided communication. Par-
tial replay allows users focus on events within a specified small set
of threads. It could ease debugging experience and relieve users
from monitoring all concurrent events from potentially thousands
of threads. OPR is built based on Berkeley UPC. OPR allows users
to deterministically replay a subset of threads in a full execution
without executing the rest of threads. The principle of data-replay
is used to ensure replay correctness, inter-thread communications
among threads in replay group are inferred at replay phase based on
value matching. To the best of our knowledge, OPR is the first soft-
ware tool that supports deterministic R&R for one-sided communi-
cation. We demonstrate practicality of our approach by evaluating
the tool using 15 applications.
10. REFERENCES
[1] Berkeley UPC. http://upc.lbl.gov.
[2] GASNet Communication System. http://gasnet.lbl.gov.
[3] The Chapel Parallel Programming Language.
http://chapel.cray.com/index.html.
[4] The NAS Parallel Benchmarks. Available at
http://www.nas.nasa.gov/Software/NPB.
[5] UPC Home Page. http://upc-lang.org.
[6] UPC Task Library. http://upc.lbl.gov/task.shtml.
[7] X10: Performance and Productivity at Scale.
http://x10-lang.org.
[8] MPI: A Message-Passing Interface Standard. Version 3.0.
Message Passing Interface Forum, 2012.
[9] A. Bouteiller, G. Bosilca, and J. Dongarra. Retrospect:
Deterministic Replay of MPI Applications for Interactive
Distributed Debugging. In EuroPVM/MPI, pages 297–306.
LNCS, 2007.
[10] E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B.
Johnson. A survey of rollback-recovery protocols in
message-passing systems. ACM Comput. Surv., 34.
[11] E. Georganas, A. BuluÃg˘, J. Chapman, L. Oliker,
D. Rokhsar, and K. Yelick. Parallel De Bruijn Graph
Construction and Traversal for De Novo Genome Assembly.
In Proceedings of the 26th ACM/IEEE International
Conference on High Performance Computing, Networking,
Storage and Analysis (SC), November 2014.
[12] T. J. LeBlanc and J. M. Mellor-Crummey. Debugging
Parallel Programs with Instant Replay. IEEE Transactions on
Computers, 36(4):471–482, April 1987.
[13] D. Manivannan and M. Singhal. Quasi-synchronous
checkpointing: Models, characterization, and classification.
IEEE Trans. Parallel Distrib. Syst., 10(7).
[14] J. Mellor-Crummey, L. Adhianto, G. Jin, and W. N. S. III. A
New Vision for Coarray Fortran. In The Third Conference on
Partitioned Global Address Space Programming Models
(PGAS), October 2009.
[15] S.-J. Min, C. Iancu, and K. Yelick. Hierarchical Work
Stealing on Manycore Clusters. In Proceedings of the Fifth
Conference on Partitioned Global Address Space
Programming Models (PGAS), Oct 2011.
[16] S. Olivier and J. Prins. Scalable Dynamic Load Balancing
Using UPC. In Proceedings of 37th International Conference
on Parallel Processing (ICPP), September 2008.
[17] R. Schwarz and F. Mattern. Detecting Causal Relationships
in Distributed Computations: In Search of the Holy Grail.
Distributed Computing, 7(3):149–174, March 1994.
[18] K. Sen, G. Rosu, and G. Agha. Runtime safety analysis of
multithreaded programs. In ESEC/SIGSOFT FSE, pages
337–346, 2003.
[19] C. Svensson, D. Kesler, R. Kumar, and G. Pokam. Scalable
Automated Methods for Dynamic Program Analysis. In Ph.D
Thesis. University of Illinois, Urbana-Champaign, 2006.
[20] C. Svensson, D. Kesler, R. Kumar, and G. Pokam. MPreplay:
Architecture Support for Deterministic Replay of Message
Passing Programs on Message Passing Many-core
Processors. In UIUC Technical Report UILU-09-2209, Apr
2009.
[21] R. Xue, X. Liu, M. Wu, Z. Guo, W. Chen, W. Zheng, and
G. Voelker. MPIWiz: Subgroup Reproducible Replay of MPI
Applications. In Proceedings of the 14th ACM SIGPLAN
symposium on Principles and practice of parallel
programming, pages 251–260. ACM, February 2009.
