A Theory of Partitioned Global Address Spaces by Calin, Georgel et al.
A Theory of Partitioned Global Address Spaces*
Georgel Calin1, Egor Derevenetc2, Rupak Majumdar3, and
Roland Meyer1
1 University of Kaiserslautern, Germany, {calin,meyer}@cs.uni-kl.de
2 Fraunhofer ITWM, Germany, egor.derevenetc@itwm.fraunhofer.de
3 MPI-SWS, Germany, rupak@mpi-sws.org
Abstract
Partitioned global address space (PGAS) is a parallel programming model for the development of
high-performance applications on clusters. It provides a global address space partitioned among
the cluster nodes, and is supported in programming languages like C, C++, and Fortran by
means of APIs. Our first contribution is a formal model for the semantics of single program,
multiple data programs that use PGAS APIs. Our model reflects the main features of popular
real-world APIs such as SHMEM, ARMCI, GASNet, GPI, and GASPI.
A key feature of PGAS is the support for one-sided communication: a node may directly
read and write the memory located at a remote node, without explicit synchronization with
the processes running on the remote side. One-sided communication increases performance by
decoupling process synchronization from data transfer, but requires the programmer to reason
about appropriate synchronizations between reads and writes. As a second contribution, we
propose and investigate robustness, a criterion for correct synchronization of PGAS programs.
Robustness corresponds to acyclicity of a suitable happens-before relation defined on PGAS
computations. The requirement is finer than classical data race freedom and rules out most false
error reports.
Our main technical result is an algorithm for checking robustness of PGAS programs. The
algorithm makes use of two insights. We first show that, if a PGAS program is not robust,
then there are computations in a certain normal form that violate happens-before acyclicity.
Intuitively, normal-form computations delay remote accesses in an ordered way. We then devise
an algorithm that checks for cyclic normal-form computations. Essentially, the algorithm is
an emptiness check for a novel automaton model that accepts normal-form computations in
streaming fashion. Altogether, we prove that the robustness problem is PSpace-complete.
1998 ACM Subject Classification D.2.4 Software/Program Verification, D.1.3 Concurrent Pro-
gramming, F.4.3 Formal Languages
Keywords and phrases PGAS, SC preservation, Robustness, Semantics, Formal languages
Digital Object Identifier 10.4230/LIPIcs.FSTTCS.2013.127
1 Introduction
Partitioned global address space (PGAS) is a parallel programming model for the develop-
ment of high-performance software on clusters. The PGAS model provides a global address
space to the programmer that is partitioned among the cluster nodes (see Figure 1(b)).
Nodes can read and write their local memories, but additionally access the remote address
* The second author was granted by the Competence Center High Performance Computing and Visu-
alization (CC-HPC) of the Fraunhofer Institute for Industrial Mathematics (ITWM). The work was
partially supported by the PROCOPE project ROIS: Robustness under Realistic Instruction Sets.
© Georgel Calin, Egor Derevenetc, Rupak Majumdar, and Roland Meyer;
licensed under Creative Commons License CC-BY
33rd Int’l Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2013).
Editors: Anil Seth and Nisheeth K. Vishnoi; pp. 127–139
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
128 A Theory of Partitioned Global Address Spaces
int x = 1, y = 0;
write(x, rightNeighborRank,
y, myWriteQ);
barrier();
assert(y == 1);
NIC
Shared Memory 1× ADDR
Process 1
Local Registers
Node 1
∙ ∙ ∙
NIC
Shared Memory N× ADDR
Process N
Local Registers
Node N
Network
Figure 1 (a) Program 1to1 is the compute and exchange results idiom often found in PGAS
applications. Each process copies an integer value to its neighbor. write asks the hardware to copy
the value of address 𝑥 to 𝑦 on the right neighboring node. barrier blocks until all processes reach
the barrier. The assertion can fail, as the barrier may return before the write completes. (b)
PGAS architecture — NIC stands for network interface controller.
space through (synchronous or asynchronous) API calls. PGAS is a popular programming
model, and supported by many PGAS APIs, such as SHMEM [10], ARMCI [21], GAS-
NET [4], GPI [19], and GASPI [15], as well as by languages for high-performance computing,
such as UPC [11], Titanium [16], and Co-Array Fortran [23].
A key ingredient of PGAS APIs is their support for one-sided communication: a node
may directly read and write the memory located at a remote node without explicit synchron-
ization with the remote side, unlike in traditional message passing interfaces. One-sided
communication can be efficiently implemented on top of networking hardware featuring
remote direct memory access (RDMA), and increases performance of PGAS programs by
avoiding unnecessary synchronization between the sender and the receiver [19, 14].
However, the use of one-sided communication introduces additional non-determinism in
the ordering of memory reads and writes, and makes reasoning about program correctness
harder. Figure 1(a) demonstrates a subtle bug arising out of improper synchronizations:
while the barriers ensure all processes are at the same control location, the remote writes
may or may not have completed when address y is accessed after the barrier.
We make two contributions in this paper.
First, we provide a core calculus of PGAS APIs that models concurrent processes sharing
a global address space and accessing remote memory through one-sided reads and writes.
Despite the popularity of PGAS APIs in the high-performance computing community, to
the best of our knowledge, there were no formal models for common PGAS APIs.
Second, we define and study a correctness criterion called robustness for PGAS programs.
To understand robustness, we begin with a classical and intuitive correctness condition, se-
quential consistency [18]. A computation is sequentially consistent if its memory accesses
happen atomically and in the order in which they were issued. Sequential consistency is
too strong a criterion for PGAS programs, where time is required to access remote memory
and accesses themselves can be reordered. Robustness is the weaker notion that all compu-
tations of the program have the same happens-before (data and control) dependencies [26]
as some sequentially consistent computation. Our notion of robustness captures common
programming error patterns [13, 20], and is derived from a similar notion in shared memory
multiprocessing [26]. Related correctness criteria have been proposed for weak memory
models [2, 3, 5, 6, 7, 8, 24].
A simpler correctness property would be data race freedom (DRF), in which no two
processes access the same address at the same time, with at least one access being a write [1].
Indeed, programs free of data races are sequentially consistent. Unfortunately, DRF is too
strong a requirement in practice [25], and leads to numerous false alarms. Many common
synchronization idioms for PGAS programs, such as producer-consumer synchronization,
and many concurrent data structure implementations, contain benign data races. Instead,
G. Calin, E. Derevenetc, R. Majumdar, and R. Meyer 129
the notion of robustness captures the intuitive requirement that, even when events are
reordered in a computation, there are no causality cycles. Our notion of causality is the
standard happens-before relation from [26].
We study the algorithmic verification of robustness. Our main result is that robustness is
decidable (actually PSpace-complete) for PGAS programs, assuming a finite data domain
and finite memory. Note that our model of PGAS programs is infinite-state even when the
data domain is finite: one-sided communication allows unboundedly many requests to be in
flight simultaneously (a feature modeled in our formalism using unbounded queues).
Our decidability result uses two technical ingredients. First, we show that among all
computations violating robustness, there is always one in a certain normal form. The normal
form partitions the violating computation into phases: the first phase initiates memory reads
and writes, and the latter phases complete the reads and writes in the same order in which
they were initiated.
Second, we provide an algorithm to detect violating computations in this normal form.
We take a language-theoretic view, and introduce a multiheaded automaton model which can
accept violating computations in normal form. Then the problem of checking robustness
reduces to checking emptiness for multiheaded automata. Interestingly, since the normal
form maintains orderings of accesses, the multiple heads can be exploited to accept violating
computations without explicitly modeling unbounded queues of memory access requests.
The resulting class of languages contains non-context-free ones (such as 𝑎𝑛𝑏𝑛𝑐𝑛), but retains
sufficient decidability properties. Altogether this yields a PSpace decision procedure for
checking robustness of programs using PGAS APIs.
For lack of space, full constructions and proofs are given in [9].
Related Work. Although PGAS APIs are popular in the high-performance computing
community [4, 10, 15, 19, 21], no previous work provides a unifying formal semantics that
incorporates one-sided asynchronous communication. As for synchronization correctness,
Park et al. proposed a testing framework for data race detection and implemented it for
the UPC language [25]. However, these authors note that many data races are actually not
harmful, and support the statement by the analysis of the NAS Parallel Benchmarks [22].
For this reason, in contrast to data race freedom [1], we consider robustness as a more precise
notion of appropriate synchronization. Several examples from [25] show that harmful data
races (like in the knapsack example) lead to non-robustness, while benign data races (like
in the examples NPB 3.3 BT and SP) do not.
The robustness problem was posed by Shasha and Snir [26] for shared memory multipro-
cessing. They showed that non-sequentially consistent computations have a happens-before
cycle. Alglave and Maranget [2, 3] extended this result. They developed a general theory
for reasoning about robustness problems, even among different architectures. Owens [24]
proposed a notion of appropriate synchronization that is based on triangular data races.
Compared to robustness, triangular race freedom requires heavier synchronization, which is
undesirable for performance reasons.
We consider here the algorithmic problem of checking robustness. For programs running
on weak memory models the problem has been addressed in [3, 7, 8], but none of these
works provides a (sound and complete) decision procedure. The first complete algorithm for
checking robustness of programs running on Total Store Ordering (TSO) architectures was
given in [6]. It is based on the following locality property. If a TSO program is not robust,
then there is a violating computation where only one process delays commands. This insight
leads to a reduction of robustness to reachability in the sequential consistency model [5].
PGAS programs allow more reorderings than TSO ones and, as a consequence, locality
FSTTCS 2013
130 A Theory of Partitioned Global Address Spaces
does not hold. Instead, our decision procedure relies on a more complex normal form for
computations and on an automata-theoretic algorithm to look for normal-form violations.
2 PGAS Programs
2.1 Features of PGAS Programs
PGAS programs are single program, multiple data programs running on a cluster (see Fig-
ure 1(b)). At run time, a PGAS program consists of multiple processes executing the same
code on different nodes. Each process has a rank, which is the index of the node it runs on.
The processes can access a global address space partitioned into local address spaces for each
process. Local addresses can be accessed directly. Remote addresses (addresses belonging
to different processes) are accessed using API calls, which come in different flavors.
SHMEM [10] provides synchronous remote reads where the invoking process waits for
completion of the command. Remote write commands are asynchronous, and no ordering is
guaranteed between writes, even to the same remote node. The ordering can, however, be
enforced by a special fence command.
ARMCI [21] features synchronous as well as asynchronous read and write commands.
The asynchronous variants of the commands return a handle that can be waited upon. When
the wait on a read handle is over, the data being read has arrived and is accessible. When
the wait on a write handle is over, the data being written has been sent to the network but
might not have reached its destination. Unlike operations to different nodes, operations to
the same remote node are executed in their issuing order.
GASNet [4], like ARMCI, provides both synchronous and asynchronous versions of reads
and writes. Commands return a handle that can be waited upon, and a return from a
wait implies full completion of the operation. The order in which asynchronous operations
complete is intentionally left unspecified.
GPI [19] and GASPI [15] only support asynchronous read and write commands. Each
read or write operation is assigned a queue identifier. In GPI, operations with the same
queue id and to the same remote node are executed in the order in which they were issued;
in GASPI this guarantee does not hold. One can wait on a queue id, and the wait returns
when all commands in the queue are fully completed, on both the local and the remote side.
Summing up, in a uniform PGAS programming model it should be possible to
perform synchronous and asynchronous data transfers,
assign an asynchronous operation a handle or a queue id,
wait for completion of an individual command or of all commands in a given queue,
enforce ordering between operations.
We define a core model for PGAS that supports all these features. Our model only uses
asynchronous remote reads and writes with explicit queues, but is flexible enough to accom-
modate all the above idioms. Moreover, it is not limited to single program, multiple data
programs common in PGAS applications, but can model ordinary concurrent programs with
different processes as well.
2.2 Syntax of PGAS Programs
We define PGAS programs and their semantics in terms of automata. A (non-deterministic)
automaton is a tuple 𝐴 = (𝑆,Σ,Δ, 𝑠0, 𝐹 ), where 𝑆 is a set of states, Σ is a finite alphabet,
Δ ⊆ 𝑆 × (Σ ∪ {𝜀})× 𝑆 is a set of transitions, 𝑠0 ∈ 𝑆 is an initial state, and 𝐹 ⊆ 𝑆 is a set
of final states. We call the automaton finite if the set of states is finite. We write 𝑠1
𝑎−→ 𝑠2
G. Calin, E. Derevenetc, R. Majumdar, and R. Meyer 131
⟨cmd⟩ ::= ⟨reg⟩ ← mem[⟨expr⟩]
| mem[⟨expr⟩] ← ⟨expr⟩
| ⟨reg⟩ ← ⟨expr⟩
| assume(⟨expr⟩)
| read(⟨local-addr⟩,⟨rank⟩,⟨remote-addr⟩,⟨que-id⟩)
| write(⟨local-addr⟩,⟨rank⟩,⟨remote-addr⟩,⟨que-id⟩)
| barrier
Figure 2 Syntax of commands. ⟨reg⟩ ranges over REG;
expressions ⟨expr⟩, local addresses ⟨local-addr⟩, remote ad-
dresses ⟨remote-addr⟩, and queue identifiers ⟨que-id⟩ range
over expressions; ranks ⟨rank⟩ over 1, 𝑁 -valued expressions.
write write
popa popa
popb popb
bar bar
load
po
po
pocf
Figure 3 Happens-before relation
of computation 𝜏1to1 (Example 1).
𝜏1to1 violates robustness.
if (𝑠1, 𝑎, 𝑠2) ∈ Δ, and extend the relation to computations 𝜎 ∈ Σ* in the expected way. The
language of the automaton is ℒ(𝐴) := {𝜎 ∈ Σ* | 𝑠0 𝜎−→ 𝑠 for some 𝑠 ∈ 𝐹}. We write |𝜎| for
the length of a computation 𝜎 ∈ Σ*, and use succ(𝜎) to denote the successor relation among
the letters in 𝜎. We write 𝑎 <𝜎 𝑏 if 𝜎 = 𝜎1 · 𝑎 · 𝜎2 · 𝑏 · 𝜎3 for some 𝜎1, 𝜎2, 𝜎3 ∈ Σ*.
A PGAS program (𝒫, 𝑁) consists of a program code 𝒫 and a fixed number 𝑁 ≥ 1 of
cluster nodes. The program code 𝒫 := (𝑄,CMD, ℐ, 𝑞0, 𝑄) is a finite automaton with a set of
control states 𝑄, all of them final, an initial state 𝑞0, and a set of transitions ℐ labeled with
commands CMD defined as follows.
Let DOM, ADDR, and QUE be finite domains of values (containing a value 0), addresses,
and queue identifiers, respectively. Let REG be a finite set registers that take values from
DOM. The grammar of commands is given in Figure 2. For simplicity, we will assume
DOM = ADDR = QUE. The set of expressions is defined over constants from DOM, registers
from REG, and (unspecified) operators over DOM. The set of commands CMD includes local
assignments and conditionals (assume), remote read and write API calls read and write
respectively, and barriers barrier.
At run time, there is a process on each node 1, 𝑁 that executes program 𝒫, where
𝑀,𝑁 := {𝑀,𝑀 +1, . . . , 𝑁}. We will identify each process with its rank from RNK := 1, 𝑁 .
For modeling purposes, one may assume there are special constant expressions that let a
process learn about its rank in RNK and about the total number of processes 𝑁 .
2.3 Semantics of PGAS Programs
The semantics of a PGAS program (𝒫, 𝑁) is defined using a state-space automaton 𝑋(𝒫, 𝑁) :=
(𝑆𝑋 ,E,Δ𝑋 , 𝑠0𝑋 , 𝐹𝑋). A state 𝑠 ∈ 𝑆𝑋 is a tuple 𝑠 = (st,m, fa, fb), where state configura-
tion st : RNK → 𝑄 maps each process to its current control state, memory configuration
m : RNK × (REG ∪ ADDR) → DOM maps each process to the values stored in each register
and at each address, queue configuration fa : RNK×QUE→ (RNK×ADDR×RNK×ADDR)*
maps each process to remote read and write requests that were issued, and fb : RNK×QUE→
(RNK × ADDR × DOM)* contains values to be transferred. The two queue configurations
capture the delays between creating a request, reading data, and writing data.
The initial state is 𝑠0𝑋 := (st0,m0, fa0, fb0), where for all ranks r ∈ RNK, registers
and addresses a ∈ REG ∪ ADDR, and queue identifiers q ∈ QUE, we have st0(r) := 𝑞0,
m0(r, a) := 0, and fa0(r, q) := 𝜀 =: fb0(r, q). The set of final states is 𝐹𝑋 := {(st,m, fa, fb) ∈
𝑆𝑋 | fa(r, q) = 𝜀 = fb(r, q) for all r ∈ RNK, q ∈ QUE}. The semantics of commands ensures
queues can always be emptied, so acceptance with empty queues is not a restriction.
The alphabet of 𝑋(𝒫, 𝑁) is the set of events E := K×RNK×((RNK×ADDR)∪{⊥}) with
event kinds K := {load, store, assign, assume, read,write, popa, popb, bar}. Consider an event
FSTTCS 2013
132 A Theory of Partitioned Global Address Spaces
Table 1 Transition rules for 𝑋(𝒫, 𝑁), given 𝑞1 cmd−−→ 𝑞2 and current state 𝑠 = (st,m, fa, fb) with
st(r) = 𝑞1. We set st′ := st[r := 𝑞2] to update st so that process r is at 𝑞2. ̂︀𝑒 denotes the evaluation
of expression 𝑒 in memory configuration m.
cmd = 𝑟 ← mem[𝑒a]
𝑠
(load,r,(r,̂︀𝑒a))−−−−−−−−→ (st′,m[(r, 𝑟) := m(r, ̂︀𝑒a)], fa, fb) (load)
cmd = write(𝑒loca , 𝑒remr , 𝑒rema , 𝑒q) fa(r, ̂︀𝑒q) = 𝛼
𝑠
(write,r,⊥)−−−−−−→ (st′,m, fa[(r, ̂︀𝑒q) := 𝛼 · (r,̂︁𝑒loca ,̂︂𝑒remr ,̂︂𝑒rema )], fb) (write)
fa(r, q) = (rs, as, rd, ad) · 𝛼 fb(r, q) = 𝛽
𝑠
(popa,r,(rs,as))−−−−−−−−−→ (st,m, fa[(r, q) := 𝛼], fb[(r, q) := 𝛽 · (rd, ad,m(rs, as))])
(popa)
fb(r, q) = (rd, ad, 𝑣) · 𝛽
𝑠
(popb,r,(rd,ad))−−−−−−−−−→ (st,m[(rd, ad) := 𝑣], fa, fb[(r, q) := 𝛽])
(popb)
st(r) barrier−−−−−−→ st′(r) for each r ∈ RNK
𝑠
(bar,1,⊥)·(bar,2,⊥)···(bar,𝑁,⊥)−−−−−−−−−−−−−−−−−−−−→ (st′,m, fa, fb)
(bar)
e = (k, r, (ra, a)) ∈ E. We use kind(e) = k to determine the kind of the event, rank(e) = r for
the rank of the process that produced the event, and addr(e) = (ra, a) to obtain the rank
and the address that are accessed by the event. If kind(e) ∈ {load, popa}, then e is said to
be a read of (ra, a). If kind(e) ∈ {store, popb}, then e is a write of address addr(e).
Table 1 shows a subset of the transition relation Δ𝑋 ; the remaining rules are similar.
When a process executes a remote write command, Rule (write), a new item is added to a
queue in fa. This item contains the source rank and source address from which the data will
be copied, together with the destination rank and destination address to which the data will
be copied. Eventually, the item is popped from the queue in fa, Rule (popa), the value is
read from the source address, and a new item is pushed into the corresponding queue in fb.
The new item contains the destination rank and destination address, and the value that was
read from the source address. Eventually, this item is popped from the queue, Rule (popb),
and the value is written to the destination address in the destination rank. Modeling two
queue configurations yields a symmetry between remote writes and reads: a read can be
interpreted as a write that comes upon request.
The semantics of a PGAS program C(𝒫, 𝑁) := ℒ(𝑋(𝒫, 𝑁)) ⊆ E* is the set of computa-
tions of the state-space automaton.
I Example 1. Consider PGAS program (1to1, 2) with the program code from Figure 1(a)
being run on two nodes. It has the following computation:
𝜏1to1 = write ·write · popa · popa · bar · bar · load · popb · popb.
Bold events belong to the process with rank 2, the other events — to the process with rank 1.
We have addr(popa) = (1, 𝑥), addr(popb) = (2, 𝑦). Symmetrically, addr(popa) = (2, 𝑥) and
addr(popb) = (1, 𝑦). The assert in Figure 1 is a shortcut for a combination of load and
assume, and in this computation addr(load) = (1, 𝑦).
2.4 Simulating PGAS APIs
Our formalism natively supports asynchronous data transfers and queues. Operations in the
same queue are completed in the order in which they were issued. Using this, we can model
G. Calin, E. Derevenetc, R. Majumdar, and R. Meyer 133
the ordering guarantees given by ARMCI and GPI – by putting ordered operations into the
same queue.
To model waiting on individual operations (waiting on a handle), we associate a shadow
memory address with each operation. Before issuing the operation, the value at this address
is set to 0. When the operation has been issued, the process sends to the same queue a
read request which overwrites the value at the shadow address to 1. Now waiting on the
individual operation can be implemented by polling on the shadow address associated with
the operation. Waiting on all operations in a given queue is done similarly. Synchronous
data transfers are modeled by asynchronous transfers, immediately followed by a wait.
3 Robustness: A Notion of Appropriate Synchronization
We now define robustness, a correctness condition for PGAS programs. Robustness is a
weaker criterion than requiring all computations to be sequentially consistent [18]: it allows
for reordering of events as long as there are no causality cycles. As causality relation, we
adopt the happens-before relation [26]. Fix a computation 𝜏 ∈ C(𝒫, 𝑁). Its happens-before
relation is the union of the three relations we define next, →hb (𝜏) := →po ∪ →cf ∪ ↔.
The program order relation →po is the union of the program order relations for all
processes: →po :=
⋃︀
r∈RNK →rpo. Relation →rpo gives the order in which events were issued
in process r. Formally, let 𝜏 ′ be the subsequence of all events e in 𝜏 such that rank(e) = r
and kind(e) ̸∈ {popa, popb}. Then →rpo := succ(𝜏 ′).
The conflict relation →cf orders conflicting accesses to the same address. Let 𝜏 =
𝛼 · e1 ·𝛽 · e2 · 𝛾, where e1 and e2 access the same address, and at least one of them is a write:
addr(e1) = addr(e2) = (r, a), kind(e1) ∈ {store, popb} or kind(e2) ∈ {store, popb}. If there is
no e ∈ 𝛽 such that addr(e) = (r, a) and kind(e) ∈ {store, popb}, then e1 →cf e2.
The identity relation ↔ identifies events corresponding to the same command. Let e be
a remote read or write event, kind(e) ∈ {read,write}, and e1 and e2 be the corresponding
requests, kind(e1) = popa and kind(e2) = popb. Then we have e ↔ e1 ↔ e2. In a similar
way, ↔ identifies matching barrier events in different processes.
We say a computation 𝜏 is violating if the associated happens-before relation contains
a non-trivial cycle, i.e., a cycle that is not included in ↔. Violating computations violate
sequential consistency. The robustness problem amounts to proving the absence of violations:
ROB Given a program (𝒫, 𝑁), show that no computation 𝜏 ∈ C(𝒫, 𝑁) is violating.
I Example 2. The happens-before relation of computation 𝜏1to1 is depicted in Figure 3.
It is cyclic, therefore 𝜏1to1 is violating and (1to1, 2) is not robust. Indeed, no sequentially
consistent execution of 1to1 allows the assert statements to load the initial value of 𝑦.
Our main result is the following.
I Theorem 3. ROB is PSpace-complete.
The PSpace lower bound follows from PSpace-hardness of control state reachability in
sequentially consistent programs [17]. To reduce to robustness, we add an artificial happens-
before cycle starting in the target control state. The rest of the paper shows a PSpace
algorithm, and hence upper bound, for the problem.
4 Normal-Form Violations
We show that a PGAS program is not robust if and only if it has a violating computation
of the following normal form.
FSTTCS 2013
134 A Theory of Partitioned Global Address Spaces
I Definition 4. Computation 𝜏 = 𝜏1 ·𝜏2 ·𝜏3 ·𝜏4 ∈ C(𝒫, 𝑁) is in normal form if all e ∈ 𝜏2 ·𝜏3 ·𝜏4
satisfy kind(e) ∈ {popa, popb} and for all 𝑎, 𝑏 ∈ 𝜏1 with kind(𝑎), kind(𝑏) /∈ {popa, popb} and
all 𝑎′, 𝑏′ ∈ 𝜏𝑖 with 𝑖 ∈ 1, 4 we have:
𝑎 <𝜏1 𝑏, 𝑎 ̸↔* 𝑏, 𝑎↔* 𝑎′, 𝑏↔* 𝑏′ implies 𝑎′ <𝜏𝑖 𝑏′. (NF)
We explain the normal-form requirement (NF). Consider two accesses 𝑎 and 𝑏 to remote
processes that can be found in the first part of the computation 𝜏1. Assume corresponding
pop events 𝑎′ and 𝑏′ are delayed and can both be found in a later part of the computation, say
𝜏2. Then the ordering of 𝑎′ and 𝑏′ in 𝜏2 coincides with the order of 𝑎 and 𝑏 in 𝜏1. Computation
𝜏1to1 is not in normal-form whereas 𝜏nf1to1 in Figure 4 is. The following theorem guarantees
that, in case of non-robustness, normal-form violations always exist.
I Theorem 5. A PGAS program (𝒫, 𝑁) is robust iff it has no normal-form violation.
Phrased differently, to decide robustness our procedure should look for normal-form
violations. The remainder of the section is devoted to proving Theorem 5. We make use of
the following property of PGAS programs: every computation contains an event that can be
deleted, in the sense that the result is again a computation of the program, i.e., in C(𝒫, 𝑁).
I Lemma 6 (Cancellation). Consider a computation 𝜀 ̸= 𝜏 ∈ C(𝒫, 𝑁) and let e be the last
event in 𝜏 with kind(e) ̸∈ {popa, popb}. Then 𝜏 ∖ e ∈ C(𝒫, 𝑁), where computation 𝜏 ∖ e is
defined to remove e and all ↔-related events from 𝜏 .
Proof. All events to the right of e are unconditionally executable. Moreover, 𝜏 does not have
→po-successors following e. Therefore, the resulting computation 𝜏 ∖ e is in C(𝒫, 𝑁). J
A PGAS program is not robust if and only if it has a violating computation 𝜏 of minimal
length. Let e ∈ 𝜏 be the event determined by Lemma 6. If kind(e) ̸∈ {read,write}, then
𝜏 = 𝜏1 · e · 𝜏2. Otherwise 𝜏 = 𝜏1 · e · 𝜏2 · e′ · 𝜏3 · e′′ · 𝜏4 with e↔ e′ ↔ e′′. Consider the latter
case where 𝜏 ∖ e = 𝜏1 · 𝜏2 · 𝜏3 · 𝜏4. Since |𝜏 ∖ e| < |𝜏 |, the new computation is not violating and
→hb (𝜏 ∖ e) is acyclic. This acyclicity guarantees that we find a computation 𝜎 ∈ E* with
the same happens-before relation as 𝜏 ∖ e and where pop events directly follow their remote
accesses. Intuitively, 𝜎 is a sequentially consistent computation corresponding to 𝜏 ∖ e.
I Lemma 7 ([26]). There is 𝜎 ∈ C(𝒫, 𝑁) with→hb (𝜎) =→hb (𝜏 ∖e) and 𝜎 = 𝜎1 ·e1 . . . e𝑛 ·𝜎2
for all e1 ↔ . . .↔ e𝑛.
We now use 𝜎 to rearrange the events in 𝜏 ∖e and guarantee the normal-form requirement.
The idea is to project 𝜎 to the events in 𝜏1 to 𝜏4. Reinserting e yields a normal-form violation:
𝜏nf := (𝜎↓𝜏1) · e · (𝜎↓𝜏2) · e′ · (𝜎↓𝜏3) · e′′ · (𝜎↓𝜏4).
The following lemma concludes the proof of Theorem 5.
I Lemma 8 (Reinsertion). 𝜏nf ∈ C(𝒫, 𝑁), →hb (𝜏nf) = →hb (𝜏), and 𝜏nf is in normal form.
I Example 9. Computation 𝜏1to1 in Example 1 is a shortest violation. The event determined
by Lemma 6 is e = load. Therefore, 𝜏 ∖ e = 𝜏1 · 𝜏2 with
𝜏1 = write ·write · popa · popa · bar · bar and 𝜏2 = popb · popb.
G. Calin, E. Derevenetc, R. Majumdar, and R. Meyer 135
write popa write popa bar bar load popb popb
𝑝𝑜 𝑝𝑜𝑝𝑜 cf
Figure 4 Normal-form violation 𝜏nf1to1 from Example 9. The edges indicate the dependencies in
the computation and coincide with the relations in Figure 3.
A sequentially consistent computation corresponding to 𝜏 ∖ e is
𝜎 = write · popa · popb ·write · popa · popb · bar · bar.
The normal-form violation 𝜏nf1to1 is depicted in Figure 4. Note that 𝜏
nf
1to1 is indeed in
C(1to1, 2). Moreover, popa and popa immediately follow write and write, respectively.
Similarly, the popb and popb events in the second part of the computation respect the order
of write and write in the first part of the computation. This means the property (NF) holds.
5 From Normal-Form Violations to Language Emptiness
We now reduce checking the absence of normal-form violations to the emptiness problem in
a suitable automaton model. We introduce multiheaded automata and construct, for each
program (𝒫, 𝑁), a multiheaded automaton accepting all normal-form computations. To
verify robustness, we check that the intersection of this automaton with regular languages
accepting cyclic happens-before relations is empty.
5.1 Multiheaded Automata
Multiheaded automata are an extension of finite automata. Intuitively, instead of generating
just a single computation, they generate several computations in one pass, each by a separate
head. The language of the multiheaded automaton then consists of the concatenations of
the computations generated by each head.
Syntactically, an 𝑛-headed finite automaton over Σ is a finite automaton that uses the
extended alphabet 1, 𝑛×Σ. So we have 𝐴 = (𝑆, (1, 𝑛×Σ),Δ, 𝑠0, 𝐹 ). The semantics, however,
is different from finite automata. Given 𝜎 ∈ (1, 𝑛×Σ)*, we use 𝜎↓𝑘 to project 𝜎 to the letters
(𝑘, 𝑎), and afterwards cut away the index 𝑘. So ((1, 𝑎) · (2, 𝑏) · (1, 𝑐))↓1 = 𝑎 · 𝑐. The language
of 𝐴 is ℒ(𝐴) := {comp(𝜎) | 𝑠0 𝜎−→ 𝑠 for some 𝑠 ∈ 𝐹} where comp(𝜎) := 𝜎↓1 · · ·𝜎↓𝑛.
Multiheaded automata are closed under regular intersection, and emptiness is decidable
in non-deterministic logarithmic space. Indeed, checking emptiness reduces to finding a path
from an initial to a final node in a directed graph.
I Lemma 10. Consider an 𝑛-headed automaton 𝑈 and a finite automaton 𝑉 over a common
alphabet Σ. There is an 𝑛-headed automaton 𝑊 with ℒ(𝑊 ) = ℒ(𝑈) ∩ ℒ(𝑉 ).
I Lemma 11. Emptiness for 𝑛-headed automata is NL-complete.
Multiheaded automata are incomparable with context-free grammars, and indeed the
normal-form computations of a program may be non-context-free.1
1 Consider 𝒫 := ({𝑞0},CMD, {𝑞0 read(0,0,0,0)−−−−−−−−−−→ 𝑞0}, {𝑞0}) running on a single node. The language
C(𝒫, 1) is not context-free. To see this, let kind(𝑎) = read, kind(𝑏) = popa, and kind(𝑐) = popb. Then
C(𝒫, 1) ∩ 𝑎*𝑏*𝑐* is the non-context-free language {𝑎𝑝𝑏𝑝𝑐𝑝 | 𝑝 ≥ 0}.
FSTTCS 2013
136 A Theory of Partitioned Global Address Spaces
Table 2 Transition rules for 𝑌 (𝒫, 𝑁), given 𝑞1 cmd−−→ 𝑞2 and current state 𝑠 = (st,m, pa, pb) with
st(r) = 𝑞1. The target is 𝑠′ = (st′,m′, pa′, pb′) where, unless otherwise stated, st′ = st, m′ = m,
pa′ = pa, pb′ = pb. The auxiliary states 𝑠aux1, 𝑠aux2 ∈ 𝑆aux𝑌 are unique for each rule application.
(gpa′) pa(r, q) < pb(r, q)
𝑠
𝜀−→ 𝑠′ pa′ := pa[(r, q) := pa(r, q) + 1]
pb(r, q) < 4
𝑠
𝜀−→ 𝑠′ pb′ := pb[(r, q) := pb(r, q) + 1] (gpb
′)
cmd = write(𝑒loca , 𝑒remr , 𝑒rema , 𝑒q) pa(r, ̂︀𝑒q) = 𝑚 pb(r, ̂︀𝑒q) = 𝑛
𝑠
1,(write,r,⊥)−−−−−−−−→ 𝑠aux1 𝑚,(popa,r,(r,
̂︀𝑒loca ))−−−−−−−−−−−→ 𝑠aux2 𝑛,(popb,r,(̂︁𝑒remr ,̂︁𝑒rema ))−−−−−−−−−−−−−→ 𝑠′ st′ := st[r := 𝑞2]
if 𝑛 = 1 then m′ := m[(̂︂𝑒remr ,̂︂𝑒rema ) := m(r,̂︂𝑒loca )] (write
′)
Multiheaded automata can be understood as a restriction of matrix grammars [12]. In
matrix grammars, productions simultaneously rewrite multiple non-terminals. Roughly,
each production can be understood as a Petri net transition, and emptiness is decidable
as Petri net reachability is. Since we target a PSpace result, matrix grammars are too
expressive for our purposes.
5.2 Detecting Normal-Form Computations
We define a 4-headed automaton 𝑌 (𝒫, 𝑁) := (𝑆𝑌 ⊎ 𝑆aux𝑌 ,E,Δ𝑌 , 𝑠0𝑌 , 𝑆𝑌 ) that accepts all
normal-form computations 𝜏 = 𝜏1 · 𝜏2 · 𝜏3 · 𝜏4 ∈ C(𝒫, 𝑁). In order to accept 𝜏1, the new
automaton tracks the control and memory configurations in the way 𝑋(𝒫, 𝑁) does. For
the remainder of the computation, these configurations are not needed. Indeed, 𝜏2 to 𝜏4
only consist of popa and popb events that are executable independently of the control and
memory configurations. However, 𝑌 (𝒫, 𝑁) has to take care of the ordering of popa and popb
events from the same queue. In particular, if e1 handles a request issued before the request
of e2 with kind(e1) = kind(e2), then it cannot be the case that e1 ∈ 𝜏𝑗 and e2 ∈ 𝜏𝑖 with 𝑖 < 𝑗.
Guided by this discussion, we define a state 𝑠 ∈ 𝑆𝑌 as a tuple 𝑠 := (st,m, pa, pb). The
state and memory configurations st and m are defined as in Section 2. They reflect the state
of the program after it has generated a prefix of 𝜏1. The functions pa, pb : RNK×QUE→ 1, 4
give, for each process and each queue, the part 𝜏1 to 𝜏4 of the computation where the next
popa resp. popb event will be generated. The initial state is 𝑠0 := (st0,m0, pa0, pb0) with
pa0(r, q) := 1 =: pb0(r, q) for all r ∈ RNK and q ∈ QUE.
The transition relation Δ𝑌 is the smallest relation defined by the rules in Table 2.
Rule (gpa′) lets the automaton choose the part of the computation to which the next popa
event will be appended. The first restriction is that the index of the part can only increase,
as events from the same queue are processed in order. The second restriction is that popa
events cannot be generated to the right of popb events from the same queue. Rule (gpb′) is
the similar rule for popb events.
By Rule (write′), the automaton appends a write event to 𝜏1 and the corresponding popa
and popb events in one shot to the parts determined by pa and pb. Since a single transition of
a multiheaded automaton can generate at most one letter, the rule makes use of intermediary
states from 𝑆aux𝑌 . If popb is added to 𝜏1, the memory configuration is updated accordingly.
Note that the generation in one shot causes pop events within the same part 𝜏𝑖 to follow in
the order of the corresponding read/write events in 𝜏1. Fortunately, this is always the case in
normal-form computations by (NF). Computations that are not in normal form, e.g. 𝜏1to1,
cannot be generated by 𝑌 (𝒫, 𝑁).
The set of final states of 𝑌 (𝒫, 𝑁) is 𝑆𝑌 . The auxiliary states 𝑆aux𝑌 are not included in
the set of final states to forbid computations with pending remote requests.
G. Calin, E. Derevenetc, R. Majumdar, and R. Meyer 137
I Lemma 12. {𝜏 ∈ C(𝒫, 𝑁) | 𝜏 is in normal form} = ℒ(𝑌 (𝒫, 𝑁)).
5.3 Detecting Violations
The multiheaded automaton accepts all normal form computations, and we would like to
check if one of these computations is violating. In general, violating computations can
contain complicated cycles in the happens-before relation. However, we now show that
whenever a computation has a happens-before cycle, it has a cycle in which each process is
entered and left at most once. Our algorithm for robustness will look for happens-before
cycles of this special form that, as we will show, can be captured by a regular language.
I Lemma 13. Computation 𝜏 ∈ C(𝒫, 𝑁) is violating iff there is a cycle
𝑎1 ↔* 𝑏1 →*po 𝑐1 ↔* 𝑑1  . . . 𝑎𝑘 ↔* 𝑏𝑘 →*po 𝑐𝑘 ↔* 𝑑𝑘  𝑎1 (CYC)
where rank(𝑥𝑖) = rank(𝑦𝑗) iff 𝑖 = 𝑗, for all 𝑥𝑖, 𝑦𝑗 ∈ {𝑎1, . . . , 𝑑𝑘}, and  := →cf ∪ ↔.
I Example 14. The computations 𝜏1to1 (Example 1) and 𝜏nf1to1 (Example 9) have a cycle
of the form (CYC) depicted in Figure 3: 𝑘 = 2, 𝑎1 = 𝑏1 = bar, 𝑐1 = 𝑑1 = load, 𝑎2 = popb,
𝑏2 = write, 𝑐2 = 𝑑2 = bar.
Note that 𝑑𝑖 ↔ 𝑎𝑖+1 means both are barriers, kind(𝑑𝑖) = bar = kind(𝑎𝑖+1). This holds as
the ranks are different. In spite of the additional restrictions, cycles (CYC) are not trivial to
recognize. The reason is that the events constituting the cycle are not necessarily contained
in the computation in the order in which they appear in the cycle, see Figure 4. The idea
of our cycle detection is to first guess the events 𝑎𝑖 and 𝑑𝑖 for each process and then check
that 𝑑𝑖  𝑎𝑖+1 holds. The former can be accomplished by an extension 𝑌 M(𝒫, 𝑁) of the
multiheaded automaton 𝑌 (𝒫, 𝑁), the latter by a regular intersection.
The automaton 𝑌 M(𝒫, 𝑁) accepts computations over the alphabet E × M with M :=
2{enter,leave}. The events marked by enter are the guessed 𝑎𝑖 events in (CYC) and those
marked by leave are the 𝑑𝑖 events in (CYC). We still have to guarantee we only mark 𝑎𝑖
and 𝑑𝑖 that satisfy 𝑎𝑖 ↔* 𝑏𝑖 →*po 𝑐𝑖 ↔* 𝑑𝑖. This is straightforward thanks to the fact that
𝑌 (𝒫, 𝑁) generates the events of each process in program order, and generates events related
by ↔ in one shot. The full construction of 𝑌 M(𝒫, 𝑁) is given in [9].
I Example 15. Consider the normal-form computation 𝜏nf1to1 (Example 9) that has the
cycle (CYC) given in Figure 3. A corresponding marked computation of 𝑌 M(𝒫, 𝑁) is
(write, ∅) · (popa, ∅) · (write, ∅) · (popa, ∅)·
(bar, {enter}) · (bar, {leave}) · (load, {leave}) · (popb, ∅) · (popb, {enter}).
Every cycle of the form (CYC) has a cycle type cyc, which is a sequence cyc = r1 . . . r𝑘
of ranks from 1, 𝑁 with r𝑖 ̸= r𝑗 for 𝑖 ̸= 𝑗. The idea is that the events 𝑎𝑖, 𝑏𝑖, 𝑐𝑖, 𝑑𝑖 belong
to rank r𝑖. For each pair r𝑖, r𝑖+1 in this sequence, we construct a finite automaton 𝑍r𝑖,r𝑖+1
over the alphabet E × M. It checks whether there is a conflict or identity edge from the
leave-marked event of process r𝑖 to the enter-marked event of process r𝑖+1. Consider the case
of conflicts. The automaton looks for a marked event (e𝑖,𝑚𝑖) with rank(e𝑖) = r𝑖 marked
by leave ∈ 𝑚𝑖. It remembers the kind and the address of this event. Then, it seeks a
marked event (e𝑖+1,𝑚𝑖+1) with rank(e𝑖+1) = r𝑖+1 marked by enter ∈ 𝑚𝑖+1. If both events
are found, they touch the same address, and one of them is a write, the automaton reaches
the accepting state. Since finite automata are closed under intersection, we can define the
finite automaton of cycle type cyc as 𝑍cyc := 𝑍r1,r2 ∩ . . . ∩ 𝑍r𝑘−1,r𝑘 ∩ 𝑍r𝑘,r1 .
FSTTCS 2013
138 A Theory of Partitioned Global Address Spaces
I Theorem 16. 𝒫 is robust iff ℒ(𝑌 M(𝒫, 𝑁)) ∩ ℒ(𝑍cyc) = ∅ for all cycle types cyc.
We can now prove Theorem 3. To check whether (𝒫, 𝑁) is robust, we go over all cycle
types cyc = r1 . . . r𝑘. This enumeration of cycle types can be done in space that is polynomial
in 𝑁 . For each such sequence, we check if ℒ(𝑌 M(𝒫, 𝑁))∩ℒ(𝑍cyc) = ∅. By Theorem 16, the
program is robust iff all intersections are empty. By Lemma 10, there is a 4-headed finite
state automaton𝑊 with ℒ(𝑊 ) = ℒ(𝑌 M(𝒫, 𝑁))∩ℒ(𝑍cyc). Since the size of𝑊 is exponential
in the size of (𝒫, 𝑁) and emptiness is in NL by Lemma 11, deciding ℒ(𝑊 ) = ∅ can be done
in space that is polynomial in (𝒫, 𝑁). This shows robustness is in PSpace.
References
1 S. V. Adve and M. D. Hill. A unified formalization of four shared-memory models. IEEE
Transactions on Parallel and Distributed Systems, 4(6):613–624, 1993.
2 J. Alglave. A Shared Memory Poetics. PhD thesis, University Paris 7, 2010.
3 J. Alglave and L. Maranget. Stability in weak memory models. In CAV, volume 6806 of
LNCS, pages 50–66. Springer, 2011.
4 D. Bonachea. GASNet specification, v1.1. Technical Report UCB/CSD-02-1207, University
of California, Berkeley, 2002.
5 A. Bouajjani, E. Derevenetc, and R. Meyer. Checking and enforcing robustness against
TSO. In ESOP, volume 7792 of LNCS, pages 533–553. Springer, 2013.
6 A. Bouajjani, R. Meyer, and E. Möhlmann. Deciding robustness against Total Store Or-
dering. In ICALP, volume 6756 of LNCS, pages 428–440. Springer, 2011.
7 S. Burckhardt and M. Musuvathi. Effective program verification for relaxed memory mod-
els. In CAV, volume 5123 of LNCS, pages 107–120. Springer, 2008.
8 J. Burnim, C. Stergiou, and K. Sen. Sound and complete monitoring of sequential consist-
ency for relaxed memory models. In TACAS, volume 6605 of LNCS, pages 11–25. Springer,
2011.
9 G. Calin, E. Derevenetc, R. Majumdar, and R. Meyer. A theory of partitioned global
address spaces. CoRR, abs/1307.6590, 2013. http://arxiv.org/abs/1307.6590.
10 B. Chapman, T. Curtis, S. Pophale, S. Poole, J. Kuehn, C. Koelbel, and L. Smith. In-
troducing OpenSHMEM: SHMEM for the PGAS community. In PGAS, page 2. ACM,
2010.
11 UPC Consortium. UPC language specification v1.2. Technical report, 2005.
12 J. Dassow and G. Paˇun. Regulated Rewriting in Formal Language Theory, volume 18 of
Monographs in Theoretical Computer Science. An EATCS Series. Springer, 1989.
13 D. Dice. A race in locksupport park() arising from weak memory models. https://blogs.
oracle.com/dave/entry/a_race_in_locksupport_park, Nov 2009.
14 J. Dinan, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur. An implementation
and evaluation of the MPI 3.0 one-sided communication interface. www.mcs.anl.gov/
uploads/cels/papers/P4014-0113.pdf.
15 Global address space programming interface. http://www.gaspi.de/.
16 P. N. Hilfinger, D. O. Bonachea, K. Datta, D. Gay, S. L. Graham, B. R. Liblit, G. Pike,
J. Zh. Su, and K. A. Yelick. Titanium language reference manual, version 2.19. Technical
Report UCB/EECS-2005-15, UC Berkeley, 2005.
17 D. Kozen. Lower bounds for natural proof systems. In FOCS, pages 254–266. IEEE, 1977.
18 L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess
programs. IEEE Transactions on Computers, 28(9):690–691, 1979.
19 R. Machado and C. Lojewski. The Fraunhofer virtual machine: a communication library
and runtime system based on the RDMA model. Computer Science — Research and
Development, 23(3-4):125–132, 2009.
G. Calin, E. Derevenetc, R. Majumdar, and R. Meyer 139
20 A. Muzahid, S. Qi, and J. Torrellas. Vulcan: Hardware support for detecting sequential
consistency violations dynamically. In MICRO, pages 363–375. IEEE, 2012.
21 J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy library for
distributed array libraries and compiler run-time systems. In Parallel and Distributed
Processing, volume 1586 of LNCS, pages 533–546. Springer, 1999.
22 The UPC NAS parallel benchmarks. http://upc.gwu.edu/download.html.
23 R. W. Numrich and J. Reid. Co-array Fortran for parallel programming. In ACM Sigplan
Fortran Forum, volume 17, pages 1–31. ACM, 1998.
24 S. Owens. Reasoning about the implementation of concurrency abstractions on x86-TSO.
In ECOOP, volume 6183 of LNCS, pages 478–503. Springer, 2010.
25 C.-S. Park, K. Sen, P. Hargrove, and C. Iancu. Efficient data race detection for distributed
memory parallel programs. In SC’11, page 51. ACM, 2011.
26 D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share
memory. ACM TOPLAS, 10(2):282–312, 1988.
FSTTCS 2013
