Partitioned global address space (PGAS) is a parallel programming model for the development of high-performance applications on clusters. It provides a global address space partitioned among the cluster nodes, and is supported in programming languages like C, C++, and Fortran by means of APIs. In this paper we provide a formal model for the semantics of single instruction, multiple data programs using PGAS APIs. Our model reflects the main features of popular real-world APIs such as SHMEM, ARMCI, GASNet, GPI, and GASPI.
remote direct memory access (RDMA), and increases performance of PGAS programs by avoiding unnecessary synchronization between the sender and the receiver [18, 13] .
However, the use of one-sided communication introduces additional non-determinism in the ordering of memory reads and writes, and makes reasoning about program correctness harder. Figure 1 (a) demonstrates a subtle bug arising out of improper synchronizations: while the barriers ensure all processes are at the same control location, the remote writes may or may not have completed when address y is accessed after the barrier.
We make two contributions in this paper. First, we provide a core calculus of PGAS APIs that models concurrent processes sharing a global address space and accessing remote memory through one-sided reads and writes. Despite the popularity of PGAS APIs in the high-performance computing community, to the best of our knowledge, there are no formal models for common PGAS APIs.
Second, we define and study a correctness criterion called robustness for PGAS programs. To understand robustness, we begin with a classical and intuitive correctness condition, sequential consistency [17] . A computation is sequentially consistent if its memory accesses happen atomically and in the order in which they are issued. Sequential consistency is too strong a criterion for PGAS programs, where time is required to access remote memory and accesses themselves can be reordered. Robustness is the weaker notion that all computations of the program have the same happens-before (data and control) dependencies [25] as some sequentially consistent computation. Our notion of robustness captures common programming error patterns [12, 19] , and is derived from a similar notion in shared memory multiprocessing [25] . Related correctness criteria have been proposed for weak memory models [7, 23, 2, 3, 6, 8, 5] .
A simpler correctness property would be data race freedom (DRF), in which no two processes access the same address at the same time, with at least one access being a write [1] . Indeed, data race free programs are sequentially consistent. Unfortunately, DRF is too strong a requirement in practice [24] , and leads to numerous false alarms. Many common synchronization idioms for PGAS programs, such as producer-consumer synchronization, and many concurrent data structure implementations, contain benign data races. Instead, the notion of robustness captures the intuitive requirement that, even when events are reordered in a computation, there are no causality cycles. Our notion of causality is the standard happens-before relation from [25] .
We study the algorithmic verification of robustness. Our main result is that robustness is decidable (actually PSpace-complete) for PGAS programs, assuming a finite data domain and finite memory. Note that our model of PGAS programs is infinite-state even when the data domain is finite: one-sided communication allows unboundedly many requests to be in flight simultaneously (a feature modeled in our formalism using unbounded queues).
Our decidability result uses two technical ingredients. First, we show that among all computations violating robustness, there is always one in a certain normal form. The normal form partitions the violating computation into phases: the first phase initiates memory reads and writes, and the latter phases complete the reads and writes in the same order in which they were initiated.
Second, we provide an algorithm to detect violating computations in this normal form. We take a language-theoretic view, and introduce a multiheaded automaton model which can accept violating computations in normal form. Then the problem of checking robustness reduces to checking emptiness for multiheaded automata. Interestingly, since the normal form maintains orderings of accesses, the multiple heads can be exploited to accept violating computations without explicitly modeling unbounded queues of memory access requests. The resulting class of languages contains non-context-free ones (such as a n b n c n ), but retains sufficient decidability properties. Altogether this yields a PSpace decision procedure for checking robustness of programs using PGAS APIs.
For lack of space, full constructions and proofs are given in the appendix.
Related Work
Although PGAS APIs are popular in the high-performance computing community [4, 9, 14, 18, 20] , to the best of our knowledge, no previous work provides a unifying formal semantics that incorporates one-sided asynchronous communication. As for synchronization correctness, only recently Park et al. proposed a testing framework for data race detection and implemented it for the UPC language [24] . However, the authors argue that many data races are actually not harmful, a claim they support through the analysis of the NAS Parallel Benchmarks [21] . For this reason, in contrast to data race freedom [1] , we consider robustness as a more precise notion of appropriate synchronization. The robustness problem was posed by Shasha and Snir [25] for shared memory multiprocessing. They showed that non sequentially consistent computations have a happens-before cycle. Alglave and Maranget [2, 3] extended this result. They developed a general theory for reasoning about robustness problems, even among different architectures. Owens [23] proposed a notion of appropriate synchronization that is based on triangular data races. Compared to robustness, triangular race freedom requires heavier synchronization, which is undesirable for performance reasons.
We consider here the algorithmic problem of checking robustness. For programs running on weak memory models the problem has been addressed in [7, 8, 3] , but none of these works provides a (sound and complete) decision procedure. The first complete algorithm for checking robustness of programs running on Total Store Ordering (TSO) architectures was given in [6] . It is based on the following locality property. If a TSO program is not robust, then there is a violating computation where only one process delays commands. This insight leads to a reduction of robustness to reachability in the sequential consistency model [5] . PGAS programs allow more reorderings than TSO ones and, as a consequence, locality does not hold. Instead, our decision procedure relies on a complex normal form for computations and on a sophisticated automata-theoretic algorithm to look for normal-form violations.
PGAS Programs

Features of PGAS Programs
PGAS programs are single instruction, multiple data programs running on a cluster (see Figure 1 (b)). At run time, a PGAS program consists of multiple processes executing the same code on different nodes. Each process has a rank, which is the index of the node it runs on. The processes can access a global address space partitioned into local address spaces for each process. Local addresses can be accessed directly. Remote addresses (addresses belonging to different processes) are accessed using API calls, which come in different flavors. SHMEM [9] provides synchronous remote reads where the invoking process waits for completion of the command. Remote write commands are asynchronous, and no ordering is guaranteed between writes, even to the same remote node. The ordering can, however, be enforced by a special fence command.
ARMCI [20] features synchronous as well as asynchronous read and write commands. The asynchronous variants of the commands return a handle that can be waited upon. When the wait on a read handle is over, the data being read has arrived and is accessible. When the wait on a write handle is over, the data being written has been sent to the network but might not have reached its destination. Unlike operations to different nodes, operations to the same remote node are executed in their issuing order.
GASNet [4] , like ARMCI, provides both synchronous and asynchronous versions of reads and writes. Commands return a handle that can be waited upon, and a return from a wait implies full completion of the operation. The order in which asynchronous operations complete is intentionally left unspecified.
GPI [18] and GASPI [14] only support asynchronous read and write commands. Each read or write operation is assigned a queue identifier. In GPI, operations with the same queue id and to the same remote node are executed in the order in which they were issued; in GASPI this guarantee does not hold. One can wait on a queue id, and the wait returns when all commands in the queue are fully completed, on both the local and the remote side.
Summing up, in a uniform PGAS programming model it should be possible to perform synchronous and asynchronous data transfers, assign an asynchronous operation a handle or a queue id, wait for completion of an individual command or of all commands in a given queue, enforce ordering between operations. We define a core model for PGAS that supports all these features. Our model only uses asynchronous remote reads and writes with explicit queues, but is flexible enough to accommodate all the above idioms.
Syntax of PGAS Programs
We define PGAS programs and their semantics in terms of automata. A (non-deterministic) automaton is a tuple A = (S, Σ, ∆, s 0 , F ), where S is a set of states, Σ is a finite alphabet, ∆ ⊆ S × (Σ ∪ {ε}) × S is a set of transitions, s 0 ∈ S is an initial state, and F ⊆ S is a set of final states. We call the automaton finite if the set of states is finite. We write s 1 a − → s 2 if (s 1 , a, s 2 ) ∈ ∆, and extend the relation to computations σ ∈ Σ * in the expected way. The language of the automaton is L(A) := {σ ∈ Σ * | s 0 assignments and conditionals (assume), remote read and write API calls read and write respectively, and barriers barrier.
At run time, there is a process on each node 1, N that executes program P, where M, N := {M, M + 1, . . . , N }. We will identify each process with its rank from RNK := 1, N . For modeling purposes, one may assume there are special constant expressions that let a process learn about its rank in RNK and about the total number of processes N .
Semantics of PGAS Programs
The semantics of a PGAS program (P, N ) is defined using a state-space automaton X(P, N ) := (S X , E, ∆ X , s 0X , F X ). A state s ∈ S X is a tuple s = (st, m, fa, fb), where state configuration st : RNK → Q maps each process to its current control state, memory configuration m : RNK×(REG∪ADR) → DOM maps each process to the values stored in each register and at each address, queue configuration fa : RNK × QUE → (RNK × ADR × RNK × ADR) * maps each process to remote read and write requests that were issued, and fb : RNK×QUE → (RNK × ADR × DOM)
* contains values to be transferred.
The initial state is s 0X := (st 0 , m 0 , fa 0 , fb 0 ), where for all ranks r ∈ RNK, registers and addresses a ∈ REG ∪ ADR, and queue identifiers q ∈ QUE, we have st 0 (r) := q 0 , m 0 (r, a) := 0, and fa 0 (r, q) := ε =: fb 0 (r, q). The set of final states is F X := {(st, m, fa, fb) ∈ S X | fa(r, q) = ε = fb(r, q) for all r ∈ RNK, q ∈ QUE}. The semantics of commands ensures queues can always be emptied, so acceptance with empty queues is not a restriction.
The alphabet of X(P, N ) is the set of events E := K × RNK × ((RNK × ADR) ∪ {⊥}) with event kinds K := {load, store, assign, assume, read, write, popa, popb, bar}. Consider an event e = (k, r, (r a , a)) ∈ E. We use kind(e) = k to determine the kind of the event, rank(e) = r for the rank of the process that produced the event, and addr(e) = (r a , a) to obtain the rank and the address that are accessed by the event. If kind(e) ∈ {load, popa}, then e is said to be a read of (r a , a). If kind(e) ∈ {store, popb}, then e is a write of address addr(e). Table 1 shows a subset of the transition relation ∆ X ; other rules are similar. When a process executes a remote write command, Rule (write), a new item is added to a queue in fa. This item contains the source rank and source address from which the data will be copied, together with the destination rank and destination address to which the data will be copied. Eventually, the item is popped from the queue in fa, Rule (popa), the value is read from the source address, and a new item is pushed into the corresponding queue in fb. The new item contains the destination rank and destination address, and the value that was read from the source address. Eventually, this item is popped from the queue, Rule (popb), and the value is written to the destination address in the destination rank. Modeling two queue configurations yields a symmetry between remote writes and reads: a read can be interpreted as a write that comes upon request. Moreover, two queue configurations capture well the delays between request creation, reading of the data, and writing of the data. Table 1 Transition rules for X(P, N ), given q1 cmd − − → q2 and current state s = (st, m, fa, fb) with st(r) = q1. We set st := st[r := q2] to update st so that process r is at q2. e denotes the evaluation of expression e in memory configuration m.
The semantics of a PGAS program
* is the set of computations of the state-space automaton.
Example 1.
Consider PGAS program (1to1, 2) with the program code from Figure 1 (a) being run on two nodes. It has the following computation:
Bold events belong to the process with rank 2, the other events to the process with rank 1. We have addr(popa) = (1, x), addr(popb) = (2, y). Symmetrically, addr(popa) = (2, x) and addr(popb) = (1, y). The assert in Figure 1 is a shortcut for a combination of load and assume, and in this computation addr(load) = (1, y).
Simulating PGAS APIs
Our formalism natively supports asynchronous data transfers and queues. Operations in the same queue are completed in the order in which they were issued. Using this, we can model the ordering guarantees given by ARMCI and GPI -by putting ordered operations into the same queue.
To model waiting on individual operations (waiting on a handle), we associate a shadow memory address with each operation. Before issuing the operation, the value at this address is set to 0. When the operation has been issued, the process sends to the same queue a read request which overwrites the shadow memory to 1. Now waiting on the individual operation can be implemented by polling on the shadow address associated with the operation. Waiting on all operations in a given queue is done similarly. Synchronous data transfers are modeled by asynchronous transfers, immediately followed by a wait.
Robustness: A Notion of Appropriate Synchronization
We now define robustness, a correctness condition for PGAS programs. Robustness is a weaker criterion than requiring all computations to be sequentially consistent [17] : it allows for reordering of events as long as there are no causality cycles. As causality relation, we adopt the happens-before relation [25] . Fix a computation τ ∈ C(P, N ). Its happens-before relation is the union of the three relations we define next,
The program order relation → po is the union of the program order relations for all processes: → po := r∈RNK → r po . Relation → r po gives the order in which events were issued in process r. Formally, let τ be the subsequence of all events e in τ such that rank(e) = r and kind(e) ∈ {popa, popb}. Then → r po := succ(τ ). The conflict relation → cf orders conflicting accesses to the same address. Let τ = α · e 1 · β · e 2 · γ, where e 1 and e 2 access the same address, and at least one of them is a write: addr(e 1 ) = addr(e 2 ) = (r, a), kind(e 1 ) ∈ {store, popb} or kind(e 2 ) ∈ {store, popb}. If there is no e ∈ β such that addr(e) = (r, a) and kind(e) ∈ {store, popb}, then e 1 → cf e 2 .
The identity relation ↔ identifies events corresponding to the same command. Let e be a remote read or write event, kind(e) ∈ {read, write}, and e 1 and e 2 be the corresponding requests, kind(e 1 ) = popa and kind(e 2 ) = popb. Then we have e ↔ e 1 ↔ e 2 . In a similar way, ↔ identifies matching barrier events in different processes.
We say a computation τ is violating if the associated happens-before relation contains a non-trivial cycle, i.e., a cycle that is not included in ↔. Violating computations violate sequential consistency. The robustness problem amounts to proving the absence of violations.
ROB
Given a program (P, N ), show that no computation τ ∈ C(P, N ) is violating.
Example 2. The happens-before relation of computation τ 1to1 is depicted in Figure 3 . It is cyclic, therefore τ 1to1 is violating and (1to1, 2) is not robust. Indeed, no sequentially consistent execution of 1to1 allows the assert statements to load the initial value of y.
Our main result is the following.
Theorem 3. ROB is PSpace-complete.
The PSpace lower bound follows from PSpace-hardness of control state reachability in sequentially consistent programs [16] . To reduce to robustness, we add an artificial happensbefore cycle starting in the target control state. The rest of the paper shows a PSpace algorithm, and hence upper bound, for the problem.
Normal-Form Violations
We show that a PGAS program is not robust if and only if it has a violating computation of the following normal form.
Definition 4.
Computation τ = τ 1 ·τ 2 ·τ 3 ·τ 4 ∈ C(P, N ) is in normal form if all e ∈ τ 2 ·τ 3 ·τ 4 satisfy kind(e) ∈ {popa, popb} and for all a, b ∈ τ 1 with kind(a), kind(b) / ∈ {popa, popb} and all a , b ∈ τ i with i ∈ 1, 4 we have:
We explain the normal-form requirement (NF). Consider two accesses a and b to remote processes that can be found in the first part of the computation τ 1 . Assume corresponding pop events a and b are delayed and can both be found in a later part of the computation, say τ 2 . Then the ordering of a and b in τ 2 coincides with the order of a and b in τ 1 . Computation τ 1to1 is not in normal-form whereas τ nf 1to1 in Figure 4 is. The following theorem guarantees that, in case of non-robustness, normal-form violations always exist. Phrased differently, to decide robustness our procedure should look for normal-form violations. The remainder of the section is devoted to proving Theorem 5. We make use of the following property of PGAS programs: every computation contains an event that can be deleted, in the sense that the result is again a computation of the program.
Lemma 6 (Cancellation).
Consider ε = τ ∈ C(P, N ). There is an event e ∈ τ so that τ \ e ∈ C(P, N ). Computation τ \ e is defined to remove e and all ↔-related events from τ .
Proof.
Take as e the last event in τ with kind(e) ∈ {popa, popb}. All events to the right of e are unconditionally executable. Moreover, τ does not have → po -successors following e. Therefore, the resulting computation τ \ e is in C(P, N ).
A PGAS program is not robust if and only if it has a violating computation τ of minimal length. Let e ∈ τ be the event determined by Lemma 6. If kind(e) ∈ {read, write}, then 4 with e ↔ e ↔ e . Consider the latter case where τ \ e = τ 1 · τ 2 · τ 3 · τ 4 . Since |τ \ e| < |τ |, the new computation is not violating and → hb (τ \ e) is acyclic. This acyclicity guarantees we find a computation σ ∈ E * with the same happens-before relation as τ \ e and where pop events directly follow their remote accesses. Intuitively, σ is a sequentially consistent computation corresponding to τ \ e.
Lemma 7 ([25]
). There is σ ∈ C(P, N ) with → hb (σ) = → hb (τ \e) and σ = σ 1 ·e 1 . . . e n ·σ 2 for all e 1 ↔ . . . ↔ e n .
We now use σ to rearrange the events in τ \ e and guarantee the normal-form requirement. The idea is to project σ to the events in τ 1 to τ 4 . Reinserting e yields a normal-form violation:
The following lemma concludes the proof of Theorem 5.
, and τ nf is in normal form.
Example 9.
Computation τ 1to1 in Example 1 is a shortest violation. The event determined by Lemma 6 is e = load. Therefore, τ \ e = τ 1 · τ 2 with τ 1 = write · write · popa · popa · bar · bar and τ 2 = popb · popb.
A sequentially consistent computation corresponding to τ \ e is σ = write · popa · popb · write · popa · popb · bar · bar.
The normal-form violation τ nf 1to1 is depicted in Figure 4 . Note that τ nf 1to1 is indeed in C(1to1, 2). Moreover, popa and popa immediately follow write and write, respectively. Similarly, the popb and popb events in the second part of the computation respect the order of write and write in the first part of the computation. This means, (NF) holds.
From Normal-Form Violations to Language Emptiness
We now reduce checking the absence of normal-form violations to the emptiness problem in a suitable automaton model. We introduce multiheaded automata and construct, for each program (P, N ), a multiheaded automaton accepting all normal-form computations. To verify robustness, we check that the intersection of this automaton with regular languages accepting cyclic happens-before relations is empty.
Multiheaded Automata
Multiheaded automata are an extension of finite automata. Intuitively, instead of generating just a single computation, they generate several computations in one pass, each by a separate head. The language of the multiheaded automaton then consists of the concatenations of the computations generated by each head. Syntactically, an n-headed finite automaton over alphabet Σ is a finite automaton that uses the extended alphabet 1, n×Σ. So we have A = (S, (1, n×Σ), ∆, s 0 , F ). The semantics, however, is different from finite automata. Given σ ∈ (1, n × Σ) * , we use σ ↓ k to project σ to the letters (k, a), and afterwards cut away the index k.
Multiheaded automata are closed under regular intersection, and emptiness is decidable in non-deterministic logarithmic space. Indeed, checking emptiness reduces to finding a path from an initial to a final node in a directed graph.
Lemma 10. Consider an n-headed automaton U and a finite automaton V over a common alphabet Σ. There is an n-headed automaton W with L(W ) = L(U ) ∩ L(V ).
Lemma 11. Emptiness for n-headed automata is NL-complete.
Multiheaded automata are incomparable with context-free grammars, and indeed the normal-form computations of a program may be non-context-free.
1 Multiheaded automata can be understood as a restriction of matrix grammars [11] . In matrix grammars, productions simultaneously rewrite multiple non-terminals. Roughly, each production can be understood as a Petri net transition, and emptiness is decidable as Petri net reachability is. Since we target a PSpace result, matrix grammars are too expressive for our purposes.
Detecting Normal-Form Computations
We define a 4-headed automaton N ) . In order to accept τ 1 , the new automaton tracks the control and memory configurations in the way X(P, N ) does. For the remainder of the computation, these configurations are not needed. Indeed, τ 2 to τ 4 only consist of popa and popb events that are executable independently of the control and memory configurations. However, Y (P, N ) has to take care of the ordering of popa and popb events from the same queue. In particular, if e 1 handles a request issued before the request of e 2 with kind(e 1 ) = kind(e 2 ), then it cannot be the case that e 1 ∈ τ j and e 2 ∈ τ i with i < j. Guided by this discussion, we define a state s ∈ S Y as a tuple s := (st, m, pa, pb). The state and memory configurations st and m are defined as in Section 2. They reflect the state of the program after it has generated a prefix of τ 1 . The functions pa, pb : RNK×QUE → 1, 4 give, for each process and each queue, the part τ 1 to τ 4 of the computation where the next popa resp. popb event will be generated. The initial state is s 0 := (st 0 , m 0 , pa 0 , pb 0 ) with pa 0 (r, q) := 1 =: pb 0 (r, q) for all r ∈ RNK and q ∈ QUE.
The transition relation ∆ Y is the smallest relation defined by the rules in the Tables 2  and 4 . Rule (gpa ) lets the automaton choose the part of the computation to which the next popa event will be appended. The first restriction is that the index of the part can only increase, as events from the same queue are processed in order. The second restriction is that popa events cannot be generated to the right of popb events from the same queue. Rule (gpb ) is the similar rule for popb events.
By Rule (write ), the automaton appends a write event to τ 1 and the corresponding popa and popb events in one shot to the parts determined by pa and pb. Since a single transition of a multiheaded automaton can generate at most one letter, the rule makes use of intermediary states from S aux Y . If popb is added to τ 1 , the memory configuration is updated accordingly. Note that the generation in one shot causes pop events within the same part τ i to follow in the order of the corresponding read/write events in τ 1 . Fortunately, this is always the case in normal-form computations by (NF). Computations that are not in normal form, e.g. τ 1to1 , cannot be generated by Y (P, N ) .
The set of final states of
are not included in the set of final states to forbid computations with pending remote requests. (P, N ) ).
Lemma 12. {τ ∈ C(P, N ) | τ is in normal form} = L(Y
Detecting Violations
The multiheaded automaton accepts all normal form computations, and we would like to check if one of those computations is violating. In general, violating computations can contain complicated cycles in the happens-before relation. However, we now show that whenever a computation has a happens-before cycle, it has a cycle in which each process is entered and left at most once. Our algorithm for robustness will look for happens-before cycles of this special form that, as we will show, can be captured by a regular language.
Lemma 13. Computation τ ∈ C(P, N ) is violating iff there is a cycle
where rank( over the alphabet E × M. It checks whether there is a conflict or identity edge from the leave-marked event of process r i to the enter-marked event of process r i+1 . Consider the case of conflicts. The automaton looks for a marked event (e i , m i ) with rank(e i ) = r i marked by leave ∈ m i . It remembers the kind and the address of this event. Then, it seeks a marked event (e i+1 , m i+1 ) with rank(e i+1 ) = r i+1 marked by enter ∈ m i+1 . If both events are found, they touch the same address, and one of them is a write, the automaton reaches the accepting state. Since finite automata are closed under intersection, we can define the finite automaton of cycle type cyc as
We can now prove Theorem 3. To check whether (P, N ) is robust, we go over all cycle types cyc = r 1 . . . r k . This enumeration of cycle types can be done in space that is polynomial in N . For each such sequence, we check if 
. Since the size of W is exponential in the size of (P, N ) and emptiness is in NL by Lemma 11, deciding L(W ) = ∅ can be done in space that is polynomial in (P, N ) . This shows robustness is in PSpace. Table 3 Transition rules for X(P, N ), given q1 cmd − − → q2 and current state s = (st, m, fa, fb) with st(r) = q1. We set st := st[r := q2] to update st so that process r is at q2. e denotes the evaluation of expression e in memory configuration m. 
A Missing Proofs
For some of the following proofs, we assume that Table 3 and Table 4 associate with each event e the transition in the program that produced this event: instr(e). Also, for a read, write, popa, or popb event we write que(e) to denote the id of the queue being modified by this event e.
Proof of Lemma 8.
To relieve the reader from the burden of syntax, we consider the case when τ \e = τ 1 ·τ 2 . We start with the program order. Let e 1 , e 2 ∈ τ 1 with e 1 → po e 2 in τ and, consequently, in τ \ e. By definition of σ, we have e 1 → po e 2 in σ. Since σ ↓ τ 1 contains e 1 and e 2 and does not add events between them, e 1 → po e 2 holds for σ ↓ τ 1 and, consequently, τ nf . Assume e 1 ∈ τ 1 and e 2 ∈ τ 2 with e 1 → po e 2 in τ and in τ \ e. Then e 1 is the rightmost element in τ 1 with its rank that is different from a pop. Similarly, e 2 is the leftmost element in τ 2 with its rank and different from a pop. The same is valid for their positions in σ ↓ τ 1 and σ ↓ τ 2 , which leads to e 1 → po e 2 in τ nf . The case when e 1 ∈ τ 1 and e 2 = e is similar. Since τ and τ nf consist of the same events, the cardinalities of the respective → po relations are equal, and the above inclusion already means the program orders in both computations are equal. Now we consider the conflict relation. Let e 1 , e 2 ∈ τ 1 with e 1 → cf e 2 in τ and hence in τ \ e. By definition of σ, we have e 1 → cf e 2 in σ. Since σ ↓ τ 1 contains e 1 and e 2 and does not add new actions between them, e 1 → cf e 2 holds for σ ↓ τ 1 and, consequently, for τ nf .
Assume e 1 , e 2 ∈ τ 1 and e 1 → cf e 2 in τ . One option is that e 1 and e 2 do not access the same address or both are reads. Then they still will not conflict in τ nf . The other option is that e 1 → cf e 3 in τ , where e 3 is a write to addr(e 1 ) = addr(e 2 ) that is located between e 1 and e 2 in τ 1 . Then, as already proven, e 1 → cf e 3 will hold in τ nf . Consequently, e 1 → cf e 2 will not hold in τ nf . The case when e 1 , e 2 ∈ τ 2 is similar.
Assume e 1 ∈ τ 1 , e 2 ∈ τ 2 , and e 1 → cf e 2 in τ . Then, e is not a write to addr(e 1 ) = addr(e 2 ), and e 1 → cf e 2 in τ \ e. Note that σ ↓ τ 1 does not contain a write to addr(e 1 ) to the right of e 1 . Otherwise, τ 1 would contain a write e 3 to addr(e 1 ), and e 1 → + cf e 3 , which contradicts e 1 → cf e 2 in τ . With a similar argument, σ ↓ τ 2 does not contain a write to addr(e 1 ) to the left of e 2 . Therefore,
Assume e 1 ∈ τ 1 , e 2 ∈ τ 2 , and e 1 → cf e 2 in τ . The proof of e 1 → cf e 2 in τ nf is as in the case when e 1 , e 2 ∈ τ 1 .
The case when e 1 = e or e 2 = e is no harder. The formal definition of the identity relation takes a computation α and determines the three projections α ↓ {write, read}, α ↓ popa, and α ↓ popb. The identity relation then relates the ith elements in these projections. To show that the identity relations in τ and τ nf coincide, one shows that the three projections coincide -using the same technique as for the program order. Therefore, the identity relations of both computations match. Also note that for each read or write event sequence e 1 ↔ e 2 ↔ e 3 , we have e 1 < τ nf e 2 < τ nf e 3 . This holds by the fact that e 1 < τ e 2 < τ e 3 , and the fact that σ = σ 1 · e 2 · e 2 · e 3 · σ 2 for some σ 1 and σ 2 .
To prove that τ nf ∈ C(P, N ), we proceed by contradiction. Let α = τ nf be the longest prefix of τ nf so that s (ẽ) ). But since st(rank(ẽ)) is unambiguously determined by the instr() of → po -predecessor ofẽ, which is the same in τ nf and in τ due to the matching program-order relations, this is not the case. The last opportunity why s ẽ − → may hold is because the transition producingẽ reads different values from registers or memory, e.g.ẽ is an assertion assume(e) and e = 0 in s. But since τ nf consists of the same events as τ , has the same program and conflict relations (i.e. reads receive values from the same writes in both computations), and τ ∈ C(P), this cannot be the case.
Finally, τ nf is in normal-form. The condition on the shape of τ nf is immediate, (NF) holds by the definitions of τ nf and σ.
Proof of Lemma 10. Let
Let Ω be the set of functions 1, n → S V . Then, the set of states is S W := {s 0W } (S U × Ω × Ω). The set of final states is
The automaton has the following transitions:
, where α k is produced by the k th head of U . By the ε-transition from the initial state, W guesses, for each k, the state ω(k) that the automaton V will reach after processing the prefix α 1 · · · α k−1 of α. The other transitions effectively execute the automaton U synchronously with n copies of the automaton V , each matching its own α k subword of α, starting from the guessed initial state ω(k). The set of final states F W makes sure that the guess was done correctly, which means the k th copy of V has reached the initial state of the k + 1 th copy, and the n th copy has reached a final state in F V .
Lemma 17. Y (P, N ) only generates computations of
with τ i = σ ↓ i. We prove the following by induction on the length of the computation.
− → (st, m, fa, fb) for some fa, fb, but with the same st, m as in s Y above. IS3 Let pa(r, q) = k. Then no τ i with i > k contains an event e with kind(e) = popa, rank(e) = r, and que(e) = q. A similar statement holds for fb. IS4 For all e ∈ τ 2 · τ 3 · τ 4 we have kind(e) ∈ {popa, popb}.
In the base case with σ = ε the inductive statement trivially holds. 
kind(e 1 ) = read, kind(e 2 ) = popa, kind(e 3 ) = popb.
Then m = m, pa = pa, pb = pb, and
, and τ 4 = τ 4 . Since IS4 and IS3 hold for σ, they also hold for σ by definition of σ and Rule (read ). It remains to check the behaviour of the state-space automaton. By IS2 from the induction hypothesis and the Rules (read) and (read'), we have s 0X τ1·e1 − −− → (st , m, fa , fb). So IS2 holds for σ as well. To check IS1 for σ , we consider the content of fa . According to Rule (read), we have fa := fa[(rank(e 1 ), que(e 1 )) := fa(rank(e 1 ), que(e 1 ))·(r rem , a rem , r loc , a loc )]. By the induction hypothesis, we can generate τ 2 from (st, m, fa, fb). In (st , m, fa , fb), we append an action to fa. Since τ 2 only consists of popa and popb events, we can still generate the computation from (st , m, fa , fb). This yields s 0X τ1·e1·τ2 − −−−− → s 1 for some s 1 . We now show that s 1 e2 − → s 2 for some s 2 . Let s 1 = (st , m , fa , fb ). When checking IS3 for σ , we noted that τ 3 · τ 4 does not contain popa eventsẽ with rank rank(ẽ) = rank(e 1 ) and queue id que(ẽ) = que(e 1 ). Therefore, by IS1 from the induction hypothesis, all elements in fa(rank(e 1 ), que(e 1 )) are popped by popa transitions in τ 2 . As a result, fa (rank(e 1 ), que(e 1 )) contains only the single element added by e 1 . Comparing Rules (read), (popa), and (read'), shows s 1 e2 − → s 2 . Note that we need to take the read-rules into account to make sure the contents of the tuple e 2 coincide for Y (P, N ) and X(P, N ).
The fact that X(P, N ) can accept the rest of computation τ (s 2 τ3·e3·τ4 − −−−− → s 3 for some s 3 ) is proven similarly. Emptiness of the queues in s 3 follows from Rule (read') and IS1 for τ .
The argumentation for write events, kind(e 1 ) = write, is the same. For the remaining kinds of events e 1 , the proofs are simpler. There, we only need to make use of state and memory configurations, which coincide in Y (P, N ) and X(P, N ). Y (P, N ) generates all normal-form computations of the program: {τ ∈ C(P, N ) | τ is in normal form} ⊆ L(Y (P, N ) ).
Lemma 18. Automaton
Proof. Consider a normal-form computation τ = τ 1 ·τ 2 ·τ 3 ·τ 4 ∈ C(P, N ) with s 0X τ1 − → s X for some s X = (st, m, fa, fb). To prove that Y (P, N ) can generate τ , we show the following by induction on the length of the computation. (Note that by (NF) we can extend normal-form computations inductively).
IS1 s 0Y
σ − → s Y = (st, m, pa, pb) with st and m from s X above. IS2 We have σ ↓ i = τ i for all i ∈ 1, 4. IS3 Let the last e with kind(e) = popa, rank(e) = r, que(e) = q be in τ k . Then pa(r, q) = k.
If there is no such event, pa(r, q) = 1. There is a similar requirement for popb events.
Note that computation ε satisfies all the constraints. Assume the constraints hold for computation τ . We extend τ to a computation τ = τ 1 ·τ 2 ·τ 3 ·τ 4 , and show that it also satisfies IS1 to IS3. Extending τ adds an event to the first part of the computation, s X e1 − → s X . We do a case distinction based on kind(e 1 ).
Consider the case kind(e 1 ) = read. Let e 1 ↔ e 2 ↔ e 3 with τ 2 = τ 2 · e 2 and τ 3 = τ 3 · e 3 .
Assume e 1 was generated by the transition q 1 cmd − − → q 2 . This means st(rank(e 1 )) = q 1 . By IS1 in the induction hypothesis, s X and s Y share the same st and m. Therefore, by Rules (read) and (read'), Y (P, N ) can mimic the read in X(P, N ). To make sure we append e 2 to τ 2 , we have to check the requirements on pa. If pa(rank(e 2 ), que(e 2 )) < 2, we can use Rule (gpa') to adapt the counter. If we assume that pa(rank(e 2 ), que(e 2 )) = k > 2, we derive a contradiction as follows. By the induction hypothesis, there is an event e in τ k with rank(e ) = rank(e 2 ), que(e ) = que(e 2 ), and kind(e ) = kind(e 2 ) = popa. This event has a corresponding event e ↔ e in τ 1 . Summing up, e, e 1 , e 2 , e are contained in τ in this order. Moreover, the latter two events are added to the same queue in reverse order: e before e 2 . A contradiction to the definition of FIFO. We conclude 
