Abstract. We investigate the decidability of the state reachability problem in finite-state programs running under weak memory models. In [3], we have shown that this problem is decidable for TSO and its extension with the write-to-write order relaxation, but beyond these models nothing is known to be decidable. Moreover, we have shown that relaxing the program order by allowing reads or writes to overtake reads leads to undecidability. In this paper, we refine these results by sharpening the (un)decidability frontiers on both sides. On the positive side, we introduce a new memory model NSW (for non-speculative writes) that extends TSO with the write-to-write relaxation, the read-to-read relaxation, and support for partial fences. We present a backtrackfree operational model for NSW, and prove that it does not allow causal cycles (thus barring pathological out-of-thin-air effects). On the negative side, we show that adding the read-to-write relaxation to TSO causes undecidability, and that adding non-atomic writes to NSW also causes undecidability. Our results establish that NSW is the first known hardware-centric memory model that is relaxed enough to permit both delayed execution of writes and early execution of reads for which the reachability problem is decidable.
Introduction
The memory consistency model (or simply, the memory model) of a shared-memory multiprocessor is a low-level programming abstraction that defines when and in what order writes performed by one processor become visible to other processors. The simplest memory model, sequential consistency [16] , requires that the operations performed by the processors should appear as if these operations are interleaved in a consistent global order. Despite its simplicity and appeal, most contemporary hardware platforms support weak (relaxed) memory models for performance reasons [2, 13] .
The effects of weak memory models can be counterintuitive and difficult to understand even for very small programs. Not surprisingly, relaxed memory models are an active research area today. Much progress has been made to aid programmers, in the form of verification or model-checking algorithms [8, 15, 26, 4] , testing tools [11, 18] , analyses that check whether programs are exposed to specific relaxations [7, 9, 20] , fence insertion tools [14, 15, 17] , verified compilation [10, 24, 23] , and formal models that closely approximate commercial multiprocessors [21, 22, 25] .
Nevertheless, many foundational questions about weak memory models remain. For instance, given a finite-state concurrent program under weak memory model, what is the complexity of deciding if a particular erroneous state can be reached? What is the most relaxed model for which the safety verification problem is decidable? Understanding the answers to these questions is crucial for model checking safety properties of programs under a relaxed memory model and for checking if a program exhibits the same behavior under different memory models.
w → r (Write-to-read order). The effect of a write may be delayed past a subsequent read.
This relaxation enables the use of per-processor write buffers. Specifically, when executing a write, a processor may buffer the value to be written in its local buffer and continue executing before the buffered value becomes globally visible. w → w (Write-to-write order). A processor may swap the order of two writes. For instance, if using a write buffer as described above, writes may exit the buffer in a different order than they entered. r → r/w (Read-to-read/write order). A processor may change the order of a read and a subsequent read or write. This enables out-of-order execution techniques that help to hide latency of memory accesses. We further distinguish between r → r (read-to-read) and r → w (read-to-write) relaxations. RLWE (Read local writes early). A processor may read its own writes even if they are not globally visible yet (i.e. before the exit the buffer). For example, if a processor executes a read from a location for which there are pending writes in the local buffer, it can immediately forward the value of the last such write from the buffer to the read. RRWE (Read remote writes early). A processor may read other processors' writes even if they are not globally visible yet. For example, a write in a local buffer may be directly forwarded to some remote processors before it exits the buffer.
RWF (read-read and write-write fences). A processor may issue a read-read (write-write)
fence to prevent reordering of reads (writes) that precede the fence with reads (writes) that succeed it. In prior work [3] , we have presented some early decidability results for relaxed memory models. In this paper, we refine these results with a precise study of relaxations that lead to the undecidability of memory models. Our results show (perhaps surprisingly) that relaxations that are commonly considered as counter-intuitive by programmers coincide with those that lead to undecidability. For instance, we show that adding the read-to-write relaxation to TSO (total store order) results in an undecidable memory model. In such a relaxation, a processor eagerly makes a write visible to other processors before a prior read has completed. Such speculative writes can result in causal cycles, a well known memory model hazard [12, 19] . On the other hand, a memory model that avoids this relaxation but otherwise remains general by allowing read-to-read, write-to-read, and write-to-write relaxations together with read-read and write-write fences is actually decidable. We call this memory model NSW (non speculative writes) and study its properties. Finally, we show that adding non-atomic writes to NSW leads to undecidability. Such non-atomic writes can lead to counter-intuitive IRIW (independent reads of independent writes) effects [6] .
Along the same vein, we show that NSW, which is the most relaxed model known to be decidable, exhibits the following desirable properties:
-NSW enables significant optimizations; specifically, (1) it permits a write to be moved down (later) in the program execution past any other read or write (by delaying it in a buffer), and (2) it permits reads to be moved up (earlier) in the program execution, before any read or write (even before a read on whose value it depends). -The performance impact of prohibiting the read-to-write relaxation (which is the only ordering relaxation remaining in NSW) can be ameliorated by write buffers: even if we disallow writes to become visible to other processors (i.e. exit the write buffer) before all preceding reads have completed, we may still allow writes to enter into the buffer while older reads are still pending. -Since NSW does not permit writes to become visible to other processors before all older loads by the same processor have completed, causal cycles and out-of-thin-air behaviors are impossible. We formalize and prove this fact in Section 3.6. -In operational memory models, reordering of dependent memory accesses is usually modeled by nondeterministically guessing the read value and validating it later. In some sense, such models are not very constructive as they may require backtracking if a guess can not be validated later on. We discovered a way to eliminate all such guesses from our operational model for NSW, obtaining an alternative operational model that is backtrack-free (Section 5). -The relaxations in NSW do not depend on any notion of data/control-dependencies.
Not only does this greatly simplify the formalism, but it also avoids subtle soundness problems with compiler optimizations that may break dependencies [5] .
To establish that the state reachability problem for NSW is decidable, we proceed in two steps. First, we define an operational model for NSW where reads do not need to be stored, but still allowing the precise simulation of all their possible reorderings due to the read-to-read relaxation (section 5). The key idea for tackling this issue consists, roughly speaking, in using a buffer storing the history of all the past memory states, in addition to informations about the most recent value read by each process on each variable. The whole model has actually three levels of buffers, each of them related to one of the considered relaxations (write-to-write, write-to-read, and finally read-toread). We think that this step has its own interest from the point of view of modeling and of understanding the effects of each of the considered relaxations, regardless from the decidability issue. Then, in a second step (section 6), we prove that the defined operational model can be transformed, while preserving state reachability, into a system that is monotonic w.r.t. a well quasi-ordering on the set of its configurations. This allows to deduce that the model has a decidable state reachability problem, using [1] . Both steps are nontrivial and are based on new and quite subtle constructions.
Preliminary definitions and notations
Let k ∈ N such that k ≥ 1. Then, we denote by [k] the set {1, . . . , k}. Let Σ be a finite alphabet. We denote by Σ * the set of all words over Σ, and by ε the empty word. The length of a word w ∈ Σ * is denoted by length(w). (We assume that length(ε) = 0.) For every i ∈ [length(w)], let w(i) denote the symbol at position i in w. For a ∈ Σ and w ∈ Σ * , we write a ∈ w if a appears in w, i.e., ∃i ∈ [length(w)] such that a = w(i).
Given a sub-alphabet Θ ⊆ Σ and a word u ∈ Σ * , we denote by u| Θ the projection of u over Θ, i.e., the word obtained from u by erasing all the symbols that are not in Θ.
Let k ≥ 1 be an integer and E be a set. Let e = (e 1 , . . . , e k ) ∈ E k be a k-dim vector over E. For every i ∈ [k], we use e[i] to denote the i-th component of e (i.e., e[i] = e i ). For every j ∈ [k] and e ∈ E, we denote by e[ j ← e ] the k-dim vector e over E defined as follows: e [ j] = e and e [l] = e[l] for all l = j.
Let E and F be two sets. We denote by [E → F] the set of all mappings from E to F. Assume that E is finite and that E = {e 1 , . . . , e k } for some integer k ≥ 1. Then, we sometimes identify a mapping g ∈ [E → F] with a k-dim vector over F. Intuitively, r(i, j, d) (resp. w(i, j, d)) means that process i reads (resp. writes) the data d from (resp. to) the variable x j . The semantics of atomic read-writes and of read/write fences will be explained in section 3.2.
A concurrent system over D and X is a tuple N = (P 1 , . . . , P n ) such that for every i ∈ [n], P i = (P i , ∆ i ) is a finite-state process where (1) P i is a finite set of control states, and (2) ∆ i ⊆ P i × Ω({i}, X, D) × P i is a finite set of labeled transitions.
Let P = P 1 ×. . .×P n . For convenience, we write p op − − → i p instead of (p, op, p ) ∈ ∆ i , for any p, p ∈ P i and op ∈ Ω({i}, X, D). We denote by Ω(N) ⊆ Ω([n], X, D) the set of operations used in N. Given an operation ω = op(i, j, d) with op ∈ {r, w}, i ∈ [n], j ∈ [m], and d ∈ D, let proc(ω) = i, var(ω) = j, and data(ω) = d.
Memory models
The executions of a concurrent system are obtained by interleaving the operations issued by its different processes. In the Sequential Consistency (SC) model, the order between operations of a same process is preserved. Relaxations of this program order lead to the definition of various weak memory models. However, fences (i.e., barriers) can be used to impose the serialization of some operations at some execution points. An operation arw(i, j, d, d ) is equivalent to the atomic execution of the sequence r(i, j, d); w(i, j, d ), with the additional assumption that this operation is never reordered with any other operation of the same process. Therefore, this operation can emulate a full fence, i.e., a fence such that any two operations by the same process occurring before and after (in program order) the full fence cannot be swapped. The operation wfence(i) (resp. rfence(i)) is a fence for writes (resp. reads) only, i.e., writes (resp. reads) that occur before and after a write fence (resp. read fence) cannot be swapped.
A Semantics based on Rewrite Rules
We consider memory models corresponding to a set of program order relaxations defined by permutation rules between the operations. Given read/write operations op 1 , op 2 ∈ {w, r}, relaxing the op 1 to op 2 order consists in allowing that operations of the class op 2 are allowed to overtake operations of the class op 1 in a computation, provided that these operations are issued by the same process, and that they are acting on different variables. This corresponds to defining a set of rewrite rules:
In addition to permutations between reads and writes, we consider that reads and write fences issued by the same process can always be swapped, and the same holds concerning writes and read fences. Then, we consider the following set of rewrite rules RWF defining the semantics of read/write fences:
We also consider the following set RLWE (Read Local Write Early) of rewrite rules:
These rules say that a read that occurs after a write of the same value on the same variable by the same process can be validated immediately. Then, we consider that a memory model M is defined by the choice of a set of rewrite rules defining the allowed relaxations of the program order. For instance, we define in this framework the two well known models TSO and PSO as follows:
Clearly, TSO can be simulated under PSO by inserting a wfence before each write operation. Notice that using read fences in TSO and PSO is not relevant since reads cannot be swapped in these models. Similarly, using write fences in TSO is not relevant. But the possibility of using write fences in PSO is important. Without write fences, it is not possible to simulate TSO under PSO.
Given a process P i of N, and two control states p, p ∈ P i , a computation trace of P i from p to p is a finite sequence τ = ω 0 · · · ω −1 ∈ Ω({i}, X, D) * such that there are
The set of computation traces of P i from p to p is denoted by T (P i , p, p ).
Let R be a set of rewrite rules over traces defining a memory model M. Given a rewrite rule ρ = α → β, where α, β ∈ Ω(N) * , and a computation trace τ ∈ Ω(N) * , we define a rewriting relation → ρ between traces as follows: τ → ρ τ if τ = τ 1 ατ 2 and τ = τ 1 βτ 2 for some τ 1 , τ 2 ∈ Ω(N) * . As usual, → * ρ denotes the reflexive-transitive closure of → ρ . These definitions are generalized in the obvious way to sets of rules and sets of computation traces. Given a set of rewrite rules R, the closure of a set of traces T , denoted by [T ] R , is the smallest set containing T and which is closed under the application of the rules in R, i.e., [T ] R = {τ ∈ Ω(N) * : τ ∈ T ∧ τ → * R τ }. Given two traces τ 1 and τ 2 , the shuffle of the two traces is the set of traces obtained by interleaving the elements of τ 1 and τ 2 while preserving the original order between elements of each trace. Formally, the operator is defined inductively as follows: (1) ε τ = τ ε = τ, and (2) 
, and for every τ, τ 1 , τ 2 ∈ Ω(N) * . The definition can be extended in a straightforward manner to a finite number of traces.
Given two vectors of control states p, p ∈ P, the set of computation traces in N from p to p in the memory model M (defined by R), denoted by T M (N, p, p ), is defined by
We define a relation [ between memory states corresponding to the execution of operations in Ω(N). Given d, d ∈ M, we have, for every i ∈ [n] and for every j ∈ [m]:
We extend this definition to sequences of operations, and therefore to computation 
The State Reachability Problem
The state reachability problem for a memory model M consists in, given a concurrent system N and two states s and s of N, checking whether Reach M N (s, s ) holds. We have: Theorem 1 ( [3] ). The state reachability problem for TSO is decidable.
We also proved in [3] the decidability of the state reachability problem for a model with both w → w and w → r relaxations, but without considering write fences. Therefore, the so-called PSO in [3] is incomparable with TSO (since write fences are necessary to simulate TSO under that model), and is strictly less expressive (w.r.t. the set of computation traces) than the PSO as defined in this paper. We show also in [3] that the state reachability problem is undecidable for the model where all four read/write relaxations are considered. We prove, using a reduction of Post's Correspondence Problem, the following stronger result:
Theorem 2. The state reachability problem for TSO ∪ {r → w} is undecidable.
NSW: A Model with Non Speculative Writes
We have seen in Section 3.4 that including the r → w relaxation to TSO results in a memory model with an undecidable state reachability problem. Motivated by this, we introduce a memory model called NSW (for Non Speculative Writes) obtained by discarding this relaxation, i.e., by considering the following set of rules:
Clearly, the NSW model subsumes TSO and PSO, and since it allows out-of-order reads, it is actually a strictly more relaxed model than PSO. Notice that PSO can be simulated under NSW by inserting a rfence after each read operation. We show later that the state reachability problem problem for NSW is decidable. In the next section, we discuss another desirable property of the NSW memory model.
Absence of Causality Cycles in NSW
Let po denote the program order relation corresponding to the order in which operations of each thread are issued by the program. Then, one can define a dependency relation between operations of a same process that reflects the data and control dependencies. We adopt here a conservative definition by considering that all operations occurring after a read operation, in the program order, are dependent from that read. Formally, this corresponds to the following dependency relation.
Second, we define a read-from relation, denoted rf, that associates with each read event of the computation a write event such that
operation issued by process P j takes the value d that has been written by the operation w(i, k, d) issued by process P i on the variable x k . Then, the causality relation corresponding to the considered computation is defined by c = dep ∪ rf.
It can be seen that under the model SC ∪ {r → w}, there are programs having computations with a cyclic causality relation. An example of such a program is given on the right. It is clear that under the SC model, the four operations of this program cannot belong to a same computation from x = y = 0 to x = y = 1. However, using the r → w relaxation, it is possible by permuting (1) and (2), to execute the four operations in the following order (2), (3), (4), (1) . This computation contains the causality cycle: (2) . We prove that by discarding the r → w relaxation, NSW avoids causal cycles.
Theorem 3. Every computation of any concurrent system under the NSW model has an acyclic causality relation.
Notice that since this theorem relies on the conservative definition of dependency given above (4), it also holds for any refinement of the dependency relation.
An Operational Model for NSW
We provide an operational model for NSW where configurations are formed by a vector of control states, one per process, a memory state giving the valuation of the shared variables, and an event structure where pending operations, issued by the different processes but not yet executed, are stored. This event structure defines a partial order between these operations reflecting the constraints imposed by the memory model on the order of their execution. We start by defining the notion of event structure. Then, we define a first operational model where the stored operations can be reads, writes, or write fences. (Nop's, atomic read-writes, and read fences do not need to be stored.)
Event structures
Let E be an enumerable set of of events. An event structure over an alphabet Σ is a tuple S = (E, Y, λ) where E is a finite subset of E, Y⊆ E × E is a partial order over E, and λ : E → Σ is a mapping associating with each event a symbol in Σ.
Given an event e ∈ E\ E and a symbol a ∈ Σ, we denote by S¡ [e ← a] the structure (E ∪ {e}, Y, λ ) such that λ (e) = a and λ (e ) = λ(e ) for all e ∈ E. Given an event e ∈ E, we denote by S £ e the structure (E = E \ {e}, Y | E , λ| E ). Moreover, given e, e ∈ E, we denote by S ⊕ e Y e the event structure (E, (Y ∪{(e, e )}) * , λ). These notations can be generalized to sets (of events and transitions) in the obvious way.
Given a concurrent system N = (P 1 , . . . , P n ), an event structure S over N is an event structure over Ω(N). Given i ∈ [n] and j ∈ [m], let E (i, j) = {e ∈ E : ∃d ∈ D. ∃op ∈ {w, r}. λ(e) = op(i, j, d)}. An event structure over Ω(N) is well-formed if, for every i and j, the relation Y | E (i, j) is a total order. We assume in the rest of the paper that all event structures over N are well-formed. This condition corresponds to the fact that read/write operations on the same variable should not be reordered.
Let
For every e ∈ E, we use data(e) to denote data(λ(e)).
An Operational Model with Stored Reads
We associate with the concurrent system N a transition system (Conf N , ⇒ N ) where Conf N is a set of configurations, and ⇒ N ⊆ Conf N × Conf N is a transition relation between configurations. A configuration of N (an element of Conf N ) is any triple (p, d, S) where p ∈ P, d ∈ M, and S is an event structure over N. The transition relation ⇒ N is the smallest relation such that for every p, p ∈ P, for every d, d ∈ M, and for every S = (E, Y, λ), S = (E , Y , λ ) two event structures over N, we have
, and one of the following cases hold:
0 with e m = max(WE(i, j)), e ∈ RE(i, j). e m Y e, and data(e m ) = d. Let us explain each case. A write operation w(i, j, d) is simply added to the structure by introducing a new event e labelled with this operation, which is inserted after all write fences issued by P i as well as all the write/read operations of P i on x j .
A read operation r(i, j, d) can be validated immediately (point 3) if S still contain a write of P i on x j (and there is no read of P i on x i after this write), and the last of such an operation writes precisely the value d on x j . Otherwise, (in point 4) a read operation r(i, j, d) is simply added to the structure S after all reads/writes of P i on x j . Notice, that the event associated with this read operation is not ordered w.r.t. write fences that are maximal in S (i.e., the read is allowed to overtake such write fences). Moreover, a new write fence is inserted after the read. This ensures that, as long as this read has not been validated, it cannot be overtaken by any write.
An atomic read-write operation, which acts as a fence on all operations of the process P i , can be executed only when all events before it have been executed. A read fence issued by P i is executed immediately (it is not stored in S) if there is no reads in S issued by P i . A write fence is inserted in S after all the events issued by P i . Writes are removed from S and used to update the main memory when these operations correspond to minimal events of S. Similarly, reads are validated w.r.t. the main memory and removed from S if they correspond to minimal events. Finally, a write fence can simply be removed from S when it becomes minimal.
Let S / 0 denote the empty event structure. Then, we have: 
From Event Structures to FIFO Buffers
We provide in this section a model for NSW using FIFO buffers where reads and fences are never stored. We proceed in two steps. First, we provide an alternative operational model for NSW where reads can be immediately validated using informations about the sequence of states that the memory had in the past. The history of the memory states is stored in an additional FIFO buffer. Then, we show that it is also possible to get rid of wfences by converting event structures into two-level structures of write buffers.
Eliminating reads from event structures
We present hereafter a new operational model where reads are validated using an additional buffer storing memory states, called history buffer. The idea is the following.
Consider a read operation r(i, j, d) issued by process P i that can be validated during a computation from a write operation w(k, j, d) issued by process P k . Then, if at the mo-
has not yet been issued, it is actually possible for P i to wait until P k produces w(k, j, d). The reason is that issuing w(k, j, d) by P k can't depend from the actions of P i after r(i, j, d), because otherwise, this would mean that there is a read by P k before w(k, j, d) which needs (i.e., is causally dependent from) a write of P i occurring after r(i, j, d). But this would imply the existence of a causality cycle, which contradicts the fact that such cycle do not exist in NSW computations due to the fact that writes cannot overtake reads (see Thm. 3). Therefore, it is always possible to consider computations where reads are validated w.r.t. writes that have been issued in the past. However, since some actions must exit the event structure of the system configuration (due to fences), we need to maintain the history of all past memory states in a buffer. Then, we use a buffer such that the last element represents actually the current state of the memory, and where the other elements represent the precedent states of the memory in the order they have been produced. Notice that a history buffer is never empty since it must contain at least one element representing the state of the memory. Now, since reads can be swapped, their validation can use writes that might be issued in a different order. However, reads by the same process on a same variable must be done in a coherent way, i.e., they should read from states occurring in the same order. To ensure that, we introduce pointers π(i, j) on the history buffer defining for each process P i and each variable x j the oldest memory state that can be observed. Then, to validate a read on x j by P i , we should find a memory state that occurs after π(i, j) in the buffer where x j has the right value. Actually, to simplify the construction, we allow that a pointer can move in a nondeterministic way toward the tail of the buffer (i.e., the most recent element). Then, to validate an operation r(i, j, d), we simply require that the value of x j in the element pointed by π(i, j) is precisely d. Also, when a write event w(i, j, d) exits the event structure and is used to update the memory, the pointer π(i, j) is moved to the last element of the history buffer (i.e., the current state of the memory) since this is the only value of x j that is visible to P i .
Notice that the relevant part of the history buffer at any moment is formed by the elements between the last element (current state of the memory) and the oldest element that is pointed by π.
To give the formal description of our model, we need to introduce some definitions concerning buffers and their manipulation. An event structure (E, Y, λ) is totally ordered when Y is a total order. We use such structures to encode FIFO buffers. Given a buffer B = (E, Y, λ) over an alphabet Σ, and a symbol a ∈ Σ, let add(B, a) be the buffer (E , Y , λ ) such that (1) E = E ∪{e} for some e ∈ E\E, (2) if E = / 0 then Y = {(e, e)}, otherwise Y = (Y ∪{(max(E), e)}) * , and (3) λ = λ ∪ [e → a]. Then, if λ(min(B)) = a, let remove(B, a) be the buffer (E , Y , λ ) such that (1) E = E \ {min(E)}, (2) Y =Y | E , and (3) λ = λ| E . We also define the predicate Empty which is true when the buffer has an empty set of events. When the buffer B is not empty, we denote by tail(B) (resp. head(B)) the element λ(max(E)) (resp. λ(min(E))).
Given a concurrent system N, a history buffer of memory states is a tuple H = (E, Y , λ, π) where (E, Y, λ) is a buffer over M (the set of all memory states) such that E = / 0, and π : [n] × [m] → E is a mapping associating with each process and each variable an event in E. We say that a history buffer is unitary if H is reduced to a singleton (i.e.,
Then, we are ready to define the transition system of the new model. A configuration is a tuple p, S,H where, as in the previous model p ∈ P is a vector of control states of each of the processes and S is an event structure, and where H is a history buffer over M. The new transition relation N is the smallest relation s.t. for every p, p ∈ P, 
{e Y e : e ∈ max( E (i, j) )}. 
10. Write fence elimination: p = p , H = H , d = d, and ∃e ∈ min(E) such that λ(e) = wfence(i), and S = S£ e. 
Eliminating write fences from event structures
We show in this section that we can avoid storing write fences and to convert event structures into write buffers. The idea is the following. We observe that the projection of the event structure on the events of a same process is, roughly speaking, a sequence of partial orders, each of these partial orders corresponding to the set of write events occurring between two successive write fences. These partial order have also the property that they are unions of m total orders, each of them corresponding to the set of writes to a same variable. These total orders can naturally be manipulated using m FIFO buffers W B (i,1) , . . . ,W B (i,m) . Then, to simulate the whole sequence of partial orders corresponding the events of a process, we need to reuse the same buffers after each write fence, while ensuring that all writes occurring before the write fence are executed before all those occurring after it. The solution for that is to introduce for each process P i an additional buffer W B (i,m+1) used to flush the buffers W B (i,1) , . . . ,W B (i,m) after each write fence without imposing that their content is directly written in the memory.
Then, the architecture of our model is as follows. Each process P i has two levels of buffers, a first level with m write buffers storing the writes for each variable, and a second level with one buffer used to serialize the writes before committing them to the main memory. Then, we have the history buffer, the last element of which represents the current state of the memory, and the rest of its elements represent the history of all past memory states. Pointers on this buffer allow to each process to know what is the oldest value it can read on each variable. We give hereafter the formal definition of our model. A configuration in this model is a tuple of the form p, (W B (i, j) )
, H where p ∈ P, for every i ∈ [n] and every j ∈ [m + 1], W B (i, j) is a write buffer, and H is a history buffer over M. Then, we define the transition relation → N between configurations as the smallest relation such that for every p, p ∈ P, for every two vectors of store buffers (W B (i, j) )
, λ (i, j) ) for all i and j, and for every two history buffers H = (B, π) and H = (B , π ), where B = (H, Y H , λ H ) and B = (H , Y H , λ H ) are two buffers over M, we have B (i,m+1) , ω). and data(tail(W B (i, j) )) = d. Empty(W B (i, j) ), the set W (i,m+1) = {e ∈ B (i,m+1) : ∃d ∈ D. λ (i,m+1) (e) = w(i, j, d )} is not empty, and data(max (W (i,m+1) (i, j) ), the set W (i,m+1) defined above is empty, and ∃d ∈ M such that It is worth noting that for PSO, i.e., when read fences are systematically inserted after reads, the operational model we define has always a history buffer of size 1 (i.e., reduced to the memory state). Notice that still we need two levels of write buffers for PSO due to the use of write fences. For TSO, write buffers for each variable (W B (i, j) for j ∈ [m]) are not needed since writes are immediately followed by write fences. This coincides with the operational model defined, e.g., in [3] . 6 The state reachability problem of NSW We show hereafter that the state reachability problem of NSW is decidable. For that, we use the framework defined in [1] which establishes that state reachability can be solved using backward reachability analysis in the following case: Given a well quasiordering (WQO) on configurations 4 , if the system is monotonic w.r.t. , i.e., larger configurations w.r.t. can always simulate smaller ones, then backward reachability in this system is guaranteed to terminate if it starts from -upward closed sets, i.e., sets that whenever they contain a configuration c, they also contain all -larger one than c.
Transfer write
: p = p , H = H , ∃ j ∈ [m]. W B (i,k) = W B (i,k) for every k ∈ ([m] \ { j}), and ∃ω = head(W B (i, j) ). W B (i, j) = remove(W B (i, j) , ω) and W B (i,m+1) = add(W5. RLWE from W B (i, j) , j ∈ [m]: p r(i, j,d) −−−−→ i p , H = H , W B (i,k) = W B (i,k) for every k ∈ [m + 1],6. RLWE from W B (i,m+1) : p r(i, j,d) −−−−→ i p , H = H , W B (i,k) = W B (i,k) for every k ∈ [m + 1],) = d. 7. Read: p r(i, j,d) −−−−→ i p , H = H , W B (i,k) = W B (i,k) for every k ∈ [m + 1],
Empty(W B
To define such ordering, we observe that a value in the memory written by some process might be overwritten by other write operations by the same process before any other process has had time to read it. Therefore, the effect of a write operation sent by a process to its store buffer may never be used, and this would suggest that we should define to reflect the subword relation between the buffer contents. However, this intuition cannot be exploited directly. As we will see below, NSW's are not monotonic in general w.r.t. such as subword-based relation. To circumvent this problem, we introduce another model called NSW + obtained from the NSW, where, roughly, serialization buffers W (i,m+1) contain memory states (corresponding to cumulated effects of write operations) instead of write operations and we associate one history buffer per process, and we show that (1) the state reachability problem in a given NSW is reducible to the one in its corresponding NSW + , and (2) every NSW + is monotonic w.r.t. a subword-based relation on buffers. Notice that the translation from NSW to NSW + preserves reachability but the resulting model from this translation is not bisimilar to the original one (and therefore monotonicity can not be transferred).
Informal introduction to NSW + : We explain hereafter how a NSW + model is defined starting from a given NSW. Let us first see why NSW's are not monotonic w.r.t. the subword relation, i.e., considering that the buffers in NSW are lossy is not sound. More precisely, while it can be shown that it is possible to consider safely that the write buffers W B (i, j) for all i ∈ [n] and j ∈ [m] as well as the history buffer are lossy, the serialization buffers W B (i,m+1) for i ∈ [n] cannot be simply turned to lossy buffers. Consider first a sequence of write operations w(i, j, d )w(i, j, d) in the write buffer W B (i, j) , for some j ∈ [m], where w(i, j, d) is the oldest operation. Since both operations are on the same variable x j , losing the operation w(i, j, d), i.e., replacing this sequence by just w(i, j, d ), yields a valid computation corresponding to compaction of the two operations. Indeed, it is possible to overwrite the value d by d before that any process is able to read d. Therefore, it is possible to lose any operation in a write buffer corresponding to a variable, except the last operation. This is especially important for the read-localwrite-early operation. Then, by considering the last symbol in each write buffer W B (i, j) as a strong symbol (can not be lost), and turning W B (i, j) to a lossy channel does not introduce computations that are not possible in the original program. Observe that the number of possible such strong symbols is finite (one per write buffer W B (i, j) ).
Consider now a sequence of memory states d · d in the history buffer H, where d is the oldest state. Then, losing the memory state d in M i is similar to considering that this state has not been observed by P i . This is perfectly valid since processes observe the states of the memory in an asynchronous way, and therefore they may miss some states. However, memory states in H that are pointed by some pointer π(i, j) should not be lost, and they must be considered as strong symbol. Indeed, without these pointed states, reads cannot be validated. In addition, we also should not lose the tail of H (which corresponds to the current memory state) since it is used to compute the next memory state. Then, pointed elements as well as the last element of the history buffer must be considered as strong symbols (again the number of such symbols is finite).
It remains to consider the case of the serialization write buffer W B (i,m+1) . Consider a sequence of operations w(i, j, d )w(i, k, d) in W B (i,m+1) . Since these two operations are on different variables, losing w(i, k, d) does not correspond to the compaction of the two operations. To encode the compaction (or the summary) of such a sequence of operations, we need to use a vector of values defining the last written value to each variable by the operations in the sequence. Then, an idea is to replace the content of W B (i,m+1) = ω · · · ω 1 by the sequence of summaries σ · · · σ 1 where σ i is the summary of the sequence ω i · · · ω 1 . For instance, in our example, the sequence of summaries is
does not correspond to losing the effect of the operation w(i, k, d) since this effect is still visible in (x j = d , x k = d). Assume now that (x k = d) has not been lost and has been updated to the main memory. This value of x k in the main memory can be over-written by a write operation (
of a different process from P i . Then, when the system decides to update (x j = d , x k = d) to the main memory, we should not reset the value of x k to d (since the write operation (x k = d) has already taken effect). This shows that W B (i,m+1) (under NSW + ) must contain a valid sequence of memory states (that will be used to update the memory in the future). Then, we can formulate a similar argument as in the case of the history buffer to allow some of the memory states in W B (i,m+1) to be lost.
However, in order to have a valid sequence of memory states, the serialization buffer W B (i,m+1) under NSW + should simulate the contributions of the other processes. Therefore, it has to insert in W B (i,m+1) the memory states resulting from writes performed by other processes. This implies that the system should guess in advance in which order the write operations will be updated to the main memory. This is performed under NSW + as follows: (1) a write is removed from some write buffer W B (k, j) (chosen nondeterministically), (2) a new memory state is then computed from the last state added to W B (k,m+1) , and (3) this new state is added to all the serialization buffers. Observe that a memory state in W B (i,m+1) resulting from a write operation of a process P k (with k = j) should not be detected by P i (since it has not been yet committed to the main memory).
Observe that the execution of each process is totally determined by the sequence of memory states and its local configuration (i.e., its control state, its store buffer contents, and its serialization buffer content). Therefore, under NSW + , each process P i has its own private copy of the history buffer H i (without any need of synchronization with the other threads) since it has already the sequence of memory states in its serialization buffer. Now, if a memory state is at the head of the serialization buffer W B (i,m+1) of the process P i , then this state will be removed from all this buffer and one copy is transferred to its history buffer H i .
Formal definition of NSW
, (H i ) i∈ [n] where p and (W B (i, j) )
i∈ [n] are defined as in the previous section, (W B (i,m+1) ) i∈ [n] are write buffers over F = {w(i, j, d) : j ∈ [m] ∧ d ∈ M}, and H i are history buffers over M. Then, we define the transition relation → N as the smallest relation such that for every p, p ∈ P, for every two vectors of buffers (W B (i, j) )
, λ (i, j) ) for all i ∈ [n] and j ∈ [m + 1], and for every two vectors of history buffers
, (H i ) i∈ [n] if there are i ∈ [n], and p, p ∈ P i , such that
\ {i}, and one of the following cases holds:
3. Write fence: p
and ∈ [m + 1], and data(tail(W B (i, j) )) = d. (i, j) ), the set W (i,m+1) defined above is empty, and ∃d ∈ M such that
Empty(W B
10. Read fence: p
Memory update
We prove that the state reachability problem for a concurrent system N under NSW can be reduced to its corresponding one for N under NSW + . Let N be an NSW + , and let us define the relation on the configura-
) for all i and j, and
. Then, we consider that c c if
Then, from the three lemmas above and [1] , we deduce the following fact:
Theorem 11. The state reachability problem for NSW + is decidable.
As a corollary of Theorem 7 and Theorem 11, we obtain our main result:
Corollary 12. The state reachability problem for NSW is decidable.
7 Nonatomic Writes Cause Undecidability
(1) r(x, 1) (4) r(y, 1) (7) w(x, 1) (8) w(y, 1) (2) rfence (5) rfence (3) r(y, 0) (6) r(x, 0) x = y = 1 Fig. 3 . The IRIW (Independent Reads of Independent Writes) Litmus Test. P 3 writes 1 to x and P 4 writes 1 to y. In parallel, P 1 observes that x has been modified before y, whereas P 2 observes that y is modified before x.
So far, we have considered only models that do not contain the RRWE (read remote writes early) relaxation. In this section, we show that adding RRWE to NSW makes the reachability problem undecidable. The RRWE relaxation allows a processor to read other processors' writes even if they are not globally visible yet. This makes writes non-atomic and can be detected by the IRIW litmus test (Fig. 3) . IRIW is not possible in NSW as defined earlier.
However, if we change the model to allow a read operation of P i from a variable x j to be validated by the last write issued by P k (with k = i) on x j , although this write has not been yet committed, it becomes possible.
An operational model An operational model for NSW with the RRWE relaxation can be defined as an extension of the one defined in Sec. 4. The idea is to add to the event structure S = (E, Y, λ) a mapping σ : [n] × [m] → E ∪ {⊥}, with ⊥ / ∈ E, that associates with each process and variable, either a pointer on some event of the structure, or ⊥ when it is not defined. The pointer σ(i, j) defines an event e such that every future read operation of P i on the variable x j should not take its value from a write event that is Y-smaller than e. The intuition is that the validation of successive reads by the same process on a same variable should be done in a coherent way, i.e., the writes from which they read their values should occur in the same order. If σ(i, j) points to some event e in the event structure, then e corresponds to the write event from which the last read performed by the process P i on the variable x j took its value. The fact that σ(i, j) = ⊥ means that either P i has never read a value from x j , or the last write operation on x j (issued by some other process) that has validated a read of P i has already been updated.
Then, to validate a read operation of P i on x j using the RRWE, an event e must be found such that (1) e does not occur before the event e = σ(i, j) or any read/write event of P i on x j , and (2) e is the last write operation on x j of P k different from P i . If this is the case, then σ(i, j) is updated to e and constraints are added to ensure that (i) e should be executed after the event e and any read/write event of P i on x j , and (ii) e should be executed before all writes/reads by P i on x j coming after the validated read operation. When a write event is executed and exits the event structure S, if this write event is pointed by σ(i, j), then σ(i, j) is set to ⊥. P i can perform a RLWE on x j only if the event associated to the last write operation of P i on x j does not occur before σ(i, j). An atomic read-write operation arw(i, j, d, d ) can be executed only when no pending reads on the same variable still exist in the structure S, i.e., σ(i, j) = ⊥. The reason is that operations on the same variable cannot be reordered. Finally, all the other operations are defined as in Sec. 4 while keeping the pointers unchanged.
As an example, consider the IRIW litmus test (Fig. 3) . Starting from(x = 0, y = 0) and an empty event structure S, the execution of the writes (7) and (8) by P 3 and P 4 adds two events e 1 and e 2 to S labeled by w(3, x, 1) and w(4, y, 1), respectively. Then, P 1 and P 2 can execute their reads (1) and (4) that are validated using the RRWE relaxation, and set the pointers σ(1, x) and σ(2, y) to e 1 and e 2 . At this point, (2) and (5) can be executed, and then, the reads (3) and (6) can be validated w.r.t. the content of the main memory. Finally, the writes corresponding to e 1 and e 2 in S are committed to the main memory, and this yields the memory state (x = 1, y = 1). We can prove by a reduction of Post's Correspondance Problem the following fact:
Theorem 13. The state reachability problem for NSW ∪ {RRWE} is undecidable.
Conclusion and Future Work
We have sharpened the decidability boundary of the reachability problem for weak memory models by (1) introducing a model NSW which supports many important relaxations (delay writes, perform reads early, allow partial fences) yet has a decidable reachability problem, and (2) showing that the read-write relaxation and the nonatomic-stores-relaxation are problematic (cause non-decidability) if added to TSO or NSW, respectively. Besides decidability, our work contributes in clarifying the effects and the power of common relaxations existing in weak memory models. It provides an insight on the formal models needed to reason about these relaxations, which can be useful for other formal algorithmic verification approaches, including approximate analyses. Notice that the models we introduce in Sections 4 and 5 can be also considered in the case of an infinite data domain, and the relationship between them still holds in the same manner. It is only when we address the decidability issue that we need to restrict ourselves to a finite data domain. Future work may address the question of further sharpening the boundary by considering finer distinctions of the r → w relaxation, say by making it conditional on the absence of control-or data-dependencies. Moreover, we would like to explore the effect of non-atomic stores in more detail, such as whether it causes undecidability in weaker forms (e.g. if caused by static memory hierarchies) or if added to TSO rather than NSW.
