We present an algorithm for automatically checking robustness of concurrent programs against C/C++11 release/acquire semantics, namely verifying that all program behaviors under release/acquire are allowed by sequential consistency. Our approach reduces robustness verification to a reachability problem under (instrumented) sequential consistency. We have implemented our algorithm in a prototype tool called Rocker and applied it to several challenging concurrent algorithms. To the best of our knowledge, this is the first precise method for verifying robustness against a high-level programming language weak memory semantics.
Introduction
Release/acquire (RA), the fragment of the C/C++11 memory model [14] consisting of release stores, acquire loads and acquire-release read-modify-writes (RMWs), is a particularly useful and well-behaved weak memory model [36] . It is weaker than sequential consistency (SC) [40] and allows higher performance implementations. For example, x86-TSO [50] provides RA łfor freež (its memory model is stronger than RA), and POWER [45] implements RA using 'lightweight sync' instructions rather than more expensive 'full sync' instructions which are needed for SC.
At the same time, since RA is designed to support the common łmessage passingž synchronization idiom, the guarantees provided by RA suffice to implement various fundamental concurrent algorithms and synchronization mechanisms. In fact, many useful programs are actually robust against RAÐthe behaviors they exhibit under RA semantics are also allowed under SCÐor can be made robust by placing few SC-fences or by strengthening certain reads and writes to be RMW operations. Such modifications are sometimes necessary, with the best known example being Dekker's mutual exclusion algorithm, whose RA (non-SC) behavior is harmful for its correctness.
A natural question is thus whether one can automatically verify robustness against RA. Our main contribution is a decision procedure for this problem. Besides our theoretical interest, we believe that this result can facilitate the development of concurrent algorithms for RA. In particular, if we are able to verify robustness against RA, various programs designed for SC may be directly ported and verified with more ordinary techniques assuming SC. Further, robustness of non-robust programs may be enforced (by placing SCfences or RMW operations), and verifying the robustness of the strengthened program.
To precisely state our result, it is crucial to carefully define what constitutes a behavior of a concurrent program under SC and under RA, which in turn determines what robustness means. Here, it is natural to use operational presentations of SC and RA as memory subsystems, formulated as labeled transition systems (for RA one could use the timestamp machine introduced in [33] ). Then, program behaviors correspond to program states that are reachable when linked with each of the memory subsystems. More precisely, thinking about a concurrent program as a labeled transition system (whose states compromise of the values of the thread-local program counters and variables), one may identify SC (RA) program behaviors with the set of states of the program that are reachable in its runs when synchronized with runs of the SC (RA) memory subsystems. This definition of program behavior leads to what is known as state robustness, and corresponds to typical safety properties verification using local assertions and global invariants that relate values of local variables and program counters.
Nevertheless, following [24, Thm. 2.12] , it is easy to show that verifying state robustness against RA is as hard as the general state reachability problem under RA. The latter problem was recently shown to be undecidable [2] . Thus we resort to a more informative definition of a behavior, leading to a stronger notion of robustness. By doing so, we follow works on robustness against hardware models, TSO in particular (e.g., [17, 19] ), where state robustnessÐlike state reachabilityÐis non-primitive recursive [11, 12] . For this matter, we use formulations of SC and RA as labeled transition systems whose states are (C/C++11-like) execution graphs. Execution graphs keep track of the full partially ordered history of the run (and thus in this presentation both SC and RA are infinite state systems), including the reads-from mapping (mapping each read to the write it read from) and the modification order (a total order on writes to the same location). The difference between SC and RA is then reduced to the transitions they allow. For instance, when adding reads to the execution graph, SC requires that it reads from the write that is maximal in the modification order, while RA places much weaker restrictions. Now, we can identify program behaviors with pairs of states of both the program and the memory subsystem that are reachable in their synchronized runs. We refer to the robustness notion induced by this definition as execution-graph robustness.
Our main contribution is a decision procedure that checks whether a given concurrent program is execution-graph robust against RA. To achieve this, we show how this verification problem can be reduced to a state reachability problem under a (finite state) instrumented SC memory. Roughly speaking, this memory keeps track of the relevant parts of the generated execution graph and uses this information for monitoring that RA execution graphs cannot diverge from SC ones. We prove that our approach is sound and precise. In particular, it follows that this verification problem for programs with bounded data domain is PSPACE-complete.
Our approach can be straightforwardly extended to handle C/C++11's non-atomic accesses. A data-race on a non-atomic access is considered an undefined behavior, and, thus, robustness of a program should also imply that it has no data-races on non-atomic accesses. Since robust programs have only SC executions, checking for data-races can be done using standard techniques. For completeness, we incorporated these checks in our method simultaneously to the verification of robustness against RA.
We have implemented our method in a prototype tool, called Rocker, using Spin [31] as a back-end model checker under SC. We used Rocker to verify the robustness of several concurrent algorithms, including Peterson's mutual execution adaptations for RA [57] , sequence locks [16] and user-mode read-copy-update (RCU) implementations [26] . In particular, we observe that execution-graph robustness is a useful property, allowing one, in many cases, to think in terms of SC while running on a weaker model. The rest of this paper is structured as follows. In ğ2 we formally present the programming language and the notion of state robustness. In ğ3 we present the RA concurrency semantics. In ğ4 we define execution-graph robustness against RA. In ğ5 we present our decision procedure. In ğ6 we extend the decision procedure to support non-atomic accesses. In ğ7 we discuss the implementation and our experiments with it. In ğ8 we discuss related work. Finally, in ğ9 we conclude and outline directions for future work. Additional material and proofs for the claims of this paper are available in [1] . The prototype implementation and the examples it was tested on are available in the artifact accompanying this paper.
Preliminaries: State Robustness
Given a (binary) relation R, dom(R) and codom(R) denote its domain and codomain, and R ? , R + , and R * denote its reflexive, transitive, and reflexive-transitive closures. The inverse of a relation R is denoted by R −1 , and the (left) composition of two relations R 1 , R 2 is denoted by R 1 ; R 2 . We denote by [A] the identity relation on a set A. In particular, [A] ; R ; [B] = R ∩(A×B). For a strict total order R, we write R| imm to denote the set of immediate R-edges, i.e., R| imm = R \ (R ; R).
Programming Language
Let Val, Loc, Reg be finite sets of values, (shared) locations, and register names. We assume that Val contains a distinguished value 0, used as the initial value for all locations. Figure 1 presents our toy programming language. Its expressions are constructed from registers (local variables) and values. Instructions include assignments and conditional branching, as well as memory operations. Intuitively speaking, an assignment r := e assigns the value of e to register r (involving no memory access); if e goto n jumps to line n of the program iff the value of e is not 0; a write x := e stores the value of e in x; a read r := x loads the value of x to register r ; r := FADD(x, e) atomically increments x by the value of e and loads the old value of x to r ; and r := CAS(x, e R e W ) atomically loads the value of x to r , compares it to the value of e R , and if the two values are equal, replaces the value of x by the value of e W .
The less standard instructions wait and BCAS are blocking: wait(x = e) blocks the current thread until it manages to load the value of e from x; and BCAS(x, e R e W ) blocks the current thread until it performs a successful CAS of x from the value of e R (to the value of e W ). These instructions can be easily implemented using loops (e.g., L : r := x; if r e goto L with fresh r for wait(x = e)). Nevertheless, as we demonstrate in the end of this section, including them as primitives leads to a more expressive notion of robustness.
In turn, a sequential program S ∈ SProg is a finite map from N to instructions (we assume that 0 ∈ dom(S)), and a concurrent program P is a top-level parallel composition of sequential programs, defined as a mapping from a finite set Tid ⊆ N of thread identifiers to SProg. In our examples, we often write sequential programs as sequences of instructions
Exp ∋ e ::= r | v | e + e | e = e | e e | ... Inst ∋ inst ::= r := e | if e goto n | x := e | r := x | r := FADD(x, e) | r := CAS(x, e e) wait(x = e) | BCAS(x, e e)
Sequential programs: Figure 1 . Domains and programming language syntax. delimited by line breaks, use '∥' for parallel composition, and refer to the program threads as τ 1 , τ 2 , ... following their left-to-right order in the program listing.
From Programs to Transition Systems
A labeled transition system (LTS) A over an alphabet Σ is a tuple ⟨Q, q 0 , − →⟩, where Q is a set of states, q 0 ∈ Q is the initial state, and − →⊆ Q ×Σ×Q is a set of transitions. We write σ − → for the relation {⟨q, q ′ ⟩ | ⟨q, σ, q ′ ⟩ ∈ − →}, and − → for σ ∈Σ σ − →. We denote by A.Q, A.q 0 and − →A the three components of an
, where x ∈ Loc and v R , v W ∈ Val. The functions typ, loc, val R , and val W return (when applicable) the type (R/W/RMW), location, read value, and written value of a given label.
A sequential program S ∈ SProg induces an LTS over Lab ∪ {ϵ }, whose states are pairs ⟨pc, Φ⟩ where pc ∈ N (called program counter) and Φ : Reg → Val (called store, and extended to expressions in the obvious way). Its initial state is ⟨0, λr . 0⟩, and its transitions are given in Fig. 2 , following the informal description above of the language constructs. In the sequel we identify sequential programs with their induced LTSs (when writing, e.g., S.Q and − →S ).
Example 2.2. We present the LTS induced by a simple sequential program S. Let Val = {0, ... ,4}, Loc = {x } and Reg = {r }. We use + to denote the possibly overflowing sum (e.g., 2 + 4 = 1), and evaluate expressions of the form r < e to be 1 if Φ(r ) < Φ(e) and 0 otherwise.
A concurrent program P induces an LTS over the alphabet Tid × (Lab ∪ {ϵ }). Its states are tuples in τ ∈Tid P(τ ).Q; its initial state is λτ . P(τ ).q 0 ; and its transitions are interleaved transitions of P's components, given by:
In the sequel we identify concurrent programs with their induced LTSs. We often use vector notation (e.g., q) to denote states of concurrent programs.
Concurrent Systems and State Robustness
To give semantics to concurrent programs, we synchronize them with memory subsystems, as defined next. The labels here are pairs in N×Lab representing the thread identifier and the label of the performed operation. 1 The most well-known memory subsystem is the one of sequential consistency, denoted here by SC. This memory subsystem simply tracks the most recent value written to each location. Formally, it is defined by SC.Q ≜ Loc → Val, SC.q 0 ≜ λx . 0, and − →SC is given by:
Note that SC is oblivious to the thread that takes the action (we have M
By synchronizing a concurrent program and a memory subsystem, we obtain a concurrent system as defined next. Definition 2.4. A concurrent system is a pair, denoted P M , where P is a concurrent program and M is a memory subsystem. A concurrent system P M induces an LTS over Tid × Lab whose states are pairs in P .Q × M.Q; its initial state is ⟨P .q 0 , M.q 0 ⟩; and its transitions are given by:
In the sequel we identify concurrent systems with their induced state machines.
We can now define state robustness against a given memory subsystem. This definition essentially identifies the behaviors of a program P under a memory subsystem M with the first projection of the states that are reachable in P M . Definition 2.5. A state q of a concurrent program P is reachable under a memory subsystem M if ⟨q, q M ⟩ is reachable in the concurrent system P M for some q M ∈ M.Q. Definition 2.6. A concurrent program P is state robust against a memory subsystem M if every reachable state of P under M is also reachable under SC.
We can now demonstrate the reason for including the blocking instructions wait and BCAS as primitives. Consider the following implementations of a łglobal barrierž: (BAR) While the two programs are functionally equivalent, only the right program may be state robust against memory subsystems M that allow reading of łstale valuesž (such as RA and TSO). Indeed, the state in which both threads are in their last program line (pc 1 = pc 2 = 2) after reading 0 (Φ 1 (r 1 ) = Φ 2 (r 2 ) = 0) is reachable for the program on the left under such memory subsystem, but clearly not under SC. In many cases, such robustness violations are not harmful for the safety of the program, as they only imply that under weak memory the program may remain longer waiting in the busy loop. 2 A corresponding state is not reachable for the program on the right, and thus, using the blocking wait instruction, one may mask such benign robustness violations.
Similar benign robustness violations when using CAS, e.g., in spin loops, can be avoided using the BCAS primitive. Handling blocking instructions is essential to establish robustness of some interesting examples (e.g., RCU), without having more fences than actually necessary for program correctness.
Release/Acquire Semantics
In this section, we introduce the RA memory subsystem. RA's original presentation, as a fragment of C/C++11 [14] , is declarative (a.k.a. axiomatic), i.e., it is formulated as a collection of formal consistency constraints that are used to filter candidate execution graphs. In our proofs we use such a presentation (see [1, ğA] )), but for the current purpose we need to define RA as an LTS. The declarative RA semantics can be easily łoperationalizedž, as was done, e.g., in [54] , so that consistent execution graphs are incrementally constructed. We will need this presentation as well (see ğ4.2), but since execution-graph semantics is often considered unintuitive, we present here an equivalent operational model, due to [33] , which is perhaps more natural as an operational semantics for readers unfamiliar with the declarative style.
The memory in the RA operational model is a set of timestamped messages, which record all previously executed writes. Timestamps are taken to be natural numbers, Time ≜ N. A timestamp and a location uniquely identify a message (that is, there cannot coexist in memory two messages of the same location and timestamp). Each thread maintains its view of the memory, where T ∈ View is a function Loc → Time. The thread's view places lower bounds on the set of messages that the thread may read, as well as the timestamps it may pick when adding new messages to memory. Messages carry views as well, which record the thread's view at the time the message was added to memory. When a message is read, its view is incorporated into the thread view, which, roughly speaking, ensures that the thread becomes aware of whatever the message it reads was aware of.
Formally, a message m ∈ Msg is a tuple of the form ⟨x=v@t, T ⟩ where x ∈ Loc, v ∈ Val, t ∈ Time, and T ∈ View. The states of the RA memory subsystem are given by RA.Q ≜ P(Msg) × (N → View) (it consists of memory and thread views), with the initial state being RA.q 0 ≜ ⟨{⟨x=0@0, T 0 ⟩ | x ∈ Loc}, λn. T 0 ⟩, where T 0 ≜ λx . 0 denotes the initial view. Figure 3 . Transitions of the RA memory subsystem.
The transitions of RA are given in Fig. 3 , where ⊔ denotes pointwise maximum (
To perform a write to x, thread τ (1) picks a timestamp that is available for x in the current memory and is greater than the timestamp in τ 's view for x; (2) updates its view to include the new timestamp; (3) adds a message to the memory carrying τ 's (updated) view. In turn, to read from x, τ may pick any message of x in the memory whose timestamp is not lower than the timestamp in τ 's view for x. The view of the read message is incorporated in τ 's view. Finally, RMWs are obtained as an atomic combination of a read and a write, but crucially require that the timestamp of the added message is the successor of the timestamp of the read message. This guarantees that distinct RMWs never read from the same message (see Ex. 3.5 below).
Next, we provide simple examples of runs of concurrent programs under the RA memory subsystem, and analyze their robustness. When writing views, we often write only their non-zero elements.
Example 3.1 (Store buffer). The following program is the simplest example of a weak behavior allowed by RA:
Here and henceforth, we use comment annotations to denote a particular program state. In this example, the annotations denote the state in which both program counters point to the end of the program, and the values of a and b are both 0.
To reach this state under RA (cf. Def. 2.5), τ 1 may run first: add ⟨x=1@1, [x → 1]⟩ to the memory (this does not affect the view of τ 2 ), and read the initial message ⟨y=0@0, T 0 ⟩. Then, τ 2 adds ⟨y=1@1, [y → 1]⟩ to the memory, and it is free read the initial message ⟨x=0@0, T 0 ⟩. Under SC, this state is clearly unreachable, and thus, this program is not state robust against RA (cf. Def. 2.6).
Example 3.2 (Message passing).
RA is designed to support łflag-basedž synchronization. That is, the following annotated behavior is disallowed under RA:
Indeed, τ 2 can read 1 for y, only after τ 1 executed the two writes adding messages m x = ⟨x=1@t x , [x → t x ]⟩ and m y = ⟨y=1@t y , [x → t x , y → t y ]⟩ to the memory with t x , t y > 0. When reading m y , τ 2 increases its view of x to be t x , and then, since t x > 0, it is unable to read the initial message of x, and must read m x . Hence, it can be easily seen that this program is state robust against RA. This example also shows that a stronger definition of robustness, which requires that P SC and P RA have the same traces, is too strong to be of any use. Indeed, the transition ⟨τ 2 , R(y, 0)⟩ is allowed under RA also after τ 1 performed its two writes, and thus, such stronger condition would deem this program as non-robust.
Example 3.3 (Independent reads of independent writes).
Unlike TSO, RA is non-multi-copy-atomic. That is, different threads may observe different stores in different orders. Thus, RA allows the following behavior:
Indeed, nothing in RA forbids a run in which the two writers finished their execution, and then τ 2 picks the message written by τ 1 for x and the initialization message for y, while τ 3 picks the message written by τ 4 for y and the initialization message for x. The corresponding program state is unreachable under SC, and, thus, this program is not state robust against RA. (It is, nevertheless, robust against TSO.)
Example 3.4. Unlike the SRA model [36] , under RA, write steps do not have to choose a globally maximal timestamp. Thus, the following outcome is allowed [56] , and the program is not state robust against RA (it is robust against TSO):
Indeed, to execute both writes, τ 1 may add the messages , and τ 2 's view for x is 1 and it may read m x 1 . Example 3.5. A crucial property of RMWs is that two (successful) RMWs never read from the same message. Indeed, this allows the standard implementation of lock acquisition using RMWs. This property is guaranteed in RA by forcing RMWs to use t + 1 as the timestamp for the added message, where t is the timestamp of the message that was read. To see how this works consider the following (robust) program (the annotated behavior is disallowed under RA):
W.l.o.g., if τ 1 runs first, it reads from the initialization message ⟨x=0@0, T 0 ⟩ (it is the only message of x in the memory), and it is forced to add a message with timestamp 1, namely ⟨x=1@1, [x → 1]⟩. When τ 2 runs, it may not read from the initialization message, as it will again require adding a message of x with timestamp 1, but such a message already exists in memory. Thus, it may only read from the message that was added by τ 1 , and the CAS will fail. Example 3.6. RMW operations to a distinguished otherwiseunused location can force synchronization, practically serving as SC-fences [36, 37] (in fact, this is how we encode SCfences in our programming language). To see this, consider the following modification of the SB program:
Here, the annotated program behavior is disallowed under RA, and, consequently, this program is state robust against RA. Indeed, suppose, w.l.o.g., that τ 1 executes the FADD first and adds the message m = ⟨f =0@1, [x → t x , f → 1]⟩ (where t x > 0). When τ 2 executes its FADD, it has to read m, and update its view of x to t x . Then, when it reads x it may not pick the initial message. It is crucial to use the same location in both FADDs: unlike TSO, under RA a single barrier (equivalently, a single FADD instruction to an otherwise-unused location) has no effect.
Finally, note that SC is clearly stronger than RA: Lemma 3.7. If a state q of a concurrent program P is reachable under SC, then it is also reachable under RA.
Proof. RA can simulate SC: in read (and RMW) steps, read the message with the maximal timestamp; and in write steps, pick t to be greater than the maximal timestamp of the messages of the written location. □ Execution-graph robustness is based on different presentations of the SC and RA memory subsystems, which we denote by SCG and RAG, whose states are execution graphs capturing (partially ordered) histories of executed actions. The fact that the states of SCG and RAG are the same mathematical objects allows us to easily compare program behaviors under the two memory subsystems. In the rest of this section, we present SCG and RAG, and define execution-graph robustness. First, we define execution graphs, starting with their nodes, called events. Definition 4.1. An event e ∈ Event is a tuple ⟨τ , s, l⟩ ∈ (N ⊎ {⊥}) × N × Lab, where τ is a thread identifier (or ⊥ for initialization events), s is a serial number inside each thread (0 for initialization events), and l is a label (as defined in Def. 2.1). The functions tid, sn, and lab return the thread identifier, serial number, and label of an event. The functions typ, loc, val R , and val W are lifted to events in the obvious way. We use R, W, RMW for the following sets of events:
Execution-Graph Robustness
We employ subscripts and superscripts to restrict sets of events to certain location and thread identifier (e.g.,
Definition 4.2. The set Init of initialization events is given by Init ≜ {⟨⊥, 0, W(x, 0)⟩ | x ∈ Loc}. We say that a set E ⊆ Event is initialized if Init ⊆ E, and tid(e) ⊥ and sn(e) 0 for every e ∈ E \ Init.
Our representation of events induces a sequenced-before partial order on events, where e 1 < e 2 holds iff (e 1 ∈ Init and e 2 Init) or (e 1 Init, e 2 Init, tid(e 1 ) = tid(e 2 ), and sn(e 1 ) < sn(e 2 )). That is, initialization events precede all non-initialization events, while events of the same thread are ordered according to their serial numbers.
In turn, an execution graph consists of a set of events, a reads-from mapping that determines the write event from which each read reads its value, and a modification order which totally orders the writes to each location. In terms of the model in ğ3, the modification order represents the timestamp order on messages to each location. Definition 4.3. An execution graph G ∈ EGraph is a tuple ⟨E, rf , mo⟩ where:
1. E is an initialized finite set of events. 2. rf , called reads-from, is a relation on E satisfying:
• If ⟨w, r ⟩ ∈ rf then w ∈ W, r ∈ R, loc(w) = loc(r ), val W (w) = val R (r ), and w r .
• w 1 = w 2 whenever ⟨w 1 , r ⟩, ⟨w 2 , r ⟩ ∈ rf (each read reads from at most one write).
• E∩R ⊆ codom(rf ) (each read reads from some write). 3. mo, called modification order, is a disjoint union of relations {mo x } x ∈Loc , such that each mo x is a strict total order on E ∩ W x .
Robustness violation:
and 0 ∈ V(2)(x) Figure 4 . Illustrations of runs: (i) of SCG for the MP program (Ex. 3.2); and (ii) of RAG for the SB program (Ex. 3.1). Each illustration is followed by the corresponding run of SCM for monitoring robustness as described in ğ5 (deltas from the previous state are colored).
We denote the components of G by G.E, G.rf and G.mo. We use G.po to denote the restriction of the order on events to
Next, we define a general execution-graph-based memory subsystem, called FG (standing for łFree Graphsž). Later, the memory subsystems SCG and RAG are defined as restrictions of FG. To define FG, we use the following notation that extends a given graph G with a new event e, placed last in its thread, and either reading from a designated event w or placed as the immediate successor of w in the modification order. When e is an RMW event, it is both reading from w, and placed as the immediate successor of w in the modification order. This is in accordance with the usual atomicity restriction in declarative semantics, according to which RMWs read from their immediate mo-predecessors. Notation 4.4. For G ∈ EGraph, τ ∈ N, l ∈ Lab and w ∈ W, add(G, τ , l, w) denotes the triple ⟨E ′ , rf ′ , mo ′ ⟩ defined as follows, where e = ⟨τ , max{sn(e) | e ∈ G.E τ } + 1, l⟩:
G.mo otherwise Definition 4.5. The initial execution graph G 0 is given by G 0 ≜ ⟨Init, ∅, ∅⟩. The memory subsystem FG is defined by FG.Q ≜ EGraph, FG.q 0 ≜ G 0 , and − →FG is defined as follows:
The conditions in FG's step ensure that add(G, τ , l, w) is indeed an execution graph: mo should only relate events in W of the same location; and rf goes from W to R only between events of the same location and matching values. Below, we refer to the write w in such steps as the predecessor write.
Next, we present the memory subsystems SCG and RAG. 
The Memory Subsystem SCG
The steps of SCG are uniformly given by:
SCG steps require the predecessor write to be G.w max loc(l ) : added write events are placed last in G.mo, and read events read from the latest added write. Figure 4 illustrates an example of a run of the MP program (Ex. 3.2) under SCG. Lemma 4.6. SCG and SC have the same traces.
Proof (outline). Define the memory of a given
It is easy to show that SC.q 0 = M(G 0 ) and {⟨M(G), G⟩ | G ∈ EGraph} is a bisimulation relation between SC and SCG. □
The Memory Subsystem RAG
To define the transitions of RAG, we use the following standard derived łhappens-beforež relation:
Roughly speaking, G.hb abstracts RA's execution order: any run of the timestamp machine in ğ3 follows some linearization of hb, and, conversely, all linearizations of hb induce runs of the timestamp machine. Using hb, the steps of RAG are uniformly given by:
The first two conditions in the step are the general conditions of FG (see Def. 4.5). The third and fourth conditions restrict the choice of the predecessor write w. Unlike in SCG, w is not necessarily G.w max
loc(l )
. Instead, it is subject to two conditions. First, the thread that takes the action must not have observed an mo-later write, where observed writes are writes that have a (possibly empty) hb-path to (some event of) the thread (w dom(G.mo ; G.hb ? ; [G.E τ ])). Referring to the timestamp machine, this is in accordance with the choice of the message to read in read steps and the new added messages in write steps (their timestamp cannot be smaller than the last timestamp observed by the thread for the location). Second, when writing (by a write or an RMW), the predecessor write w cannot be the immediate mo-predecessor of some (other) RMW event (w dom(G.mo| imm ; [RMW])). In the timestamp machine, this corresponds to the fact that timestamp of the message added by an RMW must be the immediate successor of the timestamp of the message read by the RMW. Note that in graphs generated by RAG, RMWs always read from their immediate mo-predecessor (G.rf;[RMW] = G.mo| imm ;[RMW]), which is the usual atomicity condition in declarative weak memory semantics.
It is easy to see that SCG is more restrictive than RAG (and thus, the run of SCG for the MP program in Fig. 4 is also allowed under RAG):
as the predecessor write. By definition we have w ∈ G.W loc(l ) , w dom(G.mo ; G.hb ? ; [G.E τ ]), and w dom(G.mo| imm ; [RMW]). □ Figure 4 illustrates an example of a run of the SB program (Ex. 3.1) under RAG. The last step there is disallowed by SCGÐthe predecessor write is not the mo-maximal one.
Next, we state the equivalence between RAG and RA. A proof outline is provided in [1, ğB] . 
Execution-Graph Robustness
Next, we define execution-graph robustness and show that it implies state robustness. Definition 4.9. A concurrent program P is execution-graph robust against RA if every reachable state ⟨q, G⟩ in the concurrent system P RAG is also reachable in P SCG . Proposition 4.10. If P is execution-graph robust against RA then it is state robust against RA.
Proof. Let q be a state of P that is reachable under RA. Let ⟨M, T ⟩ ∈ RA.Q such that ⟨q, ⟨M, T ⟩⟩ is reachable in P RA . By Lemma 4.8, ⟨q, G⟩ is reachable in P RAG for some G. Since P is execution-graph robust against RA, it follows that ⟨q, G⟩ is reachable in P SCG . By Lemma 4.6, ⟨q, M⟩ is reachable in P SC for some M ∈ SC.Q, and so q is reachable under SC. □ Execution-graph robustness, as we demonstrate below, is not overly strong for establishing state robustness in a variety of concurrent algorithms. In particular, the state robust litmus tests mentioned in ğ3 (MP,2RMW,SB+RMWs) are also execution-graph robust.
Verifying Execution-Graph Robustness
In this section, we present our approach to the verification of execution-graph robustness against RA. First, Thm. 5.1 below reduces this problem to reachability of certain configurations in P SCG . To state this theorem, we require two more standard derived relations in execution graphs:
The from-read relation, fr, relates every read event r to all writes that are mo-later than the write that r reads from (identity is subtracted to avoid self loops in RMW events). The SC-happens-before relation, G.hb SC , following [51] , abstracts SC's execution order: to yield certain execution G, the SCG memory subsystem must follow G.hb SC . Thus, runs of SCG can yield an execution graph G iff G.hb SC is irreflexive.
Theorem 5.1. Let P be a concurrent program. Call a tuple ⟨q, G, τ , l, w⟩ ∈ P .Q × EGraph × Tid × Lab × Event a nonrobustness witness for P if the following hold:
• ⟨q, G⟩ is reachable in the concurrent system P SCG .
• q enables ⟨τ , l⟩ (in the LTS induced by P).
• w G.w max
.
•
Then, P is execution-graph robust against RA iff there does not exist a non-robustness witness for P.
Theorem 5.1 reduces execution-graph robustness of a program P to the existence of a reachable state in the concurrent system P SCG that satisfies certain properties. More precisely, P is not robust iff there exist a reachable state ⟨q, G⟩ of P SCG and a transition ⟨τ , l⟩ that is enabled in q, such that: (a) there is an hb SC -path in G from w max loc(l ) to (some event of) thread τ ; and (b) G enables the transition ⟨τ , l⟩ in RAG with a predecessor write w G.w max
The proof is given in [1, ğA] . Roughly speaking, we utilize purely declarative presentations of SCG and RAG, and show that the existence of a non-robustness witness allows RA executions to diverge w.r.t. SC ones, and that given a łmini-malž such divergence, one can construct a non-robustness witness. The latter has generally a similar structure to proofs establishing the DRF (data-race-freedom) guarantee [6, 29] .
We note that DRF for RA can be easily obtained as a corollary of Thm. 5.1. Indeed, if a program P is race-free (under SC), then all reachable states ⟨q, G⟩ in P SCG satisfy G.mo∪G.fr ⊆ G.hb. It follows that G.hb SC ⊆ G.hb, and thus, G.w max
may serve as the predecessor write in RAG transitions from G. Therefore, P cannot have a non-robustness witness, and Thm. 5.1 ensures that it is execution-graph robust.
Similarly, it follows that a program with no concurrent writes under SC cannot have weak behaviors allowed by RA (as was established in [7] for a certain variant of causal consistency). Indeed, if P has no concurrent writes (under SC), then all reachable states ⟨q, G⟩ in P SCG satisfy [W] ; G.hb SC ⊆ G.hb (use hb to reach the last write in the hb SCpath and from that point on no mo and fr edges are used). Again, Thm. 5.1 ensures that P is execution-graph robust.
It remains to show that the condition in Thm. 5.1 can be automatically checked. Since SCG is not finite (execution graphs of programs with loops may grow unboundedly), we cannot naively explore traces of P SCG . The key idea is to define a finite memory subsystem, which we call SCM (for SC with Monitors), that simulates SCG (so that they have the same traces) and precisely track the properties of SCG's execution graphs that are needed for monitoring the above condition.
Next, we gradually present SCM's states, which are composed of eight components in total, and the transitions between them. Figure 4 provides detailed examples of runs of SCM for the MP and SB programs, together with the corresponding runs of SCG. Below, we use I as a metavariable for states of SCM and write I (G) for the SCM state that corresponds to an execution graph G.
Memory (I .M).
The basic building block for SCM is the (finite) memory subsystem SC whose states are simple locationvalue mappings (see ğ2.3). Thus, a state I of SCM has a memory component, denoted I .M, which is a function from Loc to Val storing the value written by G.w max x for every location x. Formally, we have
The transitions of SCM are subject to the same constraints as SC with respect to this component. The other components of the states of SCM are used to track more properties of G, and do not restrict SCM's traces. Thus, the fact that SCM has the same traces as SCG directly follows from Lemma 4.6.
hb SC -tracking (I .V SC , I .M SC , I .W SC ). For checking condition (a) above, we need to know for every thread τ and location x whether τ is łhb SC -awarež of w max x . To include and maintain this information in a state I of SCM, we use three components, denoted by I .V SC , I .M SC and I .W SC .
The first, I .V SC , is a function in Tid → P(Loc) tracking precisely this property. Formally, we have:
Having x ∈ I (G).V SC (τ ) means that τ is hb SC -aware of w max x , i.e., G.w max x is an initialization write (of which all threads are aware) or ⟨G.w max x , e⟩ ∈ G.hb SC ? for some e ∈ G.E τ . In turn, to maintain I .V SC , we include two additional components, I .M SC and I .W SC , both of which are functions in Loc → P(Loc). Consider first an SCG-step that adds a write (or RMW) event w to location x in thread τ . Following SCG, w is placed it last in mo, which means that every event accessing x becomes hb SC before w (writes to x have mo to w and reads from x have fr to w). In turn, the thread τ in which w is added will have (additional) hb SC -paths from every w max y that previously had an hb SC -path to some event accessing x. Figure 5 . Maintaining V SC , M SC and W SC in SCM transitions.
To properly reflect this in I .V SC (τ ), we maintain I .M SC that tracks for every x ∈ Loc the set of locations y such that w max y has an hb SC -path to some event accessing x. In steps that write to x in thread τ , we incorporate I .M SC (x) into I .V SC (τ ). Second, similarly, when an SCG-step adds a read event r of location x in thread τ , it reads from w max x , and so we have ⟨w max
x , r ⟩ ∈ hb SC . In turn, thread τ will have (additional) hb SC -paths from every w max y that previously had an hb SCpath to w max x . Accordingly, the I .W SC component tracks for every x ∈ Loc the set of locations y such that G.w max y has an hb SC -path to w max x . In steps that read from x in thread τ , we incorporate I .W SC (x) into I .V SC (τ ). Note that, while y ∈ I .M SC (x) iff w max y has an hb SC -path to some event accessing x, we have y ∈ I .W SC (x) iff w max y has an hb SC -path to w max x (equivalently, to some write event accessing x). This implies, in particular, that we always have
Formally, the meaning of these two łhelperž components is given by:
Initially, we take SCM.q 0 .V SC = λτ . Loc and SCM.q 0 .M SC = SCM.q 0 .W SC = λx . {x }. Figure 5 presents the maintenance of I .V SC , I .M SC and I .W SC (primed components denote the corresponding components after the transition mentioned in the column headers). In particular, note that when a write (or RMW) to x is performed it becomes the new w max x and it has no hb SC -paths to other events in the execution graph. Thus, we remove x from I .V SC (π ) for every thread π except for the one that performed the write, as well as from I .M SC (y) and I .W SC (y) for every y x. In addition, when accessing location x in thread τ , I .M SC (x) inherits I .V SC (τ ) (every event that had hb SC -path to thread τ now has hb SC -path to an event accessing x), and when writing to location x in thread τ , I .W SC (x) inherits I .V SC (τ ) (every event that had hb SC -path to thread τ now has hb SC -path to w max x ). RAG-tracking (I .V, I .W, I .V RMW , I .W RMW ). It remains to extend the instrumentation, so that we can check for every thread τ and label l, whether the transition ⟨τ , l⟩ in enabled in RAG with a predecessor write that is not w max
above). For this matter, we include four additional components in the state I of SCM. Two of them, I .V and I .V RMW , are functions in (Tid × Loc) → P(Val) and are the ones used to check the above condition. The other two, I .W and I .W RMW , are functions in (Loc × Loc) → P(Val) and, as before, are used to properly maintain I .V and I .V RMW .
To understand these components, recall the transition of RAG in ğ4.2:
Read: Consider first a read transition ⟨τ , l⟩ with l = R(x, v R ). By definition, an execution graph G enables ⟨τ , l⟩ with a predecessor write w ∈ G.W x if val W (w) = v R and w dom(G.mo ; G.hb ? ; [G.E τ ]). To be able to check this condition, we use I .V to track for every τ ∈ Tid and x ∈ Loc the set of values that are written by some w ∈ G.W x that is not G.w max x and satisfies w dom(G.mo ; G.hb ? ; [G.E τ ]). Then, to check condition (b) above, we check whether v R ∈ I .V(τ )(x). In other words, I .V(τ )(x) tracks the set of values that can be read by thread τ from x under RAG, excluding the case of reading from w max x (which is also allowed by SCG). As before, to maintain I .V, we use another component in I . When thread τ reads (or performs an RMW to) x in a transition of SCG, it induces an mo ; hb-path to thread τ from any write that had mo ; hb-path to w max x . Thus, after such transition, I .V(τ )(y) should be restricted to values written by some w ∈ G.W y such that ⟨w, G.w max x ⟩ G.mo ; G.hb ? . Accordingly, I .W tracks for every pair x, y ∈ Loc the set of values that are written by some write w ∈ G.W y that is not G.w max y and satisfies ⟨w, G.w max x ⟩ G.mo ; G.hb ? .
Write and RMW: A write (or RMW) transition is similar, but it is subject to an additional constraint in RAG: the predecessor write w should not be an mo-immediate predecessor of an RMW event in G (equivalently, w should not be read by an RMW event). For this condition, we use I .V RMW , that, as I .V, tracks for every τ ∈ Tid and x ∈ Loc the set of values that are written by some w ∈ G.W x that is not G.w max x and satisfies w dom(G.mo ; G.hb ? ; [G.E τ ]), but further requires that w dom(G.mo| imm ; [RMW]). To maintain I .V RMW , we use I .W RMW , which is similar to I .W with the same additional condition on w (i.e., w dom(G.mo| imm ; [RMW])).
Formally, the meaning of these components is given by:
where W = G.W x \ {G.w max x }, R = G.mo ; G.hb ? and R RMW = G.mo| imm ; [RMW] (the function val W is extended to sets of events in the obvious way). Initially, since each location has only one write in the initial graph, these four components all return the empty set of values. Figure 6 presents our maintenance of these components. Putting all pieces together, the states of SCM are tuples • SCM.q 0 = I (G 0 ).
Our main result easily follows from Thm. 5.1 and Lemma 5.2: Theorem 5.3. P is execution-graph robust against RA iff for every reachable state ⟨q, I ⟩ in P SCM , the following hold for every ⟨τ , l⟩ that is enabled in q and satisfies loc(l) ∈ I .V SC (τ ), where x = loc(l) and v R = val R (l):
PSPACE-completeness (assuming bounded data domain as we defined in ğ2) easily follows:
Corollary 5.4. Verifying execution-graph robustness against RA for a given input program is PSPACE-complete.
Proof (outline). For the upper bound, we can (gradually) guess a run of P SCM and check the conditions of Thm. 5.3 at each step. The memory required for storing a state is polynomial in the size of P. The lower bound is established as the one in [19] for TSO, by a reduction from reachability under SC (which is PSPACE-complete [35] ): A program can be made robust by adding fences (as in Ex. 3.6) between every two instructions, and an artificial robustness violation (e.g., in the form of SB) can be added when the target state is reached. □ Note that for verifying robustness we generate one reachability query, and since we only monitor traces, we do not add additional non-determinism w.r.t. reachability under SC. However, the instrumentation in SCM creates dependencies between instructions (e.g., both a write to x and a write to y x require to update the bit representing y ∈ M SC (x)), which may hinder partial order reduction.
Abstract Value Management
The V and V RMW (and, consequently, W and W RMW ) components in SCM states are often łtoo elaboratež for what is actually needed to verify robustness. For example, for a program P without CAS, wait and BCAS instructions, whether P RAG enables a transition or not does not depend on the value being read. In such case, we only need to check whether I .V(τ )(x) is empty (for reads) and whether I .V RMW (τ )(x) is empty (for writes and RMWs). More generally, we only need to track values that may affect P RAG transitions (e.g., block a thread from executing or make an RMW succeed). Next, we use this observation to reduce the metadata size in SCM. To do so, we first define critical values. Definition 5.5. A value v ∈ Val is called a critical value of x ∈ Loc in a sequential program S if at least one of the following hold for some q ∈ S.Q: (1) q enables R(x, v) but there exists v ′ such that q does not enable R(x, v ′ ) and RMW(x, v ′ , v W ) for every v W ; (2) q enables RMW(x, v, v W ) for some v W ∈ Val but there exists v ′ such that q does not enable RMW(x, v ′ , v ′ W ) for every v ′ W . We call v a critical value of x in a (concurrent) program P if it is a critical value of x in P(τ ) for some τ ∈ Tid, and denote by Val(P, x) the set of critical values of x in P.
For instance, if wait(x = 1) is included in a program P then 1 is a critical value of x in P. Similarly, r := CAS(x, 0 1) (e.g., for implementing spin locks) makes 0 a critical value of x. A program without CAS, wait and BCAS instructions has no critical values. On the other hand, in a program including an instruction like r := CAS(x, r ′ e) (where the expected value is not a constant), we have Val(P, x) = Val (in which case, our proposed optimization does not change anything). Now, the V, V RMW , W, W RMW components can be restricted to record information only about the critical values (so, we have V, V RMW : Tid → x ∈Loc P (Val(P, x) ) and W, W RMW : Loc → x ∈Loc P (Val(P, x) )), and additional components CV, CV RMW : Tid → P(Loc) and CW, CW RMW : Loc → P(Loc) (disjunctively) summarize all non-critical values. The latter are formally interpreted as follows (using the interpretations above):
That is, CV(τ ) (respectively, CV RMW (τ )) contains all locations y for which there exist at least one non-critical value that is written by a non-mo-maximal write to y that can serve as the predecessor write in an RAG read (respectively, write or RMW) step. The maintenance of these components (given in [1, ğC] ) is straightforwardly derived from the maintenance of V, V RMW , W, W RMW .
In turn, three conditions are added to Thm. 5.3:
• if typ(l) = R and v R Val(P, x) then x I .CV(τ ).
• if typ(l) = RMW and v R Val(P, x) then x I .CV RMW (τ ). This construction results in smaller instrumentation (and fewer operations to maintain the instrumentation), where the size (number of bits) of the monitoring metadata is 3|Tid||Loc| + 4|Loc| 2 + 2(|Tid| + |Loc|)
x ∈Loc |Val(P, x)|.
In particular, for programs without CAS, wait and BCAS instructions the metadata size is 3|Tid||Loc| + 4|Loc| 2 , while in the worst case (when all values are critical) we will have |Loc|(|Tid| + 2|Loc| + 2|Val|(|Tid| + |Loc|)). In some of the examples we checked, this optimization dramatically reduce the verification time (e.g., the 'ticketlock4' example in ğ7 is x9 faster). In addition, it may be beneficial for programs with infinite data domains but finite sets of critical values, where the (generally undecidable) reachability problem in P SCM can be solved using abstraction techniques. (This is left for future work.)
Extension with Non-atomic Accesses
In this section, we describe an extension of our approach to handle C/C++11's non-atomic accesses, typically used for łdata variablesž (unlike łsynchronization variablesž). A data-race on a non-atomic access is considered an undefined behavior, and thus non-atomic accesses allow very efficient implementation. In turn, robustness of a program should imply that it has no data-races on non-atomic accesses.
For this extension, we assume that Loc = Loc ra ⊎ Loc na is composed from a set of release/acquire locations and a disjoint set of non-atomic locations (we do not consider release/acquire and non-atomic accesses to the same location). The programming language Fig. 1 is extended with instructions x na := e and r := x na for x na ∈ Loc na , e ∈ Exp, and r ∈ Reg. The rest of the instructions only apply to locations in Loc ra (in particular, there are no RMW instructions for non-atomic locations).
The SC and SCG systems ignore the type of the location, while RAG is extended to detect races on non-atomic locations. We refer to the extended memory subsystem as RAG+NA. The state of RAG+NA are execution graphs (as in RAG) as well as a special state, denoted by ⊥, that the system enters once a race is detected. To define RAG+NA's transitions, hb is modified so that only rf-edges on release/acquire accesses synchronize:
Now, the transitions of RAG+NA extend the transitions of RAG (which govern the release/acquire locations) with the following steps for non-atomic accesses:
Thus, for a thread to successfully perform a non-atomic access to location x na , it must have observed (in hb) the momaximal (equivalently, hb-maximal) write to x na . Otherwise, the system moves to the ⊥ state. Execution-graph robustness against RAG+NA is defined just as against RA (cf. Def. 4.9), and it implies state robustness against RAG+NA. Since P SCG never reaches states of the form ⟨q, ⊥⟩, execution-graph robustness against RAG+NA implies that such states are not reachable in P RAG+NA . Next, Theorem 5.1 is extended as follows: Definition 6.1. A state q of a concurrent program is racy if q enables both ⟨τ , l 1 ⟩ and ⟨π, l 2 ⟩ for some τ π and l 1 , l 2 ∈ Lab with loc(l 1 ) = loc(l 2 ) ∈ Loc na and W ∈ {typ(l 1 ), typ(l 2 )}. Theorem 6.2. A concurrent program P is execution-graph robust against RAG+NA iff there does not exist a non-robustness witness ⟨q, G, τ , l, w⟩ for P with loc(l) ∈ Loc ra (as defined in Thm. 5.1), and there does not exist a reachable state ⟨q, G⟩ in P SCG such that q is racy.
The SCM system can be easily adapted for monitoring the conditions of Thm. 6 Figure 7 . Experiments with Rocker states is extended in the obvious way to track the latest value of non-atomic locations as well. Since non-atomic instructions do not affect inter-thread synchronization, the monitoring instrumentation in ğ5 requires no change (it only applies to the locations in Loc ra ). Since SCM and SCG have the same traces, the additional condition about races can be checked on SCM runs.
Implementation and Evaluation
We implemented our algorithm in a prototype tool called Rocker (for RObustness ChecKER), which uses Spin [31] as a back-end model checker. The implementation and the examples it was tested on are available in the artifact accompanying this paper. Rocker takes as input a program in our toy programming language, and converts it to Promela code (Spin's input language) with appropriate instrumentation and assertions that check for execution-graph robustness against RA. Thus, our implementation is actually using the SC memory subsystem, and implements the monitoring of SCM by instrumenting the input program. When a robustness violation is detected, one can use Spin's output to see the trace leading to this violation. In addition, since in any case we explore traces of the input program under SC, Rocker allows one to include standard assertions, which will be verified as well by the model checker.
We performed a series of experiments on litmus tests, examples from [5, 17] , and additional concurrent algorithms. Figure 7 summarizes the running times on some of the examples when executed on an Intel ® Core™ i5-6300U CPU @ 2.40GHz GNU/Linux machine. Columns 'Res', '#T', and 'LoC' respectively present the robustness of the input program, the number of threads, and total number of lines of code. Column 'Time' shows the verification time (in seconds), and the percentage of that time that was dedicated to compiling Spin's verifier (using gcc with -O2). The latter often completely dominates the total time. Generating the input for Spin is negligibly fast (< 0.1s), as well as Spin's verifier generation in C (< 0.2s). Column 'SC' provides, for the sake of comparison, the verification duration using Spin with no instrumentation whatsoever. In this mode, only the assertions in the input are verified assuming SC semantics.
For some of the examples Fig. 7 provides several versions of the same algorithm: The '-sc' suffix denotes an original algorithm as designed for SC; the '-tso' suffix denotes its strengthening with fences to ensure robustness against TSO; and, when needed, the '-ra' suffix is a further strengthening that ensures robustness against RA. For instance, it is well known that Peterson mutual exclusion algorithm ('petersonsc') is not robust against relaxed memory. For TSO, placing one fence in each thread suffices to ensure robustness. For RA, more fences are needed ('peterson-ra'). Alternatively, as noted in [57] , one may replace certain write operations by RMWs ('peterson-dmitriy'). The choice of these writes is criticalÐRocker correctly identified that a different version is incorrect ('peterson-bratosz'). Other algorithms, which were designed with relaxed memory considerations in mind, e.g., Seqlocks [16] and a user-level RCU [26] , do not require fences at all. Note that we have also verified robustness of a more involved RCU implementation ('rcu-offline'), where the writer is not a unique thread, and threads may declare that they are going offline, stop the communication with the writer and return online later on.
Finally, column 'Trencher' provides the (total) running time of Trencher, a tool for verifying robustness against TSO [17] , which also uses Spin for model checking. (A newer version of Trencher that implements its own model checker crashed on some of these examples.) Their notion of robustness is similar to execution-graph robustness, but it should be noted that Rocker and Trencher solve different problems: TSO and RA are fundamentally different models, where RA is weaker and non-multi-copy atomic. Thus, this comparison is of limited significance (see also ğ8). The input language is different as well. In particular, Trencher does not handle blocking instructions. For this reason, Trencher reports some examples as non-robust (marked with ⋆ ), while no additional fences are needed for them to function correctly under TSO. We note that Trencher can be used in parallel to Rocker for verifying robustness against RA: a violation detected by Trencher implies non-robustness against RA. 53ś55], which require manual proofs, and (bounded) model checkers, e.g., [3, 34, 48] , which provide limited guarantees for programs with loops. These methods can be used to verify programs that are not necessarily robust against RA. The verification problem of programs with loops under RA (i.e., given a program P and a state q ∈ P .Q, is q reachable under the concurrent system P RA ?) was recently shown to be undecidable [2] . (For TSO, this problem is decidable but nonprimitive recursive [11, 12] .) As shown in [24, Thm. 2.12] , this immediately entails the undecidability of state robustness.
Finally, robustness was also studied, e.g., in [15, 20, 28, 47] , in the context of distributed systems, where SC is replaced by serializability. Unlike the current work, these works are focused on practical over-approximations, and do not provide provably precise general verification methods.
Conclusion
We have presented a method to verify execution-graph robustness against release/acquire concurrency semantics, in particular, establishing the decidability of this problem. Our method works by exploring only runs of the program under SC while monitoring certain properties for the detection of robustness violations. We believe that our result can play an important role in verification and development of concurrent algorithms for weak memory semantics, alongside with other existing methods.
In the future, we plan to study the applicability of our approach for different and extended models, such as RC11 [39] , WeakRC11 [34] , SRA [36] , as well as transactional consistency models, such as PSI [52] . In addition, we are interested in deriving efficient and precise methods for automatic robustness enforcement (such as fence insertion) as were developed before for hardware models; as well as in handling parametrized programs with arbitrary number of threads.
