We present an approach for automatic verification and fence inference in concurrent programs running under relaxed memory models. Verification under relaxed memory models is a hard problem. Given a finite state program and a safety specification, verifying that the program satisfies the specification under a sufficiently relaxed memory model is undecidable. For stronger models, the problem is decidable but has non-primitive recursive complexity.
Introduction
In the early 1990s, features like out-of-order execution and multilevel caches became common in commodity CPU architectures. These features drastically improved performance in a programmertransparent fashion: their introduction did not change the semantics of existing (sequential) programs. With the advent of symmetric multiprocessing and multi-core CPUs, preserving the illusion of in-order memory operations became more difficult. One possible approach is to keep the illusion-known as sequential consistency [20] in a multi-processor setting-and sacrifice performance. The other is to define architectural relaxed memory models (RMMs) that allow improved performance at the cost of weaker semantics. * Deloro Fellow Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. A relaxed memory model allows observable executions that cannot occur if instructions running on different processors are simply interleaved. As a result, a program that runs correctly on the sequentially-consistent model may violate its specification when running on a relaxed memory model. In practice, relaxed memory models are used by all major CPU designs, among them Intel x86 [33] , SPARC [36] and PowerPC [16] . To enforce order between memory operations, these architectures provide special memory fence instructions. Informally, inserting a fence instruction prohibits certain re-orderings, thus restricting the set of relaxed executions. For example, a fence inserted between two store instructions will force these two stores to appear to execute in-order. It is the programmer's (or the compiler's) responsibility to correctly place fences.
Program Verification and Fence Inference
To place fences, the programmer must first be able to know whether the program is correct given a fence placement -in other words, she needs to be able to verify the program. However, verification of concurrent programs is not an easy task even under the sequentially consistent memory model. Relaxed memory models make reasoning about program correctness, both manually and automatically, even harder, as they require reasoning about non sequentially-consistent executions. Even for finite-state programs, automatic verification under relaxed memory models is a hard problem. Given a finite state program and a safety specification, verifying that the program satisfies a specification under a sufficiently relaxed memory model (e.g., SPARC RMO) is undecidable. For somewhat stronger memory models (e.g., SPARC TSO, PSO), the problem is decidable but has non-primitive recursive complexity [2] .
Even given a verification procedure, inserting fences is still nontrivial. On the one hand, since each fence incurs a heavy performance penalty, the programmer should not insert fences unless they are strictly required for correctness. On the other hand, missing a fence may lead to subtle concurrency bugs.
Store Buffers Relaxed memory models allow two basic relaxations of sequential consistency: memory operations may be reordered with respect to each other, and stores may be executed nonatomically across processors [1] . Some relaxations can be naturally modeled using store buffers [25] , emulating the actual hardware implementation. In store-buffer based semantics, one or more FIFO queues ("store buffers") are associated with each processor. Memory writes are split into two phases: a "store" phase and a "flush" phase. The store phase adds a value into a local store buffer, and the flush phase propagates the stored value to main memory (or directly to other processors).
The basic hurdle for automatic verification under those models is that store buffers can grow without a bound, even for programs that are otherwise finite state. To enable automatic program verification and fence inference on relaxed memory models, we need a technique that can represent those buffers in a bounded way.
Existing Approaches Existing approaches either employ underapproximations such as bounded checking [5] and testing [8] , or side-step the problem by focusing on a restricted class of programs. For instance, [32] considers data-race free programs, and [28] focuses on programs free from a particular ("triangular") type of data races. Bounded checking and testing are valuable, but cannot establish that the program satisfies its specification on all executions. When used for automatic fence inference, bounded techniques (e.g., [5, 18] ) might miss required fences. Targeting only race-free programs simplifies the problem by allowing consideration of only sequentially-consistent executions. However, it is often unrealistic, as many concurrent programs contain "benign data races" [27] . For example, some mutual exclusion algorithms, such as Dekker's algorithm, contain benign triangular data races. Thus we cannot apply the results of [28] , even if we restrict attention only to the TSO model.
In contrast to these approaches, our technique over-approximates possible program behaviors and is able to verify programs executing under RMMs. When used for fence inference, our technique is guaranteed to produce all required fences.
Our Approach We present a technique for automatic verification and fence inference in finite-state programs running on relaxed memory models. Based on abstract interpretation [9] , we introduce a family of partial-coherence abstractions for store buffers. Our abstractions provide a bounded representation for (potentially) unbounded store buffers. We use the term abstract memory model to refer to a memory model that uses an abstract structure to represent store buffers. Our approach provides a range of abstractions with varying precision, enabling successive abstraction refinements of a given abstract memory model MA.
Given a program P , a specification S and an abstract memory model MA, the question we are trying to answer is whether P |=M A S, that is, whether the program satisfies the specification under the given abstract memory model. When P |=M A S, it is possible to:
• Refine the abstraction: refine MA and try to find a more precise memory model M A under which P |=M A S.
• Restrict the program: find a program P obtained from P by adding memory fences that restrict the permitted re-orderings during execution, such that P |=M A S. In this work, we focus on restricting the program by inserting fences, and show how using different abstract memory models affects the precision of the resulting fence placement. We focus on a family of abstractions for the TSO and PSO memory models, as those models are implemented in common hardware (e.g., Intel x86 [29] , SPARC) and have simple concrete operational semantics.
Partial-Coherence Abstractions
The challenge for abstractions of store buffers is to provide a bounded representation that (partially) preserves the following three key properties (described in more detail in Section 2):
• Intra-process memory coherence: a process should only see its own most recently written value to a variable.
• Inter-process memory coherence: a process should observe values written by another process in the order they were written.
• Fence semantics: a fence executed by a process writes to memory the most recent value written by the process. The main idea behind our abstractions is to preserve only a limited amount of order inherent in a store buffer. In particular, our abstract buffer representation preserves information about: i) the most recent store to a buffer, and ii) the order between a bounded number of the oldest stores in the buffer. While inter-process coherence is only partially preserved, we show this choice is particularly effective for verifying concurrent algorithms running on relaxed memory models (see Section 4 for details). Figure 2 . Peterson's Algorithm with fences that guarantee mutual exclusion under the PSO memory model. Fences were automatically inferred by our approach.
Main Contributions
The main contributions of this paper are as follows:
• We describe a family of parametric abstractions that enable automatic verification of safety properties for programs under relaxed memory models.
• When a program violates its specification but can be fixed by adding fences, our approach can automatically infer a correct fence placement that is optimal under the given abstraction.
• We have implemented our approach in a tool called BLENDER and applied it for verification and fence inference of several challenging concurrent algorithms.
Overview

Motivating Example -Peterson's Algorithm
Fig . 1 shows the code of Peterson's mutual exclusion algorithm [31] . In this algorithm, two processes repeatedly enter and exit a critical section. We would like to show that the algorithm satisfies the mutual exclusion property: it is impossible for both processes to be in the critical section simultaneously. Indeed, Peterson's algorithm satisfies mutual exclusion under a sequentially-consistent (SC) memory model. Unfortunately, under relaxed memory models, such as "Partial Store Order" (PSO), the algorithm does not satisfy the property. To see why, we first give a brief explanation of the PSO memory model.
The Partial Store Order (PSO) Memory Model PSO is one of three memory consistency models defined for the SPARC architecture [36] . In PSO, a store to some memory location l may become visible to other processes only after the storing process executes later loads and stores to different memory locations.
The PSO model can be formalized operationally by associating with each processor a set of FIFO queues (store buffers), one for each variable, as shown in Fig. 3 . The informal semantics of store buffers for PSO can be summarized as follows:
• Store buffering: A store issued by process pi to variable x is written into the store buffer associated with (pi, x).
• Store forwarding: A load by pj from y is performed from its local store buffer (associated with (pj, y)) if it is not empty, or from the global memory otherwise.
• Flushing: The oldest value stored in the buffer may be written to the global memory and removed from the buffer at nondeterministic points in the execution.
The Problem: Delayed Stores Under the PSO model, the following execution of Peterson's algorithm is possible:
• p0 runs alone until line 11, however the store to ent0 in line 3 is written only to the buffer but not flushed.
• p1 runs. Since the store to ent0 is delayed, it is not visible to p1. p1 enters the critical section, and mutual exclusion is violated. Peterson's algorithm relies on ordering of loads and stores for synchronization. It requires p0's store to ent0 to be visible to p1 before p0 loads ent1, and symmetrically on p1's store to ent1 to be visible to p0 before p1 loads ent0. When the underlying memory model does not preserve this order, Peterson's algorithm, as it appears in Fig. 1 , does not satisfy mutual exclusion.
Restoring Order with Fences To allow programmer control over ordering in relaxed memory models, processors provide special memory fence instructions. Intuitively, the semantics of a fence are that memory operations issued before the fence must take global effect before memory operations after the fence may execute. In general, there are different kinds of fences (e.g., store-load, storestore) that impose order between different types of operations. A store-load fence executed by a processor forces all stores issued by that processor to complete before any new loads by the same processor start. In this paper we assume the model provides the strongest type of fence (a "full memory barrier") that restricts reordering of any memory operations. In Fig. 2 the fences in lines 4 and 6 prevent the erroneous execution above (and other possible related bugs) by forcing the stores in lines 3 and 5 to take global effect before the storing process can advance. Unfortunately, fence instructions are very costly in terms of CPU cycles. Thus, we wish to place fences only when they are required for correctness.
Efficient Fence Placement The programmer's challenge is, then, in finding a fence placement that permits as much re-ordering as possible but does not allow the specification to be violated. To find an efficient placement of fences, we need to observe what reorderings lead to violation of the specification, and find a minimal placement (often, there are multiple non-comparable solutions) that prevents these re-orderings. When the program is finite-state, we can enumerate all reachable program states, identify error states and find fences that prevent execution from reaching those states (c.f [18] ). Unfortunately, Peterson's algorithm without fences running on PSO has an infinite state-space. The length of the store buffers generated by the program is not bounded: running p0 alone for t iterations of the outer loop without flushing will generate a buffer of length 2t for the ent0 variable.
Abstraction
To handle programs that have an unbounded state-space, we introduce a family of parametric abstractions that provide a conservative bounded representation. Our abstractions induce a hierarchy of (abstract) memory models with varying degrees of consistency. Before describing the abstraction, we note that concrete PSO semantics preserve the following 3 properties. 1. Intra-process coherence: If a process stores several values to shared variable x, and then performs a load from x, it should not see any value it has itself stored except the most recent one. 2. Inter-process coherence: A process pi should not observe values written to shared variable x by process pj in an order different from the order in which they were written. 3. Fence semantics: If a process pi executes a fence when its buffer for variable x is non-empty, the value of x visible to other processes immediately after the fence should be the most recent value pi wrote. The properties above are phrased in terms of PSO semantics (store buffer per variable), but it is easy to formulate similar properties for other memory models. For example, for TSO, the only change is that inter-process coherence is global and not per variable. In that case, the desired property may be called inter-process consistency.
Partial-Coherence Abstraction The challenge in designing an abstraction for store-buffer based memory models lies in preserving properties 1-3 (to the greatest possible extent) using a bounded representation of each buffer. To preserve intra-process coherence, our abstractions maintain recency information per variable. To preserve inter-process coherence, our abstractions preserve order between stores up to some constant bound (a parameter of our abstraction), and treat the remaining stores as an unordered set. While property 2 is not fully preserved, this partial coherence is often sufficient in practice. The intuition is that if a process stores many (possibly different) values to the same shared variable without an intervening fence, the order in which they become visible is not important for the correctness of the algorithm. Fig. 4 shows a schematic view of a partial-coherence abstraction of PSO store buffers for the Peterson algorithm of Fig. 1 . In this abstraction, a store buffer is represented by treating items after a bounded head (of length k) of the buffer as a set, and additionally recording the most recently stored value for each buffer.
In Section 3, we give a formal concrete semantics for the PSO memory model and in Section 4 present our abstract semantics. Using this abstraction with the fence inference algorithm of Section 5, our approach automatically infers the fences shown in Fig. 2 . In Section 6, we show that we can use different parameters to achieve more scalable abstractions and still get reasonable results. We also show that there is a tradeoff between the precision of the abstraction and the quality of the inferred fences. Finer abstractions lead to successful inference with fewer fences, while restricting the program by adding fences enables verification with a coarser abstraction. In particular, our partially disjunctive abstraction (see Section 4.3) produces non-trivial fence placements for programs for which the fully disjunctive abstraction leads to state-space explosion. Fig. 1 . In this abstraction, a store buffer is given a bounded representation by representing items after a bounded head of the buffer as a set, and recording the recently stored value for each buffer.
Operational Semantics for Relaxed Memory Models
In this section, we present an operational semantics for the PSO memory model. It is easy to give similar semantics for other conceptually close models such as TSO, NTSO and NPSO [24] .
Preliminaries
Sequence Notation Given a finite domain D, we use Seqn(D) to denote the set of all sequences of length n over D, Seq ≤n (D) to denote the set of all sequences shorter than or equal in length to n over D, Seq(D) to denote the set of all finite sequences over D, |w| to denote the length of a sequence w and to denote an empty sequence. We denote the concatenation of two sequences w1, w2 by w1 · w2. For k > |w|, we define head(w, k) as the subsequence consisting of the first k items in w and tail(w, k) as the subsequence consisting of the last k items in w. For 0 < k ≤ |w| we define head(w, k) = tail(w, k) = w and for k ≤ 0, head(w, k) = tail(w, k) = . We define last(w) to be the only element in tail(w, 1), or ⊥ if tail(w, 1) = . We define Set(w) to be the set of elements in the sequence w. Finally, we define U T ail(w, k) as Set(tail(w, |w| − k)) -the set of all but the first k elements of w.
Program Syntax We consider programs written in a simple assembly-like programming language with the operations load, store, branch, CAS (compare and swap) and sequential and parallel composition. Our language also contains a full fence operation. We assume that instructions in our programs are labeled, and the labels used in the code of process are unique. We denote the set of program labels by Labs.
Program Semantics A transition system for a program P under a memory model M is a tuple σ0, Σ, T , where Σ is a set of states, σ0 ∈ Σ is the initial state of P , and T is a set of transitions σ
and execution from state σ according to the semantics of M can result in state σ . A trace π of the program is a (possibly infinite) sequence of transitions σ0 → σ1 → σ2 → ..., where for all i, σi −→ σi+1 ∈ T . In all of our semantics, a single transition will correspond to action taken by a single process. Thus we will associate a transition t with that process, and denote the associated process proc(t). A transition tp is enabled for process p in state σ if p = proc(tp) and there exists some σ2 such that σ tp −→ σ2 ∈ T . Throughout the paper we present the semantics in a standard operational style as a set of inference rules. To simplify presentation, when updating mappings, we use M (x) = v as a shorthand
Components not updated in the inference rule are assumed to be left unchanged.
Store Buffers
In our memory model semantics we follow [2, 7, 25] and assume that processes execute their programs sequentially, and any violations of sequential consistency happen within the memory subsystem. This is in contrast to other formulations that combine the memory and processor effects (e.g., [30, 35, 41] ).
Our formulation is based on store-buffers, and our concrete semantics uses the following semantic domains:
• G ∈ SV ar where SV ar = Shared → D. Valuation of shared variables from the domain D.
• L ∈ Env where Env = P ID → (Local → D). Valuation of local variables for each process.
• pc ∈ P C where P C = P ID → Labs. Program counters.
• B ∈ SB where SB differs between different memory models, and is intentionally left unspecified at this stage. A representation of the store buffers. Here D represents the domain from which the variables in the program take values.
A state σ = G σ , L σ , pc σ , B σ ∈ C is a tuple where C = SV ar × Env × P C × SB. We use next(pc(p)) to mean the instruction following pc(p) in the program code. Furthermore, we will omit the p when the referenced process is clear from the context.
Partial Store Order (PSO) Model
Concrete Semantics For PSO, a separate FIFO store buffer is maintained for every (process, variable) pair. That is, formally,
Sem. 1 shows the concrete operational semantics of the PSO model. Each inference rule applies only to a single process. Thus, the p parameter is consistently omitted in all inference rules presented. However it is always implicitly existentially quantified. For example, the premise of the LOAD-G rule should be read as
The semantics show the role played by the store buffer for storing and loading values to/from main memory (STORE, LOAD-G, LOAD-B, FLUSH). The FENCE and CAS rules have memory fence semantics. These two rules are enabled only when the buffer of the executing process is empty. This means that when a process encounters, e.g., a fence instruction, it cannot continue execution until all of the buffers are flushed. For simplicity we omit the semantics of instructions that do not access shared memory (register operations, branches) and leave expression evaluation implicit. That is, L(r) is extended to the evaluation of complex expressions r. Such a complex expression may only depend on local variables -expression evaluation may not cause a memory access.
The premise of all rules except FLUSH depends on the program counter of the process. They are enabled only if pc(p) points to an instruction of a specific type. The FLUSH rule, on the other hand, is always enabled for a given buffer B(p)(x) if that buffer is not empty. This captures the fact that flushes can be performed nondeterministically at any stage of program execution.
Total Store Order (TSO) Model
The TSO concrete state differs from the PSO concrete state only in the definition of the store buffer. For TSO, there is only a single, buffer for all variables of a process. That is, SB = P ID → Seq(Shared × D). The semantics must also be updated to take the difference into account. The flavor of the required changes can be seen in the TSO version of the LOAD-G rule in Sem. 2. Note that as the difference between PSO and TSO lies purely in the grouping of shared variables into store buffers, we can treat them as special cases of the same general model.
Partial-Coherence Abstractions
In this section, we present a family of abstract memory models that abstract the concrete semantics of Section 3. The presentation focuses on abstractions of the SPARC PSO model, but the adaptation to TSO is straight-forward. The main idea behind our partialcoherence abstractions is to vary how much of the order between memory operations we preserve. The ability to vary the precision is useful as different algorithms can be verified with different levels of precision and cost. When the abstraction is used for fence inference, Section 6 shows that there exists a trade-off between the precision of the analysis (which affects the state-space size) and the quality of inferred fences.
Abstract Domain
The abstract domain is designed to represent store buffers in a bounded way by losing order information between items past a certain bound. To achieve this goal, we represent a concrete buffer B by a tuple l, S, H . The l ∈ D element records the latest (most recent) value that was written into the buffer. H ∈ Seq ≤k (D) records the k oldest values in the buffer (in the original order) if those are known. S ⊆ D records a set of values that were written into the buffer, abstracting away the order between them, as well as the number of times each elements appears in the buffer. Formally, we define, for a buffer B :
An abstract state σ is a tuple G, L, pc, B where G, L and pc are defined as in the concrete semantics. B maps a (process, shared variable) pair to the tuple l, S, H defined earlier. To simplify notation, we will use short-hands such as lp(x) to represent the l element of B(p)(x). As in the concrete semantics, we will often omit the p. We denote by A the set of all abstract states.
To define the abstract domain, we define several order relations. The order b is defined on l, S, H tuples: l1, S1, H1 b l2, S2, H2 if l1 = l2 and:
Intuitively, l2, S2, H2 is produced from l1, S1, H1 by removing part of the tail end of H1 and adding all the removed elements into the set. We then use b to define a partial order s on abstract states σ1 = G1, L1, pc1, B1 , σ2 = G2, L2, pc2, B2 : σ1 s σ2 if the two states coincide on G, L, pc and:
Finally, we define our abstract domain A ⊆ 2
A as the set of all antichains of A. That is:
The order relation : A × A is defined as:
The join operator over A, implied by the above order, is:
We define the abstraction function α : 2 C → A using an extraction function β : C → A:
The intuition behind this abstraction is shown in Fig. 3 . That figure can, however, be slightly misleading: it is important to note the entire concrete buffer is covered by the concatenation of values from S to H. Specifically, l does not need to be concatenated to the end. Thus if S = ∅, then l is always equal to the last element of H and is in fact redundant. If S = ∅, then l ∈ S is an invariant. A more precise representation of the abstraction is given in Fig. 5 . Fig. 5(a) shows the case in which the concrete buffer is longer than k and Fig. 5(b) shows the case in which the concrete buffer is of length at most k.
The Importance of Recency Our abstraction uses l to record the most recent value stored for a variable. This is motivated by the need to preserve the intra-process coherence requirement that a process storing several values to a shared variable x, and then performing a load from x, should not see any value it has itself stored except the most recent one. This is a very basic property and abstractions that do not preserve this information will fail to verify many reasonable programs.
Partial Inter-Process Coherence The abstract domain only partially preserves the inter-process coherence requirement. For example, suppose processor p stores the values a and then b to the variable x. The resulting concrete buffer B p (x) is "ab". Taking k = 0, the abstract buffer is B = βB("ab") = a, {a, b}, . Note that, for example, βB("ab") = βB("aba"). So given the abstract buffer B we must allow a different process q to observe the values being written in the opposite order. Worse, since, for example, βB("ab") = βB("abab") it is possible for process q to observe the stores in the opposite order, and for a third process r to observe them in the original order.
This behavior occurs only if a process performs more than k stores to the same memory location without an intervening fence or CAS. As long as at most k stores to a memory location are performed without an intervening fence, the abstract domain is fully precise.
Abstract Domain Design One of our observations is that in a correct program a process rarely performs an unbounded number of stores to a memory location without a fence. When a process does perform a large number of stores to a memory location without a fence, it means that the order in which these stores are performed is not important for program correctness. Following this observation, our abstraction is designed to: (i) preserve the order between a small number of stores to the same memory location; (ii) abstract away the order in long sequences of stores but preserve the stored values, such that values cannot appear "out of thin air".
Given that programmers do not normally think in terms of unordered stores, we expect many correct programs to only utilize very short buffers. This is validated by our results (c.f Section 6) -correct versions of the benchmark algorithms could be verified using k = 1. However, there are examples of correct programs where buffers may become very long. One such example is the Sieve of Eratosthenes implementation in [3] , which only requires that values do not appear out of thin air.
Furthermore, we wish our verification procedure to remain sound for arbitrary programs. This is impossible using bounded buffers, as it is trivial to construct an incorrect program which would appear to be correct using bounded buffers of some given length k. Similarly, when the abstraction is used for fence inference, we wish to always infer a correct placement. This is impossible if we simply bound the buffer length.
Domain of Variables
Throughout this paper we do not specify the domain D from which local and shared variables take values. However, the domain is in fact critically important to the effectiveness of the abstraction. Let d be the number of values a shared variable may potentially take during the execution of a program. Then the number of possible H values for that variable is d k , and the number of possible S values is 2 d . Since in the worst case, d = |D|, we may expect an increase in state-space size (w.r.t state-space under SC) that is exponential in |D|. The examples we used for our experiments do not suffer from this problem, as the variables only take a small number of values in any given execution. However, for different programs this may become a real issue. Note, however, that our abstraction can be trivially composed with value abstractions (e.g., sign abstraction, parity abstraction, interval abstraction [10] ). Instead of using the concrete domain D, we can replace it with an abstract value domain D . We can then store those abstract values in the buffer and perform local operations on values according to the semantics dictated by D , retaining a sound abstraction.
Abstract Semantics
Sem. 3 shows the abstract semantics with partial coherence parameterized by k. In the figure, we use the shorthand emp(x)
Loading Values and Recency In the concrete semantics, a process may load the latest value it wrote by reading its own store buffer. Correspondingly, in the abstract semantics, the rule LOAD-B reads the most recent value recorded in l. Had we not recorded the most recent value l that a process wrote, a process p that performs a load when Sp(x) = ∅ would have to conservatively explore all possible values in the set Sp(x). The rule LOAD-G is similar to the concrete semantics: when the buffer is known to be empty, the value is loaded from global store.
Storing Values
In the abstract semantics, store is split into two cases STORE-H and STORE-S, based on whether the size of the buffer H(x) has reached the bound k. As long as |H(x)| < k and S(x) = ∅, the contents of the buffer are known precisely. Thus, similarly to the concrete semantics, the effect of a store follows STORE-H, adding the value to the tail of the buffer H(x) and updating the most recent value l(x). When |H(x)| = k, the size of the buffer H has been exceeded and no more values can be stored in H. Therefore, the new value is stored in the (unordered) set of values S(x) (as shown in the rule STORE-S) and the most recent value l(x) is updated accordingly. When S(x) = ∅ we have lost the information on the precise number of elements on the buffer, and thus are also forced to keep updating the set.
Flushing Values
In the abstract semantics, flush is split into three cases: FLUSH-H, FLUSH-SN and FLUSH-SD. When we have H(x) = then FLUSH-H behaves as the FLUSH rule in the concrete semantics: it selects the oldest element in H(x), writes it to G(x) and updates H(x). However, when H(x) = and S(x) = ∅, any of the values in S(x) become possible candidates for flushing (since S(x) is unordered, we do not know which value is the oldest one). The rules FLUSH-SD (flush from set, destructive) and FLUSH-SN (flush from set, non-destructive) then only differ on whether the value selected to be flushed is removed from S(x) or kept in it. This is required since we do not know how many times every value appears in the buffer. Thus, in the concrete domain, FLUSH-SD of a value v represents a flush of the last occurrence of v in the buffer. In contrast, FLUSH-SN represents the situation in which more instances of v remain.
We improve the precision of the analysis by disabling the FLUSH-SD rule when we know that the resulting abstract states do not represent any possible concrete states and will only introduce imprecision. In particular, if v = l(x) and S(x) = {v}, FLUSH-SD need not fire. If we apply the concrete FLUSH rule to any concretization of such a state, the value v will stay in the (concrete) buffer, while if we flush v from the abstract state using FLUSH-SD, it will remove v from the abstract buffer, leading to abstract states that could not arise in the concrete semantics.
Example: Motivating Recency and Order Next, we illustrate via an example why maintaining recency and order is important for verification and inference. Consider a naive set abstraction for the store buffers, and a version of Peterson's algorithm with fences shown in Fig. 2 (in Section 2) . Under standard concrete semantics of PSO,
those fences guarantee that it is impossible for both processes to be concurrently executing line 13. Let us consider an abstract memory model where order and recency are not maintained, that is, we only maintain Sp(x) but without maintaining lp(x) and Hp(x). Then, we cannot show that the algorithm is correct. Consider the following execution: 1. Initially both processes start with empty buffers, and ent0 = ent1 = turn = 0. 2. Process 0 runs through one iteration of the outer loop (executes lines 1-14 inclusively), without performing a flush after line 14. 3. Process 0 then tries to enter the critical section again and executes lines 1-3 inclusively. At this stage, Sp 0 (ent0) = {true, f alse}. 4. Two flush actions are performed on Sp 0 (ent0), first flushing true and then f alse. At this point G(ent0) = f alse. 5. Process 0 completes entering the critical section. 6. Process 1 loads ent0 from global store and since ent0 is f alse process 1 also enters the critical section. The above example would not have been possible had we kept either: i) ordering information via Hp(ent0) for at least two values (i.e., k = 2) or ii) recency information via lp(ent0). In the first case, the order in which {true, f alse} are flushed would have been consistent with the order in which the values were written: we would have first flushed f alse and then true. In the second case, the fence in line 4 would have forced fully flushing Sp 0 (ent0), resulting in writing out the most recent value (i.e., G(ent0) = true). While in this case we could have used either lp(ent0) or Hp(ent0) with k = 2, in other examples both of these refinements with respect to a set are required.
A Partially Disjunctive Abstraction for Store Buffers
The abstraction of Section 4.1 distinguishes two abstract buffers B1 = l1, S1, H1 , B2 = l2, S2, H2 even when they differ only on the contents of their unordered sets S1 = S2. This leads to distinctions between abstract states that are often more precise than necessary. We observe that a more efficient abstraction can be obtained without a significant sacrifice in precision by merging such states. In Section 6, we show that combining such states leads to a more scalable abstraction, while keeping a sufficient level of precision. The one distinction that we do wish to preserve regarding the S component is the difference between an empty set and a non-empty set, as many of the rules in Sem. 3 distinguish between these two cases. To capture this, we change the definition of b as follows: The order τ b is defined on l, S, H tuples. l1, S1, H1 τ b l2, S2, H2 if l1 = l2 and one of the two following conditions holds:
The orders τ s and τ are then defined exactly as before but with respect to τ b instead of b . α τ is also defined as before, but using τ instead of . Note that this small change in the formalism drastically changes the intuitive meaning of the set S. Let B = l, S, be an abstract buffer, and B a concrete buffer such that
In the fully-disjunctive abstraction, this implies that Set(B ) = S. This means that a value v was in S if and only if it appeared at least once in B . In the partially-disjunctive abstraction, this is no longer true. Consider the concrete buffer "a". Assuming k = 0, βB("a") = a, {a}, b a, {a, b}, . If a value appears at least once in B then it is necessarily in S, but the converse does not hold.
The new abstraction also implies a change to the abstract transformer. In the fully disjunctive abstraction, flushes from S were split into two cases: FLUSH-SD to represent flushing the last instance of a value from the buffer, and FLUSH-SN to represent an instance that is not the last one. The case split for the partially disjunctive abstraction is slightly different. The new flush semantics are shown in Sem. 4. The rule FLUSH-NE covers the case in which a flush leaves S non-empty, while FLUSH-E represents flushing the only remaining element of the concretization of the abstract buffer. Note that it's possible for both types of flush rules to be enabled for the same buffer.
Semantics 4 Partially-disjunctive flush semantics
H(x) = v ∈ S(x) G (x) = v (FLUSH-NE) H(x) = S(x) = ∅ G (x) = l(x) S (x) = ∅ (FLUSH-E)
Fence Inference
In this section, we introduce a new technique for inferring memory fences under store-buffer based abstract memory models. For our algorithm, we follow the same general recipe as outlined in [40] : (i) Construct (a possibly abstract) transition system and find the reachable error states.
(ii) Construct a boolean formula that describes how traces leading to those error states can be avoided.
(iii) Implement satisfying assignments of the formula using syntactic constructs.
The main challenge in implementing this general recipe is in defining which transitions may be avoided, and how they can be avoided syntactically. Next, we present the definitions that are appropriate for each of our semantics: concrete, fully-disjunctive and partially-disjunctive. We also show an example demonstrating the interplay between the precision of the abstraction and the quality of inferred fences.
Recoverability of Sequential Consistency
Before we introduce the details of the inference algorithm, we note that the problem always has a trivial (inefficient) solution under concrete memory models. However, for some abstract memory models, the problem no longer has a solution. We would like to restrict attention to abstractions in which the existence of a solution is guaranteed. Consider a program P that satisfies its specification S under the sequentially consistent memory model, P |=SC S, but violates it under a weaker memory model M , P |=M S. We say that M is SC-Recoverable (SCR) when for any such P there exists a program P obtained from P by adding fences such that P |=M S. For SC-Recoverable memory models, when P |=SC S, the trivial solution in which fences are added after every memory store in P always exists. This property might seem trivial, however it is easy to design seemingly reasonable abstract models for which it does not hold. For instance, as demonstrated in Section 6, a partialcoherence abstraction with recency and k = 0 does not satisfy the SCR property.
For the partial coherence abstractions of Section 4, k ≥ 1 guarantees SC-Recoverability. If we place a fence immediately after every store instruction, then i) |H| can never grow above 1 so stores cannot become visible out of order and ii) the store cannot be observed by the process itself before it is flushed. In effect this makes the store and flush operations atomic, reducing the program's behaviors to those possible under sequential consistency.
Fence Inference under Concrete Semantics
Buffers of Labeled Stores The semantics given in Section 3 do not preserve enough information about program execution to enable fence inference. Using those semantics, it is not possible to determine that a given memory operation was delayed by examining only the source state and the transition associated with the operation. Therefore, we instrument the concrete semantics with additional information about the instruction that stored each value. To achieve this, for a process p and variable x, we extend the store buffer Bp(x) ∈ Seq(Labs × D) to be a sequence of pairs l, v . For every value stored we also record the label of the program instruction that stored the value.
Avoiding Error States Let P be a program, and σ0, Σ, T be the program's transition system. Every transition t ∈ T that is not a flush transition is associated with a instruction in the code that caused the transition. We denote by lt the label of this instruction. Our goal is to construct a program P by inserting fences into P such that the state-space of P does not contain any error states. To "remove" a state from the state-space, we must prohibit all program traces that contain it. The question then becomes "how can a program trace be prohibited?"
The intuition behind the recipe of [40] is that program traces can be classified as either avoidable and unavoidable. The classification is performed according to the syntactic device we have to eliminate traces. In our setting, the syntactic device used to prohibit traces is a memory fence. If a trace π contains a store transition σi tp −→ σi+1 by process p which is not immediately followed by a flush of the stored value, we can prohibit π by placing a fence immediately after the store. This means a trace is unavoidable if and only if every store is immediately followed by a flush.We can refine the concept of an avoidable trace and talk about avoidable transitions. A transition tp performed by process p is avoidable if it is a memory operation (store, load or CAS), and some store buffer associated with p is non-empty.
Formally, let t be the transition σ t −→ τ . Let v be a value written by the instruction at label lv such that l, v appears in some buffer Bσ,p(x). Then t is avoidable, and can be avoided by placing fences on all program paths between lv and lt, forcing the value v to be flushed before lt is reached. We formalize this by defining ordering constraints: we say an ordering constraint [lv ≺ lt] is enforced if a fence is placed on all program paths between lv and lt. We say the constraint is violated by a transition if lt is executed while a value stored by lv is in the buffer. A constraint is violated by a trace if it is violated by one of its transition. Note that if a constraint is enforced by a fence, it cannot be violated by any transition. This implies a trace π of P cannot appear in P if at least one of the constraints it violates is enforced in P . This fact gives us a complete characterization of how a state can be removed from the state-space: by enforcing at least one constraint that is violated by each trace leading to that state. We call this characterization the avoid formula of a state.
A direct implementation of the method described above to compute the avoid formulae would be very inefficient, as it requires enumerating all program traces. Below we give a brief description of a more efficient algorithm. A fuller description, albeit in a different setting, is given in [18] .
As a first stage in the algorithm we construct the transition system σ0, Σ, T . We then label every state σ ∈ Σ with a propositional formula that captures how σ can be made unreachable (avoided) through the use of ordering constraints. Intuitively, a state σ can be avoided by avoiding all incoming transitions to σ in the program's transition system. In turn, a single transition µ → σ can be avoided by either avoiding its source state µ or by prohibiting the transition itself. We associate with each transition t ∈ T a formula:
Here, we use proc(t) to denote the process that executes the transition t and src(t) to denote the source state of the transition. This formula captures all possible ordering constraints that would prohibit execution of t. Formally, it is a disjunction because it is enough to enforce one of the constraints to make t unreachable.
To compute how a state σ ∈ Σ can be avoided, we define a labeling function L and:
We then define a transformer that updates the labeling function:
Given an initial mapping L0 that maps all unavoidable states to f alse and the rest to true, the greatest fixed point of inf er(L0) describes all the possible ways in which any state σ can be avoided. The greatest fixed point is computed with respect to implication partial order L1 L2 ≡ ∀σ ∈ Σ.L1(σ) ⇒ L2(σ). Using the provided specification, we identify a set E ⊆ Σ of reachable error states. We then compute the overall constraint formula ψ by taking the conjunction of avoid constraints for all error states: ψ = {L(σ) | σ ∈ E}. A satisfying assignment to this formula is guaranteed to represent a correct fence placement.
Inference under Abstract Semantics
We can extend the abstract model in the same way we extended the concrete model. That is, Hp(x) and Sp(x) will contain label, value pairs.
Inference under Disjunctive Abstraction
Using the abstract semantics of Sec. 4.2, we can construct an abstract transition system for the program, and apply the same reason-ing as in the concrete semantics, except that we adjust prevent(t):
This adjustment is safe because we know that if lv, v ∈ Qp(x), then for any concretization σ of σ, B σ ,p (x) must contain lv, v at least once. This means that placing a fence between any such lv and lt is sufficient to avoid t from σ .
Note that it is possible to infer more fences than necessary due to the imprecision of the abstraction. Consider the simple example in Fig. 6 , with the specification that in a final state r1 ≤ r2. If we attempt to execute this program under partial-coherence semantics with k = 0, we may get a trace where in the final state we have r1 = 2, r2 = 1: (a) Process 1 performs both stores. The single avoidable transition in this trace is the execution of the second store by process 1. The only way to avoid this transition is by placing a fence between the two stores. However, if we increase the precision of the abstraction and use k = 1, we will not produce this (spurious) trace and will not infer the redundant fence.
Process 1:
1 store x = 1; 2 store x = 2; Process 2:
1 load r1 = x; 2 load r2 = x; Figure 6 . Fully disjunctive partial-coherence abstraction with k = 0 leads to a redundant fence between the stores in Process 1, while with k = 1 the inference algorithm determines that no fences are necessary.
Inference under Partially Disjunctive Abstraction
For the abstract semantics of Sec. 4.3, we need to adjust prevent(t):
The only change from the fully disjunctive abstraction is in replacing with . The reason for this change becomes clear once we examine the concretization function for the partially disjunctive abstraction. As pointed out in the previous section, given an abstract state σ and a non-empty Sσ,p(x), there exist concretized states which do not contain all values in Sσ,p(x). Since prohibiting a transition from σ requires prohibiting that transition from all concrete states represented by σ, prevent(t) must be a conjunction over the possible prevent formulas in the concrete domain. For many transitions, this formula will be stronger than the optimal one, potentially leading to a fence placement worse than the one produced by the fully disjunctive abstraction with the same k value.
Fine-grained fence inference
The inference algorithm described above generates sets of constraints that must be enforced so that the specification is satisfied. One simple way to enforce a constraint [l1 ≺ l2] is by placing a full fence on every path between l1 and l2 on the control-flow graph of the program. If finer-grained fences are available on the architecture we can use information encoded in the constraint to implement it more efficiently. For example if the architecture provides separate store-store and store-load fences we can place the appropriate fence based on whether the instruction at l2 is a store or a load. If the architecture provides fences that enforce flushing only one variable (e.g., CAS in our concrete PSO semantics) then we can place the correct fence type based on the variable written to by l1. For simplicity, in Section 6 we assume the only fence available is a full fence. However, whenever inference succeeds we could trivially place finer-grained fences.
Experience
We implemented our abstractions together with the verification and inference algorithms in a tool called BLENDER. Using BLENDER, we demonstrate the effectiveness of our abstractions by successfully verifying and inferring the required fences in a number of challenging algorithms. None of these algorithms could be handled by existing approaches. Further, we illustrate an inherent trade-off between the optimality of fence inference and the state-space size dictated by the abstraction.
BLENDER is implemented in Java and uses the JavaBDD library to represent avoid formulae as BDDs. All experiments were conducted on an 8-CPU Xeon 1.6GHz with 16GB memory running a 64-bit Sun JVM on Red Hat Linux.
Abstractions In our experiments, we consider a range of abstract memory models, all of which are abstractions of the concrete PSO memory model:
• Set: an abstraction of the store buffer to a set, without any additional information such as recency.
• FD: the partial coherence abstraction shown in Sem. 3, with varying k.
• PD: the partially disjunctive abstraction described in Sec. 4.3. Note that the Set abstraction and FD/PD with k = 0 are generally not SC-Recoverable. Thus, it is possible that during fence inference, BLENDER will report the program as impossible to fix.
Benchmarks
To evaluate our tool, we chose various classic concurrent algorithms such as well-known mutual exclusion algorithms (mutex) and synchronization barrier algorithms. All algorithms were exercised in a loop by two concurrent processes ("repeated entry"):
• Dekker's Algorithm [11] . To evaluate both inference and verification we used two versions: Dek0: has no added fences and is incorrect under the PSO memory model. Dek2: has two added fences and is known to be correct. • Peterson's Algorithm [31] , using two versions, Pet0 and Pet2.
• A variation of Lamport's Bakery [19] using two versions, Lam0 and Lam2. To make this algorithm finite-space we manually bounded the maximum ticket number at 2. • Lamport's Fast Mutex [21] using two versions, Fast0 and Fast3.
• CLH queue lock [26] using two versions, CLH0 and CLH2 • Centralized sense-reversing synchronization barrier [14] using two versions Sense0 and Sense1. For the mutual exclusion, the specification is that there cannot be more than one process inside the critical section. "Release semantics" for operations within the critical section are not enforced. The benchmarks were selected based on two criteria:
Novelty The benchmarks could not be handled by any of the previous approaches. For instance, as mutual exclusion algorithms inherently contain benign data races, using techniques like delay set analysis [34] would result in a gross over-estimation of the required fences. Furthermore, some of the benchmarks -for instance Dek and Fast -contain benign triangular data races (as defined in [28] ). Thus, even if we focus squarely on the TSO memory model, we could not use the results of [28] to establish the correctness of the algorithms by focusing only on sequentially consistent executions. Finally, all of the benchmarks contain unbounded spin-loops, and as such, they cannot be handled directly using the bounded techniques of [18] or [5] .
Simplicity Our focus in this work has been abstracting the effect of the relaxed memory model in isolation from other sources of unboundedness. Hence, we chose our algorithms to be finite-state when executed under the SC model. We defer the problem of verifying infinite-state programs using abstract memory models (e.g., by composing our abstractions with heap or predicate abstractions) to future work.
Verification
Tab. 1 shows the verification results produced by BLENDER with three abstractions. The programs we used here are the ones we know to be correct under the concrete PSO semantics, that is, appropriate fences have been placed in advance. All verification runs completed within 30 seconds. Each entry in the table contains the total number of states explored (in thousands). A mark is placed if verification succeeded and a mark if a spurious (abstract) counter-example was found. In some of the runs of CLH2, BLENDER exhausted the available memory, and thus we do not report the state-space size. However, in both those cases an (abstract) counter-example was found before BLENDER ran out of memory. Table 1 . Verification results and number of states (in thousands).
Discussion
As Tab. 1 shows, none of the correct examples could be verified using the naive set abstraction, however all of them could be verified using FD with k = 1. Since verification of all examples successfully completed with FD, there was no need to use the PD abstraction. The table also shows F D k=0 generated spurious counterexamples for CLH2 and Fast3 but not the other algorithms. When k = 0, the partial-coherence abstraction (F D k=0 ) reduces to the set abstraction with recency information. This is enough to verify the simpler algorithms, however it fails on the more complex ones. The example of Lamport's fast mutex is particularly interesting, as it demonstrates the type of executions possible with non SC-recoverable abstractions. Consider the code in Fig. 7 . In this implementation a process can enter the critical section either along the fast path (the if condition in line 15 is false) or along the slow path (the conditions is true). Under an abstract model with k = 0, the following execution is possible:
• Process 1 enters the critical section along the fast path, executes it, and runs until line 29.
• Process 1 executes line 29. At this point S1(y) = {0}.
• Process 1 flushes y non-destructively, using the FLUSH-SN rule. Now G(y) = 0.
• Process 2 enters the critical section. Since G(y) = 0 it enters along the fast-path setting y = 2 in the process. This is flushed destructively using the FLUSH-SD rule. At this point G(y) = 2, S1(y) = {0}, S2(y) = ∅.
• Process 1 resumes. It first performs a flush of y, setting G(y) = 0. Then it proceeds to enter the critical section again, using the fast path. This execution relies on the fact p1 only stored the value 0 to y once, but this store is flushed twice. In effect, p2 observed this store as if it happened before its own, and p1 observed it as if it happened after the store of p2. This coherence violation would have been prevented if we kept more information in the content of the buffer, by using k > 0. Indeed, with k = 1, Fast3 passes verification. Figure 7 . A version of Lamport's fast mutex algorithm for 2 processors. The code given is for process i.
Inference
In Tab. 2 we show the state-space size and inference results for 5 of the under-fenced implementations. A mark of means the optimal fences were inferred, means that sub-optimal fences were inferred, and means that BLENDER was unable to infer fences as according to the analysis any fence placement would leave the program incorrect. M appears if BLENDER ran out of memory.
Prog. Table 2 . Inference results and number of states (in thousands).
Discussion
Initially, we used BLENDER to perform fence inference with abstractions F D k=0 and F D k=1 . However, BLENDER ran out of memory for Lam0, Fast0, and Fast1c. Using the partially disjunctive abstraction P D k=0 enabled us to run the inference algorithm for both Lam0 and Fast1c and obtain a sound fence placement for both. Furthermore, despite the loss of precision in the P D abstraction, in both cases the inferred fences are not trivial.
Peterson's Algorithm
Our results for Peterson's algorithm demonstrate the inherent tradeoffs between inference optimality and abstraction precision:
• With the F D abstraction BLENDER was able to infer the optimal fence placement with k = 0. With the P D abstraction it required k = 2 and a much larger state-space.
• With the P D k=0 abstraction we can produce a smaller state space but the result is suboptimal: 3 fences are required instead of 2. In addition to the two fences shown in Fig. 2 , another fence, immediately after the store in line 14, is inferred. The same trade-off can also be observed when using a similar partial-coherence abstraction of the TSO model. For k = 0 and k = 1 suboptimal fence placement is generated, while with k = 2 the result is optimal (for TSO).
Lamport's Fast Mutex
For both Fast0 and Sense0, we experienced a loss of precision when using a k value that is too small. In the case of Fast0, the inference algorithm reported the program as unfixable when using P D k=0 . This is due to the fact the counter-example presented for Fast3 under this abstract model cannot be fixed with any number of fences. Unfortunately, BLENDER was unable to build the state-space of Fast0 under P D k=1 . Thus, we've run a complementary set of experiments in which 1 of the 3 required fences was placed. The 3 versions of Lamport's fast mutex (Fig. 7) we have ran had a single fence inserted: (i) between lines 5 and 6 (Fast1a), (ii) between lines 13 and 14 (Fast1b), (iii) between lines 29 and 30 (Fast1c). As expected, for all 3 programs, when running under P D k=0 the program was unfixable. However, in all 3 cases we were able to infer a correct fence placement using P D k=1 . Furthermore, for Fast1a and Fast1b the optimal placement of the two other fences was found when using P D k=2 . For Fast1c even with k = 2 the placement was still suboptimal. This demonstrates another example of the interplay between the placed fences and the precision of the required abstraction. Even though for Fast1c we could not infer the optimal fence placement using P D k=1 , had we placed them manually, this abstraction could be used to verify them.
Related Work
Data-Race Freedom Guarantee A common technique to reduce the complexity of analyzing programs under relaxed memory models is to focus only on programs that have no data-races under sequentially consistent executions. For such programs the "fundamental property of memory models" [32] (also known as the DRF theorem) ensures that there can be no sequentially inconsistent executions. Owens (in [28] ) studies a generalization of this theorem for the x86-TSO model. To guarantee correctness under this model one needs only to prove sequentially consistent executions satisfy "triangular race-freedom", a property weaker than general data racefreedom. In our work, we focus on abstractions of arbitrary programs and unlike these methods, we can handle programs that contain data-races, such as common lock-free algorithms and mutual exclusion primitives.
Checking Equivalence To Sequential Consistency In [6] and [7] algorithms are presented that can, based only on sequentially consistent executions, find violations of sequential consistency under the TSO and PSO memory models. Similarly, it is possible to place fences to preserve only SC executions using Delay Set Analysis ( [34] ), for instance as implemented in the Pensieve compiler ( [12, 22] ). However, a violation of SC does not necessarily cause a violation of any high-level properties. Thus those algorithms are often needlessly conservative. Our approach, on the other hand, uses a high-level specification and allows a trade-off between precision and optimality of the solution. [17, 18, 30] describe explicit-state model checking under the Sparc RMO model. Among those, [18] focuses on fence inference. [15] also describes an explicit-state model checking and inference technique for the .NET memory model, but it suffers from significant technical drawbacks (c.f. [18] ). However, the techniques presented in [18] are not applicable in our setting. (i) The FENDER algorithm described in [18] can only infer fences for programs that are finitestate under the relaxed memory model, and not under the sequentially consistent model. While this distinction might seem subtle, it is in fact significant. For example, FENDER relies on finite clients of lock-free data structures being finite-state. Unfortunately, there is no guarantee that a data structure that is lock-free under SC will stay lock-free under a relaxed model. More generally, any code that uses a spin-loop with a store in the loop body will always be infinite-state under a relaxed model, unless the store is followed by a fence. Since classical implementations of synchronization primitives use this code pattern, it is not possible to use FENDER to infer fences in those implementations. In contrast, the technique we present in this paper requires the input to be finite-state only under SC. (ii) More technically, the algorithm is phrased in terms of execution buffer semantics. Adapting it to store-buffer semantics is challenging, especially in the abstract case.
Explicit Model Checking for Relaxed Memory Models
In [5] , Burckhardt et. al take a different approach to verification under RMM. Instead of working with operational memory models and explicit model-checking, they convert programs into a form that can be checked against an axiomatic model specification. This technique still suffers from the same limitation -it must unroll loops at a preprocessing stage. Thus it cannot verify programs that contain unbounded spinning. In contrast, our verification approach is based on abstract interpretation and is sound for any input program and any values of the abstraction parameters. An alternative approach, in the spirit of [5, 18] , is to assume a bound on the size of buffers or the number of loop iterations. Combined with iterative increase of the bound, this may work for some examplesbut not in the general case. In addition, using an abstraction is beneficial not only when the buffers are unbounded. Even if the buffer is bounded, the concrete state-space may simply be too large -while in fact representing the buffers with full precision is not important to the correctness of the algorithm.
Synchronization Inference In [38, 40] , the authors propose algorithms that automatically infer synchronization constructs such as atomic sections and conditional critical regions. These works assume sequential consistency and do not support weak memory models. The approach of [38] is close to our work in spirit, and deals with inferring synchronization under abstraction, but it enumerates traces explicitly, which does not scale to our setting.
In [37, 39] , inference of synchronization is performed by syntactic exploration of placements of atomic sections to create candidate algorithms, and using a backing verifier to attempt verification of each candidate. In principle, a similar approach can be employed for fence inference by exploring candidate algorithms with all possible fence placements. In contrast, our constraint-based approach lays the ground for inference of more advanced fences, such as fence per variable, and conditional fences, for which syntactic exploration will yield a non-feasible number of candidates.
Alternative Buffer Abstractions In [23] the authors use automata as symbolic representation of store buffers in the TSO memory model. Their approach uses an acceleration technique that does not guarantee termination. Furthermore, their automata-based representation preserves redundant information, and as noted by the authors themselves, ends up being too expensive to be of practical interest. Since store buffers are similar to FIFO channels in communicating FSMs (CFSMs), it is tempting to employ techniques from CFSMs in our context as well. These techniques include algorithms based on symbolic representation of channel content (e.g., [4] ), and conservative abstractions for FIFO channels (e.g., [13] ). Abstraction of FIFO channels, as presented in [13] , is similar in spirit to our approach in that it guarantees termination by using approximation. However, their abstraction preserves a slightly different kind of information than the information required for reasoning about store buffers. They use a regular abstraction of queue content and an expensive widening operation to establish correct usage of protocols. This is more than what is required in our setting in terms of characterization of buffer content, and often less than needed in terms of recency information. In contrast to these, our technique guarantees termination by using conservative approximation, and our abstractions are tailored to relaxed memory models.
Conclusions and Future Work
We present an approach for automatic verification and fence inference for concurrent programs running under relaxed memory models. Our approach is based on abstract interpretation, and its technical core is a family of partial-coherence abstractions that provide a (parametric) bounded representation for potentially unbounded store buffers. Our abstractions enable us to automatically verify concurrent algorithms without worrying about the size of the underlying store buffers. Because partial coherence abstractions are designed to be SC-Recoverable, they can be used for automatic inference of memory fences. We have implemented our approach in a tool called BLENDER and applied it to verify several correctlyfenced concurrent algorithms and automatically infer fences in under-fenced versions of these algorithms. In the future, we plan to combine our abstractions with heap abstractions to enable verification of heap-manipulating programs under RMMs.
