Abstract. Symbolic model checking of parallel programs stands and falls with effective methods of dealing with the explosion of interleavings. We propose a dynamic reduction technique to avoid unnecessary interleavings. By extending Lipton's original work with a notion of bisimilarity, we accommodate dynamic transactions, and thereby reduce dependence on the accuracy of static analysis, which is a severe bottleneck in other reduction techniques. The combination of symbolic model checking and dynamic reduction techniques has proven to be challenging in the past. Our generic reduction theorem nonetheless enables us to derive an efficient symbolic encoding, which we implemented for IC3 and BMC. The experiments demonstrate the power of dynamic reduction on several case studies and a large set of SVCOMP benchmarks.
Introduction
The rise of multi-threaded software-a consequence of a necessary technological shift from ever higher frequencies to multi-core architectures-exacerbates the challenge of verifying programs automatically. While automated software verification has made impressive advances recently thanks to novel symbolic model checking techniques, such as lazy abstraction [26, 6] , interpolation [33] , and IC3 [9] for software [7, 10] , multi-threaded programs still pose a formidable challenge.
The effectiveness of model checking in the presence of concurrency is severely limited by the state explosion caused through thread interleavings. Consequently, techniques that avoid thread interleavings, such as partial order reduction (POR) [38, 41, 19] or Lipton's reduction [32] , are crucial to the scalability of model checking, while also benefitting other verification approaches [18, 12, 15] .
These reduction techniques, however, rely heavily on the identification of statements that are either independent or commute with the statements of all other threads, i.e. those that are globally independent. For instance, the singleaction rule [31]-a primitive precursor of Lipton reduction-states that a sequential block of statements can be considered an atomic transaction if all but one of the statements are globally independent. Inside an atomic block, all interleavings of other threads can be discarded, thus yielding the reduction.
Identifying these globally independent statements requires non-local static analyses. In the presence of pointers, arrays, and complicated branching structures, however, the results of an up-front static analysis are typically extremely conservative, thus a severe bottleneck for good reduction. Fig. 1 shows an example with two threads (T1 and T2). Let's assume static analysis can establish that pointers p and q never point to the same memory throughout the program's (parallel) execution. This means that statements involving the pointers are globally independent, hence they globally commute, e.g. an interleaving *p++; *q = 1 always yields the same result as *q = 1; *p++;. Assuming that *p++; is also independent of the other statements from T2 (b = 2 and c = 3), we can reorder any trace of the parallel program to a trace where *p++ and *q = 2 occur subsequently without affecting the resulting state. The figure shows one example. Therefore, a syntactic transformation from *p++; *q = 2 to atomic{*p++; *q = 2} is a valid static reduction.
Still, it is often hard to prove that pointers do not overlap throughout a program's execution. Moreover, in many cases, pointers might temporarily overlap at some point in the execution. For instance, assume that initially p points to the variable b. This means that statements b = 2 and *p++ no longer commute, because b = 2; b++ yields a different result than b++; b = 2. Nevertheless, if b = 2 already happened, then we can still swap instructions and achieve the reduction as shown in Fig. 1 . Traditional, static reduction methods cannot distinguish whether b = 2 already happened and yield no reduction. Sec. 2 provides various other real-world examples of hard cases for static analysis.
In Sec. 4.2, we propose a dynamic reduction method that is still based on a similar syntactic transformation. Instead of merely making sequences of statements atomic, it introduces branches as shown in Fig. 1 (T1') . A dynamic commutativity condition determines whether the branch with or without reduction is taken. In our example, the condition checks whether the program counter of T2 (pc T2) still points to the statement b = 2 (pc T2 == 1). In that case, no reduction is performed, otherwise the branch with reduction is taken. In addition to conditions on the program counters, we provide other heuristics comparing pointer and array values dynamically. The instrumented code (T1') however poses one problem: the branching condition no longer commutes with the statement that enables it. In this case, the execution of b = 2 disables the condition, thus before executing b = 2, T1' ends up at Line 2, whereas after b = 2 it ends up at Line 4 (see Fig. 2 ). To remedy this, we require in Sec. 4.3 that the instrumentation guarantees that the target states in both branches are bisimilar. Fig. 3 shows that locations 2 and 4 of T1' are bimilar, written 2 ∼ = 4, which implies that any statement executable from the one is also executable from the other ending again in a bisimilar location (3 ∼ = 5). As bisimularity is preserved under parallel composition (4, 2) ∼ = (2, 2), we can prove the correctness of our dynamic reduction method in Sec. A.
The benefit of our syntactic approach is that the technique can be combined with symbolic model checking checking techniques (Sec. 5 provides an encoding for our lean instrumentation). Thus far, symbolic model checkers only supported more limited and static versions of reduction techniques as discussed in Sec. 7 .
We implemented the dynamic reduction and encoding for LLVM bitcode, mainly to enable support for C/C++ programs without dealing with their intricate semantics (the increased instruction count of LLVM bitcode is mitigated by the reduction). The encoded transition relation is then passed to the Vienna Verification Tool (VVT) [24] , which implements both BMC and IC3 algorithms extended with abstractions [7] . Experimental evaluation shows that (Sec. 6) dynamic reduction can yield several orders of magnitude gains in verification times. Lazy initialization. We illustrate our method with the code in Fig. 4 . The main function starts two threads executing the worker thread function, which processes the contents of data in the for loop at the end of the function. Using a common pattern, a worker thread lazily delays the initialization of the global data pointer until it is needed. It does this by reading some content from disc and setting the pointer atomically via a compare-and-swap operation (CAS) at label W (whose semantics here is an atomic C-statement: if (data==NULL) { data = tmp; return 1; } else return 0;). If it fails (returns 0), the locally allocated data is freed as the other thread has won the race.
Motivating Examples
The subsequent read access at label R is only reachable once data has been initialized. Consequently, the write access at W cannot possibly interfere with the read accesses at R, and the many interleavings caused by both threads executing the for loop can safely be ignored by the model checker. This typical pattern is however too complex for static analysis to efficiently identify, causing the model checker to conservatively assume conflicting accesses, preventing any reduction.
Hash table. The code in Fig. 5 implements a lockless hash table (from [30]) inserting a value v by scanning the bucket array T starting from hash, the hash value calculated from v. If an empty bucket is found (T[index]==E), then v is atomically inserted using the CAS operation. If the bucket is not empty, the operation checks whether the value was already inserted in the bucket (T[index] == v). If that is not the case, then it probes the next bucket of T until either v is found to be already in the table, or it is inserted in an empty slot, or the table is full. This basic bucket search order is called a linear probe sequence.
A thread performing find-or-put(25), for instance, merely reads buckets T [2] to T [5] . However, other threads might write an empty bucket, thus causing interference. To show that these reads are independent, the static analysis would have to demonstrate that the writes happen to different buckets. Normally this is done via alias analysis that tries to identify the buckets that are written to (by the CAS operation). However, because of the hashing and the probe sequence, such an analysis can only conclude that all buckets may be written. So all operations involving T will be classified as non-movers, also the reads. However if we look at the state of individual buckets, it turns out that a common pattern is followed using the CAS operation: a bucket is only written when it is empty, thereafter it doesn't change. In other words, when a bucket T[i] does not contain E, then any operation on it is a read and consequently is independent. { p1 = &x; p2 = &y; } else c: { p1 = &y; p2 = &x; } pthread_create(worker, p1); // T1 pthread_create(worker, p2); // T2 pthread_join(t1); pthread_join(t2); return x+y; } Load balancing. Fig. 6 shows a simplified example of a common pattern in multi-threaded software; load balancing: The work to be done (counting to 2048) is split up between two threads (each of which counts to 1024). The work assignment is represented by pointers p1 and p2, and a dynamic hand-off to one of the two threads is simulated using non-determinism (the first if branch). Static analysis cannot establish the fact that the partitions are independent, because they are assigned dynamically. But because the pointer is unmodified after assignment, its dereference commutes with that in other worker threads.
In Sec. 4, we show how the discussed operations in all three examples can become dynamic movers, allowing for more reduction.
Preliminaries
A concurrent program consists of a finite number of sequential procedures, one for each thread i. We model the syntax of each thread i by a control flow graph (CFG) G i = (V i , δ i ) with δ i ⊆ V i × A × V i and A being the set of actions, i.e., statements. V i is a finite set of locations, and (l, α, l ) ∈ δ i are (CFG) edges. We abbreviate the actions for a thread i with Domains A state of the concurrent system is composed of (1) a location for each thread, i.e., a a tuple of thread locations (the set Locs contains all such tuples), and (2) a data valuation, i.e., a mapping from variables (Vars) to data values (Vals). We take Data to be the set of all data valuations. Hence, a state is a pair, σ = (pc, d) where pc ∈ i V i and d ∈ Data. The locations in each CFG actually correspond to the values of the thread-local program counters for each thread. In particular, the global locations correspond to the global program counter pc being a tuple with pc i ∈ V i the thread-local program counter for thread i. We use
Each possible action α semantically corresponds to a binary relation α ⊆ Data × Data representing the evolution of the data part of a state under the transition labelled by α. We call α the transition relation of the statement α, referring to both simply as α. We also use several simple statements from programming languages, such as C, as actions.
The semantics of a concurrent program consisting of a finite number of threads, each with CFG G i = (V i , δ i ), is a transition system with data (TS) C = (S, →) with S = Locs × Data, Locs = i V i and →= i → i where → i is given by (pc, d)
Hence, the concurrent program is an asynchronous execution of the parallel composition of all its threads. Each step (transition) is a local step of one of the threads. Each thread i has a unique initial location pc 0,i , and hence the TS has one initial location pc 0 . Moreover, there is an initial data valuation d 0 as well. Hence, the initial state of a TS is σ 0 (pc 0 , d 0 ). Since we focus on preserving simple safety properties (e.g. assertions) in our reduction, w.l.o.g., we require one sink location per thread l sink to represent errors (it has no outgoing edges, no selfloop). Correspondingly, error states of a TS are those in which at least one thread is in the error location.
In the following, we introduce additional notation for states and relations. Let R ⊆ S × S and X ⊆ S. Then left restriction of R to X is X R R ∩ (X × S) and right restriction is R X R ∩ (S × X). Finally, the complement of X is denoted X S \ X (the universe of all states remains implicit in this notation). Commutativity. We let R•Q denote the sequential composition of two binary relations R and Q, defined as: {(x, z) | ∃y : (x, y) ∈ R∧(y, z) ∈ Q} . Moreover, let:
Illustrated graphically for transition relations, → i right commutes with
Conversely, → j left commutes with → i . The typical example of (both) commuting operations α − → i and β − → i is when α and β access a disjoint set of variables. Two operations may commute even if both access the same variables, e.g., if both only read or both (atomically) increment/decrement the same variable. Lipton Reduction. Lipton [32] devised a method that merges multiple sequential statements into one atomic operation, and thereby radically reducing the number of states reachable from the initial state as Fig. 7 shows for a transition system composed of two (independent, thus commuting) threads. 
Both-movers are transitions that are both left and right movers, whereas nonmovers are neither. The sequential composition of two movers is also a corresponding mover, and vice versa. Moreover, one may always safely classify an action as a non-mover, although having more movers yields better reductions.
Lipton reduction only preserves halting. We present Lamport's [31] version, which preserves safety properties such as ϕ:
. . . ; α n (i.e. a compound statement with the same local behavior), if for some 1 ≤ k < n:
L1. statements before α k are right movers, i.e.: α1
. statements after α k are left movers, i.e.:
The action α k might interact with other threads and therefore is called the commit in the database terminology [36] . Actions preceding it are called precommit actions and gather resources, such as locks. The remaining actions are post-commit actions that (should) release these resources. We refer to pre(/post)-commit transitions including source and target states as the pre(/post) phase.
Dynamic Reduction
The reduction outlined above depends on the identification of movers. And to determine whether a statement is a mover, the analysis has to consider all other statements in all other threads. Why is the definition of movers so strong? The answer is that 'movability' has to be preserved in all future computations for the reduction not to miss any relevant behavior. For instance, consider the system composed of x=0; y=2 and y=1; x=y with initial state σ 0 = (pc 0 , d 0 ), d 0 = (x = 0, y = 0) and pc 0 = (1, 1) using line numbers as program counters. Fig. 8 shows the TS of this system, from which we can derive that x:=0 and y:=1 do not commute except in the initial state (see the diamond structure of the top 3 and the middle state). Now assume, we have a dynamic version of Lipton reduction that allows us to apply the reduction atomic{x=0; y=2;} and atomic{y=1; x=y;}, but only in the initial state where both x=0 and y=1 commute. The resulting reduced system, as shown with bold arrows, now discards various states. Clearly, a safety property such as ¬(x = 1 ∧ y = 2) is not preserved anymore by such a reduction, even though x=0 and y=1 never disable the property (L4 in Sec. 3 holds).
The mover definition comparing all behaviors of all other threads is thus merely a way to (over)estimate the computational future. But we can do better, without precisely calculating the future computations (which would indulge in a task that the reduction is trying to avoid in the first place). For example, unreachable code should not negatively impact movability of statements in the entire program. By the same line of reasoning, we can conclude that lazy initialization procedures (e.g. Fig. 4) should not eliminate movers in the remainder of the program. Intuitively, one can run the program until after initialization, then remove the initialization procedure and restart the verification using that state as the new initial state. Similarly, reading unchanging buckets in the hash table of Fig. 5 should not cause interference and dynamically assigned, yet disjoint, pointers still do not overlap, so these bucket reads and pointer dereferences could also become movers after initialization. The current section provides dynamic notion of movability and a generalized reduction theorem that can use this new notion. Proofs of all lemmas and theorems can be found in Sec. A.
Dynamic Movers
Recall from the example of Fig. 1 that we introduce branches in order to guide the dynamic reductions. This section formalizes the concept of a dynamic bothmoving condition, guarding these branches. We only consider both movers for ease of explanation. Nonetheless, Sec. A considers left and right movers.
Definition 1 (Dynamic both-moving conditions).
A state predicate (a subset of states) c α is a dynamic both-moving condition for a CFG edge (l, α, l ) ∈ δ i , if for all j = i:
One key property of a dynamic both-moving condition for α ∈ ∆ i is its monotonicity: In the transition system, the condition c α can be enabled by remote threads (j = i), but never disabled. While the definition allows us to define many practical heuristics, we have identified the following both-moving conditions as useful. When these fail, c α := false can be taken to designate α as a non-mover statically. Although our heuristics still rely on static analysis, the required information is easier to establish (e.g. with basic control-flow analysis and the identification of CAS instructions) than for the global mover condition.
Reachability As in Fig. 4 , interfering actions, such as the write at label W, may become unreachable once a certain program location has been reached. The dynamic condition for the read α process(data[i + tid * 512]) i therefore becomes: c α := j =i l∈L(j) pc j = l, where L(j) is the set of all locations in V j that can reach the location with label W in V j . For example, for thread T1 we obtain c α := pc T2 != a,b,c,d,W (abbreviated). Deriving this condition merely requires a simple reachability check on the CFG.
Static pointer dereference If pointers are not modified in the future, then their dereferences commute if they point to different memory locations. For thread T1 in the pointer example in Fig. 6 , we obtain c α := p1 != p2 && pc T2 != a,b,c (here *p++ is the pointer dereference with p = p1). Monotonic atomic A CAS instruction CAS(p, a, b) is monotonic, if its expected value a is never equal to the value b that it tries to write. Assuming that no other instructions write to the location where p refers to, this means that once it is set to b, it never changes again.
In the hash table example in Fig. 5 , there is only a CAS instruction writing to the array T. The dynamic moving condition is:
Lemma 1. The above conditions are dynamic both-moving conditions.
Instrumentation
Fig. 1 demonstrated how our instrumentation adds branches to dynamically implement the basic single-action rule. Lipton reduction however distinguishes between pre-and post-commit phases. Here, we provide an instrumentation that satisfies the constraints on these phases (see L1-L4 in Sec. 3). Roughly, we transform each
Respectively, there are external, pre-, and post-locations, plus two auxiliary pre-and post-locations for along branches. 2. Add edges/branches with dynamic moving conditions according to Table 1 .
The rules in Table 1 precisely describe the instrumented edges in G i : for each graph part in the original G i (middle column), the resulting parts of G i are shown (right column). As no non-movers are allowed in the post phase, R4 only checks the dynamic moving condition for all outgoing transitions of a postlocation l 
with c(la)
true actual action can be executed (R1). If it succeeds, then the action commutes and can safely be executed while remaining in the post phase (R5). We do this from an intermediary post location l L a . Since transitions α thus need to be split up into two steps in the post phase, dummy steps need to be introduced in the pre phase (R1 and R2) to match this (R3), otherwise we lose bisimilarity (see subsequent subsection). As an intermediary pre location, we use l R a . All new paths in the instrumented G i adhere to the pattern: l
. Moreover, using the notion of location feedback sets (LFS) defined in Def. 2, R4 and R6 ensure that all cycles in the post phase contain an external state. This is because our reduction theorem (introduced later) allows non-terminating transactions as long as they remain in the pre-commit phase (it thus generalizes L3). Fig. 9 shows a simple example CFG with its instrumentation. The subsequent reduction will completely hide the internal states, avoiding exponential blowup in the TS (see Sec. 4.3).
Definition 2 (LFS). A location feedback set (LFS) for thread i is a subset LFS
i ⊆ V i such that for each cycle C = l 1 , .., l n , l 1 in G i it holds that LFS i ∩ C = ∅. The corresponding (state) feedback set (FS) is: C i {(pc, d) | pc i ∈ LFS i )}.
Corollary 1 ([29]
). i C i is a feedback set in the TS.
The instrumentation yields the following 3/4-partition of states for all threads i: Lemma 2. An error state is →-reachable in the original system iff an error state is → -reachable in the instrumented system.
Recall the situation illustrated in Fig. 3 within the example in Fig. 1 . Rules R1, R2, and R4 of our instrumentation in Table 1 give rise to a similar problem as illustrated in the following.
Hence, our instrumentation introduces non-movers. Nevertheless, we can prove that the target states are bisimilar. This enables us to introduce a weaker notion of commutativity up to bisimilarity which effectively will enable a reduction along one branch (where reduction was not originally possible). The details of the reduction are presented in the following section. We emphasize that our implementation does not introduce any unnecessary non-movers.
Reduction
We now formally define the notion of thread bisimulation required for the reduction, as well as commutativity up to bisimilarity.
Definition 3 (thread bisimulation). An equivalence relation R on the states of a TS (S, →) is a thread bisimulation iff
Standard bisimulation [34,37] is an equivalence relation R which satisfies the property from Def. 3 when the indexes i of the transitions are removed. Hence, in a thread bisimulation, in contrast to standard bisimulation, the transitions performed by thread i will be matched by transitions performed by the same thread i. As we only make use of thread bisimulations, we will often refer to them simply as bisimulations.
Definition 4 (commutativity up to bisimulation). Let R be a thread bisimulation on a TS (S, →). The right and left commutativity up to R of the transition relation
Our reduction works on parallel transaction systems (P T ), a specialized TS. While its definition (Def. 5) looks complicated, most rules are concerned with ensuring that all paths in the underlying TS form transactions, i.e. that they conform to the pattern
, where α k is the non-mover, etc. We have from the perspective of thread i that: σ 1 and σ n+1 are external, ∀1 < x ≤ k : σ x pre-commit, and ∀k < x ≤ n : σ x post-commit states. The rest of the conditions ensure bisimilarity and constrain error locations.
The reduction theorem, Th. 1, then tells us that reachability of error states is preserved (and reflected) if we consider only P T -paths between globally external states N . The reduction thus removes all internal states I where
Definition 5 (transaction system). A parallel transaction system P T is a transition system T S = (S, →) whose states are partitioned in three sets of phases and error states in one of the phases, for each thread i. For each thread i, there exists a thread bisimulation relation ∼ =i. Additionally, the following properties hold (for all i, all j = i): Theorem 1. The block-reduced transition relation of a parallel transaction system P T = (S, →) is defined in two steps:
We have p → * q for p ∈ N and q ∈ E if and only if p * q for q ∈ E.
Our instrumentation from Table 1 in Sec. 4.2 indeed gives rise to a P T (Lemma 3) with the state partitioning from (Eq. 2-6). The following equivalence relation ∼ i over locations becomes the needed bisimulation ∼ =i when lifted to states. (The locations in the rightmost column of Table 1 are intentionally positioned such that vertically aligned locations are bisimilar.)
The dynamic both-moving condition in Def. 1 is sufficient to prove (item 8-9). The LFS notion in Def. 2 is sufficient to prove post-phase termination (item 2).
All of the apparent exponential blowup of the added phases (5 |Threads| ) is hidden by the reduction as only reveals external states N i N i (note that S = I N ) and there is only one instrumented external location (replicated sinks can be eliminated easily with a more verbose instrumentation).
Block Encoding of Transition Relations
We implement the reduction by encoding a transition relation for symbolic model checking. Transitions encoded as SMT formulas may not contain cycles. Although our instrumentation conservatively eliminates cycles in the post-commit phase of transactions with external states, cycles (without external locations) can still occur in the pre-phase. To break these remaining cycles, we use a refined location feedback set LFS i of the instrumented CFG without external locations G i \ l N ∈ V i (this also removes edges incident to external locations). Now, we can construct a new block-reduced relation . It resembles the definition of in Th. 1, except for the fact that the execution of thread i can be interrupted in an internal state C i (LFS i lifted to states) in order to break the remaining cycles. 
Here, the use of → i (from Th. 1) warrants that only thread i can transit from the newly exposed internal states C i ⊆ N j =i . Therefore, by carefully selecting the exposed locations of C i , e.g. only l R a , the overhead is limited to a factor two. To encode , we identify blocks of paths that start and end in external or LFS locations, but do not traverse external or LFS locations and encode them using large blocks [5] . This automatically takes care of disallowing intermediate states, except for the states C i exposed by the breaking of cycles. At the corresponding locations, we thus add constraints to the scheduler encoding to only allow the current thread to execute. To support pthreads constructs, such as locks and thread management, we use similar scheduling mechanisms.
Experiments
We implemented the encoding with dynamic reduction in the Vienna Verification Tool (VVT) [24, 23] . VVT implements CTIGAR [7] , an IC3 [9] algorithm with predicate-abstraction, and bounded model checking (BMC) [25] . VVT came fourth in the concurrency category of SVComp 2016 [3] the first year it participated, only surpassed by tools based on BMC or symbolic simulation.
We evaluated our dynamic reductions on the running examples and compared the running time of the following configurations:
-BMC with all dynamic reductions (BMC-dyn in the graphs); -BMC with only static reductions and phase variables from [17] (BMC-phase); -IC3 with all dynamic reductions (IC3-dyn); and -IC3 with only static reductions and phase variables from [17] (IC3-phase).
We used a time limit of one hour for each run and ran each instance 4 times to even out results of non-determinism. Variation over the 4 runs was insignificant, so we omit plotting it. Missing values in the graphs indicate a timeout. Hashtable. The lockless hash table of Fig. 5 is used together with the monotonic atomic heuristic in three experiments:
1. Each thread inserts one element into an empty hash table. The verification condition is that all inserted elements are present in the hash table after all threads have finished executing. We can see in Fig. 10a that the dynamic reduction benefits neither BMC nor IC3. This is because every thread changes the hash table thus forcing an exploration of all interleavings. The overhead of using dynamic reductions, while significant in the BMC case, seems to be non-existent in IC3. 2. Every thread attempts to insert an already-present element into the hash table. The verification condition is that every find-or-put operation reports that the element is already present. Since a successful lookup operation doesn't change the hash table, the dynamic reduction now takes full effect: While the static reduction can only verify two threads for BMC and four for IC3, the dynamic reduction can handle six threads for BMC and more than seven for IC3. 3. Since both of the previous cases can be considered corner-cases (the hash table being either empty or full), this configuration has half of the threads inserting values already present while the other half insert new values. While the difference between static and dynamic reductions is not as extreme as before, we can still see that we profit from dynamic reductions, being able to verify two more threads in the IC3 case.
Dynamic locking. To study the effect of lock pointer analysis, we use a benchmark in which multiple threads use a common lock to access two global variables. The single lock these threads use is randomly picked from a set of locks at the beginning of the program. Each thread writes the same value four times to both global variables. Because all threads use the same lock, after all threads terminate, the value of both global variables must be the same, which is the verification condition. We use our static pointer heuristic to determine that all global variable accesses are protected by the same lock, potentially allowing the critical section to become a single transaction. Fig. 10d shows that the dynamic reduction indeed kicks in and benefits both IC3 and BMC. Load balancing. We use the load-balancing example (Fig. 6 ). It relies on the static pointer heuristic. We verified that the computed sum of the counters is indeed the expected result. Our experiment revealed that dynamic reductions reduce the runtime from 15 minutes to 97 seconds for two threads already. SVComp. We also ran our IC3 implementation on the pthread-ext and pthread-atomic categories of the software verification competition (SVComp) benchmarks [4, 2] . In instances with an unbounded number of threads, we introduced a limit of three threads. To check the effect of different reduction-strategies on the verification time, we tested the following reductions:
dyn: Dynamic with all heuristics from Table 2 . phase: Dynamic phases only (equal to [17] ). static: Static (as in Sec. 3). nored : No reduction, all interleavings considered. Fig. 11 shows that static Lipton reduction yields an average six-fold decrease in runtime when compared to no reduction. Enabling the various dynamic improvements (dyn, phase) does not show improvement over the static case (static), since most of the benchmarks are either too small or do not have opportunities for reductions, but also not much overhead (up to 7%). Comparing the nored case with the other cases shows the benefit of removing intermediate states.
Related Work
Lipton's reduction was refined multiple times [31, 20, 13, 11, 40] . It has recently been applied in the context of compositional verification [39] . Qadeer and Flanagan proposed reductions with dynamic phases [17] using phase variables to identify internal and external states and also provided a dynamic solution for determining locked regions. Their approach, however, does not solve the examples featured in the current paper and also relies on a specialized deductions incompatible with IC3. Moreover, in [17] , the phases of target locations of nondeterministic conditions are required to agree. This restriction is not present in our encoding.
Grumberg et al. [21] present underapproximation-widening, which iteratively refines an under-approximated encoding of the system. In their implementation, interleavings are constrained to achieve the under-approximation. Because refinement is done based on verification proofs, irrelevant interleavings will never be considered. The technique currently only supports BMC and the implementation is not available, so we did not compare against it.
Kahlon et al.
[27] extend the dynamic solution of [17] , by supporting a strictly more general set of lock patterns. They incorporate the transactions into the stubborn set POR method [42] and encode these in the transition relation in similar fashion as in [1] . Unlike ours, their technique does not support other constructs than locks.
While in fact it is sufficient for item 2 of Def. 5 to pinpoint a single state in each bottom SCC of the CFG, we use feedback sets because the encoding in Sec. 5 also requires them. Moreover, we take a syntactical definition for ease of explanation. Semantic heuristics for better feedback sets can be found in [29] and can easily be supported via state predicates. Elmas et al. [15] propose dynamic reductions for type systems, where the invariant is used to weaken the mover definition. The over-approximations performed in IC3, however, decrease the effectiveness of such approaches.
In POR, similar techniques have been employed in [14] and the earliermentioned necessary enabling sets of [19, 41] . Completely dynamic approaches exist [16] , but symbolic versions remain highly static [1] . Notable exceptions are peephole and monotonic POR by Wang et al. [44, 28] . Like sleep sets [19] , however, these only reduce the number of transitions-not states-which is crucial in e.g. IC3 to cut counterexamples to induction [9] . Cartesian POR [22] is a dynamic form of Lipton reduction for explicit-state model checking.
Conclusions
Our work provides a novel dynamic reduction for symbolic software model checking. To accomplish this, we presented a reduction theorem generalized with bisimulation, facilitating various dynamic instrumentations as our heuristics show. We demonstrated its effectiveness with an encoding used by the BMC and IC3 algorithms in VVT. 
A Dynamic Reduction
The current appendix provides proofs for Sec. 4. Some definitions presented here are more general in that they support both dynamic right and left movers, whereas Sec. 4 only considers dynamic both movers for ease of explanation. The instrumentation is consequently also adapted to support dynamic left and right movers. Moreover, Def. 5 is decomposed into Def. 6, Def. 7 and Def. 11 in order to introduce these concepts in a top-down, incremental fashion. Several lemmas are proved along the way, to finally conclude soundness and completeness of the axiomatized reduction in Th. 4. Lemma 2 goes on to show that our instrumentation preserves errors. And, Lemma 3 shows that it fulfills the reduction axioms. So the instrumentation can be used as a valid basis for obtaining reductions.
The inspiration for our reduction theorem comes from [17] , which in turn is based on a string of earlier works (see Sec. 7). Its generalization with bisimulation is necessary to accommodate the dynamic behavior of movers, which causes entire atomic sections to "switch phase" as explained in Sec. 4.3.
Definition 6 (phase-annotated transition system).
A parallel phase-annotated transition system P T T S is a transition system T S = (S, →) with a (parallel) transition relation →= i → i whose states are partitioned in three sets (of phases), for each thread i, and for all threads i there exists a thread bisimulation ∼ =i. Additionally, we require the following properties (for all i and all j = i):
( ∼ =i preserves phase-equality for j)
Note that all transitions in a parallel phase-annotated transition system are distinguishable, i.e., can be assigned to unique threads performing them (due to the parallel transition relation property and item 2). We will apply this feature silently and assign threads to steps in paths whenever needed.
We next define equivalence relations ∼ =X for X ⊆ Threads and let ∼ = ∼ =Threads. We put
Hence, ∼ =X is the equivalence closure of the union of all ∼ =i. Note that (1)
The following properties are immediate.
Corollary 2. The relation ∼ =X is a thread bisimulation. As a consequence, it is also a standard bisimulation of (S, →).
Corollary 3. For any path starting in σ, if σ is such that σ ∼ =X σ for X ⊆ Threads, then there is a (bisimilar) path from σ where the same threads transit in the same order.
Corollary 4. For a path σ 1 → σ 2 → . . ., let the phase trace for i be a sequence X 1 , X 2 , . . . with X x = R ⇔ σ x ∈ R i , etc. Bisimilar states σ ∼ =X σ have equal phase traces for all i such that i / ∈ X.
Let N T i∈T N i and N = N Threads . Definition 7 (parallel transaction system). A parallel transaction system T S T S is a phase-annotated transition system P T T S (S, →) s.t. for all i and all j = i:
The transaction relation is →= i → i with → i N j =i → i (i only transits when all j are in external).
Th. 2 shows that for all paths in a transaction system there also exists a bisimilar transaction path ( →). This enables reduction by removing interleavings. But first, we prove several intermediate lemmas, before we can introduce the error states used in the theorem to connect divergent paths between the two transition relations.
The above definition differs on some key points from that of Flanagan and Qadeer: Most importantly, it lifts the commutativity condition from the phase change. This means that while transitions should still commute, their phase change might not, crucially enabling the dynamic instrumentation. To preserve correctness while allowing phase changes when moving transitions, the system is constructed in such a way that these phase-unequal states are still bisimilar as the above corollaries demonstrate.
A.1 Incomplete Transaction Paths
To reason over (transaction) paths where some threads are still in the middle of their transaction (open), we introduce the notion of open transaction sequences (ots). An ots consists of two concatenated subpaths: The first (called cts) has only transactions in committed form (ending in post-commit or external phase), while the second (called uts) only contains uncommitted transitions (ending in the pre-commit phase). For example, the phase trace for 3 different threads of the same path could look like:
1: NNNRRLLLNRRLLLLLLNNNNNNNNNNnRRRRRR 2: NNNNNNNNNNNNNNNNNNRRRRRLLLLnNNNNRR 3: NLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
Here the small letters indicate phases of states that are part of both the cts (where it is the last state of the cts) and the uts (the first state). We start with a definition of the uts, as it is easier to understand. We also prove two lemmas that are needed for correctness of the reduction in Sec. A.2.
Definition 8 (Uncommitted transaction sequence (uts)).
A uts under a set of threads T is a sequence of transactions
(where l ≥ 0 and ψ is a mapping from indices to threads) such that for all 1 ≤ i ≤ l, we have
with 1. ψ(i) ∈ T and q 1 ∈ N T , 2. q i,1 ∈ N ψ(i) , and 3. q i,2 , . . . , q i,xi ∈ R ψ(i) .
Corollary 5. For every uts as defined in Def. 8 it holds that
∀i, j ∈ {1, . . . , l} : i = j ⇒ ψ(i) = ψ(j) .
Corollary 6. Every uts q
under T can be split at an arbitrary index 1 ≤ k ≤ l + 1 into two uts's:
a uts under T , and
a uts under {i ∈ T | q k ∈ N i }.
Lemma 4. Given a uts q
under T u range(ψ) and a transition q l+1 → i q with i ∈ T u , then there exists a transition q 1 → i q and a uts q = q 1 →
Illustratively:
Proof. The proof proceeds by nested induction over the transaction blocks in the uts and the steps inside each block.
The outer induction is performed over the length l (i.e., the number of transactions) of the uts.
-Base case (l = 0): Choose q 1 = q l+1 = q , which is trivially a uts of length 0.
Furthermore, q l+1 ∼ = q.
l−1 q l → i q, for short). -Inductive case: Using the induction hypothesis for the uts q 1 (→ + ) l−1 q l of length l − 1, we show that the lemma must hold equally for a sequence
(where also ψ(l) = i). Let j ψ(l) be the last thread executing before i. We thus have (gray parts are the proof obligation):
The nested induction is over the steps in the transaction → + j . By induction over the length m of the path q l = q l,1 → j . . . → j q l,m = q l+1 with q l,2 , . . . , q l,m ∈ R j , we show that there is a path q l,1 → i q l,1 → j . . . → j q l,m with q l,2 , . . . , q l,m ∈ R j and q l,m ∼ =i q , if there is a transition q l,m → i q :
• Base case (m = 2 and q l,1 → j q l,m → i q ): By item 2 of the definition of transaction systems (Def. 7), we obtain q l,1 → i q l,1 → j q l,2 with q l,2 ∼ = {i} q and q l,2 ∈ R j . • Induction hypothesis (IH2): Assume that the hypothesis holds for (m − 1) > 0. • Induction step: We show that our claim holds for a path up to m ending with: q l,m−1 → j q l,m → i q with q l,m ∈ R j . By item 2 of Def. 7, we obtain q l,m−1 → i q l,m−1 → j q l,m with q l,m ∼ = {i} q and q l,m ∈ R j . We then apply IH2 on the path of length m − 1 to obtain a path q l,1 → i q l,1 → j . . . → j q l,m−2 → j q l,m−1 with q l,2 , . . . , q l,m−1 ∈ R j and q l,m−1 ∼ = {i} q l,m−1 . This implies q l,m−1 → j q l,m with q l,m ∼ = {i} q l,m (since ∼ = {i} is a bisimulation relation). Since ∼ = {i} preserves the phase of other threads, it follows that q l,m ∈ R j from q l,m ∈ R j . 
The resulting path satisfies the required property. Note that q l,1 → j . . . → j q l,m is a uts. We thus obtain a transition sequence:
with ψ(1), . . . , ψ(l − 1) = i. Applying the IH to A yields a path:
Since, q l ∼ = {i} q l , we also have q l → j q l+1 , by Cor. 3 and i = j. This is a uts under {j}. Since B remains a uts under T u , the path q 1 (→ + ) n q l+1 satisfies the needed property (it is a uts under T u {j}).
Definition 9 (Committed transaction sequence (cts)).
A cts under T ⊆ Threads is a sequence of transactions with k ≥ 0:
Corollary 7. An analogous "splitting" property to the one from Cor. 6 also holds for cts's.
Definition 10 (Open transaction sequence (ots)).
If A is a cts under T A and B is a uts under T B such that A ends in the same state in which B starts, then the sequence of transactions A • B obtained by appending B to A is an ots under T A ∪ T B .
We sometimes write AB for A • B. We refer to the cts part of AB by using states p x , p x+1 (transactions start/end states), p x,y , p x,y+1 (internal transitions) and threads ϕ(x) for a transaction index x and an internal transition index y (as in Def. 9). Similarly, we refer to the uts part by using states q x , q x+1 (transactions start/end states), q x,y , q x,y+1 (internal transitions) and threads ψ(x) for a transaction index x and an internal transition index y (as in Def. 8).
Corollary 8. For an ots containing a sub-trace . . . r → i . . . with r / ∈ N j for j = i, the suffix from r does not contain j transitions.
Corollary 9. An analogous "splitting" property to the one from Cor. 6 also holds for ots's.
Corollary 10. Every path p → * i q with p ∈ N i in a transaction system is an ots under {i}.
Lemma 5 (From transaction system path to ots).
Let T S be a transaction system and → T i∈T → i . Suppose that p → * T q is a T S path with p ∈ N T , then there exists an ots p → * q under T s.t. q ∼ = q.
Proof. We prove the hypothesis by induction on the length of the path p → * q.
The induction hypothesis (IH) is:
For every path p → * T q with p ∈ N T there is a path
such that q l+1 ∼ = q, A is a cts under T C range(ϕ), and B a uts under T U range(ψ) (so that AB is an ots under T C ∪ T U ⊆ T ). For the base case, k = l = 0, take q p = q and the conclusion holds trivially. Let p → * q → i r be the path extension with i, so the IH holds for p → * q. We show that there is an r ∼ = r and a path p → + r that is an ots (under T C ∪ T U ∪ {i}).
We do case analysis over the i-phase of q: q ∈ N i : It follows that i ∈ T U , because otherwise we would have q ∈ R i (by an inductive argument using the definition of the uts and item 3 of Def. 6, also needed in the proof of Cor. 8). Using Lemma 4, one can move the transition to the end of the cts phase:
uts such that q l+1 ∼ = {i} q l+1 . Depending on the phase of transition i's target state q 1 , the transition can either be part of the cts (q 1 ∈ N i or q 1 ∈ L i ) or of the uts (q 1 ∈ R i ). We pick r q l+1 . By Lemma 4, we have r ∼ = {i} r. The new path has the same transitions as the original path plus i (if i was not in the path of the induction hypothesis). Hence, the property is satisfied. q ∈ L i : As in the previous case, we have i ∈ T U . There is at least one x such that ϕ(x) = i, otherwise we would have q ∈ N i (again applying the same inductive reasoning over the cts prefix path). Let x be largest index for which ϕ(x) = i, so that → ϕ(x) is the rightmost occurrence of i in the ots. Therefore, the path has the form:
Using item 3 from Def. 7, one can construct a path
with q ∼ = r. By Cor. 7, the prefix path including p x+1 is a cts, so the prefix path including p x+1 is also a cts. Moreover, since by Cor. 9 the original suffix path p x+1 → * q is an ots under T for some T ⊆ T \ {i}, the threads that transit in A are still those in T by virtue of item 3 and Cor. 3. By contraposition of Cor. 8, we also have p x+1 ∈ N T and therefore also p x+1 ∈ N T (by item 3 of Def. 6 and i / ∈ T ). Since 1 ≤ x ≤ k, the suffix path A is shorter than the original path. Hence one can apply the (strong) induction hypothesis to A with T to obtain an ots under T starting in p x+1 ∈ N T . This yields an ots under T :
ots under T with q ∼ = q . Let r q . By transitivity of ∼ =, we have r ∼ = r. Thus the property is satisfied. q ∈ R i : There is at least one x such that ψ(x) = i, otherwise we would have q ∈ N i ∪ L i (again applying the same inductive reasoning over the ots). Let x be the largest index for which ψ(x) = i, so that → ψ(x) is the rightmost occurrence of i in the ots. One can then split the path into a cts A under T (Cor. 6), a uts B under T B ⊆ T and a uts C under T C = T \ T B by Cor. 5:
From the above, we have i ∈ T B , but i / ∈ T C . Hence, one can apply Lemma 4 to C and i to obtain a uts C under T C (ignore D for now):
If q x+1 ∈ R i , we are done as this path is an ots (under T i). Otherwise, by repeatedly applying Lemma 4 to the each of the i-transitions q x → + i q x+1 and D (a path without i), we right-commute all i-transitions to the end of the cts:
with q x = q x+1 ∼ = {i} q x+1 . By Cor. 10, we obtain a new cts B under T ∪ {i}. Moreover, by bisimilarity up to i (Cor. 4), we also have a path:
This path is still a uts under T C , because i / ∈ T C . Therefore, the resulting path p 1 → + T r is an ots under T with r ∼ =i r by transitivity of ∼ =i. This path satisfies the property.
Theorem 2 (reduction of interleavings).
Let T S = (S, →) be a parallel transaction system as in Def. 7 with errors as in Def. 11 and L t L i . Suppose that p → * q, p ∈ N and q ∈ E, then there is p → * q s.t. q ∼ = q ∈ E.
Proof. By Lemma 5, the hypothesis implies that there is an ots under Threads:
with q l+1 ∼ = q and T c ∪ T u = Threads. First, we show that there also is an ots ending in q l+1 with q l+1 / ∈ L. We do this by completing paths that are stuck in the post phase (L), starting with the thread that gets stuck the first (left-most in the path). Let x be the lowest index, such that there exists a subpath:
. By item 3 of Def. 6, this subpath must exists when q l+1 ∈ L i . Hence, by construction, we have p x ∈ N (trivially from the cts definition, we have p x ∈ N i and for j = i anything but p x ∈ N j would contradict the assumption of having chosen the lowest x). Also, by Cor. 8, the suffix from p x+1 does not contain i transitions.
Now we extend the path with i: L i q l+1 → i q . This is possible by item 4 of Def. 7. Using again strong induction with the same IH as in the proof of Lemma 5, we may move the new i transition 'in place' as a first step towards constructing a new ots. Thus using item 3 from Def. 7 and Cor. 3, one can construct a path:
By Cor. 7, the prefix path including p x+1 is a cts, so the prefix path including p x+1 is also a cts (by item 1 of Def. 7). Moreover, since by Cor. 9 the original suffix path p x+1 → * q l+1 is an ots under T = Threads \ {i}. The threads that transit in A are still contained in T by virtue of item 3 and Cor. 3. Since p x ∈ N , we also have p x+1 , p x+1 ∈ N T (by item 3 of Def. 6 and i / ∈ T ). Since 1 ≤ x ≤ k, the suffix path A is shorter than the original path. Hence one can apply the (strong) induction hypothesis to A with T to obtain an ots under T starting in p x+1 ∈ N T . This yields a new ots under Threads:
Because the bisimilarity preserves error states, we have q ∈ E.
We must further have q x+1 ∈ L i or q x+1 ∈ N i (note that the phase may have changed compared to q by the moving operation). In the former case, we repeat this process of extending the path with i until q x+1 ∈ N i (since the system is finite this happens eventually). In the latter case, we are done for this i, because q ∈ N i . It follows that for all other j / ∈ T , we also have q ∈ N j . If q ∈ L, we are done showing that there is also an ots path leading to an error without ending in a post phase. Otherwise, we pick a new left-most x and a new i to repeat the process. Because, the number of threads is finite, i = i and x > x this process eventually terminates.
Let
* q l+1 ∈ L be the ots after the above extension. Since, we reduced the case to an ots where q l+1 / ∈ L, it remains to show that there exists a transaction path → to some q ∈ E. We know that q ∈ E i ⊆ N i for some thread i. Therefore, we have i / ∈ T u and q 1 ∈ E i ⊆ N i by item 3 of Def. 6. Also, because q l+1 ∈ L, we have q 1 ∈ N . Let1 = p k+1 . It is easy to show that the this cts is also a i → i path. Hence, the conclusion is satisfied.
The following theorem shows that internal states may just as well be skipped. This theorem is found in Sec. 4 as Th. 1.
Theorem 3 (Atomic block reduction).
The block-reduced transition relation is defined as:
There is path p → * q, p ∈ N and q ∈ E, then there is p * q s.t. q ∼ = q ∈ E.
Proof. By Th. 2, the premise gives us that there is a transaction path p → * q s.t. q ∼ = q ∈ E. We induce backwards over that transaction path to show that it is a block transition path . Let x be an index s.t. the path contains a transition p x → i p x+1 with p x / ∈ E and p x+1 ∈ E. We have p x+1 ∈ E i ⊆ N i , or else a contradiction with item 3 (invariance). By definition of →, we also have p x+1 ∈ N and p x ∈ N Threads\{i} by invariance. If p x = p, we are done since with p ∈ N this is a block transition ( ). Else either p x ∈ N i , and again we have a block transition or not. In the first case, we repeat the process until eventually we hit p z = p with z < x. In the letter case, there exists a y < x s.t. p y ∈ N and all intermediate states in N Threads\{i} by invariance. Clearly, this is also a block transition step and one can repeat the process until we find some p z = p with z < y. Taking q p x+1 therefore satisfies the induction hypothesis.
Theorem 4 (Th. 1 in the main part). Dynamic transaction reduction is sound and complete for the reachability of error states.
Proof. Completeness (no error states are missed) follows from Th. 2(/Th. 3). Soundness (no errors are introduced) follows immediately from the fact that the reduced transition relation of Def. 7 is a subset of the original transition relation → (and the reduced relations of Th. 3 are subsets of the closure of the original transition relation: → + ).
A.3 Instrumentation
First, we define sufficient criteria for dynamic moving conditions as a simple interface to easily design other heuristics. Then we give a formal definition of the heuristics discussed in Sec. 4.1 and show that they satisfy the critaria.
Definition 12 (Dynamic moving conditions).
A state predicate (a subset of states) c L α is a dynamic left-moving condition for a CFG edge (l, α, l ) ∈ δ i , if for all
To formulate the heuristics, we use some basic static analysis on the CFG similar to e.g. [19, Sec. 3.4] . The relation Conflict on actions (relations on data) relates dependent actions, i.e., Conflict(α, β) holds if α accesses (reads or writes) variables that are written by β, or vice versa. The requirement on the implementation of Conflict is such that:
We also write Conflict(α) = {β | Conflict(α, β)}.
The conflict relation/function might over-estimate the set of conflicts (see the implication in Eq. 7) as static analysis can be imprecise. It should be noted that static analysis runs with a tight constraint on computational resources. Typically, it is ran once over the syntactical program structure to derive all the aliasing constraints, and it completes in polynomial time-often quadratic or cubicin the input size, much unlike the expensive model checking procedure. The heuristics we provide deal with the consequental imprecision of static analysis, by deferring various judgements to the model checker.
To reason over the control flow of the threads, we define location reachability. Abusing notation slightly, let δ * i be the transitive and reflexive closure of the CFG location relation {(l, l ) | ∃α : (l, α, l ) ∈ δ i }. , where d is a data assignment. For simplicity, we assume that p = p' ensures that dereferencing these pointers does not conflict, i.e., references may not partly overlap. However, for clarity, we write c-style expressions, such as *p. Table 2 : Some heuristics to establish dynamic commutativity conditions for a CFG edge (la, α, l b ) ∈ δi of thread i. As the reachability heuristic is always applicable, it can be considered used when the restrictions of the other conditions do not hold on the CFGs.
Name
Dynamic condition cα and static analysis on α and j =i Gj Reachability (for Fig. 4) The condition c reach α j =i l∈L(j) (pc j = l) guarantees that remote threads j are not in certain locations l ∈ L(j). The locations l ∈ Vj considered are all locations that either: (1) have outgoing edges β conflicting with α, or (2) can reach another location l β ∈ Vj through Gj where (1) holds. Therefore, L(j) is defined as follows:
This heuristic can be applied for any action α ∈ ∆i. When there are no conflicts, one obtains L(j) = ∅ and c reach α = true, yielding a static mover through our instrumentation (see Table 3 and explanations).
Static pointer dereference (for Fig. 6) The condition c deref α is only applicable when α is a pointer dereference of some pointer p, written here as α = *p (the action might also be a modifying dereference such as α = *p++ from Fig. 6 ). The condition further requires that all potentially conflicting actions β from other threads j = i are also pointer dereferences of some pointer p'. It guarantees that the pointers: (1) have a different value p = p', and (2) the value of p' is not modified by some thread k = i in the future. The condition is defined as: c
Here, Cj(α) is the set of all conflicting actions that are pointer dereferences with their corresponding pointers:
And the set Fi(p') holds pairs of thread ids k and locations l ∈ V k , such that l can reach an action modifying the value of p':
k } Note that the condition is restricted to actions α that are pointer dereferences and that all its conflicting actions β must be pointer dereferences as well. It is easy to lose the second conjunct of this requirement, by carefully incorporating c reach β in the condition. However, we did not do so to keep the presentation simple. Our implementation also only supports the restrictive, simpler version of this condition, making it less often applicable in real-world programs. For our examples, however, the simple version sufficed.
Monotonic atomic (for Fig. 5) The condition requires that α and all its conflicting operations β are CAS checking for the same expected value c, i.e.: α, β = cas(p, c, x) for some p and x. It is defined as c atomic α (*p = c) and guarantees that the CAS operation won't write to the location that p references.
Lemma 6 (Lemma 1 in the main part). The conditions in Table 2 are dynamic both-movers (Def. 1).
Proof sketch. We show that all dynamic both-mover constraints from Def. 1 hold for all three heuristic dynamic moving conditions c α in the table, i.e., for all i, j = i and all α ∈ ∆ i , β ∈ ∆ j :
-(1) the predicate c α should not be disabled by transition α − → i , -(2) the predicate c α should not be disabled by transition β − → j , and -(3) the actions c α α − → i and c α β − → j should (both) commute.
In the following cases, let (l α , α, l α ) ∈ δ i and (l β , β, l β ) ∈ δ j . Reachability :
( ensures that the expected value check of the compare and swap operation fails, no write will occur and c α remains enabled after α.
(2) For all states σ ∈ c α , it holds that there is no β-transition to a state σ / ∈ c α , as the only conflicting operations with α are compare and swap operations that check for the same constant value. Since the expected value of those CAS operations does not agree with the pointer dereference, these operations cannot conflict with it. 
Follows from a similar argument as in Case 2.
Corollary 12.
The conditions in Table 2 are dynamic left and right movers (Def. 12).
Proof. Def. 1 is stronger than the conjunction of dynamic left and right movers from Def. 12 (taking c
(Note that the proof of Th. 5 uses Def. 12, but also holds with the stronger Def. 1 -a hint is provided in the figure in the proof of Th. 5 case Def. 7.2.R1. This is because in addition to a stronger commutativity condition, which also requires that (c α α
, it also requires that c α α − → i c α .) Let G i = (V i , δ i ) be the CFG for thread i. We transform this into an instrumented CFG G i (V i , δ i ) by copying all locations l a ∈ V i to pre-commit, post-commit, and external locations: l Table 3 . Note that all discussion on the instrumentation from the main paper holds here as well, and additionally we distinguish here between dynamic left/right/bothmoving conditions. They are defined in the same way as dynamic both-moving conditions Def. 1 by replacing the commuting condition to left/right-commuting for left/right-moving conditions. Theorem 5 (Lemma 3 in the main part). Let → be the transition relation of an instrumented system. 3 Let:
We define an equivalence relation ∼ =i over the locations
and lift it to semantic states:
We show that Def. 6, Def. 7 and Def. 11 hold.
Proof. We show that all five conditions of Def. 6 are satisfied:
1. By definition, R i , L i and N i are pair-wise disjoint. 2. By definition, all V i are pairwise disjoint. 3. By definition of → i , the phases of j = i remain the same. 4. By definition of ∼ =i, equivalent states must have the same phase for all j = i. 
where l R and l L are bisimilar. Moreover, we have 
L , because both edges (R3 and R5) perform no action and are always enabled, we getd
L with l / ∈ LFS i , then we have: Table 3 . By definition, we have l
None of the actions from above edges modifies any data. Hence, we also have
is l L and l ∈ LFS i , we may apply a similar argument as for R3.
We show that all four conditions of Def. 7 are satisfied using the bisimulations established above for all i and j = i:
1. Since the instrumentation doesn't contain any edge from some l L , l L ∈ V i to some l R , l R ∈ V i , this condition is fulfilled by the definition of → i . 2. We look at all CFG edges ending in locations that constitute states (pc, d) ∈ R i and see if the edges are indeed right-movers up to ∼ =j. Note that we reason separately over transitions and their actions, since the actions might commute, but the transitions may not perfectly commute if they end up taking a different branch of the instrumentation (e.g. see R1, R2, R4 and R5 in Table 3 ). Within this case and the next case only, we use → , δ i , V i , G i again to refer to the instrumented CFG to distinguish it from its original →, δ, V i , G i .
R1
We have pc i = l The action associated with the edge performs no state modification and is always enabled. It is therefore a both-mover. Because the edge is an internal edge added by our instrumentation, i.e. from original location l a (copied to l R a ) to l a (copied to l R a ), the conditional movers of other threads cannot detect it and the corresponding transition is a (perfect) both-mover, i.e., up to ∼ = {} . As Def. 1 is stronger than both the left and the right mover case in Def. 12, this proof also holds for the latter. 3. Similarily, we look at all the edges from locations constituting states (pc, d) ∈ L i and show that they are left-mover up to ∼ = {i,j} . Note again that we reason separately over transitions and there actions. Within this case and the previous case only, we use → , δ i , V i , G i again to refer to the instrumented CFG to distinguish it from its original →, δ, V i , G i . R4 We have pc i = l R6 By a similar argument as for case R3 (Condition 2 of Def. 7), this edge is a perfect both-mover. As Def. 1 is stronger than both the left and the right mover case in Def. 12, this proof also holds for the latter. 4. Because we require for every cycle in the CFG i that there is at least one state in the location feedback set LFS i , we know that there is no cycle in CFG i such that every location is an L i -state.
We show that all two conditions of Def. 11 are satisfied:
we have E i ⊆ N i . 2. By assumption, there is no transition from a sink state to a non-sink state, so all states following an error state are error states as well. 3. Same argument as before.
If the internal/conditional step does not immediately precede the last action, then (1) if the conditional is enabled, then this step can move to the right, right before the last action; (2) if the conditional is disabled, then there an internal/conditional step for the same action that is enabled and leads to a bisimilar state (recall that bisimilarity preserves error states). By the inductive hypothesis, from (pc o , d 0 ) → * (pc n , d n ) via n actions, we get a path (pc 0 , d 0 ) → * (pc n , d n ) performing n actions, and from (pc n , d n ) → * (pc n+1 , d n+1 ) in (at most) two steps, and the shape of the instrumentation, we get a step (pc n , d n ) → (pc n+1 , d n+1 ) with pc n+1 = strip(pc n+1 ).
A.4 With Instrumentation, Left/Right Moving Affects Phases
The following composition of left/right moving with in presence of dynamic movers is exhaustive and refers to the rows in Table 3 . Left mover i, when j becomes dynamically right / left moving by R1,2 and R4,5: We conclude that the instrumentations allows other threads j to lose their phase (end up in a different location) when a transition of i is moved to the right (over j) in a global trace. On the other hand, when i moves a transition to the left over j, both i and j could change their phase. This explains why Def. 7 defines the commutativity the way it does, to wit: (→ i R i ) → j → j and (L i → i ) ← {i,j} → j . Note that we have equality for all threads k = i, j (not participating in the move) from the definition of ∼ =i,j.
