Abstract. We present a new method that combines the efficiency of testing with the reasoning power of satisfiability modulo theory (SMT) solvers for the verification of multithreaded programs under a user specified test vector. Our method performs dynamic executions to obtain both under-and over-approximations of the program, represented as quantifier-free first order logic formulas. The formulas are then analyzed by an SMT solver which implicitly considers all possible thread interleavings. The symbolic analysis may return the following results: (1) it reports a real bug, (2) it proves that the program has no bug under the given input, or (3) it remains inconclusive because the analysis is based on abstractions. In the last case, we present a refinement procedure that uses symbolic analysis to guide further executions.
Introduction
One of the main challenges in testing multithreaded programs is that the absence of bugs in a particular execution does not necessarily imply error-free operation under that input. To completely verify program behavior for a given test input, all executions permissible under that input must be examined. However, this is often an infeasible task considering the exponentially large number of possible interleavings of a typical multithreaded program. A program with n threads, each executing k statements, can have up to (nk)!/(k!) n ≥ (n!) k thread interleavings, a dependence that is exponential in both n and k.
In this paper we address this challenge by an approach called Trace-Driven Verification (TDV) that combines the efficiency of testing with the reasoning power of satisfiability modulo theory (SMT) solvers. TDV performs dynamic executions to obtain approximations, represented as quantifier-free first order logic (FOL) formulas, of the program under verification. The formulas are then analyzed by an SMT solver which implicitly considers all possible thread interleavings. The symbolic analysis may return one of the following results: (1) it reports a real bug, (2) it proves that the program has no bug under the given input, or (3) it remains inconclusive because the analysis is based on abstractions. In the last case, we present a refinement procedure that uses symbolic analysis to guide further executions. The features of TDV include:
The work was supported by NSF Grant CCF-0811287.
-Implicit consideration of thread interleavings. As explicit enumeration of executions is intractable, the alternative we present is to capture thread interleavings implicitly as a set of constraints in a satisfiability formula. These constraints belong to the family of quantifier-free first order logic formulas for which efficient SMT solvers are available. -Integration of dynamic executions and symbolic analysis. At any given time, TDV analyzes only the statements that appear in a particular execution under a user-specified test vector. It may report a real bug, or prove that the program behaves as expected under all thread interleavings stimulated by the given input. In either case, TDV avoids the analysis of statements that do not appear in an execution. However, it is also possible that the symbolic analysis, being an abstraction of program behavior, remains inconclusive. In such a case, TDV uses the symbolic analysis result to guide future concrete executions. -Abstraction with both under-and over-approximations. Based on an execution, TDV infers both under-and over-approximations of the entire program. The under-approximation is complete so that any bug detected in the model is a real bug; while the over-approximation is sound so that it can be used to prove the absence of bugs.
The rest of the paper is organized as follows. After giving the algorithm overview in Section 2, we present the symbolic encoding of program traces in Section 3. The refinement procedure is illustrated in Section 4. In Section 5 we outline several encoding and algorithmic optimizations to improve scalability. We discuss related work in Section 6. Finally we present experimental results in Section 7 and conclude the paper in Section 8.
Algorithm Overview
Consider a multithreaded program P where threads communicate via shared variables. Without loss of generality, we assume there is at most one shared variable access at a program statement 3 . Then each statement constitutes an atomic computational step, at which granularity the thread scheduler can switch control between threads during the execution.
Consider the program, shown in Fig. 1 , that consists of two concurrently running threads. In a typical testing environment, even if we run the program multiple times under the test input a = 1, b = 0, we may obtain the same executed trace π 1 = 1, 2, 5, 6, 7 where the integer values indicate the line numbers. In general, an executed trace is an ordered sequence of program statements executed by the different threads. Although π 1 does not cause an assertion failure on Line 5, we cannot conclude the absence of assertion failures in this program as this input admits other interleavings of these two threads. Table 1 shows the set Π(π 1 ) of all 10 possible interleavings of π 1 . For each trace in the table, the bottom row indicates whether the assertion on Line 5 holds (h) or fails (f). However, not all the interleavings in Π(π 1 ) are valid executions. Closer examination of π 6 and π 9 shows that they are infeasible traces, due to the violation of program semantics. In particular, after y is updated by Thread 2 on Line 7, it is not possible for Thread 1 to follow the Else branch on Line 2. Let Π P (π 1 ) be the set of interleavings derived from π 1 that are consistent with the semantics of the program P . We have Step π1 π2 π3 π4 π5 π * 6 π7 π8 π * 9 π10 1 1 1 1 1 1 1 6 6 6 6 2 2 2 2 6 6 6 1 1 1  7  3 5  6 6 2 2  7 2 2  7 1  4  6 5  7 5  7 2 5  7 2 2  5  7 7 5  7 5 5  7 5 5 
In order to check for assertion failures not only in π 1 but also in its induced traces, we construct an FOL formula ϕ(π) that implicitly models all the traces in Π p (π) (see Section 3.1 for details). A satisfying assignment to ϕ(π) indicates a true assertion failure and can be used to identify the particular thread interleaving that produces it. If ϕ(π) is unsatisfiable, however, we cannot conclude correctness because ϕ(π) is an under-approximation of program behavior. To understand the reason consider a statement assert(C complexA ) inside complexA() on Line 3 in Fig. 1 . Given the executed trace π 1 = 1, 2, 5, 6, 7 , ϕ(π 1 ) itself cannot reveal any assertion failure inside complexA() since the assert(C complexA ) statement does not even appear in any traces of Π p (π 1 ). On the other hand, there exist valid executions that execute complexA() (e.g. π = 1, 6, 7, 2, 3, . . . ). Thus an assertion failure is still possible under the test input a = 1, b = 0. To insure correctness (absence of assertion failures), all execution traces permissible under that input must be examined. We relax, or abstract ϕ(π), by making changes to and dropping some of its constraints (see Section 3.2 for details). This leads to ψ(π), an FOL formula that represents an over-approximation to the program behavior under the specified input. If ψ(π) is unsatisfiable, we can provably conclude the absence of assertion failures for all thread interleavings under the specified input. Otherwise we need to check if the reported violation is true or spurious. In the latter case, TDV performs refinement by modifying the control flow in order to examine other executions of P under the same test input.
As illustrated in Fig. 2 , TDV consists of the following steps:
1. Run the program under a given user input to obtain an initial execution trace π. 2. Using an encoding along the lines illustrated in Section 3.1, construct an FOL formula ϕ(π). 3. Using an SMT solver, check the satisfiability of ϕ(π).
-If ϕ(π) is found to be satisfiable, a real bug is found. Based on the solution to ϕ(π) we can report to the user the specific scheduling that exposes the bug. -If ϕ(π) is found to be unsatisfiable, we relax ϕ(π) to obtain ψ(π). This allows us to examine sibling traces, i.e., traces that conform to the same input but cover different statements.
• If ψ(π) is found to be unsatisfiable, we can conclude that the property holds under all possible thread interleavings under the given test input.
• If ψ(π) is found to be satisfiable, the SMT solver returns a counterexample, which is used to guide new executions that are guaranteed to touch new statements that have not appeared in previous executions.
the line number for the statement, o is an occurrence index that distinguishes the different executions of the same statement, and Q is the statement type that can be one of assign, branch, jump, fork, join or assert. In this paper we assume all the executions eventually terminate 4 . We consider three basic types of statements: assignment v = E where E is an arithmetic expression, branch C?l.o where C is a relational expression, and jump goto l.o. Note that C?l.o only lists the destination if C holds because no two branches can be taken simultaneously in an executed trace 5 . Besides the basic types, we also allow assert(C) for checking assertions, exit for signaling the termination of a thread, and the synchronization primitives. fork(t) and join(t) allow a thread to dispatch and wait for the completion of another Thread t. Given a program written in a full-fledged programming languages like C, one can use pre-processing [21] to simplify its executed traces into the basic statements described above.
Under-Approximation FOL Formula ϕ(π)
The key to the TDV algorithm is the construction of appropriate FOL formulas that can be easily checked with SMT solvers.
Let V G and V L (t) denote the set of global and local variables in Thread t, respectively. Let the set of variables visible to t be The program transition constraint δ π is defined as
(1)
thread termination: (t, l.o, exit)
thread fork: (t, l.o, fork(t'))
thread join: (t, l.o, join(t'))
) substitute all variables in E(C) by the by their corresponding versions at step i;
denotes all visible variables in t keep their value except variable v.
-Initial condition constraint ι π that specifies the starting locations for each thread as well the initial values of program variables, including the values set by the input vector.
-Trace enforcement constraint ε π that restricts the encoded behavior to include only the statements appearing in an executed trace π. For each (t, l.o, C?l .o ) ∈ π we assume condition C holds on line l at o-th occurrence in π. Then we have
-Thread control constraint τ π that (1) insures that the local state of a thread (the values of its local variables) remains unchanged when the thread is not executing, and (2) insures that the thread cannot be selected for execution after it has terminated. These two constraints are specified in Equation 3 .
The thread control constraint is defined as follows:
In τ other , additional optional constraints can be included to model particular scheduling policy. -Property constraint ρ P that indicates the correctness conditions, specified as assertions within the program in this paper, that we would like to check for validity under all possible executions. Note that many common programming errors can be modeled as assertions [21] . Let (t, l, assert(C)) be an assertion on line l in Thread t. The property constraint can be specified as follows:
Note that properties encoded by ρ P are not necessarily the assertions appearing in π only; the assertions may appear anywhere in the program P . This is a crucial requirement for our trace-based method to find real failures anywhere in the program, or to prove the absence of assertion failures of the program.
Whether the property ρ P holds for all possible thread interleavings in Π P (π) is determined by checking the validity of the formula:
which is equivalent to checking the satisfiability of the formula
Equation 6, which implicitly represents all thread interleavings of Π P (π), is still an under-approximation of the behavior of program P under the given test input. Therefore, a solution to ϕ(π) reveals real errors in the program, but the unsatisfiability of ϕ(π) does not prove the absence of errors.
Over-Approximation FOL Formula ψ(π)
Let Π P ( − → v ) be the set of all possible execution traces of program P under the test input − → v . The set of interleavings considered by
To catch assertion violations in branches not yet executed in π, or to establish the absence of such violations in all traces, we need an over-approximation of Π P ( − → v ). The over-approximated encoding can be obtained from ϕ(π) with the following changes:
-Remove the trace enforcement constraint ε π that prohibits any trace π ∈ Π P (π) from being considered in ϕ(π). In Fig. 1 , for example, a trace starting from 1, 6, 7, 2, 3, . . . can be a valid execution according to the program. However, the ε π constraint 
Similarly, for an assignment statement (t, l, v = E) ∈ π that executes l 1 next, the constraint added to λ π [i] is
After the modifications above we obtain the following over-approximation:
Let Ω(π) be the set of interleavings considered by ψ π ; then
is an over-approximation of the program behavior under the test vector − → v . In general, the unsatisfiability of ψ(π) proves P has no assertion failures under the test vector − → v . The downside of using ψ(π) is the inevitability of invalid executions which need to be filtered out afterwards. In the running example in Fig. 1 , the SMT solver may report π 6 in Table 1 as a satisfiable solution of ψ(π). However, it is not a feasible trace since the behavior of the step in line 2 is unspecified in ψ(π) when y < 2.
Refinement

Analysis-Guided Execution
Let CEX π be a satisfiable assignment to all variables in ψ(π); it is called a potential counterexample. In the counterexample guided abstraction refinement (CEGAR) framework, a decision procedure (theorem prover, SAT solver, or BDDs) [8, 2, 1, 27] has been used to check whether CEX π is feasible in P , and if not, to refine the over-approximation. Such an approach may not be scalable for handling multithreaded software due to the program complexity and the length of the counterexamples. Instead, we use guided concrete execution rather than a theorem prover or a SAT solver. Let T = ∪ Given CEX π , we first extract a thread schedule SCH π = ∃ v∈{T ∪L} .CEX π , and organize it as a sequence
Note that the occurrence index is not needed as the sequence uniquely identifies a trace (although it may be infeasible). The program is then re-executed by trying to follow π SCH ; this is implemented by using check-point and restart techniques as in [30] . If the re-execution can follow π SCH to completion, then π SCH represents a real bug. Otherwise, we obtain a new executed trace
π and π have the same thread ids and line numbers for the first k − 1 steps. But starting from the k-th step π can no longer follow π and completes the execution on its own.
To sum up, by performing a guided execution after analyzing the overapproximation ψ(π), we are able to either validate the potential counterexample CEX π , or obtain a new execution π for a further analysis.
Avoid Redundant Checks
To avoid performing symbolic analysis on executed traces that have been analyzed before, we maintain a set χ of already inspected traces. Let {π 1 , . . . , π m } be the set of executed traces in the first m iterations that have been analyzed. If ψ(π m ) is satisfiable, we are only interested in a solution − → S such that the trace
Such requirement is not only for performance, but also for the termination of the algorithm: without χ our algorithm may analyze the same executed trace infinitely.
Let π t be a subsequence of π that is executed by Thread t. For two such subsequences π Therefore, the trace enforcement constraint ε πt uniquely identifies a trace π t in Thread t. As Π P (π) is the interleavings among the traces π t1 , . . . , π t N , they are identified by ε π = ε π t 1 ∧ . . . ∧ ε π t N . In the other words, in order to find a trace not in Π P (π), we must add the constraint ¬ε π . Assume {π 1 , . . . , π m } are the traces that have been executed so far, we have
The over-approximation formula at the (m + 1)-th iteration becomes
4.3 An Illustrative Example Assume the first executed trace is π 1 = (1, 0.1), (1, 1.1), (1, 2.1), (1, 3) , (1, 0.2), (1, 1.2), (1, 2.2), (1, 5) , (1, 6) , (2, 13) , (2, 14) , (2, 15) , (2, 16) , (3, 13) , (3, 14) , (3, 15) , (3, 16) ,
(1, 12.1), (1, 12.2) , in which Thread 1 creates Thread 2 and 3 that execute bar (1) . Note that in π 1 we drop the occurrence index if a statement of a thread occurs only once. An under-approximated symbolic analysis on π 1 does not yield an assertion violation, but the over-approximated symbolic analysis produces a counter-example CEX 1 = (1, 0), (1, 1) , (2, 13) , (2, 14) , (2, 15) , (1, 2) , (1, 5) , (1, 7) , (1, 10) , (1, 11) , which leads to an assertion failure on Line 11. An execution following CEX 1 shows that the counterexample is spurious as it can only follow up to (1, 5) , because the else branch on Line 5 cannot be taken. The complete executed trace is π 2 = (1, 0), (1, 1), (2, 13) , (2, 14) , (2, 15) , (1, 2) , (1, 5) , (1, 6) , (2, 16) , (1, 12) . There is no assertion failure in π 2 , but the counterexample obtained from the over-approximated analysis is CEX 2 = (1, 0), (1, 1), (2, 13), (2, 14), (2, 15),  (2, 16), (1, 2), (1, 5), (1, 7), (1, 10), (1, 11) . A further execution is able to follow the complete trace of CEX 2 and therefore reveals a real assertion failure on line 11.
Optimizations
We apply peephole partial order reduction (PPOR) [29] to exploit the equivalence of interleavings due to independent transitions. Unlike classical partial order reduction [17, 15] , PPOR is able to reduce the search space symbolically in an SMT solver.
Given an executed trace π = (t 1 , l 1 . 
which prohibits Q p being executed immediately after Q q . Similar constraints can be added to over-approximated satisfiability formula ψ(π).
Another optimization is a new thread-local static single assignment (TL-SSA) form to efficiently encode the thread-local statements. TL-SSA can significantly reduce the number of variables and the number of constraints needed in ϕ(π) and ψ(π), which are crucial since they often directly affect the performance of an SMT solver. Our observation is that the encoding in Section 3 may produce many redundant variables and constraints, due to the fact that it has to assign a fresh copy to every variable at every step. However, statements involving only local variables do not need a fresh copy of the local variables and constraints at every step. Furthermore, in a typical program execution, each statement writes to one variable at a time; a vast number of constraints, in the form of
, are used to keep the current values of the uninvolved variables.
In a purely sequential program, one can use Static Single Assignment (SSA) form [9] to simplify the encoding of a SAT formula [7] . However, SSA is not meant to be used in multithreaded programs (it remains an open problem as to what a SSA-style IR should be for concurrent programs), since a use-define chain for any shared variable cannot be established at compile time. Our observation here is that, while shared global variables cannot take advantages of the SSA form, local variables can still utilize the reduction power of SSA. The proposed TL-SSA form exploits the fact that, in any particular execution trace, the usedefine chain of every local variable can be determined. Consider an executed trace snippet . . . (y=a+1), . . ., (a=y), . . ., (y=y+a) , where y is a shared variable and a is a local variables. In addition, no other statements in the trace access a. The trace with corresponding sequence of TL-SSA statements are . . . (y=a0+1),. . .,  (a1=y), . . ., (y=y+a1) . Instead of creating fresh copies for local variables at every step,the TL-SSA form creates only two copies of a. In addition, there is no need for the constraints a[i + 1] = a[i] to keep the value of a at each step where a is not assigned.
Related Work
Since we are not the first in modeling high-level source code semantics using a constraint language, it is helpful to briefly mention some of the successful approaches that have been reported. Noting the large gap between high-level programming languages and those of the formal logics, existing symbolic model checking tools, including [2, 7, 21] , often restrict their representations to the pure Boolean domain; that is, they extract a Boolean-level model from the given program and then apply Binary Decision Diagrams (BDDs) [4] or SAT solvers (e.g., [11] ) to perform verification. Although modeling all variables as bit-vectors is accurate, such high-precision approaches are often not needed and may generate models that are too large. In addition, bit vectors cannot model floating point arithmetic. In [32] , sequential C programs are modeled at the word, as opposed to the bit, level using polyhedral analysis. This approach was shown to be very competitive for handling sequential C programs of non-trivial sizes. Unlike [32] that uses polyhedra library Omega [26] to perform reachability computation, we leverage the recently-demonstrated performance advances of SMT solvers to perform satisfiability checking.
Approaches based on similar ideas that augment testing with formal analysis include [20, 18, 6, 24, 28] . While Synergy [20] considers only sequential programs, we concentrate on multithreaded programs. Concolic testing [18, 6, 24] runs symbolic executions simultaneously with concrete executions, but the purpose is to generate new test inputs for better path coverage. In our approach, the purpose of symbolic analysis is to consider all related feasible thread interleavings implicitly, and in the event of inconclusive results, to guide the next concrete execution to follow a different thread schedule (that obeys program control flow semantics) under the same test vector. Predictive analysis [28] encodes a single execution symbolically without further refinement. The approach that augments testing with formal analysis has also been applied in other domains such as MCAPI [13, 12] and service computing [14] .
Although integrated under-and over-approximations have been used in a decision procedure [3] for bit-vector arithmetic, most previous works on hardware and software model checking follow the paradigm of CEGAR [22, 8] , which is based solely on over-approximations and uses spurious counterexamples to refine the over-approximations. In [19] , Grumberg et al. presented a software model checking procedure based on a series of under-approximations. There are several research projects that target concurrent program verification directly. Inspect [30] and CHESS [25] can check multithreaded C/C++ programs by explicitly executing different interleavings using dynamic partial order reduction [16] . However, explicitly exploring the thread interleavings does not scale well in the presence of a large number of (equivalence classes of) interleavings. The recent development in CHESS [25] also allows the tool to perform context bounded model checking. However, it is not intuitive to ask from user for a preset value on the number of context switches. CheckFence [5] checks all concurrent executions of a given C program on a relaxed memory model and verifies that they are observationally equivalent to a sequential execution, which targets a different application than ours.
Experiments
We have implemented a prototype of TDV using the Yices SMT solver [10] , which is capable of deciding formulas with a combination of theories including propositional logic, linear arithmetic, and arrays. We performed two case studies. The first case study is on the example shown in Fig. 4 with multiple threads and recursions, and the second case study is on a file system implementation, which was previously used in [16] . Our experiments were conducted on a workstation with Pentium D 2.8 GHz CPU and 4GB memory running Red Hat Linux 7.2. Table 2 shows the results of the first case study. By changing the value of the test variable a, we can increase the number of threads and the level of recursion. Column 1 lists the number of threads. Columns 2 and 3 show the peak memory and total time usage for Bounded Model Checking (BMC) without dynamic execution and abstraction. Columns 4 and 5 show the peak memory and total time usage for TDV. Note that optimizations has been applied to both methods. The last Column shows the speedup of the new method. A one-hour timeout limit is used in all the experiments. BMC ran out of time for test cases with more than 50 threads, while our method took only 407 seconds to complete 80 threads.
We also performed the experiments on the file system example, which is derived from a synchronization idiom found in the Frangipani file system. Table  3 shows the results we obtained by comparing BMC and TDV, both without and with optimizations. The results show that TDV gains a speedup from 1.46 to 77.33 over BMC, and the TDV with optimizations gains a speedup from 5.87 to 1171.44 over BMC, with an average speedup of 299.
Conclusion and Future Work
We have presented a new method to combine the efficiency of dynamic executions with the reasoning power of an SMT solver for the verification of safety properties of multithreaded programs. The main contributions are (1) a new symbolic encoding of executions of a multithreaded program, (2) using both under-and over-approximations in the same trace-driven abstraction framework, where refinement involving the mutual guidance between concrete program execution and symbolic analysis. For future work, we plan to investigate performance enhancement techniques, such as minimal unsatisfiable core analysis [23] and dynamic path reduction [31] , to allow TDV to scale to larger programs.
