Programming environments for sequential programs provide strong debugging support. However, concurrent programs, especially under relaxed memory models, lack powerful interactive debugging tools. In this work, we present Gambit, an interactive debugging environment that uses gdb to run a concrete debugging session on a concurrent program, while employing a symbolic execution on the program trace in the background simultaneously. The symbolic execution is analysed by a theorem prover to answer queries from the programmer on possible scenarios resulting from alternate thread interleavings or due to reorderings on other relaxed memory models.
Introduction
Debugging programs is hard. Interactive debugging environments, like gdb, are a powerful asset at understanding program behaviors. However, though many environments (including gdb) offer support for concurrency, debugging concurrent programs still remains an arduous task.
Concurrent programs, under relaxed memory models, pose the additional dual challenges of thread interference and memory reorderings. Thread interference occurs due to the complete liberty of the thread scheduler to interleave the thread instructions, leading to concurrent writes from other thread in shared memory locations. Memory reorderings occur due to the relaxed semantics of weak memory models-relaxations on the order in which the different memory operations take effect (due to microarchitectural features like store buffers [41] or compiler optimizations). There is a significant gap in the debugging support for such program environments.
We present Gambit 1 , an interactive debugging environment for concurrent programs under relaxed memory models. Gambit leverages gdb to provide powerful debugging features (like setting breakpoints, continuing till a breakpoint, watching variables etc.) while running a symbolic execution under the hood to collect constraints on the program path being debugged. These constraints are analyzed by an Satisfiability Modulo Theories (SMT) solver to answer questions over alternate instruction reorderings (due to relaxed memory semantics) or other thread interleavings (due to scheduler choices). Gambit provides the following debugging options:
• Isolated Debugging: It allows the user to explore the effect of running only one thread (or a few) in isolation (i.e., with no interference from any other thread). This allows the user to untwine the effect of weak memory semantics from behaviors exhibited due to interleavings with other threads.
• Differential Debugging: It allows the user to understand the impact of running the program under different settings (scheduler choices, memory model reorderings). For example, it can ask questions like "Can this set of variables give a different set of values under a different memory model?"
• Scenario Exploration: It allows the programmer to answer "what if" questions to understand the program better. For example, a programmer can ask: "Can the program produce a certain state over a set of variables under a specified memory model? If yes, what must be the reordering that can lead to it? Or, what is the culprit thread-switch that causes it?"
• Repair Suggestions: It offers suggestions on creating atomic blocks or inserting fences (of certain types).
Gambit offers a wide-range of concurrency primitives to model programs: atomic blocks (weak and strong), fences (load-load, load-store, store-store, load-load, full) and lock variables. Additionally, it allows constructs like assume and assert statements to allow modelling of failure scenarios to guide debugging. Gambit can be used to understand the runtime behavior of critical parts of a program on client environments that are different from the developer's host environment. Our current implementation supports sequential consistency (SC), partial stored ordering (PSO) and total stored ordering (TSO); however, Gambit is extensible, allowing addition of other memory models. Additionally, Gambit provides a visualization of the heap, allowing the programmer to debug concurrent data-structure manipulations.
The contributions of the paper are as follows:
• We propose an interactive debugging environment that leverages the power of a powerful debugger (gdb) along with the reasoning ability of a potent theorem prover to provide new debugging features-isolated debugging, differential debugging, scenario exploration and repair suggestions-to ease the challenging proposition of debugging concurrent programs over diverse memory models and scheduler choices.
• We propose proof-guided underapproximation-widening driven algorithms for identifying culprit interleavings (unintended interleavings that cause atomicity violation bugs) and culprit reorderings (reorderings allowed by relaxed memory models that cause unintended/buggy program behaviors), which, though not optimal, are fast.
• Our experiments indicate that Gambit is effective at answering interesting queries in quick time.
Motivating Example
Our motivating example (Listing 1) is an implementation of a concurrent stack data structure. The push() (pop()) function pushes (pops) a node to (from) the stack. We demonstrate a scenario at debugging reported bugs regarding a crash, as well as some lost updates (pushes). Moreover, the bugs were observed in an environment with the PSO memory model, while the programmer (user) environment supports SC. Possibly due to this, though the user could reproduce the lost updates in some cases, the crash could not be reproduced in the programmer's environment. Hence, the user resorts to Gambit to understand the situation better. The crash was reported on the concurrent use of the push and pop routines. We launch Gambit with the reported failing inputs. $gambit stack Welcome to Gambit 1.0 (gbt) Gambit inserts requisite breakpoints (like at the thread entry points) to ensure a deterministic, controlled interleaved execution. Running the program and enquiring about the threads shows that the threads are created: (gbt) run; info threads * 1 Thread 0x7ffff7fd6740 (LWP 5630) "stack" main () at stack.c:44 2 Thread 0x7ffff600b700 (LWP 5634) "stack" push (value=0x14) at stack.c:21 3 Thread 0x7ffff580a700 (LWP 5635) "stack" push (value=0x1e) at stack.c: 21 4 Thread 0x7ffff5009700 (LWP 5636) "stack" pop () at stack.c:33
Using the viz command displays visualizations of the heap for the convenience of the programmer (Figure 1a ).
(gbt) viz arr success
Bug#1: Program Crash (Null-Pointer
Dereference) We run only the push() and pop() routines and, then, ask Gambit to explore all possible alternative scenarios on this set of instructions. We, first, run all the instructions in the push() routine by switching to the respective thread (using the thread command), assert a breakpoint at the return statement (line 31), and then continue the execution till that location.
(gbt) thread 2; break stack.c:31; continue Breakpoint 4, push() at stack.c:31 31 return;
The same is repeated for the pop() routine. The crash logs had shown the involvement of ptr and result, local variables of the fourth thread (reading from arr[t]->v and ptr->val respectively). It is natural to suspect that the bug is due the difference in the behaviors of the host and the target memory models. So, we apply the differential debugging facility in Gambit to query for states in ptr and result that are possible in PSO (the environment on which the bugs were reported) but not in SC (the host environment). The diffmm command below queries for the possible values of ptr and result in thread 4 (pop) with the given memory model constraints. In this case, Gambit is able to find a scenario where it is able to produce NULL for the desired pointer. The buggy heap state can be seen by again running the viz command It shows that the NULL pointer violation was due to reordering of the above statements in PSO, that is prohibited in SC. Finally, we can use the repair suggestions feature in Gambit to get a list of fences that can fix this trace for PSO.
(gbt) sgt_fence_se {PSO,!SC} 4:ptr=0x0 store-store fence between line 27, 30 We may apply the fence suggestions now (Figure 1c ). While these suggestions are guaranteed to work on the trace being debugged, they may not be the best with respect to performance. Once we understand the problem using Gambit, we may use other tools (see Section 6) for performant suggestions. Gambit is not meant to replace the large set of tools for analyzing, testing and repairing concurrent programs, but provides a fast, interactive debugging option for concurrency bugs.
As errors due to memory model relaxations may not require interference from other threads to manifest, Gambit provides an isolated debugging mode that allows inspection of states in any of the threads being debugged. This mode enables the user to look at a desired thread in isolation and uncover the inconsistencies caused by the reordering of statements alone, that is, without any interferences from other threads. For example, observing thread 2 in isolation can be enabled by:
(gbt) setisolation 2 2.2 Bug#2: Lost Updates (Pushes) Now, we switch to the other bug i.e. lost updates. Figure 1d shows the heap state illustrating the scenario. This bug can be reproduced in Gambit by running two instances of push(). We hypothesize that the bug must be due to the two push() statements conflicting on the same index. So, we use the dbg_atomic (debug atomicity) command to search for a pair of statements that must contain an interleaving to create the undesirable state (the variable idx in both threads points to the same location in the array, say 0).
(gbt) dbg_atomic 2:idx=0 3:idx=0 Culprit interleaving between lines 23, 24 This shows that interference from the other thread between lines 23 and 24 may be causing the failure. sgt_atomic (suggest atomic section) command suggests that lines 23-24 must be enclosed within an atomic block. After applying the suggestion the correct heap state is observed as shown in Figure 1e . Figure 2 shows the layout of Gambit. We use the Intermediate Representation generator (IR generator) to identify the shared memory accesses, and accordingly compile the program into an intermediate representation. Table 1 shows the core intermediate language supported by Gambit, consisting of shared memory read and write instructions, operations on local variables, along with conditional and unconditional branching instructions. Additionally, the intermediate representation supports fence and atomic instructions for concurrency and the assume and assert instructions for modeling interesting scenarios. Many other concurrency constructs, like compare-and-swap (CAS) and mutex locks, are implemented using the core intermediate language as part of an accompanying library; for example, CAS and mutex locks are implemented using the assume and atomic instructions. Gambit is capable of handling third-party library calls (whose source is not available) via the well-known technique of concretization [21] .
The IR is fed into the driver, that under the commands issued from the user (via the user-shell), runs a controlled execution of the program through the execution controller. The execution controller, under the strict supervision of the driver module, fetches the program instructions and feeds them simultaneously to both gdb (for concrete execution) and the symbolic execution engine (to collect a symbolic event trace).
On query commands from the user-shell, the driver launches the reasoning module to handle isolated debugging, differential debugging, scenario exploration and repair suggestion related features. The reasoning module fetches the event trace from the symbolic execution engine, and uses an SMT solver to push a response to the user via the driver module.
The symbolic execution and the reasoning engines share common symbol tables that maintain the symbolic expression bindings of each program variable in the program. The symbol tables are populated by the symbolic execution engine. The reasoning engine, operating completely in the symbolic domain, uses the symbol tables to extract the symbolic expression bindings on the variables involved in a query. 
Notations
We use small letters (often r ) to indicate local variables and capital letters to indicate shared variables. We use the notation stmt t to indicate that the statement stmt is executed in thread t; gdb_val refers to the concrete value of a variable (extracted via gdb).
Our intermediate representation includes shared memory reads (r := X), shared memory writes (X := r), operations on local variables (r 1 = r 2 op r 3 ) control transfer instructions (if (r) then goto L or goto L), modeling constructs (assume and assert) and concurrency constructs (atomicBegin/atomicEnd and fence).
We define the program state at a program location l as a map from variables to their respective values at l during a program execution. A projected program state is simply a projection of the program state on a (provided) subset of variables.
We define a memory model constraint as a set of memory model labels in their positive or negated form; it specifies behaviors to be explored by the reasoning engine. For example, {PSO,!SC} forces the reasoning engine to only explore behaviors that are allowed by PSO but not allowed by SC. This forms our core specification construct for differential debugging queries.
) denotes the write (read) of the global variable X by thread t i at sequence-id s i ; an asterix (*) for the global variable indicates write (read) to any global variable.
Symbolic Execution Engine
The symbolic execution engine operates on the set of program statements as they pass through the debugger (gdb); the symbolic execution engine maintains:
• a set of program counters (loc t ) corresponding to each thread t; • a global memory map (Γ) for the shared variables, and a set of thread-local memory maps (Γ t ) for local variables in each thread t. The memory maps bind the variables to their respective symbolic states; the symbol table is the set of all the memory maps. Table 1 shows how each program construct is translated into a dynamic event trace via symbolic execution. Column 2 shows how the symbolic state in the symbolic execution engine is updated for each statement. As the symbolic execution proceeds, a symbolic trace is captured (column 3), that is used by the reasoning module to answer the programmer queries.
A trace entry is recorded in the format: ⟨s, t, op, X , Φ⟩ where s is the trace sequence identifier used to uniquely identify this trace entry that records that an operation op is performed on the global variable X in the thread t with a symbolic expression Φ.
The shared read in thread t (denoted as r = X t ) increments the program counter and updates the memory map corresponding to thread t to bind r to a fresh variable α. All shared memory reads are bound to fresh variables which are resolved to relevant writes in accordance to the interleaving and memory model semantics. Shared memory write ( X = r t ) simply updates the shared memory map to bind the variable X to the symbolic expression mapped to the local variable r in the map Γ t .
The other operations are self-explanatory. Note that local operations produce no trace entries and are, thus, opaque to the reasoning engine. The effect of these operations is captured via the symbolic execution engine that constructs the relevant symbolic expressions.
Reasoning Engine
We refer to the trace of events captured from the symbolic execution as the program (event) trace. As the programmers are interested in alternate event sequences resulting from different interleavings or reorderings, the reasoning engine uses an SMT engine to capture all possible sequences allowed by the target memory models on the same set of events as captured in the trace (i.e. for the same interprocedural control-flow). An event sequence is ordered by the logical timestamps function τ (.), such that τ (s i ) is the logical time of the event with sequence number s i in the event trace. If, in an event sequence S a , τ (s i ) < τ (s k ), we say that s i happens before s k in S a . We define additional relational operators: s i ⊏ T s k (trace order) to define that s i happens before s k in 
the program event trace and τ (s i ) ⊏ M τ (s k ) (memory order) implies that τ (s i ) happens before τ (s k ) in a synthesized event sequence.
We define the notion of a write-free (sub)sequence of events: in a program event trace, W F (t, X , s i , s k ) implies that there are no writes to the variable X in thread t between sequence numbers s i and s k . We overload this notation for an event sequence, where W F (t, X , τ (s i ), τ (s k )) implies that there are no writes to the variable X in thread t between event timestamps τ (s i ) and τ (s k ). If t is absent in W F (), then it is true for all threads.
To explore an alternate sequence, Gambit builds the following set of constraints:
• Program constraints (P): For each event in the trace, the reasoning engine builds constraints as described in Table 2 . At a symbolic read, it ensures that the symbolic value read (α) binds as follows: -A read on X can read from a shared write in the same thread t (i.e. α = W X (s 2 , t 1 ) ∧ s 2 < s i ) if s 2 is the nearest preceding write in the (program) trace order (WF(t 1 , X , τ (s 2 ) + 1, τ (s i ))) and there is no interim write between τ (s 2 ) and τ (s i ) in another thread: (WF(X , τ (s 2 ) + 1, τ (s i ))); -A write on X can read from a shared write in another thread t k (i.e. α = W X (s i , t k )) if there is no interim write between these statements: (WF(X , τ (s 2 ) + 1, τ (s i ))) At a symbolic write, no additional constraint is generated. The fence instructions ensure that certain reorderings are disallowed; for example, the store-store fence ensures that two writes on any event sequence follow the program order. The weak atomicity constraints disallow any intervening event from an atomic block in another thread while the strong atomic constraint ensures that all the atomic events share contiguous timestamps.
• Memory model constraints (M (.)): The encoding of memory model constraints is listed in Table 3 . Each memory model must satisfy the program order (PO) constraints which ensures that all the dependencies on the same variable are obeyed within each thread.
The sequentially consistent memory model (SC) ensures that all events in an event sequence follow the program order. The weaker memory models, PSO and TSO are more relaxed; TSO allows reordering of (write→ read) of different memory locations (due to mechanisms like store buffers [41] ) while PSO further allows reordering of two writes of distinct memory locations.
Algorithms

Isolated Debugging
Inspecting the values of shared variables due to reorderings in alternate memory models is challenging, due to two reasons:
• Inspection of a shared location should generate corresponding trace events so that the reasoning engine has access to these additional (inspection) read events; • If we simply add additional inspection read event in the same thread (under inspection), these read events would introduce pseudo dependencies that will inhibit certain potential reorderings. However, if such a code is run over PSO memory model, that relaxes the program ordering over the store-load statements and the store-store statements, the memory accesses being on different memory locations, an unexpected state (0,1) also appears. This is possible because statements L1 and L2 might get reordered being a storestore pair over two different memory locations. Due to this reordering, write of the value 1 into x is delayed while write of 1 into y is reflected into global memory first.
To inspect the values of x and y just after the statement y=1 in thread thr1(), we can (via gdb) simply print the concrete values corresponding to the same execution, but no symbolic constraints would be generated. This would prevent the reasoning engine from providing support on queries regarding executions on other memory models and (differently) interleaved executions. Hence, we need to issue trace events corresponding to the shared memory reads encountered during such intermittent state inspections. However, if the requisite shared memory read events are issued from the thread under inspection, undesired data dependencies would be introduced that would restrict reorderings even under relaxed memory models.
Intuitively, we circumvent this problem by issuing the read events corresponding to state inspection from a distinct observer thread. The interleaving of the observer thread with the program statements is important.
For example, if the observer read events are allowed to interleave in any manner, the state inspection may fetch values from past program points; for example, in Figure 4(a) , the observer thread will be able to observe the value x=0,y=0,z=0 though the programmer issued a query at L4. On the other hand, if the observer thread is constrained at the end of the event sequence, it will not allow reordering of the instructions.
We enable the following constraints on the interleaving of the observer thread in an event sequence:
• all the observer inspection (read) events must be atomic;
• the observer events must happen after the current location, i.e. after the last event in the event trace where the programmer has issued the query; • to allow for reorderings, we allow a bounded number of reorder slots (say k); only k events can happen after the observer events in any event sequence. As the value of k is increased, more reorderings are enabled and more behaviors are possible. The idea of bounded reorder buffers has also been suggested in [27] in the context of bounded model checking. Figure 4 shows the effect of increasing the number of reorder slots. In the Figure 4(a) , setting k=1, the observer can witness x=1,y=0,z=29 (Figure 4(b) ) or x=1,y=5,z=0 (not shown) under PSO. Setting k=2 would allow the observer to witness an additional state as well, x=1,y=0,z=0 (Figure 4(c) ).
The observer is particularly useful at enabling isolated debugging where only a single thread is run in isolation. This can allow the study of weak memory model behaviors without any interference from other threads. Gambit provides the setisolation command to enter the isolation mode for a thread. Isolated debugging in Gambit leverages the ability of gdb to pause and run threads selectively, while also providing support on the reasoning side to answer queries from the user.
Instead of spawning an observer thread, our implementation uses a more efficient solution enabled by gdb that produces the same effect: we launch an external function, using gdb.execute() (from the gdb API) that acts like an asynchronous call, thereby simulating the effect of the observer thread.
Differential Debugging
Given the base memory model m 2 and a target memory model m 1 , the DiffDebug algorithm (Algorithm 1) attempts to synthesize a projected state over a set of variables V that is feasible in m 1 but not in m 2 , thereby distinguishing the behaviors of the memory models m 1 and m 2 .
To enumerate all possible ways shared read instructions can read from different writes, we use a selector vector that
if r 1 == U N S AT then 6: break 7:
end if 8:
9:
if r 2 == U N S AT then 10:
if r 3 == U N S AT then 12:
L ← L ∪ ⟨Ω 1 
matches every shared read with all possible writes it can read from. Each shared memory read (r = X ), say at a trace sequence identifier i, uses σ i to define a selector such that σ i = s if r reads from a shared memory write to X at the trace identifier s. The algorithm groups the interleavings of a program by the selector vector, ⃗ σ , where each component σ i of the vector ⃗ σ , corresponds to a shared memory read at trace event i. In the example shown in Listing 2 the constraints for each component of selector vector, ⃗ σ = ⟨σ 4 , σ 5 ⟩ for the read of y and x with label L3 and L4 will be: L3:
The selector ⃗ σ = ⟨3, 0⟩ will read y from event sequence number 3 and x from event sequence number 0 ( Figure 3) .
The algorithm start off by exploring all possible assignments of the selector vector to find an interleaving that exists in m 1 but not in m 2 (line 4) (the selector vector is initially empty). The algorithm exits when it is not able to find any more such interleavings (lines 5-6), returning the set of such distinguishing values (along with their reorderings) found so far (line 17).
Next, it issues another satisfiability query to check if the candidate interleaving can produce the desired (projected) state (Ω 1 [V ] ) in the memory model m 2 (line 8). As the model Ω essentially returns the resulting state containing variable bindings as well as the logical timestamp function τ , we use the notation Ω[V ] to denote that only the variable bindings
return ∅ 4: end if 5: return uc are extracted from the model. If no such execution exists in m 2 , we extract an unsatisfiable core (unsat core) on the variable to value bindings (labelled as the box uc) to identify a subset of variables that form the distinguishing state. Further, the reorderings that allow the required projected state is returned as another set of unsat core uc 3 (line 10).
In the example shown in Listing 2, let us invoke Algorithm 1 with m 1 as PSO, m 2 as SC and V = (a, b). The first Check query (line 4) r 1 is SAT; In the second Check query (line 8), we get r 2 as unsatisfiable (UNSAT) and in uc 2 we get distinguishable values for the given variable set V , i.e., a=1, b=0. Unsat core uc 3 gives τ (2) < τ (3), stating reordered statements are trace location 2 and 3 (statements with label L1 and L2). The values seen in uc 2 , i.e., a=1, b=0 is not possible in SC but is possible in PSO due to reordering of the statements shown in uc 3 , which is a Store-Store reordering.
Reordering Explorer
The reordering explorer is shown in Algorithm 2. The reordering explorer allows the programmer to check if a set of variable-value bindings (V P ) is present in one memory model (m 1 ) but not in another (m 2 ); if such a scenario is feasible, it also shows the required reordering of statements with respect to m 2 .
In the two Check queries (lines 2, 3), the algorithm checks if the given (projected) state is feasible in m 1 but not in m 2 . If feasible, we assert the candidate trace extracted from the first satisfiability query (Ω[τ ]) to extract the order of events that inhibit the desired (projected) state in m 2 (captured in the unsat core). For Listing 2, the algorithm, with ⟨m 1 = PSO, m 2 = SC, V P = {a : 1, b : 0}⟩ returns (τ (2) < τ (3)) in the unsat core, indicating that trace location 2, 3 (statements {L1, L2}) require reordering for the desired state to be feasible.
Atomicity Explorer
The atomicity explorer is shown in Algorithm 3: it enforces atomicity for all instructions in a thread (i.e. disallows any interference from other threads). If the target (projected) state V P is not feasible on the sequentially consistent memory model (constraints in blue box) with full atomicity constraints (constraints in red box with subscript uc), then the unsat core on the atomicity constraints return a pair of trace events that are responsible for this infeasibility. dbg_reorder_d Returns set of culprit reorderings over subset of queried variables under given memory model constraints.
Scenario Exploration verify
Verifies the given projected state under given memory model constraints.
dbg_reorder_se Returns set of culprit reorderings responsible for given projected state under given memory model constraints.
dbg_atomic Returns program location pairs, involved in interleavings that led to the given projected state.
Repair Suggestions sgt_fence_d
Suggests appropriate fences to restrict the culprit reorderings over subset of queried variables under given memory model constraints.
sgt_fence_se Suggests fence to restrict culprit reorderings responsible for given projected state under given memory model constraints.
sgt_atomic Suggests locations where atomic blocks shall be inserted to prevent the given projected state.
Miscellaneous setisolation
Enables isolated debugging for the given thread.
replay Debugging sessions can be recorded and replayed.
thread [freeze]
Thread command enforces context switch to the given thread. Using freeze option along with thread command of gdb, introduces an additional constraint to enforce the context switch during the reordering of event trace.
For Listing 2, invoking Algorithm 3 with V P = {a : 0, b : 1} returns ((τ (3) = τ (2) + 1), (τ (5) = τ (4) + 1)), indicating that one or more of trace location 2, 3 (L1, L2) or location 4, 5 (L3, L4) must be atomic to prevent violation of assert (L5) , i.e., a=0, b=1 in sequential consistency.
The SuggestFence algorithm iteratively calls Reordering-Explorer (Algorithm 2) to suggest possible fence placements that can repair the program. Similarly, the SuggestAtomic algorithm iteratively calls AtomicityExplorer (Algorithm 3) to suggest atomic block placements that can repair the program. Due to space restrictions, we omit the detailed descriptions of these algorithms.
Experiments
Gambit uses INSPECT [57] as a preprocessor to identify the shared memory operations. The symbolic execution is implemented on top of CREST [12] and the reasoning engine uses Z3 [17] as the SMT solver.
We ran our experiments in Intel i7-4770 CPU @ 3.40GHz with 32GB RAM on a set of 13 programs taken from [16] : Peterson (pe), Parker (pa), Dekker (de), Lamport (la), Szymanski (sz), Chase_Lev_WSQ (ch), Cilk_WSQ (ci), Harris_set (ha), Fifo (fi), Lifo (li), Anchor (an), MSN_queue (mn), MS2_queue (m2). We created multiple debug scenarios and ran Gambit on these programs; many of these are actual buggy scenarios in the benchmark programs. Our experiments are directed at answering if Gambit can answer to different types of queries within reasonable time. Table 4 provides a brief description of the important commands in Gambit.
We ran Gambit on 107 traces of the above mentioned benchmarks. Table 5 shows the performance of Gambit on some of these queries; we provide the average time (Avg) and the standard deviation (Sd) in seconds, corresponding to each command. The Benchmark column indicates the number of queries that we executed on a given benchmark. For example, the first row implies that diffmm was executed on three traces from Peterson (pe(3)), one trace from Parker (pa(1)), and so on; the average time (standard deviation) across all these queries was about 0.23s (0.20s). Our experiment shows that Gambit is quite fast, responding in less than 1 second for almost all queries. The standard deviation is high, which is not surprising for SMT solver based solutions. Table 6 shows a glimpse of some of our queries. We report the size of event traces (#Ev) while the actual program statements executed are larger, typically 3.5× of #events. Command diffmm {PSO,!SC} 3:task 3:t, infer that we are doing differential debugging with memory constraints PSO and not of SC on variable task and t of thread 3.
Gambit could infer multiple fences in programs like Peterson, Parker and Dekker; the detected fences corroborate with previous work [4] . Gambit could infer an atomic section in Harris_set, MSN_queue and MS2_queue.
Related Work
Verifiers for weak memory models (that are required by fence synthesis engines) use exploration-based techniques [26, 40] to symbolic techniques [2, 10, 27] . Fence synthesis algorithms range from graph-theoretic based [26] , automata-theoretic based [35, 36] , linear programming based [4, 8] to constraint solving based techniques [2] . Shasha and Snir [52] pioneered fence synthesis using the notions of delay and critical cycle. Delay refers to a pair of instructions that may be reordered (in a weak memory model) while a critical cycle represents a minimal violation of SC. This work has inspired multiple proposals [4, 5, 8, 9] . Huynh et al. [26] build a model-checker for the C# memory model that issues the bytecodes stream in all possible sequence allowed by the memory model; it also suggests a minimal set of barriers to disallow program executions that are not sequentially consistent with respect to a user-provided specification. They essentially build a reachability graph annotated with edges that are possible through SC and its weak semantics, and then use the maxflow-mincut algorithm to detect a minimal set of edges that rule out states that are unreachable through sequential consistency. Demsky et al. [18] capture concrete executions of a program, and use a SAT solver to find new behaviors under TSO and SC memory model. Kuperstein et al. [31] build a symbolic representation of the program and run a fixpoint iteration to learn an avoids formula that prevents the program from reaching undesirable states in the weak memory environment. Counterexample-guided fence reordering algorithms [2, 27, 35] drive counterexample-based refinement techniques to insert fences, eliminating culprit reorderings till the desired solutions are found. Our approach is closest to these approaches-building a symbolic representation and attempt to detect such culprit reorderings to help debugging.
Norris et al. [44] build a new algorithm for discovering new behaviors of a program under the C/C++ memory model by enumerating behaviors using stateless model checking; the implementation, CDSChecker, reports information on deadlocks, data races, uninitialized atomic loads and userprovided assertion failures. Ou and Demsky [46] propose an intuitive correctness model that allows the user to express the specifications on a concurrent data-structure on relaxed memory models using non-deterministic behaviors on a sequential data structure; their tool, CDSSpec, is capable of checking concurrent data structures against such specifications. Kusano et al. [32] propose an abstract interpretationbased static analysis of concurrent programs under relaxed memory models (TSO, PSO, RMO); their thread interference feasibility model is formulated as a constraint problem, expressed as a set of Datalog rules. Kim et al. [28] build an extension to the model-checker Java PathFinder (JPF) [23] into a tool named Java RaceFinder (JRF). JRF performs an happens-before analysis to obtain ordering relations in the relaxed Java memory model [40] , relying on JPF to uncover interleavings. JRF uses data race specific search heuristics (write-first, watch-written, avoid release/aquire, acquire first) to prioritize states-space exploration. RELAXER [13] uses dynamic predictive analysis to detect data races in relaxed memory models: in the first phase, they analyse a sequentially consistent execution trace to predict potential data races; in the second phase, armed with operational models of relaxed memory models (that delay certain operations to simulate the behavior of a relaxed memory model), they attempt to recreate the predicted scenarios predicted in the first phase.
Sezgin and Tasiran [51] generalize Lipton's theory of reductions [37] (that is limited to SC) to check programs operating under the TSO relaxed memory model. Lahav et al. [34] propose a new memory model, RC11 (repaired C11), with better SC atomic access semantics, providing stronger guarantees to SC fence. Kokologiannakis et al. [29] build a stateless model checker for RC11 that works directly on the execution graphs. Yin et al. [59] extend scheduling constraint based abstraction refinement (SCAR) [58] to verifying concurrent programs under weak memory models (like TSO and PSO). Nidhugg [1] , an efficient stateless model checker to find bugs in C/C++ programs under relaxed memory models, uses chronological traces that captures dependencies of the shared variables on relaxed memory executions using partial order relations, allowing for an optimal dynamic partial order reduction algorithm. Huang [24] propose maximal causality reduction (MCR) for stateless model checking that guarantees a minimal number of executions to explore the state-space of concurrent programs. Huang and Huang [25] extend maximal causality reduction (MCR) to verify concurrent programs under TSO and PSO memory models. MultiRace [49] detects data races in the relaxed C++ memory model both at the variable as well as object granularity level in the program, combining lockset and happensbefore analysis. TRACER [3] implements a stateless model checker for the under the Release-Acquire (RA) fragment of the C/C++11 memory model. It uses read-from relations to explore the space possible ordering of instructions in a thread. Kokologiannakis et al. [30] design a sound, complete and optimal model-checker, that is parametric in the choice of memory models, for verifying clients of concurrent libraries.
CHESS [42] and PCT [11] are directed at finding bugs in concurrent programs. RR [45] , a record-replay tool, provides a chaos mode (that fuzzes interleavings) as a bug-finding option. Such tools are orthogonal to our work as Gambit is a debugging tool to allow debugging across memorymodels rather than a bug-finding tool. One can use tools like RR to discover bugs, that can then be debugged or betterunderstood using Gambit.
Analysis and synthesis of synchronization, like locks and atomic sections, has also received enough attention. Deshmukh et al. [19] attempt to extract fine-grained locking from the sequential proofs of correctness; any violation of the proofs in the concurrent setting indicates malevolent interference and, thus, becomes a candidate for synchronization. Their work closely resembles the Owicki & Gries [48] methodology for verifying concurrent programs. Vechev et al. [54] use a counterexample-guided abstract refinement technique to synthesize atomic blocks; starting with a coarse abstraction, they either refine the abstract model by (using a better abstraction) or correct the program (by inserting an atomic section) till the model is verified. Vaziri et al. [53] present an interprocedural static analysis to determine the places in the code where synchronization must be performed in order to ensure that there are no data races for a given atomic set of fields. Race detection has also seen some dynamic techniques: Eraser [50] performs a binary rewriting to monitor every shared memory reference, verifying consistent locking behavior. Atomizer [20] uses Lipton's theory of reductions [37] to infer atomic sections. As with fences, in contrast to all the above work, we again target fast response rather than toiling towards an optimal solution.
Distinguishing behavior of memory models has also been an interesting research direction, where the algorithms [6, 7, 33, 38, 39, 43, 47, 56] are designed to synthesize a litmus test ⟨P, σ ⟩, such that the program P can reach a different state σ in one memory model but not on another. One possible solution to this problem is to synthesize program executions that behave differently on different models, and then recover a program from the same [56] . This is different than our agenda where we attempt to help the user understand the reason why the current execution trace on the provided input is behaving differently on different memory models.
In contrast to all the above proposals, our approach does not attempt to be optimal, but fast-for it to be usable in an interactive debugging environment. Our algorithms on detecting culprit reorderings and culprit interference is a proofguided underapproximation-widening technique, first introduced for verification of concurrent programs [22] . Grumberg et al. [22] , in contrast to the abstraction-refinement techniques [14, 15] , start with an underapproximation by adding underapproximation constraints that inhibit behaviors (thread interleavings). A counterexample indicates a bug is found; however, if a proof is found that does not rely on the underapproximation constraint, the program is verified; the underapproximation constraint is relaxed by allowing more behaviors (interleavings). Thus, the algorithm is capable of finding proofs from underapproximate models without needing to create abstractions. Such underapproximationwidening based technique has also been used in the repair of heap-manipulating programs [55] . However, to the best of our knowledge, ours is the first attempt at identifying the application of this underapproximation-widening for fence and atomicity suggestions.
Discussion
The primary contribution of our work is that it provides formal methods support to debugging tools (gdb), helping users understand the demeanor of her program better. We believe that ours is the first work that successfully marries concrete debugging with formal techniques (symbolic exploration) to answer interesting queries on alternate memory models and thread interleavings on-the-go. Most of the former analysis techniques had different objectives, like hunting for bugs or proving program correctness. To be effective in an interactive environment, our algorithms were required to be fast. Due to differing objectives, in contrast to these prior proposals, we use fast proof-based artifacts (unsat-cores) while the prior proposals were based on model-based techniques (like model enumeration or MAXSAT or maximum hitting set computation [27] ) that tend to be slow, but can provide optimal answers (see Section 6). To summarize, both the end (symbolic exploration enabled interactive debugging of concurrent programs on weak memory models) as well as the means (proof-directed algorithms) are novel contributions of our work.
Finally, programmers still fear regression faults from automated repairs; a developer confidently applies an automated repair patch only when they understand the bug well. Hence, enhancing debugging support for programs will remain an important research direction.
