Abstract. Linearizability is the standard correctness criterion for fine-grained, non-atomic concurrent algorithms, and a variety of methods for verifying linearizability have been developed. However, most approaches assume a sequentially consistent memory model, which is not always realised in practice. In this paper we define linearizability on a weak memory model: the TSO (Total Store Order) memory model, which is implemented in the x86 multicore architecture. We also show how a simulation-based proof method can be adapted to verify linearizability for algorithms running on TSO architectures. We demonstrate our approach on a typical concurrent algorithm, spinlock, and prove it linearizable using our simulation-based approach. Previous approaches to proving linearizabilty on TSO architectures have required a modification to the algorithm's natural abstract specification. Our proof method is the first, to our knowledge, for proving correctness without the need for such modification.
Introduction
The correctness of concurrent algorithms has received considerable attention over the last few years. For algorithms that have fine-grained concurrent implementations correctness has focussed on a condition called linearizability [12] . This requires that the fine-grained operations (e.g., insertion or removal of an element of a data structure) appear as though they take effect "instantaneously at some point in time within their intervals of execution" [12] , thereby achieving the same effect as an atomic operation.
Such fine-grained implementations are becoming increasingly commonplace, and are now standard in libraries such as java.util.concurrent. To increase efficiency, these algorithms dispense with locking, or only lock small parts of a shared data structure. Therefore the shared data structure might be concurrently accessed by different processors executing different operations. This complexity makes the correctness of such algorithms, i.e., their proofs of linearizability, a key issue.
Because linearizability is such an important condition, there has been a large amount of interest in proof methods for verifying whether an algorithm is linearizable. However, the vast majority of this work has assumed a particular memory model; in particular a sequentially consistent (SC) memory model, whereby program instructions are executed by the hardware in the order specified by the program. This is in contrast to multiprocessor architectures such as x86 [15] , Power [1] or ARM [1] that only provide weaker guarantees in order to allow efficient executions.
Processor cores within modern multicore systems often communicate via shared memory and use (local) store buffers to improve performance. Whilst this does give . . . . . . greater scope for optimisation, the order in which instructions are executed by the hardware is no longer the same as that specified by the program. In this paper we focus on one such memory model, the TSO (Total Store Order) model which is implemented in the x86 multicore processor architecture. Rather surprisingly given Intel's and AMD's use of x86, and in contrast to the many different approaches and techniques for linearizability on sequentially consistent architectures, there have only been three approaches to the question of linearizability on a TSO model: [4] , [11] and [17] .
The proof approaches in both [4] and [11] , however, require the natural abstract specification of a concurrent implementation to be modified. Burckhardt et al. [4] define so-called TSO-to-TSO linearizability, which as the name implies compares a concurrent implementation with an abstract specification that executes in TSO memory. Their definition of linearizability thus compares two specifications with local buffers. In our approach, we aim to compare the execution of an implementation in TSO against its natural SC abstraction. In [11] Gotsman et al. define a more sophisticated mapping between the TSO model and a sequentially consistent one (this is called TSO-to-SC linearizability), but to verify the linearizability of an example like spinlock (our running example) they weaken the abstract specification to allow non-deterministic behaviour when one would not naturally expect it. The approach closest to ours is [17] which uses the same principles that we discuss, but does not provide a proof method (rather they use SPIN to model check particular runs of the algorithm, much like testing).
The purpose of this paper is to investigate linearizability of concurrent algorithms on a TSO memory model which avoids the compromises of [4, 11] , and also defines a proof method. We begin in Section 2 by introducing the TSO model as well as our running example, the spinlock algorithm. In Section 3 we introduce linearizability and discuss how we adapt the definition to the TSO model. In Section 4 we explain an existing simulation-based method for verifying linearizability and show how we can adapt this to the TSO model. This method is then applied to the spinlock example in Section 5 before we conclude in Section 6.
The TSO memory model
In the TSO (Total Store Order) architecture (see [16] for an introduction), each processor core uses a write buffer (as shown in Figure 1 ), which is a FIFO queue that stores pending writes to memory. A processor core (from this point on referred to as a process) performing a write to a memory location enqueues the write to the buffer and continues computation without waiting for the write to be committed to memory. Pending writes do not become visible to other processes until the buffer is flushed, which commits (some or all) pending writes to memory.
The value of a memory location read by a process is the most recent in the processor's local buffer. If there is no such value (e.g., initially or when all writes corresponding to the location have been flushed), the value of the location is fetched from memory. The use of local buffers allows a read by one process, occurring after a write by another, to return an older value as if it occurred before the write.
In general, flushes are controlled by the CPU. However, a programmer may explicitly include a fence, or memory barrier, instruction in a program's code to force a flush to occur. Therefore, although TSO allows non-sequentially consistent executions, it is used in many modern architectures on the basis that these can be prevented, where necessary, by programmers using fence instructions.
A pair of lock and unlock commands in TSO allows a process to acquire sole access to the memory. Both commands include a memory barrier which forces the store buffer of that process to be flushed completely (via a sequence of atomic flushes).
Example -spinlock
Spinlock [3] is a locking mechanism designed to avoid operating system overhead associated with process scheduling and context switching. A typical implementation of spinlock is shown in Fig. 2 , where a global variable x represents the lock and is set to 0 when the lock is held by a process, and 1 otherwise. A process trying to acquire the lock x spins, i.e., waits in a loop and repeatedly checks the lock for availability. It is particularly efficient when processes only spin for short periods of time and is often used in operating system kernels.
The acquire operation only terminates if it successfully acquires the lock. It will lock the global memory 1 so that no other process can write to x. If, however, another process has already acquired the lock (i.e., x==1) then it will unlock the global memory and spin, i.e., loop in the while-loop until it becomes free, before starting over. Otherwise, it acquires the lock by setting x to 0. The operation release releases the lock by setting x to 1. We assume that only a process that has acquired the lock will call this operation. The operation tryacquire differs from acquire in that it only makes one attempt to acquire the lock. If this is successful it returns 1, otherwise it returns 0.
The lock and unlock commands act as memory barriers. Hence, writes to x by the acquire and tryacquire operations are not delayed. For efficiency, however, release does not have a memory barrier and so its write to x can be delayed until a flush occurs. This leads to the possibility of a tryacquire operation of a process q returning 0 after the lock has been released by another process p. For example, the following concrete execution is possible, where we write (q, tryacquire(0)) to denote process q performing a tryacquire operation and returning 0, and flush(p) to denote the CPU flushing a value from process p's buffer:
Thus p performs an acquire, then a release and then q performs a tryacquire that returns 0 even though it occurs immediately after the release. This is because the flush(p), which sets the value of x in memory to 0 has not yet occurred.
At an abstract level, the operations are captured by the following Z specification, which has parameterised operations Acquire p , Release p and TryAcquire p , the parameter p denoting the identifier of the process performing the operation.
The question is now: "Is the behaviour of spinlock under TSO comparable to this abstract specification?". We answer this question in the subsequent sections by proving that spinlock under TSO is actually linearizable with respect to the specification.
Linearizability on TSO
Linearizability [12] is the standard notion of correctness for concurrent algorithms, and allows one to compare a fine-grained implementation against its abstract specification.
The comparison is made at the level of invocations and returns of operations as the fine-grained nature of some operations means that an operation's steps might be interleaved with steps of another operation executed by another process. For example, process p might start a release, but then process q invokes its tryacquire before p's release has returned. The key idea of linearizability is as follows.
Linearizability provides the illusion that each operation applied by concurrent processes takes effect instantaneously at some point between its invocation and its return. This point is known as the linearization point.
In other words, if two operations overlap, then they may take effect in any order from an abstract perspective, but otherwise they must take effect in real-time order.
The original definition in [12] (for a formalisation, see [7] ) is based on the concept of possibilities, however there are now a number of different proof strategies which have been applied to a number of algorithms. These range from using shape analysis [2, 5] and separation logic [5] to rely-guarantee reasoning [18] and refinement-based simulation methods [10, 14, 7] . The simulation-based methods, which we will adapt for use in this paper, show that an abstraction (or simulation or refinement) relation exists between the abstract specification of the data structure and its concurrent implementation.
We will return to the proof method in Section 4. In this section we address the question: Is spinlock linearizable on TSO? The definition of linearizability is architectureneutral, so we should be able to answer the question on a TSO memory model. However, the presence of local buffers, and operations under control of the CPU (i.e., the flushes) complicate the answer.
Consider the execution of spinlock in (1) . Obviously, such an execution has no corresponding behaviour at the abstract level, since looking at the Z specification, the value of x after Release is 1, thus TryAcquire returns 1. Hence standard approaches to proving linearizability will fail. There are three alternative approaches to tackling the issue of linearizability on TSO: [4] , [11] and [17] . Of these, both [4] and [11] involve changes to the natural abstract specification. For example, in [4] the abstract specification is described with local buffers and flushes. Linearizability, as they define it, (which they call TSO-to-TSO linearizability) then compares two specifications both with local buffers, but this seems to miss the essential nature of the abstract to concrete transformation. On the other hand in [11] , Gotsman et al. weaken the abstract specification to allow tryacquire to nondeterministically either fail or succeed when x is 1, i.e.,
The nondeterminism in the abstract operation models that introduced to the concrete system by hardware-controlled flushes. Since the abstract specification does not have local buffers in it, the authors call it TSO-to-SC linearizability. Again, changing the abstract specification seems to weaken what one has achieved with the proof. The approach closest to ours is [17] which uses the same principles that we use here, but only considers model checking of the algorithm. Our aim is to formalise this intuition and provide a refinement-based proof method for it.
Although at first sight it would seem that linearizability simply fails without changing the abstract specification, one needs to take into account the role of the local buffers. Since the flush of a process's buffer is sometimes the point that the effect of an operation's changes to memory become globally visible, the flush can be viewed as being the final part of the operation. For example, the flush of a variable, such as x, after an operation, such as release, can be taken as the return of that operation. Under this interpretation, the release operation extends from its invocation to the flush which writes its change to x to the global memory.
In the standard definition of linearizability, histories are sequences of events which can be invocations or returns of operations. Invocations and returns of operations from a set I are performed by a particular process from a set P. Invocations have an associated input from domain In, and returns an output from domain Out (we assume both domains contain an element ⊥ denoting no input or output, respectively). TSO events can also be flushes, which are performed by the CPU and operate on a particular process's buffer:
The history corresponding to the execution (1) is 2 :
To prove linearizability on a TSO architecture we transform this history to one where the flush on p is the return of the release, since in TSO it is the flush that makes the effect visible. The original return of the release in the history above is removed. That is, the above history is transformed to:
In general, we need to transform a history h consisting of invokes, returns and flushes to a history h which replaces flushes by the appropriate returns whilst removing returns that are no longer required. (The transformation can be formalised but its definition is elided here due to space constraints.) This new history consists just of invokes and returns, the latter indicating when the effect of an operation is made visible globally.
The standard definition of linearizability can then be applied to the history h . For example, since release and tryacquire now overlap in the transformed history above, a potential linearization in terms of the abstract specification is inv(p, Acquire, ), ret(p, Acquire, ), inv(q, TryAcquire, ), ret(q, TryAcquire, 0), inv(p, Release, ), ret(p, Release, )
Under our interpretation of the return of release, spinlock is linearizable with respect to the abstract specification. The rest of the paper describes the framework necessary to prove this. To do so we adapt the simulation-based approach for verifying linearizabililty on a sequentially consistent architecture as defined by Derrick et al. in [7] . This requires careful modelling of the local TSO buffers, but the essential point that allows the simulation rules to work is the following principle:
The return point of an operation on a TSO architecture is not necessarily the point where the operation ceases execution, but can be any point up to the last flush of the variables written by that operation.
A formal definition of linearizability is given in [7] . For an abstract history hs to be a linearization of a concrete history h, this definition relies on the existence of a bijective function f : 1..#h → 1..#hs such that ∀ n : 1..#h • h(n) = hs(f (n)), which describes the allowed reorderings between concrete and abstract histories. The key part of adapting this definition to TSO is what we mean by a matching pair of invocations and returns. This is denoted mp(m, n, h) when h(m) is an operation invocation and h(n) its matching return in the history h.
With this in place, f must satisfy the following:
That is, operations in hs do not overlap (each invocation is followed immediately by its matching return) and the order of non-overlapping operations in h are preserved in hs.
The formal definition of the function mp in [7] requires that h(m) and h(n) are executed by the same process p and are an invocation and return event, respectively, of the same operation. Additionally, it requires that for all k between m and n, h(k) is not an invocation or return event of p. That is, given inv?(e) and ret?(e) denote that the event e is an invocation and return event, respectively, e.p denotes the process executing e, and e.i the operation being executed, mp(m, n, h) holds iff
The latter condition is not true for our transformed histories on TSO since operations by the same process may overlap. Therefore, to adopt the definition of linearizability of [7] , we need to redefine mp. The way we do this is to use the existing definition of mp on the histories before transformation to identify, and then uniquely label, the events of each matching pair. After the transformation the labels are used to identify matching pairs. We don't give the details here, but they are straightforward to describe and define.
A proof method for linearizability on TSO
We do not work directly with the definition of linearizability, but rather use a refinementbased proof method for verifying linearizability as defined in [6] [7] [8] 14] . This approach defines simulation rules that form a sound (and complete) proof method for verifying linearizability. Different classes of algorithm use slightly different rules, where the difference depends on how easy it is to identify the linearization points -in some algorithms these can't be identified directly, and depend on the behaviour of other processes [8] . However, for the example in this paper we can use the simplest set of rules found in [7] and described below.
General approach. The approach is based on proving a concrete specification that has one operation for each line of code is a non-atomic refinement [9] of the abstract specification capturing the code's intent. Each allowable sequence of concrete steps must simulate a sequence of abstract operations despite the interleaving of concrete steps performed by different processes 3 . Let P be the set of processes. Let our abstract and concrete specifications be given as A = (AState, AInit, (AOP p,i ) p∈P,i∈I ) and C = (CState, CInit, (COP p,j ) p∈P,j∈J ) where the sets I and J are used to index the abstract operations and concrete steps, respectively. The function abs : J → I maps each concrete step to the abstract operation it (together with other steps) implements. We assume the concrete state space CState is composed of a global state GS (the shared memory) and the local state LS of one process (the program counter, local variables and, on TSO, the local buffer). Following [7] , linearizability is then shown by:
Defining a status function that identifies the linearization points of operations.
Let STATUS ::= IDLE | IN In | OUT Out where In and Out are the domains of inputs and outputs, respectively, as defined in Section 3. We define a function status : GS × LS → STATUS such that the following hold. If a process has no pending operation then the status of the process is IDLE. If it is executing an operation and has not passed the linearization point, then the status of the process is IN(in) where in is the input of the operation, if any, and ⊥ otherwise. If it is executing an operation and has passed the linearization point, the status is OUT(out) where out is the output of the operation if any, and ⊥ otherwise.
2.
Showing individual concrete runs of a process correctly implement the abstract operations using non-atomic refinement. We find a forward simulation R relating the global state and the local state of a process to the abstract state, i.e., R ⊆ AState × (GS × LS), and a set of simulation rules which additionally update the status function appropriately as shown in the example in Fig. 3 . In this example, the input in of an invocation step INVOP(in) is used to establish a status of IN(in). After the invocation an internal operation implements skip and leave the status unchanged. Then the linearization point that implements AOp(in, out) is passed and the status changes to OUT(out). Finally the status is used to compute the output of a return step RETOP(out) and the status returns to IDLE. There are 5 different simulation rules depending on whether the particular concrete step being considered is an invocation step, a return step, or an internal step before where the first and second disjuncts in the consequent capture invocations that do and do not correspond to a linearization, respectively. where a symmetric predicate D ⊆ LS × LS is used to constrain the relationship between the local states of any two processes. This predicate must also be preserved by the steps of all processes.
4.
Showing the concrete initialisation satisfies the abstract initialisation.
where GSInit and LSInit are the initial states of the global and local state spaces, respectively. 2
The status function captures the status of a single pending operation. Under TSO, however, we may have several pending operations: that operation currently being executed by the process, if any, and those that have completed apart from the flushing of their writes to memory. In our extension to the above approach, we let status capture the status of the operation which the process is currently executing. If there is no such process, the status is IDLE. All other pending operations, i.e., those completed apart from flushes, will necessarily be before their linearization points.
The other role of the status function is to carry the inputs of the operation until they are needed at the linearization point where they, along with the operation's outputs, must match those of the associated abstract operation. To adapt the approach to TSO, we need to be able to keep track of the inputs of completed, but pending, operations. We also need to keep track of the abstract operation associated with each completed, but pending, operation and the completed operation's outputs (since the operation has already completed, its outputs will have already occurred).
To do this we add four auxiliary variables to the local concrete state space LS. The first of these lin : seq((I ∪ {null}) × In × Out) records, for each buffer entry, the abstract operation for which its flush is a linearization point (null indicates its flush is not a linearization point), the abstract operation's input, and the abstract operation's output. When a flush occurs the values corresponding to the flushed entry are read into the other three auxiliary variables op : I ∪ {null}, in : In and out : Out.
We also need to introduce two new simulation rules. The first corresponds to a process with status IN returning to IDLE without linearizing. This would be the case where the operation is to be linearized by a flush which is yet to occur.
Return without Lin. ∀ as : AState; gs, gs : GS; ls, ls : LS; in : In • R(as, gs, ls) ∧ status(gs, ls) = IN(in) ∧ RETOP j (gs, ls, gs , ls , out) ⇒ status(gs , ls ) = IDLE ∧ R(as, gs , ls )
The second corresponds to the occurrence of a flush. A flush acts as either an internal step or a linearizing step as shown in cases (a) and (b) of Fig. 4 , respectively. Case (a) can occur when the process has any status and its status is not changed. When the status is IN or OUT the internal step may be of the pending operations, and for any status it may be of an operation which has previously completed.
Case (b) also occurs from any status and when the status is IDLE or OUT it remains unchanged. Such a flush in these statuses corresponds to the linearization of an operation which has already completed. When the status is IN the flush may also be linearizing an operation which has already completed, in which case the status is unchanged, or it may be linearizing the pending operation, in which case the status becomes OUT.
The rule refers to the post-states of the auxiliary variables op, in and out via ls .op, ls .in and ls .out, respectively. 
Spinlock is linearizable on TSO
To show that spinlock is linearizable using the approach described in Section 4, we produce a concrete specification of the algorithm. Given P is the set of all process identifiers, the global state of the concrete specification includes the value of the shared variable x which is initially 1, and a variable lock denoting which process, if any, currently has the global memory locked.
The local state of a given process is specified in terms of its process identifier from P, a program counter indicating which operation (i.e., line of code) can next be performed, and the process's buffer.
where the value 1 denotes the process is idle when it has not acquired the spinlock, the value 2 denotes the process is idle when it has acquired the spinlock, the values ai, for i ∈ 1 . . 8, denote the process is ready to perform the ith line of code of acquire, the values tai, for i ∈ 1 . . 7, denote the process is ready to perform the ith line of code of tryacquire, and the value r1 denotes the process is ready to perform the first line of release.
LS 0 id : P pc : PC buffer : seq{0, 1}
As detailed in Section 4, we add auxiliary variables to our local state to keep track of information required at linearization points corresponding to a flush. Let I == {1, 2, 3} be the indices of the abstract operations such that 1 denotes Acquire, 2 denotes Release and 3 denotes TryAcquire. Let In == {⊥} be the set of input values of operations, and Out == {0, 1, ⊥} be the set of output values. Given this specification, the lines of code are formalised as Z operations 5 . For example, for the acquire operation we have an operation A0 corresponding to the invocation of the operation, an operation A1 corresponding to the line of code while (1), and an operation A2 corresponding to the line of code lock.
LS LS
To model the fact that A2 also results in all entries of the process's buffer being flushed, the operation A3 corresponding to the following line of code, x=1, is not enabled unless buffer = . It will become enabled after the required number of Flush operations have occurred. These remove an entry from the buffer and update the auxiliary variables op, in and out according to the information in lin. This information is added to lin when the buffer entries are added. For example, the operation A4, corresponding to the line x=0, updates lin to indicate that the flush of this value will not be a linearization point.
The other concrete operations are modelled similarly. The operations corresponding to the unlock statements are only enabled when buffer = modelling that the buffer must be completely flushed before the memory is unlocked.
Such a concrete specification is well-formed only if any sequence of operations corresponding to an abstract operation has exactly one linearization point. It is important, therefore, when modelling operations which change lin (i.e., those that write to the buffer) to ensure this. If a buffer entry is marked as a linearization point, the sequence of operations in which it occurs should not be linearized by a change in status from IN to OUT, nor by any other buffer entry. Checking this notion of well-formedness as part of our proof method is an area of future work.
Linearization can now be proved by defining the function status so that the linearization points of Acquire and TryAcquire are the operations which release the memory lock, and the linearization point of Release is the flush that commits the associated value of 1 to the global variable x.
The required relations ABS, INV and D are as follows. • (ls.pc = {1, a1, a2, a3, ta1, ta2} ∧ ls.buffer = ⇒ gs.x = 0) ∧ (ls.pc ∈ {a4, ta3} ⇒ gs.x = 1) ∧ (ls.pc ∈ {a5, ta4} ∧ ls.buffer = ⇒ gs.x = 0) ∧ (ls.pc ∈ {a5, ta4} ∧ ls.buffer = ⇒ gs.x = 1 ∧ ls.buffer = 0 ∧ (head ls.lin).1 = null) ∧ (ls.pc ∈ {2, a6, ta3, ta6, r1} ⇒ gs.x = 0) ∧ (ls.pc ∈ {a6, ta6} ∧ ls.buffer = ⇒ ls.buffer = 1 ∧ (head ls.lin).1 = 2) ∧ (ls.pc ∈ {2, a4, a6, a7, a8, ta3, ta5, ta6, ta7, r1} ⇒ ls.buffer = ) ∧ (ls.pc ∈ {a3, a4, ta2, ta3} ⇒ gs.lock = {ls.id}) ∧ (gs.x = 1 ∧ ls.buffer = ⇒ gs.lock = {ls.id}) D : LS ↔ LS ∀ ls, lsq : LS | D(ls, lsq) • ls.buffer = ⇒ lsq.buffer = ∧ lsq.buffer = ⇒ ls.buffer = ∧ ls.pc ∈ {2, a4, a6, ta3, ta5, r1} ⇒ ls.buffer = lsq.buffer = ∧ lsq.pc ∈ {2, a4, a6, ta3, ta5, r1} ⇒ ls.buffer = lsq.buffer = ∧ ls.pc ∈ {a5, ta4} ⇒ lsq.buffer = ∧ lsq.pc ∈ {a5, ta4} ⇒ ls.buffer =
Conclusions
This paper has presented a definition and simulation-based proof method for linearizability on the TSO memory model. The key to our definition is the treatment of flushes of local buffer entries as part of the operation which made the entries. This enables a proof method which, unlike existing methods, can be used to show implementations of algorithms are linearizable with respect to their natural abstract specifications. This work has applied state-based methods to program verification, and is part of a larger effort on the verification of linearizability on sequentially consistent architectures as well as weaker memory models. This larger effort mechanises the proofs of linearizability by integrating the state-based reasoning into the KIV theorem prover, see [6] [7] [8] , and in [14] we prove (and mechanise the proof) that our approach is complete, in that all linearizable algorithms can be verified by such simulation-based methods. Given we are using the same simulation-based approach, mechanisation and integration into KIV of the theory in this paper will be relatively straightforward. One interesting consequence of our approach is that operations on a single process may overlap and hence be reordered under linearizability. This is in contrast to earlier work on sequentially consistent architectures where linearizability implies the additional correctness criterion of sequential consistency [13] , i.e., that operations on a single process occur in the order that they are called. Sequential consistency will only hold on TSO when, in addition to linearizability, memory barriers are included in all operations which (a) do not write to memory, but (b) occur in a process with other operations which do write to memory. This can be checked by inspection of the code.
Investigating alternative definitions of linearizability on TSO which maintain sequential consistency is an area of future work.
