A standard design pa ern found in many concurrent data structures, such as hash tables or ordered containers, is an alternation of parallelizable sections that incur no data con icts and critical sections that must run sequentially and are protected with locks. A lock can be viewed as a queue that arbitrates the order in which the critical sections are executed, and a natural question is whether we can use stochastic analysis to predict the resulting throughput. As a preliminary evidence to the a rmative, we describe a simple model that can be used to predict the throughput of coarse-grained lock-based algorithms. We show that our model works well for CLH lock, and we expect it to work for other popular lock designs such as TTAS, MCS, etc.
MODEL ASSUMPTIONS
Below we list basic assumptions on the abstract machine used for our analytical throughput prediction.
First, we assume that coherence of caches is maintained by a variant of MESI protocol [7] . Each cache line can be in one of four states: Modified (M), Exclusive (E), Shared (S) and Invalid (I). MESI regulates transitions between states of a cache line and responses depending on the request (read or write) to the cache line by a process or on the request to the memory bus. e important transitions for us are: (1) upon reading, the state of the cache line changes from any state to S, and, if the state was I, then a read request is sent to the bus; (2) upon writing, the state of the cache line becomes M, and, if the state was S or I, an invalidation request is sent to the bus.
We assume that the caches are symmetric: for each MESI state st, there exist two constants R st and W st such that any read from any cache line with status st takes R st work and any write to a cache line with status st takes W st work. David et al. [4] showed that for an Intel Xeon machine (similar to the one we use in our experimental validation below), given the relative location of a cache line with respect to the process (whether they are located on the same socket or not), the following hypotheses hold: (1) writes induce the same work, regardless of the state of the cache line; (2) swaps, not concurrent with other swaps, induce the same work as writes. erefore, we assume that (1) W = W M = W E = W S = W I and (2) any contention-free swap induces a work of size W .
CLH LOCK
Multiple lock implementations have been previously proposed, from simple spinlocks and TTAS to more advanced MCS [6] and CLH [3] . For our analysis, we choose CLH, as the simplest lock among those considered to be e cient. In Figure 2 , we inline lock and unlock calls to CLH lock in our abstract coarse-grained operation.
Cost of an operation
Let us zoom into what happens during the execution of the operation.
Note that at the beginning of an operation (unless it is the very rst invocation), my node.locked is loaded into the cache and the corresponding cache line is in state M, because of the set in Line 15 during the previous operation by the same process.
(1) e operation starts with swap (Line 9) that induces a work of size W , if not concurrent with other swaps, and a work of size at most X , otherwise. (2) In Line 10, the algorithm loops on a eld next.locked. During this loop one or two cache misses happens.
One cache miss can happen at the rst iteration of the loop if the read of locked returns true. e last process that grabbed the lock already invalidated this cache line in Line 15 during its penultimate operation. MESI reloads the cache line and changes its state from I (or none if it was not loaded previously) to S. e other cache miss happens in every execution when the operation reads next.locked and gets false. In this case, the cache line was invalidated in Line 13 during the last operation of the last process that grabbed the lock. MESI reloads the cache line and changes its state from I (or none) to S.
Each of the described cache misses induces the work of size R I . us, the work induced in Line 10 is of size of R I (if only the second miss happens) or 2 · R I (if both misses happen). (7) In Lines 16-17, the parallel work of size P is performed.
Evaluating throughput
To evaluate the throughput of the resulting program under the uniform scheduler, take a closer look on how N processes continuously perform the operation from Figure 2 .
Process 1 executes: its rst swap (taking at most X units); the critical section (blue, Lines 10-13): acknowledges the ownership of the lock by reading false in Line 10 (takes R I units), performs the work of size C and releases the lock in Line 13 (takes W units); the parallel section (red, Lines 15-17 and 9): sets my node.locked to true (takes W ), performs the work of size P, performs a non-contended swap (takes W ) and, possibly, reads true in Line 10 (takes R I ). (Here, the swap operation performed a er the very rst completed critical section is counted in the parallel work, as it is executed in the absence of contention.) Every other process i operates in the same way: it swaps as early as possible (taking at most X ), waits until process i − 1 releases the lock, and then performs its critical (blue) and parallel (red) sections.
Depending on the parameters N , C, P, W , and R I , two types of executions are possible.
Each process enters the critical section without waiting in the queue.
Each process waits in the queue before entering the critical section. In case 1 (Figure 3a) , at the moment when process 1 nishes its parallel section, process N already nished its critical section, i.e., P + 2 · W > (N − 1) · (C + R I + W ).
erefore, in the steady case, at every moment of time, each process do not wait and execute either the parallel or critical section, and the read in Line 10 cannot return true because the lock is already released.
us, the throughput, measured as the number of operations completed in a unit of time, equals to N · α (P +2·W )+(C+R I +W ) . In case 2 (Figure 3b) , before proceeding to the next operation, process 1 has to wait until process N completes its critical section from the previous round of operations; process 2 waits for process 1, process 3 waits for process 2, etc. us, there is always some process in the critical section, giving the throughput of α C+R I +W . erefore, given the number of processes N , the sizes C and P of critical and parallel sections, the throughput can be calculated as follows:
EXPERIMENTS
For our measurements, we used a server with four 10-core Intel Xeon E7-4870 chips of 2.4 GHz (yielding 40 hardware processes in total), running Ubuntu Linux kernel v3.13.0-66-generic. We compiled the code with MinGW GCC 5.2.0 (with -O0 ag to avoid compiler optimizations, such as function inlining, that can screw up our benchmarking environment). e code is available at h ps://github.com/Aksenov239/complexity-lock-with-libslock.
We considered the following experimental se ings: the number of processes N ∈ {5, 10, 20, 30, 39}; the size of the critical section C ∈ {100, 500, 1000, 5000, 10000}; and the multiplier x ∈ [1, 150] (we choose all integer values) that determined the size of the parallel section P = x · C. For each se ing, we measured the throughput for 10 seconds. Our experimental evaluation gives α ≈ 3.5 · 10 5 , W ≈ 40, and R I ≈ 80. e ratio between W and R I correlates with the experimental results provided by David et al. [4] .
In Figure 4 we show our experimental results for three se ings with N = 39 and C ∈ {100, 500, 5000} (blue curves) compared with our theoretical prediction (red curves). e two curves match very closely, except for the case of small C and P where our predicted throughput underestimates the real one. We relate this to the fact that we oversimpli ed the abstract machine: any write induces the work of constant size W , regardless of the relative location of the cache line with respect to the process. For small C and P two processes from the same socket are more likely to take the lock one a er the other and, thus, on average, a write might induce less work than W , and, consequently, the throughput can be higher than predicted.
COMPARISON WITH PRIOR WORK
In this work, we proposed a very simple, not to say simplistic, analytical framework intended to predict the performance of a class of lock-based algorithms. A more involved analysis has been earlier proposed by Atalar et al. [1] for a similar class of lock-free concurrent data structures ( Figure 5 ).
ere the concurrent processes alternate the constant size parallel work with constant-size critical work and synchronize critical operations on the shared data using read and compare&set operations on a decdicated access point.
By adapting the code to our notations, we get: In order to have the critical work of size C we had to have the critical loop of size C − 1, because in each iteration of the loop we increment the variable i and a er the loop we increment the thread local variable it.
We argue that the two approaches, ours and by Atalar et al. [1] though seemingly quite similar, bear some important di erences. In particular, these di erences, do not allow us to treat our analytical framework as a special case of that in [1] .
Note that we do not consider here the more general analysis in a later paper by Atalar [2] in which the amounts of parallel and critical work are treated as random variables obeying speci c distributions. e analysis in [2] is a probabilistic generalization of that in [1] . erefore, it appears that, for the sake of comparison, we can focus on the deterministic framework of [1] .
Two types of con icts happen in the described lock-free algorithms:
• logical con icts -the unsuccessful retry Lines 4-6, i.e., a "fast" process succeeds in updating the access point variable AP, causing "slower" process to fail in their compare&set operations; • hardware con icts relates here to the serialization of concurrent reads at Line 4 and compare&set at Line 6 on AP.
At the same time, our lock-based algorithms are subject only to hardware con icts on head variable (Figure 2 Line 9).
At rst, let us look on the two types of algorithms from the high-level point of view. Lock-based algorithms are conservative in the sense that the critical section is performed only when the lock is taken and the actions of the critical section always "take place". In particular, this kind of algorithms is only subject to hardware con icts.
In contrast, lock-free algorithms are speculative: a critical section can be performed several times before it succeeds and only the actions of the successful instance are e ective. In analysing these algorithms, we should account for both logical and hardware con icts.
Under high contention, speculative data structures peform worse than conservative ones due to the orverwhelming number of retries critical sections. Intuitively, this suggests that we should use di erent analyses to reason about the throughput of these two classes of algorithms.
In what follows, we suppose that P and C exceed the cost of the swap operation. Such condition greatly simpli es the analysis for lock-based data structures since we do not have to deal with hardware con icts.
Under high contention, i.e., when P is comparatively small, P << (n − 1) · C, we use the special properties of the lock-based algorithms: with these parameters there is always some process in the critical section and, consequently, this allows us to easily evaluate the resulting throughput.
In contrast, the analysis of the performance of lock-free algorithms under high contention in [1] is considerably more involved, due to the intrinsic interleaving of hardware and logical con icts.
When contention is small and con icts are unlikely, the two analyses for xed P and C should coincide. Both of them provide us with the throughput approximately equal to α N P +C where α is some constant and N is the number of threads.
To summarize, in the case of CLH Lock the analysis for lock-based algorithms coincide with the analysis for lock-free programs for the se ings with small contention, while in other se ings our analysis is much simpler due to the special properties of the lock-based algorithms.
Furthermore, we consider the MESI cache-coherence protocol [4] . Our analysis is further simpli ed by assuming that writes take the same time no ma er in which state a cache line is: there are evidences that this is indeed the case for our machine. However, the situation might get more complicated for other machines in which, e.g., the write complexity depends on the cache state, which might result in a more complicated analysis.
If the CLH lock in conservative programs is replaced with a lock of another type, e.g., test&test&set, ticket, spin lock, MCS, etc., the analysis becomes somewhat more complicated but it still shares the part when P ≥ (n − 1) · C.
For example, suppose that we replace CLH lock with spin lock:
1 operation(): Here we have hardware con icts not only on compare&set, but also on the write locked = 0. ese con icts are not considered by the analysis of lock-free programs and, thus, there should be a di erent analysis for the coarse-grained programs with spin lock.
To summarize, the two analyses, though designed using similar arguments, are distinct.
CONCLUSION
In this short note, we showed that a simple theoretical analysis may quite accurately predict the throughput of data structures implemented using coarse-grained synchronization. For the moment, our analysis is restricted to algorithms using CLH-based locking in systems obeying the uniform scheduler. In upcoming work, we intend extend the analysis to more realistic algorithm designs, lock implementations and architectures.
