This paper considers the modelling and the analysis of the performance of lock-free concurrent search data structures. Our analysis considers such lock-free data structures that are utilized through a sequence of operations which are generated with a memoryless and stationary access pattern. Our main contribution is a new way of analysing lock-free search data structures: our execution model matches with the behavior that we observe in practice and achieves good throughput predictions. Search data structures are formed of linked basic blocks, usually referred as nodes, that can be accessed by two kinds of events, characterized by their latencies; (i) CAS events originated as a result of modifications of the search data structures (ii) Read events originated during traversals. This type of data structures are usually designed to accommodate a large number of data nodes, which makes the occurrence of an event on a given node rare at any given time. The throughput is defined by the number of events per operation in conjunction with the factors that impact the latencies of these events. We frame these impacting factors under capacity and coherence cache misses.
IX Conclusion 37
References 38
I. INTRODUCTION A search data structure is a collection of key, value pairs which are stored in an organized way to allow efficient search, delete and insert operations. Linked lists, hash tables, binary trees are some widely known examples. Lock-free implementations of such concurrent data structures are known to be strongly competitive at tackling scalability by allowing processors to operate asynchronously on the data structure.
Performance (here throughput, i.e. number of operations per unit of time) is ruled by the number of events in a search data structure operation (e.g. O(log N ) for the expected number of steps in a skip list or a binary tree). The practical performance estimation requires an additional layer as the cost (latency) of these events need to be mapped onto the hardware platform; typical values of latency varies from 4 cycles for an access to the first level of cache, to 350 cycles for the last level of remote cache. To estimate the latency of events, one needs to consider the misses, which are sensitive to the interleaving of these events on the time line. On the one hand, a capacity miss in data or TLB (Translation Lookaside Buffer) caches with LRU (Least Recently Used) policy arise when the interleaving of memory accesses evicted a cacheline. On the other hand, the coherence cache misses arise as a result of the modifications, that are often realized with Compare-and-Swap (CAS) instructions, in the lock-free search data structure. The interleaving of events that originate from different threads, determine the frequency and severity of these misses, hence the latencies of the events.
In the literature, there exist many asymptotic analyses on the time complexity of sequential search data structures and amortized analyses for the concurrent lock-free variants that involve the interaction between multiple threads. But they only consider the number of events, ignoring the latency. On the other side, there are performance analyses that aim to estimate the coherence and capacity misses for the programs on a given platform, with no view on data structures. We will mention them in the related work. However, there is a lack of results that merge these approaches in the context of lock-free data structures to analytically predict the practical performance.
An analytical performance prediction framework could be useful in many ways: (i) to facilitate design decisions by providing an extensive understanding; (ii) to rank different designs in various contexts; (iii) to help the tuning process. On this last point, lock-free data structures come with specific parameters, e.g. padding, back-off and memory management related parameters, and become competitive only after picking their hopefully optimal values.
In this paper, we aim to compute the average throughput of search data structures for a sequence of operations, generated by a memoryless and stationary access pattern. The threads execute the same piece of code on the same platform, throughput T can be estimated on the long-term as the expected latency of an operation (subjected to the distribution of the operations) divided by the number of threads P . As the traversal of a search data structure is light in computation, the latency of an operation is dominated by the memory access costs to the nodes that belong to the path from the entry of the data structure to the targeted node.
Therefore, part of this paper is dedicated to the discovery of the route(s) followed by a thread on its way to reach any node in the data structure. In other words, what is the sequence of nodes that are accessed when a given node is targeted by an operation.
As the latency of an operation is the sum of the latency of each memory access to the nodes that are on the path, we obviously need to estimate the individual latency of each traversed node. Even if, in the end, we are interested in the average throughput, this part of the analysis cannot be satisfied with a high-level approach, where we would ignore which thread accesses which node across time. For instance, the cache, whose misses are expected to greatly impact throughput, should be taken carefully into account. This can only be done in a framework from which the interleaving of memory accesses among threads can be extracted. That is why we model the distribution of the memory accesses for every thread.
More precisely, a memory access (traversal) can be either the read or the modification of a node, and two point distributions per node represent the triggering instant of either a Read or a CAS. These point distributions are modelled as Poisson point processes, since they can be approximated by Bernoulli processes, in the context of rare events. Knowing the probabilistic ordering of these events gives a decisive information that is used in the estimate of the traversal latency associated with the triggered event. Once this information is grabbed, we roll back to the expectation of the traversal of a node, then to the expectation of the latency of an operation.
We validate our approach through a large set of experiments on several lock-free search data structures based on various algorithmic designs, namely linked lists, hash tables, skip lists and binary trees. We feed our experiments with different key distributions, and show that our framework is able to predict and explain the observed phenomena.
The rest of the paper is organized as follows. We discuss related work in Section II, then the problem is formulated in Section III. We present the framework in Section IV and the computation of throughput in Section V. In Section VI, we show how to initiate our model by considering the particularity of different search data structures. Finally, we describe the experimental results in Sections VII and VIII.
II. RELATED WORK
The search path length of skiplists is analysed in [16] , [21] . In [16] , the search path length is split into vertical and horizontal components, where the horizontal cost is modelled with the number of right-to-left maximas (which corresponds to the traversed node) in a sequence of nodes with random heights. In [9] , [22] , [18] , various performance shapers for the randomized trees are studied, such as the time complexity of operations, the expectation and distribution of the depth of the nodes based on their keys.
Previously mentioned studies are not concerned with the interaction between the algorithms and the hardware. The following approaches rely on the independent reference model (IRM) for memory references and derive theoretical results or performance analysis. In [24] , data reuse distance patterns are modelled and then exploited to predict the cache miss ratio. In [11] , the exact cache miss ratio is derived analytically (computationally expensive) for LRU caches under IRM. As an outcome of this approach, the cache miss ratio of a static binary tree is estimated by assigning independent reference probababilities to the nodes in [10] .
For the time complexity of lock-free search data structures, asymptotic amortized analyses [12] , [5] are conducted since it is not possible to bound the execution time of a single operation, by definition. Apart from these theoretical studies, the performance of concurrent lock-free search data structures are studied and investigated through empirical studies in [14] , [8] . In [7] , it is shown experimentally that the conflicts between threads occur very rarely in the context of concurrent search data structures, which is confirmed by our analysis.
III. PROBLEM STATEMENT
We describe in this section the structure of the algorithm and the system that is covered by our model. We target a multicore platform where the communication between threads takes place through asynchronous shared memory accesses. The threads are pinned to separate cores and call AbstractAlgorithm (see Figure 1 ) when they are spawned. A concurrent search data structure is a shared collection of data elements, each associated with a key, that support three basic operations holding a key as a parameter. Search (resp. Insert, Delete) operation returns (resp. inserts, deletes) the element if the associated key is present (resp. absent, present) in the search data structure, otherwise returns null.
The applications that use a search data structure can be seen as a sequence of operations on the structure, interleaved by application-specific code containing at least the key and operation selection, as reflected in AbstractAlgorithm.
The access pattern (i.e. the output of the key and operation selections) should be considered with care since it plays a decisive role in the throughput value. An application that always looks for the first element of a linked list will obviously lead to very high throughput rates. In this study, we consider a memoryless and stationary key and operation selection process i.e. such that the probability of selecting a key (resp. an operation type) is a constant.
A search data structure is modelled as a set of basic blocks called nodes, which either contain a value (valued nodes) or routes towards nodes (router nodes). W.l.o.g. the key set can be reduced to [1. .R], where R is the number of possible keys. We denote by (N i ) i∈ [1. .N ] the set of N potential nodes, and by K i the key associated with N i . Until further notice (see Section VIII), we assume that we have exactly one node per cacheline.
An operation can trigger two types of events in a node. We distinguish these events as Read and CAS events. The latency of an event is based on the state of the hardware platform at the time that the event occurs, e.g. the level of the cache where a node belongs to for a Read request. We summarize the parameters of our model as follows:
• Algorithm parameters: Expected latency of the application call t app , expected computational cost to traverse a node t cmp , probability mass functions for the key and operation selection.
• Platform parameters: Cache hit latencies (resp. capacity) from level : t dat (resp. C dat ) for the data caches and t tlb (resp. C tlb ) for TLB caches; other memory instruction latencies (that depends on P ): t cas for a CAS execution and t rec to recover from an invalid state; number of threads P .
IV. FRAMEWORK A. Event Distributions
We consider first a single thread running AbstractAlgorithm on a data structure where only search operations happen, and we observe the distribution of the Read triggering events on a given node N i . The execution is composed of a sequence of search operations, where each operation is associated with a set of traversed nodes, which potentially includes N i . If we slice the time into consecutive intervals, where an interval begins with a call to an operation, we can model the Read events as a Bernoulli process (where a success means that a Read event on N i occurs), where the probability of having a Read event during an interval depends on the associated operation (recall that the operation generating process is stationary and memoryless).
Search data structures have been designed as a way to store large data sets while still being able to reach any node within a short time: the set of traversed nodes is then expected to be small in front of the set of all nodes. This implies that, given an operation, the probability that N i belongs to the set of traversed nodes is small. Therefore we can map the Bernoulli process on the timeline with constant-sized interval of length T −1 instead of mapping it with the actual operation intervals: as the probability of having a Read event within an operation is small, the duration between two events is big, and this duration is close to the number of initial intervals within this duration, multiplied by T −1 (with high probability, because of the Central Limit Theorem).
When we increase the scope of the operations to insertion and deletion, the structure is no longer static and the probability for a node to appear in an interval is no longer uniform, since it can move inside the data structure. There exists a long line of research in approximating Bernoulli processes by Poisson point processes [3] , [6] , [1] . In particular, [4] has dealt with non-uniform Bernoulli processes. Their error bounds, which are proportional to the success probability, strengthen the use of Poisson processes in our context: the events on N i are rare, thus the probabilities in Bernoulli processes are small and the approximation is well-conditioned.
Once the Read and CAS triggering events are modelled as Poisson processes for a single thread, the merge of several Poisson processes models the multi-thread execution.
Lastly, we specify a point on the dynamicity: since we have insertions and deletions, nodes can enter and leave the data structure. This is modelled by the masking random variable P i which expresses the presence of N i in the structure. At a random time, we denote by D the set of nodes that are inside the data structure, and P i is set to 1 iff N i ∈ D. We denote by p i its probability of success (p i = P [P i = 1]). Its evaluation will often rely on the probability that the last update operation on key k was an Insert; we denote it by q k , and
Note that the search data structures contain generally several sentinel nodes which define the boundaries of the structure and are never removed from the structure: their presence probability is 1. 
Recall for later that Poisson processes have useful properties, e.g. merging two Poisson processes produces another Poisson process whose rate is the sum of the two initial rates. This implies especially that the traversal triggering events follows a Poisson process with rate λ trav i = λ read i + λ cas i , and that the read triggering events that originates from P different threads and occurs at N i follow a Poisson process with rate P × λ read i .
B. Validity of Poisson Process Hypothesis

Range: 16384, threads=4, Ins−Del:0−0 (a) Read Events for Skiplist(b) Read Events for Hash Table(c) Read Events for Binary TreeTo illustrate the validity of modeling the events as Poisson processes, we experimentally extract the cumulative distribution function of the inter-arrival latency of Read events that occur on a given node in a skip list and we compare it against the corresponding exponential distribution (recall that the time between events in a Poisson process is exponentially distributed).
We consider a search only scenario and 50/50 search/update scenario. Each thread initially picks a random key and tracks the instants when a node associated with the chosen key is traversed during the execution. To facilitate the recording of the inter-arrival times, we disable the deletion of these particular keys (deletion is still enabled for any other key).
In Figure 2 and Figure 3 , we illustrate the results, where the dots represent the experimental measurements and the lines are generated by exponential distributions. The mean of each distribution is instantiated as the mean of the experimental measurements. One can observe the grounds a posteriori of our Poisson process(a) Read Events for Skiplist(b) Read Events for Hash Table(c) Read Events for Binary Treemodeling, and the variation of the event rates across keys, issuing from the differences between the node characteristics (key, height, location; see Section VI).
C. Impacting Factors
We have identified five factors that dominate the traversal latency of a node, distributed into two sets. On the one hand, the first set of factors only emerges in the parallel executions as a result of the coherence issues on the search data structures. Atomic primitives, such as a CAS, are used to modify the shared search data structures asynchronously. To execute a CAS in multi-core architectures, the cache coherency protocol enforces exclusive ownership of the target cacheline by a thread (pinned to a core) through the invalidation of all the other copies of the cacheline in the system, if needed. One can guess the performance implications of this process that triggers back and forth communication among the cores. As the first factor, CAS instruction has a significant latency. The thread that executes the CAS pays this latency cost. Secondly, any other thread has to stall until the end of the CAS execution if it attempts to access (read or modify) the node while the CAS is getting executed. Last and most importantly, any thread pays a cost to bring a cacheline to a valid state if it attempts to access a node that resides in this cacheline and that has been modified by another thread after its previous access to this node.
On the other hand, the capacity misses in the data and TLB caches are other performance impacting factors for the node traversals. Consider a cache of size C (fully associative), assume a node is traversed by a thread at time t and the next traversal (same thread and node) occurs at time t . The thread would experience a capacity miss for the traversal at time t if it has traversed at least C distinct nodes in the interval (t, t ). The same applies for TLB caches where the references to the distinct pages are counted instead of the nodes.
At a given instant, we denote by Traverse i the latency of traversing node N i , either due to a Read event or a CAS event, for a given thread. This latency is the sum of random variables that correspond to the previous respective five impacting factors:
where, at a random time, CAS 
D. Solving Process
The solving decomposes into three main steps. Firstly, we can notice that Equation 1 exposes 2R + 1 unknowns (the 2R access rates and throughput) against 2R equations. To end up with a unique solution, a last equation is necessary. The first two steps provide a last sufficient equation thanks to Little's law (see Section V-B), which links throughput with the expectation of the traversal latency of a node, computed from Sections V-A1 to V-A6. We show in these sections that they can be expressed according to the access rates λ read i and λ cas i . The last step focuses on the values of the probabilities in Equation 1, which are strongly related with the particular data structure under consideration; they are instantiated in Section VI-A (resp. VI-B, VI-C, VI-D) for linked lists (resp. hash tables, skip lists, binary trees).
V. THROUGHPUT ESTIMATION A. Traversal Latency

Applying expectation to Equation 2 leads to E [Traverse
. We express here each term according to the rates at every node λ cas and λ read . 1) CAS Execution: Naturally, among all traversal events, only the events originating from a CAS event contribute, with the latency t cas of a CAS: E [CAS
A thread experiences stall time while traversing N i when a thread, among the (P − 1) remaining threads, is currently executing a CAS on the same node. As a first approximation, supported by the rareness of the events, we assume that at most one thread will wait for the access to the node.
Firstly, we obtain the rate of CAS events generated by (P − 1) threads through the merge of their poisson processes. Consider a traversal of N i at a random time; (i) the probability of being stalled is the ratio of time when N i is occupied by a CAS of (P − 1) threads, given by: λ cas i (P − 1)t cas ; (ii) the stall time that the thread would experience is distributed uniformly in the interval [0, t cas ]. Then, we obtain: E CAS stall i = λ cas i (P − 1)t cas (t cas /2). 3) Invalidation Recovery: Given a thread, a coherence cache miss occurs if N i is modified by any other thread in between two consecutive traversals of N i . The events that are concerned are: (i) the CAS events from any thread; (ii) the Read events from the given thread. When N i is traversed, we look back at these events, and if among them, the last event was a CAS from another thread, a coherence miss occur:
. We derive the expected latency of this factor during a traversal at N k by multiplying this with the latency penalty of a coherence cache miss: E [CAS
Che's Approximation: Che's Approximation is a technique to estimate the hit ratio of a LRU cache, where the object (nodes for our case) accesses follow IRM (Independent Reference Model). Che's approximation is concerned with the capacity misses in a cache. We apply the approximation to the search data structures to estimate E Hit . In this part, we give a brief discussion on Che's Approximation and in the following sections (see V-A5, V-A6), we have shown how we adapt this scheme for our purposes.
IRM is based on the assumption that the object references occur in an infinite sequence from a fixed catalog of N objects. The probability of referencing object i at any point in the sequence (denoted by s i , where i ∈ [1..N ]) is a constant that does not depend on the reference history and does not vary over time. Under LRU policy with cache of size C dat and subject to IRM demand of N objects, an object reference would lead to a capacity miss if at least C dat unique object references take place after the previous reference to the same object. Let a reference to object i (O i ) occurs at time t 0 , the characteristic time for the object i is defined by the random variable:
Briefly, Che's approximation, first combines all T i , where i ∈ [1.
.N ] in a single variable by assuming s i is negligible compared to N j=1 s j and then approximates T i with a constant T dat over objects. Consider a sequence of references that follows an IRM demand for N objects, with reference probability s i , where i ∈ [1..N ]. The characteristic time T dat of a cache with size C dat is the unique solution of the following equation:
In [13] , they analyse and illustrate the reason behind the accuracy of the approximations for a quite large spectrum of object reference distributions. Their argument relies on the random variable X(t) = N j=1 1 t0<Oj ≤t , that provides the number of unique object references that have occured in the interval [0, t]. As the crucial property, X(t) is defined as the sum of independent random variables. Based on the central limit theorem, they show that a Gaussian approximation for this sum is quite reasonable, for all t.
Without loss of generality, let an object i is referenced consecutively at time 0 and t. We know that the second reference would be cache miss, in a cache of size C dat , if X(t) > C dat , where by assumption X(t) is a Gaussion random variable. The cache hit ratio of cacheline is given by:
Che's approximation, basically, approximates the cumulative distribution function of X(t) with a step function that cuts this S-shaped cumulative distribution function at the
, denoted by m(t). Thus, it approximates hit i in Equation 3 with:
In this study, we have exploited Che's approximation to estimate the data and TLB cache hit ratios with a slight modification by keeping our arguments along the same lines with the ones presented above.
5) Cache Misses:
We consider a data cache at level of size C dat and compute the hit latency due to Read events on this cache. We assume that N i is either present in the search data structure or not, during the characteristic time of the cache. Read events at N i are indeed much more frequent than the removal or insertion of N i . This implies that if the characteristic time is long enough to accommodate the intervals where N i ∈ D and N i ∈ D, then the cache miss ratio of N i should be quite low, which would be underestimated due to our assumption. We can employ the Read rates as popularities, i.e. s i = λ read i , and modify Che's approximation to discriminate whether, at a random time, N i is inside the data structure or not.
We integrate the masking variable P i into Che's approximation. We have:
, where O i denotes the reference time of N i . We can still assume X cache (t) is gaussian, as a sum of many independent random variables. We estimate the characteristic time as follows with the linearity of expectation and the independence of the random variables:
Lastly, we solve the equation for the characteristic time T dat of level cache:
thanks to a fixed-point approach. After computing T dat , we estimate the cache hit ratio (on level ) of N i :
6) Page Misses:
In this paragraph, we aim at computing the page hit ratio of N i for the TLB cache at level of size C tlb . The total number M of pages that are used by the search data structure can be regulated by a parameter of the memory managements scheme (frequency of recycling attempts for the deleted nodes), as the total number of nodes is a function of R. Different from the cachelines (corresponding to the nodes), we can safely assume that a page accommodates at least a single node that is present in the structure at any time.
We cannot apply straightforwardly Che's approximation since the page reference probabilities are unknown. However, we are given the cacheline reference probabilities
.N ] and we assume that N cachelines are mapped uniformly to M pages, [1. .
Under these assumptions, we know that the resulting page references would follow IRM because aggregated Poisson processes form again a poisson process.
We follow the same line of reasoning as in the cache miss estimation. First, we consider a set of Bernoulli random variables (Y j i ), leading to a success if N i is mapped into page j, with probability
does not depend on j). Under IRM, we can then express the page references as point processes with rate
Similar to the previous section, we denote the time of a reference to page j with O j and we define the random variable X page (t) = M j=1 1 0<Oj ≤t and compute its expectation:
Assuming X page (t) is Gaussian as it is sum of many independent random variables, we solve the following equation for the constant T tlb (characteristic time of a TLB cache of size C): E X page (T tlb ) = C tlb . Lastly, we obtain the TLB hit rate for N i by relying on the average Read rate of the page that N i belongs to; we should add to the contributions of N i , the references to of the nodes that belong to the same page as N i . Then follows the TLB hit ratio:
, where
7) Interactions:
To be complete, we mention the interaction between impacting factors and the possibility of latency overlaps in the pipeline. Firstly, the traversal latency of different nodes can not be overlapped due to the semantic dependency for the linked nodes. For a single node traversal, the latency for cas execution and stall time can not be overlapped with any other factor. We consider inclusive data and TLB caches. It is not possible to have a cache hit on level l, if the cache on level l − 1 is hit, and we do not consider any cost for the data cache hit if invalidation recovery (coherence) cost is induced (i.e. E Hit
B. Latency vs. Throughput
In the previous sections, we have shown how to compute the expected traversal latency for a given node. There remains to combine these traversal latencies in order to obtain the throughput of the search data structure. Given N i ∈ D, the average arrival rate of threads to N i is λ . It can then be passed to Little's Law [17] 
This second order equation has a unique positive solution that provides the expected throughput, T .
VI. INSTANTIATING THE THROUGHPUT MODEL
In this section, we show how to initialize our model with widely known lock-free search data structures, that have different operation time complexities. In order to obtain a throughput estimate for a structure, we need to compute the rates λ read and λ cas , and
, i.e. the probability that, at a random time, an operation of type o on key k leads to a memory instruction of type e on node N i , knowing that N i is in the data structure. For the ease of notation, nodes will sometimes be doubly or triply indexed, and when the context is clear, we will omit |N i ∈ D in the probabilities.
We first estimate the throughput of linked lists and hash tables, on which we can directly apply our method, then we move on more involved search data structure, namely skip lists and binary trees, that need a particular attention.
A. Linked List
We start with the lock-free linked list implementation of Harris [15] . All operations in the linked list start with the search phase in which the linked list is traversed until a key. At this point all operations terminate except the successful update operations that proceed by modifying a subset of nodes in the structure with CAS instructions. The structure contains only valued node and two sentinel nodes N 0 and N R+1 , so that N = R + 2 and for all i ∈ [1..R], N i holds key i, i.e. K i = i.
First, we need to compute the probabilities of triggering a Read event and CAS event on a node, given that the node is in the search data structure, for all operations of type t ∈ {Insert, Delete, Search} targeted to key k.
At a random time, N k , for k ∈ [1..R], is in the linked list iff the last update operation on key k is an insert: p k = q k , by definition of q k . Moreover, when N k is in the structure (condition that we omit in the notation), op
CAS events can only be triggered by successful Insert and Delete operations. A successful Insert operation, targeted to N k , is realized with a CAS that is executed on N k , where k = sup{ < k : N ∈ D}. The probability of success, which conditions the CAS's, follows from the presence probabilities:
B. Hash Table
We analyse here a chaining based hash table where elements are hashed to B buckets implemented with the lock-free linked list of Harris [15] . The structure is parametrized with a load factor lf which determines B through B = R/lf . The hash function h : k → k/lf maps the keys sequentially to the buckets, so that, after including the sentinel nodes (2 per bucket), we can doubly index the nodes: N b,k is the node in bucket b with key k, where b ∈ [1..B] and k ∈ [1..lf ] (the last bucket may contain less elements).
In the previous two data structures, we do observe differences in the traversal rate from node to node, but the node associated with a given key does not show significant variation in its traversal rate during the course of the execution: inside the structure, the number of nodes preceding (and following) this node is indeed rather stable. In the next two data structures, node traversal rates can change dramatically according to node characteristics, that may include its position in the structure. In a skip list, a node N i containing key K i with maximum height will be traversed by any operation targeting a node with a higher key. However, N i can later be deleted and inserted back with the minimum height; the operations that traverse it will then be extremely rare. The same reasoning holds when comparing an internal node with key K i of a binary tree located at the root or close to the leaves.
As explained before, an accurate cache miss analysis cannot be satisfied with average access rates. Therefore, the information on the possible significant variations of rates should not be diluted into a single access rate of the node. To avoid that, we pass the information through virtual nodes: a node of the structure is divided into a set of virtual nodes, each of them holding a different flavor of the initial node (height of the node in the skip list or subtree size in the binary tree). The virtual nodes go through the whole analysis instead of the initial nodes, before we extract the average behavior of the system hence throughput.
C. Skip List
There exist various lock-free skip list implementations and we study here the lock-free skip list [23] . Skip lists offer layers of linked lists. Each layer is a sparser version of the layer below where the bottom layer is a linked list that includes all the elements that are present in the search data structure. An element that is present in the layer at height h appears in layer at height h + 1 with a fixed appearance probability (1/2 for our case) up to some maximum layer h max that is a parameter of the skip list.
Skip list implementations are often realized by distinguishing two type of nodes: (i) valued nodes reside at the bottom layer and they hold the key-value pair in addition to the two pointers, one to the next node at the bottom layer and one to the corresponding routing node (could be null); (ii) routing nodes are used to route the threads towards the search key. Being coupled with a valued node, a routing node does not replicate the key-value pair. Instead, only a set of pointers, corresponding to the valued node containing the next key in different layers, are packed together in a single routing node (that fits in a cacheline with high probability). Every Read event in a routing node is preceded by a Read in the corresponding valued node. 
otherwise. By decomposing into three cases, we compute the probability that an operation op Figure 4 , no node in the skip list overlaps with the red frame). Let assume now k < k. The occurrence of a Read event requires that: for all (x, y) such that y ≥ h and k ≤ x < k, N z x,y , is not present in the structure. Lastly, a Read event is certainly triggered if k = k. The final formula is given by:
Next, we apply a similar approach for CAS events. In Figure 5 , we illustrate an example. A CAS event occurs at the green pointer, as a result of the removal (or insertion) of K k if there is no node in the red frame. For all node and operation couples, P op 2 −(h+1) when the maximum height. The data node is linked to the list at the bottom layer with a CAS that is executed on the previous data node. If a routing node is introduced, it is linked to lists at h different layers, thus leads to h CAS instructions that are applied on the other nodes.
The deletion of an element is composed of two phases. The first phase is to mark the data node, N dat k ,h and the pointers in the routing node with height k , if it exists. If the height of the routing node is more than one, it is possible that multiple CAS intructions are executed on the same routing node. But, we only consider the first one. The latency and also the effect of remaining ones would be negligible, as they are applied on the same cacheline one after each other. This repetitive behavior guarentees that the cacheline has already been exclusively owned before the next CAS instructions run. To recall, this is consistent with our assumption that an event can occur at most once per operation on a node. The second phase of deletion operation follows the same path with the insertion operation. Simply, a CAS, on the previous node, is executed for each layer that the data and routing nodes span.
We have denoted the success probability of an Insert operation with q k = P[op=op
. Also, the factor 2 −(h+1) provides the probability of the insertion of a routing node with height h, coupled with its data node. Based on the non-existence of any node that overlaps with the area that is enclosed with the red frame in Figure 5 , we obtain:
We show here how to estimate the throughput of external binary trees. They are composed of two types of nodes: internal nodes route the search towards the leaves (routing nodes) and store just a key, while leaves, referred as external nodes contain the key-value pair (valued node). We use the external binary tree of Natarajan [19] to initialize our model. The search traversal starts and continues with a set of internal nodes and ends with an external node. We denote by N int k (resp. N Our first aim is to find the paths followed by any operation through the binary tree, in order to obtain the access triggering rates, thanks to Equation 1. Binary trees are more complex than the previous structures since the order of the operations impact the positioning of the nodes. The random permutation model proposes a framework for randomized constructions in which we can develop our model. Each key is associated with a priority, which determines its insertion order: the key with the highest priority is inserted first. The performance characteristics of the randomized binary trees are studied in [22] . In the same vein, we compute the traversal probability of the internal node with key k in an operation that targets key k . Lemma 1. Given an external binary tree, the probability of traversing N int k in an operation that targets key K k is given by: Proof. N int k would be traversed if it is on the search path to the external node with key k . Given k ≥ k, this happens iff N int k has the highest priority among the internal nodes in the interval [k, k ]. This interval contains f (k, k ) internal nodes, thus, the probability of N int k to possess the highest priority is 1/f (k, k ).
is traversed iff it has the highest priority in the interval (k , k]. Hence, the lemma.
Even if in the binary tree, nodes are inserted and deleted an infinite number of times, Lemma 1 can still be of use. The number of internal nodes in the interval [k, k ] (or (k , k] if k < k) is indeed a random variable which is the sum of independent Bernoulli random variables that models the presence of the nodes. As a sum of many independent Bernoulli variables, the outcome is expected to have low variations because of its asymptotic normality. Therefore, we replace this random variable with its expected value and stick to this approximation in the rest of this section. The number of internal nodes in any interval come out from the presence probabilities: p z k = q k , where z ∈ {int, ext}.
In an operation is targeted to key k , a single external node is traversed (if any): N ext k , if present, else the external node with the biggest key smaller than k , if it exists, else the external node with the smallest key. Then, we have:
These probabilities finally lead to the computation of the Read (resp. CAS) rates λ read z,k (resp. λ cas z,k ) of N z k , where z ∈ {int, ext}, that will be used in the last following step.
We focus now on the Read rate of the internal nodes. We have found the average behavior of each node in the previous step; however, the node can follow different behaviors during the execution since the Read rate of N int k depends on the size of the subtree whose root is N int k , which is expected to vary with the update operations on the tree. We dig more into this and reflect these variations by decomposing
We define the Read rate λ read int,k,h of these virtual nodes as a weighted sum of the initial node rate thanks the two equations p
We connect the virtual nodes to the initial nodes in two ways. On the one hand, one can remark that the Read rate is proportional to the subtree size: λ read int,k,h ∝ hλ read int,k . On the other hand, based on the probability mass function of the random variable Sub k representing the size of the subtree rooted at N int k , we can evaluate the weight of the virtual nodes:
We have computed λ read int,k . These values reflect the average behaviour along the whole execution. However, the average behavior is not enough to computethe traversal latency accurately for the internal nodes. In the execution, there are different time intervals where λ read int,k show significant variation depending on the part of the tree that it is located. For instance, it is quite improbable to observe a cache miss at N int k when it is positioned at the root of the tree. One would observe a very high rate of traversals with low latency in this case, which decreases the expected traversal latency of N int k
significantly. An accurate estimation for the cache misses requires the consideration of this particularity of the binary tree. To approximate the impact of this variation, we split N int k into a number (let H k denotes this number for N int k ) of independent virtual nodes (in the lines of independent reference model), each representing the behavior of N int k with a different Read rate. The virtual node, with Read rate λ read int,k,h , is denoted by N k h,int . We will obtain the Read rates λ read int,k,h and presence probabilities p int k,h for these virtual nodes by requiring that the average behaviors are still valid:
For an external binary tree with N internal nodes, generated with the random permutation of insertions, the probability mass function of the size of the subtree (the random variable concerns only the number of the internal nodes and denoted by Sub k ) that is rooted at N is the root of subtree that includes all N int y , such that σ j < σ y < σ i ) can happen with probability, such that σ j = s + 1. In addition, there can be at least 0 and at most s − 2 distinct pairs of nodes (N int j , N int i ) such that σ i − σ j = s + 1 and σ j < σ k < σ i . Similar to (i) and (ii), we obtain and sum the probabilities lead to Sub k = s. We have:
We start with an observation. The Read rate of N int k is proportional to the size of the subtree that is rooted at N int k . Given a binary tree of N internal nodes, the size of the subtree can vary in the interval 2 for the majority of different values of h and k. Therefore, we approximate
, with a single constant c 1 for all k and h < H k . We know,
2 ) and p Now, we consider the CAS events. Delete and Insert operation start with the search phase. Insert operation finalize with a CAS executed at the grandparent internal node of the inserted external key. Delete operation contains three CAS; (i) one at the grandparent internal node of the deleted external key; (ii) two that are executed consecutively at the parent node of the external key. Thus, we consider them as a single CAS instruction, since the second of the consecutive ones has a negligible cost because the cacheline has already been exclusively owned by the thread.
Similar to Read events, we first find the rate of CAS events for N int k and split these events to virtual nodes by requiring the average behavior is still valid:
To determine the target of CAS event, we need to determine the probability of an internal node N int k to be the grandparent or parent of the targetted N ext k . We examine four different cases as illustrated in Figure 6 . Given that we are in the first case, we look for the probability that N int k , k < k, to possess the smallest or second smallest key, that is bigger than k , among the internal nodes that are present in the tree. Such internal nodes with the smallest key and the second smallest key corresponds to the parent and grandparent of N ext k , respectively. For case 1, it is possible that the grandparent node is the node which has the xth, x > 1, smallest key that is bigger than i, that is present in the tree. But this probability decreases exponentially as x increases. That is why, we have attributed the CAS events that takes place at the granparent node to the node with second smallest key that is bigger than k . For case 2, the parent corresponds to the smallest key that is bigger than k and the grandparent corresponds to the biggest key that is smaller than k , that are present in the tree.
Formally, let P For k ≥ k we have: (these probabilities are zero if k < k )
And for k < k : (these probabilities are zero if k ≥ k )
Based on Lemma 1 (assuming a constant tree size), we obtain the expected number of internal nodes that route the search to its left child (c k ,l ) and right child(c k ,r ) for an operation that is targetted to key = k . On this route, we compute the probability of a random node to be the left (right) child of its parent, with l k = c k ,l /(c k ,l + c k ,r ) (and similarly r = c k ,r /(c k ,l + c k ,r )). And, we estimate the probability of observing a case at a random time by using these values (i.e. l 2 k for Case 1, l k r k for Case 2). And finally, we obtain:
Lastly, we split the CAS events to the virtual nodes. CAS events can happen at the internal nodes only when they are in the last two levels of the tree (or similarly when the size of the subtree that is rooted at the concerned internal node is in the interval [1, 3] ). We required the average behaviour to be valid and set λ
, ∀x ∈ {1, 2, 3}. For the cases where the operation key selection follows a zipf distribution, there exist a small region of the tree that the most operations concentrate. The update operations concentrate to that region so that the nodes are expected to change levels frequently. This means that the impact of invalidation recovery factor can be seen while the node is at an level. For this impacting factor, for zipf distribution, we split the events to virtual nodes evenly, ∀h, λ 
VII. EXPERIMENTAL EVALUATION
We validate our model through a set of well-known lock-free search data structure designs, mentioned in the previous section. We stress the model with various access patterns and number of threads to cover a considerable amount of scenarios where the data structures could be exploited. For the key selection process, we vary the key ranges and the distribution: from uniform (i.e. the probability of targeting any key is constant for each operation) to zipf (with α = 1.1 and the probability to target a key decreases with the value of the key). Regarding the operation types, we start with various balanced update ratios, i.e. such that the ratio of Insert (among all operations) equals the ratio of Delete. Then, we also consider asymmetric cases where the ratio of Insert and Delete operations are not equal, which changes the expected size of the structure.
A. Setting
We have conducted experiments on an Intel ccNUMA workstation system. The system is composed of two sockets, each containing eight physical cores. The system is equipped with Intel Xeon E5-2687W v2 CPUs. Threads are pinned to separate cores. One can observe the performance change when number of threads exceeds 8, which activates the second socket.
In all the figures, y-axis provides the throughput, while the number of threads is represented on x-axis. The dots provide the results of the experiments and the lines provide the estimates of our framework. The key range of the data structure is given at the top of the figures and the percentage of update operations are color coded.
We instantiate all the algorithm and architecture related latencies, following the methodologies described in [20] , [2] . In line with these studies, we observed that the latencies of t cas and t rec are based on thread placement. We distinguish two different costs for t cas according to the number of active sockets. Similarly, given a thread accessing to a node N i , the recovery latency is low (resp. high), denoted by t rec low (resp. t rec high ), if the modification has been performed by a thread that is pinned to the same (resp. another) socket. Before the execution, we measure both t rec low and t rec high , and instantiate t rec with the average recovery latency, computed in the following way for a two-socket chip. For s ∈ {1, 2}, we denote by P s the number of threads that are pinned to socket numbered s. By taking into account all combinations, we have t rec = (P 1 (P 1 t rec low +P 2 t rec high )+ P 2 (P 2 t rec low + P 1 t rec high ))/P 2 . Since P = P 1 + P 2 , we obtain t rec = t rec low + 2(P 1 /P )(1 − P 1 /P )(t rec high − t rec low ). For the data structure implementation, we have used ASCYLIB library [8] that is coupled with an epoch based memory management mechanism which introduces negligible latency. Figures 7, 8 and 9 illustrates the results for the lock-free linked list, for various scenarios that are described before (see VII). For the majority of the cases, our estimates look reasonable except the cases where the cache miss ratios are underestimated due to the limitations of the independent reference assumption. The assumption in the Independent Reference Model is that the event at the different nodes are independent Poisson Processes. A linked list operation reveals a high degree of spatial locality, implying that the Poisson Processes for the different nodes are indeed dependent. This inaccuracy illustrates indeed the importance of the accurate estimations for the event latencies that are needed to capture the practical performance. Figure 13 shows the results for a case where the selection process follows zipf distribution. Lastly, Figure 14 reveals the results for asymmetric delete and insert operation ratios where the key selection is done with uniform distribution. For the hash table, our estimates are able to capture the real behavior almost for all cases with satisfactory precision. 3) Skip List: Figure 15 , 16 and 17 illustrates the results for the lock-free skip list, for various scenarios that are described before (see VII), where the estimations often closely follow the real behavior. In Figure 17 , we observe that our estimation show some deviation from the real behavior, for the cases where key range is small and Delete ratio is higher than Insert. For such cases, the expected size of search data structure tends to be very small which might lead to inaccuracies. In a non-padded (packed) configuration, multiple nodes are packed together into a single cacheline. This implies that a modification done at a node, could lead to a coherence cache miss in the traversal of the other nodes. It is often referred as false sharing. On the other hand, the packed configurations benefit from their compact representation by reducing the capacity misses.
B. Search Data Structures 1) Linked List:
Until now, we have assumed that the nodes are padded. Here, we extend the framework to estimate the performance of a packed configuration to facilitate the tuning process. In such a setting, where the nodes are inserted and deleted repeatedly, N i can be alone in its cacheline with the old versions of a set of nodes that are not present any more in the data structure. Alternatively, it might be mapped to the same cacheline with some number of active nodes that are present in the search data structure and they all together contribute to the event rates that are originating from the same cacheline.
Firstly, we assume that at most two nodes can be packed to a cacheline (which is the case for the data structures that we consider) and we denote the total number of slots for the node allocations with S = 2MpageSize/cacheLineSize (recall that M is the number of pages that are used by the structure). We assume that the nodes are assigned uniformly to the slots; given that N i and N j are present in the structure, N j is mapped to the same cacheline as N i with probability: 1/(S − 1). With the linearity of expectation, the expected additional event rate for the cacheline that N i is mapped to can be given by the sum of event rates originating from different nodes. λ With the node packing, we obtain additive components for CAS and Read events. Now, we show the integration of these additive components into the process.
1) Cache Misses: To begin with, packing would have a positive impact on the cache misses as it would increase the characteristic time (T ) of the cache, that is the duration for C unique cacheline references. To recall, N i could contribute to this C references only if N i ∈ D and we have embedded this effect into the process by introducing the random variable P i (see V-A5). With the packing, this contribution becomes less probable, as the contribution would occur only if the reference to N i occurs before the references to the other node that is mapped to the same cacheline with N i . Otherwise, the reference to N i would be ineffective for the characteristic time. To recall, the characteristic time is the solution of the following equation: Having obtained the characteristic time, we involve the additive factor to estimate the cache miss rate of N i . This is because a reference leads to a cache miss (in a cache of size C) only if the previous C cacheline references do not include the cacheline that N i is mapped to. 2) Page Misses: Secondly, packing can improve the TLB cache hit ratios. This simply happens because it reduces the total number of pages that the search data structure spans. To recall, the total number of pages is a parameter of the process that computes the expected latency for the impacting factor (Hit tlb i ). Packing do not influence the process, so we just need to update the value of the parameter.
3) CAS Execution: On the downside, packing is expected to reduce the performance through the CAS related impacting factors. To recall, CAS reco i represents the expected latency per traversal at N i for executing CAS instructions targeted to N i . This factor is proportional to the throughput, and packing do not change the probability of executing a CAS at N i while traversing it. So, packing does not have a direct impact on this component.
4) Invalidation Recovery:
The most important performance impacting CAS related factor is the invalidation recovery. For each traversal of N i , there exist a possibility to pay for a coherence cache miss due to the previous CAS executions at the cacheline, that N i is mapped to. To compute the probability of a coherence miss, one needs to consider the previous events on the cacheline. The traversal (by a thread at N i ) would not experience the coherence miss if the previous traversal (on the cacheline that N i is mapped to) of the same thread is not followed by CAS event of another thread. Thus, we consider the additive factor for both type of events and modify the process as follows: Figures 22 and 21 , the results are depicted for configurations with padding (dashed lines), packing(dots) and our packing based estimations(lines), for the linked list and hash table (nodes for tree and skiplist is too large to be packed in a single cacheline or already packed). The key selection is done with the uniform distribution. For almost every case, we observe that the packing increases the performance and the performance do not degrade due to the false sharing, even when the update rate is high. The stall time (E CAS stall i ) often is not significant and the invalidation recovery (E [CAS This might explain us the reason why the false sharing do not degrade the performance, as opposed to one might expect. However, the cache and page misses influence the performance positively, as expected.
Our estimations show that these effects are captured by our framework. We observe a slight increase in almost all the curves that is coupled with a slight increase in our estimations, due to the reduced capacity cache misses.In this paper, we have modelled and analysed the performance of search data structures under a stationary and memoryless access pattern. We have distinguished two types of events that occur in the search data structure nodes and have modelled the arrival of events with Poisson processes. The properties of the Poisson process allowed us to consider the thread-wise and system-wise interleaving of events which are crucial for the estimation of the throughput. For the validation, we have used several fundemental lock-free search data structures.
As a future work, it would be of interest to study to which extent the application workload can be distorted while giving satisfactory results. Putting aside the non-memoryless access patterns, the non-stationary workloads such as bursty access patterns, could be covered by splitting the time interval into alternating phases and assuming a stationary behaviour for each phase. Furthermore, we foresee that the framework can capture the performance of lock-based search data structures and also can be exploited to predict the energy efficiency of the concurrent search data structures.
