Classic test-and-test (TS) mutual exclusion locks are simple, and enjoy high performance and low latency of ownership transfer under light or no contention. However, they do not scale gracefully under high contention and do not provide any admission order guarantees. Such concerns led to the development of scalable queue-based locks, such as a recent Compact NUMA-aware (CNA) lock, a variant of another popular queue-based MCS lock. CNA scales well under load and provides certain admission guarantees, but has more complicated lock handover operations than TS and incurs higher latencies at low contention.
Introduction
TS: Test-and-test locks (TS) [3] are compact -consisting of a single lock word -and simple, and provide excellent latency under light or no contention. They fail to scale, however, as contention increases.
Acquiring threads simply busy-wait, or spin attempting to change the lock word state from unlocked to locked with an atomic read-modify-write instruction, such as compareand-swap (CAS) or exchange (SWAP). If the atomic operation was successful, then the thread has acquired the lock and may enter the critical section. Releasing the lock requires only a simple store to set the state to unlocked. So-called "polite" test-and-test-and-set locks (TTS), a variation on TS, first fetch the lock value and only attempt the atomic instruction if the lock was observed to be not held. That is, acquiring threads busy-wait until the lock is clear, at which point they execute an atomic instruction to try to gain ownership. TTS acts to avoid unnecessary write invalidation arising from failed atomic operations. Simple "impolite" TS locks do not bother to first load the value, so each probe of the lock causes writing via the atomic instruction. TS and TTS locks are usually augmented with back-off -delays between probesto moderate contention.
Probe the lock
In our descriptions we will assume a sequentially consistent memory model and not consider the need for memory fence or barrier instructions.
MCS:
The MCS lock [34] , is the usual alternative to simple test-and-set-based locks, performing better under high contention, but also having a more complex path and often lagging behind simple locks under no or light contention. In MCS, arriving threads use an atomic operation to append an element to the tail of a linked list of waiting threads, and then busy wait on a field within that element, avoiding global spinning as found in TS. The list forms a queue of waiting threads. The lock's tail variable is explicit and the head -the current owner -is implicit. When the owner releases the lock it reclaims the element it originally enqueued and sets the flag in the next element, passing ownership. To convey ownership, the MCS unlock operator must identify the successor, if any, and then store to the location where the successor busy waits. The list forms a multiple-producer-single-consumer (MPSC) queue where any thread can enqueue but only the current owner can dequeue itself and pass ownership. The handover path is longer than that of TS locks and accesses more distinct shared locations.
MCS uses so-called local waiting where at most one thread is waiting on a given location at any one time. As such, an unlock operation will normally need to invalidate just one cache line -the line underlying the flag where the successor busy waits -in one remote cache. Under contention, the unlock operator must fetch the address of the successor element from its own element, and then store into the flag in the successor's element, accessing two distinct cache lines, and incurring a dependent memory access to reach the successor. Absent contention, the unlock operator uses an atomic compare-and-swap (CAS) operator to try to detach the owner's element and set the tail variable to null.
MCS locks provide strict FIFO order. They are also compact, with the lock body requiring just a pointer to the tail of the chain of queue elements.
One MCS queue element instance is required for each lock a thread currently holds, and an additional queue element is required while a thread is waiting on a lock. Queue elements can not be shared concurrently and can appear on at most one queue -be associated with at most one lock -at a given time. The standard POSIX pthread _ mutex _ lock and pthread _ mutex _ unlock operators do not require scoped or lexically balanced locking. As such, queue element can not be allocated on stack. Instead, MCS implementations that expose a standard POSIX interface will typically allocate elements from thread-local free lists, populated on demand 1 .
MCS requires the address of queue element inserted by the owner to be passed to the corresponding unlock operator, where it will be used to identify a successor, if any.
The standard POSIX interface does not provide any means to pass information from a lock operation to the corresponding unlock operator. As such, the address of the MCS queue element inserted by the owner thread is usually recorded in the lock instance so it can be conveyed to the subsequent unlock operation to identify the successor, if any. That field is protected by the lock itself and accessed within the critical section. Accesses to the field that records the owner's queue element address may themselves generate additional coherence traffic, although some implementations may avoid such accesses to shared fields by storing the queue element address in a thread-local associative structure that maps lock addresses to the owner's queue element address.
The standard POSIX pthread mutex APIs have no provision to pass data from the acquire operation to the corresponding to unlock so typical MCS or CNA implementations adhering to that API are forced to either store the address of the owner's queue element in the lock structure itself, to convey that address to the unlock operator -generating additional coherence traffic -or use supplementary per-thread maps that associate held locks with the owning queue element.
CNA: Compact NUMA-Aware locks (CNA) [19] are based on MCS, but add NUMA-awareness. At arrival time, threads annotate their queue element with their NUMA node number. At unlock-time, the owner scans forward into the primary MCS chain and culls remote elements, transferring them to a secondary chain of remote threads. That secondary chain is propagated from the unlock operator to the successor via the queue elements, so the lock structure remains compact. Reducing the NUMA diversity of the primary chain acts to reduce lock migration [21] and improve performance. To avoid indefinite starvation of threads on the secondary check, the unlock operator periodically flushes the secondary chain back into the primary chain to shift the currently preferred NUMA node. At unlock-time, if the primary chain is found empty, the secondary is flushed back into the primary to reprovision the primary chain. CNA unlock prefers to dispatch to threads on the primary, but will revert to the secondary list if the primary is empty. The secondary chain is manipulated under the lock itself, in the unlock operation. While CNA is NUMA-aware, compared to MCS, a number of additional CNA-specific administrative steps -culling, reprovisioning, periodic flushing -execute while the lock is held and are subsumed into the critical section, potentially increasing the effective hold time of the lock. We observe that all NUMA-aware locks trade-off short-term fairness for improved overall throughput.
The Fissile Algorithm
Fissile augments CNA with a TS fast-path using the LOITER lock construction (Locking : Outer-Inner Tranformation) [14] where the outer lock is a TS lock and the inner lock is a CNA lock. Acquiring ownership of the outer TS lock confers ownership of the compound Fissile lock. Arriving threads first try the fast-path TS lock and, if successful, immediately enter the critical section. Otherwise control diverts into the slow path where the thread acquires the inner CNA lock. We refer to the owner of the inner CNA lock as the alpha thread. Once the CNA lock has been acquired, the alpha thread then busy-waits on the TS outer lock. At most one thread at any one time busy-waits on the outer TS lock, avoiding the scalability impact of global spinning, where multiple threads simultaneously busy-wait on a given location. As there is at most one thread busy-waiting on the outer lock, we use TS instead of TTS. Once the outer lock has been acquired, we release the inner lock and enter the critical section. To release a Fissile lock, we simply release the outer TS lock, regardless of whether the corresponding acquisition took the fast path or slow path. A thread holds the inner CNA lock only within the Fissile lock acquisition operator. Specifically, Fissile releases the inner CNA lock within the Fissile acquire operation, but while still holding the outer TS lock, potentially extending the hold-time of the outer lock. This choice, however, allows us to allocate the MCS queue element on-stack, which is a distinct advantage, avoiding MCS queue element allocation and deallocation. (Classic MCS requires one allocated queue element for each lock concurrently held by a thread whereas our approach avoids that expense). Furthermore the queue element of the alpha thread does not need to be communicated from the Fissile acquire operation to the unlock operation, as is the case for normal MCS and CNA. We employ a specialized CNA implementation, described below, which shifts much of the administrative overhead specific to CNA and normally found in the unlock operator to run before we acquire the outer TS lock, so the overhead of releasing the CNA inner lock while holding the outer TS lock is minimized.
In Listing-1 we provide a sketch of the Fissile algorithm. The Outer field is a TS lock word which can take on 3 values: 0 indicates unlocked; 1 indicates locked and 2 encodes a special locked state used when the alpha thread is impatient and the previous owner is transferring ownership of the outer TS lock directly to the alpha thread. Inner is the CNA inner lock, and Impatient reflects the state of the alpha thread. preferred embodiment tension and trade-off: @ balance quality of admission schedule (NUMA) vs reorganization latency @ fairness vs throughput : latitude and laxity; unfairness may be faster. Absent remediation, simple TS allows indefinite bypass and starvation of waiting threads. To avoid this issue, the alpha threads busy-waits on the TS lock for a short grace period but will then become "impatient" and cue direct handover of ownership the next time the TS lock is released, bounding bypass.
When the alpha thread becomes impatient, having failed to acquire the outer lock within the grace period, it sets the Impatient field from the normal state of 0 to 2. The unlock operator fetches from Impatient and stores that value into the TS lock word. In typical circumstances when unlock runs after the alpha has become impatient, it will observe and fetch 2 from Impatient and store that value into the TS lock word. The alpha will then notice that the value 2 has propagate from Impatient into the lock word, and takes direct handoff of ownership from that previous owner, restoring the lock word from 2 back to 1. If the unlock operation happens to run concurrently with the alpha thread becoming impatient, the unlock may race and fetch 0 from Impatient instead of 2. In this case either the alpha manages to seize the TS lock and acquire it when it becomes 0, or some other thread managed to pounce on the TS lock, in which case the alpha thread must wait one more lock cycle to take ownership. At worst, impatient handover is delayed by one acquire-release cycle. Once the value of 2 is visible to threads in unlock, immediate handover to the alpha is assured. Threads arriving in the fast-path that observe 2 will divert immediately into the CNA slow-path.
The grace period serves as tunable parameter reflecting the trade-off and tension between throughput and shortterm fairness. A shorter grace periods yields less bypass and fairer admission, while longer periods may allow better throughput but worse short-term fairness.
Fissile provides hybrid succession, employing competitive succession [14] when there is no contention, but switching to more conservative direct succession when the alpha thread becomes impatient. Under competitive succession, the owner releases the lock, allowing other waiting or arriving threads to acquire the lock. Unfettered competitive succession admits undesirable long-term unfairness and starvation but typically performs well under light load. In addition, competitive succession tends to provide more graceful throughput under preemption. In direct succession, as used by MCS, for instance, the lock holder directly transfers ownership to a waiting successor without any intermediate or intervening transition to an unlocked state. All strict FIFO locks employ direct succession. Direct succession suffers under preemption, however, as ownership may be conveyed to a preempted thread, and we have to wait for operating system time-slicing to dispatch the owner onto a processor. Barge; pounce; renouncement; Rapacious locks; Alpha reverts to impatience Hybrid; Composite; Compound Proactively Analogy : CSMA/CD -competitive succession ; Token Ring -direct succession By restricting the number of threads competing for the outer TS lock, we improve the odds that an arriving thread will find the lock clear and manage to acquire via the TS outer fast path. Under fixed load, system will reach a "balanced" steady state where many circulating threads tend to acquire the TS lock without waiting.
As shown in [17] , as more threads busy-wait on a given location, as is the case in TS, stores to that location take longer to propagate. (Concurrent reads to a given location scale, but concurrent writes or atomics do not [36] ). Fissile addresses that concern by ensuring that only the alpha thread busywaits on the outer TS lock at any given time, accelerating handover.
The TS fast path provides the following benefits. First, latency is reduced, relative to MCS and CNA, for the uncontended case. Acquisition requires an atomic instruction and just a simple store to release. Second, the slow-path CNA MCS nodes can be allocated on-stack, simplifying the CNA implementation and avoiding the need to communicate or convey the owner's MCS node from the lock operation to the corresponding unlock. Third, TS with bounded bypass performs well under preemption, relative to MCS. Finally, and less obviously, the TS fast path provides benefit in the contended case. Fissile provides significant improvement over CNA when the critical section is small, and CNA has a hard time "keeping up" with the flow of arriving threads. That is, for very short critical sections, CNA itself -CNA overheads -becomes the bottleneck for throughput [22] . Under intense contention the TS lock allows more throughput, serving as an alternative bypass channel, giving contention "pressure" a way to get around CNA when CNA becomes the bottleneck. When the critical sections are longer, fissile performs like CNA. Allowing some threads to pass through the CNA slow path and some fraction over the TS fast path would appear to dilute CNA's NUMA benefits, but in practice, we find that CNA still quickly acts to filter out remote threads from a set of threads circulating over a contended lock.
Dilute; attenuate;
Fissile provides significant improvement over CNA when the critical section is small, and CNA has a hard time "keeping up" with the flow of arriving threads. That is, for very short critical sections, CNA itself -CNA overheads -becomes the bottleneck for throughput. In this case the TS lock allows more throughput under intense contention, as well as providing a low-latency fast-path under no or light contention. Fissile TS serves bypass channel, giving the "pressure" a way to get around CNA when CNA becomes the bottleneck. When the critical sections are longer, fissile performs like CNA.
The result is a highly scalable NUMA-aware lock that performs like TS at low contention and as well or better than CNA at high contention. Fissile provides short-term concurrency restriction [14] which may improve overall throughput over a contended lock. Fissile locks are compact and also tolerate preemption, by virtue of the TS outer lock, more gracefully than does CNA or MCS.
In fact the performance of Fissile often exceeds that of either of its underlying lock types.
Synergy with max more than max of the parts An atomic is required to acquire the lock and simple store to release.
Off-load contention into CNA; Divert; Deflect; relegate;
Specialized CNA
Adapt CNA to fissile -specialization Classic CNA performs reorganization of the MCS chainto be more NUMA-friendly and reduce NUMA lock transitions -while holding the the CNA lock itself, extending the effective critical section length and delaying handover to a successor. Handover time impacts the scalability as the lock is held throughout handover, increasing the effective length of the critical section. At extreme contention, the critical section length determines throughput [2, 22] . Fissile uses a specialized variant of CNA which reorganizes the chain of waiting threads early, immediately after acquiring the CNA lock. As such, reorganization runs outside and before the TS critical section, off the critical path, and potentially allows pipelining and overlap with the critical section execution. (Arguably, earlier reorganization may suffer as there are fewer threads enqueued from which to schedule, but we have not observed any performance penalty related to this concern). Fissile uses a specialized variant of CNA that differs from the original [19] as follows. Classic CNA, at unlock-time, culls the entire remote suffix of the primary chain into the remote list. Our variant looks ahead only one thread into the primary MCS chain, and provides constant-time culling costs, yielding less potentially futile scanning of the chain, and more predictable overheads. In addition, our look-aheadone policy generates less coherence traffic accessing the MCS chain elements, as the element examine for potential culling would also be accessed in the near future when we subsequently release the CNA lock.
Finally, our version of CNA performs CNA administrative duties -flushing and culling -immediately after the owner acquires the CNA lock, whereas classic CNA defers those operations until unlock-time. Specifically, we reorganize outside and before the outer TS critical section, allowing more overlap between CNA administrative duties and the execution of the critical section, and accelerating CNA lock handover.
All the changes above are optional optimizations and are not required to use CNA within Fissile, but they serve to enhance performance.
Related Work
While mutual exclusion remains an active research topic [2, 4-6, 13, 18, 22-24, 26-28] we focus on locks closely related to our design.
NUMA-aware locks attempt to restrict ownership of a lock to threads on a given NUMA node over the short term, reducing so-called lock migration, which can result in expensive inter-node coherence traffic. The first NUMA-aware lock was HBO (Hierarchical Back Off) [35] , a test-and-set lock where busy-waiting threads running on the same NUMA node as the current owner would use shorter back-off durations, favoring the odds of handover to such proximal threads relative to most distant threads. While simple, HBO suffers from the same issues as other TS locks.
Luchangco et al. [31] introducing HCLH, a NUMA-aware hierarchical version of the CLH queue-lock [12, 33] . The HCLH algorithm collects requests on each node into a local CLH-style queue, and then has the thread at the head of the queue integrate each node's queue into a single global queue. This avoids the overhead of spinning on a shared location and eliminates fairness and starvation issues. HCLH intentionally inserts non work-conserving combining delays to increase the size of groups of threads to be merged into the global queue. It was subsequently discovered that HCLH required threads to be bound to one processor for the thread's lifetime. Failure to bind could result in exclusion and progress failures, and as such we will not consider HCLH further.
NUMA-aware Cohort locks [20, 21] spawned various derivatives [7, 8] . While cohort locks scale well, they have a large variable-sized footprint. The size of a cohort lock instance is a function of the number of NUMA nodes, and is thus not generally known until run-time, complicating static allocation of cohort locks. Being hierarchical in nature, they suffer increased latency under low or no contention as acquisition requires acquiring both node-level locks and top-level lock. CNA avoids all these concerns and is superior to cohort locks. A changeset to convert the Linux kernel qspinlock low-level spin lock [10, 30] implementation from an MCS-based design to CNA is under under submission at the time of writing 2 . Similarly, Fissile locks are readily portable into the kernel enviroment.
Kashyap's et al. [27] Shuffling Lock also performs NUMAaware reorganization of MCS chains of waiters off the critical path, by waiting threads. They also use a LOITER-based design, but do not allow bypass. In the evaluation section, below, we compare Fissile against their user-mode implementation.
LOITER-base designs [14] first appeared, to our knowledge, in the HotSpot Java Virtual Machine implementation 3 in 2007. The "Go" language runtime mutex [1] uses a LOITERbased scheme where the inner lock is implemented via a semaphore and time-bounded bypass is allowed. The linux kernel QSpinlock [30] construct also has a dual path TS and MCS lock, but does not allow bypass. The TS fast-path avoids MCS latency overheads in the uncontended case.
Various authors [4, 25] have suggested switching adaptively between MCS and lower latency locks depending on the contention level. While workable, this adds considerable algorithmic complexity, particularly for the changeover phase, and requires tuning. Lim and Agarwal [29] suggested a more general framework for switching locks at runtime. 
Empirical Evaluation
Unless otherwise noted, all data was collected on an Oracle X5-2 system. The system has 2 sockets, each populated with an Intel Xeon E5-2699 v3 CPU running at 2.30GHz. Each socket has 18 cores, and each core is 2-way hyperthreaded, yielding 72 logical CPUs in total. The system was running Ubuntu 18.04 with a stock Linux version 4.15 kernel, and all software was compiled using the provided GCC version 7.3 toolchain at optimization level "-O3". 64-bit C or C++ code was used for all experiments. Factory-provided system defaults were used in all cases, and Turbo mode [39] was left enabled. In all cases default free-range unbound threads were used.
We implemented all user-mode locks within LD_PRELOAD interposition libraries that expose the standard POSIX pthread _ mutex _ t programming interface using the framework from [21] . This allows us to change lock implementations by varying the LD_PRELOAD environment variable and without modifying the application code that uses locks. The C++ std::mutex construct maps directly to pthread _ mutex primitives, so interposition works for both C and C++ code. All busy-wait loops used the Intel PAUSE instruction. We note that user-mode locks are not typically implemented as pure spin locks, instead often using a spin-then-park waiting policy which voluntarily surrenders the CPUs of waiting threads after a brief optimistic spinning period designed to reduce the context switching rate. In our case, we find that user-mode is convenient venue for experiments, and note in passing that threads in the CNA slow-path are easily made to park.
PAUSE for polite waiting
We use a 128 byte sector size on Intel processors for alignment to avoid false sharing. The unit of coherence is 64 bytes throughout the cache hierarchy, but 128 bytes is required because of the adjacent cache line prefetch facility where pairs of lines are automatically fetched together.
MutexBench
The MutexBench benchmark spawns T concurrent threads. Each thread loops as follows: acquire a central lock L; execute a critical section; release L; execute a non-critical section. At the end of a 10 second measurement interval the benchmark reports the total number of aggregate iterations completed by all the threads. We report the median of 7 independent runs in Figure-1 . The critical section advances a C++ std::mt19937 pseudo-random generator (PRNG) 2 steps. The non-critical section is empty. For clarity and to convey the maximum amount of information to allow a comparision the algorithms, the X -axis is offset to the minimum score and the Y -axis is logarithmic.
To facilitate comparison; visual comparison;
Immediately before acquiring the lock, each thread fetches the value of a shared lock clock value. The critical section advances that value. Subtracting the clock value fetched in the critical section from the value fetched before acquiring the lock gives a useful approximation of the thread's waiting time, given in units of lock acqusitions. Within the critical section, we log that waiting time value into a global array. After the measurement interval the benchmark harness postprocesses the log to produce statistics describing the distribution of the waiting time values, which reflect short-term fairness of the lock algorithm. The critical section also tallies lock migrations. These activities increase the effective length of the critical section.
We ran the benchmark under the following lock algorithms: TTS is a simple test-and-test-and-set lock using classic truncated randomized binary exponential back-off [3, 34] with the back-off duration capped to 100000 iterations of a PAUSE loop; MCS is classic MCS; CNA is described in [19] with the probability of flushing the secondary chain into the primary configured as P = 1/256 4 ; Shuffle is Kashyap's Shuffle Lock [27] aqswonode variant 5 ; Fissile is the Fissile algorithm described above with the grace period configured as 50 steps of the TS loop executed by the alpha thread and the CNA flush probability configured for P = 1/256. CNA flush interval set to 50000 iterations of the waiting loop, with PAUSE instruction, executed by the head of the CNA secondary chain. 4 We picked P = 1/256 to match the default value used by the Shuffle Lock, allowing a fair comparison between that lock and CNA 5 Taken verbatim from https://github.com/sslab-gatech/shfllock/ blob/master/ulocks/src/litl/src/aqswonode.c and integrated into our LD_PRELOAD framework
In Figure-1 we make the following observations regarding operation at maximal contention with an empty critical section 6 .
• At 1 thread the benchmark measures the latency of uncontended acquire and release operations. MCS and CNA lag behind TTS, Shuffle and Fissile as they lack a fast-path. • At or above 2 threads, most algorithms fall behind TTS as TTS starves all but one thread for long periods, effectively yielding performance near that found at just one thread. In Table-1 we provide additional details for execution at 10 threads. Throughput is given in units of millions of acquires per second aggregate throughput for all threads; Spread reflects long-term fairness between threads, computed as the maximum number of iterations completed by any thread within the measurement interval divided by the minimum; Migration is the reciprocal of the NUMA lock migration rate. (A Migration value of N indicates that the lock migrated between NUMA nodes 1 out of every N lock acquisitions, on average). The remaining columns describe the distribution of the observed waiting times, which we use to measure shortterm fairness. RSTDDEV is the relative standard deviation [40] ; Theil-T is the normalized Theil-T index [38, 42] -used in the field of econometrics as a metric of income disparity and unfairness -where a value of 0 is ideally fair and 1 is maximally unfair.
Disperal; Disparity; Diverity; inequality; uniformity; unfairness;
We observe that TTS is deeply unfair over the long term and short term. TTS also exhibits a surprisingly low lock migration rate -on average 1 migration per 323 acquisitions -presumably arising from platform-specific cache line arbitration phenomena. Somewhat perversely, this makes TTS implicitly NUMA-friendly, reducing migration rates. TTS is vulnerable to the Matthew Effect[41] -once a thread has entered deeper back-off, it is less likely to acquire the lock in unit time, amplifying subsequent unfairness. The remaining locks show reasonable long-term and short-term fairness.
In Figure-2 we configure the benchmark so the non-critical section generates a uniformly distributed random value in [0−200) and steps the thread-local random number generator that many steps, admitting potential positive scalability. In this moderate contention case we can see that Fissile and TTS locks tend to provide the best performance, although the TTS lock is again unfair.
Methodological trap and flaw when negative scaling occurs; Tuning the lock in isolution in this mode -at maximal contention with an empty non-critical section -can lead us to incorrectly tune toward making the lock "worse" as ensuing inefficient delays may reduce the arrival rate and throttle concurrency, thus making aggregate performance better in the maximal contention case as we worsen the lock for the general case. Deceptive and confounding; peril; misleading; https://blogs.oracle.com/dave/the-perils-of-negative-scalability https://en.wikipedia.org/wiki/Brooks' s_law https://arxiv.org/abs/1506.06796 SIF = Slower-is-faster 
std::atomic
In Figure-3 we use a benchmark harness similar to that of MutexBench but with the following differences. The noncritical section uses a thread-local std::mt19937 pseudorandom number generator (PRNG) to compute a value distributed uniformly in [0, 200) and then advances the PRNG that many steps. Instead of an explicit critical section, each iteration executes A.load() where A is shared an instance of Figure 3 . C++ std::atomic std::atomic<T> and T is a simple struct containing 5 32bit integer fields. The C++ compiler and runtime implement std::atomic for such objects by hashing the address of the instance into an array of mutexes, and acquiring those as needed to implement the desired atomic action. Interestingly, the NUMA-aware locks, CNA, Shuffle and Fissile, exhibit fading performance between 5 and 10 threads, but performance recovers at higher thread counts when there are sufficient waiting threads to profitably reorder for a NUMA-friendly admission schedule. In Figure-4 we repeat the experiment in Figure-3 on an Oracle X5-4, which has 4 NUMA nodes, 18 cores per socket and 2 hyperthreads per core, for 144 logical CPUs, demonstrating that our approach generalizes to larger NUMA systems. The onset of benefit provided by NUMA-aware locks is somewhat delayed as we have 4 nodes instead of 2 and, at a given thread count, threads are more dispersed and the socket is less populated.
FIFO Support
Fissile allows bypass both over the outer lock and within the CNA inner lock. We can, however, easily modify Fissile to provide expedited FIFO-like admission service as follows. First, FIFO locking requests that pass into the slow path mark their CNA MCS queue element with a "FIFO" flag. CNA culling refrains from shifting such elements into the CNA secondary list. Critically, if element S is marked as FIFO, then no requests that arrive after S on the inner CNA lock will acquire that lock before S. We also suppress bypass over the outer lock while FIFO requests are waiting. To that end, instead of setting and clearing the Impatient field we modify Fissile slightly to atomically fetch-and-add Impatient by 2 or -2, respectively. (We also make a corresponding change to the comparison in the grace period loop from == 0 to != 1). When a FIFO request diverts into the slow path, it increments Impatient by 2 before acquiring the CNA inner lock, and decrements by 2 after acquiring the outer lock. The request will be serviced in FIFO order, without being bypassed by more recently arrived threads, once it increments Impatient -and that value has become visible to threads in the unlock path -and has executed the SWAP instruction that appends the request to the CNA MCS chain.
To demonstrate the efficacy of FIFO-enabled Fissile, we extended the MutexBench benchmark harness to allow a mixture of normal and FIFO-designated threads, both competing for a common lock. We used 25 normal threads, and 2 FIFO threads. Normal threads advance the global PRNG 2 times in the critical section, as described above, and in the non-critical section compute a uniformly distributed random number in [0 − 100) and advance a thread-local PRNG instance that many steps. FIFO threads execute the same critical section, but use a non-critical section duration randomly selected from the range [0 − 2000), reflecting intermittent low duty-cycle FIFO operations. The FIFO attribute is perthread (but could also be specified for individual locking operations) and is ignored by all lock implementations except FIFO-enabled Fissile. All FIFO data was taken on the X5-2.
Table-2 shows the results, comparing Fissile, FIFO-enabled Fissile, and MCS. We report throughput over a 10 second measurement interval broken out for the normal threads and the FIFO threads. We also report statistics describing the observed wait times, computed in logical lock clock units, for the FIFO threads in isolation. As we can see Fissile+FIFO provides wait times very close to that afforded by MCS, and with greater throughput for both normal and FIFO threads.
Conclusion
Fissile locks are compact, NUMA-aware, preemption tolerant, and scalable, but also provide excellent latency at low or no contention. The algorithm is straightforward and easily Bypass over the outer lock via the fast path is the key to Fissile. While the slow path provides a higher quality NUMAfriendly admission schedule, it also suffers higher latency arising from the more complex lock mechanism. The fast path allows for low latency in the uncontended case, but also improves scalability under contention by augmenting the slow path with an alternative if the the slow path lock overheads prove a bottleneck.
Imposing no surprising or onerous requirements
In the Appendix we identify a number of variations on the basic Fissile algorithm that we plan to explore in the future.
Appendix : Algorithmic and Implementation Variations
▶ CNA with support for expedited admission : CNA can trivially be extended to better support real-time or FIFO acquisition requests by marking the MCS queue element as expedited. Such expedited elements are will not be culled into the secondary list of remote threads. Thus, if S is the MCS queue element associated with an expedited lock acquisition operation, then no threads that arrived after S will be admitted before S.
In the context of Fissile locks, once an expedited thread acquires the CNA inner lock, it can then immediately assert impatience, inhibiting bypass over the outer TS lock, and providing the desired admission service.
▶ CNA -triggering flushes: Classic CNA runs Bernoulli trials in the unlock path to decide whether to flush the remote chain into primary and change the preferred NUMA node, in order to provide long-term fairness. We have experimented with variants where the head of the remote chain monitors how long it has waited and, if necessary, sets a flag in its MCS queue element to cue flushing the remote list into the primary MCS chain, in order to avoid starvation. The CNA unlock operator checks that flag, and if set, flushes the remote queue and changes the preferred NUMA node. This approach yields a time-based anti-starvation policy instead of the count-based Bernoulli trials as found in the original CNA and shifts the Bernoulli trial out of the unlock path, replacing it with a fetch of a location that is usually in cache. In addition, we can use polite constructs such as MONITOR-MWAIT for timekeeping.
▶ Probabilistic bounded bypass : We can provided bounded bypass over the outer TS lock as follows, without requiring an explicit "Impatient" state to be encoded or stored in the lock structure. Briefly, arriving threads run a biased Bernoulli trial with probability of success P = 1/256. On success, threads immediately divert into the CNA slow-path and skip trying the outer TS fast-path. If we have a set of threads dominating the outer TS lock and starving the CNA inner lock owner, they will eventually self-decimate and pass through the CNA inner lock, providing the desired antistarvation. This approach is simple, effective, and avoids state and coherence updates associated with state. Bypass also ensures a "trickle" of threads will pass over the CNA inner lock, acting to homongenize the set of active circulating threads and reduce lock migration.
PrbBB; Unlock() is simple ST 0; count-based instead of time-based impatience; Allows CNA inner lock to act as a sieve/strainer to filter out lock circulating in ACS over outer TS lock. Claim and conjecture: lazy PrbBB filter suffices to eventually homogenize the ACS;
▶ Compact single-word form : We can construct a singleworld compact form of Fissile by collapsing the Outer, Inner, and Impatient fields into a single word. This condensed version is appropriate to replace the linux kernel's qspinlock construct [10, 17, 30] . Briefly, the least significant byte serves as the outer lock, the next most significant bit encodes the impatient state, and the remaining higher order bits encode the tail of the CNA MCS queue. The Fissile unlock operator stores 0 into the separately addressable low-order byte to release the lock. We note that this encoding requires mixed sized atomic accesses to the same location, the safety of which are platform-dependent. The current qspinlock implementation also depends on mixed size accesses.
claim Compact : collapse inner CNA and TS outer into a single word for use in the linux kernel ▶ Deferred release of the CNA inner Lock : We have investigated deferring the release of the CNA inner lock until we unlock the Fissile lock proper and specifically after dropping the outer TS lock. This may improve scalability by shifting CNA administrative work (culling and flushing activities) outside and after the TS critical section. While appealing, this change means that MCS queue elements can not be allocated on-stack, necessitating more complex queue element lifecycle memory management. Each thread must have one allocated queue element for each lock that it holds whereas with on-stack allocation, each thread has at most one active queue element. Furthermore, we must convey the address of the MCS queue element -the CNA owner's element -from the Fissile acquire operation to the corresponding unlock operation. As noted above, typical locking APIs do not have provisions to pass such information from acquire to unlock. A viable solution is to implement a threadlocal cache that contains at most one element, a reference to MCS queue element for the most recently acquired inner lock held that thread.
In the Fissile slow path, either before or after acquiring the CNA inner lock, a thread checks its cache. If a pending deferred queue element is present, it releases the associated CNA lock and clears the cache. It can then install the current queue element into its cache. In the Fissile unlock operation, after releasing the TS outer lock, the thread again checks and clears its cache, releasing any locks associated with a pending queue element found in the cache.
The following benign scenario can arise. Thread T 1 acquires contended lock L1 via the slow path using MCS queue element E1. Our approach defers the release of the CNA inner lock of L1 and stores E1 into T 1's cache. T 1 then goes on to acquire another contended lock L2 using MCS element E2, displacing E1 from T 1 ′ s cache and forcing the early release of L1's inner CNA lock via E1. E1 can then be reclaimed and E2 is installed in the cache. As noted, the premature release of L1's inner CNA lock is harmless.
▶ Deferred wakeup of the CNA successor : Instead of deferring the release of the CNA inner lock until the Fissile unlock, and passing the queue element reference to from the Fissile acquire operation to the corresponding unlock operation, we can perform a partial MCS release in the Fissile lock operation. The partial release operator clears the MCS Tail variable if there are no other elements on the chain, and otherwise identifies the successor -the next element in the MCS chain. If there is a successor in the chain, the partial release does not notify that thread. Instead, we pass a reference to the successor queue element to the corresponding Fissile unlock operator, which in turn passes ownership to the successor via setting the usual MCS flag in the successor's queue element. This again allows the queue elements to be allocated on-stack and simplifies memory management.
As above, we can employ a thread-local cache of 1 element to help convey the successor reference to the unlock operator. The deferred notification can occur at any point between acquiring the TS outer lock to immediately after releasing the TS outer lock. Thread T 1 acquires contended lock L1 via the slow path, having performed a partial release on the inner lock of L1, identifying thread T 2 (or more precisely the MCS node associated with T 2) as the successor. T 1 then sets its thread-local cache to refer to the successor on L1, thread T 2. When T 1 releases T 2, it notifies T 2 -via T 2's queue element -so T 2 starts running as the owner of L1's inner lock. However T 1, while holding L1, might acquire contend lock L2 via the slow path. During acqusition of L2, T 1 will need to displace T 2 from its cache, waking T 2 early, and installing the successor for L2, T 3, in its cache. Such early wakeup is benign.
▶ Simplified encoding of the "Impatient" state : We remove the explicit Impatient field from the Fissile lock structure. The outer lock word encoding changes as follows : 0 indicates unlocked; 1 indicates locked; 2 indicates locked but with an impatient alpha thread. When the alpha thread becomes impatient, it executes an atomic fetch-and-increment on the outer TS lock word. If value advanced from 0 to 1, then the alpha thread acquired the outer TS lock. Otherwise the value advanced from 1 to 2 indicating TS lock was held by some other thread. The alpha thread waits impatiently for the word to change back to 1, at which point it has gained ownership of the outer TS lock. The Fissile unlock operator simply executes an atomic decrement on the lock word shifting the value from either 2 to 1 or 1 to 0. While more compact, this variant requires an atomic decrement in the unlock path, instead of a simple store. We note that the form that uses an explicit store depends on eventual consistency, where threads in unlock will eventually observe the Impatient = 2 value set by an impatiently waiting alpha thread. ▶ 3-Stage Gated We note that we can use an 3-Stage construction [17] . The embodiment described above has 2 stages of waiting for threads that encounter contention : the inner lock and the outer lock. Adding an extra Gate stage can, paradoxically, act to reduce handover latency. Contended threads in the slow path proceed as follows : acquire the CNA inner lock; wait until the Gate becomes 0; set the Gate to 1; release the inner CNA lock; wait for acquire the outer TS lock; clear the Gate; execute the critical section; and finally release the outer TS lock. At any given time there can be at most one thread waiting on the gate and at most one thread busy-waiting on the outer lock. Clearing the Gate entails low latency as there is at most one thread waiting on the gate, allowing faster handover and improved pipelining of lock acquisition operations. Note that acquiring the Gate occurs only while holding the inner lock, so atomics are not required.
Instead of a simple gate, which allows at most one thread, we could also employ a semaphore to allow a very small number of threads to busy-wait on the outer TS lock. ▶ 3-Stage with outer ticket lock Borrowing from TWA [17] (TWA-Staged and 3-Stage variations) we can easily construct a 3-Stage Fissile lock where the outer lock is a ticket lock, with differentiated near and far waiting on the ticket lock. Threads in the slow path proceed as follows: acquire the CNA inner lock; Fetch and increment the ticket variable to assign a unique ticket value to the locking request; busy-wait while the assigned ticket value differs from the ticket lock's grant field by 2 or more (far waiting); release the CNA inner lock; busy-wait until the assigned ticket equals the grant variable (near waiting); enter and execute the critical section; release the outer ticket lock by incrementing the grant variable. A non-atomic increment suffices in the unlock path. In this formulation no bypass is allowed, although a fast path is feasible. The CNA inner lock completely dictates the order of admission, and we use the ticket locks, which have efficient handover under light contention, when a thread nears the front of the conceptual queue of waiting threads. A very small number of threads wait on the ticket lock at any given time, leveraging its excellent behavior in that mode.
▶ Impatience policies: Myriad policies are possible for setting and clearing the "impatient" state. Our reference implementation uses a simple form where the alpha thread waits, if necessary becomes impatient, and then cancels the impatient state once it obtains the outer TS lock. Other possibilities include leaving the impatient state set for all threads currently waiting on the inner lock, or until the inner lock drains to empty state [1] .
The "Go" Mutex uses a mostly LIFO semaphore in a manner similar to that of the Fissile inner lock, and similar to the LIFO-CR construct described in [14] .
▶ TS Tunables: Our current implement allows arriving thread to try the TS just once. Experiments suggest, however, some benefit to allowing a brief bounded polite spinning phase, potentially with back-off, before diverting into the slow-path, as long as the lock remains impatient. In addition, in our current implementation, the alpha thread uses a simple busy-wait loop with no back-off. Employing some moderate back-off policy may be useful. While these avenues show promise, our current implementation has no TTS tunable values and exhibits desirable parameter parsimony.
▶ Spin-then-Park waiting: As described, stalled threads use simple busy-waiting. We note, however, that it relatively simple to modify Fissile locks so that threads waiting on the CNA inner lock will park -descheduling themselves via the operating system so the CPU where they were running can dispatch other ready threads or become idle. Shuffle, for instance, delegates NUMA reorganization to other waiting threads, allow parallelism between reorganization and execution of the critical section. While elegant, for short critical sections with intense contention the coherent communications cost can come to dominate performance, and make this approach unprofitable. That is, the granularity of work being delegated does not overcome the communication costs to delegate and coordinate. And when the critical sections are longer then any additional CNA overheads in the critical path are less important, so delegated execution often has no appreciable benefit relative to CNA.
▶ TS : polite vs impolite Anderson [3] observed that "polite" test-and-test-and-set locks may be a better choice for contended locks than simple "impolite" test-and-set locks. Test-and-test-and-set locks are polite in the sense that they first load and check the lock word before conditionally attempting the atomic operation to acquire the lock. Simple test-and-set locks are optimistic and forego that load and check and simply try the atomic operation. But if the lock is already held, such futile atomic operators may generate unnecessary coherence and write invalidation. A test-andtest-and-set strategy acts to reduce those overheads and the the rate of failed atomic operations.
We have found, however, that a simple impolite impolite test-and-set policy is appropriate for the outer Fissile TS lock, ostensibly as there is at most one thread busying waiting on the outer lock at any given time. Relatedly, polite test-andtest-and-set TTS acquisition may incur more coherence bus transactions in the case where the lock in not held but the lock word is not in cache or is in dirty state in a remote cache, as would be the case if the previous owner ran on a CPU with a different cache. Absent coherence probe speculation, on bus operation is needed to load the lock and a second to upgrade the cache line to written state. This scenario is common if the lock instance is promiscuous -being lock by multiple different threads, but with little or no contention.
Performant test-and-set locks will typically probe the lock directly with an atomic on arrival, optimistically assuming they can acquire the lock, but then shift to a polite TTS mode as they busy wait. This strategy reduces latency in the uncontended case but acts to unnecessary write invalidation and coherence traffic in the contended case.
Polite lock attempts avoid futile atomic operators which generate unnecessary coherence traffic via write invalidation if the lock is found held.
▶ Anti-starvation: Fissile requires two types of anti-starvation. The first is in CNA, to ensure that lock operations from remote NUMA nodes are eventually serviced, and the second is managed by the alpha thread to avoid indefinite bypass over the TS outer lock.
▶ Linux kernel qspinlock:
The Linux qspinlock construct [9, 10, 30 ] is a compact 32bit lock, even on 64-bit architectures. The low-order bits of the lock word constitute a simple test-and-set lock while the upper bits encode the tail of an MCS chain. The result is a hybrid of MCS and test-and-set. In order to fit into a 32-bit work -a critical requirement -the chain is formed by logical CPU identifiers instead of traditional MCS queue node pointers. Arriving threads attempt to acquire the testand-set lock embedded in the low order bits of the lock word. This attempt fails if the test-and-set lock is held or of the MCS chain is populated. If successful, they enter the critical section, otherwise they join the MCS chain embedded in the upper bits of the lock word. When a thread becomes an owner of the MCS lock, it can wait for the test-and-set lock to become clear, at which point it claims the test-andset lock, releases the MCS lock, and then enters the critical section. The MCS aspect of qspinlock is used only when there is contention. The unlock operator simply clears the test-and-set lock. The MCS lock is never held over the critical section, but only during contended acquisition. Only the owner of the MCS lock spins on the test-and-set lock, reducing coherence traffic . Qspinlock is strictly FIFO. While the technique employs local spinning on the MCS chain, unlike traditional MCS, arriving and departing threads will both update the common lock word, increasing coherence traffic and degrading performance relative to classic MCS. Qspinlock incorporates an additional optimization where the first contending thread spins on the test-and-set lock instead of using the MCS path. Traditional MCS does not fit well in the Linux kernel as (a) the constraint that a low-level spin lock instance be only 32-bits is a firm requirement, and (b) the lock-unlock API does not provide a convenient way to pass the owner's MCS queue node address from lock to unlock. We note that qspinlocks replaced classic ticket locks as the kernel's primary low-level spin lock mechanism in 2014, and ticket locks replaced test-and-set locks, which are unfair and allow unbounded bypass, in 2008 [11] .
We note that the kernel provides a specialized qspinlock form for paravirtualized environments within virual machines, the so-called "PV-friendly" qspinlock. We believe the same properties that make Fissile tolerant of preemption also https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c This provides a LOITER-style [15] lock with the outer lock consisting of a test-and-set lock and the inner lock consisting of the MCS lock, with both locks embedded in the same 32-bit word. make it inherently PV-friendly, and, if used in the kernel, Fissile may obviate the need for a different PV-friendly form.
▶ Original abstract Classic test-and-test (TS) mutual exclusion locks [3] are simple, enjoy high performance and low latency of ownership transfer under light or no contention but do not, however, scale gracefully under high contention. Furthermore TS locks do not provide any admission order guarantees, and may allow sustained starvation of waiting threads and long-term unfairness.
Such concerns led to the development of scalable queuebased locks such as MCS locks (Mellor-Crummey and Scott) [34] and NUMA-aware variants thereof such as Compact NUMA-aware Locks (CNA) [16, 19] . Both MCS and CNA scale under load, but have more complicated lock handover operations than TS and suffer higher latencies at low contention.
We propose Fissile locks, which capture the most desirable properties of both TS and CNA. A Fissile lock consists of two underlying locks: a TS lock and a CNA lock. Acquiring ownership of the TS lock confers ownership of the compound Fissile Lock. Arriving threads first use an atomic instruction to try to acquire the TS lock. If successful, they immediately enter the critical section, and we say the Fissile lock was acquired via the fast path. If the fast path attempt fails, the thread then acquires the CNA lock and busy-waits on the TS lock, releases the CNA lock, and enters the critical section. Releasing a Fissile lock entails just releasing the TS lock. Contended locking uses the CNA lock while uncontended operations use the TS lock. Fissile locks deflect contention away from TS lock into the CNA lock.
To avoid TS-based starvation, the thread holding the CNA lock and waiting on the TS lock can become "impatient" and cue direct handover of ownership the next time the TS lock is released, bounding bypass. The result is a highly scalable NUMA-aware lock that performs like TS at low contention, enjoying low latency, and like CNA at high contention.
▶ Comparison of lock properties: 
