Applications running in modern multithreaded environments are sometimes overthreaded 1 . The excess threads do not improve performance, and in fact may act to degrade performance via scalability collapse 2 . Often, such software also has highly contended locks. We opportunistically leverage the existence of such locks by modifying the lock admission policy so as to intentionally limit the number of distinct threads circulating over the lock in a given period. Specifically, if there are more threads circulating than are necessary to keep the lock saturated (continuously held), our approach will selectively cull and passivate some of those excess threads. We borrow the concept of swapping from the field of memory management and impose concurrency restriction (CR) if a lock is oversubscribed 3 . The resultant admission order is unfair over the short term but we explicitly provide long-term fairness by periodically shifting threads between the set of passivated threads and those actively circulating. Our approach is palliative, but is often effective at avoiding scalability collapse, and in the worst case does no harm. Specifically, throughput is either unaffected or improved, and unfairness is bounded, relative to common TAS locks which allow unbounded bypass 4 .
Introduction
The scaling collapse phenomena mentioned above arises variously from communication and coordination overheads or from compe- 1 Robert Malthus [63] argued for population control, cautioning that societies would collapse as increasing populations competed for resources. His dire predictions did not come to pass as food production -which had previously been stagnant -managed to improve and keep pace with population growth. 2 Increased concurrency resulting in decreased throughput appears in other contexts. Brooks [7] observed that increasing the number of workers on a project could slow delivery. 3 An extended version of this paper is available at http://arxiv.org/ abs/1511.06035. 4 Bypass occurs when a thread T acquires a lock but there exist other waiting threads that arrived earlier that T .
[Copyright notice will appear here once 'preprint' option is removed.] tition for any one of a number of shared resources. This paper focuses on the latter -we explore the etiology of scaling collapse via resource competition in more detail below. For example, one such resource is the shared last-level cache (LLC) on a single-socket system. All the cores on the socket compete for residency in the LLC, and concurrent requests from those cores may cause destructive interference in the LLC, continuously eroding the residency of the data from any one core. * Destructive interference; fratricide; internecine; pyrrhic; * Impede; impedance; impediment * Residency;occupancy; tenancy;
T ypicallywe think of resource competition as zero-sum game, but in practice it can be a negative sum game.
The effect is similar to that of thrashing 5 as described in Denning's working set model of memory pressure [18] . The solution in that context is swapping -the transient deactivation of some subset of the concurrently running programs. The medium-term scheduler responds to excessive paging and potential thrashing by swapping out selected "victim" processes until the thrashing abates. This closely models our approach where we transiently deactivate excess contending threads that do not contribute to improved throughput. CR responds to contention instead of memory pressure. We extend Denning's ideas from memory management to locks, defining the lock working set (LWS) as the set of distinct threads that have acquired a given lock in some time interval. We use the ordinal acquisition number of the lock to define the interval instead of wall-clock time. Suppose threads A, B, C, D and E contend for lock L and we have an admission order 6 
of A B C A B C D A E
for admission times 0 − 8, respectively. The LWS for L for the period 0 − 5 inclusive is threads A B C and the lock working set size (LWSS) for the period is thus 3 threads. * L ogical time; logical acquisition time; * T hroughput; progress; * Selected processes;victimize; * paging devolves to swapping under sustained pressure. * reduce; minimize; * http://web.stanford.edu/ ouster/cgi-bin/cs140-winter12/lecture.php?topic=thrashing CR may be unfair over the short-term, but our admission policies intentionally impose long-term fairness 7 . To help gauge the trade-off between throughput and fairness we introduce two metrics for short-term fairness. We partition the acquisition history of a lock into W-sized disjoint windows, compute the LWSS of each window, and take the average of those values. We refer to this value as the average LWSS over the measurement interval -it gives an intuitive and useful measure of short-term fairness. In this paper we use a window size of 1000 acquisitions, well above the maximum number of participating threads. A related measure of shortterm fairness is the median time to reacquire (MTTR), computed over the entire acquisition history. Time to reacquire is determined at admission time, and is the number of admissions since the current thread last acquired the lock. Time to reacquire is analogous to reuse distance in memory management. schedule; Admission schedule;* Sacrifice; trade-off; * Quantify short-term and long-term fairness * metric; measure; quantify; evaluate; characterize; figure-of-merit CR acts to reduce the number of distinct threads circulating through the lock over short intervals and thus tends to reduce the LWSS, while still providing long-term fairness. The CR admission policy must also be work conserving and never under-provision the lock. It should never be the case that the critical section remains intentionally unoccupied if there are waiting or arriving threads that might enter. * T headmission policy for an optimal CRimplementation attempts to .. * If arriving or waiting threads might enter the critical section, then one will be promptly enabled to enter. * Finally, the admission policy must keep the lock fully subscribed but not oversubscribed. * CR attempts to minimize L WSS over short intervals and average L WSS over longer intervals. * CR attempts to minimize L WSSover short intervals while also providing long-term fairness. * maximizing fairness over long intervals. * keywords: maximize; maintain; provide; sustain 5 A system is said to thrash when memory is overcommitted and the operating system spends an inordinate amount of time servicing page faults, reducing overall progress. 6 also called the admission history 7 Fairness measures how admission order deviates from arrival order or from strict FIFO order.
As noted above, CR partitions and segregates the set of threads attempting to circulate over the lock into the ACS (active circulating set) and the PS (passive set) 8 . Threads in the ACS circulate normally. We desire to minimize the size of the ACS (and thus the LWSS) while still remaining work-conserving. The goal is to have sufficient threads in the ACS to saturate the lock -ensuring the critical section enjoys maximum occupancy -but no more. Surplus threads are culled from the ACS and transferred into the PS where they remain quiesced. Conversely a deficit in the ACS prompts threads to be transferred from the PS back into the ACS as necessary to sustain saturation. To ensure long-term fairness our algorithm periodically shifts threads between the ACS and PS. Ideally, threads in the ACS will have to wait only briefly to acquire the lock and at most one thread in the ACS will be waiting at any moment. That is, at unlock-time we expect there is typically at most one thread from the ACS waiting to take the lock. Intuitively, threads in the ACS remain "enabled" and operate normally while threads in the PS are "disabled" and do not circulate over the lock. Threads sequestered in the PS typically spin in a polite fashion on a threadlocal flag, or block in the operating system, surrendering their CPU. Our approach constrains and regulates the degree of concurrency over critical sections guarded by a contended lock in order to conserve shared resources such as residency in shared caches. Specifically, we minimize the number of distinct threads acquiring or transiting the lock over the short term.
For instance assume an simplified execution model with 10 threads contending for a common lock. The threads loop as follows: acquire the lock; execute the critical section (CS); release the lock; execute their respective non-critical section (NCS). Each such iteration reflects circulation over the lock. In our example the NCS or flow length is 5 microseconds and the CS length is 1 microsecond. For the purposes of explication we assume an ideal lock with no administrative overheads. In this case we reach saturation -Amdahl peak speedup -at 6 threads. At any given time 1 thread is in the CS and 5 execute in their respective NCS. Thus under ideal CR the ACS would have 6 threads and 4 of the 10 threads would reside in the PS, transiently made passive. The 6 circulating threads in the ACS would enjoy a round-robin cyclic admission schedule. 
Scalability Collapse
The scalability collapse phenomena involves competition for shared hardware resources. A classic example is residency in a shared LLC. As more distinct threads circulate over the lock in unit time, cache pressure and miss rates increase. Critically, as the cache is shared, residency of the data accessed by a thread decays over time due to the action of other concurrently running threads that share the LLC. The application may start to thrash in the LLC and become memory-bound. As the LLC miss rate rises from cache pressure, contention for the DRAM channels increases, making LLC misses even more expensive, compounding a deleterious effect. CR can serve to reduce such destructive interference in shared caches. By reducing the number of threads circulating, we reduce cache pressure and retain residency for longer periods, reducing the miss rate and DRAM channel congestion. * Concurrency restriction reduces destructive interference in shared caches.* By reducing the L L Cmiss rate we also reduce contention for DRAM channels: DRAMchannel congestion and competition for channel bandwidth. * With a sufficiently large number of threads circulating over a contented lock, the throughput becomes dominated solely by the CS duration.
In Figure 1 we show the impact of CR via an idealized aggregate throughput graph. Thread count appears on the X-axis and aggregate throughput on the Y-axis. Such concave scaling graphs are common in practice, and reflect scalability collapse [14, 59] . Lock implementations themselves are sometimes a causative factor for collapse 9 but we show a properly designed lock with CR can also act to reduce collapse stemming from competition for shared hardware resources. Assume an execution model with one contended lock L, where each thread repeatedly acquires L, executes a critical section, releases L, and then executes a non-critical section. All threads start at the same time and run concurrently throughout the 8 The ACS corresponds to the balance set in the working set model 9 For instance via induced coherence traffic on lock metadata. Figure 1 : Impact of Concurrency Restriction measurement interval. Throughput on the Y-axis reflects the total number of iterations completed by the threads in the measurement interval. Maximum throughput appears at the threading level corresponding to Peak, representing the key inflection point where performance drops as thread counts increase. Beyond peak, additional threads do not contribute to performance, and in fact may degrade performance 10 . Saturation reflects the minimum threading level where there is always at least one waiting thread when the owner releases L. We say threads beyond saturation are excess threadsexcess threads are not necessary to achieve saturation. Peak will always be less than or equal to saturation. CR can begin to operate and provide benefit when the thread count exceeds saturation. The value for Peak is imposed by software and platform architectural factors and is unrelated and orthogonal to saturation 11 . *Over-threading andcontention not coupled; decoupled; * We usea fixed-time-report-work methodology. * Saturation implies that the lock remain held and the CSremains occupied.
* Saturation =point of sustained contention; * Retrograde scaling
We note two regions of interest. First, when the thread count is less than saturation, CR would be ineffective and does not operate. CR does not impact performance in this region, providing neither harm nor benefit. Second, when the thread count exceeds saturation, CR can operate, ideally avoiding the subadditive "scaling fade" evident in the graph when CR is not enabled. CR acts by clamping the effective thread count -over the short term -to saturation. Under fixed load we expect the LWSS to always be greater than or equal to saturation.
Arguably, CR doesnot improve performance in this region, but it avoids worsening performance arising from scalability collapse.
Taxonomy of Shared Resources
We provide a limited taxonomy of inter-thread shared resources that are subject to competition and are amenable to conservation via CR. Each of the following shared resources resources identifies a potential mode of benefit for CR. Competition for core-level resources such as pipelines typically starts to manifest when the number of ready threads exceeds the number of cores, and more than one thread is running on a core. The onset of competition for socket-level resource may start at lower thread counts and contention for CPUs occurs when the number of ready threads exceeds the number of logical CPUs, where preemption starts.
A key socket-level shared resource is LLC residency. Suppose we have a contended lock that is fully saturated. In this mode the critical section duration solely dictates throughput [30] . Data accessed in non-critical sections is thread-private and multiple independent non-critical sections may execute concurrently with a single CS. NCS accesses displace and evict critical data 12 . As the set of threads circulating over the lock grows, the total non-critical footprint increases, and we find more cache pressure in the LLC. In turn, the critical section suffers more LLC misses, increasing the duration of the CS and decreasing throughput over the contended lock. CR can afford benefit in this circumstance by restricting the set of circulating threads, reducing cache pressure and thus increasing throughput compared to a perfectly fair FIFO lock 13 .
Increasing the CS duration decreasesaggregate throughput over the contended lock.
Wenote that CS-vs-CS, NCS-vs-CS and NCS-vs-NCS competition for L L Cresidency are all possible.
Wenote that the non-critical sectionswill erode and decayeach other's residency in the L L Caswell as that of the critical sectiondata. Weassume sufficient contention that aggregate throughput is controlled solely by the critical section duration. E venso, inter-NCS "fratricide", can increase DRAMchannel contention.
We next provide a detailed example to motivate the benefit of CR on a single-socket SPARC T5 processor where the shared LLC (L3) is 8MB. We have a customer database that is 1MB, and each CS operation will access a record in that database. Each record resides on a single cache line. An individual CS will access only one record, but over time most records will be accessed repeatedly by subsequent operations. (The CS may be "short" in average duration but "wide" in the sense that a sequence of CS operations will eventually access a large fraction of the records). We have 16 threads, and on an otherwise unloaded system the NCS duration is 4 times that of the CS duration. The (NCS +CS )/CS ratio is such that only 5 threads are needed to fully saturate the lock and provision the ACS. Furthermore, the NCS footprint of each thread is 1MB. Even though an individual NCS operation might be short, over time a thread will access all 1MB of its thread-private data. Recall that the CS data is shared and the NCS data is per-thread and thread-private. Under a classic FIFO MCS lock, all 16 threads will circulate over the lock in round-robin cyclic order. The total footprint is 17MB : (16 threads * 1MB/thread) + 1MB for the CS, exceeding the 8MB capacity of the LLC. The NCS operations will erode and decay the residency of the CS data, slowing execution of the CS, and degrading overall throughput. But with CR the lock subsystem is able to limit the size of the ACS to 5 threads. In this mode, the total short-term footprint is 6MB : (5 threads * 1MB/thread) + 1MB for the CS. The total footprint -the CS data plus the NCS data of the ACS threads -fits comfortably within the LLC. Consequently, the NCS instances do not erode CS residency, the CS does not suffer from misses arising from destructive interference in the LLC, and throughput is improved. CR reduces cache pressure and in 12 CS invocations under the same lock typically exhibit reference similarity: acquiring lock L is a good predictor that the critical section protected by L will access data that was accessed by recent prior critical sections protected by L. That is, CS invocations tend to access data accessed by prior CS invocations, exhibiting inter-CS inter-thread locality and reuse. 13 Various forms of competition for LLC residency are possible: CS-vs-CS, NCS-vs-CS, and NCS-vs-NCS. We assume sufficient contention that aggregate throughput is solely controlled by CS duration, in which case inter-NCS "fratricide" is the least important mode. Even so, NCS-vs-NCS competition can increase DRAM channel contention. particularly on CS data. "Hot" threads -those that have run recently and have residual LLC residency -tend to remain "hot". * Byminimizing the ACS, CRcan increase throughput. * CS data references under a given lock will exhibit inter-CS inter-thread temporal locality and reuse T he actions by concurrent threads in the NCS will erode the L L Ccache residency of the CS data. If the ACS is large, the eviction pressure on the CS data by the multiple NCS instancesbecomes more intense. In turn, the CS incurs more missesin the L L C;the CS duration increases,and throughput consequently drops. T hus by minimizing the ACS, we can increaseaggregate throughput. * replace; Displace; erode; Decay; expel; evict; pollute; attrition; attrit; * Fratricide; internecine; * Congestion; Contention; Competition; Conflicts * L ittle's law; PAST Aproperty Amdahl's law isnot aperfect match for locking. Amdahl's model assumesa serial phasewhere no threadsrun -or just 1 thread runs -alternating with apure parallel phase where all thread can this. Arguably, Amdahl models barriers, although barriers still allow some concurrency when some threads have arrived for rendezvous and other threads are still running.
Amdahl does not faithfully model locking. When athread holds the lock, other threads can concurrently executetheir NCS regions. See E yerman.
Another socket-level shared and rationed resource is thermal and energy headroom. By running fewer threads in a given interval relative to other locks, CR may reduce energy use and heat dissipation. Furthermore, by quiescing threads in the PS and allowing more processors to enter and remain in deeper low-power sleep states while idle, our approach can enable turbo mode [25, 61] for the remaining active threads -critically including the lock holderaccelerating their progress and improving throughput.
Space considerations do not allow us to present the details,but weused the RAP L-Running Average Power L imit -facility on modern Intel processors to measure Joulesconsumed and the enablement of turbo mode. Wealso used the RAPL facility to constrain power usage. In power-constrained configurations, CRprovided even higher relative benefits.
Spinning expends energy, which is rationed.
T hispaper focuses on socket-level L L Cresidency, core-level cache and DT L Bresidency, core-level pipeline availability, and system-wide logical CPU availability.
The waiting policy of a lock implementation (discussed below) defines how a thread waits for admission, and can have a significant impact on competition for core-level resources such as pipelines, socket-level resources such as thermal and energy headroom, and global resources such as logical CPUs.
Keywords: impact; influence; interaction; interplay
The MCSCR lock algorithm
We now describe the implementation of MCSCR -a classic MCS lock [52] modified to provide CR by adding an explicit list for members of the PS 14 . At unlock-time, if there exist any intermediate nodes in the queue between the owner's node and the current tail, then we have surplus threads in the ACS and we can unlink and excise one of those nodes and transfer it to the head of the passive list where excess "cold" threads reside. This constitutes the culling operation. Conversely, at unlock-time if the main queue is empty Specifically, our approach simply looks forward into the MCS chain to detect the onset of contention. except for the owner's node, we then extract a node from the head of the passive list, insert it into the main queue at the tail, and pass ownership to that thread, effectively transferring an element from the PS back into the ACS. This ensures MCSCR is work conserving and provides progress and liveness. The element at the head of passive list is the most recently arrived member of the PS. Absent sufficient contention, MCSCR operates precisely like classic MCS. MCSCR directly edits 15 the MCS chain to shift threads back and forth between the main chain and the explicit list of passivated threads. The ACS list is implicit, while the PS -the excess list -is explicit. MCSCR detects contention and excess threads simply by inspecting the main MCS chain.
* If there are any intervening threads betweenthe tail and the owner's node, then those threads are excessand can be culled. * 3 or more nodes, including owner's node, on the MCS chain * Intervening; intermediate; excess;surplus; * We now describe an implementation of CR basedon the classicMCS lock. We modified the classicMCS lock [52 ] by adding an explicit list of passivatedexcessthreads, yielding the MCSCR lock. * We note that CR can be easilyapplied to wide variety of existing locks.
To ensure long-term fairness, the unlock operator periodically selects the tail T of the excess list as the successor and then grafts T into the main MCS chain immediately after the lock-holder's element, passing ownership of the lock to T . Statistically, we cede ownership to the tail of the excess list -which is the least recently arrived thread -on average once every 1000 unlock operations. (The probability parameter is tunable and reflects the trade-off between fairness and throughput). Transferring a thread from the PS into the ACS typically results in some other member of the ACS being displaced and shifted into the PS in subsequent culling operations. We use a thread-local pseudo-random number generator to implement Bernoulli trials with probability P = 0.001 to control passing ownership to the tail of the excess list.
* Degenerate case where we always pull from PS is just normal MCS. * Homeostatis; converge to steady-state * Use Bernoulli trials to trigger tail extraction. * Desirable state; preferred; target Culling acts to minimize the size of the ACS. Under fixed load, aggressive culling causes the system to devolve to a desirable state where there is at most one member of the ACS waiting to acquire the lock. In this state, the ACS consists of that one waiting thread, the current owner of the lock, and a number of threads circulating through their respective non-critical sections. At unlock-time, the owner will usually pass ownership of the lock to that waiting thread. Subsequently, some member of the ACS will complete its non-T hewaiting thread will typically take the lock after the owner releases it. critical section and wait for the lock. In this mode, admission order is effectively cyclic over the members of the ACS . 14 Under classic MCS, arriving threads append an element to the tail of the list of waiting threads and then wait on a flag within that element. The lock's tail variable is explicit and the head -the current owner -is implicit. When the owner releases the lock it reclaims the element it enqueued and sets the flag in the next element, passing ownership. 15 Editing the MCS chain was first suggested by Markatos et al [49] for the purposes of enforcing thread priorities.
All changes to support MCSCR are implemented in the unlock path; the MCS lock operator remains unchanged. Operations on the excess list occur within the unlock operator while the MCS lock is held -the excess list is protected by the MCS lock itself. This artificially increases the length of the critical section, but the additional manipulations are short and constant-time. The default POSIX pthread mutex lock specification does not dictate fairness properties 16 , giving significant latitude and license to implementors. In fact common mutex constructions, such as those found in Solaris or Linux, are based on test-and-set (TAS) locks [3] and allow unbounded bypass with potentially indefinite starvation and unfairness. Similarly, the synchronized implementation in the HotSpot Java Virtual Machine allows indefinite bypass as does java.util.concurrent ReentrantLock.
Lock Design Variations
* T hislaxity also allows for CR. * allow; admit; affords; enables; * and waiting threads fall back to parking as necessary.* QoI = Quality of implementation * QoI feature vs specification * Park; block; sleep; suspend;deschedule; * L iberty; L icense;L atitude; laxity; * Dictate; Prescribe; Specify; Demand; require Interestingly, MCS [52 ] or other strictly FIFO locks appear rarely outside a few uses in operating systemkernels.
We employ a park-unpark infrastructure for voluntary context switching. The park-unpark facilities allows a waiting thread to surrender its CPU directly to the operating system while the thread waits for a contended lock. The park primitive blocks the caller, rendering itself ineligible to be scheduled or dispatched onto a CPU. A corresponding unpark(T ) system call wakes or resumes the tar-T hethread "deschedules"itself via park. get thread T , making it again ready for dispatch and causing control to return from park if T was blocked. An unpark(T ) operation can occur before the corresponding park call by T , in which case park returns immediately and consumes the pending unpark action. Waiting for a lock via parking is "polite" in the sense that the waiting thread can make its CPU immediately available to other ready (runnable) threads 17 . Solaris exposes lwp park and lwp unpark system calls while the futex facility can be used to park and unpark threads on Linux. The park-unpark facility is often implemented via a restricted-range semaphore, allowing values only of 0 (neutral) and 1 (unpark pending). The park-unpark interface moves the decision of which thread to wake out of the kernel and into the user-space lock subsystem. Explicit lists of parked threads must be maintained by the user-space lock subsystem.
Broadly, lock implementations use one of two possible succession policies, which describes how ownership is transferred at unlock-time when threads are waiting. Under direct handoff the unlock operation identifies a waiting successor and then passes ownership to that thread, enabling it enter the CS. If no successor exists then the lock is set to an available state. MCS employs direct handoff. Under competitive succession the owner sets the lock to an available state, and, if there are any waiters, picks at least one as the heir presumptive, enabling that thread to re-contend for the lock 18 19 . Enabling an heir presumptive is necessary to ensure progress. The heir presumptive may compete with arriving threads for the lock. TAS-based locks use competitive succession and in the simplest forms all waiting threads act as heir presumptive and no specific enabling is needed. Locks that use direct handoff can exhibit poor performance if there are more ready threads than CPUs and involuntary context 16 fairness is considered a quality-of-implementation concern 17 Threads can also wait via unbounded spinning -busy-waiting. In this case involuntary preemption by the operating system will eventually make sure other ready thread will run. However time slices can be long, so it may take considerable time for a ready thread to be dispatched if there are no idle CPUs. Parking surrenders the caller's CPU in a prompt fashion. 18 The reader might note that competitive succession is analogous to the CSMA-CD communication protocol, while direct succession is analogous to token ring protocols. CSMA-CD is optimistic and exhibits low latency under light load but suffers under high load, whereas token ring is pessimistic but fair, and provides stable guaranteed performance under heavy load, but incurs more latency under light load. 19 Competitive succession is also called barging, as arriving threads can barge in front of other waiting threads, allowing unbounded bypass and grossly unfair admission.
switching -preemption -is in play. The successor may have been be preempted, in which case the lock handover latency and responsiveness of the lock is impaired. Specifically, an unlock operation may pick thread T as a successor, but T has been preempted. Circulation stalls until the operating system eventually dispatches T 20 . This leads to the undesirable convoying phenomena [4] with transitive waiting. With competitive succession, the new owner must take explicit actions to acquire the lock, and is thus known to be running, albeit at just the moment of acquisition. Competitive succession reduces succession latency and works well in conditions of light contention. Direct handoff performs well under high contention [47] , except when there are so many ready threads that successor preemption comes into play, in which case competitive succession may provide better throughput. * head of line blocking * Direct handoff is generally better under high contention while competitive successionis more optimistic and reduces successionlatency in conditions of light contention. * T hree modes : L ight or no contention : usecompetitive successionheavy contention : use direct handoff Involuntary preemption : use competitive succession Direct handoff suffers from an additional performance concern related to the waiting policy. If the successor T parked itself by calling into the operating system, then the unlock operator needs to make a corresponding system call to wake and unpark T , making T eligible for dispatch. The time from an unpark(T ) call until the corresponding blocked thread T returns and resumes from park can be considerable. Latencies of more than 30000 cycles are common even in the best case on an otherwise unloaded system where there are fewer ready threads than CPUs and an idle CPU is available on which to dispatch T 21 . Crucially, these administrative laten-W ehave observed latenciesin excesson 80000 cycleson modern Intel processors when unpark causesa thread to be dispatchedonto a CPU idling in a deep low-power sleep state. cies required by succession to resume threads accrue while the lock is held, artificially lengthening the critical section. Such lock handover latency greatly impacts throughput over the contented lock, and can dominate performance under contention. Direct handoff is generally unsuitable for locks that wait via parking as context switch overheads artificially increase the critical section duration and effective lock hold times.
Hybrid spin-then-park [19] waiting strategies may reduce the rate of voluntary blocking and provide some relief from such voluntary context switching costs. However spin-then-park tends not to work well with strict FIFO queue-based locks. The next thread to be granted the lock is also the one that has waited the longest, and is thus most likely to have exceeded its spin duration and reverted to parking. Generally, a waiting strategy that parks and unparks threads is inimical to locks that use direct handoff, and to FIFO locks specifically. * Blocksand resumes vs park-unpark * T he most recently arrived threads are the most likely to still be spinning, but they will be the last to be granted the lock.
L ock algorithms can provide successioneither by direct handoff -where ownership of the lock is conveyed directly from the current owner to some waiting thread -or via socalled competitive succession,where the current owner, in unlock(), releasesthe lock and allows waiting threads to contend for the lock. Direct handoff is generally better under high contention while competitive successionis more optimistic and reduces successionlatency in conditions of light contention. T o provide progress and liveness, locks that use competitive successionmay need to unpark an "heir presumptive" thread that had been waiting. T heheir presumptive can then compete for the lock.
Direct handoff typically implies the existence of an explicit list of waiting threads. In turn, that allows local spinning by those waiters. T his claim is not universally correct -ticket locks serveas counter-example.
All strictly FIFO locks use direct handoff. Relatedly, all locks that use local spinning [26] , such as MCS, also use direct handoff. With local spinning, at most one waiting thread spins on a given location at any given time. Local spinning often implies the existence of an explicit list of waiting threads 22 . Depending on the platform, local spinning may reduce the "invalidation diameter" of the writes that transfer ownership, as the location to be written should be monitored by only one thread and thus reside in only one remote cache. Lock algorithms such as TAS use global spinning, where all waiting threads busy-wait on a single memory location 23 .
Invalidation diameter; number of participating caches;
Given its point-to-point nature where thread A directly unparks and wakes B, using park-unpark for locks requires the lock algorithm to maintain an explicit list of waiting threads, visible to the unlock operator. Most locks that use local spinning, such as MCS, can therefore be readily converted to use parking. A simple TAS lock with global spinning and competitive succession requires no such list be maintained -the set of of waiting threads is implicit and invisible to the unlock operator. Lock algorithms that use global spinning, such as ticket locks or TAS locks, are more difficult to 20 Kontothanassis et al. suggested ways to mitigate this problem for MCS locks [44] . 21 Unpark itself incurs a cost of more than 9000 cycles to the caller on our SPARC T5 system. 22 more precisely, at unlock-time the owner thread must be able to identify the next waiting thread -the successor. 23 Systems with MOESI-based cache coherence may be more tolerant of global spinning than those that use MESI [26] . adapt to parking. As noted above, parking is typically inimical to locks that use direct handoff, as the context switch overheads artificially increase the critical section length.
We define lock handover latency as follows. Say thread A holds lock L and B waits for lock L. B is the next thread to acquire ownership when A releases L. The handover latency is the time between A's call to unlock and when B returns from lock and can enter the critical section. Handover latency reflects overheads required to convey ownership from A to B. Lock implementations attempt to minimize handover latency, also called responsiveness in the literature. Excessive handover latency increases latency and degrades scalability. As noted above, if A must resume B via calls into the kernel to transition B from blocked to ready, then the handover latency increases significantly.
Waiting Policies
The choice of waiting policy used by a lock implementation influences competition for CPUs, pipelines and thermal headroom, making the selection of a waiting policy critical for CR. The waiting policy also dictates key latencies, further informing our design. We identify a number of commonly used policies:
T hemanner in which a thread waits for a lock is the waiting policy.
Unbounded spinning
Classic MCS and TAS locks use unbounded spinning, also called busy-waiting or polling. Waiting threads simply loop, re-checking the variable of interest. While unbounded spinning appears often in academic literature, actual deployed software generally avoids indefinite spinning. At some point a spinning thread is expected to deschedule itself. While convenient and simple, unbounded spinning can interfere with the performance of other threads on the system by consuming pipeline resources. Spinning also expends energy and consumes available thermal headroom, possibly to the detriment of sibling cores that might otherwise enjoy turbo mode acceleration. In addition, a spinning thread occupies a processor, possibly prohibiting some other ready thread from running in a timely fashion 24 . If there are more ready threads than logical CPUs, then preemption by the kernel would eventually ensure those other threads run, but those ready threads may languish on dispatch queues until the spinners exhaust their time slice. Typical quanta durations far exceed the latency of a voluntary context switch. Despite those concerns, spinning remains appealing because it is simple and the lock handover latency -absent preemption -is low.
If a thread uses unbounded spinning then eventually involuntary preemption by the operating system will deschedule the spinner and allow those other ready threads to run, but quanta (time slice) can be relatively long, so depending on preemption is not prudent and can result in particularly poor performance when the number of ready thread exceedsthe number of available processors. *Spinning; Busy-waiting; polling; active waiting * Blocking; passivewaiting; deschedule; suspend; sleep;* Visible vs invisible spinners Spinning can be made more polite to sibling threads by using the PAUSE instruction on x86, or the RD CCR,G0 idiom, a long-latency no-op, on SPARC These instructions cede pipeline resources to siblings -logical CPUs that share the core with the spinning thread -allowing those siblings to run faster. Such instructions may also reduce power usage.
SPARC also provides the WRPAUSE instruction with a parameterized delay period [20] 25 . Longer pauses yield more benefit to siblings but may impact response latency by creating "dead time" and lag when ownership is passed to a waiting thread that happens to be in the middle of a WRPAUSE operation. This presents an altruism trade-off: longer delays are more polite and provide more benefit to siblings, but may also increase lock handover latency. Spinning can also be augmented with sched yield calls, but these are advisory and in practice we find they perform worse than parking.
*Polite spinning and parking reflect altruism * WRPAUSE useful for short-term waiting * WRPAUSE isuseful for both global and local spinning Spinning policies are further determined by the choice of local spinning versus global spinning. A simple fixed back-off usually 24 Spinning threads may wait for the lock holder which has itself been preempted. 25 When only one logical CPU is active in a core, the per-core pipelines automatically fuse and provide better performance for the single active CPU. Intel processors with hyperthreading exhibit similar behavior. Polite spinning via the WRPAUSE instruction or the RD CCR,G0 idiom also enables fusion. suffices for local spinning, while randomized back-off is more suitable for global spinning.
Spinning can be augmented with operating system sched yield calls that attempt to yield the CPU to other ready threads. Unfortunately schedyield is advisory and has almost no semantics. Often, it does nothing even when other runnable threads are available on other dispatch queues.E mpirical results with sched yield are dismal. Wedo not consider it further.
MWAIT
The MWAIT instruction, available on x86 26 and SPARC M7 systems, allow a thread to wait politely for a location to change. MWAIT "returns" promptly after a modification of a monitored location. While waiting, the thread still occupies a CPU, but MWAIT [2, 22] may allow the CPU to reach deeper sleep states. It also frees up pipeline resources more effectively than WRPAUSE. Latency to enter and exit MWAIT state is low, avoiding the trade-off inherent in picking WRPAUSE durations. Transferring ownership for locks that use local spinning is efficient and incurs little handover latency. MWAIT also avoids branch mispredict stalls that are otherwise inherent in exiting wait loops. MWAIT with a parameterized maximum time bound allows hybrid forms where a thread initially uses MWAIT but then falls back to parking. MWAIT is tantamount to spinning, but more polite and preferred when the instruction is available.
MWAIT is well-suited for local spinning and we prefer MWAIT over WRPAUSE where available. Using MWAIT for global T AS spinning is less clear. Absent additional randomization, all waiting threads will resume from MWAIT and try the lock, likely generating futile coherence traffic. Normally MWAIT is inappropriate for global spinning with a large number of threads, but our approach constrains the number of threads spinning on a given lock at any moment, making MWAIT a viable choice. We can also use hardware transactional memory to wait politely via T XPAUSE .
Parking
Parking suspends the calling thread and voluntarily surrenders the CPU on which the caller ran, making it immediately available to run other ready threads. If no other threads are ready, then the CPU may become idle and be able to drop to deeper sleep states, reducing power consumption and potentially enabling other ready threads on the same chip to run at faster speeds via turbo mode 27 . Parking also reduces competition for pipeline resources. In turn, other threads -possibly including the lock holder running in its critical section -may run faster, improving scalability. Parking also allows the operating system to rebalance the set of running threads over the available cores via intra-socket migration. This is particularly useful for CR 28 . Spinning does not allow such redistribution. Parking also reduces the number of concurrently ready threads, in turn reducing involuntary preemption by the operating system. However the costs to enter and exit the parked state are high and require operating system services. Thus our policies strive to reduce the rate of voluntary context switching. Parking vacatesthe processor, allowing other ready threads to run on that CPU and making the caller ineligible for dispatch. If parking causesthe CPU to become idle, the CPU can enter deeper sleep states,which in turn confers benefit to other running threads via turbo mode.
Modern CPUs support special hardware sleep statesfor idle CPUs. Deeper sleep statesdraw lesspower and dissipate lessheat, and allow more aggressive turbo mode for sibling CPUs on the same socket,permitting threads on those CPUsto enjoy faster execution. Deeper sleep statesmay also and may promote more aggressiveturbo mode for sibling cores.
T urbo mode is not controlled directly by software. Rather, it isautomatically controlled by the CPU itself and requires sufficient energy headroom to beenabled.
CPUs transition to deeper (lower power) sleep states the longer they remain idle. Deeper sleep states, however, take longer to enter and exit. Exit latency significantly impacts unpark latency -the time between an unpark(T ) operation and the time when T returns from park. Deeper sleep states, while useful for energy consumption and turbo mode, may also increase the time it takes to wake a thread. To effectively leverage the benefits of deeper sleep states, the CPU needs to stay in that state for some period to amortize the entry and exit costs. Frequent transitions between idle and running states also attenuates the turbo mode benefit for sibling CPUs as the CPU may not idle long enough to reach deeper states. Lock implementations that act to reduce thread park-unpark rates will also reduce CPU idle-running transitions and will incur less unpark latency -by avoiding sleep state exit latencies -and also allow better use of turbo mode. By keeping the ACS stable and minimal, CR reduces the park-unpark voluntary context switch rate, and in turn the idle-running CPU transition rate.
Frequent transitions between idle and running incurs extra unpark latency and attenuates the turbo mode benefit for siblings CPUs.Reducing thread park-unpark rates also reduces CPU idle-running transitions and concerns associatedwith deeper sleepstates. As such,we our locking policies prefer to avoid frequent transitions between idle and running states for CPUs. * L ag in dispatching threads when the CPU exits idle state. * Generally, the deeper the sleep state, the more power conserved while in that state, but the longer it takesthe CPU to enter and exit that state.* Reducedcontext switching rates can lead to reduced CPU transitions between idle and non-idle states, allowing deeper sleepstates and lesstransition overheads.
Park-unpark and waiting via local spinning typically requires the lock algorithm to maintain explicit lists of waiting threads. Generally, it is easyto convert a lock that uses local spinning to usepark-unpark or spin-then-park. Parking is point-to-point by nature and requires a visible list of waiting threads.
Park() admits spurious returns. A good litmus test of proper and safe park-unpark usageis to consider the degenerate but legal implementation where park() and unpark() were implemented asno-ops, in which casethe algorithms that use park-unpark would simply degenerateto spinning. T hisreflects a legal but poor quality implementation. After returning from a park() call, the caller is expected to re-evaluate the conditions related to waiting. Park-unpark can be thought of as an optimized from of busy-waiting or polling. Specifically, control returning from Park() does not imply a corresponding previous unpark() operation. By allowing spurious wakeupswe afford more latitude to the park-unpark implementation, possibly enabling useful performance optimizations.
Optimized park-unpark implementations can often avoid calling into the kernel. Say thread S calls unpark(T) where T is not currently parked. T he unpark(T) operation will record the available "permit" in T's thread structure and return immediately without calling into the kernel. WhenT eventually calls park, it will clear that permit flag and return immediately, again without calling into the kernel. Redundant unpark(T) operations -where a waiting thread T has previously been unparked but hasnot yet resumed -also have an optimized fast path to avoid calling into the kernel. T heonly casethat requires calling the kernel is where an unpark follows the corresponding park operation.
Optimized Park() implementations may spin briefly before reverting to blocking in the kernel. T he spin period is brief and bounded, and acts to reduce the rate of expensive and potentially unscalablecalls into the kernel to perform ready-blocked statetransitions. T hisis the so-called spin-then-block waiting policy. T hespin period reflects local spinning and can be implemented with a "polite" busy-wait loop or via MONIT OR-MWAIT instructions.
Waiting in the kernel via blocking or via MONIT OR-MWAIT on a thread-private local variable canfree up pipeline resources or bring the CPU under thermal-energy caps,which in turn can acceleratethe progress of the lock owner, increasing scalability. Recall that if the lock is contended and fully saturated, throughput is completely determine by the critical section duration. By potentially accelerating the lock owner, we may reduce the critical section duration and lock hold time.
As noted above Park-Unpark incurs latenciesfor both the thread parking and the thread calling unpark. If the corresponding unpark occurs after the park operation, and the parked thread was blocked, there is considerable latency required to transition the wakee from blocked to ready to running. park-unpark : Appropriate for longer-term waiting; relinquishes to other potentially ready threads Gives the kernel an opportunity to balanceactive ACS threads over the cores and pipelines via intra-node migration. Provides relief for sibling threads that are running on the samecores. Reducescompetition for pipelines.
T ohelp reduce handover latency, we can use "anticipatory warmup" as follows. If we expect to unpark() thread T in the near future and T is blocked in the kernel, then we can preemptively unpark(T ) so T becomesready andstarts spinning. An Unpark(T ) operation can impose considerablelatency in the caller becauseof the need to invoke kernel operations.
As such,an anticipatory unpark(T ) should be executedwhile the caller does not hold the lock for which T waits, otherwise we risk artificially increasing the critical sectionlength and impacting throughput over the contented lock. Anticipatory unpark() operations are particularly well suited for locks that use successionby direct handoff, and acts to increase the odds that an unlock() operation will transfer control to a thread that is spinning, instead of a thread that is blocked in the kernel. T hisoptimization is optional, but helpsto reduce lock handover latency.
Threads can also attempt to voluntarily surrender their CPUin a polite fashion, avoiding dependencies on longer-term involuntary preemption -by calling sched yield or Sleep(D) where D is a duration to sleep. Sched yield attempts to transfer the CPU to some other ready thread while keeping the caller ready. Polling 26 Intel's MWAIT instruction is not currently available in user-mode, significantly impacting its adoption. 27 Turbo mode is controlled directly by hardware instead of software and requires sufficient energy headroom to be enabled. Software indirectly influences the availability of turbo mode via waiting policies. 28 If the operating system did not rebalance then we could easily extend CR to itself balance the ACS over the cores, intentionally picking ACS members based on where they run. via sleep and sched yield avoids the need to maintain explicit lists of waiting threads, as is required by the park-unpark facility. Both can be wasteful, however, because of futile context switching where a thread resumes to find the lock remains held. Furthermore the semantics of sched yield are extremely weak on modern operating systems. The choice of D presents another trade-off between response time and politeness. D values are quantified to the units of clock tick interrupt periods on many operating systems -typically between 1 and 10 milliseconds -meaning that short periods can not be expressed.
Spin-Then-Park
To reduce the impact of park-unpark overheads, lock designers may opt to use a hybrid two-phase spin-then-park strategy. Threads spin for a brief period -optimistically waiting -anticipating a corresponding unpark operation and then, if no unpark has occurred, they revert to parking as necessary. The maximum spin period is commonly set to the length of a context-switch round trip. A thread spins for either the spin period or until an unpark is observed 29 30 31 . If no unpark occurs within the period, the thread deschedules itself by blocking in the kernel. Karlin et al. note that spinning for the length of a context switch and then, if necessary, parking, is 2competitive [42, 46] . The spinning phase constitutes local spinning. If available, the spin phase in spin-then-park can be implemented via MWAIT 32 . We prefer parking -passive waiting -over spinning -active waiting -when the latencies to unpark a thread exceed the expected waiting period.
More precisely,a thread spinsuntil I stepshave passedor until an unpark occurs. I can be expressedin either units of wall-clock time or iterations of a spin loop.
*As such, wetry to minimize these administrative costs. * Gedankenexperiment * Amortize; optimistic; opportunistic; speculative; bet; gamble; Anticipate;
Unless otherwise stated, all locks evaluated in this paper use a spin-then-park waiting policy with a maximum spin duration of approximately 20000 cycles. On SPARC the loop consists of a load and test followed by a single RD CCR,G0 instruction for polite spinning.
Evaluation
We used an Oracle SPARC T5-2 [54] for all experiments. The T5-2 has 2 sockets, each with a single T5 processor running at 3.6 GHz. Each processor has 16 cores, and each core has 2 pipelines supporting 8 logical CPUs ("strands"), yielding 128 logical CPUs per socket. If there is only one active CPU on a core, both pipelines promptly and automatically "fuse" to provide improved performance. The extra strands exist to exploit available memory-level parallelism (MLP) [13] . Each socket has an 8MB unified L3 LLC shared by all cores on that socket. Each core has a fully associative 128-entry data TLB shared by all logical CPUs on that core. Each TLB entry can support all the available page sizes. Each core also has a 16KB L1 data cache and a 128KB L2 unified cache. For all experiments we took all the CPUs on the second T5-2 socket offline, yielding a non-NUMA T5 system. All data collected for this paper was run in maximum performance mode with power management disabled. The SPARC T5 processor exposes the sel 0 ready hardware performance counter which tallies the number of cycles where logical CPUs were ready to run, but pipelines where not available. We used this counter to detect and measure pipeline oversubscription and competition.
The system ran Solaris 5.11. Unless otherwise specified, all code was compiled with gcc 4.9.1 in 32-bit mode. We discov- 29 Spinning can be further refined by techniques such as inverted schedctl [21] which reduces the impact of preemption on spinning 30 The spinning period can also be made adaptive, based on success/failure ratio of recent spin attempts [19] . 31 As a thought experiment, if parking and unparking had no or low latencies, then we would never use spinning or spin-then-park waiting strategies, but would instead simply park in a prompt fashion. Spinning is an optimistic attempt to avoid park-unpark overheads. Parking and spinning both reflect wasted administrative work -coordination overheads -that do not contribute directly to the forward progress of the application. 32 Spin-then-park waiting further admits the possibility of anticipatory warmup optimizations where the lock implementation unparks a thread in advance, shifting it from parked state to spinning state. The lock might also favor succession to spinning threads over parked threads. ered that the performance and scalability of numerous benchmarks were sensitive to the quality of the malloc-free allocator. The default Solaris allocator protects the heap with a single global lock and scales poorly. The poor performance of the default allocator often dominated overall performance of applications, and masked any sensitivity to lock algorithms. We therefore used the scalable LD PRELOAD CIA-Malloc allocator [1] for all experiments, except where noted. CIA-Malloc does not itself use the pthread mutex primitives for synchronization.
All locks were implemented as LD PRELOAD interposition libraries, exposing the standard POSIX pthread mutex programming interface. LD PRELOAD interposition allows us to change lock implementations by varying the LD PRELOAD environment variable and without modifying the application code that uses locks.
We use the default free-range threading model, where the OS is free to migrate threads between processor and nodes in order to balance load or achieve other scheduling goals. Modern operating systems use intra-node migration to aggressively balance and disperse the set of ready threads equally over the available cores and pipelines, avoiding situations where some pipelines are overutilized and others underutilized 33 . Inter-node migration is relatively expensive and is less frequent.
equally, equitably
We use a number of small carefully constructed benchmarks to exhibit various modes of contention for shared hardware resources.
The first examples are intentionally simple so as to be amenable to analysis. E xemplars; contrived to illustrate; designed to show existence of effect;
We measure long-term fairness with the relative standard deviation (RSTDDEV), which describes the distribution of work completed by the set participating threads. We also report the Gini Coefficient [23, 34] , popular in the field of economics as in index of income disparity and unfairness. A value of 0 is ideally fair (FIFO), and 1 is maximally unfair.
Wemight also the compute the Gini Coefficient over smaller intervals in the acquisition history and then combine those values to form an index that represents short-term fairness.
Webelieve, however, that L WSS is more intuitive as a measurefor short-term fairness asthe value is expressedin units of threads.
Random Access Array
The RandArray microbenchmark spawns N concurrent threads. Each thread loops as follows: acquire a central lock L; execute a critical section (CS); release L; execute a non-critical section (NCS). At the end of a 10 second measurement interval the benchmark reports the total number of aggregate iterations completed by all the threads. RandArray also reports average LWSS, median time to reacquire, and long-term fairness statistics. We vary N and the lock algorithm and report aggregate throughput results in Figure 2 , taking the median of 7 runs. The number of threads on the X-axis is shown in log scale.
The NCS consists of an inner loop of 400 iterations. Each iteration generates a uniformly distributed random index into a thread-private array of 256K 32-bit integers, and then fetches that value. To avoid the confounding effects of coherence traffic, we used only loads and no stores. The CS executes the same code, but has a duration of 100 iterations and accesses a shared array of 256K 32-bit integers. The random number generators are threadlocal. The ideal speedup is 5x. The 1MB arrays reside on large pages to avoid DTLB concerns. We used random indexes to avoid the impact of automatic hardware prefetch mechanisms 34 .
MCS-S is classic MCS where the waiting loop is augmented to include a polite RD CCR,G0 instruction. MCS-STP uses spin-thenpark waiting. MCSCR-S is MCSCR where the waiting loop uses the RD CCR,G0 instruction on every iteration, and MCSCR-STP is MCSCR with spin-then-park waiting. For reference, we include null where the lock acquire and release operators are degenerate and return immediately.
As we can see in Figure 2 , ignoring null, the peak appears at about N = 5, where the maximum observed speedup is slightly 33 explicit binding of threads to CPUs or indefinite spinning precludes this benefit. 34 Our benchmark was inspired by "new benchmark" from [57] more than 3 times that of a single thread. MCS-S and MCS-STP start to show evidence of collapse at 6 threads where the total NCS and CS footprint is 7MB, just short of the total 8MB LLC capacity. The LLC is not perfectly associative, so the onset of thrashing appears at footprints slightly below 8MB. Absent CR, the NCS instances erode LLC CS residency and impair scalability. As noted above, MCS-STP performs poorly because spin-then-parking waiting is unsuitable for direct handoff FIFO locks such as MCS. Crucially, spin-then-park delivers good performance for MCSCR over all thread counts, but decreases performance of MCS except in the case where there are more ready threads than CPUs, where pure unbounded spinning breaks down. Interestingly, MCSCR-STP achieves better performance than null beyond 48 threads.
While not immediately visible in the figure, at 256 threads MCS-STP yields 120x better throughput than MCS-S. Under MCS-S, as we increase the number of ready spinning threads, we increase the odds that the lock will be transferred to a preempted successor, degrading performance. Spinning threads must "burn through" their allotted time slice until the owner is eventually scheduled onto a CPU. At 256 threads, MCS-STP requires a context switch for each lock handover, but it sustains reliable and consistent -but relatively low -performance even if we increase the number of threads. This demonstrates why lock designers conservatively opt for parking over unbounded spinning. Typical time slice periods used by modern operating systems are far longer than park-unpark latencies. As such, we prefer progress via voluntary context switching over involuntary preemption. *T ime Slice; quanta; * Favor; Prefer; * if forced to choose ...
In addition to competition for LLC residency, this graph reflects competition for pipelines. At 16 threads -recall that we have 16 cores -we see MCSCR-S fade. In this case the spinning threads in the PS compete for pipelines with the "working" threads in the ACS. (The "polite" spin loop helps reduce the impact of pipeline competition, which would otherwise be far worse). Using a spinthen-park waiting strategy avoids this concern. MCSCR-STP manages to avoid collapse.
MCS-S and MCS-STP depart from MCSCR-S and MCSCR-STP at around 8 threads because of LLC thrashing. MCSCR-S departs from MCSCR-STP at 16 threads because of competition for pipelines. The slow-down arises from the spin-only waiting policy of those locks. MCS-S and MCSCR-S exhibit an abrupt cliff at 128 threads because of competition for logical CPU residency arising from unbounded spinning. Beyond 128 threads there is systemwide competition for logical processors. MCSCR-STP is the only algorithm that maintains performance in this region.
In Figure 3 we include more details of RandArray execution at 32 threads. The L3 miss rate is considerably lower under the CR forms. As would be expected, the average LWSS and the CPU utilization correspond closely under MCSCR-STP. Note too that the CPU utilization for MCSCR-STP is low, providing lower energy utilization and improved opportunities for multi-tenancy. Despite consuming the least CPU-time, MCSCR-STP yields the best performance. We use the Solaris ldmpower facility to measure the wattage above idle, showing that power consumption is also the lowest with MCSCR-STP. Figure 2 demonstrated competition for socket-level LLC. In Figure  4 we now provide an illustration of core-level DTLB pressure. The structure of our RingWalker benchmark is similar to that of RandArray. Each thread has a private circularly linked list. Each list element is 8KB in length and resides on its own page. Each ring contains 50 elements. The non-critical section iterates over 50 thread-private elements. We record the last element at the end of the NCS and start the next NCS at the subsequent element. The critical section is similar, although the ring is shared, and each critical section advances only 10 elements. The inflection point at 16 threads for MCS-S and MCS-STP is attributable to DTLB misses. Recall that each SPARC core has 128 TLB entries. When two members of the ACS reside on the same core, we have a total DTLB span of 150 pages, which exceeds the number of TLB entries. (The CS ring has a span of 50 pages, and each of the 2 NCS instances have a span of 50 pages). We can shift the inflection point for MCS-S and MCS-STP to the right by decreasing the number of elements in the rings. The cache footprint of a ring with N elements is just N cache lines, and the DTLB footprint is N entries. The offsets of elements within their respective pages were randomly colored to avoid cache index conflicts. Figure 5 shows the performance of the stress latency benchmark from [17] 35 . The benchmark spawns the specified number of threads, which all run concurrently during a 10 second measurement interval. Each thread iterates as follows: acquire a central lock; execute 200 loops of a delay loop; release the lock; execute 5000 iterations of the same delay loop. The benchmark reports the total number of iterations of the outer loop. This delay loop and thus the benchmark itself are cycle-bound, and the main inflectioni point appears 16 threads where threads that wait via spinning compete with working threads for core-level pipelines. This demonstrates the impact of waiting policy. Similar to many other lock microbenchmarks, very few distinct locations are accessed: there is only one shared variable and there are no memory accesses within the non-critical section.
DTLB Pressure
In practice, lock-based code would be expectedto display more variety. A lock algorithm would be expected to perform well on such a simple benchmark, but the benchmark is not likely to be reflective of common usage.
malloc scalability benchmarks
In Figure 6 we use the mmicro malloc-free scalability benchmark from [26] . In this case we use the default Solaris libc memory allocator, which is implemented as a splay tree protected by a central mutex. While not scalable, this allocator yields a dense heap and small footprint and thus remains the default. Mmicro spawns a set of concurrent threads, each of which iterates as follows: allocate and zero 1000 blocks of length 1000 bytes and then release those 1000 blocks. The measurement interval is 50 seconds and we report the median of 7 runs. The benchmark reports the aggregate mallocfree rate. Each malloc and free operation acquires the central mutex. The benchmark suffers from competition for LLC residency, and, at above 16 threads, from pipeline competition. Under CR, fewer threads circulating over the central mutex lock in a given period also yields fewer malloc-ed blocks in circulation which in turn yields better hit rates for core-level DTLBs and caches.
In addition to the usual benefits of CR noted above, CRprovides additional and surprising modes of benefits for simple malloc-free allocators. Assume we have an allocator that protects a unified heap with a single pthread mutex. T he default Solaris libc allocator is of this design. Our application, mmicro, spawns N concurrent threads that each loop, allocating a set of blocks via malloc and then releasing those blocks. Furthermore assumethe malloc lock is highly contented and that we are using an L D PRE L OADinterposition library to replace the normal pthread mutex with a lock providing CR. As such we will enjoy fewer threads circulating over the lock in unit-time. Fewer threads circulating implies that fewer malloc-ed blocks are circulating. In turn, this yields better hits rates in the core-level DT L Bsand caches.Consider an "episode" -a sequenceof malloc operations performed by a given thread in the mmicro benchmark. CRprovides the following benefits: 
Kyoto Cabinet kccachetest
In Figure 7 we show the benefits of CR for the Kyoto Cabinet [31] kccachetest benchmark, which exercises an in-memory database. We modified the benchmark to use standard POSIX pthread mutex locks and to run for a fixed amount of time and then report the aggregate work completed. We used a 300 second measurement interval and took the median of 3 runs. Finally, the key range for a run was computed as a function of the number of threads, making it difficult to compare performance while varying the thread count. We fixed the key range at 10M elements.
Peak performance occurs at about 5 threads, dropping rapidly as we increase the number of threads. Analysis of the program with hardware performance counters shows the LLC miss rate is responsible for the drop between 5 and 16 threads. After 16 threads MCS-S and MCS-STP suffer from both increasing LLC misses and Figure 8 illustrates the benefits of CR on the producer consumer benchmark from the COZ package [16] . The benchmark implements a bounded blocking queue by means of a pthread mutex, a pair of pthread condition variables to signal not-empty and not-full conditions, and a standard C++ std::queue<int> container for the values. (This implementation idiom -a lock; a simple queue; and two condition variables -is common). Threads take on fixed roles, acting as either producers or consumers. The benchmark spawns N concurrent threads, each of which loops, producing or consuming according to its role. We fix the number of consumers at 3 threads and vary the number of producers on the X-axis. We report the number of messages conveyed at the end of a 10 second measurement interval, taking the median of 7 distinct trials.
producer-consumer benchmark
When the arrival rate of producers exceeds that of consumer threads, and under a classic FIFO lock, producers will acquire the lock and then typically find the queue is full and thus block on the condition variable, releasing the lock. Eventually they reacquire the lock, insert the value into the queue, and finally release the lock 36 . Each conveyed message requires 3 lock acquisitions -2 by the 36 the condition variable implementation used in these experiments provides FIFO order. producer and one by the consumer. The critical section length for producers is artificially increased by futile acquisitions where the producer immediately surrenders the lock and blocks on the condition variable. When the condition variable is subsequently signaled, the producer moves to the tail of the lock queue. Producers typically block 3 times : first on arrival to acquire the lock; on the condition variable; and on reacquisition of the lock. Ownership of the lock circulates over all participating threads.
Under a CR lock we find the system tends to enter a desirable "fast flow" mode where the futile acquisition by producers is avoided and each conveyed message requires only 2 lock acquisitions. Threads tend to wait on the mutex instead of on condition variables. Given sufficient threads, ownership continuously circulates over a small stable set of producers and consumers. (As usual, long-term fairness enforcement ensures eventual participation of all threads). We note that CR's mode of benefit for the other benchmarks involves competition for fixed shared resources, whereas producer consumer demonstrates benefits from reduced lock acquisition rates and hold times 37 .
keymap benchmark
The keymap benchmark in Figure 9 spawns set of concurrent threads, each of which loops executing a critical section followed by a non-critical section. At the end of a 10-second measurement interval the benchmark reports the aggregate throughput as the total number of loop iterations completed by all the threads. The noncritical section advances a C++ std::mt19937 pseudo-random number generator 1000 times. The critical section acquires a central lock and then picks a random index into its thread-local keyset array. Each keyset array contains 1000 elements and is initialized to random keys before the measurement interval. With probability P = .9 the thread then extracts a key from its keyset and updates a central C++ std::unordered map<int,int> instance with that key. With probability P = .1 the thread generates a new random key in the range [0, 10000000), updates the keyset index with that key, and then updates the shared map. All pseudo-random generators are thread-local and uniform. To reduce allocation and deallocation during the measurement interval, we initialize all 10000000 keys in the map prior to spawning the threads.
Keymap models server threads with short-lived session connections and moderate temporal key reuse and memory locality between critical sections executed by a given thread. There is little or no inter-thread CS access locality or similarity, however. Threads tend to access different regions of the CS data. The NCS accesses just a small amount of memory, and CR provides benefit by moderating inter-thread competition for occupancy of CS data in the shared LLC.
leveldb benchmark
In Figure 10 we used the db bench readwhilewriting benchmark in leveldb version 1.18 database 38 , varying the number of threads. Both the central database lock and internal LRUCache locks are highly contended, and amenable to CR, which reduces the LLC miss rate.
Discussion
MCSCR is robust under varying load and adapts the size of the ACS quickly and automatically, providing predictable performance. The For instance if the system load increases and we find more ready threads than logical processors, then the system will start to multiplex the logical processors via involuntary preemption. Contention on an already contended lock may increasebecause of lock holder preemption. T he MCS chain will grow, but MCSCRresponds immediately by increased culling, reducing the size of the ACS to alevel appropriate for the new system load.
implementation of MCSCR is entirely in user-space and requires no special operating system support. No stateful adaptive mechanisms are employed, resulting in more predictable behavior and faster response to changing conditions. The only tunable parameter is how frequently the unlock operator should pick the eldest thread from the passive set, which controls fairness.
Performance does not degrade as load increases,mitigating the need for load-specific and application-specific tuning.
Parameter Parsimony 37 Medhat et al. [51] explored the interaction of waiting policies on CPU sleep states for producer-consumer applications. 38 leveldb.org Concurrency restriction techniques may reduce involuntary preemption rates by reducing the number of ready threads competing for available CPUs. This also serves to reduce lock-holder preemption and convoying 39 .
Involuntary preemption, which typically manifests when there are more ready threads than logical CPUs and the operating system is forced to multiplex the CPUs via time slicing, can cause both lock holder preemption and lock waiter preemption. The former concern can be address in various ways [24] , including deferred preemption via schedctl. Lock holder preemption can also be mitigated by a "directed yield" facility, which allows a running thread to donate its time slice to a specified thread that is ready but preempted. This allows threads waiting on a lock to grant CPU time to the lock holder, reducing queueing and convoying. Lock waiter preemption entails handoff by direct succession to a thread that is waiting but was preempted. MCS, for instance, is vulnerable to lock waiter preemption 40 , whereas simple TAS locks are not.
Directed Yield = YieldT oor SwitchT o;
It is sometimes possible to use schedctl with direct succession locks to avoid handoff to a preempted waiter. The thread in unlock() examines the schedctl state of the tentative successor. If that thread was preempted, then it picks some other thread instead. Early experiments suggest that it is helpful to use the schedctl facility to detect preempted threads on the MCS chain. The unlock() operator can check for and evict such threads from the chain, forcing them to recirculate and "re-arrive" after they are eventually dispatched, making schedctl augmented MCS far more tolerant of waiter preemption and reducing the incidence of ownership being transferred to a preempted thread.
CR actively reduces the voluntary context switch rate. Since the passive set can remain stable for prolonged periods, threads in the passive set incur less context switching, which in turn means that the CPUs on which those threads were running may be eligible to use deeper sleep states and enjoy reduced power consumption and more thermal headroom for turbo mode 41 . Relatedly, CR acts to reduce the number of threads concurrently spinning on a given lock, reducing wastage of CPU cycles. Voluntary blocking reduces the involuntary preemption rate -having fewer ready threads results in less preemption.
CRactively reduces park-unpark voluntary context switching rates by keeping the ACS stable and minimal, in turn . Reducing the park-unpark rate also acts to reduce the CPU transitions between idle and running.
AbsentCR, lock ownership can circulate over a larger number of threads (CPUs) in a given period. Some of those threads may wait by blocking in the kernel, potentially making their CPUs become idle. Rapid circulation of the lock over this larger circulating set may cause CPUs to shift between idle and non-idle more rapidly, both incurring latency in the idle to non-idle transition, and alsoprohibiting the CPUs underlying the ACS from reaching deeper energy-saving state. T hosedeeper sleep statesalso enable more aggressive turbo mode [61] for other sibling cores,allowing threads on those cores to run faster.
Byminimizing the size of the ACS, we tend to fully utilize the setof CPUs hosting the ACS threads. T heseCPUs do not become idle, and thus do not incur latency exiting deeper sleepstates.Furthermore other CPUsnot hosting ACS threads may enjoy longer idle periods and deeper idle sleep states,thus improving energy consumption and possibly allowing more available thermal headroom for either other unrelated threads, or to allow members of the ACSto run at higher clock rates. Idle state; sleepstate; park CPU; C-state; P-state;
A common admonition is to never run with more threads than cores. This advice certainly avoids some types of scaling collapse related to resource competition, but is not generally true, ignoring the potential benefit of memory-level parallelism (MLP), threads that alternate between computation and blocking IO, etc. Many applications achieve peak throughput with far more threads than cores. Such advice also assumes a simplistic load with just one application, whereas servers may run in conditions of varying load and multi-tenancy, with multiple concurrent unrelated and mutuallyunaware applications. Even within a single complex application we can find independent components with their own sets of threads, or thread pools. CR provides particular benefit in such real-world circumstances.
Weobserve that in some situations, no speedup is to be had evenas we increase the number of threads, even though the NCS length might be far larger than the CS length. T his happensfrequently when the CS and NCS lengths are short, in which caselock administrative overheads may dominate throughput. In practice, locks are never ideal. Speedup can be impaired by coherencetraffic -sharing that involves writes, resulting in invalidation and misses-becauseof data accessesin the critical section. L ockmetadata accesswill also induce coherencetraffic. Other overheadsrelated to handover latency include the branch mispredict stallsthat are incurred when a thread exits an busy-wait loop.
Related Work
Locks continue to underpin most applications and remain a key synchronization construct. They remain the topic of numerous recent papers [5, 8-10, 15, 24, 27, 32, 35, 36, 43, 62] .
Our work is most closely related to that of Johnson et al. [40] , which uses load and admission control to set the number of threads allowed to spin concurrently on a contended lock. Their approach requires a daemon thread to detect and respond to contention whereas ours uses timely decentralized decision making. Chadha et al. [11] identified cache-level thrashing as a scalability impediment and proposed system-wide concurrency throttling. Throttling concurrency to improve throughput was also suggested by Raman et al. [58] and Pusukuri et al. [56] . Chandra et al. [12] and Brett et al. [6] analyzed the impact of inter-thread cache contention. Heirman et al. [37] suggested intentional undersubscription of threads as a response to competition for shared caches. Mars et al. [50] proposed a runtime environment to reduce cross-core interference. Porterfield et al. [55] suggested throttling concurrency in order to constrain energy use. Zhuravlev et al. [65] studied the impact of kernel-level scheduling decisions -deciding which and where to dispatch ready threads -on shared resources, but did investigate the decisions made by lock subsystems. Cui et al. [14] studied lock thrashing avoidance techniques in the linux kernel where simple ticket locks with global spinning caused scalability collapse. They investigated using spin-then-park waiting and local spinning, but did not explore CR.
Like our approach, Cohort locks [26] explored the trade-off between throughput and short-term fairness. Cohort locks restrict the active circulating set to a preferred NUMA node over the short term. They sacrifice short-term fairness for aggregate throughput, but still enforce long-term fairness. NUMA-aware locks exploit the inter-socket topology, while our approach focuses on on intrasocket resources. The NUMA-aware HCLH lock [48] edits the nodes of a queue-based lock in a fashion similar to that of MCSCR, but does not provide CR and was subsequently discovered to have an algorithmic flaw.
Johnson et al. [41] and Lim et al. [46] explored the trade-offs between spinning and blocking.
Ebrahimi et al. [28] proposed changes to the system scheduler, informed in part by lock contention and mutual inter-thread DRAM interference, to shuffle thread priorities in order to improve overall throughput.
Hardware and software transactional memory systems use contention managers to throttle concurrency in order to optimize throughput [64] . The issue is particularly acute for transactional memory as failed optimistic transactions are wasteful of resources.
Various hardware schemes have been proposed to mitigate LLC thrashing, but none are available in commonly available processors [60] . Intel [39] allows static partitioning of the LLC in certain models designed for real-time environments.
Conclusion
Modern multicore systems present the illusion of having a large number of individual independent "classic" processors, connected via shared memory. This abstraction, which underlies the SMP programming model, is a useful simplification for programmers. In practice, however, the logical processors comprising these multicore systems share considerable infrastructure and resources. Contention for those shared resources manifests in surprising performance issues.
* Destructive interference means we often face negative-sum situations. * sub-additive * Illusional; notional; illusory; * L ocksare in the businessof medium-term scheduling. * Performance isolation failure;
Multicore systems are fundamentally a deceit. Most of the time we live happily with the "SMP" illusion that we have a large number of independent processors. ML Pand lowered communication costsactive as palliative factors. But sometimes we have to face reality and deal with the fact that there are really lots of shared resources, subject to contention and even destructive interference, under the facade.
We describe a lock admission policy -concurrency restrictionthat is intentionally unfair over the short term. Our algorithm intentionally culls excess threads into an explicit passive set. CR moderates and reduces the size of the active circulating set. Periodically, we reschedule, shifting threads between the active and passive sets, affording long-term fairness. CR conserves shared resources and can reduce thrashing effects and performance drop that can occur when too many threads compete for those resources. CR provides a number of modes of benefit for the various types of shared and contended resources. We further show the subtle interplay of waiting policy, which must be carefully selected to fully leverage CR.
While scalability collapse is not uncommon, it remains a challenge to characterize which shared resources underly a drop in performance. The analysis is difficult and in our experience, multiple resources are often involved. It's also difficult to determine in ad-vance if CR will provide any benefit. However, since CR typically does no harm, the decision to use it is simple. CR gates access to the resources involved in scalability collapse by moderating access to locks -an unrelated resource. In the future we hope to employ more direct means to measure and control scalability collapse. Locks remain convenient, however, and detecting oversubscription (contention) is relatively simple compared to determining when some of the complex hardware resources are oversubscribed. Contention is a convenient but imprecise proxy for overthreading. *graceful; predictable; robust; automatic; autonomic; * adjustsautomatically and promptly to varying load; * respond; response; react; adapt; adjust; lag; latency; promptly; reaction time; response time; * Unsatisfying; * CR works in concert with both operating systemand hardware * under high load -handles preemption gracefully; under varying load -fast responsetime 10.1 Future Work Throttling in current CR designs is driven by the detection of contention. In the future we hope to vary the admission rate (and the ACS size) in order to maximize lock transit rates, possibly allowing non-working conserving admission [33] . This attempts to close the performance "gap" between saturation and peak shown in Figure  1 . We also plan to apply intentionally unfair CR-based activation policies to the pthread cond condition variable construct, waking the most recently arrived threads. This approach shows promise for pools of worker threads where idle threads wait on a central condition variable. We also intent to explore energy-efficient locking in more detail, and the performance advantages of CR on energycapped systems.
Classic CR is concerned with the size of the ACS. But we can easily extend CR to be NUMA-aware by taking the demographics of the ACS into account in the culling criteria. For NUMA environments we prefer the ACS to be homogeneous and composed of threads from just one NUMA node. This reduces the NUMAdiversity of the ACS, reduces lock migrations and improves performance. Our MCSCRN design starts with MCSCR, but we add two new fields: the identity of the currently preferred "home" NUMA node, and a list of remote threads. At unlock-time, the owner thread inspects the next threads in the MCS chain and culls remote threads from the main chain to the remote list. A thread is considered remote if it runs on the some node other than the currently preferred node. Periodically, the unlock operator also selects a new home node from the threads on the remote list, and drains threads from that node into the main MCS chain. This confers long-term fairness. If we encounter a deficit on the main list at unlock-time, then we simply reprovision from the remote list. Early experiments with NUMA-aware CR show that MCSCRN performs as well as or better than CPTLTKTD [26] , the best known cohort lock. In addition, cohort locks require one node-level lock for each NUMA node. Because of padding and alignment concerns to avoid false sharing, those node-level locks themselves are large. MCSCRN avoids that -the lock size is fixed and small. In the uncontended case, cohort locks require acquisition of both the nodelevel and level-locks, although a fast-path can be implemented that tries to avoid that overhead by opportunistically bypassing the node-level locks under conditions of no or light contention when cohort formation is not feasible. MCSCRN is non-hierarchical, and avoids that concern, always using the fast-path. The system tends to converge quickly to a steady-state where the arriving threads are largely from the home node, so accesses to lock metadata elements avoids inter-node coherence traffic. Finally, we note that it is a challenge to implement polite spin-then-park waiting in CPTLTKTD, but it is trivial to do so in MCSCRN. MCSCRN will be the topic of a subsequent publication.
A. Additional lock formulations that provide CR
We provide additional examples to illustrate the ease with which locks providing CR can be constructed.
Construction of CRvariants; Construction of other lockswith CR;Variations that provide CR; Other lock formulations that provide CR; Additional lock formations that provide CR;
Additional lock constructions that provide CR; Algorithmic sketches
A.1 LOITER Locks
Simple TAS or more polite test-and-test-and-set spin locks can be deeply unfair. A thread can repeatedly barge in front of and bypass threads that have waited longer. A simple TAS lock without backoff can also suffer from considerable futile coherence traffic when the owner releases the lock and the waiting threads observe the transition and N such spinning threads pounce, trying to obtain ownership via an atomic instruction, producing the thundering herd effect. N − 1 will fail, but in doing so force coherence traffic on the underlying cache line. As such, modern TAS locks are typically augmented with randomized back-off, which reduces coherence traffic from polling and also reduces the odds of futile attempts to acquire the lock. Back-off strives to balance those benefits against reduced lock responsiveness. Longer back-off periods entail longer possible "dead time" where the lock has been released but the waiting threads have not detected that transition 42 . Traditional randomized back-off for TAS locks is anti-FIFO in the sense that threads that have waited longer are less likely to acquire the lock in unit time. Absent remediation, such back-off may partition threads into those that wait for long periods and those that wait for short periods and circulate rapidly 43 .
* Unfair : admission order is decoupled from arrival order * Barge; Overtake; bypass; jump * L unge; Pounce; * Dead time; lag; latency; hand-over responsiveness;conveyance; * Admission order isdecoupled from arrival order L istof L OIT E Roptimizations: inverted schedctl for spinners; L imit number of concurrent spinners; wakeup-unpark deferral and avoidance; fence elision; anti-spinning by standby thread; fairness enforcement; 42 Arguably, back-off is not work conserving. 43 The back-off can also provide inadvertent and unintended but beneficial concurrency restriction.
Anderson [3] suggested the following partial remedy for the thundering herd effect. After a spinning thread observes the lock is free, it delays for a short randomized period and then re-checks the lock. If the lock is not free then the thread resumes normal spinning. Otherwise the thread attempts the atomic operation to acquire the lock. This technique has lost favor in modern TAS locks, but is useful when used in conjunction with MWAIT.
Fairness of TAS locks is further determined by platform-specific aspects of the system such as the underlying hardware arbitration mechanism for cache lines. On Intel and SPARC, threads running "near" the most recent owner -near in the system topologymay enjoy a persistent statistical advantage acquiring the lock, dominating ownership. On some platforms, threads on the home node of the memory underlying the lock will have a persistent advantage. Somewhat perversely, such behavior can be NUMAfriendly over the short-term as it tends to reduce lock migrations. The unfairness can persist for long periods, however. *L ock Philopatry; propinquity; inertia * Mattheweffect; * Fairness measuresdeviation of admission order from ideal strict FIFO.
On Intel and SPARC processors if a thread T releases a T ASlock, siblings of T from the samecore or NUMA node enjoy a statistical advantage in next acquiring the lock. T hreads from a given node may dominate ownership of a lock for extended periods.
Despite these disadvantages, TAS locks confer a key benefit: the lock is never passed to a preempted thread as might be the case with MCS This reduces undesirable convoying behavior and latencies waiting for a ready but descheduled thread to again be dispatched onto a CPU. Furthermore, waiting threads do not need to "register" or otherwise make themselves visible to threads performing the unlock operation, reducing administrative overheads and coherence costs related to lock metadata. As such, these locks perform better under mixed load, and in particular when the number of runnable threads exceeds the number of logical CPUs. They also have very low latency hand-off under light or no contention.
We design a simple TAS lock enhanced with CR as follows. Our new LOITER (Locking : Outer-Inner with ThRottling) lock has an outer TAS lock. Arriving threads try to obtain the outer lock using a bounded spin phase -busy waiting -with randomized back-off. If they acquire the outer lock, they can enter the critical section. We refer to this as the fast-path. If the spinning attempt fails, control then reverts to an inner lock. An MCS lock with spin-then-park waiting is suitable for use as the inner lock. The thread that manages to acquire the inner lock is called the standby thread -there is at most one standby thread per lock at any given moment. The inner lock constitutes a so-called slow path. The standby thread then proceeds to contend for the outer lock. Again, it uses a spin-thenpark waiting policy. When the standby thread ultimately acquires the outer lock it can enter the critical section. At unlock time, if the current owner acquired the lock via the slow path, it releases both the outer lock and the inner lock. Otherwise if it releases the outer lock and, if a standby thread exists, it unparks that standby thread as the heir presumptive 44 .
The ACS consists of the owner, threads passing through their NCSes, and threads spinning in the fast path arrival phase. The PS consists of threads waiting for the inner lock. The standby thread is on the cusp and is transitional between the two sets. Under steady state the system converges to a mode where we have a stable set of threads circulating over the outer lock (the ACS), at most one thread spinning or parking in the standby position, and the remainder of the threads are blocked on the inner locks (the PS).
We impose long-term fairness by detecting that the standby thread has waiting too long and is "impatient", in which case it requests direct handoff of ownership to the standby thread upon the next unlock operation. This construction attempts to retain the 44 This design admits a minor but useful optimization. Normally the store to release the lock would need to be followed by a store-load memory barrier (fence) before the load that checks for existence and identity of the standby thread. That barrier can be safely elided if the standby thread uses a timed park operation that returns periodically. Instead of avoiding the race -which arises from architectural reordering and could result in progress failure because a thread in unlock fails to notice a just-arrived waiting thread -we tolerate the race but recover as necessary via periodic polling in the standby thread. On platforms with expensive barrier operations, this optimization can improve performance under no or moderate contention. desirable properties of TAS-based lock while providing CR and long-term fairness. The result is a hybrid that uses competitive handoff in most cases, reverting to direct handoff as part of an antistarvation mechanism when the standby thread languishes too long.* CR makes choice of "main" outer lock less important -reduces sensitivity of performance to "main" lock. * Augmentation * Properties imposed by inner lock : CR; NUMAawareness; fairness * General transformation like cohort locks; wrap abstract outer lock with CR "throttling" construct. T hrottling provides K-exclusion. * patience: count of bypasses;time; integral of all waiters; maximum of all waiters Ideally, we'd prefer to constrain the flow of threads from the PS into the ACS. A simple expedient is to make standby thread less aggressive than arriving threads when it attempts to acquire the outer lock. A related optimization is to briefly defer waking the standby thread in the unlock path. If the lock is acquired by some other thread in the interim, then there is no need to unpark the standby thread. In a similar fashion, the Solaris pthread mutex implementation attempts to defer and hopefully avoid the need to awake potential successors. This defer-and-avoid strategy tends to keep the ACS stable and also avoids unpark latencies. Finally, another simple optimization is to constrain the number of threads allowed to spin on the outer lock in the arrival phase. Similarly, the Solaris pthread mutex implementation bounds the number of threads concurrently spinning on a lock 45 .
Delegate; pass;shirk; transfer; convey; impart;
Arriving threads start with global spinning on the outer lock, and if they can't manage to obtain the lock within the arrival spinning phase, they then revert to the MCS lock, which uses local waiting. Global spinning allows more efficient lock hand-over, but local spinning generates less coherence traffic and provides gracefully performance under high contention [47] . Threads waiting on the inner MCS lock simply spin or spin-then-park on the threadlocal variable, avoiding concerns about back-off policies. All parkunpark activity takes place on paths outside the critical section. The inner lock provides succession by direct handoff via MCS, while the outer lock provides succession by competitive handoff.
The LOITER transformation allows us to convert a lock such as MCS, which uses direct handoff, into a composite form that allows a fast path with barging. The resultant composite LOITER lock enjoys the benefits of both direct handoff and competitive succession, while mitigating the undesirable aspects of each of those policies. Specifically, the new construct uses direct handoff for threads in the slow contention path, but allows competitive succession for threads circulating outside the slow path. It retains the best properties of both MCS and TAS locks.
T ofurther restrict and constrain concurrency, the implementation can restrict or capthe number of threads spinning on a lock at any given moment.
A useful complementary thread-local policy in the spinning phase implementation is to abandon the current spin episode if the TAS atomic operation on the outer lock fails too frequently. This condition indicates a sufficient flow of threads in the ACS over the lock. Another variation is to monitor either traffic over the lock or the arrival or spinners, and to abandon the spin attempt if the rate 45 For reference, the Solaris pthread mutex implementation uses a simple polite test-and-test-and-set lock with a bounded spin duration. If the spin attempt fails, the thread reverts to a slow path where it enqueues itself and parks. We have classic spin-then-park waiting. The "queue" is mostly-LIFO [53] . The schedctl facility is used to avoid spinning if the owner is itself not running on a CPU. The implementation also bounds the number of concurrent spinners. After releasing the lock in pthread mutex unlock, the implementation checks if the queue is empty. If so, it returns immediately. Otherwise it waits briefly to see if the lock happens to be acquired by some other thread. If so, the caller can return without needing to dequeue and unpark an heir presumptive. The responsibility for succession and progress is delegated to the new owner. Such unpark avoidance reduces the parkunpark rate and reduces the latency of the unlock operator. The policies of bounding the number of concurrent spinners and unpark avoidance act to effectively constrain the size of the ACS. The TAS lock admits indefinite bypass and unfairness. The implementation also provides wait morphingif a pthread cond signal operation selects a thread that waits on a mutex held by the caller, then that thread is simply transfered from the condition variable's wait-set directly to the mutex's queue of blocked threads, avoiding the need to unpark the notifyee. This operation is fast, and reduces the hold time when pthread cond signal is called within a critical section. In addition, we avoid waking a thread while the lock that thread is trying to acquire is already held by the caller, reducing futile and unnecessary contention between the notifier and notifyee. Morphing leverages the observation that is is usually safe to shift a pthread cond signal call from within a critical section to immediately after the critical section. or flux is too high. By abandoning the spin attempt early, the thread reverts from spinning to parking. This is tantamount to self-culling.
If the inner lock is NUMA-friendly -say, a cohort lockthen the aggregate LOITER lock is NUMA-friendly. As threads circulate between the active and passive sets, the inner lock tends to filter out threads from different nodes, and the ACS then tends to converge toward a set of threads located on a given node. Decreased NUMA-diversity of the ACS decreases lock migration rates and yields better throughput.
A.2 LIFO-CR
This design starts with a pure LIFO lock 46 with an explicit stack of waiting threads. Contended threads push an MCS-like node onto the stack and then spin or spin-then-park on a thread-local flag. When threads are waiting, the unlock operator pops the head of stack -the most recently arrived threads -and directly passes ownership to that thread. Both "push" and "pop" operations are implemented via atomic compare-and-swap CAS instructions. Only the lock holder can "pop" elements, so the approach is immune to ABA pathologies. The stack is multiple-producer but, by virtue of the lock itself, single-consumer. The ACS consists of the owner, the threads circulating through their respective NCS regions, and the top of the stack. The PS consists of the threads deeper on the stack. Admission order is effectively cyclic round-robin over the members of the ACS, regardless of the prevailing LIFO lock admission policy. We then augment the lock to periodically pick the tail of the stack -the eldest thread -to be the next owner. This imposes longterm fairness 47 . We refer to the resultant lock as LIFO-CR. LIFO admission order may improve temporal locality and reduce misses in shared caches. Both LIFO-CR and LOITER offer performance competitive with MCSCR.
Normally the "pop" operator would employ CAS in a loop. We can avoid the loop and implement a constant-time unlock operation by observing that if the CAS fails, then new threads have arrived and pushed themselves onto the stack, and there are at least two elements on the stack. We can thus implement a plausibly LIFO unlock by naively unlinking and passing ownership to the element that follows the thread identified by the failed CAS return value.
Under LIFO-CR both arriving and departing (unlocking) threads will update the the head of the stack, potentially creating an undesirable coherence hot-spot. MCSCR avoids this concern. In practice, however, this does not seem to adversely impact performance. The performance properties of the inner-outer lock and LIFO-CR are approximately the same as MCSCR. This algorithm works particularly well with spin-then-park waiting policies, as the threads near the top of the stack are most likely to run next, but are also the most likely to be spinning instead of blocked, thus avoiding expensive context switching and unpark activity. *Stack prefix; suffix * epicycles CRprovides a number of modes of benefit. I'll start with the cache occupancy/residency argument you noted above. L ets assumeour CS and NCS exhibit reasonable temporal locality and reuse patterns. Critically, the NCS operations can erode the cacheresidency of the data accessedin the CS. (I am assuming a single-socket environment with a large sharelast-level cache.All the interesting cacheeffects happen in the L L C).T he more threads circulating, the greater the decayrate of the CS data from the cache. If you know the L L Ccapacity and can control the CS and NCS "length" and "width" (footprint -amount of data accessed),it's then pretty easy to set up a contrived caseto show the effect. Note that a CS "activation doesn't have to be long or touch much data to see the impact. A nice example I'm using is where the CS looks up a randomly selected key in a red-black tree. Any individual CS is relatively short and does not touch much data, but the key is the size of the tree. If we restrict concurrency, then the CS suffers fewer missesarising from displacementcaused by NCS accesses. T he CS then runs faster, and we enjoy better throughput.
B. Scheduler Interactions
The operating system kernel scheduler provides 3 states for threads : running, ready, and blocked. Running indicates the thread is active on a processor. Ready indicates the thread is eligible to run, but has not been dispatched onto a processor. Blocked indicates the thread 46 If we use a pure LIFO lock then the LWSS should correspond to the ACS size, giving an easy way to measure the ideally minimal ACS size and maximum benefit afforded by CR. 47 It is relatively simple to augment any given unfair lock so that starving threads are periodically given a turn via direct handoff. The Solaris and windows schedulers employ similar anti-starvation policies. If threads languish too long on the run queue because their effective priority is too low, then they'll be given transient priority boosts until they run. By analogy, this policy can extend to locks, where waiting threads that languish too long can be explicitly granted ownership. This allows our locks to enjoy the benefits of short-term unfairness but explicitly manage long-term unfairness. is suspended and ineligible for dispatch and execution -the thread is typically waiting for some condition. The park operator transitions a running thread to blocked state and unpark makes a blocked thread ready. The kernel typically manages all ready-running transitions while the lock subsystem, via park-unpark, controls the readyblocked transitions associated with locking. The kernel scheduler's dispatch function shifts a thread from ready to running. Involuntary preemption via time-slicing shifts a thread from running to ready. Intuitively, park causes a thread to sleep and unpark wakes or resumes that thread, reenabling the thread for subsequent dispatch onto a processor. A parked thread is waiting for some event to occur and notification of that event occurs via a corresponding unpark. We expect the scheduler is itself work conserving with respect to idle CPUs and ready threads. In addition, ready threads will eventually be dispatched and make progress. If there are available idle CPUs, unpark(T ) will dispatch T onto one of those CPUs, directly transitioning T from blocked to running. If there are more ready threads than CPUs then the kernel will use preemption to multiplex those threads over the set of CPUs. Threads that are ready but not running wait for a time slice on dispatch queues.
A thread issimply a software construct exposing a virtualized logical processor.
Preemption is controlled by the kernel and reflects an involuntary context switch. The victim is changed from running to ready and some other ready thread is dispatched on the CPU and made to run. Preemption is usually triggered by timer interrupts. Typically the kernel resorts to preemption when there are more runnable threads than CPUs. The kernel preempts one thread T running on CPU C in order to allow some other ready thread a chance to run on C. Preemption provides long-term fairness over the set of runnable threads competing for the CPUs. The kernel uses preemption to multiplex M ready threads over N CPUs, where M > N. When a thread is dispatched onto a CPU it receives a time slice (quantum). When the quantum expires, the thread is preempted in favor of some ready thread. Threads that have been preempted transition from running to ready state.
A CPU is either idle or running. A CPU becomes idle when the operating system has no ready threads to dispatch onto that CPU. When a thread on a CPU parks itself and the operating system (OS) scheduler is unable to locate another suitable ready thread to dispatch onto that CPU, the CPU transitions to idle. Subsequently, an idle CPU C switches back to running when some blocked thread T is made ready via unpark and the OS dispatches T thread onto C. Thread park-unpark transitions can induce CPU running-idle transitions.
The longer a CPU remains idle, the deeper the reachable sleep state. Deeper idle (sleep) states draw less power, and allow more aggressive turbo mode for sibling cores, but such sleep states take longer to enter and exit. Specifically, to leverage the benefits of deeper sleep states, the CPU needs to stay in that state for some period to amortize the entry and exit costs. As such, we prefer to avoid frequent transitions between idle and running states for CPUs. When a thread on a CPU parks and the operating system (OS) scheduler is unable to locate another suitable ready thread to dispatch onto that CPU, the CPU becomes idle. Subsequently, an idle CPU C switches to running when some blocked thread T is made ready via unpark and the OS dispatches T thread onto C, transitioning T from ready to running and transitioning CPU C from idle to running. Thread park-unpark transitions can induce CPU idle transitions.
Frequent park-unpark activity may cause rapid transitions between idle and running CPU states, incurring latencies when unpark dispatches a thread onto a idle CPU and that CPU exits idle state. Furthermore, frequent transitions in and out of idle may prevent a CPU from reaching deeper power saving idle (sleep) states 48 .
