TWA -- Ticket Locks Augmented with a Waiting Array by Dice, Dave & Kogan, Alex
TWA – Ticket Locks Augmented with a Waiting Array
Dave Dice
Oracle Labs
first.last@oracle.com
Alex Kogan
Oracle Labs
first.last@oracle.com
Abstract
The classic ticket lock [22, 35, 39] consists of ticket and
grant fields. Arriving threads atomically fetch-and-increment
ticket to obtain an assigned ticket value, and then wait for
grant to become equal to that value, at which point the
thread holds the lock. The corresponding unlock operation
simply increments grant. This simple design has short code
paths and fast handover (transfer of ownership) under light
contention, but may suffer degraded scalability under high
contention when multiple threads busy wait on the grant
field – so-called global spinning.
We propose a variation on ticket locks where long-term
waiting threads – those with an assigned ticket value far
larger than grant – wait on locations in a waiting array in-
stead of busy waiting on the grant field. The single waiting
array is shared among all locks. Short-term waiting is accom-
plished in the usual manner on the grant field. The resulting
algorithm, TWA, improves on ticket locks by limiting the
number of threads spinning on the grant field at any given
time, reducing the number of remote caches requiring in-
validation from the store that releases the lock. In turn, this
accelerates handover, and since the lock is held throughout
the handover operation, scalability improves. Under light or
no contention, TWA yields performance comparable to the
classic ticket lock. Under high contention, TWA is substan-
tially more scalable than the classic ticket lock, and provides
performance on par or beyond that of scalable queue-based
locks such as MCS [35] by avoiding the complexity and extra
accesses incurred by the MCS handover operation while also
obviating the need for maintaining queue elements.
We provide an empirical evaluation, comparing TWAagainst
ticket locks andMCS for various user-space applications, and
within the Linux kernel.TWA avoids the complexity and extra accesses incurred by scalable queue-based locks, such as MCS the handover path,
providing performance above or beyond that of MCS at high contention.
Relative to MCS we also incur fewer coherence misses during handover. Under contention, MCS will generally induce
a coherence miss fetching the owner node’s Next field, and another miss to set the flag field in the successor’s node,
whereas our approach causes only one miss in the critical handover path.
CCS Concepts • Software and its engineering → Mul-
tithreading; Mutual exclusion; Concurrency control; Process
synchronization;
Keywords Locks, Mutexes, Mutual Exclusion, Synchroniza-
tion, Concurrency Control
Oracle ID accession number ORA190196
EuroSys 2019 submission number 257
* Tension; trade-off * Primum non-nocere ; bound harm; hippocratic oath * Collision probability in Waiting Array is
equivalent to "Birthday Paradox". * dispersed; diffused; distributed; disseminate; split; fractured; sharded; decomposed;
dilute; spread; * deconstruct; decouple; * faithful; realistic; accurate; fidelity; veracity; authentic; adherent; * A short idea
expressed in a long paper * subsume; subducts * performance predictability; variability; variance; consistent; * adhere
to principle of least surprise * divert; revert; fall-back; fail-over * desiderata; performance goals; target; ideal; aspire; *
devolve; degenerate; converge; trend toward; tend towards * slipstream; * confers * affords; encumbrance * defensible
decision/design * supplant * consequent; ensuant; ensue; pursuant; arising; by virtue of; * constrain; limit; guard; clamp;
guard; cap; protect; restrict; * release; surrender; relinquish; unlock; abjure; * Thought experiment : Gedankenexper-
iment * assiduously avoid * relax * Minimize * Inter-node miss vs intra-node coherence miss * Augment and extent
* race; inopportune interleaving; intervened; window * surpass; exceed * Slot; Array element; Table element; * Incre-
mental cost * in-effect; prevailing preference; transparent; * Numerous; myriad; plurality; * without surrendering ...;
trade-off * competitive with * suppress bias; inhibit; refrain; squelch; * contrived; intentionally configured; artificially; *
we observe in passing and without further comment ... * favor; at the expense of * avoid dilemma * MCS : Each arriving
thread uses an atomic instruction to append a “node” (representing that thread) to tail of a chain of waiting threads,
forming an explicit queue, and then busy waits on a field within that node. * MCS : node is proxy for thread * MCS :
chain ; explicit queue of nodes where node represents waiting thread. * MCS is usually considered as an alternative to
ticket locks. * succession : handoff vs handover * QoI = Quality-of-implementation issue * SPARC oddities : no 64-bit
SWAP; no FAA; emulate all with CAS; MOESI * Claim : performance TWA » Max(MCS, TKT) * TWA unlock path is
slightly longer than TKT, but handoff is accomplished first/early. * Ambient; native; free-range; * Supports our claim ...
* models; reflects; mirrors; * Waiting threads interfere with handover * wrap around; rollover; overflow; aliasing; ABA;
rollover recurrence could result in progress failure and hang waiting thread fails to exii long-term waiting mode missed
wakeup – LT thread fails to observe change in WA[x] * Optimization : really "cycle shaving" * Influence of API design
on lock : @ pass/convey from lock-to-unlock : pass/convey May need to add extra field in lock body to pass That field
may induce additional coherence traffic @ scoped locking; lexically balanced – allows on-stack queue nodes @ express
as closure : like java synchronized block * Inter-lock hash collisions vs intra-lock collisions * Ticket locks made less
repulsive * overall path length increases; but critical path decreases * Cycle shaving game * Ticket Locks are context-free
locks. * latency withing contended CS implies scalability ! * Readers are not simply passive observers ; Quantum : read-
ers change state * Not strictly deterministic; reliable performance * Collision : false notification * point-to-point 1:1 vs
one-to-many communcation * Amenable to MONITOR-MWAIT * Wait-away; wait-aside * dissipate central contention
* TKTWA5 : invisible readers/waiters TKTWA7 : visible readers/waiters * Analogy : bakery/deli with TicketAllocator
and NowServing Crowd/mob around NowServing impedes handover MCS is a true queue - line TWA : assign waiting
place based on ticket Move to counter when turn is near * Waiting Array values are hints – advisory Can not reason
directly about tranfer of ownership from absence or presence of values on the waiting array. Only changes. Waiting
thread must always consult ground truth in "grant" field. * Ticket lock field names Ticket-Grant; Head-Tail; Request-
Admit; Arrive-Enter; nextticket-NowServing; Next-Owner; Taken-Turn; * probabilistic conflict avoidance * Wait-away
: distal * analogy : crowd at door of bathroom larger crowd makes it harder to exit and "pass" ownership wait elsewhere
* ondeck; immediate successor; next-up; * circumstantial correlation vs causation : how to establish etiology * Ticket-
based form : large performance sensitivity to placement of ticket/grant fields sequestration and isolation vs compact
colocation Fat vs compact
1 Introduction
The classic ticket lock is compact and has a very simple
design. The acquisition path requires only one atomic op-
eration – a fetch-and-add to increment the ticket – and the
unlock path requires no atomics. On Intel systems, fetch-
and-add is implemented via the LOCK:XADD instruction so
the doorway phase [28] is wait-free. Under light or no con-
tention, the handover latency, defined as the time between
the call to unlock and the time a successor is enabled to enter
the critical section, is low. Handover time impacts the scal-
ability as the lock is held throughout handover, increasing
the effective length of the critical section [3, 21]. Typical im-
plementations use 32-bit integers for the ticket and grant
variable. Rollover is not a concern as long as the number of
concurrently waiting threads on a given lock never exceeds
232 − 1 1. A ticket lock is in unlocked state when ticket and
grant are equal. Otherwise the lock is held, and the number
of waiters is given by ticket - grant - 1. Ignoring nu-
meric rollover, grant always lags or is equal to ticket. The
increment operation in unlock either passes ownership to
the immediate successor, if any, and otherwise sets the state
to unlocked.
Ticket locks suffer, however, from a key scalability impedi-
ment. All threads waiting for a particular lock will busy wait
on that lock’s grant field. An unlock operation, when it in-
crements grant, invalidates the cache line underlying grant
for all remote caches where waiting threads are scheduled. In
turn, this negatively impacts scalability by retarding the han-
dover step. Ticket locks use global spinning, as all waiting
threads monitor the central lock-specific grant variable.
In Figure-1 we show the impact of readers on a single
writer. We refer to the number of participating caches as the
invalidation diameter [19]. The Invalidation Diameter
benchmark spawns T concurrent threads, with T shown on
the X-axis. A single writer thread loops, using an atomic
fetch-and-add primitive to update a shared location. The
other T − 1 threads are readers. They loop, fetching the
value of that location. The shared variable is sequestered
to avoid false sharing and is the sole occupant of its under-
lying cache sector. We present the throughput rate of the
writer on the Y-axis. As we increase the number of concur-
rent readers, the writer’s progress is slowed. This scenario
models the situation in ticket locks where multiple waiting
threads monitor the grant field, which is updated by the
current owner during handover. The benchmark reports the
writer’s throughput at the end of a 10 second measurement
interval. The data exhibited high variance due to the NUMA
placement vagaries of the threads and the home node of
the variable. As such, for each data point show, we took the
median of 100 individual runs, reflecting a realistic set of
samples. The system-under-test is described in detail in §4. As can be seen in the graph, increasing the number of readers degrades the performance of the writer.
What we really want to measure store-to-load visibility latency. That is, we want to know how multiple readers might
affect the lag between a store and when that store becomes visible to a set of busy-waiting readers. This reflects the
ticket lock handover operation. In theory, for the “Invalidation Diameter” experiments, the stores could become visible
quickly to the readers, but the store itself is delayed by the readers, and does not retire or “return” until after some
delay.
1 Developers assume this constraint is always satisfied – having more that
232 − 1 waiting threads is not considered a practical concern.
2019-07-12 • Copyright Oracle and or its affiliates
ar
X
iv
:1
81
0.
01
57
3v
4 
 [c
s.O
S]
  1
0 J
ul 
20
19
Dave Dice and Alex Kogan
The MCS lock [35] is the usual alternative to ticket locks,
performing better under high contention, but also having
a more complex path and often lagging behind ticket locks
under no or light contention. In MCS, arriving threads use
an atomic operation to append an element, commonly called
a “queue node”, to the tail of a linked list of waiting threads,
and then busy wait on a field within that element, avoiding
global spinning. The list forms a queue of waiting threads.
The lock’s tail variable is explicit and the head – the current
owner – is implicit. When the owner releases the lock it
reclaims the element it originally enqueued and sets the flag
in the next element, passing ownership. To convey owner-
ship, the MCS unlock operator must identify the successor,
if any, and then store to the location where the successor
busy waits. The handover path is longer than that of ticket
locks and accesses more distinct shared locations. MCS uses
so-called local waiting where at most one thread is waiting
on a given location at any one time. As such, an unlock op-
eration will normally need to invalidate just one location –
the flag where the successor busy waits. Under contention,
the unlock operator must fetch the address of the successor
node from its own element, and then store into the flag in the
successor’s element, accessing two distinct cache lines, and
incurring a dependent memory access to reach the successor.
In the case of no contention, the unlock operator must use
an atomic compare-and-swap operator to detach the owner’s
element.
1 2 5 10 20 50
0
50
10
0
15
0
Threads
Up
da
te
 th
ro
ug
hp
ut
 ra
te
 : 
M
 w
rit
es
/s
ec
Figure 1. Invalidation Diameter
Under classic MCS, arriving threads append an element, commonly called a “queue node”, to the tail of the linked list
of waiting threads and then busy wait on a flag within that element, avoiding global spinning. The lock’s tail variable
is explicit and the head – the current owner – is implicit. When the owner releases the lock it reclaims the element it
originally enqueued and sets the flag in the next element, passing ownership. MCS uses so-called local waiting where
at most one thread is waiting on a given location at any one time. As such, an unlock operation will normally need
to invalidate just one location – the flag where the successor busy waits. Under contention, the unlock operator must
fetch the address of the successor node from its own element, and then store into the flag in the successor’s element,
accessing two distinct cache lines, and incurring a dependent memory access to reach the successor. In the case of no
contention, the MCS unlock operator must use an atomic compare-and-swap operator to detach the owner’s element.
Ticket locks and TWA do not require indirection or dependent accesses in the unlock path.
One MCS queue node instance is required for each lock
a thread currently holds, and an additional queue node is
required while a thread is waiting on a lock. Queue nodes
can not be shared concurrently and can appear on at most
one queue – be associated with at most one lock – at a
given time. The standard POSIX pthread_mutex_lock and
pthread_mutex_unlock operators do not require scoped or
lexically balanced locking. As such, queue nodes can not
be allocated on stack. Instead, MCS implementations that
expose a standard POSIX interface will typically allocate
elements from thread-local free lists, populated on demand
2. MCS requires the address of queue node inserted by the owner to be passed to the corresponding unlock operator,
where it will be used to identify a successor, if any.The standard POSIX interface does not provide any means
to pass information from a lock operation to the correspond-
ing unlock operator. As such, the address of the MCS queue
node inserted by the owner thread is usually recorded in
the lock instance so it can be conveyed to the subsequent
unlock operation to identify the successor, if any. Accesses
to the field that records the owner’s queue node address may
themselves generate additional coherence traffic, although
some implementations may avoid such accesses to shared
fields by storing the queue node address in a thread-local
associative structure that maps lock addresses to the owner’s
queue node address.
Ticket locks and TWA require no indirection or dependent
accesses in the unlock path and also avoid the need for queue
elements and the management thereof. The queue of waiting
threads is implicit in ticket locks and TWA, and explicit in
MCS. MCS, ticket locks and TWA all provide strict FIFO
admission order.
Ticket locks are usually a better choice under light or
no contention, while MCS locks are more suitable under
heavy contention [6, 8]. By employing a waiting array for
long-term waiting, TWA achieves the best of the two worlds,
as demonstrated by our empirical evaluation with multiple
user-space applications and within the Linux kernel 3.
We note in passing that under conditions of intermittent
or no contention when no successor is visible, MCS may
require an atomic compare-and-swap operation to detach
the onwer’s queue node from the MCS tail pointer. (When
waiting threads are visible in the MCS chain, the MCS unlock
operator accomplish succession with just a store into the
next thread’s queue node structure). In contrast, ticket locks
never require an atomic read-modify-write instruction in
the unlock path. We do not believe, however, that the atomic
instruction accounts for the difference in performance be-
tween MCS and ticket locks at low contention. For instance
if we replace the load-increment-store sequence found in the
unlock path of the ticket lock algorithmwith an atomic fetch-
and-add, we observe no appreciable change in performance.
Broadly, onmodern processors, we find that coherence traffic
dictates performance, and that an atomic compare-and-swap
is the perforamnce equivalent of a simple store.
Oddly, when incurring a coherence miss, fetch-and-add is faster than XADD; load-increment-store; add-register-to-
memory; etc.
An additional confounding fact is that under certain loads, the MCS unlock operator may execute futile CAS operations
that generate unnecessary coherence traffic, and that the unlock operator may need to busy-wait to allow an arriving
successor to update the next pointer in the owner’s qnode.
2Threads might also malloc and free queue nodes as needed, but most
malloc allocators are not sufficiently scalable. Also, many malloc imple-
mentations themselves make use of POSIX locks, resulting in reentry and
recursion if a lock implementation were to try to call malloc which in turn
would need to acquire a lock. We note that the MCS “K42” variant [32, 40]
allows queue nodes to be allocated on stack – they are required only while a
thread waits – but at the cost of a longer path with more accesses to shared
locations.
3The Linux kernel switched from ticket locks to MCS-based locks in 2014
[14].
2019-07-12 • Copyright Oracle and or its affiliates
2 The TWA Algorithm
TWA builds directly on ticket locks. We add a new waiting
array for long-term waiting. The array is shared amongst all
threads and TWA locks in an address space. Arriving threads
use an atomic fetch-and-increment instruction to advance
the ticket value, yielding the lock request’s assigned ticket
value, and then fetch grant. If the difference is 0 then we
have uncontended acquisition and the thread may enter the
critical section immediately. (This case is sometimes referred
to as the lock acquisition fast-path). Otherwise TWA com-
pares the difference to the LongTermThreshold parameter. If
the difference exceeds LongTermThreshold then the thread
enters the long-term waiting phase. Otherwise control pro-
ceeds to the short-term waiting phase, which is identical to
that of normal ticket locks; the waiting thread simply waits
for grant to become equal to the ticket value assigned to the
thread. While LongTermThreshold is a tunable parameter
in our implementation, we found a value of 1 to be suitable
for all environments, ensuring that only the immediate suc-
cessor waits in short-term mode. All data reported below
uses a value of 1.
A thread entering the long-termwaiting phase first hashes
its assigned ticket value to form an index into the waiting
array. Using this index, it fetches the value from the array and
then rechecks the value of grant 4. If the observed grant
value changed, the thread rechecks the difference between
that new value and its assigned ticket value, and decides once
again on short-term versus long-term waiting. If grant was
unchanged, the thread then busy waits for the waiting array
value to change, at which point it reevaluates grant. When
grant is found to be sufficiently near the assigned ticket
value, the thread reverts to normal short-term waiting. The
values found in the waiting array have no particular meaning,
except to conservatively indicate that a grant value that
maps to that index has changed, and rechecking of grant is
required for waiters on that index. As rollover is a concern
in the waiting array, we use 64-bit integers, so in practice,
rollover never occurs.
The TWAunlock operator increments grant as usual from
U to U + 1 and then uses an atomic operator to increment
the location in the waiting array that corresponds to threads
waiting on ticket value U + 1 + LonдTermThreshold , noti-
fying long-term threads, if any, that they should recheck
grant. An atomic operation is necessary as the location may
be subject to hash collisions. We observe that this change
increases the path length in the unlock operator, but cru-
cially the store that effects handover, which is accomplished
4The recheck step is needed to avoid races between lock and unlock op-
erations. Specifically, in the window between the load of grant and the
recheck, the owner might have released the lock. Absent the recheck, the
algorithm would be vulnerable to lost wakeups. Similar recheck idioms ap-
pear in other constructs, such as the Linux kernel futex mechanism, where
waiting threads check a condition, prepare for long-term waiting, and then
recheck the condition before commiting to long-term waiting.
by a non-atomic increment of grant, happens first. Given
a LongTermThreshold value of 1, we expect at most one
thread, the immediate successor, to be waiting on grant. Up-
dating the waiting array occurs after handover and outside
the critical section. and does not influence scalability
All our experiments use a waiting array with 4096 ele-
ments, although ideally, we believe the waiting array should
be sized as a function of the number of CPUs in the system.
(A similar approach is used to size the futex hash table array
in the Linux kernel 5.) Hash collisions in the table are be-
nign, at worst causing unnecessary rechecking of the grant
field. Specifically, collisions are a performance and quality-
of-implementation concern that does not impact correctness.
A larger waiting array table will reduce the collisions rate
but might increase cache pressure. We note that the odds
of inter-lock collision are equivalent to those given by the
“Birthday Paradox” [43]. Our hash function is cache-aware
and intentionally designed to map adjacent ticket values to
different 128-byte cache sectors underlying the waiting ar-
ray, to reduce false sharing among long-term waiters. We
multiply the ticket value by 127 and then EXCLUSIVE-OR
that result with the address of the lock, and then mask with
4096− 1 to form an index into the waiting array. We selected
a small prime P = 127 to provide the equidistribution proper-
ties of aWeyl sequence [33] and also to thwart the automatic
stride-based hardware prefetch mechanism which can artif-
ically induce false sharing. Multiplication by 127 is easily
strength-reduced to a shift and subtract. We include the lock
address into our deterministic hash to avoid the situation
where two locks might operate in an entrained fashion, with
ticket and grant values moving in near unison, and thus suf-
fer from excessive inter-lock collisions. A given lock address
and ticket value pair always hashes to the same index. The
hash computed in the unlock operator must target the same
index as the corresponding hash in the long-term waiting
path. We also note that near collisions can result in false
sharing, when two accesses map to distinct words in the
same cache sector. We claim, that in terms of collisions, 1 lock with 10 waiting threads should be equivalent to 2 locks with 5 waiting
threads each.TWA leaves the structure of the ticket lock unchanged,
allowing for easy adoption. As the instance size remains the
same, the only additional space cost for TWA is the waiting
array, which is shared over all locks, reflecting a one-time
space cost.
Precautionary increment ... We say “precautionary” as there might not be any long-term waiting threads
The TWA fast-path for acquisition remains unchanged
relative to ticket locks. The unlock path adds an increment
of the waiting array, to notify any long-term waiters suffi-
ciently “near” the front of the queue that they should tran-
sition from long-term to short-term waiting. We note that
TWA doesn’t reduce overall coherence traffic, but does act
to reduce coherence traffic in the critical handover path,
constraining the invalidation diameter of the store in un-
lock that accomplishes handover. TWA thus captures the
5https://blog.stgolabs.net/2014/01/futexes-and-hash-table-collisions.html
2019-07-12 • Copyright Oracle and or its affiliates
Dave Dice and Alex Kogan
desirable performance aspects of both MCS locks and ticket
locks.
Listing-1 depicts a pseudo-code implementation of the
TWA algorithm. Lines 7 through 16 reflect the classic ticket
lock algorithm and lines 20 through 71 show TWA. TWA
extends the existing ticket lock algorithm by adding lines
41 through 57 for long-term waiting, and line 71 to notify
long-term waiters to shift to classic short-term waiting.For the purposes of explication, we assume the compiler avoids reordering and emits fences as necessary to provide
sequential consistency for accesses the ticket and grant fields, and the waiting array.
2.1 Example Scenario – TWA in Action
1 Initially the lock is in unlocked state with Ticket and
Grant both 0.
2 Thread T1 arrives at Listing-1 line 34 attempting to ac-
quire the lock. T1 increments Ticket from 0 to 1, and
the atomic FetchAdd operator returns the original value
of 0 into the local variable tx, which holds the assigned
ticket value for the locking request. At line 36 T1 then
fetches Grant observing a value of 0. Since tx equals
that fetched value, we have uncontended lock acquisi-
tion. T1 now holds the lock and can enter the the critical
section immediately, without waiting, via the fast path
at line 39.
3 Thread T2 now arrives and tries to acquire the lock. The
FetchAdd operator advances Ticket from 1 to 2 and re-
turns 1, the assigned ticket, into tx at line 35. T2 fetches
Grant and notes that tx differs from that value by 1.
The dx variable holds that computed difference, which
reflects the number of threads between the requester
and the head of the logical queue, which is the owner. T2
has encountered contention and must wait. The differ-
ence is only 1, and T2 will be the immediate successor,
so T2 proceeds to line 60 for short-term waiting similar
to that used in classic ticket locks shown at line 10. T2
waits for the Grant field to become 1.
4 Thread T3 arrives and advances Ticket from 2 to 3, with
the FetchAdd operator returning 2 as the assigned ticket.
The difference between that value (2) and the value of
Grant(0) fetched at line 64 exceeds the LongTermThreshold
(1), so T3 enters the path for long-term waiting at line
49. T3 hashes its observed ticket value of 2 into an index
at, say 100, in the long-term waiting array and then
fetches from WaitArray[at] observing U . To recover
from potential races with threads in the unlock path,
T3 rechecks that the Grant variable remains unchanged
(0) at line 49 and that the thread should continue with
long-term waiting. Thread T3 busy waits at lines 52-53
on the WaitArray value.
5 Thread T4 arrives, advances Ticket from 3 to 4, ob-
taining a value in its tx variable of 3. Similar to T3,
T4 enters the long-term. T4 hashes its assigned ticket
value of 3 yielding an index of, say, 207, and fetches
WaitArray[207] observingV . T4 then busy waits, wait-
ing for WaitArray[207] to change fromV to any other
value.
6 Thread T1 now releases the lock, calling TWARelease
at line 63. T1 increments Grant from 0 to 1 at line 67,
passing ownership to T2 and sets local variable k to the
new value (1).
7 Thread T2 waiting at lines 60-61 notices that Grant
changed to match its tx value. T2 is now the owner and
may enter the critical section.
8 Thread T1, still in TWARelease at line 71 then hashes
k+LonдTermThreshold (the sum is 2) to yield index 100
and then increments WaitArray[100] fromU toU + 1.
9 Thread T3 waiting at lines 52-53 observes that change,
rechecks Grant, sees that it is close to being granted
ownership, exits the long-termwaiting loop and switches
to classic short-term waiting at lines 60-61. T1 has pro-
moted T3 from long-term to short-term waiting in an-
ticipation of the next unlock operation, to eventually be
performed by T2.
10 Thread T1 now exits the TWARelease operator.
11 Thread T2 is the current owner, thread T3 is waiting in
short-term mode, and thread T4 is waiting in long-term
mode.
3 Related Work
Mellor-Crummey and Scott [35] proposed ticket locks with
proportional backoff. Waiting threads compare the value of
their ticket against the grant field. The difference reflects the
number of intervening threads waiting. That value is then
multiplied by some tunable constant, and the thread delays
for that period before rechecking grant. The intention is to
reduce futile polling that might induce unnecessary coher-
ence traffic. The constant is platform- and load-dependent,
and requires tuning. In addition, while the approach de-
creases the futile polling rate on grant, and may be used in
conjunction with polite waiting techniques [19], it does not
decrease the invalidation diameter. TWA and ticket locks
with proportional backoff both makes a distinction among
waiting threads based on their relative position in the queue.
Partitioned Ticket Locks [18] augment each ticket lock
with a constant-length private array of grant fields, allow-
ing for semi-local waiting. Critically, the array is not shared
between locks, and to avoid false sharing within the array,
the memory footprint of each lock instance is significantly
increased. Ticket Lock “AWN” [37] also uses per-lock array
for semi-local waiting. Anderson’s array-based queueing
lock [4, 5] is also based on ticket locks. It employs a wait-
ing array for each lock instance, sized to ensure there is at
least one array element for each potentially waiting thread,
yielding a potentially large footprint. The maximum number
of participating threads must be known in advance when
initializing the array. Such dynamic sizing also makes static
allocation of Anderson’s locks more difficult than would be
the case for a lock with a fixed size, such as TWA.
2019-07-12 • Copyright Oracle and or its affiliates
1 ## Classic Ticket Lock
2
3 class TicketLock :
4 int Ticket = 0 ## Next ticket to be assigned
5 int Grant = 0 ## "Now Serving"
6
7 TicketAcquire (TicketLock * L) :
8 ## Atomic fetch-and-add on L.Ticket
9 auto tx = FetchAdd (L.Ticket, 1)
10 while tx != L.Grant :
11 Pause()
12
13 TicketRelease (TicketLock * L) :
14 ## succession via direct handoff ...
15 ## Increment does not require atomic instructions
16 L.Grant += 1
17
18 ## ===================================================
19
20 ## TWA : Ticket lock augmented with waiting array
21
22 ## tunable parameters
23 ## short-term vs long-term proximity threshold
24 LongTermThreshold = 1 |
25 ArraySize = 4096 |
26
27 ## Global variables :
28 ## Long-term waiting array, initially all 0
29 ## Shared by all locks and threads in the address space
30
31
32 uint64_t WaitArray [ArraySize] |
33
34 TWAAcquire (TWA * L) :
35 auto tx = FetchAdd (L.Ticket, 1)
36 auto dx = tx - L.Grant
37 if dx == 0 :
38 ## fast-path return - uncontended case
39 return
40
41 ## slow path with contention -- need to wait |
42 ## Select long-term vs short-term based on the number |
43 ## of threads waiting in front of us |
44 if dx > LongTermThreshold : |
45 ## long-term waiting via WaitArray |
46 auto at = Hash(L, tx) |
47 for |
48 auto u = WaitArray[at] |
49 dx = tx - L.Grant |
50 assert dx >= 0 |
51 if dx <= LongTermThreshold : break |
52 while WaitArray[at] == u : |
53 Pause() |
54 ## This waiting thread is now "near" the front of |
55 ## the logical queue of waiting threads |
56 ## Transition from long-term waiting to |
57 ## short-term waiting |
58
59 ## classic short-term waiting on L.Grant field
60 while L.Grant != tx :
61 Pause()
62
63 TWARelease (TWA * L) :
64 ## Notify immediate successor, if any
65 ## such threads will be in short-term waiting phase
66 ## non-atomic increment
67 auto k = ++ L.Grant
68
69 ## Notify long-term waiters |
70 ## atomic increment required |
71 FetchAdd (WaitArray[Hash(L,k + LongTermThreshold)], 1)|
Listing 1. Simplified Python-like Implementation of TWA
Fu et al. [24, 29] describe a mutual exclusion scheme that
avoids global spinning, but the technique has long paths and
accesses large numbers of shared variables.
Various authors [6, 26] have suggested switching adap-
tively between MCS and ticket locks depending on the con-
tention level. While workable, this adds considerable algo-
rithmic complexity, particularly for the changeover phase,
and requires tuning. Lim and Agarwal [30] suggested a more
general framework for switching locks at runtime. Concerns: reactivity response time; hysteresis and damping; chase; hunt; ring
4 Empirical Evaluation
Unless otherwise noted, all data was collected on an Oracle
X5-2 system. The system has 2 sockets, each populated with
an Intel Xeon E5-2699 v3 CPU running at 2.30GHz. Each
socket has 18 cores, and each core is 2-way hyperthreaded,
yielding 72 logical CPUs in total. The system was running
Ubuntu 18.04 with a stock Linux version 4.15 kernel, and
all software was compiled using the provided GCC version
7.3 toolchain at optimization level “-O3”. 64-bit C or C++
code was used for all experiments. Factory-provided sys-
tem defaults were used in all cases, and Turbo mode [42]
was left enabled. In all cases default free-range unbound
threads were used. SPARC and x86 both provide a strong
TSO memory model [41]. The atomic fetch-and-add primi-
tives, LOCK:XADD on x86 and a LD;ADD;CAS loop on SPARC,
have fence semantics. TWA is trivial to implement in C++
with std::atomic<> primitives.
We implemented all user-mode lockswithin LD_PRELOAD
interposition libraries that expose the standard POSIX pthread_
mutex_t programming interface using the framework from [20].
This allows us to change lock implementations by vary-
ing the LD_PRELOAD environment variable and without
modifying the application code that uses locks. The C++
std::mutex constructmaps directly to pthread_mutex prim-
itives, so interposition works for both C and C++ code. All
busy-wait loops used the Intel PAUSE instruction for polite
waiting.
We use a 128 byte sector size on Intel processors for align-
ment to avoid false sharing. The unit of coherence is 64 bytes
throughout the cache hierarchy, but 128 bytes is required
because of the adjacent cache line prefetch facility where
pairs of lines are automatically fetched together.
4.1 Sensitivity to Inter-Lock Interference
As the waiting array is shared over all locks and threads
within an address space, one potential concern is collisions
that might arise when multiple threads are using a large
set of locks. Near collisions are also of concern as they can
cause false sharing within the array. To determine TWA’s
performance sensitity to such effects, we implemented a
benchmark program that spawns 64 concurrent threads. Each
thread loops as follows : randomly pick a lock from a pool of
such locks; acquire that lock; advance a thread-local pseudo-
random number generator 50 steps; release the lock; and
finally advance that random number generator 100 steps. At
the end of a 10 second measurement interval we report the
2019-07-12 • Copyright Oracle and or its affiliates
Dave Dice and Alex Kogan
number of lock acquistions. Each data point is the median of
7 distinct runs.We report the results in Figure-2 where X-axis
reflects the number of locks in the pool (varying through
powers-of-two between 1 and 8192) and the Y-axis is the
number of acquisitions completed by TWA divided by the
number completed by a specialized version of TWA where
each lock instance has a private array of 4096 elements. This
fraction reflects the performance drop attributable to inter-
lock conflicts and near conflicts in the shared array, where
the modified form of TWA can be seen as an idealized form
that has a large per-instance footprint but which is immune
to inter-lock conflicts. The worst-case penalty arising from
inter-thread interference (the lowest fraction value) is always
under 8%.
1 10 100 1000 10000
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Locks
Th
ro
ug
hp
ut
 F
ra
ct
io
n
Figure 2. Inter-Lock Interference
4.2 MutexBench
The MutexBench benchmark spawns T concurrent threads.
Each thread loops as follows: acquire a central lock L; execute
a critical section; release L; execute a non-critical section. At
the end of a 10 second measurement interval the benchmark
reports the total number of aggregate iterations completed by
all the threads. We show the median of 5 independent runs in
Figure-3. The critical section advances a C++ std::mt19937
pseudo-random generator (PRNG) 4 steps. The non-critical
section uses that same PRNG to compute a value distributed
uniformly in [0, 200) and then advances the PRNG that many
steps. For clarity and to convey the maximum amount of in-
formation to allow a comparision the algorithms, the X -axis
is offset to the minimum score and the Y -axis is logarithmic.
To facilitate comparison; visual comparison; to convey maximum information; for clarity; for density;
As seen in the figure, ticket locks performs the best up
to 6 threads, with TWA lagging slightly behind. As we fur-
ther increase the thread count, however, ticket locks fail
to scale. MCS provides stable asymptotic performance that
surpasses ticket locks at 24 threads. TWA manages to al-
ways outperform MCS, freeing the developer from making a
choice between MCS locks and ticket locks.
1 2 5 10 20 50
1.
0
1.
5
2.
0
2.
5
3.
0
3.
5
4.
0
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
M
 s
te
ps
/s
ec MCSTKT
TWA
Figure 3.MutexBench
To show that our approach is general and portable, we next
report MutexBench results on a Sun/Oracle T7-2 [15]. The
T7-2 has 2 sockets, each socket populated by an M7 SPARC
CPU running at 4.13GHz with 32 cores. Each core has 8
logical CPUs sharing 2 pipelines. The system has 512 logical
CPUs and was running Solaris 11. 64-bit SPARC does not
directly support atomic fetch-and-add or swap operations –
these are emulated by means of a 64-bit compare-and-swap
operator (CASX). The system uses MOESI cache coherency
instead of the MESIF [25] found in modern Intel-branded
processors, allowing more graceful handling of write sharing.
The graph in Figure-4 has the same shape as found in Figure-
3. The abrupt performance drop experienced by all locks
starting at 256 threads is caused by competition for pipeline
resources.
1 2 5 10 20 50 100 200 500
1
2
3
4
5
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
M
 s
te
ps
/s
ec MCSTKT
TWA
Figure 4.MutexBench on Solaris/SPARC 64-bit
4.3 throw
The “throw” benchmark launches T threads, each of which
loop, executing the following line of C++ code:
try { throw 20 ;} catch (int e) {}.
2019-07-12 • Copyright Oracle and or its affiliates
Naively, this construct would be expected to scale linearly,
but the C++ runtime implementation acquires mutexes that
protect the list of dynamically loaded modules and their
exception tables 6. The problem is long-standing and has
proven difficult to fix given the concern that some applica-
tions might have come to depend on the serialization [1].
At the end of a 10 second measurement interval the bench-
mark reports the aggregate number of loops executed by all
threads. There is no non-critical section in this benchmark;
throw-catch operations are performed back-to-back with no
intervening delay. In Figure-5 we observe that performance
drops significantly between 1 and 2 threads. There is little
or no benefit from multiple threads, given that execution
is largely serialized, but coherent communication costs are
incurred. As we increase beyond two threads performance
improves slightly, but never exceeds that observed at one
thread. Beyond 2 threads, the shape of the graph recapitu-
lates that seen in MutexBench.
1 2 5 10 20 50
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
M
 s
te
ps
/s
ec MCSTKT
TWA
Figure 5. throw
4.4 Random Replacement Cache
The “Random Replacement Cache” benchmark creates a key-
value cache with a random replacement policy 7. All cache
operations are protected with a central lock. Both the keys
and values are 32-bit integers and we set values equal to a
hash of the key. The cache is configured with a capacity limit
of 10000 elements. The benchmark launches the specified
number of concurrent threads, each of which loops, accessing
the cache, and then executing a delay. At the end of a 10
second measurement interval the benchmark reports the
aggregate throughput rate. We plot the median of 5 runs
6The C++ runtime libraries have specialized binding conventions to invoke
locking operators, which are not amenable to normal LD_PRELOAD inter-
position. We instead intercepted the locking calls via the _rtld_global
facility.
7We use the cache from https://github.com/ceph/ceph/blob/master/src/
common/random_cache.hpp in conjunction with a test harness of our own
making
for each data point in Figure-6. To emulate locality and key
reuse, each thread has a private keyset of 10 recently used
keys. We pre-populate the keyset with random keys before
the measurement interval, using selection with replacement.
With probability P = 0.9 a thread picks a random index in its
keyset and then uses the corresponding key for its access. We
use thread-local C++ std::mt19937 pseudo-randomnumber
generators with a uniform distribution. Otherwise, the thread
generates a new random key in the range [0, 50000), installs
that key into a random slot in the keyset, and then proceeds
to access the cache with that key. The inter-access delay
operation picks a random number in the range [0, 200) and
then steps the thread-local random number generator that
many times.
The cache implementation makes frequent use of malloc-
free operations. The default malloc allocator fails to fully
scale in this environment and attenuates the benefit con-
ferred by improved locks, so we instead used the index-aware
allocator from [2]. This allocator uses its own built-in syn-
chronization primitives instead of pthread operators, so
LD_PRELOAD interposition on the pthread mutex primi-
tives has no influence on malloc performance.
1 2 5 10 20 50
1.
5
2.
0
2.
5
3.
0
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
M
 o
ps
/s
ec
MCS
TKT
TWA
Figure 6. Random Replacement Cache
4.5 libslock stress_latency
Figure-7 shows the performance of the “stress latency” bench-
mark from [17] 8. The benchmark spawns the specified num-
ber of threads, which all run concurrently during a 10 second
measurement interval. Each thread iterates as follows: ac-
quire a central lock; execute 200 loops of a delay loop; release
the lock; execute 5000 iterations of the same delay loop. The
benchmark reports the total number of iterations of the outer
loop.
8We use the following command line: ./stress_latency -l 1 -d 10000 -a 200
-n threads -w 1 -c 1 -p 5000.
2019-07-12 • Copyright Oracle and or its affiliates
Dave Dice and Alex Kogan
1 2 5 10 20 50
15
00
00
20
00
00
30
00
00
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
op
s/
se
c
MCS
TKT
TWA
Figure 7. libslock stress_latency
4.6 LevelDB readrandom
In Figure-8 we used the “readrandom” benchmark in Lev-
elDB version 1.20 database 9 varying the number of threads
and reporting throughput from the median of 5 runs of 50
second each. Each thread loops, generating random keys
and then trying to read the associated value from the data-
base. We first populated a database 10 and then collected data
11. We made a slight modification to the db_bench bench-
marking harness to allow runs with a fixed duration that
reported aggregate throughput. Ticket locks exhibit a very
slight advantage over MCS and TWA at low threads count
after which ticket locks fade and TWA matches or exceeds
the performance of MCS. LevelDB uses coarse-grained lock-
ing, protecting the database with a single central mutex:
DBImpl::Mutex. Profiling indicates contention on that lock
via leveldb::DBImpl::Get().
1 2 5 10 20 50
80
00
00
10
00
00
0
14
00
00
0
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
op
s/
se
c
MCS
TKT
TWA
Figure 8. LevelDB readrandom
9leveldb.org
10db_bench ––threads=1 ––benchmarks=fillseq ––db=/tmp/db/
11db_bench ––threads=threads ––benchmarks=readrandom
––use_existing_db=1 ––db=/tmp/db/ ––duration=50
4.7 LevelDB readwhilewriting
The LevelDB “readwhilewriting” benchmark in Figure-9 spawns
T −1 random readers (identical to the “readrandom” threads)
and a single writer thread which writes to randomly selected
keys 12. The benchmark reports the aggregate throughput
completed in a 50 second measurement interval. Each data
point is taken as median of 5 distinct runs. The same lock in
“readrandom” is the source of contention in this benchmark.
1 2 5 10 20 50
40
00
00
60
00
00
10
00
00
0
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
op
s/
se
c
MCS
TKT
TWA
Figure 9. LevelDB readwhilewriting
4.8 RocksDB readwhilewriting
We next present results in Figure-10 from the RocksDB 13
version 5.14.2 database running their variant of the “read-
whitewriting” benchmark. The benchmark is similar to the
form found in LevelDB, above, but the underlying database
allows more concurrency and avoids the use of a single cen-
tral lock. We intentionally use a command-line configured to
stress the locks that protect the sharded LRU cache, causing
contention in LRUShard::lookup() 14.
4.9 Linux kernel locktorture
We ported TWA into the Linux kernel environment and eval-
uated its performance with the locktorture benchmark 15.
Locktorture is distributed as a part of the Linux kernel. It
is implemented as a loadable kernel module, and according
to its documentation, “runs torture tests on core kernel lock-
ing primitives”, including qspinlock, the kernel spin lock. It
creates a given number of threads that repeatedly acquire
and release the lock, with occasional short delays (citing
the comment in the source code, ”to emulate likely code”)
12db_bench ––benchmarks=readwhilewriting ––threads=threads ––
cache_size=50000 ––num=100000 ––duration=50
13rocksdb.org
14db_bench ––duration=200 ––threads=threads
––benchmarks=readwhilewriting ––compression_type=none
––mmap_read=1 ––mmap_write=1 ––cache_size=100000
––cache_numshardbits=0 ––sync=0 ––verify_checksum=0
15https://www.kernel.org/doc/Documentation/locking/locktorture.txt
2019-07-12 • Copyright Oracle and or its affiliates
1 2 5 10 20 50
50
00
00
15
00
00
0
25
00
00
0
35
00
00
0
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
op
s/
se
c
MCS
TKT
TWA
Figure 10. RocksDB readwhilewriting
and occasional long delays (”to force massive contention”)
inside the critical section. At the end of the measurement
interval, it reports the overall throughput (lock acquisitions)
completed by the threads. We used locktorture to compare
TWA, classic ticket locks, and the default kernel qspinlock.
The Linux qspinlock construct [11, 12, 31] is a compact
32-bit lock, even on 64-bit architectures. The low-order bits
of the lock word constititue a simple test-and-set lock while
the upper bits encode the tail of an MCS chain. The result
is a hybrid of MCS and test-and-set16. In order to fit into a
32-bit work – a critical requirement – the chain is formed
by logical CPU identifiers instead of traditional MCS queue
node pointers. Arriving threads attempt to acquire the test-
and-set lock embedded in the low order bits of the lock word.
This attempt fails if the test-and-set lock is held or of the
MCS chain is populated. If successful, they enter the critical
section, otherwise they join the MCS chain embedded in
the upper bits of the lock word. When a thread becomes
an owner of the MCS lock, it can wait for the test-and-set
lock to become clear, at which point it claims the test-and-
set lock, releases the MCS lock, and then enters the critical
section. The MCS aspect of qspinlock is used only when
there is contention. The unlock operator simply clears the
test-and-set lock. The MCS lock is never held over the criti-
cal section, but only during contended acquistion. Only the
owner of the MCS lock spins on the test-and-set lock, reduc-
ing coherence traffic 17. Qspinlock is strictly FIFO. While the
technique employs local spinning on the MCS chain, unlike
traditional MCS, arriving and departing threads will both
update the common lock word, increasing coherence traffic
and degrading performance relative to classic MCS. Qspin-
lock incorporates an additional optimization where the first
16https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.
c
17This provides a LOITER-style [19] lock with the outer lock consisting of a
test-and-set lock and the inner lock consisting of the MCS lock, with both
locks embedded in the same 32-bit word.
contending thread spins on the test-and-set lock instead of
using the MCS path. Traditional MCS does not fit well in the
Linux kernel as (a) the contraint that a low-level spin lock
instance be only 32-bits is a firm requirement, and (b) the
lock-unlock API does not provide a convenient way to pass
the owner’s MCS queue node address from lock to unlock.
We note that qspinlocks replaced classic ticket locks as the
kernel’s primary low-level spin lock mechanism in 2014, and
ticket locks replaced test-and-set locks, which are unfair and
allow unbounded bypass, in 2008 [13].
Regarding lock instance size, Bueso [10] notes:
Lock overhead. This is the resource cost of using a
particular lock in terms of both size and latency. Locks
embedded in data structures, for example, will bloat
that type. Larger structure sizes mean more CPU cache
and memory footprint. Thus, size is an important fac-
tor when a structure becomes frequently used through-
out the system. Implementers also need to consider
lock overhead when enlarging a lock type, after some
nontrivial modification; this can lead to performance
issues in unexpected places. For example, Linux ker-
nel file-system and memory-management developers
must take particular care of the size of VFS struct inode
(index node) and struct page, optimizing as much as
possible. These data structures represent, respectively,
information about each file on the system and each
of the physical page frames. As such, the more files
or memory present, the more instances of these struc-
tures are handled by the kernel. It is not uncommon to
see machines with tens of millions of cached inodes, so
increasing the size of the inode by 4 percent is signifi-
cant. That’s enough to go from having a well-balanced
workload to not being able to fit the working set of
inodes in memory. Implementers must always keep in
mind the size of the locking primitives.
Another example where the size of the lock is important is
in concurrent data structures, such as linked lists or binary
search trees, that use a lock per node or entry [9, 16, 27]. As
Bronson at el. observe, when a scalable lock is striped across
multiple cache lines to avoid contention in the coherence
fabric, it is “prohibitively expensive to store a separate lock
per node”[9].
qspinlock is hybrid test-and-set and MCS lock – compound “LOITER” lock
In Table-1 we report the cyclomatic complexity [34] and
N-Path complexity [7, 36] measures – derived from the com-
plexity of the control flow graph – for the lock and unlock
methods for classic ticket locks (TKT), the kernel qspinlock
primitive 18 and TWA. To facilitate a fairer comparison, we
removed performance monitoring and debugging facilities
from qspinlock before running oclint 19 to compute the
complexity. In addition, the values reported above reflect
18https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.
c
19https://github.com/oclint/oclint
2019-07-12 • Copyright Oracle and or its affiliates
Dave Dice and Alex Kogan
only the top-level qspinlock acquistion method, and does
not include the complexity of helper methods that it calls.
The only helper methods used by TWA and ticket locks are
atomic operators and the PAUSE operator, which reduce to
just one instruction with no additional complexity. All three
lock algorithms have cyclomatic and N-Path complexity of
just 1 for the unlock operation. As can be seen in the table,
TWA’s lock operator is far less complex than is the qspinlock
acquisition method.NPath complexity is exponential function of cyclomatic complexity
Implicitly, we’re suggesting that complexitymeasure correlate with understandability and bug rates – correctness. Code
with higher complexithy is more apt to contain latent flaws and is harder to analyze and understand. We use complexity
as a proxy for correctness in the absense of formal proofs.
NPath Cyclomatic
Lock Unlock Lock Unlock
Ticket Lock 2 1 2 1
QSpinLock 4320 1 18 1
TWA 28 1 6 1
Table 1. Complexity Measures
The average critical section duration used by locktorture
is a function of the number of concurrent threads. In or-
der to use the benchmark to measure and report scalabil-
ity, we augmented it to parameterize the critical and non-
critical section durations, which are expressed as steps of
the thread-local pseudo-random number generator provided
in the locktorture infrastructure. We used 20 steps for the
critical section. Each execution of the non-critical section
computes a uniformly random distributed number in [0−N ]
and then steps the local random number generator that many
iterations. At the end of a run (lasting 30 seconds in our case),
the total number of lock operations performed by all threads
is reported. We report the median of 7 such runs. Figure-11
uses N = 20 and Figure-12 uses N = 400.
For the kernel versions of ticket locks and TWA we re-
duced the ticket and grant fields to 16 bits to allow the
composite lock word to fit in 32-bits, imposing a constraint
that the maximum number of waiting threads never exceeds
65535. Qspinlocks operate with preemption disabled, so the
constraint devolves to not having more than 65535 proces-
sors waiting on a given lock at any one time. By default, the
maximum number of CPUs allowed by the Linux kernel on
x86 architectures is 4096, satisfying the constraint.
As we can see in Figures 11 and 12, classic ticket locks
perform well at low conconcurrency but fade as the number
of threads increases. TWA and and QSpinLock provide fairly
similar performance, although TWA is far simpler.
5 Conclusion
while our approach is deterministic we note that perfor-
mance under TWA can be influenced by the activities of
other unrelated threads and locks by means of collisions in
the shared waiting array, potentially reducing predictabil-
ity. Other shared resources incur the same risk. Examples
1 2 5 10 20 50
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
M
 a
cq
ui
re
s/
se
c QSpinlock
TKT
TWA
Figure 11. LockTorture : High Contention
1 2 5 10 20 50
0.
10
0.
15
0.
20
0.
25
0.
30
Threads
Ag
gr
eg
at
e 
th
ro
ug
hp
ut
 ra
te
 : 
M
 a
cq
ui
re
s/
se
c QSpinlock
TKT
TWA
Figure 12. LockTorture : Moderate Contention
include (a) competition for occupancy of shared hardware
caches and (b) collisions in the Linux futex hash table, where
lock addresses map to hash chains of blocked threads. Any structure that uses a lock to protect hash chains
TWA is a straightforward extension to classic ticket locks,
providing the best performance properties of ticket locks and
MCS locks. Like ticket locks, it is simple, compact, and has a
fixed memory footprint. The key benefit conferred by TWA
arises from improved transfer of ownership (handover) in
the unlock path, by reducing the number of threads spinning
on the grant field at any given time. Even though TWA
increases the overall path length in the unlock operation,
adding an atomic fetch-and-increment operation compared
to the classic ticket lock, it decreases the effective critical
path duration for contended handover. In the Appendix we identify a number of variations on the basic TWA algorithm that we plan to explore in the future.
We plan to explore long-term waiting in the Linux kernel
via the MONITOR-MWAIT construct or by means of the
kernel futex [23] mechanism.
An extended version of this paper is available at https://arxiv.org/abs/1810.
01573
2019-07-12 • Copyright Oracle and or its affiliates
Acknowledgments
We thank Shady Issa for reviewing drafts and helpful com-
ments.
References
[1] 2016. RFC : introduce dl_iterate_phdr_parallel. https://patchwork.
ozlabs.org/patch/652301/
[2] YehudaAfek, DaveDice, andAdamMorrison. 2011. Cache Index-aware
Memory Allocation. In Proceedings of the International Symposium on
Memory Management (ISMM ’11). ACM. http://doi.acm.org/10.1145/
1993478.1993486
[3] Vitaly Aksenov, Dan Alistarh, and Petr Kuznetsov. 2018. Brief An-
nouncement: Performance Prediction for Coarse-Grained Locking. In
Proceedings of the 2018 ACM Symposium on Principles of Distributed
Computing (PODC ’18). ACM. https://doi.org/10.1145/3212734.3212785
[4] J.H. Anderson, Y.J. Kim, and T. Herman. 2003. Shared-memory mutual
exclusion: major research trends since 1986. Distributed Computing
(2003). https://doi.org/10.1007/s00446-003-0088-6
[5] T. E. Anderson. 1990. The performance of spin lock alternatives for
shared-money multiprocessors. IEEE Transactions on Parallel and
Distributed Systems (1990). https://doi.org/10.1109/71.80120
[6] Jelena Antić, Georgios Chatzopoulos, Rachid Guerraoui, and Vasileios
Trigonakis. 2016. Locking Made Easy. In Proceedings of the 17th In-
ternational Middleware Conference (Middleware ’16). ACM. https:
//doi.org/10.1145/2988336.2988357
[7] Lucas Bang, Abdulbaki Aydin, and Tevfik Bultan. 2015. Automatically
Computing Path Complexity of Programs. In Proceedings of the 2015
10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE
2015). ACM. http://doi.acm.org/10.1145/2786805.2786863
[8] Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nick-
olai Zeldovich. 2012. Non-scalable locks are dangerous. Ottawa
Linux Symposium (OLS) (2012). https://www.kernel.org/doc/ols/2012/
ols2012-zeldovich.pdf
[9] Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun.
2010. A Practical Concurrent Binary Search Tree. In Proceedings of the
15th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (PPoPP ’10). ACM. http://doi.acm.org/10.1145/1693453.
1693488
[10] Davidlohr Bueso. 2014. Scalability Techniques for Practical Synchro-
nization Primitives. Commun. ACM (2014). http://doi.acm.org/10.
1145/2687882
[11] Jonathan Corbet. [n. d.]. Cramming more into struct page. https:
//lwn.net/Articles/565097, August 28, 2013. Accessed: 2018-10-01.
[12] Jonathan Corbet. [n. d.]. MCS locks and qspinlocks. https://lwn.net/
Articles/590243, March 11, 2014. Accessed: 2018-09-12.
[13] Jonathan Corbet. [n. d.]. Ticket Spinlocks. https://lwn.net/Articles/
267968, February 6, 2008. Accessed: 2018-09-12.
[14] Jonathan Corbet. 2014. MCS locks and qspinlocks. https://lwn.net/
Articles/590243/
[15] Oracle Corporation. 2014. SPARC T7-2 Server - Oracle
Datasheet. http://www.oracle.com/us/products/servers-storage/
sparc-t7-2-server-ds-2687049.pdf
[16] Tyler Crain, Vincent Gramoli, and Michel Raynal. 2012. A Speculation-
friendly Binary Search Tree. In Proceedings of the 17th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming (PPoPP
’12). ACM. http://doi.acm.org/10.1145/2145816.2145837
[17] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Every-
thing You Always Wanted to Know About Synchronization but Were
Afraid to Ask. In Proceedings of the Twenty-Fourth ACM Symposium on
Operating Systems Principles (SOSP ’13). http://doi.acm.org/10.1145/
2517349.2522714
[18] David Dice. 2011. Brief Announcement: A Partitioned Ticket Lock. In
Proceedings of the Twenty-third Annual ACM Symposium on Parallelism
in Algorithms and Architectures (SPAA ’11). http://doi.acm.org/10.
1145/1989493.1989543
[19] Dave Dice. 2017. Malthusian Locks. In Proceedings of the Twelfth
European Conference on Computer Systems (EuroSys ’17). http://doi.
acm.org/10.1145/3064176.3064203
[20] David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock Cohorting:
A General Technique for Designing NUMA Locks. ACM Trans. Parallel
Comput. (2015). https://doi.org/10.1145/2686884
[21] Stijn Eyerman and Lieven Eeckhout. 2010. Modeling Critical Sec-
tions in Amdahl’s Law and Its Implications for Multicore Design. In
Proceedings of the 37th Annual International Symposium on Computer
Architecture (ISCA ’10). ACM. https://doi.org/10.1145/1815961.1816011
[22] M. J. Fischer, N. A. Lynch, J. E. Burns, and A. Borodin. 1979. Resource
allocation with immunity to limited process failure. In 20th Annual
Symposium on Foundations of Computer Science (FOCS 1979). http:
//dx.doi.org/10.1109/SFCS.1979.37
[23] Hubertus Franke, Rusty Russel, and Matthew Kirkwood. [n. d.]. Fuss,
Futexes and Furwocks: Fast User-level Locking in Linux. https://www.
kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf. Ottawa Linux
Symposium.
[24] S. S. Fu and Nian-Feng Tzeng. 1997. A circular list based mutual
exclusion scheme for large shared-memory multiprocessors. IEEE
Transactions on Parallel and Distributed Systems (1997). https://doi.
org/10.1109/71.595581
[25] J.R. Goodman and H.H.J. Hum. 2009. MESIF: A Two-Hop Cache Co-
herency Protocol for Point-to-Point Interconnects. https://www.cs.
auckland.ac.nz/~goodman/TechnicalReports/MESIF-2009.pdf
[26] P. H. Ha, M. Papatriantafilou, and P. Tsigas. 2005. Reactive spin-locks:
a self-tuning approach. In 8th International Symposium on Parallel
Architectures,Algorithms and Networks (ISPAN’05). https://doi.org/10.
1109/ISPAN.2005.73
[27] Steve Heller, Maurice Herlihy, Victor Luchangco, Mark Moir,
William N. Scherer, and Nir Shavit. 2006. A Lazy Concurrent List-
based Set Algorithm. In Proceedings of the 9th International Confer-
ence on Principles of Distributed Systems (OPODIS’05). Springer-Verlag.
http://dx.doi.org/10.1007/11795490_3
[28] Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Pro-
gramming. Morgan Kaufmann Publishers Inc.
[29] Ting-Lu Huang and Chien-Hua Shann. 1998. A Comment on "A Cir-
cular List-Based Mutual Exclusion Scheme for Large Shared-Memory
Multiprocessors". IEEE Transactions on Parallel and Distributed Systems
(1998). http://dx.doi.org/10.1109/71.667901
[30] Beng-Hong Lim and Anant Agarwal. 1994. Reactive Synchronization
Algorithms for Multiprocessors. In Proceedings of the Sixth Interna-
tional Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS VI). ACM. https://doi.org/10.1145/
195473.195490
[31] Waiman Long. [n. d.]. qspinlock: Introducing a 4-byte queue spin-
lock implementation. https://lwn.net/Articles/561775, July 31, 2013.
Accessed: 2018-09-19.
[32] O. Krieger B. Rosenburg M. Auslander, D. Edelsohn and R. Wisniewski.
2003. Enhancement to the MCS lock for increased functionality and im-
proved programmability – U.S. patent application number 20030200457.
https://patents.google.com/patent/US20030200457
[33] George Marsaglia. 2003. Xorshift RNGs. Journal of Statistical Software,
Articles (2003). https://doi.org/10.18637/jss.v008.i14
[34] T. J. McCabe. 1976. A Complexity Measure. IEEE Transactions on
Software Engineering (1976). https://doi.org/10.1109/TSE.1976.233837
[35] John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for
Scalable Synchronization on Shared-memory Multiprocessors. ACM
Trans. Comput. Syst. (1991). http://doi.acm.org/10.1145/103727.103729
[36] Brian A. Nejmeh. 1988. NPATH: A Measure of Execution Path
Complexity and Its Applications. Commun. ACM (1988). http:
//doi.acm.org/10.1145/42372.42379
2019-07-12 • Copyright Oracle and or its affiliates
Dave Dice and Alex Kogan
[37] Pedro Ramalhete. 2015. Ticket Lock - Array of Waiting
Nodes (AWN). http://concurrencyfreaks.blogspot.com/2015/01/
ticket-lock-array-of-waiting-nodes-awn.html
[38] Pedro Ramalhete andAndreia Correia. 2016. Tidex: AMutual Exclusion
Lock. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming (PPoPP ’16). http://doi.acm.org/
10.1145/2851141.2851171
[39] David P. Reed and Rajendra K. Kanodia. 1979. Synchronization with
Eventcounts and Sequencers. Commun. ACM (1979). http://doi.acm.
org/10.1145/359060.359076
[40] Michael L. Scott. 2013. Shared-Memory Synchronization. Morgan &
Claypool Publishers.
[41] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli,
and Magnus O. Myreen. 2010. X86-TSO: A Rigorous and Usable Pro-
grammer’s Model for x86 Multiprocessors. Commun. ACM (2010).
http://doi.acm.org/10.1145/1785414.1785443
[42] U. Verner, A. Mendelson, and A. Schuster. 2017. Extending Amdahl’s
Law for Multicores with Turbo Boost. IEEE Computer Architecture
Letters (2017). https://doi.org/10.1109/LCA.2015.2512982
[43] Wikipedia contributors. 2018. Birthday problem—Wikipedia, The Free
Encyclopedia. https://en.wikipedia.org/w/index.php?title=Birthday_
problem&oldid=853622452. [Online; accessed 8-August-2018].
6 Appendix : Algorithmic Variations
▶ TKT-Dual An interesting variation on TWA is to forgo
the waiting array and simply augment the ticket lock struc-
ture to use two grant fields, one for short-term waiting, for
the immediate successor and perhaps a small number of
other threads “near” the front of conceptual queue, and a
second grant field for long-term waiting. To reduce coher-
ence traffic we isolated the grant field to be sole occupant
of a cache sector, increasing the lock size. The unlock opera-
tor first advances the short-term grant field, and, as needed,
may advance the long-term grant field to shift one or more
threads from long-term to short-term waiting, constraining
the number of short-term waiters and accelerating handover.
We refer to this form as TKT-Dual given the dual encoding
of the grant field. Initial experiments with this form show
promise, yielding results better than that of the baseline
ticket lock, although lagging slightly behind TWA. The in-
creased size means this form can not be used in the linux
kernel as a replacement for the qspinlock algorithm.
▶ TWA-Staged We are also exploring variations of TWA
where the fast uncontended path for both lock and unlock
operations would be identical to that of normal ticket locks,
specifically avoiding accesses to the waiting array in the
unlock operator. Briefly, we divide waiting threads into 3
groups: (A) Those that are 2 or more elements away from
the front of the conceptual queue. These “long-term” threads
wait via the waiting array in the usual TWA fashion. (B)
The thread that is 2 away from the head. This “transitional”
thread busy-waits on the global grant field. (C) The thread
that is 1 away from the head. This is the immediate successor,
and also busy-waits on the global grant field. Assume we
have a non-trivial set of waiting threads. Incrementing grant
in the unlock operator passes ownership to the thread T1
in (C) state above. T1 exits (C) state and becomes the owner.
Thread T2 in (B) state also observes the change in the grant
field, at which point it increments the waiting array slot
associated with the next ticket value – the ticket value one
after it’s assigned ticket – to transfer a thread from (A) state
to take T2’s place as the (B) thread. T2 then shifts from (B)
to (C) state. The downside to this approach is that we have
2 threads directly busy-waiting on grant, the (B) and (C)
threads, whereas in our other TWA variants we at most
have one thread busy-waiting on grant. The upside is that
unlock operator does not access the waiting array, and all
the waiting array accesses – both loads and fetch-and-add
operations – are performed by waiting threads. As noted
above, the unlock operator simply increments the grant field.
This approach leverages those waiting threads to help drive
through the (A)→(B)→(C) transitions, reducing the path
length of the unlock operation. Relative to classic ticket
locks, all additional code is encapsulate in the locking slow
path. In the case of uncontended operations, the array is
never accessed and the path is the same as classic ticket
locks, improving uncontended unlock latency.
▶ TWA-ID We note that we can replace the atomic incre-
ment of the waiting array element in the unlock path with
a simple store of 0 to that location by changing the waiting
array elements from counters to unique thread identity ref-
erences. Threads arriving in the long-term waiting state will
write their temporally unique non-zero identity value (which
can be as simple as the address of the stack pointer) into the
waiting array, recheck the grant value and then busy-wait
while the waiting array elements remains equal to the thread
identity value they just stored. When the element changes,
the thread shifts to classic short-term waiting. While this
approach eliminates the atomic fetch-and-add in the unlock
path, it also increases write traffic into the shared array, as
threads entering the long-term write phase must store their
unique identity.
▶ TWA-Quantized To reduce the impact on the unlock
path we can arrange for the unlock operator to release a
quanta or gang of N threads – based on the ticket lock values
– into global spinning mode every Nth unlock operation.
This reduces the average cost of an unlock operation, but
increases the number of threads waiting via global spinning.
▶ 3-Stage A key observation in the design of TWA is im-
portance of hand-over efficiency, improved by reducing the
number of threads concurrently busy-waiting on the grant
field. This insight informs the design of other locks that per-
form gracefully under contention. One such variation is a
3-StagemutexM which has 3 sub-locks :A, B andC .Awould
typically by a FIFO lock such as MCS or ticket B andC could
be simple test-and-set locks. To acquireM a thread uses the
following protocol : acquire A; acquire B; release A; acquire
C; release B. To release M a thread simply releases C . The
2019-07-12 • Copyright Oracle and or its affiliates
thread that holds C also holds M itself. (For efficiency, we
might shift the release of B to unlock path).* Handoff; handover * The basic operating theme is to get the expensive handoffs out of the critical paths.
This approach, while seemingly complicated, confers a
number of advantages. If we have N waiting threads, then
N − 2 will typically wait on A, 1 on B and 1 on C . When we
releaseM , by releasingC , there is at most one thread waiting
on C , so handover costs arising from coherence traffic are
minimized. Once a thread has acquired C it still needs to
release B, and this occurs within the critical section of M .
But there is at most one waiter on B, so the cost to release
B isn’t excessive. A relatively large handover cost may be
incurred when we release A, but that latency overlaps with
execution of the critical section and does not manifest as
reduced throughput. The latency associated with the release
of A is subsumed into the waiting period forM and moved
out of the critical path, improving throughput and scalability.
Note thatM is FIFO if A is FIFO. A protects the acquisition
of B, so as an optional optimization we can avoid atomics to
acquire B. And B likewise protects the acquisition of C .
To reduce the path length in the uncontended case, we can
add a fast-path where arriving threads trylock directly on C ,
and, if successful can avoid A and B. In this case, however,
we would need to use atomics to acquire C in the slow path.And or course we can add a fast path where arriving threads trylock directly on C and skip all the A+B nonsense
that exists to mitigate contention. Once we add a fast path with bypass we probably need to add some anti-starvation
mechanism, but that’s not hard. Basically, if the owner of B waits too long it can become impatient and require direct
handoff the next time C is dropped. And we need to use atomics to acquire C. In the simple formulation without a
trylock-based fast path, B and C don’t need atomics, as there would be at most one thread trying to acquire those
sub-locks at any given time. So we can get by with loads and stores.
The fetch-and-increment of the waiting array element in the unlock operator can be avoided as follows. The waiting
array contains unique thread identities instead of notification counters. Threads start a long-term waiting phase by
storing their non-zero identity into the array and then recheck the value of grant. They then busy-wait while the
array element remains equal to their own identity. The unlock operator simply stores 0 into the waiting array instead
of incrementing the location.
We also intend to look into ways to avoid the precautionary increment in the unlock path in the waiting array if there
are no waiters, or no long-term waiters. We hope to investigate the use of the external waiting array for other locks,
such as the “tidex” algorithm [38].
We also plan on kernel-level experiments to determine if TWA might be a viable replacement for the Linux kernel’s
existing qspinlock construct.
Finally, we believe that replacing the waiting array ele-
ments with pointers to chains of waiting threads may have
benefit. Briefly, each long-term waiting thread would have
an on-stack MCS-like queue node that it would push into the
appropriate chain in the waiting array, and then use local
spinning on a field within that node. Notification of long-
term waiters causes the chain to be detached via an atomic
SWAP instruction and all the elements are updated to reflect
that they should reevaluate the grant field. In the case of
collisions, waiting threads may need to re-enqueue on the
chain. This design recapitulates much of the Linux kernel
“futex” mechanism.
For environments that already use ticket locks, and where the size of the ticket lock is “baked” into binaries, TWA may
serve as a drop-in replacement if interposition is available. The same is not true TKTDual where an additional grant
field must be added to the lock stru‘cture.
Recall that for TWA, the locking fast path is exactly the same as TKT, but the unlock operator adds that precautionary
increment into the waiting array above and beyond the normal unlock code, so we’d like to get rid of that increment if
it’s not needed.
Under little or no contention, I can speed up unlock by trying to undo the increment of ticket with a CAS. That is, at
unlock time we try to revert the ticket value with a CAS instead of incrementing grant. If the CAS succeeds, then we
can skip bumping the notification indicator in the waiting array, and as bonus, we don’t cyclically touch all the lines
under the waiting array with those precautionary increments, in turn decreasing D-cache pressures. This is slightly
faster than TWA at low/no contention, but falls about midway between TKT and TWA at higher levels of contention.
Another idea to eliminate the increment into the waiting array in unlock is as follows.We “stage” the waiting threads, so
the immediate successor, as usual, waits directly on the grant field. We’ll call the immediate successor S1. (S1 identifies
a specific role, not a thread). That threads successor, which we’ll S2, also waits directly on grant. Other waiters deeper
in the logical queue wait via the long-term waiting array. So we have 2 threads busy-waiting on grant, which isn’t
ideal. The subsequent unlock operation will pass ownership to S1. S2 also notices that grant changed, and takes on
and assumes the role of the immediate successor S1. As S2 transitions itself to the S1 position, it performs the fetch-
and-add into the waiting array to notify its successor to exit long-term waiting and to take over the role of S2. So
we’ve eliminated the increment into the waiting array from the unlock path, and made the waiting threads take over
that responsibility. The performance is just a tiny bit better than TWA under light/no contention. And under high
contention performance falls between classic TKT and TWA. This idea sounds good in principle, but it doesn’t really
work. I think the issue is that there’s still a real jump in the “invalidation diameter” between having 1 and 2 threads
busy waiting on the location.
7 Appendix: Maximum Ideal Scalability
We note that our system-under-test does not exhibit ideal
linear scalability as we increase the number of threads. To
demonstrate this effect we use a IdealScalabiltymicrobench-
mark which spawsT concurrent threads, each of which loops
advancing a thread-local std::mt19933 pseudo-randomnum-
ber generator. There is no sharing, no communication, and
no waiting; the threads are completely independent. At the
end of a 30 second measurement interval the benchmark
tallies and reports the aggregate number of random num-
ber steps completed. We report the median of 5 such runs
for each data point. In Figure 13 we vary T on the X-axis,
and on the Y-axis we report the normalized throughput –
throughput divided by the throughtput atT = 1 – divided by
the number of participating threads T . This yields a fraction
which reflects the slow-down caused by fraternal interfer-
ence for shared hardware compute resources, such as caches,
cores, pipelines, DRAM channels, etc. Specifically, we ob-
serve that the progress rate of each thread is impeded by the
concurrent execution of unrelated threads.
1 2 5 10 20 50
0.
5
0.
6
0.
7
0.
8
0.
9
1.
0
Threads
Th
ro
ug
hp
ut
 p
er
 th
re
ad
Figure 13.Maximum Ideal Scalability
2019-07-12 • Copyright Oracle and or its affiliates
